# CUDA Problem with const int (if it is >200000 it fails for some reason

Tags:

Last response: in Applications

KingBoo

August 30, 2012 4:40:51 AM

I want to see the computing performance of my GTX 460 v2 vs cpu. Only say to do this easily is sqrts (did i mention i love sqrts?)

Anyways there is a const int signifying the size of the array as well as the for loop. I am a complete CUDA noobie, however I do understand C++ to an extent (not so much vanilla C with it's pointers).

I just can't wrap my head why changing the const int to a value from 200,000 to anything larger (i.e. 300,00) would change the result of lets say 6605 -> 1.000004 to 6605->6605.

Thanks for your help

Anyways there is a const int signifying the size of the array as well as the for loop. I am a complete CUDA noobie, however I do understand C++ to an extent (not so much vanilla C with it's pointers).

I just can't wrap my head why changing the const int to a value from 200,000 to anything larger (i.e. 300,00) would change the result of lets say 6605 -> 1.000004 to 6605->6605.

Thanks for your help

#include <stdio.h>

#include <iostream>

#include <math.h>

using namespace std;

// Kernel that executes on the CUDA device

__global__ void double_array( float *a, int N )

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if ( idx < N )

a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));

}

// Main routine that executes on the host

int main( void )

{

cudaDeviceProp prop; // Struct that contains device properties

int dev; // Integer to save the device number

cudaGetDevice(&dev); // Get the number of the device in use

cudaGetDeviceProperties(&prop, dev); // Get the properties

// Print the number and name of the CUDA device in use

cout << "Cuda device: " << dev << " with name: "

<< prop.name << endl << endl;

float *a_h, *a_d; // Pointer to host & device arrays

const int N = 200000; // Number of elements in arrays

size_t size = N * sizeof( float );

a_h = (float *)malloc( size ); // Allocate array on host

cudaMalloc( (void **)&a_d, size ); // Allocate array on device

// Initialize host array and copy it to CUDA device

for ( int i = 0; i < N; i++ )

a_h[i] = (float)i;

cudaMemcpy( a_d, a_h, size, cudaMemcpyHostToDevice );

// Do calculation on device:

int block_size = 4;

int n_blocks = N / block_size + ( N % block_size == 0 ? 0 : 1 );

double_array <<< n_blocks, block_size >>> ( a_d, N );

// Retrieve result from device and store it in host array

cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );

// Print results

for ( int i = 0; i < N; i++ )

{

cout << i << " " << fixed << a_h[i] << endl;

}

// Free the memory on the host and the CUDA device

free( a_h );

cudaFree( a_d );

}

More about : cuda problem const int 200000 fails reason

for (int i = 0; i < 21; i++) a[idx] = sqrt(a[idx]);

Did you try doing that code on the CPU?

Also, why are you doing this?

int n_blocks = N / block_size + ( N % block_size == 0 ? 0 : 1 );

Doesn't N already mean how many different members in the array you have got?

Lastly, checking this:

if (idx < N)

is pretty much useless. If it's giving the error on that, it means you're calculating the index wrong.

KingBoo

September 1, 2012 2:49:15 AM

Now I understand the concept of blocks and threads. I looked up my gfx card info (here it is):

--- General Information for device 0 ---

Name: GeForce GTX 460 v2

Compute capability: 2.1

Clock rate: 1647000

Device copy overlap: Enabled

Kernel execition timeout : Enabled

--- Memory Information for device 0 ---

Total global mem: 1073414144

Total constant Mem: 65536

Max mem pitch: 2147483647

Texture Alignment: 512

--- MP Information for device 0 ---

Multiprocessor count: 7

Shared mem per mp: 49152

Registers per mp: 32768

Threads in warp: 32

Max threads per block: 1024

Max thread dimensions: (1024, 1024, 64)

Max grid dimensions: (65535, 65535, 65535)

So I can have max 65535 blocks and 1024 threads. For whatever reason I have to use less than these in order to get correct calculation (about 50k blocks, and about 500 threads)

Here is the revised version that I actually understand now (the posted code was someone's example.

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <iostream>

#include <math.h>

#include <stdio.h>

using namespace std;

// Kernel that executes on the CUDA device

__global__ void double_array( float *a, int N )

{

int idx = blockIdx.x;

a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));

}

// Main routine that executes on the host

int main( void )

{

float *a_h, *a_d;

const int N = 200000;

a_h = (float *)malloc( N * sizeof(float) ); // Allocate array on host

cudaMalloc( &a_d, N * sizeof(float) ); // Allocate array on device

// Initialize host array and copy it to CUDA device

for ( int i = 0; i < N; i++ )

a_h[i] = (float)i;

cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );

double_array <<< 40000, 1 >>> ( a_d, N );

// Retrieve result from device and store it in host array

cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );

// Print results

for ( int i = 0; i < N; i++ )

{

cout << i << " " << fixed << a_h[i] << endl;

}

// Free the memory on the host and the CUDA device

free( a_h );

cudaFree( a_d );

}

I made it into 40k blocks, 1 thread. When the calculation is past 40k i get the same thing as before, which explains why it happened.

This leaves me with a problem though of how to calculate 200k examples correctly. Will I need to use threads as well?

Thanks!!

KingBoo

September 1, 2012 3:02:30 PM

I understand the concept, split the array into 40k sections and feed them to the GPU. What I fail to understand is how you are stopping the GPU from calculating past your desired amount. I.E. When the loop starts out you are passing just "a_d" bc i = 0; so the pass would look like (got rid of N, I think it was a unnecessary remnant of something previously done):

double_array <<< 40000, 1 >>> (a_d);

Now that I think of it, this way SHOULD work, but is inefficient. Correct me if i am wrong, but this way would start at 0 and go to N, then 40k to N, then 80k to N...till the last loop, overwriting the previous incorrect calculations with correct ones.

Here is the output from when it works to when it errs, no idea why it just stops working past 50k...Maybe you can make sense of it with the code and output provided.

Code:

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include <iostream>

#include <math.h>

#include <stdio.h>

#include <fstream>

using namespace std;

__global__ void double_array( float *a)

{

int idx = blockIdx.x;

a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));

}

int main( void )

{

float *a_h, *a_d;

const int N = 200000;

ofstream ofs("C:\\Test\\test.txt");

a_h = (float *)malloc( N * sizeof(float) );

cudaMalloc( &a_d, N * sizeof(float) );

for ( int i = 0; i < N; i++ )

a_h[i] = (float)i;

cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );

for (int i = 0; i < N / 40000 + 1; i++)

{

double_array <<< 40000, 1 >>> (a_d + i * 40000);

}

cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );

for ( int i = 0; i < N; i++ )

{

ofs << i << " " << fixed << a_h[i] << endl;

}

free( a_h );

cudaFree( a_d );

}

Output:

49990 1.000005

49991 1.000005

49992 1.000005

49993 1.000005

49994 1.000005

49995 1.000005

49996 1.000005

49997 1.000005

49998 1.000005

49999 1.000005

50000 1.000000

50001 0.000000

50002 1.000000

50003 1.#QNAN0

50004 1.000000

50005 0.000000

50006 1.000000

50007 1.#QNAN0

50008 1.000000

50009 0.000000

50010 1.000000

50011 1.#QNAN0

50012 1.000000

50013 1.000000

50014 1.#QNAN0

50015 1.#QNAN0

50016 0.000000

50017 0.000000

50018 0.000000

50019 0.000000

50020 0.000000

50021 0.000000

50022 0.000000

50023 0.000000

I know I say it a lot but thank you so much, I really appreciate the time you are putting in to help me.

KingBoo

September 5, 2012 6:29:16 AM

But first, let's see if this helps, on this code

double_array <<< 40000, 1 >>> (a_d + i * 40000);

the last < is underlined red and says "Error: expected an expression" in VS2010

mrmeister

December 9, 2012 12:36:45 AM

When kernel calls fail, nothing happens. The program doesn't exit with a failed status, the kernel just doesn't do anything, it's part of the failure state for gpus, since graphics errors shouldn't bring down programs. If you call more blocks than the card can handle, or more threads, or do something else that terminates the kernel, it doesn't execute, so the array you fed it initially i[x] = x in this case, is still in there. Hence 5061 -> 5061 instead of a sqrt. You're just reading the initial array back out since nothing changed it.

Be careful using functions of N in your kernel call, since CUDA doesn't check to make sure you asked for a sane number of blocks/threads when running, and an invalid number will crash it (and sometimes the whole computer). Also, that many sqrt are going to way exceed the precision of floats.

I have a newer card that can handle a bit more, and your code runs fine on mine.

Read discussions in other Applications categories

!