CUDA Problem with const int (if it is >200000 it fails for some reason
KingBoo
I want to see the computing performance of my GTX 460 v2 vs cpu. Only say to do this easily is sqrts (did i mention i love sqrts?)
Anyways there is a const int signifying the size of the array as well as the for loop. I am a complete CUDA noobie, however I do understand C++ to an extent (not so much vanilla C with it's pointers).
I just can't wrap my head why changing the const int to a value from 200,000 to anything larger (i.e. 300,00) would change the result of lets say 6605 > 1.000004 to 6605>6605.
Thanks for your help
Anyways there is a const int signifying the size of the array as well as the for loop. I am a complete CUDA noobie, however I do understand C++ to an extent (not so much vanilla C with it's pointers).
I just can't wrap my head why changing the const int to a value from 200,000 to anything larger (i.e. 300,00) would change the result of lets say 6605 > 1.000004 to 6605>6605.
Thanks for your help
#include <stdio.h>
#include <iostream>
#include <math.h>
using namespace std;
// Kernel that executes on the CUDA device
__global__ void double_array( float *a, int N )
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ( idx < N )
a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
}
// Main routine that executes on the host
int main( void )
{
cudaDeviceProp prop; // Struct that contains device properties
int dev; // Integer to save the device number
cudaGetDevice(&dev); // Get the number of the device in use
cudaGetDeviceProperties(&prop, dev); // Get the properties
// Print the number and name of the CUDA device in use
cout << "Cuda device: " << dev << " with name: "
<< prop.name << endl << endl;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 200000; // Number of elements in arrays
size_t size = N * sizeof( float );
a_h = (float *)malloc( size ); // Allocate array on host
cudaMalloc( (void **)&a_d, size ); // Allocate array on device
// Initialize host array and copy it to CUDA device
for ( int i = 0; i < N; i++ )
a_h[i] = (float)i;
cudaMemcpy( a_d, a_h, size, cudaMemcpyHostToDevice );
// Do calculation on device:
int block_size = 4;
int n_blocks = N / block_size + ( N % block_size == 0 ? 0 : 1 );
double_array <<< n_blocks, block_size >>> ( a_d, N );
// Retrieve result from device and store it in host array
cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
// Print results
for ( int i = 0; i < N; i++ )
{
cout << i << " " << fixed << a_h[i] << endl;
}
// Free the memory on the host and the CUDA device
free( a_h );
cudaFree( a_d );
}
8
answers
Last reply
More about cuda problem const 200000 fails reason

Wouldn't it be easier if you did this?
for (int i = 0; i < 21; i++) a[idx] = sqrt(a[idx]);
Did you try doing that code on the CPU?
Also, why are you doing this?int n_blocks = N / block_size + ( N % block_size == 0 ? 0 : 1 );
Doesn't N already mean how many different members in the array you have got?
Lastly, checking this:if (idx < N)
is pretty much useless. If it's giving the error on that, it means you're calculating the index wrong. 
Hi sorry for late response. I wanted to do some more learning before responding so I have a better understanding and not waste peoples time.
Now I understand the concept of blocks and threads. I looked up my gfx card info (here it is): General Information for device 0  Name: GeForce GTX 460 v2 Compute capability: 2.1 Clock rate: 1647000 Device copy overlap: Enabled Kernel execition timeout : Enabled  Memory Information for device 0  Total global mem: 1073414144 Total constant Mem: 65536 Max mem pitch: 2147483647 Texture Alignment: 512  MP Information for device 0  Multiprocessor count: 7 Shared mem per mp: 49152 Registers per mp: 32768 Threads in warp: 32 Max threads per block: 1024 Max thread dimensions: (1024, 1024, 64) Max grid dimensions: (65535, 65535, 65535)
So I can have max 65535 blocks and 1024 threads. For whatever reason I have to use less than these in order to get correct calculation (about 50k blocks, and about 500 threads)
Here is the revised version that I actually understand now (the posted code was someone's example.#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <iostream> #include <math.h> #include <stdio.h> using namespace std; // Kernel that executes on the CUDA device __global__ void double_array( float *a, int N ) { int idx = blockIdx.x; a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx]))))))))))))))))))))); } // Main routine that executes on the host int main( void ) { float *a_h, *a_d; const int N = 200000; a_h = (float *)malloc( N * sizeof(float) ); // Allocate array on host cudaMalloc( &a_d, N * sizeof(float) ); // Allocate array on device // Initialize host array and copy it to CUDA device for ( int i = 0; i < N; i++ ) a_h[i] = (float)i; cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice ); double_array <<< 40000, 1 >>> ( a_d, N ); // Retrieve result from device and store it in host array cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost ); // Print results for ( int i = 0; i < N; i++ ) { cout << i << " " << fixed << a_h[i] << endl; } // Free the memory on the host and the CUDA device free( a_h ); cudaFree( a_d ); }
I made it into 40k blocks, 1 thread. When the calculation is past 40k i get the same thing as before, which explains why it happened.
This leaves me with a problem though of how to calculate 200k examples correctly. Will I need to use threads as well?
Thanks!! 
I ran that and getting the same post 50k errors.
I understand the concept, split the array into 40k sections and feed them to the GPU. What I fail to understand is how you are stopping the GPU from calculating past your desired amount. I.E. When the loop starts out you are passing just "a_d" bc i = 0; so the pass would look like (got rid of N, I think it was a unnecessary remnant of something previously done):double_array <<< 40000, 1 >>> (a_d);
Now that I think of it, this way SHOULD work, but is inefficient. Correct me if i am wrong, but this way would start at 0 and go to N, then 40k to N, then 80k to N...till the last loop, overwriting the previous incorrect calculations with correct ones.
Here is the output from when it works to when it errs, no idea why it just stops working past 50k...Maybe you can make sense of it with the code and output provided.
Code:#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <iostream> #include <math.h> #include <stdio.h> #include <fstream> using namespace std; __global__ void double_array( float *a) { int idx = blockIdx.x; a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx]))))))))))))))))))))); } int main( void ) { float *a_h, *a_d; const int N = 200000; ofstream ofs("C:\\Test\\test.txt"); a_h = (float *)malloc( N * sizeof(float) ); cudaMalloc( &a_d, N * sizeof(float) ); for ( int i = 0; i < N; i++ ) a_h[i] = (float)i; cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice ); for (int i = 0; i < N / 40000 + 1; i++) { double_array <<< 40000, 1 >>> (a_d + i * 40000); } cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost ); for ( int i = 0; i < N; i++ ) { ofs << i << " " << fixed << a_h[i] << endl; } free( a_h ); cudaFree( a_d ); }
Output:49990 1.000005 49991 1.000005 49992 1.000005 49993 1.000005 49994 1.000005 49995 1.000005 49996 1.000005 49997 1.000005 49998 1.000005 49999 1.000005 50000 1.000000 50001 0.000000 50002 1.000000 50003 1.#QNAN0 50004 1.000000 50005 0.000000 50006 1.000000 50007 1.#QNAN0 50008 1.000000 50009 0.000000 50010 1.000000 50011 1.#QNAN0 50012 1.000000 50013 1.000000 50014 1.#QNAN0 50015 1.#QNAN0 50016 0.000000 50017 0.000000 50018 0.000000 50019 0.000000 50020 0.000000 50021 0.000000 50022 0.000000 50023 0.000000
I know I say it a lot but thank you so much, I really appreciate the time you are putting in to help me. 
I read a bit on breakpoints and understand them a bit, but not enough to know where to put them in a cuda application. Can you explain where you would put it first and why?
But first, let's see if this helps, on this codedouble_array <<< 40000, 1 >>> (a_d + i * 40000);
the last < is underlined red and says "Error: expected an expression" in VS2010 
I don't know if you're still following this, but I have some insight.
When kernel calls fail, nothing happens. The program doesn't exit with a failed status, the kernel just doesn't do anything, it's part of the failure state for gpus, since graphics errors shouldn't bring down programs. If you call more blocks than the card can handle, or more threads, or do something else that terminates the kernel, it doesn't execute, so the array you fed it initially i[x] = x in this case, is still in there. Hence 5061 > 5061 instead of a sqrt. You're just reading the initial array back out since nothing changed it.
Be careful using functions of N in your kernel call, since CUDA doesn't check to make sure you asked for a sane number of blocks/threads when running, and an invalid number will crash it (and sometimes the whole computer). Also, that many sqrt are going to way exceed the precision of floats.
I have a newer card that can handle a bit more, and your code runs fine on mine.
Related Resources
Ask a new question
Read More
Programming
Apps
Related Resources
 Problem in cuda source code
 RAID with large number of drives
 16xSSD Raid Array
 Poor X38/ICH9R RAID 0 performance, GAX38DQ6
 Defraging a RAID 0 array?
 Performance scaling with SSDs in raid array
 Moving array
 "No array is defined??"
 Help with Marvell Virtual Device SCSI Array Device
 Help me with my Adaptec RAID5 array please!
 Is OC'ing FSB bad for RAID array?
 Disk not recognized as part of array
 IDE RAID  one device per channel w/rounded cables