CUDA Problem with const int (if it is >200000 it fails for some reason

I want to see the computing performance of my GTX 460 v2 vs cpu. Only say to do this easily is sqrts (did i mention i love sqrts?)

Anyways there is a const int signifying the size of the array as well as the for loop. I am a complete CUDA noobie, however I do understand C++ to an extent (not so much vanilla C with it's pointers).

I just can't wrap my head why changing the const int to a value from 200,000 to anything larger (i.e. 300,00) would change the result of lets say 6605 -> 1.000004 to 6605->6605.

Thanks for your help :)


#include <stdio.h>
#include <iostream>
#include <math.h>
 
using namespace std;
 
// Kernel that executes on the CUDA device
__global__ void double_array( float *a, int N )
{
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if ( idx < N )
      a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
}
 
// Main routine that executes on the host
int main( void )
{
 
   cudaDeviceProp prop; // Struct that contains device properties
   int dev; // Integer to save the device number
 
   cudaGetDevice(&dev); // Get the number of the device in use
   cudaGetDeviceProperties(&prop, dev); // Get the properties
 
   // Print the number and name of the CUDA device in use
   cout << "Cuda device: " << dev << " with name: " 
      << prop.name << endl << endl;
 
   float *a_h, *a_d; // Pointer to host & device arrays
   const int N = 200000; // Number of elements in arrays
   size_t size = N * sizeof( float );
   a_h = (float *)malloc( size );    // Allocate array on host
   cudaMalloc( (void **)&a_d, size ); // Allocate array on device
 
   // Initialize host array and copy it to CUDA device
   for ( int i = 0; i < N; i++ )
      a_h[i] = (float)i;
   cudaMemcpy( a_d, a_h, size, cudaMemcpyHostToDevice );
 
   // Do calculation on device:
   int block_size = 4;
   int n_blocks   = N / block_size + ( N % block_size == 0 ? 0 : 1 );
   double_array <<< n_blocks, block_size >>> ( a_d, N );
 
   // Retrieve result from device and store it in host array
   cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
 
   // Print results
   for ( int i = 0; i < N; i++ )
   {
      cout << i << " " << fixed << a_h[i] << endl;
   }
 
   // Free the memory on the host and the CUDA device
   free( a_h );
   cudaFree( a_d );
}
8 answers Last reply
More about cuda problem const 200000 fails reason
  1. Wouldn't it be easier if you did this?

    
    for (int i = 0; i < 21; i++)
        a[idx] = sqrt(a[idx]);
    


    Did you try doing that code on the CPU?

    Also, why are you doing this?
    int n_blocks  = N / block_size + ( N % block_size == 0 ? 0 : 1 );

    Doesn't N already mean how many different members in the array you have got?

    Lastly, checking this:
    if (idx < N)

    is pretty much useless. If it's giving the error on that, it means you're calculating the index wrong.
  2. Hi sorry for late response. I wanted to do some more learning before responding so I have a better understanding and not waste peoples time.

    Now I understand the concept of blocks and threads. I looked up my gfx card info (here it is):
    
     --- General Information for device 0 ---
    Name: GeForce GTX 460 v2
    Compute capability: 2.1
    Clock rate: 1647000
    Device copy overlap: Enabled
    Kernel execition timeout : Enabled
     --- Memory Information for device 0 ---
    Total global mem: 1073414144
    Total constant Mem: 65536
    Max mem pitch: 2147483647
    Texture Alignment: 512
     --- MP Information for device 0 ---
    Multiprocessor count: 7
    Shared mem per mp: 49152
    Registers per mp: 32768
    Threads in warp: 32
    Max threads per block: 1024
    Max thread dimensions: (1024, 1024, 64)
    Max grid dimensions: (65535, 65535, 65535)
    


    So I can have max 65535 blocks and 1024 threads. For whatever reason I have to use less than these in order to get correct calculation (about 50k blocks, and about 500 threads)

    Here is the revised version that I actually understand now (the posted code was someone's example.

    
    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    
    #include <iostream>
    #include <math.h>
    #include <stdio.h>
    using namespace std;
    
    
    // Kernel that executes on the CUDA device
    __global__ void double_array( float *a, int N )
    {
       int idx = blockIdx.x;
       a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
    }
     
    // Main routine that executes on the host
    int main( void )
    {
       float *a_h, *a_d;
       const int N = 200000;
    
       a_h = (float *)malloc( N * sizeof(float) );    // Allocate array on host
       cudaMalloc( &a_d, N * sizeof(float) ); // Allocate array on device
     
       // Initialize host array and copy it to CUDA device
       for ( int i = 0; i < N; i++ )
          a_h[i] = (float)i;
    
       cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );
     
    
       double_array <<< 40000, 1 >>> ( a_d, N );
     
       // Retrieve result from device and store it in host array
       cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
     
       // Print results
       for ( int i = 0; i < N; i++ )
       {
          cout << i << " " << fixed << a_h[i] << endl;
       }
     
       // Free the memory on the host and the CUDA device
       free( a_h );
       cudaFree( a_d );
    }
    
    



    I made it into 40k blocks, 1 thread. When the calculation is past 40k i get the same thing as before, which explains why it happened.

    This leaves me with a problem though of how to calculate 200k examples correctly. Will I need to use threads as well?

    Thanks!!
  3. Why don't you just loop?

    for (int i = 0; i < N / 40000 + 1; i++)
    { 
        double_array <<< 40000, 1 >>> (a_d + i * 40000, N);
    }
  4. I ran that and getting the same post 50k errors.

    I understand the concept, split the array into 40k sections and feed them to the GPU. What I fail to understand is how you are stopping the GPU from calculating past your desired amount. I.E. When the loop starts out you are passing just "a_d" bc i = 0; so the pass would look like (got rid of N, I think it was a unnecessary remnant of something previously done):
    
    double_array <<< 40000, 1 >>> (a_d);
    


    Now that I think of it, this way SHOULD work, but is inefficient. Correct me if i am wrong, but this way would start at 0 and go to N, then 40k to N, then 80k to N...till the last loop, overwriting the previous incorrect calculations with correct ones.

    Here is the output from when it works to when it errs, no idea why it just stops working past 50k...Maybe you can make sense of it with the code and output provided.

    Code:
    
    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    #include <iostream>
    #include <math.h>
    #include <stdio.h>
    #include <fstream>
    using namespace std;
    
    
    __global__ void double_array( float *a)
    {
    	int idx = blockIdx.x;
    	a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
    }
    
    int main( void )
    {
    	float *a_h, *a_d;
    	const int N = 200000;
    
    	ofstream ofs("C:\\Test\\test.txt");
    
    	a_h = (float *)malloc( N * sizeof(float) );
    	cudaMalloc( &a_d, N * sizeof(float) );
    
    	for ( int i = 0; i < N; i++ )
    		a_h[i] = (float)i;
    
    	cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );
    
    
    	for (int i = 0; i < N / 40000 + 1; i++)
    	{
    		double_array <<< 40000, 1 >>> (a_d + i * 40000);
    	}
    
    
    	cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
    
    
    	for ( int i = 0; i < N; i++ )
    	{
    		ofs << i << " " << fixed << a_h[i] << endl;
    	}
    
    	free( a_h );
    	cudaFree( a_d );
    }
    

    Output:
    
    49990 1.000005
    49991 1.000005
    49992 1.000005
    49993 1.000005
    49994 1.000005
    49995 1.000005
    49996 1.000005
    49997 1.000005
    49998 1.000005
    49999 1.000005
    50000 1.000000
    50001 0.000000
    50002 1.000000
    50003 1.#QNAN0
    50004 1.000000
    50005 0.000000
    50006 1.000000
    50007 1.#QNAN0
    50008 1.000000
    50009 0.000000
    50010 1.000000
    50011 1.#QNAN0
    50012 1.000000
    50013 1.000000
    50014 1.#QNAN0
    50015 1.#QNAN0
    50016 0.000000
    50017 0.000000
    50018 0.000000
    50019 0.000000
    50020 0.000000
    50021 0.000000
    50022 0.000000
    50023 0.000000
    


    I know I say it a lot but thank you so much, I really appreciate the time you are putting in to help me.
  5. No, it would not to 1 to N, then 40000 to N, 80000 TO N, etc. It would do it 1 to 40000, then 40001 to 80000, etc.

    I've no idea why it doesn't work though. Did you try debugging with breakpoints?
  6. I read a bit on breakpoints and understand them a bit, but not enough to know where to put them in a cuda application. Can you explain where you would put it first and why?

    But first, let's see if this helps, on this code
    
    double_array <<< 40000, 1 >>> (a_d + i * 40000);
    


    the last < is underlined red and says "Error: expected an expression" in VS2010
  7. Sorry, I am not really familiar with CUDA c++, so therefore I cannot help you on that side. Did you try compiling it? IntelliSense sometimes is inaccurate when it comes to errors.
  8. I don't know if you're still following this, but I have some insight.

    When kernel calls fail, nothing happens. The program doesn't exit with a failed status, the kernel just doesn't do anything, it's part of the failure state for gpus, since graphics errors shouldn't bring down programs. If you call more blocks than the card can handle, or more threads, or do something else that terminates the kernel, it doesn't execute, so the array you fed it initially i[x] = x in this case, is still in there. Hence 5061 -> 5061 instead of a sqrt. You're just reading the initial array back out since nothing changed it.

    Be careful using functions of N in your kernel call, since CUDA doesn't check to make sure you asked for a sane number of blocks/threads when running, and an invalid number will crash it (and sometimes the whole computer). Also, that many sqrt are going to way exceed the precision of floats.

    I have a newer card that can handle a bit more, and your code runs fine on mine.
Ask a new question

Read More

Programming Apps