Sign in with
Sign up | Sign in
Your question

CUDA Problem with const int (if it is >200000 it fails for some reason

Last response: in Applications
Share
August 30, 2012 4:40:51 AM

I want to see the computing performance of my GTX 460 v2 vs cpu. Only say to do this easily is sqrts (did i mention i love sqrts?)

Anyways there is a const int signifying the size of the array as well as the for loop. I am a complete CUDA noobie, however I do understand C++ to an extent (not so much vanilla C with it's pointers).

I just can't wrap my head why changing the const int to a value from 200,000 to anything larger (i.e. 300,00) would change the result of lets say 6605 -> 1.000004 to 6605->6605.

Thanks for your help :) 

  1. #include <stdio.h>
  2. #include <iostream>
  3. #include <math.h>
  4.  
  5. using namespace std;
  6.  
  7. // Kernel that executes on the CUDA device
  8. __global__ void double_array( float *a, int N )
  9. {
  10. int idx = blockIdx.x * blockDim.x + threadIdx.x;
  11. if ( idx < N )
  12. a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
  13. }
  14.  
  15. // Main routine that executes on the host
  16. int main( void )
  17. {
  18.  
  19. cudaDeviceProp prop; // Struct that contains device properties
  20. int dev; // Integer to save the device number
  21.  
  22. cudaGetDevice(&dev); // Get the number of the device in use
  23. cudaGetDeviceProperties(&prop, dev); // Get the properties
  24.  
  25. // Print the number and name of the CUDA device in use
  26. cout << "Cuda device: " << dev << " with name: "
  27. << prop.name << endl << endl;
  28.  
  29. float *a_h, *a_d; // Pointer to host & device arrays
  30. const int N = 200000; // Number of elements in arrays
  31. size_t size = N * sizeof( float );
  32. a_h = (float *)malloc( size ); // Allocate array on host
  33. cudaMalloc( (void **)&a_d, size ); // Allocate array on device
  34.  
  35. // Initialize host array and copy it to CUDA device
  36. for ( int i = 0; i < N; i++ )
  37. a_h[i] = (float)i;
  38. cudaMemcpy( a_d, a_h, size, cudaMemcpyHostToDevice );
  39.  
  40. // Do calculation on device:
  41. int block_size = 4;
  42. int n_blocks = N / block_size + ( N % block_size == 0 ? 0 : 1 );
  43. double_array <<< n_blocks, block_size >>> ( a_d, N );
  44.  
  45. // Retrieve result from device and store it in host array
  46. cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
  47.  
  48. // Print results
  49. for ( int i = 0; i < N; i++ )
  50. {
  51. cout << i << " " << fixed << a_h[i] << endl;
  52. }
  53.  
  54. // Free the memory on the host and the CUDA device
  55. free( a_h );
  56. cudaFree( a_d );
  57. }
a b L Programming
August 30, 2012 5:17:15 AM

Wouldn't it be easier if you did this?

  1. for (int i = 0; i < 21; i++)
  2. a[idx] = sqrt(a[idx]);


Did you try doing that code on the CPU?

Also, why are you doing this?
  1. int n_blocks = N / block_size + ( N % block_size == 0 ? 0 : 1 );

Doesn't N already mean how many different members in the array you have got?

Lastly, checking this:
  1. if (idx < N)

is pretty much useless. If it's giving the error on that, it means you're calculating the index wrong.

September 1, 2012 2:49:15 AM

Hi sorry for late response. I wanted to do some more learning before responding so I have a better understanding and not waste peoples time.

Now I understand the concept of blocks and threads. I looked up my gfx card info (here it is):
  1. --- General Information for device 0 ---
  2. Name: GeForce GTX 460 v2
  3. Compute capability: 2.1
  4. Clock rate: 1647000
  5. Device copy overlap: Enabled
  6. Kernel execition timeout : Enabled
  7. --- Memory Information for device 0 ---
  8. Total global mem: 1073414144
  9. Total constant Mem: 65536
  10. Max mem pitch: 2147483647
  11. Texture Alignment: 512
  12. --- MP Information for device 0 ---
  13. Multiprocessor count: 7
  14. Shared mem per mp: 49152
  15. Registers per mp: 32768
  16. Threads in warp: 32
  17. Max threads per block: 1024
  18. Max thread dimensions: (1024, 1024, 64)
  19. Max grid dimensions: (65535, 65535, 65535)


So I can have max 65535 blocks and 1024 threads. For whatever reason I have to use less than these in order to get correct calculation (about 50k blocks, and about 500 threads)

Here is the revised version that I actually understand now (the posted code was someone's example.

  1. #include "cuda_runtime.h"
  2. #include "device_launch_parameters.h"
  3.  
  4. #include <iostream>
  5. #include <math.h>
  6. #include <stdio.h>
  7. using namespace std;
  8.  
  9.  
  10. // Kernel that executes on the CUDA device
  11. __global__ void double_array( float *a, int N )
  12. {
  13. int idx = blockIdx.x;
  14. a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
  15. }
  16.  
  17. // Main routine that executes on the host
  18. int main( void )
  19. {
  20. float *a_h, *a_d;
  21. const int N = 200000;
  22.  
  23. a_h = (float *)malloc( N * sizeof(float) ); // Allocate array on host
  24. cudaMalloc( &a_d, N * sizeof(float) ); // Allocate array on device
  25.  
  26. // Initialize host array and copy it to CUDA device
  27. for ( int i = 0; i < N; i++ )
  28. a_h[i] = (float)i;
  29.  
  30. cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );
  31.  
  32.  
  33. double_array <<< 40000, 1 >>> ( a_d, N );
  34.  
  35. // Retrieve result from device and store it in host array
  36. cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
  37.  
  38. // Print results
  39. for ( int i = 0; i < N; i++ )
  40. {
  41. cout << i << " " << fixed << a_h[i] << endl;
  42. }
  43.  
  44. // Free the memory on the host and the CUDA device
  45. free( a_h );
  46. cudaFree( a_d );
  47. }



I made it into 40k blocks, 1 thread. When the calculation is past 40k i get the same thing as before, which explains why it happened.

This leaves me with a problem though of how to calculate 200k examples correctly. Will I need to use threads as well?

Thanks!!
Related resources
a b L Programming
September 1, 2012 4:18:21 AM

Why don't you just loop?

  1. for (int i = 0; i < N / 40000 + 1; i++)
  2. {
  3. double_array <<< 40000, 1 >>> (a_d + i * 40000, N);
  4. }
September 1, 2012 3:02:30 PM

I ran that and getting the same post 50k errors.

I understand the concept, split the array into 40k sections and feed them to the GPU. What I fail to understand is how you are stopping the GPU from calculating past your desired amount. I.E. When the loop starts out you are passing just "a_d" bc i = 0; so the pass would look like (got rid of N, I think it was a unnecessary remnant of something previously done):
  1. double_array <<< 40000, 1 >>> (a_d);


Now that I think of it, this way SHOULD work, but is inefficient. Correct me if i am wrong, but this way would start at 0 and go to N, then 40k to N, then 80k to N...till the last loop, overwriting the previous incorrect calculations with correct ones.

Here is the output from when it works to when it errs, no idea why it just stops working past 50k...Maybe you can make sense of it with the code and output provided.

Code:
  1. #include "cuda_runtime.h"
  2. #include "device_launch_parameters.h"
  3. #include <iostream>
  4. #include <math.h>
  5. #include <stdio.h>
  6. #include <fstream>
  7. using namespace std;
  8.  
  9.  
  10. __global__ void double_array( float *a)
  11. {
  12. int idx = blockIdx.x;
  13. a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
  14. }
  15.  
  16. int main( void )
  17. {
  18. float *a_h, *a_d;
  19. const int N = 200000;
  20.  
  21. ofstream ofs("C:\\Test\\test.txt");
  22.  
  23. a_h = (float *)malloc( N * sizeof(float) );
  24. cudaMalloc( &a_d, N * sizeof(float) );
  25.  
  26. for ( int i = 0; i < N; i++ )
  27. a_h[i] = (float)i;
  28.  
  29. cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );
  30.  
  31.  
  32. for (int i = 0; i < N / 40000 + 1; i++)
  33. {
  34. double_array <<< 40000, 1 >>> (a_d + i * 40000);
  35. }
  36.  
  37.  
  38. cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
  39.  
  40.  
  41. for ( int i = 0; i < N; i++ )
  42. {
  43. ofs << i << " " << fixed << a_h[i] << endl;
  44. }
  45.  
  46. free( a_h );
  47. cudaFree( a_d );
  48. }

Output:
  1. 49990 1.000005
  2. 49991 1.000005
  3. 49992 1.000005
  4. 49993 1.000005
  5. 49994 1.000005
  6. 49995 1.000005
  7. 49996 1.000005
  8. 49997 1.000005
  9. 49998 1.000005
  10. 49999 1.000005
  11. 50000 1.000000
  12. 50001 0.000000
  13. 50002 1.000000
  14. 50003 1.#QNAN0
  15. 50004 1.000000
  16. 50005 0.000000
  17. 50006 1.000000
  18. 50007 1.#QNAN0
  19. 50008 1.000000
  20. 50009 0.000000
  21. 50010 1.000000
  22. 50011 1.#QNAN0
  23. 50012 1.000000
  24. 50013 1.000000
  25. 50014 1.#QNAN0
  26. 50015 1.#QNAN0
  27. 50016 0.000000
  28. 50017 0.000000
  29. 50018 0.000000
  30. 50019 0.000000
  31. 50020 0.000000
  32. 50021 0.000000
  33. 50022 0.000000
  34. 50023 0.000000


I know I say it a lot but thank you so much, I really appreciate the time you are putting in to help me.
a b L Programming
September 1, 2012 4:21:53 PM

No, it would not to 1 to N, then 40000 to N, 80000 TO N, etc. It would do it 1 to 40000, then 40001 to 80000, etc.

I've no idea why it doesn't work though. Did you try debugging with breakpoints?
September 5, 2012 6:29:16 AM

I read a bit on breakpoints and understand them a bit, but not enough to know where to put them in a cuda application. Can you explain where you would put it first and why?

But first, let's see if this helps, on this code
  1. double_array <<< 40000, 1 >>> (a_d + i * 40000);


the last < is underlined red and says "Error: expected an expression" in VS2010
a b L Programming
September 5, 2012 8:48:32 AM

Sorry, I am not really familiar with CUDA c++, so therefore I cannot help you on that side. Did you try compiling it? IntelliSense sometimes is inaccurate when it comes to errors.
December 9, 2012 12:36:45 AM

I don't know if you're still following this, but I have some insight.

When kernel calls fail, nothing happens. The program doesn't exit with a failed status, the kernel just doesn't do anything, it's part of the failure state for gpus, since graphics errors shouldn't bring down programs. If you call more blocks than the card can handle, or more threads, or do something else that terminates the kernel, it doesn't execute, so the array you fed it initially i[x] = x in this case, is still in there. Hence 5061 -> 5061 instead of a sqrt. You're just reading the initial array back out since nothing changed it.

Be careful using functions of N in your kernel call, since CUDA doesn't check to make sure you asked for a sane number of blocks/threads when running, and an invalid number will crash it (and sometimes the whole computer). Also, that many sqrt are going to way exceed the precision of floats.

I have a newer card that can handle a bit more, and your code runs fine on mine.
!