CUDA Problem with const int (if it is >200000 it fails for some reason

KingBoo

Honorable
Aug 30, 2012
5
0
10,510
I want to see the computing performance of my GTX 460 v2 vs cpu. Only say to do this easily is sqrts (did i mention i love sqrts?)

Anyways there is a const int signifying the size of the array as well as the for loop. I am a complete CUDA noobie, however I do understand C++ to an extent (not so much vanilla C with it's pointers).

I just can't wrap my head why changing the const int to a value from 200,000 to anything larger (i.e. 300,00) would change the result of lets say 6605 -> 1.000004 to 6605->6605.

Thanks for your help :)

Code:
#include <stdio.h>
#include <iostream>
#include <math.h>
 
using namespace std;
 
// Kernel that executes on the CUDA device
__global__ void double_array( float *a, int N )
{
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if ( idx < N )
      a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
}
 
// Main routine that executes on the host
int main( void )
{
 
   cudaDeviceProp prop; // Struct that contains device properties
   int dev; // Integer to save the device number
 
   cudaGetDevice(&dev); // Get the number of the device in use
   cudaGetDeviceProperties(&prop, dev); // Get the properties
 
   // Print the number and name of the CUDA device in use
   cout << "Cuda device: " << dev << " with name: " 
      << prop.name << endl << endl;
 
   float *a_h, *a_d; // Pointer to host & device arrays
   const int N = 200000; // Number of elements in arrays
   size_t size = N * sizeof( float );
   a_h = (float *)malloc( size );    // Allocate array on host
   cudaMalloc( (void **)&a_d, size ); // Allocate array on device
 
   // Initialize host array and copy it to CUDA device
   for ( int i = 0; i < N; i++ )
      a_h[i] = (float)i;
   cudaMemcpy( a_d, a_h, size, cudaMemcpyHostToDevice );
 
   // Do calculation on device:
   int block_size = 4;
   int n_blocks   = N / block_size + ( N % block_size == 0 ? 0 : 1 );
   double_array <<< n_blocks, block_size >>> ( a_d, N );
 
   // Retrieve result from device and store it in host array
   cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
 
   // Print results
   for ( int i = 0; i < N; i++ )
   {
      cout << i << " " << fixed << a_h[i] << endl;
   }
 
   // Free the memory on the host and the CUDA device
   free( a_h );
   cudaFree( a_d );
}
 

Sunius

Distinguished
Dec 19, 2010
390
0
19,060
Wouldn't it be easier if you did this?

Code:
for (int i = 0; i < 21; i++)
    a[idx] = sqrt(a[idx]);

Did you try doing that code on the CPU?

Also, why are you doing this?
Code:
int n_blocks  = N / block_size + ( N % block_size == 0 ? 0 : 1 );
Doesn't N already mean how many different members in the array you have got?

Lastly, checking this:
Code:
if (idx < N)
is pretty much useless. If it's giving the error on that, it means you're calculating the index wrong.

 

KingBoo

Honorable
Aug 30, 2012
5
0
10,510
Hi sorry for late response. I wanted to do some more learning before responding so I have a better understanding and not waste peoples time.

Now I understand the concept of blocks and threads. I looked up my gfx card info (here it is):
Code:
 --- General Information for device 0 ---
Name: GeForce GTX 460 v2
Compute capability: 2.1
Clock rate: 1647000
Device copy overlap: Enabled
Kernel execition timeout : Enabled
 --- Memory Information for device 0 ---
Total global mem: 1073414144
Total constant Mem: 65536
Max mem pitch: 2147483647
Texture Alignment: 512
 --- MP Information for device 0 ---
Multiprocessor count: 7
Shared mem per mp: 49152
Registers per mp: 32768
Threads in warp: 32
Max threads per block: 1024
Max thread dimensions: (1024, 1024, 64)
Max grid dimensions: (65535, 65535, 65535)

So I can have max 65535 blocks and 1024 threads. For whatever reason I have to use less than these in order to get correct calculation (about 50k blocks, and about 500 threads)

Here is the revised version that I actually understand now (the posted code was someone's example.

Code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <math.h>
#include <stdio.h>
using namespace std;


// Kernel that executes on the CUDA device
__global__ void double_array( float *a, int N )
{
   int idx = blockIdx.x;
   a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
}
 
// Main routine that executes on the host
int main( void )
{
   float *a_h, *a_d;
   const int N = 200000;

   a_h = (float *)malloc( N * sizeof(float) );    // Allocate array on host
   cudaMalloc( &a_d, N * sizeof(float) ); // Allocate array on device
 
   // Initialize host array and copy it to CUDA device
   for ( int i = 0; i < N; i++ )
      a_h[i] = (float)i;

   cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );
 

   double_array <<< 40000, 1 >>> ( a_d, N );
 
   // Retrieve result from device and store it in host array
   cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );
 
   // Print results
   for ( int i = 0; i < N; i++ )
   {
      cout << i << " " << fixed << a_h[i] << endl;
   }
 
   // Free the memory on the host and the CUDA device
   free( a_h );
   cudaFree( a_d );
}


I made it into 40k blocks, 1 thread. When the calculation is past 40k i get the same thing as before, which explains why it happened.

This leaves me with a problem though of how to calculate 200k examples correctly. Will I need to use threads as well?

Thanks!!
 

KingBoo

Honorable
Aug 30, 2012
5
0
10,510
I ran that and getting the same post 50k errors.

I understand the concept, split the array into 40k sections and feed them to the GPU. What I fail to understand is how you are stopping the GPU from calculating past your desired amount. I.E. When the loop starts out you are passing just "a_d" bc i = 0; so the pass would look like (got rid of N, I think it was a unnecessary remnant of something previously done):
Code:
double_array <<< 40000, 1 >>> (a_d);

Now that I think of it, this way SHOULD work, but is inefficient. Correct me if i am wrong, but this way would start at 0 and go to N, then 40k to N, then 80k to N...till the last loop, overwriting the previous incorrect calculations with correct ones.

Here is the output from when it works to when it errs, no idea why it just stops working past 50k...Maybe you can make sense of it with the code and output provided.

Code:
Code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <fstream>
using namespace std;


__global__ void double_array( float *a)
{
	int idx = blockIdx.x;
	a[idx] = sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(sqrt(a[idx])))))))))))))))))))));
}

int main( void )
{
	float *a_h, *a_d;
	const int N = 200000;

	ofstream ofs("C:\\Test\\test.txt");

	a_h = (float *)malloc( N * sizeof(float) );
	cudaMalloc( &a_d, N * sizeof(float) );

	for ( int i = 0; i < N; i++ )
		a_h[i] = (float)i;

	cudaMemcpy( a_d, a_h, N, cudaMemcpyHostToDevice );


	for (int i = 0; i < N / 40000 + 1; i++)
	{
		double_array <<< 40000, 1 >>> (a_d + i * 40000);
	}


	cudaMemcpy( a_h, a_d, sizeof( float ) * N, cudaMemcpyDeviceToHost );


	for ( int i = 0; i < N; i++ )
	{
		ofs << i << " " << fixed << a_h[i] << endl;
	}

	free( a_h );
	cudaFree( a_d );
}
Output:
Code:
49990 1.000005
49991 1.000005
49992 1.000005
49993 1.000005
49994 1.000005
49995 1.000005
49996 1.000005
49997 1.000005
49998 1.000005
49999 1.000005
50000 1.000000
50001 0.000000
50002 1.000000
50003 1.#QNAN0
50004 1.000000
50005 0.000000
50006 1.000000
50007 1.#QNAN0
50008 1.000000
50009 0.000000
50010 1.000000
50011 1.#QNAN0
50012 1.000000
50013 1.000000
50014 1.#QNAN0
50015 1.#QNAN0
50016 0.000000
50017 0.000000
50018 0.000000
50019 0.000000
50020 0.000000
50021 0.000000
50022 0.000000
50023 0.000000

I know I say it a lot but thank you so much, I really appreciate the time you are putting in to help me.
 

Sunius

Distinguished
Dec 19, 2010
390
0
19,060
No, it would not to 1 to N, then 40000 to N, 80000 TO N, etc. It would do it 1 to 40000, then 40001 to 80000, etc.

I've no idea why it doesn't work though. Did you try debugging with breakpoints?
 

KingBoo

Honorable
Aug 30, 2012
5
0
10,510
I read a bit on breakpoints and understand them a bit, but not enough to know where to put them in a cuda application. Can you explain where you would put it first and why?

But first, let's see if this helps, on this code
Code:
double_array <<< 40000, 1 >>> (a_d + i * 40000);

the last < is underlined red and says "Error: expected an expression" in VS2010
 

mrmeister

Honorable
Dec 8, 2012
1
0
10,510
I don't know if you're still following this, but I have some insight.

When kernel calls fail, nothing happens. The program doesn't exit with a failed status, the kernel just doesn't do anything, it's part of the failure state for gpus, since graphics errors shouldn't bring down programs. If you call more blocks than the card can handle, or more threads, or do something else that terminates the kernel, it doesn't execute, so the array you fed it initially i[x] = x in this case, is still in there. Hence 5061 -> 5061 instead of a sqrt. You're just reading the initial array back out since nothing changed it.

Be careful using functions of N in your kernel call, since CUDA doesn't check to make sure you asked for a sane number of blocks/threads when running, and an invalid number will crash it (and sometimes the whole computer). Also, that many sqrt are going to way exceed the precision of floats.

I have a newer card that can handle a bit more, and your code runs fine on mine.