Use a CUDA Thread Index as a number - c++

I am new to CUDA and GPGPUs. I am trying to check properties of a large set of numbers (bigger than 32 bit) and I would like to try to do this using my Windows 7 64bit machine equipped with a nVidia GTX 1080:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1080"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8192 MBytes (8589934592 bytes)
(20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1734 MHz (1.73 GHz)
Memory Clock rate: 5005 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
When I run the following code the value for "sum" is nonsensical (28, 20, etc.) even though I can see the threadId goes from 0 to 4095 :
#include <cuda.h>
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
#include "stdio.h"
__global__ void Simple(unsigned long long int *sum)
{
unsigned long long int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z;
unsigned long long int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z)
+ (threadIdx.z * (blockDim.x * blockDim.y))
+ (threadIdx.y * blockDim.x)
+ threadIdx.x;
printf("threadId = %llu.\n", threadId);
// Check threadId for property. Possibly introduce a grid stride for loop to give each thread a range to check.
sum[0]++;
}
int main(int argc, char **argv)
{
unsigned long long int sum[] = { 0 };
unsigned long long int *dev_sum;
cudaMalloc((void**)&dev_sum, sizeof(unsigned long long int));
cudaMemcpy(dev_sum, sum, sizeof(unsigned long long int), cudaMemcpyHostToDevice);
dim3 grid(2, 1, 1);
dim3 block(1024, 1, 1);
printf("--------- Start kernel ---------\n\n");
Simple <<< grid, block >>> (dev_sum);
cudaDeviceSynchronize();
cudaMemcpy(sum, dev_sum, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
printf("sum = %llu.\n", sum[0]);
cudaFree(dev_sum);
getchar();
return 0;
}
How would I modify this kernel call to get the maximum threads to operate (with my setup ) over a range of numbers say 0 to 10^12 by adding a grid stride loop?
dim3 grid(2, 1, 1);
dim3 block(1024, 1, 1);
Simple <<< grid, block >>> (dev_sum);

All threads are doing increment on the same place in memory, which results in race condition. That is why the outcome is incorrect. You should use atomic addition to have it right (there is a function for that in CUDA).

Related

N-body OpenCL code : error CL_​OUT_​OF_​HOST_​MEMORY with GPU card NVIDIA A6000 [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I would like to make run an old N-body which uses OpenCL.
I have 2 cards NVIDIA A6000 with NVLink, a component which binds from an hardware (and maybe software ?) point of view these 2 GPU cards.
But at the execution, I get the following result:
Here is the kernel code used (I have put pragma that I estimate useful for NVIDIA cards):
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel
void
nbody_sim(
__global double4* pos ,
__global double4* vel,
int numBodies,
double deltaTime,
double epsSqr,
__local double4* localPos,
__global double4* newPosition,
__global double4* newVelocity)
{
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// Gravitational constant
double G_constant = 227.17085e-74;
// Number of tiles we need to iterate
unsigned int numTiles = numBodies / localSize;
// position of this work-item
double4 myPos = pos[gid];
double4 acc = (double4) (0.0f, 0.0f, 0.0f, 0.0f);
for(int i = 0; i < numTiles; ++i)
{
// load one tile into local memory
int idx = i * localSize + tid;
localPos[tid] = pos[idx];
// Synchronize to make sure data is available for processing
barrier(CLK_LOCAL_MEM_FENCE);
// Calculate acceleration effect due to each body
// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
for(int j = 0; j < localSize; ++j)
{
// Calculate acceleration caused by particle j on particle i
double4 r = localPos[j] - myPos;
double distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
double invDist = 1.0f / sqrt(distSqr + epsSqr);
double invDistCube = invDist * invDist * invDist;
double s = G_constant * localPos[j].w * invDistCube;
// accumulate effect of all particles
acc += s * r;
}
// Synchronize so that next tile can be loaded
barrier(CLK_LOCAL_MEM_FENCE);
}
double4 oldVel = vel[gid];
// updated position and velocity
double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
newPos.w = myPos.w;
double4 newVel = oldVel + acc * deltaTime;
// write to global memory
newPosition[gid] = newPos;
newVelocity[gid] = newVel;
}
The part of code which sets up the Kernel code is below:
int NBody::setupCL()
{
cl_int status = CL_SUCCESS;
cl_event writeEvt1, writeEvt2;
// The block is to move the declaration of prop closer to its use
cl_command_queue_properties prop = 0;
commandQueue = clCreateCommandQueue(
context,
devices[current_device],
prop,
&status);
CHECK_OPENCL_ERROR( status, "clCreateCommandQueue failed.");
...
// create a CL program using the kernel source
const char *kernelName = "NBody_Kernels.cl";
FILE *fp = fopen(kernelName, "r");
if (!fp) {
fprintf(stderr, "Failed to load kernel.\n");
exit(1);
}
char *source = (char*)malloc(10000);
int sourceSize = fread( source, 1, 10000, fp);
fclose(fp);
// Create a program from the kernel source
program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
// Build the program
status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);
// get a kernel object handle for a kernel with the given name
kernel = clCreateKernel(
program,
"nbody_sim",
&status);
CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");
status = waitForEventAndRelease(&writeEvt1);
CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt1) Failed");
status = waitForEventAndRelease(&writeEvt2);
CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt2) Failed");
return NBODY_SUCCESS;
}
So, the errors occurs at the creation of the Kernel code. Is there a way to consider the 2 GPU as a unique GPU with NVLINK component ? I mean from a software point of view ?
How can I fix this error of creation of Kernel code ?
Update 1
I) I have voluntarily restricted the number of GPU devices to only one GPU by modifying this loop below (actually, it remains only one iteration):
// Print device index and device names
//for(cl_uint i = 0; i < deviceCount; ++i)
for(cl_uint i = 0; i < 1; ++i)
{
char deviceName[1024];
status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");
std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;
}
// Set id = 0 for currentDevice with deviceType
*currentDevice = 0;
free(deviceIds);
return NBODY_SUCCESS;
}
and doing after the classical call:
status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);
But error remains, below the message:
II) If I don't modify this loop and apply the solution suggested,i.e set devices[current_device] instead of devices I get a compilation error like this:
In file included from NBody.hpp:8,
from NBody.cpp:1:
/opt/AMDAPPSDK-3.0/include/CL/cl.h:863:16: note: initializing argument 3 of ‘cl_int clBuildProgram(cl_program, cl_uint, _cl_device_id* const*, const char*, void (*)(cl_program, void*), void*)’
const cl_device_id * /* device_list */,
How could I circumvent this issue of compilation ?
Update 2
I have printed the values of status variable in this portion of my code:
and I get a value for status = -44. From CL/cl.h, it would correspond to a CL_INVALID_PROGRAM error:
and then, when I execute the application, I get:
I wonder if I didn't miss to put special pragma in kernel code since i am using OpenCL on NVIDIA cards, don't I ?
By the way, what is the type of the variables devices ? I can't manage to print it correctly.
Update 3
I have added the following lines but still -44 error at the execution. Instead of putting all the concerned code, I provide the following link to download the source file: http://31.207.36.11/NBody.cpp and the Makefile used for compilation: http://31.207.36.11/Makefile . Maybe someone will find some errors but I would like mostly know why I get this error -44 .
Update 4
I am taking over this project.
Here is the result of clinfo command:
$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 3.0 CUDA 11.4.94
Platform Name: NVIDIA CUDA
Platform Vendor: NVIDIA Corporation
Platform Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info
Platform Name: NVIDIA CUDA
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 10deh
Max compute units: 84
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 1
Native vector width short: 1
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1800Mhz
Address bits: 64
Max memory allocation: 12762480640
Image support: Yes
Max number of images read arguments: 256
Max number of images write arguments: 32
Max image 2D width: 32768
Max image 2D height: 32768
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 16384
Max samplers within kernel: 32
Max size of kernel argument: 4352
Alignment (bits) of base address: 4096
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 128
Cache size: 2408448
Global memory size: 51049922560
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 49152
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1000
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x1e97440
Name: NVIDIA RTX A6000
Vendor: NVIDIA Corporation
Device OpenCL C version: OpenCL C 1.2
Driver version: 470.57.02
Profile: FULL_PROFILE
Version: OpenCL 3.0 CUDA
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 10deh
Max compute units: 84
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 64
Max work group size: 1024
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 1
Native vector width short: 1
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1800Mhz
Address bits: 64
Max memory allocation: 12762578944
Image support: Yes
Max number of images read arguments: 256
Max number of images write arguments: 32
Max image 2D width: 32768
Max image 2D height: 32768
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 16384
Max samplers within kernel: 32
Max size of kernel argument: 4352
Alignment (bits) of base address: 4096
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 128
Cache size: 2408448
Global memory size: 51050315776
Constant buffer size: 65536
Max number of constant args: 9
Local memory type: Scratchpad
Local memory size: 49152
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1000
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x1e97440
Name: NVIDIA RTX A6000
Vendor: NVIDIA Corporation
Device OpenCL C version: OpenCL C 1.2
Driver version: 470.57.02
Profile: FULL_PROFILE
Version: OpenCL 3.0 CUDA
Extensions: cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info
So I have one platform with 2 GPU cards A6000.
Given the fact that I want to make run the original version of my code (i.e using a single GPU card), I have to select only one ID in the source NBody.cpp (I will see in a second time how to manage with 2 GPU cards but this is for after). So, I have just modified in this source.
Instead of:
// Print device index and device names
for(cl_uint i = 0; i < deviceCount; ++i)
{
char deviceName[1024];
status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");
std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;
}
I did:
// Print device index and device names
//for(cl_uint i = 0; i < deviceCount; ++i)
for(cl_uint i = 0; i < 1; ++i)
{
char deviceName[1024];
status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");
std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;
}
As you can see, I have forced to take into account deviceIds[0], that is to say, a single GPU card.
A critical point is also the part of building program.
// create a CL program using the kernel source
const char *kernelName = "NBody_Kernels.cl";
FILE *fp = fopen(kernelName, "r");
if (!fp) {
fprintf(stderr, "Failed to load kernel.\n");
exit(1);
}
char *source = (char*)malloc(10000);
int sourceSize = fread( source, 1, 10000, fp);
fclose(fp);
// Create a program from the kernel source
program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
// Build the program
//status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);
status = clBuildProgram(program, 1, &devices[current_device], NULL, NULL, NULL);
printf("status1 = %d\n", status);
//printf("devices = %d\n", devices[current_device]);
// get a kernel object handle for a kernel with the given name
kernel = clCreateKernel(
program,
"nbody_sim",
&status);
printf("status2 = %d\n", status);
CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");
At the execution, I get the following values for status1 and status2:
Selected Platform Vendor : NVIDIA Corporation
deviceCount = 2/nDevice 0 : NVIDIA RTX A6000 Device ID is 0x55c38207cdb0
status1 = -44
devices = -2113661720
status2 = -44
clCreateKernel failed.
clSetKernelArg failed. (updatedPos)
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.
The first error is a failed creation of kernel. Here my NBody_Kernels.cl source:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel
void
nbody_sim(
__global double4* pos ,
__global double4* vel,
int numBodies,
double deltaTime,
double epsSqr,
__local double4* localPos,
__global double4* newPosition,
__global double4* newVelocity)
{
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// Gravitational constant
double G_constant = 227.17085e-74;
// Number of tiles we need to iterate
unsigned int numTiles = numBodies / localSize;
// position of this work-item
double4 myPos = pos[gid];
double4 acc = (double4) (0.0f, 0.0f, 0.0f, 0.0f);
for(int i = 0; i < numTiles; ++i)
{
// load one tile into local memory
int idx = i * localSize + tid;
localPos[tid] = pos[idx];
// Synchronize to make sure data is available for processing
barrier(CLK_LOCAL_MEM_FENCE);
// Calculate acceleration effect due to each body
// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
for(int j = 0; j < localSize; ++j)
{
// Calculate acceleration caused by particle j on particle i
double4 r = localPos[j] - myPos;
double distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
double invDist = 1.0f / sqrt(distSqr + epsSqr);
double invDistCube = invDist * invDist * invDist;
double s = G_constant * localPos[j].w * invDistCube;
// accumulate effect of all particles
acc += s * r;
}
// Synchronize so that next tile can be loaded
barrier(CLK_LOCAL_MEM_FENCE);
}
double4 oldVel = vel[gid];
// updated position and velocity
double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
newPos.w = myPos.w;
double4 newVel = oldVel + acc * deltaTime;
// write to global memory
newPosition[gid] = newPos;
newVelocity[gid] = newVel;
}
The modified source can be found here:
last modified code
I don't know how to solve the creation of this Kernel code and the following values status1 = -44 and status2 = -44.
Update 5
I have added clGetProgramBuildInfo to the code the following snippet to be able to see what's wrong with the clCreateKernl failed error:
// Create a program from the kernel source
program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
if (clBuildProgram(program, 1, devices, NULL, NULL, NULL) != CL_SUCCESS)
{
// Determine the size of the log
size_t log_size;
clGetProgramBuildInfo(program, devices[current_device], CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
// Allocate memory for the log
char *log = (char *) malloc(log_size);
cout << "size log =" << log_size << endl;
// Get the log
clGetProgramBuildInfo(program, devices[current_device], CL_PROGRAM_BUILD_LOG, log_size, log, NULL);
// Print the log
printf("%s\n", log);
}
// get a kernel object handle for a kernel with the given name
kernel = clCreateKernel(
program,
"nbody_sim",
&status);
CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");
Unfortunately, this function clGetProgramBuildInfo only gives the output:
Selected Platform Vendor : NVIDIA Corporation
Device 0 : NVIDIA RTX A6000 Device ID is 0x562857930980
size log =16
log =
clCreateKernel failed.
How can I print the content of "value" ?
Update 6
If I do a printf on :
// Create a program from the kernel source
program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
printf("status clCreateProgramWithSourceContext = %d\n", status);
I get an status=-6 which corresponds to CL_​OUT_​OF_​HOST_​MEMORY
Which are the tracks which allow to fix this ?
Partial solution
By compiling with Intel compilers (icc and icpc), compilation is performed well and code is running fine. I don't understand why it doesn't work with GNU gcc/g++-8 compiler. If someone had an idea ...
Your kernel code looks good and the cache tiling implementation is correct. Only make sure that the number of bodies is a multiple of local size, or alternatively limit the inner for loop to the global size additionally.
OpenCL allows usage of multiple devices in parallel. You need to make a thread with a queue for each device separately. You also need to take care of device-device communications and synchronization manually. Data transfer happens over PCIe (you also can do remote direct memory access); but you can't use NVLink with OpenCL. This should not be an issue in your case though as you need only little data transfer compared to the amount of arithmetic.
A few more remarks:
In many cases N-body requires FP64 to sum up the forces and resolve positions at very different length scales. However on the A6000, FP64 performance is very poor, just like on GeForce Ampere. FP32 would be significantly (~64x) faster, but is likely insufficient in terms of accuracy here. For efficient FP64 you would need an A100 or MI100.
Instead of 1.0/sqrt, use rsqrt. This is hardware supported and almost as fast as a multiplication.
Make sure to use either FP32 float (1.0f) or FP64 double (1.0) literals consistently. Using double literals with float triggers double arithmetic and casting of the result back to float which is much slower.
EDIT: To help you out with the error message: Most probably the error at clCreateKernel (what value does status have after calling clCreateKernel?) hints that program is invalid. This might be because you give clBuildProgram a vector of 2 devices, but set the number of devices to only 1 and also have context only for 1 device. Try
status = clBuildProgram(program, 1, &devices[current_device], NULL, NULL, NULL);
with only a single device.
To go multi-GPU, create two threads on the CPU that run NBody::setupCL() independently for GPUs 0 and 1, and then do synchronization manually.
EDIT 2:
I see nowhere that you create context. Without a valid context, program will be invalid, so clBuildProgram will throw error -44.
Call
context = clCreateContext(0, 1, &devices[current_device], NULL, NULL, NULL);
before you do anything with context.

CUDA periodic execution time

I just started learning CUDA and I have a trouble interpreting my experiment results. I wanted to compare CPU vs GPU in a simple program that adds two vectors together. The code is following:
__global__ void add(int *a, int *b, int *c, long long n) {
long long tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) {
c[tid] = a[tid] + b[tid];
}
}
void add_cpu(int* a, int* b, int* c, long long n) {
for (long long i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
void check_results(int* gpu, int* cpu, long long n) {
for (long long i = 0; i < n; i++) {
if (gpu[i] != cpu[i]) {
printf("Different results!\n");
return;
}
}
}
int main(int argc, char* argv[]) {
long long n = atoll(argv[1]);
int num_of_blocks = atoi(argv[2]);
int num_of_threads = atoi(argv[3]);
int* a = new int[n];
int* b = new int[n];
int* c = new int[n];
int* c_cpu = new int[n];
int *dev_a, *dev_b, *dev_c;
cudaMalloc((void **) &dev_a, n * sizeof(int));
cudaMalloc((void **) &dev_b, n * sizeof(int));
cudaMalloc((void **) &dev_c, n * sizeof(int));
for (long long i = 0; i < n; i++) {
a[i] = i;
b[i] = i * 2;
}
cudaMemcpy(dev_a, a, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, n * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
sdkStopTimer(&timer);
float time = sdkGetTimerValue(&timer);
sdkDeleteTimer(&timer);
cudaMemcpy(c, dev_c, n * sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
clock_t start = clock();
add_cpu(a, b, c_cpu, n);
clock_t end = clock();
check_results(c, c_cpu, n);
printf("%f %f\n", (double)(end - start) * 1000 / CLOCKS_PER_SEC, time);
return 0;
}
I ran this code in a loop with a bash script:
for i in {1..2560}
do
n="$((1024 * i))"
out=`./vectors $n $i 1024`
echo "$i $out" >> "./vectors.txt"
done
Where 2560 is maximum number of blocks that my GPU supports, and 1024 is the maximum number of threads in block. So I just ran it for maximum block size to the maximum problem size my GPU can handle, with a step of 1 block (1024 ints in vector).
Here is my GPU info:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2070 SUPER"
CUDA Driver Version / Runtime Version 11.3 / 11.0
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 8192 MBytes (8589934592 bytes)
(040) Multiprocessors, (064) CUDA Cores/MP: 2560 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 65536 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.3, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS
After running the experiment I gathered the results and plotted them:
So what bothers me is this 256 blocks-wide period in the GPU execution time. I have no clue why this happens. Why executing 512 blocks is much slower than executing 513 blocks of threads?
I also checked this with a constant number of blocks (2560) as well as with different block sizes and it always give this period of 256 * 1024 vector size (so for block size 512 its each 512 blocks, not each 256 blocks). So maybe this is something with memory, but I can't figure out what.
I would appreciate any ideas on why this is happening.
This is by no means a complete or precise answer. However I believe the periodic pattern you are observing is at least partly due to a 1-time or first-time kernel launch overhead. Good benchmarking practice usually is to do something other than what you are doing. For example, run the kernel multiple times and take an average. Or do some other kind of statistical measurement.
When I run your code using your script on a GTX 960 GPU, I get the following graph (only plotting the GPU data, vertical axis is in milliseconds):
When I modify your code as follows:
cudaMemcpy(dev_c, c, n * sizeof(int), cudaMemcpyHostToDevice);
// next two lines added:
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
StopWatchInterface *timer=NULL;
sdkCreateTimer(&timer);
sdkResetTimer(&timer);
sdkStartTimer(&timer);
add <<<num_of_blocks, num_of_threads>>>(dev_a, dev_b, dev_c, n);
cudaDeviceSynchronize();
Doing a "warm-up" run first, then timing the second run, I witness data like this:
So the data without the warm-up shows a periodicity. After the warm-up, the periodicity disappears. I conclude that the periodicity is due to some kind of 1-time or first-time behavior. Some typical things that might be in this category are caching effects and cuda "lazy" initialization effects (for example, the time taken to JIT-compile the GPU code, which is certainly happening in your case, or the time to load the GPU code into GPU memory). I won't be able to go farther with any explanation of what kind of first-time effect exactly is giving rise to the periodicity.
Another observation is that while my data shows an expected "average slope" to each graph, indicating that the kernel duration associated with 2560 blocks is approximately 5 times the kernel duration associated with 512 blocks, I don't see that kind of trend in your data. It ought to be there, however. Your GPU will "saturate" at about 40 blocks. Thereafter, the average kernel duration should increase in approximately a linear fashion, such that the kernel duration associated with 2560 blocks is 4-5x the kernel duration associated with 512 blocks. I can't explain your data in this respect at all, I suspect a graphing or data processing error, or else a characteristic in your environment (e.g. shared GPU with other users, broken CUDA install, etc.) that is not present in my environment, and which I'm unable to guess at.
Finally, my conclusion is that GPU "expected" behavior is more evident in the presence of good benchmarking techniques.

Why my cuda program became slower after using 128 threads on blocks?

I have a simple cuda application with the following code:
#include <stdio.h>
#include <sys/time.h>
#include <stdint.h>
__global__
void daxpy(int n, int a, int *x, int *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
y[i] = x[i];
int j;
for(j = 0; j < 1024*10000; ++j) {
y[i] += j%10;
}
}
// debug time
void calc_time(struct timeval *start, const char *msg) {
struct timeval end;
gettimeofday(&end, NULL);
uint64_t us = end.tv_sec * 1000000 + end.tv_usec - (start->tv_sec * 1000000 + start->tv_usec);
printf("%s cost us = %llu\n", msg, us);
memcpy(start, &end, sizeof(struct timeval));
}
void do_test() {
unsigned long n = 1536;
int *x, *y, a, *dx, *dy;
a = 2.0;
x = (int*)malloc(sizeof(int)*n);
y = (int*)malloc(sizeof(int)*n);
for(i = 0; i < n; ++i) {
x[i] = i;
}
cudaMalloc((void**)&dx, n*sizeof(int));
cudaMalloc((void**)&dy, n*sizeof(int));
struct timeval start;
gettimeofday(&start, NULL);
cudaMemcpy(dx, x, n*sizeof(int), cudaMemcpyHostToDevice);
daxpy<<<1, 512>>>(n, a, dx, dy); // this line
cudaThreadSynchronize();
cudaMemcpy(y, dy, n*sizeof(int), cudaMemcpyDeviceToHost);
calc_time(&start, "do_test ");
cudaFree(dx);
cudaFree(dy);
free(x);
free(y);
}
int main() {
do_test();
return 0;
}
The gpu kernel call is daxpy<<<1, 512>>>(n, a, dx, dy) and I performed some tests using different block sizes:
daxpy<<<1, 32>>>(n, a, dx, dy)
daxpy<<<1, 64>>>(n, a, dx, dy)
daxpy<<<1, 128>>>(n, a, dx, dy)
daxpy<<<1, 129>>>(n, a, dx, dy)
daxpy<<<1, 512>>>(n, a, dx, dy)
... and made the following observations:
Execution time is the same for 32, 64, and 128 block sizes,
Execution time differs for block sizes 128 and 129, in particular:
For 128 the execution time is 280ms,
For 129 the execution time is 386ms.
I would like to ask what is causing the difference in execution time for block sizes 128 and 129.
My GPU is tesla K80:
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Clock rate: 824 MHz (0.82 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 135 / 0
After providing us with the exact time differences in one of the comments, i.e.:
280ms for up to 128 threads,
386ms for 129+ threads,
I think it indirectly supports my theory of issue being related to warp scheduling. Look at the GK210 whitepaper, which is a chip used in K80:
K80 SMX features a quad warp scheduler, see section Quad Warp Scheduler,
It means that K80 SMX is able to schedule up to 128 threads at once (4 warps == 128 threads), these are then executed simultaneously,
Therefore, for 129 threads, scheduling cannot happen at once, because SMX has to schedule 5 warps, i.e. scheduling will happen in two steps.
If the above is true, then I would expect:
The execution time to be roughly the same for block sizes 1 - 128,
The execution time to be roughly the same for block sizes 129 - 192.
192 is the number of cores on the SMX, see whitepaper. As a reminder - entire blocks are always scheduled for one SMX and so obviously if you spawn more than 192 threads then those for sure won't be able to execute in parallel and execution time should be higher for 193+ number of threads.
You can verify the above thesis by simplifying your kernel code to the degree where it will do almost nothing so it should be more or less obvious whether the execution takes longer only due to scheduling (there will be no other limiting factors such as memory throughput).
Disclaimer: The above are just my assumptions as I don't have access to K80, nor any other GPU with quad warp scheduler so I cannot profile your code properly. But anyway, I believe that is the task for you - why not to use nvprof and profile your code yourself? Then you should be able to see where the time difference lies.

Measuring OpenCL kernel's memory throughput

I read about global memory optimization in OpenCL. In one of the slide-shows, a very simple kernel (below) has been used to demonstrate the importance of memory coalescing.
__kernel void measure(__global float* idata, __global float* odata, int offset) {
int xid = get_global_id(0) + offset;
odata[xid] = idata[xid];
}
Please see my code below which measures the running time of the kernel
ret = clFinish(command_queue);
size_t local_item_size = MAX_THREADS;
size_t global_item_size = INPUTSIZE;
struct timeval t0,t1;
gettimeofday(&t0, 0 );
//ret = clFinish(command_queue);
ret = clEnqueueNDRangeKernel(command_queue, measure, 1, NULL,
&global_item_size, &local_item_size, 0, NULL, NULL);
ret = clFlush(command_queue);
ret = clFinish(command_queue);
gettimeofday(&t1,0);
double elapsed = (t1.tv_sec-t0.tv_sec)*1000000 + (t1.tv_usec-t0.tv_usec);
printf("time taken = %lf microseconds\n", elapsed);
I transfer around 0.5 GB of data:
#define INPUTSIZE 1024 * 1024 * 128
int main (int argc, char *argv[])
{
int offset = atoi(argv[1]);
float* input = (float*) malloc(sizeof(float) * INPUTSIZE);
Now, the results are a bit random. With offset =0, I get times as low as 21 usecs. With offset = 1, I get times ranging between 53 usecs to 24400 usecs.
Can someone please tell me what is going on. I thought that offset=0 will be the fastest, because all the threads will access consecutive locations, hence the minimum number of memory transactions will take place.
Bandwidth is a measure of how fast data can be transferred, and is typically measured in bytes/second in these situations (usually GB/s for GPU memory bandwidth).
To compute the bandwidth of a compute kernel, you just need to know how much data the kernel reads/writes from/to memory, and then divide that by the time your kernel took to execute.
Your example kernel has each work-item (or CUDA thread) read a single float, and write a single float. If you launch this kernel to copy 2^10 floats, then you will be reading 2^10 * sizeof(float) bytes, and writing the same amount (so 8MB in total). If this kernel takes 1ms to execute, then you have achieved bandwidth of 8MB / 0.001s = 8GB/s.
Your new code snippet that shows your kernel timing approach indicates that you are only timing the kernel enqueue, not the amount of time it actually takes to run the kernel. This is why you are getting very low kernel timings (0.5GB / 0.007ms ~= 71TB/s!). You should add calls to clFinish() to obtain proper timing. I typically also take timings over several runs, to allow the device to warm-up, which usually gives more consistent timing:
// Warm-up run (not timed)
clEnqueueNDRangeKernel(command_queue, ...);
clFinish(command_queue);
// start timing
start = ...
for (int i = 0; i < NUM_RUNS; i++)
{
clEnqueueNDRangeKernel(command_queue, ...);
}
clFinish(command_queue);
// stop timing
end = ...
// Compute time taken, bandwidth etc
average_time = (end-start)/NUM_RUNS;
...
Question from comment:
Why does offset=0 perform better than offset=1,4 or 6?
On NVIDIA GPUs, work-items are grouped into 'warps' of size 32, which execute in lockstep (other devices have similar approaches, just with a different sizes). Memory transactions are aligned to multiples of the cacheline size (e.g. 64 bytes, 128 bytes etc). Consider what happens when each work-item in a warp attempts to read a single 4-byte value (assuming they are contiguous, as per your example), with a cacheline size of 64 bytes.
This warp is reading a total of 128 bytes of data. If the start of this 128-byte chunk is aligned to a 64-byte boundary (i.e. if offset=0), then this can serviced in two 64-byte transactions. However, if this chunk is not aligned to the a 64-byte boundary (offset=1,4,6,etc), then this will require three memory transactions to fetch all of the data. This is where your performance difference comes from.
If you set the offset to be a multiple of the cacheline size (e.g. 64), then you will likely get performance equivalent to offset=0.

Improving asynchronous execution in CUDA

I am currently writing a programme that performs large simulations on the GPU using the CUDA API. In order to accelerate the performance, I tried to run my kernels simultaneously and then asynchronously copy the result into the host memory again. The code looks roughly like this:
#define NSTREAMS 8
#define BLOCKDIMX 16
#define BLOCKDIMY 16
void domainUpdate(float* domain_cpu, // pointer to domain on host
float* domain_gpu, // pointer to domain on device
const unsigned int dimX,
const unsigned int dimY,
const unsigned int dimZ)
{
dim3 blocks((dimX + BLOCKDIMX - 1) / BLOCKDIMX, (dimY + BLOCKDIMY - 1) / BLOCKDIMY);
dim3 threads(BLOCKDIMX, BLOCKDIMY);
for (unsigned int ii = 0; ii < NSTREAMS; ++ii) {
updateDomain3D<<<blocks,threads, 0, streams[ii]>>>(domain_gpu,
dimX, 0, dimX - 1, // dimX, minX, maxX
dimY, 0, dimY - 1, // dimY, minY, maxY
dimZ, dimZ * ii / NSTREAMS, dimZ * (ii + 1) / NSTREAMS - 1); // dimZ, minZ, maxZ
unsigned int offset = dimX * dimY * dimZ * ii / NSTREAMS;
cudaMemcpyAsync(domain_cpu + offset ,
domain_gpu+ offset ,
sizeof(float) * dimX * dimY * dimZ / NSTREAMS,
cudaMemcpyDeviceToHost, streams[ii]);
}
cudaDeviceSynchronize();
}
All in all it is just a simple for-loop, looping over all streams (8 in this case) and dividing the work. This actually is a deal faster (up to 30% performance gain), although maybe less than I had hoped. I analysed a typical cycle in Nvidia's Compute Visual Profiler, and the execution looks like this:
As can be seen in the picture, the kernels do overlap, although never more than two kernels are running at the same time. I tried the same thing for different numbers of streams and different sizes of the simulation domain, but this is always the case.
So my question is: is there a way to encourage/force the GPU scheduler to run more than two things at the same time? Or is this a limitation dependent on the GPU device that cannot be represented in the code?
My system specifications are: 64-bit Windows 7, and a GeForce GTX 670 graphics card (that's Kepler architecture, compute capability 3.0).
Kernels overlap only if the GPU has resources left to run a second kernel. Once the GPU is fully loaded, there is no gain from running more kernels in parallel, so the driver does not do that.