Using "cuFFT Device Callbacks"

Using "cuFFT Device Callbacks" - c++

This is my first question, so I'll try to be as detailed as possible. I'm working on implementing noise reduction algorithm in CUDA 6.5. My code is based on this Matlab implementation: http://pastebin.com/HLVq48C1.
I'd love to use new cuFFT Device Callbacks feature, but I'm stuck on cufftXtSetCallback. Every time my cufftResult is CUFFT_NOT_IMPLEMENTED (14). Even example provided by nVidia fails the same way...
My device callback testing code:
__device__ void noiseStampCallback(void *dataOut,
size_t offset,
cufftComplex element,
void *callerInfo,
void *sharedPointer) {
element.x = offset;
element.y = 2;
((cufftComplex*)dataOut)[offset] = element;
}
__device__ cufftCallbackStoreC noiseStampCallbackPtr = noiseStampCallback;
CUDA part of my code:
cufftHandle forwardFFTPlan;//RtC
//find how many windows there are
int batch = targetFile->getNbrOfNoiseWindows();
size_t worksize;
cufftCreate(&forwardFFTPlan);
cufftMakePlan1d(forwardFFTPlan, WINDOW, CUFFT_R2C, batch, &worksize); //WINDOW = 2048
//host memory, allocate
float *h_wave;
cufftComplex *h_complex_waveSpec;
unsigned int m_num_real_elems = batch*WINDOW*2;
h_wave = (float*)malloc(m_num_real_elems * sizeof(float));
h_complex_waveSpec = (cufftComplex*)malloc((m_num_real_elems/2+1)*sizeof(cufftComplex));
//init
memset(h_wave, 0, sizeof(float) * m_num_real_elems); //last window won't probably be full of file data, so fill memory with 0
memset(h_complex_waveSpec, 0, sizeof(cufftComplex) * (m_num_real_elems/2+1));
targetFile->getNoiseFile(h_wave); //fill h_wave with samples from sound file
//device memory, allocate, copy from host
float *d_wave;
cufftComplex *d_complex_waveSpec;
cudaMalloc((void**)&d_wave, m_num_real_elems * sizeof(float));
cudaMalloc((void**)&d_complex_waveSpec, (m_num_real_elems/2+1) * sizeof(cufftComplex));
cudaMemcpy(d_wave, h_wave, m_num_real_elems * sizeof(float), cudaMemcpyHostToDevice);
//prepare callback
cufftCallbackStoreC hostNoiseStampCallbackPtr;
cudaMemcpyFromSymbol(&hostNoiseStampCallbackPtr,
noiseStampCallbackPtr,
sizeof(hostNoiseStampCallbackPtr));
cufftResult status = cufftXtSetCallback(forwardFFTPlan,
(void **)&hostNoiseStampCallbackPtr,
CUFFT_CB_ST_COMPLEX,
NULL);
//always return status 14 - CUFFT_NOT_IMPLEMENTED
//run forward plan
cufftResult result = cufftExecR2C(forwardFFTPlan, d_wave, d_complex_waveSpec);
//result seems to be okay without cufftXtSetCallback
I'm aware that I'm just a beginner in CUDA. My question is:
How can I call cufftXtSetCallback properly or what is a cause of this error?

Referring to the documentation:
The callback API is available in the statically linked cuFFT library only, and only on 64 bit LINUX operating systems. Use of this API requires a current license. Free evaluation licenses are available for registered developers until 6/30/2015. To learn more please visit the cuFFT developer page.
I think you are getting the not implemented error because either you are not on a Linux 64 bit platform, or you are not explicitly linking against the CUFFT static library. The Makefile in the cufft callback sample will give the correct method to link.
Even if you fix that issue, you will likely run into a CUFFT_LICENSE_ERROR unless you have gotten one of the evaluation licenses.
Note that there are various device limitations as well for linking to the cufft static library. It should be possible to build a statically linked CUFFT application that will run on cc 2.0 and greater devices.

A new (2019) possibility are cuFFT device extensions (cuFFTDX). Being part of the Math Library Early Access they are device FFT functions, which can be inlined into user kernels.
Announcement of cuFFTDX:
https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9240-cuda-new-features-and-beyond.pdf
Math Library Early Access:
https://developer.nvidia.com/cuda-math-library-early-access-program-page
Example Code:
https://github.com/mnicely/cufft_examples

Related

FatFS - Cannot format drive, FR_MKFS_ABORTED

I am implementing a file system on SPI flash memory using a w25qxx chip and an STM32F4xx on STM32CubeIDE. I have successfully created the basic i/o for the w25 over SPI, being able to write and read sectors at a time.
In my user_diskio.c I have implemented all of the needed i/o methods and have verified that they are properly linked and being called.
in my main.cpp I go to format the drive using f_mkfs(), then get the free space, and finally open and close a file. However, f_mkfs() keeps returning FR_MKFS_ABORTED. (FF_MAX_SS is set to 16384)
fresult = FR_NO_FILESYSTEM;
if (fresult == FR_NO_FILESYSTEM)
{
BYTE work[FF_MAX_SS]; // Formats the drive if it has yet to be formatted
fresult = f_mkfs("0:", FM_ANY, 0, work, sizeof work);
}
f_getfree("", &fre_clust, &pfs);
total = (uint32_t)((pfs->n_fatent - 2) * pfs->csize * 0.5);
free_space = (uint32_t)(fre_clust * pfs->csize * 0.5);
fresult = f_open(&fil, "file67.txt", FA_OPEN_ALWAYS | FA_READ | FA_WRITE);
f_puts("This data is from the FILE1.txt. And it was written using ...f_puts... ", &fil);
fresult = f_close(&fil);
fresult = f_open(&fil, "file67.txt", FA_READ);
f_gets(buffer, f_size(&fil), &fil);
f_close(&fil);
Upon investigating my ff.c, it seems that the code is halting on line 5617:
if (fmt == FS_FAT12 && n_clst > MAX_FAT12) return FR_MKFS_ABORTED; /* Too many clusters for FAT12 */
n_clst is calculated a few lines up before some conditional logic, on line 5594:
n_clst = (sz_vol - sz_rsv - sz_fat * n_fats - sz_dir) / pau;
Here is what the debugger reads the variables going in as:
This results in n_clst being set to 4294935040, as it is unsigned, though the actual result of doing the calculations would be -32256 if the variable was signed. As you can imagine, this does not seem to be an accurate calculation.
The device I am using has 16M-bit (2MB) of storage organized in 512 sectors of 4kb in size. The minimum erasable block size is 32kb. If you would need more info on the flash chip I am using, page 5 of this pdf outlines all of the specs.
This is what my USER_ioctl() looks like:
DRESULT USER_ioctl (
BYTE pdrv, /* Physical drive nmuber (0..) */
BYTE cmd, /* Control code */
void *buff /* Buffer to send/receive control data */
)
{
/* USER CODE BEGIN IOCTL */
UINT* result = (UINT*)buff;
HAL_GPIO_WritePin(GPIOE, GPIO_PIN_11, GPIO_PIN_SET);
switch (cmd) {
case GET_SECTOR_COUNT:
result[0] = 512; // Sector and block sizes of
return RES_OK;
case GET_SECTOR_SIZE:
result[0] = 4096;
return RES_OK;
case GET_BLOCK_SIZE:
result[0] = 32768;
return RES_OK;
}
return RES_ERROR;
/* USER CODE END IOCTL */
}
I have tried monkeying around with the parameters to f_mkfs(), swapping FM_ANY out for FM_FAT, FM_FAT32, and FM_EXFAT (along with enabling exFat in my ffconf.h. I have also tried using several values for au rather than the default. For a deeper documentation on the f_mkfs() method I am using, check here, there are a few variations of this method floating around out there.

Here:
fresult = f_mkfs("0:", FM_ANY, 0, work, sizeof work);
The second argument is not valid. It should be a pointer to a MKFS_PARM structure or NULL for default options, as described at http://elm-chan.org/fsw/ff/doc/mkfs.html.
You should have something like:
MKFS_PARM fmt_opt = {FM_ANY, 0, 0, 0, 0};
fresult = f_mkfs("0:", &fmt_opt, 0, work, sizeof work);
except that it is unlikely for your media (SPI flash) that the default option are appropriate - the filesystem cannot obtain formatting parameters from the media as it would for SD card for example. You have to provide the necessary formatting information.
Given your erase block size I would guess:
MKFS_PARM fmt_opt = {FM_ANY, 0, 32768, 0, 0};
but to be clear I have never used the ELM FatFS (which STM32Cube incorporates) with SPI flash - there may be additional issues. I also do not use STM32CubeMX - it is possible I suppose that the version has a different interface, but I would recommend using the latest code from ELM rather than ST's possibly fossilised version.
Another consideration is that FatFs is not particularly suitable for your media due to wear-levelling issues. Also ELM FatFs has not journalling or check/repair function, so is not power fail safe. That is particularly important for non-removable media that you cannot easily back-up or repair.
You might consider a file system specifically designed for SPI NOR flash such as SPIFFS, or the power-fail safe LittleFS. Here is an example of LittleFS in STM32: https://uimeter.com/2018-04-12-Try-LittleFS-on-STM32-and-SPI-Flash/

Ok, I think the real problem was that the IOCTL call GET_BLOCK_SIZE to get the block size was returning the sector size instead of the number of sectors in the block. Which is usually 1 for SPI Flash.

CUDA signal to host

Is there a way to signal (success/failure) to the host at the end of kernel execution?
I am looking at an iterative process where calculations are made in device and after each iteration, a boolean variable is passed to host that tells if the process has converged. Based on the variable, host decides to either stop iterating or go through another round of iteration.
Copying a single boolean variable at the end of every iteration nullifies the time gain obtained through parallelization. Hence, I would like to find a way to let the host know of the convergence status (success/failure) without having to CudaMemCpy every time.
Note: The time issue exists after using pinned memory to transfer data.
Alternatives that I have looked at.
asm("trap;"); & assert();
These will trigger respectively Unknown error and cudaErrorAssert in host. Unfortunately, they are "sticky" in that the error cannot be reset using CudaGetLastError. The only way is to reset device using cudaDeviceReset().
using CudaHostAllocMapped to avoid CudaMemCpy This is of no use as it does not offer any time based advantage over standard pinned memory allocation + CudaMemCpy. (Pg 460, MultiCore and GPU Programming, An Integrated Approach, Morgran Kruffmann 2014).
Will appreciate other ways to overcome this issue.

I suspect the real issue here is that your iteration kernel run time is very short (on the order of 100us or less), meaning the work per iteration is very small. The best solution might be to try to increase the work per iteration (refactor your code/algorithm, tackle a larger problem, etc.)
However, here are some possibilities:
Use mapped/pinned memory. Your claim in item 2 of your question is unsupported, IMO, without a lot more context than a page reference to a book that many of us probably don't have available to look at.
Use dynamic parallelism. Move your kernel launch process to a CUDA parent kernel that is issuing child kernels. Whatever boolean is set by the child kernel will be immediately discoverable in the parent kernel, without any need for a cudaMemcpy operation or mapped/pinned memory.
Use a pipelined algorithm, and overlap a speculative kernel launch with the device->host copy of the boolean, for each pipeline stage.
I consider the first two items above fairly obvious, so I'll provide a worked example for item 3. The basic idea is that we will ping-pong between two streams, launching the kernel alternately into one stream then the other. We will have a 3rd stream so that we can overlap the device->host copy operations with the execution of the next launch. Due to the overlap of D->H copy with kernel execution, there is effectively no "cost" for the copy operation, it is hidden by kernel execution work.
Here's a fully worked example, plus a nvvp timeline:
$ cat t267.cu
#include <stdio.h>
const int stop_count = 5;
const long long tdelay = 1000000LL;
__global__ void test_kernel(int *icounter, bool *istop, int *ocounter, bool *ostop){
if (*istop) return;
long long start = clock64();
while (clock64() < tdelay+start);
int my_count = *icounter;
my_count++;
if (my_count >= stop_count) *ostop = true;
*ocounter = my_count;
}
int main(){
volatile bool *v_stop;
volatile int *v_counter;
bool *h_stop, *d_stop1, *d_stop2, *d_s1, *d_s2, *d_ss;
int *h_counter, *d_counter1, *d_counter2, *d_c1, *d_c2, *d_cs;
cudaStream_t s1, s2, s3, *sp1, *sp2, *sps;
cudaEvent_t e1, e2, *ep1, *ep2, *eps;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
cudaStreamCreate(&s3);
cudaEventCreate(&e1);
cudaEventCreate(&e2);
cudaMalloc(&d_counter1, sizeof(int));
cudaMalloc(&d_stop1, sizeof(bool));
cudaMalloc(&d_counter2, sizeof(int));
cudaMalloc(&d_stop2, sizeof(bool));
cudaHostAlloc(&h_stop, sizeof(bool), cudaHostAllocDefault);
cudaHostAlloc(&h_counter, sizeof(int), cudaHostAllocDefault);
v_stop = h_stop;
v_counter = h_counter;
int n_counter = 1;
h_stop[0] = false;
h_counter[0] = 0;
cudaMemcpy(d_stop1, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_stop2, h_stop, sizeof(bool), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter1, h_counter, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_counter2, h_counter, sizeof(int), cudaMemcpyHostToDevice);
sp1 = &s1;
sp2 = &s2;
ep1 = &e1;
ep2 = &e2;
d_c1 = d_counter1;
d_c2 = d_counter2;
d_s1 = d_stop1;
d_s2 = d_stop2;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
while (v_stop[0] == false){
cudaStreamWaitEvent(*sp2, *ep1, 0);
sps = sp1; // ping-pong
sp1 = sp2;
sp2 = sps;
eps = ep1;
ep1 = ep2;
ep2 = eps;
d_cs = d_c1;
d_c1 = d_c2;
d_c2 = d_cs;
d_ss = d_s1;
d_s1 = d_s2;
d_s2 = d_ss;
test_kernel<<<1,1, 0, *sp1>>>(d_c1, d_s1, d_c2, d_s2);
cudaEventRecord(*ep1, *sp1);
while (n_counter > v_counter[0]);
n_counter++;
if(v_stop[0] == false){
cudaStreamWaitEvent(s3, *ep1, 0);
cudaMemcpyAsync(h_stop, d_s2, sizeof(bool), cudaMemcpyDeviceToHost, s3);
cudaMemcpyAsync(h_counter, d_c2, sizeof(int), cudaMemcpyDeviceToHost, s3);
}
}
cudaDeviceSynchronize(); // optional
printf("terminated at counter = %d\n", v_counter[0]);
}
$ nvcc -arch=sm_52 -o t267 t267.cu
$ ./t267
terminated at counter = 5
$
In the above diagram, we see that 5 kernel launches are evident (actually 6) andy they are bouncing back and forth between two streams. (The 6th kernel launch, which we would expect from the code organization and pipelining, is a very short line at the end of stream15 above. This kernel launches but immediately witness that stop is true, so it exits.) The device -> host copies are in a 3rd stream. If we zoom in closely at the handoff from one kernel iteration to the next:
we see that even these very short D->H memcpy operations are essentially overlapped with the next kernel execution. For reference, the gap between kernel executions above is about 5us.
Note that this was entirely done on linux. If you attempt this on windows WDDM, it may be difficult to achieve anything similar, due to WDDM command batching. Windows TCC should approximately duplicate linux behavior, however.

In cuda, thread indexes are not fully shown in the kernel function

I am writing code and recently, I found some error. The simplified version is shown below.
#include <stdio.h>
#include <cuda.h>
#define DEBUG 1
inline void check_cuda_errors(const char *filename, const int line_number)
{
#ifdef DEBUG
cudaThreadSynchronize();
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess)
{
printf("CUDA error at %s:%i: %s\n", filename, line_number, cudaGetErrorString(error));
exit(-1);
}
#endif
}
__global__ void make_input_matrix_zp()
{
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
printf("col: %d (%d*%d+%d) row: %d (%d*%d+%d) \n", col, blockIdx.x, blockDim.x, threadIdx.x, row, blockIdx.y, blockDim.y, threadIdx.y);
}
int main()
{
dim3 blockDim(16, 16, 1);
dim3 gridDim(6, 6, 1);
make_input_matrix_zp<<<gridDim, blockDim>>>();
//check_cuda_errors(__FILE__, __LINE__);
return 0;
}
The first inline function is for checking the error in the cuda.
The second kernel function simply calculates current thread's index written in 'row' and 'col' and print these values. I guess there are no problem in inline function since it is from other reliable source.
The problem is, when I run the program, it does not execute kernel function even though it is called in the main function. However, if I delete the comment notation '//' in front of the
check_cuda_error
the program seems to enter the kernel function and it shows some printed value by printf function. But, it does not shows full combination of 'col' and 'row' indexes. In detail, the 'blockDim.y' does not change much. It only shows values of 4 and 5, but not 0, 1, 2, 3.
What I do not understand first.
As far as I know, the 'gridDim' means the dimension of the blocks. That means the block indexes have combination of (0,0)(0,1)(0,2)(0,3)(0,4)(0,5)(1,0)(1,1)(1,2)(1,3)... and so on. Also the size of the each block is 16 by 16. However, if you run this program, it does not show full combination. I just shows several combinations and it ends.
What I do not understand second.
Why the kernel function is dependent to the function named 'check_cuda_errors'? When this function exists, the program at least runs although imperfectly. However, when this error checking function is commented, the kernel function does not show any printed values.
This is very simple code but I couldn't find the problem for several days. Is there anything that I missed? Or do I know something wrong?
My working environment is like this.
"GeForce GT 630"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 2.1
Ubuntu 14.04

The CUDA GPU printf subsystem relies on a FIFO buffer to store printed output. If your output exceeds the size of the buffer, some or all of the previous content of the FIFO buffer will be overwritten by subsequent output. This is what will be happening in this case.
You can query and change the size of the buffer using the runtime API with cudaDeviceGetLimit and cudaDeviceSetLimit. If your device has the resources available to expand the limit, you should be able to see all the output your code emits.
As an aside, relying on the kernel printf feature for anything other than simple diagnostics or lightweight debugging, is a terrible idea, and you have probably just proven to yourself that you should be looking at other methods of verifying the correctness of your code.
Regarding your second question, the printf buffer is flushed to output only when the host synchronizes with the device. For example, with a call to cudaDeviceSynchronize, cudaThreadSynchronize, cudaMemcpy, and others (see B.17.2 Limitations of the Formatted Output appendix).
When check_cuda_errors is uncommented, calling cudaThreadSynchronize is what triggers the buffer to be printed. When it is commented, the main thread simply terminates before the kernel gets to run to completion, and nothing else happens.

CUDA kernel causing causing "display driver not responding" with the addition of 4 lines

The basic problem was as follows:
When I run the below Kernel with N threads and don't include the 4
lines to instantiate and populate the ScaledLLA variable every thing
works fine.
When I run the below Kernel with N threads and do include the 4
lines to instantiate and populate the ScaledLLA variable the GPU locks
up, and Windows throws a "display driver not responding" error.
If I reduce the number of threads running by reducing the grid size
everything worked fine.
I'm new to CUDA and have been incrementally building out some GIS functionality.
my host code looks like this at the kernel call.
MapperKernel << <g_CUDAControl->aGetGridSize(), g_CUDAControl->aGetBlockSize() >> >(g_Deltas.lat, g_Deltas.lon, 32.2,
g_DataReader->aGetMapper().aGetRPCBoundingBox()[0], g_DataReader->aGetMapper().aGetRPCBoundingBox()[1],
g_CUDAControl->aGetBlockSize().x,
g_CUDAControl->aGetThreadPitch(),
LLA_Offset,
LLA_ScaleFactor,
RPC_XN,RPC_XD,RPC_YN,RPC_YD,
Pixel_Offset, Pixel_ScaleFactor,
device_array);
cudaDeviceSynchronize(); //code crashes here
host_array = (point3D*)malloc(num_bytes);
cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);
the Kernel that is being called looks like this:
__global__ void MapperKernel(double deltaLat, double deltaLon, double passedAlt,
double minLat, double minLon,
int threadsperblock,
int threadPitch,
point3D LLA_Offset,
point3D LLA_ScaleFactor,
double * RPC_XN, double * RPC_XD, double * RPC_YN, double * RPC_YD,
point2D pixelOffset, point2D pixelScaleFactor,
point3D * rValue)
{
//calculate thread's LLA
int latindex = threadIdx.x + blockIdx.x*threadsperblock;
int lonindex = threadIdx.y + blockIdx.y*threadsperblock;
point3D LLA;
LLA.lat = ((double)(latindex))*deltaLat + minLat;
LLA.lon = ((double)(lonindex))*deltaLon + minLon;
LLA.alt = passedAlt;
//scale threads LLA - adding these four lines is what causes the problem
point3D ScaledLLA;
ScaledLLA.lat = (LLA.lat - LLA_Offset.lat) * LLA_ScaleFactor.lat;
ScaledLLA.lon = (LLA.lon - LLA_Offset.lon) * LLA_ScaleFactor.lon;
ScaledLLA.alt = (LLA.alt - LLA_Offset.alt) * LLA_ScaleFactor.alt;
rValue[lonindex*threadPitch + latindex] = ScaledLLA; //if I assign LLA without calculating ScaledLLA everything works fine
}
if I assign LLA to rValue then everything executes quickly and I get the expected behavior; however, when I add those fourlines for ScaledLLA and try to assign it to rValue, CUDA takes too long for windows's liking at the cudaDeviceSynchronize() call and I get a
"display driver not responding" error that then proceeds to reset the GPU. From looking around the error appears to be a windows thing that occurs when Windows believes that the GPU isn't being responsive. I am certain that the kernel is running and performing the right calculations, because I have stepped through it with the NSIGHT debugger.
Does anybody have a good explanation for why adding those three lines to the kernel would cause the execution time to spike?
I'm running Win7 VS 2013 and have nsight 4.5 installed.

For those who get here later via a search engine. It turns out the problem was with the card running out of memory.
That should probably have been one of the top couple of things to think of since the problem occurred only after the instantiation was added.
The card only had so much memory (~2GB) and my rvalue buffer was taking up most (~1.5GB) of it. With every thread trying to instantiate its own point3D variable the card simply ran out of memory.
For those interested NSight's profiler said that it was a cudaUknownError.
The fix was to lower the number of threads running the kernel

OpenCL inside Visual Studio - can we compile one exe that would use all possible CPUs OpenCL can get an all OpenCL supporting platforms?

So I mean compiling code like:
//*******************************************************************
// Demo OpenCL application to compute a simple vector addition
// computation between 2 arrays on the GPU
// ******************************************************************
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
// OpenCL source code
const char* OpenCLSource[] = {
"__kernel void VectorAdd(__global int* c, __global int* a,__global int* b)",
"{",
" // Index of the elements to add \n",
" unsigned int n = get_global_id(0);",
" // Sum the n’th element of vectors a and b and store in c \n",
" c[n] = a[n] + b[n];",
"}"
};
// Some interesting data for the vectors
int InitialData1[20] = {37,50,54,50,56,0,43,43,74,71,32,36,16,43,56,100,50,25,15,17};
int InitialData2[20] = {35,51,54,58,55,32,36,69,27,39,35,40,16,44,55,14,58,75,18,15};
// Number of elements in the vectors to be added
#define SIZE 2048
// Main function
// *********************************************************************
int main(int argc, char **argv)
{
// Two integer source vectors in Host memory
int HostVector1[SIZE], HostVector2[SIZE];
// Initialize with some interesting repeating data
for(int c = 0; c < SIZE; c++)
{
HostVector1[c] = InitialData1[c%20];
HostVector2[c] = InitialData2[c%20];
}
// Create a context to run OpenCL on our CUDA-enabled NVIDIA GPU
cl_context GPUContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU,
NULL, NULL, NULL);
// Get the list of GPU devices associated with this context
size_t ParmDataBytes;
clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, 0, NULL, &ParmDataBytes);
cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes);
clGetContextInfo(GPUContext, CL_CONTEXT_DEVICES, ParmDataBytes, GPUDevices, NULL);
// Create a command-queue on the first GPU device
cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,
GPUDevices[0], 0, NULL);
// Allocate GPU memory for source vectors AND initialize from CPU memory
cl_mem GPUVector1 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector1, NULL);
cl_mem GPUVector2 = clCreateBuffer(GPUContext, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(int) * SIZE, HostVector2, NULL);
// Allocate output memory on GPU
cl_mem GPUOutputVector = clCreateBuffer(GPUContext, CL_MEM_WRITE_ONLY,
sizeof(int) * SIZE, NULL, NULL);
// Create OpenCL program with source code
cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext, 7,
OpenCLSource, NULL, NULL);
// Build the program (OpenCL JIT compilation)
clBuildProgram(OpenCLProgram, 0, NULL, NULL, NULL, NULL);
// Create a handle to the compiled OpenCL function (Kernel)
cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "VectorAdd", NULL);
// In the next step we associate the GPU memory with the Kernel arguments
clSetKernelArg(OpenCLVectorAdd, 0, sizeof(cl_mem),(void*)&GPUOutputVector);
clSetKernelArg(OpenCLVectorAdd, 1, sizeof(cl_mem), (void*)&GPUVector1);
clSetKernelArg(OpenCLVectorAdd, 2, sizeof(cl_mem), (void*)&GPUVector2);
// Launch the Kernel on the GPU
size_t WorkSize[1] = {SIZE}; // one dimensional Range
clEnqueueNDRangeKernel(GPUCommandQueue, OpenCLVectorAdd, 1, NULL,
WorkSize, NULL, 0, NULL, NULL);
// Copy the output in GPU memory back to CPU memory
int HostOutputVector[SIZE];
clEnqueueReadBuffer(GPUCommandQueue, GPUOutputVector, CL_TRUE, 0,
SIZE * sizeof(int), HostOutputVector, 0, NULL, NULL);
// Cleanup
free(GPUDevices);
clReleaseKernel(OpenCLVectorAdd);
clReleaseProgram(OpenCLProgram);
clReleaseCommandQueue(GPUCommandQueue);
clReleaseContext(GPUContext);
clReleaseMemObject(GPUVector1);
clReleaseMemObject(GPUVector2);
clReleaseMemObject(GPUOutputVector);
// Print out the results
for (int Rows = 0; Rows < (SIZE/20); Rows++, printf("\n")){
for(int c = 0; c <20; c++){
printf("%c",(char)HostOutputVector[Rows * 20 + c]);
}
}
return 0;
}
into exe with VS (in my case 08) can we be sure that any where we run it it would use maximum of PC computing power? If not how to do that? (btw how to make it work with AMD cards and other non CUDA gpus? (meaning one programm for all PCs))

This code should run on any Windows PC with OpenCL capable GPU drivers.
However, to improve portability and possibly performance, you should be more careful about a few things.
You require a working OpenCL implementation. This won't work at all if no OpenCL implementation is available - a DLL is missing in this case.
You require a GPU implementation of OpenCL. It's a better idea to query all OpenCL platforms and devices and prefer GPU devices
You simply use the first available device, which might not be ideal. It's a better idea to determine the most powerful device via heuristics or to give the user a choice.
Furthermore, you are not using clCreateContextFromType correctly. You need to explicitly specify a platform ID (you can query IDs with clGetPlatformIDs) on ICD-enabled OpenCL implementations. All of today's OpenCL implementations use the ICD wrapper. (Well, except Apple's)

By writing your program for OpenCL, it will run on any computer that supports your executable format and has an OpenCL driver. Might be GPU-accelerated and might not, and won't be quite as fast as hardware-specific machine code, but should be fairly portable. If by portable you mean x86 CPU architecture and Windows OS and only the video card changes. For true portability, you'll need to distribute either source code or bytecode that's JIT compiled -- not C or C++ binaries.

In principle OpenCL code should run anywhere. But there are multiple versions of its implementation, 1.1 and 1.2 (not yet on Nvidia) with 2.0 spec released and due sometime. There are API incompatibilities between the OpenCL versions. AMD have provided a way for their OpenCL 1.2 implementation to use the 1.1 API calls.
There are important caveats noted, by other authors, above, OpenCL ICD http://www.khronos.org/registry/cl/extensions/khr/cl_khr_icd.txt supports running
on different Vendor platforms but since there is no binary portability for compiled kernel code, you'll need to provide the kernel source code.
In the general case, check for the OpenCL Platform Version and Profile. Remember there can easily be multiple platforms and devices on a system. Then make a determination as to what the minimum platform version and profile available you can run on. If your code relies on specific device extensions, like double precision floating point, then you need to check for their presence on the device.
Use
clGetPlatformInfo( ... )
to get the platform profile and version.
Use
clGetDeviceInfo( ... )
to get the device specific extensions

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js