For the purpose of testing printf() call on device, I wrote a simple program which copies an array of moderate size to device and print the value of device array to screen. Although the array is correctly copied to device, the printf() function does not work correctly, which lost the first several hundred numbers. The array size in the code is 4096. Is this a bug or I'm not using this function properly? Thanks in adavnce.
EDIT: My gpu is GeForce GTX 550i, with compute capability 2.1
My code:
#include<stdio.h>
#include<stdlib.h>
#define N 4096
__global__ void Printcell(float *d_Array , int n){
int k = 0;
printf("\n=========== data of d_Array on device==============\n");
for( k = 0; k < n; k++ ){
printf("%f ", d_Array[k]);
if((k+1)%6 == 0) printf("\n");
}
printf("\n\nTotally %d elements has been printed", k);
}
int main(){
int i =0;
float Array[N] = {0}, rArray[N] = {0};
float *d_Array;
for(i=0;i<N;i++)
Array[i] = i;
cudaMalloc((void**)&d_Array, N*sizeof(float));
cudaMemcpy(d_Array, Array, N*sizeof(float), cudaMemcpyHostToDevice);
cudaDeviceSynchronize();
Printcell<<<1,1>>>(d_Array, N); //Print the device array by a kernel
cudaDeviceSynchronize();
/* Copy the device array back to host to see if it was correctly copied */
cudaMemcpy(rArray, d_Array, N*sizeof(float), cudaMemcpyDeviceToHost);
printf("\n\n");
for(i=0;i<N;i++){
printf("%f ", rArray[i]);
if((i+1)%6 == 0) printf("\n");
}
}
printf from the device has a limited queue. It's intended for small scale debug-style output, not large scale output.
referring to the programmer's guide:
The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten.
Your in-kernel printf output overran the buffer, and so the first printed elements were lost (overwritten) before the buffer was dumped into the standard I/O queue.
The linked documentation indicates that the buffer size can be increased, also.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I don't really have any experience with CUDA. I have C++ script that looks like the following
for (int i = 0; i < n; ++i) {
// out_data here is a pointer to some chunk of memory on a CPU
out_data[i] = manipulate_out_data_val(out_data[i]);
}
This is currently set up for CPUs. I would like to adapt this to work with GPU allocated arrays, i.e., if out_data was allocated on GPU, how can do I write the above loop?
I tried porting it over as is with a GPU-allocated array, and the program seg-faults.
I'm not sure if this is relevant, but manipulate_out_data_val applies a constant scaling factor to the input value and then adds a constant to the resulting scaled value.
So firstly, I will convert your function into a CUDA kernel which looks something like this.
__global__ void manipulate_out_data_val(int *array)
{
// Assuming `20` is just a scaling factor.
array[threadIdx.x] *= 20;
}
Please note that for loops will not be used anymore because of the threadIdx parameter that is provided by CUDA. The thread's index replaces the i from your for loop. Please refer to this document to learn more about the CUDA's threading model.
Lets assume the array can store up to 100 integers.
int n = 100;
int bytes = n * sizeof(int);
Initialise an array on the CPU first.
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i = 0;i < n;i++) {
arr_cpu[i] = i;
}
Allocate some memory on the GPU
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
Now, you can copy your CPU array to this allocated GPU memory using the cudaMemcpu function. Note that Host indicates CPU and Device indicates GPU as stated here
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
Finally you can run your kernel. Note the number 1 in kernel syntax is number of blocks and n is number of threads per block.
manipulate_out_data_val<<<1, n>>>(arr_gpu);
Wait until the kernel is finished running
cudaDeviceSynchronize();
Finally, you can move the array from GPU back to CPU
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
Please find the whole code here:
#include <iostream>
#include <cuda.h>
using namespace std;
__global__ void manipulate_out_data_val(int *array)
{
// Can add your constant scaling logic here
array[threadIdx.x] *= 20;
}
int main(int argc,char **argv)
{
int n = 100;
int bytes = n * sizeof(int);
int *arr_cpu;
arr_cpu = (int *)malloc(bytes);
for(int i=0;i<n;i++)
arr_cpu[i]=i;
int *arr_gpu;
cudaMalloc((void **)&arr_gpu, n*sizeof(int));
printf("Copying to device..\n");
cudaMemcpy(arr_gpu, arr_cpu, n * sizeof(int), cudaMemcpyHostToDevice);
manipulate_out_data_val<<<1, n>>>(arr_gpu);
cudaDeviceSynchronize();
cudaMemcpy(arr_cpu, arr_gpu, n * sizeof(int), cudaMemcpyDeviceToHost);
for(int i=0;i<n;i++)
printf("%d,", arr_cpu[i]);
cudaFree(arr_gpu);
return 0;
}
Build and run using:
# program.cu is the file containing the code
nvcc program.cu -o program
# Run
./program
The above code has been tested on CUDA 11.4.
I'm using CUDA Toolkit 8 with Visual Studio Community 2015. When I try simple vector addition from NVidia's PDF manual (minus error checking which I don't have the *.h's for) it always comes back as undefined values, which means the output array was never filled. When I pre-fill it with 0's, that's all I get at the end.
Others have had this problem and some people are saying it's caused by compiling for the wrong compute capability. However, I am using an NVidia GTX 750 Ti, which is supposed to be Compute Capability 5. I have tried compiling for Compute Capability 2.0 (the minimum for my SDK) and 5.0.
I also cannot make any of the precompiled examples work, such as vectoradd.exe which says, "Failed to allocate device vector A (error code initialization error)!" And oceanfft.exe says, "Error unable to find GLSL vertex and fragment shaders!" which doesn't make sense because GLSL and fragment shading are very basic features.
My driver version is 361.43 and other apps such as Blender Cycles in CUDA mode and Stellarium work perfectly.
Here is the code that should work:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#include <algorithm>
#define N 10
__global__ void add(int *a, int *b, int *c) {
int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}
int main(void) {
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));
// fill the arrays 'a' and 'b' on the CPU
for (int i = 0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}
// copy the arrays 'a' and 'b' to the GPU
cudaMemcpy(dev_a, a, N * sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int),cudaMemcpyHostToDevice);
add << <N, 1 >> >(dev_a, dev_b, dev_c);
// copy the array 'c' back from the GPU to the CPU
cudaMemcpy(c, dev_c, N * sizeof(int),cudaMemcpyDeviceToHost);
// display the results
for (int i = 0; i<N; i++) {
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}
// free the memory allocated on the GPU
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
I'm trying to develop CUDA apps so any help would be greatly appreciated.
This was apparently caused by using an incompatible driver version with the CUDA 8 toolkit. Installing the driver distributed with the version 8 toolkit solved thr problem.
[Answer assembled from comments and added as a community wiki entry to get the question off the unanswered queue for the CUDA tag]
In a simple test CUDA application, I have a pointer pointing to a list of class instances, and I copy that data to the GPU. I then run a kernel function many times. The kernel function then calls a __device__ member function for each class instance which increments a variable, profitLoss.
For some reason, profitLoss is not incrementing. Here is the code I have:
#include <stdio.h>
#include <stdlib.h>
#define N 200000
class Strategy {
private:
double profitLoss;
public:
__device__ __host__ Strategy() {
this->profitLoss = 0;
}
__device__ __host__ void backtest() {
this->profitLoss++;
}
__device__ __host__ double getProfitLoss() {
return this->profitLoss;
}
};
__global__ void backtestStrategies(Strategy *strategies) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
strategies[i].backtest();
}
}
int main() {
int threadsPerBlock = 1024;
int blockCount = 32;
Strategy *devStrategies;
Strategy *strategies = (Strategy*)malloc(N * sizeof(Strategy));
int i = 0;
// Allocate memory for strategies on the GPU.
cudaMalloc((void**)&devStrategies, N * sizeof(Strategy));
// Initialize strategies on host.
for (i=0; i<N; i++) {
strategies[i] = Strategy();
}
// Copy strategies from host to GPU.
cudaMemcpy(devStrategies, strategies, N * sizeof(Strategy), cudaMemcpyHostToDevice);
for (i=0; i<363598; i++) {
backtestStrategies<<<blockCount, threadsPerBlock>>>(devStrategies);
}
// Copy strategies from the GPU.
cudaMemcpy(strategies, devStrategies, N * sizeof(Strategy), cudaMemcpyDeviceToHost);
// Display results.
for (i=0; i<N; i++) {
printf("%f\n", strategies[i].getProfitLoss());
}
// Free memory for the strategies on the GPU.
cudaFree(devStrategies);
return 0;
}
The output is as follows:
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
...
I would expect it to be:
363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
363598.000000
...
I believe profitLoss is not incrementing due to the way I have initialized the objects (automatic storage duration), and I'm not sure of a better way to instantiate these objects and cudaMemcpy them over to the GPU:
strategies[i] = Strategy();
Can anyone offer any suggestions on how to fix this issue or what might be the cause? Thank you in advance!
UPDATE It appears that for the first 32768 output lines, there is data, and then after that, every line is zero. So I'm possibly hitting some kind of limit.
According to your grid dim blockCount and block dim threadsPerBlock settings, you only launch 32x1024 threads and each thread only updates one instance. That's why you only have 32768 non-zero results at the head of your vector.
To get the expected result, you could either increase the number of GPU threads by increasing the grid dim blockCount large enough to cover all N elements, or
You could use a for loop in the kernel function to let each GPU thread update several elements until all of them are updated.
The second way is preferred as it has much less block launching overhead. But you may still need a grid dim larger than 32 to fully utilize your GPU. You could find more details here.
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
It is a simple CUDA code for initializing a big matrix (filling in zeros).
I output the first 1*3 matrix, if the code works. It should be all zeros.
If I set the matrix size to be small, then the program works properly. But when I make the size larger (> 43200 * 2400), what is inside the matrix are all garbage.
I had cudaDeviceSynchronize() append at the end of each CUDA functions already.
I am using NVIDIA Quadro K4200, Xeon E5-2630 with Ubuntu 14.04.
Thanks for anyone helping me here.
Attached below is my full code.
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <cuComplex.h>
#define BLOCK_SIZE 16 // change it to 16 to get maximum performance
// populate the matrix using first row
__global__ void RepmatKernel (cuComplex *Mat, const unsigned int N, const unsigned int Cols)
{
unsigned int i = (unsigned int)blockIdx.x * (unsigned int)blockDim.x + (unsigned int)threadIdx.x;
if (i < N)
{
Mat[i].x = 0;
Mat[i].y = 0;
}
}
// main routine
int main ()
{
const unsigned int Rows = 43200;
const unsigned int Cols = 2400;
const unsigned int Num_thrd = 256; // max threads per block
unsigned int Mat_size = Rows * Cols; // size of array
cuComplex *vec; // supposedly the input
cuComplex *mat_debug; // for debug
vec = new cuComplex [Cols];
mat_debug = new cuComplex [Rows*Cols];
cuComplex *mat_in_d; // device array
//input in host array
for(unsigned int i = 0; i < Cols; i++)
{
vec[i].x = 3*i+4;
vec[i].y = 0.2*i+1;
}
const unsigned int size_mat_d = Rows * Cols * sizeof(cuComplex);
//create device array cudaMalloc ( (void **)&array_name, sizeofmatrixinbytes) ;
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess) std::cout<<"Error allocating GPU";
cudaDeviceSynchronize() ;
//copy host array to device array; cudaMemcpy ( dest , source , WIDTH , direction )
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice ) ;
cudaDeviceSynchronize() ;
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"before repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
RepmatKernel<<<(unsigned int)ceil((float)(Mat_size)/(float)(Num_thrd)),
(Num_thrd)>>>(mat_in_d,
Mat_size,
Cols);
cudaDeviceSynchronize();
// ========================================================================
cudaMemcpy(mat_debug , mat_in_d , size_mat_d , cudaMemcpyDeviceToHost) ;
cudaDeviceSynchronize() ;
std::cout<<"after repmat="<<std::endl;
std::cout<<"[";
for(unsigned int i = 0; i < 3; i++)
{
std::cout<< mat_debug[i * Cols].x <<"+"<<mat_debug[i * Cols].y <<"i, ";
std::cout<<";"<<std::endl;
}
std::cout<<"]"<<std::endl;
// ==========================================================================
cudaFree(mat_in_d);
delete [] vec;
delete [] mat_debug;
return 0;
}
Your call to cudaMalloc states that there is a problem, but doesn't actually terminate the computation. You should put a
if (cudaMalloc((void **) &mat_in_d , size_mat_d) != cudaSuccess)
{
std::cout<<"Error allocating GPU\n";
return 1;
}
so that the computation actually stops when you overflow the memory, rather than attempt to work anyway with only a warning to std::cout. Even better would be to use an error handling macro.
Another problem is here:
cudaMemcpy ( mat_in_d , vec , Cols , cudaMemcpyHostToDevice );
First, mat_in_d is size Rows * Cols * sizeof(cuComplex), but you are only copying Cols bytes into it. Even if you only wanted to copy vec into the first part of the mat_in_d vector, you'd need to change this to
cudaMemcpy ( mat_in_d , vec , Cols*sizeof(cuComplex) , cudaMemcpyHostToDevice );
At this point, you'd expect the first Cols entries of you matrix to be reasonable, at the rest to be garbage. (Making the suggested change shows that this is indeed the case; why you would want to do this is a better question).
Next comes your kernel call, whose entire goal is to set the entries of Mat to zero. This should be done with cudaMemset, i.e., just use
cudaMemset(mat_in_d, 0, Mat_size*sizeof(cuComplex));
We could look more carefully at the execution configuration to see what went wrong with your kernel call, but for now this fixes your problem.
For debugging CUDA errors; I find a header from samples, helper_cuda.h, quite convenient. I almost always include this header, which is located in the common directory of samples, in my projects.
Then, wrapping all CUDA calls with checkCudaErrors(), like checkCudaErrors(cudaMalloc((void **) &mat_in_d , size_mat_d)); gives explicit error messages.
In my case, since just mat_in_d is close to 1 GB and my GPU's memory is only 512 MB, it failed for sure and threw cudaErrorMemoryAllocation. However, an NVIDIA Quadro K4200 should not fail that easily!
Did you check the actual available memory information using cudaMemGetInfo ?
I'm trying to get the fft of a 2D array. The input is a NxM real matrix, therefore the output matrix is also a NxM matrix (2xNxM output matrix which is complex is saved in a NxM matrix using the property Hermitian symmetry).
So i want to know whether there is method to extract in cuda to extract real and complex matrices separately ? In opencv split function does the duty. So I'm looking for a similar function in cuda, but I couldn't find it yet.
Given below is my complete code
#define NRANK 2
#define BATCH 10
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h>
#include <iostream>
#include <vector>
using namespace std;
int main()
{
const size_t NX = 4;
const size_t NY = 5;
// Input array - host side
float b[NX][NY] ={
{0.7943 , 0.6020 , 0.7482 , 0.9133 , 0.9961},
{0.3112 , 0.2630 , 0.4505 , 0.1524 , 0.0782},
{0.5285 , 0.6541 , 0.0838 , 0.8258 , 0.4427},
{0.1656 , 0.6892 , 0.2290 , 0.5383 , 0.1067}
};
// Output array - host side
float c[NX][NY] = { 0 };
cufftHandle plan;
cufftComplex *data; // Holds both the input and the output - device side
int n[NRANK] = {NX, NY};
// Allocated memory and copy from host to device
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*(NY/2+1));
for(int i=0; i<NX; ++i){
// Uses this because my actual array is a dynamically allocated.
// but here I've replaced it with a static 2D array to make it simple.
cudaMemcpy(reinterpret_cast<float*>(data) + i*NY, b[i], sizeof(float)*NY, cudaMemcpyHostToDevice);
}
// Performe the fft
cufftPlanMany(&plan, NRANK, n,NULL, 1, 0,NULL, 1, 0,CUFFT_R2C,BATCH);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, (cufftReal*)data, data);
cudaThreadSynchronize();
cudaMemcpy(c, data, sizeof(float)*NX*NY, cudaMemcpyDeviceToHost);
// Here c is a NxM matrix. I want to split it to 2 seperate NxM matrices with each
// having the complex and real component of the output
// Here c is in
cufftDestroy(plan);
cudaFree(data);
return 0;
}
EDIT
As suggested by JackOLanter, I modified the code as below. But still the problem is not solved.
float real_vec[NX][NY] = {0}; // host vector, real part
float imag_vec[NX][NY] = {0}; // host vector, imaginary part
cudaError cudaStat1 = cudaMemcpy2D (real_vec, sizeof(real_vec[0]), data, sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);
cudaError cudaStat2 = cudaMemcpy2D (imag_vec, sizeof(imag_vec[0]),data + 1, sizeof(data[0]),NY*sizeof(float2), NX, cudaMemcpyDeviceToHost);
The error i get is 'invalid pitch argument error'. But i can't understand why. For the destination I use a pitch size of 'float' while for the source i use size of 'float2'
Your question and your code do not make much sense to me.
You are performing a batched FFT, but it seems you are not foreseeing enough memory space neither for the input, nor for the output data;
The output of cufftExecR2C is a NX*(NY/2+1) float2 matrix, which can be interpreted as a NX*(NY+2) float matrix. Accordingly, you are not allocating enough space for c (which is only NX*NY float) for the last cudaMemcpy. You would need still one complex memory location for the continuous component of the output;
Your question does not seem to be related to the cufftExecR2C command, but is much more general: how can I split a complex NX*NY matrix into 2 NX*NY real matrices containing the real and imaginary parts, respectively.
If I correctly interpret your question, then the solution proposed by #njuffa at
Copying data to “cufftComplex” data struct?
could be a good clue to you.
EDIT
In the following, a small example on how "assembling" and "disassembling" the real and imaginary parts of complex vectors when copying them from/to host to/from device. Please, add your own CUDA error checking.
#include <stdio.h>
#define N 16
int main() {
// Declaring, allocating and initializing a complex host vector
float2* b = (float2*)malloc(N*sizeof(float2));
printf("ORIGINAL DATA\n");
for (int i=0; i<N; i++) {
b[i].x = (float)i;
b[i].y = 2.f*(float)i;
printf("%f %f\n",b[i].x,b[i].y);
}
printf("\n\n");
// Declaring and allocating a complex device vector
float2 *data; cudaMalloc((void**)&data, sizeof(float2)*N);
// Copying the complex host vector to device
cudaMemcpy(data, b, N*sizeof(float2), cudaMemcpyHostToDevice);
// Declaring and allocating space on the host for the real and imaginary parts of the complex vector
float* cr = (float*)malloc(N*sizeof(float));
float* ci = (float*)malloc(N*sizeof(float));
/*******************************************************************/
/* DISASSEMBLING THE COMPLEX DATA WHEN COPYING FROM DEVICE TO HOST */
/*******************************************************************/
float* tmp_d = (float*)data;
cudaMemcpy2D(cr, sizeof(float), tmp_d, 2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
cudaMemcpy2D(ci, sizeof(float), tmp_d+1, 2*sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost);
printf("DISASSEMBLED REAL AND IMAGINARY PARTS\n");
for (int i=0; i<N; i++)
printf("cr[%i] = %f; ci[%i] = %f\n",i,cr[i],i,ci[i]);
printf("\n\n");
/******************************************************************************/
/* REASSEMBLING THE REAL AND IMAGINARY PARTS WHEN COPYING FROM HOST TO DEVICE */
/******************************************************************************/
cudaMemcpy2D(tmp_d, 2*sizeof(float), cr, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
cudaMemcpy2D(tmp_d + 1, 2*sizeof(float), ci, sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice);
// Copying the complex device vector to host
cudaMemcpy(b, data, N*sizeof(float2), cudaMemcpyHostToDevice);
printf("REASSEMBLED DATA\n");
for (int i=0; i<N; i++)
printf("%f %f\n",b[i].x,b[i].y);
printf("\n\n");
getchar();
return 0;
}