calling a host function from a global function is not allowed - c++

i'm compiling a cuda 5.5 project on vs2010. i need to use mpir library because my project consist of large numbers. when i use mpir instructions this error appears. i do not know how can i fix it.
this program adds array A and array B using mpir functions.
__global__
void vecAdd(mpz_t *A,mpz_t *B,mpz_t *C,int N)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<N)
mpz_add(C[i],A[i],B[i]);
}
int main()
{
mpz_t *h_A;
h_A=(mpz_t*)malloc(5*sizeof(mpz_t));
mpz_array_init(h_A[0],5,16);
mpz_set_si(h_A[0],1);
mpz_set_si(h_A[1],2);
mpz_set_si(h_A[2],3);
mpz_set_si(h_A[3],4);
mpz_set_si(h_A[4],5);
mpz_t *h_B;
h_B=(mpz_t*)malloc(5*sizeof(mpz_t));
mpz_array_init(h_B[0],5,16);
mpz_set_si(h_B[0],1);
mpz_set_si(h_B[1],2);
mpz_set_si(h_B[2],3);
mpz_set_si(h_B[3],4);
mpz_set_si(h_B[4],5);
mpz_t *h_C;
h_C=(mpz_t*)malloc(5*sizeof(mpz_t));
mpz_array_init(h_C[0],5,16);
int N=5;
int size=N*sizeof(mpz_t);
mpz_t *d_A;
d_A=(mpz_t*)malloc(5*sizeof(mpz_t));
mpz_array_init(d_A[0],5,16);
mpz_t *d_B;
d_B=(mpz_t*)malloc(5*sizeof(mpz_t));
mpz_array_init(d_B[0],5,16);
mpz_t *d_C;
d_C=(mpz_t*)malloc(5*sizeof(mpz_t));
mpz_array_init(d_C[0],5,16);
cudaMalloc((void**)&d_A,size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_B,size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_C,size);
vecAdd<<<ceil(N/512.0),512>>>(d_A,d_B,d_C,N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
for(int i=0;i<5;i++)
{
mpz_out_str(stdout,10,h_C[i]);
printf("\n");
}
return 0;
}

You have to understand that functions that may be called from the device have to be compiled to device code. Placing a __device__ in the declaration of the function will make it available from the device side.
However, since mpz_add is from the MPIR library, which was not made with CUDA-compatibility features (as far as I'm aware of), you're out of luck. I suggest you find a GPU implementation of arbitrary precision numbers.

Related

CUDA - separating cpu code from cuda code

Was looking to use system functions (such as rand() ) within the CUDA kernel. However, ideally this would just run on the CPU. Can I separate files (.cu and .c++), while still making use of gpu matrix addition? For example, something along these lines:
in main.cpp:
int main(){
std::vector<int> myVec;
srand(time(NULL));
for (int i = 0; i < 1024; i++){
myvec.push_back( rand()%26);
}
selfSquare(myVec, 1024);
}
and in cudaFuncs.cu:
__global__ void selfSquare_cu(int *arr, n){
int i = threadIdx.x;
if (i < n){
arr[i] = arr[i] * arr[i];
}
}
void selfSquare(std::vector<int> arr, int n){
int *cuArr;
cudaMallocManaged(&cuArr, n * sizeof(int));
for (int i = 0; i < n; i++){
cuArr[i] = arr[i];
}
selfSquare_cu<<1, n>>(cuArr, n);
}
What are best practices surrounding situations like these? Would it be a better idea to use curand and write everything in the kernel? It looks to me like in the above example, there is an extra step in taking the vector and copying it to the shared cuda memory.
In this case the only thing that you need is to have the array initialised with random values. Each value of the array can be initialised indipendently.
The CPU is involved in your code during the initialization and trasferring of the data to the device and back to the host.
In your case, do you really need to have the CPU to initialize the data for then having all those values moved to the GPU?
The best approach is to allocate some device memory and then initialize the values using a kernel.
This will save time because
The elements are initialized in parallel
There is not memory transfer required from the host to the device
As a rule of thumb, always avoid communication between host and device if possible.

Cuda: pinned memory zero copy problems

I tried the code in this link Is CUDA pinned memory zero-copy?
The one who asked claims the program worked fine for him
But does not work the same way on mine
the values does not change if I manipulate them in the kernel.
Basically my problem is, my GPU memory is not enough but I want to do calculations which require more memory. I my program to use RAM memory, or host memory and be able to use CUDA for calculations. The program in the link seemed to solve my problem but the code does not give output as shown by the guy.
Any help or any working example on Zero copy memory would be useful.
Thank you
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
void test()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
}
First of all, to allocate ZeroCopy memory, you have to specify cudaHostAllocMapped flag as an argument to cudaHostAlloc.
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
Still the pinnedHostPointer will be used to access the mapped memory from the host side only. To access the same memory from device, you have to get the device side pointer to the memory like this:
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
Pass this pointer as kernel argument.
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
Also, you have to synchronize the kernel execution with the host to read the updated values. Just add cudaDeviceSynchronize after the kernel call.
The code in the linked question is working, because the person who asked the question is running the code on a 64 bit OS with a GPU of Compute Capability 2.0 and TCC enabled. This configuration automatically enables the Unified Virtual Addressing feature of the GPU in which the device sees host + device memory as a single large memory instead of separate ones and host pointers allocated using cudaHostAlloc can be passed directly to the kernel.
In your case, the final code will look like this:
#include <cstdio>
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
int main()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS * sizeof(double), cudaHostAllocMapped);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
double* dPtr;
cudaHostGetDevicePointer(&dPtr, pinnedHostPtr, 0);
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(dPtr);
cudaDeviceSynchronize();
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
return 0;
}

2d char array to CUDA kernel

I need help with transfer char[][] to Cuda kernel. This is my code:
__global__
void kernel(char** BiExponent){
for(int i=0; i<500; i++)
printf("%c",BiExponent[1][i]); // I want print line 1
}
int main(){
char (*Bi2dChar)[500] = new char [5000][500];
char **dev_Bi2dChar;
...//HERE I INPUT DATA TO Bi2dChar
size_t host_orig_pitch = 500 * sizeof(char);
size_t pitch;
cudaMallocPitch((void**)&dev_Bi2dChar, &pitch, 500 * sizeof(char), 5000);
cudaMemcpy2D(dev_Bi2dChar, pitch, Bi2dChar, host_orig_pitch, 500 * sizeof(char), 5000, cudaMemcpyHostToDevice);
kernel <<< 1, 512 >>> (dev_Bi2dChar);
free(Bi2dChar); cudaFree(dev_Bi2dChar);
}
I use:
nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2012 -ccbin
Thanks for help.
cudaMemcpy2D doesn't actually handle 2-dimensional (i.e. double pointer, **) arrays in C.
Note that the documentation indicates it expects single pointers, not double pointers.
Generally speaking, moving arbitrary double pointer C arrays between the host and the device is more complicated than a single pointer array.
If you really want to handle the double-pointer array, then search on "CUDA 2D Array" in the upper right hand corner of this page, and you'll find various examples of how to do it. (For example, the answer given by #talonmies here)
Often, an easier approach is simply to "flatten" the array so it can be referenced by a single pointer, i.e. char[] instead of char[][], and then use index arithmetic to simulate 2-dimensional access.
Your flattened code would look something like this:
(the code you provided is an uncompilable, incomplete snippet, so mine is also)
#define XDIM 5000
#define YDIM 500
__global__
void kernel(char* BiExponent){
for(int i=0; i<500; i++)
printf("%c",BiExponent[(1*XDIM)+i]); // I want print line 1
}
int main(){
char (*Bi2dChar)[YDIM] = new char [XDIM][YDIM];
char *dev_Bi2dChar;
...//HERE I INPUT DATA TO Bi2dChar
cudaMalloc((void**)&dev_Bi2dChar,XDIM*YDIM * sizeof(char));
cudaMemcpy(dev_Bi2dChar, &(Bi2dChar[0][0]), host_orig_pitch, XDIM*YDIM * sizeof(char), cudaMemcpyHostToDevice);
kernel <<< 1, 512 >>> (dev_Bi2dChar);
free(Bi2dChar); cudaFree(dev_Bi2dChar);
}
If you want a pitched array, you can create it similarly, but you will still do so as single pointer arrays, not double pointer arrays.
You can't use printf in a Cuda kernel. The reason being is that the code is being executed on the GPU and not on the host CPU.
You can, however use cuPrintf
How do we use cuPrintf()?

Error compiling Cuda - expected primary-expression

this program seems be fine but I still getting an erro, some suggestion?
Program:
#include "dot.h"
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
int main(int argc, char** argv)
{
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c;
int size = N * sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, sizeof(int));
a = (int *)malloc (size);
b = (int *)malloc (size);
c = (int *)malloc (sizeof(int));
random_ints(a, N);
random_ints(b, N);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
int res = N/THREADS_PER_BLOCK;
dot<<< res, THREADS_PER_BLOCK >>> (dev_a, dev_b, dev_c);
//helloWorld<<< dimGrid, dimBlock >>>(d_str);
cudaMemcpy (c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
free(a); free(b); free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
the error:
DotProductCuda.cpp:27: error: expected primary-expression before '<' token
DotProductCuda.cpp:27: error: expected primary-expression before '>' token
The <<< >>> syntax for calling a kernel is not standard C or C++. Those calls must be in a file compiled by the NVCC compiler. Those files are normally named with a .cu extension. Other API calls to CUDA such as cudaMalloc can be in regular .c or .cpp files.
nvcc uses the file extension to determine how to process the contents of the file. If you have CUDA syntax inside the file, it must have a .cu extension, otherwise nvcc will simply pass the file untouched to the host compiler, resulting in the syntax error you are observing.
It seems the compiler cannot recognize the <<<,>>> syntax. I have no experience with CUDA, but I guess you need to compile this file with a special compiler and not an ordinary C compiler.
Maybe you use a host function (printf for example) inside kernel?

cuda and c++ problem

hi i have a cuda program which run successfully
here is code for cuda program
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N)
a[idx] = a[idx] * a[idx];
}
int main(void)
{
float *a_h, *a_d;
const int N = 10;
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
free(a_h);
cudaFree(a_d);
}
now i want to split this code into two files means there should be two file one for c++ code or c code and other one .cu file for kernel. i just wanat to do it for learning and i don't want to write same kernel code again and again.
can any one tell me how to do this ?
how to split this code into two different file?
than how to compile it?
how to write makefile for it ?
how to
Code which has CUDA C extensions has to be in *.cu file, rest can be in c++ file.
So here your kernel code can be moved to separate *.cu file.
To have main function implementation in c++ file you need to wrap invocation of kernel (code with square_array<<<...>>>(...);) with c++ function which implementation needs to be in *cu file as well.
Functions cudaMalloc etc. can be left in c++ file as long as you include proper cuda headers.
The biggest obstacle that you will most likely encounter is to - how to call your kernel from your cpp file. C++ will not understand your <<< >>> syntax. There are 3 ways of doing it.
Just write a small encapsulating host function in your .cu file
Use CUDA library functions (cudaConfigureCall, cudaFuncGetAttributes, cudaLaunch) --- check Cuda Reference Manual for details, chapter "Execution Control" online version. You can use those functions in plain C++ code, as long as you include the cuda libraries.
Include PTX at runtime. It is harder, but allows you to manipulate PTX code at runtime. This JIT approach is explained in Cuda Programming Guide (chapter 3.3.2) and in Cuda Reference Manual (Module Management chapter) online version
Encapsilating function could look like this for example:
mystuff.cu:
... //your device square_array function
void host_square_array(dim3 grid, dim3 block, float *deviceA, int N) {
square_array <<< grid, block >>> (deviceA, N);
}
mystuff.h
#include <cuda.h>
void host_square_array(dim3 grid, dim3 block, float *deviceA, int N);
mymain.cpp
#include "mystuff.h"
int main() { ... //your normal host code
}