Error compiling Cuda - expected primary-expression - c++

this program seems be fine but I still getting an erro, some suggestion?
#include "dot.h"
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
int main(int argc, char** argv)
int *a, *b, *c;
int *dev_a, *dev_b, *dev_c;
int size = N * sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, sizeof(int));
a = (int *)malloc (size);
b = (int *)malloc (size);
c = (int *)malloc (sizeof(int));
random_ints(a, N);
random_ints(b, N);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
dot<<< res, THREADS_PER_BLOCK >>> (dev_a, dev_b, dev_c);
//helloWorld<<< dimGrid, dimBlock >>>(d_str);
cudaMemcpy (c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
free(a); free(b); free(c);
return 0;
the error:
DotProductCuda.cpp:27: error: expected primary-expression before '<' token
DotProductCuda.cpp:27: error: expected primary-expression before '>' token

The <<< >>> syntax for calling a kernel is not standard C or C++. Those calls must be in a file compiled by the NVCC compiler. Those files are normally named with a .cu extension. Other API calls to CUDA such as cudaMalloc can be in regular .c or .cpp files.

nvcc uses the file extension to determine how to process the contents of the file. If you have CUDA syntax inside the file, it must have a .cu extension, otherwise nvcc will simply pass the file untouched to the host compiler, resulting in the syntax error you are observing.

It seems the compiler cannot recognize the <<<,>>> syntax. I have no experience with CUDA, but I guess you need to compile this file with a special compiler and not an ordinary C compiler.

Maybe you use a host function (printf for example) inside kernel?


cudaMalloc succeeded with static size but failed with dynamically calculated size

I was trying to allocate about 2.75G memory on GPU. It's OK when size is 'static'(known when compiling), and if the size 'dynamic', it failed.
I am on a box with CentOS 7.1, Cuda 7.5, 2 x TtianX cards, intel 4790K, 32GB memory
The Code:
#include <cstdio>
#include <cuda_runtime.h>
int main() {
int item_count = 21217344;
int dim = 128;
unsigned char * data_dev;
size_t mem_size = item_count * dim * sizeof(unsigned char);
printf("memory to alloc %u\n", mem_size);
int r = cudaMalloc((void **)&data_dev, mem_size);
if(r) {
printf("memory alloc failed!\n");
size_t mem_size_static = 2715820032; // 21217344 * 128 = 2715820032;
r = cudaMalloc((void **)&data_dev, mem_size_static);
if(!r) {
printf("memory alloc succeeded!\n");
Save it to '' and then compile it:
And run it:
[root#localhost test]# ./a.out
memory to alloc 2715820032
memory alloc failed!
memory alloc succeeded!
So any idea about this?
int item_count = 21217344;
int dim = 128;
Those are ints, and the product of those is 2715820032, which overflows as -1579147264. Requesting a negative amount of memory is of course an error and cudaMalloc fails.
What you want is to either declare those with a wider type (e.g std::size_t) or cast either components to such a wider type before the multiply, and everything will work fine.
Side note: you would have spotted the bug immediately had you used C++'s std::cout instead of printf, or the proper size format specifier %z.

cudaOccupancyMaxActiveBlocksPerMultiprocessor is undefined

I am trying to learn cuda and use it in an efficient way. And I have found a code from nvidia's website, which tells that we can learn what should be the block size that we should use for the device's most efficient usage. The code is as follows :
#include <iostream>
// Device code
__global__ void MyKernel(int *d, int *a, int *b)
int idx = threadIdx.x + blockIdx.x * blockDim.x;
d[idx] = a[idx] * b[idx];
// Host code
int main()
int numBlocks; // Occupancy in terms of active blocks
int blockSize = 32;
// These variables are used to convert occupancy to warps
int device;
cudaDeviceProp prop;
int activeWarps;
int maxWarps;
cudaGetDeviceProperties(&prop, device);
activeWarps = numBlocks * blockSize / prop.warpSize;
maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;
return 0;
However, when I compiled it, there is the following error :
Compile line :
nvcc -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o my
Error : error: identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000623d_00000000-8_ben_deneme2.cpp1.ii".
Should I include a library for this, though I could not find a library name for this on the internet? Or am I doing something else wrong?
Thanks in advance
The cudaOccupancyMaxActiveBlocksPerMultiprocessorfunction is included in CUDA 6.5. You have not access to that function if you have a previous version of CUDA installed, for example, it will not work for CUDA 5.5.
If you want to use that function you must update your CUDA version at least to to 6.5.
People using older versions usually use the Cuda Occupancy Calculator.
One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. -- CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

calling a host function from a global function is not allowed

i'm compiling a cuda 5.5 project on vs2010. i need to use mpir library because my project consist of large numbers. when i use mpir instructions this error appears. i do not know how can i fix it.
this program adds array A and array B using mpir functions.
void vecAdd(mpz_t *A,mpz_t *B,mpz_t *C,int N)
int i = threadIdx.x + blockDim.x * blockIdx.x;
int main()
mpz_t *h_A;
mpz_t *h_B;
mpz_t *h_C;
int N=5;
int size=N*sizeof(mpz_t);
mpz_t *d_A;
mpz_t *d_B;
mpz_t *d_C;
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
for(int i=0;i<5;i++)
return 0;
You have to understand that functions that may be called from the device have to be compiled to device code. Placing a __device__ in the declaration of the function will make it available from the device side.
However, since mpz_add is from the MPIR library, which was not made with CUDA-compatibility features (as far as I'm aware of), you're out of luck. I suggest you find a GPU implementation of arbitrary precision numbers.

2d char array to CUDA kernel

I need help with transfer char[][] to Cuda kernel. This is my code:
void kernel(char** BiExponent){
for(int i=0; i<500; i++)
printf("%c",BiExponent[1][i]); // I want print line 1
int main(){
char (*Bi2dChar)[500] = new char [5000][500];
char **dev_Bi2dChar;
size_t host_orig_pitch = 500 * sizeof(char);
size_t pitch;
cudaMallocPitch((void**)&dev_Bi2dChar, &pitch, 500 * sizeof(char), 5000);
cudaMemcpy2D(dev_Bi2dChar, pitch, Bi2dChar, host_orig_pitch, 500 * sizeof(char), 5000, cudaMemcpyHostToDevice);
kernel <<< 1, 512 >>> (dev_Bi2dChar);
free(Bi2dChar); cudaFree(dev_Bi2dChar);
I use:
nvcc.exe" -gencode=arch=compute_20,code=\"sm_20,compute_20\" --use-local-env --cl-version 2012 -ccbin
Thanks for help.
cudaMemcpy2D doesn't actually handle 2-dimensional (i.e. double pointer, **) arrays in C.
Note that the documentation indicates it expects single pointers, not double pointers.
Generally speaking, moving arbitrary double pointer C arrays between the host and the device is more complicated than a single pointer array.
If you really want to handle the double-pointer array, then search on "CUDA 2D Array" in the upper right hand corner of this page, and you'll find various examples of how to do it. (For example, the answer given by #talonmies here)
Often, an easier approach is simply to "flatten" the array so it can be referenced by a single pointer, i.e. char[] instead of char[][], and then use index arithmetic to simulate 2-dimensional access.
Your flattened code would look something like this:
(the code you provided is an uncompilable, incomplete snippet, so mine is also)
#define XDIM 5000
#define YDIM 500
void kernel(char* BiExponent){
for(int i=0; i<500; i++)
printf("%c",BiExponent[(1*XDIM)+i]); // I want print line 1
int main(){
char (*Bi2dChar)[YDIM] = new char [XDIM][YDIM];
char *dev_Bi2dChar;
cudaMalloc((void**)&dev_Bi2dChar,XDIM*YDIM * sizeof(char));
cudaMemcpy(dev_Bi2dChar, &(Bi2dChar[0][0]), host_orig_pitch, XDIM*YDIM * sizeof(char), cudaMemcpyHostToDevice);
kernel <<< 1, 512 >>> (dev_Bi2dChar);
free(Bi2dChar); cudaFree(dev_Bi2dChar);
If you want a pitched array, you can create it similarly, but you will still do so as single pointer arrays, not double pointer arrays.
You can't use printf in a Cuda kernel. The reason being is that the code is being executed on the GPU and not on the host CPU.
You can, however use cuPrintf
How do we use cuPrintf()?

cuda and c++ problem

hi i have a cuda program which run successfully
here is code for cuda program
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N)
a[idx] = a[idx] * a[idx];
int main(void)
float *a_h, *a_d;
const int N = 10;
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
now i want to split this code into two files means there should be two file one for c++ code or c code and other one .cu file for kernel. i just wanat to do it for learning and i don't want to write same kernel code again and again.
can any one tell me how to do this ?
how to split this code into two different file?
than how to compile it?
how to write makefile for it ?
how to
Code which has CUDA C extensions has to be in *.cu file, rest can be in c++ file.
So here your kernel code can be moved to separate *.cu file.
To have main function implementation in c++ file you need to wrap invocation of kernel (code with square_array<<<...>>>(...);) with c++ function which implementation needs to be in *cu file as well.
Functions cudaMalloc etc. can be left in c++ file as long as you include proper cuda headers.
The biggest obstacle that you will most likely encounter is to - how to call your kernel from your cpp file. C++ will not understand your <<< >>> syntax. There are 3 ways of doing it.
Just write a small encapsulating host function in your .cu file
Use CUDA library functions (cudaConfigureCall, cudaFuncGetAttributes, cudaLaunch) --- check Cuda Reference Manual for details, chapter "Execution Control" online version. You can use those functions in plain C++ code, as long as you include the cuda libraries.
Include PTX at runtime. It is harder, but allows you to manipulate PTX code at runtime. This JIT approach is explained in Cuda Programming Guide (chapter 3.3.2) and in Cuda Reference Manual (Module Management chapter) online version
Encapsilating function could look like this for example:
... //your device square_array function
void host_square_array(dim3 grid, dim3 block, float *deviceA, int N) {
square_array <<< grid, block >>> (deviceA, N);
#include <cuda.h>
void host_square_array(dim3 grid, dim3 block, float *deviceA, int N);
#include "mystuff.h"
int main() { ... //your normal host code