cudaTextureObject_t texFetch1D doesn't compile - c++

this code doesn't compile in cuda toolkit 7.5 on a gtx 980 with compute capability set to 5.2 in visual studio 2013.
__global__ void a_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
int something = tex1Dfetch(texObj, thread_id);
}
here is the error.
error : more than one instance of overloaded function "tex1Dfetch" matches the argument list:
this code also doesn't compile.
__global__ void another_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
float something = tex1Dfetch<float>(texObj, thread_id);
}
here is that error.
error : type name is not allowed
following this example and the comments, all of the above should work:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/
please let me know if you need additional info, I couldn't think what else to provide.

Your first kernel doesn't compile because of a missing template type argument. This will compile:
__global__ void a_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
int something = tex1Dfetch<int>(texObj, thread_id);
}
Your second kernel is correct, and it does compile for me using VS2012 with the CUDA 7.0 toolkit for every compute capability I tried (sm_30 through sm_52).

I reinstalled the cuda toolkit and now the second piece of code (another_kernel) compiles. The first piece of code was incorrect in the first place as per the first answer. W.r.t. reinstalling the cuda toolkit, it was that I must have previously clobbered something in the sdk, I believe it was texture_indirect_functions.h.

Related

cuda <<<X,X>>> gives expected an expression error

I am trying to compile and run the following program called test.cu:
#include <iostream>
#include <math.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
// Kernel function to add the elements of two arrays
__global__
void add(int n, float* x, float* y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1 << 20;
float* x, * y;
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 2.0f;
y[i] = 1.0f;
}
// Run kernel on 1M elements on the GPU
add <<<1, 256>>> (N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
for (int i = 0; i < 10; i++)
std::cout << y[i] << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
I am using visual studio comunity 2019 and it marks the "add <<<1, 256>>> (N, x, y);" line as having an expected an expression error. I tried compiling it and somehow it compiles without mistakes, but when running the .exe file it outputs a bunch of "1" instead of the expected "3".
I also tried compiling using "nvcc test.cu", but initially it said "nvcc fatal : Cannot find compiler 'cl.exe' in PATH", so i added "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\Hostx64\x64" to path and now compiling with nvcc gives the same mistake as compiling with visual studio.
In both cases the program never enter the "add" function.
I am pretty sure the code is right and the problem has something to do with the installation, but i already tried reinstalling cuda toolkit and repairing MCVS, but it didn't work.
The kernel.cu exemple that appears when starting a new project with cuda in visual studio also didn't work. When running it outputted "No kernel image available for execution on the device".
How can is solve this?
nvcc version if that helps:
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:35_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28845127_0
Visual Studio provides IntelliSense for C++. In the C++ language, the proper parsing of angle brackets is troublesome. You've got < as less than and for templates, and << as shift. So, the fact is that the guys at NVIDIA choose the worst possible delimiter <<<>>>. This makes Intellisense difficult to work properly. The way to get full IntelliSense in CUDA is to switch from the Runtime API to the Driver API. The C++ is just C++, and the CUDA is still (sort of) C++, there is no <<<>>> badness for the language parsing to have to work around.
You could take a look at the difference between matrixMul and matrixMulDrv. The <<<>>> syntax is handled by the compiler essentially just spitting out code that calls the Driver API calls. You'll link to cuda.lib not cudart.lib, and may have to deal with a "mixed mode" program if you use CUDA-RT only libraries. You could refer to this link for more information.
Also, this link tells how to add Intellisense for CUDA in VS.

work group barrier not working

as a simple test to see whether or not the OpenCL 2.0 functions work for me, I've written a small kernel that calls work_group_barrier. However, for the life if me, I can't figure out why the kernel becomes invalid.
Considering that the kernel will be valid if "barrier" is used, and work_group_barrier is just a renamed version of barrier, this doesn't make sense.
The kernel in question:
#pragma OPENCL EXTENSION cl_amd_printf : enable
#pragma OPENCL EXTENSION cl_khr_subgroups : enable
//pragmas go here
#define TRUE 1
#define FALSE 0
__kernel void my_dumb_test(
__global float *in0,
__global float *in1,
__global float *out
){
int global_num = get_global_id(0);
int local_num = get_local_id(0);
int local_size = get_local_size(0);
int global_size = get_global_size(0);
int group_id = get_group_id(0);
int group_num = get_num_groups(0);
local int a;
int b = 2;
//a = work_group_broadcast(b, local_num);
//uint sub_group_size = get_sub_group_size();
//printf("in0[%d]: %f\n", global_num, in0[global_num]);
//printf("max sub group size: %d\n", sub_group_size);
//work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//barrier(CLK_GLOBAL_MEM_FENCE);
printf("global id: %d local id: %d group id: %d num groups %d\n", global_num, local_num, group_id, group_num);
}
The funny thing is, the host-side OpenCL 2.0 functions work. Using clCreateCommandQueueWithProperties returns with a success. In the older versions of OpenCL, this function existed as clCreateCommandQueue. CL_DEVICE_VERSION pings back OpenCL 2.0. I'm running with the AMD Radeon R9 290X 4GB GDDR5, running Ubuntu 14.04, with the latest drivers, and the AMD-APP-SDK 3.0 beta.
Any help is appreciated.
I have found the solution to my problem.
clBuildProgram will default to the highest version of OpenCL C compiler 1.X if the option "-cl-std=CL2.0" is not specified in the options portion of the clBuildProgram API call.
The OpenCL C compiler is for the device-side kernel code, and is separate from the Host-side compiliation. One must manually specify OpenCL 2.0 if they choose to use it.

cudaOccupancyMaxActiveBlocksPerMultiprocessor is undefined

I am trying to learn cuda and use it in an efficient way. And I have found a code from nvidia's website, which tells that we can learn what should be the block size that we should use for the device's most efficient usage. The code is as follows :
#include <iostream>
// Device code
__global__ void MyKernel(int *d, int *a, int *b)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
d[idx] = a[idx] * b[idx];
}
// Host code
int main()
{
int numBlocks; // Occupancy in terms of active blocks
int blockSize = 32;
// These variables are used to convert occupancy to warps
int device;
cudaDeviceProp prop;
int activeWarps;
int maxWarps;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
MyKernel,
blockSize,
0);
activeWarps = numBlocks * blockSize / prop.warpSize;
maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;
return 0;
}
However, when I compiled it, there is the following error :
Compile line :
nvcc ben_deneme2.cu -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o my
Error :
ben_deneme2.cu(25): error: identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000623d_00000000-8_ben_deneme2.cpp1.ii".
Should I include a library for this, though I could not find a library name for this on the internet? Or am I doing something else wrong?
Thanks in advance
The cudaOccupancyMaxActiveBlocksPerMultiprocessorfunction is included in CUDA 6.5. You have not access to that function if you have a previous version of CUDA installed, for example, it will not work for CUDA 5.5.
If you want to use that function you must update your CUDA version at least to to 6.5.
People using older versions usually use the Cuda Occupancy Calculator.
One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. -- CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

CUDA kernel template instantiation causing compilation error

I am trying to define a template CUDA kernel for logical operations on an image. The code looks like this:
#define AND 1
#define OR 2
#define XOR 3
#define SHL 4
#define SHR 5
template<typename T, int opcode>
__device__ inline T operation_lb(T a, T b)
{
switch(opcode)
{
case AND:
return a & b;
case OR:
return a | b;
case XOR:
return a ^ b;
case SHL:
return a << b;
case SHR:
return a >> b;
default:
return 0;
}
}
//Logical Operation With A Constant
template<typename T, int channels, int opcode>
__global__ void kernel_logical_constant(T* src, const T val, T* dst, int width, int height, int pitch)
{
const int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
const int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
if(xIndex >= width || yIndex >= height) return;
unsigned int tid = yIndex * pitch + (channels * xIndex);
#pragma unroll
for(int i=0; i<channels; i++)
dst[tid + i] = operation_lb<T,opcode>(src[tid + i],val);
}
The problem is that when I instantiate the kernel for bit shifting, the following compilation error arises
Error 1 error : Ptx assembly aborted due to errors
The kernel instants are like this:
template __global__ void kernel_logical_constant<unsigned char,1,SHL>(unsigned char*,unsigned char,unsigned char*,int,int,int);
There are 19 more instants like this for unsigned char, unsigned short, 1 and 3 channels and all logical operations. But only the bit shifting instants, i.e. SHL and SHR cause error. When I remove these instants, the code compiles and works perfectly.
The code also works if I replace the bit shifting with any other operation inside the operation_lb device function.
I was wondering if this had anything to do with the amount of ptx code generated due to so many different instants of the kernel.
I am using CUDA 5.5, Visual Studio 2010, Windows 8 x64. Compiling for compute_1x, sm_1x.
Any help would be appreciated.
The original question specified that the poster was using compute_20, sm_20. With that, I was not able to reproduce the error using the code here. However, in the comments it was pointed out that actually sm_10 was being used. When I switch to compiling for sm_10 I am able to reproduce the error.
It appears to be a bug in the compiler. I say this simply because I do not believe that the compiler should generate code that the assembler cannot handle. However beyond that I have no knowledge of the underlying root cause. I have filed a bug report with NVIDIA.
In my limited testing, it seems to only happen with unsigned char not int.
As a possible workaround, for cc2.0 and newer devices, specify -arch=sm_20 when compiling.

cuda and c++ problem

hi i have a cuda program which run successfully
here is code for cuda program
#include <stdio.h>
#include <cuda.h>
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N)
a[idx] = a[idx] * a[idx];
}
int main(void)
{
float *a_h, *a_d;
const int N = 10;
size_t size = N * sizeof(float);
a_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
free(a_h);
cudaFree(a_d);
}
now i want to split this code into two files means there should be two file one for c++ code or c code and other one .cu file for kernel. i just wanat to do it for learning and i don't want to write same kernel code again and again.
can any one tell me how to do this ?
how to split this code into two different file?
than how to compile it?
how to write makefile for it ?
how to
Code which has CUDA C extensions has to be in *.cu file, rest can be in c++ file.
So here your kernel code can be moved to separate *.cu file.
To have main function implementation in c++ file you need to wrap invocation of kernel (code with square_array<<<...>>>(...);) with c++ function which implementation needs to be in *cu file as well.
Functions cudaMalloc etc. can be left in c++ file as long as you include proper cuda headers.
The biggest obstacle that you will most likely encounter is to - how to call your kernel from your cpp file. C++ will not understand your <<< >>> syntax. There are 3 ways of doing it.
Just write a small encapsulating host function in your .cu file
Use CUDA library functions (cudaConfigureCall, cudaFuncGetAttributes, cudaLaunch) --- check Cuda Reference Manual for details, chapter "Execution Control" online version. You can use those functions in plain C++ code, as long as you include the cuda libraries.
Include PTX at runtime. It is harder, but allows you to manipulate PTX code at runtime. This JIT approach is explained in Cuda Programming Guide (chapter 3.3.2) and in Cuda Reference Manual (Module Management chapter) online version
Encapsilating function could look like this for example:
mystuff.cu:
... //your device square_array function
void host_square_array(dim3 grid, dim3 block, float *deviceA, int N) {
square_array <<< grid, block >>> (deviceA, N);
}
mystuff.h
#include <cuda.h>
void host_square_array(dim3 grid, dim3 block, float *deviceA, int N);
mymain.cpp
#include "mystuff.h"
int main() { ... //your normal host code
}