CUDA kernel template instantiation causing compilation error - c++

I am trying to define a template CUDA kernel for logical operations on an image. The code looks like this:
#define AND 1
#define OR 2
#define XOR 3
#define SHL 4
#define SHR 5
template<typename T, int opcode>
__device__ inline T operation_lb(T a, T b)
{
switch(opcode)
{
case AND:
return a & b;
case OR:
return a | b;
case XOR:
return a ^ b;
case SHL:
return a << b;
case SHR:
return a >> b;
default:
return 0;
}
}
//Logical Operation With A Constant
template<typename T, int channels, int opcode>
__global__ void kernel_logical_constant(T* src, const T val, T* dst, int width, int height, int pitch)
{
const int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
const int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
if(xIndex >= width || yIndex >= height) return;
unsigned int tid = yIndex * pitch + (channels * xIndex);
#pragma unroll
for(int i=0; i<channels; i++)
dst[tid + i] = operation_lb<T,opcode>(src[tid + i],val);
}
The problem is that when I instantiate the kernel for bit shifting, the following compilation error arises
Error 1 error : Ptx assembly aborted due to errors
The kernel instants are like this:
template __global__ void kernel_logical_constant<unsigned char,1,SHL>(unsigned char*,unsigned char,unsigned char*,int,int,int);
There are 19 more instants like this for unsigned char, unsigned short, 1 and 3 channels and all logical operations. But only the bit shifting instants, i.e. SHL and SHR cause error. When I remove these instants, the code compiles and works perfectly.
The code also works if I replace the bit shifting with any other operation inside the operation_lb device function.
I was wondering if this had anything to do with the amount of ptx code generated due to so many different instants of the kernel.
I am using CUDA 5.5, Visual Studio 2010, Windows 8 x64. Compiling for compute_1x, sm_1x.
Any help would be appreciated.

The original question specified that the poster was using compute_20, sm_20. With that, I was not able to reproduce the error using the code here. However, in the comments it was pointed out that actually sm_10 was being used. When I switch to compiling for sm_10 I am able to reproduce the error.
It appears to be a bug in the compiler. I say this simply because I do not believe that the compiler should generate code that the assembler cannot handle. However beyond that I have no knowledge of the underlying root cause. I have filed a bug report with NVIDIA.
In my limited testing, it seems to only happen with unsigned char not int.
As a possible workaround, for cc2.0 and newer devices, specify -arch=sm_20 when compiling.

Related

Vectorizing sparse matrix vector product with Compressed Sparse Row SegFault [duplicate]

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?
Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.
Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.
If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,

cudaTextureObject_t texFetch1D doesn't compile

this code doesn't compile in cuda toolkit 7.5 on a gtx 980 with compute capability set to 5.2 in visual studio 2013.
__global__ void a_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
int something = tex1Dfetch(texObj, thread_id);
}
here is the error.
error : more than one instance of overloaded function "tex1Dfetch" matches the argument list:
this code also doesn't compile.
__global__ void another_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
float something = tex1Dfetch<float>(texObj, thread_id);
}
here is that error.
error : type name is not allowed
following this example and the comments, all of the above should work:
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility/
please let me know if you need additional info, I couldn't think what else to provide.
Your first kernel doesn't compile because of a missing template type argument. This will compile:
__global__ void a_kernel(cudaTextureObject_t texObj)
{
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
int something = tex1Dfetch<int>(texObj, thread_id);
}
Your second kernel is correct, and it does compile for me using VS2012 with the CUDA 7.0 toolkit for every compute capability I tried (sm_30 through sm_52).
I reinstalled the cuda toolkit and now the second piece of code (another_kernel) compiles. The first piece of code was incorrect in the first place as per the first answer. W.r.t. reinstalling the cuda toolkit, it was that I must have previously clobbered something in the sdk, I believe it was texture_indirect_functions.h.

C++AMP exception in simple image processing example

I'm trying to teach myself C++AMP, and would like to start with a very simple task from my field, that is image processing. I'd like to convert a 24 Bit-per-pixel RGB image (a Bitmap) to a 8 Bit-per-Pixel grayscale one. The image data is available in unsigned char arrays (obtained from Bitmap::LockBits(...) etc.)
I know that C++AMP for some reason cannot deal with char or unsigned char data via array or array_view, so I tried to use textures according to that blog. Here it is explained how 8bpp textures are written to, although VisualStudio 2013 tells me writeonly_texture_view was deprecated.
My code throws a runtime exception, saying "Failed to dispatch kernel." The complete text of the exception is lenghty:
ID3D11DeviceContext::Dispatch: The Unordered Access View (UAV) in slot 0 of the Compute Shader unit has the Format (R8_UINT). This format does not support being read from a shader as as UAV. This mismatch is invalid if the shader actually uses the view (e.g. it is not skipped due to shader code branching). It was unfortunately not possible to have all hardware implementations support reading this format as a UAV, despite that the format can written to as a UAV. If the shader only needs to perform reads but not writes to this resource, consider using a Shader Resource View instead of a UAV.
The code I use so far is this:
namespace gpu = concurrency;
gpu::extent<3> inputExtent(height, width, 3);
gpu::graphics::texture<unsigned int, 3> inputTexture(inputExtent, eight);
gpu::graphics::copy((void*)inputData24bpp, dataLength, inputTexture);
gpu::graphics::texture_view<unsigned int, 3> inputTexView(inputTexture);
gpu::graphics::texture<unsigned int, 2> outputTexture(width, height, eight);
gpu::graphics::writeonly_texture_view<unsigned int, 2> outputTexView(outputTexture);
gpu::parallel_for_each(outputTexture.extent,
[inputTexView, outputTexView](gpu::index<2> pix) restrict(amp) {
gpu::index<3> indR(pix[0], pix[1], 0);
gpu::index<3> indG(pix[0], pix[1], 1);
gpu::index<3> indB(pix[0], pix[1], 2);
unsigned int sum = inputTexView[indR] + inputTexView[indG] + inputTexView[indB];
outputTexView.set(pix, sum / 3);
});
gpu::graphics::copy(outputTexture, outputData8bpp);
What's the reason for this exception, and what can I do for a workaround?
I've also been learning C++Amp on my own and faced a very similar problem than yours, but in my case, I needed to deal with a 16 bit image.
Likely, the issue can be solved using textures although I can't help you on that due to a lack of experience.
So, what I did is basically based on bit masking.
First off, trick the compiler in order to let you compile:
unsigned int* sourceData = reinterpret_cast<unsigned int*>(source);
unsigned int* destData = reinterpret_cast<unsigned int*>(dest);
Next, your array viewer has to see all your data. Be aware that viwer really thing your data is 32 bit sized. So, you have to make the conversion ( divided to 2 because 16 bits, use 4 for 8 bits).
concurrency::array_view<const unsigned int> source( (size+ 7)/2, sourceData) );
concurrency::array_view<unsigned int> dest( (size+ 7)/2, sourceData) );
Now, you are able to write a typical for_each block.
typedef concurrency::array_view<const unsigned int> OriginalImage;
typedef concurrency::array_view<unsigned int> ResultImage;
bool Filters::Filter_Invert()
{
const int size = k_width*k_height;
const int maxVal = GetMaxSize();
OriginalImage& im_original = GetOriginal();
ResultImage& im_result = GetResult();
im_result.discard_data();
parallel_for_each(
concurrency::extent<2>(k_width, k_height),
[=](concurrency::index<2> idx) restrict(amp)
{
const int pos = GetPos(idx);
const int val = read_int16(im_original, pos);
write_int16(im_result, pos, maxVal - val);
});
return true;
}
int Filters::GetPos( const concurrency::index<2>& idx ) restrict(amp, cpu)
{
return idx[0] * Filters::k_height + idx[1];
}
And here it comes the magic:
template <typename T>
unsigned int read_int16(T& arr, int idx) restrict(amp, cpu)
{
return (arr[idx >> 1] & (0xFFFF << ((idx & 0x7) << 4))) >> ((idx & 0x7) << 4);
}
template<typename T>
void write_int16(T& arr, int idx, unsigned int val) restrict(amp, cpu)
{
atomic_fetch_xor(&arr[idx >> 1], arr[idx >> 1] & (0xFFFF << ((idx & 0x7) << 4)));
atomic_fetch_xor(&arr[idx >> 1], (val & 0xFFFF) << ((idx & 0x7) << 4));
}
Notice that this methods are for 16 bits for 8 bits won't work but it shouldn't be too difficult to adapt it to 8 bits. In fact, this was based on a 8 bit version, unfortunately, I couldn't find the reference.
Hope it helps.
David

cudaErrorLaunchFailure while trying to run simple templated kernel on a 64bit data type

I have this simple kernel code:
template<typename T> __global__ void CalcHamming( const T* pData, const uint64_t u64Count, const T Arg, uint32_t* pu32Results )
{
uint64_t gidx = blockDim.x * blockIdx.x + threadIdx.x;
while ( gidx < u64Count )
{
pu32Results[ gidx ] += __popc( pData[gidx] ^ Arg );
gidx += blockDim.x * gridDim.x;
}
}
It works correctly unless I use it for a 64 bit unsigned int (uint64_t). In that case I get cudaErrorLaunchFailure. I figured that maybe the problem is in __popc() which cannot handle 64 bit numbers so i made a specialized function to solve this:
template<> __global__ void CalcHamming<uint64_t>( const uint64_t* pData, const uint64_t u64Count, const uint64_t Arg, uint32_t* pu32Results )
{
uint64_t gidx = blockDim.x * blockIdx.x + threadIdx.x;
while ( gidx < u64Count )
{
pu32Results[ gidx ] += __popcll( pData[gidx] ^ Arg );
gidx += blockDim.x * gridDim.x;
}
}
However the problem still remains. One thing to note is that my data are not in several arrays, like this:
Array1 (uint32_t): 100 items
Array2 (uint64_t): 200 items
But instead concatenated in one memory block:
Array: 100 items (uint32_t), 200 items (uint64_t)
And I am doing some pointer arithmetic to launch the kernel on the correct spot. I'm quite sure the calculations are correct. Also note that the above example is a simplified case, i have many more 'subarrays' of various integer types concatenated like this).
My guess is that this might be behind the issue, that CUDA somehow dislikes the alignment of the uint64_t array. However fixing this requires quite a lot of effort and I would like ot be sure it will help before I do it. Or can I fix this just by modifying the kernel somehow? Will there be performance penalties?
uint64_t must be 8-bytes aligned: see HERE.
So yes, CUDA "dislikes" misaligned types: it does not run at all with them.
However I think you can avoid to rearrange your data structure externally. It's enough you check and treat as uint32_t (or uint8_t for total generality!) the extremes of the array. That's quite common in optimized kernels, especially using vector types as float4, int4,...
For some alignment tips see HERE.

Function crashes when using _mm_load_pd

I have the following function:
template <typename T>
void SSE_vectormult(T * A, T * B, int size)
{
__m128d a;
__m128d b;
__m128d c;
double A2[2], B2[2], C[2];
const double * A2ptr, * B2ptr;
A2ptr = &A2[0];
B2ptr = &B2[0];
a = _mm_load_pd(A);
for(int i = 0; i < size; i+=2)
{
std::cout << "In SSE_vectormult: i is: " << i << '\n';
A2[0] = A[i];
B2[0] = B[i];
A2[1] = A[i+1];
B2[1] = B[i+1];
std::cout << "Values from A and B written to A2 and B2\n";
a = _mm_load_pd(A2ptr);
b = _mm_load_pd(B2ptr);
std::cout << "Values converted to a and b\n";
c = _mm_mul_pd(a,b);
_mm_store_pd(C, c);
A[i] = C[0];
A[i+1] = C[1];
};
// const int mask = 0xf1;
// __m128d res = _mm_dp_pd(a,b,mask);
// r1 = _mm_mul_pd(a, b);
// r2 = _mm_hadd_pd(r1, r1);
// c = _mm_hadd_pd(r2, r2);
// c = _mm_scale_pd(a, b);
// _mm_store_pd(A, c);
}
When I am calling it on Linux, everything is fine, but when I am calling it on a windows OS, my program crashes with "program is not working anymore". What am I doing wrong, and how can I determine my error?
Your data is not guaranteed to be 16 byte aligned as required by SSE loads. Either use _mm_loadu_pd:
a = _mm_loadu_pd(A);
...
a = _mm_loadu_pd(A2ptr);
b = _mm_loadu_pd(B2ptr);
or make sure that your data is correctly aligned where possible, e.g. for static or locals:
alignas(16) double A2[2], B2[2], C[2]; // C++11, or C11 with <stdalign.h>
or without C++11, using compiler-specific language extensions:
__attribute__ ((aligned(16))) double A2[2], B2[2], C[2]; // gcc/clang/ICC/et al
__declspec (align(16)) double A2[2], B2[2], C[2]; // MSVC
You could use #ifdef to #define an ALIGN(x) macro that works on the target compiler.
Let me try and answer why your code works in Linux and not Windows. Code compiled in 64-bit mode has the stack aligned by 16 bytes. However, code compiled in 32-bit mode is only 4 byte aligned on windows and is not guaranteed to be 16 byte aligned on Linux.
GCC defaults to 64-bit mode on 64-bit systems. However MSVC defaults to 32-bit mode even on 64-bit systems. So I'm going to guess that you did not compile your code in 64-bit mode in windows and _mm_load_pd and _mm_store_pd both need 16 byte aligned addresses so the code crashes.
You have at least three different solutions to get your code working in Windows as well.
Compile your code in 64 bit mode.
Use unaligned loads and stores (e.g. _mm_storeu_pd)
Align the data yourself as Paul R suggested.
The best solution is the third solution since then your code will work on 32 bit systems and on older systems where unaligned loads/stores are much slower.
If you look at http://msdn.microsoft.com/en-us/library/cww3b12t(v=vs.90).aspx you can see that the function __mm_load_pd is defined as:
__m128d _mm_load_pd (double *p);
So, in your code A should be of type double, but A is of tipe T that is a template param. You should be sure that you are calling your SSE_vectormult function with the rights template params or just remove the template and use the double type instead,