OpenCL nested loop misalignment - c++

I'm trying to use GPU for some image processing. In my kernel function I catched "misalignment" exception as
The thread tried to read or write data that is misaligned on hardware that does not provide alignment. For example, 16-bit values must be aligned on 2-byte boundaries; 32-bit values on 4-byte boundaries, and so on.
I reduced the kernel code to loops only, but I still got this problem. My reduced kernel function:
__kernel void TestKernel(
global const uchar* iImage,
global uchar* oImage,
uint width,
uint heigth,
uchar dif,
float power)
{
uint y = get_global_id(0);
if (y >= heigth)
return;
for (uint x = 0; x< width; ++x){
for (uint i = 0; i < 5; ++i) {
uint sum = 0;
for (uint j = 0; j<5; ++j) {
sum += 3;
}
}
}
}
(program throws exception in the second loop)
I'm using the C++ wrapper to call my kernel
kernel.setArg(iArg++, iImage);
kernel.setArg(iArg++, oImage);
kernel.setArg(iArg++, header.GetVal(header.Width));
kernel.setArg(iArg++, header.GetVal(header.Height));
kernel.setArg(iArg++, (unsigned char)10);
kernel.setArg(iArg++, saturation);
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(header.GetVal(header.Height)), cl::NDRange(128));
oImage and iImage are cl::Buffer
saturation is float
header.GetVal() returns int
I'm using Visual Studio 2015 with CodeXL plugin and run the program on AMD Spectre(Radion R7).
What can cause this problem?

Related

Using ulong instead of uint in OpenCL for device array indexing

I am programming a project for tomographic reconstruction in OpenCL. Until now all my device structures had length less than MAXUINT32. Now I am facing the problem for some big datasets, this is too restrictive and I would need possibility to index by UINT64, represented by ulong type in OpenCL. Some of the kernels need to use array size as argument and apparently it is forbidden to use size_t in kernel arguments, especially on NVidia platforms.
I have two use cases, the code computing partial sums by two methods. The first do not have to use ulong in kernel argument since the block of the memory partialFrameSize on which each instance will work does not exceed MAXUINT32.
void kernel FLOATvector_SumPartial(global const float* restrict x,
global float* restrict sumPartial,
private uint partialFrameSize)
{
uint gid = get_global_id(0);
uint start = gid * partialFrameSize;
uint end = start + partialFrameSize;
float sum = 0.0f;
float val;
for(uint i = start; i < end; i++)
{
val = x[i];
sum += val;
}
sumPartial[gid] = sum;
}
Second is doing the same using fancier implementation and barrier calls. Because of the memory alignment, it needs to have parameter private uint vecLength, which needs to be changed to private ulong vecLength.
void kernel FLOATvector_SumPartial_barrier(global const float* restrict x,
global float* restrict partialSum,
local float* loc,
private uint vecLength)
{
uint gid = get_global_id(0);
uint gs = get_global_size(0);
uint lid = get_local_id(0);
uint ls = get_local_size(0);
float val;
if(gid < vecLength)
{
val = x[gid];
} else
{
val = 0.0;
}
loc[lid] = val;
barrier(CLK_LOCAL_MEM_FENCE);
for(uint stride = ls / 2; stride > 1; stride >>= 1) // Does the same as /=2
{
if(lid < stride)
{
loc[lid] += loc[lid + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0)
{
gid = get_group_id(0);
partialSum[gid] = loc[0] + loc[1];
}
}
I have the following questions:
How big overhead, e.g. on NVidia V100 architecture, will be when I
replace all uint simply by ulong.
Will using size_t instead of uint in the first kernel be without any overhead?
How this can be solved in CUDA? Shall I switch?
If you want to use 64-bit indexing, you can use unsigned long long type. This is a 64-bit type on any platform, and it is not implementation defined, as far as the acceptable platforms go for usage of OpenCL or CUDA on a NVIDIA GPU.
How big overhead, e.g. on NVidia V100 architecture, will be when I replace all uint simply by ulong.
It should be simple enough just to test that.
Will using size_t instead of uint in the first kernel be without any overhead?
size_t, on a 64-bit platform (e.g. 64-bit OS), would have the same overhead as switching to 64-bit indexing using unsigned long long.
How this can be solved in CUDA? Shall I switch?
CUDA shouldn't be meaningfully different in this respect. It has no restrictions around usage of size_t for kernel arguments, and all current CUDA development would be on a 64-bit platform, which means that size_t would be a 64-bit unsigned integer type, just like unsigned long long. However if we compared OpenCL using unsigned long long and CUDA using unsigned long long, there should be no meaningful difference. And there would be no difference in CUDA using size_t vs. unsigned long long (again, for typical current development, on a 64-bit platform).

access violation _mm_store_si128 SSE Intrinsics

I want to create a histogram of vertical gradients in an 8 bit gray image.
The vertical distance to calculate the gradient can be specified.
I already managed to speed up another part of my code using Intrinsics, but it does not work here.
The code runs without exception if the _mm_store_si128 is commented out.
When it is not commented, I get an access violation.
What is going wrong here?
#define _mm_absdiff_epu8(a,b) _mm_adds_epu8(_mm_subs_epu8(a, b), _mm_subs_epu8(b, a)) //from opencv
void CreateAbsDiffHistogramUnmanaged(void* source, unsigned int sourcestride, unsigned int height, unsigned int verticalDistance, unsigned int histogram[])
{
unsigned int xcount = sourcestride / 16;
__m128i absdiffData;
unsigned char* bytes = (unsigned char*) _aligned_malloc(16, 16);
__m128i* absdiffresult = (__m128i*) bytes;
__m128i* sourceM = (__m128i*) source;
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride;
for (unsigned int y = 0; y < (height - verticalDistance); y++)
{
for (unsigned int x = 0; x < xcount; x++, ++sourceM, ++sourceVOffset)
{
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
_mm_store_si128(absdiffresult, absdiffData);
//unroll loop
histogram[bytes[0]]++;
histogram[bytes[1]]++;
histogram[bytes[2]]++;
histogram[bytes[3]]++;
histogram[bytes[4]]++;
histogram[bytes[5]]++;
histogram[bytes[6]]++;
histogram[bytes[7]]++;
histogram[bytes[8]]++;
histogram[bytes[9]]++;
histogram[bytes[10]]++;
histogram[bytes[11]]++;
histogram[bytes[12]]++;
histogram[bytes[13]]++;
histogram[bytes[14]]++;
histogram[bytes[15]]++;
}
}
_aligned_free(bytes);
}
Your function crashed while loading because the input data was not aligned properly. In order to solve this problem you have to change your code from:
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
to:
absdiffData = _mm_absdiff_epu8(_mm_loadu_si128(sourceM), _mm_loadu_si128(sourceVOffset));
Here I use unaligned loading.
P.S. I have implemented a similar function (SimdAbsSecondDerivativeHistogram) in Simd Library. It has SSE2, AVX2, NEON and Altivec implementations. I hope that it will help you.
P.P.S. Also I would strongly recommended to check this line:
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride);
It may result in a crash (access to memory outside of the input array bounds). Maybe you had in mind this:
__m128i* sourceVOffset = (__m128i*)((char*)source + verticalDistance * sourcestride);

CUDA - Optimize mean of matrix rows calculation using shared memory

I am trying to optimize the computation of the mean of each row in my 512w x 1024h image, and then subtract the mean from the row from which it was computed. I wrote a piece of code which does it in 1.86 ms, but I want to reduce the speed. This piece of code works fine, but does not use shared memory, and it utilizes for loops. I want to do away with them.
__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {
// height = 1024, width = 512
int tidy = threadIdx.x + blockDim.x * blockIdx.x;
float sum = 0.0f;
float sumDiv = 0.0f;
if(tidy < height) {
for(int c = 0; c < width; c++) {
sum += img[tidy*width + c];
}
sumDiv = (sum/width)/2;
//__syncthreads();
for(int cc = 0; cc < width; cc++) {
lineImg[tidy*width + cc] = img[tidy*width + cc] - sumDiv;
}
}
__syncthreads();
I called the above kernel using:
subtractMean <<< 2, 512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);
However, the following code I wrote uses shared memory to optimize. But, it does not work as expected. Any thoughts on what the problem might be?
__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {
extern __shared__ float perRow[];
int idx = threadIdx.x; // set idx along x
int stride = width/2;
while(idx < width) {
perRow[idx] = 0;
idx += stride;
}
__syncthreads();
int tidx = threadIdx.x; // set idx along x
int tidy = blockIdx.x; // set idx along y
if(tidy < height) {
while(tidx < width) {
perRow[tidx] = img[tidy*width + tidx];
tidx += stride;
}
}
__syncthreads();
tidx = threadIdx.x; // reset idx along x
tidy = blockIdx.x; // reset idx along y
if(tidy < height) {
float sumAllPixelsInRow = 0.0f;
float sumDiv = 0.0f;
while(tidx < width) {
sumAllPixelsInRow += perRow[tidx];
tidx += stride;
}
sumDiv = (sumAllPixelsInRow/width)/2;
tidx = threadIdx.x; // reset idx along x
while(tidx < width) {
lineImg[tidy*width + tidx] = img[tidy*width + tidx] - sumDiv;
tidx += stride;
}
}
__syncthreads();
}
The shared memory function was called using:
subtractMean <<< 1024, 256, sizeof(float)*512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);
2 blocks is hardly enough to saturate GPU use. You are going towards the right approach with utilizing more blocks, however, you are using Kepler and I would like to present an option that does not use shared memory at all.
Start with 32 threads in a block (this can be changed later using 2D blocks)
With those 32 threads you should do something along the lines of this:
int rowID = blockIdx.x;
int tid = threadIdx.x;
int stride= blockDim.x;
int index = threadIdx.x;
float sum=0.0;
while(index<width){
sum+=img[width*rowID+index];
index+=blockDim.x;
}
at this point you will have 32 threads that have a partial sum in each of them. You next need to add them all together. You can do this without the use of shared memory (since we are within a warp) by utilizing a shuffle reduction. For details on that look here: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ what you want is the shuffle warp reduce, but you need to change it to use the full 32 threads.
Now that thread 0 in each warp has the sum of every row, you can divide that by the width cast to a float, and broadcast it to the rest of the warp using shfl using shfl(average, 0);. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-description
With the average found and the warps synchronized implicitly and explicitly (with shfl), you can continue on in a similar method with the subtract.
Possible further optimizations would be to include more than one warp in a block to improve occupancy, and to manually unroll the loops over the width to improve instruction level parallelism.
Good Luck.

iOS - C/C++ - Speed up Integral Image calculation

I have a method which calculates an integral image (description here) commonly used in computer vision applications.
float *Integral(unsigned char *grayscaleSource, int height, int width, int widthStep)
{
// convert the image to single channel 32f
unsigned char *img = grayscaleSource;
// set up variables for data access
int step = widthStep/sizeof(float);
uint8_t *data = (uint8_t *)img;
float *i_data = (float *)malloc(height * width * sizeof(float));
// first row only
float rs = 0.0f;
for(int j=0; j<width; j++)
{
rs += (float)data[j];
i_data[j] = rs;
}
// remaining cells are sum above and to the left
for(int i=1; i<height; ++i)
{
rs = 0.0f;
for(int j=0; j<width; ++j)
{
rs += data[i*step+j];
i_data[i*step+j] = rs + i_data[(i-1)*step+j];
}
}
// return the integral image
return i_data;
}
I am trying to make it as fast as possible. It seems to me like this should be able to take advantage of Apple's Accelerate.framework, or perhaps ARMs neon intrinsics, but I can't see exactly how. It seems like that nested loop is potentially quite slow (for real time applications at least).
Does anyone think this is possible to speed up using any other techniques??
You can certainly vectorize the row by row summation. That is vDSP_vadd(). The horizontal direction is vDSP_vrsum().
If you want to write your own vector code, the horizontal sum might be sped up by something like psadbw, but that is Intel. Also, take a look at prefix sum algorithms, which are famously parallelizable.

Sometimes I get EXEC_BAD_ACCESS (Access violation) when reversing an array

I am loading an image using the OpenEXR library.
This works fine, except the image is loaded rotated 180 degrees. I use the loop shown below to reverse the array but sometimes the program will quit and xcode will give me an EXEC_BAD_ACCESS error (Which I assume is the same as an access violation in msvc). It does not happen everytime, just once every 5-10 times.
Ideally I'd want to reverse the array in place, although that led to errors everytime and using memcpy would fail but without causing an error, just a blank image. I'd like to know what's causing this problem first.
Here is the code I am using: (Rgba is a struct of 4 "Half"s r, g, b, and a, defined in OpenEXR)
Rgba* readRgba(const char filename[], int& width, int& height){
Rgba* pixelBuffer = new Rgba[width * height];
Rgba* temp = new Rgba[width * height];
// ....EXR Loading code....
// TODO: *Sometimes* the following code results in a bad memory access error. No idea why.
// Flip the image to conform with OpenGL coordinates.
for (int i = 0; i < height; i++){
for(int j = 0; j < width; j++){
temp[(i*width)+j] = pixelBuffer[(width*height)-(i*width)+j];
}
}
delete pixelBuffer;
return temp;
}
Thanks in advance!
Change:
temp[(i*width)+j] = pixelBuffer[(width*height)-(i*width)+j];
to:
temp[(i*width)+j] = pixelBuffer[(width*height)-(i*width)+j - 1];
(Hint: think about what happens when i = 0 and j = 0 !)
And here's how you can optimize this code, to save memory and for cycles:
Rgba* readRgba(const char filename[], int& width, int& height)
{
Rgba* pixelBuffer = new Rgba[width * height];
Rgba tempPixel;
// ....EXR Loading code....
// Flip the image to conform with OpenGL coordinates.
for (int i = 0; i <= height/2; i++)
for(int j = 0; j < width && (i*width + j) <= (height*width/2); j++)
{
tempPixel = pixelBuffer[i*width + j];
pixelBuffer[i*width + j] = pixelBuffer[height*width - (i*width + j) -1];
pixelBuffer[height*width - (i*width + j) -1] = tempPixel;
}
return pixelBuffer;
}
Note that optimal (from a memory usage best practices point of view) would be to pass pixelBuffer* as a parameter and already allocated. It's a good practice to allocate and release the memory in the same piece of code.