I am programming a project for tomographic reconstruction in OpenCL. Until now all my device structures had length less than MAXUINT32. Now I am facing the problem for some big datasets, this is too restrictive and I would need possibility to index by UINT64, represented by ulong type in OpenCL. Some of the kernels need to use array size as argument and apparently it is forbidden to use size_t in kernel arguments, especially on NVidia platforms.
I have two use cases, the code computing partial sums by two methods. The first do not have to use ulong in kernel argument since the block of the memory partialFrameSize on which each instance will work does not exceed MAXUINT32.
void kernel FLOATvector_SumPartial(global const float* restrict x,
global float* restrict sumPartial,
private uint partialFrameSize)
{
uint gid = get_global_id(0);
uint start = gid * partialFrameSize;
uint end = start + partialFrameSize;
float sum = 0.0f;
float val;
for(uint i = start; i < end; i++)
{
val = x[i];
sum += val;
}
sumPartial[gid] = sum;
}
Second is doing the same using fancier implementation and barrier calls. Because of the memory alignment, it needs to have parameter private uint vecLength, which needs to be changed to private ulong vecLength.
void kernel FLOATvector_SumPartial_barrier(global const float* restrict x,
global float* restrict partialSum,
local float* loc,
private uint vecLength)
{
uint gid = get_global_id(0);
uint gs = get_global_size(0);
uint lid = get_local_id(0);
uint ls = get_local_size(0);
float val;
if(gid < vecLength)
{
val = x[gid];
} else
{
val = 0.0;
}
loc[lid] = val;
barrier(CLK_LOCAL_MEM_FENCE);
for(uint stride = ls / 2; stride > 1; stride >>= 1) // Does the same as /=2
{
if(lid < stride)
{
loc[lid] += loc[lid + stride];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if(lid == 0)
{
gid = get_group_id(0);
partialSum[gid] = loc[0] + loc[1];
}
}
I have the following questions:
How big overhead, e.g. on NVidia V100 architecture, will be when I
replace all uint simply by ulong.
Will using size_t instead of uint in the first kernel be without any overhead?
How this can be solved in CUDA? Shall I switch?
If you want to use 64-bit indexing, you can use unsigned long long type. This is a 64-bit type on any platform, and it is not implementation defined, as far as the acceptable platforms go for usage of OpenCL or CUDA on a NVIDIA GPU.
How big overhead, e.g. on NVidia V100 architecture, will be when I replace all uint simply by ulong.
It should be simple enough just to test that.
Will using size_t instead of uint in the first kernel be without any overhead?
size_t, on a 64-bit platform (e.g. 64-bit OS), would have the same overhead as switching to 64-bit indexing using unsigned long long.
How this can be solved in CUDA? Shall I switch?
CUDA shouldn't be meaningfully different in this respect. It has no restrictions around usage of size_t for kernel arguments, and all current CUDA development would be on a 64-bit platform, which means that size_t would be a 64-bit unsigned integer type, just like unsigned long long. However if we compared OpenCL using unsigned long long and CUDA using unsigned long long, there should be no meaningful difference. And there would be no difference in CUDA using size_t vs. unsigned long long (again, for typical current development, on a 64-bit platform).
Related
How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}
I was trying to run an OpenCL kernel on both Adreno 630 and my laptop, it turns out that the kernel runs perfectly on mobile but crashes my laptop every single time. I am still trying to figure out the reason by myself. Here's my kernel. I hope you could help me with it, thanks.
__kernel void gen_mapxy( __read_only image2d_t _disp, const float offsetX, __write_only image2d_t _mapxy )
{
const int y = get_global_id(0);
const int local_y = get_local_id(0);
__local short temp[24][1080];
const int imageWidth = get_image_width(_disp);
for(int x = 0; x < imageWidth; ++x)
temp[local_y][x] = 0;
for(int x = imageWidth - 1; x >= 0; --x){
int tempDisp = read_imagei(_disp, sampler_nearest, (int2)(x, y)).x;
int newPos = clamp((int)(x + offsetX * (tempDisp) / 255), 0, imageWidth - 1);
temp[local_y][newPos] = tempDisp;
write_imagef(_mapxy, (int2)(newPos, y), (float4)(x, y, 0, 0));
}
You are using a big local array.
__local short temp[24][1080]
2 byte * 24 * 1080 = 50.6kB. Some desktop GPUs(and their notebook counterparts) have less available local memory limits. For example, GTX 1060 supports the value CL_DEVICE_LOCAL_MEM_SIZE 49152 bytes. But adreno 620, either it is ignoring the array usage silently or supporting larger local arrays because there is a possilibity that local arrays are emulated inside global arrays (limited in hundreds of megabytes) for those chips. If they do support in-chip fast local memory, then there is more possibility of "ignoring" issue or they really doubled local memory limits from last generation of Adrenos.
Even when GPU supports exact value, using all of it will limit thread-level-parallelism on each pipeline, severely reducing potential performance gains, generally.
If last generation of Adreno GPUs are same,
https://compubench.com/device.jsp?benchmark=compu15m&os=Android&api=cs&D=Samsung+Galaxy+S7+%28SM-G930x%29&testgroup=info
this page says
CL_DEVICE_LOCAL_MEM_SIZE
32768
CL_DEVICE_LOCAL_MEM_TYPE
CL_LOCAL
it is fast but it is 32kB so it is ignoring the error or you've missed adding necessary error catching logic in there, or both.
I'm trying to use GPU for some image processing. In my kernel function I catched "misalignment" exception as
The thread tried to read or write data that is misaligned on hardware that does not provide alignment. For example, 16-bit values must be aligned on 2-byte boundaries; 32-bit values on 4-byte boundaries, and so on.
I reduced the kernel code to loops only, but I still got this problem. My reduced kernel function:
__kernel void TestKernel(
global const uchar* iImage,
global uchar* oImage,
uint width,
uint heigth,
uchar dif,
float power)
{
uint y = get_global_id(0);
if (y >= heigth)
return;
for (uint x = 0; x< width; ++x){
for (uint i = 0; i < 5; ++i) {
uint sum = 0;
for (uint j = 0; j<5; ++j) {
sum += 3;
}
}
}
}
(program throws exception in the second loop)
I'm using the C++ wrapper to call my kernel
kernel.setArg(iArg++, iImage);
kernel.setArg(iArg++, oImage);
kernel.setArg(iArg++, header.GetVal(header.Width));
kernel.setArg(iArg++, header.GetVal(header.Height));
kernel.setArg(iArg++, (unsigned char)10);
kernel.setArg(iArg++, saturation);
queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(header.GetVal(header.Height)), cl::NDRange(128));
oImage and iImage are cl::Buffer
saturation is float
header.GetVal() returns int
I'm using Visual Studio 2015 with CodeXL plugin and run the program on AMD Spectre(Radion R7).
What can cause this problem?
I want to create a histogram of vertical gradients in an 8 bit gray image.
The vertical distance to calculate the gradient can be specified.
I already managed to speed up another part of my code using Intrinsics, but it does not work here.
The code runs without exception if the _mm_store_si128 is commented out.
When it is not commented, I get an access violation.
What is going wrong here?
#define _mm_absdiff_epu8(a,b) _mm_adds_epu8(_mm_subs_epu8(a, b), _mm_subs_epu8(b, a)) //from opencv
void CreateAbsDiffHistogramUnmanaged(void* source, unsigned int sourcestride, unsigned int height, unsigned int verticalDistance, unsigned int histogram[])
{
unsigned int xcount = sourcestride / 16;
__m128i absdiffData;
unsigned char* bytes = (unsigned char*) _aligned_malloc(16, 16);
__m128i* absdiffresult = (__m128i*) bytes;
__m128i* sourceM = (__m128i*) source;
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride;
for (unsigned int y = 0; y < (height - verticalDistance); y++)
{
for (unsigned int x = 0; x < xcount; x++, ++sourceM, ++sourceVOffset)
{
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
_mm_store_si128(absdiffresult, absdiffData);
//unroll loop
histogram[bytes[0]]++;
histogram[bytes[1]]++;
histogram[bytes[2]]++;
histogram[bytes[3]]++;
histogram[bytes[4]]++;
histogram[bytes[5]]++;
histogram[bytes[6]]++;
histogram[bytes[7]]++;
histogram[bytes[8]]++;
histogram[bytes[9]]++;
histogram[bytes[10]]++;
histogram[bytes[11]]++;
histogram[bytes[12]]++;
histogram[bytes[13]]++;
histogram[bytes[14]]++;
histogram[bytes[15]]++;
}
}
_aligned_free(bytes);
}
Your function crashed while loading because the input data was not aligned properly. In order to solve this problem you have to change your code from:
absdiffData = _mm_absdiff_epu8(*sourceM, *sourceVOffset);
to:
absdiffData = _mm_absdiff_epu8(_mm_loadu_si128(sourceM), _mm_loadu_si128(sourceVOffset));
Here I use unaligned loading.
P.S. I have implemented a similar function (SimdAbsSecondDerivativeHistogram) in Simd Library. It has SSE2, AVX2, NEON and Altivec implementations. I hope that it will help you.
P.P.S. Also I would strongly recommended to check this line:
__m128i* sourceVOffset = (__m128i*)source + verticalDistance * sourcestride);
It may result in a crash (access to memory outside of the input array bounds). Maybe you had in mind this:
__m128i* sourceVOffset = (__m128i*)((char*)source + verticalDistance * sourcestride);
I am having a problem with a SSE method I am writing that performs audio processing. I have implemented a SSE random function based on Intel's paper here:
http://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/
I also have a method that is performing conversions from Float to S16 using SSE also, the conversion is performed quite simply as follows:
unsigned int Float_S16LE(float *data, const unsigned int samples, uint8_t *dest)
{
int16_t *dst = (int16_t*)dest;
const __m128 mul = _mm_set_ps1((float)INT16_MAX);
__m128 rand;
const uint32_t even = count & ~0x3;
for(uint32_t i = 0; i < even; i += 4, data += 4, dst += 4)
{
/* random round to dither */
FloatRand4(-0.5f, 0.5f, NULL, &rand);
__m128 rmul = _mm_add_ps(mul, rand);
__m128 in = _mm_mul_ps(_mm_load_ps(data),rmul);
__m64 con = _mm_cvtps_pi16(in);
memcpy(dst, &con, sizeof(int16_t) * 4);
}
}
FloatRand4 is defined as follows:
static inline void FloatRand4(const float min, const float max, float result[4], __m128 *sseresult = NULL)
{
const float delta = (max - min) / 2.0f;
const float factor = delta / (float)INT32_MAX;
...
}
If sseresult != NULL the __m128 result is returned and result is unused.
This performs perfectly on the first loop, but on the next loop delta becomes -1.#INF instead of 1.0. If I comment out the line __m64 con = _mm_cvtps_pi16(in); the problem goes away.
I think that the FPU is getting into an unknown state or something.
Mixing SSE Integer arithmetic and (regular) Floating point math. Can produce weird results because both are operating on the same registers. If you use:
_mm_empty()
the FPU is reset into a correct state. Microsoft has Guidelines for When to Use EMMS
_mm_load_ps is not guaranteed to do an aligned load. float* data can be aligned to 4 bytes instead of 16 _ => _mm_loadu_ps
memcpy will probably kill the advantages achieved with SSE, you should use a store command for __m64 but here again, take care of the alignment.
If it's impossible to do an unaligned stream or store of an __m64, I'd either keep it inside an _m128i and do a masked write with _mm_maskmoveu_si128 or store those 8 bytes by hand.
http://msdn.microsoft.com/en-us/library/bytwczae.aspx