Understanding the use of memset in CUDA device code

Understanding the use of memset in CUDA device code - c++

I have a linear int array arr, which is on CUDA global memory. I want to set sub-arrays of arr to defined values. The sub-array start indexes are given by the starts array, while the length of each sub-array is given in counts array.
What I want to do is to set the value of sub-array i starting from starts[i] and continuing upto counts[i] to the value starts[i]. That is, the operation is:
arr[starts[i]: starts[i]+counts[i]] = starts[i]
I thought of using memset() in the kernel for setting the values. However, it is not getting correctly written ( the array elements are being assigned some random values). The code I am using is:
#include <stdlib.h>
__global__ void kern(int* starts,int* counts, int* arr,int* numels)
{
unsigned int idx = threadIdx.x + blockIdx.x*blockDim.x;
if (idx>=numels[0])
return;
const int val = starts[idx];
memset(&arr[val], val, sizeof(arr[0])*counts[idx]) ;
__syncthreads();
}
Please note that numels[0] contains the number of elements in starts array.
I have checked the code with cuda-memcheck() but didn't get any errors. I am using PyCUDA, if it's relevant. I am probably misunderstanding the usage of memset here, as I am learning CUDA.
Can you please suggest a way to correct this? Or other efficient way of doint this operation.
P.S: I know that thrust::fill() can probably do this well, but since I am learning CUDA, I would like to know how to do this without using external libraries.

The memset and memcpy implementations in CUDA device code emit simple, serial, byte values operations (and note that memset can't set anything other than byte values, which might be contributing to the problem you see if the values you are trying to set are larger than 8 bits).
You could replace the memset call with something like this:
const int val = starts[idx];
//memset(&arr[val], val, sizeof(arr[0])*counts[idx]) ;
for(int i = 0; i < counts[idx]; i++)
arr[val + i] = val;
The performance of that code will probably be better than the built-in memset.
Note also that the __syncthreads() call at the end of your kernel is both unnecessary, and a potential source of deadlock and should be removed. See here for more information.

Related

Problem: CUDA Naive sum reduction, but not desired result [duplicate]

hello I want to find the sum of array elements using CUDA.
__global__ void countZeros(int *d_A, int * B)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
B[0] = B[0]+d_A[index];
}
so in the end, B[0] supposed to contain the sum of all elements. but I noticed that B[0] equals to zero every time. so in the end it contains only last element.
why B[0] becomes zero every time?

All of the threads are writing to B[0], and some may be attempting to write simultaneously. This line of code:
B[0] = B[0]+d_A[index];
requires a read and a write of B[0]. If multiple threads are doing this at the same time, you will get strange results.
You can make a simple fix by doing this:
atomicAdd(B, d_A[index]);
and you should get sensible results (assuming you have no errors elsewhere in your code, that you haven't shown.) Be sure to initialize B[0] to some known value before calling this kernel.
If you want to do this efficiently, however, you should study the cuda reduction sample or just use CUB.
And be sure to use proper cuda error checking any time you are having trouble with a CUDA code.
So, if you still can't get sensible results, please instrument your code with proper cuda error checking before asking "I made this change but it still doesn't work, why?" I can't tell you why, because this is the only snippet of code that you've shown.

2D array access time comparison

I have two ways of constructing a 2D array:
int arr[NUM_ROWS][NUM_COLS];
//...
tmp = arr[i][j]
and flattened array
int arr[NUM_ROWS*NUM_COLS];
//...
tmp = arr[i*NuM_COLS+j];
I am doing image processing so even a little improvement in access time is necessary. Which one is faster? I am thinking the first one since the second one needs calculation, but then the first one requires two addressing so I am not sure.

I don't think there is any performance difference. System will allocate same amount of contiguous memory in both cases. For calculate i*Numcols+j, either you would do it for 1D array declaration, or system would do it in 2D case. Only concern is ease of usage.

You should have trust into the capabilities of your compiler in optimizing standard code.
Also you should have trust into modern CPUs having fast numeric multiplication instructions.
Don't bother to use one or another!
I - decades ago - optimized some code greatly by using pointers instead of using 2d-array-calculation --> but this will a) only be useful if it is an option to store the pointer - e.g. in a loop and b) have low impact since i guess modern cpus should do 2d array access in a single cycle? Worth measuring! May be related to the array size.
In any case pointers using ptr++ or ptr += NuM_COLS will for sure be a little bit faster if applicable!

The first method will almost always be faster. IN GENERAL (because there are always corner cases) processor and memory architecture as well as compilers may have optimizations built in to aid with 2d arrays or other similar data structures. For example, GPUs are optimized for matrix (2d array) math.
So, again in general, I would allow the compiler and hardware to optimize your memory and address arithmetic if possible.
...also I agree with #Paul R, there are much bigger considerations when it comes to performance than your array allocation and address arithmetic.

There are two cases to consider: compile time definition and run-time definition of the array size. There is big difference in performance.
Static allocation, global or file scope, fixed size array:
The compiler knows the size of the array and tells the linker to allocate space in the data / memory section. This is the fastest method.
Example:
#define ROWS 5
#define COLUMNS 6
int array[ROWS][COLUMNS];
int buffer[ROWS * COLUMNS];
Run time allocation, function local scope, fixed size array:
The compiler knows the size of the array, and tells the code to allocate space in the local memory (a.k.a. stack) for the array. In general, this means adding a value to a stack register. Usually one or two instructions.
Example:
void my_function(void)
{
unsigned short my_array[ROWS][COLUMNS];
unsigned short buffer[ROWS * COLUMNS];
}
Run Time allocation, dynamic memory, fixed size array:
Again, the compiler has already calculated the amount of memory required for the array since it was declared with fixed size. The compiler emits code to call the memory allocation function with the required amount (usually passed as a parameter). A little slower because of the function call and the overhead required to find some dynamic memory (and maybe garbage collection).
Example:
void another_function(void)
{
unsigned char * array = new char [ROWS * COLS];
//...
delete[] array;
}
Run Time allocation, dynamic memory, variable size:
Regardless of the dimensions of the array, the compiler must emit code to calculate the amount of memory to allocate. This quantity is then passed to the memory allocation function. A little slower than above because of the code required to calculate the size.
Example:
int * create_board(unsigned int rows, unsigned int columns)
{
int * board = new int [rows * cols];
return board;
}

Since your goal is image processing then I would assume your images are too large for static arrays. The correct question you should be about dynamically allocated arrays
In C/C++ there are multiple ways you can allocate a dynamic 2D array How do I work with dynamic multi-dimensional arrays in C?. To make this work in both C/C++ we can use malloc with casting (for C++ only you can use new)
Method 1:
int** arr1 = (int**)malloc(NUM_ROWS * sizeof(int*));
for(int i=0; i<NUM_ROWS; i++)
arr[i] = (int*)malloc(NUM_COLS * sizeof(int));
Method 2:
int** arr2 = (int**)malloc(NUM_ROWS * sizeof(int*));
int* arrflat = (int*)malloc(NUM_ROWS * NUM_COLS * sizeof(int));
for (int i = 0; i < dimension1_max; i++)
arr2[i] = arrflat + (i*NUM_COLS);
Method 2 essentially creates a contiguous 2D array: i.e. arrflat[NUM_COLS*i+j] and arr2[i][j] should have identical performance. However, arrflat[NUM_COLS*i+j] and arr[i][j] from method 1 should not be expected to have identical performance since arr1 is not contiguous. Method 1, however, seems to be the method that is most commonly used for dynamic arrays.
In general, I use arrflat[NUM_COLS*i+j] so I don't have to think of how to allocated dynamic 2D arrays.

Memset an int (16 bit) array to short's max value

Can't seem to find the answer to this anywhere,
How do I memset an array to the maximum value of the array's type?
I would have thought memset(ZBUFFER,0xFFFF,size) would work where ZBUFFER is a 16bit integer array. Instead I get -1s throughout.
Also, the idea is to have this work as fast as possible (it's a zbuffer that needs to initialize every frame) so if there is a better way (and still as fast or faster), let me know.
edit:
as clarification, I do need a signed int array.

In C++, you would use std::fill, and std::numeric_limits.
#include <algorithm>
#include <iterator>
#include <limits>
template <typename IT>
void FillWithMax( IT first, IT last )
{
typedef typename std::iterator_traits<IT>::value_type T;
T const maxval = std::numeric_limits<T>::max();
std::fill( first, last, maxval );
}
size_t const size=32;
short ZBUFFER[size];
FillWithMax( ZBUFFER, &ZBUFFER[0]+size );
This will work with any type.
In C, you'd better keep off memset that sets the value of bytes. To initialize an array of other types than char (ev. unsigned), you have to resort to a manual for loop.

-1 and 0xFFFF are the same thing in a 16 bit integer using a two's complement representation. You are only getting -1 because either you have declared your array as short instead of unsigned short. Or because you are converting the values to signed when you output them.
BTW your assumption that you can set something except bytes using memset is wrong. memset(ZBUFFER, 0xFF, size) would have done the same thing.

In C++ you can fill an array with some value with the std::fill algorithm.
std::fill(ZBUFFER, ZBUFFER+size, std::numeric_limits<short>::max());
This is neither faster nor slower than your current approach. It does have the benefit of working, though.

Don't attribute speed to language. That's for implementations of C. There are C compilers that produce fast, optimal machine code and C compilers that produce slow, inoptimal machine code. Likewise for C++. A "fast, optimal" implementation might be able to optimise code that seems slow. Hence, it doesn't make sense to call one solution faster than another. I'll talk about the correctness, and then I'll talk about performance, however insignificant it is. It'd be a better idea to profile your code, to be sure that this is in fact the bottleneck, but let's continue.
Let us consider the most sensible option, first: A loop that copies int values. It is clear just by reading the code that the loop will correctly assign SHRT_MAX to each int item. You can see a testcase of this loop below, which will attempt to use the largest possible array allocatable by malloc at the time.
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
size_t size = SIZE_MAX;
volatile int *array = malloc(size);
/* Allocate largest array */
while (array == NULL && size > 0) {
size >>= 1;
array = malloc(size);
}
printf("Copying into %zu bytes\n", size);
for (size_t n = 0; n < size / sizeof *array; n++) {
array[n] = SHRT_MAX;
}
puts("Done!");
return 0;
}
I ran this on my system, compiled with various optimisations enabled (-O3 -march=core2 -funroll-loops). Here's the output:
Copying into 1073741823 bytes
Done!
Process returned 0 (0x0) execution time : 1.094 s
Press any key to continue.
Note the "execution time"... That's pretty fast! If anything, the bottleneck here is the cache locality of such a large array, which is why a good programmer will try to design systems that don't use so much memory... Well, then let us consider the memset option. Here's a quote from the memset manual:
The memset() function copies c (converted to an unsigned char) into
each of the first n bytes of the object pointed to by s.
Hence, it'll convert 0xFFFF to an unsigned char (and potentially truncate that value), then assign the converted value to the first size bytes. This results in incorrect behaviour. I don't like relying upon the value SHRT_MAX to be represented as a sequence of bytes storing the value (unsigned char) 0xFFFF, because that's relying upon coincidence. In other words, the main problem here is that memset isn't suitable for your task. Don't use it. Having said that, here's a test, derived from the test above, which will be used to test the speed of memset:
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(void) {
size_t size = SIZE_MAX;
volatile int *array = malloc(size);
/* Allocate largest array */
while (array == NULL && size > 0) {
size >>= 1;
array = malloc(size);
}
printf("Copying into %zu bytes\n", size);
memset(array, 0xFFFF, size);
puts("Done!");
return 0;
}
A trivial byte-copying memset loop will iterate sizeof (int) times more than the loop in my first example. Considering that my implementation uses a fairly optimal memset, here's the output:
Copying into 1073741823 bytes
Done!
Process returned 0 (0x0) execution time : 1.060 s
Press any key to continue.
These tests are likely to vary, however significantly. I only ran them once each to get a rough idea. Hopefully you've come to the same conclusion that I have: Common compilers are pretty good at optimising simple loops, and it's not worth postulating about micro-optimisations here.
In summary:
Don't use memset to fill ints with values (with an exception for the value 0), because it's not suitable.
Don't postulate about optimisations prior to running tests. Don't run tests until you have a working solution. By working solution I mean "A program that solves an actual problem". Once you have that, use your profiler to identify more significant opportunities to optimise!

This is because of two's complement. You have to change your array type to unsigned short, to get the max value, or use 0x7FFF.

for (int i = 0; i < SIZE / sizeof(short); ++i) {
ZBUFFER[i] = SHRT_MAX;
}
Note this does not initialize the last couple bytes, if (SIZE % sizeof(short))

In C, you can do it like Adrian Panasiuk said, and you can also unroll the copy loop. Unrolling means copying larger chunks at a time. The extreme end of loop unrolling is copying the whole frame over with a zero frame, like this:
init()
{
for (int i = 0; i < sizeof(ZBUFFER) / sizeof(ZBUFFER[0]; ++i) {
empty_ZBUFFER[i] = SHRT_MAX;
}
}
actual clearing:
memcpy(ZBUFFER, empty_ZBUFFER, SIZE);
(You can experiment with different sizes of the empty ZBUFFER, from four bytes and up, and then have a loop around the memcpy.)
As always, test your findings, if a) it's worth optimizing this part of the program and b) what difference the different initializing techniques makes. It will depend on a lot of factors. For the last few per cents of performance, you may have to resort to assembler code.

#include <algorithm>
#include <limits>
std::fill_n(ZBUFFER, size, std::numeric_limits<FOO>::max())
where FOO is the type of ZBUFFER's elements.

When you say "memset" do you actually have to use that function? That is only a byte-by-byte assign so it won't work with signed arrays.
If you want to set each value to the maximum you would use something like:
std::fill( ZBUFFER, ZBUFFER+len, std::numeric_limits<short>::max() )
when len is the number of elements (not the size in bytes of your array)

What's the fastest way to extract non-zero indices from a byte array in C++

I have a byte array
unsigned char* array=new unsigned char[4000000];
...
And I would like to get indices of all non-zero elements of the array.
Of course, I can do following
for(int i=0;i<size;i++)
{
if(array[i]!=0) somevector.push_back(i);
}
Is there any faster algorithm than this?
Update 1 I can see majority answer is no. I hoped that there is some magical bit operations I am not aware of. Some guys suggested sorting but no it's not feasible in this case. But thanks a lot for all your answers.
Update 2 After 4 years and 4 months since this question posted, #wim suggested this answer that looks promising.

Unless your vector is ordered, this is the most efficient algorithm to perform what you want to do if you are using a mono-thread program. You can try to optimize the data structure where you want to store your result, but in time this is the best you can do.

With a byte array that is mostly zero, being a sparse array, you can take advantage of a 32 bit CPU by doing comparisons 4 bytes at a time. The actual comparisons are done 4 bytes at a time however if any of the bytes are non-zero then you have to determine which of the bytes in the unsigned long are non-zero so that will take more effort. If the array is really sparse then the time saved with the comparisons may compensate for the additional work determining which of the bytes are non-zero.
The easiest would be to make the unsigned char array sized to some multiple of 4 bytes so that you do not need to worry about doing the last few bytes after the loop completes.
I would suggest doing a timing study on this as it is purely conjectural and there would be a point where an array becomes un-sparse enough that this would take more time than a simple loop.
One question that I would have is what are you doing with the vector of offsets of non-zero elements of the array and whether you can do away with the vector. Another question is if you need the vector whether you can build the vector as you place elements into the array.
unsigned char* array=new unsigned char[4000000];
......
unsigned long *pUlaw = (unsigned long *)array;
for ( ; pUlaw < array + 4000000; pUlaw++) {
if (*pUlaw) {
// at least one byte is non-zero
unsigned char *pUlawByte = (unsigned char *)pUlaw;
if (*pUlawByte)
somevector.push_back(pUlawByte - array);
if (*(pUlawByte+1))
somevector.push_back(pUlawByte - array + 1);
if (*(pUlawByte+2))
somevector.push_back(pUlawByte - array + 2);
if (*(pUlawByte+3))
somevector.push_back(pUlawByte - array + 3);
}
}

If the non-zero values are relatively rare, one trick you can use is a sentinel value:
unsigned char old_value = array[size-1];
array[size-1] = 1; // make sure we find a non-zero eventually
int i=0;
for (;;) {
while (array[i]==0) ++i; // tighter loop
if (i==size-1) break;
somevector.push_back(i);
++i;
}
array[size-1] = old_value;
if (old_value!=0) {
somevector.push_back(size-1);
}
This avoids having to check both the index and the value on each iteration.

The only thing you can do to improve the speed is to use concurrency.

This is not really an answer to your question, but I was trying to imagine what problem you are trying to solve.
Sometimes when performing operations on matrices (in mathematical sense), the operations can be improved when you know that the great majority of matrix elements will be zeros (a sparse matrix). You do such an optimization by not using a big array at all, but simply storing pairs {index, value} that indicate a non-zero element.

Memcpy : Adding an int offset?

I was looking over some C++ code and I ran into this memcpy function. I understand what memcpy does but they add an int to the source. I tried looking up the source code for memcpy but I can't seem to understand what the adding is actually doing to the memcpy function.
memcpy(Destination, SourceData + intSize, SourceDataSize);
In other words, I want to know what SourceData + intSize is doing. (I am trying to convert this to java.)
EDIT:
So here is my attempt at doing a memcpy function in java using a for loop...
for(int i = 0 ; i < SourceDataSize ; i ++ ) {
Destination[i] = SourceData[i + 0x100];
}

It is the same thing as:
memcpy(&Destination[0], &SourceData[intSize], SourceDataSize);

This is basic pointer arithmetic. SourceData points to some data type, and adding n to it increases the address it's pointing to by n * sizeof(*SourceData).
For example, if SourceData is defined as:
uint32_t *SourceData;
and
sizeof(uint32_t) == 4
then adding 2 to SourceData would increase the address it holds by 8.
As an aside, if SourceData is defined as an array, then adding n to it is sometimes the same as accessing the nth element of the array. It's easy enough to see for n==0; when n==1, it's easy to see that you'll be accessing a memory address that's sizeof(*SourceData) bytes after the beginning of the array.

SourceData + intSize is skipping intSize * sizeof(source data type) bytes at the beginning of SourceData. Maybe SourceDataSize is stored there or something like that.

The closest equivalent to memcpy in Java that you're probably going to get is System.arraycopy, since Java doesn't really have pointers in the same sense.

The add will change the address used for the source of the memory copy.
The amount the address changes will depend on the type of SourceData.
(See http://www.learncpp.com/cpp-tutorial/68-pointers-arrays-and-pointer-arithmetic/)
It might be trying to copy a section of an array SourceData starting at offset intSize and of length SourceDataSize/sizeof(*SourceData).
EDIT
So, for example, if the array was of integers of size 4 bytes, then the equivalent java code would look like:
for(int i = 0 ; i < SourceDataSize/4 ; i ++ ) {
Destination[i] = SourceData[i + intSize];
}

Regarding doing this in Java:
Your loop
for(int i = 0 ; i < SourceDataSize ; i ++ ) {
Destination[i] = SourceData[i + 0x100];
}
will always start copying data from 0x100 elements into SourceData; this may not be desired behavior. (For instance, when i=0, Destination[0] = SourceData[0 + 0x100]; and so forth.) This would be what you wanted if you never wanted to copy SourceData[0]..SourceData[0xFF], but note that hard-coding this prevents it from being a drop-in replacement for memcpy.
The reason the intSize value is specified in the original code is likely because the first intSize elements are not part of the 'actual' data, and those bytes are used for bookkeeping somehow (like a record of what the total size of the buffer is). memcpy itself doesn't 'see' the offset; it only knows the pointer it's starting with. SourceData + intSize creates a pointer that points intSize bytes past SourceData.
But, more importantly, what you are doing is likely to be extremely slow. memcpy is a very heavily optimized function that maps to carefully tuned assembly on most architectures, and replacing it with a simple loop-per-byte iteration will dramatically impact the performance characteristics of the code. While what you are doing is appropriate if you are trying to understand how memcpy and pointers work, note that if you are attempting to port existing code to Java for actual use you will likely want to use a morally equivalent Java function like java.util.Arrays.copyOf.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Understanding the use of memset in CUDA device code - c++

Related

Problem: CUDA Naive sum reduction, but not desired result [duplicate]

2D array access time comparison

Memset an int (16 bit) array to short's max value

What's the fastest way to extract non-zero indices from a byte array in C++

Memcpy : Adding an int offset?

Categories

Resources