How to store bool result of a CUDA kernel function - c++

Assume that we have 2^10 CUDA cores and 2^20 data points. I want a kernel that will process these points and will provide true/false for each of them. So I will have 2^20 bits. Example:
bool f(x) { return x % 2? true : false; }
void kernel(int* input, byte* output)
{
tidx = thread.x ...
output[tidx] = f(input[tidx]);
...or...
sharedarr[tidx] = f(input[tidx]);
sync()
output[blockidx] = reduce(sharedarr);
...or...
atomic_result |= f(input[tidx]) << tidx;
sync(..)
output[blckidx] = atomic_result;
}
Thrust/CUDA has some algorithms as "partitioning", "transformation" which provides similar alternatives.
My question is, when I write the relevant CUDA kernel with a predicate that is providing the corresponding bool result,
should I use one byte for each result and directly store the result in the output array? Performing one step for calculation and performing another step for reduction/partitioning later.
should I compact the output in the shared memory, using one byte for 8 threads and then at the end write the result from shared memory to output array?
should I use atomic variables?
What's the best way to write such a kernel and the most logical data structure to keep the results? Is it better to use more memory and simply do more writes to main memory instead of trying to deal with compacting the result before writing back to result memory area?

There is no tradeoff between speed and data size when using the __ballot() warp-voting intrinsic to efficiently pack the results.
Assuming that you can redefine output to be of uint32_t type, and that your block size is a multiple of the warp size (32), you can simply store the packed output using
output[tidx / warpSize] = __ballot(f(input[tidx]));
Note this makes all threads of the warp try to store the result of __ballot(). Only one thread of the warp will succeed, but as their results are all identical, it does not matter which one will.

Related

Are there any more efficient ways for atomically adding two floats?

I have a bundle of floats which get updated by various threads. Size of the array is much larger than the number of threads. Therefore simultaneous access on particular floats is rather rare. I need a solution for C++03.
The following code atomically adds a value to one of the floats (live demo). Assuming it works it might be the best solution.
The only alternative I can think of is dividing the array into bunches and protecting each bunch by a mutex. But I don't expect the latter to be more efficient.
My questions are as follows. Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient? Yes, I am willing to do some benchmarks. Maybe the solution below can be improved by relaxing the memorder constraints, i.e. exchanging __ATOMIC_SEQ_CST by something else. I have no experience with that.
void atomic_add_float( float *x, float add )
{
int *ip_x= reinterpret_cast<int*>( x ); //1
int expected= __atomic_load_n( ip_x, __ATOMIC_SEQ_CST ); //2
int desired;
do {
float sum= *reinterpret_cast<float*>( &expected ) + add; //3
desired= *reinterpret_cast<int*>( &sum );
} while( ! __atomic_compare_exchange_n( ip_x, &expected, desired, //4
/* weak = */ true,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST ) );
}
This works as follows. At //1 the bit-pattern of x is interpreted as an int, i.e. I assume that float and int have the same size (32 bits). At //2 the value to be increased is loaded atomically. At //3 the bit-pattern of the int is interpreted as float and the summand is added. (Remember that expected contains a value found at ip_x == x.) This doesn't change the value under ip_x == x. At //4 the result of the summation is stored only at ip_x == x if no other thread changed the value, i.e. if expected == *ip_x (docu). If this is not the case the do-loop continues and expected contains the updated value found ad ip_x == x.
GCC's functions for atomic access (__atomic_load_n and __atomic_compare_exchange_n) can easily be exchanged by other compiler's implementations.
Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient?
Sure, there are at least few that come to mind:
Use synchronization primitives, i.e. spinlocks. Will be a bit slower than compare-exchange.
Transactional extension (see Wikipedia). Will be faster, but this solution might limit the portability.
Overall, your solution is quire reasonable: it is fast and yet will work on any platform.
In my opinion the needed memory orders are:
__ATOMIC_ACQUIRE -- when we read the value in __atomic_load_n()
__ATOMIC_RELEASE -- when __atomic_compare_exchange_n() is success
__ATOMIC_ACQUIRE -- when __atomic_compare_exchange_n() is failed
To make this function more efficient you may like to use __ATOMIC_ACQUIRE for __atomic_load_n and __ATOMIC_RELEASE and __ATOMIC_RELAXED for __atomic_compare_exchange_n success_memorder and failure_memorder respectively.
On x86-64 though that does not change the generated assembly because its memory model is relatively strong. Unlike for ARM with its weaker memory model.

Bool Flags vs. Unsigned Char Flags

Disclaimer: Please correct me in the event that I make any false claims in this post.
Consider a struct that contains eight bool member variables.
/*
* Struct uses one byte for each flag.
*/
struct WithBools
{
bool f0 = true;
bool f1 = true;
bool f2 = true;
bool f3 = true;
bool f4 = true;
bool f5 = true;
bool f6 = true;
bool f7 = true;
};
The space allocated to each variable is a byte in length, which seems like a waste if the variables are used solely as flags. One solution to reduce this wasted space, as far as the variables are concerned, is to encapsulate the eight flags into a single member variable of unsigned char.
/*
* Struct uses a single byte for eight flags; retrieval and
* manipulation of data is achieved through accessor functions.
*/
struct WithoutBools
{
unsigned char getFlag(unsigned index)
{
return flags & (1 << (index % 8));
}
void toggleFlag(unsigned index)
{
flags ^= (1 << (index % 8));
}
private:
unsigned char flags = 0xFF;
};
The flags are retrieved and manipulated via. bitwise operators, and the struct provides an interface for the user to retrieve and manipulate the flags. While flag sizes have been reduced, we now have the two additional methods that add to the size of the struct. I do not know how to benchmark this difference, therefore I could not be certain of any fluctuation between the above structs.
My questions are:
1) Would the difference in space between these two structs be negligible?
2) Generally, is this approach of "optimising" a collection of bools by compacting them into a single byte a good idea? Either in an embedded systems context or otherwise.
3) Would a C++ compiler make such an optimisation that compacts a collection of bools wherever possible and appropriate.
we now have the two additional methods that add to the size of the
struct
Methods are code and do not increase the size of the struct. Only data makes up size on the structure.
3) Would a C++ compiler make such an optimisation that compacts a
collection of bools wherever possible and appropriate.
That is a sound resounding no. The compiler is not allowed to change data types.
1) Would the difference in space between these two structs be
negligible?
No, there definitely is a size difference between the two approaches.
2) Generally, is this approach of "optimising" a collection of bools
by compacting them into a single byte a good idea? Either in an
embedded systems context or otherwise.
Generally yes, the idiomatic way to model flags is with bit-wise manipulation inside an unsigned integer. Depending on the number of flags needed you can use std::uint8_t, std::uint16_t and so on.
However the most common way to model this is not via index as you've done, but via masks.
Would the difference in space between these two structs be negligible?
That depends on how many values you are storing and how much space you have to store them in. The size difference is 1 to 8.
Generally, is this approach of "optimising" a collection of bools by compacting them into a single byte a good idea? Either in an embedded systems context or otherwise.
Again, it depends on how many values and how much space. Also note that dealing with bits instead of bytes increases code size and execution time.
Many embedded systems have relatively little RAM and plenty of Flash. Code is stored in Flash, so the increased code size can be ignored, and the saved memory could be important on small RAM systems.
Would a C++ compiler make such an optimisation that compacts a collection of bools wherever possible and appropriate.
Hypothetically it could. I would consider that an aggressive space optimization, at the expense of execution time.
STL has a specialization for vector<bool> that I frequently avoid for performance reasons - vector<char> is much faster.

Unaligned load versus unaligned store

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?
Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.
As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:
__m256i next = _mm256_loadu_si256(in256++);
for(...){
__m256i data = next; // usually 0 cost
next = _mm256_loadu_si256(in256++);
// do computations and store data
}
If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

CUDA Non Atomic Write clash results

I am writing a function which needs to iterate until completion. I realise that I can use atomic operators, but speed is critical in this Kernel and I suspect they may not be required.
I have included a small piece of pseudo-code to demonstrate what I am intending to do
__global__ void TestKernel()
{
__shared__ bool lbRepeat[1];
do
{
lbRepeat=false;
__syncthreads();
if(Condition == true) lbRepeat=true;
__syncthreads();
}
while(lbRepeat);
}
If no thread has found the Condition to be true lbRepeat will be false.
If one thread has found the Condition to be true lbRepeat will be true.
What will the result be if multiple threads write true into lbRepeat at the same time?
I would like to extend this to copying integer values (unsigned 16 bit specifically). As well as checking the condition I would like to copy a unsigned 16 bit integer.
__global__ void TestKernel()
{
__shared__ unsigned short liValues[32*8];
__shared__ bool lbRepeat[1];
unsigned long tid = threadIdx.x+threadIdx.y*blockDim.x;
do
{
lbRepeat=false;
__syncthreads();
if(Condition == true)
{
liValue[tid] = liValue[Some_Value_In_Range];
lbRepeat=true;
}
__syncthreads();
}
while(lbRepeat);
}
If another thread is writing to the memory as it is read could this cause a neither the previous value or the new value to be returned? I do not mind if either the previous or the new value is returned (both will be valid) but a mixture of the bits of each would cause problems.
I thought this wouldn't be acceptable, but my testing seems to indicate that it works as desired. Is this because unsigned short copys are atomic in CUDA?
In Summary:
What is the result if two threads write the same value into one boolean memory location?
Can reading from a unsigned short memory location as another thread is writing a new value to the same location return a value which is neither the previous value or the new value in that memory location?
What is the result if two threads write the same value into one boolean memory location?
The end result will be that one of the written values will end up in that memory location. Which value is undefined. If all written values are the same, you can be sure that value will end up in that location.
Can reading from a unsigned short memory location as another thread is writing a new value to the same location return a value which is neither the previous value or the new value in that memory location?
Assuming these are the only two operations going on (one write, and one read), no. The read value will be either the value before the write has begun or the value after the write is complete. If you have multiple writes going on, then of course see the answer to the first question. The actual written value is undefined, except that it will be as if one of the writes succeeded and all others did not.
I'm making the above statements in the context of properly aligned 8, 16, or 32 bit datatypes, which your examples are.

Reading different data types in shared memory

I want to share some memory between different processes running a DLL. Therefore i create a memory-mapped-file by HANDLE hSharedFile = CreateFileMapping(...) then LPBYTE hSharedView = MapViewOfFile(...) and LPBYTE aux = hSharedView
Now I want to read a bool, a int, a float and a char from the aux array. Reading a bool and char is easy. But how would I go around reading a int or float? Notice that the int or float could start at position 9 e.g. a position that is not dividable by 4.
I know you can read a char[4] and then memcpy it into a float or int. But i really need this to be very fast. I am wondering if it is possible to do something with pointers?
Thanks in advance
If you know, for instance, that array elements aux[13..16] contain a float, then you can access this float in several ways:
float f = *(float*)&aux[13] ; // Makes a copy. The simplest solution.
float* pf = (float*)&aux[13] ; // Here you have to use *pf to access the float.
float& rf = *(float*)&aux[13] ; // Doesn't make a copy, and is probably what you want.
// (Just use rf to access the float.)
There is nothing wrong with grabbing an int at offset 9:
int* intptr = (int*) &data[9];
int mynumber = *intptr;
There might be a really tiny performance penalty for this "unaligned" access, but it will still work correctly, and the chances of you noticing any differences are slim.
First of all, I think you should measure. There are three options you can go with that I can think of:
with unaligned memory
with memcpy into buffers
with custom-aligned memory
Unaligned memory will work fine, it will just be slower than aligned. How slower is that, and does it matter to you? Measure to find out.
Copying into a buffer will trade off the slower unaligned accesses for additional copy operations. Measuring will tell you if it's worth it.
If using unaligned memory is too slow for you and you don't want to copy data around (perhaps because of the performance cost), then you can possibly do faster by wasting some memory space and increasing your program complexity. Don't use the mapped memory blindly: round your "base" pointer upwards to a suitable value (e.g. 8 bytes) and only do reads/writes at 8-byte increments of this "base" value. This will ensure that all your accesses will be aligned.
But do measure before you go into all this trouble.