CUDA, is there an atomicRead? - c++

I am working on a CUDA program where all blocks and threads need to determine the minimum step size for an iterative problem dynamically. I want the first thread in the block to be responsible for reading in the global dz value to shared memory so the rest of the threads can do a reduction on it. Meanwhile other threads in other blocks may be writing to it. Is there simply an atomicRead option in CUDA or something equivalent. I guess I could do an atomic add with zero or something. Or is this even necessary?
template<typename IndexOfRefractionFunct>
__global__ void _step_size_kernel(IndexOfRefractionFunct n, double* dz, double z, double cell_size)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= cells * cells)
return;
int idy = idx / cells;
idx %= cells;
double x = cell_size * idx;
double y = cell_size * idy;
__shared__ double current_dz;
if(threadIdx.x == 0)
current_dz = atomicRead(dz);
...
atomicMin(dz, calculated_min);
}
Also I just realized that cuda does not seem to support atomics on doubles. Any way around this?

Is there simply an atomicRead option in CUDA or something equivalent.
The idea of an atomic operation is that it allows for combining multiple operations without the possibility of intervening operations from other threads. The canonical use is for a read-modify-write. All 3 steps of the RMW operation can be performed atomically, with respect to a given location in memory, without the possibility of intervening activity from other threads.
Therefore the concept of an atomic read (only, by itself) doesn't really have meaning in this context. It is only one operation. In CUDA, all properly aligned reads of basic types (int, float, double, etc.) occur atomically, i.e. all in one operation, without the possibility of other operations affecting that read, or parts of that read.
Based on what you have shown, it seems that the correctness of your use-case should be satisfied without any special behavior on the read operation. If you simply wanted to ensure that the current_dz value gets populated from the global value, before any threads have a chance to modify it, at the block level, this can be sorted out simply with __syncthreads():
__shared__ double current_dz;
if(threadIdx.x == 0)
current_dz = dz;
__syncthreads(); // no threads can proceed beyond this point until
// thread 0 has read the value of dz
...
atomicMin(dz, calculated_min);
If you need to make sure this behavior is enforced grid-wide, then my suggestion would be to have an initial value of dz that threads don't write to, followed by the atomicMin operation being done on another location (ie. separate the write/output from the read/input at the kernel level).
But, again, I'm not suggesting this is necessary for your use-case. If you simply want to pick up the current dz value, you can do this with an ordinary read. You will get a "coherent" value. At the grid level, some number of atomicMin operations may have occurred before that read, and some may have occurred after that read, but none of them will corrupt the read, leading you to read a bogus value. The value you read will be either the initial value that was there, or some value that was properly deposited by an atomicMin operation (based on the code you have shown).
Also I just realized that cuda does not seem to support atomics on doubles. Any way around this?
CUDA has support for a limited set of atomic operations on 64-bit quantities. In particular, there is a 64-bit atomicCAS operation. The programming guide demonstrates how to use this in a custom function to achieve an arbitrary 64 bit atomic operation (e.g. 64-bit atomicMin on a double quantity). The example in the programming guide describes how to do a double atomicAdd operation. Here are examples of atomicMin and atomicMax operating on double:
__device__ double atomicMax(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
while(val > __longlong_as_double(old) ) {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val));
}
return __longlong_as_double(old);
}
__device__ double atomicMin(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
while(val < __longlong_as_double(old) ) {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val));
}
return __longlong_as_double(old);
}
As a good programming practice, atomics should be used sparingly, although Kepler global 32-bit atomics are pretty fast. But when using these types of custom 64-bit atomics, the advice is especially applicable; they will be noticeably slower than ordinary reads and writes.

Related

Are there any more efficient ways for atomically adding two floats?

I have a bundle of floats which get updated by various threads. Size of the array is much larger than the number of threads. Therefore simultaneous access on particular floats is rather rare. I need a solution for C++03.
The following code atomically adds a value to one of the floats (live demo). Assuming it works it might be the best solution.
The only alternative I can think of is dividing the array into bunches and protecting each bunch by a mutex. But I don't expect the latter to be more efficient.
My questions are as follows. Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient? Yes, I am willing to do some benchmarks. Maybe the solution below can be improved by relaxing the memorder constraints, i.e. exchanging __ATOMIC_SEQ_CST by something else. I have no experience with that.
void atomic_add_float( float *x, float add )
{
int *ip_x= reinterpret_cast<int*>( x ); //1
int expected= __atomic_load_n( ip_x, __ATOMIC_SEQ_CST ); //2
int desired;
do {
float sum= *reinterpret_cast<float*>( &expected ) + add; //3
desired= *reinterpret_cast<int*>( &sum );
} while( ! __atomic_compare_exchange_n( ip_x, &expected, desired, //4
/* weak = */ true,
__ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST ) );
}
This works as follows. At //1 the bit-pattern of x is interpreted as an int, i.e. I assume that float and int have the same size (32 bits). At //2 the value to be increased is loaded atomically. At //3 the bit-pattern of the int is interpreted as float and the summand is added. (Remember that expected contains a value found at ip_x == x.) This doesn't change the value under ip_x == x. At //4 the result of the summation is stored only at ip_x == x if no other thread changed the value, i.e. if expected == *ip_x (docu). If this is not the case the do-loop continues and expected contains the updated value found ad ip_x == x.
GCC's functions for atomic access (__atomic_load_n and __atomic_compare_exchange_n) can easily be exchanged by other compiler's implementations.
Are there any alternative solutions for adding floats atomically? Can anyone anticipate which is the most efficient?
Sure, there are at least few that come to mind:
Use synchronization primitives, i.e. spinlocks. Will be a bit slower than compare-exchange.
Transactional extension (see Wikipedia). Will be faster, but this solution might limit the portability.
Overall, your solution is quire reasonable: it is fast and yet will work on any platform.
In my opinion the needed memory orders are:
__ATOMIC_ACQUIRE -- when we read the value in __atomic_load_n()
__ATOMIC_RELEASE -- when __atomic_compare_exchange_n() is success
__ATOMIC_ACQUIRE -- when __atomic_compare_exchange_n() is failed
To make this function more efficient you may like to use __ATOMIC_ACQUIRE for __atomic_load_n and __ATOMIC_RELEASE and __ATOMIC_RELAXED for __atomic_compare_exchange_n success_memorder and failure_memorder respectively.
On x86-64 though that does not change the generated assembly because its memory model is relatively strong. Unlike for ARM with its weaker memory model.

ULP comparison code

The following code snippet is scattered all over the web and seems to be used in multiple different projects with very little changes:
union Float_t {
Float_t(float num = 0.0f) : f(num) {}
// Portable extraction of components.
bool Negative() const { return (i >> 31) != 0; }
int RawMantissa() const { return i & ((1 << 23) - 1); }
int RawExponent() const { return (i >> 23) & 0xFF; }
int i;
float f;
};
inline bool AlmostEqualUlpsAndAbs(float A, float B, float maxDiff, int maxUlpsDiff)
{
// Check if the numbers are really close -- needed
// when comparing numbers near zero.
float absDiff = std::fabs(A - B);
if (absDiff <= maxDiff)
return true;
Float_t uA(A);
Float_t uB(B);
// Different signs means they do not match.
if (uA.Negative() != uB.Negative())
return false;
// Find the difference in ULPs.
return (std::abs(uA.i - uB.i) <= maxUlpsDiff);
}
See, for example here or here or here.
However, I don't understand what is going on here. To my (maybe naive) understanding, the floating-point member variable f is initialized in the constructor, but the integer member i is not.
I'm not terribly familiar with the binary operators that are used here, but I fail to understand how accesses of uA.i and uB.i produce anything but random numbers, given that no line in the code actually connects the values of f and i in any meaningful way.
If somebody could enlighten my on why (and how) exactly this code produces the desired result, I would be very delighted!
A lot of Undefined Behaviour are being exploited here. First assumption is that fields of union can be accessed in place of each other, which is, in itself, UB. Furthermore, coder assumes that: sizeof(int) == sizeof(float), that floats have a given length of mantissa and exponent, that all union members are aligned to zero, that the binary representation of float coincides with the binary representation with int in a very specific way. In short, this will work as long as you're on x86, have specific int and float types and you say a prayer at every sunrise and sunset.
What you probably didn't note is that this is a union, therefore int i and float f is usually aligned in a specific manner in a common memory array by most compilers. This is, in general, still UB and you can't even safely assume that the same physical bits of memory will be used without restricting yourself to a specific compiler and a specific architecture. All that's guaranteed is, the address of both members will be the same (but there might be alignment and/or typedness issues). Assuming that your compiler uses the same physical bits (which is by no means guaranteed by standard) and they both start at offset 0 and have the same size, then i will represent the binary storage format of f.. as long as nothing changes in your architecture. Word of advice? Do not use it until you don't have to. Stick to floating point operations for AlmostEquals(), you can implement it like that. It's the very final pass of optimization when we consider these specialities and we usually do it in a separate branch, you shouldn't plan your code around it.

CUDA Non Atomic Write clash results

I am writing a function which needs to iterate until completion. I realise that I can use atomic operators, but speed is critical in this Kernel and I suspect they may not be required.
I have included a small piece of pseudo-code to demonstrate what I am intending to do
__global__ void TestKernel()
{
__shared__ bool lbRepeat[1];
do
{
lbRepeat=false;
__syncthreads();
if(Condition == true) lbRepeat=true;
__syncthreads();
}
while(lbRepeat);
}
If no thread has found the Condition to be true lbRepeat will be false.
If one thread has found the Condition to be true lbRepeat will be true.
What will the result be if multiple threads write true into lbRepeat at the same time?
I would like to extend this to copying integer values (unsigned 16 bit specifically). As well as checking the condition I would like to copy a unsigned 16 bit integer.
__global__ void TestKernel()
{
__shared__ unsigned short liValues[32*8];
__shared__ bool lbRepeat[1];
unsigned long tid = threadIdx.x+threadIdx.y*blockDim.x;
do
{
lbRepeat=false;
__syncthreads();
if(Condition == true)
{
liValue[tid] = liValue[Some_Value_In_Range];
lbRepeat=true;
}
__syncthreads();
}
while(lbRepeat);
}
If another thread is writing to the memory as it is read could this cause a neither the previous value or the new value to be returned? I do not mind if either the previous or the new value is returned (both will be valid) but a mixture of the bits of each would cause problems.
I thought this wouldn't be acceptable, but my testing seems to indicate that it works as desired. Is this because unsigned short copys are atomic in CUDA?
In Summary:
What is the result if two threads write the same value into one boolean memory location?
Can reading from a unsigned short memory location as another thread is writing a new value to the same location return a value which is neither the previous value or the new value in that memory location?
What is the result if two threads write the same value into one boolean memory location?
The end result will be that one of the written values will end up in that memory location. Which value is undefined. If all written values are the same, you can be sure that value will end up in that location.
Can reading from a unsigned short memory location as another thread is writing a new value to the same location return a value which is neither the previous value or the new value in that memory location?
Assuming these are the only two operations going on (one write, and one read), no. The read value will be either the value before the write has begun or the value after the write is complete. If you have multiple writes going on, then of course see the answer to the first question. The actual written value is undefined, except that it will be as if one of the writes succeeded and all others did not.
I'm making the above statements in the context of properly aligned 8, 16, or 32 bit datatypes, which your examples are.

Is global memory write considered atomic in CUDA?

Is global memory write considered atomic or not in CUDA?
Considering the following CUDA kernel code:
int idx = blockIdx.x*blockDim.x+threadIdx.x;
int gidx = idx%1000;
globalStorage[gidx] = somefunction(idx);
Is the global memory write to globalStorage atomic?, e.g. there is no race conditions such that concurrent kernel threads write to the bytes of the same variable stored in globalStorage, which could mess the results up (e.g. parial writes)?
Note that I am not talking about atomic operations like add/sub/bit-wise etc here, just straight global write.
Edited: Rewrote the example code to avoid confusion.
Memory acesses in CUDA are not implicitly atomic. However, the code you originally showed isn't intrinsically a memory race as long as idx has a unique value for each thread in the running kernel.
So your original code:
int idx = blockIdx.x*blockDim.x+threadIdx.x;
globalStorage[idx] = somefunction(idx);
would be safe if the kernel launch uses a 1D grid and globalStorage is suitably sized, whereas your second version:
int idx = blockIdx.x*blockDim.x+threadIdx.x;
int gidx = idx%1000;
globalStorage[gidx] = somefunction(idx);
would not be because multiple thread could potentially write to the same entry in globalStorage. There is no atomic protections or serialisation mechanisms which would produce predictable results in such as case.

Atomic counter in gcc

I must be just having a moment, because this should be easy but I can't seem to get it working right.
Whats the correct way to implement an atomic counter in GCC?
i.e. I want a counter that runs from zero to 4 and is thread safe.
I was doing this (which is further wrapped in a class, but not here)
static volatile int _count = 0;
const int limit = 4;
int get_count(){
// Create a local copy of diskid
int save_count = __sync_fetch_and_add(&_count, 1);
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
}
return save_count;
}
But it's running from 1 through from 1 - 4 inclusive then around to zero.
It should go from 0 - 3. Normally I'd do a counter with a mod operator but I don't
know how to do that safely.
Perhaps this version is better. Can you see any problems with it, or offer
a better solution.
int get_count(){
// Create a local copy of diskid
int save_count = _count;
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
return 0;
}
return save_count;
}
Actually, I should point out that it's not absolutely critical that each thread get a different value. If two threads happened to read the same value at the same time that wouldn't be a problem. But they can't exceed limit at any time.
Your code isn't atomic (and your second get_count doesn't even increment the counter value)!
Say count is 3 at the start and two threads simultaneously call get_count. One of them will get his atomic add done first and increments count to 4. If the second thread is fast enough, it can increment it to 5 before the first thread resets it to zero.
Also, in your wraparound processing, you reset count to 0 but not save_count. This is clearly not what's intended.
This is easiest if limit is a power of 2. Don't ever do the reduction yourself, just use
return (unsigned) __sync_fetch_and_add(&count, 1) % (unsigned) limit;
or alternatively
return __sync_fetch_and_add(&count, 1) & (limit - 1);
This only does one atomic operation per invocation, is safe and very cheap. For generic limits, you can still use %, but that will break the sequence if the counter ever overflows. You can try using a 64-bit value (if your platform supports 64-bit atomics) and just hope it never overflows; this is a bad idea though. The proper way to do this is using an atomic compare-exchange operation. You do this:
int old_count, new_count;
do {
old_count = count;
new_count = old_count + 1;
if (new_count >= limit) new_count = 0; // or use %
} while (!__sync_bool_compare_and_swap(&count, old_count, new_count));
This approach generalizes to more complicated sequences and update operations too.
That said, this type of lockless operation is tricky to get right, relies on undefined behavior to some degree (all current compilers get this right, but no C/C++ standard before C++0x actually has a well-defined memory model) and is easy to break. I recommend using a simple mutex/lock unless you've profiled it and found it to be a bottleneck.
You're in luck, because the range you want happens to fit into exactly 2 bits.
Easy solution: Let the volatile variable count up forever. But after you read it, use just the lowest two bits (val & 3). Presto, atomic counter from 0-3.
It's impossible to create anything atomic in pure C, even with volatile. You need asm. C1x will have special atomic types, but until then you're stuck with asm.
You have two problems.
__sync_fetch_and_add will return the previous value (i.e., before adding one). So at the step where _count becomes 3, your local save_count variable is getting back 2. So you actually have to increment _count up to 4 before it'll come back as a 3.
But even on top of that, you're specifically looking for it to be >= 4 before you reset it back to 0. That's just a question of using the wrong limit if you're only looking for it to get as high as three.