I am writing a function which needs to iterate until completion. I realise that I can use atomic operators, but speed is critical in this Kernel and I suspect they may not be required.
I have included a small piece of pseudo-code to demonstrate what I am intending to do
__global__ void TestKernel()
{
__shared__ bool lbRepeat[1];
do
{
lbRepeat=false;
__syncthreads();
if(Condition == true) lbRepeat=true;
__syncthreads();
}
while(lbRepeat);
}
If no thread has found the Condition to be true lbRepeat will be false.
If one thread has found the Condition to be true lbRepeat will be true.
What will the result be if multiple threads write true into lbRepeat at the same time?
I would like to extend this to copying integer values (unsigned 16 bit specifically). As well as checking the condition I would like to copy a unsigned 16 bit integer.
__global__ void TestKernel()
{
__shared__ unsigned short liValues[32*8];
__shared__ bool lbRepeat[1];
unsigned long tid = threadIdx.x+threadIdx.y*blockDim.x;
do
{
lbRepeat=false;
__syncthreads();
if(Condition == true)
{
liValue[tid] = liValue[Some_Value_In_Range];
lbRepeat=true;
}
__syncthreads();
}
while(lbRepeat);
}
If another thread is writing to the memory as it is read could this cause a neither the previous value or the new value to be returned? I do not mind if either the previous or the new value is returned (both will be valid) but a mixture of the bits of each would cause problems.
I thought this wouldn't be acceptable, but my testing seems to indicate that it works as desired. Is this because unsigned short copys are atomic in CUDA?
In Summary:
What is the result if two threads write the same value into one boolean memory location?
Can reading from a unsigned short memory location as another thread is writing a new value to the same location return a value which is neither the previous value or the new value in that memory location?
What is the result if two threads write the same value into one boolean memory location?
The end result will be that one of the written values will end up in that memory location. Which value is undefined. If all written values are the same, you can be sure that value will end up in that location.
Can reading from a unsigned short memory location as another thread is writing a new value to the same location return a value which is neither the previous value or the new value in that memory location?
Assuming these are the only two operations going on (one write, and one read), no. The read value will be either the value before the write has begun or the value after the write is complete. If you have multiple writes going on, then of course see the answer to the first question. The actual written value is undefined, except that it will be as if one of the writes succeeded and all others did not.
I'm making the above statements in the context of properly aligned 8, 16, or 32 bit datatypes, which your examples are.
Related
I've been trying to build a simple (i.e. inefficient) MPMC queue using only std C++ but I'm having trouble getting the underlying array to synchronize between threads. A simplified version of the queue is:
constexpr int POISON = 5000;
class MPMC{
std::atomic_int mPushStartCounter;
std::atomic_int mPushEndCounter;
std::atomic_int mPopCounter;
static constexpr int Size = 1<<20;
int mData[Size];
public:
MPMC(){
mPushStartCounter.store(0);
mPushEndCounter.store(-1);
mPopCounter.store(0);
for(int i = 0; i < Size;i++){
//preset data with a poison flag to
// detect race conditions
mData[i] = POISON;
}
}
void push(int x) {
int index = mPushStartCounter.fetch_add(1);
mData[index] = x;//Race condition
atomic_thread_fence(std::memory_order_release);
int expected = index-1;
while(!mPushEndCounter.compare_exchange_strong(expected, index, std::memory_order_acq_rel)){std::this_thread::yield();}
}
int pop(){
int index = mPopCounter.load();
if(index <= mPushEndCounter.load(std::memory_order_acquire) && mPopCounter.compare_exchange_strong(index, index+1, std::memory_order_acq_rel)){
return mData[index]; //race condition
}else{
return pop();
}
}
};
It uses three atomic variables for synchronization:
mPushStartCounter that is used by push(int) to determine which location to write to.
mPushEndCounter that is used to signal that push(int) has finished writing up to that point int the array to pop().
mPopCounter that is used by pop() to prevent double pops from occurring.
In push(), between writing to the array mData and updating mPushEndCounter I've put a release barrier in an attempt to force synchronization of the mData array.
The way I understood cpp reference this should force a Fence-Atomic Synchronization. where
the CAS in push() is an 'atomic store X',
the load of mPushEndCounter in pop() is an 'atomic acquire operation Y' ,
The release barrier 'F' in push() is 'sequenced-before X'.
In which case cppreference states that
In this case, all non-atomic and relaxed atomic stores that are sequenced-before F in thread A will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread B after Y.
Which I interpreted to mean that the write to mData from push() would be visible in pop(). This is however not the case, sometimes pop() reads uninitialized data. I believe this to be a synchronization issues because if I check the contents of the queue afterwards, or via breakpoint, it reads correct data instead.
I am using clang 6.0.1, and g++ 7.3.0.
I tried looking at the generated assembly but it looks correct to me: the write to the array is followed by a lock cmpxchg and the read is preceded by a check on the same variable. Which to the best of my limited knowledge should work as expected on x64 because
Loads are not reordered with other loads, hence the load from array can not speculate ahead of reading the atomic counter.
stores are not reordered with other stores, hence the cmpxchg always comes after the store to array.
lock cmpxchg flushes the write-buffer, cache, etc. Therefore if another thread observes it as finished, one can rely on cache coherency to guarantee that write to the array has finished. I am not too sure that this is correct however.
I've posted a runable test on Github. The test code involves 16 threads, half of which push the numbers 0 to 4999 and the other half read back 5000 elements each. It then combines the results of all the readers and checks that we've seen all the numbers in [0, 4999] exactly 8 times (which fails) and scans the underlying array once more to see if it contains all the numbers in [0, 4999] exactly 8 times (which succeeds).
Assume that we have 2^10 CUDA cores and 2^20 data points. I want a kernel that will process these points and will provide true/false for each of them. So I will have 2^20 bits. Example:
bool f(x) { return x % 2? true : false; }
void kernel(int* input, byte* output)
{
tidx = thread.x ...
output[tidx] = f(input[tidx]);
...or...
sharedarr[tidx] = f(input[tidx]);
sync()
output[blockidx] = reduce(sharedarr);
...or...
atomic_result |= f(input[tidx]) << tidx;
sync(..)
output[blckidx] = atomic_result;
}
Thrust/CUDA has some algorithms as "partitioning", "transformation" which provides similar alternatives.
My question is, when I write the relevant CUDA kernel with a predicate that is providing the corresponding bool result,
should I use one byte for each result and directly store the result in the output array? Performing one step for calculation and performing another step for reduction/partitioning later.
should I compact the output in the shared memory, using one byte for 8 threads and then at the end write the result from shared memory to output array?
should I use atomic variables?
What's the best way to write such a kernel and the most logical data structure to keep the results? Is it better to use more memory and simply do more writes to main memory instead of trying to deal with compacting the result before writing back to result memory area?
There is no tradeoff between speed and data size when using the __ballot() warp-voting intrinsic to efficiently pack the results.
Assuming that you can redefine output to be of uint32_t type, and that your block size is a multiple of the warp size (32), you can simply store the packed output using
output[tidx / warpSize] = __ballot(f(input[tidx]));
Note this makes all threads of the warp try to store the result of __ballot(). Only one thread of the warp will succeed, but as their results are all identical, it does not matter which one will.
I am working on a CUDA program where all blocks and threads need to determine the minimum step size for an iterative problem dynamically. I want the first thread in the block to be responsible for reading in the global dz value to shared memory so the rest of the threads can do a reduction on it. Meanwhile other threads in other blocks may be writing to it. Is there simply an atomicRead option in CUDA or something equivalent. I guess I could do an atomic add with zero or something. Or is this even necessary?
template<typename IndexOfRefractionFunct>
__global__ void _step_size_kernel(IndexOfRefractionFunct n, double* dz, double z, double cell_size)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= cells * cells)
return;
int idy = idx / cells;
idx %= cells;
double x = cell_size * idx;
double y = cell_size * idy;
__shared__ double current_dz;
if(threadIdx.x == 0)
current_dz = atomicRead(dz);
...
atomicMin(dz, calculated_min);
}
Also I just realized that cuda does not seem to support atomics on doubles. Any way around this?
Is there simply an atomicRead option in CUDA or something equivalent.
The idea of an atomic operation is that it allows for combining multiple operations without the possibility of intervening operations from other threads. The canonical use is for a read-modify-write. All 3 steps of the RMW operation can be performed atomically, with respect to a given location in memory, without the possibility of intervening activity from other threads.
Therefore the concept of an atomic read (only, by itself) doesn't really have meaning in this context. It is only one operation. In CUDA, all properly aligned reads of basic types (int, float, double, etc.) occur atomically, i.e. all in one operation, without the possibility of other operations affecting that read, or parts of that read.
Based on what you have shown, it seems that the correctness of your use-case should be satisfied without any special behavior on the read operation. If you simply wanted to ensure that the current_dz value gets populated from the global value, before any threads have a chance to modify it, at the block level, this can be sorted out simply with __syncthreads():
__shared__ double current_dz;
if(threadIdx.x == 0)
current_dz = dz;
__syncthreads(); // no threads can proceed beyond this point until
// thread 0 has read the value of dz
...
atomicMin(dz, calculated_min);
If you need to make sure this behavior is enforced grid-wide, then my suggestion would be to have an initial value of dz that threads don't write to, followed by the atomicMin operation being done on another location (ie. separate the write/output from the read/input at the kernel level).
But, again, I'm not suggesting this is necessary for your use-case. If you simply want to pick up the current dz value, you can do this with an ordinary read. You will get a "coherent" value. At the grid level, some number of atomicMin operations may have occurred before that read, and some may have occurred after that read, but none of them will corrupt the read, leading you to read a bogus value. The value you read will be either the initial value that was there, or some value that was properly deposited by an atomicMin operation (based on the code you have shown).
Also I just realized that cuda does not seem to support atomics on doubles. Any way around this?
CUDA has support for a limited set of atomic operations on 64-bit quantities. In particular, there is a 64-bit atomicCAS operation. The programming guide demonstrates how to use this in a custom function to achieve an arbitrary 64 bit atomic operation (e.g. 64-bit atomicMin on a double quantity). The example in the programming guide describes how to do a double atomicAdd operation. Here are examples of atomicMin and atomicMax operating on double:
__device__ double atomicMax(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
while(val > __longlong_as_double(old) ) {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val));
}
return __longlong_as_double(old);
}
__device__ double atomicMin(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
while(val < __longlong_as_double(old) ) {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val));
}
return __longlong_as_double(old);
}
As a good programming practice, atomics should be used sparingly, although Kepler global 32-bit atomics are pretty fast. But when using these types of custom 64-bit atomics, the advice is especially applicable; they will be noticeably slower than ordinary reads and writes.
In the GotW article #45, Herb states the following:
void String::AboutToModify(
size_t n,
bool bMarkUnshareable /* = false */
) {
if( data_->refs > 1 && data_->refs != Unshareable ) {
/* ... etc. ... */
This if-condition is not thread-safe. For one thing, evaluating even "data_->refs > 1" may not be atomic; if so, it's possible that if thread 1 tries to evaluate "data_->refs > 1" while thread 2 is updating the value of refs, the value read from data_->refs might be anything -- 1, 2, or even something that's neither the original value nor the new value.
Additionally, he points out that data_->refs may be modified in between comparing with 1 and comparing with Unshareable.
Further down, we find a solution:
void String::AboutToModify(
size_t n,
bool bMarkUnshareable /* = false */
) {
int refs = IntAtomicGet( data_->refs );
if( refs > 1 && refs != Unshareable ) {
/* ... etc. ...*/
Now, I understand that the same refs is used for both comparisons, solving problem 2. But why the IntAtomicGet? I have turned up nothing in searches on the topic - all atomic operations focus on Read, Modify, Write operations, and here we just have a read. So can we just do...
int refs = data_->refs;
...which should probably just be one instruction in the end anyway?
Different platforms make different promises about atomicity of read/write operations. x86 for example guarantees that reading a double word (4 bytes) will be an atomic operation. However, you cannot assume that this will be true for any architecture and it will probably not be.
If you plan to port your code for different platforms, such assumptions could put you in troubles and lead to strange race conditions in your code. Therefore, it's better to protect yourself and make read/write operations explicitly atomic.
Reading from shared memory (data_->refs) while another thread writes to it is the definition of a data race.
What happens when we non-atomically read from data_->refs while another thread is trying to write to it at the same time?
Imagine that thread A is doing ++data_->refs (write) while thread B is doing int x = data_->refs (read). Imagine that thread B reads the first few bytes from data_->refs and that thread A finishes writing its value to data_->refs before thread B is done reading. Thread B then reads the rest of the bytes at data_->refs.
You will get neither the original value, nor the new value; you will get a completely different value! This scenario is just to illustrate what is meant by:
[...] the value read from data_->refs might be anything -- 1, 2, or
even something that's neither the original value nor the new value.
The purpose of atomic operations is to ensure that an operation is indivisible: it is either observed as done or not done. Therefore, we use an atomic read operation to ensure that we get the value of data_->refs either before it is updated, or after (this depends on thread timings).
i have some questions in boost spinlock code :
class spinlock
{
public:
spinlock()
: v_(0)
{
}
bool try_lock()
{
long r = InterlockedExchange(&v_, 1);
_ReadWriteBarrier(); // 1. what this mean
return r == 0;
}
void lock()
{
for (unsigned k = 0; !try_lock(); ++k)
{
yield(k);
}
}
void unlock()
{
_ReadWriteBarrier();
*const_cast<long volatile*>(&v_) = 0;
// 2. Why don't need to use InterlockedExchange(&v_, 0);
}
private:
long v_;
};
A ReadWriteBarrier() is a "memory barrier" (in this case for both reads and writes), a special instruction to the processor to ensure that any instructions resulting in memory operations have completed (load & store operations - or in for example x86 processors, any opertion which has a memory operand at either side). In this particular case, to make sure that the InterlockedExchange(&v_,1) has completed before we continue.
Because an InterlockedExchange would be less efficient (takes more interaction with any other cores in the machine to ensure all other processor cores have 'let go' of the value - which makes no sense, since most likely (in correctly working code) we only unlock if we actually hold the lock, so no other processor will have a different value cached than what we're writing over anyway), and a volatile write to the memory will be just as good.
The barriers are there to ensure memory synchronization; without
them, different threads may see modifications of memory in
different orders.
And the InterlockedExchange isn't necessary in the second case
because we're not interested in the previous value. The role of
InterlockedExchange is doubtlessly to set the value and return
the previous value. (And why v_ would be long, when it can
only take values 0 and 1, is beyond me.)
There are three issues with atomic access to variables. First, ensuring that there is no thread switch in the middle of reading or writing a value; if this happens it's called "tearing"; the second thread can see a partly written value, which will usually be nonsensical. Second, ensuring that all processors see the change that is being made with a write, or that the processor reading a value sees any previous changes to that value; this is called "cache coherency". Third, ensuring that the compiler doesn't move code across the read or write; this is called "code motion". InterlockedExchange does the first two; although the MSDN documentation is rather muddled, _ReadWriteBarrier does the third, and possibly the second.