Is global memory write considered atomic or not in CUDA?
Considering the following CUDA kernel code:
int idx = blockIdx.x*blockDim.x+threadIdx.x;
int gidx = idx%1000;
globalStorage[gidx] = somefunction(idx);
Is the global memory write to globalStorage atomic?, e.g. there is no race conditions such that concurrent kernel threads write to the bytes of the same variable stored in globalStorage, which could mess the results up (e.g. parial writes)?
Note that I am not talking about atomic operations like add/sub/bit-wise etc here, just straight global write.
Edited: Rewrote the example code to avoid confusion.
Memory acesses in CUDA are not implicitly atomic. However, the code you originally showed isn't intrinsically a memory race as long as idx has a unique value for each thread in the running kernel.
So your original code:
int idx = blockIdx.x*blockDim.x+threadIdx.x;
globalStorage[idx] = somefunction(idx);
would be safe if the kernel launch uses a 1D grid and globalStorage is suitably sized, whereas your second version:
int idx = blockIdx.x*blockDim.x+threadIdx.x;
int gidx = idx%1000;
globalStorage[gidx] = somefunction(idx);
would not be because multiple thread could potentially write to the same entry in globalStorage. There is no atomic protections or serialisation mechanisms which would produce predictable results in such as case.
Related
I read in the en.cppreference.com specifications relaxed operations on atomics:
"[...]only guarantee atomicity and modification order
consistency."
So, I was asking myself if such 'modification order' would work when you are working on the same atomic variable or different ones.
In my code I have an atomic tree, where a low priority, event based message thread fills which node should be updated storing some data on red '1' atomic (see picture), using memory_order_relaxed. Then it continues writing in its parent using fetch_or to know which child atomic has been updated. Each atomic supports up to 64 bits, so I fill the bit 1 in red operation '2'. It continues successively until the root atomic which is also flagged using the fetch_or but using this time memory_order_release.
Then a fast, real time, unblockable, thread loads the control atomic (with memory_order_acquire) and reads which bits have it enabled. Then it updates recursively the childs atomics with memory_order_relaxed. And that is how I sync my data with each cycle of the high priority thread.
Since this thread is updating, it is fine child atomics are being stored before its parent. The problem is when it stores a parent (filling the bit of the children to update) before I fill the child information.
In other words, as the tittle says, are the relaxed stores reordered between them before the release one? I don't mind non-atomic variables are reordered. Pseudo-code, suppose [x, y, z, control] are atomic and with initial values 0:
Event thread:
z = 1; // relaxed
y = 1; // relaxed
x = 1; // relaxed;
control = 0; // release
Real time thread (loop):
load control; // acquire
load x; // relaxed
load y; // relaxed
load z; // relaxed
I wonder if in the real time thread this would be true always: x <= y <=z. To check that I wrote this small program:
#define _ENABLE_ATOMIC_ALIGNMENT_FIX 1
#include <atomic>
#include <iostream>
#include <thread>
#include <assert.h>
#include <array>
using namespace std;
constexpr int numTries = 10000;
constexpr int arraySize = 10000;
array<atomic<int>, arraySize> tat;
atomic<int> tsync {0};
void writeArray()
{
// Stores atomics in reverse order
for (int j=0; j!=numTries; ++j)
{
for (int i=arraySize-1; i>=0; --i)
{
tat[i].store(j, memory_order_relaxed);
}
tsync.store(0, memory_order_release);
}
}
void readArray()
{
// Loads atomics in normal order
for (int j=0; j!=numTries; ++j)
{
bool readFail = false;
tsync.load(memory_order_acquire);
int minValue = 0;
for (int i=0; i!=arraySize; ++i)
{
int newValue = tat[i].load(memory_order_relaxed);
// If it fails, it stops the execution
if (newValue < minValue)
{
readFail = true;
cout << "fail " << endl;
break;
}
minValue = newValue;
}
if (readFail) break;
}
}
int main()
{
for (int i=0; i!=arraySize; ++i)
{
tat[i].store(0);
}
thread b(readArray);
thread a(writeArray);
a.join();
b.join();
}
How it works: There is an array of atomic. One thread stores with relaxed ordering in reverse order and ends storing a control atomic with release ordering.
The other thread loads with acquire ordering that control atomic, then it loads with relaxed that atomic the rest of values of the array. Since the parents mustn't be updates before the children, the newValue should always be equal or greater than the oldValue.
I've executed this program on my computer several times, debug and release, and it doesn't trig the fail. I'm using a normal x64 Intel i7 processor.
So, is it safe to suppose that relaxed stores to multiple atomics do keep the 'modification order' at least when they are being sync with a control atomic and acquire/release?
Sadly, you will learn very little about what the Standard supports by experiment with x86_64, because the x86_64 is so well-behaved. In particular, unless you specify _seq_cst:
all reads are effectively _acquire
all writes are effectively _release
unless they cross a cache-line boundary. And:
all read-modify-write are effectively seq_cst
Except that the compiler is (also) allowed to re-order _relaxed operations.
You mention using _relaxed fetch_or... and if I understand correctly, you may be disappointed to learn that is no less expensive than seq_cst, and requires a LOCK prefixed instruction, carrying the full overhead of that.
But, yes _relaxed atomic operations are indistinguishable from ordinary operations as far as ordering is concerned. So yes, they may be reordered wrt to other _relaxed atomic operations as well as not-atomic ones -- by the compiler and/or the machine. [Though, as noted, on x86_64, not by the machine.]
And, yes where a release operation in thread X synchronizes-with an acquire operation in thread Y, all writes in thread X which are sequenced-before the release will have happened-before the acquire in thread Y. So the release operation is a signal that all writes which precede it in X are "complete", and when the acquire operation sees that signal Y knows it has synchronized and can read what was written by X (up to the release).
Now, the key thing to understand here is that simply doing a store _release is not enough, the value which is stored must be an unambiguous signal to the load _acquire that the store has happened. For otherwise, how can the load tell ?
Generally a _release/_acquire pair like this are used to synchronize access to some collection of data. Once that data is "ready", a store _release signals that. Any load _acquire which sees the signal (or all loads _acquire which see the signal) know that the data is "ready" and they can read it. Of course, any writes to the data which come after the store _release may (depending on timing) also be seen by the load(s) _acquire. What I am trying to say here, is that another signal may be required if there are to be further changes to the data.
Your little test program:
initialises tsync to 0
in the writer: after all the tat[i].store(j, memory_order_relaxed), does tsync.store(0, memory_order_release)
so the value of tsync does not change !
in the reader: does tsync.load(memory_order_acquire) before doing tat[i].load(memory_order_relaxed)
and ignores the value read from tsync
I am here to tell you that the _release/_acquire pairs are not synchronizing -- all these stores/load may as well be _relaxed. [I think your test will "pass" if the the writer manages to stay ahead of the reader. Because on x86-64 all writes are done in instruction order, as are all reads.]
For this to be a test of _release/_acquire semantics, I suggest:
initialises tsync to 0 and tat[] to all zero.
in the writer: run j = 1..numTries
after all the tat[i].store(j, memory_order_relaxed), write tsync.store(j, memory_order_release)
this signals that the pass is complete, and that all tat[] is now j.
in the reader: do j = tsync.load(memory_order_acquire)
a pass across tat[] should find j <= tat[i].load(memory_order_relaxed)
and after the pass, j == numTries signals that the writer has finished.
where the signal sent by the writer is that it has just completed writing j, and will continue with j+1, unless j == numTries. But this does not guarantee the order in which tat[] are written.
If what you wanted was for the writer to stop after each pass, and wait for the reader to see it and signal same -- then you need another signal and you need the threads to wait for their respective "you may proceed" signal.
The quote about relaxed giving modification order consistency. only means that all threads can agree on a modification order for that one object. i.e. an order exists. A later release-store that synchronizes with an acquire-load in another thread will guarantee that it's visible. https://preshing.com/20120913/acquire-and-release-semantics/ has a nice diagram.
Any time you're storing a pointer that other threads could load and deref, use at least mo_release if any of the pointed-to data has also been recently modified, if it's necessary that readers also see those updates. (This includes anything indirectly reachable, like levels of your tree.)
On any kind of tree / linked-list / pointer-based data structure, pretty much the only time you could use relaxed would be in newly-allocated nodes that haven't been "published" to the other threads yet. (Ideally you can just pass args to constructors so they can be initialized without even trying to be atomic at all; the constructor for std::atomic<T>() is not itself atomic. So you must use a release store when publishing a pointer to a newly-constructed atomic object.)
On x86 / x86-64, mo_release has no extra cost; plain asm stores already have ordering as strong as release so the compiler only needs to block compile time reordering to implement var.store(val, mo_release); It's also pretty cheap on AArch64, especially if you don't do any acquire loads soon after.
It also means you can't test for relaxed being unsafe using x86 hardware; the compiler will pick one order for the relaxed stores at compile time, nailing them down into release operations in whatever order it picked. (And x86 atomic-RMW operations are always full barriers, effectively seq_cst. Making them weaker in the source only allows compile-time reordering. Some non-x86 ISAs can have cheaper RMWs as well as load or store for weaker orders, though, even acq_rel being slightly cheaper on PowerPC.)
I found an example of a race condition that I was able to reproduce under g++ in linux. What I don't understand is how the order of operations matter in this example.
int va = 0;
void fa() {
for (int i = 0; i < 10000; ++i)
++va;
}
void fb() {
for (int i = 0; i < 10000; ++i)
--va;
}
int main() {
std::thread a(fa);
std::thread b(fb);
a.join();
b.join();
std::cout << va;
}
I can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va. Can someone clarify?
The standard says (quoting the latest draft):
[intro.races]
Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below.
Any such data race results in undefined behavior.
Your example program has a data race, and the behaviour of the program is undefined.
What I don't understand is how the order of operations matter in this example.
The order of operations matters because the operations are not atomic, and they read and modify the same memory location.
can undertand that the order matters if I had used va = va + 1; because then RHS va could have changed before getting back to assigned LHS va
The same applies to the increment operator. The abstract machine will:
Read a value from memory
Increment the value
Write a value back to memory
There are multiple steps there that can interleave with operations in the other thread.
Even if there was a single operation per thread, there would be no guarantee of well defined behaviour unless those operations are atomic.
Note outside of the scope of C++: A CPU might have a single instruction for incrementing an integer in memory. For example, x86 has such instruction. It can be invoked both atomically and non-atomically. It would be wasteful for the compiler to use the atomic instruction unless you explicitly use atomic operations in C++.
First of all, this is undefined behaviour since the two threads' reads and writes of the same non-atomic variable va are potentially concurrent and neither happens before the other.
With that being said, if you want to understand what your computer is actually doing when this program is run, it may help to assume that ++va is the same as va = va + 1. In fact, the standard says they are identical, and the compiler will likely compile them identically. Since your program contains UB, the compiler is not required to do anything sensible like using an atomic increment instruction. If you wanted an atomic increment instruction, you should have made va atomic. Similarly, --va is the same as va = va - 1. So in practice, various results are possible.
The important idea here is that when c++ is compiled it is "translated" to assembly language. The translation of ++va or --va will result in assembly code that moves the value of va to a register, then stores the result of adding 1 to that register back to va in a separate instruction. In this way, it is exactly the same as va = va + 1;. It also means that the operation va++ is not necessarily atomic.
See here for an explanation of what the Assembly code for these instructions will look like.
In order to make atomic operations, the variable could use a locking mechanism. You can do this by declaring an atomic variable (which will handle synchronization of threads for you):
std::atomic<int> va;
Reference: https://en.cppreference.com/w/cpp/atomic/atomic
I've been trying to build a simple (i.e. inefficient) MPMC queue using only std C++ but I'm having trouble getting the underlying array to synchronize between threads. A simplified version of the queue is:
constexpr int POISON = 5000;
class MPMC{
std::atomic_int mPushStartCounter;
std::atomic_int mPushEndCounter;
std::atomic_int mPopCounter;
static constexpr int Size = 1<<20;
int mData[Size];
public:
MPMC(){
mPushStartCounter.store(0);
mPushEndCounter.store(-1);
mPopCounter.store(0);
for(int i = 0; i < Size;i++){
//preset data with a poison flag to
// detect race conditions
mData[i] = POISON;
}
}
void push(int x) {
int index = mPushStartCounter.fetch_add(1);
mData[index] = x;//Race condition
atomic_thread_fence(std::memory_order_release);
int expected = index-1;
while(!mPushEndCounter.compare_exchange_strong(expected, index, std::memory_order_acq_rel)){std::this_thread::yield();}
}
int pop(){
int index = mPopCounter.load();
if(index <= mPushEndCounter.load(std::memory_order_acquire) && mPopCounter.compare_exchange_strong(index, index+1, std::memory_order_acq_rel)){
return mData[index]; //race condition
}else{
return pop();
}
}
};
It uses three atomic variables for synchronization:
mPushStartCounter that is used by push(int) to determine which location to write to.
mPushEndCounter that is used to signal that push(int) has finished writing up to that point int the array to pop().
mPopCounter that is used by pop() to prevent double pops from occurring.
In push(), between writing to the array mData and updating mPushEndCounter I've put a release barrier in an attempt to force synchronization of the mData array.
The way I understood cpp reference this should force a Fence-Atomic Synchronization. where
the CAS in push() is an 'atomic store X',
the load of mPushEndCounter in pop() is an 'atomic acquire operation Y' ,
The release barrier 'F' in push() is 'sequenced-before X'.
In which case cppreference states that
In this case, all non-atomic and relaxed atomic stores that are sequenced-before F in thread A will happen-before all non-atomic and relaxed atomic loads from the same locations made in thread B after Y.
Which I interpreted to mean that the write to mData from push() would be visible in pop(). This is however not the case, sometimes pop() reads uninitialized data. I believe this to be a synchronization issues because if I check the contents of the queue afterwards, or via breakpoint, it reads correct data instead.
I am using clang 6.0.1, and g++ 7.3.0.
I tried looking at the generated assembly but it looks correct to me: the write to the array is followed by a lock cmpxchg and the read is preceded by a check on the same variable. Which to the best of my limited knowledge should work as expected on x64 because
Loads are not reordered with other loads, hence the load from array can not speculate ahead of reading the atomic counter.
stores are not reordered with other stores, hence the cmpxchg always comes after the store to array.
lock cmpxchg flushes the write-buffer, cache, etc. Therefore if another thread observes it as finished, one can rely on cache coherency to guarantee that write to the array has finished. I am not too sure that this is correct however.
I've posted a runable test on Github. The test code involves 16 threads, half of which push the numbers 0 to 4999 and the other half read back 5000 elements each. It then combines the results of all the readers and checks that we've seen all the numbers in [0, 4999] exactly 8 times (which fails) and scans the underlying array once more to see if it contains all the numbers in [0, 4999] exactly 8 times (which succeeds).
I am working on a CUDA program where all blocks and threads need to determine the minimum step size for an iterative problem dynamically. I want the first thread in the block to be responsible for reading in the global dz value to shared memory so the rest of the threads can do a reduction on it. Meanwhile other threads in other blocks may be writing to it. Is there simply an atomicRead option in CUDA or something equivalent. I guess I could do an atomic add with zero or something. Or is this even necessary?
template<typename IndexOfRefractionFunct>
__global__ void _step_size_kernel(IndexOfRefractionFunct n, double* dz, double z, double cell_size)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= cells * cells)
return;
int idy = idx / cells;
idx %= cells;
double x = cell_size * idx;
double y = cell_size * idy;
__shared__ double current_dz;
if(threadIdx.x == 0)
current_dz = atomicRead(dz);
...
atomicMin(dz, calculated_min);
}
Also I just realized that cuda does not seem to support atomics on doubles. Any way around this?
Is there simply an atomicRead option in CUDA or something equivalent.
The idea of an atomic operation is that it allows for combining multiple operations without the possibility of intervening operations from other threads. The canonical use is for a read-modify-write. All 3 steps of the RMW operation can be performed atomically, with respect to a given location in memory, without the possibility of intervening activity from other threads.
Therefore the concept of an atomic read (only, by itself) doesn't really have meaning in this context. It is only one operation. In CUDA, all properly aligned reads of basic types (int, float, double, etc.) occur atomically, i.e. all in one operation, without the possibility of other operations affecting that read, or parts of that read.
Based on what you have shown, it seems that the correctness of your use-case should be satisfied without any special behavior on the read operation. If you simply wanted to ensure that the current_dz value gets populated from the global value, before any threads have a chance to modify it, at the block level, this can be sorted out simply with __syncthreads():
__shared__ double current_dz;
if(threadIdx.x == 0)
current_dz = dz;
__syncthreads(); // no threads can proceed beyond this point until
// thread 0 has read the value of dz
...
atomicMin(dz, calculated_min);
If you need to make sure this behavior is enforced grid-wide, then my suggestion would be to have an initial value of dz that threads don't write to, followed by the atomicMin operation being done on another location (ie. separate the write/output from the read/input at the kernel level).
But, again, I'm not suggesting this is necessary for your use-case. If you simply want to pick up the current dz value, you can do this with an ordinary read. You will get a "coherent" value. At the grid level, some number of atomicMin operations may have occurred before that read, and some may have occurred after that read, but none of them will corrupt the read, leading you to read a bogus value. The value you read will be either the initial value that was there, or some value that was properly deposited by an atomicMin operation (based on the code you have shown).
Also I just realized that cuda does not seem to support atomics on doubles. Any way around this?
CUDA has support for a limited set of atomic operations on 64-bit quantities. In particular, there is a 64-bit atomicCAS operation. The programming guide demonstrates how to use this in a custom function to achieve an arbitrary 64 bit atomic operation (e.g. 64-bit atomicMin on a double quantity). The example in the programming guide describes how to do a double atomicAdd operation. Here are examples of atomicMin and atomicMax operating on double:
__device__ double atomicMax(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
while(val > __longlong_as_double(old) ) {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val));
}
return __longlong_as_double(old);
}
__device__ double atomicMin(double* address, double val)
{
unsigned long long int* address_as_ull =(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
while(val < __longlong_as_double(old) ) {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val));
}
return __longlong_as_double(old);
}
As a good programming practice, atomics should be used sparingly, although Kepler global 32-bit atomics are pretty fast. But when using these types of custom 64-bit atomics, the advice is especially applicable; they will be noticeably slower than ordinary reads and writes.
In c++11, can std::atomic be used to transmit non-atomic data between two thread? In detail, are the following 4 semantics all established by atomic?
all statements(when talking about execution, including all machine instructions generated by those c++ statements) before an atomic-write statement are executed before the atomic-write.
all statements(when talking about execution, including all machine instructions generated by those c++ statements) after an atomic-read are executed after the atomic-read.
all other memory-write before writing an atomic are committed to main memory.
all other memory-read after reading an atomic will read from main memory again(that means discard thread cache).
I have seen an example here: http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/
However, in the example, the data is an atomic, so my question is, what if the data is non-atomic?
Here is some code, showing what i want:
common data:
std::atomic_bool ready;
char* data; // or data of any other non atomic
write thread:
data = new char[100];
data[0] = 1;
ready.store(true); // use default memory_order(memory_order_seq_cst), witch i think is the most restrict one
read thread:
if(ready.load()) { // use default memory_order(memory_order_seq_cst)
assert(data[0] == 1); // both data(of type char*) and data[0~99](each of type char) are loaded
}
I think you must use memory orders:
data = new char[100];
data[0] = 1;
ready.store_explicit(true, std::memory_order_release);
if(ready.load_explicit(std::memory_order_aqcuire)) {
assert(data[0] == 1); // both data(of type char*) and data[0~99](each of type char) are loaded
}
(i am not sure about this syntax _explicit)
Actually your code is correct, but in this case there is no need to sec/cst memory order and acquire/release is correct.
with release no write can be reordered after atomic write and with acquire no load can be reordered before atomic load, so all non-atomic store before atomic store will be visible to all loads after atomic load.