does atomic operation synchronize? - c++

Does atomic operation synchronize between threads? I know that no one thread can see such operation undone, but does it synchronize? For example if I write to some var in one thread, and read after(in time domain) from another, is there possibility that I still can see old value?

Atomics by default provide sequential consistency (SC). SC doesn't need to respect the real time order. So it could be that after a write has executed (and even retired), when a different CPU does a load, it will not see that write. So in the real time order the load has occurred after the write, but in the memory order it has 'happened before' the write.
See the following answer for more info:
Does 'volatile' guarantee that any thread reads the most recently written value?

Related

Synchronization problem with std::atomic<>

I have basically two questions that are closely related and they are both based on this SO question:
Thread synchronization problem with c++ std::atomic variables
As cppreference.com explains:
For memory_order_acquire:
A load operation with this memory order performs the acquire operation
on the affected memory location: no reads or writes in the current
thread can be reordered before this load. All writes in other
threads that release the same atomic variable are visible in the
current thread
For memory_order_release: A store operation with this memory order
performs the release operation: no reads or writes in the current
thread can be reordered after this store. All writes in the current
thread are visible in other threads that acquire the same atomic
variable
Why people say that memory_order_seq_cst MUST be used in order for that example to work properly? What's the purpose of memory_order_acquire if it doesn't work as the official documentation says so?
The documentation clearly says: All writes in other threads that release the same atomic variable are visible in the current thread.
Why that example from SO question should never print "bad\n"? It just doesn't make any sense to me.
I did my homework by reading all available documentation, SO queastions/anwers, googling, etc... But, I'm still not able to understand some things.
Your linked question has two atomic variables, your "cppreference" quote specifically mentions "same atomic variable". That's why the reference text doesn't cover the linked question.
Quoting further from cppreference: memory_order_seq_cst : "...a single total order exists in which all threads observe all modifications in the same order".
So that does cover modifications to two atomic variables.
Essentially, the design problem with memory_order_release is that it's a data equivalent of GOTO, which we know is a problem since Dijkstra. And memory_order_acquire is the equivalent of a COMEFROM, which is usually reserved for April Fools. I'm not yet convinced that they're good additions to C++.

C++ 17 shared_mutex : why read lock is even necessary for reading threads [duplicate]

I have a class that has a state (a simple enum) and that is accessed from two threads. For changing state I use a mutex (boost::mutex). Is it safe to check the state (e.g. compare state_ == ESTABLISHED) or do I have to use the mutex in this case too? In other words do I need the mutex when I just want to read a variable which could be concurrently written by another thread?
It depends.
The C++ language says nothing about threads or atomicity.
But on most modern CPU's, reading an integer is an atomic operation, which means that you will always read a consistent value, even without a mutex.
However, without a mutex, or some other form of synchronization, the compiler and CPU are free to reorder reads and writes, so anything more complex, anything involving accessing multiple variables, is still unsafe in the general case.
Assuming the writer thread updates some data, and then sets an integer flag to inform other threads that data is available, this could be reordered so the flag is set before updating the data. Unless you use a mutex or another form of memory barrier.
So if you want correct behavior, you don't need a mutex as such, and it's no problem if another thread writes to the variable while you're reading it. It'll be atomic unless you're working on a very unusual CPU. But you do need a memory barrier of some kind to prevent reordering in the compiler or CPU.
You have two threads, they exchange information, yes you need a mutex and you probably also need a conditional wait.
In your example (compare state_ == ESTABLISHED) indicates that thread #2 is waiting for thread #1 to initiate a connection/state. Without a mutex or conditionals/events, thread #2 has to poll the status continously.
Threads is used to increase performance (or improve responsiveness), polling usually results in decreased performance, either by consuming a lot of CPU or by introducing latencey due to the poll interval.
Yes. If thread a reads a variable while thread b is writing to it, you can read an undefined value. The read and write operation are not atomic, especially on a multi-processor system.
Generally speaking you don't, if your variable is declared with "volatile". And ONLY if it is a single variable - otherwise you should be really careful about possible races.
actually, there is no reason to lock access to the object for reading. you only want to lock it while writing to it. this is exactly what a reader-writer lock is. it doesn't lock the object as long as there are no write operations. it improves performance and prevents deadlocks. see the following links for more elaborate explanations :
wikipedia
codeproject
The access to the enum ( read or write) should be guarded.
Another thing:
If the thread contention is less and the threads belong to same process then Critical section would be better than mutex.

How does warp work with atomic operation?

The threads in a warp run physically parallel, so if one of them (called, thread X) start an atomic operation, what other will do? Wait? Is it mean, all threads will be waiting while thread X is pushed to the atomic-queue, get the access (mutex) and do some stuff with memory, which was protected with that mutex, and realese mutex after?
Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency? I mean, 15 idle threads it's.. not well, I guess. Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?
Does atomic operation with shared memory lock for a bank or whole memory?
For example (without mutexs), there is __shared__ float smem[256];
Thread1 runs atomicAdd(smem, 1);
Thread2 runs atomicAdd(smem + 1, 1);
Those threads works with different banks, but in general shared memory. Does they run parralel or they will be queued? Is there any difference with this example, if Thread1 and Thread2 are from separated warps or general one?
I count something like 10 questions. It makes it quite difficult to answer. It's suggested you ask one question per question.
Generally speaking, all threads in a warp are executing the same instruction stream. So there are two cases we can consider:
without conditionals (e.g. if...then...else) In this case, all threads are executing the same instruction, which happens to be an atomic instruction. Then all 32 threads will execute an atomic, although not necessarily on the same location. All of these atomics will get processed by the SM, and to some extent will serialize (they will completely serialize if they are updating the same location).
with conditionals For example, suppose we had if (!threadIdx.x) AtomicAdd(*data, 1); Then thread 0 would execute the atomic, and
others wouldn't. It might seem like we could get the others to do
something else, but the lockstep warp execution doesn't allow this.
Warp execution is serialized such that all threads taking the if
(true) path will execute together, and all threads executing the
if (false) path will execute together, but the true and false
paths will be serialized. So again, we can't really have different
threads in a warp executing different instructions simultaneously.
The net of it is, within a warp, we can't have one thread do an atomic while others do something else simultaneously.
A number of your other questions seem to expect that memory transactions are completed at the end of the instruction cycle that they originated in. This is not the case. With global and with shared memory, we must take special steps in the code to ensure that previous write transactions are visible to other threads (which could be argued as the evidence that the transaction completed.) One typical way to do this is to use barrier instructions, such as __syncthreads() or __threadfence() But without those barrier instructions, threads are not "waiting" for writes to complete. A (an operation dependent on a) read can stall a thread. A write generally cannot stall a thread.
Now lets see about your questions:
so if one of them start an atomic operation, what other will do? Wait?
No, they don't wait. The atomic operation gets dispatched to a functional unit on the SM that handles atomics, and all threads continue, together, in lockstep. Since an atomic generally implies a read, yes, the read can stall the warp. But the threads do not wait until the atomic operation is completed (i.e, the write). However, a subsequent read of this location could stall the warp, again, waiting for the atomic (write) to complete. In the case of a global atomic, which is guaranteed to update global memory, it will invalidate the L1 in the originating SM (if enabled) and the L2, if they contain that location as an entry.
Is there any way to take other threads for some work, like reads some memory, so the atomic operation will hide it's latency?
Not really, for the reasons I stated at the beginning.
Atomic is really slow, does it? How can I accelerate it? Is there any pattern to work with it?
Yes, atomics can make a program run much more slowly if they dominate the activity (such as naive reductions or naive histogramming.) Generally speaking, the way to accelerate atomic operations is to not use them, or use them sparingly, in a way that doesn't dominate program activity. For example, a naive reduction would use an atomic to add every element to the global sum. A smart parallel reduction will use no atomics at all for the work done in the threadblock. At the end of the threadblock reduction, a single atomic might be used to update the threadblock partial sum into the global sum. This means that I can do a fast parallel reduction of an arbitrarily large number of elements with perhaps on the order of 32 atomic adds, or less. This sparing use of atomics will basically not be noticeable in the overall program execution, except that it enables the parallel reduction to be done in a single kernel call rather than 2.
Shared memory: Does they run parralel or they will be queued?
They will be queued. The reason for this is that there are a limited number of functional units that can process atomic operations on shared memory, not enough to service all the requests from a warp in a single cycle.
I've avoided trying to answer questions that relate to the throughput of atomic operations, because this data is not well specified in the documentation AFAIK. It may be that if you issue enough simultaneous or nearly-simultaneous atomic operations, that some warps will stall on the atomic instruction, due to the queues that feed the atomic functional units being full. I don't know this to be true and I can't answer questions about it.

C++11 : Atomic variable : lock_free property : What does it mean?

I wanted to understand what does one mean by lock_free property of atomic variables in c++11. I did googled out and saw the other relevant questions on this forum but still having partial understanding. Appreciate if someone can explain it end-to-end and in simple way.
It's probably easiest to start by talking about what would happen if it was not lock-free.
The most obvious way to handle most atomic tasks is by locking. For example, to ensure that only one thread writes to a variable at a time, you can protect it with a mutex. Any code that's going to write to the variable needs obtain the mutex before doing the write (and release it afterwards). Only one thread can own the mutex at a time, so as long as all the threads follow the protocol, no more than one can write at any given time.
If you're not careful, however, this can be open to deadlock. For example, let's assume you need to write to two different variables (each protected by a mutex) as an atomic operation -- i.e., you need to ensure that when you write to one, you also write to the other). In such a case, if you aren't careful, you can cause a deadlock. For example, let's call the two mutexes A and B. Thread 1 obtains mutex A, then tries to get mutex B. At the same time, thread 2 obtains mutex B, and then tries to get mutex A. Since each is holding one mutex, neither can obtain both, and neither can progress toward its goal.
There are various strategies to avoid them (e.g., all threads always try to obtain the mutexes in the same order, or upon failure to obtain a mutex within a reasonable period of time, each thread releases the mutex it holds, waits some random amount of time, and then tries again).
With lock-free programming, however, we (obviously enough) don't use locks. This means that a deadlock like above simply cannot happen. When done properly, you can guarantee that all threads continuously progress toward their goal. Contrary to popular belief, it does not mean the code will necessarily run any faster than well written code using locks. It does, however, mean that deadlocks like the above (and some other types of problems like livelocks and some types of race conditions) are eliminated.
Now, as to exactly how you do that: the answer is short and simple: it varies -- widely. In a lot of cases, you're looking at specific hardware support for doing certain specific operations atomically. Your code either uses those directly, or extends them to give higher level operations that are still atomic and lock free. It's even possible (though only rarely practical) to implement lock-free atomic operations without hardware support (but given its impracticality, I'll pass on trying to go into more detail about it, at least for now).
Jerry already mentioned common correctness problems with locks, i.e. they're hard to understand and program correctly.
Another danger with locks is that you lose determinism regarding your execution time: if a thread that has acquired a lock gets delayed (e.g. descheduled by the operating system, or "swapped out"), then it is pos­si­ble that the entire program is de­layed be­cause it is waiting for the lock. By contrast, a lock-free al­go­rithm is al­ways gua­ran­teed to make some progress, even if any number of threads are held up some­where else.
By and large, lock-free programming is often slower (sometimes significantly so) than locked pro­gram­ming using non-atomic operations, because atomic operations cause a sig­ni­fi­cant hit on caching and pipelining; however, it offers determinism and upper bounds on latency (at least overall latency of your process; as #J99 observed, individual threads may still be starved as long as enough other threads are making progress). Your program may get a lot slower, but it never locks up entirely and always makes some progress.
The very nature of hardware architectures allows for certain small operations to be intrinsically atomic. In fact, this is a very necessity of any hardware that supports multitasking and multithreading. At the very heart of any synchronisation primitive, such as a mutex, you need some sort of atomic instruction that guarantees correct locking behaviour.
So, with that in mind, we now know that it is possible for certain types like booleans and machine-sized integers to be loaded, stored and exchanged atomically. Thus when we wrap such a type into an std::atomic template, we can expect that the resulting data type will indeed offer load, store and exchange operations that do not use locks. By contrast, your library implementation is always allowed to implement an atomic Foo as an ordinary Foo guarded by a lock.
To test whether an atomic object is lock-free, you can use the is_lock_free member function. Additionally, there are ATOMIC_*_LOCK_FREE macros that tell you whether atomic primitive types potentially have a lock-free instantiation. If you are writing concurrent algorithms that you want to be lock-free, you should thus include an assertion that your atomic objects are indeed lock-free, or a static assertion on the macro to have value 2 (meaning that every object of the corresponding type is always lock-free).
Lock-free is one of the non-blocking techniques. For an algorithm, it involves a global progress property: whenever a thread of the program is active, it can make a forward step in its action, for itself or eventually for the other.
Lock-free algorithms are supposed to have a better behavior under heavy contentions where threads acting on a shared resources may spent a lot of time waiting for their next active time slice. They are also almost mandatory in context where you can't lock, like interrupt handlers.
Implementations of lock-free algorithms almost always rely on Compare-and-Swap (some may used things like ll/sc) and strategy where visible modification can be simplified to one value (mostly pointer) change, making it a linearization point, and looping over this modification if the value has change. Most of the time, these algorithms try to complete jobs of other threads when possible. A good example is the lock-free queue of Micheal&Scott (http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html).
For lower-level instructions like Compare-and-Swap, it means that the implementation (probably the micro-code of the corresponding instruction) is wait-free (see http://www.diku.dk/OLD/undervisning/2005f/dat-os/skrifter/lockfree.pdf)
For the sake of completeness, wait-free algorithm enforces progression for each threads: each operations are guaranteed to terminate in a finite number of steps.

Do I need a mutex for reading?

I have a class that has a state (a simple enum) and that is accessed from two threads. For changing state I use a mutex (boost::mutex). Is it safe to check the state (e.g. compare state_ == ESTABLISHED) or do I have to use the mutex in this case too? In other words do I need the mutex when I just want to read a variable which could be concurrently written by another thread?
It depends.
The C++ language says nothing about threads or atomicity.
But on most modern CPU's, reading an integer is an atomic operation, which means that you will always read a consistent value, even without a mutex.
However, without a mutex, or some other form of synchronization, the compiler and CPU are free to reorder reads and writes, so anything more complex, anything involving accessing multiple variables, is still unsafe in the general case.
Assuming the writer thread updates some data, and then sets an integer flag to inform other threads that data is available, this could be reordered so the flag is set before updating the data. Unless you use a mutex or another form of memory barrier.
So if you want correct behavior, you don't need a mutex as such, and it's no problem if another thread writes to the variable while you're reading it. It'll be atomic unless you're working on a very unusual CPU. But you do need a memory barrier of some kind to prevent reordering in the compiler or CPU.
You have two threads, they exchange information, yes you need a mutex and you probably also need a conditional wait.
In your example (compare state_ == ESTABLISHED) indicates that thread #2 is waiting for thread #1 to initiate a connection/state. Without a mutex or conditionals/events, thread #2 has to poll the status continously.
Threads is used to increase performance (or improve responsiveness), polling usually results in decreased performance, either by consuming a lot of CPU or by introducing latencey due to the poll interval.
Yes. If thread a reads a variable while thread b is writing to it, you can read an undefined value. The read and write operation are not atomic, especially on a multi-processor system.
Generally speaking you don't, if your variable is declared with "volatile". And ONLY if it is a single variable - otherwise you should be really careful about possible races.
actually, there is no reason to lock access to the object for reading. you only want to lock it while writing to it. this is exactly what a reader-writer lock is. it doesn't lock the object as long as there are no write operations. it improves performance and prevents deadlocks. see the following links for more elaborate explanations :
wikipedia
codeproject
The access to the enum ( read or write) should be guarded.
Another thing:
If the thread contention is less and the threads belong to same process then Critical section would be better than mutex.