boost vs std atomic sequential consistency semantics - c++

I'd like to write a C++ lock-free object where there are many logger threads logging to a large global (non-atomic) ring buffer, with an occasional reader thread which wants to read as much data in the buffer as possible. I ended up having a global atomic counter where loggers get locations to write to, and each logger increments the counter atomically before writing. The reader tries to read the buffer and per-logger local (atomic) variable to know whether particular buffer entries are busy being written by some logger, so as to avoid using them.
So I have to do synchronization between a pure reader thread and many writer threads. I sense that the problem can be solved without using locks, and I can rely on "happens after" relation to determine whether my program is correct.
I've tried relaxed atomic operation, but it won't work: atomic variable stores are releases and loads are acquires, and the guarantee is that some acquire (and its subsequent work) always "happen after" some release (and its preceding work). That means there is no way for the reader thread (doing no store at all) to guarantee that something "happens after" the time it reads the buffer, which means I don't know whether some logger has overwritten part of the buffer when the thread is reading it.
So I turned to sequential consistency. For me, "atomic" means Boost.Atomic, which notion of sequential consistency has a "pattern" documented:
The third pattern for coordinating threads via Boost.Atomic uses
seq_cst for coordination: If ...
thread1 performs an operation A,
thread1 subsequently performs any operation with seq_cst,
thread1 subsequently performs an operation B,
thread2 performs an operation C,
thread2 subsequently performs any operation with seq_cst,
thread2 subsequently performs an operation D,
then either "A happens-before D" or "C happens-before B" holds.
Note that the second and fifth lines say "any operation", without saying whether it modify anything, or what it operates on. This provides the guarantee that I wanted.
All is happy until I watch the talk of Herb Sutter titled "atomic<> Weapnos". What he implies is that seq_cst is just a acq_rel, with the additional guarantee of consistent atomic stores ordering. I turned to the cppreference.com, which have similar description.
So my questions:
Does C++11 and Boost Atomic implement the same memory model?
If (1) is "yes", does it mean the "pattern" described by Boost is somehow implied by the C++11 memory model? How? Or does it mean the documentation of either Boost or C++11 in cppreference is wrong?
If (1) is "no", or (2) is "yes, but Boost documentation is incorrect", is there any way to achieve the effect I want in C++11, namely to have guarantee that (the work subsequent to) some atomic store happens after (the work preceding) some atomic load?

I saw no answer here, so I asked again in the Boost user mailing list.
I saw no answer there either (apart from a suggestion to look into
Boost lockfree), so I planed to ask Herb Sutter (expecting no answer
anyway). But before doing that, I Googled "C++ memory model" a little
more deeply. After reading a page of Hans Boehm
(http://www.hboehm.info/c++mm/), I could answer most of my own
question. I Googled a bit more, this time for "C++ Data Race", and
landed at a page by Bartosz Milewski
(http://bartoszmilewski.com/2014/10/25/dealing-with-benign-data-races-the-c-way/).
Then I can answer even more of my own question. Unluckily, I still
don't know how to do what I want to do given that knowledge. Perhaps
what I want to do is actually unachieveable in standard C++.
My first part of the question: "Does C++11 and Boost.Atomic implement
the same memory model?" The answer is, mostly, "yes". My second part
of the question: "If (1) is 'yes', does it mean the "pattern"
described by Boost is somehow implied by the C++11 memory model?" The
answer is again, yes. "How?" is answered by a proof found here
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2392.html).
Essentially, for data race free programs, the little bit added to
acq_rel is sufficient to guarantee the behavior required by seq_cst.
So both documentation, although perhaps confusing, are correct.
Now the real problem: although both (1) and (2) get "yes" answers, my
original program is wrong! I neglected (actually, I'm unaware of) an
important rule of C++: a program with data race has undefined behavior
(rather than an "unspecified" or "implementation defined" one). That
is, the compiler guarantees behavior of my program only if my program
has absolutely no data race. Without a lock, my program contains a
data race: the pure reader thread can read any time, even at a time
when the logger thread is busy writing. This is "undefined behavior",
and the rule says that the computer can do anything (the "catch fire"
rule). To fix it, one has to use ideas found in the page of Bartosz
Milewski I mentioned earlier, i.e., change the ring buffer to contain
only atomic content, so that the compiler knows that its ordering is
important and must not be reordered with the operations marked to
require sequential consistency. If overhead minimization is desired,
one can write to it using relaxed atomic operations.
Unluckily, this applies to the reader thread too. I can no longer
just "memcpy" the whole memory buffer. Instead I must also use
relaxed atomic operations to read the buffer, one word after another.
This kills performance, but I have no choice actually. Luckily for
me, the dumper's performance is not important to me at all: it rarely
gets run anyway. But if I do want the performance of "memcpy", I
would get an answer of "no solution": C++ provides no semantics of "I
know there is data race, you can return anything to me here but don't
screw up my program". Either you ensure that there is no data race
and pay the cost to get everything well defined, or you have a data
race and the compiler is allowed to put you to jail.

Related

Synchronization problem with std::atomic<>

I have basically two questions that are closely related and they are both based on this SO question:
Thread synchronization problem with c++ std::atomic variables
As cppreference.com explains:
For memory_order_acquire:
A load operation with this memory order performs the acquire operation
on the affected memory location: no reads or writes in the current
thread can be reordered before this load. All writes in other
threads that release the same atomic variable are visible in the
current thread
For memory_order_release: A store operation with this memory order
performs the release operation: no reads or writes in the current
thread can be reordered after this store. All writes in the current
thread are visible in other threads that acquire the same atomic
variable
Why people say that memory_order_seq_cst MUST be used in order for that example to work properly? What's the purpose of memory_order_acquire if it doesn't work as the official documentation says so?
The documentation clearly says: All writes in other threads that release the same atomic variable are visible in the current thread.
Why that example from SO question should never print "bad\n"? It just doesn't make any sense to me.
I did my homework by reading all available documentation, SO queastions/anwers, googling, etc... But, I'm still not able to understand some things.
Your linked question has two atomic variables, your "cppreference" quote specifically mentions "same atomic variable". That's why the reference text doesn't cover the linked question.
Quoting further from cppreference: memory_order_seq_cst : "...a single total order exists in which all threads observe all modifications in the same order".
So that does cover modifications to two atomic variables.
Essentially, the design problem with memory_order_release is that it's a data equivalent of GOTO, which we know is a problem since Dijkstra. And memory_order_acquire is the equivalent of a COMEFROM, which is usually reserved for April Fools. I'm not yet convinced that they're good additions to C++.

How are MT programs proven correct with "non sequential" semantics?

This could be a language neutral question, but in practice I'm interested with the C++ case: how are multithread programs written in C++ versions that support MT programming, that is modern C++ with a memory model, ever proven correct?
In old C++, MT programs were just written in term of pthread semantics and validated in term of the pthread rules which was conceptually easy: use primitives correctly and avoid data races.
Now, the C++ language semantic is defined in term of a memory model and not in term of sequential execution of primitive steps. (Also the standard mentions an "abstract machine" but I don't understand any more what it means.)
How are C++ programs proven correct with that non sequential semantic? How can anyone reason about a program that is not doing primitive steps one after the other?
It is "conceptually easier" with the C++ memory model than it was with pthreads prior to the C++ memory model. C++ prior to the memory model interacting with pthreads was loosely specified, and reasonable interpretations of the specification permitted the compiler to "introduce" data races, so it is extremely difficult (if possible at all) to reason about or prove correctness for MT algorithms in the context of older C++ with pthreads.
There seems to be a fundamental misunderstanding in the question in that C++ was never defined as a sequential execution of primitive steps. It has always been the case that there is a partial ordering between expression evaluations. And the compiler is allowed to move such expressions around unless constrained from doing so. This was unchanged by the introduction of the memory model. The memory model introduced a partial order for evaluations between separate threads of execution.
The advice "use the primitives correctly and avoid data races" still applies, but the C++ memory model more strictly and precisely constrains the interaction between the primitives and the rest of the language, allowing more precise reasoning.
In practice, it is not easy to prove correctness in either context. Most programs are not proven to be data race free. One tries to encapsulate as much as possible any synchronization so as to allow reasoning about smaller components, some of which can be proven correct. And one uses tools such as address sanitizer and thread sanitizer to catch data races.
On data races, POSIX says:
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads.... Applications may allow more than one thread of control to read a memory location simultaneously.
On data races, C++ says:
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
C++ defines more terms and tries to be more precise. The gist of this is that both forbid data races, which in both, are defined as conflicting accesses, without the use of the synchronization primitives.
POSIX says that the pthread functions synchronize memory with respect to other threads. That is underspecified. One could reasonably interpret that as (1) the compiler cannot move memory accesses across such a function call, and (2) after calling such a function in one thread, the prior actions to memory from that thread will be visible to another thread after it calls such a function. This was a common interpretation, and this is easily accomplished by treating the functions as opaque and potentially clobbering all of memory.
As an example of problems with this loose specification, the compiler is still allowed to introduce or remove memory accesses (e.g. through register promotion and spillage) and to make larger accesses than necessary (e.g. touching adjacent fields in a struct). Therefore, the compiler completely correctly could "introduce" data races that weren't written in the source code directly. The C++11 memory model stops it from doing so.
C++ says, with regard to mutex lock:
Synchronization: Prior unlock() operations on the same object shall synchronize with this operation.
So C++ is a little more specific. You have to lock and unlock the same mutex to have synchronization. But given this, C++ says that operations before the unlock are visible to the new locker:
An evaluation A strongly happens before an evaluation D if ... there are evaluations B and C such that A is sequenced before B, B simply happens before C, and C is sequenced before D. [ Note: Informally, if A strongly happens before B, then A appears to be evaluated before B in all contexts. Strongly happens before excludes consume operations. — end note ]
(With B = unlock, C = lock, B simply happens before C because B synchronizes with C. Sequenced before is a concept in a single thread of execution, so for example, one full expression is sequenced before the next.)
So, if you restrict yourself to the sorts of primitives (locks, condition variables, ...) that exist in pthread, and to the type of guarantees provided by pthread (sequential consistency), C++ should add no surprises. In fact, it removes some surprises, adds precision, and is more amenable to correctness proofs.
The article Foundations of the C++ Concurrency Memory Model is a great, expository read for anyone interested in this topic about the problems with the status quo at the time and the choices made to fix them in the C++11 memory model.
Edited to more clearly state that the premise of the question is flawed, that reasoning is easier with the memory model, and add a reference to the Boehm paper, which also shaped some of the exposition.

Is the meaning of "lock-free" even defined by the C++ standard?

I can't find a semantic difference between lock-based and lock-free atomics. So far as I can tell, the difference is semantically meaningless as far as the language is concerned, since the language doesn't provide any timing guarantees. The only guarantees I can find are memory ordering guarantees, which seem to be the same for both cases.
(How) can the lock-free-ness of atomics affect program semantics?
i.e., aside from calling is_lock_free or atomic_is_lock_free, is it possible to write a well-defined program whose behavior is actually affected by whether atomics are lock-free?
Do those functions even have a semantic meaning? Or are they just practical hacks for writing responsive programs even though the language never provides timing guarantees in the first place?
There is at least one semantic difference.
As per C++11 1.9 Program execution /6:
When the processing of the abstract machine is interrupted by receipt of a signal, the values of objects which are neither of type volatile std::sig_atomic_t nor lock-free atomic objects are unspecified during the execution of the signal handler, and the value of any object not in either of
these two categories that is modified by the handler becomes undefined.
In other words, it's safe to muck about with those two categories of variables but any access or modification to all other categories should be avoided.
Of course you could argue that it's no longer a well defined program if you invoke unspecified/undefined behaviour like that but I'm not entirely sure whether you meant that or well-formed (i.e., compilable).
But, even if you discount that semantic difference, a performance difference is worth having. If I had to have a value for communicating between threads, I'd probably choose, in order of preference:
the smallest adequate data type that was lock-free.
a larger data type than necessary, if it was lock-free and the smaller one wasn't.
a shared region fully capable of race conditions, but in conjunction with an atomic_flag (guaranteed to be lock-free) to control access.
This behaviour could be selected at compile or run time based on the ATOMIC_x_LOCK_FREE macros so that, even though the program behaves the same regardless, the optimal method for that behaviour is chosen.
In C++11 Standard, the term "lock-free" was not defined well as reported in issue LWG #2075.
C++14 Standard define what lock-free executions is in C++ language (N3927 approved).
Quote C++14 1.10[intro.multithread]/paragraph 4:
Executions of atomic functions that are either defined to be lock-free (29.7) or indicated as lock-free (29.4) are lock-free executions.
If there is only one unblocked thread, a lock-free execution in that thread shall complete. [ Note: Concurrently executing threads may prevent progress of a lock-free execution. For example, this situation can occur with load-locked store-conditional implementations. This property is sometimes termed obstruction-free. -- end note ]
When one or more lock-free executions run concurrently, at least one should complete. [ Note: It is difficult for some implementations to provide absolute guarantees to this effect, since repeated and particularly inopportune interference from other threads may prevent forward progress, e.g., by repeatedly stealing a cache line for unrelated purposes between load-locked and store-conditional instructions. Implementations should ensure that such effects cannot indefinitely delay progress under expected operating conditions, and that such anomalies can therefore safely be ignored by programmers. Outside this International Standard, this property is sometimes termed lock-free. -- end note ]
Above definition of "lock-free" depends on what does unblocked thread behave. C++ Standard does not define unblocked thread directly, but 17.3.3[defns.blocked] defines blocked thread:
a thread that is waiting for some condition (other than the availability of a processor) to be satisfied before it can continue execution
(How) can the lock-free-ness of atomics affect program semantics?
I think the answer is NO, except signal handler as paxdiablo's answer, when "program semantics" mean the side effects of atomic operations.
The lock-free-ness of atomic affect the strength of progress guarantee for whole multithreading program.
When two (or more) threads concurrently execute lock-free atomic operations on same object, at least one of these operations should complete under any worst thread scheduling.
In other words, 'evil' thread scheduler could intentionally block progress of lock-based atomic operations in theory.
Paxdiablo has answered pretty well, but some background might help.
"Lock-free atomic" is a bit of redundant terminology. The point of atomic variables, as they were originally invented, is to avoid locks by leveraging hardware guarantees. But, each platform has its own limitations, and C++ is highly portable. So the implementation has to emulate atomicity (usually via the library) using fine-grained locks for atomic types that don't really exist at the hardware level.
Behavioral differences are minimized between hardware atomics and "software atomics," because differences would mean lost portability. On the other hand, a program should be able to avoid accidentally using mutexes, hence introspection through ATOMIC_x_LOCK_FREE which is available to the preprocessor.

What is lock-free multithreaded programming?

I have seen people/articles/SO posts who say they have designed their own "lock-free" container for multithreaded usage. Assuming they haven't used a performance-hitting modulus trick (i.e. each thread can only insert based upon some modulo) how can data structures be multi-threaded but also lock-free???
This question is intended towards C and C++.
The key in lock-free programming is to use hardware-intrinsic atomic operations.
As a matter of fact, even locks themselves must use those atomic operations!
But the difference between locked and lock-free programming is that a lock-free program can never be stalled entirely by any single thread. By contrast, if in a locking program one thread acquires a lock and then gets suspended indefinitely, the entire program is blocked and cannot make progress. By contrast, a lock-free program can make progress even if individual threads are suspended indefinitely.
Here's a simple example: A concurrent counter increment. We present two versions which are both "thread-safe", i.e. which can be called multiple times concurrently. First the locked version:
int counter = 0;
std::mutex counter_mutex;
void increment_with_lock()
{
std::lock_guard<std::mutex> _(counter_mutex);
++counter;
}
Now the lock-free version:
std::atomic<int> counter(0);
void increment_lockfree()
{
++counter;
}
Now imagine hundreds of thread all call the increment_* function concurrently. In the locked version, no thread can make progress until the lock-holding thread unlocks the mutex. By contrast, in the lock-free version, all threads can make progress. If a thread is held up, it just won't do its share of the work, but everyone else gets to get on with their work.
It is worth noting that in general lock-free programming trades throughput and mean latency throughput for predictable latency. That is, a lock-free program will usually get less done than a corresponding locking program if there is not too much contention (since atomic operations are slow and affect a lot of the rest of the system), but it guarantees to never produce unpredictably large latencies.
For locks, the idea is that you acquire a lock and then do your work knowing that nobody else can interfere, then release the lock.
For "lock-free", the idea is that you do your work somewhere else and then attempt to atomically commit this work to "visible state", and retry if you fail.
The problems with "lock-free" are that:
it's hard to design a lock-free algorithm for something that isn't trivial. This is because there's only so many ways to do the "atomically commit" part (often relying on an atomic "compare and swap" that replaces a pointer with a different pointer).
if there's contention, it performs worse than locks because you're repeatedly doing work that gets discarded/retried
it's virtually impossible to design a lock-free algorithm that is both correct and "fair". This means that (under contention) some tasks can be lucky (and repeatedly commit their work and make progress) and some can be very unlucky (and repeatedly fail and retry).
The combination of these things mean that it's only good for relatively simple things under low contention.
Researchers have designed things like lock-free linked lists (and FIFO/FILO queues) and some lock-free trees. I don't think there's anything more complex than those. For how these things work, because it's hard it's complicated. The most sane approach would be to determine what type of data structure you're interested in, then search the web for relevant research into lock-free algorithms for that data structure.
Also note that there is something called "block free", which is like lock-free except that you know you can always commit the work and never need to retry. It's even harder to design a block-free algorithm, but contention doesn't matter so the other 2 problems with lock-free disappear. Note: the "concurrent counter" example in Kerrek SB's answer is not lock free at all, but is actually block free.
The idea of "lock free" is not really not having any lock, the idea is to minimize the number of locks and/or critical sections, by using some techniques that allow us not to use locks for most operations.
It can be achieved using optimistic design or transactional memory, where you do not lock the data for all operations, but only on some certain points (when doing the transaction in transactional memory, or when you need to roll-back in optimistic design).
Other alternatives are based on atomic implementations of some commands, such as CAS (Compare And Swap), that even allows us to solve the consensus problem given an implementation of it. By doing swap on references (and no thread is working on the common data), the CAS mechanism allows us to easily implement a lock-free optimistic design (swapping to the new data if and only if no one have changed it already, and this is done atomically).
However, to implement the underlying mechanism to one of these - some locking will most likely be used, but the amount of time the data will be locked is (supposed) to be kept to minimum, if these techniques are used correctly.
The new C and C++ standards (C11 and C++11) introduced threads, and thread shared atomic data types and operations. An atomic operation gives guarantees for operations that run into a race between two threads. Once a thread returns from such an operation, it can be sure that the operation has gone through in its entirety.
Typical processor support for such atomic operations exists on modern processors for compare and swap (CAS) or atomic increments.
Additionally to being atomic, data type can have the "lock-free" property. This should perhaps have been coined "stateless", since this property implies that an operation on such a type will never leave the object in an intermediate state, even when it is interrupted by an interrupt handler or a read of another thread falls in the middle of an update.
Several atomic types may (or may not) be lock-free, there are macros to test for that property. There is always one type that is guaranteed to be lock free, namely atomic_flag.

C++11 : Atomic variable : lock_free property : What does it mean?

I wanted to understand what does one mean by lock_free property of atomic variables in c++11. I did googled out and saw the other relevant questions on this forum but still having partial understanding. Appreciate if someone can explain it end-to-end and in simple way.
It's probably easiest to start by talking about what would happen if it was not lock-free.
The most obvious way to handle most atomic tasks is by locking. For example, to ensure that only one thread writes to a variable at a time, you can protect it with a mutex. Any code that's going to write to the variable needs obtain the mutex before doing the write (and release it afterwards). Only one thread can own the mutex at a time, so as long as all the threads follow the protocol, no more than one can write at any given time.
If you're not careful, however, this can be open to deadlock. For example, let's assume you need to write to two different variables (each protected by a mutex) as an atomic operation -- i.e., you need to ensure that when you write to one, you also write to the other). In such a case, if you aren't careful, you can cause a deadlock. For example, let's call the two mutexes A and B. Thread 1 obtains mutex A, then tries to get mutex B. At the same time, thread 2 obtains mutex B, and then tries to get mutex A. Since each is holding one mutex, neither can obtain both, and neither can progress toward its goal.
There are various strategies to avoid them (e.g., all threads always try to obtain the mutexes in the same order, or upon failure to obtain a mutex within a reasonable period of time, each thread releases the mutex it holds, waits some random amount of time, and then tries again).
With lock-free programming, however, we (obviously enough) don't use locks. This means that a deadlock like above simply cannot happen. When done properly, you can guarantee that all threads continuously progress toward their goal. Contrary to popular belief, it does not mean the code will necessarily run any faster than well written code using locks. It does, however, mean that deadlocks like the above (and some other types of problems like livelocks and some types of race conditions) are eliminated.
Now, as to exactly how you do that: the answer is short and simple: it varies -- widely. In a lot of cases, you're looking at specific hardware support for doing certain specific operations atomically. Your code either uses those directly, or extends them to give higher level operations that are still atomic and lock free. It's even possible (though only rarely practical) to implement lock-free atomic operations without hardware support (but given its impracticality, I'll pass on trying to go into more detail about it, at least for now).
Jerry already mentioned common correctness problems with locks, i.e. they're hard to understand and program correctly.
Another danger with locks is that you lose determinism regarding your execution time: if a thread that has acquired a lock gets delayed (e.g. descheduled by the operating system, or "swapped out"), then it is pos­si­ble that the entire program is de­layed be­cause it is waiting for the lock. By contrast, a lock-free al­go­rithm is al­ways gua­ran­teed to make some progress, even if any number of threads are held up some­where else.
By and large, lock-free programming is often slower (sometimes significantly so) than locked pro­gram­ming using non-atomic operations, because atomic operations cause a sig­ni­fi­cant hit on caching and pipelining; however, it offers determinism and upper bounds on latency (at least overall latency of your process; as #J99 observed, individual threads may still be starved as long as enough other threads are making progress). Your program may get a lot slower, but it never locks up entirely and always makes some progress.
The very nature of hardware architectures allows for certain small operations to be intrinsically atomic. In fact, this is a very necessity of any hardware that supports multitasking and multithreading. At the very heart of any synchronisation primitive, such as a mutex, you need some sort of atomic instruction that guarantees correct locking behaviour.
So, with that in mind, we now know that it is possible for certain types like booleans and machine-sized integers to be loaded, stored and exchanged atomically. Thus when we wrap such a type into an std::atomic template, we can expect that the resulting data type will indeed offer load, store and exchange operations that do not use locks. By contrast, your library implementation is always allowed to implement an atomic Foo as an ordinary Foo guarded by a lock.
To test whether an atomic object is lock-free, you can use the is_lock_free member function. Additionally, there are ATOMIC_*_LOCK_FREE macros that tell you whether atomic primitive types potentially have a lock-free instantiation. If you are writing concurrent algorithms that you want to be lock-free, you should thus include an assertion that your atomic objects are indeed lock-free, or a static assertion on the macro to have value 2 (meaning that every object of the corresponding type is always lock-free).
Lock-free is one of the non-blocking techniques. For an algorithm, it involves a global progress property: whenever a thread of the program is active, it can make a forward step in its action, for itself or eventually for the other.
Lock-free algorithms are supposed to have a better behavior under heavy contentions where threads acting on a shared resources may spent a lot of time waiting for their next active time slice. They are also almost mandatory in context where you can't lock, like interrupt handlers.
Implementations of lock-free algorithms almost always rely on Compare-and-Swap (some may used things like ll/sc) and strategy where visible modification can be simplified to one value (mostly pointer) change, making it a linearization point, and looping over this modification if the value has change. Most of the time, these algorithms try to complete jobs of other threads when possible. A good example is the lock-free queue of Micheal&Scott (http://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html).
For lower-level instructions like Compare-and-Swap, it means that the implementation (probably the micro-code of the corresponding instruction) is wait-free (see http://www.diku.dk/OLD/undervisning/2005f/dat-os/skrifter/lockfree.pdf)
For the sake of completeness, wait-free algorithm enforces progression for each threads: each operations are guaranteed to terminate in a finite number of steps.