What are the differences between Linearizability and quiescent consistency? - concurrency

I am reading a book "The Art of Multiprocessor Programming by Maruice Herilihy and Nir Shavit and am trying to understand Chapter 3 about Concurrent Objects"
Linearizability : "The basic idea behind linearizability is that every concurrent history is equiv- alent, in the following sense, to
some sequential history. The basic rule is that if one method call
precedes another, then the earlier call must have taken effect before
the later call. By contrast, if two method calls overlap, then their
order is ambiguous, and we are free to order them in any convenient
way."
Now,I was reading about quiescent consistency,
Method calls should appear to happen in a one-at-a-time, sequential
order.
Method calls separated by a period of quiescence should appear to take
effect in their real-time order.
I feel like both are same. I read this What are the differences between sequential consistency and quiescent consistency?.
from above link
Quiescent consistency : "requires non-overlapping operations to appear to take effect in their real-time order, but overlapping
operations might be reordered"
can anybody explain how both are different ?
Thanks.

Quiescent consistency is a weaker constrain than linearizability. For example there are 3 concurrent method calls.
<---A---> <---B--->
<------C------>
For linearizability, you can map the concurrent execution to sequential ones such as ABC, ACB, CAB etc. But you CAN NOT put B before A because the precedence order must be preserved.
This is not the case in quiescent consistency, you can reorder A and B since there is no quiescent point in between A and B (because C is pending). In another word, you can reorder the method calls in any manner you like as long as there is no quiescent points separating them.

Related

Do I understand the semantics of std::memory_order correctly?

c++reference.com states about memory_order::seq_cst:
A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order.
[ Q1 ]: Does this mean that the order goes straight down through every operation of all (others + this) atomic_vars with memory_order::seq_cst?
[ Q2 ]: And release , acquire and rel_acq are not included in "single total order" ?
I understood that seq_cst is equivalent to the other three with write, read and write_read operation, but I'm confused about whether seq_cst can order other atomic_vars too, not only the same var.
cppreference is only a summary of the C++ standard, and sometimes its text is less precise. The actual standard draft makes it clear: The final C++20 working draft N4681 states in atomics.order, par. 4 (p. 1525):
There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the following constraints [...]
This clearly says all seq_cst operations, not just all operations on a particular object.
And notes 6 and 7 further down emphasize that the order does not apply to weaker memory orders:
6 [Note: We do not require that S be consistent with “happens before” (6.9.2.1). This allows more efficient
implementation of memory_order::acquire and memory_order::release on some machine architectures.
It can produce surprising results when these are mixed with memory_order::seq_cst accesses. — end note]
7 [Note: memory_order::seq_cst ensures sequential consistency only for a program that is free of data races
and uses exclusively memory_order::seq_cst atomic operations. Any use of weaker ordering will invalidate
this guarantee unless extreme care is used. In many cases, memory_order::seq_cst atomic operation
I find this part incomplete:
A load operation with this memory order performs an acquire operation,
a store performs a release operation, and read-modify-write performs
both an acquire operation and a release operation, plus a single total
order exists in which all threads observe all modifications in the
same order.
If those things (release store, acquire loads, and a total store order) were actually sufficient to give sequential consistency, that would imply that release and acquire operations on their own would be more strongly ordered than the actually are.
Let's have a look at the following counter example:
CPU1:
a = 1 // release store
int r1 = b // acquire load
Then based on the above definition for SC (and the known properties sequential consistency must have to fit the name), I would presume that the store of a and the load of b can't be reordered:
we have a release-store and an acquire-load
we (can) have a total order over all loads/stores
So we have satisfied the above definition for sequential consistency.
But a release-store followed by an acquire-load to a different address can be reordered. The canonical example would be Dekker's algorithm. Therefor the above definition for SC is broken because it is missing that memory order needs to preserve the program order. Apart from a compiler messing things up, the typical cause of this violation would be store buffers which most modern CPUs have can cause an older store to be reordered with a newer load to a different address.
The single total order is a different concern than CPU local instruction reordering as you can get with e.g. store buffers. It effectively means that there is some moment where an operation takes effect in the memory order and nobody should be able to disagree with that. The standard litmus test for this is the independent reads of independent writes (IRIW):
CPU1:
A=1
CPU2:
B=1
CPU3:
r1=A
[LoadLoad]
r2=B
CPU4:
r3=B
[LoadLoad]
r4=A
So could it be that CPU3 and CPU4 see the stores to different addresses in different orders? If the answer is yes, then no total order over the load/stores exist.
Another cause of a not having a total order over the loads/stores is store to load forwarding (STLF).
CPU1:
A=1
r1=A
r2=B
CPU2:
B=1
r3=B
r4=A
It is possible that r1=1, r2=0, r3=1 and r4=0?
On the X86 this is possible due to store to load forwarding. So if CPU1 does a store of A followed by a load of A, then the CPU must look in the store buffer for value of A. This causes the store of A not to be atomic; the local CPU can see the store early and the consequence is no total order over the loads/stores exist.
So instead of having a total order over all load/stores, it is reduced to a total order over the stores and this is how the X86 gets its name for its memory model (Total Store Order).
[edit]
Made some clarifications. I cleaned up some text and cleaned up the original example because it was misleading.

Is the concept of release-sequence useful in practice?

C++ atomic semantics only guarantee visibility (through happen-before relation) of memory operations performed by the last thread that did a release write (simple or read-modify-write) operation.
Consider
int x, y;
atomic<int> a;
Thread 1:
x = 1;
a.store(1,memory_order_release);
Thread 2:
y = 2;
if (a.load(memory_order_relaxed) == 1))
a.store(2,memory_order_release);
Then the observation of a == 2 implies visibility of thread 2 operations (y == 2) but not thread 1 (one cannot even read x).
As far as I know, real implementations of multithreading use concepts of fences (and sometimes release store) but not happen-before or release-sequence which are high level C++ concepts; I fail to see what real hardware details these concepts map to.
How can a real implementation not guarantee visibility of thread 1 memory operations when the value of 2 in a is globally visible?
In other words, is there any good in the release-sequence definition? Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Consider in particular silly-thread 3:
if (a.load(memory_order_relaxed) == 2))
a.store(2,memory_order_relaxed);
Can silly-thread 3 ever suppress any visibility guarantee on any real hardware? In other words, if value 2 is globally visible, how would making it again globally visible break any ordering?
Is my mental model of real multiprocessing incorrect? Can a value of partially visible, on some CPU but note another one?
(Of course I assume a non crazy semantic for relaxed writes, as writes that go back in time make language semantics of C++ absolutely nonsensical, unlike safe languages like Java that always have bounded semantics. No real implementation can have crazy, non-causal relaxed semantic.)
Let's first answer your question:
Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Because if so, we would lose some potential optimization. For example, consider the thread:
x = 1; // #1
a.store(1,memory_order_relaxed); // #2
Under current rules, the compiler is able to reorder #1 and #2. However, after the extension of release-sequence, the compiler is not allowed to reorder the two lines because another thread like your thread 2 may introduce a release sequence headed by #2 and tailed by a release operation, thus it is possible that some read-acquire operation in another thread would be synchronized with #2.
You give a specific example, and claim that all implementations would produce a specific outcome while the language rules do not guarantee this outcome. This is not a problem because the language rules are intended to handle all cases, not only your specific example. Of course the language rules may be improved so that it can guarantee the expected outcome for your specific example, but that is not a trivial work. At least, as we have argued above, simply extending the definition for release-sequence is not an accepted solution.

confused about atomic class: memory_order_relaxed

I am studying this site: https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync, which is very helpful to understand the topic about atomic class.
But this example about relaxed mode is hard to understand:
/*Thread 1:*/
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
/*Thread 2*/
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
/*Thread 3*/
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
But the website says either assert in this example can actually FAIL.
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
In effect, your argument is that since in thread 2 the store of 10 into x occurred before the store of 10 into y, in thread 3 the same must be the case.
However, since you are only using relaxed memory operations, there is nothing in the code that requires two different threads to agree on the ordering between modifications of different variables. So indeed thread 2 may see the store of 10 into x before the store of 10 into y while thread 3 sees those two operations in the opposite order.
In order to ensure that assert B succeeds, you would, in effect, need to ensure that when thread 3 sees the value 10 of y, it also sees any other side effects performed by the thread that stored 10 into y before the time of the store. That is, you need the store of 10 into y to synchronize with the load of 10 from y. This can be done by having the store perform a release and the load perform an acquire:
// thread 2
y.store (10, memory_order_release);
// thread 3
if (y.load (memory_order_acquire) == 10)
A release operation synchronizes with an acquire operation that reads the value stored. Now because the store in thread 2 synchronizes with the load in thread 3, anything that happens after the load in thread 3 will see the side effects of anything that happens before the store in thread 2. Hence the assertion will succeed.
Of course, we also need to make sure assertion A succeeds, by making the x.store in thread 1 use release and the x.load in thread 2 use acquire.
I find it much easier to understand atomics with some knowledge of what might be causing them, so here's some background knowledge. Know that these concepts are in no way stated in the C++ language itself, but is some of the possible reasons why things are the way they are.
Compiler reordering
Compilers, often when optimizing, will choose to refactor the program as long as its effects are the same on a single threaded program. This is circumvented with the use of atomics, which will tell the compiler (among other things) that the variable might change at any moment, and that its value might be read elsewhere.
Formally, atomics ensures one thing: there will be no data races. That is, accessing the variable will not make your computer explode.
CPU reordering
CPU might reorder instructions as it is executing them, which means the instructions might get reordered on the hardware level, independent of how you wrote the program.
Caching
Finally there are effects of caches, which are faster memory that sorta contains a partial copy of the global memory. Caches are not always in sync, meaning they don't always agree on what is "correct". Different threads may not be using the same cache, and due to this, they may not agree on what values variables have.
Back to the problem
What the above surmounts to is pretty much what C++ says about the matter: unless explicitly stated otherwise, the ordering of side effects of each instruction is totally and completely unspecified. It might not even be the same viewed from different threads.
Formally, the guarantee of an ordering of side effects is called a happens-before relation. Unless a side effect happens-before another, it is not. Loosely, we just say call it synchronization.
Now, what is memory_order_relaxed? It is telling the compiler to stop meddling, but don't worry about how the CPU and cache (and possibly other things) behave. Therefore, one possibility of why you see the "impossible" assert might be
Thread 1 stores 20 to y and then 10 to x to its cache.
Thread 2 reads the new values and stores 10 to y to its cache.
Thread 3 didn't read the values from thread 1, but reads those of thread 2, and then the assert fails.
This might be completely different from what happens in reality, the point is anything can happen.
To ensure a happens-before relation between the multiple reads and writes, see Brian's answer.
Another construct that provides happens-before relations is std::mutex, which is why they are free from such insanities.
The answer to your question is the C++ standard.
The section [intro.races] is surprisingly very clear (which is not the rule of normative text kind: formalism consistency oftenly hurts readability).
I have read many books and tuto which treats the subject of memory order, but it just confused me.
Finally I have read the C++ standard, the section [intro.multithread] is the clearest I have found. Taking the time to read it carefully (twice) may save you some time!
The answer to your question is in [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification
order of M. [ Note: There is a separate order for each atomic object. There is no requirement that these can
be combined into a single total order for all objects. In general this will be impossible since different threads
may observe modifications to different objects in inconsistent orders. — end note ]
You were expecting a single total order on all atomic operations. There is such an order, but only for atomic operations that are memory_order_seq_cst as explained in [atomics.order]/3:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens
before” order and modification orders for all affected locations [...]

C++ memory_order_consume, kill_dependency, dependency-ordered-before, synchronizes-with

I am reading C++ Concurrency in Action by Anthony Williams. Currently I at point where he desribes memory_order_consume.
After that block there is:
Now that I’ve covered the basics of the memory orderings, it’s time to look at the
more complex parts
It scares me a little bit, because I don't fully understand several things:
How dependency-ordered-before differs from synchronizes-with? They both create happens-before relationship. What is exact difference?
I am confused about following example:
int global_data[]={ … };
std::atomic<int> index;
void f()
{
int i=index.load(std::memory_order_consume);
do_something_with(global_data[std::kill_dependency(i)]);
}
What does kill_dependency exactly do? Which dependency it kills? Between which entities? And how compiler can exploit that knowladge?
Can all occurancies of memory_order_consume be safely replaced with memory_order_acquire? I.e. is it stricter in all senses?
At Listing 5.9, can I safely replace
std::atomic<int> data[5]; // all accesses are relaxed
with
int data[5]
? I.e. can acquire and release be used to synchronize access to non-atomic data?
He describes relaxed, acquire and release by some examples with mans in cubicles. Are there some similar simple descriptions of seq_cst and consume?
As to the next to last question, the answer takes a little more explanation. There are three things that can go wrong when multiple threads access the same data:
the system might switch threads in the middle of a read or write, producing a result that's half one value and half another.
the compiler might move code around, on the assumption that there is no other thread looking at the data that's involved.
the processor may be keeping a value in its local cache, without updating main memory after changing the value or re-reading it after another thread changed the value in main memory.
Memory order addresses only number 3. The atomic functions address 1 and 2, and, depending on the memory order argument, maybe 3 as well. So memory_order_relaxed means "don't bother with number 3. The code still handles 1 and 2. In that case, you'd use acquire and release to ensure proper memory ordering.
How dependency-ordered-before differs from synchronizes-with?
From 1.10/10: "[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/consume in place of release/acquire. — end note ]".
What does kill_dependency exactly do?
Some compilers do data-dependency analysis. That is, they trace changes to values in variables in order to better figure out what has to be synchronized. kill_dependency tells such compilers not to trace any further because there's something going on in the code that the compiler wouldn't understand.
Can all occurancies of memory_order_consume be safely replaced with
memory_order_acquire? I.e. is it stricter in all senses?
I think so, but I'm not certain.
memory_order_consume requires that the atomic operation happens-before all non-atomic operations that are data dependent on it. A data dependency is any dependency where you cannot evaluate an expression without using that data. For example, in x->y, there is no way to evaluate x->y without first evaluating x.
kill_dependency is a unique function. All other functions have a data dependency on their arguments. Kill_dependency explicitly does not. It shows up when you know that the data itself is already synchronized, but the expression you need to get to the data may not be synchronized. In your example, do_something_with is allowed to assume any cached value of globalldata[i] is safe to use, but i itself must actually be the correct atomic value.
memory_order_acquire is strictly stronger if all changes to the data are properly released with a matching memory_order_release.

Combining stores/loads of consecutive atomic variables

Referring to a (slightly dated) paper by Hans Boehm, under "Atomic Operations". It mentions that the memory model (proposed at the time) would not prevent an optimizing compiler from combining a sequence of loads, or stores, on the same variable from being combined into a single load. His example is as follows (updated to hopefully correct current syntax):
Given
atomic<int> v;
The code
while( v.load( memory_order_acquire ) ) { ... }
Could be optimized to:
int a = v.load(memory_order_acquire);
while(a) { ... }
Obviously this would be bad, as he states. Now my question is, as the paper is a bit old, does the current C++0x memory model prevent this type of optimization, or is it still technically allowed?
My reading of the standard would seem to lean towards it being disallowed, but the use "acquire" semantics makes it less clear. For example if it were "seq_cst" the model seems to guarantee that the load must partake in a total ordering on the access and loading the value only once would thus seem to violate ordering (as it breaks the sequence happens before relationship).
For acquire I interpret 29.3.2 to mean that this optimization can not occur, since any "release" operation must be observed by the "acquire" operation. Doing only one acquire would seem not valid.
So my question is whether the current model (in the pending standard) would disallow this type of optimization? And if yes, then which part specifically forbids it? If no, does using a volatile atomic solve the problem?
And for bonus, if the load operation has a "relaxed" ordering is the optimization then allowed?
The C++0x standard attempts to outlaw this optimization.
The relevant words are from 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
If the thread that is doing the load only ever issues one load instruction then this is violated, as if it misses the write the first time, it will never see it. It doesn't matter which memory ordering is used for the load, it is the same for both memory_order_seq_cst and memory_order_relaxed.
However, the following optimization is allowed, unless there is something in the loop that forces an ordering:
while( v.load( memory_order_acquire ) ) {
for(unsigned __temp=0;__temp<100;++__temp) {
// original loop body goes here
}
}
i.e. the compiler can generate code that executes the actual loads arbitrarily infrequently, provided it still executes them. This is even permitted for memory_order_seq_cst unless there are other memory_order_seq_cst operations in the loop, since this is equivalent to running 100 iterations between any memory accesses by other threads.
As an aside, the use of memory_order_acquire doesn't have the effect you describe --- it is not required to see release operations (other than by 29.3p13 quoted above), just that if it does see the release operation then it imposes visibility constraints on other accesses.
Right from the very paper you're linking:
Volatiles guarantee that the right number of memory operations are
performed.
The standard says essentially the same:
Access to volatile objects are evaluated strictly according to the
rules of the abstract machine.
This has always been the case, since the very first C compiler by Dennis Ritchie I think. It has to be this way because memory mapped I/O registers won't work otherwise. To read two characters from your keyboard, you need to read the corresponding memory mapped register twice. If the compiler had a different idea about the number of reads it has to perform, that would be too bad!