What does "a single total order" mean in std::notify_one()?

What does "a single total order" mean in std::notify_one()? - c++

I have read Concurrency: Atomic and volatile in C++11 memory model and How std::memory_order_seq_cst works, it doesn't help much and answer my question directly.
From https://en.cppreference.com/w/cpp/thread/condition_variable/notify_one:
The effects of notify_one()/notify_all() and each of the three atomic
parts of wait()/wait_for()/wait_until() (unlock+wait, wakeup, and
lock) take place in a single total order that can be viewed as
modification order of an atomic variable: the order is specific to
this individual condition_variable. This makes it impossible for
notify_one() to, for example, be delayed and unblock a thread that
started waiting just after the call to notify_one() was made.
What does it mean by saying "take place in a single total order"? How is this related to the next sentence "This makes it impossible ..... was made."? (It seems that it's telling a cause and effect).
I read it word by word more than 10 times and don't understand what it's saying.. Definition of "total order" from Wikipedia can't help much.

What does it mean by saying "take place in a single total order"?
It means that every thread sees same sequence of operations. As an example, using multiple non-atomic variables, thread C can see the changes to int a caused by thread A, before it sees changes to int b caused by thread B, while thread D sees those of B before A. There are multiple incompatible timelines of which events occur before others, potentially every thread disagreeing with another. Without synchronisation mechanisms (like std::condition_variable) it can be impossible to prevent unwanted system behaviours.
A total order means that every element can be compared to every other element (contrast a partial order, where some pairs of elements are incomparable). In this case, there exists a timeline of events. It is single in that all threads agree on it.
How is this related to the next sentence "This makes it impossible for notify_one() to, for example, be delayed and unblock a thread that started waiting just after the call to notify_one() was made."?
Because all threads agree on the order that things happen, you can't anywhere observing an effect preceding it's cause.

Related

C++: std::memory_order in std::atomic_flag::test_and_set to do some work only once by a set of threads

Could you please help me to understand what std::memory_order should be used in std::atomic_flag::test_and_set to do some work only once by a set of threads and why? The work should be done by whatever thread gets to it first, and all other threads should just check as quickly as possible that someone is already going the work and continue working on other tasks.
In my tests of the example below, any memory order works, but I think that it is just a coincidence. I suspect that Release-Acquire ordering is what I need, but, in my case, only one memory_order can be used in both threads (it is not the case that one thread can use memory_order_release and the other can use memory_order_acquire since I do not know which thread will arrive to doing the work first).
#include <atomic>
#include <iostream>
#include <thread>
std::atomic_flag done = ATOMIC_FLAG_INIT;
const std::memory_order order = std::memory_order_seq_cst;
//const std::memory_order order = std::memory_order_acquire;
//const std::memory_order order = std::memory_order_relaxed;
void do_some_work_that_needs_to_be_done_only_once(void)
{ std::cout<<"Hello, my friend\n"; }
void run(void)
{
if(not done.test_and_set(order))
do_some_work_that_needs_to_be_done_only_once();
}
int main(void)
{
std::thread a(run);
std::thread b(run);
a.join();
b.join();
// expected result:
// * only one thread said hello
// * all threads spent as little time as possible to check if any
// other thread said hello yet
return 0;
}
Thank you very much for your help!

Following up on some things in the comments:
As has been discussed, there is a well-defined modification order M for done on any given run of the program. Every thread does one store to done, which means one entry in M. And by the nature of atomic read-modify-writes, the value returned by each thread's test_and_set is the value that immediately precedes its own store in the order M. That's promised in C++20 atomics.order p10, which is the critical clause for understanding atomic RMW in the C++ memory model.
Now there are a finite number of threads, each corresponding to one entry in M, which is a total order. Necessarily there is one such entry that precedes all the others. Call it m1. The test_and_set whose store is entry m1 in M must return the preceding value in M. That can only be the value 0 which initialized done. So the thread corresponding to m1 will see test_and_set return 0. Every other thread will see it return 1, because each of their modifications m2, ..., mN follows (in M) another modification, which must have been a test_and_set storing the value 1.
We may not be bothering to observe all of the total order M, but this program does determine which of its entries is first on this particular run. It's the unique one whose test_and_set returns 0. A thread that sees its test_and_set return 1 won't know whether it came 2nd or 8th or 96th in that order, but it does know that it wasn't first, and that's all that matters here.
Another way to think about it: suppose it were possible for two threads (tA, tB) both to load the value 0. Well, each one makes an entry in the modification order; call them mA and mB. M is a total order so one has to go before the other. And bearing in mind the all-important [atomics.order p10], you will quickly find there is no legal way for you to fill out the rest of M.
All of this is promised by the standard without any reference to memory ordering, so it works even with std::memory_order_relaxed. The only effect of relaxed memory ordering is that we can't say much about how our load/store will become visible with respect to operations on other variables. That's irrelevant to the program at hand; it doesn't even have any other variables.
In the actual implementation, this means that an atomic RMW really has to exclusively own the variable for the duration of the operation. We must ensure that no other thread does a store to that variable, nor the load half of a read-modify-write, during that period. In a MESI-like coherent cache, this is done by temporarily locking the cache line in the E state; if the system makes it possible for us to lose that lock (like an LL/SC architecture), abort and start again.
As to your comment about "a thread reading false from its own cache/buffer": the implementation mustn't allow that in an atomic RMW, not even with relaxed ordering. When you do an atomic RMW, you must read it while you hold the lock, and use that value in the RMW operation. You can't use some old value that happens to be in a buffer somewhere. Likewise, you have to complete the write while you still hold the lock; you can't stash it in a buffer and let it complete later.

relaxed is fine if you just need to determine the winner of the race to set the flag1, so one thread can start on the work and later threads can just continue on.
If the run_once work produces data that other threads need to be able to read, you'll need a release store after that, to let potential readers know that the work is finished, not just started. If it was instead just something like printing or writing to a file, and other threads don't care when that finishes, then yeah you have no ordering requirements between threads beyond the modification order of done which exists even with relaxed. An atomic RMW like test_and_set lets you determines which thread's modification was first.
BTW, you should check read-only before even trying to test-and-set; unless run() is only called very infrequently, like once per thread startup. For something like a static int foo = non_constant; local var, compilers use a guard variable that's loaded (with an acquire load) to see if init is already complete. If it's not, branch to code that uses an atomic RMW to modify the guard variable, with one thread winning, the rest effectively waiting on a mutex for that thread to init.
You might want something like that if you have data that all threads should read. Or just use a static int foo = something_to_run_once(), or some type other than int, if you actually have some data to init.
Or perhaps use C++11 std::call_once to solve this problem for you.
On normal systems, atomic_flag has no advantage over and atomic_bool. done.exchange(true) on a bool is equivalent to test_and_set of a flag. But atomic_bool is more flexible in terms of the operations it supports, like plain read that isn't part of an RMW test-and-set.
C++20 does add a test() method for atomic_flag. ISO C++ guarantees that atomic_flag is lock-free, but in practice so is std::atomic<bool> on all real-world systems.
Footnote 1: why relaxed guarantees a single winner
The memory_order parameter only governs ordering wrt. operations on other variables by the same thread.
Does calling test_and_set by a thread force somehow synchronization of the flag with values written by other threads?
It's not a pure write, it's an atomic read-modify-write, so the result of the one that went first is guaranteed to be visible to the one that happens to be second. That's the whole point of test-and-set as a primitive building block for mutual exclusion.
If two TAS operations could both load the original value (false), and then both store true, they would be atomic. They'd have overlapped with each other.
Two atomic RMWs on the same atomic object must happen in some order, the modification-order of that object. (Because they're not read-only: an RMW includes a modification. But also includes a read so you can see what the value was immediately before the new value; that read is tied to the modification order, unlike a plain read).
Every atomic object separately has a modification-order that all threads can agree on; this is guaranteed by ISO C++. (With orders less than seq_cst, ordering between objects can be different from source order, and not guaranteed that all threads even agree which store happened first, the IRIW problem.)
Being an atomic RMW guarantees that exactly one test_and_set will return false in thread A or B. Same for fetch_add with multiple threads incrementing a counter: the increments have to happen in some order (i.e. serialized with each other), and whatever that order is becomes the modification-order of that atomic object.
Atomic RMWs have to work this way to not lose counts. i.e. to actually be atomic.

Is it possible that a store with memory_order_relaxed never reaches other threads?

Suppose I have a thread A that writes to an atomic_int x = 0;, using x.store(1, std::memory_order_relaxed);. Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);? Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?
The practical case that I have at hand is where a thread B reads an atomic_bool frequently to check if it has to quit; Another thread, at some point, writes true to this bool and then calls join() on thread B. Clearly I do not mind to call join() before thread B can even see that the atomic_bool was set, nor do I mind when thread B already saw the change and exited execution before I call join(). But I am wondering: using memory_order_relaxed on both sides, is it possible to call join() and block "forever" because the change is never propagated to thread B?
Edit
I contacted Mark Batty (the brain behind mathematically verifying and subsequently fixing the C++ memory model requirements). Originally about something else (which turned out to be a known bug in cppmem and his thesis; so fortunately I didn't make a complete fool of myself, and took the opportunity to ask him about this too; his answer was:
Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?
Mark: Theoretically, yes, but I don't think that has been observed.
Q: In other words, do relaxed stores make no sense
whatsoever unless you combine them with some release operation (and
acquire on the other thread), assuming you want another thread to
see it?
Mark: Nearly all of the use cases for them do use release and acquire, yes.

This is all the standard has to say on the matter, I believe:
[intro.multithread]/25 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

In practice
Without any other synchronization methods, how long would it take
before other threads can see this, using
x.load(std::memory_order_relaxed);?
No time. It's a normal write, it goes to the store buffer, so it will be available in the L1d cache in less time than a blink. But that's only when the assembly instruction is run.
Instructions can be reordered by the compiler, but no reasonable compiler would reorder atomic operation over arbitrarily long loops.
In theory
Q: Can it theoretically be that such a store [memory_order_relaxed
without (any following) release operation] never reaches the other
thread?
Mark: Theoretically, yes,
You should have asked him what would happen if the "following release fence" was added back. Or with atomic store release operation.
Why wouldn't these be reordered and delayed a loooong time? (so long that it seems like an eternity in practice)
Is it possible that the value written to x stays entirely thread-local
given the current definition of the C/C++ memory model that the
standard gives?
If an imaginary and especially perverse implementation wanted to delay the visibility of atomic operation, why would it do that only for relaxed operations? It could well do it for all atomic operations.
Or never run some threads.
Or run some threads so slowly that you would believe they aren't running.

This is what the standard says in 29.3.12:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
There is no guarantee a store will become visible in another thread, there is no guaranteed timing and there is no formal relationship with memory order.
Of course, on each regular architecture a store will become visible, but on rare platforms that do not support cache coherency, it may never become visible to a load.
In that case, you would have to reach for an atomic read-modify-write operation to get the latest value in the modification order.

Multithreading - hopefully a simple task

I have an iterative process coded in C++ which takes a long time and am considering converting my code to use multiple threads. But I am concerned that it could be very complicated and risk lock-ups and bugs. However I suspect that for this particular problem it may be trivial, but I would like confirmation.
I am hoping I can use threading code which is a s simple as this here.
My program employs large amounts of global arrays and structures. I assume that the individual threads need not concern themselves if other threads are attempting to read the same data at the same time.
I would also assume that if one thread wanted to increment a global float variable by say 1.5 and another thread wanted to decrement it by 0.1 then so long as I didn't care about the order of events then both threads would succeed in their task without any special code (like mutexs and locks etc) and the float would eventually end up larger by 1.4. If all my assumptions are correct then my task will be easy - Please advise.
EDIT: just to make it absolutely clear - it doesn't matter at all the order in which the float is incremented / decremented. So long as its value ends up larger by 1.4 then I am happy. The value of the float is not read until after all the threads have completed their task.
EDIT: As a more concrete example, imaging we had the task of finding the total donations made to a charity from different states in the US. We could have a global like this:
float total_donations= 0;
Then we could have 50 separate threads, each of which calculated a local float called donations_from_this_state. And each thread would separately perform:
total_donations += donations_from_this_state;
Obviously which order the threads performed their task in would make no difference to the end result.

I assume that the individual threads need not concern themselves if other threads are attempting to read the same data at the same time.
Correct. As long as all threads are readers no synchronization is needed as no values are changed in the shared data.
I would also assume that if one thread wanted to increment a global float variable by say 1.5 and another thread wanted to decrement it by 0.1 then so long as I didn't care about the order of events then both threads would succeed in their task without any special code (like mutexs and locks etc) and the float would eventually end up larger by 1.4
This assumption is not correct. If you have two or more threads writing to the same shared variable and that variable is not internally synchronized then you need external synchronization otherwise your code has undefined behavior per [intro.multithread]/21
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
Where conflicting action is specified by [intro.multithread]/4
Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.

definition of wait-free (referring to parallel programming)

In Maurice Herlihy paper "Wait-free synchronization" he defines wait-free:
"A wait-free implementation of a concurrent data object is one that guarantees
that any process can complete any operation in a finite number of steps, regardless
the execution speeds on the other processes."
www.cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf
Let's take one operation op from the universe.
(1) Does the definition mean: "Every process completes a certain operation op in the same finite number n of steps."?
(2) Or does it mean: "Every process completes a certain operation op in any finite number of steps. So that a process can complete op in k steps another process in j steps, where k != j."?
Just by reading the definition i would understand meaning (2). However this makes no sense to me, since a process executing op in k steps and another time in k + m steps meets the definition, but m steps could be a waiting loop. If meaning (2) is right, can anybody explain to me, why this describes wait-free?
In contrast to (2), meaning (1) would guarantee that op is executed in the same number of steps k. So there can't be any additional steps m that are necessary e.g. in a waiting loop.
Which meaning is right and why?
Thanks a lot,
sema

The answer means definition (2). Consider that the waiting loop may potentially never terminate, if the process that is waited for runs indefinitely: “regardless the execution speeds on the other processes”.
So the infinite waiting loop effectively means that a given process may not be able to complete an operation in a finite number of steps.

When an author of a theoretical paper like this writes "a finite number of steps", it means that there exists some constant k (you do not necessarily know k), so that the number of steps is smaller than k (i.e. your waiting time surely won't be infinite).
I'm not sure what 'op' means in this context, but generally, when you have a multithreaded program, threads might wait for one another to do something.
Example: a thread has a lock, and other threads wait for this lock to be freed until they can operate.
This example is not wait free, since if the thread holding the lock does not get a chance to do any ops (this is bad, since the requirement here is that other threads will continue regardless of any other thread), other threads are doomed, and will never ever make any progress.
Other Example: there are several threads each trying to CAS on the same address
This example is wait free, because although all threads but one will fail in such an operation, there will always be progress no matter which threads are chosen to run.

It sounds like you're concerned that definition 2 would allow for an infinite wait loop, but such a loop—being infinite—would not satisfy the requirement for completion within a finite number of steps.
I take "wait-free" to mean that making progress does not require any participant to wait for another participant to finish. If such waiting was necessary, if one participant hangs or operates slowly, other participants suffer similarly.
By contrast, with a wait-free approach, each participant tries its operation and accommodates competitive interaction with other participants. For instance, each thread may try to advance some state, and if two try "at the same" time, only one should succeed, but there's no need for any participants that "failed" to retry. They merely recognize that someone else already got the job done, and they move on.
Rather than focusing on "waiting my turn to act", a wait-free approach encourages "trying to help", acknowledging that others may also be trying to help at the same time. Each participant has to know how to detect success, when to retry, and when to give up, confident that trying only failed because someone else got in there first. As long as the job gets done, it doesn't matter which thread got it done.

Wait-free essentially means that it needs no synchronization to be used in a multi-processing environment. The 'finite number of steps' refers to not having to wait on a synchronization device (e.g. a mutex) for an unknown -- and potentially infinite (deadlock) -- length of time while another process executes a critical section.

When is it necessary to implement locking when using pthreads in C++?

After posting my solution to my own problem regarding memory issues, nusi suggested that my solution lacks locking.
The following pseudo code vaguely represents my solution in a very simple way.
std::map<int, MyType1> myMap;
void firstFunctionRunFromThread1()
{
MyType1 mt1;
mt1.Test = "Test 1";
myMap[0] = mt1;
}
void onlyFunctionRunFromThread2()
{
MyType1 &mt1 = myMap[0];
std::cout << mt1.Test << endl; // Prints "Test 1"
mt1.Test = "Test 2";
}
void secondFunctionFromThread1()
{
MyType1 mt1 = myMap[0];
std::cout << mt1.Test << endl; // Prints "Test 2"
}
I'm not sure at all how to go about implementing locking, and I'm not even sure why I should do it (note the actual solution is much more complex). Could someone please explain how and why I should implement locking in this scenario?

One function (i.e. thread) modifies the map, two read it. Therefore a read could be interrupted by a write or vice versa, in both cases the map will probably be corrupted. You need locks.

Actually, it's not even just locking that is the issue...
If you really want thread two to ALWAYS print "Test 1", then you need a condition variable.
The reason is that there is a race condition. Regardless of whether or not you create thread 1 before thread 2, it is possible that thread 2's code can execute before thread 1, and so the map will not be initialized properly. To ensure that no one reads from the map until it has been initialized you need to use a condition variable that thread 1 modifies.
You also should use a lock with the map, as others have mentioned, because you want threads to access the map as though they are the only ones using it, and the map needs to be in a consistent state.
Here is a conceptual example to help you think about it:
Suppose you have a linked list that 2 threads are accessing. In thread 1, you ask to remove the first element from the list (at the head of the list), In thread 2, you try to read the second element of the list.
Suppose that the delete method is implemented in the following way: make a temporary ptr to point at the second element in the list, make the head point at null, then make the head the temporary ptr...
What if the following sequence of events occur:
-T1 removes the heads next ptr to the second element
- T2 tries to read the second element, BUT there is no second element because the head's next ptr was modified
-T1 completes removing the head and sets the 2nd element as the head
The read by T2 failed because T1 didn't use a lock to make the delete from the linked list atomic!
That is a contrived example, and isn't necessarily how you would even implement the delete operation; however, it shows why locking is necessary: it is necessary so that operations performed on data are atomic. You do not want other threads using something that is in an inconsistent state.
Hope this helps.

In general, threads might be running on different CPUs/cores, with different memory caches. They might be running on the same core, with one interrupting ("pre-empting" the other). This has two consequences:
1) You have no way of knowing whether one thread will be interrupted by another in the middle of doing something. So in your example, there's no way to be sure that thread1 won't try to read the string value before thread2 has written it, or even that when thread1 reads it, it is in a "consistent state". If it is not in a consistent state, then using it might do anything.
2) When you write to memory in one thread, there is no telling if or when code running in another thread will see that change. The change might sit in the cache of the writer thread and not get flushed to main memory. It might get flushed to main memory but not make it into the cache of the reader thread. Part of the change might make it through, and part of it not.
In general, without locks (or other synchronization mechanisms such as semaphores) you have no way of saying whether something that happens in thread A will occur "before" or "after" something that happens in thread B. You also have no way of saying whether or when changes made in thread A will be "visible" in thread B.
Correct use of locking ensures that all changes are flushed through the caches, so that code sees memory in the state you think it should see. It also allows you to control whether particular bits of code can run simultaneously and/or interrupt each other.
In this case, looking at your code above, the minimum locking you need is to have a synchronisation primitive which is released/posted by the second thread (the writer) after it has written the string, and acquired/waited on by the first thread (the reader) before using that string. This would then guarantee that the first thread sees any changes made by the second thread.
That's assuming the second thread isn't started until after firstFunctionRunFromThread1 has been called. If that might not be the case, then you need the same deal with thread1 writing and thread2 reading.
The simplest way to actually do this is to have a mutex which "protects" your data. You decide what data you're protecting, and any code which reads or writes the data must be holding the mutex while it does so. So first you lock, then read and/or write the data, then unlock. This ensures consistent state, but on its own it does not ensure that thread2 will get a chance to do anything at all in between thread1's two different functions.
Any kind of message-passing mechanism will also include the necessary memory barriers, so if you send a message from the writer thread to the reader thread, meaning "I've finished writing, you can read now", then that will be true.
There can be more efficient ways of doing certain things, if those prove too slow.

The whole idea is to prevent the program from going into an indeterminate/unsafe state due to multiple threads accessing the same resource(s) and/or updating/modifying the resource so that the subsequent state becomes undefined. Read up on Mutexes and Locking (with examples).

The set of instructions created as a result of compiling your code can be interleaved in any order. This can yield unpredictable and undesired results. For example, if thread1 runs before thread2 is selected to run, your output may look like:
Test 1
Test 1
Worse yet, one thread may get pre-empted in the middle of assigning - if assignment is not an atomic operation. In this case let's think of atomic as the smallest unit of work which can not be further split.
In order to create a logically atomic set of instructions -- even if they yield multiple machine code instructions in reality -- is to use a lock or mutex. Mutex stands for "mutual exclusion" because that's exactly what it does. It ensures exclusive access to certain objects or critical sections of code.
One of the major challenges in dealing with multiprogramming is identifying critical sections. In this case, you have two critical sections: where you assign to myMap, and where you change myMap[ 0 ]. Since you don't want to read myMap before writing to it, that is also a critical section.

The simplest answer is: you have to lock whenever there is an access to some shared resources, which are not atomics. In your case myMap is shared resource, so you have to lock all reading and writing operations on it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js