System calls in UNIX-like OSes are reentrant (i.e. multiple system calls may be executed in parallel). Are there any ordering constraints of those system calls in the sense of the C/C++11 happens-before relation?
For example, let's consider the following program with 3 threads (in pseudo-code):
// thread 1
store x 1
store y 2
// thread 2
store y 1
store x 2
// thread 3
Thread.join(1 and 2)
wait((load x) == 1 && (load y) == 1)
Here, suppose x and y are shared locations, and all the load and stores have the relaxed ordering. (NOTE: with relaxed atomic accesses, races are not considered bug; they are intentional in the sense of C/C++11 semantics.) This program may terminate, since (1) compiler may reorder store x 1 and store y 2, and then (2) execute store y 2, store y 1, store x 2, and then store x 1, so (3) thread 3 may read x = 1 and y = 1 at the same time.
I would like to know if the following program also may terminate. Here, some system calls syscall1() & syscall2() are inserted in the thread 1 & 2, respectively:
// thread 1
store x 1
syscall1()
store y 2
// thread 2
store y 1
syscall2()
store x 2
// thread 3
Thread.join(1 and 2)
wait((load x) == 1 && (load y) == 1)
The program seems impossible to terminate. However, in the absence of the ordering constraints of the system calls invoked, I think this program may terminate. Here is the reason. Suppose syscall1() and syscall2() are not serialized and may be run in parallel. Then the compiler, with the full knowledge of the semantics of syscall1() and syscall2(), may still reorder store x 1 & syscall1() and store y 2.
So I would like to ask if there are any ordering constraints of the system calls invoked by different threads. If possible, I would like to know the authoritative source for this kind of questions.
A system call (those listed in syscalls(2)...) is an elementary operation, from the point of view of an application program in user land.
Each system call is (by definition) calling the kernel, thru some single machine code instruction (SYSENTER, SYSCALL, INT ...); details depend upon the processor (its instruction set) and the ABI. The kernel does it business (of processing your system call - which could succeed or fail), but your user program sees only an elementary step. Sometimes that step (during which control is given to the kernel) could last a long piece of time (e.g. minutes or hours).
So your program in user land runs in a low level virtual machine, provided by the user mode machine instructions of your processor augmented by a single "virtual" system call instruction (able of doing any system call implemented by the kernel).
This does not forbid your program to be buggy because of race conditions.
Related
Consider this program:
-- Initially --
std::atomic<int> x{0};
int y{0};
-- Thread 1 --
y = 1; // A
x.store(1, std::memory_order_release); // B
x.store(3, std::memory_order_relaxed); // C
-- Thread 2 --
if (x.load(std::memory_order_acquire) == 3) // D
print(y); // E
Under the C++11 memory model, if the program prints anything then it prints 1.
In the C++20 memory model, release sequences were changed to exclude writes performed by the same thread. How does that affect this program? Could it now have a data-race and print either 0 or 1?
Notes
This code appears in P0982R1: Weaken Release Sequences which I believe is the paper that resulted in the changes to the definition of release sequences in C++20. In that particular example, there is a third thread making a store to x which disrupts the release sequence in a way that is counter-intuitive. That motivates the need to weaken the release sequence definition.
From reading the paper my understanding is that with the C++20 changes, C will no longer form part of the release sequence headed by B, because C is not a Read-Modify-Write operation. Therefore C does not synchronize with D. Thus there is no happens-before relation between A and E.
Since B and C are stores to the same atomic variable and all threads must agree on the modification order of that variable, does the C++20 memory model allow us to infer anything about whether A happens-before E?
Your understanding is correct; the program has a data race. The store of 3 does not form part of any release sequence, so D is not synchronized with any release store. There is thus no way to establish any happens-before relationship between any two operations from the two different threads, and in particular, no happens-before between A and E.
I think the only thing you can infer from the load of 3 is that D definitely does not happen before C; if it did, then D would be obliged to load a value that was strictly earlier in the modification order of x [read-write coherence, intro.races p17]. That means in particular that E does not happen before A.
The modification order would come into play if you were to load from x again in Thread 2 somewhere after D. Then you would be guaranteed to load the value 3 again. That follows from read-read coherence [intro.races p16]. Your second load is not allowed to observe anything that preceded 3 in the modification order, so it cannot load the values 0 or 1. This would apply even if all the loads and stores in both threads were relaxed.
volatile bool b;
Thread1: //only reads b
void f1() {
while (1) {
if (b) {do something};
else { do something else};
}
}
Thread2:
//only sets b to true if certain condition met
// updated by thread2
void f2() {
while (1) {
//some local condition evaluated - local_cond
if (!b && (local_cond == true)) b = true;
//some other work
}
}
Thread3:
//only sets b to false when it gets a message on a socket its listening to
void f3() {
while (1) {
//select socket
if (expected message came) b = false;
//do some other work
}
}
If thread2 updates b first at time t and later thread3 updates b at time t+5:
will thread1 see the latest value "in time" whenever it is reading b?
for example: reads from t+delta to t+5+delta should read true and
reads after t+5+delta should read false.
delta is the time for the store of "b" into memory when one of threads 2 or 3 updated it
The effect of volatile keyword is principally two things (I avoid scientifically strict formulations here):
1) Its accesses can't be cached or combined. (UPD: on suggestion, I underline this is for caching in registers or another compiler-provided location, not the RAM cache in CPU.) For example, the following code:
x = 1;
x = 2;
for a volatile x will never be combined into single x = 2, whatever optimization level is required; but if x is not volatile, even low levels will likely cause this collapse into a single write. The same for reads: each read operation will access the variable value without any attempt to cache it.
2) All volatile operations are relayed onto machine command layer in the same order between them (to underline, only between volatile operations), as they are defined in source code.
But this is not true for accesses between non-volatile and volatile memory. For the following code:
int *x;
volatile int *vy;
void foo()
{
*x = 1;
*vy = 101;
*x = 2;
*vy = 102;
}
gcc (9.4) with -O2 and clang (10.0) with -O produce something similar to:
movq x(%rip), %rax
movq vy(%rip), %rcx
movl $101, (%rcx)
movl $2, (%rax)
movl $102, (%rcx)
retq
so one access to x is already gone, despite its presence between two volatile accesses. If one need the first x = 1 to succeed before first write to vy, let him put an explicit barrier (since C11, atomic_signal_fence is the platform-independent mean for this).
That was the common rule but without regarding multithread issues. What happens here with multithreading?
Well, imagine as you declare that thread 2 writes true to b, so, this is writing of value 1 to single-byte location. But, this is ordinary write without any memory ordering requirements. What you provided with volatile is that compiler won't optimize it. But what for processor?
If this was a modern abstract processor, or one with relaxed rules, like ARM, I'd say nothing prevent it from postponing the real write for an indefinite time. (To clarify, "write" is exposing the operation to RAM-and-all-caches conglomerate.) It's fully up to processor's deliberation. Well, processors are designed to flush their stockpiling of pending writes as fast as possible. But what affects real delay, you can't know: for example, it could "decide" to fill instruction cache with a few next lines, or flush another queued writings... lots of variants. The only thing we know it provides "best effort" to flush all queued operations, to avoid getting buried under previous results. That's truly natural and nothing more.
With x86, there is an additional factor. Nearly every memory write (and, I guess, this one as well) is "releasing" write in x86, so, all previous reads and writes shall be completed before this write. But, the gut fact is that the operations to complete are before this write. So when you write true to volatile b, you will be sure all previous operations have already got visible to other participants... but this one still could be postponed for a while... how long? Nanoseconds? Microseconds? Any other write to memory will flush and so publish this write to b... do you have writes in cycle iteration of thread 2?
The same affects thread 3. You can't be sure this b = false will be published to other CPUs when you need it. Delay is unpredictable. The only thing is guaranteed, if this is not a realtime-aware hardware system, for an indefinite time, and the ISA rules and barriers provide ordering but not exact times. And, x86 is definitely not for such a realtime.
Well, all this means you also need an explicit barrier after write which affects not only compiler, but CPU as well: barrier before previous write and following reads or writes. Among C/C++ means, full barrier satifies this - so you have to add std::atomic_thread_fence(std::memory_order_seq_cst) or use atomic variable (instead of plain volatile one) with the same memory order for write.
And, all this still won't provide you with exact timings like you described ("t" and "t+5"), because the visible "timestamps" of the same operation can differ for different CPUs! (Well, this resembles Einstein's relativity a bit.) All you could say in this situation is that something is written into memory, and typically (not always) the inter-CPU order is what you expected (but the ordering violation will punish you).
But, I can't catch the general idea of what do you want to implement with this flag b. What do you want from it, what state should it reflect? Let you return to the upper level task and reformulate. Is this (I'm just guessing on coffee grounds) a green light to do something, which is cancelled by an external order? If so, an internal permission ("we are ready") from the thread 2 shall not drop this cancellation. This can be done using different approaches, as:
1) Just separate flags and a mutex/spinlock around their set. Easy but a bit costly (or even substantially costly, I don't know your environment).
2) An atomically modified analog. For example, you can use a bitfield variable which is modified using compare-and-swap. Assign bit 0 to "ready" but bit 1 for "cancelled". For C, atomic_compare_exchange_strong is what you'll need here at x86 (and at most other ISAs). And, volatile is not needed anymore here if you keep residing with memory_order_seq_cst.
Will thread1 see the latest value "in time" whenever it is reading b?
Yes, the volatile keyword denotes that it can be modified outside of the thread or hardware without the compiler being aware thus every access (both read and write) will be made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.
Unfortunately, the volatile keyword is not thread-safe and operation will have to be taken with care, it is recommended to use atomic for this, unless in an embedded or bare-metal scenario.
Also the whole struct should be atomic struct X {int a; volatile bool b;};.
Say I have a system with 2 cores. The first core runs thread 2, the second core runs thread 3.
reads from t+delta to t+5+delta should read true and reads after t+5+delta should read false.
Problem is that thread 1 will read at t + 10000000 when the kernel decides one of the threads has run long enough and schedules a different thread. So it likely thread1 will not see the change a lot of the time.
Note: this ignores all the additional problems of synchronicity of caches and observability. If the thread isn't even running all of that becomes irrelevant.
When i run this program i get output as 10 which seems to be impossible for me. I'm running this on x86_64 core i3 ubuntu.
If the output is 10, then 1 must have come from either c or d.
Also in thread t[0], we assign c as 1. Now a is 1 since it occurs before c=1. c is equal to b which was set to 1 by thread 1. So when we store d it should be 1 as a=1.
Can output 10 happen with memory_order_seq_cst ? I tried inserting a atomic_thread_fence(seq_cst) on both thread between 1st (variable =1 ) and 2nd line (printf) but it still didn't work.
Uncommenting both the fence doesn't work.
Tried running with g++ and clang++. Both give the same result.
#include<thread>
#include<unistd.h>
#include<cstdio>
#include<atomic>
using namespace std;
atomic<int> a,b,c,d;
void foo(){
a.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
c.store(b,memory_order_seq_cst);
}
void bar(){
b.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
d.store(a,memory_order_seq_cst);
}
int main(){
thread t[2];
t[0]=thread(foo); t[1]=thread(bar);
t[0].join();t[1].join();
printf("%d%d\n",c.load(memory_order_seq_cst),d.load(memory_order_seq_cst));
}
bash$ while [ true ]; do ./a.out | grep "10" ; done
10
10
10
10
10 (c=1, d=0) is easily explained: bar happened to run first, and finished before foo read b.
Quirks of inter-core communication to get threads started on different cores means it's easily possible for this to happen even though thread(foo) ran first in the main thread. e.g. maybe an interrupt arrived at the core the OS chose for foo, delaying it from actually getting into that code1.
Remember that seq_cst only guarantees that some total order exists for all seq_cst operations which is compatible with the sequenced-before order within each thread. (And any other happens-before relationship established by other factors). So the following order of atomic operations is possible without even breaking out the a.load2 in bar separately from the d.store of the resulting int temporary.
b.store(1,memory_order_seq_cst); // bar1. b=1
d.store(a,memory_order_seq_cst); // bar2. a.load reads 0, d=0
a.store(1,memory_order_seq_cst); // foo1
c.store(b,memory_order_seq_cst); // foo2. b.load reads 1, c=1
// final: c=1, d=0
atomic_thread_fence(seq_cst) has no impact anywhere because all your operations are already seq_cst. A fence basically just stops reordering of this thread's operations; it doesn't wait for or sync with fences in other threads.
(Only a load that sees a value stored by another thread can create synchronization. But such a load doesn't wait for the other store; it has no way of knowing there is another store. If you want to keep loading until you see the value you expect, you have to write a spin-wait loop.)
Footnote 1:
Since all your atomic vars are probably in the same cache line, even if execution did reach the top of foo and bar at the same time on two different cores, false-sharing is likely going to let both operations from one thread happen while the other core is still waiting to get exclusive ownership. Although seq_cst stores are slow enough (on x86 at least) that hardware fairness stuff might relinquish exclusive ownership after committing the first store of 1. Anyway, lots of ways for both operations in one thread to happen before the other thread and get 10 or 01. Even possible to get 11 if we get b=1 then a=1 before either load. Using seq_cst does stop the hardware from doing the load early (before the store is globally visible), so it's very possible.
Footnote 2: The lvalue-to-rvalue evaluation of bare a uses the overloaded (int) conversion which is equivalent to a.load(seq_cst). The operations from foo could happen between that load and the d.store that gets a temporary value from it. d.store(a) is not an atomic copy; it's equivalent to int tmp = a; d.store(tmp);. That isn't necessary to explain your observations.
The printf statements are unsynchronized so output of 10 can be just a reordered 01.
01 happens when the functions before the printf run serially.
I have a single core CPU (ARM Cortex M3, 32bit) with two threads. Assuming the following situation:
// Thread 1:
int16_t a = 1;
double b = 1.0;
// Do some other fancy stuff including starting Thread 2
for (;;) {std::cout << a << "," <<b;}
// Thread 2:
a = 2;
b = 2.0;
I can handle the following outputs:
1,1
1,2
2,1
2,2
Can I be certain that the output will always be one of those (1/2) without using mutex or other locking mechanisms? And more specifically, is this compiler dependent? And is the behavior different for int16 and double?
It depends on the CPU, mostly, though in theory anything involving multiple threads on pre-C11 is at best implementation defined and at worst undefined behavior, so the compiler might do just about anything.
If you can ignore crazy compilers that do silly things, and assume that the compiler will use the CPU's facilities in a reasonable way, it depends mostly on what the CPU supports.
Cortex-M3 is a 32-bit CPU with a 32-bit bus and no FPU. So reads and writes of 32-bit and smaller values will generally be atomic. double, however, is 64 bits, so any read/write of a double will involve two instructions and be non-atomic. Thus if one thread reads while the other is writing you might get half from one value and half from the other.
Now in your specific example, the values 1.0 and 2.0 are both 0 for their lower half, so the 'mix' would be innocuous, but other values will not have that behavior.
The evaluation order of the operations before the ; are not guaranteed left to right, even if it were, they are not atomic, you will get a segfault if you try and read and write to a at the same time (they can & do take multiple cycles to perform and a context switch can interrupt it).
On arm in particular reads and writes go in a queue on the cpu where they are free to be reordered (except past memory barriers), even on a cpu that doesn't reorder memory the compiler is also free to reorder them. There is nothing stopping your assignment and read being moved forward or back and so you cannot guarantee the state of any of the values or ordering.
I was reading Bjarne Stroustrup's C++11 FAQ and I'm having trouble understanding an example in the memory model section.
He gives the following code snippet:
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
The FAQ says there is not a data race here. I don't understand. The memory location x is read by thread 1 and written to by thread 2 without any synchronization (and the same goes for y). That's two accesses, one of which is a write. Isn't that the definition of a data race?
Further, it says that "every current C++ compiler (that I know of) gives the one right answer." What is this one right answer? Couldn't the answer vary depending on whether one thread's comparison happens before or after the other thread's write (or if the other thread's write is even visible to the reading thread)?
// start with x==0 and y==0
if (x) y = 1; // thread 1
if (y) x = 1; // thread 2
Since neither x nor y is true, the other won't be set to true either. No matter the order the instructions are executed, the (correct) result is always x remains 0, y remains 0.
The memory location x is ... written to by thread 2
Is it really? Why do you say so?
If y is 0 then x is not written to by thread 2. And y starts out 0. Similarly, x cannot be non-zero unless somehow y is non-zero "before" thread 1 runs, and that cannot happen. The general point here is that conditional writes that don't execute don't cause a data race.
This is a non-trivial fact of the memory model, though, because a compiler that is not aware of threading would be permitted (assuming y is not volatile) to transform the code if (x) y = 1; to int tmp = y; y = 1; if (!x) y = tmp;. Then there would be a data race. I can't imagine why it would want to do that exact transformation, but that doesn't matter, the point is that optimizers for non-threaded environments can do things that would violate the threaded memory model. So when Stroustrup says that every compiler he knows of gives the right answer (right under C++11's threading model, that is), that's a non-trivial statement about the readiness of those compilers for C++11 threading.
A more realistic transformation of if (x) y = 1 would be y = x ? 1 : y;. I believe that this would cause a data race in your example, and that there is no special treatment in the standard for the assignment y = y that makes it safe to execute unsequenced with respect to a read of y in another thread. You might find it hard to imagine hardware on which it doesn't work, and anyway I may be wrong, which is why I used a different example above that's less realistic but has a blatant data race.
There has to be a total ordering of the writes, because of the fact that no thread can write to the variable x or y until some other thread has first written a 1 to either variable. In other words you have basically three different scenarios:
thread 1 gets to write to y because x was written to at some previous point before the if statement, and then if thread 2 comes later, it writes to x the same value of 1, and doesn't change it's previous value of 1.
thread 2 gets to write to x because y was changed at some point before the if statement, and then thread 1 will write to y if it comes later the same value of 1.
If there are only two threads, then the if statements are jumped over because x and y remain 0.
Neither of the writes occurs, so there is no race. Both x and y remain zero.
(This is talking about the problem of phantom writes. Suppose one thread speculatively did the write before checking the condition, then attempted to correct things after. That would break the other thread, so it isn't allowed.)
Memory model set the supportable size of code and data areas.before comparing linking source code,we need to specify the memory model that is he can set the size limitsthe data and code.