How understand the result of incrementing and printing a global variable by two threads? - concurrency

In my main I have two threads running the following function:
int i = 0;
void foo()
{
++i;
printf("%d", i);
}
I ran it 10000 times and I had 3 different results:
{1 2}, {1 1}, {2 1}
First two I understand. The question is how the result can be {2 1} and why {2 2} doesn't appear at all.
Thank you!

To understand why any particular instance of undefined behavior happens to execute in a particular way on a particular machine, you usually have to read the assembly code generated by the compiler. Here is what gcc 10.2 on x86-64 produces. It does roughly the following:
Load i from memory into a register
Increment the register
Store the register back to i
Pass the contents of the register to printf.
So imagine that the threads happen to do those steps in the following order:
Thread A Thread B
-------- --------
Step 1
Step 2
Step 3
Step 1
Step 2
Step 3
Step 4
Step 4
Then you can clearly see that the output will be 2 1.
As to why 2 2 never occurs, note that the value to be printed is loaded at step 1. One of the two threads must execute step 1 before the other, or at the same time. Thus at least one of them will load 0 at step 1, and therefore will print 1. Even if the other thread happens to have updated the value of i in memory before the first thread prints its value, it won't make any difference because the value to be printed is already in a register in the first thread.
(Of course it is entirely possible that some other compiler would generate code where i is reloaded from memory, and in that case 2 2 could occur. It just doesn't happen to be the case for this particular generated code.)

Related

C++ atomics reading stale value

I'm reading the C++ Concurrency in Action book and I'm having trouble understanding the visibility of writes to atomic variables.
Lets say we have a
std::atomic<int> x = 0;
and we read/write with sequential consistent ordering
1. ++x;
// <-- thread 2
2. if (x == 1) {
// <-- thread 1
}
If we have 2 threads that execute the code above.
Is it possible that thread 1 arrives at line 2. and reads x == 1, after thread 2 already executed line 1.?
So does the sequential consistent ++x of thread 2 instantly gets propagated to thread t1 or is it possible that thread 1 reads a stale value x == 1?
I think if we use relaxed_ordering or acq/rel the above situation is possible, but how about the sequential consistent ordering?
If you're thinking that multiple atomic operations are somehow safely grouped, you're wrong. They'll always occur in order within that thread, and they'll be visible in that order, but there is no guarantee that two separate operations will occur in one thread before either occurs in the other.
So for your specific question "Is it possible that thread 1 arrives at line 2. and reads x == 1, after thread 2 already executed line 1.?", the answer is yes, thread 1 could reach the x == 1 test after thread 2 has incremented x as well, so x would already be 2 and neither thread would see x == 1 as true.
The simplest way to think about this is to imagine a single processor system, and consider what happens if the running thread is switched out at any time aside from the middle of a single atomic operation.
So in this case, the operations (inc1 and test1 for thread 1 and inc2 and test2 for thread 2) could occur in any of the following orders:
inc1 test1 inc2 test2
inc1 inc2 test1 test2
inc1 inc2 test2 test1
inc2 inc1 test1 test2
inc2 inc1 test2 test1
inc2 test2 inc1 test1
As you see, there is no possibility of either test occurring before either increment, nor can both tests pass (because the only way a test passes is if the increment associated with it on that thread has occurred but not the increment on the other thread), but there's no guarantee any test passes (both increments could precede both tests, causing both tests to test against the value 2 and neither test to pass). The race window is narrow, so most of the time you'd probably see exactly one test pass, but it's wholly possible to get unlucky and have neither pass.
If you want to make this work reliably, you need to make sure you both modify and test in a single operation, so exactly one thread will see the value as being 1:
if (++x == 1) { // The first thread to get here will do the stuff
// Do stuff
}
In this case, the increment and read are a single atomic operation, so the first thread to get to that line (which might be thread 1 or thread 2, no guarantees) will perform the first increment with ++x atomically returning the new value which is tested. Two threads can't both see x become 1, because we kept both increment and test as one operation.
That said, if you're relying on the content of that if being completed before any thread executes code after the if, that won't work; the first thread could enter the if, while the second thread arrives nanoseconds later and skips it, realizing it wasn't the first to get there, and it would immediately begin executing the code after the if even if the first thread hasn't finished. Simple use of atomics like this is not suited for a "run only once" scenario that people often write this code for when the "run only once" code must be run exactly once before dependent code is executed.
Let's simplify your question.
When 2 threads execute func()
std::atomic <int> x=0;
void func()
{
++x;
std::cout << x;
}
Following result is possible?
11
And the answer is NO!
Only "12" or "21" is possible.
The sequential consistency on a atomic variable works as you want on this simple case.

output 10 with memory_order_seq_cst

When i run this program i get output as 10 which seems to be impossible for me. I'm running this on x86_64 core i3 ubuntu.
If the output is 10, then 1 must have come from either c or d.
Also in thread t[0], we assign c as 1. Now a is 1 since it occurs before c=1. c is equal to b which was set to 1 by thread 1. So when we store d it should be 1 as a=1.
Can output 10 happen with memory_order_seq_cst ? I tried inserting a atomic_thread_fence(seq_cst) on both thread between 1st (variable =1 ) and 2nd line (printf) but it still didn't work.
Uncommenting both the fence doesn't work.
Tried running with g++ and clang++. Both give the same result.
#include<thread>
#include<unistd.h>
#include<cstdio>
#include<atomic>
using namespace std;
atomic<int> a,b,c,d;
void foo(){
a.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
c.store(b,memory_order_seq_cst);
}
void bar(){
b.store(1,memory_order_seq_cst);
// atomic_thread_fence(memory_order_seq_cst);
d.store(a,memory_order_seq_cst);
}
int main(){
thread t[2];
t[0]=thread(foo); t[1]=thread(bar);
t[0].join();t[1].join();
printf("%d%d\n",c.load(memory_order_seq_cst),d.load(memory_order_seq_cst));
}
bash$ while [ true ]; do ./a.out | grep "10" ; done
10
10
10
10
10 (c=1, d=0) is easily explained: bar happened to run first, and finished before foo read b.
Quirks of inter-core communication to get threads started on different cores means it's easily possible for this to happen even though thread(foo) ran first in the main thread. e.g. maybe an interrupt arrived at the core the OS chose for foo, delaying it from actually getting into that code1.
Remember that seq_cst only guarantees that some total order exists for all seq_cst operations which is compatible with the sequenced-before order within each thread. (And any other happens-before relationship established by other factors). So the following order of atomic operations is possible without even breaking out the a.load2 in bar separately from the d.store of the resulting int temporary.
b.store(1,memory_order_seq_cst); // bar1. b=1
d.store(a,memory_order_seq_cst); // bar2. a.load reads 0, d=0
a.store(1,memory_order_seq_cst); // foo1
c.store(b,memory_order_seq_cst); // foo2. b.load reads 1, c=1
// final: c=1, d=0
atomic_thread_fence(seq_cst) has no impact anywhere because all your operations are already seq_cst. A fence basically just stops reordering of this thread's operations; it doesn't wait for or sync with fences in other threads.
(Only a load that sees a value stored by another thread can create synchronization. But such a load doesn't wait for the other store; it has no way of knowing there is another store. If you want to keep loading until you see the value you expect, you have to write a spin-wait loop.)
Footnote 1:
Since all your atomic vars are probably in the same cache line, even if execution did reach the top of foo and bar at the same time on two different cores, false-sharing is likely going to let both operations from one thread happen while the other core is still waiting to get exclusive ownership. Although seq_cst stores are slow enough (on x86 at least) that hardware fairness stuff might relinquish exclusive ownership after committing the first store of 1. Anyway, lots of ways for both operations in one thread to happen before the other thread and get 10 or 01. Even possible to get 11 if we get b=1 then a=1 before either load. Using seq_cst does stop the hardware from doing the load early (before the store is globally visible), so it's very possible.
Footnote 2: The lvalue-to-rvalue evaluation of bare a uses the overloaded (int) conversion which is equivalent to a.load(seq_cst). The operations from foo could happen between that load and the d.store that gets a temporary value from it. d.store(a) is not an atomic copy; it's equivalent to int tmp = a; d.store(tmp);. That isn't necessary to explain your observations.
The printf statements are unsynchronized so output of 10 can be just a reordered 01.
01 happens when the functions before the printf run serially.

How can I optimize swapping elements of the array in c++

my task is to code the enigma machine. I' ve nearly done it but the only thing that slows me is the fact that whole code in every case must execute in less than 3 seconds. There are thousands of variables, arrays like [1000+] so I need to optimize my code. First thing I would like to change is my swapping function. I heard about something like circullar array but I'm not sure how it works. Generally the point is before every coding I need to rotate the rotor once. If there are numbers higher than thousands it might take a lot of time I think, any ideas for that?
Firstly I have a class that storage only used rotors as I can declare them any number but I declare before every coding how much of them and then I set their position like
normal alphabet 1 2 3 4, rotor set for 1 2 4 3 and every number coding rotor rotates to 2 4 3 1 then 4 3 1 2 then 3 1 2 4 etc... and the same in the reverse.
for(int i=0;i<ile_rotorow_wiad;i++)
{
rotory_uzyte[i].copy(rotory[wiadomosc.jakie_rotory[i]],ile_liter, rotory_rev[wiadomosc.jakie_rotory[i]]);
rotory_uzyte[i].Obroc(wiadomosc.pozycja[i], ile_liter);
}
that's the code that is running every loop while the "enigma message" isn't equal 0.
rotory_uzyte[0].Obroc(2, ile_liter);
function Obroc:
void Rotor::Obroc(int pozycja, int ile)
{
for(int i=0;i<pozycja-1;i++)
{
for(int j=0;j<ile-1;j++)
{
swap(pozycje[j],pozycje[j+1]);
swap(pozycje_rev[j],pozycje_rev[j+1]);
}
}
}
As you can see for high numbers it is useless and would take so much time. Any ideas how to optimize it if it makes "rounds" like circles I can't name it. As I mentioned a circular array would be a good thing? How should it look like then?

Ordering specification of UNIX system calls

System calls in UNIX-like OSes are reentrant (i.e. multiple system calls may be executed in parallel). Are there any ordering constraints of those system calls in the sense of the C/C++11 happens-before relation?
For example, let's consider the following program with 3 threads (in pseudo-code):
// thread 1
store x 1
store y 2
// thread 2
store y 1
store x 2
// thread 3
Thread.join(1 and 2)
wait((load x) == 1 && (load y) == 1)
Here, suppose x and y are shared locations, and all the load and stores have the relaxed ordering. (NOTE: with relaxed atomic accesses, races are not considered bug; they are intentional in the sense of C/C++11 semantics.) This program may terminate, since (1) compiler may reorder store x 1 and store y 2, and then (2) execute store y 2, store y 1, store x 2, and then store x 1, so (3) thread 3 may read x = 1 and y = 1 at the same time.
I would like to know if the following program also may terminate. Here, some system calls syscall1() & syscall2() are inserted in the thread 1 & 2, respectively:
// thread 1
store x 1
syscall1()
store y 2
// thread 2
store y 1
syscall2()
store x 2
// thread 3
Thread.join(1 and 2)
wait((load x) == 1 && (load y) == 1)
The program seems impossible to terminate. However, in the absence of the ordering constraints of the system calls invoked, I think this program may terminate. Here is the reason. Suppose syscall1() and syscall2() are not serialized and may be run in parallel. Then the compiler, with the full knowledge of the semantics of syscall1() and syscall2(), may still reorder store x 1 & syscall1() and store y 2.
So I would like to ask if there are any ordering constraints of the system calls invoked by different threads. If possible, I would like to know the authoritative source for this kind of questions.
A system call (those listed in syscalls(2)...) is an elementary operation, from the point of view of an application program in user land.
Each system call is (by definition) calling the kernel, thru some single machine code instruction (SYSENTER, SYSCALL, INT ...); details depend upon the processor (its instruction set) and the ABI. The kernel does it business (of processing your system call - which could succeed or fail), but your user program sees only an elementary step. Sometimes that step (during which control is given to the kernel) could last a long piece of time (e.g. minutes or hours).
So your program in user land runs in a low level virtual machine, provided by the user mode machine instructions of your processor augmented by a single "virtual" system call instruction (able of doing any system call implemented by the kernel).
This does not forbid your program to be buggy because of race conditions.

What is the meaning of monotonicity in C++ n2660?

In "http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2660.htm"
That says
[monotonicity] Accesses to a single variable V of type T by a single
thread X appear to occur in programme order. For example, if V is
initially 0, then X writes 1, and then 2 to V, no thread (including
but not limited to X) can read a value from V and then subsequently
read a lower value from V. (Notice that this does not prevent
arbitrary load and store reordering; it constrains ordering only
between actions on a single memory location. This assumption is
reasonable on all architectures that I currently know about. I suspect
that the Java and CLR memory models require this assumption also.)
I can't understand relationship between call_once and monotonicity.
And can't find related document about it.
Please, help.
It means that the compiler won't reorder actions done on the same memory spot.
So if you write:
int i = 0;
i = 1;
i = 2;
There is no way your current thread or another will read the i variable with a value of 2 then read the same variable to find out the value 1 or 0.
In the linked paper it is used as a requirement for the given pthread_once implementation, so if this principle is not respected, that implementation might not work. The reason of this added requirement seems to be to avoid a memory barrier to gain performance.
Monotonicity means: If operation B is issued after operation A, then B cannot be executed before A.
The explanation given in the text stems from mathematics, where a monotonic series is one that only ever moves up or down: 1, 2, 7, 11 is monotonic (each value is bigger than the one before), as is 100, 78, 39, 12 (each value is smaller than the one before), 16, 5, 30 isn't monotonic.
If a value is modified in strictly ascending order, any two reads will lead to two results a, b with b >= a - the monotonicity is kept.