operator ++ (prefix) with threads - c++

Betting between friends.
sum variable is defined as global.
and we have 2 threads that run over loop 1..100 and increments sum by 1 every loop.
what will be printed?
"sum="?
int sum = 0;
void func(){
for (int i=0 ; i<= 100; i++){
sum++;
}
}
int main(){
t1 = Thread(func);
t2 = Thread(func);
t1.start();
t2.start();
t1.join();
t2.join();
cout << "sum = " << sum;
return 0;
}

It is undefined behavior so I am gong to say 42. When you have more than one thread accessing a shared variable and at least on of them is a writer then you need synchronization. If you do not have that synchronization then you have undefined behavior and we cannot tell you what will happen.
You could use a std::mutex or you could use a std::atomic to get synchronication and make the programs behavior defined.

There is no single value for sum. If there are 0 race conditions, the value will be 200. If there are race conditions on every iteration of the loop (unlikely) it could be as low as 100. Or it could be anywhere in between.
You probably think of sum++ as an atomic operation, but it is actually syntactic sugar for sum = sum + 1. There is the possibility of a race condition within this operation, so sum could be different every time you run it.
Imagine the current value of sum is 10. Then t1 gets into the loop and reads the value of sum (10), and then is stopped to let t2 begin running. t2 will then reads the same value (10) of sum as t1. Then when each thread increments they will both increment it to 11. If there are no other race conditions the end value of sum would be 199.
Here's an even worse case. Imagine the current value of sum is 10 again. t1 gets into the loop and reads the value of sum (10), then is stopped to let t2 begin running. t2, again, reads the value of sum (10) and then itself is stopped. Now t1 begins running again and it loops through 10 times setting the value of sum to 20. Now t2 starts up again and increments sum to 11, so you've actually decremented the value of sum.

Since incrementation is not atomic, it will result in undefined behaviour.

It will be a random value between 100 and 200. There is a race condition between the two threads without mutual exclusion. So, some ++ operations will be lost. This is why you will get 100 when all ++ operations of a thread are lost, and 200 when nothing is lost. Anything between may happen.

Related

what's the possible result of the following code in concurrency situation?

I have seen an interview question as below:
What's the possible range of the result of the following code:
void ThreadProc(int& sum)
{
for (int i = 1; i <= 50; i++)
{
sum += 1;
}
}
int main()
{
int sum = 0;
thread t1(ThreadProc, std::ref(sum));
thread t2(ThreadProc, std::ref(sum));
t1.join();
t2.join();
cout << sum << '\n';
return 0;
}
The given answer is [50,100].
However, I thought it should be [2,100].
If given sequence as below, sum will be 2.
thread t1 get the cpu, and load the initial sum=0 into cache (let's say the cached sum is c1, its value is 0 now).
thread t2 get the cpu, and increase (49 times), and now the sum will be 49.
thread t1 get the cpu, and compute sum = c1 + 1, now sum is 1.
thread t2 get the cpu, and load the sum (=1) and compute sum + 1 and cached the result (c1 is 2 now). Before the c1 is written to variable sum by t1, t2 preempt the cpu.
thread t2 get the cpu, and increase (1 times) [and now the sum will be x (the value does't matter)], then thread t2 finished.
thread t1 get the cpu, and write the cached result c1 to sum,
now sum is 2.
Am I right?
This code causes undefined behaviour because sum is modified from two different threads without any concurrency protection. This is called a data race in the C++ Standard.
Therefore any behaviour whatsoever is possible (including but not limited to all the cases you mention).
Link to cppreference page about memory model.

thread synchronisation issue

In the following example i have called the pthread_join() for both the threads in the end(before i print the sum). Even though it is expected that the sum should be 0, it prints any value. I know that if i do pthread_join(id1,NULL) just before the creation of the 2nd thread then it would work fine(it does), but i don't understand why should not it work when i call join for both threads in the end.
Because sum is printed only after both the threads must have finished execution completely. So, after the execution of the first thread, it must have added 2000000 to the variable sum and second thread must have subtracted 2000000 from the sum sum SHOULD BE 0
long long sum=0;
void* counting_thread(void* arg)
{
int offset = *(int*) arg;
for(int i=0;i<2000000;i++)
{
sum=sum+offset;
}
pthread_exit(NULL);
}
int main(void)
{
pthread_t id1;
int offset1 = 1;
pthread_create(&id1,NULL,counting_thread,&offset1);
pthread_t id2;
int offset2 = -1;
pthread_create(&id2,NULL,counting_thread,&offset2);
pthread_join(id1,NULL);
pthread_join(id2,NULL);
cout<<sum;
}
The problem is that the sum=sum+offset; is not thread safe.
This is causing some sums not to be counted.
As you specified C++, std::atomic<long long> sum; Would help, but you need to use += operator, rather than the thread-unsafe sum = sum + count;
sum += offset;
A mutex to block updates would also help.
Without these changes, the compiler can produce code, which
Reads sum at the beginning of the function, having only one thread applying its changes.
Have a stale value of sum for the addition.
Incorrect state from cache.
read optimization
The compiler can legitimately read the value of sum when the thread starts, add offset to it n times, and store the value. This would mean only one thread would work.
stale value
Consider the following assembly code.
read sum
add offset to sum
store sum
thread1 thread2
1 read sum
2 add offset to sum read sum
3 store sum add offset to sum
4 read sum store sum
5 add offset to sum read sum
6 store sum add offset to sum
Line 3 of thread 2 adds the offset to the old value which makes line 3 of thread one get lost.
Incorrect state from cache
In multi-threaded systems, then the cache may be inconsistent between threads of the process.
That would mean that even after sum+=offset has been executed, then another core/CPU may see the pre-updated value.
This allows the CPUs to run faster, as they can ignore sharing the data between them. However, when 2 threads are accessing the same data, this needs to be taken into account.
std::atomic / mutex ensures :-
The value is modified atomically (as if the sum = sum + count is indivisible).
The value is visible across the cores/CPUs consistently.
The compiler doesn't re-order the load/store of sum as if it couldn't be changed.
You can end up with any result without synchronization, because add operation is not atomic.
On the basic level
Your
sum=sum+offset;
is actually
fetch sum to register # tmp := sum
add offset # tmp := tmp + offset
store new value # sum := tmp
Now imagine 2 threads working simultaneously
Thread1 Thread2 Sum
tmp:= 1 tmp:=1 1
tmp:= 1+1 tmp:=1-1 1
-zzz- sum := 0 0
sum := 2 -zzz- 2
In this serai of computations result of Thread 2 subtraction is lost
If I change timing a bit
Thread1 Thread2 Sum
sum := 2 -zzz- 2
-zzz- sum := 0 0
I will get lost Thread 1 addition
Add some optimizer
Now things go worse. If you do not synchronize, compiler assumes that no raсe can happen (because compiler always trust in you)
So it will skip fetching and storing part And just transform code to
fetch sum to register # tmp := sum
add offset N times # for (i := 1 ; i < 2000000; i++) tmp := tmp + offset
store result # sum := tmp
or even
fetch sum to register # tmp := sum
add offset * N # tmp := tmp + 2000000 * offset
sore tmp # sum := tmp
Now imagine two threads working simultaneously here
Add some machine-dependent behaviour
The basic ideas are covered earlier, but not only compiler can be blamed here but your platform itself. Caching mechanism allows faster data access, but if cache is not being synchronized different threads could read different values of the same variable
You have no synchronization between the two threads which are concurrently modifying the global variable sum. You need a mutex around the code or you need to use one of the platform provided atomic increment/decrement functions.
When you fail to synchronize the threads properly this code suffers from the 'lost update' problem. See this link about what Oracle terms Thread Interference. https://docs.oracle.com/javase/tutorial/essential/concurrency/interfere.html They're talking about Java but the same holds true for C/C++. sum = sum + offset is not an atomic operation. Most platforms have operations to atomically update a variable such as InterlockedIncrement on Windows and _sync_add_and_fetch() on Linux.
EDIT: This very program was also studied in detail in Anthony Williams's article "Avoiding the Perils of C++0x Data Races".

Why is my for loop of cilk_spawn doing better than my cilk_for loop?

I have
cilk_for (int i = 0; i < 100; i++)
x = fib(35);
the above takes 6.151 seconds
and
for (int i = 0; i < 100; i++)
x = cilk_spawn fib(35);
takes 5.703 seconds
The fib(x) is the horrible recursive Fibonacci number function. If I dial down the fib function cilk_for does better than cilk_spawn, but it seems to me that regardless of the time it takes to do fib(x) cilk_for should do better than cilk_spawn.
What don't I understand?
Per comments, the issue was a missing cilk_sync. I'll expand on that to point out exactly how the ratio of time can be predicted with surprising accuracy.
On a system with P hardware threads (typically 8 on a i7) for/cilk_spawn code will execute as follows:
The initial thread will execute the iteration for i=0, and leave a continuation that is stolen by some other thread.
Each thief will steal an iteration and leave a continuation for the next iteration.
When each thief finishes an iteration, it goes back to step 2, unless there are no more iterations to steal.
Thus the threads will execute the loop hand-over-hand, and the loop exits at a point where P-1 threads are still working on iterations. So the loop can be expected to finish after evaluating only (100-P-1) iterations.
So for 8 hardware threads, the for/cilk_spawn with missing cilk_sync should take about 93/100 of the time for the cilk_for, quite close to the observed ratio of about 5.703/6.151 = 0.927.
In contrast, in a "child steal" system such as TBB or PPL task_group, the loop will race to completion, generating 100 tasks, and then keep going until a call to task_group::wait. In that case, forgetting the synchronization would have led to a much more dramatic ratio of times.

Why does a for loop not reach the number set in the 2nd condition of the loop?

I don't understand why, in a for loop, using an int - if it's initialised as 0 and you set the loop to increment the int for integer less than a figure - if you output the integer each time the loop executes - it doesn't go up to the figure you told it to increment up to in the for loop.
I might not be phrasing myself very clearly here to best help you understand so please see the following link by way of an example:
http://codepad.org/doPC6kuI
The output in the link is 0,1,2,3,4 - I realise it's 5 ints but int i begins at 0 but is incremented before the output so I don't understand why the first output is 0 and not 1 making it 1,2,3,4,5.
It's annoying me that I can't rationalise this or come up with a logical explanation.
The for loop in your example works as follows:
Initialise i = 0
Repeat while i < 5 is true:
Execute the body of the for loop
Increment i
The condition is checked at the beginning of each iteration and the increment happens at the end of each iteration.
So first i is initialised to 0, then it is checked that 0 < 5, which is fine, so we start the first iteration. At the end of the iteration, i gets incremented to 1, then it checks that 1 < 5 and we start the second iteration, and so on. At the end of the 5th iteration, i gets incremented to 5, and it checks that 5 < 5, which is not true any more, so no more iterations occur. So no iteration occurs in which i has value 5.
You said it yourself:
you set the loop to increment the int for integer less than a figure
If you do that, the loop will only run for values less then the limit, not the limit itself. The counter is incremented after each loop, and before checking the condition; so it's 0 for the first iteration, 1 for the second, and so on; then the condition stops it before the iteration when it would be 5.
If you wanted to include the limit, then you'd loop while it's less than or equal the limit.

performance difference in two almost same loop

I got two almost same loop, but with remarkable difference in performance, both tested with MSVC2010, on system ~2.4 GHZ and 8GB RAM
Below loop take around 2500 ms to execute
for (double count = 0; count < ((2.9*4/555+3/9)*109070123123.8); count++)
;
And this loop execute in less then 1 ms
for (double count = ((2.9*4/555+3/9)*109070123123.8); count >0; --count)
;
What making such huge difference here? One got post increment and other using pre-increment can it result in such huge difference?
You're compiling without optimizations, so the comparison is futile. (If you did have optimizations on, that code would just be cut out completely).
Without optimization, the computation is likely executed at each iteration in the first loop, whereas the second loop only does the computation once, when it first initializes count.
Try changing the first loop to
auto max = ((2.9*4/555+3/9)*109070123123.8);
for (double count = 0; count < max; count++)
;
and then stop profiling debug builds.
In the first loop count < ((2.9*4/555+3/9)*109070123123.8) is computed every time round the loop where as in the second count = ((2.9*4/555+3/9)*109070123123.8) is calculated once and decremented each time round the loop.