Is there a better way to use C++ concurrency than below? - c++

I am trying to learn the basics of threading, mutex, etc. Following the documentation and examples from here. In the below code I am getting the output as expected. Questions:
Want to confirm if there is any pitfall that I am missing? How can we improve the code below?
On which line did my thread tried to take the mutex or is waiting for mutex? (Is it after line 11 or on line 11?)
Is it ok to include cv.notify_all(); in the thread code?
#include<mutex>
#include<iostream>
#include<thread>
#include<condition_variable>
std::mutex mtx;
std::condition_variable cv;
int currentlyRequired = 1;
void work(int id){
std::unique_lock<std::mutex> lock(mtx); // Line 11
while(currentlyRequired != id){
cv.wait(lock);
}
std::cout<<"Thread # "<<id<<"\n";
currentlyRequired++;
cv.notify_all();
}
int main(){
std::thread threads[10];
for(int i=0; i<10; i++){
threads[i] = std::thread(work, i+1);
}
for(int i=0; i<10; i++){
threads[i].join();
}
return 0;
}
/* Current Output:
Thread # 1
Thread # 2
Thread # 3
Thread # 4
Thread # 5
Thread # 6
Thread # 7
Thread # 8
Thread # 9
Thread # 10
Program ended with exit code: 0
*/

First and foremost I recommend you read the documentation here: cppreference.com
It's a little more discursive about the key points you need to conform to to use a condition variable for inter-thread waiting and notification.
I think your code measures up.
Your code obtains the lock in line 11 as you think. Other constructors of unique_lock will adopt a mutex previously locked (by the current thread) or not locked (by current thread) and lock it when requested (that would be lock.lock(); here).
What you have is right. You check the relevant data holding the lock.
Wait then unlocks it (through the unique_lock) and awaits notification.
When notified it stops waiting, locks it again and loops to check the condition.
Eventually the condition is true and (still holding the lock) each thread continues on to do its 'work'.
The 'waiting' side looks correct. The 'notifying' side also looks correct.
The data for the condition must be modified holding the mutex to ensure the correct synchronisation of checking the data and going into the waiting state that the condition-variable manages.
You correctly notify_all(). Even though logically (in this example) only one thread needs to wake up there's no way of picking it out to be the target of notify_one().
So all the threads wake up (if suspended), check their condition and exactly one of them identifies its turn and runs.
Common wisdom (and my experience) is that it's better (and valid) to notify_all() not holding the lock because the waiting threads wake up to block (on the lock). But I'm told some platforms run better notifying under the lock. Welcome to the world of platform dependence...
So in terms of implementing a condition-variable I think that's valid and pretty much textbook.
It's good to see the join() as well. I have a bugbear about coders not joining to threads.
It's one of those that at small scale and load you get away with but when the application scales and experiences high-load can start to cause problems and confusion.
The problem I have with what you've done has no parallelism.
What you've achieved is a daisy-chain. The very intention is to ensure only one thread is 'doing work' at once and they do so in strict order.
To take advantage of concurrency we want to maximise parallelism - the number of parallel running threads doing 'the work' (i.e. not the housekeeping of inter-thread communication) and your code is literally (because you did it right!) guaranteed to ensure there is never more than one thread running and the code is due to housekeeping guaranteed to be slightly slower than a single-threaded application (which would be a for loop!).
So it gets top-marks for program correctness but no marks for being useful!
Both the examples on cplusplus.con and cppreference.com are little better in my view.
The best introductory example is some kind of producer consumer model.
That's closer to an actually useful pattern.
Try something as simple as one producer thread is counting up through the integers and multiple consumer threads square them and output them.
A key point will be that if you're doing it right the squares won't in general come out in sequential order.
These examples are like teaching recursion with factorial. Sure it's recursive but recursion is a terrible way to calculate factorial!
Sure your multi-threading (the other examples) are valid but they're outright contrived to do nothing useful in parallel!
Please don't take that as a criticism. As a 'tooth cutting' exercise what you've got is top-class. The next task is to do something that uses concurrency usefully!.
The problem is we don't want condition-variables! By definition they make threads wait for each other and that reduces parallelism. We need them and they do their job. But a design that reduces the amount of mutual waiting (either on locks or conditions) is usually better because the enemy here is waiting, blocking or (worst) spinning instead of suspended waiting/blocking.
Here's a better design for your task. Better because it entirely avoids the condition-variable!!
#include<mutex>
#include<iostream>
#include<thread>
std::mutex mtx;
int currentlyRequired = 1;
void work(int id){
std::lock_guard<decltype(mtx)> guard(mtx);
const auto old{currentlyRequired++};
std::cout<<"Thread # "<<id<<" "<< old << "-> " <<currentlyRequired<< "\n";
}
int main(){
std::thread threads[10];
for(int i=0; i<10; i++){
threads[i] = std::thread(work, i+1);
}
for(int i=0; i<10; i++){
threads[i].join();
}
std::cout << "Final result: " << currentlyRequired << std::endl;
return 0;
}
Specimen output:
Thread # 7 1-> 2
Thread # 8 2-> 3
Thread # 9 3-> 4
Thread # 10 4-> 5
Thread # 6 5-> 6
Thread # 5 6-> 7
Thread # 4 7-> 8
Thread # 3 8-> 9
Thread # 2 9-> 10
Thread # 1 10-> 11
Final result: 11
Which thread does which increment will vary. But final result is always 11.
It's still no good because there's still no parallelism because the work is all done under a single lock. That's why we need a more interesting task.
Have fun.

Related

Why my std::atomic<int> variable isn't thread-safe?

I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.
Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.
The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.

Odd thread behaviors

The next code normally prints BA but sometimes it can print BBAA, BAAB, ... How is it possible to get two A or B with it?! However this code never prints three A or B. Both functions (produce and consume) run a lot of threads. Many thanks in advance.
int permission;
void set_permission(int v) {
permission = v;
printf("%c", v + 'A');fflush(stdin);
}
void* produce(void*) {
for (;;) {
pthread_mutex_lock(&mr1);
set_permission(1);
while (permission == 1);
pthread_mutex_unlock(&mr1);
}
}
void* consume(void*) {
for (;;) {
pthread_mutex_lock(&mr2);
while (permission == 0);
set_permission(0);
pthread_mutex_unlock(&mr2);
}
}
Your threads are not synchronized, as they are not using the same mutex.
The other thread can by chance only mange to set permission to 1 or 0, but not manage to produce output yet. In which case it appears as if the first thread ran two full rounds.
The write by the corresponding thread can also get entirely lost, when the memory content is synchronized between cores and both threads wrote. The mutex also prevents this from happening, because it establishes a strict memory access order, which, to put it simple, guarantees that everything which has happened under the protection of one mutex is fully visible to the next user of the same mutex.
Printing the same character 3 or more times would be very unlikely, as there is at most one write happening in between, so at most one lost write, or one out of order output. That's not guaranteed though.
If you are working on a system with no implicit memory synchronisation at all, your code could also just straight out deadlock, as the writes done under one mutex never propagate to the users of the other one. ( Doesn't actually happen because there is still some synchronisation introduced by the IO operations. )

EnterSynchronizationBarrier hangs in windows 8

I tried to use new API for synchronization barriers from Windows 8, but the following simple code sometimes hangs in Windows 8:
#undef WINVER
#define WINVER 0x0603
#include "windows.h"
#include <thread>
#include <vector>
int main()
{
SYNCHRONIZATION_BARRIER barrier;
int count = 32;
InitializeSynchronizationBarrier (&barrier, count, -1);
std::vector<std::thread> threads;
for (int thr_num = 0; thr_num < count; thr_num++)
{
threads.emplace_back ([thr_num]
{
for (int i = 0; i < 100000; i++)
EnterSynchronizationBarrier (&barrier, 0);
});
}
for (auto &thr : threads)
thr.join ();
return 0;
}
Tested on Windows 8.1 64-bit on 32-core dual-Xeon E5 2630. It hangs roughly one time out of ten launches.
It seems that in windows 10 it works normally (on another machine). Is this a bug in windows 8 that got fixed, or this is not a correct usage of EnterSynchronizationBarrier (maybe you can't call it in a loop?). There're not much information about this function, have anybody even used it?
Not that it matters years later, except perhaps to show that some problems are too obscure for Stack Overflow to deliver close attention in useful time, but your usage is correct, if extreme, and the stress you have put the called function to does look to have exposed its problems with memory barriers.
In your fragment, a synchronisation barrier is prepared for 32 threads and you create 32 threads which each proceed to 100000 phases of synchronised work. All 32 reach their call number N to EnterSynchronizationBarrier before all are released on their way to their call number N+1. It should work. It likely would if your phases had any substance.
The stress is that each phase between calls is just however few instructions are involved in looping back to repeat the call. While the last thread to end phase N is in its call, it signals the others to leave, and they have a good chance of leaving (and even of reentering the function to end their phase N+1) while the thread that ends phase N is still doing its internal bookkeeping.
In this bookkeeping are two counters. One, named Barrier according to Microsoft's symbol files, is decremented as threads enter the synchronisation barrier. The other, named LeftBarrier, is incremented as they leave it. The thread that ends a phase resets Barrier from LeftBarrier (which should be the count of all participating threads) and resets LeftBarrier to 1. Or so it goes as a simplification.
The complicated reality is that the Barrier count is overloaded: its high bit signifies the change of phase. If a thread that waits at the synchronisation barrier is spinning rather than blocking on an event, then what it checks for while spinning is whether the high bit in Barrier has changed. It therefore really matters exactly how the counters get reset in the ending thread's bookkeeping. The sequence is: read LeftBarrier; write LeftBarrier as 1; write Barrier as the old LeftBarrier with the high bit toggled.
What I think happens is that without a memory barrier, the Barrier count can be written before LeftBarrier, but because Barrier has a toggled high bit, a spinning thread comes out of its spin and increments LeftBarrier from another processor before the first resets it to 1. The increment gets lost, after which all bets are off because subsequent phases will find that LeftBarrier at the end of a phase is no longer the count of participating threads.
Windows 8 and 8.1 have no memory barrier here. Windows 10 does, though I believe it's in the wrong place and that Windows Vista and Windows 7 had it correctly between the two writes. The implementation was anyway reworked completely for Version 1607 so that it now uses the WaitOnAddress functionality, much as sketched by a later Raymond Chen blog than the one cited by one of your correspondents. At the time of the cited blog, Microsoft, though possibly not Raymond, surely knew of the function's two earlier code changes regarding memory barriers.

how to solve busy waiting for many running threads

I have write a multithread program :
#include <Windows.h>
#include <process.h>
#include <stdio.h>
#include <fstream>
#include <iostream>
using namespace std;
ofstream myfile;
BYTE lockmem=0x0;
unsigned int __stdcall mythreadA(void* data)
{
__asm
{
mov DL,0xFF
mutex:
mov AL,0x0
LOCK CMPXCHG lockmem,DL
jnz mutex
}
// Enter Critical Section
for (int i = 0; i < 100000; i++)
{
myfile << "." << i << endl;
}
// Exit Critical Section
__asm
{
lock not lockmem
}
return 0;
}
int main(int argc, char* argv[])
{
myfile.open ("report.txt");
HANDLE myhandleA[10];
//int index = 0;
for(int index = 0;index < 100;index++)
{
myhandleA[index] = (HANDLE)_beginthreadex(0, 0, &mythreadA, 0, 0, 0);
}
getchar();
myfile.close();
return 0;
}
at the critical section I write inline code to sure that only one thread is in the critical section .( in this program I don't want to use API and functions for implement the only one thread in critical section so I use inline assembly ). now I have busy waiting problem .because after one thread enter in the critical section the other threads are busy in the loop before critical section , so the process of cpu go up and up! here I search for ways to solve the problem of busy waiting. ( I prefer to use assembly instruction instead of API and any functions but I also want to know them)
What you are doing is basically called spinlock, and it should not be used for long operations. It is the expected result to drain cpu time as you described.
You may however build a mutex, of futex (fast user-mode mutex) based on spinlock and condvar/event.
You can use a kernal API call that blocks the threads that must wait, or you can waste CPU cycles and memory-bandwidth keeping your fans at full speed and your office warm.
There isn't any other choice.
If I understand your question and program correctly,there are couple of problem in your program. This is mentioned by others as well in above post.
Your actual critical section part of code is terribly slow as your are writing the numbers in new line in your file for 100000 times. The file I/O operation would be very very slow and in your case this is what your thread function is each doing. I am not aware much about the assembly instruction used by you in your program but its looks like that these locking mechanism code are creating busy waiting code for your remaining threads which is yet to be scheduled for execution. As mentioned above you should start using the EnterCriticalSection() and LeaveCriticalSection() API provided on Microsoft based system. These API internally take care about anything(which can not be achieved in general from our own logic) and hence you would not have any CPU spikes while waiting.
If you still want to do something using your current logic, I think you should use/implement some form of sleep() type function which should be executed and try. This would ensure that there would not be any CPU busy time scenario due to continuous execution/check for some flag.
Calvin has also mentioned that you should try to distribute the locks instead of central lock. We should remember one thing, if our task is simple and can be done easily by conventional single thread approach,we should not go for multi-threaded solution.

Timing overhead binary semaphore vs mutex

I have run a sample C++ program on vxWorks platform to test the timing difference between mutex and a binary semaphore. The below program is the prototype
SEM ID semMutex;
UINT ITER = 10000;
taskIdOne = TASKSPAWN("t1",TASK_PRIORITY_2,0,8192,0,(FUNCPTR)myMutexMethod,0,0);
taskIdTwo = TASKSPAWN("t2",TASK_PRIORITY_2,0,8192,0,(FUNCPTR)myMutexMethod,0,0);
void myMutexMethod(void)
{
int i;
VKI_PRINTF("I'm (%s)\n",TASKNAME(0) );
myMutexTimer.start();
for (i=0; i < ITER; i++)
{
MUTEX_LOCK(semMutex,WAIT_FOREVER);
++global;
MUTEX_UNLOCK(semMutex);
}
myMutexTimer.stop();
myMutexTimer.show();
}
In the above program there is a contention ( 2 tasks are trying to get the mutex). my timer printed 37.43 ms for the above program. With the same prototype, binary semaphore program took just 2.8 ms. This is understood because binary semaphore is lightweight and does not have many features like a mutex (priority inversion, ownership etc).
However, I removed one task and ran the above program ( without contention). Since there is no contention, the task t1 just gets the mutex , executes the critical section and then releases the mutex. Same with Binary semaphore.
For the timings, mutex I got 3.35 ms and binary semaphore 4 ms.
I'm surprised to see mutex is faster than binary semaphore when there is no contention.
Is this expected? or am I missing something?
Any help is appreciated. !
The mutex is probably faster in this case due to the fact that the same task is taking it over and over again with no other task getting involved. My guess is that mutex code is taking a shortcut to enable recursive mutex calls (i.e. the same task takes the same mutex twice). Even though your code is not technically a recursive mutex take, the code probably uses the same shortcut due to the fact that the semaphore owner was not overwritten by any other task taking the semaphore.
In other words you do:
1) semTake(semMutex)
2) ++global;
3) semGive(semMutex) // sem owner flag is not changed
4) sameTake(semMutex) // from same task as previous semTake
...
Then in step 4 the semTake sees that sem owner == current task id (because the sem owner was set in step 1 and never changed to anything else), so it just marks the semaphore as taken and quickly jumps out.
Of course this is a guess, a quick look at the source code and some vxworks shell breakpoints could confirm this, something I am unable to do because I no longer have access to vxworks.
Additionally look at the semMLib docs for some documentation on the recursive use of mutex.