I am new to multi threading. While writing multi threaded code in C++11 using condition variable , I use the following construct
while(predicate) {
cond_var.wait(&lock);
}
However, I have been reading Deitel's third edition book on operating systems(chp 6) where the following construct is being used
if(predicate) {
cond_var.wait(&lock);
}
So, what's the difference? Why isn't the book using while? Isn't spurious call an issue?
Spurious wakeup is always a potential issue. For example, look at the answers here: Do spurious wakeups actually happen?. Perhaps Deitel's code is part of a larger loop that can help them deal with the spurious wakeup? Or maybe it's just a typo.
In any case, there's never a (good) reason not to use your construct, and in fact the wait function has a variant that does it for you (http://en.cppreference.com/w/cpp/thread/condition_variable/wait).
template< class Predicate >
void wait( std::unique_lock<std::mutex>& lock, Predicate pred );
which is equivalent to:
while (!pred()) {
wait(lock);
}
People seem to be dealing with spurious wakeups exclusively, but there is a more fundamental reason why a while or an if is to be used in monitor procedures.
We would have to choose one or the other even if there were no spurious wakeups because monitor implementations may choose from a number of different signaling disciplines.
The following paper describes these
John H. Howard. 1976. Signaling in monitors. In Proceedings of the 2nd international conference on Software engineering (ICSE '76). IEEE Computer Society Press, Los Alamitos, CA, USA, 47-52.
The point is that a monitor can be used by at most one process at a time, and there is a conflict when a waiting process is being woken up (signaled) by another process from inside the monitor. The problem is: which process may continue executing inside the monitor?
There are a number of different disciplines. The one originally proposed is the so called signal and wait, where the signaled process continues immediately (the signaler has to wait). Using this discipline the
if ( predicate) {
cond_var.wait( &lock);
}
form can be used because the predicate must be true after waiting (provided it is true at the time of signaling)
Another discipline is signal and continue, where the signaling process continues, the signaled is put into an entry queue of the monitor. Using this discipline necessitates the use of the
while ( predicate) {
cond_var.wait( &lock);
}
form because predicate can be invalidated by the time the signaled process gets a chance to execute, so it has to retest the condition.
Related
I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.
Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.
The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.
I'm trying to demonstrate that it's very bad idea to not use std::atomic<>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:
{
foobar = false;
}
and the other:
{
if (foobar) {
// ...
}
}
the type of foobar is either bool or std::atomic_bool and it's initialized to true. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp and I see that the asm differences on read are minor (without atomic on left, with on right):
No mfence or something that "dramatic". On the write side, something more "dramatic" happens:
As you can see, the atomic<> version uses xchgb which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfences being added which also indicates there's a serious concern.
I kind of understand that "X86 implements a very strong memory model" (ref) and that mfences might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>s unless I care for consistency at ns-level?
I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.
The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.
A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.
Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.
bool operation_done;
void thread1() {
while (!operation_done) {
sleep();
}
// do something that depends on operation being done
}
void thread2() {
// do the operation
operation_done = true;
}
This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.
The problem was that the optimizer realized that under its memory model, operation_done could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.
Changing the type of operation_done to atomic_bool (or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.
This is my own version of #Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to #HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool as much as one would expect.
I was able to have a trivial program that reproduces the problem:
std::atomic_bool foobar(true);
//bool foobar = true;
long long cnt = 0;
long long loops = 400000000ll;
void thread_1() {
usleep(200000);
foobar = false;
}
void thread_2() {
while (loops--) {
if (foobar) {
++cnt;
}
}
std::cout << cnt << std::endl;
}
The main difference with my original code was that I used to have a usleep() inside the while loop. It was enough to prevent any optimizations within the while loop. The cleaner code above, yields the same asm for write:
but quite different for read:
We can see that in the bool case (left) clang brought the if (foobar) outside the loop. Thus when I run the bool case I get:
400000000
real 0m1.044s
user 0m1.032s
sys 0m0.005s
while when I run the atomic_bool case I get:
95393578
real 0m0.420s
user 0m0.414s
sys 0m0.003s
It's interesting that the atomic_bool case is faster - I guess because it does just 95 million incs on the counter contrary to 400 million in the bool case.
What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl; out of the threaded code, after pthread_join(), the loop in the non-atomic case becomes just this:
i.e. there's no loop. It's just if (foobar!=0) cnt = loops;! Clever clang. Then the execution yields:
400000000
real 0m0.206s
user 0m0.001s
sys 0m0.002s
while the atomic_bool remains the same.
So more than enough evidence that we should use atomics. The only thing to remember is - don't put any usleep() on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.
In general, it is very rare that the use of atomic types actually does anything useful for you in multithreaded situations. It is more useful to implement things like mutexes, semaphores and so on.
One reason why it's not very useful: As soon as you have two values that both need to be changed in an atomic way, you are absolutely stuck. You can't do it with atomic values. And it's quite rare that I want to change a single value in an atomic way.
In iOS and MacOS X, the three methods to use are: Protecting the change using #synchronized. Avoiding multi-threaded access by running code on a sequential queue (may be the main queue). Using mutexes.
I hope you are aware that atomicity for boolean values is rather pointless. What you have is a race condition: One thread stores a value, another reads it. Atomicity doesn't make a difference here. It makes (or might make) a difference if two threads accessing a variable at exactly the same time causes problems. For example, if a variable is incremented on two threads at exactly the same time, is it guaranteed that the final result is increased by two? That requires atomicity (or one of the methods mentioned earlier).
Sebastian makes the ridiculous claim that atomicity fixes the data race: That's nonsense. In a data race, a reader will read a value either before or after it is changed, whether that value is atomic or not doesn't make any difference whatsoever. The reader will read the old value or the new value, so the behaviour is unpredictable. All that atomicity does is prevent the situation that the reader would read some in-between state. Which doesn't fix the data race.
I have run a sample C++ program on vxWorks platform to test the timing difference between mutex and a binary semaphore. The below program is the prototype
SEM ID semMutex;
UINT ITER = 10000;
taskIdOne = TASKSPAWN("t1",TASK_PRIORITY_2,0,8192,0,(FUNCPTR)myMutexMethod,0,0);
taskIdTwo = TASKSPAWN("t2",TASK_PRIORITY_2,0,8192,0,(FUNCPTR)myMutexMethod,0,0);
void myMutexMethod(void)
{
int i;
VKI_PRINTF("I'm (%s)\n",TASKNAME(0) );
myMutexTimer.start();
for (i=0; i < ITER; i++)
{
MUTEX_LOCK(semMutex,WAIT_FOREVER);
++global;
MUTEX_UNLOCK(semMutex);
}
myMutexTimer.stop();
myMutexTimer.show();
}
In the above program there is a contention ( 2 tasks are trying to get the mutex). my timer printed 37.43 ms for the above program. With the same prototype, binary semaphore program took just 2.8 ms. This is understood because binary semaphore is lightweight and does not have many features like a mutex (priority inversion, ownership etc).
However, I removed one task and ran the above program ( without contention). Since there is no contention, the task t1 just gets the mutex , executes the critical section and then releases the mutex. Same with Binary semaphore.
For the timings, mutex I got 3.35 ms and binary semaphore 4 ms.
I'm surprised to see mutex is faster than binary semaphore when there is no contention.
Is this expected? or am I missing something?
Any help is appreciated. !
The mutex is probably faster in this case due to the fact that the same task is taking it over and over again with no other task getting involved. My guess is that mutex code is taking a shortcut to enable recursive mutex calls (i.e. the same task takes the same mutex twice). Even though your code is not technically a recursive mutex take, the code probably uses the same shortcut due to the fact that the semaphore owner was not overwritten by any other task taking the semaphore.
In other words you do:
1) semTake(semMutex)
2) ++global;
3) semGive(semMutex) // sem owner flag is not changed
4) sameTake(semMutex) // from same task as previous semTake
...
Then in step 4 the semTake sees that sem owner == current task id (because the sem owner was set in step 1 and never changed to anything else), so it just marks the semaphore as taken and quickly jumps out.
Of course this is a guess, a quick look at the source code and some vxworks shell breakpoints could confirm this, something I am unable to do because I no longer have access to vxworks.
Additionally look at the semMLib docs for some documentation on the recursive use of mutex.
I am trying to understand the term lwt supported.
So assume I have a piece of code which connect a database and write some data: Db.write conn data. It has nothing to do with lwt yet and each write will cost 10 sec.
Now, I would like to use lwt. Can I directly code like below?
let write_all data_list = Lwt_list.iter (Db.write conn) data_list
let _ = Lwt_main.run(write_all my_data_list)
Support there are 5 data items in my_data_list, will all 5 data items be written into the database sequentially or in parallel?
Also in Lwt manually or http://ocsigen.org/tutorial/application, they say
Using Lwt is very easy and does not cause troubles, provided you never
use blocking functions (non cooperative functions). Blocking functions
can cause the entre server to hang!
I quite don't understand how to not using blocking functions. For every my own function, can I just use Lwt.return to make it lwt support?
Yes, your code is correct. The principle of lwt supported is that everything that can potentially takes time in your code should return an Lwt value.
About Lwt_list.iter, you can choose whether you want the treatment to be parallel or sequential, by choosing between iter_p and iter_s :
In iter_s f l, iter_s will call f on each elements
of l, waiting for completion between each element. On the
contrary, in iter_p f l, iter_p will call f on all
elements of l, then wait for all the threads to terminate.
About the non-blocking functions, the principle of the Light Weight Threads is that they keep running until they reach a "cooperation point", i.e. a point where the thread can be safely interrupted or has nothing to do, like in a sleep.
But you have to declare you enter a "cooperation point" before actually doing the sleep. This is why the whole Unix library has been wrapped, so that when you want to do an operation that takes time (e.g. a write), a cooperation point is automatically reached.
For your own function, if you use IOs operations from Unix, you should instead use the Lwt version (Lwt_unix.sleep instead of Unix.sleep)
So here's my scenario. First, I have a structure -
struct interval
{
double lower;
double higher;
}
Now my thread function -
void* thread_function(void* i)
{
interval* in = (interval*)i;
double a = in->lower;
cout << a;
pthread_exit(NULL)
}
In main, let's say I create these 2 threads -
pthread_t one,two;
interval i;
i.lower = 0; i.higher = 5;
pthread_create(&one,NULL,thread_function,&i);
i.lower=10; i.higher = 20;
pthread_create(&two,NULL,thread_function, &i);
pthread_join(one,NULL);
pthread_join(two,NULL);
Here's the problem. Ideally, thread "one" should print out 0 and thread "two" should print out 10. However, this doesn't happen. Occasionally, I end up getting two 10s.
Is this by design? In other words, by the time the thread is created, the value in i.lower has been changed already in main, therefore both threads end up using the same value?
Is this by design?
Yes. It's unspecified when exactly the threads start and when they will access that value. You need to give each one of them their own copy of the data.
Your application is non-deterministic.
There is no telling when a thread will be scheduled to run.
Note: By creating a thread does not mean it will start executing immediately (or even first). The second thread created may actually start running before the first (it is all dependant on the OS and hardware).
To get deterministic behavior each thread must be given its own data (that is not modified by the main thread).
pthread_t one,two;
interval oneData,twoData
oneData.lower = 0; oneData.higher = 5;
pthread_create(&one,NULL,thread_function,&oneData);
twoData.lower=10; twoData.higher = 20;
pthread_create(&two,NULL,thread_function, &twoData);
pthread_join(one,NULL);
pthread_join(two,NULL);
I would not call it by design.
I would rather refer to it as a side-effect of scheduling policy. But the observed behavior is what I would expect.
This is the classic 'race condition'; where the results vary depending on which thread wins the 'race'. You have no way of knowing which thread will 'win' each time.
Your analysis of the problem is correct; you simply don't have any guarantees that the first thread created will be able to read i.lower before the data is changed on the next line of your main function. This is in some sense the heart of why it can be hard to think about multithreaded programming at first.
The straight forward solution to your immediate problem is to keep different intervals with different data, and pass a separate one to each thread, i.e.
interval i, j;
i.lower = 0; j.lower = 10;
pthread_create(&one,NULL,thread_function,&i);
pthread_create(&two,NULL,thread_function,&j);
This will of course solve your immediate problem. But soon you'll probably wonder what to do if you want multiple threads actually using the same data. What if thread 1 wants to make changes to i and thread 2 wants to take these into account? It would hardly be much point in doing multithreaded programming if each thread would have to keep its memory separate from the others (well, leaving message passing out of the picture for now). Enter mutex locks! I thought I'd give you a heads up that you'll want to look into this topic sooner rather than later, as it'll also help you understand the basics of threads in general and the required change in mentality that goes along with multithreaded programming.
I seem to recall that this is a decent short introduction to pthreads, including getting started with understanding locking etc.