Strange behavior of go routine

Strange behavior of go routine - concurrency

I just tried the following code, but the result seems a little strange. It prints odd numbers first, and then even numbers. I'm really confused about it. I had hoped it outputs odd number and even number one after another, just like 1, 2, 3, 4... . Who can help me?
package main
import (
"fmt"
"time"
)
func main() {
go sheep(1)
go sheep(2)
time.Sleep(100000)
}
func sheep(i int) {
for ; ; i += 2 {
fmt.Println(i,"sheeps")
}
}

More than likely you are only running with one cpu thread. so it runs the first goroutine and then the second. If you tell go it can run on multiple threads then both will be able to run simultaneously provided the os has spare time on a cpu to do so. You can demonstrate this by setting GOMAXPROCS=2 before running your binary. Or you could try adding a runtime.Gosched() call in your sheep function and see if that triggers the runtime to allow the other goroutine to run.
In general though it's better not to assume ordering semantics between operations in two goroutines unless you specify specific synchronization points using a sync.Mutex or communicating between them on channels.

Unsynchronized goroutines execute in a completely undefined order. If you want to print out something like
1 sheeps
2 sheeps
3 sheeps
....
in that exact order, then goroutines are the wrong way to do it. Concurrency works well when you don't care so much about the order in which things occur.
You could impose an order in your program through synchronization (locking a mutex around the fmt.Println calls or using a channel), but it's pointless since you could more easily just write code that uses a single goroutine.

Related

std::cout is thread safe, but can cause race conditions?

I'm looking at an online course for C++ multithreading, and I see the following:
If std::cout can have race conditions, then how is it that it's thread-safe? Isn't thread-safety, by definition, free of race conditions?

It's distinguishing between the ordering of calls to operator<< and outputting of each character as a result of those calls.
Specifically, there is no conflict between threads on a per-character basis but, if you have two threads output pax and diablo respectively, you may end up with any of (amongst others):
paxdiablo
diablopax
pdaixablo
padiablox
diapaxblo
What you quoted text is stating is that there will be no intermixing of the output of (for example) p and d that would cause a data race.
And a race condition isn't necessarily a problem, nor a thread-safety issue, it just means the behaviour can vary based on ordering. It can become such an issue if the arbitrary ordering can corrupt data in some way but that's not the case here (unless you consider badly formatted output to be corrupt).
It's similar to the statement:
result = a * 7 + b * 4;
ISO C++ doesn't actually mandate the order in which a * 7 or b * 4 is evaluated so you could consider that a race condition (it's quite plausible, though unlikely, that the individual calculations could be handed off to separate CPUs for parallelism), but in no way is that going to be a problem.
Interestingly, ISO C++ makes no mention of race conditions (which may become a problem), only of data races (which are almost certainly a problem).

There are many levels of thread safety. Some programs must complete in exactly the same order every time. They must be as-if they were executed in a single thread. Others permit some level of flexibility in the order in which statements are executed. The spec's statements about std::cout state that it is in the latter category. The order that characters will be printed to the screen is unspecified.
You can see the two behaviors in the act of compiling. In most cases, we don't care about the order that things get compiled in. We happily type make -j8 and have 8 threads (processes) compiling on parallel. Whatever thread does the work is fine by us. On the other hand, consider the compiling requirements of NixOS. In that distro, you compile every application yourself and cross-check it against hashes posted by the authors. When compiling on NixOS, within a compilation unit (one .c or .cpp), you absolutely must operate as-if there is a single thread because you need the exact same product deterministically -- every time.
There is a darker version of this kind of race behavior known as a "data race." This is when two writers try to write the same variable at the same time, or a reader tries to read it while a writer tries to write to it, all without any synchronization. This is undefined behavior. Indeed, one of my favorite pages on programming goes painstakingly through all the things that can go wrong if there is a data rate.
So what is the spec saying about needing to lock cout? What it's saying is that the interthread behavior is specified, and that there is no data race. Thus, you do not need to guard the output stream.
Typically, if you find you want to guard it, it's a good time to take a step back and evaluate your choices. Sure, you can put a mutex on your usage of cout, but you can't put it on any libraries that you are using, so you'll still have issues with interleaving if they print anything. It may be a good time to look at your output system. You may want to start passing messages to a centralized pool (protected by a mutex), and have one thread that writes them out to the screen.

The text is there to warn you that building some thing entirely out of "thread-safe" components does not guarantee that the thing you built will be "thread-safe."
The std::cout stream is called "thread-safe" because using it in multiple threads without any other synchronization will not break any of its underlying mechanism. The cout stream will behave in predictable, reasonable ways; Everything that various threads write will appear in the output, nothing will appear that wasn't written by some thread, everything written by any one thread will appear in the same order that the one thread wrote it, etc. Also, it won't throw unexpected exceptions, or segfault your program, or overwrite random variables.
But, if your program allows several threads to "race" to write std::cout, there is nothing that the authors of the library could do to help predict which thread will win. If you expect to see data written by several threads appear in some particular order, then it's your responsibility to ensure that the threads call the library in that same order.
Don't forget that a statement like this is not one std::cout function call; it's three separate calls.
std::cout << "there are " << number_of_doodles << " doodles here.\n";
If some other thread concurrently executes this statement (also three calls):
std::cout << "I like " << name_of_favorite_pet << ".\n";
Then the output of the six separate std::cout calls could be interleaved, e.g.:
I like there are 5 fido doodles here.
.

Why my std::atomic<int> variable isn't thread-safe?

I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}

I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.

Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.

The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.

Running multiple while true loops independently in python

Essentially I have 2 "while True:" loops in my code. Both of the loops are right at the end. However when I run the code, only the first while True: loop gets run, and the second one gets ignored.
For example:
while True:
print "hi"
while True:
print "bye"
Here, it will continuously print hi, but wont print bye at all (the actual code has a tracer.execute() for one loop, and the other is listening to a port, and they both work on their own).
Is there any way to get both loops to work at the same time independently?

Yes.A way to get both loops to work at the same time independently:
Your initial surprise was related to the nature, how Finite-State-Automata actually work.
[0]: any-processing-will-always-<START>-here
[1]: Read a next instruction
[2]: Execute the instruction
[3]: GO TO [1]
The stream of abstract instructions is being executed in a pure-[SERIAL] manner, one after another. There is no other way in the CPU since uncle Turing.
Your desire to have more streams-of-instructions run at the same time independently is called [CONCURRENT] process-scheduling.
You have several tools for achieving a wanted modus-operandi:
Read about a weaker form, using just a thread-based concurrency ( which, due to a Python-specific GIL-locking, yet executes on the physical hardware as a [CONCURRENT]-processing, but GIL-interleaving ( which was knowingly implemented as a very cheap form of a collision-avoidance for each and every case, that this [CONCURRENCY] might introduce ) will finally interleave each of the ( now ) [CONCURRENT]-streams, so as to principally avoid colliding access to any Python object at the same time. If you are fine with this execute-just-one-instruction-stream-fragment-at-a-time ( and round-robin their actual order of GIL-stepped execution ), you can live in a safe and collision-free world.
Another tool, Python may use, is the joblib.Parallel()( joblib.delayed() ), where you will have to master a bit more things to make these ( now a set of fully spawned subprocesses, each ( yes, each ) having a full-copy of python-state + all variables ( read: a lot of time and memory needed to spawn 'em ) and no mutual coordination ).
So decide about which form is just-enough for the kind of your use-case, and better check the new Amdahl's Law re-formulation carefully ( implications on costs of going distributed or parallel )

Unit tests fail only in ARM

I am working on a multithreaded program for a Raspberry Pi, and I have noticed that our current code runs perfectly in my computer and in the computers of other colleges, but it fails when running on ARM.
We are using C++11 for our project, and this is the output in our computer:
............
Success!
Test run complete. 12 tests run. 12 succeeded.
But when we try to run it on ARM, as you can see here: https://travis-ci.org/OpenStratos/server/builds/49297710
It says the following:
....
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
After some debugging, I have understood that the issue comes to this code: https://github.com/OpenStratos/server/blob/feature/temperature/serial/Serial.cpp#L91
this->open = false;
while( ! this->stopped);
And there is another thread doing the opposite:
while(this->open)
{
// Do stuff
}
this->stopped = true;
The first code is called when I need to stop the thread, and the double flag is ussed for the thread to be able to update the current object even if it's stopping. Both variables are of type std::atomic_bool, but it seems that in the while ( ! this->stopped); it does not check for it, ant it supposes a while (true);.
Is this the case? how can it be solved? why does it work differently on my x86_64 processor than on the ARM?
Thanks in advance.

The core guarantee made by std::atomic<T> is that you can always read a value. Consistency is not necessarily guaranteed.
Now, in this case you're relying on .operator bool which is equivalent to .load(memory_order_seq_cst) and operator=(x) which is .store(x, memory_order_seq_cst). This should give you a sequential consistent memory order.
The order you're observing on ARM appears sequentially consistent to me. The fast that you're not yet seeing stopped == true is OK. There's no time limit on that. The compiler cannot swap the memory operation with another memory operation, but it may indefinitely delay it.
The main question is why this thread should be stopped at all. If there was any real, observable work done in the loop body of that thread, that loop body could not be reordered relative to stopped==true check.

Finally the issue was that the environment I was creating in Travis.ci was not working properly. In real ARM hardware works properly.

Why does this program terminate on my system but not on playground?

Consider this program:
package main
import "fmt"
import "time"
import "runtime"
func main() {
x := 0
go func() {
time.Sleep(500 * time.Millisecond)
x = 1
}()
for x == 0 {
runtime.Gosched()
}
fmt.Println("it works!")
}
Why does it terminate locally but not on Playground? Does the termination of my program rely on undefined behavior?

The Go playground uses a special implementation of time.Sleep designed to prevent individual programs from monopolising the back end resources of the website.
As described in this article about the how the playground is implemented, goroutines that call time.Sleep() are put to sleep. The playground back end waits until all other goroutines are blocked (what would otherwise be a deadlock), and then wakes the goroutine with the shortest time out.
In your program, you have two goroutines: the main one, and one that calls time.Sleep. Since the main goroutine never blocks, the time.Sleep call will never return. The program continues until it exceeds the CPU time allocated to it and is then terminated.

The Go Memory Model does not guarantee that the value written to x in the goroutine will ever be observed by the main program. A similarly erroneous program is given as an example in the section on go routine destruction. The Go Memory Model also specifically calls out busy waiting without synchronization as an incorrect idiom in this section.
You need to do some kind of synchronization in the goroutine in order to guarantee that x=1 happens before one of the iterations of the for loop in main.
Here is a version of the program that is guaranteed to work as intended.
http://play.golang.org/p/s3t5_-Q73W
package main
import (
"fmt"
"time"
)
func main() {
c := make(chan bool)
x := 0
go func() {
time.Sleep(500 * time.Millisecond)
x = 1
close(c) // 1
}()
for x == 0 {
<-c // 2
}
fmt.Println("it works!")
}
The Go Memory Model guarantees that the line marked with // 1 happens before the line marked with // 2. As a result, the for loop is guaranteed to terminate before its second iteration.

That code doesn't offer much guarantees. It's relying almost entirely on implementation details around undefined behavior.
In most multi-threaded systems, there's no guarantee that a change in one thread with no barrier will be seen in another. You've got a goroutine which could be running on another processor altogether writing a value to a variable nobody's ever guaranteed to read.
The for x == 0 { could quite easily be rewritten to for { since there's never a guarantee any changes to that variable might be visible.
The race detector will also probably report this issue. You should really not expect this to work. If you want a sync.WaitGroup you should just use one as it properly coordinates across threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js