Several Threads racing to set the same data to the same value - c++

I have a situation where I'm having to use a blackboxed wrapper for multithreading (which I suspect sits on top of a TBB Thread Pool).
I have a value that can only be obtained by an object that has an expensive constructor, and each thread needs a local instance of that, which is OK.
That object will produce a value that is guaranteed always identical across threads (all constructors take the same const forming argument from the main loop).
Each thread also has access to a shared struct for that argument as well as to save some results.
The value in question (an iteration range in the form of unsigned int) required by the threads is used later in the main loop, so if I could I'd rather not create another expensive instance of the above mentioned object just to get that same value again.
My question is, on Windows with VC11, and Linux with GCC 4.8.2, on x86-64, is writing the same value to the same memory location (int in a struct the threads have a pointer to) from multiple threads a benign race? Is it a race I could just let happen without guarding the value with an expensive lock? From cursory testing it seems to be the case, but I'm not entirely sure whether under the hood the operation is atomic and safe, or if there's a potential for corruption that might show up under stress.

If the data race is "benign" or not really depends on the compiler and runtime platform. Compilers assume programs are race-free and and behaviour resulting from race conditions is undefined. Using atomic operations does not incur much overhead and is recommended in this situation.
Some fringe cases and very good examples on what could go wrong can be found here:
https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
In his post ThreadSanitizer developer Dmitry Vyukov writes "So, if a data race involves non-atomic writes, it always can go wrong".

Related

Can two threads write to the same place without synchronization? [duplicate]

Can multiple threads write the same value to the same variable at the same time safely?
For a specific example — is the below code guaranteed by the C++ standard to compile, run without undefined behavior and print "true", on every conforming system?
#include <cstdio>
#include <thread>
int main()
{
bool x = false;
std::thread one{[&]{ x = true; }};
std::thread two{[&]{ x = true; }};
one.join();
two.join();
std::printf(x ? "true" : "false");
}
This is a theoretical question; I want to know whether it definitely always works rather than whether it works in practice (or whether writing code like this is a good idea :)). I'd appreciate if someone could point to the relevant part of the standard. In my experience it always works in practice, but not knowing whether or not it's guaranteed to work I always use std::atomic instead - I'd like to know whether that's strictly necessary for this specific case.
No.
You need to synchronize access to those variables, either by using mutexes or by making them atomic.
There is no exemption for when the same value is being written. You don't know what steps are involved in writing that value (which is the underlying practical concern), and neither does the standard which is why code has undefined behaviour … which means your compiler can just make absolute mayhem with your program (and that's the real issue you need to avoid).
Someone's going to come along and tell you that such-and-such an architecture guarantees atomic writes to these sized variables. But that doesn't change the UB aspect.
The passages you're looking for are:
[intro.races/2]: Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
[intro.races/21]: […] The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, […]. Any such data race results in undefined behavior.
… and the surrounding wording. That section is actually quite esoteric, but you don't really need to parse it as this is a classic, textbook data race that you can read about in any book on programming.
Lightness is correct and spot-on from a standards perspective.
But I'll give you another perspective why this is not a good idea from a hardware architecture perspective.
Without a memory barrier (atomic, mutex, etc...), you can encounter what's known as the cache coherency problem. On a multi-core or multi-processor machine, your two threads could both set x to true, but your main thread potentially could print false even if your compiler didn't stash x into a register. That's because the hardware cache used by the main thread hasn't been updated to have x invalidated from whatever cache line its on yet. The atomic types and lock guards provided by C++ (along with countless OS primitives) are implemented to solve this issue.
In any case, google for Cache Coherence Problem and Cache Coherence Multicore. And for a particular architecture implementation of how atomic transactions are implemented, look up the Intel LOCK prefix.

Can multiple threads write the same value to the same variable at the same time safely?

Can multiple threads write the same value to the same variable at the same time safely?
For a specific example — is the below code guaranteed by the C++ standard to compile, run without undefined behavior and print "true", on every conforming system?
#include <cstdio>
#include <thread>
int main()
{
bool x = false;
std::thread one{[&]{ x = true; }};
std::thread two{[&]{ x = true; }};
one.join();
two.join();
std::printf(x ? "true" : "false");
}
This is a theoretical question; I want to know whether it definitely always works rather than whether it works in practice (or whether writing code like this is a good idea :)). I'd appreciate if someone could point to the relevant part of the standard. In my experience it always works in practice, but not knowing whether or not it's guaranteed to work I always use std::atomic instead - I'd like to know whether that's strictly necessary for this specific case.
No.
You need to synchronize access to those variables, either by using mutexes or by making them atomic.
There is no exemption for when the same value is being written. You don't know what steps are involved in writing that value (which is the underlying practical concern), and neither does the standard which is why code has undefined behaviour … which means your compiler can just make absolute mayhem with your program (and that's the real issue you need to avoid).
Someone's going to come along and tell you that such-and-such an architecture guarantees atomic writes to these sized variables. But that doesn't change the UB aspect.
The passages you're looking for are:
[intro.races/2]: Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
[intro.races/21]: […] The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, […]. Any such data race results in undefined behavior.
… and the surrounding wording. That section is actually quite esoteric, but you don't really need to parse it as this is a classic, textbook data race that you can read about in any book on programming.
Lightness is correct and spot-on from a standards perspective.
But I'll give you another perspective why this is not a good idea from a hardware architecture perspective.
Without a memory barrier (atomic, mutex, etc...), you can encounter what's known as the cache coherency problem. On a multi-core or multi-processor machine, your two threads could both set x to true, but your main thread potentially could print false even if your compiler didn't stash x into a register. That's because the hardware cache used by the main thread hasn't been updated to have x invalidated from whatever cache line its on yet. The atomic types and lock guards provided by C++ (along with countless OS primitives) are implemented to solve this issue.
In any case, google for Cache Coherence Problem and Cache Coherence Multicore. And for a particular architecture implementation of how atomic transactions are implemented, look up the Intel LOCK prefix.

A legitimate use case for volatile in C++?

I'm running a few threads that basically are all returning the same object as a result. Then I wait for all of them to complete, and basically read the results. To avoid needing synchronization, I figured I could just pre-allocate all the result objects in an array or vector and give the threads a pointer to each. At a high level, the code is something like this (simplified):
std::vector<Foo> results(2);
RunThread1(&results[0]);
RunThread2(&results[1]);
WaitForAll();
// Read results
cout << results[0].name << results[1].name;
Basically I'd like to know if there's anything potentially unsafe about this "code". One thing I was wondering is whether the vector should declared volatile such that the reads at the end aren't optimized and output an incorrect value.
The short answer to your question is no, the array should not be declared volatile. For two simple reasons:
Using volatile is not necessary. Every sane multithreading platform provides synchronization primitives with well-defined semantics. If you use them, you don't need volatile.
Using volatile isn't sufficient. Since volatile doesn't have defined multithread semantics on any platform you are likely to use, it alone is not enough to provide synchronization.
Most likely, whatever you do in WaitForAll will be sufficient. For example, if it uses an event, mutex, condition variable, or almost anything like that, it will have defined multithreading semantics that are sufficient to make this safe.
Update: "Just out curiosity, what would be an example of something that happens in the WaitForAll that guarantees safety of the read? Wouldn't it need to effectively tell the compiler somehow to "flush" the cache or avoid optimizations of subsequent read operations?"
Well, if you're using pthreads, then if it uses pthread_join that would be an example that guarantees safety of the read because the documentation says that anything the thread does is visible to a thread that joins it after pthread_join returns.
How it does it is an implementation detail. In practice, on modern systems, there are no caches to flush nor are there any optimizations of subsequent reads that are possible but need to be avoided.
Consider if somewhere deep inside WaitForAll, there's a call to pthread_join. Generally, you simply don't let the compiler see into the internals of pthread_join and thus the compiler has to assume that pthread_join might do anything that another thread could do. So keeping information that another thread might modify in a register across a call to pthread_join would be illegal because pthread_join itself might access or modify that data.
I was wondering is whether the vector should declared volatile such that the reads at the end aren't optimized and output an incorrect value.
No. If there was problem with lack of synchronisation, then volatile would not help.
But there is no problem with lack of synchronisation, since according to your description, you don't access the same object from multiple threads - until you've waited for the threads to complete, which is something that synchronises the threads.
There is a potential problem that if the objects are small (less than about 64 bytes; depending on CPU architecture), then the objects in the array share a cache "line", and access to them may become effectively synchronised due to write contention. This is only a problem if the threads write to the variable a lot in relation to operations that don't access the output object.
It depends on what's in WaitForAll(). If it's a proper synchronization, all is good. Mutexes, for example, or thread join, will result in the proper memory synchronization being put in.
Volatile would not help. It may prevent compiler optimizations, but would not affect anything happening at the CPU level, like caches not being updated. Use proper synchronization, like mutexes, thread join, and then the result will be valid (sequentially coherent). Don't count on silver bullet volatile. Compilers and CPUs are now complex enough that it won't be guaranteed.
Other answers will elaborate on the memory fences and other instructions that the synchronization will put in. :-)

how to declare and use "one writer, many readers, one process, simple type" variable?

I have really simple question. I have simple type variable (like int). I have one process, one writer thread, several "readonly" threads. How should I declare variable?
volatile int
std::atomic<int>
int
I expect that when "writer" thread modifies value all "reader" threads should see fresh value ASAP.
It's ok to read and write variable at the same time, but I expect reader to obtain either old value or new value, not some "intermediate" value.
I'm using single-CPU Xeon E5 v3 machine. I do not need to be portable, I run the code only on this server, i compile with -march=native -mtune=native. Performance is very important so I do not want to add "synchronization overhead" unless absolutely required.
If I just use int and one thread writes value is it possible that in another thread I do not see "fresh" value for a while?
Just use std::atomic.
Don't use volatile, and don't use it as it is; that doesn't give the necessary synchronisation. Modifying it in one thread and accessing it from another without synchronisation will give undefined behaviour.
If you have unsynchronized access to a variable where you have one or more writers then your program has undefined behavior. Some how you have to guarantee that while a write is happening no other write or read can happen. This is called synchronization. How you achieve this synchronization depends on the application.
For something like this where we have one writer and and several readers and are using a TriviallyCopyable datatype then a std::atomic<> will work. The atomic variable will make sure under the hood that only one thread can access the variable at the same time.
If you do not have a TriviallyCopyable type or you do not want to use a std::atomic You could also use a conventional std::mutex and a std::lock_guard to control access
{ // enter locking scope
std::lock_guard lock(mutx); // create lock guard which locks the mutex
some_variable = some_value; // do work
} // end scope lock is destroyed and mutx is released
An important thing to keep in mind with this approach is that you want to keep the // do work section as short as possible as while the mutex is locked no other thread can enter that section.
Another option would be to use a std::shared_timed_mutex(C++14) or std::shared_mutex(C++17) which will allow multiple readers to share the mutex but when you need to write you can still look the mutex and write the data.
You do not want to use volatile to control synchronization as jalf states in this answer:
For thread-safe accesses to shared data, we need a guarantee that:
the read/write actually happens (that the compiler won't just store the value in a register instead and defer updating main memory until
much later)
that no reordering takes place. Assume that we use a volatile variable as a flag to indicate whether or not some data is ready to be
read. In our code, we simply set the flag after preparing the data, so
all looks fine. But what if the instructions are reordered so the flag
is set first?
volatile does guarantee the first point. It also guarantees that no
reordering occurs between different volatile reads/writes. All
volatile memory accesses will occur in the order in which they're
specified. That is all we need for what volatile is intended for:
manipulating I/O registers or memory-mapped hardware, but it doesn't
help us in multithreaded code where the volatile object is often
only used to synchronize access to non-volatile data. Those accesses
can still be reordered relative to the volatile ones.
As always if you measure the performance and the performance is lacking then you can try a different solution but make sure to remeasure and compare after changing.
Lastly Herb Sutter has an excellent presentation he did at C++ and Beyond 2012 called Atomic Weapons that:
This is a two-part talk that covers the C++ memory model, how locks and atomics and fences interact and map to hardware, and more. Even though we’re talking about C++, much of this is also applicable to Java and .NET which have similar memory models, but not all the features of C++ (such as relaxed atomics).
I'll complete a little bit the previous answers.
As exposed previously, just using int or eventually volatile int is not enough for various reason (even with the memory order constraint of Intel processors.)
So, yes, you should use atomic types for that, but you need extra considerations: atomic types guarantee coherent access but if you have visibility concerns you need to specify memory barrier (memory order.)
Barriers will enforce visibility and coherency between threads, on Intel and most modern architectures, it will enforce cache synchronizations so updates are visible for every cores. The problem is that it may be expensive if you're not careful enough.
Possible memory order are:
relaxed: no special barrier, only coherent read/write are enforce;
sequential consistency: strongest possible constraint (the default);
acquire: enforce that no loads after the current one are reordered before and add the required barrier to ensure that released stores are visible;
consume: a simplified version of acquire that mostly only constraint reordering;
release: enforce that all stores before are complete before the current one and that memory writes are done and visible to loads performing an acquire barrier.
So, if you want to be sure that updates to the variable are visible to readers, you need to flag your store with a (at least) a release memory order and, on the reader side you need an acquire memory order (again, at least.) Otherwise, readers may not see the actual version of the integer (it'll see a coherent version at least, that is the old or the new one, but not an ugly mix of the two.)
Of course, the default behavior (full consistency) will also give you the correct behavior, but at the expense of a lot of synchronization. In short, each time you add a barrier it forces cache synchronization which is almost as expensive as several cache misses (and thus reads/writes in main memory.)
So, in short you should declare your int as atomic and use the following code for store and load:
// Your variable
std::atomic<int> v;
// Read
x = v.load(std::memory_order_acquire);
// Write
v.store(x, std::memory_order_release);
And just to complete, sometimes (and more often that you think) you don't really need the sequential consistency (even the partial release/acquire consistency) since visibility of updates are pretty relative. When dealing with concurrent operations, updates take place not when write is performed but when others see the change, reading the old value is probably not a problem !
I strongly recommend reading articles related to relativistic programming and RCU, here are some interesting links:
Relativistic Programming wiki: http://wiki.cs.pdx.edu/rp/
Structured Deferral: Synchronization via Procrastination: https://queue.acm.org/detail.cfm?id=2488549
Introduction to RCU Concepts: http://www.rdrop.com/~paulmck/RCU/RCU.LinuxCon.2013.10.22a.pdf
Let's start from int at int. In general, when used on single processor, single core machine this should be sufficient, assuming int size same or smaller than CPU word (like 32bit int on 32bit CPU). In this case, assuming correctly aligned address word addresses (high level language should assure this by default) the write/read operations should be atomic. This is guaranteed by Intel as stated in [1] . However, in C++ specification simultaneous reading and writing from different threads is undefined behaviour.
$1.10
6 Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.
Now volatile. This keyword disables almost every optimization. This is the reason why it was used. For example, sometimes when optimizing the compiler can come to idea, that variable you only read in one thread is constant there and simply replace it with it's initial value. This solves such problems. However, it does not make access to variable atomic. Also, in most cases, it is simply unnecessary, because use of proper multithreading tools, like mutex or memory barrier, will achieve same effect as volatile on it's own, as described for instance in [2]
While this may be sufficient for most uses, there are other operations that are not guaranteed to be atomic. Like incrementation is a one. This is when std::atomic comes in. It has those operations defined, like here for mentioned incrementations in [3]. It is also well defined when reading and writing from different threads [4].
In addition, as stated in answers in [5], there exists a lot of other factors that may influence (negatively) atomicity of operations. From loosing cache coherency between multiple cores to some hardware details are the factors that may change how operations are performed.
To summarize, std::atomic is created to support accesses from different threads and it is highly recommended to use it when multithreading.
[1] http://www.intel.com/Assets/PDF/manual/253668.pdf see section 8.1.1.
[2] https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
[3] http://en.cppreference.com/w/cpp/atomic/atomic/operator_arith
[4] http://en.cppreference.com/w/cpp/atomic/atomic
[5] Are C++ Reads and Writes of an int Atomic?
The other answers, which say to use atomic and not volatile, are correct when portability matters. If you’re asking this question, and it’s a good question, that’s the practical answer for you, not, “But, if the standard library doesn’t provide one, you can implement a lock-free, wait-free data structure yourself!” Nevertheless, if the standard library doesn’t provide one, you can implement a lock-free data structure yourself that works on a particular compiler and a particular architecture, provided that there’s only one writer. (Also, somebody has to implement those atomic primitives in the standard library.) If I’m wrong about this, I’m sure someone will kindly inform me.
If you absolutely need an algorithm guaranteed to be lock-free on all platforms, you might be able to build one with atomic_flag. If even that doesn’t suffice, and you need to roll your own data structure, you can do that.
Since there’s only one writer thread, your CPU might guarantee that certain operations on your data will still work atomically even if you just use normal accesses instead of locks or even compare-and-swaps. This is not safe according to the language standard, because C++ has to work on architectures where it isn’t, but it can be safe, for example, on an x86 CPU if you guarantee that the variable you’re updating fits into a single cache line that it doesn’t share with anything else, and you might be able to ensure this with nonstandard extensions such as __attribute__ (( aligned (x) )).
Similarly, your compiler might provide some guarantees: g++ in particular makes guarantees about how the compiler will not assume that the memory referenced by a volatile* hasn’t changed unless the current thread could have changed it. It will actually re-read the variable from memory each time you dereference it. That is in no way sufficient to ensure thread-safety, but it can be handy if another thread is updating the variable.
A real-world example might be: the writer thread maintains some kind of pointer (on its own cache line) which points to a consistent view of the data structure that will remain valid through all future updates. It updates its data with the RCU pattern, making sure to use a release operation (implemented in an architecture-specific way) after updating its copy of the data and before making the pointer to that data globally visible, so that any other thread that sees the updated pointer is guaranteed to see the updated data as well. The reader then makes a local copy (not volatile) of the current value of the pointer, getting a view of the data which will stay valid even after the writer thread updates again, and works with that. You want to use volatile on the single variable that notifies the readers of the updates, so they can see those updates even if the compiler “knows” your thread couldn’t have changed it. In this framework, the shared data just needs to be constant, and readers will use the RCU pattern. That’s one of the two ways I’ve seen volatile be useful in the real world (the other being when you don’t want to optimize out your timing loop).
There also needs to be some way, in this scheme, for the program to know when nobody’s using an old view of the data structure any longer. If that’s a count of readers, that count needs to be atomically modified in a single operation at the same time as the pointer is read (so getting the current view of the data structure involves an atomic CAS). Or this might be a periodic tick when all the threads are guaranteed to be done with the data they’re working with now. It might be a generational data structure where the writer rotates through pre-allocated buffers.
Also observe that a lot of things your program might do could implicitly serialize the threads: those atomic hardware instructions lock the processor bus and force other CPUs to wait, those memory fences could stall your threads, or your threads might be waiting in line to allocate memory from the heap.
Unfortunately it depends.
When a variable is read and written in multiple threads, there may be 2 failures.
1) tearing. Where half the data is pre-change and half the data is post change.
2) stale data. Where the data read has some older value.
int, volatile int and std:atomic all don't tear.
Stale data is a different issue. However, all values have existed, can be concieved as correct.
volatile. This tells the compiler neither to cache the data, nor to re-order operations around it. This improves the coherence between threads by ensuring all operations in a thread are either before the variable, at the variable, or after.
This means that
volatile int x;
int y;
y =5;
x = 7;
the instruction for x = 7 will be written after y = 5;
Unfortunately, the CPU is also capable of re-ordering operations. This can mean that another thread sees x ==7 before y =5
std::atomic x; would allow a guarantee that after seeing x==7, another thread would see y ==5. (Assuming other threads are not modifying y)
So all reads of int, volatile int, std::atomic<int> would show previous valid values of x. Using volatile and atomic increase the ordering of values.
See kernel.org barriers
I have simple type variable (like int).
I have one process, one writer thread, several "readonly" threads. How
should I declare variable?
volatile int
std::atomic
int
Use std::atomic with memory_order_relaxed for the store and load
It's quick, and from your description of your problem, safe. E.g.
void func_fast()
{
std::atomic<int> a;
a.store(1, std::memory_order_relaxed);
}
Compiles to:
func_fast():
movl $1, -24(%rsp)
ret
This assumes you don't need to guarantee that any other data is seen to be written before the integer is updated, and therefore the slower and more complicated synchronisation is unnecessary.
If you use the atomic naively like this:
void func_slow()
{
std::atomic<int> b;
b = 1;
}
You get an MFENCE instruction with no memory_order* specification which is massive slower (100 cycles more more vs just 1 or 2 for the bare MOV).
func_slow():
movl $1, -24(%rsp)
mfence
ret
See http://goo.gl/svPpUa
(Interestingly on Intel the use of memory_order_release and _acquire for this code results in the same assembly language. Intel guarantees that writes and reads happen in order when using the standard MOV instruction).
Here is my attempt at bounty:
- a. General answer already given above says 'use atomics'. This is correct answer. volatile is not enough.
-a. If you dislike the answer, and you are on Intel, and you have properly aligned int, and you love unportable solutions, you can do away with simple volatile, using Intel strong memory ordering gurantees.
TL;DR: Use std::atomic<int> with a mutex around it if you read multiple times.
Depends on how strong guarantees you want.
First volatile is a compiler hint and you shouldn't count on it doing something helpful.
If you use int you can suffer for memory aliasing. Say you have something like
struct {
int x;
bool q;
}
Depending on how this is aligned in memory and the exact implementation of CPU and memory bus it's possible that writing to q will actually overwrite x when the page is copied from the cpu cache back to ram. So unless you know how much to allocate around your int it's not guaranteed that your writer will be able to write without being overwritten by some other thread.
Also even if you write you depend on the processor for reloading the data to the cache of other cores so there's no guarantee that your other thread will see a new value.
std::atomic<int> basically guarantees that you will always allocate sufficient memory, properly aligned so that you don't suffer from aliasing. Depending on the memory order requested you will also disable a bunch of optimizations, like caching, so everything will run slightly slower.
This still doesn't grantee that if your read the var multiple times you'll get the value. The only way to do that is to put a mutex around it to block the writer from changing it.
Still better find a library that already solves the problem you have and it has already been tested by others to make sure it works well.

Dangers of simultaneous write and read of a boolean in a simple situation

I've read some similar questions but the situations described there are bit more complicated.
I have a bool b initialized as false in the heap and two threads. I do understand that operations with bools are not atomic, but please read the question till the end.
First thread can set b = true only once and doesn't do anything else with it.
Second thread checks b in a loop and if it's true does some actions.
Do I need to use some synchronization mechanism(like mutexes) to protect b?
What can happen if I don't? With ints I can obviously get arbitrary values when I read and write in the same time. But with bools there are just true and false and I don't mind to once get false instead of true. Is it a potential SIGSEGV?
Data races result in undefined behavior. As far as the standard is concerned, a conforming implementation is permitted to segfault.
In practice the main danger is that without synchronization, the compiler will observe enough of the code in the reader loop to judge that b "never changes", and optimize out all but the first read of the value. It can do this because if it observes that there is no synchronization in the loop, then it knows that any write to the value would be a data race. The optimizer is permitted to assume that your program does not provoke undefined behavior, so it is permitted to assume that there are no writes from other threads.
Marking b as volatile will prevent this particular optimization in practice, but even on volatile objects data races are undefined behavior. Calling into code that the optimizer "can't see" will also prevent the optimization in practice, since it doesn't know whether that code modifies b. Of course with link-time/whole-program optimization there is less that the optimizer can't see, than with compile-time-only optimization.
Anyway, preventing the optimization from being made in software doesn't prevent the equivalent thing happening in hardware on a system with non-coherent caches (at least, so I claim: other people argue that this is not correct, and that volatile accesses are required to read/write through caches. Some implementations do behave that way). If you're asking about what the standard says then it doesn't really matter whether or not the hardware shows you a stale cache indefinitely, since behavior remains undefined and so the implementation can break your code regardless of whether this particular optimization is the thing that breaks it.
The problem you might get is that we don't know how long it takes for the reader thread to see the changed value. If they are on different CPUs, with separate caches, there are no guarantees unless you use a memory barrier to synchronize the caches.
On an x86 this is handled automatically by the hardware protocol, but not on some other systems.
Do I need to use some synchronization mechanism(like mutexes) to protect b?
If you don't, you have a data race. Programs with data races have undefined behaviour. The answer to this question is the same as the answer to the question "Do you want your program to have well-defined behaviour?"
What can happen if I don't?
Theoretically, anything can happen. That's what undefined behaviour means. The most likely bad thing that can happen is that the "second thread" may never see a true value.
The compiler can assume that a program has no data races (if it has the behaviour is not defined by the standard, so behaving as if it didn't is fine). Since the second thread only ever reads from a variable that has the value false, and there's no synchronization that affects those reads, the logical conclusion is that the value never changes, and thus the loop is infinite. (and some infinite loops have undefined behaviour in C++11!)
Here are a few alternative solutions:
Use a Mutex, details have been covered in the other answers above.
Consider using a read/write lock which will manage/protect simultaneous reads and writes. The pthread lib provides an implementation: pthread_rwlock_t
Depending on what your application is doing, consider using a condition variable (pthread lib impl: pthread_cond_t). This is effectively a signal from one thread to another, which could allow you to remove your while loop and bool checking.
Making the boolean volatile will suffice (on x86 architecture), no mutex needed:
volatile bool b;