I’ve rarely thought about what happens between two consecutive expressions, between the call to a function and the execution of its body's first expression, or between a call to a constructor and the execution of its initializer. Then I started reading about concurrency...
1.) In two consecutive calls to std::thread’s constructor with the same callable (e.g. function, functor, lambda), whose body begins with a std::lock_guard initialization with the same std::mutex object, does the standard guarantee the thread corresponding to the first thread constructor call executes the lock-protected code first?
2.) If the standard doesn’t make the guarantee, then is there any theoretical or practical possibility the thread corresponding to the second thread constructor call executes the protected code first? (e.g. heavy system load during the execution of the initializer or body of the first thread constructor call)
Here’s a global std::mutex object m and a global unsigned num initialized to 1. There is nothing but whitespace between function foo’s body’s opening brace { and the std::lock_guard. In main, there are two std::threads t1 and t2. t1 calls the thread constructor first. t2 calls the thread constructor second. Each thread is constructed with a pointer to foo. t1 calls foo with unsigned argument 1. t2 calls foo with unsigned argument 2. Depending on which thread locks the mutex first, num’s value will be either a 4 or a 3 after both threads have executed the lock-protected code. num will equal 4 if t1 beats t2 to the lock. Otherwise, num will equal 3. I ran 100,000 trials of this by looping and resetting num to 1 at the end of each loop. (As far as I know, the results don’t and shouldn’t depend on which thread is join()ed first.)
#include <thread>
#include <mutex>
#include <iostream>
std::mutex m;
unsigned short num = 1;
void foo(unsigned short par) {
std::lock_guard<std::mutex> guard(m);
if (1 == num)
num += par;
else
num *= par;
}
int main() {
unsigned count = 0;
for (unsigned i = 0; i < 100000; ++i) {
std::thread t1(foo, 1);
std::thread t2(foo, 2);
t1.join();
t2.join();
if (4 == num) {
++count;
}
num = 1;
}
std::cout << count << std::endl;
}
In the end, count equals 100000, so it turns out t1 wins the race every time. But these trials don’t prove anything.
3.) Does the standard mandate “first to call thread constructor” always implies “first to call the callable passed to the thread constructor”?
4.) Does the standard mandate “first to call the callable passed to the thread constructor” always implies “first to lock the mutex”; provided that within the callable’s body, there exists no code dependent upon the parameter(s) passed to the callable prior to the line with the std::lock_guard initialization? (Also rule out any callable’s local static variable, like a counter of number of times called, which can be used to intentionally delay certain calls.)
No, the standard doesn't guarantee that the first thread gets the lock first. Basically, if you need to impose and ordering between threads, you'll need to synchronize between these threads. Even if the first thread gets to call the mutex lock function first, the second thread may acquire the lock first.
Absolutely. For example, there may be just one core available to your application at the time the threads are spawned and if the spawning thread decides after the second thread is spawned to wait on something, the schedule may decide to process the latest thread seen which is the second thread. Even if there are many cores available there are plenty of reasons the second thread is faster.
No, why would it! The first step is to spawn a thread and carry on. By the time the first function object is called the second thread can be running and call its function object.
No. There are not ordering guarantees between threads unless you explicitly impose them yourself as they would defeat the purpose of concurrency.
Related
I have a piece of multithreaded that I'm not sure is not liable to a data race because of compiler reordering.
Here is a minimal example:
int main()
{
int x = 0;
x = 5;
auto t = std::thread([&x]()
{
++x;
});
t.join();
return 0;
}
Is the assignment of x = 5 guaranteed to be before the thread start?
Short answer: The code will work as expected. No reordering will take place
Long answer:
Compile time reordering
Let's consider what's going on.
You put a variable in automatic storage (x)
You create an object that holds a reference to this variable (the lambda)
You pass that object to an external function (the thread constructor)
The compiler does escape analysis during optimization. Due to this sequence of events, the variable x has escaped once point 3 is reached. Which means from the compiler's point of view, any external function (except those marked as pure) may read or modify the variable. Therefore its value has to be stored to the stack before each function call and has to be loaded from stack after the function.
You did not make x an atomic variable. So the compiler is free to ignore any potential multithreading effects. Therefore the value may not be reloaded multiple times from memory in between calls to external functions. It may still be reloaded if the compiler decides to not keep the value in a register in between uses.
Let's annotate and expand your source code to show it:
int main()
{
int x = 0;
x = 5; // stores on stack for use by external function in next line
auto t = std::thread([&x]() mutable
{
++x;
});
int x1 = x; // loads x from stack after thread constructor may (in theory) have modified it
int x2 = x; // probably no reload because not an atomic variable
x = 7; // new value stored on stack because join() could access it (in theory)
t.join();
int x3 = x; // reload from stack because join() could have changed it
return 0;
}
Again, this has nothing to do with multithreading. Escape analysis and external function calls are sufficient.
Any access from main() between thread creation and joining would also be undefined behavior because it would be a data-race on a non-atomic variable. But that's just a side-note.
This takes care of the compiler behavior. But what about the CPU? May it reorder instructions?
Run time reordering
For this, we can look at the C++ standard Section 32.4.2.2 [thread.thread.constr] clause 7:
Synchronization: The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f.
The constructor means the thread constructor. f is the thread function, meaning the lambda in your case. So this means that any memory effects are synchronized properly.
The join() call also synchronizes. Therefore access to x after the join can not suffer from runtime-reordering.
The completion of the thread represented by *this synchronizes with (6.9.2) the corresponding successful join() return.
Side note
Unlike suggested in some comments, the compiler will not optimize the thread creation away for two reasons: 1. No compiler is sufficiently magical to figure this out. 2. The thread creation may fail, which is defined behavior. Therefore it has to be included in the runtime.
I read a lot & I'm still unsure if I understood it or not (Im a woodworker).
Let's suppose that I have a function:
void class_test::example_1(int a, int b, char c)
{
//do stuff
int v;
int k;
char z;
if(condition)
{
std::thread thread_in_example (&class_test::example_1, & object, v ,k ,z);
th.detach();
}
}
Now if I call it:
std::thread example (&class_test::example_1, &object, a, b, c);
example.detach();
Question: What happen to thread_in_example when example complete & "detele" himself? is thread_in_example going to lost access to its parameters?
I thought that std::thread was making a copy of the elements unless they are given by &reference but on http://en.cppreference.com/w/cpp/thread/thread I can't really understand this part (du to my lack of knowledge in programming/english/computer science's semantics):
std::thread objects may also be in the state that does not represent any thread (after default construction, move from, detach, or join), and a thread of execution may be not associated with any thread objects (after detach).
and this one too:
No two std::thread objects may represent the same thread of execution;
std::thread is not CopyConstructible or CopyAssignable, although it is
MoveConstructible and MoveAssignable.
So I've doubts on how it really works.
From this std::thread::detach reference:
Separates the thread of execution from the thread object, allowing execution to continue independently. Any allocated resources will be freed once the thread exits.
[Emphasis mine]
Among those "allocated resources" will be the arguments, which means you can still safely use the arguments in the detached thread.
Unless you of course the arguments are references or pointers to objects that are destructed independently of the detached thread or the thread that created the detached thread.
I am confused with the description of thread_local in C++11. My understanding is, each thread has unique copy of local variables in a function. The global/static variables can be accessed by all the threads (possibly synchronized access using locks). And the thread_local variables are visible to all the threads but can only modified by the thread for which they are defined? Is it correct?
Thread-local storage duration is a term used to refer to data that is seemingly global or static storage duration (from the viewpoint of the functions using it) but, in actual fact, there is one copy per thread.
It adds to the current options:
automatic (exists during a block or function);
static (exists for the program duration); and
dynamic (exists on the heap between allocation and deallocation).
Something that is thread-local is brought into existence at thread creation time and disposed of when the thread finishes.
For example, think of a random number generator where the seed must be maintained on a per-thread basis. Using a thread-local seed means that each thread gets its own random number sequence, independent of all other threads.
If your seed was a local variable within the random function, it would be initialised every time you called it, giving you the same number each time. If it was a global, threads would interfere with each other's sequences.
Another example is something like strtok where the tokenisation state is stored on a thread-specific basis. That way, a single thread can be sure that other threads won't screw up its tokenisation efforts, while still being able to maintain state over multiple calls to strtok - this basically renders strtok_r (the thread-safe version) redundant.
Yet another example would be something like errno. You don't want separate threads modifying errno after one of your calls fails, but before you've had a chance to check the result.
This site has a reasonable description of the different storage duration specifiers.
When you declare a variable thread_local then each thread has its own copy. When you refer to it by name, then the copy associated with the current thread is used. e.g.
thread_local int i=0;
void f(int newval){
i=newval;
}
void g(){
std::cout<<i;
}
void threadfunc(int id){
f(id);
++i;
g();
}
int main(){
i=9;
std::thread t1(threadfunc,1);
std::thread t2(threadfunc,2);
std::thread t3(threadfunc,3);
t1.join();
t2.join();
t3.join();
std::cout<<i<<std::endl;
}
This code will output "2349", "3249", "4239", "4329", "2439" or "3429", but never anything else. Each thread has its own copy of i, which is assigned to, incremented and then printed. The thread running main also has its own copy, which is assigned to at the beginning and then left unchanged. These copies are entirely independent, and each has a different address.
It is only the name that is special in that respect --- if you take the address of a thread_local variable then you just have a normal pointer to a normal object, which you can freely pass between threads. e.g.
thread_local int i=0;
void thread_func(int*p){
*p=42;
}
int main(){
i=9;
std::thread t(thread_func,&i);
t.join();
std::cout<<i<<std::endl;
}
Since the address of i is passed to the thread function, then the copy of i belonging to the main thread can be assigned to even though it is thread_local. This program will thus output "42". If you do this, then you need to take care that *p is not accessed after the thread it belongs to has exited, otherwise you get a dangling pointer and undefined behaviour just like any other case where the pointed-to object is destroyed.
thread_local variables are initialized "before first use", so if they are never touched by a given thread then they are not necessarily ever initialized. This is to allow compilers to avoid constructing every thread_local variable in the program for a thread that is entirely self-contained and doesn't touch any of them. e.g.
struct my_class{
my_class(){
std::cout<<"hello";
}
~my_class(){
std::cout<<"goodbye";
}
};
void f(){
thread_local my_class unused;
}
void do_nothing(){}
int main(){
std::thread t1(do_nothing);
t1.join();
}
In this program there are 2 threads: the main thread and the manually-created thread. Neither thread calls f, so the thread_local object is never used. It is therefore unspecified whether the compiler will construct 0, 1 or 2 instances of my_class, and the output may be "", "hellohellogoodbyegoodbye" or "hellogoodbye".
Thread-local storage is in every aspect like static (= global) storage, only that each thread has a separate copy of the object. The object's life time starts either at thread start (for global variables) or at first initialization (for block-local statics), and ends when the thread ends (i.e. when join() is called).
Consequently, only variables that could also be declared static may be declared as thread_local, i.e. global variables (more precisely: variables "at namespace scope"), static class members, and block-static variables (in which case static is implied).
As an example, suppose you have a thread pool and want to know how well your work load was being balanced:
thread_local Counter c;
void do_work()
{
c.increment();
// ...
}
int main()
{
std::thread t(do_work); // your thread-pool would go here
t.join();
}
This would print thread usage statistics, e.g. with an implementation like this:
struct Counter
{
unsigned int c = 0;
void increment() { ++c; }
~Counter()
{
std::cout << "Thread #" << std::this_thread::id() << " was called "
<< c << " times" << std::endl;
}
};
I am confused with the description of thread_local in C++11. My understanding is, each thread has unique copy of local variables in a function. The global/static variables can be accessed by all the threads (possibly synchronized access using locks). And the thread_local variables are visible to all the threads but can only modified by the thread for which they are defined? Is it correct?
Thread-local storage duration is a term used to refer to data that is seemingly global or static storage duration (from the viewpoint of the functions using it) but, in actual fact, there is one copy per thread.
It adds to the current options:
automatic (exists during a block or function);
static (exists for the program duration); and
dynamic (exists on the heap between allocation and deallocation).
Something that is thread-local is brought into existence at thread creation time and disposed of when the thread finishes.
For example, think of a random number generator where the seed must be maintained on a per-thread basis. Using a thread-local seed means that each thread gets its own random number sequence, independent of all other threads.
If your seed was a local variable within the random function, it would be initialised every time you called it, giving you the same number each time. If it was a global, threads would interfere with each other's sequences.
Another example is something like strtok where the tokenisation state is stored on a thread-specific basis. That way, a single thread can be sure that other threads won't screw up its tokenisation efforts, while still being able to maintain state over multiple calls to strtok - this basically renders strtok_r (the thread-safe version) redundant.
Yet another example would be something like errno. You don't want separate threads modifying errno after one of your calls fails, but before you've had a chance to check the result.
This site has a reasonable description of the different storage duration specifiers.
When you declare a variable thread_local then each thread has its own copy. When you refer to it by name, then the copy associated with the current thread is used. e.g.
thread_local int i=0;
void f(int newval){
i=newval;
}
void g(){
std::cout<<i;
}
void threadfunc(int id){
f(id);
++i;
g();
}
int main(){
i=9;
std::thread t1(threadfunc,1);
std::thread t2(threadfunc,2);
std::thread t3(threadfunc,3);
t1.join();
t2.join();
t3.join();
std::cout<<i<<std::endl;
}
This code will output "2349", "3249", "4239", "4329", "2439" or "3429", but never anything else. Each thread has its own copy of i, which is assigned to, incremented and then printed. The thread running main also has its own copy, which is assigned to at the beginning and then left unchanged. These copies are entirely independent, and each has a different address.
It is only the name that is special in that respect --- if you take the address of a thread_local variable then you just have a normal pointer to a normal object, which you can freely pass between threads. e.g.
thread_local int i=0;
void thread_func(int*p){
*p=42;
}
int main(){
i=9;
std::thread t(thread_func,&i);
t.join();
std::cout<<i<<std::endl;
}
Since the address of i is passed to the thread function, then the copy of i belonging to the main thread can be assigned to even though it is thread_local. This program will thus output "42". If you do this, then you need to take care that *p is not accessed after the thread it belongs to has exited, otherwise you get a dangling pointer and undefined behaviour just like any other case where the pointed-to object is destroyed.
thread_local variables are initialized "before first use", so if they are never touched by a given thread then they are not necessarily ever initialized. This is to allow compilers to avoid constructing every thread_local variable in the program for a thread that is entirely self-contained and doesn't touch any of them. e.g.
struct my_class{
my_class(){
std::cout<<"hello";
}
~my_class(){
std::cout<<"goodbye";
}
};
void f(){
thread_local my_class unused;
}
void do_nothing(){}
int main(){
std::thread t1(do_nothing);
t1.join();
}
In this program there are 2 threads: the main thread and the manually-created thread. Neither thread calls f, so the thread_local object is never used. It is therefore unspecified whether the compiler will construct 0, 1 or 2 instances of my_class, and the output may be "", "hellohellogoodbyegoodbye" or "hellogoodbye".
Thread-local storage is in every aspect like static (= global) storage, only that each thread has a separate copy of the object. The object's life time starts either at thread start (for global variables) or at first initialization (for block-local statics), and ends when the thread ends (i.e. when join() is called).
Consequently, only variables that could also be declared static may be declared as thread_local, i.e. global variables (more precisely: variables "at namespace scope"), static class members, and block-static variables (in which case static is implied).
As an example, suppose you have a thread pool and want to know how well your work load was being balanced:
thread_local Counter c;
void do_work()
{
c.increment();
// ...
}
int main()
{
std::thread t(do_work); // your thread-pool would go here
t.join();
}
This would print thread usage statistics, e.g. with an implementation like this:
struct Counter
{
unsigned int c = 0;
void increment() { ++c; }
~Counter()
{
std::cout << "Thread #" << std::this_thread::id() << " was called "
<< c << " times" << std::endl;
}
};
Derived from this question and related to this question:
If I construct an object in one thread and then convey a reference/pointer to it to another thread, is it thread un-safe for that other thread to access the object without explicit locking/memory-barriers?
// thread 1
Obj obj;
anyLeagalTransferDevice.Send(&obj);
while(1); // never let obj go out of scope
// thread 2
anyLeagalTransferDevice.Get()->SomeFn();
Alternatively: is there any legal way to convey data between threads that doesn't enforce memory ordering with regards to everything else the thread has touched? From a hardware standpoint I don't see any reason it shouldn't be possible.
To clarify; the question is with regards to cache coherency, memory ordering and whatnot. Can Thread 2 get and use the pointer before Thread 2's view of memory includes the writes involved in constructing obj? To miss-quote Alexandrescu(?) "Could a malicious CPU designer and compiler writer collude to build a standard conforming system that make that break?"
Reasoning about thread-safety can be difficult, and I am no expert on the C++11 memory model. Fortunately, however, your example is very simple. I rewrite the example, because the constructor is irrelevant.
Simplified Example
Question: Is the following code correct? Or can the execution result in undefined behavior?
// Legal transfer of pointer to int without data race.
// The receive function blocks until send is called.
void send(int*);
int* receive();
// --- thread A ---
/* A1 */ int* pointer = receive();
/* A2 */ int answer = *pointer;
// --- thread B ---
int answer;
/* B1 */ answer = 42;
/* B2 */ send(&answer);
// wait forever
Answer: There may be a data race on the memory location of answer, and thus the execution results in undefined behavior. See below for details.
Implementation of Data Transfer
Of course, the answer depends on the possible and legal implementations of the functions send and receive. I use the following data-race-free implementation. Note that only a single atomic variable is used, and all memory operations use std::memory_order_relaxed. Basically this means, that these functions do not restrict memory re-orderings.
std::atomic<int*> transfer{nullptr};
void send(int* pointer) {
transfer.store(pointer, std::memory_order_relaxed);
}
int* receive() {
while (transfer.load(std::memory_order_relaxed) == nullptr) { }
return transfer.load(std::memory_order_relaxed);
}
Order of Memory Operations
On multicore systems, a thread can see memory changes in a different order as what other threads see. In addition, both compilers and CPUs may reorder memory operations within a single thread for efficiency - and they do this all the time. Atomic operations with std::memory_order_relaxed do not participate in any synchronization and do not impose any ordering.
In the above example, the compiler is allowed to reorder the operations of thread B, and execute B2 before B1, because the reordering has no effect on the thread itself.
// --- valid execution of operations in thread B ---
int answer;
/* B2 */ send(&answer);
/* B1 */ answer = 42;
// wait forever
Data Race
C++11 defines a data race as follows (N3290 C++11 Draft): "The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior." And the term happens before is defined earlier in the same document.
In the above example, B1 and A2 are conflicting and non-atomic operations, and neither happens before the other. This is obvious, because I have shown in the previous section, that both can happen at the same time.
That's the only thing that matters in C++11. In contrast, the Java Memory Model also tries to define the behavior if there are data races, and it took them almost a decade to come up with a reasonable specification. C++11 didn't make the same mistake.
Further Information
I'm a bit surprised that these basics are not well known. The definitive source of information is the section Multi-threaded executions and data races in the C++11 standard. However, the specification is difficult to understand.
A good starting point are Hans Boehm's talks - e.g. available as online videos:
Threads and Shared Variables in C++11
Getting C++ Threads Right
There are also a lot of other good resources, I have mentioned elsewhere, e.g.:
std::memory_order - cppreference.com
There is no parallel access to the same data, so there is no problem:
Thread 1 starts execution of Obj::Obj().
Thread 1 finishes execution of Obj::Obj().
Thread 1 passes reference to the memory occupied by obj to thread 2.
Thread 1 never does anything else with that memory (soon after, it falls into infinite loop).
Thread 2 picks-up the reference to memory occupied by obj.
Thread 2 presumably does something with it, undisturbed by thread 1 which is still infinitely looping.
The only potential problem is if Send didn't acts as a memory barrier, but then it wouldn't really be a "legal transfer device".
As others have alluded to, the only way in which a constructor is not thread-safe is if something somehow gets a pointer or reference to it before the constructor is finished, and the only way that would occur is if the constructor itself has code that registers the this pointer to some type of container which is shared across threads.
Now in your specific example, Branko Dimitrijevic gave a good complete explanation how your case is fine. But in the general case, I'd say to not use something until the constructor is finished, though I don't think there's anything "special" that doesn't happen until the constructor is finished. By the time it enters the (last) constructor in an inheritance chain, the object is pretty much fully "good to go" with all of its member variables being initialized, etc. So no worse than any other critical section work, but another thread would need to know about it first, and the only way that happens is if you're sharing this in the constructor itself somehow. So only do that as the "last thing" if you are.
It is only safe (sort of) if you wrote both threads, and know the first thread is not accessing it while the second thread is. For example, if the thread constructing it never accesses it after passing the reference/pointer, you would be OK. Otherwise it is thread unsafe. You could change that by making all methods that access data members (read or write) lock memory.
Read this question until now... Still will post my comments:
Static Local Variable
There is a reliable way to construct objects when you are in a multi-thread environment, that is using a static local variable (static local variable-CppCoreGuidelines),
From the above reference: "This is one of the most effective solutions to problems related to initialization order. In a multi-threaded environment the initialization of the static object does not introduce a race condition (unless you carelessly access a shared object from within its constructor)."
Also note from the reference, if the destruction of X involves an operation that needs to be synchronized you can create the object on the heap and synchronize when to call the destructor.
Below is an example I wrote to show the Construct On First Use Idiom, which is basically what the reference talks about.
#include <iostream>
#include <thread>
#include <vector>
class ThreadConstruct
{
public:
ThreadConstruct(int a, float b) : _a{a}, _b{b}
{
std::cout << "ThreadConstruct construct start" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(2));
std::cout << "ThreadConstruct construct end" << std::endl;
}
void get()
{
std::cout << _a << " " << _b << std::endl;
}
private:
int _a;
float _b;
};
struct Factory
{
template<class T, typename ...ARGS>
static T& get(ARGS... args)
{
//thread safe object instantiation
static T instance(std::forward<ARGS>(args)...);
return instance;
}
};
//thread pool
class Threads
{
public:
Threads()
{
for (size_t num_threads = 0; num_threads < 5; ++num_threads) {
thread_pool.emplace_back(&Threads::run, this);
}
}
void run()
{
//thread safe constructor call
ThreadConstruct& thread_construct = Factory::get<ThreadConstruct>(5, 10.1);
thread_construct.get();
}
~Threads()
{
for(auto& x : thread_pool) {
if(x.joinable()) {
x.join();
}
}
}
private:
std::vector<std::thread> thread_pool;
};
int main()
{
Threads thread;
return 0;
}
Output:
ThreadConstruct construct start
ThreadConstruct construct end
5 10.1
5 10.1
5 10.1
5 10.1
5 10.1