Data race in parallelized std::for_each

Data race in parallelized std::for_each - c++

On the cpp reference website on execution policy there is an example like this:
std::atomic<int> x{0};
int a[] = {1,2};
std::for_each(std::execution::par, std::begin(a), std::end(a), [&](int) {
x.fetch_add(1, std::memory_order_relaxed);
while (x.load(std::memory_order_relaxed) == 1) { } // Error: assumes execution order
});
As you see it is an example of (supposedly) erroneous code. But I do not really understand what the error is here, it does not seem to me that any part of the code assumes the execution order. AFAIK, the first thread to fetch_add will wait for the second one but that's it, no problematic behaviour. Am I missing something and there is some error out there?

The execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel
algorithm's execution may be parallelized. The invocations of element
access functions in parallel algorithms invoked with this policy
(usually specified as std::execution::par) are permitted to execute in
either the invoking thread or in a thread implicitly created by the
library to support parallel algorithm execution. Any such invocations
executing in the same thread are indeterminately sequenced with
respect to each other.
As far as I can see, the issue here is that there is no guarantee on how many threads are used, if the system uses a single thread - there's going to be an endless loop here (while (x.load(std::memory_order_relaxed) == 1) { } never completes).
So I guess the comment means that this codes wrongfully relies on multiple threads executing which would cause fetch_add to be called at some point more than once.
The only guarantee you get is that for each thread, the invocations are not interleaved.

Related

What do the execution policies in std::copy_n really mean?

I just discovered that std::copy_n provides overloads for different execution policies. Yet I find cppreference quite hard to understand here as (I suppose) it is kept very general. So I have difficulties putting together what actually goes on.
I don't really understand the explanation of the first policy:
The execution policy type used as a unique type to disambiguate
parallel algorithm overloading and require that a parallel algorithm's
execution may not be parallelized. The invocations of element access
functions in parallel algorithms invoked with this policy (usually
specified as std::execution::seq) are indeterminately sequenced in the
calling thread.
To my understanding this means that we don't parallelize (multithread) here and each element access is sequential like in strcpy. This basically means to me that one thread runs through the function and I'm done. But then there is
invocations of element access functions in parallel algorithms.
What now? Are there still paralell algorithms? How?
The second execution policy states that:
Any such invocations executing in the same thread are indeterminately
sequenced with respect to each other.
What I imagine that means is this: Each thread starts at a different position, e.g. the container is split up into multiple segments and each thread copies one of those segments. The threads are created by the library just to run the algorithm. Am I correct in assuming so?
From the third policy:
The invocations of element access functions in parallel algorithms
invoked with this policy are permitted to execute in an unordered
fashion in unspecified threads, and unsequenced with respect to one
another within each thread.
Does this mean the above mentioned container "segments" need not be copied one after another but can be copied in random fashion? If so, why is this so important to justify an extra policy? When I have multiple threads they will need to be somewhat intermixed to keep synchronisation on a minimum no?
So here's my probably incorrect current understanding of the policies. Please correct me!
sequenced_policy: 1 thread executes the algorithm and copies everything from A - Z.
parallel_policy: Lib creates new threads specifically for copying, whereas each thread's copied segment has to follow the other (sequenced)?
parallel_unsequenced_policy: try to multithread and SIMD. Copied segments can be intermixed by thread (it doesn't matter where you start).
unsequenced_policy: Try to use SIMD but only singlethreaded.

Your summary of the basic idea of each policy is basically correct.
Does this mean the above mentioned container "segments" need not be copied one after another but can be copied in random fashion? If so, why is this so important to justify an extra policy?
The extra policies for unsequenced_policy and parallel_unsequenced_policy are necessary because they impose an extra requirement on calling code1:
The
behavior of a program is undefined if it invokes a vectorization-unsafe standard library function from user code
called from a execution::unsequenced_policy algorithm.
[and a matching restriction for parallel_unsequenced_policy.]
These four execution policies are used for algorithms in general. The mention of user code called from execution of the algorithm mostly applies to things like std::for_each, or std::generate, where you tell the algorithm to invoke a function. Here's one of the examples from the standard:
int a[] = {0,1};
std::vector<int> v;
std::for_each(std::execution::par, std::begin(a), std::end(a), [&](int i) {
v.push_back(i*2+1); // incorrect: data race
});
This particular example shows a problem created by parallel execution--you might have two threads trying to invoke push_back on v concurrently, giving a data race.
If you use for_each with one of the unsequenced policies, that imposes a further constraint on what your code can do.
When we look specifically at std::copy_n, that's probably less of a problem as a rule, because we're not passing it some code to be invoked. Well, we're not doing so directly, anyway. In reality, we are potentially doing so indirectly though. std::copy_n uses the assignment operator for the item being copied. So, for example, consider something like this:
class foo {
static int copy_count;
int data;
public:
foo &operator=(foo const &other) {
data = other.data;
++copy_count;
}
};
foo::int copy_count;
std::vector<foo> a;
std::vector<foo> b;
// code to fill a with data goes here
std::copy_n(std::execution::par, a.begin(), a.end(), std::back_inserter(b));
Our copy assignment operator accesses copy_count without synchronization. If we specify sequential execution, that's fine, but if we specify parallel execution we're now (potentially) invoking it concurrently on two or more threads, so we have a data race.
I'd probably have to work harder to put together a somewhat coherent reason for an assignment operator to do something that was vectorizaton-unsafe, but that doesn't mean it doesn't exist.
Summary
We have four separate execution policies because each imposes unique constraints on what you can do in your code. In the specific cases of std::copy or std::copy_n, those constraints apply primarily to the assignment operator for the items in the collection being copied.
N4835, section [algorithms.parallel.exec]

C++ member update visibility inside a critical section when not atomic

I stumbled across the following Code Review StackExchange and decided to read it for practice. In the code, there is the following:
Note: I am not looking for a code review and this is just a copy paste of the code from the link so you can focus in on the issue at hand without the other code interfering. I am not interested in implementing a 'smart pointer', just understanding the memory model:
// Copied from the link provided (all inside a class)
unsigned int count;
mutex m_Mutx;
void deref()
{
m_Mutx.lock();
count--;
m_Mutx.unlock();
if (count == 0)
{
delete rawObj;
count = 0;
}
}
Seeing this makes me immediately think "what if two threads enter when count == 1 and neither see the updates of each other? Can both end up seeing count as zero and double delete? And is it possible for two threads to cause count to become -1 and then deletion never happens?
The mutex will make sure one thread enters the critical section, however does this guarantee that all threads will be properly updated? What does the C++ memory model tell me so I can say this is a race condition or not?
I looked at the Memory model cppreference page and std::memory_order cppreference, however the latter page seems to deal with a parameter for atomic. I didn't find the answer I was looking for or maybe I misread it. Can anyone tell me if what I said is wrong or right, and whether or not this code is safe or not?
For correcting the code if it is broken:
Is the correct answer for this to turn count into an atomic member? Or does this work and after releasing the lock on the mutex, all the threads see the value?
I'm also curious if this would be considered the correct answer:
Note: I am not looking for a code review and trying to see if this kind of solution would solve the issue with respect to the C++ memory model.
#include <atomic>
#include <mutex>
struct ClassNameHere {
int* rawObj;
std::atomic<unsigned int> count;
std::mutex mutex;
// ...
void deref()
{
std::scoped_lock lock{mutex};
count--;
if (count == 0)
delete rawObj;
}
};

"what if two threads enter when count == 1" -- if that happens, something else is fishy. The idea behind smart pointers is that the refcount is bound to an object's lifetime (scope). The decrement happens when the object (via stack unrolling) is destroyed. If two threads trigger that, the refcount can not possibly be just 1 unless another bug is present.
However, what could happen is that two threads enter this code when count = 2. In that case, the decrement operation is locked by the mutex, so it can never reach negative values. Again, this assumes non-buggy code elsewhere. Since all this does is to delete the object (and then redundantly set count to zero), nothing bad can happen.
What can happen is a double delete though. If two threads at count = 2 decrement the count, they could both see the count = 0 afterwards. Just determine whether to delete the object inside the mutex as a simple fix. Store that info in a local variable and handle accordingly after releasing the mutex.
Concerning your third question, turning the count into an atomic is not going to fix things magically. Also, the point behind atomics is that you don't need a mutex, because locking a mutex is an expensive operation. With atomics, you can combine operations like decrement and check for zero, which is similar to the fix proposed above. Atomics are typically slower than "normal" integers. They are still faster than a mutex though.

In both cases there’s a data race. Thread 1 decrements the counter to 1, and just before the if statement a thread switch occurs. Thread 2 decrement the counter to 0 and then deletes the object. Thread 1 resumes, sees that count is 0, and deletes the object again.
Move the unlock() to the end of th function.or, better, use std::lock_guard to do the lock; its destructor will unlock the mutex even when the delete call throws an exception.

If two threads potentially* enter deref() concurrently, then, regardless of the previous or previously expected value of count, a data race occurs, and your entire program, even the parts that you would expect to be chronologically prior, has undefined behavior as stated in the C++ standard in [intro.multithread/20] (N4659):
Two actions are potentially concurrent if
(20.1) they are performed by different threads, or
(20.2) they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is
not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
The potentially concurrent actions in this case, of course, are the read of count outside of the locked section, and the write of count within it.
*) That is, if current inputs allow it.
UPDATE 1: The section you reference, describing atomic memory order, explains how atomic operations synchronize with each other and with other synchronization primitives (such as mutexes and memory barriers). In other words, it describes how atomics can be used for synchronization so that some operations aren't data races. It does not apply here. The standard takes a conservative approach here: Unless other parts of the standard explicitly make clear that two conflicting accesses are not concurrent, you have a data race, and hence UB (where conflicting means same memory location, and at least one of them isn't read-only).

Your lock prevents that operation count-- gets in a mess when performed concurrently in different threads. It does not guarantee, however, that the values of count are synchronized, such that repeated reads outside a single critical section will bear the risk of a data race.
You could rewrite it as follows:
void deref()
{
bool isLast;
m_Mutx.lock();
--count;
isLast = (count == 0);
m_Mutx.unlock();
if (isLast) {
delete rawObj;
}
}
Thereby, the lock makes sure that access to count is synchronized and always in a valid state. This valid state is carried over to the non-critical section through a local variable (without race condition). Thereby, the critical section can be kept rather short.
A simpler version would be to synchronize the complete function body; this might get a disadvantage if you want to do more elaborative things than just delete rawObj:
void deref()
{
std::lock_guard<std::mutex> lock(m_Mutx);
if (! --count) {
delete rawObj;
}
}
BTW: std::atomic allone will not solve this issue as this synchronizes just each single access, but not a "transaction". Therefore, your scoped_lock is necessary, and - as this spans the complete function then - the std::atomic becomes superfluous.

Does tbb::parallel_for always utilize the calling thread

I have a piece of code where I am using tbb::parallel_for to multithread a loop, which is called by main thread. In that loop I need main thread to update the UI to reflect the progress. From what I have observed, tbb::parallel_for always uses the caller thread + N worker threads. However, I wonder, whether the usage of the calling threads is guaranteed or rather just happens to be the case?
Here is the sample code:
static thread_local bool _mainThread = false; // false in all threads
_mainThread = true; // now true in main thread, but false in others
tbb::parallel_for(start, end, *this);
void Bender::processor::operator()(size_t i) const
{
...
if(_mainThread) // only main thread will issue events
ProgressUpdatedEvent(progress);
}
Thanks!

Strictly speaking, I don't think there is any guarantee in TBB about what any given thread is supposed to run (basic principles of TBB are optional parallelism and random work-stealing). Even task affinity in TBB is "soft" since it is not guaranteed that a specific worker can take affinitized task.
Practically speaking, the way how parallel_for is implemented implies that it will run at least one task before switching to something else and exiting parallel_for. Thus, for at least simple case, it is expected to work well enough.

How to synchronize threads/CPUs without mutexes if sequence of access is known to be safe?

Consider the following:
// these services are running on different threads that are started a long time ago
std::vector<io_service&> io_services;
struct A {
std::unique_ptr<Object> u;
} a;
io_services[0].post([&io_services, &a] {
std::unique_ptr<Object> o{new Object};
a.u = std::move(o);
io_services[1].post([&a] {
// as far as I know changes to `u` isn't guaranteed to be seen in this thread
a.u->...;
});
});
The actual code passes a struct to a bunch of different boost::asio::io_service objects and each field of struct is filled by a different service object (the struct is never accessed from different io_service objects/threads at the same time, it is passed between the services by reference until the process is done).
As far as I know I always need some kind of explicit synchronization/memory flushing when I pass anything between threads even if there is no read/write race (as in simultaneous access). What is the way of correctly doing it in this case?
Note that Object does not belong to me and it is not trivially copy-able or movable. I could use a std::atomic<Object*> (if I am not wrong) but I would rather use the smart pointer. Is there a way to do that?
Edit:
It seems like std::atomic_thread_fence is the tool for the job but I cannot really wrap the 'memory model' concepts to safely code.
My understanding is that the following lines are needed for this code to work correctly. Is it really the case?
// these services are running on different threads that are started a long time ago
std::vector<io_service&> io_services;
struct A {
std::unique_ptr<Object> u;
} a;
io_services[0].post([&io_services, &a] {
std::unique_ptr<Object> o{new Object};
a.u = std::move(o);
std::atomic_thread_fence(std::memory_order_release);
io_services[1].post([&a] {
std::atomic_thread_fence(std::memory_order_acquire);
a.u->...;
});
});

Synchronisation is only needed when there would be a data race without it. A data race is defined as unsequenced access by different threads.
You have no such unsequenced access. The t.join() guarantees that all statements that follow are sequenced strictly after all statements that run as part of t. So no synchronisation is required.
ELABORATION: (To explain why thread::join has the above claimed properties) First, description of thread::join from standard [thread.thread.member]:
void join();
Requires: joinable() is true.
Effects: Blocks until
the thread represented by *this has completed.
Synchronization: The
completion of the thread represented by *this synchronizes with (1.10)
the corresponding successful join() return.
a). The above shows that join() provides synchronisation (specifically: the completion of the thread represented by *this synchronises with the outer thread calling join()). Next [intro.multithread]:
An evaluation A inter-thread happens before an evaluation B if
(13.1) — A synchronizes with B, or ...
Which shows that, because of a), we have that the completion of t inter-thread happens before the return of the join() call.
Finally, [intro.multithread]:
Two actions are potentially concurrent if
(23.1) — they are performed
by different threads, or
(23.2) — they are unsequenced, and at least
one is performed by a signal handler.
The execution of a program
contains a data race if it contains two potentially concurrent
conflicting actions, at least one of which is not atomic, and neither
happens before the other ...
Above the required conditions for a data race are described. The situation with t.join() does not meet these conditions because, as shown, the completion of t does in fact happen-before the return of join().
So there is no data race, and all data accesses are guaranteed well-defined behaviour.

(I'd like to remark that you appear to have changed your question in some significant way since #Smeeheey answered it; essentially, he answered your originally-worded question but cannot get credit for it since you asked two different questions. This is poor form – in the future, please just post a new question so the original answerer can get credit as due.)
If multiple threads read/write a variable, even if you know said variable is accessed in a defined sequence, you must still inform the compiler of that. The correct way to do this necessarily involves synchronization, atomics, or something documented to perform one of the prior itself (such as std::thread::join). Presuming the synchronization route is both obvious in implementation and undesirable..:
Addressing this with atomics may simply consist of std::atomic_thread_fence; however, an acquire fence in C++ cannot synchronize-with a release fence alone – an actual atomic object must be modified. Consequently, if you want to use fences alone you'll need to specify std::memory_order_seq_cst; that done, your code will work as shown otherwise.
If you want to stick with release/acquire semantics, fortunately even the simplest atomic will do – std::atomic_flag:
std::vector<io_service&> io_services;
struct A {
std::unique_ptr<Object> u;
} a;
std::atomic_flag a_initialized = ATOMIC_FLAG_INIT;
io_services[0].post([&io_services, &a, &a_initialized] {
std::unique_ptr<Object> o{new Object};
a_initialized.clear(std::memory_order_release); // initiates release sequence (RS)
a.u = std::move(o);
a_initialized.test_and_set(std::memory_order_relaxed); // continues RS
io_services[1].post([&a, &a_initialized] {
while (!a_initialized.test_and_set(std::memory_order_acquire)) ; // completes RS
a.u->...;
});
});
For information on release sequences, see here.

Is there a race condition in the `latch` sample in N3600?

Proposed for inclusion in C++14 (aka C++1y) are some new thread synchronization primitives: latches and barriers. The proposal is
N3600: C++ Latches and Barriers
N3666: C++ Latches and Barriers, revised
It sounds like a good idea and the samples make it look very programmer-friendly. Unfortunately, I think the sample code invokes undefined behavior. The proposal says of latch::~latch():
Destroys the latch. If the latch is destroyed while other threads are in wait(), or are invoking count_down(), the behaviour is undefined.
Note that it says "in wait()" and not "blocked in wait()", as the description of count_down() uses.
Then the following sample is provided:
An example of the second use case is shown below. We need to load data and then process it using a number of threads. Loading the data is I/O bound, whereas starting threads and creating data structures is CPU bound. By running these in parallel, throughput can be increased.
void DoWork()
{
latch start_latch(1);
vector<thread*> workers;
for (int i = 0; i < NTHREADS; ++i) {
workers.push_back(new thread([&] {
// Initialize data structures. This is CPU bound.
...
start_latch.wait();
// perform work
...
}));
}
// Load input data. This is I/O bound.
...
// Threads can now start processing
start_latch.count_down();
}
Isn't there a race condition between the threads waking and returning from wait(), and destruction of the latch when it leaves scope? Besides that, all the thread objects are leaked. If the scheduler doesn't run all worker threads before count_down returns and the start_latch object leaves scope, then I think undefined behavior will result. Presumably the fix is to iterate the vector and join() and delete all the worker threads after count_down but before returning.
Is there a problem with the sample code?
Do you agree that a proposal should show a complete correct example, even if the task is extremely simple, in order for reviewers to see what the use experience will be like?
Note: It appears possible that one or more of the worker threads haven't yet begun to wait, and will therefore call wait() on a destroyed latch.
Update: There's now a new version of the proposal, but the representative example is unchanged.

Thanks for pointing this out. Yes, I think that the sample code (which, in its defense, was intended to be concise) is broken. It should probably wait for the threads to finish.
Any implementation that allows threads to be blocked in wait() is almost certainly going to involves some kind of condition variable, and destroying the latch while a thread has not yet exited wait() is potentially undefined.
I don't know if there's time to update the paper, but I can make sure that the next version is fixed.
Alasdair

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js