C++17: how to control number of threads in execution policy? - c++

The C++17 standard introduced an execution policy parameter (e.g. std::execution::par_unseq) which can be passed to some of the functions in the std library to make them execute in parallel, e.g.:
std::copy(std::execution::par_unseq, obj1.begin(), obj1.end(), obj2.begin())
In other frameworks like OpenMP, it’s possible to set the maximum number of threads that it will use, e.g. #pragma omp parallel num_threads(<desired_numer>) to set it locally within the section, or omp_set_num_threads(<desired_number>) to set it within the calling scope.
I’m wondering how can this be achieved in standard C++ for the execution policies.

This is a good question. That said, unfortunately, I don't think it's possible. [execpol.general]/1 says:
This subclause describes classes that are execution policy types. An
object of an execution policy type indicates the kinds of
parallelism allowed in the execution of an algorithm and expresses
the consequent requirements on the element access functions.
(emphasis mine)
Moreover, after that, the whole [execpol] is dealing about is_execution_policy, (disambiguator) policy types, and the execution policy objects.
In other words, execution policies only bring the possibility of parallelism at the cost of constrained element access functions. It is not really specified how these policies are carried out. To me, it seems even less possible to control the details of parallelism, with the number of threads being an example.

Related

What do the execution policies in std::copy_n really mean?

I just discovered that std::copy_n provides overloads for different execution policies. Yet I find cppreference quite hard to understand here as (I suppose) it is kept very general. So I have difficulties putting together what actually goes on.
I don't really understand the explanation of the first policy:
The execution policy type used as a unique type to disambiguate
parallel algorithm overloading and require that a parallel algorithm's
execution may not be parallelized. The invocations of element access
functions in parallel algorithms invoked with this policy (usually
specified as std::execution::seq) are indeterminately sequenced in the
calling thread.
To my understanding this means that we don't parallelize (multithread) here and each element access is sequential like in strcpy. This basically means to me that one thread runs through the function and I'm done. But then there is
invocations of element access functions in parallel algorithms.
What now? Are there still paralell algorithms? How?
The second execution policy states that:
Any such invocations executing in the same thread are indeterminately
sequenced with respect to each other.
What I imagine that means is this: Each thread starts at a different position, e.g. the container is split up into multiple segments and each thread copies one of those segments. The threads are created by the library just to run the algorithm. Am I correct in assuming so?
From the third policy:
The invocations of element access functions in parallel algorithms
invoked with this policy are permitted to execute in an unordered
fashion in unspecified threads, and unsequenced with respect to one
another within each thread.
Does this mean the above mentioned container "segments" need not be copied one after another but can be copied in random fashion? If so, why is this so important to justify an extra policy? When I have multiple threads they will need to be somewhat intermixed to keep synchronisation on a minimum no?
So here's my probably incorrect current understanding of the policies. Please correct me!
sequenced_policy: 1 thread executes the algorithm and copies everything from A - Z.
parallel_policy: Lib creates new threads specifically for copying, whereas each thread's copied segment has to follow the other (sequenced)?
parallel_unsequenced_policy: try to multithread and SIMD. Copied segments can be intermixed by thread (it doesn't matter where you start).
unsequenced_policy: Try to use SIMD but only singlethreaded.
Your summary of the basic idea of each policy is basically correct.
Does this mean the above mentioned container "segments" need not be copied one after another but can be copied in random fashion? If so, why is this so important to justify an extra policy?
The extra policies for unsequenced_policy and parallel_unsequenced_policy are necessary because they impose an extra requirement on calling code1:
The
behavior of a program is undefined if it invokes a vectorization-unsafe standard library function from user code
called from a execution::unsequenced_policy algorithm.
[and a matching restriction for parallel_unsequenced_policy.]
These four execution policies are used for algorithms in general. The mention of user code called from execution of the algorithm mostly applies to things like std::for_each, or std::generate, where you tell the algorithm to invoke a function. Here's one of the examples from the standard:
int a[] = {0,1};
std::vector<int> v;
std::for_each(std::execution::par, std::begin(a), std::end(a), [&](int i) {
v.push_back(i*2+1); // incorrect: data race
});
This particular example shows a problem created by parallel execution--you might have two threads trying to invoke push_back on v concurrently, giving a data race.
If you use for_each with one of the unsequenced policies, that imposes a further constraint on what your code can do.
When we look specifically at std::copy_n, that's probably less of a problem as a rule, because we're not passing it some code to be invoked. Well, we're not doing so directly, anyway. In reality, we are potentially doing so indirectly though. std::copy_n uses the assignment operator for the item being copied. So, for example, consider something like this:
class foo {
static int copy_count;
int data;
public:
foo &operator=(foo const &other) {
data = other.data;
++copy_count;
}
};
foo::int copy_count;
std::vector<foo> a;
std::vector<foo> b;
// code to fill a with data goes here
std::copy_n(std::execution::par, a.begin(), a.end(), std::back_inserter(b));
Our copy assignment operator accesses copy_count without synchronization. If we specify sequential execution, that's fine, but if we specify parallel execution we're now (potentially) invoking it concurrently on two or more threads, so we have a data race.
I'd probably have to work harder to put together a somewhat coherent reason for an assignment operator to do something that was vectorizaton-unsafe, but that doesn't mean it doesn't exist.
Summary
We have four separate execution policies because each imposes unique constraints on what you can do in your code. In the specific cases of std::copy or std::copy_n, those constraints apply primarily to the assignment operator for the items in the collection being copied.
N4835, section [algorithms.parallel.exec]

Thread Safety: Multiple threads reading from a single const source

What should I be concerned about as far as thread safety and undefined behavior goes in a situation where multiple threads are reading from a single source that is constant?
I am working on a signal processing model that allows for parallel execution of independent processes, these processes may share an input buffer, but the process that fills the input buffer will always be complete before the next stage of possibly parallel processes will execute.
Do I need to worry about thread safety issues in this situation? and what could i do about it?
I would like to note that a lock free solution would be best if possible
but the process that fills the input buffer will always be complete before the next stage of possibly parallel processes will execute
If this is guaranteed then there is not a problem having multiple reads from different threads for const objects.
I don't have the official standard so the following is from n4296:
17.6.5.9 Data race avoidance
3 A C++ standard library function shall not directly or indirectly modify objects (1.10) accessible by threads
other than the current thread unless the objects are accessed directly or indirectly via the function’s non-const
arguments, including this.
4 [ Note: This means, for example, that implementations can’t use a static object for internal purposes without
synchronization because it could cause a data race even in programs that do not explicitly share objects
between threads. —end note ]
Here is the Herb Sutter video where I first learned about the meaning of const in the C++11 standard. (see around 7:00 to 10:30)
No, you are OK. Multiple reads from the same constant source are OK and do not pose any risks in all threading models I know of (namely, Posix and Windows).
However,
but the process that fills the input buffer will always be complete
What are the guarantees here? How do you really know this is the case? Do you have a synchronization?

What is Clojure volatile?

There has been an addition in the recent Clojure 1.7 release : volatile!
volatile is already used in many languages, including java, but what are the semantics in Clojure?
What does it do? When is it useful?
The new volatile is as close as a real "variable" (as it is from many other programming languages) as it gets for clojure.
From the announcement:
there are a new set of functions (volatile!, vswap!, vreset!, volatile?) to create and use volatile "boxes" to hold state in stateful transducers. Volatiles are faster than atoms but give up atomicity guarantees so should only be used with thread isolation.
For instance, you can set/get and update them just like you would do with a variable in C.
The only addition (and hence the name) is the volatile keyword to the actual java object.
This is to prevent the JVM from optimization and makes sure that it reads the memory location every time it is accessed.
From the JIRA ticket:
Clojure needs a faster variant of Atom for managing state inside transducers. That is, Atoms do the job, but they provide a little too much capability for the purposes of transducers. Specifically the compare and swap semantics of Atoms add too much overhead. Therefore, it was determined that a simple volatile ref type would work to ensure basic propagation of its value to other threads and reads of the latest write from any other thread. While updates are subject to race conditions, access is controlled by JVM guarantees.
Solution overview: Create a concrete type in Java, akin to clojure.lang.Box, but volatile inside supports IDeref, but not watches etc.
This mean, a volatile! can still be accessed by multiple threads (which is necessary for transducers) but it does not allow to be changed by these threads at the same time since it gives you no atomic updates.
The semantics of what volatile does is very well explained in a java answer:
there are two aspects to thread safety: (1) execution control, and (2) memory visibility. The first has to do with controlling when code executes (including the order in which instructions are executed) and whether it can execute concurrently, and the second to do with when the effects in memory of what has been done are visible to other threads. Because each CPU has several levels of cache between it and main memory, threads running on different CPUs or cores can see "memory" differently at any given moment in time because threads are permitted to obtain and work on private copies of main memory.
Now let's see why not use var-set or transients:
Volatile vs var-set
Rich Hickey didn't want to give truly mutable variables:
Without mutable locals, people are forced to use recur, a functional
looping construct. While this may seem strange at first, it is just as
succinct as loops with mutation, and the resulting patterns can be
reused elsewhere in Clojure, i.e. recur, reduce, alter, commute etc
are all (logically) very similar.
[...]
In any case, Vars
are available for use when appropriate.
And thus creating with-local-vars, var-set etc..
The problem with these is that they're true vars and the doc string of var-set tells you:
The var must be thread-locally bound.
This is, of course, not an option for core.async which potentially executes on different threads. They're also much slower because they do all those checks.
Why not use transients
Transients are similar in that they don't allow concurrent access and optimize mutating a data structure.
The problem is that transient only work with collection that implement IEditableCollection. That is they're simply to avoid expensive intermediate representation of the collection data structures. Also remember that transients are not bashed into place and you still need some memory location to store the actual transient.
Volatiles are often used to simply hold a flag or the value of the last element (see partition-by for instance)
Summary:
Volatile's are nothing else but a wrapper around java's volatile and have thus the exact same semantics.
Don't ever share them. Use them only very carefully.
Volatiles are a "faster atom" with no atomicity guarantees. They were introduced as atoms were considered too slow to hold state in transducers.
there are a new set of functions (volatile!, vswap!, vreset!, volatile?) to create and use volatile "boxes" to hold state in stateful transducers. Volatiles are faster than atoms but give up atomicity guarantees so should only be used with thread isolation

Why map is not multithread safe in C++?

I met this problem when I tried to solve an concurrency issue in my code. In the original code, we only use a unique lock to lock the write operation on a cache which is a stl map. But there is no restrictions on read operation to the cache. So I was thinking add a shared lock to the read operation and keep the unique lock to the write. But someone told me that it's not safe to do multithreading on a map due to some internal caching issue that it itself does.
Can someone explain the reason in details? What does the internal caching do?
The implementations of std::map must all meet the usual
guarantees: if all your do is read, then there is no need for
external synchrionization, but as soon as one thread modifies,
all accesses must be synchronized.
It's not clear to me what you mean by "shared lock"; there is no
such thing in the standard. But if any one thread is writing,
you must ensure that no other threads may read at the same time.
(Something like Posix' pthread_rwlock could be used, but
there's nothing similar in the standard, at least not that I can
find off hand.)
Since C++11 at least, a const operation on a standard library class is guaranteed to be thread safe (assuming const operations on objects stored in it are thread safe).
All const member functions of std types can be safely called from multiple threads in C++11 without explicit synchronization. In fact, any type that is ever used in conjunction with the standard library (e.g. as a template parameter to a container) must fulfill this guarantee.
Clarificazion: The standard guarantees that your program will have the desired behaviour as long as you never cause a write and any other access to the same data location without a synchronization point in between. The rationale behind this is that modern CPUs don't have strict sequentially consistent memory models, which would limit scalability and performance. Under the hood, your compiler and standard library will emit appropriate memory fences at places where stronger memory orderings are needed.
I really don't see why there would be any caching issue...
If I refer to the stl definition of a map, it should be implemented as a binary search tree.
A binary search tree is simply a tree with a pool of key-value nodes. Those nodes are sorted following the natural order of their keys and, to avoid any problem, keys must be unique. So no internal caching is needed at all.
As no internal caching is required, read operations are safe in multi-threading context. But it's not the same story for write operations, for those you must provide your own synchronization mechanism as for any non-threading-aware data structure.
Just be aware that you must also forbid any read operations when a write operation is performed by a thread, because this write operation can result in a slow and complete rebalancing of the binary tree, i.e. a quick read operation during a long write operation would return a wrong result.

std::async - std::launch::async | std::launch::deferred

I understand what std::async does with the following parameters.
std::launch::async
std::launch::deferred
However what happens with, std::launch::async | std::launch::deferred?
A launch policy of std::launch::async | std::launch::deferred means that the implementation can choose whether to apply a policy of std::launch::async or std::launch::deferred. This choice may vary from call to call, and may not be decided immediately.
An implementation that always chooses one or the other is thus legal (which is what gcc does, always choosing deferred), as is one that chooses std::launch::async until some limit is reached, and then switches to std::launch::deferred.
It also means that the implementation can defer the choice until later. This means that the implementation may wait to make a decision until its hand is forced by a call that has visibly distinct effects from deferred and async tasks, or until the number of running tasks is less than the internal task limit. This is what just::thread does.
The functions that force the decision are: wait(), get(), wait_for(), wait_until(), and the destructor of the last future object referencing the result.
Chaprer 30.6.8 of ISO IEC 14882-2011 explains that launch::async | launch::deferred means implementations should defer invocation or the selection of the policy when no more concurrency can be effectively exploited (same as async without policy parameter).
In practice it means that C++ runtime should start new threads for each async as long as there are unused CPU cores.