Can std::transform or std::generate without ExecutionPolicy be parallel? - c++

In C++17 parallel std algorithms were introduced (overloads with ExecutionPolicy arguments), where strict rules of execution order, interleaving and paralelization were defined, for example ([algorithm.parallel.exec/3]):
The invocations of element access functions in parallel algorithms invoked with an execution policy object of
type execution::sequenced_policy all occur in the calling thread of execution. [ Note: The invocations are not interleaved; see 4.6. — end note ]
(same thing in current draft)
The problem is that I can't find any such requirement for old, non-parallel overloads of these algorithms.
Question: Can this mean that library implementers can, since C++11 when thread of execution term was introduced, implement std::transform and std::generate using SIMD/multithreading/other(?)? Is there a reason for that?

[res.on.data.races]/8 Unless otherwise specified, C++ standard library functions shall perform all operations solely within the current thread if those operations have effects that are visible (4.7) to users.
This precludes any kind of behind-the-scenes multithreading that touches any user-defined entities.
I suppose, in principle, something like std::sort working on a vector<int> can prove that no user-defined class is involved, and send work to multiple threads. That's rather far-fetched, it's difficult to imagine any implementation doing this in practice.

Related

Const-ness affecting concurrent algorithms?

Are there any examples of the absence or presence of const affecting the concurrency of any C++ Standard Library algorithms or containers? If there aren't, is there any reason using the const-ness in this way is not permitted?
To clarify, I'm assuming concurrent const access to objects is data-race free, as advocated in several places, including GotW 6.
By analogy, the noexcept-ness of move operations can affect the performance of std::vectors methods such as resize.
I skimmed through several of the C++17 concurrent Algorithms, hoping to find an example, but I didn't find anything. Algorithms like transform don't require the unary_op or binary_op function objects to be const. I did find for_each takes a MoveConstructible function object for the original, non execution policy version, and take a CopyConstructible function object for the C++17 execution policy version; I thought this might be an example where the programmer was manually forced to select one or the other based on if the function object could be safely copied (a const operation)... but at the moment I suspect this requirement is just there to support the return type (UnaryFunction vs. void).

Does `std::sort()` use threading to increase its performance?

Does std::sort() typically use threading to increase its performance? I realize that this could vary from implementation to implementation. If not, why not?
[res.on.data.races]/8 Unless otherwise specified, C++ standard library functions shall perform all operations solely within the current thread if those operations have effects that are visible (4.7) to users.
/9 [ Note: This allows implementations to parallelize operations if there are no visible side effects. —end note ]
std::sort could, in principle, use parallel execution when sorting elements of fundamental type (it wouldn't be observable whether it does), but not user-defined type (unless explicitly given permission via execution policy parameter, of course). The type's operator< may not be thread-safe.

Are c++ standard library containers like std::queue guaranteed to be reentrant?

I've seen people suggest that I should wrap standard containers such as std::queue and std::vector in a mutex lock or similar if i wish to use them.
I read that as needing a lock for each individual instance of a container being accessed by multiple threads, not per type or any utilization of the c++ standard library. But this assumes that the standard containers and standard library is guaranteed to be re-entrant.
Is there such a guarantee in the language?
The standard says:
Except where explicitly specified in this standard, it is implementation-defined which functions in the Standard C++ library may be recursively reentered.
Then it proceeds to specify that a function must be reentrant in, if I count them correctly, zero cases.
If one is to strictly follow the standard in this regard, the standard library suddenly becomes rather limited in its usefulness. A huge number of library functions call user-supplied functions. Writers of these functions, especially if those are themselves released as a library, in general don't know where they will be called from.
It is completely reasonable to assume that e.g. any constructor may be called from emplace_back of any standard container; if the user wishes to eliminate any uncertainty, he must refrain from any calls to emplace_back in any constructor. Any copy constructor is callable from e.g. vector::resize or sort, so one cannot manage vectors or do sorting in copy constructors. And so on, ad libitum.
This includes calling any third party component that might reasonably be using the standard library.
All these restrictions together probably mean that a large part of the standard library cannot be used in real world programs at all.
Update: this doesn't even start taking threads into consideration. With multiple threads, at least functions that deal with containers and algorithms must be reentrant. Imagine that std::vector::operator[] is not reentrant. This would mean that one cannot access two different vectors at the same time from two different threads! This is clearly not what the standard intends. I understand that this is your primary interest. To reiterate, no, I don't think there is reentrancy guarantee; and no, I don't think absence of such guarantee is reasonable in any way. --- end update.
My conclusion is that this is probably an oversight. The standard should mandate that all standard functions must be reentrant, unless otherwise specified.
I would
completely ignore the possibility of of any standard function being non-reentrant, except when it is clear that the function cannot be reasonably made reentrant.
raise an issue with the standards committee.
[Answer left for historical purposes, but see n.m.'s answer. There's no requirement on individual functions, but there is a single global non-requirement]
Yes, the standard guarantees reentrancy of member functions of standard containers.
Let me define what (non)-reentrancy means for functions. A reentrant function can be called with well-defined behavior on a thread while it is already on the call stack of that thread, i.e. executing. Obviously, this can only happen if the control flow temporarily left the reentrant function via a function call. If the behavior is not well-defined, the function is not reentrant.
(Leaf functions can't be said to be reentrant or non-reentrant, as the flow of control can only leave a leaf function by returning, but this isn't critical to the analysis).
Example:
int fac(int n) { return n==0 ? 1 : n * fac(n-1); }
The behavior of fac(3) is to return 6, even while fac(4) is running. Therefore, fac is reentrant.
The C++ Standard does define the behavior of member functions of standard containers. It also defines all restrictions under which such behavior is guaranteed. None of the member functions of standard containers have restrictions with respect to reentrancy. Therefore, any implementation which would restrict reentrancy is non-conformant.

What are the constraints on the user using STL's parallel algorithms?

At the Jacksonville meeting the proposal P0024r2 effectively adopting the specifications from the Parallelism TS was accepted into the C++17 (draft). This proposal adds overloads for many of the algorithms taking an execution policy argument to indicate what kind of parallelism should be considered. There are three execution policies already defined in <execution> (20.19.2 [execution]):
std::execution::sequenced_policy (20.19.4 [execpol.seq]) with a constexpr object std::execution::seq (20.19.7 [parallel.execpol.objects]) to indicate sequential execution similar to calling the algorithms without an execution policy.
std::execution::parallel_policy (20.19.5 [execpol.par]) with a constexpr object std::execution::par (20.19.7 [parallel.execpol.objects]) to indicate execution of algorithms potentially using multiple threads.
std::execution::parallel_unsequenced_policy (20.19.6 [execpol.vec]) with a constexpr object std::execution::par_unseq (20.19.7 [parallel.execpol.objects])to indicate execution of algorithms potentially using vector execution and/or multiple threads.
The STL algorithms generally take user-defined objects (iterators, function objects) as arguments. What are the constraints on user-defined objects to make them usable with the parallel algorithms using the standard execution policies?
For example, when using an algorithm like in the example below, what are the implications for FwdIt and Predicate?
template <typename FwdIt, typename Predicate>
FwdIt call_remove_if(FwdIt begin, FwdIt end, Predicate predicate) {
return std::remove_if(std::execution::par, begin, end, predicate);
}
The short answer is that the element access functions (essentially the operations required by the algorithms on the various arguments; see below for details) used with algorithms using the execution policy std::execution::parallel are not allowed to cause data races or dead-locks. The element access functions used with algorithms using the execution policy std::execution::parallel_unsequenced_policy additionally can't use any blocking synchronisation.
The Details
The description is based on the ballot document N4604. I haven't verified if some of the clauses were modified in response to national body comments (a cursory check seems to imply that there were no edits so far).
Section 25.2 [algorithms.parallel] specifies the semantics of the parallel algorithms. There are multiple constraints which do not apply to the algorithms not taking an execution policy, broken down in multiple sections:
In 25.2.2 [algorithms.parallel.user] constrains what predicate functions can do to their arguments:
Function objects passed into parallel algorithms as objects of type Predicate, BinaryPredicate, Compare, and BinaryOperation shall not directly or indirectly modify objects via their arguments.
The way the clause is written it seems that the objects themselves can be modified as long as the other constraints (see below) are obeyed. Note that this constraint is independent of the execution policy and, thus, applies even when using std::execution::sequenced_policy. The full answer is more complicated than that and it seems the specification is currently unintentionally over constrained (see the last paragraph below).
In 25.2.3 [algorithms.parallel.exec] adds constraints on element access functions (see below) which are specific to the different execution policies:
When using std::execution::sequenced_policy the element access functions are all invoked from the same thread, i.e., the execution is not interleaved in any form.
When using std::execution::parallel_policy different threads may invoke the element access functions concurrently from different threads. Invoking element access functions from different threads is not allowed to cause data races or to cause dead-locks. However, invocations of element access from the same thread are [indeterminately] sequence, i.e., there are no interleaved invocations of element access function from the same thread. For example, if a Predicate used with std::execution::par counts how often it is called, the corresponding count will need to be appropriately synchronized.
When using std::execution::parallel_unsequenced_policy the invocation of element access functions can be interleaved both between different thread as well as within one thread of execution. That is, use of a blocking synchronization primitive (like a std::mutex) may cause dead-lock as the same thread may try to synchronize multiple times (and, e.g., try to lock the same mutex multiple times). When using standard library functions for element access functions the constraint in the standard is (25.2.3 [algorithms.parallel.exec] paragraph 4):
A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from execution::parallel_unsequenced_policy algorithms.
What happens when using implementation defined execution policies is, unsurprisingly, implementation defined.
In 25.2.4 [algorithm.parallel.exception] the use of exceptions thrown from element access functions is sort of constrained: when an element access function throws an exception, std::terminate() is called. That is, it is legal to throw an exception but it is unlikely that the outcome is desirable. Note that std::terminate() will be called even when using std::execution::sequenced_policy.
Element Access Functions
The constraints above use the term element access function. This term is defined in 25.2.1 [algorithm.parallel.defns] paragraph 2. There are four groups of functions classified as element access functions:
All operations of the categories of the iterators that the algorithm is instantiated with.
Operations on those sequence elements that are required by its specification.
User-provided function objects to be applied during the execution of the algorithm, if required by the specification.
Operations on those function objects required by the specification.
Essentially, element access functions are all the operations which the standard explicitly refers to in the specification of the algorithms or the concepts used with these algorithms. Functions not mentioned and, e.g., detected to be present (e.g., using SFINAE) are not constrained and, effectively, can't be called from the parallel algorithms imposing synchronisation constraints on their use.
The Problem
It is slightly concerning that there seems to be no guarantee that the objects the [mutating] element access functions are applied to are different between different threads. In particular, I can't see any guarantee that the iterator operations applied to an iterator object can not be applied to the same iterator object from two different threads! The implication is that, e.g., operator++() on an iterator object would need to somehow synchronise its state. I can't see how, e.g., operator==() could do something useful if the object is modified in a different thread. It seems unintentional that operations on the same object need to be synchronised as it doesn't make any sense to apply [mutating] element access functions concurrently to an object. However, I can't see any text stating that different objects are used (I guess, I need to raise a defect for this).

Parallel implementations of std algorithms and side effects

Going through the standard documentation for std::transform, I noticed that until C++11 the functor argument was required not to have side effects, while from C++11 onwards the requirement has been made less restrictive - "op and binary_op shall not invalidate iterators or subranges, or modify elements in the ranges". See
http://en.cppreference.com/w/cpp/algorithm/transform
and section 25.3.4 of the standard. The webpage on cppreference.com mentions as well that "The intent of these requirements is to allow parallel or out-of-order implementations of std::transform".
I do not understand then if this snippet of code is legal or not in C++11:
std::vector<int> v(/* fill it with something */), v_transformed;
int foo = 0;
std::transform(v.begin(),v.end(),std::back_inserter(v_transformed),[&foo](const int &n) -> int {
foo += 1;
return n*2;
});
Clearly, if std::transform is parallelised behind the scenes, we will have multiple concurrent calls to foo += 1, which is going to be UB. But the functor itself does not seem to violate the requirements outlined in the standard.
This question can be asked for other standard algorithms (except I think std::for_each, which explicitly states that the iteration is to be performed in-order).
Did I misunderstand something?
As far as I understand the C++11 spec, all standard library functions have to perform all their operations sequentially, if their effects are visible to the user. In particular, all "mutating sequence operations" have to be perfomed sequentially.
The relevant part of the standard is §17.6.5.9/8:
Unless otherwise specified, C++ standard library functions shall perform all operations solely within the current thread if those operations have effects that are visible (1.10) to users.
The way the algorithms are currently defined they have to be executed sequentially unless the implementation can prove that executing it concurrently doesn't change the semantics. I could imagine a future addition for algorithms which are explicitly allowed to be executed concurrently but they would be different algorithms.
So C++11 now allows std::transform to be parallelised, but that's not a guarantee that your own code will be made parallelisation-safe. Now, yes, I suppose you have to protect your data variables. I can imagine a lot of MT bugs arising from this, if implementations ever do actually paralellise std::transform.