Parallel Aggregration of a collection using PPL or TBB

Parallel Aggregration of a collection using PPL or TBB - c++

I've decided to write an algorithm to utilize parallel aggregation. Here is the single threaded code that I want to transform.
vector<vector<double>> sum;
for (const auto* fold : _obj.GetFolds())
sum.push_back(move(vector<double>(fold->GetSize())));
for (int index : sequence)
{
vector<vector<double>> values = Calculate(vec1[index], vec2[index]);
for (int i = 0; i < sum.size(); i++)
{
for (int j = 0; j < sum[i].size(); j++)
sum[i][j] += values[i][j];
}
}
I looked at the MSDN page http://msdn.microsoft.com/en-us/library/gg663540.aspx which covers parallel_for with combinable, and http://msdn.microsoft.com/en-us/library/dd470426.aspx#map_reduce_example showing parallel_transform with parallel_reduce, but they are simple examples with only a counter.
vector<int> sequence = ...
combinable<int> count([]() { return 0; });
parallel_for_each(sequence.cbegin(), sequence.cend(),
[&count](int i)
{
count.local() += IsPrime(i) ? 1 : 0;
});
return count.combine(plus<int>());
I'm having a difficult time finding examples where I would aggregate with a parallel loop the vector<vector<double>> sum as outlined above.
Also, I'm looking for advice on whether to use parallel_for and combinable or parallel_transform with parallel_reduce? The first link above states:
The parallel_reduce function is usually the recommended approach
whenever you need to apply the Parallel Aggregation pattern within
applications that use PPL. Its declarative nature makes it less prone
to error than other approaches, and its performance on multicore
computers is competitive with them. Implementing parallel aggregation
with parallel_reduce doesn't require adding locks in your code.
Instead, all the synchronization occurs internally. Of course, if
parallel_reduce doesn't meet your needs or if you prefer a less
declarative style of coding, you can also use the combinable class
with parallel_for or parallel_for_each to implement the parallel
aggregation.
You should be aware that parallel_for and parallel_for_each add
overhead due to their support of features such as cancellation and
dynamic range stealing. Also, a call to the combinable::local() method
inside of a parallel loop adds the cost of a hash table lookup to each
iteration of the loop. In general, use parallel aggregation to
increase performance when iterations perform complex computations.
Thanks.

Related

Increasing array index in openMP

I am new to using OpenMP. I am trying to parallelize a nested loop, and so far I have something of this form...
#pragma omp parallel for
for (j=0;j <m; j++) {
some work;
for (i= 0; i < n ; i++) {
p =b[i];
if (P< 0 && k < m) {
a[k] = c[i]; k++ ;
} else {
x=c[i];
}
}
some work
}
The outer loop is in parallel, and the inner loop updates k. The current value of k is needed for the other threads to update a[k] correctly. The problem is that all of the threads are updating a[k], but the proper order of k is not kept.
Some threads will update k and a[k], and some will not. How do I communicate the latest k between threads to update a[k] properly, since c[i] will have different values for each thread?
For example, when it runs serially, the program might set the first seven values of a to {1,3,5,7,3,9,13} and terminate with k equal to 7, but when done parallel, produces different results, or results in a different (therefore wrong) order.
How do I keep the same order and ensure parallelism at the same time?

Note: this answer was completely rewritten in light of OP clarifications. The original answer text is at the end.
How do I keep the same order and ensure parallelism at the same time?
Order dependency is antithetical to parallelism, as running operations in parallel inherently entails relaxing the relative order in which they are performed. Not all computations can be effectively parallelized.
Your case is not an exception. The second and each subsequent iteration of your outer loop needs to use the final value of k (among other things) computed by the previous iteration. How can it get that? Only by performing the previous iteration first. What room does that leave for concurrent operation? None. Concurrency is not the same thing as parallelism, but it is one of the main motivations for parallelism, because that's how parallelism yields improvements in elapsed time.
With no scope for concurrency, parallelism is actively counterproductive for you. Suppose you made the whole body of the outer loop a critical section, so that there was no concurrency in fact (as your present code requires) and no data races involving k. Then you would still pay the overhead for parallelism, get no speedup in return, and probably still get the wrong results because of evaluations of the outer-loop body being performed in the wrong order.
It may be that the whole thing can be rewritten to reduce or remove the data dependencies that prevent effective parallelization of the computation, or it may not. We haven't enough information to determine, as it depends in part on the details of "some work" and on the significance of the data. Probably you would need an altogether different algorithm for producing the desired results.
> Instead of giving a[n]={0,1,2,3,.......n} , it gives me garbage values for a when I use the reduction clause. I need the total sum of K, hence the reduction clause.
There is a closed-form equation for the sum of consecutive integers, and it has especially simple form when the first integer in the list is 0 or 1. In particular, the sum of the integers from 0 to n, inclusive, is n * (n + 1) / 2. You do not need a reduction for this.
If you wanted to use a reduction anyway, then you need to understand that it doesn't work the way you seem to think it does. What you get is a separate, private copy of the reduction variable for each thread executing the parallel construct, with the per thread (not per iteration) final values of those independant variables combined according to the reduction operator. Thus, if you really want to do the computation via an OpenMP reduction, then you would need to restructure the loop something like this:
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[i] = i;
k += i;
}
That assumes that the value of k is 0 immediately prior to the loop, as you indeed seem to be doing. If that were not a safe assumption then you would need something like
type_of_k k0 = k;
k = 0;
#pragma omp parallel for reduction (+:k)
for (i = 0; i < 10; i++) {
a[k0 + i] = i;
k += k0 + i;
}
Note that in either case, not only does that set up the reduction correctly, but it also breaks the data dependency between loop iterations that was previously carried by the expression k++.

It sounds like you're essentially filling in a with a filter of entries from c, and want to preserve their order. If this is the only use k has, some other methods spring to mind:
Always write a[i], but use a mark indicating unused values where the P predicate wasn't satisfied. This preserves order, but requires a larger a you can compact in a second pass.
Write an a_i array storing which index each entry belonged to. This still requires a #pragma omp atomic k_local = k++ access to k, and a second sort to restore order. And you'd need both a and a_i to be the full size again, or you might miss entries, so in all a terrible workaround.
Even with some sequential dependencies you can do optimizations, e.g. a scan to calculate what k would be for each i could be done in O(log n) rather than O(n). E.g. parallel prefix sum, openmp discussion on stack overflow. This sort of thing is what OpenMP's ordered depend is for, I believe. Anyhow, this leads to the third solution:
Generate a k array, holding the values k will have for each iteration, such that those threads that will write write to the correct places. This requires scanning the predicate.
It is useful to have higher level constructs like map, scan and reduce when planning out algorithms.

parallel programming in OpenMP

I have the following piece of code.
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
where offset is an array of size n containing values in the range [0, m) and cnt is an array of size m initialized to 0. I use OpenMP to parallelize it as follows.
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
According to the discussion in this post, if offset[i1] == offset[i2] for i1 != i2, the above piece of code may result in incorrect cnt. What can I do to avoid this?

This code:
#pragma omp parallel for shared(cnt, offset) private(i)
for (i = 0; i < n; ++i) {
++cnt[offset[i]];
}
contains a race-condition during the updates of the array cnt, to solve it you need to guarantee mutual exclusion of those updates. That can be achieved with (for instance) #pragma omp atomic update but as already pointed out in the comments:
However, this resolves just correctness and may be terribly
inefficient due to heavy cache contention and synchronization needs
(including false sharing). The only solution then is to have each
thread its private copy of cnt and reduce these copies at the end.
The alternative solution is to have a private array per thread, and at end of the parallel region you perform the manual reduction of all those arrays into one. An example of such approach can be found here.
Fortunately, with OpenMP 4.5 you can reduce arrays using a dedicate pragma, namely:
#pragma omp parallel for reduction(+:cnt)
You can have look at this example on how to apply that feature.
Worth mentioning that regarding the reduction of arrays versus the atomic approach as kindly point out by #Jérôme Richard:
Note that this is fast only if the array is not huge (the atomic based
solution could be faster in this specific case regarding the platform
and if the values are not conflicting). So that is m << n. –
As always profiling is the key!; Hence, you should test your code with aforementioned approaches to find out which one is the most efficient.

OpenMP parallel thread

I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
In this case the loop is not parallelized because it uses iterator and the compiler cannot
understand how to slit it.
Can You help me?

OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(I assume that mesh->element_quality() returns double)
Some key points:
The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).

Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!
For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.

Finding minimal element in array, and its index

With OpenMP 3.1, it is possible to have a reduction clause with min:
double m;
#pragma omp parallel for reduction(min:m)
for (int i=0;i< n; i++){
if (a[i]*2 < m) {
m = a[i] * 2;
}
return m;
Suppose I also need the index of the minimal element; is there a way to use the reduction clause for this? I believe the alternative is writing the reduction manually using nowait and critical.

Suppose I also need the index of the minimal element; is there a way to use the reduction clause for this?
Unfortunately, no. the list of possible reductions in OpenMP is very … small. In particular, min and max are the only “higher-level” functions and they aren’t customisable. At all.
I have to admit that I don’t like OpenMP’s approach to reductions, precisely because it’s not extensible in the slightest, it is designed only to work on special cases. Granted, those are interesting special cases but it’s still fundamentally a bad approach.
For such operations, you need to implement the reduction yourself by accumulating thread-local results into thread-local variables and combining them at the end.
The easiest way of doing this (and indeed quite close to how OpenMP implements reductions) is to have an array with elements for each thread, and using omp_get_thread_num() to access an element. Note however that this will lead to a performance degradation due to false sharing if the elements in the array share a cache line. To mitigate this, pad the array:
struct min_element_t {
double min_val;
size_t min_index;
};
size_t const CACHE_LINE_SIZE = 1024; // for example.
std::vector<min_element_t> mins(threadnum * CACHE_LINE_SIZE);
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
size_t const index = omp_get_thread_num() * CACHE_LINE_SIZE;
// operate on mins[index] …
}

C++ OpenMP directives for a parallel for loop?

I am trying OpenMP on a particular code snippet. Not sure if the snippet needs a revamp, perhaps it is set up too rigidly for sequential implementation. Anyway here is the (pseudo-)code that I'm trying to parallelize:
#pragma omp parallel for private(id, local_info, current_local_cell_id, local_subdomain_size) shared(cells, current_global_cell_id, global_id)
for(id = 0; id < grid_size; ++id) {
local_info = cells.get_local_subdomain_info(id);
local_subdomain_size = local_info.size();
...do other stuff...
do {
current_local_cell_id = cells.get_subdomain_cell_id(id);
global_id.set(id, current_global_cell_id + current_local_cell_id);
} while(id < local_subdomain_size && ++id);
current_global_cell_id += local_subdomain_size;
}
This makes complete sense (after staring at it for some time) in a sequential sense, which also might mean that it needs to be re-written for OpenMP. My concern is that current_local_cell_id and local_subdomain_size are private, but current_global_cell_id and global_id are shared.
Hence the statement current_global_cell_id += local_subdomain_size after the inner loop:
do {
...
} while(...)
current_global_cell_id += local_subdomain_size;
might lead to errors in the OpenMP setting, I suspect. I would greatly appreciate if any of the OpenMP experts out there can provide some pointers on any of the special OMP directives I can use to make minimum changes to the code but still avail of OpenMP for such a type of for loop.

I'm not sure I understand your code. However, I think you really want some kind of parallel accumulation.
You could use a pattern like
size_t total = 0;
#pragma omp parallel for shared(total) reduction (+:total)
for (int i=0; i<MAXITEMS; i++)
{
total += getvalue(i); // TODO replace with your logic
}
// total has been 'magically' combined by OMP
On a related note, when you use gcc you can just use the __gnu_parallel::accumulate drop-in replacement for std::accumulate, which does exactly the same. See Chapter 18. Parallel Mode
size_t total = __gnu_parallel::accumulate(c.begin(), c.end(), 0, &myvalue_accum);
You can even compile with -D_GLIBCXX_PARALLEL which will make all use of std algorithms automatically parallellized if possible. Don't use that unless you know what you're doing! Frequently, performance just suffers and the chance of introducing bugs due to unexpected parallelism is real

changing id inside the loop is not correct. There is no way to dispatch the loop to different thread, as loop step does not produce a predictable id value.
Why are you using the id inside that do while loop?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parallel Aggregration of a collection using PPL or TBB - c++

Related

Increasing array index in openMP

parallel programming in OpenMP

OpenMP parallel thread

Finding minimal element in array, and its index

C++ OpenMP directives for a parallel for loop?

Categories

Resources