Is it possible to do a reduction on an array with openmp? - c++

Does OpenMP natively support reduction of a variable that represents an array?
This would work something like the following...
float* a = (float*) calloc(4*sizeof(float));
omp_set_num_threads(13);
#pragma omp parallel reduction(+:a)
for(i=0;i<4;i++){
a[i] += 1; // Thread-local copy of a incremented by something interesting
}
// a now contains [13 13 13 13]
Ideally, there would be something similar for an omp parallel for, and if you have a large enough number of threads for it to make sense, the accumulation would happen via binary tree.

Array reduction is now possible with OpenMP 4.5 for C and C++. Here's an example:
#include <iostream>
int main()
{
int myArray[6] = {};
#pragma omp parallel for reduction(+:myArray[:6])
for (int i=0; i<50; ++i)
{
double a = 2.0; // Or something non-trivial justifying the parallelism...
for (int n = 0; n<6; ++n)
{
myArray[n] += a;
}
}
// Print the array elements to see them summed
for (int n = 0; n<6; ++n)
{
std::cout << myArray[n] << " " << std::endl;
}
}
Outputs:
100
100
100
100
100
100
I compiled this with GCC 6.2. You can see which common compiler versions support the OpenMP 4.5 features here: https://www.openmp.org/resources/openmp-compilers-tools/
Note from the comments above that while this is convenient syntax, it may invoke a lot of overheads from creating copies of each array section for each thread.

Only in Fortran in OpenMP 3.0, and probably only with certain compilers.
See the last example (Example 3) on:
http://wikis.sun.com/display/openmp/Fortran+Allocatable+Arrays

Now the latest openMP 4.5 spec has supports of reduction of C/C++ arrays.
http://openmp.org/wp/2015/11/openmp-45-specs-released/
And latest GCC 6.1 also has supported this feature.
http://openmp.org/wp/2016/05/gcc-61-released-supports-openmp-45/
But I didn't give it a try yet. Wish others can test this feature.

OpenMP cannot perform reductions on array or structure type variables (see restrictions).
You also might want to read up on private and shared clauses. private declares a variable to be private to each thread, where as shared declares a variable to be shared among all threads. I also found the answer to this question very useful with regards to OpenMP and arrays.

OpenMP can perform this operation as of OpenMP 4.5 and GCC 6.3 (and possibly lower) supports it. An example program looks as follows:
#include <vector>
#include <iostream>
int main(){
std::vector<int> vec;
#pragma omp declare reduction (merge : std::vector<int> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
#pragma omp parallel for default(none) schedule(static) reduction(merge: vec)
for(int i=0;i<100;i++)
vec.push_back(i);
for(const auto x: vec)
std::cout<<x<<"\n";
return 0;
}
Note that omp_out and omp_in are special variables and that the type of the declare reduction must match the vector you are planning to reduce on.

Related

How to properly use #pragma omp simd?

I'm trying to understand whether the code below is OpenMP standard compliant. The main concern here is the args object that contains an offset field that is modified inside a loop to which #pragma omp simd is applied. Is this a legit use case?
#include <cstdio>
struct args_t {
int offset;
};
const int n = 10;
float data1[n];
float data2[n];
void foo(float &res, const args_t& args) {
res = res + data2[args.offset];
}
int main() {
printf("Original arrays:\n");
for (int i = 0; i < n; i++) {
data1[i] = (float)i / 2.0f;
printf("%f ", data1[i]);
}
printf("\n");
for (int i = 0; i < n; i++) {
data2[i] = (float)i / 3.0f;
printf("%f ", data2[i]);
}
printf("\n");
args_t args;
args.offset = 0;
#pragma omp simd
for (int i = 0; i < n; i++) {
foo(data1[i], args);
args.offset++;
}
printf("Sum of two arrays:\n");
for (int i = 0; i < n; i++)
printf("%f ", data1[i]);
printf("\n");
return 0;
}
TL;DR answer: OpenMP specification is not specific in this respect, which means that the answer depends on the actual implementation. In practice, your code properly vectorized (at least on newest gcc/clang on x86-64 platform), but you can specify that your variable is modified inside the loop by using the linear clause.
Detailed answer:
In the OpenMP specification the execution model of the simd construct is quite vaguely described:
The simd construct can be applied to a loop to indicate that the
loop can be transformed into a SIMD loop...
This gives a lot of flexibility/freedom to the compiler, and also raises many questions - like yours. The last paragraph of this document is much more clear:
OpenMP provides directives to improve the capabilities of the
compiler’s auto-vectorization pass by providing it with information
that cannot be determined through compile-time static-analysis. This
allows the programmer to effectively vectorize previously problematic
sections of code and have it run efficiently on several computer
architectures and accelerators...
This practically means that the OpenMP simd directives provide only information to the compiler for auto-vectorization, but how auto-vectorization is actually performed depends on the the implementation .
So, based on the above mentioned references and some tests with Compiler Explorer (gcc and clang on x86-64 platform) I always found that if you do not provide enough information for vectorization the worst case is that the loop won't be vectorized, but it will not result incorrect code.
I have also found that using #pragma omp simd without any additional clause or directive is practically equivalent to the use of #pragma GCC ivdep (or #pragma clang loop vectorize(assume_safety) for clang), but it is much more portable.
In the following code, the compiler generated code first checks the value of k to determine if it is safe to vectorize, but if #pragma omp simd is added this check is omitted:
void vec_dep(int *a, int k, int c, int m) {
for (int i = 0; i < m; i++)
a[i] = a[i + k] * c;
}
Consider the following example:
int foo(int* A){
int sum=0;
#pragma omp simd reduction(+:sum)
for(int i=0;i<1024;++i)
sum+=A[i];
return sum;
}
In this example #pragma omp simd reduction(+:sum) is the absolutely correct form, but using #pragma omp simd or #pragma GCC ivdep or not using anything at all gives similar (correctly vectorized) code. Note that this is not the case if #pragma omp parallel for reduction(+:sum) is used, in this case reduction is absolutely necessary to avoid race condition. (Well, this raises the obvious question why the compiler does not give a warning in such a case.)
Similarly it is not necessary to use linear clause (the compiler can find this linear dependence):
#pragma omp simd linear(b:1)
for (int i=0;i<N;++i) array[i]=b++;
Note that, however, if #pragma omp parallel for simd linear(b) is used the linear(b) cannot be omitted otherwise the result will be incorrect, because the OpenMP calculates the initial b value for each thread using this linear relationship.
So, to answer your question, your code will compile to properly vectorized code (at least on compilers I have tested), even though the linear relationship is not specified. To specify this linear relationship you have to use the linear clause. The first idea to use #pragma omp simd linear(args.offset), but it can't compile becasue the following error: linear clause applied to non-integral non-pointer variable with 'args_t' type. The workaround is to use a reference to args.offset and change the function foo accordingly:
void foo(float &res, const int& offset) {
res = res + data2[offset];
}
...
int& p=args.offset;
#pragma omp simd linear(p)
for (int i = 0; i < n; i++) {
foo(data1[i], p);
p++;
}

OpenMP reduction on container elements

I have a nested loop, with few outer, and many inner iterations. In the inner loop, I need to calculate a sum, so I want to use an OpenMP reduction. The outer loop is on a container, so the reduction is supposed to happen on an element of that container.
Here's a minimal contrived example:
#include <omp.h>
#include <vector>
#include <iostream>
int main(){
constexpr int n { 128 };
std::vector<int> vec (4, 0);
for (unsigned int i {0}; i<vec.size(); ++i){
/* this does not work */
//#pragma omp parallel for reduction (+:vec[i])
//for (int j=0; j<n; ++j)
// vec[i] +=j;
/* this works */
int* val { &vec[0] };
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
val[i] +=j;
/* this is allowed, but looks very wrong. Produces wrong results
* for std::vector, but on an Eigen type, it worked. */
#pragma omp parallel for reduction (+:val[i])
for (int j=0; j<n; ++j)
vec[i] +=j;
}
for (unsigned int i=0; i<vec.size(); ++i) std::cout << vec[i] << " ";
std::cout << "\n";
return 0;
}
The problem is, that if I write the reduction clause as (+:vec[i]), I get the error ‘vec’ does not have pointer or array type, which is descriptive enough to find a workaround. However, that means I have to introduce a new variable and somewhat change the code logic, and I find it less obvious to see what the code is supposed to do.
My main question is, whether there is a better/cleaner/more standard way to write a reduction for container elements.
I'd also like to know why and how the third way shown in the code above somewhat works. I'm actually working with the Eigen library, on whose containers that variant seems to work just fine (haven't extensively tested it though), but on std::vector, it produces results somewhere between zero and the actual result (8128). I thought it should work, because vec[i] and val[i] should both evaluate to dereferencing the same address. But alas, apparently not.
I'm using OpenMP 4.5 and gcc 9.3.0.
I'll answer your question in three parts:
1. What is the best way to perform to OpenMP reductions in your example above with a std::vec ?
i) Use your approach, i.e. create a pointer int* val { &vec[0] };
ii) Declare a new shared variable like #1201ProgramAlarm answered.
iii) declare a user defined reduction (which is not really applicable in your simple case, but see 3. below for a more efficient pattern).
2. Why doesn't the third loop work and why does it work with Eigen ?
Like the previous answer states you are telling OpenMP to perform a reduction sum on a memory address X, but you are performing additions on memory address Y, which means that the reduction declaration is ignored and your addition is subjected to the usual thread race conditions.
You don't really provide much detail into your Eigen venture, but here are some possible explanations:
i) You're not really using multiple threads (check n = Eigen::nbThreads( ))
ii) You didn't disable Eigen's own parallelism which can disrupt your own usage of OpenMP, e.g. EIGEN_DONT_PARALLELIZE compiler directive.
iii) The race condition is there, but you're not seeing it because Eigen operations take longer, you're using a low number of threads and only writing a low number of values => lower occurrence of threads interfering with each other to produce the wrong result.
3. How should I parallelize this scenario using OpenMP (technically not a question you asked explicitly) ?
Instead of parallelizing only the inner loop, you should parallelize both at the same time. The less serial code you have, the better. In this scenario each thread has its own private copy of the vec vector, which gets reduced after all the elements have been summed by their respective thread. This solution is optimal for your presented example, but might run into RAM problems if you're using a very large vector and very many threads (or have very limited RAM).
#pragma omp parallel for collapse(2) reduction(vsum : vec)
for (unsigned int i {0}; i<vec.size(); ++i){
for (int j = 0; j < n; ++j) {
vec[i] += j;
}
}
where vsum is a user defined reduction, i.e.
#pragma omp declare reduction(vsum : std::vector<int> : std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<int>())) initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))
Declare the reduction before the function where you use it, and you'll be good to go
For the second example, rather than storing a pointer then always accessing the same element, just use a local variable:
int val = vec[i];
#pragma omp parallel for reduction (+:val)
for (int j=0; j<n; ++j)
val +=j;
vec[i] = val;
With the 3rd loop, I suspect that the problem is because the reduction clause names a variable, but you never update that variable by that name in the loop so there is nothing that the compiler sees to reduce. Using Eigen may make the code a bit more complicate to analyze, resulting in the loop working.

Performance of matrix multiplications remains unchanged with OpenMP in C++

auto t1 = chrono::steady_clock::now();
#pragma omp parallel
{
for(int i=0;i<n;i++)
{
#pragma omp for collapse(2)
for(int j=0;j<n;j++)
{
for(int k=0;k<n;k++)
{
C[i][j]+=A[i][k]*B[k][j];
}
}
}
}
auto t2 = chrono::steady_clock::now();
auto t = std::chrono::duration_cast<chrono::microseconds>( t2 - t1 ).count();
With and without the parallelization the variable t remains fairly constant. I am not sure why this is happening. Also once in a while t is outputted as 0.
One more problem I am facing is that if I increase value of n to something like 500, the compiler is unable to run the program.(Here I've take n=100)
I am using code::blocks with the GNU GCC compiler.
The proposed OpenMP parallelization is not correct and may lead to wrong results. When specifying collapse(2), threads execute "simultaneously" the (j,k) iterations. If two (or more) threads work on the same j but different k, they accumulate the result of A[i][k]*B[k][j] to the same array location C[i][j]. This is a so called race condition, i.e. "two or more threads can access shared data and they try to change it at the same time" (What is a race condition?). Data races do not necessarily lead to wrong results despite the code is not OpenMP valid and can produce wrong results depending on several factors (scheduling, compiler implementation, number of threads,...). To fix the problem in the code above, OpenMP offers the reduction clause:
#pragma omp parallel
{
for(int i=0;i<n;i++) {
#pragma omp for collapse(2) reduction(+:C)
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
so that "a private copy is created in each implicit task (...) and is initialized with the initializer value of the reduction-identifier. After the end of the region, the original list item is updated with the values of the private copies using the combiner associated with the reduction-identifier" (http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf). Note that the reduction on arrays in C is directly supported by the standard since OpenMP 4.5 (check if the compiler support it, otherwise there are old manual ways to achieve it, Reducing on array in OpenMp).
However, for the given code, it should be probably more adequate to avoid the parallelization of the innermost loop so that the reduction is not needed at all:
#pragma omp parallel
{
#pragma omp for collapse(2)
for(int i=0;i<n;i++) {
for(int j=0;j<n;j++) {
for(int k=0;k<n;k++) {
C[i][j]+=A[i][k]*B[k][j];
Serial can be faster than OpenMP version for small sizes of matrices and/or small number of threads.
On my Intel machine using up to 16 cores, n=1000, GNU compiler v6.1 the break even is around 4 cores when the -O3 optimization is activated while the break even is around 2 cores compiling with -O0. For clarity I report the performances I measured:
Serial 418020
----------- WRONG ORIG -- +REDUCTION -- OUTER.COLLAPSE -- OUTER.NOCOLLAPSE -
OpenMP-1 1924950 2841993 1450686 1455989
OpenMP-2 988743 2446098 747333 745830
OpenMP-4 515266 3182262 396524 387671
OpenMP-8 280285 5510023 219506 211913
OpenMP-16 2227567 10807828 150277 123368
Using reduction the performance loss is dramatic (reversed speed-up). The outer parallelization (w or w/o collapse) is the best option.
As concerns your failure with large matrices, a possible reason is related to the size of the available stack. Try to enlarge both the system and OpenMP stack sizes, i.e.
ulimit -s unlimited
export OMP_STACKSIZE=10000000
The collapse directive may actually be responsible for this, because the index j is recreated using divide/mod operations.
Did you try without collapse?

Openmp and reduction on std::vector?

I want to make this code parallel:
std::vector<float> res(n,0);
std::vector<float> vals(m);
std::vector<float> indexes(m);
// fill indexes with values in range [0,n)
// fill vals and indexes
for(size_t i=0; i<m; i++){
res[indexes[i]] += //something using vas[i];
}
In this article it's suggested to use:
#pragma omp parallel for reduction(+:myArray[:6])
In this question the same approach is proposed in the comments section.
I have two questions:
I don't know m at compile time, and from these two examples it seems that's required. Is it so? Or if I can use it for this case, what do I have to replace ? with in the following command #pragma omp parallel for reduction(+:res[:?]) ? m or n?
Is it relevant that the indexes of the for are relative to indexes and vals and not to res, especially considering that reduction is done on the latter one?
However, If so, how can I solve this problem?
It is fairly straight forward to do a user declared reduction for C++ vectors of a specific type:
#include <algorithm>
#include <vector>
#pragma omp declare reduction(vec_float_plus : std::vector<float> : \
std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<float>())) \
initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))
std::vector<float> res(n,0);
#pragma omp parallel for reduction(vec_float_plus : res)
for(size_t i=0; i<m; i++){
res[...] += ...;
}
1a) Not knowing m at compile time is not a requirement.
1b) You cannot use the array section reduction on std::vectors, because they are not arrays (and std::vector::data is not an identifier). If it were possible, you'd have to use n, as this is the number of elements in the array section.
2) As long as you are only reading indexes and vals, there is no issue.
Edit: The original initializer caluse was simpler: initializer(omp_priv = omp_orig). However, if the original copy is then not full of zeroes, the result will be wrong. Therefore, I suggest the more complicated initializer which always creates zero-element vectors.

C++ OpenMP writing to specific element of a shared array/vector

I have a long-running simulation program and I plan to use OpenMP for paralleling some codes for speedup. I'm new to OpenMP and have the following question.
Given that the simulation is a stochastic one, I have following data structure and I need to capture age-specific count of seeded agents [Edited: some code edited]:
class CAgent {
int ageGroup;
bool isSeed;
/* some other stuff */
};
class Simulator {
std::vector<int> seed_by_age;
std::vector<CAgent> agents;
void initEnv();
/* some other stuff */
};
void Simulator::initEnv() {
std::fill(seed_by_age.begin(), seed_by_age.end(), 0);
#pragma omp parallel
{
#pragma omp for
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
seed_by_age[0]++; // index = 0 -> overall
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
} // end #parallel
} // end Simulator::initEnv()
As the variable seed_by_age is shared across threads, I know I have to protect it properly. So in (b), I used #pragma omp flush(seed_by_age[agents[i].ageGroup]) But the compiler complains "error: expected ')' before '[' token"
I'm not doing reduction, and I try to avoid 'critical' directive if possible. So, am I missing something here? How can I properly protect a particular element of the vector?
Many thanks and I appreciate any suggestions.
Development box: 2 core CPU, target platform 4-6 cores
Platform: Windows 7, 64bits
MinGW 4.7.2 64 bits (rubenvb build)
You can only use flush with variables, not elements of arrays and definitely not with elements of C++ container classes. The indexing operator for std::vector results in a call to operator[], an inline function, but still a function.
Because in your case std::vector::operator[] returns a reference to a simple scalar type, you can use the atomic update construct to protect the updates:
#pragma omp atomic update
seed_by_age[0]++; // index = 0 -> overall
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
As for not using reduction, each thread touches seed_by_age[0] when the condition inside the loop is met thereby invalidating the same cache line in all other cores. Access to the other vector elements also leads to mutual cache invalidation but assuming that agents are more or less equally distributed among the age groups, it would not be that severe as in the case with the first element in the vector. Therefore I would propose that you do something like:
int total_seed_by_age = 0;
#pragma omp parallel for schedule(static) reduction(+:total_seed_by_age)
for (size_t i = 0; i < agents.size(); i++)
{
agents[i].setup(); // (a)
if (someRandomCondition())
{
agents[i].isSeed = true;
/* (b) */
total_seed_by_age++;
#pragma omp atomic update
seed_by_age[ agents[i].ageGroup - 1 ]++;
}
}
seed_by_age[0] = total_seed_by_age;
#pragma omp flush(seed_by_age[agents[i]].ageGroup)
try to close all your bracket, it will fix the compiler error.
I am afraid, that your #pragma omp flush statement is not sufficient to protect your data and prevent a race condition here.
If someRandomCondition() is true in only a very limited number of cases you could use a critical section for the update of your vector without loosing too much speed. Alternatively, if the size of your vector seed_by_age is not too large (which I assume) than it could be efficient to have a private version of the vector for each thread which you merge right before leaving the parallel block.