OpenMP reduction after parallel region declared outside function - c++

Apologies if this has already been asked, I can't find an answer to my specific question easily.
I have code that I am parallelising. I want to declare a parallel region outside a function call, but inside the function I need to do some reduction operations.
The basic form of the code is:
#pragma omp parallel
{
for(j=0;j<time_limit;j++)
{
//do some parallel loops
do_stuff(arg1, arg2)
}
}
...
...
void do_stuff(int arg1, int arg2)
{
int sum=0;
#pragma omp for reduction(+:sum) //the sum must be shared between all threads
for(int i=0; i<arg1;i++)
{
sum += something;
}
}
When I try to compile, the reduction clause throws an error because the variable sum is private for each thread (obviously since it is declared inside the parallel region).
Is there a way to do this reduction (or something with the same end result) without having to declare the parallel region inside the function do_stuff?

If you only want the reduction in the function you can use static storage. From 2.14.1.2 of the OpenMP 4.0.0 specification
Variables with static storage duration that are declared in called routines in the region are shared.
#include <stdio.h>
void do_stuff(int arg1, int arg2)
{
static int sum = 0;
#pragma omp for reduction(+:sum)
for(int i=0; i<arg1;i++) sum += arg2;
printf("sum %d\n", sum);
}
int main(void) {
const int time_limit = 10;
int x[time_limit]; for(int i=0; i<time_limit; i++) x[i] = i;
#pragma omp parallel
{
for(int j=0;j<time_limit;j++) do_stuff(10,x[j]);
}
}

Related

is a race condition for parallel OpenMP threads reading the same shared data possible?

There is a piece of code
#include <iostream>
#include <array>
#include <random>
#include <omp.h>
class DBase
{
public:
DBase()
{
delta=(xmax-xmin)/n;
for(int i=0; i<n+1; ++i) x.at(i)=xmin+i*delta;
y={1.0, 3.0, 9.0, 15.0, 20.0, 17.0, 13.0, 9.0, 5.0, 4.0, 1.0};
}
double GetXmax(){return xmax;}
double interpolate(double xx)
{
int bin=xx/delta;
if(bin<0 || bin>n-1) return 0.0;
double res=y.at(bin)+(y.at(bin+1)-y.at(bin)) * (xx-x.at(bin))
/ (x.at(bin+1) - x.at(bin));
return res;
}
private:
static constexpr int n=10;
double xmin=0.0;
double xmax=10.0;
double delta;
std::array<double, n+1> x;
std::array<double, n+1> y;
};
int main(int argc, char *argv[])
{
DBase dbase;
const int N=10000;
std::array<double, N> rnd{0.0};
std::array<double, N> src{0.0};
std::array<double, N> res{0.0};
unsigned seed = 1;
std::default_random_engine generator (seed);
for(int i=0; i<N; ++i) rnd.at(i)=
std::generate_canonical<double,std::numeric_limits<double>::digits>(generator);
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
for(int i=0; i<N; ++i) std::cout<<"("<<src.at(i)<<" , "<<res.at(i)
<<") "<<std::endl;
return 0;
}
It seemes to work properly with either #pragma omp parallel for
or without it (checked output). But i can't understand the following things:
1) Different parallel threads access the same arrays x and y of object dbase of class Dbase (i understand that OpenMP implicitly made dbase object shared, i. e. #pragma omp parallel for shared(dbase)). Different threads do not write in these arrays, only read. But when they read, can there be a race condition for their reading from x and y or not? If not, how is it organized that at every moment only 1 thread reads from x and y in interpolate() and different threads do not bother each other? Maybe, there is a local copy of dbase object and its x and y arrays in each OpenMP thread (but it is equivalent to #pragma omp parallel for private(dbase))?
2) Shall i write in such code #pragma omp parallel for shared(dbase) or #pragma omp parallel for is enough?
3) I think that if i placed 1 random number generator inside the for-loop, to make it work properly (not to let its inner state be in race condition), i should write
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
src.at(i)=rnd.at(i) * dbase.GetXmax();
#pragma omp atomic
std::generate_canonical<double,std::numeric_limits<double>::digits>
(generator)
res.at(i)=dbase.interpolate(rnd.at(i) * dbase.GetXmax());
}
The #pragma omp atomic would destroy the increase in performance (would make threads wait) from #pragma omp parallel for. So, the only correct way how to use random numbers inside a parallel region is to have own generator (or seed) for each thread or prepare all needed random numbers before the for-loop. Is it correct?

How to read lock a multithreaded C++ program

For the following program multithreaded with openMP, what can I do to prevent other threads reading a "stuff" vector while one thread is writting to "stuff"?
vector<int> stuff; //Global vector
void loop() {
#pragma omp parallel for
for(int i=0; i < 8000; i++){
func(i);
}
}
void func(int& i) {
vector<int> local(stuff.begin() + i, stuff.end()); //Reading and copying global vector "stuff"
//Random function calls here
#pragma omp critical
stuff.assign(otherstuff.begin(), otherstuff.end()); //Writing to global vector "stuff"
}
You can use mutexes to synchronize access to data shared among several threads:
#include <mutex>
std::mutex g_stuff_mutex;
vector<int> stuff; //Global vector
void loop() {
#pragma omp parallel for
for(int i=0; i < 8000; i++){
func(i);
}
}
void func(int& i) {
g_stuff_mutex.lock();
vector<int> local(stuff.begin() + i, stuff.end()); //Reading and copying global vector "stuff"
g_stuff_mutex.unlock();
//Random function calls here
#pragma omp critical
g_stuff_mutex.lock();
stuff.assign(otherstuff.begin(), otherstuff.end()); //Writing to global vector "stuff"
g_stuff_mutex.unlock();
}

“private variable cannot be reduction” although that variable is defined outside the SIMD block

I have a C++ project which uses OpenMP, I try to compile that with LLVM on Blue Gene/Q. There is one function that, stripped down, looks like this:
template <typename FT, int veclen>
inline void xmyNorm2Spinor(FT *res,
FT *x,
FT *y,
double &n2res,
int n,
int n_cores,
int n_simt,
int n_blas_simt) {
#if defined(__GNUG__) && !defined(__INTEL_COMPILER)
double norm2res __attribute__((aligned(QPHIX_LLC_CACHE_ALIGN))) = 0;
#else
__declspec(align(QPHIX_LLC_CACHE_ALIGN)) double norm2res = 0;
#endif
#pragma omp parallel shared(norm_array)
{
// […]
if (smtid < n_blas_simt) {
// […]
double lnorm = 0;
//#pragma prefetch x,y,res
//#pragma vector aligned(x,y,res)
#pragma omp simd aligned(res, x, y : veclen) reduction(+ : lnorm)
for (int i = low; i < hi; i++) {
res[i] = x[i] - y[i];
double tmpd = (double)res[i];
lnorm += (tmpd * tmpd);
}
// […]
}
}
// […]
}
The error is this right here:
In file included from /homec/hbn28/hbn28e/Sources/qphix/tests/timeDslashNoQDP.cc:6:
In file included from /homec/hbn28/hbn28e/Sources/qphix/include/qphix/blas.h:8:
/homec/hbn28/hbn28e/Sources/qphix/include/qphix/blas_c.h:156:54: error: private variable cannot be reduction
#pragma omp simd aligned(res,x,y:veclen) reduction(+:lnorm)
^
/homec/hbn28/hbn28e/Sources/qphix/include/qphix/blas_c.h:151:12: note: predetermined as private
double lnorm=0;
^
Due to the outer omp parallel block, the variable lnorm is defined for every thread. Then there is an additional SIMD section where each thread uses a SIMD lane. The reduction should be done within the thread, so the scope of the variables looks right. Still the compiler does not want it this way.
What is wrong here?
The problem seems to be that the private attribute attached to the lnorm variable by the omp parallel block conflicts with the requirements imposed by the OpenMP reduction() clause on its argument variable (even though lnorm is not private with respect to the nested omp simd block to which the reduction() clause applies).
You can try solving that problem by extracting the lnorm calculation code into a function of its own:
template <typename FT, int veclen>
inline double compute_res_and_lnorm(FT *res,
FT *x,
FT *y,
int low,
int hi)
{
double lnorm = 0;
#pragma omp simd aligned(res, x, y : veclen) reduction(+ : lnorm)
for (int i = low; i < hi; i++) {
res[i] = x[i] - y[i];
double tmpd = (double)res[i];
lnorm += (tmpd * tmpd);
}
return lnorm;
}
template <typename FT, int veclen>
inline void xmyNorm2Spinor(FT *res,
FT *x,
FT *y,
double &n2res,
int n,
int n_cores,
int n_simt,
int n_blas_simt) {
#if defined(__GNUG__) && !defined(__INTEL_COMPILER)
double norm2res __attribute__((aligned(QPHIX_LLC_CACHE_ALIGN))) = 0;
#else
__declspec(align(QPHIX_LLC_CACHE_ALIGN)) double norm2res = 0;
#endif
#pragma omp parallel shared(norm_array)
{
// […]
if (smtid < n_blas_simt) {
// […]
double lnorm = compute_res_and_lnorm(res, x, y, low, hi);
// […]
}
}
// […]
}

OpenMP - using functions

When I am using OpenMP without functions with the reduction(+ : sum) , the OpenMP version works fine.
#include <iostream>
#include <omp.h>
using namespace std;
int sum = 0;
void summation()
{
sum = sum + 1;
}
int main()
{
int i,sum;
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
std::cerr << "Sum is=" << sum << std::endl;
}
But when I am calling a function summation over a global variable, the OpenMP version is taking even more time than the sequential version.
I would like to know the reason for the same and the changes that should be made.
The summation function doesn't use the OMP shared variable that you are reducing to. Fix it:
#include <iostream>
#include <omp.h>
void summation(int& sum) { sum++; }
int main()
{
int sum;
#pragma omp parallel for reduction (+ : sum)
for(int i = 0; i < 1000000000; ++i)
summation(sum);
std::cerr << "Sum is=" << sum << '\n';
}
The time taken to synchronize the access to this one variable will be way in excess of what you gain by using multiple cores- they will all be endlessly waiting on each other, because there is only one variable and only one core can access it at a time. This design is not capable of concurrency and all the sync you're paying will just increase the run-time.

pointers with OpenMP

i am trying to use OpenMP in my program (i am newbie using OpenMP) and the program return in two places errors.
Here is an example code:
#include <iostream>
#include <cstdint>
#include <vector>
#include <boost/multi_array.hpp>
#include <omp.h>
class CNachbarn {
public:
CNachbarn () { a = 0; }
uint32_t Get_Next_Neighbor() { return a++; }
private:
uint32_t a;
};
class CNetwork {
public:
CNetwork ( uint32_t num_elements_ );
~CNetwork();
void Validity();
void Clean();
private:
uint32_t num_elements;
uint32_t nachbar;
std::vector<uint32_t> remove_node_v;
CNachbarn *Nachbar;
};
CNetwork::CNetwork( uint32_t num_elements_ ) {
num_elements = num_elements_;
Nachbar = new CNachbarn();
remove_node_v.reserve( num_elements );
}
CNetwork::~CNetwork() {
delete Nachbar;
}
inline void CNetwork::Validity() {
#pragma omp parallel for
for ( uint32_t i = 0 ; i < num_elements ; i++ ) {
#pragma omp critical
remove_node_v.push_back(i);
}
}
void CNetwork::Clean () {
#pragma omp parallel for
for ( uint8_t j = 0 ; j < 2 ; j++ ) {
nachbar = Nachbar->Get_Next_Neighbor();
std::cout << "i: " << i << ", neighbor: " << nachbar << std::endl;
}
remove_node_v.clear();
}
int main() {
uint32_t num_elements = 1u << 3;
uint32_t i = 0;
CNetwork Network( num_elements );
do {
Network.Validity();
Network.Clean();
} while (++i < 2);
return 0;
}
I would like to know
if #pragma omp critical is a good solution for push_back()? (Does solve this problem?) would it be better to define for each thread its own vector and then combine them (using insert() )? or some kind of lock?
In my original code i get a running error at: nachbar = Nachbar->Get_Next_Neighbor( &remove_node_v[i] ); but in this example not. Nether the less, i would like OpenMP to use as the number of cores CNachbarn classes, since CNachbarn is recursive computation and should not be influenced from the other threads. The question is how to do it smarty? (I dont think it is smart to define CNachbarn each time i start the for-loop, since i call this function more the million times in my simulation and time is important.
Concerning your first problem:
Your function Validity is a perfect way to achieve below serial performance in a parallel loop. However, you already gave the correct answer. You should fill independent vectors for each thread and merge them afterwards.
inline void CNetwork::Validity() {
#pragma omp parallel for
for ( uint32_t i = 0 ; i < num_elements ; i++ ) {
#pragma omp critical
remove_node_v.push_back(i);
}
}
EDIT: A possible remedy could look like this (if you require serial access to your elements, you need to change the loop a bit)
inline void CNetwork::Validity() {
remove_node_v.reserve(num_elements);
#pragma omp parallel
{
std::vector<uint32_t> remove_node_v_thread_local;
uint32_t thread_id=omp_get_thread_num();
uint32_t n_threads=omp_get_num_threads();
for ( uint32_t i = thread_id ; i < num_elements ; i+=n_threads )
remove_node_v_thread_local.push_back(i);
#pragma omp critical
remove_node_v.insert(remove_node_v.end(), remove_node_v_thread_local.begin(), remove_node_v_thread_local.end());
}
}
Your second problem could be solved by defining an array of CNachbarn with the size of the maximum number of OMP threads possible, and access distinct elements of the array from each thread like:
CNachbarn* meine_nachbarn=alle_meine_nachbarn[omp_get_thread_num()]