Replacing TBB parallel_for with OpenMP - c++

I'm trying to come up with an equivalent replacement of an Intel TBB parallel_for loop that uses a tbb::blocked_range using OpenMP. Digging around online, I've only managed to find mention of one other person doing something similar; a patch submitted to the Open Cascade project, wherein the TBB loop appeared as so (but did not use a tbb::blocked_range):
tbb::parallel_for_each (aFaces.begin(), aFaces.end(), *this);
and the OpenMP equivalent was:
int i, n = aFaces.size();
#pragma omp parallel for private(i)
for (i = 0; i < n; ++i)
Process (aFaces[i]);
Here is TBB loop I'm trying to replace:
tbb::parallel_for( tbb::blocked_range<size_t>( 0, targetList.size() ), DoStuff( targetList, data, vec, ptr ) );
It uses the DoStuff class to carry out the work:
class DoStuff
{
private:
List& targetList;
Data* data;
vector<things>& vec;
Worker* ptr;
public:
DoIdentifyTargets( List& pass_targetList,
Data* pass_data,
vector<things>& pass_vec,
Worker* pass_worker)
: targetList(pass_targetList), data(pass_data), vecs(pass_vec), ptr(pass_worker)
{
}
void operator() ( const tbb::blocked_range<size_t> range ) const
{
for ( size_t idx = range.begin(); idx != range.end(); ++idx )
{
ptr->PerformWork(&targetList[idx], data->getData(), &Vec);
}
}
};
My understanding based on this reference is that TBB will divide the blocked range into smaller subsets and give each thread one of the ranges to loop through. Since each thread will get its own DoStuff class, which has a bunch of references and pointers, meaning the threads are essentially sharing those resources.
Here's what I've come up with as an equivalent replacement in OpenMP:
int index = 0;
#pragma omp parallel for private(index)
for (index = 0; index < targetList.size(); ++index)
{
ptr->PerformWork(&targetList[index], data->getData(), &Vec);
}
Because of circumstances outside of my control (this is merely one component in a much larger system that spans +5 computers) stepping through the code with a debugger to see exactly what's happening is... Unlikely. I'm working on getting remote debugging going, but it's not looking very promising. All I know for sure is that the above OpenMP code is somehow doing something differently than TBB was, and expected results after calling PerformWork for each index are not obtained.
Given the information above, does anyone have any ideas on why the OpenMP and TBB code are not functionally equivalent?

Following Ben and Rick's advice, I tested the following loop without the omp pragma (serially) and obtained my expected results (very slowly). After adding the pragma back in, the parallel code also performs as expected. Looks like the problem was either in declaring the index as private outside of the loop, or declaring numTargets as private inside the loop. Or both.
int numTargets = targetList.size();
#pragma omp parallel for
for (int index = 0; index < numTargets; ++index)
{
ptr->PerformWork(&targetList[index], data->getData(), &vec);
}

Related

How do i get a private vector while still using functions with openmp?

Im trying to parallelize my code with openmp.
I have a global vector, so i can excess it with my functions.
Is there a way that i can asign a copy of the vector to every thread so they can do stuff with it?
Here is some pseudocode to describe my problem:
double var = 1;
std::vector<double> vec;
void function()
{
vec.push_back(var);
return;
}
int main()
{
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp for private(vec)
for (int i = 0; i < 4; i++)
{
function();
}
}
return 0;
}
Notes:
i want each tread to have an own vector, to safe specific values, which later only the same thread needs to excess
each thread calls a function (sometimes its the same) which then does some work on the vector (changing specific values)
(in my original code there are many vectors and functions, ive just tried to break the problem down)
Ive tried #pragma omp threadprivate(), but that only works for varibles and not for vectors.
Also redeclaring the vector inside the parallel region doesnt help, as my function always works with the global vector, which then leads to problems when different treads call it at the same time.
Is there a way that I can assign a copy of the vector to every thread
so they can do stuff with it?
Yes, the firstprivate clause does this:
The firstprivate clause declares one or more list items to be private
to a task, and initializes each of them with the value that the
corresponding original item has when the construct is encountered.
So, it creates a private copy of the variable for each thread, but the scope of this private variable is the structured block following the OpenMP construct. Outside this block you access the global variable:
#pragma omp ... firstprivate(vec)
{
vec.push_back(...); // private copy is changed here, which is threadsafe
}
void function()
{
vec.push_back(var); // the global variable is changed here, which is not threadsafe
return;
}
If you wish to use the private copy of your variable in a function you have to pass it as a reference to your function :
void function(std::vector<double>& x, double y)
{
x.push_back(y);
return;
}
...
#pragma omp for firstprivate(vec)
for (int i = 0; i < 4; i++)
{
function(vec, 1);
}
Note that, however, as pointed out and explained by #JeromeRichard you should not use global variables in your code.

Locking nested function with thread-unsafe static variable

I have a function memoize that reads and writes a static std::map, eg:
int memoize(int i)
{
static std::map<int,int> map;
const auto iterator = memoizer.find(i);
if (iterator == memoizer.end())
memoizer[i]=i+1;
else
return iterator->second;
}
memoize is called by other functions which are called by other functions which are called.... by functions in main. Now if in main I have something like
#pragma omp parallel for
for (int i=0; i<1000; i+++)
f(i); \\function calling memoize
for (int i=0; i<1000; i+++)
g(i); \\function calling memoize
then I have a problem in the first loop since std::map is not thread safe. I am trying to figure out a way to lock the static map only if openmp is used (thus only for the first loop). I would rather avoid having to rewrite all functions to take an extra omp_lock_t argument.
What's the best way to achieve that? Hopefully with the least amount of macros possible.
A very simple solution to your problem would be to protect both the read and the update parts of your code with a critical OpenMP directive. Of course, to reduce the risk of unwanted interaction/collision with some other critical you might already have somewhere else in your code, you should give it a name to identify it clearly. And should your OpenMP implementation permit it (ie the version of the standard being high enough), and also if you have the corresponding knowledge, you can add a hint on how much of contention you expect from this update among the threads for example.
Then, I would suggest you to check whether this simple implementation gives you sufficient performance, meaning whether the impact of the critical directive isn't too much performance-wise. If you're happy with the performance, then just leave it as is. If not, then come back with a more precise question and we'll see what can be done.
For the simple solution, the code could look like this (with here a hint suggesting that high contention is expected, which is only a example for your sake, not a true expectation of mine):
int memoize( int i ) {
static std::map<int,int> memoizer;
bool found = false;
int value;
#pragma omp critical( memoize ) hint( omp_sync_hint_contended )
{
const auto it = memoizer.find( i );
if ( it != memoizer.end() ) {
value = it->second;
found = true;
}
}
if ( !found ) {
// here, we didn't find i in memoizer, so we compute it
value = compute_actual_value_for_memoizer( i );
#pragma omp critical( memoize ) hint( omp_sync_hint_contended )
memoizer[i] = value;
}
return value;
}

Building a custom aggregator based on CPP iterator with OpenMP

Following scenario:
Simulation dealing with data on two different spatial resolutions
An existing lookup table mapping the finer resolution data to the coarser data
The recurring job of aggregating or distributing the data (different data to be exact) to the other scale
Instead of having to write the same two for loops etc over and over again I wanted to be able to write something like the following:
for(const auto gridcell& : iterator) {
container[gridcell.id] = gridcell.map([] getRequiredData(CellRef cell){return cell->getSpecificData()})}
or with openMP
#pragma omp parallel for
for (int j = 0; j < iterator.size(); ++j) {
For example for every 0.5° grid cell I want to apply a lambda function to all beloningin 5' cells retriving the needed data and summing/averaging.. it up afterwards.
So far so good. I've implemented it and it is waaay slower than using two for loops (both implementations with #pragma omp parallel for).
All I've done so far is implement a custom iterator storing the information about the lookup table return a new GridItem on each iteration.
The implementation of the GridItem looks something like that:
class FiveDegreeToZeroFiveItem {
private:
AssociatedNodes nodes;
//Is called in public constructor not included here
static AssociatedNodes getData(){
AssociatedNodes out;
std::vector<int> tmp;
try{tmp = lu_table.at(pos);}
catch (const std::out_of_range& oor){return out;} //No 5min nodes in 5min cell
int i = 0;
for (int id : tmp) {
i = id_table.at(id);
out.push_back(vector->at(i).get());
}
return out;
}
//Other implementations like averaging etc. left out
template <
class ElemT,
class FunctionT,
class T = decltype(std::declval<FunctionT>()(std::declval<ElemT>())),
typename = typename std::enable_if<!std::is_void<T>::value>::type
>
const T iterate(FunctionT fun) const{
T val;
for (int j = 0; j < nodes.size(); ++j) {
val += fun(nodes.at(j));
}
return val;
}
public:
//Constructor with call to private getData() omitted for readability
template <
class ElemT,
class FunctionT
>
const void map(FunctionT fun) const {
iterate<ElemT>(std::forward<FunctionT>(fun));
}
};
I left out some of the code reducing it to the essentials (If I missed something important I'd be happy to include that later on). It achieves what it was build for but as I mentioned way to slow.
Is there a general flaw in my approach? Or are there reasons why it is so much slower than two for loops? Is (wrong use of) openMP the reason?

C++ OpenMP parallel with a read-only reference variable

I'm trying to run code in parallel with the GCC 4.4.7 version. I used the OpenMP library. I have a read-only variable (pointer to a class) which is shared by all the threads.
The code is compiled and executed without any errors, but I do not obtain the same results when running the same code in a serial sequence (the results in the serial mode are true).
The code looks like:
#include <class1>
#include <class2>
int main(){
string a;
int b1,b2,d;
class1 c1(a,b1);
c1.compute(d);
int n_thread = 10;
int i,n=10;
vector<vector<int> > res(n);
omp_set_dynamic(0);
omp_set_num_threads(n_thread);
#pragma omp parallel for num_threads(n_thread) private(i) shared(res)
for(i=0;i<n;i++)
{
class1 c(c1.tab,b2);
c.compute(d);
class2 toto(b1,b2);
toto.getvect(c1.tab,c.tab);//Inside toto, the c1.tab is read-only
#pragma omp critical
{
res[i] = vector<int> (toto.p);
}
}
//the rest of the program when I use the c1 var and the res matrix.
}
My first thought was that the problem is from the two reference variables c and toto, but these two variables are created at each thread, so they are private to each thread.
I tried to use c1 as threadprivate but there was a compilation error. If c1 is declared as shared, there is no change at the output results. Maybe the problem is from the multiple acces to the smae variable at the same time ? How can I solve the problem ?

std::vector push_back fails when used in a parallel for loop

I have a code that is as follow (simplified code):
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
output.push_back( newValues);
}
}
This code works well, but if I want to make it parallel using omp parallel for, I am getting error on output.push_back and it seems that during vector resize, the memory corrupted.
What is the problem and how can I fix it?
How can I make sure only one thread inserting a new item into vector at any time?
The simple answer is that std::vector::push_back is not thread-safe.
In order to safely do this in parallel you need to synchronize in order to ensure that push_back isn't called from multiple threads at the same time.
Synchronization in C++11 can easily be achieved by using an std::mutex.
std::vector's push_back can not guarantee a correct behavior when being called in a concurrent manner like you are doing now (there is no thread-safety).
However since the elements don't depend on each other, it would be very reasonable to resize the vector and modify elements inside the loop separately:
output.resize(input.rows);
int k = 0;
#pragma omp parallel for shared(k, input)
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
...
// ! prevent other threads to modify k !
output[k] = newValues;
k++;
// ! allow other threads to modify k again !
}
}
output.resize(k);
since the direct access using operator[] doesn't depend on other members of std::vector which might cause inconsistencies between the threads. However this solution might still need an explicit synchronization (i.e. using a synchronization mechanism such as mutex) that will ensure that a correct value of k will be used.
"How can I make sure only one thread inserting a new item into vector at any time?"
You don't need to. Threads will be modifying different elements (that reside in different parts of memory). You just need to make sure that the element each thread tries to modify is the correct one.
Use concurrent vector
#include <concurrent_vector.h>
Concurrency::concurrent_vector<int> in c++11.
It is thread safe version of vector.
Put a #pragma omp critical before the push_back.
I solved a similar problem by deriving the standard std::vector class just to implement an atomic_push_back method, suitable to work in the OpenMP paradigm.
Here is my "OpenMP-safe" vector implementation:
template <typename T>
class omp_vector : public std::vector<T>
{
private:
omp_lock_t lock;
public:
omp_vector()
{
omp_init_lock(&lock);
}
void atomic_push_back(T const &p)
{
omp_set_lock(&lock);
std::vector<T>::push_back(p);
omp_unset_lock(&lock);
}
};
of course you have to include omp.h. Then your code could be just as follows:
opm_vector<...> output;
#pragma omp parallel for shared(input,output)
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
output.atomic_push_back( newValues);
}
}
If you still need the output vector somewhere else in a non-parallel section of the code, you could just use the normal push_back method.
You can try to use a mutex to fix the problem.
Usually I prefer to achieve such thing myself;
static int mutex=1;
int signal(int &x)
{
x+=1;
return 0;
}
int wait(int &x)
{
x-=1;
while(x<0);
return 0;
}
for( int i = 0; i < input.rows; i++ )
{
if(IsGoodMatch(input[I])
{
Newvalues newValues;
newValues.x1=input.x1;
newValues.x2=input.x1*2;
wait(mutex);
output.push_back( newValues);
signal(mutex);
}
}
Hope this could help.