Assignment of boost::ptr_vector in OpenMP loop - c++

Is it possible to fill out a boost::ptr_vector inside an OpenMP loop? The only way I can see how to add a 'new' entry to ptr_vector is via push_back(), which I assume is not thread safe.
See example below (gcc compilation: g++ ptr_vector.cpp -fopenmp -DOPTION=1). Currently only g++ ptr_vector.cpp -DOPTION=2 works.
#include <boost/ptr_container/ptr_vector.hpp>
#include <iostream>
#ifdef _OPENMP
#include <omp.h>
#endif
int main() {
boost::ptr_vector<double> v;
int n = 10;
# if OPTION==1
v.resize(n);
# endif
int i;
#ifdef _OPENMP
#pragma omp barrier
#pragma omp parallel for private(i) schedule(runtime)
#endif
# if OPTION==1
for ( i=0; i<n; ++i ) {
double * vi = &v[i];
vi = new double(i);
}
# elif OPTION==2
for ( i=0; i<n; ++i )
v.push_back(new double(i));
# endif
for ( size_t i=0; i<n; ++i )
std::cout << "v[" << i << "] = " << v[i] << std::endl;
}
Thanks for any help!

To answer my own question, the solution is to use the replace() function:
int i;
#ifdef _OPENMP
#pragma omp barrier
#pragma omp parallel for private(i) schedule(runtime)
#endif
for ( i=0; i<n; ++i ) {
v.replace(i,new double(i));
}
seems to work for this example, but is this method thread safe in general?

Related

Avoid calling omp_get_thread_num() in parallel for loop with simd

What is the performance cost of call omp_get_thread_num(), compared to look up the value of a variable?
How to avoid calling omp_get_thread_num() for many times in a simd openmp loop?
I can use #pragma omp parallel, but will that make a simd loop?
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp for simd
for (int i = 0; i < a_size; ++i) {
a[i] = omp_get_thread_num();
}
}
I wouldn't be too worried about the cost of the call, but for code clarity you can do:
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp parallel
{
const auto threadId = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < a_size; ++i) {
a[i] = threadId;
}
}
}
As long as you use #pragma omp for (and don't put an extra `parallel in there! otherwise each of your n threads will spawn n more threads... that's bad) it will ensure that inside your parallel region that for loop is split up amongst the n threads. Make sure omp compiler flag is turned on.

How does pragma and omp make a difference in these two codes producing same output?

Initially value of ab is 10, then after some delay created by for loop ab is set to 55 and then its printed in this code..
#include <iostream>
using namespace std;
int main()
{
long j, i;
int ab=10 ;
for(i=0; i<1000000000; i++) ;
ab=55;
cout << "\n----------------\n";
for(j=0; j<100; j++)
cout << endl << ab;
return 0;
}
The purpose of this code is also the same but what was expected from this code is the value of ab becomes 55 after some delay and before that the 2nd pragma block should print 10 and then 55 (multithreading) , but the second pragma block prints only after the delay created by the first for loop and then prints only 55.
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
#pragma omp barrier
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
So you want to "observe race conditions" by changing the value of a variable in a first region and printing the value from the second region.
There are a couple of things that prevent you achieving this.
The first (and explicitly stated) is the #pragma omp barrier. This OpenMP statement requests the runtime that threads running the #pragma omp parallel must wait until all threads in the team arrive. This first barrier forces the two threads to be at the barrier, thus at that point ab will have value 55.
The #pragma omp single (and here stated implicitly) contains an implicit `` waitclause, so the team of threads running theparallel region` will wait until this region has finished. Again, this means that ab will have value 55 after the first region has finished.
In order to try to achieve (and note the "try" because that will depend from run to run, depending on several factors [OS thread scheduling, OpenMP thread scheduling, HW resources available...]). You can give a try to this alternative version from yours:
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
long j, i;
int ab=10;
omp_set_num_threads(2);
#pragma omp parallel
{
#pragma omp single nowait
{
for(i=0; i<1000000000; i++) ;
ab=55;
}
cout << "\n----------------\n";
#pragma omp single
{
for(j=0; j<100; j++)
cout << endl << ab;
}
}
return 0;
}
BTW, rather than iterating for a long trip-count in your loops, you could use calls such as sleep/usleep.

OpenMP Parallelizing for loop with map

I am trying to parallelize a for-loop which scans std::map. Below is my toy program:
#include <iostream>
#include <cstdio>
#include <map>
#include <string>
#include <cassert>
#include <omp.h>
#define NUM 100000
using namespace std;
int main()
{
omp_set_num_threads(16);
int realThreads = 0;
string arr[] = {"0", "1", "2"};
std::map<int, string> myMap;
for(int i=0; i<NUM; ++i)
myMap[i] = arr[i % 3];
string is[NUM];
#pragma omp parallel for
for(map<int, string>::iterator it = myMap.begin(); it != myMap.end(); it++)
{
is[it->first] = it->second;
if(omp_get_thread_num() == 0)
realThreads = omp_get_num_threads();
}
printf("First for-loop with %d threads\n", realThreads);
realThreads = 0;
#pragma omp parallel for
for(int i=0; i<NUM; ++i)
{
assert(is[i] == arr[i % 3]);
if(omp_get_thread_num() == 0)
realThreads = omp_get_num_threads();
}
printf("Second for-loop with %d threads\n", realThreads);
return 0;
}
Compilation command:
icc -fopenmp foo.cpp
The output of the above code block is:
First for-loop with 1 threads
Second for-loop with 16 threads
Why am I not able to parallelize the first for-loop?
std::map does not provide random-access iterators, only the usual bi-directional iterator. OpenMP requires that the iterators in parallel loops are of random-access type. With other kind of iterators explicit tasks should be used instead:
#pragma omp parallel
{
#pragma omp master
realThreads = omp_get_num_threads();
#pragma omp single
for(map<int, string>::iterator it = myMap.begin(); it != myMap.end(); it++)
{
#pragma omp task
is[it->first] = it->second;
}
}
Note in that case a separate task is created for each member of the map. Since the task body is very computationally simple, the OpenMP overhead will be relatively high in that particular case.

OpenMP - using functions

When I am using OpenMP without functions with the reduction(+ : sum) , the OpenMP version works fine.
#include <iostream>
#include <omp.h>
using namespace std;
int sum = 0;
void summation()
{
sum = sum + 1;
}
int main()
{
int i,sum;
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
std::cerr << "Sum is=" << sum << std::endl;
}
But when I am calling a function summation over a global variable, the OpenMP version is taking even more time than the sequential version.
I would like to know the reason for the same and the changes that should be made.
The summation function doesn't use the OMP shared variable that you are reducing to. Fix it:
#include <iostream>
#include <omp.h>
void summation(int& sum) { sum++; }
int main()
{
int sum;
#pragma omp parallel for reduction (+ : sum)
for(int i = 0; i < 1000000000; ++i)
summation(sum);
std::cerr << "Sum is=" << sum << '\n';
}
The time taken to synchronize the access to this one variable will be way in excess of what you gain by using multiple cores- they will all be endlessly waiting on each other, because there is only one variable and only one core can access it at a time. This design is not capable of concurrency and all the sync you're paying will just increase the run-time.

pointers with OpenMP

i am trying to use OpenMP in my program (i am newbie using OpenMP) and the program return in two places errors.
Here is an example code:
#include <iostream>
#include <cstdint>
#include <vector>
#include <boost/multi_array.hpp>
#include <omp.h>
class CNachbarn {
public:
CNachbarn () { a = 0; }
uint32_t Get_Next_Neighbor() { return a++; }
private:
uint32_t a;
};
class CNetwork {
public:
CNetwork ( uint32_t num_elements_ );
~CNetwork();
void Validity();
void Clean();
private:
uint32_t num_elements;
uint32_t nachbar;
std::vector<uint32_t> remove_node_v;
CNachbarn *Nachbar;
};
CNetwork::CNetwork( uint32_t num_elements_ ) {
num_elements = num_elements_;
Nachbar = new CNachbarn();
remove_node_v.reserve( num_elements );
}
CNetwork::~CNetwork() {
delete Nachbar;
}
inline void CNetwork::Validity() {
#pragma omp parallel for
for ( uint32_t i = 0 ; i < num_elements ; i++ ) {
#pragma omp critical
remove_node_v.push_back(i);
}
}
void CNetwork::Clean () {
#pragma omp parallel for
for ( uint8_t j = 0 ; j < 2 ; j++ ) {
nachbar = Nachbar->Get_Next_Neighbor();
std::cout << "i: " << i << ", neighbor: " << nachbar << std::endl;
}
remove_node_v.clear();
}
int main() {
uint32_t num_elements = 1u << 3;
uint32_t i = 0;
CNetwork Network( num_elements );
do {
Network.Validity();
Network.Clean();
} while (++i < 2);
return 0;
}
I would like to know
if #pragma omp critical is a good solution for push_back()? (Does solve this problem?) would it be better to define for each thread its own vector and then combine them (using insert() )? or some kind of lock?
In my original code i get a running error at: nachbar = Nachbar->Get_Next_Neighbor( &remove_node_v[i] ); but in this example not. Nether the less, i would like OpenMP to use as the number of cores CNachbarn classes, since CNachbarn is recursive computation and should not be influenced from the other threads. The question is how to do it smarty? (I dont think it is smart to define CNachbarn each time i start the for-loop, since i call this function more the million times in my simulation and time is important.
Concerning your first problem:
Your function Validity is a perfect way to achieve below serial performance in a parallel loop. However, you already gave the correct answer. You should fill independent vectors for each thread and merge them afterwards.
inline void CNetwork::Validity() {
#pragma omp parallel for
for ( uint32_t i = 0 ; i < num_elements ; i++ ) {
#pragma omp critical
remove_node_v.push_back(i);
}
}
EDIT: A possible remedy could look like this (if you require serial access to your elements, you need to change the loop a bit)
inline void CNetwork::Validity() {
remove_node_v.reserve(num_elements);
#pragma omp parallel
{
std::vector<uint32_t> remove_node_v_thread_local;
uint32_t thread_id=omp_get_thread_num();
uint32_t n_threads=omp_get_num_threads();
for ( uint32_t i = thread_id ; i < num_elements ; i+=n_threads )
remove_node_v_thread_local.push_back(i);
#pragma omp critical
remove_node_v.insert(remove_node_v.end(), remove_node_v_thread_local.begin(), remove_node_v_thread_local.end());
}
}
Your second problem could be solved by defining an array of CNachbarn with the size of the maximum number of OMP threads possible, and access distinct elements of the array from each thread like:
CNachbarn* meine_nachbarn=alle_meine_nachbarn[omp_get_thread_num()]