Is this for-loop valid using OpenMP - c++

I am in the process of learning OpenMP . This is a for loop I am using
std::string result;
#pragma omp parallel
{
#pragma omp parallel for public(local_arg) reduction(+:result)
for(int i=0 ; i<Myvector.size();i++)
{
result = result + someMethod(urn,Myvector[i]);
}
}
Now someMethod(urn,Myvector[i]) which will called by multiple threads in the above code will return a string. This string needs to be appended to the return string. My question is do I need to put a lock on the statement in the for loop ? Is there a better approach ? Any suggestions ?

This isn't perfect (and it's been a while since I've used OpenMP), but the idea is basic divide-and-conquer.
std::vector<std::string> results;
int n = 2*omp_get_num_threads();
results.reserve(n); // For reliability, ask OS about # of cores, double that.
// Reserve a small string for each prospective worker
for(int i = 0; i < n; ++i){
std::string str{};
str.reserve(worker_reserve);
results.push_back(move(str));
}
// Let each worker grab and mutate the string
// corresponding to its worker ID
//
#pragma omp parallel for
for(int i = 0; i < Myvector.size(); ++i)
{
auto &str = results[omp_get_thread_num()];
str.append(someMethod(urn, Myvector[i]));
}
// Measure the total size of the result
std::string end_result;
size_t total_len = 0;
for(auto &res : results){
total_len += res.length();
}
// Reserve and combine
end_result.reserve(total_len + 1);
for(auto &res : results){
end_result.append(res);
}
However, there is still the issue of heap contention.
Also omp_get_num_threads isn't guaranteed to return the actual number of threads.

Related

How can I parallelize DFS using OpenMP?

I am trying to figure it out with OpenMP. I need to parallelize depth-first traversal.
This is the algorithm:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
I try:
#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <queue>
#include <sstream>
#include <omp.h>
#include <fstream>
#include <vector>
using namespace std;
vector<int> output;
vector<bool> visited;
vector < vector <int> >g;
int global = 0;
void dfs(int v)
{
printf(" potoki %i",omp_get_thread_num());
//cout<<endl;
visited[v] = true;
/*for(int i =0;i<visited.size();i++){
cout <<visited[i]<< " ";
}*/
//cout<<endl;
//global++;
output.push_back(v);
int i;
//printf(" potoki %i",omp_get_num_threads());
//cout<<endl;
for (i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task shared(visited)
{
#pragma omp critical
{
dfs(g[v][i]);
}
}
}
}
}
main(){
omp_set_num_threads(5);
int length = 1000;
int e = 4;
for (int i = 0; i < length; i++) {
visited.push_back(false);
}
int limit = (length / 2) - 1;
g.resize(length);
for (int x = 0; x < g.size(); x++) {
int p=0;
while(p<e){
int new_e = rand() % length ;
if(new_e!=x){
bool check=false;
for(int c=0;c<g[x].size();c++){
if(g[x][c]==new_e){
check=true;
}
}
if(check==false){
g[x].push_back(new_e);
p++;
}
}
}
}
ofstream fin("input.txt");
for (int i = 0; i < g.size(); i++)
{
for (int j = 0; j < g[i].size(); j++)
{
fin << g[i][j] << " ";
}
fin << endl;
}
fin.close();
/*for (int x = 0; x < g.size(); x++) {
for(int j=0;j<g[x].size();j++){
printf(" %i ", g[x][j]);
}
printf(" \n ");
}*/
double start;
double end;
start = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
dfs(0);
}
}
end = omp_get_wtime();
cout << endl;
printf("Work took %f seconds\n", end - start);
cout<<global;
ofstream fout("output.txt");
for(int i=0;i<output.size();i++){
fout<<output[i]<<" ";
}
fout.close();
}
Graph "g" is generated and written to the file input.txt. The result of the program is written to the file output.txt.
But this does not work on any number of threads and is much slower.
I tried to use taskwait but in that case, only one thread works.
A critical section protects a block of code so that no more than one thread can execute it at any given time. Having the recursive call to dfs() inside a critical section means that no two tasks could make that call simultaneously. Moreover, since dfs() is recursive, any top-level task will have to wait for the entire recursion to finish before it could exit the critical section and allow a task in another thread to execute.
You need to sychronise where it will not interfere with the recursive call and only protect the update to shared data that does not provide its own internal synchronisation. This is the original code:
void dfs(int v){
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
dfs(g[v][i]);
}
}
}
A naive but still parallel version would be:
void dfs(int v){
#pragma omp critical
{
visited[v] = true;
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
Here, the code leaves the critical section as soon as the tasks are created. The problem here is that the entire body of dfs() is one critical section, which means that even if there are 1000 recursive calls in parallel, they will execute one after another sequentially and not in parallel. It will even be slower than the sequential version because of the constant cache invalidation and the added OpenMP overhead.
One important note is that OpenMP critical sections, just as regular OpenMP locks, are not re-entrant, so a thread could easily deadlock itself due to encountering the same critical section in a recursive call from inside that same critical section, e.g., if a task gets executed immediately instead of being postponed. It is therefore better to implement a re-entrant critical section using OpenMP nested locks.
The reason for that code being slower than sequential is that it does nothing else except traversing the graph. If it was doing some additional work at each node, e.g., accessing data or computing node-local properties, then this work could be inserted between updating visited and the loop over the unvisited neighbours:
void dfs(int v){
#pragma omp critical
visited[v] = true;
// DO SOME WORK
#pragma omp critical
{
for (int i = 0; i < g[v].size(); ++i) {
if (!visited[g[v][i]]) {
#pragma omp task
dfs(g[v][i]);
}
}
}
}
The parts in the critical sections will still execute sequentially, but the processing represented by // DO SOME WORK will overlap in parallel.
There are tricks to speed things up by reducing the lock contention introduced by having one big lock / critical section. One could, e.g., use a set of OpenMP locks and map the index of visited onto those locks, e.g., using simple modulo arithmetic as described here. It is also possible to stop creating tasks at certain level of recursion and call a sequential version of dfs() instead.
void p_dfs(int v)
{
#pragma omp critical
visited[v] = true;
#pragma omp parallel for
for (int i = 0; i < graph[v].size(); ++i)
{
#pragma omp critical
if (!visited[graph[v][i]])
{
#pragma omp task
p_dfs(graph[v][i]);
}
}
}
OpenMP is good for data-parallel code, when the amount of work is known in advance. Doesn’t work well for graph algorithms like this one.
If the only thing you do is what’s in your code (push elements into a vector), parallelism is going to make it slower. Even if you have many gigabytes of data on your graph, the bottleneck is memory not compute, multiple CPU cores won’t help. Also, if all threads gonna push results to the same vector, you’ll need synchronization. Also, reading memory recently written by another CPU core is expensive on modern processors, even more so than a cache miss.
If you have some substantial CPU-bound work besides just copying integers, look for alternatives to OpenMP. On Windows, I usually use CreateThreadpoolWork and SubmitThreadpoolWork APIs. On iOS and OSX, see grand central dispatch. On Linux see cp_thread_pool_create(3) but unlike the other two I don’t have any hands-on experience with it, just found the docs.
Regardless on the thread pool implementation you gonna use, you’ll then be able to post work to the thread pool dynamically, as you’re traversing the graph. OpenMP also has a thread pool under the hood, but the API is not flexible enough for dynamic parallelism.

openMP increment add int among threads?

Hye
I'm trying to multithread the function below. I fail to get the counter to be properly shared among OpenMP threads, I tried atomic and int, atomic seem to not be working, neither do INT. Not sure, I'm lost, how can I solve this?
std::vector<myStruct> _myData(100);
int counter;
counter =0
int index;
#pragma omp parallel for private(index)
for (index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s); // processDataA(myStruct &data)
processDataB(s);
_myData[counter++] = s; // each thread should have unique int not going over 100 of initially allocated items in _myData
}
}
Edit. Update bad syntax/missing parts
If you cannot use OpenMP atomic capture, I would try:
std::vector<myStruct> _myData(100);
int counter = 0;
#pragma omp parallel for schedule(dynamic)
for (int index = 0; index < 500; ++index) {
if (data[index].type == "xx") {
myStruct s;
s.innerData = data[index].rawData
processDataA(s);
processDataB(s);
int temp;
#pragma omp critical
temp = counter++;
assert(temp < _myData.size());
_myData[temp] = s;
}
}
Or:
#pragma omp parallel for schedule(dynamic,c)
and experiment with chunk size c.
However, atomics would be likely more efficient than critical sections. There should be some form of atomics supported by your compiler.
Note that your solution is kind of fragile, since it works only if the condition inside the loop is evaluated to true less than 101x. That's why I added assertion into the code. Maybe a better solution:
std::vector<myStruct> _myData;
size_t size = 0;
#pragma omp parallel for reduction(+,size)
for (int index = 0; index < data.size(); ++index)
if (data[index].type == "xx") size++;
v.resize(size);
...
Then, you don't need to care about the vector size and also don't waste memory space.

Simple task-based OpenMP application hangs

The following small program (online version) attempts to calculate the area of a 64 by 64 square by recursively dividing into four squares until the smallest square has unit length (hardly optimal). But for some reason the program hangs. What am doing wrong?
#include <iostream>
unsigned compute( unsigned length )
{
if( length == 1 ) return length * length;
unsigned a[4] , area = 0 , len = length/2;
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task
{
a[i] = compute( len );
}
#pragma omp single
{
area += a[i];
}
}
return area;
}
int main()
{
unsigned area , length = 64;
#pragma omp parallel
{
area = compute( length );
}
std::cout << area << std::endl;
}
The single construct acts as an implicit barrier for all threads in the team. However, not all threads in the team do encounter this single block, because different threads are working at different recursion depths. This is why your application hangs.
In any case your code is not correct. After your task block, a[i] is not yet assigned, so you cannot immediately use it! You must wait for the task to be completed. Of course you shouldn't do that inside the loop, otherwise the tasking wouldn't exploit any parallelism. The solution is to do this at the end of the loop. Also you must specify a as shared for the output to become visible:
for( unsigned i = 0; i < 4; ++i )
{
#pragma omp task shared(a)
{
a[i] = compute( len );
}
}
#pragma omp taskwait
for( unsigned i = 0; i < 4; ++i )
{
area += a[i];
}
Note that the reduction is not wrapped a single construct! Compute is executed by a task, so only one thread should ever have it's own local area. However, you need one single construct before you first spawn any tasks:
#pragma omp parallel
#pragma omp single
{
area = compute( length );
}
Simply speaking this opens a parallel region with a team of threads, and only one thread begins the initial computation. The other threads will pick up the tasks that are later spawned by this initial thread with the task construct. This is what tasking is all about.
Motivated by the discussion about taskwait and how it can be avoided, I show below a slightly modified version of the original code. Please note that the implied barrier at the end of the single construct is really necessary in this case.
unsigned tp_area = 0;
#pragma omp threadprivate(tp_area)
void compute (unsigned length)
{
if (length == 1)
{
tp_area += 1;
return;
}
unsigned len = length / 2;
for (unsigned i = 0; i < 4; ++i)
{
#pragma omp task
{
compute (len);
}
}
}
int main ()
{
unsigned area, length = 64;
#pragma omp parallel
{
#pragma omp single
{
compute (length);
}
#pragma omp atomic
area += tp_area;
}
std::cout << area << std::endl;
}

#pragma omp parallel for schedule crashes my program

I am building a plugin for autodesk maya 2013 in c++. I have to solve a set of optimization problems as fast as i can. I am using open MP for this task. the problem is I don't have very much experience with parallel computing. I tried to use:
#pragma omp parallel for schedule (static)
on my for loops (without enough understanding of how it's supposed to work) and it worked very well for some of my code, but crashed another portion of my code.
Here is an example of a function that crashes because of the omp directive:
void PlanarizationConstraint::fillSparseMatrix(const Optimizer& opt, vector<T>& elements, double mu)
{
int size = 3;
#pragma omp parallel for schedule (static)
for(int i = 0; i < opt.FVIc.outerSize(); i++)
{
int index = 3*i;
Eigen::Matrix<double,3,3> Qxyz = Eigen::Matrix<double,3,3>::Zero();
for(SpMat::InnerIterator it(opt.FVIc,i); it; ++it)
{
int face = it.row();
for(int n = 0; n < size; n++)
{
Qxyz.row(n) += N(face,n)*N.row(face);
elements.push_back(T(index+n,offset+face,(1 - mu)*N(face,n)));
}
}
for(int n = 0; n < size; n++)
{
for(int k = 0; k < size; k++)
{
elements.push_back(T(index+n,index+k,(1-mu)*Qxyz(n,k)));
}
}
}
#pragma omp parallel for schedule (static)
for(int j = 0; j < opt.VFIc.outerSize(); j++)
{
elements.push_back(T(offset+j,offset+j,opt.fvi[j]));
for(SpMat::InnerIterator it(opt.VFIc,j); it; ++it)
{
int index = 3*it.row();
for(int n = 0; n < size; n++)
{
elements.push_back(T(offset+j,index+n,N(j,n)));
}
}
}
}
And here is an example of code that works very well with those directives (and is faster because of it)
Eigen::MatrixXd Optimizer::OptimizeLLGeneral()
{
ConstraintsManager manager;
SurfaceConstraint surface(1,true);
PlanarizationConstraint planarization(1,true,3^Nv,Nf);
manager.addConstraint(&surface);
manager.addConstraint(&planarization);
double mu = mu0;
for(int k = 0; k < iterations; k++)
{
#pragma omp parallel for schedule (static)
for(int j = 0; j < VFIc.outerSize(); j++)
{
manager.calcVariableMatrix(*this,j);
}
#pragma omp parallel for schedule (static)
for(int i = 0; i < FVIc.outerSize(); i++)
{
Eigen::MatrixXd A = Eigen::Matrix<double, 3, 3>::Zero();
Eigen::MatrixXd b = Eigen::Matrix<double, 1, 3>::Zero();
manager.addLocalMatrixComponent(*this,i,A,b,mu);
Eigen::VectorXd temp = b.transpose();
Q.row(i) = A.colPivHouseholderQr().solve(temp);
}
mu = r*mu;
}
return Q;
}
My question is what makes one function work so well with the omp directive and what makes the other function crash? what is the difference that makes the omp directive act differently?
Before using openmp, you pushed back some data to the vector elements one by one. However, with openmp, there will be several threads running the code in the for loop in parallel. When more than one thread are pushing back data to the vector elements at the same time, and when there's no code to ensure that one thread will not start pushing before another one finishes, problem will happen. That's why your code crashes.
To solve this problem, you could use local buff vectors. Each thread first push data to its private local buffer vector, then you can concatenate these buffer vectors together into a single vector.
You will notice that this method can not maintain the original order of the data elements in the vector elements. If you want to do that, you could calculate each expected index of the data element and assign the data to the right position directly.
update
OpenMP provides APIs to let you know how many threads you use and which thread you are using. See omp_get_max_threads() and omp_get_thread_num() for more info.

OpenMP Ordered Parallelization

I'm trying to parallelize the following function (pseudocode):
vector<int32> out;
for (int32 i = 0; i < 10; ++i)
{
int32 result = multiplyStuffByTwo(i);
// Push to results
out.push_back(result);
}
When I now parallelize the for loop and define the push_back part as a critical path, I'm encountering the problem that (of course) the order of the results in out is not always right. How can I make the threads run execute the code in the right order in the last line of the for loop? Thanks!
You can set the size of the out-vector by calling out.resize() and then set the value by index, not by push_back()
Pseudo-code:
vector<int32> out; out.resize(10);
for (int32 i = 0; i < 10; ++i)
{
int32 result = multiplyStuffByTwo(i);
// set the result
out[i] = result;
}
But, I'd recommend using "classic" arrays. They're much faster and not really harder to manage
vector<int32> out;
#pragma omp parallel for ordered
for (int32 i = 0; i < 10; ++i)
{
int32 result = multiplyStuffByTwo(i); // this will be run in parallel
#pragma omp ordered
// Push to results
out.push_back(result); // this will be run sequential
}
This can be helpful:
http://openmp.org/mp-documents/omp-hands-on-SC08.pdf