I am writing simple parallel program in C++ using OpenMP.
I am working on Windows 7 and on Microsoft Visual Studio 2010 Ultimate.
I changed the Language property of the project to "Yes/OpenMP" to support OpenMP
Here I provide the code:
#include <iostream>
#include <omp.h>
using namespace std;
double sum;
int i;
int n = 800000000;
int main(int argc, char *argv[])
{
omp_set_dynamic(0);
omp_set_num_threads(4);
sum = 0;
#pragma omp for reduction(+:sum)
for (i = 0; i < n; i++)
sum+= i/(n/10);
cout<<"sum="<<sum<<endl;
return EXIT_SUCCESS;
}
But, I couldn't get any acceleration by changing the x in omp_set_num_threads(x);
It doesn't matter if I use OpenMp or not, the calculating time is the same, about 7 seconds.
Does Someone know what is the problem?
Your pragma statement is missing the parallel specifier:
#include <iostream>
#include <omp.h>
using namespace std;
double sum;
int i;
int n = 800000000;
int main(int argc, char *argv[])
{
omp_set_dynamic(0);
omp_set_num_threads(4);
sum = 0;
#pragma omp parallel for reduction(+:sum) // add "parallel"
for (i = 0; i < n; i++)
sum+= i/(n/10);
cout<<"sum="<<sum<<endl;
return EXIT_SUCCESS;
}
Sequential:
sum=3.6e+009
2.30071
Parallel:
sum=3.6e+009
0.618365
Here's a version that some speedup with Hyperthreading. I had to increase the # of iterations by 10x and bump the datatypes to long long:
double sum;
long long i;
long long n = 8000000000;
int main(int argc, char *argv[])
{
omp_set_dynamic(0);
omp_set_num_threads(8);
double start = omp_get_wtime();
sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < n; i++)
sum+= i/(n/10);
cout<<"sum="<<sum<<endl;
double end = omp_get_wtime();
cout << end - start << endl;
system("pause");
return EXIT_SUCCESS;
}
Threads: 1
sum=3.6e+014
13.0541
Threads: 2
sum=3.6e+010
6.62345
Threads: 4
sum=3.6e+010
3.85687
Threads: 8
sum=3.6e+010
3.285
Apart from the error pointed out by Mystical, you seemed to assume that openMP can justs to magic. It can at best use all cores on your machine. If you have 2 cores, it may reduce the execution time by two if you call omp_set_num_threads(np) with np>=2, but for np much larger than the number of cores, the code will be inefficient due to parallelization overheads.
The example from Mystical was apparently run on at least 4 cores with np=4.
Related
What is the performance cost of call omp_get_thread_num(), compared to look up the value of a variable?
How to avoid calling omp_get_thread_num() for many times in a simd openmp loop?
I can use #pragma omp parallel, but will that make a simd loop?
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp for simd
for (int i = 0; i < a_size; ++i) {
a[i] = omp_get_thread_num();
}
}
I wouldn't be too worried about the cost of the call, but for code clarity you can do:
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp parallel
{
const auto threadId = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < a_size; ++i) {
a[i] = threadId;
}
}
}
As long as you use #pragma omp for (and don't put an extra `parallel in there! otherwise each of your n threads will spawn n more threads... that's bad) it will ensure that inside your parallel region that for loop is split up amongst the n threads. Make sure omp compiler flag is turned on.
I'm currently trying to parallelize a particle swarm algorithm, and can't find an efficient way to handle a = inside a loop. I didn't do the most simple form of just putting #pragma omp parallel for at the start of the loop since i expect that will just lead to false sharing problems. The whole thing gets complicated by the variables being Matrices and Vectors.
I think this minimal example (using armadillo, a linear algebra library similar to shows it better then i can describe it:
#include <omp.h>
#include <math.h>
#include <armadillo>
using namespace std;
int main(int argc, char** argv) {
arma::uword dimensions = 10, particleCount = 40;
//Matrix with dimensions-rows, and particleCount-columns. initialized with 1
arma::Mat<double> positions = arma::ones(dimensions, particleCount);
//same as above, but initialized with 2. +,/,*,- are elementwise;like in matlab
arma::Mat<double> velocities = arma::ones(dimensions, particleCount) * 2;
for(arma::uword n = 0; n < particleCount; n++) {
//.col(n) gets the nth column of the matrix
arma::Col<double> newVelocity = std::rand() * velocities.col(n);
//there is a lot more math done here, but all of it is read-only access to other variables
positions.col(n) += newVelocity; //again elementwise
velocities.col(n) = newVelocity;
}
return 0;
}
My first idea was to do something like this, but it's horribly inefficient:
#include <omp.h>
#include <math.h>
#include <armadillo>
using namespace std;
int main(int argc, char** argv) {
arma::uword dimensions = 10, particleCount = 40;
//these two variables cannot be moved inside the parallel region :-/
arma::Mat<double> positions = arma::ones(dimensions, particleCount);
arma::Mat<double> velocities = arma::ones(dimensions, particleCount) * 2;
#pragma omp parallel
{
arma::Mat<double> velocity_private = arma::zeros(dimensions,particleCount);
#pragma omp for
for(arma::uword n = 0; n < particleCount; n++) {
arma::Col<double> newVelocity = std::rand() * velocities.col(n);
velocity_private.col(n) = newVelocity;
}
#pragma omp single
{
//first part of workaround for '='
velocities = arma::zeros(dimensions, particleCount);
}
#pragma omp critical
{
for(arma::uword n = 0; n < particleCount; n++) {
positions.col(n) += velocity_private.col(n);
//second part of workaround for '='
velocities.col(n) += velocity_private.col(n);
}
}
}//end omp parallel
return 0;
}
I thought about using user-defined reductions, but i didn't find any example for assignments. All of them where for additions or multiplications, and unfortunately not very accessible.
Any suggestion(s) or advice appreciated! :)
In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.
I decided to calculate e as the sum of rows to get 2.718....
Well my code without OpenMP works perfectly and I measured the time which it is taking for calculations. When I used OpenMP to parralelize my calculation however, I got an error. I am running my program on core i7(8 cores 4 logic and 4 physical). As people say I must get a time twice as fast without using openMP. Below is my code:
#include <iostream>
#include <time.h>
#include <math.h>
#include "fact.h"
#include <cstdlib>;
#include <conio.h>;
using namespace std;
int main()
{
clock_t t1,t2;
int n;
long double exp=0;
long double y;
int p;
cout<<"Enter n:";
cin>>n;
t1=clock();
#pragma omp parallel for num_threads(2);
for(int i=1; i<n; i++)
{
p=i+1;
exp=exp+(1/((fact(p))));
}
t2=clock();
double total_clock;
total_clock=t2-t1;
long double total_exp;
total_exp=exp+2;
cout<<total_clock<<"\n the time is used for parralel calculations"<<endl;
cout<<total_exp<<endl;
cin.get();
getch();
return 0;
}
Fact() using function to calculate factorial of the number
long double fact(int N)
{
if(N < 0)
return 0;
if (N == 0)
return 1;
else
return N * fact(N - 1);
}
Error 3 error C3005: ;: unexpected token in directive OpenMP "parallel for" c:\users\александр\documents\visual studio 2012\projects\consoleapplication1\consoleapplication1\openmp.cpp 18
When using openmp pragmas, semicolons are not needed, hence:
"#pragma omp parallel for num_threads(2);"
should be "#pragma omp parallel for num_threads(2)"
without the ;
When I am using OpenMP without functions with the reduction(+ : sum) , the OpenMP version works fine.
#include <iostream>
#include <omp.h>
using namespace std;
int sum = 0;
void summation()
{
sum = sum + 1;
}
int main()
{
int i,sum;
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
#pragma omp parallel for reduction (+ : sum)
for(i = 0; i < 1000000000; i++)
summation();
std::cerr << "Sum is=" << sum << std::endl;
}
But when I am calling a function summation over a global variable, the OpenMP version is taking even more time than the sequential version.
I would like to know the reason for the same and the changes that should be made.
The summation function doesn't use the OMP shared variable that you are reducing to. Fix it:
#include <iostream>
#include <omp.h>
void summation(int& sum) { sum++; }
int main()
{
int sum;
#pragma omp parallel for reduction (+ : sum)
for(int i = 0; i < 1000000000; ++i)
summation(sum);
std::cerr << "Sum is=" << sum << '\n';
}
The time taken to synchronize the access to this one variable will be way in excess of what you gain by using multiple cores- they will all be endlessly waiting on each other, because there is only one variable and only one core can access it at a time. This design is not capable of concurrency and all the sync you're paying will just increase the run-time.