I've spent time going over other posts but I still can't get this simple program to go.
#include<iostream>
#include<cmath>
#include<omp.h>
using namespace std;
int main()
{
int threadnum =4;//want manual control
int steps=100000,cumulative=0, counter;
int a,b,c;
float dum1, dum2, dum3;
float pos[10000][3] = {0};
float non=0;
//RNG declared
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
dum1=0,dum2=0,dum3=0;
a=0,b=0,c=0;
for (counter=0;counter<steps;counter++)
{
dum1 = somefunct1()+rand();
dum2=somefunct2()+rand();
dum3 = somefunct3(dum1, dum2, ...);
a += somefunct4(dum1,dum2,dum3, ...);
b += somefunct5(dum1,dum2,dum3, ...);
c += somefunct6(dum1,dum2,dum3, ...);
cumulative++; //count number of loops executed
}
pos[dummy][0] = a;//saves results of second loop to array
pos[dummy][1] = b;
pos[dummy][2] = c;
non+= pos[dummy][0];//holds the summed a values
}
}
}
I've cut down the program to get it to fit here. A lot of times if I make changes, and I've tried a lot, a lot of time the inner loop simply does not execute the correct number of times and I get cumulative equal to something like 32,532,849 instead of 1 billion. Scaling is about 2x for the code above but should be much higher.
I want the code to simply break the first 10000 iteration for loop so that each thread runs a certain number of iterations in parallel (if this could be dynamic that would be nice) and saves the results of each iteration of the second for loop to the results array. The second for loop is composed of dependents and cannot be broken. Currently the order of the 'dummy' iterations do not matter (can switch pos[345] with pos[3456] as long as all three indices are switches) but I will have to modify it later so it does matter.
The numerous variables and initializations in the inner loop are confusing me terribly. There are a lot of random calls and functions/math functions in the inner loop - is there overhead here that is causing a problem? I'm using GNU 4.9.2 on windows.
Any help would be greatly appreciated.
Edit: finally fixed. Moved the RNG declaration inside the first for loop. Now I get 3.75x scaling going to 4 threads and 5.72x scaling on 8 threads (hyperthreads). Not perfect but I will take it. I still think there is an issue with thread locking and syncing.
......
float non=0;
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
//RNG declared
#pragma omp for
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
....
Related
I am trying to add an openMP parallelization into quite a big Project and I found out the openMP does too much synchronization outside the parallel blocks.
This synchronization is done for all of the variables, even those not used in the parallel block and it is done continuously, not only before entering the block.
I made an example proving this:
#include <cmath>
int main()
{
double dummy1 = 1.234;
int const size = 1000000;
int const size1 = 2500;
int const size2 = 500;
for(unsigned int i=0; i<size; ++i){
//for (unsigned int j=0; j<size1; j++){
// dummy1 = pow(dummy1/2 + 1, 1.5);
//}
#pragma omp parallel for
for (unsigned int j=0; j<size2; j++){
double dummy2 = 2.345;
dummy2 = pow(dummy2/2 + 1, 1.5);
}
}
}
If I run this code (with the for cycle commented), the runtimes are 6.75s with parallelization and 30.6s without. Great.
But if I uncomment the for cycle and run it again, the excessive synchronization kicks in and I get results 67.9s with parallelization and 73s without. If I increase size1 I even get slower results with parallelization than without it.
Is there a way to disable this synchronization and force it only before the second for cycle? Or any other way how to improve the speed?
Note that the outer neither the first for cycle are in the real example parallelizable. The outer one is in fact a ODE solver and the first inner one updating of loads of inner values.
I am using gcc (SUSE Linux) 4.8.5
Thanks for Your answers.
In the end the solution for my problem was specifying number of threads = number of processor cores. It seems the hyperthreading was causing the problems. So using (my processor has 4 real cores)
#pragma omp parallel for num_threads(4)
I get times 8.7s without the first for loop and 51.9s with it. There is still about 1.2s overhead, but that is acceptable. Using default (8 threads)
#pragma omp parallel for
the times are 6.65s and 68s. Here the overhead is about 19s.
So the hyperthreading helps if no other code is present, but when it is it might not always be a good idea to use it.
Perhaps the solution to my problem is obvious for some on with exprience with openmp, but I don't have it. I want to accelerate the following subroutine using openmp:
void Build_ERIS(vector<double> &eris, vector<Atomic_Orbital> &Basis)
{
int basis_size = Basis.size();
int m = basis_size*(basis_size+1)/2;
eris.resize(m*(m+1)/2);
bool compute;
std::fill(eris.begin(), eris.end(), 0);
int i_orbital,j_orbital, k_orbital,l_orbital, i_primitive, j_primitive, k_primitive,l_primitive,ij,kl, ijkl,ijij,klkl;
#pragma omp parallel
{
#pragma omp for ordered
for(i_orbital=0; i_orbital<basis_size; i_orbital++){
for(j_orbital=0; j_orbital<i_orbital+1; j_orbital++){
ij = i_orbital*(i_orbital+1)/2 + j_orbital;
for(k_orbital=0; k_orbital<basis_size; k_orbital++){
for(l_orbital=0; l_orbital<k_orbital+1; l_orbital++){
kl = k_orbital*(k_orbital+1)/2 + l_orbital;
if (ij >= kl) {
ijkl = composite_index(i_orbital,j_orbital,k_orbital,l_orbital);
ijij = composite_index(i_orbital,j_orbital,i_orbital,j_orbital);
klkl = composite_index(k_orbital,l_orbital,k_orbital,l_orbital);
for(i_primitive=0; i_primitive<Basis[i_orbital].contraction.size; i_primitive++)
for(j_primitive=0; j_primitive<Basis[j_orbital].contraction.size; j_primitive++)
for(k_primitive=0; k_primitive<Basis[k_orbital].contraction.size; k_primitive++)
for(l_primitive=0; l_primitive<Basis[l_orbital].contraction.size; l_primitive++)
eris[ijkl] +=
normconst(Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n)*
normconst(Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n)*
normconst(Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n)*
normconst(Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n)*
Basis[i_orbital].contraction.coef[i_primitive]*
Basis[j_orbital].contraction.coef[j_primitive]*
Basis[k_orbital].contraction.coef[k_primitive]*
Basis[l_orbital].contraction.coef[l_primitive]*
ERI_int(Basis[i_orbital].contraction.center.x, Basis[i_orbital].contraction.center.y, Basis[i_orbital].contraction.center.z, Basis[i_orbital].contraction.exponent[i_primitive],Basis[i_orbital].angular.l, Basis[i_orbital].angular.m, Basis[i_orbital].angular.n,
Basis[j_orbital].contraction.center.x, Basis[j_orbital].contraction.center.y, Basis[j_orbital].contraction.center.z, Basis[j_orbital].contraction.exponent[j_primitive],Basis[j_orbital].angular.l, Basis[j_orbital].angular.m, Basis[j_orbital].angular.n,
Basis[k_orbital].contraction.center.x, Basis[k_orbital].contraction.center.y, Basis[k_orbital].contraction.center.z, Basis[k_orbital].contraction.exponent[k_primitive],Basis[k_orbital].angular.l, Basis[k_orbital].angular.m, Basis[k_orbital].angular.n,
Basis[l_orbital].contraction.center.x, Basis[l_orbital].contraction.center.y, Basis[l_orbital].contraction.center.z, Basis[l_orbital].contraction.exponent[l_primitive],Basis[l_orbital].angular.l, Basis[l_orbital].angular.m, Basis[l_orbital].angular.n);
/**/
}
}
}
}
}
}
}
My concern is regarding the best way of be sure that after the openmp parallelization, the computation of the reductions in eris[ijkl], still giving the same values that the serial version of the routine? How can I do a loops fusion in a way that is numerically safe?
Several things I see.
1) #pragma for ordered means: execute every single one of the iterations of this loop in order. This essentially means that while you're executing "in parallel," all of your work will be done in serial. Remove it.
2) You have not declared any of your variables shared or private. Note that all variables by default will be shared, so in your case ij and kl for instance will be accessible by any thread working on any iteration. You can no doubt see how this would cause a race condition if, say, iteration 100 changed variable ij while iteration 1 thought it was using it.
3) Your variable eris[ijkl] as you rightly noted must be reduced properly. If ijkl can never be the same value for two different iterations in your i_orbital loop, then you're fine as-is; no two threads will ever be changing the same variable eris[ijkl] potentially at the same time. If it can be the same value, then you have to carefully handle reduction on the array.
4) Here's what you should work with for starters. This is assuming that ijkl will never be the same value for two different iterations, and your functions do not take in any non-constant references (potentially changing what I'm assuming input variables to output variables).
#pragma omp parallel for private(i_orbital, j_orbital, ij, k_orbital, l_orbital, kl, ijkl, ijij, klkl, i_primitive, j_primitive, k_primitive, l_primitive)
I am trying to parallelize my own C++ implementation of Travelling Salesman Problem using OpenMP.
I have a function to calculate cost of road cost() and vector [0,1,2,...,N], where N is a number of nodes of the road.
In main(), I am trying to find the best road:
do
{
cost();
} while (std::next_permutation(permutation_base, permutation_base + operations_number));
I was trying to use #pragma omp parallel to parallelize that code, but it only made it more time consuming.
Is there any way to parallelize that code?
#pragma omp parallel doesn't automatically divide the computation on separate threads. If you want to divide the computation you need do additionally use #pragma omp for, otherwise the hole computation is done multiple times, one time for each thread. For instance the following code prints "Hello World!" four times on my laptop, since it has 4 cores.
int main(int argc, char* argv[]){
#pragma omp parallel
cout << "Hello World!\n";
}
The same thing happens to your code, if you simple write #pragma omp parallel. Your code gets executed multiple times, once for each thread. And therefore your program won't be faster. If you want to divide the work onto the threads (each thread does different things), you have to use something like #pragma omp parallel for.
Now we can look at your code. It isn't suited for parallelization. Lets see why. You start with your array permutation_base and calculate the costs. Then you manipulate permutation_base with next_permutation. You actually have to wait for the finished cost computations, before you are allowed to manipulate the the array, because otherwise the cost computation would be wrong. So the whole thing wouldn't work on separate threads.
One possible solution would be, to keep multiple copies of your array permutation_base, and each possible permutation base only runs through a part of all permutations. For instance:
vector<int> permutation_base{1, 2, 3, 4};
int n = permutation_base.size();
#pragma omp parallel for
for (int i = 0; i < n; ++i) {
// Make a copy of permutation_base
auto perm = permutation_base;
// rotate the i'th element to the front
// keep the other elements sorted
std::rotate(perm.begin(), perm.begin() + i, perm.begin() + i + 1);
// Now go through all permutations of the last `n-1` elements.
// Keep the first element fixed.
do {
cost()
}
while (std::next_permutation(perm.begin() + 1, perm.end()));
}
Most definitely.
The big problem with parallelizing these permutation problems is that in order to parallelize well, you need to "index" into an arbitrary permutation. In short, you need to find the kth permutation. You can take advantage of some cool math properties and you'll find this:
std::vector<int> kth_perm(long long k, std::vector<int> V) {
long long int index;
long long int next;
std::vector<int> new_v;
while(V.size()) {
index = k / fact(V.size() - 1);
new_v.push_back(V.at(index));
next = k % fact(V.size() - 1);
V.erase(V.begin() + index);
k = next;
}
return new_v;
}
So then your logic might look something like this:
long long int start = (numperms*threadnum)/ numthreads;
long long int end = threadnum == numthreads-1 ? numperms : (numperms*(threadnum+1))/numthreads;
perm = kth_perm(start, perm); // perm is your list of permutations
for (int j = start; j < end; ++j){
if (is_valid_tour(adj_list, perm, startingVertex, endingVertex)) {
isValidTour=true;
return perm;
}
std::next_permutation(perm.begin(),perm.end());
}
isValidTour = false;
return perm;
Obviously there's a lot of code, but the idea of parallelizing it can be captured by the little code I've posted. You can visualize "indexing" like this:
|--------------------------------|
^ ^ ^
t1 t2 ... tn
Find the ith permutation and let a thread call std::next_permutation until it finds the starting point of the next thread.
Note that you'll want to wrap the function that contains the bottom code in #pragma omp parallel
I am trying to increase performance of a rather complex iteration algorithm by parallelizing matrix multiplication, which is being called on each iteration.
The algorithm takes 500 iterations and approximately 10 seconds. But after parallelizing matrix multiplication it slows down to 13 seconds.
However, when I tested matrix multiplication of the same dimension alone, there was an increase in speed. (I am talking about 100x100 matrices.)
Finally, I switched off any parallelizing inside the algorithm and added on each iteration the following piece of code, which does absolutely nothing and presumably shouldn't take long:
int j;
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
And again, there is a 30% slowdown comparing to the same algorithm without this piece of code.
Thus, calling any parallelization using openmp 500 times inside the main algorithm somehow slows things down. This behavior looks very strange to me, anybody has any clues what the problem is?
The main algorithm is being called by a desktop application, compiled by VS2010, Win32 Release.
I work on Intel Core i3 (parallelization creates 4 threads), 64 bit Windows 7.
Here is a structure of a program:
int internal_method(..)
{
...//no openmp here
// the following code does nothing, has nothing to do with the rest of the program and shouldn't take long,
// but somehow adding of this code caused a 3 sec slowdown of the Huge_algorithm()
double sum;
#pragma omp parallel for private(sum)
for (int i = 0; i < 10; i++)
sum = i*i*i / (1.0 + i*i*i*i);
...//no openmp here
}
int Huge_algorithm(..)
{
...//no openmp here
for (int i = 0; i < 500; i++)
{
.....// no openmp
internal_method(..);
......//no openmp
}
...//no openmp here
}
So, the final point is:
calling the parallel piece of code 500 times alone (when the rest of the algorithm is omitted) takes less than 0.01 sec, but when you call it 500 times inside a huge algorithm it causes 3 sec delay of the entire algorithm.
And what I don't understand is how the small parallel part affects the rest of the algorithm?
For 10 iterations and a simple assignment, I guess there is too much OpenMP overhead compared to the computation itself. What looks lightweight here is actually managing and synchronizing multiple threads which may not even come from a thread pool. There might be some locking involved, and I don't know how good MSVC is at estimating whether to parallelize at all.
Try with bigger loop bodies or a bigger amount of iterations (say 1024*1024 iterations, just for starters).
Example OpenMP Magick:
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
This might be approximately expanded by a compiler to:
const unsigned __cpu_count = __get_cpu_count();
const unsigned __j = alloca (sizeof (unsigned) * __cpu_count);
__thread *__threads = alloca (sizeof (__thread) * __cpu_count);
for (unsigned u=0; u!=__cpu_count; ++u) {
__init_thread (__threads+u);
__run_thread ([u]{for (int i=u; i<10; i+=__cpu_count)
__j[u] = __i;}); // assume lambdas
}
for (unsigned u=0; u!=__cpu_count; ++u)
__join (__threads+u);
with __init_thread(), __run_thread() and __join() being non-trivial function that invoke certain system calls.
In case thread-pools are used, you would replace the first alloca() by something like __pick_from_pool() or so.
(note this, names and emitted code, was all imaginary, actual implementation will look different)
Regarding your updated question:
You seem to be parallelizing at the wrong granularity. Put as much workload as possible in a thread, so instead of
for (...) {
#omp parallel ...
for (...) {}
}
try
#omp parallel ...
for (...) {
for (...) {}
}
Rule of thumb: Keep workloads big enough per thread so as to reduce relative overhead.
Maybe just j=i is not high-yield for core-cpu bandwith. maybe you should try something more yielding calculation. (for exapmle taking i*i*i*i*i*i and dividing it by i+i+i)
are you running this on multi-core cpu or gpu?
I'm trying to implement the distance matrix in parallel using openmp in which I calculate the distance between each point and all the other points, so the best algorithm I thought of till now cost O(n^2) and the performance of my algorithm using openmp using 10 thread on 8processor machine isn't better than the serial approach in terms of running time, so I wonder if there is any mistake in my implementation on the openmp approach as this is my first time to use openmp, so please if there is any mistake in my apporach or any better "faster" approach please let me know. The following is my code where "dat" is a vector that contains the data points.
map <int, map< int, double> > dist; //construct the distance matrix
int c=count(dat.at(0).begin(),dat.at(0).end(),delm)+1;
#pragma omp parallel for shared (c,dist)
for(int p=0;p<dat.size();p++)
{
for(int j=p+1;j<dat.size();j++)
{
double ecl=0;
string line1=dat.at(p);
string line2=dat.at(j);
for (int i=0;i<c;i++)
{
double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
line1=line1.substr(line1.find_first_of(delm)+1).c_str();
double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
line2=line2.substr(line2.find_first_of(delm)+1).c_str();
ecl += (num1-num2)*(num1-num2);
}
ecl=sqrt(ecl);
#pragma omp critical
{
dist[p][j]=ecl;
dist[j][p]=ecl;
}
}
}
#pragma omp critical has the effect of serializing your loop so getting rid of that should be your first goal. This should be a step in the right direction:
ptrdiff_t const c = count(dat[0].begin(), dat[0].end(), delm) + 1;
vector<vector<double> > dist(dat.size(), vector<double>(dat.size()));
#pragma omp parallel for
for (size_t p = 0; p != dat.size(); ++p)
{
for (size_t j = p + 1; j != dat.size(); ++j)
{
double ecl = 0.0;
string line1 = dat[p];
string line2 = dat[j];
for (ptrdiff_t i = 0; i != c; ++i)
{
double const num1 = atof(line1.substr(0, line1.find_first_of(delm)).c_str());
double const num2 = atof(line2.substr(0, line2.find_first_of(delm)).c_str());
line1 = line1.substr(line1.find_first_of(delm) + 1);
line2 = line2.substr(line2.find_first_of(delm) + 1);
ecl += (num1 - num2) * (num1 - num2);
}
ecl = sqrt(ecl);
dist[p][j] = ecl;
dist[j][p] = ecl;
}
}
There are a few other obvious things that could be done to make this faster overall, but fixing your parallelization is the most important thing.
As already pointed out, using critical sections will slow things down as only 1 thread is allowed in that section at a time. There is absolutely no need for using critical sections because each thread writes to mutually exclusive sections of data, reading non-modified data obviously doesn't need protection.
My suspicion as to the slowness of the code comes down to uneven work distribution over the threads. By default I think openmp divides the iterations equally among threads. As an example, consider when you have 8 threads and 8 points:
-thread 0 will get 7 distance calculations
-thread 1 will get 6 distance calculations
...
-thread 7 will get 0 distance calculations
Even with more iterations, a similar inequality still exists. If you need to convince yourself, make a thread private counter to track how many distance calculations are actually done by each thread.
With work-sharing constructs like parallel for, you can specify various work distribution strategies. In your case, probably best to go with
#pragma omp for schedule(guided)
When each thread requests some iterations of the for loop, it will get the number of remaining loops (not already given to a thread) divided by the number of threads. So initially you get big blocks, later you get smaller blocks. It's a form of automatic load balancing, mind you there's some (probably small) overhead in dynamically allocating iterations to the threads.
To avoid the first thread getting an unfair large amount of work, your looping structure should be changed so that lower iterations have fewer calculations, e.g. change the inner for loop to
for (j=0; j<p-1; j++)
Another thing to consider is when working with a lot of cores, memory can become the bottleneck. You have 8 processors fighting for probably 2 or maybe 3 channels of DRAM (separate memory sticks on the same channel still compete for bandwidth). On-chip CPU cache is at best shared between all the processors, so you still have no more cache than the serial version of this program.