I am trying to increase performance of a rather complex iteration algorithm by parallelizing matrix multiplication, which is being called on each iteration.
The algorithm takes 500 iterations and approximately 10 seconds. But after parallelizing matrix multiplication it slows down to 13 seconds.
However, when I tested matrix multiplication of the same dimension alone, there was an increase in speed. (I am talking about 100x100 matrices.)
Finally, I switched off any parallelizing inside the algorithm and added on each iteration the following piece of code, which does absolutely nothing and presumably shouldn't take long:
int j;
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
And again, there is a 30% slowdown comparing to the same algorithm without this piece of code.
Thus, calling any parallelization using openmp 500 times inside the main algorithm somehow slows things down. This behavior looks very strange to me, anybody has any clues what the problem is?
The main algorithm is being called by a desktop application, compiled by VS2010, Win32 Release.
I work on Intel Core i3 (parallelization creates 4 threads), 64 bit Windows 7.
Here is a structure of a program:
int internal_method(..)
{
...//no openmp here
// the following code does nothing, has nothing to do with the rest of the program and shouldn't take long,
// but somehow adding of this code caused a 3 sec slowdown of the Huge_algorithm()
double sum;
#pragma omp parallel for private(sum)
for (int i = 0; i < 10; i++)
sum = i*i*i / (1.0 + i*i*i*i);
...//no openmp here
}
int Huge_algorithm(..)
{
...//no openmp here
for (int i = 0; i < 500; i++)
{
.....// no openmp
internal_method(..);
......//no openmp
}
...//no openmp here
}
So, the final point is:
calling the parallel piece of code 500 times alone (when the rest of the algorithm is omitted) takes less than 0.01 sec, but when you call it 500 times inside a huge algorithm it causes 3 sec delay of the entire algorithm.
And what I don't understand is how the small parallel part affects the rest of the algorithm?
For 10 iterations and a simple assignment, I guess there is too much OpenMP overhead compared to the computation itself. What looks lightweight here is actually managing and synchronizing multiple threads which may not even come from a thread pool. There might be some locking involved, and I don't know how good MSVC is at estimating whether to parallelize at all.
Try with bigger loop bodies or a bigger amount of iterations (say 1024*1024 iterations, just for starters).
Example OpenMP Magick:
#pragma omp parallel for private(j)
for (int i = 0; i < 10; i++)
j = i;
This might be approximately expanded by a compiler to:
const unsigned __cpu_count = __get_cpu_count();
const unsigned __j = alloca (sizeof (unsigned) * __cpu_count);
__thread *__threads = alloca (sizeof (__thread) * __cpu_count);
for (unsigned u=0; u!=__cpu_count; ++u) {
__init_thread (__threads+u);
__run_thread ([u]{for (int i=u; i<10; i+=__cpu_count)
__j[u] = __i;}); // assume lambdas
}
for (unsigned u=0; u!=__cpu_count; ++u)
__join (__threads+u);
with __init_thread(), __run_thread() and __join() being non-trivial function that invoke certain system calls.
In case thread-pools are used, you would replace the first alloca() by something like __pick_from_pool() or so.
(note this, names and emitted code, was all imaginary, actual implementation will look different)
Regarding your updated question:
You seem to be parallelizing at the wrong granularity. Put as much workload as possible in a thread, so instead of
for (...) {
#omp parallel ...
for (...) {}
}
try
#omp parallel ...
for (...) {
for (...) {}
}
Rule of thumb: Keep workloads big enough per thread so as to reduce relative overhead.
Maybe just j=i is not high-yield for core-cpu bandwith. maybe you should try something more yielding calculation. (for exapmle taking i*i*i*i*i*i and dividing it by i+i+i)
are you running this on multi-core cpu or gpu?
Related
I have a C++ code that performs a time evolution of four variables that live on a 2D spatial grid. To save some time, I tried to parallelise my code with OpenMP but I just cannot get it to work: No matter how many cores I use, the runtime stays basically the same or increases. (My code does use 24 cores or however many I specify, so the compilation is not a problem.)
I have the feeling that the runtime for one individual time-step is too short and the overhead of producing threads kills the potential speed-up.
The layout of my code is:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
rhs[0][i][j] = A0[i][j] + B0[i][j] + ...;
rhs[1][i][j] = A1[i][j] + B1[i][j] + ...;
rhs[2][i][j] = A2[i][j] + B2[i][j] + ...;
rhs[3][i][j] = A3[i][j] + B3[i][j] + ...;
}
}
// (2) perform Euler step (or Runge-Kutta, ...)
for (int d = 0; d < 4; d++) {
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
next[d][i][j] = current[d][i][j] + time_step * rhs[d][i][j];
}
}
}
}
I thought this code should be fairly easy to parallelise... I put "#pragma omp parellel for" in front of the (1) and (2) loops, and I also specified the number of cores (e.g. 4 cores for loop (2) since there are four variables) but there is simply no speed-up whatsoever.
I have found that OpenMP is fairly smart about when to create/destroy the threads. I.e. it realises that threads are required soon again and then they're only put asleep to save overhead time.
I think one "problem" is that my time step is coded in a subroutine (I'm using RK4 instead of Euler) and the computation of the righthand-side is again in another subroutine that is called by the time_step() function. So, I believe that due to this, OpenMP cannot see that the threads should be kept open for longer and hence the threads are created and destroyed at every time step.
Would it be helpful to put a "#pragma omp parallel" in front of the time-loop so that the threads are created at the very beginning? And then do the actual parallelisation for the righthand-side (1) and the Euler step (2)? But how do I do that?
I have found numerous examples for how to parallelise nested for loops, but none of them were concerned with the setup where the inner loops have been sourced out to separate modules. Would this an obstacle for parallelising?
I have now removed the d loops (by making the indices explicit) and collapsed the i and j loops (by running over the entire 2D array with one variable only).
The code looks like:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
rhs[0][0][i] = A0[0][i] + B0[0][i] + ...;
rhs[1][0][i] = A1[0][i] + B1[0][i] + ...;
rhs[2][0][i] = A2[0][i] + B2[0][i] + ...;
rhs[3][0][i] = A3[0][i] + B3[0][i] + ...;
}
// (2) perform Euler step (or Runge-Kutta, ...)
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
next[0][0][i] = current[0][0][i] + time_step * rhs[0][0][i];
next[1][0][i] = current[1][0][i] + time_step * rhs[1][0][i];
next[2][0][i] = current[2][0][i] + time_step * rhs[2][0][i];
next[3][0][i] = current[3][0][i] + time_step * rhs[3][0][i];
}
}
The size of nr*ntheta is 400*40=1600 and I a make max_time_steps=1000 time steps. Still, the parallelisation does not result in a speed-up:
Runtime without OpenMP (result of time on the command line):
real 0m23.597s
user 0m23.496s
sys 0m0.076s
Runtime with OpenMP (24 cores)
real 0m23.162s
user 7m47.026s
sys 0m0.905s
I do not understand what's happening here.
One peculiarity that I don't show in my code snippet above is that my variables are not actually doubles but a self-defined struct of two doubles which resemble real and imaginary part. But I think this should not make a difference.
Just wanted to report some success after I left the parallelisation alone for a while. The code evolved for a year and now I went back to parallelisation. This time, I can say that OpenMP does it's job and reduces the required walltime.
While the code evolved overall, this particular loop that I've shown above did not really change; merely two things: a) The resolution is higher so that it covers about 10 times as many points and b) the number of calculations per loop also is about 10-fold (maybe even more).
My only explanation why it works now and didn't work a little over a year ago, is that, when I tried to parallelise the code last time, it wasn't computationally expensive enough and the speed-up was killed by the OpenMP overhead. One single loop now requires about 200-300ms whereas that time required must have been in the single digit ms last time.
I can see such effect when comparing gcc and the Intel compiler (which are doing a very different job when vectorizing):
a) Using gcc, one loop needs about 300ms without OpenMP, and on two cores only 52% of the time is required --> near perfect optimization.
b) Using icpc, one loop needs about 160ms without OpenMP, and on two cores it needs 60% of the time --> good optimization but about 20% less effective.
When going for more than two cores, the speed-up is not large enough to make it worthwhile.
I am trying to add an openMP parallelization into quite a big Project and I found out the openMP does too much synchronization outside the parallel blocks.
This synchronization is done for all of the variables, even those not used in the parallel block and it is done continuously, not only before entering the block.
I made an example proving this:
#include <cmath>
int main()
{
double dummy1 = 1.234;
int const size = 1000000;
int const size1 = 2500;
int const size2 = 500;
for(unsigned int i=0; i<size; ++i){
//for (unsigned int j=0; j<size1; j++){
// dummy1 = pow(dummy1/2 + 1, 1.5);
//}
#pragma omp parallel for
for (unsigned int j=0; j<size2; j++){
double dummy2 = 2.345;
dummy2 = pow(dummy2/2 + 1, 1.5);
}
}
}
If I run this code (with the for cycle commented), the runtimes are 6.75s with parallelization and 30.6s without. Great.
But if I uncomment the for cycle and run it again, the excessive synchronization kicks in and I get results 67.9s with parallelization and 73s without. If I increase size1 I even get slower results with parallelization than without it.
Is there a way to disable this synchronization and force it only before the second for cycle? Or any other way how to improve the speed?
Note that the outer neither the first for cycle are in the real example parallelizable. The outer one is in fact a ODE solver and the first inner one updating of loads of inner values.
I am using gcc (SUSE Linux) 4.8.5
Thanks for Your answers.
In the end the solution for my problem was specifying number of threads = number of processor cores. It seems the hyperthreading was causing the problems. So using (my processor has 4 real cores)
#pragma omp parallel for num_threads(4)
I get times 8.7s without the first for loop and 51.9s with it. There is still about 1.2s overhead, but that is acceptable. Using default (8 threads)
#pragma omp parallel for
the times are 6.65s and 68s. Here the overhead is about 19s.
So the hyperthreading helps if no other code is present, but when it is it might not always be a good idea to use it.
I try to use openmp and find strange results.
Parallel "for" run faster with openmp as expected. But serial "for" run much faster when openmp disabled (without /openmp option. vs 2013).
Test code
const int n = 5000;
const int m = 2000000;
vector <double> a(n, 0);
double start = omp_get_wtime();
#pragma omp parallel for shared(a)
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "omp Time: " << (omp_get_wtime() - start) << endl;
start = omp_get_wtime();
for (int i = 0; i < n; i++)
{
double StartVal = i;
for (int j = 0; j < m; ++j)
{
a[i] = (StartVal + log(exp(exp((double)i))));
}
}
cout << "serial Time: " << (omp_get_wtime() - start) << endl;
Output without /openmp option
0
omp Time: 6.4389
serial Time: 6.37592
Output with /openmp option
0
1
2
3
omp Time: 1.84636
serial Time: 16.353
Is it correct results? Or I'm doing something wrong?
I believe part of the answer lies hidden in the architecture of the computer you run on. I tried running the same code another machine (GCC 4.8 on GNU+Linux, quad Core2 CPU), and over many runs, found a slightly odd thing: while the time for both loops varied, and OpenMP with many threads always ran faster, the second loop never ran significantly faster than the first, even without OpenMP.
The next step was to try to eliminate a dependency between the loops, allocating a second vector for the second loop. It still ran no faster than the first. So I tried reversing them, running the OpenMP loop after the serial one; and while it still ran fast when multithreaded, it would now see delays when the first loop didn't. It's looking more like an operating system behaviour at this point; long-lived threads simply seem more likely to get interrupted. I had taken some measures to reduce interruptions (niceness -15, specific cpu set) but this is not a system dedicated to benchmarking.
None of my results were anywhere near as extreme as yours, however. My first guess as to what caused your large difference was that you reused the same array and ran the parallel loop first. This would distribute the array into caches on all cores, causing a slight dilemma of whether to migrade the thread to the data or the other way around; and OpenMP may have chosen any distribution, including iteration i to thread i%threads (as with schedule(static,1)), which probably would hurt multithreaded runtime, or one cacheline each which would hurt later single threaded reading if it fit in per-core caches. However, all of the array accesses are writes, so the processor shouldn't need to wait for them in the first place.
In summary, your results are certainly platform dependent and unexpected. I would suggest rerunning the test with swapped order, the two loops operating on different arrays, and placed in different compilation units, and of course to verify the written results. It is possible you've found a flaw in your compiler.
I am trying to parallelize a code for particle-based simulations and experiencing poor performance of an OpenMP based approach. By that I mean:
Displaying CPU usage using the Linux tool top, OpenMP-threads running CPUs have an average usage of 50 %.
With increasing number of threads, speed up converges to a factor of about 1.6. Convergence is quite fast, i.e. I reach a speed up of 1.5 using 2 threads.
The following pseudo code illustrates the basic template for all parallel regions implemented.
Note that during a single time step, 5 parallel regions of the below shown fashion are being executed. Basically, the force acting on a particle i < N is a function of several field properties of neighboring particles j < NN(i).
omp_set_num_threads(ncpu);
#pragma omp parallel shared( quite_a_large_amount_of_readonly_data, force )
{
int i,j,N,NN;
#pragma omp for
for( i=0; i<N; i++ ){ // Looping over all particles
for ( j=0; j<NN(i); j++ ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
force[i] += function(j);
}
}
}
I am trying to sort out the cause for the observed bottleneck. My naive initial guess for an explanation:
As stated, there is large amount of memory being shared between threads for read-only access. It is quite possible that different threads try to read the same memory location at the same time. Is this causing a bottleneck ? Should I rather let OpenMP allocate private copies ?
How large is N, and how intensive is NN(i)?
You say nothing shared, but force[i] is probably within the same cache line of force[i+1]. This is what's known as false sharing and can be pretty detrimental. OpenMP should batch things together to compensate for this, so with a large enough N I don't think this would be your problem.
If NN(i) isn't very CPU intensive, you might have a simple memory bottleneck -- in which case throwing more cores at it won't solve anything.
Assuming that force[i] is plain array of 4 or 8 byte data, you definitely have false sharing, no doubt about it.
Assuming that function(j) is independently calculated, you may want to do something like this:
for( i=0; i<N; i+=STEP ){ // Looping over all particles
for ( j=0; j<NN(i); j+=STEP ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
calc_next(i, j);
}
}
void calc_next(int i, int j)
{
int ii, jj;
for(ii = 0; ii < STEP; ii++)
{
for(jj = 0; jj < STEP; jj++)
{
force[i+ii] = function(j+jj);
}
}
}
That way, you calculate a bunch of things on one thread, and a bunch of things on the next thread, and each bunch is far enough apart that you don't get false sharing.
If you can't do it this way, try to split it up in some other way that leads to larger sections being calculated each time.
As the others stated that, false sharing on force could be a reason. Try in this simple way,
#pragma omp for
for( i=0; i<N; i++ ){
int sum = force[i];
for ( j=0; j<NN(i); j++ ){
sum += function(j);
}
force[i] = sum;
}
Technically, it's possible that force[i] = sum still makes a false sharing. But, it's highly unlikely to happen because the other thread would access force[i + N/omp_num_threads()*omp_thread_num()], which is pretty far from force[i].
If still scalability is poor, try to use a profiler such as Intel Parallel Amplifier (or VTune) to see how much memory bandwidth is needed per thread. If so, put some more DRAMs in your computer :) That will really boost memory bandwidth.
I'm trying to implement the distance matrix in parallel using openmp in which I calculate the distance between each point and all the other points, so the best algorithm I thought of till now cost O(n^2) and the performance of my algorithm using openmp using 10 thread on 8processor machine isn't better than the serial approach in terms of running time, so I wonder if there is any mistake in my implementation on the openmp approach as this is my first time to use openmp, so please if there is any mistake in my apporach or any better "faster" approach please let me know. The following is my code where "dat" is a vector that contains the data points.
map <int, map< int, double> > dist; //construct the distance matrix
int c=count(dat.at(0).begin(),dat.at(0).end(),delm)+1;
#pragma omp parallel for shared (c,dist)
for(int p=0;p<dat.size();p++)
{
for(int j=p+1;j<dat.size();j++)
{
double ecl=0;
string line1=dat.at(p);
string line2=dat.at(j);
for (int i=0;i<c;i++)
{
double num1=atof(line1.substr(0,line1.find_first_of(delm)).c_str());
line1=line1.substr(line1.find_first_of(delm)+1).c_str();
double num2=atof(line2.substr(0,line2.find_first_of(delm)).c_str());
line2=line2.substr(line2.find_first_of(delm)+1).c_str();
ecl += (num1-num2)*(num1-num2);
}
ecl=sqrt(ecl);
#pragma omp critical
{
dist[p][j]=ecl;
dist[j][p]=ecl;
}
}
}
#pragma omp critical has the effect of serializing your loop so getting rid of that should be your first goal. This should be a step in the right direction:
ptrdiff_t const c = count(dat[0].begin(), dat[0].end(), delm) + 1;
vector<vector<double> > dist(dat.size(), vector<double>(dat.size()));
#pragma omp parallel for
for (size_t p = 0; p != dat.size(); ++p)
{
for (size_t j = p + 1; j != dat.size(); ++j)
{
double ecl = 0.0;
string line1 = dat[p];
string line2 = dat[j];
for (ptrdiff_t i = 0; i != c; ++i)
{
double const num1 = atof(line1.substr(0, line1.find_first_of(delm)).c_str());
double const num2 = atof(line2.substr(0, line2.find_first_of(delm)).c_str());
line1 = line1.substr(line1.find_first_of(delm) + 1);
line2 = line2.substr(line2.find_first_of(delm) + 1);
ecl += (num1 - num2) * (num1 - num2);
}
ecl = sqrt(ecl);
dist[p][j] = ecl;
dist[j][p] = ecl;
}
}
There are a few other obvious things that could be done to make this faster overall, but fixing your parallelization is the most important thing.
As already pointed out, using critical sections will slow things down as only 1 thread is allowed in that section at a time. There is absolutely no need for using critical sections because each thread writes to mutually exclusive sections of data, reading non-modified data obviously doesn't need protection.
My suspicion as to the slowness of the code comes down to uneven work distribution over the threads. By default I think openmp divides the iterations equally among threads. As an example, consider when you have 8 threads and 8 points:
-thread 0 will get 7 distance calculations
-thread 1 will get 6 distance calculations
...
-thread 7 will get 0 distance calculations
Even with more iterations, a similar inequality still exists. If you need to convince yourself, make a thread private counter to track how many distance calculations are actually done by each thread.
With work-sharing constructs like parallel for, you can specify various work distribution strategies. In your case, probably best to go with
#pragma omp for schedule(guided)
When each thread requests some iterations of the for loop, it will get the number of remaining loops (not already given to a thread) divided by the number of threads. So initially you get big blocks, later you get smaller blocks. It's a form of automatic load balancing, mind you there's some (probably small) overhead in dynamically allocating iterations to the threads.
To avoid the first thread getting an unfair large amount of work, your looping structure should be changed so that lower iterations have fewer calculations, e.g. change the inner for loop to
for (j=0; j<p-1; j++)
Another thing to consider is when working with a lot of cores, memory can become the bottleneck. You have 8 processors fighting for probably 2 or maybe 3 channels of DRAM (separate memory sticks on the same channel still compete for bandwidth). On-chip CPU cache is at best shared between all the processors, so you still have no more cache than the serial version of this program.