Recently I started using OpenMP. Doing a numerical calculation involving 3d matrices created in c++ as vectors and I used parallel for loops to speedup the code. But it runs slower than serial code. I compile the code using Codeblocks in Windows 7. The code is something like this.
int main(){
vector<vector<vector<float> > > Dx; //
/*create 3d array Dx[IE][JE][KE] as vectors*/
Dx.resize(IE);
for (int i = 0; i < IE; ++i) {
for (int j = 0; j < JE; ++j){
dx[i][j].resize(KE);
}
}
//declare and initialize more matrices like this
.
.
.
double wtime = omp_get_wtime(); // start time
//and matrix calculations using parallel for loop
#pragma omp parallel for
for (int i=1; i < IE; ++i ) {
for (int j=1; j < JE; ++j ) {
for (int k=1; k < KE; ++k ) {
curl_h = ( Hz[i][j][k] - Hz[i][j-1][k] - Hy[i][j][k] + Hy[i][j][k-1]);
idxl[i][j][k] = idxl[i][j][k] + curl_h;
Dx[i][j][k] = gj3[j]*gk3[k]*dx[i][j][k]
+ gj2[j]*gk2[k]*.5*(curl_h + gi1[i]*idxl[i][j][k]);
}
}
}
wtime = omp_get_wtime() - wtime;
}
But code with parallel loops run slower than the serial code. Any ideas ?
Thxs.
The loop uses the variable curl_h, which is not declared as thread private. This is both a bug, and also the reason for your perceived performance problem:
As there is only one place in memory where curl_h is stored, all threads constantly and concurrently try to read and write it. One CPU core will load the value into its cache, the next one will issue a write to it, invalidating the cache of the first CPU, which will again grab the cacheline when it itself tries to use curl_h (read or write, both will require the cacheline to be in the local cache).
The point is, that the fierce pretense put up by the hardware that there is only one memory location called curl_h demands its tribute. You get a huge amount of chatter in the cache coherency protocol, and keep your memory buses busy with constantly refetching the same cacheline from memory. All your threads are really doing is fighting over that one cacheline.
Of course, the constant races between the threads are a big bug, as no process can be certain that the value it's currently using is actually the one it calculated in the statement above.
So, just add the correct private() declarations to your omp parallel for statement, and you'll fix both the bug and the performance issue.
Related
I have a C++ code that performs a time evolution of four variables that live on a 2D spatial grid. To save some time, I tried to parallelise my code with OpenMP but I just cannot get it to work: No matter how many cores I use, the runtime stays basically the same or increases. (My code does use 24 cores or however many I specify, so the compilation is not a problem.)
I have the feeling that the runtime for one individual time-step is too short and the overhead of producing threads kills the potential speed-up.
The layout of my code is:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
rhs[0][i][j] = A0[i][j] + B0[i][j] + ...;
rhs[1][i][j] = A1[i][j] + B1[i][j] + ...;
rhs[2][i][j] = A2[i][j] + B2[i][j] + ...;
rhs[3][i][j] = A3[i][j] + B3[i][j] + ...;
}
}
// (2) perform Euler step (or Runge-Kutta, ...)
for (int d = 0; d < 4; d++) {
for (int i = 0; i < nr; i++) {
for (int j = 0; j < ntheta; j++) {
next[d][i][j] = current[d][i][j] + time_step * rhs[d][i][j];
}
}
}
}
I thought this code should be fairly easy to parallelise... I put "#pragma omp parellel for" in front of the (1) and (2) loops, and I also specified the number of cores (e.g. 4 cores for loop (2) since there are four variables) but there is simply no speed-up whatsoever.
I have found that OpenMP is fairly smart about when to create/destroy the threads. I.e. it realises that threads are required soon again and then they're only put asleep to save overhead time.
I think one "problem" is that my time step is coded in a subroutine (I'm using RK4 instead of Euler) and the computation of the righthand-side is again in another subroutine that is called by the time_step() function. So, I believe that due to this, OpenMP cannot see that the threads should be kept open for longer and hence the threads are created and destroyed at every time step.
Would it be helpful to put a "#pragma omp parallel" in front of the time-loop so that the threads are created at the very beginning? And then do the actual parallelisation for the righthand-side (1) and the Euler step (2)? But how do I do that?
I have found numerous examples for how to parallelise nested for loops, but none of them were concerned with the setup where the inner loops have been sourced out to separate modules. Would this an obstacle for parallelising?
I have now removed the d loops (by making the indices explicit) and collapsed the i and j loops (by running over the entire 2D array with one variable only).
The code looks like:
for (int t = 0; t < max_time_steps; t++) {
// do some book-keeping
...
// perform time step
// (1) calculate righthand-side of ODE:
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
rhs[0][0][i] = A0[0][i] + B0[0][i] + ...;
rhs[1][0][i] = A1[0][i] + B1[0][i] + ...;
rhs[2][0][i] = A2[0][i] + B2[0][i] + ...;
rhs[3][0][i] = A3[0][i] + B3[0][i] + ...;
}
// (2) perform Euler step (or Runge-Kutta, ...)
#pragma omp parallel for
for (int i = 0; i < nr*ntheta; i++) {
next[0][0][i] = current[0][0][i] + time_step * rhs[0][0][i];
next[1][0][i] = current[1][0][i] + time_step * rhs[1][0][i];
next[2][0][i] = current[2][0][i] + time_step * rhs[2][0][i];
next[3][0][i] = current[3][0][i] + time_step * rhs[3][0][i];
}
}
The size of nr*ntheta is 400*40=1600 and I a make max_time_steps=1000 time steps. Still, the parallelisation does not result in a speed-up:
Runtime without OpenMP (result of time on the command line):
real 0m23.597s
user 0m23.496s
sys 0m0.076s
Runtime with OpenMP (24 cores)
real 0m23.162s
user 7m47.026s
sys 0m0.905s
I do not understand what's happening here.
One peculiarity that I don't show in my code snippet above is that my variables are not actually doubles but a self-defined struct of two doubles which resemble real and imaginary part. But I think this should not make a difference.
Just wanted to report some success after I left the parallelisation alone for a while. The code evolved for a year and now I went back to parallelisation. This time, I can say that OpenMP does it's job and reduces the required walltime.
While the code evolved overall, this particular loop that I've shown above did not really change; merely two things: a) The resolution is higher so that it covers about 10 times as many points and b) the number of calculations per loop also is about 10-fold (maybe even more).
My only explanation why it works now and didn't work a little over a year ago, is that, when I tried to parallelise the code last time, it wasn't computationally expensive enough and the speed-up was killed by the OpenMP overhead. One single loop now requires about 200-300ms whereas that time required must have been in the single digit ms last time.
I can see such effect when comparing gcc and the Intel compiler (which are doing a very different job when vectorizing):
a) Using gcc, one loop needs about 300ms without OpenMP, and on two cores only 52% of the time is required --> near perfect optimization.
b) Using icpc, one loop needs about 160ms without OpenMP, and on two cores it needs 60% of the time --> good optimization but about 20% less effective.
When going for more than two cores, the speed-up is not large enough to make it worthwhile.
Currently, somewhere deep in my code, I am working with a nested for-loop (N1=~10000, N2 = ~500, x,y= 10-50). I used the #pragma omp, to have OpenMP distribute my calculation on several cores.
#pragma omp parallel for
for (int i = 0; i < N1; ++i)
{
for (int j = 0; j < N2; ++j)
{
for (int k = x; k <= y; ++k)
{
// calculation
}
}
}
Now, my two innerloops becomes conditional
#pragma omp parallel for
for (int i = 0; i < N1; ++i)
{
if (toExecute[i])
{
for (int j = 0; j < N2; ++j)
{
for (int k = x; k <= y; ++k)
{
// calculation
}
}
}
}
The inner nested loop either takes a long time, or is immediately done. Of course I can omit the if-statement by replacing the outer-loop and if-statement with a shorter loop and lookup for the later indexing.
My question is: Is OpenMP smart enough to handle the if-statement within my outer loop, or do I have to do something manually?
I am currently using C++ in Visual Studio 2017 if that matters (I think the OpenMP version is a bit behind).
Ideally, you should let OpenMP handle that for you. But as always when you're doing performance stuffs, you have to try to see what is best for you. Indeed, you can gain great speedup by doing things manually. OpenMP is not omniscient, he does not know all the details and intelligence about your calculation.
If your calculation implies the same work of amount for any iteration then your condition is likely to lead to some different work load regarding the most outter loop. So theoritically, a dynamic scheduling should be more fitted
#pragma omp parallel for schedule(dynamic)
You could also try static or guided scheduling which might fit your calculation (I don't know the details of your calculation so I cannot say) and play with the granularity block.
An other test to do, if you can afford that (i.e. is it parallelizable ?), you should try to move the parallelization in the inner loops.
You can even nest the parallelization, it sometimes give nice speedup. Try and tune step by step, take time to see what gives you the best output. Just to remind you these tweaks are often not generic accross different architectures, so aim for a good tradeoff between performance and code reusability.
I need some help with OpenMP. Is it possible that if a thread ended in a for loop it helps then to another thread, dividing it? I have a loop in a loop where are breaks; and the threads doesn't end at the same time, so there are threads which has much work, and other threads which are done. (so there are unused cores). I run my program on a corei7, and it seems that OpenMP divide the loop to 8 threads. But the utilization starts to drop after some time when one thread did the job.
#pragma omp parallel for
for(i = 0; i < Vector.size(); i++) {
for(j = 0; j < othervector.size(); j++) {
{some code}
if(sth is true) break;
}
}
Thank you.
The default division/SCHEDULE of the loop iterations in a for loop is implementation dependent. In your case, when using the omp parallel for the default shedule may be STATIC, which means that depending on the size of your vector each thread gets assigned a fixed chunk of data. Since apparently the work load can't be balanced by statically dividing it, you should check out the DYNAMIC, GUIDED and RUNTIME clause and see if this helps you to reestablish a high utilization of your (virtual) cores. Depending on the chunk size this will of course cause additional overhead, but it may become negligible comparing it with the time your cores spend in idle when scheduling statically.
To answer the original question: I don't think that you can tell a thread to continue the work of another one. When the work gets assigned each thread has to deal with it on its own. Here is what I would try out.
#define CHUNKSIZE 100
#pragma omp parallel for schedule(dynamic,chunk) nowait
for(i = 0; i < Vector.size(); i++) {
for(j = 0; j < othervector.size(); j++) {
{some code}
if(sth is true) break;
}
}
Actually Hristo Iliev wrote a very nice answer to a similar question some time ago.
I am trying to parallelize a code for particle-based simulations and experiencing poor performance of an OpenMP based approach. By that I mean:
Displaying CPU usage using the Linux tool top, OpenMP-threads running CPUs have an average usage of 50 %.
With increasing number of threads, speed up converges to a factor of about 1.6. Convergence is quite fast, i.e. I reach a speed up of 1.5 using 2 threads.
The following pseudo code illustrates the basic template for all parallel regions implemented.
Note that during a single time step, 5 parallel regions of the below shown fashion are being executed. Basically, the force acting on a particle i < N is a function of several field properties of neighboring particles j < NN(i).
omp_set_num_threads(ncpu);
#pragma omp parallel shared( quite_a_large_amount_of_readonly_data, force )
{
int i,j,N,NN;
#pragma omp for
for( i=0; i<N; i++ ){ // Looping over all particles
for ( j=0; j<NN(i); j++ ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
force[i] += function(j);
}
}
}
I am trying to sort out the cause for the observed bottleneck. My naive initial guess for an explanation:
As stated, there is large amount of memory being shared between threads for read-only access. It is quite possible that different threads try to read the same memory location at the same time. Is this causing a bottleneck ? Should I rather let OpenMP allocate private copies ?
How large is N, and how intensive is NN(i)?
You say nothing shared, but force[i] is probably within the same cache line of force[i+1]. This is what's known as false sharing and can be pretty detrimental. OpenMP should batch things together to compensate for this, so with a large enough N I don't think this would be your problem.
If NN(i) isn't very CPU intensive, you might have a simple memory bottleneck -- in which case throwing more cores at it won't solve anything.
Assuming that force[i] is plain array of 4 or 8 byte data, you definitely have false sharing, no doubt about it.
Assuming that function(j) is independently calculated, you may want to do something like this:
for( i=0; i<N; i+=STEP ){ // Looping over all particles
for ( j=0; j<NN(i); j+=STEP ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
calc_next(i, j);
}
}
void calc_next(int i, int j)
{
int ii, jj;
for(ii = 0; ii < STEP; ii++)
{
for(jj = 0; jj < STEP; jj++)
{
force[i+ii] = function(j+jj);
}
}
}
That way, you calculate a bunch of things on one thread, and a bunch of things on the next thread, and each bunch is far enough apart that you don't get false sharing.
If you can't do it this way, try to split it up in some other way that leads to larger sections being calculated each time.
As the others stated that, false sharing on force could be a reason. Try in this simple way,
#pragma omp for
for( i=0; i<N; i++ ){
int sum = force[i];
for ( j=0; j<NN(i); j++ ){
sum += function(j);
}
force[i] = sum;
}
Technically, it's possible that force[i] = sum still makes a false sharing. But, it's highly unlikely to happen because the other thread would access force[i + N/omp_num_threads()*omp_thread_num()], which is pretty far from force[i].
If still scalability is poor, try to use a profiler such as Intel Parallel Amplifier (or VTune) to see how much memory bandwidth is needed per thread. If so, put some more DRAMs in your computer :) That will really boost memory bandwidth.
Could someone please provide some suggestions on how I can decrease the following for loop's runtime through multithreading? Suppose I also have two vectors called 'a' and 'b'.
for (int j = 0; j < 8000; j++){
// Perform an operation and store in the vector 'a'
// Add 'a' to 'b' coefficient wise
}
This for loop is executed many times in my program. The two operations in the for loop above are already optimized, but they only run on one core. However, I have 16 cores available and would like to make use of them.
I've tried modifying the loop as follows. Instead of having the vector 'a', I have 16 vectors, and suppose that the i-th one is called a[i]. My for loop now looks like
for (int j = 0; j < 500; j++){
for (int i = 0; i < 16; i++){
// Perform an operation and store in the vector 'a[i]'
}
for (int i = 0; i < 16; i++){
// Add 'a[i]' to 'b' coefficient wise
}
}
I use the OpenMp on each of the for loops inside by adding '#pragma omp parallel for' before each of the inner loops. All of my processors are in use but my runtime only increases significantly. Does anyone have any suggestions on how I can decrease the runtime of this loop? Thank You in Advance.
omp creates threads for your program whereever you insert pragma tag, so it's createing threads for inner tags but the problem is 16 threads are created, each one does 1 operation and then all of them are destroyed using your method. creating and destroying threads take a lot of time so the method you used increases the overal time of your process although it uses all 16 cores. you didn't have to create inner fors just put #pragma omp parallel for tag before your 8000 loop it's up to omp to seperate values between treads so what you did to create the second loop, is omp's job. that way omp create threads only once and then process 500 numbers useing that each thread and end all of them after that (using 499 less thread creation and destruction)
Actually, I am going to put these comments in an answer.
Forking threads for trivial operations just adds overhead.
First, make sure your compiler is using vector instructions to implement your loop. (If it does not know how to do this, you might have to code with vector instructions yourself; try searching for "SSE instrinsics". But for this sort of simple addition of vectors, automatic vectorization ought to be possible.)
Assuming your compiler is a reasonably modern GCC, invoke it with:
gcc -O3 -march=native ...
Add -ftree-vectorizer-verbose=2 to find out whether or not it auto-vectorized your loop and why.
If you are already using vector instructions, then it is possible you are saturating your memory bandwidth. Modern CPU cores are pretty fast... If so, you need to restructure at a higher level to get more operations inside each iteration of the loop, finding ways to perform lots of operations on blocks that fit inside the L1 cache.
Does anyone have any suggestions on how I can decrease the runtime of this loop?
for (int j = 0; j < 500; j++){ // outer loop
for (int i = 0; i < 16; i++){ // inner loop
Always try to make outer loop iterations lesser than inner loop. This will save you from inner loop initializations that many times. In above code inner loop i = 0; is initialized 500 times. Now,
for (int i = 0; j < 16; i++){ // outer loop
for (int j = 0; j < 500; j++){ // inner loop
Now, inner loop j = 0; is initialized only 16 times !
Give a try by modifying your code accordingly, if it makes any impact.