OpenMP nested loops with code between each `for` loop - c++

for some reasons, I have to put some code between each for statement in nested loops like this:
for (int i = 0; i < n; ++i) {
//i have to put some code here
do something_1
for (int j = 0; j < n; ++j) {
//i have to put some code here
do something_2
for (int k = 0; k < n; ++k) {
do something_3
}
}
}
------Update 20:11 6.17 2016----------------------
I found it's not the nested loops made my OpenMP program crash, I use std::vector with push_back() method inside the loop and that's really dangerous when using OpenMP.

You can in principle parallelize this code using openmp. Here is how you can do it using Visual Studio. (For other development environments you find similar settings):
In the property page, enable openmp:
Configuration Protperties->C/C++->Language->Open Mp support
Include the ompenmp header file:
#include "omp.h"
Put the following line in front of your first for loop:
#pragma omp parallel for
This will parallelize you outer loop, which is what you want in most cases, if n is significantly larger than the number of cores.
Be aware that if you run your loop in parallel that your different iterations need to be independent of each other. If you are not yet familiar with parallel processing, you might want to look at some openmp tutorials to avoid the typical pitfalls. I found these slides pretty helpful.

Related

Aspects that affects the efficiency of OpenMP parallelism

I would like to parallel a big loop using OpenMP to improve its efficiency. Here is the main part of the toy code:
vector<int> config;
config.resize(indices.size());
omp_set_num_threads(2);
#pragma omp parallel for schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
#pragma omp simd
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
#pragma omp atomic
result[index]++;
}
Then I found I cannot get improvements in efficiency if I use 2, 4, or 8 threads. The execution time of the parallel versions is generally greater than that of the sequential version. The outer loop has 10000 iterations and they are independent so I want multiple threads to execute those iterations in parallel.
I guess the reasons for performance decrease maybe include: private copies of config? or, random access of ref_table? or, expensive atomic operation? So what are the exact reasons for the performance decrease? More importantly, how can I get a shorter execution time?
Private copies of config or, random access of ref_tables are not problematic, I think the workload is very small, there are 2 potential issues which prevent efficient parallelization:
atomic operation is too expensive.
overheads are bigger than workload (it simply means that it is not worth parallelizing with OpenMP)
I do not know which one is more significant in your case, so it is worth trying to get rid of atomic operation. There are 2 cases:
a) If the results array is zero initialized you have to use:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config) where N is the size of result array and delete #pragma omp atomic. Note that this works on OpenMP 4.5 or later. It is also worth removing #parama omp simd for a loop of 2-10 iterations. So, your code should look like this:
#pragma omp parallel for reduction(+:result[0:N]) schedule(static, 5000) firstprivate(config)
for (int i = 0; i < 10000; ++i) { // the outer loop that I would like to parallel
for (int j = 0; j < indices.size(); ++j) { // pick some columns from a big ref_table
config[j] = ref_table[i][indices[j]];
}
int index = GetIndex(config); // do simple computations on the picked values to get the index
result[index]++;
}
b) If the result array is not zero initialized the solution is very similar, but use a temporary zero initialized array in the loop and after that add it to result array.
If the speed will not increase then your code is not worth parallelizing with OpenMP on your hardware.

OpenMP: Is array reduction always needed for updating an array in parallel?

I am quite new to OpenMP. I have the following simple loop that I want to run in parallel with OpenMP:
double rij[3];
double r;
#ifdef _OPENMP
#pragma omp parallel for private(rij,r)
#endif
for (int i=0; i<n; ++i)
{
for (int j=0; j<n; ++j)
{
if (i != j)
{
distance(X,rij,r,i,j);
V[i] += ke * Q[j] / r;
for (int k=0; k<3; ++k)
{
F[3*i+k] += ke * Q[j] * rij[k] / pow(r,3);
}
}
}
}
From what I understood, variables are shared by default which is why I only declared private(rij,r). But according to these questions (first second third), I should do array reduction in this case.
It's clear to me that if many threads need to sum to the same variable, this has to be done with #pragma omp parallel for reduction(+:A[:n]) for summing to array A of size n. This is what I do in another part of my code, and it works as expected.
However, in this case workers never have to sum to the same variable: every worker performs the sum on its index i. Is is correct to do as I do in this case i.e. not doing any array reduction and not using any critical section ?
If my implementation is correct, I believe it would avoid the overhead of the critical section while being simpler code. Feel free to give your advice on how this could be better optimized.
Thank you
You don't need a reduction. It is a feature to avoid copying the same code all over again because they are re-occurring problems (Try to think off, how you would implement a sum-reduction without OpenMP).
What you do right now is working on parallel data (V[i]) which should not overlap at any iteration (as you state in the question), because you divide by i itself. Furthermore write to F[...] shouldn't overlap either, because it only depends on iand k

Is the OpenMP scheduling still efficient with a conditional inner loop?

Currently, somewhere deep in my code, I am working with a nested for-loop (N1=~10000, N2 = ~500, x,y= 10-50). I used the #pragma omp, to have OpenMP distribute my calculation on several cores.
#pragma omp parallel for
for (int i = 0; i < N1; ++i)
{
for (int j = 0; j < N2; ++j)
{
for (int k = x; k <= y; ++k)
{
// calculation
}
}
}
Now, my two innerloops becomes conditional
#pragma omp parallel for
for (int i = 0; i < N1; ++i)
{
if (toExecute[i])
{
for (int j = 0; j < N2; ++j)
{
for (int k = x; k <= y; ++k)
{
// calculation
}
}
}
}
The inner nested loop either takes a long time, or is immediately done. Of course I can omit the if-statement by replacing the outer-loop and if-statement with a shorter loop and lookup for the later indexing.
My question is: Is OpenMP smart enough to handle the if-statement within my outer loop, or do I have to do something manually?
I am currently using C++ in Visual Studio 2017 if that matters (I think the OpenMP version is a bit behind).
Ideally, you should let OpenMP handle that for you. But as always when you're doing performance stuffs, you have to try to see what is best for you. Indeed, you can gain great speedup by doing things manually. OpenMP is not omniscient, he does not know all the details and intelligence about your calculation.
If your calculation implies the same work of amount for any iteration then your condition is likely to lead to some different work load regarding the most outter loop. So theoritically, a dynamic scheduling should be more fitted
#pragma omp parallel for schedule(dynamic)
You could also try static or guided scheduling which might fit your calculation (I don't know the details of your calculation so I cannot say) and play with the granularity block.
An other test to do, if you can afford that (i.e. is it parallelizable ?), you should try to move the parallelization in the inner loops.
You can even nest the parallelization, it sometimes give nice speedup. Try and tune step by step, take time to see what gives you the best output. Just to remind you these tweaks are often not generic accross different architectures, so aim for a good tradeoff between performance and code reusability.

OpenMP parallel for loop speedup issues

Recently I started using OpenMP. Doing a numerical calculation involving 3d matrices created in c++ as vectors and I used parallel for loops to speedup the code. But it runs slower than serial code. I compile the code using Codeblocks in Windows 7. The code is something like this.
int main(){
vector<vector<vector<float> > > Dx; //
/*create 3d array Dx[IE][JE][KE] as vectors*/
Dx.resize(IE);
for (int i = 0; i < IE; ++i) {
for (int j = 0; j < JE; ++j){
dx[i][j].resize(KE);
}
}
//declare and initialize more matrices like this
.
.
.
double wtime = omp_get_wtime(); // start time
//and matrix calculations using parallel for loop
#pragma omp parallel for
for (int i=1; i < IE; ++i ) {
for (int j=1; j < JE; ++j ) {
for (int k=1; k < KE; ++k ) {
curl_h = ( Hz[i][j][k] - Hz[i][j-1][k] - Hy[i][j][k] + Hy[i][j][k-1]);
idxl[i][j][k] = idxl[i][j][k] + curl_h;
Dx[i][j][k] = gj3[j]*gk3[k]*dx[i][j][k]
+ gj2[j]*gk2[k]*.5*(curl_h + gi1[i]*idxl[i][j][k]);
}
}
}
wtime = omp_get_wtime() - wtime;
}
But code with parallel loops run slower than the serial code. Any ideas ?
Thxs.
The loop uses the variable curl_h, which is not declared as thread private. This is both a bug, and also the reason for your perceived performance problem:
As there is only one place in memory where curl_h is stored, all threads constantly and concurrently try to read and write it. One CPU core will load the value into its cache, the next one will issue a write to it, invalidating the cache of the first CPU, which will again grab the cacheline when it itself tries to use curl_h (read or write, both will require the cacheline to be in the local cache).
The point is, that the fierce pretense put up by the hardware that there is only one memory location called curl_h demands its tribute. You get a huge amount of chatter in the cache coherency protocol, and keep your memory buses busy with constantly refetching the same cacheline from memory. All your threads are really doing is fighting over that one cacheline.
Of course, the constant races between the threads are a big bug, as no process can be certain that the value it's currently using is actually the one it calculated in the statement above.
So, just add the correct private() declarations to your omp parallel for statement, and you'll fix both the bug and the performance issue.

Parallel Sum for Vectors

Could someone please provide some suggestions on how I can decrease the following for loop's runtime through multithreading? Suppose I also have two vectors called 'a' and 'b'.
for (int j = 0; j < 8000; j++){
// Perform an operation and store in the vector 'a'
// Add 'a' to 'b' coefficient wise
}
This for loop is executed many times in my program. The two operations in the for loop above are already optimized, but they only run on one core. However, I have 16 cores available and would like to make use of them.
I've tried modifying the loop as follows. Instead of having the vector 'a', I have 16 vectors, and suppose that the i-th one is called a[i]. My for loop now looks like
for (int j = 0; j < 500; j++){
for (int i = 0; i < 16; i++){
// Perform an operation and store in the vector 'a[i]'
}
for (int i = 0; i < 16; i++){
// Add 'a[i]' to 'b' coefficient wise
}
}
I use the OpenMp on each of the for loops inside by adding '#pragma omp parallel for' before each of the inner loops. All of my processors are in use but my runtime only increases significantly. Does anyone have any suggestions on how I can decrease the runtime of this loop? Thank You in Advance.
omp creates threads for your program whereever you insert pragma tag, so it's createing threads for inner tags but the problem is 16 threads are created, each one does 1 operation and then all of them are destroyed using your method. creating and destroying threads take a lot of time so the method you used increases the overal time of your process although it uses all 16 cores. you didn't have to create inner fors just put #pragma omp parallel for tag before your 8000 loop it's up to omp to seperate values between treads so what you did to create the second loop, is omp's job. that way omp create threads only once and then process 500 numbers useing that each thread and end all of them after that (using 499 less thread creation and destruction)
Actually, I am going to put these comments in an answer.
Forking threads for trivial operations just adds overhead.
First, make sure your compiler is using vector instructions to implement your loop. (If it does not know how to do this, you might have to code with vector instructions yourself; try searching for "SSE instrinsics". But for this sort of simple addition of vectors, automatic vectorization ought to be possible.)
Assuming your compiler is a reasonably modern GCC, invoke it with:
gcc -O3 -march=native ...
Add -ftree-vectorizer-verbose=2 to find out whether or not it auto-vectorized your loop and why.
If you are already using vector instructions, then it is possible you are saturating your memory bandwidth. Modern CPU cores are pretty fast... If so, you need to restructure at a higher level to get more operations inside each iteration of the loop, finding ways to perform lots of operations on blocks that fit inside the L1 cache.
Does anyone have any suggestions on how I can decrease the runtime of this loop?
for (int j = 0; j < 500; j++){ // outer loop
for (int i = 0; i < 16; i++){ // inner loop
Always try to make outer loop iterations lesser than inner loop. This will save you from inner loop initializations that many times. In above code inner loop i = 0; is initialized 500 times. Now,
for (int i = 0; j < 16; i++){ // outer loop
for (int j = 0; j < 500; j++){ // inner loop
Now, inner loop j = 0; is initialized only 16 times !
Give a try by modifying your code accordingly, if it makes any impact.