Reducing time by using OpenMP parallelization - c++

I'm trying to reduce the computation-time of my algorithm by using OpenMP parallelization (C++).
I tried simple things, but I don't quite understand how it works...
Here is my code:
int nthread = omp_get_max_threads();
#pragma omp parallel for num_threads(nthread)
for(int i=0;i<24;++i)
std::cout << omp_get_thread_num() << std::endl;
On my computer, nthread = 6. I don't understand why the output is:
0
0
0
... (24 times)
Why doesn't it give my numbers from 0 to 5 ?
If I understand it well (correct me if I'm wrong), in this code, there are 6 threads which will execute the std::cout command.
Then, why do I have only "0" as an output ?
Second thing: I would like to execute in each thread a certain part of the loop. I want to divide my loop in 6 (nthread) different parts, so that each can be executed by a different thread.
Here, I want each of my 6 threads to execute
std::cout << omp_get_thread_num() << std::endl;
4 times.
How can I do it? I tried this:
#pragma omp parallel for num_threads(nthread)
for(int i=omp_get_thread_num()*(24/nthread);i<(omp_get_thread_num()+1)*(24/nthread);++i)
std::cout << omp_get_thread_num() << std::endl;
Is it right? The output I have is:
0
0
0
0
Is it normal to have only the "0" thread and no other in the terminal?
Thank you

Only a partial answer but I couldn't stay silent on this
I tried this:
for(int i=omp_get_thread_num()*(24/nthread);i<(omp_get_thread_num()+1)*(24/nthread);++i)
std::cout << omp_get_thread_num() << std::endl;
Is it right?
No, it's not right, not right at all ! The code is doing the work of dividing iterations across threads, a better model to follow would be
for(int i=0;i<max_iters;++i)
do work depending on i
and the compiler/run-time will take care of dividing the work across threads. Each thread will get its own set of values of i to work on.
This simple pattern is only correct if each task inside the loop is independent of every other task, so no dependencies between work(i) and work(i-1) say. But at the beginning that's probably enough to get you started.
As for the rest of your question it looks as if you aren't actually running the code in parallel. I suggest replacing
int nthread = omp_get_max_threads();
#pragma omp parallel for num_threads(nthread)
with
#pragma omp parallel for
that is, leave the number of threads to the default set up. If that doesn't work, edit your question with the results of your further investigations. And have a look round at SO, I'm fairly sure that you'll find a duplicate.

RyanP, you were absolutely right, I missed the keyword openmp.
I added it and now it works well! Thanks a lot.
Also thank you High Performance Mark for your answer,
#pragma omp parallel for
was enough for what I wanted to do.
I knew that
for(int i=omp_get_thread_num()*(24/nthread);i<(omp_get_thread_num()+1)*(24/nthread);++i)
std::cout << omp_get_thread_num() << std::endl;
was wrong, but since the other things I tried didn't work, I tried insane things. Thanks for your explanation, it's clearer now.
To resolve my problem, I simply add to my CMakeList.txt the following lines:
find_package(OpenMP)
if (OPENMP_FOUND)
set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}")
set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
endif()
And it works well.
Thank you all

Related

OpenMP 4.5 task dependencies and execution order

I am trying to get OpenMP task dependencies to work, to no avail.
Let's take this simplified example:
int main()
{
int x=0;
#pragma omp parallel
#pragma omp single
{
#pragma omp task depend(in:x)
{
#pragma omp critical
cout << "Read the value " << x << "\n";
}
#pragma omp task depend(out:x)
{
x = 3;
#pragma omp critical
cout << "Set the value to " << x << "\n";
}
}
return 0;
}
As far as I understand (frome OpenMP specs), the depend(in:x) tasks should only be executed after all depend(out:x) tasks have been resolved, and so the expected output is
Set the value to 3
Read the value 3
However, the tasks are instead executed in semantic order, disregarding the depend clauses, and what I get is this:
Read the value 0
Set the value to 3
I am compiling using g++-7 (SUSE Linux) 7.3.1 20180323 [gcc-7-branch revision 258812] with the -fopenmp flag. This version of the compiler should have access to OpenMP 4.5.
Is this a misunderstanding of task dependencies on my side, or is there anything else at play here?
The concept of task dependencies can be misleading.
The best way to put it is to think about them as a way to indicate how different tasks access data and not as a way to control execution order.
The order of the tasks in the source code, together with the depend clause, describe one of the possibile 4 scenarios: read after write, write after read, write after write and read after read.
In your example, you are describing a write after read case: you are telling the compiler that the second task will overwrite the variable x, but the first task takes x as an input, as indicated by depend(in:x). Therefore, the software will execute the first task before the second to prevent overwriting the initial value.
If you have a look at Intel's documentation here there's a brief example which shows how tasks order (in the source code) still plays a role in the determination of the dependency graph (and therefore, on the execution order).
Another informative page on this matter is available here.

Openmp performance with omp_get_max_threads greater than number of cores

I am novice is parallel programming. I running a my own Gibbs sampler written in C++. The overview of program look some thing like this.
for(int iter=0; iter <=itermax; iter++){ //loop1
#pragma omp parallel for schedule(dynamic)
for(int jobs= 0; jobs<=1000; jobs++){ // loop2
small_job();
#pragma omp critical(dataupdate){
data_updates()
}
}
jobs_that_cannot_be_parallelized();
}
I am running in a machine with 64 cores. Since small_job are of variable length and small I was assigning omp_get_max_threads = 128. The number of cores used seems to be correct (see fig load last hour).. Each of peaks belongs to loop2.
However when I look to the actual cpu usage (see fig it seems lot of of cpu is used by system and only 20% is used by user. Is it because I am spawning lots of threads at loop2. What are best practices to decide on omp_get_max_threads? I know I have not given enough information but I will really appreciate any other recommendation to make the program faster.

Strange ratio in speedup between release and debug builds in game "Life"

I wrote classic game "Life" with 4-sided neighbors. When I run it in debug, it says:
Consecutive version: 4.2s
Parallel version: 1.5s
Okey, it's good. But if I run it in release, it says:
Consecutive version: 0.46s
Parallel version: 1.23s
Why? I run it on the computer with 4 kernels. I run 4 threads in parallel section. Answer is correct. But somethere is leak and I don't know that place. Can anybody help me?
I try to run it in Visual Studio 2008 and 2012. The results are same. OMP is enabled in the project settings.
To repeat my problem, you can find defined constant PARALLEL and set it to 1 or 0 to enable and disable OMP correspondingly. Answer will be in the out.txt (out.txt - right answer example). The input must be in in.txt (my input - in.txt). There are some russian symbols, you don't need to understand them, but the first number in in.txt means number of threads to run in parallel section (it's 4 in the example).
The main part is placed in the StartSimulation function. If you run the program, you will see some russian text with running time in the console.
The program code is big enough, so I add it with file hosting - main.cpp (l2 means "lab 2" for me)
Some comments about StartSimulation function. I cuts 2D surface with cells into small rectangles. It is done by AdjustKernelsParameters function.
I do not find the ratio so strange. Having multiple threads co-operate is a complex business and has overheads.
Access to shared memory needs to be serialized which normally involves some form of locking mechanism and contention between threads where they have to wait for the lock to be released.
Such shared variables need to be synchronized between the processor cores which can give significant slowdowns. Also the compiler needs to treat these critical areas differently as a "sequence point".
All this reduces the scope for per thread optimization both in the processor hardware and the compiler for each thread when it is working with the shared variable.
It seems that in this case the overheads of parallelization outweigh the optimization possibilities for the single threaded case.
If there were more work for each thread to do independently before needed to access a shared variable then these overheads would be less significant.
You are using guided loop schedule. This is a very bad choice given that you are dealing with a regular problem where each task can easily do exactly the same amount of work as any other if the domain is simply divided into chunks of equal size.
Replace schedule(guided) with schedule(static). Also employ sum reduction over livingCount instead of using locked increments:
#if PARALLEL == 1
#pragma omp parallel for schedule(static) num_threads(kernelsCount) \
reduction(+:livingCount)
#endif
for (int offsetI = 0; offsetI < n; offsetI += kernelPartSizeN)
{
for (int offsetJ = 0; offsetJ < m; offsetJ += kernelPartSizeM)
{
int boundsN = min(kernelPartSizeN, n - offsetI),
boundsM = min(kernelPartSizeM, m - offsetJ);
for (int kernelOffsetI = 0; kernelOffsetI < boundsN; ++kernelOffsetI)
{
for (int kernelOffsetJ = 0; kernelOffsetJ < boundsM; ++kernelOffsetJ)
{
if(BirthCell(offsetI + kernelOffsetI, offsetJ + kernelOffsetJ))
{
++livingCount;
}
}
}
}
}

Make g++ produce a program that can use multiple cores?

I have a c++ program with multiple For loops; each one runs about 5 million iterations. Is there any command I can use with g++ to make the resulting .exe will use multiple cores; i.e. make the first For loop run on the first core and the second For loop run on the second core at the same time? I've tried -O3 and -O3 -ftree-vectorize, but in both cases, my cpu usage still only hovers at around 25%.
EDIT:
Here is my code, in case in helps. I'm basically just making a program to test the speed capabilities of my computer.
#include <iostream>
using namespace std;
#include <math.h>
int main()
{
float *bob = new float[50102133];
float *jim = new float[50102133];
float *joe = new float[50102133];
int i,j,k,l;
//cout << "Starting test...";
for (i=0;i<50102133;i++)
bob[i] = sin(i);
for (j=0;j<50102133;j++)
bob[j] = sin(j*j);
for (k=0;k<50102133;k++)
bob[k] = sin(sqrt(k));
for (l=0;l<50102133;l++)
bob[l] = cos(l*l);
cout << "finished test.";
cout << "the 100120 element is," << bob[1001200];
return 0;
}
The most obviously choice would be to use OpenMP. Assuming your loop is one that's really easy to execute multiple iterations in parallel, you might be able to just add:
#pragma openmp parallel for
...immediately before the loop, and get it to execute in parallel. You'll also have to add -fopenmp when you compile.
Depending on the content of the loop, that may give anywhere from a nearly-linear speedup to slowing the code down somewhat. In the latter cases (slowdown or minimal speedup) there may be other things you can do with OpenMP to help speed it up, but without knowing at least a little about the code itself, it's hard to guess what to do or what improvement you may be able to expect at maximum.
The other advice you're getting ("Use threads") may be suitable. OpenMP is basically an automated way of putting threads to use for specific types of parallel code. For a situation such as you describe (executing multiple iterations of a loop in parallel) OpenMP is generally preferred--it's quite a bit simpler to implement, and may well give better performance unless you know multithreading quite well and/or expend a great deal of effort on parallelizing the code.
Edit:
The code you gave in the question probably won't benefit from multiple threads. The problem is that it does very little computation on each data item before writing the result out to memory. Even a single core can probably do the computation fast enough that the overall speed will be limited by the bandwidth to memory.
To stand a decent chance of getting some real benefit from multiple threads, you probably want to write some code that does more computation and less just reading and writing memory. For example, if we collapse your computations together, and do all of them on a single item, then sum the results:
double total = 0;
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
By adding a pragma:
#pragma omp parallel for reduction(+:total)
...just before the for loop, we stand a good chance of seeing a substantial improvement in execution speed. Without OpenMP, I get a time like this:
Real 16.0399
User 15.9589
Sys 0.0156001
...but with the #pragma and OpenMP enabled when I compile, I get a time like this:
Real 8.96051
User 17.5033
Sys 0.0468003
So, on my (dual core) processor, time has dropped from 16 to 9 seconds--not quite twice as fast, but pretty close. Of course, a lot of the improvement you get will depend on exactly how many cores you have available. For example, on my other computer (with an Intel i7 CPU), I get a rather larger improvement because it has more cores.
Without OpenMP:
Real 15.339
User 15.3281
Sys 0.015625
...and with OpenMP:
Real 3.09105
User 23.7813
Sys 0.171875
For completeness, here's the final code I used:
#include <math.h>
#include <iostream>
static const int size = 1024 * 1024 * 128;
int main(){
double total = 0;
#pragma omp parallel for reduction(+:total)
for (int i = 0; i < size; i++)
total += sin(i) + sin(i*i) + sin(sqrt(i)) + cos(i*i);
std::cout << total << "\n";
}
The compiler has no way to tell if your code inside the loop can be safely executed on multiple cores. If you want to use all your cores, use threads.
Use Threads or Processes, you may want to look to OpenMp
C++11 got support for threading but c++ compilers won't/can't do any threading on their own.
As others have pointed out, you can manually use threads to achieve this. You might look at libraries such as libdispatch (aka. GCD) or Intel's TBB to help you do this with the least pain.
The -ftree-vectorize option you mention is for targeting SIMD vector processor units on CPUs such as ARM's NEON or Intel's SSE. The code produced is not thread-parallel, but rather operation parallel using a single thread.
The code example posted above is highly amenable to parallelism on SIMD systems as the body of each loop very obviously has no dependancies on the previous iteration, and the operations in the loop are linear.
On some ARM Cortex A series systems at least, you may need to accept slightly reduced accuracy to get the full benefits.

OpenMP in Qt on VirtualBox uses only one thread

I am trying to parallelise part of a C++ program using OpenMP, in QtCreator in Linux on VirtualBox. The host system has 4core cpu. Since my initial attempts at using openmp pragmas didn't seem to work (the code with openmp took almost the same time as that without), I went back to OpenMP wiki and tried to run this simple example.
int main(void)
{
#pragma omp parallel
printf("Hello, world.\n");
return 0;
}
and the output is just
'Hello, world'.
I also tried to run this piece of code
int main () {
int thread_number;
#pragma omp parallel private(thread_number)
{
#pragma omp for schedule(static) nowait
for (int i = 0; i < 50; i++) {
thread_number = omp_get_thread_num();
cout << "Thread " << thread_number << " says " << i << endl;
}
}
return 0;
}
and the output is:
Thread 0 says 0
Thread 0 says 1
Thread 0 says 2
.
.
.
.
Thread 0 says 49
So it looks like there is no parallelising happening after all. I have set QMAKE_CXXFLAGS+= -fopenmp
QMAKE_LFLAGS += -fopenmp in the .pro file. Is this happening because I am running it from a virtual machine? How do I make multithreading work here? I would really appreciate any suggestions/pointers. Thank you.
Your problem is that VirtualBox always defaults to a machine with one core. Go to Settings/System/Processor and increase the number of CPUs to the number of hardware threads (4 in your case or eight if you have hyperthreading). If you have hyperthreading VirtualBox will warn you that you chose more CPUs than physical CPUs. Ignore the warning.
I set my CPUs to eight. When I use OpenMP in GCC on Windows I get eight threads.
Edit: According to VirtualBox's manaual you should set the number of threads to the number of physical cores not the number of hyper-threads.
You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
Try setting the environment variable OMP_NUM_THREADS. The default may be 1 if your virtual machine says it has a single core (this was happening to me).