I am trying to parallelise part of a C++ program using OpenMP, in QtCreator in Linux on VirtualBox. The host system has 4core cpu. Since my initial attempts at using openmp pragmas didn't seem to work (the code with openmp took almost the same time as that without), I went back to OpenMP wiki and tried to run this simple example.
int main(void)
{
#pragma omp parallel
printf("Hello, world.\n");
return 0;
}
and the output is just
'Hello, world'.
I also tried to run this piece of code
int main () {
int thread_number;
#pragma omp parallel private(thread_number)
{
#pragma omp for schedule(static) nowait
for (int i = 0; i < 50; i++) {
thread_number = omp_get_thread_num();
cout << "Thread " << thread_number << " says " << i << endl;
}
}
return 0;
}
and the output is:
Thread 0 says 0
Thread 0 says 1
Thread 0 says 2
.
.
.
.
Thread 0 says 49
So it looks like there is no parallelising happening after all. I have set QMAKE_CXXFLAGS+= -fopenmp
QMAKE_LFLAGS += -fopenmp in the .pro file. Is this happening because I am running it from a virtual machine? How do I make multithreading work here? I would really appreciate any suggestions/pointers. Thank you.
Your problem is that VirtualBox always defaults to a machine with one core. Go to Settings/System/Processor and increase the number of CPUs to the number of hardware threads (4 in your case or eight if you have hyperthreading). If you have hyperthreading VirtualBox will warn you that you chose more CPUs than physical CPUs. Ignore the warning.
I set my CPUs to eight. When I use OpenMP in GCC on Windows I get eight threads.
Edit: According to VirtualBox's manaual you should set the number of threads to the number of physical cores not the number of hyper-threads.
You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
Try setting the environment variable OMP_NUM_THREADS. The default may be 1 if your virtual machine says it has a single core (this was happening to me).
Related
I am trying to get OpenMP task dependencies to work, to no avail.
Let's take this simplified example:
int main()
{
int x=0;
#pragma omp parallel
#pragma omp single
{
#pragma omp task depend(in:x)
{
#pragma omp critical
cout << "Read the value " << x << "\n";
}
#pragma omp task depend(out:x)
{
x = 3;
#pragma omp critical
cout << "Set the value to " << x << "\n";
}
}
return 0;
}
As far as I understand (frome OpenMP specs), the depend(in:x) tasks should only be executed after all depend(out:x) tasks have been resolved, and so the expected output is
Set the value to 3
Read the value 3
However, the tasks are instead executed in semantic order, disregarding the depend clauses, and what I get is this:
Read the value 0
Set the value to 3
I am compiling using g++-7 (SUSE Linux) 7.3.1 20180323 [gcc-7-branch revision 258812] with the -fopenmp flag. This version of the compiler should have access to OpenMP 4.5.
Is this a misunderstanding of task dependencies on my side, or is there anything else at play here?
The concept of task dependencies can be misleading.
The best way to put it is to think about them as a way to indicate how different tasks access data and not as a way to control execution order.
The order of the tasks in the source code, together with the depend clause, describe one of the possibile 4 scenarios: read after write, write after read, write after write and read after read.
In your example, you are describing a write after read case: you are telling the compiler that the second task will overwrite the variable x, but the first task takes x as an input, as indicated by depend(in:x). Therefore, the software will execute the first task before the second to prevent overwriting the initial value.
If you have a look at Intel's documentation here there's a brief example which shows how tasks order (in the source code) still plays a role in the determination of the dependency graph (and therefore, on the execution order).
Another informative page on this matter is available here.
I wrote classic game "Life" with 4-sided neighbors. When I run it in debug, it says:
Consecutive version: 4.2s
Parallel version: 1.5s
Okey, it's good. But if I run it in release, it says:
Consecutive version: 0.46s
Parallel version: 1.23s
Why? I run it on the computer with 4 kernels. I run 4 threads in parallel section. Answer is correct. But somethere is leak and I don't know that place. Can anybody help me?
I try to run it in Visual Studio 2008 and 2012. The results are same. OMP is enabled in the project settings.
To repeat my problem, you can find defined constant PARALLEL and set it to 1 or 0 to enable and disable OMP correspondingly. Answer will be in the out.txt (out.txt - right answer example). The input must be in in.txt (my input - in.txt). There are some russian symbols, you don't need to understand them, but the first number in in.txt means number of threads to run in parallel section (it's 4 in the example).
The main part is placed in the StartSimulation function. If you run the program, you will see some russian text with running time in the console.
The program code is big enough, so I add it with file hosting - main.cpp (l2 means "lab 2" for me)
Some comments about StartSimulation function. I cuts 2D surface with cells into small rectangles. It is done by AdjustKernelsParameters function.
I do not find the ratio so strange. Having multiple threads co-operate is a complex business and has overheads.
Access to shared memory needs to be serialized which normally involves some form of locking mechanism and contention between threads where they have to wait for the lock to be released.
Such shared variables need to be synchronized between the processor cores which can give significant slowdowns. Also the compiler needs to treat these critical areas differently as a "sequence point".
All this reduces the scope for per thread optimization both in the processor hardware and the compiler for each thread when it is working with the shared variable.
It seems that in this case the overheads of parallelization outweigh the optimization possibilities for the single threaded case.
If there were more work for each thread to do independently before needed to access a shared variable then these overheads would be less significant.
You are using guided loop schedule. This is a very bad choice given that you are dealing with a regular problem where each task can easily do exactly the same amount of work as any other if the domain is simply divided into chunks of equal size.
Replace schedule(guided) with schedule(static). Also employ sum reduction over livingCount instead of using locked increments:
#if PARALLEL == 1
#pragma omp parallel for schedule(static) num_threads(kernelsCount) \
reduction(+:livingCount)
#endif
for (int offsetI = 0; offsetI < n; offsetI += kernelPartSizeN)
{
for (int offsetJ = 0; offsetJ < m; offsetJ += kernelPartSizeM)
{
int boundsN = min(kernelPartSizeN, n - offsetI),
boundsM = min(kernelPartSizeM, m - offsetJ);
for (int kernelOffsetI = 0; kernelOffsetI < boundsN; ++kernelOffsetI)
{
for (int kernelOffsetJ = 0; kernelOffsetJ < boundsM; ++kernelOffsetJ)
{
if(BirthCell(offsetI + kernelOffsetI, offsetJ + kernelOffsetJ))
{
++livingCount;
}
}
}
}
}
I'm currently parallelizing program using openmp on a 4-core phenom2. However I noticed that my parallelization does not do anything for the performance. Naturally I assumed I missed something (falsesharing, serialization through locks, ...), however I was unable to find anything like that. Furthermore from the CPU Utilization it seemed like the program was executed on only one core. From what I found sched_getcpu() should give me the Id of the core the thread executing the call is currently scheduled on. So I wrote the following test program:
#include <iostream>
#include <sstream>
#include <omp.h>
#include <utmpx.h>
#include <random>
int main(){
#pragma omp parallel
{
std::default_random_engine rand;
int num = 0;
#pragma omp for
for(size_t i = 0; i < 1000000000; ++i) num += rand();
auto cpu = sched_getcpu();
std::ostringstream os;
os<<"\nThread "<<omp_get_thread_num()<<" on cpu "<<sched_getcpu()<<std::endl;
std::cout<<os.str()<<std::flush;
std::cout<<num;
}
}
On my machine this gives the following output(the random numbers will vary of course):
Thread 2 on cpu 0 num 127392776
Thread 0 on cpu 0 num 1980891664
Thread 3 on cpu 0 num 431821313
Thread 1 on cpu 0 num -1976497224
From this I assume that all threads execute on the same core (the one with id 0). To be more certain I also tried the approach from this answer. The results where the same. Additionally using #pragma omp parallel num_threads(1) didn't make the execution slower (slightly faster in fact), lending credibility to the theory that all threads use the same cpu, however the fact that the cpu is always displayed as 0 makes me kind of suspicious. Additionally I checked GOMP_CPU_AFFINITY which was initially not set, so I tried setting it to 0 1 2 3, which should bind each thread to a different core from what I understand. However that didn't make a difference.
Since develop on a windows system, I use linux in virtualbox for my development. So I though that maybe the virtual system couldn't access all cores. However checking the settings of virtualbox showed that the virtual machine should get all 4 cores and executing my test program 4 times at the same time seems to use all 4 cores judging from the cpu utilization (and the fact that the system was getting very unresponsive).
So for my question is basically what exactly is going on here. More to the point:
Is my deduction that all threads use the same core correctly? If it is, what could be the reasons for that behavious?
After some experimentation I found out that the problem was that I was starting my program from inside the eclipse IDE, which seemingly set the affinity to use only one core. I thought I got the same problems when starting from outside of the IDE, but a repeated test showed that the program works just fine, when started from the terminal instead of from inside the ide.
I compiled your program using g++ 4.6 on Linux
g++ --std=c++0x -fopenmp test.cc -o test
The output was, unsurprisingly:
Thread 2 on cpu 2
Thread 3 on cpu 1
910270973
Thread 1 on cpu 3
910270973
Thread 0 on cpu 0
910270973910270973
The fact that 4 threads are started (if you have not set the number of threads in any way, e.g. using OMP_NUM_THREADS) should imply that the program is able to see 4 usable CPUs. I cannot guess why it is not using them but I suspect a problem in your hardware/software setting, in some environment variable, or in the compiler options.
You should use #pragma omp parallel for
And yes, you're right about not needing OMP_NUM_THREADS. omp_set_num_threads(4); should also have done fine.
if you are running on windows, try this:
c:\windows\system32\cmd.exe /C start /affinity F path\to\your\program.exe
/affinity 1 uses CPU0
/affinity 2 uses CPU1
/affinity 3 uses CPU0 and CPU1
/affinity 4 uses CPU2
/affinity F uses all 4 cores
Convert the number to hex, and see the bits from right which are the cores to be used.
you can verify the affinity while its running using task-manager.
I believe I am experiencing false sharing using OpenMP. Is there any way to identify it and fix it?
My code is: https://github.com/wchan/libNN/blob/master/ResilientBackpropagation.hpp line 36.
Using a 4 core CPU compared to the single threaded 1 core version yielded only 10% in additional performance. When using a NUMA 32 physical (64 virtual) CPU system, the CPU utilization is stuck at around 1.5 cores, I think this is a direct symptom of false sharing and unable to scale.
I also tried running it with Intel VTune profiler, it stated most of the time is spent on the "f()" and "+=" functions. I believe this is reasonable and doesn't really explain why I am getting such poor scaling...
Any ideas/suggestions?
Thanks.
Use reduction instead of explicitly indexing an array based on the thread ID. That array virtually guarantees false sharing.
i.e. replace this
#pragma omp parallel for
clones[omp_get_thread_num()]->mse() += norm_2(dedy);
for (int i = 0; i < omp_get_max_threads(); i++) {
neural_network->mse() += clones[i]->mse();
with this:
#pragma omp parallel for reduction(+ : mse)
mse += norm_2(dedy);
neural_network->mse() = mse;
One way of knowing for sure is looking at cache statistics with a tool like cachegrind :
valgrind --tool=cachegrind [command]
I have a small C++ program using OpenMP. It works fine on Windows7, Core i7 with VisualStudio 2010. On an iMac with a Core i7 and g++ v4.2.1, the code runs much more slowly using 4 threads than it does with just one. The same 'slower' behavior is exihibited on 2 other Red Hat machines using g++.
Here is the code:
int iHundredMillion = 100000000;
int iNumWorkers = 4;
std::vector<Worker*> workers;
for(int i=0; i<iNumWorkers; ++i)
{
Worker * pWorker = new Worker();
workers.push_back(pWorker);
}
int iThr;
#pragma omp parallel for private (iThr) // Parallel run
for(int k=0; k<iNumWorkers; ++k)
{
iThr = omp_get_thread_num();
workers[k]->Run( (3)*iHundredMillion, iThr );
}
I'm compiling with g++ like this:
g++ -fopenmp -O2 -o a.out *.cpp
Can anyone tell me what silly mistake I'm making on the *nix platform?
I'm thinking that the g++ compiler is not optimizing as well as the visual studio compiler. Can you try other optimization levels (like -O3) and see if it makes a difference?
Or you could try some other compiler. Intel offers free compilers for linux for non-commercial purposes.
http://software.intel.com/en-us/articles/non-commercial-software-development/
It's impossible to answer given the information provided, but one guess could be that your code is designed so it can't be executed efficiently on multiple threads.
I haven't worked a lot with OMP, but I believe it is allowed to use fewer worker threads than specified. In that case, some implementations could be clever enough to realize that the code can't be efficiently parallellized, and just run it on a single thread, while others naively try to run it on 4 cores, and suffer the performance penalty (due to false (or real) sharing, for example)
Some of the information that'd be necessary in order to give you a reasonable answer is:
the actual timings (how long does the code take to run on a single thread? How long with 4 threads using OM? How long with 4 threads using "regular" threads?
the data layout: which data is allocated where, and when is it accessed?
what actually happens inside the loop? All we can see at the moment is a multiplication and a function call. As long as we don't know what happens inside the function, you might as well have posted this code: foo(42) and asked why it doesn't return the expected result.