OpenMP threads executing on the same cpu core - c++

I'm currently parallelizing program using openmp on a 4-core phenom2. However I noticed that my parallelization does not do anything for the performance. Naturally I assumed I missed something (falsesharing, serialization through locks, ...), however I was unable to find anything like that. Furthermore from the CPU Utilization it seemed like the program was executed on only one core. From what I found sched_getcpu() should give me the Id of the core the thread executing the call is currently scheduled on. So I wrote the following test program:
#include <iostream>
#include <sstream>
#include <omp.h>
#include <utmpx.h>
#include <random>
int main(){
#pragma omp parallel
{
std::default_random_engine rand;
int num = 0;
#pragma omp for
for(size_t i = 0; i < 1000000000; ++i) num += rand();
auto cpu = sched_getcpu();
std::ostringstream os;
os<<"\nThread "<<omp_get_thread_num()<<" on cpu "<<sched_getcpu()<<std::endl;
std::cout<<os.str()<<std::flush;
std::cout<<num;
}
}
On my machine this gives the following output(the random numbers will vary of course):
Thread 2 on cpu 0 num 127392776
Thread 0 on cpu 0 num 1980891664
Thread 3 on cpu 0 num 431821313
Thread 1 on cpu 0 num -1976497224
From this I assume that all threads execute on the same core (the one with id 0). To be more certain I also tried the approach from this answer. The results where the same. Additionally using #pragma omp parallel num_threads(1) didn't make the execution slower (slightly faster in fact), lending credibility to the theory that all threads use the same cpu, however the fact that the cpu is always displayed as 0 makes me kind of suspicious. Additionally I checked GOMP_CPU_AFFINITY which was initially not set, so I tried setting it to 0 1 2 3, which should bind each thread to a different core from what I understand. However that didn't make a difference.
Since develop on a windows system, I use linux in virtualbox for my development. So I though that maybe the virtual system couldn't access all cores. However checking the settings of virtualbox showed that the virtual machine should get all 4 cores and executing my test program 4 times at the same time seems to use all 4 cores judging from the cpu utilization (and the fact that the system was getting very unresponsive).
So for my question is basically what exactly is going on here. More to the point:
Is my deduction that all threads use the same core correctly? If it is, what could be the reasons for that behavious?

After some experimentation I found out that the problem was that I was starting my program from inside the eclipse IDE, which seemingly set the affinity to use only one core. I thought I got the same problems when starting from outside of the IDE, but a repeated test showed that the program works just fine, when started from the terminal instead of from inside the ide.

I compiled your program using g++ 4.6 on Linux
g++ --std=c++0x -fopenmp test.cc -o test
The output was, unsurprisingly:
Thread 2 on cpu 2
Thread 3 on cpu 1
910270973
Thread 1 on cpu 3
910270973
Thread 0 on cpu 0
910270973910270973
The fact that 4 threads are started (if you have not set the number of threads in any way, e.g. using OMP_NUM_THREADS) should imply that the program is able to see 4 usable CPUs. I cannot guess why it is not using them but I suspect a problem in your hardware/software setting, in some environment variable, or in the compiler options.

You should use #pragma omp parallel for
And yes, you're right about not needing OMP_NUM_THREADS. omp_set_num_threads(4); should also have done fine.

if you are running on windows, try this:
c:\windows\system32\cmd.exe /C start /affinity F path\to\your\program.exe
/affinity 1 uses CPU0
/affinity 2 uses CPU1
/affinity 3 uses CPU0 and CPU1
/affinity 4 uses CPU2
/affinity F uses all 4 cores
Convert the number to hex, and see the bits from right which are the cores to be used.
you can verify the affinity while its running using task-manager.

Related

OpenMP task directive slower multithreaded than singlethreaded

I've encountered a problem where the task directive seems to slow down the execution time of the code the more threads I have. Now I have removed all of the unnecessary stuff from my code that isn't related to the problem since the problem still occurs even for this slimmed down piece of code that doesn't really do anything. But the general idea I have for this code is that I have the master thread generate tasks for all the other worker threads to execute.
#ifndef _REENTRANT
#define _REENTRANT
#endif
#include <vector>
#include <iostream>
#include <random>
#include <sched.h>
#include <semaphore.h>
#include <time.h>
#include <bits/stdc++.h>
#include <sys/times.h>
#include <stdio.h>
#include <stdbool.h>
#include <omp.h>
#include <chrono>
#define MAXWORKERS 16
using namespace std;
int nbrThreads = MAXWORKERS; //Number of threads
void busyWait() {
for (int i=0; i < 999; i++){}
}
void generatePlacements() {
#pragma omp parallel
{
#pragma omp master
{
int j = 0;
while (j < 8*7*6*5*4*3*2) {
#pragma omp task
{
busyWait();
}
j++;
}
}
}
}
int main(int argc, char const *argv[])
{
for (int i = 1; i <= MAXWORKERS; i++) {
int nbrThreads = i;
omp_set_num_threads(nbrThreads);
auto begin = omp_get_wtime();
generatePlacements();
double elapsed;
auto end = omp_get_wtime();
auto diff = end - begin;
cout << "Time taken for " << nbrThreads << " threads to execute was " << diff << endl;
}
return 0;
}
And I get the following output from running the program:
Time taken for 1 threads to execute was 0.0707005
Time taken for 2 threads to execute was 0.0375168
Time taken for 3 threads to execute was 0.0257982
Time taken for 4 threads to execute was 0.0234329
Time taken for 5 threads to execute was 0.0208451
Time taken for 6 threads to execute was 0.0288127
Time taken for 7 threads to execute was 0.0380352
Time taken for 8 threads to execute was 0.0403016
Time taken for 9 threads to execute was 0.0470985
Time taken for 10 threads to execute was 0.0539719
Time taken for 11 threads to execute was 0.0582986
Time taken for 12 threads to execute was 0.051923
Time taken for 13 threads to execute was 0.571846
Time taken for 14 threads to execute was 0.569011
Time taken for 15 threads to execute was 0.562491
Time taken for 16 threads to execute was 0.562118
Most notably was that from 6 threads on the time seems to get slower, and going from 12 threads to 13 threads seems to have the biggest performance hit, becoming whooping 10 times slower. Now I know that this issue revolves around the openMP task directive, since if I remove the busyWait() function the performance stays the same as seen above. But if I also remove the #pragma omp task header along with the busyWait() call I don't get any slowdown whatsoever, so the slowdown can't depend on the thread-creation. I have no clue what the problem here is.
First of all, the for (int i=0; i < 999; i++){} loop can be optimized by the compiler when optimization flags like -O2 or -O3 are enabled. In fact, mainstream compilers like Clang and GCC optimize it in -O2. Profiling non-optimized build is a wast of time and should never be done unless you have a very good reason to do that.
Assuming you enabled optimizations, the created task will be empty which means you are measuring the time to create many tasks. The thing is creating tasks is slow and creating many tasks doing nothing causes a contention making the creation even slower. The task granularity should be carefully tuned so not to put to much pressure on the OpenMP runtime. Assuming you did not enabled optimisations, then even a loop of 999 iterations is not enough for the runtime not to be under pressure (it should last less than 1 us on mainstream machines). Tasks should last for at least few microseconds for the overhead not to be the main bottleneck. On mainstream servers with a lot of cores, it should be at least dozens of microseconds. For the overhead to be negligible, tasks should last even longer. Task scheduling is powerful but expensive.
Due to the use of shared data structure protected with atomics and locks in OpenMP runtimes, the contention tends to grows with the number of core. On NUMA systems, it can be significantly higher when using multiple NUMA nodes due to NUMA effects. AMD processors with 16 cores are typically processors having multiple NUMA nodes. Using SMT (multiple hardware thread per physical core) does not significantly speed up this operation and adds more pressure to the OpenMP scheduler and the OS scheduler so it is generally not a good idea to use more threads than cores in this case (it can worth it when the task computational work can benefit from SMT, that is for latency-bound tasks for example, and when the overhead is small).
For more information about the overhead of mainstream OpenMP runtimes please consider reading On the Impact of OpenMP Task Granularity.

MPI & OpenMP: omp_get_max_threads returns half of the true thread capacity

For an unknown reason, when I compile with MPI omp_get_max_threads(), the number of threads returned (12) is half of the capacity my computer has (24 threads, 12 cores). This strange behaviour appeared without reason for two days now, while everything was working well before. I have installed MPI from source. I tried to install it again in the same way but still faced the same problem.
I have read several posts, and seen the oversubscribe or the omp_set_dynamic() solutions, but I am not satisfied with neither of these, because I am running a cluster with different type of machines, and I really want to determine the maximum number of threads dynamically.
How can I find the variables driving the default result of omp_get_max_threads() ?
#include <omp.h>
#include <mpi.h>
#include <iostream>
int main(int argc, char** argv){
int provided;
MPI_Init_thread(NULL, NULL, MPI_THREAD_SINGLE, &provided);
//int num_cpu = omp_get_max_threads();
//omp_set_dynamic(0);
//omp_set_num_threads(4);
std::cout << omp_get_max_threads()<< std::endl;
MPI_Finalize();
return 0;
}
Compiling it with mpicxx not_fun.cpp -fopenmp -o not_fun. If I execute it with ./not_fun result is 24, which is correct. If I execute it with mpiexec -np 4 ./not_fun result is 12, not correct.
Additional information
I am not sure if this is related.
I ultimately had a problem with my RAM, and one stick is not working properly so I removed it. It could perhaps be related, but I don't think so. I installed again MPI with the new RAM configuration, and still got the same problem.
You're misunderstanding what a thread is. A thread is a software construct, having nothing to do with hardware. Try writing a program that prints out the number of threads, and do OMP_NUM_THREADS=321 ./yourprogram It will report both the max and actual number of threads to be 321. If you want something relating to the number of cores, use omp_get_num_procs. (And just to be clear: the number of threads comes from the OMP_NUM_THREADS environment variable, or any explicit override in your source.)
If you write an MPI program, and you do the same thing, you'll find that (probably) each MPI process gets an equal number of "procs" and multiplying MPI procs (on one node) and OMP procs will be less equal than the core count. But that may depend on your implementation.

OpenMP in Qt on VirtualBox uses only one thread

I am trying to parallelise part of a C++ program using OpenMP, in QtCreator in Linux on VirtualBox. The host system has 4core cpu. Since my initial attempts at using openmp pragmas didn't seem to work (the code with openmp took almost the same time as that without), I went back to OpenMP wiki and tried to run this simple example.
int main(void)
{
#pragma omp parallel
printf("Hello, world.\n");
return 0;
}
and the output is just
'Hello, world'.
I also tried to run this piece of code
int main () {
int thread_number;
#pragma omp parallel private(thread_number)
{
#pragma omp for schedule(static) nowait
for (int i = 0; i < 50; i++) {
thread_number = omp_get_thread_num();
cout << "Thread " << thread_number << " says " << i << endl;
}
}
return 0;
}
and the output is:
Thread 0 says 0
Thread 0 says 1
Thread 0 says 2
.
.
.
.
Thread 0 says 49
So it looks like there is no parallelising happening after all. I have set QMAKE_CXXFLAGS+= -fopenmp
QMAKE_LFLAGS += -fopenmp in the .pro file. Is this happening because I am running it from a virtual machine? How do I make multithreading work here? I would really appreciate any suggestions/pointers. Thank you.
Your problem is that VirtualBox always defaults to a machine with one core. Go to Settings/System/Processor and increase the number of CPUs to the number of hardware threads (4 in your case or eight if you have hyperthreading). If you have hyperthreading VirtualBox will warn you that you chose more CPUs than physical CPUs. Ignore the warning.
I set my CPUs to eight. When I use OpenMP in GCC on Windows I get eight threads.
Edit: According to VirtualBox's manaual you should set the number of threads to the number of physical cores not the number of hyper-threads.
You should not, however, configure virtual machines to use more CPU cores than you have available physically (real cores, no hyperthreads).
Try setting the environment variable OMP_NUM_THREADS. The default may be 1 if your virtual machine says it has a single core (this was happening to me).

Why is this OpenMP program slower than single-thread?

Please look at this code.
Single-threaded program: http://pastebin.com/KAx4RmSJ. Compiled with:
g++ -lrt -O2 main.cpp -o nnlv2
Multithread with openMP: http://pastebin.com/fbe4gZSn
Compiled with:
g++ -lrt -fopenmp -O2 main_openmp.cpp -o nnlv2_openmp
I tested it on a dual core system (so we have two threads running in parallel). But multi-threaded version is slower than the single-threaded one (and shows unstable time, try to run it few times). What's wrong? Where did I make mistake?
Some tests:
Single-thread:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 1898983
10 500 500 --- 11009094
10 1000 1000 --- 48116913
Multi-thread:
Layers Neurons Inputs --- Time (ns)
10 200 200 --- 2518262
10 500 500 --- 13861504
10 1000 1000 --- 53446849
I don't understand what is wrong.
Is your goal here to study OpenMP, or to make your program faster? If the latter, it would be more worthwhile to write multiply-add code, reduce the number of passes, and incorporate SIMD.
Step 1: Combine loops and use multiply-add:
// remove the variable 'temp' completely
for(int i=0;i<LAYERS;i++)
{
for(int j=0;j<NEURONS;j++)
{
outputs[j] = 0;
for(int k=0,l=0;l<INPUTS;l++,k++)
{
outputs[j] += inputs[l] * weights[i][k];
}
outputs[j] = sigmoid(outputs[j]);
}
std::swap(inputs, outputs);
}
compiling with -static and -p, running and then parsing gmon.out with gprof I got:
45.65% gomp_barrier_wait_end
That's a lot of time in opemmp's barrier routine. that is the time spent waiting for the other threads to finish. since you're running the parallel for loops many times (LAYERS), you loose the advantage of running in parallel since every time a parallel for loop is finished, there is an implicit barrier call which won't return till all other threads finish.
Before anything else, run the test on Multi-thread configuration and MAKE SURE that procexp or task manager will show you 100% CPU usage for it. If it doesn't, then you don't use multiple threads nor multiple processor cores.
Also, taken from wiki:
Environment variables
A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example OMP_NUM_THREADS is used to specify number of threads for an application.
I don't see where you have actually used OpenMP - try #pragma omp parallel for above the main loop... (documented here, for example)
The slowness is possibly from including OpenMP and it initialising, adding code bloat or otherwise changing the compilation as a result of the compiler flags you introduced to enable it. Alternatively the loops are so small and simple that the overhead of threading far exceeds the performance gain.

OpenMP 'slower' on iMac? (C++)

I have a small C++ program using OpenMP. It works fine on Windows7, Core i7 with VisualStudio 2010. On an iMac with a Core i7 and g++ v4.2.1, the code runs much more slowly using 4 threads than it does with just one. The same 'slower' behavior is exihibited on 2 other Red Hat machines using g++.
Here is the code:
int iHundredMillion = 100000000;
int iNumWorkers = 4;
std::vector<Worker*> workers;
for(int i=0; i<iNumWorkers; ++i)
{
Worker * pWorker = new Worker();
workers.push_back(pWorker);
}
int iThr;
#pragma omp parallel for private (iThr) // Parallel run
for(int k=0; k<iNumWorkers; ++k)
{
iThr = omp_get_thread_num();
workers[k]->Run( (3)*iHundredMillion, iThr );
}
I'm compiling with g++ like this:
g++ -fopenmp -O2 -o a.out *.cpp
Can anyone tell me what silly mistake I'm making on the *nix platform?
I'm thinking that the g++ compiler is not optimizing as well as the visual studio compiler. Can you try other optimization levels (like -O3) and see if it makes a difference?
Or you could try some other compiler. Intel offers free compilers for linux for non-commercial purposes.
http://software.intel.com/en-us/articles/non-commercial-software-development/
It's impossible to answer given the information provided, but one guess could be that your code is designed so it can't be executed efficiently on multiple threads.
I haven't worked a lot with OMP, but I believe it is allowed to use fewer worker threads than specified. In that case, some implementations could be clever enough to realize that the code can't be efficiently parallellized, and just run it on a single thread, while others naively try to run it on 4 cores, and suffer the performance penalty (due to false (or real) sharing, for example)
Some of the information that'd be necessary in order to give you a reasonable answer is:
the actual timings (how long does the code take to run on a single thread? How long with 4 threads using OM? How long with 4 threads using "regular" threads?
the data layout: which data is allocated where, and when is it accessed?
what actually happens inside the loop? All we can see at the moment is a multiplication and a function call. As long as we don't know what happens inside the function, you might as well have posted this code: foo(42) and asked why it doesn't return the expected result.