I use Ubuntu and write several lines of code.But it creates only one thread. When I run on my terminal the nproc command, the output is 2. My code is below
int nthreads, tid;
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf("Thread = %d\n", tid);
/* for only main thread */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
The output:
Thread = 0
Number of threads = 1
How can I do parallelism?
If you are using gcc/g++ you must make sure you enable openmp extensions with the -fopenmp compiler and linker options. Specifying it during linking will link in the appropriate library (-lgomp).
Compile with something like:
g++ -fopenmp myfile.c -o exec
or:
g++ -c myfile.c -fopenmp
g++ -o exec myfile.o -fopenmp
If you leave out the -fopenmp compile option your program will compile but it will run as if openmp wasn't being used. If your program doesn't use omp_set_num_threads to set the number of threads they can be set from the command line:
OMP_NUM_THREADS=8 ./exec
I think the default is is generally the number of cores on a particular system.
Related
I am trying to run the following example MPI code that launches 20 threads and keeps those threads busy for a while. However, when I check the CPU utilization using a tool like nmon or top I see that only a single thread is being used.
#include <iostream>
#include <thread>
#include <mpi.h>
using namespace std;
int main(int argc, char *argv[]) {
int provided, rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED)
exit(1);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
auto f = [](float x) {
float result = 0;
for (float i = 0; i < x; i++) { result += 10 * i + x; }
cout << "Result: " << result << endl;
};
thread threads[20];
for (int i = 0; i < 20; ++i)
threads[i] = thread(f, 100000000.f); // do some work
for (auto& th : threads)
th.join();
MPI_Finalize();
return 0;
}
I compile this code using mpicxx: mpicxx -std=c++11 -pthread example.cpp -o example and run it using mpirun: mpirun -np 1 example.
I am using Open MPI version 4.1.4 that is compiled with posix thread support (following the explanation from this question).
$ mpicxx --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ mpirun --version
mpirun (Open MPI) 4.1.4
$ ompi_info | grep -i thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
FT Checkpoint support: no (checkpoint thread: no)
$ mpicxx -std=c++11 -pthread example.cpp -o example
$ ./example
My CPU has 10 cores and 20 threads and runs the example code above without MPI on all 20 threads. So, why does the code with MPI not run on all threads?
I suspect I might need to do something with MPI bindings, which I see being mentioned in some answers on the same topic (1, 2), but other answers entirely exclude these options, so I'm unsure whether this is the correct approach.
mpirun -np 1 ./example assigns a single core to your program (so 20 threads end up time sharing): this is the default behavior for Open MPI (e.g. 1 core per MPI process when running with -np 1 or -np 2.
./example (e.g. singleton mode) should use all the available cores, unless you are already running on a subset.
If you want to use all the available cores with mpirun, you can
mpirun --bind-to none -np 1 ./example
The library PETSc runs some test programs during configuration while checking the environment. One of those test programs is the following program (reduced by two relative headers):
#include <stdlib.h>
#include <mpi.h>
int main() {
int size;
int ierr;
MPI_Init(0,0);
ierr = MPI_Type_size(MPI_LONG_DOUBLE, &size);
if(ierr || (size == 0)) exit(1);
MPI_Finalize();
;
return 0;
}
Configuration fails, due to a timeout. When debugging the program, it gets stuck at the line MPI_Init(0, 0);, even though this line should be perfectly legal. I am using OpenMPI 2 with G++ 9.2.1, running on OpenSUSE TW.
The program is compiled using
mpicxx -O0 -g mpi_test.cpp -o mpi_test
I have a program that does independent computations on a bunch of images. This seems like a good idea to use OpenMP:
//file: WoodhamData.cpp
#include <omp.h>
...
void WoodhamData::GenerateLightingDirection() {
int imageWidth = (this->normalMap)->width();
int imageHeight = (this->normalMap)->height();
#pragma omp paralell for num_threads(2)
for (int r = 0; r < RadianceMaps.size(); r++) {
if (omp_get_thread_num() == 0){
std::cout<<"threads="<<omp_get_num_threads()<<std::endl;
}
...
}
}
In order to use OpenMP, I add -fopenmp to my makefile, so it outputs:
g++ -g -o test.exe src/test.cpp src/WoodhamData.cpp -pthread -L/usr/X11R6/lib -fopenmp --std=c++0x -lm -lX11 -Ilib/eigen/ -Ilib/CImg
However, I am sad to say, my program reports threads=1 (run from terminal ./test.exe ...)
Does anyone know what might be wrong? This is the slowest part of my program, and it would be great to speed it up a bit.
Your OpenMP directive is wrong - it is "parallel" not "paralell".
I have a basic loop:
int i, n=50000000;
for (i=0 ; i<n ; i++)
{
register float val = rand()/(float)RAND_MAX;
}
That I want to accelerate with OpenMP. I previously set:
omp_set_dynamic(0);
omp_set_num_threads(nths);
with nths=4
And the final loop is:
int i, n=50000000;
#pragma omp parallel for firstprivate(n) default(shared)
for (i=0 ; i<n ; i++)
{
register float val = rand()/(float)RAND_MAX;
}
The non parallelized loop takes 1.12s to execute and the parallel one takes 21.04s (it can varies a lot depending on my linux priority process). I am on a x86 platform with Ubuntu and 4 CPUs with 1 thread each. I compile with g++ (I need it) that I flagged with -fopenmp and I use the library -lgomp
Why OpenMP doesn't accelerate this basic loop ?
EDIT:
Regarding the answers I changed the inside of the loop to be:
for (i=0 ; i<n ; i++)
{
a[i]=i;
b[i]=i;
c[i]=a[i]+b[i];
}
with n=500000 and the pragma:
#pragma omp parallel for firstprivate(n) default(shared) schedule(dynamic) num_threads(4)
I also changed the code to use only gcc and I have the same problem:
With 1 Thread
Test ms = 0.003000
Test Omp ms = 19.695000
With 4 Threads
Test ms = 0.003000
Test Omp ms = 240.990000
EDIT2:
I changed the way I was measuring time when using OpenMP. Instead of the clock() function I used the omp_get_wtime() one, the results are way better.
I ran your code quickly through my system.
First of all, in the array addition case, your 50M is barely enough to show the win, but it does - if OpenMP is set up correctly.
In your case, the schedule(dynamic) is killing you - it tells the compiler to spread the work to the team at runtime. That would make sense if you cannot predetermine your workload - but in this case it's perfectly predictable as the effort per iteration is exactly the same.
I get the following results after editing your example (see below) and running on a hyperthreaded CPU with the cores all fixed on the lowest frequency. I compiled using gcc 4.9.3:
time ./testseq && time ./testpar
real 0m0.576s
user 0m0.504s
sys 0m0.072s
real 0m0.285s
user 0m0.968s
sys 0m0.123s
As you can see, the real value, which is the "wallclock time", roughly halves. The user time increases, because of thread startup and shutdown.
The parallelized results change considerably if i add the schedule(dynamic) clause:
real 0m4.181s
user 0m14.886s
sys 0m1.283s
All of the extra work load is spent on threads that are done doing a small amount of work and looking for the next batch. That requires taking a lock - and that kills your second example. Please only use schedule(dynamic) when you have load balancing issues - where the amount of work per iterator varies wildly.
To give full disclosure, I ran with the following full source code:
CXXFLAGS=-std=c++11 -I. -Wall -Wextra -g -pthread
all: testseq testpar
testpar: test.cpp
${CXX} -o $# $^ -fopenmp ${CXXFLAGS}
testseq: test.cpp
${CXX} -o $# $^ ${CXXFLAGS}
clean:
rm -f *.o *~ test
and test.cpp:
#include <omp.h>
constexpr int n=50*1000*1000;
float a[n];
float b[n];
float c[n];
int main(void) {
#pragma omp parallel for schedule(dynamic)
for (int i=0 ; i<n ; i++) {
a[i]=i;
b[i]=i;
c[i]=a[i]+b[i];
}
}
Note that I also took away the other clauses to your parallel for - you need none of them.
rand() is not at all a thread-safe function. There is a PRNG inside that has a state and therefore cannot be invoked by multiple threads without synchronization. Use different PRNG (C++11 has a bunch of them, Boost as well), use one generator per thread and don't forget to seed them with different seed values (beware of time(NULL)).
UPDATE
n being 500k may be too small to get any speedup due to the overhead of thread creation. Could you test with, e.g., n set to 500M? Moreover, times without OpenMP are suspiciously low (maybe not for such a low n). What are you doing with the a, b, and c arrays after the loop? If nothing, a compiler could optimize the whole loop away in the sequential code. Do something with these arrays, e.g., print out their sum (out of the measured section).
My problem is that I get no parallelization with openMP.
My system:
ubuntu 11.4
Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz
Compiler:
g++ Version: 4.5.2
with flag -fopenmp
With this code I see that there is only one thread:
int nthreads, tid, procs, maxt, inpar, dynamic, nested;
// Start parallel region
#pragma omp parallel private(nthreads, tid) {
// Obtain thread number
tid = omp_get_thread_num();
// Only master thread does this
if (tid == 0)
{
printf("Thread %d getting environment info...\n", tid);
// Get environment information
procs = omp_get_num_procs();
nthreads = omp_get_num_threads();
maxt = omp_get_max_threads();
inpar = omp_in_parallel();
dynamic = omp_get_dynamic();
nested = omp_get_nested();
// Print environment information
printf("Number of processors = %d\n", procs);
printf("Number of threads = %d\n", nthreads);
printf("Max threads = %d\n", maxt);
printf("In parallel? = %d\n", inpar);
printf("Dynamic threads enabled? = %d\n", dynamic);
printf("Nested parallelism supported? = %d\n", nested);
}
}
because I see the following output:
Number of processors = 4
Number of threads = 1
Max threads = 4
In parallel? = 0
Dynamic threads enabled? = 0
Nested parallelism supported? = 0
What is the problem?
Can some one help, please?
Your code works for me on Ubuntu 11.04 with the g++ compiler version 4.5.2 however I had to change
#pragma omp parallel private(nthreads, tid) {
to
#pragma omp parallel private(nthreads, tid)
{
for it to compile successfully.
EDIT: If fixing the syntax doesn't work my next idea would be to ask what is the exact command that you are using to compile code?
#pragma omp parallel private(nthreads, tid) {
is incorrect syntax, as noted by hrandjet
The pragma must end with a new line, so the { should be on the next line.
#pragma omp parallel private(nthreads, tid)
{
This works for me on Windows XP.
Is the output prefaced by
Thread 0 getting environment info...
If not, the problem is as stated above - the open bracket ( { ) must be on a new line. To prove this further, try initializing
int tid = 1
and see if the output still shows up. If not, the #pragma is being ignored by your compiler (probably because of the bracket issue).