I am trying to run the following example MPI code that launches 20 threads and keeps those threads busy for a while. However, when I check the CPU utilization using a tool like nmon or top I see that only a single thread is being used.
#include <iostream>
#include <thread>
#include <mpi.h>
using namespace std;
int main(int argc, char *argv[]) {
int provided, rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED)
exit(1);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
auto f = [](float x) {
float result = 0;
for (float i = 0; i < x; i++) { result += 10 * i + x; }
cout << "Result: " << result << endl;
};
thread threads[20];
for (int i = 0; i < 20; ++i)
threads[i] = thread(f, 100000000.f); // do some work
for (auto& th : threads)
th.join();
MPI_Finalize();
return 0;
}
I compile this code using mpicxx: mpicxx -std=c++11 -pthread example.cpp -o example and run it using mpirun: mpirun -np 1 example.
I am using Open MPI version 4.1.4 that is compiled with posix thread support (following the explanation from this question).
$ mpicxx --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ mpirun --version
mpirun (Open MPI) 4.1.4
$ ompi_info | grep -i thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
FT Checkpoint support: no (checkpoint thread: no)
$ mpicxx -std=c++11 -pthread example.cpp -o example
$ ./example
My CPU has 10 cores and 20 threads and runs the example code above without MPI on all 20 threads. So, why does the code with MPI not run on all threads?
I suspect I might need to do something with MPI bindings, which I see being mentioned in some answers on the same topic (1, 2), but other answers entirely exclude these options, so I'm unsure whether this is the correct approach.
mpirun -np 1 ./example assigns a single core to your program (so 20 threads end up time sharing): this is the default behavior for Open MPI (e.g. 1 core per MPI process when running with -np 1 or -np 2.
./example (e.g. singleton mode) should use all the available cores, unless you are already running on a subset.
If you want to use all the available cores with mpirun, you can
mpirun --bind-to none -np 1 ./example
Related
I have a small function in written Haskell with the following type:
foreign export ccall sget :: Ptr CInt -> CSize -> Ptr CSize -> IO (Ptr CInt)
I am calling this from multiple C++ threads running concurrently (via
TBB). During this part of the execution of my program I can barely get a
load average above 1.4 even though I'm running on a six-core CPU (12
logical cores). I therefore suspect that either the calls into Haskell all
get funnelled through a single thread, or there is some significant
synchronization going on.
I am not doing any such thing explicitly, all the function does is operate
on the incoming data (after storing it into a Data.Vector.Storable), and
return the result back as a newly allocated array (from Data.Marshal.Array).
Is there anything I need to do to fully enable concurrent calls like this?
I am using GHC 8.6.5 on Debian Linux (bullseye/testing), and I am compiling with -threaded -O2.
Looking forward to reading some advice,
Sebastian
Using the simple example at the end of this answer, if I compile with:
$ ghc -O2 Worker.hs
$ ghc -O2 -threaded Worker.o caller.c -lpthread -no-hs-main -o test
then running it with ./test occupies only one core at 100%. I need to run it with ./test +RTS -N, and then on my 4-core desktop, it runs at 400% with a load average of around 4.0.
So, the RTS -N flag affects the number of parallel threads that can simultaneously run an exported Haskell function and there is no special action required (other than compiling with -threaded and running with +RTS -n) to fully utilize all available cores.
So, there must be something about your example that's causing the problem. It could be contention between threads over some shared data structure. Or, maybe parallel garbage collection is causing problems; I've observed parallel GC causing worse performance with increasing -N in a simple test case (details forgotten, sadly), so you could try turning off parallel GC with -qg or limiting the number of cores involved with -qn2 or something. To enable these options, you need to call hs_init_with_rtsopts() in place of the usual hs_init() as in my example.
If that doesn't work, I think you'll have to try to narrow down the problem and post a minimal example that illustrates the performance issue to get more help.
My example:
caller.c
#include "HsFFI.h"
#include "Rts.h"
#include "Worker_stub.h"
#include <pthread.h>
#define NUM_THREAD 4
void*
work(void* arg)
{
for (;;) {
fibIO(30);
}
}
int
main(int argc, char **argv)
{
hs_init_with_rtsopts(&argc, &argv);
pthread_t threads[NUM_THREAD];
for (int i = 0; i < NUM_THREAD; ++i) {
int rc = pthread_create(&threads[i], NULL, work, NULL);
}
for (int i = 0; i < NUM_THREAD; ++i) {
pthread_join(threads[i], NULL);
}
hs_exit();
return 0;
}
Worker.hs
module Worker where
import Foreign
fibIO :: Int -> IO Int
fibIO = return . fib
fib :: Int -> Int
fib n | n > 1 = fib (n-1) + fib (n-2)
| otherwise = 1
foreign export ccall fibIO :: Int -> IO Int
The library PETSc runs some test programs during configuration while checking the environment. One of those test programs is the following program (reduced by two relative headers):
#include <stdlib.h>
#include <mpi.h>
int main() {
int size;
int ierr;
MPI_Init(0,0);
ierr = MPI_Type_size(MPI_LONG_DOUBLE, &size);
if(ierr || (size == 0)) exit(1);
MPI_Finalize();
;
return 0;
}
Configuration fails, due to a timeout. When debugging the program, it gets stuck at the line MPI_Init(0, 0);, even though this line should be perfectly legal. I am using OpenMPI 2 with G++ 9.2.1, running on OpenSUSE TW.
The program is compiled using
mpicxx -O0 -g mpi_test.cpp -o mpi_test
Q: Is it possible to improve IO of this code with LLVM Clang under OS X:
test_io.cpp:
#include <iostream>
#include <string>
constexpr int SIZE = 1000*1000;
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
std::cin.tie(nullptr);
std::string command(argv[1]);
if (command == "gen") {
for (int i = 0; i < SIZE; ++i) {
std::cout << 1000*1000*1000 << " ";
}
} else if (command == "read") {
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
}
Compile:
clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
Benchmark:
> time ./test_io gen | ./test_io read
real 0m2.961s
user 0m3.675s
sys 0m0.012s
Apart from the sad fact that reading of 10MB file costs 3 seconds, it's much slower than g++ (installed via homebrew):
> gcc-6 -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io
> time ./test_io gen | ./test_io read
real 0m0.149s
user 0m0.167s
sys 0m0.040s
My clang version is Apple LLVM version 7.0.0 (clang-700.0.72). clangs installed from homebrew (3.7 and 3.8) also produce slow io. clang installed on Ubuntu (3.8) generates fast io. Apple LLVM version 8.0.0 generates slow io (2 people asked).
I also dtrussed it a bit (sudo dtruss -c "./test_io gen | ./test_io read") and found that clang version makes 2686 write_nocancel syscalls, while gcc version makes 2079 writev syscalls. Which probably points to the root of the problem.
The issue is in libc++ that does not implement sync_with_stdio.
Your command line clang++ -x c++ -lstdc++ -std=c++11 -O2 test_io.cpp -o test_io does not use libstdc++, it will use libc++. To force use libstdc++ you need -stdlib=libstdc++.
Minimal example if you have the input file ready:
int main(int argc, const char * argv[]) {
std::ios_base::sync_with_stdio(false);
int x;
for (int i = 0; i < SIZE; ++i) {
std::cin >> x;
}
}
Timings:
$ clang++ test_io.cpp -o test -O2 -std=c++11
$ time ./test read < input
real 0m2.802s
user 0m2.780s
sys 0m0.015s
$ clang++ test_io.cpp -o test -O2 -std=c++11 -stdlib=libstdc++
clang: warning: libstdc++ is deprecated; move to libc++
$ time ./test read < input
real 0m0.185s
user 0m0.169s
sys 0m0.012s
I am running this
#include <boost/mpi.hpp>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <time.h>
namespace mpi = boost::mpi;
int main()
{
mpi::environment env;
mpi::communicator world;
srand (time(NULL));
std::srand(time(0) + world.rank());
int my_number = std::rand();
if (world.rank() == 0) {
std::vector<int> all_numbers;
gather(world, my_number, all_numbers, 0);
for (int proc = 0; proc < world.size(); ++proc)
std::cout << "Process #" << proc << " thought of "
<< all_numbers[proc] << std::endl;
} else {
gather(world, my_number, 0);
}
return 0;
}
to distributively generate random number, however, it gives me the number around the same magnitude everytime....
dhcp-18-189-66-216:ising2 myname$ make
mpic++ -I/usr/local/include/boost -L/usr/local/lib -lboost_mpi -lboost_serialization main.cpp -o main
mpirun -n 4 main
Process #0 thought of 238772362
Process #1 thought of 238789169
Process #2 thought of 238805976
Process #3 thought of 238822783
dhcp-18-189-66-216:ising2 myname$ make
mpic++ -I/usr/local/include/boost -L/usr/local/lib -lboost_mpi -lboost_serialization main.cpp -o main
mpirun -n 4 main
Process #0 thought of 238805976
Process #1 thought of 238822783
Process #2 thought of 238839590
Process #3 thought of 238856397
dhcp-18-189-66-216:ising2 myname$ make
mpic++ -I/usr/local/include/boost -L/usr/local/lib -lboost_mpi -lboost_serialization main.cpp -o main
mpirun -n 4 main
Process #0 thought of 238856397
Process #1 thought of 238873204
Process #2 thought of 238890011
Process #3 thought of 238906818
dhcp-18-189-66-216:ising2 myname$
In the website, http://www.boost.org/doc/libs/1_55_0/doc/html/mpi/tutorial.html , others said they get:
Process #0 thought of 332199874
Process #1 thought of 20145617
Process #2 thought of 1862420122
Process #3 thought of 480422940
Process #4 thought of 1253380219
Process #5 thought of 949458815
Process #6 thought of 650073868
I am very confused.... Any help? Thank you.
Your problem is the rand() function of the cstdlib. It is not a good random number generator. If you want to use proper random numbers in C++, use some from the header from C++11 or external random number generators (e.g. mersenne twister).
But nonetheless, using random numbers in for parallel programs is no easy task. You should use random number generators, which are specialised on that (e.g. r250_omp.h).
The problem is likely to be caused by your rand. See the discussion and answers for this question: First random number is always smaller than rest
It seems that the generated numbers for neighboring seeds (your case) can be quite heavily correlated. rand's implementations might vary, and for some implementations it seems to be a much more pronounced phenomena.
I think your random gen should be like this:
int max=100, min=0;
srand(time(NULL));
int random = (rand() % (max-min)) + min;
I use Ubuntu and write several lines of code.But it creates only one thread. When I run on my terminal the nproc command, the output is 2. My code is below
int nthreads, tid;
#pragma omp parallel private(tid)
{
tid = omp_get_thread_num();
printf("Thread = %d\n", tid);
/* for only main thread */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
The output:
Thread = 0
Number of threads = 1
How can I do parallelism?
If you are using gcc/g++ you must make sure you enable openmp extensions with the -fopenmp compiler and linker options. Specifying it during linking will link in the appropriate library (-lgomp).
Compile with something like:
g++ -fopenmp myfile.c -o exec
or:
g++ -c myfile.c -fopenmp
g++ -o exec myfile.o -fopenmp
If you leave out the -fopenmp compile option your program will compile but it will run as if openmp wasn't being used. If your program doesn't use omp_set_num_threads to set the number of threads they can be set from the command line:
OMP_NUM_THREADS=8 ./exec
I think the default is is generally the number of cores on a particular system.