I am running this
#include <boost/mpi.hpp>
#include <iostream>
#include <vector>
#include <cstdlib>
#include <time.h>
namespace mpi = boost::mpi;
int main()
{
mpi::environment env;
mpi::communicator world;
srand (time(NULL));
std::srand(time(0) + world.rank());
int my_number = std::rand();
if (world.rank() == 0) {
std::vector<int> all_numbers;
gather(world, my_number, all_numbers, 0);
for (int proc = 0; proc < world.size(); ++proc)
std::cout << "Process #" << proc << " thought of "
<< all_numbers[proc] << std::endl;
} else {
gather(world, my_number, 0);
}
return 0;
}
to distributively generate random number, however, it gives me the number around the same magnitude everytime....
dhcp-18-189-66-216:ising2 myname$ make
mpic++ -I/usr/local/include/boost -L/usr/local/lib -lboost_mpi -lboost_serialization main.cpp -o main
mpirun -n 4 main
Process #0 thought of 238772362
Process #1 thought of 238789169
Process #2 thought of 238805976
Process #3 thought of 238822783
dhcp-18-189-66-216:ising2 myname$ make
mpic++ -I/usr/local/include/boost -L/usr/local/lib -lboost_mpi -lboost_serialization main.cpp -o main
mpirun -n 4 main
Process #0 thought of 238805976
Process #1 thought of 238822783
Process #2 thought of 238839590
Process #3 thought of 238856397
dhcp-18-189-66-216:ising2 myname$ make
mpic++ -I/usr/local/include/boost -L/usr/local/lib -lboost_mpi -lboost_serialization main.cpp -o main
mpirun -n 4 main
Process #0 thought of 238856397
Process #1 thought of 238873204
Process #2 thought of 238890011
Process #3 thought of 238906818
dhcp-18-189-66-216:ising2 myname$
In the website, http://www.boost.org/doc/libs/1_55_0/doc/html/mpi/tutorial.html , others said they get:
Process #0 thought of 332199874
Process #1 thought of 20145617
Process #2 thought of 1862420122
Process #3 thought of 480422940
Process #4 thought of 1253380219
Process #5 thought of 949458815
Process #6 thought of 650073868
I am very confused.... Any help? Thank you.
Your problem is the rand() function of the cstdlib. It is not a good random number generator. If you want to use proper random numbers in C++, use some from the header from C++11 or external random number generators (e.g. mersenne twister).
But nonetheless, using random numbers in for parallel programs is no easy task. You should use random number generators, which are specialised on that (e.g. r250_omp.h).
The problem is likely to be caused by your rand. See the discussion and answers for this question: First random number is always smaller than rest
It seems that the generated numbers for neighboring seeds (your case) can be quite heavily correlated. rand's implementations might vary, and for some implementations it seems to be a much more pronounced phenomena.
I think your random gen should be like this:
int max=100, min=0;
srand(time(NULL));
int random = (rand() % (max-min)) + min;
Related
I am trying to run the following example MPI code that launches 20 threads and keeps those threads busy for a while. However, when I check the CPU utilization using a tool like nmon or top I see that only a single thread is being used.
#include <iostream>
#include <thread>
#include <mpi.h>
using namespace std;
int main(int argc, char *argv[]) {
int provided, rank;
MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
if (provided != MPI_THREAD_FUNNELED)
exit(1);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
auto f = [](float x) {
float result = 0;
for (float i = 0; i < x; i++) { result += 10 * i + x; }
cout << "Result: " << result << endl;
};
thread threads[20];
for (int i = 0; i < 20; ++i)
threads[i] = thread(f, 100000000.f); // do some work
for (auto& th : threads)
th.join();
MPI_Finalize();
return 0;
}
I compile this code using mpicxx: mpicxx -std=c++11 -pthread example.cpp -o example and run it using mpirun: mpirun -np 1 example.
I am using Open MPI version 4.1.4 that is compiled with posix thread support (following the explanation from this question).
$ mpicxx --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
$ mpirun --version
mpirun (Open MPI) 4.1.4
$ ompi_info | grep -i thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
FT Checkpoint support: no (checkpoint thread: no)
$ mpicxx -std=c++11 -pthread example.cpp -o example
$ ./example
My CPU has 10 cores and 20 threads and runs the example code above without MPI on all 20 threads. So, why does the code with MPI not run on all threads?
I suspect I might need to do something with MPI bindings, which I see being mentioned in some answers on the same topic (1, 2), but other answers entirely exclude these options, so I'm unsure whether this is the correct approach.
mpirun -np 1 ./example assigns a single core to your program (so 20 threads end up time sharing): this is the default behavior for Open MPI (e.g. 1 core per MPI process when running with -np 1 or -np 2.
./example (e.g. singleton mode) should use all the available cores, unless you are already running on a subset.
If you want to use all the available cores with mpirun, you can
mpirun --bind-to none -np 1 ./example
My code runs perfectly on my machine but it fails even in the sample testcases on Hackerrank. How can the same code yield different results on two different environment?
#include <bits/stdc++.h>
#include <algorithm>
using namespace std;
int maxum(int arr[], int n)
{
int t[100][100];
for(int i=0;i<n+1; i++)
for(int j=0; j<n+1;j++)
{
if(i==0)
t[i][j]=0;
else
t[i][j]=max(t[i-1][j]+arr[i-1], t[i-1][j] );
}
int maximum=0;
for(int i=0;i<n+1;i++)
for(int j=0; j<n+1;j++)
{
if(t[i][j]>maximum)
maximum=t[i][j];
}
return maximum;
}
int main()
{
int arr[100],i,n;
cin>>n;
for(i=0;i<n;i++)
cin>>arr[i];
cout<<maxum(arr,n);
return 0;
}
The algorithm is just wrong. For instance given the input from the first test case
5
3 7 4 6 5
The correct answer is 13 (7+6) but your code outputs 25 (3+7+4+6+5).
It seems haven't implemented the requirement that the members of the maximal subset be non-adjacent.
Your code exhibits undefined behaviour at least for larger inputs. Example:
g++ -O3 -Wall -Wextra -pedantic -std=c++17 -fno-exceptions -fsanitize=address a.cpp
yes 100 | ./a.out
Triggers asan to complain, although the input is valid according to the constraints specified in the link your provide.
==298664==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffce142a650 at pc 0x55697440f74d bp 0x7ffce14209c0 sp 0x7ffce14209b8
WRITE of size 4 at 0x7ffce142a650 thread T0
#0 0x55697440f74c in maxum(int*, int) (/tmp/a.out+0x174c)
#1 0x55697440f26c in main (/tmp/a.out+0x126c)
#2 0x7fc97926acc9 in __libc_start_main ../csu/libc-start.c:308
#3 0x55697440f3a9 in _start (/tmp/a.out+0x13a9)
[rest of asan output omitted]
I have a small function in written Haskell with the following type:
foreign export ccall sget :: Ptr CInt -> CSize -> Ptr CSize -> IO (Ptr CInt)
I am calling this from multiple C++ threads running concurrently (via
TBB). During this part of the execution of my program I can barely get a
load average above 1.4 even though I'm running on a six-core CPU (12
logical cores). I therefore suspect that either the calls into Haskell all
get funnelled through a single thread, or there is some significant
synchronization going on.
I am not doing any such thing explicitly, all the function does is operate
on the incoming data (after storing it into a Data.Vector.Storable), and
return the result back as a newly allocated array (from Data.Marshal.Array).
Is there anything I need to do to fully enable concurrent calls like this?
I am using GHC 8.6.5 on Debian Linux (bullseye/testing), and I am compiling with -threaded -O2.
Looking forward to reading some advice,
Sebastian
Using the simple example at the end of this answer, if I compile with:
$ ghc -O2 Worker.hs
$ ghc -O2 -threaded Worker.o caller.c -lpthread -no-hs-main -o test
then running it with ./test occupies only one core at 100%. I need to run it with ./test +RTS -N, and then on my 4-core desktop, it runs at 400% with a load average of around 4.0.
So, the RTS -N flag affects the number of parallel threads that can simultaneously run an exported Haskell function and there is no special action required (other than compiling with -threaded and running with +RTS -n) to fully utilize all available cores.
So, there must be something about your example that's causing the problem. It could be contention between threads over some shared data structure. Or, maybe parallel garbage collection is causing problems; I've observed parallel GC causing worse performance with increasing -N in a simple test case (details forgotten, sadly), so you could try turning off parallel GC with -qg or limiting the number of cores involved with -qn2 or something. To enable these options, you need to call hs_init_with_rtsopts() in place of the usual hs_init() as in my example.
If that doesn't work, I think you'll have to try to narrow down the problem and post a minimal example that illustrates the performance issue to get more help.
My example:
caller.c
#include "HsFFI.h"
#include "Rts.h"
#include "Worker_stub.h"
#include <pthread.h>
#define NUM_THREAD 4
void*
work(void* arg)
{
for (;;) {
fibIO(30);
}
}
int
main(int argc, char **argv)
{
hs_init_with_rtsopts(&argc, &argv);
pthread_t threads[NUM_THREAD];
for (int i = 0; i < NUM_THREAD; ++i) {
int rc = pthread_create(&threads[i], NULL, work, NULL);
}
for (int i = 0; i < NUM_THREAD; ++i) {
pthread_join(threads[i], NULL);
}
hs_exit();
return 0;
}
Worker.hs
module Worker where
import Foreign
fibIO :: Int -> IO Int
fibIO = return . fib
fib :: Int -> Int
fib n | n > 1 = fib (n-1) + fib (n-2)
| otherwise = 1
foreign export ccall fibIO :: Int -> IO Int
Here is a C++ program that runs 10 times with 5 different threads and each thread increments the value of counter so the final output should be 500, which is exactly what the program is giving output. But i cant understand why is it giving 500 every time the output should be different as the increment operation is not atomic and there are no locks used so the program should give out different outputs in each case.
edit to increase probability of race condition i increased the loop count but still couldn't see any varying output
#include <iostream>
#include <thread>
#include <vector>
struct Counter {
int value;
Counter() : value(0){}
void increment(){
value = value + 1000;
}
};
int main(){
int n = 50000;
while(n--){
Counter counter;
std::vector<std::thread> threads;
for(int i = 0; i < 5; ++i){
threads.push_back(std::thread([&counter](){
for(int i = 0; i < 1000; ++i){
counter.increment();
}
}));
}
for(auto& thread : threads){
thread.join();
}
std::cout << counter.value << std::endl;
}
return 0;
}
You're just lucky :)
Compiling with clang++ my output is not always 500:
500
425
470
500
500
500
500
500
432
440
Note
Using g++ with -fsanitize=thread -static-libtsan:
WARNING: ThreadSanitizer: data race (pid=13871)
Read of size 4 at 0x7ffd1037a9c0 by thread T2:
#0 Counter::increment() <null> (Test+0x000000509c02)
#1 main::{lambda()#1}::operator()() const <null> (Test+0x000000507ed1)
#2 _M_invoke<> /usr/include/c++/5/functional:1531 (Test+0x0000005097d7)
#3 operator() /usr/include/c++/5/functional:1520 (Test+0x0000005096b2)
#4 _M_run /usr/include/c++/5/thread:115 (Test+0x0000005095ea)
#5 <null> <null> (libstdc++.so.6+0x0000000b8c7f)
Previous write of size 4 at 0x7ffd1037a9c0 by thread T1:
#0 Counter::increment() <null> (Test+0x000000509c17)
#1 main::{lambda()#1}::operator()() const <null> (Test+0x000000507ed1)
#2 _M_invoke<> /usr/include/c++/5/functional:1531 (Test+0x0000005097d7)
#3 operator() /usr/include/c++/5/functional:1520 (Test+0x0000005096b2)
#4 _M_run /usr/include/c++/5/thread:115 (Test+0x0000005095ea)
#5 <null> <null> (libstdc++.so.6+0x0000000b8c7f)
shows the race condition. (Also, on my system the output shows results different than 500).
The options for g++ are explained in the documentage for g++ (e.g.: man g++). See also: https://github.com/google/sanitizers/wiki#threadsanitizer.
Just because your code has race conditions does not mean they occur. That is the hard part about them. A lot of times they only occur when something else changes and timing is different.
here are several issues: incrementing to 100 can be done really fast. So your threads may be already halfway done before the second one is started. Same for the next thread etc. So you never know you have really 5 in parallel.
You should create a barrier at the beginning of each thread to make sure they start all at the same time.
Also maybe try a bit more than "100" and only 5 threads. But it all depends on the system / load / timing. etc.
to increase probability of race condition i increased the loop count
but still couldn't see any varying output
Strictly speaking you have data race in this code which is Undefined Behavior and therefore you cannot reliably reproduce it.
But you can rewrite Counter to some "equivalent" code with artificial delays in increment:
struct Counter {
int value;
Counter() : value(0){}
void increment(){
int val=value;
std::this_thread::sleep_for(std::chrono::milliseconds(1));
++val;
value=val;
}
};
I've got the following output with this counter which is far less than 500:
100
100
100
100
100
101
100
100
101
100
I have a real puzzler for you folks.
Below is a small, self-contained, simple 40-line program that calculates partial sums of a bunch of numbers and routinely (but stochastically) crashes nodes on a distributed memory cluster that I'm using. If I spawn 50 PBS jobs that run this code, between 0 and 4 of them will crash their nodes. It will happen on a different repeat of the main loop each time and on different nodes each time, there is no discernible pattern. The nodes just go "down" on the ganglia report and I can't ssh to them ("no route to host"). If instead of submitting jobs I ssh onto one of the nodes and run my program there, if I'm unlucky and it crashes then I just stop seeing text and then see that that node is dead on ganglia.
The program is threaded with openmp and the crashes only happen when a large number of threads are spawned (like 12).
The cluster it's killing is a RHEL 5 cluster with nodes that have 2 6-core x5650 processors:
[jamelang#hooke ~]$ tail /etc/redhat-release
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
I have tried enabling core dumps ulimit -c unlimited but no files show up. This is the code, with comments:
#include <cstdlib>
#include <cstdio>
#include <omp.h>
int main() {
const unsigned int numberOfThreads = 12;
const unsigned int numberOfPartialSums = 30000;
const unsigned int numbersPerPartialSum = 40;
// make some numbers
srand(0); // every instance of program should get same results
const unsigned int totalNumbersToSum = numbersPerPartialSum * numberOfPartialSums;
double * inputData = new double[totalNumbersToSum];
for (unsigned int index = 0; index < totalNumbersToSum; ++index) {
inputData[index] = rand()/double(RAND_MAX);
}
omp_set_num_threads(numberOfThreads);
// prepare a place to dump output
double * partialSums = new double[numberOfPartialSums];
// do the following algorithm many times to induce a problem
for (unsigned int repeatIndex = 0; repeatIndex < 100000; ++repeatIndex) {
if (repeatIndex % 1000 == 0) {
printf("Absurd testing is on repeat %06u\n", repeatIndex);
}
#pragma omp parallel for
for (unsigned int partialSumIndex = 0; partialSumIndex < numberOfPartialSums;
++partialSumIndex) {
// get this partial sum's limits
const unsigned int beginIndex = numbersPerPartialSum * partialSumIndex;
const unsigned int endIndex = numbersPerPartialSum * (partialSumIndex + 1);
// we just sum the 40 numbers, can't get much simpler
double sumOfNumbers = 0;
for (unsigned int index = beginIndex; index < endIndex; ++index) {
// only reading, thread-safe
sumOfNumbers += inputData[index];
}
// writing to non-overlapping indices (guaranteed by omp),
// should be thread-safe.
// at worst we would have false sharing, but that would just affect
// performance, not throw sigabrts.
partialSums[partialSumIndex] = sumOfNumbers;
}
}
delete[] inputData;
delete[] partialSums;
return 0;
}
I compile it with the following:
/home/jamelang/gcc-4.8.1/bin/g++ -O3 -Wall -fopenmp Killer.cc -o Killer
It seems to be linking against the right shared objects:
[jamelang#hooke Killer]$ ldd Killer
linux-vdso.so.1 => (0x00007fffc0599000)
libstdc++.so.6 => /home/jamelang/gcc-4.8.1/lib64/libstdc++.so.6 (0x00002b155b636000)
libm.so.6 => /lib64/libm.so.6 (0x0000003293600000)
libgomp.so.1 => /home/jamelang/gcc-4.8.1/lib64/libgomp.so.1 (0x00002b155b983000)
libgcc_s.so.1 => /home/jamelang/gcc-4.8.1/lib64/libgcc_s.so.1 (0x00002b155bb92000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003293a00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003292e00000)
/lib64/ld-linux-x86-64.so.2 (0x0000003292a00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003298600000)
Some Notes:
1. On osx lion with gcc 4.7, this code will throw a SIGABRT, similar to this question: Why is this code giving SIGABRT with openMP?. Using gcc 4.8 seems to fix the issue on OSX. However, using gcc 4.8 on the RHEL5 machine does not fix it. The RHEL5 machine has GLIBC version 2.5, and it seems that yum doesn't provide a later one, so the admins are sticking with 2.5.
2. If I define a SIGABRT signal handler, it doesn't catch the problem on the RHEL5 machine, but it does catch it on OSX with gcc47.
3. I believe that no variables should need to be shared in the omp clause because they can all have private copies, but adding them as shared does not change the behavior.
4. The killing of nodes occurs regardless of the level of optimization used.
5. The killing of nodes occurs even if I run the program from within gdb (i.e. put "gdb -batch -x gdbCommands Killer" in the pbs file) where "gdbCommands" is a file with one line: "run"
6. This example spawns threads on every repeat. One strategy would be to make a parallel block that contains the repeats loop in order to prevent this. However, that does not help me - this example is only representative of a much larger research code in which I cannot use that strategy.
I'm all out of ideas, at my last straw, at my wit's end, ready to pull my hair out, etc with this. Does anyone have suggestions or ideas?
You are trying to parallelize nested for loops, in this case you need to make the variables in the inner loop private, so that each thread has its own variable. It can be done using private clause, as in the example below.
#pragma omp parallel for private(j)
for (i = 0; i < height; i++)
for (j = 0; j < width; j++)
c[i][j] = 2;
In your case, index and sumOfNumbers need to be private.