I have an application that I run with MPI for distributed computing. Let's say there are two MPI ranks started on a single machine, I start my target application on rank-0 which then spawns few threads. I want each of these threads to access a simple array[block] of data that was created by rank-1.
How can I do this? Shared memory?(Is it the only way). Can I use something in MPI(I'm a beginner)?
Thanks!
Related
I am building an application using RMA in MPI. I am stuck at how to achieve the following, suppose I have 2 windows win1 and win2, what I want to do is write data into both the windows but I want this to be atomic so that untill both the elements are written into their respective windows no other process accesses the window at same target process.
Is this possible to achieve in MPI? making write on single window atomic is possible with Exclusive Lock but is there any way I can lock multiple windows together?
I have a multithread C++ program. I understand that each process has a pid and multiple threads have unique tid's . LWPid's are kernal level ids for the threads that process owns and pthreadId's are user level identifiers.
i tried to trace my program i could see, two traces with same pthreadId but different lwpid, how is this possible? (i am trying to run this program on Linux)
20170807 04:48:01.743 [pid:32174,pthreadId:139630838007552,lwpId:589][work] starter function
20170807 04:48:01.753 [pid:32174,pthreadId:139630838007552,lwpId:590][work] starter function
Hi i have started to work on a project where i use parallel computing to separate job loads among multiple machines, such as hashing and other forms of mathematical calculations. Im using C++
it is running on a Master/slave or Server/Client model if you prefer where every client connects to the server and waits for a job. The server can than take a job and seperate it depending on the number of clients
1000 jobs -- > 3 clients
IE: client 1 --> calculate(0 to 333)
Client 2 --> calculate(334 to 666)
Client 3 --> calculate(667 to 999)
I wanted to further enhance the speed by creating multiple threads on every running client. But since every machine are not likely (almost 100%) not going to have the same hardware, i cannot arbitrarily decide on a number of threads to run on every client.
i would like to know if one of you guys knew a way to evaluate the load a thread has on the cpu and extrapolate the number of threads that can be run concurently on the machine.
there are ways i see of doing this.
I start threads one by one, evaluating the cpu load every time and stop when i reach a certain prefix ceiling of (50% - 75% etc) but this has the flaw that ill have to stop and re-separate the job every time i start a new thread.
(and this is the more complex)
run some kind of test thread and calculate its impact on the cpu base load and extrapolate the number of threads that can be run on the machine and than start threads and separate jobs accordingly.
any idea or pointer are welcome, thanks in advance !
I have a application using pthreads and prior to C++11 is in use. We have several worker threads assigned for several purposes and tasks get distributed in producer-consumer way through shared circular pool of task data. Posix semaphores have been used to do inter-thread synchronizations in both wait/notify mode as well as mutex locks for shared data to ensure mutual exclusions.
Recently, noticing a strange problem with large volume of data that program seems to hang with signal 1 received. Signal 1 is basically a SIGHUP, that means hang-up, this signal is usually used to report that the user's terminal is disconnected, perhaps because a network or telephone connection was broken.
Can this be caused because the parent terminal time-outing? If so, can nohup help?
This occurs only for large volume of data (didn't notice with smaller volume) and the application is being run from command line from a solaris terminal (telnet session).
Thoughts, welcome.
Actually I have 3 questions. Any input is appreciated. Thank you!
1) How to run exactly 1 process on each host? My application uses TBB for multi-threading. Does it mean that I should run exactly 1 process on each host for best performance?
2) My cluster has heterogeneous hosts. Some hosts have better CPUs and more memory than the others. How to map process ranks to real hosts for work distribution purposes? I am thinking to use hostname.Is there a better to do it?
3) How process ranks are assigned? What process gets 0?
1) TBB splits loops into several threads of a thread pool to utilize all processors of one machine. So you should only run one process per machine. More processes would fight with each other for processor time. The number of processes per machine is given by options in your hostfile:
# my_hostfile
192.168.0.208 slots=1 max_slots=1
...
2) To give each machine an appropriate amount of work according to its performance is not trivial.
The easiest approach is to split the workload into small pieces of work, send them to the slaves, collect their answers, and give them new pieces of work, until you are done. There is an example on my website (in German). You can also find some references to manuals and tutorials there.
3) Each process gets a number (processID) in your program by
MPI_Comm_rank(MPI_COMM_WORLD, &processID);
The master has processID == 0. Maybe the other are given the slots in the order of your hostfile. Another possibility is they are assigned in the order the connections to slaves are established. I don't know that.