Message passing interface on shared memory systems performance

Message passing interface on shared memory systems performance - c++

As I know there is two way for parallel processing Message passing interface and multi threading. Multi threading can not be used for distributed memory systems without message passing interface; but message passing interface can be used for either systems "shared memory" and "distributed memory". My question is about performance of a code that is parallelized with MPI and ran on shared memory system. Is the performance of this code in the same range of a code that is parallelized with multi threading?
Update:
My job is in the for that processes need to communicate with each other in repeatedly and the communication array can be 200*200 matrix

The answer is: it depends. MPI proceses are predominantly separate OS processes and communication between them occurs with some sort of shared memory IPC techniques when the communicating processes run on the same shared-memory node. Being separate OS processes, MPI processes in general do not share data and sometimes data has to be replicated in each process which leads to less than optimal memory usage. On the other hand threads can share lots of data and can benefit from cache reusage, especially on multicore CPUs that have large shared last-level caches (e.g. the L3 cache on current generation x86 CPUs). Cache reusage when combined with more lightweight methods for data exchange between threads (usually just synchronisation since work data is already shared) can lead to better performance than the one achievalbe by separate processes.
But once again - it depends.

Let's assume we only consider MPI and OpenMP, since they are the two major representatives of the two parallel programming families you mention. For distributed systems, MPI is the only option between different nodes. Within a single node, however, as you well say, you can still use MPI and use OpenMP too. Which one will perform better really depends on the application you are running, and specifically in its computation/communication ratio. Here you can see a comparison of MPI and OpenMP for a multicore processor, where they confirm the same observation.
You can go a step further and use a hybrid approach. Use MPI between the nodes and then use OpenMP within nodes. This is called hybrid MPI+OpenMP parallel programming. You can also apply this within a node that contains a hybrid CMP+SMT processor.
You can check some information here and here. Moreover this paper compares an MPI approach vs a hybrid MPI+OpenMP one.

In my opinion, they're simply better at different jobs. The Actor model is great at asynchronously performing many different tasks at different times, whereas the OpenMP/TBB/PPL model is great for performing one task in parallel very simply and reliably.

Related

Ways to process data which is coming at double speed than my processing speed

I was asked this question my someone and bit confused on same.
Q: how will you process the data which is coming at double speed than my processing speed?
I think of following:
using queue to handle this. But if I use simply queue then size of
queue required will be indefinetly large and i will still lag
behind. As every t time i will have half more data that I can
process. and I will keep laging exponentially.
I use one thread for reading data and two more for processing. But
suppose my data has to be processed serially then what happens.
Am still confused and any help on similar problems will be welcomed. I know there might be a standard solution for this but am unaware of same.
I would like to implement in c/c++

Short answer: you'll need some kind of parallel processing. It's not easy.
Long answer: Depending on your workload requirements, and whether the bottleneck is in IO or in CPU, it might simply be multithreading on a single core, or on a multicore processor, or on a shared memory multiprocessor or even distributed between multiple nodes. It can be just a matter of distributing and balancing your work between the worker units, if the problem is simple enough (embarrasingly parallel) or you'll need to explicitly do some parallel programming. There are fundamentally two parallel programming models: OpenMP, for multithreading in multicore systems with shared memory (either symmetric or non-uniform access); and MPI, for distributed processing in a low-latency high-bandwidth network. To complicate even further, OpenMP and MPI might perfectly run together, in a hybrid parallel programming runtime environment: OpenMP distributes and coordinates the parallel compute load between the cores on each node, and MPI does it between the nodes. Be aware, it is very tough work.

Executing C++ program on multiple processor machine

I developed a program in C++ for research purpose. It takes several days to complete.
Now i executing it on our lab 8core server machine to get results quickly, but i see machine assigns only one processor to my program and it remains at 13% processor usage(even i set process priority at high level and affinity for 8 cores).
(It is a simple object oriented program without any parallelism or multi threading)
How i can get true benefit from the powerful server machine?
Thanks in advance.

Partition your code into chunks you can execute in parallel.
You need to go read about data parallelism
and task parallelism.
Then you can use OpenMP or
MPI
to break up your program.

(It is a simple object oriented program without any parallelism or
multi threading)
How i can get true benefit from the powerful server machine?
By using more threads. No matter how powerful the computer is, it cannot spread a thread across more than one processor. Find independent portions of your program and run them in parallel.
C++0x threads
Boost threads
OpenMP
I personally consider OpenMP a toy. You should probably go with one of the other two.

You have to exploit multiparallelism explicitly by splitting your code into multiple tasks that can be executed independently and then either use thread primitives directly or a higher level parallelization framework, such as OpenMP.

If you don't want to make your program itself use multithreaded libraries or techniques, you might be able to try breaking your work up into several independent chunks. Then run multiple copies of your program...each being assigned to a different chunk, specified by getting different command-line parameters.
As for just generally improving a program's performance...there are profiling tools that can help you speed up or find the bottlenecks in memory usage, I/O, CPU:
https://stackoverflow.com/questions/tagged/c%2b%2b%20profiling
Won't help split your work across cores, but if you can get an 8x speedup in an algorithm that might be able to help more than multithreading would on 8 cores. Just something else to consider.

What are the recommended C++ parallelization libraries for large data processing

Can some one recommend approaches to parallelize in C++, when the data to be acted up on is huge. I have been reading about openMP and Intel's TBB for parallelization in C++, but have not experimented with them yet. Which of these is better for parallel data processing ? Any other libraries/ approaches ?

"large" and "data processing" cover a lot of ground here, and it's hard to give a sensible answer without more information.
If the data processing is "embarrassingly parallel" -- if it involves doing lots and lots of calculations that are completely independant of each other -- then there's a million things that will work and it's just a matter of finding something that matches your code and background.
If it isn't embarrasingly parallel, but nearly so - the computations take a big chunk of data but just distill it into a handfull of numbers - there's fewer, but still lots of options.
If the calculation is more tightly coupled than this - where you need the processors to work on tandem on big chunks of data then you're probably stuck with the standbys - the OpenMP features of your compiler if it will work on a single machine (there's TBB, too, but usually for number crunching OpenMP is faster and easier) or MPI if it needs several machines simultaneously. You mentioned C++; Boost has a very nice MPI layer.
But thinking about which library to use for parallelization is probably thinking about the wrong end of the problem first. In many cases, you don't necessarily need to deal with these layers directly. If the number crunching involves lots of linear algebra (for instance), then PLASMA (for multicore machines - http://icl.cs.utk.edu/plasma/ ) or PetSC, which has support for distributed memory machines, eg, multiple computers ( http://www.mcs.anl.gov/petsc/petsc-as/ ) are good choices, which can completely hide the actual details of the parallel implementation from you. Other sorts of techniques have other libraries, too. It's probably best to think about what sort of analysis you need to do, and look to see if existing toolkits have the amount of parallization you need. Only once you've determined the answer is no should you start to worry about how to roll your own.

Both OpenMP and Intel TBB are for local use as they help in writing multithreaded applications.
If you have truly huge datasets, you may need to split load over several machines -- and then libraries like Open MPI for parallel programming with MPI come into play. Open MPI has a C++ interface, but you now also face a networking component and some administrative issues you do not have with a single computer.

MPI is also useful on a single local machine. It will run a job across multiple cores/CPUs, while this is probably overkill compared to threading it does mean you can move the job to a cluster with no changes. Most MPI implementations also optimize a local job to use shared memory instead of TCP for data connections.

Thread Building Block versus MPI, which one fits mt need better?

Now I have a serial solver in C++ for solving optimization problems and I am supposed to parallelize my solver with different parameters to see whether it can help improve the performance of the solver. Now I am not sure whther I should use TBB or MPI. From a TBB book I read, I feel TBB is more suitable for looping or fine-grained code. Since I do not have much experience with TBB, I feel it is difficult to divide my code to small parts in order to realize the parallelization. In addition, from the literature, I find many authors used MPI to parallel several solvers and make it cooperate. I guess maybe MPI fits my need more. Since I do not have much knowledge on either TBB or MPI. Anyone can tell me whether my feeling is right? Will MPI fit me better? If so, what material is good for start learning MPI. I have no experience with MPI and I use Windows system and c++. Thanks a lot.

The basic thing you need to have in mind is to choose between shared-memory and distributed-memory.
Shared-memory is when you have more than one process (normally more than one thread within a process) that can access a common memory. This can be quite fine-grained and it is normally simpler to adapt a single-threaded program to have several threads. You will need to design the program in a way that the threads work most of the time in separate parts of the memory (exploit data parallelism) and that the shared part is protected against concurrent accesses using locks.
Distributed-memory means that you have different processes that might be executed in one or several distributed computers but these process have together a common goal and share data through message-passing (data communication). There is no common memory space and all the data one process need from another process will require communication.
It is a more general approach but, because of communication requirements, it requires coarse grains.
TBB is a library support for thread-based shared-memory parallelism while MPI is a library for distributed-memory parallelism (it has simple primitives for communication and also scripts for several processes in different nodes execution).
The most important thing is for you to identify the parallelisms within your solver and then choose the best solution. Do you have data parallelism (different thread/processes could be working in parallel in different chunks of data without the need of communication or sharing parts of this data)? Task parallelism (different threads/processes could be performing a different transformation to your data or a different step in the data processing in a pipeline or graph fashion)?

shared memory, MPI and queuing systems

My unix/windows C++ app is already parallelized using MPI: the job is splitted in N cpus and each chunk is executed in parallel, quite efficient, very good speed scaling, the job is done right.
But some of the data is repeated in each process, and for technical reasons this data cannot be easily splitted over MPI (...).
For example:
5 Gb of static data, exact same thing loaded for each process
4 Gb of data that can be distributed in MPI, the more CPUs are used, smaller this per-CPU RAM is.
On a 4 CPU job, this would mean at least a 20Gb RAM load, most of memory 'wasted', this is awful.
I'm thinking using shared memory to reduce the overall load, the "static" chunk would be loaded only once per computer.
So, main question is:
Is there any standard MPI way to share memory on a node? Some kind of readily available + free library ?
If not, I would use boost.interprocess and use MPI calls to distribute local shared memory identifiers.
The shared-memory would be read by a "local master" on each node, and shared read-only. No need for any kind of semaphore/synchronization, because it wont change.
Any performance hit or particular issues to be wary of?
(There wont be any "strings" or overly weird data structures, everything can be brought down to arrays and structure pointers)
The job will be executed in a PBS (or SGE) queuing system, in the case of a process unclean exit, I wonder if those will cleanup the node-specific shared memory.

One increasingly common approach in High Performance Computing (HPC) is hybrid MPI/OpenMP programs. I.e. you have N MPI processes, and each MPI process has M threads. This approach maps well to clusters consisting of shared memory multiprocessor nodes.
Changing to such a hierarchical parallelization scheme obviously requires some more or less invasive changes, OTOH if done properly it can increase the performance and scalability of the code in addition to reducing memory consumption for replicated data.
Depending on the MPI implementation, you may or may not be able to make MPI calls from all threads. This is specified by the required and provided arguments to the MPI_Init_Thread() function that you must call instead of MPI_Init(). Possible values are
{ MPI_THREAD_SINGLE}
Only one thread will execute.
{ MPI_THREAD_FUNNELED}
The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are ``funneled'' to the main thread).
{ MPI_THREAD_SERIALIZED}
The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ``serialized'').
{ MPI_THREAD_MULTIPLE}
Multiple threads may call MPI, with no restrictions.
In my experience, modern MPI implementations like Open MPI support the most flexible MPI_THREAD_MULTIPLE. If you use older MPI libraries, or some specialized architecture, you might be worse off.
Of course, you don't need to do your threading with OpenMP, that's just the most popular option in HPC. You could use e.g. the Boost threads library, the Intel TBB library, or straight pthreads or windows threads for that matter.

I haven't worked with MPI, but if it's like other IPC libraries I've seen that hide whether other threads/processes/whatever are on the same or different machines, then it won't be able to guarantee shared memory. Yes, it could handle shared memory between two nodes on the same machine, if that machine provided shared memory itself. But trying to share memory between nodes on different machines would be very difficult at best, due to the complex coherency issues raised. I'd expect it to simply be unimplemented.
In all practicality, if you need to share memory between nodes, your best bet is to do that outside MPI. i don't think you need to use boost.interprocess-style shared memory, since you aren't describing a situation where the different nodes are making fine-grained changes to the shared memory; it's either read-only or partitioned.
John's and deus's answers cover how to map in a file, which is definitely what you want to do for the 5 Gb (gigabit?) static data. The per-CPU data sounds like the same thing, and you just need to send a message to each node telling it what part of the file it should grab. The OS should take care of mapping virtual memory to physical memory to the files.
As for cleanup... I would assume it doesn't do any cleanup of shared memory, but mmaped files should be cleaned up since files are closed (which should release their memory mappings) when a process is cleaned up. I have no idea what caveats CreateFileMapping etc. have.
Actual "shared memory" (i.e. boost.interprocess) is not cleaned up when a process dies. If possible, I'd recommend trying killing a process and seeing what is left behind.

With MPI-2 you have RMA (remote memory access) via functions such as MPI_Put and MPI_Get. Using these features, if your MPI installation supports them, would certainly help you reduce the total memory consumption of your program. The cost is added complexity in coding but that's part of the fun of parallel programming. Then again, it does keep you in the domain of MPI.

MPI-3 offers shared memory windows (see e.g. MPI_Win_allocate_shared()), which allows usage of on-node shared memory without any additional dependencies.

I don't know much about unix, and I don't know what MPI is. But in Windows, what you are describing is an exact match for a file mapping object.
If this data is imbedded in your .EXE or a .DLL that it loads, then it will automatically be shared between all processes. Teardown of your process, even as a result of a crash will not cause any leaks or unreleased locks of your data. however a 9Gb .dll sounds a bit iffy. So this probably doesn't work for you.
However, you could put your data into a file, then CreateFileMapping and MapViewOfFile on it. The mapping can be readonly, and you can map all or part of the file into memory. All processes will share pages that are mapped the same underlying CreateFileMapping object. it's good practice to close unmap views and close handles, but if you don't the OS will do it for you on teardown.
Note that unless you are running x64, you won't be able to map a 5Gb file into a single view (or even a 2Gb file, 1Gb might work). But given that you are talking about having this already working, I'm guessing that you are already x64 only.

If you store your static data in a file, you can use mmap on unix to get random access to the data. Data will be paged in as and when you need access to a particular bit of the data. All that you will need to do is overlay any binary structures over the file data. This is the unix equivalent of CreateFileMapping and MapViewOfFile mentioned above.
Incidentally glibc uses mmap when one calls malloc to request more than a page of data.

I had some projects with MPI in SHUT.
As i know , there are many ways to distribute a problem using MPI, maybe you can find another solution that does not required share memory,
my project was solving an 7,000,000 equation and 7,000,000 variable
if you can explain your problem,i would try to help you

I ran into this problem in the small when I used MPI a few years ago.
I am not certain that the SGE understands memory mapped files. If you are distributing against a beowulf cluster, I suspect you're going to have coherency issues. Could you discuss a little about your multiprocessor architecture?
My draft approach would be to set up an architecture where each part of the data is owned by a defined CPU. There would be two threads: one thread being an MPI two-way talker and one thread for computing the result. Note that MPI and threads don't always play well together.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js