parallel processing library - c++

I would like to know which parallel processing library to be best used under these configurations:
A single quad core machine. I would like to execute four functions of the same type on each core. The same function takes different arguments.
A cluster of 4 machines with each one with multi core. I would like to execute the same functions but n-parallel ( 4 machines * no of cores in each machine ). So I want it to scale.
Program details :
C++ program. There is no dependency between functions. The same function gets executed with different set of inputs and gets completed for > 100 times
There is no shared memory as each function takes its own data and its own inputs.
Each function need not to wait for others to complete. There is no need of join or fork.
For above scenarios what is the best parallel libs can be used? MPI, BOOST::MPI, open mp or other libs.
My preference would be BOOST::MPI but I want some recommendations. I am not sure if using MPI is allowed with parallel multi core machines?
Thanks.

What you have here is an embarassingly parallel problem (http://en.wikipedia.org/wiki/Embarrassingly_parallel). While MPI can definitely be used on a multi-core machine, it could be over kill for the problem at hand. If your tasks are completely separated, you could just compile them in to separate executables or a single executable with different inputs and use "make -j [n]" (see http://www.gnu.org/software/make/manual/html_node/Parallel.html) to execute them in parallel.
If MPI comes naturally to you, by all means, use it. OpenMP probably won't cut it if you want to control computing on separate computers within a cluster.

Related

How to use multi-core instead of multi-tread for programming?

I'm working on a project (Hardware: RaspberryPI 3B+), which has lots of computation and parallel processing. At present, I'm noticing some sort of lag in the code performance. Therefore, I'm constantly looking for efficient ways to improve my code and its performance.
Currently, I'm using C-language (because I can access and manipulate lower-level drivers easily) and developing my own set of functions, libraries and the drivers, which runs faster than any other pre-defined or readymade libraries or plugins.
Now, instead of the software-based muti-treading (Pthread), I wanted to use the separate cores for performing the corresponding task. So, any suggestion or guideline how I can use the different cores of the RaspberryPI?
Moreover, how can I check the CPU utilization to choose the best core to perform a certain task?
Thanking with regards,
Aatif Shaikh
At the C/C++ level you do not have access to which CPU core will run which thread. Just use the C++ 11 standard threads and let the OS scheduler to decide which thread runs where.
That said, Linux has the taskset tool to check thread affinity and there 's also sched_setaffinity() function.

Software parallelization with OpenMP versus Affinity scheduling?

Scenario: I have an program which can be easily parallelized using OpenMP, lets say the main loop of the program is a for loop and independent data within it, so paralleizing it would be trivial. However currently I don't parallelize it, and instead use affinity scheduling.
This program performs work on some input files specified by a folder in the command line arguments. To run this program in parallel, someone can create a bat file like so:
start \affinity 1 "1" bat1
start \affinity 2 "2" bat2
start \affinity 3 "3" bat3
start \affinity 4 "4" bat4
where bat1 - 4 is a bat file that calls main.exe with a different input folder for each bat file. So in this case there would be 4 instances of main.exe running on input_folder1, input_folder2, input_folder3, input_folder4 respectively.
What would be the benefits of using a library like OpenMP be instead of affinity scheduling? I figure
Less memory usage, single stack and heap for a single program instance as opposed to n instances of a program for n cores
Better scaling
But would I expect to actually see a performance boost? Why if so?
If your problem is a simply parallel, with no interaction among the data in the separate input files, then you would probably not see a speedup with OpenMP, and might even see a slow-down, since memory allocation and various other things then have to be thread-safe. Single-threaded processes can gain lots of efficiencies, and in fact do on GNU libc, where linking in POSIX threads support means you also get a slower implementation of malloc

Open CL Running parallel tasks on data parallel kernel

I'm currently reading up on the OpenCL framework because of reasons regarding my thesis work. And what I've come across so far is that you can either run kernels in data parallel or in task parallel. Now I've got a question and I can't manage to find the answer.
Q: Say that you have a vector that you want to sum up. You can do that in OpenCL by writing a kernel for a data parallel process and just run it. Fairly simple.
However, now say that you have 10+ different vectors that need to be summed up also. Is it possible to run these 10+ different vectors in task parallel, while still using a kernel that processes them as "data parallel"?
So you basically parallelize tasks, which in a sense are run in parallel? Because what I've come to understand is that you can EITHER run the tasks parallel, or just run one task itself in parallel.
The whole task-parallel/data-parallel distinction in OpenCL was a mistake. We deprecated clEnqueueTask in OpenCL 2.0 because it had no meaning.
All enqueued entities in OpenCL can be viewed as tasks. Those tasks may be run concurrently, they may be run in parallel, they may be serialized. You may need multiple queues to run them concurrently, or a single out-of-order queue, this is all implementation-defined to be fully flexible.
Those tasks may be data-parallel, if they are made of multiple work-items working on different data elements within the same task. They may not be, consisting of only one work-item. This last definition is what clEnqueueTask used to provide - however, because it had no meaning whatsoever compared with clEnqueueNDRangeKernel with a global size of (1,1,1), and it was not checked against anything in the kernel code, deprecating it was the safer option.
So yes, if you enqueue multiple NDRanges, you can have multiple tasks in parallel, each one of which is data-parallel.
You can also copy all of those vectors at once inside one data-parallel kernel, if you are careful with the way you pass them in. One option would be to launch a range of work-groups, each one iterates through a single vector copying it (that might well be the fastest way on a CPU for cache prefetching reasons). You could have each work-item copy one element using some complex lookup to see which vector to copy from, but that would likely have high overhead. Or you can just launch multiple parallel kernels, each for one kernel, and have the runtime decide if it can run them together.
If your 10+ different vectors are close to the same size, it becomes a data parallel problem.
The task parallel nature of OpenCL is more suited for CPU implementations. GPUs are more suited for data parallel work. Some high-end GPUs can have a handful of kernels in-flight at once, but their real efficiency is in large data parallel jobs.

MPI Fundamentals

I have a basic question regarding MPI, to get a better understanding of it (I am new to MPI and multiple processes so please bear with me on this one). I am using a simulation environment in C++ (RepastHPC) that makes extensive use of MPI (using the Boost libraries) to allow parallel operations. In particular, the simulation consists of multiple instances of the respective classes (i.e. agents), that are supposed to interact with each other, exchange information etc. Now given that this takes place on multiple processes (and given my rudimentary understanding of MPI) the natural question or fear I have is, that agents on different processes don't intereact with each other anymore because they cannot connect (I know, this contradicts the entire idea of MPI).
After reading the manual my understanding is this: the available libraries of Boost.MPI (and also the libaries of the above mentionend package) take care of all of the communication and sending packages back and forth between processes, i.e. each process has copies of the instances from other processes (I guess this is some form of call by value, b/c the original instance cannot be changed from a process that has only a copy), then an updating takes place, to ensure that the copies of the instances have the same information as the originals and so on.
Does this mean, that in terms of the final outcomes of the simulations runs, I get the same as if I would be doing the entire thing on one process? Put differently, the multiple processes are just supposed to speed up things but not change the design of simulation (thus I don't have to worry about it)?
I think you have a fundamental misunderstanding of MPI here. MPI is not an automatic parallelization library. It isn't a distributed shared memory mechanism. It doesn't do any magic for you.
What it does do is make it simpler to communicate between different processes on the same or different machines. Each process has its own address space which does not overlap with the other processes (unless you're doing something else outside of MPI). Assuming you set up your MPI installation correctly, it will do all of the pain of setting up the communication channels between your processes for you. It also gives you some higher level abstractions like collective communication.
When you use MPI, you compile your code differently than normal. Instead of using g++ -o code code.cpp (or whatever your compiler is), you use mpicxx -o code code.cpp. This will automatically link with all of the MPI stuff necessary. Then when you run your application, you use mpiexec -n <num_processes> ./code (other arguments aren't required, but are probably necessary) . The argument num_processes will tell MPI how many processes to launch. This isn't done at compile/link time.
You will also have to rewrite your code to use MPI. MPI has lots of functions (the standard is available here and there are lots of tutorials available on the web that are easier to understand) that you can use. The basics are MPI_Send() and MPI_Recv(), but there's lots and lots more. You'll have to find a tutorial for that.

Executing C++ program on multiple processor machine

I developed a program in C++ for research purpose. It takes several days to complete.
Now i executing it on our lab 8core server machine to get results quickly, but i see machine assigns only one processor to my program and it remains at 13% processor usage(even i set process priority at high level and affinity for 8 cores).
(It is a simple object oriented program without any parallelism or multi threading)
How i can get true benefit from the powerful server machine?
Thanks in advance.
Partition your code into chunks you can execute in parallel.
You need to go read about data parallelism
and task parallelism.
Then you can use OpenMP or
MPI
to break up your program.
(It is a simple object oriented program without any parallelism or
multi threading)
How i can get true benefit from the powerful server machine?
By using more threads. No matter how powerful the computer is, it cannot spread a thread across more than one processor. Find independent portions of your program and run them in parallel.
C++0x threads
Boost threads
OpenMP
I personally consider OpenMP a toy. You should probably go with one of the other two.
You have to exploit multiparallelism explicitly by splitting your code into multiple tasks that can be executed independently and then either use thread primitives directly or a higher level parallelization framework, such as OpenMP.
If you don't want to make your program itself use multithreaded libraries or techniques, you might be able to try breaking your work up into several independent chunks. Then run multiple copies of your program...each being assigned to a different chunk, specified by getting different command-line parameters.
As for just generally improving a program's performance...there are profiling tools that can help you speed up or find the bottlenecks in memory usage, I/O, CPU:
https://stackoverflow.com/questions/tagged/c%2b%2b%20profiling
Won't help split your work across cores, but if you can get an 8x speedup in an algorithm that might be able to help more than multithreading would on 8 cores. Just something else to consider.