Package for distributing calculations - c++

Do you know of any package for distributing calculations on several computers and/or several cores on each computer? The calculation code is in c++, the package needs to be able to cope with data >2GB and work on a windows x64 machine. Shareware would be nice, but isn't a requirement.

A suitable solution would depend on the type of calculation and data you wish you process, the granularity of parallelism you wish to achieve, and how much effort you are willing to invest in it.
The simplest would be to just use a suitable solver/library that supports parallelism (e.g.
scalapack). Or if you wish to roll your own solvers, you can squeeze out some paralleisation out of your current code using OpenMP or compilers that provide automatic paralleisation (e.g Intel C/C++ compiler). All these will give you a reasonable performance boost without requiring massive restructuring of your code.
At the other end of the spectrum, you have the MPI option. It can afford you the most performance boost if your algorithm parallelises well. It will however require a fair bit of reengineering.
Another alternative would be to go down the threading route. There are libraries an tools out there that will make this less of a nightmare. These are worth a look: Boost C++ Parallel programming library and Threading Building Block

You may want to look at OpenMP

There's an MPI library and the DVM system working on top of MPI. These are generic tools widely used for parallelizing a variety of tasks.

Related

Sharing multi-threaded code between standard C++ and OpenCL

I am trying to develop a C++ application that will have to run many calculations in parallel. The algorithms will be pretty large but they will be purely mathematical and on floating point integers. They should therefore work on the GPU using OpenCL. I would like the program to work on systems that do not support OpenCL but I would like for it to also be able to use the GPU for extra speed on systems that do.
My problem is that I do not want to have to maintain two sets of code (standard C++ probably using std::thread and OpenCL). I do realize that I will be able to share a lot of the actual code but the only obvious way to do so is to manually copy the shareable parts over between the files and that is really not something that I want to do for maintainability reasons and to prevent errors.
I would like to be able to use only one set of files. Is there any proper way to do this? Would OpenCL emulation be an option in a way?
PS. I know of the AMD OpenCL emulator but it seems to be only for development and also only for Windows. Or am I wrong?
OpenCL can use the CPU as a compute device, so OpenCL code can run on platforms without a GPU. However, the architectural differences between a GPU and a CPU will most likely require you to still maintain two code bases to get optimal performance in both situations.

Intel TBB vs CilkPlus [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am developing time-demanding simulations in C++ targeting Intel x86_64 machines.
After researching a little, I found two interesting libraries to enable parallelization:
Intel Threading Bulding Blocks
Intel Cilk Plus
As stated in docs, they both target parallelism on multicore processors but still haven't figured which one is the best. AFAIK Cilkplus simply implements three keywords for an easier parallelism (which causes GCC to be recompiled to support these keywords); while TBB is just a library to promote a better parallel development.
Which one would you recommend?
Consider that I am having many many many problems installing CilkPlus (still trying and still screaming). So I was wondering, should I go check TBB first? Is Cilkplus better than TBB? What would you recommend?
Are they compatible?
Should I accomplish installing CilkPlus (still praying for this), would it be possible to use TBB together with it? Can they work together? Is there anyone who did experience sw develpment with both CiclkPlus and TBB? Would you recommend working with them together?
Thankyou
Here are some FAQ type of information to the question in the original post.
Cilk Plus vs. TBB vs. Intel OpenMP
In short it depends what type of parallelization you are trying to implement and how your application is coded.
I can answer this question in context to TBB. The pros of using TBB are:
No compiler support needed to run the code.
Generic C++ algorithms of TBB lets user create their own objects and map them to thread as task.
User doesn't need to worry about thread management. The built in task scheduler automatically detects the number of possible hardware threads. However user can chose to fix the number of threads for performance studies.
Flow graphs for creating tasks that respect dependencies easily lets user exploit functional as well as data parallelism.
TBB is naturally scalable obviating the need for code modification when migrating to larger systems.
Active forum and documentation being continually updated.
with intel compilers, latest version of tbb performs really well.
The cons can be
Low user base in the open source community making it difficult to find examples
examples in documentations are very basic and in older versions they are even wrong. However the Intel forum is always ready to extend support to resolve issues.
the abstraction in the template classes are very high making the learning curve very steep.
the overhead of creating tasks is high. User has to make sure that the problem size is large enough for the partitioner to create tasks of optimal grain size.
I have not worked with cilk either, but it's apparent that if at all there are users in the two domain, the majority is that of TBB. It's likely if Intel pushes for TBB by it's updated document and free support, the user community in TBB grows
They can be used in complement to each other (CILK and TBB). Usually, thats the best. But from my experience you will use TBB the most.
TBB and CILK will scale automatically with the number of cores. (by creating a tree of tasks, and then using recursion at run-time).
TBB is a runtime library for C++, that uses programmer defined Task Patterns, instead of threads. TBB will decide - at run-time - on the optimal number of threads, tasks granularity and performance oriented scheduling (Automatic load balancing through tasks stealing, Cache efficiency and memory reusing). Create tasks recursively (for a tree this is logarithmic in number of tasks).
CILK(plus) is a C/C++ language extension requires compiler support.
Code may not be portable to different compilers and operating systems. It supports fork-join parallelism. Also, it is extremely easy to parallelize recursive algorithms. Lastly, it has a few tools (spawn, sync), with which you can parallelize a code very easily. (not a lot of rewrite is needed!).
Other differences, that might be interesting:
a) CILK's random work stealing schedule for countering "waiting" processes.
a) TBB steals from the most heavily loaded process.
Is there a reason you can't use the pre-built GCC binaries we make available at https://www.cilkplus.org/download#gcc-development-branch ? It's built from the cilkplus_4-8_branch, and should be reasonably current.
Which solution you choose is up to you. Cilk provides a very natural way to express recursive algorithms, and its child-stealing scheduler can be very cache friendly if you use cache-oblivious algorithms. If you have questions about Cilk Plus, you'll get the best response to them in the Intel Cilk Plus forum at http://software.intel.com/en-us/forums/intel-cilk-plus/.
Cilk Plus and TBB are aware of each other, so they should play well together if you mix them. Instead of getting a combinatorial explosion of threads you'll get at most the number of threads in the TBB thread pool plus the number of Cilk worker threads. Which usually means you'll get 2P threads (where P is the number of cores) unless you change the defaults with library calls or environment variables. You can use the vectorization features of Cilk Plus with either threading library.
- Barry Tannenbaum
Intel Cilk Plus developer
So, as a request from the OP:
I have used TBB before and I'm happy with it. It has good docs and the forum is active. It's not rare to see the library developers answering the questions. Give it a try. (I never used cilkplus so I can't talk about it).
I worked with it both in Ubuntu and Windows. You can download the packages via the package manager in Ubuntu or you can build the sources yourself. In that case, it shouldn't be a problem. In Windows I built TBB with MinGW under the cygwin environment.
As for the compatibility issues, there shouldn't be none. TBB works fine with Boost.Thread or OpenMP, for example; it was designed so it could be mixed with other threading solutions.

Libraries for parallel distributed cholesky decomposition in c/c++ in mpi environment?

What libraries are available for parallel distributed cholesky decomposition of dense matrices in C/C++ in mpi environment?
I've found the ScaLAPACK library, and this might be the solution I'm looking for. It seems that it's a bit fiddly to call though, lots of Fortran <-> C conversions to do, which makes me think that maybe it is not widely used, and therefore maybe there are some other libraries that are used instead?
Alternatively, are there some wrappers for ScaLAPACK that make it relatively not too painful to use in a C or C++ environment, when one is already using MPI, and MPI has already been initialized in the program?
Are these dense or sparse matrices?
Trilinos is a huge library for parallel scientific computation. The sub-package Amesos can link to Scalapack for parallel, direct solution of dense systems and to UMFPACK, SuperLU or MUMPS for sparse systems. Trilinos is mostly in C++, but there are Python bindings if that's your taste. It might be overkill, but it'll get the job done.
Intel MKL might also be a choice, since it calls ScaLAPACK on the inside. Note that Intel supports student use of this library, but in this case you have to use an open source MPI version. Also the Intel forum is very helpful.
Elemental is also an option, written in C++, which is surely a big advantage when you want to integrate with your C/C++ application and the project leader, Jack Poulson is a very friendly and helps a lot.
OpenBLAS, SuperLU and PETSc are also interesting and you may want to read more in my answer.

Data parallel libraries in C/C++

I have a C# prototype that is heavily data parallel, and I've had extremely successful order of magnitude speed ups using the Parallel.For construct in .NETv4. Now I need to write the program in native code, and I'm wondering what my options are. I would prefer something somewhat portable across various operating systems and compilers, but if push comes to shove, I can go with something Windows/VC++.
I've looked at OpenMP, which looks interesting, but I have no idea whether this is looked upon as a good solution by other programmers. I'd also love suggestions for other portable libraries I can look at.
If you're happy with the level of parallelism you're getting from Parallel.For, OpenMP is probably a pretty good solution for you -- it does roughly the same kinds of things. There's also work and research being done on parallelized implementations of the algorithms in the standard library. Code that uses the standard algorithms heavily can gain from this with even less work.
Some of this is done using OpenMP, while other is using hand-written code to (attempt to) provide greater benefits. In the long term, we'll probably see more of the latter, but for now, the OpenMP route is probably a bit more practical.
If you're using Parallel.For in .Net 4.0, you should also look at the Parallel Pattern Library in VS2010 for C++; many things are similar and it is library based.
If you need to run on another platform besides Windows the PPL is also implemented in Intel's Thread Building Blocks which has an open source version.
Regarding the other comment on similarities / differences vs. .Net 4.0, the parallel loops in the PPL (parallel_for, parallel_for_each, parallel_invoke) are virtually identical to the .Net 4.0 loops. The task model is slightly different, but should be straightforward.
You should check out Intel's Thread Building Blocks. Visual Studio 2010 also offers a Concurrency Runtime native. The .NET 4.0 libraries and the ConcRT are designed very similarly, if not identically.
if you want something that is versatile, as in portable across various OS and environments, it would be very difficult to not to consider Java. And they are very similar to C# so it be a very easy transition.
Unless you want to pull out your ninja scalpel and wanting to make your code extremely efficient, I would say java over VC++ or C++.

Is it worth learning AMD-specific APIs?

I'm currently learning the APIs related to Intel's parallelization libraries such as TBB, MKL and IPP. I was wondering, though, whether it's also worth looking at AMD's part of the puzzle. Or would that just be a waste of time? (I must confess, I have no clue about AMD's library support - at all - so would appreciate any advice you might have.)
Just to clarify, the reason I'm going the Intel way is because 1) the APIs are very nice; and 2) Intel seems to be taking tool support as seriously as API support. (Once again, I have no clue how AMD is doing in this department.)
The MKL and IPP libraries will perform (nearly) as well on AMD machines. My guess is that TBB will also run just fine on AMD boxes. If I had to suggest a technology that would be beneficial and useful to both, it would be to master the OpenMP libraries. The Intel compiler with the OpenMP extensions is stunningly fast and works with AMD chips also.
Only worth it if you are specifically interested in building something like Video games, Operating systems, database servers, or virtualization software. In other words: if you have a segment where you care enough about performance to take the time to do it (and do it right) in assembler. The same is true for Intel.
If your company sells packages of just Intel Servers with your software, then you shouldn't bother learning the AMD approach. But if you're going to have to offer software for both (or many) different platforms, then it might be worth looking into the different technologies. It will be very difficult to create the wrappers for the hardware-specific libraries. (Especially since threading is involved.)
And you definitely don't want to write completely separate implementation for each hardware configuration. In fact, if your software is to be consumed by a generic user, then you may want to abandon the Intel technology, and use standard threading techniques. I don't mean to be discouraging, but I believe that the Intel threading libraries are a bit ahead of their time for all intents and purposes.