C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks [closed]

C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks [closed] - c++

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Opinion-based Update the question so it can be answered with facts and citations by editing this post.
I'm going to retrofit my custom graphics engine so that it takes advantage of multicore CPUs. More exactly, I am looking for a library to parallelize loops.
It seems to me that both OpenMP and Intel's Thread Building Blocks are very well suited for the job. Also, both are supported by Visual Studio's C++ compiler and most other popular compilers. And both libraries seem quite straight-forward to use.
So, which one should I choose? Has anyone tried both libraries and can give me some cons and pros of using either library? Also, what did you choose to work with in the end?
Thanks,
Adrian

I haven't used TBB extensively, but my impression is that they complement each other more than competing. TBB provides threadsafe containers and some parallel algorithms, whereas OpenMP is more of a way to parallelise existing code.
Personally I've found OpenMP very easy to drop into existing code where you have a parallelisable loop or bunch of sections that can be run in parallel. However it doesn't help you particularly for a case where you need to modify some shared data - where TBB's concurrent containers might be exactly what you want.
If all you want is to parallelise loops where the iterations are independent (or can be fairly easily made so), I'd go for OpenMP. If you're going to need more interaction between the threads, I think TBB may offer a little more in that regard.

From Intel's software blog: Compare Windows* threads, OpenMP*, Intel® Threading Building Blocks for parallel programming
It is also the matter of style - for me TBB is very C++ like, while I don't like OpenMP pragmas that much (reeks of C a bit, would use it if I had to write in C).
I would also consider the existing knowledge and experience of the team. Learning a new library (especially when it comes to threading/concurrency) does take some time. I think that for now, OpenMP is more widely known and deployed than TBB (but this is just mine opinion).
Yet another factor - but considering most common platforms, probably not an issue - portability. But the license might be an issue.
TBB incorporates some of nice research originating from academic research, for example recursive data parallel approach.
There is some work on cache-friendliness, for example.
Lecture of the Intel blog seems really interesting.

In general I have found that using TBB requires much more time consuming changes to the code base with a high payoff while OpenMP gives a quick but moderate payoff. If you are staring a new module from scratch and thinking long term go with TBB. If you want small but immediate gains go with OpenMP.
Also, TBB and OpenMP are not mutually exclusive.

I've actually used both, and my general impression is that if your algorithm is fairly easy to make parallel (e.g. loops of even size, not too much data interdependence) OpenMP is easier, and quite nice to work with. In fact, if you find you can use OpenMP, it's probably the better way to go, if you know your platform will support it. I haven't used OpenMP's new Task structures, which are much more general than the original loop and section options.
TBB gives you more data structures up front, but definitely requires more up front. As a plus, it might be better at making you aware of race condition bugs. What I mean by this is that it is fairly easy in OpenMP to enable race conditions by not making something shared (or whatever) that should be. You only see this when you get bad results. I think this is a bit less likely to occur with TBB.
Overall my personal preference was for OpenMP, especially given its increased expressiveness with tasks.

As far as i know, TBB (there is an OpenSource Version under GPLv2 avaiable) adresses more the C++ then C Area. These times it's hard to find C++ and general OOP parallelization specific Informations.The most adresses functional stuff like c (the same on CUDA or OpenCL). If you need C++ Support for parallelization go for TBB!

Yes, TBB is much more C++ friendly while OpenMP is more appropriate for FORTRAN-style C code given its design. The new task feature in OpenMP looks very interesting, while at the same time the Lambda and function object in C++0x may make TBB easier to use.

In Visual Studio 2008, you can add the following line to parallelize any "for" loop. It even works with multiple nested for loops. Here is an example:
#pragma omp parallel for private(i,j)
for (i=0; i<num_particles; i++)
{
p[i].fitness = fitnessFunction(p[i].present);
if (p[i].fitness > p[i].pbestFitness)
{
p[i].pbestFitness = p[i].fitness;
for (j=0; j<p[i].numVars; j++) p[i].pbest[j] = p[i].present[j];
}
}
gbest = pso_get_best(num_particles, p);
After we added the #pragma omp parallel, both cores on my Core 2 Duo were used to their maximum capacity, so total CPU usage went from 50% to 100%.

Related

On the implementation of multiple simultaneous tasks of a compiler [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
In my struggle to part ways with interpreters, and while attempting to advance my knowledge in c++, I recently puchased a copy of "C++ In a Nutshell: A Desktop Quick Reference (In a Nutshell (O'Reilly))" and "Writing Compilers and Interpreters (Wiley)". While my college course on C++ taught me how to sort stacks and lists, it taught me nothing on those subjects. I decided I will write a compiler taylored to my unique habits and coding style.
In the case of the recent advent of real multiprocessing capacity, what is worth the required effort? I am fully aware of a multitude of libraries capable of providing threading, and of providing multiprocessing. Defaulting to a resort of pre-existing code does not actualize an efficient learning process due to the fact that that old intimate connection with personally written code would be quite lacking.

Compilers are generally implemented as a pipeline, where source code goes in one end and a number of processing steps are applied in sequence, and an object file comes out the other end. This kind of processing does not generally lend itself well to multithreading.
With the speed of today's computers, it is rare that a compiler needs to do so much processing that it would benefit from a multithreaded implementation. Within a pipeline, there's only a few things you could usefully do anyway, such as running per-function optimisations in parallel.
A much easier gain can be seen by simply running the compiler in parallel against more than one source file. This can be doing without any multithreading support in the compiler; just write a single-threaded compiler (much less prone to error), and let your build system take care of the parallelism.

Writing a compiler is hard; languages are complex, people want good code, and they don't want to wait long get it. This should be obvious from your own experience with using C++ compilers. Writing a C++ compiler is especially hard, because the language is especially complex (the new C++11 standard making it considerably more complex), and people expect really good code from C++ compilers.
All this complexity, and your lack of background in compilers, suggests that you are not likely to write a C++ compiler, let alone a parallel one. The GCC and CLANG communities have hundreds of peoples and decades of elapsed development time. It may be worth your effort to build a toy compiler, to understand the issues better.
Regarding parallelism in the compiler itself, one can take several approaches.
The first is to use a standard library, as you have suggested, and attempt to retrofit an existing compiler with parallelism. It is odd that few seem to have attempted this task given that GCC and CLANG are open source tools. But it is also very difficult to parallelize a program that was designed without parallelism in mind. Where does one find units of parallelism (processing individual methods?), how do you insert the parallelism, how do you ensure that the now-retrofitted compiler doesn't have synchronization problems (if the compiler processes all methods in parallel, where's the guarantee that a signature from one method is actually ready and available to other methods being compiled?) Finally, how does one guarantee that the parallel work dominates the additional thread initialization/teardown/synchronization overhead so that the compiler is actually faster given multiple CPUs? In addition, thread libraries are a bit difficult to use, and it is relatively easy to make dumb mistake coding a threading call. If you have lots of these in your compiler, you have a high probability of such a dumb mistake. Debugging will be hard.
The second is build a new compiler from scratch, using such libraries. This requires a lot of work just to get the basic compiler in place, but has the advantage that the individual elements of the compiler can be considered during design, and parallel interlocks designed in I don't know of any compilers built this way (surely there are some research compilers like this) but its a lot of work, and clearly more work than just writing a non-parallel compiler. You still suffer from the clumsiness of thread libraries, too.
The third is to find a parallel programming language, and write the compiler in that. This makes is easier to write parallel code without error, and can allow you to implement kinds of parallelism that might not be possible with a thread library (e.g., dynamic teams of computation, partial orders, ...). It also has the advantage that the parallel-language compiler can see the parallelism in the code, and can thus generate lower-overhead thread operations. Because compilers do many computations of varying duration, you don't want a synchronized-data-parallel language; you want one with task parallelism.
Our PARLANSE compiler is such a programming language, designed with goal of doing parallel symbolic (eg non-numeric) computations appropriate for compilers. Now you need a a parallel compiler and the energy to build a new compiler from scratch.
The fourth approach is to use a parallel language and a predefined library of compiler-typical activities (parsing, symbol table lookup, flow analysis, code generation) and build your compiler that way. At then you don't have to reinvent the basic facilities of the compiler and can get on with building the compiler itself.
Our DMS Software Reengineering Toolkit is exactly such a set of tools and libraries designed to allow one to build complex code generation/transformation or analysis tools. DMS has an full C++11 front end available that uses all the DMS (parallel) support machinery.
We've used DMS to carry out massive C++ analysis and transformation tasks. The parallelism is there, and it works; we could do more if we tried. We have not attempted to build a real parallel compiler; we don't see the market for it considering that other C++ compilers are free and well established. I expect someday that somebody will find a niche place where a parallel C++ compiler is needed, and then this machinery like this is likely to be a foundation; its too much work to start from scratch.

It's unlikely that you're going to build something that's substantially useful to you in real coding, especially as a first learning experience -- or, in other words, the value is in the journey, not in what you reach at the end. This is a learning experience.
Thus, I would suggest that you take a couple of things that are of interest to you and that annoy you about existing languages and compilers, and try to improve on them (or at least try out something different). Beyond that, though, you want to write something that's as simple as possible, so that you can actually complete it.
If you are doing enough multithreaded programming that that's a critical part of what you're thinking about, then it may be worth trying to write something that does multiprocessing. Otherwise, I would strongly recommend that you leave it out of your first compiler project; it's a big hairy nest of complication, and a compiler is a big enough and hairy enough project without it.
And, if you decide that multithreading is important, you can always add it later. Better to have something that works first, though, so that you have the fun of using it to keep you from getting bogged down while you add more to it.
Oh, and you definitely don't want to make your first compiler multithreaded internally! Start with simple, then add complexity once you've done it once and know how much effort is involved.

It depends on the type of language you wish to compile.
If you think of creating a language today, it should be based on modules, rather than the header/source way C++ is built. With such languages, the compiler is often given a single "target" (a module) and automatically compiles the dependencies (other modules), this is for example what the Python compiler does.
In this case, multithreading yields an immediate gain: you can simply parallelize the multiple modules compilations. This is relatively easy, just think about leaving a "mark" that you are compiling a given module to avoid doing it several times in parallel.
Going further is not immediately useful. While optimizations could probably be applied in parallel for example, it would require paying attention to the synchronization of the models they are applied to. The nice thing about compiling multiple modules in parallel is that they are mostly independent during compilation and read-only once ready, which alleviates most synchronization issues.

How to write fast (low level) code? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I would like to learn more about low level code optimization, and how to take advantage of the underlying machine architecture. I am looking for good pointers on where to read about this topic.
More details:
I am interested in optimization in the context of scientific computing (which is a lot of number crunching but not only) in low level languages such as C/C++. I am in particular interested in optimization methods that are not obvious unless one has a good understanding of how the machine works (which I don't---yet).
For example, it's clear that a better algorithm is faster, without knowing anything about the machine it's run on. It's not at all obvious that it matters if one loops through the columns or the rows of a matrix first. (It's better to loop through the matrix so that elements that are stored at adjacent locations are read successively.)
Basic advice on the topic or pointers to articles are most welcome.
Answers
Got answers with lots of great pointers, a lot more than I'll ever have time to read. Here's a list of all of them:
The software optimization cookbook from Intel (book)
What every programmer should know about memory (pdf book)
Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level (book)
Software optimization resources by Agner Fog (five detailed pdf manuals)
I'll need a bit of skim time to decide which one to use (not having time for all).

Drepper's What Every Programmer Should Know About Memory [pdf] is a good reference to one aspect of low-level optimisation.

For Intel architectures this is priceless: The Software Optimization Cookbook, Second Edition

It's been a few years since I read it, but Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level by Randall Hyde was quite good. It gives good examples of how C/C++ code translates into assembly, e.g. what really happens when you have a big switch statement.
Also, altdevblogaday.com is focused on game development, but the programming articles might give you some ideas.

An interesting book about bit manipulation and smart ways of doing low-level things is Hacker's Delight.
This is definitely worth a read for everyone interested in low-level coding.

Check out: http://www.agner.org/optimize/

C and C++ are usually the languages that are used for this because of their speed (ignoring Fortran as you didn't mention it). What you can take advantage of (which the icc compiler does a lot) is SSE instruction sets for a lot of floating point number crunching. Another thing that is possible is the use of CUDA and Stream API's for Nvidia/Ati respectively to do VERY fast floating point operations on the graphics card while leaving the CPU free to do the rest of the work.

Another approach to this is hands-on comparison. You can get a library like Blitz++ (http://www.oonumerics.org/blitz/) which - I've been told - implements aggressive optimisations for numeric/scientific computing, then write some simple programs doing operations of interest to you (e.g. matrix multiplications). As you use Blitz++ to perform them, write your own class that does the same, and if Blitz++ proves faster start investigating it's implementation until you realise why. (If yours is significantly faster you can tell the Blitz++ developers!)
You should end up learning about a lot of things, for example:
memory cache access patterns
expression templates (there are some bad links atop Google search results re expression templates - the key scenario/property you want to find discussion of is that they can encode many successive steps in a chain of operations such that they all be applied during one loop over a data set)
some CPU-specific instructions (though I haven't checked they've used such non-portable techniques)...

I learned a lot from the book Inner Loops. It's ancient now, in computer terms, but it's very well written and Rick Booth is so enthusiastic about his subject I would still say it's worth looking at to see the kind of mindset you need to make a CPU fly.

Do Fortran 95 constructs such as WHERE, FORALL and SPREAD generally result in faster parallel code?

I have read through the Fortran 95 book by Metcalf, Reid and Cohen, and Numerical Recipes in Fortran 90. They recommend using WHERE, FORALL and SPREAD amongst other things to avoid unnecessary serialisation of your program.
However, I stumbled upon this answer which claims that FORALL is good in theory, but pointless in practice - you might as well write loops as they parallelise just as well and you can explicitly parallelise them using OpenMP (or automatic features of some compilers such as Intel).
Can anyone verify from experience whether they have generally found these constructs to offer any advantages over explicit loops and if statements in terms of parallel performance?
And are there any other parallel features of the language which are good in principal but not worth it in practice?
I appreciate that the answers to these questions are somewhat implementation dependant, so I'm most interested in gfortran, Intel CPUs and SMP parallelism.

As I said in my answer to the other question, there is a general belief that FORALL has not been as useful as was hoped when it was introduced to the language. As already explained in other answers, it has restrictive requirements and a limited role, and compilers have become quite good at optimizing regular loops. Compilers keep getting better, and capabilities vary from compiler to compiler. Another clue is that the Fortran 2008 is trying again... besides adding explicit parallelization to the language (co-arrays, already mentioned), there is also "do concurrent", a new loop form that requires restrictions that should better allow the compiler to perform automatic parallization optimizations, yet should be sufficiently general to be useful -- see ftp://ftp.nag.co.uk/sc22wg5/N1701-N1750/N1729.pdf.
In terms of obtaining speed, mostly I select good algorithms and program for readability & maintainability. Only if the program is too slow do I locate the bottle necks and recode or implement multi-threading (OpenMP). It will be a rare case where FORALL or WHERE versus an explicit do loop will have a meaningful speed difference -- I'd look more to how clearly they state the intent of the program.

I've looked shallowly into this and, sad to report, generally find that writing my loops explicitly results in faster programs than the parallel constructs you write about. Even simple whole-array assignments such as A = 0 are generally outperformed by do-loops.
I don't have any data to hand and if I did it would be out of date. I really ought to pull all this into a test suite and try again, compilers do improve (sometimes they get worse too).
I do still use the parallel constructs, especially whole-array operations, when they are the most natural way to express what I'm trying to achieve. I haven't ever tested these constructs inside OpenMP workshare constructs. I really ought to.

FORALL is a generalised masked assignment statement (as is WHERE). It is not a looping construct.
Compilers can parallelise FORALL/WHERE using SIMD instructions (SSE2, SSE3 etc) and is very useful to get a bit of low-level parallelisation. Of course, some poorer compilers don't bother and just serialise the code as a loop.
OpenMP and MPI is more useful at a coarser level of granularity.

In theory, using such assignments lets the compiler know what you want to do and should allow it to optimize it better. In practice, see the answer from Mark... I also think it's useful if the code looks cleaner that way. I have used things such as FORALL myself a couple of times, but didn't notice any performance changes over regular DO loops.
As for optimization, what kind of parallellism do you intent to use? I very much dislike OpenMP, but I guess if you inted to use that, you should test these constructs first.

*This should be a comment, not an answer, but it won't fit into that little box, so I'm putting it here. Don't hold it against me :-) Anyways, to continue somewhat onto #steabert's comment on his answer. OpenMP and MPI are two different things; one rarely gets to choose between the two since it's more dictated by your architecture than personal choice. As far as learning concepts of paralellism go, I would recommend OpenMP any day; it is simpler and one easily continues the transition to MPI later on.
But, that's not what I wanted to say. This is - a few days back from now, Intel has announced that it has started supporting Co-Arrays, a F2008 feature previously only supported by g95. They're not intending to put down g95, but the fact remains that Intel's compiler is more widely used for production code, so this is definitely an interesting line of developemnt. They also changed some things in their Visual Fortran Compiler (the name, for a start :-)
More info after the link: http://software.intel.com/en-us/articles/intel-compilers/

Why do books on concurrent programming always ignore data parallelism? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
There has been a significant shift towards data-parallel programming via systems like OpenCL and CUDA over the last few years, and yet books published even within the last six months never even mention the topic of data-parallel programming.
It's not suitable for every problem, but it seems that there is a significant gap here that isn't being addressed.

First off, I'll point out that concurrent programming is not necessarily synonymous with parallel programming. Concurrent programming is about constructing applications from loosely-coupled tasks. For instance, a dialog window could have interactions with each control implemented as a separate task. Parallel programming, on the other hand, is explicitly about spreading the solution of some computational task across more than a single piece of execution hardware, essentially always for performance reasons of some sort (note: even too little RAM is a performance reason when the alternative is swapping.
So, I have to ask in return: What books are you referring to? Are they about concurrent programming (I have a few of these, there's a lot of interesting theory there), or about parallel programming?
If they really are about parallel programming, I'll make a few observations:
CUDA is a rapidly moving target, and has been since its release. A book written about it today would be half-obsolete by the time it made it into print.
OpenCL's standard was released just under a year ago. Stable implementations came out over the last 8 months or so. There's simply not been enough time to get a book written yet, let alone revised and published.
OpenMP is covered in at least a few of the parallel programming textbooks that I've used. Up to version 2 (v3 was just released), it was essentially all about data parallel programming.

I think those working with parallel computing academically today are usually coming from the cluster computing field. OpenCL and CUDA use graphics processors, which more or less inadvertently have evolved into general purpose processors along with the development of more advanced graphics rendering algorithms.
However, the graphics people and the high performance computing people have been "discovering" each other for some time now, and a lot or research is being does into using GPUs for general purpose computing.

"always" is a bit strong; there are resources out there (example) that include data parallelism topics.

The classic book "The Connection Machine" by Hillis was all data parallelism. It's one of my favorites

Getting starting with Parallel programming [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
So it looks like multicore and all its associated complications are here to stay. I am planning a software project that will definitely benefit from parallelism. The problem is that I have very little experience writing concurrent software. I studied it at University and understand the concepts and theory very well but have zero hands on useful experience building software to run on on multiple processors since school.
So my question is, what is the best way to get started with multiprocessor programming?
I am familiar with mostly Linux development in C/C++ and Obj-C on Mac OS X with almost zero Windows experience. Also my planned software project will require FFT and probably floating point comparisons of a lot of data.
There is OpenCL, OpenMP, MPI, POSIX threads, etc... What technologies should I start with?
Here are a couple stack options I am considering but not sure if they will let me experiment in working towards my goal:
Should I get Snow Leopard and try to
get OpenCL Obj-C programs to run
execution on the ATI X1600 GPU on my
laptop? or
Should I get a
Playstation and try writing C code to
throw across its six available Cell SPE cores?
or
Should I build out a Linux box
with an Nvidia card and try working
with CUDA?
Thanks in advance for your help.

I'd suggest going OpenMP and MPI initially, not sure it matters which you choose first, but you definitely ought to want (in my opinion :-) ) to learn both shared and distributed memory approaches to parallel computing.
I suggest avoiding OpenCL, CUDA, POSIX threads, at first: get a good grounding in the basics of parallel applications before you start to wrestle with the sub-structure. For example, it's much easier to learn to use broadcast communications in MPI than it is to program them in threads.
I'd stick with C/C++ on your Mac since you are already familiar with them, and there are good open-source OpenMP and MPI libraries for that platform and those languages.
And, and for some of us it's a big plus, whatever you learn about C/C++ and MPI (to a lesser extent it's true of OpenMP too) will serve you well when you graduate to real supercomputers.
All subjective and argumentative, so ignore this if you wish.

If you're interested in parallelism in OS X, make sure to check out Grand Central Dispatch, especially since the tech has been open-sourced and may soon see much wider adoption.

The traditional and imperative 'shared state with locks' isn't your only choice. Rich Hickey, the creator of Clojure, a Lisp 1 for the JVM, makes a very compelling argument against shared state. He basically argues that it's almost impossible to get right. You may want to read up on message passing ala Erlang actors or STM libraries.

You should Learn You Some Erlang. For great good.

You don't need special hardware like graphic cards and Cells to do parallel programming. Your simple multi-core CPU will also profit from parallel programming. If you have experience with C/C++ and objective-c, start with one of those and learn to use threads. Start with simple examples like matrix multiplication or maze solving and you'll learn about those pesky problems (parallel software is non-deterministic and full of Heisenbugs).
If you want to go into the massive multiparallelism, I'd choose openCL as it's the most portable one. Cuda still has a larger community, more documentation and examples and is a bit easier, but you'd an nvidia card.

Maybe your problem is suitable for the MapReduce paradigm. It automatically takes care of load balancing and concurrency issues, the research paper from Google is already a classic. You have a single-machine implementation called Mars that run on GPUs, this may work fine for you. There is also Phoenix that runs map-reduce on multicore and symmetric multiprocessors.

I would start with MPI as you learn how to deal with distributed memory. Pacheco's book is an oldie but a goodie, and MPI runs fine out of the box on OS X now giving pretty good multicore performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js