As I could understand, C++ 17 will come with Parallelism. However, what I could not understand is it a specific hardware parallelism (CPU by default)? Or it can be extended to any hardware with multiple computation units?
In other words, will we see something like,for example, "nVidia C++ standard compiler" which is going to compile the parallel parts to be executed on GPUs?
Will it be some more standardized alternative to OpenCL for example?
Note: Absolutely, I am not asking "Will nVidia do that?". I am asking if C++ 17 standards allow that and if it is theoretically possible.
The question provides a link to the paper proposing this change, and, with respect to the parallelism aspects, there haven't been substantial changes to what's proposed. Yes, the compiler can do whatever makes sense for the target hardware to parallelize the execution of various algorithms, provided only that it gets the right answer (with some reservations) and that it doesn't impose unneeded overhead (again, with some reservations).
There are a couple of important points to understand.
First, C++17 parallelism is not a general parallel programming mechanism. It provides parallel versions of many of the STL algorithms, nothing more. So it's not a replacement for more powerful mechanisms like OpenCL, TBB, etc.
Second, there are inherent limitations when you try to parallelize algorithms, and that's why I added those two parenthesized qualifications. For example, the parallel version of std::accumulate will produce the same result as the non-parallel version only if the function being applied to the input range is commutative and associative. The most obvious problem area here is floating-point values, where math operations are not associative, so the result might differ. Similarly, some algorithms actually impose more overhead when parallelized; you get a net speedup, but there is more total work done, so the speedup for those algorithms will not be linear in the number of processing units. std::partial_sum is an example: each output value depends on the preceding value, so it's not simple to parallelize the algorithm. There are ways to do it, but you end up applying the combiner function more times than the non-parallel algorithm would. In general, there are relaxations of the complexity requirements for algorithms in order to reflect this reality.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I understand the basic difference between the two, I often use std::async in my programs, which gives me concurrency.
Is there any reliable/notable libraries that can provide parallelism in C++? (I know this is a likely feature of C++17). If so, what are your experiences with them?
Thanks!
Barbra
Threading Building Blocks (TBB) is a templated C++ library for task parallelism. The library contains various algorithms and data structures specialized for task parallelism. I have had success with using parallel_for as well as parallel_pipeline to greatly speed up computations. With a little bit of extra coding, TBB's parallel_for can take a serial for loop that is appropriate for being executed in parallel and make it execute as such (See example here). TBB's parallel_pipeline has the ability to execute a chain of dependent tasks with the option of each being executed in parallel or serial (See example here). There are many more examples on the web especially at software.intel.com and here on stackoverflow (see here).
OpenMP is an API for thread parallelism that is accessed primarily through compiler directives. Although, I prefer to use the richer feature set provided by TBB, OpenMP can be a quick way of testing out parallel algorithms and code (just add a pragma and set some build settings). Once things have been tested and experimented with, I have found that converting certain uses of OpenMP to TBB can be done fairly easily. This isn't to say that OpenMP is not meant for serious coding. In fact, there may be instances in which one would prefer OpenMP over TBB (One is that because it primarily relies on pragmas, switching to serial execution can be easier than with TBB.). A number of open source projects that utilize OpenMP can be found in this discussion. There are a number of examples (e.g., on wikipedia) and tutorials on the web for OpenMP including many questions here on stackoverflow.
I previously neglected a discussion on SIMD (single instruction, multiple data), which provides data parallelism. As pointed out in the below comments, OpenMP is an option for exploring SIMD (check this link). Extensions to instruction sets such as SSE and AVX (both extensions to the x86 instruction set architecture) as well as NEON (ARM architecture) are also worthwhile to explore. I have had good and bad experience with using SSE and AVX. The good is that they can provide a nice speed up to certain algorithms (in particular I have used Intel intrinsics). The bad is that the ability to use these instructions is dependent upon specific CPU support, which may cause unexpected runtime exceptions.
Specifically with respect to parallelism and mathematics, I have had good experiences using Intel MKL (which now has a no cost option) as well as OpenBLAS. These libraries provide optimized, parallel, and/or vectorized implementations of common mathematical functions/routines (e.g., BLAS and LAPACK). There are many more libraries available that deal specifically with mathematics out there that involve optimized parallelism to some extent. While they may not provide lower level building blocks of parallelism (e.g., ability to manipulate threads, schedule tasks), it is very worthwhile to utilize (and contribute to) the immense amount of research and work in the field of computational mathematics. A similar statement could be said for areas of interest outside of mathematics.
I am writing a scientific application for my Maths PhD in C++, it's based on some heavy linear algebra, mostly BLAS level 3 routines. The sizes of the matrices employed vary considerably, ideally I would like to be able to deal with very large matrices of order 10000 and higher. So far I have used Intel MKL, multi-threaded, scales nicely onto 8 cores. My algorithm produces the correct results, however is very unstable, in double precision arithmetic, due to the accumulating errors, resulting from high powers being taken. Additionally, as I have access to a large supercomputer cluster, and my algorithm can be easily scaled across multiple nodes, I would like to employ MPI to scale the application across hundreds of nodes.
My goal is to find a templated BLAS library that:
Supports Multiple Precision Arithmetic,
Supports Multi-threading,
Supports MPI
My findings so far:
MTL4 - Matrix Template library 4 seems to do all of the above, however the open source edition will only run on one core, and the supercomputing edition is quite costly.
Eigen - appears not to support multicore? Does it support multicore and MPI if linked with MKL?
Armadillo - does all the above?
I would greatly appreciate any insights and recommendations
Kind Regards,
Maria
Depending on your matrix problem, the Tpetra package of Trilinos might be worth a look. It's templated on the scalar type, so you might use multiple precision types. It targets large scale applications on supercomputers so one can expect good parallel performances.
Hope it helps!
Edit: and it's free!
I'm developing non-interactive cpu-bound application which does only computations, almost no IO. Currently it works too long and while I'm working on improving the algorithm, I also think if it can give any benefit to change language or platform. Currently it is C++ (no OOP so it is almost C) on windows compiled with Intel C++ compiler. Can switching to ASM help and how much? Can switching to Linux and GCC help?
Just to be thorough: the first thing to do is to gather profile data and the second thing to do is consider your algorithms. I'm sure you know that, but they've got to be #included into any performance-programming discussion.
To be direct about your question "Can switching to ASM help?" the answer is "If you don't know the answer to that, then probably not." Unless you're very familiar with the CPU architecture and its ins and outs, it's unlikely that you'll do a significantly better job than a good optimizing C/C++ compiler on your code.
The next point to make is that significant speed-ups in your code (aside from algorithmic improvements) will almost certainly come from parallelism, not linear increases. Desktop machines can now throw 4 or 8 cores at a task, which has much more performance potential than a slightly better code generator. Since you're comfortable with C/C++, OpenMP is pretty much a no-brainer; it's very easy to use to parallelize your loops (obviously, you have to watch loop-carried dependencies, but it's definitely "the simplest parallelism that could possibly work").
Having said all that, code generation quality does vary between C/C++ compilers. The Intel C++ compiler is well-regarded for its optimization quality and has full support not just for OpenMP but for other technologies such as the Threading Building Blocks.
Moving into the question of what programming languages might be even better than C++, the answer would be "programming languages that actively promote / facilitate concepts of parallelism and concurrent programming." Erlang is the belle of the ball in that regard, and is a "hot" language right now and most people interested in performance programming are paying at least some attention to it, so if you want to improve your skills in that area, you might want to check it out.
It's always algorithm, rarely language. Here's my clue: "while I'm working on improving the algorithm".
Tweaking may not be enough.
Consider radical changes to the algorithm. You've got to eliminate processing, not make the processing go faster. The culprit is often "search" -- looping through data looking for something. Find ways to eliminate search. If you can't eliminate it, replace linear search with some kind of tree search or a hash map of some kind.
Switching to ASM is not going to help much, unless you're very good at it and/or have a specific critical path routine which you know you can do better. As several people have remarked, modern compilers are just better in most cases at taking advantages of caching/etc. than anyone can do by hand.
I'd suggest:
Try a different compiler, and/or different optimization options
Run a code coverage/analysis utility, and figure out where the critical paths are, and work on optimizing those in the code
C++ should be able to give you very near the best possible performance from the code, so I wouldn't recommend switching the language. Depending on the app, you may be able to get better performance on multi code/processor systems using multiple thread, as another suggestion.
While just switching to asm won't give any benefits, since the Intel C++ Compiler is likely better at optimizing than you, you can try one of the following options:
Try a compiler that will parallelize your code, like the VectorC compiler.
Try to switch to asm with heavy use of MMX, 3DNow!, SSE or whatever fits your needs (and your CPU). This will give more of a benefit than pure asm.
You can also try GPGPU, i.e. execute large parts of your algorithm on a GPU instead of a CPU. Depending on your algorithm, it can be dramatically faster.
Edit: I also second the profile approach. I recommend AQTime, which supports the Intel C++ compiler.
Personally I'd look at languages which allow you to take advantage of parallelism most easily, unless it's a thoroughly non-parallelisable situation. Being able to bolt on some extra cores and get (if possible!) near-linear improvement may well be a lot more cost-effective than squeezing the extra few percent of efficiency out.
When it comes to parallelisation, I believe functional languages are often regarded as the best way to go, or you could look at OpenMP for C/C++. (Personally, as a managed language guy, I'd be looking at libraries for Java/.NET, but I quite understand that not everyone has the same preferences!)
Try Fortran 77 - when it comes to computations still nothing beats the granddaddy of programming languages. Also, try it with OpenMP to take advantage of multiple cores.
Hand optimizing your ASM code compared to what C++ can do for you is rarely cost effective.
If you've done anything you can to the algorithm from a traditional algorithmic view, and you've also eliminated excesses, then you may either be SOL, or you can consider optimizing your program from a hardware point of view.
For example, any time you follow a pointer around the heap you are paying a huge cost due to cache misses, possibly paging, etc., which all affect branching predictions. Most programmers (even C gurus) tend to look at the CPU from the functional standpoint rather than what happens behind the scenes. Sometimes reorganizing memory, for example by "flattening" or manually allocating memory to fit on the same page can obtain ENORMOUS speedups. I managed to get 2X speedups on graph traversals just by flattening my structures.
These are not things that your compiler will do for you since they are based on your high-level understanding of the program.
As lobrien said, you haven't given us any information to tell you if hand-optimized ASM code would help... which means the answer is probably, "not yet."
Have you run your code with a profiler?
Do you know if the code is slow because of memory constraints or processor constraints?
Are you using all your available cores?
Have you identified any algorithms you're using that aren't O(1)? Can you get them to O(1)? If not, why not?
If you've done all that, how much control do you have over the environment your program is running in? (presumably a lot if you're thinking of switching operating systems) Can you disable other processes, give your process highest priority, etc? What about just finding a machine with a faster processor, more cores, or more memory (depending on what you're constrained on)
And on and on.
If you've already done all that and more, it's certainly possible you'll get to a point where you think, "I wonder if these few lines of code right here could be optimized better than the assembly that I'm looking at in the debugger right now?" And at that point you can ask specifically.
Good luck! You're solving a problem that's fun to solve.
Sometimes you can find libraries that have optimized implementations of the algorithms you care about. Often times they will have done the multithreading for you.
For example switching from LINPACK to LAPACK got us a 10x speed increase in LU factorization/solve with a good BLAS library.
First, figure out if you can change the algorithm, as S.Lott suggested.
Assuming the algorithm choice is correct, you might look a the memory access patterns, if you have a lot of data you are processing. For a lot of number crunching applications these days, they're bound by the memory bus, not by the ALU(s). I recently optimized some code that was of the form:
// Assume N is a big number
for (int i=0; i<N; i++) {
myArray[i] = dosomething(i);
}
for (int i=0; i<N; i++) {
myArray[i] = somethingElse(myArray[i]);
}
...
and converted it to look like:
for (int i=0; i<N; i++) {
double tmp = dosomething(i);
tmp = somethingElse(tmp);
...
myArray[i] = tmp;
}
...
In this particular case, this yielded about a 2x speedup.
As Oregonghost already hinted - The VectorC compiler might help. It does not really parallelize the code though, instead you can use it to leverage on extended command sets like mmx or sse. I used it for the most time-critical parts in a software rendering engine and it resulted in a speedup of about 150%-200% on most processors.
For an alternative approach, you could look into Distributed Computing which sounds like it could suit your needs.
If you're sticking with C++ on the intel compiler, take a look at the compiler intrinsics (full reference here). I know that VC++ has similar functionality, and I'm sure you can do the same thing with gcc. These can let you take full advantage of the parallelism built into your CPU. You can use the MMX, SSE and SSE2 instructions to improve performance to a degree. Like others have said, you're probably best looking at the algorithm first.
I suggest you rethink your algorithm, or maybe even better, your approach. On the other hand maybe what you are trying to calculate just takes a lot of computing time. Have you considered to make it distributed so it can run in a cluster of some sort? If you want to focus on pure code optimization by introducing Assembler for your inner loops then often that can be very beneficial (if you know what you're doing).
For modern processors, learning ASM will take you a long time. Further, with all the different versions of SSE around, your code will end up very processor dependant.
I do quite a lot of CPU-bound work, and have found that the difference between intel's C++ compiler and g++ usually isn't that big (at most 15% or so), and there is no measurable difference between Mac OS X, Windows and Linux.
You are going to have to optimise your code and improve your algorithm by hand. There is no "magic fairy dust" which can make existing code that much faster I'm afraid.
If you haven't yet, and you care about performance, you MUST run your code through a good profiler (personally, I like kcachegrind & valgrind on Linux, or Shark on Mac OS X. I don't know what is good for windows I'm afraid).
Based on my past experience, there is a very good chance you'll find some method is taking 95% of your CPU time, and some simple change or addition of caching will make a massive improvement to your performance. On a similar note, if some method is only taking 1% of your CPU time, no amount of optimising is going to gain you anything.
The 2 obvious answers to "CPU-bound" are:
1. Use more CPU (core)s
2. Use something else.
Using 2 threads instead of 1 will cut the time spent by up to 50%. In comparision, C++ to ASM rarely gives you 5% (and for novice ASM programmers, it's often -5%!). Some problems scale well, and may benefit from 8 or 16 cores. That kind of hardware is still pretty mainstream, so see if your problems fall in that category.
The other solution is to throw more specialized hardware at the task. This could be the vector unit of your CPU - considering Windows=x86/x64, that's going to be a flavor of SSE. Another kind of vector hardware is the modern GPU. The GPU also has its own memory bus, which is quite speedy.
First get the lead out. Then if it's as fast as it can possibly be without going to ASM, so be it. But thinking you have to go to ASM assumes you know what's making it slow, and I'll bet a donut that you're guessing.
If you feel you have optimized your code to a point there is no improvement, increase your CPU's. This can be done on different platforms. One I develop with is Appistry. A few links:
http://www.appistry.com/resource-library/index.html
and you can download the product free from here:
http://www.appistry.com/developers/
I work for Appistry and we have done many installations for tasks that were cpu bound by spreading work out over 10's or 100's of machines.
Hope this helps,
-Brett
Probable small help:
Optimization of 64-bit programs
AMD64 (EM64T) architecture
Debugging and optimization of multi-thread OpenMP-programs
Introduction into the problems of developing parallel programs
Development of Resource-intensive Applications in Visual C++
Linux
Switching to Linux can help, if you strip it down to only the parts you actually need.
CrowdProcess has about 2000 workers you can use to compute your algorithm. The API is extremely simple and we've been observing speedups close to the number of workers. Also you can write Javascript which should make you more productive than C++ or ASM.
So if you're in between C++ or ASM, I'd say you should first use all your CPU cores, then if it's not enough, CrowdProcess should be an interesting platform.
Disclaimer: I built CrowdProcess.
It is hard to produce ASM code that is faster than naive C or C++ code. In most cases if you do this job really well, you probably gain not much than few percents and getting like 10% speedup is considered great success but in most cases it is just impossible.
Compilers are capable of understanding how to compile efficiently. You should profile in order to figure out where to optimize.