C++ 17 parallelism hardware implementation

C++ 17 parallelism hardware implementation - c++

As I could understand, C++ 17 will come with Parallelism. However, what I could not understand is it a specific hardware parallelism (CPU by default)? Or it can be extended to any hardware with multiple computation units?
In other words, will we see something like,for example, "nVidia C++ standard compiler" which is going to compile the parallel parts to be executed on GPUs?
Will it be some more standardized alternative to OpenCL for example?
Note: Absolutely, I am not asking "Will nVidia do that?". I am asking if C++ 17 standards allow that and if it is theoretically possible.

The question provides a link to the paper proposing this change, and, with respect to the parallelism aspects, there haven't been substantial changes to what's proposed. Yes, the compiler can do whatever makes sense for the target hardware to parallelize the execution of various algorithms, provided only that it gets the right answer (with some reservations) and that it doesn't impose unneeded overhead (again, with some reservations).
There are a couple of important points to understand.
First, C++17 parallelism is not a general parallel programming mechanism. It provides parallel versions of many of the STL algorithms, nothing more. So it's not a replacement for more powerful mechanisms like OpenCL, TBB, etc.
Second, there are inherent limitations when you try to parallelize algorithms, and that's why I added those two parenthesized qualifications. For example, the parallel version of std::accumulate will produce the same result as the non-parallel version only if the function being applied to the input range is commutative and associative. The most obvious problem area here is floating-point values, where math operations are not associative, so the result might differ. Similarly, some algorithms actually impose more overhead when parallelized; you get a net speedup, but there is more total work done, so the speedup for those algorithms will not be linear in the number of processing units. std::partial_sum is an example: each output value depends on the preceding value, so it's not simple to parallelize the algorithm. There are ways to do it, but you end up applying the combiner function more times than the non-parallel algorithm would. In general, there are relaxations of the complexity requirements for algorithms in order to reflect this reality.

Related

How is scheduling handled in C++17 STL parallel algorithms?

Is there a standard scheduler specification for the C++17 STL parallel algorithms or is it entirely implementation dependant? The serial algorithms have complexity guarantees but the scheduler implementation is critical for performance with non uniform task loads, does the specification address this? It seems like it would be hard to guarantee cross-platform performance without a standardized scheduler.

As far as I can tell from the wording, such details are completely within the domain of implementation specification, as one would expect. The standard generally makes no effort to guarantee absolute performance of any kind, only complexity requirements, as you're seeing in this case.
Ultimately, though your source code can now take advantage of parallelism while being completely standard-defined, the actual practical outcome of running your program is up to your implementation, and I think that still makes sense. The goal of standardising features is not cross-platform performance, but portable code that can be proven correct in a vacuum.
I'd expect your toolchain to give further information on how this sort of thing works, and that may even influence your choice of toolchain! But it does make sense for them to have freedom in that regard, as they do in other areas. After all, there is a multitude of target platforms out there (theoretically infinite), all with their own potential and quirks.
It could be that a future standard emplaces further constraints on scheduling in order to kick implementers up the backside a little, but personally I wouldn't count on it.

Scheduling for C++17 STL algorithms is implementation-defined.
Moreover, C++17 doesn't guarantee parallel execution. It just allows parallelism.
The class execution::parallel_policy is an execution policy type used
as a unique type to disambiguate parallel algorithm overloading and
indicate that a parallel algorithm’s execution may be parallelized

Writing optimal code in FORTRAN using array expressions

I am looking for a way to write fast code and be able to use builtin vector operations (for the sake of readability).
FORTRAN seems to be the good candidate. However, almost all resources I find on the web are about writing code without array expressions, and have only trivial examples of vector operations.
I feel strong need in some good resource which can cover caveats and give some insight into optimizations of code with vector expressions.
Example:
currently I am not even able to predict the behavior of such code:
! a = [0], indices = [1, 1]
a(indices) = a(indices) + 1
After compiling I get a = [2], but it this correct? If I use openmp, will it behave like this?
Personally, I would be very happy to have something like following examples on numpy:
100 numpy excercises
numpy: tips and tricks to work with data
Getting the Best Performance out of NumPy

Your code is not standard conforming:
Fortran 2008 6.5.3.3.2.3:
If a vector subscript has two or more elements with the same value,
an array section with that vector subscript shall not appear in a
variable definition context (16.6.7). NOTE 6.15
Therefore the result of your operation is not defined by the standard.
Other parts of your question appear to be too broad to treat them here. There are many books about scientific programming in Fortran 90 and later.
Also be aware that by vectorization most people in Fortran and C or C++ mean the usage of SIMD instructions simd and not the vectorized expressions from NumPy. These are just array expressions in Fortran.

I have scanned many sources (~20 books and dozens of web pages). Hard luck I missed something really important. The question I posted is indeed incorrect and comes from my initial high expectation about array operations in fortran.
The answer I would expect is: there are no tools to write short, readable code in fortran with automatic parallelization (to be more precise: there are, but those are proprietary libraries).
The list of intrinsic functions available in fortran is quite short
(link), and consists only of functions easily mapped to SIMD ops.
There are lots of functions that one will be missing.
while this could be resolved by separate library with separate implementation for each platform, fortran doesn't provide such. There are commercial options (see this thread)
Brief examples of missing functions:
no built-in array sort or unique. The proposed way is to use this library, which provides single-threaded code (forget threads and CUDA)
cumulative sum / running sum. One trivially can implement it, but the resulting code will never work fine on threads/CUDA/Xeon Phi/whatever comes next.
bincount, numpy.ufunc.at, numpy.ufunc.reduceat (which is very useful in many applications)
In most cases fortran provides 2x speed up even with simple implementations, but the code written will always be one-threaded, while matlab/numpy functions can be reimplemented for GPU or other parallel platform without any effort from user side (which occasionally happened to MATLAB, also see gnumpy, theano and parakeet)
To conclude, this is bad news for me. Fortran developers really care about having fast programs today, not in the future. I also can't lock my code on proprietary software. And I'm still looking for appropriate tool. (Julia is current candidate)
See also:
STL analogue in fortran
where ready-to-use algorithms are asked.
Numerical recipes: the art of parallel programming author implements basic MATLAB-like operations to have more expressive code
I also find useful these notes to see recommended ways of code optimizations (to see there is no place for vector operations)
numpy, fortran, blitz++: a case study
dicussion about implementing unique in fortran, where proprietary tools are recommended.

Fast trigonometric functions using only integer in c++ for arm target

I am writing code for an ARM-Target which uses a lot of floating point operations and trigonometric functions. AFAIK floating point calculations are MUCH slower than int (especially on ARM). Accuracy is not crucial.
I thought about implementing my own trigonometric functions using a scaling factor (p.e. range of 0*pi to 2*pi becomes int 0 to 1024) and lookup tables. Is that a good approach?
Are there any alternatives?
Target platform is an Odroid U2 (Exynos4412) running ubuntu and lots of other stuff (webserver etc...).
(c++11 and boost/libraries allowed)

If your target platform has a math library, use it. If it is any good, it was written by experts who were considering speed. You should not base code design on guesses about what is fast or slow. If you do not have actual measurements or processor specifications, and you do not know trigonometric functions in your application are consuming a lot of time, then you do not have good reason for replacing the math libraries.
Floating-point instructions typically have longer latencies than integer instructions, but they are pipelined so that throughput may be comparable. (E.g., a floating-point unit might have four stages to do the work, so an instruction takes four cycles to work through all the stages, but you can push a new instruction into the first stage in each cycle.) Whether the pipelining is sufficient to provide performance on a par with an integer implementation depends greatly on the target processor, the algorithm being used, and the skill of the implementor.
If it is beneficial in your case to use custom implementations of the math routines, then how they should be designed is hugely dependent on circumstances. Proper advice depends on the domain to support (Just 0 to 2π? –2π to +2π? Possibly larger values, which have to be folded to -π to π?), what special cases needed to be supported (Propagate NaNs?), the accuracy required, what else is happening in the processor (Is a lot of memory in use or can we rely on a lookup table remaining in cache?), and more.
A significant part of the trigonometric routines is handling various cases (NaNs, infinities, small values) and reducing arguments modulo 2π. It may be possible to implement stripped-down routines that do not handle special cases or perform argument reduction but still use floating-point.

Exynos 4412 uses the Cortex-A9 core[1], which has fully pipelined single- and double-precision floating-point. There is no reason to resort to integer operations, as there was with some older ARM cores.
Depending on your specific accuracy requirements (and especially if you can guarantee that the inputs fall into a limited range), you may be able to use approximations that are significantly faster than the implementations available in the standard library. More information about your exact usage would be necessary to give sound advice.
[1] http://en.wikipedia.org/wiki/Exynos_(system_on_chip)

One possible alternative is trigint:
trigint download
trigint doxygen

You should use "fixed point" math rather than floating point.
Most ARM processors (7 and above) allow for 32 bits of resolution in the fixed point. So you could go to 1E-3 radians quite easily. But the real question is how much accuracy do you need in the results?
Whether to use lookup tables, lookup tables with interpolation or functions depends on how much data space you have on your system. Lookup tables are fastest execution, but use the most data space. Functions use the least amount of data but require the most execution time. Interpolation may be a mitigation that allows smaller tables and some extra processing.

Why don't games use expression templates for math?

I can imagine expression templates doing awful things to compile times for things as pervasive as vectors/matrices/quaternions etc, but if it is such a great speed boost why don't games use it? It's quite obvious that SIMD instructions can exploit data level parallelism to great effect. Expression templates and lazy evaluation together seem to make sense, at least when it comes to eliminating temporaries.
So while libraries like Eigen advertise such features, I don't see this done commonly in middleware (e.g. Havok) or games where things are extremely speed critical. Can anyone shed some light on this? Does it have to do with non-determinism or branch prediction?

I can think of a lot of reasons:
it hurts compile-times. Longer compile-times means that testing any change you made to the code takes longer. It hurts productivity.
it's complex. Most likely, many developers on the team are not familiar with expression templates, and will have a hard time reading and debugging them.
Games often have to work on multiple platforms, with various compilers which may have a wide range of shortcomings, which might for example make advanced template trickery problematic.
It's generally not necessary. You can write efficient code without expression templates. It just gets more verbose, and you have to do more hand-holding for the compiler.
Game developers are extremely skeptical of anything that wasn't already used in games 10 years ago. It's not long ago that several major developers stuck to C: not because C++ wasn't good enough, but because it was "new". Game developers are conservative as hell.
And of course, the obvious question: where would they use expression templates? Is there enough complex math to really make it worthwhile? Games tend to rely on a fairly small number of linear algebra operations, which will typically be heavily hand-tuned in any case.

I want to add one more reason not stated in the above answers. Apologies if it is and I missed it.
Adding templates to math based classes, such as a vec3 class, can change the meaning of operators and lead to functions that are invalid for some template types.
Take for instance,
vec3<int> myVec( 3, 5, 4 );
myVec.Normalize();
What would normalize mean to an integer vector? All of the sudden when we add templates to math constructs, we invalidate many existing functions, such as the example described above.
Also, another thing worth mentioning is that many math constructs are optimized with certain types because optimization is so important in games. GPU's are floating point calculating machines. Doubles take up double the space of floats and are quite a bit slower to compute with, even though it may seem like an obvious use case to a new game developer.
I hope this example makes sense. Templates are a great tool but math constructs in games are just not the right place to use them.

Typically the parts of a game that are both performance sensitive and math heavy and still tend to run on the CPU rather than the GPU are applying the same basic operations to large numbers of elements. Some examples are animation blending, physics calculations, visibility tests, etc.
The best approach to optimizing these sorts of problems on current console hardware is generally to try and batch as much work together as possible and to aim for maximum data locality to avoid expensive cache misses. The actual math can then be optimized using SIMD intrinsics and will typically be carefully hand optimized. The kind of optimizations that expression templates give you can be performed relatively easily during that hand optimization phase but there are various other important optimizations that are also likely to be performed that expression templates won't give you. Often this critical code will have sections with custom optimizations for each target platform and won't be very portable.
I think the reason that expression templates aren't widely used is that they add software complexity (for all the reasons described by jalf) to non performance critical code that doesn't really warrant it while not covering all the optimizations that are necessary for the really performance critical code that shows up at the top of profiles.

What are the functions in the standard library that can be implemented faster with programming hacks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have recently read an article about fast sqrt calculation. Therefore, I have decided to ask SO community and its experts to help me find out, which STL algorithms or mathematical calculations can be implemented faster with programming hacks?
It would be great if you can give examples or links.
Thanks in advance.

System library developers have more concerns than just performance in mind:
Correctness and standards compliance: Critical!
General use: No optimisations are introduced, unless they benefit the majority of users.
Maintainability: Good hand-written assembly code can be faster, but you don't see much of it. Why?
Portability: Decent libraries should be portable to more than just Windows/x86/32bit.
Many optimisation hacks that you see around violate one or more of the requirements above.
In addition, optimisations that will be useless or even break when the next generation CPU comes around the corner are not a welcome thing.
If you don't have profiler evidence on it being really useful, don't bother optimising the system libraries. If you do, work on your own algorithms and code first, anyway...
EDIT:
I should also mention a couple of other all-encompassing concerns:
The cost/effort to profit/result ratio: Optimisations are an investment. Some of them are seemingly-impressive bubbles. Others are deeper and more effective in the long run. Their benefits must always be considered in relation to the cost of developing and maintaining them.
The marketing people: No matter what you think, you'll end up doing whatever they want - or think they want.

Probably all of them can be made faster for a specific problem domain.
Now the real question is, which ones should you hack to make faster? None, until the profiler tells you to.

Several of the algorithms in <algorithm> can be optimized for vector<bool>::[const_]iterator. These include:
find
count
fill
fill_n
copy
copy_backward
move // C++0x
move_backward // C++0x
swap_ranges
rotate
equal
I've probably missed some. But all of the above algorithms can be optimized to work on many bits at a time instead of just one bit at a time (as would a naive implementation).
This is an optimization that I suspect is sorely missing from most STL implementations. It is not missing from this one:
http://libcxx.llvm.org/

This is where you really need to listen to project managers and MBAs. What you're suggesting is re-implementing parts of the STL and or standard C library. There is an associated cost in terms of time to implement and maintenance burden of doing so, so you shouldn't do it unless you really, genuinely need to, as John points out. The rule is simple: is this calculation you're doing slowing you down (a.k.a. you are bound by the CPU)? If not, don't create your own implementation just for the sake of it.
Now, if you're really interested in fast maths, there are a few places you can start. The gnu multi-precision library implements many algorithms from modern computer arithmetic and semi numerical algorithms that are all about doing maths on arbitrary precision integers and floats insanely fast. The guys who write it optimise in assembly per build platform - it is about as fast as you can get in single core mode. This is the most general case I can think of for optimised maths i.e. that isn't specific to a certain domain.
Bringing my first paragraph and second in with what thkala has said, consider that GMP/MPIR have optimised assembly versions per cpu architecture and OS they support. Really. It's a big job, but it is what makes those libraries so fast on a specific small subset of problems that are programming.
Sometimes domain specific enhancements can be made. This is about understanding the problem in question. For example, when doing finite field arithmetic under rijndael's finite field you can, based on the knowledge that the characteristic polynomial is 2 with 8 terms, assume that your integers are of size uint8_t and that addition/subtraction are equivalent to xor operations. How does this work? Well basically if you add or subtract two elements of the polynomial, they contain either zero or one. If they're both zero or both one, the result is always zero. If they are different, the result is one. Term by term, that is equivalent to xor across a 8-bit binary string, where each bit represents a term in the polynomial. Multiplication is also relatively efficient. You can bet that rijndael was designed to take advantage of this kind of result.
That's a very specific result. It depends entirely on what you're doing to make things efficient. I can't imagine many STL functions are purely optimised for cpu speed, because amongst other things STL provides: collections via templates, which are about memory, file access which is about storage, exception handling etc. In short, being really fast is a narrow subset of what STL does and what it aims to achieve. Also, you should note that optimisation has different views. For example, if your app is heavy on IO, you are IO bound. Having a massively efficient square root calculation isn't really helpful since "slowness" really means waiting on the disk/OS/your file parsing routine.
In short, you as a developer of an STL library are trying to build an "all round" library for many different use cases.
But, since these things are always interesting, you might well be interested in bit twiddling hacks. I can't remember where I saw that, but I've definitely stolen that link from somebody else on here.

Almost none. The standard library is designed the way it is for a reason.
Taking sqrt, which you mention as an example, the standard library version is written to be as fast as possible, without sacrificing numerical accuracy or portability.
The article you mention is really beyond useless. There are some good articles floating around the 'net, describing more efficient ways to implement square roots. But this article isn't among them (it doesn't even measure whether the described algorithms are faster!) Carmack's trick is slower than std::sqrt on a modern CPU, as well as being less accurate.
It was used in a game something like 12 years ago, when CPUs had very different performance characteristics. It was faster then, but CPU's have changed, and today, it's both slower and less accurate than the CPU's built-in sqrt instruction.
You can implement a square root function which is faster than std::sqrt without losing accuracy, but then you lose portability, as it'll rely on CPU features not present on older CPU's.
Speed, accuracy, portability: choose any two. The standard library tries to balance all three, which means that the speed isn't as good as it could be if you were willing to sacrifice accuracy or portability, and accuracy is good, but not as good as it could be if you were willing to sacrifice speed, and so on.
In general, forget any notion of optimizing the standard library. The question you should be asking is whether you can write more specialized code.
The standard library has to cover every case. If you don't need that, you might be able to speed up the cases that you do need. But then it is no longer a suitable replacement for the standard library.
Now, there are no doubt parts of the standard library that could be optimized. the C++ IOStreams library in particular comes to mind. It is often naively, and very inefficiently, implemented. The C++ committee's technical report on C++ performance has an entire chapter dedicated to exploring how IOStreams could be implemented to be faster.
But that's I/O, where performance is often considered to be "unimportant".
For the rest of the standard library, you're unlikely to find much room for optimization.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js