How is scheduling handled in C++17 STL parallel algorithms?

How is scheduling handled in C++17 STL parallel algorithms? - c++

Is there a standard scheduler specification for the C++17 STL parallel algorithms or is it entirely implementation dependant? The serial algorithms have complexity guarantees but the scheduler implementation is critical for performance with non uniform task loads, does the specification address this? It seems like it would be hard to guarantee cross-platform performance without a standardized scheduler.

As far as I can tell from the wording, such details are completely within the domain of implementation specification, as one would expect. The standard generally makes no effort to guarantee absolute performance of any kind, only complexity requirements, as you're seeing in this case.
Ultimately, though your source code can now take advantage of parallelism while being completely standard-defined, the actual practical outcome of running your program is up to your implementation, and I think that still makes sense. The goal of standardising features is not cross-platform performance, but portable code that can be proven correct in a vacuum.
I'd expect your toolchain to give further information on how this sort of thing works, and that may even influence your choice of toolchain! But it does make sense for them to have freedom in that regard, as they do in other areas. After all, there is a multitude of target platforms out there (theoretically infinite), all with their own potential and quirks.
It could be that a future standard emplaces further constraints on scheduling in order to kick implementers up the backside a little, but personally I wouldn't count on it.

Scheduling for C++17 STL algorithms is implementation-defined.
Moreover, C++17 doesn't guarantee parallel execution. It just allows parallelism.
The class execution::parallel_policy is an execution policy type used
as a unique type to disambiguate parallel algorithm overloading and
indicate that a parallel algorithm’s execution may be parallelized

Related

About time/space complexity in C/C++ standards

Recently I've read things about abstract machine and as-if rule (What exactly is the "as-if" rule?), and the requirements on time complexity of standard library (like this one: Is list::size() really O(n)?).
Are the time/space complexity requirements on standard library in terms of abstract machine or in terms of real concrete machine?
If these are in terms of abstract machine, it seems an implementation can actually generate less efficient code in terms of complexity even though it seems not to be practical.
Did the standards mention anything about time/space complexity for non-standard-library code?
e.g. I may write a custom sorting code and expect O(n log n) time, but if an implementation just treats this as code in abstract machine, it is allowed to generate a slower sorting in assembly and machine code, like changing it to O(n^2) sort, even though it unlikely will do that in real situation.
Or maybe I missed something about the transformation requirements between abstract machine and real concrete machine. Can you help me to clarify? :)
Even thought I mainly read things about C++ standard, I also want to know the situation about C standard. So this question tags both.

Are the time/space complexity requirements on standard library in terms of abstract machine or in terms of real concrete machine?
The complexity requirements are in terms of the abstract machine:
[intro.abstract] The semantic descriptions in this document define a parameterized nondeterministic abstract machine...
Did the standards mention anything about time/space complexity for non-standard-library code?
No. The only complexity requirements in the standard are for standard containers and algorithms.
if an implementation just treats this as code in abstract machine, it is allowed to generate a slower sorting in assembly and machine code, like changing it to O(n^2) sort
That's not the worst thing it can do. An implementation can put the CPU to sleep for a year between every instruction. As long as you're patient enough, the program would have same observable behaviour as the abstract machine, so it would be conforming.

Many of the complexity requirements in the C++ standard are in terms of specific counts of particular operations. These do constrain the implementation.
E.g. std::find_if
At most last - first applications of the predicate.
This is more specific than "O(N), where N = std::distance(first, last)", as it specifies a constant factor of 1.
And there are others that have Big-O bounds, defining what operation(s) are counted
E.g. std::sort
O(N·log(N)), where N = std::distance(first, last) comparisons.
What this doesn't constrain includes how slow a comparison is, nor how many swaps occur. If your model of computation has fast comparison and slow swapping, you don't get a very useful analysis.

As you've been told in comments, the standards don't have any requirements regarding time or space complexity. And addressing your additional implicit question, yes, a compiler can change your O(n log n) code to run in O(n²) time. Or in O(n!) if it wants to.
The underlying explanation is that the standard defines correct programs, and a program is correct regardless of how long it takes to execute or how much memory it uses. These details are left to the implementation.
Specific implementations can compile your code in whichever way achieves correct behavior. It would be completely permissible, for instance, for an implementation to add a five-second delay between every line of code you wrote — the program is still correct. It would also be permissible for the compiler to figure out a better way of doing what you wrote and rewriting your entire program, as long as the observable behavior is the same.
However, the fact that an implementation is compliant doesn't mean it is perfect. Adding five-second delays wouldn't affect the implementation's compliance, but nobody would want to use that implementation. Compilers don't do these things because they are ultimately tools, and as such, their writers expect them to be useful to those who use them, and making your code intentionally worse is not useful.
TL;DR: bad performance (time complexity, memory complexity, etc.) doesn't affect compliance, but it will make you look for a new compiler.

C++ 17 parallelism hardware implementation

As I could understand, C++ 17 will come with Parallelism. However, what I could not understand is it a specific hardware parallelism (CPU by default)? Or it can be extended to any hardware with multiple computation units?
In other words, will we see something like,for example, "nVidia C++ standard compiler" which is going to compile the parallel parts to be executed on GPUs?
Will it be some more standardized alternative to OpenCL for example?
Note: Absolutely, I am not asking "Will nVidia do that?". I am asking if C++ 17 standards allow that and if it is theoretically possible.

The question provides a link to the paper proposing this change, and, with respect to the parallelism aspects, there haven't been substantial changes to what's proposed. Yes, the compiler can do whatever makes sense for the target hardware to parallelize the execution of various algorithms, provided only that it gets the right answer (with some reservations) and that it doesn't impose unneeded overhead (again, with some reservations).
There are a couple of important points to understand.
First, C++17 parallelism is not a general parallel programming mechanism. It provides parallel versions of many of the STL algorithms, nothing more. So it's not a replacement for more powerful mechanisms like OpenCL, TBB, etc.
Second, there are inherent limitations when you try to parallelize algorithms, and that's why I added those two parenthesized qualifications. For example, the parallel version of std::accumulate will produce the same result as the non-parallel version only if the function being applied to the input range is commutative and associative. The most obvious problem area here is floating-point values, where math operations are not associative, so the result might differ. Similarly, some algorithms actually impose more overhead when parallelized; you get a net speedup, but there is more total work done, so the speedup for those algorithms will not be linear in the number of processing units. std::partial_sum is an example: each output value depends on the preceding value, so it's not simple to parallelize the algorithm. There are ways to do it, but you end up applying the combiner function more times than the non-parallel algorithm would. In general, there are relaxations of the complexity requirements for algorithms in order to reflect this reality.

Performances of reverse_iterator vs iterator?

I am designing a very efficient iterator for a High Performance Computing / Supercomputer application. I was wondering whether the std::reverse_iterator adaptor will introduce some overhead or whether it is likely to compile and achieve the exact same performances as the iterator type it is applied to.

When it comes to performance, don't guess, don't ask, measure. And don't measure with artificial small test programs. Measure speed with your real code, your real application, your real data on real target machines.
C++ as a language does not make any guarantees as to whether std::reverse_iterator introduces overhead or not. The question would be a bit more answerable if you at least told us which compiler and which compiler options you are using, because the performance of standard-library facilities (algorithmic complexity requirements aside) is a quality-of-implementation (QoI) issue. It's unlikely that you would experience a difference in performance with std::reverse_iterator, but who knows?
You also have not told us anything about the underlying iterator that you want to adapt. It may, for instance, be an iterator into a custom container and have different performance characteristics for -- and ++.

What are the functions in the standard library that can be implemented faster with programming hacks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have recently read an article about fast sqrt calculation. Therefore, I have decided to ask SO community and its experts to help me find out, which STL algorithms or mathematical calculations can be implemented faster with programming hacks?
It would be great if you can give examples or links.
Thanks in advance.

System library developers have more concerns than just performance in mind:
Correctness and standards compliance: Critical!
General use: No optimisations are introduced, unless they benefit the majority of users.
Maintainability: Good hand-written assembly code can be faster, but you don't see much of it. Why?
Portability: Decent libraries should be portable to more than just Windows/x86/32bit.
Many optimisation hacks that you see around violate one or more of the requirements above.
In addition, optimisations that will be useless or even break when the next generation CPU comes around the corner are not a welcome thing.
If you don't have profiler evidence on it being really useful, don't bother optimising the system libraries. If you do, work on your own algorithms and code first, anyway...
EDIT:
I should also mention a couple of other all-encompassing concerns:
The cost/effort to profit/result ratio: Optimisations are an investment. Some of them are seemingly-impressive bubbles. Others are deeper and more effective in the long run. Their benefits must always be considered in relation to the cost of developing and maintaining them.
The marketing people: No matter what you think, you'll end up doing whatever they want - or think they want.

Probably all of them can be made faster for a specific problem domain.
Now the real question is, which ones should you hack to make faster? None, until the profiler tells you to.

Several of the algorithms in <algorithm> can be optimized for vector<bool>::[const_]iterator. These include:
find
count
fill
fill_n
copy
copy_backward
move // C++0x
move_backward // C++0x
swap_ranges
rotate
equal
I've probably missed some. But all of the above algorithms can be optimized to work on many bits at a time instead of just one bit at a time (as would a naive implementation).
This is an optimization that I suspect is sorely missing from most STL implementations. It is not missing from this one:
http://libcxx.llvm.org/

This is where you really need to listen to project managers and MBAs. What you're suggesting is re-implementing parts of the STL and or standard C library. There is an associated cost in terms of time to implement and maintenance burden of doing so, so you shouldn't do it unless you really, genuinely need to, as John points out. The rule is simple: is this calculation you're doing slowing you down (a.k.a. you are bound by the CPU)? If not, don't create your own implementation just for the sake of it.
Now, if you're really interested in fast maths, there are a few places you can start. The gnu multi-precision library implements many algorithms from modern computer arithmetic and semi numerical algorithms that are all about doing maths on arbitrary precision integers and floats insanely fast. The guys who write it optimise in assembly per build platform - it is about as fast as you can get in single core mode. This is the most general case I can think of for optimised maths i.e. that isn't specific to a certain domain.
Bringing my first paragraph and second in with what thkala has said, consider that GMP/MPIR have optimised assembly versions per cpu architecture and OS they support. Really. It's a big job, but it is what makes those libraries so fast on a specific small subset of problems that are programming.
Sometimes domain specific enhancements can be made. This is about understanding the problem in question. For example, when doing finite field arithmetic under rijndael's finite field you can, based on the knowledge that the characteristic polynomial is 2 with 8 terms, assume that your integers are of size uint8_t and that addition/subtraction are equivalent to xor operations. How does this work? Well basically if you add or subtract two elements of the polynomial, they contain either zero or one. If they're both zero or both one, the result is always zero. If they are different, the result is one. Term by term, that is equivalent to xor across a 8-bit binary string, where each bit represents a term in the polynomial. Multiplication is also relatively efficient. You can bet that rijndael was designed to take advantage of this kind of result.
That's a very specific result. It depends entirely on what you're doing to make things efficient. I can't imagine many STL functions are purely optimised for cpu speed, because amongst other things STL provides: collections via templates, which are about memory, file access which is about storage, exception handling etc. In short, being really fast is a narrow subset of what STL does and what it aims to achieve. Also, you should note that optimisation has different views. For example, if your app is heavy on IO, you are IO bound. Having a massively efficient square root calculation isn't really helpful since "slowness" really means waiting on the disk/OS/your file parsing routine.
In short, you as a developer of an STL library are trying to build an "all round" library for many different use cases.
But, since these things are always interesting, you might well be interested in bit twiddling hacks. I can't remember where I saw that, but I've definitely stolen that link from somebody else on here.

Almost none. The standard library is designed the way it is for a reason.
Taking sqrt, which you mention as an example, the standard library version is written to be as fast as possible, without sacrificing numerical accuracy or portability.
The article you mention is really beyond useless. There are some good articles floating around the 'net, describing more efficient ways to implement square roots. But this article isn't among them (it doesn't even measure whether the described algorithms are faster!) Carmack's trick is slower than std::sqrt on a modern CPU, as well as being less accurate.
It was used in a game something like 12 years ago, when CPUs had very different performance characteristics. It was faster then, but CPU's have changed, and today, it's both slower and less accurate than the CPU's built-in sqrt instruction.
You can implement a square root function which is faster than std::sqrt without losing accuracy, but then you lose portability, as it'll rely on CPU features not present on older CPU's.
Speed, accuracy, portability: choose any two. The standard library tries to balance all three, which means that the speed isn't as good as it could be if you were willing to sacrifice accuracy or portability, and accuracy is good, but not as good as it could be if you were willing to sacrifice speed, and so on.
In general, forget any notion of optimizing the standard library. The question you should be asking is whether you can write more specialized code.
The standard library has to cover every case. If you don't need that, you might be able to speed up the cases that you do need. But then it is no longer a suitable replacement for the standard library.
Now, there are no doubt parts of the standard library that could be optimized. the C++ IOStreams library in particular comes to mind. It is often naively, and very inefficiently, implemented. The C++ committee's technical report on C++ performance has an entire chapter dedicated to exploring how IOStreams could be implemented to be faster.
But that's I/O, where performance is often considered to be "unimportant".
For the rest of the standard library, you're unlikely to find much room for optimization.

I need high performance. Will there be a difference if I use C or C++?

I need to write a program (a project for university) that solves (approx) an NP-hard problem.
It is a variation of Linear ordering problems.
In general, I will have very large inputs (as Graphs) and will try to find the best solution
(based on a function that will 'rate' each solution)
Will there be a difference if I write this in C-style code (one main, and functions)
or build a Solver class, create an instance and invoke a 'run' method from a main (similar to Java)
Also, there will be alot of floating point math going on in each iteration.
Thanks!

No.
The biggest performance gains/flaws will be on the algorithm you implement, and how much unneeded work you perform (Unneeded work could be everything from recalculating a previous value that could have been cached, to using too many malloc/free's vs using memory pools,
passing large immutable data by value instead of reference)

The biggest roadblock to optimal code is no longer the language (for properly compiled languages), but rather the programmer.

No, unless you are using virtual functions.
Edit: If you have a case where you need run-time dynamism, then yes, virtual functions are as fast or faster than a manually constructed if-else statement. However, if you drop in the virtual keyword in front of a method, but you don't actually need the polymorphism, then you will be paying an unnecessary overhead. The compiler won't optimize it away at compile time. I am just pointing this out because it's one of the features of C++ that breaks the 'zero-overhead principle` (quoting Stroustrup).
As a side note, since you mention heavy use of fp math:
The following gcc flags may help you speed things up (I'm sure there are equivalent ones for visual C++, but I don't use it): -mfpmath=sse, -ffast-math and -mrecip (The last two are 'slightly dangerous', meaning that they could give you weird results in edge cases in exchange for the speed. The first one reduces precision by a bit -- you have 64-bit doubles instead of 80-bit ones -- but this extra precision is often unneeded.) These flags would work equally well for C and C++ compilers.
Depending on your processor, you may also find that simulating true INFINITY with a large-but-not-infinite value gives you a good speed boost. This is because true INFINITY has to be handled as a special case by the processor.

Rule of thumb - do not optimize until you know what to optimize. So start with C++ and have some working prototype. Then profile it and rewrite bottle necks in assembly. But as others noted, chosen algorithm will have much greater impact than the language.

When speaking of performance, anything you can do in C can be done in C++.
For example, virtual methods are known to be “slow”, but if it's really a problem, you can still resort to C idioms.
C++ also brings templates, which lead to better performance than using void* for generic programming.

The Solver class will be constructed once, I take it, and the run method executed once... in that kind of environment, you won't see a difference. Instead, here are things to watch out for:
Memory management is hellishly expensive. If you need to do lots of little malloc()s, the operating system will eat your lunch. Make a determined effort to re-use whatever data structures you create if you know you'll be doing the same kind of thing again soon!
Instantiating classes generally means... allocating memory! Again, there's practically no cost for instantiating a handful of objects and re-using them. But beware of creating objects only to tear them down and rebuild them soon after!
Choose the right flavor of floating point for your architecture, insofar as the problem permits. It's possible that double will end up being faster than float, although it will need more memory. You should experiment to fine-tune this. Ideally, you'll use a #define or typedef to specify the type so you can change it easily in one place.
Integer calculations are probably faster than floating point. Depending on the numeric range of your data, you may also consider doing it with integers treated as fixed-point decimals. If you need 3 decimal places, you could use ints and just consider them "milli-somethings". You'll have to remember to shift decimals after division and multiplication... but no big deal. If you use any math functions beyond the basic arithmetic, of course, that would of course kill this possibility.

Since both are compiled, and the compilers now are very good at how to handle C++, I think the only problem would come from how well optimized your code is. I think it would be easier to write slower code in C++, but that depends on which style your model fits into best.
When it comes down to it, I doubt there will be any real difference, assuming both are well-written, any libraries you use, how well written they are, if you are measuring on the same computer.

Function call vs. member function call overhead is unlikely to be the limiting factor, compared to file input and the algorithm itself. C++ iostreams are not necessarily super high speed. C has 'restrict' if you're really optimizing, in C++ it's easier to inline function calls. Overall, C++ offers more options for organizing your code clearly, but if it's not a big program, or you're just going to write it in a similar manner whether it's C or C++, then the portability of C libraries becomes more important.

As long as you don't use any virtual functions etc. you won't note any considerable performance differences. Early C++ was compiled to C, so as long as you know the pinpoints where this creates any considerable overhead (such as with virtual functions) you can clearly calculate for the differences.
In addition I want to note that using C++ can give you a lot to gain if you use the STL and Boost Libraries. Especially the STL provides very efficient and proven implementations of the most important data structures and algorithms, so you can save a lot of development time.
Effectively it also depends on the compiler you will be using and how it will optimize the code.

first, writing in C++ doesn't imply using OOP, look at the STL algorithms.
second, C++ can be even slightly faster at runtime (the compilation times can be terrible compared to C, but that's because modern C++ tends to rely heavily on abstractions that tax the compiler).
edit: alright, see Bjarne Stroustrup's discussion of qsort and std::sort, and the article that FAQ mentions (Learning Standard C++ as a New Language), where he shows that C++-style code can be not only shorter and more readable (because of higher abstractions), but also somewhat faster.

Another aspect:
C++ templates can be an excellent tool to generate type-specific /
optimized code variations.
For example, C qsort requires a function call to the comparator, whereas std::sort can inline the functor passed. This can make a significant difference when compare and swap themselves are cheap.
Note that you could generate "custom qsorts" optimized for various types with a barrage of defines or a code generator, or by hand - you could do these optimizations in C, too, but at much higher cost.
(It's not a general weapon, templates help only in sepcific scenarios - usually a single algorithm applied to different data types or with differing small pieces of code injected.)

Good answers. I would put it like this:
Make the algorithm as efficient as possible in terms of its formal structure.
C++ will be as fast as C, except that it will tempt you to do dumb things, like constructing objects that you don't have to, so don't take the bait. Things like STL container classes and iterators may look like the latest-and-greatest thing, but they will kill you in a hotspot.
Even so, single-step it at the disassembly level. You should see it very directly working on your problem. If it is spending lots of cycles getting in and out of routines, try some in-lining (or macros). If it is wandering off into memory allocation and freeing, for much of the time, put a stop to that. If it's got inner loops where the loop overhead is a large percentage, try unrolling the loop.
That's how you can make it about as fast as possible.

I would go with C++ definitely. If you are careful about your design and avoid creating heavy objects inside hotspots you should not see any performance difference but the code is going to be much simpler to understand, maintain, and expand.
Use templates and classes judiciously. avoid unnecessary object creation by passing objects by reference. Avoid excessive memory allocation, if needed, allocate memory in advance of hotspots. Use restrict keyword on memory pointers to tell compiler whenever pointers overlap or not.
As far as optimization, pay careful attention to memory alignment. Assuming you are working on Intel processor, you can make use of vector instructions, provided you tell the compiler through pragma's about your memory alignment and aliased pointers. you can also use vector instructions directly via intrinsics.
you can also automatically create hotspot code using templates and let compiler optimize it out if you have things like short loops of different sizes. To find out performance and to drill down to your bottlenecks, Intel vtune or oprofile are extremely helpful.
hope that helps

I do some DSP coding, where it still pays off to go to assembly language sometimes. I'd say use C or C++, either one, and be prepared to go to assembly language when you need to, especially to exploit SIMD instructions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js