Why don't likely/unlikely show performance improvements?

Why don't likely/unlikely show performance improvements? - c++

I have many validation checks in the code where program is crashed if any check gets failed. So all the checks are more unlikely.
if( (msg = newMsg()) == (void *)0 )//this is more unlikely
{
panic()//crash
}
So I have used the macro unlikely which hints compiler in branch prediction. But I have seen no improvement with this(I have some performance tests). I am using gcc4.6.3.
Why is there no improvement ? Is it because there are no else case for this? Should I use any optimization flag while building my application?

Should I use any optimization flag while building my application?
Absolutely! Even the optimizations turned at the lowest level, -O1 for GCC/clang/icc, are likely to outperform most of your optimization efforts. For free essentially, so why not?
I am using gcc4.6.3.
GCC 4.6 is old. You should consider working with modern tools, unless you're constrained otherwise.
But I have seen no improvement with this(I have some performance tests).
You haven't seen visible performance improvements, which is very common when dealing with micro-optimizations like those. Unfortunately, achieving visible improvements is not very easy with today's hardware: this is because we have got faster (unbelievably faster) components than they used to be. So saving up cycles is not as sensible as it used to be.
It's though worth noticing that sequential micro-optimizations can still make your code much faster, as in tight loops. Avoiding stalls, branch mispredictions, maximizing cache use do make a difference when handling chunks of data. And SO's most voted question shows clearly that.
It's even stated on the GCC manual:
— Built-in Function: long __builtin_expect (long exp, long c)
You may use __builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.
(emphasis mine)

See other answers related to this on SO:
likely(x) and __builtin_expect((x),1)
Why do we use __builtin_expect when a straightforward way is to use if-else
etc.
Search for it: https://stackoverflow.com/search?q=__builtin_expect

Related

Elimination of if-else for improving performance in pipelined architecture

My professor insists on writing code which is "truly sequential" by omitting use of "if-else" constructs or "looping" constructs.
His argument is that any branch instruction causes a pipeline flush and is inefficient.
He suggested use of signals and exception handling.
Also use certain flags viz, Overflow flag, sign flag, carry flag to replace if-else conditions.
My question is whether such a program is feasible. If YES is it really efficient?Examples would be helpful.

This kind of micro-optimization makes sense in highly compute intensive loops, where reducing by a few cycles can have a noticeable effect. There are a few buts:
if the code is not mature enough, such optimization can be premature and counterproductive;
it is of little use to optimize code that already runs fast;
use profiling to be sure where the bottlenecks are;
only assembly language gives you access to the arithmetic flags;
good compilers know the tricks and do the work for you to some extent;
if you care about branches, also care about divisions and the memory access patterns.
With the current level of sophistication of modern processors and high-level languages, be them compiled or interpreted, you have less and less control on the code actually generated.

Performance penalty for "if error then fail fast" in C++?

Is there any performance difference (in C++) between the two styles of writing if-else, as shown below (logically equivalent code) for the likely1 == likely2 == true path (likely1 and likely2 are meant here as placeholders for some more elaborate conditions)?
// Case (1):
if (likely1) {
Foo();
if (likely2) {
Bar();
} else {
Alarm(2);
}
} else {
Alarm(1);
}
vs.
// Case (2):
if (!likely1) {
Alarm(1);
return;
}
Foo();
if (!likely2) {
Alarm(2);
return;
}
Bar();
I'd be very grateful for information on as many compilers and platforms as possible (but with gcc/x86 highlighted).
Please note I'm not interested in readability opinions on those two styles, neither in any "premature optimisation" claims.
EDIT: In other words, I'd like to ask if the two styles are at some point considered fully-totally-100% equivalent/transparent by a compiler (e.g. bit-by-bit equivalent AST at some point in a particular compiler), and if not, then what are the differences? For any (with a preference towards "modern" and gcc) compiler you know.
And, to make it more clear, I too don't suppose that it's going to give me much of a performance improvement, and that it usually would be premature optimization, but I am interested in whether and how much it can improve/degrade anything?

It depends greatly on the compiler, and the optimization settings. If the difference is crucial - implement both, and either analyze the assembly, or do benchmarks.

I have no answers for specific platforms, but I can make a few general points:
The traditional answer on non-modern processors without branch prediction, is that the first is likely to be more efficient since in the common case it takes fewer branches. But you seem interested in modern compilers and processors.
On modern processors, generally speaking short forward branches are not expensive, whereas mispredicted branches may be expensive. By "expensive" of course I mean a few cycles
Quite aside from this, the compiler is entitled to order basic blocks however it likes provided it doesn't change the logic. So when you write if (blah) {foo();} else {bar();}, the compiler is entitled to emit code like:
evaluate condition blah
jump_if_true else_label
bar()
jump endif_label
else_label:
foo()
endif_label:
On the whole, gcc tends to emit things in roughly the order you write them, all else being equal. There are various things which make all else not equal, for example if you have the logical equivalent of bar(); return in two different places in your function, gcc might well coalesce those blocks, emit only one call to bar() followed by return, and jump or fall through to that from two different places.
There are two kinds of branch prediction - static and dynamic. Static means that the CPU instructions for the branch specify whether the condition is "likely", so that the CPU can optimize for the common case. Compilers might emit static branch predictions on some platforms, and if you're optimizing for that platform you might write code to take account of that. You can take account of it either by knowing how your compiler treats the various control structures, or by using compiler extensions. Personally I don't think it's consistent enough to generalize about what compilers will do. Look at the disassembly.
Dynamic branch prediction means that in hot code, the CPU will keep statistics for itself how likely branches are to be taken, and optimize for the common case. Modern processors use various different dynamic branch prediction techniques: http://en.wikipedia.org/wiki/Branch_predictor. Performance-critical code pretty much is hot code, and as long as the dynamic branch prediction strategy works, it very rapidly optimizes hot code. There might be certain pathological cases that confuse particular strategies, but in general you can say that anything in a tight loop where there's a bias towards taken/not taken, will be correctly predicted most of the time
Sometimes it doesn't even matter whether the branch is correctly predicted or not, since some CPUs in some cases will include both possibilities in the instruction pipeline while it's waiting for the condition to be evaluated, and ditch the unnecessary option. Modern CPUs get complicated. Even much simpler CPU designs have ways of avoiding the cost of branching, though, such as conditional instructions on ARM.
Calls out of line to other functions will upset all such guesswork anyway. So in your example there may be small differences, and those differences may depend on the actual code in Foo, Bar and Alarm. Unfortunately it's not possible to distinguish between significant and insignificant differences, or to account for details of those functions, without getting into the "premature optimization" accusations you're not interested in.
It's almost always premature to micro-optimize code that isn't written yet. It's very hard to predict the performance of functions named Foo and Bar. Presumably the purpose of the question is to discern whether there's any common gotcha that should inform coding style. To which the answer is that, thanks to dynamic branch prediction, there is not. In hot code it makes very little difference how your conditions are arranged, and where it does make a difference that difference isn't as easily predictable as "it's faster to take / not take the branch in an if condition".
If this question was intended to apply to just one single program with this code proven to be hot, then of course it can be tested, there's no need to generalize.

It is compiler dependent. Check out the gcc documentation on __builtin_expect. Your compiler may have something similar. Note that you really should be concerned about premature optimization.

The answer depends a lot on the type of "likely". If it is an integer constant expression, the compiler can optimize it and both cases will be equivalent. Otherwise, it will be evaluated in runtime and can't be optimized much.
Thus, case 2 is generally more efficient than case 1.
As input from real-time embedded systems, which I work with, your "case 2" is often the norm for code that is safety- and/or performance critical. Style guides for safety-critical embedded systems often allow this syntax so a function can quit quickly upon errors.
Generally, style guides will frown upon the "case 2" syntax, but make an exception to allow several returns in one function either if
1) the function needs to quit quickly and handle the error, or
2) if one single return at the end of the function leads to less readable code, which is often the case for various protocol and data parsers.

If you are this concerned about performance, I assume you are using profile guided optimization.
If you are using profile guided optimization, the two variants you have proposed are exactly the same.
In any event, the performance of what you are asking about is completely overshadowed by performance characteristics of things not evident in your code samples, so we really can not answer this. You have to test the performance of both.

Though I'm with everyone else here insofar as optimizing a branch makes no sense without having profiled and actually having found a bottleneck... if anything, it makes sense to optimize for the likely case.
Both likely1 and likely2 are likely, as their name suggests. Thus ruling out the also likely combination of both being true would likely be fastest:
if(likely1 && likely2)
{
... // happens most of the time
}else
{
if(likely1)
...
if(likely2)
...
else if(!likely1 && !likely2) // happens almost never
...
}
Note that the second else is probably not necessary, a decent compiler will figure out that the last if clause cannot possibly be true if the previous one was, even if you don't explicitly tell it.

How to know what optimizations are done automatically by my compiler

I was going through this link Will it optimize and wondered how can we know what optimizations are done by a particular compiler.
Like does VC8.0 convert if-else statements to switch-case?
Is such information available on msdn?

As everyone seems to be bent on telling the OP that he shouldn't worry about it, there is some useful although not as specific as the OP requested) information about compiler optimization (options).
You'll have to figure out what flags you're using, especially for MSVC and Intel (GCC release build should default to -O2), but here are the links:
GCC
MSVC
Intel
This is about as close as you'll get before disassembling your binary after compilation.

It depends on the level of of optimization you choose for compiler.
you can find a very nice article about it here

First of all, if optimization took place then your program should work faster usually. After that you could inspect disassembly code to find out what kind of optimizations were performed.

I don't know anything about VC8.0, so I'm not sure how you would access that information. However, if you are generally interested in the kinds of optimisations that go on and want to experiment, I recommend you use LLVM. You can look at the unoptimised, disassembled byte code generated from the default C front end, and then run various optimiser passes over it to see what the effect is each time. Because it's a nicer, abstract assembly code, it tends to be a little easier to figure out what is an optimisation derivable from the code and what is a machine-specific optimisation.

Like does VC8.0 convert if-else statements to switch-case?
Compilers do not do magically rewrite your source code. And even if they did, what would that tell you? What you really would want to know is if the compiler compiled it into a jump table or into multiple compare operations. Any dis-assembler will tell you that.
To clarify my point: Writing a switch-case statement does not necesseraly imply that there will be a jump table in the binary. Not needing to worry about this is the whole point of having compilers.

Instead of figuring out which optimizations are done by the compiler in general, it's probably better to NOT have any dependencies on such compiler-specific knowledge.
Instead start out with a good design and algorithm, writing (as much as possible) portable code that's easy to follow. Then profile the code if it's too slow and fix the actual hotspots. Compiler optimizations are useful no doubt, but better is to apply some investigation to what's actually happening in the code. Algorithmic/design improvements at the source level will typically help performance more than the presence or absence of optimizations like transforming if/else into switch-case.

I'm not sure what "convert if/else to switch/case" means. My processor doesn't have a hardware switch/case instruction.
Typical compilers have several different ways to implement switch/case. A well-known one is using a jump table, but this is only done if appropriate.
For if/else, certainly it is normal for compilers to analyse a digraph of execution flow. I would expect a compiler to notice if each condition references the same variable, and I would expect the compiler to treat equivalent forms of conditionals the same way in general. But this isn't something I'd worry about.
IIRC, the general policy in GCC is that regressions in optimisation are tolerable so long as preferred improvements result. Optimisation is complex and what is "generally" a good optimisation isn't always that great. Plus for perfect optimisation, the compiler would have to know things it can't know (e.g. what inputs it will encounter in real life).
The point is that it really isn't worthwhile knowing that much about specific optimisations unless you happen to be a compiler developer. If you depend on something being optimised by V8, that particular optimisation might not happen in V9 or V10.

C/C++ compiler feedback optimization

Has anyone seen any real world numbers for different programs which are using the feedback optimization that C/C++ compilers offer to support the branch prediction, cache preloading functions etc.
I searched for it and amazingly not even the popular interpreter development groups seem to have checked the effect. And increasing ruby,python,php etc. performance by 10% or so should be considered usefull.
Is there really no benefit or is the whole developer community just to lazy to use it?

10% is a good ballpark figure. That said, ...
You have to REALLY care about the performance to go this route. The product I work on (DB2) uses PGO and other invasive and agressive optimizations. Among the costs are significant build time (triple on some platforms) and development and support nightmares.
When something goes wrong it can be non-trivial to map the fault location in the optimized code back to the source. Developers don't usually expect that functions in different modules can end up merged and inlined and this can have "interesting" effects.
Problems with pointer aliasing, which are nasty to track down also usually show up with these sorts of optimizations. You have the additional fun of having non-deterministic builds (an aliasing problem can show up in monday's build, vanish again till thursday's, ...).
The line between what is correct or incorrect compiler behaviour under these sorts of aggressive optimizations also becomes fairly blurred. Even with the luxury of having our compiler guys in house (literally) the optimization issues (either in our source or the compiler) are still not easy to understand and resolve.

From unladen-swallow (a project optimizing the CPython VM):
For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower.
So some people are at least looking at it. That said, PGO sets some pretty tricky requirements on the build environment that are hard to satisfy for open-source projects meant to be built by a distributed heterogeneous group of people. Heavy optimization also creates difficult to debug heisenbugs. It's less work to give the compiler explicit hints for the performance critical parts.
That said, I expect significant performance increases from runtime profile guided optimization. JIT'ing allows the optimizer to cope with the profile of data changing across the execution of a program and do many extremely runtime data specific optimizations that would explode the code size for static compilation. Especially dynamic languages need good runtime data based optimization to perform well. With dynamic language performance getting significant attention lately (JavaScript VM's, MS DLR, JSR-292, PyPy and so on) there's a lot of work being done in this area.

Traditional methods in improving the compiler efficiency via profiling is done by performance analysis tools. However, how the data from the tools may be of use in optimization still depends on the compiler you use. For example, GCC is a framework being worked on to produce compilers for different domains. Providing profiling mechanism in the such compiler framework will be extremely difficult.
We can rely on statistical data to do certain optimization. For instance, GCC unrolls a loop if the loop count is less than a constant (say 7). How it fixes up the constant will be based on statistical result of the code size generated for different target architecture.
Profile guided optimizations track the special areas of the source. Details regarding previous run results needs to be stored which is an overhead. The input on the other hand requires a statistical representation of the target application which may use the compiler. So the complexity level rises with the number of different inputs and outputs. In short, deciding profile guided optimization needs extreme data collection. Automation or embedding such profiling into source needs careful monitoring. If not, the entire result will be awry and in our effort to swim we actually will drown.
However, experimentation on this regard is ongoing. Just have a look at POGO.

profile-guided optimization (C)

Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)

It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
Deciding which functions should be inlined or not depending on how often they are called.
Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.

PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.

Jason's advise is right on. The best speedups you are going to get come from "discovering" that you let an O(n2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.

The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf if you're using GCC) and just start poking around the results of running your application through some normal operations.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js