Any way to handle "predictable branches" faster? - c++

I have some code in which there are two or three branches which you don't know what way they will go, but after the first time they are hit, it is either 100% certain, or close to that, that the same path will happen again. I have noticed that use of the __builtin_likely doesn't do much in terms of avoiding branch misses. And even though branch prediction does a good job when my function is called repeatedly in a short time span..as soon as there is other stuff going on between calls to my function, performance degrades substantially. What are some ways around this or some techniques I can look into? Any way to somehow "tag" these branches for when they are reached again after some vagrancy?

You could use templates to generate a different version of the function for each code path, then use a function pointer to select one at runtime when you find out which way the condition goes.

The branch predictor and compiler intrinsics are all you've got. At best, you can look at the assembly and try to hand-roll some optimization yourself, but you won't find much.

Related

Simple profiling of single C++ function in Windows

There are times, particularly when I'm writing a new function, where I would like to profile a single piece of code but a full profile run is not really necessary, and possibly too slow.
I'm using VS 2008 and have used the AMD profiler on C++ with good results, but I'm looking for something a little more lightweight.
What tools do you use to profile single functions? Perhaps something that is a macro which gets excluded when your not in DEBUG mode. I could write my own, but I wanted to know if there were any built in that I'm missing. I was thinking something like:
void FunctionToTest()
{
PROFILE_ENTER("FunctionToTest")
// Do some stuff
PROFILE_EXIT()
}
Which would simply print in the debug output window how long the function took.
If I want to get maximum speed from a particular function, I wrap it in a nice long-running loop and use this technique.
I really don't care too much about the time it takes. That's just the result.
What I really need to know is what must I do to make it take less time.
See the difference?
After finding and fixing the speed bugs, when the outer loop is removed, it flies.
Also I don't follow the orthodoxy of only tuning optimized code, because that assumes the code is already nearly as tight as possible.
In fact, in programs of any significant size, there is usually stupid stuff going on, like calling a subfunction over and over with the same arguments, or new-ing objects repeatedly, when prior copies could be re-used.
The compiler's optimizer might be able to clean up some such problems,
but I need to clean up every one, because the ones left behind will dominate.
What it can do is make them harder to find, by scrambling the code.
When I've gotten all the stupid stuff out (making it much faster), then I turn on the optimizer.
You might think "Well I would never put stupid stuff in my code."
Right.
And you'd never put in bugs either.
None of us try to make mistakes, but we all do, if we're working.
This code by Jeff Preshing should do the trick:-
http://preshing.com/20111203/a-c-profiling-module-for-multithreaded-apis
Measure the time - which the code in the link does too - using either clock() or one of the OS provided high resolution timers. With C++11 you can use the timers from the <chrono> header.
Please note, that you should always measure in Release build not Debug build, to get proper timings.

Best way to make loop profiling

I must change a C/C++ program with a lot of loop inside one function. I must add cuda functions.
Before i start making changes, I wanted to take the time to all loops found. But i did't find any profiling programs who make exactly that. What is the best way to do that's. I on linux. if you have any solutions let me know.
here you will find an example of tool who makes exactly what i want but i haven't find it or something like that : http://carbon.ucdenver.edu/~dconnors/papers/wbia06-loopprof.pdf
I would use gperftools, and figure out where the code is spending most of it's time. Once you have identified a function or part of a function, you're probably done. Understanding exactly which instructions are the "heaviest" in a function will require a long running testcase for that particular loop, so that the profiler can get sufficient data for each instruction (or at least most instructions) in the loop. But actually profiling down to instructions is probably not relevant if you are looking to replace the code with another technology - it is unlikely that replacing one loop of a few lines of code will help much, since there'd be too much overhead. Instead, you want to take a larger block and move that across to CUDA.

Looking for opinions on when to unwind loops with if statements

I'm wondering when (if ever) I should be pulling if statement out of substantial loops in order to help optimize speed?
for(i=0;i<File.NumBits;i++)
if(File.Format.Equals(FileFormats.A))
ProcessFormatA(File[i]);
else
ProcessFormatB(File[i]);
into
if(File.Format.Equals(FileFormats.A))
for(i=0;i<File.NumBits;i++)
ProcessFormatA(File[i]);
else
for(i=0;i<File.NumBits;i++)
ProcessFormatB(File[i]);
I'm not sure if the compiler will do this type of optimization for me, or if this is considered good coding practice because I would imagine it would make code much harder to read / maintain if the loops were more complex.
Thanks for any input / suggestions.
When you are finished the code and the profiler tells you that the for loops are a bottleneck. No sooner.
If you're actually processing a file (i.e. reading and/or writing) within your functions, optimizing the if is going to be pointless, the file operations will take so much longer, comparatively, than the if that you won't even notice the speed improvement.
I would expect that a decent compiler might be able to optimize the code - however, it would need to be certain that File.Format can't change in the loop, which could be a big ask.
As I always like to say, write first, optimize later!
Definitely code for maintainability and correctness first. In this case I would be inclined to suggest neither:
for(...)
ProcessFormat(File, Format);
Much harder to mess up if all the checks are in one place. You do a better job of confusing your optimizer this way, but generally you want correct code to run slowly rather than incorrect code running quickly. You can always optimize later if you want.
The two perform the same and read the same, so I'd pick the one that has less code. In fact, this example you give hints polymorphism might be a good fit here to make code simpler and short still.

Should I use a function in a situation where it would be called an extreme number of times?

I have a section of my program that contains a large amount of math with some rather long equations. Its long and unsightly and I wish to replace it with a function. However, chunk of code is used an extreme number of times in my code and also requires a lot of variables to be initialized.
If I'm worried about speed, is the cost of calling the function and initializing the variables negligible here or should i stick to directly coding it in each time?
Thanks,
-Faken
Most compilers are smart about inlining reasonably small functions to avoid the overhead of a function call. For functions big enough that the compiler won't inline them, the overhead for the call is probably a very small fraction of the total execution time.
Check your compiler documentation to understand it's specific approach. Some older compilers required or could benefit from hints that a function is a candidate for inlining.
Either way, stick with functions and keep your code clean.
Are you asking if you should optimize prematurely?
Code it in a maintainable manner first; if you then find that this section is a bottleneck in the overall program, worry about tuning it at that point.
You don't know where your bottlenecks are until you profile your code. Anything you can assume about your code hot spots is likely to be wrong. I remember once I wanted to optimize some computational code. I ran a profiler and it turned out that 70 % of the running time was spent zeroing arrays. Nobody would have guessed it by looking at the code.
So, first code clean, then run a profiler, then optimize the rough spots. Not earlier. If it's still slow, change algorithm.
Modern C++ compilers generally inline small functions to avoid function call overhead. As far as the cost of variable initialization, one of the benefits of inlining is that it allows the compiler to perform additional optimizations at the call site. After performing inlining, if the compiler can prove that you don't need those extra variables, the copying will likely be eliminated. (I assume we're talking about primitives, not things with copy constructors.)
The only way to answer that is to test it. Without knowing more about the proposed function, nobody can really say whether the compiler can/will inline that code or not. This may/will also depend on the compiler and compiler flags you use. Depending on the compiler, if you find that it's really a problem, you may be able to use different flags, a pragma, etc., to force it to be generated inline even if it wouldn't be otherwise.
Without knowing how big the function would be, and/or how long it'll take to execute, it's impossible guess how much effect on speed it'll have if it isn't generated inline.
With both of those being unknown, none of us can really guess at how much effect moving the code into a function will have. There might be none, or little or huge.

profile-guided optimization (C)

Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)
It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
Deciding which functions should be inlined or not depending on how often they are called.
Deciding how to place hints about which branch of an "if" statement should be predicted on based on the percentage of calls going one way or the other.
Deciding how to optimize loops based on how many iterations get taken each time that loop is called.
You never really know how much these things can help until you test it.
PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.
Jason's advise is right on. The best speedups you are going to get come from "discovering" that you let an O(n2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.
The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf if you're using GCC) and just start poking around the results of running your application through some normal operations.