Prevent code being moved by GCC in benchmark code - c++

I'm trying to fine tune some benchmark code we are using and am wondering if there is a way to communicate to GCC explicitly how to order certain bits of code. For example, given these blocks of code:
Pre
Start-Timer
Body
Stop-Timer
Post
I wish to tell GCC that each block must be kept in the above order without any instruction leakage into the other block. Ideally the timer would measure only Step 3, however, for practical reasons measuring at least Step 3 and at most Steps 2-4 will suffice. I just want to make sure I'm note measuring any part of Step 1 or 5.
Currently I use a __sync_synchronize in the Timer functions to issue a full memory fence. My hope is that, in addition to being a fence, that this function is marked to prevent reordering.
Is this call to __sync_synchronize sufficient? Also logically, would the C++11 fence commands also suffice according to the text of the standard?

If the Start-Timer is a function call and the Stop-Timer is another function call, the optimizer has little opportunity to move the Body around, or spill material from Pre or Post into Body.
All the side-effects from Pre must be complete before the Start-Timer function is called (there's a sequence point there). All the side effects of Stop-Timer must be complete before executing Post (there's a sequence point there, too). So the compiler would have to have the code for Start-Timer and Stop-Timer visible to monkey with the generated code, spilling material around, and I'm not convinced it could do so even then.
So, in summary, I don't think you have to worry about it if you use function calls to start and stop the timer.

Make two versions of the code: one with the real code you want to measure, one with stubs. Measure both. Subtract. Then, I think, you needn't care what GCC does.

Related

If vs function call performance when handling different scenarios

Inspired from my most favorite stackoverflow question I'm trying to decide what is more 'convenient' when facing up a classical problem in programming, i.e. executing some code instead some other according to any current scenario.
To make it brief, imagine an application can be in 2 different states and has to execute some code according to this state.
(switching to C++ context, which I'm more interested to)
The usual approach would be to store status in a bool and use an if to decide which branch to execute.
A different approach is to wrap the 2 branches of code in 2 different functions and store the status in a function pointer: whenever suitable code shall be executed the application has just to call pointed function, no if involved.
Now, in addition to lazily asking your opinions on which of two strategies is more 'convenient', I approximated 'convenient'=='fastest' so I set up a benchmark whose code you can read here, and executed it also on my machine compiling with MSVC2017.
Each compiler returns different results (in my MSVC function pointer strategy is slower, in online GCC is faster) but what surprises me is function pointer strategy is always slower when switching status while I was expecting times to be very similar to non switching situation: why?
Thank you.
NOTE: increase value std::uint64_t N = 100'000'000; for significant results as online compiler would timeout.

Efficiently handling options

I have a function, which is executed hundreds of millions of times in a typical program run. This function performs a certain main task, but, if the user so desires, it should perform some slight variations of that main task. The obvious way to implement this would be something like this:
void f(bool do_option)
{
// Do the first part
if (do_option)
{
// Optional extra code
}
// Continue normal execution
}
However, this is not very elegant, since the value of do_option does not change during a program run. The if statement is unnecessarily being performed very often.
I solved it by turning do_option into a template parameter. I recompile the program every time I want to change it. Right now, this workflow is acceptable: I don't change these options very often and I am the sole user/developer. In the future though, both these things will change, so I want a single binary with command-line switches.
Question is: what is the best or most elegant way to deal with this situation? I don't mind having a large binary with many copies of f. I could create a map from a set of command-line parameters to a function to execute, or perhaps use a switch. But I don't want to maintain that map by hand -- there will probably be more than five such parameters.
By the way, I know somebody is going to say that I'm prematurely optimizing. That is correct, I tested it. In my specific case, the performance of runtime ifs is not much worse than my template construction. That doesn't mean I'm not interested if nicer solutions are possible.
On a modern (non-embedded) CPU, the branch predictor will be smart enough to recognize that the same path is taken every time, so an if statement is a perfectly acceptable (and readable) way of handling your situation.
On an embedded processor, compiler optimizations should be smart enough to get rid of most of the overhead of the if statement.
If you're really picky, you can use the template method that you mentioned earlier, and have an if statement select which version of the function to execute.

Is ref-copying a compiler optimization, and can I avoid it?

I dislike pointers, and generally try to write as much code as I can using refs instead.
I've written a very rudimentary "vertical layout" system for a small Win32 app. Most of the Layout methods look like this:
void Control::DoLayout(int availableWidth, int &consumedYAmt)
{
textYPosition = consumedYAmt;
consumedYAmt += measureText(font, availableWidth);
}
They are looped through like so:
int innerYValue = 0;
foreach(control in controls) {
control->DoLayout(availableWidth, innerYValue);
}
int heightOfControl = innerYValue;
It's not drawing its content here, just calculating exactly how much space this control will require (usually it's adding padding too, etc). This has worked great for me.......in debug mode.
I found that in Release mode, I could suddenly see tangible, loggable issues where, when I'm looping through controls and calling DoLayout(), the consumedYAmt variable actually stays at 0 in the outside loop. The most annoying part is that if I put in breakpoints and walk through the code line by line, this stops happening and parts of it are properly updated by the inside "add" methods.
I'm kind of thinking about whether this would be some compiler optimization where they think I'm simply adding the ref flag to ints as a way to optimize memory; or if there's any possibility this actually works in a way different from how it seems.
I would give a minimum reproducible example, but I wasn't able to do so with a tiny commandline app. I get the sense that if this is an optimization, it only kicks in for larger code blocks and indirections.
EDIT: Again sorry for generally low information, but I'm now getting hints that this might be some kind of linker issue. I skipped one part of the inheritance model in my pseudocode: The calling class actually calls "Layout()", which is a non-virtual function on the root definition of the class. This function performs some implementation-neutral logic, and then calls DoLayout() with the same arguments. However, I'm now noticing that if I try adding a breakpoint to Layout(), Visual Studio claims that "The breakpoint will not be hit. No executable code of the debugger's target code type is associated with this line." I am able to add breakpoints to certain other lines, but I'm beginning to notice weird stepping logic where it refuses to go inside certain functions, like Layout. Already tried completely clearing the build folders and rebuilding. I'm going to have to keep looking, since I have to admit this isn't a lot to go on.
Also, random addition: The "controls" list is a vector containing shared_ptr objects. I hadn't suspected the looping mechanism previously but now I'm looking more closely.
"the consumedYAmt variable actually stays at 0"
The behavior you describe is typical for a specific optimization that's more due to the CPU than the compiler. I suspect you're logging consumedYAmt from another thread. The updates to consumedYAmt simply don't make it to that other thread.
This is legal for the CPU, because the C++ compiler didn't put in memory fences. And the CPU compiler didn't put in fences because the variable isn't atomic.
In a small program without threads, this simply doesn't show up, nor does it show in debug mode.
Written by OP
Okay, eventually figured this one out. As simple as the issue was, pinning it down became difficult because of Release mode's debugger seemingly acting in inconsistent ways. When I changed tactic to adding Logging statements in lots of places, I found that my Control class had an "mShowing" variable that was uninitialized in its constructor. In debug mode, it apparently retained uninitialized memory which I guess made it "true" - but in release mode, my best analysis is that memory protections made it default to "false", which as it turns out skipped the main body of the DoLayout method most of the time.
Since through the process, responders were low on information to work with (certainly could've been easier if I posted a longer example), I instead simply upvoted each comment that mentioned uninitialized variables.

How to track down a SIGFPE/Arithmetic exception

I have a C++ application cross-compiled for Linux running on an ARM CortexA9 processor which is crashing with a SIGFPE/Arithmetic exception. Initially I thought that it's because of some optimizations introduced by the -O3 flag of gcc but then I built it in debug mode and it still crashes.
I debugged the application with gdb which catches the exception but unfortunately the operation triggering exception seems to also trash the stack so I cannot get any detailed information about the place in my code which causes that to happen. The only detail I could finally get was the operation triggering the exception(from the following piece of stack trace):
3 raise() 0x402720ac
2 __aeabi_uldivmod() 0x400bb0b8
1 __divsi3() 0x400b9880
The __aeabi_uldivmod() is performing an unsigned long long division and reminder so I tried the brute force approach and searched my code for places that might use that operation but without much success as it proved to be a daunting task. Also I tried to check for potential divisions by zero but again the code base it's pretty large and checking every division operation it's a cumbersome and somewhat dumb approach. So there must be a smarter way to figure out what's happening.
Are there any techniques to track down the causes of such exceptions when the debugger cannot do much to help?
UPDATE: After crunching on hex numbers, dumping memory and doing stack forensics(thanks Crashworks) I came across this gem in the ARM Compiler documentation(even though I'm not using the ARM Ltd. compiler):
Integer division-by-zero errors can be trapped and identified by
re-implementing the appropriate C library helper functions. The
default behavior when division by zero occurs is that when the signal
function is used, or
__rt_raise() or __aeabi_idiv0() are re-implemented, __aeabi_idiv0() is
called. Otherwise, the division function returns zero.
__aeabi_idiv0() raises SIGFPE with an additional argument, DIVBYZERO.
So I put a breakpoint at __aeabi_idiv0(_aeabi_ldiv0) et Voila!, I had my complete stack trace before being completely trashed. Thanks everybody for their very informative answers!
Disclaimer: the "winning" answer was chosen solely and subjectively taking into account the weight of its suggestions into my debugging efforts, because more than one was informative and really helpful.
My first suggestion would be to open a memory window looking at the region around your stack pointer, and go digging through it to see if you can find uncorrupted stack frames nearby that might give you a clue as to where the crash was. Usually stack-trashes only burn a couple of the stack frames, so if you look upwards a few hundred bytes, you can get past the damaged area and get a general sense of where the code was. You can even look down the stack, on the assumption that the dead function might have called some other function before it died, and thus there might be an old frame still in memory pointing back at the current IP.
In the comments, I linked some presentation slides that illustrate the technique on a PowerPC — look at around #73-86 for a case study in a similar botched-stack crash. Obviously your ARM's stack frames will be laid out differently, but the general principle holds.
(Using the basic idea from Fedor Skrynnikov, but with compiler help instead)
Compile your code with -pg. This will insert calls to mcount and mcountleave() in every function. Do not link against the GCC profiling lib, but provide your own. The only thing you want to do in your mcount and mcountleave() is to keep a copy of the current stack, so just copy the top 128 bytes or so of the stack to a fixed buffer. Both the stack and the buffer will be in cache all the time so it's fairly cheap.
You can implement special guards in functions that can cause the exception. Guard is a simple class, in constractor of this class you put the name of the file and line (_FILE_, _LINE_) into file/array/whatever. The main condition is that this storage should be the same for all instances of this class(kind of stack). In the destructor you remove this line. To make it works you need to put the creation of this guard on the first line of each function and to create it only on stack. When you will be out of current block deconstructor will be called. So in the moment of your exception you will know from this improvised callstack which function is causing a problem.
Ofcaurse you may put creation of this class under debug condition
Enable generation of core files, and open the core file with the debuger
Since it uses raise() to raise the exception, I would expect that signal() should be able to catch it. Is this not the case?
Alternatively, you can set a conditional breakpoint at __aeabi_uldivmod to break when divisor (r1) is 0.

Iterating without incurring the cost of IF statements

My question is based on curiosity and not whether there is another approach to the problem or not. It is a strange/interesting question, so please read it with an open mind.
Let's assume there is a game loop that is being called every frame. The game loop in turn calls several functions through a myriad of if statements. For example, if the user has GUI to false then don't refresh the GUI otherwise call RefreshGui(). There are many other if statements in the loop and they call their respective functions if they are true. Some are if/if-else.../else which are more costly in the worst case. Even the functions that are called, if the if statement is true, have logic. If user wants raypicking on all objects call FunctionA(), if user wants raypicking on lights, call FunctionB(), ... , else call all functions. Hopefully you get the idea.
My point is, that is a lot of redundant if statements. So I decided to use function pointers instead. Now my assumption is that a function pointer is always going to be faster than an if statement. It is a replacement for if/else. So if the user wants to switch between two different camera modes, he/she presses the C key to toggle between them. The callback function for the keyboard changes the function pointer to the correct UpdateCamera function (in this case, the function pointer can point to either UpdateCameraFps() or UpdateCameraArcBall() )... you get the gist of it.
Now to the question itself. What if I have several update functions all with the same signature (let's say void (*Update)(float time) ), so that a function pointer can potentially point to any one of them. Then, I have a vector which is used to store the pointers. Then in my main update loop, I go through the vector and call each update function. I can remove/add and even change the order of the updates, without changing the underlying code. In the best case, I might only be calling one update function or in the worst case all of them, all with a very clean while loop and no nasty (potentially nested) if statements. I have implemented this part and it works great. I am aware, that, with each iteration of the while loop responsible for iterating through the vector, I am checking whether the itrBegin == itrEnd. More specifically while (itrBegin != itrEnd). Is there any way to avoid the call to the if statements? Can I use branch prediction to my advantage (or am I taking advantage of it already without knowing)?
Again, please take the question as-is, i.e. I am not looking for a different approach (although you are more than welcome to give one).
EDIT: A few replies state that this is an unneeded premature optimization and I should not be focusing on it and that the if-statement(s) cost is minuscule compared to the work done in all the separate update functions. Very true, and I completely agree, but that was not the point of the question and I apologize if I did not make the question clearer. I did learn quite a few new things with all the replies though!
there is a game loop that is being called every frame
That's a backwards way of describing it. A game loop doesn't run during a frame, a frame is handled in the body of the game loop.
my assumption is that a function pointer is always going to be faster than an if statement
Have you tested that? It's not likely to be true, especially if you're changing the pointer frequently (which really messes with the CPU's branch prediction).
Can I use branch prediction to my advantage (or am I taking advantage of it already without knowing)?
This is just wishful thinking. By having one indirect call inside your loop calling a bunch of different functions you are definitely working against the CPU branch prediction logic.
More specifically while (itrBegin != itrEnd). Is there any way to avoid the call to the if statements?
One thing you could do in order to avoid conditionals as you iterate the chain of functions is to use a linked list. Then each function can call the next one unconditionally, and you simply install your termination logic as the last function in the chain (longjmp or something). Or you could hopefully just never terminate, include glSwapBuffers (or the equivalent for your graphics API) in the list and just link it back to the beginning.
First, profile your code. Then optimize the parts that need it.
"if" statements are the least of your concerns. Typically, with optimization, you focus on loops, I/O operations, API calls (e.g. SQL), containers/algorithms that are inefficient and used frequently.
Using function pointers to try to optimize is typically the worst thing you can do. You kill any chance at code readability and work against the CPU and compiler. I recommend using polymorphism or just use the "if" statements.
To me, this is asking for an event-driven approach. Rather than checking every time if you need to do something, monitor for the incoming request to do something.
I don't know if you consider it a deviation from your approach, but it would reduce the number of if...then statements to 1.
while( active )
{
// check message queue
if( messages )
{
// act on each message and update flags accordingly
}
// draw based on flags (whether or not they changed is irrelevant)
}
EDIT: Also I agree with the poster who stated that the loop should not be based on frames; the frames should be based on the loop.
If the conditions checked by your ifs are not changing during the loop, you could check them all once, and set a function pointer to the function you'd like to call in that case. Then in the loop call the function the function pointer points to.