If vs function call performance when handling different scenarios - c++

Inspired from my most favorite stackoverflow question I'm trying to decide what is more 'convenient' when facing up a classical problem in programming, i.e. executing some code instead some other according to any current scenario.
To make it brief, imagine an application can be in 2 different states and has to execute some code according to this state.
(switching to C++ context, which I'm more interested to)
The usual approach would be to store status in a bool and use an if to decide which branch to execute.
A different approach is to wrap the 2 branches of code in 2 different functions and store the status in a function pointer: whenever suitable code shall be executed the application has just to call pointed function, no if involved.
Now, in addition to lazily asking your opinions on which of two strategies is more 'convenient', I approximated 'convenient'=='fastest' so I set up a benchmark whose code you can read here, and executed it also on my machine compiling with MSVC2017.
Each compiler returns different results (in my MSVC function pointer strategy is slower, in online GCC is faster) but what surprises me is function pointer strategy is always slower when switching status while I was expecting times to be very similar to non switching situation: why?
Thank you.
NOTE: increase value std::uint64_t N = 100'000'000; for significant results as online compiler would timeout.

Related

Efficiently handling options

I have a function, which is executed hundreds of millions of times in a typical program run. This function performs a certain main task, but, if the user so desires, it should perform some slight variations of that main task. The obvious way to implement this would be something like this:
void f(bool do_option)
{
// Do the first part
if (do_option)
{
// Optional extra code
}
// Continue normal execution
}
However, this is not very elegant, since the value of do_option does not change during a program run. The if statement is unnecessarily being performed very often.
I solved it by turning do_option into a template parameter. I recompile the program every time I want to change it. Right now, this workflow is acceptable: I don't change these options very often and I am the sole user/developer. In the future though, both these things will change, so I want a single binary with command-line switches.
Question is: what is the best or most elegant way to deal with this situation? I don't mind having a large binary with many copies of f. I could create a map from a set of command-line parameters to a function to execute, or perhaps use a switch. But I don't want to maintain that map by hand -- there will probably be more than five such parameters.
By the way, I know somebody is going to say that I'm prematurely optimizing. That is correct, I tested it. In my specific case, the performance of runtime ifs is not much worse than my template construction. That doesn't mean I'm not interested if nicer solutions are possible.
On a modern (non-embedded) CPU, the branch predictor will be smart enough to recognize that the same path is taken every time, so an if statement is a perfectly acceptable (and readable) way of handling your situation.
On an embedded processor, compiler optimizations should be smart enough to get rid of most of the overhead of the if statement.
If you're really picky, you can use the template method that you mentioned earlier, and have an if statement select which version of the function to execute.

Is ref-copying a compiler optimization, and can I avoid it?

I dislike pointers, and generally try to write as much code as I can using refs instead.
I've written a very rudimentary "vertical layout" system for a small Win32 app. Most of the Layout methods look like this:
void Control::DoLayout(int availableWidth, int &consumedYAmt)
{
textYPosition = consumedYAmt;
consumedYAmt += measureText(font, availableWidth);
}
They are looped through like so:
int innerYValue = 0;
foreach(control in controls) {
control->DoLayout(availableWidth, innerYValue);
}
int heightOfControl = innerYValue;
It's not drawing its content here, just calculating exactly how much space this control will require (usually it's adding padding too, etc). This has worked great for me.......in debug mode.
I found that in Release mode, I could suddenly see tangible, loggable issues where, when I'm looping through controls and calling DoLayout(), the consumedYAmt variable actually stays at 0 in the outside loop. The most annoying part is that if I put in breakpoints and walk through the code line by line, this stops happening and parts of it are properly updated by the inside "add" methods.
I'm kind of thinking about whether this would be some compiler optimization where they think I'm simply adding the ref flag to ints as a way to optimize memory; or if there's any possibility this actually works in a way different from how it seems.
I would give a minimum reproducible example, but I wasn't able to do so with a tiny commandline app. I get the sense that if this is an optimization, it only kicks in for larger code blocks and indirections.
EDIT: Again sorry for generally low information, but I'm now getting hints that this might be some kind of linker issue. I skipped one part of the inheritance model in my pseudocode: The calling class actually calls "Layout()", which is a non-virtual function on the root definition of the class. This function performs some implementation-neutral logic, and then calls DoLayout() with the same arguments. However, I'm now noticing that if I try adding a breakpoint to Layout(), Visual Studio claims that "The breakpoint will not be hit. No executable code of the debugger's target code type is associated with this line." I am able to add breakpoints to certain other lines, but I'm beginning to notice weird stepping logic where it refuses to go inside certain functions, like Layout. Already tried completely clearing the build folders and rebuilding. I'm going to have to keep looking, since I have to admit this isn't a lot to go on.
Also, random addition: The "controls" list is a vector containing shared_ptr objects. I hadn't suspected the looping mechanism previously but now I'm looking more closely.
"the consumedYAmt variable actually stays at 0"
The behavior you describe is typical for a specific optimization that's more due to the CPU than the compiler. I suspect you're logging consumedYAmt from another thread. The updates to consumedYAmt simply don't make it to that other thread.
This is legal for the CPU, because the C++ compiler didn't put in memory fences. And the CPU compiler didn't put in fences because the variable isn't atomic.
In a small program without threads, this simply doesn't show up, nor does it show in debug mode.
Written by OP
Okay, eventually figured this one out. As simple as the issue was, pinning it down became difficult because of Release mode's debugger seemingly acting in inconsistent ways. When I changed tactic to adding Logging statements in lots of places, I found that my Control class had an "mShowing" variable that was uninitialized in its constructor. In debug mode, it apparently retained uninitialized memory which I guess made it "true" - but in release mode, my best analysis is that memory protections made it default to "false", which as it turns out skipped the main body of the DoLayout method most of the time.
Since through the process, responders were low on information to work with (certainly could've been easier if I posted a longer example), I instead simply upvoted each comment that mentioned uninitialized variables.

Make access to unsigned char thread-safe (atomic)

I am well aware that similar questions have been asked before and I'm also aware, that the operation is most likely not atomic at all, but I'm still asking out of idle curiosity and in hope that there's some way to MAKE it atomic.
The situation: Inside a struct, there's an unsigned char variable called Busy. (It can be moved out of there and stand on its own though).
This variable Busy is modified by two concurrent threads, one who sets bits on scheduling and one who clears them upon completion of the scheduled action.
Right now, the scheduling looks like this:
while(SEC.Busy&(1 << SEC.ReqID))
if(++SEC.ReqID == 5) SEC.ReqID = 0;
sQuery.cData[2] = SEC.ReqID;
while the clearing of the bitmaks looks like this:
SEC.Busy &= ~(1 << sQuery->cData[2]);
cData[2] basically carries the information about which slot is used over the network and comes back via a callback in another thread.
Now the question: How can I make sure that SEC.Busy (which is the only variable in this troubled situation) doesn't get torn apart by two threads trying to alter it at the same time without using a mutex, critical section or anything of the likes if possible?
I've also tried assigning the content of SEC.Busy to a local variable, alter that and then write the variable back, but unfortunately this operation also doesn't seem atomic.
I'm using Borland C++ Builder 6 at the moment, though a GCC solution would also be just fine.
Thank you very much.
C++03 (nor C99) does not say anything about atomicity at all. Assignment is atomic (= everybody sees either the old or the new value) on most platforms, but because it is not synchronized (= anybody may see the old value after they saw new value for other updates) on any, it is useless anyway. Any other operation like increment, set bit and such is likely not even atomic.
C++11 defines std::atomic template, which ensures both atomicity and synchronization, so you need to use it. Boost provides compatible implementation for most C++03 compilers and gcc has had built-in support since 4.2, which is being replaced by more advanced support needed by C++11 in gcc 4.7
Windows API had "Interlocked operations" since long ago. Unix alternative required assembly (which several libraries provided) before introduction of the gcc __sync function.
There are three potential problems when accessing shared data from multiple threads. First, you might get a thread switch in the middle of a memory access that requires more than one bus cycle; this is called "tearing". Second, each processor has its own memory cache, and writing data to one cache does not automatically write to other caches, so a different thread can see stale data. Third, the compiler can move instructions around, so that another processor might see a later store to one variable without seeing a preceding store to another one.
Using a variable of type unsigned char almost certainly removes the first one, but it doesn't have any effect on the other two.
To avoid all three problems, use atomic<unsigned char> from C++11 or whatever synchronizing techniques your compiler and OS provide.

Prevent code being moved by GCC in benchmark code

I'm trying to fine tune some benchmark code we are using and am wondering if there is a way to communicate to GCC explicitly how to order certain bits of code. For example, given these blocks of code:
Pre
Start-Timer
Body
Stop-Timer
Post
I wish to tell GCC that each block must be kept in the above order without any instruction leakage into the other block. Ideally the timer would measure only Step 3, however, for practical reasons measuring at least Step 3 and at most Steps 2-4 will suffice. I just want to make sure I'm note measuring any part of Step 1 or 5.
Currently I use a __sync_synchronize in the Timer functions to issue a full memory fence. My hope is that, in addition to being a fence, that this function is marked to prevent reordering.
Is this call to __sync_synchronize sufficient? Also logically, would the C++11 fence commands also suffice according to the text of the standard?
If the Start-Timer is a function call and the Stop-Timer is another function call, the optimizer has little opportunity to move the Body around, or spill material from Pre or Post into Body.
All the side-effects from Pre must be complete before the Start-Timer function is called (there's a sequence point there). All the side effects of Stop-Timer must be complete before executing Post (there's a sequence point there, too). So the compiler would have to have the code for Start-Timer and Stop-Timer visible to monkey with the generated code, spilling material around, and I'm not convinced it could do so even then.
So, in summary, I don't think you have to worry about it if you use function calls to start and stop the timer.
Make two versions of the code: one with the real code you want to measure, one with stubs. Measure both. Subtract. Then, I think, you needn't care what GCC does.

How to track down a SIGFPE/Arithmetic exception

I have a C++ application cross-compiled for Linux running on an ARM CortexA9 processor which is crashing with a SIGFPE/Arithmetic exception. Initially I thought that it's because of some optimizations introduced by the -O3 flag of gcc but then I built it in debug mode and it still crashes.
I debugged the application with gdb which catches the exception but unfortunately the operation triggering exception seems to also trash the stack so I cannot get any detailed information about the place in my code which causes that to happen. The only detail I could finally get was the operation triggering the exception(from the following piece of stack trace):
3 raise() 0x402720ac
2 __aeabi_uldivmod() 0x400bb0b8
1 __divsi3() 0x400b9880
The __aeabi_uldivmod() is performing an unsigned long long division and reminder so I tried the brute force approach and searched my code for places that might use that operation but without much success as it proved to be a daunting task. Also I tried to check for potential divisions by zero but again the code base it's pretty large and checking every division operation it's a cumbersome and somewhat dumb approach. So there must be a smarter way to figure out what's happening.
Are there any techniques to track down the causes of such exceptions when the debugger cannot do much to help?
UPDATE: After crunching on hex numbers, dumping memory and doing stack forensics(thanks Crashworks) I came across this gem in the ARM Compiler documentation(even though I'm not using the ARM Ltd. compiler):
Integer division-by-zero errors can be trapped and identified by
re-implementing the appropriate C library helper functions. The
default behavior when division by zero occurs is that when the signal
function is used, or
__rt_raise() or __aeabi_idiv0() are re-implemented, __aeabi_idiv0() is
called. Otherwise, the division function returns zero.
__aeabi_idiv0() raises SIGFPE with an additional argument, DIVBYZERO.
So I put a breakpoint at __aeabi_idiv0(_aeabi_ldiv0) et Voila!, I had my complete stack trace before being completely trashed. Thanks everybody for their very informative answers!
Disclaimer: the "winning" answer was chosen solely and subjectively taking into account the weight of its suggestions into my debugging efforts, because more than one was informative and really helpful.
My first suggestion would be to open a memory window looking at the region around your stack pointer, and go digging through it to see if you can find uncorrupted stack frames nearby that might give you a clue as to where the crash was. Usually stack-trashes only burn a couple of the stack frames, so if you look upwards a few hundred bytes, you can get past the damaged area and get a general sense of where the code was. You can even look down the stack, on the assumption that the dead function might have called some other function before it died, and thus there might be an old frame still in memory pointing back at the current IP.
In the comments, I linked some presentation slides that illustrate the technique on a PowerPC — look at around #73-86 for a case study in a similar botched-stack crash. Obviously your ARM's stack frames will be laid out differently, but the general principle holds.
(Using the basic idea from Fedor Skrynnikov, but with compiler help instead)
Compile your code with -pg. This will insert calls to mcount and mcountleave() in every function. Do not link against the GCC profiling lib, but provide your own. The only thing you want to do in your mcount and mcountleave() is to keep a copy of the current stack, so just copy the top 128 bytes or so of the stack to a fixed buffer. Both the stack and the buffer will be in cache all the time so it's fairly cheap.
You can implement special guards in functions that can cause the exception. Guard is a simple class, in constractor of this class you put the name of the file and line (_FILE_, _LINE_) into file/array/whatever. The main condition is that this storage should be the same for all instances of this class(kind of stack). In the destructor you remove this line. To make it works you need to put the creation of this guard on the first line of each function and to create it only on stack. When you will be out of current block deconstructor will be called. So in the moment of your exception you will know from this improvised callstack which function is causing a problem.
Ofcaurse you may put creation of this class under debug condition
Enable generation of core files, and open the core file with the debuger
Since it uses raise() to raise the exception, I would expect that signal() should be able to catch it. Is this not the case?
Alternatively, you can set a conditional breakpoint at __aeabi_uldivmod to break when divisor (r1) is 0.