optimizing branching by re-ordering - c++

I have this sort of C function -- that is being called a zillion times:
void foo ()
{
if (/*condition*/)
{
}
else if(/*another_condition*/)
{
}
else if (/*another_condition_2*/)
{
}
/*And so on, I have 4 of them, but we can generalize it*/
else
{
}
}
I have a good test-case that calls this function, causing certain if-branches to be called more than the others.
My goal is to figure the best way to arrange the if statements to minimize the branching.
The only way I can think of is to do write to a file for every if condition branched to, thereby creating a histogram. This seems to be a tedious way. Is there a better way, better tools?
I am building it on AS3 Linux, using gcc 3.4; using oprofile (opcontrol) for profiling.

It's not portable, but many versions of GCC support a function called __builtin_expect() that can be used to tell the compiler what we expect a value to be:
if(__builtin_expect(condition, 0)) {
// We expect condition to be false (0), so we're less likely to get here
} else {
// We expect to get here more often, so GCC produces better code
}
The Linux kernel uses these as macros to make them more intuitive, cleaner, and more portable (i.e. redefine the macros on non-GCC systems):
#ifdef __GNUC__
# define likely(x) __builtin_expect((x), 1)
# define unlikely(x) __builtin_expect((x), 0)
#else
# define likely(x) (x)
# define unlikely(x) (x)
#endif
With this, we can rewrite the above:
if(unlikely(condition)) {
// we're less likely to get here
} else {
// we expect to get here more often
}
Of course, this is probably unnecessary unless you're aiming for raw speed and/or you've profiled and found that this is a problem.

Try a profiler (gprof?) - it will tell you how much time is spent. I don't recall if gprof counts branches, but if not, just call a separate empty method in each branch.

Running your program under Callgrind will give you branch information. Also I hope you profiled and actually determined this piece of code is problematic, as this seems like a microoptimization at best. The compiler is going to generate a branch table from the if/else if/else if it's able to which would require no branching (this is dependent on what the conditionals are, obviously)0, and even failing that the branch predictor on your processor (assuming this is not for embedded work, if it is feel free to ignore me) is pretty good at determining the target of branches.

It doesn't actually matter what order you change them round to, IMO. The branch predictor will store the most common branch and auto take it anyway.
That said, there are something you could try ... You could maintain a set of job queues and then, based on the if statements, assign them to the correct job queue before executing them one after another at the end.
This could further be optimised by using conditional moves and so forth (This does require assembler though, AFAIK). This could be done by conditionally moving a 1 into a register, that is initialised as 0, on condition a. Place the pointer valueat the end of the queue and then decide to increment the queue counter or not by adding that conditional 1 or 0 to the counter.
Suddenly you have eliminated all branches and it becomes immaterial how many branch mispredictions there are. Of course, as with any of these things, you are best off profiling because, though it seems like it would provide a win ... it may not.

We use a mechanism like this:
// pseudocode
class ProfileNode
{
public:
inline ProfileNode( const char * name ) : m_name(name)
{ }
inline ~ProfileNode()
{
s_ProfileDict.Find(name).Value() += 1; // as if Value returns a nonconst ref
}
static DictionaryOfNodesByName_t s_ProfileDict;
const char * m_name;
}
And then in your code
void foo ()
{
if (/*condition*/)
{
ProfileNode("Condition A");
// ...
}
else if(/*another_condition*/)
{
ProfileNode("Condition B");
// ...
} // etc..
else
{
ProfileNode("Condition C");
// ...
}
}
void dumpinfo()
{
ProfileNode::s_ProfileDict.PrintEverything();
}
And you can see how it's easy to put a stopwatch timer in those nodes too and see which branches are consuming the most time.

Some counter may help. After You see the counters, and there are large differences, You can sort the conditions in a decreasing order.
static int cond_1, cond_2, cond_3, ...
void foo (){
if (condition){
cond_1 ++;
...
}
else if(/*another_condition*/){
cond_2 ++;
...
}
else if (/*another_condtion*/){
cond_3 ++;
...
}
else{
cond_N ++;
...
}
}
EDIT: a "destructor" can print the counters at the end of a test run:
void cond_print(void) __attribute__((destructor));
void cond_print(void){
printf( "cond_1: %6i\n", cond_1 );
printf( "cond_2: %6i\n", cond_2 );
printf( "cond_3: %6i\n", cond_3 );
printf( "cond_4: %6i\n", cond_4 );
}
I think it is enough to modify only the file that contains the foo() function.

Wrap the code in each branch into a function and use a profiler to see how many times each function is called.

Line-by-line profiling gives you an idea which branches are called more often.
Using something like LLVM could make this optimization automatically.

As a profiling technique, this is what I rely on.
What you want to know is: Is the time spent in evaluating those conditions a significant fraction of execution time?
The samples will tell you that, and if not, it just doesn't matter.
If it does matter, for example if the conditions include function calls that are on the stack a significant part of the time, what you want to avoid is spending much time in comparisons that are false. The way you tell this is, if you often see it calling a comparison function from, say, the first or second if statement, then catch it in such a sample and step out of it to see if it returns false or true. If it typically returns false, it should probably go farther down the list.

Related

Optimal place to call __syncthreads()

Given that the code is correct, is there some potential performance benefit in calling __syncthreads as late as possible, as early as possible, or does it not matter? Here's an example with comments that demonstrate the question:
__global__ void kernel(const float* data) {
__shared__ float shared_data[64];
if (threadIdx.x < 64) {
shared_data[threadIdx.x] = data[threadIdx.x];
}
// Option #1: Place the call to `__syncthreads()` here?
// Here is a lot of code that doesn't use `shared_data`.
// Option #2: Place the call to `__syncthreads()` here?
// Here is some code that uses `shared_data`.
}
What you are facing is a split between where the writes are made and where they should be visible to the entire block.
NVIDIA has recently introduced a mechanism for just that: arrive + wait.
You start with initializing a barrier:
void __mbarrier_init(__mbarrier_t* bar, uint32_t expected_count);
Then you arrive at your "option 1" position, with the bar token you initialized:
__mbarrier_token_t __mbarrier_arrive(__mbarrier_t* bar);
then you have your unrelated code, and then finally, wait for everyone to arrive at your "option 2" position:
bool __mbarrier_test_wait(__mbarrier_t* bar, __mbarrier_token_t token);
... but note that this call doesn't block, i.e you'll have to actively "wait".
Alternatively, you can use NVIDIA's C++ wrappers for this mechanism, presented here.
Note that this functionality is relatively new, with Compute Capability at least 7.0 required, and 8.0 or later recommended.

please solve this memory leak problem in c++

class A
{
public:
unique_ptr<int> m_pM;
A() { m_pM = make_unique<int>(5); };
~A() { };
public:
void loop() { while (1) {}; } // it means just activating some works. for simplifying
};
int main()
{
_CrtSetDbgFlag(_CRTDBG_ALLOC_MEM_DF | _CRTDBG_LEAK_CHECK_DF);
A a;
a.loop(); // if i use force quit while activating this function, it causes memory leak
}
is there any way to avoid memory leak when i use force quit while activating this program?
a.loop() is an infinite loop so everything after that is unreachable, so the compiler is within its right to remove all code after the call to a.loop(). See the compiler explorer for proof.
I believe that outside of some niche and very rare scenarios truly infinite loops like the one you wrote here are pretty useless, since they literally mean “loop indefinitely”. So what’s the compiler supposed to do? In some sense it just postpones the destruction of your object until some infinite time in the future.
What you usually do is use break inside such loop and break when some condition is met. A simplified example: https://godbolt.org/z/sxr7eG4W1
Here you can see the unique_ptr::default_delete in the disassembly and also see that the compiler is actually checking the condition inside the loop.
Note: extern volatile is used to ensure the compiler doesn’t optimise away the flag, since it’s a simplified example and the compiler is smart enough to figure out that the flag is not changed. In real code I’d advice against using volatile. Just check for the stop condition. That’s it.

Use of goto in this very specific case... alternatives?

I have a question about the possibile use of goto in a C++ code: I know that goto shall be avoided as much as possibile, but in this very particular case I'm having few difficulties to find good alternatives that avoid using multiple nested if-else and/or additional binary flags...
The code is like the following one (only the relevant parts are reported):
// ... do initializations, variable declarations, etc...
while(some_flag) {
some_flag=false;
if(some_other_condition) {
// ... do few operations (20 lines of code)
return_flag=foo(input_args); // Function that can find an error, returning false
if(!return_flag) {
// Print error
break; // jump out of the main while loop
}
// ... do other more complex operations
}
index=0;
while(index<=SOME_VALUE) {
// ... do few operations (about 10 lines of code)
return_flag=foo(input_args); // Function that can find an error, returning false
if(!return_flag) {
goto end_here; // <- 'goto' statement
}
// ... do other more complex operations (including some if-else and the possibility to set some_flag to true or leave it to false
// ... get a "value" to be compared with a saved one in order to decide whether to continue looping or not
if(value<savedValue) {
// Do other operations (about 20 lines of code)
some_flag=true;
}
// ... handle 'index'
it++; // Increse number of iterations
}
// ... when going out from the while loop, some other operations must be done, at the moment no matter the value of some_flag
return_flag=foo(input_args);
if(!return_flag) {
goto end_here; // <- 'goto' statement
}
// ... other operations here
// ... get a "value" to be compared with a saved one in order to decide whether to continue looping or not
if(value<savedValue) {
// Do other operations (about 20 lines of code)
some_flag=true;
}
// Additional termination constraint
if(it>MAX_ITERATIONS) {
some_flag=false;
}
end_here:
// The code after end_here checks for some_flag, and executes some operations that must always be done,
// no matter if we arrive here due to 'goto' or due to normal execution.
}
}
// ...
Every time foo() returns false, no more operations should be executed, and the code should execute the final operations as soon as possible. Another requirement is that this code, mainly the part inside the while(index<=SOME_VALUE) shall run as fast as possible to try to have a good overall performance.
Is using a 'try/catch' block, with the try{} including lots of code inside (while, actually, the function foo() can generate errors only when called, that is in two distinct points of the code only) a possibile alternative? Is is better in this case to use different 'try/catch' blocks?
Are there other better alternatives?
Thanks a lot in advance!
Three obvious choices:
Stick with goto
Associate the cleanup code with the destructor of some RAII class. (You can probably write it as the delete for a std::unique_ptr as a lambda.)
Rename your function as foo_internal, and change it to just return. Then write the cleanup in a new foo function which calls foo_internal
So:
return_t foo(Args...) {
const auto result = foo_internal(Args..);
// cleanup
return result;
}
In general, your function looks too long, and needs decomposing into smaller bits.
One way you can do it is to use another dummy loop and break like so
int state = FAIL_STATE;
do {
if(!operation()) {
break;
}
if(!other_operation()) {
break;
}
// ...
state = OK_STATE;
} while(false);
// check for state here and do necessary cleanups
That way you can avoid deep nesting levels in your code beforehand.
It's C++! Use exceptions for non-local jumps:
try {
if(some_result() < threshold) throw false;
}
catch(bool) {
handleErrors();
}
// Here follows mandatory cleanup for both sucsesses and failures

Performance function calls per frame

I'm a game developer therefore performance is really important to me.
My simple question:
I have a lot of checks(button clicks,collisions,whatever) running per frame, but I don't want to put everything in one function, therefore I would split them into other functions and just call them:
void Tick()
{
//Check 1 ..... lots of code
//Check 2 ...... lots of code
//Check 3 ..... lots of code
}
to
void Tick()
{
funcCheck1();
funcCheck2();
funcCheck3();
}
void funcCheck1()
{
//check1 lots of code
}
void funcCheck2()
{
//check2 lots of code
}
void funcCheck3()
{
//check3 lots of code
}
Does the function call per frame has any performance impact?(not inlined)
Clearly the second version is much more readable.
If you don't pass any complex objects by value, the overhead of calling several functions instead of putting all code in one function should be negligible (e.g.
put function parameters on top of the stack, add space for the return type, jump to the beginning of the called function's code)
You cannot say for sure, specifically that the compiler could inline small function automatically. The only way to be sure is to use a profiler and compare the two scenarios.

Use of goto for cleanly exiting a loop

I have a question about use of the goto statement in C++. I understand that this topic is controversial, and am not interested in any sweeping advice or arguments (I usually stray from using goto). Rather, I have a specific situation and want to understand whether my solution, which makes use of the goto statement, is a good one or not. I would not call myself new to C++, but would not classify myself as a professional-level programmer either. The part of the code which has generated my question spins in an infinite loop once started. The general flow of the thread in pseudocode is as follows:
void ControlLoop::main_loop()
{
InitializeAndCheckHardware(pHardware) //pHardware is a pointer given from outside
//The main loop
while (m_bIsRunning)
{
simulated_time += time_increment; //this will probably be += 0.001 seconds
ReadSensorData();
if (data_is_bad) {
m_bIsRunning = false;
goto loop_end;
}
ApplyFilterToData();
ComputeControllerOutput();
SendOutputToHardware();
ProcessPendingEvents();
while ( GetWallClockTime() < simulated_time ) {}
if ( end_condition_is_satisified ) m_bIsRunning = false;
}
loop_end:
DeInitializeHardware(pHardware);
}
The pHardware pointer is passed in from outside the ControlLoop object and has a polymorphic type, so it doesn't make much sense for me to make use of RAII and to create and destruct the hardware interface itself inside main_loop. I suppose I could have pHardware create a temporary object representing a sort of "session" or "use" of the hardware which could be automatically cleaned up at exit of main_loop, but I'm not sure whether that idea would make it clearer to somebody else what my intent is. There will only ever be three ways out of the loop: the first is if bad data is read from the external hardware; the second is if ProcessPendingEvents() indicates a user-initiated abort, which simply causes m_bIsRunning to become false; and the last is if the end-condition is satisfied at the bottom of the loop. I should maybe also note that main_loop could be started and finished multiple times over the life of the ControlLoop object, so it should exit cleanly with m_bIsRunning = false afterwards.
Also, I realize that I could use the break keyword here, but most of these pseudocode function calls inside main_loop are not really encapsulated as functions, simply because they would need to either have many arguments or they would all need access to member variables. Both of these cases would be more confusing, in my opinion, than simply leaving main_loop as a longer function, and because of the length of the big while loop, a statement like goto loop_end seems to read clearer to me.
Now for the question: Would this solution make you uncomfortable if you were to write it in your own code? It does feel a little wrong to me, but then I've never made use of the goto statement before in C++ code -- hence my request for help from experts. Are there any other basic ideas which I am missing that would make this code clearer?
Thanks.
Avoiding the use of goto is a pretty solid thing to do in object oriented development in general.
In your case, why not just use break to exit the loop?
while (true)
{
if (condition_is_met)
{
// cleanup
break;
}
}
As for your question: your use of goto would make me uncomfortable. The only reason that break is less readable is your admittance to not being a strong C++ developer. To any seasoned developer of a C-like language, break will both read better, as well as provide a cleaner solution than goto.
In particular, I simply do not agree that
if (something)
{
goto loop_end;
}
is more readable than
if (something)
{
break;
}
which literally says the same thing with built-in syntax.
With your one, singular condition which causes the loop to break early I would simply use a break. No need for a goto that's what break is for.
However, if any of those function calls can throw an exception or if you end up needing multiple breaks I would prefer an RAII style container, this is the exact sort of thing destructors are for. You always perform the call to DeInitializeHardware, so...
// todo: add error checking if needed
class HardwareWrapper {
public:
HardwareWrapper(Hardware *pH)
: _pHardware(pH) {
InitializeAndCheckHardware(_pHardware);
}
~HardwareWrapper() {
DeInitializeHardware(_pHardware);
}
const Hardware *getHardware() const {
return _pHardware;
}
const Hardware *operator->() const {
return _pHardware;
}
const Hardware& operator*() const {
return *_pHardware;
}
private:
Hardware *_pHardware;
// if you don't want to allow copies...
HardwareWrapper(const HardwareWrapper &other);
HardwareWrapper& operator=(const HardwareWrapper &other);
}
// ...
void ControlLoop::main_loop()
{
HardwareWrapper hw(pHardware);
// code
}
Now, no matter what happens, you will always call DeInitializeHardware when that function returns.
UPDATE
If your main concern is the while loop is too long, then you should aim at make it shorter, C++ is an OO language and OO is for split things to small pieces and component, even in general non-OO language we generally still think we should break a method/loop into small one and make it short easy for read. If a loop has 300 lines in it, no matter break/goto doesn't really save your time there isn't it?
UPDATE
I'm not against goto but I won't use it here as you do, I prefer just use break, generally to a developer that he saw a break there he know it means goto to the end of the while, and with that m_bIsRunning = false he can easily aware of that it's actually exit the loop within seconds. Yes a goto may save the time for seconds to understand it but it may also make people feel nervous about your code.
The thing I can imagine that I'm using a goto would be to exit a two level loop:
while(running)
{
...
while(runnning2)
{
if(bad_data)
{
goto loop_end;
}
}
...
}
loop_end:
Instead of using goto, you should use break; to escape loops.
There are several alternative to goto: break, continue and return depending on the situation.
However, you need to keep in mind that both break and continue are limited in that they only affect the most inner loop. return on the other hand is not affected by this limitation.
In general, if you use a goto to exit a particular scope, then you can refactor using another function and a return statement instead. It is likely that it will make the code easier to read as a bonus:
// Original
void foo() {
DoSetup();
while (...) {
for (;;) {
if () {
goto X;
}
}
}
label X: DoTearDown();
}
// Refactored
void foo_in() {
while (...) {
for (;;) {
if () {
return;
}
}
}
}
void foo() {
DoSetup();
foo_in();
DoTearDown();
}
Note: if your function body cannot fit comfortably on your screen, you are doing it wrong.
Goto is not good practice for exiting from loop when break is an option.
Also, in complex routines, it is good to have only one exit logic (with cleaning up) placed at the end. Goto is sometimes used to jump to the return logic.
Example from QEMU vmdk block driver:
static int vmdk_open(BlockDriverState *bs, int flags)
{
int ret;
BDRVVmdkState *s = bs->opaque;
if (vmdk_open_sparse(bs, bs->file, flags) == 0) {
s->desc_offset = 0x200;
} else {
ret = vmdk_open_desc_file(bs, flags, 0);
if (ret) {
goto fail;
}
}
/* try to open parent images, if exist */
ret = vmdk_parent_open(bs);
if (ret) {
goto fail;
}
s->parent_cid = vmdk_read_cid(bs, 1);
qemu_co_mutex_init(&s->lock);
/* Disable migration when VMDK images are used */
error_set(&s->migration_blocker,
QERR_BLOCK_FORMAT_FEATURE_NOT_SUPPORTED,
"vmdk", bs->device_name, "live migration");
migrate_add_blocker(s->migration_blocker);
return 0;
fail:
vmdk_free_extents(bs);
return ret;
}
I'm seeing loads of people suggesting break instead of goto. But break is no "better" (or "worse") than goto.
The inquisition against goto effectively started with Dijkstra's "Go To Considered Harmful" paper back in 1968, when spaghetti code was the rule and things like block-structured if and while statements were still considered cutting-edge. ALGOL 60 had them, but it was essentially a research language used by academics (cf. ML today); Fortran, one of the dominant languages at the time, would not get them for another 9 years!
The main points in Dijkstra's paper are:
Humans are good at spatial reasoning, and block-structured programs capitalise on that because program actions that occur near each other in time are described near each other in "space" (program code);
If you avoid goto in all its various forms, then it's possible to know things about the possible states of variables at each lexical position in the program. In particular, at the end of a while loop, you know that that loop's condition must be false. This is useful for debugging. (Dijkstra doesn't quite say this, but you can infer it.)
break, just like goto (and early returns, and exceptions...), reduces (1) and eliminates (2). Of course, using break often lets you avoid writing convoluted logic for the while condition, getting you a net gain in understandability -- and exactly the same applies for goto.