Complicated code for obvious operations - c++

Sometimes, mainly for optimization purposes, very simple operations are implemented as complicated and clumsy code.
One example is this integer initialization function:
void assign( int* arg )
{
__asm__ __volatile__ ( "mov %%eax, %0" : "=m" (*arg));
}
Then:
int a;
assign ( &a );
But actually I don't understand why is it written in this way...
Have you seen any example with real reasons to do so?

In the case of your example, I think it is a result of the fallacious assumption that writing code in assembly is automatically faster.
The problem is that the person who wrote this didn't understand WHY assembly can sometimes run faster. That is, you know more than the compiler what you are trying to do and can sometimes use this knowledge to write code at a lower level that is more performant based on not having to make assumptions that the compiler will.
In the case of a simple variable assignment, I seriously doubt that holds true and the code is likely to perform slower because it has the additional overhead of managing the assign function on the stack. Mind you, it won't be noticeably slower, the main cost here is code that is less readable and maintainable.
This is a textbook example of why you shouldn't implement optimizations without understanding WHY it is an optimization.

It seems that the assembly code intent was to ensure that the assignment to the *arg int location will be done every time - preventing (on purpose) any optimization from the compiler in this regard.
Usually the volatile keyword is used in C++ (and C...) to tell the compiler that this value should not be kept in a register (for instance) and reused from that register (optimization in order to get the value faster) as it can be changed asynchronously (by an external module, an assembly program, an interruption etc...).
For instance, in a function
int a = 36;
g(a);
a = 21;
f(a);
in this case the compiler knows that the variable a is local to the function and is not modified outside the function (a pointer on a is not provided to any call for instance). It may use a processor register to store and use the a variable.
In conclusion, that ASM instruction seems to be injected to the C++ code in order not to perform some optimizations on that variable.

While there are several reasonable justifications for writing something in assembly, in my experience those are uncommonly the actual reason. Where I've been able to study the rationale, they boil down to:
Age: The code was written so long ago that it was the most reasonable option for dealing with compilers of the era. Typically, before about 1990 can be justified, IMHO.
Control freak: Some programmers have trust issues with the compiler, but aren't inclined to investigate its actual behavior.
Misunderstanding: A surprisingly widespread and persistent myth is that anything written in assembly language inherently results in more efficient code than writing in a "clumsy" compiler—what with all its mysterious function entry/exit code, etc. Certainly a few compilers deserved this reputation
To be "cool": When time and money are not factors, what better way to strut a programmer's significantly elevated hormone levels than some macho, preferably inscrutable, assembly language?

The example you give seems flawed, in that the assign() function is liable to be slower than directly assigning the variable, reason being that calling a function with arguments involves stack usage, whereas just saying int a = x is liable to compile to efficient code without needing the stack.
The only times I have benefited from using assembler is by hand optimising the assembler output produced by the compiler, and that was in the days where processor speeds were often in the single megahertz range. Algorithmic optimisation tends to give a better return on investment as you can gain orders of magnitudes in improvement rather than small multiples. As others have already said, the only other times you go to assembler is if the compiler or language doesn't do something you need to do. With C and C++ this is very rarely the case any more.
It could well be someone showing off that they know how to write some trivial assembler code, making the next programmers job more difficult, and possibly as a half assed measure to protect their own job. For the example given, the code is confusing, possibly slower than native C, less portable, and should probably be removed. Certainly if I see any inline assmebler in any modern C code, I'd expect copious comments explaining why it is absolutely necessary.

Let compilers optimize for you. There's no possible way this kind of "optimization" will ever help anything... ever!

Related

Will compiler automatically optimize repeating code?

If I have some code with simple arithmetics that is repeating several times. Will the compiler automatically optimize it?
Here the example:
someArray[index + 1] = 5;
otherArray[index + 1] = 7;
Does it make sense to introduce variable nextIndex = index + 1 from the perfomance point of view, (not from the point of view of good readable and maintanable code) or the compiler will do such optimization automatically?
You should not worry about trivial optimization like this because almost all compilers do it last 10-15 years or longer.
But if you have a really critical place in your code and want to get maximal speed of running, than you can check generated assembler code for this lines to be sure that compiler did this trivial optimization.
In some cases one more arithmetic addition could be more faster version of code than saving in register or memory, and compilers knows about this. You can make your code slower if you try optimize trivial cases manually.
And you can use online services like https://gcc.godbolt.org for check generated code (support gcc, clang, icc in several version).
The old adage "suck it and see" seems to be appropriate here. We often forget that by far the most common processors are 4/8/16 bit micros with weird and wonderful application specific architectures and suitably odd vendor specific compilers to go with them. They frequently have compiler extensions to "aid" (or confuse) the compiler into producing "better" code.
One DSP from early 2000s carried out 8 instructions per clock-cycle in parallel in a pipeline (complex - "load+increment+multiply+add+round"). The precondition for this to work was that everything had to be preloaded into the registers beforehand. This meant that registers were obviously at a premium (as always). With this architecture it was frequently better to bin results to free registers and use free slots that couldn't be paralleled (some instructions precluded the use of others in the same cycle) to recalculate it later. Did the compiler get this "right"?. Yes, it often kept the result to reuse later with the result that it stalled the pipeline due to lack of registers which resulted in slower execution speed.
So, you compiled it, examined it, profiled it etc. to make sure that the when when the compiler got it "right" we could go in and fix it. Without additional semantic information which is not supported by the language it is really hard to know what "right" is.
Conclusion: Suck it and see
Yes. It's a common optimization. https://en.wikipedia.org/wiki/Common_subexpression_elimination

C++ - put expression to register and use it in assembly

How to resolve expression and put it in the register, use it in inline assembly and even use it again and put it somewhere?
For example:
EAX=a[i]; //Any expression that valid in C++
__asm xor eax,0xFFFF //Do something with this
b[i]=EAX; //And then put it in some variable.
By the way, the reason is for performance.
Several compilers have compiler specific ways of accomplishing this. But it's almost never worth doing.
There are a list of reasons why this is almost never worth doing:
The compiler will usually generate better code than you can write most of the time.
Even if it doesn't, you can frequently tweak your code slightly to convince the compiler to write code that's at least as good as you could write, and still have your program remain portable.
The code that has the perceived performance issue is not actually critical to performance because the program spends 0.01% of it's time there.
You want your program to stay standard C++ and don't want to clutter it with tons of #ifdef guards.
The example you've shown is not very compelling.

Increase Program Speed By Avoiding Functions? (C++)

When it comes to procedural programming, functional decomposition is ideal for maintaining complicated code. However, functions are expensive- adding to the call stack, passing parameters, storing return addresses. all of this takes extra time! When speed is crucial, how can I get the best of both worlds? I want a highly decomposed program without any necessary overhead introduced by function calls. I'm familiar with the keyword: "inline" but that seems to be only be a suggestion to the compiler, and if used incorrectly by the programmer it will yield an even slower program. I'm using g++, so will the -03 flag optimize away my functions that call functions that call functions..
I just wanted to know, if my concerns are valid and if there are any methods to combat this issue.
First, as always when dealing with performance issues, you should try and measure what are your bottlenecks with a profiler. The first thing coming out is usually not function calls and by a large margin. If you did this, then please read on.
Then, you can anticipate a bit what functions you want inlined by using the inline keyword. The compiler is usually smart enough to know what to inline and what not to inline (it can inline functions you forgot and may not inline some you mentionned if he thinks it won't help).
If (really) you still want to improve performance on function calls and want to force inlining, some compilers allow you to do so (see this question). Please consider that massive inlining may actually decrease performance: your code will use a lot of memory and you may get more cache misses on the code than before (which is not good).
If it's a specific piece of code you're worried about you can measure the time yourself. Just run it in a loop a large number of times and get the system time before and after. Use the difference to find the average time of each call.
As always the numbers you get are subjective, since they will vary depending on your system and compiler. You can compare the times you get from different methods to see which is generally faster, such as replacing the function with a macro. My guess is however you won't notice much difference, or at the very least it will be inconsequential.
If you don't know where the slowdown is follow J.N's advice and use a code profiler and optimise where it's needed. As a rule of thumb always pass large objects to functions by reference or const reference to avoid copy times.
I highly doubt speed is that curcial, but my suggestion would be to use preprocessor macros.
For example
#define max(a,b) ( a > b ? a : b )
This would seem obvious to me, but I don't consider myself an expect in C++, so I may have misunderstood the question.

Performance penalty for "if error then fail fast" in C++?

Is there any performance difference (in C++) between the two styles of writing if-else, as shown below (logically equivalent code) for the likely1 == likely2 == true path (likely1 and likely2 are meant here as placeholders for some more elaborate conditions)?
// Case (1):
if (likely1) {
Foo();
if (likely2) {
Bar();
} else {
Alarm(2);
}
} else {
Alarm(1);
}
vs.
// Case (2):
if (!likely1) {
Alarm(1);
return;
}
Foo();
if (!likely2) {
Alarm(2);
return;
}
Bar();
I'd be very grateful for information on as many compilers and platforms as possible (but with gcc/x86 highlighted).
Please note I'm not interested in readability opinions on those two styles, neither in any "premature optimisation" claims.
EDIT: In other words, I'd like to ask if the two styles are at some point considered fully-totally-100% equivalent/transparent by a compiler (e.g. bit-by-bit equivalent AST at some point in a particular compiler), and if not, then what are the differences? For any (with a preference towards "modern" and gcc) compiler you know.
And, to make it more clear, I too don't suppose that it's going to give me much of a performance improvement, and that it usually would be premature optimization, but I am interested in whether and how much it can improve/degrade anything?
It depends greatly on the compiler, and the optimization settings. If the difference is crucial - implement both, and either analyze the assembly, or do benchmarks.
I have no answers for specific platforms, but I can make a few general points:
The traditional answer on non-modern processors without branch prediction, is that the first is likely to be more efficient since in the common case it takes fewer branches. But you seem interested in modern compilers and processors.
On modern processors, generally speaking short forward branches are not expensive, whereas mispredicted branches may be expensive. By "expensive" of course I mean a few cycles
Quite aside from this, the compiler is entitled to order basic blocks however it likes provided it doesn't change the logic. So when you write if (blah) {foo();} else {bar();}, the compiler is entitled to emit code like:
evaluate condition blah
jump_if_true else_label
bar()
jump endif_label
else_label:
foo()
endif_label:
On the whole, gcc tends to emit things in roughly the order you write them, all else being equal. There are various things which make all else not equal, for example if you have the logical equivalent of bar(); return in two different places in your function, gcc might well coalesce those blocks, emit only one call to bar() followed by return, and jump or fall through to that from two different places.
There are two kinds of branch prediction - static and dynamic. Static means that the CPU instructions for the branch specify whether the condition is "likely", so that the CPU can optimize for the common case. Compilers might emit static branch predictions on some platforms, and if you're optimizing for that platform you might write code to take account of that. You can take account of it either by knowing how your compiler treats the various control structures, or by using compiler extensions. Personally I don't think it's consistent enough to generalize about what compilers will do. Look at the disassembly.
Dynamic branch prediction means that in hot code, the CPU will keep statistics for itself how likely branches are to be taken, and optimize for the common case. Modern processors use various different dynamic branch prediction techniques: http://en.wikipedia.org/wiki/Branch_predictor. Performance-critical code pretty much is hot code, and as long as the dynamic branch prediction strategy works, it very rapidly optimizes hot code. There might be certain pathological cases that confuse particular strategies, but in general you can say that anything in a tight loop where there's a bias towards taken/not taken, will be correctly predicted most of the time
Sometimes it doesn't even matter whether the branch is correctly predicted or not, since some CPUs in some cases will include both possibilities in the instruction pipeline while it's waiting for the condition to be evaluated, and ditch the unnecessary option. Modern CPUs get complicated. Even much simpler CPU designs have ways of avoiding the cost of branching, though, such as conditional instructions on ARM.
Calls out of line to other functions will upset all such guesswork anyway. So in your example there may be small differences, and those differences may depend on the actual code in Foo, Bar and Alarm. Unfortunately it's not possible to distinguish between significant and insignificant differences, or to account for details of those functions, without getting into the "premature optimization" accusations you're not interested in.
It's almost always premature to micro-optimize code that isn't written yet. It's very hard to predict the performance of functions named Foo and Bar. Presumably the purpose of the question is to discern whether there's any common gotcha that should inform coding style. To which the answer is that, thanks to dynamic branch prediction, there is not. In hot code it makes very little difference how your conditions are arranged, and where it does make a difference that difference isn't as easily predictable as "it's faster to take / not take the branch in an if condition".
If this question was intended to apply to just one single program with this code proven to be hot, then of course it can be tested, there's no need to generalize.
It is compiler dependent. Check out the gcc documentation on __builtin_expect. Your compiler may have something similar. Note that you really should be concerned about premature optimization.
The answer depends a lot on the type of "likely". If it is an integer constant expression, the compiler can optimize it and both cases will be equivalent. Otherwise, it will be evaluated in runtime and can't be optimized much.
Thus, case 2 is generally more efficient than case 1.
As input from real-time embedded systems, which I work with, your "case 2" is often the norm for code that is safety- and/or performance critical. Style guides for safety-critical embedded systems often allow this syntax so a function can quit quickly upon errors.
Generally, style guides will frown upon the "case 2" syntax, but make an exception to allow several returns in one function either if
1) the function needs to quit quickly and handle the error, or
2) if one single return at the end of the function leads to less readable code, which is often the case for various protocol and data parsers.
If you are this concerned about performance, I assume you are using profile guided optimization.
If you are using profile guided optimization, the two variants you have proposed are exactly the same.
In any event, the performance of what you are asking about is completely overshadowed by performance characteristics of things not evident in your code samples, so we really can not answer this. You have to test the performance of both.
Though I'm with everyone else here insofar as optimizing a branch makes no sense without having profiled and actually having found a bottleneck... if anything, it makes sense to optimize for the likely case.
Both likely1 and likely2 are likely, as their name suggests. Thus ruling out the also likely combination of both being true would likely be fastest:
if(likely1 && likely2)
{
... // happens most of the time
}else
{
if(likely1)
...
if(likely2)
...
else if(!likely1 && !likely2) // happens almost never
...
}
Note that the second else is probably not necessary, a decent compiler will figure out that the last if clause cannot possibly be true if the previous one was, even if you don't explicitly tell it.

Should I use a function in a situation where it would be called an extreme number of times?

I have a section of my program that contains a large amount of math with some rather long equations. Its long and unsightly and I wish to replace it with a function. However, chunk of code is used an extreme number of times in my code and also requires a lot of variables to be initialized.
If I'm worried about speed, is the cost of calling the function and initializing the variables negligible here or should i stick to directly coding it in each time?
Thanks,
-Faken
Most compilers are smart about inlining reasonably small functions to avoid the overhead of a function call. For functions big enough that the compiler won't inline them, the overhead for the call is probably a very small fraction of the total execution time.
Check your compiler documentation to understand it's specific approach. Some older compilers required or could benefit from hints that a function is a candidate for inlining.
Either way, stick with functions and keep your code clean.
Are you asking if you should optimize prematurely?
Code it in a maintainable manner first; if you then find that this section is a bottleneck in the overall program, worry about tuning it at that point.
You don't know where your bottlenecks are until you profile your code. Anything you can assume about your code hot spots is likely to be wrong. I remember once I wanted to optimize some computational code. I ran a profiler and it turned out that 70 % of the running time was spent zeroing arrays. Nobody would have guessed it by looking at the code.
So, first code clean, then run a profiler, then optimize the rough spots. Not earlier. If it's still slow, change algorithm.
Modern C++ compilers generally inline small functions to avoid function call overhead. As far as the cost of variable initialization, one of the benefits of inlining is that it allows the compiler to perform additional optimizations at the call site. After performing inlining, if the compiler can prove that you don't need those extra variables, the copying will likely be eliminated. (I assume we're talking about primitives, not things with copy constructors.)
The only way to answer that is to test it. Without knowing more about the proposed function, nobody can really say whether the compiler can/will inline that code or not. This may/will also depend on the compiler and compiler flags you use. Depending on the compiler, if you find that it's really a problem, you may be able to use different flags, a pragma, etc., to force it to be generated inline even if it wouldn't be otherwise.
Without knowing how big the function would be, and/or how long it'll take to execute, it's impossible guess how much effect on speed it'll have if it isn't generated inline.
With both of those being unknown, none of us can really guess at how much effect moving the code into a function will have. There might be none, or little or huge.