Code optimization - c++

When i'm trying to optimize my code, I often run into a dilemma:
I have an expression like this:
int x = 5 + y * y;
int z = sqrt(12) + y * y;
Does it worth it making a new integer variable to store y*y for two instances, or just leave them alone?
int s = y* y;
int x = 5 + s;
int z = sqrt(12) + s;
If not, how many instances does it need to worth it?

Trying to optimize your code most often means giving the compiler the permission (through flags) to do its own optimization. Trying to do it yourself will more often then not, either just be a waste of time (no improvement over the compiler) or worse.
In your specific example, I seriously doubt there is anything you can do to change the performance.

One of the older compiler optimisations is "common subexpression elimination" - in this case y * y is such a common subexpression.
It may still make sense to show a reader of the code that the expression only needs calculating once, but any compiler produced in the last ten years will calculate this perfectly fine without repeating the multiplication.
Trying to "beat the compiler on it's own game" is often futile, and certainly needs measuring to ensure you get a better result than the compiler. Adding extra variables MAY cause the compiler to produce worse code, because it gets "confused", so it may not help at all.
And ALWAYS when it comes to performance (or code size) results from varying optimizations, measure, measure again, and measure a third time to make sure you get the results you expect. It's not very easy to predict from looking at code which is faster, and which is slower. But it'd definitely be surprised if y * y is calculated twice even with a low level of optimisation in your compiler.

You don't need a temporary variable:
int z = y * y;
int x = z + 5
z = z + sqrt(12);
but the only way to be sure if this is (a) faster and (b) truly where you should focus your attention, is to use a profiler and benchmark your entire application.

Related

Is a boolean expression as onerous as branching with if or switch?

Often I convert some if statements into boolean expressions for code compactness. For instance, if I have something like
foo(int x)
{
if (x > 5) return 100 + 5;
return 100;
}
I'll do it like
foo(int x)
{
return 100 + (x > 5) * 5;
}
This is very simple so no problem, the thing is when I have multiple tests, I can greatly simplify them (at the expense of readability but that's a different issue).
So the question is if that (x > 5) evaluation is as onerous as explicitly branching with it.
In both cases the expression (x > 5) has to be checked if it evaluates to true . And as demonstrated already, both versions compile to the same assembly even without any optimization enabled.
However, the Philosophy section of C++ Core Guidelines has these two rules you would do well to pay heed to:
P.1: Express ideas directly in code
P.3: Express intent
Though these rules cannot be enforced in anyway, adhering to them will make you adopt the version with the if statement.
Doing so will make it less onerous for those who have to maintain the code after you and even yourself a few months later.
You seem to be conflating C++ language constructs with patterns in the assembly. It may have been viable to reason about code on this level given the compilers of the late eighties or early nineties. At this point, however, compilers apply a lot of optimizations and transformations whose correctness or utility is not even obvious to the average programmer. A very simple example is the common beginner's mistake of assuming the following equivalences:
std::uint16_t a = ...;
a *= 2; // a multiplication in assembly
a *= 17; // ditto
a /= 3; // a division in assembly
They may then be surprised to find out that their compiler of choice translates these into the assembly equivalent of e.g.:
a <<= 1u;
a = (a << 4u) + a; // or even (a << 4u) | a if a < 16
a *= 43691u;
Note that the last transformation is only allowed if a is known to be a multiple of the divisor, so you may not see this kind of optimization all too often. How does it even work? In mathematical terms, uint16_t can be thought of as the residue class ring Z/(2^16)Z, and in this ring, there exists a multiplicative inverse for any element that is coprime to 2^16 (i.e. not divisible by 2). If d (e.g. 3) is coprime to 2, it has such an inverse, and then dividing by d is simply equivalent to multiplying by the inverse of d if the remainder is known to be zero. (I won't go into how this inverse can be calculated here.)
Here is another surprising optimization:
long arithsum(long n)
{
long result = 0;
for (long i=0; i<=n; ++i)
result += i;
return result;
}
GCC with -O3 rather mundanely translates this into an unrolled loop of additions. My version (9.0.0svn-something) of Clang, however, will pull a Gauss on you if you do this, and translate this into something like:
long arithsum(long n)
{
return (n * (n+1)) >> 1;
}
Anyway, the same caveats apply to if/switch etc. – while these are control flow structures, and so you'd think they correspond to branching, this may not be so. Likewise, what appears to be a non-branching operation might be translated to a branching operation if the compiler has an optimization rule under which this seems beneficial, or even if it is just unable to translate its own AST or intermediate representation into machine code without use of branching (on the given architecture).
TL;DR: Before you try to outsmart your compiler, figure out which assembly the compiler produces for the straightforward / readable code in this first place. If this assembly is good, there is no point in making the code more subtle / less readable.
Assuming by onerous you mean 1/0. Sure it might work in C/C++ due to implicit typecasting but might not for other languages. If that's what you want to achieve why not use ternary operator (? :) which also makes the code more readable
foo(int x) {
return (x > 5) ? (100 + 5) : 100;
}
Also read this stackoverflow article -- bool to int conversion

How to replace __ieee754_exp_avx calls from source code or library?

__ieee754_exp_avx from libm*.so being used intensively in a certain source code, I would like to replace it with a faster exp(x) implementation?
custom exp(x):
inline
double exp2(double x) {
x = 1.0 + x / 1024;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x;
return x;
}
What gcc tags should I use to make gcc automatically use a custom exp(x) implementation? If it is not possible with gcc how can I do it then?
https://codingforspeed.com/using-faster-exponential-approximation/
Don't. This function is slower than the native implementation of exp, and is an extremely poor approximation.
First, the speed. My benchmarking indicates that, depending on your compiler and CPU, this implementation of exp2 may be anywhere between 1.5x and 4.5x slower than the native exp. I'm not sure where the web site got their figures -- "360 times faster than the traditional exp" seems absurd, and is completely inconsistent with my tests.
Second, the accuracy. exp2(x) is reasonably close to exp(x) for x ≤ 1, but fails badly for larger values. For instance:
exp(1) = 2.7182818
exp2(1) = 2.7169557 (0.05% too low)
exp(2) = 7.3890561
exp2(2) = 7.3746572 (0.20% too low)
exp(5) = 148.41316
exp2(5) = 146.61829 (1.21% too low)
exp(10) = 22026.466
exp2(10) = 20983.411 (4.74% too low)
exp(20) = 4.851652e+08
exp2(20) = 4.0008755e+08 (17.5% too low)
While the web site you got this function from claims that there is "very good agreement for input smaller than 5", this is simply not true. A 1.21% difference (for x=5) is huge, and is likely to cause significant errors in any calculations using this approximation.
Simply don't. That function looks way slower than the built-in code, and it's definitely not OK with respect to precision.
If you need SIMD (single instruction, multiple data) optimized exp functionality, ie. you're not calculating a single value but a series of those, there's C libraries that do that for you. I'd like to highlight VOLK, the Vector Optimized Library of Kernels, a spin-off of the DSP-intense GNU Radio project.
It implements its own expf (single precision exponentiation – if you're willing to accept errors, there's certainly no reason to lug double precision floats around); here's how that compares on my machine :
RUN_VOLK_TESTS: volk_32f_expfast_32f(131071,1987)
a_avx completed in 60.119ms
a_sse4_1 completed in 62.052ms
u_avx completed in 60.376ms
u_sse4_1 completed in 62.131ms
generic completed in 2383.73ms
So, for 1987 iterations over a vector of 131071 elements, all the SIMD-optimized kernels were faster by a factor of 40 – that's pretty OK, but it's far away from the audacious 360x claim of the website you quote.
The source code of the expfast functions used can be found here.
In its core, that implementation relies on the floating point representation – which is a pretty good idea.
It admits it has a 7% error boundary – that's pretty much!
This is more like a workaround (gainarie):
Place exp2 definition in a .h file:
// exp2.h
#if !defined(__EXP2__H__)
#define __EXP2__H__
inline double exp2(double x) {
x = 1.0 + x / 1024;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x; x *= x; x *= x;
x *= x; x *= x;
return x;
}
#endif //__EXP2__H__
Now, this file must end up included (whether directly or indirectly) in all the .c(xx) files that call exp - which might be a painful job if the existing codebase is large.
Then, when compiling the code, pass -D(preprocessor definition) to gcc (I don't know the minimum version that supports this form; v5.4.0 does) like this: -D'exp(X)=exp2(X)'.
Note: You no longer need libm.so.*(-lm) at link time (at least not as far as exp is concerned), so you can remove it. Actually, it would be a good idea to remove it (temporarily - if you're using other math functions, permanently - otherwise), so that if there are any .c(xx) files that don't include exp2.h, the linker will spit an exp related undefined reference error (if using other math functions, after resolving all these errors by including exp2.h in the appropriate .c(xx) file you must add it back), otherwise you might end up with a mixture of exp/exp2 calls in the code.

Why was no intrinsic access to the CPU's status register in the design of both C and C++?

In the case of the overflow flag, it would seem that access to this flag would be a great boon to cross-architecture programming. It would provide a safe alternative to relying on undefined behaviour to check for signed integer overflow such as:
if(a < a + 100) //detect overflow
I do understand that there are safe alternatives such as:
if(a > (INT_MAX - 100)) //detected overflow
However, it would seem that access to the status register or the individual flags within it is missing from both the C and C++ languages. Why was this feature not included or what language design decisions were made that prohibited this feature from being included?
Because C and C++ are designed to be platform independent. Status register is not.
These days, two's complement is universally used to implement signed integer arithmetic, but it was not always the case. One's complement or sign and absolute value used to be quite common. And when C was first designed, such CPUs were still in common use. E.g. COBOL distinguishes negative and positive 0, which existed on those architectures. Obviously overflow behaviour on these architectures is completely different!
By the way, you can't rely on undefined behaviour for detecting overflow, because reasonable compilers upon seeing
if(a < a + 100)
will write a warning and compile
if(true)
... (provided optimizations are turned on and the particular optimization is not turned off).
And note, that you can't rely on the warning. The compiler will only emit the warning when the condition ends up true or false after equivalent transformations, but there are many cases where the condition will be modified in presence of overflow without ending up as plain true/false.
Because C++ is designed as a portable language, i.e. one that compiles on many CPUs (e.g. x86, ARM, LSI-11/2, with devices like Game Boys, Mobile Phones, Freezers, Airplanes, Human Manipulation Chips and Laser Swords).
the available flags across CPUs may largely differ
even within the same CPU, flags may differ (take x86 scalar vs. vector instructions)
some CPUs may not even have the flag you desire at all
The question has to be answered: Should the compiler always deliver/enable that flag when it can't determine whether it is used at all?, which does not conform the pay only for what you use unwritten but holy law of both C and C++
Because compilers would have to be forbidden to optimize and e.g. reorder code to keep those flags valid
Example for the latter:
int x = 7;
x += z;
int y = 2;
y += z;
The optimizer may transform this to that pseudo assembly code:
alloc_stack_frame 2*sizeof(int)
load_int 7, $0
load_int 2, $1
add z, $0
add z, $1
which in turn would be more similar to
int x = 7;
int y = 2;
x += z;
y += z;
Now if you query registers inbetween
int x = 7;
x += z;
if (check_overflow($0)) {...}
int y = 2;
y += z;
then after optimizing and dissasembling you might end with this:
int x = 7;
int y = 2;
x += z;
y += z;
if (check_overflow($0)) {...}
which is then incorrect.
More examples could be constructed, like what happens with a constant-folding-compile-time-overflow.
Sidenotes: I remember an old Borland C++ compiler having a small API to read the current CPU registers. However, the argumentation above about optimization still applies.
On another sidenote: To check for overflow:
// desired expression: int z = x + y
would_overflow = x > MAX-y;
more concrete
auto would_overflow = x > std::numeric_limits<int>::max()-y;
or better, less concrete:
auto would_overflow = x > std::numeric_limits<decltype(x+y)>::max()-y;
I can think of the following reasons.
By allowing access to the register-flags, portability of the language across platforms is severily limited.
The optimizer can change expressions drastically, and render your flags useless.
It would make the language more complex
Most compilers have a big set of intrinsic functions, to do most common operations (e.g. addition with carry) without resorting to flags.
Most expressions can be rewritten in a safe way to avoid overflows.
You can always fall back to inline assembly if you have very specific needs
Access to status registers does not seem needed enough, to go through a standardization-effort.

Compiler optimization causing the performance to slow down

I have one strange problem. I have following piece of code:
template<clss index, class policy>
inline int CBase<index,policy>::func(const A& test_in, int* srcPtr ,int* dstPtr)
{
int width = test_in.width();
int height = test_in.height();
double d = 0.0; //here is the problem
for(int y = 0; y < height; y++)
{
//Pointer initializations
//multiplication involving y
//ex: int z = someBigNumber*y + someOtherBigNumber;
for(int x = 0; x < width; x++)
{
//multiplication involving x
//ex: int z = someBigNumber*x + someOtherBigNumber;
if(soemCondition)
{
// floating point calculations
}
*dstPtr++ = array[*srcPtr++];
}
}
}
The inner loop gets executed nearly 200,000 times and the entire function takes 100 ms for completion. ( profiled using AQTimer)
I found an unused variable double d = 0.0; outside the outer loop and removed the same. After this change, suddenly the method is taking 500ms for the same number of executions. ( 5 times slower).
This behavior is reproducible in different machines with different processor types.
(Core2, dualcore processors).
I am using VC6 compiler with optimization level O2.
Follwing are the other compiler options used :
-MD -O2 -Z7 -GR -GX -G5 -X -GF -EHa
I suspected compiler optimizations and removed the compiler optimization /O2. After that function became normal and it is taking 100ms as old code.
Could anyone throw some light on this strange behavior?
Why compiler optimization should slow down performance when I remove unused variable ?
Note: The assembly code (before and after the change) looked same.
If the assembly code looks the same before and after the change the error is somehow connected to how you time the function.
VC6 is buggy as hell. It is known to generate incorrect code in several cases, and its optimizer isn't all that advanced either. The compiler is over a decade old, and hasn't even been supported for many years.
So really, the answer is "you're using a buggy compiler. Expect buggy behavior, especially when optimizations are enabled."
I don't suppose upgrading to a modern compiler (or simply testing the code on one) is an option?
Obviously, the generated assembly can not be the same, or there would be no performance difference.
The only question is where the difference lies. And with a buggy compiler, it may well be some completely unrelated part of the code that suddenly gets compiled differently and breaks. Most likely though, the assembly code generated for this function is not the same, and the differences are just so subtle you didn't notice them.
Declare width and height as const {unsigned} ints. {The unsigned should be used since heights and widths are never negative.}
const int width = test_in.width();
const int height = test_in.height();
This helps the compiler with optimizing. With the values as const, it can place them in the code or in registers, knowing that they won't change. Also, it relieves the compiler of having to guess whether the variables are changing or not.
I suggest printing out the assembly code of the versions with the unused double and without. This will give you an insight into the compiler's thought process.

Compiler optimization of references

I often use references to simplify the appearance of code:
vec3f& vertex = _vertices[index];
// Calculate the vertex position
vertex[0] = startx + col * colWidth;
vertex[1] = starty + row * rowWidth;
vertex[2] = 0.0f;
Will compilers recognize and optimize this so it is essentially the following?
_vertices[index][0] = startx + col * colWidth;
_vertices[index][1] = starty + row * rowWidth;
_vertices[index][2] = 0.0f;
Yes. This is a basic optimization that any modern (and even ancient) compilers will make.
In fact, I don't think it's really accurate to call that you've written an optimisation, since the move straightforward way to translate that to assembly involves a store to the _vertex address, plus index, plus {0,1,2} (multiplied by the appropriate sizes for things, of course).
In general though, modern compilers are amazing. Almost any optimization you can think of will be implemented. You should always write your code in a way that emphasizes readability unless you know that one way has significant performance benefits for your code.
As a simple example, code like this:
int func() {
int x;
int y;
int z;
int a;
x = 5*5;
y = x;
z = y;
a = 100 * 100 * 100* 100;
return z;
}
Will be optimized to this:
int func() {
return 25
}
Additionally, the compiler will also inline the function so that no call is actually made. Instead, everywhere 'func()' appears will just be replaced with '25'.
This is just a simple example. There are many more complex optimizations a modern compiler implements.
Compilers will even do more clever stuff than this. Maybe they'll do
vec3f * vertex = _vertices[index];
*vertex++ = startx + col * colWidth;
*vertex++ = starty + row * rowWidth;
*vertex++ = 0.0f;
Or even other variations …
Depending on the types of your variables, what you've described is a pessimization.
If vertices is a class type then your original form makes a single call to operator[] and reuses the returned reference. Your second form makes three separate calls. It can't necessarily be inferred that the returned reference will refer to the same object each time.
The cost of a reference is probably not material in comparison to repeated lookups in the original vertices object.
Except in limited cases, the compiler cannot optimize out (or pessimize in) extra function calls, unless the change introduced is not detectable by a conforming program. Often this requires visibility of an inline definition.
This is known as the "as if" rule. So long as the code behaves as if the language rules have been followed exactly, the implementation may make any optimizations it sees fit.