How much we can trust compiler optimization? - c++

I wrote following code to calculate intersecting points of two circles. The code is simple and fast enough. Not that I need more optimization, but I can think of optimizing this code more aggressively. For example h/d and 1.0/d are calculated twice (Let's forget about compiler optimizations).
const std::array<point,2> intersect(const circle& a,
const circle& b) {
std::array<point,2> intersect_points;
const float d2 = squared_distance(a.center, b.center);
const float d = std::sqrt(d2);
const float r12 = std::pow(a.radious, 2);
const float r22 = std::pow(b.radious, 2);
const float l = (r12 - r22 + d2) / (2*d);
const float h = std::sqrt(r12 - std::pow(l,2));
const float termx1 = (1.0/d) * (b.center.x - a.center.x) + a.center.x;
const float termx2 = (h/d)*(b.center.y - a.center.y);
const float termy1 = (1.0/d) * (b.center.y - a.center.y) + a.center.y;
const float termy2 = (h/d)*(b.center.x - a.center.x);
intersect_points[0].x = termx1 + termx2;
intersect_points[0].y = termy1 - termy2;
intersect_points[1].x = termx1 - termx2;
intersect_points[1].y = termy1 + termy2;
return intersect_points;
}
My question is how much we can trust C++ compilers (g++ here) to understand the code and optimize final binary? Can g++ avoid doing 1.0/d twice? More precisely I want to know where is the line. When we should leave fine tuning to compiler and when do we do optimization?

Popular compilers are pretty good in optimization nowadays.
It is very likely that the optimizer detects common expressions like 1.0/d, so don't care about this one.
It is much less likely that the optimizer replaces std:pow( x, 2 ) by x * x.
This depends of the exact function you use, the compiler version you are using and the optimization command line switches. So in this case, you're better off to write x * x.
It's hard to say how far an optimizer can go and when you as a human must take over, this depends on how "smart" the optimizer is. But as a rule of thumb, the compiler can ony optimize things it can deduct from the lines of code.
Example:
The compiler will know, that this term is always false: 1 == 2
But it can't know that this is always false as well: 1 == nextPrimeAfter(1), because therefore it would have to have knowledge about what the function nextPrimeAfter() does.

Related

Subtract extremely small number from one in C++

I need to subtract extremely small double number x from 1 i.e. to calculate 1-x in C++ for 0<x<1e-16. Because of machine precision restrictions for small enoughf x I will always get 1-x=1. Simple solution is to switch from double to some more precise format like long. But because of some restrictions I can't switch to more precise number formats.
What is the most efficient way to get accurate value of 1-x where x is an extremely small double if I can't use more precise formats and I need to store the result of the subtraction as a double? In practice I would like to avoid percentage errors greater then 1% (between double representation of 1-x and its actual value).
P.S. I am using Rcpp to calculate the quantiles of standard normal distribution via qnorm function. This function is symmetric around 0.5 being much more accurate for values close to 0. Therefore instead of qnorm(1-(1e-30)) I would like to calculate -qnorm(1e-30) but to derive 1e-30 from 1-(1e-30) I need to deal with a precision problem. The restriction on double is due to the fact that as I know it is not safe to use more precise numeric formats in Rcpp. Note that my inputs to qnorm could be sought of exogeneous so I can't to derive 1-x from x durning some preliminary calculations.
Simple solution is to switch from double to some more precise format like long [presumably, double]
In that case you have no solution. long double is an alias for double on all modern machines. I stand corrected, gcc and icc still support it, only cl has dropped support for it for a long time.
So you have two solutions, and they're not mutually exclusive:
Use an arbitrary precision library instead of the built-in types. They're orders of magnitude slower, but if that's the best your algorithm can work with then that's that.
Use a better algorithm, or at least rearrange your equation variables, to not have this need in the first place. Use distribution and cancellation rules to avoid the problem entirely. Without a more in depth description of your problem we can't help you, but I can tell you with certainty that double is more than enough to allow us to model airplane AI and flight parameters anywhere in the world.
Rather than resorting to an arbitrary precision solution (which, as others have said, would potentially be extremely slow), you could simply create a class that extends the inherent precision of the double type by a factor of (approximately) two. You would then only need to implement the operations that you actually need: in your case, this may only be subtraction (and possibly addition), which are both reasonably easy to achieve. Such code will still be considerably slower than using native types, but likely much faster than libraries that use unnecessary precision.
Such an implementation is available (as open-source) in the QD_Real class, created some time ago by Yozo Hida (a PhD Student, at the time, I believe).
The linked repository contains a lot of code, much of which is likely unnecessary for your use-case. Below, I have shown an extremely trimmed-down version, which allows creation of data with the required precision, shows an implementation of the required operator-() and a test case.
#include <iostream>
class ddreal {
private:
static inline double Plus2(double a, double b, double& err) {
double s = a + b;
double bb = s - a;
err = (a - (s - bb)) + (b - bb);
return s;
}
static inline void Plus3(double& a, double& b, double& c) {
double t3, t2, t1 = Plus2(a, b, t2);
a = Plus2(c, t1, t3);
b = Plus2(t2, t3, c);
}
public:
double x[2];
ddreal() { x[0] = x[1] = 0.0; }
ddreal(double hi) { x[0] = hi; x[1] = 0.0; }
ddreal(double hi, double lo) { x[0] = Plus2(hi, lo, x[1]); }
ddreal& operator -= (ddreal const& b) {
double t1, t2, s2;
x[0] = Plus2(x[0], -b.x[0], s2);
t1 = Plus2(x[1], -b.x[1], t2);
x[1] = Plus2(s2, t1, t1);
t1 += t2;
Plus3(x[0], x[1], t1);
return *this;
}
inline double toDouble() const { return x[0] + x[1]; }
};
inline ddreal operator-(ddreal const& a, ddreal const& b)
{
ddreal retval = a;
return retval -= b;
}
int main()
{
double sdone{ 1.0 };
double sdwee{ 1.0e-42 };
double sdval = sdone - sdwee;
double sdans = sdone - sdval;
std::cout << sdans << "\n"; // Gives zero, as expected
ddreal ddone{ 1.0 };
ddreal ddwee{ 1.0e-42 };
ddreal ddval = ddone - ddwee; // Can actually hold 1 - 1.0e42 ...
ddreal ddans = ddone - ddval;
std::cout << ddans.toDouble() << "\n"; // Gives 1.0e-42
ddreal ddalt{ 1.0, -1.0e-42 }; // Alternative initialization ...
ddreal ddsec = ddone - ddalt;
std::cout << ddsec.toDouble() << "\n"; // Gives 1.0e-42
return 0;
}
Note that I have deliberately neglected error-checking and other overheads that would be needed for a more general implementation. Also, the code I have shown has been 'tweaked' to work more optimally on x86/x64 CPUs, so you may need to delve into the code at the linked GitHub, if you need support for other platforms. (However, I think the code I have shown will work for any platform that conforms strictly to the IEEE-754 Standard.)
I have tested this implementation, extensively, in code I use to generate and display the Mandelbrot Set (and related fractals) at very deep zoom levels, where use of the raw double type fails completely.
Note that, though you may be tempted to 'optimize' some of the seemingly pointless operations, doing so will break the system. Also, this must be compiled using the /fp:precise (or /fp:strict) flags (with MSVC), or the equivalent(s) for other compilers; using /fp:fast will break the code, completely.

Code optimization

When i'm trying to optimize my code, I often run into a dilemma:
I have an expression like this:
int x = 5 + y * y;
int z = sqrt(12) + y * y;
Does it worth it making a new integer variable to store y*y for two instances, or just leave them alone?
int s = y* y;
int x = 5 + s;
int z = sqrt(12) + s;
If not, how many instances does it need to worth it?
Trying to optimize your code most often means giving the compiler the permission (through flags) to do its own optimization. Trying to do it yourself will more often then not, either just be a waste of time (no improvement over the compiler) or worse.
In your specific example, I seriously doubt there is anything you can do to change the performance.
One of the older compiler optimisations is "common subexpression elimination" - in this case y * y is such a common subexpression.
It may still make sense to show a reader of the code that the expression only needs calculating once, but any compiler produced in the last ten years will calculate this perfectly fine without repeating the multiplication.
Trying to "beat the compiler on it's own game" is often futile, and certainly needs measuring to ensure you get a better result than the compiler. Adding extra variables MAY cause the compiler to produce worse code, because it gets "confused", so it may not help at all.
And ALWAYS when it comes to performance (or code size) results from varying optimizations, measure, measure again, and measure a third time to make sure you get the results you expect. It's not very easy to predict from looking at code which is faster, and which is slower. But it'd definitely be surprised if y * y is calculated twice even with a low level of optimisation in your compiler.
You don't need a temporary variable:
int z = y * y;
int x = z + 5
z = z + sqrt(12);
but the only way to be sure if this is (a) faster and (b) truly where you should focus your attention, is to use a profiler and benchmark your entire application.

Question about optimization in C++

I've read that the C++ standard allows optimization to a point where it can actually hinder with expected functionality. When I say this, I'm talking about return value optimization, where you might actually have some logic in the copy constructor, yet the compiler optimizes the call out.
I find this to be somewhat bad, as in someone who doesn't know this might spend quite some time fixing a bug resulting from this.
What I want to know is whether there are any other situations where over-optimization from the compiler can change functionality.
For example, something like:
int x = 1;
x = 1;
x = 1;
x = 1;
might be optimized to a single x=1;
Suppose I have:
class A;
A a = b;
a = b;
a = b;
Could this possibly also be optimized? Probably not the best example, but I hope you know what I mean...
Eliding copy operations is the only case where a compiler is allowed to optimize to the point where side effects visibly change. Do not rely on copy constructors being called, the compiler might optimize away those calls.
For everything else, the "as-if" rule applies: The compiler might optimize as it pleases, as long as the visible side effects are the same as if the compiler had not optimized at all.
("Visible side effects" include, for example, stuff written to the console or the file system, but not runtime and CPU fan speed.)
It might be optimized, yes. But you still have some control over the process, for example, suppose code:
int x = 1;
x = 1;
x = 1;
x = 1;
volatile int y = 1;
y = 1;
y = 1;
y = 1;
Provided that neither x, nor y are used below this fragment, VS 2010 generates code:
int x = 1;
x = 1;
x = 1;
x = 1;
volatile int y = 1;
010B1004 xor eax,eax
010B1006 inc eax
010B1007 mov dword ptr [y],eax
y = 1;
010B100A mov dword ptr [y],eax
y = 1;
010B100D mov dword ptr [y],eax
y = 1;
010B1010 mov dword ptr [y],eax
That is, optimization strips all lines with "x", and leaves all four lines with "y". This is how volatile works, but the point is that you still have control over what compiler does for you.
Whether it is a class, or primitive type - all depends on compiler, how sophisticated it's optimization caps are.
Another code fragment for study:
class A
{
private:
int c;
public:
A(int b)
{
*this = b;
}
A& operator = (int b)
{
c = b;
return *this;
}
};
int _tmain(int argc, _TCHAR* argv[])
{
int b = 0;
A a = b;
a = b;
a = b;
return 0;
}
Visual Studio 2010 optimization strips all the code to nothing, in release build with "full optimization" _tmain does just nothing and immediately returns zero.
This will depend on how class A is implemented, whether the compiler can see the implementation and whether it is smart enough. For example, if operator=() in class A has some side effects such optimizing out would change the program behavior and is not possible.
Optimization does not (in proper term) "remove calls to copy or assignments".
It convert a finite state machine in another finite state, machine with a same external behaviour.
Now, if you repeadly call
a=b; a=b; a=b;
what the compiler do depends on what operator= actually is.
If the compiler founds that a call have no chances to alter the state of the program (and the "state of the program" is "everything lives longer than a scope that a scope can access") it will strip it off.
If this cannot be "demonstrated" the call will stay in place.
Whatever the compiler will do, don't worry too much about: the compiler cannot (by contract) change the external logic of a program or of part of it.
i dont know c++ that much but am currently reading Compilers-Principles, techniques and tools
here is a snippet from their section on code optimization:
the machine-independent code-optimization phase attempts to improve
intermediate code so that better target code will result. Usually
better means faster, but other objectives may be desired, such as
shorter code, or target code that consumes less power. for example a
straightforward algorithm generates the intermediate code (1.3) using
an instruction for each operator in the tree representation that comes
from semantic analyzer. a simple intermediate code generation
algorithm followed by code optimization is a reasonable way to
generate good target code. the optimizar can duduce that the
conversion of 60 from integer to floating point can be done once and
for all at compile time, so the inttofloat operation can be eliminated
by replacing the integer 6- by the floating point number 60.0.
moreover t3 is used only once to trasmit its value to id1 so the
optimizer can transform 1.3 into the shorter sequence (1.4)
1.3
t1 - intoffloat(60
t2 -- id3 * id1
ts -- id2 + t2
id1 t3
1.4
t1=id3 * 60.0
id1 = id2 + t1
all and all i mean to say that code optimization should come at a much deeper level and because the code is at such a simple state is doesnt effect what your code does
I had some trouble with const variables and const_cast. The compiler produced incorrect results when it was used to calculate something else. The const variable was optimized away, its old value was made into a compile-time constant. Truly "unexpected behavior". Okay, perhaps not ;)
Example:
const int x = 2;
const_cast<int&>(x) = 3;
int y = x * 2;
cout << y << endl;

C++ / VS2008: Performance of Macros vs. Inline functions

All,
I'm writing some performance sensitive code, including a 3d vector class that will be doing lots of cross-products. As a long-time C++ programmer, I know all about the evils of macros and the various benefits of inline functions. I've long been under the impression that inline functions should be approximately the same speed as macros. However, in performance testing macro vs inline functions, I've come to an interesting discovery that I hope is the result of me making a stupid mistake somewhere: the macro version of my function appears to be over 8 times as fast as the inline version!
First, a ridiculously trimmed down version of a simple vector class:
class Vector3d
{
public:
double m_tX, m_tY, m_tZ;
Vector3d() : m_tX(0), m_tY(0), m_tZ(0) {}
Vector3d(const double &tX, const double &tY, const double &tZ):
m_tX(tX), m_tY(tY), m_tZ(tZ) {}
static inline void CrossAndAssign ( const Vector3d& cV1, const Vector3d& cV2, Vector3d& cV )
{
cV.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY;
cV.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ;
cV.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX;
}
#define FastVectorCrossAndAssign(cV1,cV2,cVOut) { \
cVOut.m_tX = cV1.m_tY * cV2.m_tZ - cV1.m_tZ * cV2.m_tY; \
cVOut.m_tY = cV1.m_tZ * cV2.m_tX - cV1.m_tX * cV2.m_tZ; \
cVOut.m_tZ = cV1.m_tX * cV2.m_tY - cV1.m_tY * cV2.m_tX; }
};
Here's my sample benchmarking code:
Vector3d right;
Vector3d forward(1.0, 2.2, 3.6);
Vector3d up(3.2, 1.4, 23.6);
clock_t start = clock();
for (long l=0; l < 100000000; l++)
{
Vector3d::CrossAndAssign(forward, up, right); // static inline version
}
clock_t end = clock();
std::cout << end - start << endl;
clock_t start2 = clock();
for (long l=0; l<100000000; l++)
{
FastVectorCrossAndAssign(forward, up, right); // macro version
}
clock_t end2 = clock();
std::cout << end2 - start2 << endl;
The end result: With optimizations turned completely off, the inline version takes 3200 ticks, and the macro version 500 ticks... With optimization turned on (/O2, maximize speed, and other speed tweaks), I can get the inline version down to 1100 ticks, which is better but still not the same.
So I appeal to all of you: is this really true? Have I made a stupid mistake somewhere? Or are inline functions really this much slower -- and if so, why?
NOTE: After posting this answer, the original question was edited to remove this problem. I'll leave the answer as it is instructive on several levels.
The loops differ in what they do!
if we manually expand the macro, we get:
for (long l=0; l<100000000; l++)
right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
Note the absense of curly brackets. So the compiler sees this as:
for (long l=0; l<100000000; l++)
{
right.m_tX = forward.m_tY * up.m_tZ - forward.m_tZ * up.m_tY;
}
right.m_tY = forward.m_tZ * up.m_tX - forward.m_tX * up.m_tZ;
right.m_tZ = forward.m_tX * up.m_tY - forward.m_tY * up.m_tX;
Which makes it obvious why the second loop is so much faster.
Udpate: This is also a good example of why macros are evil :)
please note that if you use the inline keyword, this is only a hint for the compiler. If you turn optimizations off, this might cause the compiler not to inline the function. You should go to Project Settings/C++/Optimization/ and make sure to turn Optimization on. What settings have you used for "Inline Function Expansion"?
it also depends optimizations and compiler settings. also look for your compiler's support for an always inline/force inline declaration. inlining is as fast as a macro.
by default, the keyword is a hint -- force inline/always inline (for the most part) returns the control to the programmer of the original intention of the keyword.
finally, gcc (for example) can be directed to inform you when such a function is not inlined as directed.
Apart from what Philipp mentioned, if your using MSVC, you can use __forceinline or the gcc __attrib__ equivalent to correct the probelems with inlining.
However, there is another possible problem lurking, using a macro will cause the parameters of the macro to be re-evaluated at each point, so if you call the macro like so:
FastVectorCrossAndAssign(getForward(), up, right);
it will expand to:
right.m_tX = getForward().m_tY * up.m_tZ - getForward().m_tZ * up.m_tY;
right.m_tY = getForward().m_tZ * up.m_tX - getForward().m_tX * up.m_tZ;
right.m_tZ = getForward().m_tX * up.m_tY - getForward().m_tY * up.m_tX;
not want you want when your concerned with speed :) (especially if getForward() isn't a lightweight function, or does some incrementing each call, if its an inline function, the compiler might fix the amount of calls, provided it isn't volatile, that still won't fix everything though)

Compiler optimization of references

I often use references to simplify the appearance of code:
vec3f& vertex = _vertices[index];
// Calculate the vertex position
vertex[0] = startx + col * colWidth;
vertex[1] = starty + row * rowWidth;
vertex[2] = 0.0f;
Will compilers recognize and optimize this so it is essentially the following?
_vertices[index][0] = startx + col * colWidth;
_vertices[index][1] = starty + row * rowWidth;
_vertices[index][2] = 0.0f;
Yes. This is a basic optimization that any modern (and even ancient) compilers will make.
In fact, I don't think it's really accurate to call that you've written an optimisation, since the move straightforward way to translate that to assembly involves a store to the _vertex address, plus index, plus {0,1,2} (multiplied by the appropriate sizes for things, of course).
In general though, modern compilers are amazing. Almost any optimization you can think of will be implemented. You should always write your code in a way that emphasizes readability unless you know that one way has significant performance benefits for your code.
As a simple example, code like this:
int func() {
int x;
int y;
int z;
int a;
x = 5*5;
y = x;
z = y;
a = 100 * 100 * 100* 100;
return z;
}
Will be optimized to this:
int func() {
return 25
}
Additionally, the compiler will also inline the function so that no call is actually made. Instead, everywhere 'func()' appears will just be replaced with '25'.
This is just a simple example. There are many more complex optimizations a modern compiler implements.
Compilers will even do more clever stuff than this. Maybe they'll do
vec3f * vertex = _vertices[index];
*vertex++ = startx + col * colWidth;
*vertex++ = starty + row * rowWidth;
*vertex++ = 0.0f;
Or even other variations …
Depending on the types of your variables, what you've described is a pessimization.
If vertices is a class type then your original form makes a single call to operator[] and reuses the returned reference. Your second form makes three separate calls. It can't necessarily be inferred that the returned reference will refer to the same object each time.
The cost of a reference is probably not material in comparison to repeated lookups in the original vertices object.
Except in limited cases, the compiler cannot optimize out (or pessimize in) extra function calls, unless the change introduced is not detectable by a conforming program. Often this requires visibility of an inline definition.
This is known as the "as if" rule. So long as the code behaves as if the language rules have been followed exactly, the implementation may make any optimizations it sees fit.