I am working with pin tool that simulates a processor and having a very strange problem.
In the code snippet below, Router::Evaluate() is called repeatedly many times. After it is called several million times, strange behavior occurs intermittently where "_cycles != 0" is evaluated to be true in the first IF statement and to be false in the immediately following IF statement, falling into ELSE block.
void Router::Evaluate( )
{
//---------debug print code---------
if (_cycles != 0) {
cout << "not a zero" << endl;
if (_cycles != 0) cout << "---not a zero" << endl;
else cout << "---zero" << endl;
}
//----------------------------------
_cycles += _speedup;
while ( _cycles >= 1.0 ) {
_Step();
_cycles -= 1.0;
}
}
//class definition
class Router : public TimedModule {
Protected:
double _speedup; //initialized to 1.0
double _cycles; //initialized to 0.0
...
}
Below is the output of the code where "not a zero" followed by "---zero" is printed out from time to time seemingly randomly.
not a zero
---zero
(...some other output...)
not a zero
---zero
(...some other output...)
How could this possibly happen? This is not a multi-threaded program, so synchronization is not an issue. The program is compiled with gcc4.2.4 and executed on 32-bit CentOS. Does anybody have a clue?
Thanks.
--added---
I should have mentioned this, too. I did try printing the value of _cycles each time, and it is always 0.0, which should not be possible...
I also used the following g++ options: "-MM -MG -march=i686 -g -ggdb -g1 -finline-functions -O3 -fPIC"
Unless you have a horrible compiler bug, I would guess something like this is happening:
_cycles has some small fraction remaining after the subtractions. As long the compiler knows nothing else is changing its contents, it keeps its value in a higher precision floating point register. When it sees the I/O operation it is not certain the value of _cycles is needed elsewhere, so it makes sure to store its contents back to the double-precision memory location, rounding off the extra bits that were in the register. The next check assumes pessimistically the value might have changed during the I/O operation, and loads it back from memory, now without the extra bits that made it non-zero in the previous test.
As Daniel Fischer mentioned in a comment, using -ffloat-store inhibits the use of high-precision registers. If the problem goes away when using this option then the scenario I described is very likely. Check the assembly output of Router::Evaluate to be sure.
Related
I want to use this question to improve a bit in my general understanding of how computer works, since I'll probably never have the chance to study in a profound and deep manner. Sorry in advance if the question is silly and not useful in general, but I prefer to learn in this way.
I am learning c++, I found online a code that implements the Newton-Raphson method for finding the root of a function. The code is pretty simple, as you can see from it, at the beginning it asks for the tolerance required, and if I give a "decent" number it works fine. If instead, when it asks for the tolerance I write something like 1e-600, the program break down immediately and the output is Enter starting value x: Failed to converge after 100 iterations .
The output of failed convergence should be a consequence of running the loop for more than 100 iterations, but this isn't the case since the loop doesn't even start. It looks like the program knows already it won't reach that level of tolerance.
Why does this happen? How can the program write that output even if it didn't try the loop for 100 times?
Edit: It seems that everything meaningless (too small numbers, words) I write when it asks for tolerance produces a pnew=0.25 and then the code runs 100 times and fails.
The code is the following:
#include <iostream>
#include <cmath>
using namespace std;
#define N 100 // Maximum number of iterations
int main() {
double p, pnew;
double f, dfdx;
double tol;
int i;
cout << "Enter tolerance: ";
cin >> tol;
cout << "Enter starting value x: ";
cin >> pnew;
// Main Loop
for(i=0; i < N; i++){
p = pnew;
//Evaluate the function and its derivative
f = 4*p - cos(p);
dfdx= 4 + sin(p);
// The Newton-Raphson step
pnew = p - f/dfdx;
// Check for convergence and quit if done
if(abs(p-pnew) < tol){
cout << "Root is " << pnew << " to within " << tol << "\n";
return 0;
}
}
// We reach this point only if the iteration failed to converge
cerr << "Failed to converge after " << N << " iterations.\n";
return 1;
}
1e-600 is not representable by most implementations of double. std::cin will fail to convert your input to double and fall into a failed state. This means that, unless you clear the error state, any future std::cin also automatically fails without waiting for user input.
From cppreference (since c++17) :
If extraction fails, zero is written to value and failbit is set. If extraction results in the value too large or too small to fit in value, std::numeric_limits<T>::max() or std::numeric_limits<T>::min() is written and failbit flag is set.
As mentioned, 1e-600 is not a valid double value. However, there's more to it than being outside of the range. What's likely happening is that 1 is scanned into tol, and then some portion of e-600 is being scanned into pnew, and that's why it ends immediately, instead of asking for input for pnew.
Like François said, you cannot exeed 2^64 when you work on an 64bit machine (with corresponding OS) and 2^32 on a 32bit machine, you can use SSE which are 4 32 bytes data used for floating point representation. In your program the function fails at every iteration and skips your test with "if" and so never returns before ending the loop.
Editor's clarification: When this was originally posted, there were two issues:
Test performance drops by a factor of three if seemingly inconsequential statement added
Time taken to complete the test appears to vary randomly
The second issue has been solved: the randomness only occurs when running under the debugger.
The remainder of this question should be understood as being about the first bullet point above, and in the context of running in VC++ 2010 Express's Release Mode with optimizations "Maximize Speed" and "favor fast code".
There are still some Comments in the comment section talking about the second point but they can now be disregarded.
I have a simulation where if I add a simple if statement into the while loop that runs the actual simulation, the performance drops about a factor of three (and I run a lot of calculations in the while loop, n-body gravity for the solar system besides other things) even though the if statement is almost never executed:
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
with time and cb_last_orbital_update being both of type double and defined in the beginning of the main function, where this if statement is too. Usually there are computations I want to run there too, but it makes no difference if I delete them. The if statement as it is above has the same effect on the performance.
The variable time is the simulation time, it increases in 0.001 steps in the beginning so it takes a really long time until the if statement is executed for the first time (I also included printing a message to see if it is being executed, but it is not, or at least only when it's supposed to). Regardless, the performance drops by a factor of 3 even in the first minutes of the simulation when it hasn't been executed once yet. If I comment out the line
cb_last_orbital_update = time;
then it runs faster again, so it's not the check for
time - cb_last_orbital_update > 5000000
either, it's definitely the simple act of writing current simulation time into this variable.
Also, if I write the current time into another variable instead of cb_last_orbital_update, the performance does not drop. So this might be an issue with assigning a new value to a variable that is used to check if the "if" should be executed? These are all shots in the dark though.
Disclaimer: I am pretty new to programming, and sorry for all that text.
I am using Visual C++ 2010 Express, deactivating the stdafx.h precompiled header function didn't make a difference either.
EDIT: Basic structure of the program. Note that nowhere besides at the end of the while loop (time += time_interval;) is time changed. Also, cb_last_orbital_update has only 3 occurrences: Declaration / initialization, plus the two times in the if statement that is causing the problem.
int main(void)
{
...
double time = 0;
double time_interval = 0.001;
double cb_last_orbital_update = 0;
F_Rocket_Preset(time, time_interval, ...);
while(conditions)
{
Rocket[active].Stage[Rocket[active].r_stage].F_Update_Stage_Performance(time, time_interval, ...);
Rocket[active].F_Calculate_Aerodynamic_Variables(time);
Rocket[active].F_Calculate_Gravitational_Forces(cb_mu, cb_pos_d, time);
Rocket[active].F_Update_Rotation(time, time_interval, ...);
Rocket[active].F_Update_Position_Velocity(time_interval, time, ...);
Rocket[active].F_Calculate_Orbital_Elements(cb_mu);
F_Update_Celestial_Bodies(time, time_interval, ...);
if (time - cb_last_orbital_update > 5000000.0)
{
cb_last_orbital_update = time;
}
Rocket[active].F_Check_Apoapsis(time, time_interval);
Rocket[active].F_Status_Check(time, ...);
Rocket[active].F_Update_Mass (time_interval, time);
Rocket[active].F_Staging_Check (time, time_interval);
time += time_interval;
if (time > 3.1536E8)
{
std::cout << "\n\nBreak main loop! Sim Time: " << time << std::endl;
break;
}
}
...
}
EDIT 2:
Here is the difference in the assembly code. On the left is the fast code with the line
cb_last_orbital_update = time;
outcommented, on the right the slow code with the line.
EDIT 4:
So, i found a workaround that seems to work just fine so far:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
EDIT 5:
While that workaround does work, it only works in combination with using __declspec(noinline). I just removed those from the function declarations again to see if that changes anything, and it does.
EDIT 6: Sorry this is getting confusing. I tracked down the culprit for the lower performance when removing __declspec(noinline) to this function, that is being executed inside the if:
__declspec(noinline) std::string F_Get_Body_Name(int r_body)
{
switch (r_body)
{
case 0:
{
return ("the Sun");
}
case 1:
{
return ("Mercury");
}
case 2:
{
return ("Venus");
}
case 3:
{
return ("Earth");
}
case 4:
{
return ("Mars");
}
case 5:
{
return ("Jupiter");
}
case 6:
{
return ("Saturn");
}
case 7:
{
return ("Uranus");
}
case 8:
{
return ("Neptune");
}
case 9:
{
return ("Pluto");
}
case 10:
{
return ("Ceres");
}
case 11:
{
return ("the Moon");
}
default:
{
return ("unnamed body");
}
}
}
The if also now does more than just increase the counter:
if(time - cb_orbit_update_counter * 1E7 > 0)
{
F_Update_Orbital_Elements_Of_Celestial_Bodies(args);
std::cout << F_Get_Body_Name(3) << " SMA: " << cb_sma[3] << "\tPos Earth: " << cb_pos_d[3][0] << " / " << cb_pos_d[3][1] << " / " << cb_pos_d[3][2] <<
"\tAlt: " << sqrt(pow(cb_pos_d[3][0] - cb_pos_d[0][0],2) + pow(cb_pos_d[3][1] - cb_pos_d[0][1],2) + pow(cb_pos_d[3][2] - cb_pos_d[0][2],2)) << std::endl;
std::cout << "Time: " << time << "\tcb_o_h[3]: " << cb_o_h[3] << std::endl;
cb_orbit_update_counter++;
}
I remove __declspec(noinline) from the function F_Get_Body_Name alone, the code gets slower. Similarly, if i remove the execution of this function or add __declspec(noinline) again, the code runs faster. All other functions still have __declspec(noinline).
EDIT 7:
So i changed the switch function to
const std::string cb_names[] = {"the Sun","Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune","Pluto","Ceres","the Moon","unnamed body"}; // global definition
const int cb_number = 12; // global definition
std::string F_Get_Body_Name(int r_body)
{
if (r_body >= 0 && r_body < cb_number)
{
return (cb_names[r_body]);
}
else
{
return (cb_names[cb_number]);
}
}
and also made another part of the code slimmer. The program now runs fast without any __declspec(noinline). As ElderBug suggested, an issue with the CPU instruction cache then / the code getting too big?
I'd put my money on Intel's branch predictor. http://en.wikipedia.org/wiki/Branch_predictor
The processor assumes (time - cb_last_orbital_update > 5000000) to be false most of the time and loads up the execution pipeline accordingly.
Once the condition (time - cb_last_orbital_update > 5000000) comes true. The misprediction delay is hitting you. You may loose 10 to 20 cycles.
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
Something is happening that you don't expect.
One candidate is some uninitialised variables hanging around somewhere, which have different values depending on the exact code that you are running. For example, you might have uninitialised memory that is sometime a denormalised floating point number, and sometime it's not.
I think it should be clear that your code doesn't do what you expect it to do. So try debugging your code, compile with all warnings enabled, make sure you use the same compiler options (optimised vs. non-optimised can easily be a factor 10). Check that you get the same results.
Especially when you say "it runs faster again (this doesn't always work though, but i can't see a pattern). Also worked with changing 5000000 to 5E6 once. It only runs fast once though, recompiling causes the performance to drop again without changing anything. One time it ran slower only after recompiling twice." it looks quite likely that you are using different compiler options.
I will try another guess. This is hypothetical, and would be mostly due to the compiler.
My guess is that you use a lot of floating point calculations, and the introduction and use of double values in your main makes the compiler run out of XMM registers (the floating point SSE registers). This force the compiler to use memory instead of registers, and induce a lot of swapping between memory and registers, thus greatly reducing the performance. This would be happening mainly because of the computations functions inlining, because function calls are preserving registers.
The solution would be to add __declspec(noinline) to ALL your computation functions declarations.
I suggest using the Microsoft Profile Guided Optimizer -- if the compiler is making the wrong assumption for this particular branch it will help, and it will in all likelihood improve speed for the rest of the code as well.
Workaround, try 2:
The code is now looking like this:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
So far it runs fast, plus the code is being executed when it should as far as i can tell. Again only a workaround, but if this proves to work all around then i'm satisfied.
After some more testing, seems good.
My guess is that this is because the variable cb_last_orbital_update is otherwise read-only, so when you assign to it inside the if, it destroys some optimizations that the compiler has for read-only variables (e.g. perhaps it's now stored in memory instead of a register).
Something to try (although this might still not work) is to make a third variable that is initialized via cb_last_orbital_update and time depending on whether the condition is true, and using that one instead. Presumably, the compiler would now treat that variable as a constant, but I'm not sure.
I am trying to test a series of libraries for matrix-vector computations. For that I just make a large loop and inside I call the routine I want to time. Very simple. However I sometimes see that when I increase the level of optimization for the compiler the time drops to zero no matter how large the loop is. See the example below where I try to time a C macro to compute cross products. What is the compiler doing? how can I avoid it but to allow maximum optimization for floating point arithmetics? Thank you in advance
The example below was compiled using g++ 4.7.2 on a computer with an i5 intel processor.
Using optimization level 1 (-O1) it takes 0.35 seconds. For level two or higher it drops down to zero. Remember, I want to time this so I want the computations to actually happen even if, for this simple test, unnecessary.
#include<iostream>
using namespace std;
typedef double Vector[3];
#define VecCross(A,assign_op,B,dummy_op,C) \
( A[0] assign_op (B[1] * C[2]) - (B[2] * C[1]), \
A[1] assign_op (B[2] * C[0]) - (B[0] * C[2]), \
A[2] assign_op (B[0] * C[1]) - (B[1] * C[0]) \
)
double get_time(){
return clock()/(double)CLOCKS_PER_SEC;
}
int main()
{
unsigned long n = 1000000000u;
double start;
{//C macro cross product
Vector u = {1,0,0};
Vector v = {1,1,0};
Vector w = {1.2,1.2,1.2};
start = get_time();
for(unsigned long i=0;i<n;i++){
VecCross (w, =, u, X, v);
}
cout << "C macro cross product: " << get_time()-start << endl;
}
return 0;
}
Ask yourself, what does your program actually do, in terms of what is visible to the end-user?
It displays the result of a calculation: get_time()-start. The contents of your loop have no bearing on the outcome of that calculation, because you never actually use the variables being modified inside the loop.
Therefore, the compiler optimises out the entire loop since it is irrelevant.
One solution is to output the final state of the variables being modified in the loop, as part of your cout statement, thus forcing the compiler to compute the loop. However, a smart compiler could also figure out that the loop always calculates the same thing, and it can simply insert the result directly into your cout statement, because there's no need to actually calculate it at run-time. As a workaround to this, you could for example require that one of the inputs to the loop be provided at run-time (e.g. read it in from a file, command line argument, cin, etc.).
For more (and possibly better) solutions, check out this duplicate thread: Force compiler to not optimize side-effect-less statements
Sorry for may be too abstract question, but for me it is quite practical + may be some experts had similar experience and can explain it.
I have a big code, about 10000 lines size.
I notices that if in a certain place I put
if ( expression ) continue;
where expression is always false (double checked with logic of code and cout), but depends on unknown parameters (so compiler can't simply rid of this line during compilation) the speed of the program is increased by 25% (the result of calculation are the same). If I measure speed of the loop itself the speed up factor is bigger than 3.
Why can this happen and what is possible ways to use this speed up possibility without such tricks?
P.S. I use gcc 4.7.3, -O3 optimisation.
More info:
I have tried two different expressions, both works.
If I change the line to:
if ( expression ) { cout << " HELLO " << endl; continue; };
the speed up is gone.
If I change the line to:
expression;
the speed up is gone.
The code, which surrounds the line looks like this:
for ( int i = a; ; ) {
do {
i += d;
if ( d*i > d*ilast ) break;
// small amount of calculations, and conditional calls of continue;
} while ( expression0 );
if ( d*i > dir*ilast ) break;
if ( expression ) continue;
// very big amount calculations, and conditional calls of continue;
}
the for loop looks strange. It is because I have modified the loops in order to catch this bottle neck. Initially expression was equal to expression0 and instead of do-loop I had only this continue.
I tried use __builtin_expect in order to understand branch prediction. With
// the expression (= false) is supposed to be true by branch prediction.
if ( __builtin_expect( !!(expression), 1) ) continue;
the speed up is 25%.
// the expression (= false) is supposed to be false by branch prediction.
if ( __builtin_expect( !!(expression), 0) ) continue;
the speed up is gone.
If I use -O2 instead of -O3 the effect is gone. The code is slightly (~3%) slower than the fast O3-version with the false condition.
Same for "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize". With one more option: "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize -fipa-cp-clone" the effect is amplified. With "the line" the speed is same, without "the line" the code is 75% slower.
The reason is in just following conditional operator. So the code looks like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
if ( expression ) continue;
// calculations1
if ( expression2 ) {
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
The value of expression2 is almost always false. So I changed it like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
// if ( expression ) continue; // don't need this anymore
// calculations1
if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
And have got desired 25% speed up. Even a little bit more. And behaviour no longer depends on the critical line.
If somebody knows materials, which can explain this behaviour without guesses I will be very glad to read and accept their answer.
Found it.
The reason was in the just following conditional operator. So the code looks like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
if ( expression ) continue;
// calculations1
if ( expression2 ) {
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
The value of expression2 is almost always false. So I changed it like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
// if ( expression ) continue; // don't need this anymore
// calculations1
if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
And have got desired 25% speed up. Even a little bit more. And behaviour no longer depends on the critical line.
I'm not sure how to explain it and can't find enough material on branch prediction.
But I guess the point is that calculations2 should be skipped, but compiler doesn't know about this and suppose expression2 == true by default.
Meanwhile it suppose that in the simple continue-check
if ( expression ) continue;
expression == false, and nicely skips calculations2 as has to be done in any case.
In case when under if we have more complicated operations (for example cout) it suppose that expression is true and the trick doesn't work.
If somebody knows materials, which can explain this behaviour without guesses I will be very glad to read and accept their answer.
The introduction of that impossible-to-reach branch breaks up the flow graph. Normally the compiler knows that the flow of execution is from the top of the loop straight to the exit test and back to the start again. Now there's a extra node in the graph, where the flow can leave the loop. It now needs to compile the loop body differently, in two parts.
This almost always results in worse code. Why it doesn't here, I can only offer one guess: you didn't compile with profiling information. Hence, the compiler has to make assumptions. In particular, it must make assumptions about the possibility that the branch will be taken at runtime.
Clearly, since the assumptions it must make are different, it's quite possible that the resulting code differs in speed.
I hate to say it, but the answer is going to be pretty technical, and more importantly, very specific to your code. So much so, that probably nobody outside of yourself is going to invest the time to investigate the root of your question. As others have suggested, it will quite possibly depend on branch prediction and other post-compile optimizations related to pipelining.
The only thing that I can suggest to help you narrow down if this is a compiler optimization issue, or a post-compile (CPU) optimization, is to compile your code again, with -O2 vs -O3, but this time add the following additional options: -fverbose-asm -S. Pipe each of the outputs to two different files, and then run something like sdiff to compare them. You should see a lot of differences.
Unfortunately, without a good understanding of assembly code, it will be tough to make heads or tails of it, and honestly, not many people on Stack Overflow have the patience (or time) to spend more than a few minutes on this issue. If you aren't fluent in assembly (presumably x86), then I would suggest finding a coworker or friend who is, to help you parse the assembly output.
I am trying to debug the problem I posted earlier here:
C++ and pin tool -- very weird DOUBLE variable issue with IF statement.
I tracked down the moment when the weird behavior occurred using gdb. What I found is shown in the figure below that shows the gdb screenshot displaying the disassembled code and floating pointer register values. (larger image here)
Left-hand side image shows the screenshot before the highlighted FLDZ instruction is executed and the right-hand side image is after the instructions is executed. I looked up the x86 ISA and FLDZ is for loading +0.0 into ST(0). However, what I get is -nan instead of +0.0.
Does anybody know why this happens?
The system I am using is Intel xeon 5645 running 64-bit CentOS, but the target program I am trying to debug is 32-bit application. Also, as I mentioned in the earlier post, I tried two versions of gcc, 4.2.4 and 4.1.2 and observed the same problem.
Thanks.
--added--
By the way, below is the source code.
void Router::Evaluate( )
{
if (_id == 0) aaa++;
if ( _partial_internal_cycles != 0 )
{
aaa += 12345;
cout << "this is not a zero : " << endl;
on = true;
}
_partial_internal_cycles += (double) 1.0;
if ( _partial_internal_cycles >= (double)1.0 ) {
_InternalStep( );
_partial_internal_cycles -= (double)1.0;
}
if (GetSimTime() > 8646000 && _id == 0) cout << "aaa = " << aaa << endl;
if ( on)
{
cout << "break. id = " << _id << endl;
assert(false);
}
}
An exception was generated (notice the I bit is set in the stat field). As the documentation says:
If the ST(7) data register which would become the new ST(0) is not empty, both a Stack Fault and an Invalid operation exceptions are detected, setting both flags in the Status Word. The TOP register pointer in the Status Word would still be decremented and the new value in ST(0) would be the INDEFINITE NAN.
By the way, your underlying issue is because this is just the nature of floating point. It's not exact. See, for example, this gcc bug report -- and this one.