Cache Poisoning Issue for deep nested loop - c++

I am writing a code for a mathematical method (Incomplete Cholesky) and I have hit a curious roadblock. Please see the following simplified code.
for(k=0;k<nosUnknowns;k++)
{
//Pieces of code
for(i=k+1;i<nosUnknowns;i++)
{
// more code
}
for(j=k+1;j<nosUnknowns;j++)
{
for(i=j;i<nosUnknowns;i++)
{
//Some more code
if(xOk && yOk && zOk)
{
if(xDF == 1 && yDF == 0 && zDF == 0)
{
for(row=0;row<3;row++)
{
for(col=0;col<3;col++)
{
// All 3x3 static arrays This is the line
statObj->A1_[row][col] -= localFuncArr[row][col];
}
}
}
}
}//Inner loop i ends here
}//Inner loop j ends here
}//outer loop k ends here
For context,
statObj is an object containing a number of 3x3 static double arrays. I am initializing statObj by a call to new function. Then I am populating the arrays inside it using some mathematical functions. One such array is A1_. The value of variable nosUnknowns is around 3000. The array localFuncArr is previously generated by matrix multiplication and is a double array.
Now this is my problem:
When I use the line as shown in the code, the code runs extremely sluggishly. Something like 245secs for the whole function.
When I comment out the said line, the code performs extremely fast. It takes something like 6 secs.
Now when I replace the said line with the following line : localFuncArr[row][col] += 3.0, again the code runs with the same speed as that of case(2) above.
Clearly something about the call to statObj->A1_ is making the code run slow.
My question(s):
Is Cache Poisoning the reason why this is happening ?
If so, what could be changed in terms of array initialization/object initialization/loop unrolling or for that matter any form of code optimization that can speed this up ?
Any insights to this from experienced folks is highly appreciated.
EDIT: Changed the description to be more verbose and redress some of the points mentioned in the comments.

If the conditions are mostly true, your line of code is executed 3000x3000x3000x3x3 times. That's about 245 billion times. Depending on your hardware architecture 245 seconds might be a very reasonable timing (that's 1 iteration every 2 cycles - assuming 2GHz processor). In any case there isn't anything in the code that suggests cache poisoning.

Related

Does continue statement really increases the speed of the loop in C++?

So, I am new to online competitive programming and i came across a code where i am using the if else statement inside a for loop. I want to increase the speed of the loop and after doing some research i came across break and continue statements.
So my question is that does using continue really increases the speed of the loop or not.
CODE :
int even_sum = 0;
for(int i=0;i<200;i++){
if(i%4 == 0){
even_sum +=i;
continue;
}else{
//do other stuff when sum of multiple of 4 is not calculated
}
}
In the specific code in the question, the code has the identical meaning with and without the continue: In either case, after execution leaves even_sum +=i;, it flows to the closing } of the for statement. Any compiler of even modest quality should treat the two options identically.
The intended purpose of continue is not to speed up code by requesting a jump the compiler is going to make anyway but to skip code that is undesired in the current loop iteration—it acts as if the remaining code had been enclosed in an else clause but may be more visually appealing and less disruptive to human perception of the code.
It is conceivable a very rudimentary compiler, or even a decent compiler but with optimization disabled, might generate a jump instruction for the continue and also a jump instruction for the “then” clause of the if statement to jump over the else clause. The latter would never be executed and would have no direct effect on program execution time, but it would increase the size of the program and thus could have indirect effects. This possibility is of negligible concern in typical modern environments, where you are unlikely to encounter such a rudimentary compiler.
No, there's no speed advantage when using continue here. Both of your codes are identical and even without optimizations they produce the same machine code.
However, sometimes continue can make your code a lot more efficient, if you have structured your loop in a specific way, e.g.
This:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
continue;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
is a lot more efficient, than:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
because the former doesn't have to execute the huge_computation_but_always_false_when_multiple_of_4() function every time.
So even though both of these codes would always produce the same result (given that huge_computation_but_always_false_when_multiple_of_4() has no side effects), the first one, which uses continue, would be a lot faster.

Precalculate data vs sequential processing

I have the following sequential code:
1.
ProcessImage(){
for_each_line
{
for_each_pixel_of_line()
{
A = ComputeA();
B = ComputeB();
DoBiggerWork();
}
}
}
Now I changed for precalculating all the A, B value of whole image as below.
2.
ProcessImage(){
for_each_line
{
A = ComputeAinLine();
B = ComputeBinLine();
for_each_pixel_of_line()
{
Ai = A[i];
Bi = B[i];
DoBiggerWork();
}
}
}
The result shows that the 2nd block of code execute slower about 10% of processing time compared to the 1st block of code.
I wondering was it a cache miss issue in the 2nd block of code ?
I am going to use SIMD for parallel the precalculation in the 2nd block of code. Is it worth trying ?
All depends on how did you implement your functions. Try to profile your code and determine where are the bottlenecks.
If there are no benefits in calculating values once for a row, then don't use it. You need A and B values only for one pixel routine. In the second block of code you run the line once for calculate values, then run again for DoBiggerWork() and each time you retrieve values from prepared array. That costs more CPU time.

Rarely executed and almost empty if statement drastically reduces performance in C++

Editor's clarification: When this was originally posted, there were two issues:
Test performance drops by a factor of three if seemingly inconsequential statement added
Time taken to complete the test appears to vary randomly
The second issue has been solved: the randomness only occurs when running under the debugger.
The remainder of this question should be understood as being about the first bullet point above, and in the context of running in VC++ 2010 Express's Release Mode with optimizations "Maximize Speed" and "favor fast code".
There are still some Comments in the comment section talking about the second point but they can now be disregarded.
I have a simulation where if I add a simple if statement into the while loop that runs the actual simulation, the performance drops about a factor of three (and I run a lot of calculations in the while loop, n-body gravity for the solar system besides other things) even though the if statement is almost never executed:
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
with time and cb_last_orbital_update being both of type double and defined in the beginning of the main function, where this if statement is too. Usually there are computations I want to run there too, but it makes no difference if I delete them. The if statement as it is above has the same effect on the performance.
The variable time is the simulation time, it increases in 0.001 steps in the beginning so it takes a really long time until the if statement is executed for the first time (I also included printing a message to see if it is being executed, but it is not, or at least only when it's supposed to). Regardless, the performance drops by a factor of 3 even in the first minutes of the simulation when it hasn't been executed once yet. If I comment out the line
cb_last_orbital_update = time;
then it runs faster again, so it's not the check for
time - cb_last_orbital_update > 5000000
either, it's definitely the simple act of writing current simulation time into this variable.
Also, if I write the current time into another variable instead of cb_last_orbital_update, the performance does not drop. So this might be an issue with assigning a new value to a variable that is used to check if the "if" should be executed? These are all shots in the dark though.
Disclaimer: I am pretty new to programming, and sorry for all that text.
I am using Visual C++ 2010 Express, deactivating the stdafx.h precompiled header function didn't make a difference either.
EDIT: Basic structure of the program. Note that nowhere besides at the end of the while loop (time += time_interval;) is time changed. Also, cb_last_orbital_update has only 3 occurrences: Declaration / initialization, plus the two times in the if statement that is causing the problem.
int main(void)
{
...
double time = 0;
double time_interval = 0.001;
double cb_last_orbital_update = 0;
F_Rocket_Preset(time, time_interval, ...);
while(conditions)
{
Rocket[active].Stage[Rocket[active].r_stage].F_Update_Stage_Performance(time, time_interval, ...);
Rocket[active].F_Calculate_Aerodynamic_Variables(time);
Rocket[active].F_Calculate_Gravitational_Forces(cb_mu, cb_pos_d, time);
Rocket[active].F_Update_Rotation(time, time_interval, ...);
Rocket[active].F_Update_Position_Velocity(time_interval, time, ...);
Rocket[active].F_Calculate_Orbital_Elements(cb_mu);
F_Update_Celestial_Bodies(time, time_interval, ...);
if (time - cb_last_orbital_update > 5000000.0)
{
cb_last_orbital_update = time;
}
Rocket[active].F_Check_Apoapsis(time, time_interval);
Rocket[active].F_Status_Check(time, ...);
Rocket[active].F_Update_Mass (time_interval, time);
Rocket[active].F_Staging_Check (time, time_interval);
time += time_interval;
if (time > 3.1536E8)
{
std::cout << "\n\nBreak main loop! Sim Time: " << time << std::endl;
break;
}
}
...
}
EDIT 2:
Here is the difference in the assembly code. On the left is the fast code with the line
cb_last_orbital_update = time;
outcommented, on the right the slow code with the line.
EDIT 4:
So, i found a workaround that seems to work just fine so far:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
EDIT 5:
While that workaround does work, it only works in combination with using __declspec(noinline). I just removed those from the function declarations again to see if that changes anything, and it does.
EDIT 6: Sorry this is getting confusing. I tracked down the culprit for the lower performance when removing __declspec(noinline) to this function, that is being executed inside the if:
__declspec(noinline) std::string F_Get_Body_Name(int r_body)
{
switch (r_body)
{
case 0:
{
return ("the Sun");
}
case 1:
{
return ("Mercury");
}
case 2:
{
return ("Venus");
}
case 3:
{
return ("Earth");
}
case 4:
{
return ("Mars");
}
case 5:
{
return ("Jupiter");
}
case 6:
{
return ("Saturn");
}
case 7:
{
return ("Uranus");
}
case 8:
{
return ("Neptune");
}
case 9:
{
return ("Pluto");
}
case 10:
{
return ("Ceres");
}
case 11:
{
return ("the Moon");
}
default:
{
return ("unnamed body");
}
}
}
The if also now does more than just increase the counter:
if(time - cb_orbit_update_counter * 1E7 > 0)
{
F_Update_Orbital_Elements_Of_Celestial_Bodies(args);
std::cout << F_Get_Body_Name(3) << " SMA: " << cb_sma[3] << "\tPos Earth: " << cb_pos_d[3][0] << " / " << cb_pos_d[3][1] << " / " << cb_pos_d[3][2] <<
"\tAlt: " << sqrt(pow(cb_pos_d[3][0] - cb_pos_d[0][0],2) + pow(cb_pos_d[3][1] - cb_pos_d[0][1],2) + pow(cb_pos_d[3][2] - cb_pos_d[0][2],2)) << std::endl;
std::cout << "Time: " << time << "\tcb_o_h[3]: " << cb_o_h[3] << std::endl;
cb_orbit_update_counter++;
}
I remove __declspec(noinline) from the function F_Get_Body_Name alone, the code gets slower. Similarly, if i remove the execution of this function or add __declspec(noinline) again, the code runs faster. All other functions still have __declspec(noinline).
EDIT 7:
So i changed the switch function to
const std::string cb_names[] = {"the Sun","Mercury","Venus","Earth","Mars","Jupiter","Saturn","Uranus","Neptune","Pluto","Ceres","the Moon","unnamed body"}; // global definition
const int cb_number = 12; // global definition
std::string F_Get_Body_Name(int r_body)
{
if (r_body >= 0 && r_body < cb_number)
{
return (cb_names[r_body]);
}
else
{
return (cb_names[cb_number]);
}
}
and also made another part of the code slimmer. The program now runs fast without any __declspec(noinline). As ElderBug suggested, an issue with the CPU instruction cache then / the code getting too big?
I'd put my money on Intel's branch predictor. http://en.wikipedia.org/wiki/Branch_predictor
The processor assumes (time - cb_last_orbital_update > 5000000) to be false most of the time and loads up the execution pipeline accordingly.
Once the condition (time - cb_last_orbital_update > 5000000) comes true. The misprediction delay is hitting you. You may loose 10 to 20 cycles.
if (time - cb_last_orbital_update > 5000000)
{
cb_last_orbital_update = time;
}
Something is happening that you don't expect.
One candidate is some uninitialised variables hanging around somewhere, which have different values depending on the exact code that you are running. For example, you might have uninitialised memory that is sometime a denormalised floating point number, and sometime it's not.
I think it should be clear that your code doesn't do what you expect it to do. So try debugging your code, compile with all warnings enabled, make sure you use the same compiler options (optimised vs. non-optimised can easily be a factor 10). Check that you get the same results.
Especially when you say "it runs faster again (this doesn't always work though, but i can't see a pattern). Also worked with changing 5000000 to 5E6 once. It only runs fast once though, recompiling causes the performance to drop again without changing anything. One time it ran slower only after recompiling twice." it looks quite likely that you are using different compiler options.
I will try another guess. This is hypothetical, and would be mostly due to the compiler.
My guess is that you use a lot of floating point calculations, and the introduction and use of double values in your main makes the compiler run out of XMM registers (the floating point SSE registers). This force the compiler to use memory instead of registers, and induce a lot of swapping between memory and registers, thus greatly reducing the performance. This would be happening mainly because of the computations functions inlining, because function calls are preserving registers.
The solution would be to add __declspec(noinline) to ALL your computation functions declarations.
I suggest using the Microsoft Profile Guided Optimizer -- if the compiler is making the wrong assumption for this particular branch it will help, and it will in all likelihood improve speed for the rest of the code as well.
Workaround, try 2:
The code is now looking like this:
int cb_orbit_update_counter = 1; // before while loop
if(time - cb_orbit_update_counter * 5E6 > 0)
{
cb_orbit_update_counter++;
}
So far it runs fast, plus the code is being executed when it should as far as i can tell. Again only a workaround, but if this proves to work all around then i'm satisfied.
After some more testing, seems good.
My guess is that this is because the variable cb_last_orbital_update is otherwise read-only, so when you assign to it inside the if, it destroys some optimizations that the compiler has for read-only variables (e.g. perhaps it's now stored in memory instead of a register).
Something to try (although this might still not work) is to make a third variable that is initialized via cb_last_orbital_update and time depending on whether the condition is true, and using that one instead. Presumably, the compiler would now treat that variable as a constant, but I'm not sure.

How to understand the tricky speed up

Sorry for may be too abstract question, but for me it is quite practical + may be some experts had similar experience and can explain it.
I have a big code, about 10000 lines size.
I notices that if in a certain place I put
if ( expression ) continue;
where expression is always false (double checked with logic of code and cout), but depends on unknown parameters (so compiler can't simply rid of this line during compilation) the speed of the program is increased by 25% (the result of calculation are the same). If I measure speed of the loop itself the speed up factor is bigger than 3.
Why can this happen and what is possible ways to use this speed up possibility without such tricks?
P.S. I use gcc 4.7.3, -O3 optimisation.
More info:
I have tried two different expressions, both works.
If I change the line to:
if ( expression ) { cout << " HELLO " << endl; continue; };
the speed up is gone.
If I change the line to:
expression;
the speed up is gone.
The code, which surrounds the line looks like this:
for ( int i = a; ; ) {
do {
i += d;
if ( d*i > d*ilast ) break;
// small amount of calculations, and conditional calls of continue;
} while ( expression0 );
if ( d*i > dir*ilast ) break;
if ( expression ) continue;
// very big amount calculations, and conditional calls of continue;
}
the for loop looks strange. It is because I have modified the loops in order to catch this bottle neck. Initially expression was equal to expression0 and instead of do-loop I had only this continue.
I tried use __builtin_expect in order to understand branch prediction. With
// the expression (= false) is supposed to be true by branch prediction.
if ( __builtin_expect( !!(expression), 1) ) continue;
the speed up is 25%.
// the expression (= false) is supposed to be false by branch prediction.
if ( __builtin_expect( !!(expression), 0) ) continue;
the speed up is gone.
If I use -O2 instead of -O3 the effect is gone. The code is slightly (~3%) slower than the fast O3-version with the false condition.
Same for "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize". With one more option: "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize -fipa-cp-clone" the effect is amplified. With "the line" the speed is same, without "the line" the code is 75% slower.
The reason is in just following conditional operator. So the code looks like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
if ( expression ) continue;
// calculations1
if ( expression2 ) {
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
The value of expression2 is almost always false. So I changed it like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
// if ( expression ) continue; // don't need this anymore
// calculations1
if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
And have got desired 25% speed up. Even a little bit more. And behaviour no longer depends on the critical line.
If somebody knows materials, which can explain this behaviour without guesses I will be very glad to read and accept their answer.
Found it.
The reason was in the just following conditional operator. So the code looks like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
if ( expression ) continue;
// calculations1
if ( expression2 ) {
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
The value of expression2 is almost always false. So I changed it like this:
for ( int i = a; ; ) {
// small amount of calculations, and conditional calls of continue;
// if ( expression ) continue; // don't need this anymore
// calculations1
if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
// calculations2
}
// very big amount calculations, and conditional calls of continue;
}
And have got desired 25% speed up. Even a little bit more. And behaviour no longer depends on the critical line.
I'm not sure how to explain it and can't find enough material on branch prediction.
But I guess the point is that calculations2 should be skipped, but compiler doesn't know about this and suppose expression2 == true by default.
Meanwhile it suppose that in the simple continue-check
if ( expression ) continue;
expression == false, and nicely skips calculations2 as has to be done in any case.
In case when under if we have more complicated operations (for example cout) it suppose that expression is true and the trick doesn't work.
If somebody knows materials, which can explain this behaviour without guesses I will be very glad to read and accept their answer.
The introduction of that impossible-to-reach branch breaks up the flow graph. Normally the compiler knows that the flow of execution is from the top of the loop straight to the exit test and back to the start again. Now there's a extra node in the graph, where the flow can leave the loop. It now needs to compile the loop body differently, in two parts.
This almost always results in worse code. Why it doesn't here, I can only offer one guess: you didn't compile with profiling information. Hence, the compiler has to make assumptions. In particular, it must make assumptions about the possibility that the branch will be taken at runtime.
Clearly, since the assumptions it must make are different, it's quite possible that the resulting code differs in speed.
I hate to say it, but the answer is going to be pretty technical, and more importantly, very specific to your code. So much so, that probably nobody outside of yourself is going to invest the time to investigate the root of your question. As others have suggested, it will quite possibly depend on branch prediction and other post-compile optimizations related to pipelining.
The only thing that I can suggest to help you narrow down if this is a compiler optimization issue, or a post-compile (CPU) optimization, is to compile your code again, with -O2 vs -O3, but this time add the following additional options: -fverbose-asm -S. Pipe each of the outputs to two different files, and then run something like sdiff to compare them. You should see a lot of differences.
Unfortunately, without a good understanding of assembly code, it will be tough to make heads or tails of it, and honestly, not many people on Stack Overflow have the patience (or time) to spend more than a few minutes on this issue. If you aren't fluent in assembly (presumably x86), then I would suggest finding a coworker or friend who is, to help you parse the assembly output.

Is a std::vector lookup faster than performing a simple operation?

I'm trying to optimize some C++ code for speed, and not concerned about memory usage. If I have some function that, for example, tells me if a character is a letter:
bool letterQ ( char letter ) {
return (lchar>=65 && lchar<=90) ||
(lchar>=97 && lchar<=122);
}
Would it be faster to just create a lookup table, i.e.
int lookupTable[128];
for (i = 0 ; i < 128 ; i++) {
lookupTable[i] = // some int value that tells what it is
}
and then modifying the letterQ function above to be
bool letterQ ( char letter ) {
return lookupTable[letter]==LETTER_VALUE;
}
I'm trying to optimize for speed in this simple region, because these functions are called a lot, so even a small increase in speed would accumulate into long-term gain.
EDIT:
I did some testing, and it seems like a lookup array performs significantly better than a lookup function if the lookup array is cached. I tested this by trying
for (int i = 0 ; i < size ; i++) {
if ( lookupfunction( text[i] ) )
// do something
}
against
bool lookuptable[128];
for (int i = 0 ; i < 128 ; i++) {
lookuptable[i] = lookupfunction( (char)i );
}
for (int i = 0 ; i < size ; i++) {
if (lookuptable[(int)text[i]])
// do something
}
Turns out that the second one is considerably faster - about a 3:1 speedup.
About the only possible answer is "maybe" -- and you can find out by running a profiler or something else to time the code. At one time, it would have been pretty easy to give "yes" as the answer with little or no qualification. Now, given how much faster CPUs have gotten than memory, it's a lot less certain -- you can do a lot of computation in the time it takes to fill one cache line from main memory.
Edit: I should add that in either C or C++, it's probably best to at least start with the functions (or macros) built into the standard library. These are often fairly carefully optimized for the target and (more importantly for most people) support things like switching locales, so you won't be stuck trying to explain to your German users that 'ß' isn't really a letter (and I doubt many will be much amused by "but that's really two letters, not one!)
First, I assume you have profiled the code and verified that this particular function is consuming a noticeable amount of CPU time over the runtime of the program?
I wouldn't create a vector as you're dealing with a very fixed data size. In fact, you could just create a regular C++ array and initialize is at program startup. With a really modern compiler that supports array initializers you even can do something like this:
bool lookUpTable[128] = { false, false, false, ..., true, true, ... };
Admittedly I'd probably write a small script that generates out the code rather then doing it all manually.
For a simple calculation like this, the memory access (caused by a lookup table) is going to be more expensive than just doing the calculation every time.