I have one strange problem. I have following piece of code:
template<clss index, class policy>
inline int CBase<index,policy>::func(const A& test_in, int* srcPtr ,int* dstPtr)
{
int width = test_in.width();
int height = test_in.height();
double d = 0.0; //here is the problem
for(int y = 0; y < height; y++)
{
//Pointer initializations
//multiplication involving y
//ex: int z = someBigNumber*y + someOtherBigNumber;
for(int x = 0; x < width; x++)
{
//multiplication involving x
//ex: int z = someBigNumber*x + someOtherBigNumber;
if(soemCondition)
{
// floating point calculations
}
*dstPtr++ = array[*srcPtr++];
}
}
}
The inner loop gets executed nearly 200,000 times and the entire function takes 100 ms for completion. ( profiled using AQTimer)
I found an unused variable double d = 0.0; outside the outer loop and removed the same. After this change, suddenly the method is taking 500ms for the same number of executions. ( 5 times slower).
This behavior is reproducible in different machines with different processor types.
(Core2, dualcore processors).
I am using VC6 compiler with optimization level O2.
Follwing are the other compiler options used :
-MD -O2 -Z7 -GR -GX -G5 -X -GF -EHa
I suspected compiler optimizations and removed the compiler optimization /O2. After that function became normal and it is taking 100ms as old code.
Could anyone throw some light on this strange behavior?
Why compiler optimization should slow down performance when I remove unused variable ?
Note: The assembly code (before and after the change) looked same.
If the assembly code looks the same before and after the change the error is somehow connected to how you time the function.
VC6 is buggy as hell. It is known to generate incorrect code in several cases, and its optimizer isn't all that advanced either. The compiler is over a decade old, and hasn't even been supported for many years.
So really, the answer is "you're using a buggy compiler. Expect buggy behavior, especially when optimizations are enabled."
I don't suppose upgrading to a modern compiler (or simply testing the code on one) is an option?
Obviously, the generated assembly can not be the same, or there would be no performance difference.
The only question is where the difference lies. And with a buggy compiler, it may well be some completely unrelated part of the code that suddenly gets compiled differently and breaks. Most likely though, the assembly code generated for this function is not the same, and the differences are just so subtle you didn't notice them.
Declare width and height as const {unsigned} ints. {The unsigned should be used since heights and widths are never negative.}
const int width = test_in.width();
const int height = test_in.height();
This helps the compiler with optimizing. With the values as const, it can place them in the code or in registers, knowing that they won't change. Also, it relieves the compiler of having to guess whether the variables are changing or not.
I suggest printing out the assembly code of the versions with the unused double and without. This will give you an insight into the compiler's thought process.
Related
I give the following example to illustrate my question:
void fun(int i, float *pt)
{
// do something based on i
std::cout<<*(pt+i)<<std::endl;
}
const unsigned int LOOP = 2000000007;
void fun_without_optmization()
{
float *example;
example = new float [LOOP];
for(unsigned int i=0; i<LOOP; i++)
{
fun(i,example);
}
delete []example;
}
void fun_with_optimization()
{
float *example;
example = new float [LOOP];
unsigned int unit_loop = LOOP/10;
unsigned int left_loop = LOOP%10;
pt = example;
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
delete []example;
}
As far as I understand, function fun_without_optimization() and function fun_with_optimization() should perform the same. The only argument why the second function is better than the first is that the pointer calculation in fun becomes simple. Any other arguments why the second function is better?
Unrolling a loop in which I/O is performed is like moving the landing strip for a B747 from London an inch eastward in JFK.
Re: "Any other arguments why the second function is better?" - would you accept the answer explaining why it is NOT better?
Manually unrolling a loop is error-prone, as is clearly illustrated by your code: you forgot to process the tail left_loop.
For at least a couple of decades compiler does this optimization for you.
How do you know the optimal number of iteration to put in that unrolled loop? Do you target a specific cache size and calculate the length of assembly instructions in bytes? The compiler might.
Your messing with the otherwise clean loop can prevent other optimizations, like the use of SIMD.
The bottom line is: if you know something that your compiler doesn't (specific pattern of the run-time data, details of the targeted execution environment, etc.), and you know what you are doing - you can try manual loop unrolling. But even then - profile.
The technique you describe is called loop unrolling; potentially this increases performance, as the time for evaluation of the control structures (update of te loop variable and checking the termination condition) becomes smaller. However, decent compilers can do this for you and maintainability of the code decreases if done manually.
This is an optimization technique used for parallel architectures (architectures that support VLIW instructions). Depending on the number DALU (most common 4) and ALU(most common 2) units the architecture supports, and the level of "parallelization" the code supports, multiple instructions can be executes in one cycle.
So this code:
for (int i=0; i<n;i++) //n multiple of 4, for simplicity
a+=temp; //just a random instruction
Will actually execute faster on a parallel architecture if rewritten like:
for (int i=0;i<n ;i+=4)
{
temp0 = temp0 +temp1; //reads and additions can be executed in parallel
temp1 = temp2 +temp3;
a=temp0+temp1+a;
}
There is a limit to how much you can parallelize your code, a limit imposed by the physical ALUs/DALUs the CPU has. That's why it's important to know your architecture before you attempt to (properly) optimize your code.
It does not stop here: the code you want to optimize has to be a continuous block of code, meaning no jumps ( no function calls, no chance of flow instructions), for maximum efficiency.
Writing your code, like:
for(unsigend int i=0; i<unit_loop; i++)
{
fun(0,pt);
fun(1,pt);
fun(2,pt);
fun(3,pt);
fun(4,pt);
fun(5,pt);
fun(6,pt);
fun(7,pt);
fun(8,pt);
fun(9,pt);
pt=pt+10;
}
Wold not do much, unless the compiler inlines the function calls; and it looks like to many instructions anyway...
On a different note: while it's true that you ALWAYS have to work with the compiler when optimizing your code, you should NEVER rely only on it when you what to get the maximum optimization out of your code. Remember, the compiler handles 'the general case' while you are likely interested in a particular situation - that's why some compiles have special directives to help with the optimization process.
I am thinking about heavy memory cache optimization and like to have some feedback.
Consider this example:
class example
{
float phase1;
float phaseInc;
float factor;
public:
void process(float* buffer,unsigned int iSamples)//<-high prio audio thread
{
for(unsigned int i = 0; i < iSamples; i++)// mostly iSamples is 32
{
phase1 += phaseInc;
float f1 = sinf(phase1);//<-sinf is just an example!
buffer[i] = f1*factor;
}
}
};
optimization idea:
void example::process(float* buffer,unsigned int iSamples)
{
float stackMemory[3];// should fit in L1
memcpy(stackMemory,&phase1,sizeof(float)*3);// get all memory at once
for(unsigned int i = 0; i < iSamples; i++)
{
stackMemory[0] += stackMemory[1];
float f1 = sinf(stackMemory[0]);
buffer[i] = f1*stackMemory[2];
}
memcpy(&phase1,stackMemory,sizeof(float)*1);// write back only changed mameory
}
Note that the real sample loop will contain thousands of operations.
So the stackMemory can become quite big.
I think it will be not more then 32kb (are there any smaller L1's out there ?).
Does the order of the used variables in this stackmemory matter ?
I hope not, because i'd like to order them so that i can reduce the writeback size.
Or does the L1 cache have the same cachline behaviour that RAM has ?
I have the feeling that i am somehow doing what prefetch is made for, but all i read about prefetch is relative vague about how to use it efficently. Try and error is not an option with 5000+ lines of code.
Code will run on Win,Mac and iOS.
Any ARM<->Intel issues to expect ?
Is it possible that this kind optimization is useless since all memory is accessed and transferred to L1 on the first iteration of the loop anyway ?
Thanks for any hints and ideas.
At first I thought there was a good chance that the second one could be slower as a result of additional memory access and instructions required for memcpy, while the first could simply work directly with these three class members already loaded into registers.
Nevertheless, I tried fiddling with the code in GCC 5.2 with both -O2 and -O3 and found that, no matter what I tried, I got identical assembly instructions for both. This is pretty amazing considering all the extra conceptual work that memcpy typically has to do that apparently got squashed away to zilch.
The one case I can think of where your second version might be faster in some scenario, on some compiler, is if the aliasing involved to access this->data_member interfered with an optimization and caused redundant loads and stores to/from registers.
It would have nothing to do with the L1 cache in that case and everything to do with register allocation on the compiler side. Caches are largely irrelevant here when you're loading the same memory (member variables) regardless for a contiguous chunk of data, it has entirely to do with registers. Nevertheless, I couldn't find a single scenario where I could cause that to happen where the compiler did a worse job with one over the other -- every case I tested yielded identical results. In a sufficiently complex real world case, perhaps there might be a difference.
Then again, in such a case, it should be on the safer side to simply do:
void process(float* buffer,unsigned int iSamples)
{
const float pi = phaseInc;
const float p1 = phase1;
const float fact = factor;
for(unsigned int i = 0; i < iSamples; i++)
{
phase1 += pi;
float f1 = sinf(p1);
buffer[i] = f1*fact;
}
}
There's no need to jump through hoops with memcpy to store the results into an array and back. That puts additional strain on the optimizer even if, in my findings, the optimizer managed to eliminate the overhead typically associated.
I realize your example is simplified, but there should not be a need to reduce the structure down to such a primitive array no matter how many data members you're dealing with (unless such an array actually is the most convenient representation). From a performance standpoint, a compiler will have an "easier" time (even if optimizers today are pretty amazing and can handle this) optimizing if you just use local variables instead of an array to which you memcpy aggregate data members in and out.
When i'm trying to optimize my code, I often run into a dilemma:
I have an expression like this:
int x = 5 + y * y;
int z = sqrt(12) + y * y;
Does it worth it making a new integer variable to store y*y for two instances, or just leave them alone?
int s = y* y;
int x = 5 + s;
int z = sqrt(12) + s;
If not, how many instances does it need to worth it?
Trying to optimize your code most often means giving the compiler the permission (through flags) to do its own optimization. Trying to do it yourself will more often then not, either just be a waste of time (no improvement over the compiler) or worse.
In your specific example, I seriously doubt there is anything you can do to change the performance.
One of the older compiler optimisations is "common subexpression elimination" - in this case y * y is such a common subexpression.
It may still make sense to show a reader of the code that the expression only needs calculating once, but any compiler produced in the last ten years will calculate this perfectly fine without repeating the multiplication.
Trying to "beat the compiler on it's own game" is often futile, and certainly needs measuring to ensure you get a better result than the compiler. Adding extra variables MAY cause the compiler to produce worse code, because it gets "confused", so it may not help at all.
And ALWAYS when it comes to performance (or code size) results from varying optimizations, measure, measure again, and measure a third time to make sure you get the results you expect. It's not very easy to predict from looking at code which is faster, and which is slower. But it'd definitely be surprised if y * y is calculated twice even with a low level of optimisation in your compiler.
You don't need a temporary variable:
int z = y * y;
int x = z + 5
z = z + sqrt(12);
but the only way to be sure if this is (a) faster and (b) truly where you should focus your attention, is to use a profiler and benchmark your entire application.
According to the GCC manual, the -fipa-pta optimization does:
-fipa-pta: Perform interprocedural pointer analysis and interprocedural modification and reference analysis. This option can cause excessive
memory and compile-time usage on large compilation units. It is not
enabled by default at any optimization level.
What I assume is that GCC tries to differentiate mutable and immutable data based on pointers and references used in a procedure. Can someone with more in-depth GCC knowledge explain what -fipa-pta does?
I think the word "interprocedural" is the key here.
I'm not intimately familiar with gcc's optimizer, but I've worked on optimizing compilers before. The following is somewhat speculative; take it with a small grain of salt, or confirm it with someone who knows gcc's internals.
An optimizing compiler typically performs analysis and optimization only within each individual function (or subroutine, or procedure, depending on the language). For example, given code like this contrived example:
double *ptr = ...;
void foo(void) {
...
*ptr = 123.456;
some_other_function();
printf("*ptr = %f\n", *ptr);
}
the optimizer will not be able to determine whether the value of *ptr has been changed by the call to some_other_function().
If interprocedural analysis is enabled, then the optimizer can analyze the behavior of some_other_function(), and it may be able to prove that it can't modify *ptr. Given such analysis, it can determine that the expression *ptr must still evaluate to 123.456, and in principle it could even replace the printf call with puts("ptr = 123.456");.
(In fact, with a small program similar to the above code snippet I got the same generated code with -O3 and -O3 -fipa-pta, so I'm probably missing something.)
Since a typical program contains a large number of functions, with a huge number of possible call sequences, this kind of analysis can be very expensive.
As quoted from this article:
The "-fipa-pta" optimization takes the bodies of the called functions into account when doing the analysis, so compiling
void __attribute__((noinline))
bar(int *x, int *y)
{
*x = *y;
}
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return b + 10;
}
with -fipa-pta makes the compiler see that bar does not modify b, and the compiler optimizes foo by changing b+10 to 15
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return 15;
}
A more relevant example is the “slow” code from the “Integer division is slow” blog post
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);
for (int i = 0; i < 1000000000; i++) {
volatile auto r = theIntDist(randGenerator);
}
Compiling this with -fipa-pta makes the compiler see that theIntDist is not modified within the loop, and the inlined code can thus be constant-folded in the same way as the “fast” version – with the result that it runs four times faster.
I just stumbled upon a change that seems to have counterintuitive performance ramifications. Can anyone provide a possible explanation for this behavior?
Original code:
for (int i = 0; i < ct; ++i) {
// do some stuff...
int iFreq = getFreq(i);
double dFreq = iFreq;
if (iFreq != 0) {
// do some stuff with iFreq...
// do some calculations with dFreq...
}
}
While cleaning up this code during a "performance pass," I decided to move the definition of dFreq inside the if block, as it was only used inside the if. There are several calculations involving dFreq so I didn't eliminate it entirely as it does save the cost of multiple run-time conversions from int to double. I expected no performance difference, or if any at all, a negligible improvement. However, the perfomance decreased by nearly 10%. I have measured this many times, and this is indeed the only change I've made. The code snippet shown above executes inside a couple other loops. I get very consistent timings across runs and can definitely confirm that the change I'm describing decreases performance by ~10%. I would expect performance to increase because the int to double conversion would only occur when iFreq != 0.
Chnaged code:
for (int i = 0; i < ct; ++i) {
// do some stuff...
int iFreq = getFreq(i);
if (iFreq != 0) {
// do some stuff with iFreq...
double dFreq = iFreq;
// do some stuff with dFreq...
}
}
Can anyone explain this? I am using VC++ 9.0 with /O2. I just want to understand what I'm not accounting for here.
You should put the conversion to dFreq immediately inside the if() before doing the calculations with iFreq. The conversion may execute in parallel with the integer calculations if the instruction is farther up in the code. A good compiler might be able to push it farther up, and a not-so-good one may just leave it where it falls. Since you moved it to after the integer calculations it may not get to run in parallel with integer code, leading to a slowdown. If it does run parallel, then there may be little to no improvement at all depending on the CPU (issuing an FP instruction whose result is never used will have little effect in the original version).
If you really want to improve performance, a number of people have done benchmarks and rank the following compilers in this order:
1) ICC - Intel compiler
2) GCC - A good second place
3) MSVC - generated code can be quite poor compared to the others.
You may also want to try -O3 if they have it.
Maybe the result of getFreq is kept inside a register in the first case and written to memory in the second case? It might also be, that the performance decrease has to do with CPU mechanisms as pipelining and/or branch prediction.
You could check the generated assembly code.
This looks to me like a pipeline stall
int iFreq = getFreq(i);
double dFreq = iFreq;
if (iFreq != 0) {
Allows the conversion to double to happen in parallel with other code
since dFreq is not being used immediately. it gives the compiler something
to do between storing iFreq and using it, so this conversion is most likely
"free".
But
int iFreq = getFreq(i);
if (iFreq != 0) {
// do some stuff with iFreq...
double dFreq = iFreq;
// do some stuff with dFreq...
}
Could be hitting a store/reference stall after the conversion to double since you begin using the double value right away.
Modern processors can do multiple things per clock cycle, but only when the things are independent. Two consecutive instructions that reference the same register often result in a stall. The actual conversion to double may take 3 clocks, but all but the first clock can be done in parallel with other work, provided you don't refer to the result of the conversion for an instruction or two.
C++ compilers are getting pretty good at re-ordering instructions to take advantage of this, it looks like your change defeated some nice optimization.
One other (less likely) possibility is that when the conversion to float was before the branch, the compiler was able remove the branch entirely. Branchless code is often a major performance win in modern processors.
It would be interesting to see what instructions the compiler actually emitted for these two cases.
Try moving the definition of dFreq outside of the for loop but keep the assignment inside the for loop/if block.
Perhaps the creation of dFreq on the stack every for loop, inside the if, is causing issue (although the compiler should take care of that). Perhaps a regression in the compiler, if the dFreq var is in the four loop its created once, inside the if inside the for its created every time.
double dFreq;
int iFreq;
for (int i = 0; i < ct; ++i)
{
// do some stuff...
iFreq = getFreq(i);
if (iFreq != 0)
{
// do some stuff with iFreq...
dFreq = iFreq;
// do some stuff with dFreq...
}
}
maybe the compiler is optimizing it taking the definition outside the for loop. when you put it in the if the compiler optimizations aren't doing that.
There's a likelihood that this changed caused your compiler to disable some optimizations. What happens if you move the declarations above the loop?
Once I've read a document about optimization that said that as defining variables just before their usage and not even before was a good practice, the compilers could optimize code following that advice.
This article (a bit old but quite valid) say (with statistics) something similar : http://www.tantalon.com/pete/cppopt/asyougo.htm#PostponeVariableDeclaration
It's easy enough to find out. Just take 20 stackshots of the slow version, and of the fast version. In the slow version you will see on roughly 2 of the shots what it is doing that it is not doing in the fast version. You will see a subtle difference in where it halts in the assembly language.