I'd like to write a function that would have some optional code to be executed or not depending on user settings. The function is cpu-intensive and having ifs in it would be slow since the branch predictor is not that good.
My idea is making a copy in memory of the function and replace NOPs with a jump when I don't want to execute some code. My working example goes like this:
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
x *= 2; // Op to skip or not
x *= 2;
}
return x;
}
In my test's main, I copy this function into a newly allocated executable memory and replace the NOPs by a JMP 2 so that the following x *= 2 is not executed. JMP 2 is really "skip the next 2 bytes".
The problem is that I would have to change the JMP operand every time I edit the code to be skipped and change its size.
An alternative that would fix this problem would be:
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
I would then want to skip or not the goto, which has a fixed size. Unfortunately, in full optimization mode, the goto and the x*=2 are removed because they are unreachable at compilation time.
Hence the need to keep that dead code.
I'm using VStudio 2008.
You can cut the cost of the branch by up to 10, just by moving it out of the loop:
int Test()
{
int x = 2;
if (should_skip) {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
}
} else {
for (int i=0 ; i<10 ; i++)
{
x *= 2;
x *= 2;
x *= 2;
}
}
return x;
}
In this case, and others like it, that might also provoke the compiler into doing a better job of optimising the loop body, since it will consider the two possibilities separately rather than trying to optimise conditional code, and it won't optimise anything away as dead.
If this results in too much duplicated code to be maintainable, use a template that takes x by reference:
int x = 2;
if (should_skip) {
doLoop<true>(x);
} else {
doLoop<false>(x);
}
And check that the compiler inlines it.
Obviously this increases code size a bit, which will occasionally be a concern. Whichever way you do it though, if this change doesn't produce a measurable performance improvement then I'd guess that yours won't either.
If the number of permutations for the code is reasonable, you can define your code as C++ templates and generate all variants.
You do not specify what compiler and platform you are using, which will prevent most people from being able to help you. For example, on some platforms, the code section is not going to be writeable, so you won't be able to replace the NOPs with a JMP.
You are trying to pick-and-choose the optimizations offered to you by the compiler and second-guessing it. In general, it's a bad idea. Either write the whole inner loop block in assembly, which would prevent the compiler eliminating is as dead code, or put the damn if statement in there and let the compiler do its thing.
I'm also dubious that the branch prediction is bad enough where you will gain any sort of a net win from doing what you're proposing. Are you sure this isn't a case of premature optimization? Have you written the code in the most obvious way possible and only then determined that its performance isn't good enough? That would be my suggested start.
Here's an actual answer to the actual question!
volatile int y = 0;
int Test()
{
int x = 2;
for (int i=0 ; i<10 ; i++)
{
x *= 2;
__asm {NOP}; // to skip it replace this
__asm {NOP}; // by JMP 2 (after the goto)
goto dont_do_it;
keep_my_code:
x *= 2; // Op to skip or not
dont_do_it:
x *= 2;
}
if (y) goto keep_my_code;
return x;
}
Is this x64? You might be able to use function pointers and a conditional move to avoid the branch predictor. Load the address of the procedure based on the user settings; one of the procedures could be a dummy that does nothing. You should be able to do this without any inline ASM at all.
This may give insight:
#pragma optimize for Visual Studio.
That said, for this particular problem I would hand-code into ASM, using the VS asm output as a reference point.
At the meta level, I would have to be very certain this was the best design & algorithm for what I was doing before I started optimizing for the CPU pipe.
If you get this to work then I would profile it to make sure that it really is faster for you. On modern CPUs there is very little you can do that is slower than modifying code that is already in the cpu cache, or worse, the cpu pipeline. The cpu basically has to throw out all the work that is in the pipeline and start again.
Related
So, I am new to online competitive programming and i came across a code where i am using the if else statement inside a for loop. I want to increase the speed of the loop and after doing some research i came across break and continue statements.
So my question is that does using continue really increases the speed of the loop or not.
CODE :
int even_sum = 0;
for(int i=0;i<200;i++){
if(i%4 == 0){
even_sum +=i;
continue;
}else{
//do other stuff when sum of multiple of 4 is not calculated
}
}
In the specific code in the question, the code has the identical meaning with and without the continue: In either case, after execution leaves even_sum +=i;, it flows to the closing } of the for statement. Any compiler of even modest quality should treat the two options identically.
The intended purpose of continue is not to speed up code by requesting a jump the compiler is going to make anyway but to skip code that is undesired in the current loop iteration—it acts as if the remaining code had been enclosed in an else clause but may be more visually appealing and less disruptive to human perception of the code.
It is conceivable a very rudimentary compiler, or even a decent compiler but with optimization disabled, might generate a jump instruction for the continue and also a jump instruction for the “then” clause of the if statement to jump over the else clause. The latter would never be executed and would have no direct effect on program execution time, but it would increase the size of the program and thus could have indirect effects. This possibility is of negligible concern in typical modern environments, where you are unlikely to encounter such a rudimentary compiler.
No, there's no speed advantage when using continue here. Both of your codes are identical and even without optimizations they produce the same machine code.
However, sometimes continue can make your code a lot more efficient, if you have structured your loop in a specific way, e.g.
This:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
continue;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
is a lot more efficient, than:
int even_sum = 0;
for (int i = 0; i < 200; i++) {
if (i % 4 == 0) {
even_sum += i;
}
if (huge_computation_but_always_false_when_multiple_of_4(i)) {
// do stuff
}
}
because the former doesn't have to execute the huge_computation_but_always_false_when_multiple_of_4() function every time.
So even though both of these codes would always produce the same result (given that huge_computation_but_always_false_when_multiple_of_4() has no side effects), the first one, which uses continue, would be a lot faster.
Im programming some ring buffers and this question came to me several times.
Suppouse we have a counter and we need to reset after certain count.
Ive seen several examples of ring buffers (mostly audio, wraping around r/w pointers), that do this:
x++;
if (x == SOME_NUMBER ){ // Reseting counter
x -= x;
}
is there any difference/preference in doing this:
x++;
if (x == SOME_NUMBER ){ // Reseting counter
x = 0;
}
?
This question applies to almost all kind of variable resets. In my case, besides ring buiffers, im also reseting a counter that do an average, so after i made all my measures, i reset that counter.
Besides the fact that the result may be the same (x reseting to zero), there may be some difference between one approach and the other. Is there any preference?
Consider those slightly modified versions of your snippets
void f(int n)
{
int x = 0;
for (;;)
{
++x;
if (x == n ) { // Reseting counter
x -= x;
}
// Ending condition to avoid UB
if ( x == 42 )
return;
}
}
void g(int n)
{
int x = 0;
for (;;)
{
++x;
if (x == n ) {
x = 0;
}
if ( x == 42 )
return;
}
}
If you look at the generated assembly (e.g. using Compiler Explorer) you'll notice how modern optimizing compilers can take advantage of the as-if rule.
Clang (with -O2) generates the same machine code for both functions. It uses
xor eax, eax
To load a zero into a register and then
cmove ecx, eax
to "reset" the other register when needed.
Gcc just creates f() and then g() becomes
jmp f(int)
That said
Is there any preference?
A common guideline is to write the more readable and maintainable code and to explore possible optimizations only after having profiled it.
In most cases I'd use the x = 0; version, because it conveys the intent better, IMHO. I can only think of a couple of reasons to adopt the x -= x; one:
It does not rely on "magic numbers". However, that would be the case for the 42 literal in my snippet, 0 is an exceptional case.
It doesn't need any implicit conversion. Consider any case where x is not an int.
There may be some architectures/toolchains where it actually delivers faster code. I can't think of any, but that's immaterial.
The difference is in the number of operations: x -= x is subtraction and assignment, whereas x = 0 is just an assignment. Other than the number of CPU cycles, this affects behavior if x is accessible from other threads.
A simple assignment x = 0 is much clearer as well IMO.
In the follow two code snippets, is there actually any different according to the speed of compiling or running?
for (int i = 0; i < 50; i++)
{
if (i % 3 == 0)
continue;
printf("Yay");
}
and
for (int i = 0; i < 50; i++)
{
if (i % 3 != 0)
printf("Yay");
}
Personally, in the situations where there is a lot more than a print statement, I've been using the first method as to reduce the amount of indentation for the containing code. Been wondering for a while so found it about time I ask whether it's actually having an effect other than visually.
Reply to Alf (i couldn't get code working in comments...)
More accurate to my usage is something along the lines of a "handleObjectMovement" function which would include
for each object
if object position is static
continue
deal with velocity and jazz
compared with
for each object
if object position is not static
deal with velocity and jazz
Hence me not using return. Essentially "if it's not relevant to this iteration, move on"
The behaviour is the same, so the runtime speed should be the same unless the compiler does something stupid (or unless you disable optimisation).
It's impossible to say whether there's a difference in compilation speed, since it depends on the details of how the compiler parses, analyses and translates the two variations.
If speed is important, measure it.
If you know which branch of the condition has higher probability you may use GCC likely/unlikely macro
How about getting rid of the check altogether?
for (int t = 0; t < 33; t++)
{
int i = t + (t >> 1) + 1;
printf("%d\n", i);
}
Would there be any noticeable speed difference between these two snippets of code? Naively, I think the second snippet would be faster because branch instructions are encountered a lot less, but on the other hand the branch predictor should solve this problem. Or will it have a noticeable overhead despite the predictable pattern? Assume that no conditional move instruction is used.
Snippet 1:
for (int i = 0; i < 100; i++) {
if (a == 3)
output[i] = 1;
else
output[i] = 0;
}
Snippet 2:
if (a == 3) {
for (int i = 0; i < 100; i++)
output[i] = 1;
} else {
for (int i = 0; i < 100; i++)
output[i] = 0;
}
I'm not intending to optimise these cases myself, but I would like to know more about the overhead of branches even with a predictable pattern.
Since a remains unchanged once you enter into the loop, there shouldn't be much difference between the two code-snippet.
Personally, I would prefer the former, unless branch predictor fails to predict the branch which is really unlikely, given that a remains unchanged in the loop.
Moreover, the compiler may perform this optimization:
Loop unswitching
thereby making both code-snippets emit exactly same machine instructions.
You asked a performance question without specifying hardware (although from the question we can infer that it's one of the architectures that have branch prediction), toolchain, or compile options.
Overall, this is just another space vs speed tradeoff, where space often itself affects speed (CPU instruction and microcode caches).
The only reasonable answer is "Performance will vary depending on processor hardware and compiler optimizations."
Just calculating sum of two arrays with slight modification in code
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
OR
int main()
{
int a[10000]={0}; //initialize something
int b[10000]={0}; //initialize something
int sumA=0, sumB=0;
for(int i=0; i<10000; i++)
{
sumA += a[i];
}
for(int i=0; i<10000; i++)
{
sumB += b[i];
}
printf("%d %d",sumA,sumB);
}
Which code will be faster.
There is only one way to know, and that is to test and measure. You need to work out where your bottleneck is (cpu, memory bandwidth etc).
The size of the data in your array (int's in your example) would affect the result, as this would have an impact into the use of the processor cache. Often, you will find example 2 is faster, which basically means your memory bandwidth is the limiting factor (example 2 will access memory in a more efficient way).
Here's some code with timing, built using VS2005:
#include <windows.h>
#include <iostream>
using namespace std;
int main ()
{
LARGE_INTEGER
start,
middle,
end;
const int
count = 1000000;
int
*a = new int [count],
*b = new int [count],
*c = new int [count],
*d = new int [count],
suma = 0,
sumb = 0,
sumc = 0,
sumd = 0;
QueryPerformanceCounter (&start);
for (int i = 0 ; i < count ; ++i)
{
suma += a [i];
sumb += b [i];
}
QueryPerformanceCounter (&middle);
for (int i = 0 ; i < count ; ++i)
{
sumc += c [i];
}
for (int i = 0 ; i < count ; ++i)
{
sumd += d [i];
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "Done." << endl << suma << sumb << sumc << sumd;
return 0;
}
Running this, the latter version is usually faster.
I tried writing some assembler to beat the second loop but my attempts were usually slower. So I decided to see what the compiler had generated. Here's the optimised assembler produced for the main summation loop in the second version:
00401110 mov edx,dword ptr [eax-0Ch]
00401113 add edx,dword ptr [eax-8]
00401116 add eax,14h
00401119 add edx,dword ptr [eax-18h]
0040111C add edx,dword ptr [eax-10h]
0040111F add edx,dword ptr [eax-14h]
00401122 add ebx,edx
00401124 sub ecx,1
00401127 jne main+110h (401110h)
Here's the register usage:
eax = used to index the array
ebx = the grand total
ecx = loop counter
edx = sum of the five integers accessed in one iteration of the loop
There are a few interesting things here:
The compiler has unrolled the loop five times.
The order of memory access is not contiguous.
It updates the array index in the middle of the loop.
It sums five integers then adds that to the grand total.
To really understand why this is fast, you'd need to use Intel's VTune performance analyser to see where the CPU and memory stalls are as this code is quite counter-intuitive.
In theory, due to cache optimizations the second one should be faster.
Caches are optimized to bring and keep chunks of data so that for the first access you'll get a big chunk of the first array into cache. In the first code, it may happen that when you access the second array you might have to take out some of the data of the first array, therefore requiring more accesses.
In practice both approach will take more or less the same time, being the first a little better given the size of actual caches and the likehood of no data at all being taken out of the cache.
Note: This sounds a lot like homework. In real life for those sizes first option will be slightly faster, but this only applies to this concrete example, nested loops, bigger arrays or specially smaller cache sizes would have a significant impact in performance depending on the order.
The first one will be faster. The compiler will not need to repeat the loop twice. Although not much work, bu some cycles are lost on incrementing the cycle variable and performing the check condition.
For me (GCC -O3) measuring shows that the second version is faster by some 25%, which can be explained with more efficient memory access pattern (all memory accesses are close to each other, not all over the place). Of course you'll need to repeat the operation thousands of times before the difference becomes significant.
I also tried std::accumulate from the numeric header which is the simple way to implement the second version and was in turn a tiny amount faster than the second version (probably due to more compiler-friendly looping mechanism?):
sumA = std::accumulate(a, a + 10000, 0);
sumB = std::accumulate(b, b + 10000, 0);
The first one will be faster because you loop from 1 to 10000 only one time.
C++ Standard says nothing about it, it is implementation dependent. It is looks like you are trying to do premature optimization. It is shouldn't bother you until it is not a bottleneck in your program. If it so, you should use some profiler to find out which one will be faster on certain platform.
Until that, I'd prefer first variant because it looks more readable (or better std::accumulate).
If the data type size is enough large not to cache both variables (as example 1), but single variable (example 2), then the code of first example will be slower than the code of second example.
Otherwise code of first example will be faster than the second one.
The first one will probably be faster. The memory access pattern will allow the (modern) CPU to manage the caches efficiently (prefetch), even while accessing two arrays.
Much faster if your CPU allows it and the arrays are aligned: use SSE3 instructions to process 4 int at a time.
If you meant a[i] instead of a[10000] (and for b, respectively) and if your compiler performs loop distribution optimizations, the first one will be exactly the same as the second. If not, the second will perform slightly better.
If a[10000] is intended, then both loops will perform exactly the same (with trivial cache and flow optimizations).
Food for thought for some answers that were voted up: how many additions are performed in each version of the code?