I was messing around with the worst code I could write, (basicly trying to break things) and i noticed that this piece of code:
for(int i = 0; i < N; ++i)
tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
end
std::cout << x;
where N is a global variable runs significantly slower then:
int N = 10000;
for(int i = 0; i < N; ++i)
tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
end
std::cout << x;
What happens with a global variable that makes it run slower?
tl;dr: Local version keeps N in a register, global version doesn't. Declare constants with const and it'll be faster no matter how you declare it.
Here's the example code I used:
#include <iostream>
#include <math.h>
void first(){
int x=1;
int N = 10000;
for(int i = 0; i < N; ++i)
tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
std::cout << x;
}
int N=10000;
void second(){
int x=1;
for(int i = 0; i < N; ++i)
tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
std::cout << x;
}
int main(){
first();
second();
}
(named test.cpp).
To look at the assembler code generated I ran g++ -S test.cpp.
I got a huge file but with some smart searching (I searched for tan), I found what I wanted:
from the first function:
Ltmp2:
movl $1, -4(%rbp)
movl $10000, -8(%rbp) ; N is here !!!
movl $0, -12(%rbp) ;initial value of i is here
jmp LBB1_2 ;goto the 'for' code logic
LBB1_1: ;the loop is this segment
movl -4(%rbp), %eax
cvtsi2sd %eax, %xmm0
movl -4(%rbp), %eax
addl $1, %eax
movl %eax, -4(%rbp)
callq _tan
callq _tan
callq _tan
callq _tan
callq _tan
callq _tan
callq _tan
movl -12(%rbp), %eax
addl $1, %eax
movl %eax, -12(%rbp)
LBB1_2:
movl -12(%rbp), %eax ;value of n kept in register
movl -8(%rbp), %ecx
cmpl %ecx, %eax ;comparing N and i here
jl LBB1_1 ;if less, then go into loop code
movl -4(%rbp), %eax
second function:
Ltmp13:
movl $1, -4(%rbp) ;i
movl $0, -8(%rbp)
jmp LBB5_2
LBB5_1: ;loop is here
movl -4(%rbp), %eax
cvtsi2sd %eax, %xmm0
movl -4(%rbp), %eax
addl $1, %eax
movl %eax, -4(%rbp)
callq _tan
callq _tan
callq _tan
callq _tan
callq _tan
callq _tan
callq _tan
movl -8(%rbp), %eax
addl $1, %eax
movl %eax, -8(%rbp)
LBB5_2:
movl _N(%rip), %eax ;loading N from globals at every iteration, instead of keeping it in a register
movl -8(%rbp), %ecx
So from the assembler code you can see (or not) that , in the local version, N is kept in a register during the entire calculation, whereas in the global version, N is reread from the global at each iteration.
I imagine the main reason why this happens is for things like threading, the compiler can't be sure that N isn't modified.
if you add a const to the declaration of N (const int N=10000), it'll be even faster than the local version though:
movl -8(%rbp), %eax
addl $1, %eax
movl %eax, -8(%rbp)
LBB5_2:
movl -8(%rbp), %eax
cmpl $9999, %eax ;9999 used instead of 10000 for some reason I do not know
jle LBB5_1
N is replaced by a constant.
The global version can't be optimized to put it in a register.
I did a little experiment with the question and the answer of #rtpg,
experimenting with the question
In the file main1.h the global N variable
int N = 10000;
Then in the main1.c file, 1000 computations of the situation:
#include <stdio.h>
#include "sys/time.h"
#include "math.h"
#include "main1.h"
extern int N;
int main(){
int k = 0;
timeval static_start, static_stop;
int x = 0;
int y = 0;
timeval start, stop;
int M = 10000;
while(k <= 1000){
gettimeofday(&static_start, NULL);
for (int i=0; i<N; ++i){
tan(tan(tan(tan(tan(tan(tan(tan(x++))))))));
}
gettimeofday(&static_stop, NULL);
gettimeofday(&start, NULL);
for (int j=0; j<M; ++j){
tan(tan(tan(tan(tan(tan(tan(tan(y++))))))));
}
gettimeofday(&stop, NULL);
int first_interval = static_stop.tv_usec - static_start.tv_usec;
int last_interval = stop.tv_usec - start.tv_usec;
if(first_interval >=0 && last_interval >= 0){
printf("%d, %d\n", first_interval, last_interval);
}
k++;
}
return 0;
}
The results are shown in the follow histogram (frequency/microseconds) :
The red boxes are the non global variable based ended for loop (N), and the transparent green the M ended based for loop (non global).
There are evidence to suspect that the extern global varialbe is a little slow.
experimenting with the answer
The reason of #rtpg is very strong. In this sense, a global variable could be slower.
Speed of accessing local vs. global variables in gcc/g++ at different optimization levels
To test this premise i use a register global variable to test the performance.
This was my main1.h with global variable
int N asm ("myN") = 10000;
The new results histogram:
conclusion there are performance improve when the global variable is in register. There is no a "global" or "local" variable problem. The performance depends on the access to the variable.
I'm assuming the optimizer doesn't know the contents of the tan function when compiling the above code.
Ie, what tan does is unknown -- all it knows is to stuff stuff onto the stack, jump to some address, then clean up the stack afterwards.
In the global variable case, the compiler doesn't know what tan does to N. In the local case, there are no "loose" pointers or references to N that tan could legitimately get at: so the compiler knows what values N will take.
The compiler can flatten the loop -- anywhere from completely (one flat block of 10000 lines), partially (100 length loop, each with 100 lines), or not at all (length 10000 loop of 1 line each), or anything in between.
The compiler knows way more when your variables are local, because when they are global it has very little knowledge about how they change, or who reads them. So few assumptions can be made.
Amusingly, this is also why it is hard for humans to reason about globals.
i think this may be a reason:
Since Global variables are stored in heap memory,your code needs to access heap memory each time.
May be because of above reason code runs slow.
Related
What is the cleanest way to write a function (not a procedure) ?
Does the 2nd solution is told to have "side effect" ?
struct myArea
{
int t[10][10]; // it could by 100x100...
};
Solution 1 : pass by value
double mySum1(myArea a)
{
// compute and return the sum of elements
}
Solution 2 : pass by const reference
double mySum2(const myArea & a)
{
// compute and return the sum of elements
}
My prefered is the first one (clean function) although it is less effective. But when there are a lot of data to copy, it can be time-consuming.
Thank you for feedback.
I have a number of quibbles with your terminology:
There's no such thing as a "procedure" in C or C++. At best, there are functions that return no value: "void"
Your example has no "side effect".
I'm not sure what you mean by "clean function" ... but I HOPE you don't mean "less source == cleaner code". Nothing could be further from the truth :(
TO ANSWER YOUR ORIGINAL QUESTION:
In your example, double mySum1(myArea a) incurs the space and CPU overhead of a COMPLETELY UNNECESSARY COPY. Don't do it :)
To my mind, double mySum1(myArea & a) or double mySum1(myArea * a) are equivalent. Personally, I'd prefer double mySum1(myArea * a) ... but most C++ developers would (rightly!) prefer double mySum1(myArea & a).
double mySum1 (const myArea & a) is best of all: it has the runtime efficiency of 2), and it signals your intent that it WON'T modify the array.
PS:
I generated assembly output from the following test:
struct myArea {
int t[10][10];
};
double mySum1(myArea a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a.t[i][j];
return sum;
}
double mySum2(myArea & a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a.t[i][j];
return sum;
}
double mySum3(myArea * a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a->t[i][j];
return sum;
}
double mySum4(const myArea & a) {
double sum = 0.0;
for (int i=0; i < 10; i++)
for (int j=0; j<10; j++)
sum += a.t[i][j];
return sum;
}
mySum1, as you'd expect, had extra code to do the extra copy.
The output for mySum2, mySum3 and mySun4, however, were IDENTICAL:
_Z6mySum2R6myArea:
.LFB1:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %rbx
movq %rdi, -32(%rbp)
movl $0, %eax
movq %rax, -24(%rbp)
movl $0, -16(%rbp)
jmp .L8
.cfi_offset 3, -24
.L11:
movl $0, -12(%rbp)
jmp .L9
.L10:
movl -16(%rbp), %eax
movl -12(%rbp), %edx
movq -32(%rbp), %rcx
movslq %edx, %rbx
movslq %eax, %rdx
movq %rdx, %rax
salq $2, %rax
addq %rdx, %rax
addq %rax, %rax
addq %rbx, %rax
movl (%rcx,%rax,4), %eax
cvtsi2sd %eax, %xmm0
movsd -24(%rbp), %xmm1
addsd %xmm1, %xmm0
movsd %xmm0, -24(%rbp)
addl $1, -12(%rbp)
.L9:
cmpl $9, -12(%rbp)
setle %al
testb %al, %al
jne .L10
addl $1, -16(%rbp)
.L8:
cmpl $9, -16(%rbp)
setle %al
testb %al, %al
jne .L11
movq -24(%rbp), %rax
movq %rax, -40(%rbp)
movsd -40(%rbp), %xmm0
popq %rbx
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
<= mySum3 and mySum4 had different labels ... but identical instructions!
It's also worth noting that one of the benefits of "const" is that it can help the compiler perform several different kinds of optimizations, whenever possible. For example:
What kind of optimization does const offer in C/C++? (if any)
The C++ 'const' Declaration: Why & How
Please note there is no such thing as "procedure" in C++. Functions that don't return anything are still functions.
Now to the question: If your parameter is an output parameter or an input/output parameter, that is, you want the caller to see changes made inside the function to the object passed to it, then pass by reference. Otherwise if the type is small/trivially cheap to copy, pass by value. Otherwise, pass by reference to const. In your case I'd pass by reference to const.
Parameter-passing is not a side effect in itself.
If a function does anything observable that isn't just returning a value, that would be a side effect.
(For example modifying a reference parameter, printing something, modifying any global state...)
That is, even if you pass by non-const reference, the presence of side effects depends on whether you modify the referenced object.
This question already has answers here:
Efficiency: arrays vs pointers
(14 answers)
Closed 7 years ago.
Is it faster to do something like
for ( int * pa(arr), * pb(arr+n); pa != pb; ++pa )
{
// do something with *pa
}
than
for ( size_t k = 0; k < n; ++k )
{
// do something with arr[k]
}
???
I understand that arr[k] is equivalent to *(arr+k), but in the first method you are using the current pointer which has incremented by 1, while in the second case you are using a pointer which is incremented from arr by successively larger numbers. Maybe hardware has special ways of incrementing by 1 and so the first method is faster? Or not? Just curious. Hope my question makes sense.
If the compiler is smart enought (and most of compilers is) then performance of both loops should be ~equal.
For example I have compiled the code in gcc 5.1.0 with generating assembly:
int __attribute__ ((noinline)) compute1(int* arr, int n)
{
int sum = 0;
for(int i = 0; i < n; ++i)
{
sum += arr[i];
}
return sum;
}
int __attribute__ ((noinline)) compute2(int* arr, int n)
{
int sum = 0;
for(int * pa(arr), * pb(arr+n); pa != pb; ++pa)
{
sum += *pa;
}
return sum;
}
And the result assembly is:
compute1(int*, int):
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
leaq 4(%rdi,%rax,4), %rdx
xorl %eax, %eax
.L3:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdx, %rdi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
compute2(int*, int):
movslq %esi, %rsi
xorl %eax, %eax
leaq (%rdi,%rsi,4), %rdx
cmpq %rdx, %rdi
je .L10
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
.L10:
rep ret
main:
xorl %eax, %eax
ret
As you can see, the most heavy part (loop) of both functions is equal:
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
But in more complex examples or in other compiler the results might be different. So you should test it and measure, but most of compilers generate similar code.
The full code sample: https://goo.gl/mpqSS0
This cannot be answered. It depends on your compiler AND on your machine.
A very naive compiler would translate the code as is to machine code. Most machines indeed provide an increment operation that is very fast. They normally also provide relative addressing for an address with an offset. This could take a few cycles more than absolute addressing. So, yes, the version with pointers could potentially be faster.
But take into account that every machine is different AND that compilers are allowed to optimize as long as the observable behavior of your program doesn't change. Given that, I would suggest a reasonable compiler will create code from both versions that doesn't differ in performance.
Any reasonable compiler will generate code that is identical inside the loop for these two choices - I looked at the code generated for iterating over a std::vector, using for-loop with an integer for the iterator or using a for( auto i: vec) type construct [std::vector internally has two pointers for the begin and end of the stored values, so like your pa and pb]. Both gcc and clang generates identical code inside the loop itself [the exact details of the loop is subtly different between the compilers, but other than that, there's no difference]. The setup of the loop was subtly different, but unless you OFTEN do loops of less than 5 items [and if so, why do you worry?], the actual content of the loop is what matters, not the bit just before the actual loop.
As with ALL code where performance is important, the exact code, compiler make and version, compiler options, processor make and model, will make a difference to how the code performs. But for the vast majority of processors and compilers, I'd expect no measurable difference. If the code is really critical, measure different alternatives and see what works best in your case.
Very much like this question, except that instead vector<int> I have vector<struct myType>.
If I want to reset (or for that matter, set to some value) myType.myVar for every element in the vector, what's the most efficient method?
Right now I'm iterating through:
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
But since vectors are guaranteed to be stored contiguously, there's surely a better way?
Resetting will need to traverse every element of the vector, so it will need at least O(n) complexity. Your current algorithm takes O(n).
In this particular case you can use operator[] instead of at (that might throw an exception). But I doubt that's the bottleneck of your application.
On this note you should probably use std::fill:
std::fill(myVec.begin(), myVec.end(), 0);
But unless you want to go byte level and set a chunk of memory to 0, which will not only cause you headaches but will also make you lose portability in most cases, there's nothing to improve here.
Instead of the below code
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
do it as follows:
size_t sz = myVec.size();
for(int i=0; i<sz; ++i) myVec[i].myVar = 0;
As the "at" method internally checks whether index is out of range or not. But as your loop index is taking care(myVec.size()), you can avoid the extra check. Otherwise this is the fastest way to do it.
EDIT
In addition to that, we can store the size() of vector prior to executing the for loop.This would ensure that there is no further call of method size() inside the for loop.
One of the fastest ways would be to perform loop unwinding and break the speed limit posed by conventional for loops that cause great cash spills. In your case, as it's a run time thing, there's no way to apply template metaprogramming so a variation on good old Duff's device would do the trick
#include <iostream>
#include <vector>
using namespace std;
struct myStruct {
int a;
double b;
};
int main()
{
std::vector<myStruct> mV(20);
double val(20); // the new value you want to reset to
int n = (mV.size()+7) / 8; // 8 is the size of the block unroll
auto to = mV.begin(); //
switch(mV.size()%8)
{
case 0: do { (*to++).b = val;
case 7: (*to++).b = val;
case 6: (*to++).b = val;
case 5: (*to++).b = val;
case 4: (*to++).b = val;
case 3: (*to++).b = val;
case 2: (*to++).b = val;
case 1: (*to++).b = val;
} while (--n>0);
}
// just printing to verify that the value is set
for (auto i : mV) std::cout << i.b << std::endl;
return 0;
}
Here I choose to perform an 8-block unwind, to reset the value (let's say) b of a myStruct structure. The block size can be tweaked and loops are effectively unrolled. Remember this is the underlying technique in memcpy and one of the optimizations (loop unrolling in general) a compiler will attempt (actually they're quite good at this so we might as well let them do their job).
In addition to what's been said before, you should consider that if you turn on optimizations, the compiler will likely perform loop-unrolling which will make the loop itself faster.
Also pre increment ++i takes few instructions less than post increment i++.
Explanation here
Beware of spending a lot of time thinking about optimization details which the compiler will just take care of for you.
Here are four implementations of what I understand the OP to be, along with the code generated using gcc 4.8 with --std=c++11 -O3 -S
Declarations:
#include <algorithm>
#include <vector>
struct T {
int irrelevant;
int relevant;
double trailing;
};
Explicit loop implementations, roughly from answers and comments provided to OP. Both produced identical machine code, aside from labels.
.cfi_startproc
movq (%rdi), %rsi
void clear_relevant(std::vector<T>* vecp) { movq 8(%rdi), %rcx
for(unsigned i=0; i<vecp->size(); i++) { xorl %edx, %edx
vecp->at(i).relevant = 0; xorl %eax, %eax
} subq %rsi, %rcx
} sarq $4, %rcx
testq %rcx, %rcx
je .L1
.p2align 4,,10
.p2align 3
.L5:
void clear_relevant2(std::vector<T>* vecp) { salq $4, %rdx
std::vector<T>& vec = *vecp; addl $1, %eax
auto s = vec.size(); movl $0, 4(%rsi,%rdx)
for (unsigned i = 0; i < s; ++i) { movl %eax, %edx
vec[i].relevant = 0; cmpq %rcx, %rdx
} jb .L5
} .L1:
rep ret
.cfi_endproc
Two other versions, one using std::for_each and the other one using the range for syntax. Here there is a subtle difference in the code for the two versions (other than the labels):
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rax, %rdx
je .L17
void clear_relevant3(std::vector<T>* vecp) { .p2align 4,,10
for (auto& p : *vecp) p.relevant = 0; .p2align 3
} .L21:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L21
.L17:
rep ret
.cfi_endproc
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rdx, %rax
void clear_relevant4(std::vector<T>* vecp) { je .L12
std::for_each(vecp->begin(), vecp->end(), .p2align 4,,10
[](T& o){o.relevant=0;}); .p2align 3
} .L16:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L16
.L12:
rep ret
.cfi_endproc
Should one use dynamic memory allocation when one knows that a variable will not be needed before it goes out of scope?
For example in the following function:
void func(){
int i =56;
//do something with i, i is not needed past this point
for(int t; t<1000000; t++){
//code
}
}
say one only needed i for a small section of the function, is it worthwhile deleting i as it is not needed in the very long for loop?
As Borgleader said:
A) This is micro (and most probably premature) optimization, meaning
don't worry about it. B) In this particular case, dynamically
allocation i might even hurt performance. tl;dr; profile first,
optimize later
As an example, I compiled the following two programs into assembly (using g++ -S flag with no optimisation enabled).
Creating i on the stack:
int main(void)
{
int i = 56;
i += 5;
for(int t = 0; t<1000; t++) {}
return 0;
}
Dynamically:
int main(void)
{
int* i = new int(56);
*i += 5;
delete i;
for(int t = 0; t<1000; t++) {}
return 0;
}
The first program compiled to:
movl $56, -8(%rbp) # Store 56 on stack (int i = 56)
addl $5, -8(%rbp) # Add 5 to i (i += 5)
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
And the second:
subq $16, %rsp # Allocate memory (new)
movl $4, %edi
call _Znwm
movl $56, (%rax) # Store 56 in *i
movq %rax, -16(%rbp)
movq -16(%rbp), %rax # Add 5
movl (%rax), %eax
leal 5(%rax), %edx
movq -16(%rbp), %rax
movl %edx, (%rax)
movq -16(%rbp), %rax # Free memory (delete)
movq %rax, %rdi
call _ZdlPv
movl $0, -4(%rbp) # Initialize loop index (int t = 0)
jmp .L2 # Begin loop (goto .L2.)
.L3:
addl $1, -4(%rbp) # Increment index (t++)
.L2:
cmpl $999, -4(%rbp) # Check loop condition (t<1000)
setle %al
testb %al, %al
jne .L3 # If (t<1000) goto .L3.
movl $0, %eax # return 0
In the above assembly output, you can see strait away that there is a significant difference between the number of commands being executed. If I compile the same programs with optimisation turned on. The first program produced the result:
xorl %eax, %eax # Equivalent to return 0;
The second produced:
movl $4, %edi
call _Znwm
movl $61, (%rax) # A smart compiler knows 56+5 = 61
movq %rax, %rdi
call _ZdlPv
xorl %eax, %eax
addq $8, %rsp
With optimisation on, the compiler becomes a pretty powerful tool for improving your code, in certain cases it can even detect that a program only returns 0 and get rid of all the unnecessary code. When you use dynamic memory in the code above, the program still has to request and then free the dynamic memory, it can't optimise it out.
That is, what is the fastest way to do the test
if( a >= ( b - c > 0 ? b - c : 0 ) &&
a <= ( b + c < 255 ? b + c : 255 ) )
...
if a, b, and c are all unsigned char aka BYTE. I am trying to optimize an image scanning process to find a sub-image, and a comparison such as this is done about 3 million times per scan, so even minor optimizations could be helpful.
Not sure, but maybe some sort of bitwise operation? Maybe adding 1 to c and testing for less-than and greater-than without the or-equal-to part? I don't know!
Well, first of all let's see what you are trying to check without all kinds of over/underflow checks:
a >= b - c
a <= b + c
subtract b from both:
a - b >= -c
a - b <= c
Now that is equal to
abs(a - b) <= c
And in code:
(a>b ? a-b : b-a) <= c
Now, this code is a tad faster and doesn't contain (or need) complicated underflow/overflow checks.
I've profiled mine and 6502's code with 1000000000 repitions and there officially was no difference whatsoever. I would suggest to pick the most elegant solution (which is IMO mine, but opinions differ), since performance is not an argument.
However, there was a notable difference between my and the asker's code. This is the profiling code I used:
#include <iostream>
int main(int argc, char *argv[]) {
bool prevent_opti;
for (int ai = 0; ai < 256; ++ai) {
for (int bi = 0; bi < 256; ++bi) {
for (int ci = 0; ci < 256; ++ci) {
unsigned char a = ai;
unsigned char b = bi;
unsigned char c = ci;
if ((a>b ? a-b : b-a) <= c) prevent_opti = true;
}
}
}
std::cout << prevent_opti << "\n";
return 0;
}
With my if statement this took 120ms on average and the asker's if statement took 135ms on average.
It think you will get the best performance by writing it in clearest way possible then turning on the compilers optimizers. The compiler is rather good at this kind of optimization and will beat you most of the time (in the worst case it will equal you).
My preference would be:
int min = (b-c) > 0 ? (b-c) : 0 ;
int max = (b+c) < 255? (b+c) : 255;
if ((a >= min) && ( a<= max))
The original code: (in assembley)
movl %eax, %ecx
movl %ebx, %eax
subl %ecx, %eax
movl $0, %edx
cmovs %edx, %eax
cmpl %eax, %r12d
jl L13
leal (%rcx,%rbx), %eax
cmpl $255, %eax
movb $-1, %dl
cmovg %edx, %eax
cmpl %eax, %r12d
jmp L13
My Code (in assembley)
movl %eax, %ecx
movl %ebx, %eax
subl %ecx, %eax
movl $0, %edx
cmovs %edx, %eax
cmpl %eax, %r12d
jl L13
leal (%rcx,%rbx), %eax
cmpl $255, %eax
movb $-1, %dl
cmovg %edx, %eax
cmpl %eax, %r12d
jg L13
nightcracker's code (in assembley)
movl %r12d, %edx
subl %ebx, %edx
movl %ebx, %ecx
subl %r12d, %ecx
cmpl %ebx, %r12d
cmovle %ecx, %edx
cmpl %eax, %edx
jg L16
Just using plain ints for a, b and c will allow you to change the code to the simpler
if (a >= b - c && a <= b + c) ...
Also, as an alternative, 256*256*256 is just 16M and a map of 16M bits is 2 MBytes. This means that it's feasible to use a lookup table like
int index = (a<<16) + (b<<8) + c;
if (lookup_table[index>>3] & (1<<(index&7))) ...
but I think that the cache trashing will make this much slower even if modern processors hate conditionals...
Another alternative is to use a bit of algebra
b - c <= a <= b + c
iff
- c <= a - b <= c (subtracted b from all terms)
iff
0 <= a - b + c <= 2*c (added c to all terms)
this allows to use just one test
if ((unsigned)(a - b + c) < 2*c) ...
assuming that a, b and c are plain ints. The reason is that if a - b + c is negative then unsigned arithmetic will make it much bigger than 2*c (if c is 0..255).
This should generate efficent machine code with a single branch if the processor has dedicated signed/unsigned comparison instructions like x86 (ja/jg).
#include <stdio.h>
int main()
{
int err = 0;
for (int ia=0; ia<256; ia++)
for (int ib=0; ib<256; ib++)
for (int ic=0; ic<256; ic++)
{
unsigned char a = ia;
unsigned char b = ib;
unsigned char c = ic;
int res1 = (a >= ( b - c > 0 ? b - c : 0 ) &&
a <= ( b + c < 255 ? b + c : 255 ));
int res2 = (unsigned(a - b + c) <= 2*c);
err += (res1 != res2);
}
printf("Errors = %i\n", err);
return 0;
}
On x86 with g++ the assembler code generated for the res2 test only includes one conditional instruction.
The assembler code for the following loop is
void process(unsigned char *src, unsigned char *dst, int sz)
{
for (int i=0; i<sz; i+=3)
{
unsigned char a = src[i];
unsigned char b = src[i+1];
unsigned char c = src[i+2];
dst[i] = (unsigned(a - b + c) <= 2*c);
}
}
.L3:
movzbl 2(%ecx,%eax), %ebx ; This loads c
movzbl (%ecx,%eax), %edx ; This loads a
movzbl 1(%ecx,%eax), %esi ; This loads b
leal (%ebx,%edx), %edx ; This computes a + c
addl %ebx, %ebx ; This is c * 2
subl %esi, %edx ; This is a - b + c
cmpl %ebx, %edx ; Comparison
setbe (%edi,%eax) ; Set 0/1 depending on result
addl $3, %eax ; next group
cmpl %eax, 16(%ebp) ; did we finish ?
jg .L3 ; if not loop back for next
Using instead dst[i] = (a<b ? b-a : a-b); the code becomes much longer
.L9:
movzbl %dl, %edx
andl $255, %esi
subl %esi, %edx
.L4:
andl $255, %edi
cmpl %edi, %edx
movl 12(%ebp), %edx
setle (%edx,%eax)
addl $3, %eax
cmpl %eax, 16(%ebp)
jle .L6
.L5:
movzbl (%ecx,%eax), %edx
movb %dl, -13(%ebp)
movzbl 1(%ecx,%eax), %esi
movzbl 2(%ecx,%eax), %edi
movl %esi, %ebx
cmpb %bl, %dl
ja .L9
movl %esi, %ebx
movzbl %bl, %edx
movzbl -13(%ebp), %ebx
subl %ebx, %edx
jmp .L4
.p2align 4,,7
.p2align 3
.L6:
And I'm way too tired now to try to decipher it (2:28 AM here)
Anyway longer doesn't mean necessarely slower (at a first sight seems g++ decided to unroll the loop by writing a few elements at a time in this case).
As I said before you should do some actual profiling with your real computation and your real data. Note that if true performance is needed may be that the best strategy will differ depending on the processor.
For example Linux during bootstrap makes ae test to decide what is the faster way to perform a certain computation that is needed in the kernel. The variables are just too many (cache size/levels, ram speed, cpu clock, chipset, cpu type...).
Rarely does embedding the ternary operator in another statement improve performance :)
If every single op code matters, write the op codes yourself - use assembler. Also consider using simd instructions if possible. I'd also be interested in the target platform. ARM assembler loves compares of this sort and has opcodes to speed up saturated math of this type.