Ordered Binary Search w/ Assembly | Recursive vs. Iterative vs. Library - c++

Which is better?
According to Savitch, each recurse is saved on the stack in the form of an activation frame. This has overhead. However it takes a few less lines of code to write a recursive version. For an interview which one is better to turn in. The code for both is below.
#include <iostream>
using namespace std;
const int SIZE = 10;
int array[ SIZE ] = { 0,1,2,3,4,5,6,7,8,9 };
int answer = NULL;
void binary_search_recursive( int array[], int start, int end, int value, int& answer )
{
int mid = (start + end ) / 2;
if ( array[ mid ] == value )
{
answer = mid;
}
else if ( array[ mid ] < value )
{
binary_search_recursive( array, mid + 1, end, value, answer );
}
else
{
binary_search_recursive( array, start, mid - 1, value, answer );
}
}
void binary_search_iterative( int array[], int start, int end, int value, int& answer )
{
int mid = ( (start + end ) / 2 );
while( array[ mid ] != value )
{
if ( array[ mid ] < value )
{
start = mid;
mid = ( ( ( mid + 1 ) + end ) / 2 );
}
else
{
end = mid;
mid = ( ( start + ( mid - 1 ) ) / 2 );
}
}
answer = mid;
}
int main()
{
binary_search_iterative( array, 0, SIZE - 1, 4, answer);
cout << answer;
return 0;
}

Recursive versions of algorithms are often shorter in lines of code but iterative versions of the same algorithm are often faster because of the function call overhead of the recursive version.
Regarding the binary search algorithm, the faster implementations are written as iterative. For example Jon Bentley's published version of the binary search is iterative.

In case of a binary search recursion does not help you express your intent any better than the iteration does, so an iterative approach is better.
I think the best approach for an interview would be to submit a solution that calls lower_bound: it shows the interviewer that you not only know some basic-level syntax and how to code a freshman-year algorithm, but that you do not waste time re-writing boilerplate code.

For an interview, I'd start by mentioning that both recursive and iterative solutions are possible and similarly trivial to write. Recursive versions have a potential issue with nested stack frames using or even exhausting stack memory (and faulting different pages into cache), but compilers tend to provide tail recursive optimisations that effectively create an iterative implementation. Recursive functions tend to be more self-evidently correct and concise, but aren't as widely applicable in day to day C++ programming so may be a little less familiar and comfortable to maintenance programmers.
Unless there's a reason not to, in a real project I'd use std::binary_search from <algorithm> (http://www.sgi.com/tech/stl/binary_search.html).
To illustrate tail recursion, your binary_search_recursive algorithm was changed to the assembly below by g++ -O4 -S. Notes:
to get an impression of the code, you don't need to understand every line, but the following helps:
movl are move statements (assignments) between registers and memory (the trailing "l" for "long" reflects the number of bits in the registers/memory locations)
subl, shrl, sarl, cmpl are subtraction, shift right, shift arithmetic right, and compare instructions, and the important thing to note is that as a side effect they set a few flags, such as "equals" if they produce a 0 result, that is consulted by je (jump if equals) and jge (jump if greater or equal), jne jump if not equal.
the answer = mid termination condition is handled at L10, while the recursive steps are instead handled by the code at L14 and L4 and may jump back up to L12.
Here's the disassembly of your binary_search_recursive function (the name is mangled in C++ style)...
__Z23binary_search_recursivePiiiiRi:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $4, %esp
movl 24(%ebp), %eax
movl 8(%ebp), %edi
movl 12(%ebp), %ebx
movl 16(%ebp), %ecx
movl %eax, -16(%ebp)
movl 20(%ebp), %esi
.p2align 4,,15
L12:
leal (%ebx,%ecx), %edx
movl %edx, %eax
shrl $31, %eax
leal (%edx,%eax), %eax
sarl %eax
cmpl %esi, (%edi,%eax,4)
je L10
L14:
jge L4
leal 1(%eax), %ebx
leal (%ebx,%ecx), %edx
movl %edx, %eax
shrl $31, %eax
leal (%edx,%eax), %eax
sarl %eax
cmpl %esi, (%edi,%eax,4)
jne L14
L10:
movl -16(%ebp), %ecx
movl %eax, (%ecx)
popl %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 4,,7
L4:
leal -1(%eax), %ecx
jmp L12

You use iteration if speed is an issue or if the stack size is constraining, because as you said it involves calling the function repeatedly which results in it occupying more space on the stack. As for answering in the interview I would go for whichever I feel is simplest to do correctly at that time, for obvious reasons :))

Related

Is it faster to iterate through the elements of an array with pointers incremented by 1? [duplicate]

This question already has answers here:
Efficiency: arrays vs pointers
(14 answers)
Closed 7 years ago.
Is it faster to do something like
for ( int * pa(arr), * pb(arr+n); pa != pb; ++pa )
{
// do something with *pa
}
than
for ( size_t k = 0; k < n; ++k )
{
// do something with arr[k]
}
???
I understand that arr[k] is equivalent to *(arr+k), but in the first method you are using the current pointer which has incremented by 1, while in the second case you are using a pointer which is incremented from arr by successively larger numbers. Maybe hardware has special ways of incrementing by 1 and so the first method is faster? Or not? Just curious. Hope my question makes sense.
If the compiler is smart enought (and most of compilers is) then performance of both loops should be ~equal.
For example I have compiled the code in gcc 5.1.0 with generating assembly:
int __attribute__ ((noinline)) compute1(int* arr, int n)
{
int sum = 0;
for(int i = 0; i < n; ++i)
{
sum += arr[i];
}
return sum;
}
int __attribute__ ((noinline)) compute2(int* arr, int n)
{
int sum = 0;
for(int * pa(arr), * pb(arr+n); pa != pb; ++pa)
{
sum += *pa;
}
return sum;
}
And the result assembly is:
compute1(int*, int):
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
leaq 4(%rdi,%rax,4), %rdx
xorl %eax, %eax
.L3:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdx, %rdi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
compute2(int*, int):
movslq %esi, %rsi
xorl %eax, %eax
leaq (%rdi,%rsi,4), %rdx
cmpq %rdx, %rdi
je .L10
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
.L10:
rep ret
main:
xorl %eax, %eax
ret
As you can see, the most heavy part (loop) of both functions is equal:
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
But in more complex examples or in other compiler the results might be different. So you should test it and measure, but most of compilers generate similar code.
The full code sample: https://goo.gl/mpqSS0
This cannot be answered. It depends on your compiler AND on your machine.
A very naive compiler would translate the code as is to machine code. Most machines indeed provide an increment operation that is very fast. They normally also provide relative addressing for an address with an offset. This could take a few cycles more than absolute addressing. So, yes, the version with pointers could potentially be faster.
But take into account that every machine is different AND that compilers are allowed to optimize as long as the observable behavior of your program doesn't change. Given that, I would suggest a reasonable compiler will create code from both versions that doesn't differ in performance.
Any reasonable compiler will generate code that is identical inside the loop for these two choices - I looked at the code generated for iterating over a std::vector, using for-loop with an integer for the iterator or using a for( auto i: vec) type construct [std::vector internally has two pointers for the begin and end of the stored values, so like your pa and pb]. Both gcc and clang generates identical code inside the loop itself [the exact details of the loop is subtly different between the compilers, but other than that, there's no difference]. The setup of the loop was subtly different, but unless you OFTEN do loops of less than 5 items [and if so, why do you worry?], the actual content of the loop is what matters, not the bit just before the actual loop.
As with ALL code where performance is important, the exact code, compiler make and version, compiler options, processor make and model, will make a difference to how the code performs. But for the vast majority of processors and compilers, I'd expect no measurable difference. If the code is really critical, measure different alternatives and see what works best in your case.

When/why does (a < 0) potentially branch in an expression?

After reading many of the comments on this question, there are a couple people (here and here) that suggest that this code:
int val = 5;
int r = (0 < val) - (val < 0); // this line here
will cause branching. Unfortunately, none of them give any justification or say why it would cause branching (tristopia suggests it requires a cmove-like instruction or predication, but doesn't really say why).
Are these people right in that "a comparison used in an expression will not generate a branch" is actually myth instead of fact? (assuming you're not using some esoteric processor) If so, can you give an example?
I would've thought there wouldn't be any branching (given that there's no logical "short circuiting"), and now I'm curious.
To simplify matters, consider just one part of the expression: val < 0. Essentially, this means “if val is negative, return 1, otherwise 0”; you could also write it like this:
val < 0 ? 1 : 0
How this is translated into processor instructions depends heavily on the compiler and the target processor. The easiest way to find out is to write a simple test function, like so:
int compute(int val) {
return val < 0 ? 1 : 0;
}
and review the assembler code that is generated by the compiler (e.g., with gcc -S -o - example.c). For my machine, it does it without branching. However, if I change it to return 5 instead of 1, there are branch instructions:
...
cmpl $0, -4(%rbp)
jns .L2
movl $5, %eax
jmp .L3
.L2:
movl $0, %eax
.L3:
...
So, “a comparison used in an expression will not generate a branch” is indeed a myth. (But “a comparison used in an expression will always generate a branch” isn’t true either.)
Addition in response to this extension/clarification:
I'm asking if there's any (sane) platform/compiler for which a branch is likely. MIPS/ARM/x86(_64)/etc. All I'm looking for is one case that demonstrates that this is a realistic possibility.
That depends on what you consider a “sane” platform. If the venerable 6502 CPU family is sane, I think there is no way to calculate val > 0 on it without branching. Most modern instruction sets, on the other hand, provide some type of set-on-X instruction.
(val < 0 can actually be computed without branching even on 6502, because it can be implemented as a bit shift.)
Empiricism for the win:
int sign(int val) {
return (0 < val) - (val < 0);
}
compiled with optimisations. gcc (4.7.2) produces
sign:
.LFB0:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
setg %al
shrl $31, %edi
subl %edi, %eax
ret
.cfi_endproc
no branch. clang (3.2):
sign: # #sign
.cfi_startproc
# BB#0:
movl %edi, %ecx
shrl $31, %ecx
testl %edi, %edi
setg %al
movzbl %al, %eax
subl %ecx, %eax
ret
neither. (on x86_64, Core i5)
This is actually architecture-dependent. If there exists an instruction to set the value to 0/1 depending on the sign of another value, there will be no branching. If there's no such instruction, branching would be necessary.

Is it a bug in g++? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
#include <stdint.h>
#include <iostream>
using namespace std;
uint32_t k[] = {0, 1, 17};
template <typename T>
bool f(T *data, int i) {
return data[0] < (T)(1 << k[i]);
}
int main() {
uint8_t v = 0;
cout << f(&v, 2) << endl;
cout << (0 < (uint8_t)(1 << 17)) << endl;
return 0;
}
g++ a.cpp && ./a.out
1
0
Why am I getting these results?
It looks like gcc reverses the shift and applies it to the other side, and I guess this is a bug.
In C (instead of C++) the same thing happens, and C translated to asm is easier to read, so I'm using C here; also I reduced the test cases (dropping templates and the k array).
foo() is the original buggy f() function, foo1() is what foo() behaves like with gcc but shouldn't, and bar() shows what foo() should look like apart from the pointer read.
I'm on 64-bit, but 32-bit is the same apart from the parameter handling and finding k.
#include <stdint.h>
#include <stdio.h>
uint32_t k = 17;
char foo(uint8_t *data) {
return *data < (uint8_t)(1<<k);
/*
with gcc -O3 -S: (gcc version 4.7.2 (Debian 4.7.2-5))
movzbl (%rdi), %eax
movl k(%rip), %ecx
shrb %cl, %al
testb %al, %al
sete %al
ret
*/
}
char foo1(uint8_t *data) {
return (((uint32_t)*data) >> k) < 1;
/*
movzbl (%rdi), %eax
movl k(%rip), %ecx
shrl %cl, %eax
testl %eax, %eax
sete %al
ret
*/
}
char bar(uint8_t data) {
return data < (uint8_t)(1<<k);
/*
movl k(%rip), %ecx
movl $1, %eax
sall %cl, %eax
cmpb %al, %dil
setb %al
ret
*/
}
int main() {
uint8_t v = 0;
printf("All should be 0: %i %i %i\n", foo(&v), foo1(&v), bar(v));
return 0;
}
If your int is 16-bit long, you're running into undefined behavior and either result is "OK".
Shifting N-bit integers by N or more bit positions left or right results in undefined behavior.
Since this happens with 32-bit ints, this is a bug in the compiler.
Here are some more data points:
basically, it looks like gcc optimizes (even in when the -O flag is off and -g is on):
[variable] < (type-cast)(1 << [variable2])
to
((type-cast)[variable] >> [variable2]) == 0
and
[variable] >= (type-cast)(1 << [variable2])
to
((type-cast)[variable] >> [variable2]) != 0
where [variable] needs to be an array access.
I guess the advantage here is that it doesn't have to load the literal 1 into a register, which saves 1 register.
So here are the data points:
changing 1 to a number > 1 forces it to implement the correct version.
changing any of the variables to a literal forces it to implement the correct version
changing [variable] to a non array access forces it to implement the correct version
[variable] > (type-cast)(1 << [variable2]) implements the correct version.
I suspect this is all trying to save a register. When [variable] is an array access, it needs to also keep an index. Someone probably thought this is so clever, until it's wrong.
Using code from the bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56051
#include <stdio.h>
int main(void)
{
int a, s = 8;
unsigned char data[1] = {0};
a = data[0] < (unsigned char) (1 << s);
printf("%d\n", a);
return 0;
}
compiled with gcc -O2 -S
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ecx
subl $8, %esp
pushl $1 ***** seems it already precomputed the result to be 1
pushl $.LC0
pushl $1
call __printf_chk
xorl %eax, %eax
movl -4(%ebp), %ecx
leave
leal -4(%ecx), %esp
ret
compile with just gcc -S
.globl main
.type main, #function
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %ecx
subl $16, %esp
movl $8, -12(%ebp)
movb $0, -17(%ebp)
movb -17(%ebp), %dl
movl -12(%ebp), %eax
movb %dl, %bl
movb %al, %cl
shrb %cl, %bl ****** (unsigned char)data[0] >> s => %bl
movb %bl, %al %bl => %al
testb %al, %al %al = 0?
sete %dl
movl $0, %eax
movb %dl, %al
movl %eax, -16(%ebp)
movl $.LC0, %eax
subl $8, %esp
pushl -16(%ebp)
pushl %eax
call printf
addl $16, %esp
movl $0, %eax
leal -8(%ebp), %esp
addl $0, %esp
popl %ecx
popl %ebx
popl %ebp
leal -4(%ecx), %esp
ret
I guess the next step is to dig through gcc's source code.
I'm pretty sure we're talking undefined behaviour here - converting a "large" integer to a smaller, of a value that doesn't fit in the size of the new value, is undefined as far as I know. 131072 definitely doesn't fit in a uint_8.
Although looking at the code generated, I'd say that it's probably not quite right, since it does "sete" rather than "setb"??? That does seem very suspicios to me.
If I turn the expression around:
return (T)(1<<k[i]) > data[0];
then it uses a "seta" instruction, which is what I'd expect. I'll do a bit more digging - but something seems a bit wrong.

Tail call recursion

I'm implementing a function as following:
void Add(list* node)
{
if(this->next == NULL)
this->next = node;
else
this->next->Add(node);
}
As it seems Add is going to be tail-called in every step of the recursion.
I could also implement it as:
void Add(list *node)
{
list *curr = this;
while(curr->next != NULL) curr = curr->next;
curr->next = node;
}
This will not use recursion at all.
Which version of this is better? (in stack size or speed)
Please don't give the "Why don't use STL/Boost/whatever?" comments/answers.
They probably will be the same performance-wise, since the compiler will probably optimise them into the exact same code.
However, if you compile on Debug settings, the compiler will not optimise for tail-recursion, so if the list is long enough, you can get a stack overflow. There is also the (very small) possibility that a bad compiler won't optimise the recursive version for tail-recursion. There is no risk of that in the iterative version.
Pick whichever one is clearer and easier for you to maintain taking the possibility of non-optimisation into account.
I tried it out, making three files to test your code:
node.hh:
struct list {
list *next;
void Add(list *);
};
tail.cc:
#include "node.hh"
void list::Add(list* node)
{
if(!this->next)
this->next = node;
else
this->next->Add(node);
}
loop.cc:
#include "node.hh"
void list::Add(list *node)
{
list *curr = this;
while(curr->next) curr = curr->next;
curr->next = node;
}
Compiled both files with G++ 4.3 for IA32, with -O3 and -S to give assembly output rather than object files
Results:
tail.s:
_ZN4list3AddEPS_:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
movl 8(%ebp), %eax
.p2align 4,,7
.p2align 3
.L2:
movl %eax, %edx
movl (%eax), %eax
testl %eax, %eax
jne .L2
movl 12(%ebp), %eax
movl %eax, (%edx)
popl %ebp
ret
.cfi_endproc
loop.s:
_ZN4list3AddEPS_:
.LFB0:
.cfi_startproc
.cfi_personality 0x0,__gxx_personality_v0
pushl %ebp
.cfi_def_cfa_offset 8
movl %esp, %ebp
.cfi_offset 5, -8
.cfi_def_cfa_register 5
movl 8(%ebp), %edx
jmp .L3
.p2align 4,,7
.p2align 3
.L6:
movl %eax, %edx
.L3:
movl (%edx), %eax
testl %eax, %eax
jne .L6
movl 12(%ebp), %eax
movl %eax, (%edx)
popl %ebp
ret
.cfi_endproc
Conclusion: The output is substantially similar enough (the core loop/recursion becomes movl, movl, testl, jne in both) that it really isn't worth worrying about. There's one less unconditional jump in the recursive version, although I wouldn't want to bet either way which is faster if it's even measurable at all. Pick which ever is most natural to express the algorithm at hand. Even if you later decide that was a bad choice it's not too hard to switch.
Adding -g to the compilation doesn't change the actual implementation with g++ either, although there is the added complication that setting break points no longer behaves like you would expect it to - break points on the tail call line gets hit at most once (but not at all for a 1 element list) in my tests with GDB, regardless of how deep the recursion actually goes.
Timings:
Out of curiosity I ran some timings with the same variant of g++. I used:
#include <cstring>
#include "node.hh"
static const unsigned int size = 2048;
static const unsigned int count = 10000;
int main() {
list nodes[size];
for (unsigned int i = 0; i < count; ++i) {
std::memset(nodes, 0, sizeof(nodes));
for (unsigned int j = 1; j < size; ++j) {
nodes[0].Add(&nodes[j]);
}
}
}
This was run 200 times, with each of the loop and the tail call versions. The results with this compiler on this platform were fairly conclusive. Tail had a mean of 40.52 seconds whereas lop had a mean of 66.93. (The standard deviations were 0.45 and 0.47 respectively).
So I certainly wouldn't be scared of using tail call recursion if it seems the nicer way of expressing the algorithm, but I probably wouldn't go out of my way to use it either, since I suspect that these timing observations would most likely vary from platform/compiler (versions).

What is the best way (performance-wise) to test whether a value falls within a threshold?

That is, what is the fastest way to do the test
if( a >= ( b - c > 0 ? b - c : 0 ) &&
a <= ( b + c < 255 ? b + c : 255 ) )
...
if a, b, and c are all unsigned char aka BYTE. I am trying to optimize an image scanning process to find a sub-image, and a comparison such as this is done about 3 million times per scan, so even minor optimizations could be helpful.
Not sure, but maybe some sort of bitwise operation? Maybe adding 1 to c and testing for less-than and greater-than without the or-equal-to part? I don't know!
Well, first of all let's see what you are trying to check without all kinds of over/underflow checks:
a >= b - c
a <= b + c
subtract b from both:
a - b >= -c
a - b <= c
Now that is equal to
abs(a - b) <= c
And in code:
(a>b ? a-b : b-a) <= c
Now, this code is a tad faster and doesn't contain (or need) complicated underflow/overflow checks.
I've profiled mine and 6502's code with 1000000000 repitions and there officially was no difference whatsoever. I would suggest to pick the most elegant solution (which is IMO mine, but opinions differ), since performance is not an argument.
However, there was a notable difference between my and the asker's code. This is the profiling code I used:
#include <iostream>
int main(int argc, char *argv[]) {
bool prevent_opti;
for (int ai = 0; ai < 256; ++ai) {
for (int bi = 0; bi < 256; ++bi) {
for (int ci = 0; ci < 256; ++ci) {
unsigned char a = ai;
unsigned char b = bi;
unsigned char c = ci;
if ((a>b ? a-b : b-a) <= c) prevent_opti = true;
}
}
}
std::cout << prevent_opti << "\n";
return 0;
}
With my if statement this took 120ms on average and the asker's if statement took 135ms on average.
It think you will get the best performance by writing it in clearest way possible then turning on the compilers optimizers. The compiler is rather good at this kind of optimization and will beat you most of the time (in the worst case it will equal you).
My preference would be:
int min = (b-c) > 0 ? (b-c) : 0 ;
int max = (b+c) < 255? (b+c) : 255;
if ((a >= min) && ( a<= max))
The original code: (in assembley)
movl %eax, %ecx
movl %ebx, %eax
subl %ecx, %eax
movl $0, %edx
cmovs %edx, %eax
cmpl %eax, %r12d
jl L13
leal (%rcx,%rbx), %eax
cmpl $255, %eax
movb $-1, %dl
cmovg %edx, %eax
cmpl %eax, %r12d
jmp L13
My Code (in assembley)
movl %eax, %ecx
movl %ebx, %eax
subl %ecx, %eax
movl $0, %edx
cmovs %edx, %eax
cmpl %eax, %r12d
jl L13
leal (%rcx,%rbx), %eax
cmpl $255, %eax
movb $-1, %dl
cmovg %edx, %eax
cmpl %eax, %r12d
jg L13
nightcracker's code (in assembley)
movl %r12d, %edx
subl %ebx, %edx
movl %ebx, %ecx
subl %r12d, %ecx
cmpl %ebx, %r12d
cmovle %ecx, %edx
cmpl %eax, %edx
jg L16
Just using plain ints for a, b and c will allow you to change the code to the simpler
if (a >= b - c && a <= b + c) ...
Also, as an alternative, 256*256*256 is just 16M and a map of 16M bits is 2 MBytes. This means that it's feasible to use a lookup table like
int index = (a<<16) + (b<<8) + c;
if (lookup_table[index>>3] & (1<<(index&7))) ...
but I think that the cache trashing will make this much slower even if modern processors hate conditionals...
Another alternative is to use a bit of algebra
b - c <= a <= b + c
iff
- c <= a - b <= c (subtracted b from all terms)
iff
0 <= a - b + c <= 2*c (added c to all terms)
this allows to use just one test
if ((unsigned)(a - b + c) < 2*c) ...
assuming that a, b and c are plain ints. The reason is that if a - b + c is negative then unsigned arithmetic will make it much bigger than 2*c (if c is 0..255).
This should generate efficent machine code with a single branch if the processor has dedicated signed/unsigned comparison instructions like x86 (ja/jg).
#include <stdio.h>
int main()
{
int err = 0;
for (int ia=0; ia<256; ia++)
for (int ib=0; ib<256; ib++)
for (int ic=0; ic<256; ic++)
{
unsigned char a = ia;
unsigned char b = ib;
unsigned char c = ic;
int res1 = (a >= ( b - c > 0 ? b - c : 0 ) &&
a <= ( b + c < 255 ? b + c : 255 ));
int res2 = (unsigned(a - b + c) <= 2*c);
err += (res1 != res2);
}
printf("Errors = %i\n", err);
return 0;
}
On x86 with g++ the assembler code generated for the res2 test only includes one conditional instruction.
The assembler code for the following loop is
void process(unsigned char *src, unsigned char *dst, int sz)
{
for (int i=0; i<sz; i+=3)
{
unsigned char a = src[i];
unsigned char b = src[i+1];
unsigned char c = src[i+2];
dst[i] = (unsigned(a - b + c) <= 2*c);
}
}
.L3:
movzbl 2(%ecx,%eax), %ebx ; This loads c
movzbl (%ecx,%eax), %edx ; This loads a
movzbl 1(%ecx,%eax), %esi ; This loads b
leal (%ebx,%edx), %edx ; This computes a + c
addl %ebx, %ebx ; This is c * 2
subl %esi, %edx ; This is a - b + c
cmpl %ebx, %edx ; Comparison
setbe (%edi,%eax) ; Set 0/1 depending on result
addl $3, %eax ; next group
cmpl %eax, 16(%ebp) ; did we finish ?
jg .L3 ; if not loop back for next
Using instead dst[i] = (a<b ? b-a : a-b); the code becomes much longer
.L9:
movzbl %dl, %edx
andl $255, %esi
subl %esi, %edx
.L4:
andl $255, %edi
cmpl %edi, %edx
movl 12(%ebp), %edx
setle (%edx,%eax)
addl $3, %eax
cmpl %eax, 16(%ebp)
jle .L6
.L5:
movzbl (%ecx,%eax), %edx
movb %dl, -13(%ebp)
movzbl 1(%ecx,%eax), %esi
movzbl 2(%ecx,%eax), %edi
movl %esi, %ebx
cmpb %bl, %dl
ja .L9
movl %esi, %ebx
movzbl %bl, %edx
movzbl -13(%ebp), %ebx
subl %ebx, %edx
jmp .L4
.p2align 4,,7
.p2align 3
.L6:
And I'm way too tired now to try to decipher it (2:28 AM here)
Anyway longer doesn't mean necessarely slower (at a first sight seems g++ decided to unroll the loop by writing a few elements at a time in this case).
As I said before you should do some actual profiling with your real computation and your real data. Note that if true performance is needed may be that the best strategy will differ depending on the processor.
For example Linux during bootstrap makes ae test to decide what is the faster way to perform a certain computation that is needed in the kernel. The variables are just too many (cache size/levels, ram speed, cpu clock, chipset, cpu type...).
Rarely does embedding the ternary operator in another statement improve performance :)
If every single op code matters, write the op codes yourself - use assembler. Also consider using simd instructions if possible. I'd also be interested in the target platform. ARM assembler loves compares of this sort and has opcodes to speed up saturated math of this type.