When/why does (a < 0) potentially branch in an expression?

When/why does (a < 0) potentially branch in an expression? - c++

After reading many of the comments on this question, there are a couple people (here and here) that suggest that this code:
int val = 5;
int r = (0 < val) - (val < 0); // this line here
will cause branching. Unfortunately, none of them give any justification or say why it would cause branching (tristopia suggests it requires a cmove-like instruction or predication, but doesn't really say why).
Are these people right in that "a comparison used in an expression will not generate a branch" is actually myth instead of fact? (assuming you're not using some esoteric processor) If so, can you give an example?
I would've thought there wouldn't be any branching (given that there's no logical "short circuiting"), and now I'm curious.

To simplify matters, consider just one part of the expression: val < 0. Essentially, this means “if val is negative, return 1, otherwise 0”; you could also write it like this:
val < 0 ? 1 : 0
How this is translated into processor instructions depends heavily on the compiler and the target processor. The easiest way to find out is to write a simple test function, like so:
int compute(int val) {
return val < 0 ? 1 : 0;
}
and review the assembler code that is generated by the compiler (e.g., with gcc -S -o - example.c). For my machine, it does it without branching. However, if I change it to return 5 instead of 1, there are branch instructions:
...
cmpl $0, -4(%rbp)
jns .L2
movl $5, %eax
jmp .L3
.L2:
movl $0, %eax
.L3:
...
So, “a comparison used in an expression will not generate a branch” is indeed a myth. (But “a comparison used in an expression will always generate a branch” isn’t true either.)
Addition in response to this extension/clarification:
I'm asking if there's any (sane) platform/compiler for which a branch is likely. MIPS/ARM/x86(_64)/etc. All I'm looking for is one case that demonstrates that this is a realistic possibility.
That depends on what you consider a “sane” platform. If the venerable 6502 CPU family is sane, I think there is no way to calculate val > 0 on it without branching. Most modern instruction sets, on the other hand, provide some type of set-on-X instruction.
(val < 0 can actually be computed without branching even on 6502, because it can be implemented as a bit shift.)

Empiricism for the win:
int sign(int val) {
return (0 < val) - (val < 0);
}
compiled with optimisations. gcc (4.7.2) produces
sign:
.LFB0:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
setg %al
shrl $31, %edi
subl %edi, %eax
ret
.cfi_endproc
no branch. clang (3.2):
sign: # #sign
.cfi_startproc
# BB#0:
movl %edi, %ecx
shrl $31, %ecx
testl %edi, %edi
setg %al
movzbl %al, %eax
subl %ecx, %eax
ret
neither. (on x86_64, Core i5)

This is actually architecture-dependent. If there exists an instruction to set the value to 0/1 depending on the sign of another value, there will be no branching. If there's no such instruction, branching would be necessary.

Related

Understanding of the following assembly code from CSAPP

I am recently reading CSAPP and I have a question about example of assembly code. This is an example from CSAPP, the code is followed:
long pcount_goto
(unsigned long x) {
long result = 0;
result += x & 0x1;
x >>= 1;
if(x) goto loop;
return result;

And the corresponding assembly code is:
movl $0, %eax # result = 0
.L2: # loop:
movq %rdi, %rdx
andl $1, %edx # t = x & 0x1
addq %rdx, %rax # result += t
shrq %rdi # x >>= 1
jne .L2 # if (x) goto loop
rep; ret
The questions I have may look naive since I am very new to assembly code but I will be grateful is someone can help me with these questions.
what's the difference between %eax, %rax, (also %edx, %rdx). I have seen them occur in the assembly code but they seems to refer to the same space/address. What's the point of using two different names?
In the code
andl $1, %edx # t = x & 0x1
I understand that %edx now stores the t, but where does x goes then?
In the code
shrq %rdi
I think
shrq 1, %rdi
should be better?
For
jne .L2 # if (x) goto loop
Where does if (x) goes? I can't see any judgement.

These are really basic questions, a little research of your own should have answered all of them. Anyway,
The e registers are the low 32 bits of the r registers. You pick one depending on what size you need. There are also 16 and 8 bit registers. Consult a basic architecture manual.
The and instruction modifies its argument, it's not a = b & c, it's a &= b.
That would be shrq $1, %rdi which is valid, and shrq %rdi is just an alias for it.
jne examines the zero flag which is set earlier by shrq automatically if the result was zero.

Is it faster to iterate through the elements of an array with pointers incremented by 1? [duplicate]

This question already has answers here:
Efficiency: arrays vs pointers
(14 answers)
Closed 7 years ago.
Is it faster to do something like
for ( int * pa(arr), * pb(arr+n); pa != pb; ++pa )
{
// do something with *pa
}
than
for ( size_t k = 0; k < n; ++k )
{
// do something with arr[k]
}
???
I understand that arr[k] is equivalent to *(arr+k), but in the first method you are using the current pointer which has incremented by 1, while in the second case you are using a pointer which is incremented from arr by successively larger numbers. Maybe hardware has special ways of incrementing by 1 and so the first method is faster? Or not? Just curious. Hope my question makes sense.

If the compiler is smart enought (and most of compilers is) then performance of both loops should be ~equal.
For example I have compiled the code in gcc 5.1.0 with generating assembly:
int __attribute__ ((noinline)) compute1(int* arr, int n)
{
int sum = 0;
for(int i = 0; i < n; ++i)
{
sum += arr[i];
}
return sum;
}
int __attribute__ ((noinline)) compute2(int* arr, int n)
{
int sum = 0;
for(int * pa(arr), * pb(arr+n); pa != pb; ++pa)
{
sum += *pa;
}
return sum;
}
And the result assembly is:
compute1(int*, int):
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
leaq 4(%rdi,%rax,4), %rdx
xorl %eax, %eax
.L3:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdx, %rdi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
compute2(int*, int):
movslq %esi, %rsi
xorl %eax, %eax
leaq (%rdi,%rsi,4), %rdx
cmpq %rdx, %rdi
je .L10
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
.L10:
rep ret
main:
xorl %eax, %eax
ret
As you can see, the most heavy part (loop) of both functions is equal:
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
But in more complex examples or in other compiler the results might be different. So you should test it and measure, but most of compilers generate similar code.
The full code sample: https://goo.gl/mpqSS0

This cannot be answered. It depends on your compiler AND on your machine.
A very naive compiler would translate the code as is to machine code. Most machines indeed provide an increment operation that is very fast. They normally also provide relative addressing for an address with an offset. This could take a few cycles more than absolute addressing. So, yes, the version with pointers could potentially be faster.
But take into account that every machine is different AND that compilers are allowed to optimize as long as the observable behavior of your program doesn't change. Given that, I would suggest a reasonable compiler will create code from both versions that doesn't differ in performance.

Any reasonable compiler will generate code that is identical inside the loop for these two choices - I looked at the code generated for iterating over a std::vector, using for-loop with an integer for the iterator or using a for( auto i: vec) type construct [std::vector internally has two pointers for the begin and end of the stored values, so like your pa and pb]. Both gcc and clang generates identical code inside the loop itself [the exact details of the loop is subtly different between the compilers, but other than that, there's no difference]. The setup of the loop was subtly different, but unless you OFTEN do loops of less than 5 items [and if so, why do you worry?], the actual content of the loop is what matters, not the bit just before the actual loop.
As with ALL code where performance is important, the exact code, compiler make and version, compiler options, processor make and model, will make a difference to how the code performs. But for the vast majority of processors and compilers, I'd expect no measurable difference. If the code is really critical, measure different alternatives and see what works best in your case.

c++11 fast constexpr integer powers

Beating the dead horse here. A typical (and fast) way of doing integer powers in C is this classic:
int64_t ipow(int64_t base, int exp){
int64_t result = 1;
while(exp){
if(exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
However I needed a compile time integer power so I went ahead and made a recursive implementation using constexpr:
constexpr int64_t ipow_(int base, int exp){
return exp > 1 ? ipow_(base, (exp>>1) + (exp&1)) * ipow_(base, exp>>1) : base;
}
constexpr int64_t ipow(int base, int exp){
return exp < 1 ? 1 : ipow_(base, exp);
}
The second function is only to handle exponents less than 1 in a predictable way. Passing exp<0 is an error in this case.
The recursive version is 4 times slower
I generate a vector of 10E6 random valued bases and exponents in the range [0,15] and time both algorithms on the vector (after doing a non-timed run to try to remove any caching effects). Without optimization the recursice method is twice as fast as the loop. But with -O3 (GCC) the loop is 4 times faster than the recursice method.
My question to you guys is this: Can any one come up with a faster ipow() function that handles exponent and bases of 0 and can be used as a constexpr?
(Disclaimer: I don't need a faster ipow, I'm just interested to see what the smart people here can come up with).

A good optimizing compiler will transform tail-recursive functions to run as fast as imperative code. You can transform this function to be tail recursive with pumping. GCC 4.8.1 compiles this test program:
#include <cstdint>
constexpr int64_t ipow(int64_t base, int exp, int64_t result = 1) {
return exp < 1 ? result : ipow(base*base, exp/2, (exp % 2) ? result*base : result);
}
int64_t foo(int64_t base, int exp) {
return ipow(base, exp);
}
into a loop (See this at gcc.godbolt.org):
foo(long, int):
testl %esi, %esi
movl $1, %eax
jle .L4
.L3:
movq %rax, %rdx
imulq %rdi, %rdx
testb $1, %sil
cmovne %rdx, %rax
imulq %rdi, %rdi
sarl %esi
jne .L3
rep; ret
.L4:
rep; ret
vs. your while loop implementation:
ipow(long, int):
testl %esi, %esi
movl $1, %eax
je .L4
.L3:
movq %rax, %rdx
imulq %rdi, %rdx
testb $1, %sil
cmovne %rdx, %rax
imulq %rdi, %rdi
sarl %esi
jne .L3
rep; ret
.L4:
rep; ret
Instruction-by-instruction identical is good enough for me.

It seems that this is a standard problem with constexpr and template programming in C++. Due to compile time constraints, the constexpr version is slower than a normal version if executed at runtime. But overloading doesn't allows to chose the correct version. The standardization committee is working on this issue. See for example the following working document http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3583.pdf

Ordered Binary Search w/ Assembly | Recursive vs. Iterative vs. Library

Which is better?
According to Savitch, each recurse is saved on the stack in the form of an activation frame. This has overhead. However it takes a few less lines of code to write a recursive version. For an interview which one is better to turn in. The code for both is below.
#include <iostream>
using namespace std;
const int SIZE = 10;
int array[ SIZE ] = { 0,1,2,3,4,5,6,7,8,9 };
int answer = NULL;
void binary_search_recursive( int array[], int start, int end, int value, int& answer )
{
int mid = (start + end ) / 2;
if ( array[ mid ] == value )
{
answer = mid;
}
else if ( array[ mid ] < value )
{
binary_search_recursive( array, mid + 1, end, value, answer );
}
else
{
binary_search_recursive( array, start, mid - 1, value, answer );
}
}
void binary_search_iterative( int array[], int start, int end, int value, int& answer )
{
int mid = ( (start + end ) / 2 );
while( array[ mid ] != value )
{
if ( array[ mid ] < value )
{
start = mid;
mid = ( ( ( mid + 1 ) + end ) / 2 );
}
else
{
end = mid;
mid = ( ( start + ( mid - 1 ) ) / 2 );
}
}
answer = mid;
}
int main()
{
binary_search_iterative( array, 0, SIZE - 1, 4, answer);
cout << answer;
return 0;
}

Recursive versions of algorithms are often shorter in lines of code but iterative versions of the same algorithm are often faster because of the function call overhead of the recursive version.
Regarding the binary search algorithm, the faster implementations are written as iterative. For example Jon Bentley's published version of the binary search is iterative.

In case of a binary search recursion does not help you express your intent any better than the iteration does, so an iterative approach is better.
I think the best approach for an interview would be to submit a solution that calls lower_bound: it shows the interviewer that you not only know some basic-level syntax and how to code a freshman-year algorithm, but that you do not waste time re-writing boilerplate code.

For an interview, I'd start by mentioning that both recursive and iterative solutions are possible and similarly trivial to write. Recursive versions have a potential issue with nested stack frames using or even exhausting stack memory (and faulting different pages into cache), but compilers tend to provide tail recursive optimisations that effectively create an iterative implementation. Recursive functions tend to be more self-evidently correct and concise, but aren't as widely applicable in day to day C++ programming so may be a little less familiar and comfortable to maintenance programmers.
Unless there's a reason not to, in a real project I'd use std::binary_search from <algorithm> (http://www.sgi.com/tech/stl/binary_search.html).
To illustrate tail recursion, your binary_search_recursive algorithm was changed to the assembly below by g++ -O4 -S. Notes:
to get an impression of the code, you don't need to understand every line, but the following helps:
movl are move statements (assignments) between registers and memory (the trailing "l" for "long" reflects the number of bits in the registers/memory locations)
subl, shrl, sarl, cmpl are subtraction, shift right, shift arithmetic right, and compare instructions, and the important thing to note is that as a side effect they set a few flags, such as "equals" if they produce a 0 result, that is consulted by je (jump if equals) and jge (jump if greater or equal), jne jump if not equal.
the answer = mid termination condition is handled at L10, while the recursive steps are instead handled by the code at L14 and L4 and may jump back up to L12.
Here's the disassembly of your binary_search_recursive function (the name is mangled in C++ style)...
__Z23binary_search_recursivePiiiiRi:
pushl %ebp
movl %esp, %ebp
pushl %edi
pushl %esi
pushl %ebx
subl $4, %esp
movl 24(%ebp), %eax
movl 8(%ebp), %edi
movl 12(%ebp), %ebx
movl 16(%ebp), %ecx
movl %eax, -16(%ebp)
movl 20(%ebp), %esi
.p2align 4,,15
L12:
leal (%ebx,%ecx), %edx
movl %edx, %eax
shrl $31, %eax
leal (%edx,%eax), %eax
sarl %eax
cmpl %esi, (%edi,%eax,4)
je L10
L14:
jge L4
leal 1(%eax), %ebx
leal (%ebx,%ecx), %edx
movl %edx, %eax
shrl $31, %eax
leal (%edx,%eax), %eax
sarl %eax
cmpl %esi, (%edi,%eax,4)
jne L14
L10:
movl -16(%ebp), %ecx
movl %eax, (%ecx)
popl %eax
popl %ebx
popl %esi
popl %edi
popl %ebp
ret
.p2align 4,,7
L4:
leal -1(%eax), %ecx
jmp L12

You use iteration if speed is an issue or if the stack size is constraining, because as you said it involves calling the function repeatedly which results in it occupying more space on the stack. As for answering in the interview I would go for whichever I feel is simplest to do correctly at that time, for obvious reasons :))

Is there a more efficient way to get the length of a 32bit integer in bytes?

I'd like a shortcut for the following little function, where
performance is very important (the function is called more than 10.000.000 times):
inline int len(uint32 val)
{
if(val <= 0x000000ff) return 1;
if(val <= 0x0000ffff) return 2;
if(val <= 0x00ffffff) return 3;
return 4;
}
Does anyone have any idea... a cool bitoperation trick?
Thanks for your help in advance!

How about this one?
inline int len(uint32 val)
{
return 4
- ((val & 0xff000000) == 0)
- ((val & 0xffff0000) == 0)
- ((val & 0xffffff00) == 0)
;
}
Removing the inline keyword, g++ -O2 compiles this to the following branchless code:
movl 8(%ebp), %edx
movl %edx, %eax
andl $-16777216, %eax
cmpl $1, %eax
sbbl %eax, %eax
addl $4, %eax
xorl %ecx, %ecx
testl $-65536, %edx
sete %cl
subl %ecx, %eax
andl $-256, %edx
sete %dl
movzbl %dl, %edx
subl %edx, %eax
If you don't mind machine-specific solutions, you can use the bsr instruction which searches for the first 1 bit. Then you simply divide by 8 to convert bits to bytes and add 1 to shift the range 0..3 to 1..4:
int len(uint32 val)
{
asm("mov 8(%ebp), %eax");
asm("or $255, %eax");
asm("bsr %eax, %eax");
asm("shr $3, %eax");
asm("inc %eax");
asm("mov %eax, 8(%ebp)");
return val;
}
Note that I am not an inline assembly god, so maybe there's a better to solution to access val instead of addressing the stack explicitly. But you should get the basic idea.
The GNU compiler also has an interesting built-in function called __builtin_clz:
inline int len(uint32 val)
{
return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;
}
This looks much better than the inline assembly version to me :)

I did a mini unscientific benchmark just measuring the difference in GetTickCount() calls when calling the function in a loop from 0 to MAX_LONG times under the VS 2010 compiler.
Here's what I saw:
This took 11497 ticks
inline int len(uint32 val)
{
if(val <= 0x000000ff) return 1;
if(val <= 0x0000ffff) return 2;
if(val <= 0x00ffffff) return 3;
return 4;
}
While this took 14399 ticks
inline int len(uint32 val)
{
return 4
- ((val & 0xff000000) == 0)
- ((val & 0xffff0000) == 0)
- ((val & 0xffffff00) == 0)
;
}
edit: my idea about why one was faster is wrong because:
inline int len(uint32 val)
{
return 1
+ (val > 0x000000ff)
+ (val > 0x0000ffff)
+ (val > 0x00ffffff)
;
}
This version used only 11107 ticks. Since + is faster than - perhaps? I'm not sure.
Even faster though was the binary search at 7161 ticks
inline int len(uint32 val)
{
if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
return (val & 0x0000ff00)? 2: 1;
}
And fastest so far is using the MS intrinsic function, at 4399 ticks
#pragma intrinsic(_BitScanReverse)
inline int len2(uint32 val)
{
DWORD index;
_BitScanReverse(&index, val);
return (index>>3)+1;
}
For reference - here's the code i used to profile:
int _tmain(int argc, _TCHAR* argv[])
{
int j = 0;
DWORD t1,t2;
t1 = GetTickCount();
for(ULONG i=0; i<-1; i++)
j=len(i);
t2 = GetTickCount();
_tprintf(_T("%ld ticks %ld\n"), t2-t1, j);
t1 = GetTickCount();
for(ULONG i=0; i<-1; i++)
j=len2(i);
t2 = GetTickCount();
_tprintf(_T("%ld ticks %ld\n"), t2-t1, j);
}
Had to print j to prevent the loops from being optimized out.

Do you really have profile evidence that this is a significant bottleneck in your application? Just do it the most obvious way and only if profiling shows it to be a problem (which I doubt), then try to improve things. Most likely you'll get the best improvement by reducing the number of calls to this function than by changing something within it.

Binary search MIGHT save a few cycles, depending on the processor architecture.
inline int len(uint32 val)
{
if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
return (val & 0x0000ff00)? 2: 1;
}
Or, finding out which is the most common case might bring down the average number of cycles, if most inputs are one byte (eg when building UTF-8 encodings, but then your break points wouldn't be 32/24/16/8):
inline int len(uint32 val)
{
if (val & 0xffffff00) {
if (val & 0xffff0000) {
if (val & 0xff000000) return 4;
return 3;
}
return 2;
}
return 1;
}
Now the short case does the fewest conditional tests.

If bit ops are faster than comparison on your target machine you can do this:
inline int len(uint32 val)
{
if(val & 0xff000000) return 4;
if(val & 0x00ff0000) return 3;
if(val & 0x0000ff00) return 2;
return 1;
}

You can avoid the conditional branches that can be costly if the distribution of your numbers does not make prediction easy:
return 4 - (val <= 0x000000ff) - (val <= 0x0000ffff) - (val <= 0x00ffffff);
Changing the <= to a & will not change anything much on a modern processor. What is your target platform?
Here is the generated code for x86-64 with gcc -O:
cmpl $255, %edi
setg %al
movzbl %al, %eax
addl $3, %eax
cmpl $65535, %edi
setle %dl
movzbl %dl, %edx
subl %edx, %eax
cmpl $16777215, %edi
setle %dl
movzbl %dl, %edx
subl %edx, %eax
There are comparison instructions cmpl of course, but these are followed by setg or setle instead of conditional branches (as would be usual). It's the conditional branch that is expensive on a modern pipelined processor, not the comparison. So this version saves the expensive conditional branches.
My attempt at hand-optimizing gcc's assembly:
cmpl $255, %edi
setg %al
addb $3, %al
cmpl $65535, %edi
setle %dl
subb %dl, %al
cmpl $16777215, %edi
setle %dl
subb %dl, %al
movzbl %al, %eax

You may have a more efficient solution depending on your architecture.
MIPS has a "CLZ" instruction that counts the number of leading zero-bits of a number. What you are looking for here is essentially 4 - (CLZ(x) / 8) (where / is integer division). PowerPC has the equivalent instruction cntlz, and x86 has BSR. This solution should simplify down to 3-4 instructions (not counting function call overhead) and zero branches.

On some systems this could be quicker on some architectures:
inline int len(uint32_t val) {
return (int)( log(val) / log(256) ); // this is the log base 256 of val
}
This may also be slightly faster (if comparison takes longer than bitwise and):
inline int len(uint32_t val) {
if (val & ~0x00FFffFF) {
return 4;
if (val & ~0x0000ffFF) {
return 3;
}
if (val & ~0x000000FF) {
return 2;
}
return 1;
}
If you are on an 8 bit microcontroller (like an 8051 or AVR) then this will work best:
inline int len(uint32_t val) {
union int_char {
uint32_t u;
uint8_t a[4];
} x;
x.u = val; // doing it this way rather than taking the address of val often prevents
// the compiler from doing dumb things.
if (x.a[0]) {
return 4;
} else if (x.a[1]) {
return 3;
...
EDIT by tristopia: endianness aware version of the last variant
int len(uint32_t val)
{
union int_char {
uint32_t u;
uint8_t a[4];
} x;
const uint16_t w = 1;
x.u = val;
if( ((uint8_t *)&w)[1]) { // BIG ENDIAN (Sparc, m68k, ARM, Power)
if(x.a[0]) return 4;
if(x.a[1]) return 3;
if(x.a[2]) return 2;
}
else { // LITTLE ENDIAN (x86, 8051, ARM)
if(x.a[3]) return 4;
if(x.a[2]) return 3;
if(x.a[1]) return 2;
}
return 1;
}
Because of the const, any compiler worth its salt will only generate the code for the right endianness.

to Pascal Cuoq and the 35 other people who up-voted his comment:
"Wow! More than 10 million times... You mean that if you squeeze three cycles out of this function, you will save as much as 0.03s? "
Such a sarcastic comment is at best rude and offensive.
Optimization is frequently the cumulative result of 3% here, 2% there. 3% in overall capacity is nothing to be sneezed at. Suppose this was an almost saturated and unparallelizable stage in a pipe. Suppose CPU utilization went from 99% to 96%. Simple queuing theory tells one that such a reduction in CPU utilization would reduce the average queue length by over 75%. [the qualitative (load divided by 1-load)]
Such a reduction can frequently make or break a particular hardware configuration as this has feed back effects on memory requirements, caching the queued items, lock convoying, and (horror of horrors should it be a paged system) even paging. It is precisely these sorts of effects that cause bifurcated hysteresis loop type system behavior.
Arrival rates of anything seem to tend to go up and field replacement of a particular CPU or buying a faster box is frequently just not an option.
Optimization is not just about wall clock time on a desktop. Anyone who thinks that it is has much reading to do about the measurement and modelling of computer program behavior.
Pascal Cuoq owes the original poster an apology.

Just to illustrate, based on FredOverflow's answer (which is nice work, kudos and +1), a common pitfall regarding branches on x86. Here's FredOverflow's assembly as output by gcc:
movl 8(%ebp), %edx #1/.5
movl %edx, %eax #1/.5
andl $-16777216, %eax#1/.5
cmpl $1, %eax #1/.5
sbbl %eax, %eax #8/6
addl $4, %eax #1/.5
xorl %ecx, %ecx #1/.5
testl $-65536, %edx #1/.5
sete %cl #5
subl %ecx, %eax #1/.5
andl $-256, %edx #1/.5
sete %dl #5
movzbl %dl, %edx #1/.5
subl %edx, %eax #1/.5
# sum total: 29/21.5 cycles
(the latency, in cycles, is to be read as Prescott/Northwood)
Pascal Cuoq's hand-optimized assembly (also kudos):
cmpl $255, %edi #1/.5
setg %al #5
addb $3, %al #1/.5
cmpl $65535, %edi #1/.5
setle %dl #5
subb %dl, %al #1/.5
cmpl $16777215, %edi #1/.5
setle %dl #5
subb %dl, %al #1/.5
movzbl %al, %eax #1/.5
# sum total: 22/18.5 cycles
Edit: FredOverflow's solution using __builtin_clz():
movl 8(%ebp), %eax #1/.5
popl %ebp #1.5
orb $-1, %al #1/.5
bsrl %eax, %eax #16/8
sarl $3, %eax #1/4
addl $1, %eax #1/.5
ret
# sum total: 20/13.5 cycles
and the gcc assembly for your code:
movl $1, %eax #1/.5
movl %esp, %ebp #1/.5
movl 8(%ebp), %edx #1/.5
cmpl $255, %edx #1/.5
jbe .L3 #up to 9 cycles
cmpl $65535, %edx #1/.5
movb $2, %al #1/.5
jbe .L3 #up to 9 cycles
cmpl $16777216, %edx #1/.5
sbbl %eax, %eax #8/6
addl $4, %eax #1/.5
.L3:
ret
# sum total: 16/10 cycles - 34/28 cycles
in which the instruction cache line fetches which come as the side-effect of the jcc instructions probably cost nothing for such a short function.
Branches can be a reasonable choice, depending on the input distribution.
Edit: added FredOverflow's solution which is using __builtin_clz().

Ok one more version. Similar to Fred's one, but with less operations.
inline int len(uint32 val)
{
return 1
+ (val > 0x000000ff)
+ (val > 0x0000ffff)
+ (val > 0x00ffffff)
;
}

This gives you less comparisons. But may be less efficient if memory access operation costs more than a couple of comparisons.
int precalc[1<<16];
int precalchigh[1<<16];
void doprecalc()
{
for(int i = 0; i < 1<<16; i++) {
precalc[i] = (i < (1<<8) ? 1 : 2);
precalchigh[i] = precalc[i] + 2;
}
}
inline int len(uint32 val)
{
return (val & 0xffff0000 ? precalchigh[val >> 16] : precalc[val]);
}

The minimum number of bits required to store an integer is:
int minbits = (int)ceil( log10(n) / log10(2) ) ;
The number of bytes is:
int minbytes = (int)ceil( log10(n) / log10(2) / 8 ) ;
This is an entirely FPU bound solution, performance may or may not be better than a conditional test, but worth investigation perhaps.
[EDIT]
I did the investigation; a simple loop of ten million iterations of the above took 918ms whereas FredOverflow's accepted solution took just 49ms (VC++ 2010). So this is not an improvement in terms of performance, though may remain useful if it were the number of bits that were required, and further optimisations are possible.

If I remember 80x86 asm right, I'd do something like:
; Assume value in EAX; count goes into ECX
cmp eax,16777215 ; Carry set if less
sbb ecx,ecx ; Load -1 if less, 0 if greater
cmp eax,65535
sbb ecx,0 ; Subtract 1 if less; 0 if greater
cmp eax,255
sbb ecx,-4 ; Add 3 if less, 4 if greater
Six instructions. I think the same approach would also work for six instructions on the ARM I use.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

When/why does (a < 0) potentially branch in an expression? - c++

This is actually architecture-dependent. If there exists an instruction to set the value to 0/1 depending on the sign of another value, there will be no branching. If there's no such instruction, branching would be necessary.

Related

Understanding of the following assembly code from CSAPP

Is it faster to iterate through the elements of an array with pointers incremented by 1? [duplicate]

c++11 fast constexpr integer powers

Ordered Binary Search w/ Assembly | Recursive vs. Iterative vs. Library

Is there a more efficient way to get the length of a 32bit integer in bytes?

Categories

Resources