c++11 fast constexpr integer powers - c++

Beating the dead horse here. A typical (and fast) way of doing integer powers in C is this classic:
int64_t ipow(int64_t base, int exp){
int64_t result = 1;
while(exp){
if(exp & 1)
result *= base;
exp >>= 1;
base *= base;
}
return result;
}
However I needed a compile time integer power so I went ahead and made a recursive implementation using constexpr:
constexpr int64_t ipow_(int base, int exp){
return exp > 1 ? ipow_(base, (exp>>1) + (exp&1)) * ipow_(base, exp>>1) : base;
}
constexpr int64_t ipow(int base, int exp){
return exp < 1 ? 1 : ipow_(base, exp);
}
The second function is only to handle exponents less than 1 in a predictable way. Passing exp<0 is an error in this case.
The recursive version is 4 times slower
I generate a vector of 10E6 random valued bases and exponents in the range [0,15] and time both algorithms on the vector (after doing a non-timed run to try to remove any caching effects). Without optimization the recursice method is twice as fast as the loop. But with -O3 (GCC) the loop is 4 times faster than the recursice method.
My question to you guys is this: Can any one come up with a faster ipow() function that handles exponent and bases of 0 and can be used as a constexpr?
(Disclaimer: I don't need a faster ipow, I'm just interested to see what the smart people here can come up with).

A good optimizing compiler will transform tail-recursive functions to run as fast as imperative code. You can transform this function to be tail recursive with pumping. GCC 4.8.1 compiles this test program:
#include <cstdint>
constexpr int64_t ipow(int64_t base, int exp, int64_t result = 1) {
return exp < 1 ? result : ipow(base*base, exp/2, (exp % 2) ? result*base : result);
}
int64_t foo(int64_t base, int exp) {
return ipow(base, exp);
}
into a loop (See this at gcc.godbolt.org):
foo(long, int):
testl %esi, %esi
movl $1, %eax
jle .L4
.L3:
movq %rax, %rdx
imulq %rdi, %rdx
testb $1, %sil
cmovne %rdx, %rax
imulq %rdi, %rdi
sarl %esi
jne .L3
rep; ret
.L4:
rep; ret
vs. your while loop implementation:
ipow(long, int):
testl %esi, %esi
movl $1, %eax
je .L4
.L3:
movq %rax, %rdx
imulq %rdi, %rdx
testb $1, %sil
cmovne %rdx, %rax
imulq %rdi, %rdi
sarl %esi
jne .L3
rep; ret
.L4:
rep; ret
Instruction-by-instruction identical is good enough for me.

It seems that this is a standard problem with constexpr and template programming in C++. Due to compile time constraints, the constexpr version is slower than a normal version if executed at runtime. But overloading doesn't allows to chose the correct version. The standardization committee is working on this issue. See for example the following working document http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2013/n3583.pdf

Related

Understanding of the following assembly code from CSAPP

I am recently reading CSAPP and I have a question about example of assembly code. This is an example from CSAPP, the code is followed:
long pcount_goto
(unsigned long x) {
long result = 0;
result += x & 0x1;
x >>= 1;
if(x) goto loop;
return result;

And the corresponding assembly code is:
movl $0, %eax # result = 0
.L2: # loop:
movq %rdi, %rdx
andl $1, %edx # t = x & 0x1
addq %rdx, %rax # result += t
shrq %rdi # x >>= 1
jne .L2 # if (x) goto loop
rep; ret
The questions I have may look naive since I am very new to assembly code but I will be grateful is someone can help me with these questions.
what's the difference between %eax, %rax, (also %edx, %rdx). I have seen them occur in the assembly code but they seems to refer to the same space/address. What's the point of using two different names?
In the code
andl $1, %edx # t = x & 0x1
I understand that %edx now stores the t, but where does x goes then?
In the code
shrq %rdi
I think
shrq 1, %rdi
should be better?
For
jne .L2 # if (x) goto loop
Where does if (x) goes? I can't see any judgement.
These are really basic questions, a little research of your own should have answered all of them. Anyway,
The e registers are the low 32 bits of the r registers. You pick one depending on what size you need. There are also 16 and 8 bit registers. Consult a basic architecture manual.
The and instruction modifies its argument, it's not a = b & c, it's a &= b.
That would be shrq $1, %rdi which is valid, and shrq %rdi is just an alias for it.
jne examines the zero flag which is set earlier by shrq automatically if the result was zero.

Is it faster to iterate through the elements of an array with pointers incremented by 1? [duplicate]

This question already has answers here:
Efficiency: arrays vs pointers
(14 answers)
Closed 7 years ago.
Is it faster to do something like
for ( int * pa(arr), * pb(arr+n); pa != pb; ++pa )
{
// do something with *pa
}
than
for ( size_t k = 0; k < n; ++k )
{
// do something with arr[k]
}
???
I understand that arr[k] is equivalent to *(arr+k), but in the first method you are using the current pointer which has incremented by 1, while in the second case you are using a pointer which is incremented from arr by successively larger numbers. Maybe hardware has special ways of incrementing by 1 and so the first method is faster? Or not? Just curious. Hope my question makes sense.
If the compiler is smart enought (and most of compilers is) then performance of both loops should be ~equal.
For example I have compiled the code in gcc 5.1.0 with generating assembly:
int __attribute__ ((noinline)) compute1(int* arr, int n)
{
int sum = 0;
for(int i = 0; i < n; ++i)
{
sum += arr[i];
}
return sum;
}
int __attribute__ ((noinline)) compute2(int* arr, int n)
{
int sum = 0;
for(int * pa(arr), * pb(arr+n); pa != pb; ++pa)
{
sum += *pa;
}
return sum;
}
And the result assembly is:
compute1(int*, int):
testl %esi, %esi
jle .L4
leal -1(%rsi), %eax
leaq 4(%rdi,%rax,4), %rdx
xorl %eax, %eax
.L3:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdx, %rdi
jne .L3
rep ret
.L4:
xorl %eax, %eax
ret
compute2(int*, int):
movslq %esi, %rsi
xorl %eax, %eax
leaq (%rdi,%rsi,4), %rdx
cmpq %rdx, %rdi
je .L10
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
.L10:
rep ret
main:
xorl %eax, %eax
ret
As you can see, the most heavy part (loop) of both functions is equal:
.L9:
addl (%rdi), %eax
addq $4, %rdi
cmpq %rdi, %rdx
jne .L9
rep ret
But in more complex examples or in other compiler the results might be different. So you should test it and measure, but most of compilers generate similar code.
The full code sample: https://goo.gl/mpqSS0
This cannot be answered. It depends on your compiler AND on your machine.
A very naive compiler would translate the code as is to machine code. Most machines indeed provide an increment operation that is very fast. They normally also provide relative addressing for an address with an offset. This could take a few cycles more than absolute addressing. So, yes, the version with pointers could potentially be faster.
But take into account that every machine is different AND that compilers are allowed to optimize as long as the observable behavior of your program doesn't change. Given that, I would suggest a reasonable compiler will create code from both versions that doesn't differ in performance.
Any reasonable compiler will generate code that is identical inside the loop for these two choices - I looked at the code generated for iterating over a std::vector, using for-loop with an integer for the iterator or using a for( auto i: vec) type construct [std::vector internally has two pointers for the begin and end of the stored values, so like your pa and pb]. Both gcc and clang generates identical code inside the loop itself [the exact details of the loop is subtly different between the compilers, but other than that, there's no difference]. The setup of the loop was subtly different, but unless you OFTEN do loops of less than 5 items [and if so, why do you worry?], the actual content of the loop is what matters, not the bit just before the actual loop.
As with ALL code where performance is important, the exact code, compiler make and version, compiler options, processor make and model, will make a difference to how the code performs. But for the vast majority of processors and compilers, I'd expect no measurable difference. If the code is really critical, measure different alternatives and see what works best in your case.

Fastest way to reset a value in every struct element of a vector?

Very much like this question, except that instead vector<int> I have vector<struct myType>.
If I want to reset (or for that matter, set to some value) myType.myVar for every element in the vector, what's the most efficient method?
Right now I'm iterating through:
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
But since vectors are guaranteed to be stored contiguously, there's surely a better way?
Resetting will need to traverse every element of the vector, so it will need at least O(n) complexity. Your current algorithm takes O(n).
In this particular case you can use operator[] instead of at (that might throw an exception). But I doubt that's the bottleneck of your application.
On this note you should probably use std::fill:
std::fill(myVec.begin(), myVec.end(), 0);
But unless you want to go byte level and set a chunk of memory to 0, which will not only cause you headaches but will also make you lose portability in most cases, there's nothing to improve here.
Instead of the below code
for(int i=0; i<myVec.size(); i++) myVec.at(i).myVar = 0;
do it as follows:
size_t sz = myVec.size();
for(int i=0; i<sz; ++i) myVec[i].myVar = 0;
As the "at" method internally checks whether index is out of range or not. But as your loop index is taking care(myVec.size()), you can avoid the extra check. Otherwise this is the fastest way to do it.
EDIT
In addition to that, we can store the size() of vector prior to executing the for loop.This would ensure that there is no further call of method size() inside the for loop.
One of the fastest ways would be to perform loop unwinding and break the speed limit posed by conventional for loops that cause great cash spills. In your case, as it's a run time thing, there's no way to apply template metaprogramming so a variation on good old Duff's device would do the trick
#include <iostream>
#include <vector>
using namespace std;
struct myStruct {
int a;
double b;
};
int main()
{
std::vector<myStruct> mV(20);
double val(20); // the new value you want to reset to
int n = (mV.size()+7) / 8; // 8 is the size of the block unroll
auto to = mV.begin(); //
switch(mV.size()%8)
{
case 0: do { (*to++).b = val;
case 7: (*to++).b = val;
case 6: (*to++).b = val;
case 5: (*to++).b = val;
case 4: (*to++).b = val;
case 3: (*to++).b = val;
case 2: (*to++).b = val;
case 1: (*to++).b = val;
} while (--n>0);
}
// just printing to verify that the value is set
for (auto i : mV) std::cout << i.b << std::endl;
return 0;
}
Here I choose to perform an 8-block unwind, to reset the value (let's say) b of a myStruct structure. The block size can be tweaked and loops are effectively unrolled. Remember this is the underlying technique in memcpy and one of the optimizations (loop unrolling in general) a compiler will attempt (actually they're quite good at this so we might as well let them do their job).
In addition to what's been said before, you should consider that if you turn on optimizations, the compiler will likely perform loop-unrolling which will make the loop itself faster.
Also pre increment ++i takes few instructions less than post increment i++.
Explanation here
Beware of spending a lot of time thinking about optimization details which the compiler will just take care of for you.
Here are four implementations of what I understand the OP to be, along with the code generated using gcc 4.8 with --std=c++11 -O3 -S
Declarations:
#include <algorithm>
#include <vector>
struct T {
int irrelevant;
int relevant;
double trailing;
};
Explicit loop implementations, roughly from answers and comments provided to OP. Both produced identical machine code, aside from labels.
.cfi_startproc
movq (%rdi), %rsi
void clear_relevant(std::vector<T>* vecp) { movq 8(%rdi), %rcx
for(unsigned i=0; i<vecp->size(); i++) { xorl %edx, %edx
vecp->at(i).relevant = 0; xorl %eax, %eax
} subq %rsi, %rcx
} sarq $4, %rcx
testq %rcx, %rcx
je .L1
.p2align 4,,10
.p2align 3
.L5:
void clear_relevant2(std::vector<T>* vecp) { salq $4, %rdx
std::vector<T>& vec = *vecp; addl $1, %eax
auto s = vec.size(); movl $0, 4(%rsi,%rdx)
for (unsigned i = 0; i < s; ++i) { movl %eax, %edx
vec[i].relevant = 0; cmpq %rcx, %rdx
} jb .L5
} .L1:
rep ret
.cfi_endproc
Two other versions, one using std::for_each and the other one using the range for syntax. Here there is a subtle difference in the code for the two versions (other than the labels):
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rax, %rdx
je .L17
void clear_relevant3(std::vector<T>* vecp) { .p2align 4,,10
for (auto& p : *vecp) p.relevant = 0; .p2align 3
} .L21:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L21
.L17:
rep ret
.cfi_endproc
.cfi_startproc
movq 8(%rdi), %rdx
movq (%rdi), %rax
cmpq %rdx, %rax
void clear_relevant4(std::vector<T>* vecp) { je .L12
std::for_each(vecp->begin(), vecp->end(), .p2align 4,,10
[](T& o){o.relevant=0;}); .p2align 3
} .L16:
movl $0, 4(%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L16
.L12:
rep ret
.cfi_endproc

When/why does (a < 0) potentially branch in an expression?

After reading many of the comments on this question, there are a couple people (here and here) that suggest that this code:
int val = 5;
int r = (0 < val) - (val < 0); // this line here
will cause branching. Unfortunately, none of them give any justification or say why it would cause branching (tristopia suggests it requires a cmove-like instruction or predication, but doesn't really say why).
Are these people right in that "a comparison used in an expression will not generate a branch" is actually myth instead of fact? (assuming you're not using some esoteric processor) If so, can you give an example?
I would've thought there wouldn't be any branching (given that there's no logical "short circuiting"), and now I'm curious.
To simplify matters, consider just one part of the expression: val < 0. Essentially, this means “if val is negative, return 1, otherwise 0”; you could also write it like this:
val < 0 ? 1 : 0
How this is translated into processor instructions depends heavily on the compiler and the target processor. The easiest way to find out is to write a simple test function, like so:
int compute(int val) {
return val < 0 ? 1 : 0;
}
and review the assembler code that is generated by the compiler (e.g., with gcc -S -o - example.c). For my machine, it does it without branching. However, if I change it to return 5 instead of 1, there are branch instructions:
...
cmpl $0, -4(%rbp)
jns .L2
movl $5, %eax
jmp .L3
.L2:
movl $0, %eax
.L3:
...
So, “a comparison used in an expression will not generate a branch” is indeed a myth. (But “a comparison used in an expression will always generate a branch” isn’t true either.)
Addition in response to this extension/clarification:
I'm asking if there's any (sane) platform/compiler for which a branch is likely. MIPS/ARM/x86(_64)/etc. All I'm looking for is one case that demonstrates that this is a realistic possibility.
That depends on what you consider a “sane” platform. If the venerable 6502 CPU family is sane, I think there is no way to calculate val > 0 on it without branching. Most modern instruction sets, on the other hand, provide some type of set-on-X instruction.
(val < 0 can actually be computed without branching even on 6502, because it can be implemented as a bit shift.)
Empiricism for the win:
int sign(int val) {
return (0 < val) - (val < 0);
}
compiled with optimisations. gcc (4.7.2) produces
sign:
.LFB0:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
setg %al
shrl $31, %edi
subl %edi, %eax
ret
.cfi_endproc
no branch. clang (3.2):
sign: # #sign
.cfi_startproc
# BB#0:
movl %edi, %ecx
shrl $31, %ecx
testl %edi, %edi
setg %al
movzbl %al, %eax
subl %ecx, %eax
ret
neither. (on x86_64, Core i5)
This is actually architecture-dependent. If there exists an instruction to set the value to 0/1 depending on the sign of another value, there will be no branching. If there's no such instruction, branching would be necessary.

Is there a more efficient way to get the length of a 32bit integer in bytes?

I'd like a shortcut for the following little function, where
performance is very important (the function is called more than 10.000.000 times):
inline int len(uint32 val)
{
if(val <= 0x000000ff) return 1;
if(val <= 0x0000ffff) return 2;
if(val <= 0x00ffffff) return 3;
return 4;
}
Does anyone have any idea... a cool bitoperation trick?
Thanks for your help in advance!
How about this one?
inline int len(uint32 val)
{
return 4
- ((val & 0xff000000) == 0)
- ((val & 0xffff0000) == 0)
- ((val & 0xffffff00) == 0)
;
}
Removing the inline keyword, g++ -O2 compiles this to the following branchless code:
movl 8(%ebp), %edx
movl %edx, %eax
andl $-16777216, %eax
cmpl $1, %eax
sbbl %eax, %eax
addl $4, %eax
xorl %ecx, %ecx
testl $-65536, %edx
sete %cl
subl %ecx, %eax
andl $-256, %edx
sete %dl
movzbl %dl, %edx
subl %edx, %eax
If you don't mind machine-specific solutions, you can use the bsr instruction which searches for the first 1 bit. Then you simply divide by 8 to convert bits to bytes and add 1 to shift the range 0..3 to 1..4:
int len(uint32 val)
{
asm("mov 8(%ebp), %eax");
asm("or $255, %eax");
asm("bsr %eax, %eax");
asm("shr $3, %eax");
asm("inc %eax");
asm("mov %eax, 8(%ebp)");
return val;
}
Note that I am not an inline assembly god, so maybe there's a better to solution to access val instead of addressing the stack explicitly. But you should get the basic idea.
The GNU compiler also has an interesting built-in function called __builtin_clz:
inline int len(uint32 val)
{
return ((__builtin_clz(val | 255) ^ 31) >> 3) + 1;
}
This looks much better than the inline assembly version to me :)
I did a mini unscientific benchmark just measuring the difference in GetTickCount() calls when calling the function in a loop from 0 to MAX_LONG times under the VS 2010 compiler.
Here's what I saw:
This took 11497 ticks
inline int len(uint32 val)
{
if(val <= 0x000000ff) return 1;
if(val <= 0x0000ffff) return 2;
if(val <= 0x00ffffff) return 3;
return 4;
}
While this took 14399 ticks
inline int len(uint32 val)
{
return 4
- ((val & 0xff000000) == 0)
- ((val & 0xffff0000) == 0)
- ((val & 0xffffff00) == 0)
;
}
edit: my idea about why one was faster is wrong because:
inline int len(uint32 val)
{
return 1
+ (val > 0x000000ff)
+ (val > 0x0000ffff)
+ (val > 0x00ffffff)
;
}
This version used only 11107 ticks. Since + is faster than - perhaps? I'm not sure.
Even faster though was the binary search at 7161 ticks
inline int len(uint32 val)
{
if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
return (val & 0x0000ff00)? 2: 1;
}
And fastest so far is using the MS intrinsic function, at 4399 ticks
#pragma intrinsic(_BitScanReverse)
inline int len2(uint32 val)
{
DWORD index;
_BitScanReverse(&index, val);
return (index>>3)+1;
}
For reference - here's the code i used to profile:
int _tmain(int argc, _TCHAR* argv[])
{
int j = 0;
DWORD t1,t2;
t1 = GetTickCount();
for(ULONG i=0; i<-1; i++)
j=len(i);
t2 = GetTickCount();
_tprintf(_T("%ld ticks %ld\n"), t2-t1, j);
t1 = GetTickCount();
for(ULONG i=0; i<-1; i++)
j=len2(i);
t2 = GetTickCount();
_tprintf(_T("%ld ticks %ld\n"), t2-t1, j);
}
Had to print j to prevent the loops from being optimized out.
Do you really have profile evidence that this is a significant bottleneck in your application? Just do it the most obvious way and only if profiling shows it to be a problem (which I doubt), then try to improve things. Most likely you'll get the best improvement by reducing the number of calls to this function than by changing something within it.
Binary search MIGHT save a few cycles, depending on the processor architecture.
inline int len(uint32 val)
{
if (val & 0xffff0000) return (val & 0xff000000)? 4: 3;
return (val & 0x0000ff00)? 2: 1;
}
Or, finding out which is the most common case might bring down the average number of cycles, if most inputs are one byte (eg when building UTF-8 encodings, but then your break points wouldn't be 32/24/16/8):
inline int len(uint32 val)
{
if (val & 0xffffff00) {
if (val & 0xffff0000) {
if (val & 0xff000000) return 4;
return 3;
}
return 2;
}
return 1;
}
Now the short case does the fewest conditional tests.
If bit ops are faster than comparison on your target machine you can do this:
inline int len(uint32 val)
{
if(val & 0xff000000) return 4;
if(val & 0x00ff0000) return 3;
if(val & 0x0000ff00) return 2;
return 1;
}
You can avoid the conditional branches that can be costly if the distribution of your numbers does not make prediction easy:
return 4 - (val <= 0x000000ff) - (val <= 0x0000ffff) - (val <= 0x00ffffff);
Changing the <= to a & will not change anything much on a modern processor. What is your target platform?
Here is the generated code for x86-64 with gcc -O:
cmpl $255, %edi
setg %al
movzbl %al, %eax
addl $3, %eax
cmpl $65535, %edi
setle %dl
movzbl %dl, %edx
subl %edx, %eax
cmpl $16777215, %edi
setle %dl
movzbl %dl, %edx
subl %edx, %eax
There are comparison instructions cmpl of course, but these are followed by setg or setle instead of conditional branches (as would be usual). It's the conditional branch that is expensive on a modern pipelined processor, not the comparison. So this version saves the expensive conditional branches.
My attempt at hand-optimizing gcc's assembly:
cmpl $255, %edi
setg %al
addb $3, %al
cmpl $65535, %edi
setle %dl
subb %dl, %al
cmpl $16777215, %edi
setle %dl
subb %dl, %al
movzbl %al, %eax
You may have a more efficient solution depending on your architecture.
MIPS has a "CLZ" instruction that counts the number of leading zero-bits of a number. What you are looking for here is essentially 4 - (CLZ(x) / 8) (where / is integer division). PowerPC has the equivalent instruction cntlz, and x86 has BSR. This solution should simplify down to 3-4 instructions (not counting function call overhead) and zero branches.
On some systems this could be quicker on some architectures:
inline int len(uint32_t val) {
return (int)( log(val) / log(256) ); // this is the log base 256 of val
}
This may also be slightly faster (if comparison takes longer than bitwise and):
inline int len(uint32_t val) {
if (val & ~0x00FFffFF) {
return 4;
if (val & ~0x0000ffFF) {
return 3;
}
if (val & ~0x000000FF) {
return 2;
}
return 1;
}
If you are on an 8 bit microcontroller (like an 8051 or AVR) then this will work best:
inline int len(uint32_t val) {
union int_char {
uint32_t u;
uint8_t a[4];
} x;
x.u = val; // doing it this way rather than taking the address of val often prevents
// the compiler from doing dumb things.
if (x.a[0]) {
return 4;
} else if (x.a[1]) {
return 3;
...
EDIT by tristopia: endianness aware version of the last variant
int len(uint32_t val)
{
union int_char {
uint32_t u;
uint8_t a[4];
} x;
const uint16_t w = 1;
x.u = val;
if( ((uint8_t *)&w)[1]) { // BIG ENDIAN (Sparc, m68k, ARM, Power)
if(x.a[0]) return 4;
if(x.a[1]) return 3;
if(x.a[2]) return 2;
}
else { // LITTLE ENDIAN (x86, 8051, ARM)
if(x.a[3]) return 4;
if(x.a[2]) return 3;
if(x.a[1]) return 2;
}
return 1;
}
Because of the const, any compiler worth its salt will only generate the code for the right endianness.
to Pascal Cuoq and the 35 other people who up-voted his comment:
"Wow! More than 10 million times... You mean that if you squeeze three cycles out of this function, you will save as much as 0.03s? "
Such a sarcastic comment is at best rude and offensive.
Optimization is frequently the cumulative result of 3% here, 2% there. 3% in overall capacity is nothing to be sneezed at. Suppose this was an almost saturated and unparallelizable stage in a pipe. Suppose CPU utilization went from 99% to 96%. Simple queuing theory tells one that such a reduction in CPU utilization would reduce the average queue length by over 75%. [the qualitative (load divided by 1-load)]
Such a reduction can frequently make or break a particular hardware configuration as this has feed back effects on memory requirements, caching the queued items, lock convoying, and (horror of horrors should it be a paged system) even paging. It is precisely these sorts of effects that cause bifurcated hysteresis loop type system behavior.
Arrival rates of anything seem to tend to go up and field replacement of a particular CPU or buying a faster box is frequently just not an option.
Optimization is not just about wall clock time on a desktop. Anyone who thinks that it is has much reading to do about the measurement and modelling of computer program behavior.
Pascal Cuoq owes the original poster an apology.
Just to illustrate, based on FredOverflow's answer (which is nice work, kudos and +1), a common pitfall regarding branches on x86. Here's FredOverflow's assembly as output by gcc:
movl 8(%ebp), %edx #1/.5
movl %edx, %eax #1/.5
andl $-16777216, %eax#1/.5
cmpl $1, %eax #1/.5
sbbl %eax, %eax #8/6
addl $4, %eax #1/.5
xorl %ecx, %ecx #1/.5
testl $-65536, %edx #1/.5
sete %cl #5
subl %ecx, %eax #1/.5
andl $-256, %edx #1/.5
sete %dl #5
movzbl %dl, %edx #1/.5
subl %edx, %eax #1/.5
# sum total: 29/21.5 cycles
(the latency, in cycles, is to be read as Prescott/Northwood)
Pascal Cuoq's hand-optimized assembly (also kudos):
cmpl $255, %edi #1/.5
setg %al #5
addb $3, %al #1/.5
cmpl $65535, %edi #1/.5
setle %dl #5
subb %dl, %al #1/.5
cmpl $16777215, %edi #1/.5
setle %dl #5
subb %dl, %al #1/.5
movzbl %al, %eax #1/.5
# sum total: 22/18.5 cycles
Edit: FredOverflow's solution using __builtin_clz():
movl 8(%ebp), %eax #1/.5
popl %ebp #1.5
orb $-1, %al #1/.5
bsrl %eax, %eax #16/8
sarl $3, %eax #1/4
addl $1, %eax #1/.5
ret
# sum total: 20/13.5 cycles
and the gcc assembly for your code:
movl $1, %eax #1/.5
movl %esp, %ebp #1/.5
movl 8(%ebp), %edx #1/.5
cmpl $255, %edx #1/.5
jbe .L3 #up to 9 cycles
cmpl $65535, %edx #1/.5
movb $2, %al #1/.5
jbe .L3 #up to 9 cycles
cmpl $16777216, %edx #1/.5
sbbl %eax, %eax #8/6
addl $4, %eax #1/.5
.L3:
ret
# sum total: 16/10 cycles - 34/28 cycles
in which the instruction cache line fetches which come as the side-effect of the jcc instructions probably cost nothing for such a short function.
Branches can be a reasonable choice, depending on the input distribution.
Edit: added FredOverflow's solution which is using __builtin_clz().
Ok one more version. Similar to Fred's one, but with less operations.
inline int len(uint32 val)
{
return 1
+ (val > 0x000000ff)
+ (val > 0x0000ffff)
+ (val > 0x00ffffff)
;
}
This gives you less comparisons. But may be less efficient if memory access operation costs more than a couple of comparisons.
int precalc[1<<16];
int precalchigh[1<<16];
void doprecalc()
{
for(int i = 0; i < 1<<16; i++) {
precalc[i] = (i < (1<<8) ? 1 : 2);
precalchigh[i] = precalc[i] + 2;
}
}
inline int len(uint32 val)
{
return (val & 0xffff0000 ? precalchigh[val >> 16] : precalc[val]);
}
The minimum number of bits required to store an integer is:
int minbits = (int)ceil( log10(n) / log10(2) ) ;
The number of bytes is:
int minbytes = (int)ceil( log10(n) / log10(2) / 8 ) ;
This is an entirely FPU bound solution, performance may or may not be better than a conditional test, but worth investigation perhaps.
[EDIT]
I did the investigation; a simple loop of ten million iterations of the above took 918ms whereas FredOverflow's accepted solution took just 49ms (VC++ 2010). So this is not an improvement in terms of performance, though may remain useful if it were the number of bits that were required, and further optimisations are possible.
If I remember 80x86 asm right, I'd do something like:
; Assume value in EAX; count goes into ECX
cmp eax,16777215 ; Carry set if less
sbb ecx,ecx ; Load -1 if less, 0 if greater
cmp eax,65535
sbb ecx,0 ; Subtract 1 if less; 0 if greater
cmp eax,255
sbb ecx,-4 ; Add 3 if less, 4 if greater
Six instructions. I think the same approach would also work for six instructions on the ARM I use.