Find the k'th non-zero bit in 32 bit int - bit-manipulation

I am looking for an efficient way of calculating the kth non-zero bit in a 32 bit int.
The best I can think of is a serial algorithm, using ctz (count trailing zeros):
uint test = 0x88;
int p=0;
for (0; p < k; ++p){
int pos = ctz(test);
test >>= pos;
}
but I am looking for something more parallel. This is for an opencl kernel.
Edit:
For the above example, the first non-zero bit is at position 2 (zero based)
and second non-zero bit is at position 5. There are no other non-zero bits.
Thanks!

The question is somewhat unclear but I presume that your desired output is the bit position of the Nth set bit in the input integer.
Alas I am not certain there is much room to speed this up. However we can try bringing the complexity down from a linear to a logarithmic operation, which may or may not help depending on the typical bit number sought after. The integer width is small and constant here so the scope for improvement is limited.
The idea is to recursively perform a population counts on the lower half of the input integer. If the target bit exceeds the number in the lower half, then select the upper and recurse on the upper, otherwise select the lower.
A neat benefit is that partial totals of the traditional recursive population count trick can be reused in the search.
Something along these, virtually untested, lines:
unsigned int position_of_nth_set_bit(uint_fast32_t input, unsigned int bitno) {
// Perform the common population-count trick of recursively building up the
// sub counts as hierarchal bit-fields
const uint_fast32_t mask1 = UINT32_C(0x55555555);
const uint_fast32_t mask2 = UINT32_C(0x33333333);
const uint_fast32_t mask4 = UINT32_C(0x0F0F0F0F);
const uint_fast32_t mask8 = UINT32_C(0x00FF00FF);
const uint_fast32_t pop1 = input;
const uint_fast32_t pop2 = (pop1 >> 1 & mask1) + pop1 & mask1;
const uint_fast32_t pop4 = (pop2 >> 2 & mask2) + pop2 & mask2;
const uint_fast32_t pop8 = (pop4 >> 4 & mask4) + pop4 & mask4;
const uint_fast32_t pop16 = (pop8 >> 8 & mask8) + pop8 & mask8;
unsigned int bitpos = 0;
// Recursively check whether our target bit fits into the upper or lower
// half-space, and shift down according
unsigned int halfspace;
if(halfspace = (pop16 & 15), bitno > halfspace) {
bitno -= halfspace;
bitpos += 16;
}
if(halfspace = (pop8 >> bitpos) & 7, bitno > halfspace) {
bitno -= halfspace;
bitpos += 8;
}
if(halfspace = (pop4 >> bitpos) & 7, bitno > halfspace) {
bitno -= halfspace;
bitpos += 4;
}
if(halfspace = (pop2 >> bitpos) & 3, bitno > halfspace) {
bitno -= halfspace;
bitpos += 2;
}
if(halfspace = (pop1 >> bitpos) & 1, bitno > halfspace)
bitpos += 1;
return bitpos;
}

Related

Implement a function that blends two colors encoded with RGB565 using Alpha blending

I am trying to implement a function that blends two colors encoded with RGB565 using Alpha blending
Crgb565 = (1-a)Argb565 + a*Brgb565
Where a is the alpha parameter, and the alpha blending value of 0.0-1.0 is mapped to an unsigned char value on the range 0-32.
we can choose to use a five bit representation for a instead, thus restricting it to the range of 0-31 (effectively mapping to an alpha blending value of 0.0-0.96875).
Following code I am trying to implement, can you please suggest better way wrt less temp variable , memory optimization (number of multiplications and required memory accesses ),Is my logic for alpha bending is correct? I am not getting correct result/expected output, Seems like I am missing something, please review the code, Every suggest is appreciated, have some doubt based on alpha parameter. I have put my doubts in code comment section. Is there any way to shortening the alpha blending equations(division operation)?
=====================================================
unsigned short blend_rgb565(unsigned short A, unsigned short B, unsigned char Alpha)
{
unsigned short res = 0;
// Alpha converted from [0..255] to [0..31] (8 bit to 5 bit)
/* I want the alpha parameter (0-32), do i need to add something in Alpha before right shift?? */
Alpha = Alpha >> 3;
// Split Image A into R, G, B components
/*Do I need to take it as unsigned short or uint8_t also work fine ??*/
unsigned short A_r = A >> 11;
unsigned short A_g = (A >> 5) & ((1u << 6) - 1); // ((1u << 6) - 1) --> 00000000 00111111
unsigned short A_b = A & ((1u << 5) - 1); // ((1u << 5) - 1) --> 00000000 00011111
// Split Image B into R, G, B components
unsigned short B_r = B >> 11;
unsigned short B_g = (B >> 5) & ((1u << 6) - 1);
unsigned short B_b = B & ((1u << 5) - 1);
// Alpha blend components
/*Do I need to use 255(8 bit) instead of 32(5 bit), Why we are dividing by it , I have taken the ref from internet , but need little bit more clarification ??*/
unsigned short uiC_r = (A_r * Alpha + B_r * (32 - Alpha)) / 32;
unsigned short uiC_g = (A_g * Alpha + B_g * (32 - Alpha)) / 32;
unsigned short uiC_b = (A_b * Alpha + B_b * (32 - Alpha)) / 32;
// Pack result
res= (unsigned short) ((uiC_r << 11) | (uiC_g << 5) | uiC_b);
return res;
}
=====================
EDIT:
Adding method 2 ,is this approach is correct ?
Method 2:
// rrrrrggggggbbbbb
#define RB_MASK 63519 // 0b1111100000011111 --> hex :F81F
#define G_MASK 2016 // 0b0000011111100000 --> hex :07E0
#define RB_MUL_MASK 2032608 // 0b111110000001111100000 --> hex :1F03E0
#define G_MUL_MASK 64512 // 0b000001111110000000000 --> hex :FC00
unsigned short blend_rgb565(unsigned short A,unsigned short B,unsigned char Alpha) {
// Alpha converted from [0..255] to [0..31]
Alpha = Alpha >> 3
uint8_t beta = 32 - Alpha;
// so (0..32)*Alpha + (0..32)*beta always in 0..32
return (unsigned short)
(
(
( ( Alpha * (uint32_t)( A & RB_MASK ) + beta * (uint32_t)( B & RB_MASK )) & RB_MUL_MASK )
|
( ( Alpha * ( A & G_MASK ) + beta * ( B & G_MASK )) & G_MUL_MASK )
)
>> 5 // removing the alpha component 5 bit
);
}
It's possible to reduce the multiplies from 6 to 2 if you space out the RGB values into 2 32-bit integers before multiplying:
unsigned short blend_rgb565(unsigned short A, unsigned short B, unsigned char Alpha)
{
unsigned short res = 0;
// Alpha converted from [0..255] to [0..31] (8 bit to 5 bit)
Alpha = Alpha >> 3;
// Alpha = (Alpha + (Alpha >> 5)) >> 3; // map from 0-255 to 0-32 (if Alpha is unsigned short or larger)
// Space out A and B from RRRRRGGGGGGBBBBB to 00000RRRRR00000GGGGGG00000BBBBB
// 31 = 11111 binary
// 63 = 111111 binary
unsigned int A32 = (unsigned int)A;
unsigned int A_spaced = A32 & 31; // B
A_spaced |= (A32 & (63 << 5)) << 5; // G
A_spaced |= (A32 & (31 << 11)) << 11; // R
unsigned int B32 = (unsigned int)B;
unsigned int B_spaced = B32 & 31; // B
B_spaced |= (B32 & (63 << 5)) << 5; // G
B_spaced |= (B32 & (31 << 11)) << 11; // R
// multiply and add the alpha to give a result RRRRRrrrrrGGGGGGgggggBBBBBbbbbb,
// where RGB are the most significant bits we want to keep
unsigned int C_spaced = (A_spaced * Alpha) + (B_spaced * (32 - Alpha));
// remap back to RRRRRGGGGGBBBBB
res = (unsigned short)(((C_spaced >> 5) & 31) + ((C_spaced >> 10) & (63 << 5)) + ((C_spaced >> 16) & (31 << 11)));
return res;
}
You need to profile this to see if it is faster, it assumes that multiplications you save are slower than the extra bit-manipulations you replace them with.
can you please suggest better way wrt less temp variable
There is no advantage to remove temporary variables from the implementation. When you compile with optimizations turned on (e.g. -O2 or /O2) those temp variables will get optimized away.
Two adjustments I would make to your code:
Use uint16_t instead of unsigned short. For most platforms, it won't matter since sizeof(uint16_t)==sizeof(unsigned short), but it helps to be definitive.
No point in converting alpha from an 8-bit value to a 5-bit value. You'll get better accuracy with blending if you let alpha have the full range
Some of your bit-shifting looks weird. It might work. But I use a simpler approach.
Here's an adjustment to your implementation:
#include <stdint.h>
#define MAKE_RGB565(r, g, b) ((r << 11) | (g << 5) | (b))
uint16_t blend_rgb565(uint16_t a, uint16_t b, uint8_t Alpha)
{
const uint8_t invAlpha = 255 - Alpha;
uint16_t A_r = a >> 11;
uint16_t A_g = (a >> 5) & 0x3f;
uint16_t A_b = a & 0x1f;
uint16_t B_r = b >> 11;
uint16_t B_g = (b >> 5) & 0x3f;
uint16_t B_b = b & 0x1f;
uint32_t C_r = (A_r * invAlpha + B_r * Alpha) / 255;
uint32_t C_g = (A_g * invAlpha + B_g * Alpha) / 255;
uint32_t C_b = (A_b * invAlpha + B_b * Alpha) / 255;
return MAKE_RGB565(C_r, C_g, C_b);
}
But the bigger issue is that this function works on exactly one one pair of pixel colors. If you are invoking this function across an entire image or pair of images, the overhead of using the function call is going to be a major performance issue - even with compiler optimizations and inlining. So if you are calling this function row x col times, you should probably manually inline the code into your loop that is enumerating over every pixel on an image (or pair of images).
In the same vein as #samgak's answer, you can implement more efficiently on a 64 bits architecture by "post-masking", as follows:
rrrrrggggggbbbbb
Replicate to a long long (by shifting or mapping the long long to four shorts)
---------------- rrrrrggggggbbbbb rrrrrggggggbbbbb rrrrrggggggbbbbb
Mask out the useless bits
---------------- rrrrr----------- -----gggggg----- -----------bbbbb
Multiply by α
-----------rrrrr rrrrr----------- ggggggggggg----- ------bbbbbbbbbb
Mask out the low order bits
-----------rrrrr ---------------- gggggg---------- ------bbbbb-----
Pack
rrrrrgggggbbbbb
Another saving is possible by rewriting
(1 - α) X + α Y
as
X + α (Y - X)
(or X - α (X - Y) to avoid negatives). This spares a multiply (at the expense of a comparison).
Update:
The "saving" above cannot work because the negatives should be handled component-wise.

Generate random numbers in a given range with AVX2, faster than SVML _mm256_rem_epu32 remainder?

I'm currently trying to implement an XOR_SHIFT Random Number Generator using AVX2, it's actually quite easy and very fast. However I need to be able to specify a range. This usually requires modulo.
This is a major problem for me for 2 reasons:
Adding the _mm256_rem_epu32() / _mm256_rem_epi32() SVML function to my code takes the run time of my loop from around 270ms to 1.8 seconds. Ouch!
SVML is only available on MSVC and Intel Compilers
Are the any significantly faster ways to do modulo using AVX2?
Non Vector Code:
std::srand(std::time(nullptr));
std::mt19937_64 e(std::rand());
uint32_t seed = static_cast<uint32_t>(e());
for (; i != end; ++i)
{
seed ^= (seed << 13u);
seed ^= (seed >> 7u);
seed ^= (seed << 17u);
arr[i] = static_cast<T>(low + (seed % ((up + 1u) - low)));
}//End for
Vectorized:
constexpr uint32_t thirteen = 13u;
constexpr uint32_t seven = 7u;
constexpr uint32_t seventeen = 17u;
const __m256i _one = _mm256_set1_epi32(1);
const __m256i _lower = _mm256_set1_epi32(static_cast<uint32_t>(low));
const __m256i _upper = _mm256_set1_epi32(static_cast<uint32_t>(up));
__m256i _temp = _mm256_setzero_si256();
__m256i _res = _mm256_setzero_si256();
__m256i _seed = _mm256_set_epi32(
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e()),
static_cast<uint32_t>(e())
);
for (; (i + 8uz) < end; ++i)
{
//Generate Random Numbers
_temp = _mm256_slli_epi32(_seed, thirteen);
_seed = _mm256_xor_si256(_seed, _temp);
_temp = _mm256_srai_epi32(_seed, seven);
_seed = _mm256_xor_si256(_seed, _temp);
_temp = _mm256_slli_epi32(_seed, seventeen);
_seed = _mm256_xor_si256(_seed, _temp);
//Narrow
_temp = _mm256_add_epi32(_upper, _one);
_temp = _mm256_sub_epi32(_temp, _lower);
_temp = _mm256_rem_epu32(_seed, _temp); //Comment this line out for a massive speed up but incorrect results
_res = _mm256_add_epi32(_lower, _temp);
_mm256_store_si256((__m256i*) &arr[i], _res);
}//End for
If you range is smaller than ~16.7 million, and you don’t need cryptography-grade quality of the distribution, an easy and relatively fast method of narrowing these random numbers is FP32 math.
Here’s an example, untested.
The function below takes integer vector with random bits, and converts these bits into integer numbers in [ 0 .. range - 1 ] interval.
// Ideally, make sure this function is inlined,
// by applying __forceinline for vc++ or __attribute__((always_inline)) for gcc/clang
inline __m256i narrowRandom( __m256i bits, int range )
{
assert( range > 1 );
// Convert random bits into FP32 number in [ 1 .. 2 ) interval
const __m256i mantissaMask = _mm256_set1_epi32( 0x7FFFFF );
const __m256i mantissa = _mm256_and_si256( bits, mantissaMask );
const __m256 one = _mm256_set1_ps( 1 );
__m256 val = _mm256_or_ps( _mm256_castsi256_ps( mantissa ), one );
// Scale the number from [ 1 .. 2 ) into [ 0 .. range ),
// the formula is ( val * range ) - range
const __m256 rf = _mm256_set1_ps( (float)range );
val = _mm256_fmsub_ps( val, rf, rf );
// Convert to integers
// The instruction below always truncates towards 0 regardless on MXCSR register.
// If you want ranges like [ -10 .. +10 ], use _mm256_add_epi32 afterwards
return _mm256_cvttps_epi32( val );
}
When inlined, it should compile into 4 instructions, vpand, vorps, vfmsub132ps, vcvttps2dq Probably an order of magnitude faster than _mm256_rem_epu32 in your example.

Getting min short value in a __m128i vector with SSE?

This question seems similar to Getting max value in a __m128i vector with SSE? but with shorts and minimum instead of integer + maximum. This is what I came up with:
typedef short int weight;
weight horizontal_min_Vec4i(__m128i x) {
__m128i max1 = _mm_shufflehi_epi16(x, _MM_SHUFFLE(0, 0, 3, 2));
__m128i max1b = _mm_shufflelo_epi16(x, _MM_SHUFFLE(0, 0, 3, 2));
__m128i max2 = _mm_min_epi16(max1, max1b);
//max2 = _mm_min_epi16(max2, x);
max1 = _mm_shufflehi_epi16(max2, _MM_SHUFFLE(0, 0, 0, 1));
max1b = _mm_shufflelo_epi16(max2, _MM_SHUFFLE(0, 0, 0, 1));
__m128i max3 = _mm_min_epi16(max1, max1b);
max2 = _mm_min_epi16(max2, max3);
return min(_mm_extract_epi16(max2, 0), _mm_extract_epi16(max2, 4));
}
The function basically does the same as the answer in https://stackoverflow.com/a/18616825/1500111 for the upper and lower parts of x. So, I know the minimum value is either in the position 0 or 4 of the __m128i variable max2. Although it is much faster than the no SIMD function horizontal_min_Vec4i_Plain(__m128i x) shown below, I am afraid the bottleneck is the _mm_extract_epi16 operation at the last line. Is there a better way to achieve this, for a better speed up? I am using Haswell so I have access to the latest SSE extensions.
weight horizontal_min_Vec4i_Plain(__m128i x) {
weight result[8] __attribute__((aligned(16)));
_mm_store_si128((__m128i *) result, x);
weight myMin = result[0];
for (int l = 1; l < 8; l++) {
if (myMin > result[l]) {
myMin = result[l];
}
}
return myMin;
}
Signed and unsigned comparison are almost the same, except that the range with the top bit set is treated as bigger than the range with the top bit not set in unsigned comparisons, and as smaller in signed comparisons. That means signed and unsigned comparisons can be converted into each other by these rules:
x <s y = (x ^ signbit) <u (y ^ signbit)
x <u y = (x ^ signbit) <s (y ^ signbit)
This property transfers directly to min and max, so:
min_s(x, y) = min_u(x ^ signbit, y ^ signbit) ^ signbit
And then we can use _mm_minpos_epu16 to handle the horizontal minimum, to get, in total, something like
__m128i xs = _mm_xor_si128(x, _mm_set1_epi16(0x8000));
return _mm_extract_epi16(_mm_minpos_epu16(xs), 0) - 0x8000;
The - 0x8000 is ^ 0x8000 and sign-extension (extract zero-extends) rolled into one.

Shift masked bits to the lsb

When you and some data with a mask you get some result which is of the same size as the data/mask.
What I want to do, is to take the masked bits in the result (where there was 1 in the mask) and shift them to the right so they are next to each other and I can perform a CTZ (Count Trailing Zeroes) on them.
I didn't know how to name such a procedure so Google has failed me. The operation should preferably not be a loop solution, this has to be as fast operation as possible.
And here is an incredible image made in MS Paint.
This operation is known as compress right. It is implemented as part of BMI2 as the PEXT instruction, in Intel processors as of Haswell.
Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:
unsigned compress(unsigned x, unsigned m) {
unsigned r, s, b; // Result, shift, mask bit.
r = 0;
s = 0;
do {
b = m & 1;
r = r | ((x & b) << s);
s = s + b;
x = x >> 1;
m = m >> 1;
} while (m != 0);
return r;
}
But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
Notice that a lot of the values there depend only on m. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)
unsigned compress(unsigned x, int maskindex) {
unsigned t;
int i;
x = x & masks[maskindex][0];
for (i = 0; i < 5; i++) {
t = x & masks[maskindex][i + 1];
x = x ^ t | (t >> (1 << i));
}
return x;
}
Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.
You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.
For example with the mask 0b10101001 == 0xA9 like above and 8-bit data abcdefgh (with a-h is the 8 bits) you can use the below expression to get 0000aceh
uint8_t compress_maskA9(uint8_t x)
{
const uint8_t mask1 = 0xA9 & 0xF0;
const uint8_t mask2 = 0xA9 & 0x0F;
return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}
In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone
An alternate way with only one multiplication
const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;
I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications
If you want the result's bits reversed (i.e. heca0000) you can easily change the magic numbers accordingly
// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);
or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication
return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000
But there might be a better alternative like the above second snippet
const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28
Correctness check: http://ideone.com/PYUkty
For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically
Explanation
We have abcdefgh & mask1 = a0c00000. Multiply it with magic1
........................a0c00000
× 00000011000000000000000000000000 (magic1 = 0x03000000)
────────────────────────────────
a0c00000........................
+ a0c00000......................... (the leading "a" bit is outside int's range
──────────────────────────────── so it'll be truncated)
r1 = acc.............................
=> (r1 >> 28) & 0x0C = 0000ac00
Similarly we multiply abcdefgh & mask2 = 0000e00h with magic2
........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
────────────────────────────────
e00h............................
+ 0h..............................
────────────────────────────────
r2 = eh..............................
=> (r2 >> 30) = 000000eh
Combine them together we have the expected result
((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh
And here's the demo for the second snippet
abcdefghabcdefgh
& 1000100000100001 (0x8821)
────────────────────────────────
a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
────────────────────────────────
000h
00e00000c0000h
+ 0c0000h
a000e00000c0000h
────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
────────────────────────────────
= aceh
For the reversed order case:
abcdefghabcdefgh
& 0010100010000001 (0x2881)
────────────────────────────────
00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
────────────────────────────────
000a000000h
00c0e000a000000h
+ 0e000a000000h
h
────────────────────────────────
hecaea00a0h0h00h
& 11110000000000000000000000000000
────────────────────────────────
= heca
Related:
How to create a byte out of 8 bool values (and vice versa)?
Redistribute least significant bits from a 4-byte array to a nibble

performance of log10 function returning an int

Today I needed a cheap log10 function, of which I only used the int part. Assuming the result is floored, so the log10 of 999 would be 2. Would it be beneficial writing a function myself? And if so, which way would be the best to go. Assuming the code would not be optimized.
The alternatives to log10 I've though of;
use a for loop dividing or multiplying by 10;
use a string parser(probably extremely expensive);
using an integer log2() function multiplying by a constant.
Thank you on beforehand:)
The operation can be done in (fast) constant time on any architecture that has a count-leading-zeros or similar instruction (which is most architectures). Here's a C snippet I have sitting around to compute the number of digits in base ten, which is essentially the same task (assumes a gcc-like compiler and 32-bit int):
unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
static unsigned int baseTenDigits(unsigned int x) {
static const unsigned char guess[33] = {
0, 0, 0, 0, 1, 1, 1, 2, 2, 2,
3, 3, 3, 3, 4, 4, 4, 5, 5, 5,
6, 6, 6, 6, 7, 7, 7, 8, 8, 8,
9, 9, 9
};
static const unsigned int tenToThe[] = {
1, 10, 100, 1000, 10000, 100000,
1000000, 10000000, 100000000, 1000000000,
};
unsigned int digits = guess[baseTwoDigits(x)];
return digits + (x >= tenToThe[digits]);
}
GCC and clang compile this down to ~10 instructions on x86. With care, one can make it faster still in assembly.
The key insight is to use the (extremely cheap) base-two logarithm to get a fast estimate of the base-ten logarithm; at that point we only need to compare against a single power of ten to decide if we need to adjust the guess. This is much more efficient than searching through multiple powers of ten to find the right one.
If the inputs are overwhelmingly biased to one- and two-digit numbers, a linear scan is sometimes faster; for all other input distributions, this implementation tends to win quite handily.
One way to do it would be loop with subtracting powers of 10. This powers could be computed and stored in table. Here example in python:
table = [10**i for i in range(1, 10)]
# [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]
def fast_log10(n):
for i, k in enumerate(table):
if n - k < 0:
return i
Usage example:
>>> fast_log10(1)
0
>>> fast_log10(10)
1
>>> fast_log10(100)
2
>>> fast_log10(999)
2
fast_log10(1000)
3
Also you may use binary search with this table. Then algorithm complexity would be only O(lg(n)), where n is number of digits.
Here example with binary search in C:
long int table[] = {10, 100, 1000, 10000, 1000000};
#define TABLE_LENGHT sizeof(table) / sizeof(long int)
int bisect_log10(long int n, int s, int e) {
int a = (e - s) / 2 + s;
if(s >= e)
return s;
if((table[a] - n) <= 0)
return bisect_log10(n, a + 1, e);
else
return bisect_log10(n, s, a);
}
int fast_log10(long int n){
return bisect_log10(n, 0, TABLE_LENGHT);
}
Note for small numbers this method would slower then upper method.
Full code here.
Well, there's the old standby - the "poor man's log function".
(If you want to handle more than 63 integer digits, change the first "if" to a "while".)
n = 1;
if (v >= 1e32){n += 32; v /= 1e32;}
if (v >= 1e16){n += 16; v /= 1e16;}
if (v >= 1e8){n += 8; v /= 1e8;}
if (v >= 1e4){n += 4; v /= 1e4;}
if (v >= 1e2){n += 2; v /= 1e2;}
if (v >= 1e1){n += 1; v /= 1e1;}
so if you feed in 123456.7, here's how it goes:
n = 1;
if (v >= 1e32) no
if (v >= 1e16) no
if (v >= 1e8) no
if (v >= 1e4) yes, so n = 5, v = 12.34567
if (v >= 1e2) no
if (v >= 1e1) yes, so n = 6, v = 1.234567
so result is n = 6
Here's a variation that uses multiplication, rather than division:
int n = 1;
double d = 1, temp;
temp = d * 1e32; if (v >= temp){n += 32; d = temp;}
temp = d * 1e16; if (v >= temp){n += 16; d = temp;}
temp = d * 1e8; if (v >= temp){n += 8; d = temp;}
temp = d * 1e4; if (v >= temp){n += 4; d = temp;}
temp = d * 1e2; if (v >= temp){n += 2; d = temp;}
temp = d * 1e1; if (v >= temp){n += 1; d = temp;}
and an execution looks like this
v = 123456.7
n = 1
d = 1
temp = 1e32, if (v >= 1e32) no
temp = 1e16, if (v >= 1e16) no
temp = 1e8, if (v >= 1e8) no
temp = 1e4, if (v >= 1e4) yes, so n = 5, d = 1e4;
temp = 1e6, if (v >= 1e6) no
temp = 1e5, if (v >= 1e5) yes, so n = 6, d = 1e5;
If you want to have a faster log function you need to approximate their result. E.g. the exp function can be approximated using a 'short' taylor approximation. You can find example approximations for exp, log, root and power here
edit:
You can find a short performance comparsion here
Because an unsigned < or >= test is done simply by subtracting and checking the carry flag, it is possible to put both arrays (guess and negated tenToThe) in a single 64-bit value, combine both array lookups into one, and use the carry from 32-bit addition to adjust the guess. The high 32 bits of guess[n] provide the value of log10(2^n*2-1), while the low 32 bits contain -10^log10(2^n*2-1).
static unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
unsigned int baseTenDigits(unsigned int x) {
static uint64_t guess[33] = {
/* 1 */ 0, 0, 0,
/* 8 */ (1ull<<32)-10, (1ull<<32)-10, (1ull<<32)-10,
/* 64 */ (2ull<<32)-100, (2ull<<32)-100, (2ull<<32)-100,
/* 512 */ (3ull<<32)-1000, (3ull<<32)-1000, (3ull<<32)-1000,
(3ull<<32)-1000,
/* 8192 */ (4ull<<32)-10000, (4ull<<32)-10000, (4ull<<32)-10000,
/* 65536 */ (5ull<<32)-100000, (5ull<<32)-100000, (5ull<<32)-100000,
/* 524288 */ (6ull<<32)-1000000, (6ull<<32)-1000000, (6ull<<32)-1000000,
(6ull<<32)-1000000,
/* 8388608 */ (7ull<<32)-10000000, (7ull<<32)-10000000,
(7ull<<32)-10000000,
/* 67108864 */ (8ull<<32)-100000000, (8ull<<32)-100000000,
(8ull<<32)-100000000,
/* 536870912 */ (9ull<<32)-1000000000, (9ull<<32)-1000000000,
(9ull<<32)-1000000000,
};
uint64_t adjust = guess[baseTwoDigits(x)];
return (adjust + x) >> 32;
}
Without any specifications, I will just give a general answer:
The log function will be pretty efficient in most languages as it is such a basic function.
The fact that you are only interested in integers could give you some leverage, but probably this is not enough to easily beat the builtin standard solutions.
One of the few things that I can think of to be faster than a builtin function is a table lookup, so if you are only interested in the numbers upto 10000 for instance, you could simply create a table that you could use to lookup any of these values when you need them.
Obviously this solution will not scale well, but it may be just what you need.
Sidenote: If you are importing the data for example, it may actually be faster to look at the string diecty length (rather than first converting the string to a number and than looking at the value of the string). Of course this will require the input to be stored in just the right format, otherwise it won't gain you anything.