Query points on the vertices of a Hamming cube - c++

I have N points that lie only on the vertices of a cube, of dimension D, where D is something like 3.
A vertex may not contain any point. So every point has coordinates in {0, 1}D. I am only interested in query time, as long as the memory cost is reasonable ( not exponential in N for example :) ).
Given a query that lies on one of the cube's vertices and an input parameter r, find all the vertices (thus points) that have hamming distance <= r with the query.
What's the way to go in a c++ environment?
I am thinking of a kd-tree, but I am not sure and want help, any input, even approximative, would be appreciated! Since hamming distance comes into play, bitwise manipulations should help (e.g. XOR).

There is a nice bithack to go from one bitmask with k bits set to the lexicographically next permutation, which means it's fairly simple to loop through all masks with k bits set. XORing these masks with an initial value gives all the values at hamming distance exactly k away from it.
So for D dimensions, where D is less than 32 (otherwise change the types),
uint32_t limit = (1u << D) - 1;
for (int k = 1; k <= r; k++) {
uint32_t diff = (1u << k) - 1;
while (diff <= limit) {
// v is the input vertex
uint32_t vertex = v ^ diff;
// use it
diff = nextBitPermutation(diff);
}
}
Where nextBitPermutation may be implemented in C++ as something like (if you have __builtin_ctz)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
return (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
}
Or for MSVC (not tested)
uint32_t nextBitPermutation(uint32_t v) {
// see https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
uint32_t t = v | (v - 1);
unsigned long tzc;
_BitScanForward(&tzc, v); // v != 0 so the return value doesn't matter
return (t + 1) | (((~t & -~t) - 1) >> (tzc + 1));
}
If D is really low, 4 or lower, the old popcnt-with-pshufb works really well and generally everything just lines up well, like this:
uint16_t query(int vertex, int r, int8_t* validmask)
{
// validmask should be array of 16 int8_t's,
// 0 for a vertex that doesn't exist, -1 if it does
__m128i valid = _mm_loadu_si128((__m128i*)validmask);
__m128i t0 = _mm_set1_epi8(vertex);
__m128i r0 = _mm_set1_epi8(r + 1);
__m128i all = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
__m128i popcnt_lut = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
__m128i dist = _mm_shuffle_epi8(popcnt_lut, _mm_xor_si128(t0, all));
__m128i close_enough = _mm_cmpgt_epi8(r0, dist);
__m128i result = _mm_and_si128(close_enough, valid);
return _mm_movemask_epi8(result);
}
This should be fairly fast; fast compared to the bithack above (nextBitPermutation, which is fairly heavy, is used a lot there) and also compared to looping over all vertices and testing whether they are in range (even with builtin popcnt, that automatically takes at least 16 cycles and the above shouldn't, assuming everything is cached or even permanently in a register). The downside is the result is annoying to work with, since it's a mask of which vertices both exist and are in range of the queried point, not a list of them. It would combine well with doing some processing on data associated with the points though.
This also scales down to D=3 of course, just make none of the points >= 8 valid. D>4 can be done similarly but it takes more code then, and since this is really a brute force solution that is only fast due to parallelism it fundamentally gets slower exponentially in D.

Related

can i speed up more than _mm256_i32gather_epi32

I made a gamma conversion code for 4k video
/** gamma0
input range : 0 ~ 1,023
output range : 0 ~ ?
*/
v00 = _mm256_unpacklo_epi16(v0, _mm256_setzero_si256());
v01 = _mm256_unpackhi_epi16(v0, _mm256_setzero_si256());
v10 = _mm256_unpacklo_epi16(v1, _mm256_setzero_si256());
v11 = _mm256_unpackhi_epi16(v1, _mm256_setzero_si256());
v20 = _mm256_unpacklo_epi16(v2, _mm256_setzero_si256());
v21 = _mm256_unpackhi_epi16(v2, _mm256_setzero_si256());
v00 = _mm256_i32gather_epi32(csv->gamma0LUT, v00, 4);
v01 = _mm256_i32gather_epi32(csv->gamma0LUT, v01, 4);
v10 = _mm256_i32gather_epi32(csv->gamma0LUTc, v10, 4);
v11 = _mm256_i32gather_epi32(csv->gamma0LUTc, v11, 4);
v20 = _mm256_i32gather_epi32(csv->gamma0LUTc, v20, 4);
v21 = _mm256_i32gather_epi32(csv->gamma0LUTc, v21, 4);
I want to implement a "10-bit input to 10~13bit output" LUT(look-up table), but only 32-bit commands are supported by AVX2.
So, it was unavoidably extended to 32bit and implemented using the _mm256_i32gather_epi32 command.
The performance bottleneck in this area is the most severe, is there any way to improve this?
Since the context of your question is still a bit vague for me, just some general ideas you could try (some may be just slightly better or even worse compared to what you have at the moment, all code below is untested):
LUT with 16 bit values using _mm256_i32gather_epi32
Even though it loads 32bit values, you can still use a multiplier of 2 as last argument of _mm256_i32gather_epi32. You should make sure that 2 bytes before and after your LUT are readable.
static const int16_t LUT[1024+2] = { 0, val0, val1, ..., val1022, val1023, 0};
__m256i high_idx = _mm256_srli_epi32(v, 16);
__m256i low_idx = _mm256_blend_epi16(v, _mm256_setzero_si256(), 0xAA);
__m256i high_val = _mm256_i32gather_epi32((int const*)(LUT+0), high_idx, 2);
__m256i low_val = _mm256_i32gather_epi32((int const*)(LUT+1), low_idx, 2);
__m256i values = _mm256_blend_epi16(low_val, high_val, 0xAA);
Join two values into one LUT-entry
For small-ish LUTs, you could calculate an index from two neighboring indexes as (idx_hi << 10) + idx_low and look up the corresponding tuple directly. However, instead of 2KiB you would have a 4 MiB LUT in your case, which likely hurts caching -- but you only have half the number of gather instructions.
Polynomial approximation
Mathematically, all continuous functions on a finite interval can be approximated by a polynomial. You could either convert your values to float evaluate the polynomial and convert it back, or do it directly with fixed-point multiplications (note that _mm256_mulhi_epi16/_mm256_mulhi_epu16 compute (a * b) >> 16, which is convenient if one factor is actually in [0, 1).
8 bit, 16 entry LUT with linear interpolation
SSE/AVX2 provides a pshufb instruction which can be used as a 8bit LUT with 16 entries (and an implicit 0 entry).
Proof-of-concept implementation:
__m256i idx = _mm256_srli_epi16(v, 6); // shift highest 4 bits to the right
idx = _mm256_mullo_epi16(idx, _mm256_set1_epi16(0x0101)); // duplicate idx, maybe _mm256_shuffle_epi8 is better?
idx = _mm256_sub_epi8(idx, _mm256_set1_epi16(0x0001)); // subtract 1 from lower idx, 0 is mapped to 0xff
__m256i lut_vals = _mm256_shuffle_epi8(LUT, idx); // implicitly: LUT[-1] = 0
// get fractional part of input value:
__m256i dv = _mm256_and_si256(v, _mm256_set1_epi8(0x3f)); // lowest 6 bits
dv = _mm256_mullo_epi16(dv, _mm256_set1_epi16(0xff01)); // dv = [-dv, dv]
dv = _mm256_add_epi8(dv, _mm256_set1_epi16(0x4000)); // dv = [0x40-(v&0x3f), (v&0x3f)];
__m256i res = _mm256_maddubs_epi16(lut_vals, dv); // switch order depending on whether LUT values are (un)signed.
// probably shift res to the right, depending on the scale of your LUT values
You could also combine this with first doing a linear or quadratic approximation and just calculating the difference to your target function.

Grid nearest neighbour BFS slow

Im trying to upsample my image. I fill the upsampled version with corresponding pixels in this way.
pseudocode:
upsampled.getPixel(((int)(x * factorX), (int)(y * factorY))) = old.getPixel(x, y)
as a result i end up with the bitmap that is not completely filled and I try to fill each not filled pixel with it's nearest filled neighbor.
I use this method for nn search and call it for each unfilled pixel. I do not flag unfilled pixel as filled after changing its value as it may create some weird patterns. The problem is that - it works but very slow. Execution time on my i7 9700k for 2500 x 3000 img scaled by factor x = 1,5 and y = 1,5 takes about 10 seconds.
template<typename T>
std::pair<int, int> cn::Utils::nearestNeighbour(const Bitmap <T> &bitmap, const std::pair<int, int> &point, int channel, const bool *filledArr) {
auto belongs = [](const cn::Bitmap<T> &bitmap, const std::pair<int, int> &point){
return point.first >= 0 && point.first < bitmap.w && point.second >= 0 && point.second < bitmap.h;
};
if(!(belongs(bitmap, point))){
throw std::out_of_range("This point does not belong to bitmap!");
}
auto hash = [](std::pair<int, int> const &pair){
std::size_t h1 = std::hash<int>()(pair.first);
std::size_t h2 = std::hash<int>()(pair.second);
return h1 ^ h2;
};
//from where, point
std::queue<std::pair<int, int>> queue;
queue.push(point);
std::unordered_set<std::pair<int, int>, decltype(hash)> visited(10, hash);
while (!queue.empty()){
auto p = queue.front();
queue.pop();
visited.insert(p);
if(belongs(bitmap, p)){
if(filledArr[bitmap.getDataIndex(p.first, p.second, channel)]){
return {p.first, p.second};
}
std::vector<std::pair<int,int>> neighbors(4);
neighbors[0] = {p.first - 1, p.second};
neighbors[1] = {p.first + 1, p.second};
neighbors[2] = {p.first, p.second - 1};
neighbors[3] = {p.first, p.second + 1};
for(auto n : neighbors) {
if (visited.find(n) == visited.end()) {
queue.push(n);
}
}
}
}
return std::pair<int, int>({-1, -1});
}
the bitmap.getDataIndex() works in O(1) time. Here's its implementation
template<typename T>
int cn::Bitmap<T>::getDataIndex(int col, int row, int depth) const{
if(col >= this->w or col < 0 or row >= this->h or row < 0 or depth >= this->d or depth < 0){
throw std::invalid_argument("cell does not belong to bitmap!");
}
return depth * w * h + row * w + col;
}
I have spent a while on debugging this but could not really find what makes it so slow.
Theoretically when scaling by factor x = 1,5, y = 1,5, the filled pixel should be no further than 2 pixels from unfilled one, so well implemented BFS wouldn't take long.
Also i use such encoding for bitmap, example for 3x3x3 image
* (each row and channel is in ascending order)
* {00, 01, 02}, | {09, 10, 11}, | {18, 19, 20},
c0 {03, 04, 05}, c1{12, 13, 14}, c2{21, 22, 23},
* {06, 07, 08}, | {15, 16, 17}, | {24, 25, 26},
the filled pixel should be no further than 2 pixels from unfilled one, so well implemented BFS wouldn't take long.
Sure, doing it once won’t take long. But you need to do this for almost every pixel in the output image, and doing lots of times something that doesn’t take long will still take long.
Instead of searching for a set pixel, use the information you have about the earlier computation to directly find the values you are looking for.
For example, in your output image, and set pixel, is at ((int)(x * factorX), (int)(y * factorY)), for integer x and y. So for a non-set pixel (a, b), you can find the nearest set pixel by ((int)(round(a/factorX)*factorX), (int)(round(b/factorY)*factorY)).
However, you are much better off directly upsampling the image in a simpler way: don’t loop over the input pixels, instead loop over the output pixels, and find the corresponding input pixel.

Tiles Stacking Problem to build a stable stack

I am studying Dynamic Programming on GeeksForGeeks and have a problem with Tiles Stacking Problem and the way it is solved
A stable tower of height n is a tower consisting of exactly n tiles of unit height stacked vertically in such a way, that no bigger tile is placed on a smaller tile. An example is shown below :
We have infinite number of tiles of sizes 1, 2, …, m. The task is calculate the number of different stable tower of height n that can be built from these tiles, with a restriction that you can use at most k tiles of each size in the tower.
Note: Two tower of height n are different if and only if there exists a height h (1 <= h <= n), such that the towers have tiles of different sizes at height h.
For example:
Input : n = 3, m = 3, k = 1.
Output : 1
Possible sequences: { 1, 2, 3}.
Hence answer is 1.
Input : n = 3, m = 3, k = 2.
Output : 7
{1, 1, 2}, {1, 1, 3}, {1, 2, 2},
{1, 2, 3}, {1, 3, 3}, {2, 2, 3},
{2, 3, 3}.
The way to solve is to count number of decreasing sequences of length n using numbers from 1 to m where every number can be used at most k times. We can recursively compute count for n using count for n-1.
Declare a 2D array dp[][], where each state dp[i][j] denotes the number of decreasing sequences of length i using numbers from j to m. We need to take care of the fact that a number can be used a most k times. This can be done by considering 1 to k occurrences of a number. Hence our recurrence relation becomes:
Also, we can use the fact that for a fixed j we are using the consecutive values of previous k values of i. Hence, we can maintain a prefix sum array for each state. Now we have got rid of the k factor for each state.
I have read this algorithm for many times but I don't understand it and how to prove the accuracy of it. I have tried to find the guide on the internet but only its variations. Please help me to explain it.
Observe that the largest size tile (m) can appear only at the bottom.
Its appearances are consecutive
Your recurrence becomes:
T(n,m,k) = SIGMA_{i=0,...,k} T(n-i,m-1,k)
Then you have to define the base cases of the recurrence:
T(n,m,1) = // can you tell what this is?
T(n,1,k) = // can you tell what this is?
T(1,m,k) = m // this is easy
We can prove it by forming a logical recurrence:
(A) If the maximum stack height, given m and k is smaller than n, we cannot create any stack.
(B) If only one tile is allowed, we can choose m different sizes for that tile.
(C) If only one size is allowed, if k is greater than or equal to n, we can construct one stack of n tiles of size 1; otherwise, zero stacks.
(D) For each possible count, x, of tiles of size m stacked, we have one way that is multiplied by the number of ways to stack (n - x) tiles, using sizes of at most (m - 1) since we used m.
To convert the recurrence to bottom-up dynamic programming, we initialise the matrix using the base cases of the recurrence, and fill in subsequent entries using its general-case logical branch.
Here's a demonstration of the recurrence in JavaScript (sorry I'm not versed in C++ but the first function, f, which calculates just the count, should be very easy to convert):
// Returns the count
function f(n, m, k){
if (n > m * k)
return 0;
if (n == 1)
return m;
if (m == 1)
return n <= k ? 1 : 0;
let result = 0;
for (let x=0; x<=k; x++)
result += f(n - x, m - 1, k);
return result;
}
// Returns the sequences
function g(n, m, k){
if (n > m * k)
return [];
if (n == 1)
return new Array(m).fill(0).map((_, i) => [i + 1]);
if (m == 1)
return n <= k ? [new Array(n).fill(1)] : [];
let result = [];
for (let x=0; x<=k; x++){
const pfx = new Array(x).fill(m);
const prev = g(n - x, m - 1, k);
for (let s of prev)
result.push(pfx.concat(s));
}
return result;
}
var inputs = [
[3, 3, 1],
[3, 3, 2],
[1, 2, 2]
];
for (let args of inputs){
console.log('' + args);
console.log(f(...args));
console.log(JSON.stringify(g(...args)));
console.log('');
}

AVX, Horizontal Sum of Single Precision Complex Numbers?

I have a 256 bit AVX register containing 4 single precision complex numbers stored as real, imaginary, real, imaginary, etc. I'm currently writing the entire 256 bit register back to memory and summing it there, but that seems inefficient.
How can the complex number horizontal sum be performed using AVX (or AVX2) intrinsics? I would accept an answer using assembly if there is not an answer with comparable efficiency using intrinsics.
Edit: To clarify, if the register contains AR, AI, BR, BI, CR, CI, DR, DI, I want to compute the complex number (AR + BR + CR + DR, AI + BI + CI + DI). If the result is in a 256 bit register, I can extract the 2 single precision floating point numbers.
Edit2: Potential solution, though not necessarily optimal...
float hsum_ps_sse3(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
float sumReal = 0.0;
float sumImaginary = 0.0;
__m256i mask = _mm256_set_epi32 (7, 5, 3, 1, 6, 4, 2, 0);
// Separate real and imaginary.
__m256 permutedSum = _mm256_permutevar8x32_ps(sseSum0, mask);
__m128 realSum = _mm256_extractf128_ps(permutedSum , 0);
__m128 imaginarySum = _mm256_extractf128_ps(permutedSum , 1);
// Horizontally sum real and imaginary.
sumReal = hsum_ps_sse3(realSum);
sumImaginary = hsum_ps_sse3(imaginarySum);
One fairly straightforward solution which requires only AVX (not AVX2):
__m128i v0 = _mm256_castps256_ps128(v); // get low 2 complex values
__m128i v1 = _mm256_extractf128_ps(v, 1); // get high 2 complex values
v0 = _mm_add_ps(v0, v1); // add high and low
v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(1, 0, 3, 2));
v0 = _mm_add_ps(v0, v1); // combine two halves of result
The result will be in v0 as { sum.re, sum.im, sum.re, sum.im }.

performance of log10 function returning an int

Today I needed a cheap log10 function, of which I only used the int part. Assuming the result is floored, so the log10 of 999 would be 2. Would it be beneficial writing a function myself? And if so, which way would be the best to go. Assuming the code would not be optimized.
The alternatives to log10 I've though of;
use a for loop dividing or multiplying by 10;
use a string parser(probably extremely expensive);
using an integer log2() function multiplying by a constant.
Thank you on beforehand:)
The operation can be done in (fast) constant time on any architecture that has a count-leading-zeros or similar instruction (which is most architectures). Here's a C snippet I have sitting around to compute the number of digits in base ten, which is essentially the same task (assumes a gcc-like compiler and 32-bit int):
unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
static unsigned int baseTenDigits(unsigned int x) {
static const unsigned char guess[33] = {
0, 0, 0, 0, 1, 1, 1, 2, 2, 2,
3, 3, 3, 3, 4, 4, 4, 5, 5, 5,
6, 6, 6, 6, 7, 7, 7, 8, 8, 8,
9, 9, 9
};
static const unsigned int tenToThe[] = {
1, 10, 100, 1000, 10000, 100000,
1000000, 10000000, 100000000, 1000000000,
};
unsigned int digits = guess[baseTwoDigits(x)];
return digits + (x >= tenToThe[digits]);
}
GCC and clang compile this down to ~10 instructions on x86. With care, one can make it faster still in assembly.
The key insight is to use the (extremely cheap) base-two logarithm to get a fast estimate of the base-ten logarithm; at that point we only need to compare against a single power of ten to decide if we need to adjust the guess. This is much more efficient than searching through multiple powers of ten to find the right one.
If the inputs are overwhelmingly biased to one- and two-digit numbers, a linear scan is sometimes faster; for all other input distributions, this implementation tends to win quite handily.
One way to do it would be loop with subtracting powers of 10. This powers could be computed and stored in table. Here example in python:
table = [10**i for i in range(1, 10)]
# [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]
def fast_log10(n):
for i, k in enumerate(table):
if n - k < 0:
return i
Usage example:
>>> fast_log10(1)
0
>>> fast_log10(10)
1
>>> fast_log10(100)
2
>>> fast_log10(999)
2
fast_log10(1000)
3
Also you may use binary search with this table. Then algorithm complexity would be only O(lg(n)), where n is number of digits.
Here example with binary search in C:
long int table[] = {10, 100, 1000, 10000, 1000000};
#define TABLE_LENGHT sizeof(table) / sizeof(long int)
int bisect_log10(long int n, int s, int e) {
int a = (e - s) / 2 + s;
if(s >= e)
return s;
if((table[a] - n) <= 0)
return bisect_log10(n, a + 1, e);
else
return bisect_log10(n, s, a);
}
int fast_log10(long int n){
return bisect_log10(n, 0, TABLE_LENGHT);
}
Note for small numbers this method would slower then upper method.
Full code here.
Well, there's the old standby - the "poor man's log function".
(If you want to handle more than 63 integer digits, change the first "if" to a "while".)
n = 1;
if (v >= 1e32){n += 32; v /= 1e32;}
if (v >= 1e16){n += 16; v /= 1e16;}
if (v >= 1e8){n += 8; v /= 1e8;}
if (v >= 1e4){n += 4; v /= 1e4;}
if (v >= 1e2){n += 2; v /= 1e2;}
if (v >= 1e1){n += 1; v /= 1e1;}
so if you feed in 123456.7, here's how it goes:
n = 1;
if (v >= 1e32) no
if (v >= 1e16) no
if (v >= 1e8) no
if (v >= 1e4) yes, so n = 5, v = 12.34567
if (v >= 1e2) no
if (v >= 1e1) yes, so n = 6, v = 1.234567
so result is n = 6
Here's a variation that uses multiplication, rather than division:
int n = 1;
double d = 1, temp;
temp = d * 1e32; if (v >= temp){n += 32; d = temp;}
temp = d * 1e16; if (v >= temp){n += 16; d = temp;}
temp = d * 1e8; if (v >= temp){n += 8; d = temp;}
temp = d * 1e4; if (v >= temp){n += 4; d = temp;}
temp = d * 1e2; if (v >= temp){n += 2; d = temp;}
temp = d * 1e1; if (v >= temp){n += 1; d = temp;}
and an execution looks like this
v = 123456.7
n = 1
d = 1
temp = 1e32, if (v >= 1e32) no
temp = 1e16, if (v >= 1e16) no
temp = 1e8, if (v >= 1e8) no
temp = 1e4, if (v >= 1e4) yes, so n = 5, d = 1e4;
temp = 1e6, if (v >= 1e6) no
temp = 1e5, if (v >= 1e5) yes, so n = 6, d = 1e5;
If you want to have a faster log function you need to approximate their result. E.g. the exp function can be approximated using a 'short' taylor approximation. You can find example approximations for exp, log, root and power here
edit:
You can find a short performance comparsion here
Because an unsigned < or >= test is done simply by subtracting and checking the carry flag, it is possible to put both arrays (guess and negated tenToThe) in a single 64-bit value, combine both array lookups into one, and use the carry from 32-bit addition to adjust the guess. The high 32 bits of guess[n] provide the value of log10(2^n*2-1), while the low 32 bits contain -10^log10(2^n*2-1).
static unsigned int baseTwoDigits(unsigned int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
unsigned int baseTenDigits(unsigned int x) {
static uint64_t guess[33] = {
/* 1 */ 0, 0, 0,
/* 8 */ (1ull<<32)-10, (1ull<<32)-10, (1ull<<32)-10,
/* 64 */ (2ull<<32)-100, (2ull<<32)-100, (2ull<<32)-100,
/* 512 */ (3ull<<32)-1000, (3ull<<32)-1000, (3ull<<32)-1000,
(3ull<<32)-1000,
/* 8192 */ (4ull<<32)-10000, (4ull<<32)-10000, (4ull<<32)-10000,
/* 65536 */ (5ull<<32)-100000, (5ull<<32)-100000, (5ull<<32)-100000,
/* 524288 */ (6ull<<32)-1000000, (6ull<<32)-1000000, (6ull<<32)-1000000,
(6ull<<32)-1000000,
/* 8388608 */ (7ull<<32)-10000000, (7ull<<32)-10000000,
(7ull<<32)-10000000,
/* 67108864 */ (8ull<<32)-100000000, (8ull<<32)-100000000,
(8ull<<32)-100000000,
/* 536870912 */ (9ull<<32)-1000000000, (9ull<<32)-1000000000,
(9ull<<32)-1000000000,
};
uint64_t adjust = guess[baseTwoDigits(x)];
return (adjust + x) >> 32;
}
Without any specifications, I will just give a general answer:
The log function will be pretty efficient in most languages as it is such a basic function.
The fact that you are only interested in integers could give you some leverage, but probably this is not enough to easily beat the builtin standard solutions.
One of the few things that I can think of to be faster than a builtin function is a table lookup, so if you are only interested in the numbers upto 10000 for instance, you could simply create a table that you could use to lookup any of these values when you need them.
Obviously this solution will not scale well, but it may be just what you need.
Sidenote: If you are importing the data for example, it may actually be faster to look at the string diecty length (rather than first converting the string to a number and than looking at the value of the string). Of course this will require the input to be stored in just the right format, otherwise it won't gain you anything.