I have two functions, add and sub, which accept two 16 bit arguments and return 17 bits (result and carry/borrow).
Can I build a bitwise "and" function from these?
(Reasonably small lookup tables, <300 bytes, allowed. Runtime proportional to number of bits is fine.)
I find it very hard to guess what your CPU does and does not have, since your question makes it sound like it has basically nothing, but then your follow-up comment seems to take it for granted that it has all the basics.
So, I'll assume the following:
add and sub, as provided in the question.
adequate mechanisms to create a C-style "function", such as:
a way to save registers in memory so that they don't get stomped by the function you're calling.
a way to take arguments by value, that you can destroy without affecting the caller.
a way to return a result to the caller.
a way to skip a section of logic if one value is less than another (analogous to X86's jl "jump if less").
I'll write this as if (a >= b) { ... }, meaning "if b is less than a, then jump past the next few instructions; afterward (otherwise), run ...".
enough of the basics to support lookup tables of up to 299 bytes, as specified in the question.
Given that, we can write something like this (in C notation):
static int const single_bit_values[] = {
0x8000, 0x4000, 0x2000, 0x1000, 0x0800, 0x0400, 0x0200, 0x0100,
0x0080, 0x0040, 0x0020, 0x0010, 0x0008, 0x0004, 0x0002, 0x0001
};
int bitwise_and(int operand1, int operand2) {
int accumulator = 0;
for (int i = 0; i < 16; ++i) {
// set accumulator's bit #i if appropriate:
if (operand1 >= single_bit_values[i] && operand2 >= single_bit_values[i]) {
accumulator += single_bit_values[i];
}
// clear operands' bit #i:
if (operand1 >= single_bit_values[i]) {
operand1 -= single_bit_values[i];
}
if (operand2 >= single_bit_values[i]) {
operand2 -= single_bit_values[i];
}
}
return accumulator;
}
Note that, although the above uses && and for-loops, I don't actually assume support for either of those; rather, if (... && ...) can easily be expanded into nested ifs, and the for-loop can easily be completely unrolled. But the above version is easier for humans to read.
The way the above works is, it iterates over the single-bit values from high-order bit 10000000 00000000 to low-order bit 00000000 00000001, and for each one, it sets the corresponding bit in the accumulator if the corresponding bits in the operands are both set. The only tricky part is how we check if the two operands are both set; what we do is, we clear each bit in the operands as we complete the corresponding iteration (for example, 11110000 00001111 after three iterations becomes 00010000 00001111), which then lets us write e.g. operand1 >= single_bit_values[3] to mean "operand1 has bit #3 (the fourth bit) set". (Do you see why?)
Related
I have a 64-bit unsigned integer. I want to check the 6th bit of each byte and return a single byte representing those 6th bits.
The obvious, "brute force" solution is:
inline const unsigned char Get6thBits(unsigned long long num) {
unsigned char byte(0);
for (int i = 7; i >= 0; --i) {
byte <<= 1;
byte |= bool((0x20 << 8 * i) & num);
}
return byte;
}
I could unroll the loop into a bunch of concatenated | statements to avoid the int allocation, but that's still pretty ugly.
Is there a faster, more clever way to do it? Maybe use a bitmask to get the 6th bits, 0x2020202020202020 and then somehow convert that to a byte?
If _pext_u64 is a possibility (this will work on Haswell and newer, it's very slow on Ryzen though), you could write this:
int extracted = _pext_u64(num, 0x2020202020202020);
This is a really literal way to implement it. pext takes a value (first argument) and a mask (second argument), at every position that the mask has a set bit it takes the corresponding bit from the value, and all bits are concatenated.
_mm_movemask_epi8 is more widely usable, you could use it like this:
__m128i n = _mm_set_epi64x(0, num);
int extracted = _mm_movemask_epi8(_mm_slli_epi64(n, 2));
pmovmskb takes the high bit of every byte in its input vector and concatenates them. The bits we want are not the high bit of every byte, so I move them up two positions with psllq (of course you could shift num directly). The _mm_set_epi64x is just some way to get num into a vector.
Don't forget to #include <intrin.h>, and none of this was tested.
Codegen seems reasonable enough
A weirder option is gathering the bits with a multiplication: (only slightly tested)
int extracted = (num & 0x2020202020202020) * 0x08102040810204 >> 56;
The idea here is that num & 0x2020202020202020 only has very few bits set, so we can arrange a product that never carries into bits that we need (or indeed at all). The multiplier is constructed to do this:
a0000000b0000000c0000000d0000000e0000000f0000000g0000000h0000000 +
0b0000000c0000000d0000000e0000000f0000000g0000000h00000000000000 +
00c0000000d0000000e0000000f0000000g0000000h000000000000000000000 etc..
Then the top byte will have all the bits "compacted" together. The lower bytes actually have something like that too, but they're missing bits that would have to come from "higher" (bits can only move to the left in a multiplication).
So I'm writing a program where I need to produce strings of binary numbers that are not only a specific length, but also have a specific number of 1's and 0's. In addition, theses strings that are produced are compared to a higher and lower value to see if they are in that specific range. The issue that I'm having is that I'm dealing with 64 bit unsigned integers. So sometimes, very large numbers that require al 64 bits produce a lot of permutations of binary strings for values which are not in the range at all and it's taking a ton of time.
I'm curious if it is possible for an algorithm to take in two bound values, a number of ones, and only produce binary strings in between the bound values with that specific number of ones.
This is what I have so far, but it's producing way to many numbers.
void generatePermutations(int no_ones, int length, uint64_t smaller, uint64_t larger, uint64_t& accum){
char charArray[length+1];
for(int i = length - 1; i > -1; i--){
if(no_ones > 0){
charArray[i] = '1';
no_ones--;
}else{
charArray[i] = '0';
}
}
charArray[length] = '\0';
do {
std::string val(charArray);
uint64_t num = convertToNum(val);
if(num >= smaller && num <= larger){
accum ++;
}
} while ( std::next_permutation(charArray, (charArray + length)));
}
(Note: The number of 1-bits in a binary value is generally called the population count -- popcount, for short -- or Hamming weight.)
There is a well-known bit-hack to cycle through all binary words with the same population count, which basically does the following:
Find the longest suffix of the word consisting of a 0, a non-empty sequence of 1s, and finally a possibly empty sequence of 0s.
Change the first 0 to a 1; the following 1 to a 0, and then shift all the others 1s (if any) to the end of the word.
Example:
00010010111100
^-------- beginning of the suffix
00010011 0 becomes 1
0 1 becomes 0
00111 remaining 1s right-shifted to the end
That can be done quite rapidly by using the fact that the lowest-order set bit in x is x & -x (where - represents the 2s-complement negative of x). To find the beginning of the suffix, it suffices to add the lowest-order set bit to the number, and then find the new lowest-order set bit. (Try this with a few numbers and you should see how it works.)
The biggest problem is performing the right shift, since we don't actually know the bit count. The traditional solution is to do the right-shift with a division (by the original low-order 1 bit), but it turns out that divide on modern hardware is really slow, relative to other operands. Looping a one-bit shift is generally faster than dividing, but in the code below I use gcc's __builtin_ffsll, which normally compiles into an appropriate opcode if one exists on the target hardware. (See man ffs for details; I use the builtin to avoid feature-test macros, but it's a bit ugly and limits the range of compilers you can use. OTOH, ffsll is also an extension.)
I've included the division-based solution as well for portability; however, it takes almost three times as long on my i5 laptop.
template<typename UInt>
static inline UInt last_one(UInt ui) { return ui & -ui; }
// next_with_same_popcount(ui) finds the next larger integer with the same
// number of 1-bits as ui. If there isn't one (within the range
// of the unsigned type), it returns 0.
template<typename UInt>
UInt next_with_same_popcount(UInt ui) {
UInt lo = last_one(ui);
UInt next = ui + lo;
UInt hi = last_one(next);
if (next) next += (hi >> __builtin_ffsll(lo)) - 1;
return next;
}
/*
template<typename UInt>
UInt next_with_same_popcount(UInt ui) {
UInt lo = last_one(ui);
UInt next = ui + lo;
UInt hi = last_one(next) >> 1;
if (next) next += hi/lo - 1;
return next;
}
*/
The only remaining problem is to find the first number with the correct popcount inside of the given range. To help with this, the following simple algorithm can be used:
Start with the first value in the range.
As long as the popcount of the value is too high, eliminate the last run of 1s by adding the low-order 1 bit to the number (using exactly the same x&-x trick as above). Since this works right-to-left, it cannot loop more than 64 times, once per bit.
While the popcount is too small, add the smallest possible bit by changing the low-order 0 bit to a 1. Since this adds a single 1-bit on each loop, it also cannot loop more than k times (where k is the target popcount), and it is not necessary to recompute the population count on each loop, unlike the first step.
In the following implementation, I again use a GCC builtin, __builtin_popcountll. This one doesn't have a corresponding Posix function. See the Wikipedia page for alternative implementations and a list of hardware which does support the operation. Note that it is possible that the value found will exceed the end of the range; also, the function might return a value less than the supplied argument, indicating that there is no appropriate value. So you need to check that the result is inside the desired range before using it.
// next_with_popcount_k returns the smallest integer >= ui whose popcnt
// is exactly k. If ui has exactly k bits set, it is returned. If there
// is no such value, returns the smallest integer with exactly k bits.
template<typename UInt>
UInt next_with_popcount_k(UInt ui, int k) {
int count;
while ((count = __builtin_popcountll(ui)) > k)
ui += last_one(ui);
for (int i = count; i < k; ++i)
ui += last_one(~ui);
return ui;
}
It's possible to make this slightly more efficient by changing the first loop to:
while ((count = __builtin_popcountll(ui)) > k) {
UInt lo = last_one(ui);
ui += last_one(ui - lo) - lo;
}
That shaved about 10% off of the execution time, but I doubt whether the function will be called often enough to make that worthwhile. Depending on how efficiently your CPU implements the POPCOUNT opcode, it might be faster to do the first loop with a single bit sweep in order to be able to track the popcount instead of recomputing it. That will almost certainly be the case on hardware without a POPCOUNT opcode.
Once you have those two functions, iterating over all k-bit values in a range becomes trivial:
void all_k_bits(uint64_t lo, uint64_t hi, int k) {
uint64_t i = next_with_popcount_k(lo, k);
if (i >= lo) {
for (; i > 0 && i < hi; i = next_with_same_popcount(i)) {
// Do what needs to be done
}
}
}
I have the following function:
int GetGroup(unsigned bitResult, int iStartPos, int iNumOfBites)
{
return (bitResult >> (iStartPos + 1- iNumOfBites)) & ~(~0 << iNumOfBites);
}
The function returns group of bits from a byte.
i.e if bitResult=102 (01100110)2, iStartPos=5, iNumOfBites=3
Output: 2 (10)2
For iStartPos=7, iNumOfBites=4
Output: 3 (0110)2
I'm looking for better way / "friendly" to do that, i.e with bitset or something like that.Any suggestion?
(src >> start) & ((1UL << len)-1) // or 1ULL << if you need a 64-bit mask
is one way to express extraction of len bits, starting at start. (In this case, start is the LSB of the range you want. Your function requires the MSB as input.) This expression is from Wikipedia's article on the x86 BMI1 instruction set extensions.
Both ways of producing the mask look risky in case len is the full width of the type, though. (The corner-case of extracting all the bits). Shifts by the full width of the type can either produce zero or unchanged. (It actually invokes undefined behaviour, but this is in practice what happens if the compiler can't see that at compile time. x86 for example masks the shift count down to the 0-31 range (for 32bit shifts). With 32bit ints:
If 1 << 32 produces 1, then 1-1 = 0, so the result will be zero.
If ~0 << 32 produces ~0, rather than 0, the mask will be zero.
Remember that 1<<len is undefined behaviour for len too large: unlike writing it as 0x3ffffffffff or whatever, no automatic promotion to long long happens, so the type of the 1 matters.
I think from your examples you want the bits [iStartPos : iStartPos - iNumOfBites], where bits are numbered from zero.
The main thing I'd change in your function is the naming of the function and variables, and add a comment.
bitResult is the input to the function; don't use "result" in its name.
iStartPos ok, but a little verbose
iNumOfBites Computers have bits and bytes. If you're dealing with bites, you need a doctor (or a dentist).
Also, the return type should probably be unsigned.
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
return (input >> (msb-len + 1)) & ~(~0 << len);
}
If your start-position parameter was the lsb, rather than msb, the expression would be simpler, and the code would be smaller and faster (unless that just makes extra work for the caller). With LSB as a param, BitExtract is 7 instructions, vs. 9 if it's MSB (on x86-64, gcc 5.2).
There's also a machine instruction (introduced with Intel Haswell, and AMD Piledriver) that does this operation. You will get somewhat smaller and slightly faster code by using it. It also uses the LSB, len position convention, not MSB, so you get shorter code with LSB as an argument.
Intel CPUs only know the version that would require loading an immediate into a register first, so when the values are compile-time constants, it doesn't save much compared to simply shifting and masking. e.g. see this post about using it or pextr for RGB32 -> RGB16. And of course it doesn't matter whether the parameter is the MSB or LSB of the desired range, if start and len are both compile time constants.
Only AMD implements a version of bextr that can have the control mask as an immediate constant, but unfortunately it seems gcc 5.2 doesn't use the immediate version for code that uses the intrinsic (even with -march=bdver2 (i.e. bulldozer v2 aka piledriver). (It will generate bextr with an immediate argument on its own in some cases with -march=bdver2.)
I tested it out on godbolt to see what kind of code you'd get with or without bextr.
#include <immintrin.h>
// Intel ICC uses different intrinsics for bextr
// extract bits [msb : msb-len] from input into the low bits of the result
unsigned BitExtract(unsigned input, int msb, int len)
{
#ifdef __BMI__ // probably also need to check for __GNUC__
return __builtin_ia32_bextr_u32(input, (len<<8) | (msb-len+1) );
#else
return (input >> (msb-len + 1)) & ~(~0 << len);
#endif
}
It would take an extra instruction (a movzx) to implement a (msb-len+1)&0xff safety check to avoid the start byte from spilling into the length byte. I left it out because it's nonsense to ask for a starting bit outside the 0-31 range, let alone the 0-255 range. Since it won't crash, just return some other nonsense result, there's not much point.
Anyway, bext saves quite a few instructions (if BMI2 shlx / shrx isn't available either! -march=native on godbolt is Haswell, and thus includes BMI2 as well.)
But bextr on Intel CPUs decodes to 2 uops (http://agner.org/optimize/), so it's not very useful at all compared to shrx / and, except for saving some code size. pext is actually better for throughput (1 uop / 3c latency), even though it's a way more powerful instruction. It is worse for latency, though. And AMD CPUs run pext very slowly, but bextr as a single uop.
I would probably do something like the following in order to provide additional protections around errors in arguments and to reduce the amount of shifting.
I am not sure if I understood the meaning of the arguments you are using so this may require a bit of tweaking.
And I am not sure if this is necessarily any more efficient since there are a number of decisions and range checks made in the interests of safety.
/*
* Arguments: const unsigned bitResult byte containing the bit field to extract
* const int iStartPos zero based offset from the least significant bit
* const int iNumOfBites number of bits to the right of the starting position
*
* Description: Extract a bitfield beginning at the specified position for the specified
* number of bits returning the extracted bit field right justified.
*/
int GetGroup(const unsigned bitResult, const int iStartPos, const int iNumOfBites)
{
// masks to remove any leading bits that need to disappear.
// we change starting position to be one based so the first element is unused.
const static unsigned bitMasks[] = {0x01, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff};
int iStart = (iStartPos > 7) ? 8 : iStartPos + 1;
int iNum = (iNumOfBites > 8) ? 8 : iNumOfBites;
unsigned retVal = (bitResult & bitMasks[iStart]);
if (iStart > iNum) {
retVal >>= (iStart - iNum);
}
return retVal;
}
pragma pack(push, 1)
struct Bit
{
union
{
uint8_t _value;
struct {
uint8_t _bit0:0;
uint8_t _bit1:0;
uint8_t _bit2:0;
uint8_t _bit3:0;
uint8_t _bit4:0;
uint8_t _bit5:0;
uint8_t _bit6:0;
uint8_t _bit7:0;
};
};
};
#pragma pack(pop, 1)
typedef Bit bit;
struct B
{
union
{
uint32_t _value;
bit bytes[1]; // 1 for Single Byte
};
};
With a struct and union you can set the Struct B _value to your result, then access byte[0]._bit0 through byte[0]._bit7 for each 0 or 1 and vise versa. Set each bit, and the result will be in the _value.
This question already has answers here:
What's the best way to toggle the MSB?
(4 answers)
Closed 8 years ago.
If, for example, I have the number 20:
0001 0100
I want to set the highest valued 1 bit, the left-most, to 0.
So
0001 0100
will become
0000 0100
I was wondering which is the most efficient way to achieve this.
Preferrably in c++.
I tried substracting from the original number the largest power of two like this,
unsigned long long int originalNumber;
unsigned long long int x=originalNumber;
x--;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x++;
x >>= 1;
originalNumber ^= x;
,but i need something more efficient.
The tricky part is finding the most significant bit, or counting the number of leading zeroes. Everything else is can be done more or less trivially with left shifting 1 (by one less), subtracting 1 followed by negation (building an inverse mask) and the & operator.
The well-known bit hacks site has several implementations for the problem of finding the most significant bit, but it is also worth looking into compiler intrinsics, as all mainstream compilers have an intrinsic for this purpose, which they implement as efficiently as the target architecture will allow (I tested this a few years ago using GCC on x86, came out as single instruction). Which is fastest is impossible to tell without profiling on your target architecture (fewer lines of code, or fewer assembly instructions are not always faster!), but it is a fair assumption that compilers implement these intrinsics not much worse than you'll be able to implement them, and likely faster.
Using an intrinsic with a somewhat intellegible name may also turn out easier to comprehend than some bit hack when you look at it 5 years from now.
Unluckily, although a not entirely uncommon thing, this is not a standardized function which you'd expect to find in the C or C++ libraries, at least there is no standard function that I'm aware of.
For GCC, you're looking for __builtin_clz, VisualStudio calls it _BitScanReverse, and Intel's compiler calls it _bit_scan_reverse.
Alternatively to counting leading zeroes, you may look into what the same Bit Twiddling site has under "Round up to the next power of two", which you would only need to follow up with a right shift by 1, and a NAND operation. Note that the 5-step implementation given on the site is for 32-bit integers, you would have to double the number of steps for 64-bit wide values.
#include <limits.h>
uint32_t unsetHighestBit(uint32_t val) {
for(uint32_t i = sizeof(uint32_t) * CHAR_BIT - 1; i >= 0; i--) {
if(val & (1 << i)) {
val &= ~(1 << i);
break;
}
}
return val;
}
Explanation
Here we take the size of the type uint32_t, which is 4 bytes. Each byte has 8 bits, so we iterate 32 times starting with i having values 31 to 0.
In each iteration we shift the value 1 by i to the left and then bitwise-and (&) it with our value. If this returns a value != 0, the bit at i is set. Once we find a bit that is set, we bitwise-and (&) our initial value with the bitwise negation (~) of the bit that is set.
For example if we have the number 44, its binary representation would be 0010 1100. The first set bit that we find is bit 5, resulting in the mask 0010 0000. The bitwise negation of this mask is 1101 1111. Now when bitwise and-ing & the initial value with this mask, we get the value 0000 1100.
In C++ with templates
This is an example of how this can be solved in C++ using a template:
#include <limits>
template<typename T> T unsetHighestBit(T val) {
for(uint32_t i = sizeof(T) * numeric_limits<char>::digits - 1; i >= 0; i--) {
if(val & (1 << i)) {
val &= ~(1 << i);
break;
}
}
return val;
}
If you're constrained to 8 bits (as in your example), then just precalculate all possible values in an array (byte[256]) using any algorithm, or just type it in by hand.
Then you just look up the desired value:
x = lookup[originalNumber]
Can't be much faster than that. :-)
UPDATE: so I read the question wrong.
But if using 64 bit values, then break it apart into 8 bytes, maybe by casting it to a byte[8] or overlaying it in a union or something more clever. After that, find the first byte which are not zero and do as in my answer above with that particular byte. Not as efficient I'm afraid, but still it is at most 8 tests (and in average 4.5) + one lookup.
Of course, creating a byte[65536} lookup will double the speed.
The following code will turn off the right most bit:
bool found = false;
int bit, bitCounter = 31;
while (!found) {
bit = x & (1 << bitCounter);
if (bit != 0) {
x &= ~(1 << bitCounter);
found = true;
}
else if (bitCounter == 0)
found = true;
else
bitCounter--;
}
I know method to set more right non zero bit to 0.
a & (a - 1)
It is from Book: Warren H.S., Jr. - Hacker's Delight.
You can reverse your bits, set more right to zero and reverse back. But I do now know efficient way to invert bits in your case.
I have written the below mentioned code. The code checks the first bit of every byte. If the first bit of every byte of is equal to 0, then it concatenates this value with the previous byte and stores it in a different variable var1. Here pos points to bytes of an integer. An integer in my implementation is uint64_t and can occupy upto 8 bytes.
uint64_t func(char* data)
{
uint64_t var1 = 0; int i=0;
while ((data[i] >> 7) == 0)
{
variable = (variable << 7) | (data[i]);
i++;
}
return variable;
}
Since I am repeatedly calling func() a trillion times for trillions of integers. Therefore it runs slow, is there a way by which I may optimize this code?
EDIT: Thanks to Joe Z..its indeed a form of uleb128 unpacking.
I have only tested this minimally; I am happy to fix glitches with it. With modern processors, you want to bias your code heavily toward easily predicted branches. And, if you can safely read the next 10 bytes of input, there's nothing to be saved by guarding their reads by conditional branches. That leads me to the following code:
// fast uleb128 decode
// assumes you can read all 10 bytes at *data safely.
// assumes standard uleb128 format, with LSB first, and
// ... bit 7 indicating "more data in next byte"
uint64_t unpack( const uint8_t *const data )
{
uint64_t value = ((data[0] & 0x7F ) << 0)
| ((data[1] & 0x7F ) << 7)
| ((data[2] & 0x7F ) << 14)
| ((data[3] & 0x7F ) << 21)
| ((data[4] & 0x7Full) << 28)
| ((data[5] & 0x7Full) << 35)
| ((data[6] & 0x7Full) << 42)
| ((data[7] & 0x7Full) << 49)
| ((data[8] & 0x7Full) << 56)
| ((data[9] & 0x7Full) << 63);
if ((data[0] & 0x80) == 0) value &= 0x000000000000007Full; else
if ((data[1] & 0x80) == 0) value &= 0x0000000000003FFFull; else
if ((data[2] & 0x80) == 0) value &= 0x00000000001FFFFFull; else
if ((data[3] & 0x80) == 0) value &= 0x000000000FFFFFFFull; else
if ((data[4] & 0x80) == 0) value &= 0x00000007FFFFFFFFull; else
if ((data[5] & 0x80) == 0) value &= 0x000003FFFFFFFFFFull; else
if ((data[6] & 0x80) == 0) value &= 0x0001FFFFFFFFFFFFull; else
if ((data[7] & 0x80) == 0) value &= 0x00FFFFFFFFFFFFFFull; else
if ((data[8] & 0x80) == 0) value &= 0x7FFFFFFFFFFFFFFFull;
return value;
}
The basic idea is that small values are common (and so most of the if-statements won't be reached), but assembling the 64-bit value that needs to be masked is something that can be efficiently pipelined. With a good branch predictor, I think the above code should work pretty well. You might also try removing the else keywords (without changing anything else) to see if that makes a difference. Branch predictors are subtle beasts, and the exact character of your data also matters. If nothing else, you should be able to see that the else keywords are optional from a logic standpoint, and are there only to guide the compiler's code generation and provide an avenue for optimizing the hardware's branch predictor behavior.
Ultimately, whether or not this approach is effective depends on the distribution of your dataset. If you try out this function, I would be interested to know how it turns out. This particular function focuses on standard uleb128, where the value gets sent LSB first, and bit 7 == 1 means that the data continues.
There are SIMD approaches, but none of them lend themselves readily to 7-bit data.
Also, if you can mark this inline in a header, then that may also help. It all depends on how many places this gets called from, and whether those places are in a different source file. In general, though, inlining when possible is highly recommended.
Your code is problematic
uint64_t func(const unsigned char* pos)
{
uint64_t var1 = 0; int i=0;
while ((pos[i] >> 7) == 0)
{
var1 = (var1 << 7) | (pos[i]);
i++;
}
return var1;
}
First a minor thing: i should be unsigned.
Second: You don't assert that you don't read beyond the boundary of pos. E.g. if all values of your pos array are 0, then you will reach pos[size] where size is the size of the array, hence you invoke undefined behaviour. You should pass the size of your array to the function and check that i is smaller than this size.
Third: If pos[i] has most significant bit equal to zero for i=0,..,k with k>10, then previous work get's discarded (as you push the old value out of var1).
The third point actually helps us:
uint64_t func(const unsigned char* pos, size_t size)
{
size_t i(0);
while ( i < size && (pos[i] >> 7) == 0 )
{
++i;
}
// At this point, i is either equal to size or
// i is the index of the first pos value you don't want to use.
// Therefore we want to use the values
// pos[i-10], pos[i-9], ..., pos[i-1]
// if i is less than 10, we obviously need to ignore some of the values
const size_t start = (i >= 10) ? (i - 10) : 0;
uint64_t var1 = 0;
for ( size_t j(start); j < i; ++j )
{
var1 <<= 7;
var1 += pos[j];
}
return var1;
}
In conclusion: We separated logic and got rid of all discarded entries. The speed-up depends on the actual data you have. If lot's of entries are discarded then you save a lot of writes to var1 with this approach.
Another thing: Mostly, if one function is called massively, the best optimization you can do is call it less. Perhaps you can have come up with an additional condition that makes the call of this function useless.
Keep in mind that if you actually use 10 values, the first value ends up the be truncated.
64bit means that there are 9 values with their full 7 bits of information are represented, leaving exactly one bit left foe the tenth. You might want to switch to uint128_t.
A small optimization would be:
while ((pos[i] & 0x80) == 0)
Bitwise and is generally faster than a shift. This of course depends on the platform, and it's also possible that the compiler will do this optimization itself.
Can you change the encoding?
Google came across the same problem, and Jeff Dean describes a really cool solution on slide 55 of his presentation:
http://research.google.com/people/jeff/WSDM09-keynote.pdf
http://videolectures.net/wsdm09_dean_cblirs/
The basic idea is that reading the first bit of several bytes is poorly supported on modern architectures. Instead, let's take 8 of these bits, and pack them as a single byte preceding the data. We then use the prefix byte to index into a 256-item lookup table, which holds masks describing how to extract numbers from the rest of the data.
I believe it's how protocol buffers are currently encoded.
Can you change your encoding? As you've discovered, using a bit on each byte to indicate if there's another byte following really sucks for processing efficiency.
A better way to do it is to model UTF-8, which encodes the length of the full int into the first byte:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx 10xxxxxx // two bytes with 12 bits of data
110xxxxx 10xxxxxx 10xxxxxx // three bytes with 16 bits of data
1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx // four bytes with 22 bits of data
// etc.
But UTF-8 has special properties to make it easier to distinguish from ASCII. This bloats the data and you don't care about ASCII, so you'd modify it to look like this:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx xxxxxxxx // two bytes with 14 bits of data.
110xxxxx xxxxxxxx xxxxxxxx // three bytes with 21 bits of data
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // four bytes with 28 bits of data
// etc.
This has the same compression level as your method (up to 64 bits = 9 bytes), but is significantly easier for a CPU to process.
From this you can build a lookup table for the first byte which gives you a mask and length:
// byte_counts[255] contains the number of additional
// bytes if the first byte has a value of 255.
uint8_t const byte_counts[256]; // a global constant.
// byte_masks[255] contains a mask for the useful bits in
// the first byte, if the first byte has a value of 255.
uint8_t const byte_masks[256]; // a global constant.
And then to decode:
// the resulting value.
uint64_t v = 0;
// mask off the data bits in the first byte.
v = *data & byte_masks[*data];
// read in the rest.
switch(byte_counts[*data])
{
case 3: v = v << 8 | *++data;
case 2: v = v << 8 | *++data;
case 1: v = v << 8 | *++data;
case 0: return v;
default:
// If you're on VC++, this'll make it take one less branch.
// Better make sure you've got all the valid inputs covered, though!
__assume(0);
}
No matter the size of the integer, this hits only one branch point: the switch, which will likely be put into a jump table. You can potentially optimize it even further for ILP by not letting each case fall through.
First, rather than shifting, you can do a bitwise test on the
relevant bit. Second, you can use a pointer, rather than
indexing (but the compiler should do this optimization itself.
Thus:
uint64_t
readUnsignedVarLength( unsigned char const* pos )
{
uint64_t results = 0;
while ( (*pos & 0x80) == 0 ) {
results = (results << 7) | *pos;
++ pos;
}
return results;
}
At least, this corresponds to what your code does. For variable
length encoding of unsigned integers, it is incorrect, since
1) variable length encodings are little endian, and your code is
big endian, and 2) your code doesn't or in the high order byte.
Finally, the Wiki page suggests that you've got the test
inversed. (I know this format mainly from BER encoding and
Google protocol buffers, both of which set bit 7 to indicate
that another byte will follow.
The routine I use is:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
int shift = 0;
uint64_t results = 0;
uint8_t tmp = *source ++;
while ( ( tmp & 0x80 ) != 0 ) {
*value |= ( tmp & 0x7F ) << shift;
shift += 7;
tmp = *source ++;
}
return results | (tmp << shift);
}
For the rest, this wasn't written with performance in mind, but
I doubt that you could do significantly better. An alternative
solution would be to pick up all of the bytes first, then
process them in reverse order:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
unsigned char buffer[10];
unsigned char* p = std::begin( buffer );
while ( p != std::end( buffer ) && (*source & 0x80) != 0 ) {
*p = *source & 0x7F;
++ p;
}
assert( p != std::end( buffer ) );
*p = *source;
++ p;
uint64_t results = 0;
while ( p != std::begin( buffer ) ) {
-- p;
results = (results << 7) + *p;
}
return results;
}
The necessity of checking for buffer overrun will likely make
this slightly slower, but on some architectures, shifting by
a constant is significantly faster than shifting by a variable,
so this could be faster on them.
Globally, however, don't expect miracles. The motivation for
using variable length integers is to reduce data size, at
a cost in runtime for decoding and encoding.