Efficient way to test bits in a byte array using a union? - c++

I have data that is sometimes best viewed as an array of 10 bytes, sometimes as an array of 80 bits. Maybe a job for a union?
After filling the array with 10 bytes, I scan through the 80 bits and test if set. The scan is advanced bit-by-bit in an ISR, so efficiency is key.
Right now I do this at each interrupt:
volatile uint8_t bit_array[10]; // external to ISR
volatile uint8_t bit_idx;
volatile uint8_t byte_idx;
// -----ISR---------
static uint8_t abyte; // temp byte from array
if (bit_idx == 0) { // at each new byte
bit_idx = 1; // begin at the lowest bit
abyte = bit_array[byte_idx];
}
if (abyte & bit_idx) {
// << do the thing >>
}
if ((bit_idx *= 2) == 0) { // idx << and test for done
if (++byte_idx > 9) { // try next byte
byte_idx = 0;
fill_array_again();
}
}
I have a sense that there's a way to create a union that would allow a straightforward scan of the bits using a single index 0..79, but I don't know enough to try it.
The questions are: can I do that? and: can it be efficient?

You can use the 0 ... 79 range for your index without the need for a union1. You can get the byte index in your array using index / 8 and the bit position (within that byte) using index % 8.
This would certainly simplify your code; however, whether it will be significantly more efficient will depend on a number of factors, like what the target CPU is and how smart your compiler is. But note that the division and remainder operations with 8 as their RHS are trivial for most compilers/architectures and reduce to a bit-shift and a simple mask, respectively.
Here's a possible outline implementation:
uint8_t data[10]; // The 10 bytes
uint8_t index = 0; // index of bits in 0 .. 79 range
void TestISR()
{
// Test the indexed bit using combination of division and remainder ...
if (data[index / 8] & (1 << (index % 8))) {
// Do something
}
// Increment index ...
if (++index > 79) {
index = 0;
refill_array();
}
}
For any compiler that fails to implement the optimized division and remainder operations, the if statement can be re-written thus:
if (data[index >> 3] & (1 << (index & 7))) {
// ...
1 Note that any attempt to actually use a union will likely exhibit undefined behaviour. In C++, reading from a member of a union that wasn't the last one written is UB (although it's acceptable and well-defined in C).

Related

how to split a bitset to an array

I have am enum like below:
enum types : uint16_t
{
A = 1 << 0,
B = 1 << 1,
C = 1 << 2,
D = 1 << 3,
E = 1 << 4,
F = 1 << 5,
G = 1 << 6
};
Assume I have number:
uint16_t val = A | C | F;
how to split the val to an array, I know I can do this by using for
for(int i=0;i<7;++i){
if(val & (1 << i)){
//push_back(1 << i)
}
}
but what if the enum has 1000 rows?
Is there any simple and faster way to do this?
Use a std::bitset<N> to store bit flags like that. Have the enum be, instead of 1 << 5, just be a normal incrementing enum. Then you use the bitset like this:
std::bitset<1000> myBits;
myBits.set(A);
if(myBits[A]) {
// do some bit flag logic.
}
but what if the enum has 1000 rows?
C++ doesn't have a primitive type with 1000 bits (even __int128 uses 128 bits) so we will have to make our own.
struct Bits1000
{
uint8_t bits[1000 / 8] = {};
};
Is there any simple and faster way to do this?
One way or the other, you will end up using a loop. It's the other optimizations that can be done to improve performance. The reason is that C++ does not have a way to index bits (like bits[0] is equal to the first bit, unless we're talking about std::bitset). Even a struct containing a bitfield also gets packed to a byte. And considering that we have an if condition that we use to check if a certain enum is present, this means that this process should be done at runtime.
Optimizations
Since we are looking at the poweres of two, we can always increment the index by multiplying it by two. This way we dont have to do the bit shift opearation, in addition to the index increment and might help reduce the total CPU instruction cycles.
/**
* Helper to calculate the power.
*/
template<size_t Base, size_t Exponent>
struct Power
{
static constexpr size_t Value = Base * Power<Base, Exponent - 1>::Value;
};
/**
* Helper template specialization.
*/
template<size_t Base>
struct Power<Base, 1>
{
static constexpr size_t Value = Base;
};
// ...
for (int i = 1; i < Power<2, sizeof(uint8_t) * 8>::Value; i *= 2)
{
if(val & i)
{
// Do your work here.
}
}
Another optimization would be to precompute and store the powers of two in an array, then index that array to get the poweres.
uint64_t poweres[64] = { 1, 2, 4, 8, ... };
for (int i = 0; i < sizeof(uint8_t) * 8; i++)
{
if(val & poweres[i])
{
// Do your work here.
}
}
The trick here is that with standard types, you can get away with almost constant time (O(8), O(16), O(32), and O(64) time complexities). So the performance hits are pretty low. Even with custom types, we will be using the absolute minimum that's available for us.
I should also note that if we're dealing with very large numbers (for example with 1000 bits), the iterating integer would not be able to handle that (nor would bit shifting because the result will always be at max uint64_t depending on the build). In that case, make sure to iterate through the maximum available (4 bytes in x86 and 8 bytes in x64).
Bit scanning is a technique to efficiently do this sort of thing. To do that you'll need bit-level operations. Traditionally, that has involved compiler extensions (like GCC's builtins). Starting with C++20 you'll find what you need in the bit header.
Example using your types type:
std::vector<types> parse_num(uint16_t num)
{
std::vector<types> res;
while (num) {
auto bit = std::countr_zero(num);
auto mask = 1 << bit;
res.push_back(static_cast<types>(mask));
num &= ~mask;
}
return res;
}
This is something you may want to reach for in the "the enum has 1000 rows" kind of case - especially if the data is sparse - rather than the simple example in the question where your loop is simpler and possibly more performant. As always, it depends.
For a (non-portable) pre-C++20 solution you can:
Replace std::countr_zero(num) with __builtin_ctz(num) for GCC/Clang.
Use _BitScanForward for MSVC.

How to build N bits variables in C++?

I am dealing with very large list of booleans in C++, around 2^N items of N booleans each. Because memory is critical in such situation, i.e. an exponential growth, I would like to build a N-bits long variable to store each element.
For small N, for example 24, I am just using unsigned long int. It takes 64MB ((2^24)*32/8/1024/1024). But I need to go up to 36. The only option with build-in variable is unsigned long long int, but it takes 512GB ((2^36)*64/8/1024/1024/1024), which is a bit too much.
With a 36-bits variable, it would work for me because the size drops to 288GB ((2^36)*36/8/1024/1024/1024), which fits on a node of my supercomputer.
I tried std::bitset, but std::bitset< N > creates a element of at least 8B.
So a list of std::bitset< 1 > is much greater than a list of unsigned long int.
It is because the std::bitset just change the representation, not the container.
I also tried boost::dynamic_bitset<> from Boost, but the result is even worst (at least 32B!), for the same reason.
I know an option is to write all elements as one chain of booleans, 2473901162496 (2^36*36), then to store then in 38654705664 (2473901162496/64) unsigned long long int, which gives 288GB (38654705664*64/8/1024/1024/1024). Then to access an element is just a game of finding in which elements the 36 bits are stored (can be either one or two). But it is a lot of rewriting of the existing code (3000 lines) because mapping becomes impossible and because adding and deleting items during the execution in some functions will be surely complicated, confusing, challenging, and the result will be most likely not efficient.
How to build a N-bits variable in C++?
How about a struct with 5 chars (and perhaps some fancy operator overloading as needed to keep it compatible to the existing code)? A struct with a long and a char probably won't work because of padding / alignment...
Basically your own mini BitSet optimized for size:
struct Bitset40 {
unsigned char data[5];
bool getBit(int index) {
return (data[index / 8] & (1 << (index % 8))) != 0;
}
bool setBit(int index, bool newVal) {
if (newVal) {
data[index / 8] |= (1 << (index % 8));
} else {
data[index / 8] &= ~(1 << (index % 8));
}
}
};
Edit: As geza has also pointed out int he comments, the "trick" here is to get as close as possible to the minimum number of bytes needed (without wasting memory by triggering alignment losses, padding or pointer indirection, see http://www.catb.org/esr/structure-packing/).
Edit 2: If you feel adventurous, you could also try a bit field (and please let us know how much space it actually consumes):
struct Bitset36 {
unsigned long long data:36;
}
I'm not an expert, but this is what I would "try". Find the bytes for the smallest type your compiler supports (should be char). You can check with sizeof and you should get 1. That means 1 byte, so 8 bits.
So if you wanted a 24 bit type...you would need 3 chars. For 36 you would need 5 char array and you would have 4 bits of wasted padding on the end. This could easily be accounted for.
i.e.
char typeSize[3] = {0}; // should hold 24 bits
Now make a bit mask to access each position of typeSize.
const unsigned char one = 0b0000'0001;
const unsigned char two = 0b0000'0010;
const unsigned char three = 0b0000'0100;
const unsigned char four = 0b0000'1000;
const unsigned char five = 0b0001'0000;
const unsigned char six = 0b0010'0000;
const unsigned char seven = 0b0100'0000;
const unsigned char eight = 0b1000'0000;
Now you can use the bit-wise or to set the values to 1 where needed..
typeSize[1] |= four;
*typeSize[0] |= (four | five);
To turn off bits use the & operator..
typeSize[0] &= ~four;
typeSize[2] &= ~(four| five);
You can read the position of each bit with the & operator.
typeSize[0] & four
Bear in mind, I don't have a compiler handy to try this out so hopefully this is a useful approach to your problem.
Good luck ;-)
You can use array of unsigned long int and store and retrieve needed bit chains with bitwise operations. This approach excludes space overhead.
Simplified example for unsigned byte array B[] and 12-bit variables V (represented as ushort):
Set V[0]:
B[0] = V & 0xFF; //low byte
B[1] = B[1] & 0xF0; // clear low nibble
B[1] = B[1] | (V >> 8); //fill low nibble of the second byte with the highest nibble of V

Two values in one byte

In a single nibble (0-F) I can store one number from 0 to 15. In one byte, I can store a single number from 0 to 255 (00 - FF).
Can I use a byte (00-FF) to store two different numbers each in the range 0-127 (00 - 7F)?
The answer to your question is NO. You can split a single byte into two numbers, but the sum of the bits in the two numbers must be <= 8. Since, the range 0-127 requires 7 bits, the other number in the byte can only be 1 bit, i.e. 0-1.
For obvious cardinality reasons, you cannot store two small integers in the 0 ... 127 range in one byte of 0 ... 255 range. In other words the cartesian product [0;127]×[0;127] has 214 elements which is bigger than 28 (the cardinal of the [0;255] interval, for bytes)
(If you can afford losing precision - which you didn't tell - you could, e.g. by storing only the highest bits ...)
Perhaps your question is: could I store two small integers from [0;15] in a byte? Then of course you could:
typedef unsigned unibble_t; // unsigned nibble in [0;15]
uint8_t make_from_two_nibbles(unibble_t l, unibble_t r) {
assert(l<=15);
assert(r<=15);
return (l<<4) | r;
}
unibble_t left_nible (uint8_t x) { return x >> 4; }
unibble_t right_nibble (uint8_t) { return x & 0xf; }
But I don't think you always should do that. First, you might use bit fields in struct. Then (and most importantly) dealing with nibbles that way might be more inefficient and make less readable code than using bytes.
And updating a single nibble, e.g. with
void update_left_nibble (uint8_t*p, unibble_t l) {
assert (p);
assert (l<=15);
*p = ((l<<4) | ((*p) & 0xf));
}
is sometimes expensive (it involves a memory load and a memory store, so uses the CPU cache and cache coherence machinery), and most importantly is generally a non-atomic operation (what would happen if two different threads are calling simultaneously update_left_nibble on the same address p -i.e. with pointer aliasing- is undefined behavior).
As a rule of thumb, avoid packing more than one data item in a byte unless you are sure it is worthwhile (e.g. you have a billion of such data items).
One byte is not enough for two values in 0…127, because each of those values needs log2(128) = 7 bits, for a total of 14, but a byte is only 8 bits.
You can declare variables with bit-packed storage using the C and C++ bitfield syntax:
struct packed_values {
uint8_t first : 7;
uint8_t second : 7;
uint8_t third : 2;
};
In this example, sizeof(packed_values) should equal 2 because only 16 bits were used, despite having three fields.
This is simpler than using bitwise arithmetic with << and & operators, but it's still not quite the same as ordinary variables: bit-fields have no addresses, so you can't have a pointer (or C++ reference) to one.
Can I use a byte to store two numbers in the range 0-127?
Of course you can:
uint8_t storeTwoNumbers(unsigned a, unsigned b) {
return ((a >> 4) & 0x0f) | (b & 0xf0);
}
uint8_t retrieveTwoNumbers(uint8_t byte, unsigned *a, unsigned *b) {
*b = byte & 0xf0;
*a = (byte & 0x0f) << 4;
}
Numbers are still in range 0...127 (0...255, actually). You just loose some precision, similar to floating point types. Their values increment in steps of 16.
You can store two data in range 0-15 in a single byte, but you should not (one var = one data is a better design).
If you must, you can use bit-masks and bit-shifts to access to the two data in your variable.
uint8_t var; /* range 0-255 */
data1 = (var & 0x0F); /* range 0-15 */
data2 = (var & 0xF0) >> 4; /* range 0-15 */

Get Integer From Bits Inside `std::vector<char>`

I have a vector<char> and I want to be able to get an unsigned integer from a range of bits within the vector. E.g.
And I can't seem to be able to write the correct operations to get the desired output. My intended algorithm goes like this:
& the first byte with (0xff >> unused bits in byte on the left)
<< the result left the number of output bytes * number of bits in a byte
| this with the final output
For each subsequent byte:
<< left by the (byte width - index) * bits per byte
| this byte with the final output
| the final byte (not shifted) with the final output
>> the final output by the number of unused bits in the byte on the right
And here is my attempt at coding it, which does not give the correct result:
#include <vector>
#include <iostream>
#include <cstdint>
#include <bitset>
template<class byte_type = char>
class BitValues {
private:
std::vector<byte_type> bytes;
public:
static const auto bits_per_byte = 8;
BitValues(std::vector<byte_type> bytes) : bytes(bytes) {
}
template<class return_type>
return_type get_bits(int start, int end) {
auto byte_start = (start - (start % bits_per_byte)) / bits_per_byte;
auto byte_end = (end - (end % bits_per_byte)) / bits_per_byte;
auto byte_width = byte_end - byte_start;
return_type value = 0;
unsigned char first = bytes[byte_start];
first &= (0xff >> start % 8);
return_type first_wide = first;
first_wide <<= byte_width;
value |= first_wide;
for(auto byte_i = byte_start + 1; byte_i <= byte_end; byte_i++) {
auto byte_offset = (byte_width - byte_i) * bits_per_byte;
unsigned char next_thin = bytes[byte_i];
return_type next_byte = next_thin;
next_byte <<= byte_offset;
value |= next_byte;
}
value >>= (((byte_end + 1) * bits_per_byte) - end) % bits_per_byte;
return value;
}
};
int main() {
BitValues<char> bits(std::vector<char>({'\x78', '\xDA', '\x05', '\x5F', '\x8A', '\xF1', '\x0F', '\xA0'}));
std::cout << bits.get_bits<unsigned>(15, 29) << "\n";
return 0;
}
(In action: http://coliru.stacked-crooked.com/a/261d32875fcf2dc0)
I just can't seem to wrap my head around these bit manipulations, and I find debugging very difficult! If anyone can correct the above code, or help me in any way, it would be much appreciated!
Edit:
My bytes are 8 bits long
The integer to return could be 8,16,32 or 64 bits wside
The integer is stored in big endian
You made two primary mistakes. The first is here:
first_wide <<= byte_width;
You should be shifting by a bit count, not a byte count. Corrected code is:
first_wide <<= byte_width * bits_per_byte;
The second mistake is here:
auto byte_offset = (byte_width - byte_i) * bits_per_byte;
It should be
auto byte_offset = (byte_end - byte_i) * bits_per_byte;
The value in parenthesis needs to be the number of bytes to shift right by, which is also the number of bytes byte_i is away from the end. The value byte_width - byte_i has no semantic meaning (one is a delta, the other is an index)
The rest of the code is fine. Though, this algorithm has two issues with it.
First, when using your result type to accumulate bits, you assume you have room on the left to spare. This isn't the case if there are set bits near the right boundry and the choice of range causes the bits to be shifted out. For example, try running
bits.get_bits<uint16_t>(11, 27);
You'll get the result 42 which corresponds to the bit string 00000000 00101010 The correct result is 53290 with the bit string 11010000 00101010. Notice how the rightmost 4 bits got zeroed out. This is because you start off by overshifting your value variable, causing those four bits to be shifted out of the variable. When shifting back at the end, this results in the bits being zeroed out.
The second problem has to do with the right shift at the end. If the rightmost bit of the value variable happens to be a 1 before the right shift at the end, and the template parameter is a signed type, then the right shift that is done is an 'arithmetic' right shift, which causes bits on the right to be 1-filled, leaving you with an incorrect negative value.
Example, try running:
bits.get_bits<int16_t>(5, 21);
The expected result should be 6976 with the bit string 00011011 01000000, but the current implementation returns -1216 with the bit string 11111011 01000000.
I've put my implementation of this below which builds the bit string from the right to the left, placing bits in their correct positions to start with so that the above two problems are avoided:
template<class ReturnType>
ReturnType get_bits(int start, int end) {
int max_bits = kBitsPerByte * sizeof(ReturnType);
if (end - start > max_bits) {
start = end - max_bits;
}
int inclusive_end = end - 1;
int byte_start = start / kBitsPerByte;
int byte_end = inclusive_end / kBitsPerByte;
// Put in the partial-byte on the right
uint8_t first = bytes_[byte_end];
int bit_offset = (inclusive_end % kBitsPerByte);
first >>= 7 - bit_offset;
bit_offset += 1;
ReturnType ret = 0 | first;
// Add the rest of the bytes
for (int i = byte_end - 1; i >= byte_start; i--) {
ReturnType tmp = (uint8_t) bytes_[i];
tmp <<= bit_offset;
ret |= tmp;
bit_offset += kBitsPerByte;
}
// Mask out the partial byte on the left
int shift_amt = (end - start);
if (shift_amt < max_bits) {
ReturnType mask = (1 << shift_amt) - 1;
ret &= mask;
}
}
There is one thing you certainly missed I think: the way you index the bits in the vector is different from what you have been given in the problem. I.e. with algorithm you outlined, the order of the bits will be like 7 6 5 4 3 2 1 0 | 15 14 13 12 11 10 9 8 | 23 22 21 .... Frankly, I didn't read through your whole algorithm, but this one was missed in the very first step.
Interesting problem. I've done similar, for some systems work.
Your char is 8 bits wide? Or 16? How big is your integer? 32 or 64?
Ignore the vector complexity for a minute.
Think about it as just an array of bits.
How many bits do you have? You have 8*number of chars
You need to calculate a starting char, number of bits to extract, ending char, number of bits there, and number of chars in the middle.
You will need bitwise-and & for the first partial char
you will need bitwise-and & for the last partial char
you will need left-shift << (or right-shift >>), depending upon which order you start from
what is the endian-ness of your Integer?
At some point you will calculate an index into your array that is bitindex/char_bit_width, you gave the value 171 as your bitindex, and 8 as your char_bit_width, so you will end up with these useful values calculated:
171/8 = 23 //location of first byte
171%8 = 3 //bits in first char/byte
8 - 171%8 = 5 //bits in last char/byte
sizeof(integer) = 4
sizeof(integer) + ( (171%8)>0?1:0 ) // how many array positions to examine
Some assembly required...

how to optimize C++/C code for a large number of integers

I have written the below mentioned code. The code checks the first bit of every byte. If the first bit of every byte of is equal to 0, then it concatenates this value with the previous byte and stores it in a different variable var1. Here pos points to bytes of an integer. An integer in my implementation is uint64_t and can occupy upto 8 bytes.
uint64_t func(char* data)
{
uint64_t var1 = 0; int i=0;
while ((data[i] >> 7) == 0)
{
variable = (variable << 7) | (data[i]);
i++;
}
return variable;
}
Since I am repeatedly calling func() a trillion times for trillions of integers. Therefore it runs slow, is there a way by which I may optimize this code?
EDIT: Thanks to Joe Z..its indeed a form of uleb128 unpacking.
I have only tested this minimally; I am happy to fix glitches with it. With modern processors, you want to bias your code heavily toward easily predicted branches. And, if you can safely read the next 10 bytes of input, there's nothing to be saved by guarding their reads by conditional branches. That leads me to the following code:
// fast uleb128 decode
// assumes you can read all 10 bytes at *data safely.
// assumes standard uleb128 format, with LSB first, and
// ... bit 7 indicating "more data in next byte"
uint64_t unpack( const uint8_t *const data )
{
uint64_t value = ((data[0] & 0x7F ) << 0)
| ((data[1] & 0x7F ) << 7)
| ((data[2] & 0x7F ) << 14)
| ((data[3] & 0x7F ) << 21)
| ((data[4] & 0x7Full) << 28)
| ((data[5] & 0x7Full) << 35)
| ((data[6] & 0x7Full) << 42)
| ((data[7] & 0x7Full) << 49)
| ((data[8] & 0x7Full) << 56)
| ((data[9] & 0x7Full) << 63);
if ((data[0] & 0x80) == 0) value &= 0x000000000000007Full; else
if ((data[1] & 0x80) == 0) value &= 0x0000000000003FFFull; else
if ((data[2] & 0x80) == 0) value &= 0x00000000001FFFFFull; else
if ((data[3] & 0x80) == 0) value &= 0x000000000FFFFFFFull; else
if ((data[4] & 0x80) == 0) value &= 0x00000007FFFFFFFFull; else
if ((data[5] & 0x80) == 0) value &= 0x000003FFFFFFFFFFull; else
if ((data[6] & 0x80) == 0) value &= 0x0001FFFFFFFFFFFFull; else
if ((data[7] & 0x80) == 0) value &= 0x00FFFFFFFFFFFFFFull; else
if ((data[8] & 0x80) == 0) value &= 0x7FFFFFFFFFFFFFFFull;
return value;
}
The basic idea is that small values are common (and so most of the if-statements won't be reached), but assembling the 64-bit value that needs to be masked is something that can be efficiently pipelined. With a good branch predictor, I think the above code should work pretty well. You might also try removing the else keywords (without changing anything else) to see if that makes a difference. Branch predictors are subtle beasts, and the exact character of your data also matters. If nothing else, you should be able to see that the else keywords are optional from a logic standpoint, and are there only to guide the compiler's code generation and provide an avenue for optimizing the hardware's branch predictor behavior.
Ultimately, whether or not this approach is effective depends on the distribution of your dataset. If you try out this function, I would be interested to know how it turns out. This particular function focuses on standard uleb128, where the value gets sent LSB first, and bit 7 == 1 means that the data continues.
There are SIMD approaches, but none of them lend themselves readily to 7-bit data.
Also, if you can mark this inline in a header, then that may also help. It all depends on how many places this gets called from, and whether those places are in a different source file. In general, though, inlining when possible is highly recommended.
Your code is problematic
uint64_t func(const unsigned char* pos)
{
uint64_t var1 = 0; int i=0;
while ((pos[i] >> 7) == 0)
{
var1 = (var1 << 7) | (pos[i]);
i++;
}
return var1;
}
First a minor thing: i should be unsigned.
Second: You don't assert that you don't read beyond the boundary of pos. E.g. if all values of your pos array are 0, then you will reach pos[size] where size is the size of the array, hence you invoke undefined behaviour. You should pass the size of your array to the function and check that i is smaller than this size.
Third: If pos[i] has most significant bit equal to zero for i=0,..,k with k>10, then previous work get's discarded (as you push the old value out of var1).
The third point actually helps us:
uint64_t func(const unsigned char* pos, size_t size)
{
size_t i(0);
while ( i < size && (pos[i] >> 7) == 0 )
{
++i;
}
// At this point, i is either equal to size or
// i is the index of the first pos value you don't want to use.
// Therefore we want to use the values
// pos[i-10], pos[i-9], ..., pos[i-1]
// if i is less than 10, we obviously need to ignore some of the values
const size_t start = (i >= 10) ? (i - 10) : 0;
uint64_t var1 = 0;
for ( size_t j(start); j < i; ++j )
{
var1 <<= 7;
var1 += pos[j];
}
return var1;
}
In conclusion: We separated logic and got rid of all discarded entries. The speed-up depends on the actual data you have. If lot's of entries are discarded then you save a lot of writes to var1 with this approach.
Another thing: Mostly, if one function is called massively, the best optimization you can do is call it less. Perhaps you can have come up with an additional condition that makes the call of this function useless.
Keep in mind that if you actually use 10 values, the first value ends up the be truncated.
64bit means that there are 9 values with their full 7 bits of information are represented, leaving exactly one bit left foe the tenth. You might want to switch to uint128_t.
A small optimization would be:
while ((pos[i] & 0x80) == 0)
Bitwise and is generally faster than a shift. This of course depends on the platform, and it's also possible that the compiler will do this optimization itself.
Can you change the encoding?
Google came across the same problem, and Jeff Dean describes a really cool solution on slide 55 of his presentation:
http://research.google.com/people/jeff/WSDM09-keynote.pdf‎
http://videolectures.net/wsdm09_dean_cblirs/
The basic idea is that reading the first bit of several bytes is poorly supported on modern architectures. Instead, let's take 8 of these bits, and pack them as a single byte preceding the data. We then use the prefix byte to index into a 256-item lookup table, which holds masks describing how to extract numbers from the rest of the data.
I believe it's how protocol buffers are currently encoded.
Can you change your encoding? As you've discovered, using a bit on each byte to indicate if there's another byte following really sucks for processing efficiency.
A better way to do it is to model UTF-8, which encodes the length of the full int into the first byte:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx 10xxxxxx // two bytes with 12 bits of data
110xxxxx 10xxxxxx 10xxxxxx // three bytes with 16 bits of data
1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx // four bytes with 22 bits of data
// etc.
But UTF-8 has special properties to make it easier to distinguish from ASCII. This bloats the data and you don't care about ASCII, so you'd modify it to look like this:
0xxxxxxx // one byte with 7 bits of data
10xxxxxx xxxxxxxx // two bytes with 14 bits of data.
110xxxxx xxxxxxxx xxxxxxxx // three bytes with 21 bits of data
1110xxxx xxxxxxxx xxxxxxxx xxxxxxxx // four bytes with 28 bits of data
// etc.
This has the same compression level as your method (up to 64 bits = 9 bytes), but is significantly easier for a CPU to process.
From this you can build a lookup table for the first byte which gives you a mask and length:
// byte_counts[255] contains the number of additional
// bytes if the first byte has a value of 255.
uint8_t const byte_counts[256]; // a global constant.
// byte_masks[255] contains a mask for the useful bits in
// the first byte, if the first byte has a value of 255.
uint8_t const byte_masks[256]; // a global constant.
And then to decode:
// the resulting value.
uint64_t v = 0;
// mask off the data bits in the first byte.
v = *data & byte_masks[*data];
// read in the rest.
switch(byte_counts[*data])
{
case 3: v = v << 8 | *++data;
case 2: v = v << 8 | *++data;
case 1: v = v << 8 | *++data;
case 0: return v;
default:
// If you're on VC++, this'll make it take one less branch.
// Better make sure you've got all the valid inputs covered, though!
__assume(0);
}
No matter the size of the integer, this hits only one branch point: the switch, which will likely be put into a jump table. You can potentially optimize it even further for ILP by not letting each case fall through.
First, rather than shifting, you can do a bitwise test on the
relevant bit. Second, you can use a pointer, rather than
indexing (but the compiler should do this optimization itself.
Thus:
uint64_t
readUnsignedVarLength( unsigned char const* pos )
{
uint64_t results = 0;
while ( (*pos & 0x80) == 0 ) {
results = (results << 7) | *pos;
++ pos;
}
return results;
}
At least, this corresponds to what your code does. For variable
length encoding of unsigned integers, it is incorrect, since
1) variable length encodings are little endian, and your code is
big endian, and 2) your code doesn't or in the high order byte.
Finally, the Wiki page suggests that you've got the test
inversed. (I know this format mainly from BER encoding and
Google protocol buffers, both of which set bit 7 to indicate
that another byte will follow.
The routine I use is:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
int shift = 0;
uint64_t results = 0;
uint8_t tmp = *source ++;
while ( ( tmp & 0x80 ) != 0 ) {
*value |= ( tmp & 0x7F ) << shift;
shift += 7;
tmp = *source ++;
}
return results | (tmp << shift);
}
For the rest, this wasn't written with performance in mind, but
I doubt that you could do significantly better. An alternative
solution would be to pick up all of the bytes first, then
process them in reverse order:
uint64_t
readUnsignedVarLen( unsigned char const* source )
{
unsigned char buffer[10];
unsigned char* p = std::begin( buffer );
while ( p != std::end( buffer ) && (*source & 0x80) != 0 ) {
*p = *source & 0x7F;
++ p;
}
assert( p != std::end( buffer ) );
*p = *source;
++ p;
uint64_t results = 0;
while ( p != std::begin( buffer ) ) {
-- p;
results = (results << 7) + *p;
}
return results;
}
The necessity of checking for buffer overrun will likely make
this slightly slower, but on some architectures, shifting by
a constant is significantly faster than shifting by a variable,
so this could be faster on them.
Globally, however, don't expect miracles. The motivation for
using variable length integers is to reduce data size, at
a cost in runtime for decoding and encoding.