C++ fast bit array class for streaming compression algorithms

C++ fast bit array class for streaming compression algorithms - c++

Implementing streaming compression algorithms, one usually need a super-fast FIFO bit container class with the following functions:
AddBits(UINT n, UINT nBits); // Add lower nBits bits of n
GetBitCount(); // Get the number of bits currently stored
GetBits(BYTE* n, UINT nBits); // Extract n Bits, and remove them
The number of bits is bounded to a relatively small size (the 'packet' size or a bit more).
I'm looking for a small C++ class which implement this functionality.
Yes, I can write one (and know how to do it), but probably someone wrote it already...
Note: I don't want to add boost / whatever-big-lib to my project just for this.

One approach I've used in embedded systems, when I always wanted to read 16 bits or less at a time, was to keep a 32-bit long which holds the current partial 16-bit word, and the next whole one. Then the code was something like:
/* ui=unsigned 16-bit ul=unsigned 32-bit LT == less-than SHL = shift-left */
ul bit_buff;
ui buff_count;
ui *bit_src;
unsigned int readbits(int numbits)
{
if (buff_count LT numbits)
{
bit_buff |= ((ul)(*bit_src++)) SHL buff_ct;
buff_ct += 16;
}
buff_ct -= numbits;
return bit_buff & ((1 SHL numbits)-1);
}
That could probably be readily adapted for use with a 64-bit long long and allow for withdrawal of up to 32 bits at a time.

I know you don't want to, but you could use a boost dynamic bitset and provide the FIFO capability on top of it using supplier/consumer semantics.
I didn't want to use boost either, but it's really not a big deal. You don't have to do anything with libraries. You just need to have boost available on your build system and include the right include files.

Related

How the size of a struct containing bitset fields is calculated [duplicate]

It seems for std::bitset<1 to 32>, the size is set to 4 bytes. For sizes 33 to 64, it jumps straight up to 8 bytes. There can't be any overhead because std::bitset<32> is an even 4 bytes.
I can see aligning to byte length when dealing with bits, but why would a bitset need to align to word length, especially for a container most likely to be used in situations with a tight memory budget?
This is under VS2010.

The most likely explanation is that bitset is using a whole number of machine words to store the array.
This is probably done for memory bandwidth reasons: it is typically relatively cheap to read/write a word that's aligned at a word boundary. On the other hand, reading (and especially writing!) an arbitrarily-aligned byte can be expensive on some architectures.
Since we're talking about a fixed-sized penalty of a few bytes per bitset, this sounds like a reasonable tradeoff for a general-purpose library.

I assume that indexing into the bitset is done by grabbing a 32-bit value and then isolating the relevant bit because this is fastest in terms of processor instructions (working with smaller-sized values is slower on x86). The two indexes needed for this can also be calculated very quickly:
int wordIndex = (index & 0xfffffff8) >> 3;
int bitIndex = index & 0x7;
And then you can do this, which is also very fast:
int word = m_pStorage[wordIndex];
bool bit = ((word & (1 << bitIndex)) >> bitIndex) == 1;
Also, a maximum waste of 3 bytes per bitset is not exactly a memory concern IMHO. Consider that a bitset is already the most efficient data structure to store this type of information, so you would have to evaluate the waste as a percentage of the total structure size.
For 1025 bits this approach uses up 132 bytes instead of 129, for 2.3% overhead (and this goes down as the bitset site goes up). Sounds reasonable considering the likely performance benefits.

The memory system on modern machines cannot fetch anything else but words from memory, apart from some legacy functions that extract the desired bits. Hence, having the bitsets aligned to words makes them a lot faster to handle, because you do not need to mask out the bits you don't need when accessing it. If you do not mask, doing something like
bitset<4> foo = 0;
if (foo) {
// ...
}
will most likely fail. Apart from that, I remember reading some time ago that there was a way to cramp several bitsets together, but I don't remember exactly. I think it was when you have several bitsets together in a structure that they can take up "shared" memory, which is not applicable to most use cases of bitfields.

I had the same feature in Aix and Linux implementations. In Aix, internal bitset storage is char based:
typedef unsigned char _Ty;
....
_Ty _A[_Nw + 1];
In Linux, internal storage is long based:
typedef unsigned long _WordT;
....
_WordT _M_w[_Nw];
For compatibility reasons, we modified Linux version with char based storage
Check which implementation are you using inside bitset.h

Because a 32 bit Intel-compatible processor cannot access bytes individually (or better, it can by applying implicitly some bit mask and shifts) but only 32bit words at time.
if you declare
bitset<4> a,b,c;
even if the library implements it as char, a,b and c will be 32 bit aligned, so the same wasted space exist. But the processor will be forced to premask the bytes before letting bitset code to do its own mask.
For this reason MS used a int[1+(N-1)/32] as a container for the bits.

Maybe because it's using int by default, and switches to long long if it overflows? (Just a guess...)

If your std::bitset< 8 > was a member of a structure, you might have this:
struct A
{
std::bitset< 8 > mask;
void * pointerToSomething;
}
If bitset<8> was stored in one byte (and the structure packed on 1-byte boundaries) then the pointer following it in the structure would be unaligned, which would be A Bad Thing. The only time when it would be safe and useful to have a bitset<8> stored in one byte would be if it was in a packed structure and followed by some other one-byte fields with which it could be packed together. I guess this is too narrow a use case for it to be worthwhile providing a library implementation.
Basically, in your octree, a single byte bitset would only be useful if it was followed in a packed structure by another one to three single-byte members. Otherwise, it would have to be padded to four bytes anyway (on a 32-bit machine) to ensure that the following variable was word-aligned.

Can I use data types like bool to compress data while improving readability?

My official question will be: "Is there a clean way to use data types to "encode and compress" data rather than using messy bit masking." The hopes would be to save space in the case of compressing, and I would like to use native data types, structures, and arrays in order to improve readability over bit masking. I am proficient in bit masking from my assembly background but I am learning C++ and OOP. We can store so much information in a 32 bit register by using individual bits and I feel that I am trying to get back to that low level environment while having the readability of C++ code.
I am attempting to save some space because I am working with huge resource requirements. I am still learning more about how c++ treats the bool data type. I realize that memory is stored in byte chunks and not individual bits. I believe that a bool usually uses one byte and is masked somehow. In my head I could use 8 bool values in one byte.
If I malloc in C++ an array of 2 bool elements. Does it allocate two bytes or just one?
Example: We will use DNA as an example since it can be encoded into two bit to represent A,C,G and T. If I make a struct with an array of two bool called DNA_Base, then I make an array of 20 of those.
struct DNA_Base{ bool Bit_1; bool Bit_2; };
DNA_Base DNA_Sequence[7] = {false};
cout << sizeof(DNA_Base)<<sizeof(DNA_Sequence)<<endl;
//Yields a 2 and a 14.
//I would like this to say 1 and 2.
In my example I would also show the case where the DNA sequence can be 20 bases long which would require 40 bits to encode. GATTACA could only take up a maximum of 2 bytes? I suppose an alternative question would have been "How to make C++ do the bit masking for me in a more readable way" or should I just make my own data type and classes and implement the bit masking using classes and operator overloading.

Not fully what you want but you can use bitfield:
struct DNA_Base
{
unsigned char Bit_1 : 1;
unsigned char Bit_2 : 1;
};
DNA_Base DNA_Sequence[7];
So sizeof(DNA_Base) == 1 and sizeof(DNA_Sequence) == 7
So you have to pack the DNA_Base to avoid to lose place with padding, something like:
struct DNA_Base_4
{
unsigned char base1 : 2; // may have value 0 1 2 or 3
unsigned char base2 : 2;
unsigned char base3 : 2;
unsigned char base4 : 2;
};
So sizeof(DNA_Base_4) == 1
std::bitset is an other alternative, but you have to do the interpretation job yourself.

An array of bools will be N-elements x sizeof(bool).
If your goal is to save space in registers, don't bother, because it is actually more efficient to use a word size for the processor in question than to use a single byte, and the compiler will prefer to use a word anyway, so in a struct/class the bool will usually be expanded to a 32-bit or 64-bit native word.
Now, if you like to save room on disk, or in RAM, due to needing to store LOTS of bools, go ahead, but it isn't going to save room in all cases unless you actually pack the structure, and on some architectures packing can also have performance impact because the CPU will have to perform unaligned or byte-by-byte access.
A bitmask (or bitfield), on the other hand, is performant and efficient and as dense as possible, and uses a single bitwise operation. I would look at one of the abstract data types that provide bit fields.
The standard library has bitset http://www.cplusplus.com/reference/bitset/bitset/ which can be as long as you want.
Boost also has something I'm sure.

Unless you are on a 4 bit machine, the final result will be using bit arithmetic. Whether you do it explicitly, have the compiler do it via bit fields, or use a bit container, there will be bit manipulation.
I suggest the following:
Use existing compression libraries.
Use the method that is most readable or understood by people other
than yourself.
Use the method that is most productive (talking about development
time).
Use the method that you will inject the least amount of defects.
Edit 1:
Write each method up as a separate function.
Tell the compiler to generate the assembly language for each function.
Compare the assembly language of each function to each other.
My belief is that they will be very similar, enough that wasting time discussing them is not worthwhile.

You can't operate on bits directly, but you can treat the smallest unit available to you as a multiple data store, and define
enum class DNAx4 : uint8_t {
AAAA = 0x00, AAAC = 0x01, AAAG = 0x02, AAAT = 0x03,
// .... And the rest of them
AAAA = 0xFC, AAAC = 0xFD, AAAG = 0xFE, AAAT = 0xFF
}
I'd actually go further, and create a structure DNAx16 or DNAx32 to efficiently use the native word size on your machine.
You can then define functions on the data type, which will have to use the underlying bit representation, but at least it allows you to encapsulate this and build higher level operations from these primitives.

Bitwise structure definition language generating c++ code

Before any question is asked: I am dealing with actual hardware.
I am searching for a meta-language that would allow me to specify data structure contents where fields have different bit length (this includes fields like 1, 3 or 24 or 48 bits long), with respect to endianess, and would generate C++ code accessing the data.
The question was put on hold due to being too vague, so I'll try to make it as clear as possible:
I am searching for a language that:
accepts simple structure description and generate useful C++ code,
would allow to precisely specify integers ranging from 1 bit to multiple (up to 8) bytes long, along with data (typically string),
would isolate me from need to convert endianess,
produces exact, predictable output that does not come with overhead (like in protocol buffers)
ASN.1 sounds almost good for the purpose, but it adds its own overhead (meaning, I cannot produce a simple structure that has 2 bytes split into 4 nibbles) - what i'm looking for is a language that would offer exact representation of the structure.
For example, I would want to abstract this:
struct Command {
struct Record {
int8_t track;
int8_t point;
int8_t index;
int16_t start_position; // big endian, misaligned
int32_t length; // big endian, misaligned;
} __attribute__((packed)); // structure length = 11 bytes.
int8_t current : 1;
int8_t command : 7;
int8_t reserved;
int16_t side : 3; // entire int16_t needs to be
int16_t layer : 3; // converted from big endian, because
int16_t laser_mark : 3; // this field spans across bytes.
int16_t laser_power : 3;
int16_t reserved_pad : 2;
int16_t laser_tag : 2;
int32_t mode_number : 8; // again, entire 32 bit field needs to be converted
int32_t record_count : 24; // from big endian to read this count properly.
Record records[];
} __attribute__((packed));
the above needs to be packed exactly to the structure carrying 8 + record_count * 11 bytes, all formed accurately, no additional data, no additional bits or bytes set.
The above is just an example. it's made simple so that I don't clog the site with actual structures that have oftentimes hundreds of fields. It has been simplified, but shows many of the features that I am looking forward to see (two remaining features are 48 or 64-bit integers and plain data (bytes[]))
If this question is still too vague, please explain what it is that I should add in the comments. thanks!

A simple table that tracks individual field sizes and is used to spin out offsets of each element into your structure sounds like the easiest solution. This won't scale to deeply nested structures, but could be tuned to support handling of the unassigned bit cases you identify.
Then, you can use this to generate constants or even named property accessors to extract and update the individual fields. Given the size of the individual elements, macros are likely to make life even harder, but any mainstream compiler should inline the code. You mileage could vary with a template-based implementation.
If would help if you could use a common representation for both sides of the application (host and device) to further reduce the likelihood of transcription errors.
The PLC world has a number of different mechanisms for layout, but these are all very hardwired into their eco-systems and so would not really help.
Alternately, if you have the tooling available, you could consider something like ASN.1 structures for the representation. In the extreme, you could even use an open source generator to come up with an unencoded generator directly from the MIB.

Why is std::bitset<8> 4 bytes big?

It seems for std::bitset<1 to 32>, the size is set to 4 bytes. For sizes 33 to 64, it jumps straight up to 8 bytes. There can't be any overhead because std::bitset<32> is an even 4 bytes.
I can see aligning to byte length when dealing with bits, but why would a bitset need to align to word length, especially for a container most likely to be used in situations with a tight memory budget?
This is under VS2010.

The most likely explanation is that bitset is using a whole number of machine words to store the array.
This is probably done for memory bandwidth reasons: it is typically relatively cheap to read/write a word that's aligned at a word boundary. On the other hand, reading (and especially writing!) an arbitrarily-aligned byte can be expensive on some architectures.
Since we're talking about a fixed-sized penalty of a few bytes per bitset, this sounds like a reasonable tradeoff for a general-purpose library.

I assume that indexing into the bitset is done by grabbing a 32-bit value and then isolating the relevant bit because this is fastest in terms of processor instructions (working with smaller-sized values is slower on x86). The two indexes needed for this can also be calculated very quickly:
int wordIndex = (index & 0xfffffff8) >> 3;
int bitIndex = index & 0x7;
And then you can do this, which is also very fast:
int word = m_pStorage[wordIndex];
bool bit = ((word & (1 << bitIndex)) >> bitIndex) == 1;
Also, a maximum waste of 3 bytes per bitset is not exactly a memory concern IMHO. Consider that a bitset is already the most efficient data structure to store this type of information, so you would have to evaluate the waste as a percentage of the total structure size.
For 1025 bits this approach uses up 132 bytes instead of 129, for 2.3% overhead (and this goes down as the bitset site goes up). Sounds reasonable considering the likely performance benefits.

The memory system on modern machines cannot fetch anything else but words from memory, apart from some legacy functions that extract the desired bits. Hence, having the bitsets aligned to words makes them a lot faster to handle, because you do not need to mask out the bits you don't need when accessing it. If you do not mask, doing something like
bitset<4> foo = 0;
if (foo) {
// ...
}
will most likely fail. Apart from that, I remember reading some time ago that there was a way to cramp several bitsets together, but I don't remember exactly. I think it was when you have several bitsets together in a structure that they can take up "shared" memory, which is not applicable to most use cases of bitfields.

I had the same feature in Aix and Linux implementations. In Aix, internal bitset storage is char based:
typedef unsigned char _Ty;
....
_Ty _A[_Nw + 1];
In Linux, internal storage is long based:
typedef unsigned long _WordT;
....
_WordT _M_w[_Nw];
For compatibility reasons, we modified Linux version with char based storage
Check which implementation are you using inside bitset.h

Because a 32 bit Intel-compatible processor cannot access bytes individually (or better, it can by applying implicitly some bit mask and shifts) but only 32bit words at time.
if you declare
bitset<4> a,b,c;
even if the library implements it as char, a,b and c will be 32 bit aligned, so the same wasted space exist. But the processor will be forced to premask the bytes before letting bitset code to do its own mask.
For this reason MS used a int[1+(N-1)/32] as a container for the bits.

Maybe because it's using int by default, and switches to long long if it overflows? (Just a guess...)

If your std::bitset< 8 > was a member of a structure, you might have this:
struct A
{
std::bitset< 8 > mask;
void * pointerToSomething;
}
If bitset<8> was stored in one byte (and the structure packed on 1-byte boundaries) then the pointer following it in the structure would be unaligned, which would be A Bad Thing. The only time when it would be safe and useful to have a bitset<8> stored in one byte would be if it was in a packed structure and followed by some other one-byte fields with which it could be packed together. I guess this is too narrow a use case for it to be worthwhile providing a library implementation.
Basically, in your octree, a single byte bitset would only be useful if it was followed in a packed structure by another one to three single-byte members. Otherwise, it would have to be padded to four bytes anyway (on a 32-bit machine) to ensure that the following variable was word-aligned.

fastest way to write a bitstream on modern x86 hardware

What is the fastest way to write a bitstream on x86/x86-64? (codeword <= 32bit)
by writing a bitstream I refer to the process of concatenating variable bit-length symbols into a contiguous memory buffer.
currently I've got a standard container with a 32bit intermediate buffer to write to
void write_bits(SomeContainer<unsigned int>& dst,unsigned int& buffer, unsigned int& bits_left_in_buffer,int codeword, short bits_to_write){
if(bits_to_write < bits_left_in_buffer){
buffer|= codeword << (32-bits_left_in_buffer);
bits_left_in_buffer -= bits_to_write;
}else{
unsigned int full_bits = bits_to_write - bits_left_in_buffer;
unsigned int towrite = buffer|(codeword<<(32-bits_left_in_buffer));
buffer= full_bits ? (codeword >> bits_left_in_buffer) : 0;
dst.push_back(towrite);
bits_left_in_buffer = 32-full_bits;
}
}
Does anyone know of any nice optimizations, fast instructions or other info that may be of use?
Cheers,

I wrote once a quite fast implementation, but it has several limitations: It works on 32 bit x86 when you write and read the bitstream. I don't check for buffer limits here, I was allocating larger buffer and checked it from time to time from the calling code.
unsigned char* membuff;
unsigned bit_pos; // current BIT position in the buffer, so it's max size is 512Mb
// input bit buffer: we'll decode the byte address so that it's even, and the DWORD from that address will surely have at least 17 free bits
inline unsigned int get_bits(unsigned int bit_cnt){ // bit_cnt MUST be in range 0..17
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1; // rounding down by 2.
unsigned int bits = *(unsigned int*)(membuff + byte_offset);
bits >>= bit_pos & 0xF;
bit_pos += bit_cnt;
return bits & BIT_MASKS[bit_cnt];
};
// output buffer, the whole destination should be memset'ed to 0
inline unsigned int put_bits(unsigned int val, unsigned int bit_cnt){
unsigned int byte_offset = bit_pos >> 3;
byte_offset &= ~1;
*(unsigned int*)(membuff + byte_offset) |= val << (bit_pos & 0xf);
bit_pos += bit_cnt;
};

It's hard to answer in general because it depends on many factors such as the distribution of bit-sizes you are reading, the call pattern in the client code and the hardware and compiler. In general, the two possible approaches for reading (writing) from a bitstream are:
Using a 32-bit or 64-bit buffer and conditionally reading (writing) from the underlying array it when you need more bits. That's the approach your write_bits method takes.
Unconditionally reading (writing) from the underlying array on every bitstream read (write) and then shifting and masking the resultant values.
The primary advantages of (1) include:
Only reads from the underlying buffer the minimally required number of times in an aligned fashion.
The fast path (no array read) is somewhat faster since it doesn't have to do the read and associated addressing math.
The method is likely to inline better since it doesn't have reads - if you have several consecutive read_bits calls, for example, the compiler can potentially combine a lot of the logic and produce some really fast code.
The primary advantage of (2) is that it is completely predictable - it contains no unpredictable branches.
Just because there is only one advantage for (2) doesn't mean it's worse: that advantage can easily overwhelm everything else.
In particular, you can analyze the likely branching behavior of your algorithm based on two factors:
How often will the bitsteam need to read from the underlying buffer?
How predictable is the number of calls before a read is needed?
For example if you are reading 1 bit 50% of the time and 2 bits 50% of time, you will do 64 / 1.5 = ~42 reads (if you can use a 64-bit buffer) before requiring an underlying read. This favors method (1) since reads of the underlying are infrequent, even if mis-predicted. On the other hand, if you are usually reading 20+ bits, you will read from the underlying every few calls. This is likely to favor approach (2), unless the pattern of underlying reads is very predictable. For example, if you always read between 22 and 30 bits, you'll perhaps always take exactly three calls to exhaust the buffer and read the underlying1 array. So the branch will be well-predicated and (1) will stay fast.
Similarly, it depends on how you call these methods, and how the compiler can inline and simplify the code. Especially if you ever call the methods repeatedly with a compile-time constant size, a lot of simplification is possible. Little to no simplification is available when the codeword is known at compile-time.
Finally, you may be able to get increased performance by offering a more complex API. This mostly applies to implementation option (1). For example, you can offer an ensure_available(unsigned size) call which ensures that at least size bits (usually limited the buffer size) are available to read. Then you can read up to that number of bits using unchecked calls that don't check the buffer size. This can help you reduce mis-predictions by forcing the buffer fills to a predictable schedule and lets you write simpler unchecked methods.
1 This depends on exactly how your "read from underlying" routine is written, as there are a few options here: Some always fill to 64-bits, some fill to between 57 and 64-bits (i.e., read an integral number of bytes), and some may fill between 32 or 33 and 64-bits (like your example which reads 32-bit chunks).

You'll probably have to wait until 2013 to get hold of real HW, but the "Haswell" new instructions will bring proper vectorised shifts (ie the ability to shift each vector element by different amounts specified in another vector) to x86/AVX. Not sure of details (plenty of time to figure them out), but that will surely enable a massive performance improvement in bitstream construction code.

I don't have the time to write it for you (not too sure your sample is actually complete enough to do so) but if you must, I can think of
using translation tables for the various input/output bit shift offsets; This optimization would make sense for fixed units of n bits (with n sufficiently large (8 bits?) to expect performance gains)
In essence, you'd be able to do
destloc &= (lookuptable[bits_left_in_buffer][input_offset][codeword]);
disclaimer: this is very sloppy pseudo code, I just hope it conveys my idea of a lookup table o prevent bitshift arithmetics
writing it in assembly (I know i386 has XLAT, but then again, a good compiler might already use something like that)
; Also, XLAT seems limited to 8 bits and the AL register, so it's not really versatile
Update
Warning: be sure to use a profiler and test your optimization for correctness and speed. Using a lookup table can result in poorer performance in the light of locality of reference. So, you might need to change the bit-streaming thread on a single core (set thread affinity) to get the benefits, and you might have to adapt the lookup table size to the processor's L2 cache.
Als, have a look at SIMD, SSE4 or GPU (CUDA) instruction sets if you know you'll have certain features at your disposal.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js