Advice needed for an API for reading bits - c++

I found a wonderful project called python-bitstring, and I believe a C++ port could be very helpful in quite some situations (for sure in some projects of mine).
While porting the read/write/patch bytes methods, I didn't get any problems at all; it was as easy as translating Python to C++.
Anyway, now I'm getting to the bits methods and I'm not really sure how to express that functionality.
For example, let's say I want to create a method like:
readBits(uint64_t n_bits_to_read, uint64_t n_bits_to_skip = 0) {...}
Let's suppose, for the sake of this example, that this->data is a chunk of memory (void *) holding the entire data from which I'm reading.
So, the method will receive a number of bits to read and an optional number of bits to skip.
this->readBits(5, 2);
That way I'll be reading bits from position 2 to position 6 inclusive (forget little/big endian for the sake of this example).
0 1 1 0 1 0 1 1
‾ ‾ ‾ ‾ ‾
I can't return anything smaller than a byte (or can I?), so even if I actually read 5 bits, I'll still be returning 8. But what if I read 14 bits and skip 1? Is there any other way I could return only those bits in some more useful way?
I'm thinking about a few common situations, for example:
Do the first 14 bits match "010101....."
Do the next 13 bits after skipping 2 match "00011010....."
Read the first 5 bits and convert them to an int/float
Read 7 bits after skipping 5 and convert them to an int/float
My question is: what type of data/structure/methods should I return/expose in order to make working with bits easier (or at least easier for the previously described situations).

Related

How do I read and write a stream 3 bits at a time?

I'm trying to make an ultra-compressed variant of brainfuck, which is an esoteric programming language with 8 instructions. Since 3 bits is the minimum amount of storage to store 8 values, I went with that. The part I'm stuck on is how to read a number of bits that isn't a power of 2.
I tried using std::bitset, but that just serializes to a string that is 1 byte per bit, which is the opposite of what I want. How would I go about this?
Read 3 bytes at a time, and split those into 8 packs of 3 bits each using the >> and & operators. Put these in an ordinary uint8_t array to simplify later access and jumps.
You don't read bits from a stream, you read bytes from a stream.
So, you must do so, then shuffle the component bits around as you wish using bitwise arithmetic.
By the way, the fact that computers work in bytes also means that many of your programs (any that don't have a multiple of 8 instructions) are necessarily going to have wasted space.

C++ Byte is implementation dependent

I have been reading C Primer Plus.
It is said that:
Note that the meaning of byte is implementation dependent. So a 2-byte int could be 16 bits on one system and 32 bits on another.
Here I think I am not sure about this. From my understanding, 1 byte always = 8 bits, so it makes sense that 2-byte int = 2 * 8 = 16 bits. But from this statement it sounds like some system define 1 byte = 16 bits. Is that correct?
In general, how I should understand this statement?
The C++ standard, section 1.7 point 1 confirms this:
The fundamental storage unit in the C++ memory model is the byte. A
byte is at least large enough to contain any member of the basic
execution character set (2.3) and the eight-bit code units of the
Unicode UTF-8 encoding form and is composed of a contiguous sequence
of bits, the number of which is implementation defined. (...)
The memory available to a C++ program consists of one or more
sequences of contiguous bytes. Every byte has a unique address.
Bytes are always composed of at least 8 bits. They can be larger than 8 bits, though this is fairly uncommon.
One byte is not always 8-bits. Before octets (the term you want to use if you want to explicitly refer to an 8-bit byte), there were 4-, 6-, and 7-bit bytes. For the purposes of [modern] programming (in pretty much any language), you can assume it is at least 8 bits.
Historically, a byte was not always 8 bits. Today it is, but long time ago, it could be 6, 7, 8, 9 ... so to have a language that could exploit the specifics of the hardware (for efficiency) but still letting the user express himself in a bit higher level language, they had to make sure the int type was mapped on the most natural fit for the hardware.

Saving binary date into file in c++

My algoritm produces stream of 9bits and 17bits I need to find solution to store this data in file. but i can't just store 9 bits as int and 17bits as int_32.
For example if my algoritm produces 10x9bit and 5x17bits the outfile size need to be 22bytes.
Also one of the big problem to solve is that the out file can be very big and size of the file is unknown.
The only idea with I have now is to use bool *vector;
If you have to save dynamic bits, then you should probably save two values: The first being either the number of bits (if bits are consecutive from 0 to x), or a bitmask to say which bits are valid; The second being the 32-bit integer representing your bits.
Taking your example literally: if you want to store 175 bits and it consists of unknown number of entities of two different lengths, then the file absolutely cannot be only 22 bytes. You need to know what is ahead of you in the file, you need the lengths. If you got only two possible sizes, then it can be only a single bit. 0 means 9 bit, 1 means 17 bit.
|0|9bit|0|9bit|1|17bit|0|9bit|1|17bit|1|17bit|...
So for your example, you would need 10*(1+9)+5*(1+17) = 190 bits ~ 24 bytes. The outstanding 2 bits need to be padded with 0's so that you align at byte boundary. The fact that you will go on reading the file as if there was another entity (because you said you don't know how long the file is) shouldn't be a problem because last such padding will be always less than 9 bits. Upon reaching end of file, you can throw away the last incomplete reading.
This approach indeed requires implementing a bit-level manipulation of the byte-level stream. Which means careful masking and logical operations. BASE64 is exactly that, only being simpler than you, consisting only of fixed 6-bit entities, stored in a textfile.

How to write non aligned data to a binary stream in c++

In order to reduce the data size over network, i would like to write only enough bits to network, that can hold the value. For example, if 40 bits can hold the value, i want to write 40 bits to the stream and not 64 bits. Or if the value can be stored in 3 bits, i would simply like to write 3 bits to the binary stream and not 8 bits, with 5 bits as 0.
My question is how do i write non aligned data to a binary stream in C++ ?
The stream works with bytes, not bits, so you'll have to work with multiples of 8 bits. You can write 40 bits to the stream because that's exactly 5 bytes.
You are inventing your own compression scheme and will almost certainly do worse than the experts have done.
Your network may also already doing compression so you might be doing work that is already being done.
Your question is sorely lacking in detail that makes a better answer impossible.

Compression for a unique stream of data

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.
Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?
Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.
If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.
Take an input stream like:
1101
1101
1110
1110
0110
and output:
1101
0000
0010
0000
1000
a bit of pseudo code
compressed[0] = uncompressed[0]
loop
compressed[i] = uncompressed[i-1] ^ uncompressed[i]
We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.
When you want to decompress:
uncompressed[0] = compressed[0]
loop
uncompressed[i] = uncompressed[i-1] ^ compressed[i]
This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.
Have you considered Run-length encoding?
Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".
You can combine that with run-length encoding for even better compression ratios, depending on your data.
Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).
You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.
In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.
Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...
Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.
Break your ints up each into 4 bytes, so i0, i1, i2, ..., in becomes b0,0, b0,1, b0,2, b0,3, b1,0, b1,1, b1,2, b1,3, ..., bn,0, bn,1, bn,2, bn,3. Then write out all the bi,0s, followed by the bi,1s, bi,2s, and bi,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.
Did you try bzip2 for this?
http://bzip.org/
It's always worked better than zlib for me.
Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.
A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:
1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,
1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.
6 bits to record the bit number to switch from your previous integer.
If there are more than 4 bits different, then store the integer.
This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.
"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.
One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).
Output the first integer (32 bits).
Output the number of bit changes (n=0-3, 2 bits).
Output n bit specifiers (0-31, 5 bits each).
Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).
I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.
Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).
Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.
Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be
single-bit changes (2+5 = 7 bits) : 80% of the transitions.
double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.
This means your average would have to be 1.2 bit changes per integer to make this worthwhile.
One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).
I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.