Bit ordering / Endianness flac decoding - c++

I'm currently trying to write a FLAC to WAV transcoder as an exercise in C++, and I am currently struggling a bit the wording of the FLAC format regarding bit ordering.
Here is the (little) section talking about ordering:
All numbers used in a FLAC bitstream are integers; there are no floating-point representations. All numbers are big-endian coded. All numbers are unsigned unless otherwise specified.
Does this apply to bit-ordering, as well as byte-ordering?
More specifically, if I read, say, a 7 bits value, do I get the most-significant bit 1st?

Bit ordering should never be an issue unless you're using a struct with bitfields (which is a good reason to avoid then).
Also, you can only read data one byte at a time. If you want to read 7 bits out of a byte, you need to apply a bitmask to the byte value.
For example, if a byte contains one value in the high order bit and another in the low order 7 bits, you would extract them as follows:
field1 = (byte & 0x80) >> 7;
field2 = byte & 0x7f;

Related

Is this the correct way of writing bits to big endian?

Currently, it's for a Huffman compression algorithm that assigns binary codes to characters used in a text file. Fewer bits for more frequent- and more bits for less-frequent characters.
Currently, I'm trying to save the binary code big-endian in a byte.
So let's say I'm using an unsigned char to hold it.
00000000
And I want to store some binary code that's 1101.
In advance, I want to apologize if this seems trivial or is a dupe but I've browsed dozens of other posts and can't seem to find what I need. If anyone could link or quickly explain, it'd be greatly appreciated.
Would this be the correct syntax?
I'll have some external method like
int length = 0;
unsigned char byte = (some default value);
void pushBit(unsigned int bit){
if (bit == 1){
byte |= 1;
}
byte <<= 1;
length++;
if (length == 8) {
//Output the byte
length = 0;
}
}
I've seen some videos explaining endianess and my understanding is the most significant bit (the first one) is placed in the lowest memory address.
Some videos showed the byte from left to right which makes me think I need to left shift everything over but whenever I set, toggle, erase a bit, it's from the rightmost is it not? I'm sorry once again if this is trivial.
So after my method finishes pushing the 1101 into this method, byte would be something like 00001101. Is this big endian? My knowledge of address locations is very weak and I'm not sure whether
**-->00001101 or 00001101<-- **
location is considered the most significant.
Would I need to left shift the remaining amount?
So since I used 4 bits, I would left shift 4 bits to make 11010000. Is this big endian?
First off, as the Killzone Kid noted, endianess and the bit ordering of a binary code are two entirely different things. Endianess refers to the order in which a multi-byte integer is stored in the bytes of memory. For little endian, the least significant byte is stored first. For big endian, the most significant byte is stored first. The bits in the bytes don't change order. Endianess has nothing to do with what you're asking.
As for accumulating bits until you have a byte's worth to write, you have the basic idea, but your code is incorrect. You need to shift first, and then or the bit. The way you're doing it, you are losing the first bit you put in off the top, and the low bit of what you write is always zero. Just put the byte <<= 1; before the if.
You also need to deal with ending the stream somehow, writing out the last bits if there are less than eight left. So you'll need a flushBits() to write out you bit buffer if it has more than one bit in it. Your bit stream would need to be self terminating, or you need to first send the number of bits, so that you don't misinterpret the filler bits in the last byte as a code or codes.
There are two types of endianness, Big-endian and Little-endian (technically there are more, like middle-endian, but big and little are the most common). If you want to have the big-endian format, (as it seems like you do), then the most significant byte comes first, with little-endian the least significant byte comes first.
Wikipedia has some good examples
It looks like what you are trying to do is store the bits themselves within the byte to be in reverse order, which is not what you want. A byte is endian agnostic and does not need to be flipped. Multi-byte types such as uint32_t may need their byte order changed, depending on what endianness you want to achieve.
Maybe what you are referring to is bit numbering, in which case the code you have should largely work (although you should compare length to 7, not 8). The order you place the bits in pushBit would end up with the first bit you pass being the most significant bit.
Bits aren't addressable by definition (if we're talking about C++, not C51 or its C++ successor), so from point of high level language, even from POV of assembler pseudo-code, no matter what the direction LSB -> MSB is, bit-wise << would perform shift from LSB to MSB. Bit order referred as bit numbering and is a separate feature from endian-ness, related to hardware implementation.
Bit fields in C++ change order because in most common use-cases usually bits do have an opposite order, e.g. in network communication, but in fact way how bit fields are packed into byte is implementation dependent, there is no consistency guarantee that there is no gaps or that order is preserved.
Minimal addressable unit of memory in C++ is of char size , and that's where your concern with endian-ness ends. The rare case if you actually should change bit order (when? working with some incompatible hardware?), you have to do explicitly so.
Note, that when working with Ethernet or other network protocol you should not do so, order change is done by hardware (first bit sent over wire is least significant one on the platform).

Read any number of bits from ifstream

I'm currently working with SWFFiles.
In SWF headers ist RECT, which is built with 5 fields. First one is 5bit field(nBits -> used to specify length of others fields.
How should look like a method, which takes one argument(how many bits read) and reads it from ifstream?
SWF File format specification
Thanks, S.
C++ file streams are byte-oriented. You can't read specific numbers of bits from them (unless the number is a multiple of 8 of course).
To get just 5 bits, you'll have to read an entire byte and then mask off the 8 bits of interest. If that byte also holds another field you'll have to keep it around for use later. If you make this generic enough, you could write your own "bit stream" class that buffers unused portions of bytes internally.
To obtain the low-order (least significant) 5 bits of a byte:
unsigned char bits = byte & 0x1F; // note 0x1F = binary 00011111
To obtain the high-order (most significant) 5 bits:
unsigned char bits = byte >> 3; // shift off the unused 3 low bits

How to write non aligned data to a binary stream in c++

In order to reduce the data size over network, i would like to write only enough bits to network, that can hold the value. For example, if 40 bits can hold the value, i want to write 40 bits to the stream and not 64 bits. Or if the value can be stored in 3 bits, i would simply like to write 3 bits to the binary stream and not 8 bits, with 5 bits as 0.
My question is how do i write non aligned data to a binary stream in C++ ?
The stream works with bytes, not bits, so you'll have to work with multiples of 8 bits. You can write 40 bits to the stream because that's exactly 5 bytes.
You are inventing your own compression scheme and will almost certainly do worse than the experts have done.
Your network may also already doing compression so you might be doing work that is already being done.
Your question is sorely lacking in detail that makes a better answer impossible.

Is there a name for this compression algorithm?

Say you have a four byte integer and you want to compress it to fewer bytes. You are able to compress it because smaller values are more probable than larger values (i.e., the probability of a value decreases with its magnitude). You apply the following scheme, to produce a 1, 2, 3 or 4 byte result:
Note that in the description below (the bits are one-based and go from most significant to least significant), i.e., the first bit refers to most significant bit, the second bit to the next most significant bit, etc...)
If n<128, you encode it as a
single byte with the first bit set
to zero
If n>=128 and n<16,384 ,
you use a two byte integer. You set
the first bit to one, to indicate
and the second bit to zero. Then you
use the remaining 14 bits to encode
the number n.
If n>16,384 and
n<2,097,152 , you use a three byte
integer. You set the first bit to
one, the second bit to one, and the
third bit to zero. You use the
remaining 21 bits, to encode n.
If n>2,097,152 and n<268,435,456 ,
you use a four byte integer. You set
the first three bits to one and the
fourth bit to zero. You use the
remaining 28 bits to encode n.
If n>=268,435,456 and n<4,294,967,296,
you use a five byte integer. You set
the first four bits to one and use
the following 32-bits to set the
exact value of n, as a four byte
integer. The remainder of the bits is unused.
Is there a name for this algorithm?
This is quite close to variable-length quantity encoding or base-128. The latter name stems from the fact that each 7-bit unit in your encoding can be considered a base-128 digit.
it sounds very similar to Dlugosz' Variable-Length Integer Encoding
Huffman coding refers to using fewer bits to store more common data in exchange for using more bits to store less common data.
Your scheme is similar to UTF-8, which is an encoding scheme used for Unicode text data.
The chief difference is that every byte in a UTF-8 stream indicates whether it is a lead or trailing byte, therefore a sequence can be read starting in the middle. With your scheme a missing lead byte will make the rest of the file completely unreadable if a series of such values are stored. And reading such a sequence must start at the beginning, rather than an arbitrary location.
Varint
Using the high bit of each byte to indicate "continue" or "stop", and the remaining bits (7 bits of each byte in the sequence) interpreted as plain binary that encodes the actual value:
This sounds like the "Base 128 Varint" as used in Google Protocol Buffers.
related ways of compressing integers
In summary: this code represents an integer in 2 parts:
A first part in a unary code that indicates how many bits will be needed to read in the rest of the value, and a second part (of the indicated width in bits) in more-or-less plain binary that encodes the actual value.
This particular code "threads" the unary code with the binary code, but other, similar codes pack the complete unary code first, and then the binary code afterwards,
such as Elias gamma coding.
I suspect this code is one of the family of "Start/Stop Codes"
as described in:
Steven Pigeon — Start/Stop Codes — Procs. Data Compression Conference 2001, IEEE Computer Society Press, 2001.

Why is floating point byte swapping different from integer byte swapping?

I have a binary file of doubles that I need to load using C++. However, my problem is that it was written in big-endian format but the fstream >> operator will then read the number wrong because my machine is little-endian. It seems like a simple problem to resolve for integers, but for doubles and floats the solutions I have found won't work. How can I (or should I) fix this?
I read this as a reference for integer byte swapping:
How do I convert between big-endian and little-endian values in C++?
EDIT: Though these answers are enlightening, I have found that my problem is with the file itself and not the format of the binary data. I believe my byte swapping does work, I was just getting confusing results. Thanks for your help!
The most portable way is to serialize in textual format so that you don't have byte order issues. This is how operator>> works so you shouldn't be having any endian issues with >>. The principal problem with binary formats (which would explain endian problems) is that floating point numbers consist of a number of mantissa bits, a number of exponent bits and a sign bit. The exponent may use an offset. This mean that a straight byte re-ordering may not be sufficient, depending on the source and target format.
If you are using and IEEE-754 on both machines then you may be OK with a straight byte reversal as this standard specifies a bit-string interchange format that should be portable (byte order issues aside).
If you have to convert between two machine architectures and you have to use a raw byte memory dump, then so long as the basic number format is the same (i.e. they have the same bit counts in each part of the number), you can read the data into an array of unsigned char, use some basic byte and bit swapping routines to correct the storage format and then copy the raw bytes into a variable of the appropriate type.
The standard conversion operators do not work with binary data, so it's not clear how you got where you are.
However, since byte swapping operates on bytes, not numbers, you perform it on data destined to be floats just as data which will be integers.
And since text is so inefficient and floating-point data sets tend to be so large, it's entirely reasonable to want this.
int32_t raw_bytes;
stream >> raw_bytes; // not an int, just 32 bits of bytes
my_byte_swap( raw_bytes ); // swap 'em
float f = * reinterpret_cast< float * >( & raw_bytes ); // read them into FPU