gzip compression on non byte aligned data - compression

Is bit packing detrimental to the performance of gzip?
Assume that I have 7 bit values and pack in the following way:
Byte1 Byte2 Byte3 Byte4
[aaaaaaab][bbbbbbcc][cccccddd][dddd...
As far as I understand LZ compression works on a byte basis.
Any repeating pattern in the 7 bits will be obscured.
Is it advisable to stuff with an extra bit for byte alignment to help LZ?
Byte1 Byte2 Byte3 Byte4
[aaaaaaa0][bbbbbbb0][ccccccc0][ddddddd0][...
Are there any results on that in literature?

Likely, yes. If your a, b, c, d's have repeating patterns or statistical bias in their frequency, then it should be better to stuff the zero bits.
The way to know is to simply test it.

Related

Bit ordering / Endianness flac decoding

I'm currently trying to write a FLAC to WAV transcoder as an exercise in C++, and I am currently struggling a bit the wording of the FLAC format regarding bit ordering.
Here is the (little) section talking about ordering:
All numbers used in a FLAC bitstream are integers; there are no floating-point representations. All numbers are big-endian coded. All numbers are unsigned unless otherwise specified.
Does this apply to bit-ordering, as well as byte-ordering?
More specifically, if I read, say, a 7 bits value, do I get the most-significant bit 1st?
Bit ordering should never be an issue unless you're using a struct with bitfields (which is a good reason to avoid then).
Also, you can only read data one byte at a time. If you want to read 7 bits out of a byte, you need to apply a bitmask to the byte value.
For example, if a byte contains one value in the high order bit and another in the low order 7 bits, you would extract them as follows:
field1 = (byte & 0x80) >> 7;
field2 = byte & 0x7f;

Is this the correct way of writing bits to big endian?

Currently, it's for a Huffman compression algorithm that assigns binary codes to characters used in a text file. Fewer bits for more frequent- and more bits for less-frequent characters.
Currently, I'm trying to save the binary code big-endian in a byte.
So let's say I'm using an unsigned char to hold it.
00000000
And I want to store some binary code that's 1101.
In advance, I want to apologize if this seems trivial or is a dupe but I've browsed dozens of other posts and can't seem to find what I need. If anyone could link or quickly explain, it'd be greatly appreciated.
Would this be the correct syntax?
I'll have some external method like
int length = 0;
unsigned char byte = (some default value);
void pushBit(unsigned int bit){
if (bit == 1){
byte |= 1;
}
byte <<= 1;
length++;
if (length == 8) {
//Output the byte
length = 0;
}
}
I've seen some videos explaining endianess and my understanding is the most significant bit (the first one) is placed in the lowest memory address.
Some videos showed the byte from left to right which makes me think I need to left shift everything over but whenever I set, toggle, erase a bit, it's from the rightmost is it not? I'm sorry once again if this is trivial.
So after my method finishes pushing the 1101 into this method, byte would be something like 00001101. Is this big endian? My knowledge of address locations is very weak and I'm not sure whether
**-->00001101 or 00001101<-- **
location is considered the most significant.
Would I need to left shift the remaining amount?
So since I used 4 bits, I would left shift 4 bits to make 11010000. Is this big endian?
First off, as the Killzone Kid noted, endianess and the bit ordering of a binary code are two entirely different things. Endianess refers to the order in which a multi-byte integer is stored in the bytes of memory. For little endian, the least significant byte is stored first. For big endian, the most significant byte is stored first. The bits in the bytes don't change order. Endianess has nothing to do with what you're asking.
As for accumulating bits until you have a byte's worth to write, you have the basic idea, but your code is incorrect. You need to shift first, and then or the bit. The way you're doing it, you are losing the first bit you put in off the top, and the low bit of what you write is always zero. Just put the byte <<= 1; before the if.
You also need to deal with ending the stream somehow, writing out the last bits if there are less than eight left. So you'll need a flushBits() to write out you bit buffer if it has more than one bit in it. Your bit stream would need to be self terminating, or you need to first send the number of bits, so that you don't misinterpret the filler bits in the last byte as a code or codes.
There are two types of endianness, Big-endian and Little-endian (technically there are more, like middle-endian, but big and little are the most common). If you want to have the big-endian format, (as it seems like you do), then the most significant byte comes first, with little-endian the least significant byte comes first.
Wikipedia has some good examples
It looks like what you are trying to do is store the bits themselves within the byte to be in reverse order, which is not what you want. A byte is endian agnostic and does not need to be flipped. Multi-byte types such as uint32_t may need their byte order changed, depending on what endianness you want to achieve.
Maybe what you are referring to is bit numbering, in which case the code you have should largely work (although you should compare length to 7, not 8). The order you place the bits in pushBit would end up with the first bit you pass being the most significant bit.
Bits aren't addressable by definition (if we're talking about C++, not C51 or its C++ successor), so from point of high level language, even from POV of assembler pseudo-code, no matter what the direction LSB -> MSB is, bit-wise << would perform shift from LSB to MSB. Bit order referred as bit numbering and is a separate feature from endian-ness, related to hardware implementation.
Bit fields in C++ change order because in most common use-cases usually bits do have an opposite order, e.g. in network communication, but in fact way how bit fields are packed into byte is implementation dependent, there is no consistency guarantee that there is no gaps or that order is preserved.
Minimal addressable unit of memory in C++ is of char size , and that's where your concern with endian-ness ends. The rare case if you actually should change bit order (when? working with some incompatible hardware?), you have to do explicitly so.
Note, that when working with Ethernet or other network protocol you should not do so, order change is done by hardware (first bit sent over wire is least significant one on the platform).

How to write EXACTLY n bits to UDP packet c++

As a way to learn my university content, I decided to start writing a networking library in c++11 using UDP and the new threading memory model. Here is my biggest roadblock at the moment.
The size of a byte is platform specific. Memcpy copies and writes with respect to the size of a byte on that platform. So how do I go about making platform-agnostic code to write exactly N bits to my UDP packets? N will typically be a multiple of 8, but I need to be sure that when I deal with N bytes, the same amount of bits are modified, regardless of platform.
The closest I think I have come to a solution is to make a base struct of 32 bits, and access it as groups of 8. E.g.
struct Data
{
char a : 8;
char b : 8;
char c : 8;
char d : 8;
}
This way, I know that each char will be limited to 8 bytes. My messages would all be a multiple of 32 - that is no problem (this struct would actually help when dealing with endianness). But how can I be sure that the compiler won't pad the structure to the platform's native boundary? Will this work?
Your data structure is not adequate. The char type is defined as signed char (7 data bits, 1 sign bit), unsigned char (8 data bits) or char by your compiler.
Most computers sending out packets use unsigned char or uint8_t for the octets.
Also, beware of multibyte ordering, as known as Endianness. Big endian is where the most significant byte comes first; Little endian is where the least significant byte comes first.
Many messaging schemes will pad remaining bits with zeros to make them come out to 8, 16 or 32 bit quantities.
You're probably looking for the #pragma pack(nopack) directive. As someone pointed out though, you're still going to be dealing with multiples of 8 bits.
char should always be exactly 1 byte, AFAIK although I could be wrong.

Bit order in C/C++

I have to implement a protocol which defines data in 8bit words, which starts with the least significant bit (LSB) first. I want to realize this data with unsigned char, but I don't know what's the bit order of LSB and most significant bit (MSB) in C/C++, that could possible require swapping the bits.
Can anybody explain me how to find out an unsigned char is encoded: with MSB-LSB or LSB-MSB?
Example:
unsigned char b = 1;
MSB-LSB: 0000 0001
LSB-MSB: 1000 0000
Endian-ness is platform dependent. Anyway, you don't have to worry about actual bit order unless you are serializing the bytes, which you may be trying to do. In which case, you still don't need to worry about how individual bytes are stored while they're on the machine, since you will have to dig the bits out individually anyway. Fortunately, if you bitwise AND with 1, you get the LSB, regardless of storage order; bit-AND with 2 and you get the next most significant bit, and so on. The compiler will sort out what constants to generate in the machine code, so that level of detail is abstracted away.
There is no such thing in C/C++. The least significant bit is -- well -- the least significant bit. Since the bits don't have addresses, there is no other ordering.

Why are all datatypes a power of 2?

Why are all data type sizes always a power of 2?
Let's take two examples:
short int 16
char 8
Why are they not the like following?
short int 12
That's an implementation detail, and it isn't always the case. Some exotic architectures have non-power-of-two data types. For example, 36-bit words were common at one stage.
The reason powers of two are almost universal these days is that it typically simplifies internal hardware implementations. As a hypothetical example (I don't do hardware, so I have to confess that this is mostly guesswork), the portion of an opcode that indicates how large one of its arguments is might be stored as the power-of-two index of the number of bytes in the argument, thus two bits is sufficient to express which of 8, 16, 32 or 64 bits the argument is, and the circuitry required to convert that into the appropriate latching signals would be quite simple.
The reason why builtin types are those sizes is simply that this is what CPUs support natively, i.e. it is the fastest and easiest. No other reason.
As for structs, you can have variables in there which have (almost) any number of bits, but you will usually want to stay with integral types unless there is a really urgent reason for doing otherwise.
You will also usually want to group identical-size types together and start a struct with the largest types (usually pointers).That will avoid needless padding and it will make sure you don't have access penalties that some CPUs exhibit with misaligned fields (some CPUs may even trigger an exception on unaligned access, but in this case the compiler would add padding to avoid it, anyway).
The size of char, short, int, long etc differ depending on the platform. 32 bit architectures tend to have char=8, short=16, int=32, long=32. 64 bit architectures tend to have char=8, short=16, int=32, long=64.
Many DSPs don't have power of 2 types. For example, Motorola DSP56k (a bit dated now) has 24 bit words. A compiler for this architecture (from Tasking) has char=8, short=16, int=24, long=48. To make matters confusing, they made the alignment of char=24, short=24, int=24, long=48. This is because it doesn't have byte addressing: the minimum accessible unit is 24 bits. This has the exciting (annoying) property of involving lots of divide/modulo 3 when you really do have to access an 8 bit byte in an array of packed data.
You'll only find non-power-of-2 in special purpose cores, where the size is tailored to fit a special usage pattern, at an advantage to performance and/or power. In the case of 56k, this was because there was a multiply-add unit which could load two 24 bit quantities and add them to a 48 bit result in a single cycle on 3 buses simultaneously. The entire platform was designed around it.
The fundamental reason most general purpose architectures use powers-of-2 is because they standardized on the octet (8 bit bytes) as the minimum size type (aside from flags). There's no reason it couldn't have been 9 bit, and as pointed out elsewhere 24 and 36 bit were common. This would permeate the rest of the design: if x86 was 9 bit bytes, we'd have 36 octet cache lines, 4608 octet pages, and 569KB would be enough for everyone :) We probably wouldn't have 'nibbles' though, as you can't divide a 9 bit byte in half.
This is pretty much impossible to do now, though. It's all very well having a system designed like this from the start, but inter-operating with data generated by 8 bit byte systems would be a nightmare. It's already hard enough to parse 8 bit data in a 24 bit DSP.
Well, they are powers of 2 because they are multiples of 8, and this comes (simplifying a little) from the fact that usually the atomic allocation unit in memory is a byte, which (edit: often, but not always) is made by 8 bits.
Bigger data sizes are made taking multiple bytes at a time.
So you could have 8,16,24,32... data sizes.
Then, for the sake of memory access speed, only powers of 2 are used as a multiplier of the minimum size (8), so you get data sizes along these lines:
8 => 8 * 2^0 bits => char
16 => 8 * 2^1 bits => short int
32 => 8 * 2^2 bits => int
64 => 8 * 2^3 bits => long long int
8 bits is the most common size for a byte (but not the only size, examples of 9 bit bytes and other byte sizes are not hard to find). Larger data types are almost always multiples of the byte size, hence they will typically be 16, 32, 64, 128 bits on systems with 8 bit bytes, but not always powers of 2, e.g. 24 bits is common for DSPs, and there are 80 bit and 96 bit floating point types.
The sizes of standard integral types are defined as multiple of 8 bits, because a byte is 8-bits (with a few extremely rare exceptions) and the data bus of the CPU is normally a multiple of 8-bits wide.
If you really need 12-bit integers then you could use bit fields in structures (or unions) like this:
struct mystruct
{
short int twelveBitInt : 12;
short int threeBitInt : 3;
short int bitFlag : 1;
};
This can be handy in embedded/low-level environments - but bear in mind that the overall size of the structure will still be packed out to the full size.
They aren't necessarily. On some machines and compilers, sizeof(long double) == 12 (96 bits).
It's not necessary that all data types use of power of 2 as number of bits to represent. For example, long double uses 80 bits(though its implementation dependent on how much bits to allocate).
One advantage you gain with using power of 2 is, larger data types can be represented as smaller ones. For example, 4 chars(8 bits each) can make up an int(32 bits). In fact, some compilers used to simulate 64 bit numbers using two 32 bit numbers.
Most of the times your computer tries to keep all data formats in either a whole multiple (2, 3, 4...) or a whole part (1/2, 1/3, 1/4...) of the machine data size. It does this so that each time it loads N data words it loads an integer number of bits of information for you. That way, it doesn't have to recombine parts later on.
You can see this in the x86 for example:
a char is 1/4th of 32-bits
a short is 1/2 of 32-bits
an int / long are a whole 32 bits
a long long is 2x 32 bits
a float is a single 32-bits
a double is two times 32-bits
a long double may either be three or four times 32-bits, depending on your compiler settings. This is because for 32-bit machines it's three native machine words (so no overhead) to load 96 bits. On 64-bit machines it is 1.5 native machine word, so 128 bits would be more efficient (no recombining). The actual data content of a long double on x86 is 80 bits, so both of these are already padded.
A last aside, the computer doesn't always load in its native data size. It first fetches a cache line and then reads from that in native machine words. The cache line is larger, usually around 64 or 128 bytes. It's very useful to have a meaningful bit of data fit into this and not be stuck on the edge as you'd have to load two whole cache lines to read it then. That's why most computer structures are a power of two in size; it will fit in any power of two size storage either half, completely, double or more - you're guaranteed to never end up on a boundary.
There are a few cases where integral types must be an exact power of two. If the exact-width types in <stdint.h> exist, such as int16_t or uint32_t, their widths must be exactly that size, with no padding. Floating-point math that declares itself to follow the IEEE standard forces float and double to be powers of two (although long double often is not). There are additionally types char16_t and char32_t in the standard library now, or built-in to C++, defined as exact-width types. The requirements about support for UTF-8 in effect mean that char and unsigned char have to be exactly 8 bits wide.
In practice, a lot of legacy code would already have broken on any machine that didn’t support types exactly 8, 16, 32 and 64 bits wide. For example, any program that reads or writes ASCII or tries to connect to a network would break.
Some historically-important mainframes and minicomputers had native word sizes that were multiples of 3, not powers of two, particularly the DEC PDP-6, PDP-8 and PDP-10.
This was the main reason that base 8 used to be popular in computing: since each octal digit represented three bits, a 9-, 12-, 18- or 36-bit pattern could be represented more neatly by octal digits than decimal or hex. For example, when using base-64 to pack characters into six bits instead of eight, each packed character took up two octal digits.
The two most visible legacies of those architectures today are that, by default, character escapes such as '\123' are interpreted as octal rather than decimal in C, and that Unix file permissions/masks are represented as three or four octal digits.