Currently, it's for a Huffman compression algorithm that assigns binary codes to characters used in a text file. Fewer bits for more frequent- and more bits for less-frequent characters.
Currently, I'm trying to save the binary code big-endian in a byte.
So let's say I'm using an unsigned char to hold it.
00000000
And I want to store some binary code that's 1101.
In advance, I want to apologize if this seems trivial or is a dupe but I've browsed dozens of other posts and can't seem to find what I need. If anyone could link or quickly explain, it'd be greatly appreciated.
Would this be the correct syntax?
I'll have some external method like
int length = 0;
unsigned char byte = (some default value);
void pushBit(unsigned int bit){
if (bit == 1){
byte |= 1;
}
byte <<= 1;
length++;
if (length == 8) {
//Output the byte
length = 0;
}
}
I've seen some videos explaining endianess and my understanding is the most significant bit (the first one) is placed in the lowest memory address.
Some videos showed the byte from left to right which makes me think I need to left shift everything over but whenever I set, toggle, erase a bit, it's from the rightmost is it not? I'm sorry once again if this is trivial.
So after my method finishes pushing the 1101 into this method, byte would be something like 00001101. Is this big endian? My knowledge of address locations is very weak and I'm not sure whether
**-->00001101 or 00001101<-- **
location is considered the most significant.
Would I need to left shift the remaining amount?
So since I used 4 bits, I would left shift 4 bits to make 11010000. Is this big endian?
First off, as the Killzone Kid noted, endianess and the bit ordering of a binary code are two entirely different things. Endianess refers to the order in which a multi-byte integer is stored in the bytes of memory. For little endian, the least significant byte is stored first. For big endian, the most significant byte is stored first. The bits in the bytes don't change order. Endianess has nothing to do with what you're asking.
As for accumulating bits until you have a byte's worth to write, you have the basic idea, but your code is incorrect. You need to shift first, and then or the bit. The way you're doing it, you are losing the first bit you put in off the top, and the low bit of what you write is always zero. Just put the byte <<= 1; before the if.
You also need to deal with ending the stream somehow, writing out the last bits if there are less than eight left. So you'll need a flushBits() to write out you bit buffer if it has more than one bit in it. Your bit stream would need to be self terminating, or you need to first send the number of bits, so that you don't misinterpret the filler bits in the last byte as a code or codes.
There are two types of endianness, Big-endian and Little-endian (technically there are more, like middle-endian, but big and little are the most common). If you want to have the big-endian format, (as it seems like you do), then the most significant byte comes first, with little-endian the least significant byte comes first.
Wikipedia has some good examples
It looks like what you are trying to do is store the bits themselves within the byte to be in reverse order, which is not what you want. A byte is endian agnostic and does not need to be flipped. Multi-byte types such as uint32_t may need their byte order changed, depending on what endianness you want to achieve.
Maybe what you are referring to is bit numbering, in which case the code you have should largely work (although you should compare length to 7, not 8). The order you place the bits in pushBit would end up with the first bit you pass being the most significant bit.
Bits aren't addressable by definition (if we're talking about C++, not C51 or its C++ successor), so from point of high level language, even from POV of assembler pseudo-code, no matter what the direction LSB -> MSB is, bit-wise << would perform shift from LSB to MSB. Bit order referred as bit numbering and is a separate feature from endian-ness, related to hardware implementation.
Bit fields in C++ change order because in most common use-cases usually bits do have an opposite order, e.g. in network communication, but in fact way how bit fields are packed into byte is implementation dependent, there is no consistency guarantee that there is no gaps or that order is preserved.
Minimal addressable unit of memory in C++ is of char size , and that's where your concern with endian-ness ends. The rare case if you actually should change bit order (when? working with some incompatible hardware?), you have to do explicitly so.
Note, that when working with Ethernet or other network protocol you should not do so, order change is done by hardware (first bit sent over wire is least significant one on the platform).
So.. wrestling with bits and bytes, It occurred to me that if i say "First bit of nth byte", it might not mean what I think it means. So far I have assumed that if I have some data like this:
00000000 00000001 00001000
then the
First byte is the leftmost of the groups and has the value of 0
First bit is the leftmost of all 0's and has the value of 0
Last byte is the rightmost of the groups and has the value of 8
Last bit of the second byte is the rightmost of the middle group and has the value of 1
Then I learned that the byte order in a typed collection of bytes is determined by the endianess of the system. In my case it should be little endian (windows, intel, right?) which would mean that something like 01 10 as a 16 bit uinteger should be 2551 while in most programs dealing with memory it would be represented as 265.. no idea whats going on there.
I also learned that bits in a byte could be ordered as whatever and there seems to be no clear answer as to which bit is the actual first one since they could also be subject to bit-endianess and peoples definition about what is first differs. For me its left to right, for somebody else it might be what first appears when you add 1 to 0 or right to left.
Why does any of this matter? Well, curiosity mostly but I was also trying to write a class that would be able to extract X number of bits, starting from bit-address Y. I envisioned it sorta like .net string where i can go and type ".SubArray(12(position), 5(length))" then in case of data like in the top of this post it would retrieve "0001 0" or 2.
So could somebody clarifiy as to what is first and last in terms of bits and bytes in my environment, does it go right to left or left to right or both, wut? And why does this question exist in the first place, why couldn't the coding ancestors have agreed on something and stuck with it?
A shift is an arithmetic operation, not a memory-based operation: it is intended to work on the value, rather than on its representation. Shifting left by one is equivalent to a multiplication by two, and shifting right by one is equivalent to a division by two. These rules hold first, and if they conflict with the arrangement of the bits of a multibyte type in memory, then so much for the arrangement in memory. (Since shifts are the only way to examine bits within one byte, this is also why there is no meaningful notion of bit order within one byte.)
As long as you keep your operations to within a single data type (rather than byte-shifting long integers and them examining them as character sequences), the results will stay predictable. Examining the same chunk of memory through different integer types is, in this case, a bit like performing integer operations and then reading the bits as a float; there will be some change, but it's not the place of the integer arithmetic definitions to say exactly what. It's out of their scope.
You have some understanding, but a couple misconceptions.
First off, arithmetic operations such as shifting are not concerned with the representation of the bits in memory, they are dealing with the value. Where memory representation comes into play is usually in distributed environments where you have cross-platform communication in the mix, where the data on one system is represented differently on another.
Your first comment...
I also learned that bits in a byte could be ordered as whatever and there seems to be no clear answer as to which bit is the actual first one since they could also be subject to bit-endianess and peoples definition about what is first differs
This isn't entirely true, though the bits are only given meaning by the reader and the writer of data, generally bits within an 8-bit byte are always read from left (MSB) to right (LSB). The byte-order is what is determined by the endian-ness of the system architecture. It has to do with the representations of the data in memory, not the arithmetic operations.
Second...
And why does this question exist in the first place, why couldn't the coding ancestors have agreed on something and stuck with it?
From Wikipedia:
The initial endianness design choice was (is) mostly arbitrary, but later technology revisions and updates perpetuate the same endianness (and many other design attributes) to maintain backward compatibility. As examples, the Intel x86 processor represents a common little-endian architecture, and IBM z/Architecture mainframes are all big-endian processors. The designers of these two processor architectures fixed their endiannesses in the 1960s and 1970s with their initial product introductions to the market. Big-endian is the most common convention in data networking (including IPv6), hence its pseudo-synonym network byte order, and little-endian is popular (though not universal) among microprocessors in part due to Intel's significant historical influence on microprocessor designs. Mixed forms also exist, for instance the ordering of bytes within a 16-bit word may differ from the ordering of 16-bit words within a 32-bit word. Such cases are sometimes referred to as mixed-endian or middle-endian. There are also some bi-endian processors which can operate either in little-endian or big-endian mode.
Finally...
Why does any of this matter? Well, curiosity mostly but I was also trying to write a class that would be able to extract X number of bits, starting from bit-address Y. I envisioned it sorta like .net string where i can go and type ".SubArray(12(position), 5(length))" then in case of data like in the top of this post it would retrieve "0001 0" or 2.
Many programming languages and libraries offer functions that allow you to convert to/from network (big endian) and host order (system dependent) so that you can ensure data you're dealing with is in the proper format, if you need to care about it. Since you're asking specifically about bit shifting, it doesn't matter in this case.
Read this post for more info
So from http://en.wikipedia.org/wiki/RGBA_color_space, I learned that the byte order for ARGB is, from lowest address to highest address, BGRA, on a little endian machine in certain interpretations.
How does this effect the naming convention of packed data eg a uint8_t ar[]={R,G,B,R,G,B,R,G,B}?
Little endian by definition stores the bytes of a number in reverse order. This is not strictly necessary if you are treating them as byte arrays however any vaguely efficient code base will actually treat the 4 bytes as a 32 bit unsigned integer. This will speed up software blitting by a factor of almost 4.
Now the real question is why. This comes from the fact that when treating a pixel as a 32 bit int as described above coders want to be able to run arithmetic and shifts in a predictable way. This relies on the bytes being in reverse order.
In short, this is not actually odd as in little endian machines the last byte (highest address) is actually the most significant byte and the first the least significant. Thus a field like this will naturally be in reverse order so it is the correct way around when treated as a number (as a number it will appear ARGB but as a byte array it will appear BGRA).
Sorry if this is unclear, but I hope it helps. If you do not understand or I have missed something please comment.
If you are storing data in a byte array like you have specified, you are using BGR format which is basically RGB reversed:
bgr-color-space
I'm looking for something resembling an STL vector but can handle integers that are, for example, 12, 16, 20, 24, 32, and 40 bits long. The 16-bit and 32-bit cases are handled wonderfully by vector<uint16_t> and vector<uint32_t>, but I haven't been able to find any way to handle the other ones. Note that the whole purpose of going this way is to save memory and bandwidth, so padding is not an option.
My data structure can infer the most significant bits of the integers (which are int64s), hence I only want to store the LSBs. The bits-per-integer and number of integers are known at creation time, but not compile time. Ideally bits per integer can be any value between 12 and 40, but tiers are ok for performance reasons or to work with a structure where bits-per-integer needs to be set at compile time.
vector<bool> and dynamic_bitset can create bitfields, but they're limited to 1-bit integers. Anyone know of something else out there?
There is none in the STL.
I understand you performance concern, and notably the memory issue. However unaligned reads (requiring bit-shifting) can be slower than regular ones (on bytes boundaries), so I would suggest keeping to whole bytes. This means:
12 bits, 16 bits: uint16_t
20 bits, 24 bits: either 3 uint8_t or 1 uint32_t (tradeoff speed/memory)
32 bits: uint32_t
40 bits: either 5 uint8_t, 3 uint16_t or 1 uint64_t
Assuming you choose the easier solution (ie, always bumping to the next available integer), you could use something like:
boost::variant<
std::vector<uint16_t>,
std::vector<uint32_t>,
std::vector<uint64_t>
>
This let you choose the exact vector to be used at runtime, based on the number of bytes you need to store.
If you want to spare as much as possible, then you simply need to add a std::vector<uint8_t> to the list. But I would refrain from doing so and perhaps only "pack" the 40 bits case (in uint16_t) as the others never waste much memory anyway... and I would probably profile to check both alternatives.
Note: you might also wish to tune the vector memory usage by checking that the capacity does not exceed the size too much.
Every now and then, someone on SO points out that char (aka 'byte') isn't necessarily 8 bits.
It seems that 8-bit char is almost universal. I would have thought that for mainstream platforms, it is necessary to have an 8-bit char to ensure its viability in the marketplace.
Both now and historically, what platforms use a char that is not 8 bits, and why would they differ from the "normal" 8 bits?
When writing code, and thinking about cross-platform support (e.g. for general-use libraries), what sort of consideration is it worth giving to platforms with non-8-bit char?
In the past I've come across some Analog Devices DSPs for which char is 16 bits. DSPs are a bit of a niche architecture I suppose. (Then again, at the time hand-coded assembler easily beat what the available C compilers could do, so I didn't really get much experience with C on that platform.)
char is also 16 bit on the Texas Instruments C54x DSPs, which turned up for example in OMAP2. There are other DSPs out there with 16 and 32 bit char. I think I even heard about a 24-bit DSP, but I can't remember what, so maybe I imagined it.
Another consideration is that POSIX mandates CHAR_BIT == 8. So if you're using POSIX you can assume it. If someone later needs to port your code to a near-implementation of POSIX, that just so happens to have the functions you use but a different size char, that's their bad luck.
In general, though, I think it's almost always easier to work around the issue than to think about it. Just type CHAR_BIT. If you want an exact 8 bit type, use int8_t. Your code will noisily fail to compile on implementations which don't provide one, instead of silently using a size you didn't expect. At the very least, if I hit a case where I had a good reason to assume it, then I'd assert it.
When writing code, and thinking about cross-platform support (e.g. for general-use libraries), what sort of consideration is it worth giving to platforms with non-8-bit char?
It's not so much that it's "worth giving consideration" to something as it is playing by the rules. In C++, for example, the standard says all bytes will have "at least" 8 bits. If your code assumes that bytes have exactly 8 bits, you're violating the standard.
This may seem silly now -- "of course all bytes have 8 bits!", I hear you saying. But lots of very smart people have relied on assumptions that were not guarantees, and then everything broke. History is replete with such examples.
For instance, most early-90s developers assumed that a particular no-op CPU timing delay taking a fixed number of cycles would take a fixed amount of clock time, because most consumer CPUs were roughly equivalent in power. Unfortunately, computers got faster very quickly. This spawned the rise of boxes with "Turbo" buttons -- whose purpose, ironically, was to slow the computer down so that games using the time-delay technique could be played at a reasonable speed.
One commenter asked where in the standard it says that char must have at least 8 bits. It's in section 5.2.4.2.1. This section defines CHAR_BIT, the number of bits in the smallest addressable entity, and has a default value of 8. It also says:
Their implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign.
So any number equal to 8 or higher is suitable for substitution by an implementation into CHAR_BIT.
Machines with 36-bit architectures have 9-bit bytes. According to Wikipedia, machines with 36-bit architectures include:
Digital Equipment Corporation PDP-6/10
IBM 701/704/709/7090/7094
UNIVAC 1103/1103A/1105/1100/2200,
A few of which I'm aware:
DEC PDP-10: variable, but most often 7-bit chars packed 5 per 36-bit word, or else 9 bit chars, 4 per word
Control Data mainframes (CDC-6400, 6500, 6600, 7600, Cyber 170, Cyber 176 etc.) 6-bit chars, packed 10 per 60-bit word.
Unisys mainframes: 9 bits/byte
Windows CE: simply doesn't support the `char` type at all -- requires 16-bit wchar_t instead
There is no such thing as a completely portable code. :-)
Yes, there may be various byte/char sizes. Yes, there may be C/C++ implementations for platforms with highly unusual values of CHAR_BIT and UCHAR_MAX. Yes, sometimes it is possible to write code that does not depend on char size.
However, almost any real code is not standalone. E.g. you may be writing a code that sends binary messages to network (protocol is not important). You may define structures that contain necessary fields. Than you have to serialize it. Just binary copying a structure into an output buffer is not portable: generally you don't know neither the byte order for the platform, nor structure members alignment, so the structure just holds the data, but not describes the way the data should be serialized.
Ok. You may perform byte order transformations and move the structure members (e.g. uint32_t or similar) using memcpy into the buffer. Why memcpy? Because there is a lot of platforms where it is not possible to write 32-bit (16-bit, 64-bit -- no difference) when the target address is not aligned properly.
So, you have already done a lot to achieve portability.
And now the final question. We have a buffer. The data from it is sent to TCP/IP network. Such network assumes 8-bit bytes. The question is: of what type the buffer should be? If your chars are 9-bit? If they are 16-bit? 24? Maybe each char corresponds to one 8-bit byte sent to network, and only 8 bits are used? Or maybe multiple network bytes are packed into 24/16/9-bit chars? That's a question, and it is hard to believe there is a single answer that fits all cases. A lot of things depend on socket implementation for the target platform.
So, what I am speaking about. Usually code may be relatively easily made portable to certain extent. It's very important to do so if you expect using the code on different platforms. However, improving portability beyond that measure is a thing that requires a lot of effort and often gives little, as the real code almost always depends on other code (socket implementation in the example above). I am sure that for about 90% of code ability to work on platforms with bytes other than 8-bit is almost useless, for it uses environment that is bound to 8-bit. Just check the byte size and perform compilation time assertion. You almost surely will have to rewrite a lot for a highly unusual platform.
But if your code is highly "standalone" -- why not? You may write it in a way that allows different byte sizes.
It appears that you can still buy an IM6100 (i.e. a PDP-8 on a chip) out of a warehouse. That's a 12-bit architecture.
Many DSP chips have 16- or 32-bit char. TI routinely makes such chips for example.
The C and C++ programming languages, for example, define byte as "addressable unit of data large enough to hold any member of the basic character set of the execution environment" (clause 3.6 of the C standard). Since the C char integral data type must contain at least 8 bits (clause 5.2.4.2.1), a byte in C is at least capable of holding 256 different values. Various implementations of C and C++ define a byte as 8, 9, 16, 32, or 36 bits
Quoted from http://en.wikipedia.org/wiki/Byte#History
Not sure about other languages though.
http://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_Formats
Defines a byte on that machine to be variable length
The DEC PDP-8 family had a 12 bit word although you usually used 8 bit ASCII for output (on a Teletype mostly). However, there was also a 6-BIT character code that allowed you to encode 2 chars in a single 12-bit word.
For one, Unicode characters are longer than 8-bit. As someone mentioned earlier, the C spec defines data types by their minimum sizes. Use sizeof and the values in limits.h if you want to interrogate your data types and discover exactly what size they are for your configuration and architecture.
For this reason, I try to stick to data types like uint16_t when I need a data type of a particular bit length.
Edit: Sorry, I initially misread your question.
The C spec says that a char object is "large enough to store any member of the execution character set". limits.h lists a minimum size of 8 bits, but the definition leaves the max size of a char open.
Thus, the a char is at least as long as the largest character from your architecture's execution set (typically rounded up to the nearest 8-bit boundary). If your architecture has longer opcodes, your char size may be longer.
Historically, the x86 platform's opcode was one byte long, so char was initially an 8-bit value. Current x86 platforms support opcodes longer than one byte, but the char is kept at 8 bits in length since that's what programmers (and the large volumes of existing x86 code) are conditioned to.
When thinking about multi-platform support, take advantage of the types defined in stdint.h. If you use (for instance) a uint16_t, then you can be sure that this value is an unsigned 16-bit value on whatever architecture, whether that 16-bit value corresponds to a char, short, int, or something else. Most of the hard work has already been done by the people who wrote your compiler/standard libraries.
If you need to know the exact size of a char because you are doing some low-level hardware manipulation that requires it, I typically use a data type that is large enough to hold a char on all supported platforms (usually 16 bits is enough) and run the value through a convert_to_machine_char routine when I need the exact machine representation. That way, the platform-specific code is confined to the interface function and most of the time I can use a normal uint16_t.
what sort of consideration is it worth giving to platforms with non-8-bit char?
magic numbers occur e.g. when shifting;
most of these can be handled quite simply
by using CHAR_BIT and e.g. UCHAR_MAX instead of 8 and 255 (or similar).
hopefully your implementation defines those :)
those are the "common" issues.....
another indirect issue is say you have:
struct xyz {
uchar baz;
uchar blah;
uchar buzz;
}
this might "only" take (best case) 24 bits on one platform,
but might take e.g. 72 bits elsewhere.....
if each uchar held "bit flags" and each uchar only had 2 "significant" bits or flags that
you were currently using, and you only organized them into 3 uchars for "clarity",
then it might be relatively "more wasteful" e.g. on a platform with 24-bit uchars.....
nothing bitfields can't solve, but they have other things to watch out
for ....
in this case, just a single enum might be a way to get the "smallest"
sized integer you actually need....
perhaps not a real example, but stuff like this "bit" me when porting / playing with some code.....
just the fact that if a uchar is thrice as big as what is "normally" expected,
100 such structures might waste a lot of memory on some platforms.....
where "normally" it is not a big deal.....
so things can still be "broken" or in this case "waste a lot of memory very quickly" due
to an assumption that a uchar is "not very wasteful" on one platform, relative to RAM available, than on another platform.....
the problem might be more prominent e.g. for ints as well, or other types,
e.g. you have some structure that needs 15 bits, so you stick it in an int,
but on some other platform an int is 48 bits or whatever.....
"normally" you might break it into 2 uchars, but e.g. with a 24-bit uchar
you'd only need one.....
so an enum might be a better "generic" solution ....
depends on how you are accessing those bits though :)
so, there might be "design flaws" that rear their head....
even if the code might still work/run fine regardless of the
size of a uchar or uint...
there are things like this to watch out for, even though there
are no "magic numbers" in your code ...
hope this makes sense :)
The weirdest one I saw was the CDC computers. 6 bit characters but with 65 encodings. [There were also more than one character set -- you choose the encoding when you install the OS.]
If a 60 word ended with 12, 18, 24, 30, 36, 40, or 48 bits of zero, that was the end of line character (e.g. '\n').
Since the 00 (octal) character was : in some code sets, that meant BNF that used ::= was awkward if the :: fell in the wrong column. [This long preceded C++ and other common uses of ::.]
ints used to be 16 bits (pdp11, etc.). Going to 32 bit architectures was hard. People are getting better: Hardly anyone assumes a pointer will fit in a long any more (you don't right?). Or file offsets, or timestamps, or ...
8 bit characters are already somewhat of an anachronism. We already need 32 bits to hold all the world's character sets.
The Univac 1100 series had two operational modes: 6-bit FIELDATA and 9-bit 'ASCII' packed 6 or 4 characters respectively into 36-bit words. You chose the mode at program execution time (or compile time.) It's been a lot of years since I actually worked on them.