I have to implement a protocol which defines data in 8bit words, which starts with the least significant bit (LSB) first. I want to realize this data with unsigned char, but I don't know what's the bit order of LSB and most significant bit (MSB) in C/C++, that could possible require swapping the bits.
Can anybody explain me how to find out an unsigned char is encoded: with MSB-LSB or LSB-MSB?
Example:
unsigned char b = 1;
MSB-LSB: 0000 0001
LSB-MSB: 1000 0000
Endian-ness is platform dependent. Anyway, you don't have to worry about actual bit order unless you are serializing the bytes, which you may be trying to do. In which case, you still don't need to worry about how individual bytes are stored while they're on the machine, since you will have to dig the bits out individually anyway. Fortunately, if you bitwise AND with 1, you get the LSB, regardless of storage order; bit-AND with 2 and you get the next most significant bit, and so on. The compiler will sort out what constants to generate in the machine code, so that level of detail is abstracted away.
There is no such thing in C/C++. The least significant bit is -- well -- the least significant bit. Since the bits don't have addresses, there is no other ordering.
Related
Currently, it's for a Huffman compression algorithm that assigns binary codes to characters used in a text file. Fewer bits for more frequent- and more bits for less-frequent characters.
Currently, I'm trying to save the binary code big-endian in a byte.
So let's say I'm using an unsigned char to hold it.
00000000
And I want to store some binary code that's 1101.
In advance, I want to apologize if this seems trivial or is a dupe but I've browsed dozens of other posts and can't seem to find what I need. If anyone could link or quickly explain, it'd be greatly appreciated.
Would this be the correct syntax?
I'll have some external method like
int length = 0;
unsigned char byte = (some default value);
void pushBit(unsigned int bit){
if (bit == 1){
byte |= 1;
}
byte <<= 1;
length++;
if (length == 8) {
//Output the byte
length = 0;
}
}
I've seen some videos explaining endianess and my understanding is the most significant bit (the first one) is placed in the lowest memory address.
Some videos showed the byte from left to right which makes me think I need to left shift everything over but whenever I set, toggle, erase a bit, it's from the rightmost is it not? I'm sorry once again if this is trivial.
So after my method finishes pushing the 1101 into this method, byte would be something like 00001101. Is this big endian? My knowledge of address locations is very weak and I'm not sure whether
**-->00001101 or 00001101<-- **
location is considered the most significant.
Would I need to left shift the remaining amount?
So since I used 4 bits, I would left shift 4 bits to make 11010000. Is this big endian?
First off, as the Killzone Kid noted, endianess and the bit ordering of a binary code are two entirely different things. Endianess refers to the order in which a multi-byte integer is stored in the bytes of memory. For little endian, the least significant byte is stored first. For big endian, the most significant byte is stored first. The bits in the bytes don't change order. Endianess has nothing to do with what you're asking.
As for accumulating bits until you have a byte's worth to write, you have the basic idea, but your code is incorrect. You need to shift first, and then or the bit. The way you're doing it, you are losing the first bit you put in off the top, and the low bit of what you write is always zero. Just put the byte <<= 1; before the if.
You also need to deal with ending the stream somehow, writing out the last bits if there are less than eight left. So you'll need a flushBits() to write out you bit buffer if it has more than one bit in it. Your bit stream would need to be self terminating, or you need to first send the number of bits, so that you don't misinterpret the filler bits in the last byte as a code or codes.
There are two types of endianness, Big-endian and Little-endian (technically there are more, like middle-endian, but big and little are the most common). If you want to have the big-endian format, (as it seems like you do), then the most significant byte comes first, with little-endian the least significant byte comes first.
Wikipedia has some good examples
It looks like what you are trying to do is store the bits themselves within the byte to be in reverse order, which is not what you want. A byte is endian agnostic and does not need to be flipped. Multi-byte types such as uint32_t may need their byte order changed, depending on what endianness you want to achieve.
Maybe what you are referring to is bit numbering, in which case the code you have should largely work (although you should compare length to 7, not 8). The order you place the bits in pushBit would end up with the first bit you pass being the most significant bit.
Bits aren't addressable by definition (if we're talking about C++, not C51 or its C++ successor), so from point of high level language, even from POV of assembler pseudo-code, no matter what the direction LSB -> MSB is, bit-wise << would perform shift from LSB to MSB. Bit order referred as bit numbering and is a separate feature from endian-ness, related to hardware implementation.
Bit fields in C++ change order because in most common use-cases usually bits do have an opposite order, e.g. in network communication, but in fact way how bit fields are packed into byte is implementation dependent, there is no consistency guarantee that there is no gaps or that order is preserved.
Minimal addressable unit of memory in C++ is of char size , and that's where your concern with endian-ness ends. The rare case if you actually should change bit order (when? working with some incompatible hardware?), you have to do explicitly so.
Note, that when working with Ethernet or other network protocol you should not do so, order change is done by hardware (first bit sent over wire is least significant one on the platform).
I'm learning about CRCs, and search engines and SO turn up nothing on this....
Why do we have "Normal" and "Reversed" and "Reciprocal" Polynomials? Does one favor Big Endian, Little Endian, or something else?
The classic definition of a CRC would use a non-reflected polynomial, which shifts the CRC left. If the word size being used for the calculation is larger than the CRC, then you would need an operation at the end to clear the high bits that were shifted into (e.g. & 0xffff for a 16-bit CRC).
You can flip the whole thing, use a reflected polynomial, and shift right instead of left. That gives the same CRC properties, but the bits from the message are effectively operated on from least to most significant bit, instead of most to least significant bit. Since you are shifting right, the extraneous bits get dropped off the bottom into oblivion, and there is no need for the additional operation. This may have been one of the early motivations to use a very slightly faster and more compact implementation.
Sometimes the specification from the original hardware is that the bits are processed from least to most significant, so then you have to use the reflected version.
No, none of this favors little or big endian. Either kind of CRC can be computed just as easily in little-endian or big-endian architectures.
I am trying to parse a bitstream, and I am having trouble getting my head around endianness. I have a byte buffer, and I need to be able to read bitfields out which are of varying lengths, anywhere from 1 bit to 8 bits mostly.
My problem comes with the endianness of the bytes. When I step through with a debugger, the bottom 4 bits appear to be in the top portion of the byte. That is, where I am expecting the first two bits to be 10 (they must be 10), however, the first byte in the bitstream is 0xA3, or 1010 0011, when checking with the debugger. Meaning, assuming that the bits are in the "correct" order, the first two bits are in fact 11 (reading right to left).
It would seem, however, that if the bits were not in the right order, and should be 0x3A, or 0011 1010, I then have 10 as my expected first two bits.
This confuses me, because it doesn't seem to be a matter of bit order, MSb to LSb/LSb to MSb, but rather nibble order. How does this happen? That seems to just be the way it came out of the file. There is a possibility this is an invalid bitstream, but I have seen this kind of thing before when reading files in Hex Editors, nibbles seemingly in the "wrong" order.
I am just confused and would like some help understanding what's going on. I don't often deal with things at this level.
You don't need to concern the bit order, because in C/C++ there is no way for you to iterate through the bits using pointer arithmetics. You can only manipulate the bits using bit-wise operators that are independent of the bit order of the local machine. What you mentioned in the OP is just a matter of visualization. Different debuggers may choose different ways to visualize the bits in a byte. There is no right or wrong for this matter. There is just preference. What really matters if the byte order.
So.. wrestling with bits and bytes, It occurred to me that if i say "First bit of nth byte", it might not mean what I think it means. So far I have assumed that if I have some data like this:
00000000 00000001 00001000
then the
First byte is the leftmost of the groups and has the value of 0
First bit is the leftmost of all 0's and has the value of 0
Last byte is the rightmost of the groups and has the value of 8
Last bit of the second byte is the rightmost of the middle group and has the value of 1
Then I learned that the byte order in a typed collection of bytes is determined by the endianess of the system. In my case it should be little endian (windows, intel, right?) which would mean that something like 01 10 as a 16 bit uinteger should be 2551 while in most programs dealing with memory it would be represented as 265.. no idea whats going on there.
I also learned that bits in a byte could be ordered as whatever and there seems to be no clear answer as to which bit is the actual first one since they could also be subject to bit-endianess and peoples definition about what is first differs. For me its left to right, for somebody else it might be what first appears when you add 1 to 0 or right to left.
Why does any of this matter? Well, curiosity mostly but I was also trying to write a class that would be able to extract X number of bits, starting from bit-address Y. I envisioned it sorta like .net string where i can go and type ".SubArray(12(position), 5(length))" then in case of data like in the top of this post it would retrieve "0001 0" or 2.
So could somebody clarifiy as to what is first and last in terms of bits and bytes in my environment, does it go right to left or left to right or both, wut? And why does this question exist in the first place, why couldn't the coding ancestors have agreed on something and stuck with it?
A shift is an arithmetic operation, not a memory-based operation: it is intended to work on the value, rather than on its representation. Shifting left by one is equivalent to a multiplication by two, and shifting right by one is equivalent to a division by two. These rules hold first, and if they conflict with the arrangement of the bits of a multibyte type in memory, then so much for the arrangement in memory. (Since shifts are the only way to examine bits within one byte, this is also why there is no meaningful notion of bit order within one byte.)
As long as you keep your operations to within a single data type (rather than byte-shifting long integers and them examining them as character sequences), the results will stay predictable. Examining the same chunk of memory through different integer types is, in this case, a bit like performing integer operations and then reading the bits as a float; there will be some change, but it's not the place of the integer arithmetic definitions to say exactly what. It's out of their scope.
You have some understanding, but a couple misconceptions.
First off, arithmetic operations such as shifting are not concerned with the representation of the bits in memory, they are dealing with the value. Where memory representation comes into play is usually in distributed environments where you have cross-platform communication in the mix, where the data on one system is represented differently on another.
Your first comment...
I also learned that bits in a byte could be ordered as whatever and there seems to be no clear answer as to which bit is the actual first one since they could also be subject to bit-endianess and peoples definition about what is first differs
This isn't entirely true, though the bits are only given meaning by the reader and the writer of data, generally bits within an 8-bit byte are always read from left (MSB) to right (LSB). The byte-order is what is determined by the endian-ness of the system architecture. It has to do with the representations of the data in memory, not the arithmetic operations.
Second...
And why does this question exist in the first place, why couldn't the coding ancestors have agreed on something and stuck with it?
From Wikipedia:
The initial endianness design choice was (is) mostly arbitrary, but later technology revisions and updates perpetuate the same endianness (and many other design attributes) to maintain backward compatibility. As examples, the Intel x86 processor represents a common little-endian architecture, and IBM z/Architecture mainframes are all big-endian processors. The designers of these two processor architectures fixed their endiannesses in the 1960s and 1970s with their initial product introductions to the market. Big-endian is the most common convention in data networking (including IPv6), hence its pseudo-synonym network byte order, and little-endian is popular (though not universal) among microprocessors in part due to Intel's significant historical influence on microprocessor designs. Mixed forms also exist, for instance the ordering of bytes within a 16-bit word may differ from the ordering of 16-bit words within a 32-bit word. Such cases are sometimes referred to as mixed-endian or middle-endian. There are also some bi-endian processors which can operate either in little-endian or big-endian mode.
Finally...
Why does any of this matter? Well, curiosity mostly but I was also trying to write a class that would be able to extract X number of bits, starting from bit-address Y. I envisioned it sorta like .net string where i can go and type ".SubArray(12(position), 5(length))" then in case of data like in the top of this post it would retrieve "0001 0" or 2.
Many programming languages and libraries offer functions that allow you to convert to/from network (big endian) and host order (system dependent) so that you can ensure data you're dealing with is in the proper format, if you need to care about it. Since you're asking specifically about bit shifting, it doesn't matter in this case.
Read this post for more info
Why is int typically 32 bit on 64 bit compilers? When I was starting programming, I've been taught int is typically the same width as the underlying architecture. And I agree that this also makes sense, I find it logical for a unspecified width integer to be as wide as the underlying platform (unless we are talking 8 or 16 bit machines, where such a small range for int will be barely applicable).
Later on I learned int is typically 32 bit on most 64 bit platforms. So I wonder what is the reason for this. For storing data I would prefer an explicitly specified width of the data type, so this leaves generic usage for int, which doesn't offer any performance advantages, at least on my system I have the same performance for 32 and 64 bit integers. So that leaves the binary memory footprint, which would be slightly reduced, although not by a lot...
Bad choices on the part of the implementors?
Seriously, according to the standard, "Plain ints have the
natural size suggested by the architecture of the execution
environment", which does mean a 64 bit int on a 64 bit
machine. One could easily argue that anything else is
non-conformant. But in practice, the issues are more complex:
switching from 32 bit int to 64 bit int would not allow
most programs to handle large data sets or whatever (unlike the
switch from 16 bits to 32); most programs are probably
constrained by other considerations. And it would increase the
size of the data sets, and thus reduce locality and slow the
program down.
Finally (and probably most importantly), if int were 64 bits,
short would have to be either 16 bits or
32 bits, and you'ld have no way of specifying the other (except
with the typedefs in <stdint.h>, and the intent is that these
should only be used in very exceptional circumstances).
I suspect that this was the major motivation.
The history, trade-offs and decisions are explained by The Open Group at http://www.unix.org/whitepapers/64bit.html. It covers the various data models, their strengths and weaknesses and the changes made to the Unix specifications to accommodate 64-bit computing.
Because there is no advantage to a lot of software to have 64-bit integers.
Using 64-bit int's to calculate things that can be calculated in a 32-bit integer (and for many purposes, values up to 4 billion (or +/- 2 billon) are sufficient), and making them bigger will not help anything.
Using a bigger integer will however have a negative effect on how many integers sized "things" fit in the cache on the processor. So making them bigger will make calculations that involve large numbers of integers (e.g. arrays) take longer because.
The int is the natural size of the machine-word isn't something stipulated by the C++ standard. In the days when most machines where 16 or 32 bit, it made sense to make it either 16 or 32 bits, because that is a very efficient size for those machines. When it comes to 64 bit machines, that no longer "helps". So staying with 32 bit int makes more sense.
Edit:
Interestingly, when Microsoft moved to 64-bit, they didn't even make long 64-bit, because it would break too many things that relied on long being a 32-bit value (or more importantly, they had a bunch of things that relied on long being a 32-bit value in their API, where sometimes client software uses int and sometimes long, and they didn't want that to break).
ints have been 32 bits on most major architectures for so long that changing them to 64 bits will probably cause more problems than it solves.
I originally wrote this up in response to this question. While I've modified it some, it's largely the same.
To get started, it is possible to have plain ints wider than 32 bits, as the C++ draft says:
Note: Plain ints are intended to have the natural size suggested by the architecture of the execution environment; the other signed integer types are provided to meet special needs. — end note
Emphasis mine
This would ostensibly seem to say that on my 64 bit architecture (and everyone else's) a plain int should have a 64 bit size; that's a size suggested by the architecture, right? However I must assert that the natural size for even 64 bit architecture is 32 bits. The quote in the specs is mainly there for cases where 16 bit plain ints is desired--which is the minimum size the specifications allow.
The largest factor is convention, going from a 32 bit architecture with a 32 bit plain int and adapting that source for a 64 bit architecture is simply easier if you keep it 32 bits, both for the designers and their users in two different ways:
The first is that less differences across systems there are the easier is for everyone. Discrepancies between systems been only headaches for most programmer: they only serve to make it harder to run code across systems. It'll even add on to the relatively rare cases where you're not able to do it across computers with the same distribution just 32 bit and 64 bit. However, as John Kugelman pointed out, architectures have gone from a 16 bit to 32 bit plain int, going through the hassle to do so could be done again today, which ties into his next point:
The more significant component is the gap it would cause in integer sizes or a new type to be required. Because sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long) is in the actual specification, a gap is forced if the plain int is moved to 64 bits. It starts with shifting long. If a plain int is adjusted to 64 bits, the constraint that sizeof(int) <= sizeof(long) would force long to be at least 64 bits and from there there's an intrinsic gap in sizes. Since long or a plain int usually are used as a 32 bit integer and neither of them could now, we only have one more data type that could, short. Because short has a minimum of 16 bits if you simply discard that size it could become 32 bits and theoretically fill that gap, however short is intended to be optimized for space so it should be kept like that and there are use cases for small, 16 bit, integers as well. No matter how you arrange the sizes there is a loss of a width and therefore use case for an int entirely unavailable. A bigger width doesn't necessarily mean it's better.
This now would imply a requirement for the specifications to change, but even if a designer goes rogue, it's highly likely it'd be damaged or grow obsolete from the change. Designers for long lasting systems have to work with an entire base of entwined code, both their own in the system, dependencies, and user's code they'll want to run and a huge amount of work to do so without considering the repercussions is simply unwise.
As a side note, if your application is incompatible with a >32 bit integer, you can use static_assert(sizeof(int) * CHAR_BIT <= 32, "Int wider than 32 bits!");. However, who knows maybe the specifications will change and 64 bits plain ints will be implemented, so if you want to be future proof, don't do the static assert.
Main reason is backward compatibility. Moreover, there is already a 64 bit integer type long and same goes for float types: float and double. Changing the sizes of these basic types for different architectures will only introduce complexity. Moreover, 32 bit integer responds to many needs in terms of range.
The C + + standard does not say how much memory should be used for the int type, tells you how much memory should be used at least for the type int. In many programming environments on 32-bit pointer variables, "int" and "long" are all 32 bits long.
Since no one pointed this out yet.
int is guaranteed to be between -32767 to 32767(2^16) That's required by the standard. If you want to support 64 bit numbers on all platforms I suggest using the right type long long which supports (-9223372036854775807 to 9223372036854775807).
int is allowed to be anything so long as it provides the minimum range required by the standard.