I've come across many comments on various questions regarding bitfields asserting that bitfields are non-portable, but I've never been able to find a source explaining precisely why.
At face value, I would have presumed all bitfields merely compile to variations of the same bitshifting code, but evidently there must be more too it than that or there would not be such vehement dislike for them.
So my question is what is it that makes bitfields non-portable?
Bit fields are non-portable in the same sense as integers are non-portable. You can use integers to write a portable program, but you cannot expect to send a binary representation of int as is to a remote machine and expect it to interpret the data correctly.
This is because 1. word lengths of processors differ, and because of that, the sizes of integer types differ (1.1 byte length can differ too, but that is these days rare outside embedded systems). And because 2. the byte endianness differs across processors.
These problems are easy to overcome. Native endianness can be easily converted to agreed upon endianness (big endian is de facto standard for network communication), and the size can be inspected at compile time and fixed length integer types are available these days. Therefore integers can be used to communicate across network, as long as these details are taken care of.
Bit fields build upon regular integer types, so they have the same problems with endianness and integer sizes. But they have even more implementation specified behaviour.
Everything about the actual allocation details of bit fields within the class object
For example, on some platforms, bit fields don't straddle bytes, on others they do
Also, on some platforms, bit fields are packed left-to-right, on others right-to-left
Whether char, short, int, long, and long long bit fields are signed or unsigned (when not declared so explicitly).
Unlike endianness, it is not trivial to convert "everything about the actual allocation details" to a canonical form.
Also, while endianness is cpu architecture specific, the bit field details are specific to the compiler implementer. So, bit fields are not portable for communication even between separate processes within the same computer, unless we can guarantee that they were compiled using the same (or binary compatible) compiler.
TL;DR bit fields are not a portable way to communicate between computers. Integers aren't either, but their non-portability is easy to work around.
Bit fields are non-portable in the sense that the ordering of the bit is unspecified. So the bit at index 0 with one compiler could very well be the last bit with another compiler.
This prevents the use of bit fields in applications like toggling bits in memory-mapped hardware registers.
But, you will see hardware vendor use bitfields in the code they release (like microchip for instance). Usually, it's because they also release the compiler with it or target a single compiler. In microchip case for instance, the licence for their source code mandates you to use their own compiler (for 8 bits low-end devices)
The link pointed to by #Pharap contains an extract of the (c++14) norm related to this un-specified ordering: is-there-a-portable-alternative-to-c-bitfields
Related
This is purely a theoretical problem, nothing I have really found myself in, but it has piqued my curiosity and wanted to see if anyone has a better solution for it:
How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
Say we have a file format that uses a 64 bit header struct immediately followed by a variable length array of 32 bit structures:
Header: magic : 32 bit
count : 32 bit
Field : id : 16 bit
data : 16 bit
My first instinct would be to write something like:
struct Field
{
uint16_t id ;
uint16_t data ;
};
Except that our compiler may decide that padding is advisable and we end up with a 64 bit structure. So our next bet is:
using Field = uint16_t[2];
and work on that.
That is, unless someone has carefully read the standard and noticed that uint16_t is optional. At this point our next best friend is uint_least16_t, which is guaranteed to be at least 16 bits long, but for all we know could be 20 bits long in a 10 bit / char processor.
At this point, the only real solution I can come up with is some sort of bit stream, capable of reading and writing specific amounts of bits, and adaptable by std::numeric_limits.
So, is there someone out there who has very carefully read the standard and found the point I'm missing? Or it is this the only real way of having a portable guarantee.
Notes:
- I've just realized that endianness would probably add another layer of complexity.
- I'm using the current working draft of the ISO standard (N3797).
How do you portably guarantee that an specific file format / network
protocol or whatever conforms to a specific bit pattern.
You can't. Not in C++, which was standardized against an abstract platform where little more than the existence of a "byte" that is made up of bits can be assumed. We can't even say for certain, in looking only at the Standard, how many bits are in a char. You can use bitfields for everything, as bits are indivsible, but then you'll have padding to contend with at the least.
Sometimes it is best to give up on the idea of absolute Standards conformance for the sake of conformance, and look to other means to get the job done efficiently and effectively. In this case, platform specifics in combination with almost absolute Standards conformance (aka, good programming practices) will set you free.
Every platform I work on regularly (linux & windows) provides a means to regulate the padding the compiler will actually apply. For network communications, under Linux & Windows I use:
#pragma pack (push, 1)
as a preface to all the data structures I'm going to send over the wire. Endianness is indeed another challenge, but one more or less easily dealt with using other resources provided by every platform: ntohl and the like.
Standards conformance is a laudable goal, and indeed in a code review I would reject most code that is non-conformant. The lack of conformance is really just a moniker for the rejection however; not the reason itself. The actual reason for the rejection is in large part difficulty in maintaining and porting non-conformant code when moving to another platform, or indeed even just upgrading the compiler on the same platform. Non-conformant code might compile and even appear to work, but it will very often fail in subtle and miserable ways when you least expect it, even after thorough testing.
The moral of the story is:
You should always write Standards-conformant code, except when you
shouldn't.
This really is just a re-imagining of Einstein's articulation of Occam's Razor:
Make everything as simple as possible, but no simpler.
If you want to ensure portability to everything standard-conforming, including platforms for which CHAR_BITS isn't 8, well, you've got your work cut out for you.
If you are comfortable limiting yourself to 98% of the computers you'll ever program, I recommend writing explicit serialization for anything that has to adhere to a particular wire-format. That includes breaking integers into bytes, etc.
Write appropriate abstractions around things and the code won't be too bad. Don't put shifts and masks everywhere. Encapsulate it.
I would use network types and network byte orders. See this link.http://www.beej.us/guide/bgnet/output/html/multipage/htonsman.html. The example uses uint16_t. You can write the values a field at a time to prevent padding.
Or if you want to read and write the entire structure at one see this link C++ struct alignment question
Make the structure easy for the program to use.
Provide input methods that extract data from the input and write to the data members. This removes the issue of padding, alignment boundaries and endianness. Similarly with output.
For example, if your input data is 16-bits wide, but your platform is 32-bits wide, declare the structure using 32-bit fields. Copy the 16 bits from the input into the 32-bit fields.
Most programs read into a structure fewer times than they access the data members. Your program is not reading the input 100% of the time.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
When is it worthwhile to use bit fields?
I was looking up bitwise operators recently and stumbled upon the concept of the bitfield. It seems interesting and is a very cool concept, but when and/or why would a person use this in their code?
I know it's used quite a bit in embedded systems programming, but why (I can't seem to find anything about why its useful)? Are there any advantages to it? And where are some other places bitfields are useful?
In general, use bitfields when you don't care about speed and you don't care about memory layout. IF you care about these things, then don't use bitfields.
If you have a set of boolean flags, then you can pack them using bitfields (reducing size needed to store). However, only use the bitfield to access the bitfield.
It is the classic size vs. speed problem.
An additional caveat is that if you have a set of bitfields that are smaller than the native word, then your compiler will probably try to pad and align the bitfield struct. So you have to end up #pragma pack'ing the struct or use at least a native word. So if you are on a 32 bit machine and you happen to have 32 boolean flags that are only used internally, then this would be a good use of bitfields.
Some uses that immediately come to mind are:
implementing communications protocols;
storing user data in objects where you have limited space;
extending data structures in existing protocols (similar to the above);
performing multiple tests in a single operation;
I have used bitfields as part of unions to encompass register in embedded system i.e. control registers of microcontrollers, codecs. They are very useful in depicting physical layout of registers as software constructs thereby conveying readability. They were commonly used in for device driver implementations. A few years back 8-bit micros with very little flash and ram memory were common and therefore bitfields were common. These days 32-bit micros with lots of ram/flash means that bitfields are not necessary.
The new C++ standard still refuses to specify the binary representation of integer types. Is this because there are real-world implementations of C++ that don't use 2's complement arithmetic? I find that hard to believe. Is it because the committee feared that future advances in hardware would render the notion of 'bit' obsolete? Again hard to believe. Can anyone shed any light on this?
Background: I was surprised twice in one comment thread (Benjamin Lindley's answer to this question). First, from piotr's comment:
Right shift on signed type is undefined behaviour
Second, from James Kanze's comment:
when assigning to a long, if the value doesn't fit in a long, the results are
implementation defined
I had to look these up in the standard before I believed them. The only reason for them is to accommodate non-2's-complement integer representations. WHY?
(Edit: C++20 now imposes 2's complement representation, note that overflow of signed arithmetic is still undefined and shifts continue to have undefined and implementation defined behaviors in some cases.)
A major problem in defining something which isn't, is that compilers were built assuming that is undefined. Changing the standard won't change the compilers and reviewing those to find out where the assumption was made is a difficult task.
Even on 2 complement machine, you may have more variety than you think. Two examples: some don't have a sign preserving right shift, just a right shift which introduce zeros; a common feature in DSP is saturating arithmetic, there assigning an out of range value will clip it at the maximum, not just drop the high order bits.
I suppose it is because the Standard says, in 3.9.1[basic.fundamental]/7
this International Standard permits 2’s complement, 1’s complement and signed magnitude representations for integral types.
which, I am willing to bet, came along from the C programming language, which lists sign and magnitude, two's complement, and one's complement as the only allowed representations in 6.2.6.2/2. And there sure were 1's complement systems around when C was wide-spread: UNIVACs are the most often mentioned, it seems.
It seems to me that, even today, if you are writing a broadly-applicable C++ library that you expect to run on any machine, then 2's complement cannot be assumed. C++ is just too widely used to be making assumptions like that.
Most people don't write those sorts of libraries, though, so if you want to take a dependency on 2's complement you should just go ahead.
Many aspects of the language standard are as they are because the Standards Committee has been extremely loath to forbid compilers from behaving in ways that existing code may rely upon. If code exists which would rely upon one's complement behavior, then requiring that compilers behave as though the underlying hardware uses two's complement would make it impossible for the older code to run using newer compilers.
The solution, which the Standards Committee has alas not yet seen fit to implement, would be to allow code to specify the desired semantics for things in a fashion independent of the machine's word size or hardware characteristics. If support for code which relies upon ones'-complement behavior is deemed important, design a means by which code could expressly demand one's-complement behavior regardless of the underlying hardware platform. If desired, to avoid overly complicating every single compiler, specify that certain aspects of the standard are optional, but conforming compilers must document which aspects they support. Such a design would allow compilers for ones'-complement machines to support both two's-complement behavior and ones'-complement behavior depending upon the needs of the program. Further, it would make it possible to port the code to two's-complement machines with compilers that happened to include ones'-complement support.
I'm not sure exactly why the Standards Committee has as yet not allowed any way by which code can specify behavior in a fashion independent of the underlying architecture and word size (so that code wouldn't have some machines use signed semantics for comparisons where other machines would use unsigned semantics), but for whatever reason they have yet to do so. Support for ones'-complement representation is but a part of that.
This question already has answers here:
What platforms have something other than 8-bit char?
(14 answers)
Closed 8 years ago.
All the time I read sentences like
don't rely on 1 byte being 8 bit in size
use CHAR_BIT instead of 8 as a constant to convert between bits and bytes
et cetera. What real life systems are there today, where this holds true?
(I'm not sure if there are differences between C and C++ regarding this, or if it's actually language agnostic. Please retag if neccessary.)
On older machines, codes smaller than 8 bits were fairly common, but most of those have been dead and gone for years now.
C and C++ have mandated a minimum of 8 bits for char, at least as far back as the C89 standard. [Edit: For example, C90, §5.2.4.2.1 requires CHAR_BIT >= 8 and UCHAR_MAX >= 255. C89 uses a different section number (I believe that would be §2.2.4.2.1) but identical content]. They treat "char" and "byte" as essentially synonymous [Edit: for example, CHAR_BIT is described as: "number of bits for the smallest object that is not a bitfield (byte)".]
There are, however, current machines (mostly DSPs) where the smallest type is larger than 8 bits -- a minimum of 12, 14, or even 16 bits is fairly common. Windows CE does roughly the same: its smallest type (at least with Microsoft's compiler) is 16 bits. They do not, however, treat a char as 16 bits -- instead they take the (non-conforming) approach of simply not supporting a type named char at all.
TODAY, in the world of C++ on x86 processors, it is pretty safe to rely on one byte being 8 bits. Processors where the word size is not a power of 2 (8, 16, 32, 64) are very uncommon.
IT WAS NOT ALWAYS SO.
The Control Data 6600 (and its brothers) Central Processor used a 60-bit word, and could only address a word at a time. In one sense, a "byte" on a CDC 6600 was 60 bits.
The DEC-10 byte pointer hardware worked with arbitrary-size bytes. The byte pointer included the byte size in bits. I don't remember whether bytes could span word boundaries; I think they couldn't, which meant that you'd have a few waste bits per word if the byte size was not 3, 4, 9, or 18 bits. (The DEC-10 used a 36-bit word.)
Unless you're writing code that could be useful on a DSP, you're completely entitled to assume bytes are 8 bits. All the world may not be a VAX (or an Intel), but all the world has to communicate, share data, establish common protocols, and so on. We live in the internet age built on protocols built on octets, and any C implementation where bytes are not octets is going to have a really hard time using those protocols.
It's also worth noting that both POSIX and Windows have (and mandate) 8-bit bytes. That covers 100% of interesting non-embedded machines, and these days a large portion of non-DSP embedded systems as well.
From Wikipedia:
The size of a byte was at first
selected to be a multiple of existing
teletypewriter codes, particularly the
6-bit codes used by the U.S. Army
(Fieldata) and Navy. In 1963, to end
the use of incompatible teleprinter
codes by different branches of the
U.S. government, ASCII, a 7-bit code,
was adopted as a Federal Information
Processing Standard, making 6-bit
bytes commercially obsolete. In the
early 1960s, AT&T introduced digital
telephony first on long-distance trunk
lines. These used the 8-bit µ-law
encoding. This large investment
promised to reduce transmission costs
for 8-bit data. The use of 8-bit codes
for digital telephony also caused
8-bit data "octets" to be adopted as
the basic data unit of the early
Internet.
As an average programmer on mainstream platforms, you do not need to worry too much about one byte not being 8 bit. However, I'd still use the CHAR_BIT constant in my code and assert (or better static_assert) any locations where you rely on 8 bit bytes. That should put you on the safe side.
(I am not aware of any relevant platform where it doesn't hold true).
Firstly, the number of bits in char does not formally depend on the "system" or on "machine", even though this dependency is usually implied by common sense. The number of bits in char depends only on the implementation (i.e. on the compiler). There's no problem implementing a compiler that will have more than 8 bits in char for any "ordinary" system or machine.
Secondly, there are several embedded platforms where sizeof(char) == sizeof(short) == sizeof(int) , each having 16 bits (I don't remember the exact names of these platforms). Also, the well-known Cray machines had similar properties with all these types having 32 bits in them.
I do a lot of embedded and currently working on DSP code with CHAR_BIT of 16
In history, there's existed a bunch of odd architectures that where not using native word sizes that where multiples of 8. If you ever come across any of these today, let me know.
The first commerical CPU by Intel was the Intel 4004 (4-bit)
PDP-8 (12-bit)
The size of the byte has historically
been hardware dependent and no
definitive standards exist that
mandate the size.
It might just be a good thing to keep in mind if your doing lots of embedded stuff.
Adding one more as a reference, from Wikipedia entry on HP Saturn:
The Saturn architecture is nibble-based; that is, the core unit of data is 4 bits, which can hold one binary-coded decimal (BCD) digit.
I need to pack some bits in a byte in this fashion:
struct
{
char bit0: 1;
char bit1: 1;
} a;
if( a.bit1 ) /* etc */
or:
if( a & 0x2 ) /* etc */
From the source code clarity it's pretty obvious to me that bitfields are neater. But which option is faster? I know the speed difference won't be too much if any, but as I can use any of them, if one's faster, better.
On the other hand, I've read that bitfields are not guaranteed to arrange bits in the same order across platforms, and I want my code to be portable.
Notes: If you plan to answer 'Profile' ok, I will, but as I'm lazy, if someone already has the answer, much better.
The code may be wrong, you can correct me if you want, but remember what the point to this question is and please try and answer it too.
Bitfields make the code much clearer if they are used appropriately. I would use bitfields as a space saving device only. One common place I've seen them used is in compilers: Often type or symbol information consists of a bunch of true/false flags. Bitfields are ideal here since a typical program will have many thousands of these nodes created when it is compiled.
I wouldn't use bitfields to do a common embedded programming job: reading and writing device registers. I prefer using shifts and masks here because you get exactly the bits the documentation tells you you need and you don't have to worry about differences in various compilers implementation of bitfields.
As for speed, a good compiler will give the same code for bitfields that masking will.
I would rather use the second example in preference for maximum portability. As Neil Butterworth pointed out, using bitfields is only for the native processor. Ok, think about this, what happens if Intel's x86 went out of business tomorrow, the code will be stuck, which means having to re-implement the bitfields for another processor, say RISC chip.
You have to look at the bigger picture and ask how did OpenBSD manage to port their BSD systems to a lot of platforms using one codebase? Ok, I'll admit that is a bit over the top, and debatable and subjective, but realistically speaking, if you want to port the code to another platform, its the way to do it by using the second example you used in your question.
Not alone that, compilers for different platforms would have their own way of padding, aligning bitfields for the processor where the compiler is on. And furthermore, what about the endianess of the processor?
Never rely on bitfields as a magic bullet. If you want speed for the processor and will be fixed on it, i.e. no intention of porting, then feel free to use bitfields. You cannot have both!
C bitfields were stillborn from the moment they were invented - for unknown reason. People didn't like them and used bitwise operators instead. You have to expect other developers to not understand C bitfield code.
With regard to which is faster: Irrelevant. Any optimizing compiler (which means practically all) will make the code do the same thing in whatever notation. It's a common misconception of C programmers that the compiler will just search-and-replace keywords into assembly. Modern compilers use the source code as a blueprint to what shall be achieved and then emit code that often looks very different but achieves the intended result.
The first one is explicit and whatever the speed the second expression is error-prone because any change to your struct might make the second expression wrong.
So use the first.
If you want portability, avoid bitfields. And if you are interested in performance of specific code, there is no alternative to writing your own tests. Remember, bitfields will be using the processor's bitwise instructions under the hood.
I think a C programmer would tend toward the second option of using bit masks and logic operations to deduce the value of each bit. Rather than having the code littered with hex values, enumerations would be set up, or, usually when more complex operations are involved, macros to get/set specific bits. I've heard on the grapevine, that struct implemented bitfields are slower.
Don't read to much in the "non portable bit-fields". There are two aspects of bit fields which are implementation defined: the signedness and the layout and one unspecified one: the alignment of the allocation unit in which they are packed. If you don't need anything else that the packing effect, using them is as portable (provided you specify explicitly a signed keyword where needed) as function calls which also have unspecified properties.
Concerning the performance, profile is the best answer you can get. In a perfect world, there would be no difference between the two writings. In practice, there can be some, but I can think of as many reasons in one direction as the other. And it can be very sensitive to context (logically meaningless difference between unsigned and signed for instance) so measure in context...
To sum up, the difference is thus mostly a style difference in cases where you have truly the choice (i.e. not if the precise layout is important). In those case, it is a optimization (in size, not in speed) and so I'd tend first to write the code without it and add it after if needed. So, bit fields are the obvious choice (the modifications to be done are the smallest one to achieve the result and are contained to the unique place of definition instead of being spread of to all the places of use).