Issues with C++ bitfields

Issues with C++ bitfields - c++

I have to write a file header with a specific data format. For simplicity, let's just assume it is:
bits [0-7]: index a
bits [8-9]: index b
bits [10-15]: index c
All of them are simple unsigned integers. I thought I might use bit fields to get a nice syntax. I defined
struct Foo {
unsigned int a : 8, b : 2, c : 6;
};
However, I get sizeof(Foo) == 4. Why is that so? I expected a 2-byte structure here. Is the compiler adding padding between my fields? If I use unsigned char as my member type, I get a size of 2 bytes.
On cppreference, it says:
Multiple adjacent bit fields are usually packed together (although
this behavior is implementation-defined).
Does that mean that I cannot rely on the fields being packed together? Eventually, I will use memcpy to turn this struct into a stream of bytes and write that to a file. Is that not a good use of bit fields? This will only work if these bits are guaranteed to be packed together.
EDIT: The actual header relates to the GIF format. Many indexes are packed into just a few bytes. Some of them are made up of 1, 2, 3 or more bits.

From [class.bit]/1 [extract]:
[...] Allocation of bit-fields within a class object is implementation-defined. Alignment of bit-fields is implementation-defined.
and, from [defns.impl.defined]:
implementation-defined behavior
behavior, for a well-formed program construct and correct data, that
depends on the implementation and that each implementation documents
Thus, for a portable implementation you cannot rely on any specific kind of behaviour for implementation-defined behaviour. If you are developing for a particular platform and compiler, however, you could rely on documented implementation-defined behaviour to a certain extent.

Related

Is Byte Really The Minimum Addressable Unit?

Section 3.6 of C11 standard defines "byte" as "addressable unit of data storage ... to hold ... character".
Section 1.7 of C++11 standard defines "byte" as "the fundamental storage unit in the C++ memory model ... to contain ... character".
Both definitions does not say that "byte" is the minimum addressable unit. Is this because standards intentionally want to abstract from a specific machine ? Can you provide a real example of machine where C/C++ compiler were decided to have "byte" longer/shorter than the minimum addressable unit ?

A byte is the smallest addressable unit in strictly conforming C code. Whether the machine on which the C implementation executes a program supports addressing smaller units is irrelevant to this; the C implementation must present a view in which bytes are the smallest addressable unit in strictly conforming C code.
A C implementation may support addressing smaller units as an extension, such as simply by defining the results of certain pointer operations that are otherwise undefined by the C standard.

One example of a real machine and its compiler where the minimal addressable unit is smaller than a byte is the 8051 family. One compiler I was used to is Keil C51.
The minimal addressable unit is a bit. You can define a variable of this type, you can read and write it. However, the syntax to define the variable is non-standard. Of course, C51 needs several extensions to support all of this. BTW, pointers to bits are not allowed.
For example:
unsigned char bdata bitsAdressable;
sbit bitAddressed = bitsAdressable^5;
void f(void) {
bitAddressed = 1;
}
bit singleBit;
void g(bit value) {
singleBit = value;
}

Both definitions does not say that "byte" is the minimum addressable unit.
That's because they don't need to. Byte-wise types (char, unsigned char, std::byte, etc) have sufficient restrictions that enforce this requirement.
The size of byte-wise types is explicitly defined to be precisely 1:
sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1.
The alignment of byte-wise types is the smallest alignment possible:
Furthermore, the narrow character types (6.9.1) shall have the weakest alignment requirement
This doesn't have to be an alignment of 1, of course. Except... it does.
See, if the alignment were higher than 1, that would mean that a simple byte array wouldn't work. Array indexing is based on pointer arithmetic, and pointer arithmetic determines the next address based on sizeof(T). But if alignof(T) is greater than sizeof(T), then the second element in any array of T would be misaligned. That's not allowed.
So even though the standard doesn't explicitly say that the alignment of bytewise types is 1, other requirements ensure that it must be.
Overall, this means that every pointer to an object has an alignment at least as restrictive as a byte-wise type. So no object pointer can be misaligned, relative to the alignment of byte-wise types. All valid, non-NULL pointers (pointers to a live object or to a past-the-end pointer) must therefore be at least aligned enough to point to a char.
Similarly, the difference between two pointers is defined in C++ as the difference between the array indices of the elements pointed to by those pointers (pointer arithmetic in C++ requires that the two pointers point into the same array). Additive pointer arithmetic is as previously stated based on the sizeof the type being pointed to.
Given all of these facts, even if an implementation has pointers whose addresses can address values smaller than char, it is functionally impossible for the C++ abstract model to generate a pointer and still have that pointer count as valid (pointing to an object/function, a past-the-end of an array, or be NULL). You could create such a pointer value with a cast from an integer. But you would be creating an invalid pointer value.
So while technically there could be smaller addresses on the machine, you could never actually use them in a valid, well-formed C++ program.
Obviously compiler extensions could do anything. But as far as conforming programs are concerned, it simply isn't possible to generate valid pointers that are misaligned for byte-wise types.

I programmed both the TMS34010 and its successor TMS34020 graphics chips back in the early 1990's and they had a flat address space and were bit addressable i.e. addresses indexed each bit. This was very useful for computer graphics of the time and back when memory was a lot more precious.
The embedded C-compiler didn't really have away to access individual bits directly, since from a (standard) C language point of view the byte was still the smallest unit as pointed out in a previous post.
Thus if you want to read/write a stream of bits in C, you need to read/write (at least) a byte at a time and buffer (for example when writing a Arithmetic or Huffman Coder).

(Thank you everyone who commented and answered, every word helps)
Memory model of a programming language and memory model of the target machine are different things.
Yes, byte is the minimum addressable unit in context of memory model of programming language.
No, byte is not the minimum addressable unit in context of memory model of machine. For example, there are machines where minimum addressable unit is longer or shorter than the "byte" of programming language:
longer: HP Saturn - 4-bit unit vs 8-bit byte gcc (thanks Nate).
shorter: IBM 360 - 36-bit unit vs 6-bit byte (thanks Antti)
longer: Intel 8051 - 1-bit unit vs 8-bit byte (thanks Busybee)
longer: Ti TMS34010 - 1-bit unit vs 8-bit byte (thanks Wcochran)

How to guarantee a C++ type's number of bits

I am looking to typedef my own arithmetic types (e.g. Byte8, Int16, Int32, Float754, etc) with the intention of ensuring they comprise a specific number of bits (and in the case of the float, adhere to the IEEE754 format). How can I do this in a completely cross-platform way?
I have seen snippets of the C/C++ standards here and there and there is a lot of:
"type is at least x bytes"
and not very much of:
"type is exactly x bytes".
Given that typedef Int16 unsigned short int may not necessarily result in a 16-bit Int16, is there a cross-platform way to guarantee my types will have specific sizes?

You can use exact-width integer types int8_t, int16_t, int32_t, int64_t declared in <cstdint>. This way the sizes are fixed on all the platforms

The only available way to truly guarantee an exact number of bits is to use a bit-field:
struct X {
int abc : 14; // exactly 14 bits, regardless of platform
};
There is some upper limit on the size you can specify this way -- at least 16 bits for int, and 32 bits for long (but a modern platform may easily allow up to 64 bits for either). Note, however, that while this guarantees that arithmetic on X::abc will use (or at least emulate) exactly 14 bits, it does not guarantee that the size of a struct X is the minimum number of bytes necessary to provide 14 bits (e.g., given 8-bit bytes, its size could easily be 4 or 8 instead of the 2 that are absolutely necessary).
The C and C++ standards both now include a specification for fixed-size types (e.g., int8_t, int16_t), but no guarantee that they'll be present. They're required if the platform provides the right type, but otherwise won't be present. If memory serves, these are also required to use a 2's complement representation, so a platform with a 16-bit 1's complement integer type (for example) still won't define int16_t.

Have a look at the types declared in stdint.h. This is part of the standard library, so it is expected (though technically not guaranteed) to be available everywhere. Among the types declared here are int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. Local implementations will map these types to the appropriate-width type for the given complier and architecture.

This is not possible.
There are platforms where char is 16 or even 32 bits.
Note that I'm not saying there are in theory platforms where this happens... it is a real and quite concrete possibility (e.g. DSPs).
On that kind of hardware there is just no way to use 8 bit only for an operation and for example if you need 8 bit modular arithmetic then the only way is doing a masking operation yourself.
The C language doesn't provide this kind of emulation for you...
With C++ you could try to build a class that behaves like the expected native elementary type in most cases (with the exclusion of sizeof, obviously). The result will have however truly horrible performances.
I can think to no use case in which forcing the hardware this way against its nature would be a good idea.

It is possible to use C++ templates at compile time to check and create new types on the fly that do fit your requirements, specifically that sizeof() of the type is the correct size that you want.
Take a look at this code: Compile time "if".
Do note that if the requested type is not available then it is entirely possible that your program will simply not compile. It simply depends on whether or not that works for you or not!

Is the byte alignment requirement of a given data type guaranteed to be a power of 2?

Is the byte alignment requirement of a given data type guaranteed to be a power of 2?
Is there something that provides this guarantee other than it "not making sense otherwise" because it wouldn't line up with system page sizes?
(background: C/C++, so feel free to assume data type is a C or C++ type and give C/C++ specific answers.)

Alignment requirement are based on the hardware. Most, if not all, "modern" chips have addresses that are divisible by 8, not just a power of 2. In the past there were non-divisible by 8 chips (I know of a 36 bit architecture).

Things you can assume about alignment, per the C standard:
The alignment requirement of any type divides the size of that type (as determined by sizeof).
The character types char, signed char, and unsigned char have no alignment requirement. (This is actually just a special case of the first point.)
In the modern real world, integer and pointer types have sizes that are powers of two, and their alignment requirements are usually equal to their sizes (the only exception being long long on 32-bit machines). Floating point is a bit less uniform. On 32-bit machines, all floating point types typically have an alignment of 4, whereas on 64-bit machines, the alignment requirement of floating point types is typically equal to the size of the type (4, 8, or 16).
The alignment requirement of a struct should be the least common multiple of the alignment requirements of its members, but a compiler is allowed to impose stricter alignment. However, normally each cpu architecture has an ABI standard that includes alignment rules, and compilers which do not adhere to the standard will generate code that cannot be linked with code built by compilers which follow the ABI standard, so it would be very unusual for a compiler to break from the standard except for very special-purpose use.
By the way, a useful macro that will work on any sane compiler is:
#define alignof(T) ((char *)&((struct { char x; T t; } *)0)->t - (char *)0)

The alignment of a field inside a "struct", optimized for size could very well be on a odd boundary. other then that your "It wouldn't make sense" would probably apply, but I think there is NO guarantee, especially if the program was small model, optimized for size. - Joe

The standard doesn't require alignment, but allows struct/unions/bit fields to silently add padding bytes to get a correct alignment. The compiler is also free to align all your data types on even addresses should it desire.
That being said, this is CPU dependent, and I don't believe there exists a CPU that has an alignment requirement on odd addresses. There are plenty of CPUs with no alignment requirements however, and the compiler may then place variables at any address.

In short, no. It depends on the hardware.
However, most modern CPUs either do byte alignment (e.g., Intel x86 CPUs), or word alignment (e.g., Motorola, IBM/390, RISC, etc.).
Even with word alignment, it can be complicated. For example, a 16-bit word would be aligned on a 2-byte (even) address, a 32-bit word on a 4-byte boundary, but a 64-bit value may only require 4-byte alignment instead of an 8-byte aligned address.
For byte-aligned CPUs, it's also a function of the compiler options. The default alignmen for struct members can usually be specified (usually also with a compiler-specific #pragma).

For basic data types (ints, floats, doubles) usually the alignment matches the size of the type. For classes/structs, the alignment is at least the lowest common multiple of the alignment of all its members (that's the standard)
In Visual Studio you can set your own alignment for a type, but it has to be a power of 2, between 1 and 8192.
In GCC there is a similar mechanism, but it has no such requirement (at least in theory)

Why the sizeof(bool) is not defined to be one, by the Standard itself?

Size of char, signed char and unsigned char is defined to be 1 byte, by the C++ Standard itself. I'm wondering why it didn't define the sizeof(bool) also?
C++03 Standard $5.3.3/1 says,
sizeof(char), sizeof(signed char) and
sizeof(unsigned char) are 1; the
result of sizeof applied to any other
fundamental type (3.9.1) is
implementation-defined. [Note: in
particular,sizeof(bool) and
sizeof(wchar_t) are
implementation-defined.69)
I understand the rationale that sizeof(bool) cannot be less than one byte. But is there any rationale why it should be greater than 1 byte either? I'm not saying that implementations define it to be greater than 1, but the Standard left it to be defined by implementation as if it may be greater than 1.
If there is no reason sizeof(bool) to be greater than 1, then I don't understand why the Standard didn't define it as just 1 byte, as it has defined sizeof(char), and it's all variants.

The other likely size for it is that of int, being the "efficient" integer type for the platform.
On architectures where it makes any difference whether the implementation chooses 1 or sizeof(int) there could be a trade-off between size (but if you're happy to waste 7 bits per bool, why shouldn't you be happy to waste 31? Use bitfields when size matters) vs. performance (but when is storing and loading bool values going to be a genuine performance issue? Use int explicitly when speed matters). So implementation flexibility wins - if for some reason 1 would be atrocious in terms of performance or code size, it can avoid it.

As #MSalters pointed out, some platforms work more efficiently with larger data items.
Many "RISC" CPUs (e.g., MIPS, PowerPC, early versions of the Alpha) have/had a considerably more difficult time working with data smaller than one word, so they do the same. IIRC, with at least some compilers on the Alpha a bool actually occupied 64 bits.
gcc for PowerPC Macs defaulted to using 4 bytes for a bool, but had a switch to change that to one byte if you wanted to.
Even for the x86, there's some advantage to using a 32-bit data item. gcc for the x86 has (or at least used to have -- I haven't looked recently at all) a define in one of its configuration files for BOOL_TYPE_SIZE (going from memory, so I could have that name a little wrong) that you could set to 1 or 4, and then re-compile the compiler to get a bool of that size.
Edit: As for the reason behind this, I'd say it's a simple reflection of a basic philosophy of C and C++: leave as much room for the implementation to optimize/customize its behavior as reasonable. Require specific behavior only when/if there's an obvious, tangible benefit, and unlikely to be any major liability, especially if the change would make it substantially more difficult to support C++ on some particular platform (though, of course, if the platform is sufficiently obscure, it might get ignored).

Many platforms cannot effectively load values smaller than 32 bits. They have to load 32 bits, and use a shift-and-mask operation to extract 8 bits. You wouldn't want this for single bools, but it's OK for strings.

The operation resulted in 'sizeof' is MADUs (minimal addresible unit), not bytes. So family processors C54 *. C55 * Texas Instuments, the expression 1 MADU = 2 bytes.
For this platform sizeof (bool) = sizeof (char) = 1 MADUs = 2 bytes.
This does not violate the C + + standard, but clarifies the situation.

How does endianness affect enumeration values in C++?

How does endianness affect enumeration values in C++?
Is the size of an enum dependent upon how many enumerations there are and thus some enums are 1 byte, while others are 2 or 4 bytes?
How do I put an enumeration value into network byte order from host byte order?

Endianness affects enumerations no more or less than it does other integer types.
The only guarantee on size is that it must be possible for an enum to hold values of int; the compiler is free to choose the actual size based on the defined members (disclaimer: this is certainly the case for C, not 100% sure for C++; I'll check...)

Enums depend on the compiler. They can be an 1, 2, or 4 bytes (see here). They should have the same endianness as the platform they are used on.
To put an enum value into a specific byte order you would need to know what the system you are on is and what the network expect. Then treat as you would an int. See here for help on conversions.

Same way it affects everything else.
The compiler is free to choose the minimum required space.
htons(), or if you know you have more than 64k values, htonl().

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js