Union initialization - c++

What are the rules that govern the uninitialized bytes of a union ? (Assuming some are initialized)
Below is a 32 bytes union of which I initialize only the first 16 bytes via the first member.
It seems the remaining bytes are zero-initialized. That's great for my use case but I am wondering what's the rule behind this - I was expecting garbage.
#include <cstdint>
#include <iostream>
using namespace std;
union Blah {
struct {
int64_t a;
int64_t b;
};
int64_t c[4];
}
int main()
{
Blah b = {{ 1, 2 }}; // initialize first member, so only the first 16 bytes.
// prints 1, 2, 0, 0 -- not 1, 2, <garbage>, <garbage>
cout << b.c[0] << ", " << b.c[1] << ", " << b.c[2] << ", " << b.c[3] << '\n';
return 0;
}
I've compiled on GCC 4.7.2 with -O3 -Wall -Wextra -pedantic (that last one required giving a name to the anonymous struct). That hopefully should save me from being lucky.
I've also tried to overlay two variables with two different scopes on the stack but gcc didn't give them the same address.
I've also tried replacing the array by another struct in that case that would have mattered, but it didn't change anything.
I can't access online compilers from here, they're blocked by my work.

The most pertinent part of the C11 standard 6.2.6.1.7, while not speaking specifically to initialization:
When a value is stored in a member of an object of union type, the
bytes of the object representation that do not correspond to that
member but do correspond to other members take unspecified values.
Section 6.7.9.17 says:
Each brace-enclosed initializer list has an associated current object.
When no designations are present, subobjects of the current object are
initialized in order according to the type of the current object:
array elements in increasing subscript order, structure members in
declaration order, and the first named member of a union.
but doesn't explicitly come out and say the other bits are not initialized. For static unions, 6.7.9.10 says:
the first named member is initialized (recursively) according to these
rules, and any padding is initialized to zero bits;
so the first named member and any padding bits would be zero-initialized, but the bits corresponding to other (by implication, larger) members of the union would be unspecified.
So you cannot count on those extra bytes being initialized to zero.
Note that technically, even if you do initialize your c array to zero, the moment you store something in your struct those excess bits become unspecified again, and you can't count on them still being zero. There's a lot of code out there which assumes this is true (e.g. putting a char array in a union to access the individual bytes), and in reality it probably will be, but the standard doesn't guarantee it.

Brace-enclosed initializers for a union are only permitted to initialize the first member. This is fine, and your initializer does initialize the anonymous struct, and causes the first member to be the active member.
In C++ only one member of a union may be active at any time. Trying to read the other members via the union causes undefined behaviour. Trying to read them by aliasing them as a character type gives unspecified values.

So I would say that the observed behavior is backed by the standard.
ISO/IEC 9899:201x in 6.7.9 (Initialization) statement 12 says:
If there are fewer initializers in a brace-enclosed list than there are elements or members
of an aggregate, or fewer characters in a string literal used to initialize an array of known
size than there are elements in the array, the remainder of the aggregate shall be
initialized implicitly the same as objects that have static storage duration.
Static objects are initialized to 0 (see 6.7.9.10 or The initialization of static variables in C).

Related

Is it implementation-defined that how to deal with [[no_unique_address]]?

Below is excerpted from cppref but reduced to demo:
#include <iostream>
struct Empty {}; // empty class
struct W
{
char c[2];
[[no_unique_address]] Empty e1, e2;
};
int main()
{
std::cout << std::boolalpha;
// e1 and e2 cannot have the same address, but one of them can share with
// c[0] and the other with c[1]
std::cout << "sizeof(W) == 2 is " << (sizeof(W) == 2) << '\n';
}
The documentation says the output might be:
sizeof(W) == 2 is true
However, both gcc and clang output as follows:
sizeof(W) == 2 is false
Is it implementation-defined that how to deal with [[no_unique_address]]?
See [intro.object]/8:
An object has nonzero size if ... Otherwise, if the object is a base class subobject of a standard-layout class type with no non-static data members, it has zero size.
Otherwise, the circumstances under which the object has zero size are implementation-defined.
Empty base class optimization became mandatory for standard-layout classes in C++11 (see here for discussion). Empty member optimization is never mandatory. It is implementation-defined, as you suspected.
Yes, virtually all aspects of no_unique_address are implementation-defined. It is a tool to allow for optimizations, not to enforce them.
That being said, you should never assume that no_unique_address will work when you attempt to have two subobjects with the same type. The standard still requires that all distinct subobjects of the same type have different addresses, no_unique_address or not. And while it is possible that the compiler could assign these empty subobjects distinct addresses by radically reordering the members... they're pretty much not going to do that.
Your best bet for reasonably taking advantage of no_unique_address optimizations is to never have two subobjects of the same type, and try to put all possibly empty members first. That is, you should expect implementations to assign an empty no_unique_address member to the offset of the next member (or to the offset of the containing struct as a whole).

Do data member addresses lie between (this) and (this+1)?

Suppose that we have the following two inequalities inside a member function
this <= (void *) &this->data_member
and
&this->data_member < (void *) (this+1)
Are they guaranteed to be true?
(They seem to be true in a few cases that I checked.)
Edit: I missed ampersands, now it's the correct form of inequalities.
From CPP standard draft 4713:
6.6.2 Object model [intro.object]/7
An object of trivially copyable or standard-layout type (6.7) shall occupy contiguous bytes of storage.
12.2 Class members [class.mem]/18
Non-static data members of a (non-union) class with the same access control (Clause 14) are allocated so that later members have higher addresses within a class object.
12.2 Class members [class.mem]/25
If a standard-layout class object has any non-static data members, its address is the same as the address of its first non-static data member. Otherwise, its address is the same as the address of its first base class subobject (if any).
Taking all the above together, we can say the first equation holds for at least trivially copyable objects.
Also from the online cpp reference:
The result of comparing two pointers to objects (after conversions) is defined as follows:
1) If two pointers point to different elements of the same array, or to subobjects within different elements of the same array, the pointer to the element with the higher subscript compares greater. In other words, they results of comparing the pointers is the same as the result of comparing the indexes of the elements they point to.
2) If one pointer points to an element of an array, or to a subobject of the element of the array, and another pointer points one past the last element of the array, the latter pointer compares greater. Pointers to single objects are treated as pointers to arrays of one: &obj+1 compares greater than &obj (since C++17)
So if your data_member is not a pointer and has not been allocated memory separately, the equations you have posted hold good
for at least trivially copyable objects.
The full standard text amounts to this:
[expr.rel] - 4: The result of comparing unequal pointers to objects82 is defined in terms of a partial order consistent with the following rules:
We are dealing with a partial order here, not a total order. That does mean a < b and b < c implies a < c, but not much else.
(Note 82 states that non-array objects are considered elements of a single-element array for this purpose, with the intuitive meaning/behavior of "pointer to element one past the end").
(4.1)
If two pointers point to different elements of the same array, or to subobjects thereof, the pointer to the element with the higher subscript is required to compare greater.
Pointers to different members are not pointers to (subobjects of) elements of the same array. This rule does not apply.
(4.2)
If two pointers point to different non-static data members of the same object, or to subobjects of such members, recursively, the pointer to the later declared member is required to compare greater provided the two members have the same access control ([class.access]), neither member is a subobject of zero size, and their class is not a union.
This rule only relates pointers to data members of the same object, not of a different object.
(4.3)
Otherwise, neither pointer is required to compare greater than the other.
Thus, you do not get any guarantees from the standard. Whether you can find a real-world system where you get a different result than you expect is another question.
The value of an object is within its representation, and this representation is a sequence of unsigned char: [basic.types]/4
The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T).
The value representation of an object of type T is the set of bits that participate in representing a value of type T.[...]
So for formalism fundamentalists, it is true that value is not defined but the term appears in the definition of access:[defns.access]:
read or modify the value of an object
So is the value of a suboject part of the value of a complete object? I suppose this is what is intended by the standard.
The comparison should be true if you cast object pointers to unsigned char*. (This is a common practice that falls on an under-specification core issue #1701)
No, here's a counterexample
#include <iostream>
struct A
{
int a_member[10];
};
struct B : public virtual A
{
int b_member[10];
void print_b() { std::cout << static_cast<void*>(this) << " " << static_cast<void*>(std::addressof(this->a_member)) << " " << static_cast<void*>(this + 1) << std::endl; }
};
struct C : public virtual A
{
int c_member[10];
void print_c() { std::cout << static_cast<void*>(this) << " " << static_cast<void*>(std::addressof(this->a_member)) << " " << static_cast<void*>(this + 1) << std::endl; }
};
struct D : public B, public C
{
void print_d()
{
print_b();
print_c();
}
};
int main()
{
D d;
d.print_d();
}
With the possible output (as seen here)
0x7fffc6bf9fb0 0x7fffc6bfa010 0x7fffc6bfa008
0x7fffc6bf9fe0 0x7fffc6bfa010 0x7fffc6bfa038
Note that the a_member is outside of the B pointed to by this in print_b

Do array elements count as a common initial sequence?

Sort of related to my previous question:
Do elements of arrays count as a common initial sequence?
struct arr4 { int arr[4]; };
struct arr2 { int arr[2]; };
union U
{
arr4 _arr4;
arr2 _arr2;
};
U u;
u._arr4.arr[0] = 0; //write to active
u._arr2.arr[0]; //read from inactive
According to this cppreference page:
In a standard-layout union with an active member of non-union class type T1, it is permitted to read a non-static data member m of another union member of non-union class type T2 provided m is part of the common initial sequence of T1 and T2....
Would this be legal, or would it also be illegal type punning?
C++11 says (9.2):
If a standard-layout union contains two or more standard-layout structs that share a common initial sequence,
and if the standard-layout union object currently contains one of these standard-layout structs, it is permitted
to inspect the common initial part of any of them. Two standard-layout structs share a common initial
sequence if corresponding members have layout-compatible types and either neither member is a bit-field or
both are bit-fields with the same width for a sequence of one or more initial members.
As to whether arrays of different size form a valid common initial sequence, 3.9 says:
If two types T1 and T2 are the same type, then T1 and T2 are layout-compatible types
These arrays are not the same type, so this doesn't apply. There is no special further exception for arrays, so the arrays may not be layout-compatible and do not form a common initial sequence.
In practice, though, I know of a compiler (GCC) which:
ignores the "common initial sequence" rule, and
allows type punning anyway, but only when accesses are "via the union type" (as in your example), in which case the "common initial sequence" rule is obeyed indirectly (because a "common initial sequence" implies a common initial layout on the architectures the compiler supports).
I suspect many other compilers take a similar approach. In your example, where you type-pun via the union object, such compilers will give you the expected result - reading from the inactive member should give you value written via the inactive member.
The C Standard would allow an implementation to vary the placement of an array object within a structure based upon the number of elements. Among other things, there may be some circumstances where it may be useful to word-align a byte array which would occupy exactly one word, but not to word-align arrays of other sizes. For example, on a system with 8-bit char and 32-bit words, processing a structure such as:
struct foo {
char header;
char dat[4];
};
in a manner that word-aligns dat may allow an access to dat[i] to be processed by loading a word and shifting it right by a 0, 8, 16, or 24 bits, but such advantages might not be applicable had the structure instead been:
struct foo {
char header;
char dat[5];
};
The Standard was clearly not intended to forbid implementations from laying out structures in such ways, on platforms where doing so would be useful. On the other hand, when the Standard was written, compilers which would place arrays within a structure at offsets that were unaffected by the arrays' sizes would unanimously behave as though array elements that were present in two structures were part of the same Common Initial Sequence, and nothing in the published Rationale for the Standard suggests any intention to discourage such implementations from continuing to behave in such fashion. Code which relied upon such treatment would have been "non-portable", but correct on all implementations which followed common struct layout practices.

C struct elements alignment (ansi)

just a simple question... what the standard says about the structure members alignment?
for example with this one:
struct
{
uint8_t a;
uint8_t b;
/* other members */
} test;
It is guarateed that b is at offset 1 from the struct start?
Thanks
The standard (as of C99) doesn't really say anything.
The only real guarantees are that (void *)&test == (void *)&a, and that a is at a lower address than b. Everything else is up to the implementation.
C11 6.7.2.1 Structure and union specifiers p14 says
Each non-bit-field member of a structure or union object is aligned in
an implementation- defined manner appropriate to its type.
meaning that you can't make any portable assumptions about the difference between the addresses of a and b.
It should be possible to use offsetof to determine the offset of members.
For C the alignment is implementation defined, we can see that in the draft C99 standard section 6.7.2.1 Structure and union specifiers paragraph 12(In C11 it would be paragraph 14) which says:
Each non-bit-field member of a structure or union object is aligned in an implementation defined manner appropriate to its type.
and paragraph 13 says:
Within a structure object, the non-bit-field members and the units in which bit-fields
reside have addresses that increase in the order in which they are declared. A pointer to a
structure object, suitably converted, points to its initial member (or if that member is a
bit-field, then to the unit in which it resides), and vice versa. There may be unnamed
padding within a structure object, but not at its beginning.
and for C++ we have the following similar quotes from the draft standard section 9.2 Class members paragraph 13 says:
Nonstatic data members of a (non-union) class with the same access control (Clause 11) are allocated so that later members have higher addresses within a class object. The order of allocation of non-static data members with different access control is unspecified (Clause 11). Implementation alignment requirements might cause two adjacent members not to be allocated immediately after each other;
and paragraph 19 says:
A pointer to a standard-layout struct object, suitably converted using a reinterpret_cast, points to its
initial member (or if that member is a bit-field, then to the unit in which it resides) and vice versa. [ Note:
There might therefore be unnamed padding within a standard-layout struct object, but not at its beginning,
as necessary to achieve appropriate alignment. —end note ]
the case you're using is not really an edge case, both uint_8 are small enough to fit in the same word in memory and it would be no use to put each uint_8 in a uint_16.
A more critical case would be something like :
{
uint8_t a;
uint8_t b;
uint_32 c; // where is C, at &a+2 or &a+4 ?
/* other members */
} test;
and anyway this will always depend on the target architecture and your compiler...
K&R second edition (ANSI C) in chapter 6.4 (page 138) says:
Don't assume, however, that the size of a structure is the sum of the sizes of its memebers. Because of alignment requirements for different objects, there may be unnamed "holes" in a structure.
So no, ANSI C does not guarantee that b is at offset 1.
It is even likely that the compiler puts b at offset sizeof(int) so that it's aligned on the size of a machine word, which is easier to deal with.
Some compilers support pack-pragmas so that you can force that there are no such "holes" in the struct, but that is not portable.
What is guaranteed by the C-Standard already had been mentioned by other answers.
However, to make sure b is at offset 1 your compiler might offer options to "pack" the structure, will say to explicitly add no padding.
For gcc this can be achieved by the #pragma pack().
#pragma pack(1)
struct
{
uint8_t a; /* Guaranteed to be at offset 0. */
uint8_t b; /* Guaranteed to be at offset 1. */
/* other members are guaranteed to start at offset 2. */
} test_packed;
#pragma pack()
struct
{
uint8_t a; /* Guaranteed to by at offset 0. */
uint8_t b; /* NOT guaranteed to be at offset 1. */
/* other members are NOT guaranteed to start at offset 2. */
} test_unpacked;
A portable (and save) solution would be to simply use an array:
struct
{
uint8_t ab[2]; /* ab[0] is guaranteed to be at offset 0. */
/* ab[1] is guaranteed to be at offset 1. */
/* other members are NOT guaranteed to start at offset 2. */
} test_packed;

Are anonymous unions acceptable for aliasing member variables in a struct?

Let's say that I have the following C++ code:
struct something
{
// ...
union { int size, length; };
// ...
};
This would create two members of the struct which access the same value: size and length.
Would treating the two members as complete aliases (i.e. setting the size, then accessing the length and vice/versa) be undefined behaviour? Is there a "better" way to implement this type of behaviour, or is this an acceptable implementation?
Yes, this is allowed and well-defined. According to §3.10 [basic.lval]:
10/ If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
— the dynamic type of the object
[...]
Since here we store an int and read through an int, we access the object through a glvalue of the same dynamic type than the object, thus things are fine.
There even is a special caveat in the Standard for structures that share the same prefix. Or, in standardese, standard-layout types that share a common initial sequence.
§9.2/18 If a standard-layout union contains two or more standard-layout structs that share a common initial sequence, and if the standard-layout union object currently contains one of these standard-layout structs, it is permitted to inspect the common initial part of any of them. Two standard-layout structs share a common initial sequence if corresponding members have layout-compatible types and either neither member is a bit-field or both are bit-fields with the same width for a sequence of one or more initial members.
That is:
struct A { unsigned size; char type; };
struct B { unsigned length; unsigned capacity; };
union { A a; B b; } x;
assert(x.a.size == x.b.length);
EDIT: Given that int is not a struct (nor a class) I am afraid it's actually not formally defined (I certainly could not see anything in the Standard), but should be safe in practice... I've brought the matters to the isocpp forums; you might have found a hole.
EDIT: Following the above mentionned discussion, I have been shown §3.10/10.
It is not undefined behavior. Both of the aliases in the union will be accessing the same location in the memory. See below:
§9.2/18 If a standard-layout union contains two or more
standard-layout structs that share a common initial sequence, and if
the standard-layout union object currently contains one of these
standard-layout structs, it is permitted to inspect the common initial
part of any of them. Two standard-layout structs share a common
initial sequence if corresponding members have layout-compatible types
and either neither member is a bit-field or both are bit-fields with
the same width for a sequence of one or more initial members.
It is undefined if types have different initial sequence.
Values will be same. If you assign 5 to size then length will also be 5.