Unions, aliasing and type-punning in practice: what works and what does not? - c++

I have a problem understanding what can and cannot be done using unions with GCC. I read the questions (in particular here and here) about it but they focus the C++ standard, I feel there's a mismatch between the C++ standard and the practice (the commonly used compilers).
In particular, I recently found confusing informations in the GCC online doc while reading about the compilation flag -fstrict-aliasing. It says:
-fstrict-aliasing
Allow the compiler to assume the strictest aliasing rules applicable to the language being compiled. For C (and C++), this activates optimizations based on the type of expressions. In particular, an object of one type is assumed never to reside at the same address as an object of a different type, unless the types are almost the same.
For example, an unsigned int can alias an int, but not a void* or a double. A character type may alias any other type.
Pay special attention to code like this:
union a_union {
int i;
double d;
};
int f() {
union a_union t;
t.d = 3.0;
return t.i;
}
The practice of reading from a different union member than the one most recently written to (called “type-punning”) is common.
Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type. So, the code above works as expected.
This is what I think I understood from this example and my doubts:
1) aliasing only works between similar types, or char
Consequence of 1): aliasing - as the word suggests - is when you have one value and two members to access it (i.e. the same bytes);
Doubt: are two types similar when they have the same size in bytes? If not, what are similar types?
Consequence of 1) for non similar types (whatever this means), aliasing does not work;
2) type punning is when we read a different member than the one we wrote to; it's common and it works as expected as long as the memory is accessed through the union type;
Doubt: is aliasing a specific case of type-punning where types are similar?
I get confused because it says unsigned int and double are not similar, so aliasing does not work; then in the example it's aliasing between int and double and it clearly says it works as expected, but calls it type-punning:
not because types are or are not similar, but because it's reading from a member it did not write. But reading from a member it did not write is what I understood aliasing is for (as the word suggests). I'm lost.
The questions:
can someone clarify the difference between aliasing and type-punning and what uses of the two techniques are working as expected in GCC? And what does the compiler flag do?

Aliasing can be taken literally for what it means: it is when two different expressions refer to the same object. Type-punning is to "pun" a type, ie to use a object of some type as a different type.
Formally, type-punning is undefined behaviour with only a few exceptions. It happens commonly when you fiddle with bits carelessly
int mantissa(float f)
{
return (int&)f & 0x7FFFFF; // Accessing a float as if it's an int
}
The exceptions are (simplified)
Accessing integers as their unsigned/signed counterparts
Accessing anything as a char, unsigned char or std::byte
This is known as the strict-aliasing rule: the compiler can safely assume two expressions of different types never refer to the same object (except for the exceptions above) because they would otherwise have undefined behaviour. This facilitates optimizations such as
void transform(float* dst, const int* src, int n)
{
for(int i = 0; i < n; i++)
dst[i] = src[i]; // Can be unrolled and use vector instructions
// If dst and src alias the results would be wrong
}
What gcc says is it relaxes the rules a bit, and allows type-punning through unions even though the standard doesn't require it to
union {
int64_t num;
struct {
int32_t hi, lo;
} parts;
} u = {42};
u.parts.hi = 420;
This is the type-pun gcc guarantees will work. Other cases may appear to work but may one day silently be broken.

Terminology is a great thing, I can use it however I want, and so can everyone else!
are two types similar when they have the same size in bytes? If not, what are similar types?
Roughly speaking, types are similar when they differ by constness or signedness. Size in bytes alone is definitely not sufficient.
is aliasing a specific case of type-punning where types are similar?
Type punning is any technique that circumvents the type system.
Aliasing is a specific case of that which involves placing objects of different types at the same address. Aliasing is generally allowed when types are similar, and forbidden otherwise. In addition, one may access an object of any type through a char (or similar to char) lvalue, but doing the opposite (i.e. accessing an object of type char through a dissimilar type lvalue) is not allowed. This is guaranteed by both C and C++ standards, GCC simply implements what the standards mandate.
GCC documentation seems to use "type punning" in a narrow sense of reading a union member other than the one last written to. This kind of type punning is allowed by the C standard even when types are not similar. OTOH the C++ standard does not allow this. GCC may or may not extend the permission to C++, the documentation is not clear on this.
Without -fstrict-aliasing, GCC apparently relaxes these requirements, but it isn't clear to what exact extent. Note that -fstrict-aliasing is the default when performing an optimised build.
Bottom line, just program to the standard. If GCC relaxes the requirements of the standard, it isn't significant and isn't worth the trouble.

In ANSI C (AKA C89) you have (section 3.3.2.3 Structure and union members):
if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined
In C99 you have (section 6.5.2.3 Structure and union members):
If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.
IOW, union-based type punning is allowed in C, although the actual semantics may be different, depending on the language standard supported (note that the C99 semantics is narrower than the C89's implementation-defined).
In C99 you also have (section 6.5 Expressions):
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
And there's a section (6.2.7 Compatible type and composite type) in C99 that describes compatible types:
Two types have compatible type if their types are the same. Additional rules for
determining whether two types are compatible are described in 6.7.2 for type specifiers,
in 6.7.3 for type qualifiers, and in 6.7.5 for declarators. ...
And then (6.7.5.1 Pointer declarators):
For two pointer types to be compatible, both shall be identically qualified and both shall be pointers to compatible types.
Simplifying it a bit, this means that in C by using a pointer you can access signed ints as unsigned ints (and vice versa) and you can access individual chars in anything. Anything else would amount to aliasing violation.
You can find similar language in the various versions of the C++ standard. However, as far as I can see in C++03 and C++11 union-based type punning isn't explicitly allowed (unlike in C).

According to the footnote 88 in the C11 draft N1570, the "strict aliasing rule" (6.5p7) is intended to specify the circumstances in which compilers must allow for the possibility that things may alias, but makes no attempt to define what aliasing is. Somewhere along the line, a popular belief has emerged that accesses other than those defined by the rule represent "aliasing", and those allowed don't, but in fact the opposite is true.
Given a function like:
int foo(int *p, int *q)
{ *p = 1; *q = 2; return *p; }
Section 6.5p7 doesn't say that p and q won't alias if they identify the same storage. Rather, it specifies that they are allowed to alias.
Note that not all operations which involve accessing storage of one type as another represent aliasing. An operation on an lvalue which is freshly visibly derived from another object doesn't "alias" that other object. Instead, it is an operation upon that object. Aliasing occurs if, between the time a reference to some storage is created and the time it is used, the same storage is referenced in some way not derived from the first, or code enters a context wherein that occurs.
Although the ability to recognize when an lvalue is derived from another is a Quality of Implementation issue, the authors of the Standard must have expected implementations to recognize some constructs beyond those mandated. There is no general permission to access any of the storage associated with a struct or union by using an lvalue of member type, nor does anything in the Standard explicitly say that an operation involving someStruct.member must be recognized as an operation on a someStruct. Instead, the authors of the Standard expected that compiler writers who make a reasonable effort to support constructs their customers need should be better placed than the Committee to judge the needs of those customers and fulfill them. Since any compiler that makes an even-remotely-reasonable effort to recognize derived references would notice that someStruct.member is derived from someStruct, the authors of the Standard saw no need to explicitly mandate that.
Unfortunately, the treatment of constructs like:
actOnStruct(&someUnion.someStruct);
int q=*(someUnion.intArray+i)
has evolved from "It's sufficiently obvious that actOnStruct and the pointer dereference should be expected to act upon someUnion (and consequently all the members thereof) that there's no need to mandate such behavior" to "Since the Standard doesn't require that implementations recognize that the actions above might affect someUnion, any code relying upon such behavior is broken and need not be supported". Neither of the above constructs is reliably supported by gcc or clang except in -fno-strict-aliasing mode, even though most of the "optimizations" that would be blocked by supporting them would generate code that is "efficient" but useless.
If you're using -fno-strict-aliasing on any compiler having such an option, almost anything will work. If you're using -fstrict-aliasing on icc, it will try to support constructs that use type punning without aliasing, though I don't know if there's any documentation about exactly what constructs it does or does not handle. If you use -fstrict-aliasing on gcc or clang, anything at all that works is purely by happenstance.

I think it's good to add a complementary answer, simply because when I asked the question I did not know how to fulfill my needs without using UNION: I got stubborn on using it because it seemed to answer precisely my needs.
The good way to do type punning and to avoid possible consequences of undefined behavior (depending on the compiler and other env. settings) is to use std::memcpy and copy the memory bytes from one type to another. This is explained - for example - here and here.
I've also read that often when a compiler produces valid code for type punning using unions, it produces the same binary code as if std::memcpy was used.
Finally, even if this information does not directly answer my original question it's so strictly related that I felt it was useful to add it here.

Related

Is it possible to auto cast (long *) to (int *)

I'm working on an existing code base, so the answer "do it right, just use one type" doesn't work because they already didn't do it right, I just have to live with it.
I know that coercion works so that this works:
int a = 1;
long b = a;
even though int and long are different base types. However, this doesn't work:
int a;
long *b = &a;
because there's no auto conversion between "pointer to int" and "pointer to long".
If, rather than base types, I was working with classes, I could get it to work by providing a conversion. My question is this. Is there a way to provide conversions for base types (or rather pointers to base types) so I could start the process of converting my code to use a single 32-bit integral type? As it stands, I either need to do an "all or nothing" edit OR provide a boatload of explicit casts.
My question is this. Is there a way to provide conversions for base types (or rather pointers to base types) so I could start the process of converting my code to use a single 32-bit integral type?
No, there is no way to influence the set of implicit conversions between pointer types aside from inheritance relations between classes.
Even if you add a (long*) cast or reinterpret_cast<long*> everywhere, as mentioned in the comments, accessing the value through that pointer will be an aliasing violation and therefore cause undefined behavior. This is not related to the size, alignment or the representation of the int and long types. Rather compilers are explicitly allowed to make optimizations that assume that a long pointer can never be pointing to a int object and compilers will perform such optimizations that will break code in possibly very subtle ways.
Note that this is different for casts between e.g. signed int* and unsigned int*. Signed and unsigned variants of the same integral type are allowed to alias one another, so that either pointer type can be used to access the object. The compiler is not allowed to perform optimizations in this case that assume that pointers of the two types don't point to the same address at the same time.
GCC and Clang offer the -fno-strict-aliasing option to disable optiizations based on the aliasing rules (still assuming that the type do actually have the same size, alignment and compatible representations), but I don't know whether MSVC has a similar option. Some compilers may also explicitly allow additional types to alias that the standard does not allow to alias, but I would only rely on that if the compiler documents these clearly. I don't know whether MSVC makes any such guarantees for int and long.

Is std::byte well defined?

C++17 introduces the std::byte type. A library type that can (supposedly) be used to access raw memory, but stands separate from the character types and represents a mere lump of bits.
So far so good. But the definition has me slightly worried. As given in [cstddef.syn]:
enum class byte : unsigned char {};
I have seen two answers on SO which seem to imply different things about the robustness of the above. This answer argues (without reference) that an enumeration with an underlying type has the same size and alignment requirements as said type. Intuitively this seems correct, since specifying an underlying type allows for opaque enum declarations.
However, this answer argues that the standard only guarantees that two enumerations with the same underlying type are layout compatible, and no more.
When reading [dcl.enum] I couldn't help but notice that indeed, the underlying type is only used to specify the range of the enumerators. There is no mention of size or alignment requirements.
What am I missing?
Essentially there is special wording all around the c++17 draft standard that gives std::byte the same properties with regard to aliasing as char and unsigned char.
To give you an example, in $6.10 in n4659 it states
8 If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined.
[...]
(8.8) — a char, unsigned char, or std::byte type.
I didn't do an exhaustive search, but essentially anywhere that char gets special treatment in the standard, the same is given to std::byte. As far as accessing memory is concerned, it seems irrelevant that it is defined as an enum or what it's underlying type is.
EDIT
Maybe I understood your question wrongly: If you are asking, if the standard guarantees that sizeof(std::byte) == alignof(std::byte) == 1 then I believe this is not the case, as there seems to be no wording about how those properties depend on the underlying type of a scoped enum and I couldn't find special wording for std::byte in that regard. As #T.C. mentions in the comments, this is probably a defect in the language.
(Documenting the comments made by #T.C. that ultimately answer my question)
(I will remove this if T.C. ever wishes to reformulate his own answer.)
Oddly enough, N2213 had wording that guarantees identical
representation to underlying type, but that wording was removed in
N2347. In fact, it even removed the C++03 wording providing for
identical sizeof without any obvious replacement.
The more general question regarding enums and their underlying types
is probably worth a core issue, given that CWG approved this
formulation of std::byte and presumably thought that the
size/alignment relationship exists. As a practical matter, the clear
intent is for std::byte to take up, well, one byte; no sane
implementer would do it differently.

union 'punning' structs w/ "common initial sequence": Why does C (99+), but not C++, stipulate a 'visible declaration of the union type'?

Background
Discussions on the mostly un-or-implementation-defined nature of type-punning via a union typically quote the following bits, here via #ecatmur ( https://stackoverflow.com/a/31557852/2757035 ), on an exemption for standard-layout structs having a "common initial sequence" of member types:
C11 (6.5.2.3 Structure and union members; Semantics):
[...] if a union contains several structures that share a common initial sequence (see below), and if the union object currently
contains one of these structures, it is permitted to inspect the
common initial part of any of them anywhere that a declaration of
the completed type of the union is visible. Two structures share a
common initial sequence if corresponding members have compatible types (and, for bit-fields, the same widths) for a sequence of one or
more initial members.
C++03 ([class.mem]/16):
If a POD-union contains two or more POD-structs that share a common initial sequence, and if the POD-union object currently contains one
of these POD-structs, it is permitted to inspect the common initial
part of any of them. Two POD-structs share a common initial sequence
if corresponding members have layout-compatible types (and, for
bit-fields, the same widths) for a sequence of one or more initial
members.
Other versions of the two standards have similar language; since C++11
the terminology used is standard-layout rather than POD.
Since no reinterpretation is required, this isn't really type-punning, just name substitution applied to union member accesses. A proposal for C++17 (the infamous P0137R1) makes this explicit using language like 'the access is as if the other struct member was nominated'.
But please note the bold - "anywhere that a declaration of the completed type of the union is visible" - a clause that exists in C11 but nowhere in C++ drafts for 2003, 2011, or 2014 (all nearly identical, but later versions replace "POD" with the new term standard layout). In any case, the 'visible declaration of union type bit is totally absent in the corresponding section of any C++ standard.
#loop and #Mints97, here - https://stackoverflow.com/a/28528989/2757035 - show that this line was also absent in C89, first appearing in C99 and remaining in C since then (though, again, never filtering through to C++).
Standards discussions around this
[snipped - see my answer]
Questions
From this, then, my questions were:
What does this mean? What is classed as a 'visible declaration'? Was this clause intended to narrow down - or expand up - the range of contexts in which such 'punning' has defined behaviour?
Are we to assume that this omission in C++ is very deliberate?
What is the reason for C++ differing from C? Did C++ just 'inherit' this from C89 and then either decide - or worse, forget - to update alongside C99?
If the difference is intentional, then what benefits or drawbacks are there to the 2 different treatments in C vs C++?
What, if any, interesting ramifications does it have at compile- or runtime? For example, #ecatmur, in a comment replying to my pointing this out on his original answer (link as above), speculated as follows.
I'd imagine it permits more aggressive optimization; C can assume that
function arguments S* s and T* t do not alias even if they share a
common initial sequence as long as no union { S; T; } is in view,
while C++ can make that assumption only at link time. Might be worth
asking a separate question about that difference.
Well, here I am, asking! I'm very interested in any thoughts about this, especially: other relevant parts of the (either) Standard, quotes from committee members or other esteemed commentators, insights from developers who might have noticed a practical difference due to this - assuming any compiler even bothers to enforce C's added clause - and etc. The aim is to generate a useful catalogue of relevant facts about this C clause and its (intentional or not) omission from C++. So, let's go!
I've found my way through the labyrinth to some great sources on this, and I think I've got a pretty comprehensive summary of it. I'm posting this as an answer because it seems to explain both the (IMO very misguided) intention of the C clause and the fact that C++ does not inherit it. This will evolve over time if I discover further supporting material or the situation changes.
This is my first time trying to sum up a very complex situation, which seems ill-defined even to many language architects, so I'll welcome clarifications/suggestions on how to improve this answer - or simply a better answer if anyone has one.
Finally, some concrete commentary
Through vaguely related threads, I found the following answer by #tab - and much appreciated the contained links to (illuminating, if not conclusive) GCC and Working Group defect reports: answer by tab on StackOverflow
The GCC link contains some interesting discussion and reveals a sizeable amount of confusion and conflicting interpretations on part of the Committee and compiler vendors - surrounding the subject of union member structs, punning, and aliasing in both C and C++.
At the end of that, we're linked to the main event - another BugZilla thread, Bug 65892, containing an extremely useful discussion. In particular, we find our way to the first of two pivotal documents:
Origin of the added line in C99
C proposal N685 is the origin of the added clause regarding visibility of a union type declaration. Through what some claim (see GCC thread #2) is a total misinterpretation of the "common initial sequence" allowance, N685 was indeed intended to allow relaxation of aliasing rules for "common initial sequence" structs within a TU aware of some union containing instances of said struct types, as we can see from this quote:
The proposed solution is to require that a union declaration be visible
if aliases through a common initial sequence (like the above) are possible.
Therefore the following TU provides this kind of aliasing if desired:
union utag {
struct tag1 { int m1; double d2; } st1;
struct tag2 { int m1; char c2; } st2;
};
int similar_func(struct tag1 *pst2, struct tag2 *pst3) {
pst2->m1 = 2;
pst3->m1 = 0; /* might be an alias for pst2->m1 */
return pst2->m1;
}
Judging by the GCC discussion and comments below such as #ecatmur's, this proposal - which seems to mandate speculatively allowing aliasing for any struct type that has some instance within some union visible to this TU - seems to have received great derision and rarely been implemented.
It's obvious how difficult it would be to satisfy this interpretation of the added clause without totally crippling many optimisations - for little benefit, as few coders would want this guarantee, and those who do can just turn on fno-strict-aliasing (which IMO indicates larger problems). If implemented, this allowance is more likely to catch people out and spuriously interact with other declarations of unions, than to be useful.
Omission of the line from C++
Following on from this and a comment I made elsewhere, #Potatoswatter in this answer here on SO states that:
The visibility part was purposely omitted from C++ because it's widely considered to be ludicrous and unimplementable.
In other words, it looks like C++ deliberately avoided adopting this added clause, likely due to its widely pereceived absurdity. On asking for an "on the record" citation of this, Potatoswatter provided the following key info about the thread's participants:
The folks in that discussion are essentially "on the record" there. Andrew Pinski is a hardcore GCC backend guy. Martin Sebor is an active C committee member. Jonathan Wakely is an active C++ committee member and language/library implementer. That page is more authoritative, clear, and complete than anything I could write.
Potatoswatter, in the same SO thread linked above, concludes that C++ deliberately excluded this line, leaving no special treatment (or, at best, implementation-defined treatment) for pointers into the common initial sequence. Whether their treatment will in future be specifically defined, versus any other pointers, remains to be seen; compare to my final section below about C. At present, though, it is not (and again, IMO, this is good).
What does this mean for C++ and practical C implementations?
So, with the nefarious line from N685... 'cast aside'... we're back to assuming pointers into the common initial sequence are not special in terms of aliasing. Still. it's worth confirming what this paragraph in C++ means without it. Well, the 2nd GCC thread above links to another gem:
C++ defect 1719. This proposal has reached DRWP status: "A DR issue whose resolution is reflected in the current Working Paper. The Working Paper is a draft for a future version of the Standard" - cite. This is either post C++14 or at least after the final draft I have here (N3797) - and puts forward a significant, and in my opinion illuminating, rewrite of this paragraph's wording, as follows. I'm bolding what I consider to be the important changes, and {these comments} are mine:
In a standard-layout union with an active member {"active" indicates a union instance, not just type} (9.5 [class.union])
of struct type T1, it is permitted to read {formerly "inspect"} a non-static data member m
of another union member of struct type T2 provided m is part of the
common initial sequence of T1 and T2. [Note: Reading a volatile object
through a non-volatile glvalue has undefined behavior (7.1.6.1
[dcl.type.cv]). —end note]
This seems to clarify the meaning of the old wording: to me, it says that any specifically allowed 'punning' among union member structs with common initial sequences must be done via an instance of the parent union - rather than being based on the type of the structs (e.g. pointers to them passed to some function). This wording seems to rule out any other interpretation, a la N685. C would do well to adopt this, I'd say. Hey, speaking of which, see below!
The upshot is that - as nicely demonstrated by #ecatmur and in the GCC tickets - this leaves such union member structs by definition in C++, and practically in C, subject to the same strict aliasing rules as any other 2 officially unrelated pointers. The explicit guarantee of being able to read the common initial sequence of inactive union member structs is now more clearly defined, not including vague and unimaginably tedious-to-enforce "visibility" as attempted by N685 for C. By this definition, the main compilers have been behaving as intended for C++. As for C?
Possible reversal of this line in C / clarification in C++
It's also very worth noting that C committee member Martin Sebor is looking to get this fixed in that fine language, too:
Martin Sebor 2015-04-27 14:57:16 UTC If one of you can explain the problem with it I'm willing to write up a paper and submit it to WG14 and request to have the standard changed.
Martin Sebor 2015-05-13 16:02:41 UTC I had a chance to discuss this issue with Clark Nelson last week. Clark has worked on improving the aliasing parts of the C specification in the past, for example in N1520 (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1520.htm). He agreed that like the issues pointed out in N1520, this is also an outstanding problem that would be worth for WG14 to revisit and fix."
Potatoswatter inspiringly concludes:
The C and C++ committees (via Martin and Clark) will try to find a consensus and hammer out wording so the standard can finally say what it means.
We can only hope!
Again, all further thoughts are welcome.
I suspect it means that the access to these common parts is permitted not only through the union type, but outside of the union. That is, suppose we have this:
union u {
struct s1 m1;
struct s2 m2;
};
Now suppose that in some function we have a struct s1 *p1 pointer which we know was lifted from the m1 member of such a union. We can cast this to a struct s2 * pointer and still access the members which are in common with struct s1. But somewhere in the scope, a declaration of union u has to be visible. And it has to be the complete declaration, which informs the compiler that the members are struct s1 and struct s2.
The likely intent is that if there is such a type in scope, then the compiler has knowledge that struct s1 and struct s2 are aliased, and so an access through a struct s1 * pointer is suspected of really accessing a struct s2 or vice versa.
In the absence of any visible union type which joins those types this way, there is no such knowledge; strict aliasing can be applied.
Since the wording is absent from C++, then to take advantage of the "common initial members relaxation" rule in that language, you have to route the accesses through the union type, as is commonly done anyway:
union u *ptr_any;
// ...
ptr_any->m1.common_initial_member = 42;
fun(ptr_any->m2.common_initial_member); // pass 42 to fun

What is the difference between a proper defined union and a reinterpret_cast?

Can you propose at least 1 scenario where there is a substantial difference between
union {
T var_1;
U var_2;
}
and
var_2 = reinterpret_cast<U> (var_1)
?
The more i think about this, the more they look like the same thing to me, at least from a practical viewpoint.
One difference that I found is that while the union size is big as the biggest data type in terms of size, the reinterpret_cast as described in this post can lead to a truncation, so the plain old C-style union is even safer than a newer C++ casting.
Can you outline the differences between this 2 ?
Contrary to what the other answers state, from a practical point of view there is a huge difference, although there might not be such a difference in the standard.
From the standard point of view, reinterpret_cast is only guaranteed to work for roundtrip conversions and only if the alignment requirements of the intermediate pointer type are not stronger than those of the source type. You are not allowed (*) to read through one pointer and read from another pointer type.
At the same time, the standard requires similar behavior from unions, it is undefined behavior to read out of a union member other than the active one (the member that was last written to)(+).
Yet compilers often provide additional guarantees for the union case, and all compilers I know of (VS, g++, clang++, xlC_r, intel, Solaris CC) guarantee that you can read out of an union through an inactive member and that it will produce a value with exactly the same bits set as those that were written through the active member.
This is particularly important with high optimizations when reading from network:
double ntohdouble(const char *buffer) { // [1]
union {
int64_t i;
double f;
} data;
memcpy(&data.i, buffer, sizeof(int64_t));
data.i = ntohll(data.i);
return data.f;
}
double ntohdouble(const char *buffer) { // [2]
int64_t data;
double dbl;
memcpy(&data, buffer, sizeof(int64_t));
data = ntohll(data);
dbl = *reinterpret_cast<double*>(&data);
return dbl;
}
The implementation in [1] is sanctioned by all compilers I know (gcc, clang, VS, sun, ibm, hp), while the implementation in [2] is not and will fail horribly in some of them when aggressive optimizations are used. In particular, I have seen gcc reorder the instructions and read into the dbl variable before evaluating ntohl, thus producing the wrong results.
(*) With the exception that you are always allowed to read from a [signed|unsigned] char* regardless of that the real object (original pointer type) was.
(+) Again with some exceptions, if the active member shares a common prefix with another member, you can read through the compatible member that prefix.
There are some technical differences between a proper union and a (let's assume) a proper and safe reinterpret_cast. However, I can't think of any of these differences which cannot be overcome.
The real reason to prefer a union over reinterpret_cast in my opinion isn't a technical one. It's for documentation.
Supposing you are designing a bunch of classes to represent a wire protocol (which I guess is the most common reason to use type-punning in the first place), and that wire protocol consists of many messages, submessages and fields. If some of those fields are common, such as msg type, seq#, etc, using a union simplifies tying these elements together and helps to document exactly how the protocol appears on the wire.
Using reinterpret_cast does the same thing, obviously, but in order to really know what's going on you have to examine the code that advances from one packet to the next. Using a union you can just take a look at the header and get an idea what's going on.
In C++11, union is class type, you can an hold a member with non-trivial member functions. You can't simply cast from one member to another.
§ 9.5.3
[ Example: Consider the following union:
union U {
int i;
float f;
std::string s;
};
Since std::string (21.3) declares non-trivial versions of all of the special member functions, U will have
an implicitly deleted default constructor, copy/move constructor, copy/move assignment operator, and destructor. To use U, some or all of these member functions must be user-provided. — end example ]
From a practical point of view, they're most probably 100% identical, at least on real, non-fictional computers. You take the binary representation of one type and stuff it into another type.
From a language lawyer point of view, using reinterpret_cast is well-defined for some occasions (e.g. pointer to integer conversions) and implementation-specific otherwise.
Union type punning, on the other hand is very clearly undefined behaviour, always (though undefined does not necessarily mean "doesn't work"). The standard says that the value of at most one of the non-static data members can be stored in a union at any time. This means that if you set var1 then var1 is valid, but var2 is not.
However, since var1 and var2 are stored at the same memory location, you can of course still read and write any of the types as you like, and assuming they have the same storage size, no bits are "lost".

How is long long implemented in C++?

In C++, as far as I know, all data types are implemented as classes. ( Don't know if it is right, but I read it as a justification for statements such as int a(5); which calls the parametric constructor of int. )
If so, how are long and short implemented? I just found out that long long and short short are valid types but short long and long short are not (Checked the latter ones just because it sounds funny!)
Similarly, how are signed and unsigned implemented?
PS. By implemented, what I mean is "Is written using C/C++ features or is it written at a lower level in the compiler itself".
So the equivalent parts of declaration of variable of a basic type and a userdefined object or variable is (read downwards)
auto|register|static|extern <=> auto|register|static|extern
const <=> const
(signed|unsigned)(long|short)datatype <=> class name etc
variable name <=> object/variable name
? Is that assusmption correct?
On the particular question, long long is implemented by the compiler in a compiler + platform specific way (usually more platform than compiler dependent).
As to the original misconception, no, not all types are classes in C++. The languages tries to provide a uniform syntax in as much as possible for all types, trying to have all types behave similarly in as much as possible, and be used in a similar way. As a matter of fact, it is actually quite the other way around: C++ tries in as much as possible to have classes behave like primitive types (value semantics).
The particular reason to be able to initialize an integer that way is actually quite related to classes, just in a different way. In a class constructor definition there are initializer lists that define how each one of the members is initialized before the constructor block is executed. The syntax for each initializer element in the list is basically (I would have to lookup the exact definition): member_name( initializer ), so for example you would get:
class my_int_vector {
int * p;
int size;
public:
my_int_vector() : p(0), size(0) {}
//...
}
Both pointers and integers are fundamental types, but they can be initialized in a way similar to that of classes. If that type of initialization was not allowed for fundamental types, and only the type name = value; syntax was allowed, the initializer list syntax would have to be extended, and you would not be able to seamlessly change those types at a later time (say, change the int to a atomic_int).
Built-in data types are not implemented as classes. They just have some syntactic similarities.
short, long, int and every other built-in type name are keywords which get treated specially by the compiler. So, in a nutshell, they're implemented as magic. They're not classes, they're just themselves.
So no, these types cannot be implemented in terms of other language features, the way std::string or other standard library components can. The built-in types are a fundamental part of the core language.
Not all data types are classes in C++. The primitive C data types are not (they're called scalars). They're not "implemented" at all, but rather they form a core feature of the language that has a direct translation to machine code. The syntax int i(5); is equivalent to C's int i = 5; and initializes the variable at declaration time.
You're laboring under a number of misconceptions:
Not all data types are implemented as classes in C++. In fact, only class types are implemented as classes. Enums, pointers and arrays, in addition to the fundamental types, are not implemented as classes.
short short is not a valid type. The signed integral types in
C++03 are signed char, short, int, and long. Period. C++11
adds long long, and allows the implementation to add others.
I'm not sure what you mean by "implemented" here. On almost all modern machine, all of the integral types (signed or unsigned) are directly supported in hardware; the C++ compiler just generates the appropriate hardware instructions. (On older machines, long, and sometimes even int or short, often required function calls for the basic operations, and on some rare and exotic machines, unsigned arithmetic requires added instructions.)
How you write a definition is a question of syntax, not of implementation. Both T v(i); and T v = i; are legal if the type supports copy; for all but class types, they are perfectly identical. (With a modern compiler; I've used some compilers in the past that had bugs in this regard. But that's a fairly distant past.)