Is the code below well formed, in particular regarding aliasing rules? - c++

The template function below is part of a sequence generator. Instead of manual shifts, I came up with the following union-based solution to make the operations more explicit. It works great on all the compilers tested. Godbolt link.
However despite working in practice, I am afraid there are aliasing rules that are being violated, which means it might not work in the future or in another compiler other than GCC and CLANG.
Strictly in view of the C++ standard: is the code below well formed? Does it incur in undefined behavior?
template <int BITS>
uint64_t flog2(uint64_t num) {
constexpr uint64_t MAXNUM = (uint64_t(1) << BITS);
if (num < MAXNUM) return num;
union FP {
double dbl;
struct {
uint64_t man: 52;
uint32_t exp: 11;
uint32_t sign: 1;
};
struct {
uint64_t xman: 52-BITS;
uint32_t xexp: 11+BITS;
uint32_t xsgn: 1;
};
};
FP fp;
fp.dbl = num;
fp.exp -= 1023-1+BITS;
return fp.xexp;
}
Thanks!

First of all, the program is syntactically ill-formed in ISO standard C++. Anonymous struct members are not standard C++ (in contrast to C). They are an extension. In ISO standard C++ the struct must be named and accessed through that name.
I'll ignore that for the rest of the answer and pretend you were accessing through such a name.
It is not an aliasing violation technically, but undefined behavior for reading an inactive member of the union object in
fp.exp -= 1023-1+BITS;
The types don't really matter for this (in contrast to aliasing). There is always only at most one active member of a union, which would the last one which was either explicitly created or written with a member access/assignment expression. In your case fp.dbl = num; means that dbl is the active member and the only one that may be read from.
There is one exception in the standard for accessing the common initial sequence of standard layout class type members of a union, in which case the non-active one may be accessed as if it was the active one. But even your two struct { members have a non-empty common initial sequence only for BITS == 0.
However, in practice compilers typically explicitly support this kind of type punning, probably already for C compatibility where it is allowed.
Of course, even setting all of this aside, the layout of bitfields and the representations of the involved types are completely implementation-defined and you can't expect this to be generally portable.

It is undefined behaviour to read from a union member that was not most recently written.
Furthermore, the layout of bit-fields is implementation-defined.
Hence, from a strict C++ Standard view, this code invokes both undefined behaviour (by reading exp after writing dbl) and relies on implementation-defined behaviour in assuming that the bit-field layout corresponds to the double floating point representation (which by the way is also implementation-defined).

Related

C++ standard for member offsets of standard layout struct

Does the C++11 standard guarantee that all compilers will choose the same memory offsets for all members in a given standard layout struct, assuming all members have guaranteed sizes (e.g. int32_t instead of int)?
That is, for a given member in a standard layout struct, does C++11 guarantee that offsetof will give the same value across all compilers?
If so, is there any specification of what that value would be, e.g. as a function of size, alignment, and order of the struct members?
There is no guarantee that offsetof will yield the same values across compilers.
There are guarantees about minimum sizes of types (e.g., char >= 8 bits, short, int >= 16 bits, long >= 32 bits, long long >= 64 bits), and the relationship between sizes1 (sizeof(char) <= sizeof(short) <= sizeof(int) <= sizeof(long) <= sizeof(long long)).
For most types, it's guaranteed that the alignment requirement is no greater than the size of the type.
For any struct/class, the first (non-static2) element must be at the beginning of the class/struct, and in the absence of changes to the visibility, order is guaranteed to be in the order of definition. For example:
struct { // same if you use `class`
int a;
int b;
};
Since these are both public, a and b must be in that order. But:
struct {
int a;
int b;
private:
int c;
};
The first element (a) is required to be at the beginning of the struct, but because of the change from public to private, the compiler is (theoretically) allowed to arrange c before b.
This rule has changed over time though. In C++98, even a vacuous visibility specifier allowed rearrangement of members.
struct A {
int a;
int b;
public:
int c;
};
The public allows rearranging b and c even though they're both public. Since then it's been tightened up so it's only elements with differing visibility, and in C++ 23 the whole idea of rearranging elements based on visibility is gone (and long past time, in my opinion--I don't think anybody ever used it, so it's always been a rule you sort of needed to know, but did nobody any real good).
If you want to get really technical, the requirement isn't really on the size, but on the range, so in theory the relationship between sizes isn't quite guaranteed, but for for most practical purposes, it is.
A static element isn't normally allocated as part of the class/struct object at all. A static member is basically allocated as a global variable, but with some extra rules about visibility of its name.
No, there are no such guarantees. The C++ standard explicitly provides for type-specific padding and alignment requirements, for one thing, and that automatically dissolves this kind of guarantee.
It might be reasonable to anticipate uniform padding and alignment requirements for a specific hardware platform, that all compilers on that platform will implement, but that again is not guaranteed.
Absolutely not. Memory layout is completely up to the C++ implementation. The only exception, only for standard-layout classes, is that the first non-static data member or base class subobject(s) have zero offset. There are also some other constraints, e.g. due to sizes and alignment of subobjects and constraints on ordering of addresses of subobjects, but nothing that determines concrete offsets of subobjects.
However, typically compilers follow some ABI specification on any given architecture/platform, so that compilers for the same architecture/platform will likely use the same ABI and same memory layout (e.g. the SysV x86-64 ABI together with the Itanium C++ ABI on Linux x86-64 at least for both GCC and Clang).

Does this type aliasing using union invoke undefined behavior?

For example,
#include <cstdint>
#include <cstdio>
struct ipv4addr {
union {
std::uint32_t value;
std::uint8_t parts[4];
};
};
int main() {
ipv4addr addr;
addr.value = static_cast<std::uint32_t>(-1);
std::printf("%hhu.%hhu.%hhu.%hhu",
addr.parts[0], addr.parts[1], addr.parts[2], addr.parts[3]);
}
Per cppref,
The details of that allocation are implementation-defined, and it's
undefined behavior to read from the member of the union that wasn't
most recently written.
So looks like the code invokes undefined behavior. But the page also says
If two union members are standard-layout types, it's well-defined to
examine their common subsequence on any compiler.
I don't quite understand this. Does it make the code behavior well-defined?
Also note cppref's description on type aliasing.
Whenever an attempt is made to read or modify the stored value of an
object of type DynamicType through a glvalue of type AliasedType, the
behavior is undefined unless one of the following is true:
[...]
AliasedType is std::byte, char, or unsigned char: this permits examination of the object representation of any object as an array of
bytes.
I guess this applies to std::uint8_t as well. No?
Common initial subsequence has a ridiculously specific definition. int and struct foo{int x;} do not have a common initial subsequence.
struct foo{int x;} and struct bar{int y;} do have a common initial subsequence.
Reading memory through an unrelated type is not the same as reading from a union alternative. That text doesn't do anything there.
You can do (std::unit8_t const*)&addr.value and treat it as a 4 byte array, assuming your platform has unit8_t. The byte values you get are implementation defined.
You cannot, under the standard, read from parts[i] however (when value exists).
Compilers are free to specify behaviour when the standard states it is undefined under the standard, except during a compile time constexpr evaluation.
Reading the contents of a union member that was not most-recently assigned is indeed undefined behavior. However, most all compilers I've used have non-standard extensions to allow it.
Technically if you want this to be safe with no chance for failure you should store the ip as int32_t and reinterpret via reinterpret_cast to read the individual bytes, like so:
int32_t ip = 185734;
int8_t *ip_bytes = reinterpret_cast<int8_t*>(&ip).
ip_bytes[etc]...
However, you should keep in mind the endianness of your platform will impact the byte ordering for any 32-bit read/write. Therefore it may be safer to scrap the int32_t idea entirely and just use an array of bytes. It all depends on what you need and/or whatever any library you might be using requires.
That's well defined to do exactly what fwrite followed by four get calls to read it back does. The actual result is machine dependent. "Get me the bytes for this uint32" in whatever storage mode the CPU prefers.

How safe is it to have references to union members in C++?

Imagine this situation:
union Reg16
{
uint16_t word;
struct
{
uint8_t bottom;
uint8_t top;
};
};
Let's say I had something like this somewhere that uses this union:
Reg16 reg_AF;
uint8_t& reg_A = reg_AF.top;
uint8_t& reg_F = reg_AF.bottom;
Is it safe to keep a reference to the respective halves of the union even though the union does not technically always refer to a u8? I am not sure if this is a violation of strict aliasing rules, but it is worth noting that there is no reference anywhere to the word, only to top and bottom. Especially in this particular situation, the references are supposed to point to the top and and bottom halves of the entire u16, so there isn't any UB because I won't be getting any value I don't expect to get by reading them versus if I had a union of a float and an int and made a reference to each of those.
Thanks in advance.
The first thing to say is that there is explicit language in the strict aliasing rule (section 3.10 of the C++03 standard)† which permits access to a value via unsigned char. If uint8_t is a typedef for unsigned char then this usage will be unequivocally legal.
However, uint8_t can be typedef for an implementation defined type which is not unsigned char (although careful reading of the standard shows that if uint8_t exists, it must be the same size as unsigned char). Further, this question can still be of interest for other types (eg uint32_t and uint16_t).
There is definitely no harm in the reference existing - the only danger could arise if the reference is used when the union contains word rather than top + bottom. On the other hand, if that happens (and top/bottom are not unsigned char) then the strict aliasing rule has been violated.
Given that the C standard explicitly permits type punning via unions (unlike C++), there is a significant chance that your C++ compiler will allow it too.
Aside: strictly speaking the sample is not legal; anonymous structs are an extension supported by GCC, Clang, and MSVC (so the question is, which compilers don't support them).
† The wording has changed slightly in later versions, and the section number may have changed - but the principle stays the same.
Assuming your code is actually:
union Reg16
{
uint16_t word;
struct
{
uint8_t bottom;
uint8_t top;
} bytes;
};
This is perfectly fine as a declaration:
Reg16 reg_AF;
uint8_t& reg_A = reg_AF.bytes.top;
uint8_t& reg_F = reg_AF.bytes.bottom;
You can use the references exactly as you could use directly the struct members.
For example reg_AF.bytes.bottom = 'A'; std::cout << ref_F; is perfectly fine: the union currently contains the struct and you correctly use a reference to a struct member.
But this reg_AF.word = 0x4142; std::cout << reg_A; is UB(*) exactly as is reg_AF.word = 0x4142; std::cout << reg_AF.bytes.bottom;
(*) Per the strict aliasing rule you should not access a value from a different type that the one that was used to write it. As struct bytes is different from uint_8, the aliasing is incorrect. That being said, common compilers currently accept it as an extension for unions and even have options to ignore the strict aliasing rule. Simply the code is not standard conformant and another compiler could behave differently.

With strict aliasing in C++11, is it defined to _write_ to a char*, then _read_ from an aliased nonchar*?

There are many discussions of strict aliasing (notably "What is the strict aliasing rule?" and "Strict aliasing rule and 'char *' pointers"), but this is a corner case I don't see explicitly addressed.
Consider this code:
int x;
char *x_alias = reinterpret_cast<char *>(&x);
x = 1;
*x_alias = 2; // [alias-write]
printf("x is now %d\n", x);
Must the printed value reflect the change in [alias-write]? (Clearly there are endianness and representation considerations, that's not my concern here.)
The famous [basic.lval] clause of the C++11 spec uses this language (emphasis mine):
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
... various other conditions ...
a char or unsigned char type.
I can't figure out whether "access" refers only to read operations (read chars from a nonchar object) or also to write operations (write chars onto a nonchar object). If there's a formal definition of "access" in the spec, I can't find it, but in other places the spec seems to use "access" for reads and "update" for writes.
This is of particular interest when deserializing; it's convenient and efficient to bring data directly from a wire into an object, without requiring an intermediate memcpy() from a char-buffer into the object.
is it defined to _write_ to a char*, then _read_ from an aliased nonchar*?
Yes.
Must the printed value reflect the change in [alias-write]?
Yes.
Strict aliasing says ((un)signed) char* can alias anything. The word "access" means both read and write operations.
The authors of the C89 Standard wanted to allow e.g.
int thing;
unsigned char *p = &x;
int i;
for (i=0; i<sizeof thing; i++)
p[i] = getbyte();
and
int thing = somevalue();
unsigned char *p = &x;
int i;
for (i=0; i<sizeof thing; i++)
putbyte(p[i]);
but not to require that compilers handle any possible aliasing given something
like:
/* global definitions */
int thing;
double *p;
int x(double *p)
{
thing = 1;
*p = 1.0;
return thing;
}
There are two ways in which the supported and non-supported cases differ: (1) in the cases to be supported, the access is made using a character-type pointer rather than some other type, and (2) after the address of the thing in question is converted to another type, all accesses to the storage using that pointer are made before the next access using the original lvalue. The authors of the Standard unfortunately regarded only first as significant, even though the second would have been a much more reliable way of identifying cases where aliasing may be important. If the Standard had focused on the second, it might not have required compilers to recognize aliasing in your example. As it is, though, the Standard requires that compilers recognize aliasing any time programs use character types, despite the needless impact on the performance of code that is processing actual character data.
Rather than fixing this fundamental mistake, other standards for both C and C++ have simply kept on with the same broken approach.

Double data type represented by int data type (low, high) struct in a union

Is the following a valid representation? I'm aware of byte order, this is a Windows environment. If I define Int32Double myVar; will myVar.int32.low always be the same if myVar.d is a computed value?
E.G: myVar.d = 0.4 * log(4); printf("%08X\n", myVar.int32.low);
union Int32Double
{
struct
{
int low;
int high;
} int32;
double d;
};
No, it's undefined behavior writing into d and reading from int32.
Firstly, the object representations of integral types and floating-point types are typically very different. Reinterpreting any part of double object as an int object will not usually produce any value that would resemble the original double value. The result will not be meaningful, unless you really know what you are doing. And if one does know what one's doing, one uses unsigned integral types for reinterpretation.
Secondly, using unions for memory reinterpretation is illegal in C++. It leads to undefined behavior. One of the latest technical corrigendums to C99 specification actually made it legal in C language (with implementation-defined behavior, of course, and as long as we don't attempt to access a trap representation). But AFAIK it is not in C++ yet. So, use at your own risk.
P.S. I'm not sure what you mean by your "will always be the same"...