Using memcpy to switch active member of union in C++

Using memcpy to switch active member of union in C++ - c++

I know about the memcpy/memmove to a union member, does this set the 'active' member? question , but I guess my question is different. So:
Suppose sizeof( int ) == sizeof( float ) and I have the following code snippet:
union U{
int i;
float f;
};
U u;
u.i = 1; //i is the active member of u
::std::memcpy( &u.f, &u.i, sizeof( u ) ); //copy memory content of u.i to u.f
My questions:
Does the code lead to an undefined behaviour (UB)? If yes why?
If the code does not lead to an UB, what is the active member of u after the memcpy call and why?
What would be the answer to previous two questions if sizeof( int ) != sizeof( float ) and why?

Regardless of the union, the behaviour of std::memcpy is undefined if the source and destination overlap. This is the case for every member of the union, and it would not be different if the sizes weren't the same.
If you were to use std::memmove instead, there is no longer an issue due to the overlap, and it also doesn't matter that you copy from a member of a union. Since both types are trivially copyable, the behaviour is defined and u.f becomes the active member of the union, but the union holds the same bytes as before in practice.
The only issue would arise if sizeof(U) was larger than sizeof(int), because you would be copying potentially uninitialized bytes. This is undefined behaviour.

You are not allowed to use memcpy to copy overlapping regions of memory:
If the objects overlap, the behavior is undefined.
Your code has undefined behavior because of this violation of memcpy's precondition, as u.f and u.i occupy the same address in memory.

You can use std::bit_cast to switch the active members in a union. Additionally, it's constexpr so you can even use the union in a core constant calculation.
#include <memory>
#include <bit>
union U {
int i;
float f;
constexpr void switch_to_int() { this->i = std::bit_cast<int>(f); }
constexpr void switch_to_float() { this->f = std::bit_cast<float>(i); }
};
constexpr int foo() {
U u{};
u.f = 2.0f;
u.switch_to_int();
return u.i;
}
int main()
{
constexpr int i = foo();
}
Compiler Explorer
std::bit_cast uses memcpy under the hood and compilers do a great job of optimizing the code. In this case memcpy is used to create an r-value that is then written to the new, active union element. This is what's left of the call to foo() at runtime.
mov eax,40000000h

Yes, it's undefined behaviour (UB). Because &u.f and &u.i points to the same start address. See the definition of memcpy:
void * memcpy (void *__restrict, const void *__restrict, size_t);
The C99 keyword restrict is an indication to the compiler that different object pointer types and function parameter arrays do not point to overlapping regions of memory.
This enables the compiler to perform optimizations that might otherwise be prevented because of possible aliasing.
It is your responsibility to ensure that restrict-qualified pointers do not point to overlapping regions of memory.
__restrict, permitted in C90 and C++, is a synonym for restrict.
Because it is UB. The result are undefined.
Also UB. Because &u.f and &u.i always points to the same start address, regardless equality of their lengths. sizeof(U) will get the maximum size of all the members of the union.

Related

Struct offsets and pointer safety in C++

This question is about pointers derived using pointer arithmetic with struct offsets.
Consider the following program:
#include <cstddef>
#include <iostream>
#include <new>
struct A {
float a;
double b;
int c;
};
static constexpr auto off_c = offsetof(A, c);
int main() {
A * a = new A{0.0f, 0.0, 5};
char * a_storage = reinterpret_cast<char *>(a);
int * c = reinterpret_cast<int *>(a_storage + off_c));
std::cout << *c << std::endl;
delete a;
}
This program appears to work and give expected results on compilers that I tested, using default settings and C++11 standard.
(A closely related program, where we use void * instead of char * and static_cast instead of reinterpret_cast,
is not universally accepted. gcc 5.4 issues a warning about pointer arithmetic with void pointers, and clang 6.0 says that pointer
arithmetic with void * is an error.)
Does this program have well-defined behavior according to the C++ standard?
Does the answer depend on whether the implementation has relaxed or strict pointer safety ([basic.stc.dynamic.safety])?

There are no fundamental errors in your code.
If A isn't plain old data, the above is UB (prior to C++17) and conditionally supported (after C++17).
You might want to replace char* and int* with auto*, but that is a style thing.
Note that pointers to members do this exact same thing in a type-safe manner. Most compilers implement a pointer to member ... as the offset of the member in the type. They do, however, work everywhere even on non-pod structures.
Aside: I don't see a guarantee that offsetof is constexpr in the standard. ;)
In any case, replace:
static constexpr auto off_c = offsetof(A, c);
with
static constexpr auto off_c = &A::c;
and
auto* a_storage = static_cast<char *>(a);
auto* c = reinterpret_cast<int *>(a_storage + off_c));
with
auto* c = &(a->*off_c);
to do it the C++ way.

It is safe in your specific example, but only because your struct is a standard layout, which you can double-check using std::is_standard_layout<>.
Trying to apply this to a struct such as:
struct A {
float a;
double b;
int c;
std::string str;
};
Would be illegal, even if the string is past the part of the struct that's relevant.
Edit
Heres what im concerned abt: in 3.7.4.3 [basic.stc.dynamic.safety] it says pointer is safely derived only if (conditions) and if we have strict pointer safety then a pointer is invalid if it doesnt come from such a place. In 5.7 pointer arithmetic, it says i can do usual arithmetic within an array but i dont see anything there telling me struct offset arithmetic is ok. Im trying to figure out if this isnt relevant in the way i think it is, or if struct offset arithmetic is not ok in the hypothetical "strict" impls, or if i read 5.7 wrong (n4296)
When you are doing the pointer arithmatic, you are performing it on an array of char, the size of which is at least sizeof(A), so that's fine.
Then, when you cast back into the second member, you are covered under (2.4):
— the result of a well-defined pointer conversion (4.10, 5.4) of a safely-derived pointer value;

you should examine your assumptions.
Assumption #1) offsetof gives the correct offset in bytes. This is only guaranteed if the class is considered "standard-layout", which has a number of restrictions, such as no virtual methods, avoids multiple inheritances, etc. In this case, it should be fine, but in general you can't be sure.
Assumption #2) A char is the same size as a byte. In C, this is by definition, so you are safe.
Assumption #3) The offsetof gives the correct offset from the pointer to the class, not from the beginning of the data. This is basically the same as #1, but a vtable could certainly be a problem. Again, only works with standard-layout.

Can a pointer be volatile?

Consider the following code:
int square(volatile int *p)
{
return *p * *p;
}
Now, the volatile keyword indicates that the value in a
memory location can be altered in ways unknown to the compiler or have
other unknown side effects (e.g. modification via a signal interrupt,
hardware register, or memory mapped I/O) even though nothing in the
program code modifies the contents.
So what exactly happens when we declare a pointer as volatile?
Will the above mentioned code always work, or is it any different from this:
int square(volatile int *p)
{
int a = *p;
int b = *p
return a*b;
}
Can we end up multiplying different numbers, as pointers are volatile?
Or is there better way to do so?

Can a pointer be volatile?
Absolutely; any type, excluding function and references, may be volatile-qualified.
Note that a volatile pointer is declared T *volatile, not volatile T*, which instead declares a pointer-to-volatile.
A volatile pointer means that the pointer value, that is its address and not the value pointed to by, may have side-effects that are not visible to the compiler when it's accessed; therefore, optimizations deriving from the "as-if rule" may not be taken into account for those accesses.
int square(volatile int *p) { return *p * *p; }
The compiler cannot assume that reading *p fetches the same value, so caching its value in a variable is not allowed. As you say, the result may vary and not be the square of *p.
Concrete example: let's say you have two arrays of ints
int a1 [] = { 1, 2, 3, 4, 5 };
int a2 [] = { 5453, -231, -454123, 7565, -11111 };
and a pointer to one of them
int * /*volatile*/ p = a1;
with some operation on the pointed elements
for (int i = 0; i < sizeof(a1)/sizeof(a1[0]); ++i)
*(p + i) *= 2;
here p has to be read each iteration if you make it volatile because, perhaps, it may actually point to a2 due to external events.

Yes, you can of course have a volatile pointer.
Volatile means none more and none less than that every access on the volatile object (of whatever type) is treated as a visible side-effect, and is therefore exempted from optimization (in particular, this means that accesses may not be reordered or collapsed or optimized out alltogether). That's true for reading or writing a value, for calling member functions, and of course for dereferencing, too.
Note that when the previous paragraph says "reordering", a single thread of execution is assumed. Volatile is no substitute for atomic operations or mutexes/locks.
In more simple words, volatile generally translates to roughly "Don't optimize, just do exactly as I say".
In the context of a pointer, refer to the exemplary usage pattern given by Chris Lattner's well-known "What every programmer needs to know about Undefined Behavior" article (yes, that article is about C, not C++, but the same applies):
If you're using an LLVM-based compiler, you can dereference a "volatile" null pointer to get a crash if that's what you're looking for, since volatile loads and stores are generally not touched by the optimizer.

Yes. int * volatile.
In C++, keywords according to type/pointer/reference go after the token, like int * const is constant pointer to integer, int const * is pointer to constant integer, int const * const is constant pointer to constant integer e.t.c. You can write keyword before the type only if it's for the first token: const int x is equal to int const x.

The volatile keyword is a hint for the compiler (7.1.6.1/7):
Note:
volatile
is a hint to the implementation to avoid aggressive optimization involving the object
because the value of the object might be changed by means undetectable by an implementation. Furthermore,
for some implementations,
volatile
might indicate that special hardware instructions are required to access
the object. See
1.9
for detailed semantics. In general, the semantics of
volatile
are intended to be the
same in C
++
as they are in C.
— end note
]
What does it mean? Well, take a look at this code:
bool condition = false;
while(!condition)
{
...
}
by default, the compiler will easilly optimize the condition out (it doesn't change, so there is no need to check it at every iteration). If you, however, declare the condition as volatile, the optimization will not be made.
So of course you can have a volatile pointer, and it is possible to write code that will crash because of it, but the fact that a variable is volative doesn't mean that it is necessarily going to be changed due to some external interference.

Yes, a pointer can be volatile if the variable that it points to can change unexpectedly even though how this might happen is not evident from the code.
An example is an object that can be modified by something that is external to the controlling thread and that the compiler should not optimize.
The most likely place to use the volatile specifier is in low-level code that deals directly with the hardware and where unexpected changes might occur.

You may be end up multiplying different numbers because it's volatile and could be changed unexpectedly. So, you can try something like this:
int square(volatile int *p)
{
int a = *p;
return a*a;
}

int square(volatile int *p)
{
int a = *p;
int b = *p
return a*b;
}
Since it is possible for the value of *ptr to change unexpectedly, it is possible for a and b to be different. Consequently, this code could return a number that is not a square! The correct way to code this is:
long square(volatile int *p)
{
int a;
a = *p;
return a * a;
}

Is it legal to use address of one field of a union to access another field?

Consider following code:
union U
{
int a;
float b;
};
int main()
{
U u;
int *p = &u.a;
*(float *)p = 1.0f; // <-- this line
}
We all know that addresses of union fields are usually same, but I'm not sure is it well-defined behavior to do something like this.
So, question is: Is it legal and well-defined behavior to cast and dereference a pointer to union field like in the code above?
P.S. I know that it's more C than C++, but I'm trying to understand if it's legal in C++, not C.

All members of a union must reside at the same address, that is guaranteed by the standard. What you are doing is indeed well-defined behavior, but it shall be noted that you cannot read from an inactive member of a union using the same approach.
Accessing inactive union member - undefined behavior?
Note: Do not use c-style casts, prefer reinterpret_cast in this case.
As long as all you do is write to the other data-member of the union, the behavior is well-defined; but as stated this changes which is considered to be the active member of the union; meaning that you can later only read from that you just wrote to.
union U {
int a;
float b;
};
int main () {
U u;
int *p = &u.a;
reinterpret_cast<float*> (p) = 1.0f; // ok, well-defined
}
Note: There is an exception to the above rule when it comes to layout-compatible types.
The question can be rephrased into the following snippet which is semantically equivalent to a boiled down version of the "problem".
#include <type_traits>
#include <algorithm>
#include <cassert>
int main () {
using union_storage_t = std::aligned_storage<
std::max ( sizeof(int), sizeof(float)),
std::max (alignof(int), alignof(float))
>::type;
union_storage_t u;
int * p1 = reinterpret_cast< int*> (&u);
float * p2 = reinterpret_cast<float*> (p1);
float * p3 = reinterpret_cast<float*> (&u);
assert (p2 == p3); // will never fire
}
What does the Standard (n3797) say?
9.5/1 Unions [class.union]
In a union, at most one of the non-static data members can be
active at any time, that is, the value of at most one of the
non-static dat amembers ca nbe stored in a union at any time.
[...] The size of a union is sufficient to contain the largest of
its non-static data members. Each non-static data member is
allocated as if it were the sole member of a struct. All non-static data members of a union object have the same address.
Note: The wording in C++11 (n3337) was underspecified, even though the intent has always been that of C++14.

Yes, it is legal. Using explicit casts, you can do almost anything.
As other comments have stated, all members in a union start at the same address / location so casting a pointer to a different member is pointless.
The assembly language will be the same. You want to make the code easy to read so I don't recommend the practice. It is confusing and there is no benefit.
Also, I recommend a "type" field so that you know when the data is in float format versus int format.

Can Aliasing Problems be Avoided with const Variables

My company uses a messaging server which gets a message into a const char* and then casts it to the message type.
I've become concerned about this after asking this question. I'm not aware of any bad behavior in the messaging server. Is it possible that const variables do not incur aliasing problems?
For example say that foo is defined in MessageServer in one of these ways:
As a parameter: void MessageServer(const char* foo)
Or as const variable at the top of MessageServer: const char* foo = PopMessage();
Now MessageServer is a huge function, but it never assigns anything to foo, however at 1 point in MessageServer's logic foo will be cast to the selected message type.
auto bar = reinterpret_cast<const MessageJ*>(foo);
bar will only be read from subsequently, but will be used extensively for object setup.
Is an aliasing problem possible here, or does the fact that foo is only initialized, and never modified save me?
EDIT:
Jarod42's answer finds no problem with casting from a const char* to a MessageJ*, but I'm not sure this makes sense.
We know this is illegal:
MessageX* foo = new MessageX;
const auto bar = reinterpret_cast<MessageJ*>(foo);
Are we saying this somehow makes it legal?
MessageX* foo = new MessageX;
const auto temp = reinterpret_cast<char*>(foo);
auto bar = reinterpret_cast<const MessageJ*>(temp);
My understanding of Jarod42's answer is that the cast to temp makes it legal.
EDIT:
I've gotten some comments with relation to serialization, alignment, network passing, and so on. That's not what this question is about.
This is a question about strict aliasing.
Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias eachother.)
What I'm asking is: Will the initialization of a const object, by casting from a char*, ever be optimized below where that object is cast to another type of object, such that I am casting from uninitialized data?

First of all, casting pointers does not cause any aliasing violations (although it might cause alignment violations).
Aliasing refers to the process of reading or writing an object through a glvalue of different type than the object.
If an object has type T, and we read/write it via a X& and a Y& then the questions are:
Can X alias T?
Can Y alias T?
It does not directly matter whether X can alias Y or vice versa, as you seem to focus on in your question. But, the compiler can infer if X and Y are completely incompatible that there is no such type T that can be aliased by both X and Y, therefore it can assume that the two references refer to different objects.
So, to answer your question, it all hinges on what PopMessage does. If the code is something like:
const char *PopMessage()
{
static MessageJ foo = .....;
return reinterpret_cast<const char *>(&foo);
}
then it is fine to write:
const char *ptr = PopMessage();
auto bar = reinterpret_cast<const MessageJ*>(foo);
auto baz = *bar; // OK, accessing a `MessageJ` via glvalue of type `MessageJ`
auto ch = ptr[4]; // OK, accessing a `MessageJ` via glvalue of type `char`
and so on. The const has nothing to do with it. In fact if you did not use const here (or you cast it away) then you could also write through bar and ptr with no problem.
On the other hand, if PopMessage was something like:
const char *PopMessage()
{
static char buf[200];
return buf;
}
then the line auto baz = *bar; would cause UB because char cannot be aliased by MessageJ. Note that you can use placement-new to change the dynamic type of an object (in that case, char buf[200] is said to have stopped existing, and the new object created by placement-new exists and its type is T).

My company uses a messaging server which gets a message into a const char* and then casts it to the message type.
So long as you mean that it does a reinterpret_cast (or a C-style cast that devolves to a reinterpret_cast):
MessageJ *j = new MessageJ();
MessageServer(reinterpret_cast<char*>(j));
// or PushMessage(reinterpret_cast<char*>(j));
and later takes that same pointer and reinterpret_cast's it back to the actual underlying type, then that process is completely legitimate:
MessageServer(char *foo)
{
if (somehow figure out that foo is actually a MessageJ*)
{
MessageJ *bar = reinterpret_cast<MessageJ*>(foo);
// operate on bar
}
}
// or
MessageServer()
{
char *foo = PopMessage();
if (somehow figure out that foo is actually a MessageJ*)
{
MessageJ *bar = reinterpret_cast<MessageJ*>(foo);
// operate on bar
}
}
Note that I specifically dropped the const's from your examples as their presence or absence doesn't matter. The above is legitimate when the underlying object that foo points at actually is a MessageJ, otherwise it is undefined behavior. The reinterpret_cast'ing to char* and back again yields the original typed pointer. Indeed, you could reinterpret_cast to a pointer of any type and back again and get the original typed pointer. From this reference:
Only the following conversions can be done with reinterpret_cast ...
6) An lvalue expression of type T1 can be converted to reference to another type T2. The result is an lvalue or xvalue referring to the same object as the original lvalue, but with a different type. No temporary is created, no copy is made, no constructors or conversion functions are called. The resulting reference can only be accessed safely if allowed by the type aliasing rules (see below) ...
Type aliasing
When a pointer or reference to object of type T1 is reinterpret_cast (or C-style cast) to a pointer or reference to object of a different type T2, the cast always succeeds, but the resulting pointer or reference may only be accessed if both T1 and T2 are standard-layout types and one of the following is true:
T2 is the (possibly cv-qualified) dynamic type of the object ...
Effectively, reinterpret_cast'ing between pointers of different types simply instructs the compiler to reinterpret the pointer as pointing at a different type. More importantly for your example though, round-tripping back to the original type again and then operating on it is safe. That is because all you've done is instructed the compiler to reinterpret a pointer as pointing at a different type and then told the compiler again to reinterpret that same pointer as pointing back at the original, underlying type.
So, the round trip conversion of your pointers is legitimate, but what about potential aliasing problems?
Is an aliasing problem possible here, or does the fact that foo is only initialized, and never modified save me?
The strict aliasing rule allows compilers to assume that references (and pointers) to unrelated types do not refer to the same underlying memory. This assumption allows lots of optimizations because it decouples operations on unrelated reference types as being completely independent.
#include <iostream>
int foo(int *x, long *y)
{
// foo can assume that x and y do not alias the same memory because they have unrelated types
// so it is free to reorder the operations on *x and *y as it sees fit
// and it need not worry that modifying one could affect the other
*x = -1;
*y = 0;
return *x;
}
int main()
{
long a;
int b = foo(reinterpret_cast<int*>(&a), &a); // violates strict aliasing rule
// the above call has UB because it both writes and reads a through an unrelated pointer type
// on return b might be either 0 or -1; a could similarly be arbitrary
// technically, the program could do anything because it's UB
std::cout << b << ' ' << a << std::endl;
return 0;
}
In this example, thanks to the strict aliasing rule, the compiler can assume in foo that setting *y cannot affect the value of *x. So, it can decide to just return -1 as a constant, for example. Without the strict aliasing rule, the compiler would have to assume that altering *y might actually change the value of *x. Therefore, it would have to enforce the given order of operations and reload *x after setting *y. In this example it might seem reasonable enough to enforce such paranoia, but in less trivial code doing so will greatly constrain reordering and elimination of operations and force the compiler to reload values much more often.
Here are the results on my machine when I compile the above program differently (Apple LLVM v6.0 for x86_64-apple-darwin14.1.0):
$ g++ -Wall test58.cc
$ ./a.out
0 0
$ g++ -Wall -O3 test58.cc
$ ./a.out
-1 0
In your first example, foo is a const char * and bar is a const MessageJ * reinterpret_cast'ed from foo. You further stipulate that the object's underlying type actually is a MessageJ and that no reads are done through the const char *. Instead, it is only casted to the const MessageJ * from which only reads are then done. Since you do not read nor write through the const char * alias, then there can be no aliasing optimization problem with your accesses through your second alias in the first place. This is because there are no potentially conflicting operations performed on the underlying memory through your aliases of unrelated types. However, even if you did read through foo, then there could still be no potential problem as such accesses are allowed by the type aliasing rules (see below) and any ordering of reads through foo or bar would yield the same results because there are no writes occurring here.
Let us now drop the const qualifiers from your example and presume that MessageServer does do some write operations on bar and furthermore that the function also reads through foo for some reason (e.g. - prints a hex dump of memory). Normally, there might be an aliasing problem here as we have reads and writes happening through two pointers to the same memory through unrelated types. However, in this specific example, we are saved by the fact that foo is a char*, which gets special treatment by the compiler:
Type aliasing
When a pointer or reference to object of type T1 is reinterpret_cast (or C-style cast) to a pointer or reference to object of a different type T2, the cast always succeeds, but the resulting pointer or reference may only be accessed if both T1 and T2 are standard-layout types and one of the following is true: ...
T2 is char or unsigned char
The strict-aliasing optimizations that are allowed for operations through references (or pointers) of unrelated types are specifically disallowed when a char reference (or pointer) is in play. The compiler instead must be paranoid that operations through the char reference (or pointer) can affect and be affected by operations done through other references (or pointers). In the modified example where reads and writes operate on both foo and bar, you can still have defined behavior because foo is a char*. Therefore, the compiler is not allowed to optimize to reorder or eliminate operations on your two aliases in ways that conflict with the serial execution of the code as written. Similarly, it is forced to be paranoid about reloading values that may have been affected by operations through either alias.
The answer to your question is that, so long as your functions are properly round tripping pointers to a type through a char* back to its original type, then your function is safe, even if you were to interleave reads (and potentially writes, see caveat at end of EDIT) through the char* alias with reads+writes through the underlying type alias.
These two technical references (3.10.10) are useful for answering your question. These other references help give a better understanding of the technical information.
====
EDIT: In the comments below, zmb objects that while char* can legitimately alias a different type, that the converse is not true as several sources seem to say in varying forms: that the char* exception to the strict aliasing rule is an asymmetric, "one-way" rule.
Let us modify my above strict-aliasing code example and ask would this new version similarly result in undefined behavior?
#include <iostream>
char foo(char *x, long *y)
{
// can foo assume that x and y cannot alias the same memory?
*x = -1;
*y = 0;
return *x;
}
int main()
{
long a;
char b = foo(reinterpret_cast<char*>(&a), &a); // explicitly allowed!
// if this is defined behavior then what must the values of b and a be?
std::cout << (int) b << ' ' << a << std::endl;
return 0;
}
I argue that this is defined behavior and that both a and b must be zero after the call to foo. From the C++ standard (3.10.10):
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:^52
the dynamic type of the object ...
a char or unsigned char type ...
^52: The intent of this list is to specify those circumstances in which an object may or may not be aliased.
In the above program, I am accessing the stored value of an object through both its actual type and a char type, so it is defined behavior and the results have to comport with the serial execution of the code as written.
Now, there is no general way for the compiler to always statically know in foo that the pointer x actually aliases y or not (e.g. - imagine if foo was defined in a library). Maybe the program could detect such aliasing at run time by examining the values of the pointers themselves or consulting RTTI, but the overhead this would incur wouldn't be worth it. Instead, the better way to generally compile foo and allow for defined behavior when x and y do happen to alias one another is to always assume that they could (i.e. - disable strict alias optimizations when a char* is in play).
Here's what happens when I compile and run the above program:
$ g++ -Wall test59.cc
$ ./a.out
0 0
$ g++ -O3 -Wall test59.cc
$ ./a.out
0 0
This output is at odds with the earlier, similar strict-aliasing program's. This is not dispositive proof that I'm right about the standard, but the different results from the same compiler provides decent evidence that I may be right (or, at least that one important compiler seems to understand the standard the same way).
Let's examine some of the seemingly conflicting sources:
The converse is not true. Casting a char* to a pointer of any type other than a char* and dereferencing it is usually in volation of the strict aliasing rule. In other words, casting from a pointer of one type to pointer of an unrelated type through a char* is undefined.
The bolded bit is why this quote doesn't apply to the problem addressed by my answer nor the example I just gave. In both my answer and the example, the aliased memory is being accessed both through a char* and the actual type of the object itself, which can be defined behavior.
Both C and C++ allow accessing any object type via char * (or specifically, an lvalue of type char). They do not allow accessing a char object via an arbitrary type. So yes, the rule is a "one way" rule."
Again, the bolded bit is why this statement doesn't apply to my answers. In this and similar counter-examples, an array of characters is being accessed through a pointer of an unrelated type. Even in C, this is UB because the character array might not be aligned according to the aliased type's requirements, for example. In C++, this is UB because such access does not meet any of the type aliasing rules as the underlying type of the object actually is char.
In my examples, we first have a valid pointer to a properly constructed type that is then aliased by a char* and then reads and writes through these two aliased pointers are interleaved, which can be defined behavior. So, there seems to be some confusion and conflation out there between the strict aliasing exception for char and not accessing an underlying object through an incompatible reference.
int value;
int *p = &value;
char *q = reinterpret_cast<char*>(&value);
Both p and p refer to the same address, they are aliasing the same memory. What the language does is provide a set of rules defining the behaviors that are guaranteed: write through p read through q fine, other way around not fine.
The standard and many examples clearly state that "write through q, then read through p (or value)" can be well defined behavior. What is not as abundantly clear, but what I'm arguing for here, is that "write through p (or value), then read through q" is always well defined. I claim even further, that "reads and writes through p (or value) can be arbitrarily interleaved with reads and writes to q" with well defined behavior.
Now there is one caveat to the previous statement and why I kept sprinkling the word "can" throughout the above text. If you have a type T reference and a char reference that alias the same memory, then arbitrarily interleaving reads+writes on the T reference with reads on the char reference is always well defined. For example, you might do this to repeatedly print out a hex dump of the underlying memory as you modify it multiple times through the T reference. The standard guarantees that strict aliasing optimizations will not be applied to these interleaved accesses, which otherwise might give you undefined behavior.
But what about writes through a char reference alias? Well, such writes may or may not be well defined. If a write through the char reference violates an invariant of the underlying T type, then you can get undefined behavior. If such a write improperly modified the value of a T member pointer, then you can get undefined behavior. If such a write modified a T member value to a trap value, then you can get undefined behavior. And so on. However, in other instances, writes through the char reference can be completely well defined. Rearranging the endianness of a uint32_t or uint64_t by reading+writing to them through an aliased char reference is always well defined, for example. So, whether such writes are completely well defined or not depends on the particulars of the writes themselves. Regardless, the standard guarantees that its strict aliasing optimizations will not reorder or eliminate such writes w.r.t. other operations on the aliased memory in a manner that itself could lead to undefined behavior.

So my understanding is that you are doing something like that:
enum MType { J,K };
struct MessageX { MType type; };
struct MessageJ {
MType type{ J };
int id{ 5 };
//some other members
};
const char* popMessage() {
return reinterpret_cast<char*>(new MessageJ());
}
void MessageServer(const char* foo) {
const MessageX* msgx = reinterpret_cast<const MessageX*>(foo);
switch (msgx->type) {
case J: {
const MessageJ* msgJ = reinterpret_cast<const MessageJ*>(foo);
std::cout << msgJ->id << std::endl;
}
}
}
int main() {
const char* foo = popMessage();
MessageServer(foo);
}
If that is correct, then the expression msgJ->id is ok (as would be any access to foo), as msgJ has the correct dynamic type. msgx->type on the other hand does incur UB, because msgx has a unrelated type. The fact that the the pointer to MessageJ was cast to const char* in between is completely irrelevant.
As was cited by others, here is the relevant part in the standard (the "glvalue" is the result of dereferencing the pointer):
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:52
the dynamic type of the object,
a cv-qualified version of the dynamic type of the object,
a type similar (as defined in 4.4) to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
an aggregate or union type that includes one of the aforementioned types among its elements or nonstatic data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
a char or unsigned char type.
As far as the discussion "cast to char*" vs "cast from char*" is concerned:
You might know that the standard doesn't talk about strict aliasing as such, it only provides the list above. Strict aliasing is one analysis technique based on that list for compilers to determine which pointers can potentially alias each other. As far as optimizations are concerned, it doesn't make a difference, if a pointer to a MessageJ object was cast to char* or vice versa. The compiler cannot (without further analysis) assume that a char* and MessageX* point to distinct objects and will not perform any optimizations (e.g. reordering) based on that.
Of course that doesn't change the fact that accessing a char array via a pointer to a different type would still be UB in C++ (I assume mostly due to alignment issues) and the compiler might perform other optimizations that could ruin your day.
EDIT:
What I'm asking is: Will the initialization of a const object, by
casting from a char*, ever be optimized below where that object is
cast to another type of object, such that I am casting from
uninitialized data?
No it will not. Aliasing analysis doesn't influence how the pointer itself is handled, but the access through that pointer. The compiler will NOT reorder the write access (store memory address in the pointer variable) with the read access (copy to other variable / load of address in order to access the memory location) to the same variable.

There is no aliasing problem as you use (const)char* type, see the last point of:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
the dynamic type of the object,
a cv-qualified version of the dynamic type of the object,
a type similar (as defined in 4.4) to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to the dynamic type of the object,
a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
an aggregate or union type that includes one of the aforementioned types among -its elements or non-static data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
a char or unsigned char type.

The other answer answered the question well enough (it's a direct quotation from the C++ standard in https://isocpp.org/files/papers/N3690.pdf page 75), so I'll just point out other problems in what you're doing.
Note that your code may run into alignment problems. For example, if the alignment of MessageJ is 4 or 8 bytes (typical on 32-bit and 64-bit machines), strictly speaking, it is undefined behaviour to access an arbitrary character array pointer as a MessageJ pointer.
You won't run into any problems on x86/AMD64 architectures as they allow unaligned access. However, someday you may find that the code you're developing is ported to a mobile ARM architecture and the unaligned access would be a problem then.
It therefore seems you're doing something you shouldn't be doing. I would consider using serialization instead of accessing a character array as a MessageJ type. The only problem isn't potential alignment problems, an additional problem is that the data may have a different representation on 32-bit and 64-bit architectures.

What are 'partially overlapping objects'?

I was just going through all the possible Undefined Behaviours in this thread, and one of them is
The result of assigning to partially overlapping objects
I wondered if anyone could give me a definition of what "partially overlapping objects" are and an example in code of how that could possibly be created?

As pointed out in other answers, a union is the most obvious way to arrange this.
This is an even clearer example of how partially overlapping objects might arise with the built in assignment operator. This example would not otherwise exhibit UB if it were not for the partially overlapping object restrictions.
union Y {
int n;
short s;
};
void test() {
Y y;
y.s = 3; // s is the active member of the union
y.n = y.s; // Although it is valid to read .s and then write to .x
// changing the active member of the union, .n and .s are
// not of the same type and partially overlap
}
You can get potential partial overlap even with objects of the same type. Consider this example in the case where short is strictly larger than char on an implementation that adds no padding to X.
struct X {
char c;
short n;
};
union Y {
X x;
short s;
};
void test() {
Y y;
y.s = 3; // s is the active member of the union
y.x.n = y.s; // Although it is valid to read .s and then write to .x
// changing the active member of the union, it may be
// that .s and .x.n partially overlap, hence UB.
}

A union is a good example for that.
You can create a structure of memory with overlapping members.
for example (from MSDN):
union DATATYPE // Declare union type
{
char ch;
int i;
long l;
float f;
double d;
} var1;
now if you use assign the char member all other member are undefined. That's because they are at the same memory block, and you've only set an actual value to a part of it:
DATATYPE blah;
blah.ch = 4;
If you will then try to access blah.i or blah.d or blah.f they will have an undefined value. (because only first byte, which is a char, had its value set)

This refers to the problem of pointer aliasing, which is forbidden in C++ to give compilers an easier time optimizing. A good explanation of the problem can be found in this thread

May be he mean a strict aliasing rule? Object in memory should not overlap with object of other type.
"Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias each other.)"

The canonical example is using memcpy:
char *s = malloc(100);
int i;
for(i=0; i != 100;++i) s[i] = i; /* just populate it with some data */
char *t = s + 10; /* s and t may overlap since s[10+i] = t[i] */
memcpy(t, s, 20); /* if you are copying at least 10 bytes, there is overlap and the behavior is undefined */
The reason why memcpy is undefined behavior is because there's no required algorithm for performing the copy. In this circumstance, memmove was introduced as a safe alternative.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using memcpy to switch active member of union in C++ - c++

You are not allowed to use memcpy to copy overlapping regions of memory: If the objects overlap, the behavior is undefined. Your code has undefined behavior because of this violation of memcpy's precondition, as u.f and u.i occupy the same address in memory.

Related

Struct offsets and pointer safety in C++

Can a pointer be volatile?

Is it legal to use address of one field of a union to access another field?

Can Aliasing Problems be Avoided with const Variables

What are 'partially overlapping objects'?

Categories

Resources