Can you access the object representation of any object through a char*? - c++

I have stumbled upon a reddit thread in which a user has found an interesting detail of the C++ standard. The thread has not spawned much constructive discussion, therefore I will retell my understanding of the problem here:
OP wants to reimplement memcpy in a standard-compliant way
They attempt to do so by using reinterpret_cast<char*>(&foo), which is an allowed exception to the strict aliasing restrictions, in which reinterpreting as char is allowed to access the "object representation" of an object.
[expr.reinterpret.cast] says that doing so results in static_­cast<cv T*>(static_­cast<cv void*>(v)), so reinterpret_cast in this case is equivalent to static_cast'ing first to void * and then to char *.
[expr.static.cast] in combination with [basic.compound]
A prvalue of type “pointer to cv1 void” can be converted to a prvalue of type “pointer to cv2 T”, where T is an object type and cv2 is the same cv-qualification as, or greater cv-qualification than, cv1. [...] if the original pointer value points to an object a, and there is an object b of type T (ignoring cv-qualification) that is pointer-interconvertible with a, the result is a pointer to b. [...] [emphasis mine]
Consider now the following union class:
union Foo{
char c;
int i;
};
// the OP has used union, but iiuc,
// it can also be a struct for the problem to arise.
OP has thus come to the conclusion that reinterpreting a Foo* as char* in this case yields a pointer pointing to the first char member of the union (or its object representation), rather than to the object representation of the union itself, i.e. it points only to the member. While this appears superficially to be the same, and corresponds to the same memory address, the standard seems to differentiate between the "value" of a pointer and its corresponding address, in that on the abstract C++ machine, a pointer belongs to a certain object only. Incrementing it beyond that object (compare with end() of an array) is undefined behavior.
OP thus argues that if the standard forces the char* to be associated with the objects's first member instead of the object representation of the whole union object, dereferencing it after one incrementation is UB, which allows a compiler to optimize as if it were impossible for the resultant char* to ever access the following bytes of the int member. This implies that it is not possible to legally access the complete object representation of a class object which is pointer-interconvertible with a char member.
The same would, if I understand correctly apply if "union" was simply replaced with "struct", but I have taken this example from the original thread.
What do you think? Is this a standard defect? Is it a misinterpretation?

This video, linked in the comments (now chat) by #KonradRudolph is likely the answer to the problem.
At around the 40min mark, Timur Doumler, who is a member of the ISO C++ commitee, discusses the possibility of accessing byte representations. The summary is that any attempt of accessing byte representation except memcpy is UB. The situation in the OP does not even arise without making use of UB because the very act of using a pointer to an object like an array, or doing any pointer arithmetic on it is UB, as these operations are only well-defined when dealing with actual array objects, as far as the abstract machine is concerned.
Also, while reinterpreting a pointer as a char* does not on its own violate aliasing rules, there is technically no guarantee that the resulting char* will point to the first byte of the object.
The only legal way of accessing byte representations is to memcpy the object into a char array. This means that reimplementing memcpy is impossible.
Timur Doumler additionally describes this as a wording defect that will hopefully be fixed in C++23 and presents a paper that proposes a fix to this.

Related

How can the alignment requirement be satisfied?

I think that I am misreading the standard quotation, hence I do not fully understand what's the exact intent of the wording.
Firstly, I am already aware of what alignment requirement is, but I can't figure out the exact relation between alignment requirement and casting in general, and what're the points I should care about, regarding alignment requirement, when I perform static_casting or reinterpret_casting. I think now the reader got my first question.
Secondly, there're some words in the standard quotation I spend two days to understand them but I don't. From the paragraph in:
N4885: 7.6.1.9 Static cast [expr.static.cast]
A prvalue of type “pointer to cv1 void” can be converted to a prvalue
of type “pointer to cv2 T”, where T is an object type and cv2 is the
same cv-qualification as, or greater cv-qualification than, cv1. If
the original pointer value represents the address A of a byte in
memory and A does not satisfy the alignment requirement of T, then the
resulting pointer value is unspecified.
Here, they said "if the original pointer value doesn't satisfy the alignment requirement of T, then the resulting pointer value is unspecified". What really does that mean?
What's I can't understand is when does this "original pointer value" satisfies the alignment requirement of T, and when does not, to avoid such unspecified pointer value. I just need someone to explain the bold part from the above quote with simple examples; what I have to know, as a programmer, from that bold part and what I have to avoid. For example:
int i = 12;
double *pd = static_cast<double *>(static_cast<void *>(&i)); // does 'pd' has unspecified address value?.
short *ps = static_cast<short *>(static_cast<void *>(&i)); // does 'ps' has unspecified address value?.
Finally, there's a relatively same sentence I need to understand in:
N4885: 7.6.1.10 Reinterpret cast [expr.reinterpret.cast]
An object pointer can be explicitly converted to an object pointer of
a different type.61 When a prvalue v of object pointer type is
converted to the object pointer type “pointer to cv T”, the result is
static_cast<cv T*>(static_cast<cv void*>(v)). [Note 7: Converting a
pointer of type “pointer to T1” that points to an object of type T1 to
the type “pointer to T2” (where T2 is an object type and the
alignment requirements of T2 are no stricter than those of T1) and
back to its original type yields the original pointer value. — end
note].
What does the standard mean by this sentence "the alignment requirements of T2 are no stricter than those of T1", what does the word "stricter than" mean.
I think if I have this static_assert expression, then maybe the alignment requirements of T2 would not be stricter than those of T1: static_assert(alignof(T1) >= alignof(T2)); Or this assertion is not true for some cases.
int i = 34;
double *pd = reinterpret_cast<double *>(&i); // does 'pd' has unspecified address value?.
short *ps = reinterpret_cast<short *>(&i); // does 'ps' has unspecified address value?.
I am added these example to just clear what my problem lies, not to just answer the questions in the // comments
While common implementations have integer-like pointers (such that reinterpret_cast behaves like memcpy between pointers and integers and arithmetic on the pointers is reflected in the integer values), the standard as usual provides only weak guarantees to support less common architectures where pointers have other formats and/or special registers. As such, it’s impossible to observe alignment of a dynamic pointer value: the unspecified value applies if alignof(expression_type)<alignof(cast_type) unless the pointer refers to an object declared with alignas or to an object whose actual type is more strongly aligned.
This means that double* is a poor type to use for the sort of “temporary pointer storage” for which reinterpret_cast exists; fortunately, most C++ code uses templates (so this doesn’t come up), old C code uses char* (whose alignment is 1), and other code uses void* (which has no alignment), so there’s rarely an actual issue here.

char* conversion and aliasing rules

According to strict aliasing rules:
struct B { virtual ~B() {} };
struct D : public B { };
D d;
char *c = reinterpret_cast<char*>(&d);
A char* to any object of different type is valid. But now the question is, will it point to the same address of &d? what is the guarantee made by C++ Standard that it will return the same address?
c and &d do indeed have the same value, and if you reinterpret-cast c back to a D* you get a valid pointer that you may dereference. Furthermore, you can treat c as (pointer to the first element of) an opaque array char[sizeof(D)] -- this is indeed the main purpose of casting pointers to char pointers: To allow (de)serialization (e.g. ofile.write(c, sizeof(D));), although you should generally only do this for primitive types (and arrays thereof), since the binary layout of of compound types is not generally specified in a portable fashion.
As #Oli rightly points out and would like me to reinforce, you should really never serialize compound types as a whole. The result will almost never be deserializable, since the implementation of polymorphic classes and padding between data fields is not specified and not accessible to you.
Note that reinterpret_cast<char*>(static_cast<B*>(&d)) may be treated as an opaque array char[sizeof(B)] by similar reasoning.
Section 5.2.10, point 7 of the 2003 C++ Standard says:
A pointer to an object can be explicitly converted to a pointer to an
object of different type. Except that converting an rvalue of type
“pointer to T1” to the type “pointer to T2” (where T1 and T2 are
object types and where the alignment requirements of T2 are no
stricter than those of T1) and back to its original type yields the
original pointer value, the result of such a pointer conversion is
unspecified.
If by "same address" you mean "original pointer value," then this entry says "yes."
The intent is clear (and not something that needs to be debated):
reinterpret_cast never changes the value of an address, unless the target type cannot represent all address values (like a small integer type, on a pointer type with intrinsic alignment: f.ex. a pointer that can only represent even addresses, or pointers to object and pointers to functions cannot be mixed...).
The wording of the standard fails to capture that, but that doesn't mean there is a real practical issue here.
char *c = reinterpret_cast<char*>(&d);
c will point to the first byte of d, always.

Is the aliasing rule symmetric?

I had a discussion with someone on IRC and this question turned up. We are allowed by the Standard to change an object of type int by a char lvalue.
int a;
char *b = (char*) &a;
*b = 0;
Would we be allowed to do this in the opposite direction, if we know that the alignment is fine?
The issue I'm seeing is that the aliasing rule does not cover the simple case of the following, if one considers the aliasing rule as a non-symmetric relation
int a;
a = 0;
The reason is, that each object contains a sequence of sizeof(obj) unsigned char objects (called the "object representation"). If we change the int, we will change some or all of those objects. However, the aliasing rule only states we are allowed to change a int by an char or unsigned char, but not the other way around. Another example
int a[1];
int *ra = a;
*ra = 0;
Only one direction is described by 3.10/15 ("An aggregate or union type that includes..."), but this time we need the other way around ("A type that is the element or non-static data member type of an aggregate...").
Is the other direction implied? This question also applies to C.
The aliasing rule simply states that there's one "effective type" (C99 6.5.7, plus footnote 73) for any given object in memory, and any accesses to such an object go through one of:
A type compatible with the effective type (qualifiers such as const and restrict, as well signed/unsigned-ness may vary)
A struct or union containing one such type
A character type
The effective type isn't specified in advanced, of course - it's just a construct that is used to specify aliasing. But the intent is simply that you don't access the same object with two different non-character types.
So the answer is, yes, you can indeed go the other direction.
The standard (C99 6.3.2.3 §7) defines pointer casts like this as "just fine", the casted pointer will point at the same address. (Unless the CPU has alignment which makes the cast impossible, then it is undefined behavior.)
That is, the actual cast in itself is fine. What will happen if you start to manipulate the data... now that's another implementation-defined story.
Here's from the standard:
"A pointer to an object or incomplete type may be converted to a pointer to a different object or incomplete type. If the resulting pointer is not correctly aligned (57) for the pointed-to type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers
to the remaining bytes of the object."
"57) In general, the concept ‘‘correctly aligned’’ is transitive: if a pointer to type A is correctly aligned for a pointer to type B, which in turn is correctly aligned for a pointer to type C, then a pointer to type A is
correctly aligned for a pointer to type C."
I think I'm a bit confused by the question, but the 2nd and 3rd examples are accessing the int through an lvalue that has the object's type (int in the examples).
C++ 3.10/15 states as it's first item that it's OK to access the object through an lvalue that has the the type of "the dynamic type of the object".
What am I misunderstanding in the question?

What wording in the C++ standard allows static_cast<non-void-type*>(malloc(N)); to work?

As far as I understand the wording in 5.2.9 Static cast, the only time the result of a void*-to-object-pointer conversion is allowed is when the void* was a result of the inverse conversion in the first place.
Throughout the standard there is a bunch of references to the representation of a pointer, and the representation of a void pointer being the same as that of a char pointer, and so on, but it never seems to explicitly say that casting an arbitrary void pointer yields a pointer to the same location in memory, with a different type, much like type-punning is undefined where not punning back to an object's actual type.
So while malloc clearly returns the address of suitable memory and so on, there does not seem to be any way to actually make use of it, portably, as far as I have seen.
C++0x standard draft has in 5.2.9/13:
An rvalue of type “pointer to cv1
void” can be converted to an rvalue of
type “pointer to cv2 T,” where T is an
object type and cv2 is the same
cv-qualification as, or greater
cv-qualification than, cv1. The null
pointer value is converted to the null
pointer value of the destination type.
A value of type pointer to object
converted to “pointer to cv void” and
back, possibly with different
cv-qualification, shall have its
original value.
But also note that the cast doesn't necessarily result in a valid object:
std::string* p = static_cast<std::string*>(malloc(sizeof(*p)));
//*p not a valid object
C++03, §20.4.6p2
The contents are the same as the Standard C library header <stdlib.h>, with the following changes: [list of changes that don't apply here]
C99, §7.20.3.3p2-3
(Though C++03 is based on C89, I only have C99 to quote. However, I believe this section is semantically unchanged. §7.20.3p1 may also be useful.)
The malloc function allocates space for an object whose size is specified by size and
whose value is indeterminate.
The malloc function returns either a null pointer or a pointer to the allocated space.
From these two quotes, malloc allocates an uninitialized object and returns a pointer to it, or returns a null pointer. A pointer to an object which you have as a void pointer can be converted to a pointer to that object (first sentence of C++03 §5.2.9p13, mentioned in the previous answer).
This should be less "handwaving", which you complained of, but someone might argue I'm "interpreting" C's definition of malloc as I wish, by, for example, noticing C says "to the allocated space" rather than "to the allocated object". To those people: first realize that "space" and "object" are synonyms in C, and second please file a defect report with the standard committees, because not even I am pedantic enough to continue. :)
I'll give you the benefit of the doubt and believe you got tripped up in the cross-references, cross-interpretation, and sometimes-confused integration between the standards, rather than "space" vs "object".
Throughout the standard there is a bunch of references to the representation of a pointer, and the representation of a void pointer being the same as that of a char pointer,
Yes, indeed.
So while malloc clearly returns the address of suitable memory and so on, there does not seem to be any way to actually make use of it, portably, as far as I have seen.
Of course there is:
void *vp = malloc (1);
char *cp;
memcpy (&cp, &vb, sizeof cp);
*cp = ' ';
There is one tiny problem : it does not work for any other type. :(

casting via void* instead of using reinterpret_cast [duplicate]

This question already has answers here:
Should I use static_cast or reinterpret_cast when casting a void* to whatever
(9 answers)
Closed 1 year ago.
I'm reading a book and I found that reinterpret_cast should not be used directly, but rather casting to void* in combination with static_cast:
T1 * p1=...
void *pv=p1;
T2 * p2= static_cast<T2*>(pv);
Instead of:
T1 * p1=...
T2 * p2= reinterpret_cast<T2*>(p1);
However, I can't find an explanation why is this better than the direct cast. I would very appreciate if someone can give me an explanation or point me to the answer.
Thanks in advance
p.s. I know what is reinterpret_cast used for, but I never saw that is used in this way
For types for which such cast is permitted (e.g. if T1 is a POD-type and T2 is unsigned char), the approach with static_cast is well-defined by the Standard.
On the other hand, reinterpret_cast is entirely implementation-defined - the only guarantee that you get for it is that you can cast a pointer type to any other pointer type and then back, and you'll get the original value; and also, you can cast a pointer type to an integral type large enough to hold a pointer value (which varies depending on implementation, and needs not exist at all), and then cast it back, and you'll get the original value.
To be more specific, I'll just quote the relevant parts of the Standard, highlighting important parts:
5.2.10[expr.reinterpret.cast]:
The mapping performed by reinterpret_cast is implementation-defined. [Note: it might, or might not, produce a representation different from the original value.] ... A pointer to an object can be explicitly converted to a pointer to an object of different type.) Except that converting an rvalue of type “pointer to T1” to the type “pointer to T2” (where T1 and T2 are object types and where the alignment requirements of T2 are no stricter than those of T1) and back to its original type yields the original pointer value, the result of such a pointer conversion is unspecified.
So something like this:
struct pod_t { int x; };
pod_t pod;
char* p = reinterpret_cast<char*>(&pod);
memset(p, 0, sizeof pod);
is effectively unspecified.
Explaining why static_cast works is a bit more tricky. Here's the above code rewritten to use static_cast which I believe is guaranteed to always work as intended by the Standard:
struct pod_t { int x; };
pod_t pod;
char* p = static_cast<char*>(static_cast<void*>(&pod));
memset(p, 0, sizeof pod);
Again, let me quote the sections of the Standard that, together, lead me to conclude that the above should be portable:
3.9[basic.types]:
For any object (other than a base-class subobject) of POD type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array of char or unsigned char. If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.
The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T).
3.9.2[basic.compound]:
Objects of cv-qualified (3.9.3) or cv-unqualified type void* (pointer to void), can be used to point to objects of unknown type. A void* shall be able to hold any object pointer. A cv-qualified or cv-unqualified (3.9.3) void* shall have the same representation and alignment requirements as a cv-qualified or cv-unqualified char*.
3.10[basic.lval]:
If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined):
...
a char or unsigned char type.
4.10[conv.ptr]:
An rvalue of type “pointer to cv T,” where T is an object type, can be converted to an rvalue of type “pointer to cv void.” The result of converting a “pointer to cv T” to a “pointer to cv void” points to the start of the storage location where the object of type T resides, as if the object is a most derived object (1.8) of type T (that is, not a base class subobject).
5.2.9[expr.static.cast]:
The inverse of any standard conversion sequence (clause 4), other than the lvalue-to-rvalue (4.1), array-topointer (4.2), function-to-pointer (4.3), and boolean (4.12) conversions, can be performed explicitly using static_cast.
[EDIT] On the other hand, we have this gem:
9.2[class.mem]/17:
A pointer to a POD-struct object, suitably converted using a reinterpret_cast, points to its initial member (or if that member is a bit-field, then to the unit in which it resides) and vice versa. [Note: There might therefore be unnamed padding within a POD-struct object, but not at its beginning, as necessary to achieve appropriate alignment. ]
which seems to imply that reinterpret_cast between pointers somehow implies "same address". Go figure.
There is not the slightest doubt that the intent is that both forms are well defined, but the wording fails to capture that.
Both forms will work in practice.
reinterpret_cast is more explicit about the intent and should be preferred.
The real reason this is so is because of how C++ defines inheritance, and because of member pointers.
With C, pointer is pretty much just an address, as it should be. In C++ it has to be more complex because of some of its features.
Member pointers are really an offset into a class, so casting them is always a disaster using C style.
If you have multiply inherited two virtual objects that also have some concrete parts, that's also a disaster for C style. This is the case in multiple inheritance that causes all the problems, though, so you should not ever want to use this anyway.
Really hopefully you never use these cases in the first place. Also, if you are casting a lot that's another sign you are messing up in in your design.
The only time I end up casting is with the primitives in areas C++ decides are not the same but where obviously they have to be. For actual objects, any time you want to cast something, start to question your design because you should be 'programming to the interface' most of the time. Of course, you can't change how 3rd party APIs work so you don't always have much choice.