The result of some pointer casts are described as unspecified. For example, [expr.static.cast]/13:
A prvalue of type “pointer to cv1 void” can be converted to a prvalue of type “pointer to cv2 T,” [...] If the original pointer value represents the address A of a byte in memory and A satisfies the alignment requirement of T, then the resulting pointer value represents the same address as the original pointer value, that is, A. The result of any other such pointer conversion is unspecified.
My question is: in the case where alignment is not satisfied, what are the possible results?
For example, are the following results permitted?
a null pointer
an invalid pointer value (i.e. pointer which does not point to allocated storage of size T)
a valid pointer to a T in a completely separate part of memory
Code sample for reference:
#include <iostream>
int main(int argc, char **argv)
{
int *b = (int *)"Hello, world"; // (1)
*b = -1; // (2)
std::cout << argc << '\n';
}
Line (1) triggers my above quote from [expr.static.cast]/13 because it is a reinterpret_cast which is covered by [expr.reinterpret.cast]/7 which defines the conversion in terms of static_casting through void *.
If the unspecified result may be an invalid pointer value, then line (1) may cause a hardware trap. (Reference: N4430 which clarifies similar wording that was in C++14 and C++11).
Corollary question: is there any case in which line 1 would cause undefined behaviour? (I don't think so at this stage; since C++14 invalid pointer value reading is implementation-defined or causes a hardware trap).
Also interesting is that line (2) would in most cases be undefined behaviour due to strict aliasing violation (and perhaps other reasons too), however if the unspecified result may be &argc then this program could output -1 without triggering undefined behaviour!
My question is: in the case where alignment is not satisfied, what are the possible results?
As far as I can tell N4303: Pointer safety and placement new partially answers this question, although somewhat indirectly. This paper refers to CWG issue 1412: Problems in specifying pointer conversions which brought about the changes to [expr.static.cast]/13 that you reference, specifically adding:
[...]If the original pointer value represents the address A of a byte in memory and A satisfies the alignment requirement of T, then the resulting pointer value represents the same address as the original pointer value, that is, A. The result of any other such pointer conversion is unspecified.[...]
In reference to this change N4303 says (emphasis mine):
Prior to the adoption of the resolution for DR 1412 [CWG1412], the value of bp is unspecified at the point of its initialization and its subsequent passing to operator new via the new-expression. Said pointer may be null, insufficiently aligned or otherwise dangerous to use.
So an unspecified conversion can results in:
A null pointer
An insufficiently aligned pointer
A pointer that is dangerous to use
Related
Consider this example
int main(){
std::intptr_t value = /* a special integer value */;
int* ptr = reinterpret_cast<int*>(value ); // #1
int v = *ptr; // #2
}
[expr.reinterpret.cast] p5 says
A value of integral type or enumeration type can be explicitly converted to a pointer. A pointer converted to an integer of sufficient size (if any such exists on the implementation) and back to the same pointer type will have its original value; mappings between pointers and integers are otherwise implementation-defined.
At least, step #1 is implementation-defined. For step #2, in my opinion, I think it has four possibilities, which are
The implementation does not support such a conversion, and implementation does anything for it.
The pointer value is exactly the address of an object of type int, the result is well-formed.
The pointer value is the address of an object other than the int type, the result is UB.
The pointer value is an invalid pointer value, the indirection is UB.
It means what the behavior of the indirection through the pointer ptr will be depends on the implementation. It is not definitely UB if the implementation takes option 2. So, I wonder whether this case is not definitely UB or is definitely UB? If it is latter, which provisions strongly state the behavior?
The standard has nothing more to say on it than what you quoted. The standard only guarantees the meaning of a integer-to-pointer cast if that integer value was taken from a pointer-to-integer cast. The meaning of all other integer-to-pointer conversions are implementation defined.
And that means everything about them is "implementation defined": what they result in and what using those results will do. After all, the pointer-to-integer-to-pointer round trip spells out that you get the "original value" back. The fact that the resulting pointer has the "original value" means that it will behave exactly like you had copied the original pointer itself. So the standard needs say nothing more on the matter.
The behavior of the pointer taken from an implementation-defined integer-to-pointer cast is... implementation-defined. If an implementation says that supports such conversions, it must spell out what the result of supported conversions are. That is, if there is some "a special integer value" for which a cast to an int* is supported by the implementation, it must say what the result of that cast is. This includes things like whether it pointer to an actual int or whatever.
Is performing indirection from a pointer acquired from converting an integer value definitely UB?
Not always. Here is an example that is definitely not UB:
int i = 42;
std::intptr_t value = reinterpret_cast<std::intptr_t>(&i);
int* ptr = reinterpret_cast<int*>(value);
int v = *ptr;
This is because converting a pointer to an integer of sufficient size and back to the same pointer type is guaranteed to yield the same pointer value as stated in the rule you quoted. Since the original pointer value was valid for indirection, so is the converted one.
I have stumbled upon a reddit thread in which a user has found an interesting detail of the C++ standard. The thread has not spawned much constructive discussion, therefore I will retell my understanding of the problem here:
OP wants to reimplement memcpy in a standard-compliant way
They attempt to do so by using reinterpret_cast<char*>(&foo), which is an allowed exception to the strict aliasing restrictions, in which reinterpreting as char is allowed to access the "object representation" of an object.
[expr.reinterpret.cast] says that doing so results in static_cast<cv T*>(static_cast<cv void*>(v)), so reinterpret_cast in this case is equivalent to static_cast'ing first to void * and then to char *.
[expr.static.cast] in combination with [basic.compound]
A prvalue of type “pointer to cv1 void” can be converted to a prvalue of type “pointer to cv2 T”, where T is an object type and cv2 is the same cv-qualification as, or greater cv-qualification than, cv1. [...] if the original pointer value points to an object a, and there is an object b of type T (ignoring cv-qualification) that is pointer-interconvertible with a, the result is a pointer to b. [...] [emphasis mine]
Consider now the following union class:
union Foo{
char c;
int i;
};
// the OP has used union, but iiuc,
// it can also be a struct for the problem to arise.
OP has thus come to the conclusion that reinterpreting a Foo* as char* in this case yields a pointer pointing to the first char member of the union (or its object representation), rather than to the object representation of the union itself, i.e. it points only to the member. While this appears superficially to be the same, and corresponds to the same memory address, the standard seems to differentiate between the "value" of a pointer and its corresponding address, in that on the abstract C++ machine, a pointer belongs to a certain object only. Incrementing it beyond that object (compare with end() of an array) is undefined behavior.
OP thus argues that if the standard forces the char* to be associated with the objects's first member instead of the object representation of the whole union object, dereferencing it after one incrementation is UB, which allows a compiler to optimize as if it were impossible for the resultant char* to ever access the following bytes of the int member. This implies that it is not possible to legally access the complete object representation of a class object which is pointer-interconvertible with a char member.
The same would, if I understand correctly apply if "union" was simply replaced with "struct", but I have taken this example from the original thread.
What do you think? Is this a standard defect? Is it a misinterpretation?
This video, linked in the comments (now chat) by #KonradRudolph is likely the answer to the problem.
At around the 40min mark, Timur Doumler, who is a member of the ISO C++ commitee, discusses the possibility of accessing byte representations. The summary is that any attempt of accessing byte representation except memcpy is UB. The situation in the OP does not even arise without making use of UB because the very act of using a pointer to an object like an array, or doing any pointer arithmetic on it is UB, as these operations are only well-defined when dealing with actual array objects, as far as the abstract machine is concerned.
Also, while reinterpreting a pointer as a char* does not on its own violate aliasing rules, there is technically no guarantee that the resulting char* will point to the first byte of the object.
The only legal way of accessing byte representations is to memcpy the object into a char array. This means that reimplementing memcpy is impossible.
Timur Doumler additionally describes this as a wording defect that will hopefully be fixed in C++23 and presents a paper that proposes a fix to this.
Try as I might, the closest answer I've seen is this, with two completely opposing answers(!)
The question is simple, is this legal?
auto p = reinterpret_cast<int*>(0xbadface);
*p; // legal?
My take on the matter
Casting integer to pointer: no restrictions on what may be casted
Indirection: only states the result is a lvalue.
Lifetimes: only states what can't be done on objects, there is no object here
Expression statements: *p is a discarded value expression
Discarded value expressions: no lvalue-to-rvalue conversion occurs
Undefined-ness of lvalues: aka strict aliasing rule, only if the lvalue is converted to a rvalue
So I conclude there is nothing explicitly saying this is undefined behaviour. Yet I distinctively remember that some platforms trap on indirection for invalid pointers. What went wrong with my reasoning?
[basic.compound] says:
Every value of pointer type is one of the following:
a pointer to an object or function (the pointer is said to point to the object or function), or
a pointer past the end of an object ([expr.add]), or
the null pointer value ([conv.ptr]) for that type, or
an invalid pointer value.
By the process of elimination we can deduce that p is an invalid pointer value.
[basic.stc] says:
Indirection through an invalid pointer value and passing an invalid
pointer value to a deallocation function have undefined behavior. Any
other use of an invalid pointer value has implementation-defined
behavior.
As indirection operator is said to perform indirection by [expr.unary.op], I would say, that expression *p causes UB no matter if the result is used or not.
... some platforms trap on indirection for invalid pointers.
Most platforms trap on invalid address access. This does not contradict the issue in any way. The question of what happens in *p; boils down to whether an attempt to actually fetch at an invalid address takes place or not.
The question of fetching is very similar to the core issue 232 (indirection through a null pointer). As you have already pointed out, *p; is a discarded value expression, and as such no lvalue-to-rvalue conversion ("fetching") takes place:
Tom Plum:
...it is only the act of "fetching", of lvalue-to-rvalue conversion, that triggers the ill-formed or undefined behavior.
And subsequently:
Notes from the October 2003 meeting:
We agreed that the approach in the standard seems okay: p = 0; *p; is
not inherently an error. An lvalue-to-rvalue conversion would give it
undefined behavior.
As to whether or not reinterpret_cast<int*>(0xbadface) produces a valid pointer, indeed in implementations with strict pointer safety, it wouldn't be a safely-derived pointer, and as such is invalid and any use of it is UB.
But in case of relaxed pointer safety the resulting pointer is valid (otherwise it would be impossible to use pointers returned from binary libraries and components written in C or other languages).
In the C++14 standard (n3797), the section on lvalue to rvalue conversions reads as follows (emphasis mine):
4.1 Lvalue-to-rvalue-conversion [conv.lval]
A glvalue (3.10) of a non-function, non-array type T can be converted to a prvalue. If T is an incomplete type, a program that necessitates this conversion is ill-formed. If T is a non-class type, the type of the prvalue is the cv-unqualified version of T. Otherwise the type of the prvalue is T.
When an lvalue-to-rvalue conversion occurs in an unevaluated operand
or a subexpression thereof (Clause 5) the value contained in the
referenced object is not accessed. In all other cases, the result of the
conversion is determined according to the following rules:
If T is a (possibly cv-qualified) std::nullptr_t then the result is a null pointer constant.
Otherwise, if T has class type, the conversion copy-initializes a temporary of type T from the glvalue and the result of the conversion is a prvalue for the temporary.
Otherwise, if the object to which the glvalue refers contains an invalid pointer value, the behavior is implementation-defined.
Otherwise, if T is a (possibly cv-qualified) unsigned character type, and the object to which the glvalue refers contains an indeterminate value, and that object does not have automatic storage duration or the glvalue was the operand of a unary & operator or it was bound to a reference, the result is an unspecified value.
Otherwise, if the object to which the glvalue refers has an indeterminate value, the behavior is undefined.
Otherwise, the object indicated by the glvalue is the prvalue result.
[Note: See also 3.10]
What's the significance of this paragraph (in bold)?
If this paragraph were not here, then the situations in which it applies would lead to undefined behavior. Normally, I would expect that accessing an unsigned char value while it has an indeterminate value leads to undefined behavior. But, with this paragraph it means that
If I'm not actually accessing the character value, i.e. I'm immediately passing it to & or binding it to a reference, or
If the unsigned char does not have automatic storage duration,
then the conversion yields an unspecified value, and not undefined behavior.
Am I correct to conclude that this program:
#include <new>
#include <iostream>
// using T = int;
using T = unsigned char;
int main() {
T * array = new T[500];
for (int i = 0; i < 500; ++i) {
std::cout << static_cast<int>(array[i]) << std::endl;
}
delete[] array;
}
is well-defined by the standard, and must output a sequence of 500 unspecified ints, while the same program where T = int, would have undefined behavior?
IIUC, one of the reasons to make it UB to read things with indeterminate values, is to allow aggressive dead store elimination by the optimizer. So, this paragraph may mean that a conforming compiler can't do as much optimization when working with unsigned char or arrays of unsigned char.
Assuming I understand correctly, what is the rationale for this rule? When is it useful to be able to read unsigned char that have indeterminate values, and get unspecified results instead of UB? I have this feeling that if they put this much effort into crafting this part of the rule, they had some motivation to help certain code examples that they cared about, or to be consistent with some other part of the standard, or simplify some other issue. But I have no idea what that might be.
In many situations, code will write some parts of a PODS or array without writing everything, and then use functions like memcpy or fwrite to copy or write the entire thing without regard for which parts had assigned values and which did not. Although it is not terribly common for C++ code to use byte-based operations to copy or write out the contents of aggregates, the ability to do so is a fundamental part of the language. Requiring that a program write definite values to all portions of an object, including those nothing will ever "care" about, would needlessly impair efficiency.
As far as I understand the wording in 5.2.9 Static cast, the only time the result of a void*-to-object-pointer conversion is allowed is when the void* was a result of the inverse conversion in the first place.
Throughout the standard there is a bunch of references to the representation of a pointer, and the representation of a void pointer being the same as that of a char pointer, and so on, but it never seems to explicitly say that casting an arbitrary void pointer yields a pointer to the same location in memory, with a different type, much like type-punning is undefined where not punning back to an object's actual type.
So while malloc clearly returns the address of suitable memory and so on, there does not seem to be any way to actually make use of it, portably, as far as I have seen.
C++0x standard draft has in 5.2.9/13:
An rvalue of type “pointer to cv1
void” can be converted to an rvalue of
type “pointer to cv2 T,” where T is an
object type and cv2 is the same
cv-qualification as, or greater
cv-qualification than, cv1. The null
pointer value is converted to the null
pointer value of the destination type.
A value of type pointer to object
converted to “pointer to cv void” and
back, possibly with different
cv-qualification, shall have its
original value.
But also note that the cast doesn't necessarily result in a valid object:
std::string* p = static_cast<std::string*>(malloc(sizeof(*p)));
//*p not a valid object
C++03, §20.4.6p2
The contents are the same as the Standard C library header <stdlib.h>, with the following changes: [list of changes that don't apply here]
C99, §7.20.3.3p2-3
(Though C++03 is based on C89, I only have C99 to quote. However, I believe this section is semantically unchanged. §7.20.3p1 may also be useful.)
The malloc function allocates space for an object whose size is specified by size and
whose value is indeterminate.
The malloc function returns either a null pointer or a pointer to the allocated space.
From these two quotes, malloc allocates an uninitialized object and returns a pointer to it, or returns a null pointer. A pointer to an object which you have as a void pointer can be converted to a pointer to that object (first sentence of C++03 §5.2.9p13, mentioned in the previous answer).
This should be less "handwaving", which you complained of, but someone might argue I'm "interpreting" C's definition of malloc as I wish, by, for example, noticing C says "to the allocated space" rather than "to the allocated object". To those people: first realize that "space" and "object" are synonyms in C, and second please file a defect report with the standard committees, because not even I am pedantic enough to continue. :)
I'll give you the benefit of the doubt and believe you got tripped up in the cross-references, cross-interpretation, and sometimes-confused integration between the standards, rather than "space" vs "object".
Throughout the standard there is a bunch of references to the representation of a pointer, and the representation of a void pointer being the same as that of a char pointer,
Yes, indeed.
So while malloc clearly returns the address of suitable memory and so on, there does not seem to be any way to actually make use of it, portably, as far as I have seen.
Of course there is:
void *vp = malloc (1);
char *cp;
memcpy (&cp, &vb, sizeof cp);
*cp = ' ';
There is one tiny problem : it does not work for any other type. :(