I've stumbled along the following code:
#include <bitset>
#include <iostream>
int main() {
int x = 8;
void *w = &x;
bool val = *reinterpret_cast<const unsigned char*>(&x);
bool *z = static_cast<bool *>(w);
std::cout << "z (" << z << ") is " << *z << ": " << std::bitset<8>(*z) << "\n";
std::cout << "val is " << val << ": " << std::bitset<8>(val) << "\n";
}
With -O3, this produced output:
z (0x7ffcaef0dba4) is 8: 00001000
val is 1: 00000001
However, with -O0, this produced output:
z (0x7ffe8c6c914c) is 0: 00000000
val is 1: 00000001
I know that dereferencing z invokes undefined behavior, and is why we are seeing inconsistent results. However, it seems dereferencing the reinterpret_cast into val is not invoking undefined behavior, and reliably produces {0,1} values.
Via (https://godbolt.org/z/f6s11Kr96), we see that gcc for x86 produces:
lea rax, [rbp-16]
movzx eax, BYTE PTR [rax]
test al, al
setne al
mov BYTE PTR [rbp-9], al
The effect of the test setne instructions is to convert non 0 values to 1 (and keep 0 values at 0). Is there some rule that states that reinterpret_casting from void * to const unsigned char * should have this behavior?
Reading the value of *z causes undefined behaviour due to strict aliasing violation (C++20 [basic.lval]/11) . The expression has type bool but the object at the memory location has type int. Only certain pairs of types are permitted to be aliased, and bool to int is not one of those.
The val part of the code is not UB because const unsigned char is allowed to alias other types. The initializer for val will produce an unsigned char value whose memory representation is the same as the contents of the first byte of x.
Then, that result is converted (not reinterpreted) to bool producing either false if 0 or true otherwise.
Accessing (i.e. reading) the value of z (not merely dereferencing itself) causes undefined behavior because it is an aliasing violation. (z points to an object of type int, but the access is through an lvalue of type bool)
Access through a lvalue of type unsigned char is specifically exempt from being an aliasing violation. (see [basic.lval]/11.3)
However, technically, it is still not specified what the result of accessing the int object through a unsigned char lvalue should be. The intent is that it gives the first byte of the object representation of the int object, but the standard currently is defective in not specifying that behavior. The paper P1839 attempts to resolve this defect.
After reading this first byte from the object representation as a unsigned char value you convert it implicitly to bool when initializing bool val from it. The conversion from unsigned char to bool is a conversion of values, not reinterpretation of object representation. It is specified that a zero value is converted to false and anything else to true. (see [conv.bool])
Whether you cast through void* explicitly or directly cast the int* to unsigned char* or bool* doesn't matter at all. reinterpret_cast between pointers is actually specified to be equivalent to static_cast<void*> followed by static_cast to the target pointer type. (In your code static_cast and reinterpret_cast are interchangeable.)
This is undefined behavior.
According to -fsanitize=undefined:
/app/example.cpp:9:41: runtime error: load of value 8, which is not a valid value for type 'bool'
/app/example.cpp:9:70: runtime error: load of value 8, which is not a valid value for type 'bool'
https://godbolt.org/z/vM5MxT4Md
In particular, -fsanitize=undefined is complaining that bool should hold either true or false and anything else is UB. (I believe the bit representation of true and false are implementation-defined.)
As for expr.reinterpret.cast
reinterpret_cast<T>(v)
An object pointer can be explicitly converted to an object pointer of a different type.58 When a prvalue v of object pointer type is converted to the object pointer type “pointer to cv T”, the result is static_cast<cv T*>(static_cast<cv void*>(v)).
[Note 6: Converting a pointer of type “pointer to T1” that points to an object of type T1 to the type “pointer to T2” (where T2 is an object type and the alignment requirements of T2 are no stricter than those of T1) and back to its original type yields the original pointer value. — end note].
The expression bool val = *reinterpret_cast<const unsigned char*>(&x); is a valid code.
Related
Suppose we take a very big array of unsigned chars.
std::array<uint8_t, 100500> blob;
// ... fill array ...
(Note: it is aligned already, question is not about alignment.)
Then we take it as uint64_t[] and trying to access it:
const auto ptr = reinterpret_cast<const uint64_t*>(blob.data());
std::cout << ptr[7] << std::endl;
Casting to uint64_t and then reading from it looks suspicious as for me.
But UBsan, -Wstrict-aliasing is not triggering about it.
Google uses this technique in FlatBuffers.
Also, Cap'n'Proto uses this too.
Is it undefined behavior?
You cannot access an unsigned char object value through a glvalue of an other type. But the opposite is authorized, you can access the value of any object through an unsigned char glvalue [basic.lval]:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: [...]
a char, unsigned char, or std::byte type.
So, to be 100% standard compliant, the idea is to reverse the reinterpret_cast:
uint64_t i;
std::memcpy(&i, blob.data() + 7*sizeof(uint64_t), sizeof(uint64_t));
std::cout << i << std::endl;
And it will produces the exact same assembly.
The cast itself is well defined (a reinterpret_cast never has UB), but the lvalue to rvalue conversion in expression "ptr[7]" would be UB if no uint64_t object has been constructed in that address.
As "// ... fill array ..." is not shown, there could have been constructed a uint64_t object in that address (assuming as you say, the address has sufficient alignment):
const uint64_t* p = new (blob.data() + 7 * sizeof(uint64_t)) uint64_t();
If a uint64_t object has been constructed in that address, then the code in question has well defined behaviour.
I was bitten by an unintended C++ function match by MSVC. I can reduce it to the following test case:
#include <iostream>
enum Code { aaa, bbb };
struct MyVal {
Code c;
MyVal(Code c): c(c) { }
};
void test(int i, MyVal val) {
std::cout << "case " << i << ": value " << val.c << std::endl;
}
void test(int i, double* f) {
std::cout << "case " << i << ": WRONG" << std::endl;
}
const Code v1 = aaa;
Code v2 = aaa;
const Code v3 = bbb;
int main() {
const Code w1 = aaa;
Code w2 = aaa;
const Code w3 = bbb;
test(1, v1); // unexpected MSVC WRONG
test(2, v2);
test(3, v3);
test(4, aaa);
test(5, w1); // unexpected MSVC WRONG
test(6, w2);
test(7, w3);
return 0;
}
I expected that all 7 invocations of test would match the first overload, and GCC (live example) and Clang (live example) match this as intended:
case 1: value 0
case 2: value 0
case 3: value 1
case 4: value 0
case 5: value 0
case 6: value 0
case 7: value 1
But MSVC (live example) matches cases 1 and 5 to the "wrong" overload (I found this behavior in MSVC 2013 and 2015):
case 1: WRONG
case 2: value 0
case 3: value 1
case 4: value 0
case 5: WRONG
case 6: value 0
case 7: value 1
It seems that the conversion to a pointer is preferred by MSVC for a const enum variable with (accidental) value 0. I would have expected this behavior with a literal 0, but not with an enum variable.
My questions: Is the MSVC behavior standard-conformant? (Perhaps for an older version of C++?) If not, is this a known extension or bug?
You don't name any standards, but let's see what the differences are:
[C++11: 4.10/1]: A null pointer constant is an integral constant expression (5.19) prvalue of integer type that evaluates to zero or a prvalue of type std::nullptr_t. A null pointer constant can be converted to a pointer type; the result is the null pointer value of that type and is distinguishable from every other value of object pointer or function pointer type. Such a conversion is called a null pointer conversion. Two null pointer values of the same type shall compare equal. The conversion of a null pointer constant to a pointer to cv-qualified type is a single conversion, and not the sequence of a pointer conversion followed by a qualification. [..]
[C++11: 5.19/3]: A literal constant expression is a prvalue core constant expression of literal type, but not pointer type. An integral constant expression is a literal constant expression of integral or unscoped enumeration type. [..]
And:
[C++03: 4.10/1]: A null pointer constant is an integral constant expression (5.19) rvalue of integer type that evaluates to zero. A null pointer constant can be converted to a pointer type; the result is the null pointer value of that type and is distinguishable from every other value of pointer to object or pointer to function type. Two null pointer values of the same type shall compare equal. The conversion of a null pointer constant to a pointer to cv-qualified type is a single conversion, and not the sequence of a pointer conversion followed by a qualification conversion (4.4).
[C++03: 5.19/2]: Other expressions are considered constant-expressions only for the purpose of non-local static object initialization (3.6.2). Such constant expressions shall evaluate to one of the following:
a null pointer value (4.10),
a null member pointer value (4.11),
an arithmetic constant expression,
an address constant expression,
a reference constant expression,
an address constant expression for a complete object type, plus or minus an integral constant expression, or
a pointer to member constant expression.
The key here is that the standard language changed between C++03 and C++11, with the latter introducing the requirement that a null pointer constant of this form be a literal.
(They always needed to actually be constants and evaluate to 0, so you can remove v2, v3, w2 and w3 from your testcase.)
A null pointer constant can convert to a double* more easily than going through your user-defined conversion, so…
I believe MSVS is implementing the C++03 rules.
Amusingly, though, if I put GCC in C++03 mode, its behaviour isn't changed, which is technically non-compliant. I suspect the change in the language stemmed from the behaviour of common implementations at the time, rather than the other way around. I can see some evidence that GCC was [allegedly] non-conforming in this regard as early as 2004, so it may also just be that the standard wording change fortuitously un-bugged what had been a GCC bug.
Consider the following code.
#include <stdio.h>
int main() {
typedef int T;
T a[] = { 1, 2, 3, 4, 5, 6 };
T(*pa1)[6] = (T(*)[6])a;
T(*pa2)[3][2] = (T(*)[3][2])a;
T(*pa3)[1][2][3] = (T(*)[1][2][3])a;
T *p = a;
T *p1 = *pa1;
//T *p2 = *pa2; //error in c++
//T *p3 = *pa3; //error in c++
T *p2 = **pa2;
T *p3 = ***pa3;
printf("%p %p %p %p %p %p %p\n", a, pa1, pa2, pa3, p, p1, p2, p3);
printf("%d %d %d %d %d %d %d\n", a[5], (*pa1)[5],
(*pa2)[2][1], (*pa3)[0][1][2], p[5], p1[5], p2[5], p3[5]);
return 0;
}
The above code compiles and runs in C, producing the expected results. All the pointer values are the same, as are all the int values. I think the result will be the same for any type T, but int is the easiest to work with.
I confessed to being initially surprised that dereferencing a pointer-to-array yields an identical pointer value, but on reflection I think that is merely the converse of the array-to-pointer decay we know and love.
[EDIT: The commented out lines trigger errors in C++ and warnings in C. I find the C standard vague on this point, but this is not the real question.]
In this question, it was claimed to be Undefined Behaviour, but I can't see it. Am I right?
Code here if you want to see it.
Right after I wrote the above it dawned on me that those errors are because there is only one level of pointer decay in C++. More dereferencing is needed!
T *p2 = **pa2; //no error in c or c++
T *p3 = ***pa3; //no error in c or c++
And before I managed to finish this edit, #AntonSavin provided the same answer. I have edited the code to reflect these changes.
This is a C-only answer.
C11 (n1570) 6.3.2.3 p7
A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned*) for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer.
*) In general, the concept “correctly aligned” is transitive: if a pointer to type A is correctly aligned for a pointer to type B, which in turn is correctly aligned for a pointer to type C, then a pointer to type A is correctly aligned for a pointer to type C.
The standard is a little vague what happens if we use such a pointer (strict aliasing aside) for anything else than converting it back, but the intent and wide-spread interpretation is that such pointers should compare equal (and have the same numerical value, e.g. they should also be equal when converted to uintptr_t), as an example, think about (void *)array == (void *)&array (converting to char * instead of void * is explicitly guaranteed to work).
T(*pa1)[6] = (T(*)[6])a;
This is fine, the pointer is correctly aligned (it’s the same pointer as &a).
T(*pa2)[3][2] = (T(*)[3][2])a; // (i)
T(*pa3)[1][2][3] = (T(*)[1][2][3])a; // (ii)
Iff T[6] has the same alignment requirements as T[3][2], and the same as T[1][2][3], (i), and (ii) are safe, respectively. To me, it sounds strange, that they couldn’t, but I cannot find a guarantee in the standard that they should have the same alignment requirements.
T *p = a; // safe, of course
T *p1 = *pa1; // *pa1 has type T[6], after lvalue conversion it's T*, OK
T *p2 = **pa2; // **pa2 has type T[2], or T* after conversion, OK
T *p3 = ***pa3; // ***pa3, has type T[3], T* after conversion, OK
Ignoring the UB caused by passing int * where printf expects void *, let’s look at the expressions in the arguments for the next printf, first the defined ones:
a[5] // OK, of course
(*pa1)[5]
(*pa2)[2][1]
(*pa3)[0][1][2]
p[5] // same as a[5]
p1[5]
Note, that strict aliasing isn’t a problem here, no wrongly-typed lvalue is involved, and we access T as T.
The following expressions depend on the interpretation of out-of-bounds pointer arithmetic, the more relaxed interpretation (allowing container_of, array flattening, the “struct hack” with char[], etc.) allows them as well; the stricter interpretation (allowing a reliable run-time bounds-checking implementation for pointer arithmetic and dereferencing, but disallowing container_of, array flattening (but not necessarily array “lifting”, what you did), the struct hack, etc.) renders them undefined:
p2[5] // UB, p2 points to the first element of a T[2] array
p3[5] // UB, p3 points to the first element of a T[3] array
The only reason your code compiles in C is that your default compiler setup allows the compiler to implicitly perform some illegal pointer conversions. Formally, this is not allowed by C language. These lines
T *p2 = *pa2;
T *p3 = *pa3;
are ill-formed in C++ and produce constraint violations in C. In casual parlance, these lines are errors in both C and C++ languages.
Any self-respecting C compiler will issue (is actually required to issue) diagnostic messages for these constraint violations. GCC compiler, for one example, will issue "warnings" telling you that pointer types in the above initializations are incompatible. While "warnings" are perfectly sufficient to satisfy standard requirements, if you really want to use GCC compiler's ability to recognize constraint violating C code, you have to run it with -pedantic-errors switch and, preferably, explicitly select standard language version by using -std= switch.
In your experiment, C compiler performed these implicit conversions for you as a non-standard compiler extension. However, the fact that GCC compiler running under ideone front completely suppressed the corresponding warning messages (issued by the standalone GCC compiler even in its default configuration) means that ideone is a broken C compiler. Its diagnostic output cannot be meaningfully relied upon to tell valid C code from invalid one.
As for the conversion itself... It is not undefined behavior to perform this conversion. But it is undefined behavior to access array data through the converted pointers.
UPDATE: The following applies to C++ only, for C scroll down.
In short, there's no UB in C++ and there is UB in C.
8.3.4/7 says:
A consistent rule is followed for multidimensional arrays. If E is an n-dimensional array of rank i x j x ... x k,
then E appearing in an expression that is subject to the array-to-pointer conversion (4.2) is converted to a
pointer to an (n - 1)-dimensional array with rank j x ... x k. If the * operator, either explicitly or implicitly
as a result of subscripting, is applied to this pointer, the result is the pointed-to (n - 1)-dimensional array,
which itself is immediately converted into a pointer.
So this won't produce error in C++ (and will work as expected):
T *p2 = **pa2;
T *p3 = ***pa3;
Regarding whether this is UB or not. Consider the very first conversion:
T(*pa1)[6] = (T(*)[6])a;
In C++ it's in fact
T(*pa1)[6] = reinterpret_cast<T(*)[6]>(a);
And this is what the standard says about reinterpret_cast:
An object pointer can be explicitly converted to an object pointer of a different type. When a prvalue
v of type “pointer to T1” is converted to the type “pointer to cv T2”, the result is static_cast< cv
T2 * >(static_cast< cv void * >(v)) if both T1 and T2 are standard-layout types (3.9) and the alignment
requirements of T2 are no stricter than those of T1, or if either type is void.
So a is converted to pa1 through static_cast to void* and back. Static cast to void* is guaranteed to return the real address address of a as stated in 4.10/2:
A prvalue of type “pointer to cv T,” where T is an object type, can be converted to a prvalue of type “pointer
to cv void”. The result of converting a non-null pointer value of a pointer to object type to a “pointer to
cv void” represents the address of the same byte in memory as the original pointer value.
Next static cast to T(*)[6] is again guaranteed to return the same address as stated in 5.2.9/13:
A prvalue of type “pointer to cv1 void” can be converted to a prvalue of type “pointer to cv2 T,” where T is
an object type and cv2 is the same cv-qualification as, or greater cv-qualification than, cv1. The null pointer
value is converted to the null pointer value of the destination type. If the original pointer value represents
the address A of a byte in memory and A satisfies the alignment requirement of T, then the resulting pointer
value represents the same address as the original pointer value, that is, A
So the pa1 is guaranteed point to the same byte in memory as a, and so access to data through it is perfectly valid because the alignment of arrays is the same as the alignment of underlying type.
What about C?
Consider again:
T(*pa1)[6] = (T(*)[6])a;
In C11 standard, 6.3.2.3/7 states the following:
A pointer to an object type may be converted to a pointer to a different object type. If the
resulting pointer is not correctly aligned for the referenced type, the behavior is
undefined. Otherwise, when converted back again, the result shall compare equal to the
original pointer. When a pointer to an object is converted to a pointer to a character type,
the result points to the lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining bytes of the object.
It means that unless the conversion is to char*, the value of converted pointer is not guaranteed to be equal to value of original pointer, resulting in undefined behavior when accessing data through converted pointer. In order to make it work, the conversion has to be done explicitly through void*:
T(*pa1)[6] = (T(*)[6])(void*)a;
Conversions back to T*
T *p = a;
T *p1 = *pa1;
T *p2 = **pa2;
T *p3 = ***pa3;
All of these are conversions from array of T to pointer to T, which are valid in both C++ and C, and no UB is triggered by accessing the data through converted pointers.
We can look at the representation of an object of type T by converting a T* that points at that object into a char*. At least in practice:
int x = 511;
unsigned char* cp = (unsigned char*)&x;
std::cout << std::hex << std::setfill('0');
for (int i = 0; i < sizeof(int); i++) {
std::cout << std::setw(2) << (int)cp[i] << ' ';
}
This outputs the representation of 511 on my system: ff 01 00 00.
There is (surely) some implementation defined behaviour occurring here. Which of the casts is allowing me to convert an int* to an unsigned char* and which conversions does that cast entail? Am I invoking undefined behaviour as soon as I cast? Can I cast any T* type like this? What can I rely on when doing this?
Which of the casts is allowing me to convert an int* to an unsigned char*?
That C-style cast in this case is the same as reinterpret_cast<unsigned char*>.
Can I cast any T* type like this?
Yes and no. The yes part: You can safely cast any pointer type to a char* or unsigned char* (with the appropriate const and/or volatile qualifiers). The result is implementation-defined, but it is legal.
The no part: The standard explicitly allows char* and unsigned char* as the target type. However, you cannot (for example) safely cast a double* to an int*. Do this and you've crossed the boundary from implementation-defined behavior to undefined behavior. It violates the strict aliasing rule.
Your cast maps to:
unsigned char* cp = reinterpret_cast<unsigned char*>(&x);
The underlying representation of an int is implementation defined, and viewing it as characters allows you to examine that. In your case, it is 32-bit little endian.
There is nothing special here -- this method of examining the internal representation is valid for any data type.
C++03 5.2.10.7: A pointer to an object can be explicitly converted to a pointer to an object of different type. Except that converting an rvalue of type "pointer to T1" to the type "pointer to T2" (where T1 and T2 are object types and where the alignment requirements of T2 are no stricter than those of T1) and back to its original type yields the original pointer value, the result of such a pointer conversion is unspecified.
This suggests that the cast results in unspecified behavior. But pragmatically speaking, casting from any pointer type to char* will always allow you to examine (and modify) the internal representation of the referenced object.
The C-style cast in this case is equivalent to reinterpret_cast. The Standard describes the semantics in 5.2.10. Specifically, in paragraph 7:
"A pointer to an object can be explicitly converted to a pointer to a
different object type.70 When a prvalue v of type “pointer to T1” is
converted to the type “pointer to cvT2”, the result is
static_cast<cvT2*>(static_cast<cvvoid*>(v)) if both T1 and T2 are
standard-layout types (3.9) and the alignment requirements of T2 are
no stricter than those of T1. Converting a prvalue of type “pointer to
T1” to the type “pointer to T2” (where T1 and T2 are object types and
where the alignment requirements of T2 are no stricter than those of
T1) and back to its original type yields the original pointer value.
The result of any other such pointer conversion is unspecified."
What it means in your case, the alignment requirements are satisfied, and the result is unspecified.
The implementation behaviour in your example is the endianness attribute of your system, in this case your CPU is a little endian.
About the type casting, when you cast an int* to char* all what you are doing is telling the compiler to interpret what cp is pointing to as a char, so it will read the first byte only and interpret it as a character.
The cast between pointers are themselves always possible since all pointers are nothing more than memory addresses and whatever type, in memory, can always be thought as a sequence of bytes.
But -of course- the way the sequence is formed depends on how the decomposed type is represented in memory, and that's out of the scope of the C++ specifications.
That said, unless of very pathological cases, you can expect that representation to be the same on all the code produced by a same compiler for all the machines of a same platform (or family), and you should not expect same results on different platforms.
In general one thing to avoid is to express the relation between type sizes as "predefined":
in your sample you assume sizeof(int) == 4*sizeof(char): that's not necessarily always true.
But it is always true that sizeof(T) = N*sizeof(char), hence whatever T can always be seen as a integer number of char-s
Unless you have a cast operator, then a cast is simply telling to "see" that memory area in a different way. Nothing really fancy, I would say.
Then, you are reading the memory area byte-by-byte; as long as you do not change it, it is just fine. Of course, the result of what you see depends a lot from the platform: think about endianness, word size, padding, and so on.
Just reverse the byte order then it becomes
00 00 01 ff
Which is 256 (01) + 255 (ff) = 511
This is because your platfom is little endian.
I know this is a bizarre thing to do, and it's not portable. But I have an allocated array of unsigned ints, and I occasionaly want to "store" a float in it. I don't want to cast the float or convert it to the closest equivalent int; I want to store the exact bit image of the float in the allocated space of the unsigned int, such that I could later retrieve it as a float and it would retain its original float value.
This can be achieved through a simple copy:
uint32_t dst;
float src = get_float();
char * const p = reinterpret_cast<char*>(&dst);
std::copy(p, p + sizeof(float), reinterpret_cast<char *>(&src));
// now read dst
Copying backwards works similarly.
Just do a reinterpret cast of the respective memory location:
float f = 0.5f;
unsigned int i = *reinterpret_cast<unsigned int*>(&f);
or the more C-like version:
unsigned int i = *(unsigned int*)&f;
From your question text I assume you are aware that this breaks if float and unsigned int don't have the same size, but on most usual platforms both should be 32-bit.
EDIT: As Kerrek pointed out, this seems to be undefined behaviour. But I still stand to my answer, as it is short and precise and should indeed work on any practical compiler (convince me of the opposite). But look at Kerrek's answer if you want a UB-free answer.
You can use reinterpret_cast if you really have to. You don't even need to play with pointers/addresses as other answers mention. For example
int i;
reinterpret_cast<float&>(i) = 10;
std::cout << std::endl << i << " " << reinterpret_cast<float&>(i) << std::endl;
also works (and prints 1092616192 10 if you are qurious ;).
EDIT:
From C++ standard (about reinterpret_cast):
5.2.10.7 A pointer to an object can be explicitly converted to a pointer to an object of different type.Except that converting an
rvalue of type “pointer to T1” to the type “pointer to T2” (where T1
and T2 are object types and where the alignment requirements of T2 are
no stricter than those of T1) and back to its original type yields the
original pointer value, the result of such a pointer conversion is
unspecified.
5.2.10.10 10 An lvalue expression of type T1 can be cast to the type “reference to T2” if an expression of type “pointer to T1” can be
explicitly converted to the type “pointer to T2” using a
reinterpret_cast. That is, a reference cast reinterpret_cast<T&>(x)
has the same effect as the conversion
*reinterpret_cast<T*>(&x) with the built-in & and * operators. The result is an lvalue that refers to the same object as the source
lvalue, but with a different type. No temporary is created, no copy is
made, and constructors (12.1) or conversion functions (12.3) are not
called.67)
So it seems that consistently reinterpreting pointers is not undefined behavior, and using references has the same result as taking address, reintepreting and deferencing obtained pointer. I still claim that this is not undefined behavior.