I'm reading type aliasing rules but can't figure out if this code has UB in it:
std::vector<std::byte> vec = {std::byte{'a'}, std::byte{'b'}};
auto sv = std::string_view(reinterpret_cast<char*>(vec.data()), vec.size());
std::cout << sv << '\n';
I'm fairly sure it does not, but I often get surprised by C++.
Is reinterpret_cast between char*, unsigned char* and std::byte* always allowed?
Additionally, is addition of const legal in such cast, e.g:
std::array<char, 2> arr = {'a', 'b'};
auto* p = reinterpret_cast<const std::byte*>(arr.data());
Again, I suspect it is legal since it says
AliasedType is the (possibly cv-qualified) signed or unsigned variant of DynamicType
but I would like to be sure with reinterpret_casting once and for all.
The code is ok.
char* and std::byte* are allowed by the standard to alias any pointer type. (Be careful as the reverse is not true).
([basic.types]/2):
For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes ([intro.memory]) making up the object can be
copied into an array of char, unsigned char, or std::byte
([cstddef.syn]).43 If the content of that array is copied back into
the object, the object shall subsequently hold its original value.
([basic.lval]/8.8):
If a program attempts to access the stored value of an object through
a glvalue of other than one of the following types the behavior is
undefined:
a char, unsigned char, or std::byte type.
And yes you can add const.
Related
In order to serialize components in my game, I need to be able to access the data in various vectors only given a pointer and a size for the vector.
I want to get the data() pointer from a vector if I have only a void * pointing to the vector. I am attempting to convert from std::vector<T> to std::vector<char> to get the data() pointer. I want to know if the following code is defined behavior and not going to act any different in different situations.
#include <iostream>
#include <vector>
int main()
{
std::vector<int> ints = { 0, 1, 2, 3, 4 };
std::vector<char>* memory = reinterpret_cast<std::vector<char>*>(&ints);
int *intArray = reinterpret_cast<int *>(memory->data());
std::cout << intArray[0] << intArray[1] << intArray[2] << intArray[3] << intArray[4] << std::endl; //01234 Works on gcc and vc++
std::getchar();
}
This seems to work in this isolated case, but I don't know if it will give errors or undefined behavior inside the serialization code.
This is an aliasing violation:
std::vector<char>* memory = reinterpret_cast<std::vector<char>*>(&ints);
int *intArray = reinterpret_cast<int *>(memory->data());
Per [basic.life], accessing memory->data() here has undefined behavior.
The way to get around this is to call ints.data() to obtain a int* pointer to the underlying contiguous array. Afterwards, you are allowed to cast it to void*, char*, or unsigned char* (or std::byte* in C++17).
From there you can cast back to int* to access the elements again.
I don't think that it is UB.
With reinterpret_cast<std::vector<char>*>(&ints), you are casting a vector-object to another vector object of different (and actually incompatible) type. Yet you do not dereference the resulting pointer, and - as both vector objects will very likely have the same aliasing restrictions - the cast will be OK. Cf, for example, this online C++ draft). Note that a vector does not store the data types "in place" but will hold a pointer to the values.
5.2.10 Reinterpret cast
(7) An object pointer can be explicitly converted to an object pointer of
a different type.70 When a prvalue v of type “pointer to T1” is
converted to the type “pointer to cv T2”, the result is static_cast(static_cast(v)) if both T1 and T2 are standard-layout
types ([basic.types]) and the alignment requirements of T2 are no
stricter than those of T1, or if either type is void. Converting a
prvalue of type “pointer to T1” to the type “pointer to T2” (where T1
and T2 are object types and where the alignment requirements of T2 are
no stricter than those of T1) and back to its original type yields the
original pointer value. The result of any other such pointer
conversion is unspecified.
So casting a vector object forth and back should work in a defined manner here.
Second, you cast a pointer that originally points (and is aliased to) int "back" to its original type int. So aliasing is obviously not violated.
I don't see any UB here (unless a vector-object had stricter aliasing rules than a vector-object, which is very likely not the case).
Suppose we take a very big array of unsigned chars.
std::array<uint8_t, 100500> blob;
// ... fill array ...
(Note: it is aligned already, question is not about alignment.)
Then we take it as uint64_t[] and trying to access it:
const auto ptr = reinterpret_cast<const uint64_t*>(blob.data());
std::cout << ptr[7] << std::endl;
Casting to uint64_t and then reading from it looks suspicious as for me.
But UBsan, -Wstrict-aliasing is not triggering about it.
Google uses this technique in FlatBuffers.
Also, Cap'n'Proto uses this too.
Is it undefined behavior?
You cannot access an unsigned char object value through a glvalue of an other type. But the opposite is authorized, you can access the value of any object through an unsigned char glvalue [basic.lval]:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: [...]
a char, unsigned char, or std::byte type.
So, to be 100% standard compliant, the idea is to reverse the reinterpret_cast:
uint64_t i;
std::memcpy(&i, blob.data() + 7*sizeof(uint64_t), sizeof(uint64_t));
std::cout << i << std::endl;
And it will produces the exact same assembly.
The cast itself is well defined (a reinterpret_cast never has UB), but the lvalue to rvalue conversion in expression "ptr[7]" would be UB if no uint64_t object has been constructed in that address.
As "// ... fill array ..." is not shown, there could have been constructed a uint64_t object in that address (assuming as you say, the address has sufficient alignment):
const uint64_t* p = new (blob.data() + 7 * sizeof(uint64_t)) uint64_t();
If a uint64_t object has been constructed in that address, then the code in question has well defined behaviour.
C++ (and C) strict aliasing rules include that a char* and unsigned char* may alias any other pointer.
AFAIK there is no analogous rule for uint8_t*.
Thus my question: What are the aliasing rules for a std::byte pointer?
The C++ reference currently just specifies:
Like the character types (char, unsigned char, signed char) it can be used to access raw memory occupied by other objects (object representation), but unlike those types, it is not a character type and is not an arithmetic type.
From the current Standard draft ([basic.types]/2):
For any object (other than a base-class subobject) of trivially
copyable type T, whether or not the object holds a valid value of type
T, the underlying bytes ([intro.memory]) making up the object can be
copied into an array of char, unsigned char, or std::byte
([cstddef.syn]).43 If the content of that array is copied back into
the object, the object shall subsequently hold its original value.
So yes, the same aliasing rules apply for the three types, just as cppreference sums up.
It also might be valuable to mention ([basic.lval]/8.8):
If a program attempts to access the stored value of an object through
a glvalue of other than one of the following types the behavior is
undefined:
a char, unsigned char, or std::byte type.
I'm trying to cast some objects (the size is known) pointed by void* to a char array bitwisely in c++. I'm considering using union with a char array so that I don't need to worry too much about the casting. However, since the type of the object is unknown, I don't know how to define this union.
Just wondering if there is any other better way to deal with this?
PS: edited to avoid confusion. For instance, an integer could be cast to a 4-character array.
Thanks!
In the link I put in the comments, the accepted answer goes into great detail about type punning and why you can't do it in c++.
What you can do is safely inspect any object with a char* (signed or unsigned) by using reinterpret_cast.
char* ptr = reinterpret_cast<char*>(&object);
for (std::size_t x = 0; x < sizeof(object); ++x)
std::cout << ptr[x]; //Or something less slow but this is an example
If you want to actually move the object into a char[], you should use std::memcpy.
If you are not worried about a bit of extra memory, you can use memcpy.
int i = 10;
char carray[sizeof(i)];
memcpy(carray, &i, sizeof(i));
However, remember that carray won't be a null terminated string. It will be just an array of chars. It will be better to use unsigned char since the value in one of those bytes might be too large for char if char is a signed type on your platform.
int i = 10;
unsigned char carray[sizeof(i)];
memcpy(carray, &i, sizeof(i));
Why do you feel you need to worry about the casting?
Just reinterpret_cast the void pointer to a char* and iterate over each character up to the size of the original object. Keep in mind that the char* pointer is not a null-terminated string and may or may not contain null characters in the middle of the data, so do not process it like a C string.
From 5.2.10 Reinterpret cast:
An object pointer can be explicitly converted to an object pointer of a different type. When a prvalue
v of type “pointer to T1” is converted to the type “pointer to cv T2”, the result is static_cast(static_cast(v)) if both T1 and T2 are standard-layout types (3.9) and the alignment
requirements of T2 are no stricter than those of T1, or if either type is void.
So you simply want to use:
char* my_bytes = reinterpret_cast<char*>(my_pointer);
size_t num_bytes = sizeof(my_pointer);
for(size_t i = 0; i < num_bytes; ++i) {
// *(my_bytes + i) has the most significant to least significant bytes
}
We can look at the representation of an object of type T by converting a T* that points at that object into a char*. At least in practice:
int x = 511;
unsigned char* cp = (unsigned char*)&x;
std::cout << std::hex << std::setfill('0');
for (int i = 0; i < sizeof(int); i++) {
std::cout << std::setw(2) << (int)cp[i] << ' ';
}
This outputs the representation of 511 on my system: ff 01 00 00.
There is (surely) some implementation defined behaviour occurring here. Which of the casts is allowing me to convert an int* to an unsigned char* and which conversions does that cast entail? Am I invoking undefined behaviour as soon as I cast? Can I cast any T* type like this? What can I rely on when doing this?
Which of the casts is allowing me to convert an int* to an unsigned char*?
That C-style cast in this case is the same as reinterpret_cast<unsigned char*>.
Can I cast any T* type like this?
Yes and no. The yes part: You can safely cast any pointer type to a char* or unsigned char* (with the appropriate const and/or volatile qualifiers). The result is implementation-defined, but it is legal.
The no part: The standard explicitly allows char* and unsigned char* as the target type. However, you cannot (for example) safely cast a double* to an int*. Do this and you've crossed the boundary from implementation-defined behavior to undefined behavior. It violates the strict aliasing rule.
Your cast maps to:
unsigned char* cp = reinterpret_cast<unsigned char*>(&x);
The underlying representation of an int is implementation defined, and viewing it as characters allows you to examine that. In your case, it is 32-bit little endian.
There is nothing special here -- this method of examining the internal representation is valid for any data type.
C++03 5.2.10.7: A pointer to an object can be explicitly converted to a pointer to an object of different type. Except that converting an rvalue of type "pointer to T1" to the type "pointer to T2" (where T1 and T2 are object types and where the alignment requirements of T2 are no stricter than those of T1) and back to its original type yields the original pointer value, the result of such a pointer conversion is unspecified.
This suggests that the cast results in unspecified behavior. But pragmatically speaking, casting from any pointer type to char* will always allow you to examine (and modify) the internal representation of the referenced object.
The C-style cast in this case is equivalent to reinterpret_cast. The Standard describes the semantics in 5.2.10. Specifically, in paragraph 7:
"A pointer to an object can be explicitly converted to a pointer to a
different object type.70 When a prvalue v of type “pointer to T1” is
converted to the type “pointer to cvT2”, the result is
static_cast<cvT2*>(static_cast<cvvoid*>(v)) if both T1 and T2 are
standard-layout types (3.9) and the alignment requirements of T2 are
no stricter than those of T1. Converting a prvalue of type “pointer to
T1” to the type “pointer to T2” (where T1 and T2 are object types and
where the alignment requirements of T2 are no stricter than those of
T1) and back to its original type yields the original pointer value.
The result of any other such pointer conversion is unspecified."
What it means in your case, the alignment requirements are satisfied, and the result is unspecified.
The implementation behaviour in your example is the endianness attribute of your system, in this case your CPU is a little endian.
About the type casting, when you cast an int* to char* all what you are doing is telling the compiler to interpret what cp is pointing to as a char, so it will read the first byte only and interpret it as a character.
The cast between pointers are themselves always possible since all pointers are nothing more than memory addresses and whatever type, in memory, can always be thought as a sequence of bytes.
But -of course- the way the sequence is formed depends on how the decomposed type is represented in memory, and that's out of the scope of the C++ specifications.
That said, unless of very pathological cases, you can expect that representation to be the same on all the code produced by a same compiler for all the machines of a same platform (or family), and you should not expect same results on different platforms.
In general one thing to avoid is to express the relation between type sizes as "predefined":
in your sample you assume sizeof(int) == 4*sizeof(char): that's not necessarily always true.
But it is always true that sizeof(T) = N*sizeof(char), hence whatever T can always be seen as a integer number of char-s
Unless you have a cast operator, then a cast is simply telling to "see" that memory area in a different way. Nothing really fancy, I would say.
Then, you are reading the memory area byte-by-byte; as long as you do not change it, it is just fine. Of course, the result of what you see depends a lot from the platform: think about endianness, word size, padding, and so on.
Just reverse the byte order then it becomes
00 00 01 ff
Which is 256 (01) + 255 (ff) = 511
This is because your platfom is little endian.