reinterpret_cast a slice of byte array? - c++

If there is a buffer that is supposed to pack 3 integer values, and you want to increment the one in the middle, the following code works as expected:
#include <iostream>
#include <cstring>
int main()
{
char buffer[] = {'\0','\0','\0','\0','A','\0','\0','\0','\0','\0','\0','\0'};
int tmp;
memcpy(&tmp, buffer + 4, 4); // unpack buffer[5:8] to tmp
std::cout<<buffer[4]; // prints A
tmp++;
memcpy(buffer + 4, &tmp, 4); // pack tmp value back to buffer[5:8]
std::cout<<buffer[4]; // prints B
return 0;
}
To me this looks like too many operations are taking place for a simple action of merely modifying some data in a buffer array, i.e. pushing a new variable to the stack, copying the specific region from the buffer to that var, incrementing it, then copying it back to the buffer.
I was wondering whether it's possible to cast the 5:8 range from the byte array to an int* variable and increment it, for example:
int *tmp = reinterpret_cast < int *>(buffer[5:8]);
(*tmp)++;
It's more efficient this way, no need for the 2 memcpy calls.

The latter approach is technically undefined, though it's likely to work on any sane implementation. Your syntax is slightly off, but something like this will probably work:
int* tmp = reinterpret_cast<int*>(buffer + 4);
(*tmp)++;
The problem is that it runs afoul of C++'s strict aliasing rules. Essentially, you're allowed to treat any object as an array of char, but you're not allowed to treat an array of char as anything else. Thus to be fully compliant you need to take the approach you did in the first snippet: treat an int as an array of char (which is allowed) and copy the bytes from the array into it, manipulate it as desired, and then copy back.
I would note that if you're concerned with runtime efficiency, you probably shouldn't be. Compilers are very good at optimizing these sorts of things, and will likely end up just manipulating the bytes in place. For instance, clang with -O2 compiles your first snippet (with std::cout replaced with printf to avoid stream I/O overhead) down to:
mov edi, 65
call putchar
mov edi, 66
call putchar
Demo
Remember, when writing C++ you are describing the behavior of the program you want the compiler to write, not writing the instructions the machine will execute.

Simply change buffer[5:8] to buffer + 4, just like in your memcpy() calls, and then it will likely work the way you want:
int *tmp = reinterpret_cast<int*>(buffer + 4 /* or: &buffer[4] */);
(*tmp)++;
Alternatively, you can use a reference instead of a pointer:
int &tmp = reinterpret_cast<int&>(buffer[4] /* or: *(buffer+4) */);
tmp++;
However, note that either approach is technically undefined behavior, as accessing the array like this violates the Strict Aliasing rules. The memcpy() approach is the safe and standard way to go, and compilers are very good about optimizing memcpy() calls.
But, the reinterpret_cast approach will likely work nonetheless, depending on your compiler.

Related

Passing pointers to arrays of unrelated, but compatible, types without copying?

(Disclaimer: At this point, this is mostly academic interest.)
Imagine I have such an external interface, that is, I do not control it's code:
// Provided externally: Cannot (easily) change this:
// fill buffer with n floats:
void data_source_external(float* pDataOut, size_t n);
// send n data words from pDataIn:
void data_sink_external(const uint32_t* pDataIn, size_t n);
Is it possible within standard C++ to "move" / "stream" data between these two interfaces without copying?
That is, is there any way to make the following be non-UB, without copying of the data between two correctly typed buffers?
int main()
{
constexpr size_t n = 64;
float fbuffer[n];
data_source_external(fbuffer, n);
// These hold and can be checked statically:
static_assert(sizeof(float) == sizeof(uint32_t), "same size");
static_assert(alignof(float) == alignof(uint32_t), "same alignment");
static_assert(std::numeric_limits<float>::is_iec559 == true, "IEEE 754");
// This is clearly UB. Any way to make this work without copying the data?
const uint32_t* buffer_alias = static_cast<uint32_t*>(static_cast<void*>(fbuffer));
// **Note**:
// + reinterpret_cast would also be UB.
data_sink_external(buffer_alias, n);
// ...
As far as I can tell the following would be defined behavior, at least with regard to strict aliasing:
...
uint32_t ibuffer[n];
std::memcpy(ibuffer, fbuffer, n * sizeof(uint32_t));
data_sink_external(ibuffer, n);
but given that the ibuffer will have exactly the same bits as the fbuffer this seems quite insane.
Or would we expect optimizing compilers to optimize even this copy away? (In a now deleted comment-like answer a user posted a godbolt link that seems to indicate, at least on first glance, that clang 11 indeed would be able to optimize out the memcpy.)
I didn't test and can't comment yet (cause not enough reputation). But reinterpret_cast may help in this situation.
Documentation
Basically it tells the compiler, hey treat this pointer as if it was the specified type in the cast.

Test of 8 subsequent bytes isn't translated into a single compare instruction

Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):
bool f1(const char *ptr)
{
for (int i = 0; i < 8; i++)
if (ptr[i])
return false;
return true;
}
bool f2(const char *ptr)
{
bool res = true;
for (int i = 0; i < 8; i++)
res &= (ptr[i] == 0);
return res;
}
bool f3(const char *ptr)
{
static const char tmp[8]{};
return !std::memcmp(ptr, tmp, 8);
}
Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.
Live demo: https://godbolt.org/z/j48366
Is there any reason why f1 and f2 cannot be optimized into a single compare instruction (possibly with additional unaligned load)? It seem to be a pretty straightforward optimization to me.
In f1 the loop stops when ptr[i] is true, so it is not always equivalent of to consider 8 elements as it is the case with the two other functions or directly comparing a 8 bytes word if the size of the array is less than 8 (the compiler does not know the size of the array) :
f1("\000\001"); // no access out of the array
f2("\000\001"); // access out of the array
f3("\000\001"); // access out of the array
For f2 I agree that can be replaced by a 8 bytes comparison under the condition the CPU allows to read a word of 8 bytes from any address alignment which is the case of the x64 but that can introduce unusual situation as explained in Unusual situations where this wouldn't be safe in x86 asm
First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as #bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.
You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)
GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)
Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks #Tharwen for spotting that.
Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)
The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.
In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.
Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.
Or in GNU C++, use a typedef to express unaligned may-alias loads:
bool f4(const char *ptr) {
typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
auto val = *(const aliasing_unaligned_u64*)ptr;
return val != 0;
}
Compiles on Godbolt with GCC10 -O3:
f4(char const*):
cmp QWORD PTR [rdi], 0
setne al
ret
Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.
And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.
https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointers/
Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior? is another example of using may_alias (without aligned(1) in that case because implicit-length strings could end at any point, so you need to do aligned loads to make sure that your chunk that contains at least 1 valid string byte doesn't cross a page boundary.) Also Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
You need to help your compiler a bit to get exactly what you want... If you want to compare 8 bytes in one CPU operation, you'll need to change your char pointer so it points to something that's actually 8 bytes long, like a uint64_t pointer.
If your compiler does not support uint64_t, you can use unsigned long long* instead:
#include <cstdint>
inline bool EightBytesNull(const char *ptr)
{
return *reinterpret_cast<const uint64_t*>(ptr) == 0;
}
Note that this will work on x86, but will not on ARM, which requires strict integer memory alignment.

Manipulating byte vector through float pointer

Is it possible to manipulate an std::vector<unsigned char> through its data pointer as if it were a container of float?
Here is an example that compiles and (seemingly?) runs as desired (GCC 4.8, C++11):
#include <iostream>
#include <vector>
int main()
{
std::vector<unsigned char> bytes(2 * sizeof(float));
auto ptr = reinterpret_cast<float *>(bytes.data());
ptr[0] = 1.1;
ptr[1] = 1.2;
std::cout << ptr[0] << ", " << ptr[1] << std::endl;
return 0;
}
This snippet successfully writes/reads data from the byte buffer as if it were an array of float. From reading about reinterpret_cast I'm afraid that this might be undefined behavior. My confidence in understanding the type aliasing details is too little for me to be sure.
Is the code snippet undefined behavior as outlined above? If so, is there another way to achieve this sort of byte manipulation?
Legal answer
No, this is not permitted.
C++ isn't just "a load of bytes" — the compiler (and, more abstractly, the language) have been told that you have a container of unsigned chars, not a container of floats. No floats exist, and you can't pretend that they do.
The rule you're looking for, which is known as strict aliasing, may be found under [basic.lval]/8.
The opposite would work, because it is permitted (via a special rule in that same paragraph) to examine the bytes of any type via an unsigned char*. But in your case, the quickest safe and correct way to "get" a float from something that starts life as unsigned char is to std::memcpy or std::copy those bytes into an actual float that exists:
std::vector<unsigned char> bytes(2 * sizeof(float));
float f1, f2;
// Extracting values
std::memcpy(
reinterpret_cast<unsigned char*>(&f1),
bytes.data(),
sizeof(float)
);
std::memcpy(
reinterpret_cast<unsigned char*>(&f2),
bytes.data() + sizeof(float),
sizeof(float)
);
// Putting them back
f1 = 1.1;
f2 = 1.2;
std::memcpy(
bytes.data(),
reinterpret_cast<unsigned char*>(&f1),
sizeof(float)
);
std::memcpy(
bytes.data() + sizeof(float),
reinterpret_cast<unsigned char*>(&f2),
sizeof(float)
);
This is fine as long as those bytes form a valid representation of float on your system. Granted it looks a little unwieldy, but a quick wrapper function will make short work of it.
A common alternative, assuming you only care about floats and don't need a resizable buffer, is to produce some std::aligned_storage then do a bunch of placement new into the resulting buffer. Since C++17, you could alternatively play around with std::launder, though resizing the vector (read: reallocating its buffer) would also be inadvisable in that scenario.
Also, these approaches are quite involved and result in complex code that not all your readers will be able to follow. If you can launder your data such that it "is" a sequence of floats, you may as well just make yourself a nice std::vector<float> in the first place. Per the above, it is permitted to get and use an unsigned char* to that buffer if you wish.
It ought to be noted that there is much code out there in the wild that uses your original approach (particularly in older projects with a barebones C heritage). On many implementations, it may appear to work. But it is a common misconception that it is valid and/or safe, and you're prone to instruction "re-ordering" (or other optimisations) if you rely on it.
Hedge-betting answer
For what it's worth, if you disable strict aliasing (GCC permits this as an extension, and LLVM doesn't even implement it), then you can probably get away with your original code. Just be careful.
Is it possible to manipulate an std::vector through its data pointer as if it were a container of float?
Not quite. Your example has UB indeed.
However, you can reuse the storage of those bytes to create the floats there. Example:
float* ptr = std::launder(reinterpret_cast<float*>(bytes.data()));
std::uninitialized_fill_n(ptr, 2, 0.0f);
After this, the lifetime of the unsigned char objects has ended, end there are floats there instead. Using ptr is well defined.
Whether this would be useful for you is another matter. Start with a simpler design first: Why not simply use std::vector<float>?

Is it possible to write to an array second element by overflowing the first element in C?

In low level languages it is possible to mov a dword (32 bit) to the first array element this will overflow to write to the second, third and fourth element, or mov a word (16 bit) to the first and it will overflow to the second element.
How to achieve the same effect in c? as when trying for example:
char txt[] = {0, 0};
txt[0] = 0x4142;
it gives a warning [-Woverflow]
and the value of txt[1] doesn't change and txt[0] is set to 0x42.
How to get the same behavior as in assembly:
mov word [txt], 0x4142
the previous assembly instruction will set the first element [txt+0] to 0x42 and the second element [txt+1] to 0x41.
EDIT
What about this suggestion?
define the array as a single variable.
uint16_t txt;
txt = 0x4142;
and accessing the elements with ((uint8_t*) &txt)[0] for the first element and ((uint8_t*) &txt)[1] for the second element.
If you are totally sure this will not cause a segmentation fault, which you must be, you can use memcpy()
uint16_t n = 0x4142;
memcpy((void *)txt, (void *)&n, sizeof(uint16_t));
By using void pointers, this is the most versatile solution, generalizable to all the cases beyond this example.
txt[0] = 0x4142; is an assignment to a char object, so the right hand side is implicitly cast to (char) after being evaluated.
The NASM equivalent is mov byte [rsp-4], 'BA'. Assembling that with NASM gives you the same warning as your C compiler:
foo.asm:1: warning: byte data exceeds bounds [-w+number-overflow]
Also, modern C is not a high-level assembler. C has types, NASM doesn't (operand-size is on a per-instruction basis only). Don't expect C to work like NASM.
C is defined in terms of an "abstract machine", and the compiler's job is to make asm for the target CPU which produces the same observable results as if the C was running directly on the C abstract machine. Unless you use volatile, actually storing to memory doesn't count as an observable side-effect. This is why C compilers can keep variables in registers.
And more importantly, things that are undefined behaviour according to the ISO C standard may still be undefined when compiling for x86. For example, x86 asm has well-defined behaviour for signed overflow: it wraps around. But in C, it's undefined behaviour, so compilers can exploit this to make more efficient code for for (int i=0 ; i<=len ;i++) arr[i] *= 2; without worrying that i<=len might always be true, giving an infinite loop. See What Every C Programmer Should Know About Undefined Behavior.
Type-punning by pointer-casting other than to char* or unsigned char* (or __m128i* and other Intel SSE/AVX intrinsic types, because they're also defined as may_alias types) violates the strict-aliasing rule. txt is a char array, but I think it's still a strict-aliasing violation to write it through a uint16_t* and then read it back via txt[0] and txt[1].
Some compilers may define the behaviour of *(uint16_t*)txt = 0x4142, or happen to produce the code you expect in some cases, but you shouldn't count on it always working and being safe other code also reads and writes txt[].
Compilers (i.e. C implementations, to use the terminology of the ISO standard) are allowed to define behaviour that the C standard leaves undefined. But in a quest for higher performance, they choose to leave a lot of stuff undefined. This is why compiling C for x86 is not similar to writing in asm directly.
Many people consider modern C compilers to be actively hostile to the programmer, looking for excuses to "miscompile" your code. See the 2nd half of this answer on gcc, strict-aliasing, and horror stories, and also the comments. (The example in that answer is safe with a proper memcpy; the problem was a custom implementation of memcpy that copied using long*.)
Here's a real-life example of a misaligned pointer leading to a fault on x86 (because gcc's auto-vectorization strategy assumed that some whole number of elements would reach a 16-byte alignment boundary. i.e. it depended on the uint16_t* being aligned.)
Obviously if you want your C to be portable (including to non-x86), you must use well-defined ways to type-pun. In ISO C99 and later, writing one union member and reading another is well-defined. (And in GNU C++, and GNU C89).
In ISO C++, the only well-defined way to type-pun is with memcpy or other char* accesses, to copy object representations.
Modern compilers know how to optimize away memcpy for small compile-time constant sizes.
#include <string.h>
#include <stdint.h>
void set2bytes_safe(char *p) {
uint16_t src = 0x4142;
memcpy(p, &src, sizeof(src));
}
void set2bytes_alias(char *p) {
*(uint16_t*)p = 0x4142;
}
Both functions compile to the same code with gcc, clang, and ICC for x86-64 System V ABI:
# clang++6.0 -O3 -march=sandybridge
set2bytes_safe(char*):
mov word ptr [rdi], 16706
ret
Sandybridge-family doesn't have LCP decode stalls for 16-bit mov immediate, only for 16-bit immediates with ALU instructions. This is an improvement over Nehalem (See Agner Fog's microarch guide), but apparently gcc8.1 -march=sandybridge doesn't know about it because it still likes to:
# gcc and ICC
mov eax, 16706
mov WORD PTR [rdi], ax
ret
define the array as a single variable.
... and accessing the elements with ((uint8_t*) &txt)[0]
Yes, that's fine, assuming that uint8_t is unsigned char, because char* is allowed to alias anything.
This is the case on almost any implementation that supports uint8_t at all, but it's theoretically possible to build one where it's not, and char is a 16 or 32-bit type, and uint8_t is implemented with a more expensive read/modify/write of the containing word.
One option is to Trust Your Compiler(tm) and just write proper code.
With this test code:
#include <iostream>
int main() {
char txt[] = {0, 0};
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
}
Clang 6.0 produces:
int main() {
00E91020 push ebp
00E91021 mov ebp,esp
00E91023 push eax
00E91024 lea eax,[ebp-2]
char txt[] = {0, 0};
00E91027 mov word ptr [ebp-2],4241h <-- Combined write, without any tricks!
txt[0] = 0x41;
txt[1] = 0x42;
std::cout << txt;
00E9102D push eax
00E9102E push offset cout (0E99540h)
00E91033 call std::operator<<<std::char_traits<char> > (0E91050h)
00E91038 add esp,8
}
00E9103B xor eax,eax
00E9103D add esp,4
00E91040 pop ebp
00E91041 ret
You're looking to do a deep copy which you'll need to use a loop to accomplish (or a function that does the loop for you internally: memcpy).
Simply assigning 0x4142 to a char will have to be truncated to fit in the char. This should throw a warning as the outcome will be implementation specific, but typically the least significant bits are retained.
In any case, if you know the numbers you want to assign you could just construct using them: const char txt[] = { '\x41', '\x42' };
I'd suggest doing this with an initializer-list, obviously it's on you to make sure the initializer list is at least as long as size(txt). For example:
copy_n(begin({ '\x41', '\x42' }), size(txt), begin(txt));
Live Example

Is pointer arithmetic in iterations overflow-safe?

I've seen very often array iterations using plain pointer arithmetic even in newer C++ code. I wonder how safe they really are and if it's a good idea to use them. Consider this snippet (it compiles also in C if you put calloc in place of new):
int8_t *buffer = new int8_t[16];
for (int8_t *p = buffer; p < buffer + 16; p++) {
...
}
Wouldn't this kind of iteration result in an overflow and the loop being skipped completely when buffer happens to become allocated at address 0xFFFFFFF0 (in a 32 bit address space) or 0xFFFFFFFFFFFFFFF0 (64 bit)?
As far as I know, this would be an exceptionally unlucky, but still possible circumstance.
This is safe. The C and C++ standards explicitly allow you to calculate a pointer value that points one item beyond the end of an array, and to compare a pointer that points within the array to that value.
An implementation that had an overflow problem in the situation you describe would simply not be allowed to place an array right at the end of memory like that.
In practice, a more likely problem is buffer + 16 comparing equal to NULL, but this is not allowed either and again a conforming implementation would need to leave an empty place following the end of the array.