Manipulating byte vector through float pointer - c++

Is it possible to manipulate an std::vector<unsigned char> through its data pointer as if it were a container of float?
Here is an example that compiles and (seemingly?) runs as desired (GCC 4.8, C++11):
#include <iostream>
#include <vector>
int main()
{
std::vector<unsigned char> bytes(2 * sizeof(float));
auto ptr = reinterpret_cast<float *>(bytes.data());
ptr[0] = 1.1;
ptr[1] = 1.2;
std::cout << ptr[0] << ", " << ptr[1] << std::endl;
return 0;
}
This snippet successfully writes/reads data from the byte buffer as if it were an array of float. From reading about reinterpret_cast I'm afraid that this might be undefined behavior. My confidence in understanding the type aliasing details is too little for me to be sure.
Is the code snippet undefined behavior as outlined above? If so, is there another way to achieve this sort of byte manipulation?

Legal answer
No, this is not permitted.
C++ isn't just "a load of bytes" — the compiler (and, more abstractly, the language) have been told that you have a container of unsigned chars, not a container of floats. No floats exist, and you can't pretend that they do.
The rule you're looking for, which is known as strict aliasing, may be found under [basic.lval]/8.
The opposite would work, because it is permitted (via a special rule in that same paragraph) to examine the bytes of any type via an unsigned char*. But in your case, the quickest safe and correct way to "get" a float from something that starts life as unsigned char is to std::memcpy or std::copy those bytes into an actual float that exists:
std::vector<unsigned char> bytes(2 * sizeof(float));
float f1, f2;
// Extracting values
std::memcpy(
reinterpret_cast<unsigned char*>(&f1),
bytes.data(),
sizeof(float)
);
std::memcpy(
reinterpret_cast<unsigned char*>(&f2),
bytes.data() + sizeof(float),
sizeof(float)
);
// Putting them back
f1 = 1.1;
f2 = 1.2;
std::memcpy(
bytes.data(),
reinterpret_cast<unsigned char*>(&f1),
sizeof(float)
);
std::memcpy(
bytes.data() + sizeof(float),
reinterpret_cast<unsigned char*>(&f2),
sizeof(float)
);
This is fine as long as those bytes form a valid representation of float on your system. Granted it looks a little unwieldy, but a quick wrapper function will make short work of it.
A common alternative, assuming you only care about floats and don't need a resizable buffer, is to produce some std::aligned_storage then do a bunch of placement new into the resulting buffer. Since C++17, you could alternatively play around with std::launder, though resizing the vector (read: reallocating its buffer) would also be inadvisable in that scenario.
Also, these approaches are quite involved and result in complex code that not all your readers will be able to follow. If you can launder your data such that it "is" a sequence of floats, you may as well just make yourself a nice std::vector<float> in the first place. Per the above, it is permitted to get and use an unsigned char* to that buffer if you wish.
It ought to be noted that there is much code out there in the wild that uses your original approach (particularly in older projects with a barebones C heritage). On many implementations, it may appear to work. But it is a common misconception that it is valid and/or safe, and you're prone to instruction "re-ordering" (or other optimisations) if you rely on it.
Hedge-betting answer
For what it's worth, if you disable strict aliasing (GCC permits this as an extension, and LLVM doesn't even implement it), then you can probably get away with your original code. Just be careful.

Is it possible to manipulate an std::vector through its data pointer as if it were a container of float?
Not quite. Your example has UB indeed.
However, you can reuse the storage of those bytes to create the floats there. Example:
float* ptr = std::launder(reinterpret_cast<float*>(bytes.data()));
std::uninitialized_fill_n(ptr, 2, 0.0f);
After this, the lifetime of the unsigned char objects has ended, end there are floats there instead. Using ptr is well defined.
Whether this would be useful for you is another matter. Start with a simpler design first: Why not simply use std::vector<float>?

Related

reinterpret_cast a slice of byte array?

If there is a buffer that is supposed to pack 3 integer values, and you want to increment the one in the middle, the following code works as expected:
#include <iostream>
#include <cstring>
int main()
{
char buffer[] = {'\0','\0','\0','\0','A','\0','\0','\0','\0','\0','\0','\0'};
int tmp;
memcpy(&tmp, buffer + 4, 4); // unpack buffer[5:8] to tmp
std::cout<<buffer[4]; // prints A
tmp++;
memcpy(buffer + 4, &tmp, 4); // pack tmp value back to buffer[5:8]
std::cout<<buffer[4]; // prints B
return 0;
}
To me this looks like too many operations are taking place for a simple action of merely modifying some data in a buffer array, i.e. pushing a new variable to the stack, copying the specific region from the buffer to that var, incrementing it, then copying it back to the buffer.
I was wondering whether it's possible to cast the 5:8 range from the byte array to an int* variable and increment it, for example:
int *tmp = reinterpret_cast < int *>(buffer[5:8]);
(*tmp)++;
It's more efficient this way, no need for the 2 memcpy calls.
The latter approach is technically undefined, though it's likely to work on any sane implementation. Your syntax is slightly off, but something like this will probably work:
int* tmp = reinterpret_cast<int*>(buffer + 4);
(*tmp)++;
The problem is that it runs afoul of C++'s strict aliasing rules. Essentially, you're allowed to treat any object as an array of char, but you're not allowed to treat an array of char as anything else. Thus to be fully compliant you need to take the approach you did in the first snippet: treat an int as an array of char (which is allowed) and copy the bytes from the array into it, manipulate it as desired, and then copy back.
I would note that if you're concerned with runtime efficiency, you probably shouldn't be. Compilers are very good at optimizing these sorts of things, and will likely end up just manipulating the bytes in place. For instance, clang with -O2 compiles your first snippet (with std::cout replaced with printf to avoid stream I/O overhead) down to:
mov edi, 65
call putchar
mov edi, 66
call putchar
Demo
Remember, when writing C++ you are describing the behavior of the program you want the compiler to write, not writing the instructions the machine will execute.
Simply change buffer[5:8] to buffer + 4, just like in your memcpy() calls, and then it will likely work the way you want:
int *tmp = reinterpret_cast<int*>(buffer + 4 /* or: &buffer[4] */);
(*tmp)++;
Alternatively, you can use a reference instead of a pointer:
int &tmp = reinterpret_cast<int&>(buffer[4] /* or: *(buffer+4) */);
tmp++;
However, note that either approach is technically undefined behavior, as accessing the array like this violates the Strict Aliasing rules. The memcpy() approach is the safe and standard way to go, and compilers are very good about optimizing memcpy() calls.
But, the reinterpret_cast approach will likely work nonetheless, depending on your compiler.

Passing pointers to arrays of unrelated, but compatible, types without copying?

(Disclaimer: At this point, this is mostly academic interest.)
Imagine I have such an external interface, that is, I do not control it's code:
// Provided externally: Cannot (easily) change this:
// fill buffer with n floats:
void data_source_external(float* pDataOut, size_t n);
// send n data words from pDataIn:
void data_sink_external(const uint32_t* pDataIn, size_t n);
Is it possible within standard C++ to "move" / "stream" data between these two interfaces without copying?
That is, is there any way to make the following be non-UB, without copying of the data between two correctly typed buffers?
int main()
{
constexpr size_t n = 64;
float fbuffer[n];
data_source_external(fbuffer, n);
// These hold and can be checked statically:
static_assert(sizeof(float) == sizeof(uint32_t), "same size");
static_assert(alignof(float) == alignof(uint32_t), "same alignment");
static_assert(std::numeric_limits<float>::is_iec559 == true, "IEEE 754");
// This is clearly UB. Any way to make this work without copying the data?
const uint32_t* buffer_alias = static_cast<uint32_t*>(static_cast<void*>(fbuffer));
// **Note**:
// + reinterpret_cast would also be UB.
data_sink_external(buffer_alias, n);
// ...
As far as I can tell the following would be defined behavior, at least with regard to strict aliasing:
...
uint32_t ibuffer[n];
std::memcpy(ibuffer, fbuffer, n * sizeof(uint32_t));
data_sink_external(ibuffer, n);
but given that the ibuffer will have exactly the same bits as the fbuffer this seems quite insane.
Or would we expect optimizing compilers to optimize even this copy away? (In a now deleted comment-like answer a user posted a godbolt link that seems to indicate, at least on first glance, that clang 11 indeed would be able to optimize out the memcpy.)
I didn't test and can't comment yet (cause not enough reputation). But reinterpret_cast may help in this situation.
Documentation
Basically it tells the compiler, hey treat this pointer as if it was the specified type in the cast.

Additional questions on memory alignment

There have previously been some great answers on memory alignment, but I feel don't completely answer some questions.
E.g.:
What is data alignment? Why and when should I be worried when typecasting pointers in C?
What is aligned memory allocation?
I have an example program:
#include <iostream>
#include <vector>
#include <cstring>
int32_t cast_1(int offset) {
std::vector<char> x = {1,2,3,4,5};
return reinterpret_cast<int32_t*>(x.data()+offset)[0];
}
int32_t cast_2(int offset) {
std::vector<char> x = {1,2,3,4,5};
int32_t y;
std::memcpy(reinterpret_cast<char*>(&y), x.data() + offset, 4);
return y;
}
int main() {
std::cout << cast_1(1) << std::endl;
std::cout << cast_2(1) << std::endl;
return 0;
}
The cast_1 function outputs a ubsan alignment error (as expected) but cast_2 does not. However, cast_2 looks much less readable to me (requires 3 lines). cast_1 looks perfectly clear on the intent, even though it is UB.
Questions:
1) Why is cast_1 UB, when the intent is perfectly clear? I understand that there may be performance issues with alignment.
2) Is cast_2 a correct approach to fixing the UB of cast_1?
1) Why is cast_1 UB?
Because the language rules say so. Multiple rules in fact.
The offset where you access the object does not meet the alignment requirements of int32_t (except on systems where the alignment requirement is 1). No objects can be created without conforming to the alignment requirement of the type.
A char pointer may not be aliased by a int32_t pointer.
2) Is cast_2 a correct approach to fixing the UB of cast_1?
cast_2 has well defined behaviour. The reinterpret_cast in that function is redundant, and it is bad to use magic constants (use sizeof).
WRT the first question, it would be trivial for the compiler to handle that for you, true. All it would have to do is pessimize every other non-char load in the program.
The alignment rules were written precisely so the compiler can generate code that performs well on the many platforms where aligned memory access is a fast native op, and misaligned access is the equivalent of your memcpy. Except where it could prove alignment, the compiler would have to handle every load the slow & safe way.

Are casts as safe as unions?

I want to split large variables like floats into byte segments and send these serially byte by byte via UART. I'm using C/C++.
One method could be to deepcopy the value I want to send to a union and then send it. I think that would be 100% safe but slow. The union would look like this:
union mySendUnion
{
mySendType sendVal;
char[sizeof(mySendType)] sendArray;
}
Another option could be to cast the pointer to the value I want to send, into a pointer to a particular union. Is this still safe?
The third option could be to cast the pointer to the value I want to send to a char, and then increment a pointer like this:
sendType myValue = 443.2;
char* sendChar = (char*)myValue;
for(char i=0; i< sizeof(sendType) ; i++)
{
Serial.write(*(sendChar+j), 1);
}
I've had succes with the above pointer arithmetics, but I'm not sure if it's safe under all circumstances. My concern is, what if we for instance is using a 32 bit processor and want to send a float. The compiler choose to store this 32 bit float into one memory cell, but does only store one single char into each 32 bit cell.
Each counter increment would then make the program pointer increment one whole memory cell, and we would miss the float.
Is there something in the C standard that prevents this, or could this be an issue with a certain compiler?
First off, you can't write your code in "C/C++". There's no such language as "C/C++", as they are fundamentally different languages. As such, the answer regarding unions differs radically.
As to the title:
Are casts as safe as unions?
No, generally they aren't, because of the strict aliasing rule. That is, if you type-pun a pointer of one certain type with a pointer to an incompatible type, it will result in undefined behavior. The only exception to this rule is when you read or manipulate the byte-wise representation of an object by aliasing it through a pointer to (signed or unsigned) char. As in your case.
Unions, however, are quite different bastards. Type punning via copying to and reading from unions is permitted in C99 and later, but results in undefined behavior in C89 and all versions of C++.
In one direction, you can also safely type pun (in C99 and later) using a pointer to union, if you have the original union as an actual object. Like this:
union p {
char c[sizeof(float)];
float f;
} pun;
union p *punPtr = &pun;
punPtr->f = 3.14;
send_bytes(punPtr->c, sizeof(float));
Because "a pointer to a union points to all of its members and vice versa" (C99, I don't remember the exact pargraph, it's around 6.2.5, IIRC). This isn't true in the other direction, though:
float f = 3.14;
union p *punPtr = &f;
send_bytes(punPtr->c, sizeof(float)); // triggers UB!
To sum up: the following code snippet is valid in both C89, C99, C11 and C++:
float f = 3.14;
char *p = (char *)&f;
size_t i;
for (i = 0; i < sizeof f; i++) {
send_byte(p[i]); // hypotetical function
}
The following is only valid in C99 and later:
union {
char c[sizeof(float)];
float f;
} pun;
pun.f = 3.14;
send_bytes(pun.c, sizeof float); // another hypotetical function
The following, however, would not be valid:
float f = 3.14;
unsigned *u = (unsigned *)&f;
printf("%u\n", *u); // undefined behavior triggered!
Another solution that is always guaranteed to work is memcpy(). The memcpy() function does a bytewise copying between two objects. (Don't get me started on it being "slow" -- in most modern compilers and stdlib implementations, it's an intrinsic function).
A general advice when sending floating point data on a byte stream would be to use some serialization technology, to ensure that the data format is well defined (and preferably architecture neutral, beware of endianness issues!).
You could use XDR -or perhaps ASN1- which is a binary format (see xdr(3) for more). For C++, see also libs11n
Unless speed or data size is very critical, I would suggest instead a textual format like JSON or perhaps YAML (textual formats are more verbose, but easier to debug and to document). There are several good libraries supporting it (e.g. jsoncpp for C++ or jansson for C).
Notice that serial ports are quite slow (w.r.t. CPU). So the serialization processing time is negligible.
Whatever you do, please document the serialization format (even for an internal project).
The cast to [[un]signed] char [const] * is legal and it won't cause issues when reading, so that is a fine option (that is, after fixing char *sendChar = reinterpret_cast<char*>(&myValue);, and since you are at it, make it const)
Now the next problem comes on the other side, when reading, as you cannot safely use the same approach for reading. In general, the cost of copying the variables is much less than the cost of sending over the UART, so I would just use the union when reading out of the serial.

Casting/dereferencing char pointers to a double array

Is there anything wrong with the casting a double pointer to a char pointer? Goal in the following code is to change the 1 element in three different ways.
double vec1[100];
double *vp = vec1;
char *yp = (char*) vp;
vp++;
vec1[1] = 19.0;
*vp = 12.0;
*((double*) (yp + (1*sizeof (vec1[0])))) = 34.0;
Casts of this type fall into the category of "OK if you know what you're doing but dangerous if you don't".
For example, in this case you already know the pointer value of "yp" (it was pointing to a double) so it is technically safe to increase its value by the size of a double and re-cast back to a double*.
A counter-example: suppose you didn't know where the char* came from...say, it was given to you as a function parameter. Now, your cast would be a big problem: since char* is technically 1-byte-aligned and a double is usually 8-byte-aligned, you can't be sure if you were given an 8-byte-aligned address. If it's aligned, your arithmetic would produce a valid double*; if not, it would crash when dereferenced.
This is just one example of how casts can go wrong. What you're doing (at first glance) looks like it will work but in general you really have to pay attention when you cast things.
With newer INTEL processors the main problem you can run into is alignment. Say you were to write something like this:
*((double*) (yp + 4)) = 34.0;
Then you are likely to have a runtime error because a double should be aligned on 8 bytes. This was also true on processors such as 68k, or MIPS.
This is similar to having a structure and doing casts on that structure. You are not unlikely to break things.
In most cases, if you can avoid such, your code will be a lot stronger. Personally, I do not even use such casts when reading a file. Instead, I get the data from the file and put it in a structure as required. Say I read 4 bytes in a buffer to convert to an integer, I'd write something like this:
unsigned char buf[4];
...
fread(buf, 1, 4, f);
my_struct.integer = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24);
Now I did not do an ugly cast and I could control the endianess of the integer in the file whatever the endian of the processor you are running with.