Bad value affectation after type casting - casting

I am using a native unsigned long variable as a buffer used to contain two unsigned short variable inside it. From my knowledge of C++ it should be a valid method. I used this method to store 2 unsigned char inside one unsigned short many times without any problem. Unfortunately when using it on a different architecture, it react strangely. It seems to update the value after a second assignation. The (Overflow) case is there simply to demonstrate it. Can anyone shed some light on why it react that way?
unsigned long dwTest = 0xFFEEDDCC;
printf("sizeof(unsigned short) = %d\n", sizeof(unsigned short));
printf("dwTest = %08X\n", dwTest);
//Address + values
printf("Addresses + Values: %08X <- %08X, %08X <- %08X\n", (DWORD)(&((unsigned short*)&dwTest)[0]), (((unsigned short*)&dwTest)[0]), (DWORD)(&((unsigned short*)&dwTest)[1]), (((unsigned short*)&dwTest)[1]) );
((unsigned short*)&dwTest)[0] = (WORD)0xAAAA;
printf("dwTest = %08X\n", dwTest);
((unsigned short*)&dwTest)[1] = (WORD)0xBBBB;
printf("dwTest = %08X\n", dwTest);
//(Overflow)
((unsigned short*)&dwTest)[2] = (WORD)0x9999;
printf("dwTest = %08X\n", dwTest);
Visual C++ 2010 output (OK):
sizeof(unsigned short) = 2
dwTest = FFEEDDCC
Addresses + Values: 0031F728 <- 0000DDCC, 0031F72A <- 0000FFEE
dwTest = FFEEAAAA
dwTest = BBBBAAAA
dwTest = BBBBAAAA
ARM9 GCC Crosstool output (Doesn't work):
sizeof(unsigned short) = 2
dwTest = FFEEDDCC
Addresses + Values: 7FAFECD8 <- 0000DDCC, 7FAFECDA <- 0000FFEE
dwTest = FFEEDDCC
dwTest = FFEEAAAA
dwTest = BBBBAAAA

What you are trying to do is called type-punning. There are two traditional ways to do it.
A way to do it is via pointers (what you have done). Unfortunately, this conflicts with the optimizer. You see, due to the halting problem, the optimizer cannot know in the general case that two pointers don't alias each other. This means that the compiler has to reload any value that may have been modified via a pointer, resulting in tons of potentially unnecessary reloads.
So, the strict-aliasing rule was introduced. It basically says that two pointers can only alias each other when they are of the same type. As a special rule, a char * can alias any other pointer (but not the other way around). This breaks type-punning via pointers, and lets the compiler generate more efficient code. When gcc detects type-punning and has warnings enabled, it will warn you thus:
warning: dereferencing type-punned pointer will break strict-aliasing rules
Another way to do type-punning is via the union:
union {
int i;
short s[2];
} u;
u.i = 0xDEADBEEF;
u.s[0] = 0xBABE;
....
This opens up a new whole can of worms. In the best case, this is implementation dependant. Now, I don't have access to the C89 standard, but in C99 it originally stated that the value of an union member other than the last one stored into is unspecified. This was changed in a TC to state that the values of bytes that don't correspond to the last stored-into member are unspecified, and stated otherwise that the bytes that do correspond to the last stored-into member are reinterpreted as per the new type (something which is obviously implementation dependant).
For C++, I can't find the language about the union hack in the standard. Anyways, C++ has reinterpret_cast<>, which is what you should use for type-punning in C++ (use the reference variant of reinterpret_cast<>).
Anyways, you probably shouldn't be using type-punning (implementation-dependant), and you should build up your values manually via bit-shifting.

Related

Using mmap with /dev/mem - Is it okay to use reinterpret_cast

I'm using mmap with /dev/mem. I've seen examples in C use the following pattern:
#define OFFSET = ...;
int fd = 0;
void* base;
fd = open("/dev/mem", ...);
base = mmap(..., fd, ...);
// Below is line of interest.
*((uint32_t*)(base + OFFSET)) = 23;
First question - What is happening here?
It looks like we are adding an offset value to a void*, then casting it to uint32_t* and then assigning a number to it. Why can't we just declare base as uint32_t*? Why cast it just before assigning it?
Second question - How would I do this in C++?
The following works from bits and pieces on the net. But it's basically me trying whack-a-mole with reinterpret_cast and static_cast and seeing which one gives me the right result and doesn't throw errors or warnings. Also replaced void* with uint8_t* to prevent compiler warning me of arithmetic on void pointer. I don't know why it works or if it's even the right way to do it. Help me not shoot myself in the foot?
#define OFFSET = ...;
int fd = 0;
uint8_t* base;
fd = open("/dev/mem", ...);
base = reinterpret_cast<uint8_t*>(mmap(..., fd, ...));
*(reinterpret_cast<uint32_t*>(base + OFFSET)) = 23;
base + OFFSET is a gcc extension where pointer arithmetic works on void *. "Why can't we just declare base as uint32_t*?" Because the arithmetic would come out wrong. void * arithmetic is in bytes.
There seems to be some confusion as to the meaning of arithmetic in bytes. OFFSET is the number of bytes (not 32 bit integers) from base where the value 23 needs to go; so casting to uint32_t first would move four times as far. This kind of code is typical of code that writes to heterogeneous data-structures directly rather than using a struct. There's pros and cons of declaring a struct vs. writing the accesses out directly. In the ancient days they usually declared a struct, but the pendulum has swung back because of alignment issues. It's easier to ensure the compiler doesn't hose the struct definition by inserting padding by writing it out longhand.
base = reinterpret_cast<uint8_t*>(mmap(..., fd, ...));
*(reinterpret_cast<uint32_t*>(base + OFFSET)) = 23;
This is indeed what you'd have to do to say it in standard C++; though for code like this people tend to use C style casts in C++ (not going to debate it; just pointing it out).

The best way in C++ to cast different signedness types each other?

There is an uint64_t data field sent by the communication peer, it carries an order ID that I need to store into a Postgresql-11 DB that do NOT support unsigned integer types. Although a real data may exceed 2^63, I think a INT8 filed in Postgresql11 can hold it, if I do some casting carefully.
Let's say there be:
uint64_t order_id = 123; // received
int64_t to_db; // to be writed into db
I plan to use one of the following methods to cast an uint64_t value into an int64_t value:
to_db = order_id; // directly assigning;
to_db = (int64_t)order_id; //c-style casting;
to_db = static_cast<int64_t>(order_id);
to_db = *reinterpret_cast<const int64_t*>( &order_id );
and when I need to load it from the db, I can do a reversed casting.
I know they all work, I'm just interested in which one meet the C++ standard the most perfectly.
In other words, which method will always work in whatever 64bit platform with whatever compiler?
Depends where it would be compiled and run... any of those not fully portable without C++20 support.
The safest way without that would be doing conversion yourself by changing range of values, something like that
int64_t to_db = (order_id > (uint64_t)LLONG_MAX)
? int64_t(order_id - (uint64_t)LLONG_MAX - 1)
: int64_t(order_id ) - LLONG_MIN;
uint64_t from_db = (to_db < 0)
? to_db + LLONG_MIN
: uint64_t(to_db) + (uint64_t)LLONG_MAX + 1;
If order_id is greater than (2^63 -1), then order_id - (uint64_t)LLONG_MAX - 1 yields a non-negative value. If not, then cast to signed is well defined and subtraction ensures values to be shifted into negative range.
During reverse conversion, to_db + LLONG_MIN places value into [0, ULLONG_MAX] range.
and do opposite on reading. Database platform or compiler you use may do something awful with binary representation of unsigned values when converting them to signed, not to mention that different format of signed do exist.
For same reason inter-platform protocols often involve use of string formatting or "least bit's value" for representing floating point values as integers, i.e. as encoded fixed point.
I would go with memcpy. It avoids (? see comments) undefined behavior and typically compilers optimize any byte copying away:
int64_t uint64_t_to_int64_t(uint64_t u)
{
int64_t i;
memcpy(&i, &u, sizeof(int64_t));
return i;
}
order_id = uint64_t_to_int64_t(to_db);
GCC with -O2 generated the optimal assembly for uint64_t_to_int64_t:
mov rax, rdi
ret
Live demo: https://godbolt.org/z/Gbvhzh
All four methods will always work, as long as the value is within range. The first will generate warnings on many compilers, so should probably not be used. The second is more a C idiom than a C++ idiom, but is widely used in C++. The last one is ugly and relies on subtle details from the standard, and should not be used.
This function seems UB-free
int64_t fromUnsignedTwosComplement(uint64_t u)
{
if (u <= std::numeric_limits<int64_t>::max()) return static_cast<int64_t>(u);
else return -static_cast<int64_t>(-u);
}
It reduces to a no-op under optimisations.
Conversion in the other direction is a straight cast to uint64_t. It is always well-defined.

How to cast char array to int at non-aligned position?

Is there a way in C/C++ to cast a char array to an int at any position?
I tried the following, bit it automatically aligns to the nearest 32 bits (on a 32 bit architecture) if I try to use pointer arithmetic with non-const offsets:
unsigned char data[8];
data[0] = 0; data[1] = 1; ... data[7] = 7;
int32_t p = 3;
int32_t d1 = *((int*)(data+3)); // = 0x03040506 CORRECT
int32_t d2 = *((int*)(data+p)); // = 0x00010203 WRONG
Update:
As stated in the comments the input comes in tuples of 3 and I cannot
change that.
I want to convert 3 values to an int for further
processing and this conversion should be as fast as possible.
The
solution does not have to be cross platform. I am working with a very
specific compiler and processor, so it can be assumed that it is a 32
bit architecture with big endian.
The lowest byte of the result does not matter to me (see above).
My main questions at the moment are: Why has d1 the correct value but d2 does not? Is this also true for other compilers? Can this behavior be changed?
No you can't do that in a portable way.
The behaviour encountered when attempting a cast from char* to int* is undefined in both C and C++ (possibly for the very reasons that you've spotted: ints are possibly aligned on 4 byte boundaries and data is, of course, contiguous.)
(The fact that data+3 works but data+p doesn't is possibly due to to compile time vs. runtime evaluation.)
Also note that the signed-ness of char is not specified in either C or C++ so you should use signed char or unsigned char if you're writing code like this.
Your best bet is to use bitwise shift operators (>> and <<) and logical | and & to absorb char values into an int. Also consider using int32_tin case you build to targets with 16 or 64 bit ints.
There is no way, converting a pointer to a wrongly aligned one is undefined.
You can use memcpy to copy the char array into an int32_t.
int32_t d = 0;
memcpy(&d, data+3, 4); // assuming sizeof(int) is 4
Most compilers have built-in functions for memcpy with a constant size argument, so it's likely that this won't produce any runtime overhead.
Even though a cast like you've shown is allowed for correctly aligned pointers, dereferencing such a pointer is a violation of strict aliasing. An object with an effective type of char[] must not be accessed through an lvalue of type int.
In general, type-punning is endianness-dependent, and converting a char array representing RGB colours is probably easier to do in an endianness-agnostic way, something like
int32_t d = (int32_t)data[2] << 16 | (int32_t)data[1] << 8 | data[0];

Are casts as safe as unions?

I want to split large variables like floats into byte segments and send these serially byte by byte via UART. I'm using C/C++.
One method could be to deepcopy the value I want to send to a union and then send it. I think that would be 100% safe but slow. The union would look like this:
union mySendUnion
{
mySendType sendVal;
char[sizeof(mySendType)] sendArray;
}
Another option could be to cast the pointer to the value I want to send, into a pointer to a particular union. Is this still safe?
The third option could be to cast the pointer to the value I want to send to a char, and then increment a pointer like this:
sendType myValue = 443.2;
char* sendChar = (char*)myValue;
for(char i=0; i< sizeof(sendType) ; i++)
{
Serial.write(*(sendChar+j), 1);
}
I've had succes with the above pointer arithmetics, but I'm not sure if it's safe under all circumstances. My concern is, what if we for instance is using a 32 bit processor and want to send a float. The compiler choose to store this 32 bit float into one memory cell, but does only store one single char into each 32 bit cell.
Each counter increment would then make the program pointer increment one whole memory cell, and we would miss the float.
Is there something in the C standard that prevents this, or could this be an issue with a certain compiler?
First off, you can't write your code in "C/C++". There's no such language as "C/C++", as they are fundamentally different languages. As such, the answer regarding unions differs radically.
As to the title:
Are casts as safe as unions?
No, generally they aren't, because of the strict aliasing rule. That is, if you type-pun a pointer of one certain type with a pointer to an incompatible type, it will result in undefined behavior. The only exception to this rule is when you read or manipulate the byte-wise representation of an object by aliasing it through a pointer to (signed or unsigned) char. As in your case.
Unions, however, are quite different bastards. Type punning via copying to and reading from unions is permitted in C99 and later, but results in undefined behavior in C89 and all versions of C++.
In one direction, you can also safely type pun (in C99 and later) using a pointer to union, if you have the original union as an actual object. Like this:
union p {
char c[sizeof(float)];
float f;
} pun;
union p *punPtr = &pun;
punPtr->f = 3.14;
send_bytes(punPtr->c, sizeof(float));
Because "a pointer to a union points to all of its members and vice versa" (C99, I don't remember the exact pargraph, it's around 6.2.5, IIRC). This isn't true in the other direction, though:
float f = 3.14;
union p *punPtr = &f;
send_bytes(punPtr->c, sizeof(float)); // triggers UB!
To sum up: the following code snippet is valid in both C89, C99, C11 and C++:
float f = 3.14;
char *p = (char *)&f;
size_t i;
for (i = 0; i < sizeof f; i++) {
send_byte(p[i]); // hypotetical function
}
The following is only valid in C99 and later:
union {
char c[sizeof(float)];
float f;
} pun;
pun.f = 3.14;
send_bytes(pun.c, sizeof float); // another hypotetical function
The following, however, would not be valid:
float f = 3.14;
unsigned *u = (unsigned *)&f;
printf("%u\n", *u); // undefined behavior triggered!
Another solution that is always guaranteed to work is memcpy(). The memcpy() function does a bytewise copying between two objects. (Don't get me started on it being "slow" -- in most modern compilers and stdlib implementations, it's an intrinsic function).
A general advice when sending floating point data on a byte stream would be to use some serialization technology, to ensure that the data format is well defined (and preferably architecture neutral, beware of endianness issues!).
You could use XDR -or perhaps ASN1- which is a binary format (see xdr(3) for more). For C++, see also libs11n
Unless speed or data size is very critical, I would suggest instead a textual format like JSON or perhaps YAML (textual formats are more verbose, but easier to debug and to document). There are several good libraries supporting it (e.g. jsoncpp for C++ or jansson for C).
Notice that serial ports are quite slow (w.r.t. CPU). So the serialization processing time is negligible.
Whatever you do, please document the serialization format (even for an internal project).
The cast to [[un]signed] char [const] * is legal and it won't cause issues when reading, so that is a fine option (that is, after fixing char *sendChar = reinterpret_cast<char*>(&myValue);, and since you are at it, make it const)
Now the next problem comes on the other side, when reading, as you cannot safely use the same approach for reading. In general, the cost of copying the variables is much less than the cost of sending over the UART, so I would just use the union when reading out of the serial.

C++: how to cast 2 bytes in an array to an unsigned short

I have been working on a legacy C++ application and am definitely outside of my comfort-zone (a good thing). I was wondering if anyone out there would be so kind as to give me a few pointers (pun intended).
I need to cast 2 bytes in an unsigned char array to an unsigned short. The bytes are consecutive.
For an example of what I am trying to do:
I receive a string from a socket and place it in an unsigned char array. I can ignore the first byte and then the next 2 bytes should be converted to an unsigned char. This will be on windows only so there are no Big/Little Endian issues (that I am aware of).
Here is what I have now (not working obviously):
//packetBuffer is an unsigned char array containing the string "123456789" for testing
//I need to convert bytes 2 and 3 into the short, 2 being the most significant byte
//so I would expect to get 515 (2*256 + 3) instead all the code I have tried gives me
//either errors or 2 (only converting one byte
unsigned short myShort;
myShort = static_cast<unsigned_short>(packetBuffer[1])
Well, you are widening the char into a short value. What you want is to interpret two bytes as an short. static_cast cannot cast from unsigned char* to unsigned short*. You have to cast to void*, then to unsigned short*:
unsigned short *p = static_cast<unsigned short*>(static_cast<void*>(&packetBuffer[1]));
Now, you can dereference p and get the short value. But the problem with this approach is that you cast from unsigned char*, to void* and then to some different type. The Standard doesn't guarantee the address remains the same (and in addition, dereferencing that pointer would be undefined behavior). A better approach is to use bit-shifting, which will always work:
unsigned short p = (packetBuffer[1] << 8) | packetBuffer[2];
This is probably well below what you care about, but keep in mind that you could easily get an unaligned access doing this. x86 is forgiving and the abort that the unaligned access causes will be caught internally and will end up with a copy and return of the value so your app won't know any different (though it's significantly slower than an aligned access). If, however, this code will run on a non-x86 (you don't mention the target platform, so I'm assuming x86 desktop Windows), then doing this will cause a processor data abort and you'll have to manually copy the data to an aligned address before trying to cast it.
In short, if you're going to be doing this access a lot, you might look at making adjustments to the code so as not to have unaligned reads and you'll see a perfromance benefit.
unsigned short myShort = *(unsigned short *)&packetBuffer[1];
The bit shift above has a bug:
unsigned short p = (packetBuffer[1] << 8) | packetBuffer[2];
if packetBuffer is in bytes (8 bits wide) then the above shift can and will turn packetBuffer into a zero, leaving you with only packetBuffer[2];
Despite that this is still preferred to pointers. To avoid the above problem, I waste a few lines of code (other than quite-literal-zero-optimization) it results in the same machine code:
unsigned short p;
p = packetBuffer[1]; p <<= 8; p |= packetBuffer[2];
Or to save some clock cycles and not shift the bits off the end:
unsigned short p;
p = (((unsigned short)packetBuffer[1])<<8) | packetBuffer[2];
You have to be careful with pointers, the optimizer will bite you, as well as memory alignments and a long list of other problems. Yes, done right it is faster, done wrong the bug can linger for a long time and strike when least desired.
Say you were lazy and wanted to do some 16 bit math on an 8 bit array. (little endian)
unsigned short *s;
unsigned char b[10];
s=(unsigned short *)&b[0];
if(b[0]&7)
{
*s = *s+8;
*s &= ~7;
}
do_something_With(b);
*s=*s+8;
do_something_With(b);
*s=*s+8;
do_something_With(b);
There is no guarantee that a perfectly bug free compiler will create the code you expect. The byte array b sent to the do_something_with() function may never get modified by the *s operations. Nothing in the code above says that it should. If you don't optimize your code then you may never see this problem (until someone does optimize or changes compilers or compiler versions). If you use a debugger you may never see this problem (until it is too late).
The compiler doesn't see the connection between s and b, they are two completely separate items. The optimizer may choose not to write *s back to memory because it sees that *s has a number of operations so it can keep that value in a register and only save it to memory at the end (if ever).
There are three basic ways to fix the pointer problem above:
Declare s as volatile.
Use a union.
Use a function or functions whenever changing types.
You should not cast a unsigned char pointer into an unsigned short pointer (for that matter cast from a pointer of smaller data type to a larger data type). This is because it is assumed that the address will be aligned correctly. A better approach is to shift the bytes into a real unsigned short object, or memcpy to a unsigned short array.
No doubt, you can adjust the compiler settings to get around this limitation, but this is a very subtle thing that will break in the future if the code gets passed around and reused.
Maybe this is a very late solution but i just want to share with you. When you want to convert primitives or other types you can use union. See below:
union CharToStruct {
char charArray[2];
unsigned short value;
};
short toShort(char* value){
CharToStruct cs;
cs.charArray[0] = value[1]; // most significant bit of short is not first bit of char array
cs.charArray[1] = value[0];
return cs.value;
}
When you create an array with below hex values and call toShort function, you will get a short value with 3.
char array[2];
array[0] = 0x00;
array[1] = 0x03;
short i = toShort(array);
cout << i << endl; // or printf("%h", i);
static cast has a different syntax, plus you need to work with pointers, what you want to do is:
unsigned short *myShort = static_cast<unsigned short*>(&packetBuffer[1]);
Did nobody see the input was a string!
/* If it is a string as explicitly stated in the question.
*/
int byte1 = packetBuffer[1] - '0'; // convert 1st byte from char to number.
int byte2 = packetBuffer[2] - '0';
unsigned short result = (byte1 * 256) + byte2;
/* Alternatively if is an array of bytes.
*/
int byte1 = packetBuffer[1];
int byte2 = packetBuffer[2];
unsigned short result = (byte1 * 256) + byte2;
This also avoids the problems with alignment that most of the other solutions may have on certain platforms. Note A short is at least two bytes. Most systems will give you a memory error if you try and de-reference a short pointer that is not 2 byte aligned (or whatever the sizeof(short) on your system is)!
char packetBuffer[] = {1, 2, 3};
unsigned short myShort = * reinterpret_cast<unsigned short*>(&packetBuffer[1]);
I (had to) do this all the time. big endian is an obvious problem. What really will get you is incorrect data when the machine dislike misaligned reads! (and write).
you may want to write a test cast and an assert to see if it reads properly. So when ran on a big endian machine or more importantly a machine that dislikes misaligned reads an assert error will occur instead of a weird hard to trace 'bug' ;)
On windows you can use:
unsigned short i = MAKEWORD(lowbyte,hibyte);
I realize this is an old thread, and I can't say that I tried every suggestion made here. I'm just making my self comfortable with mfc, and I was looking for a way to convert a uint to two bytes, and back again at the other end of a socket.
There are alot of bit shifting examples you can find on the net, but none of them seemed to actually work. Alot of the examples seem overly complicated; I mean we're just talking about grabbing 2 bytes out of a uint, sending them over the wire, and plugging them back into a uint at the other end, right?
This is the solution I finally came up with:
class ByteConverter
{
public:
static void uIntToBytes(unsigned int theUint, char* bytes)
{
unsigned int tInt = theUint;
void *uintConverter = &tInt;
char *theBytes = (char*)uintConverter;
bytes[0] = theBytes[0];
bytes[1] = theBytes[1];
}
static unsigned int bytesToUint(char *bytes)
{
unsigned theUint = 0;
void *uintConverter = &theUint;
char *thebytes = (char*)uintConverter;
thebytes[0] = bytes[0];
thebytes[1] = bytes[1];
return theUint;
}
};
Used like this:
unsigned int theUint;
char bytes[2];
CString msg;
ByteConverter::uIntToBytes(65000,bytes);
theUint = ByteConverter::bytesToUint(bytes);
msg.Format(_T("theUint = %d"), theUint);
AfxMessageBox(msg, MB_ICONINFORMATION | MB_OK);
Hope this helps someone out.