When does casting change a value's bits in C++? - c++

I have a C++ unsigned int which is actually storing a signed value. I want to cast this variable to a signed int, so that the unsigned and signed values have the same binary value.
unsigned int lUnsigned = 0x80000001;
int lSigned1 = (int)lUnsigned; // Does lSigned == 0x80000001?
int lSigned2 = static_cast<int>(lUnsigned); // Does lSigned == 0x80000001?
int lSigned3 = reinterpret_cast<int>(lUnsigned); // Compiler didn't like this
When do casts change the bits of a variable in C++? For example, I know that casting from an int to a float will change the bits because int is twos-complement and float is floating-point. But what about other scenarios? I am not clear on the rules for this in C++.
In section 6.3.1.3 of the C99 spec it says that casting from an unsigned to a signed integer is compiler-defined!

A type conversion can
keep the conceptual value (the bitpattern may have to be changed), or
keep the bitpattern (the conceptual value may have to be changed).
The only C++ cast that guaranteed always keeps the bitpattern is const_cast.
A reinterpret_cast is, as its name suggests, intended to keep the bitpattern and simply reinterpret it. But the standard allows an implementation very much leeway in how to implement reinterpret_cast. In some case a reinterpret_cast may change the bitpattern.
A dynamic_cast generally changes both bitpattern and value, since it generally delves into an object and returns a pointer/reference to a sub-object of requested type.
A static_cast may change the bitpattern both for integers and pointers, but, nearly all extant computers use a representation of signed integers (called two's complement) where static_cast will not change the bitpattern. Regarding pointers, suffice it to say that, for example, when a base class is non-polymorphic and a derived class is polymorphic, using static_cast to go from pointer to derived to pointer to base, or vice versa, may change the bitpattern (as you can see when comparing the void* pointers). Now, integers...
With n value bits, an unsigned integer type has 2^n values, in the range 0 through 2^n-1 (inclusive).
The C++ standard guarantees that any result of the type is wrapped into that range by adding or subtracting a suitable multiple of 2^n.
Actually that's how the C standard describes it; the C++ standard just says that operations are modulo 2^n, which means the same.
With two's complement form a signed value -x has the same bitpattern as the unsigned value -x+2^n. That is, the same bitpattern as the C++ standard guarantees that you get by converting -x to unsigned type of the same size. That's the simple basics of two's complement form, that it is precisely the guarantee that you're seeking. :-)
And nearly all extant computers use two's complement form.
Hence, in practice you're guaranteed an unchanged bitpattern for your examples.

If you cast from a smaller signed integral type to a larger signed integral type, copies of the original most significant bit (1 in the case of a negative number) will be prepended as necessary to preserve the integer's value.
If you cast an object pointer to a pointer of one of its superclasses, the bits can change, especially if there is multiple inheritance or virtual superclasses.
You're kind of asking for the difference between static_cast and reinterpret_cast.

If your implementation uses 2's complement for signed integer types, then casting from signed to unsigned integer types of the same width doesn't change the bit pattern.
Casting from unsigned to signed could in theory do all sorts of things when the value is out of range of the signed type, because it's implementation-defined. But the obvious thing for a 2's complement implementation to do is to use the same bit pattern.
If your implementation doesn't use 2's complement, then casting between signed and unsigned values will change the bit pattern, when the signed value involved is negative. Such implementations are rare, though (I don't specifically know of any use of non-2's complement in C++ compilers).

Using a C-style cast, or a static_cast, to cast an unsigned int to a signed int may still allow the compiler to assign the former to the latter directly as if a cast were not performed, and thus may change the bits if the unsigned int value is larger than what the signed int can hold. A reinterpret_cast should work though, or you can type-cast using a pointer instead:
unsigned int lUnsigned = 0x80000001;
int lSigned1 = *((int*)&lUnsigned);
int lSigned2 = *(reinterpret_cast<int*>(&lUnsigned));

unsigned int is always the same size as int. And every computer on the planet uses 2's complement these days. So none of your casts will change the bit representation.

You're looking for int lSigned = reinterpret_cast<int&>(lUnsigned);
You don't want to reinterpret the value of lUnsigned, you want to reinterpret the object lUnsigned. Hence, the cast to a reference type.

Casting is just a way to override the type-checker, it shouldn't actually modify the bits themselves.

Related

Can I reinterpret cast in GLSL?

In C++ you can take a pointer of an unsigned int, and cast it to a pointer to a signed int (reinterpret_cast).
unsigned int a = 200;
int b = *(reinterpret_cast<int *>(&a));
I need to store an int generated in a shader as an unsigned int, to be written to a texture with an unsigned integer internal format. Is there any similar alternative to C++'s reinterpret_cast in GLSL?
In C++ (pre-20), signed and unsigned integers are permitted to be represented in very different ways. C++ does not require signed integers to be two's complement; implementations are allowed to use ones complement, or other representations. The only requirement C++ has on signed vs. unsigned is that conversion of all non-negative (or trap) signed values to unsigned values is possible.
And FYI: your code yields UB for violating the strict aliasing rule (accessing an object of type X through a pointer to an unrelated object of type Y). Though this is somewhat common in low-level code, the C++ object model does not really allow it. But I digress.
I brought up all the signed-vs-unsigned stuff because GLSL actually defines the representation of signed integers. In GLSL, a signed integer is two's complement. Because of that, GLSL can define how conversion from the entire range of unsigned value goes to signed values and vice-versa, simply by preserving the bitpattern of the value.
And that's exactly what it does. So instead of having to use casting gymnastics, you simply do an unsigned-to-signed conversion, just as you would have for float-to-signed or whatever:
int i = ...
uint j = uint(i);
This conversion preserves the bit-pattern.
Oh, and C++20 seems to be getting on-board with this too.
GLSL does not support this kind of casting (nor does it support pointers at all). Instead, in GLSL you construct values of a different type with constructor-style syntax:
int a = 5; // set an int to a constant
uint b = uint(a); // "cast" that int to a uint by constructing a uint from it.

Why cast a pointer to a float into a pointer to a long, then dereference?

I was going through this example which has a function outputting a hex bit pattern to represent an arbitrary float.
void ExamineFloat(float fValue)
{
printf("%08lx\n", *(unsigned long *)&fValue);
}
Why take the address of fValue, cast to unsigned long pointer, then dereference? Isn't all that work just equivalent to a direct cast to unsigned long?
printf("%08lx\n", (unsigned long)fValue);
I tried it and the answer isn't the same, so confused.
(unsigned long)fValue
This converts the float value to an unsigned long value, according to the "usual arithmetic conversions".
*(unsigned long *)&fValue
The intention here is to take the address at which fValue is stored, pretend that there is not a float but an unsigned long at this address, and to then read that unsigned long. The purpose is to examine the bit pattern which is used to store the float in memory.
As shown, this causes undefined behavior though.
Reason: You may not access an object through a pointer to a type that is not "compatible" to the object's type. "Compatible" types are for example (unsigned) char and every other type, or structures that share the same initial members (speaking of C here). See §6.5/7 N1570 for the detailed (C11) list (Note that my use of "compatible" is different - more broad - than in the referenced text.)
Solution: Cast to unsigned char *, access the individual bytes of the object and assemble an unsigned long out of them:
unsigned long pattern = 0;
unsigned char * access = (unsigned char *)&fValue;
for (size_t i = 0; i < sizeof(float); ++i) {
pattern |= *access;
pattern <<= CHAR_BIT;
++access;
}
Note that (as #CodesInChaos pointed out) the above treats the floating point value as being stored with its most significant byte first ("big endian"). If your system uses a different byte order for floating point values you'd need to adjust to that (or rearrange the bytes of above unsigned long, whatever's more practical to you).
Floating-point values have memory representations: for example the bytes can represent a floating-point value using IEEE 754.
The first expression *(unsigned long *)&fValue will interpret these bytes as if it was the representation of an unsigned long value. In fact in C standard it results in an undefined behavior (according to the so-called "strict aliasing rule"). In practice, there are issues such as endianness that have to be taken into account.
The second expression (unsigned long)fValue is C standard compliant. It has a precise meaning:
C11 (n1570), § 6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
*(unsigned long *)&fValue is not equivalent to a direct cast to an unsigned long.
The conversion to (unsigned long)fValue converts the value of fValue into an unsigned long, using the normal rules for conversion of a float value to an unsigned long value. The representation of that value in an unsigned long (for example, in terms of the bits) can be quite different from how that same value is represented in a float.
The conversion *(unsigned long *)&fValue formally has undefined behaviour. It interprets the memory occupied by fValue as if it is an unsigned long. Practically (i.e. this is what often happens, even though the behaviour is undefined) this will often yield a value quite different from fValue.
Typecasting in C does both a type conversion and a value conversion. The floating point → unsigned long conversion truncates the fractional portion of the floating point number and restricts the value to the possible range of an unsigned long. Converting from one type of pointer to another has no required change in value, so using the pointer typecast is a way to keep the same in-memory representation while changing the type associated with that representation.
In this case, it's a way to be able to output the binary representation of the floating point value.
As others have already noted, casting a pointer to a non-char type to a pointer to a different non-char type and then dereferencing is undefined behavior.
That printf("%08lx\n", *(unsigned long *)&fValue) invokes undefined behavior does not necessarily mean that running a program that attempts to perform such a travesty will result in hard drive erasure or make nasal demons erupt from ones nose (the two hallmarks of undefined behavior). On a computer in which sizeof(unsigned long)==sizeof(float) and on which both types have the same alignment requirements, that printf will almost certainly do what one expects it to do, which is to print the hex representation of the floating point value in question.
This shouldn't be surprising. The C standard openly invites implementations to extend the language. Many of these extensions are in areas that are, strictly speaking, undefined behavior. For example, the POSIX function dlsym returns a void*, but this function is typically used to find the address of a function rather than a global variable. This means the void pointer returned by dlsym needs to be cast to a function pointer and then dereferenced to call the function. This is obviously undefined behavior, but it nonetheless works on any POSIX compliant platform. This will not work on a Harvard architecture machine on which pointers to functions have different sizes than do pointers to data.
Similarly, casting a pointer to a float to a pointer to an unsigned integer and then dereferencing happens to work on almost any computer with almost any compiler in which the size and alignment requirements of that unsigned integer are the same as that of a float.
That said, using unsigned long might well get you into trouble. On my computer, an unsigned long is 64 bits long and has 64 bit alignment requirements. This is not compatible with a float. It would be better to use uint32_t -- on my computer, that is.
The union hack is one way around this mess:
typedef struct {
float fval;
uint32_t ival;
} float_uint32_t;
Assigning to a float_uint32_t.fval and accessing from a ``float_uint32_t.ival` used to be undefined behavior. That is no longer the case in C. No compiler that I know of blows nasal demons for the union hack. This was not UB in C++. It was illegal. Until C++11, a compliant C++ compiler had to complain to be compliant.
Any even better way around this mess is to use the %a format, which has been part of the C standard since 1999:
printf ("%a\n", fValue);
This is simple, easy, portable, and there is no chance of undefined behavior. This prints the hexadecimal/binary representation of the double precision floating point value in question. Since printf is an archaic function, all float arguments are converted to double prior to the call to printf. This conversion must be exact per the 1999 version of the C standard. One can pick up that exact value via a call to scanf or its sisters.

Casting positive 'int' to 'size_t'

The difference between size_t and int is well-documented, which I recapitulate: the former is an alias to some unsigned integer type that's implementation-dependent, whereas the latter is signed; the former is preferable for memory declarations, whereas the latter is better for arithmetic operations.
My question is, if I do some arithmetic computations to give an int (which is never too large and is always positive) and assign it to a size_t variable (that's used for accessing array locations), is there any situation in which a problem may arise?
Assigning a signed type to an unsigned type is always well-defined (even for negative values). If the signed variable is no larger than the unsigned type, and has only non-negative numbers the value will not change in such a conversion.

Aliasing of otherwise equivalent signed and unsigned types

The C and C++ standards both allow signed and unsigned variants of the same integer type to alias each other. For example, unsigned int* and int* may alias. But that's not the whole story because they clearly have a different range of representable values. I have the following assumptions:
If an unsigned int is read through an int*, the value must be within the range of int or an integer overflow occurs and the behaviour is undefined. Is this correct?
If an int is read through an unsigned int*, negative values wrap around as if they were casted to unsigned int. Is this correct?
If the value is within the range of both int and unsigned int, accessing it through a pointer of either type is fully defined and gives the same value. Is this correct?
Additionally, what about compatible but not equivalent integer types?
On systems where int and long have the same range, alignment, etc., can int* and long* alias? (I assume not.)
Can char16_t* and uint_least16_t* alias? I suspect this differs between C and C++. In C, char16_t is a typedef for uint_least16_t (correct?). In C++, char16_t is its own primitive type, which compatible with uint_least16_t. Unlike C, C++ seems to have no exception allowing compatible but distinct types to alias.
If an unsigned int is read through an int*, the value must be
within the range of int or an integer overflow occurs and the
behaviour is undefined. Is this correct?
Why would it be undefined? there is no integer overflow since no conversion or computation is done. We take an object representation of an unsigned int object and see it through an int. In what way the value of the unsigned int object transposes to the value of an int is completely implementation defined.
If an int is read through an unsigned int*, negative values wrap
around as if they were casted to unsigned int. Is this correct?
Depends on the representation. With two's complement and equivalent padding, yes. Not with signed magnitude though - a cast from int to unsigned is always defined through a congruence:
If the destination type is unsigned, the resulting value is the
least unsigned integer congruent to the source integer (modulo
2n where n is the number of bits used to represent the unsigned type). [ Note: In a two’s complement representation, this
conversion is conceptual and there is no change in the bit pattern (if
there is no truncation). — end note ]
And now consider
10000000 00000001 // -1 in signed magnitude for 16-bit int
This would certainly be 215+1 if interpreted as an unsigned. A cast would yield 216-1 though.
If the value is within the range of both int and unsigned int,
accessing it through a pointer of either type is fully defined and
gives the same value. Is this correct?
Again, with two's complement and equivalent padding, yes. With signed magnitude we might have -0.
On systems where int and long have the same range, alignment,
etc., can int* and long* alias? (I assume not.)
No. They are independent types.
Can char16_t* and uint_least16_t* alias?
Technically not, but that seems to be an unneccessary restriction of the standard.
Types char16_t and char32_t denote distinct types with the same
size, signedness, and alignment as uint_least16_t and
uint_least32_t, respectively, in <cstdint>, called the underlying
types.
So it should be practically possible without any risks (since there shouldn't be any padding).
If an int is read through an unsigned int*, negative values wrap around as if they were casted to unsigned int. Is this correct?
For a system using two's complement, type-punning and signed-to-unsigned conversion are equivalent, for example:
int n = ...;
unsigned u1 = (unsigned)n;
unsigned u2 = *(unsigned *)&n;
Here, both u1 and u2 have the same value. This is by far the most common setup (e.g. Gcc documents this behaviour for all its targets). However, the C standard also addresses machines using ones' complement or sign-magnitude to represent signed integers. In such an implementation (assuming no padding bits and no trap representations), the result of a conversion of an integer value and type-punning can yield different results. As an example, assume sign-magnitude and n being initialized to -1:
int n = -1; /* 10000000 00000001 assuming 16-bit integers*/
unsigned u1 = (unsigned)n; /* 11111111 11111111
effectively 2's complement, UINT_MAX */
unsigned u2 = *(unsigned *)&n; /* 10000000 00000001
only reinterpreted, the value is now INT_MAX + 2u */
Conversion to an unsigned type means adding/subtracting one more than the maximum value of that type until the value is in range. Dereferencing a converted pointer simply reinterprets the bit pattern. In other words, the conversion in the initialization of u1 is a no-op on 2's complement machines, but requires some calculations on other machines.
If an unsigned int is read through an int*, the value must be within the range of int or an integer overflow occurs and the behaviour is undefined. Is this correct?
Not exactly. The bit pattern must represent a valid value in the new type, it doesn't matter if the old value is representable. From C11 (n1570) [omitted footnotes]:
6.2.6.2 Integer types
For unsigned integer types other than unsigned char, the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N-1, so that objects of that type shall be capable of representing values from 0 to 2N-1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified.
For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits; signed char shall not have any padding bits. There shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and N in the unsigned type, then M≤N). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:
the corresponding value with sign bit 0 is negated (sign and magnitude);
the sign bit has the value -2M (two's complement);
the sign bit has the value -2M-1 (ones' complement).
Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for ones' complement), is a trap representation or a normal value. In the case of sign and magnitude and ones' complement, if this representation is a normal value it is called a negative zero.
E.g., an unsigned int could have value bits, where the corresponding signed type (int) has a padding bit, something like unsigned u = ...; int n = *(int *)&u; may result in a trap representation on such a system (reading of which is undefined behaviour), but not the other way round.
If the value is within the range of both int and unsigned int, accessing it through a pointer of either type is fully defined and gives the same value. Is this correct?
I think, the standard would allow for one of the types to have a padding bit, which is always ignored (thus, two different bit patterns can represent the same value and that bit may be set on initialization) but be an always-trap-if-set bit for the other type. This leeway, however, is limited at least by ibid. p5:
The values of any padding bits are unspecified. A valid (non-trap) object representation of a signed integer type where the sign bit is zero is a valid object representation of the corresponding unsigned type, and shall represent the same value. For any integer type, the object representation where all the bits are zero shall be a representation of the value zero in that type.
On systems where int and long have the same range, alignment, etc., can int* and long* alias? (I assume not.)
Sure they can, if you don't use them ;) But no, the following is invalid on such platforms:
int n = 42;
long l = *(long *)&n; // UB
Can char16_t* and uint_least16_t* alias? I suspect this differs between C and C++. In C, char16_t is a typedef for uint_least16_t (correct?). In C++, char16_t is its own primitive type, which compatible with uint_least16_t. Unlike C, C++ seems to have no exception allowing compatible but distinct types to alias.
I'm not sure about C++, but at least for C, char16_t is a typedef, but not necessarily for uint_least16_t, it could very well be a typedef of some implementation-specific __char16_t, some type incompatible with uint_least16_t (or any other type).
It is not defined that happens since the c standard does not exactly define how singed integers should be stored. so you can not rely on the internal representation. Also there does no overflow occur. if you just typecast a pointer nothing other happens then another interpretation of the binary data in the following calculations.
Edit
Oh, i misread the phrase "but not equivalent integer types", but i keep the paragraph for your interest:
Your second question has much more trouble in it. Many machines can only read from correctly aligned addresses there the data has to lie on multiples of the types width. If you read a int32 from a non-by-4-divisable address (because you casted a 2-byte int pointer) your CPU may crash.
You should not rely on the sizes of types. If you chose another compiler or platform your long and int may not match anymore.
Conclusion:
Do not do this. You wrote highly platform dependent (compiler, target machine, architecture) code that hides its errors behind casts that suppress any warnings.
Concerning your questions regarding unsigned int* and int*: if the
value in the actual type doesn't fit in the type you're reading, the
behavior is undefined, simply because the standard neglects to define
any behavior in this case, and any time the standard fails to define
behavior, the behavior is undefined. In practice, you'll almost always
obtain a value (no signals or anything), but the value will vary
depending on the machine: a machine with signed magnitude or 1's
complement, for example, will result in different values (both ways)
from the usual 2's complement.
For the rest, int and long are different types, regardless of their
representations, and int* and long* cannot alias. Similarly, as you
say, in C++, char16_t is a distinct type in C++, but a typedef in
C (so the rules concerning aliasing are different).

How does casting to "signed int" and back to "signed short" work for values larger than 32,767?

Code:
typedef signed short SIGNED_SHORT; //16 bit
typedef signed int SIGNED_INT; //32 bit
SIGNED_SHORT x;
x = (SIGNED_SHORT)(SIGNED_INT) 45512; //or any value over 32,767
Here is what I know:
Signed 16 bits:
Signed: From −32,768 to 32,767
Unsigned: From 0 to 65,535
Don't expect 45512 to fit into x as x is declared a 16 bit signed integer.
How and what does the double casting above do?
Thank You!
typedef signed short SIGNED_SHORT; //16 bit
typedef signed int SIGNED_INT; //32 bit
These typedefs are not particularly useful. A typedef does nothing more than provide a new name for an existing type. Type signed short already has a perfectly good name: "signed short"; calling it SIGNED_SHORT as well doesn't buy you anything. (It would make sense if it abstracted away some information about the type, or if the type were likely to change -- but using the name SIGNED_SHORT for a type other than signed short would be extremely confusing.)
Note also that short and int are both guaranteed to be at least 16 bits wide, and int is at least as wide as short, but different sizes are possible. For example, a compiler could make both short and int 16 bits -- or 64 bits for that matter. But I'll assume the sizes for your compiler are as you state.
In addition, signed short and short are names for the same type, as are signed int and int.
SIGNED_SHORT x;
x = (SIGNED_SHORT)(SIGNED_INT) 45512; //or any value over 32,767
A cast specifies a conversion to a specified type. Two casts specify two such conversions. The value 45512 is converted to signed int, and then to signed short.
The constant 45512 is already of type int (another name for signed int), so the innermost cast is fairly pointless. (Note that if int is only 16 bits, then 45512 will be of type long.)
When you assign a value of one numeric type to an object of another numeric type, the value is implicitly converted to the object's type, so the outermost cast is also redundant.
So the above code snippet is exactly equivalent to:
short x = 45512;
Given the ranges of int and short on your system, the mathematical value 45512 cannot be represented in type short. The language rules state that the result of such a conversion is implementation-defined, which means that it's up to each implementation to determine what the result is, and it must document that choice, but different implementations can do it differently. (Actually that's not quite the whole story; the 1999 ISO C standard added permission for such a conversion to raise an implementation-defined signal. I don't know of any compiler that does this.)
The most common semantics for this kind of conversion is that the result gets the low-order bits of the source value. This will probably result in the value -20024 being assigned to x. But you shouldn't depend on that if you want your program to be maximally portable.
When you cast twice, the casts are applied in sequence.
int a = 45512;
int b = (int) a;
short x = (short) b;
Since 45512 does not fit in a short on most (but not all!) platforms, the cast overflows on those platforms. This will either raise an implementation-defined signal or result in an implementation-defined value.
In practice, many platforms define the result as the truncated value, which is -20024 in this case. However, there are platforms which raise a signal, which will probably terminate your program if uncaught.
Citation: n1525 §6.3.1.3
Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.
The double casting is equivalent to:
short x = static_cast<short>(static_cast<int>(45512));
which is equivalent to:
short x = 45512;
which will likely wrap around so x equals -20024, but technically it's implementation defined behavior if a short has a maximum value less than 45512 on your platform. The literal 45512 is of type int.
You can assume it does two type conversions (although signed int and int are only separated once in the C standard, IIRC).
If SIGNED_SHORT is too small to handle 45512, the result is either implementation-defined or an implementation-defined signal is raised. (In C++ only the former applies.)