Conversion of float to integer in ARM based system - c++

I have the following piece of code called main.cpp that converts an IEE 754 32-bit hex value to float and then converts it into unsigned short.
#include <iostream>
using namespace std;
int main() {
unsigned int input_val = 0xc5dac022;
float f;
*((int*) &f) = input_val;
unsigned short val = (unsigned short) f;
cout <<"Val = 0x" << std::hex << val << endl;
}
I build and run the code using the following command:
g++ main.cpp -o main
./main
When I following code in my normal PC, I get the correct answer which is 0xe4a8. But when I run the same code on an ARM processor, it gives an output of 0x0.
Is this happening because I am building the code with normal gcc instead of aarch64? The code gives correct output for some other test cases on the ARM processor but gives an incorrect output for the given test value. How can I solve this issue?

First, your "type pun" via pointers violates the strict aliasing rule, as mentioned in comments. You can fix that by switching to memcpy.
Next, the bit pattern 0xc5dac022 as an IEEE-754 single precision float corresponds to a value of about -7000, if my test is right. This is truncated to -7000, which, being negative, cannot be represented in an unsigned short. As such, attempting to convert it to unsigned short has undefined behavior, per [7.3.10 p1] in the C++ standard (C++20 N4860). Note this is different than the situation for trying to convert a signed or unsigned integer to unsigned short, which would have well-defined "wrapping" behavior.
So there is no "correct answer" here. Printing 0 is a perfectly legal result, and is also logical in some sense, as 0 is the closest unsigned short value to -7000. But it's also not surprising that the result would vary between platforms / compilers / optimization options, as this is common for UB.
There is actually a difference between ARM64 and x86-64 that explains why this is the particular behavior you see.
When compiling without optimization, in both cases, gcc emits instructions to actually convert the float value to unsigned short at runtime.
ARM64 has a dedicated instruction fcvtzu that converts a float to a 32-bit unsigned int, so gcc emits that instruction, and then extracts the low 16 bits of the integer result. The behavior of fcvtzu with a negative input is to output 0, and so that's the value that you get.
x86-64 doesn't have such an instruction. The nearest thing is cvttss2si which converts a single-precision float to a signed 32-bit integer. So gcc emits that instruction, then uses the low 16 bits of it as the unsigned short value. This gives the right answer whenever the input float is in the range [0, 65536), because all these values fit in the range of a 32-bit signed integer. GCC doesn't care what it does in all other cases, because they are UB according to the C++ standard. But it so happens that, since your value -7000 does fit in signed int, then cvstss2si returns the signed integer -7000, which is 0xffffe4a8. Extracting the low 16 bits gives you the 0xe4a8 that you observed.
When optimizing, gcc on both platforms optimizes the value into a constant 0. Which is also perfectly legal.

Related

Cast from unsigned long long to double and vice versa changes the value

When writing a C++ code I suddenly realised that my numbers are incorrectly casted from double to unsigned long long.
To be specific, I use the following code:
#define _CRT_SECURE_NO_WARNINGS
#include <iostream>
#include <limits>
using namespace std;
int main()
{
unsigned long long ull = numeric_limits<unsigned long long>::max();
double d = static_cast<double>(ull);
unsigned long long ull2 = static_cast<unsigned long long>(d);
cout << ull << endl << d << endl << ull2 << endl;
return 0;
}
Ideone live example.
When this code is executed on my computer, I have the following output:
18446744073709551615
1.84467e+019
9223372036854775808
Press any key to continue . . .
I expected the first and third numbers to be exactly the same (just like on Ideone) because I was sure that long double took 10 bytes, and stored the mantissa in 8 of them. I would understand if the third number were truncated compared to first one - just for the case I'm wrong with the floating-point numbers format. But here the values are twice different!
So, the main question is: why? And how can I predict such situations?
Some details: I use Visual Studio 2013 on Windows 7, compile for x86, and sizeof(long double) == 8 for my system.
18446744073709551615 is not exactly representible in double (in IEEE754). This is not unexpected, as a 64-bit floating point obviously cannot represent all integers that are representible in 64 bits.
According to the C++ Standard, it is implementation-defined whether the next-highest or next-lowest double value is used. Apparently on your system, it selects the next highest value, which seems to be 1.8446744073709552e19. You could confirm this by outputting the double with more digits of precision.
Note that this is larger than the original number.
When you convert this double to integer, the behaviour is covered by [conv.fpint]/1:
A prvalue of a floating point type can be converted to a prvalue of an integer type. The conversion truncates; that is, the fractional part is discarded. The behavior is undefined if the truncated value cannot be represented in the destination type.
So this code potentially causes undefined behaviour. When undefined behaviour has occurred, anything can happen, including (but not limited to) bogus output.
The question was originally posted with long double, rather than double. On my gcc, the long double case behaves correctly, but on OP's MSVC it gave the same error. This could be explained by gcc using 80-bit long double, but MSVC using 64-bit long double.
It's due to double approximation to long long. Its precision means ~100 units error at 10^19; as you try to convert values around the upper limit of long long range, it overflows. Try to convert 10000 lower value instead :)
BTW, at Cygwin, the third printed value is zero
The problem is surprisingly simple. This is what is happening in your case:
18446744073709551615 when converted to a double is round up to the nearest number that the floating point can represent. (The closest representable number is larger).
When that's converted back to an unsigned long long, it's larger than max(). Formally, the behaviour of converting this back to an unsigned long long is undefined but what appears to be happening in your case is a wrap around.
The observed significantly smaller number is the result of this.

Casting both bitwidth and signed/unsigned, which conversion is executed first?

Consider the following code:
int32_t x = -2;
cout << uint64_t(x) << endl;
The cast in the second line contains basically two atomic steps. The increase in bitwidth from 32 bits to 64 bits and the change of interpretation from signed to unsigned. If one compiles this with g++ and executes, one gets 18446744073709551614. This suggests that the increase in bitwidth is processed first (as a signed extension) and the change in signed/unsigned interpretation thereafter, i.e. that the code above is equivalent to writing:
int32_t x = -2;
cout << uint64_t(int64_t(x)) << endl;
What confuses me that one could also first interpret x as an unsigned 32-bit bitvector first and then zero-extend it to 64-bit, i.e.
int32_t x = -2;
cout << uint64_t(uint32_t(x)) << endl;
This would yield 4294967294. Would someone please confirm that the behavior of g++ is required by the standard and is not implementation defined? I would be most excited if you could refer me to the norm in the standard that actually concerns the issue at hand. I tried to do so but failed bitterly.
Thanks in advance!
You are looking for Standard section 4.7. In particular, paragraph 2 says:
If the destination type is unsigned, the resulting value is the least unsigned integer congruent to the source integer (modulo 2n where n is the number of bits used to represent the unsigned type).
In the given example, we have that 18446744073709551614 = -2 mod 264.
As said by #aschepler, standard 4.7 §2 (Integral conversions) ensures that the result will be least unsigned integer congruent to the source
integer (modulo 2n where n is the number of bits used to represent the unsigned type)
So in your case, it will be 0xFFFFFFFFFFFFFFFE == 18446744073709551614
But this is a one step conversion as specified by the standard (what compiler actually does is out of scope)
If you want first unsigned conversion to uint32_t and then conversion to uint64_t, you have to specify 2 conversions : static_cast<uint64_t>(static_cast<uint32_t>(-2))
Per 4.7 §2, first will give 0xFFFFFFFE = 4294967294 but as this number is already a valid uint64_t it is unchanged by the second conversion.
What you observed is required by the standard and will be observable on any conformant compiler (provided uint32_t and uint64_t are defined, because this part is not required ...)
This is an old question but I recently came into this problem. I was using char, which happened to be signed in my computer. I wanted to multiply two values by
char a, b;
uint16 ans = uint16(a) * uint16(b);
However, because of the conversion, when a < 0, the answer is wrong.
Since the signedness of char is implementation-dependent, maybe we should use uint8 instead of char whenever possible.

U suffix in the variable declaration

I know that if the number is followed with U suffix it is treated as unsigned. But why following program prints correct value of variable i even though it is initialized with negative value. (Compiled with gcc 4.9.2, 4.8.2, & 4.7.1)
Program1.cpp
#include <iostream>
int main()
{
int i=-5U;
std::cout<<i; // prints -5 on gcc 4.9.2, 4.8.2, & 4.7.1
}
Program2.cpp
#include <iostream>
int main()
{
auto i=-5U;
std::cout<<i; // prints large positive number as output
}
But If I use auto keyword (The type deducer new C++0x feature) it gives me large positive number as expected.
Please correct me If I am understanding incorrect something.
-5U is not -5 U. It is -(5U). The minus sign is a negation operator operating on 5U, not the first character of an integer literal.
When you negate an unsigned number, it is equivalent to subtracting the current value from 2^n, where n is the number of bits in the integer type. So that explains the second program. As for the first, when you cast an unsigned integer to a signed integer (as you are doing by assigning it to an int), if the value is out of range the result is undefined behavior but in general* will result in the value being reinterpreted as an unsigned integer-- and since unsigned negation just so happens to have the same behavior as two's complement signed negation, the result is the same as if the negation happened in a signed context.
.* Note: This is NOT one of those "undefined behavior" situations that is only of academic concern to language wonks. Compilers can and do assume that casting an unsigned number to signed will not result in overflow (particularly when the resulting integer is then used in a loop), and there are known instances of this assumption turning carelessly written code into buggy programs.

c++ portable conversion of long to double

I need to accurately convert a long representing bits to a double and my soluton shall be portable to different architectures (being able to be standard across compilers as g++ and clang++ woulf be great too).
I'm writing a fast approximation for computing the exp function as suggested in this question answers.
double fast_exp(double val)
{
double result = 0;
unsigned long temp = (unsigned long)(1512775 * val + 1072632447);
/* to convert from long bits to double,
but must check if they have the same size... */
temp = temp << 32;
memcpy(&result, &temp, sizeof(temp));
return result;
}
and I'm using the suggestion found here to convert the long into a double. The issue I'm facing is that whereas I got the following results for int values in [-5, 5] under OS X with clang++ and libc++:
0.00675211846828461
0.0183005779981613
0.0504353642463684
0.132078289985657
0.37483024597168
0.971007823944092
2.7694206237793
7.30961990356445
20.3215942382812
54.8094177246094
147.902587890625
I always get 0 under Ubuntu with clang++ (3.4, same version) and libstd++. The compiler there even tells me (through a warning) that the shifting operation can be problematic since the long has size equal or less that the shifting parameter (indicating that longs and doubles have not the same size probably)
Am I doing something wrong and/or is there a better way to solve the problem being as more compatible as possible?
First off, using "long" isn't portable. Use the fixed length integer types found in stdint.h. This will alleviate the need to check for the same size, since you'll know what size the integer will be.
The reason you are getting a warning is that left shifting 32 bits on the 32 bit intger is undefined behavior. What's bad about shifting a 32-bit variable 32 bits?
Also see this answer: Is it safe to assume sizeof(double) >= sizeof(void*)? It should be safe to assume that a double is 64bits, and then you can use a uint64_t to store the raw hex. No need to check for sizes, and everything is portable.

Curious arithmetic error- 255x256x256x256=18446744073692774400

I encountered a strange thing when I was programming under c++. It's about a simple multiplication.
Code:
unsigned __int64 a1 = 255*256*256*256;
unsigned __int64 a2= 255 << 24; // same as the above
cerr()<<"a1 is:"<<a1;
cerr()<<"a2 is:"<<a2;
interestingly the result is:
a1 is: 18446744073692774400
a2 is: 18446744073692774400
whereas it should be:(using calculator confirms)
4278190080
Can anybody tell me how could it be possible?
255*256*256*256
all operands are int you are overflowing int. The overflow of a signed integer is undefined behavior in C and C++.
EDIT:
note that the expression 255 << 24 in your second declaration also invokes undefined behavior if your int type is 32-bit. 255 x (2^24) is 4278190080 which cannot be represented in a 32-bit int (the maximum value is usually 2147483647 on a 32-bit int in two's complement representation).
C and C++ both say for E1 << E2 that if E1 is of a signed type and positive and that E1 x (2^E2) cannot be represented in the type of E1, the program invokes undefined behavior. Here ^ is the mathematical power operator.
Your literals are int. This means that all the operations are actually performed on int, and promptly overflow. This overflowed value, when converted to an unsigned 64bit int, is the value you observe.
It is perhaps worth explaining what happened to produce the number 18446744073692774400. Technically speaking, the expressions you wrote trigger "undefined behavior" and so the compiler could have produced anything as the result; however, assuming int is a 32-bit type, which it almost always is nowadays, you'll get the same "wrong" answer if you write
uint64_t x = (int) (255u*256u*256u*256u);
and that expression does not trigger undefined behavior. (The conversion from unsigned int to int involves implementation-defined behavior, but as nobody has produced a ones-complement or sign-and-magnitude CPU in many years, all implementations you are likely to encounter define it exactly the same way.) I have written the cast in C style because everything I'm saying here applies equally to C and C++.
First off, let's look at the multiplication. I'm writing the right hand side in hex because it's easier to see what's going on that way.
255u * 256u = 0x0000FF00u
255u * 256u * 256u = 0x00FF0000u
255u * 256u * 256u * 256u = 0xFF000000u (= 4278190080)
That last result, 0xFF000000u, has the highest bit of a 32-bit number set. Casting that value to a signed 32-bit type therefore causes it to become negative as-if 232 had been subtracted from it (that's the implementation-defined operation I mentioned above).
(int) (255u*256u*256u*256u) = 0xFF000000 = -16777216
I write the hexadecimal number there, sans u suffix, to emphasize that the bit pattern of the value does not change when you convert it to a signed type; it is only reinterpreted.
Now, when you assign -16777216 to a uint64_t variable, it is back-converted to unsigned as-if by adding 264. (Unlike the unsigned-to-signed conversion, this semantic is prescribed by the standard.) This does change the bit pattern, setting all of the high 32 bits of the number to 1 instead of 0 as you had expected:
(uint64_t) (int) (255u*256u*256u*256u) = 0xFFFFFFFFFF000000u
And if you write 0xFFFFFFFFFF000000 in decimal, you get 18446744073692774400.
As a closing piece of advice, whenever you get an "impossible" integer from C or C++, try printing it out in hexadecimal; it's much easier to see oddities of twos-complement fixed-width arithmetic that way.
The answer is simple -- overflowed.
Here Overflow occurred on int and when you are assigning it to unsigned int64 its converted in to 18446744073692774400 instead of 4278190080