Casting `f32::MAX` to `u128` results in unexpected value - casting

Executing this code (Playground):
println!("u128 max: {}", u128::max_value());
println!("f32 max: {}", std::f32::MAX);
println!("f32 as u128: {}", std::f32::MAX as u128);
... prints:
u128 max: 340282366920938463463374607431768211455
f32 max: 340282350000000000000000000000000000000
f32 as u128: 340282346638528859811704183484516925440
Judging from this output, we can deduce that u128::max_value() > f32::MAX and that f32::MAX is an integer (without fractional part). Indeed, Wikipedia agrees with the max values and says that f32::MAX is (2 − 2−23) × 2127 which is slightly less than 2128. Given that, I would think that f32::MAX is exactly representable as u128. But as you can see, casting it via as gives an entirely different value.
Why does the result of the cast differ from the original value?
(I know that floats are very strange beasts and all. But I hope that there is an answer to this question that contains more information than "floats are strange duh")

This is due to the formatting of the float when you are printing it. It seems by default the formatter will only show 8 significant figures when printing floats. Explicitly specifying the precision in the format string will yield the same results for line 2 and 3.
println!("u128 max: {}", u128::max_value());
println!("f32 max: {:.0}", std::f32::MAX);
println!("f32 as u128: {}", std::f32::MAX as u128);
Outputs:
u128 max: 340282366920938463463374607431768211455
f32 max: 340282346638528859811704183484516925440
f32 as u128: 340282346638528859811704183484516925440

Related

How to add/sub int to a BYTE type?

I've a struct defined this way:
struct IMidiMsg {
int mOffset;
BYTE mStatus, mData1, mData2;
}
and I set to mData2 int values between 0 and 127, without any problem. But, if I add/sub int:
pNoteOff->mData2 -= 150;
pNoteOff->mData2 += 150;
I get weird results. I think due to the different type: it's BYTE, not int of course. (Note: BYTE is from minwindef.h)
Let say I've a mData2 with value 114: how would you first sub 150 and than add 150, getting again 114?
You cannot have negative values in a BYTE, as it is defined as typedef unsigned char BYTE; - it can only hold numbers between 0 and 255.
If you want to have negative values too, use signed char; that works between -127 and +128. Or just use a normal int. Saving bytes is not a good idea nowadays, with 32- or 64-bit architecture, it makes calculations slower if anything.
how would you first sub 150 and than add 150, getting again 114?
That is exactly what is supposed to happen. You start with 114, and you subtract 150 from a type capped at 255. Since 114 is less than 150, subtraction results in a "borrow", i.e. (256+114)-150=220.
When you add 150 to 220, so you get 370. Now the "carry" is dropped, so you get 370-256=114.
The same mechanism is at work here as in modulo arithmetic with any other cap. For example, if you consider single-digit decimal numbers, doing something like
3-6+6=3
3-6 --> (borrow) 13-6 --> 7
7+6 --> 13 --> 3 (drop tens)

C++ will exact division lose precision?

Q1:Will dividing a integer by its divisor lose precision ?
int a=M*N,b=N;//M and N are random non-zero integers.
float c=float(a)/b;
if (c==M)
cout<<"accurate"<<endl;
Q2:Will passing a float value lose precision ?
float a=K;//K is a random float;
if (a==K)
cout<<"accurate"<<endl;
Q1:Will dividing a integer by its divisor lose precision ?
Yes. I used the following program to come up with some numbers:
#include <iostream>
#include <climits>
int main()
{
int M = 10;
int N = 7;
int inaccurateCount = 0;
for (; M < INT_MAX && inaccurateCount < 10; ++M )
{
int a = M*N;
float c = float(a)/N;
if ( c != M )
{
std::cout << "Not accurate for M: " << M << " and N: " << N << std::endl;
inaccurateCount++;
}
}
return 0;
}
and here's the output:
Not accurate for M: 2396747 and N: 7
Not accurate for M: 2396749 and N: 7
Not accurate for M: 2396751 and N: 7
Not accurate for M: 2396753 and N: 7
Not accurate for M: 2396755 and N: 7
Not accurate for M: 2396757 and N: 7
Not accurate for M: 2396759 and N: 7
Not accurate for M: 2396761 and N: 7
Not accurate for M: 2396763 and N: 7
Not accurate for M: 2396765 and N: 7
Q2:Will passing a float value lose precision ?
No, it shouldn't.
Q1:Will dividing a integer by its divisor lose precision ?
You actually asked if converting a int to a float will lose precsion.
Yes, it will typically do that. On today 32-bit (or wider) computer architectures an int stores 32-bit of data: 1 bit sign plus 31 bit significand. A float stores also 32-bit of data, but these are: 1 bit sign, 8 bit exponent, and 23 bit fractional part, cf. IEEE 754 single-precision floating point format (It might not lose precision on a 16-bit architecture, but I can't check that.)
Depending on the floating point number it will be stored in different represantations, one is the normalized form, where the fractional part is prepended by a hidden one, so that, we get a 24 bit significand. This is less than as stored in an int.
For example the integer 01010101 01010101 01010101 01010101 (binary, space only for better reading) cannot be expressed as float without loosing precision. In normalized form this would be 1,010101 01010101 01010101 01010101 * 2^30. So we have 30 significand binary digits after the comma, which cannot be stored in 23 bit (fractional part) without losing precision. The actual round modes defines how the value is shortened.
Note, that it does not depends on if the value is actually "high". The integer 01000000 00000000 00000000 00000000 is in normalized form 1,000000 00000000 00000000 00000000 * 2^30. This number has zero significant bits after the comma and can be stored without losing precision.
Q2: Will passing a float value lose precision ?
No.
Q1:Will dividing a integer by its divisor lose precision ?
If a is to large it might loose precision, otherwise (if a is small enough to be exactly represented as a float) it will not. The loss of precision may actually happen already when you convert a. Also the division will loose precision, but sometimes it could be that these losses of precision will cancel each other.
For example if N = 8388609 and M=5. You have the (binary) mantissa 100...001 and multiply with 101 and end up with 101000...0000101, but then the last two bits will be rounded to zero and you get an error in (float)(N*M), but then when you divide by five, you get 1000...00 and a remainder of 100, which means that it should round up one step and you get back the original number.
Q2:Will passing a float value lose precision ?
No, it will not lose precision. However your code could still fail to identify it as accurate.
The case this could happen is if K is a NaN (for example 0.0/0.0), then x will also become a NaN - however NaN shouldn't (need to) compare equals. In this case one could argue that you lost precision and I agree, but it's not at the point x=K that looses precision - you already lost precision when producing K.
It wall not be exact but to get more accurate answers you can use the value types double and long
Case 1: Yes it loses precision in some cases. For small values of M it will be accurate.
Case 2: No it doesn't lose its precision.

Large number received when trying to do simple float division

float taxednumber = number * (float)(20/100.0);
taxednumber = number - taxednumber;`
If the number input is 135 I receive taxednumber as 1966805346. Why is it becoming such a large number? I have tried many different types of division to fix this problem and after trying most this is my final result. Troubleshooting the numbers just showed me the values being normal as far as I saw, so the problem must occur in the calculation.
EDIT: The compiler I am using is GNU GCC within Codeblocks
You don't show the data type of number. If you are using floating point, use floating point ... 20 is an integer, 20.0 is floating point. If number is not a float, it will cause issues with both lines of code shown.
Edit: since the OP said he was using gcc, I wrote a little code.
#include <stdio.h>
main()
{
int number = 135;
float taxednumber = number * (float)(20/100.0);
printf("number: %d\n", number);
printf("taxednumber: %f\n", taxednumber);
printf("int calc: %d\n", 20/100.0);
printf("float calc: %f\n", (float)(20/100.0));
printf("float with cast calc: %f\n", (float)((float)20/100.0));
}
Output:
number: 135
taxednumber: 27.000000
int calc: -1717986918
float calc: 0.200000
float with cast calc: 0.200000
There is little information with the original question, it could be as simple that the OP is printing the value with the wrong output format (see the "int calc" line).
Environment used: gcc on VirtualBox using Ubunto.

Why do I get two different outputs here?

The following two pieces of code produce two different outputs.
//this one gives incorrect output
cpp_dec_float_50 x=log(2)
std::cout << std::setprecision(std::numeric_limits<cpp_dec_float_50>::digits)<< x << std::endl;
The output it gives is
0.69314718055994528622676398299518041312694549560547
which is only correct upto the 15th decimal place. Had x been double, even then we'd have got first 15 digits correct. It seems that the result is overflowing. I don't see though why it should. cpp_dec_float_50 is supposed to have 50 digits precision.
//this one gives correct output
cpp_dec_float_50 x=2
std::cout << std::setprecision(std::numeric_limits<cpp_dec_float_50>::digits)<< log(x) << std::endl;
The output it gives is
0.69314718055994530941723212145817656807550013436026
which is correct according to wolframaplha .
When you do log(2), you're using the implementation of log in the standard library, which takes a double and returns a double, so the computation is carried out to double precision.
Only after that's computed (to, as you noted, a mere 15 digits of precision) is the result converted to your 50-digit extended precision number.
When you do:
cpp_dec_float_50 x=2;
/* ... */ log(x);
You're passing an extended precision number to start with, so (apparently) an extended precision overload of log is being selected, so it computes the result to the 50 digit precision you (apparently) want.
This is really just a complex version of:
float a = 1 / 2;
Here, 1 / 2 is integer division because the parameters are integers. It's only converted to a float to be stored in a after the result is computed.
C++ rules for how to compute a result do not depend on what you do with that result. So the actual calculation of log(2) is the same whether you store it in an int, a float, or a cpp_dec_float_50.
Your second bit of code is the equivalent of:
float b = 1;
float c = 2;
float a = b / c;
Now, you're calling / on a float, so you get floating point division. C++'s rules do take into account the types of arguments and paramaters. That's complex enough, and trying to also take into account what you do with the result would make C++'s already overly-complex rules incomprehensible to mere mortals.

Summing up float number loses precision with type conversion

I tried to add two digits with different weights. Here is my code:
void onTimeStepOp::updatePointsType1_2(boost::tuples::tuple<float,int,int,int> &_prev,
boost::tuples::tuple<float,int,int,int> &_result,
boost::tuples::tuple<float,float> weights)
{
_result.get<0>() = _result.get<0>() * weights.get<0>() + _prev.get<0>() * weights.get<1>();
std::cout<<"deb:"<<(float)_result.get<2>() * weights.get<0>()<<" "<<(float)_prev.get<2>() * weights.get<1>()<<std::endl;
_result.get<2>() = (int)((float)(_result.get<2>()) * weights.get<0>() + (float)(_prev.get<2>()) * weights.get<1>());
std::cout<<"deb2:"<<(float)_result.get<3>() * weights.get<0>() <<" "<< (float)_prev.get<3>() * weights.get<1>()<<std::endl;
_result.get<3>() = (int)((float)(_result.get<3>()) * weights.get<0>() + (float)(_prev.get<3>()) * weights.get<1>());
}
weights.get<0> = 0.3,weights.get<1> = 0.7.
The output I get looks like this:
resultBefore=36.8055 4 69 91 previousPPos=41.192 4 69 91
deb:20.7 48.3
deb2:27.3 63.7
resultAfter=39.8761 4 **68** 91
The third number should be 69(69 * 0.3 + 69 * 0.7). However, it is 68 instead. What's the problem with the type conversion expression?
Conversion to int truncates, so the slightest rounding error could cause you to be one off. Rather than converting directly to int, you might want to use the function round.
I might add that weights.get<0> is certainly not 0.3, and weights.get<1> is certainly not 0.7, since neither 0.3 nor 0.7 are representable in machine floating point (at least not on any machine you're likely to be using).
You should round() instead of just casting to int. Casting trims everything after the decimal point, and the number due to rounding error may be something like 68.99999999991 (just an example but gives the idea).
Casting to int will result the number before the dot, so 68.1..68.9 will be all 68 as written before.
Another solution could be, which is not so nice, that is to add 0.5 to your float value before casting. So 68.1 will be 68.6, which will be still 68, but 68.5 will be 69 which will be 69.