I am facing with following issue.
when I multiply two numbers depending from values of this numbers I get different results. I tried to experiment with types but didn't get expected result.
#include <stdio.h>
#include <iostream>
#include <fstream>
#include <iomanip>
#include <math.h>
int main()
{
const double value1_39 = 1.39;
const long long m_100000 = 100000;
const long long m_10000 = 10000;
const double m_10000double = 10000;
const long long longLongResult_1 = value1_39 * m_100000;
const double doubleResult_1 = value1_39 * m_100000;
const long long longLongResult_2 = value1_39 * m_10000;
const double doubleResult_2 = value1_39 * m_10000;
const long long longLongResult_3 = value1_39 * m_10000double;
const double doubleResult_3 = value1_39 * m_10000double;
std::cout << std::setprecision(6) << value1_39 << '\n';
std::cout << std::setprecision(6) << longLongResult_1 << '\n';
std::cout << std::setprecision(6) << doubleResult_1 << '\n';
std::cout << std::setprecision(6) << longLongResult_2 << '\n';
std::cout << std::setprecision(6) << doubleResult_2 << '\n';
std::cout << std::setprecision(6) << longLongResult_3 << '\n';
std::cout << std::setprecision(6) << doubleResult_3 << '\n';
return 0;
}
result seen in debuger
Variable Value
value1_39 1.3899999999999999
m_100000 100000
m_10000 10000
m_10000double 10000
longLongResult_1 139000
doubleResult_1 139000
longLongResult_2 13899
doubleResult_2 13899.999999999998
longLongResult_3 13899
doubleResult_3 13899.999999999998
result seen in cout
1.39
139000
139000
13899
13900
13899
13900
I know that the problem is that the problem is in nature of keeping floating point format in computer. It keeps data as a fractions in base 2.
My question is how to get 1.39 * 10 000 as 13900?(because I am getting 139000 when multipling with 100000 the same value) is there any trick which can help to achieve my goal?
I have some ideas in my mind bunt not sure are they good enough.
1) pars string to get number from left of . and rigth of doth
2) multiply number by 100 and divide by 100 when calculation is done, but each of this solutions has their drawback. I am wondering is there any nice trick for this.
As the comments already said, no there is no solution. This problem is due to the nature of floating points being stored as base 2 (as you already said). The type floating point is defined in IEEE 754. Everything that is not a base two number can't be stored precisely in base 2.
To be more specific
You CAN store:
1.25 (2^0 + 2^-2)
0.75 (2^-1 + 2^-2)
because there is an exact representation.
You CAN'T store:
1.1
1.4
because this will result in an irrational fracture in the base 2 system. You can try to round or use a sort of arbitrary precision float point library (but even they have their limits [memory/speed]) with a much greater precision than float and then backcast to float after multiplication.
There are also a lot of other related problems when it comes to floating points. You will find out that the result of 10^20 + 2 is only 10^20 because you have a fixed digit resolution (6-7 digits for float and 15-16 digits for double). When you calculate with numbers that have huge differences in magnitude the smaller ones will just "disappear".
Question: Why does multiply 1.39 * 10^6 get 139000 but multiplying 1.39 * 10^5 not?
This could be because of the order of magnitude. 10000 has 5 digits, 1.39 has 3 digits (distance 7 - just within the float). Both could be near enough to "show" the problem. When it comes to 100000 you have 6 digits but you have one more magnitude difference to 1.39 (distance 8 - just out of float). Therefore one of the trailing digits gets cut off and you get a more "natural" result. (This is just one reason for this. Compiler, OS and other reasons might exist)
Related
Given a float, I want to round the result to 4 decimal places using half-even rounding, i.e., rounding to the next even number method. For example, when I have the following code snippet:
#include <iostream>
#include <iomanip>
int main(){
float x = 70.04535;
std::cout << std::fixed << std::setprecision(4) << x << std::endl;
}
The output is 70.0453, but I want to be 70.0454. I could not find anything in the standard library, is there any function to achieve this? If not, what would a custom function look like?
If you use float, you're kind of screwed here. There is no such value as 70.04535, because it's not representable in IEEE 754 binary floating point.
Easy demonstration with Python's decimal.Decimal class, which will try to reproduce the actual float (well, Python float is a C double, but it's the same principle) value out to 30 digits of precision:
>>> import decimal
>>> decimal.Decimal(70.04535)
Decimal('70.0453499999999991132426657713949680328369140625')
So your actual value doesn't end in a 5, it ends in 49999... (the closest to 70.04535 a C double can get; C float is even less precise); even banker's rounding would round it down. If this is important to your program, you need to use an equivalent C or C++ library that matches "human" (base-10) math expectations, e.g. libmpdec (which is what Python's decimal.Decimal uses under the hood).
I'm sure someone can improve this, but it gets the job done.
double round_p( double x, int p ){
double d = std::pow(10,p+1);
return ((x*d)+5)/d;
}
void main(int argc, const char**argv){
double x = 70.04535;
{
std::cout << "value " << x << " rounded " << round_p(x,4) << std::endl;
std::cout << "CHECK " << (bool)(round_p(x,4) == 70.0454) << std::endl;
}
}
I was writing a little function to calculate the binomial coefficiant using the tgamma function provided by c++. tgamma returns float values, but I wanted to return an integer. Please take a look at this example program comparing three ways of converting the float back to an int:
#include <iostream>
#include <cmath>
int BinCoeffnear(int n,int k){
return std::nearbyint( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeffcast(int n,int k){
return static_cast<int>( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeff(int n,int k){
return (int) std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1));
}
int main()
{
int n = 7;
int k = 2;
std::cout << "Correct: " << std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1)); //returns 21
std::cout << " BinCoeff: " << BinCoeff(n,k); //returns 20
std::cout << " StaticCast: " << BinCoeffcast(n,k); //returns 20
std::cout << " nearby int: " << BinCoeffnear(n,k); //returns 21
return 0;
}
why is it, that even though the calculation returns a float equal to 21, 'normal' conversion fails and only nearbyint returns the correct value. What is the nicest way to implement this?
EDIT: according to c++ documentation here tgamma(int) returns a double.
From this std::tgamma reference:
If arg is a natural number, std::tgamma(arg) is the factorial of arg-1. Many implementations calculate the exact integer-domain factorial if the argument is a sufficiently small integer.
It seems that the compiler you're using is doing that, calculating the factorial of 7 for the expression std::tgamma(7+1).
The result might differ between compilers, and also between optimization levels. As demonstrated by Jonas there is a big difference between optimized and unoptimized builds.
The remark by #nos is on point. Note that the first line
std::cout << "Correct: " <<
std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1));
Prints a double value and does not perform a floating point to integer conversion.
The result of your calculation in floating point is indeed less than 21, yet this double precision value is printed by cout as 21.
On my machine (x86_64, gnu libc, g++ 4.8, optimization level 0) setting cout.precision(18) makes the results explicit.
Correct: 20.9999999999999964 BinCoeff: 20 StaticCast: 20 nearby int: 21
In this case practical to replace integer operations with floating point operations, but one has to keep in mind that the result must be integer. The intention is to use std::round.
The problem with std::nearbyint is that depending on the rounding mode it may produce different results.
std::fesetround(FE_DOWNWARD);
std::cout << " nearby int: " << BinCoeffnear(n,k);
would return 20.
So with std::round the BinCoeff function might look like
int BinCoeffRound(int n,int k){
return static_cast<int>(
std::round(
std::tgamma(n+1) /
(std::tgamma(k+1)*std::tgamma(n-k+1))
));
}
Floating-point numbers have rounding errors associated with them. Here is a good article on the subject: What Every Computer Scientist Should Know About Floating-Point Arithmetic.
In your case the floating-point number holds a value very close but less than 21. Rules for implicit floating–integral conversions say:
The fractional part is truncated, that is, the fractional part is
discarded.
Whereas std::nearbyint:
Rounds the floating-point argument arg to an integer value in floating-point format, using the current rounding mode.
In this case the floating-point number will be exactly 21 and the following implicit conversion would return 21.
The first cout outputs 21 because of rounding that happens in cout by default. See std::setprecition.
Here's a live example.
What is the nicest way to implement this?
Use the exact integer factorial function that takes and returns unsigned int instead of tgamma.
the problem is on handling the floats.
floats cant 2 as 2 but as 1.99999 something like that.
So converting to int will drop out the decimal part.
So instead of converting to int immediately first round it to by calling the ceil function w/c declared in cmath or math.h.
this code will return all 21
#include <iostream>
#include <cmath>
int BinCoeffnear(int n,int k){
return std::nearbyint( std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)) );
}
int BinCoeffcast(int n,int k){
return static_cast<int>( ceil(std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1))) );
}
int BinCoeff(int n,int k){
return (int) ceil(std::tgamma(n+1) / (std::tgamma(k+1)*std::tgamma(n-k+1)));
}
int main()
{
int n = 7;
int k = 2;
std::cout << "Correct: " << (std::tgamma(7+1) / (std::tgamma(2+1)*std::tgamma(7-2+1))); //returns 21
std::cout << " BinCoeff: " << BinCoeff(n,k); //returns 20
std::cout << " StaticCast: " << BinCoeffcast(n,k); //returns 20
std::cout << " nearby int: " << BinCoeffnear(n,k); //returns 21
std::cout << "\n" << (int)(2.9995) << "\n";
}
I have to timestamps (in µs), which are stored as a uint64_t.
My goal is to be able to get the difference between these timestamps in ms, as a float with 2 decimals.
Fx I'd like the result to be 6.520000 ms.
Though I can't seem to get them to cast correctly.
I've tried the following without any luck:
uint64_t diffus = ms.getT1 - ms.getT2
float diff = static_cast<float>(diffus);
float diffMS = diff / 1000;
std::cout << diffMS << " ms" << std::endl;
I know this wont make the float only two decimals, but I can't even get that to work.
I seem to get the same result all the time, even though I vary T1 and T2 with
srand(time(NULL));
usleep((rand() % 25) * 1000);
The output keeps being:
1.84467e+16 ms
1.84467e+16 ms
1.84467e+16 ms
1.84467e+16 ms
What is happening, and what can I do? :-)
Best regards.
I made the assumption that ms.getT1 and ms.getT2 indicated that T1 was earlier in time than T2.
In that case then you are casting a negative number to float, and the first bit is probably being interpreted incorrectly for your expectations.
The following tests confirm my assumption:
// Force diffus to be a negative number.
uint64_t diffus = 20 - 30;
float diff = static_cast<float>(diffus);
float diffMS = diff / 1000;
std::cout << diffMS << " ms" << std::endl;
// Result of casting negative integer to float.
1.84467e+16 ms
// Force diffus to be a positive number.
uint64_t diffus = 30 - 20;
float diff = static_cast<float>(diffus);
float diffMS = diff / 1000;
std::cout << diffMS << " ms" << std::endl;
// Result of casting positive integer to float.
0.01 ms
A float is normally a 32 bits number, so consider the consequences of doing this for further applications...
uint64_t diffus = ms.getT1 - ms.getT2
float diff = static_cast<float>(diffus);
on the other hand, float numbers can be represented in several ways....
(scientific notation for example) and that is only about how the number will look like, not about the number is holding..
3.1
3.14
3.14159
could be the same pi number printed in diff formats according to the needs of the application...
If your problem is about the representation of a float number then consider to set the precision of the cout object: std::cout.precision, here more info
std::cout.precision(2);
std::cout << diffMS << " ms" << std::endl;
Which is the most optimal way to get the n leftmost non-zero digits of a floating point number (number >= 0.0).
For example,
if n = 1:
0.014568 -> 0.01
0.246456 -> 0.2
if n = 2:
0.014568 -> 0.014
0.246456 -> 0.24
After #schil227 comment:
Currently I am doing multiplications and divisions (by 10) as necessary in order to have the n digits at the decimal number field.
Code could use sprintf(buf, "%e",...) to do most of the heavy lifting.
There are so many corner cases that other direct code may fail, sprintf() is likely to be, at least, as good solid reference solution.
This code prints the double to DBL_DECIMAL_DIG places to insure there is no rounding in digits that would make a difference. Then it zeros out various digits depending on n.
See #Mark Dickinson comment for reasons to use a greater value than DBL_DECIMAL_DIG. Perhaps on the order of DBL_DECIMAL_DIG*2. As mentioned above, there are many corner cases.
#include <float.h>
#include <math.h>
#include <stdio.h>
double foo(double x, int n) {
if (!isfinite(x)) {
return x;
}
printf("%g\n", x);
char buf[DBL_DECIMAL_DIG + 11];
sprintf(buf, "%+.*e", DBL_DECIMAL_DIG, x);
//puts(buf);
assert(n >= 1 && n <= DBL_DECIMAL_DIG + 1);
memset(buf + 2 + n, '0', DBL_DECIMAL_DIG - n + 1);
//puts(buf);
char *endptr;
x = strtod(buf, &endptr);
printf("%g\n", x);
return x;
}
int main() {
foo(0.014568, 1);
foo(0.246456, 1);
foo(0.014568, 2);
foo(0.246456, 2);
return 0;
}
Output
0.014568
0.01
0.246456
0.2
0.014568
0.014
0.246456
0.24
This answer assumes OP does not want a rounded answer. Re: 0.246456 -> 0.24
If you want the result as a string, you should probably print to a string with extra precision, then chop that off yourself. (See #chux's answer for details on how much extra precision you need for IEEE 64-bit double to avoid rounding up from a string of 9s, since you want truncation but all the usual to-string functions round to nearest.)
If you want a double result, then are you sure you really want this? Rounding / truncating early in the middle of a calculation usually just worsens the accuracy of the final result. Of course, there are uses in real algorithms for floor/ceil, trunc, and nearbyint, and this is just a scaled version of trunc.
If you just want a double, you can get fairly good results without ever going to a string. Use ndigits and floor(log10(fabs(x))) to work out a scale factor, then truncate the scaled value to an integer, then scale back.
Tested and working (with and without -ffast-math). See the asm on the Godbolt compiler explorer. This might run reasonably efficiently, especially with -ffast-math -msse4.1 (so floor and trunc can inline to roundsd).
If you care about speed, look into replacing pow() with something that takes advantage of the fact that the exponent is a small integer. I'm not sure how fast library pow() implementations are in that case. GNU C __builtin_powi(x, n) trades accuracy for speed, for integer exponents, doing a multiplication tree, which is less accurate than what pow() does.
#include <float.h>
#include <math.h>
#include <stdio.h>
double truncate_n_digits(double x, int digits)
{
if (x==0 || !isfinite(x))
return x; // good idea stolen from Chux's answer :)
double l10 = log10(fabs(x));
double scale = pow(10., floor(l10) + (1 - digits)); // floor rounds towards -Inf
double scaled = x / scale;
double scaletrunc = trunc(scaled); // trunc rounds towards zero
double truncated = scaletrunc * scale;
#if 1 // debugging code
printf("%2d %24.14g =>\t%24.14g\t scale=%g, scaled=%.30g\n", digits, x, truncated, scale, scaled);
// print with more accuracy to reveal the real behaviour
printf(" %24.20g =>\t%24.20g\n", x, truncated);
#endif
return truncated;
}
test cases:
int main() {
truncate_n_digits(0.014568, 1);
truncate_n_digits(0.246456, 1);
truncate_n_digits(0.014568, 2);
truncate_n_digits(-0.246456, 2);
truncate_n_digits(1234567, 2);
truncate_n_digits(99999999999, 6);
truncate_n_digits(-99999999999, 6);
truncate_n_digits(99999, 10);
truncate_n_digits(-0.0000000001234567, 3);
truncate_n_digits(1000, 6);
truncate_n_digits(0.001, 6);
truncate_n_digits(1e-312, 2); // denormal, and not exactly representable: 9.999...e-313
truncate_n_digits(nextafter(1e-312, INFINITY), 2); // denormal, just above 1.00000e-312
return 0;
}
each result shown twice: first with only %.14g so rounding gives the string we want, then again with %.20g to show enough places to reveal the realities of floating point math. Most numbers are not exactly-representable, so even with perfect rounding it's impossible to return a double exactly represents the truncated decimal string. (Integers up to about the size of the mantissa are exactly representable, and so are fractions where the denominator is a power of 2.)
1 0.014568 => 0.01 scale=0.01, scaled=1.45679999999999987281285029894
0.014567999999999999353 => 0.010000000000000000208
1 0.246456 => 0.2 scale=0.1, scaled=2.46456000000000008398615136684
0.2464560000000000084 => 0.2000000000000000111
2 0.014568 => 0.014 scale=0.001, scaled=14.5679999999999996163069226895
0.014567999999999999353 => 0.014000000000000000291
2 -0.246456 => -0.24 scale=0.01, scaled=-24.6456000000000017280399333686
-0.2464560000000000084 => -0.23999999999999999112
3 1234.56789 => 1230 scale=10, scaled=123.456789000000000555701262783
1234.567890000000034 => 1230
6 1234.56789 => 1234.56 scale=0.01, scaled=123456.789000000004307366907597
1234.567890000000034 => 1234.5599999999999454
6 99999999999 => 99999900000 scale=100000, scaled=999999.999990000040270388126373
99999999999 => 99999900000
6 -99999999999 => -99999900000 scale=100000, scaled=-999999.999990000040270388126373
-99999999999 => -99999900000
10 99999 => 99999 scale=1e-05, scaled=9999900000
99999 => 99999.000000000014552
3 -1.234567e-10 => -1.23e-10 scale=1e-12, scaled=-123.456699999999983674570103176
-1.234566999999999879e-10 => -1.2299999999999998884e-10
6 1000 => 1000 scale=0.01, scaled=100000
1000 => 1000
6 0.001 => 0.001 scale=1e-08, scaled=100000
0.0010000000000000000208 => 0.0010000000000000000208
2 9.9999999999847e-313 => 9.9999999996388e-313 scale=1e-314, scaled=100.000000003458453079474566039
9.9999999999846534143e-313 => 9.9999999996388074622e-313
2 1.0000000000034e-312 => 9.0000000001196e-313 scale=1e-313, scaled=9.9999999999011865980946822674
1.0000000000034059979e-312 => 9.0000000001195857973e-31
Since the result you want will often not be exactly representable, (and because of other rounding errors) the resulting double will sometimes be below the result you want, so printing it with full precision might give 1.19999999 instead of 1.20000011. You might want to use nextafter(result, copysign(INFINITY, original)) to get a result that's more likely to have a higher magnitude than what you want.
Of course, that could just make things worse in some cases. But since we truncate towards zero, most often we get a result that's just below (in magnitude) the unrepresentable exact value.
Ok, another one like #Peter Cordes but more generic.
/** Return \c digits semantic digis of number \c x.
\tparam T Type of number \c x can be floating point or integral.
\param x The number.
\param digits The requested number of semantic digits of number \c x.
\return The number with only \c digits semantic digits of number \c x. */
template<typename T>
requires(std::integral<T> || std::floating_point<T>)
T roundn(T x, unsigned int digits)
{
if (!x || !std::isfinite(x)) return x;
typedef std::conditional_t<std::floating_point<T>, T, double> Tp;
Tp mul = pow(10, floor(digits - log10(abs(x))));
Tp y = round(x * mul) / mul;
if constexpr (std::floating_point<T>) return y;
else return round(y);
}
int main()
{
cout << setprecision(100);
cout << roundn(123.456789, 1) << "\n";
cout << roundn(123.456789, 2) << "\n";
cout << roundn(123.456789, 3) << "\n";
cout << roundn(123.456789, 4) << "\n";
cout << roundn(123.456789, 5) << "\n";
cout << roundn(-123.456789, 1) << "\n";
cout << roundn(-123.456789, 2) << "\n";
cout << roundn(-123.456789, 3) << "\n";
cout << roundn(-123.456789, 4) << "\n";
cout << roundn(-123.456789, 5) << "\n";
cout << roundn(-123.456789, 15) << "\n";
cout << roundn(123456, 1) << "\n";
cout << roundn(123456, 2) << "\n";
cout << roundn(123456, 3) << "\n";
cout << roundn(123456, 10) << "\n";
cout << roundn(-123456, 1) << "\n";
cout << roundn(-123456, 2) << "\n";
cout << roundn(-123456, 3) << "\n";
cout << roundn(-123456, 10) << "\n";
cout << roundn(0.0123456789, 1) << "\n";
cout << roundn(0.0123456789, 2) << "\n";
cout << roundn(-0.0123456789, 1) << "\n";
cout << roundn(-0.0123456789, 2) << "\n";
return 0;
}
It returns
99.9999999999999857891452847979962825775146484375
120
123
123.5
123.4599999999999937472239253111183643341064453125
-99.9999999999999857891452847979962825775146484375
-120
-123
-123.5
-123.4599999999999937472239253111183643341064453125
-123.4567890000000005557012627832591533660888671875
100000
120000
123000
123456
-100000
-120000
-123000
-123456
0.01000000000000000020816681711721685132943093776702880859375
0.0120000000000000002498001805406602215953171253204345703125
-0.01000000000000000020816681711721685132943093776702880859375
-0.0120000000000000002498001805406602215953171253204345703125
I still have not run it through enough tests however for some reason, using certain non-negative values, this function will sometimes pass back a negative value. I have done a lot of manual testing in calculator with different values but I have yet to have it display this same behavior.
I was wondering if someone would take a look at see if I am missing something.
float calcPop(int popRand1, int popRand2, int popRand3, float pERand, float pSRand)
{
return ((((((23000 * popRand1) * popRand2) * pERand) * pSRand) * popRand3) / 8);
}
The variables are all contain randomly generated values:
popRand1: between 1 and 30
popRand2: between 10 and 30
popRand3: between 50 and 100
pSRand: between 1 and 1000
pERand: between 1.0f and 5500.0f which is then multiplied by 0.001f before being passed to the function above
Edit:
Alright so after following the execution a bit more closely it is not the fault of this function directly. It produces an infinitely positive float which then flips negative when I use this code later on:
pPMax = (int)pPStore;
pPStore is a float that holds popCalc's return.
So the question now is, how do I stop the formula from doing this? Testing even with very high values in Calculator has never displayed this behavior. Is there something in how the compiler processes the order of operations that is causing this or are my values simply just going too high?
In this case it seems that when you are converting back to an int after the function returns it is possible that you reach the maximum value of an int, my suggestion is for you to use a type that can represent a greater range of values.
#include <iostream>
#include <limits>
#include <boost/multiprecision/cpp_int.hpp>
int main(int argc, char* argv[])
{
std::cout << "int min: " << std::numeric_limits<int>::min() << std::endl;
std::cout << "int max: " << std::numeric_limits<int>::max() << std::endl;
std::cout << "long min: " << std::numeric_limits<long>::min() << std::endl;
std::cout << "long max: " << std::numeric_limits<long>::max() << std::endl;
std::cout << "long long min: " << std::numeric_limits<long long>::min() << std::endl;
std::cout << "long long max: " << std::numeric_limits<long long>::max() << std::endl;
boost::multiprecision::cpp_int bigint = 113850000000;
int smallint = 113850000000;
std::cout << bigint << std::endl;
std::cout << smallint << std::endl;
std::cin.get();
return 0;
}
As you can see here, there are other types which have a bigger range. If these do not suffice I believe the latest boost version has just the thing for you.
Throw an exception:
if (pPStore > static_cast<float>(INT_MAX)) {
throw std::overflow_error("exceeds integer size");
} else {
pPMax = static_cast<int>(pPStore);
}
or use float instead of int.
When you multiply the maximum values of each term together you get a value around 1.42312e+12 which is somewhat larger than a 32 bit integer can hold, so let's see what the standard has to say about floating point-to-integer conversions, in 4.9/1:
A prvalue of a floating point type can be converted to a prvalue of an
integer type. The conversion trun- cates; that is, the fractional part
is discarded. The behavior is undefined if the truncated value cannot
be represented in the destination type.
So we learn that for a large segment of possible result values your function can generate, the conversion back to a 32 bit integer would be undefined, which includes making negative numbers.
You have a few options here. You could use a 64 bit integer type (long or long long possibly) to hold the value instead of truncating down to int.
Alternately you could scale down the results of your function by a factor of around 1000 or so, to keep the maximal results within the range of values that a 32 bit integer could hold.