I have a function that takes long as an argument, and I want it to return that number as a float with seven decimals.
This long gets in to the function: 631452947, and I want the function to convert it and return this float: 63.1452947
How can I do this?
I have tried this:
float makeLatLon (long val) {
float tzt = (float)val/10000000.0;
return tzt;
}
but it does not work.
Seven digits after the comma means nine digits of precision total, and you can only expect seven digits of precision in a float on platforms where that's an IEEE 32-bit FP type (practically everywhere). Use a double:
long n = 631452947;
float f = n / 10000000.f;
double d = n / 10000000.;
std::cout << std::setprecision(9)
<< f << std::endl
<< d << std::endl;
On my box, that prints
63.1452942
63.1452947
So you see that using a float causes a round-off error.
IEEE-754 double spec and variants don't ensure you 7 digits being present for any number because of the density of the double not being continuous, so also double is not a good choice here.
You may want to consider to build your fixed precision math working with integers only and using a structure like:
typedef struct { int int_part, unsigned long dec_part } myfloat;
Related
I have a function which takes two strings(floating point) , operation and floating point bit-width:
EvaluateFloat(const string &str1, const string &str2, enum operation/*add,subtract, multiply,div*/, unsigned int bit-width, string &output)
input str1 and str2 could be float(32 bit) or double (64 bit).
Is it fine If store the inputs in double and perform double operation irrespective of bit-width and depending upon bit-width typecast it to float if it was 32 bit.
e.g
double num1 = atof(str1);
double num2 = atof(str2);
double result = num1 operation num2; //! operation will resolved using switch
if(32 == bit-width)
{
float f_result = result;
output = std::to_string(f_result);
}
else
{
output = std::to_string(result);
}
Can I assume safely f_result will be exactly same if I had performed operation using float type for float operations i.e.
float f_num1 = num1;
float f_num2 = num2;
float f_result = f_num1 operation f_num2
PS:
We assume there won;t be any cascaded operation i.e. out = a + b + c
instead it will transformed to: temp = a +b out = temp + c
I'm not concerned by inf and nan values.
I'm trying to code redundancy otherwise I have two do same operation
twice once for float and other for double
C++ does not specify which formats are used for float or double. If IEEE-754 binary32 and binary64 are used, then double-rounding errors do not occur for +, -, *, /, or sqrt. Given float x and float y, the following hold (float arithmetic on the left, double on the right):
x+y = (float) ((double) x + (double) y).
x-y = (float) ((double) x - (double) y).
x*y = (float) ((double) x * (double) y).
x/y = (float) ((double) x / (double) y).
sqrt(x) = (float) sqrt((double) x).
This is per the dissertation A Rigorous Framework for Fully Supporting the IEEE Standard for Floating-Point Arithmetic in High-Level Programming Languages by Samuel A. Figueroa del Cid, January 2000, New York University. Essentially, double has so many digits (bits) beyond float that the rounding to double never conceals the information needed to round correctly to float for results of these operations. (This cannot hold for operations in general; it depends on properties of these operations.) On page 57, Figueroa del Cid gives a table showing that, if the float format has p bits, then, to avoid double rounding errors, double must have 2p+1 bits for addition or subtraction, 2p for multiplication and division, and 2p+2 for sqrt. Since binary32 has 24 bits in the significand and double has 53, these are satisfied. (See the paper for details. There are some caveats, such as that p must be at least 2 or 4 for the various operations.)
According to standards floating point operations on double is equivalent to doing the operation in infinite precision. If we convert it to float we have now rounded it twice. In general this is not equivalent to just rounding to a float in the first place. For example. 0.47 rounds to 0.5 which rounds to 1, but 0.47 rounds directly to 0. As mentioned by chtz, multiplication of two floats should always be exactly some double (using IEEE math where double has more than twice the precision of float), so when we cast to a float we have still only lost precision once and so the result should be the same. Likewise addition and subtraction should not be a problem.
Division cannot be exactly represented in a double (not even 1/3), so we may think there is a problem with division. However I have run the sample code over night, trying over 3 trillion cases and have not found any case where running the original divide as a double gives a different answer.
#include <iostream>
int main() {
long i=0;
while (1) {
float x = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
float y = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
float f = x / y;
double d = (double)x / (double)y;
if(++i % 10000000 == 0) { std::cout << i << "\t" << x << "," << y << std::endl; }
if ((float(d) != f)) {
std::cout << std::endl;
std::cout << x << "," << y << std::endl;
std::cout << std::hex << *(int*)&x << "," << std::hex << *(int*)&y << std::endl;
std::cout << float(d) - f << std::endl;
return 1;
}
}
}
It's C++ code written in Visual Studio 2015. It's something as below,
LPSTR *endPtr;
string strDouble = "0.03456";
double valDouble = strtod(strDouble.c_str(), &endPtr);
Now the output in valDouble is "0.0345555566", something like that.
I want the value in valDouble to be exactly "0.03456".
Basically the value of "0.0345555566" needs to be rounded to say "0.03456".
Is there a way it can be achieved?
BTW, the value in strDouble changes all the time. So it's not possible to set precision to say 5 or something like that. Below are few examples that goes in to strDouble.
string strDouble = "0.1889";
string strDouble = "0.00883342";
string strDouble = "0.2111907";
string strDouble = "3.0045";
string strDouble = "1.45";
I want the value in valDouble to be exactly "0.03456".
That's not possible, unless you target a system whose double floating point representation can represent that number.
There exists no representation for 0.03456 in the ubiquitous IEEE 754 binary64 standard which your CPU probably uses. The closest representable number is 3.45600000000000004418687638008E-2. That's the number that you should get whether you use strtod, stod or a character stream to convert the string.
Is there a way it can be achieved?
In order to represent 0.03456 exactly on a system whose floating point cannot represent that number, you must use integers to represent the number. You can implement arbitrary precision arithmetic, fixed-point arithmetic or a decimal floating point using integers.
Basically the value ... needs to be rounded to say "0.03456".
You can round the output when you convert the non-exact float into a string:
std::cout << std::setprecision(4) << 0.03456;
BTW, the value in strDouble changes all the time. So it's not possible to set precision to say 5 or something like that.
Then you have to record the number of significant digits in the input string in order to use the same precision in output.
Here's an example function for that purpose:
template<class Range>
auto get_precision(const Range& r)
{
auto is_significant = [](auto c) {
return std::isdigit(c) && c != '0';
};
auto first = std::find_if(std:: begin(r), std:: end(r), is_significant);
auto last = std::find_if(std::rbegin(r), std::rend(r), is_significant).base();
return std::count_if(first, last, [](auto c) {
return std::isdigit(c);
});
}
// demo
std::cout << get_precision("0.03456"); // 4
Assuming that you want the number of digits after decimal point as some percent of the total number of digits after the decimal, you could do something like,
Calculate the number of digits after decimal point. Let it be n
Now convert the string to decimal just like you are doing. Let this be d
Now if you want 50% of the decimal places to be retained, you could do use an old trick,
double d_new = round(d * pow(10.0, 5)) / pow(10.0, 5). Assuming precision till 5 digits.
Note: Unlike the other answers, here you are rounding the original decimal itself. Not just printing the rounded decimal to stdout.
Example:
#include<stdio.h>
#include<cmath>
int main(){
double a = 0.0345555566;
double b = 0.00883342;
double c = 0.2111907;
double a_new = round(a * pow(10.0, 5)) / pow(10.0, 5);
double b_new = round(b * pow(10.0, 4)) / pow(10.0, 4);
double c_new = round(c * pow(10.0, 3)) / pow(10.0, 3);
printf("%.10f\n", a_new);
printf("%.10f\n", b_new);
printf("%.10f\n", c_new);
}
See the 50% precision
Results:
0.0345600000
0.0088000000
0.2110000000
Use string stream instead of strtod:
#include <iostream>
#include <sstream>
double convert(std::string string) {
std::stringstream s(string);
double ret = 0;
s >> ret;
return ret;
}
int main() {
std::cerr << convert("0.03456") << std::endl;
std::cerr << convert("0.1889") << std::endl;
std::cerr << convert("0.00883342") << std::endl;
std::cerr << convert("0.2111907") << std::endl;
std::cerr << convert("3.0045") << std::endl;
std::cerr << convert("1.45") << std::endl;
return 0;
}
On my system, this gives:
0.03456
0.1889
0.00883342
0.211191
3.0045
1.45
As some have pointed out in the comments, not all numbers can be represented with doubles. But most of the ones you listed can be.
I'm trying to improve surf.cpp performances. From line 140, you can find this function:
inline float calcHaarPattern( const int* origin, const SurfHF* f, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w;
return (float)d;
}
Running an Intel Advisor Vectorization analysis, it shows that "1 Data type conversions present" which could be inefficient (especially in vectorization).
But my question is: looking at this function, why the authors would have created d as double and then cast it to float? If they wanted a decimal number, float would be ok. The only reason that comes to my mind is that since double is more precise than float, then it can represents smaller numbers, but the final value is big enough to be stored in a float, but I didn't run any test on d value.
Any other possible reason?
Because the author want to have higher precision during calculation, then only round the final result. This is the same as preserving more significant digit during calculation.
More precisely, when addition and subtraction, error can be accumulated. This error can be considerable when large number of floating point numbers involved.
You questioned the answer saying it's to use higher precision during the summation, but I don't see why. That answer is correct. Consider this simplified version with completely made-up numbers:
#include <iostream>
#include <iomanip>
float w = 0.012345;
float calcFloat(const int* origin, int n )
{
float d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
float calcDouble(const int* origin, int n )
{
double d = 0;
for( int k = 0; k < n; k++ )
d += origin[k] * w;
return (float)d;
}
int main()
{
int o[] = { 1111, 22222, 33333, 444444, 5555 };
std::cout << std::setprecision(9) << calcFloat(o, 5) << '\n';
std::cout << std::setprecision(9) << calcDouble(o, 5) << '\n';
}
The results are:
6254.77979
6254.7793
So even though the inputs are the same in both cases, you get a different result using double for the intermediate summation. Changing calcDouble to use (double)w doesn't change the output.
This suggests that the calculation of (origin[f[k].p0] + origin[f[k].p3] - origin[f[k].p1] - origin[f[k].p2])*f[k].w is high-enough precision, but the accumulation of errors during the summation is what they're trying to avoid.
This is because of how errors are propagated when working with floating point numbers. Quoting The Floating-Point Guide: Error Propagation:
In general:
Multiplication and division are “safe” operations
Addition and subtraction are dangerous, because when numbers of different magnitudes are involved, digits of the smaller-magnitude number are lost.
So you want the higher-precision type for the sum, which involves addition. Multiplying the integer by a double instead of a float doesn't matter nearly as much: you will get something that is approximately as accurate as the float value you start with (as long as the result it isn't very very large or very very small). But summing float values that could have very different orders of magnitude, even when the individual numbers themselves are representable as float, will accumulate errors and deviate further and further from the true answer.
To see that in action:
float f1 = 1e4, f2 = 1e-4;
std::cout << (f1 + f2) << '\n';
std::cout << (double(f1) + f2) << '\n';
Or equivalently, but closer to the original code:
float f1 = 1e4, f2 = 1e-4;
float f = f1;
f += f2;
double d = f1;
d += f2;
std::cout << f << '\n';
std::cout << d << '\n';
The result is:
10000
10000.0001
Adding the two floats loses precision. Adding the float to a double gives the right answer, even though the inputs were identical. You need nine significant digits to represent the correct value, and that's too many for a float.
I have written the following routine, which is supposed to truncate a C++ double at the n'th decimal place.
double truncate(double number_val, int n)
{
double factor = 1;
double previous = std::trunc(number_val); // remove integer portion
number_val -= previous;
for (int i = 0; i < n; i++) {
number_val *= 10;
factor *= 10;
}
number_val = std::trunc(number_val);
number_val /= factor;
number_val += previous; // add back integer portion
return number_val;
}
Usually, this works great... but I have found that with some numbers, most notably those that do not seem to have an exact representation within double, have issues.
For example, if the input is 2.0029, and I want to truncate it at the fifth place, internally, the double appears to be stored as something somewhere between 2.0028999999999999996 and 2.0028999999999999999, and truncating this at the fifth decimal place gives 2.00289, which might be right in terms of how the number is being stored, but is going to look like the wrong answer to an end user.
If I were rounding instead of truncating at the fifth decimal, everything would be fine, of course, and if I give a double whose decimal representation has more than n digits past the decimal point it works fine as well, but how do I modify this truncation routine so that inaccuracies due to imprecision in the double type and its decimal representation will not affect the result that the end user sees?
I think I may need some sort of rounding/truncation hybrid to make this work, but I'm not sure how I would write it.
Edit: thanks for the responses so far but perhaps I should clarify that this value is not producing output necessarily but this truncation operation can be part of a chain of many different user specified actions on floating point numbers. Errors that accumulate within the double precision over multiple operations are fine, but no single operation, such as truncation or rounding, should produce a result that differs from its actual ideal value by more than half of an epsilon, where epsilon is the smallest magnitude represented by the double precision with the current exponent. I am currently trying to digest the link provided by iinspectable below on floating point arithmetic to see if it will help me figure out how to do this.
Edit: well the link gave me one idea, which is sort of hacky but it should probably work which is to put a line like number_val += std::numeric_limits<double>::epsilon() right at the top of the function before I start doing anything else with it. Dunno if there is a better way, though.
Edit: I had an idea while I was on the bus today, which I haven't had a chance to thoroughly test yet, but it works by rounding the original number to 16 significant decimal digits, and then truncating that:
double truncate(double number_val, int n)
{
bool negative = false;
if (number_val == 0) {
return 0;
} else if (number_val < 0) {
number_val = -number_val;
negative = true;
}
int pre_digits = std::log10(number_val) + 1;
if (pre_digits < 17) {
int post_digits = 17 - pre_digits;
double factor = std::pow(10, post_digits);
number_val = std::round(number_val * factor) / factor;
factor = std::pow(10, n);
number_val = std::trunc(number_val * factor) / factor;
} else {
number_val = std::round(number_val);
}
if (negative) {
number_val = -number_val;
}
return number_val;
}
Since a double precision floating point number only can have about 16 digits of precision anyways, this just might work for all practical purposes, at a cost of at most only one digit of precision that the double would otherwise perhaps support.
I would like to further note that this question differs from the suggested duplicate above in that a) this is using C++, and not Java... I don't have a DecimalFormatter convenience class, and b) I am wanting to truncate, not round, the number at the given digit (within the precision limits otherwise allowed by the double datatype), and c) as I have stated before, the result of this function is not supposed to be a printable string... it is supposed to be a native floating point number that the end user of this function might choose to further manipulate. Accumulated errors over multiple operations due to imprecision in the double type are acceptable, but any single operation should appear to perform correctly to the limits of the precision of the double datatype.
OK, if I understand this right, you've got a floating point number and you want to truncate it to n digits:
10.099999
^^ n = 2
becomes
10.09
^^
But your function is truncating the number to an approximately close value:
10.08999999
^^
Which is then displayed as 10.08?
How about you keep your truncate formula, which does truncate as well as it can, and use std::setprecision and std::fixed to round the truncated value to the required number of decimal places? (Assuming it is std::cout you're using for output?)
#include <iostream>
#include <iomanip>
using std::cout;
using std::setprecision;
using std::fixed;
using std::endl;
int main() {
double foo = 10.08995; // let's imagine this is the output of `truncate`
cout << foo << endl; // displays 10.0899
cout << setprecision(2) << fixed << foo << endl; // rounds to 10.09
}
I've set up a demo on wandbox for this.
I've looked into this. It's hard because you have inaccuracies due to the floating point representation, then further inaccuracies due to the decimal. 0.1 cannot be represented exactly in binary floating point. However you can use the built-in function sprintf with a %g argument that should round accurately for you.
char out[64];
double x = 0.11111111;
int n = 3;
double xrounded;
sprintf(out, "%.*g", n, x);
xrounded = strtod(out, 0);
Get double as a string
If you are looking just to print the output, then it is very easy and straightforward using stringstream:
#include <cmath>
#include <iostream>
#include <iomanip>
#include <limits>
#include <sstream>
using namespace std;
string truncateAsString(double n, int precision) {
stringstream ss;
double remainder = static_cast<double>((int)floor((n - floor(n)) * precision) % precision);
ss << setprecision(numeric_limits<double> ::max_digits10 + __builtin_ctz(precision))<< floor(n);
if (remainder)
ss << "." << remainder;
cout << ss.str() << endl;
return ss.str();
}
int main(void) {
double a = 9636346.59235;
int precision = 1000; // as many digits as you add zeroes. 3 zeroes means precision of 3.
string s = truncateAsString(a, precision);
return 0;
}
Getting the divided floating point with an exact value
Maybe you are looking for true value for your floating point, you can use boost multiprecision library
The Boost.Multiprecision library can be used for computations requiring precision exceeding that of standard built-in types such as float, double and long double. For extended-precision calculations, Boost.Multiprecision supplies a template data type called cpp_dec_float. The number of decimal digits of precision is fixed at compile-time via template parameter.
Demonstration
#include <boost/math/constants/constants.hpp>
#include <boost/multiprecision/cpp_dec_float.hpp>
#include <iostream>
#include <limits>
#include <cmath>
#include <iomanip>
using boost::multiprecision::cpp_dec_float_50;
cpp_dec_float_50 truncate(cpp_dec_float_50 n, int precision) {
cpp_dec_float_50 remainder = static_cast<cpp_dec_float_50>((int)floor((n - floor(n)) * precision) % precision) / static_cast<cpp_dec_float_50>(precision);
return floor(n) + remainder;
}
int main(void) {
int precision = 100000; // as many digits as you add zeroes. 5 zeroes means precision of 5.
cpp_dec_float_50 n = 9636346.59235789;
n = truncate(n, precision); // first part is remainder, floor(n) is int value truncated.
cout << setprecision(numeric_limits<cpp_dec_float_50> ::max_digits10 + __builtin_ctz(precision)) << n << endl; // __builtin_ctz(precision) will equal the number of trailing 0, exactly the precision we need!
return 0;
}
Output:
9636346.59235
NB: Requires sudo apt-get install libboost-all-dev
I found two ways of conversion from any base to base 10 . the first one is the normal one we do in colleges like 521(base-15) ---> (5*15^2)+(2*15^1)+(1*15^0)=1125+30+1 = 1156 (base-10) . my problem is that i applied both methods to a number (1023456789ABCDE(Base-15)) but i am getting different result . google code jam accepts the value generated from second method only for this particular number (i.e 1023456789ABCDE(Base-15)) . for all other cases both generates same results . whats big deal with this special number ?? can anybody suggest ...
#include <iostream>
#include <math.h>
using namespace std;
int main()
{ //number in base 15 is 1023456789ABCDE
int value[15]={1,0,2,3,4,5,6,7,8,9,10,11,12,13,14};
int base =15;
unsigned long long sum=0;
for (int i=0;i<15;i++)
{
sum+=(pow(base,i)*value[14-i]);
}
cout << sum << endl;
//this prints 29480883458974408
sum=0;
for (int i=0;i<15;i++)
{
sum=(sum*base)+value[i];
}
cout << sum << endl;
//this prints 29480883458974409
return 0;
}
Consider using std::stol(ref) to convert a string into a long.
It let you choose the base to use, here an example for your number wiuth base 15.
int main()
{
std::string s = "1023456789ABCDE";
long n = std::stol(s,0,15);
std::cout<< s<<" in base 15: "<<n<<std::endl;
// -> 1023456789ABCDE in base 15: 29480883458974409
}
pow(base, i) uses floating point and so you loose some precision on some numbers.
Exceeded double precision.
Precision of double, the return value from pow(), is precise for at least DBL_DIG significant decimal digits. DBL_DIG is at least 10 and typically is 15 IEEE 754 double-precision binary.
The desired number 29480883458974409 is 17 digits, so some calculation error should be expected.
In particular, sum += pow(base,i)*value[14-i] is done as a long long = long long + (double * long long) which results in long long = double. The nearest double to 29480883458974409 is 29480883458974408. So it is not an imprecise value from pow() that causes the issue here, but an imprecise sum from the addition.
#Mooing Duck in a comment references code to avoid using pow() and its double limitation`. Following is a slight variant.
unsigned long long ullongpow(unsigned value, unsigned exp) {
unsigned long long result = !!value;
while (exp-- > 0) {
result *= value;
}
return result;
}