Returning double with precision - c++

Say I have a method returning a double, but I want to determine the precision after the dot of the value to be returned. I don't know the value of the double varaible.
Example:
double i = 3.365737;
return i;
I want the return value to be with precision of 3 number after the dot
Meaning: the return value is 3.365.
Another example:
double i = 4644.322345;
return i;
I want the return value to be: 4644.322

What you want is truncation of decimal digits after a certain digit. You can easily do that with the floor function from <math.h> (or std::floor from <cmath> if you're using C++):
double TruncateNumber(double In, unsigned int Digits)
{
double f=pow(10, Digits);
return ((int)(In*f))/f;
}
Still, I think that in some cases you may get some strange results (the last digit being one over/off) due to how floating point internally works.
On the other hand, most of time you just pass around the double as is and truncate it only when outputting it on a stream, which is done automatically with the right stream flags.

You are going to need to take care with the borderline cases. Any implementation based solely on pow and casting or fmod will occasionally give wrong results, particularly so an implementation based on pow(- PRECISION).
The safest bet is to implement something that neither C nor C++ provide: A fixed point arithmetic capability. Lacking that, you will need to find the representations of the pertinent borderline cases. This question is similar to the question on how Excel does rounding. Adapting my answer there, How does Excel successfully Rounds Floating numbers even though they are imprecise? , to this problem,
// Compute 10 to some positive integral power.
// Dealing with overflow (exponent > 308) is an exercise left to the reader.
double pow10 (unsigned int exponent) {
double result = 1.0;
double base = 10.0;
while (exponent > 0) {
if ((exponent & 1) != 0) result *= base;
exponent >>= 1;
base *= base;
}
return result;
}
// Truncate number to some precision.
// Dealing with nonsense such as nplaces=400 is an exercise left to the reader.
double truncate (double x, int nplaces) {
bool is_neg = false;
// Things will be easier if we only have to deal with positive numbers.
if (x < 0.0) {
is_neg = true;
x = -x;
}
// Construct the supposedly truncated value (round down) and the nearest
// truncated value above it.
double round_down, round_up;
if (nplaces < 0) {
double scale = pow10 (-nplaces);
round_down = std::floor (x / scale);
round_up = (round_down + 1.0) * scale;
round_down *= scale;
}
else {
double scale = pow10 (nplaces);
round_down = std::floor (x * scale);
round_up = (round_down + 1.0) / scale;
round_down /= scale;
}
// Usually the round_down value is the desired value.
// On rare occasions it is the rounded-up value that is.
// This is one of those cases where you do want to compare doubles by ==.
if (x != round_up) x = round_down;
// Correct the sign if needed.
if (is_neg) x = -x;
return x;
}

You cannot "remove" precision from a double. You could have: 4644.322000. It's a different number but the precision is the same.
As #David Heffernan said do it when you convert it to a string for display.

You want to truncate your double to n decimal places, then you can use this function:
#import <cmath>
double truncate_to_places(double d, int n) {
return d - fmod(d, pow(10.0, -n));
}

Instead of multiplying and dividing by powers of 10 like the other answers, you can use the fmod function to find the digits after the precision you want, and then subtract to remove them.
#include <math.h>
#define PRECISION 0.001
double truncate(double x) {
x -= fmod(x,PRECISION);
return x;
}

There is no good way to do this with plain doubles, but you can write a class or simply struct like
struct lim_prec_float {
float value;
int precision;
};
then have your function
lim_prec_float testfn() {
double i = 3.365737;
return lim_prec_float{i, 4};
}
(4 = 1 before point + 3 after. This uses a C++11 initialization list, it would be better if lim_prec_float was a class with proper constructors.)
When you now want to output the variable, do this with a custom
std::ostream &operator<<(std::ostream &tgt, const lim_prec_float &v) {
std::stringstream s;
s << std::setprecision(v.precision) << v.value;
return (tgt << s.str());
}
Now you can, for instance,
int main() {
std::cout << testfn() << std::endl
<< lim_prec_float{4644.322345, 7} << std::endl;
return 0;
}
which will output
3.366
4644.322
this is because std::setprecision means rounding to the desired number of places, which is likely what you really want. If you actually mean truncate, you can modify the operator<< with one of the truncation functions given by the other answers.

In the same way you format a date before displaying it, you should do the same with double.
However, here are two approaches I have used for rounding.
double roundTo3Places(double d) {
return round(d * 1000) / 1000.0;
}
double roundTo3Places(double d) {
return (long long) (d * 1000 + (d > 0 ? 0.5 : -0.5)) / 1000.0;
}
The later is faster, however numbers cannot be larger than 9e15

Related

How to increase precision of std::sin function on iOS

I have a cross-platform application, which is an audio application and therefore uses sine waves a lot, and the std::sin() and other goniometric functions.
I noticed that particularly on the iOS platform, the precision of the std::sin() is extremely poor. I wrote the following test:
void TestSineZeroCrossings()
{
const static float kTwoPi = 6.28318530718f;
const static float epsilon = 1e-5f;
for (int ii = 0; ii < 10000; ++ii)
{
const float difference = std::abs(std::sin(kTwoPi * static_cast<float>(ii)));
if (difference > epsilon)
printf("Zero crossing fail, difference: %f\n", difference);
}
}
On Windows and MaxOSX this passes (i.e. no print-outs), but on iOS this fails on pretty much every iteration. In fact, only with an epsilon > 0.004f does it succeed. That results in clearly audible noise in my application.
Is there a way to tell the compiler to use a better implementation that's not as lossy?
I would assume the implementation is quite accurate.
Your actual problem is that kTwoPi * static_cast<float>(ii) gets rounded to the next float. E.g., for ii=10000 the value is (if I did not miscalculate): 62831.8515625
If you subtract 10000*2*pi in exact math from that you get approximately: -0.001509... And the sine of that value is approximately the same (and not 0). It is "relatively" close to zero but far away from your desired 10e-6 "accuracy".
If you want to have more accurate values for sin(x*pi), have a look at boost::math::sin_pi:
https://www.boost.org/doc/libs/1_69_0/libs/math/doc/html/math_toolkit/powers/sin_pi.html
If you want more precision, use double or long double rather than float.
For instance,
replace
const static float kTwoPi = 6.28318530718f;
const static float epsilon = 10e-6f;
with
const static double kTwoPi = 6.28318530718;
const static double epsilon = 10e-6;
and
const float difference = std::abs(std::sin(kTwoPi * static_cast<float>(ii)));
with
const double difference = std::abs(std::sin(kTwoPi * ii));
At the risk of repetition, your problem is obviously the use of float rather than double or long double.
You could verify this by doing
cout << kTwoPi << endl ;
and seeing how many digits get printed out and how they compare to your original value.
const static float kTwoPi = 6.28318530718f;
is roughly equivalent to
const static float kTwoPi = 6.283185 ;
on many (most?) systems.Your delta is way too small for a single precision value. Float is useless for most applications because of its usual lack of precision.

Doubles rounding again

In my program there are some precisions (some positive integer, in the most cases it supposed to be of the form ) for some doubles, so that double * precision should become an integer.
But as we all know floating point numbers are inaccurate, so, for example 1.3029515 could be saved as 1.3029514999999998..., and in my program I need to write such floating point number to a file, but I want this 1.3029515 to be written instead of something like 1.3029514999999998....
Previously only precision of form was used in my program, and I've reached the wanted result with a piece of code like below:
// I have a function for doubles equality check
inline bool sameDoubles(const double& lhs, const double& rhs, const double& epsilon) {
return fabs(lhs - rhs) < epsilon;
}
inline void roundDownDouble(double& value, const unsigned int& numberOfDigitsInFraction = 6) {
assert(numberOfDigitsInFraction <= 9);
double factor = pow(10.0, numberOfDigitsInFraction);
double oldValue = value;
value = (((int)(value * factor)) / factor);
// when, for example, 1.45 is stored as 1.4499999999..., we can get wrong value, so, need to do the check below
double diff = pow(10.0, 0.0 - numberOfDigitsInFraction);
if(sameDoubles(diff, fabs(oldValue - value), 1e-9)) {
value += diff;
}
};
But now, I can't reach wanted results with the same technique, I've tried with a function below, but have not succeeded:
// calculates logarithm of number with given base
double inline logNbase(double number, double base) {
return log(number)/log(base);
}
// sameDoubles function is the same as in above case
inline void roundDownDouble(double& value, unsigned int precision = 1e+6) {
if(sameDoubles(value, 0.0)) { value = 0; return; }
double oldValue = value;
value = ((long int)(value * precision) / (double)precision);
// when, for example, 1.45 is stored as 1.4499999999..., we can get wrong value, so, need to do the check below
int pwr = (int)(logNbase((double)precision, 10.0));
long int coeff = precision / pow(10, pwr);
double diff = coeff * pow(10, -pwr);
if(sameDoubles(diff, fabs(oldValue - value), diff / 10.0)) {
if(value > 0.0) {
value += diff;
} else {
value -= diff;
}
}
}
For 1.3029515 value and precision = 2000000 this function returns incorrect 1.302951 value (expression (long int)(value * precision) becomes equal to 2605902 instead of 2605903).
How can I fix this? Or maybe there is a some smart way to make this rounding happen correctly?
You're doing your rounding the hard way. Do it the easy way instead:
double rounding = 0.5;
if (value < 0.0) rounding = -0.5;
value = ((long int)(value * precision + rounding) / (double)precision);
Now there's no need for the rest of the code.

float value issue

I am facing problem using float
in loop its value stuck at 8388608.00
int count=0;
long X=10;
cout.precision(flt::digits10);
cout<<"Iterration #"<<setw(15)<<"Add"<<setw(21)<<"Mult"<<endl;
float Start=0.0;
float Multiplication = Addition * N;
long i = 1;
for (i; i <= N; i++){
float temp = Start + Addition;
Start=temp;
count++;
if(count%X==0 && count!=0)
{
X*=10;
cout<<i;
cout<<fixed<<setw(30)<<Start<<setw(20)<<fixed<<i*Addition<<endl;
}
}
what should i do??
Floating point addition doesn't work when you're adding (relatively) small number to (relatively) big one. It's caused by the way float is stored in memory.
You may try replacing single precision floating point (float) with double precision floating point (double) representation but if that doesn't work you'll probably need to implement hack like this:
// Lets say
double OriginalAddition = 0.123;
int Addition = 1;
// You just use base math substitution:
// Addition = OriginalAddition
int temp = Start + Addition; // You will treat transform floating point to fixed point
// with step 0.123, so 1 = 0.123
// And when displaying result (transform back into original floating point):
printf( "%f", (double)result*OriginalAddition)
This needs a lot of thought to find a substitution that doesn't cause data loss, covers required precision and won't cause int to overflow. Try to google fixed point int C (some results: 1, 2) to get better idea what to do.

How can you convert a std::bitset<64> to a double?

Is there a way to convert a std::bitset<64> to a double without using any external library (Boost, etc.)? I am using a bitset to represent a genome in a genetic algorithm and I need a way to convert a set of bits to a double.
The C++11 road:
union Converter { uint64_t i; double d; };
double convert(std::bitset<64> const& bs) {
Converter c;
c.i = bs.to_ullong();
return c.d;
}
EDIT: As noted in the comments, we can use char* aliasing as it is unspecified instead of being undefined.
double convert(std::bitset<64> const& bs) {
static_assert(sizeof(uint64_t) == sizeof(double), "Cannot use this!");
uint64_t const u = bs.to_ullong();
double d;
// Aliases to `char*` are explicitly allowed in the Standard (and only them)
char const* cu = reinterpret_cast<char const*>(&u);
char* cd = reinterpret_cast<char*>(&d);
// Copy the bitwise representation from u to d
memcpy(cd, cu, sizeof(u));
return d;
}
C++11 is still required for to_ullong.
Most people are trying to provide answers that let you treat the bit-vector as though it directly contained an encoded int or double.
I would advise you completely avoid that approach. While it does "work" for some definition of working, it introduces hamming cliffs all over the place. You usually want your encoding to arrange things so that if two decoded values are near to one another, then their encoded values are near to one another as well. It also forces you to use 64-bits of precision.
I would manage the conversion manually. Say you have three variables to encode, x, y, and z. Your domain expertise can be used to say, for example, that -5 <= x < 5, 0 <= y < 100, and 0 <= z < 1, where you need 8 bits of precision for x, 12 bits for y, and 10 bits for z. This gives you a total search space of only 30 bits. You can have a 30 bit string, treat the first 8 as encoding x, the next 12 as y, and the last 10 as z. You are also free to gray code each one to remove the hamming cliffs.
I've personally done the following in the past:
inline void binary_encoding::encode(const vector<double>& params)
{
unsigned int start=0;
for(unsigned int param=0; param<params.size(); ++param) {
// m_bpp[i] = number of bits in encoding of parameter i
unsigned int num_bits = m_bpp[param];
// map the double onto the appropriate integer range
// m_range[i] is a pair of (min, max) values for ith parameter
pair<double,double> prange=m_range[param];
double range=prange.second-prange.first;
double max_bit_val=pow(2.0,static_cast<double>(num_bits))-1;
int int_val=static_cast<int>((params[param]-prange.first)*max_bit_val/range+0.5);
// convert the integer to binary
vector<int> result(m_bpp[param]);
for(unsigned int b=0; b<num_bits; ++b) {
result[b]=int_val%2;
int_val/=2;
}
if(m_gray) {
for(unsigned int b=0; b<num_bits-1; ++b) {
result[b]=!(result[b]==result[b+1]);
}
}
// insert the bits into the correct spot in the encoding
copy(result.begin(),result.end(),m_genotype.begin()+start);
start+=num_bits;
}
}
inline void binary_encoding::decode()
{
unsigned int start = 0;
// for each parameter
for(unsigned int param=0; param<m_bpp.size(); param++) {
unsigned int num_bits = m_bpp[param];
unsigned int intval = 0;
if(m_gray) {
// convert from gray to binary
vector<int> binary(num_bits);
binary[num_bits-1] = m_genotype[start+num_bits-1];
intval = binary[num_bits-1];
for(int i=num_bits-2; i>=0; i--) {
binary[i] = !(binary[i+1] == m_genotype[start+i]);
intval += intval + binary[i];
}
}
else {
// convert from binary encoding to integer
for(int i=num_bits-1; i>=0; i--) {
intval += intval + m_genotype[start+i];
}
}
// convert from integer to double in the appropriate range
pair<double,double> prange = m_range[param];
double range = prange.second - prange.first;
double m = range / (pow(2.0,double(num_bits)) - 1.0);
// m_phenotype is a vector<double> containing all the decoded parameters
m_phenotype[param] = m * double(intval) + prange.first;
start += num_bits;
}
}
Note that for reasons that probably don't matter to you, I wasn't using bit vectors -- just ordinary vector<int> to encoding things. And of course, there's a bunch of stuff tied into this code that isn't shown here, but you can probably get the basic idea.
One other note, if you're doing GPU calculations or if you have a particular problem such that 64 bits are the appropriate size anyway, it may be worth the extra overhead to stuff everything into native words. Otherwise, I would guess that the overhead you add to the search process will probably overwhelm whatever benefits you get by faster encoding and decoding.
Edit:: I've decided that I was being a bit silly with this. While you do end up with a double it assumes that the bitset holds an integer... which is a big assumption to make. You will end up with a predictable and repeatable value per bitset but still I don't think that this is what the author intended.
Well if you iterate over the bit values and do
output_double += pow( 2, 64-(bit_position+1) ) * bit_value;
That would work. As long as it is big-endian

How to check for undefined value of a double?

I am trying to figure out a way to check for a undefined value of a slope in which case it would be vertical. I have tried using NULL but that doesn't seem to work.
double Point::Slope(Point &p2)
{
double slop = 0;
slop = (y - p2.y) / (x - p2.x);
if (slop == NULL)
{
slop = 10e100;
}
return slop;
}
If you mean nan ('not a number') with "undefined", you should avoid computing one in the first place, i.e. by checking that the denominator of a '/' operation is not zero. Second, you can always check for nan by
#include <cmath>
bool std::isnan(x); // since C++11
bool isnan(x); // pre C++11, from the C math library, defined as macro
see the man pages, or cppreference.
In C++, NULL == 0. This is not what you seek.
Maybe this may help you : http://www.gnu.org/s/hello/manual/libc/Infinity-and-NaN.html
Try the isnan(float) function.
I'd recommend avoiding the divide-by-zero all together (by the way... why don't you call it slope instead of slop?):
double Point::Slope(Point&p2)
{
double slope = 0;
double xDelta = x - p2.x;
double yDelta = y - p2.y;
if (xDelta != 0)
{
slope = yDelta / xDelta;
}
return slope;
}