Rounding to use for int -> float -> int round trip conversion

Rounding to use for int -> float -> int round trip conversion - c++

I'm writing a set of numeric type conversion functions for a database engine, and I'm concerned about the behavior of converting large integral floating-point values to integer types with greater precision.
Take for example converting a 32-bit int to a 32-bit single-precision float. The 23-bit significand of the float yields about 7 decimal digits of precision, so converting any int with more than about 7 digits will result in a loss of precision (which is fine and expected). However, when you convert such a float back to an int, you end up with artifacts of its binary representation in the low-order digits:
#include <iostream>
#include <iomanip>
using namespace std;
int main()
{
int a = 2147483000;
cout << a << endl;
float f = (float)a;
cout << setprecision(10) << f << endl;
int b = (int)f;
cout << b << endl;
return 0;
}
This prints:
2147483000
2147483008
2147483008
The trailing 008 is beyond the precision of the float, and therefore seems undesirable to retain in the int, since in a database application, users are primarily concerned with decimal representation, and trailing 0's are used to indicate insignificant digits.
So my questions are: Are there any well-known existing systems that perform decimal significant digit rounding in float -> int (or double -> long long) conversions, and are there any well-known, efficient algorithms for doing so?
(Note: I'm aware that some systems have decimal floating-point types, such as those defined by IEEE 754-2008. However, they don't have mainstream hardware support and aren't built into C/C++. I might want to support them down the road, but I still need to handle binary floats intuitively.)

std::numeric_limits<float>::digits10 says you only get 6 precise digits for float.
Pick an efficient algorithm for your language, processor, and data distribution to calculate-the-decimal-length-of-an-integer (or here). Then subtract the number of digits that digits10 says are precise to get the number of digits to cull. Use that as an index to lookup a power of 10 to use as a modulus. Etc.
One concern: Let's say you convert a float to a decimal and perform this sort of rounding or truncation. Then convert that "adjusted" decimal to a float and back to a decimal with the same rounding/truncation scheme. Do you get the same decimal value? Hopefully yes.
This isn't really what you're looking for but may be interesting reading: A Proposal to add a max significant decimal digits value to the C++ Standard Library Numeric limits

Naturally, 2147483008 has trailing zeros if you write it in binary (1111111111111111111110110000000) or hexadecimal (0b0x7FFFFD80). The most "correct" thing to do would be to track insignificant digits in any of those forms instead.
Alternatively, you could just zero all digits after the first seven significant ones in the int (ideally by rounding) after converting to it from a float, since the float contains approximately seven significant digits.

Related

Are doubles able to represent every int64_t value? [duplicate]

This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.

Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.

No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.

The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.

You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.

The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.

Are all integer values perfectly represented as doubles? [duplicate]

This question already has answers here:
Representing integers in doubles
(4 answers)
Closed 5 years ago.
My question is whether all integer values are guaranteed to have a perfect double representation.
Consider the following code sample that prints "Same":
// Example program
#include <iostream>
#include <string>
int main()
{
int a = 3;
int b = 4;
double d_a(a);
double d_b(b);
double int_sum = a + b;
double d_sum = d_a + d_b;
if (double(int_sum) == d_sum)
{
std::cout << "Same" << std::endl;
}
}
Is this guaranteed to be true for any architecture, any compiler, any values of a and b? Will any integer i converted to double, always be represented as i.0000000000000 and not, for example, as i.000000000001?
I tried it for some other numbers and it always was true, but was unable to find anything about whether this is coincidence or by design.
Note: This is different from this question (aside from the language) since I am adding the two integers.

Disclaimer (as suggested by Toby Speight): Although IEEE 754 representations are quite common, an implementation is permitted to use any other representation that satisfies the requirements of the language.
The doubles are represented in the form mantissa * 2^exponent, i.e. some of the bits are used for the non-integer part of the double number.
bits range precision
float 32 1.5E-45 .. 3.4E38 7- 8 digits
double 64 5.0E-324 .. 1.7E308 15-16 digits
long double 80 1.9E-4951 .. 1.1E4932 19-20 digits
The part in the fraction can also used to represent an integer by using an exponent which removes all the digits after the dot.
E.g. 2,9979 · 10^4 = 29979.
Since a common int is usually 32 bit you can represent all ints as double, but for 64 bit integers of course this is no longer true. To be more precise (as LThode noted in a comment): IEEE 754 double-precision can guarantee this for up to 53 bits (52 bits of significand + the implicit leading 1 bit).
Answer: yes for 32 bit ints, no for 64 bit ints.
(This is correct for server/desktop general-purpose CPU environments, but other architectures may behave differently.)
Practical Answer as Malcom McLean puts it: 64 bit doubles are an adequate integer type for almost all integers that are likely to count things in real life.
For the empirically inclined, try this:
#include <iostream>
#include <limits>
using namespace std;
int main() {
double test;
volatile int test_int;
for(int i=0; i< std::numeric_limits<int>::max(); i++) {
test = i;
test_int = test;
// compare int with int:
if (test_int != i)
std::cout<<"found integer i="<<i<<", test="<<test<<std::endl;
}
return 0;
}
Success time: 0.85 memory: 15240 signal:0
Subquestion:
Regarding the question for fractional differences. Is it possible to have a integer which converts to a double which is just off the correct value by a fraction, but which converts back to the same integer due to rounding?
The answer is no, because any integer which converts back and forth to the same value, actually represents the same integer value in double. For me the simplemost explanation (suggested by ilkkachu) for this is that using the exponent 2^exponent the step width must always be a power of two. Therefore, beyond the largest 52(+1 sign) bit integer, there are never two double values with a distance smaller than 2, which solves the rounding issue.

No. Suppose you have a 64-bit integer type and a 64-bit floating-point type (which is typical for a double). There are 2^64 possible values for that integer type and there are 2^64 possible values for that floating-point type. But some of those floating-point values (in fact, most of them) do not represent integer values, so the floating-point type can represent fewer integer values than the integer type can.

The answer is no. This only works if ints are 32 bit, which, while true on most platforms, isn't guaranteed by the standard.
The two integers can share the same double representation.
For example, this
#include <iostream>
int main() {
int64_t n = 2397083434877565865;
if (static_cast<double>(n) == static_cast<double>(n - 1)) {
std::cout << "n and (n-1) share the same double representation\n";
}
}
will print
n and (n-1) share the same double representation
I.e. both 2397083434877565865 and 2397083434877565864 will convert to the same double.
Note that I used int64_t here to guarantee 64-bit integers, which - depending on your platform - might also be what int is.

You have 2 different questions:
Are all integer values perfectly represented as doubles?
That was already answered by other people (TL;DR: it depends on the precision of int and double).
Consider the following code sample that prints "Same": [...] Is this guaranteed to be true for any architecture, any compiler, any values of a and b?
Your code adds two ints and then converts the result to double. The sum of ints will overflow for certain values, but the sum of the two separately-converted doubles will not (typically). For those values the results will differ.

The short answer is "possibly". The portable answer is "not everywhere".
It really depends on your platform, and in particular, on
the size and representation of double
the range of int
For platforms using IEEE-754 doubles, it may be true if int is 53-bit or smaller. For platforms where int is larger than double, it's obviously false.
You may want be able to investigate the properties on your runtime host, using std::numeric_limits and std::nextafter.

How to shift a floating-point value to the nearest one that can be represented exactly in a specific number of decimal places?

Is there an algorithm in C++ that will allow me to, given a floating-point value V of type T (e.g. double or float), returns the closest value to V in a given direction (up or down) that can be represented exactly in less than or equal to a specified number of decimal places D ?
For example, given
T = double
V = 670000.08267799998
D = 6
For direction = towards +inf I would like the result to be 670000.082678, and for direction = towards -inf I would like the result to be 670000.082677
This is somewhat similar to std::nexttoward(), but with the restriction that the 'next' value needs to be exactly representable using at most D decimal places.
I've considered a naive solution involving separating out the fractional portion and scaling it by 10^D, truncating it, and scaling it again by 10^-D and tacking it back onto the whole number portion, but I don't believe that guarantees that the resulting value will be exactly representable in the underlying type.
I'm hopeful that there's a way to do this properly, but so far I've been unable to find one.
Edit: I think my original explanation didn't properly convey my requirements. At the suggestion of #patricia-shanahan I'll try to describing my higher-level goal and then reformulate the problem a little differently in that context.
At the highest level, the reason I need this routine is due to some business logic wherein I must take in a double value K and a percentage P, split it into two double components V1 and V2 where V1 ~= P percent of K and V1 + V2 ~= K. The catch is that V1 is used in further calculations before being sent to a 3rd party over a wire protocol that accepts floating-point values in string format with a max of D decimal places. Because the value sent to the 3rd party (in string format) needs to be reconcilable with the results of the calculations made using V1 (in double format) , I need to "adjust" V1 using some function F() so that it is as close as possible to being P percent of K while still being exactly representable in string format using at most D decimal places. V2 has none of the restrictions of V1, and can be calculated as V2 = K - F(V1) (it is understood and acceptable that this may result in V2 such that V1 + V2 is very close to but not exactly equal to K).
At the lower level, I'm looking to write that routine to 'adjust' V1 as something with the following signature:
double F(double V, unsigned int D, bool roundUpIfTrueElseDown);
where the output is computed by taking V and (if necessary, and in the direction specified by the bool param) rounding it to the Dth decimal place.
My expectation would be that when V is serialized out as follows
const auto maxD = std::numeric_limits<double>::digits10;
assert(D <= maxD); // D will be less than maxD... e.g. typically 1-6, definitely <= 13
std::cout << std::fixed
<< std::setprecision(maxD)
<< F(V, D, true);
then the output contains only zeros beyond the Dth decimal place.
It's important to note that, for performance reasons, I am looking for an implementation of F() that does not involve conversion back and forth between double and string format. Though the output may eventually be converted to a string format, in many cases the logic will early-out before this is necessary and I would like to avoid the overhead in that case.

This is a sketch of a program that does what is requested. It is presented mainly to find out whether that is really what is wanted. I wrote it in Java, because that language has some guarantees about floating point arithmetic on which I wanted to depend. I only use BigDecimal to get exact display of doubles, to show that the answers are exactly representable with no more than D digits after the decimal point.
Specifically, I depended on double behaving according to IEEE 754 64-bit binary arithmetic. That is likely, but not guaranteed by the standard, for C++. I also depended on Math.pow being exact for simple exact cases, on exactness of division by a power of two, and on being able to get exact output using BigDecimal.
I have not handled edge cases. The big missing piece is dealing with large magnitude numbers with large D. I am assuming that the bracketing binary fractions are exactly representable as doubles. If they have more than 53 significant bits that will not be the case. It also needs code to deal with infinities and NaNs. The assumption of exactness of division by a power of two is incorrect for subnormal numbers. If you need your code to handle them, you will have to put in corrections.
It is based on the concept that a number that is both exactly representable as a decimal with no more than D digits after the decimal point and is exactly representable as a binary fraction must be representable as a fraction with denominator 2 raised to the D power. If it needs a higher power of 2 in the denominator, it will need more than D digits after the decimal point in its decimal form. If it cannot be represented at all as a fraction with a power-of-two denominator, it cannot be represented exactly as a double.
Although I ran some other cases for illustration, the key output is:
670000.082678 to 6 digits Up: 670000.09375 Down: 670000.078125
Here is the program:
import java.math.BigDecimal;
public class Test {
public static void main(String args[]) {
testIt(2, 0.000001);
testIt(10, 0.000001);
testIt(6, 670000.08267799998);
}
private static void testIt(int d, double in) {
System.out.print(in + " to " + d + " digits");
System.out.print(" Up: " + new BigDecimal(roundUpExact(d, in)).toString());
System.out.println(" Down: "
+ new BigDecimal(roundDownExact(d, in)).toString());
}
public static double roundUpExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.ceil(roundee);
return roundee / factor;
}
public static double roundDownExact(int d, double in) {
double factor = Math.pow(2, d);
double roundee = factor * in;
roundee = Math.floor(roundee);
return roundee / factor;
}
}

In general, decimal fractions are not precisely representable as binary fractions. There are some exceptions, like 0.5 (½) and 16.375 (16⅜), because all binary fractions are precisely representable as decimal fractions. (That's because 2 is a factor of 10, but 10 is not a factor of 2, or any power of two.) But if a number is not a multiple of some power of 2, its binary representation will be an infinitely-long cyclic sequence, like the representation of ⅓ in decimal (.333....).
The standard C library provides the macro DBL_DIG (normally 15); any decimal number with that many decimal digits of precision can be converted to a double (for example, with scanf) and then converted back to a decimal representation (for example, with printf). To go in the opposite direction without losing information -- start with a double, convert it to decimal and then convert it back -- you need 17 decimal digits (DBL_DECIMAL_DIG). (The values I quote are based on IEEE-754 64-bit doubles).
One way to provide something close to the question would be to consider a decimal number with no more than DBL_DIG digits of precision to be an "exact-but-not-really-exact" representation of a floating point number if that floating point number is the floating point number which comes closest to the value of the decimal number. One way to find that floating point number would be to use scanf or strtod to convert the decimal number to a floating point number, and then try the floating point numbers in the vicinity (using nextafter to explore) to find which ones convert to the same representation with DBL_DIG digits of precision.
If you trust the standard library implementation to not be too far off, you could convert your double to a decimal number using sprintf, increment the decimal string at the desired digit position (which is just a string operation), and then convert it back to a double with strtod.

Total re-write.
Based on OP's new requirement and using power-of-2 as suggested by #Patricia Shanahan, simple C solution:
double roundedV = ldexp(round(ldexp(V, D)),-D); // for nearest
double roundedV = ldexp(ceil (ldexp(V, D)),-D); // at or just greater
double roundedV = ldexp(floor(ldexp(V, D)),-D); // at or just less
The only thing added here beyond #Patricia Shanahan fine solution is C code to match OP's tag.

In C++ integers must be represented in binary, but floating point types can have a decimal representation.
If FLT_RADIX from <limits.h> is 10, or some multiple of 10, then your goal of exact representation of a decimal values is attainable.
Otherwise, in general, it's not attainable.
So, as a first step, try to find a C++ implementation where FLT_RADIX is 10.
I wouldn't worry about algorithm or efficiency thereof until the C++ implementation is installed and proved to be working on your system. But as a hint, your goal seems to be suspiciously similar to the operation known as “rounding”. I think, after obtaining my decimal floating point C++ implementation, I’d start by investigating techniques for rounding, e.g., googling that, maybe Wikipedia, …

Exact decimal datatype for C++?

PHP has a decimal type, which doesn't have the "inaccuracy" of floats and doubles, so that 2.5 + 2.5 = 5 and not 4.999999999978325 or something like that.
So I wonder if there is such a data type implementation for C or C++?

The Boost.Multiprecision library has a decimal based floating point template class called cpp_dec_float, for which you can specify any precision you want.
#include <iostream>
#include <iomanip>
#include <boost/multiprecision/cpp_dec_float.hpp>
int main()
{
namespace mp = boost::multiprecision;
// here I'm using a predefined type that stores 100 digits,
// but you can create custom types very easily with any level
// of precision you want.
typedef mp::cpp_dec_float_100 decimal;
decimal tiny("0.0000000000000000000000000000000000000000000001");
decimal huge("100000000000000000000000000000000000000000000000");
decimal a = tiny;
while (a != huge)
{
std::cout.precision(100);
std::cout << std::fixed << a << '\n';
a *= 10;
}
}

Yes:
There are arbitrary precision libraries for C++.
A good example is The GNU Multiple Precision arithmetic library.

If you are looking for data type supporting money / currency then try this:
https://github.com/vpiotr/decimal_for_cpp
(it's header-only solution)

There will be always some precision. On any computer in any number representation there will be always numbers which can be represented accurately, and other numbers which can't.
Computers use a base 2 system. Numbers such as 0.5 (2^-1), 0.125 (2^-3), 0.325 (2^-2 + 2^-3) will be represented accurately (0.1, 0.001, 0.011 for the above cases).
In a base 3 system those numbers cannot be represented accurately (half would be 0.111111...), but other numbers can be accurate (e.g. 2/3 would be 0.2)
Even in human base 10 system there are numbers which can't be represented accurately, for example 1/3.
You can use rational number representation and all the above will be accurate (1/2, 1/3, 3/8 etc.) but there will be always some irrational numbers too. You are also practically limited by the sizes of the integers of this representation.
For every non-representable number you can extend the representation to include it explicitly. (e.g. compare rational numbers and a representation a/b + c/d*sqrt(2)), but there will be always more numbers which still cannot be represented accurately. There is a mathematical proof that says so.
So - let me ask you this: what exactly do you need? Maybe precise computation on decimal-based numbers, e.g. in some monetary calculation?

What you're asking is anti-physics.
What phyton (and C++ as well) do is cut off the inaccuracy by rounding the result at the time to print it out, by reducing the number of significant digits:
double x = 2.5;
x += 2.5;
std::cout << x << std::endl;
just makes x to be printed with 6 decimal digit precision (while x itself has more than 12), and will be rounded as 5, cutting away the imprecision.
Alternatives are not using floating point at all, and implement data types that do just integer "scaled" arithmetic: 25/10 + 25/10 = 50/10;
Note, however, that this will reduce the upper limit represented by each integer type. The gain in precision (and exactness) will result in a faster reach to overflow.
Rational arithmetic is also possible (each number is represented by a "numarator" and a "denominator"), with no precision loss against divisions, (that -in fact- are not done unless exact) but again, with increasing values as the number of operation grows (the less "rational" is the number, the bigger are the numerator and denominator) with greater risk of overflow.
In other word the fact a finite number of bits is used (no matter how organized) will always result in a loss you have to pay on the side of small on on the side of big numbers.

I presume you are talking about the Binary Calculator in PHP. No, there isn't one in the C runtime or STL. But you can write your own if you are so inclined.
Here is a C++ version of BCMath compiled using Facebook's HipHop for PHP:
http://fossies.org/dox/facebook-hiphop-php-cf9b612/dir_2abbe3fda61b755422f6c6bae0a5444a.html

Being a higher level language PHP just cuts off what you call "inaccuracy" but it's certainly there. In C/C++ you can achieve similar effect by casting the result to integer type.

double type digits in C++

The IEE754 (64 bits) floating point is supposed to correctly represent 15 significant digit although the internal representation has 17 ditigs. Is there a way to force the 16th and 17th digits to zero ??
Ref:
http://msdn.microsoft.com/en-us/library/system.double(VS.80).aspx :
.
.
Remember that a floating-point number can only approximate a decimal number, and that the precision of a floating-point number determines how accurately that number approximates a decimal number. By default, a Double value contains 15 decimal digits of precision, although a maximum of 17 digits is maintained internally. The precision of a floating-point number has several consequences:
.
.
Example nos:
d1 = 97842111437.390091
d2 = 97842111437.390076
d1 and d2 differ in 16th and 17th decimal places that are not supposed to be significant. Looking for ways to force them to zero. ie
d1 = 97842111437.390000
d2 = 97842111437.390000

No. Counter-example: the two closest floating-point numbers to a rational
1.11111111111118
(which has 15 decimal digits) are
1.1111111111111799942818834097124636173248291015625
1.1111111111111802163264883347437717020511627197265625
In other words, there is not floating-point number that starts with 1.1111111111111800.

This question is a little malformed. The hardware stores the numbers
in binary, not decimal. So in the general case you can't do precise
math in base 10. Some decimal numbers (0.1 is one of them!) do not
even have a non-repeating representation in binary. If you have
precision requirements like this, where you care about the number
being of known precision to exactly 15 decimal digits, you will need
to pick another representation for your numbers.

No, but I wonder if this is relevant to any of your issues (GCC specific):
GCC Documentation
-ffloat-store Do not store floating point variables in registers, and
inhibit other options that might
change whether a floating point value
is taken from a register or memory.
This option prevents undesirable
excess precision on machines such as
the 68000 where the floating registers
(of the 68881) keep more precision
than a double is supposed to have.
Similarly for the x86 architecture.
For most programs, the excess
precision does only good, but a few
programs rely on the precise
definition of IEEE floating point. Use
-ffloat-store for such programs, after modifying them to store all pertinent
intermediate computations into
variables.

You should be able to directly modify the bits in your number by creating a union with a field for the floating point number and an integral type of the same size. Then you can access the bits you want and set them however you want. Here is in example where I whack the sign bit; you can choose any field you want, of course.
#include <stdio.h>
union double_int {
double fp;
unsigned long long integer;
};
int main(int argc, const char *argv[])
{
double my_double = 1325.34634;
union double_int *my_union = (union double_int *)&my_double;
/* print original numbers */
printf("Float %f\n", my_double);
printf("Integer %llx\n", my_union->integer);
/* whack the sign bit to 1 */
my_union->integer |= 1ULL << 63;
/* print modified numbers */
printf("Negative float %f\n", my_double);
printf("Negative integer %llx\n", my_union->integer);
return 0;
}

Generally speaking, people only care about something like this ("I only want the first x digits") when displaying the number. That's relatively easy with stringstreams or sprintf.
If you're concerned about comparing numbers with ==; you really can't do that with floating point numbers. Instead you want to see if the numbers are close enough (say, within an epsilon() of each other).
Playing with the bits of the number directly isn't a great idea.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Rounding to use for int -> float -> int round trip conversion - c++

Related

Are doubles able to represent every int64_t value? [duplicate]

Are all integer values perfectly represented as doubles? [duplicate]

How to shift a floating-point value to the nearest one that can be represented exactly in a specific number of decimal places?

Exact decimal datatype for C++?

double type digits in C++

Categories

Resources