Floating point computation changes if stored in intermediate "double" variable - c++

I am trying to write a simple log base 2 method. I understand that representing something like std::log(8.0) and std::log(2.0) on a computer is difficult. I also understand std::log(8.0) / std::log(2.0) may result in a value very slightly lower than 3.0. What I do not understand is why putting the result of a the calculation below into a double and making it an lvalue then casting it to an unsigned int would change the result compared to casting the the formula directly. The following code shows my test case which repeatedly fails on my 32 bit debian wheezy machine, but passes repeatedly on my 64 bit debian wheezy machine.
#include <cmath>
#include "assert.h"
int main () {
int n = 8;
unsigned int i =
static_cast<unsigned int>(std::log(static_cast<double>(n)) /
double d =
std::log(static_cast<double>(n)) / std::log(static_cast<double>(2));
unsigned int j = static_cast<unsigned int> (d);
assert (i == j);
I also know I can use bit shifting to come up with my result in a more predictable way. I am mostly curious why casting the double that results int he operation is any different than sticking that value into a double on the stack and casting the double on the stack.

In C++, floating point is allowed to do this sort of thing.
One possible explanation would be that the result of the division is calculated internally in a higher precision than double, and stored in a register with higher precision than double.
Converting this directly to unsigned int gives a different result to first converting this to double and then to unsigned int.
To see exactly what is going on , it might be helpful to look at the assembly output generated by your compiler for the 32-bit case.
Needless to say, you shouldn't write code that relies on exactness of floating point operations.


Problem in conversion of decimal to binary number by using bit manipulation [duplicate]

For some values (like 9) it works perfectly but, for most (like 7, 19 or 6), it subtracts 1 from the return (binary) value.
using namespace std;
int decimaltobinary(int);
int main()
int num;
cout<<"Enter the number: ";
cout<<num<<" in decimal = "<<decimaltobinary(num)<<" in binary.";
return 0;
int decimaltobinary(int num)
int remainder,i=0,binary=0;
return binary;
There are two main problems with the shown code:
The shown code attempts to build a binary version of the input number in decimal producing, for example, the result of 111 for the number 7. That's an integer value of one hundred and eleven.
On a 32 bit platform, with a 32 bit integer means that the largest number that can be "converted" to decimal this way will be 2047. 2048 is 10000000000 in binary, which will exceed the capacity of a 32 bit integer. An unsigned 32 bit integer's maximum value is 4294967295 (and half of that for it's plain, signed, int value, but either signed or unsigned, you're out of gas at this point).
Any use of pow() with two values that are integets is automatically broken by default, becase floating point math is broken. This is not what pow() really does. Here's what pow() does: a) it takes the natural logarithm of its first parameter, b) multiples the result from step a by its 2nd parameter, c) raises e to the power resulting from step b. Does this sound like something you expected to do here?
And since pow() takes floating point parameters, and the result is a floating point, the end result of the shown code is a bunch of needless conversions between floating point and integer values, and non-specific rounding errors as a result of imprecise floating point exponential math.
But the main flaw in the shown code is an attempt to use plain ints to assemble a decimal number represented of a binary value, which simply doesn't have enough digits for this. Switching to long long int won't be much of a help. Counting things off on my fingers, you'll be able to go up only to somewhere slightly north of a million, that way. A completely different approach must be taken for the described programming tasks.
Your problem is that binary+remainder*pow(10,i); is all done in floating-point arithmetic and only converted to int at the assignment. Since pow is not exact, you may get the result slightly below the exact value, in which case the conversion truncates it and makes 1 less than the desired result.
While there are various better ways to achieve your goal, the immediate fix is to use std::round() and then cast the result to int:

Is hardcoding least significant byte of a double a good rounding strategy?

I have a function doing some mathematical computation and returning a double. It ends up with different results under Windows and Android due to std::exp implementation beging different (Why do I get platform-specific result for std::exp?). The e-17 rounding difference gets propagated and in the end it's not just a rounding difference that I get (results can change 2.36 to 2.47 in the end). As I compare the result to some expected values, I want this function to return the same result on all platform.
So I need to round my result. The simpliest solution to do this is apparently (as far as I could find on the web) to do std::ceil(d*std::pow<double>(10,precision))/std::pow<double>(10,precision). However, I feel like this could still end up with different results depending on the platform (and moreover, it's hard to decide what precision should be).
I was wondering if hard-coding the least significant byte of the double could be a good rounding strategy.
This quick test seems to show that "yes":
#include <iostream>
#include <iomanip>
double roundByCast( double d )
double rounded = d;
unsigned char* temp = (unsigned char*) &rounded;
// changing least significant byte to be always the same
temp[0] = 128;
return rounded;
void showRoundInfo( double d, double rounded )
double diff = std::abs(d-rounded);
std::cout << "cast: " << d << " rounded to " << rounded << " (diff=" << diff << ")" << std::endl;
void roundIt( double d )
showRoundInfo( d, roundByCast(d) );
int main( int argc, char* argv[] )
roundIt( 7.87234042553191493141184764681 );
roundIt( 0.000000000000000000000184764681 );
roundIt( 78723404.2553191493141184764681 );
This outputs:
cast: 7.87234 rounded to 7.87234 (diff=2.66454e-14)
cast: 1.84765e-22 rounded to 1.84765e-22 (diff=9.87415e-37)
cast: 7.87234e+07 rounded to 7.87234e+07 (diff=4.47035e-07)
My question is:
Is unsigned char* temp = (unsigned char*) &rounded safe or is there an undefined behaviour here, and why?
If there is no UB (or if there is a better way to do this without UB), is such a round function safe and accurate for all input?
Note: I know floating point numbers are inaccurate. Please don't mark as duplicate of Is floating point math broken? or Why Are Floating Point Numbers Inaccurate?. I understand why results are different, I'm just looking for a way to make them be identical on all targetted platforms.
Edit, I may reformulate my question as people are asking why I have different values and why I want them to be the same.
Let's say you get a double from a computation that could end up with a different value due to platform specific implementations (like std::exp). If you want to fix those different double to end up having the exact same memory representation (1) on all platforms, and you want to loose the fewest precision as possible, then, is fixing the least significant byte a good approach? (because I feel that rounding to an arbitrary given precision is likely to loose more information than this trick).
(1) By "same representation", I mean that if you transform it to a std::bitset, you want to see the same bits sequence for all platform.
No, rounding is not a strategy for removing small errors, or guaranteeing agreement with calculations performed with errors.
For any slicing of the number line into ranges, you will successfully eliminate most slight deviations (by placing them in the same bucket and clamping to the same value), but you greatly increase the deviation if your original pair of values straddle a boundary.
In your particular case of hardcoding the least significant byte, the very near values
have a deviation of only one ULP... but after your rounding, they differ by 256 ULP. Oops!
Is unsigned char* temp = (unsigned char*) &rounded safe or is there an undefined behaviour here, and why?
It is well defined, as aliasing through unsigned char is allowed.
is such a round function safe and accurate for all input?
No. You cannot perfectly fix this problem with truncating/rounding. Consider, that one implementation gives 0x.....0ff, and the other 0x.....100. Setting the lsb to 0x00 will make the original 1 ulp difference to 256 ulps.
No rounding algorithm can fix this.
You have two options:
don't use floating point, use some other way (for example, fixed point)
embed a floating point library into your application, which only uses basic floating point arithmetic (+, -, *, /, sqrt), and don't use -ffast-math, or any equivalent option. This way, if you're on a IEEE-754 compatible platform, floating point results should be the same, as IEEE-754 mandates that basic operations should be calculated "perfectly". It means as if the operation calculated at infinite precision, and then rounded to the resulting representation.
Btw, if an input 1e-17 difference means a huge output difference, then your problem/algorithm is ill-conditioned, which generally should be avoided, as it usually doesn't give you meaningful results.
What you are doing is totally, totally misguided.
Your problem is not that you are getting different results (2.36 vs. 2.47). Your problem is that at least one of these results, and likely both, have massive errors. Your Windows and Android results are not just different, they are WRONG. (At least one of them, and you have no idea which one).
Find out why you get these massive errors and change your algorithms to not increase tiny rounding errors massively. Or you have a problem that is inherently chaotic, in which case the difference between results is actually very useful information.
What you are trying just makes the rounding errors 256 times bigger, and if two different results end in ....1ff and ....200 hexadecimal, then you change these to ....180 and ....280, so even the difference between slightly different numbers can grow by a factor 256.
And on a bigendian machine your code will just go kaboom!!!
Your function won't work because of aliasing.
double roundByCast( double d )
double rounded = d;
unsigned char* temp = (unsigned char*) &rounded;
// changing least significant byte to be always the same
temp[0] = 128;
return rounded;
Casting to unsigned char* for temp is allowed, because char* casts are the exception to the aliasing rules. That's necessary for functions like read, write, memcpy, etc, so that they can copy values to and from byte representations.
However, you aren't allowed to write to temp[0] and then assume that rounded changed. You must create a new double variable (on the stack is fine) and memcpy temp back to it.

Using scientific notation in for loops

I've recently come across some code which has a loop of the form
for (int i = 0; i < 1e7; i++){
I question the wisdom of doing this since 1e7 is a floating point type, and will cause i to be promoted when evaluating the stopping condition. Should this be of cause for concern?
The elephant in the room here is that the range of an int could be as small as -32767 to +32767, and the behaviour on assigning a larger value than this to such an int is undefined.
But, as for your main point, indeed it should concern you as it is a very bad habit. Things could go wrong as yes, 1e7 is a floating point double type.
The fact that i will be converted to a floating point due to type promotion rules is somewhat moot: the real damage is done if there is unexpected truncation of the apparent integral literal. By the way of a "proof by example", consider first the loop
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 18446744073709551615ULL; ){
std::cout << i << "\n";
This outputs every consecutive value of i in the range, as you'd expect. Note that std::numeric_limits<std::uint64_t>::max() is 18446744073709551615ULL, which is 1 less than the 64th power of 2. (Here I'm using a slide-like "operator" ++< which is useful when working with unsigned types. Many folk consider --> and ++< as obfuscating but in scientific programming they are common, particularly -->.)
Now on my machine, a double is an IEEE754 64 bit floating point. (Such as scheme is particularly good at representing powers of 2 exactly - IEEE754 can represent powers of 2 up to 1022 exactly.) So 18,446,744,073,709,551,616 (the 64th power of 2) can be represented exactly as a double. The nearest representable number before that is 18,446,744,073,709,550,592 (which is 1024 less).
So now let's write the loop as
for (std::uint64_t i = std::numeric_limits<std::uint64_t>::max() - 1024; i ++< 1.8446744073709551615e19; ){
std::cout << i << "\n";
On my machine that will only output one value of i: 18,446,744,073,709,550,592 (the number that we've already seen). This proves that 1.8446744073709551615e19 is a floating point type. If the compiler was allowed to treat the literal as an integral type then the output of the two loops would be equivalent.
It will work, assuming that your int is at least 32 bits.
However, if you really want to use exponential notation, you should better define an integer constant outside the loop and use proper casting, like this:
const int MAX_INDEX = static_cast<int>(1.0e7);
for (int i = 0; i < MAX_INDEX; i++) {
Considering this, I'd say it is much better to write
const int MAX_INDEX = 10000000;
or if you can use C++14
const int MAX_INDEX = 10'000'000;
1e7 is a literal of type double, and usually double is 64-bit IEEE 754 format with a 52-bit mantissa. Roughly every tenth power of 2 corresponds to a third power of 10, so double should be able to represent integers up to at least 105*3 = 1015, exactly. And if int is 32-bit then int has roughly 103*3 = 109 as max value (asking Google search it says that "2**31 - 1" = 2 147 483 647, i.e. twice the rough estimate).
So, in practice it's safe on current desktop systems and larger.
But C++ allows int to be just 16 bits, and on e.g. an embedded system with that small int, one would have Undefined Behavior.
If the intention to loop for a exact integer number of iterations, for example if iterating over exactly all the elements in an array then comparing against a floating point value is maybe not such a good idea, solely for accuracy reasons; since the implicit cast of an integer to float will truncate integers toward zero there's no real danger of out-of-bounds access, it will just abort the loop short.
Now the question is: When do these effects actually kick in? Will your program experience them? The floating point representation usually used these days is IEEE 754. As long as the exponent is 0 a floating point value is essentially an integer. C double precision floats 52 bits for the mantissa, which gives you integer precision to a value of up to 2^52, which is in the order of about 1e15. Without specifying with a suffix f that you want a floating point literal to be interpreted single precision the literal will be double precision and the implicit conversion will target that as well. So as long as your loop end condition is less 2^52 it will work reliably!
Now one question you have to think about on the x86 architecture is efficiency. The very first 80x87 FPUs came in a different package, and later a different chip and as aresult getting values into the FPU registers is a bit awkward on the x86 assembly level. Depending on what your intentions are it might make the difference in runtime for a realtime application; but that's premature optimization.
TL;DR: Is it safe to to? Most certainly yes. Will it cause trouble? It could cause numerical problems. Could it invoke undefined behavior? Depends on how you use the loop end condition, but if i is used to index an array and for some reason the array length ended up in a floating point variable always truncating toward zero it's not going to cause a logical problem. Is it a smart thing to do? Depends on the application.

Float to int number conversion in c++

The following C++ code:
union float2bin{
float f;
int i;
float2bin obj;
gives output as some garbage value .
union float2bin{
float f;
float i;
float2bin obj;
gives output same as the value of f i.e 2.243
Compiler GCC has int & float of same size i.e 4 but then what's the reason behind this output behaviour?
The reason is because it is undefined behavior. In practice,
you'll get away with reading an int from something that was
stored as a float on most machines, but you'll read garbage
values unless you know what to expect. Doing it in the other
direction will likely cause the program to crash for certain
values of int.
Under the hood, of course, integral values and floating point
values have different representations, at least on most
machines. (On some Unisys mainframes, your code would do what
you expect. But they're not the most common systems around, and
you probably don't have one on your desktop.) Basically,
regardless of the type, you have a sequence of bits, which will
be interpreted by the hardware in some way. C++ requires
integers to use a pure binary representation, which constrains
the representation somewhat. It also requires a very large
range for floating point values, and more or less requires some
form of exponential notation, with some bits representing the
exponent, and others the mantissa. With different encodings for
The reason is because floating point values are stored in a more complicated way, partitioning the 32 bits into a sign, an exponent and a fraction. If these bits are read as an integer straight off, it will look like a very different value.
The important point here is that if you create a union, you are saying that it is one contiguous block of memory that can be interpreted in two different ways. No where in this mechanism does it account for a safe conversion between float and int, in which case some kind of rounding occurs.
Update: What you might want is
float f = 10.25f;
int i = (int)f;
// Will give you i = 10
However, the union approach is closer to this:
float f = 10.25f;
int i = *((int *)&f);
// Will give you some seemingly arbitrary value

char* to double and back to char* again ( 64 bit application)

I am trying to convert a char* to double and back to char* again. the following code works fine if the application you created is 32-bit but doesn't work for 64-bit application. The problem occurs when you try to convert back to char* from int. for example if the hello = 0x000000013fcf7888 then converted is 0x000000003fcf7888 only the last 32 bits are right.
#include <iostream>
#include <stdlib.h>
#include <tchar.h>
using namespace std;
int _tmain(int argc, _TCHAR* argv[]){
char* hello = "hello";
unsigned int hello_to_int = (unsigned int)hello;
double hello_to_double = (double)hello_to_int;
unsigned int converted_int = (unsigned int)hello_to_double;
char* converted = reinterpret_cast<char*>(converted_int);
return 0;
On 64-bit Windows pointers are 64-bit while int is 32-bit. This is why you're losing data in the upper 32-bits while casting. Instead of int use long long to hold the intermediate result.
char* hello = "hello";
unsigned long long hello_to_int = (unsigned long long)hello;
Make similar changes for the reverse conversion. But this is not guaranteed to make the conversions function correctly because a double can easily represent the entire 32-bit integer range without loss of precision but the same is not true for a 64-bit integer.
Also, this isn't going to work
unsigned int converted_int = (unsigned int)hello_to_double;
That conversion will simply truncate anything digits after the decimal point in the floating point representation. The problem exists even if you change the data type to unsigned long long. You'll need to reinterpret_cast<unsigned long long> to make it work.
Even after all that you may still run into trouble depending on the value of the pointer. The conversion to double may cause the value to be a signalling NaN for instance, in which cause your code might throw an exception.
Simple answer is, unless you're trying this out for fun, don't do conversions like these.
You can't cast a char* to int on 64-bit Windows because an int is 32 bits, while a char* is 64 bits because it's a pointer. Since a double is always 64 bits, you might be able to get away with casting between a double and char*.
A couple of issues with encoding any integer (specifically, a collection of bits) into a floating point value:
Conversions from 64-bit integers to doubles can be lossy. A double has 53-bits of actual precision, so integers above 2^52 (give or take an extra 2) will not necessarily be represented precisely.
If you decide to reinterpret the bits of a pointer as a double instead (via union or reinterpret_cast) you will still have issues if you happen to encode a pointer as set of bits that are not a valid double representation. Unless you can guarantee that the double value never gets written back by the FPU, the FPU can silently transform an invalid double into another invalid double (see NaN), i.e., a double value that represents the same value but has different bits. (See this for issues related to using floating point formats as bits.)
You can probably safely get away with encoding a 32-bit pointer in a double, as that will definitely fit within the 53-bit precision range.
only the last 32 bits are right.
That's because an int in your platform is only 32 bits long. Note that reinterpret_cast only guarantees that you can convert a pointer to an int of sufficient size (not your case), and back.
If it works in any system, anywhere, just all yourself lucky and move on. Converting a pointer to an integer is one thing (as long as the integer is large enough, you can get away with it), but a double is a floating point number - what you are doing simply doesn't make any sense, because a double is NOT necessarily capable of representing any random number. A double has range and precision limitations, and limits on how it represents things. It can represent numbers across a wide range of values, but it can't represent EVERY number in that range.
Remember that a double has two components: the mantissa and the exponent. Together, these allow you to represent either very big or very small numbers, but the mantissa has limited number of bits. If you run out of bits in the mantissa, you're going to lose some bits in the number you are trying to represent.
Apparently you got away with it under certain circumstances, but you're asking it to do something it wasn't made for, and for which it is manifestly inappropriate.
Just don't do that - it's not supposed to work.
This is as expected.
Typically a char* is going to be 32 bits on a 32-bit system, 64 bits on a 64-bit system; double is typically 64 bits on both systems. (These sizes are typical, and probably correct for Windows; the language permits a lot more variations.)
Conversion from a pointer to a floating-point type is, as far as I know, undefined. That doesn't just mean that the result of the conversion is undefined; the behavior of a program that attempts to perform such a conversion is undefined. If you're lucky, the program will crash or fail to compile.
But you're converting from a pointer to an integer (which is permitted, but implementation-defined) and then from an integer to a double (which is permitted and meaningful for meaningful numeric values -- but converted pointer values are not numerically meaningful). You're losing information because not all of the 64 bits of a double are used to represent the magnitude of the number; typically 11 or so bits are used to represent the exponent.
What you're doing quite simply makes no sense.
What exactly are you trying to accomplish? Whatever it is, there's surely a better way to do it.