I would like to have the closest number below 1.0 as a floating point. By reading wikipedia's article on IEEE-754 I have managed to find out that the binary representation for 1.0 is 3FF0000000000000, so the closest double value is actually 0x3FEFFFFFFFFFFFFF.
The only way I know of to initialize a double with this binary data is this:
double a;
*((unsigned*)(&a) + 1) = 0x3FEFFFFF;
*((unsigned*)(&a) + 0) = 0xFFFFFFFF;
Which is rather cumbersome to use.
Is there any better way to define this double number, if possible as a constant?
Hexadecimal float and double literals do exist.
The syntax is 0x1.(mantissa)p(exponent in decimal)
In your case the syntax would be
double x = 0x1.fffffffffffffp-1
It's not safe, but something like:
double a;
*(reinterpret_cast<uint64_t *>(&a)) = 0x3FEFFFFFFFFFFFFFL;
However, this relies on a particular endianness of floating-point numbers on your system, so don't do this!
Instead, just put DBL_EPSILON in <cfloat> (or as pointed out in another answer, std::numeric_limits<double>::epsilon()) to good use.
#include <iostream>
#include <iomanip>
#include <limits>
using namespace std;
int main()
{
double const x = 1.0 - numeric_limits< double >::epsilon();
cout
<< setprecision( numeric_limits< double >::digits10 + 1 ) << fixed << x
<< endl;
}
If you make a bit_cast and use fixed-width integer types, it can be done safely:
template <typename R, typename T>
R bit_cast(const T& pValue)
{
// static assert R and T are POD types
// reinterpret_cast is implementation defined,
// but likely does what you expect
return reinterpret_cast<const R&>(pValue);
}
const uint64_t target = 0x3FEFFFFFFFFFFFFFL;
double result = bit_cast<double>(target);
Though you can probably just subtract epsilon from it.
It's a little archaic, but you can use a union.
Assuming a long long and a double are both 8 bytes long on your system:
typedef union { long long a; double b } my_union;
int main()
{
my_union c;
c.b = 1.0;
c.a--;
std::cout << "Double value is " << c.b << std::endl;
std::cout << "Long long value is " << c.a << std::endl;
}
Here you don't need to know ahead of time what the bit representation of 1.0 is.
This 0x1.fffffffffffffp-1 syntax is great, but only in C99 or C++17.
But there is a workaround, no (pointer-)casting, no UB/IB, just simple math.
double x = (double)0x1fffffffffffff / (1LL << 53);
If I need a Pi, and Pi(double) is 0x1.921fb54442d18p1 in hex, just write
const double PI = (double)0x1921fb54442d18 / (1LL << 51);
If your constant has large or small exponent, you could use the function exp2 instead of the shift, but exp2 is C99/C++11 ... Use pow for rescue!
Rather than all the bit juggling, the most direct solution is to use nextafter() from math.h. Thus:
#include <math.h>
double a = nextafter(1.0, 0.0);
Read this as: the next floating-point value after 1.0 in the direction of 0.0; an almost direct encoding of "the closest number below 1.0" from the original question.
https://godbolt.org/z/MTY4v4exz
typedef union { long long a; double b; } my_union;
int main()
{
my_union c;
c.b = 1.0;
c.a--;
std::cout << "Double value is " << c.b << std::endl;
std::cout << "Long long value is " << c.a << std::endl;
}
Related
Here is a part in a Physics engine.
The simplified function centerOfMass calculates 1D-center-of-mass of two rigid bodies (demo) :-
#include <iostream>
#include <iomanip>
float centerOfMass(float pos1,float m1, float pos2,float m2){
return (pos1*m1+pos2*m2)/(m1+m2);
}
int main(){
float a=5.55709743f;
float b= centerOfMass(a,50,0,0);
std::cout << std::setprecision(9) << a << '\n'; //5.55709743
std::cout << std::setprecision(9) << b << '\n'; //5.55709696
}
I need b to be precisely = 5.55709743.
The tiny difference can, sometimes (my real case = 5%), introduces a nasty Physics divergence.
There are some ways to solve it e.g. heavily do some conditional checking.
However, it is very error-prone for me.
Question: How to solve the calculation error while keep the code clean, fast, and still easily to be maintained?
By the way, if it can't be done elegantly, I would probably need to improve the caller to be more resistant against such numerical error.
Edit
(clarify duplicate question)
Yes, the cause is the precision error from the storage/computing format (mentioned in Is floating point math broken?).
However, this question asks about how to neutralize its symptom in a very specific case.
You are trying to get 9 decimal digits of precision , but the datatype float has a precision of about 7 decimal digits.
Use double instead. (demo)
Use double, not float. IEEE 754 double has about 16 decimal places of precision.
#include <iostream>
#include <iomanip>
double centerOfMass(double pos1, double m1, double pos2, double m2) {
return (pos1*m1 + pos2 * m2) / (m1 + m2);
}
int main() {
double a = 5.55709743;
double b = centerOfMass(a, 50, 0, 0);
std::cout << std::setprecision(16) << a << '\n'; //5.55709743
std::cout << std::setprecision(16) << b << '\n'; //5.55709743
std::cout << std::setprecision(16) << (b - a) << '\n'; // 0
}
For the example given, centerOfMass(a, 50, 0, 0), the following will give exact results for all values of a, but of course the example does not look realistic.
double centerOfMass(double pos1, double m1, double pos2, double m2) {
double divisor = m1 + m2;
return pos1*(m1/divisor) + pos2*(m2/ divisor);
}
I have a number stored as a ulong. I want the bits stored in memory to be interpreted in a 2's complement fashion. So I want the first bit to be the sign bit etc. If I want to convert to a long, so that the number is interpreted correctly as a 2's complement , how do I do this?
I tried creating pointers of different data types that all pointed to the same buffer. I then stored the ulong into the buffer. I then dereferenced a long pointer. This however is giving me a bad result?
I did :
#include <iostream>
using namespace std;
int main() {
unsigned char converter_buffer[4];//
unsigned long *pulong;
long *plong;
pulong = (unsigned long*)&converter_buffer;
plong = (long*)&converter_buffer;
unsigned long ulong_num = 65535; // this has a 1 as the first bit
*pulong = ulong_num;
std:: cout << "the number as a long is" << *plong << std::endl;
return 0;
}
For some reason this is giving me the same positive number.
Would casting help?
Actually using pointers was a good start but you have to cast your unsigned long* to void* first, then you can cast the result to long* and dereference it:
#include <iostream>
#include <climits>
int main() {
unsigned long ulongValue = ULONG_MAX;
long longValue = *((long*)((void*)&ulongValue));
std::cout << "ulongValue: " << ulongValue << std::endl;
std::cout << "longValue: " << longValue << std::endl;
return 0;
}
The code above will results the following:
ulongValue: 18446744073709551615
longValue: -1
With templates you can make it more readable in your code:
#include <iostream>
#include <climits>
template<typename T, typename U>
T unsafe_cast(const U& from) {
return *((T*)((void*)&from));
}
int main() {
unsigned long ulongValue = ULONG_MAX;
long longValue = unsafe_cast<long>(ulongValue);
std::cout << "ulongValue: " << ulongValue << std::endl;
std::cout << "longValue: " << longValue << std::endl;
return 0;
}
Keep in mind that this solution is absolutely unsafe due to the fact that you can cast anyithing to void*. This practicle was common in C but I do not recommend to use it in C++. Consider the following cases:
#include <iostream>
template<typename T, typename U>
T unsafe_cast(const U& from) {
return *((T*)((void*)&from));
}
int main() {
std::cout << std::hex << std::showbase;
float fValue = 3.14;
int iValue = unsafe_cast<int>(fValue); // OK, they have same size.
std::cout << "Hexadecimal representation of " << fValue
<< " is: " << iValue << std::endl;
std::cout << "Converting back to float results: "
<< unsafe_cast<float>(iValue) << std::endl;
double dValue = 3.1415926535;
int lossyValue = unsafe_cast<int>(dValue); // Bad, they have different size.
std::cout << "Lossy hexadecimal representation of " << dValue
<< " is: " << lossyValue << std::endl;
std::cout << "Converting back to double results: "
<< unsafe_cast<double>(lossyValue) << std::endl;
return 0;
}
The code above results for me the following:
Hexadecimal representation of 3.14 is: 0x4048f5c3
Converting back to float results: 3.14
Lossy hexadecimal representation of 3.14159 is: 0x54411744
Converting back to double results: 6.98387e-315
And for last line you can get anything because the conversion will read garbage from the memory.
Edit
As lorro commented bellow, using memcpy() is safer and can prevent the overflow. So, here is another version of type casting which is safer:
template<typename T, typename U>
T safer_cast(const U& from) {
T to;
memcpy(&to, &from, (sizeof(T) > sizeof(U) ? sizeof(U) : sizeof(T)));
return to;
}
You can do this:
uint32_t u;
int32_t& s = (int32_t&) u;
Then you can use s and u interchangeably with 2's complement, e.g.:
s = -1;
std::cout << u << '\n'; // 4294967295
In your question you ask about 65535 but that is a positive number. You could do:
uint16_t u;
int16_t& s = (int16_t&) u;
u = 65535;
std::cout << s << '\n'; // -1
Note that assigning 65535 (a positive number) to int16_t would implementation-defined behaviour, it does not necessarily give -1.
The problem with your original code is that it is not permitted to alias a char buffer as long. (And that you might overflow your buffer). However, it is OK to alias an integer type as its corresponding signed/unsigned type.
In general, when you have two arithmetic types that are the same size and you want to reinterpret the bit representation of one using the type of the other, you do it with a union:
#include <stdint.h>
union reinterpret_u64_d_union {
uint64_t u64;
double d;
};
double
reinterpret_u64_as_double(uint64_t v)
{
union reinterpret_u64_d_union u;
u.u64 = v;
return u.d;
}
For the special case of turning an unsigned number into a signed type with the same size (or vice versa), however, you can just use a traditional cast:
int64_t
reinterpret_u64_as_i64(uint64_t v)
{
return (int64_t)v;
}
(The cast is not strictly required for [u]int64_t, but if you don't explicitly write a cast, and the types you're converting between are small, the "integer promotions" may get involved, which is usually undesirable.)
The way you were trying to do it violates the pointer-aliasing rules and provokes undefined behavior.
In C++, note that reinterpret_cast<> does not do what the union does; it is the same as static_cast<> when applied to arithmetic types.
In C++, also note that the use of a union above relies on a rule in the 1999 C standard (with corrigienda) that has not been officially incorporated into the C++ standard last I checked; however, all compilers I am familiar with will do what you expect.
And finally, in both C and C++, long and unsigned long are guaranteed to be able to represent at least −2,147,483,647 ... 214,7483,647 and 0 ... 4,294,967,295, respectively. Your test program used 65535, which is guaranteed to be representable by both long and unsigned long, so the value would have been unchanged however you did it. Well, unless you used invalid pointer aliasing and the compiler decided to make demons fly out of your nose instead.
I have above statement in file I am refering . Expected output is double. I could not find anything relevant to my problem.
I found this
Passing a structure through Sockets in C
but dont know if its relevant.
I am not reading that int64 value. I am getting it from other process and that is the way it is designed.
Does anyone have any theory about serialization and deserialization of ints?
There is exactly one defined way to bitwise-copy one type into another in c++ - memcpy.
template<class Out, class In, std::enable_if_t<(sizeof(In) == sizeof(Out))>* = nullptr>
Out mangle(const In& in)
{
Out result;
std::memcpy(std::addressof(result), std::addressof(in), sizeof(Out));
return result;
}
int main()
{
double a = 1.1;
auto b = mangle<std::uint64_t>(a);
auto c = mangle<double>(b);
std::cout << a << " " << std::hex << b << " " << c << std::endl;
}
example output:
1.1 3ff199999999999a 1.1
How about reading that 64-bit number and using reinterpret_cast to convert it to bitwise equivalent floating point number.
int64_t a = 121314;
double b = *reinterpret_cast<double*>(&a);
int64_t c = *reinterpret_cast<int64_t*>(&b);
assert(a==c);
From a .c file of another guy, I saw this:
const float c = 0.70710678118654752440084436210485f;
where he wants to avoid the computation of sqrt(1/2).
Can this be really stored somehow with plain C/C++? I mean without loosing precision. It seems impossible to me.
I am using C++, but I do not believe that precision difference between this two languages are too big (if any), that' why I did not test it.
So, I wrote these few lines, to have a look at the behaviour of the code:
std::cout << "Number: 0.70710678118654752440084436210485\n";
const float f = 0.70710678118654752440084436210485f;
std::cout << "float: " << std::setprecision(32) << f << std::endl;
const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double: " << std::setprecision(32) << d << std::endl;
const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef: " << std::setprecision(32) << df << std::endl;
const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double: " << std::setprecision(32) << ld << std::endl;
const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;
The output is this:
* ** ***
v v v
Number: 0.70710678118654752440084436210485 // 32 decimal digits
float: 0.707106769084930419921875 // 24 >> >>
double: 0.70710678118654757273731092936941
doublef: 0.707106769084930419921875 // same as float
l double: 0.70710678118654757273731092936941 // same as double
l doublel: 0.70710678118654752438189403651592 // suffix l
where * is the last accurate digit of float, ** the last accurate digit of double and *** the last accurate digit of long double.
The output of double has 32 decimal digits, since I have set the precision of std::cout at that value.
float output has 24, as expected, as said here:
float has 24 binary bits of precision, and double has 53.
I would expect the last output to be the same with the pre-last, i.e. that the f suffix would not prevent the number from becoming a double. I think that when I write this:
const double df = 0.70710678118654752440084436210485f;
what happens is that first the number becomes a float one and then stored as a double, so after the 24th decimal digits, it has zeroes and that's why the double precision stops there.
Am I correct?
From this answer I found some relevant information:
float x = 0 has an implicit typecast from int to float.
float x = 0.0f does not have such a typecast.
float x = 0.0 has an implicit typecast from double to float.
[EDIT]
About __float128, it is not standard, thus it's out of the competition. See more here.
From the standard:
There are three floating point types: float, double, and long double.
The type double provides at least as much precision as float, and the
type long double provides at least as much precision as double. The
set of values of the type float is a subset of the set of values of
the type double; the set of values of the type double is a subset of
the set of values of the type long double. The value representation of
floating-point types is implementation-defined.
So you can see your issue with this question: the standard doesn't actually say how precise floats are.
In terms of standard implementations, you need to look at IEEE754, which means the other two answers from Irineau and Davidmh are perfectly valid approaches to the problem.
As to suffix letters to indicate type, again looking at the standard:
The type of a floating literal is double unless explicitly specified by
a suffix. The suffixes f and F specify float, the suffixes l and L specify
long double.
So your attempt to create a long double will just have the same precision as the double literal you are assigning to it unless you use the L suffix.
I understand that some of these answers may not seem satisfactory, but there is a lot of background reading to be done on the relevant standards before you can dismiss answers. This answer is already longer than intended so I won't try and explain everything here.
And as a final note: Since the precision is not clearly defined, why not have a constant that's longer than it needs to be? Seems to make sense to always define a constant that is precise enough to always be representable regardless of type.
Python's numerical library, numpy, has a very convenient float info function. All the types are the equivalent to C:
For C's float:
print numpy.finfo(numpy.float32)
Machine parameters for float32
---------------------------------------------------------------------
precision= 6 resolution= 1.0000000e-06
machep= -23 eps= 1.1920929e-07
negep = -24 epsneg= 5.9604645e-08
minexp= -126 tiny= 1.1754944e-38
maxexp= 128 max= 3.4028235e+38
nexp = 8 min= -max
---------------------------------------------------------------------
For C's double:
print numpy.finfo(numpy.float64)
Machine parameters for float64
---------------------------------------------------------------------
precision= 15 resolution= 1.0000000000000001e-15
machep= -52 eps= 2.2204460492503131e-16
negep = -53 epsneg= 1.1102230246251565e-16
minexp= -1022 tiny= 2.2250738585072014e-308
maxexp= 1024 max= 1.7976931348623157e+308
nexp = 11 min= -max
---------------------------------------------------------------------
And for C's long float:
print numpy.finfo(numpy.float128)
Machine parameters for float128
---------------------------------------------------------------------
precision= 18 resolution= 1e-18
machep= -63 eps= 1.08420217249e-19
negep = -64 epsneg= 5.42101086243e-20
minexp=-16382 tiny= 3.36210314311e-4932
maxexp= 16384 max= 1.18973149536e+4932
nexp = 15 min= -max
---------------------------------------------------------------------
So, not even long float (128 bits) will give you the 32 digits you want. But, do you really need them all?
Some compilers have an implementation of the binary128 floating point format, normalized by IEEE 754-2008. Using gcc, for example, the type is __float128. That floating point format have about 34 decimal precision (log(2^113)/log(10)).
You can use the Boost Multiprecision library, to use their wrapper float128. That implementation will either use native types, if available, or use a drop-in replacement.
Let's extend your experiment with that new non-standard type __float128, with a recent g++ (4.8):
// Compiled with g++ -Wall -lquadmath essai.cpp
#include <iostream>
#include <iomanip>
#include <quadmath.h>
#include <sstream>
std::ostream& operator<<(std::ostream& out, __float128 f) {
char buf[200];
std::ostringstream format;
format << "%." << (std::min)(190L, out.precision()) << "Qf";
quadmath_snprintf(buf, 200, format.str().c_str(), f);
out << buf;
return out;
}
int main() {
std::cout.precision(32);
std::cout << "Number: 0.70710678118654752440084436210485\n";
const float f = 0.70710678118654752440084436210485f;
std::cout << "float: " << std::setprecision(32) << f << std::endl;
const double d = 0.70710678118654752440084436210485; // no f extension
std::cout << "double: " << std::setprecision(32) << d << std::endl;
const double df = 0.70710678118654752440084436210485f;
std::cout << "doublef: " << std::setprecision(32) << df << std::endl;
const long double ld = 0.70710678118654752440084436210485;
std::cout << "l double: " << std::setprecision(32) << ld << std::endl;
const long double ldl = 0.70710678118654752440084436210485l; // l suffix!
std::cout << "l doublel: " << std::setprecision(32) << ldl << std::endl;
const __float128 f128 = 0.70710678118654752440084436210485;
const __float128 f128f = 0.70710678118654752440084436210485f; // f suffix
const __float128 f128l = 0.70710678118654752440084436210485l; // l suffix
const __float128 f128q = 0.70710678118654752440084436210485q; // q suffix
std::cout << "f128: " << f128 << std::endl;
std::cout << "f f128: " << f128f << std::endl;
std::cout << "l f128: " << f128l << std::endl;
std::cout << "q f128: " << f128q << std::endl;
}
The output is:
* ** *** ****
v v v v
Number: 0.70710678118654752440084436210485
float: 0.707106769084930419921875
double: 0.70710678118654757273731092936941
doublef: 0.707106769084930419921875
l double: 0.70710678118654757273731092936941
l doublel: 0.70710678118654752438189403651592
f128: 0.70710678118654757273731092936941
f f128: 0.70710676908493041992187500000000
l f128: 0.70710678118654752438189403651592
q f128: 0.70710678118654752440084436210485
where * is the last accurate digit of float, ** the last accurate digit of
double, *** the last accurate digit of long double, and **** is the
last accurate digit of __float128.
As said by another answer, the C++ standard does not say what is the precision of the various floating point types (like it does not says what is the size of the integral types). It only specifies minimal precision/size of those types. But the norm IEEE754 does specify all that! The FPU of all lot of architectures does implement that norm IEEE745, and the recent versions of gcc implement the type binary128 of the norm with the extension __float128.
As for the explanation of your code, or mine, an expression like 0.70710678118654752440084436210485f is a floating-point literal. It has a type, that is defined by its suffix, here f for float. And thus the value of the literal correspond to the nearest value of the given type from the given number. That explains why, for example, the precision of "doublef" is the same as for "float", in your code. In recent gcc versions, there is an extension, that allows to define floating-point literals of type __float128, with the Q suffix (Quadruple-precision).
Please have a look at the following code
#include <iostream>
#include <iomanip>
#include <cmath>
using namespace std;
int main()
{
//int side1 = 0;
//int side2 = 0;
//int rightSide = 0;
cout << "Right Side" << setw(10) << "Side1" << setw(10) << "Side2" << endl;
for(int i=1;i<=500;i++)
{
//side1++;
//cout << side1 << endl;
for(int a=1;a<=500;a++)
{
//side2++;
//cout << "side 2 " << side2 << endl;
for(int c=1;c<=500;c++)
{
//rightSide++;
int rightSideSqr = pow(c,c);
int side1Sqr = pow(i,i);
int side2Sqr = pow(a,a);
if(rightSideSqr == (side1Sqr+side2Sqr))
{
cout << rightSideSqr << setw(15) << i << setw(10) << a << endl;
}
}
}
}
}
This gives an error "PythagorialTriples.cpp:28: error: call of overloaded `pow(int&, int&)' is ambiguous". This doesn't happen if I simply used manual power like i*i, instead of the method. Can someone please explain me why this is happening? I am new to C++ anyway. Thanks
There are multiple overloads for pow defined in <cmath>. In your code, these 3 are all equally valid, therefore the compiler is having trouble choosing the right one:
pow(float, int);
pow(double, int);
pow(long double, int);
The simplest solution is to use static_cast on the first argument, to remove any ambiguity. e.g.
int side1Sqr = pow( static_cast<double>(i) ,i );
int side2Sqr = pow( static_cast<double>(a) ,a );
Whoa! Pow(x,y) is x raised to the yth power (in mathematical terms - xy)!! NOT x*y
So you're trying to take iith power in a 5003 nested loop. Probably not what you want. Replace with pow(i,2) for your desired behavior.
Note, #Mooing Duck raises an excellent point about x^y in c++ which is the XOR operator. But I think you sort of figured that out if you're already using pow anyway.
It cannot figure out which overloaded function to use.
Try
pow(double(i), i);
or
pow(double(i), 2);
Since that looks like what you want.
Are you sure that you can handle such a big pow?
Pow(x,y) is xy. Look at the
http://www.cplusplus.com/reference/clibrary/cmath/pow/
double pow ( double base, double exponent );
long double pow ( long double base, long double exponent );
float pow ( float base, float exponent );
double pow ( double base, int exponent );
long double pow ( long double base, int exponent );
There is no INT version. Your compiler didnt know which one is correct. You have to tell them by using static_cast like:
int side1Sqr = pow(static_cast<double>i,i);
for big precision calculate you can use:
http://gmplib.org/
There is also one solution from boost.