Why this comparison to zero is not working properly? - c++

I have this code:
double A = doSomethingWonderful(); // same as doing A = 0;
if(A == 0)
{
fprintf(stderr,"A=%llx\n", A);
}
and this output:
A=7f35959a51c0
how is this possible?
I checked the value of 7f35959a51c0 and seems to be something like 6.91040329973658785751176861252E-310, which is very small, but not zero.
EDIT:
Ok I understood that that way of printing the hex value for a double is not working. I need to find a way to print the bytes of the double.
Following the comments I modified my code:
A = doSomethingWonderful();// same as doing A = 0;
if(A == 0)
{
char bytes[8];
memcpy(bytes, &A, sizeof(double));
for(int i = 0; i < 8; i++)
fprintf(stderr," %x", bytes[i]);
}
and I get this output:
0 0 0 0 0 0 0 0
So finally it seems that the comparison is working properly but I was doing a bad print.

IEEE 754 precision floating point values use a bias in the exponent value in order to fully represent both positive and negative exponents. For double-precision values, that bias is 1023[source], which happens to be 0x3ff in hex, which matches the hex value of A that you printed for 1, or 0e0.
Two other small notes:
When printing bytes, you can use %hhx to get it to only print 2 hex digits instead of sign-extending to 8.
You can use a union to reliably print the double value as an 8-byte integer.
double A = 0;
if(A == 0)
{
A = 1; // Note that you were setting A to 1 here!
char bytes[8];
memcpy(bytes, &A, sizeof(double));
for(int i = 0; i < 8; i++)
printf(" %hhx", bytes[i]);
}
int isZero;
union {
unsigned long i;
double d;
} u;
u.d = 0;
isZero = (u.d == 0.0);
printf("\n============\n");
printf("hex = %lx\nfloat = %f\nzero? %d\n", u.i, u.d, isZero);
Result:
0 0 0 0 0 0 f0 3f
============
hex = 0
float = 0.000000
zero? 1
So in the first line, we see that 1.0 is 0e0 (i.e., 00).
In the following lines, we see that when you use a union to print the hex value of the double 0.0, you get 0 as expected.

When you pass your double to printf(), you pass it as a floating point value. However, since the "%x" format is an integer format, your printf() implementation will try to read an integer argument. Due to this fundamental type mismatch, it is possible, for instance, that the calling code places your double value in a floating point register, while the printf() implementation tries to read it from an integer register. Details depend on your ABI, but apparently the bits that you see are not the bits that you passed. From a language point of view, you have undefined behavior the moment that you have a type mismatch between one printf() argument and its corresponding format specification.
Apart from that, +0.0 is indeed represented as all bits zero, both in single and in double precision formats. However, this is only positive zero, -0.0 is represented with the sign bit set.
In your last bit of code, you are inspecting the bit pattern of 1.0, because you overwrite the value of A before you do the conversion. Note also that you get fffffff0 instead of f0 for the seventh byte because of sign extension. For correct output, use an array of unsigned bytes.
The pattern that you are seeing decodes like this:
00 00 00 00 00 00 f0 3f
to big endian:
3f f0 00 00 00 00 00 00
decode fields:
sign: 0 (1 bit)
exponent: 01111111111 (11 bit), value = 1023
exponent = value - bias = 1023 - 1023 = 0
mantissa: 0...0 (52 bit), value with implicit leading 1 bit: 1.0000...
entire value: -1^0 * 2^0 * 1.0 = 1.0

Related

not able to shift hex data in a unsigned long

i am trying to convert IEEE 754 Floating Point Representation to its Decimal Equivalent so i have an example data [7E FF 01 46 4B CD CC CC CC CC CC 10 40 1B 7E] which is in hex.
char strResponseData[STATUS_BUFFERSIZE]={0};
unsigned long strData = (((strResponseData[12] & 0xFF)<< 512 ) |((strResponseData[11] & 0xFF) << 256) |((strResponseData[10] & 0xFF)<< 128 ) |((strResponseData[9] & 0xFF)<< 64) |((strResponseData[8] & 0xFF)<< 32 ) |((strResponseData[7]& 0xFF) << 16) |((strResponseData[6] & 0xFF )<< 8) |(strResponseData[5] & 0xFF));
value = IEEEHexToDec(strData,1);
then i am passing this value to this function
IEEEHexToDec(unsigned long number, int isDoublePrecision)
{
int mantissaShift = isDoublePrecision ? 52 : 23;
unsigned long exponentMask = isDoublePrecision ? 0x7FF0000000000000 : 0x7f800000;
int bias = isDoublePrecision ? 1023 : 127;
int signShift = isDoublePrecision ? 63 : 31;
int sign = (number >> signShift) & 0x01;
int exponent = ((number & exponentMask) >> mantissaShift) - bias;
int power = -1;
double total = 0.0;
for ( int i = 0; i < mantissaShift; i++ )
{
int calc = (number >> (mantissaShift-i-1)) & 0x01;
total += calc * pow(2.0, power);
power--;
}
double value = (sign ? -1 : 1) * pow(2.0, exponent) * (total + 1.0);
return value;
}
but in return am getting value 0, also when am trying to print strData it is giving me only CCCCCD.
i am using eclipse ide.
please i need some suggestion
((strResponseData[12] & 0xFF)<< 512 )
First, the << operator takes a number of bits to shift, you seem to be confusing it with multiplication by the resulting power of two - while it has the same effect, you need to supply the exponent. Given that you have no typical data types of 512 bit width, it's fairly certain that this should actually be.
((strResponseData[12] & 0xFF)<< 9 )
Next, it's necessary for the value to be shifted to be of a sufficient type to hold the result before you do the shift. A char is obviously not sufficient, so you need to explicitly cast the value to a sufficient type to hold the result before you perform the shift.
Additionally keep in mind that depending on your platform an unsigned long may be either a 32 bit or 64 bit type, so if you were doing an operation with a bit shift where the result would not fit in 32 bits, you may want to use an unsigned long long or better yet make things unambiguous, for example with #include <stdint.h> and type such as uint32_t or uint64_t. Given that your question is tagged "embedded" this is especially important to keep in mind as you might be targeting a 32 (or even 8) bit processor, but sometimes building algorithms to test on the development machine instead.
Further, a char can be either a signed or an unsigned type. Before shifting, you should make that explicit. Given that you are combining multiple pieces of something, it is almost certain that at least most of these should be treated as unsigned.
So probably you want something like
((uint32_t)(strResponseData[12] & 0xFF)<< 9 )
Unless you are on an odd platform where char is not 8 bits (for example some TI DSP's) you probably don't need to pre-mask with 0xff, but it's not hurting anything
Finally it is not 100% clear what you are staring with:
i have an example data [7E FF 01 46 4B CD CC CC CC CC CC 10 40 1B 7E] which is in hex.
Is ambiguous as it is not clear if you mean
[0x7e, 0xff, 0x01, 0x46...]
Which would be an array of byte values which debugging code has printed out in hex for human convenience, or if you actually mean that you something such as
"[7E FF 01 46 .... ]"
Which string of text containing a human readable representation of hex digits as printable characters. In the latter case, you'd first have to convert the character representation of hex digits or octets into into numeric values.

C++ char arithmetic overflow

#include <stdio.h>
int main()
{
char a = 30;
char b = 40;
char c = 10;
printf ("%d ", char(a*b));
char d = (a * b) / c;
printf ("%d ", d);
return 0;
}
The above code yields normal int value if 127 > x > -127
and a overflow value if other. I can't understand how the overflow value is calculated. As -80 in this case.
Thanks
The trick here is how numbers are represented. Look into 2's complement. 30 * 40 in binary is 1200 or 10010110000 base 2. But our char is only 8 bits so we chop off the leading 100 (and all the implied 0s before that). This leaves us with 1011000.
Note the leading 1. In 2s complement, how your computer probably stores the values, this indicates a negative number. 11111111 is -1, 11111110 is -2 and so on. If go down to 1011000 we get to -80.
That is, if we convert 1011000 to 2s complement we're left with -80.
You can do 2s complement by hand. Take the value, drop the leading sign bit and swap the other values. In this case 10110000 turns into 01001111 in binary this would be 79. Turn it negative and remove one more because we don't start at zero and we're at -80.
Char has only 1 byte. In this case 1200 is 0100 1011 0000 (binary).
For one byte you can only assign 8 bit, in your case: 1011 0000 (first 4 bits will be deleted). Now you have -80 (first bit shows if negative (1) or positive (0)).
Try with your calculator (programmer) and type 1200 decimal and switch from Qword to Byte and you can see what happens with your number.

Best way of checking if a floating point is an integer

[There are a few questions on this but none of the answers are particularly definitive and several are out of date with the current C++ standard].
My research shows these are the principal methods used to check if a floating point value can be converted to an integral type T.
if (f >= std::numeric_limits<T>::min() && f <= std::numeric_limits<T>::max() && f == (T)f))
using std::fmod to extract the remainder and test equality to 0.
using std::remainder and test equality to 0.
The first test assumes that a cast from f to a T instance is defined. Not true for std::int64_t to float, for example.
With C++11, which one is best? Is there a better way?
Conclusion:
The answer is use std::trunc(f) == f the time difference is insignificant when comparing all these methods. Even if the specific IEEE unwinding code we write in the example below is technically twice is fast we are only talking about 1 nano second faster.
The maintenance costs in the long run though would be significantly higher. So use a solution that is easier to read and understand by the maintainer is better.
Time in microseconds to complete 12,000,000 operations on a random set of numbers:
IEEE breakdown: 18
std::trunc(f) == f 32
std::floor(val) - val == 0 35
((uint64_t)f) - f) == 0.0 38
std::fmod(val, 1.0) == 0 87
The Working out of the conclusion.
A floating point number is two parts:
mantissa: The data part of the value.
exponent: a power to multiply it by.
such that:
value = mantissa * (2^exponent)
So the exponent is basically how many binary digits we are going to shift the "binary point" down the mantissa. A positive value shifts it right a negative value shifts it left. If all the digits to the right of the binary point are zero then we have an integer.
If we assume IEEE 754
We should note that this representation the value is normalized so that the most significant bit in the mantissa is shifted to be 1. Since this bit is always set it is not actually stored (the processor knows its there and compensates accordingly).
So:
If the exponent < 0 then you definitely do not have an integer as it can only be representing a fractional value. If the exponent >= <Number of bits In Mantissa> then there is definately no fractual part and it is an integer (though you may not be able to hold it in an int).
Otherwise we have to do some work. if the exponent >= 0 && exponent < <Number of bits In Mantissa> then you may be representing an integer if the mantissa is all zero in the bottom half (defined below).
Additional as part of the normalization 127 is added to the exponent (so that there are no negative values stored in the 8 bit exponent field).
#include <limits>
#include <iostream>
#include <cmath>
/*
* Bit 31 Sign
* Bits 30-23 Exponent
* Bits 22-00 Mantissa
*/
bool is_IEEE754_32BitFloat_AnInt(float val)
{
// Put the value in an int so we can do bitwise operations.
int valAsInt = *reinterpret_cast<int*>(&val);
// Remember to subtract 127 from the exponent (to get real value)
int exponent = ((valAsInt >> 23) & 0xFF) - 127;
int bitsInFraction = 23 - exponent;
int mask = exponent < 0
? 0x7FFFFFFF
: exponent > 23
? 0x00
: (1 << bitsInFraction) - 1;
return !(valAsInt & mask);
}
/*
* Bit 63 Sign
* Bits 62-52 Exponent
* Bits 51-00 Mantissa
*/
bool is_IEEE754_64BitFloat_AnInt(double val)
{
// Put the value in an long long so we can do bitwise operations.
uint64_t valAsInt = *reinterpret_cast<uint64_t*>(&val);
// Remember to subtract 1023 from the exponent (to get real value)
int exponent = ((valAsInt >> 52) & 0x7FF) - 1023;
int bitsInFraction = 52 - exponent;
uint64_t mask = exponent < 0
? 0x7FFFFFFFFFFFFFFFLL
: exponent > 52
? 0x00
: (1LL << bitsInFraction) - 1;
return !(valAsInt & mask);
}
bool is_Trunc_32BitFloat_AnInt(float val)
{
return (std::trunc(val) - val == 0.0F);
}
bool is_Trunc_64BitFloat_AnInt(double val)
{
return (std::trunc(val) - val == 0.0);
}
bool is_IntCast_64BitFloat_AnInt(double val)
{
return (uint64_t(val) - val == 0.0);
}
template<typename T, bool isIEEE = std::numeric_limits<T>::is_iec559>
bool isInt(T f);
template<>
bool isInt<float, true>(float f) {return is_IEEE754_32BitFloat_AnInt(f);}
template<>
bool isInt<double, true>(double f) {return is_IEEE754_64BitFloat_AnInt(f);}
template<>
bool isInt<float, false>(float f) {return is_Trunc_64BitFloat_AnInt(f);}
template<>
bool isInt<double, false>(double f) {return is_Trunc_64BitFloat_AnInt(f);}
int main()
{
double x = 16;
std::cout << x << "=> " << isInt(x) << "\n";
x = 16.4;
std::cout << x << "=> " << isInt(x) << "\n";
x = 123.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 0.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 2.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 4.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 5.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 1.0;
std::cout << x << "=> " << isInt(x) << "\n";
}
Results:
> ./a.out
16=> 1
16.4=> 0
123=> 1
0=> 1
2=> 1
4=> 1
5=> 1
1=> 1
Running Some Timing tests.
Test data was generated like this:
(for a in {1..3000000};do echo $RANDOM.$RANDOM;done ) > test.data
(for a in {1..3000000};do echo $RANDOM;done ) >> test.data
(for a in {1..3000000};do echo $RANDOM$RANDOM0000;done ) >> test.data
(for a in {1..3000000};do echo 0.$RANDOM;done ) >> test.data
Modified main() to run tests:
int main()
{
// ORIGINAL CODE still here.
// Added this trivial speed test.
std::ifstream testData("test.data"); // Generated a million random numbers
std::vector<double> test{std::istream_iterator<double>(testData), std::istream_iterator<double>()};
std::cout << "Data Size: " << test.size() << "\n";
int count1 = 0;
int count2 = 0;
int count3 = 0;
auto start = std::chrono::system_clock::now();
for(auto const& v: test)
{ count1 += is_IEEE754_64BitFloat_AnInt(v);
}
auto p1 = std::chrono::system_clock::now();
for(auto const& v: test)
{ count2 += is_Trunc_64BitFloat_AnInt(v);
}
auto p2 = std::chrono::system_clock::now();
for(auto const& v: test)
{ count3 += is_IntCast_64BitFloat_AnInt(v);
}
auto end = std::chrono::system_clock::now();
std::cout << "IEEE " << count1 << " Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(p1 - start).count() << "\n";
std::cout << "Trunc " << count2 << " Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(p2 - p1).count() << "\n";
std::cout << "Int Cast " << count3 << " Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - p2).count() << "\n"; }
The tests show:
> ./a.out
16=> 1
16.4=> 0
123=> 1
0=> 1
2=> 1
4=> 1
5=> 1
1=> 1
Data Size: 12000000
IEEE 6000199 Time: 18
Trunc 6000199 Time: 32
Int Cast 6000199 Time: 38
The IEEE code (in this simple test) seem to beat the truncate method and generate the same result. BUT the amount of time is insignificant. Over 12 million calls we saw a difference in 14 milliseconds.
Use std::fmod(f, 1.0) == 0.0 where f is either a float, double, or long double. If you're worried about spurious effects of unwanted floating point promotions when using floats, then use either 1.0f or the more comprehensive
std::fmod(f, static_cast<decltype(f)>(1.0)) == 0.0
which will force, obviously at compile time, the correct overload to be called. The return value of std::fmod(f, ...) will be in the range [0, 1) and it's perfectly safe to compare to 0.0 to complete your integer check.
If it turns out that f is an integer, then make sure it's within the permitted range of your chosen type before attempting a cast: else you risk invoking undefined behaviour. I see that you're already familiar with std::numeric_limits which can help you here.
My reservations against using std::remainder are possibly (i) my being a Luddite and (ii) it not being available in some compilers partially implementing the C++11 standard, such as MSVC12. I don't like solutions involving casts since the notation hides that reasonably expensive operation and you need to check in advance for safety. If you must adopt your first choice, at least replace the C-style cast with static_cast<T>(f);
This test is good:
if ( f >= std::numeric_limits<T>::min()
&& f <= std::numeric_limits<T>::max()
&& f == (T)f))
These tests are incomplete:
using std::fmod to extract the remainder and test equality to 0.
using std::remainder and test equality to 0.
They both fail to check that the conversion to T is defined. Float-to-integral conversions that overflow the integral type result in undefined behaviour, which is even worse than roundoff.
I would recommend avoiding std::fmod for another reason. This code:
int isinteger(double d) {
return std::numeric_limits<int>::min() <= d
&& d <= std::numeric_limits<int>::max()
&& std::fmod(d, 1.0) == 0;
}
compiles (gcc version 4.9.1 20140903 (prerelease) (GCC) on x86_64 Arch Linux using -g -O3 -std=gnu++0x) to this:
0000000000400800 <_Z9isintegerd>:
400800: 66 0f 2e 05 10 01 00 ucomisd 0x110(%rip),%xmm0 # 400918 <_IO_stdin_used+0x18>
400807: 00
400808: 72 56 jb 400860 <_Z9isintegerd+0x60>
40080a: f2 0f 10 0d 0e 01 00 movsd 0x10e(%rip),%xmm1 # 400920 <_IO_stdin_used+0x20>
400811: 00
400812: 66 0f 2e c8 ucomisd %xmm0,%xmm1
400816: 72 48 jb 400860 <_Z9isintegerd+0x60>
400818: 48 83 ec 18 sub $0x18,%rsp
40081c: d9 e8 fld1
40081e: f2 0f 11 04 24 movsd %xmm0,(%rsp)
400823: dd 04 24 fldl (%rsp)
400826: d9 f8 fprem
400828: df e0 fnstsw %ax
40082a: f6 c4 04 test $0x4,%ah
40082d: 75 f7 jne 400826 <_Z9isintegerd+0x26>
40082f: dd d9 fstp %st(1)
400831: dd 5c 24 08 fstpl 0x8(%rsp)
400835: f2 0f 10 4c 24 08 movsd 0x8(%rsp),%xmm1
40083b: 66 0f 2e c9 ucomisd %xmm1,%xmm1
40083f: 7a 22 jp 400863 <_Z9isintegerd+0x63>
400841: 66 0f ef c0 pxor %xmm0,%xmm0
400845: 31 c0 xor %eax,%eax
400847: ba 00 00 00 00 mov $0x0,%edx
40084c: 66 0f 2e c8 ucomisd %xmm0,%xmm1
400850: 0f 9b c0 setnp %al
400853: 0f 45 c2 cmovne %edx,%eax
400856: 48 83 c4 18 add $0x18,%rsp
40085a: c3 retq
40085b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400860: 31 c0 xor %eax,%eax
400862: c3 retq
400863: f2 0f 10 0d bd 00 00 movsd 0xbd(%rip),%xmm1 # 400928 <_IO_stdin_used+0x28>
40086a: 00
40086b: e8 20 fd ff ff callq 400590 <fmod#plt>
400870: 66 0f 28 c8 movapd %xmm0,%xmm1
400874: eb cb jmp 400841 <_Z9isintegerd+0x41>
400876: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40087d: 00 00 00
The first five instructions implement the range check against std::numeric_limits<int>::min() and std::numeric_limits<int>::max(). The rest is the fmod test, accounting for all the misbehaviour of a single invocation of the fprem instruction (400828..40082d) and some case where a NaN somehow arose.
You get similar code by using remainder.
Some other options to consider (different compilers / libraries may produce different intrinsic sequences for these tests and be faster/slower):
bool is_int(float f) { return floor(f) == f; }
This is in addition to the tests for overflow you have...
If you are looking to really optimize, you could try the following (works for positive floats, not thoroughly tested): This assumes IEEE 32-bit floats, which are not mandated by the C++ standard AFAIK.
bool is_int(float f)
{
const float nf = f + float(1 << 23);
const float bf = nf - float(1 << 23);
return f == bf;
}
I'd go deep into the IEE 754 standard and keep thinking only in terms of this type and I'll be assuming 64 bit integers and doubles.
The number is a whole number iff:
the number is zero (regardless on the sign).
the number has mantisa not going to binary fractions (regardless on the sing), while not having any undefined digits for least significant bits.
I made following function:
#include <stdio.h>
int IsThisDoubleAnInt(double number)
{
long long ieee754 = *(long long *)&number;
long long sign = ieee754 >> 63;
long long exp = ((ieee754 >> 52) & 0x7FFLL);
long long mantissa = ieee754 & 0xFFFFFFFFFFFFFLL;
long long e = exp - 1023;
long long decimalmask = (1LL << (e + 52));
if (decimalmask) decimalmask -= 1;
if (((exp == 0) && (mantissa != 0)) || (e > 52) || (e < 0) || ((mantissa & decimalmask) != 0))
{
return 0;
}
else
{
return 1;
}
}
As a test of this function:
int main()
{
double x = 1;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1.5;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 2;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 2.000000001;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1e60;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1e-60;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1.0/0.0;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = x/x;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 0.99;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1LL << 52;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = (1LL << 52) + 1;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
}
The result is following:
x = 1.000000e+00 is int.
x = 1.500000e+00 is not int.
x = 2.000000e+00 is int.
x = 2.000000e+00 is not int.
x = 1.000000e+60 is not int.
x = 1.000000e-60 is not int.
x = inf is not int.
x = nan is not int.
x = 9.900000e-01 is not int.
x = 4.503600e+15 is int.
x = 4.503600e+15 is not int.
The condition in the method is not very clear, thus I'm posting the less obfuscated version with commented if/else structure.
int IsThisDoubleAnIntWithExplanation(double number)
{
long long ieee754 = *(long long *)&number;
long long sign = ieee754 >> 63;
long long exp = ((ieee754 >> 52) & 0x7FFLL);
long long mantissa = ieee754 & 0xFFFFFFFFFFFFFLL;
if (exp == 0)
{
if (mantissa == 0)
{
// This is signed zero.
return 1;
}
else
{
// this is a subnormal number
return 0;
}
}
else if (exp == 0x7FFL)
{
// it is infinity or nan.
return 0;
}
else
{
long long e = exp - 1023;
long long decimalmask = (1LL << (e + 52));
if (decimalmask) decimalmask -= 1;
printf("%f: %llx (%lld %lld %llx) %llx\n", number, ieee754, sign, e, mantissa, decimalmask);
// number is something in form (-1)^sign x 2^exp-1023 x 1.mantissa
if (e > 63)
{
// number too large to fit into integer
return 0;
}
else if (e > 52)
{
// number too large to have all digits...
return 0;
}
else if (e < 0)
{
// number too large to have all digits...
return 0;
}
else if ((mantissa & decimalmask) != 0)
{
// number has nonzero fraction part.
return 0;
}
}
return 1;
}
Personally I would recommend using the trunc function introduced in C++11 to check if f is integral:
#include <cmath>
#include <type_traits>
template<typename F>
bool isIntegral(F f) {
static_assert(std::is_floating_point<F>::value, "The function isIntegral is only defined for floating-point types.");
return std::trunc(f) == f;
}
It involves no casting and no floating point arithmetics both of which can be a source of error. The truncation of the decimal places can surely be done without introducing a numerical error by setting the corresponding bits of the mantissa to zero at least if the floating point values are represented according to the IEEE 754 standard.
Personally I would hesitate to use fmod or remainder for checking whether f is integral because I am not sure whether the result can underflow to zero and thus fake an integral value. In any case it is easier to show that trunc works without numerical error.
None of the three above methods actually checks whether the floating point number f can be represented as a value of type T. An extra check is necessary.
The first option actually does exactly that: It checks whether f is integral and can be represented as a value of type T. It does so by evaluating f == (T)f. This check involves a cast. Such a cast is undefined according to §1 in section 4.9 of the C++11 standard "if the truncated value cannot be represented in the destination type". Thus if f is e.g. larger or equal to std::numeric_limits<T>::max()+1 the truncated value will certainly have an undefined behavior as a consequence.
That is probably why the first option has an additional range check (f >= std::numeric_limits<T>::min() && f <= std::numeric_limits<T>::max()) before performing the cast. This range check could also be used for the other methods (trunc, fmod, remainder) in order to determine whether f can be represented as a value of type T. However, the check is flawed since it can run into undefined behavior:
In this check the limits std::numeric_limits<T>::min/max() get converted to the floating point type for applying the equality operator. For example if T=uint32_t and f being a float, std::numeric_limits<T>::max() is not representable as a floating point number. The C++11 standard then states in section 4.9 §2 that the implementation is free to choose the next lower or higher representable value. If it chooses the higher representable value and f happens to be equal to the higher representable value the subsequent cast is undefined according to §1 in section 4.9 since the (truncated) value cannot be represented in the destination type (uint32_t).
std::cout << std::numeric_limits<uint32_t>::max() << std::endl; // 4294967295
std::cout << std::setprecision(20) << static_cast<float>(std::numeric_limits<uint32_t>::max()) << std::endl; // 4294967296 (float is a single precision IEEE 754 floating point number here)
std::cout << static_cast<uint32_t>(static_cast<float>(std::numeric_limits<uint32_t>::max())) << std::endl; // Could be for example 4294967295 due to undefined behavior according to the standard in the cast to the uint32_t.
Consequently, the first option would establish that f is integral and representable as uint32_t even though it is not.
Fixing the range check in general is not easy. The fact that signed integers and floating point numbers do not have a fixed representation (such as two's complement or IEEE 754) according to the standard do not make things easier. One possibility is to write non-portable code for the specific compiler, architecture and types you use. A more portable solution is to use Boost's NumericConversion library:
#include <boost/numeric/conversion/cast.hpp>
template<typename T, typename F>
bool isRepresentableAs(F f) {
static_assert(std::is_floating_point<F>::value && std::is_integral<T>::value, "The function isRepresentableAs is only defined for floating-point as integral types.");
return boost::numeric::converter<T, F>::out_of_range(f) == boost::numeric::cInRange && isIntegral(f);
}
Then you can finally perform the cast safely:
double f = 333.0;
if (isRepresentableAs<uint32_t>(f))
std::cout << static_cast<uint32_t>(f) << std::endl;
else
std::cout << f << " is not representable as uint32_t." << std::endl;
// Output: 333
what about converting types like this?
bool can_convert(float a, int i)
{
int b = a;
float c = i;
return a == c;
}
The problem with:
if ( f >= std::numeric_limits<T>::min()
&& f <= std::numeric_limits<T>::max()
&& f == (T)f))
is that if T is (for example) 64 bits, then the max will be rounded when converting to your usual 64 bit double :-( Assuming 2's complement, the same is not true of the min, of course.
So, depending on the number of bits in the mantisaa, and the number of bits in T, you need to mask off the LS bits of std::numeric_limits::max()... I'm sorry, I don't do C++, so how best to do that I leave to others. [In C it would be something along the lines of LLONG_MAX ^ (LLONG_MAX >> DBL_MANT_DIG) -- assuming T is long long int and f is double and that these are both the usual 64 bit values.]
If the T is constant, then the construction of the two floating point values for min and max will (I assume) be done at compile time, so the two comparisons are pretty straightforward. You don't really need to be able to float T... but you do need to know that its min and max will fit in an ordinary integer (long long int, say).
The remaining work is converting the float to integer, and then floating that back up again for the final comparison. So, assuming f is in range (which guarantees (T)f does not overflow):
i = (T)f ; // or i = (long long int)f ;
ok = (i == f) ;
The alternative seems to be:
i = (T)f ; // or i = (long long int)f ;
ok = (floor(f) == f) ;
as noted elsewhere. Which replaces the floating of i by floor(f)... which I'm not convinced is an improvement.
If f is NaN things may go wrong, so you might want to test for that too.
You could try unpacking f with frexp() and extract the mantissa as (say) a long long int (with ldexp() and a cast), but when I started to sketch that out it looked ugly :-(
Having slept on it, a simpler way of dealing with the max issue is to do: min <= f < ((unsigned)max+1) -- or min <= f < (unsigned)min -- or (double)min <= f < -(double)min -- or any other method of constructing -2^(n-1) and +2^(n-1) as floating point values, where n is the number of bits in T.
(Serves me right for getting interested in a problem at 1:00am !)
First of all, I want to see if I got your question right. From what I've read, it seems that you want to determine if a floating-point is actually simply a representation of an integral type in floating-point.
As far as I know, performing == on a floating-point is not safe due to floating-point inaccuracies. Therefore I am proposing the following solution,
template<typename F, typename I = size_t>
bool is_integral(F f)
{
return fabs(f - static_cast<I>(f)) <= std::numeric_limits<F>::epsilon;
}
The idea is to simply find the absolute difference between the original floating-point and the floating-point casted to the integral type, and then determine if it is smaller than the epsilon of the floating-point type. I'm assuming here that if it is smaller than epsilon, the difference is of no importance to us.
Thank you for reading.
Use modf() which breaks the value into integral and fractional parts. From this direct test, it is known if the double is a whole number or not. After this, limit tests against the min/max of the target integer type can be done.
#include <cmath>
bool IsInteger(double x) {
double ipart;
return std::modf(x, &ipart) == 0.0; // Test if fraction is 0.0.
}
Note modf() differs from the similar named fmod().
Of the 3 methods OP posted, the cast to/from an integer may perform a fair amount of work doing the casts and compare. The other 2 are marginally the same. They work, assuming no unexpected rounding mode effects from dividing by 1.0. But do an unnecessary divide.
As to which is fastest likely depends on the mix of doubles used.
OP's first method has a singular advantage: Since the goal is to test if a FP may convert exactly to a some integer, and likely then if the result is true, the conversion needs to then occur, OP's first method has already done the conversion.
Here is what I would try:
float originalNumber;
cin >> originalNumber;
int temp = (int) originalNumber;
if (originalNumber-temp > 0)
{
// It is not an integer
}
else
{
// It is an integer
}
If your question is "Can I convert this double to int without loss of information?" then I would do something simple like :
template <typename T, typename U>
bool CanConvert(U u)
{
return U(T(u)) == u;
}
CanConvert<int>(1.0) -- true
CanConvert<int>(1.5) -- false
CanConvert<int>(1e9) -- true
CanConvert<int>(1e10)-- false

Convert binary to hex value (32 bits)

i am trying to convert binary value as hex value and i got the following code and it is working well up to 28 bit but not for 32 bit.
The code is as follows.
int main()
{
long int longint=0;
string buf;
cin>>buf;
int len=buf.size();
for(int i=0;i<len;i++)
{
longint+=( buf[len-i-1]-48) * pow((double)2,i);
}
cout<<setbase(16);
cout<<longint;
return 0;
}
If I input 28 '1' (111111111111111111111111) then the output is fffffff
but if i input 32 '1' (11111111111111111111111111111111) then the output is 80000000.
Can anyone please explain why this is happenning and also in the above code why 48 is subtracted .
The problem seems to be with the use of pow, which uses floating-point math, if I recall correctly.. You may be running into issues with overflow.
A more elegant way to calculate powers of two is by using bit-shifts:
2^0 = 1 << 0 = 1
2^1 = 1 << 1 = 2
2^2 = 1 << 2 = 4
2^n = 1 << n
You are using int32 and it is getting out of range when you use it for 32 bytes,try using int64 i.e long long
unsigned long long longint=0; //Change Here
string buf;
cin>>buf;
int len=buf.length();
for(int i=0;i<len;i++)
{
longint+=( buf[len-i-1]-48) * pow((double)2,i);
}
cout<<setbase(16);
cout<<longint;
As Nathan's post, it'll display correct when changing your code like this,
longint += (buf[len-i-1]-'0') << i;
This is because you have forced ( buf[len-i-1]-48) * pow((double)2,i) to be converted to double, and double is 8 bytes long, but it has to store extra information, it can not be full charged to represent 0x80000000, you can find more information here and your last ( buf[len-i-1]-48) * pow((double)2,i) (when i is 31) expression already overflow.
But something weard happend when converting from 4294967295.0000000(which is 0xffffffff) to int , it just come out 0x80000000 , but I am very sorry I don't know why(Please reference the comment from TonyK).
You can change it to
longint+=(long int)(( buf[len-i-1]-48) * pow(2,i));
Why minus 48 ? because ascii for '0' is 48, you want to convert from literal '0' to numeric 0, you have to do it.
The reason is that float-value is with limit precision and pow() computer use numerical calculation approximation which isn't definitely precise. To get a precise value, you should use bit-wise ">>" instead.
You can see how pow() function works as below.
I changed your code
longint+= ( buf[len-i-1]-48) * pow((double)2,i)
to the code below, which is equal, becase pow() return a double value.
double temp = ( buf[len-i-1]-48) * pow((double)2,i);
longint+= temp;
cout<<setbase(16);
cout<<"temp"<<endl;
cout<<temp<<endl;
cout<<longint<<endl;
the output is as below
temp
1.34218e+08
fffffff
temp
2.68435e+08
1fffffff
temp
5.36871e+08
3fffffff
temp
1.07374e+09
7fffffff
temp
2.14748e+09
80000000
final
80000000
Which shows clearly pow() is with limited precision. 2.14748e+09 is not equal to (2^31).
You should use ">>" which is the best or just use conversion to integer which isn't 100 percently correclty, either.
You can see conversion as below.
when I change
double temp = ( buf[len-i-1]-48) * pow((double)2,i);
to
int temp = ( buf[len-i-1]-48) * pow((double)2,i);
the result is
temp
8000000
fffffff
temp
10000000
1fffffff
temp
20000000
3fffffff
temp
40000000
7fffffff
temp
80000000
ffffffff
final
ffffffff
Which works correctly.
substract 48
You got a char from standard input, for example, you got '1' from terminal instead of 1. In order to get 1. You should use '1'-'0'.
The reason : computer store '0'~'9' as a byte with value(48~57). As a result, '1' - '0' equals '1' - 48.

Infinity in MSVC++

I'm using MSVC++, and I want to use the special value INFINITY in my code.
What's the byte pattern or constant to use in MSVC++ for infinity?
Why does 1.0f/0.0f appear to have the value 0?
#include <stdio.h>
#include <limits.h>
int main()
{
float zero = 0.0f ;
float inf = 1.0f/zero ;
printf( "%f\n", inf ) ; // 1.#INF00
printf( "%x\n", inf ) ; // why is this 0?
printf( "%f\n", zero ) ; // 0.000000
printf( "%x\n", zero ) ; // 0
}
Use numeric_limits:
#include <limits>
float maxFloat = std::numeric_limits<float>::infinity();
printf("%x\n", inf) expects an integer (32 bit on MSVC), but receives a double. Hilarity will ensue. Err, I mean: undefined behavior.
(And yes, it receives a double since for a variable argument list, floats are promoted to double).
Edit anyways, you should use numeric_limits, as the other reply says, too.
In the variable arguments list to printf, floats get promoted to doubles. The little-endian byte representation of infinity as a double is 00 00 00 00 00 00 F0 7F.
As peterchen mentioned, "%x" expects an int, not a double. So printf looks at only the first sizeof(int) bytes of the argument. No version of MSVC++ defines int to be larger than 4 bytes, so you get all zeros.
Take a look at numeric_limits::infinity.
That's what happens when you lie to printf(), it gets it wrong. When you use the %x format specifier, it expects an integer to be passed on the stack, not a float passed on the FPU stack. Fix:
printf( "%x\n", *(__int32*)&inf ) ;
You can get infinity out of the <limits> C++ header file:
float inf = std::numeric_limits<float>::infinity().