Best way of checking if a floating point is an integer

Best way of checking if a floating point is an integer - c++

[There are a few questions on this but none of the answers are particularly definitive and several are out of date with the current C++ standard].
My research shows these are the principal methods used to check if a floating point value can be converted to an integral type T.
if (f >= std::numeric_limits<T>::min() && f <= std::numeric_limits<T>::max() && f == (T)f))
using std::fmod to extract the remainder and test equality to 0.
using std::remainder and test equality to 0.
The first test assumes that a cast from f to a T instance is defined. Not true for std::int64_t to float, for example.
With C++11, which one is best? Is there a better way?

Conclusion:
The answer is use std::trunc(f) == f the time difference is insignificant when comparing all these methods. Even if the specific IEEE unwinding code we write in the example below is technically twice is fast we are only talking about 1 nano second faster.
The maintenance costs in the long run though would be significantly higher. So use a solution that is easier to read and understand by the maintainer is better.
Time in microseconds to complete 12,000,000 operations on a random set of numbers:
IEEE breakdown: 18
std::trunc(f) == f 32
std::floor(val) - val == 0 35
((uint64_t)f) - f) == 0.0 38
std::fmod(val, 1.0) == 0 87
The Working out of the conclusion.
A floating point number is two parts:
mantissa: The data part of the value.
exponent: a power to multiply it by.
such that:
value = mantissa * (2^exponent)
So the exponent is basically how many binary digits we are going to shift the "binary point" down the mantissa. A positive value shifts it right a negative value shifts it left. If all the digits to the right of the binary point are zero then we have an integer.
If we assume IEEE 754
We should note that this representation the value is normalized so that the most significant bit in the mantissa is shifted to be 1. Since this bit is always set it is not actually stored (the processor knows its there and compensates accordingly).
So:
If the exponent < 0 then you definitely do not have an integer as it can only be representing a fractional value. If the exponent >= <Number of bits In Mantissa> then there is definately no fractual part and it is an integer (though you may not be able to hold it in an int).
Otherwise we have to do some work. if the exponent >= 0 && exponent < <Number of bits In Mantissa> then you may be representing an integer if the mantissa is all zero in the bottom half (defined below).
Additional as part of the normalization 127 is added to the exponent (so that there are no negative values stored in the 8 bit exponent field).
#include <limits>
#include <iostream>
#include <cmath>
/*
* Bit 31 Sign
* Bits 30-23 Exponent
* Bits 22-00 Mantissa
*/
bool is_IEEE754_32BitFloat_AnInt(float val)
{
// Put the value in an int so we can do bitwise operations.
int valAsInt = *reinterpret_cast<int*>(&val);
// Remember to subtract 127 from the exponent (to get real value)
int exponent = ((valAsInt >> 23) & 0xFF) - 127;
int bitsInFraction = 23 - exponent;
int mask = exponent < 0
? 0x7FFFFFFF
: exponent > 23
? 0x00
: (1 << bitsInFraction) - 1;
return !(valAsInt & mask);
}
/*
* Bit 63 Sign
* Bits 62-52 Exponent
* Bits 51-00 Mantissa
*/
bool is_IEEE754_64BitFloat_AnInt(double val)
{
// Put the value in an long long so we can do bitwise operations.
uint64_t valAsInt = *reinterpret_cast<uint64_t*>(&val);
// Remember to subtract 1023 from the exponent (to get real value)
int exponent = ((valAsInt >> 52) & 0x7FF) - 1023;
int bitsInFraction = 52 - exponent;
uint64_t mask = exponent < 0
? 0x7FFFFFFFFFFFFFFFLL
: exponent > 52
? 0x00
: (1LL << bitsInFraction) - 1;
return !(valAsInt & mask);
}
bool is_Trunc_32BitFloat_AnInt(float val)
{
return (std::trunc(val) - val == 0.0F);
}
bool is_Trunc_64BitFloat_AnInt(double val)
{
return (std::trunc(val) - val == 0.0);
}
bool is_IntCast_64BitFloat_AnInt(double val)
{
return (uint64_t(val) - val == 0.0);
}
template<typename T, bool isIEEE = std::numeric_limits<T>::is_iec559>
bool isInt(T f);
template<>
bool isInt<float, true>(float f) {return is_IEEE754_32BitFloat_AnInt(f);}
template<>
bool isInt<double, true>(double f) {return is_IEEE754_64BitFloat_AnInt(f);}
template<>
bool isInt<float, false>(float f) {return is_Trunc_64BitFloat_AnInt(f);}
template<>
bool isInt<double, false>(double f) {return is_Trunc_64BitFloat_AnInt(f);}
int main()
{
double x = 16;
std::cout << x << "=> " << isInt(x) << "\n";
x = 16.4;
std::cout << x << "=> " << isInt(x) << "\n";
x = 123.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 0.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 2.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 4.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 5.0;
std::cout << x << "=> " << isInt(x) << "\n";
x = 1.0;
std::cout << x << "=> " << isInt(x) << "\n";
}
Results:
> ./a.out
16=> 1
16.4=> 0
123=> 1
0=> 1
2=> 1
4=> 1
5=> 1
1=> 1
Running Some Timing tests.
Test data was generated like this:
(for a in {1..3000000};do echo $RANDOM.$RANDOM;done ) > test.data
(for a in {1..3000000};do echo $RANDOM;done ) >> test.data
(for a in {1..3000000};do echo $RANDOM$RANDOM0000;done ) >> test.data
(for a in {1..3000000};do echo 0.$RANDOM;done ) >> test.data
Modified main() to run tests:
int main()
{
// ORIGINAL CODE still here.
// Added this trivial speed test.
std::ifstream testData("test.data"); // Generated a million random numbers
std::vector<double> test{std::istream_iterator<double>(testData), std::istream_iterator<double>()};
std::cout << "Data Size: " << test.size() << "\n";
int count1 = 0;
int count2 = 0;
int count3 = 0;
auto start = std::chrono::system_clock::now();
for(auto const& v: test)
{ count1 += is_IEEE754_64BitFloat_AnInt(v);
}
auto p1 = std::chrono::system_clock::now();
for(auto const& v: test)
{ count2 += is_Trunc_64BitFloat_AnInt(v);
}
auto p2 = std::chrono::system_clock::now();
for(auto const& v: test)
{ count3 += is_IntCast_64BitFloat_AnInt(v);
}
auto end = std::chrono::system_clock::now();
std::cout << "IEEE " << count1 << " Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(p1 - start).count() << "\n";
std::cout << "Trunc " << count2 << " Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(p2 - p1).count() << "\n";
std::cout << "Int Cast " << count3 << " Time: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - p2).count() << "\n"; }
The tests show:
> ./a.out
16=> 1
16.4=> 0
123=> 1
0=> 1
2=> 1
4=> 1
5=> 1
1=> 1
Data Size: 12000000
IEEE 6000199 Time: 18
Trunc 6000199 Time: 32
Int Cast 6000199 Time: 38
The IEEE code (in this simple test) seem to beat the truncate method and generate the same result. BUT the amount of time is insignificant. Over 12 million calls we saw a difference in 14 milliseconds.

Use std::fmod(f, 1.0) == 0.0 where f is either a float, double, or long double. If you're worried about spurious effects of unwanted floating point promotions when using floats, then use either 1.0f or the more comprehensive
std::fmod(f, static_cast<decltype(f)>(1.0)) == 0.0
which will force, obviously at compile time, the correct overload to be called. The return value of std::fmod(f, ...) will be in the range [0, 1) and it's perfectly safe to compare to 0.0 to complete your integer check.
If it turns out that f is an integer, then make sure it's within the permitted range of your chosen type before attempting a cast: else you risk invoking undefined behaviour. I see that you're already familiar with std::numeric_limits which can help you here.
My reservations against using std::remainder are possibly (i) my being a Luddite and (ii) it not being available in some compilers partially implementing the C++11 standard, such as MSVC12. I don't like solutions involving casts since the notation hides that reasonably expensive operation and you need to check in advance for safety. If you must adopt your first choice, at least replace the C-style cast with static_cast<T>(f);

This test is good:
if ( f >= std::numeric_limits<T>::min()
&& f <= std::numeric_limits<T>::max()
&& f == (T)f))
These tests are incomplete:
using std::fmod to extract the remainder and test equality to 0.
using std::remainder and test equality to 0.
They both fail to check that the conversion to T is defined. Float-to-integral conversions that overflow the integral type result in undefined behaviour, which is even worse than roundoff.
I would recommend avoiding std::fmod for another reason. This code:
int isinteger(double d) {
return std::numeric_limits<int>::min() <= d
&& d <= std::numeric_limits<int>::max()
&& std::fmod(d, 1.0) == 0;
}
compiles (gcc version 4.9.1 20140903 (prerelease) (GCC) on x86_64 Arch Linux using -g -O3 -std=gnu++0x) to this:
0000000000400800 <_Z9isintegerd>:
400800: 66 0f 2e 05 10 01 00 ucomisd 0x110(%rip),%xmm0 # 400918 <_IO_stdin_used+0x18>
400807: 00
400808: 72 56 jb 400860 <_Z9isintegerd+0x60>
40080a: f2 0f 10 0d 0e 01 00 movsd 0x10e(%rip),%xmm1 # 400920 <_IO_stdin_used+0x20>
400811: 00
400812: 66 0f 2e c8 ucomisd %xmm0,%xmm1
400816: 72 48 jb 400860 <_Z9isintegerd+0x60>
400818: 48 83 ec 18 sub $0x18,%rsp
40081c: d9 e8 fld1
40081e: f2 0f 11 04 24 movsd %xmm0,(%rsp)
400823: dd 04 24 fldl (%rsp)
400826: d9 f8 fprem
400828: df e0 fnstsw %ax
40082a: f6 c4 04 test $0x4,%ah
40082d: 75 f7 jne 400826 <_Z9isintegerd+0x26>
40082f: dd d9 fstp %st(1)
400831: dd 5c 24 08 fstpl 0x8(%rsp)
400835: f2 0f 10 4c 24 08 movsd 0x8(%rsp),%xmm1
40083b: 66 0f 2e c9 ucomisd %xmm1,%xmm1
40083f: 7a 22 jp 400863 <_Z9isintegerd+0x63>
400841: 66 0f ef c0 pxor %xmm0,%xmm0
400845: 31 c0 xor %eax,%eax
400847: ba 00 00 00 00 mov $0x0,%edx
40084c: 66 0f 2e c8 ucomisd %xmm0,%xmm1
400850: 0f 9b c0 setnp %al
400853: 0f 45 c2 cmovne %edx,%eax
400856: 48 83 c4 18 add $0x18,%rsp
40085a: c3 retq
40085b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
400860: 31 c0 xor %eax,%eax
400862: c3 retq
400863: f2 0f 10 0d bd 00 00 movsd 0xbd(%rip),%xmm1 # 400928 <_IO_stdin_used+0x28>
40086a: 00
40086b: e8 20 fd ff ff callq 400590 <fmod#plt>
400870: 66 0f 28 c8 movapd %xmm0,%xmm1
400874: eb cb jmp 400841 <_Z9isintegerd+0x41>
400876: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40087d: 00 00 00
The first five instructions implement the range check against std::numeric_limits<int>::min() and std::numeric_limits<int>::max(). The rest is the fmod test, accounting for all the misbehaviour of a single invocation of the fprem instruction (400828..40082d) and some case where a NaN somehow arose.
You get similar code by using remainder.

Some other options to consider (different compilers / libraries may produce different intrinsic sequences for these tests and be faster/slower):
bool is_int(float f) { return floor(f) == f; }
This is in addition to the tests for overflow you have...
If you are looking to really optimize, you could try the following (works for positive floats, not thoroughly tested): This assumes IEEE 32-bit floats, which are not mandated by the C++ standard AFAIK.
bool is_int(float f)
{
const float nf = f + float(1 << 23);
const float bf = nf - float(1 << 23);
return f == bf;
}

I'd go deep into the IEE 754 standard and keep thinking only in terms of this type and I'll be assuming 64 bit integers and doubles.
The number is a whole number iff:
the number is zero (regardless on the sign).
the number has mantisa not going to binary fractions (regardless on the sing), while not having any undefined digits for least significant bits.
I made following function:
#include <stdio.h>
int IsThisDoubleAnInt(double number)
{
long long ieee754 = *(long long *)&number;
long long sign = ieee754 >> 63;
long long exp = ((ieee754 >> 52) & 0x7FFLL);
long long mantissa = ieee754 & 0xFFFFFFFFFFFFFLL;
long long e = exp - 1023;
long long decimalmask = (1LL << (e + 52));
if (decimalmask) decimalmask -= 1;
if (((exp == 0) && (mantissa != 0)) || (e > 52) || (e < 0) || ((mantissa & decimalmask) != 0))
{
return 0;
}
else
{
return 1;
}
}
As a test of this function:
int main()
{
double x = 1;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1.5;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 2;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 2.000000001;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1e60;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1e-60;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1.0/0.0;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = x/x;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 0.99;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = 1LL << 52;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
x = (1LL << 52) + 1;
printf("x = %e is%s int.\n", x, IsThisDoubleAnInt(x)?"":" not");
}
The result is following:
x = 1.000000e+00 is int.
x = 1.500000e+00 is not int.
x = 2.000000e+00 is int.
x = 2.000000e+00 is not int.
x = 1.000000e+60 is not int.
x = 1.000000e-60 is not int.
x = inf is not int.
x = nan is not int.
x = 9.900000e-01 is not int.
x = 4.503600e+15 is int.
x = 4.503600e+15 is not int.
The condition in the method is not very clear, thus I'm posting the less obfuscated version with commented if/else structure.
int IsThisDoubleAnIntWithExplanation(double number)
{
long long ieee754 = *(long long *)&number;
long long sign = ieee754 >> 63;
long long exp = ((ieee754 >> 52) & 0x7FFLL);
long long mantissa = ieee754 & 0xFFFFFFFFFFFFFLL;
if (exp == 0)
{
if (mantissa == 0)
{
// This is signed zero.
return 1;
}
else
{
// this is a subnormal number
return 0;
}
}
else if (exp == 0x7FFL)
{
// it is infinity or nan.
return 0;
}
else
{
long long e = exp - 1023;
long long decimalmask = (1LL << (e + 52));
if (decimalmask) decimalmask -= 1;
printf("%f: %llx (%lld %lld %llx) %llx\n", number, ieee754, sign, e, mantissa, decimalmask);
// number is something in form (-1)^sign x 2^exp-1023 x 1.mantissa
if (e > 63)
{
// number too large to fit into integer
return 0;
}
else if (e > 52)
{
// number too large to have all digits...
return 0;
}
else if (e < 0)
{
// number too large to have all digits...
return 0;
}
else if ((mantissa & decimalmask) != 0)
{
// number has nonzero fraction part.
return 0;
}
}
return 1;
}

Personally I would recommend using the trunc function introduced in C++11 to check if f is integral:
#include <cmath>
#include <type_traits>
template<typename F>
bool isIntegral(F f) {
static_assert(std::is_floating_point<F>::value, "The function isIntegral is only defined for floating-point types.");
return std::trunc(f) == f;
}
It involves no casting and no floating point arithmetics both of which can be a source of error. The truncation of the decimal places can surely be done without introducing a numerical error by setting the corresponding bits of the mantissa to zero at least if the floating point values are represented according to the IEEE 754 standard.
Personally I would hesitate to use fmod or remainder for checking whether f is integral because I am not sure whether the result can underflow to zero and thus fake an integral value. In any case it is easier to show that trunc works without numerical error.
None of the three above methods actually checks whether the floating point number f can be represented as a value of type T. An extra check is necessary.
The first option actually does exactly that: It checks whether f is integral and can be represented as a value of type T. It does so by evaluating f == (T)f. This check involves a cast. Such a cast is undefined according to §1 in section 4.9 of the C++11 standard "if the truncated value cannot be represented in the destination type". Thus if f is e.g. larger or equal to std::numeric_limits<T>::max()+1 the truncated value will certainly have an undefined behavior as a consequence.
That is probably why the first option has an additional range check (f >= std::numeric_limits<T>::min() && f <= std::numeric_limits<T>::max()) before performing the cast. This range check could also be used for the other methods (trunc, fmod, remainder) in order to determine whether f can be represented as a value of type T. However, the check is flawed since it can run into undefined behavior:
In this check the limits std::numeric_limits<T>::min/max() get converted to the floating point type for applying the equality operator. For example if T=uint32_t and f being a float, std::numeric_limits<T>::max() is not representable as a floating point number. The C++11 standard then states in section 4.9 §2 that the implementation is free to choose the next lower or higher representable value. If it chooses the higher representable value and f happens to be equal to the higher representable value the subsequent cast is undefined according to §1 in section 4.9 since the (truncated) value cannot be represented in the destination type (uint32_t).
std::cout << std::numeric_limits<uint32_t>::max() << std::endl; // 4294967295
std::cout << std::setprecision(20) << static_cast<float>(std::numeric_limits<uint32_t>::max()) << std::endl; // 4294967296 (float is a single precision IEEE 754 floating point number here)
std::cout << static_cast<uint32_t>(static_cast<float>(std::numeric_limits<uint32_t>::max())) << std::endl; // Could be for example 4294967295 due to undefined behavior according to the standard in the cast to the uint32_t.
Consequently, the first option would establish that f is integral and representable as uint32_t even though it is not.
Fixing the range check in general is not easy. The fact that signed integers and floating point numbers do not have a fixed representation (such as two's complement or IEEE 754) according to the standard do not make things easier. One possibility is to write non-portable code for the specific compiler, architecture and types you use. A more portable solution is to use Boost's NumericConversion library:
#include <boost/numeric/conversion/cast.hpp>
template<typename T, typename F>
bool isRepresentableAs(F f) {
static_assert(std::is_floating_point<F>::value && std::is_integral<T>::value, "The function isRepresentableAs is only defined for floating-point as integral types.");
return boost::numeric::converter<T, F>::out_of_range(f) == boost::numeric::cInRange && isIntegral(f);
}
Then you can finally perform the cast safely:
double f = 333.0;
if (isRepresentableAs<uint32_t>(f))
std::cout << static_cast<uint32_t>(f) << std::endl;
else
std::cout << f << " is not representable as uint32_t." << std::endl;
// Output: 333

what about converting types like this?
bool can_convert(float a, int i)
{
int b = a;
float c = i;
return a == c;
}

The problem with:
if ( f >= std::numeric_limits<T>::min()
&& f <= std::numeric_limits<T>::max()
&& f == (T)f))
is that if T is (for example) 64 bits, then the max will be rounded when converting to your usual 64 bit double :-( Assuming 2's complement, the same is not true of the min, of course.
So, depending on the number of bits in the mantisaa, and the number of bits in T, you need to mask off the LS bits of std::numeric_limits::max()... I'm sorry, I don't do C++, so how best to do that I leave to others. [In C it would be something along the lines of LLONG_MAX ^ (LLONG_MAX >> DBL_MANT_DIG) -- assuming T is long long int and f is double and that these are both the usual 64 bit values.]
If the T is constant, then the construction of the two floating point values for min and max will (I assume) be done at compile time, so the two comparisons are pretty straightforward. You don't really need to be able to float T... but you do need to know that its min and max will fit in an ordinary integer (long long int, say).
The remaining work is converting the float to integer, and then floating that back up again for the final comparison. So, assuming f is in range (which guarantees (T)f does not overflow):
i = (T)f ; // or i = (long long int)f ;
ok = (i == f) ;
The alternative seems to be:
i = (T)f ; // or i = (long long int)f ;
ok = (floor(f) == f) ;
as noted elsewhere. Which replaces the floating of i by floor(f)... which I'm not convinced is an improvement.
If f is NaN things may go wrong, so you might want to test for that too.
You could try unpacking f with frexp() and extract the mantissa as (say) a long long int (with ldexp() and a cast), but when I started to sketch that out it looked ugly :-(
Having slept on it, a simpler way of dealing with the max issue is to do: min <= f < ((unsigned)max+1) -- or min <= f < (unsigned)min -- or (double)min <= f < -(double)min -- or any other method of constructing -2^(n-1) and +2^(n-1) as floating point values, where n is the number of bits in T.
(Serves me right for getting interested in a problem at 1:00am !)

First of all, I want to see if I got your question right. From what I've read, it seems that you want to determine if a floating-point is actually simply a representation of an integral type in floating-point.
As far as I know, performing == on a floating-point is not safe due to floating-point inaccuracies. Therefore I am proposing the following solution,
template<typename F, typename I = size_t>
bool is_integral(F f)
{
return fabs(f - static_cast<I>(f)) <= std::numeric_limits<F>::epsilon;
}
The idea is to simply find the absolute difference between the original floating-point and the floating-point casted to the integral type, and then determine if it is smaller than the epsilon of the floating-point type. I'm assuming here that if it is smaller than epsilon, the difference is of no importance to us.
Thank you for reading.

Use modf() which breaks the value into integral and fractional parts. From this direct test, it is known if the double is a whole number or not. After this, limit tests against the min/max of the target integer type can be done.
#include <cmath>
bool IsInteger(double x) {
double ipart;
return std::modf(x, &ipart) == 0.0; // Test if fraction is 0.0.
}
Note modf() differs from the similar named fmod().
Of the 3 methods OP posted, the cast to/from an integer may perform a fair amount of work doing the casts and compare. The other 2 are marginally the same. They work, assuming no unexpected rounding mode effects from dividing by 1.0. But do an unnecessary divide.
As to which is fastest likely depends on the mix of doubles used.
OP's first method has a singular advantage: Since the goal is to test if a FP may convert exactly to a some integer, and likely then if the result is true, the conversion needs to then occur, OP's first method has already done the conversion.

Here is what I would try:
float originalNumber;
cin >> originalNumber;
int temp = (int) originalNumber;
if (originalNumber-temp > 0)
{
// It is not an integer
}
else
{
// It is an integer
}

If your question is "Can I convert this double to int without loss of information?" then I would do something simple like :
template <typename T, typename U>
bool CanConvert(U u)
{
return U(T(u)) == u;
}
CanConvert<int>(1.0) -- true
CanConvert<int>(1.5) -- false
CanConvert<int>(1e9) -- true
CanConvert<int>(1e10)-- false

Related

not able to shift hex data in a unsigned long

i am trying to convert IEEE 754 Floating Point Representation to its Decimal Equivalent so i have an example data [7E FF 01 46 4B CD CC CC CC CC CC 10 40 1B 7E] which is in hex.
char strResponseData[STATUS_BUFFERSIZE]={0};
unsigned long strData = (((strResponseData[12] & 0xFF)<< 512 ) |((strResponseData[11] & 0xFF) << 256) |((strResponseData[10] & 0xFF)<< 128 ) |((strResponseData[9] & 0xFF)<< 64) |((strResponseData[8] & 0xFF)<< 32 ) |((strResponseData[7]& 0xFF) << 16) |((strResponseData[6] & 0xFF )<< 8) |(strResponseData[5] & 0xFF));
value = IEEEHexToDec(strData,1);
then i am passing this value to this function
IEEEHexToDec(unsigned long number, int isDoublePrecision)
{
int mantissaShift = isDoublePrecision ? 52 : 23;
unsigned long exponentMask = isDoublePrecision ? 0x7FF0000000000000 : 0x7f800000;
int bias = isDoublePrecision ? 1023 : 127;
int signShift = isDoublePrecision ? 63 : 31;
int sign = (number >> signShift) & 0x01;
int exponent = ((number & exponentMask) >> mantissaShift) - bias;
int power = -1;
double total = 0.0;
for ( int i = 0; i < mantissaShift; i++ )
{
int calc = (number >> (mantissaShift-i-1)) & 0x01;
total += calc * pow(2.0, power);
power--;
}
double value = (sign ? -1 : 1) * pow(2.0, exponent) * (total + 1.0);
return value;
}
but in return am getting value 0, also when am trying to print strData it is giving me only CCCCCD.
i am using eclipse ide.
please i need some suggestion

((strResponseData[12] & 0xFF)<< 512 )
First, the << operator takes a number of bits to shift, you seem to be confusing it with multiplication by the resulting power of two - while it has the same effect, you need to supply the exponent. Given that you have no typical data types of 512 bit width, it's fairly certain that this should actually be.
((strResponseData[12] & 0xFF)<< 9 )
Next, it's necessary for the value to be shifted to be of a sufficient type to hold the result before you do the shift. A char is obviously not sufficient, so you need to explicitly cast the value to a sufficient type to hold the result before you perform the shift.
Additionally keep in mind that depending on your platform an unsigned long may be either a 32 bit or 64 bit type, so if you were doing an operation with a bit shift where the result would not fit in 32 bits, you may want to use an unsigned long long or better yet make things unambiguous, for example with #include <stdint.h> and type such as uint32_t or uint64_t. Given that your question is tagged "embedded" this is especially important to keep in mind as you might be targeting a 32 (or even 8) bit processor, but sometimes building algorithms to test on the development machine instead.
Further, a char can be either a signed or an unsigned type. Before shifting, you should make that explicit. Given that you are combining multiple pieces of something, it is almost certain that at least most of these should be treated as unsigned.
So probably you want something like
((uint32_t)(strResponseData[12] & 0xFF)<< 9 )
Unless you are on an odd platform where char is not 8 bits (for example some TI DSP's) you probably don't need to pre-mask with 0xff, but it's not hurting anything
Finally it is not 100% clear what you are staring with:
i have an example data [7E FF 01 46 4B CD CC CC CC CC CC 10 40 1B 7E] which is in hex.
Is ambiguous as it is not clear if you mean
[0x7e, 0xff, 0x01, 0x46...]
Which would be an array of byte values which debugging code has printed out in hex for human convenience, or if you actually mean that you something such as
"[7E FF 01 46 .... ]"
Which string of text containing a human readable representation of hex digits as printable characters. In the latter case, you'd first have to convert the character representation of hex digits or octets into into numeric values.

Does a value x of type float exist for which x + 1 == x?

This was a question in an exam but no clear explanation why.
It was a true/false type question with;
There exists a value x of the type float for that holds: x + 1 == x...
Which is true. Why tho?
I guess it has to do with type conversion? But I can't imagine how that would work.

Sure.
#include <limits>
#include <iostream>
int main() {
float f = std::numeric_limits<float>::infinity();
std::cout << (f == f + 1) << std::endl;
}
As Deduplicator points out, if your float is big enough (works for me with float f = 1e20;), it'll also work because the added 1 would be outside of the float's accuracy.
Try it online

This code compiles without error:
#include <limits>
int main()
{
static_assert(std::numeric_limits<float>::infinity() == std::numeric_limits<float>::infinity() + 1.0f, "error");
static_assert(std::numeric_limits<double>::infinity() == std::numeric_limits<double>::infinity() + 1.0, "error");
return 0;
}
online version
You don't need to even use infinity. If the number is big enough, rounding errors become big enough so adding one to the number doesn't change it at all.
E.g.
static_assert(100000000000000000000000.f == 100000000000000000000000.f + 1.0, "error");
The specific number of 0 you have to put here may be implementation defined, though.
Always keep rounding in mind when you write programs that use floating point numbers.

#include <iostream>
int main()
{
float val = 1e5;
while (val != val + 1)
val++;
std::cout << val << "\n";
return 1;
}
Prints 1.67772e+07 for clang.
There exists a value x of the type float for that holds: x + 1 == x... which is true. Why tho?
The reason for that lies in how floating point numbers work. Basically a 32bit float has 24 bits for the mantissa (the base digits) and 8 bits for the exponent. At some point +1 just doesn't cause a change in the binary representation because the exponent is too high.

For a 32-bit IEEE754 float (a.k.a. single-precision or SP), the minumum non-negative normal such value is 16777216. That is, 16777216 + 1 == 16777216.
The number 16777216 is precisely 2^24. An SP float has 23 bits of mantissa. This is how it is represented internally:
3 2 1 0
1 09876543 21098765432109876543210
S ---E8--- ----------F23----------
Binary: 0 10010111 00000000000000000000000
Hex: 4B80 0000
Precision: SP
Sign: Positive
Exponent: 24 (Stored: 151, Bias: 127)
Hex-float: +0x1p24
Value: +1.6777216e7 (NORMAL)
You can see the entire mantissa is 0. If you add 1 to this number, it'll hit the gap and be absorbed, giving you back precisely what you started with.

Why this comparison to zero is not working properly?

I have this code:
double A = doSomethingWonderful(); // same as doing A = 0;
if(A == 0)
{
fprintf(stderr,"A=%llx\n", A);
}
and this output:
A=7f35959a51c0
how is this possible?
I checked the value of 7f35959a51c0 and seems to be something like 6.91040329973658785751176861252E-310, which is very small, but not zero.
EDIT:
Ok I understood that that way of printing the hex value for a double is not working. I need to find a way to print the bytes of the double.
Following the comments I modified my code:
A = doSomethingWonderful();// same as doing A = 0;
if(A == 0)
{
char bytes[8];
memcpy(bytes, &A, sizeof(double));
for(int i = 0; i < 8; i++)
fprintf(stderr," %x", bytes[i]);
}
and I get this output:
0 0 0 0 0 0 0 0
So finally it seems that the comparison is working properly but I was doing a bad print.

IEEE 754 precision floating point values use a bias in the exponent value in order to fully represent both positive and negative exponents. For double-precision values, that bias is 1023[source], which happens to be 0x3ff in hex, which matches the hex value of A that you printed for 1, or 0e0.
Two other small notes:
When printing bytes, you can use %hhx to get it to only print 2 hex digits instead of sign-extending to 8.
You can use a union to reliably print the double value as an 8-byte integer.
double A = 0;
if(A == 0)
{
A = 1; // Note that you were setting A to 1 here!
char bytes[8];
memcpy(bytes, &A, sizeof(double));
for(int i = 0; i < 8; i++)
printf(" %hhx", bytes[i]);
}
int isZero;
union {
unsigned long i;
double d;
} u;
u.d = 0;
isZero = (u.d == 0.0);
printf("\n============\n");
printf("hex = %lx\nfloat = %f\nzero? %d\n", u.i, u.d, isZero);
Result:
0 0 0 0 0 0 f0 3f
============
hex = 0
float = 0.000000
zero? 1
So in the first line, we see that 1.0 is 0e0 (i.e., 00).
In the following lines, we see that when you use a union to print the hex value of the double 0.0, you get 0 as expected.

When you pass your double to printf(), you pass it as a floating point value. However, since the "%x" format is an integer format, your printf() implementation will try to read an integer argument. Due to this fundamental type mismatch, it is possible, for instance, that the calling code places your double value in a floating point register, while the printf() implementation tries to read it from an integer register. Details depend on your ABI, but apparently the bits that you see are not the bits that you passed. From a language point of view, you have undefined behavior the moment that you have a type mismatch between one printf() argument and its corresponding format specification.
Apart from that, +0.0 is indeed represented as all bits zero, both in single and in double precision formats. However, this is only positive zero, -0.0 is represented with the sign bit set.
In your last bit of code, you are inspecting the bit pattern of 1.0, because you overwrite the value of A before you do the conversion. Note also that you get fffffff0 instead of f0 for the seventh byte because of sign extension. For correct output, use an array of unsigned bytes.
The pattern that you are seeing decodes like this:
00 00 00 00 00 00 f0 3f
to big endian:
3f f0 00 00 00 00 00 00
decode fields:
sign: 0 (1 bit)
exponent: 01111111111 (11 bit), value = 1023
exponent = value - bias = 1023 - 1023 = 0
mantissa: 0...0 (52 bit), value with implicit leading 1 bit: 1.0000...
entire value: -1^0 * 2^0 * 1.0 = 1.0

Fastest way to get the integer part of sqrt(n)?

As we know if n is not a perfect square, then sqrt(n) would not be an integer. Since I need only the integer part, I feel that calling sqrt(n) wouldn't be that fast, as it takes time to calculate the fractional part also.
So my question is,
Can we get only the integer part of sqrt(n) without calculating the actual value of sqrt(n)? The algorithm should be faster than sqrt(n) (defined in <math.h> or <cmath>)?
If possible, you can write the code in asm block also.

I would try the Fast Inverse Square Root trick.
It's a way to get a very good approximation of 1/sqrt(n) without any branch, based on some bit-twiddling so not portable (notably between 32-bits and 64-bits platforms).
Once you get it, you just need to inverse the result, and takes the integer part.
There might be faster tricks, of course, since this one is a bit of a round about.
EDIT: let's do it!
First a little helper:
// benchmark.h
#include <sys/time.h>
template <typename Func>
double benchmark(Func f, size_t iterations)
{
f();
timeval a, b;
gettimeofday(&a, 0);
for (; iterations --> 0;)
{
f();
}
gettimeofday(&b, 0);
return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
(a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}
Then the main body:
#include <iostream>
#include <cmath>
#include "benchmark.h"
class Sqrt
{
public:
Sqrt(int n): _number(n) {}
int operator()() const
{
double d = _number;
return static_cast<int>(std::sqrt(d) + 0.5);
}
private:
int _number;
};
// http://www.codecodex.com/wiki/Calculate_an_integer_square_root
class IntSqrt
{
public:
IntSqrt(int n): _number(n) {}
int operator()() const
{
int remainder = _number;
if (remainder < 0) { return 0; }
int place = 1 <<(sizeof(int)*8 -2);
while (place > remainder) { place /= 4; }
int root = 0;
while (place)
{
if (remainder >= root + place)
{
remainder -= root + place;
root += place*2;
}
root /= 2;
place /= 4;
}
return root;
}
private:
int _number;
};
// http://en.wikipedia.org/wiki/Fast_inverse_square_root
class FastSqrt
{
public:
FastSqrt(int n): _number(n) {}
int operator()() const
{
float number = _number;
float x2 = number * 0.5F;
float y = number;
long i = *(long*)&y;
//i = (long)0x5fe6ec85e7de30da - (i >> 1);
i = 0x5f3759df - (i >> 1);
y = *(float*)&i;
y = y * (1.5F - (x2*y*y));
y = y * (1.5F - (x2*y*y)); // let's be precise
return static_cast<int>(1/y + 0.5f);
}
private:
int _number;
};
int main(int argc, char* argv[])
{
if (argc != 3) {
std::cerr << "Usage: %prog integer iterations\n";
return 1;
}
int n = atoi(argv[1]);
int it = atoi(argv[2]);
assert(Sqrt(n)() == IntSqrt(n)() &&
Sqrt(n)() == FastSqrt(n)() && "Different Roots!");
std::cout << "sqrt(" << n << ") = " << Sqrt(n)() << "\n";
double time = benchmark(Sqrt(n), it);
double intTime = benchmark(IntSqrt(n), it);
double fastTime = benchmark(FastSqrt(n), it);
std::cout << "Number iterations: " << it << "\n"
"Sqrt computation : " << time << "\n"
"Int computation : " << intTime << "\n"
"Fast computation : " << fastTime << "\n";
return 0;
}
And the results:
sqrt(82) = 9
Number iterations: 4096
Sqrt computation : 56
Int computation : 217
Fast computation : 119
// Note had to tweak the program here as Int here returns -1 :/
sqrt(2147483647) = 46341 // real answer sqrt(2 147 483 647) = 46 340.95
Number iterations: 4096
Sqrt computation : 57
Int computation : 313
Fast computation : 119
Where as expected the Fast computation performs much better than the Int computation.
Oh, and by the way, sqrt is faster :)

Edit: this answer is foolish - use (int) sqrt(i)
After profiling with proper settings (-march=native -m64 -O3) the above was a lot faster.
Alright, a bit old question, but the "fastest" answer has not been given yet. The fastest (I think) is the Binary Square Root algorithm, explained fully in this Embedded.com article.
It basicly comes down to this:
unsigned short isqrt(unsigned long a) {
unsigned long rem = 0;
int root = 0;
int i;
for (i = 0; i < 16; i++) {
root <<= 1;
rem <<= 2;
rem += a >> 30;
a <<= 2;
if (root < rem) {
root++;
rem -= root;
root++;
}
}
return (unsigned short) (root >> 1);
}
On my machine (Q6600, Ubuntu 10.10) I profiled by taking the square root of the numbers 1-100000000. Using iqsrt(i) took 2750 ms. Using (unsigned short) sqrt((float) i) took 3600ms. This was done using g++ -O3. Using the -ffast-math compile option the times were 2100ms and 3100ms respectively. Note this is without using even a single line of assembler so it could probably still be much faster.
The above code works for both C and C++ and with minor syntax changes also for Java.
What works even better for a limited range is a binary search. On my machine this blows the version above out of the water by a factor 4. Sadly it's very limited in range:
#include <stdint.h>
const uint16_t squares[] = {
0, 1, 4, 9,
16, 25, 36, 49,
64, 81, 100, 121,
144, 169, 196, 225,
256, 289, 324, 361,
400, 441, 484, 529,
576, 625, 676, 729,
784, 841, 900, 961,
1024, 1089, 1156, 1225,
1296, 1369, 1444, 1521,
1600, 1681, 1764, 1849,
1936, 2025, 2116, 2209,
2304, 2401, 2500, 2601,
2704, 2809, 2916, 3025,
3136, 3249, 3364, 3481,
3600, 3721, 3844, 3969,
4096, 4225, 4356, 4489,
4624, 4761, 4900, 5041,
5184, 5329, 5476, 5625,
5776, 5929, 6084, 6241,
6400, 6561, 6724, 6889,
7056, 7225, 7396, 7569,
7744, 7921, 8100, 8281,
8464, 8649, 8836, 9025,
9216, 9409, 9604, 9801,
10000, 10201, 10404, 10609,
10816, 11025, 11236, 11449,
11664, 11881, 12100, 12321,
12544, 12769, 12996, 13225,
13456, 13689, 13924, 14161,
14400, 14641, 14884, 15129,
15376, 15625, 15876, 16129,
16384, 16641, 16900, 17161,
17424, 17689, 17956, 18225,
18496, 18769, 19044, 19321,
19600, 19881, 20164, 20449,
20736, 21025, 21316, 21609,
21904, 22201, 22500, 22801,
23104, 23409, 23716, 24025,
24336, 24649, 24964, 25281,
25600, 25921, 26244, 26569,
26896, 27225, 27556, 27889,
28224, 28561, 28900, 29241,
29584, 29929, 30276, 30625,
30976, 31329, 31684, 32041,
32400, 32761, 33124, 33489,
33856, 34225, 34596, 34969,
35344, 35721, 36100, 36481,
36864, 37249, 37636, 38025,
38416, 38809, 39204, 39601,
40000, 40401, 40804, 41209,
41616, 42025, 42436, 42849,
43264, 43681, 44100, 44521,
44944, 45369, 45796, 46225,
46656, 47089, 47524, 47961,
48400, 48841, 49284, 49729,
50176, 50625, 51076, 51529,
51984, 52441, 52900, 53361,
53824, 54289, 54756, 55225,
55696, 56169, 56644, 57121,
57600, 58081, 58564, 59049,
59536, 60025, 60516, 61009,
61504, 62001, 62500, 63001,
63504, 64009, 64516, 65025
};
inline int isqrt(uint16_t x) {
const uint16_t *p = squares;
if (p[128] <= x) p += 128;
if (p[ 64] <= x) p += 64;
if (p[ 32] <= x) p += 32;
if (p[ 16] <= x) p += 16;
if (p[ 8] <= x) p += 8;
if (p[ 4] <= x) p += 4;
if (p[ 2] <= x) p += 2;
if (p[ 1] <= x) p += 1;
return p - squares;
}
A 32 bit version can be downloaded here: https://gist.github.com/3481770

While I suspect you can find a plenty of options by searching for "fast integer square root", here are some potentially-new ideas that might work well (each independent, or maybe you can combine them):
Make a static const array of all the perfect squares in the domain you want to support, and perform a fast branchless binary search on it. The resulting index in the array is the square root.
Convert the number to floating point and break it into mantissa and exponent. Halve the exponent and multiply the mantissa by some magic factor (your job to find it). This should be able to give you a very close approximation. Include a final step to adjust it if it's not exact (or use it as a starting point for the binary search above).

If you don't mind an approximation, how about this integer sqrt function I cobbled together.
int sqrti(int x)
{
union { float f; int x; } v;
// convert to float
v.f = (float)x;
// fast aprox sqrt
// assumes float is in IEEE 754 single precision format
// assumes int is 32 bits
// b = exponent bias
// m = number of mantissa bits
v.x -= 1 << 23; // subtract 2^m
v.x >>= 1; // divide by 2
v.x += 1 << 29; // add ((b + 1) / 2) * 2^m
// convert to int
return (int)v.f;
}
It uses the algorithm described in this Wikipedia article.
On my machine it's almost twice as fast as sqrt :)

To do integer sqrt you can use this specialization of newtons method:
Def isqrt(N):
a = 1
b = N
while |a-b| > 1
b = N / a
a = (a + b) / 2
return a
Basically for any x the sqrt lies in the range (x ... N/x), so we just bisect that interval at every loop for the new guess. Sort of like binary search but it converges must faster.
This converges in O(loglog(N)) which is very fast. It also doesn't use floating point at all, and it will also work well for arbitrary precision integers.

This is so short that it 99% inlines:
static inline int sqrtn(int num) {
int i = 0;
__asm__ (
"pxor %%xmm0, %%xmm0\n\t" // clean xmm0 for cvtsi2ss
"cvtsi2ss %1, %%xmm0\n\t" // convert num to float, put it to xmm0
"sqrtss %%xmm0, %%xmm0\n\t" // square root xmm0
"cvttss2si %%xmm0, %0" // float to int
:"=r"(i):"r"(num):"%xmm0"); // i: result, num: input, xmm0: scratch register
return i;
}
Why clean xmm0? Documentation of cvtsi2ss
The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged.
GCC Intrinsic version (runs only on GCC):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __v4sf xmm0 = {0, 0, 0, 0};
xmm0 = __builtin_ia32_cvtsi2ss(xmm0, num);
xmm0 = __builtin_ia32_sqrtss(xmm0);
return __builtin_ia32_cvttss2si(xmm0);
}
Intel Intrinsic version (tested on GCC, Clang, ICC):
#include <xmmintrin.h>
int sqrtn2(int num) {
register __m128 xmm0 = _mm_setzero_ps();
xmm0 = _mm_cvt_si2ss(xmm0, num);
xmm0 = _mm_sqrt_ss(xmm0);
return _mm_cvtt_ss2si(xmm0);
}
^^^^ All of them require SSE 1 (not even SSE 2).
Note: This is exactly how GCC calculates (int) sqrt((float) num) with -Ofast. If you want higher accuracy for larger i, then we can calculate (int) sqrt((double) num) (as noted by Gumby The Green in the comments):
static inline int sqrtn(int num) {
int i = 0;
__asm__ (
"pxor %%xmm0, %%xmm0\n\t"
"cvtsi2sd %1, %%xmm0\n\t"
"sqrtsd %%xmm0, %%xmm0\n\t"
"cvttsd2si %%xmm0, %0"
:"=r"(i):"r"(num):"%xmm0");
return i;
}
or
#include <xmmintrin.h>
int sqrtn2(int num) {
register __v2df xmm0 = {0, 0};
xmm0 = __builtin_ia32_cvtsi2sd(xmm0, num);
xmm0 = __builtin_ia32_sqrtsd(xmm0);
return __builtin_ia32_cvttsd2si(xmm0);
}

The following solution computes the integer part, meaning floor(sqrt(x)) exactly, with no rounding errors.
Problems With Other Approaches
using float or double is neither portable nor precise enough
#orlp's isqrt gives insane results like isqrt(100) = 15
approaches based on huge lookup tables are not practical beyond 32 bits
using a fast inverse sqrt is very imprecise, you're better off using sqrtf
Newton's approach requires expensive integer division and a good initial guess
My Approach
Mine is based on the bit-guessing approach proposed on Wikipedia. Unfortunately the pseudo-code provided on Wikipedia has some errors so I had to make some adjustments:
// C++20 also provides std::bit_width in its <bit> header
unsigned char bit_width(unsigned long long x) {
return x == 0 ? 1 : 64 - __builtin_clzll(x);
}
template <typename Int, std::enable_if_t<std::is_unsigned<Int, int = 0>>
Int sqrt(const Int n) {
unsigned char shift = bit_width(n);
shift += shift & 1; // round up to next multiple of 2
Int result = 0;
do {
shift -= 2;
result <<= 1; // make space for the next guessed bit
result |= 1; // guess that the next bit is 1
result ^= result * result > (n >> shift); // revert if guess too high
} while (shift != 0);
return result;
}
bit_width can be evaluated in constant time and the loop will iterate at most ceil(bit_width / 2) times. So even for a 64-bit integer, this will be at worst 32 iterations of basic arithmetic and bitwise operations.
The compile output is only around 20 instructions.
Performance
I have benchmarked my methods against float-bases ones by generating inputs uniformly. Note that in the real world most inputs would be much closer to zero than to std::numeric_limits<...>::max().
for uint32_t this performs about 25x worse than using std::sqrt(float)
for uint64_t this performs about 30x worse than using std::sqrt(double)
Accuracy
This method is always perfectly accurate, unlike approaches using floating point math.
Using sqrtf can provide incorrect rounding in the [228, 232) range. For example, sqrtf(0xffffffff) = 65536, when the square root is actually 65535.99999.
Double precision doesn't work consistently for the [260, 264) range. For example, sqrt(0x3fff...) = 2147483648, when the square root is actually 2147483647.999999.
The only thing that covers all 64-bit integers is x86 extended precision long double, simply because it can fit an entire 64-bit integer.
Conclusion
As I said, this the only solution that handles all inputs correctly, avoids integer division and doesn't require lookup tables.
In summary, if you need a method that is independent of precision and doesn't require gigantic lookup tables, this is your only option.
It might be especially useful in a constexpr context where performance isn't critical and where it could be much more important to get a 100% accurate result.
Alternative Approach Using Newton's Method
Newton's method can be quite fast when starting with a good guess. For our guess, we will round down to the next power of 2 and compute the square root in constant time. For any number 2x, we can obtain the square root using 2x/2.
template <typename Int, std::enable_if_t<std::is_unsigned_v<Int>, int> = 0>
Int sqrt_guess(const Int n)
{
Int log2floor = bit_width(n) - 1;
// sqrt(x) is equivalent to pow(2, x / 2 = x >> 1)
// pow(2, x) is equivalent to 1 << x
return 1 << (log2floor >> 1);
}
Note that this is not exactly 2x/2 because we lost some precision during the rightshift. Instead it is 2floor(x/2).
Also note that sqrt_guess(0) = 1 which is actually necessary to avoid division by zero in the first iteration:
template <typename Int, std::enable_if_t<std::is_unsigned_v<Int>, int> = 0>
Int sqrt_newton(const Int n)
{
Int a = sqrt_guess(n);
Int b = n;
// compute unsigned difference
while (std::max(a, b) - std::min(a, b) > 1) {
b = n / a;
a = (a + b) / 2;
}
// a is now either floor(sqrt(n)) or ceil(sqrt(n))
// we decrement in the latter case
// this is overflow-safe as long as we start with a lower bound guess
return a - (a * a > n);
}
This alternative approach performs roughly equivalent to the first proposal, but is usually a few percentage points faster. However, it heavily relies on efficient hardware division and result can vary heavily.
The use of sqrt_guess makes a huge difference. It is roughly five times faster than using 1 as the initial guess.

In many cases, even exact integer sqrt value is not needed, enough having good approximation of it. (For example, it often happens in DSP optimization, when 32-bit signal should be compressed to 16-bit, or 16-bit to 8-bit, without loosing much precision around zero).
I've found this useful equation:
k = ceil(MSB(n)/2); - MSB(n) is the most significant bit of "n"
sqrt(n) ~= 2^(k-2)+(2^(k-1))*n/(2^(2*k))); - all multiplications and divisions here are very DSP-friendly, as they are only 2^k.
This equation generates smooth curve (n, sqrt(n)), its values are not very much different from real sqrt(n) and thus can be useful when approximate accuracy is enough.

Why nobody suggests the quickest method?
If:
the range of numbers is limited
memory consumption is not crucial
application launch time is not critical
then create int[MAX_X] filled (on launch) with sqrt(x) (you don't need to use the function sqrt() for it).
All these conditions fit my program quite well.
Particularly, an int[10000000] array is going to consume 40MB.
What's your thoughts on this?

On my computer with gcc, with -ffast-math, converting a 32-bit integer to float and using sqrtf takes 1.2 s per 10^9 ops (without -ffast-math it takes 3.54 s).
The following algorithm uses 0.87 s per 10^9 at the expense of some accuracy: errors can be as much as -7 or +1 although the RMS error is only 0.79:
uint16_t SQRTTAB[65536];
inline uint16_t approxsqrt(uint32_t x) {
const uint32_t m1 = 0xff000000;
const uint32_t m2 = 0x00ff0000;
if (x&m1) {
return SQRTTAB[x>>16];
} else if (x&m2) {
return SQRTTAB[x>>8]>>4;
} else {
return SQRTTAB[x]>>8;
}
}
The table is constructed using:
void maketable() {
for (int x=0; x<65536; x++) {
double v = x/65535.0;
v = sqrt(v);
int y = int(v*65535.0+0.999);
SQRTTAB[x] = y;
}
}
I found that refining the bisection using further if statements does improve accuracy, but it also slows things down to the point that sqrtf is faster, at least with -ffast-math.

Or just do a binary search, cant write a simpler version imo:
uint16_t sqrti(uint32_t num)
{
uint16_t ret = 0;
for(int32_t i = 15; i >= 0; i--)
{
uint16_t temp = ret | (1 << i);
if(temp * temp <= num)
{
ret = temp;
}
}
return ret;
}

If you need performance on computing square root, I guess you will compute a lot of them.
Then why not caching the answer? I don't know the range for N in your case, nor if you will compute many times the square root of the same integer, but if yes, then you can cache the result each time your method is called (in an array would be the most efficient if not too large).

This is an addition for those in need of a precide square root for very large integers. The trick is to leverage the fast floating point square root of modern processors and to fix round-off errors.
template <typename T>
T preciseIntegerSqrt(T n)
{
if (sizeof(T) <= 4)
{
return std::sqrt((double)n);
}
else if (sizeof(T) <= 8)
{
T r = std::sqrt((double)n);
return r - (r*r-1 >= n);
}
else
{
if (n == 0) return 0;
T r = 0;
for (T b = (T(1)) << ((std::bit_width(n)-1) / 2); b != 0; b >>= 1)
{
T const k = (b + 2*r) * b;
r |= (n >= k) * b;
n -= (n >= k) * k;
}
return r;
}
}
Explanation: Integers of up to 32 bits do not need a correction, since they can be represented precisely as double-precision floating point numbers. 64-bit integers get along with a very cheap correction. For the general case, refer to Jan Schultke's excellent answer. The code provided here is very slightly faster that that one (10% on my machine, may vary with integer type and hardware).

Converting from unsigned long long to float with round to nearest even

I need to write a function that rounds from unsigned long long to float, and the rounding should be toward nearest even.
I cannot just do a C++ type-cast, since AFAIK the standard does not specify the rounding.
I was thinking of using boost::numeric, but i could not find any useful lead after reading the documentation. Can this be done using that library?
Of course, if there is an alternative, i would be glad to use it.
Any help would be much appreciated.
EDIT: Adding an example to make things a bit clearer.
Suppose i want to convert 0xffffff7fffffffff to its floating point representation. The C++ standard permits either one of:
0x5f7fffff ~ 1.9999999*2^63
0x5f800000 = 2^64
Now if you add the restriction of round to nearest even, only the first result is acceptable.

Since you have so many bits in the source that can't be represented in the float and you can't (apparently) rely on the language's conversion, you'll have to do it yourself.
I devised a scheme that may or may not help you. Basically, there are 31 bits to represent positive numbers in a float so I pick up the 31 most significant bits in the source number. Then I save off and mask away all the lower bits. Then based on the value of the lower bits I round the "new" LSB up or down and finally use static_cast to create a float.
I left in some couts that you can remove as desired.
const unsigned long long mask_bit_count = 31;
float ull_to_float2(unsigned long long val)
{
// How many bits are needed?
int b = sizeof(unsigned long long) * CHAR_BIT - 1;
for(; b >= 0; --b)
{
if(val & (1ull << b))
{
break;
}
}
std::cout << "Need " << (b + 1) << " bits." << std::endl;
// If there are few enough significant bits, use normal cast and done.
if(b < mask_bit_count)
{
return static_cast<float>(val & ~1ull);
}
// Save off the low-order useless bits:
unsigned long long low_bits = val & ((1ull << (b - mask_bit_count)) - 1);
std::cout << "Saved low bits=" << low_bits << std::endl;
std::cout << val << "->mask->";
// Now mask away those useless low bits:
val &= ~((1ull << (b - mask_bit_count)) - 1);
std::cout << val << std::endl;
// Finally, decide how to round the new LSB:
if(low_bits > ((1ull << (b - mask_bit_count)) / 2ull))
{
std::cout << "Rounding up " << val;
// Round up.
val |= (1ull << (b - mask_bit_count));
std::cout << " to " << val << std::endl;
}
else
{
// Round down.
val &= ~(1ull << (b - mask_bit_count));
}
return static_cast<float>(val);
}

I did this in Smalltalk for arbitrary precision integer (LargeInteger), implemented and tested in Squeak/Pharo/Visualworks/Gnu Smalltalk/Dolphin Smalltalk, and even blogged about it if you can read Smalltalk code http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html .
The trick for accelerating the algorithm is this one: IEEE 754 compliant FPU will round exactly the result of an inexact operation. So we can afford 1 inexact operation and let the hardware rounds correctly for us. That let us handle easily first 48 bits. But we cannot afford two inexact operations, so we sometimes have to care of the lowest bits differently...
Hope the code is documented enough:
#include <math.h>
#include <float.h>
float ull_to_float3(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long high=val>>prec; // the high bits above float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long tmsk=(1ull<<(prec - 1)) - 1; // 0x7FFFFFull same but tie bit
// handle trivial cases, 48 bits or less,
// let FPU apply correct rounding after exactly 1 inexact operation
if( high <= mask )
return ldexpf((float) high,prec) + (float) (val & mask);
// more than 48 bits,
// what scaling s is needed to isolate highest 48 bits of val?
int s = 0;
for( ; high > mask ; high >>= 1) ++s;
// high now contains highest 24 bits
float f_high = ldexpf( (float) high , prec + s );
// store next 24 bits in mid
unsigned long long mid = (val >> s) & mask;
// care of rare case when trailing low bits can change the rounding:
// can mid bits be a case of perfect tie or perfect zero?
if( (mid & tmsk) == 0ull )
{
// if low bits are zero, mid is either an exact tie or an exact zero
// else just increment mid to distinguish from such case
unsigned long long low = val & ((1ull << s) - 1);
if(low > 0ull) mid++;
}
return f_high + ldexpf( (float) mid , s );
}
Bonus: this code should round according to your FPU rounding mode whatever it may be, since we implicitely used the FPU to perform rounding with + operation.
However, beware of aggressive optimizations in standards < C99, who knows when the compiler will use extended precision... (unless you force something like -ffloat-store).
If you always want to round to nearest even, whatever the current rounding mode, then you'll have to increment high bits when:
mid bits > tie, where tie=1ull<<(prec-1);
mid bits == tie and (low bits > 0 or high bits is odd).
EDIT:
If you stick to round-to-nearest-even tie breaking, then another solution is to use Shewchuck EXPANSION-SUM of non adjacent parts (fhigh,flow) and (fmid) see http://www-2.cs.cmu.edu/afs/cs/project/quake/public/papers/robust-arithmetic.ps :
#include <math.h>
#include <float.h>
float ull_to_float4(unsigned long long val)
{
int prec=FLT_MANT_DIG ; // 24 bits, the float precision
unsigned long long mask=(1ull<<prec) - 1 ; // 0xFFFFFFull a mask for extracting significant bits
unsigned long long high=val>>(2*prec); // the high bits
unsigned long long mid=(val>>prec) & mask; // the mid bits
unsigned long long low=val & mask; // the low bits
float fhigh = ldexpf((float) high,2*prec);
float fmid = ldexpf((float) mid,prec);
float flow = (float) low;
float sum1 = fmid + flow;
float residue1 = flow - (sum1 - fmid);
float sum2 = fhigh + sum1;
float residue2 = sum1 - (sum2 - fhigh);
return (residue1 + residue2) + sum2;
}
This makes a branch-free algorithm with a bit more ops. It may work with other rounding modes, but I let you analyze the paper to make sure...

What is possible between between 8-byte integers and the float format is straightforward to explain but less so to implement!
The next paragraph concerns what is representable in 8 byte signed integers.
All positive integers between 1 (2^0) and 16777215 (2^24-1) are exactly representable in iEEE754 single precision (float). Or, to be precise, all numbers between 2^0 and 2^24-2^0 in increments of 2^0. The next range of exactly representable positive integers is 2^1 to 2^25-2^1 in increments of 2^1 and so on up to 2^39 to 2^63-2^39 in increments of 2^39.
Unsigned 8-byte integer values can be expressed up to 2^64-2^40 in increments of 2^40.
The single precison format doesn't stop here but goes on all the way up to the range 2^103 to 2^127-2^103 in increments of 2^103.
For 4-byte integers (long) the highest float range is 2^7 to 2^31-2^7 in 2^7 increments.
On the x86 architecture the largest integer type supported by the floating point instruction set is the 8 byte signed integer. 2^64-1 cannot be loaded by conventional means.
This means that for a given range increment expressed as "2^i where i is an integer >0" all integers that end with the bit pattern 0x1 up to 2^i-1 will not be exactly representable within that range in a float
This means that what you call rounding upwards is actually dependent on what range you are working in. It is of no use to try to round up by 1 (2^0) or 16 (2^4) if the granularity of the range you are in is 2^19.
An additional consequence of what you propose to do (rounding 2^63-1 to 2^63) could result in an (long integer format) overflow if you attempt the following conversion: longlong_int=(long long) ((float) 2^63).
Check out this small program I wrote (in C) which should help illustrate what is possible and what isn't.
int main (void)
{
__int64 basel=1,baseh=16777215,src,dst,j;
float cnvl,cnvh,range;
int i=0;
while (i<40)
{
src=basel<<i;
cnvl=(float) src;
dst=(__int64) cnvl; /* compare dst with basel */
src=baseh<<i;
cnvh=(float) src;
dst=(__int64) cnvh; /* compare dst with baseh */
j=basel;
while (j<=baseh)
{
range=(float) j;
dst=(__int64) range;
if (j!=dst) dst/=0;
j+=basel;
}
++i;
}
return i;
}
This program shows the representable integer value ranges. There is overlap beteen them: for example 2^5 is representable in all ranges with a lower boundary 2^b where 1=

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js