64-bit overflow math conversion

64-bit overflow math conversion - c++

I have a conversion I am trying to perform:
uint64_t factor = 2345345345; // Actually calculated at runtime, but roughly this magnitude
uint64_t Convert(uint64_t num)
{
return num * 1000ULL / factor;
}
For the largest num values the multiplication wraps before dividing by factor. Changing the order to num / factor * 1000UL looses some accuracy that is not acceptable.
I'd like to rewrite Convert() to handle all possible num values:
uint64_t Convert(uint64_t num)
{
if(num > MAX_UINT64/1000ULL) // pseudo code
{
// Not sure what to put here
}
else
{
return num * 1000ULL / factor;
}
}
We considered using 128-bit math, but would like to avoid it if possible.
What is the most efficient way to implement Convert() so that it can ideally handle the the largest num possible and still produce the correct result?

A little oldschool math, you can use % to calculate remain:
uint64_t Convert(uint64_t num)
{
uint64_t m = 1000;
uint64_t a = num / factor;
uint64_t t = num % factor;
uint64_t h = m * t / factor;
return a * m + h;
}
Example:
uint64_t Convert2(uint64_t num)
{
return num * 1000ULL / factor;
}
uint64_t Convert3(uint64_t num)
{
return num / factor * 1000ULL;
}
int main()
{
cout << Convert(std::numeric_limits<uint64_t>::max()) << endl;
cout << Convert2(std::numeric_limits<uint64_t>::max()) << endl;
cout << Convert3(std::numeric_limits<uint64_t>::max()) << endl;
}
Output:
7865257077400 <--- // The correct one //
7865257077 <--- // Value wrapped before multiplication //
7865257077000 <--- // Low accuracy, loses remaining //

Factorize your division:
r = 1000*(n/factor) + ((n%factor)*1000)/Factor
You could still run into problems if the remainder overflows (factor is large) but if factor is less than MAX_UINT64/1000 you are ok.

Related

how can i get numerator and denominator from a fractional number?

How can I get numerator and denominator from a fractional number? for example, from "1.375" i want to get "1375/1000" or "11/8" as a result. How can i do it with c++??
I have tried to do it by separating the numbers before the point and after the point but it doesn't give any idea how to get my desired output.

You didn't really specify whether you need to convert a floating point or a string to ratio, so I'm going to assume the former one.
Instead of trying string or arithmetic-based approaches, you can directly use properties of IEEE-754 encoding.
Floats (called binary32 by the standard) are encoded in memory like this:
S EEEEEEEE MMMMMMMMMMMMMMMMMMMMMMM
^ ^
bit 31 bit 0
where S is sign bit, Es are exponent bits (8 of them) Ms are mantissa bits (23 bits).
The number can be decoded like this:
value = (-1)^S * significand * 2 ^ expoenent
where:
significand = 1.MMMMMMMMMMMMMMMMMMMMMMM (as binary)
exponent = EEEEEEEE (as binary) - 127
(note: this is for so called "normal numbers", there are also zeroes, subnormals, infinities and NaNs - see Wikipedia page I linked)
This can be used here. We can rewrite the equation above like this:
(-1)^S * significand * exponent = (-1)^s * (significand * 2^23) * 2 ^ (exponent - 23)
The point is that significand * 2^23 is an integer (equal to 1.MMMMMMMMMMMMMMMMMMMMMMM, binary - by multiplying by 2^23, we moved the point 23 places right).2 ^ (exponent - 23) is an integer too, obviously.
In other words: we can write the number as:
(significand * 2^23) / 2^(-(exponent - 23)) (when exponent - 23 < 0)
or
[(significand * 2^23) * 2^(exponent - 23)] / 1 (when exponent - 23 >= 0)
So we have both numerator and denominator - directly from binary representation of the number.
All of the above could be implemented like this in C++:
struct Ratio
{
int64_t numerator; // numerator includes sign
uint64_t denominator;
float toFloat() const
{
return static_cast<float>(numerator) / denominator;
}
static Ratio fromFloat(float v)
{
// First, obtain bitwise representation of the value
const uint32_t bitwiseRepr = *reinterpret_cast<uint32_t*>(&v);
// Extract sign, exponent and mantissa bits (as stored in memory) for convenience:
const uint32_t signBit = bitwiseRepr >> 31u;
const uint32_t expBits = (bitwiseRepr >> 23u) & 0xffu; // 8 bits set
const uint32_t mntsBits = bitwiseRepr & 0x7fffffu; // 23 bits set
// Handle some special cases:
if(expBits == 0 && mntsBits == 0)
{
// special case: +0 and -0
return {0, 1};
}
else if(expBits == 255u && mntsBits == 0)
{
// special case: +inf, -inf
// Let's agree that infinity is always represented as 1/0 in Ratio
return {signBit ? -1 : 1, 0};
}
else if(expBits == 255u)
{
// special case: nan
// Let's agree, that if we get NaN, we returns max int64_t by 0
return {std::numeric_limits<int64_t>::max(), 0};
}
// mask lowest 23 bits (mantissa)
uint32_t significand = (1u << 23u) | mntsBits;
const int64_t signFactor = signBit ? -1 : 1;
const int32_t exp = expBits - 127 - 23;
if(exp < 0)
{
return {signFactor * static_cast<int64_t>(significand), 1u << static_cast<uint32_t>(-exp)};
}
else
{
return {signFactor * static_cast<int64_t>(significand * (1u << static_cast<uint32_t>(exp))), 1};
}
}
};
(hopefully comments and description above are understandable - let me know, if there's something to improve)
I've omitted checks for out of range values for simplicity.
We can use it like this:
float fv = 1.375f;
Ratio rv = Ratio::fromFloat(fv);
std::cout << "fv = " << fv << ", rv = " << rv << ", rv.toFloat() = " << rv.toFloat() << "\n";
And the output is:
fv = 1.375, rv = 11534336/8388608, rv.toFloat() = 1.375
As you can see, exactly the same values on both ends.
The problem is that numerators and denumerators are big. This is because the code always multiplies significand by 2^23, even if smaller value would be enough to make it integer (this is equivalent to writing 0.2 as 2000000/10000000 instead of 2/10 - it's the same thing, only written differently).
This can be solved by changing the code to multiply significand (and divide exponent) by minimum number, like this (ellipsis stands for parts which are the same as above):
// counts number of subsequent least significant bits equal to 0
// example: for 1001000 (binary) returns 3
uint32_t countTrailingZeroes(uint32_t v)
{
uint32_t counter = 0;
while(counter < 32 && (v & 1u) == 0)
{
v >>= 1u;
++counter;
}
return counter;
}
struct Ratio
{
...
static Ratio fromFloat(float v)
{
...
uint32_t significand = (1u << 23u) | mntsBits;
const uint32_t nTrailingZeroes = countTrailingZeroes(significand);
significand >>= nTrailingZeroes;
const int64_t signFactor = signBit ? -1 : 1;
const int32_t exp = expBits - 127 - 23 + nTrailingZeroes;
if(exp < 0)
{
return {signFactor * static_cast<int64_t>(significand), 1u << static_cast<uint32_t>(-exp)};
}
else
{
return {signFactor * static_cast<int64_t>(significand * (1u << static_cast<uint32_t>(exp))), 1};
}
}
};
And now, for the following code:
float fv = 1.375f;
Ratio rv = Ratio::fromFloat(fv);
std::cout << "fv = " << fv << ", rv = " << rv << ", rv.toFloat() = " << rv.toFloat() << "\n";
We get:
fv = 1.375, rv = 11/8, rv.toFloat() = 1.375

In C++ you can use the Boost rational class. But you need to give numerator and denominator.
For this you need to find out no of digits in the input string after the decimal point. You can do this by string manipulation functions. Read the input character by character and find no of characters after the .
char inputstr[30];
int noint=0, nodec=0;
char intstr[30], dec[30];
int decimalfound = 0;
int denominator = 1;
int numerator;
scanf("%s",inputstr);
len = strlen(inputstr);
for (int i=0; i<len; i++)
{
if (decimalfound ==0)
{
if (inputstr[i] == '.')
{
decimalfound = 1;
}
else
{
intstr[noint++] = inputstr[i];
}
}
else
{
dec[nodec++] = inputstr[i];
denominator *=10;
}
}
dec[nodec] = '\0';
intstr[noint] = '\0';
numerator = atoi(dec) + (atoi(intstr) * 1000);
// You can now use the numerator and denominator as the fraction,
// either in the Rational class or you can find gcd and divide by
// gcd.

What about this simple code:
double n = 1.375;
int num = 1, den = 1;
double frac = (num * 1.f / den);
double margin = 0.000001;
while (abs(frac - n) > margin){
if (frac > n){
den++;
}
else{
num++;
}
frac = (num * 1.f / den);
}
I don't really tested too much, it's only an idea.

I hope I'll be forgiven for posting an answer which uses "only the C language". I know you tagged the question with C++ - but I couldn't turn down the bait, sorry. This is still valid C++ at least (although it does, admittedly, use mainly C string-processing techniques).
int num_string_float_to_rat(char *input, long *num, long *den) {
char *tok = NULL, *end = NULL;
char buf[128] = {'\0'};
long a = 0, b = 0;
int den_power = 1;
strncpy(buf, input, sizeof(buf) - 1);
tok = strtok(buf, ".");
if (!tok) return 1;
a = strtol(tok, &end, 10);
if (*end != '\0') return 2;
tok = strtok(NULL, ".");
if (!tok) return 1;
den_power = strlen(tok); // Denominator power of 10
b = strtol(tok, &end, 10);
if (*end != '\0') return 2;
*den = static_cast<int>(pow(10.00, den_power));
*num = a * *den + b;
num_simple_fraction(num, den);
return 0;
}
Sample usage:
int rc = num_string_float_to_rat("0015.0235", &num, &den);
// Check return code -> should be 0!
printf("%ld/%ld\n", num, den);
Output:
30047/2000
Full example at http://codepad.org/CFQQEZkc .
Notes:
strtok() is used to parse the input in to tokens (no need to reinvent the wheel in that regard). strtok() modifies its input - so a temporary buffer is used for safety
it checks for invalid characters - and will return a non-zero return code if found
strtol() has been used instead of atoi() - as it can detect non-numeric characters in the input
scanf() has not been used to slurp the input - due to rounding issues with floating point numbers
the base for strtol() has been explicitly set to 10 to avoid problems with leading zeros (otherwise a leading zero will cause the number to be interpreted as octal)
it uses a num_simple_fraction() helper (not shown) - which in turn uses a gcd() helper (also not shown) - to convert the result to a simple fraction
log10() of the numerator is determined by calculating the length of the token after the decimal point

I'd do this in three steps.
1) find the decimal point, so that you know how large the denominator has to be.
2) get the numerator. That's just the original text with the decimal point removed.
3) get the denominator. If there was no decimal point, the denominator is 1. Otherwise, the denominator is 10^n, where n is the number of digits to the right of the (now-removed) decimal point.
struct fraction {
std::string num, den;
};
fraction parse(std::string input) {
// 1:
std::size_t dec_point = input.find('.');
// 2:
if (dec_point == std::string::npos)
dec_point = 0;
else {
dec_point = input.length() - dec_point;
input.erase(input.begin() + dec_point);
}
// 3:
int denom = 1;
for (int i = 1; i < dec_point; ++i)
denom *= 10;
string result = { input, std::to_string(denom) };
return result;
}

How to calculate a sum of sequence e^(-x) with accuracy E=0.0001?

So I can calculate a sum of sequence without accuracy E.
int t=1, x, k;
float sum, a, result, factorial=1, E=0.0001;
for(k=0;k<=(n);k++){
while(t<=n){
factorial*=t;
t++;
}
sum=(pow(-x,k))/factorial;
sum+=sum;
//while(fabs(sum-???)<E){
// result=sum;
//}
}
So I know sum of sequence sum(k). But to calculate with accurace E, I must know sum of previous elements sum(k-1). How to get sum(k-1) from for loop?
Sorry for english.

is this a taylor series for e ^ (-x) ? if so you've written it out wrong. i don't think what you've got will converge.
http://www.efunda.com/math/taylor_series/exponential.cfm
e ^ (-x) is 1 + (-x) + (-x)^2/2! + (-x)^3/3! + ...
double calculate_power_of_e(double xx, double accuracy) {
double sum(1.0);
double term(1.0);
for (long kk=1; true; ++kk) {
term *= (-xx) / kk;
sum += term;
if (fabs(term) < accuracy)
break;
}
return sum;
}
printf("e^(-x)" = %.4f\n", calculate_power_of_e(5.0, .0001));

First a remark about the power formula that you apply: according to wikipedia you should add the terms pow(-x,k)/(k!) and not pow(-x,k)/(n!).
This leads to a small optimisation of your code: as k! = k * (k-1)! we can avoid the inner while loop and a lot of useless multiplications.
By the way, there is also an error in the way you build the sum: you always erase the previous result, and then add a second time the current term.
Once this is corrected, you just have to take care of an additional variable:
double myexpo(double x, int n=100) {
int k;
double sum = 1.0, pvsum, factorial = 1.0, E = 0.0001;
for (k = 1; k <= (n); k++){ // start with 1
pvsum = sum;
factorial *= k; // don't calculate factorial for 0.
sum += (pow(-x, k)) / factorial;
if (k > 1 && fabs(sum - pvsum) < E) { // check if diff is small enough
cout << k << " iterations" << endl;
break; // interupt the for loop if it's precise enough
}
}
return sum; // at the end of the loop sum is the best approximation
}
You can test this function with this:
double x;
do {
cout << "Enter number: ";
cin >> x;
cout << myexpo(x) << endl;
cout << exp(-x) << endl;
} while (x > 0);
Remark: I'd suggest to either use double or to use the f suffix for the float litterals (e.g. 0.001f), even if it works as is.

Check when the absolute value of the term becomes smaller than your desired accuracy.
double sum = 0, x = 1, k = 0, E = 0.0001, fact = 1;
while(true){
double term = pow(-x, k) / fact;
if(fabs(term) < E)
break;
sum += term;
fact *= (++k);
}
printf("e^(-x) = %.4f", sum);

When the term is insignificant compare to 1.0, stop looping.
By using recursion, and |x| is not too big, the smallest terms are summed first.
e(x) = 1 + x/1! + x*x/2! + x*x*x/3! + ...
double my_exp_term(double x, double term, unsigned n) {
if (term + 1.0 == 1.0) return term;
n++;
return term + my_exp_term(x, term*x/n, n);
}
double my_exp(double x) {
return 1.0 + my_exp_term(x, x, 1);
}
double y = my_exp(-1);
Exponential function

Division of a big number of 100 digits stored as string

I have a 100 digit number stored as string. I want to divide this number with an integer less than 10. How do I efficiently divide a big integer stored as a string with an integer?

You can check the big integer library.
You can use this library in a C++ program to do arithmetic on integers of size limited only by your computer's memory. The library provides BigUnsigned and BigInteger classes that represent nonnegative integers and signed integers, respectively. Most of the C++ arithmetic operators are overloaded for these classes, so big-integer calculations are as easy as:
#include "BigIntegerLibrary.hh"
BigInteger a = 65536;
cout << (a * a * a * a * a * a * a * a);
(prints 340282366920938463463374607431768211456)
Also check GMP

#WasimThabraze - what is your understanding of the longhand division method? Since the divisor is less than 1/2 the size of an integer you can use something like this for each divide:
char array[10] = {9,8,7,6,5,4,3,2,1,0};
void divide(int dvsr)
{
int rem = 0;
int dvnd;
int quot;
int i;
for(i = 0; i < (sizeof(array)/sizeof(array[0])) ; i++){
dvnd = (rem * 10) + array[i];
rem = dvnd % dvsr;
quot = dvnd / dvsr;
array[i] = quot;
}
}
int main(void)
{
divide(8);
return (0);
}

I hope this helps you because not all online judges allow BigIntegerLibrary.I have assumed for some arbitrary input.
string input="123456789";
int n=input.size();
string final(n,'0');
string::const_iterator p=input.begin(),q=input.end();
string::iterator f=final.begin();
void divide(int divisor)
{
int reminder = 0,dividend,quotient;
/*repeatedly divide each element*/
for(; p!=q ; p++,f++){
dividend = (reminder * 10) + (*p-'0');
reminder = dividend % divisor;
quotient = dividend / divisor;
*f = quotient + '0';
}
/*remove any leading zeroes from the result*/
n = final.find_first_not_of("0");
if (n != string::npos)
{
final = final.substr(n);
}
std::cout << final ;
}
int main(){
int x;
std::cin >> x;
divide(x);
return 0;
}

Finding square root without using sqrt function?

I was finding out the algorithm for finding out the square root without using sqrt function and then tried to put into programming. I end up with this working code in C++
#include <iostream>
using namespace std;
double SqrtNumber(double num)
{
double lower_bound=0;
double upper_bound=num;
double temp=0; /* ek edited this line */
int nCount = 50;
while(nCount != 0)
{
temp=(lower_bound+upper_bound)/2;
if(temp*temp==num)
{
return temp;
}
else if(temp*temp > num)
{
upper_bound = temp;
}
else
{
lower_bound = temp;
}
nCount--;
}
return temp;
}
int main()
{
double num;
cout<<"Enter the number\n";
cin>>num;
if(num < 0)
{
cout<<"Error: Negative number!";
return 0;
}
cout<<"Square roots are: +"<<sqrtnum(num) and <<" and -"<<sqrtnum(num);
return 0;
}
Now the problem is initializing the number of iterations nCount in the declaratione ( here it is 50). For example to find out square root of 36 it takes 22 iterations, so no problem whereas finding the square root of 15625 takes more than 50 iterations, So it would return the value of temp after 50 iterations. Please give a solution for this.

There is a better algorithm, which needs at most 6 iterations to converge to maximum precision for double numbers:
#include <math.h>
double sqrt(double x) {
if (x <= 0)
return 0; // if negative number throw an exception?
int exp = 0;
x = frexp(x, &exp); // extract binary exponent from x
if (exp & 1) { // we want exponent to be even
exp--;
x *= 2;
}
double y = (1+x)/2; // first approximation
double z = 0;
while (y != z) { // yes, we CAN compare doubles here!
z = y;
y = (y + x/y) / 2;
}
return ldexp(y, exp/2); // multiply answer by 2^(exp/2)
}
Algorithm starts with 1 as first approximation for square root value.
Then, on each step, it improves next approximation by taking average between current value y and x/y. If y = sqrt(x), it will be the same. If y > sqrt(x), then x/y < sqrt(x) by about the same amount. In other words, it will converge very fast.
UPDATE: To speed up convergence on very large or very small numbers, changed sqrt() function to extract binary exponent and compute square root from number in [1, 4) range. It now needs frexp() from <math.h> to get binary exponent, but it is possible to get this exponent by extracting bits from IEEE-754 number format without using frexp().

Why not try to use the Babylonian method for finding a square root.
Here is my code for it:
double sqrt(double number)
{
double error = 0.00001; //define the precision of your result
double s = number;
while ((s - number / s) > error) //loop until precision satisfied
{
s = (s + number / s) / 2;
}
return s;
}
Good luck!

Remove your nCount altogether (as there are some roots that this algorithm will take many iterations for).
double SqrtNumber(double num)
{
double lower_bound=0;
double upper_bound=num;
double temp=0;
while(fabs(num - (temp * temp)) > SOME_SMALL_VALUE)
{
temp = (lower_bound+upper_bound)/2;
if (temp*temp >= num)
{
upper_bound = temp;
}
else
{
lower_bound = temp;
}
}
return temp;
}

As I found this question is old and have many answers but I have an answer which is simple and working great..
#define EPSILON 0.0000001 // least minimum value for comparison
double SquareRoot(double _val) {
double low = 0;
double high = _val;
double mid = 0;
while (high - low > EPSILON) {
mid = low + (high - low) / 2; // finding mid value
if (mid*mid > _val) {
high = mid;
} else {
low = mid;
}
}
return mid;
}
I hope it will be helpful for future users.

if you need to find square root without using sqrt(),use root=pow(x,0.5).
Where x is value whose square root you need to find.

//long division method.
#include<iostream>
using namespace std;
int main() {
int n, i = 1, divisor, dividend, j = 1, digit;
cin >> n;
while (i * i < n) {
i = i + 1;
}
i = i - 1;
cout << i << '.';
divisor = 2 * i;
dividend = n - (i * i );
while( j <= 5) {
dividend = dividend * 100;
digit = 0;
while ((divisor * 10 + digit) * digit < dividend) {
digit = digit + 1;
}
digit = digit - 1;
cout << digit;
dividend = dividend - ((divisor * 10 + digit) * digit);
divisor = divisor * 10 + 2*digit;
j = j + 1;
}
cout << endl;
return 0;
}

Here is a very simple but unsafe approach to find the square-root of a number.
Unsafe because it only works by natural numbers, where you know that the base respectively the exponent are natural numbers. I had to use it for a task where i was neither allowed to use the #include<cmath> -library, nor i was allowed to use pointers.
potency = base ^ exponent
// FUNCTION: square-root
int sqrt(int x)
{
int quotient = 0;
int i = 0;
bool resultfound = false;
while (resultfound == false) {
if (i*i == x) {
quotient = i;
resultfound = true;
}
i++;
}
return quotient;
}

This a very simple recursive approach.
double mySqrt(double v, double test) {
if (abs(test * test - v) < 0.0001) {
return test;
}
double highOrLow = v / test;
return mySqrt(v, (test + highOrLow) / 2.0);
}
double mySqrt(double v) {
return mySqrt(v, v/2.0);
}

Here is a very awesome code to find sqrt and even faster than original sqrt function.
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f375a86 - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
x = x*(1.5f - xhalf*x*x);
x = x*(1.5f - xhalf*x*x);
x=1/x;
return x;
}

After looking at the previous responses, I hope this will help resolve any ambiguities. In case the similarities in the previous solutions and my solution are illusive, or this method of solving for roots is unclear, I've also made a graph which can be found here.
This is a working root function capable of solving for any nth-root
(default is square root for the sake of this question)
#include <cmath>
// for "pow" function
double sqrt(double A, double root = 2) {
const double e = 2.71828182846;
return pow(e,(pow(10.0,9.0)/root)*(1.0-(pow(A,-pow(10.0,-9.0)))));
}
Explanation:
click here for graph
This works via Taylor series, logarithmic properties, and a bit of algebra.
Take, for example:
log A = N
x
*Note: for square-root, N = 2; for any other root you only need to change the one variable, N.
1) Change the base, convert the base 'x' log function to natural log,
log A => ln(A)/ln(x) = N
x
2) Rearrange to isolate ln(x), and eventually just 'x',
ln(A)/N = ln(x)
3) Set both sides as exponents of 'e',
e^(ln(A)/N) = e^(ln(x)) >~{ e^ln(x) == x }~> e^(ln(A)/N) = x
4) Taylor series represents "ln" as an infinite series,
ln(x) = (k=1)Sigma: (1/k)(-1^(k+1))(k-1)^n
<~~~ expanded ~~~>
[(x-1)] - [(1/2)(x-1)^2] + [(1/3)(x-1)^3] - [(1/4)(x-1)^4] + . . .
*Note: Continue the series for increased accuracy. For brevity, 10^9 is used in my function which expresses the series convergence for the natural log with about 7 digits, or the 10-millionths place, for precision,
ln(x) = 10^9(1-x^(-10^(-9)))
5) Now, just plug in this equation for natural log into the simplified equation obtained in step 3.
e^[((10^9)/N)(1-A^(-10^-9)] = nth-root of (A)
6) This implementation might seem like overkill; however, its purpose is to demonstrate how you can solve for roots without having to guess and check. Also, it would enable you to replace the pow function from the cmath library with your own pow function:
double power(double base, double exponent) {
if (exponent == 0) return 1;
int wholeInt = (int)exponent;
double decimal = exponent - (double)wholeInt;
if (decimal) {
int powerInv = 1/decimal;
if (!wholeInt) return root(base,powerInv);
else return power(root(base,powerInv),wholeInt,true);
}
return power(base, exponent, true);
}
double power(double base, int exponent, bool flag) {
if (exponent < 0) return 1/power(base,-exponent,true);
if (exponent > 0) return base * power(base,exponent-1,true);
else return 1;
}
int root(int A, int root) {
return power(E,(1000000000000/root)*(1-(power(A,-0.000000000001))));
}

How to improve fixed point square-root for small values

I am using Anthony Williams' fixed point library described in the Dr Dobb's article "Optimizing Math-Intensive Applications with Fixed-Point Arithmetic" to calculate the distance between two geographical points using the Rhumb Line method.
This works well enough when the distance between the points is significant (greater than a few kilometers), but is very poor at smaller distances. The worst case being when the two points are equal or near equal, the result is a distance of 194 meters, while I need precision of at least 1 metre at distances >= 1 metre.
By comparison with a double precision floating-point implementation, I have located the problem to the fixed::sqrt() function, which performs poorly at small values:
x std::sqrt(x) fixed::sqrt(x) error
----------------------------------------------------
0 0 3.05176e-005 3.05176e-005
1e-005 0.00316228 0.00316334 1.06005e-006
2e-005 0.00447214 0.00447226 1.19752e-007
3e-005 0.00547723 0.0054779 6.72248e-007
4e-005 0.00632456 0.00632477 2.12746e-007
5e-005 0.00707107 0.0070715 4.27244e-007
6e-005 0.00774597 0.0077467 7.2978e-007
7e-005 0.0083666 0.00836658 1.54875e-008
8e-005 0.00894427 0.00894427 1.085e-009
Correcting the result for fixed::sqrt(0) is trivial by treating it as a special case, but that will not solve the problem for small non-zero distances, where the error starts at 194 metres and converges toward zero with increasing distance. I probably need at least an order of maginitude improvement in precision toward zero.
The fixed::sqrt() algorithim is briefly explained on page 4 of the article linked above, but I am struggling to follow it let alone determine whether it is possible to improve it. The code for the function is reproduced below:
fixed fixed::sqrt() const
{
unsigned const max_shift=62;
uint64_t a_squared=1LL<<max_shift;
unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
uint64_t a=1LL<<b_shift;
uint64_t x=m_nVal;
while(b_shift && a_squared>x)
{
a>>=1;
a_squared>>=2;
--b_shift;
}
uint64_t remainder=x-a_squared;
--b_shift;
while(remainder && b_shift)
{
uint64_t b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
uint64_t two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
uint64_t const delta=b_squared+two_a_b;
if((2*remainder)>delta)
{
a+=(1LL<<b_shift);
remainder-=delta;
if(b_shift)
{
--b_shift;
}
}
}
return fixed(internal(),a);
}
Note that m_nVal is the internal fixed point representation value, it is an int64_t and the representation uses Q36.28 format (fixed_resolution_shift = 28). The representation itself has enough precision for at least 8 decimal places, and as a fraction of equatorial arc is good for distances of around 0.14 metres, so the limitation is not the fixed-point representation.
Use of the rhumb line method is a standards body recommendation for this application so cannot be changed, and in any case a more accurate square-root function is likely to be required elsewhere in the application or in future applications.
Question: Is it possible to improve the accuracy of the fixed::sqrt() algorithm for small non-zero values while still maintaining its bounded and deterministic convergence?
Additional Information
The test code used to generate the table above:
#include <cmath>
#include <iostream>
#include "fixed.hpp"
int main()
{
double error = 1.0 ;
for( double x = 0.0; error > 1e-8; x += 1e-5 )
{
double fixed_root = sqrt(fixed(x)).as_double() ;
double std_root = std::sqrt(x) ;
error = std::fabs(fixed_root - std_root) ;
std::cout << x << '\t' << std_root << '\t' << fixed_root << '\t' << error << std::endl ;
}
}
Conclusion
In the light of Justin Peel's solution and analysis, and comparison with the algorithm in "The Neglected Art of Fixed Point Arithmetic", I have adapted the latter as follows:
fixed fixed::sqrt() const
{
uint64_t a = 0 ; // root accumulator
uint64_t remHi = 0 ; // high part of partial remainder
uint64_t remLo = m_nVal ; // low part of partial remainder
uint64_t testDiv ;
int count = 31 + (fixed_resolution_shift >> 1); // Loop counter
do
{
// get 2 bits of arg
remHi = (remHi << 2) | (remLo >> 62); remLo <<= 2 ;
// Get ready for the next bit in the root
a <<= 1;
// Test radical
testDiv = (a << 1) + 1;
if (remHi >= testDiv)
{
remHi -= testDiv;
a += 1;
}
} while (count-- != 0);
return fixed(internal(),a);
}
While this gives far greater precision, the improvement I needed is not to be achieved. The Q36.28 format alone just about provides the precision I need, but it is not possible to perform a sqrt() without loss of a few bits of precision. However some lateral thinking provides a better solution. My application tests the calculated distance against some distance limit. The rather obvious solution in hindsight is to test the square of the distance against the square of the limit!

Given that sqrt(ab) = sqrt(a)sqrt(b), then can't you just trap the case where your number is small and shift it up by a given number of bits, compute the root and shift that back down by half the number of bits to get the result?
I.e.
sqrt(n) = sqrt(n.2^k)/sqrt(2^k)
= sqrt(n.2^k).2^(-k/2)
E.g. Choose k = 28 for any n less than 2^8.

The original implementation obviously has some problems. I became frustrated with trying to fix them all with the way the code is currently done and ended up going at it with a different approach. I could probably fix the original now, but I like my way better anyway.
I treat the input number as being in Q64 to start which is the same as shifting by 28 and then shifting back by 14 afterwards (the sqrt halves it). However, if you just do that, then the accuracy is limited to 1/2^14 = 6.1035e-5 because the last 14 bits will be 0. To remedy this, I then shift a and remainder correctly and to keep filling in digits I do the loop again. The code can be made more efficient and cleaner, but I'll leave that to someone else. The accuracy shown below is pretty much as good as you can get with Q36.28. If you compare the fixed point sqrt with the floating point sqrt of the input number after it has been truncated by fixed point(convert it to fixed point and back), then the errors are around 2e-9(I didn't do this in the code below, but it requires one line of change). This is right in line with the best accuracy for Q36.28 which is 1/2^28 = 3.7529e-9.
By the way, one big mistake in the original code is that the term where m = 0 is never considered so that bit can never be set. Anyway, here is the code. Enjoy!
#include <iostream>
#include <cmath>
typedef unsigned long uint64_t;
uint64_t sqrt(uint64_t in_val)
{
const uint64_t fixed_resolution_shift = 28;
const unsigned max_shift=62;
uint64_t a_squared=1ULL<<max_shift;
unsigned b_shift=(max_shift>>1) + 1;
uint64_t a=1ULL<<(b_shift - 1);
uint64_t x=in_val;
while(b_shift && a_squared>x)
{
a>>=1;
a_squared>>=2;
--b_shift;
}
uint64_t remainder=x-a_squared;
--b_shift;
while(remainder && b_shift)
{
uint64_t b_squared=1ULL<<(2*(b_shift - 1));
uint64_t two_a_b=(a<<b_shift);
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
uint64_t const delta=b_squared+two_a_b;
if((remainder)>=delta && b_shift)
{
a+=(1ULL<<(b_shift - 1));
remainder-=delta;
--b_shift;
}
}
a <<= (fixed_resolution_shift/2);
b_shift = (fixed_resolution_shift/2) + 1;
remainder <<= (fixed_resolution_shift);
while(remainder && b_shift)
{
uint64_t b_squared=1ULL<<(2*(b_shift - 1));
uint64_t two_a_b=(a<<b_shift);
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
uint64_t const delta=b_squared+two_a_b;
if((remainder)>=delta && b_shift)
{
a+=(1ULL<<(b_shift - 1));
remainder-=delta;
--b_shift;
}
}
return a;
}
double fixed2float(uint64_t x)
{
return static_cast<double>(x) * pow(2.0, -28.0);
}
uint64_t float2fixed(double f)
{
return static_cast<uint64_t>(f * pow(2, 28.0));
}
void finderror(double num)
{
double root1 = fixed2float(sqrt(float2fixed(num)));
double root2 = pow(num, 0.5);
std::cout << "input: " << num << ", fixed sqrt: " << root1 << " " << ", float sqrt: " << root2 << ", finderror: " << root2 - root1 << std::endl;
}
main()
{
finderror(0);
finderror(1e-5);
finderror(2e-5);
finderror(3e-5);
finderror(4e-5);
finderror(5e-5);
finderror(pow(2.0,1));
finderror(1ULL<<35);
}
with the output of the program being
input: 0, fixed sqrt: 0 , float sqrt: 0, finderror: 0
input: 1e-05, fixed sqrt: 0.00316207 , float sqrt: 0.00316228, finderror: 2.10277e-07
input: 2e-05, fixed sqrt: 0.00447184 , float sqrt: 0.00447214, finderror: 2.97481e-07
input: 3e-05, fixed sqrt: 0.0054772 , float sqrt: 0.00547723, finderror: 2.43815e-08
input: 4e-05, fixed sqrt: 0.00632443 , float sqrt: 0.00632456, finderror: 1.26255e-07
input: 5e-05, fixed sqrt: 0.00707086 , float sqrt: 0.00707107, finderror: 2.06055e-07
input: 2, fixed sqrt: 1.41421 , float sqrt: 1.41421, finderror: 1.85149e-09
input: 3.43597e+10, fixed sqrt: 185364 , float sqrt: 185364, finderror: 2.24099e-09

I'm not sure how you're getting the numbers from fixed::sqrt() shown in the table.
Here's what I do:
#include <stdio.h>
#include <math.h>
#define __int64 long long // gcc doesn't know __int64
typedef __int64 fixed;
#define FRACT 28
#define DBL2FIX(x) ((fixed)((double)(x) * (1LL << FRACT)))
#define FIX2DBL(x) ((double)(x) / (1LL << FRACT))
// De-++-ified code from
// http://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html
fixed sqrtfix0(fixed num)
{
static unsigned const fixed_resolution_shift=FRACT;
unsigned const max_shift=62;
unsigned __int64 a_squared=1LL<<max_shift;
unsigned b_shift=(max_shift+fixed_resolution_shift)/2;
unsigned __int64 a=1LL<<b_shift;
unsigned __int64 x=num;
unsigned __int64 remainder;
while(b_shift && a_squared>x)
{
a>>=1;
a_squared>>=2;
--b_shift;
}
remainder=x-a_squared;
--b_shift;
while(remainder && b_shift)
{
unsigned __int64 b_squared=1LL<<(2*b_shift-fixed_resolution_shift);
int const two_a_b_shift=b_shift+1-fixed_resolution_shift;
unsigned __int64 two_a_b=(two_a_b_shift>0)?(a<<two_a_b_shift):(a>>-two_a_b_shift);
unsigned __int64 delta;
while(b_shift && remainder<(b_squared+two_a_b))
{
b_squared>>=2;
two_a_b>>=1;
--b_shift;
}
delta=b_squared+two_a_b;
if((2*remainder)>delta)
{
a+=(1LL<<b_shift);
remainder-=delta;
if(b_shift)
{
--b_shift;
}
}
}
return (fixed)a;
}
// Adapted code from
// http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Digit-by-digit_calculation
fixed sqrtfix1(fixed num)
{
fixed res = 0;
fixed bit = (fixed)1 << 62; // The second-to-top bit is set
int s = 0;
// Scale num up to get more significant digits
while (num && num < bit)
{
num <<= 1;
s++;
}
if (s & 1)
{
num >>= 1;
s--;
}
s = 14 - (s >> 1);
while (bit != 0)
{
if (num >= res + bit)
{
num -= res + bit;
res = (res >> 1) + bit;
}
else
{
res >>= 1;
}
bit >>= 2;
}
if (s >= 0) res <<= s;
else res >>= -s;
return res;
}
int main(void)
{
double testData[] =
{
0,
1e-005,
2e-005,
3e-005,
4e-005,
5e-005,
6e-005,
7e-005,
8e-005,
};
int i;
for (i = 0; i < sizeof(testData) / sizeof(testData[0]); i++)
{
double x = testData[i];
fixed xf = DBL2FIX(x);
fixed sqf0 = sqrtfix0(xf);
fixed sqf1 = sqrtfix1(xf);
double sq0 = FIX2DBL(sqf0);
double sq1 = FIX2DBL(sqf1);
printf("%10.8f: "
"sqrtfix0()=%10.8f / err=%e "
"sqrt()=%10.8f "
"sqrtfix1()=%10.8f / err=%e\n",
x,
sq0, fabs(sq0 - sqrt(x)),
sqrt(x),
sq1, fabs(sq1 - sqrt(x)));
}
printf("sizeof(double)=%d\n", (int)sizeof(double));
return 0;
}
And here's what I get (with gcc and Open Watcom):
0.00000000: sqrtfix0()=0.00003052 / err=3.051758e-05 sqrt()=0.00000000 sqrtfix1()=0.00000000 / err=0.000000e+00
0.00001000: sqrtfix0()=0.00311279 / err=4.948469e-05 sqrt()=0.00316228 sqrtfix1()=0.00316207 / err=2.102766e-07
0.00002000: sqrtfix0()=0.00445557 / err=1.656955e-05 sqrt()=0.00447214 sqrtfix1()=0.00447184 / err=2.974807e-07
0.00003000: sqrtfix0()=0.00543213 / err=4.509667e-05 sqrt()=0.00547723 sqrtfix1()=0.00547720 / err=2.438148e-08
0.00004000: sqrtfix0()=0.00628662 / err=3.793423e-05 sqrt()=0.00632456 sqrtfix1()=0.00632443 / err=1.262553e-07
0.00005000: sqrtfix0()=0.00701904 / err=5.202484e-05 sqrt()=0.00707107 sqrtfix1()=0.00707086 / err=2.060551e-07
0.00006000: sqrtfix0()=0.00772095 / err=2.501943e-05 sqrt()=0.00774597 sqrtfix1()=0.00774593 / err=3.390476e-08
0.00007000: sqrtfix0()=0.00836182 / err=4.783859e-06 sqrt()=0.00836660 sqrtfix1()=0.00836649 / err=1.086198e-07
0.00008000: sqrtfix0()=0.00894165 / err=2.621519e-06 sqrt()=0.00894427 sqrtfix1()=0.00894409 / err=1.777289e-07
sizeof(double)=8
EDIT:
I've missed the fact that the above sqrtfix1() won't work well with large arguments. It can be fixed by appending 28 zeroes to the argument and essentially calculating the exact integer square root of that. This comes at the expense of doing internal calculations in 128-bit arithmetic, but it's pretty straightforward:
fixed sqrtfix2(fixed num)
{
unsigned __int64 numl, numh;
unsigned __int64 resl = 0, resh = 0;
unsigned __int64 bitl = 0, bith = (unsigned __int64)1 << 26;
numl = num << 28;
numh = num >> (64 - 28);
while (bitl | bith)
{
unsigned __int64 tmpl = resl + bitl;
unsigned __int64 tmph = resh + bith + (tmpl < resl);
tmph = numh - tmph - (numl < tmpl);
tmpl = numl - tmpl;
if (tmph & 0x8000000000000000ULL)
{
resl >>= 1;
if (resh & 1) resl |= 0x8000000000000000ULL;
resh >>= 1;
}
else
{
numl = tmpl;
numh = tmph;
resl >>= 1;
if (resh & 1) resl |= 0x8000000000000000ULL;
resh >>= 1;
resh += bith + (resl + bitl < resl);
resl += bitl;
}
bitl >>= 2;
if (bith & 1) bitl |= 0x4000000000000000ULL;
if (bith & 2) bitl |= 0x8000000000000000ULL;
bith >>= 2;
}
return resl;
}
And it gives pretty much the same results (slightly better for 3.43597e+10) than this answer:
0.00000000: sqrtfix0()=0.00003052 / err=3.051758e-05 sqrt()=0.00000000 sqrtfix2()=0.00000000 / err=0.000000e+00
0.00001000: sqrtfix0()=0.00311279 / err=4.948469e-05 sqrt()=0.00316228 sqrtfix2()=0.00316207 / err=2.102766e-07
0.00002000: sqrtfix0()=0.00445557 / err=1.656955e-05 sqrt()=0.00447214 sqrtfix2()=0.00447184 / err=2.974807e-07
0.00003000: sqrtfix0()=0.00543213 / err=4.509667e-05 sqrt()=0.00547723 sqrtfix2()=0.00547720 / err=2.438148e-08
0.00004000: sqrtfix0()=0.00628662 / err=3.793423e-05 sqrt()=0.00632456 sqrtfix2()=0.00632443 / err=1.262553e-07
0.00005000: sqrtfix0()=0.00701904 / err=5.202484e-05 sqrt()=0.00707107 sqrtfix2()=0.00707086 / err=2.060551e-07
0.00006000: sqrtfix0()=0.00772095 / err=2.501943e-05 sqrt()=0.00774597 sqrtfix2()=0.00774593 / err=3.390476e-08
0.00007000: sqrtfix0()=0.00836182 / err=4.783859e-06 sqrt()=0.00836660 sqrtfix2()=0.00836649 / err=1.086198e-07
0.00008000: sqrtfix0()=0.00894165 / err=2.621519e-06 sqrt()=0.00894427 sqrtfix2()=0.00894409 / err=1.777289e-07
2.00000000: sqrtfix0()=1.41419983 / err=1.373327e-05 sqrt()=1.41421356 sqrtfix2()=1.41421356 / err=1.851493e-09
34359700000.00000000: sqrtfix0()=185363.69654846 / err=5.097361e-06 sqrt()=185363.69655356 sqrtfix2()=185363.69655356 / err=1
.164153e-09

Many many years ago I worked on a demo program for a small computer our outfit had built. The computer had a built-in square-root instruction, and we built a simple program to demonstrate the computer doing 16-bit add/subtract/multiply/divide/square-root on a TTY. Alas, it turned out that there was a serious bug in the square root instruction, but we had promised to demo the function. So we created an array of the squares of the values 1-255, then used a simple lookup to match the value typed in to one of the array values. The index was the square root.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

64-bit overflow math conversion - c++

Factorize your division: r = 1000(n/factor) + ((n%factor)1000)/Factor You could still run into problems if the remainder overflows (factor is large) but if factor is less than MAX_UINT64/1000 you are ok.

Related

how can i get numerator and denominator from a fractional number?

How to calculate a sum of sequence e^(-x) with accuracy E=0.0001?

Division of a big number of 100 digits stored as string

Finding square root without using sqrt function?

How to improve fixed point square-root for small values

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

64-bit overflow math conversion - c++

Factorize your division: r = 1000*(n/factor) + ((n%factor)*1000)/Factor You could still run into problems if the remainder overflows (factor is large) but if factor is less than MAX_UINT64/1000 you are ok.

Related

how can i get numerator and denominator from a fractional number?

How to calculate a sum of sequence e^(-x) with accuracy E=0.0001?

Division of a big number of 100 digits stored as string

Finding square root without using sqrt function?

How to improve fixed point square-root for small values

Categories

Resources

Factorize your division: r = 1000(n/factor) + ((n%factor)1000)/Factor You could still run into problems if the remainder overflows (factor is large) but if factor is less than MAX_UINT64/1000 you are ok.