Multiplication between big integers and doubles - c++

I am managing some big (128~256bits) integers with gmp. It has come a point were I would like to multiply them for a double close to 1 (0.1 < double < 10), the result being still an approximated integer. A good example of the operation I need to do is the following:
int i = 1000000000000000000 * 1.23456789
I searched in the gmp documentation but I didn't find a function for this, so I ended up writing this code which seems to work well:
mpz_mult_d(mpz_class & r, const mpz_class & i, double d, int prec=10) {
if (prec > 15) prec=15; //avoids overflows
uint_fast64_t m = (uint_fast64_t) floor(d);
r = i * m;
uint_fast64_t pos=1;
for (uint_fast8_t j=0; j<prec; j++) {
const double posd = (double) pos;
m = ((uint_fast64_t) floor(d * posd * 10.)) -
((uint_fast64_t) floor(d * posd)) * 10;
pos*=10;
r += (i * m) /pos;
}
}
Can you please tell me what do you think? Do you have any suggestion to make it more robust or faster?

this is what you wanted:
// BYTE lint[_N] ... lint[0]=MSB, lint[_N-1]=LSB
void mul(BYTE *c,BYTE *a,double b) // c[_N]=a[_N]*b
{
int i; DWORD cc;
double q[_N+1],aa,bb;
for (q[0]=0.0,i=0;i<_N;) // mul,carry down
{
bb=double(a[i])*b; aa=floor(bb); bb-=aa;
q[i]+=aa; i++;
q[i]=bb*256.0;
}
cc=0; if (q[_N]>127.0) cc=1.0; // round
for (i=_N-1;i>=0;i--) // carry up
{
double aa,bb;
cc+=q[i];
c[i]=cc&255;
cc>>=8;
}
}
_N is number of bits/8 per large int, large int is array of _N BYTEs where first byte is MSB (most significant BYTE) and last BYTE is LSB (least significant BYTE)
function is not handling signum, but it is only one if and some xor/inc to add.
trouble is that double has low precision even for your number 1.23456789 !!! due to precision loss the result is not exact what it should be (1234387129122386944 instead of 1234567890000000000) I think my code is mutch quicker and even more precise than yours because i do not need to mul/mod/div numbers by 10, instead i use bit shifting where is possible and not by 10-digit but by 256-digit (8bit). if you need more precision than use long arithmetic. you can speed up this code by using larger digits (16,32, ... bit)
My long arithmetics for precise astro computations are usually fixed point 256.256 bits numbers consist of 2*8 DWORDs + signum, but of course is much slower and some goniometric functions are realy tricky to implement, but if you want just basic functions than code your own lon arithmetics is not that hard.
also if you want to have numbers often in readable form is good to compromise between speed/size and consider not to use binary coded numbers but BCD coded numbers

I am not so familiar with either C++ or GMP what I could suggest source code without syntax errors, but what you are doing is more complicated than it should and can introduce unnecessary approximation.
Instead, I suggest you write function mpz_mult_d() like this:
mpz_mult_d(mpz_class & r, const mpz_class & i, double d) {
d = ldexp(d, 52); /* exact, no overflow because 1 <= d <= 10 */
unsigned long long l = d; /* exact because d is an integer */
p = l * i; /* exact, in GMP */
(quotient, remainder) = p / 2^52; /* in GMP */
And now the next step depends on the kind of rounding you wish. If you wish the multiplication of d by i to give a result rounded toward -inf, just return quotient as result of the function. If you wish a result rounded to the nearest integer, you must look at remainder:
assert(0 <= remainder); /* proper Euclidean division */
assert(remainder < 2^52);
if (remainder < 2^51) return quotient;
if (remainder > 2^51) return quotient + 1; /* in GMP */
if (remainder == 2^51) return quotient + (quotient & 1); /* in GMP, round to “even” */
PS: I found your question by random browsing but if you had tagged it “floating-point”, people more competent than me could have answered it quickly.

Try this strategy:
Convert integer value to big float
Convert double value to big float
Make product
Convert result to integer
mpf_set_z(...)
mpf_set_d(...)
mpf_mul(...)
mpz_set_f(...)

Related

Efficient way of checking the length of a double in C++

Say I have a number, 100000, I can use some simple maths to check its size, i.e. log(100000) -> 5 (base 10 logarithm). Theres also another way of doing this, which is quite slow. std::string num = std::to_string(100000), num.size(). Is there an way to mathematically determine the length of a number? (not just 100000, but for things like 2313455, 123876132.. etc)
Why not use ceil? It rounds up to the nearest whole number - you can just wrap that around your log function, and add a check afterwards to catch the fact that a power of 10 would return 1 less than expected.
Here is a solution to the problem using single precision floating point numbers in O(1):
#include <cstdio>
#include <iostream>
#include <cstring>
int main(){
float x = 500; // to be converted
uint32_t f;
std::memcpy(&f, &x, sizeof(uint32_t)); // Convert float into a manageable int
uint8_t exp = (f & (0b11111111 << 23)) >> 23; // get the exponent
exp -= 127; // floating point bias
exp /= 3.32; // This will round but for this case it should be fine (ln2(10))
std::cout << std::to_string(exp) << std::endl;
}
For a number in scientific notation a*10^e this will return e (when 1<=a<10), so the length of the number (if it has an absolute value larger than 1), will be exp + 1.
For double precision this works, but you have to adapt it (bias is 1023 I think, and bit alignment is different. Check this)
This only works for floating point numbers, though so probably not very useful in this case. The efficiency in this case relative to the logarithm will also be determined by the speed at which int -> float conversion can occur.
Edit:
I just realised the question was about double. The modified result is:
int16_t getLength(double a){
uint64_t bits;
std::memcpy(&bits, &a, sizeof(uint64_t));
int16_t exp = (f >> 52) & 0b11111111111; // There is no 11 bit long int so this has to do
exp -= 1023;
exp /= 3.32;
return exp + 1;
}
There are some changes so that it behaves better (and also less shifting).
You can also use frexp() to get the exponent without bias.
If the number is whole, keep dividing by 10, until you're at 0. You'd have to divide 100000 6 times, for example. For the fractional part, you need to keep multiplying by 10 until trunc(f) == f.

single-word division algorithm

I develop software for embedded platform and need a single-word division algorithm.
The problem is as follows:
given a large integer represented by a sequence of 32-bit words (can be many),
we need to divide it by another 32-bit word, i.e. compute the quotient (also large integer)
and the remainder (32-bits).
Certainly, If I were developing this algorithm on x86, I could simply take GNU MP
but this library is way too large for embdedde platform. Furthermore, our processor
does not have hardware integer divider (integer division is performed in the software).
However the processor has quite fast FPU, so the trick is to use floating-point arithmetic wherever possible.
Any ideas how to implement this ?
Sounds like a classic optimization. Instead of dividing by D, multiply by 0x100000000/D and then divide by 0x100000000. The latter is just a wordshift, i.e. trivial. Calculating the multiplier is a bit harder, but not a lot.
See also this detailed article for a far more detailed background.
Take a look at this one: the algorithm divides an integer a[0..n-1] by a single word 'c'
using floating-point for 64x32->32 division. The limbs of the quotient 'q' are just printed in a loop, you can save then in an array if you like. Note that you don't need GMP to run the algorithm - I use it just to compare the results.
#include <gmp.h>
// divides a multi-precision integer a[0..n-1] by a single word c
void div_by_limb(const unsigned *a, unsigned n, unsigned c) {
typedef unsigned long long uint64;
unsigned c_norm = c, sh = 0;
while((c_norm & 0xC0000000) == 0) { // make sure the 2 MSB are set
c_norm <<= 1; sh++;
}
// precompute the inverse of 'c'
double inv1 = 1.0 / (double)c_norm, inv2 = 1.0 / (double)c;
unsigned i, r = 0;
printf("\nquotient: "); // quotient is printed in a loop
for(i = n - 1; (int)i >= 0; i--) { // start from the most significant digit
unsigned u1 = r, u0 = a[i];
union {
struct { unsigned u0, u1; };
uint64 x;
} s = {u0, u1}; // treat [u1, u0] as 64-bit int
// divide a 2-word number [u1, u0] by 'c_norm' using floating-point
unsigned q = floor((double)s.x * inv1), q2;
r = u0 - q * c_norm;
// divide again: this time by 'c'
q2 = floor((double)r * inv2);
q = (q << sh) + q2; // reconstruct the quotient
printf("%x", q);
}
r %= c; // adjust the residue after normalization
printf("; residue: %x\n", r);
}
int main() {
mpz_t z, quo, rem;
mpz_init(z); // this is a dividend
mpz_set_str(z, "9999999999999999999999999999999", 10);
unsigned div = 9; // this is a divisor
div_by_limb((unsigned *)z->_mp_d, mpz_size(z), div);
mpz_init(quo); mpz_init(rem);
mpz_tdiv_qr_ui(quo, rem, z, div); // divide 'z' by 'div'
gmp_printf("compare: Quo: %Zx; Rem %Zx\n", quo, rem);
mpz_clear(quo);
mpz_clear(rem);
mpz_clear(z);
return 1;
}
I believe that a look-up table and Newton Raphson successive approximation is the canonical choice used by hardware designers (who generally can't afford the gates for a full hardware divide). You get to choose the trade off the between accuracy and execution time.

Generating random floating-point values based on random bit stream

Given a random source (a generator of random bit stream), how do I generate a uniformly distributed random floating-point value in a given range?
Assume that my random source looks something like:
unsigned int GetRandomBits(char* pBuf, int nLen);
And I want to implement
double GetRandomVal(double fMin, double fMax);
Notes:
I don't want the result precision to be limited (for example only 5 digits).
Strict uniform distribution is a must
I'm not asking for a reference to an existing library. I want to know how to implement it from scratch.
For pseudo-code / code, C++ would be most appreciated
I don't think I'll ever be convinced that you actually need this, but it was fun to write.
#include <stdint.h>
#include <cmath>
#include <cstdio>
FILE* devurandom;
bool geometric(int x) {
// returns true with probability min(2^-x, 1)
if (x <= 0) return true;
while (1) {
uint8_t r;
fread(&r, sizeof r, 1, devurandom);
if (x < 8) {
return (r & ((1 << x) - 1)) == 0;
} else if (r != 0) {
return false;
}
x -= 8;
}
}
double uniform(double a, double b) {
// requires IEEE doubles and 0.0 < a < b < inf and a normal
// implicitly computes a uniform random real y in [a, b)
// and returns the greatest double x such that x <= y
union {
double f;
uint64_t u;
} convert;
convert.f = a;
uint64_t a_bits = convert.u;
convert.f = b;
uint64_t b_bits = convert.u;
uint64_t mask = b_bits - a_bits;
mask |= mask >> 1;
mask |= mask >> 2;
mask |= mask >> 4;
mask |= mask >> 8;
mask |= mask >> 16;
mask |= mask >> 32;
int b_exp;
frexp(b, &b_exp);
while (1) {
// sample uniform x_bits in [a_bits, b_bits)
uint64_t x_bits;
fread(&x_bits, sizeof x_bits, 1, devurandom);
x_bits &= mask;
x_bits += a_bits;
if (x_bits >= b_bits) continue;
double x;
convert.u = x_bits;
x = convert.f;
// accept x with probability proportional to 2^x_exp
int x_exp;
frexp(x, &x_exp);
if (geometric(b_exp - x_exp)) return x;
}
}
int main() {
devurandom = fopen("/dev/urandom", "r");
for (int i = 0; i < 100000; ++i) {
printf("%.17g\n", uniform(1.0 - 1e-15, 1.0 + 1e-15));
}
}
Here is one way of doing it.
The IEEE Std 754 double format is as follows:
[s][ e ][ f ]
where s is the sign bit (1 bit), e is the biased exponent (11 bits) and f is the fraction (52 bits).
Beware that the layout in memory will be different on little-endian machines.
For 0 < e < 2047, the number represented is
(-1)**(s) * 2**(e – 1023) * (1.f)
By setting s to 0, e to 1023 and f to 52 random bits from your bit stream, you get a random double in the interval [1.0, 2.0). This interval is unique in that it contains 2 ** 52 doubles, and these doubles are equidistant. If you then subtract 1.0 from the constructed double, you get a random double in the interval [0.0, 1.0). Moreover, the property about being equidistant is preserve.
From there you should be able to scale and translate as needed.
I'm surprised that for question this old, nobody had actual code for the best answer. User515430's answer got it right--you can take advantage of IEEE-754 double format to directly put 52 bits into a double with no math at all. But he didn't give code. So here it is, from my public domain ojrandlib:
double ojr_next_double(ojr_generator *g) {
uint64_t r = (OJR_NEXT64(g) & 0xFFFFFFFFFFFFFull) | 0x3FF0000000000000ull;
return *(double *)(&r) - 1.0;
}
NEXT64() gets a 64-bit random number. If you have a more efficient way of getting only 52 bits, use that instead.
This is easy, as long as you have an integer type with as many bits of precision as a double. For instance, an IEEE double-precision number has 53 bits of precision, so a 64-bit integer type is enough:
#include <limits.h>
double GetRandomVal(double fMin, double fMax) {
unsigned long long n ;
GetRandomBits ((char*)&n, sizeof(n)) ;
return fMin + (n * (fMax - fMin))/ULLONG_MAX ;
}
This is probably not the answer you want, but the specification here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3225.pdf
in sections [rand.util.canonical] and [rand.dist.uni.real], contains sufficient information to implement what you want, though with slightly different syntax. It isn't easy, but it is possible. I speak from personal experience. A year ago I knew nothing about random numbers, and I was able to do it. Though it took me a while... :-)
The question is ill-posed. What does uniform distribution over floats even mean?
Taking our cue from discrepancy, one way to operationalize your question is to define that you want the distribution that minimizes the following value:
Where x is the random variable you are sampling with your GetRandomVal(double fMin, double fMax) function, and means the probability that a random x is smaller or equal to t.
And now you can go on and try to evaluate eg a dabbler's answer. (Hint all the answers that fail to use the whole precision and stick to eg 52 bits will fail this minimization criterion.)
However, if you just want to be able to generate all float bit patterns that fall into your specified range with equal possibility, even if that means that eg asking for GetRandomVal(0,1000) will create more values between 0 and 1.5 than between 1.5 and 1000, that's easy: any interval of IEEE floating point numbers when interpreted as bit patterns map easily to a very small number of intervals of unsigned int64. See eg this question. Generating equally distributed random values of unsigned int64 in any given interval is easy.
I may be misunderstanding the question, but what stops you simply sampling the next n bits from the random bit stream and converting that to a base 10 number number ranged 0 to 2^n - 1.
To get a random value in [0..1[ you could do something like:
double value = 0;
for (int i=0;i<53;i++)
value = 0.5 * (value + random_bit()); // Insert 1 random bit
// or value = ldexp(value+random_bit(),-1);
// or group several bits into one single ldexp
return value;

How can I write a power function myself?

I was always wondering how I can make a function which calculates the power (e.g. 23) myself. In most languages these are included in the standard library, mostly as pow(double x, double y), but how can I write it myself?
I was thinking about for loops, but it think my brain got in a loop (when I wanted to do a power with a non-integer exponent, like 54.5 or negatives 2-21) and I went crazy ;)
So, how can I write a function which calculates the power of a real number? Thanks
Oh, maybe important to note: I cannot use functions which use powers (e.g. exp), which would make this ultimately useless.
Negative powers are not a problem, they're just the inverse (1/x) of the positive power.
Floating point powers are just a little bit more complicated; as you know a fractional power is equivalent to a root (e.g. x^(1/2) == sqrt(x)) and you also know that multiplying powers with the same base is equivalent to add their exponents.
With all the above, you can:
Decompose the exponent in a integer part and a rational part.
Calculate the integer power with a loop (you can optimise it decomposing in factors and reusing partial calculations).
Calculate the root with any algorithm you like (any iterative approximation like bisection or Newton method could work).
Multiply the result.
If the exponent was negative, apply the inverse.
Example:
2^(-3.5) = (2^3 * 2^(1/2)))^-1 = 1 / (2*2*2 * sqrt(2))
AB = Log-1(Log(A)*B)
Edit: yes, this definition really does provide something useful. For example, on an x86, it translates almost directly to FYL2X (Y * Log2(X)) and F2XM1 (2x-1):
fyl2x
fld st(0)
frndint
fsubr st(1),st
fxch st(1)
fchs
f2xmi
fld1
faddp st(1),st
fscale
fstp st(1)
The code ends up a little longer than you might expect, primarily because F2XM1 only works with numbers in the range -1.0..1.0. The fld st(0)/frndint/fsubr st(1),st piece subtracts off the integer part, so we're left with only the fraction. We apply F2XM1 to that, add the 1 back on, then use FSCALE to handle the integer part of the exponentiation.
Typically the implementation of the pow(double, double) function in math libraries is based on the identity:
pow(x,y) = pow(a, y * log_a(x))
Using this identity, you only need to know how to raise a single number a to an arbitrary exponent, and how to take a logarithm base a. You have effectively turned a complicated multi-variable function into a two functions of a single variable, and a multiplication, which is pretty easy to implement. The most commonly chosen values of a are e or 2 -- e because the e^x and log_e(1+x) have some very nice mathematical properties, and 2 because it has some nice properties for implementation in floating-point arithmetic.
The catch of doing it this way is that (if you want to get full accuracy) you need to compute the log_a(x) term (and its product with y) to higher accuracy than the floating-point representation of x and y. For example, if x and y are doubles, and you want to get a high accuracy result, you'll need to come up with some way to store intermediate results (and do arithmetic) in a higher-precision format. The Intel x87 format is a common choice, as are 64-bit integers (though if you really want a top-quality implementation, you'll need to do a couple of 96-bit integer computations, which are a little bit painful in some languages). It's much easier to deal with this if you implement powf(float,float), because then you can just use double for intermediate computations. I would recommend starting with that if you want to use this approach.
The algorithm that I outlined is not the only possible way to compute pow. It is merely the most suitable for delivering a high-speed result that satisfies a fixed a priori accuracy bound. It is less suitable in some other contexts, and is certainly much harder to implement than the repeated-square[root]-ing algorithm that some others have suggested.
If you want to try the repeated square[root] algorithm, begin by writing an unsigned integer power function that uses repeated squaring only. Once you have a good grasp on the algorithm for that reduced case, you will find it fairly straightforward to extend it to handle fractional exponents.
There are two distinct cases to deal with: Integer exponents and fractional exponents.
For integer exponents, you can use exponentiation by squaring.
def pow(base, exponent):
if exponent == 0:
return 1
elif exponent < 0:
return 1 / pow(base, -exponent)
elif exponent % 2 == 0:
half_pow = pow(base, exponent // 2)
return half_pow * half_pow
else:
return base * pow(base, exponent - 1)
The second "elif" is what distinguishes this from the naïve pow function. It allows the function to make O(log n) recursive calls instead of O(n).
For fractional exponents, you can use the identity a^b = C^(b*log_C(a)). It's convenient to take C=2, so a^b = 2^(b * log2(a)). This reduces the problem to writing functions for 2^x and log2(x).
The reason it's convenient to take C=2 is that floating-point numbers are stored in base-2 floating point. log2(a * 2^b) = log2(a) + b. This makes it easier to write your log2 function: You don't need to have it be accurate for every positive number, just on the interval [1, 2). Similarly, to calculate 2^x, you can multiply 2^(integer part of x) * 2^(fractional part of x). The first part is trivial to store in a floating point number, for the second part, you just need a 2^x function over the interval [0, 1).
The hard part is finding a good approximation of 2^x and log2(x). A simple approach is to use Taylor series.
Per definition:
a^b = exp(b ln(a))
where exp(x) = 1 + x + x^2/2 + x^3/3! + x^4/4! + x^5/5! + ...
where n! = 1 * 2 * ... * n.
In practice, you could store an array of the first 10 values of 1/n!, and then approximate
exp(x) = 1 + x + x^2/2 + x^3/3! + ... + x^10/10!
because 10! is a huge number, so 1/10! is very small (2.7557319224⋅10^-7).
Wolfram functions gives a wide variety of formulae for calculating powers. Some of them would be very straightforward to implement.
For positive integer powers, look at exponentiation by squaring and addition-chain exponentiation.
Using three self implemented functions iPow(x, n), Ln(x) and Exp(x), I'm able to compute fPow(x, a), x and a being doubles. Neither of the functions below use library functions, but just iteration.
Some explanation about functions implemented:
(1) iPow(x, n): x is double, n is int. This is a simple iteration, as n is an integer.
(2) Ln(x): This function uses the Taylor Series iteration. The series used in iteration is Σ (from int i = 0 to n) {(1 / (2 * i + 1)) * ((x - 1) / (x + 1)) ^ (2 * n + 1)}. The symbol ^ denotes the power function Pow(x, n) implemented in the 1st function, which uses simple iteration.
(3) Exp(x): This function, again, uses the Taylor Series iteration. The series used in iteration is Σ (from int i = 0 to n) {x^i / i!}. Here, the ^ denotes the power function, but it is not computed by calling the 1st Pow(x, n) function; instead it is implemented within the 3rd function, concurrently with the factorial, using d *= x / i. I felt I had to use this trick, because in this function, iteration takes some more steps relative to the other functions and the factorial (i!) overflows most of the time. In order to make sure the iteration does not overflow, the power function in this part is iterated concurrently with the factorial. This way, I overcame the overflow.
(4) fPow(x, a): x and a are both doubles. This function does nothing but just call the other three functions implemented above. The main idea in this function depends on some calculus: fPow(x, a) = Exp(a * Ln(x)). And now, I have all the functions iPow, Ln and Exp with iteration already.
n.b. I used a constant MAX_DELTA_DOUBLE in order to decide in which step to stop the iteration. I've set it to 1.0E-15, which seems reasonable for doubles. So, the iteration stops if (delta < MAX_DELTA_DOUBLE) If you need some more precision, you can use long double and decrease the constant value for MAX_DELTA_DOUBLE, to 1.0E-18 for example (1.0E-18 would be the minimum).
Here is the code, which works for me.
#define MAX_DELTA_DOUBLE 1.0E-15
#define EULERS_NUMBER 2.718281828459045
double MathAbs_Double (double x) {
return ((x >= 0) ? x : -x);
}
int MathAbs_Int (int x) {
return ((x >= 0) ? x : -x);
}
double MathPow_Double_Int(double x, int n) {
double ret;
if ((x == 1.0) || (n == 1)) {
ret = x;
} else if (n < 0) {
ret = 1.0 / MathPow_Double_Int(x, -n);
} else {
ret = 1.0;
while (n--) {
ret *= x;
}
}
return (ret);
}
double MathLn_Double(double x) {
double ret = 0.0, d;
if (x > 0) {
int n = 0;
do {
int a = 2 * n + 1;
d = (1.0 / a) * MathPow_Double_Int((x - 1) / (x + 1), a);
ret += d;
n++;
} while (MathAbs_Double(d) > MAX_DELTA_DOUBLE);
} else {
printf("\nerror: x < 0 in ln(x)\n");
exit(-1);
}
return (ret * 2);
}
double MathExp_Double(double x) {
double ret;
if (x == 1.0) {
ret = EULERS_NUMBER;
} else if (x < 0) {
ret = 1.0 / MathExp_Double(-x);
} else {
int n = 2;
double d;
ret = 1.0 + x;
do {
d = x;
for (int i = 2; i <= n; i++) {
d *= x / i;
}
ret += d;
n++;
} while (d > MAX_DELTA_DOUBLE);
}
return (ret);
}
double MathPow_Double_Double(double x, double a) {
double ret;
if ((x == 1.0) || (a == 1.0)) {
ret = x;
} else if (a < 0) {
ret = 1.0 / MathPow_Double_Double(x, -a);
} else {
ret = MathExp_Double(a * MathLn_Double(x));
}
return (ret);
}
It's an interesting exercise. Here's some suggestions, which you should try in this order:
Use a loop.
Use recursion (not better, but interesting none the less)
Optimize your recursion vastly by using divide-and-conquer
techniques
Use logarithms
You can found the pow function like this:
static double pows (double p_nombre, double p_puissance)
{
double nombre = p_nombre;
double i=0;
for(i=0; i < (p_puissance-1);i++){
nombre = nombre * p_nombre;
}
return (nombre);
}
You can found the floor function like this:
static double floors(double p_nomber)
{
double x = p_nomber;
long partent = (long) x;
if (x<0)
{
return (partent-1);
}
else
{
return (partent);
}
}
Best regards
A better algorithm to efficiently calculate positive integer powers is repeatedly square the base, while keeping track of the extra remainder multiplicands. Here is a sample solution in Python that should be relatively easy to understand and translate into your preferred language:
def power(base, exponent):
remaining_multiplicand = 1
result = base
while exponent > 1:
remainder = exponent % 2
if remainder > 0:
remaining_multiplicand = remaining_multiplicand * result
exponent = (exponent - remainder) / 2
result = result * result
return result * remaining_multiplicand
To make it handle negative exponents, all you have to do is calculate the positive version and divide 1 by the result, so that should be a simple modification to the above code. Fractional exponents are considerably more difficult, since it means essentially calculating an nth-root of the base, where n = 1/abs(exponent % 1) and multiplying the result by the result of the integer portion power calculation:
power(base, exponent - (exponent % 1))
You can calculate roots to a desired level of accuracy using Newton's method. Check out wikipedia article on the algorithm.
I am using fixed point long arithmetics and my pow is log2/exp2 based. Numbers consist of:
int sig = { -1; +1 } signum
DWORD a[A+B] number
A is number of DWORDs for integer part of number
B is number of DWORDs for fractional part
My simplified solution is this:
//---------------------------------------------------------------------------
longnum exp2 (const longnum &x)
{
int i,j;
longnum c,d;
c.one();
if (x.iszero()) return c;
i=x.bits()-1;
for(d=2,j=_longnum_bits_b;j<=i;j++,d*=d)
if (x.bitget(j))
c*=d;
for(i=0,j=_longnum_bits_b-1;i<_longnum_bits_b;j--,i++)
if (x.bitget(j))
c*=_longnum_log2[i];
if (x.sig<0) {d.one(); c=d/c;}
return c;
}
//---------------------------------------------------------------------------
longnum log2 (const longnum &x)
{
int i,j;
longnum c,d,dd,e,xx;
c.zero(); d.one(); e.zero(); xx=x;
if (xx.iszero()) return c; //**** error: log2(0) = infinity
if (xx.sig<0) return c; //**** error: log2(negative x) ... no result possible
if (d.geq(x,d)==0) {xx=d/xx; xx.sig=-1;}
i=xx.bits()-1;
e.bitset(i); i-=_longnum_bits_b;
for (;i>0;i--,e>>=1) // integer part
{
dd=d*e;
j=dd.geq(dd,xx);
if (j==1) continue; // dd> xx
c+=i; d=dd;
if (j==2) break; // dd==xx
}
for (i=0;i<_longnum_bits_b;i++) // fractional part
{
dd=d*_longnum_log2[i];
j=dd.geq(dd,xx);
if (j==1) continue; // dd> xx
c.bitset(_longnum_bits_b-i-1); d=dd;
if (j==2) break; // dd==xx
}
c.sig=xx.sig;
c.iszero();
return c;
}
//---------------------------------------------------------------------------
longnum pow (const longnum &x,const longnum &y)
{
//x^y = exp2(y*log2(x))
int ssig=+1; longnum c; c=x;
if (y.iszero()) {c.one(); return c;} // ?^0=1
if (c.iszero()) return c; // 0^?=0
if (c.sig<0)
{
c.overflow(); c.sig=+1;
if (y.isreal()) {c.zero(); return c;} //**** error: negative x ^ noninteger y
if (y.bitget(_longnum_bits_b)) ssig=-1;
}
c=exp2(log2(c)*y); c.sig=ssig; c.iszero();
return c;
}
//---------------------------------------------------------------------------
where:
_longnum_bits_a = A*32
_longnum_bits_b = B*32
_longnum_log2[i] = 2 ^ (1/(2^i)) ... precomputed sqrt table
_longnum_log2[0]=sqrt(2)
_longnum_log2[1]=sqrt[tab[0])
_longnum_log2[i]=sqrt(tab[i-1])
longnum::zero() sets *this=0
longnum::one() sets *this=+1
bool longnum::iszero() returns (*this==0)
bool longnum::isnonzero() returns (*this!=0)
bool longnum::isreal() returns (true if fractional part !=0)
bool longnum::isinteger() returns (true if fractional part ==0)
int longnum::bits() return num of used bits in number counted from LSB
longnum::bitget()/bitset()/bitres()/bitxor() are bit access
longnum.overflow() rounds number if there was a overflow X.FFFFFFFFFF...FFFFFFFFF??h -> (X+1).0000000000000...000000000h
int longnum::geq(x,y) is comparition |x|,|y| returns 0,1,2 for (<,>,==)
All you need to understand this code is that numbers in binary form consists of sum of powers of 2, when you need to compute 2^num then it can be rewritten as this
2^(b(-n)*2^(-n) + ... + b(+m)*2^(+m))
where n are fractional bits and m are integer bits. multiplication/division by 2 in binary form is simple bit shifting so if you put it all together you get code for exp2 similar to my. log2 is based on binaru search...changing the result bits from MSB to LSB until it matches searched value (very similar algorithm as for fast sqrt computation). hope this helps clarify things...
A lot of approaches are given in other answers. Here is something that I thought may be useful in case of integral powers.
In the case of integer power x of nx, the straightforward approach would take x-1 multiplications. In order to optimize this, we can use dynamic programming and reuse an earlier multiplication result to avoid all x multiplications. For example, in 59, we can, say, make batches of 3, i.e. calculate 53 once, get 125 and then cube 125 using the same logic, taking only 4 multiplcations in the process, instead of 8 multiplications with the straightforward way.
The question is what is the ideal size of the batch b so that the number of multiplications is minimum. So let's write the equation for this. If f(x,b) is the function representing the number of multiplications entailed in calculating nx using the above method, then
Explanation: A product of batch of p numbers will take p-1 multiplications. If we divide x multiplications into b batches, there would be (x/b)-1 multiplications required inside each batch, and b-1 multiplications required for all b batches.
Now we can calculate the first derivative of this function with respect to b and equate it to 0 to get the b for the least number of multiplications.
Now put back this value of b into the function f(x,b) to get the least number of multiplications:
For all positive x, this value is lesser than the multiplications by the straightforward way.
maybe you can use taylor series expansion. the Taylor series of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor series are equal near this point. Taylor's series are named after Brook Taylor who introduced them in 1715.

How do I convert double to string using only math.h?

I am trying to convert a double to a string in a native NT application, i.e. an application that only depends on ntdll.dll. Unfortunately, ntdll's version of vsnprintf does not support %f et al., forcing me to implement the conversion on my own.
The aforementioned ntdll.dll exports only a few of the math.h functions (floor, ceil, log, pow, ...). However, I am reasonably sure that I can implement any of the unavailable math.h functions if necessary.
There is an implementation of floating point conversion in GNU's libc, but the code is extremely dense and difficult to comprehent (the GNU indentation style does not help here).
I've already implemented the conversion by normalizing the number (i.e. multiplying/dividing the number by 10 until it's in the interval [1, 10)) and then generating each digit by cutting the integral part off with modf and multiplying the fractional part by 10. This works, but there is a loss of precision (only the first 15 digits are correct). The loss of precision is, of course, inherent to the algorithm.
I'd settle with 17 digits, but an algorithm that would be able to generate an arbitrary number of digits correctly would be preferred.
Could you please suggest an algorithm or point me to a good resource?
Double-precision numbers do not have more than 15 significant (decimal) figures of precision. There is absolutely no way you can get "an arbitrary number of digits correctly"; doubles are not bignums.
Since you say you're happy with 17 significant figures, use long double; on Windows, I think, that will give you 19 significant figures.
I've thought about this a bit more. You lose precision because you normalize by multiplying by some power of 10 (you chose [1,10) rather than [0,1), but that's a minor detail). If you did so with a power of 2, you'd lose no precision, but then you'd get "decimal digits"*2^e; you could implement bcd arithmetic and compute the product yourself, but that doesn't sound like fun.
I'm pretty confident that you could split the double g=m*2^e into two parts: h=floor(g*10^k) and i=modf(g*10^k) for some k, and then separately convert to decimal digits and then stitch them together, but how about a simpler approach: use "long double" (80 bits, but I've heard that Visual C++ may not support it?) with your current approach and stop after 17 digits.
_gcvt should do it (edit - it's not in ntdll.dll, it's in some msvcrt*.dll?)
As for decimal digits of precision, IEEE binary64 has 52 binary digits. 52*log10(2)=15.65... (edit: as you pointed out, to round trip, you need more than 16 digits)
After a lot of research, I found a paper titled Printing Floating-Point Numbers Quickly and Accurately. It uses exact rational arithmetic to avoid precision loss. It cites a little older paper: How to Print Floating-Point Numbers Accurately, which however seems to require ACM subscription to access.
Since the former paper was reprinted in 2006, I am inclined to believe that it is still current. The exact rational arithmetic (which requires dynamic allocation) seems to be a necessary evil.
A complete implementation of the C code for the fastest known (as of today) algorithm:
http://code.google.com/p/double-conversion/downloads/list
It even includes a test suite.
This is the C code behind the algorithm described in this PDF:
Printing Floating-Point Numbers Quickly and Accurately
http://www.cs.indiana.edu/~burger/FP-Printing-PLDI96.pdf
#include <cstdint>
// --------------------------------------------------------------------------
// Return number of decimal-digits of a given unsigned-integer
// N is unit8_t/uint16_t/uint32_t/uint64_t
template <class N> inline uint8_t GetUnsignedDecDigits(const N n)
{
static_assert(std::numeric_limits<N>::is_integer && !std::numeric_limits<N>::is_signed,
"GetUnsignedDecDigits: unsigned integer type expected" );
const uint8_t anMaxDigits[]= {3, 5, 8, 10, 13, 15, 17, 20};
const uint8_t nMaxDigits = anMaxDigits[sizeof(N)-1];
uint8_t nDigits= 1;
N nRoof = 10;
while ((n >= nRoof) && (nDigits<nMaxDigits))
{
nDigits++;
nRoof*= 10;
}
return nDigits;
}
// --------------------------------------------------------------------------
// Convert floating-point value to NULL-terminated string represention
TCHAR* DoubleToStr(double f , // [i ]
TCHAR* pczStr , // [i/o] caller should allocate enough space
int nDigitsI, // [i ] digits of integer part including sign / <1: auto
int nDigitsF ) // [i ] digits of fractional part / <0: auto
{
switch (_fpclass(f))
{
case _FPCLASS_SNAN:
case _FPCLASS_QNAN: _tcscpy_s(pczStr, 5, _T("NaN" )); return pczStr;
case _FPCLASS_NINF: _tcscpy_s(pczStr, 5, _T("-INF")); return pczStr;
case _FPCLASS_PINF: _tcscpy_s(pczStr, 5, _T("+INF")); return pczStr;
}
if (nDigitsI> 18) nDigitsI= 18; if (nDigitsI< 1) nDigitsI= -1;
if (nDigitsF> 18) nDigitsF= 18; if (nDigitsF< 0) nDigitsF= -1;
bool bNeg= (f<0);
if (f<0)
f= -f;
int nE= 0; // exponent (displayed if != 0)
if ( ((-1 == nDigitsI) && (f >= 1e18 )) || // large value: switch to scientific representation
((-1 != nDigitsI) && (f >= pow(10., nDigitsI))) )
{
nE= (int)log10(f);
f/= (double)pow(10., nE);
if (-1 != nDigitsF)
nDigitsF= __max(nDigitsF, nDigitsI+nDigitsF-(bNeg?2:1)-4);
nDigitsI= (bNeg?2:1);
}
else if (f>0)
if ((-1 == nDigitsF) && (f <= 1e-10)) // small value: switch to scientific representation
{
nE= (int)log10(f)-1;
f/= (double)pow(10., nE);
if (-1 != nDigitsF)
nDigitsF= __max(nDigitsF, nDigitsI+nDigitsF-(bNeg?2:1)-4);
nDigitsI= (bNeg?2:1);
}
double fI;
double fF= modf(f, &fI); // fI: integer part, fF: fractional part
if (-1 == nDigitsF) // figure out number of meaningfull digits in fF
{
double fG, fGI, fGF;
do
{
nDigitsF++;
fG = fF*pow(10., nDigitsF);
fGF= modf(fG, &fGI);
}
while (fGF > 1e-10);
}
const double afPower10[20]= {1e0 , 1e1 , 1e2 , 1e3 , 1e4 , 1e5 , 1e6 , 1e7 , 1e8 , 1e9 ,
1e10, 1e11, 1e12, 1e13, 1e14, 1e15, 1e16, 1e17, 1e18, 1e19 };
uint64_t uI= (uint64_t)round(fI );
uint64_t uF= (uint64_t)round(fF*afPower10[nDigitsF]);
if (uF)
if (GetUnsignedDecDigits(uF) > nDigitsF) // X.99999 was rounded to X+1
{
uF= 0;
uI++;
if (nE)
{
uI/= 10;
nE++;
}
}
uint8_t nRealDigitsI= GetUnsignedDecDigits(uI);
if (bNeg)
nRealDigitsI++;
int nPads= 0;
if (-1 != nDigitsI)
{
nPads= nDigitsI-nRealDigitsI;
for (int i= nPads-1; i>=0; i--) // leading spaces
pczStr[i]= _T(' ');
}
if (bNeg) // minus sign
{
pczStr[nPads]= _T('-');
nRealDigitsI--;
nPads++;
}
for (int j= nRealDigitsI-1; j>=0; j--) // digits of integer part
{
pczStr[nPads+j]= (uint8_t)(uI%10) + _T('0');
uI /= 10;
}
nPads+= nRealDigitsI;
if (nDigitsF)
{
pczStr[nPads++]= _T('.'); // decimal point
for (int k= nDigitsF-1; k>=0; k--) // digits of fractional part
{
pczStr[nPads+k]= (uint8_t)(uF%10)+ _T('0');
uF /= 10;
}
}
nPads+= nDigitsF;
if (nE)
{
pczStr[nPads++]= _T('e'); // exponent sign
if (nE<0)
{
pczStr[nPads++]= _T('-');
nE= -nE;
}
else
pczStr[nPads++]= _T('+');
for (int l= 2; l>=0; l--) // digits of exponent
{
pczStr[nPads+l]= (uint8_t)(nE%10) + _T('0');
nE /= 10;
}
pczStr[nPads+3]= 0;
}
else
pczStr[nPads]= 0;
return pczStr;
}
Does vsnprintf supports I64?
double x = SOME_VAL; // allowed to be from -1.e18 to 1.e18
bool sign = (SOME_VAL < 0);
if ( sign ) x = -x;
__int64 i = static_cast<__int64>( x );
double xm = x - static_cast<double>( i );
__int64 w = static_cast<__int64>( xm*pow(10.0, DIGITS_VAL) ); // DIGITS_VAL indicates how many digits after the decimal point you want to get
char out[100];
vsnprintf( out, sizeof out, "%s%I64.%I64", (sign?"-":""), i, w );
Another option is to try to find implementation of gcvt.
Have you looked at the uClibc implementation of printf?