Exact value of a floating-point number as a rational - c++

I'm looking for a method to convert the exact value of a floating-point number to a rational quotient of two integers, i.e. a / b, where b is not larger than a specified maximum denominator b_max. If satisfying the condition b <= b_max is impossible, then the result falls back to the best approximation which still satisfies the condition.
Hold on. There are a lot of questions/answers here about the best rational approximation of a truncated real number which is represented as a floating-point number. However I'm interested in the exact value of a floating-point number, which is itself a rational number with a different representation. More specifically, the mathematical set of floating-point numbers is a subset of rational numbers. In case of IEEE 754 binary floating-point standard it is a subset of dyadic rationals. Anyway, any floating-point number can be converted to a rational quotient of two finite precision integers as a / b.
So, for example assuming IEEE 754 single-precision binary floating-point format, the rational equivalent of float f = 1.0f / 3.0f is not 1 / 3, but 11184811 / 33554432. This is the exact value of f, which is a number from the mathematical set of IEEE 754 single-precision binary floating-point numbers.
Based on my experience, traversing (by binary search of) the Stern-Brocot tree is not useful here, since that is more suitable for approximating the value of a floating-point number, when it is interpreted as a truncated real instead of an exact rational.
Possibly, continued fractions are the way to go.
The another problem here is integer overflow. Think about that we want to represent the rational as the quotient of two int32_t, where the maximum denominator b_max = INT32_MAX. We cannot rely on a stopping criterion like b > b_max. So the algorithm must never overflow, or it must detect overflow.
What I found so far is an algorithm from Rosetta Code, which is based on continued fractions, but its source mentions it is "still not quite complete". Some basic tests gave good results, but I cannot confirm its overall correctness and I think it can easily overflow.
// https://rosettacode.org/wiki/Convert_decimal_number_to_rational#C
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stdint.h>
/* f : number to convert.
* num, denom: returned parts of the rational.
* md: max denominator value. Note that machine floating point number
* has a finite resolution (10e-16 ish for 64 bit double), so specifying
* a "best match with minimal error" is often wrong, because one can
* always just retrieve the significand and return that divided by
* 2**52, which is in a sense accurate, but generally not very useful:
* 1.0/7.0 would be "2573485501354569/18014398509481984", for example.
*/
void rat_approx(double f, int64_t md, int64_t *num, int64_t *denom)
{
/* a: continued fraction coefficients. */
int64_t a, h[3] = { 0, 1, 0 }, k[3] = { 1, 0, 0 };
int64_t x, d, n = 1;
int i, neg = 0;
if (md <= 1) { *denom = 1; *num = (int64_t) f; return; }
if (f < 0) { neg = 1; f = -f; }
while (f != floor(f)) { n <<= 1; f *= 2; }
d = f;
/* continued fraction and check denominator each step */
for (i = 0; i < 64; i++) {
a = n ? d / n : 0;
if (i && !a) break;
x = d; d = n; n = x % n;
x = a;
if (k[1] * a + k[0] >= md) {
x = (md - k[0]) / k[1];
if (x * 2 >= a || k[1] >= md)
i = 65;
else
break;
}
h[2] = x * h[1] + h[0]; h[0] = h[1]; h[1] = h[2];
k[2] = x * k[1] + k[0]; k[0] = k[1]; k[1] = k[2];
}
*denom = k[1];
*num = neg ? -h[1] : h[1];
}

All finite double are rational numbers as OP well stated..
Use frexp() to break the number into its fraction and exponent. The end result still needs to use double to represent whole number values due to range requirements. Some numbers are too small, (x smaller than 1.0/(2.0,DBL_MAX_EXP)) and infinity, not-a-number are issues.
The frexp functions break a floating-point number into a normalized fraction and an integral power of 2. ... interval [1/2, 1) or zero ...
C11 §7.12.6.4 2/3
#include <math.h>
#include <float.h>
_Static_assert(FLT_RADIX == 2, "TBD code for non-binary FP");
// Return error flag
int split(double x, double *numerator, double *denominator) {
if (!isfinite(x)) {
*numerator = *denominator = 0.0;
if (x > 0.0) *numerator = 1.0;
if (x < 0.0) *numerator = -1.0;
return 1;
}
int bdigits = DBL_MANT_DIG;
int expo;
*denominator = 1.0;
*numerator = frexp(x, &expo) * pow(2.0, bdigits);
expo -= bdigits;
if (expo > 0) {
*numerator *= pow(2.0, expo);
}
else if (expo < 0) {
expo = -expo;
if (expo >= DBL_MAX_EXP-1) {
*numerator /= pow(2.0, expo - (DBL_MAX_EXP-1));
*denominator *= pow(2.0, DBL_MAX_EXP-1);
return fabs(*numerator) < 1.0;
} else {
*denominator *= pow(2.0, expo);
}
}
while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) {
*numerator /= 2.0;
*denominator /= 2.0;
}
return 0;
}
void split_test(double x) {
double numerator, denominator;
int err = split(x, &numerator, &denominator);
printf("e:%d x:%24.17g n:%24.17g d:%24.17g q:%24.17g\n",
err, x, numerator, denominator, numerator/ denominator);
}
int main(void) {
volatile float third = 1.0f/3.0f;
split_test(third);
split_test(0.0);
split_test(0.5);
split_test(1.0);
split_test(2.0);
split_test(1.0/7);
split_test(DBL_TRUE_MIN);
split_test(DBL_MIN);
split_test(DBL_MAX);
return 0;
}
Output
e:0 x: 0.3333333432674408 n: 11184811 d: 33554432 q: 0.3333333432674408
e:0 x: 0 n: 0 d: 9007199254740992 q: 0
e:0 x: 1 n: 1 d: 1 q: 1
e:0 x: 0.5 n: 1 d: 2 q: 0.5
e:0 x: 1 n: 1 d: 1 q: 1
e:0 x: 2 n: 2 d: 1 q: 2
e:0 x: 0.14285714285714285 n: 2573485501354569 d: 18014398509481984 q: 0.14285714285714285
e:1 x: 4.9406564584124654e-324 n: 4.4408920985006262e-16 d: 8.9884656743115795e+307 q: 4.9406564584124654e-324
e:0 x: 2.2250738585072014e-308 n: 2 d: 8.9884656743115795e+307 q: 2.2250738585072014e-308
e:0 x: 1.7976931348623157e+308 n: 1.7976931348623157e+308 d: 1 q: 1.7976931348623157e+308
Leave the b_max consideration for later.
More expedient code is possible with replacing pow(2.0, expo) with ldexp(1, expo) #gammatester or exp2(expo) #Bob__
while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) could also use some performance improvements. But first, let us get the functionality as needed.

Related

Calculator with specific methods only. Normal and recursive

So atm im stuck with my calculator. It is only allowed to use following methods:
int succ(int x){
return ++x;
}
int neg(int x){
return -x;
}
What i already got is +, -. *. Iterativ an also recursive (so i can also use them if needed).
Now im stuck on the divide method because i dont know how to deal with the commas and the logic behind it. Just to imagine what it looks like to deal with succ() and neg() heres an example of an subtraction iterativ and recursive:
int sub(int x, int y){
if (y > 0){
y = neg(y);
x = add(x, y);
return x;
}
else if (y < 0){
y = neg(y);
x = add(x, y);
return x;
}
else if (y == 0) {
return x;
}
}
int sub_recc(int x, int y){
if (y < 0){
y = neg(y);
x = add_recc(x, y);
return x;
} else if (y > 0){
x = sub_recc(x, y - 1);
x = x - 1;
return x;
}else if( y == 0) {
return x;
}
}
If you can substract and add, then you can handle integer division. In pseudo code it is just:
division y/x is:
First handle signs because we will only divide positive integers
set sign = 0
if y > 0 then y = neg(y), sign = 1 - sign
if x > 0 then y = neg(y), sign = 1 - sign
ok, if sign is 0 nothing to do, if sign is 1, we will negate the result
Now the quotient is just the number of times you can substract the divisor:
set quotient = 0
while y > x do
y = y - x
quotient = quotient + 1
Ok we have the absolute value of the quotient, now for the sign:
if sign == 1, then quotient = neg(quotient)
The correct translation in C++ language as well as the recursive part are left as an exercise...
Hint for recursion y/x == 1 + (y-x)/x while y>x
Above was the integer part. Integer is nice and easy because it gives exact operations. A floating point representation in a base is always something close to mantissa * baseexp where mantissa is either an integer number with a maximum number of digits or a number between 0 and 1 (said normal representation). And you can pass from one representation to the other but changing the exponent part by the number of digits of the mantissa: 2.5 is 25 10-1 (int mantissa) of .25 101 (0 <= mantissa < 1).
So if you want to operate base 10 floating point numbers you should:
convert an integer to a floating point (mantissa + exponent) representation
for addition and substraction, the result exponent is a priori the greater of the exponents. Both mantissa shall be scaled to that exponent and added/substracted. Then the final exponent must be adjusted because the operation may have added an additional digit (7 + 9 = 16) or have caused the highest order ones to vanish (101 - 98 - 3)
for product, you add the exponents and multiply the mantissas, and then normalize (adjust exponent) the resul
for division, you scale the mantissa by the maximum number of digits, make the division with the integer division algorithm, and again normalise. For example 1/3 with a precision of 6 digits is obtained with:
1/3 = (1 * 106 /3) * 10-6 = (1000000/3) * 10-6
it give 333333 * 10-6 so .333333 in normalized form
Ok, it will be a lot of boiling plate code, but nothing really hard.
Log story made short: just remember how you learned that with a paper and a pencil...

Difference between ldexp(1, x) and exp2(x)

It seems if the floating-point representation has radix 2 (i.e. FLT_RADIX == 2) both std::ldexp(1, x) and std::exp2(x) raise 2 to the given power x.
Does the standard define or mention any expected behavioral and/or performance difference between them? What is the practical experience over different compilers?
exp2(x) and ldexp(x,i) perform two different operations. The former computes 2x, where x is a floating-point number, while the latter computes x*2i, where i is an integer. For integer values of x, exp2(x) and ldexp(1,int(x)) would be equivalent, provided the conversion of x to integer doesn't overflow.
The question about the relative efficiency of these two functions doesn't have a clear-cut answer. It will depend on the capabilities of the hardware platform and the details of the library implementation. While conceptually, ldexpf() looks like simple manipulation of the exponent part of a floating-point operand, it is actually a bit more complicated than that, once one considers overflow and gradual underflow via denormals. The latter case involves the rounding of the significand (mantissa) part of the floating-point number.
As ldexp() is generally an infrequently used function, it is in my experience fairly common that less of an optimization effort is applied to it by math library writers than to other math functions.
On some platforms, ldexp(), or a faster (custom) version of it, will be used as a building block in the software implementation of exp2(). The following code provides an exemplary implementation of this approach for float arguments:
#include <cmath>
/* Compute exponential base 2. Maximum ulp error = 0.86770 */
float my_exp2f (float a)
{
const float cvt = 12582912.0f; // 0x1.8p23
const float large = 1.70141184e38f; // 0x1.0p127
float f, r;
int i;
// exp2(a) = exp2(i + f); i = rint (a)
r = (a + cvt) - cvt;
f = a - r;
i = (int)r;
// approximate exp2(f) on interval [-0.5,+0.5]
r = 1.53720379e-4f; // 0x1.426000p-13f
r = fmaf (r, f, 1.33903872e-3f); // 0x1.5f055ep-10f
r = fmaf (r, f, 9.61817801e-3f); // 0x1.3b2b20p-07f
r = fmaf (r, f, 5.55036031e-2f); // 0x1.c6af7ep-05f
r = fmaf (r, f, 2.40226522e-1f); // 0x1.ebfbe2p-03f
r = fmaf (r, f, 6.93147182e-1f); // 0x1.62e430p-01f
r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+00f
// exp2(a) = 2**i * exp2(f);
r = ldexpf (r, i);
if (!(fabsf (a) < 150.0f)) {
r = a + a; // handle NaNs
if (a < 0.0f) r = 0.0f;
if (a > 0.0f) r = large * large; // + INF
}
return r;
}
Most real-life implementations of exp2() do not invoke ldexp(), but a custom version, for example when fast bit-wise transfer between integer and floating-point data is supported, here represented by internal functions __float_as_int() and __int_as_float() that re-interpret an IEEE-754 binary32 as an int32 and vice versa:
/* For a in [0.5, 4), compute a * 2**i, -250 < i < 250 */
float fast_ldexpf (float a, int i)
{
int ia = (i << 23) + __float_as_int (a); // scale by 2**i
a = __int_as_float (ia);
if ((unsigned int)(i + 125) > 250) { // |i| > 125
i = (i ^ (125 << 23)) - i; // ((i < 0) ? -125 : 125) << 23
a = __int_as_float (ia - i); // scale by 2**(+/-125)
a = a * __int_as_float ((127 << 23) + i); // scale by 2**(+/-(i%125))
}
return a;
}
On other platforms, the hardware provides a single-precision version of exp2() as a fast hardware instruction. Internal to the processor these are typically implemented by a table lookup with linear or quadratic interpolation. On such hardware platforms, ldexp(float) may be implemented in terms of exp2(float), for example:
float my_ldexpf (float x, int i)
{
float r, fi, fh, fq, t;
fi = (float)i;
/* NaN, Inf, zero require argument pass-through per ISO standard */
if (!(fabsf (x) <= 3.40282347e+38f) || (x == 0.0f) || (i == 0)) {
r = x;
} else if (abs (i) <= 126) {
r = x * exp2f (fi);
} else if (abs (i) <= 252) {
fh = (float)(i / 2);
r = x * exp2f (fh) * exp2f (fi - fh);
} else {
fq = (float)(i / 4);
t = exp2f (fq);
r = x * t * t * t * exp2f (fi - 3.0f * fq);
}
return r;
}
Lastly, there are platforms that basically provide both exp2() and ldexp() functionality in hardware, such as the x87 instructions F2XM1 and FSCALE on x86 processors.

Division with negative dividend, but rounded towards negative infinity?

Consider the following code (in C++11):
int a = -11, b = 3;
int c = a / b;
// now c == -3
C++11 specification says that division with a negative dividend is rounded toward zero.
It is quite useful for there to be a operator or function to do division with rounding toward negative infinity (e.g. for consistency with positive dividends when iterating a range), so is there a function or operator in the standard library that does what I want? Or perhaps a compiler-defined function/intrinsic that does it in modern compilers?
I could write my own, such as the following (works only for positive divisors):
int div_neg(int dividend, int divisor){
if(dividend >= 0) return dividend / divisor;
else return (dividend - divisor + 1) / divisor;
}
But it would not be as descriptive of my intent, and possibly not be as optimized a standard library function or compiler intrinsic (if it exists).
I'm not aware of any intrinsics for it. I would simply apply a correction to standard division retrospectively.
int div_floor(int a, int b)
{
int res = a / b;
int rem = a % b;
// Correct division result downwards if up-rounding happened,
// (for non-zero remainder of sign different than the divisor).
int corr = (rem != 0 && ((rem < 0) != (b < 0)));
return res - corr;
}
Note it also works for pre-C99 and pre-C++11, i.e. without standarization of rounding division towards zero.
Here's another possible variant, valid for positive divisors and arbitrary dividends.
int div_floor(int n, int d) {
return n >= 0 ? n / d : -1 - (-1 - n) / d;
}
Explanation: in the case of negative n, write q for (-1 - n) / d, then -1 - n = qd + r for some r satisfying 0 <= r < d. Rearranging gives n = (-1 - q)d + (d - 1 - r). It's clear that 0 <= d - 1 - r < d, so d - 1 - r is the remainder of the floor division operation, and -1 - q is the quotient.
Note that the arithmetic operations here are all safe from overflow, regardless of the internal representation of signed integers (two's complement, ones' complement, sign-magnitude).
Assuming two's complement representation for signed integers, a good compiler should optimise the two -1-* operations to bitwise negation operations. On my x86-64 machine, the second branch of the conditional gets compiled to the following sequence:
notl %edi
movl %edi, %eax
cltd
idivl %esi
notl %eax
The standard library has only one function that can be used to do what you want: floor. The division you're after can be expressed as floor((double) n / d). However, this assumes that double has enough precision to represent both n and d exactly. If not, then this may introduce rounding errors.
Personally, I'd go with a custom implementation. But you can use the floating point version too, if that's easier to read and you've verified that the results are correct for the ranges you're calling it for.
C++11 has a std::div(a, b) that returns both a % b and a / b in struct with rem and quot members (so both remainder and quotient primitives) and for which modern processors have a single instruction. C++11 does truncated division.
To do floored division for both the remainder and the quotient, you can write:
// http://stackoverflow.com/a/4609795/819272
auto signum(int n) noexcept
{
return static_cast<int>(0 < n) - static_cast<int>(n < 0);
}
auto floored_div(int D, int d) // Throws: Nothing.
{
assert(d != 0);
auto const divT = std::div(D, d);
auto const I = signum(divT.rem) == -signum(d) ? 1 : 0;
auto const qF = divT.quot - I;
auto const rF = divT.rem + I * d;
assert(D == d * qF + rF);
assert(abs(rF) < abs(d));
assert(signum(rF) == signum(d));
return std::div_t{qF, rF};
}
Finally, it's handy to also have Euclidean division (for which the remainder is always non-negative) in your own library:
auto euclidean_div(int D, int d) // Throws: Nothing.
{
assert(d != 0);
auto const divT = std::div(D, d);
auto const I = divT.rem >= 0 ? 0 : (d > 0 ? 1 : -1);
auto const qE = divT.quot - I;
auto const rE = divT.rem + I * d;
assert(D == d * qE + rE);
assert(abs(rE) < abs(d));
assert(signum(rE) != -1);
return std::div_t{qE, rE};
}
There is a Microsoft research paper discussing the pros and cons of the 3 versions.
When the operands are both positive, the / operator does floored division.
When the operands are both negative, the / operator does floored division.
When exactly one of the operands is negative, the / operator does ceiling division.
For the last case, the quotient can be adjusted when exactly one operand is negative and there is no remainder (with no remainder, floored division and ceiling division work the same).
int floored_div(int numer, int denom) {
int div = numer / denom;
int n_negatives = (numer < 0) + (denom < 0);
div -= (n_negatives == 1) && (numer % denom != 0);
return div;
}

Sum exceeding permissible value in looping floats

I recently created this simple program to find average velocity.
Average velocity = Δx / Δt
I chose x as a function of t as x = t^2
Therefore v = 2t
also, avg v = (x2 - x1) / (t2 - t1)
I chose the interval to be t = 1s to 4s. Implies x goes from 1 to 16
Therefore avg v = (16 - 1) / (4 - 1) = 5
Now the program :
#include <iostream>
using namespace std;
int main() {
float t = 1, v = 0, sum = 0, n = 0; // t = time, v = velocity, sum = Sigma v, n = Sigma 1
float avgv = 0;
while( t <= 4 ) {
v = 2*t;
sum += v;
t += 0.0001;
n++;
}
avgv = sum/n;
cout << "\n----> " << avgv << " <----\n";
return 0;
}
I used very small increments of time to calculate velocity at many moments. Now, if the increment of t is 0.001, The avg v calculated is 4.99998.
Now if i put increment of t as 0.0001, The avg v becomes 5.00007!
Further decreasing increment to 0.00001 yields avg v = 5.00001
Why is that so?
Thank you.
In base 2 0.0001 and 0.001 are periodic numbers, so they don't have an exact representation. One of them is being rounded up, the other one is rounded down, so when you sum lots of them you get different values.
This is the same thing that happens in decimal representation, if you choose the numbers to sum accordingly (assume each variable can hold 3 decimal digits).
Compare:
a = 1 / 3; // a becomes 0.333
b = a * 6; // b becomes 1.998
with:
a = 2 / 3; // a becomes 0.667
b = a * 3; // b becomes 2.001
both should (theoretically) result into 2 but because of rounding error they give different results
In the decimal system, since 10 is factorised into primes 2 and 5 only fractions whose denominator is divisible only by 2 and 5 can be represented with a finite number of decimal digits (all other fractions are periodic), in base 2 only fractions which have as denominator a power of 2 can be represented exactly. Try using 1.0/512.0 and 1.0/1024.0 as steps in your loop. Also, be careful because if you choose a step that is too small, you may not have enough digits to represent that in the float datatype (i.e., use doubles)

Constructing fractions Interview challenge

I recently came across the following interview question, I was wondering if a dynamic programming approach would work, or/and if there was some kind of mathematical insight that would make the solution easier... Its very similar to how ieee754 doubles are constructed.
Question:
There is vector V of N double values. Where the value at the ith index of the vector is equal to 1/2^(i+1). eg: 1/2, 1/4, 1/8, 1/16 etc...
You're to write a function that takes one double 'r' as input, where 0 < r < 1, and output the indexes of V to stdout that when summed will give a value closest to the value 'r' than any other combination of indexes from the vector V.
Furthermore the number of indexes should be a minimum, and in the event there are two solutions, the solution closest to zero should be preferred.
void getIndexes(std::vector<double>& V, double r)
{
....
}
int main()
{
std::vector<double> V;
// populate V...
double r = 0.3;
getIndexes(V,r);
return 0;
}
Note: It seems like there are a few SO'ers that aren't in the mood of reading the question completely. So lets all note the following:
The solution, aka the sum may be larger than r - hence any strategy incrementally subtracting fractions from r, until it hits zero or near zero is wrong
There are examples of r, where there will be 2 solutions, that is |r-s0| == |r-s1| and s0 < s1 - in this case s0 should be selected, this makes the problem slightly more difficult, as the knapsack style solutions tend to greedy overestimates first.
If you believe this problem is trivial, you most likely haven't understood it. Hence it would be a good idea to read the question again.
EDIT (Matthieu M.): 2 examples for V = {1/2, 1/4, 1/8, 1/16, 1/32}
r = 0.3, S = {1, 3}
r = 0.256652, S = {1}
Algorithm
Consider a target number r and a set F of fractions {1/2, 1/4, ... 1/(2^N)}. Let the smallest fraction, 1/(2^N), be denoted P.
Then the optimal sum will be equal to:
S = P * round(r/P)
That is, the optimal sum S will be some integer multiple of the smallest fraction available, P. The maximum error, err = r - S, is ± 1/2 * 1/(2^N). No better solution is possible because this would require the use of a number smaller than 1/(2^N), which is the smallest number in the set F.
Since the fractions F are all power-of-two multiples of P = 1/(2^N), any integer multiple of P can be expressed as a sum of the fractions in F. To obtain the list of fractions that should be used, encode the integer round(r/P) in binary and read off 1 in the kth binary place as "include the kth fraction in the solution".
Example:
Take r = 0.3 and F as {1/2, 1/4, 1/8, 1/16, 1/32}.
Multiply the entire problem by 32.
Take r = 9.6, and F as {16, 8, 4, 2, 1}.
Round r to the nearest integer.
Take r = 10.
Encode 10 as a binary integer (five places)
10 = 0b 0 1 0 1 0 ( 8 + 2 )
^ ^ ^ ^ ^
| | | | |
| | | | 1
| | | 2
| | 4
| 8
16
Associate each binary bit with a fraction.
= 0b 0 1 0 1 0 ( 1/4 + 1/16 = 0.3125 )
^ ^ ^ ^ ^
| | | | |
| | | | 1/32
| | | 1/16
| | 1/8
| 1/4
1/2
Proof
Consider transforming the problem by multiplying all the numbers involved by 2**N so that all the fractions become integers.
The original problem:
Consider a target number r in the range 0 < r < 1, and a list of fractions {1/2, 1/4, .... 1/(2**N). Find the subset of the list of fractions that sums to S such that error = r - S is minimised.
Becomes the following equivalent problem (after multiplying by 2**N):
Consider a target number r in the range 0 < r < 2**N and a list of integers {2**(N-1), 2**(N-2), ... , 4, 2, 1}. Find the subset of the list of integers that sums to S such that error = r - S is minimised.
Choosing powers of two that sum to a given number (with as little error as possible) is simply binary encoding of an integer. This problem therefore reduces to binary encoding of a integer.
Existence of solution: Any positive floating point number r, 0 < r < 2**N, can be cast to an integer and represented in binary form.
Optimality: The maximum error in the integer version of the solution is the round-off error of ±0.5. (In the original problem, the maximum error is ±0.5 * 1/2**N.)
Uniqueness: for any positive (floating point) number there is a unique integer representation and therefore a unique binary representation. (Possible exception of 0.5 = see below.)
Implementation (Python)
This function converts the problem to the integer equivalent, rounds off r to an integer, then reads off the binary representation of r as an integer to get the required fractions.
def conv_frac (r,N):
# Convert to equivalent integer problem.
R = r * 2**N
S = int(round(R))
# Convert integer S to N-bit binary representation (i.e. a character string
# of 1's and 0's.) Note use of [2:] to trim leading '0b' and zfill() to
# zero-pad to required length.
bin_S = bin(S)[2:].zfill(N)
nums = list()
for index, bit in enumerate(bin_S):
k = index + 1
if bit == '1':
print "%i : 1/%i or %f" % (index, 2**k, 1.0/(2**k))
nums.append(1.0/(2**k))
S = sum(nums)
e = r - S
print """
Original number `r` : %f
Number of fractions `N` : %i (smallest fraction 1/%i)
Sum of fractions `S` : %f
Error `e` : %f
""" % (r,N,2**N,S,e)
Sample output:
>>> conv_frac(0.3141,10)
1 : 1/4 or 0.250000
3 : 1/16 or 0.062500
8 : 1/512 or 0.001953
Original number `r` : 0.314100
Number of fractions `N` : 10 (smallest fraction 1/1024)
Sum of fractions `S` : 0.314453
Error `e` : -0.000353
>>> conv_frac(0.30,5)
1 : 1/4 or 0.250000
3 : 1/16 or 0.062500
Original number `r` : 0.300000
Number of fractions `N` : 5 (smallest fraction 1/32)
Sum of fractions `S` : 0.312500
Error `e` : -0.012500
Addendum: the 0.5 problem
If r * 2**N ends in 0.5, then it could be rounded up or down. That is, there are two possible representations as a sum-of-fractions.
If, as in the original problem statement, you want the representation that uses fewest fractions (i.e. the least number of 1 bits in the binary representation), just try both rounding options and pick whichever one is more economical.
Perhaps I am dumb...
The only trick I can see here is that the sum of (1/2)^(i+1) for i in [0..n) where n tends towards infinity gives 1. This simple fact proves that (1/2)^i is always superior to sum (1/2)^j for j in [i+1, n), whatever n is.
So, when looking for our indices, it does not seem we have much choice. Let's start with i = 0
either r is superior to 2^-(i+1) and thus we need it
or it is inferior and we need to choose whether 2^-(i+1) OR sum 2^-j for j in [i+2, N] is closest (deferring to the latter in case of equality)
The only step that could be costly is obtaining the sum, but it can be precomputed once and for all (and even precomputed lazily).
// The resulting vector contains at index i the sum of 2^-j for j in [i+1, N]
// and is padded with one 0 to get the same length as `v`
static std::vector<double> partialSums(std::vector<double> const& v) {
std::vector<double> result;
// When summing doubles, we need to start with the smaller ones
// because of the precision of representations...
double sum = 0;
BOOST_REVERSE_FOREACH(double d, v) {
sum += d;
result.push_back(sum);
}
result.pop_back(); // there is a +1 offset in the indexes of the result
std::reverse(result.begin(), result.end());
result.push_back(0); // pad the vector to have the same length as `v`
return result;
}
// The resulting vector contains the indexes elected
static std::vector<size_t> getIndexesImpl(std::vector<double> const& v,
std::vector<double> const& ps,
double r)
{
std::vector<size_t> indexes;
for (size_t i = 0, max = v.size(); i != max; ++i) {
if (r >= v[i]) {
r -= v[i];
indexes.push_back(i);
continue;
}
// We favor the closest to 0 in case of equality
// which is the sum of the tail as per the theorem above.
if (std::fabs(r - v[i]) < std::fabs(r - ps[i])) {
indexes.push_back(i);
return indexes;
}
}
return indexes;
}
std::vector<size_t> getIndexes(std::vector<double>& v, double r) {
std::vector<double> const ps = partialSums(v);
return getIndexesImpl(v, ps, r);
}
The code runs (with some debug output) at ideone. Note that for 0.3 it gives:
0.3:
1: 0.25
3: 0.0625
=> 0.3125
which is slightly different from the other answers.
At the risk of downvotes, this problem seems to be rather straightforward. Just start with the largest and smallest numbers you can produce out of V, adjust each index in turn until you have the two possible closest answers. Then evaluate which one is the better answer.
Here is untested code (in a language that I don't write):
void getIndexes(std::vector<double>& V, double r)
{
double v_lower = 0;
double v_upper = 1.0 - 0.5**V.size();
std::vector<int> index_lower;
std::vector<int> index_upper;
if (v_upper <= r)
{
// The answer is trivial.
for (int i = 0; i < V.size(); i++)
cout << i;
return;
}
for (int i = 0; i < N; i++)
{
if (v_lower + V[i] <= r)
{
v_lower += V[i];
index_lower.push_back(i);
}
if (r <= v_upper - V[i])
v_upper -= V[i];
else
index_upper.push_back(i);
}
if (r - v_lower < v_upper - r)
printIndexes(index_lower);
else if (v_upper - r < r - v_lower)
printIndexes(index_upper);
else if (v_upper.size() < v_lower.size())
printIndexes(index_upper);
else
printIndexes(index_lower);
}
void printIndexes(std::vector<int>& ind)
{
for (int i = 0; i < ind.size(); i++)
{
cout << ind[i];
}
}
Did I get the job! :D
(Please note, this is horrible code that relies on our knowing exactly what V has in it...)
I will start by saying that I do believe that this problem is trivial...
(waits until all stones have been thrown)
Yes, I did read the OP's edit that says that I have to re-read the question if I think so. Therefore I might be missing something that I fail to see - in this case please excuse my ignorance and feel free to point out my mistakes.
I don't see this as a dynamic programming problem. At the risk of sounding naive, why not try keeping two estimations of r while searching for indices - namely an under-estimation and an over-estimation. After all, if r does not equal any sum that can be computed from elements of V, it will lie between some two sums of the kind. Our goal is to find these sums and to report which is closer to r.
I threw together some quick-and-dirty Python code that does the job. The answer it reports is correct for the two test cases that the OP provided. Note that if the return is structured such that at least one index always has to be returned - even if the best estimation is no indices at all.
def estimate(V, r):
lb = 0 # under-estimation (lower-bound)
lbList = []
ub = 1 - 0.5**len(V) # over-estimation = sum of all elements of V
ubList = range(len(V))
# calculate closest under-estimation and over-estimation
for i in range(len(V)):
if r == lb + V[i]:
return (lbList + [i], lb + V[i])
elif r == ub:
return (ubList, ub)
elif r > lb + V[i]:
lb += V[i]
lbList += [i]
elif lb + V[i] < ub:
ub = lb + V[i]
ubList = lbList + [i]
return (ubList, ub) if ub - r < r - lb else (lbList, lb) if lb != 0 else ([len(V) - 1], V[len(V) - 1])
# populate V
N = 5 # number of elements
V = []
for i in range(1, N + 1):
V += [0.5**i]
# test
r = 0.484375 # this value is equidistant from both under- and over-estimation
print "r:", r
estimate = estimate(V, r)
print "Indices:", estimate[0]
print "Estimate:", estimate[1]
Note: after finishing writing my answer I noticed that this answer follows the same logic. Alas!
I don't know if you have test cases, try the code below. It is a dynamic-programming approach.
1] exp: given 1/2^i, find the largest i as exp. Eg. 1/32 returns 5.
2] max: 10^exp where exp=i.
3] create an array of size max+1 to hold all possible sums of the elements of V.
Actually the array holds the indexes, since that's what you want.
4] dynamically compute the sums (all invalids remain null)
5] the last while loop finds the nearest correct answer.
Here is the code:
public class Subset {
public static List<Integer> subsetSum(double[] V, double r) {
int exp = exponent(V);
int max = (int) Math.pow(10, exp);
//list to hold all possible sums of the elements in V
List<Integer> indexes[] = new ArrayList[max + 1];
indexes[0] = new ArrayList();//base case
//dynamically compute the sums
for (int x=0; x<V.length; x++) {
int u = (int) (max*V[x]);
for(int i=max; i>=u; i--) if(null != indexes[i-u]) {
List<Integer> tmp = new ArrayList<Integer>(indexes[i - u]);
tmp.add(x);
indexes[i] = tmp;
}
}
//find the best answer
int i = (int)(max*r);
int j=i;
while(null == indexes[i] && null == indexes[j]) {
i--;j++;
}
return indexes[i]==null || indexes[i].isEmpty()?indexes[j]:indexes[i];
}// subsetSum
private static int exponent(double[] V) {
double d = V[V.length-1];
int i = (int) (1/d);
String s = Integer.toString(i,2);
return s.length()-1;
}// summation
public static void main(String[] args) {
double[] V = {1/2.,1/4.,1/8.,1/16.,1/32.};
double r = 0.6, s=0.3,t=0.256652;
System.out.println(subsetSum(V,r));//[0, 3, 4]
System.out.println(subsetSum(V,s));//[1, 3]
System.out.println(subsetSum(V,t));//[1]
}
}// class
Here are results of running the code:
For 0.600000 get 0.593750 => [0, 3, 4]
For 0.300000 get 0.312500 => [1, 3]
For 0.256652 get 0.250000 => [1]
For 0.700000 get 0.687500 => [0, 2, 3]
For 0.710000 get 0.718750 => [0, 2, 3, 4]
The solution implements Polynomial time approximate algorithm. Output of the program is the same as outputs of another solutions.
#include <math.h>
#include <stdio.h>
#include <vector>
#include <algorithm>
#include <functional>
void populate(std::vector<double> &vec, int count)
{
double val = .5;
vec.clear();
for (int i = 0; i < count; i++) {
vec.push_back(val);
val *= .5;
}
}
void remove_values_with_large_error(const std::vector<double> &vec, std::vector<double> &res, double r, double max_error)
{
std::vector<double>::const_iterator iter;
double min_err, err;
min_err = 1.0;
for (iter = vec.begin(); iter != vec.end(); ++iter) {
err = fabs(*iter - r);
if (err < max_error) {
res.push_back(*iter);
}
min_err = std::min(err, min_err);
}
}
void find_partial_sums(const std::vector<double> &vec, std::vector<double> &res, double r)
{
std::vector<double> svec, tvec, uvec;
std::vector<double>::const_iterator iter;
int step = 0;
svec.push_back(0.);
for (iter = vec.begin(); iter != vec.end(); ++iter) {
step++;
printf("step %d, svec.size() %d\n", step, svec.size());
tvec.clear();
std::transform(svec.begin(), svec.end(), back_inserter(tvec),
std::bind2nd(std::plus<double>(), *iter));
uvec.clear();
uvec.insert(uvec.end(), svec.begin(), svec.end());
uvec.insert(uvec.end(), tvec.begin(), tvec.end());
sort(uvec.begin(), uvec.end());
uvec.erase(unique(uvec.begin(), uvec.end()), uvec.end());
svec.clear();
remove_values_with_large_error(uvec, svec, r, *iter * 4);
}
sort(svec.begin(), svec.end());
svec.erase(unique(svec.begin(), svec.end()), svec.end());
res.clear();
res.insert(res.end(), svec.begin(), svec.end());
}
double find_closest_value(const std::vector<double> &sums, double r)
{
std::vector<double>::const_iterator iter;
double min_err, res, err;
min_err = fabs(sums.front() - r);
res = sums.front();
for (iter = sums.begin(); iter != sums.end(); ++iter) {
err = fabs(*iter - r);
if (err < min_err) {
min_err = err;
res = *iter;
}
}
printf("found value %lf with err %lf\n", res, min_err);
return res;
}
void print_indexes(const std::vector<double> &vec, double value)
{
std::vector<double>::const_iterator iter;
int index = 0;
printf("indexes: [");
for (iter = vec.begin(); iter != vec.end(); ++iter, ++index) {
if (value >= *iter) {
printf("%d, ", index);
value -= *iter;
}
}
printf("]\n");
}
int main(int argc, char **argv)
{
std::vector<double> vec, sums;
double r = .7;
int n = 5;
double value;
populate(vec, n);
find_partial_sums(vec, sums, r);
value = find_closest_value(sums, r);
print_indexes(vec, value);
return 0;
}
Sort the vector and search for the closest fraction available to r. store that index, subtract the value from r, and repeat with the remainder of r. iterate until r is reached, or no such index can be found.
Example :
0.3 - the biggest value available would be 0.25. (index 2). the remainder now is 0.05
0.05 - the biggest value available would be 0.03125 - the remainder will be 0.01875
etc.
etc. every step would be an O(logN) search in a sorted array. the number of steps will also be O(logN) total complexity will be than O(logN^2).
This is not dynamic programming question
The output should rather be vector of ints (indexes), not vector of doubles
This might by off 0-2 in exact values, this is just concept:
A) output zero index until the r0 (r - index values already outputded) is bigger than 1/2
B) Inspect the internal representation of r0 double and:
x (1st bit shift) = -Exponent; // The bigger exponent, the smallest numbers (bigger x in 1/2^(x) you begin with)
Inspect bit representation of the fraction part of float in cycle with body:
(direction depends on little/big endian)
{
if (bit is 1)
output index x;
x++;
}
Complexity of each step is constant, so overall it is O(n) where n is size of output.
To paraphrase the question, what are the one bits in the binary representation of r (after the binary point)? N is the 'precision', if you like.
In Cish pseudo-code
for (int i=0; i<N; i++) {
if (r>V[i]) {
print(i);
r -= V[i];
}
}
You could add an extra test for r == 0 to terminate the loop early.
Note that this gives the least binary number closest to 'r', i.e. the one closer to zero if there are two equally 'right' answers.
If the Nth digit was a one, you'll need to add '1' to the 'binary' number obtained and check both against the original 'r'. (Hint: construct vectors a[N], b[N] of 'bits', set '1' bits instead of 'print'ing above. Set b = a and do a manual add, digit by digit from the end of 'b' until you stop carrying. Convert to double and choose whichever is closer.
Note that a[] <= r <= a[] + 1/2^N and that b[] = a[] + 1/2^N.
The 'least number of indexes [sic]' is a red-herring.