So my goal here is to have a function whose signature looks like this:
template<typename int_t, int_t numerator, int_t denominator>
int_t Multiply(int_t x);
The type is an integral type, which is both the type of the one parameter and the return type.
The other two template parameters are the numerator and denominator of a fraction.
The goal of this function is to multiply a number "x" by an arbitrary fraction, which is in the two template values. In general the answer should be:
floor(x*n/d) mod (int_t_max+1)
The naive way to do this is to first multiply "x" by the numerator and then divide.
Looking at a specific case lets say that int_t=uint8_t, "x" is 30 and the numerator and denominator are 119 and 255 respectively.
Taking this naive route fails because (30*119)mod 256 = 242, which divided by 255 and then floored is 0. The real answer should be 14.
The next step would be to just use a bigger integer size for the intermediate values. So instead of doing the 30*119 calculation in mod 256 we would do it in mod 65536. This does work to a certain extent, but it fails when we try to use the maximum integer size in the Multiply function.
The next step would be to just use some BigInt type to hold the values so that it can't overflow. This also would work, but the whole reason for having the template arguments, is so that this can be extremely fast, and using a BigInt would probably defeat that purpose.
So here is the question:
Is there an algorithm that only involves shifts, multiplication, division, addition, subtraction, and remainder operators, that can preform this mathematical function, without causing overflow issues?
For Windows platform I urge you to look into this article on large integers that currently includes support for up to 128-bit integer values. You can specialize your template based on the bit-with of your int_t to serve as a proxy to those OS functions.
Implementing the "shift-and-add" for multiplication may provide a good enough alternative but a division will certainly negate any performance gains you could hope for.
Then there are "shortcuts" like trying to see if the numerator and denominator can be simplified by fraction reduction, e.g. multiplying by 35/49 is the same as multiplying by 5/7.
Another alternative that comes to mind is to "gradually" multiply by "fractions". This one will need some explanation though:
Suppose you are multiplying by 1234567/89012. I'll use decimal notation for readbility but the same is applicable (naturally) to binary math.
So what we have is a value x that needs to be multiplied by that fraction. Since we are dealing with integer arithmetic let's repackage that fraction a bit:
1234567/89012 = A + B/10 + C/100 + D/1000...
= 1157156/89012 + ((71210*10)/89012)/10 + ((5341*100)/89012)/100 + ((801*1000)/89012)/1000...
= 13 + 8/10 + 6/100 + 9/1000...
In fact at this point your main question is "how precise do I want to be in my calculations?". And based on the answer to that question you will have the appropriate number of the members of that long sequence.
That will give you the desired precision and provide a generic "no overflow" method for computing the product, but at what computational cost?
Related
I have a problem with understanding fixed-point arithmetic and its implementation in C++. I was trying to understand this code:
#define scale 16
int DoubleToFixed(double num){
return num * ((double)(1 << scale));
}
double FixedToDoble(int num){
return (double) num / (double)(1 << scale);
}
double IntToFixed(int num){
return x << scale
}
I am trying to understand exactly why we shift. I know that shifting to the right is basically multiplying that number by 2x, where x is by how many positions we want to shift or scale, and shifting to the left is basically division by 2x.
But why do we need to shift when we convert from int to fixed point?
A fixed-point format represents a number as an integer multiplied by a fixed scale. Commonly the scale is some base b raised to some power e, so the integer f would represent the number f•be.
In the code shown, the scale is 2−16 or 1/65,536. (Calling the the shift amount scale is a misnomer; 16, or rather −16, is the exponent.) So if the integer representing the number is 81,920, the value represented is 81,920•2−16 = 1.25.
The routine DoubleToFixed converts a floating-point number to this fixed-point format by multiplying by the reciprocal of the scale; it multiplies by 65,536.
The routine FixedToDouble converts a number from this fixed-format to floating-point by multiplying by the scale or, equivalently, by dividing by its reciprocal; it divides by 65,536.
IntToFixed does the same thing as DoubleToFixed except for an int input.
Fixed point arithmatic works on the concept of representing numbers as an integer multiple of a very small "base". Your case uses a base of 1/(1<<scale), aka 1/65536, which is approximately 0.00001525878.
So the number 3.141592653589793, could be represented as 205887.416146 units of 1/65536, and so would be stored in memory as the integer value 205887 (which is really 3.14158630371, due to the rounding during conversion).
The way to calculate this conversion of fractional-value-to-fixed-point is simply to divide the value by the base: 3.141592653589793 / (1/65536) = 205887.416146. (Notably, this reduces to 3.141592653589793 * 65536 = 205887.416146). However, since this involves a power-of-two. Multiplication by a power-of-two is the same as simply left shifting by that many bits. So multiplication of 2^16, aka 65536, can be calculated faster by simply shifting left 16 bits. This is really fast, which is why most fixed-point calculations use an inverse-power-of-two as their base.
Due to the inability to shift float values, your methods convert the base to a float and does floating point multiplication, but other methods, such as the fixed-point multiplication and division themselves would be able to take advantage of this shortcut.
Theoretically, one can use shifting bits with floats to do the conversion functions faster than simply floating point multiplication, but most likely, the compiler is actually already doing that under the covers.
It is also common for some code to use an inverse-power-of-ten as their base, primarily for money, which usually uses a base of 0.01, but these cannot use a single shift as a shortcut, and have to do slower math. One shortcut for multiplying by 100 is value<<6 + value<<5 + value<<2 (this is effectively value*64+value*32+value*4, which is value*(64+32+4), which is value*100), but three shifts and three adds is sometimes faster than one multiplication. Compilers already do this shortcut under the covers if 100 is a compile time constant, so in general, nobody writes code like this anymore.
I have the following statement.
d = (pow(a,2*l+1)+1)/(val+1);
Here,
val, a and l are variables which are of no relation to the question.
the numerator can exceed long long int range.
denominator is a divisor of the numerator.
But the final answer d will surely be under the long long int range. How to calculate d without loss of accuracy? I would prefer an answer without converting them to array and using grade school multiplication and division.
I don't have time to write a proper answer now; I'll expand this later if I get a chance. The basic idea is to use the grade-school algorithm, working with "digits" that are a power of the denominator. Do a Google search for "Schrage multiplication" or look here for references.
I hope the operands are integer too
I would use power by squaring instead of pow
See Integer power by squaring
while iterating #1
Each time booth sub-result and denominator are divisible by 2 divide both of them to keep the pow result small and not losing precision or the correctness of result. So each time the LSB bit of both subresult and Denominator is zero shift right both by 1 bit.
I have a class that internally represents some quantity in fixed point as 32-bit integer with somewhat arbitrary denominator (it is neither power of 2 nor power of 10).
For communicating with other applications the quantity is converted to plain old double on output and back on input. As code inside the class it looks like:
int32_t quantity;
double GetValue() { return double(quantity) / DENOMINATOR; }
void SetValue(double x) { quantity = x * DENOMINATOR; }
Now I need to ensure that if I output some value as double and read it back, I will always get the same value back. I.e. that
x.SetValue(x.GetValue());
will never change x.quantity (x is arbitrary instance of the class containing the above code).
The double representation has more digits of precision, so it should be possible. But it will almost certainly not be the case with the simplistic code above.
What rounding do I need to use and
How can I find the critical would-be corner cases to test that the rounding is indeed correct?
Any 32 bits will be represented exactly when you convert to a double, but when you divide then multiply by an arbitrary value you will get a similar value but not exactly the same. You should lose at most one bit per operations, which means your double will be almost the same, prior to casting back to an int.
However, since int casts are truncations, you will get the wrong result when very minor errors turn 2.000 into 1.999, thus what you need to do is a simple rounding task prior to casting back.
You can use std::lround() for this if you have C++11, else you can write you own rounding function.
You probably don't care about fairness much here, so the common int(doubleVal+0.5) will work for positives. If as seems likely, you have negatives, try this:
int round(double d) { return d<0?d-0.5:d+0.5; }
The problem you describe is the same problem which exists with converting between binary and decimal representation just with different bases. At least it exists if you want to have the double representation to be a good approximation of the original value (otherwise you could just multiply the 32 bit value you have with your fixed denominator and store the result in a double).
Assuming you want the double representation be a good approximation of your actual value the conversions are nontrivial! The conversion from your internal representation to double can be done using Dragon4 ("How to print floating point numbers accurately", Steele & White) or Grisu ("How to print floating point numbers quickly and accurately", Loitsch; I'm not sure if this algorithm is independent from the base, though). The reverse can be done using Bellerophon ("How to read floating point numbers accurately", Clinger). These algorithms aren't entirely trivial, though...
I want to truncate floor number to be 3 digit decimal number. Example:
input : x = 0.363954;
output: 0.364
i used
double myCeil(float v, int p)
{
return int(v * pow(float(10),p))/pow(float(10),p );
}
but the output was 0.3630001 .
I tried to use trunc from <cmath> but it doesn't exist.
Floating-point math typically uses a binary representation; as a result, there are decimal values that cannot be exactly represented as floating-point values. Trying to fiddle with internal precisions runs into exactly this problem. But mostly when someone is trying to do this they're really trying to display a value using a particular precision, and that's simple:
double x = 0.363954;
std::cout.precision(3);
std::cout << x << '\n';
The function your looking for is the std::ceil, not std::trunc
double myCeil(double v, int p)
{
return std::ceil(v * std::pow(10, p)) / std::pow(10, p);
}
substitue in std::floor or std::round for a myFloor or myRound as desired. (Note that std::round appears in C++11, which you will have to enable if it isn't already done).
It is just impossible to get 0.364 exactly. There is no way you can store the number 0.364 (364/1000) exactly as a float, in the same way you would need an infinite number of decimals to write 1/3 as 0.3333333333...
You did it correctly, except for that you probably want to use std::round(), to round to the closest number, instead of int(), which truncates.
Comparing floating point numbers is tricky business. Typically the best you can do is check that the numbers are sufficiently close to each other.
Are you doing your rounding for comparison purposes? In such case, it seems you are happy with 3 decimals (this depends on each problem in question...), in such case why not just
bool are_equal_to_three_decimals(double a, double b)
{
return std::abs(a-b) < 0.001;
}
Note that the results obtained via comparing the rounded numbers and the function I suggested are not equivalent!
This is an old post, but what you are asking for is decimal precision with binary mathematics. The conversion between the two is giving you an apparent distinction.
The main point, I think, which you are making is to do with identity, so that you can use equality/inequality comparisons between two numbers.
Because of the fact that there is a discrepancy between what we humans use (decimal) and what computers use (binary), we have three choices.
We use a decimal library. This is computationally costly, because we are using maths which are different to how computers work. There are several, and one day they may be adopted into std. See eg "ISO/IEC JTC1 SC22 WG21 N2849"
We learn to do our maths in binary. This is mentally costly, because it's not how we do our maths normally.
We change our algorithm to include an identity test.
We change our algorithm to use a difference test.
With option 3, it is where we make a decision as to just how close one number needs to be to another number to be considered 'the same number'.
One simple way of doing this is (as given by #SirGuy above) where we use ceiling or floor as a test - this is good, because it allows us to choose the significant number of digits we are interested in. It is domain specific, and the solution that he gives might be a bit more optimal if using a power of 2 rather than of 10.
You definitely would only want to do the calculation when using equality/inequality tests.
So now, our equality test would be (for 10 binary places (nearly 3dp))
// Normal identity test for floats.
// Quick but fails eg 1.0000023 == 1.0000024
return (a == b);
Becomes (with 2^10 = 1024).
// Modified identity test for floats.
// Works with 1.0000023 == 1.0000024
return (std::floor(a * 1024) == std::floor(b * 1024));
But this isn't great
I would go for option 4.
Say you consider any difference less than 0.001 to be insignificant, such that 1.00012 = 1.00011.
This does an additional subtraction and a sign removal, which is far cheaper (and more reliable) than bit shifts.
// Modified equality test for floats.
// Returns true if the ∂ is less than 1/10000.
// Works with 1.0000023 == 1.0000024
return abs(a - b) < 0.0001;
This boils down to your comment about calculating circularity, I am suggesting that you calculate the delta (difference) between two circles, rather than testing for equivalence. But that isn't exactly what you asked in the question...
I need to represent numbers using the following structure. The purpose of this structure is not to lose the precision.
struct PreciseNumber
{
long significand;
int exponent;
}
Using this structure actual double value can be represented as value = significand * 10e^exponent.
Now I need to write utility function which can covert double into PreciseNumber.
Can you please let me know how to extract the exponent and significand from the double?
The prelude is somewhat flawed.
Firstly, barring any restrictions on storage space, conversion from a double to a base 10 significand-exponent form won't alter the precision in any form. To understand that, consider the following: any binary terminating fraction (like the one that forms the mantissa on a typical IEEE-754 float) can be written as a sum of negative powers of two. Each negative power of two is a terminating fraction itself, and hence it follows that their sum must be terminating as well.
However, the converse isn't necessarily true. For instance, 0.3 base 10 is equivalent to the non-terminating 0.01 0011 0011 0011 ... in base 2. Fitting this into a fixed size mantissa would blow some precision out of it (which is why 0.3 is actually stored as something that translates back to 0.29999999999999999.)
By this, we may assume that any precision that is intended by storing the numbers in decimal significand-exponent form is either lost, or isn't simply gained at all.
Of course, you might think of the apparent loss of accuracy generated by storing a decimal number as a float as loss in precision, in which case the Decimal32 and Decimal64 floating point formats may be of some interest -- check out http://en.wikipedia.org/wiki/Decimal64_floating-point_format.
This is a very difficult problem. You might want to see how much code it takes to implement a double-to-string conversion (for printf, e.g.). You might steal the code from gnu's implementation of gcc.
You cannot convert an "imprecise" double into a "precise" decimal number, because the required "precision" simply isn't there to begin with (otherwise why would you even want to convert?).
This is what happens if you try something like it in Java:
BigDecimal x = new BigDecimal(0.1);
System.out.println(x);
The output of the program is:
0.1000000000000000055511151231257827021181583404541015625
Well you're at less precision than a typical double. Your significand is a long giving you a range from -2 billion to +2 billion which is more than 9 but fewer than 10 digits of precision.
Here's an untested starting point on what you'd want to do for some simple math on PreciseNumbers
PreciseNumber Multiply(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s*=rhs.s;
ret.e+=lhs.e;
return ret;
}
PreciseNumber Add(PreciseNumber lhs, PreciseNumber rhs)
{
PreciseNumber ret;
ret.s=lhs.s;
ret.e=lhs.e;
ret.s+=(rhs.s*pow(10,rhs.e-lhs.e));
}
I didn't take care of any renormalization, but in both cases there are places where you have to worry about over/under flows and loss of precision. Just because you're doing it yourself rather than letting the computer take care of it in a double, doesn't meat the same pitfalls aren't there. The only way to not lose precision is to keep track of all of the digits.
Here's a very rough algorithm. I'll try to fill in some details later.
Take the log10 of the number to get the exponent. Multiply the double by 10^x if positive, or divide by 10^-x if negative.
Start with a significand of zero. Repeat the following 15 times, since a double contains 15 digits of significance:
Multiply the previous significand by 10.
Take the integer portion of the double, add it to the significand, and subtract it from the double.
Subtract 1 from the exponent.
Multiply the double by 10.
When finished, take the remaining double value and use it for rounding: if it's >= 5, add one to the significand.