How to divide 2^{128} by an odd number in c++? - c++

How can we divide 2^{128} by an odd number (unsigned 64-bit integer), which is a floor more precisely, without using arbitrary multi-precision arithmetic library?
The problem is even with gcc, 2^{128} cannot be expressed.
So I'm considering creating 192-bit integer type.
But, I have no idea how to do that (especially subtraction part).
I want the result of the floor to be an unsigned number.

For any odd d > 1, UINT128_MAX / d equals floor(2128/d).
This is because 2128/d must have a remainder, as the only factors of 2128 are powers of two (including 1), so the odd d (excluding 1) cannot be a divisor. Therefore, 2128/d and (2128−1)/d have the same integral quotient, and UINT128_MAX is 2128−1.

Related

Storing a positive floating point number in a known range in two bytes

I know that a number x lies between n and f (f > n > 0). So my idea is to bring that range to [0, 0.65535] by 0.65535 * (x - n) / (f - n).
Then I just could multiply by 10000, round and store integer in two bytes.
Is it going to be an effective use of storage in terms of precision?
I'm doing it for a WebGL1.0 shader, so I'd like to have simple encoding/decoding math, I don't have access to bitwise operations.
Why multiply by 0.65535 and then by 10000.0? That introduces a second rounding with an unnecessary loss of precision.
The data will be represented well if it has equal likelihood over the entire range (f,n). But this is not always a reasonable assumption. What you're doing is similar to creating a fixed-point representation (fixed step size, just not starting at 0 or with steps that are negative powers of 2).
Floating-point numbers use bigger step sizes for bigger numbers. You could do the same by calculating log(x/f) / log(n/f) * 65535

If a floating-point number is representable in my machine, will its inverse be representable in my machine?

Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
In other words, and more precisely, if I have a positive real number x and an object x1 of type double whose stored value represents x exactly, is it guaranteed that the value represented by DBL_EPSILON is less than the real number 1/x?
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
I will assume double is IEEE 754 binary64.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
Not necessarily, for two reasons:
The inverse might not be a floating-point number.
For example, although 3 is a floating-point number, 1/3 is not.
The inverse might overflow.
For example, the inverse of 2−1074 is 21074, which is not only larger than all finite floating-point numbers but more than halfway from the largest finite floating-point number, 1.fffffffffffffp+1023 = 21024 − 2971, to what would be the next one after that, 21024, if the range of exponents were larger.
So the inverse of 2−1074 is rounded to infinity.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
The smallest such 𝑣 is always zero.
If you restrict it to be nonzero, it will always be the smallest subnormal floating-point number, 0x1p−1074, or roughly 4.9406564584124654 × 10−324, irrespective of 𝑥 (unless 𝑥 is infinite).
But perhaps you want the largest such 𝑣 rather than the smallest such 𝑣.
The largest such 𝑣 is always either 1 ⊘ 𝑥 = fl(1/𝑥) (that is, the floating-point number nearest to 1/𝑥, which is what you get by writing 1/x in C), or the next floating-point number closer to zero (which you can get by writing nextafter(1/x, 0) in C): in the default rounding mode, the division operator always returns the nearest floating-point number to the true quotient, or one of the two nearest ones if there is a tie.
You can also get the largest such 𝑣 by setting the rounding mode with fesetround(FE_DOWNWARD) or fesetround(FE_TOWARDZERO) and then just computing 1/x, although toolchain support for non-default rounding modes is spotty and mostly they serve to shake out bugs in ill-conditioned code rather than to give reliable rounding semantics.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
1/x is never rounded to zero unless 𝑥 is infinite or you have nonstandard flush-to-zero semantics enabled (so results which would ordinarily be subnormal are instead rounded to zero, such as when 𝑥 is the largest finite floating-point number 0x1.fffffffffffffp+1023).
But flush-to-zero aside, there are many values of 𝑥 for which 1/𝑥 and fl(1/𝑥) = 1/x is smaller than DBL_EPSILON.
For example, if 𝑥 = 0x1p+1000 (that is, 21000 ≈ 1.0715086071862673 × 10301), then 1/𝑥 = fl(1/𝑥) = 1/x = 0x1p−1000 (that is, 2−1000 ≈ 9.332636185032189 × 10−302) is far below DBL_EPSILON = 0x1p−52 (that is, 2−52 ≈ 2.220446049250313  × 10−16).
1/𝑥 in this case is a floating-point number, so the reciprocal is computed exactly in floating-point arithmetic; there is no rounding at all.
The largest floating-point number below 1/𝑥 in this case is 0x1.fffffffffffffp−1001, or 2−1000 − 2−1053.
DBL_EPSILON (2−52) is not the smallest floating-point number (2−1074), or even the smallest normal floating-point number (2−1022).
Rather, DBL_EPSILON is the distance from 1 to the next larger floating-point number, 1 + 2−52, sometimes written ulp(1) to indicate that it is the magnitude of the least significant digit, or unit in the last place, in the floating-point representation of 1.
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
That would be 1/DBL_EPSILON - 1, or 252 − 1.
But what do you want this number for?
Why are you trying to use DBL_EPSILON here?
The inverse of positive infinity is, of course, smaller than any positive rational number. Beyond that, even the largest finite floating point number has a multiplicative inverse well above the smallest representable floating point number of equivalent width, thanks to denormal numbers.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
No. There is no specification that 1.0/DBL_MIN <= DBL_MAX and 1.0/DBL_MAX <= DBL_MIN both must be true. One is usually true. With sub-normals, 1.0/sub-normal is often > DBL_MAX.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
This is true as v could be zero unless for some large x like DBL_MAX, 1.0/x is zero. That is a possibility. With sub-normals, that is rarely the case as 1.0/DBL_MAX is representable as a value more than 0.
DBL_EPSILON has little to do with the above. OP's issues are more dependent on DBL_MAX, DBL_MIN and is the double supports sub-normals. Many FP encodings about balanced where 1/DBL_MIN is somewhat about DBL_MIN, yet C does not require that symmetry.
No. Floating point numbers are balanced around 1.0 to minimize the effect of calculating inverses, but this balance is not exact, ad the middle point for the exponent (the value 0x3fff... fot the exponent, gives the same number of powers of two above and below 1.0. But the exponent value 0x4ffff... is reserved for infinity and then nans, while the value 0x0000... is reserved for denormals (also called subnormals) These values are not normalized (and some architectures don't even implement them), but in those that implement, they add as many bits as the width of the mantissa as powers of 2 in addition (but with with lower precision) to the normalized ones, in the range of
the negative exponents. This means that you have a set o numbers, quite close to zero, for which when you compute their inverses, you always get infinity.
For doubles you have 52 more powers of two, or around 15 more powers of ten. For floats, this is around 7 more powers of ten.
But this also means that if you calculate the inverse of a large number you'll always get a number different than zero.

Check if 1/n has infinite number of digits after decimal point

If a user enters a number "n" (integer) not equal to 0, my program should check if the the fraction 1/n has infinite or finite number of digits after the decimal sign. For example: for n=2 we have 1/2=0.5, therefore we have 1 digit after the decimal point. My first solution to this problem was this:
int n=1;
cin>>n;
if((1.0/n)*n==1)
{
cout<<"fixed number of digits after decimal point";
}
else cout<<"infinite number of digits after decimal point";
Since the computer can't store infinite numbers like 1/3, I expected that (1/3)*3 wouldn't be equal to 1. The first time I ran the program, the result was what I expected, but when I ran the program today, for n=3 I got the output (1/3)*3=1. I was surprised by this result and tried
double fraction = 1.0/n;
cout<< fraction*n;
which also returned 1. Why is the behaviour different and can I make my algorithm work? If I can't make it to work, I will have to check if n's divisors are only 1, 2 and 5, which, I think, would be harder to program and compute.
My IDE is Visual Studio, therefore my C++ compiler is VC.
Your code tries to make use of the fact that 1.0/n is not done with perfect precision, which is true. Multiplying the result by n theoretically should get you something not equal to 1, true.
Sadly the multiplication with n in your code is ALSO not done with perfect precision.
The fact which trips your concept up is that the two imperfections can cancel each other out and you get a seemingly perfect 1 in the end.
So, yes. Go with the divisor check.
Binary vs. decimal
Your assignment asks you whether the fraction 1/n can be represented with a finite number of digits in decimal representation. Floating-point numbers in python are represented using binary, which has some similarities and some differences with decimal:
if a rational number can be represented in binary with a finite number of bits, then it can also be represented in decimal with a finite number of digits;
some numbers can be represented in decimal with a finite number of digits, but require an infinite number of bits in decimal.
This is because 10 = 2 * 5; for any integer p, p / 2**k == (p * 5**k) / 10**k. So 1/2==5/10 and 1/4 == 25/100 and 1/8 == 125/1000 can be represented with finitely many digits or bits. But 1/5 can be represented with finitely many digits in decimal, yet requires infinitely many bits in binary.
Floating-point arithmetic and test for equality
See also: Is floating-point math broken? and What every programmer should know about floating-point arithmetic (pdf paper).
The computation (1.0 / n) * n results in an approximation; there is mostly no way to know whether checking for equality with 1.0 will return true or false. In language C, which uses the same floating-point arithmetic as python, compilers will raise a warning if you try to test for equality of two floating-point numbers (this warning can be abled or disabled with option -Wfloat-equal).
A different logic for your algorithm
You can't rely on floating-point arithmetic to decide your problem. But it's not needed. A number can be represented with finitely many digits if and only if it can be written under the form p / 10**k with p and k integers. So your program should examine n to find out whether there exists j and k such that 1 / n == 1 / (2**j * 5**k), without using floating-point arithmetic.

When Will static_casting the Result of ceil Compromise the Result?

static_casting from a floating point to an integer simply strips the fractional point of the number. For example static_cast<int>(13.9999999) yields 13.
Not all integers are representable as floating point numbers. For example internally the closest float to 13,000,000 may be: 12999999.999999.
In this hypothetical case, I'd expect to get an unexpected result from:
const auto foo = 12'999'999.5F;
const auto bar = static_cast<long long>(ceil(foo));
My assumption is that such a breakdown does occur at some point, if not necessarily at 13,000,000. I'd just like to know the range over which I can trust static_cast<long long>(ceif(foo))?
For example internally the closest float to 13,000,000 may be: 12999999.999999.
That is not possible in any normal floating-point format. The floating-point representation of numbers is equivalent to M•be, where b is a fixed base (e.g., 2 for binary floating-point) and M and e are integers with some restrictions on their values. In order for a value like 13,000,000-x to be represented, where x is some positive value less than 1, e must be negative (because M•be for a non-negative e is an integer). If so, then M•b0 is an integer larger than M•be, so it is larger than 13,000,000, and so 13,000,000 can be represented as M'•b0, where M' is a positive integer less than M and hence fits in the range of allowed values for M (in any normal floating-point format). (Perhaps some bizarre floating-point format might impose a strange range on M or e that prevents this, but no normal format does.)
Regarding your code:
auto test = 0LL;
const auto floater = 0.5F;
for(auto i = 0LL; i == test; i = std::ceil(i + floater)) ++test;
cout << test << endl;
When i was 8,388,608, the mathematical result of 8,388,608 + .5 is 8,388,608.5. This is not representable in the float format on your system, so it was rounded to 8,388,608. The ceil of this is 8,388,608. At this point, test was 8,388,609, so the loop stopped. So this code does not demonstrate that 8,388,608.5 is representable and 8,388,609 is not.
Behavior seems to return to normal if I do: ceil(8'388'609.5F) which will correctly return 8,388,610.
8,388,609.5 is not representable in the float format on your system, so it was rounded by the rule “round to nearest, ties to even.” The two nearest representable values are 8,388,609, and 8,388,610. Since they are equally far apart, the result was 8,388,610. That value was passed to ceil, which of course returned 8,388,610.
On Visual Studio 2015 I got 8,388,609 which is a horrifying small safe range.
In the IEEE-754 basic 32-bit binary format, all integers from -16,777,216 to +16,777,216 are representable, because the format has a 24-bit significand.
Floating point numbers are represented by 3 integers, cbq where:
c is the mantissa (so for the number: 12,999,999.999999 c would be 12,999,999,999,999)
q is the exponent (so for the number: 12,999,999.999999 q would be -6)
b is the base (IEEE-754 requires b to be either 10 or 2; in the representation above b is 10)
From this it's easy to see that a floating point with the capability of representing 12,999,999.999999 also has the capability of representing 13,000,000.000000 using a c of 1,300,000,000,000 and a q of -5.
This example is a bit contrived in that the chosen b is 10, where in almost all implementations the chosen base is 2. But it's worth pointing out that even with a b of 2 the q functions as a shift left or right of the mantissa.
Next let's talk about a range here. Obviously a 32-bit floating point cannot represent all the integers represented by a 32-bit integer, as the floating point must also represent so many much larger or smaller numbers. Since the exponent is simply shifting the mantissa, a floating point number can always exactly represent every integer that can be represented by it's mantissa. Given the traditional IEEE-754 binary base floating point numbers:
A 32-bit (float) has a 24-bit mantissa so it can represent all integers in the range [-16,777,215, 16,777,215]
A 64-bit (double) has a 53-bit mantissa so it can represent all integers in the range [-9,007,199,254,740,991, 9,007,199,254,740,991]
A 128-bit (long double depending upon implementation) has a 113-bit mantissa so it can represent all integers in the range [-103,845,937,170,696,552,570,609,926,584,40,191, 103,845,937,170,696,552,570,609,926,584,40,191]
[source]
c++ provides digits as a method of finding this number for a given floating point type. (Though admittedly even a long long is too small to represent a 113-bit mantissa.) For example a float's maximum mantissa could be found by:
(1LL << numeric_limits<float>::digits) - 1LL
Having thoroughly explained the mantissa, let's revisit the exponent section to talk about how a floating point is actually stored. Take 13,000,000.0 that could be represented as:
c = 13, q = 6, b = 10
c = 130, q = 5, b = 10
c = 1,300, q = 4, b = 10
And so on. For the traditional binary format IEEE-754 requires:
The representation is made unique by choosing the smallest representable exponent that retains the most significant bit (MSB) within the selected word size and format. Further, the exponent is not represented directly, but a bias is added so that the smallest representable exponent is represented as 1, with 0 used for subnormal numbers
To explain this in the more familiar base-10 if our mantissa has 14 decimal places, the implementation would look like this:
c = 13,000,000,000,000 so the MSB will be used in the represented number
q = 6 This is a little confusing, it's cause of the bias introduced here; logically q = -6 but the bias is set so that when q = 0 only the MSB of c is immediately to the left of the decimal point, meaning that c = 13,000,000,000,000, q = 0, b = 10 will represent 1.3
b = 10 again the above rules are really only required for base-2 but I've shown them as they would apply to base-10 for the purpose of explaination
Translated back to base-2 this means that a q of numeric_limits<T>::digits - 1 has only zeros after the decimal place. ceil only has an effect if there is a fractional part of the number.
A final point of explanation here, is the range over which ceil will have an effect. After the exponent of a floating point is larger than numeric_limits<T>::digits continuing to increase it only introduces trailing zeros to the resulting number, thus calling ceil when q is greater than or equal to numeric_limits<T>::digits - 2LL. And since we know the MSB of c will be used in the number this means that c must be smaller than (1LL << numeric_limits<T>::digits - 1LL) - 1LL Thus for ceil to have an effect on the traditional binary IEEE-754 floating point:
A 32-bit (float) must be smaller than 8,388,607
A 64-bit (double) must be smaller than 4,503,599,627,370,495
A 128-bit (long double depending upon implementation) must be smaller than 5,192,296,858,534,827,628,530,496,329,220,095

Division overflow using pow()

Q = (a_i + b_i) / (2^s)
-10^10 ≤ s ≤ 10^10
1 ≤ a_i, b_i ≤ 10^9
It is guaranteed that -10^10 ≤ Q ≤  10^10.
Here s,a_i,b_i are integers and Q is a decimal no.
When we calculate Q, there is overflow due to large value of 2^s.I am using pow(2,s) to calculate 2^s. How can i calculate Q,given the range of Q as in the statement.
I assume by your statement that Q is decimal, that this involves floating point operations rather than integer arithmetic.
If you can't use logarithms for some reason, the slower approach would be to calculate a floating point value with value equal to a_i + b_i. If s is positive, simply divide that value s times by 2 (in a loop). If s is negative, multiply instead of divide.
For arbitrary a_i and b_i, you will still have the risk of overflow (when s is negative) or underflow (s positive) and will need to manage that. However, you claim to have a guarantee that is not the case .....