Storing a positive floating point number in a known range in two bytes - glsl

I know that a number x lies between n and f (f > n > 0). So my idea is to bring that range to [0, 0.65535] by 0.65535 * (x - n) / (f - n).
Then I just could multiply by 10000, round and store integer in two bytes.
Is it going to be an effective use of storage in terms of precision?
I'm doing it for a WebGL1.0 shader, so I'd like to have simple encoding/decoding math, I don't have access to bitwise operations.

Why multiply by 0.65535 and then by 10000.0? That introduces a second rounding with an unnecessary loss of precision.
The data will be represented well if it has equal likelihood over the entire range (f,n). But this is not always a reasonable assumption. What you're doing is similar to creating a fixed-point representation (fixed step size, just not starting at 0 or with steps that are negative powers of 2).
Floating-point numbers use bigger step sizes for bigger numbers. You could do the same by calculating log(x/f) / log(n/f) * 65535

Related

How to write this floating point code in a portable way?

I am working on a cryptocurrency and there is a calculation that nodes must make:
average /= total;
double ratio = average/DESIRED_BLOCK_TIME_SEC;
int delta = -round(log2(ratio));
It is required that every node has the exact same result no matter what architecture or stdlib being used by the system. My understanding is that log2 might have different implementations that yield very slightly different results or flags like --ffast-math could impact the outputted results.
Is there a simple way to convert the above calculation to something that is verifiably portable across different architectures (fixed point?) or am I overthinking the precision that is needed (given that I round the answer at the end).
EDIT: Average is a long and total is an int... so average ends up rounded to the closest second.
DESIRED_BLOCK_TIME_SEC = 30.0 (it's a float) that is #defined
For this kind of calculation to be exact, one must either calculate all the divisions and logarithms exactly -- or one can work backwards.
-round(log2(x)) == round(log2(1/x)), meaning that one of the divisions can be turned around to get (1/x) >= 1.
round(log2(x)) == floor(log2(x * sqrt(2))) == binary_log((int)(x*sqrt(2))).
One minor detail here is, if (double)sqrt(2) rounds down, or up. If it rounds up, then there might exist one or more value x * sqrt2 == 2^n + epsilon (after rounding), where as if it would round down, we would get 2^n - epsilon. One would give the integer value of n the other would give n-1. Which is correct?
Naturally that one is correct, whose ratio to the theoretical mid point x * sqrt(2) is smaller.
x * sqrt(2) / 2^(n-1) < 2^n / (x * sqrt(2)) -- multiply by x*sqrt(2)
x^2 * 2 / 2^(n-1) < 2^n -- multiply by 2^(n-1)
x^2 * 2 < 2^(2*n-1)
In order of this comparison to be exact, x^2 or pow(x,2) must be exact as well on the boundary - and it matters, what range the original values are. Similar analysis can and should be done while expanding x = a/b, so that the inexactness of the division can be mitigated at the cost of possible overflow in the multiplication...
Then again, I wonder how all the other similar applications handle the corner cases, which may not even exist -- and those could be brute force searched assuming that average and total are small enough integers.
EDIT
Because average is an integer, it makes sense to tabulate those exact integer values, which are on the boundaries of -round(log2(average)).
From octave: d=-round(log2((1:1000000)/30.0)); find(d(2:end) ~= find(d(1:end-1))
1 2 3 6 11 22 43 85 170 340 679 1358 2716
5431 10862 21723 43445 86890 173779 347558 695115
All the averages between [1 2( -> 5
All the averages between [2 3( -> 4
All the averages between [3 6( -> 3
..
All the averages between [43445 86890( -> -11
int a = find_lower_bound(average, table); // linear or binary search
return 5 - a;
No floating point arithmetic needed

If a floating-point number is representable in my machine, will its inverse be representable in my machine?

Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
In other words, and more precisely, if I have a positive real number x and an object x1 of type double whose stored value represents x exactly, is it guaranteed that the value represented by DBL_EPSILON is less than the real number 1/x?
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
I will assume double is IEEE 754 binary64.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
Not necessarily, for two reasons:
The inverse might not be a floating-point number.
For example, although 3 is a floating-point number, 1/3 is not.
The inverse might overflow.
For example, the inverse of 2โˆ’1074 is 21074, which is not only larger than all finite floating-point numbers but more than halfway from the largest finite floating-point number, 1.fffffffffffffp+1023 = 21024 โˆ’ 2971, to what would be the next one after that, 21024, if the range of exponents were larger.
So the inverse of 2โˆ’1074 is rounded to infinity.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
The smallest such ๐‘ฃ is always zero.
If you restrict it to be nonzero, it will always be the smallest subnormal floating-point number, 0x1pโˆ’1074, or roughly 4.9406564584124654โ€‰ร—โ€‰10โˆ’324, irrespective of ๐‘ฅ (unless ๐‘ฅ is infinite).
But perhaps you want the largest such ๐‘ฃ rather than the smallest such ๐‘ฃ.
The largest such ๐‘ฃ is always either 1 โŠ˜ ๐‘ฅ = fl(1/๐‘ฅ) (that is, the floating-point number nearest to 1/๐‘ฅ, which is what you get by writing 1/x in C), or the next floating-point number closer to zero (which you can get by writing nextafter(1/x, 0) in C): in the default rounding mode, the division operator always returns the nearest floating-point number to the true quotient, or one of the two nearest ones if there is a tie.
You can also get the largest such ๐‘ฃ by setting the rounding mode with fesetround(FE_DOWNWARD) or fesetround(FE_TOWARDZERO) and then just computing 1/x, although toolchain support for non-default rounding modes is spotty and mostly they serve to shake out bugs in ill-conditioned code rather than to give reliable rounding semantics.
It seems fairly obvious that I could write double v = DBL_EPSILON;, but, if x is big enough, could it happen that v end up being bigger than the real value 1/x because it's so small that cannot be represented in my machine?
1/x is never rounded to zero unless ๐‘ฅ is infinite or you have nonstandard flush-to-zero semantics enabled (so results which would ordinarily be subnormal are instead rounded to zero, such as when ๐‘ฅ is the largest finite floating-point number 0x1.fffffffffffffp+1023).
But flush-to-zero aside, there are many values of ๐‘ฅ for which 1/๐‘ฅ and fl(1/๐‘ฅ) = 1/x is smaller than DBL_EPSILON.
For example, if ๐‘ฅ = 0x1p+1000 (that is, 21000 โ‰ˆ 1.0715086071862673โ€‰ร—โ€‰10301), then 1/๐‘ฅ = fl(1/๐‘ฅ) = 1/x = 0x1pโˆ’1000 (that is, 2โˆ’1000 โ‰ˆ 9.332636185032189โ€‰ร—โ€‰10โˆ’302) is far below DBL_EPSILON = 0x1pโˆ’52 (that is, 2โˆ’52 โ‰ˆ 2.220446049250313 โ€‰ร—โ€‰10โˆ’16).
1/๐‘ฅ in this case is a floating-point number, so the reciprocal is computed exactly in floating-point arithmetic; there is no rounding at all.
The largest floating-point number below 1/๐‘ฅ in this case is 0x1.fffffffffffffpโˆ’1001, or 2โˆ’1000 โˆ’ 2โˆ’1053.
DBL_EPSILON (2โˆ’52) is not the smallest floating-point number (2โˆ’1074), or even the smallest normal floating-point number (2โˆ’1022).
Rather, DBL_EPSILON is the distance from 1 to the next larger floating-point number, 1 + 2โˆ’52, sometimes written ulp(1) to indicate that it is the magnitude of the least significant digit, or unit in the last place, in the floating-point representation of 1.
In case it is not guaranteed, how can I calculate the biggest value of type double that ensures that DBL_EPSILON is less than the real number 1/x?
That would be 1/DBL_EPSILON - 1, or 252 โˆ’ 1.
But what do you want this number for?
Why are you trying to use DBL_EPSILON here?
The inverse of positive infinity is, of course, smaller than any positive rational number. Beyond that, even the largest finite floating point number has a multiplicative inverse well above the smallest representable floating point number of equivalent width, thanks to denormal numbers.
If a floating-point number is representable in my machine, will its inverse be representable in my machine?
No. There is no specification that 1.0/DBL_MIN <= DBL_MAX and 1.0/DBL_MAX <= DBL_MIN both must be true. One is usually true. With sub-normals, 1.0/sub-normal is often > DBL_MAX.
Given some initialized object x of type double that stores a positive value, I want to find the smallest double v such as 0 <= v < 1/x.
This is true as v could be zero unless for some large x like DBL_MAX, 1.0/x is zero. That is a possibility. With sub-normals, that is rarely the case as 1.0/DBL_MAX is representable as a value more than 0.
DBL_EPSILON has little to do with the above. OP's issues are more dependent on DBL_MAX, DBL_MIN and is the double supports sub-normals. Many FP encodings about balanced where 1/DBL_MIN is somewhat about DBL_MIN, yet C does not require that symmetry.
No. Floating point numbers are balanced around 1.0 to minimize the effect of calculating inverses, but this balance is not exact, ad the middle point for the exponent (the value 0x3fff... fot the exponent, gives the same number of powers of two above and below 1.0. But the exponent value 0x4ffff... is reserved for infinity and then nans, while the value 0x0000... is reserved for denormals (also called subnormals) These values are not normalized (and some architectures don't even implement them), but in those that implement, they add as many bits as the width of the mantissa as powers of 2 in addition (but with with lower precision) to the normalized ones, in the range of
the negative exponents. This means that you have a set o numbers, quite close to zero, for which when you compute their inverses, you always get infinity.
For doubles you have 52 more powers of two, or around 15 more powers of ten. For floats, this is around 7 more powers of ten.
But this also means that if you calculate the inverse of a large number you'll always get a number different than zero.

How to divide 2^{128} by an odd number in c++?

How can we divide 2^{128} by an odd number (unsigned 64-bit integer), which is a floor more precisely, without using arbitrary multi-precision arithmetic library?
The problem is even with gcc, 2^{128} cannot be expressed.
So I'm considering creating 192-bit integer type.
But, I have no idea how to do that (especially subtraction part).
I want the result of the floor to be an unsigned number.
For any odd d > 1, UINT128_MAX / d equals floor(2128/d).
This is because 2128/d must have a remainder, as the only factors of 2128 are powers of two (including 1), so the odd d (excluding 1) cannot be a divisor. Therefore, 2128/d and (2128โˆ’1)/d have the same integral quotient, and UINT128_MAX is 2128โˆ’1.

converting a decimal into binary in the most optimal way possible

What is the most optimal way to convert a decimal number into its binary form ,i.e with the best time complexity?
Normally to convert a decimal number into binary,we keep on dividing the number by 2 and storing its remainders.But this would take really long time if the number in decimal form is very large.The time complexity in this case would turn out to be O(log n).
So i want to know if there is any approach other than this that can do my job with better time comlexity?
The problem is essentially that of evaluating a polynomial using binary integer arithmetic, so the result is in binary. Suppose
p(x) = aโ‚€xโฟ + aโ‚xโฟโปยน + โ‹ฏ + aโ‚™โ‚‹โ‚x + aโ‚™
Now if aโ‚€,aโ‚,aโ‚‚,โ‹ฏ,aโ‚™ are the decimal digits of the number (each implicitly represented by binary numbers in the range 0 through 9) and we evaluate p at x=10 (implicitly in binary) then the result is the binary number that the decimal digit sequence represents.
The best way to evaluate a polynomial at a single point given also the coefficients as input is Horner's Rule. This amounts to rewriting p(x) in a way easy to evaluate as follows.
p(x) = ((โ‹ฏ((aโ‚€x + aโ‚)x + aโ‚‚)x + โ‹ฏ )x + aโ‚™โ‚‹โ‚)x + aโ‚™
This gives the following algorithm. Here the array a[] contains the digits of the decimal number, left to right, each represented as a small integer in the range 0 through 9. Pseudocode for an array indexed from 0:
toNumber(a[])
const x = 10
total = a[0]
for i = 1 to a.length - 1 do
total *= x //multiply the total by x=10
total += a[i] //add on the next digit
return total
Running this code on a machine where numbers are represented in binary gives a binary result. Since that's what we have on this planet, this gives you what you want.
If you want to get the actual bits, now you can use efficient binary operations to get them from the binary number you have constructed, for example, mask and shift.
The complexity of this is linear in the number of digits, because arithmetic operations on machine integers are constant time, and it does two operations per digit (apart from the first). This is a tiny amount of work, so this is supremely fast.
If you need very large numbers, bigger that 64 bits, just use some kind of large integer. Implemented properly this will keep the cost of arithmetic down.
To avoid as much large integer arithmetic as possible if your large integer implementation needs it, break the array of digits into slices of 19 digits, with the leftmost slice potentially having fewer. 19 is the maximum number of digits that can be converted into an (unsigned) 64-bit integer.
Convert each block as above into binary without using large integers and make a new array of those 64-bit values in left to right order. These are now the coefficients of a polynomial to be evaluated at x=10ยนโน. The same algorithm as above can be used only with large integer arithmetic operations, with 10 replaced by 10ยนโน which should be evaluated with large integer arithmetic in advance of its use.

0 + 0 + 0... + 0 != 0

I have a program that is finding paths in a graph and outputting the cumulative weight. All of the edges in the graph have an individual weight of 0 to 100 in the form of a float with at most 2 decimal places.
On Windows/Visual Studio 2010, for a particular path consisting of edges with 0 weight, it outputs the correct total weight of 0. However on Linux/GCC the program is saying the path has a weight of 2.35503e-38. I have had plenty of experiences with crazy bugs caused by floats, but when would 0 + 0 ever equal anything other than 0?
The only thing I can think of that is causing this is the program does treat some of the weights as integers and uses implicit coercion to add them to the total. But 0 + 0.0f still equals 0.0f!
As a quick fix I reduce the total to 0 when less then 0.00001 and that is sufficient for my needs, for now. But what vodoo causes this?
NOTE: I am 100% confident that none of the weights in the graph exceed the range I mentioned and that all of the weights in this particular path are all 0.
EDIT: To elaborate, I have tried both reading the weights from a file and setting them in the code manually as equal to 0.0f No other operation is being performed on them other than adding them to the total.
Because it's an IEEE floating point number, and it's not exactly equal to zero.
http://www.cygnus-software.com/papers/comparingfloats/comparingfloats.htm
[...] in the form of a float with at most 2 decimal places.
There is no such thing as a float with at most 2 decimal places. Floats are almost always represented as a binary floating point number (fractional binary mantissa and integer exponent). So many (most) numbers with 2 decimal places cannot be represented exactly.
For example, 0.20f may look as an innocent and round fraction, but
printf("%.40f\n", 0.20f);
will print: 0.2000000029802322387695312500000000000000.
See, it does not have 2 decimal places, it has 26!!!
Naturally, for most practical uses the difference in negligible. But if you do some calculations you may end up increasing the rounding error and making it visible, particularly around 0.
It may be that your floats containing values of "0.0f" aren't actually 0.0f (bit representation 0x00000000), but a very, very small number that evaluates to about 0.0. Because of the way IEEE754 spec defines float representations, if you have, for example, a very small mantissa and a 0 exponent, while it's not equal to absolute 0, it will round to 0. However, if you add these numbers together a sufficiently number of times, the very small amount will accumulate into a value that eventually will become non-zero.
Here is an example case which gives the illusion of 0 being non-zero:
float f = 0.1f / 1000000000;
printf("%f, %08x\n", f, *(unsigned int *)&f);
float f2 = f * 10000;
printf("%f, %08x\n", f2, *(unsigned int *)&f2);
If you are assigning literals to your variables and adding them, though, it is possible that either the compiler is not translating 0 into 0x0 in memory. If it is, and this still is happening, then it's also possible that your CPU hardware has a bug relating to turning 0s into non-zero when doing ALU operations that may have squeaked by their validation efforts.
However, it is good to remember that IEEE floating point is only an approximation, and not an exact representation of any particular float value. So any floating-point operations are bound to have some amount of error.