Proper way to generate a random float given a binary random number generator? - c++

Let's say we have a binary random number generator, int r(); that will return a zero or a one both with propability 0.5.
I looked at Boost.Random, and they generate, say, 32 bits and do something like this (pseudocode):
x = double(rand_int32());
return min + x / (2^32) * (max - min);
I have some serious doubts about this. A double has 53 bits of mantissa, and 32 bits can never properly generate a fully random mantissa, among other things such as rounding errors, etc.
What would be a fast way to create a uniformly distributed float or double in the half-open range [min, max), assuming IEEE754? The emphasis here lies on correctness of distribution, not speed.
To properly define correct, the correct distribution would be equal to the one that we would get if we would take an infinitely precise uniformly distributed random number generator and for each number we would round to the nearest IEEE754 representation, if that representation would still be within [min, max), otherwise the number would not count for the distribution.
P.S.: I would be interested in correct solutions for open ranges as well.

AFAIK, the correct (and probably also fastest) way is to first create a 64 bit unsigned integer where the 52 fraction bits are random bits, and the exponent is 1023, which if type punned into a (IEEE 754) double will be a uniformly distributed random value in the range [1.0, 2.0). So the last step is to subtract 1.0 from that, resulting in a uniformly distributed random double value in the range [0.0, 1.0).
In pseudo code:
rndDouble = bitCastUInt64ToDouble(1023 << 52 | rndUInt64 & 0xfffffffffffff) - 1.0
This method is mentioned here:
(See "Generating uniform doubles in the unit interval")
EDIT: The recommended method has since changed to:
(x >> 11) * (1. / (UINT64_C(1) << 53))
See above link for details.

Here is a correct approach with no attempt at efficiency.
We start with a bignum class, and then a rational wrapper of said bignums.
We produce a range "sufficiently bigger than" our [min, max) range, so that rounding of our smaller_min and bigger_max produces floating point values outside that range, in our rational built on the bignum.
Now we subdivide the range into two parts perfectly down the middle (which we can do, as we have a rational bignum system). We pick one of the two parts at random.
If, after rounding, the top and bottom of the picked range would be (A) outside of [min, max) (on the same side, mind you!) you reject and restart from the beginning.
If (B) the top and bottom of your range rounds to the same double (or float if you are returning a float), you are done, and you return this value.
Otherwise (C) you recurse on this new, smaller range (subdivide, pick randomly, test).
There are no guarantees that this procedure halts, because you can either constantly drill down to the "edge" between two rounding doubles, or you could constantly pick values outside of the [min, max) range. The probability of this happening is (never halting), however, zero (assuming a good random number generator, and a [min, max) of non-zero size).
This also works for (min, max), or even picking a number in the rounded sufficiently fat Cantor set. So long as the measure of the valid range of reals that round to the correct floating point values is non zero, and the range has a compact support, this procedure can be run and has a probability of 100% of terminating, but no hard upper bound on the time it takes can be made.

The problem here is that in IEEE754 the doubles which may be represented are not equi-distributed. That is, if we have a generator generating real numbers, say in (0,1) and then map to IEEE754 representable numbers, the result will not be equi-distributed.
Thus, we have to define "equi-distribution". That said, assuming that each IEEE754 number is just a representative for the probability of lying in the interval defined by the IEEE754 rounding, the procedure of first generating equi-distributed "numbers" and the round to IEEE754 will generate (by definition) an "equi-distribution" of IEEE754 numbers.
Hence, I believe that the above formula will become arbitrary close to such a distribution if we just choose the accuracy high enough. If we restrict the problem to finding a number in [0,1) this means to restricting to the set of denomalized IEEE 754 numbers, which are one-to-one to a 53 bit integer. Thus it should be fast and correct to generate just the mantissa by a 53 bit binary random number generator.
IEEE 754 arithmetic is always "arithmetic at infinite precision followed by rounding", i.e. the IEEE754 number representing ab is the one being closest to ab (put differently, you can think of a*b calculated at infinite precision, then rounded to the closes IEEE754 number). Hence I believe that min + (max-min) * x, where x is a denomalized number, is a feasible approach.
(Note: As clear from my comment, I was first not aware that you where pointing to the case with min and max different from 0,1. The denormalized numbers have the property that they are evenly spaced. Hence you get the equi distribution by mapping the 53 bits to the mantissa. Next you can use the floating point arithmetic, due fact that it is correct up to machine precistion. If you use the reverse mapping you will recover the equi-distribution.
See this question for another aspect of this problem: Scaling Int uniform random range into Double one

There's a really good talk by S.T.L. from this year’s Going Native conference that explains why you should use the standard distributions whenever possible. In short, hand-rolled code tends to be of laughably poor quality (think std::rand() % 100), or have more subtle uniformity flaws, such as in (std::rand() * 1.0 / RAND_MAX) * 99, which is the example given in the talk and is a special case of the code posted in the question.
EDIT: I took a look at libstdc++’s implementation of std::uniform_real_distribution, and this is what I found:
The implementation produces a number in the range [dist_min, dist_max) by using a simple linear transformation from some number produced in the range [0, 1). It generates this source number using std::generate_canonical, the implementation of which my be found here (at the end of the file). std::generate_canonical determines the number of times (denoted as k) the range of the distribution, expressed as an integer and denoted here as r*, will fit in the mantissa of the target type. What it then does is essentially to generate one number in [0, r) for each r-sized segment of the mantissa and, using arithmetic, populate each segment accordingly. The formula for the resulting value may be expressed as
Σ(i=0, k-1, X/(r^i))
where X is a stochastic variable in [0, r). Each division by the range is equivalent to a shift by the number of bits used to represent it (i.e., log2(r)), and so fills the corresponding mantissa segment. This way, the whole of the precision of the target type is used, and since the range of the result is [0, 1), the exponent remains 0** (modulo bias) and you don’t get the uniformity issues you have when you start messing with the exponent.
I would not trust implicity that this method is cryptographically secure (and I have suspicions about possible off-by-one errors in the calculation of the size of r), but I imagine it is significantly more reliable in terms of uniformity than the Boost implementation you posted, and definitely better than fiddling about with std::rand.
It may be worth noting that the Boost code is in fact a degenerate case of this algorithm where k = 1, meaning that it is equivalent if the input range requires at least 23 bits to represent its size (IEE 754 single-precision) or at least 52 bits (double-precision). This means a minimum range of ~8.4 million or ~4.5e15, respectively. In light of this information, I don’t think that if you’re using a binary generator, the Boost implementation is quite going to cut it.
After a brief look at libc++’s implementation, it looks like they are using what is the same algorithm, implemented slightly differently.
(*) r is actually the range of the input plus one. This allows using the max value of the urng as valid input.
(**) Strictly speaking, the encoded exponent is not 0, as IEEE 754 encodes an implicit leading 1 before the radix of the significand. Conceptually, however, this is irrelevant to this algorithm.


Is it possible in floating point to return 0.0 subtracting two different values?

Due to the floating point "approx" nature, its possible that two different sets of values return the same value.
#include <iostream>
int main() {
double a = 0.5;
double b = 0.5;
double c = 0.49999999999999994;
std::cout << a + b << std::endl; // output "exact" 1.0
std::cout << a + c << std::endl; // output "exact" 1.0
But is it also possible with subtraction? I mean: is there two sets of different values (keeping one value of them) that return 0.0?
i.e. a - b = 0.0 and a - c = 0.0, given some sets of a,b and a,c with b != c??
The IEEE-754 standard was deliberately designed so that subtracting two values produces zero if and only if the two values are equal, except that subtracting an infinity from itself produces NaN and/or an exception.
Unfortunately, C++ does not require conformance to IEEE-754, and many C++ implementations use some features of IEEE-754 but do not fully conform.
A not uncommon behavior is to “flush” subnormal results to zero. This is part of a hardware design to avoid the burden of handling subnormal results correctly. If this behavior is in effect, the subtraction of two very small but different numbers can yield zero. (The numbers would have to be near the bottom of the normal range, having some significand bits in the subnormal range.)
Sometimes systems with this behavior may offer a way of disabling it.
Another behavior to beware of is that C++ does not require floating-point operations to be carried out precisely as written. It allows “excess precision” to be used in intermediate operations and “contractions” of some expressions. For example, a*b - c*d may be computed by using one operation that multiplies a and b and then another that multiplies c and d and subtracts the result from the previously computed a*b. This latter operation acts as if c*d were computed with infinite precision rather than rounded to the nominal floating-point format. In this case, a*b - c*d may produce a non-zero result even though a*b == c*d evaluates to true.
Some C++ implementations offer ways to disable or limit such behavior.
Gradual underflow feature of IEEE floating point standard prevents this. Gradual underflow is achieved by subnormal (denormal) numbers, which are spaced evenly (as opposed to logarithmically, like normal floating point) and located between the smallest negative and positive normal numbers with zeroes in the middle. As they are evenly spaced, the addition of two subnormal numbers of differing signedness (i.e. subtraction towards zero) is exact and therefore won't reproduce what you ask. The smallest subnormal is (much) less than the smallest distance between normal numbers, and therefore any subtraction between unequal normal numbers is going to be closer to a subnormal than zero.
If you disable IEEE conformance using a special denormals-are-zero (DAZ) or flush-to-zero (FTZ) mode of the CPU, then indeed you could subtract two small, close numbers which would otherwise result in a subnormal number, which would be treated as zero due to the mode of the CPU. A working example (Linux):
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // system specific
double d = std::numeric_limits<double>::min(); // smallest normal
double n = std::nextafter(d, 10.0); // second smallest normal
double z = d - n; // a negative subnormal (flushed to zero)
std::cout << (z == 0) << '\n' << (d == n);
This should print
First 1 indicates that result of subtraction is exactly zero, while the second 0 indicates that the operands are not equal.
Unfortunately the answer is dependent on your implementation and the way it is configured. C and C++ don't demand any specific floating point representation or behavior. Most implementations use the IEEE 754 representations, but they don't always precisely implement IEEE 754 arithmetic behaviour.
To understand the answer to this question we must first understand how floating point numbers work.
A naive floating point representation would have an exponent, a sign and a mantissa. It's value would be
(-1)s2(e – e0)(m/2M)
s is the sign bit, with a value of 0 or 1.
e is the exponent field
e0 is the exponent bias. It essentially sets the overall range of the floating point number.
M is the number of mantissa bits.
m is the mantissa with a value between 0 and 2M-1
This is similar in concept to the scientific notation you were taught in school.
However this format has many different representations of the same number, nearly a whole bit's worth of encoding space is wasted. To fix this we can add an "implicit 1" to the mantissa.
(-1)s2(e – e0)(1+(m/2M))
This format has exactly one representation of each number. However there is a problem with it, it can't represent zero or numbers close to zero.
To fix this IEEE floating point reserves a couple of exponent values for special cases. An exponent value of zero is reserved for representing small numbers known as subnormals. The highest possible exponent value is reserved for NaNs and infinities (which I will ignore in this post since they aren't relevant here). So the definition now becomes.
(-1)s2(1 – e0)(m/2M) when e = 0
(-1)s2(e – e0)(1+(m/2M)) when e >0 and e < 2E-1
With this representation smaller numbers always have a step size that is less than or equal to that for larger ones. So provided the result of the subtraction is smaller in magnitude than both operands it can be represented exactly. In particular results close to but not exactly zero can be represented exactly.
This does not apply if the result is larger in magnitude than one or both of the operands, for example subtracting a small value from a large value or subtracting two values of opposite signs. In those cases the result may be imprecise but it clearly can't be zero.
Unfortunately FPU designers cut corners. Rather than including the logic to handle subnormal numbers quickly and correctly they either did not support (non-zero) subnormals at all or provided slow support for subnormals and then gave the user the option to turn it on and off. If support for proper subnormal calculations is not present or is disabled and the number is too small to represent in normalized form then it will be "flushed to zero".
So in the real world under some systems and configurations subtracting two different very-small floating point numbers can result in a zero answer.
Excluding funny numbers like NAN, I don't think it's possible.
Let's say a and b are normal finite IEEE 754 floats, and |a - b| is less than or equal to both |a| and |b| (otherwise it's clearly not zero).
That means the exponent is <= both a's and b's, and so the absolute precision is at least as high, which makes the subtraction exactly representable. That means that if a - b == 0, then it is exactly zero, so a == b.

IEEE754 float point substraction precision lost

Here is the subtraction
First number
Decimal 3.0000002
Hexadecimal 0x4040001
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0001]
substract second number:
Decimal 3.000000
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[1000_0000], Mantissa[100_0000_0000_0000_0000_0000]
At this situation, the exponent is already same, we just need to substract the mantissa. We know in IEEE754, there is a hiding bit 1 in front of mantissa. Therefore, the result mantissa should be:
Mantissa_1[1100_0000_0000_0000_0000_0001] - Mantissa_2[1100_0000_0000_0000_0000_0000]
which equal to
Mantissa_Rst = [0000_0000_0000_0000_0000_0001]
But this number is not normalized, Because of the first hiding bit is not 1. Thus we shift the Mantissa_Rst right 23 times, and the exponent minuses 23 at the same time.
Then we have the result value
Hexadecimal 0x4040000
Binary: Sign[0], Exponent[0110_1000], Mantissa[000_0000_0000_0000_0000_0000].
32 bits total, no rounding needed.
Notice that in the mantissa region, there still is a hidden 1.
If my calculations were correct, then converting result to decimal number is 0.00000023841858, comparing with the real result 0.0000002, I still think that is not very precise.
So the question is, are my calculations wrong? or actually this is a real situation and happens all the time in computer?
The inaccuracy already starts with your input. 3.0000002 is a fraction with a prime factor of five in the denominator, so its "decimal" expansion in base 2 is periodic. No amount of mantissa bits will suffice to represent it exactly. The float you give actually has the value 3.0000002384185791015625 (this is exact). Yes, this happens all the time.
Don't despair, though! Base ten has the same problem (for example 1/3). It isn't a problem. Well, it is for some people, but luckily there are other number types available for their needs. Floating point numbers have many advantages, and slight rounding error is irrelevant for many applications, for example when not even your inputs are perfectly accurate measurements of what you're interested in (a lot of scientific computing and simulation). Also remember that 64-bit floats also exist. Additionally, the error is bounded: With the best possible rounding, your result will be within 0.5 units in the last place removed from the infinite-precision result. For a 32-bit float of the magnitude as your example, this is approximately 2^-25, or 3 * 10^-8. This gets worse and worse as you do additional operations that have to round, but with careful numeric analysis and the right algorithms, you can get a lot of milage out of them.
Whenever x/2 ≤ y ≤ 2x, the calculation x - y is exact which means there is no rounding error whatsoever. That is also the case in your example.
You just made the wrong assumption that you could have a floating point number that is equal to 3.0000002. You can't. The type "float" can only ever represent integers less than 2^24, multiplied by a power of two. 3.0000002 is not such a number, therefore it is rounded to the nearest floating point number, which is closer to 3.00000023841858. Subtracting 3 calculates the difference exactly and gives a result close to 0.00000023841858.

controlling overflow and loss in precision while multiplying doubles

I have a large number of floating point numbers (~10,000 numbers) , each having 6 digits after decimal. Now, the multiplication of all these numbers would yield about 60,000 digits. But the double range is for 15 digits only. The output product has to have 6 digits of precision after decimal.
my approach:
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
I also thought of multiplying these numbers using arrays to store their digits and later converting them to decimal. But this also appears cumbersome and may not yield correct result.
Is there an alternate easier way to do this?
I thought of multiplying these numbers by 10^6 and then multiplying them and later dividing them by 10^12.
This would only achieve further loss of accuracy. In floating-point, large numbers are represented approximately just like small numbers are. Making your numbers bigger only means you are doing 19999 multiplications (and one division) instead of 9999 multiplications; it does not magically give you more significant digits.
This manipulation would only be useful if it prevented the partial product to reach into subnormal territory (and in this case, multiplying by a power of two would be recommended to avoid loss of accuracy due to the multiplication). There is no indication in your question that this happens, no example data set, no code, so it is only possible to provide the generic explanation below:
Floating-point multiplication is very well behaved when it does not underflow or overflow. At the first order, you can assume that relative inaccuracies add up, so that multiplying 10000 values produces a result that's 9999 machine epsilons away from the mathematical result in relative terms(*).
The solution to your problem as stated (no code, no data set) is to use a wider floating-point type for the intermediate multiplications. This solves both the problems of underflow or overflow and leaves you with a relative accuracy on the end result such that once rounded to the original floating-point type, the product is wrong by at most one ULP.
Depending on your programming language, such a wider floating-point type may be available as long double. For 10000 multiplications, the 80-bit “extended double” format, widely available in x86 processors, would improve things dramatically and you would barely see any performance difference, as long as your compiler does map this 80-bit format to a floating-point type. Otherwise, you would have to use a software implementation such as MPFR's arbitrary-precision floating-point format or the double-double format.
(*) In reality, relative inaccuracies compound, so that the real bound on the relative error is more like (1 + ε)9999 - 1 where ε is the machine epsilon. Also, in reality, relative errors often cancel each other, so that you can expect the actual relative error to grow like the square root of the theoretical maximum error.

How to correctly normalize a floating point value in C++?

Maybe I don't understand the IEEE754 standard that much, but given a set of floating point values that are float or double, for example :
56.543f 3238.124124f 121.3f ...
you are able to convert them in values ranging from 0 to 1, so you normalize them, by taking an appropriate common factor while considering what is the maximum value and the minimum value in the set.
Now my point is that in this transformation I need a much higher precision for the set of destination that ranges from 0 to 1 if compared to the level of precision that I need in the first one, especially if the values in the first set are covering a wide range of numerical values ( really big and really small values ).
How the float or the double ( or the IEEE 754 standard if you want ) type can handle this situation while providing more precision for the second set of values knowing that I will basically not need an integer part ?
Or it doesn't handle this at all and I need fixed point math with a totally different type ?
Floating point numbers are stored in a format similar to scientific notation. Internally, they align the leading 1 of the binary representation to the top of the significand. Each value is carried with the same number of binary digits of precision relative to its own magnitude.
When you compress your set of floating point values to the range 0..1, the only precision loss you will get will be due to the rounding that occurs in the various steps of the process.
If you're merely compressing by scaling, you will lose only a small amount of precision near the LSBs of the mantissa (around 1 or 2 ulp, where ulp means "units of the last place).
If you also need to shift your data, then things get trickier. If your data is all positive, then subtracting off the smallest number will not damage anything. But, if your data is a mixture of positive and negative data, then some of your values near zero may suffer a loss in precision.
If you do all the arithmetic at double precision, you'll carry 53 bits of precision through the calculation. If your precision needs fit within that (which likely they do), then you'll be fine. Otherwise, the exact numerical performance will depend on the distribution of your data.
Single and double IEEE floats have a format where the exponent and fraction parts have fixed bit-width. So this is not possible (i.e. you will always have unused bits if you only store values between 0 and 1). (See:
Are you sure the 52-bit wide fraction part of a double is not precise enough?
Edit: If you use the whole range of the floating format, you will lose precision when normalizing the values. The roundings can be off and enough small values will become 0. Unless you know that this is a problem, don't worry. Otherwise you have to look up some other solution as mentioned in other answers.
Having binary floating point values (with an implicit leading one) expressed as
(1+fraction) * 2^exponent where fraction < 1
A division a/b is:
a/b = (1+fraction(a)) / (1+fraction(b)) * 2^(exponent(a) - exponent(b))
Hence division/multiplication has essentially no loss of precision.
A subtraction a-b is:
a-b = (1+fraction(a)) * 2^(exponent(a) - (1+fraction(b)) * exponent(b))
Hence a subtraction/addition might have a loss of precision (big - tiny == big) !
Clamping a value x in a range [min, max] to [0, 1]
(x - min) / (max - min)
will have precision issues if any subtraction has a loss of precision.
Answering your question:
Nothing is, choose a suitable representation (floating point, fraction, multi precision ...) for your algorithms and expected data.
If you have a selection of doubles and you normalize them to between 0.0 and 1.0, there are a number of sources of precision loss. They are all, however, much smaller than you suspect.
First, you will lose some precision in the arithmetic operations required to normalize them as rounding occurs. This is relatively small -- a bit or so per operation -- and usually relatively random.
Second, the exponent component will no longer be using the positive exponent possibility.
Third, as all the values are positive, the sign bit will also be wasted.
Forth, if the input space does not include +inf or -inf or +NaN or -NaN or the like, those code points will also be wasted.
But, for the most part, you'll waste about 3 bits of information in a 64 bit double in your normalization, one of which being the kind of thing that is nearly unavoidable when you deal with finite-bit-width values.
Any 64 bit fixed point representation of the values from 0 to 1 will have far less "range" than doubles. A double can represent something on the order of 10^-300, while a 64 bit fixed point representation that includes 1.0 can only go as low as 10^-19 or so. (The 64 bit fixed point representation can represent 1 - 10^-19 as being distinct from 1, while the double cannot, but the 64 bit fixed point value can not represent anything smaller than 2^-64, while doubles can).
Some of the numbers above are approximate, and may depend on rounding/exact format.
For higher precision you can try
Note also, that for the numerical critical operations +,- there are special algorithms that minimize the numerical error introduced by the algorithm:

machine precision and max and min value of a double-precision type

(1) I have met several cases where epsilon is added to a non-negative variable to guarantee nonzero value. So I wonder why not add the minimum value that the data type can represent instead of epsilon? What are the difference problems that these two can solve?
(2) Also I notice that the inverse of the maximum value of a double precision type is bigger than its min value, and inverse of its min value is inf, way bigger than its max value. Is it useful to compute the reciprocals of its max and min values?
(3) For a very small positive number of double type, to compute its reciprocal, how small it is when its reciprocal starts to not make sense? Is it better to put an upper bound on the reciprocal? How much is the bound?
Thanks and regards
Epsilon is the smallest value that can be added to 1.0 and produce a result that's distinguishable from 1.0. As Poita_ implied, this is useful for dealing with rounding errors. The situation is pretty simple: a normal floating point number has precision that remains fixed, regardless of the magnitude of the number. To put that slightly differently, it always computes to the same number of significant digits. For example, a typical implementation of double will have around 15 significant digits (which translates to Epsilon = ~1e-15). If you're working with a number in the range 10e-200, the smallest change it can represent will be around 10e-215. If you're working with a number in the range 10e+200, the smallest change it can represent will be around 1e+185.
Meaningful use of Epsilon normally requires scaling it to the range of the numbers you're working with, and using that to define a range you're willing to accept as probably due to rounding errors, so if two numbers fall within that range, you assume they're probably really equal. For example, with Epsilon of 1e-15, you might decide to treat numbers that fall within 1e-14 of each other as equal (i.e. on significant digit has been lost to rounding).
The smallest number that can be represented will normally be dramatically smaller than that. With that same typical double, it's usually going to be around 1e-308. This would be equivalent to Epsilon if you were using fixed point numbers instead of floating point numbers. For example, at one time quite a few people used fixed-point for various graphics. A typical version was a 16-bit bit integer broken into a something like 10 bits before the decimal point and six bits after the decimal point. Such a number can represent numbers from roughly 0 to 1024, with about two (decimal) digits after the decimal point. Alternatively, you can treat it as signed, running from (roughly) -512 to +512, again with around two digits after the decimal point.
In this case, the scaling factor is fixed, so the smallest difference that can be represented between two numbers is also fixed -- i.e. the difference between 1024 and the next larger number is exactly the same as the difference between 0 and the next larger number.
I'm not sure exactly why you're concerned with computing reciprocals of extremely large or extremely small numbers. IEEE floating point uses denormals, which means numbers close to the limits of the range lose precision. Basically, a number is divided into an exponent and a significand. The exponent contains the magnitude of the number, and the significand contains the significant digits. Each is represented with a specified number of bits. In the usual case, numbers are normalized, which means they're vaguely similar to the scientific notation we all learned in school. In scientific notation, you always adjust the significand and exponent so there's exactly one place before the decimal point, so (for example) 140 becomes 1.4e2, 20030 becomes 2.003e4, and so on.
Think of this as the "normalized" form of a floating point number. Assume, however, that you're limited t an exponent having 2 digits, so it can only run from -99 to +99. Also assume that you can have a maximum of 15 significant digits. Within those limitations, you could produce a number like 0.00001002e-99. This lets you represent a number smaller than 1e-99, at the expense of losing some precision -- instead of 15 digits of precision, you've used 5 digits of your significand to represent magnitude, so you're left with only 10 digits that are really significant.
Except that it's in binary instead of decimal, IEEE floating point works roughly that way.
As you approach the end of the range, the numbers have less and less precision, until (at the very end of the range) you have only one bit of precision left.
If you take that number that has only one bit of precision, and take its reciprocal you get an extremely large number -- but since you only started with one bit of precision, the result can only have one bit of precision as well. Although slightly better than no result at all, it's still pretty close to meaningless. You've reached the limit of what the number of bits can represent; about the only way to cure the problem is to use more bits.
There's not really any one point at which a reciprocal (or other computation) "stops making sense". It's not really a hard line where one result makes sense, and another doesn't. Rather, it's a slope, where one result might have 15 digits of precision, another 10 and a third only 1. What "makes sense" or not is mostly how you interpret that result. To get meaningful results, you need a fair idea of how many digits in your final result are really meaningful.
You need to understand how floating point numbers are represented in the CPU. In the data type, 1 bit is reserved for the sign, i.e. whether it is a positive or negative number, (yes you can have positive and negative 0 in floating point numbers,) then a number of bits is reserved for the significand (or mantissa,) these are the significant digits in the floating point number and finally a number of bits is reserved for the exponent. The value of the floating point number now is:
-1^sign * significand * 2^exponent
This means the smallest number is a very small value, namely the smalles significand with the lowest exponent. The rounding error however is much larger and depends on the magnitude of the number, namely the smallest number with a given exponent. The epsilon is the difference between 1.0 and the next representable larger value. That's why epsilon is used in code that is robust for rounding errors, and really you should scale the epsilon with the magnitude of the numbers you work with if you do it right. The smallest representable value is not really of any significant use normally.
You're seeing the difference between the normalized and denormalized minimum. The problem is that due to the way the significand is used it is possible to make a larger negative exponent than a positive one, say the bit pattern of the significand is all zeros except the last bit, which is one, then the exponent is effectively lowered by the number of bits in the significand. For the maximum you cannot do this, even if you set the significand to all ones, the effective exponent will still only be the exponent that is given. i.e. think of the difference between 0.000001e-10 and 9.999999e+10, the first is much smaller than the second is big. The first is actually 1e-16 while the second is approx 1e+11.
It depends on the precision of the floating point number of course. In the case of double precision, the difference between the maximum and the next smaller value is already huge, (along the lines of 10^292,) so your rounding errors will be very big. If the value is too small you will simply get inf instead, as you already saw. Really, there is no strict answer, it depends entirely on the precision of numbers you need. Given that the rounding error is approx epsilon*magnitude, the reciprocal of (1/epsilon) already has a rounding error of around 1.0 if you need numbers to be accurate to 1e-3 then even epsilon would be too big to divide by.
See these wikipedia pages on IEEE754 and Machine epsilon for some background info.
Epsilons are added to test equality between two values that should be equal, but aren't because of rounding errors. While you could use the smallest positive value for epsilon, it wouldn't be optimal, because it's simply too small. The rounding errors caused by floating point arithmetic almost always exceed that smallest value, so a larger epsilon is needed. How large depends on your desired accuracy.
I don't understand the question. Are the reciprocals useful for what? I can't think of any reason why they would be useful.
In general, dividing by very small values is a bad idea as it will cause very large rounding errors. I'm not sure what you mean by adding an upper bound. Just avoid dividing by small values wherever possible.