Does floating-point use a key in hashtable? - c++

I know that floating-point values cannot be compared by ==. I have made a custom comparison function like this.
auto isEqual = [](const double& a, const double& b) {
return fabs(a-b) <= numeric_limits<double>::epsilon();
};
I would like to know how I modify the unordered map to be worked as I expected.
auto isEqual = [](const double& a, const double& b) {
return fabs(a-b) <= numeric_limits<double>::epsilon();
};
unordered_map<double, int, hash<double>, decltype(isEqual)> m(0, hash<double>(), isEqual);
m[1/(double)3]++;
cout << m[1-2/(double)3] << endl; // expected 1, but zero
// -----------------------------
auto comp = [&](const double& a, const double& b) {
if (isEqual(a, b)) return false;
return a < b;
};
map<double, int, decltype(comp)> m2(comp);
m2[1/(double)3]++;
cout << m2[1-2/(double)3] << endl; // expected and answered 1

It is essentially impossible to modify an unordered_map to use floating-point values that contain different rounding errors as keys.
Customizing the equality comparison alone is insufficient because it is merely used to distinguish values that are mapped to the same bucket by the hash function. Different floating-point values generally hash to different buckets, even if they differ only by the tiniest of rounding errors. Therefore, one must also customize the hash function.
However, then the requirement for the hash function would be that it map floating-point values that are different but that you would like to consider as equal to the same bucket. In general, this is impossible because, if you want to consider any two very close numbers as equal, say numbers that are so close that they are adjacent in the floating-point format, then transitivity requires the hash function to map all numbers to one bucket. That is, since zero and the smallest positive representable number must map to the same bucket, and the smallest and second smallest positive representable numbers must map to the same bucket, then zero and the second smallest positive representable number must map to the same bucket. Similarly, the third smallest number must represent to the same bucket as the second smallest and therefore to the same bucket as zero. And so on for the fourth and fifth. This creates a chain that continues for all numbers: They must all map to the same bucket.
Therefore, no hash function can serve to implement a non-degenerate map for floating-point numbers that considers close numbers as equal.
In special situations, it is possible to implement a reasonable hash for certain sets of numbers. For example, if it is known that all the floating-point values represent a number of cents, and that the floating-point numbers never contain accumulated rounding errors that reach or exceed half a cent, then each value can be rounded to the nearest cent (or the representable value nearest that) before hashing. Note that the domain of values in this case is really a discrete set, such as a set of fixed-point numbers, not a continuous set such as the set of real numbers that floating-point arithmetic is intended to approximate. In this case, the only modification that is needed is to quantize the floating-point value (round it to the nearest member of the set) before inserting it into the map. No custom hash function or equality comparison is needed.

Related

Do we need epsilon value for lesser or greater comparison for float value? [duplicate]

This question already has an answer here:
Floating point less-than-equal comparisons after addition and substraction
(1 answer)
Closed 9 months ago.
I have gone through different threads for comparing lesser or greater float value not equal comparison but not clear do we need epsilon value logic to compare lesser or greater float value as well?
e.g ->
float a, b;
if (a < b) // is this correct way to compare two float value or we need epsilon value for lesser comparator
{
}
if (a > b) // is this correct way to compare two float value for greater comparator
{
}
I know for comparing for equality of float, we need some epsilon value
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
It really depends on what should happen when both value are close enough to be seen as equal, meaning fabs(a - b) < EPSILON. In some use cases (for example for computing statistics), it is not very important if the comparison between 2 close values gives or not equality.
If it matters, you should first determine the uncertainty of the values. It really depends on the use case (where the input values come from and how they are processed), and then 2 value differing by less than that uncertainty should be considered as equal. But that equality is not longer a true mathematical equivalence relation: you can easily imagine how to build a chain a close values between 2 truely different values. In math words, the relation is not transitive (or is almost transitive is current language words).
I am sorry but as soon as you have to process approximations there cannot be any precise and consistent way: you have to think of the real world use case to determine how you should handle the approximation.
When you are working with floats, it's inevitable that you will run into precision errors.
In order to mitigate this, when checking for the equality two floats we often check if their difference is small enough.
For lesser and greater, however, there is no way to tell with full certainty which float is larger. The best (presumably for your intentions) approach is to first check if the two floats are the same, using the areSame function. If so return false (as a = b implies that a < b and a > b are both false).
Otherwise, return the value of either a < b or a > b.
The answer is application dependent.
If you are sure that a and b are sufficiently different that numerical errors will not reverse the order, then a < b is good enough.
But if a and b are dangerously close, you might require a < b + EPSILON. In such a case, it should be clear to you that < and ≤ are not distinguishable.
Needless to say, EPSILON should be chosen with the greatest care (which is often pretty difficult).
It ultimately depends on your application, but I would say generally no.
The problem, very simplified, is that if you calculate: (1/3) * 3 and get the answer 0.999999, then you want that to compare equal to 1. This is why we use epsilon values for equal comparisons (and the epsilon should be chosen according to the application and expected precision).
On the other hand, if you want to sort a list of floats then by default the 0.999999 value will sort before 1. But then again what would the correct behavior be? If they both are sorted as 1, then it will be somewhat random which one is actually sorted first (depending on the initial order of the list and the sorting algorithm you use).
The problem with floating point numbers is not that they are "random" and that it is impossible to predict their exact values. The problem is that base-10 fractions don't translate cleanly into base-2 fractions, and that non-repeating decimals in one system can translate into repeating one in the other - which then result in rounding errors when truncated to a finite number of decimals. We use epsilon values for equal comparisons to handle rounding errors that arise from these back and forth translations.
But do be aware that the nice relations that ==, < and <= have for integers, don't always translate over to floating points exactly because of the epsilons involved. Example:
a = x
b = a + epsilon/2
c = b + epsilon/2
d = c + epsilon/2
Now: a == b, b == c, c == d, BUT a != d, a < d. In fact, you can continue the sequence keeping num(n) == num(n+1) and at the same time get an arbitrarily large difference between a and the last number in the sequence.
As others have stated, there would always be precision errors when dealing with floats.
Thus, you should have an epsilon value even for comparing less than / greater than.
We know that in order for a to be less than b, firstly, a must be different from b. Checking this is a simple NOT equals, which uses the epsilon.
Then, once you already know a != b, the operator < is sufficient.

Given (a, b) compute the maximum value of k such that a^{1/k} and b^{1/k} are whole numbers

I'm writing a program that tries to find the minimum value of k > 1 such that the kth root of a and b (which are both given) equals a whole number.
Here's a snippet of my code, which I've commented for clarification.
int main()
{
// Declare the variables a and b.
double a;
double b;
// Read in variables a and b.
while (cin >> a >> b) {
int k = 2;
// We require the kth root of a and b to both be whole numbers.
// "while a^{1/k} and b^{1/k} are not both whole numbers..."
while ((fmod(pow(a, 1.0/k), 1) != 1.0) || (fmod(pow(b, 1.0/k), 1) != 0)) {
k++;
}
Pretty much, I read in (a, b), and I start from k = 2 and increment k until the kth roots of a and b are both congruent to 0 mod 1 (meaning that they are divisible by 1 and thus whole numbers).
But, the loop runs infinitely. I've tried researching, and I think it might have to do with precision error; however, I'm not too sure.
Another approach I've tried is changing the loop condition to check whether the floor of a^{1/k} equals a^{1/k} itself. But again, this runs infinitely, likely due to precision error.
Does anyone know how I can fix this issue?
EDIT: for example, when (a, b) = (216, 125), I want to have k = 3 because 216^(1/3) and 125^(1/3) are both integers (namely, 5 and 6).
That is not a programming problem but a mathematical one:
if a is a real, and k a positive integer, and if a^(1./k) is an integer, then a is an integer. (otherwise the aim is to toy with approximation error)
So the fastest approach may be to first check if a and b are integer, then do a prime decomposition such that a=p0e0 * p1e1 * ..., where pi are distinct primes.
Notice that, for a1/k to be an integer, each ei must also be divisible by k. In other words, k must be a common divisor of the ei. The same must be true for the prime powers of b if b1/k is to be an integer.
Thus the largest k is the greatest common divisor of all ei of both a and b.
With your approach you will have problem with large numbers. All IIEEE 754 binary64 floating points (the case of double on x86) have 53 significant bits. That means that all double larger than 253 are integer.
The function pow(x,1./k) will result in the same value for two different x, so that with your approach you will necessary have false answer, for example the numbers 55*290 and 35*2120 are exactly representable with double. The result of the algorithm is k=5. You may find this value of k with these number but you will also find k=5 for 55*290-249 and 35*2120, because pow(55*290-249,1./5)==pow(55*290). Demo here
On the other hand, as there are only 53 significant bits, prime number decomposition of double is trivial.
Floating numbers are not mathematical real numbers. The computation is "approximate". See http://floating-point-gui.de/
You could replace the test fmod(pow(a, 1.0/k), 1) != 1.0 with something like fabs(fmod(pow(a, 1.0/k), 1) - 1.0) > 0.0000001 (and play with various such 𝛆 instead of 0.0000001; see also std::numeric_limits::epsilon but use it carefully, since pow might give some error in its computations, and 1.0/k also inject imprecisions - details are very complex, dive into IEEE754 specifications).
Of course, you could (and probably should) define your bool almost_equal(double x, double y) function (and use it instead of ==, and use its negation instead of !=).
As a rule of thumb, never test floating numbers for equality (i.e. ==), but consider instead some small enough distance between them; that is, replace a test like x == y (respectively x != y) with something like fabs(x-y) < EPSILON (respectively fabs(x-y) > EPSILON) where EPSILON is a small positive number, hence testing for a small L1 distance (for equality, and a large enough distance for inequality).
And avoid floating point in integer problems.
Actually, predicting or estimating floating point accuracy is very difficult. You might want to consider tools like CADNA. My colleague Franck Védrine is an expert on static program analyzers to estimate numerical errors (see e.g. his TERATEC 2017 presentation on Fluctuat). It is a difficult research topic, see also D.Monniaux's paper the pitfalls of verifying floating-point computations etc.
And floating point errors did in some cases cost human lives (or loss of billions of dollars). Search the web for details. There are some cases where all the digits of a computed number are wrong (because the errors may accumulate, and the final result was obtained by combining thousands of operations)! There is some indirect relationship with chaos theory, because many programs might have some numerical instability.
As others have mentioned, comparing floating point values for equality is problematic. If you find a way to work directly with integers, you can avoid this problem. One way to do so is to raise integers to the k power instead of taking the kth root. The details are left as an exercise for the reader.

Is to round a correct way for making float-double comparison

With this question as base, it is well known that we should not apply equals comparison operation to decimal variables, due numeric erros (it is not bound to programming language):
bool CompareDoubles1 (double A, double B)
{
return A == B;
}
The abouve code it is not right.
My questions are:
It is right to round to both numbers and then compare?
It is more efficient?
For instance:
bool CompareDoubles1 (double A, double B)
{
double a = round(A,4);
double b = round(B,4)
return a == b;
}
It is correct?
EDIT
I'm considering round is a method that take a double (number) and int (precition):
bool round (float number, int precision);
EDIT
I consider that a better idea of what I mean with this question will be expressed with this compare method:
bool CompareDoubles1 (double A, double B, int precision)
{
//precition could be the error expected when rounding
double a = round(A,precision);
double b = round(B,precision)
return a == b;
}
Usually, if you really have to compare floating values, you'd specify a tolerance:
bool CompareDoubles1 (double A, double B, double tolerance)
{
return std::abs(A - B) < tolerance;
}
Choosing an appropriate tolerance will depend on the nature of the values and the calculations that produce them.
Rounding is not appropriate: two very close values, which you'd want to compare equal, might round in different directions and appear unequal. For example, when rounding to the nearest integer, 0.3 and 0.4 would compare equal, but 0.499999 and 0.500001 wouldn't.
A common comparison for doubles is implemented as
bool CompareDoubles2 (double A, double B)
{
return std::abs(A - B) < 1e-6; // small magic constant here
}
It is clearly not as efficient as the check A == B, because it involves more steps, namely subtraction, calling std::abs and finally comparison with a constant.
The same argument about efficiency holds for you proposed solution:
bool CompareDoubles1 (double A, double B)
{
double a = round(A,4); // the magic constant hides in the 4
double b = round(B,4); // and here again
return a == b;
}
Again, this won't be as efficient as direct comparison, but -- again -- it doesn't even try to do the same.
Whether CompareDoubles2 or CompareDoubles1 is faster depends on your machine and the choice of magic constants. Just measure it. You need to make sure to supply matching magic constants, otherwise you are checking for equality with a different trust region which yields different results.
I think comparing the difference with a fixed tolerance is a bad idea.
Say what happens if you set the tolerance to 1e-6, but the two numbers you compare are
1.11e-9 and 1.19e-9?
These would be considered equal, even if they differ after the second significant digit. This may not what you want.
I think a better way to do the comparison is
equal = ( fabs(A - B) <= tol*max(fabs(A), fabs(B)) )
Note, the <= (and not <), because the above must also work for 0==0. If you set tol=1e-14, two numbers will be considered equal when they are equal up to 14 significant digits.
Sidenote: When you want to test if a number is zero, then the above test might not be ideal and then one indeed should use an absolute threshold.
If the round function used in your example means to round to 4th decimal digit, this is not correct at all. For example, if A and B are 0.000003 and 0.000004 they would be rounded to 0.0 and would therefore be compared to be equal.
A general purpose compairison function must not work with a constant tolarance but with a relative one. But it is all explained in the post you cite in your question.
There is no 'correct' way to compare floating point values (Even a f == 0.0 might be correct). Different comparison may be suitable. Have a look at http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
Similar to other posts, but introducing scale-invariance: If you are doing something like adding two sets of numbers together and then you want to know if the two set sums are equal, you can take the absolute value of the log-ratio (difference of logarithms) and test to see if this is less than your prescribed tolerance. That way, e.g. if you multiply all your numbers by 10 or 100 in summation calculations, it won't affect the result about whether the answers are equal or not. You should have a separate test to determine if two numbers are equal because they are close enough to 0.

Does std::hash guarantee equal hashes for "equal" floating point numbers?

Is the floating point specialisation of std::hash (say, for doubles or floats) reliable regarding almost-equality? That is, if two values (such as (1./std::sqrt(5.)/std::sqrt(5.)) and .2) should compare equal but will not do so with the == operator, how will std::hash behave?
So, can I rely on a double as an std::unordered_map key to work as expected?
I have seen "Hashing floating point values" but that asks about boost; I'm asking about the C++11 guarantees.
std::hash has same guarantees for all types over which it can
be instantiated: if two objects are equal, their hash codes will
be equal. Otherwise, there's a very large probability that they
won't. So you can rely on a double as a key in an
unordered_map to work as expected: if two doubles are not
equal (as defined by ==), they will probably have a different
hash (and even if they don't, they're different keys, because
unordered_map also checks for equality).
Obviously, if your values are the results of inexact
calculations, they aren't appropriate keys for unordered_map
(nor perhaps for any map).
Multiple problems with this question:
The reason that your two expressions don't compare as equal is NOT that there are two binary expressions of 0.2, but that there is NO exact (finite) binary representation of 0.2, or sqrt(5) ! So in fact, while (1./std::sqrt(5.)/std::sqrt(5.)) and .2 should be the same algebraically, they may well not be the same in computer-precision arithmetic. (They aren't even in pen-on-paper arithmetic with finite precision. Say you are working with 10 digits after the decimal point. Write out sqrt(5) with 10 digits and calculate your first expression. It will not be .2.)
Of course you have a sensible concept of two numbers being close. In fact you have at least two: One absolute (|a-b| < eps) , one relative. But that doesn't translate into sensible hashes. If you want all numbers within eps of each other to have the same hash, then 1, 1+eps, 1+2*eps, ... would all have the same hash and therefore, ALL numbers would have the same hash. That is a valid, but useless hash function. But it is the only one that satisfies your requirement of mapping nearby values to the same hash!
Behind the default hashing of an unordered_map there is a std::hash struct which provides the operator() to compute the hash of a given value.
A set of default specializations of this templates is available, including std::hash<float> and std::hash<double>.
On my machine (LLVM+clang) these are defined as
template <>
struct hash<float> : public __scalar_hash<float>
{
size_t operator()(float __v) const _NOEXCEPT
{
// -0.0 and 0.0 should return same hash
if (__v == 0)
return 0;
return __scalar_hash<float>::operator()(__v);
}
};
where __scalar_hash is defined as:
template <class _Tp>
struct __scalar_hash<_Tp, 0> : public unary_function<_Tp, size_t>
{
size_t operator()(_Tp __v) const _NOEXCEPT
{
union
{
_Tp __t;
size_t __a;
} __u;
__u.__a = 0;
__u.__t = __v;
return __u.__a;
}
};
Where basically the hash is built by setting a value of an union to the source value and then getting just a piece which is large as a size_t.
So you get some padding or you get your value truncated, but that doesn't really matter because as you can see the raw bits of the number are used to compute the hash, this means that it works exactly as the == operator. Two floating numbers, to have the same hash (excluding collision given by truncation), must be the same value.
There is no rigorous concept of "almost equality". So behavior can't be guaranteed in principle. If you want to define your own concept of "almost equal" and construct a hash function such that two "almost equal" floats have the same hash, you can. But then it will only be true for your particular notion of "almost equal" floats.

std::map trick for comparing unrepresentable numbers?

I would like to have a user-defined key in a C++ std::map. The key is a binary representation of an integer set with maximum value 2^V so I can't represent all 2^V possible values. I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Now the problem is that to put this user-defined bitset as key in a std::map, I need to define a valid comparison between bitset values but if I have a maximum size of, say, V=1000, then I cannot get a number I can compare, let alone aggregating them all i.e., 2^1000 is not representable.
Therefore my question is, suppose I have two different sets (by setting the right bits in my bitset representation) and I cannot represent the final number because it will overflow:
id_1 = 2^0 + 2^1 + ... + 2^V
id_2 = 2^0 + 2^1 + ... + 2^V
Is there a suitable transformation that would lead to a value I can compare? I need to be able to say id_1 < id_2 so I would like to transform a sum of exponentials to a value that is representable BUT maintaining the invariant of the "less than". I was thinking along the lines of e.g. applying a log transformation in a clever way to preserve "less than".
Here is an example:
set_1 = {2,3,4}; set_2 = {8}
id(set_1) = 2^2 + 2^3 + 2^4 = 28; id(set_2) = 2^8 = 256
id(set_1) < id(set_2)
Perfect! How about a general set that can have {1,...,V}, and thus 2^V possible subsets?
I do so by means of an efficient binary set representation, i.e., an array of uint64_t.
Supposing that this array is accessed via a data member ra of the key type Key, and both arrays are of length N, then you want a comparator something like this:
bool operator<(const Key &lhs, const Key &rhs) {
return std::lexicographical_compare(lhs.ra, &lhs.ra[N], rhs.ra, &rhs.ra[N]);
}
This implicitly considers the array to be big-endian, i.e. the first uint64_t is the most significant. If you don't like that, that's fair enough, since you might already have in mind some relative significance for whatever order you've stored your V bits into your array. There's no great mystery to lexicographical_compare, so just look at an example implementation and modify as required.
This is called "lexicographical order". Other than the facts that I've used uint64_t instead of char and both arrays are the same length, it is how strings are compared[*] -- in fact the use of uint64_t isn't important, you could just use std::memcmp in your comparator instead of comparing 64-bit chunks. operator< for strings doesn't work by converting the whole string to an integer, and neither should your comparator.
[*] until you bring locale-specific collation rules into play.