Does std::hash guarantee equal hashes for "equal" floating point numbers? - c++

Is the floating point specialisation of std::hash (say, for doubles or floats) reliable regarding almost-equality? That is, if two values (such as (1./std::sqrt(5.)/std::sqrt(5.)) and .2) should compare equal but will not do so with the == operator, how will std::hash behave?
So, can I rely on a double as an std::unordered_map key to work as expected?
I have seen "Hashing floating point values" but that asks about boost; I'm asking about the C++11 guarantees.

std::hash has same guarantees for all types over which it can
be instantiated: if two objects are equal, their hash codes will
be equal. Otherwise, there's a very large probability that they
won't. So you can rely on a double as a key in an
unordered_map to work as expected: if two doubles are not
equal (as defined by ==), they will probably have a different
hash (and even if they don't, they're different keys, because
unordered_map also checks for equality).
Obviously, if your values are the results of inexact
calculations, they aren't appropriate keys for unordered_map
(nor perhaps for any map).

Multiple problems with this question:
The reason that your two expressions don't compare as equal is NOT that there are two binary expressions of 0.2, but that there is NO exact (finite) binary representation of 0.2, or sqrt(5) ! So in fact, while (1./std::sqrt(5.)/std::sqrt(5.)) and .2 should be the same algebraically, they may well not be the same in computer-precision arithmetic. (They aren't even in pen-on-paper arithmetic with finite precision. Say you are working with 10 digits after the decimal point. Write out sqrt(5) with 10 digits and calculate your first expression. It will not be .2.)
Of course you have a sensible concept of two numbers being close. In fact you have at least two: One absolute (|a-b| < eps) , one relative. But that doesn't translate into sensible hashes. If you want all numbers within eps of each other to have the same hash, then 1, 1+eps, 1+2*eps, ... would all have the same hash and therefore, ALL numbers would have the same hash. That is a valid, but useless hash function. But it is the only one that satisfies your requirement of mapping nearby values to the same hash!

Behind the default hashing of an unordered_map there is a std::hash struct which provides the operator() to compute the hash of a given value.
A set of default specializations of this templates is available, including std::hash<float> and std::hash<double>.
On my machine (LLVM+clang) these are defined as
template <>
struct hash<float> : public __scalar_hash<float>
{
size_t operator()(float __v) const _NOEXCEPT
{
// -0.0 and 0.0 should return same hash
if (__v == 0)
return 0;
return __scalar_hash<float>::operator()(__v);
}
};
where __scalar_hash is defined as:
template <class _Tp>
struct __scalar_hash<_Tp, 0> : public unary_function<_Tp, size_t>
{
size_t operator()(_Tp __v) const _NOEXCEPT
{
union
{
_Tp __t;
size_t __a;
} __u;
__u.__a = 0;
__u.__t = __v;
return __u.__a;
}
};
Where basically the hash is built by setting a value of an union to the source value and then getting just a piece which is large as a size_t.
So you get some padding or you get your value truncated, but that doesn't really matter because as you can see the raw bits of the number are used to compute the hash, this means that it works exactly as the == operator. Two floating numbers, to have the same hash (excluding collision given by truncation), must be the same value.

There is no rigorous concept of "almost equality". So behavior can't be guaranteed in principle. If you want to define your own concept of "almost equal" and construct a hash function such that two "almost equal" floats have the same hash, you can. But then it will only be true for your particular notion of "almost equal" floats.

Related

Do we need epsilon value for lesser or greater comparison for float value? [duplicate]

This question already has an answer here:
Floating point less-than-equal comparisons after addition and substraction
(1 answer)
Closed 9 months ago.
I have gone through different threads for comparing lesser or greater float value not equal comparison but not clear do we need epsilon value logic to compare lesser or greater float value as well?
e.g ->
float a, b;
if (a < b) // is this correct way to compare two float value or we need epsilon value for lesser comparator
{
}
if (a > b) // is this correct way to compare two float value for greater comparator
{
}
I know for comparing for equality of float, we need some epsilon value
bool AreSame(double a, double b)
{
return fabs(a - b) < EPSILON;
}
It really depends on what should happen when both value are close enough to be seen as equal, meaning fabs(a - b) < EPSILON. In some use cases (for example for computing statistics), it is not very important if the comparison between 2 close values gives or not equality.
If it matters, you should first determine the uncertainty of the values. It really depends on the use case (where the input values come from and how they are processed), and then 2 value differing by less than that uncertainty should be considered as equal. But that equality is not longer a true mathematical equivalence relation: you can easily imagine how to build a chain a close values between 2 truely different values. In math words, the relation is not transitive (or is almost transitive is current language words).
I am sorry but as soon as you have to process approximations there cannot be any precise and consistent way: you have to think of the real world use case to determine how you should handle the approximation.
When you are working with floats, it's inevitable that you will run into precision errors.
In order to mitigate this, when checking for the equality two floats we often check if their difference is small enough.
For lesser and greater, however, there is no way to tell with full certainty which float is larger. The best (presumably for your intentions) approach is to first check if the two floats are the same, using the areSame function. If so return false (as a = b implies that a < b and a > b are both false).
Otherwise, return the value of either a < b or a > b.
The answer is application dependent.
If you are sure that a and b are sufficiently different that numerical errors will not reverse the order, then a < b is good enough.
But if a and b are dangerously close, you might require a < b + EPSILON. In such a case, it should be clear to you that < and ≤ are not distinguishable.
Needless to say, EPSILON should be chosen with the greatest care (which is often pretty difficult).
It ultimately depends on your application, but I would say generally no.
The problem, very simplified, is that if you calculate: (1/3) * 3 and get the answer 0.999999, then you want that to compare equal to 1. This is why we use epsilon values for equal comparisons (and the epsilon should be chosen according to the application and expected precision).
On the other hand, if you want to sort a list of floats then by default the 0.999999 value will sort before 1. But then again what would the correct behavior be? If they both are sorted as 1, then it will be somewhat random which one is actually sorted first (depending on the initial order of the list and the sorting algorithm you use).
The problem with floating point numbers is not that they are "random" and that it is impossible to predict their exact values. The problem is that base-10 fractions don't translate cleanly into base-2 fractions, and that non-repeating decimals in one system can translate into repeating one in the other - which then result in rounding errors when truncated to a finite number of decimals. We use epsilon values for equal comparisons to handle rounding errors that arise from these back and forth translations.
But do be aware that the nice relations that ==, < and <= have for integers, don't always translate over to floating points exactly because of the epsilons involved. Example:
a = x
b = a + epsilon/2
c = b + epsilon/2
d = c + epsilon/2
Now: a == b, b == c, c == d, BUT a != d, a < d. In fact, you can continue the sequence keeping num(n) == num(n+1) and at the same time get an arbitrarily large difference between a and the last number in the sequence.
As others have stated, there would always be precision errors when dealing with floats.
Thus, you should have an epsilon value even for comparing less than / greater than.
We know that in order for a to be less than b, firstly, a must be different from b. Checking this is a simple NOT equals, which uses the epsilon.
Then, once you already know a != b, the operator < is sufficient.

Does floating-point use a key in hashtable?

I know that floating-point values cannot be compared by ==. I have made a custom comparison function like this.
auto isEqual = [](const double& a, const double& b) {
return fabs(a-b) <= numeric_limits<double>::epsilon();
};
I would like to know how I modify the unordered map to be worked as I expected.
auto isEqual = [](const double& a, const double& b) {
return fabs(a-b) <= numeric_limits<double>::epsilon();
};
unordered_map<double, int, hash<double>, decltype(isEqual)> m(0, hash<double>(), isEqual);
m[1/(double)3]++;
cout << m[1-2/(double)3] << endl; // expected 1, but zero
// -----------------------------
auto comp = [&](const double& a, const double& b) {
if (isEqual(a, b)) return false;
return a < b;
};
map<double, int, decltype(comp)> m2(comp);
m2[1/(double)3]++;
cout << m2[1-2/(double)3] << endl; // expected and answered 1
It is essentially impossible to modify an unordered_map to use floating-point values that contain different rounding errors as keys.
Customizing the equality comparison alone is insufficient because it is merely used to distinguish values that are mapped to the same bucket by the hash function. Different floating-point values generally hash to different buckets, even if they differ only by the tiniest of rounding errors. Therefore, one must also customize the hash function.
However, then the requirement for the hash function would be that it map floating-point values that are different but that you would like to consider as equal to the same bucket. In general, this is impossible because, if you want to consider any two very close numbers as equal, say numbers that are so close that they are adjacent in the floating-point format, then transitivity requires the hash function to map all numbers to one bucket. That is, since zero and the smallest positive representable number must map to the same bucket, and the smallest and second smallest positive representable numbers must map to the same bucket, then zero and the second smallest positive representable number must map to the same bucket. Similarly, the third smallest number must represent to the same bucket as the second smallest and therefore to the same bucket as zero. And so on for the fourth and fifth. This creates a chain that continues for all numbers: They must all map to the same bucket.
Therefore, no hash function can serve to implement a non-degenerate map for floating-point numbers that considers close numbers as equal.
In special situations, it is possible to implement a reasonable hash for certain sets of numbers. For example, if it is known that all the floating-point values represent a number of cents, and that the floating-point numbers never contain accumulated rounding errors that reach or exceed half a cent, then each value can be rounded to the nearest cent (or the representable value nearest that) before hashing. Note that the domain of values in this case is really a discrete set, such as a set of fixed-point numbers, not a continuous set such as the set of real numbers that floating-point arithmetic is intended to approximate. In this case, the only modification that is needed is to quantize the floating-point value (round it to the nearest member of the set) before inserting it into the map. No custom hash function or equality comparison is needed.

C++: Create integer vector of infinities

I'm working on an algorithm and I need to initialize the vector of ints:
std::vector<int> subs(10)
of fixed length with values:
{-inf, +inf, +inf …. }
This is where I read that it is possible to use MAX_INT, but it's not quiete correct because the elements of my vector are supposed to be greater than any possible int value.
I liked overrloading comparison operator method from this answer, but how do you initialize the vector with infinitytype class objects if there are supposed to be an int?
Or maybe you know any better solution?
Thank you.
The solution depends on the assumptions your algorithm (or the implementation of your algorithm) has:
You could increase the element size beyond int (e.g. if your sizeof(int) is 4, use int64_t), and initialize to (int64_t) 1 + std::numeric_limits<int>:max() (and similarly for the negative values). But perhaps your algorithm assumes that you can't "exceed infinity" by adding on multiplying by positive numbers?
You could use an std::variant like other answers suggest, selecting between an int and infinity; but perhaps your algorithm assumes your elements behave like numbers?
You could use a ratio-based "number" class, ensuring it will not get non-integral values except infinity.
You could have your algorithm special-case the maximum and minimum integers
You could use floats or doubles which support -/+ infinity, and restrict them to integrality.
etc.
So, again, it really just depends and there's no one-size-fits-all solution.
AS already said in the comments, you can't have an infinity value stored in int: all values of this type are well-defined and finite.
If you are ok with a vector of something working as an infinite for ints, then consider using a type like this:
struct infinite
{ };
bool operator < (int, infinite)
{
return true;
}
You can use a variant (for example, boost::variant) which supports double dispatching, which stores either an int or an infinitytype (which should store the sign of the infinity, for example in a bool), then implement the comparison operators through a visitor.
But I think it would be simpler if you simply used a double instead of int, and whenever you take out a value that is not infinity, convert it to int. If performance is not that great of an issue, then it will work fine (probably still faster than a variant). If you need great performance, then just use MAX_INT and be done with it.
You are already aware of the idea of an "infinite" type, but that implementation could only contain infinite values. There's another related idea:
struct extended_int {
enum {NEGINF, FINITE, POSINF} type;
int finiteValue; // Only meaningful when type==FINITE
bool operator<(extended_int rhs) {
if (this->type==POSINF) return false;
if (rhs.type==NEGINF) return false;
if (this->type==FINITE && rhs.type==POSINF) return false;
if (this->type==NEGINF && rhs.type==FINITE) return false;
assert(this->type==FINITE && rhs.type==FINITE);
return this->finiteValue < rhs.finiteValue)
}
// Implicitly converting ctor
constexpr extended_int(int value) : type(FINITE), finiteValue(value) { }
// And the two infinities
static constexpr extended_int posinf;
static constexpr extended_int neginf;
}
You now have extended_int(5) < extended_int(6) but also extended_int(5) < extended_int::posinf

Tolerant key lookup in std::map

Requirements:
container which sorts itself based on numerically comparing the keys (e.g. std::map)
check existence of key based on float tolerance (e.g. map.find() and use custom comparator )
and the tricky one: the float tolerance used by the comparator may be changed by the user at runtime!
The first 2 can be accomplished using a map with a custom comparator:
struct floatCompare : public std::binary_function<float,float,bool>
{
bool operator()( const float &left, const float &right ) const
{
return (fabs(left - right) > 1e-3) && (left < right);
}
};
typedef std::map< float, float, floatCompare > floatMap;
Using this implementation, floatMap.find( 15.0001 ) will find 15.0 in the map.
However, let's say the user doesn't want a float tolerance of 1e-3.
What is the easiest way to make this comparator function use a variable tolerance at runtime? I don't mind re-creating and re-sorting the map based on the new comparator each time epsilon is updated.
Other posts on modification after initialization here and using floats as keys here didn't provide a complete solution.
You can't change the ordering of the map after it's created (and you should just use plain old operator< even for the floating point type here), and you can't even use a "tolerant" comparison operator as that may vioate the required strict-weak-ordering for map to maintain its state.
However you can do the tolerant search with lower_bound and upper_bound. The gist is that you would create a wrapper function much like equal_range that does a lower_bound for "value - tolerance" and then an upper_bound for "value + tolerance" and see if it creates a non-empty range of values that match the criteria.
You cannot change the definition of how elements are ordered in a map once it's been instantiated. If you were to find some technical hack to do so (such as implementing a custom comparator that takes a tolerance that can change at runtime), it would evoke Undefined Behavior.
Your main alternative to changing the ordering is to create another map with a different ordering scheme. This other map could be an indexing map, where the keys are ordered in a different way, and the values arent the elements themselves, but an index in to the main map.
Alternatively maybe what you're really trying to do isn't change the ordering, but maintain the ordering and change the search parameters.
That you can do, and there are a few ways to do it.
One is to simply use map::lower_bound -- once with the lower bound of your tolerance, and once with the upper bound of your tolerance, just past the end of tolerance. For example, if you want to find 15.0 with a tolerance of 1e-5. You could lower_bound with 14.99995 and then again with 15.00005 (my math might be off here) to find the elements in that range.
Another is to use std::find_if with a custom functor, lambda, or std::function. You could declare the functor in such a way as to take the tolerance and the value at construction, and perform the check in operator().
Since this is a homework question, I'll leave the fiddly details of actually implementing all this up to you. :)
Rather than using a comparator with tolerance, which is going to fail in subtle ways, just use a consistent key that is derived from the floating point value. Make your floating point values consistent using rounding.
inline double key(double d)
{
return floor(d * 1000.0 + 0.5);
}
You can't achieve that with a simple custom comparator, even if it was possible to change it after the definition, or when resorting using a new comparator. The fact is: a "tolerant comparator" is not really a comparator. For three values, it's possible that a < c (difference is large enough) but neither a < b nor b < c (both difference too small). Example: a = 5.0, b = 5.5, c = 6.0, tolerance = 0.6
What you should do instead is to use default sorting using operator< for floats, i.e. simply don't provide any custom comparator. Then, for the lookup don't use find but rather lower_bound and upper_bound with modified values according to the tolerance. These two function calls will give you two iterators which define the sequence which will be accepted using this tolerance. If this sequence is empty, the key was not found, obviously.
You then might want to get the key which is closest to the value to be searched for. If this is true, you should then find the min_element of this subsequence, using a comparator which will consider the difference between the key and the value to be searched.
template<typename Map, typename K>
auto tolerant_find(const Map & map, const K & lookup, const K & tolerance) -> decltype(map.begin()) {
// First, find sub-sequence of keys "near" the lookup value
auto first = map.lower_bound(lookup - tolerance);
auto last = map.upper_bound(lookup + tolerance);
// If they are equal, the sequence is empty, and thus no entry was found.
// Return the end iterator to be consistent with std::find.
if (first == last) {
return map.end();
}
// Then, find the one with the minimum distance to the actual lookup value
typedef typename Map::mapped_type T;
return std::min_element(first, last, [lookup](std::pair<K,T> a, std::pair<K,T> b) {
return std::abs(a.first - lookup) < std::abs(b.first - lookup);
});
}
Demo: http://ideone.com/qT3JIa
It may be better to leave the std::map class alone (well, partly at least), and just write your own class which implements the three methods you mentioned.
template<typename T>
class myMap{
private:
float tolerance;
std::map<float,T> storage;
public:
void setTolerance(float t){tolerance=t;};
std::map<float,T>::iterator find(float val); // ex. same as you provided, just change 1e-3 for tolerance
/* other methods go here */
};
That being said, I don't think you need to recreate the container and sort it depending on the tolerance.
check existence of key based on float tolerance
merely means you have to check if an element exists. The position of the elements inside the map shouldn't change. You could start the search from val-tolerance, and when you find an element (the function find returns an iterator), get the next elements untill you reach the end of the map or untill their values exceed val+tolerance.
That basically means that the behavior of the insert/add/[]/whatever functions isn't based on the tolerance, so there's no real problem of storing the values.
If you're afraid the elements will be too close to eachother, you may want to start the searching from val, and then gradually increase the toleration untill it reaches the user desired one.

Floating point keys in std:map

The following code is supposed to find the key 3.0in a std::map which exists. But due to floating point precision it won't be found.
map<double, double> mymap;
mymap[3.0] = 1.0;
double t = 0.0;
for(int i = 0; i < 31; i++)
{
t += 0.1;
bool contains = (mymap.count(t) > 0);
}
In the above example, contains will always be false.
My current workaround is just multiply t by 0.1 instead of adding 0.1, like this:
for(int i = 0; i < 31; i++)
{
t = 0.1 * i;
bool contains = (mymap.count(t) > 0);
}
Now the question:
Is there a way to introduce a fuzzyCompare to the std::map if I use double keys?
The common solution for floating point number comparison is usually something like a-b < epsilon. But I don't see a straightforward way to do this with std::map.
Do I really have to encapsulate the double type in a class and overwrite operator<(...) to implement this functionality?
So there are a few issues with using doubles as keys in a std::map.
First, NaN, which compares less than itself is a problem. If there is any chance of NaN being inserted, use this:
struct safe_double_less {
bool operator()(double left, double right) const {
bool leftNaN = std::isnan(left);
bool rightNaN = std::isnan(right);
if (leftNaN != rightNaN)
return leftNaN<rightNaN;
return left<right;
}
};
but that may be overly paranoid. Do not, I repeat do not, include an epsilon threshold in your comparison operator you pass to a std::set or the like: this will violate the ordering requirements of the container, and result in unpredictable undefined behavior.
(I placed NaN as greater than all doubles, including +inf, in my ordering, for no good reason. Less than all doubles would also work).
So either use the default operator<, or the above safe_double_less, or something similar.
Next, I would advise using a std::multimap or std::multiset, because you should be expecting multiple values for each lookup. You might as well make content management an everyday thing, instead of a corner case, to increase the test coverage of your code. (I would rarely recommend these containers) Plus this blocks operator[], which is not advised to be used when you are using floating point keys.
The point where you want to use an epsilon is when you query the container. Instead of using the direct interface, create a helper function like this:
// works on both `const` and non-`const` associative containers:
template<class Container>
auto my_equal_range( Container&& container, double target, double epsilon = 0.00001 )
-> decltype( container.equal_range(target) )
{
auto lower = container.lower_bound( target-epsilon );
auto upper = container.upper_bound( target+epsilon );
return std::make_pair(lower, upper);
}
which works on both std::map and std::set (and multi versions).
(In a more modern code base, I'd expect a range<?> object that is a better thing to return from an equal_range function. But for now, I'll make it compatible with equal_range).
This finds a range of things whose keys are "sufficiently close" to the one you are asking for, while the container maintains its ordering guarantees internally and doesn't execute undefined behavior.
To test for existence of a key, do this:
template<typename Container>
bool key_exists( Container const& container, double target, double epsilon = 0.00001 ) {
auto range = my_equal_range(container, target, epsilon);
return range.first != range.second;
}
and if you want to delete/replace entries, you should deal with the possibility that there might be more than one entry hit.
The shorter answer is "don't use floating point values as keys for std::set and std::map", because it is a bit of a hassle.
If you do use floating point keys for std::set or std::map, almost certainly never do a .find or a [] on them, as that is highly highly likely to be a source of bugs. You can use it for an automatically sorted collection of stuff, so long as exact order doesn't matter (ie, that one particular 1.0 is ahead or behind or exactly on the same spot as another 1.0). Even then, I'd go with a multimap/multiset, as relying on collisions or lack thereof is not something I'd rely upon.
Reasoning about the exact value of IEEE floating point values is difficult, and fragility of code relying on it is common.
Here's a simplified example of how using soft-compare (aka epsilon or almost equal) can lead to problems.
Let epsilon = 2 for simplicity. Put 1 and 4 into your map. It now might look like this:
1
\
4
So 1 is the tree root.
Now put in the numbers 2, 3, 4 in that order. Each will replace the root, because it compares equal to it. So then you have
4
\
4
which is already broken. (Assume no attempt to rebalance the tree is made.) We can keep going with 5, 6, 7:
7
\
4
and this is even more broken, because now if we ask whether 4 is in there, it will say "no", and if we ask for an iterator for values less than 7, it won't include 4.
Though I must say that I've used maps based on this flawed fuzzy compare operator numerous times in the past, and whenever I digged up a bug, it was never due to this. This is because datasets in my application areas never actually amount to stress-testing this problem.
As Naszta says, you can implement your own comparison function. What he leaves out is the key to making it work - you must make sure that the function always returns false for any values that are within your tolerance for equivalence.
return (abs(left - right) > epsilon) && (left < right);
Edit: as pointed out in many comments to this answer and others, there is a possibility for this to turn out badly if the values you feed it are arbitrarily distributed, because you can't guarantee that !(a<b) and !(b<c) results in !(a<c). This would not be a problem in the question as asked, because the numbers in question are clustered around 0.1 increments; as long as your epsilon is large enough to account for all possible rounding errors but is less than 0.05, it will be reliable. It is vitally important that the keys to the map are never closer than 2*epsilon apart.
You could implement own compare function.
#include <functional>
class own_double_less : public std::binary_function<double,double,bool>
{
public:
own_double_less( double arg_ = 1e-7 ) : epsilon(arg_) {}
bool operator()( const double &left, const double &right ) const
{
// you can choose other way to make decision
// (The original version is: return left < right;)
return (abs(left - right) > epsilon) && (left < right);
}
double epsilon;
};
// your map:
map<double,double,own_double_less> mymap;
Updated: see Item 40 in Effective STL!
Updated based on suggestions.
Using doubles as keys is not useful. As soon as you make any arithmetic on the keys you are not sure what exact values they have and hence cannot use them for indexing the map. The only sensible usage would be that the keys are constant.