Tolerant key lookup in std::map - c++

Requirements:
container which sorts itself based on numerically comparing the keys (e.g. std::map)
check existence of key based on float tolerance (e.g. map.find() and use custom comparator )
and the tricky one: the float tolerance used by the comparator may be changed by the user at runtime!
The first 2 can be accomplished using a map with a custom comparator:
struct floatCompare : public std::binary_function<float,float,bool>
{
bool operator()( const float &left, const float &right ) const
{
return (fabs(left - right) > 1e-3) && (left < right);
}
};
typedef std::map< float, float, floatCompare > floatMap;
Using this implementation, floatMap.find( 15.0001 ) will find 15.0 in the map.
However, let's say the user doesn't want a float tolerance of 1e-3.
What is the easiest way to make this comparator function use a variable tolerance at runtime? I don't mind re-creating and re-sorting the map based on the new comparator each time epsilon is updated.
Other posts on modification after initialization here and using floats as keys here didn't provide a complete solution.

You can't change the ordering of the map after it's created (and you should just use plain old operator< even for the floating point type here), and you can't even use a "tolerant" comparison operator as that may vioate the required strict-weak-ordering for map to maintain its state.
However you can do the tolerant search with lower_bound and upper_bound. The gist is that you would create a wrapper function much like equal_range that does a lower_bound for "value - tolerance" and then an upper_bound for "value + tolerance" and see if it creates a non-empty range of values that match the criteria.

You cannot change the definition of how elements are ordered in a map once it's been instantiated. If you were to find some technical hack to do so (such as implementing a custom comparator that takes a tolerance that can change at runtime), it would evoke Undefined Behavior.
Your main alternative to changing the ordering is to create another map with a different ordering scheme. This other map could be an indexing map, where the keys are ordered in a different way, and the values arent the elements themselves, but an index in to the main map.
Alternatively maybe what you're really trying to do isn't change the ordering, but maintain the ordering and change the search parameters.
That you can do, and there are a few ways to do it.
One is to simply use map::lower_bound -- once with the lower bound of your tolerance, and once with the upper bound of your tolerance, just past the end of tolerance. For example, if you want to find 15.0 with a tolerance of 1e-5. You could lower_bound with 14.99995 and then again with 15.00005 (my math might be off here) to find the elements in that range.
Another is to use std::find_if with a custom functor, lambda, or std::function. You could declare the functor in such a way as to take the tolerance and the value at construction, and perform the check in operator().
Since this is a homework question, I'll leave the fiddly details of actually implementing all this up to you. :)

Rather than using a comparator with tolerance, which is going to fail in subtle ways, just use a consistent key that is derived from the floating point value. Make your floating point values consistent using rounding.
inline double key(double d)
{
return floor(d * 1000.0 + 0.5);
}

You can't achieve that with a simple custom comparator, even if it was possible to change it after the definition, or when resorting using a new comparator. The fact is: a "tolerant comparator" is not really a comparator. For three values, it's possible that a < c (difference is large enough) but neither a < b nor b < c (both difference too small). Example: a = 5.0, b = 5.5, c = 6.0, tolerance = 0.6
What you should do instead is to use default sorting using operator< for floats, i.e. simply don't provide any custom comparator. Then, for the lookup don't use find but rather lower_bound and upper_bound with modified values according to the tolerance. These two function calls will give you two iterators which define the sequence which will be accepted using this tolerance. If this sequence is empty, the key was not found, obviously.
You then might want to get the key which is closest to the value to be searched for. If this is true, you should then find the min_element of this subsequence, using a comparator which will consider the difference between the key and the value to be searched.
template<typename Map, typename K>
auto tolerant_find(const Map & map, const K & lookup, const K & tolerance) -> decltype(map.begin()) {
// First, find sub-sequence of keys "near" the lookup value
auto first = map.lower_bound(lookup - tolerance);
auto last = map.upper_bound(lookup + tolerance);
// If they are equal, the sequence is empty, and thus no entry was found.
// Return the end iterator to be consistent with std::find.
if (first == last) {
return map.end();
}
// Then, find the one with the minimum distance to the actual lookup value
typedef typename Map::mapped_type T;
return std::min_element(first, last, [lookup](std::pair<K,T> a, std::pair<K,T> b) {
return std::abs(a.first - lookup) < std::abs(b.first - lookup);
});
}
Demo: http://ideone.com/qT3JIa

It may be better to leave the std::map class alone (well, partly at least), and just write your own class which implements the three methods you mentioned.
template<typename T>
class myMap{
private:
float tolerance;
std::map<float,T> storage;
public:
void setTolerance(float t){tolerance=t;};
std::map<float,T>::iterator find(float val); // ex. same as you provided, just change 1e-3 for tolerance
/* other methods go here */
};
That being said, I don't think you need to recreate the container and sort it depending on the tolerance.
check existence of key based on float tolerance
merely means you have to check if an element exists. The position of the elements inside the map shouldn't change. You could start the search from val-tolerance, and when you find an element (the function find returns an iterator), get the next elements untill you reach the end of the map or untill their values exceed val+tolerance.
That basically means that the behavior of the insert/add/[]/whatever functions isn't based on the tolerance, so there's no real problem of storing the values.
If you're afraid the elements will be too close to eachother, you may want to start the searching from val, and then gradually increase the toleration untill it reaches the user desired one.

Related

Sorting a vector where tie-breaker elements are lazily computed

I want to sort a vector of structs by a primary field and use a secondary field as a tie-breaker. The normal way would be this:
struct element {
int primary;
int secondary;
};
bool comparator(const element& e1, const element& e2) {
if (e1.primary != e2.primary) {
return e1.primary < e2.primary;
}
return e1.secondary < e2.secondary;
}
But the secondary data is expensive to compute. As it is only needed when the primary values are equal, I want to compute it lazily.
It seems the only place I can do this lazy evaluation is within the comparator itself. Something like:
bool comparator(const element& e1, const element& e2) {
if (e1.primary != e2.primary) {
return e1.primary < e2.primary;
}
return e1.computeSecondary() < e2.computeSecondary();
}
While this will avoid evaluating the secondary for the cases when the primary values are different, it will end up recomputing the secondary values for the same element each time it is compared with another element. The data I want to sort is long tailed with something like 30% of values equal to 1, 20% equal to 2, 5% equal to 3, and lower % for higher values. So, there will be fair number of cases where the secondary element will get computed, and not storing the computed values could result in them being recomputed too many times.
So, I would like the secondary values to be evaluated at most once per element. But the comparator takes const ref arguments, so it can't modify the secondary value of the element. How can this be achieved?
Possible options are, in a nutshell.
Declare secondary mutable.
Use const_cast in comparator.
Use const_cast in computeSecondary.
Create a simple Lazy template class that either holds a value or a thunk and, when asked for, internally forces a value if it hasn't been evaluated yet and reports the result (or immediately reports a result, if it is already known), does not take long; and declare secondary as of type Lazy<int>.
Or rather do not reinvent the wheel and use std::future that is actually that very Lazy template (in one case).
Or anything else, one can create more approaches.

Efficient way to hash a 2D point

OK, so the task is this, I would be given (x, y) co-ordinates of points with both (x, y) ranging from -10^6 to 10^6 inclusive. I have to check whether a particular point e.g. (x, y) tuple was given to me or not. In simple words how do i answer the query whether a particular point(2D) is set or not. So far the best i could think of is maintaining a std::map<std::pair<int,int>, bool> and whenever a point is given I mark it 1. Although this must be running in logarithmic time and is fairly optimized way to answer the query I am wondering if there's a better way to do this.
Also I would be glad if anyone could tell what actually complexity would be if I am using the above data structure as a hash.I mean is it that the complexity of std::map is going to be O(log N) in the size of elements present irrespective of the structure of key?
In order to use a hash map you need to be using std::unordered_map instead of std::map. The constraint of using this is that your value type needs to have a hash function defined for it as described in this answer. Either that or just use boost::hash for this:
std::unordered_map<std::pair<int, int>, boost::hash<std::pair<int, int> > map_of_pairs;
Another method which springs to mind is to store the 32 bit int values in a 64 bit integer like so:
uint64_t i64;
uint32_t a32, b32;
i64 = ((uint64_t)a32 << 32) | b32;
As described in this answer. The x and y components can be stored in the high and low bytes of the integer and then you can use a std::unordered_map<uint64_t, bool>. Although I'd be interested to know if this is any more efficient than the previous method or if it even produces different code.
Instead of mapping each point to a bool, why not store all the points given to you in a set? Then, you can simply search the set to see if it contains the point you are looking for. It is essentially the same as what you are doing without having to do an additional lookup of the associated bool. For example:
set<pair<int, int>> points;
Then, you can check whether the set contains a certain point or not like this :
pair<int, int> examplePoint = make_pair(0, 0);
set<pair<int, int>>::iterator it = points.find(examplePoint);
if (it == points.end()) {
// examplePoint not found
} else {
// examplePoint found
}
As mentioned, std::set is normally implemented as a balanced binary search tree, so each lookup would take O(logn) time.
If you wanted to use a hash table instead, you could do the same thing using std::unordered_set instead of std::set. Assuming you use a good hash function, this would speed your lookups up to O(1) time. However, in order to do this, you will have to define the hash function for pair<int, int>. Here is an example taken from this answer:
namespace std {
template <> struct hash<std::pair<int, int>> {
inline size_t operator()(const std::pair<int, int> &v) const {
std::hash<int> int_hasher;
return int_hasher(v.first) ^ int_hasher(v.second);
}
};
}
Edit: Nevermind, I see you already got it working!

Floating point keys in std:map

The following code is supposed to find the key 3.0in a std::map which exists. But due to floating point precision it won't be found.
map<double, double> mymap;
mymap[3.0] = 1.0;
double t = 0.0;
for(int i = 0; i < 31; i++)
{
t += 0.1;
bool contains = (mymap.count(t) > 0);
}
In the above example, contains will always be false.
My current workaround is just multiply t by 0.1 instead of adding 0.1, like this:
for(int i = 0; i < 31; i++)
{
t = 0.1 * i;
bool contains = (mymap.count(t) > 0);
}
Now the question:
Is there a way to introduce a fuzzyCompare to the std::map if I use double keys?
The common solution for floating point number comparison is usually something like a-b < epsilon. But I don't see a straightforward way to do this with std::map.
Do I really have to encapsulate the double type in a class and overwrite operator<(...) to implement this functionality?
So there are a few issues with using doubles as keys in a std::map.
First, NaN, which compares less than itself is a problem. If there is any chance of NaN being inserted, use this:
struct safe_double_less {
bool operator()(double left, double right) const {
bool leftNaN = std::isnan(left);
bool rightNaN = std::isnan(right);
if (leftNaN != rightNaN)
return leftNaN<rightNaN;
return left<right;
}
};
but that may be overly paranoid. Do not, I repeat do not, include an epsilon threshold in your comparison operator you pass to a std::set or the like: this will violate the ordering requirements of the container, and result in unpredictable undefined behavior.
(I placed NaN as greater than all doubles, including +inf, in my ordering, for no good reason. Less than all doubles would also work).
So either use the default operator<, or the above safe_double_less, or something similar.
Next, I would advise using a std::multimap or std::multiset, because you should be expecting multiple values for each lookup. You might as well make content management an everyday thing, instead of a corner case, to increase the test coverage of your code. (I would rarely recommend these containers) Plus this blocks operator[], which is not advised to be used when you are using floating point keys.
The point where you want to use an epsilon is when you query the container. Instead of using the direct interface, create a helper function like this:
// works on both `const` and non-`const` associative containers:
template<class Container>
auto my_equal_range( Container&& container, double target, double epsilon = 0.00001 )
-> decltype( container.equal_range(target) )
{
auto lower = container.lower_bound( target-epsilon );
auto upper = container.upper_bound( target+epsilon );
return std::make_pair(lower, upper);
}
which works on both std::map and std::set (and multi versions).
(In a more modern code base, I'd expect a range<?> object that is a better thing to return from an equal_range function. But for now, I'll make it compatible with equal_range).
This finds a range of things whose keys are "sufficiently close" to the one you are asking for, while the container maintains its ordering guarantees internally and doesn't execute undefined behavior.
To test for existence of a key, do this:
template<typename Container>
bool key_exists( Container const& container, double target, double epsilon = 0.00001 ) {
auto range = my_equal_range(container, target, epsilon);
return range.first != range.second;
}
and if you want to delete/replace entries, you should deal with the possibility that there might be more than one entry hit.
The shorter answer is "don't use floating point values as keys for std::set and std::map", because it is a bit of a hassle.
If you do use floating point keys for std::set or std::map, almost certainly never do a .find or a [] on them, as that is highly highly likely to be a source of bugs. You can use it for an automatically sorted collection of stuff, so long as exact order doesn't matter (ie, that one particular 1.0 is ahead or behind or exactly on the same spot as another 1.0). Even then, I'd go with a multimap/multiset, as relying on collisions or lack thereof is not something I'd rely upon.
Reasoning about the exact value of IEEE floating point values is difficult, and fragility of code relying on it is common.
Here's a simplified example of how using soft-compare (aka epsilon or almost equal) can lead to problems.
Let epsilon = 2 for simplicity. Put 1 and 4 into your map. It now might look like this:
1
\
4
So 1 is the tree root.
Now put in the numbers 2, 3, 4 in that order. Each will replace the root, because it compares equal to it. So then you have
4
\
4
which is already broken. (Assume no attempt to rebalance the tree is made.) We can keep going with 5, 6, 7:
7
\
4
and this is even more broken, because now if we ask whether 4 is in there, it will say "no", and if we ask for an iterator for values less than 7, it won't include 4.
Though I must say that I've used maps based on this flawed fuzzy compare operator numerous times in the past, and whenever I digged up a bug, it was never due to this. This is because datasets in my application areas never actually amount to stress-testing this problem.
As Naszta says, you can implement your own comparison function. What he leaves out is the key to making it work - you must make sure that the function always returns false for any values that are within your tolerance for equivalence.
return (abs(left - right) > epsilon) && (left < right);
Edit: as pointed out in many comments to this answer and others, there is a possibility for this to turn out badly if the values you feed it are arbitrarily distributed, because you can't guarantee that !(a<b) and !(b<c) results in !(a<c). This would not be a problem in the question as asked, because the numbers in question are clustered around 0.1 increments; as long as your epsilon is large enough to account for all possible rounding errors but is less than 0.05, it will be reliable. It is vitally important that the keys to the map are never closer than 2*epsilon apart.
You could implement own compare function.
#include <functional>
class own_double_less : public std::binary_function<double,double,bool>
{
public:
own_double_less( double arg_ = 1e-7 ) : epsilon(arg_) {}
bool operator()( const double &left, const double &right ) const
{
// you can choose other way to make decision
// (The original version is: return left < right;)
return (abs(left - right) > epsilon) && (left < right);
}
double epsilon;
};
// your map:
map<double,double,own_double_less> mymap;
Updated: see Item 40 in Effective STL!
Updated based on suggestions.
Using doubles as keys is not useful. As soon as you make any arithmetic on the keys you are not sure what exact values they have and hence cannot use them for indexing the map. The only sensible usage would be that the keys are constant.

caching multiple key hash

I want to do some caching in my project.
Let my API is int foo(int a, float b, float c, int d, char e)
Now in my project, there is lot of calls to above time consuming API with repeating values of a, b, c ,d and e. Now I want to store return value of this function with these arguments as keys.
suppose my call sequence is
foo(23, 3.45, 4.5, 90, 'd') // returns 1000, so I need to store it in cache as (23,3.45, 4.5, 90, 'd')->1000
foo(30, 1.2, 3.5, 100, 'e') // returns 2000, so I need to store it in cache as (30, 1.2, 3.5, 100, 'e')->2000
foo(23, 3.45, 4.5, 90, 'd') // No need to call this API, I just check in my cache value associated with
//(23, 3.45, 4.5, 90, 'd'), which is already stored as 1000
What should be best strategy to implement above in C++? which data structure would be best to make cache table?
One key note: caching is difficult.
Often times people think that caching will solve all their issues, but they forget to take into account the issues that it brings to the table. An unmanaged cache is nothing else than a giant memory leak. Two strategies of note:
Size limit: whenever the cache is full, adding a new entry cause another entry to be evicted (you therefore need a scheme to decide when to evict an entry)
Time limit: entries are flushed out after a certain time elapsed
Usually, when we hear about caches we think LRU (Least Recently Used) Cache. Those cache are limited by size, and the least recently used entry is evicted when the cache is full. Note: might cause contention on multi-threading because read-only accesses in fact imply modifying a value.
Such a cache is implemented in terms of two elements:
A (key -> value) mapping, either using a tree or a hash-map
A priority list, which is interleaved within the nodes for efficiency
If you go this road, I would suggest using the Boost.MultiIndex library. There is an exemple of a MRU implementation which is very similar to your needs.
If you can use boost, look at boost::unordered_map, otherwise you can use a std::map. You will have to provide functor to generate the key.
It doesn't always work and is somewhat compiler dependent, but you can look into using function attributes. Of interest to you might be the const or pure attributes. hot might also be of interest.
Nice question. You have several options. First of all, put all the values into an struct:
struct values
{
int a;
float b;
...
};
If one of the values of the sequence is most representative, you can just use a std::map to map that representative value to a "bucket". Let's say that the most representative is the float b
:
std::map< float, std::list < std::pair< values, int> > >
is represented by the std::list, and stores pairs of value structures and result value (int in this case).
Declare a map from the values to the result, int. For that, you should allow values struct to be compared against others in the map, so you have to write the operator<()
:
int operator<(values const& left, values const& right)
{
if (left.a < left.b) ... // compare two values objects
}
and then declare the map as usual:
std::map<values, int>
There are other questions, such as copy constructors, etc. that you have to deal with, but this is the idea.
Final note, you can also substitute std::map for unordered_map.
Put them all in a structure
struct mykey{ int a; float b; float c; int d; char e; };
Then write them in and hash the structure, and use it as a key
int foo(int a, float b, float c, int d, char e)
{
mykey tk = { a, b, c, d, e };
guid key = md5( &tk, sizeof( tk ) );
I'd use nested maps, so you use the first parameter to lookup a map from a map, until the final map where you lookup using the last parameter and the result is the previously cached value of foo.
When you arrive to the last map and find that foo hasn't been called for this setup of parameters, you only need to store the result of foo for the last parameter.
I suggest using the Hash table. You will only need to calculate hash function of the data. If the hash is strong enough, it is possible to store it and output value, without storing arguments. Also, this metod should work faster than using std::map.
In C++ this can be implemented with unordered_map or std::hash_map.
Using very simple hash function will suffice, for example The String hash function.
By the way, the metod of storing output values for arguments is called Memoization

Is the unordered_map really unordered?

I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?
In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations
"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.
As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.
If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.
You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.