Efficient way to hash a 2D point - c++

OK, so the task is this, I would be given (x, y) co-ordinates of points with both (x, y) ranging from -10^6 to 10^6 inclusive. I have to check whether a particular point e.g. (x, y) tuple was given to me or not. In simple words how do i answer the query whether a particular point(2D) is set or not. So far the best i could think of is maintaining a std::map<std::pair<int,int>, bool> and whenever a point is given I mark it 1. Although this must be running in logarithmic time and is fairly optimized way to answer the query I am wondering if there's a better way to do this.
Also I would be glad if anyone could tell what actually complexity would be if I am using the above data structure as a hash.I mean is it that the complexity of std::map is going to be O(log N) in the size of elements present irrespective of the structure of key?

In order to use a hash map you need to be using std::unordered_map instead of std::map. The constraint of using this is that your value type needs to have a hash function defined for it as described in this answer. Either that or just use boost::hash for this:
std::unordered_map<std::pair<int, int>, boost::hash<std::pair<int, int> > map_of_pairs;
Another method which springs to mind is to store the 32 bit int values in a 64 bit integer like so:
uint64_t i64;
uint32_t a32, b32;
i64 = ((uint64_t)a32 << 32) | b32;
As described in this answer. The x and y components can be stored in the high and low bytes of the integer and then you can use a std::unordered_map<uint64_t, bool>. Although I'd be interested to know if this is any more efficient than the previous method or if it even produces different code.

Instead of mapping each point to a bool, why not store all the points given to you in a set? Then, you can simply search the set to see if it contains the point you are looking for. It is essentially the same as what you are doing without having to do an additional lookup of the associated bool. For example:
set<pair<int, int>> points;
Then, you can check whether the set contains a certain point or not like this :
pair<int, int> examplePoint = make_pair(0, 0);
set<pair<int, int>>::iterator it = points.find(examplePoint);
if (it == points.end()) {
// examplePoint not found
} else {
// examplePoint found
}
As mentioned, std::set is normally implemented as a balanced binary search tree, so each lookup would take O(logn) time.
If you wanted to use a hash table instead, you could do the same thing using std::unordered_set instead of std::set. Assuming you use a good hash function, this would speed your lookups up to O(1) time. However, in order to do this, you will have to define the hash function for pair<int, int>. Here is an example taken from this answer:
namespace std {
template <> struct hash<std::pair<int, int>> {
inline size_t operator()(const std::pair<int, int> &v) const {
std::hash<int> int_hasher;
return int_hasher(v.first) ^ int_hasher(v.second);
}
};
}
Edit: Nevermind, I see you already got it working!

Related

Hash value for a std::unordered_map

According to the standard there's no support for containers (let alone unordered ones) in the std::hash class. So I wonder how to implement that. What I have is:
std::unordered_map<std::wstring, std::wstring> _properties;
std::wstring _class;
I thought about iterating the entries, computing the individual hashes for keys and values (via std::hash<std::wstring>) and concatenate the results somehow.
What would be a good way to do that and does it matter if the order in the map is not defined?
Note: I don't want to use boost.
A simple XOR was suggested, so it would be like this:
size_t MyClass::GetHashCode()
{
std::hash<std::wstring> stringHash;
size_t mapHash = 0;
for (auto property : _properties)
mapHash ^= stringHash(property.first) ^ stringHash(property.second);
return ((_class.empty() ? 0 : stringHash(_class)) * 397) ^ mapHash;
}
?
I'm really unsure if that simple XOR is enough.
Response
If by enough, you mean whether or not your function is injective, the answer is No. The reasoning is that the set of all hash values your function can output has cardinality 2^64, while the space of your inputs is much larger. However, this is not really important, because you can't have an injective hash function given the nature of your inputs. A good hash function has these qualities:
It's not easily invertible. Given the output k, it's not computationally feasible within the lifetime of the universe to find m such that h(m) = k.
The range is uniformly distributed over the output space.
It's hard to find two inputs m and m' such that h(m) = h(m')
Of course, the extents of these really depend on whether you want something that's cryptographically secure, or you want to take some arbitrary chunk of data and just send it some arbitrary 64-bit integer. If you want something cryptographically secure, writing it yourself is not a good idea. In that case, you'd also need the guarantee that the function is sensitive to small changes in the input. The std::hash function object is not required to be cryptographically secure. It exists for use cases isomorphic to hash tables. CPP Rerefence says:
For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().
I'll show below how your current solution doesn't really guarantee this.
Collisions
I'll give you a few of my observations on a variant of your solution (I don't know what your _class member is).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= h(p.first) ^ h(p.second);
}
return result;
}
It's easy to generate collisions. Consider the following maps:
std::unordered_map<std::string, std::string> container0;
std::unordered_map<std::string, std::string> container1;
container0["123"] = "456";
container1["456"] = "123";
std::cout << hash_code(container0) << '\n';
std::cout << hash_code(container1) << '\n';
On my machine, compiling with g++ 4.9.1, this outputs:
1225586629984767119
1225586629984767119
The question as to whether this matters or not arises. What's relevant is how often you're going to have maps where keys and values are reversed. These collisions will occur between any two maps in which the sets of keys and values are the same.
Order of Iteration
Two unordered_map instances having exactly the same key-value pairs will not necessarily have the same order of iteration. CPP Rerefence says:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2).
This is a trivial requirement for a hash function. Your solution avoids this because the order of iteration doesn't matter since XOR is commutative.
A Possible Solution
If you don't need something that's cryptographically secure, you can modify your solution slightly to kill the symmetry. This approach is okay in practice for hash tables and the like. This solution is also independent of the fact that order in an unordered_map is undefined. It uses the same property your solution used (Commutativity of XOR).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
const std::size_t prime = 19937;
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= prime*h(p.first) + h(p.second);
}
return result;
}
All you need in a hash function in this case is a way to map a key-value pair to an arbitrary good hash value, and a way to combine the hashes of the key-value pairs using a commutative operation. That way, order does not matter. In the example hash_code I wrote, the key-value pair hash value is just a linear combination of the hash of the key and the hash of the value. You can construct something a bit more intricate, but there's no need for that.

In a C++ map, is there any way to search for the key given a value?

In a C++ std::map, is there any way to search for the key given the mapped value? Example:
I have this map:
map<int,string> myMap;
myMap[0] = "foo";
Is there any way that I can find the corresponding int, given the value "foo"?
cout << myMap.some_function("foo") <<endl;
Output: 0
std::map doesn't provide a (fast) way to find the key of a given value.
What you want is often called a "bijective map", or short "bimap". Boost has such a data structure. This is typically implemented by using two index trees "glued" together (where std::map has only one for the keys). Boost also provides the more general multi index with similar use cases.
If you don't want to use Boost, if storage is not a big problem, and if you can affort the extra code effort, you can simply use two maps and glue them together manually:
std::map<int, string> myMapForward;
std::map<string, int> myMapBackward; // maybe even std::set
// insertion becomes:
myMapForward.insert(std::make_pair(0, "foo"));
myMapBackward.insert(std::make_pair("foo", 0));
// forward lookup becomes:
myMapForwar[0];
// backward lookup becomes:
myMapBackward["foo"];
Of course you can wrap those two maps in a class and provide some useful interface, but this might be a bit overkill, and using two maps with the same content is not an optional solution anyways. As commented below, exception safety is also a problem of this solution. But in many applications it's already enough to simply add another reverse map.
Please note that since std::map stores unique keys, this approach will support backward lookup only for unique values, as collisions in the value space of the forward map correspond to collisions in the key space of the backward map.
No, not directly.
One option is to examine each value in the map until you find what you are looking for. This, obviously, will be O(n).
In order to do this you could just write a for() loop, or you could use std::find_if(). In order to use find_if(), you'll need to create a predicate. In C++11, this might be a lambda:
typedef std::map <unsigned, Student> MyMap;
MyMap myMap;
// ...
const string targetName = "Jones";
find_if (myMap.begin(), myMap.end(), [&targetName] (const MyMap::value_type& test)
{
if (test.second.mName == targetName)
return true;
});
If you're using C++03, then this could be a functor:
struct MatchName
: public std::unary_function <bool, MyMap::value_type>
{
MatchName (const std::string& target) : mTarget (target) {}
bool operator() (const MyMap::value_type& test) const
{
if (test.second.mName == mTarget)
return true;
return false;
}
private:
const std::string mTarget;
};
// ...
find_if (myMap.begin(), myMap.end(), MatchName (target));
Another option is to build an index. The index would likely be another map, where the key is whatever values you want to find and the value is some kind of index back to the main map.
Suppose your main map contains Student objects which consist of a name and some other stuff, and the key in this map is the Student ID, an integer. If you want to find the student with a particular last name, you could build an indexing map where the key is a last name (probably want to use multimap here), and the value is the student ID. You can then index back in to the main map to get the remainder of the Student's attributes.
There are challenges with the second approach. You must keep the main map and the index (or indicies) synchronized when you add or remove elements. You must make sure the index you choose as the value in the index is not something that may change, like a pointer. If you are multithreading, then you have to give a think to how both the map and index will be protected without introducing deadlocks or race conditions.
The only way to accomplish this that I can think of is to iterate through it. This is most likely not what you want, but it's the best shot I can think of. Good luck!
No, You can not do this. You simply have to iterate over map and match each value with the item to be matched and return the corresponding key and it will cost you high time complexity equal to O(n).
You can achieve this by iterating which will take O(n) time. Or you can store the reverse map which will take O(n) space.
By iterating:
std::map<int, string> fmap;
for (std::map<int,string>::iterator it=fmap.begin(); it!=fmap.end(); ++it)
if (strcmp(it->second,"foo"))
break;
By storing reverse map:
std::map<int, string> fmap;
std::map<string, int> bmap;
fmap.insert(std::make_pair(0, "foo"));
bmap.insert(std::make_pair("foo", 0));
fmap[0]; // original map lookup
bmap["foo"]; //reverse map lookup

Speed up access to many std::maps with same key

Suppose you have a std::vector<std::map<std::string, T> >. You know that all the maps have the same keys. They might have been initialized with
typedef std::map<std::string, int> MapType;
std::vector<MapType> v;
const int n = 1000000;
v.reserve(n);
for (int i=0;i<n;i++)
{
std::map<std::string, int> m;
m["abc"] = rand();
m["efg"] = rand();
m["hij"] = rand();
v.push_back(m);
}
Given a key (e.g. "efg"), I would like to extract all values of the maps for the given key (which definitely exists in every map).
Is it possible to speed up the following code?
std::vector<int> efgValues;
efgValues.reserve(v.size());
BOOST_FOREACH(MapType const& m, v)
{
efgValues.push_back(m.find("efg")->second);
}
Note that the values are not necessarily int. As profiling confirms that most time is spent in the find function, I was thinking about whether there is a (GCC and MSVC compliant C++03) way to avoid locating the element in the map based on the key for every single map again, because the structure of all the maps is equal.
If no, would it be possible with boost::unordered_map (which is 15% slower on my machine with the code above)? Would it be possible to cache the hash value of the string?
P.S.: I know that having a std::map<std::string, std::vector<T> > would solve my problem. However, I cannot change the data structure (which is actually more complex than what I showed here).
You can cache and playback the sequence of comparison results using a stateful comparator. But that's just nasty; the solution is to adjust the data structure. There's no "cannot." Actually, adding a stateful comparator is changing the data structure. That requirement rules out almost anything.
Another possibility is to create a linked list across the objects of type T so you can get from each map to the next without another lookup. If you might be starting at any of the maps (please, just refactor the structure) then a circular or doubly-linked list will do the trick.
As profiling confirms that most time is spent in the find function
Keeping the tree data structures and optimizing the comparison can only speed up the comparison. Unless the time is spent in operator< (std::string const&, std::string const&), you need to change the way it's linked together.

boost::multi_index_container with random_access and ordered_unique

I have a problem getting boost::multi_index_container work with random-access and with orderd_unique at the same time. (I'm sorry for the lengthly question, but I think I should use an example..)
Here an example: Suppose I want to produce N objects in a factory and for each object I have a demand to fulfill (this demand is known at creation of the multi-index).
Well, within my algorithm I get intermediate results, which I store in the following class:
class intermediate_result
{
private:
std::vector<int> parts; // which parts are produced
int used_time; // how long did it take to produce
ValueType max_value; // how much is it worth
};
The vector parts descibes, which objects are produced (its length is N and it is lexicographically smaller then my coresp demand-vector!) - for each such vector I know the used_time as well. Additionally I get a value for this vector of produced objects.
I got another constraint so that I can't produce every object - my algorithm needs to store several intermediate_result-objects in a data-structure. And here boost::multi_index_container is used, because the pair of parts and used_time describes a unique intermediate_result (and it should be unique in my data-structure) but the max_value is another index I'll have to consider, because my algorithm always needs the intermediate_result with the highest max_value.
So I tried to use boost::multi_index_container with ordered_unique<> for my "parts&used_time-pair" and ordered_non_unique<> for my max_value (different intermediate_result-objects may have the same value).
The problem is: the predicate, which is needed to decide which "parts&used_time-pair" is smaller, uses std::lexicographical_compare on my parts-vector and hence is very slow for many intermediate_result-objects.
But there would be a solution: my demand for each object isn't that high, therefore I could store on each possible parts-vector the intermediate results uniquely by its used_time.
For example: if I have a demand-vector ( 2 , 3 , 1) then I need a data-structure which stores (2+1)*(3+1)*(1+1)=24 possible parts-vectors and on each such entry the different used_times, which have to be unique! (storing the smallest time is insufficient - for example: if my additional constraint is: to meet a given time exactly for production)
But how do I combine a random_access<>-index with an ordered_unique<>-index?
(Example11 didn't help me on this one..)
To use two indices you could write the following:
indexed_by<
random_access< >,
ordered_unique<
composite_key<
intermediate_result,
member<intermediate_result, int, &intermediate_result::used_time>,
member<intermediate_result, std::vector<int>, &intermediate_result::parts>
>
>
>
You could use composite_key for comparing used_time at first and vector only if necessary. Besides that, keep in mind that you could use member function as index.
(I had to use an own answer to write code-blocks - sorry!)
The composite_key with used_time and parts (as Kirill V. Lyadvinsky suggested) is basically what I've already implemented. I want to get rid of the lexicographical compare of the parts-vector.
Suppose I've stored the needed_demand somehow then I could write a simple function, which returns the correct index within a random-access data-structure like that:
int get_index(intermediate_result &input_result) const
{
int ret_value = 0;
int index_part = 1;
for(int i=0;i<needed_demand.size();++i)
{
ret_value += input_result.get_part(i) * index_part;
index_part *= (needed_demand.get_part(i) + 1);
}
}
Obviously this can be implemented more efficiently and this is not the only possible index ordering for the needed demand. But let's suppose this function exists as a member-function of intermediate_result! Is it possible to write something like this to prevent lexicographical_compare ?
indexed_by<
random_access< >,
ordered_unique<
composite_key<
intermediate_result,
member<intermediate_result, int, &intermediate_result::used_time>,
const_mem_fun<intermediate_result,int,&intermediate_result::get_index>
>
>
>
If this is possible and I initialized the multi-index with all possible parts-vectors (i.e. in my comment above I would've pushed 24 empty maps in my data-structure), does this find the right entry for a given intermediate_result in constant time (after computing the correct index with get_index) ?
I have to ask this, because I don't quite see, how the random_access<> index is linked with the ordered_unique<> index..
But thank you for your answers so far!!

std::map insert or std::map find?

Assuming a map where you want to preserve existing entries. 20% of the time, the entry you are inserting is new data. Is there an advantage to doing std::map::find then std::map::insert using that returned iterator? Or is it quicker to attempt the insert and then act based on whether or not the iterator indicates the record was or was not inserted?
The answer is you do neither. Instead you want to do something suggested by Item 24 of Effective STL by Scott Meyers:
typedef map<int, int> MapType; // Your map type may vary, just change the typedef
MapType mymap;
// Add elements to map here
int k = 4; // assume we're searching for keys equal to 4
int v = 0; // assume we want the value 0 associated with the key of 4
MapType::iterator lb = mymap.lower_bound(k);
if(lb != mymap.end() && !(mymap.key_comp()(k, lb->first)))
{
// key already exists
// update lb->second if you care to
}
else
{
// the key does not exist in the map
// add it to the map
mymap.insert(lb, MapType::value_type(k, v)); // Use lb as a hint to insert,
// so it can avoid another lookup
}
The answer to this question also depends on how expensive it is to create the value type you're storing in the map:
typedef std::map <int, int> MapOfInts;
typedef std::pair <MapOfInts::iterator, bool> IResult;
void foo (MapOfInts & m, int k, int v) {
IResult ir = m.insert (std::make_pair (k, v));
if (ir.second) {
// insertion took place (ie. new entry)
}
else if ( replaceEntry ( ir.first->first ) ) {
ir.first->second = v;
}
}
For a value type such as an int, the above will more efficient than a find followed by an insert (in the absence of compiler optimizations). As stated above, this is because the search through the map only takes place once.
However, the call to insert requires that you already have the new "value" constructed:
class LargeDataType { /* ... */ };
typedef std::map <int, LargeDataType> MapOfLargeDataType;
typedef std::pair <MapOfLargeDataType::iterator, bool> IResult;
void foo (MapOfLargeDataType & m, int k) {
// This call is more expensive than a find through the map:
LargeDataType const & v = VeryExpensiveCall ( /* ... */ );
IResult ir = m.insert (std::make_pair (k, v));
if (ir.second) {
// insertion took place (ie. new entry)
}
else if ( replaceEntry ( ir.first->first ) ) {
ir.first->second = v;
}
}
In order to call 'insert' we are paying for the expensive call to construct our value type - and from what you said in the question you won't use this new value 20% of the time. In the above case, if changing the map value type is not an option then it is more efficient to first perform the 'find' to check if we need to construct the element.
Alternatively, the value type of the map can be changed to store handles to the data using your favourite smart pointer type. The call to insert uses a null pointer (very cheap to construct) and only if necessary is the new data type constructed.
There will be barely any difference in speed between the 2, find will return an iterator, insert does the same and will search the map anyway to determine if the entry already exists.
So.. its down to personal preference. I always try insert and then update if necessary, but some people don't like handling the pair that is returned.
I would think if you do a find then insert, the extra cost would be when you don't find the key and performing the insert after. It's sort of like looking through books in alphabetical order and not finding the book, then looking through the books again to see where to insert it. It boils down to how you will be handling the keys and if they are constantly changing. Now there is some flexibility in that if you don't find it, you can log, exception, do whatever you want...
If you are concerned about efficiency, you may want to check out hash_map<>.
Typically map<> is implemented as a binary tree. Depending on your needs, a hash_map may be more efficient.
I don't seem to have enough points to leave a comment, but the ticked answer seems to be long winded to me - when you consider that insert returns the iterator anyway, why go searching lower_bound, when you can just use the iterator returned. Strange.
Any answers about efficiency will depend on the exact implementation of your STL. The only way to know for sure is to benchmark it both ways. I'd guess that the difference is unlikely to be significant, so decide based on the style you prefer.
map[ key ] - let stl sort it out. That's communicating your intention most effectively.
Yeah, fair enough.
If you do a find and then an insert you're performing 2 x O(log N) when you get a miss as the find only lets you know if you need to insert not where the insert should go (lower_bound might help you there). Just a straight insert and then examining the result is the way that I'd go.