I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?
In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations
"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.
As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.
If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.
You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.
Related
If I know I'm going to insert large amount of data (about one million entries) into std::unordered_map, is there anything I can do in advance to boost the performance? (just like std::vector::reserve can reserve enough memory space to avoid reallocate when I roughly know the size of data before bulk insert)
More specifically, the key in hashmap is a coordinate in 2D plane with customized hash function, as shown below
using CellIndex = std::pair<int32_t, int32_t>;
struct IdxHash {
std::size_t operator()(const std::pair<int32_t, int32_t> &idx) const { return ((size_t)idx.second << 31) ^ idx.first; }
};
std::unordered_map<CellIndex, double, IdxHash> my_map;
// bluk insert into my_map
...
std::unordered_map is typically implemented as a chained hash table with linked lists. As such, inserting into an std::unordered_map takes constant time on average, and linear time in the size of the container in the worst case. This worst-case scenario for insertion corresponds to the case when the hash table elements must be rehashed because the current number of buckets in the table is insufficient to satisfy the load factor, and, therefore, a reallocation of the array of buckets is needed.
Keeping this in mind, if you know in advance the number of elements to insert into the std::unordered_map, you should consider std::unordered_map::reserve() to prevent rehashing from happening at insertion. This way, you will avoid both the bucket array reallocation and the rehashing from occurring.
std::unordered_map::insert() with hint
As with std::map, there are some overloads of the insert() member function that take a so called hint:
iterator insert(const_iterator hint, const value_type& value);
This hint iterator may be used to provide some additional information that can be used to speed the insertion up. However, the existence of these member functions in std::unordered_map taking a hint is only for interface compatibility reasons, to make its interface more suitable for generic programming. So, they don't improve the insertion time.
About the hash function
How perfect your hash function is shouldn't really matter when it comes to insertion time – only how fast it calculates the hash of a key. However, it becomes relevant when looking up elements in the hash table by their keys.
reserve(x) prepares the unordered container for x number of elements. In comparison rehash(x) prepares the unordered container for x/max_load_factor() number of elements.
Also regarding your hash function, if you're aiming for it to return a unique value for a pair of coordinates, then it should return ((size_t)idx.second << 32) ^ idx.first. ((size_t)idx.second << 31) ^ idx.first will return the same value for (1, -1) and (0, 2^31-1).
This is a programming problem I come across very often and was wondering whether there is a data structure, either in the C++ STL or one I can implement myself which provides both random and sequential access.
An example of why I might need this:
Say there are n types of items, (n = 1000000, for example), and there's a fixed number of each type of item (for example, 0 or 10)
I store these items into an array, where the array index represents the type of the item, and the value represents how many items of that given type are there
Now, I have an algorithm which iterates over all EXISTING items. To obtain these items, it is very wasteful to iterate over the entire array when all the entries are 0, except for i.e. Array[99999] and Array[999999].
Normally, I solve this by using a linked list which saves the indices of all the nonzero array entries. I implement the standard operations in this way:
Insert(int t):
1) If Array[t] == 0, LinkedList.push_back(t);
2) Array[t]++;
Delete(int t):
1) If Array[t] == 1, find and remove t from LinkedList;
2) Array[t]--;
If I want O(1) complexity for the deletion operation, I make the array store containers instead of integers. Each container contains an integer and a pointer to the respective element of the LinkedList, so I don't have to search through the list.
I would love to know whether there is a data structure which formalizes/improves this approach, or whether there's a better way to do this altogether.
Given the following requirements:
Random access
Fast lookups
Fast insertions
Fast removals
Avoid wasted space
then you probably want something called a sparse array. Sparse arrays are not part of the standard library, so you'll have to emulate your own, using a std::map or std::unordered_map. In a sparse array, only non-zero elements occupy space in the collection.
An ordered_map will have O(1) lookups, insertions, and removals, but does not provide ordered iteration. A map will generally have slower operations, but will provide ordered iteration. I'm oversimplifying things when I say std::map is slower, as it depends on the number of elements and usage patterns (a topic probably already discussed in another question).
If you must absolutely have both O(1) lookups and ordered iteration, then you can combine both a map and ordered_map and keep them in sync. At that point, you'll want to consider using Boost.MultiIndex.
Here's a rough sketch showing how you can implement your own sparse vector class:
class SparseVector
{
public:
int get(size_t index) const
{
auto kv = map_.find(index);
return (kv == map_.end()) ? 0 : kv->second;
}
void put(size_t index, int value)
{
if (value == 0)
map_.erase(index);
else
map_.emplace(index, value);
}
// etc...
private:
std::unordered_map<size_t, int> map_;
};
In such a sparse vector class, you can overload operator[] if you wish to allow something like sparseVec[42] = 123.
Linear algebra libraries, such as Eigen or Boost.uBlas, already provide templates for sparse vectors and sparse matrices.
According to the standard there's no support for containers (let alone unordered ones) in the std::hash class. So I wonder how to implement that. What I have is:
std::unordered_map<std::wstring, std::wstring> _properties;
std::wstring _class;
I thought about iterating the entries, computing the individual hashes for keys and values (via std::hash<std::wstring>) and concatenate the results somehow.
What would be a good way to do that and does it matter if the order in the map is not defined?
Note: I don't want to use boost.
A simple XOR was suggested, so it would be like this:
size_t MyClass::GetHashCode()
{
std::hash<std::wstring> stringHash;
size_t mapHash = 0;
for (auto property : _properties)
mapHash ^= stringHash(property.first) ^ stringHash(property.second);
return ((_class.empty() ? 0 : stringHash(_class)) * 397) ^ mapHash;
}
?
I'm really unsure if that simple XOR is enough.
Response
If by enough, you mean whether or not your function is injective, the answer is No. The reasoning is that the set of all hash values your function can output has cardinality 2^64, while the space of your inputs is much larger. However, this is not really important, because you can't have an injective hash function given the nature of your inputs. A good hash function has these qualities:
It's not easily invertible. Given the output k, it's not computationally feasible within the lifetime of the universe to find m such that h(m) = k.
The range is uniformly distributed over the output space.
It's hard to find two inputs m and m' such that h(m) = h(m')
Of course, the extents of these really depend on whether you want something that's cryptographically secure, or you want to take some arbitrary chunk of data and just send it some arbitrary 64-bit integer. If you want something cryptographically secure, writing it yourself is not a good idea. In that case, you'd also need the guarantee that the function is sensitive to small changes in the input. The std::hash function object is not required to be cryptographically secure. It exists for use cases isomorphic to hash tables. CPP Rerefence says:
For two different parameters k1 and k2 that are not equal, the probability that std::hash<Key>()(k1) == std::hash<Key>()(k2) should be very small, approaching 1.0/std::numeric_limits<size_t>::max().
I'll show below how your current solution doesn't really guarantee this.
Collisions
I'll give you a few of my observations on a variant of your solution (I don't know what your _class member is).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= h(p.first) ^ h(p.second);
}
return result;
}
It's easy to generate collisions. Consider the following maps:
std::unordered_map<std::string, std::string> container0;
std::unordered_map<std::string, std::string> container1;
container0["123"] = "456";
container1["456"] = "123";
std::cout << hash_code(container0) << '\n';
std::cout << hash_code(container1) << '\n';
On my machine, compiling with g++ 4.9.1, this outputs:
1225586629984767119
1225586629984767119
The question as to whether this matters or not arises. What's relevant is how often you're going to have maps where keys and values are reversed. These collisions will occur between any two maps in which the sets of keys and values are the same.
Order of Iteration
Two unordered_map instances having exactly the same key-value pairs will not necessarily have the same order of iteration. CPP Rerefence says:
For two parameters k1 and k2 that are equal, std::hash<Key>()(k1) == std::hash<Key>()(k2).
This is a trivial requirement for a hash function. Your solution avoids this because the order of iteration doesn't matter since XOR is commutative.
A Possible Solution
If you don't need something that's cryptographically secure, you can modify your solution slightly to kill the symmetry. This approach is okay in practice for hash tables and the like. This solution is also independent of the fact that order in an unordered_map is undefined. It uses the same property your solution used (Commutativity of XOR).
std::size_t hash_code(const std::unordered_map<std::string, std::string>& m) {
const std::size_t prime = 19937;
std::hash<std::string> h;
std::size_t result = 0;
for (auto&& p : m) {
result ^= prime*h(p.first) + h(p.second);
}
return result;
}
All you need in a hash function in this case is a way to map a key-value pair to an arbitrary good hash value, and a way to combine the hashes of the key-value pairs using a commutative operation. That way, order does not matter. In the example hash_code I wrote, the key-value pair hash value is just a linear combination of the hash of the key and the hash of the value. You can construct something a bit more intricate, but there's no need for that.
boost::hash has hashing functions for most builtin types including containers.
But as stated in the boost::hash_range function description, the hash algorithm for ranges
is sensitive to the order of the elements so it wouldn't be appropriate to use this with an unordered container
And thus there is no boost::hash specialization for std::unordered_map nor boost::unordered_map.
The question is:
Is there an "easy and efficient" way to hash an unordered_map without reimplementing a hash algorithm from scratch ?
The problem here is that there is no guarantee that the items even have an ordering among them.
So, sorting the items may very well not work for arbitrary unordered containers. You have 2 options:
Just XOR the hashes of all the individual elements. This is the fastest.
First sort the hashes of the containers, and then hash those. This may result in a better hash.
You can of course convert the unordered_map to some other data structure that has a guaranteed order and use that to generate the hash.
A better idea might be to hash each individual element of the map, put those hashes into a vector, then sort and combine the hashes. See for example How do I combine hash values in C++0x? to combine the hashes.
template<typename Hash, typename Iterator>
size_t order_independent_hash(Iterator begin, Iterator end, Hash hasher)
{
std::vector<size_t> hashes;
for (Iterator it = begin; it != end; ++it)
hashes.push_back(hasher(*it));
std::sort(hashes.begin(), hashes.end());
size_t result = 0;
for (auto it2 = hashes.begin(); it2 != hashes.end(); ++it2)
result ^= *it2 + 0x9e3779b9 + (result<<6) + (result>>2);
return result;
}
Testing this on shuffled vectors shows that it always returns the same hash.
Now to adapt that basic concept to work specifically with unordered_map. Since the iterator of unordered_map returns a pair, we need a hash function for that too.
namespace std
{
template<typename T1, typename T2>
struct hash<std::pair<T1,T2> >
{
typedef std::pair<T1,T2> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& s) const
{
result_type const h1 ( std::hash<T1>()(s.first) );
result_type const h2 ( std::hash<T2>()(s.second) );
return h1 ^ (h2 + 0x9e3779b9 + (h1<<6) + (h1>>2));
}
};
template<typename Key, typename T>
struct hash<std::unordered_map<Key,T> >
{
typedef std::unordered_map<Key,T> argument_type;
typedef std::size_t result_type;
result_type operator()(argument_type const& s) const
{
return order_independent_hash(s.begin(), s.end(), std::hash<std::pair<Key,T> >());
}
};
}
See it in action: http://ideone.com/WOLFbc
I think you may be confusing what hash is used for. It is for keys used to identify elements to determine where to store them. Two equivalent elements should has to the same value.
Are you trying to see if two unordered maps are equivalent and storing them in some kind of container?
The keys to an unordered map - well those are hashed. In fact the container would have been called hash_map except that such a container already existed.
But ok, suppose you really do want to store unordered-maps and compare to see if two are equivalent. Well you'd have to come up with a hashing algorithm that would return the same value regardless of the position of the elements it contains. A checksum of all its elements (keys and values) would be one possible way.
Note also that just because two elements have the same hash value, it doesn't mean they are equivalent. It just means that if the hash value differs they definitely are not equivalent. In fact checksums are often used to verify data for exactly this reason. A wrong checksum is proof the data is not valid, and given a good formula, a correct one makes it highly likely although not certain that it is.
I'm curious given that you are trying to hash the unordered_map to use it as a key, and given that once you've hashed the unordered_map you won't be changing it (unless you use it to create a new key), would the performance hit of converting the unordered_map to an ordered map be acceptable (and then, of course, hashing the ordered map and using that as a key)? Or is the problem with that approach that you need the faster lookup time provided by an unordered_map?
For what it's worth there may be a space advantage to using an ordered map (based on the accepted answer in the following post an unordered_map generally uses more memory):
Is there any advantage of using map over unordered_map in case of trivial keys?
You haven't specified any performance requirements, but if you just want a "quick and dirty" solution that won't require much coding on your behalf and would take advantage of boost::hash, you can copy the range of items from unordered_map to a vector, std::sort the vector, and then pass it to boost::hash_range.
Hardly the most effective solution, though, and not one you'd want to use often or with many elements.
My preferred approach would be a specialization of unordered_map that keeps a running, up-to-date hash of the contents — you shouldn't have to pass over all elements and perform a calculation to get the current value. Instead, a member of the data structure should reflect the hash, and be modified in real-time as elements are inserted or removed, and read when needed.
So far, I have been storing the array in a vector and then looping through the vector to find the matching element and then returning the index.
Is there a faster way to do this in C++? The STL structure I use to store the array doesn't really matter to me (it doesn't have to be a vector). My array is also unique (no repeating elements) and ordered (e.g. a list of dates going forward in time).
Since the elements are sorted, you can use a binary search to find the matching element. The C++ Standard Library has a std::lower_bound algorithm that can be used for this purpose. I would recommend wrapping it in your own binary search algorithm, for clarity and simplicity:
/// Performs a binary search for an element
///
/// The range `[first, last)` must be ordered via `comparer`. If `value` is
/// found in the range, an iterator to the first element comparing equal to
/// `value` will be returned; if `value` is not found in the range, `last` is
/// returned.
template <typename RandomAccessIterator, typename Value, typename Comparer>
auto binary_search(RandomAccessIterator const first,
RandomAccessIterator const last,
Value const& value,
Comparer comparer) -> RandomAccessIterator
{
RandomAccessIterator it(std::lower_bound(first, last, value, comparer));
if (it == last || comparer(*it, value) || comparer(value, *it))
return last;
return it;
}
(The C++ Standard Library has a std::binary_search, but it returns a bool: true if the range contains the element, false otherwise. It's not useful if you want an iterator to the element.)
Once you have an iterator to the element, you can use std::distance algorithm to compute the index of the element in the range.
Both of these algorithms work equally well any random access sequence, including both std::vector and ordinary arrays.
If you want to associate a value with an index and find the index quickly you can use std::map or std::unordered_map. You can also combine these with other data structures (e.g. a std::list or std::vector) depending on the other operations you want to perform on the data.
For example, when creating the vector we also create a lookup table:
vector<int> test(test_size);
unordered_map<int, size_t> lookup;
int value = 0;
for(size_t index = 0; index < test_size; ++index)
{
test[index] = value;
lookup[value] = index;
value += rand()%100+1;
}
Now to look up the index you simply:
size_t index = lookup[find_value];
Using a hash table based data structure (e.g. the unordered_map) is a fairly classical space/time tradeoff and can outperform doing a binary search for this sort of "reverse" lookup operation when you need to do a lot of lookups. The other advantage is that it also works when the vector is unsorted.
For fun :-) I've done a quick benchmark in VS2012RC comparing James' binary search code with a linear search and with using unordered_map for lookup, all on a vector:
To ~50000 elements unordered_set significantly (x3-4) outpeforms the binary search which is exhibiting the expected O(log N) behavior, the somewhat surprising result is that unordered_map loses it's O(1) behavior past 10000 elements, presumably due to hash collisions, perhaps an implementation issue.
EDIT: max_load_factor() for the unordered map is 1 so there should be no collisions. The difference in performance between the binary search and the hash table for very large vectors appears to be caching related and varies depending on the lookup pattern in the benchmark.
Choosing between std::map and std::unordered_map talks about the difference between ordered and unordered maps.