Usefulness of KeyEqual in std::unordered_set/std::unordered_map - c++

I understand that this may be vague question, but I wonder what are real world cases when custom comparator is useful for hash containers in std.
I understand it's usefulness in ordered containers, but for hash containers it seems a bit weird.
Reason for this is that hash value for elements that are equal according to comparator needs to be the same, and I believe that in most cases that actually means converting lookup/insert element to some common representation(it is faster and easier to implement).
For example:
set of case insensitive strings: if you want to hash properly you need to uppercase/lowercase the entire string anyway.
set of fractions(where 2/3 == 42/63): you need to convert 42/63 to 2/3 and then hash that...
So I wonder if someone can provide some real world examples of usefulness of customizing std::unordered_ template parameters so I can recognize those patterns in future code I write.
Note 1: "symmetry argument" (std::map enables customization of a comparator so std::unordred_should be customizable also) is something I considered and I do not think it is convincing.
Note 2: I mixed 2 kind of comparators (< and ==) in the post for brevity, I know that std::map uses < and std::unordered_map uses ==.

As per https://en.cppreference.com/w/cpp/container/unordered_set
Internally, the elements are not sorted in any particular order, but
organized into buckets. Which bucket an element is placed into depends
entirely on the hash of its value. This allows fast access to
individual elements, since once a hash is computed, it refers to the
exact bucket the element is placed into.
So the hash function defines the bucket your element will end up in, but once the bucket is decided, in order to find the element, the operator == will be used.
Basically operator == is used to resolve hash collision, and hence, you need your hash function and your operator == to be consistent. Furthermore, if your operator operator == says that two elements are equal, the set will not allow a duplication.
For what concerns customization, I think that the idea of case-insensitive set of strings is a good one: given two strings you will need to provide a case-insensitive hash-function to allow the set to determine the bucket it has to store the string in. Then you will need to provide a custom KeyEqualto allow the set to actually retrieve the element.
A case I had to deal with, in the past, was a way to allow users to insert strings, keeping track of their order of insertion but avoiding duplicates. So, given a struct like:
struct keyword{
std::string value;
int sequenceCounter;
};
You want to detect duplicates according only to value. One of the solutions I came up with was an unordered_set with a custom comparator/hash function, that used only value. This allowed me to check for the existence of a key before allowing insertion.

One interesting usage is to define memory efficient indexes (database sense of the term) for a given set of objects.
Example
Let's say we have a program that has a collection of N objects of this class:
struct Person {
// each object has a unique firstName/lastName pair
std::string firstName;
std::string lastName;
// each object has a unique ssn value
std::string socialSecurityNumber;
// each object has a unique email value
std::string email;
}
And we need to retrieve efficiently objects by the value of any unique property.
Implementations comparison
Time complexities are given assuming string comparisons are constant time (strings have limited length).
1) Single unordered_map
With a single map indexed by a single key (ex: email):
std::unordered_map<std::string,Person> indexedByEmail;
Time complexity: lookup by any unique property other than email requires a traversal of the map: average O(N).
Memory usage: the email value is duplicated. This could be avoided by using a single set with custom hash & compare (see 3).
2) Multiple unordered_map, no custom hash & compare
With a map for each unique property, with default hash & comparisons:
std::unordered_map<std::pair<std::string,std::string>, Person> byName;
std::unordered_map<std::string, const Person*> byEmail;
std::unordered_map<std::string, const Person*> bySSN;
Time complexity: by using the appropriate map, a lookup by any unique property is average O(1).
Memory usage: inefficient, because of all the string duplications.
3) Multiple unordered_set, custom hash & comparison:
With custom hash & comparison, we define different unordered_set which will hash & compare only specific fields of the objects. Theses sets can be used to perform lookup as if items were stored in a map as in 2, but without duplicating any field.
using StrHash = std::hash<std::string>;
// --------------------
struct PersonNameHash {
std::size_t operator()(const Person& p) const {
// not the best hashing function in the world, but good enough for demo purposes.
return StrHash()(p.firstName) + StrHash()(p.lastName);
}
};
struct PersonNameEqual {
bool operator()(const Person& p1, const Person& p2) const {
return (p1.firstName == p2.firstName) && (p1.lastName == p2.lastName);
}
};
std::unordered_set<Person, PersonNameHash, PersonNameEqual> byName;
// --------------------
struct PersonSsnHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->socialSecurityNumber);
}
};
struct PersonSsnEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->socialSecurityNumber == p2->socialSecurityNumber;
}
};
std::unordered_set<const Person*, PersonSsnHash, PersonSsnEqual> bySSN;
// --------------------
struct PersonEmailHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->email);
}
};
struct PersonEmailEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->email == p2->email;
}
};
std::unordered_set<const Person*,PersonEmailHash,PersonEmailEqual> byEmail;
Time complexity: a lookup by any unique property is still O(1) average.
Memory usage: much better than 2): no string duplication.
Live demo

The hash function itself does something to extract features in a certain way, and The comparator's job is to distinguish whether features are the same or not
With a "shell" of data you may not need to modify the comparator
Briefly: put a feature shell on the data. Features are responsible for being compared
As a matter of fact, I don't quite understand what you problem description. My speech is inevitably confused in logic. Please understand.
:)

Related

Map, pair-vector or two vectors...?

I read through some posts and "wikis" but still cannot decide what approach is suitable for my problem.
I create a class called Sample which contains a certain number of compounds (lets say this is another class Nuclide) at a certain relative quantity (double).
Thus, something like (pseudo):
class Sample {
map<Nuclide, double>;
}
If I had the nuclides Ba-133, Co-60 and Cs-137 in the sample, I would have to use exactly those names in code to access those nuclides in the map. However, the only thing I need to do, is to iterate through the map to perform calculations (which nuclides they are is of no interest), thus, I will use a for- loop. I want to iterate without paying any attention to the key-names, thus, I would need to use an iterator for the map, am I right?
An alternative would be a vector<pair<Nuclide, double> >
class Sample {
vector<pair<Nuclide, double> >;
}
or simply two independent vectors
Class Sample {
vector<Nuclide>;
vector<double>;
}
while in the last option the link between a nuclide and its quantity would be "meta-information", given by the position in the respective vector only.
Due to my lack of profound experience, I'd ask kindly for suggestions of what approach to choose. I want to have the iteration through all available compounds to be fast and easy and at the same time keep the logical structure of the corresponding keys and values.
PS.: It's possible that the number of compunds in a sample is very low (1 to 5)!
PPS.: Could the last option be modified by some const statements to prevent changes and thus keep the correct order?
If iteration needs to be fast, you don't want std::map<...>: its iteration is a tree-walk which quickly gets bad. std::map<...> is really only reasonable if you have many mutations to the sequence and you need the sequence ordered by the key. If you have mutations but you don't care about the order std::unordered_map<...> is generally a better alternative. Both kinds of maps assume you are looking things up by key, though. From your description I don't really see that to be the case.
std::vector<...> is fast to iterated. It isn't ideal for look-ups, though. If you keep it ordered you can use std::lower_bound() to do a std::map<...>-like look-up (i.e., the complexity is also O(log n)) but the effort of keeping it sorted may make that option too expensive. However, it is an ideal container for keeping a bunch objects together which are iterated.
Whether you want one std::vector<std::pair<...>> or rather two std::vector<...>s depends on your what how the elements are accessed: if both parts of an element are bound to be accessed together, you want a std::vector<std::pair<...>> as that keeps data which is accessed together. On the other hand, if you normally only access one of the two components, using two separate std::vector<...>s will make the iteration faster as more iteration elements fit into a cache-line, especially if they are reasonably small like doubles.
In any case, I'd recommend to not expose the external structure to the outside world and rather provide an interface which lets you change the underlying representation later. That is, to achieve maximum flexibility you don't want to bake the representation into all your code. For example, if you use accessor function objects (property maps in terms of BGL or projections in terms of Eric Niebler's Range Proposal) to access the elements based on an iterator, rather than accessing the elements you can change the internal layout without having to touch any of the algorithms (you'll need to recompile the code, though):
// version using std::vector<std::pair<Nuclide, double> >
// - it would just use std::vector<std::pair<Nuclide, double>::iterator as iterator
auto nuclide_projection = [](Sample::key& key) -> Nuclide& {
return key.first;
}
auto value_projecton = [](Sample::key& key) -> double {
return key.second;
}
// version using two std::vectors:
// - it would use an iterator interface to an integer, yielding a std::size_t for *it
struct nuclide_projector {
std::vector<Nuclide>& nuclides;
auto operator()(std::size_t index) -> Nuclide& { return nuclides[index]; }
};
constexpr nuclide_projector nuclide_projection;
struct value_projector {
std::vector<double>& values;
auto operator()(std::size_t index) -> double& { return values[index]; }
};
constexpr value_projector value_projection;
With one pair these in-place, for example an algorithm simply running over them and printing them could look like this:
template <typename Iterator>
void print(std::ostream& out, Iterator begin, Iterator end) {
for (; begin != end; ++begin) {
out << "nuclide=" << nuclide_projection(*begin) << ' '
<< "value=" << value_projection(*begin) << '\n';
}
}
Both representations are entirely different but the algorithm accessing them is entirely independent. This way it is also easy to try different representations: only the representation and the glue to the algorithms accessing it need to be changed.

How to find an object in std::vector?

Suppose I have a class called Bank, with attributes
class Bank {
string _name;
}
Now I declare a vector of Bank.
vector<Bank> list;
Given a string, how do I search the vector list for that particular Bank object that has the same string name?
I'm trying to avoid doing loops and see if there is an stl function that can do this.
You can use good old linear search:
auto it = std::find_if(list.begin(), list.end(), [&](const Bank& bank)
{
return bank._name == the_name_you_are_looking_for;
});
If there is no such bank in the list, the end iterator will be returned:
if (it == list.end())
{
// no bank in the list with the name you were looking for :-(
}
else
{
// *it is the first bank in the list with the name you were looking for :-)
}
If your compiler is from the stone ages, it won't understand lambdas and auto. Untested C++98 code:
struct NameFinder
{
const std::string& captured_name;
bool operator()(const Bank& bank) const
{
return bank.name == captured_name;
}
};
NameFinder finder = {the_name_you_are_looking_for};
std::vector<Bank>::iterator it = std::find_if(list.begin(), list.end(), finder);
As per popular request, just a side note to warn potential beginners attracted by this question in the future:
std::find is using a linear method, because the underlying object (a vector in that case) is not designed with search efficiency in mind.
Using a vector for data where search time is critical will possibly work, given the computing power available in your average PC, but could become slow quickly if the volume of data to handle grows.
If you need to search quickly, you have other containers (std::set, std::map and a few variants) that allows retrieval in logarithmic times.
You can even use hash tables for (near) instant access in containers like unordered_set and unordered_map, but the cost of other operations grows accordingly. It's all a matter of balance.
You can also sort the vector first and then perform a dichotomic search with std:: algorithms, like binary_search if you have a strict order or lower_bound, upper_bound and equal_range if you can only define a partial order on your elements.
std::find will allow you to search through the vector in a variety of ways.

Is the unordered_map really unordered?

I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?
In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations
"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.
As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.
If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.
You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.

Data structure that maps unique id to an object

I'm looking for a C++ data structure that will let me associate objects with a unique numeric value (a key), and that will re-use these keys after the corresponding object have been removed from the container. So it's basically somewhat of a hybrid map/queue structure. My current implementation is a
std::map<size_t, TObject>
and I insert objects like this:
size_t key = (m_Container.end()--)->first + 1;
m_Container[key] = some_object;
which works fine for my purposes (I will never allocate more than size_t objects); yet still I keep wondering is there is a more specialized container available, preferably already in the stl or boost, or that there is a way to use another container to achieve this goal.
(Of course I could, rather than taking the highest key in my map and adding one, every time go through the map and search for the first available key but that would reduce complexity from O(1) to O(n) Also it would be nice if the API was a simple
size_t new_key = m_Container.insert(object);
).
Any ideas?
If you're never going to allocate more than size_t keys then I recommend you simply use a static counter:
size_t assign_id()
{
static size_t next_id;
return next_id++;
}
And if you want a nice API:
template<class Container>
size_t insert(Container & container, TObject const & obj)
{
container.insert(obj);
return assign_id();
}
std::set<TObject> m_Container;
size_t new_key = insert(m_Container, object);
I'm not certain what you exactly want from your ID. As it happens, each object already has a unique ID: its address! There are no two distinct objects with the same address, and the address of an object doesn't change over its lifetime.
std::set<T> typically stores its T values as members of larger nodes, not independent objects. Still, the T subobjects are never moved, and thus their addresses too are stable, unique identifiers.
Create std::set<key_type> removed_keys; of the removed keys. If removed_keys is not empty then use key from removed_keys else create a new key.
Why not just use a vector?
std::vector<TObject> m_Container;
size_t key = m_Container.size();
m_Container.push_back(some_object);
Of course this could be completely useless if you have other usage characteristics. But since you only describe insert and the need for a key (so extracting) it is hard to give any other clear answer. But from these two requirements a std::vector<> should work just fine.
If you have some other usage characteristics like: (Elements can be removed), (we insert elements in large blocks), (we insert elements infrequently) etc these would be interesting factoids that may change the recommendations people give.
You mention that you want to search for un-used elements ID's. This suggests that you may be deleting elements but I don't see any explicit requirements or usage where elements are ebing deleted.
Looking at your code above:
size_t key = (m_Container.end()--)->first + 1;
This is not doing what you think it is doing.
It is equivalent too:
size_t key = m_Container.end()->first + 1;
m_Container.end()--;
The post decrement operator modifies an lvalue. But the result of the expression is the original value. Thus you are applying the operator -> to the value returned by end(). This is (probably) undefined behavior.
See the standard:
Section:5.2.6 Increment and decrement
The value of a postfix ++ expression is the value of its operand.
m_Container.end()-- // This sub-expresiion returns m_Container.end()
Alternative:
#include <vector>
template<typename T>
class X
{
public:
T const& operator[](std::size_t index) const {return const_cast<X&>(*this)[index];}
T& operator[](std::size_t index) {return data[index];}
void remove(std::size_t index) {unused.push_back(index);}
std::size_t insert(T const& value);
private:
std::vector<T> data;
std::vector<std::size_t> unused;
};
template<typename T>
std::size_t X<T>::insert(T const& value)
{
if (unused.empty())
{
data.push_back(value);
return data.size() - 1;
}
std::size_t result = unused.back();
unused.pop_back();
data[result] = value;
return result;
}
Is there a reason that you need std::map to not remove a key, value pair?
This sounds like an attempt at premature optimization.
A method is to replace the value part with a dummy or place holder value. The problem in the long run is that the dummy value can be extracted from the std::map as long as the key exists. You will need to add a check for dummy value every time the std::map is accessed.
Because you want to maintain a key without a value, you most likely will have to write your own container. You will especially have to design code to handle the case when the client accesses the key when it has no value.
Looks like there is no performance gain for using standard containers and a key without value pair. However, there may be a gain as far as memory is concerned. Your issue would reduce fragmentation of dynamic memory; thus not having to re-allocate memory for the same key. You'll have to decide the trade-off.

std::map iteration - order differences between Debug and Release builds

Here's a common code pattern I have to work with:
class foo {
public:
void InitMap();
void InvokeMethodsInMap();
static void abcMethod();
static void defMethod();
private:
typedef std::map<const char*, pMethod> TMyMap;
TMyMap m_MyMap;
}
void
foo::InitMap()
{
m_MyMap["abc"] = &foo::abcMethod;
m_MyMap["def"] = &foo::defMethod;
}
void
foo::InvokeMethodsInMap()
{
for (TMyMap::const_iterator it = m_MyMap.begin();
it != m_MyMap.end(); it++)
{
(*it->second)(it->first);
}
}
However, I have found that the order that the map is processed in (within the for loop) can differ based upon whether the build configuration is Release or Debug. It seems that the compiler optimisation that occurs in Release builds affects this order.
I thought that by using begin() in the loop above, and incrementing the iterator after each method call, it would process the map in order of initialisation. However, I also remember reading that a map is implemented as a hash table, and order cannot be guaranteed.
This is particularly annoying, as most of the unit tests are run on a Debug build, and often strange order dependency bugs aren't found until the external QA team start testing (because they use a Release build).
Can anyone explain this strange behaviour?
Don't use const char* as the key for maps. That means the map is ordered by the addresses of the strings, not the contents of the strings. Use a std::string as the key type, instead.
std::map is not a hash table, it's usually implemented as a red-black tree, and elements are guaranteed to be ordered by some criteria (by default, < comparison between keys).
The definition of map is:
map<Key, Data, Compare, Alloc>
Where the last two template parameters default too:
Compare: less<Key>
Alloc: allocator<value_type>
When inserting new values into a map. The new value (valueToInsert) is compared against the old values in order (N.B. This is not sequential search, the standard guarantees a max insert complexity of O(log(N)) ) until Compare(value,ValueToInsert) returns true. Because you are using 'const char*' as the key. The Compare Object is using less<const char*> this class just does a < on the two values. So in effect you are comparing the pointer values (not the string) therefore the order is random (as you don't know where the compiler will put strings.
There are two possible solutions:
Change the type of the key so that it compares the string values.
Define another Compare Type that does what you need.
Personally I (like Chris) would just use a std::string because < operator used on strings returns a comparison based on the string content. But for arguments sake we can just define a Compare type.
struct StringLess
{
bool operator()(const char* const& left,const char* const& right) const
{
return strcmp(left,right) < 0;
}
};
///
typedef std::map<const char*, int,StringLess> TMyMap;
If you want to use const char * as the key for your map, also set a key comparison function that uses strcmp (or similar) to compare the keys. That way your map will be ordered by the string's contents, rather than the string's pointer value (i.e. location in memory).