std::map iteration - order differences between Debug and Release builds - c++

Here's a common code pattern I have to work with:
class foo {
public:
void InitMap();
void InvokeMethodsInMap();
static void abcMethod();
static void defMethod();
private:
typedef std::map<const char*, pMethod> TMyMap;
TMyMap m_MyMap;
}
void
foo::InitMap()
{
m_MyMap["abc"] = &foo::abcMethod;
m_MyMap["def"] = &foo::defMethod;
}
void
foo::InvokeMethodsInMap()
{
for (TMyMap::const_iterator it = m_MyMap.begin();
it != m_MyMap.end(); it++)
{
(*it->second)(it->first);
}
}
However, I have found that the order that the map is processed in (within the for loop) can differ based upon whether the build configuration is Release or Debug. It seems that the compiler optimisation that occurs in Release builds affects this order.
I thought that by using begin() in the loop above, and incrementing the iterator after each method call, it would process the map in order of initialisation. However, I also remember reading that a map is implemented as a hash table, and order cannot be guaranteed.
This is particularly annoying, as most of the unit tests are run on a Debug build, and often strange order dependency bugs aren't found until the external QA team start testing (because they use a Release build).
Can anyone explain this strange behaviour?

Don't use const char* as the key for maps. That means the map is ordered by the addresses of the strings, not the contents of the strings. Use a std::string as the key type, instead.
std::map is not a hash table, it's usually implemented as a red-black tree, and elements are guaranteed to be ordered by some criteria (by default, < comparison between keys).

The definition of map is:
map<Key, Data, Compare, Alloc>
Where the last two template parameters default too:
Compare: less<Key>
Alloc: allocator<value_type>
When inserting new values into a map. The new value (valueToInsert) is compared against the old values in order (N.B. This is not sequential search, the standard guarantees a max insert complexity of O(log(N)) ) until Compare(value,ValueToInsert) returns true. Because you are using 'const char*' as the key. The Compare Object is using less<const char*> this class just does a < on the two values. So in effect you are comparing the pointer values (not the string) therefore the order is random (as you don't know where the compiler will put strings.
There are two possible solutions:
Change the type of the key so that it compares the string values.
Define another Compare Type that does what you need.
Personally I (like Chris) would just use a std::string because < operator used on strings returns a comparison based on the string content. But for arguments sake we can just define a Compare type.
struct StringLess
{
bool operator()(const char* const& left,const char* const& right) const
{
return strcmp(left,right) < 0;
}
};
///
typedef std::map<const char*, int,StringLess> TMyMap;

If you want to use const char * as the key for your map, also set a key comparison function that uses strcmp (or similar) to compare the keys. That way your map will be ordered by the string's contents, rather than the string's pointer value (i.e. location in memory).

Related

Usefulness of KeyEqual in std::unordered_set/std::unordered_map

I understand that this may be vague question, but I wonder what are real world cases when custom comparator is useful for hash containers in std.
I understand it's usefulness in ordered containers, but for hash containers it seems a bit weird.
Reason for this is that hash value for elements that are equal according to comparator needs to be the same, and I believe that in most cases that actually means converting lookup/insert element to some common representation(it is faster and easier to implement).
For example:
set of case insensitive strings: if you want to hash properly you need to uppercase/lowercase the entire string anyway.
set of fractions(where 2/3 == 42/63): you need to convert 42/63 to 2/3 and then hash that...
So I wonder if someone can provide some real world examples of usefulness of customizing std::unordered_ template parameters so I can recognize those patterns in future code I write.
Note 1: "symmetry argument" (std::map enables customization of a comparator so std::unordred_should be customizable also) is something I considered and I do not think it is convincing.
Note 2: I mixed 2 kind of comparators (< and ==) in the post for brevity, I know that std::map uses < and std::unordered_map uses ==.
As per https://en.cppreference.com/w/cpp/container/unordered_set
Internally, the elements are not sorted in any particular order, but
organized into buckets. Which bucket an element is placed into depends
entirely on the hash of its value. This allows fast access to
individual elements, since once a hash is computed, it refers to the
exact bucket the element is placed into.
So the hash function defines the bucket your element will end up in, but once the bucket is decided, in order to find the element, the operator == will be used.
Basically operator == is used to resolve hash collision, and hence, you need your hash function and your operator == to be consistent. Furthermore, if your operator operator == says that two elements are equal, the set will not allow a duplication.
For what concerns customization, I think that the idea of case-insensitive set of strings is a good one: given two strings you will need to provide a case-insensitive hash-function to allow the set to determine the bucket it has to store the string in. Then you will need to provide a custom KeyEqualto allow the set to actually retrieve the element.
A case I had to deal with, in the past, was a way to allow users to insert strings, keeping track of their order of insertion but avoiding duplicates. So, given a struct like:
struct keyword{
std::string value;
int sequenceCounter;
};
You want to detect duplicates according only to value. One of the solutions I came up with was an unordered_set with a custom comparator/hash function, that used only value. This allowed me to check for the existence of a key before allowing insertion.
One interesting usage is to define memory efficient indexes (database sense of the term) for a given set of objects.
Example
Let's say we have a program that has a collection of N objects of this class:
struct Person {
// each object has a unique firstName/lastName pair
std::string firstName;
std::string lastName;
// each object has a unique ssn value
std::string socialSecurityNumber;
// each object has a unique email value
std::string email;
}
And we need to retrieve efficiently objects by the value of any unique property.
Implementations comparison
Time complexities are given assuming string comparisons are constant time (strings have limited length).
1) Single unordered_map
With a single map indexed by a single key (ex: email):
std::unordered_map<std::string,Person> indexedByEmail;
Time complexity: lookup by any unique property other than email requires a traversal of the map: average O(N).
Memory usage: the email value is duplicated. This could be avoided by using a single set with custom hash & compare (see 3).
2) Multiple unordered_map, no custom hash & compare
With a map for each unique property, with default hash & comparisons:
std::unordered_map<std::pair<std::string,std::string>, Person> byName;
std::unordered_map<std::string, const Person*> byEmail;
std::unordered_map<std::string, const Person*> bySSN;
Time complexity: by using the appropriate map, a lookup by any unique property is average O(1).
Memory usage: inefficient, because of all the string duplications.
3) Multiple unordered_set, custom hash & comparison:
With custom hash & comparison, we define different unordered_set which will hash & compare only specific fields of the objects. Theses sets can be used to perform lookup as if items were stored in a map as in 2, but without duplicating any field.
using StrHash = std::hash<std::string>;
// --------------------
struct PersonNameHash {
std::size_t operator()(const Person& p) const {
// not the best hashing function in the world, but good enough for demo purposes.
return StrHash()(p.firstName) + StrHash()(p.lastName);
}
};
struct PersonNameEqual {
bool operator()(const Person& p1, const Person& p2) const {
return (p1.firstName == p2.firstName) && (p1.lastName == p2.lastName);
}
};
std::unordered_set<Person, PersonNameHash, PersonNameEqual> byName;
// --------------------
struct PersonSsnHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->socialSecurityNumber);
}
};
struct PersonSsnEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->socialSecurityNumber == p2->socialSecurityNumber;
}
};
std::unordered_set<const Person*, PersonSsnHash, PersonSsnEqual> bySSN;
// --------------------
struct PersonEmailHash {
std::size_t operator()(const Person* p) const {
return StrHash()(p->email);
}
};
struct PersonEmailEqual {
bool operator()(const Person* p1, const Person* p2) const {
return p1->email == p2->email;
}
};
std::unordered_set<const Person*,PersonEmailHash,PersonEmailEqual> byEmail;
Time complexity: a lookup by any unique property is still O(1) average.
Memory usage: much better than 2): no string duplication.
Live demo
The hash function itself does something to extract features in a certain way, and The comparator's job is to distinguish whether features are the same or not
With a "shell" of data you may not need to modify the comparator
Briefly: put a feature shell on the data. Features are responsible for being compared
As a matter of fact, I don't quite understand what you problem description. My speech is inevitably confused in logic. Please understand.
:)

Speed up access to many std::maps with same key

Suppose you have a std::vector<std::map<std::string, T> >. You know that all the maps have the same keys. They might have been initialized with
typedef std::map<std::string, int> MapType;
std::vector<MapType> v;
const int n = 1000000;
v.reserve(n);
for (int i=0;i<n;i++)
{
std::map<std::string, int> m;
m["abc"] = rand();
m["efg"] = rand();
m["hij"] = rand();
v.push_back(m);
}
Given a key (e.g. "efg"), I would like to extract all values of the maps for the given key (which definitely exists in every map).
Is it possible to speed up the following code?
std::vector<int> efgValues;
efgValues.reserve(v.size());
BOOST_FOREACH(MapType const& m, v)
{
efgValues.push_back(m.find("efg")->second);
}
Note that the values are not necessarily int. As profiling confirms that most time is spent in the find function, I was thinking about whether there is a (GCC and MSVC compliant C++03) way to avoid locating the element in the map based on the key for every single map again, because the structure of all the maps is equal.
If no, would it be possible with boost::unordered_map (which is 15% slower on my machine with the code above)? Would it be possible to cache the hash value of the string?
P.S.: I know that having a std::map<std::string, std::vector<T> > would solve my problem. However, I cannot change the data structure (which is actually more complex than what I showed here).
You can cache and playback the sequence of comparison results using a stateful comparator. But that's just nasty; the solution is to adjust the data structure. There's no "cannot." Actually, adding a stateful comparator is changing the data structure. That requirement rules out almost anything.
Another possibility is to create a linked list across the objects of type T so you can get from each map to the next without another lookup. If you might be starting at any of the maps (please, just refactor the structure) then a circular or doubly-linked list will do the trick.
As profiling confirms that most time is spent in the find function
Keeping the tree data structures and optimizing the comparison can only speed up the comparison. Unless the time is spent in operator< (std::string const&, std::string const&), you need to change the way it's linked together.

C string map key

Is there any issue with using a C string as a map key?
std::map<const char*, int> imap;
The order of the elements in the map doesn't matter, so it's okay if they are ordered using std::less<const char*>.
I'm using Visual Studio and according to MSDN (Microsoft specific):
In some cases, identical string literals can be "pooled" to save space in the executable file. In string-literal pooling, the compiler causes all references to a particular string literal to point to the same location in memory, instead of having each reference point to a separate instance of the string literal.
It says that they are only pooled in some cases, so it seems like accessing the map elements using a string literal would be a bad idea:
//Could these be referring to different map elements?
int i = imap["hello"];
int j = imap["hello"];
Is it possible to overload operator== for const char* so that the actual C string and not the pointer values would be used to determine if the map elements are equal:
bool operator==(const char* a, const char* b)
{
return strcmp(a, b) == 0 ? true : false;
}
Is it ever a good idea to use a C string as a map key?
Is it possible to overload operator== for const char* so that the actual C string and not the pointer values would be used to determine if the map elements are equal
No it's not, and yes, it's not a good idea for exactly the reason pointed out in the question and because you don't need char*, you can use a std::string instead. (you can provide a custom compare function - as pointed out by simonc, but I'd advise against it)
//Could these be referring to different map elements?
int i = imap["hello"];
int j = imap["hello"];
Yes, and they can even refer to elements that don't exist yet, but they'll be created by operator[] and be value initialized. The same issue exists with assignment:
imap["hello"] = 0;
imap["hello"] = 1;
The map could now have 1 or 2 elements.
You can provide a map with a custom comparitor which compares the C strings
std::map<const char*,YourType,CstrCmp>;
bool CstrCmp::operator()(const char* a, const char* b) const
{
return strcmp(a, b) < 0;
}
First, in order to introduce ordering over map keys you need to define a "less-than" comparison. A map says that two elements are "equivalent" if neither is less than the other. It's a bad idea to use char* for map keys because you will need to do memory management somewhere outside the map.
In most realistic scenarios when you query a map your keys will not be literals.
On the other hand, if you maintain a pool of string literals yourself and assign an ID to every literal you could use those IDs for map keys.
To summarize, I wouldn't rely on Microsoft saying "In some cases literals may be pooled". If you fill the map with literals and if you query the map with literals as keys you might as well use enum for keys.

Is the unordered_map really unordered?

I am very confused by the name 'unordered_map'. The name suggests that the keys are not ordered at all. But I always thought they are ordered by their hash value. Or is that wrong (because the name implies that they are not ordered)?
Or to put it different: Is this
typedef map<K, V, HashComp<K> > HashMap;
with
template<typename T>
struct HashComp {
bool operator<(const T& v1, const T& v2) const {
return hash<T>()(v1) < hash<T>()(v2);
}
};
the same as
typedef unordered_map<K, V> HashMap;
? (OK, not exactly, STL will complain here because there may be keys k1,k2 and neither k1 < k2 nor k2 < k1. You would need to use multimap and overwrite the equal-check.)
Or again differently: When I iterate through them, can I assume that the key-list is ordered by their hash value?
In answer to your edited question, no those two snippets are not equivalent at all. std::map stores nodes in a tree structure, unordered_map stores them in a hashtable*.
Keys are not stored in order of their "hash value" because they're not stored in any order at all. They are instead stored in "buckets" where each bucket corresponds to a range of hash values. Basically, the implementation goes like this:
function add_value(object key, object value) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
buckets[bucket_index] = new linked_list();
}
buckets[bucket_index].add(new key_value(key, value));
}
function get_value(object key) {
int hash = key.getHash();
int bucket_index = hash % NUM_BUCKETS;
if (buckets[bucket_index] == null) {
return null;
}
foreach(key_value kv in buckets[bucket_index]) {
if (kv.key == key) {
return kv.value;
}
}
}
Obviously that's a serious simplification and real implementation would be much more advanced (for example, supporting resizing the buckets array, maybe using a tree structure instead of linked list for the buckets, and so on), but that should give an idea of how you can't get back the values in any particular order. See wikipedia for more information.
* Technically, the internal implementation of std::map and unordered_map are implementation-defined, but the standard requires certain Big-O complexity for operations that implies those internal implementations
"Unordered" doesn't mean that there isn't a linear sequence somewhere in the implementation. It means "you can't assume anything about the order of these elements".
For example, people often assume that entries will come out of a hash map in the same order they were put in. But they don't, because the entries are unordered.
As for "ordered by their hash value": hash values are generally taken from the full range of integers, but hash maps don't have 2**32 slots in them. The hash value's range will be reduced to the number of slots by taking it modulo the number of slots. Further, as you add entries to a hash map, it might change size to accommodate the new values. This can cause all the previous entries to be re-placed, changing their order.
In an unordered data structure, you can't assume anything about the order of the entries.
As the name unordered_map suggests, no ordering is specified by the C++0x standard. An unordered_map's apparent ordering will be dependent on whatever is convenient for the actual implementation.
If you want an analogy, look at the RDBMS of your choice.
If you don't specify an ORDER BY clause when performing a query, the results are returned "unordered" - that is, in whatever order the database feels like. The order is not specified, and the system is free to "order" them however it likes in order to get the best performance.
You are right, unordered_map is actually hash ordered. Note that most current implementations (pre TR1) call it hash_map.
The IBM C/C++ compiler documentation remarks that if you have an optimal hash function, the number of operations performed during lookup, insertion, and removal of an arbitrary element does not depend on the number of elements in the sequence, so this mean that the order is not so unordered...
Now, what does it mean that it is hash ordered? As an hash should be unpredictable, by definition you can't take any assumption about the order of the elements in the map. This is the reason why it has been renamed in TR1: the old name suggested an order. Now we know that an order is actually used, but you can disregard it as it is unpredictable.

Data structure that maps unique id to an object

I'm looking for a C++ data structure that will let me associate objects with a unique numeric value (a key), and that will re-use these keys after the corresponding object have been removed from the container. So it's basically somewhat of a hybrid map/queue structure. My current implementation is a
std::map<size_t, TObject>
and I insert objects like this:
size_t key = (m_Container.end()--)->first + 1;
m_Container[key] = some_object;
which works fine for my purposes (I will never allocate more than size_t objects); yet still I keep wondering is there is a more specialized container available, preferably already in the stl or boost, or that there is a way to use another container to achieve this goal.
(Of course I could, rather than taking the highest key in my map and adding one, every time go through the map and search for the first available key but that would reduce complexity from O(1) to O(n) Also it would be nice if the API was a simple
size_t new_key = m_Container.insert(object);
).
Any ideas?
If you're never going to allocate more than size_t keys then I recommend you simply use a static counter:
size_t assign_id()
{
static size_t next_id;
return next_id++;
}
And if you want a nice API:
template<class Container>
size_t insert(Container & container, TObject const & obj)
{
container.insert(obj);
return assign_id();
}
std::set<TObject> m_Container;
size_t new_key = insert(m_Container, object);
I'm not certain what you exactly want from your ID. As it happens, each object already has a unique ID: its address! There are no two distinct objects with the same address, and the address of an object doesn't change over its lifetime.
std::set<T> typically stores its T values as members of larger nodes, not independent objects. Still, the T subobjects are never moved, and thus their addresses too are stable, unique identifiers.
Create std::set<key_type> removed_keys; of the removed keys. If removed_keys is not empty then use key from removed_keys else create a new key.
Why not just use a vector?
std::vector<TObject> m_Container;
size_t key = m_Container.size();
m_Container.push_back(some_object);
Of course this could be completely useless if you have other usage characteristics. But since you only describe insert and the need for a key (so extracting) it is hard to give any other clear answer. But from these two requirements a std::vector<> should work just fine.
If you have some other usage characteristics like: (Elements can be removed), (we insert elements in large blocks), (we insert elements infrequently) etc these would be interesting factoids that may change the recommendations people give.
You mention that you want to search for un-used elements ID's. This suggests that you may be deleting elements but I don't see any explicit requirements or usage where elements are ebing deleted.
Looking at your code above:
size_t key = (m_Container.end()--)->first + 1;
This is not doing what you think it is doing.
It is equivalent too:
size_t key = m_Container.end()->first + 1;
m_Container.end()--;
The post decrement operator modifies an lvalue. But the result of the expression is the original value. Thus you are applying the operator -> to the value returned by end(). This is (probably) undefined behavior.
See the standard:
Section:5.2.6 Increment and decrement
The value of a postfix ++ expression is the value of its operand.
m_Container.end()-- // This sub-expresiion returns m_Container.end()
Alternative:
#include <vector>
template<typename T>
class X
{
public:
T const& operator[](std::size_t index) const {return const_cast<X&>(*this)[index];}
T& operator[](std::size_t index) {return data[index];}
void remove(std::size_t index) {unused.push_back(index);}
std::size_t insert(T const& value);
private:
std::vector<T> data;
std::vector<std::size_t> unused;
};
template<typename T>
std::size_t X<T>::insert(T const& value)
{
if (unused.empty())
{
data.push_back(value);
return data.size() - 1;
}
std::size_t result = unused.back();
unused.pop_back();
data[result] = value;
return result;
}
Is there a reason that you need std::map to not remove a key, value pair?
This sounds like an attempt at premature optimization.
A method is to replace the value part with a dummy or place holder value. The problem in the long run is that the dummy value can be extracted from the std::map as long as the key exists. You will need to add a check for dummy value every time the std::map is accessed.
Because you want to maintain a key without a value, you most likely will have to write your own container. You will especially have to design code to handle the case when the client accesses the key when it has no value.
Looks like there is no performance gain for using standard containers and a key without value pair. However, there may be a gain as far as memory is concerned. Your issue would reduce fragmentation of dynamic memory; thus not having to re-allocate memory for the same key. You'll have to decide the trade-off.