Hash map, string compares, and std::map? - c++

First off, I would like to make a few points I believe to be true. Please can these be verified?
A hash map stores strings by
converting them into an integer
somehow.
std::map is not a hash map, and if I'm using strings, I should consider using a hash map for memory issues?
String compares are not good to rely on.
If std::map is not a hash map and I should not be relying on string compares (basically, I have a map with strings as keys...I was told to look up using hash maps instead?), is there a hash map in the C++ STL? If not, how about Boost?
Secondly, Is a hash map worth it for [originally] an std::map< std::string, non-POD GameState >?
I think my point is getting across...I plan to have a store of different game states I can look up and register into a factory.
If any more info is needed, please ask.
Thanks for your time.

I don't believe most of your points are correct.
there is no hash map in the current standard. C++0x introduces unordered_map, who's implementation will be a hash table and your compiler probably already supports it.
std::map is implemented as a balanced tree, not a hash table. There are no "memory issues" when using either map type with strings, either as keys or data.
strings are not stored as numbers in either case - an unordered_map will use a hashing function to derive a numeric key from the string, but this is not stored.
my experience is that unordered_map is about twice the speed of map - they have basically the same interface, so you can try both with your own data - whenever you are interested in performance you should always perform tests your self with your own real data, rather than depending on the experience of others. Both map types will be somewhat sensitive to the length of the string key.
Assuming you have some class A, that you want to access via a string key, the maps would be declared as:
map <string, A> amap;
unordered_map <string, A> umap;

I made a benchmark that compares std::map with boost::unordered_map.
My conclusion was basically this: If you do not need map-specific things like equal_range, then always use boost::unordered_map.
The full benchmark can be found here

A hash map will typically have some integral representation of a string, yes.
std::map has a requirement to be sorted, so implementing it as a hash table is unlikely, and I've never seen it in practice.
Whether string comparisons are good or bad depends entirely on what you're doing, what data you're comparing, and how often. If the first letter differs then that's barely any different from an integer comparison, for example.
You want unordered_map (that's the Boost version - there is also a version in the TR1 standard library if your compiler has that).
Is it worth it for game states? Yes, but only because using an unordered_map is simple. You're prematurely worrying about optimisations at this stage. Save the worries over access patterns for things you're going to look up thousands of times a second (ie. when your profiler tells you that it's a problem).

Related

which container from std::map or std::unordered_map is suitable for my case

I don't know how a red black tree works with string keys. I've already seen it with numbers on youtube and it baffled me a lot. However I know very well how unoredred_map work (the internal of hash maps). std::map stays esoterical for me, but I read and tested that if we don't have many changes in the std::map, it could beat hash maps.
My case is simple, I have a std::map of <std::string,bool>. Keys contains paths to XML elements (example of a key: "Instrument_Roots/Instrument_Root/Rating_Type"), and I use the boolean value in my SAX parser to know if we reached a particular element.
I build this map "only once"; and then all I do is using std::find to search if a particular "key" ("path") exists in order to set its Boolean value to true, or search the first element who has "true" as associated value and use its corresponded "key", and finally I set all the boolean values to false to guarantee that only a single "key" has a "true" boolean value.
You shouldn't need to understand how red-black trees work in order to understand how to use a std::map. It's simply an associative array where the keys are in order (lexicographical order, in the case of string keys, at least with the default comparison function). That means that you can not only look keys up in a std::map, you can also make queries which depend on order. For example, you can find the largest key in the map which is not greater than the key you have. You can find the next larger key. Or (again in the case of strings) you can find all keys which start with the same prefix.
If you iterate over all the key-value pairs in a std::map, you will see them in order by key. That can be very useful, sometimes.
The extra functionality comes at a price. std::map is usually slower than std::unordered_map (though not always; for large string keys, the overhead of computing the hash function might be noticeable), and the underlying data structure has a certain amount of overhead, so they may occupy more space. The usual advice is to use a std::map if you find the fact that the keys are ordered to be essential or even useful.
But if you've benchmarked and concluded that for your application, a std::map is also faster, then go ahead and use it :)
It is occasionally useful to have a map whose mapped type is bool, but only if you need to distinguish between keys whose corresponding value is false and keys which are not present in the map at all. In effect, a std::map<T, bool> (or std::unordered_map<T, bool>) provides a ternary choice for each possible key.
If you don't need to distinguish between the two false cases, and you don't frequently change a key's value, then you may well be better off with a std::set (or std::unordered_set), which is exactly the same datastructure but without the overhead of the bool in each element. (Although only one bit of the bool is useful, alignment considerations may end up using 8 additional bytes for each entry.) Other than storage space, there won't be much (if any) performance difference, though.
If you do really need a ternary case, then you would be well-advised to make the value an enum rather than a bool. What do true and false mean in the context of your usage? My guess is that they don't mean "true" and "false". Instead, they mean something like "is an attribute path" and "is an element path". That distinction could be made much clearer (and therefore less accident-prone) by using enum PathType {ATTRIBUTE_PATH, ELEMENT_PATH};. That will not involve any additional resources, since the bool is occupying eight bytes of storage in any case (because of alignment).
By the way, there is no guarantee that the underlying data structure is precisely a red-black tree, although the performance guarantees would be difficult to achieve without some kind of self-balancing tree. I don't know of such an implementation, but it would be possible to use k-ary trees (for some small k) to take advantage of SIMD vector comparison operations, for example. Of course, that would need to be customized for appropriate key types.
If you do want to understand red-black trees, you could do worse than Robert Sedgewick's standard textbook on Algorithms. On the book's website, you'll find a brief illustrated explanation in the chapter on balanced trees.
I would recommend you to use std::unordered_set because you really don't need to store this boolean flag and you also don't need to keep these xml tags in sorted order so std::unordered_set seems to me as logical and the most efficient choice.

What's a decent performing hasher for std::unordered_map, which treats my Key as a generic block of memory?

I have a key struct which contains no pointers/references/stl. It's a struct that would be valid to "memcpy" if you wanted. I want to quickly define a map on it, using a generic hash algorithm that just hashes the underlying memory.
std::unordered_map< MyKey, MyValue, HashMemoryState< MyKey> > resultsMap;
Does the standard library already provide a generic Hasher class like what I'm envisioning HashMemoryState as doing? Or do I have to define it myself?
How can I define 'HashMemoryState' to be a generic hasher class that std::unordered_map will accept when I compile? What light-weight hash function would yield decent general overall performance? I don't expect too many collisions and I don't think specializing the algorithm to my specific struct would make much difference for me. For example, one of the key structs I'll be using is a tightly packed Rubik's cube state. It's already kind of random looking, it just needs a quick hash of some sort.
There is a standard hash for strings, of course, and strings can contain embedded null characters so you could simply treat your data as a string and hash that.
Of course to actually invoke the string hash you'd have to copy your data into a string.
Hashes are constant time, but that means constant with regards to the number of items in your collection. A hash on a string or bunch of bytes might depend on the length of this buffer.

Data container for fast lookup and retrieval

Actually I need a data structure that helps me in reducing time for look-ups and retrieval of values of the respective keys.
Right now I am using a map container with key as structure and want to retrieve its values as fast as possible.
I am using gcc on fedora 12. I tried unordered map also, but it is not working with my compiler.
Also, Hash map is not available in namespace std.
If you're using C++11, use std::unordered_map, defined in <unordered_map>.
Otherwise, use std::tr1::unordered_map, defined in <tr1/unordered_map>, or boost::unordered_map, defined in <boost/unordered_map.hpp>.
If your key is a user-defined type, then you'll need to either define hash and operator== for the type, or provide suitable function types as template arguments to unordered_map.
Of course, you should also measure the performance compared to std::map; that may be faster in some circumstances.
hash map is called unordered_map. You can get it from boost and that will probably work even if you can't get a std/tr1 one to work. In general the lookup time is "constant" which means it does not increase with the complexity of the nubmer of elements. However you have to look at this in more detail:
"constant" assumes you never have more than a fixed number of "collisions". It's unlikely you won't have any, and then you have to measure the fact that there will be some collisions.
"constant" includes the time taken to hash the key. This is a constant time as it makes no difference how many other elements there are in the collection, however it is still a task that needs to be done, by which time your std::map may already have found your element.
If the keys are extremely fast to hash and well distributed so very few collisions occur, then hashing will indeed be faster.
One thing I always found when working with hash maps was that for the optimal performance you almost always won by writing your own implementation rather than using a standard one. That is because you could custom-tune your own for the data you knew you were going to handle. Perhaps this is why they didn't put hash maps into the original standard.
One thing I did when writing my own was store the actual hash value (the originally generated one) with the key. This was the first comparison point (usually faster than comparing the key as it's just an int) and also meant it didn't need to be regenerated if you resized your hash-table.
Note that hash-tables are easier to implement if you never delete anything from them, i.e. it is load and read only.

Efficient C++ associative container with vector key

I've constructed a map which has a vector as its key: map<vector<KeyT>, T> which I'm trying to optimize now.
An experiment with manually nested maps map<vector<KeyT>, map<KeyT,T> > where the first key is the original vector minus the last element and the second key is the last element shows a reasonable speed-up.
Now I'm wondering whether there exists a semi-standard implementation (like boost or similar) of an associative container where vector keys are implemented as such a hierarchical structure of containers.
Ideally, this would create as many layers as there are elements in the key vector, while keeping a uniform syntax for vectors of different length.
Are you sure you need to optimise it? std::string is basically like a std::vector and we happily use std::string as an array key!
Have you profiled your code? std::map doesn't copy its key/value pairs unneccesarily -- what exactly are you afraid of?
Are your vector keys of a fixed-size? std::tuple might help in that case.
If not, it might help to partition your containers according to the length of the key, although the effectiveness of schemes such as this are highly domain-dependent.
My first hunch is that you want to improve map lookup time by reducing the volume of the key. This is what hash functions are for. C++ tr1 and Boost have hash_maps by the name of unordered_map
I'll try to devise a small sample in some time here

What container to choose

I thought about storing some objects ... and now I don't know what to choose.
So, now I have such code:
std::map<std::string, Object*> mObjects;
But, as I was told here before, it's slow due to allocation of std::string in each searching, so the key should be integer.
Why did I chose std::string as key? Because it's very easy to access objects by their name, for example:
mObjects["SomeObj"];
So my first idea is:
std::map<int, Object*> mObjects;
and key is an CRC of object name:
mObjects[CRC32("SomeObject")];
But it's a bit unstable. And I know there is special hash-maps for this.
And the last, I have to sort my objects in map using some Compare function.
Any ideas about container I can use?
So again, the main points:
Accesing objects by string, but keyshould be integer, not string
Sorting objects in map by some function
p.s. boost usage is permissible.
I can't say for sure, but are you always accessing items in the map by a literal string? If so, then you should just use consecutive enumerated values with symbolic names, and an appropriately sized vector.
Assuming that you won't know the names until runtime 1000 items in the map seems really small for searching to possibly be a bottleneck. Are you sure that the lookup is the performance problem? Have you profiled to make sure that is the case? In general, using the most intuitive container is going to result in better code (because you can grasp the algorithm more easily) code.
Does your comment about constructing strings imply that you passing C-strings into the find function over and over? Try to avoid that by using std::string consistently in your application.
If you insist on using the two-part approach: I suggest storing all your items in a vector. Then you have one unordered_map from string to index and another vector that has all the indexes into the main container. Then you sort this second container of indexes to get the ordering you need. Finally, when you delete items from the master container you'll need to clean up both of the other two referencing containers.