Data container for fast lookup and retrieval - c++

Actually I need a data structure that helps me in reducing time for look-ups and retrieval of values of the respective keys.
Right now I am using a map container with key as structure and want to retrieve its values as fast as possible.
I am using gcc on fedora 12. I tried unordered map also, but it is not working with my compiler.
Also, Hash map is not available in namespace std.

If you're using C++11, use std::unordered_map, defined in <unordered_map>.
Otherwise, use std::tr1::unordered_map, defined in <tr1/unordered_map>, or boost::unordered_map, defined in <boost/unordered_map.hpp>.
If your key is a user-defined type, then you'll need to either define hash and operator== for the type, or provide suitable function types as template arguments to unordered_map.
Of course, you should also measure the performance compared to std::map; that may be faster in some circumstances.

hash map is called unordered_map. You can get it from boost and that will probably work even if you can't get a std/tr1 one to work. In general the lookup time is "constant" which means it does not increase with the complexity of the nubmer of elements. However you have to look at this in more detail:
"constant" assumes you never have more than a fixed number of "collisions". It's unlikely you won't have any, and then you have to measure the fact that there will be some collisions.
"constant" includes the time taken to hash the key. This is a constant time as it makes no difference how many other elements there are in the collection, however it is still a task that needs to be done, by which time your std::map may already have found your element.
If the keys are extremely fast to hash and well distributed so very few collisions occur, then hashing will indeed be faster.
One thing I always found when working with hash maps was that for the optimal performance you almost always won by writing your own implementation rather than using a standard one. That is because you could custom-tune your own for the data you knew you were going to handle. Perhaps this is why they didn't put hash maps into the original standard.
One thing I did when writing my own was store the actual hash value (the originally generated one) with the key. This was the first comparison point (usually faster than comparing the key as it's just an int) and also meant it didn't need to be regenerated if you resized your hash-table.
Note that hash-tables are easier to implement if you never delete anything from them, i.e. it is load and read only.

Related

which container from std::map or std::unordered_map is suitable for my case

I don't know how a red black tree works with string keys. I've already seen it with numbers on youtube and it baffled me a lot. However I know very well how unoredred_map work (the internal of hash maps). std::map stays esoterical for me, but I read and tested that if we don't have many changes in the std::map, it could beat hash maps.
My case is simple, I have a std::map of <std::string,bool>. Keys contains paths to XML elements (example of a key: "Instrument_Roots/Instrument_Root/Rating_Type"), and I use the boolean value in my SAX parser to know if we reached a particular element.
I build this map "only once"; and then all I do is using std::find to search if a particular "key" ("path") exists in order to set its Boolean value to true, or search the first element who has "true" as associated value and use its corresponded "key", and finally I set all the boolean values to false to guarantee that only a single "key" has a "true" boolean value.
You shouldn't need to understand how red-black trees work in order to understand how to use a std::map. It's simply an associative array where the keys are in order (lexicographical order, in the case of string keys, at least with the default comparison function). That means that you can not only look keys up in a std::map, you can also make queries which depend on order. For example, you can find the largest key in the map which is not greater than the key you have. You can find the next larger key. Or (again in the case of strings) you can find all keys which start with the same prefix.
If you iterate over all the key-value pairs in a std::map, you will see them in order by key. That can be very useful, sometimes.
The extra functionality comes at a price. std::map is usually slower than std::unordered_map (though not always; for large string keys, the overhead of computing the hash function might be noticeable), and the underlying data structure has a certain amount of overhead, so they may occupy more space. The usual advice is to use a std::map if you find the fact that the keys are ordered to be essential or even useful.
But if you've benchmarked and concluded that for your application, a std::map is also faster, then go ahead and use it :)
It is occasionally useful to have a map whose mapped type is bool, but only if you need to distinguish between keys whose corresponding value is false and keys which are not present in the map at all. In effect, a std::map<T, bool> (or std::unordered_map<T, bool>) provides a ternary choice for each possible key.
If you don't need to distinguish between the two false cases, and you don't frequently change a key's value, then you may well be better off with a std::set (or std::unordered_set), which is exactly the same datastructure but without the overhead of the bool in each element. (Although only one bit of the bool is useful, alignment considerations may end up using 8 additional bytes for each entry.) Other than storage space, there won't be much (if any) performance difference, though.
If you do really need a ternary case, then you would be well-advised to make the value an enum rather than a bool. What do true and false mean in the context of your usage? My guess is that they don't mean "true" and "false". Instead, they mean something like "is an attribute path" and "is an element path". That distinction could be made much clearer (and therefore less accident-prone) by using enum PathType {ATTRIBUTE_PATH, ELEMENT_PATH};. That will not involve any additional resources, since the bool is occupying eight bytes of storage in any case (because of alignment).
By the way, there is no guarantee that the underlying data structure is precisely a red-black tree, although the performance guarantees would be difficult to achieve without some kind of self-balancing tree. I don't know of such an implementation, but it would be possible to use k-ary trees (for some small k) to take advantage of SIMD vector comparison operations, for example. Of course, that would need to be customized for appropriate key types.
If you do want to understand red-black trees, you could do worse than Robert Sedgewick's standard textbook on Algorithms. On the book's website, you'll find a brief illustrated explanation in the chapter on balanced trees.
I would recommend you to use std::unordered_set because you really don't need to store this boolean flag and you also don't need to keep these xml tags in sorted order so std::unordered_set seems to me as logical and the most efficient choice.

Would a unordered_map be a good choice?

I'm wondering if an unordered_map would be a good choice as container for my specific problem. What I've read about maps does not really cover my are, which is:
The container will store between 100 and 500 objects (not
int/double...)
The size will never change.
The order is not important as the objects themselves contain some kind of "index".
Very often (!) I need to filter all elements in the container that have some
property (e.g. have color==blue)
Currently I use vectors, which works. However if e.g. an unordered_map would improve performance (in regard to "filtering") I could image to change that.
std::unordered_map wouldn't really help you if you have multiple search criteria (sometimes color == blue, sometimes flavour == up), because maps only offer fast query on a single, pre-determined key.
I'd say std::vector is just fine for you, ideally wrapped in your own structure which will provide the lookup interface. If profiling later tells you this is not fast enough, you could build your own indexes above such data. You wouldn't even have to do that manually, boost::multi_index is a generic container designed for multiple-criterion lookup.
I would use vector or simply array for storing actual data. And have a few maps that maps key with pointer to actual data.
This would give higher memory usage, but in case searching by different indexes is often needed you may sacrifice a bit of memory.
A hash table (which std::unordered_map is) provides constant-time lookup for one key (key-value pair). However, its constant factors are always higher (i. e. the lookup is slower) than a simple array (which provides constant-time lookup for integer indices).
If you need to filter a collection of elements based on some criteria, then you need to inspect each individual element. In this case, a hash table would be strictly worse than an array/vector performance-wise, since its computational complexity is the same as that of array indexing, but with worse constant factors.
So no, there's no reason why you would want to use an unordered_map in this case.

Why std::map is red black tree and not hash table ?

This is very strange for me, i expected it to be a hash table.
I saw 3 reasons in the following answer (which maybe correct but i don't think that they are the real reason).
Hash tables v self-balancing search trees
Although hash might be not a trivial operation. I think that for most of the types it is pretty simple.
when you use map you expect something that will give you amortized O(1) insert, delete, find , not log(n).
i agree that trees have better worst case performance.
I think that there is a bigger reason for that, but i can't figure it out.
In c# for example Dictionary is a hash table.
It's largely a historical accident. The standard containers (along with iterators and algorithms) were one of the very last additions before the feature set of the standard was frozen. As it happened, they didn't have what they considered an adequate definition of a hash-based map at the time, and there wasn't time to add it before features were frozen, so the original specification included only a tree-based map.
C++ 11 added std::unordered_map (as well as std::unordered_set and multi versions of both), which is based on hashing though.
The reason is that map is explicitly called out as an ordered container. It keeps the elements sorted and allows you to iterate in sorted order in linear time. A hashtable couldn't fulfill those requirements.
In C++11 they added std::unordered_map which is a hashtable implementation.
A hash table requires an additional hash function. The current implementation of map which uses a tree can work without an extra hash function by using operator<. Additionally the map allows sorted access to elements, which may be beneficial for some applications. With C++ we now have the hash versions available in form of unordered_set.
Simple answer: because a hash table cannot satisfy the complexity requirements of iteration over a std::map.
Why does std::map hold these requirements? Unanswerable question. Historical factors contribute but, overall, that's just the way it is.
Hashes are available as std::unordered_map.
It doesn't really matter what the two are called, or what they're called in some other language.

Which is the fastest STL container for find?

Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.
I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.
My current implementation has two std::vectors one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data via std::find, std::find_if, etc.
Does anyone know if using std::list, std::set, or std::map over std::vector for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.
I'm also open to using C++0x features supported by VS2010 or Boost.
For searching a particular value, with std::set and std::map it takes O(log N) time, while with the other two it takes O(N) time; So, std::set or std::map are probably better. Since you have access to C++0x, you could also use std::unordered_set or std::unordered_map which take constant time on average.
For find_if, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.
However if you will be calling find_if frequently with a certain predicate, you can optimize yourself: use a std::map or std::set with a custom comparator or special keys and use find instead.
A sorted vector using std::lower_bound can be just as fast as std::set if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.
Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.
This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.
To determine the kind of index to use, you'll need extensive benchmarks. 500 is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).
I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc ?) if the performance difference is either unnoticeable or simply does not matter.
If you only want to search for distinct values, one specific column in the table, then std::hash is fastest.
If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.
Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.
Not an answer per se, but be sure to use a typedef to refer to the container type you do use, something like typedef std::vector< itemtype > data_table_cache; Then use your typedef type everywhere.

C++ STL Map vs Vector speed

In the interpreter for my experimental programming language I have a symbol table. Each symbol consists of a name and a value (the value can be e.g.: of type string, int, function, etc.).
At first I represented the table with a vector and iterated through the symbols checking if the given symbol name fitted.
Then I though using a map, in my case map<string,symbol>, would be better than iterating through the vector all the time but:
It's a bit hard to explain this part but I'll try.
If a variable is retrieved the first time in a program in my language, of course its position in the symbol table has to be found (using vector now). If I would iterate through the vector every time the line gets executed (think of a loop), it would be terribly slow (as it currently is, nearly as slow as microsoft's batch).
So I could use a map to retrieve the variable: SymbolTable[ myVar.Name ]
But think of the following: If the variable, still using vector, is found the first time, I can store its exact integer position in the vector with it. That means: The next time it is needed, my interpreter knows that it has been "cached" and doesn't search the symbol table for it but does something like SymbolTable.at( myVar.CachedPosition ).
Now my (rather hard?) question:
Should I use a vector for the symbol table together with caching the position of the variable in the vector?
Should I rather use a map? Why? How fast is the [] operator?
Should I use something completely different?
A map is a good thing to use for a symbol table. but operator[] for maps is not. In general, unless you are writing some trivial code, you should use the map's member functions insert() and find() instead of operator[]. The semantics of operator[] are somewhat complicated, and almost certainly don't do what you want if the symbol you are looking for is not in the map.
As for the choice between map and unordered_map, the difference in performance is highly unlikely to be significant when implementing a simple interpretive language. If you use map, you are guaranteed it will be supported by all current Standard C++ implementations.
You effectively have a number of alternatives.
Libraries exist:
Loki::AssocVector: the interface of a map implemented over a vector of pairs, faster than a map for small or frozen sets because of cache locality.
Boost.MultiIndex: provides both List with fast lookup and an example of implementing a MRU List (Most Recently Used) which caches the last accessed elements.
Critics
Map look up and retrieval take O(log N), but the items may be scattered throughout the memory, thus not playing well with caching strategies.
Vector are more cache friendly, however unless you sort it you'll have O(N) performance on find, is it acceptable ?
Why not using a unordered_map ? They provide O(1) lookup and retrieval (though the constant may be high) and are certainly suited to this task. If you have a look at Wikipedia's article on Hash Tables you'll realize that there are many strategies available and you can certainly pick one that will suit your particular usage pattern.
Normally you'd use a symbol table to look up the variable given its name as it appears in the source. In this case, you only have the name to work with, so there's nowhere to store the cached position of the variable in the symbol table. So I'd say a map is a good choice. The [] operator takes time proportional to the log of the number of elements in the map - if it turns out to be slow, you could use a hash map like std::tr1::unordered_map.
std::map's operator[] takes O(log(n)) time. This means that it is quite efficient, but you still should avoid doing the lookups over and over again. Instead of storing an index, perhaps you can store a reference to the value, or an iterator to the container? This avoids having to do lookup entirely.
When most interpreters interpret code, they compile it into an intermediate language first. These intermediate languages often refer to variables by index or by pointer, instead of by name.
For example, Python (the C implementation) changes local variables into references by index, but global variables and class variables get referenced by name using a hash table.
I suggest looking at an introductory text on compilers.
a std::map (O(log(n))) or a hashtable ("amortized" O(1)) would be the first choice - use custom mechanisms if you determin it's a bottleneck. Generally, using a hash or tokenizing the input is the first optimization.
Before you have profiled it, it's most important that you isolate lookup, so you can easily replace and profile it.
std::map is likely a tad slower for a small number of elements (but then, it doesn't really matter).
Map is O(log N), so not as fast as positional lookup in an array. But the exact results will depend on a lot of factors, and so the best approach is to interface with the container in a way that allows you to swap between implementation later on. That is, write a "lookup" function that can be efficiently implemented by any suitable container, to allow yourself to switch and compare speeds of different implementation.
Map's operator [] is O(log(n)), see wikipedia : http://en.wikipedia.org/wiki/Map_(C%2B%2B)
I think as you're looking often for symbols, using a map is certainly right. Maybe a hash map (std::unordered_map) could make your performance better.
If you're going to use a vector and go to the trouble of caching the most recent symbol look up result, you could do the same (cache the most recent look-up result) if your symbol table were implemented as a map (but there probably wouldn't be a whole lot of benefit to the cache in the case of using a map). With a map you'd have the additional advantage that any non-cached symbol look ups would be much more performant than searching in a vector (assuming that the vector isn't sorted - and keeping a vector sorted can be expensive if you have to do the sort more than once).
Take Neil's advice; map is generally a good data structure for a symbol table, but you need to make sure you're using it correctly (and not adding symbols accidentally).
You say: "If the variable, still using vector, is found the first time, I can store its exact integer position in the vector with it.".
You can do the same with the map: search the variable using find and store the iterator pointing to it instead of the position.
For looking up values, by a string key, map data type is the appropriate one, as mentioned by other users.
STL map implementations usually are implemented with self-balancing trees, like the red black tree data structure, and their operations take O(logn) time.
My advice is to wrap the table manipulation code in functions,
like table_has(name), table_put(name) and table_get(name).
That way you can change the inner symbol table representation easily if you experience
slow run time performance, plus you can embed in those routines cache functionality later.
A map will scale much better, which will be an important feature. However, don't forget that when using a map, you can (unlike a vector) take pointers and references. In this case, you could easily "cache" variables with a map just as validly as a vector. A map is almost certainly the right choice here.