Would a unordered_map be a good choice? - c++

I'm wondering if an unordered_map would be a good choice as container for my specific problem. What I've read about maps does not really cover my are, which is:
The container will store between 100 and 500 objects (not
int/double...)
The size will never change.
The order is not important as the objects themselves contain some kind of "index".
Very often (!) I need to filter all elements in the container that have some
property (e.g. have color==blue)
Currently I use vectors, which works. However if e.g. an unordered_map would improve performance (in regard to "filtering") I could image to change that.

std::unordered_map wouldn't really help you if you have multiple search criteria (sometimes color == blue, sometimes flavour == up), because maps only offer fast query on a single, pre-determined key.
I'd say std::vector is just fine for you, ideally wrapped in your own structure which will provide the lookup interface. If profiling later tells you this is not fast enough, you could build your own indexes above such data. You wouldn't even have to do that manually, boost::multi_index is a generic container designed for multiple-criterion lookup.

I would use vector or simply array for storing actual data. And have a few maps that maps key with pointer to actual data.
This would give higher memory usage, but in case searching by different indexes is often needed you may sacrifice a bit of memory.

A hash table (which std::unordered_map is) provides constant-time lookup for one key (key-value pair). However, its constant factors are always higher (i. e. the lookup is slower) than a simple array (which provides constant-time lookup for integer indices).
If you need to filter a collection of elements based on some criteria, then you need to inspect each individual element. In this case, a hash table would be strictly worse than an array/vector performance-wise, since its computational complexity is the same as that of array indexing, but with worse constant factors.
So no, there's no reason why you would want to use an unordered_map in this case.

Related

Build large(ish) unordered set with all data available at the beginning

I have a situation where I need to optimize the creation of an unordered set. Expected number of elements is around 5-25M. My first thought is that I should prepare all data beforehand and do something like
unordered_set s(data);
instead of
for (auto& elem : data)
s.insert(elem);
Can the STL unordered set use bulk loading methods and speed up its creation? How can I tweak the parameters of the hash table (bucket size etc) if I know prior to the table construction the expected number of elements?
This question is quite broad and interesting.
First of all, there is a special method called reserve - it allows you to pre-allocate the storage for a number of elements before actually inserting them. Pre-allocating sufficient memory (and avoiding re-locations during the instertion) is a very powerful approach, which is commonly used for large data sets. Note, that it is also available for various standard containers, including vector, unordered_map etc.
Secondly, if you're using C++11, you might benefit from using move-semantics while inserting the elements into your container (of course, given you don't need them in your feed once they are placed in the set, which should be true for 5 to 25 millions of objects).
These two techniques are a good start. You may need to tune it further by setting different hashing function, or even choosing different implementation of an unordered_set. But at this point, you should provide more information: what are your value objects and what is their life-cycle; what insertion time do you find acceptable in your application.
EDIT: of course it's all about C++11, as unordered_set was not available prior to it. Shame on me :)
My focus now is on whether I can use functions like rehash to notify the table for the upcoming size
Suppose you call
unordered_set s(begin(data), end(data));
while the standard doesn't dictate an implementation, a good implementation will be able to discern the number of elements, and preallocate the size accordingly. If you look at the source code used by gcc (by me /usr/include/c++/5/tr1/hashtable.h), for example, it uses
_M_bucket_count = std::max(_M_rehash_policy._M_next_bkt(__bucket_hint),
_M_rehash_policy.
_M_bkt_for_elements(__detail::
__distance_fw(__f,
__l)));
_M_buckets = _M_allocate_buckets(_M_bucket_count);
so it already preallocates size based on the number of elements.
The problem might be different, though. If you look at the documentation, it states:
constructs the container with the contents of the range [first, last). Sets max_load_factor() to 1.0.
This saves space, but might cause collisions. To reduce the collisions, you could use
unordered_set s(begin(data), end(data), k * data.size());
where k > 1 is some constant. This corresponds to a load factor that is 1 / k. YMMV.

map vs unordered_map for few elements

I am trying to choose between map and unordered_map for the following use case:
The key of the map is a pointer.
The most common use case is that there will be a single element in the map.
In general, the max number of elements in the map less than 10.
The map is accessed very often and speed is the most important factor. Changes to the map are infrequent.
While measuring the speed is obviously the correct approach here, this code will be used on several platforms so I'm trying to create a general rule of thumb for choosing between a map and unordered_map based on number of elements. I've seen some posts here that hint that std::map may be faster for a small number elements, but no definition of "small" was given.
Is there a rule of thumb for when to choose between a map and unordered_map based on number of elements? Is another data structure (such as linear search through a vector) even better?
Under the premise that you always need to measure in order to figure out what's more appropriate in terms of performance, if all these things are true:
Changes to the map are not frequent;
The map contains a maximum of 10 elements;
Lookups will be frequent;
You care a lot about performance;
Then I would say you would be better off putting your elements in an std::vector and performing a plain iteration over all your elements to find the one you're looking for.
An std::vector will allocate its elements in a contiguous region of memory, so cache locality is likely to grant you a greater performance - the time required to fetch a cache line from main memory after a cache miss is at least one order of magnitude higher than the time required to access the CPU cache.
Quite interestingly, it seems like Boost's flat_map is ideal for your use case (courtesy of Praetorian):
flat_map is similar to std::map but it's implemented like an ordered vector. (from the online documentation)
So if using Boost is an option for you, you may want to try this one.
I believe for your case of 10 elements or less and usually only one a linear search of an unsorted vector will work best. However, depending on the hash algorithm used the unordered_map may be faster instead.
It should be easy enough for you to benchmark.

Which is the fastest STL container for find?

Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.
I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.
My current implementation has two std::vectors one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data via std::find, std::find_if, etc.
Does anyone know if using std::list, std::set, or std::map over std::vector for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.
I'm also open to using C++0x features supported by VS2010 or Boost.
For searching a particular value, with std::set and std::map it takes O(log N) time, while with the other two it takes O(N) time; So, std::set or std::map are probably better. Since you have access to C++0x, you could also use std::unordered_set or std::unordered_map which take constant time on average.
For find_if, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.
However if you will be calling find_if frequently with a certain predicate, you can optimize yourself: use a std::map or std::set with a custom comparator or special keys and use find instead.
A sorted vector using std::lower_bound can be just as fast as std::set if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.
Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.
This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.
To determine the kind of index to use, you'll need extensive benchmarks. 500 is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).
I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc ?) if the performance difference is either unnoticeable or simply does not matter.
If you only want to search for distinct values, one specific column in the table, then std::hash is fastest.
If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.
Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.
Not an answer per se, but be sure to use a typedef to refer to the container type you do use, something like typedef std::vector< itemtype > data_table_cache; Then use your typedef type everywhere.

Associating and iterating in C++

I've got a situation where I want to use an associative container, and I chose to use a std::unordered_map, because it's perfectly feasible that this container could be used to hold millions or more of elements. But now I also need to iterate in order. I considered having the value types link to each other in a list, but now I'm going to have issues with memory management.
Should I change container, say to a std::map? Or just iterate once through my unordered_map, insert into a vector, and sort, then iterate? It's pretty unlikely that I will need to iterate in an ordered fashion repeatedly.
Well, you know the O() of the various operations of the two alternatives you've picked. You should pick based on that and do a cost/benefit analysis based on where you need the performance to happen and which container does best for THAT.
Of course, I couldn't possibly know enough to do that analysis for you.
You could use Boost.MultiIndex, specifying the unordered (hashed) index as well as an ordered one, on the same underlying object collection.
Possible issues with this - there is no natural mapping from an existing associative container model, and it might be overkill if you don't need the second index all the time.

When to choose std::vector over std::map for key-value data?

Considering the positive effect of caching and data locality when searching in primary memory, I tend to use std::vector<> with std::pair<>-like key-value items and perform linear searches for both, if I know that the total amount of key-value items will never be "too large" to severely impact performance.
Lately I've been in lots of situations where I know beforehand that I will have huge amounts of key-value items and have therefore opted for std::map<> from the beginning.
I'd like to know how you make your decisions for the proper container in situations like the ones described above.
Do you
always use std::vector<> (or similar)?
always use std::map<> (or similar)?
have a gut feeling for where in the item-count range one is preferable over the other?
something entirely different?
Thanks!
I only rarely use std::vector with a linear search (except in conjunction with binary searching as described below). I suppose for a small enough amount of data it would be better, but with that little data it's unlikely that anything is going to provide a huge advantage.
Depending on usage pattern, a binary search on an std::vector can make sense though. A std::map works well when you need to update the data regularly during use. In quite a few cases, however, you load up some data and then you use the data -- but after you've loaded the data, it mostly remains static (i.e., it changes very little, if at all).
In this case, it can make a lot of sense to load the data into a vector, sort it if necessary, and then do binary searches on the data (e.g. std::lower_bound, std::equal_range). This gives pretty much the best of both worlds -- low-complexity binary searches and good cache usage from high locality of reference (i.e., the vector is contiguous, as opposed to the linked structure of a std::map). The shortcoming, of course, is that insertions and deletions are slow -- but this is one time I have used your original idea -- store newly inserted data separately until it reaches some limit, and only then sort it in with the rest of the data, so a single search consists of a binary search of the main body of the data, followed by a linear search of the (small amount) of newly inserted data.
I would never make the choice solely on (possibly bogus) "efficiency" grounds, but always on what I am actually going to do with the container. Do I want to store duplicates? Is insertion order important? Will I sometimes want to search for the value not the key? Those kind of things.
Have you considered using sorted data structures? They tend to offer logarithmic searches and inserts - a reasonable trade-off. Personally I don't have any hard and fast rules other than liking maps for the ability to key on a human-readable/understandable value.
Of course there's plenty of discussion as well on the efficiency of maps vs. lists/vectors (sorted and unsorted) - if your key is a string that's 10,000 characters, it can take longer to do a string compare than to search through a list of just a few items, so you want to make sure that you can efficiently compare keys as well.
I almost always prefer to use map (or unordered_map, when a hash container makes more sense) vs. a vector.
That being said, I think your reasoning is backwards. I would tend to use a vector only when there are huge amounts of data, since a vector will be a smaller memory footprint.
With the right kinds of datasets, you can load a vector and then sort it and binary_search it with a smaller footprint and similar performance characteristics to a map, especially if the dataset is stable after load.
Why are you not taking unordered_map into account?