Associating and iterating in C++ - c++

I've got a situation where I want to use an associative container, and I chose to use a std::unordered_map, because it's perfectly feasible that this container could be used to hold millions or more of elements. But now I also need to iterate in order. I considered having the value types link to each other in a list, but now I'm going to have issues with memory management.
Should I change container, say to a std::map? Or just iterate once through my unordered_map, insert into a vector, and sort, then iterate? It's pretty unlikely that I will need to iterate in an ordered fashion repeatedly.

Well, you know the O() of the various operations of the two alternatives you've picked. You should pick based on that and do a cost/benefit analysis based on where you need the performance to happen and which container does best for THAT.
Of course, I couldn't possibly know enough to do that analysis for you.

You could use Boost.MultiIndex, specifying the unordered (hashed) index as well as an ordered one, on the same underlying object collection.
Possible issues with this - there is no natural mapping from an existing associative container model, and it might be overkill if you don't need the second index all the time.

Related

Having a std::map not order

I am using a header-only json library and it uses a std::map. I would prefer although to have it not be ordered.
https://github.com/nlohmann/json/blob/develop/src/json.hpp#L371
There is the snippet that I'm wondering if I can fix. Assuming "ObjectType" is a std::map. Is there any way to remove the order from it or somehow make the std::less<StringType> irrelevant.
It seems that changing the source to support std::unordered_map would be too large of a task to be worth it.
First of all, std::unordered_map is not a viable solution here, since it will not preserve the insertion order either. "Unordered" here means pretty much ignorant to any ordering whatsoever.
Instead for your particular task you want to somehow save the original insertion order, so here are some options:
change std::map key to the index number or replace std::map with std::vector. The latter actually makes sense even if you want to retain the ability to search by the object name, as JSON objects don't tend to get too big so linear search probably won't introduce any noticeable drawback.
find a way to store the desired ordering separately. std::vector of keys can handle the storage, and you can add some iterator trickery to make your container cycle over the preferred order, e.g. by overloading begin() and end() methods.
use a multiple-keyed map as a ready solution - boost::multiindex is a default choice.

Prefer unordered_set over vector

Is it safe to say that if I don't want duplicates in my container, and I don't care about element position as I only want to iterate through the container, then I should use an unordered_set instead of vector?
Is it safe to say that if I don't want duplicates in my container, and I don't care about element position as I only want to iterate through the container, then I should use an unordered_set instead of vector?
No, it is not. It depends on many factors. For example if you seldom add new elements but iterate over container quite often it would be preferable to use std::vector and maintain uniqueness manually. There also could be other factors affecting your decision. But normally yes you may prefer std::unordered_set as it simplifies your program.
Not entirely. unordered_sets are not required to be contiguous containers; in the case where you'd frequently want to read all numerous values contained in the set, you may prefer std::vector on time-critic application.
std::unordered_set:
Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its value. This allows fast access to individual elements, since once a hash is computed, it refers to the exact bucket the element is placed into.
But in the general case, I'd say Yes.
I generally prefer vector or map. (or in your case, std::set).
Hash tables can be faster than maps/sets (red-black trees), but red-black trees have guaranteed performance 100% of the time. And logarithmic performance is REALLY fast! A hash table kan kill performance when it starts rehashing.
std::vector is the workhorse of the STL and should be your default choice. Vector is very straightforward, and is very cache-friendly
This article by Matt Austern is related to this topic and it is worth reading:
Why you shouldn't use set (and what you should use instead) by Matt Austern
This thread is trying to identify conditions under which unordered_set is preferable over vectors. Similarly, in the above article, the author clearly identifies four conditions, which all need to be satisfied in order to prefer set over a custom but simpler data structure called sorted_vector (last section: What is set good for?). It will be interesting to clearly state a set of conditions for preferring unordered_set over vector.
also, the last paragraph of the article summarizes a useful rule to keep in mind:
Every component in the standard C++ library is there because it's useful for some purpose, but sometimes that purpose is narrowly defined and rare. As a general rule you should always use the simplest data structure that meets your needs. The more complicated a data structure, the more likely that it's not as widely useful as it might seem.
Of course yes. If you do not want duplicates, you have to use a key-aware container, and since unordered_* totally win over their tree-based counterparts, this is pretty much your only choice.

Would a unordered_map be a good choice?

I'm wondering if an unordered_map would be a good choice as container for my specific problem. What I've read about maps does not really cover my are, which is:
The container will store between 100 and 500 objects (not
int/double...)
The size will never change.
The order is not important as the objects themselves contain some kind of "index".
Very often (!) I need to filter all elements in the container that have some
property (e.g. have color==blue)
Currently I use vectors, which works. However if e.g. an unordered_map would improve performance (in regard to "filtering") I could image to change that.
std::unordered_map wouldn't really help you if you have multiple search criteria (sometimes color == blue, sometimes flavour == up), because maps only offer fast query on a single, pre-determined key.
I'd say std::vector is just fine for you, ideally wrapped in your own structure which will provide the lookup interface. If profiling later tells you this is not fast enough, you could build your own indexes above such data. You wouldn't even have to do that manually, boost::multi_index is a generic container designed for multiple-criterion lookup.
I would use vector or simply array for storing actual data. And have a few maps that maps key with pointer to actual data.
This would give higher memory usage, but in case searching by different indexes is often needed you may sacrifice a bit of memory.
A hash table (which std::unordered_map is) provides constant-time lookup for one key (key-value pair). However, its constant factors are always higher (i. e. the lookup is slower) than a simple array (which provides constant-time lookup for integer indices).
If you need to filter a collection of elements based on some criteria, then you need to inspect each individual element. In this case, a hash table would be strictly worse than an array/vector performance-wise, since its computational complexity is the same as that of array indexing, but with worse constant factors.
So no, there's no reason why you would want to use an unordered_map in this case.

Which is the fastest STL container for find?

Alright as a preface I have a need to cache a relatively small subset of rarely modified data to avoid querying the database as frequently for performance reasons. This data is heavily used in a read-only sense as it is referenced often by a much larger set of data in other tables.
I've written a class which will have the ability to store basically the entirety of the two tables in question in memory while listening for commit changes in conjunction with a thread safe callback mechanism for updating the cached objects.
My current implementation has two std::vectors one for the elements of each table. The class provides both access to the entirety of each vector as well as convenience methods for searching for a specific element of table data via std::find, std::find_if, etc.
Does anyone know if using std::list, std::set, or std::map over std::vector for searching would be preferable? Most of the time that is what will be requested of these containers after populating once from the database when a new connection is made.
I'm also open to using C++0x features supported by VS2010 or Boost.
For searching a particular value, with std::set and std::map it takes O(log N) time, while with the other two it takes O(N) time; So, std::set or std::map are probably better. Since you have access to C++0x, you could also use std::unordered_set or std::unordered_map which take constant time on average.
For find_if, there's little difference between them, because it takes an arbitrary predicate and containers cannot optimize arbitrarily, of course.
However if you will be calling find_if frequently with a certain predicate, you can optimize yourself: use a std::map or std::set with a custom comparator or special keys and use find instead.
A sorted vector using std::lower_bound can be just as fast as std::set if you're not updating very often; they're both O(log n). It's worth trying both to see which is better for your own situation.
Since from your (extended) requirements you need to search on multiple fields, I would point you to Boost.MultiIndex.
This Boost library lets you build one container (with only one exemplary of each element it contains) and index it over an arbitrary number of indices. It also lets you precise which indices to use.
To determine the kind of index to use, you'll need extensive benchmarks. 500 is a relatively low number of entries, so constant factors won't play nicely. Furthermore, there can be a noticeable difference between single-thread and multi-thread usage (most hash-table implementations can collapse on MT usage because they do not use linear-rehashing, and thus a single thread ends up rehashing the table, blocking all others).
I would recommend a sorted index (skip-list like, if possible) to accomodate range requests (all names beginning by Abc ?) if the performance difference is either unnoticeable or simply does not matter.
If you only want to search for distinct values, one specific column in the table, then std::hash is fastest.
If you want to be able to search using several different predicates, you will need some kind of index structure. It can be implemented by extending your current vector based approach with several hash tables or maps, one for each field to search for, where the value is either an index into the vector, or a direct pointer to the element in the vector.
Going further, if you want to be able to search for ranges, such as all occasions having a date in July you need an ordered data structure, where you can extract a range.
Not an answer per se, but be sure to use a typedef to refer to the container type you do use, something like typedef std::vector< itemtype > data_table_cache; Then use your typedef type everywhere.

Dynamically sorted STL containers

I'm fairly new to the STL, so I was wondering whether there are any dynamically sortable containers? At the moment my current thinking is to use a vector in conjunction with the various sort algorithms, but I'm not sure whether there's a more appropriate selection given the (presumably) linear complexity of inserting entries into a sorted vector.
To clarify "dynamically", I am looking for a container that I can modify the sorting order at runtime - e.g. sort it in an ascending order, then later re-sort in a descending order.
You'll want to look at std::map
std::map<keyType, valueType>
The map is sorted based on the < operator provided for keyType.
Or
std::set<valueType>
Also sorted on the < operator of the template argument, but does not allow duplicate elements.
There's
std::multiset<valueType>
which does the same thing as std::set but allows identical elements.
I highly reccomend "The C++ Standard Library" by Josuttis for more information. It is the most comprehensive overview of the std library, very readable, and chock full of obscure and not-so-obscure information.
Also, as mentioned by 17 of 26, Effective Stl by Meyers is worth a read.
If you know you're going to be sorting on a single value ascending and descending, then set is your friend. Use a reverse iterator when you want to "sort" in the opposite direction.
If your objects are complex and you're going to be sorting in many different ways based on the member fields within the objects, then you're probably better off with using a vector and sort. Try to do your inserts all at once, and then call sort once. If that isn't feasible, then deque may be a better option than the vector for large collections of objects.
I think that if you're interested in that level of optimization, you had better be profiling your code using actual data. (Which is probably the best advice anyone here can give: it may not matter that you call sort after each insert if you're only doing it once in a blue moon.)
It sounds like you want a multi-index container. This allows you to create a container and tell that container the various ways you may want to traverse the items in it. The container then keeps multiple lists of the items, and those lists are updated on each insert/delete.
If you really want to re-sort the container, you can call the std::sort function on any std::deque, std::vector, or even a simple C-style array. That function takes an optional third argument to determine how to sort the contents.
The stl provides no such container. You can define your own, backed by either a set/multiset or a vector, but you are going to have to re-sort every time the sorting function changes by either calling sort (for a vector) or by creating a new collection (for set/multiset).
If you just want to change from increasing sort order to decreasing sort order, you can use the reverse iterator on your container by calling rbegin() and rend() instead of begin() and end(). Both vector and set/multiset are reversible containers, so this would work for either.
std::set is basically a sorted container.
You should definitely use a set/map. Like hazzen says, you get O(log n) insert/find. You won't get this with a sorted vector; you can get O(log n) find using binary search, but insertion is O(n) because inserting (or deleting) an item may cause all existing items in the vector to be shifted.
It's not that simple. In my experience insert/delete is used less often than find. Advantage of sorted vector is that it takes less memory and is more cache-friendly. If happen to have version that is compatible with STL maps (like the one I linked before) it's easy to switch back and forth and use optimal container for every situation.
in theory an associative container (set, multiset, map, multimap) should be your best solution.
In practice it depends by the average number of the elements you are putting in.
for less than 100 elements a vector is probably the best solution due to:
- avoiding continuous allocation-deallocation
- cache friendly due to data locality
these advantages probably will outperform nevertheless continuous sorting.
Obviously it also depends on how many insertion-deletation you have to do. Are you going to do per-frame insertion/deletation?
More generally: are you talking about a performance-critical application?
remember to not prematurely optimize...
The answer is as always it depends.
set and multiset are appropriate for keeping items sorted but are generally optimised for a balanced set of add, remove and fetch. If you have manly lookup operations then a sorted vector may be more appropriate and then use lower_bound to lookup the element.
Also your second requirement of resorting in a different order at runtime will actually mean that set and multiset are not appropriate because the predicate cannot be modified a run time.
I would therefore recommend a sorted vector. But remember to pass the same predicate to lower_bound that you passed to the previous sort as the results will be undefined and most likely wrong if you pass the wrong predicate.
Set and multiset use an underlying binary tree; you can define the <= operator for your own use. These containers keep themselves sorted, so may not be the best choice if you are switching sort parameters. Vectors and lists are probably best if you are going to be resorting quite a bit; in general list has it's own sort (usually a mergesort) and you can use the stl binary search algorithm on vectors. If inserts will dominate, list outperforms vector.
STL maps and sets are both sorted containers.
I second Doug T's book recommendation - the Josuttis STL book is the best I've ever seen as both a learning and reference book.
Effective STL is also an excellent book for learning the inner details of STL and what you should and shouldn't do.
For "STL compatible" sorted vector see A. Alexandrescu's AssocVector from Loki.