Efficiently finding multiple items in a container - c++

I need to find a number of objects from a large container.
The only way I can think of to do that seems to be to just search the container for one item at a time in a loop, however, even which an efficient search with an average case of say "log n" (where n is the size of the container), this gives me "m log n" (where m is the number of items I'm looking for) for the entire operation.
That seems highly suboptimal to me, and as its something that I am likely to need to do on a frequent bases, something I'd definitely like to improve if possible.
Neither part has been implemented yet, so I'm open for suggestions on the format of the main container, the "list" of items I'm looking for, etc, as well as the actual search algorithm.
The items are complex objects, however the search key is just a simple integer.

Hash tables have basically O(1) lookup. This gives you O(m) to lookup m items; obviously you can't lookup m items faster than O(m) because you need to get the result out.

If you're purely doing look-up (you don't require ordered elements) and can give up some memory, try unordered_map (it's TR1, also implemented in Boost), which has constant-time amortized look-up.
In a game engine, we tested std::map and unordered_map, and while map was faster for insertions (if I recall), unordered_map blew it out of the water for retrieval. We had greater than 1000 elements in the map, for scale, which is fairly low compared to some other tasks you may be doing.
If you require elements to be ordered, your next bet is std::map, which has the look-up times you've posted, and keeps the elements ordered. In general, it also uses less memory than an unordered_map.

If your container is a vector and the elements are sorted, you can use std::lower_bound to search in O(log n) time. If your search items are also sorted, you can do a small optimization by always using the last found iterator as the start of the search for the next one, e.g.
vector<stuff> container;
vector<stuff>::iterator it = container.begin();
for (int i = 0; i < search_items.size() && it != container.end(); ++i)
{
it = std::lower_bound(it, container.end(), search_items[i]);
// make sure the found item is a match
if (it != container.end() && search_items[i] < *it)
it = container.end(); // break out early
}
if (it != container.end()) // found it!

boost/tr1 unordered_map and unordered_set are containers backed by a hash table which gives you search in amortized contant time [ O(1) ]
Boost Unordered documentation.

I suppose if you have a sorted container and a uniform distribution of items then the most efficient type of method would be a recursive bisection search with an execution path somewhat like a tree - calling itself twice whenever all the objects being searched for are in both halves of the bisection.
However, if you choose a container based on a hash-table (boost unordered set, I think?), or something similar, then lookup can be O(1), so searching in a loop really doesn't matter.
EDIT:
note that std::map and std::set are normally (always?) implemented using rb-trees, so are only log(n) for lookup.

Are you sure that m log2(n) is actually going to be a problem? If you are using a std::map that is even relatively large, the number of actually comparisons is still pretty small - if you are looking up 10,000 elements in a map of 1,000,000, the number of comparisons should be about 200,000 or about 20 comparisons per target element. This really isn't bad if your key is just a simple integer.
If you were hashing something that didn't already have a nice key, then I would say go with boost::unordered_map. I would implement it with std::map first, profile it, and then decide if you want to make the next jump to Boost.

If you're frequently performing the same projections on your collection, such as extracting elements with a key of "42", you could consider maintaining these subsets in buckets. You'd internally maintain a hashmap from keys to vectors of elements with that key, and add elements to the appropriate bucket as well as your primary collection representing "everything". Extracting a subgroup of elements is constant time (because the respective collections have already been built), and the memory overhead of maintaining the buckets scales primarily with the number of unique keys in your dataset.
This technique is decidedly less effective if you have a large number of unique key values, and makes insertions and removals more expensive, but it's good for some situations- I thought it was at least worth mentioning.

Related

Performance difference for iteration over all elements std::unordered_map vs std::map?

I wanted to map data with pointer as the key. What container should I have chosen, map or unordered_map? There are multiple questions on stackoverflow for this topic but none of them covers the performance aspect when we need to iterate over all the key-value pairs.
std::map<classKey* , classData*> myMap;
std::unordered_map<classKey* , classData*> myUnorderedMap;
for (auto & iter : myMap) { //loop1
display(iter.second);
}
for (auto & iter : myUnorderedMap) { //loop2
display(iter.second);
}
loop1 vs loop2 which one gives better performance.
Bench Mark Provided by #RetiredNinja
For size = 10,000,000 We get following benchmark results:
As you might expect, this depends heavily on the actual implementation of the standard library data structures. Therefore, this answer will be a bit more theoretical and less tied to any one implementation.
A std::map uses a balanced binary tree under the covers. This is why it has O(log(n)) insertion, deletion, and lookup. Iterating over it should be linear because you just have to do a depth-first traversal (which will require O(log(n)) memory in the form of stack space). The benefit of using a std::map for iteration is that you will iterate over the keys in sorted order and you will get that benefit "for free".
A std::unordered_map uses a hash table under the covers. This allows you to have an amortized constant time insertion, deletion, and lookup. If the implementation is not optimized for iterating, a naive approach would be to iterate over every bucket in the hash table. Since a good hash table (in theory) has exactly one element in 50% of the buckets and zero in the rest, this operation will also be linear. However, it will take more "wall clock time" than the same linear operation for a std::map. To get around this, some hash table implementations keep a side list of all of the elements for fast iterations. If this is the case, iterating on a std::unordered_map will be faster because you can't get much better than iterating over contiguous memory (still linear time though, obviously).
In the extremely unlikely case that you actually need to optimize to this level (instead of just being curious about the performance in theory), you likely have much bigger performance bottlenecks elsewhere in your code.
All of this ignores the oddity of keying off of a pointer value, but that's neither here nor there.
Sources for further reading:
GCC std::map implementation
GCC std::unordered_map implementation
How GCC std::unordered_map achieves fast iteration

slow std::map for large entries

We have 48,16,703 entries in this format.
1 abc
2 def
...
...
4816702 blah
4816703 blah_blah
Since the number of entries are quite big, I am worried that std::map would take much time during insertion since it need to do the balancing as well for each insertion.
Only inserting these entries into the map takes a lot of time. I am doing
map[first] = second;
Two questions:
1. Am I correct in using std::map for these kind of cases?
2. Am I correct in inserting like the above way. OR Should I use map.insert()
I am sorry for not doing the experiments and writing the absolute numbers but we want an general consensus if we are doing the right thing or not.
Also, they keys are not consecutive always..
P.S. Of-course, later we will need to access that map as well to get the values corresponding to the keys.
If you don’t need to insert into the map afterwards, you can construct an unsorted vector of your data, sort it according to the key, and then search using functions like std::equal_range.
It’s the same complexity as std::map, but far less allocations.
Use an std::unordered_map, which has much better insertion time complexity than std::map, as the reference mentions:
Complexity
Single element insertions:
Average case: constant.
Worst case: linear in container size.
Multiple elements insertion:
Average case: linear in the number of elements inserted.
Worst case: N*(size+1): number of elements inserted times the container size plus one.
May trigger a rehash (not included in the complexity above).
That's better than the logarithmic time complexity of std::map's insertion.
Note: std::map's insertion can enjoy "amortized constant if a hint is given and the position given is the optimal.". If that's the case for you, then use a map (if a vector is not applicable).
#n.m. provides a representative Live demo

Insert a sorted range into std::set with hint

Assume I have a std::set (which is by definition sorted), and I have another range of sorted elements (for the sake of simplicity, in a different std::set object). Also, I have a guarantee that all values in the second set are larger than all the values in the first set.
I know I can efficiently insert one element into std::set - if I pass a correct hint, this will be O(1). I know I can insert any range into std::set, but as no hint is passed, this will be O(k logN) (where k is number of new elements, and N number of old elements).
Can I insert a range in a std::set and provide a hint? The only way I can think of is to do k single inserts with a hint, which does push the complexity of the insert operations in my case down to O(k):
std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};
for(auto bigElem : biggerSet)
bigSet.insert(bigSet.end(), bigElem);
First of all, to do the merge you're talking about, you probably want to use set (or map's) merge member function, which will let you merge some existing map into this one. The advantage of doing this (and the reason you might not want to, depending your usage pattern) is that the items being merged in are actually moved from one set to the other, so you don't have to allocate new nodes (which can save a fair amount of time). The disadvantage is that the nodes then disappear from the source set, so if you need each local histogram to remain intact after being merged into the global histogram, you don't want to do this.
You can typically do better than O(log N) when searching a sorted vector. Assuming reasonably predictable distribution you can use an interpolating search to do a search in (typically) around O(log log N), often called "pseudo-constant" complexity.
Given that you only do insertions relatively infrequently, you might also consider a hybrid structure. This starts with a small chunk of data that you don't keep sorted. When you reach an upper bound on its size, you sort it and insert it into a sorted vector. Then you go back to adding items to your unsorted area. When it reaches the limit, again sort it and merge it with the existing sorted data.
Assuming you limit the unsorted chunk to no larger than log(N), search complexity is still O(log N)--one log(n) binary search or log log N interpolating search on the sorted chunk, and one log(n) linear search on the unsorted chunk. Once you've verified that an item doesn't exist yet, adding it has constant complexity (just tack it onto the end of the unsorted chunk). The big advantage is that this can still easily use a contiguous structure such as a vector, so it's much more cache friendly than a typical tree structure.
Since your global histogram is (apparently) only ever populated with data coming from the local histograms, it might be worth considering just keeping it in a vector, and when you need to merge in the data from one of the local chunks, just use std::merge to take the existing global histogram and the local histogram, and merge them together into a new global histogram. This has O(N + M) complexity (N = size of global histogram, M = size of local histogram). Depending on the typical size of a local histogram, this could pretty easily work out as a win.
Merging two sorted containers is much quicker than sorting. It's complexity is O(N), so in theory what you say makes sense. It's the reason why merge-sort is one of the quickest sorting algorithms. If you follow the link, you will also find pseudo-code, what you are doing is just one pass of the main loop.
You will also find the algorithm implemented in STL as std::merge. This takes ANY container as an input, I would suggest using std::vector as default container for new element. Sorting a vector is a very fast operation. You may even find it better to use a sorted-vector instead of a set for output. You can always use std::lower_bound to get O(Nlog(N)) performance from a sorted-vector.
Vectors have many advantages compared with set/map. Not least of which is they are very easy to visualise in a debugger :-)
(The code at the bottom of the std::merge shows an example of using vectors)
You can merge the sets more efficiently using special functions for that.
In case you insist, insert returns information about the inserted location.
iterator insert( const_iterator hint, const value_type& value );
Code:
std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};
auto hint = bigSet.cend();
for(auto& bigElem : biggerSet)
hint = bigSet.insert(hint, bigElem);
This assumes, of course, that you are inserting new elements that will end up together or close in the final set. Otherwise there is not much to gain, only the fact that since the source is a set (it is ordered) then about half of the three will not be looked up.
There is also a member function
template< class InputIt > void insert( InputIt first, InputIt last );.
That might or might not do something like this internally.

C++ std::unordered_map complexity

I've read a lot about unordered_map (c++11) time-complexity here at stackoverflow, but I haven't found the answer for my question.
Let's assume indexing by integer (just for example):
Insert/at functions work constantly (in average time), so this example would take O(1)
std::unordered_map<int, int> mymap = {
{ 1, 1},
{ 100, 2},
{ 100000, 3 }
};
What I am curious about is how long does it take to iterate through all (unsorted) values stored in map - e.g.
for ( auto it = mymap.begin(); it != mymap.end(); ++it ) { ... }
Can I assume that each stored value is accessed only once (or twice or constant-times)? That would imply that iterate through all values is in N-valued map O(N). The other possibility is that my example with keys {1,10,100000} could take up to 1000000 iteration (if represented by array)
Is there any other container, that can be iterated linearly and value accessed by given key constantly?
What I would really need is (pseudocode)
myStructure.add(key, value) // O(1)
value = myStructure.at(key) // O(1)
for (auto key : mySructure) {...} // O(1) for each key/value pair = O(N) for N values
Is std::unordered_map the structure I need?
Integer indexing is sufficient, average complexity as well.
Regardless of how they're implemented, standard containers provide iterators that meet the iterator requirements. Incrementing an iterator is required to be constant time, so iterating through all the elements of any standard container is O(N).
The complexity guarantees of all standard containers are specified in the C++ Standard.
std::unordered_map element access and element insertion is required to be of complexity O(1) on average and O(N) worst case (cf. Sections 23.5.4.3 and 23.5.4.4; pages 797-798).
A specific implementation (that is, a specific vendor's implementation of the Standard Library) can choose whatever data structure they want. However, to be compliant with the Standard, their complexity must be at least as specified.
There's a few different ways that a hash table can be implemented, and I suggest you read more on those if you're interested, but the main two are through chaining and open addressing.
In the first case you have an array of linked lists. Each entry in the array could be empty, each item in the hashtable will be in some bucket. So iteration is walking down the array, and walking down each non-empty list in it. Clearly O(N), but could potentially be very memory inefficient depending on how the linked lists themselves are allocated.
In the second case, you just have one very large array which will have lots of empty slots. Here, iteration is again clearly linear, but could be inefficient if the table is mostly empty (which is should be for lookup purposes) because the elements that are actually present will be in different cache lines.
Either way, you're going to have linear iteration and you're going to be touching every element exactly once. Note that this is true of std::map also, iteration will be linear there as well. But in the case of the maps, iteration will definitely be far less efficient that iterating a vector, so keep that in mind. If your use-case involves requiring BOTH fast lookup and fast iteration, if you insert all your elements up front and never erase, it could be much better to actually have both the map and the vector. Take the extra space for the added performance.

c++ container for checking whether ordered data is in a collection

I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.