Which container should I use to quickly find an element, knowing that by construction it's already ordered? - c++

I have to use a container that by construction is already ordered and I need a quick find-function.
My idea is to use a set to exploit its set.find, which should be quick enough, but constructing my set with set.insert could be slow because I already know that the elements are ordered!
For this part, I could just use a vector with vector.push_back, but then I should write the find function myself and it would also be slower than the set::find... What do you suggest me?
EDIT:
The container needs to be modified, but will always be ordered, and its size will always be the same (but i don't know it a priori)
No repeated values.
A lot more finds than inserts.
I just need to know if it exists, not its position.
I found the std::binary_search. Is it good?

Constructing a set from a pair of iterators is guaranteed to be O(N) if the input range is already sorted, so you could just use that. (See e.g. http://en.cppreference.com/w/cpp/container/set/set.)
Subsequent inserts are O(log N).
Although if the total size is not that large a sorted vector tends to beat everything else because it is so cache friendly. std::binary_search is O(log N), the same as set::find.
std::unordered_set doesn't care about your ordering, but does give you O(1) lookup (on average, no guarantees).

std::vector with std::make_heap(), push_heap(), pop_heap() and sort_heap() could help you to deal with it as fast as std::set. However, you'll need to use std::binary_search manually.

Related

questions about the searching and sorting algorithms

I'm doing a little research about searching and sorting algorithms in the Standard library. I couldn't find something about those questions. I hope someone can help me out. You can also send me links if you know some.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
How can I adapt the behavior of the searching and sorting algorithms to make it more efficient?
Does the searching behavior change if the data is not sorted compared
to one which is sorted?
Depends. If you access your data in a vector/array by position, there's no performance improvement, and there's no need for sorting neither.
Searching can be done linearly, binary, keys, and by hash function.
For small (I guess something below a few dozens of items) and contiguous containers (e.g. a vector) linear search can be the fastest, just because of cache-friendly memory layout.
Binary search has O(log N) complexity which is likely the best you can get... I'm thinking in Information theory. It requires that you sort previously the container. It's useful for frequetly searches in the same container.
A std::set (and its cousin std::map) uses internally a tree, which makes searching O(log N) complexity too. Useful if you search by keys, instead of some criteria of your items. The drawback is that it's a bit slower on building (always keep sorted) than fill a vector an later sort it.
A hashmap or hashtable uses a function for getting the bucket where the item lies. The complexity is something near to O(1), depending on number of items and the function used (collisions issue).
As you see, selecting a type of container depends on how are you going to handle your data. Choose the one that fits your requirements.
How can I know if it is better to use std::sort() on a vector instead
of maybe to copy the vector to an already sorted set?
std::sort changes the container so the result is, obviously, sorted. If you need the original, unordered, container, then make a copy and sort the copy. Sorting the whole of the container is better that "insert-item-so-container-is-always-sorted" for all items, specially with a vector (many memory reallocations); a set/map filling process may be not that slow.
How can I adapt the behavior of the searching and sorting algorithms
to make it more efficient?
It's not clear to me what you mean. But, "The end justifies the means". Again, choose the container that servers best for your data handling.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
No. It depends on the algorithm you choose. General search std::find is O(n), binary search std::lower_bound is O(log n), but it works only on sorted ranges.
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
You can write a benchmark and measure. You can sort an std::vector (without duplicated elements) by copying it into an std::set, which maintains sorted order internally. std::set is typically implemented as a red-black tree and has in general high memory fragmentation in contrast to contiguous std::vector. So it is easy to predict the result. Alexander Stepanov discusses (if I remember correctly) this particular example in his lectures available on YouTube.

Storing in std::map/std::set vs sorting a vector after storing all data

Language: C++
One thing I can do is allocate a vector of size n and store all data
and then sort it using sort(begin(),end()). Else, I can keep putting
the data in a map or set which are ordered itself so I don't have to
sort afterwards. But in this case inserting an element may be more
costly due to rearrangements(I guess).
So which is the optimal choice for minimal time for a wide range of n(no. of objects)
It depends on the situation.
map and set are usually red-black trees, they should do a lot of work to be balanced, or the operation on it will be very slow. And it doesn't support random access. so if you only want to sort one time, you shouldn't use them.
However, if you want to continue insert elements into the container and keep order, map and set will take O(logN) time, while the sorted vector is O(N). The latter is much slower, so if you want frequently insert and delete, you should use map or set.
The difference between the 2 is noticable!
Using a set, you get O(log(N)) complexity for each element you insert. So by result you get O(N log(N)), which is the complexity of an insertion sort.
Adding everything in a vector is of complexity O(1), and sorting it will be O(N log(N)) since C++11 (before it, std::sort have O(N log(N)) on average.).
Once sorted, you could use binary_search to have the same complexity as in a set.
The API of using a vector as set ain't the friendly, although it does give nice performance benefits. This off course is only useful when you can do a bulk insert of data or when the amount of lookups is much larger than the manipulations of the content. Algorithmsable to sort on partially sorted vector, when you have to extend later on.
Finally, one has to remark that you don't have the same guarantees of iterator invalidation.
So, why are vectors better? Cache locality!
A vector has all data in a single memory block, hence the processor can do prefetching while for a set, the memory is scattered around the place requireing the data to find the next address. This makes vector a better set implementation than std::set for large data when you can live with the limitations.
To give you an idea, on the codebase I'm working on, we have several set and map implementations based on vectors which have their own narratives to function in. (For example: no erase or no operator[])

What is the most efficient std container for non-duplicated items?

What is the most efficient way of adding non-repeated elements into STL container and what kind of container is the fastest? I have a large amount of data and I'm afraid each time I try to check if it is a new element or not, it takes a lot of time. I hope map be very fast.
// 1- Map
map<int, int> Map;
...
if(Map.find(Element)!=Map.end()) Map[Element]=ID;
// 2-Vector
vector<int> Vec;
...
if(find(Vec.begin(), Vec.end(), Element)!=Vec.end()) Vec.push_back(Element);
// 3-Set
// Edit: I made a mistake: set::find is O(LogN) not O(N)
Both set and map has O(log(N)) performance for looking up keys. vector has O(N).
The difference between set and map, as far as you should be concerned, is whether you need to associate a key with a value, or just store a value directly. If you need the former, use a map, if you need the latter, use a set.
In both cases, you should just use insert() instead of doing a find().
The reason is insert() will insert the value into the container if and only if the container does not already contain that value (in the case of map, if the container does not contain that key). This might look like
Map.insert(std::make_pair(Element, ID));
for a map or
Set.insert(Element);
for a set.
You can consult the return value to determine whether or not an insertion was actually performed.
If you're using C++11, you have two more choices, which are std::unordered_map and std::unordered_set. These both have amortized O(1) performance for insertions and lookups. However, they also require that the key (or value, in the case of set) be hashable, which means you'll need to specialize std::hash<> for your key. Conversely, std::map and std::set require that your key (or value, in the case of set) respond to operator<().
If you're using C++11, you can use std::unordered_set. That would allow you O(1) existence-checking (technically amortized O(1) -- O(n) in the worst case).
std::set would probably be your second choice with O(lg n).
Basically, std::unordered_set is a hash table and std::set is a tree structure (a red black tree in every implementation I've ever seen)1.
Depending on how well your hashes distribute and how many items you have, a std::set might actually be faster. If it's truly performance critical, then as always, you'll want to do benchmarking.
1) Technically speaking, I don't believe either are required to be implemented as a hash table or as a balanced BST. If I remember correctly, the standard just mandates the run time bounds, not the implementation -- it just turns out that those are the only viable implementations that fit the bounds.
You should use a std::set; it is a container designed to hold a single (equivalent) copy of an object and is implemented as a binary search tree. Therefore, it is O(log N), not O(N), in the size of the container.
std::set and std::map often share a large part of their underlying implementation; you should check out your local STL implementation.
Having said all this, complexity is only one measure of performance. You may have better performance using a sorted vector, as it keeps the data local to one another and, therefore, more likely to hit the caches. Cache coherence is a large part of data structure design these days.
Sounds like you want to use a std::set. It's elements are unique, so you don't need to care about uniqueness when adding elements, and a.find(k) (where a is an std::set and k is a value) is defined as being logarithmic in complexity.
if your elements can be hashed for O(1), then better to use an index in a unordered_map or unordered_set (not in a map/set because they use RB tree in implementation which is O(logN) find complexity)
Your examples show a definite pattern:
check if the value is already in container
if not, add the value to the container.
Both of these operation would potentially take some time. First, looking up an element can be done in O(N) time (linear search) if the elements are not arranged in any particular manner (e.g., just a plain std::vector), it could be done in O(logN) time (binary search) if the elements are sorted (e.g., either std::map or std::set), and it could be done in O(1) time if the elements are hashed (e.g., either std::unordered_map or std::unordered_set).
The insertion will be O(1) (amortized) for a plain vector or an unordered container (hash container), although the hash container will be a bit slower. For a sorted container like set or map, you'll have log-time insertions because it needs to look for the place to insert it before inserting it.
So, the conclusion, use std::unordered_set or std::unordered_map (if you need the key-value feature). And you don't need to check before doing the insertion, these are unique-key containers, they don't allow duplicates.
If std::unordered_set / std::unordered_map (from C++11) or std::tr1::unordered_set / std::tr1::unordered_map (since 2007) are not available to you (or any equivalent), then the next best alternative is std::set / std::map.

std::list or std::multimap

Hey, I right now have a list of a struct that I made, I sort this list everytime I add a new object, using the std::list sort method.
I want to know what would be faster, using a std::multimap for this or std::list,
since I'm iterating the whole list every frame (I am making a game).
I would like to hear your opinion, for what should I use for this incident.
std::multimap will probably be faster, as it is O(log n) per insertion, whereas an insert and sort of the list is O(n log n).
Depending on your usage pattern, you might be better off with sorted vectors. If you insert a whole bunch of items at once and then do a bunch of reads -- i.e. reads and writes aren't interleaved -- then you'll have better performance with vector, std::sort, and std::binary_search.
You might consider using the lower_bound algorithm to find where to insert into your list. http://stdcxx.apache.org/doc/stdlibref/lower-bound.html
Edit: In light of Neil's comment, note that this will work with any sequence container (vector, deque, etc.)
If you do not need Key/Value pairs std::set or std::multiset is probably better than using std::multimap.
Reference for std::set:
http://www.cplusplus.com/reference/stl/set/
Reference for std::multiset:
http://www.cplusplus.com/reference/stl/multiset/
Edit: (seems like it was unclear before)
It is in general better to use a container like std::(multi)set or std:(multi)map than using std::list and sorting it afterwards everytime an element is inserted because std::list does not perform very good in inserting elements in the middle of the container.
Generally speaking, iterating over a container is likely to take about as much time as iterating over another, so if you keep adding to a container and then iterating over it, it's mainly a question of picking a container that avoids constantly having to reallocate memory and inserts the way you want quickly.
Both list and multimap will avoid having to reallocate themselves simply from adding an element (like you could get with a vector), so it's primarily a question of how long it takes to insert. Adding to the end of a list will be O(1) while adding to a multimap will be O(log n). However, the multimap will insert the elements in sorted order, while if you want to have the list be sorted, you're going to have to either sort the list in O(n log n) or insert the element in a sorted manner with something like lower_bound which would be O(n). In either case, it will be far worse (in the worst case at least) to use the list.
Generally, if you're maintaining a container in sorted order and continually adding to it rather than creating it and sorting it once, sets and maps are more efficient since they're designed to be sorted. Of course, as always, if you really care about performance, profiling your specific application and seeing which works better is what you need to do. However, in this case, I'd say that it's almost a guarantee that multimap will be faster (especially if you have very many elements at all).

c++ container for checking whether ordered data is in a collection

I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.