Should I randomly shuffle before inserting into STL set? - c++

I need to insert 10-million strings into a C++ STL set. The strings are sorted. Will I have a pathological problem if I insert the strings in sorted order? Should I randomize first? Or will the G++ STL implementation automatically rebalance for me?

The set implementation typically uses a red-black tree, which will rebalance for you. However, insertion may be faster (or it may not) if you randomise the data before inserting - the only way to be sure is to do a test with your set implementation and specific data. Retrieval times will be the same, either way.

The implementation will re-balance automatically. Given that you know the input is sorted, however, you can give it a bit of assistance: You can supply a "hint" when you do an insertion, and in this case supplying the iterator to the previously inserted item will be exactly the right hint to supply for the next insertion. In this case, each insertion will have amortized constant complexity instead of the logarithmic complexity you'd otherwise expect.

The only question I have: do you really need a set ?
If the data is already sorted and you don't need to insert / delete elements after the creation, a deque would be better:
you'll have the same big-O complexity using a binary search for retrieval
you'll get less memory overhead... and better cache locality
On binary_search: I suspect you need more than a ForwardIterator for a binary search, guess this site is off again :(

http://en.wikipedia.org/wiki/Standard_Template_Library
set: "Implemented using a self-balancing binary search tree."

g++'s libstdc++ uses red black trees for sets and maps.
http://en.wikipedia.org/wiki/Red-black_tree
This is a self balancing tree, and insertions are always O(log n). The C++ standard also requires that all implementations have this characteristic, so in practice, they are almost always red black trees, or something very similar.
So don't worry about the order you put the elements in.

A very cheap and simple solution is to insert from both ends of your collections of strings. That is to say, first add "A", then "ZZZZZ", then "AA", then "ZZZZY", etcetera until you meet in the middle. It doesn't require the hefty cost of shuffling, yet it is likely to sidestep pathological cases.

Maybe 'unordered_set' can be an alternative.

Related

questions about the searching and sorting algorithms

I'm doing a little research about searching and sorting algorithms in the Standard library. I couldn't find something about those questions. I hope someone can help me out. You can also send me links if you know some.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
How can I adapt the behavior of the searching and sorting algorithms to make it more efficient?
Does the searching behavior change if the data is not sorted compared
to one which is sorted?
Depends. If you access your data in a vector/array by position, there's no performance improvement, and there's no need for sorting neither.
Searching can be done linearly, binary, keys, and by hash function.
For small (I guess something below a few dozens of items) and contiguous containers (e.g. a vector) linear search can be the fastest, just because of cache-friendly memory layout.
Binary search has O(log N) complexity which is likely the best you can get... I'm thinking in Information theory. It requires that you sort previously the container. It's useful for frequetly searches in the same container.
A std::set (and its cousin std::map) uses internally a tree, which makes searching O(log N) complexity too. Useful if you search by keys, instead of some criteria of your items. The drawback is that it's a bit slower on building (always keep sorted) than fill a vector an later sort it.
A hashmap or hashtable uses a function for getting the bucket where the item lies. The complexity is something near to O(1), depending on number of items and the function used (collisions issue).
As you see, selecting a type of container depends on how are you going to handle your data. Choose the one that fits your requirements.
How can I know if it is better to use std::sort() on a vector instead
of maybe to copy the vector to an already sorted set?
std::sort changes the container so the result is, obviously, sorted. If you need the original, unordered, container, then make a copy and sort the copy. Sorting the whole of the container is better that "insert-item-so-container-is-always-sorted" for all items, specially with a vector (many memory reallocations); a set/map filling process may be not that slow.
How can I adapt the behavior of the searching and sorting algorithms
to make it more efficient?
It's not clear to me what you mean. But, "The end justifies the means". Again, choose the container that servers best for your data handling.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
No. It depends on the algorithm you choose. General search std::find is O(n), binary search std::lower_bound is O(log n), but it works only on sorted ranges.
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
You can write a benchmark and measure. You can sort an std::vector (without duplicated elements) by copying it into an std::set, which maintains sorted order internally. std::set is typically implemented as a red-black tree and has in general high memory fragmentation in contrast to contiguous std::vector. So it is easy to predict the result. Alexander Stepanov discusses (if I remember correctly) this particular example in his lectures available on YouTube.

iterate ordered versus unordered containers

I want to know which data-structures are more efficient for iterating through their elements between std::set, std::map and std::unordered_set, std::unordered_map.
I searched through SO and I found this question. The answers either propose to copy the elements in a std::vector or to use Boost.Container, which IMHO don't answer my question.
My purpose is to keep in a container a big number of unique elements, that most of the time I want to iterate through them. Insertions and extractions are more rare. I want to avoid std::vector in combination with std::unique.
Lets consider set vs unordered_set.
The main difference here is the 'nature' of the iteration, that is the traversal of the set will give you the elements in order while traversing a range in an unordered set will give you a bunch of values in no particular order.
Suppose you want to traverse a range [it1, it2]. If we exclude the lookup time that's needed to find elements it1 and it2 there can be no direct mapping from one case to another since the elements in between are not guarrandeed to be the same even if you've used the same elements to construct the container.
There are cases however where something like this has meaning when e.g. you want to traverse a fixed number of elements (regardless of what they are) or when you need to traverse the whole container. In such cases you need to consider implementation mechanics :
Sets are usually implemented like Red–black trees (a form of binary search trees). Like all binary search trees allow efficient in-order traversal (LRR: left root right) of their elements. That is to traverse you pay the cost of pointer chasing (just like traversing a list).
Unordered sets on the other hand are hash tables and to my knowledge the STL implementation uses hashing with chaining. That means (in a very very high level) that what's used for the structure is a (contiguous) buffer where each element is the head of a chain (list) that contains the elements. The way the elements are layed out across those chains (buckets) and across the buffer will affect the traversal time, however you'll be chasing pointers once again jumping through differents lists as well this time. I don't think it'll vary significantly from the tree case but won't be any better for sure.
In any case micro tuning and benchmarking will give you the answer for your particular application.
The difference does not lie between the ordering or lack of one but in the backing container. If it's a contiguous memory it should be fast to iterate over, due to simple implementation of iterator and cache friendliness.
Unordered containers are usually stored as a vector of vectors (or a similar thing), while ordered containers are implemented using trees, but it is left for implementation after all. This would suggest that iterating over unordered version should be waster. However this is left for implementation after all, and I saw implementations (which bent rules a little to be fair) with different behaviour.
Generally speaking, container performance is quite a complex topic and usually has to be tested in actual application to get reliable answer. There is plenty on implemention-defined stuff that might affect the performance. I'd go with hash_set if I had to go in blind. Copying into a vector might also turn out a good option.
EDIT: As #TonyD said in it's comment, there is a rule, that disallows invalidating iterators during addition of element when the max_load_factor() is not exceeded, this practically rules out backing containers which are contiguous in memory.
Thus, copying everything into a vector seems like even more reasonable option. If you need to remove duplicates, a feasible option might be to use http://en.cppreference.com/w/cpp/algorithm/sort and have dupes easily ignored. I have heard that using vector and sort to have a sorted array (or vector) is quite often a used option in case of need for a container that needs to be sorter and is being iterated over more often than modified.
iterate from fastest to slowest should be : set > map > unordered_set > unordered_map;
set is a little lighter than map, and they are ordered with binary tree rule, so should be faster than unordered_ containers.

Why is inserting multiple elements into a std::set simultaneously faster?

I'm reading through:
"The C++ Standard Library: A Tutorial and Reference by Nicolai M.
Josuttis"
and I'm in the section about Sets & Multisets. I came across a line regarding Inserting and Removing elements:
"Inserting and removing happens faster if, when working with multiple
elements, you use a single call for all elements rather than multiple
calls."
I'm far from a data structures master, but I'm aware that they're implemented with red-black trees. What I don't understand from that is how would the STL implementers write an algorithm to insert multiple elements at once in a faster manner?
Can anyone shed some light on why this quote is true for me?
My first thought was that it might rebalance the tree only after inserting/erasing the whole range. Since the whole operation is inlined in practice, that seems more likely than the number of function calls.
Checking the GCC headers on my local machine, this doesn't seem to be the case - and anyway, I don't know how the tradeoff between reduced rebalancing activity, and potentially increased search times for intermediate inserts to an unbalanced tree, would work out.
Maybe it's considered a QoI issue, but in any case, using the most expressive method is probably best, not just because it saves you writing a for loop and shows your intention most clearly, but because it leaves library writers latitude to optimise more aggressively in the future, without you needing to know and change your code.
There are two reasons:
1) Making a single call for multiple elements instead of N times more calls.
2) The insertion operation checks for each element inserted whether another element exists already in the container with the same value. This can be optimized when insterting multiple elements together.
What you read as you quoted is wrong. Inserting to a std::set is O(log n), unless you use the insert() overload with the position iterator, in which case it is amortized O(n) when the position is valid. But, if you use the range overload with sorted elements then you get O(n) insertion.
Memory management could be a good reason. In this case it could allocate the memory just once. If all elements are called separatelly, all calls try allocate memory separatelly. As I know, most of the set and map implementations try to keep the memory in the same page, or pages near together to minimalize page faults.
I'm not sure about this, but I think that if the number of elements inserted is smaller than the number of elements in the set, then it can be more efficient to sort the inserted range before performing the insertions. This way, all values can be inserted in a single pass over the tree, and duplicates in the inserted range can be easily eliminated (or inserted very fast in the case of a multiset).
Of course, this optimization is only possible if the input iterators allows sorting the input range (i.e. if they are random iterators).

What is the difference between set and hashset in C++ STL?

When should I choose one over the other?
Are there any pointers that you would recommend for using the right STL containers?
hash_set is an extension that is not part of the C++ standard. Lookups should be O(1) rather than O(log n) for set, so it will be faster in most circumstances.
Another difference will be seen when you iterate through the containers. set will deliver the contents in sorted order, while hash_set will be essentially random (Thanks Lou Franco).
Edit: The C++11 update to the C++ standard introduced unordered_set which should be preferred instead of hash_set. The performance will be similar and is guaranteed by the standard. The "unordered" in the name stresses that iterating it will produce results in no particular order.
stl::set is implemented as a binary search tree.
hashset is implemented as a hash table.
The main issue here is that many people use stl::set thinking it is a hash table with look-up of O(1), which it isn't, and doesn't have. It really has O(log(n)) for look-ups. Other than that, read about binary trees vs hash tables to get a better idea of the data structures.
Another thing to keep in mind is that with hash_set you have to provide the hash function, whereas a set only requires a comparison function ('<') which is easier to define (and predefined for native types).
I don't think anyone has answered the other part of the question yet.
The reason to use hash_set or unordered_set is the usually O(1) lookup time. I say usually because every so often, depending on implementation, a hash may have to be copied to a larger hash array, or a hash bucket may end up containing thousands of entries.
The reason to use a set is if you often need the largest or smallest member of a set. A hash has no order so there is no quick way to find the smallest item. A tree has order, so largest or smallest is very quick. O(log n) for a simple tree, O(1) if it holds pointers to the ends.
A hash_set would be implemented by a hash table, which has mostly O(1) operations, whereas a set is implemented by a tree of some sort (AVL, red black, etc.) which have O(log n) operations, but are in sorted order.
Edit: I had written that trees are O(n). That's completely wrong.

Should use an insertion sort or construct a heap to improve performance?

We have large (100,000+ elements) ordered vectors of structs (operator < overloaded to provide ordering):
std::vector < MyType > vectorMyTypes;
std::sort(vectorMyType.begin(), vectorMyType.end());
My problem is that we're seeing performance problems when adding new elements to these vectors while preserving sort order. At the moment we're doing something like:
for ( a very large set )
{
vectorMyTypes.push_back(newType);
std::sort(vectorMyType.begin(), vectorMyType.end());
...
ValidateStuff(vectorMyType); // this method expects the vector to be ordered
}
This isn't exactly what our code looks like since I know this example could be optimised in different ways, however it gives you an idea of how performance could be a problem because I'm sorting after every push_back.
I think I essentially have two options to improve performance:
Use a (hand crafted?) insertion sort instead of std::sort to improve the sort performance (insertion sorts on a partially sorted vector are blindingly quick)
Create a heap by using std::make_heap and std::push_heap to maintain the sort order
My questions are:
Should I implement an insertion sort? Is there something in Boost that could help me here?
Should I consider using a heap? How would I do this?
Edit:
Thanks for all your responses. I understand that the example I gave was far from optimal and it doesn't fully represent what I have in my code right now. It was simply there to illustrate the performance bottleneck I was experiencing - perhaps that's why this question isn't seeing many up-votes :)
Many thanks to you Steve, it's often the simplest answers that are the best, and perhaps it was my over analysis of the problem that blinded me to perhaps the most obvious solution. I do like the neat method you outlined to insert directly into a pre-ordered vector.
As I've commented, I'm constrained to using vectors right now, so std::set, std::map, etc aren't an option.
Ordered insertion doesn't need boost:
vectorMyTypes.insert(
std::upper_bound(vectorMyTypes.begin(), vectorMyTypes.end(), newType),
newType);
upper_bound provides a valid insertion point provided that the vector is sorted to start with, so as long as you only ever insert elements in their correct place, you're done. I originally said lower_bound, but if the vector contains multiple equal elements, then upper_bound selects the insertion point which requires less work.
This does have to copy O(n) elements, but you say insertion sort is "blindingly fast", and this is faster. If it's not fast enough, you have to find a way to add items in batches and validate at the end, or else give up on contiguous storage and switch to a container which maintains order, such as set or multiset.
A heap does not maintain order in the underlying container, but is good for a priority queue or similar, because it makes removal of the maximum element fast. You say you want to maintain the vector in order, but if you never actually iterate over the whole collection in order then you might not need it to be fully ordered, and that's when a heap is useful.
According to item 23 of Meyers' Effective STL, you should use a sorted vector if you application use its data structures in 3 phases. From the book, they are :
Setup. Create a new data structure by inserting lots of elements into it. During this phase, almost all operation are insertions and erasure. Lookups are rare on nonexistent
Lookup. Consult the data structure to find specific pieces of information. During this phase, almost all operations are lookups. Insertion and erasures are rare or nonexistent. There are so many lookups, the performance of this phase makes the performance of the other phases incidental.
Reorganize. Modify the content of the data structure. perhaps by erasing all the current data and inserting new data in its place. Behaviorally, this phase is equivalent to phase 1. Once this phase is completed, the application return to phase 2
If your use of your data structure resembles this, you should use a sorted vector, and then use a binary_search as mentionned. If not, a typical associative container should do it, that means a set, multi-set, map or multimap as those structure are ordered by default
Why not just use a binary search to find where to insert the new element? Then you will insert exactly into the required position.
If you need to insert a lot of elements into a sorted sequence, use std::merge, potentially sorting the new elements first:
void add( std::vector<Foo> & oldFoos, const std::vector<Foo> & newFoos ) {
std::vector<Foo> merged;
// precondition: oldFoos _and newFoos_ are sorted
merged.reserve( oldFoos.size() + newFoos.size() ); // only for std::vector
std::merge( oldFoos.begin(), oldFoos.end(),
newFoos.begin(), newFoos.end(),
std::back_inserter( merged );
// apply std::unique, if wanted, here
merged.erase( std::unique( merged.begin(), merged.end() ), merged.end() );
oldFoos.swap( merged ); // commit changes
}
Using a binary search to find the insertion location isn't going to speed up the algorithm much because it will still be O(N) to do the insertion (consider inserting at the beginning of a vector - you have to move every element down one to create the space).
A tree (aka heap) will be O(log(N)) to insert, much better performance.
See http://www.sgi.com/tech/stl/priority_queue.html
Note that a tree will still have worst case O(N) performance for insert unless it is balanced, e.g. an AVL tree.
Why not to use boost::multi_index ?
NOTE: boost::multi_index does not provide memory contiguity, a property of std::vectors by which elements are stored adjacent to one another in a single block of memory.
There are a few things you need to do.
You may want to consider making use of reserve() to avoid excessive re-allocing of the entire vector. If you have knowledge of the size it will grow to, you may gain some performance by doing resrve()s yourself (rather than having the implemetation do them automaticaly using the built in heuristic).
Do a binary search to find the insertion location. Then resize and shift everything following the insertion point up by one to make room.
Consider: do you really want to use a vector? Perhaps a set or map are better.
The advantage of binary search over lower_bound is that if the insertion point is close to the end of the vector you don't have to pay the theta(n) complexity.
If you want insert an element into the "right" position, why do you plan on using sort. Find the position using lower_bound and insert, using, well, `insert' method of the vector. That will still be O(N) to insert new item.
heap is not going to help you, because heap is not sorted. It allows you get get at the smallest element quickly, and then quickly remove it and get next smallest element. However, the data in heap is not stored in sort order, so if you have algorithms that must iterate over data in order, it will not help.
I am afraid you description skimmed to much detail, but it seems like list is just not the right element for the task. std::deque is much better suited for insertion in the middle, and you might also consider std::set. I suggest you explain why you need to keep the data sorted to get more helpful advice.
You might want to consider using a BTree or a Judy Trie.
You don't want to use contiguous memory for large collections, insertions should not take O(n) time;
You want to use at least binary insertion for single elements, multiple elements should be presorted so you can make the search boundaries smaller;
You do not want your data structure wasting memory, so nothing with left and right pointers for each data element.
As others have said I'd probably have created a BTree out of a linked list instead of using a vector. Even if you got past the sorting issue, vectors have the problem of fully reallocating when they need to grow, assuming you don't know your maximum size before hand.
If you are worried about a list allocating on different memory pages and causing cache related performance issues, preallocate your nodes in an array, (pool the objects) and insert these into the list.
You can add a value in your data type that denotes if it is allocated off the heap or from a pool. This way if you detect that your pool runs out of room, you can start allocating off the heap and throw an assert or something to yourself so you know to bump up the pool size (or make this a command line option to set.
Hope this helps, as I see you already have lots of great answers.