Faster data structure than unordered_map? - c++

I am looking for a data structure that is faster than C++'s unordered_map in my scenario.
I am storing in the map unsorted unique C-String char * (map.first) and integers (map.second). I can use around 10MB of memory for this data structure. Before I add a new item I need to check if it exists first. So, I am doing a ton of searches and a lot of inserts. The data structure will contain few items ( < 500 ) usually and then it will be deleted. So, I don't need to delete individual items.
I implemented my own AVL self balancing tree (seems to be really good for my case) but it was actually slower compared to std::unordered_map.
Do you know any data structure better than unordered_map in my case?

A good answer to this would be a combination of linear lookup and binary search.
Basically have a sorted vector of items which you can binary search in. This will have fantastic cache locality and will probably be quicker for the kind of sizes you're looking at. If you need to insert just push it onto a seperate unsorted vector. When you next need to search both do a linear search of the unsorted vector and a binary search of the sorted vector. When your unsorted vector gets big enough (say 10 but profiling will help here) then insert them to the back of the sorted vector and resort it then clear out the 'unsorted' vector.
This doesn't have the best complexity guarantees, but will likely be faster on modern hardware for the kind of sizes you're looking at (linear memory access are FAST and likely beat trees/lists until you get quite large).
Sorting the unsorted vector and then merging it into the sorted one would give a bit of a speed increase at the cost of complexity of code.

If Memory really doesn't matter you can create a huge vector<bool> and store if the given value was inserted in your AVL tree.
e.g. have a look at Counting sort. You could implement it like this.

It sounds like your use case calls for a set rather than a map. Do you really need a map for some reason not clear in the question? If not, an unordered_set would be a better choice and if you are dealing with a small enough range a vector<bool> as suggested by Thomas Sparber.

Related

std::list and std::vector - Best of both worlds?

By vector vs. list in STL:
std::vector: Insertions at the end are constant, amortized time, but insertions elsewhere are a costly O(n).
std::list: You cannot randomly access elements, so getting at a particular element in the list can be expensive.
I need a container such that you can both access the element at any index in O(1) time, but also insert/remove an element at any index in O(1) time. It must also be able to manage thousands of entries. Is there such a container?
Edit: If not O(1), some X << O(n)?
There's a theoretical result that says that any data structure representing an ordered list cannot have all of insert, lookup by index, remove, and update take time better than O(log n / log log n), so no such data structure exists.
There are data structures that get pretty close to this, though. For example, an order statistics tree lets you do insertions, deletions, lookups, and updates anywhere in the list in time O(log n) apiece. These are reasonably good in practice, and you may be able to find an implementation online.
Depending on your specific application, there may be alternative data structures that are more tailored toward your needs. For example, if you only care about finding the smallest/biggest element at each point in time, then a data structure like a Fibonacci heap might fit the bill. (Fibonacci heaps are usually slower in practice than a regular binary heap, but the related pairing heap tends to run extremely quickly.) If you're frequently updating ranges of elements by adding or subtracting from them, then a Fenwick tree might be a better call.
Hope this helps!
Look at a couple of data structures.
The Rope
Tree of arrays. The tree is sorted by array index for fast index search.
B+Tree
Sorted tree of sorted arrays. This thing is used by almost every database ever.
Neither one is O(1) because that's impossible. But they are pretty good.

questions about the searching and sorting algorithms

I'm doing a little research about searching and sorting algorithms in the Standard library. I couldn't find something about those questions. I hope someone can help me out. You can also send me links if you know some.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
How can I adapt the behavior of the searching and sorting algorithms to make it more efficient?
Does the searching behavior change if the data is not sorted compared
to one which is sorted?
Depends. If you access your data in a vector/array by position, there's no performance improvement, and there's no need for sorting neither.
Searching can be done linearly, binary, keys, and by hash function.
For small (I guess something below a few dozens of items) and contiguous containers (e.g. a vector) linear search can be the fastest, just because of cache-friendly memory layout.
Binary search has O(log N) complexity which is likely the best you can get... I'm thinking in Information theory. It requires that you sort previously the container. It's useful for frequetly searches in the same container.
A std::set (and its cousin std::map) uses internally a tree, which makes searching O(log N) complexity too. Useful if you search by keys, instead of some criteria of your items. The drawback is that it's a bit slower on building (always keep sorted) than fill a vector an later sort it.
A hashmap or hashtable uses a function for getting the bucket where the item lies. The complexity is something near to O(1), depending on number of items and the function used (collisions issue).
As you see, selecting a type of container depends on how are you going to handle your data. Choose the one that fits your requirements.
How can I know if it is better to use std::sort() on a vector instead
of maybe to copy the vector to an already sorted set?
std::sort changes the container so the result is, obviously, sorted. If you need the original, unordered, container, then make a copy and sort the copy. Sorting the whole of the container is better that "insert-item-so-container-is-always-sorted" for all items, specially with a vector (many memory reallocations); a set/map filling process may be not that slow.
How can I adapt the behavior of the searching and sorting algorithms
to make it more efficient?
It's not clear to me what you mean. But, "The end justifies the means". Again, choose the container that servers best for your data handling.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
No. It depends on the algorithm you choose. General search std::find is O(n), binary search std::lower_bound is O(log n), but it works only on sorted ranges.
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
You can write a benchmark and measure. You can sort an std::vector (without duplicated elements) by copying it into an std::set, which maintains sorted order internally. std::set is typically implemented as a red-black tree and has in general high memory fragmentation in contrast to contiguous std::vector. So it is easy to predict the result. Alexander Stepanov discusses (if I remember correctly) this particular example in his lectures available on YouTube.

Stream of Integers arriving at specified interval need to look sorted

Interview question: There is a stream of Integers that arrives at specified intervals (say every 20 sec). Which Container of STL would you use to store them so that the Integers look sorted? My reply was map/set when there is no duplicate or multimap/multiset when there is duplicate. Any better answer if exists?
Use a multiset if you want to preserve duplicates. If you don't want to preserve duplicates, use a set.
If it's only being updated every 20 seconds, it probably doesn't matter a whole lot (unless it goes for so long that the set of integers becomes tremendously huge).
If you had data coming in a lot faster, there are alternatives that might be worth considering. One would be to use a couple of vectors. As data arrives, just push it onto one of the vectors. When you need to do an in-order traversal, sort that newly arrived data, and merge with the other vector of existing (already-sorted data). That'll give you results in order, which you can then write out to another vector, and start the same cycle again.
The big advantage here is that you're dealing with contiguous data instead of individually allocated nodes. Even with a possibility of three vectors in use at a time, your total memory usage is likely to be about equal (or possibly even less than) that of using a set or multiset.
Another possibility to consider (that's a bit of a hybrid between the two) would be something like a B+ tree. This is still a tree, so you can do in-order insertions with logarithmic complexity, but you have all the data in the leaf nodes (which are fairly large) so you get at least a reasonable amount of contiguous access as well.
To maintain a sorted list of integers streaming I would use std::priority_queue with any underlying container (vector or deque depending on the particular use).
You can keep push() ing to the priority_queue and use top() and pop() to retrieve in the sorted order.
Answer should be std::set . std::map<key, value> has to consider when there is a pairs of data as <key, value> and it need to be sorted according to the value of key
In same way if you have to consider duplicates, use std::multiset and std::multimap according to type of data.

Best data structure for ordered list (performance)

I have a critical section on my application that consists of taking a data source (unordered) and then execute an algorithm on each element in order. Actually I follow the next algorithm:
Read the source and put it to a std::map, using the sorting element as key and the info as content.
Read the map using an iterator and execute the algorithm.
I see that map may not be the best data structure, as I only need to add the data to a sorted list and then "burn" the list altogether (also, memory allocation is costly on mobile devices, so I'd prefer to do it myself).
I've done some research and I'm reading things like B-trees and Black-Red trees. They may be what I am searching for, but I'll ask here if anybody knows of a data structure that is convenient for that task.
In short, I want a structure with:
fast insertion.
fast iteration (from begin to end).
everything else is not important (neither deletion nor search).
Also fast insertion is more important than fast iteration (my profiler said so :D).
Thank you everyone.
The theoretical better way to do this is to use heapsort.
However, in practice, the fastest way is to append your elements to a vector, and sort them using a quicksort.
In both case, it will take O( N * log(N) ) in average, but the quicksort has lowest constant factors.
There are at least two efficient solutions:
Append elements to a vector; sort the vector; scan the vector.
Insert elements into a priority_queue; drain it.
The vector has the advantage of O(N) load time (vs. O(N log N) for the priority_queue). (Note that it still takes O(N log N) overall, due to the sort).
The priority_queue has the advantage of freeing memory as you drain it. This doesn't reduce the maximum memory footprint, and is probably of negligible benefit, but it's worth trying anyway.
If your keys are in a limited range of values, you might want to consider the use of Bucketsort.
I would suggest writing a skip list. It is exactly what you ask for - a sorted list with O(log(n)) insertion. It is also relatively easy to implement.

Should use an insertion sort or construct a heap to improve performance?

We have large (100,000+ elements) ordered vectors of structs (operator < overloaded to provide ordering):
std::vector < MyType > vectorMyTypes;
std::sort(vectorMyType.begin(), vectorMyType.end());
My problem is that we're seeing performance problems when adding new elements to these vectors while preserving sort order. At the moment we're doing something like:
for ( a very large set )
{
vectorMyTypes.push_back(newType);
std::sort(vectorMyType.begin(), vectorMyType.end());
...
ValidateStuff(vectorMyType); // this method expects the vector to be ordered
}
This isn't exactly what our code looks like since I know this example could be optimised in different ways, however it gives you an idea of how performance could be a problem because I'm sorting after every push_back.
I think I essentially have two options to improve performance:
Use a (hand crafted?) insertion sort instead of std::sort to improve the sort performance (insertion sorts on a partially sorted vector are blindingly quick)
Create a heap by using std::make_heap and std::push_heap to maintain the sort order
My questions are:
Should I implement an insertion sort? Is there something in Boost that could help me here?
Should I consider using a heap? How would I do this?
Edit:
Thanks for all your responses. I understand that the example I gave was far from optimal and it doesn't fully represent what I have in my code right now. It was simply there to illustrate the performance bottleneck I was experiencing - perhaps that's why this question isn't seeing many up-votes :)
Many thanks to you Steve, it's often the simplest answers that are the best, and perhaps it was my over analysis of the problem that blinded me to perhaps the most obvious solution. I do like the neat method you outlined to insert directly into a pre-ordered vector.
As I've commented, I'm constrained to using vectors right now, so std::set, std::map, etc aren't an option.
Ordered insertion doesn't need boost:
vectorMyTypes.insert(
std::upper_bound(vectorMyTypes.begin(), vectorMyTypes.end(), newType),
newType);
upper_bound provides a valid insertion point provided that the vector is sorted to start with, so as long as you only ever insert elements in their correct place, you're done. I originally said lower_bound, but if the vector contains multiple equal elements, then upper_bound selects the insertion point which requires less work.
This does have to copy O(n) elements, but you say insertion sort is "blindingly fast", and this is faster. If it's not fast enough, you have to find a way to add items in batches and validate at the end, or else give up on contiguous storage and switch to a container which maintains order, such as set or multiset.
A heap does not maintain order in the underlying container, but is good for a priority queue or similar, because it makes removal of the maximum element fast. You say you want to maintain the vector in order, but if you never actually iterate over the whole collection in order then you might not need it to be fully ordered, and that's when a heap is useful.
According to item 23 of Meyers' Effective STL, you should use a sorted vector if you application use its data structures in 3 phases. From the book, they are :
Setup. Create a new data structure by inserting lots of elements into it. During this phase, almost all operation are insertions and erasure. Lookups are rare on nonexistent
Lookup. Consult the data structure to find specific pieces of information. During this phase, almost all operations are lookups. Insertion and erasures are rare or nonexistent. There are so many lookups, the performance of this phase makes the performance of the other phases incidental.
Reorganize. Modify the content of the data structure. perhaps by erasing all the current data and inserting new data in its place. Behaviorally, this phase is equivalent to phase 1. Once this phase is completed, the application return to phase 2
If your use of your data structure resembles this, you should use a sorted vector, and then use a binary_search as mentionned. If not, a typical associative container should do it, that means a set, multi-set, map or multimap as those structure are ordered by default
Why not just use a binary search to find where to insert the new element? Then you will insert exactly into the required position.
If you need to insert a lot of elements into a sorted sequence, use std::merge, potentially sorting the new elements first:
void add( std::vector<Foo> & oldFoos, const std::vector<Foo> & newFoos ) {
std::vector<Foo> merged;
// precondition: oldFoos _and newFoos_ are sorted
merged.reserve( oldFoos.size() + newFoos.size() ); // only for std::vector
std::merge( oldFoos.begin(), oldFoos.end(),
newFoos.begin(), newFoos.end(),
std::back_inserter( merged );
// apply std::unique, if wanted, here
merged.erase( std::unique( merged.begin(), merged.end() ), merged.end() );
oldFoos.swap( merged ); // commit changes
}
Using a binary search to find the insertion location isn't going to speed up the algorithm much because it will still be O(N) to do the insertion (consider inserting at the beginning of a vector - you have to move every element down one to create the space).
A tree (aka heap) will be O(log(N)) to insert, much better performance.
See http://www.sgi.com/tech/stl/priority_queue.html
Note that a tree will still have worst case O(N) performance for insert unless it is balanced, e.g. an AVL tree.
Why not to use boost::multi_index ?
NOTE: boost::multi_index does not provide memory contiguity, a property of std::vectors by which elements are stored adjacent to one another in a single block of memory.
There are a few things you need to do.
You may want to consider making use of reserve() to avoid excessive re-allocing of the entire vector. If you have knowledge of the size it will grow to, you may gain some performance by doing resrve()s yourself (rather than having the implemetation do them automaticaly using the built in heuristic).
Do a binary search to find the insertion location. Then resize and shift everything following the insertion point up by one to make room.
Consider: do you really want to use a vector? Perhaps a set or map are better.
The advantage of binary search over lower_bound is that if the insertion point is close to the end of the vector you don't have to pay the theta(n) complexity.
If you want insert an element into the "right" position, why do you plan on using sort. Find the position using lower_bound and insert, using, well, `insert' method of the vector. That will still be O(N) to insert new item.
heap is not going to help you, because heap is not sorted. It allows you get get at the smallest element quickly, and then quickly remove it and get next smallest element. However, the data in heap is not stored in sort order, so if you have algorithms that must iterate over data in order, it will not help.
I am afraid you description skimmed to much detail, but it seems like list is just not the right element for the task. std::deque is much better suited for insertion in the middle, and you might also consider std::set. I suggest you explain why you need to keep the data sorted to get more helpful advice.
You might want to consider using a BTree or a Judy Trie.
You don't want to use contiguous memory for large collections, insertions should not take O(n) time;
You want to use at least binary insertion for single elements, multiple elements should be presorted so you can make the search boundaries smaller;
You do not want your data structure wasting memory, so nothing with left and right pointers for each data element.
As others have said I'd probably have created a BTree out of a linked list instead of using a vector. Even if you got past the sorting issue, vectors have the problem of fully reallocating when they need to grow, assuming you don't know your maximum size before hand.
If you are worried about a list allocating on different memory pages and causing cache related performance issues, preallocate your nodes in an array, (pool the objects) and insert these into the list.
You can add a value in your data type that denotes if it is allocated off the heap or from a pool. This way if you detect that your pool runs out of room, you can start allocating off the heap and throw an assert or something to yourself so you know to bump up the pool size (or make this a command line option to set.
Hope this helps, as I see you already have lots of great answers.