BST<string> vs vector<string> + binary search - c++

Which is better (space/time) to find certain strings:
To use a vector of strings (alphabetically ordered) and a binary search.
To use a BST of strings (also ordered alphabetically).
Or are both similar?

Both have advantages, and it is going to depend on what your usage scenario is.
A sorted vector will be more efficient if your usage scenario can be broken into phases: load everything, then sort it once, then look things up without changing anything.
A tree structure will work better for you if your scenario involves inserting, searching, and removing things at different times, and you can't break it into phases. (In this case, a vector can add overhead, since inserting in the middle is expensive.)
There's a really good discussion of this in Effective STL, and there's a sorted vector implementation in Loki.

Assuming the binary search tree is balanced (which it will be if you are using std::set), then both of these are O(n) space and O(log n) time. So theoretically they are comparable.
In practice, the vector will take up somewhat less space and thus might be slightly faster thanks to locality effects. But probably not enough to matter, and since std::set supports O(log n) insertion, O(log n) deletion, and has a straightforward interface, I would recommend std::set.
That said... If all you care about is queries and you do not need to enumerate the strings in order, std::tr1::unordered_set (or boost::unordered_set or C++0x std::unordered_set) will be much faster than either, especially if the set is large. Hash tables rock.

Related

Why doesn't the C++ STL implement more efficient std::set implementation?

Background
Im building a performance minded application and I came across a place where I have to use std::set. And it works like a charm. But then I started reading into the documentation (which you can find here) and the first thing I noticed was that
Search, removal, and insertion operations have logarithmic complexity. Sets are usually implemented as red-black trees.
The search, removal and insertions makes perfect sense to me as they are using some kind of a tree structure (because the documentation does not guarantee that it uses a Red-Black Tree). But the problem is, why should they?
I made an alternate solution to the std::set of my own and which uses a std::vector to store all the entries. Then I performed some basic benchmarks and here are the results,
Iterations: 100000
// Insertion
VectorSet : 211464us
std::set : 1272864us
// Find/ Lookup
VectorSet : 404264us
std::set : 551464us
// Removal
VectorSet : 254321964us
std::set : 834664us
// Traversal (iterating through all the elements (100000 elements; 100000 iterations)
VectorSet : 2464us
std::set : 4374174264us
According to these results, my implementation (VectorSet) outperformed std::set in both insertions and lookups, and traversal was over 1800000 times. But std::set outperformed my implementation VectorSet by a significant margin (which is understandable as we are dealing with vectors).
I can justify why removal is slower in VectorSet but faster on std::set and why std::set takes so long to iterate through the entries. Some things which affect the performance would be (correct me if im wrong),
Cache misses.
Pointer dereferences.
Better locality.
For the vector being slower in removal,
Finding the element.
Removal of the element.
Possible resize.
Question
As what I can see, using a std::vector to store entries rather than a tree structure performs better in 3/4 instances. And even in the place where std::set performed better, it is still a small amount compared to iterating through. And in my opinion, developers use other aspects (lookups, insertions and iterations) more than removals. Even though these numbers are in the range of nanoseconds, the slightest improvement is better.
So my question is, why does std::set use a tree structure when they can use something like a vector to improve their efficiency?
Note: The container will be filled up with an average of 1000 elements and will be iterated repeatedly throughout the application lifetime and will directly affect the application runtime.
The standard set has some guarantees that you can't provide with your implementation:
inserting/erasing doesn't invalidate other iterators/references/pointers.
inserting/erasing elements has (at most) logarithmic complexity, as opposed to linear in your implementation.
If these don't matter to you, you're welcome to use a sorted vector and binary search. The standard provides std::sort, std::vector and std::binary_search, so you are good to go. The thing to notice is that each container has a specific use case and there is no "one size fits all" container.
The standard also provides unordered_set, which is a hash table. It is often criticized for being slow and causing cache-misses. Well, if that degrades your performance in a way you identified as a bottle-neck, go ahead and use some other hash-set implementation from other libraries. If you believe that you can do better, go ahead. Many projects build their own containers that are more specialized to that project. Could be faster, use less memory, could give different guarantees about iterator invalidation and/or complexity of operations. They all solve different problems.
Another point is that profiling and benchmarking is hard. Make sure you get it right. Performance comparison is usually done at scale (with varying number of input arguments). Picking a constant and relatively small size won't tell the whole story.

questions about the searching and sorting algorithms

I'm doing a little research about searching and sorting algorithms in the Standard library. I couldn't find something about those questions. I hope someone can help me out. You can also send me links if you know some.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
How can I adapt the behavior of the searching and sorting algorithms to make it more efficient?
Does the searching behavior change if the data is not sorted compared
to one which is sorted?
Depends. If you access your data in a vector/array by position, there's no performance improvement, and there's no need for sorting neither.
Searching can be done linearly, binary, keys, and by hash function.
For small (I guess something below a few dozens of items) and contiguous containers (e.g. a vector) linear search can be the fastest, just because of cache-friendly memory layout.
Binary search has O(log N) complexity which is likely the best you can get... I'm thinking in Information theory. It requires that you sort previously the container. It's useful for frequetly searches in the same container.
A std::set (and its cousin std::map) uses internally a tree, which makes searching O(log N) complexity too. Useful if you search by keys, instead of some criteria of your items. The drawback is that it's a bit slower on building (always keep sorted) than fill a vector an later sort it.
A hashmap or hashtable uses a function for getting the bucket where the item lies. The complexity is something near to O(1), depending on number of items and the function used (collisions issue).
As you see, selecting a type of container depends on how are you going to handle your data. Choose the one that fits your requirements.
How can I know if it is better to use std::sort() on a vector instead
of maybe to copy the vector to an already sorted set?
std::sort changes the container so the result is, obviously, sorted. If you need the original, unordered, container, then make a copy and sort the copy. Sorting the whole of the container is better that "insert-item-so-container-is-always-sorted" for all items, specially with a vector (many memory reallocations); a set/map filling process may be not that slow.
How can I adapt the behavior of the searching and sorting algorithms
to make it more efficient?
It's not clear to me what you mean. But, "The end justifies the means". Again, choose the container that servers best for your data handling.
Does the searching behavior change if the data is not sorted compared to one which is sorted?
No. It depends on the algorithm you choose. General search std::find is O(n), binary search std::lower_bound is O(log n), but it works only on sorted ranges.
How can I know if it is better to use std::sort() on a vector instead of maybe to copy the vector to an already sorted set? That is just an example. I hoped to find some explanations on the web which ways are the best for searching or sorting, but I didn't.
You can write a benchmark and measure. You can sort an std::vector (without duplicated elements) by copying it into an std::set, which maintains sorted order internally. std::set is typically implemented as a red-black tree and has in general high memory fragmentation in contrast to contiguous std::vector. So it is easy to predict the result. Alexander Stepanov discusses (if I remember correctly) this particular example in his lectures available on YouTube.

Best data structure for ordered list (performance)

I have a critical section on my application that consists of taking a data source (unordered) and then execute an algorithm on each element in order. Actually I follow the next algorithm:
Read the source and put it to a std::map, using the sorting element as key and the info as content.
Read the map using an iterator and execute the algorithm.
I see that map may not be the best data structure, as I only need to add the data to a sorted list and then "burn" the list altogether (also, memory allocation is costly on mobile devices, so I'd prefer to do it myself).
I've done some research and I'm reading things like B-trees and Black-Red trees. They may be what I am searching for, but I'll ask here if anybody knows of a data structure that is convenient for that task.
In short, I want a structure with:
fast insertion.
fast iteration (from begin to end).
everything else is not important (neither deletion nor search).
Also fast insertion is more important than fast iteration (my profiler said so :D).
Thank you everyone.
The theoretical better way to do this is to use heapsort.
However, in practice, the fastest way is to append your elements to a vector, and sort them using a quicksort.
In both case, it will take O( N * log(N) ) in average, but the quicksort has lowest constant factors.
There are at least two efficient solutions:
Append elements to a vector; sort the vector; scan the vector.
Insert elements into a priority_queue; drain it.
The vector has the advantage of O(N) load time (vs. O(N log N) for the priority_queue). (Note that it still takes O(N log N) overall, due to the sort).
The priority_queue has the advantage of freeing memory as you drain it. This doesn't reduce the maximum memory footprint, and is probably of negligible benefit, but it's worth trying anyway.
If your keys are in a limited range of values, you might want to consider the use of Bucketsort.
I would suggest writing a skip list. It is exactly what you ask for - a sorted list with O(log(n)) insertion. It is also relatively easy to implement.

C++: container replacement for vector/deque for huge sizes

so my applications has containers with 100 million and more elements.
I'm on the hunt for a container which behaves - time-wise - better than std::deque (let alone std::vector) with respect to frequent insertions and deletions all over the container ... including near the middle. Access time to the n-th element does not need to be as fast as vector, but should definetely be better than full traversal like in std::list (which has a huge memory overhead per element anyway).
Elements should be treated ordered by index (like vector, deque, list), so std::set or std::unordered_set also do not work well.
Before I sit down and code such a container myself: has anyone seen such a beast already? I'm pretty sure the STL hasn't anything like this, looking to BOOST I did not find something I could use but I may be wrong.
Any hints?
There's a whole STL replacement for big data, in case your app is centric to such data:
STXXL - http://stxxl.sourceforge.net/
edit: I was actually a bit fast to answer. 100 million is not really a large number. E.g., if each element is one byte, you could save it in a 96MiB array. So whether STXXL is any useful, the size of an element should be significantly bigger.
I think you can get the performance characteristics that you want with a skip list:
https://en.wikipedia.org/wiki/Skip_list#Indexable_skiplist
It's the "indexable" part that you're interested in, of course -- you don't actually want the items to be sorted. So some modification is needed that I leave as an exercise.
You might find that 100 million list nodes begins to strain a 32 bit address space, but probably not an issue in 64 bits.
1) If the data is highly sparse, i.e. has lots of zeroes or can be expressed as such, I would highly recommend a data structure that takes advantage of that:
sparselib++ for matrices
sparsehash for hash maps
2) Hash maps should do O(1) for all the operations you describe and the sparsehash implementation I mentioned earlier is particularly space-efficient; it also includes a sparsetable type which is a bit more low-level and can be used in place of an array.
3) If the strict ordering is not that important (it probably is, because you mentioned elements should be treated ordered by index), you can swap the elements you want to erase to the end of the vector and then resize to do removal in O(1). Insertion would just be push_back.
Try a hash map. The STL has several, all with the unordered naming prefix , such as unorderd_map, etc. It has constant time insertion and look up given a good hashing algorithm. With your 'huge' data set the hash map would most likely cover your needs. Making a slight change to the application to cover the differences in the interfaces is trivial.

c++ container for checking whether ordered data is in a collection

I have data that is a set of ordered ints
[0] = 12345
[1] = 12346
[2] = 12454
etc.
I need to check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks
EDIT: The size of the collection is a couple of hundred items
Just to detail a bit over what have already been said.
Sorted Containers
The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).
However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:
minimal overhead (in term of memory)
contiguity of memory, and thus cache locality
Several "Random Access Containers" are available here:
Boost.Array
std::vector
std::deque
So the only thing you actually need to do can be broken done in 2 steps:
push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity
Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.
If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:
because it's faster for small datasets
because it can be faster for frozen datasets
Unsorted Containers
Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.
This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.
Conclusion
Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.
Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements
If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).
Use a sorted std::vector, and use a std::binary_search to search it.
Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.
If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.
I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.
Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.
If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.
Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)
There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.
So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.
So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.