Binary data structure for fast searching

Binary data structure for fast searching - c++

I'm looking for a binary data structure (tree, list) that enables very fast searching. I'll only add/remove items at the beginning/end of the program, all at once. So it's gonna be fixed-sized, thus I don't really care about the insertion/deletion speed. Basically what I'm looking for is a structure that provides fast searching and doesn't use much memory.
Thanks

Look up the Unordered set in the Boost C++ library here. Unlike red-black trees, which are O(log n) for searching, the unordered set is based on a hash, and on average gives you O(1) search performance.

One container not to be overlooked is a sorted std::vector.
It definitely wins on the memory consumption, especially if you can reserve() the correct amount up front.

So the key can be a simple type and the value is a smallish structure of five pointers.
With only 50 elements it starts getting small enough that the Big-O theoretical performance may be overshadowed or at least measurable affected by the fixed time overhead of the algorithm or structure.
For example an array a vector with linear search is often the fastest with less than ten elements because of its simple structure and tight memory.
I would wrap the container and run real data on it with timing. Start with STL's vector, go to the standard STL map, upgrade to unordered_map and maybe even try Google's dense or sparse_hash_map:
http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html

One efficient (albeit a teeny bit confusing) algorithm is the Red-Black tree.
Internally, the c++ standard library uses Red-Black trees to implement std::map - see this question

The std::map and hash map are good choices. They also have constructors to ease one time construction.
The hash map puts key data into a function that returns an array index. This may be slower than an std::map, but only profiling will tell.
My preference would be std::map, as it is usually implemented as a type of binary tree.

The fastest tends to be a trei/trie. I implemented one 3 to 15 times faster than the std::unordered_map, they tend to use more ram unless you use a large number of elements though.

Related

What are the advantages and disadvantages of an ordered list?

I'm currently trying to figure out what the advantages from the disadvantages of having an ordered list are, but I'm over here struggling to find the importance. I know that ordered lists can allow for binary search which is much more efficient over sequential search. Otherwise I'm at a loss though.

Ordering your data has many costs and benefits.
You are right in that it does enable a binary search. But there are many ways to structure data so that searching for an element is fast; in the C++ standard library, std::set maintains its elements in order, while std::unordered_set doesn't, and the second is usually faster than the first. (It uses hashes to find elements, as opposed to a binary search over a semi-balanced binary tree).
Often I'll order data to calculate differences between them, or have a canonical representation. If I have two collections of data and they are unordered, checking equality is annoying (there are O(n^2) pairs to check). If the data is in a canonical order, I can do O(n) work; if I trust a hashing function to be sufficiently robust, I can even pre-hash each element, and each sub-collection of elements, and check equality in O(1) time.
I can even find differences in O(size of difference) time with carefully ordered data and a trustworthy hash.
Ordered data is often what humans want; an inbox ordered randomly is not very useful, while one that is ordered by when the mail arrived, or even some kind of importance, is much better UI.
Ordered data can make branch prediction easier, and other optimizations at the hardware, compiler or algorithm level easier. Here is a classic SO question about that issue; processing a sorted array is faster than an unsorted array, and the asker is surprised.
The hard part about programming is understanding what the hell is going on. Data having structure makes reasoning about it easier; there are fewer things to think about. A list being always ordered means that code that uses it can assume it is always ordered, which can make that code easier to think about and simpler, and then code that uses that code is in turn simpler and easier to think about.
The price, of course, is that almost all editing operations on an ordered list end up being more expensive, both in terms of computer time and programmer time to think about the consequences. With a std::vector, adding K elements takes O(K) time; meanwhile, a naively sorted-after-each-operation std::vector would require O(K * N lg N) time, which can be horribly slow.
Not doing something is extremely valuable. And code that doesn't do anything is easy to understand what it does. By eliminating the "ordered" requirement, you make writing data to that storage easier to think about and easier for the computer to do.

How can I implement Python sets in another language (maybe C++)?

I want to translate some Python code that I have already written to C++ or another fast language because Python isn't quite fast enough to do what I want to do. However the code in question abuses some of the impressive features of Python sets, specifically the average O(1) membership testing which I spam within performance critical loops, and I am unsure of how to implement Python sets in another language.
In Python's Time Complexity Wiki Page, it states that sets have O(1) membership testing on average and in worst-case O(n). I tested this personally using timeit and was astonished by how blazingly fast Python sets do membership testing, even with large N. I looked at this Stack Overflow answer to see how C++ sets compare when using find operations to see if an element is a member of a given set and it said that it is O(log(n)).
I hypothesize the time complexity for find is logarithmic in that C++ std library sets are implemented with some sort of binary tree. I think that because Python sets have average O(1) membership testing and worst case O(n), they are probably implemented with some sort of associative array with buckets which can just look up an element with ease and test it for some dummy value which indicates that the element is not part of the set.
The thing is, I don't want to slow down any part of my code by switching to another language (since that is the problem im trying to fix in the first place) so how could I implement my own version of Python sets (specifically just the fast membership testing) in another language? Does anybody know anything about how Python sets are implemented, and if not, could anyone give me any general hints to point me in the right direction?
I'm not looking for source code, just general ideas and links that will help me get started.
I have done a bit of research on Associative Arrays and I think I understand the basic idea behind their implementation but I'm unsure of their memory usage. If Python sets are indeed just really associative arrays, how can I implement them with a minimal use of memory?
Additional note: The sets in question that I want to use will have up to 50,000 elements and each element of the set will be in a large range (say [-999999999, 999999999]).

The theoretical difference betwen O(1) and O(log n) means very little in practice, especially when comparing two different languages. log n is small for most practical values of n. Constant factors of each implementation are easily more significant.
C++11 has unordered_set and unordered_map now. Even if you cannot use C++11, there are always the Boost version and the tr1 version (the latter is named hash_* instead of unordered_*).

Several points: you have, as has been pointed out, std::set and
std::unordered_set (the latter only in C++11, but most compilers have
offered something similar as an extension for many years now). The
first is implemented by some sort of balanced tree (usually a red-black
tree), the second as a hash_table. Which one is faster depends on the
data type: the first requires some sort of ordering relationship (e.g.
< if it is defined on the type, but you can define your own); the
second an equivalence relationship (==, for example) and a hash
function compatible with this equivalence relationship. The first is
O(lg n), the second O(1), if you have a good hash function. Thus:
If comparison for order is significantly faster than hashing,
std::set may actually be faster, at least for "smaller" data sets,
where "smaller" depends on how large the difference is—for
strings, for example, the comparison will often resolve after the first
couple of characters, whereas the hash code will look at every
character. In one experiment I did (many years back), with strings of
30-50 characters, I found the break even point to be about 100000
elements.
For some data types, simply finding a good hash function which is
compatible with the type may be difficult. Python uses a hash table for
its set, and if you define a type with a function __hash__ that always
returns 1, it will be very, very slow. Writing a good hash function
isn't always obvious.
Finally, both are node based containers, which means they use a lot
more memory than e.g. std::vector, with very poor locality. If lookup
is the predominant operation, you might want to consider std::vector,
keeping it sorted and using std::lower_bound for the lookup.
Depending on the type, this can result in a significant speed-up, and
much less memory use.

Boost flat_map container

Working on some legacy code, I am running into memory issues due mainly (I believe) to the extensive use of STL maps (particularly “maps-of-maps”.)
I am looking at Boost flat_map as a possible solution. Does anyone have any firsthand experience with flat_maps, in particular with regards improvements in speed and/or memory usage? I realize of course this can be very dependent on the types of data stored and the manner in which they are stored but still curious of folk’s actual experience.
Can anyone point me to some solid examples?
As an example: there are several cases in this code of a map-of-a-map; that is, a map where the value is another map.
By replacing the “inner” map with a pair of vectors, I reduced the memory footprint 10:1 (3G to 300M). Of course this can slow down searches but for this particular case it doesn’t seem to matter much. And it involved about a day of refactoring and careful testing.
Boost’s flat_map sounds like it might be just what I need but I can’t seem to find out much about it other than the class description on the Boost web site. Looking for some firsthand feedback.

Boost's flat_map is a binary-tree-based map implementation, except that that binary tree is stored as a (sorted) vector of key-value pairs.
You can basically figure out the answers regarding performance (relative to an std::map by yourself based on that fact:
Iterating the map or a large part of it should be super-fast, relatively
Lookup should typically be relatively fast
Adding or removing values is theoretically much slower, but in practice - assuming your key and value types are small and the number of map elements not very high - probably comparable in speed (or even better on small maps - often no allocation is necessary on insert)
etc.
In your case - maps-of-maps - you're going to lose some of the benefit of "flattening things out", since you'll have an outer map with a pointer to an inner map (if not more levels of indirection); but the flat map would at least help you reduce that. Also, supposing you have two levels of maps, you could arrange it so that you store all of the inner maps contiguously (either by constructing the inner maps appropriately or by instantiating them with your own allocator, a trickier affair); in that case, you could replace pointers to maps with map indices, reducing the amount of space they take up and making life easier for the compiler.
You might also want read Boost's documentation of flat_map; and you could also just use the force and read the source (and the source of the underlying flat_tree) - like I have; I dont actually have flat_map experience myself.

I know this is an old question, but this might be of use to someone finding this question.
I found that flat_map was a big improvement in searching, lookup and iterating large maps. The fact the map is using contiguous data in memory also makes inserting faster than you might expect due to great data locality. If you're doing more inserts than lookups in your map then it might not be for you.
Having said that, repeatedly inserting a random value into a sorted vector is faster than the same on a linked list because of the data locality - despite what Big O might tell you. (tested in VS2017 and G++ 4.8).

When to use merge sort and when to use quick sort?

The wikipedia article for merge sort.
The wikipedia article for quick sort.
Both articles have excellent visualizations.
Both have n*log(n) complexity.
So obviously the distribution of the data will effect the speed of the sort. My guess would be that since a comparison can just as quickly compare any two values, no matter their spread, the range of data values does not matter.
More importantly one should consider the lateral distribution (x direction ) with respect to ordering (magnitude removed).
A good test case to consider would be if the test data had some level of sorting...

It typically depends on the data structures involved. Quick sort is
typically the fastest, but it doesn't guarantee O(n*log(n)); there are
degenerate cases where it becomes O(n^2). Heap sort is the usual
alternative; it guarantees O(n*log(n)), regardless of the initial order,
but it has a much higher constant factor. It's usually used when you
need a hard upper limit to the time taken. Some more recent algorithms
use quick sort, but attempt to recognize when it starts to degenerate,
and switch to heap sort then. Merge sort is used when the data
structure doesn't support random access, since it works with pure
sequential access (forward iterators, rather than random access
iterators). It's used in std::list<>::sort, for example. It's also
widely used for external sorting, where random access can be very, very
expensive compared to sequential access. (When sorting a file which
doesn't fit into memory, you might break it into chunks which fit into
memory, sort these using quicksort, writing each out to a file, then
merge sort the generated files.)

Mergesort is quicker when dealing with linked lists. This is because pointers can easily be changed when merging lists. It only requires one pass (O(n)) through the list.
Quicksort's in-place algorithm requires the movement (swapping) of data. While this can be very efficient for in-memory dataset, it can be much more expensive if your dataset doesn't fit in memory. The result would be lots of I/O.
These days, there is a lot of parallelization that occurs. Parallelizing Mergesort is simpler than Quicksort (in-place). If not using the in-place algorithm, then the space complexity for quicksort is O(n) which is the same are mergesort.
So, to generalize, quicksort is probably more effective for datasets that fit in memory. For stuff that's larger, it's better to use mergesort.
The other general time to use mergesort over quicksort is if the data is very similar (that is, not close to being uniform). Quicksort relies on using a pivot. In the case where all the values are the similar, quicksort hits a worst case of O(n^2). If the values of the data are very similar, then it's more likely that a poor pivot will be chosen leading to very unbalanced partitions leading to an O(n^2) runtime. The most straightforward example is if all the values in the list are the same.

There is a real-world sorting algorithm -- called Timsort -- that does exploit the idea that data encountered in the wild is often partially sorted.
The algorithm is derived from merge sort and insertion sort, and is used in CPython, Java 7 and Android.
See the Wikipedia article for more details.

While Java 6 and earlier versions use merge sort as the sorting algorithms, C# uses QuickSort as the sorting algorithm.
QuickSort performs better than merge sort even though they are both O(nlogn). QuickSort's has a smaller constant than merge sort.

Of the two, use merge sort when you need a stable sort. You can use a modified quicksort (such as introsort) when you don't, since it tends to be faster and it uses less memory.
Plain old Quicksort as described by Hoare is quite sensitive to performance-killing special cases that make it Theta(n^2), so you normally do need a modified version. That's where the data-distribution comes in, since merge sort doesn't have bad cases. Once you start modifying quicksort you can go on with all sorts of different tweaks, and introsort is one of the more effective ones. It detects on the fly whether it's in a killer case, and if so switches to heapsort.
In fact, Hoare's most basic Quicksort fails worst for already-sorted data, and so your "good test cases" with some level of sorting will kill it to some level. That fact is for curiosity only, though, since it only takes a very small tweak to avoid that, nothing like as complicated as going all the way to introsort. So it's simplistic to even bother analyzing the version that's killed by sorted data.
In practice, in C++ you'd generally use std::stable_sort and std::sort rather than worrying too much about the exact algorithm.

Remember in practice, unless you have a very large data set and/or are executing the sort many many times, it probably won't matter at all. That being said, quicksort is generally considered the 'fastest' n*log(n) sorter. See this question already asked: Quick Sort Vs Merge Sort

Is there already some std::vector based set/map implementation?

For small sets or maps, it's usually much faster to just use a sorted vector, instead of the tree-based set/map - especially for something like 5-10 elements. LLVM has some classes in that spirit, but no real adapter that would provide a std::map like interface backed up with a std::vector.
Any (free) implementation of this out there?
Edit: Thanks for all the alternative ideas, but I'm really interested in a vector based set/map. I do have specific cases where I tend to create huge amounts of sets/maps which contain usually less than 10 elements, and I do really want to have less memory pressure. Think about for example neighbor edges for a vertex in a triangle mesh, you easily wind up with 100k sets of 3-4 elements each.

I just stumbled upon your question, hope its not too late.
I recommend a great (open source) library named Loki.
It has a vector based implementation of an associative container that is a drop-in replacement for std::map, called AssocVector.
It offers better performance for accessing elements (and worst performance for insertions/deletions).
The library was written by Andrei Alexandrescu author of Modern C++ Design.
It also contains some other really nifty stuff.

If you can't find anything suitable, I would just wrap a std::vector to do sort() on insert, and implement find() using lower_bound(). It should be straight forward, and just as efficient as a custom solution.

Old post, I know, but for more recent visitors, Boost's flat_set and flat_map look like what you need. See https://theboostcpplibraries.com/boost.container for more information.

I don't know any such implementation, but there are some functions that help working with sorted vectors already in STL, such as lower_bound and upper_bound.

If the set or map truly is small, the performance gained by micro-optimizing the data structure will have little to no noticeable effects. You'll save maybe one or two memory (read: cache) lookups when searching a tiny tree vs tiny vector, which in the big picture is insignificant.
Having said that, you could give hash_map a try. Lookups by key are guaranteed to run in constant time.

Maybe you're looking for unordered map's and unordered set's. Try taking a look at the TR1 unordered containers that rely on hashing, or the Boost.Unordered container library. Underneath the interface, I'm not sure if they really do use std::vector, but I'd wager it's worth taking a look at.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js