How can I implement Python sets in another language (maybe C++)?

How can I implement Python sets in another language (maybe C++)? - c++

I want to translate some Python code that I have already written to C++ or another fast language because Python isn't quite fast enough to do what I want to do. However the code in question abuses some of the impressive features of Python sets, specifically the average O(1) membership testing which I spam within performance critical loops, and I am unsure of how to implement Python sets in another language.
In Python's Time Complexity Wiki Page, it states that sets have O(1) membership testing on average and in worst-case O(n). I tested this personally using timeit and was astonished by how blazingly fast Python sets do membership testing, even with large N. I looked at this Stack Overflow answer to see how C++ sets compare when using find operations to see if an element is a member of a given set and it said that it is O(log(n)).
I hypothesize the time complexity for find is logarithmic in that C++ std library sets are implemented with some sort of binary tree. I think that because Python sets have average O(1) membership testing and worst case O(n), they are probably implemented with some sort of associative array with buckets which can just look up an element with ease and test it for some dummy value which indicates that the element is not part of the set.
The thing is, I don't want to slow down any part of my code by switching to another language (since that is the problem im trying to fix in the first place) so how could I implement my own version of Python sets (specifically just the fast membership testing) in another language? Does anybody know anything about how Python sets are implemented, and if not, could anyone give me any general hints to point me in the right direction?
I'm not looking for source code, just general ideas and links that will help me get started.
I have done a bit of research on Associative Arrays and I think I understand the basic idea behind their implementation but I'm unsure of their memory usage. If Python sets are indeed just really associative arrays, how can I implement them with a minimal use of memory?
Additional note: The sets in question that I want to use will have up to 50,000 elements and each element of the set will be in a large range (say [-999999999, 999999999]).

The theoretical difference betwen O(1) and O(log n) means very little in practice, especially when comparing two different languages. log n is small for most practical values of n. Constant factors of each implementation are easily more significant.
C++11 has unordered_set and unordered_map now. Even if you cannot use C++11, there are always the Boost version and the tr1 version (the latter is named hash_* instead of unordered_*).

Several points: you have, as has been pointed out, std::set and
std::unordered_set (the latter only in C++11, but most compilers have
offered something similar as an extension for many years now). The
first is implemented by some sort of balanced tree (usually a red-black
tree), the second as a hash_table. Which one is faster depends on the
data type: the first requires some sort of ordering relationship (e.g.
< if it is defined on the type, but you can define your own); the
second an equivalence relationship (==, for example) and a hash
function compatible with this equivalence relationship. The first is
O(lg n), the second O(1), if you have a good hash function. Thus:
If comparison for order is significantly faster than hashing,
std::set may actually be faster, at least for "smaller" data sets,
where "smaller" depends on how large the difference is—for
strings, for example, the comparison will often resolve after the first
couple of characters, whereas the hash code will look at every
character. In one experiment I did (many years back), with strings of
30-50 characters, I found the break even point to be about 100000
elements.
For some data types, simply finding a good hash function which is
compatible with the type may be difficult. Python uses a hash table for
its set, and if you define a type with a function __hash__ that always
returns 1, it will be very, very slow. Writing a good hash function
isn't always obvious.
Finally, both are node based containers, which means they use a lot
more memory than e.g. std::vector, with very poor locality. If lookup
is the predominant operation, you might want to consider std::vector,
keeping it sorted and using std::lower_bound for the lookup.
Depending on the type, this can result in a significant speed-up, and
much less memory use.

Related

What are the advantages and disadvantages of an ordered list?

I'm currently trying to figure out what the advantages from the disadvantages of having an ordered list are, but I'm over here struggling to find the importance. I know that ordered lists can allow for binary search which is much more efficient over sequential search. Otherwise I'm at a loss though.

Ordering your data has many costs and benefits.
You are right in that it does enable a binary search. But there are many ways to structure data so that searching for an element is fast; in the C++ standard library, std::set maintains its elements in order, while std::unordered_set doesn't, and the second is usually faster than the first. (It uses hashes to find elements, as opposed to a binary search over a semi-balanced binary tree).
Often I'll order data to calculate differences between them, or have a canonical representation. If I have two collections of data and they are unordered, checking equality is annoying (there are O(n^2) pairs to check). If the data is in a canonical order, I can do O(n) work; if I trust a hashing function to be sufficiently robust, I can even pre-hash each element, and each sub-collection of elements, and check equality in O(1) time.
I can even find differences in O(size of difference) time with carefully ordered data and a trustworthy hash.
Ordered data is often what humans want; an inbox ordered randomly is not very useful, while one that is ordered by when the mail arrived, or even some kind of importance, is much better UI.
Ordered data can make branch prediction easier, and other optimizations at the hardware, compiler or algorithm level easier. Here is a classic SO question about that issue; processing a sorted array is faster than an unsorted array, and the asker is surprised.
The hard part about programming is understanding what the hell is going on. Data having structure makes reasoning about it easier; there are fewer things to think about. A list being always ordered means that code that uses it can assume it is always ordered, which can make that code easier to think about and simpler, and then code that uses that code is in turn simpler and easier to think about.
The price, of course, is that almost all editing operations on an ordered list end up being more expensive, both in terms of computer time and programmer time to think about the consequences. With a std::vector, adding K elements takes O(K) time; meanwhile, a naively sorted-after-each-operation std::vector would require O(K * N lg N) time, which can be horribly slow.
Not doing something is extremely valuable. And code that doesn't do anything is easy to understand what it does. By eliminating the "ordered" requirement, you make writing data to that storage easier to think about and easier for the computer to do.

Big O of clojure library functions

Can anyone point me to a resource that lists the Big-O complexity of basic clojure library functions such as conj, cons, etc.? I know that Big-O would vary depending on the type of the input, but still, is such a resource available? I feel uncomfortable coding something without having a rough idea of how quickly it'll run.

Here is a table composed by John Jacobsen and taken from this discussion:

Late to the party here, but I found the link in the comments of the first answer to be more definitive, so I'm reposting it here with a few modifications (that is, english->big-o):
Table Markdown source
On unsorted collections, O(log32n) is nearly constant time, and because 232 nodes can fit in the bit-partitioned trie nodes, this means a worst-case complexity of log32232 = 6.4. Vectors are also tries where the indices are keys.
Sorted collections utilize binary search where possible. (Yes, these are both technically O(log n); including the constant factor is for reference.)
Lists guarantee constant time for operations on the first element and O(n) for everything else.

When to use merge sort and when to use quick sort?

The wikipedia article for merge sort.
The wikipedia article for quick sort.
Both articles have excellent visualizations.
Both have n*log(n) complexity.
So obviously the distribution of the data will effect the speed of the sort. My guess would be that since a comparison can just as quickly compare any two values, no matter their spread, the range of data values does not matter.
More importantly one should consider the lateral distribution (x direction ) with respect to ordering (magnitude removed).
A good test case to consider would be if the test data had some level of sorting...

It typically depends on the data structures involved. Quick sort is
typically the fastest, but it doesn't guarantee O(n*log(n)); there are
degenerate cases where it becomes O(n^2). Heap sort is the usual
alternative; it guarantees O(n*log(n)), regardless of the initial order,
but it has a much higher constant factor. It's usually used when you
need a hard upper limit to the time taken. Some more recent algorithms
use quick sort, but attempt to recognize when it starts to degenerate,
and switch to heap sort then. Merge sort is used when the data
structure doesn't support random access, since it works with pure
sequential access (forward iterators, rather than random access
iterators). It's used in std::list<>::sort, for example. It's also
widely used for external sorting, where random access can be very, very
expensive compared to sequential access. (When sorting a file which
doesn't fit into memory, you might break it into chunks which fit into
memory, sort these using quicksort, writing each out to a file, then
merge sort the generated files.)

Mergesort is quicker when dealing with linked lists. This is because pointers can easily be changed when merging lists. It only requires one pass (O(n)) through the list.
Quicksort's in-place algorithm requires the movement (swapping) of data. While this can be very efficient for in-memory dataset, it can be much more expensive if your dataset doesn't fit in memory. The result would be lots of I/O.
These days, there is a lot of parallelization that occurs. Parallelizing Mergesort is simpler than Quicksort (in-place). If not using the in-place algorithm, then the space complexity for quicksort is O(n) which is the same are mergesort.
So, to generalize, quicksort is probably more effective for datasets that fit in memory. For stuff that's larger, it's better to use mergesort.
The other general time to use mergesort over quicksort is if the data is very similar (that is, not close to being uniform). Quicksort relies on using a pivot. In the case where all the values are the similar, quicksort hits a worst case of O(n^2). If the values of the data are very similar, then it's more likely that a poor pivot will be chosen leading to very unbalanced partitions leading to an O(n^2) runtime. The most straightforward example is if all the values in the list are the same.

There is a real-world sorting algorithm -- called Timsort -- that does exploit the idea that data encountered in the wild is often partially sorted.
The algorithm is derived from merge sort and insertion sort, and is used in CPython, Java 7 and Android.
See the Wikipedia article for more details.

While Java 6 and earlier versions use merge sort as the sorting algorithms, C# uses QuickSort as the sorting algorithm.
QuickSort performs better than merge sort even though they are both O(nlogn). QuickSort's has a smaller constant than merge sort.

Of the two, use merge sort when you need a stable sort. You can use a modified quicksort (such as introsort) when you don't, since it tends to be faster and it uses less memory.
Plain old Quicksort as described by Hoare is quite sensitive to performance-killing special cases that make it Theta(n^2), so you normally do need a modified version. That's where the data-distribution comes in, since merge sort doesn't have bad cases. Once you start modifying quicksort you can go on with all sorts of different tweaks, and introsort is one of the more effective ones. It detects on the fly whether it's in a killer case, and if so switches to heapsort.
In fact, Hoare's most basic Quicksort fails worst for already-sorted data, and so your "good test cases" with some level of sorting will kill it to some level. That fact is for curiosity only, though, since it only takes a very small tweak to avoid that, nothing like as complicated as going all the way to introsort. So it's simplistic to even bother analyzing the version that's killed by sorted data.
In practice, in C++ you'd generally use std::stable_sort and std::sort rather than worrying too much about the exact algorithm.

Remember in practice, unless you have a very large data set and/or are executing the sort many many times, it probably won't matter at all. That being said, quicksort is generally considered the 'fastest' n*log(n) sorter. See this question already asked: Quick Sort Vs Merge Sort

Binary data structure for fast searching

I'm looking for a binary data structure (tree, list) that enables very fast searching. I'll only add/remove items at the beginning/end of the program, all at once. So it's gonna be fixed-sized, thus I don't really care about the insertion/deletion speed. Basically what I'm looking for is a structure that provides fast searching and doesn't use much memory.
Thanks

Look up the Unordered set in the Boost C++ library here. Unlike red-black trees, which are O(log n) for searching, the unordered set is based on a hash, and on average gives you O(1) search performance.

One container not to be overlooked is a sorted std::vector.
It definitely wins on the memory consumption, especially if you can reserve() the correct amount up front.

So the key can be a simple type and the value is a smallish structure of five pointers.
With only 50 elements it starts getting small enough that the Big-O theoretical performance may be overshadowed or at least measurable affected by the fixed time overhead of the algorithm or structure.
For example an array a vector with linear search is often the fastest with less than ten elements because of its simple structure and tight memory.
I would wrap the container and run real data on it with timing. Start with STL's vector, go to the standard STL map, upgrade to unordered_map and maybe even try Google's dense or sparse_hash_map:
http://google-sparsehash.googlecode.com/svn/trunk/doc/performance.html

One efficient (albeit a teeny bit confusing) algorithm is the Red-Black tree.
Internally, the c++ standard library uses Red-Black trees to implement std::map - see this question

The std::map and hash map are good choices. They also have constructors to ease one time construction.
The hash map puts key data into a function that returns an array index. This may be slower than an std::map, but only profiling will tell.
My preference would be std::map, as it is usually implemented as a type of binary tree.

The fastest tends to be a trei/trie. I implemented one 3 to 15 times faster than the std::unordered_map, they tend to use more ram unless you use a large number of elements though.

Efficient Dictionary lookup

For my C++ application, there is a requirement to check if a word is a valid English dictionary word or not. What is the best way to do it. Is there freely available dictionary that I can make use of. I just need a collection of all possible words. How to make this lookup least expensive. Do I need to hash it.

Use either a std::set<std::string> or a std::unordered_set<std::string>. The latter is new in C++0x and may or may not be supported by your C++ Standard Library implementation; if it does not support it, it may include a hash_set of some kind: consult your documentation to find out.
Which of these (set, which uses a binary search tree, and unordered_set, which uses a hashtable) is more efficient depends on the number of elements you are storing in the container and how your Standard Library implementation implements them. Your best bet is to try both and see which performs better for your specific scenario.
Alternatively, if the list of words is fixed, you might consider using a sorted std::vector and using std::binary_search to find words in it.

With regards to the presence of a word list, it depends on the platform.
Under Linux, /usr/share/dict/words contains a list of English words
that might meet your needs. Otherwise, there are doubtlessly such lists
available on the network.
Given the size of such lists, the most rapid access will be to load it
into a hash table. std::unsorted_set, if you have it; otherwise, many
C++ compilers come with a hash_set, although different compilers have
a slightly different interface for it, and put it in different
namespaces. If that still has performance problems, it's possible to do
better if you know the number of entries in advance (so the table never
has to grow), and implement the hash table in an std::vector (or even a
C style array); handling collisions will be a bit more complicated,
however.
Another possibility would be a trie. This will almost certainly result
in the least number of basic operations in the lookup, and is fairly
simple to implement. Typical implementations will have very poor
locality, however, which could make it slower than some of the other
solutions in actual practice (or not—the only way to know is to
implement both and measure).

I actually did this a few months ago, or something close to this. You can probably find one online for free.
Like on this website: http://wordlist.sourceforge.net/
Just put it in a text file, and compare words with what is on the list. It should be order n with n being the number of words in the list. Do you need the time complexity faster?
Hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js