Big O of clojure library functions

Big O of clojure library functions - clojure

Can anyone point me to a resource that lists the Big-O complexity of basic clojure library functions such as conj, cons, etc.? I know that Big-O would vary depending on the type of the input, but still, is such a resource available? I feel uncomfortable coding something without having a rough idea of how quickly it'll run.

Here is a table composed by John Jacobsen and taken from this discussion:

Late to the party here, but I found the link in the comments of the first answer to be more definitive, so I'm reposting it here with a few modifications (that is, english->big-o):
Table Markdown source
On unsorted collections, O(log32n) is nearly constant time, and because 232 nodes can fit in the bit-partitioned trie nodes, this means a worst-case complexity of log32232 = 6.4. Vectors are also tries where the indices are keys.
Sorted collections utilize binary search where possible. (Yes, these are both technically O(log n); including the constant factor is for reference.)
Lists guarantee constant time for operations on the first element and O(n) for everything else.

Related

What are the advantages and disadvantages of an ordered list?

I'm currently trying to figure out what the advantages from the disadvantages of having an ordered list are, but I'm over here struggling to find the importance. I know that ordered lists can allow for binary search which is much more efficient over sequential search. Otherwise I'm at a loss though.

Ordering your data has many costs and benefits.
You are right in that it does enable a binary search. But there are many ways to structure data so that searching for an element is fast; in the C++ standard library, std::set maintains its elements in order, while std::unordered_set doesn't, and the second is usually faster than the first. (It uses hashes to find elements, as opposed to a binary search over a semi-balanced binary tree).
Often I'll order data to calculate differences between them, or have a canonical representation. If I have two collections of data and they are unordered, checking equality is annoying (there are O(n^2) pairs to check). If the data is in a canonical order, I can do O(n) work; if I trust a hashing function to be sufficiently robust, I can even pre-hash each element, and each sub-collection of elements, and check equality in O(1) time.
I can even find differences in O(size of difference) time with carefully ordered data and a trustworthy hash.
Ordered data is often what humans want; an inbox ordered randomly is not very useful, while one that is ordered by when the mail arrived, or even some kind of importance, is much better UI.
Ordered data can make branch prediction easier, and other optimizations at the hardware, compiler or algorithm level easier. Here is a classic SO question about that issue; processing a sorted array is faster than an unsorted array, and the asker is surprised.
The hard part about programming is understanding what the hell is going on. Data having structure makes reasoning about it easier; there are fewer things to think about. A list being always ordered means that code that uses it can assume it is always ordered, which can make that code easier to think about and simpler, and then code that uses that code is in turn simpler and easier to think about.
The price, of course, is that almost all editing operations on an ordered list end up being more expensive, both in terms of computer time and programmer time to think about the consequences. With a std::vector, adding K elements takes O(K) time; meanwhile, a naively sorted-after-each-operation std::vector would require O(K * N lg N) time, which can be horribly slow.
Not doing something is extremely valuable. And code that doesn't do anything is easy to understand what it does. By eliminating the "ordered" requirement, you make writing data to that storage easier to think about and easier for the computer to do.

How can I implement Python sets in another language (maybe C++)?

I want to translate some Python code that I have already written to C++ or another fast language because Python isn't quite fast enough to do what I want to do. However the code in question abuses some of the impressive features of Python sets, specifically the average O(1) membership testing which I spam within performance critical loops, and I am unsure of how to implement Python sets in another language.
In Python's Time Complexity Wiki Page, it states that sets have O(1) membership testing on average and in worst-case O(n). I tested this personally using timeit and was astonished by how blazingly fast Python sets do membership testing, even with large N. I looked at this Stack Overflow answer to see how C++ sets compare when using find operations to see if an element is a member of a given set and it said that it is O(log(n)).
I hypothesize the time complexity for find is logarithmic in that C++ std library sets are implemented with some sort of binary tree. I think that because Python sets have average O(1) membership testing and worst case O(n), they are probably implemented with some sort of associative array with buckets which can just look up an element with ease and test it for some dummy value which indicates that the element is not part of the set.
The thing is, I don't want to slow down any part of my code by switching to another language (since that is the problem im trying to fix in the first place) so how could I implement my own version of Python sets (specifically just the fast membership testing) in another language? Does anybody know anything about how Python sets are implemented, and if not, could anyone give me any general hints to point me in the right direction?
I'm not looking for source code, just general ideas and links that will help me get started.
I have done a bit of research on Associative Arrays and I think I understand the basic idea behind their implementation but I'm unsure of their memory usage. If Python sets are indeed just really associative arrays, how can I implement them with a minimal use of memory?
Additional note: The sets in question that I want to use will have up to 50,000 elements and each element of the set will be in a large range (say [-999999999, 999999999]).

The theoretical difference betwen O(1) and O(log n) means very little in practice, especially when comparing two different languages. log n is small for most practical values of n. Constant factors of each implementation are easily more significant.
C++11 has unordered_set and unordered_map now. Even if you cannot use C++11, there are always the Boost version and the tr1 version (the latter is named hash_* instead of unordered_*).

Several points: you have, as has been pointed out, std::set and
std::unordered_set (the latter only in C++11, but most compilers have
offered something similar as an extension for many years now). The
first is implemented by some sort of balanced tree (usually a red-black
tree), the second as a hash_table. Which one is faster depends on the
data type: the first requires some sort of ordering relationship (e.g.
< if it is defined on the type, but you can define your own); the
second an equivalence relationship (==, for example) and a hash
function compatible with this equivalence relationship. The first is
O(lg n), the second O(1), if you have a good hash function. Thus:
If comparison for order is significantly faster than hashing,
std::set may actually be faster, at least for "smaller" data sets,
where "smaller" depends on how large the difference is—for
strings, for example, the comparison will often resolve after the first
couple of characters, whereas the hash code will look at every
character. In one experiment I did (many years back), with strings of
30-50 characters, I found the break even point to be about 100000
elements.
For some data types, simply finding a good hash function which is
compatible with the type may be difficult. Python uses a hash table for
its set, and if you define a type with a function __hash__ that always
returns 1, it will be very, very slow. Writing a good hash function
isn't always obvious.
Finally, both are node based containers, which means they use a lot
more memory than e.g. std::vector, with very poor locality. If lookup
is the predominant operation, you might want to consider std::vector,
keeping it sorted and using std::lower_bound for the lookup.
Depending on the type, this can result in a significant speed-up, and
much less memory use.

When to use which Sorting algorithm and when definitely shouldn't

We see lot of sorting techniques like Merge, quick, Heap. Could you help me decide which of these sorting techniques to be used in which kind of environment(as in the problem)? When should we use which of these sorting algorithms and where not(their disadvantages in time and space)?
I am expecting answer something in this form:
a) We would use Merge sort when... we should definitely not use Merge Sort when...
b) We would use Quick sort when... we should definitely not use quick Sort when...

There are a few basic parameters that characterize the behavior of each sorting algorithm:
average case computational complexity
worst case computational complexity
memory requirements
stability (i.e. is it a stable sort or not?)
All of these are widely documented for all commonly used sorts, and this is all the information one needs to provide an answer in the format that you want. However, since even four parameters for each sort make for a lot of things -- not all of which will be relevant -- to consider, it isn't a very good idea to try and give such a "scripted" answer. Furthermore, there are even more advanced concepts that could come into consideration (such as behavior when run on almost-sorted or reverse-sorted data, cache performance, resistance to maliciously constructed input), making such an answer even more lengthy and error-prone.
I suggest that you spend some time familiarizing yourself with the four basic concepts mentioned above, perhaps by visualizing how each type of sort works on simple input and reading an introductory text on sorting algorithms. Do this and soon enough you will be able to answer such questions yourself.

For starters, take a look at this comparison table on wikipedia, the comparison criteria will give you clues on what to look for on an algorithm and its possible tradeoffs.

When to use merge sort and when to use quick sort?

The wikipedia article for merge sort.
The wikipedia article for quick sort.
Both articles have excellent visualizations.
Both have n*log(n) complexity.
So obviously the distribution of the data will effect the speed of the sort. My guess would be that since a comparison can just as quickly compare any two values, no matter their spread, the range of data values does not matter.
More importantly one should consider the lateral distribution (x direction ) with respect to ordering (magnitude removed).
A good test case to consider would be if the test data had some level of sorting...

It typically depends on the data structures involved. Quick sort is
typically the fastest, but it doesn't guarantee O(n*log(n)); there are
degenerate cases where it becomes O(n^2). Heap sort is the usual
alternative; it guarantees O(n*log(n)), regardless of the initial order,
but it has a much higher constant factor. It's usually used when you
need a hard upper limit to the time taken. Some more recent algorithms
use quick sort, but attempt to recognize when it starts to degenerate,
and switch to heap sort then. Merge sort is used when the data
structure doesn't support random access, since it works with pure
sequential access (forward iterators, rather than random access
iterators). It's used in std::list<>::sort, for example. It's also
widely used for external sorting, where random access can be very, very
expensive compared to sequential access. (When sorting a file which
doesn't fit into memory, you might break it into chunks which fit into
memory, sort these using quicksort, writing each out to a file, then
merge sort the generated files.)

Mergesort is quicker when dealing with linked lists. This is because pointers can easily be changed when merging lists. It only requires one pass (O(n)) through the list.
Quicksort's in-place algorithm requires the movement (swapping) of data. While this can be very efficient for in-memory dataset, it can be much more expensive if your dataset doesn't fit in memory. The result would be lots of I/O.
These days, there is a lot of parallelization that occurs. Parallelizing Mergesort is simpler than Quicksort (in-place). If not using the in-place algorithm, then the space complexity for quicksort is O(n) which is the same are mergesort.
So, to generalize, quicksort is probably more effective for datasets that fit in memory. For stuff that's larger, it's better to use mergesort.
The other general time to use mergesort over quicksort is if the data is very similar (that is, not close to being uniform). Quicksort relies on using a pivot. In the case where all the values are the similar, quicksort hits a worst case of O(n^2). If the values of the data are very similar, then it's more likely that a poor pivot will be chosen leading to very unbalanced partitions leading to an O(n^2) runtime. The most straightforward example is if all the values in the list are the same.

There is a real-world sorting algorithm -- called Timsort -- that does exploit the idea that data encountered in the wild is often partially sorted.
The algorithm is derived from merge sort and insertion sort, and is used in CPython, Java 7 and Android.
See the Wikipedia article for more details.

While Java 6 and earlier versions use merge sort as the sorting algorithms, C# uses QuickSort as the sorting algorithm.
QuickSort performs better than merge sort even though they are both O(nlogn). QuickSort's has a smaller constant than merge sort.

Of the two, use merge sort when you need a stable sort. You can use a modified quicksort (such as introsort) when you don't, since it tends to be faster and it uses less memory.
Plain old Quicksort as described by Hoare is quite sensitive to performance-killing special cases that make it Theta(n^2), so you normally do need a modified version. That's where the data-distribution comes in, since merge sort doesn't have bad cases. Once you start modifying quicksort you can go on with all sorts of different tweaks, and introsort is one of the more effective ones. It detects on the fly whether it's in a killer case, and if so switches to heapsort.
In fact, Hoare's most basic Quicksort fails worst for already-sorted data, and so your "good test cases" with some level of sorting will kill it to some level. That fact is for curiosity only, though, since it only takes a very small tweak to avoid that, nothing like as complicated as going all the way to introsort. So it's simplistic to even bother analyzing the version that's killed by sorted data.
In practice, in C++ you'd generally use std::stable_sort and std::sort rather than worrying too much about the exact algorithm.

Remember in practice, unless you have a very large data set and/or are executing the sort many many times, it probably won't matter at all. That being said, quicksort is generally considered the 'fastest' n*log(n) sorter. See this question already asked: Quick Sort Vs Merge Sort

Does std::sort check if a vector is already sorted?

I believe that the C++ standard for std::sort does not guarantee O(n) performance on a list that's already sorted. But still, I'm wondering whether to your knowledge any implementations of the STL (GCC, MSVC, etc) make the std::is_sorted check before executing the sort algorithm?
Asked another way, what performance can one expect (without guarantees, of course) from running std::sort on a sorted container?
Side note: I posted some benchmarks for GCC 4.5 with C++0x enabled on my blog. Here's the results:

Implementations are free to use any efficient sorting algorithm they want so this is highly implementation dependant
However I have seen a performance comparison of libstdc++ as used on linux and against libc++ the new C++ library developed by Apple/LLVM. Both these libraries are very efficient on sorted or reverse sorted data (much faster than on a random list) with the new library being considerable faster then the old and recognizing many more patterns.
To be certain you should consider doing your own benchmarks.

No. Also, it's not logical to have is_sorted() called for any STL implementation. Since, is_sorted() is available already as a stand-alone. And many users may not want to waste execution cycles unnecessarily to call that function when they already know that their container is not sorted.
STL also should be following the C++ philosophy: "pay per use".

Wow! Did you have optimizations all the way cranked up?
the results of your code on my platform (note the values on the vertical axis).

I suggest you read this comparison of sorting algorithms, it is very well done and informative, it compares a number of sorting algorithms with each other and with GCC's implementation of std::sort. You will notice, in the charts on the given link, that the performance of std::sort for "almost sorted" and "almost reverse" are linear in the number of elements to sort, that is, O(n). So, no guarantee, but you can easily expect that an almost sorted list will be sorted in almost linear-time. But, of course, it does not do a is_sorted check, and even if it will sort a sorted array in linear-time, it won't be as fast as doing a is_sorted check and skipping the sorting altogether. It is your decision to determine if it is better to check before sorting or not.

The standard sanctions only std::sort implementations with complexity O(n log n):
Complexity: Approximately N log N (where N == last - first) comparisons on the average.
See section 25.3.1.1 Sorting [lib.sort] (ISO/IEC 14882:2003(E)).
Thus, the set of allowed sorting functions is limited, and you are right that it does not guarantee linear complexity.
Ideal behavior for a sort is O(n), but this is not possible in the average case.
Of course the average case is not necessarily the exact case you have right now, so for corner cases, there's not much of a guarantee.

And why would any implementation do that check? What would it gain? -- Nothing in average. A good design rule is not to clutter implementation with optimizations for corner cases which make no difference in average. This example is similar to check for self-assignment. A simple answer: don't do it.

There's no guarantee that it'll check this. Some implementations will do it , others probably won't.
However, if you suspect that your input might already be sorted (or nearly sorted), std::stable_sort might be a better option.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js