nlargest and nsmallest ; heapq python

nlargest and nsmallest ; heapq python - python-2.7

This is out of curiosity about the nsmallest and nlargest methods of heapq.py module in python.
I was reading it here in the docs.
The documentation doesn't say how it does so (nsmalles/nlargest) on any iterable.
This might be a stupid question, but can I assume that these methods internally create a heap of the iterable data structure (may be using 'heapify' method) and then return the n smallest/largest elements?
Just want to confirm my conclusion. thanks!

The algorithm for finding the n smallest or largest items from an iterable with N items is a bit tricky. You see, you don't create a size-N min-heap to find the smallest items.
Instead, you make a smaller, size-n max-heap with the first n items, then do repeated pushpop operations on it with the remaining items from the sequence. Once you're done, you pop the items from the heap and return them in reversed order.
This process take O(N log(n)) time (note the small n) and of course only O(n) space. If n is much less than N, it's much more efficient than sorting and slicing.
The heapq module contains a pure-Python implementation of this algorithm, though when you import it, you may get a faster version of the code written in C instead (you can read the source for that too, but it's not quite as friendly unless you know the Python C API).

Related

select on a bit vector in C++ complexity and implementation

I'm implementing a matrix reduction algorithm, I'm a math student.
Obviously I've searched and read around internet but didn't find exactly what I was looking for (I list at the end what I've found and the papers that I've read.)
Quick overview of the problem:
The bitvector b has FIXED LENGTH N.
b changes at every step (could be only at a couple of indexes (most of the times) or at considerably more indexes (from 1/10 to 1/3), this only in ~10% of the cases).
I already have a sparse implementation, now I'd like to code it using some smart implementation of the bitvector.
//initialize to 0
b=bitvector(0, n=N)
for i in 1 to N
{some operations on the bitvector b}
get I= { j | b[j] == 1 }
{save I}
What I need is:
quickly set b[i]=1 or =0 (possibly O(1))
quickly get the set of indexes I at each step (definitely not more than O(logN), ideally O(1))
a C++ library that allows it
papers/documentation
What would be nice to have:
a fast way to get the "lowest one" (the last index set to 1, namely select(rank(b)), if both operations are fast (O(1)))
What I do not need is:
save space
compress the data
I have been using the library Sdsl 2.0 of Simon Gog et al. (https://github.com/simongog/sdsl-lite) but the select structure
bit_vector::select_1_type
costs O(n) to be initialized, O(1) for every query but does not "follow" the changes in b (right?? I haven't found anything very specific about it), meaning that it needs to be initialized at every step after the modifications.
Papers that I've read are:
"Fast, Small, Simple Rank/Select on Bitmaps" (G. Navarro and E. Providel) and "Practical Entropy-Compressed Rank/Select Dictionary" (D. Okanohara
K. Sadakane) and I would appreciate any link to solid implementations in C++ (if the structure fulfills my requirements)
Things that I've found here on stackexchange about similar topics that didn't help:
Dynamic bit vector in C/C++
Bit vector and bitset
Sorry for the lengthy question, I hope I explained what I need and my determination to finding it. I'm still very confused about various things related with bitvectors, it's definitely not my field of expertise, so any clarification is appreciated.
Thanks in advance.

The structure described here is the closest thing I am aware of to the properties you want.
Specifically:
initialisation is constant time
setting/clearing entries is constant time
testing for membership is constant time
retrieving the set of entries is O(N) in the number of entries (assuming you don't need them sorted - you actually end up walking them in order of insertion; you're not going to do better than O(N) overall if you need to walk all of them for whatever happens next, of course)

Can we know if a collection is almost sorted without applying a sort algorithm?

In the wikipedia article on sorting algorithms,
http://en.wikipedia.org/wiki/Sorting_algorithm#Summaries_of_popular_sorting_algorithms
under Bubble sort it says:Bubble sort can also be used efficiently on a list of any length that is nearly sorted (that is, the elements are not significantly out of place)
So my question is: Without sorting the list using a sorting algoithm first, how can one know if that is nearly sorted or not?

Are you familiar with the general sorting lower bound? You can prove that in a comparison-based sorting algorithm, any sorting algorithm must make Ω(n log n) comparisons in the average case. The way you prove this is through an information-theoretic argument. The basic idea is that there are n! possible permutations of the input array, and since the only way you can learn about which permutation you got is to make comparisons, you have to make at least lg n! comparisons in order to be certain that you know the structure of your input permutation.
I haven't worked out the math on this, but I suspect that you could make similar arguments to show that it's difficult to learn how sorted a particular array is. Essentially, if you don't do a large number of comparisons, then you wouldn't be able to tell apart an array that's mostly sorted from an array that is actually quite far from sorted. As a result, all the algorithms I'm aware of that measure "sortedness" take a decent amount of time to do so.
For example, one measure of the level of "sortedness" in an array is the number of inversions in that array. You can count the number of inversions in an array in time O(n log n) using a divide-and-conquer algorithm based on mergesort, but with that runtime you could just sort the array instead.
Typically, the way that you'd know that your array was mostly sorted was to know something a priori about how it was generated. For example, if you're looking at temperature data gathered from 8AM - 12PM, it's very likely that the data is already mostly sorted (modulo some variance in the quality of the sensor readings). If your data looks at a stock price over time, it's also likely to be mostly sorted unless the company has a really wonky trajectory. Some other algorithms also partially sort arrays; for example, it's not uncommon for quicksort implementations to stop sorting when the size of the array left to sort is small and to follow everything up with a final insertion sort pass, since every element won't be very far from its final position then.

I don't believe there exists any standardized measure of how sorted or random an array is.
You can come up with your own measure - like count the number of adjacent pairs which are out of order (suggested in comment), or count the number of larger numbers which occur before smaller numbers in the array (this is trickier than a simple single pass).

An efficient sorting algorithm for almost sorted list containing time data?

The name says it all really. I suspect that insertion sort is best, since it's the best sort for mostly-sorted data in general. However, since I know more about the data there is a chance there are other sorts woth looking at. So the other relevant pieces of information are:
1) this is time data, which means I presumable could create an effective hash for ordering of data.
2) The data won't all exist at one time. instead I'll be reading in records which may contain a single vector, or dozen or hundreds of vectors. I want to output all time within a 5 second window. So it's possible that a sort that does the sorting as I insert the data would be a better option.
3) memory is not a big issue, but CPU speed is as this may be a bottleneck of the system.
Given these conditions can anyone suggest an algorithm that may be worth considering in addition to insertion sort? Also, How does one defined 'mostly sorted' to decide what is a good sort option? What I mean by that is how do I look at my data and decided 'this isn't as sorted as I thought it as, maybe insertion sort is no longer the best option'? Any link to an article which considered process complexity which better defines the complexity relative to the degree data is sorted would be appreciated.
Thanks
Edit:
thank you everyone for your information. I will be going with an easy insertion or merge sort (whichever I have already pre-written) for now. However, I'll be trying some of the other methods once were closer to the optimization phase (since they take more effort to implement). I appreciate the help

You could adopt option (2) you suggested - sort the data while you insert elements.
Use a skip list, sorted according to time, ascending to maintain your data.
Once a new entree arrives - check if it is larger then the last
element (easy and quick) if it is - simply append it (easy to do in a skip list). The
skip list will need to add 2 nodes on average for these cases, and will be O(1) on
average for these cases.
If the element is not larger then the last element - add it to the
skip list as a standard insert op, which will be O(logn).
This approach will yield you O(n+klogn) algorithm, where k is the number of elements inserted out of order.

I would throw in merge sort if you implement the natural version you get a best case of O(N) with a typical and worst case of O(N log N) if you have any problems. Insertion you get a worst case of O(N^2) and a best case of O(N).

You can sort a list of size n with k elements out of place in O(n + k lg k) time.
See: http://www.quora.com/How-can-I-quickly-sort-an-array-of-elements-that-is-already-sorted-except-for-a-small-number-of-elements-say-up-to-1-4-of-the-total-whose-positions-are-known/answer/Mark-Gordon-6?share=1
The basic idea is this:
Iterate over the elements of the array, building an increasing subsequence (if the current element is greater than or equal to the last element of the subsequence, append it to the end of the subsequence. Otherwise, discard both the current element and the last element of the subsequence). This takes O(n) time.
You will have discarded no more than 2k elements since k elements are out of place.
Sort the 2k elements that were discarded using an O(k lg k) sorting algorithm like merge sort or heapsort.
You now have two sorted lists. Merge the lists in O(n) time like you would in the merge step of merge sort.
Overall time complexity = O(n + k lg k)
Overall space complexity = O(n)
(this can be modified to run in O(1) space if you can merge in O(1) space, but it's by no means trivial)

Without fully understanding the problem, Timsort may fit the bill as you're alleging that your data is mostly sorted already.

There are many adaptive sorting algorithms out there that are specifically designed to sort mostly-sorted data. Ignoring the fact that you're storing dates, you might want to look at smoothsort or Cartesian tree sort as algorithms that can sort data that is reasonable sorted in worst-case O(n log n) time and best-case O(n) time. Smoothsort also has the advantage of requiring only O(1) space, like insertion sort.
Using the fact that everything is a date and therefore can be converted into an integer, you might want to look at binary quicksort (MSD radix sort) using a median-of-three pivot selection. This algorithm has best-case O(n log n) performance, but has a very low constant factor that makes it pretty competitive. Its worst case is O(n log U), where U is the number of bits in each date (probably 64), which isn't too bad.
Hope this helps!

If your OS or C library provides a mergesort function, it is very likely that it already handles the case where the data given is partially ordered (in any direction) running in O(N) time.
Otherwise, you can just copy the mergesort available from your favorite BSD operating system.

How to efficiently nearly sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);

Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.

You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x

One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).

If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.

Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.

Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.

Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.

Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.

Binary tree that stores partial sums: Name and existing implementations

Consider a sequence of n positive real numbers, (ai), and its partial sum sequence, (si). Given a number x ∊ (0, sn], we have to find i such that si−1 < x ≤ si. Also we want to be able to change one of the ai’s without having to update all partial sums. Both can be done in O(log n) time by using a binary tree with the ai’s as leaf node values, and the values of the non-leaf nodes being the sum of the values of the respective children. If n is known and fixed, the tree doesn’t have to be self-balancing and can be stored efficiently in a linear array. Furthermore, if n is a power of two, only 2 n − 1 array elements are required. See Blue et al., Phys. Rev. E 51 (1995), pp. R867–R868 for an application. Given the genericity of the problem and the simplicity of the solution, I wonder whether this data structure has a specific name and whether there are existing implementations (preferably in C++). I’ve already implemented it myself, but writing data structures from scratch always seems like reinventing the wheel to me—I’d be surprised if nobody had done it before.

This is known as a finger tree in functional programming but apparently there are implementations in imperative languages. In the articles there is a link to a blog post explaining an implementation of this data structure in C# which could be useful to you.

Fenwick tree (aka Binary indexed tree) is a data structure that maintains a sequence of elements, and is able to compute cumulative sum of any range of consecutive elements in O(logn) time. Changing value of any single element needs O(logn) time as well.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js