Create a Random Heap from a List - list

i'm doing some college studying in haskell atm, and im stuck at this. The data type is the following:
data Heap a = Empty | Node a (Heap a) (Heap a)
(in this exercise Heap a is always a min-max heap, therefore each node at an even level in the tree is less than all of its descendants, while each node at an odd level in the tree is greater than all of its descendants).
The Question:
"randomHeap :: [a] -> IO (Heap a) which generates a random heap, containing all the elements from the argument list.
"Use the function randomRIO :: Random t => (t,t) -> IO t (which is present in System.Random) to calculate a random value within a certain range.
"Note that for a non empty list, the choice of which element to use for the root is unique, what should vary are which elements are used to build the sub-trees."

I'll assume the heap shall be selected uniformly at random from all possibilities. Let's investigate the number f(n) of possibile min-max heaps with n nodes. A uniquely determined node goes at the root, there is no choice in this. We choose a subset of the remaining nodes to go on the left side, then repeat this for both children. f(n+1) = sum_{k=0}^{n}{(n over k) * f(k) * f(n-k)}; f(0) = 0. The first few terms are 1, 1, 2, 6, 24 and 120, which suggests the factorial sequence (and the proof of sequence equality is child's play), which also describes the number of possible orders of our input list. Therefore we can hope that each list order corresponds to a min-max heap.
Having mulled on it for a few minutes, it now seems obvious that after shuffling the list, we can split it at its minimum element, choosing the part of the list on the minimum's left as the mentioned subset, then splitting each half at the maximum element, then splitting each quarter at the minimum element, and so on. Different list orders yield different heaps, so this finite correspondence is one-to-one.

You're probably confused in the definition of heap.
You must chose the root which is the max/min element of the argument, depending on the type of heap. Than use the randomRIO to choose which elements go to each node.

Related

Time complexity of operations on Order Statistics Tree and 1D points problem

I ran into the following interview question:
We need a data structure to keep n points on the X-axis such that we get efficient implementations of
Insert(x), Delete(x) and Find (a, b) (giving the number of points in
the interval [a, b]). Assume that the maximum number returned by Find(a, b) is k.
We can create a data structure that performs the three operations in O(log n)
We can create a data structure that performs Insert and Delete in O(log n) and Find in O(k + log n).
I know from general information that Find is like a Range on 1D points (but for counting elements in this question, i.e we need the number of elements). If we use for example an AVL tree, then we get the time complexities of option (2).
But I was surprised when told that (1) is the correct answer. Why is (1) the right answer?
The answer is indeed (1).
The idea of an AVL tree is fine, and your conclusions are right. But you can extend the AVL tree such that each node has one extra property: the number of values that precede the node's own value. You would have to take care in the AVL operations (including rotations) that this extra property is kept up to date. But this can be done with a constant overhead, so it does not impact the time complexities of Insert or Delete.
Then Find could just search the node with value a (or the one with the greatest value less than a), and do the same for value b. From both nodes that you find you get the extra property. The subtraction of these two will give the required result. There are some boundary cases to take into consideration, like when a is present in the tree, then that node itself should be counted too, otherwise not. It may be that no node is found with a value less than or equal to a. Then the missing property should be taken as a 0 in the subtraction.
Clearly this makes Find independent of its return value (up to k). The two binary searches give it a time complexity of O(logn).

how to select least N elements with limited space?

The problem:
A function f returns elements one at a time in an unknown order. I want to select the least N elements. Function f is called many times (I'm searching through a very complex search space) and I don't have enough memory to store every output element for the future sorting.
The obvious solution:
Keep a vector of N elements in the memory and on each f() search for minimum and maximum and possibly replace something. This would probably work for very small N well. I'm looking for more general solution, though.
My solution so far:
I though about using priority_queue in order to store let's say 2N values and reducing the upper half after each 2N steps.
Pseudocode:
while (search goes on)
for (i=0..2N)
el = f()
pust el to the priority queue
remove N greatest elements from the priority queue
select N least elements from the priority queue
I think this should work, however, I don't find it elegant at all. Maybe there is already some kind of data structure that handles this problem. It would be really nice just to modify the priority_queue in order to throw away the elements that don't fit into the saved range.
Could you recommend me an existing std data structure for C++ or encourage me to implement the solution I suggested above? Or maybe there is some great and elegant trick that I can't think of.
You want to find least n elements on total K elements got from calling a function. Each time you call function f() you get one element and you want to store least n elements among them without storing total k elements got from the function since k is too big.
You can define a heap or priority_queue to store this least n found so far. Just add the returned item from f() to the pq and pop the greatest element if its size became n+1.
Total complexity would be O(K*log(n)) and space needed would be O(n). (If we ignore some extra space required by pq)
Alternate option would be to use an array. Depending on the maximum allowed elements compared to N, there are two options I can think of:
Make the array as big as possible and unsorted, periodically retrieve the smallest elements.
Have an array of size N, sorted with max elements on the end.
Option 1 would have you sort the array with O(n log n) time every time you fill up the array. That would happen for each n - N elements (except the first time), yielding (k - n) / (n - N) sorts, resulting in O((k - n) / (n - N) n log n) time complexity for k total elements, n elements in the array, N elements to be selected. So for n = 2N, you get O(2*(k - 2N) log 2N) time complexity if I'm not mistaken.
Option 2 would have you keep the array (sized N) sorted with maximum elements at the end. Each time you get an element, you can quickly (O(1)) see if it is smaller than the last one. Using binary search, you can find the right spot for the element in O(log N) time. However, you now need to move all the elements after the new element one place right. That takes O(N) time. So you end up with theoretical O(k*N) time complexity. Given that computers like working with homogenous data accesses however (caches and stuff), this might be faster than heap, even if it is array-backed.
If your elements are big, you might be better off having a structure of { coparison_value; actual_element_pointer } even if you are using heap (unless it is list-backed).

Rank-Preserving Data Structure other than std:: vector?

I am faced with an application where I have to design a container that has random access (or at least better than O(n)) has inexpensive (O(1)) insert and removal, and stores the data according to the order (rank) specified at insertion.
For example if I have the following array:
[2, 9, 10, 3, 4, 6]
I can call the remove on index 2 to remove 10 and I can also call the insert on index 1 by inserting 13.
After those two operations I would have:
[2, 13, 9, 3, 4, 6]
The numbers are stored in a sequence and insert/remove operations require an index parameter to specify where the number should be inserted or which number should be removed.
My question is, what kind of data structures, besides a Linked List and a vector, could maintain something like this? I am leaning towards a Heap that prioritizes on the next available index. But I have been seeing something about a Fusion Tree being useful (but more in a theoretical sense).
What kind of Data structures would give me the most optimal running time while still keeping memory consumption down? I have been playing around with an insertion order preserving hash table, but it has been unsuccessful so far.
The reason I am tossing out using a std:: vector straight up is because I must construct something that out preforms a vector in terms of these basic operations. The size of the container has the potential to grow to hundreds of thousands of elements, so committing to shifts in a std::vector is out of the question. The same problem lines with a Linked List (even if doubly Linked), traversing it to a given index would take in the worst case O (n/2), which is rounded to O (n).
I was thinking of a doubling linked list that contained a Head, Tail, and Middle pointer, but I felt that it wouldn't be much better.
In a basic usage, to be able to insert and delete at arbitrary position, you can use linked lists. They allow for O(1) insert/remove, but only provided that you have already located the position in the list where to insert. You can insert "after a given element" (that is, given a pointer to an element), but you can not as efficiently insert "at given index".
To be able to insert and remove an element given its index, you will need a more advanced data structure. There exist at least two such structures that I am aware of.
One is a rope structure, which is available in some C++ extensions (SGI STL, or in GCC via #include <ext/rope>). It allows for O(log N) insert/remove at arbitrary position.
Another structure allowing for O(log N) insert/remove is a implicit treap (aka implicit cartesian tree), you can find some information at http://codeforces.com/blog/entry/3767, Treap with implicit keys or https://codereview.stackexchange.com/questions/70456/treap-with-implicit-keys.
Implicit treap can also be modified to allow to find minimal value in it (and also to support much more operations). Not sure whether rope can handle this.
UPD: In fact, I guess that you can adapt any O(log N) binary search tree (such as AVL or red-black tree) for your request by converting it to "implicit key" scheme. A general outline is as follows.
Imagine a binary search tree which, at each given moment, stores the consequitive numbers 1, 2, ..., N as its keys (N being the number of nodes in the tree). Every time we change the tree (insert or remove the node) we recalculate all the stored keys so that they are still from 1 to the new value of N. This will allow insert/remove at arbitrary position, as the key is now the position, but it will require too much time for all keys update.
To avoid this, we will not store keys in the tree explicitly. Instead, for each node, we will store the number of nodes in its subtree. As a result, any time we go from the tree root down, we can keep track of the index (position) of current node — we just need to sum the sizes of subtrees that we have to our left. This allows us, given k, locate the node that has index k (that is, which is the k-th in the standard order of binary search tree), on O(log N) time. After this, we can perform insert or delete at this position using standard binary tree procedure; we will just need to update the subtree sizes of all the nodes changed during the update, but this is easily done in O(1) time per each node changed, so the total insert or remove time will be O(log N) as in original binary search tree.
So this approach allows to insert/remove/access nodes at given position in O(log N) time using any O(log N) binary search tree as a basis. You can of course store the additional information ("values") you need in the nodes, and you can even be able to calculate the minimum of these values in the tree just by keeping the minimum value of each node's subtree.
However, the aforementioned treap and rope are more advanced as they allow also for split and merge operations (taking a substring/subarray and concatenating two strings/arrays).
Consider a skip list, which can implement linear time rank operations in its "indexable" variation.
For algorithms (pseudocode), see A Skip List Cookbook, by Pugh.
It may be that the "implicit key" binary search tree method outlined by #Petr above is easier to get to, and may even perform better.

Achieve a task using a single loop instead of 2 c++`

This is a task specific code for which I want to know if there's a better way of doing it. Thus, people who love logic and coding, please help me out.
This is the question :
Let A be an array of n positive integers. All the elements are distinct.
If A[i] > A[j] and i < j then the pair (i, j) is called a special pair of A. Given n find the number of
special pairs of A.
Its pretty simple and straight. Here's the following solution I implemented. The logic part.
for(int j=0;j<nos.size();j++)
{
for(int k=j+1;k<nos.size();k++)//always maintain condition that i<j and simply compare the numbers.
{
if(nos[j] > nos[k])
{
spc++;//compute special pair.
}
}
}
Every nos[i] contains the array for which special pair is to be computed. Is there a way to do this with using a single loop? Or any other logic that can be time saving and is faster. Thanks in advance I seek to learn more from this.
And can you please tell me how I can determine which code is faster without having to execute the code. Your inputs are really welcome.
The main question is to compute the no.of special pairs. Thus, just the increment of spc.
I think #syam is correct that this can be done in O(N log N) time (and O(N) extra space).
You'd do this with a balanced binary tree where each node has not only a value, but also a count of the number of descendants in its left sub-tree.
To count the special pairs, you'd walk the array from the end to the beginning, and insert each item in the tree. As you insert the item in the tree, you find the number of items in its left sub-tree -- those are the items that are less than it, but were to its right in the array (i.e., each one represents a special pair). Since we only descend through ~log(N) nodes to insert an item, the time to compute the number of items to the left is also O(log N). We also have to update the counts of items to the left approximately log(N)/2 times (again, logarithmic complexity).
That gives O(N log N) time.
Edit for more details: The balancing of the tree is fairly conventional (e.g., AVL- or RB-tree) with the addition of adjusting the count of items to the left as it does rotations to restore balance.
As you insert each item, you descend through the tree to the point where it's going to be inserted. At the root node, you simply record the count of items in the left sub-tree. Then let's say you new item is greater than that, so you descend to the right. As you do this, you're doing two things: recording your current position, so you know the location of this node relative to the nodes already inserted, and updating the counts in the tree so you'll have an accurate count for later insertions.
So, let's work through a small sample. For the sake of argument, let's assume our input is [6, 12, 5, 9, 7]. So, our first insertion is 7, which becomes the root of our tree, with no descendants, and (obviously) 0 to its left.
Then we insert 9 to its right. Since it's to the right, we don't need to adjust any counts during the descent -- we just increment our count of items to the left. That's it, so we know for 9 we have one special pair ([9,7], though we haven't kept track of that).
Then we insert 5. This is to the left of 7, so as we descend from 7, we increment its count of items to the left to 1. We insert 5, with no items to the left, so it gets a count of 0, and no special pairs.
Then we insert 12. When we hit the root node (7) it has a count of 1 item to the left. We're descending to the right, so we increment again for the root node itself. Then we descend to the right again from 9, so we add one more ( +0 from its left sub-tree), so 12 has three special pairs.
Then we insert 6. We descend left from 7, so we don't add anything from it. We descend right from 5, so we add 1 (again, +0 from its left sub-tree). So it has one special pair.
Even when you need to generate all the special pairs (not just count them) you can expect the tree to improve speed in the average case (i.e., pretty much anything except sorted in descending order). To generate all the special pairs, we insert each item in the tree as before, then traverse through the tree to the left of that item. Where the naive algorithm traversed (and compare to) all the elements to the right in the array to find those that would be special pairs, this only has to traverse the tree to find those that actually are special pairs.
This does have one side effect though: it generates the pairs in a different order. Instead of each pair being generated in the order it occurred in the array, the pairs will be generated in descending order by the second element. For example, given an input like [4,1,2,3], the naive algorithm would produce [[4,1], [4,2], [4,3]], but this will produce `[[4,3], [4,2], [4,1]].
You mean can you do better than quadratic runtime? No. To see this, consider the decreasing se­quence A = (N, N - 1, ..., 2, 1). For this sequence, all the pairs (i, j) with i < j are special, and there are O(N^2) such pairs. Since you must output every special pair, you need quadratic time to do so.
I don't think you can improve your algorithm. In my opinion it shows O(n²) complexity. You can proove that by counting the number of inner loops your program has to perform for an array of length n. The first loop will have n-1 iterations, the second loop n-2, the third n-3 and so on. Summing that up using the formula for the sum of the first n integers (only backwards this time and not from 1 to n but from 2 to n-1):
(n-1)+(n-2)+(n-3)+...3+2 = n*(n-1)/2 - 1. You dont have to loop over the last remaining element, since there is no other element left to compare. Actually this is the only improvement I see for your algorithm ;-):
for(int j=0;j<nos.size();j++)
to
for(int j=0;j<nos.size()-1;j++)
Summing up for large n the expression n*(n-1)/2 - 1 behaves like n² and that's where I believe the O(n²) comes from. Please correct me if I'm wrong.

Fast Algorithm for finding largest values in 2d array

I have a 2D array (an image actually) that is size N x N. I need to find the indices of the M largest values in the array ( M << N x N) . Linearized index or the 2D coords are both fine. The array must remain intact (since it's an image). I can make a copy for scratch, but sorting the array will bugger up the indices.
I'm fine with doing a full pass over the array (ie. O(N^2) is fine). Anyone have a good algorithm for doing this as efficiently as possible?
Selection is sorting's austere sister (repeat this ten times in a row). Selection algorithms are less known than sort algorithms, but nonetheless useful.
You can't do better than O(N^2) (in N) here, since nothing indicates that you must not visit each element of the array.
A good approach is to keep a priority queue made of the M largest elements. This makes something O(N x N x log M).
You traverse the array, enqueuing pairs (elements, index) as you go. The queue keeps its elements sorted by first component.
Once the queue has M elements, instead of enqueuing you now:
Query the min element of the queue
If the current element of the array is greater, insert it into the queue and discard the min element of the queue
Else do nothing.
If M is bigger, sorting the array is preferable.
NOTE: #Andy Finkenstadt makes a good point (in the comments to your question) : you definitely should traverse your array in the "direction of data locality": make sure that you read memory contiguously.
Also, this is trivially parallelizable, the only non parallelizable part is when you merge the queues when joining the sub processes.
You could copy the array into a single dimensioned array of tuples (value, original X, original Y ) and build a basic heap out of it in (O(n) time), provided you implement the heap as an array.
You could then retrieve the M largest tuples in O(M lg n) time and reference their original x and y from the tuple.
If you are going to make a copy of the input array in order to do a sort, that's way worse than just walking linearly through the whole thing to pick out numbers.
So the question is how big is your M? If it is small, you can store results (i.e. structs with 2D indexes and values) in a simple array or a vector. That'll minimize heap operations but when you find a larger value than what's in your vector, you'll have to shift things around.
If you expect M to get really large, then you may need a better data structure like a binary tree (std::set) or use sorted std::deque. std::set will reduce number of times elements must be shifted in memory, while if you use std::deque, it'll do some shifting, but it'll reduce number of times you have to go to the heap significantly, which may give you better performance.
Your problem doesn't use the 2 dimensions in any interesting way, it is easier to consiger the equivalent problem in a 2d array.
There are 2 main ways to solve this problem:
Mantain a set of M largest elements, and iterate through the array. (Using a heap allows you to do this efficiently).
This is simple and is probably better in your case (M << N)
Use selection, (the following algorithm is an adaptation of quicksort):
Create an auxiliary array, containing the indexes [1..N].
Choose an arbritary index (and corresponding value), and partition the index array so that indexes corresponding to elements less go to the left, and bigger elements go to the right.
Repeat the process, binary search style until you narrow down the M largest elements.
This is good for cases with large M. If you want to avoid worst case issues (the same quicksort has) then look at more advanced algorithms, (like median of medians selection)
How many times do you search for the largest value from the array?
If you only search 1 time, then just scan through it keeping the M largest ones.
If you do it many times, just insert the values into a sorted list (probably best implemented as a balanced tree).