how to select least N elements with limited space?

how to select least N elements with limited space? - c++

The problem:
A function f returns elements one at a time in an unknown order. I want to select the least N elements. Function f is called many times (I'm searching through a very complex search space) and I don't have enough memory to store every output element for the future sorting.
The obvious solution:
Keep a vector of N elements in the memory and on each f() search for minimum and maximum and possibly replace something. This would probably work for very small N well. I'm looking for more general solution, though.
My solution so far:
I though about using priority_queue in order to store let's say 2N values and reducing the upper half after each 2N steps.
Pseudocode:
while (search goes on)
for (i=0..2N)
el = f()
pust el to the priority queue
remove N greatest elements from the priority queue
select N least elements from the priority queue
I think this should work, however, I don't find it elegant at all. Maybe there is already some kind of data structure that handles this problem. It would be really nice just to modify the priority_queue in order to throw away the elements that don't fit into the saved range.
Could you recommend me an existing std data structure for C++ or encourage me to implement the solution I suggested above? Or maybe there is some great and elegant trick that I can't think of.

You want to find least n elements on total K elements got from calling a function. Each time you call function f() you get one element and you want to store least n elements among them without storing total k elements got from the function since k is too big.
You can define a heap or priority_queue to store this least n found so far. Just add the returned item from f() to the pq and pop the greatest element if its size became n+1.
Total complexity would be O(K*log(n)) and space needed would be O(n). (If we ignore some extra space required by pq)

Alternate option would be to use an array. Depending on the maximum allowed elements compared to N, there are two options I can think of:
Make the array as big as possible and unsorted, periodically retrieve the smallest elements.
Have an array of size N, sorted with max elements on the end.
Option 1 would have you sort the array with O(n log n) time every time you fill up the array. That would happen for each n - N elements (except the first time), yielding (k - n) / (n - N) sorts, resulting in O((k - n) / (n - N) n log n) time complexity for k total elements, n elements in the array, N elements to be selected. So for n = 2N, you get O(2*(k - 2N) log 2N) time complexity if I'm not mistaken.
Option 2 would have you keep the array (sized N) sorted with maximum elements at the end. Each time you get an element, you can quickly (O(1)) see if it is smaller than the last one. Using binary search, you can find the right spot for the element in O(log N) time. However, you now need to move all the elements after the new element one place right. That takes O(N) time. So you end up with theoretical O(k*N) time complexity. Given that computers like working with homogenous data accesses however (caches and stuff), this might be faster than heap, even if it is array-backed.
If your elements are big, you might be better off having a structure of { coparison_value; actual_element_pointer } even if you are using heap (unless it is list-backed).

Related

To find sum of all consecutive sub-array of length k in a given array

I want to find out all the sum of continuous sub-array of length K
for a given array of Length n given that k < n. For example, let the given array be arr[6]={1,2,3,4,5,6} and k=3,then answer is (6,9,12,15).
It can be obtained as :
(1+2+3)=6,
(2+3+4)=9,
(3+4+5)=12,
(4+5+6)=15.
I have tried this using sliding window of length k,but its time complexity is O(n).Is any solution which takes even less time such as O(log n).

Unless you know certain specific properties of the array (e.g. the ordering of the elements, the range of the elements included in the array, etc.) then you would need to check each individual value, resulting in an O(n) complexity.
If, for instance, you knew that the sum of the values in the array were T (perhaps because you knew T itself or were given the range) then you could consider that all the elements except the first and last (K-1) elements would be included in K different sums. This would mean a sum of T.K minus some amount, and you could reduce the values of the first and last K values appropriate amount of times, resulting in an algorithm of complexity O(K).
But note that, in order to achieve a strategy similar to this, you would have to know some other specific information regarding the values in the array, may that be their range or their sum.

You can use Segment tree data structure, though building of it will take O(n log n), but than you can find sum of any interval in O( log n ), and modify each element of array in O( log n )
https://en.wikipedia.org/wiki/Segment_tree

Performance of vector sort/unique/erase vs. copy to unordered_set

I have a function that gets all neighbours of a list of points in a grid out to a certain distance, which involves a lot of duplicates (my neighbour's neighbour == me again).
I've been experimenting with a couple of different solutions, but I have no idea which is the more efficient. Below is some code demonstrating two solutions running side by side, one using std::vector sort-unique-erase, the other using std::copy into a std::unordered_set.
I also tried another solution, which is to pass the vector containing the neighbours so far to the neighbour function, which will use std::find to ensure a neighbour doesn't already exist before adding it.
So three solutions, but I can't quite wrap my head around which is gonna be faster. Any ideas anyone?
Code snippet follows:
// Vector of all neighbours of all modified phi points, which may initially include duplicates.
std::vector<VecDi> aneighs;
// Hash function, mapping points to their norm distance.
auto hasher = [&] (const VecDi& a) {
return std::hash<UINT>()(a.squaredNorm() >> 2);
};
// Unordered set for storing neighbours without duplication.
std::unordered_set<VecDi, UINT (*) (const VecDi& a)> sneighs(phi.dims().squaredNorm() >> 2, hasher);
... compute big long list of points including many duplicates ...
// Insert neighbours into unordered_set to remove duplicates.
std::copy(aneighs.begin(), aneighs.end(), std::inserter(sneighs, sneighs.end()));
// De-dupe neighbours list.
// TODO: is this method faster or slower than unordered_set?
std::sort(aneighs.begin(), aneighs.end(), [&] (const VecDi& a, const VecDi&b) {
const UINT aidx = Grid<VecDi, D>::index(a, phi.dims(), phi.offset());
const UINT bidx = Grid<VecDi, D>::index(b, phi.dims(), phi.offset());
return aidx < bidx;
});
aneighs.erase(std::unique(aneighs.begin(), aneighs.end()), aneighs.end());

A great deal here is likely to depend on the size of the output set (which, in turn, will depend on how distant of neighbors you sample).
If it's small, (no more than a few dozen items or so) your hand-rolled set implementation using std::vector and std::find will probably remain fairly competitive. Its problem is that it's an O(N2) algorithm -- each time you insert an item, you have to search all the existing items, so each insertion is linear on the number of items already in the set. Therefore, as the set grows larger, its time to insert items grows roughly quadratically.
Using std::set you each insertion has to only do approximately log2(N) comparisons instead of N comparison. That reduces the overall complexity from O(N2) to O(N log N). The major shortcoming is that it's (at least normally) implemented as a tree built up of individually allocated nodes. That typically reduces its locality of reference -- i.e., each item you insert will consist of the data itself plus some pointers, and traversing the tree means following pointers around. Since they're allocated individually, chances are pretty good that nodes that are (currently) adjacent in the tree won't be adjacent in memory, so you'll see a fair number of cache misses. Bottom line: while its speed grows fairly slowly as the number of items increases, the constants involved are fairly large -- for a small number of items, it'll start out fairly slow (typically quite a bit slower than your hand-rolled version).
Using a vector/sort/unique combines some of the advantages of each of the preceding. Storing the items in a vector (without extra pointers for each) typically leads to better cache usage -- items at adjacent indexes are also at adjacent memory locations, so when you insert a new item, chances are that the location for the new item will already be in the cache. The major disadvantage is that if you're dealing with a really large set, this could use quite a bit more memory. Where a set eliminates duplicates as you insert each item (i.e., an item will only be inserted if it's different from anything already in the set) this will insert all the items, then at the end delete all the duplicates. Given current memory availability and the number of neighbors I'd guess you're probably visiting, I doubt this is a major disadvantage in practice, but under the wrong circumstances, it could lead to a serious problem -- nearly any use of virtual memory would almost certainly make it a net loss.
Looking at the last from a complexity viewpoint, it's going to O(N log N), sort of like the set. The difference is that with the set it's really more like O(N log M), where N is the total number of neighbors, and M is the number of unique neighbors. With the vector, it's really O(N log N), where N is (again) the total number of neighbors. As such, if the number of duplicates is extremely large, a set could have a significant algorithmic advantage.
It's also possible to implement a set-like structure in purely linear sequences. This retains the set's advantage of only storing unique items, but also the vector's locality of reference advantage. The idea is to keep most of the current set sorted, so you can search it in log(N) complexity. When you insert a new item, however, you just put it in the separate vector (or an unsorted portion of the existing vector). When you do a new insertion you also do a linear search on those unsorted items.
When that unsorted part gets too large (for some definition of "too large") you sort those items and merge them into the main group, then start the same sequence again. If you define "too large" in terms of "log N" (where N is the number of items in the sorted group) you can retain O(N log N) complexity for the data structure as a whole. When I've played with it, I've found that the unsorted portion can be larger than I'd have expected before it starts to cause a problem though.

Unsorted set has a constant time complexity o(1) for insertion (on average), so the operation will be o(n) where n is the number is elements before removal.
sorting a list of element of size n is o(n log n), going over the list to remove duplicates is o(n). o(n log n) + o(n) = o(n log n)
The unsorted set (which is similar to an hash table in performance) is better.
data about unsorted set times:
http://en.cppreference.com/w/cpp/container/unordered_set

Search Algorithm to find the k lowest values in a list

I have a list that contains n double values and I need to find the k lowest double values in that list
k is much smaller than n
the initial list with the n double values is randomly ordered
the found k lowest double values are not required to be sorted
What algorithm would you recommend?
At the moment I use Quicksort to sort the whole list, and then I take the first k elements out of the sorted list. I expect there should be a much faster algorithm.
Thank you for your help!!!

You could model your solution to match the nlargest() code in Python's standard library.
Heapify the first k values on a maxheap.
Iterate over the remaining n - k values.
Compare each to the element of the top of the heap.
If the new value is lower, do a heapreplace operation (which replaces the topmost heap element with the new value and then sifts it downward).
The algorithm can be surprisingly efficient. For example, when n=100,000 and k=100, the number of comparisons is typically around 106,000 for randomly arranged inputs. This is only slightly more than 100,000 comparisons to find a single minimum value. And, it does about twenty times fewer comparisons than a full quicksort on the whole dataset.
The relative strength of various algorithms is studied and summarized at: http://code.activestate.com/recipes/577573-compare-algorithms-for-heapqsmallest

You can use selection algorithm to find the kth lowest element and then iterate and return it and all elements that are lower then it. More work has to be done if the list can contain duplicates (making sure you don't end up with more elements that you need).
This solution is O(n).
Selection algorithm is implemented in C++ as nth_element()
Another alternative is to use a max heap of size k, and iterate the elements while maintaining the heap to hold all k smallest elements.
for each element x:
if (heap.size() < k):
heap.add(x)
else if x < heap.max():
heap.pop()
heap.add(x)
When you are done - the heap contains k smallest elements.
This solution is O(nlogk)

Take a look at partial_sort algorithm from C++ standard library.

You can use std::nth_element. This is O(N) complexity because it doesn't sort the elements, it just arranges them such that every element under a certain N is less than N.

you can use selection sort, it takes O(n) to select first lowest value. Once we have set this lowest value on position 1 we can rescan the data set to find out second lowest value. and can do it until we have kth lowest value. in this way if k is enough smaller then n then we will have complexity kn which is equivalent to O(n)...

How to efficiently nearly sort a list?

I have a list of items; I want to sort them, but I want a small element of randomness so they are not strictly in order, only on average ordered.
How can I do this most efficiently?
I don't mind if the quality of the random is not especially good, e.g. it simply based on the chance ordering of the input, e.g. an early-terminated incomplete sort.
The context is implementing a nearly-greedy search by introducing a very slight element of inexactness; this is in a tight loop and so the speed of sorting and calling random() are to be considered
My current code is to do a std::sort (this being C++) and then do a very short shuffle just in the early part of the array:
for(int i=0; i<3; i++) // I know I have more than 6 elements
std::swap(order[i],order[i+rand()%3]);

Use first two passes of JSort. Build heap twice, but do not perform insertion sort. If element of randomness is not small enough, repeat.
There is an approach that (unlike incomplete JSort) allows finer control over the resulting randomness and has time complexity dependent on randomness (the more random result is needed, the less time complexity). Use heapsort with Soft heap. For detailed description of the soft heap, see pdf 1 or pdf 2.

You could use a standard sort algorithm (is a standard library available?) and pass a predicate that "knows", given two elements, which is less than the other, or if they are equal (returning -1, 0 or 1). In the predicate then introduce a rare (configurable) case where the answer is random, by using a random number:
pseudocode:
if random(1000) == 0 then
return = random(2)-1 <-- -1,0,-1 randomly choosen
Here we have 1/1000 chances to "scamble" two elements, but that number strictly depends on the size of your container to sort.
Another thing to add in the 1000 case, could be to remove the "right" answer because that would not scramble the result!
Edit:
if random(100 * container_size) == 0 then <-- here I consider the container size
{
if element_1 < element_2
return random(1); <-- do not return the "correct" value of -1
else if element_1 > element_2
return random(1)-1; <-- do not return the "correct" value of 1
else
return random(1)==0 ? -1 : 1; <-- do not return 0
}
in my pseudocode:
random(x) = y where 0 <= y <=x

One possibility that requires a bit more space but would guarantee that existing sort algorithms could be used without modification would be to create a copy of the sort value(s) and then modify those in some fashion prior to sorting (and then use the modified value(s) for the sort).
For example, if the data to be sorted is a simple character field Name[N] then add a field (assuming data is in a structure or class) called NameMod[N]. Fill in the NameMod with a copy of Name but add some randomization. Then 3% of the time (or some appropriate amount) change the first character of the name (e.g., change it by +/- one or two characters). And then 10% of the time change the second character +/- a few characters.
Then run it through whatever sort algorithm you prefer. The benefit is that you could easily change those percentages and randomness. And the sort algorithm will still work (e.g., it would not have problems with the compare function returning inconsistent results).

If you are sure that element is at most k far away from where they should be, you can reduce quicksort N log(N) sorting time complexity down to N log(k)....
edit
More specifically, you would create k buckets, each containing N/k elements.
You can do quick sort for each bucket, which takes k * log(k) times, and then sort N/k buckets, which takes N/k log(N/k) time. Multiplying these two, you can do sorting in N log(max(N/k,k))
This can be useful because you can run sorting for each bucket in parallel, reducing total running time.
This works if you are sure that any element in the list is at most k indices away from their correct position after the sorting.
but I do not think you meant any restriction.

Split the list into two equally-sized parts. Sort each part separately, using any usual algorithm. Then merge these parts. Perform some merge iterations as usual, comparing merged elements. For other merge iterations, do not compare the elements, but instead select element from the same part, as in the previous step. It is not necessary to use RNG to decide, how to treat each element. Just ignore sorting order for every N-th element.
Other variant of this approach nearly sorts an array nearly in-place. Split the array into two parts with odd/even indexes. Sort them. (It is even possible to use standard C++ algorithm with appropriately modified iterator, like boost::permutation_iterator). Reserve some limited space at the end of the array. Merge parts, starting from the end. If merged part is going to overwrite one of the non-merged elements, just select this element. Otherwise select element in sorted order. Level of randomness is determined by the amount of reserved space.

Assuming you want the array sorted in ascending order, I would do the following:
for M iterations
pick a random index i
pick a random index k
if (i<k)!=(array[i]<array[k]) then swap(array[i],array[k])
M controls the "sortedness" of the array - as M increases the array becomes more and more sorted. I would say a reasonable value for M is n^2 where n is the length of the array. If it is too slow to pick random elements then you can precompute their indices beforehand. If the method is still too slow then you can always decrease M at the cost of getting a poorer sort.

Take a small random subset of the data and sort it. You can use this as a map to provide an estimate of where every element should appear in the final nearly-sorted list. You can scan through the full list now and move/swap elements that are not in a good position.
This is basically O(n), assuming the small initial sorting of the subset doesn't take a long time. Hopefully you can build the map such that the estimate can be extracted quickly.

Bubblesort to the rescue!
For a unsorted array, you could pick a few random elements and bubble them up or down. (maybe by rotation, which is a bit more efficient) It will be hard to control the amount of (dis)order, even if you pick all N elements, you are not sure that the whole array will be sorted, because elements are moved and you cannot ensure that you touched every element only once.
BTW: this kind of problem tends to occur in game playing engines, where the list with candidate moves is kept more-or-less sorted (because of weighted sampling), and sorting after each iteration is too expensive, and only one or a few elements are expected to move.

Fast Algorithm for finding largest values in 2d array

I have a 2D array (an image actually) that is size N x N. I need to find the indices of the M largest values in the array ( M << N x N) . Linearized index or the 2D coords are both fine. The array must remain intact (since it's an image). I can make a copy for scratch, but sorting the array will bugger up the indices.
I'm fine with doing a full pass over the array (ie. O(N^2) is fine). Anyone have a good algorithm for doing this as efficiently as possible?

Selection is sorting's austere sister (repeat this ten times in a row). Selection algorithms are less known than sort algorithms, but nonetheless useful.
You can't do better than O(N^2) (in N) here, since nothing indicates that you must not visit each element of the array.
A good approach is to keep a priority queue made of the M largest elements. This makes something O(N x N x log M).
You traverse the array, enqueuing pairs (elements, index) as you go. The queue keeps its elements sorted by first component.
Once the queue has M elements, instead of enqueuing you now:
Query the min element of the queue
If the current element of the array is greater, insert it into the queue and discard the min element of the queue
Else do nothing.
If M is bigger, sorting the array is preferable.
NOTE: #Andy Finkenstadt makes a good point (in the comments to your question) : you definitely should traverse your array in the "direction of data locality": make sure that you read memory contiguously.
Also, this is trivially parallelizable, the only non parallelizable part is when you merge the queues when joining the sub processes.

You could copy the array into a single dimensioned array of tuples (value, original X, original Y ) and build a basic heap out of it in (O(n) time), provided you implement the heap as an array.
You could then retrieve the M largest tuples in O(M lg n) time and reference their original x and y from the tuple.

If you are going to make a copy of the input array in order to do a sort, that's way worse than just walking linearly through the whole thing to pick out numbers.
So the question is how big is your M? If it is small, you can store results (i.e. structs with 2D indexes and values) in a simple array or a vector. That'll minimize heap operations but when you find a larger value than what's in your vector, you'll have to shift things around.
If you expect M to get really large, then you may need a better data structure like a binary tree (std::set) or use sorted std::deque. std::set will reduce number of times elements must be shifted in memory, while if you use std::deque, it'll do some shifting, but it'll reduce number of times you have to go to the heap significantly, which may give you better performance.

Your problem doesn't use the 2 dimensions in any interesting way, it is easier to consiger the equivalent problem in a 2d array.
There are 2 main ways to solve this problem:
Mantain a set of M largest elements, and iterate through the array. (Using a heap allows you to do this efficiently).
This is simple and is probably better in your case (M << N)
Use selection, (the following algorithm is an adaptation of quicksort):
Create an auxiliary array, containing the indexes [1..N].
Choose an arbritary index (and corresponding value), and partition the index array so that indexes corresponding to elements less go to the left, and bigger elements go to the right.
Repeat the process, binary search style until you narrow down the M largest elements.
This is good for cases with large M. If you want to avoid worst case issues (the same quicksort has) then look at more advanced algorithms, (like median of medians selection)

How many times do you search for the largest value from the array?
If you only search 1 time, then just scan through it keeping the M largest ones.
If you do it many times, just insert the values into a sorted list (probably best implemented as a balanced tree).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js