Finding k smallest/largest elements in array with focus on memory

Finding k smallest/largest elements in array with focus on memory - c++

I have an unsigned int array with n elements (with n being at most around 20-25 elements). Duplicates are possible.
I know that the smallest k values are of type A and the other (larger) n-k values are of type B. In order to differentiate between A and B I need to find the indices of the k smallest values (or the n-k largest values, depending on what is easier/faster). The original array must not be altered as the element's index contains information.
There are multiple solutions for this problem on the web (e.g. here). However, most of them try to optimize processing time and neglect memory usage.
As I am implementing the code in C++ on a (Arduino based) microcontroller, I have to focus on low memory usage and, if necessary, take a slightly longer processing time. I therefore feel unsafe using pointers and recursion (maybe I wouldn't if I knew more about it, but in fact I don't).
Can you recommend which algorithm would be best for that task (an implementation is welcome but not essential)?

Related

Efficiently storing a matrix with many zeros, dynamically

Background:
I'm working in c++.
I recall there being a method to efficiently (memory-wise) store "arrays" (where an array might be made of std::vector's, std::set's, etc... I don't care how, so long as it is memory efficient and I'm able to check the value of each element) of 0's and 1's (or, equivalently, truth/false, etc), wherein there is a disproportionate number of one or the other (e.g. mostly zeroes).
I've written an algorithm, which populates an "array" (currently, a vector<vector<size_t>>) with 0's and 1's according to some function. For these purposes, we can more-or-less consider it as being done randomly. The array is to be quite large (of variable size... on the order of 1000 columns, and 1E+8 or more rows), and always rectangular.
There need be this many data points. In the best of times, my machine becomes quickly resource constrained and slows to a crawl. At worst, I get std::bad_alloc.
Putting aside what I intend to do with this array, what is the most efficient (memory-wise) way to store a rectangular array of 1's and 0's (or T/F, etc), where there are mostly 1's or 0's (and I know which is most populous)?.
Note that the array need be created "dynamically" (i.e. one element at a time), elements must maintain their location, and I need only to check the value of individual elements after creation. I'm concerned about memory footprint, nothing else.

This is known as a sparse array or matrix.
std::set<std::pair<int,int>> bob;
If you want 7,100 to be 1, just bob.insert({7,100});. Missing elements are 0. You can use bob.count({3,7}) for a 0/1 value if you like.
Now looping over both columns are rows is tricky; easiest is to make 2 sets each backwards.
If you have no need to loop in order, use an unordered set instead.

Get all subsets of given size k from unordered_set?

I have an unordered_set<int> vertices and have to generate all subsets of size exactly k for my clique program. All of solution (including all on SO) that I've seen work on arrays, not sets. Is there any algorithm like Python's itertools.combinations implemented in C++? If not, how should I go about it? Convert to array and use standard algorithm? I still have to use sets further in the program, so this would make my memory requirements twice as large.

Generation all subsets will take huge amount of time when number of elements become large (except for some special cases like choosing 0, 1, n-1, or n elements out of n), so I don't think doubing memory requirements will become serious problem unless you have strict memory limit.
For that reason, I think you should just convert the set to an array and apply known algorithms.

How to efficiently implement bitwise rotate of an arbitrary sequence?

"The permutation p of n elements defined by an index permutation p(i) = (i + k) mod n is called the k-rotation." -- Stepanov & McJones
std::rotate has become a well known algorithm thanks to Sean Parent, but how to efficiently implement it for an arbitrary sequence of bits?
By efficient, I mean minimizes at least two things, i) the number of writes and ii) the worst-case space complexity.
That is, the input should be similar to std::rotate but bit-wise specific, I guess like this:
A pointer to the memory where the bit sequence starts.
Three bit indices: first, middle and last.
The type of the pointer could be any unsigned integer, and presumably the larger the better. (Boost.Dynamic Bitset calls it the "block".)
It's important to note that the indices may all be offset from the start of a block by different amounts.
According to Stepanov and McJones, rotate on random access data can be implemented in n + gcd(n, k) assignments. The algorithm that reverses each subrange followed by reversing the entire range takes 3n assignments. (However, I agree with the comments below that it is effectively 2n assignments.) Since the bits in an array can be accessed randomly, I assume the same optimal bound applies. Each assignment will usually require two reads because of different subrange block offsets but I'm less concerned about reads than writes.
Does an efficient or optimal implementation of this algorithm already exist out in the open source wild?
If not, how would one do it?
I've looked through Hacker's Delight and Volume 4A of Knuth but can't find an algorithm for it.

Using a vector<uint32_t>, for example, it's easy and reasonably efficient to do the fractional-element part of the rotation in one pass yourself (shift_amount%32), and then call std::rotate to do the rest. The fractional part is easy and only operates on adjacent elements, except at the ends, so you only need to remember one partial element while you're working.
If you want to do the whole thing yourself, then you can do the rotation by reversing the order of the entire vector, and then reversing the order of the front and back sections. The trick to doing this efficiently is that when you reverse the whole vector, you don't actually bit-reverse each element -- you just think of them as being in the opposite order. The reversal of the front and back sections is trickier and requires you to remember 4 partial elements while you work.
In terms of writes to memory or cache, both of the above methods make 2N writes. The optimal rotation you refer to in the question takes N, but if you extend it to work with fractional-word rotations, then each write spans two words and it then takes 2N writes. It provides no advantage and I think it would turn out to be complicated.
That said... I'm sure you could get closer to N writes with a fixed amount of register storage by doing m words at a time, but that's a lot of code for a simple rotation, though, and your time (or at least my time :) would be better spent elsewhere.

Data structure for O(log N) find and update, considering small L1 cache

I'm currently working on an embedded device project where I'm running into performance problems. Profiling has located an O(N) operation that I'd like to eliminate.
I basically have two arrays int A[N] and short B[N]. Entries in A are unique and ordered by external constraints. The most common operation is to check if a particular value a appears in A[]. Less frequently, but still common is a change to an element of A[]. The new value is unrelated to the previous value.
Since the most common operation is the find, that's where B[] comes in. It's a sorted array of indices in A[], such that A[B[i]] < A[B[j]] if and only if i<j. That means that I can find values in A using a binary search.
Of course, when I update A[k], I have to find k in B and move it to a new position, to maintain the search order. Since I know the old and new values of A[k], that's just a memmove() of a subset of B[] between the old and new position of k. This is the O(N) operation that I need to fix; since the old and new values of A[k] are essentially random I'm moving on average about N/2 N/3 elements.
I looked into std::make_heap using [](int i, int j) { return A[i] < A[j]; } as the predicate. In that case I can easily make B[0] point to the smallest element of A, and updating B is now a cheap O(log N) rebalancing operation. However, I generally don't need the smallest value of A, I need to find if any given value is present. And that's now a O(N log N) search in B. (Half of my N elements are at heap depth log N, a quarter at (log N)-1, etc), which is no improvement over a dumb O(N) search directly in A.
Considering that std::set has O(log N) insert and find, I'd say that it should be possible to get the same performance here for update and find. But how do I do that? Do I need another order for B? A different type?
B is currently a short [N] because A and B together are about the size of my CPU cache, and my main memory is a lot slower. Going from 6*N to 8*N bytes would not be nice, but still acceptable if my find and update go to O(log N) both.

If the only operations are (1) check if value 'a' belongs to A and (2) update values in A, why don't you use a hash table in place of the sorted array B? Especially if A does not grow or shrink in size and the values only change this would be a much better solution. A hash table does not require significantly more memory than an array. (Alternatively, B should be changed not to a heap but to a binary search tree, that could be self-balancing, e.g. a splay tree or a red-black tree. However, trees require extra memory because of the left- and right-pointers.)
A practical solution that grows memory use from 6N to 8N bytes is to aim for exactly 50% filled hash table, i.e. use a hash table that consists of an array of 2N shorts. I would recommend implementing the Cuckoo Hashing mechanism (see http://en.wikipedia.org/wiki/Cuckoo_hashing). Read the article further and you find that you can get load factors above 50% (i.e. push memory consumption down from 8N towards, say, 7N) by using more hash functions. "Using just three hash functions increases the load to 91%."
From Wikipedia:
A study by Zukowski et al. has shown that cuckoo hashing is much
faster than chained hashing for small, cache-resident hash tables on
modern processors. Kenneth Ross has shown bucketized versions of
cuckoo hashing (variants that use buckets that contain more than one
key) to be faster than conventional methods also for large hash
tables, when space utilization is high. The performance of the
bucketized cuckoo hash table was investigated further by Askitis,
with its performance compared against alternative hashing schemes.

std::set usually provides the O(log(n)) insert and delete by using a binary search tree. This unfortunately uses 3*N space for most pointer based implementations. Assuming word sized data, 1 for data, 2 for pointers to left and right child on each node.
If you have some constant N and can guarantee that ceil(log2(N)) is less than half the word size you can use a fixed length array of tree nodes each 2*N size. Use 1 for data, 1 for the indexes of the two child nodes, stored as the upper and lower half of the word. Whether this would let you use a self balancing binary search tree of some manner depends on your N and word size. For a 16 bit system you only get N = 256, but for 32 its 65k.

Since you have limited N, can't you use std::set<short, cmp, pool_allocator> B with Boost's pool_allocator?

ocaml extremely large data structure suggestions

I am looking for suggestions on what kind of data-structure to use for extremely large structures in OCaml that scale well.
By scales well, I don't want stack overflows, or exponential heap growth, assuming there is enough memory. So this pretty much eliminates the standard lib's List.map function. Speed isn't so much an issue.
But for starters, let's assume I'm operating in the realm of 2^10 - 2^100 items.
There are only three "manipulations" I perform on the structure:
(1) a map function on subsets of the structure, which either increases or decreases the structure
(2) scanning the structure
(3) removal of specific pairs of items in the structure that satisfy a particular criterion
Originally I was using regular lists, which is still highly desirable, because the structure is constantly changing. Usually after all manipulations are performed, the structure has at most either doubled in size (or something thereabouts), or reduced to the empty list []. Perhaps the doubling dooms me from the beginning but it is unavoidable.
In any event, around 2^15 --- 2^40 items start causing severe problems (probably due to the naive list functions I was using as well). The program uses 100% of the cpu, but almost no memory, and generally after a day or two it stack-overflows.
I would prefer to start using more memory, if possible, in order to continue operating in larger spaces.
Anyway, if anyone has any suggestions it would be much appreciated.

If you have enough space, in theory, to contain all items of your data structure, you should look at data structures that have an efficient memory representation, with as few bookeeping as possible. Dynamic arrays (that you resize exponentially when you need more space) are more efficiently stored than list (that pay a full word to store the tail of each cell), so you'd get roughly twice as much elements for the same memory use.
If you cannot hold all elements in memory (this is what your number look like), you should go for a more abstract representation. It's difficult to tell more without more information on what your elements are. But maybe an example of abstract representation would help you devise what you need.
Imagine that I want to record set of integers. I want to make unions, intersections of those sets, and also some more funky operations such as "get all elements that are multiple". I want to be able to do that for really large sets (zillions of distinct integers), and then I want to be able to pick one element, any one, in this set I have built. Instead of trying to store lists of integers, or set of integers, or array of booleans, what I can do is store the logical formulas corresponding to the definition of those sets: a set of integers P is characterized by a formula F such that F(n) ⇔ n∈P. I can therefore define a type of predications (conditions):
type predicate =
| Segment of int * int (* n ∈ [a;b] *)
| Inter of predicate * predicate
| Union of predicate * predicate
| Multiple of int (* n mod a = 0 *)
Storing these formulas requires little memory (proportional to the number of operations I want to apply in total). Building the intersection or the union takes constant time. Then I'll have some work to do to find an element satisfying the formula; basically I'll have to reason about what those formulas mean, get a normal form out of them (they are all of the form "the elements of a finite union of interval satisfying some modulo criterions"), and from there extract some element.
In the general case, when you get a "command" on your data set, such that "add the result of mapping over this subset", you can always, instead of actually evaluating this command, store this as data – the definition of your structure. The more precisely you can describe those commands (eg. you say "map", but storing an (elem -> elem) function will not allow you to reason easily on the result, maybe you can formulate that mapping operation as a concrete combination of operations), the more precisely you will be able to work on them at this abstract level, without actually computing the elements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js