What container really mimics std::vector in Haskell? - c++

The problem
I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one. This means that the size of the container, at the end, will always be n.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
The container have to provide:
constant time insertions at either beginning or end (one of the two, not necessarily both)
constant time indexing in the middle
or alternatively (given a O(n) initialization):
constant time single element edits
constant time indexing in the middle
What is std::vector and why is it relevant
For those of you who don't know C++, std::vector is a dynamically sized array. It is a perfect fit for this problem because it is able to:
reserve space at construction
offer constant time indexing in the middle
offer constant time insertion at the end (with a reserved space)
Therefore this problem is solvable in O(n) complexity, in C++.
Why Data.Vector is not std::vector
Data.Vector, together with Data.Array, provide similar functionality to std::vector, but not quite the same. Both, of course, offer constant time indexing in the middle, but they offer neither constant time modification ((//) for example is at least O(n)) nor constant time insertion at either beginning of end.
Conclusion
What container really mimics std::vector in Haskell? Alternatively, what is my best shot?

From reddit comes the suggestion to use Data.Vector.constructN:
O(n) Construct a vector with n elements by repeatedly applying the generator function to the already constructed part of the vector.
constructN 3 f = let a = f <> ; b = f <a> ; c = f <a,b> in f <a,b,c>
For example:
λ import qualified Data.Vector as V
λ V.constructN 10 V.length
fromList [0,1,2,3,4,5,6,7,8,9]
λ V.constructN 10 $ (1+) . V.sum
fromList [1,2,4,8,16,32,64,128,256,512]
λ V.constructN 10 $ \v -> let n = V.length v in if n <= 1 then 1 else (v V.! (n - 1)) + (v V.! (n - 2))
fromList [1,1,2,3,5,8,13,21,34,55]
This certainly seems to qualify to solve the problem as you've described it above.

The first data structures that come to my mind are either Maps from Data.Map or Sequences from Data.Sequence.
Update
Data.Sequence
Sequences are persistent data structures that allow most operations efficient, while allowing only finite sequences. Their implementation is based on finger-trees, if you are interested. But which qualities does it have?
O(1) calculation of the length
O(1) insert at front/back with the operators <| and |> respectively.
O(n) creation from a list with fromlist
O(log(min(n1,n2))) concatenation for sequences of length n1 and n2.
O(log(min(i,n-i))) indexing for an element at position i in a sequence of length n.
Furthermore this structure supports a lot of the known and handy functions you'd expect from a list-like structure: replicate, zip, null, scans, sort, take, drop, splitAt and many more. Due to these similarities you have to do either qualified import or hide the functions in Prelude, that have the same name.
Data.Map
Maps are the standard workhorse for realizing a correspondence between "things", what you might call a Hashmap or associave array in other programming languages are called Maps in Haskell; other than in say Python Maps are pure - so an update gives you back a new Map and does not modify the original instance.
Maps come in two flavors - strict and lazy.
Quoting from the Documentation
Strict
API of this module is strict in both the keys and the values.
Lazy
API of this module is strict in the keys, but lazy in the values.
So you need to choose what fits best for your application. You can try both versions and benchmark with criterion.
Instead of listing the features of Data.Map I want to pass on to
Data.IntMap.Strict
Which can leverage the fact that the keys are integers to squeeze out a better performance
Quoting from the documentation we first note:
Many operations have a worst-case complexity of O(min(n,W)). This means that the operation can become linear in the number of elements with a maximum of W -- the number of bits in an Int (32 or 64).
So what are the characteristics for IntMaps
O(min(n,W)) for (unsafe) indexing (!), unsafe in the sense that you will get an error if the key/index does not exist. This is the same behavior as Data.Sequence.
O(n) calculation of size
O(min(n,W)) for safe indexing lookup, which returns a Nothing if the key is not found and Just a otherwise.
O(min(n,W)) for insert, delete, adjust and update
So you see that this structure is less efficient than Sequences, but provide a bit more safety and a big benefit if you actually don't need all entries, such the representation of a sparse graph, where the nodes are integers.
For completeness I'd like to mention a package called persistent-vector, which implements clojure-style vectors, but seems to be abandoned as the last upload is from (2012).
Conclusion
So for your use case I'd strongly recommend Data.Sequence or Data.Vector, unfortunately I don't have any experience with the latter, so you need to try it for yourself. From the stuff I know it provides a powerful thing called stream fusion, that optimizes to execute multiple functions in one tight "loop" instead of running a loop for each function. A tutorial for Vector can be found here.

When looking for functional containers with particular asymptotic run times, I always pull out Edison.
Note that there's a result that in a strict language with immutable data structures, there's always a logarithmic slowdown to implementing mutable data structure on top of them. It's an open problem whether the limited mutation hidden behind laziness can avoid that slowdown. There also the issue of persistent vs. transient...
Okasaki is still a good read for background, but finger trees or something more complex like an RRB-tree should be available "off-the-shelf" and solve your problem.

I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
Lets consider a very small program. that calculates fibonacci numbers.
fib 1 = 1
fib 2 = 1
fib n = fib (n-1) + fib (n-2)
This is great for small N, but horrible for n > 10. At this point, you stumble across this gem:
fib n = fibs !! n where fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
You may be tempted to exclaim that this is dark magic (infinite, self referential list building and zipping? wth!) but it is really a great example of tying the knot, and using lazyness to ensure that values are calcuated as-needed.
Similarly, we can use an array to tie the knot too.
import Data.Array
fib n = arr ! 10
where arr :: Arr Int Int
arr = listArray (1,n) (map fib' [1..n])
fib' 1 = 1
fib' 2 = 1
fib' n = arr!(n-1) + arr!(n-2)
Each element of the array is a thunk that uses other elements of the array to calculate it's value. In this way, we can build a single array, never having to perform concatenation, and call out values from the array at will, only paying for the calculation up to that point.
The beauty of this method is that you don't only have to look behind you, you can look in front of you as well.

Related

Why does the shuffle' function require an Int parameter?

In System.Random.Shuffle,
shuffle' :: RandomGen gen => [a] -> Int -> gen -> [a]
The hackage page mentions this Int argument as
..., its length,...
However, it seems that a simple wrapper function like
shuffle'' x = shuffle' x (length x)
should've sufficed.
shuffle operates by building a tree form of its input list, including the tree size. The buildTree function performs this task using Data.Function.fix in a manner I haven't quite wrapped my head around. Somehow (I think due to the recursion of inner, not the fix magic), it produces a balanced tree, which then has logarithmic lookup. Then it consumes this tree, rebuilding it for every extracted item. The advantage of the data structure would be that it only holds remaining items in an immutable form; lazy updates work for it. But the size of the tree is required data during the indexing, so there's no need to pass it separately to generate the indices used to build the permutation. System.Random.Shuffle.shuffle indeed has no random element - it is only a permutation function. shuffle' exists to feed it a random sequence, using its internal helper rseq. So the reason shuffle' takes a length argument appears to be because they didn't want it to touch the list argument at all; it's only passed into shuffle.
The task doesn't seem terribly suitable for singly linked lists in the first place. I'd probably consider using VectorShuffling instead. And I'm baffled as to why rseq isn't among the exported functions, being the one that uses a random number generator to build a permutation... which in turn might have been better handled using Data.Permute. Probably the reasons have to with history, such as Data.Permute being written later and System.Random.Shuffle being based on a paper on immutable random access queues.
Data.Random.Extras seems to have a more straight forward Seq-based shuffle function.
It might be a case when length of the given list is already known, and doesn't need to be calculated again. Thus, it might be considered as an optimisation.
Besides, in general, the resulting list doesn't need to have the same size as the original one. Thus, this argument could be used for setting this length.
This is true for the original idea of Oleg (source - http://okmij.org/ftp/Haskell/perfect-shuffle.txt):
-- examples
t1 = shuffle1 ['a','b','c','d','e'] [0,0,0,0]
-- "abcde"
-- Note, that rseq of all zeros leaves the sequence unperturbed.
t2 = shuffle1 ['a','b','c','d','e'] [4,3,2,1]
-- "edcba"
-- The rseq of (n-i | i<-[1..n-1]) reverses the original sequence of elements
However, it's not the same for the 'random-shuffle' package implementation:
> shuffle [0..10] [0,0,0,0]
[0,1,2,3random-shuffle.hs: [shuffle] called with lists of different lengths
I think it worth to follow-up with the packages maintainers in order to understand the contract of this function.

Calculate ratio of an element in a list efficiently

The following code works with small lists, however it takes forever with long lists, I suppose it's my double use of length that is the problem.
ratioOfPrimes :: [Int] -> Double
ratioOfPrimes xs = fromIntegral (length (filter isPrime xs))/ fromIntegral(length xs)
How do I calculate the ratio of an element in longer lists?
The double use of length isn't the main problem here. The multiple traversals in your implementation produce a constant factor and with double length and filter you get the avg complexity of O(3n). Due to Stream Fusion it's even O(2n), as already mentioned by Impredicative. But in fact since the constant factors don't have a dramatic effect on performance, it's even conventional to simply ignore them, so, conventionally speaking, your implementation still has the complexity of O(n), where n is the length of the input list.
The real problem here is that the above would all be true only if isPrime had the complexity of O(1), but it doesn't. This function performs a traversal thru a list of all primes, so it itself has the complexity of O(m). So the dramatic performance decrease here is caused by your algorithm having the final complexity of O(n*m), because on each iteration of the input list it has to traverse the list of all primes to an unknown depth.
To optimize I suggest to first sort the input list (takes O(n*log n)) and itegrate a custom lookup on a list of all primes, which will drop the already visited numbers on each iteration. This way you'll be able to achieve a single traversal on the list of all primes, which theoretically could grant you with the complexity of O(n*log n + n + m), which again, conventionally can be thought of as simply O(n*log n), by highlighting the cost center.
So, there's a few things going on there. Let's look at some of the operations involved:
length
filter
isPrime
length
As you say, using length twice isn't going to help, since that's O(n) for lists. You do that twice. Then there's filter, which is also going to do a whole pass of the list in O(n). What we'd like to do is do all this in a single pass of the list.
Functions in the Data.List.Stream module implement a technique called Stream Fusion, which would for example rewrite your (length (filter isPrime xs)) call into a single loop. However, you'd still have the second call to length. You could rewrite this whole thing into a single fold (or use of the State or ST monads) with a pair of accumulators and do this in a single pass:
ratioOfPrimes xs = let
(a,b) = foldl' (\(odd,all) i -> if (isPrime i) then (odd +1, all+1) else (odd, all+1)) (0,0) xs
in a/b
However, in this case you could also move away from using a list and use the vector library. The vector library implements the same stream fusion techniques for removing intermediate lists, but also has some other nifty features:
length is O(1)
The Data.Vector.Unboxed module lets you store unboxable types (which primitive types such as Int certainly are) without the overhead of the boxed representation. So this list of ints would be stored as a low-level Int array.
Using the vector package should let you write the idiomatic representation you have above and get better than the performance of a single-pass translation.
import qualified Data.Vector.Unboxed as U
ratioOfPrimes :: U.Vector Int -> Double
ratioOfPrimes xs = (fromIntegral $ U.length . U.filter isPrime $ xs) / (fromIntegral $ U.length xs)
Of course, the thing that hasn't been mentioned is the isPrime function, and whether the real problem is that it's slow for large n. An unperformant prime checker could easily blow concerns over list indexing out of the water.

ocaml extremely large data structure suggestions

I am looking for suggestions on what kind of data-structure to use for extremely large structures in OCaml that scale well.
By scales well, I don't want stack overflows, or exponential heap growth, assuming there is enough memory. So this pretty much eliminates the standard lib's List.map function. Speed isn't so much an issue.
But for starters, let's assume I'm operating in the realm of 2^10 - 2^100 items.
There are only three "manipulations" I perform on the structure:
(1) a map function on subsets of the structure, which either increases or decreases the structure
(2) scanning the structure
(3) removal of specific pairs of items in the structure that satisfy a particular criterion
Originally I was using regular lists, which is still highly desirable, because the structure is constantly changing. Usually after all manipulations are performed, the structure has at most either doubled in size (or something thereabouts), or reduced to the empty list []. Perhaps the doubling dooms me from the beginning but it is unavoidable.
In any event, around 2^15 --- 2^40 items start causing severe problems (probably due to the naive list functions I was using as well). The program uses 100% of the cpu, but almost no memory, and generally after a day or two it stack-overflows.
I would prefer to start using more memory, if possible, in order to continue operating in larger spaces.
Anyway, if anyone has any suggestions it would be much appreciated.
If you have enough space, in theory, to contain all items of your data structure, you should look at data structures that have an efficient memory representation, with as few bookeeping as possible. Dynamic arrays (that you resize exponentially when you need more space) are more efficiently stored than list (that pay a full word to store the tail of each cell), so you'd get roughly twice as much elements for the same memory use.
If you cannot hold all elements in memory (this is what your number look like), you should go for a more abstract representation. It's difficult to tell more without more information on what your elements are. But maybe an example of abstract representation would help you devise what you need.
Imagine that I want to record set of integers. I want to make unions, intersections of those sets, and also some more funky operations such as "get all elements that are multiple". I want to be able to do that for really large sets (zillions of distinct integers), and then I want to be able to pick one element, any one, in this set I have built. Instead of trying to store lists of integers, or set of integers, or array of booleans, what I can do is store the logical formulas corresponding to the definition of those sets: a set of integers P is characterized by a formula F such that F(n) ⇔ n∈P. I can therefore define a type of predications (conditions):
type predicate =
| Segment of int * int (* n ∈ [a;b] *)
| Inter of predicate * predicate
| Union of predicate * predicate
| Multiple of int (* n mod a = 0 *)
Storing these formulas requires little memory (proportional to the number of operations I want to apply in total). Building the intersection or the union takes constant time. Then I'll have some work to do to find an element satisfying the formula; basically I'll have to reason about what those formulas mean, get a normal form out of them (they are all of the form "the elements of a finite union of interval satisfying some modulo criterions"), and from there extract some element.
In the general case, when you get a "command" on your data set, such that "add the result of mapping over this subset", you can always, instead of actually evaluating this command, store this as data – the definition of your structure. The more precisely you can describe those commands (eg. you say "map", but storing an (elem -> elem) function will not allow you to reason easily on the result, maybe you can formulate that mapping operation as a concrete combination of operations), the more precisely you will be able to work on them at this abstract level, without actually computing the elements.

Find that unique element from the 10^5 array size [duplicate]

This question already has answers here:
How to find the only number in an array that doesn't occur twice [duplicate]
(5 answers)
Closed 7 years ago.
What would be the best algorithm for finding a number that occurs only once in a list which has all other numbers occurring exactly twice.
So, in the list of integers (lets take it as an array) each integer repeats exactly twice, except one. To find that one, what is the best algorithm.
The fastest (O(n)) and most memory efficient (O(1)) way is with the XOR operation.
In C:
int arr[] = {3, 2, 5, 2, 1, 5, 3};
int num = 0, i;
for (i=0; i < 7; i++)
num ^= arr[i];
printf("%i\n", num);
This prints "1", which is the only one that occurs once.
This works because the first time you hit a number it marks the num variable with itself, and the second time it unmarks num with itself (more or less). The only one that remains unmarked is your non-duplicate.
By the way, you can expand on this idea to very quickly find two unique numbers among a list of duplicates.
Let's call the unique numbers a and b. First take the XOR of everything, as Kyle suggested. What we get is a^b. We know a^b != 0, since a != b. Choose any 1 bit of a^b, and use that as a mask -- in more detail: choose x as a power of 2 so that x & (a^b) is nonzero.
Now split the list into two sublists -- one sublist contains all numbers y with y&x == 0, and the rest go in the other sublist. By the way we chose x, we know that a and b are in different buckets. We also know that each pair of duplicates is still in the same bucket. So we can now apply ye olde "XOR-em-all" trick to each bucket independently, and discover what a and b are completely.
Bam.
O(N) time, O(N) memory
HT= Hash Table
HT.clear()
go over the list in order
for each item you see
if(HT.Contains(item)) -> HT.Remove(item)
else
ht.add(item)
at the end, the item in the HT is the item you are looking for.
Note (credit #Jared Updike): This system will find all Odd instances of items.
comment: I don't see how can people vote up solutions that give you NLogN performance. in which universe is that "better" ?
I am even more shocked you marked the accepted answer s NLogN solution...
I do agree however that if memory is required to be constant, then NLogN would be (so far) the best solution.
Kyle's solution would obviously not catch situations were the data set does not follow the rules. If all numbers were in pairs the algorithm would give a result of zero, the exact same value as if zero would be the only value with single occurance.
If there were multiple single occurance values or triples, the result would be errouness as well.
Testing the data set might well end up with a more costly algorithm, either in memory or time.
Csmba's solution does show some errouness data (no or more then one single occurence value), but not other (quadrouples). Regarding his solution, depending on the implementation of HT, either memory and/or time is more then O(n).
If we cannot be sure about the correctness of the input set, sorting and counting or using a hashtable counting occurances with the integer itself being the hash key would both be feasible.
I would say that using a sorting algorithm and then going through the sorted list to find the number is a good way to do it.
And now the problem is finding "the best" sorting algorithm. There are a lot of sorting algorithms, each of them with its strong and weak points, so this is quite a complicated question. The Wikipedia entry seems like a nice source of info on that.
Implementation in Ruby:
a = [1,2,3,4,123,1,2,.........]
t = a.length-1
for i in 0..t
s = a.index(a[i])+1
b = a[s..t]
w = b.include?a[i]
if w == false
puts a[i]
end
end
You need to specify what you mean by "best" - to some, speed is all that matters and would qualify an answer as "best" - for others, they might forgive a few hundred milliseconds if the solution was more readable.
"Best" is subjective unless you are more specific.
That said:
Iterate through the numbers, for each number search the list for that number and when you reach the number that returns only a 1 for the number of search results, you are done.
Seems like the best you could do is to iterate through the list, for every item add it to a list of "seen" items or else remove it from the "seen" if it's already there, and at the end your list of "seen" items will include the singular element. This is O(n) in regards to time and n in regards to space (in the worst case, it will be much better if the list is sorted).
The fact that they're integers doesn't really factor in, since there's nothing special you can do with adding them up... is there?
Question
I don't understand why the selected answer is "best" by any standard. O(N*lgN) > O(N), and it changes the list (or else creates a copy of it, which is still more expensive in space and time). Am I missing something?
Depends on how large/small/diverse the numbers are though. A radix sort might be applicable which would reduce the sorting time of the O(N log N) solution by a large degree.
The sorting method and the XOR method have the same time complexity. The XOR method is only O(n) if you assume that bitwise XOR of two strings is a constant time operation. This is equivalent to saying that the size of the integers in the array is bounded by a constant. In that case you can use Radix sort to sort the array in O(n).
If the numbers are not bounded, then bitwise XOR takes time O(k) where k is the length of the bit string, and the XOR method takes O(nk). Now again Radix sort will sort the array in time O(nk).
You could simply put the elements in the set into a hash until you find a collision. In ruby, this is a one-liner.
def find_dupe(array)
h={}
array.detect { |e| h[e]||(h[e]=true; false) }
end
So, find_dupe([1,2,3,4,5,1]) would return 1.
This is actually a common "trick" interview question though. It is normally about a list of consecutive integers with one duplicate. In this case the interviewer is often looking for you to use the Gaussian sum of n-integers trick e.g. n*(n+1)/2 subtracted from the actual sum. The textbook answer is something like this.
def find_dupe_for_consecutive_integers(array)
n=array.size-1 # subtract one from array.size because of the dupe
array.sum - n*(n+1)/2
end

What elegant solution exists for this pattern? Multi-Level Searching

Assume that we have multiple arrays of integers. You can consider each array as a level. We try to find a sequence of elements, exactly one element from each array, and proceed to the next array with the same predicate. For example, we have v1, v2, v3 as the arrays:
v1 | v2 | v3
-----------------
1 | 4 | 16
2 | 5 | 81
3 | 16 | 100
4 | 64 | 121
I could say that the predicate is: next_element == previous_element^2
A valid sequence from the above example is: 2 -> 4 -> 16
Actually, in this example there isn't another valid sequence.
I could write three loops to brute-force the mentioned example, but what if the number of arrays is variable, but with know order of course, how would you solve this problem?
Hints, or references to design patters are very appreciated. I shall do it in C++, but I just need the idea.
Thanks,
If you order your arrays beforehand, the search can be done much faster. You could start on your smaller array, then binary-search for expected numbers on each of them. This would be O(nklogM), n being the size of the smallest array, k being the numbers of arrays, M being the size of larger array
This could be done even faster if you use Hashmaps instead of arrays. This would let you search in O(n*k).
If using reverse functions (to search in earlier arrays) is not an option, then you should start on first array, and n = size of first array.
For simplicity, I'll start from first array
//note the 1-based arrays
for (i : 1 until allArrays[1].size()) {
baseNumber = allArrays[1][i];
for (j: 2 until allArrays.size()) {
expectedNumber = function(baseNumber);
if (!find(expectedNumber, allArrays[j]))
break;
baseNumber = expectedNumber;
}
}
You can probably do some null checks and add some booleans in there to know if the sequence exist or not
(Design patterns apply to class and API design to improve code quality, but they aren't for solving computational problems.)
Depending on the cases:
If the arrays comes in random order, and you have finite space requirement, then brute-force is the only solution. O(Nk) time (k = 3), O(1) space.
If the predicate is not invertible (e.g. SHA1(next_elem) xor SHA1(prev_elem) == 0x1234), then brute force is also the only solution.
If you can expense space, then create hash sets for v2 and v3, so you can quickly find the next element that satisfies the predicate. O(N + bk) time, O(kN) space. (b = max number of next_elem that satisfy the predicate given a prev_elem)
If the arrays are sorted and bounded, you can also use binary search instead of the hash table to avoid using space. O(N (log N)k-1 + bk) time, O(1) space.
(All of the space count doesn't take account to stack usage due to recursion.)
A general way that consumes up to O(Nbk) space is to build the solution by successively filtering, e.g.
solutions = [[1], [2], ... [N]]
filterSolution (Predicate, curSols, nextElems) {
nextSols = []
for each curSol in curSols:
find elem in nextElems that satisfy the Predicate
append elem into a copy of curSol, then push into nextSols
return nextSols
}
for each levels:
solutions = filterSolution(Predicate, solutions, all elems in this level)
return solutions
You could generate a seperate index that map an index from one array to the index of another. From the index you can quickly see if an solution exists or not.
Generating the index would require a brute force approach but then you'd do it only one. If you want to improve the array search, consider using a more appropriate data structure to allow for fast search (red-black trees for example instead of arrays).
I would keep all vectors as heaps so I can have O(log n) complexity when searching for an element. So for a total of k vectors you will get a complexity like O(k * log n)
If the predicates preserve the ordering in the arrays (e.g. with your example, if the values are all guaranteed non-negative), you could adapt a merge algorithm. Consider the key for each array to be the end-value (what you get after applying the predicate as many times as needed for all arrays).
If the predicate doesn't preserve ordering (or the arrays aren't ordered to start) you can sort by the end-value first, but the need to do that suggests that another approach may be better (such as the hash tables suggested elsewhere).
Basically, check whether the next end-value is equal for all arrays. If not, step over the lowest (in one array only) and repeat. If you get all three equal, that is a (possible) solution - step over all three before searching for the next.
"Possible" solution because you may need to do a check - if the predicate function can map more than one input value to the same output value, you might have a case where the value found in some arrays (but not the first or last) is wrong.
EDIT - there may be bigger problems when the predicate doesn't map each input to a unique output - can't think at the moment. Basically, the merge approach can work well, but only for certain kinds of predicate function.