Calculate ratio of an element in a list efficiently - list

The following code works with small lists, however it takes forever with long lists, I suppose it's my double use of length that is the problem.
ratioOfPrimes :: [Int] -> Double
ratioOfPrimes xs = fromIntegral (length (filter isPrime xs))/ fromIntegral(length xs)
How do I calculate the ratio of an element in longer lists?

The double use of length isn't the main problem here. The multiple traversals in your implementation produce a constant factor and with double length and filter you get the avg complexity of O(3n). Due to Stream Fusion it's even O(2n), as already mentioned by Impredicative. But in fact since the constant factors don't have a dramatic effect on performance, it's even conventional to simply ignore them, so, conventionally speaking, your implementation still has the complexity of O(n), where n is the length of the input list.
The real problem here is that the above would all be true only if isPrime had the complexity of O(1), but it doesn't. This function performs a traversal thru a list of all primes, so it itself has the complexity of O(m). So the dramatic performance decrease here is caused by your algorithm having the final complexity of O(n*m), because on each iteration of the input list it has to traverse the list of all primes to an unknown depth.
To optimize I suggest to first sort the input list (takes O(n*log n)) and itegrate a custom lookup on a list of all primes, which will drop the already visited numbers on each iteration. This way you'll be able to achieve a single traversal on the list of all primes, which theoretically could grant you with the complexity of O(n*log n + n + m), which again, conventionally can be thought of as simply O(n*log n), by highlighting the cost center.

So, there's a few things going on there. Let's look at some of the operations involved:
length
filter
isPrime
length
As you say, using length twice isn't going to help, since that's O(n) for lists. You do that twice. Then there's filter, which is also going to do a whole pass of the list in O(n). What we'd like to do is do all this in a single pass of the list.
Functions in the Data.List.Stream module implement a technique called Stream Fusion, which would for example rewrite your (length (filter isPrime xs)) call into a single loop. However, you'd still have the second call to length. You could rewrite this whole thing into a single fold (or use of the State or ST monads) with a pair of accumulators and do this in a single pass:
ratioOfPrimes xs = let
(a,b) = foldl' (\(odd,all) i -> if (isPrime i) then (odd +1, all+1) else (odd, all+1)) (0,0) xs
in a/b
However, in this case you could also move away from using a list and use the vector library. The vector library implements the same stream fusion techniques for removing intermediate lists, but also has some other nifty features:
length is O(1)
The Data.Vector.Unboxed module lets you store unboxable types (which primitive types such as Int certainly are) without the overhead of the boxed representation. So this list of ints would be stored as a low-level Int array.
Using the vector package should let you write the idiomatic representation you have above and get better than the performance of a single-pass translation.
import qualified Data.Vector.Unboxed as U
ratioOfPrimes :: U.Vector Int -> Double
ratioOfPrimes xs = (fromIntegral $ U.length . U.filter isPrime $ xs) / (fromIntegral $ U.length xs)
Of course, the thing that hasn't been mentioned is the isPrime function, and whether the real problem is that it's slow for large n. An unperformant prime checker could easily blow concerns over list indexing out of the water.

Related

List.fold_left implementation vs List.fold_right implementation

I am new to OCaml, and I have seen from other posts that fold_left in List is tail recursive and works better on larger lists, whereas fold_right is not tail recursive.
My question is why fold_left only works better on larger lists, how is it implemented that makes it not work better on smaller lists.
Being tail-recursive allows to avoid a lot of memory allocation. That optimization will be directly proportional to the length of the list.
On a small list, there will be a gain, but it's not likely to be noticeable until you start using big lists.
As a rule of thumb, you should use fold_left unless you are working on a small list and the fold_right version corresponds more to what you're trying to write.
The fold_left function is indeed tail-recursive, however, it works fine on both small and large lists. There is no gain in using fold_right instead of fold_left on small lists. The fold_left function is always faster than fold_right, and the rumors that you heard are not about fold_left vs fold_right, but rather about a tail-recursive version of fold_right vs a non-tail-recursive version of fold_right. But let me first of all highlight the difference between right and left folds.
The left fold takes a list of elements
a b c d ... z
and a function f, and produces a value
(f (f (f (f a b) c) d) ... z)
It is easier to understand, if we imagine that f is some operator, e.g., an addition, and use the infix notation a + b, instead of the prefix notation (add a b), so the left fold will reduce a sequence to a sum as follows
((((a + b) + c) + d) + ... + z)
So, we can see that the left fold associates parenthesis to the left. This is its only difference from the right fold, which actually associates parenthesis to the right, so if we will take the same sequence and apply the same function to it using the right fold, we will have the following computation
(a + (b + ... (x + (y + z))))
In the case of the addition operation, the result will be the same for both left and right folds. However, the right fold implementation will be less efficient. The reason for that is that for the left fold, we can compute the result as soon as we get two elements, e.g., a+b, where for the right fold, we need to compute the result of the addition of n-1 elements, and then add the first element, e.g., a + (b + ... + (y + z)). Therefore, the right fold has to store the intermediate results somewhere. The easy way is to use stack, e.g., a::rest -> a + (fold_right (+) rest 0)), where the a value is put onto the stack, then the (fold_right (+) rest 0)) computation is run, and when it is ready, we can finally add a and the sum of all other elements. Eventually, it will push all values a, b, ... x, until we finally get to y and z which we can sum, and then unfold the stack of calls.
The problem with the stack is that it is usually bounded, unlike the heap memory, which may grow without any bounds. This is not actually specific to mathematics or computer language design, this is how modern operating systems run programs, they give them a fixed sized stack space and unbound heap size. And once a program runs out of the stack size the operating system terminates it, without any possibility to recover. This is very bad, and should be avoided if possible.
Therefore, people proposed a safer implementation of fold_right, as a left fold of a reversed list. Obviously, this tradeoff results in a slower implementation, as we have to essentially create a reversed copy of the input list, and only after that traverse it with the fold_left function. As a result, we will traverse the list twice and produce garbage, which will further reduce the performance of our code. Therefore, we have a tradeoff between fast but unsafe implementation as provided by the standard library, versus a sure and safe, but slow implementation provided by some other libraries.
To summarize, fold_left is always faster than fold_right, and is always tail-recursive. The standard OCaml implementation of fold_right is not tail-recursive, which is faster than a tail recursive implementation of fold_right functions provided by some other libraries. However, this comes with a price, you shall not apply fold_right to large lists. In general, it means that in OCaml you have to prefer fold_left as your primary tool for processing lists.

What container really mimics std::vector in Haskell?

The problem
I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one. This means that the size of the container, at the end, will always be n.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
The container have to provide:
constant time insertions at either beginning or end (one of the two, not necessarily both)
constant time indexing in the middle
or alternatively (given a O(n) initialization):
constant time single element edits
constant time indexing in the middle
What is std::vector and why is it relevant
For those of you who don't know C++, std::vector is a dynamically sized array. It is a perfect fit for this problem because it is able to:
reserve space at construction
offer constant time indexing in the middle
offer constant time insertion at the end (with a reserved space)
Therefore this problem is solvable in O(n) complexity, in C++.
Why Data.Vector is not std::vector
Data.Vector, together with Data.Array, provide similar functionality to std::vector, but not quite the same. Both, of course, offer constant time indexing in the middle, but they offer neither constant time modification ((//) for example is at least O(n)) nor constant time insertion at either beginning of end.
Conclusion
What container really mimics std::vector in Haskell? Alternatively, what is my best shot?
From reddit comes the suggestion to use Data.Vector.constructN:
O(n) Construct a vector with n elements by repeatedly applying the generator function to the already constructed part of the vector.
constructN 3 f = let a = f <> ; b = f <a> ; c = f <a,b> in f <a,b,c>
For example:
λ import qualified Data.Vector as V
λ V.constructN 10 V.length
fromList [0,1,2,3,4,5,6,7,8,9]
λ V.constructN 10 $ (1+) . V.sum
fromList [1,2,4,8,16,32,64,128,256,512]
λ V.constructN 10 $ \v -> let n = V.length v in if n <= 1 then 1 else (v V.! (n - 1)) + (v V.! (n - 2))
fromList [1,1,2,3,5,8,13,21,34,55]
This certainly seems to qualify to solve the problem as you've described it above.
The first data structures that come to my mind are either Maps from Data.Map or Sequences from Data.Sequence.
Update
Data.Sequence
Sequences are persistent data structures that allow most operations efficient, while allowing only finite sequences. Their implementation is based on finger-trees, if you are interested. But which qualities does it have?
O(1) calculation of the length
O(1) insert at front/back with the operators <| and |> respectively.
O(n) creation from a list with fromlist
O(log(min(n1,n2))) concatenation for sequences of length n1 and n2.
O(log(min(i,n-i))) indexing for an element at position i in a sequence of length n.
Furthermore this structure supports a lot of the known and handy functions you'd expect from a list-like structure: replicate, zip, null, scans, sort, take, drop, splitAt and many more. Due to these similarities you have to do either qualified import or hide the functions in Prelude, that have the same name.
Data.Map
Maps are the standard workhorse for realizing a correspondence between "things", what you might call a Hashmap or associave array in other programming languages are called Maps in Haskell; other than in say Python Maps are pure - so an update gives you back a new Map and does not modify the original instance.
Maps come in two flavors - strict and lazy.
Quoting from the Documentation
Strict
API of this module is strict in both the keys and the values.
Lazy
API of this module is strict in the keys, but lazy in the values.
So you need to choose what fits best for your application. You can try both versions and benchmark with criterion.
Instead of listing the features of Data.Map I want to pass on to
Data.IntMap.Strict
Which can leverage the fact that the keys are integers to squeeze out a better performance
Quoting from the documentation we first note:
Many operations have a worst-case complexity of O(min(n,W)). This means that the operation can become linear in the number of elements with a maximum of W -- the number of bits in an Int (32 or 64).
So what are the characteristics for IntMaps
O(min(n,W)) for (unsafe) indexing (!), unsafe in the sense that you will get an error if the key/index does not exist. This is the same behavior as Data.Sequence.
O(n) calculation of size
O(min(n,W)) for safe indexing lookup, which returns a Nothing if the key is not found and Just a otherwise.
O(min(n,W)) for insert, delete, adjust and update
So you see that this structure is less efficient than Sequences, but provide a bit more safety and a big benefit if you actually don't need all entries, such the representation of a sparse graph, where the nodes are integers.
For completeness I'd like to mention a package called persistent-vector, which implements clojure-style vectors, but seems to be abandoned as the last upload is from (2012).
Conclusion
So for your use case I'd strongly recommend Data.Sequence or Data.Vector, unfortunately I don't have any experience with the latter, so you need to try it for yourself. From the stuff I know it provides a powerful thing called stream fusion, that optimizes to execute multiple functions in one tight "loop" instead of running a loop for each function. A tutorial for Vector can be found here.
When looking for functional containers with particular asymptotic run times, I always pull out Edison.
Note that there's a result that in a strict language with immutable data structures, there's always a logarithmic slowdown to implementing mutable data structure on top of them. It's an open problem whether the limited mutation hidden behind laziness can avoid that slowdown. There also the issue of persistent vs. transient...
Okasaki is still a good read for background, but finger trees or something more complex like an RRB-tree should be available "off-the-shelf" and solve your problem.
I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
Lets consider a very small program. that calculates fibonacci numbers.
fib 1 = 1
fib 2 = 1
fib n = fib (n-1) + fib (n-2)
This is great for small N, but horrible for n > 10. At this point, you stumble across this gem:
fib n = fibs !! n where fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
You may be tempted to exclaim that this is dark magic (infinite, self referential list building and zipping? wth!) but it is really a great example of tying the knot, and using lazyness to ensure that values are calcuated as-needed.
Similarly, we can use an array to tie the knot too.
import Data.Array
fib n = arr ! 10
where arr :: Arr Int Int
arr = listArray (1,n) (map fib' [1..n])
fib' 1 = 1
fib' 2 = 1
fib' n = arr!(n-1) + arr!(n-2)
Each element of the array is a thunk that uses other elements of the array to calculate it's value. In this way, we can build a single array, never having to perform concatenation, and call out values from the array at will, only paying for the calculation up to that point.
The beauty of this method is that you don't only have to look behind you, you can look in front of you as well.

Complexity of lists in haskell in Data.map

Sorry if this seems like an obvious question.
I was creating a Data.map of lists {actually a tuple of an integer and a list (Integer, [(Integer, Integer)])} for implementing a priority queue + adjacency list for some graph algorithms like Dijkstras and Prims,
The Data.map is implemented using binary trees(I read that) so I just want to confirm that when doing the map operations (I believe they will be rotations) the interpreter does not do deep copies of the list just shallow copies of the references of lists right?
I am doing this to implement a prims algorithm in haskell which will run in O(nlogn + mlogn) time where n = no. of vertices and m = no. of edges, in a purely functional way,
If the lists are stored in the priority queue the algorithm will work in that time. Most haskell implementations I found online, dont achieve this complexity.
Thanks in advance.
You are correct that the lists will not be copied every time you create a new Map, at least if you're using GHC (other implementations probably do this correctly as well). This is one of the advantages of a purely functional language: because data can't be rewritten, data structures don't need to be copied to avoid problems you might have in an imperative language. Consider this snippet of Lisp:
(setf a (list 1 2 3 4 5))
(setf b a)
; a and b are now both '(1 2 3 4 5).
(setf (cadr a) 0)
; a is now '(1 0 3 4 5).
; Careful though! a and b point to the same piece of memory,
; so b is now also '(1 0 3 4 5). This may or may not be what you expected.
In Haskell, the only way to have mutable state like this is to use an explicitly mutable data structure, such as something in the State monad (and even this is sort of faking it (which is a good thing)). This potentially unexpected memory duplication issue is unthinkable in Haskell because once you declare that a is a particular list, it is that list now and forever. Because it is guaranteed to never change, there is no danger in reusing memory for things that are supposed to be equal, and in fact, GHC will do exactly this. Therefore, when you make a new Map with the same values, only pointers to the values will be copied, not the values themselves.
For more information, read about the difference between Boxed and Unboxed types.
1) Integer is slower then Int
2) If you have [(Integer, [(Integer, Integer)])]
You could create with Data.Map not only Map Integer [(Integer, Integer)], but Map Integer (Map Integer Integer)
3) If you use Int instead of Integer, you could use a bit quicker data - IntMapfrom Data.IntMap: IntMap (IntMap Int)
4) complexity of each methods are written in description:
Data.IntMap.Strict and here Data.IntMap.Lazy:
map :: (a -> b) -> IntMap a -> IntMap b
O(n). Map a function over all values in the map.

What is the fastest way to return x,y coordinates that are present in both list A and list B?

I have two lists (list A and list B) of x,y coordinates where 0 < x < 4000, 0 < y < 4000, and they will always be integers. I need to know what coordinates are in both lists. What would be your suggestion for how to approach this?
I have been thinking about representing the lists as two grids of bits and doing bitwise & possibly?
List A has about 1000 entries and changes maybe once every 10,000 requests. List B will vary wildly in length and will be different on every run through.
EDIT: I should mention that no coordinate will be in lists twice; 1,1 cannot be in list A more than once for example.
Represent (x,y) as a single 24 bit number as described in the comments.
Maintain A in numerical order (you said it doesn't vary much, so this should be hardly any cost).
For each B do a binary search on the list. Since A is about 1000 items big, you'll need at most 10 integer comparisons (in the worst case) to check for membership.
If you have a bit more memory (about 2MB) to play with you could create a bit-vector to support all possible 24 bit numbers then then perform a single bit operation per item to test for membership. So A would be represented by a single 2^24 bit number with a bit-set if the value is there (otherwise 0). To test for membership you would just use an appropriate bit and operation.
Put the coordinates of list A into some kind of a set (probably a hash, bst, or heap), then you can quickly see if the coordinate from list B is present.
Depending on whether you're expecting the list to be present or not present in the list would determine what underlying data structure you use.
Hashes are good at telling you if something is in it, though depending on how it's implemented, could behave poorly when trying to find something that isn't in it.
bst and heaps are equally good at telling you if something is in it or not, but don't perform theoretically as well as hashes when something is in it.
Since A is rather static you may consider building a query structure and check of all elements in B whether they occur in A. One example would be an std::set > A and you can query like A.find(element_from_b) != A.end() ...
So the running time in total is worst case O(b log a) (where b is the number of elements in B, and a respectively). Note also that since a is always about 10000, log a basically is constant.
Define an ordering based on their lexicographic order (sort first on x then on y). Sort both lists based on that ordering in O(n log n) time where n is the larger of the number of elements of each list. Set a pointer to the first elment of each list and advance the one that points to the lesser element; when the pointers reference to elements with the same value, put them into a set (to avoid multiplicities within each list). This last part can be done in O(n) time (or O(m log m) where m is the number of elements common to both lists).
Update (based on comment below and edit above): Since no point appears more than once in each list, then you can use a list or vector or dequeue to hold the points common to both or some other (amortized) constant time insertion realizing the O(n) time performance regardless of the number of common elements.
This is easy if you implement an STL predicate which orders two pairs (i.e. return (R.x < L.x || (R.x==L.x && R.y < L.y). You can then call std::list::sort to order them, and std::set_intersection to find the common elements. No need to write the algoritms
This is the kind of problem that just screams "Bloom Filter" at me.
If I understand correctly, you want the common coordinates in X and Y -- the intersection of (sets) Listing A and B? If you are using STL:
#include <vector>
#include <std>
using namespace std;
// ...
set<int> a; // x coord (assumed populated elsewhere)
set<int> b; // y coord (assumed populated elsewhere)
set<int> in; // intersection
// ...
set_intersection(a.begin(), a.end(), b.begin(), b.end(), insert_iterator<set<int> >(in,in.begin()));
I think hashing is your best bet.
//Psuedocode:
INPUT: two lists, each with (x,y) coordinates
find the list that's longer, call it A
hash each element in A
go to the other list, call it B
hash each element in B and look it up in the table.
if there's a match, return/store (x,y) somewhere
repeat #4 till the end
Assuming length of A is m and B's length is n, run time is O(m + n) --> O(n)

What elegant solution exists for this pattern? Multi-Level Searching

Assume that we have multiple arrays of integers. You can consider each array as a level. We try to find a sequence of elements, exactly one element from each array, and proceed to the next array with the same predicate. For example, we have v1, v2, v3 as the arrays:
v1 | v2 | v3
-----------------
1 | 4 | 16
2 | 5 | 81
3 | 16 | 100
4 | 64 | 121
I could say that the predicate is: next_element == previous_element^2
A valid sequence from the above example is: 2 -> 4 -> 16
Actually, in this example there isn't another valid sequence.
I could write three loops to brute-force the mentioned example, but what if the number of arrays is variable, but with know order of course, how would you solve this problem?
Hints, or references to design patters are very appreciated. I shall do it in C++, but I just need the idea.
Thanks,
If you order your arrays beforehand, the search can be done much faster. You could start on your smaller array, then binary-search for expected numbers on each of them. This would be O(nklogM), n being the size of the smallest array, k being the numbers of arrays, M being the size of larger array
This could be done even faster if you use Hashmaps instead of arrays. This would let you search in O(n*k).
If using reverse functions (to search in earlier arrays) is not an option, then you should start on first array, and n = size of first array.
For simplicity, I'll start from first array
//note the 1-based arrays
for (i : 1 until allArrays[1].size()) {
baseNumber = allArrays[1][i];
for (j: 2 until allArrays.size()) {
expectedNumber = function(baseNumber);
if (!find(expectedNumber, allArrays[j]))
break;
baseNumber = expectedNumber;
}
}
You can probably do some null checks and add some booleans in there to know if the sequence exist or not
(Design patterns apply to class and API design to improve code quality, but they aren't for solving computational problems.)
Depending on the cases:
If the arrays comes in random order, and you have finite space requirement, then brute-force is the only solution. O(Nk) time (k = 3), O(1) space.
If the predicate is not invertible (e.g. SHA1(next_elem) xor SHA1(prev_elem) == 0x1234), then brute force is also the only solution.
If you can expense space, then create hash sets for v2 and v3, so you can quickly find the next element that satisfies the predicate. O(N + bk) time, O(kN) space. (b = max number of next_elem that satisfy the predicate given a prev_elem)
If the arrays are sorted and bounded, you can also use binary search instead of the hash table to avoid using space. O(N (log N)k-1 + bk) time, O(1) space.
(All of the space count doesn't take account to stack usage due to recursion.)
A general way that consumes up to O(Nbk) space is to build the solution by successively filtering, e.g.
solutions = [[1], [2], ... [N]]
filterSolution (Predicate, curSols, nextElems) {
nextSols = []
for each curSol in curSols:
find elem in nextElems that satisfy the Predicate
append elem into a copy of curSol, then push into nextSols
return nextSols
}
for each levels:
solutions = filterSolution(Predicate, solutions, all elems in this level)
return solutions
You could generate a seperate index that map an index from one array to the index of another. From the index you can quickly see if an solution exists or not.
Generating the index would require a brute force approach but then you'd do it only one. If you want to improve the array search, consider using a more appropriate data structure to allow for fast search (red-black trees for example instead of arrays).
I would keep all vectors as heaps so I can have O(log n) complexity when searching for an element. So for a total of k vectors you will get a complexity like O(k * log n)
If the predicates preserve the ordering in the arrays (e.g. with your example, if the values are all guaranteed non-negative), you could adapt a merge algorithm. Consider the key for each array to be the end-value (what you get after applying the predicate as many times as needed for all arrays).
If the predicate doesn't preserve ordering (or the arrays aren't ordered to start) you can sort by the end-value first, but the need to do that suggests that another approach may be better (such as the hash tables suggested elsewhere).
Basically, check whether the next end-value is equal for all arrays. If not, step over the lowest (in one array only) and repeat. If you get all three equal, that is a (possible) solution - step over all three before searching for the next.
"Possible" solution because you may need to do a check - if the predicate function can map more than one input value to the same output value, you might have a case where the value found in some arrays (but not the first or last) is wrong.
EDIT - there may be bigger problems when the predicate doesn't map each input to a unique output - can't think at the moment. Basically, the merge approach can work well, but only for certain kinds of predicate function.