Complexity of lists in haskell in Data.map - list

Sorry if this seems like an obvious question.
I was creating a Data.map of lists {actually a tuple of an integer and a list (Integer, [(Integer, Integer)])} for implementing a priority queue + adjacency list for some graph algorithms like Dijkstras and Prims,
The Data.map is implemented using binary trees(I read that) so I just want to confirm that when doing the map operations (I believe they will be rotations) the interpreter does not do deep copies of the list just shallow copies of the references of lists right?
I am doing this to implement a prims algorithm in haskell which will run in O(nlogn + mlogn) time where n = no. of vertices and m = no. of edges, in a purely functional way,
If the lists are stored in the priority queue the algorithm will work in that time. Most haskell implementations I found online, dont achieve this complexity.
Thanks in advance.

You are correct that the lists will not be copied every time you create a new Map, at least if you're using GHC (other implementations probably do this correctly as well). This is one of the advantages of a purely functional language: because data can't be rewritten, data structures don't need to be copied to avoid problems you might have in an imperative language. Consider this snippet of Lisp:
(setf a (list 1 2 3 4 5))
(setf b a)
; a and b are now both '(1 2 3 4 5).
(setf (cadr a) 0)
; a is now '(1 0 3 4 5).
; Careful though! a and b point to the same piece of memory,
; so b is now also '(1 0 3 4 5). This may or may not be what you expected.
In Haskell, the only way to have mutable state like this is to use an explicitly mutable data structure, such as something in the State monad (and even this is sort of faking it (which is a good thing)). This potentially unexpected memory duplication issue is unthinkable in Haskell because once you declare that a is a particular list, it is that list now and forever. Because it is guaranteed to never change, there is no danger in reusing memory for things that are supposed to be equal, and in fact, GHC will do exactly this. Therefore, when you make a new Map with the same values, only pointers to the values will be copied, not the values themselves.
For more information, read about the difference between Boxed and Unboxed types.

1) Integer is slower then Int
2) If you have [(Integer, [(Integer, Integer)])]
You could create with Data.Map not only Map Integer [(Integer, Integer)], but Map Integer (Map Integer Integer)
3) If you use Int instead of Integer, you could use a bit quicker data - IntMapfrom Data.IntMap: IntMap (IntMap Int)
4) complexity of each methods are written in description:
Data.IntMap.Strict and here Data.IntMap.Lazy:
map :: (a -> b) -> IntMap a -> IntMap b
O(n). Map a function over all values in the map.

Related

IndexedSeq VS. PersistentVector

Can somebody explain me, the difference between 'IndexedSeq' and 'PersistentVector'?
I bumped into this, when updating a vector in my data structure via 'rest'. Here's a REPL excerpt that shows the transformation.
=> (def xs [1 2 3])
...
(type xs)
cljs.core/PersistentVector
=> (def xs2 (rest xs))
...
(type xs2)
cljs.core/IndexedSeq
I'm holding a list in an app-state atom, which needs to be shifted once in a while, so the first item must disappear. Would be really cool, if anybody could give me a hint about which data structure might be preferable here in terms of performance.
Sometimes elements get pushed to the end of the list as well, so I guess it's a LIFO mechanism that I'm creating here.
From your last paragraph, it sounds like you're using this as a stack. Taken together, pop, peek, and conj form a stack interface that can be used with either lists or vectors (working on the front of a list or the end of a vector). I would use those.
If you're just using those functions, I don't think there should be any significant performance differences (all three functions should be constant time).
looking at the superinterfaces here: http://static.javadoc.io/org.clojure/clojure/1.7.0/clojure/lang/IndexedSeq.html
I can guess, it is not the most efficient thing here, since it is just a seq, with no guaranteed constant-time access to the nth member. To ensure the vector semantics you should probably use subvec to remove the first element.
In general, if you don't do random access to elements, in terms of performance it should be enough to use concat to add element to the end (as it produces a lazy sequence, won't consume the whole collection, and should be done in a constant time) and rest to remove the first element (as it is also done in a constant time), to make FIFO stack (which is what you do). (it's not the best variant still, since it may lead to stack owerflow, if you do alot of push without realizing the sequence.
But sure it's better to use vectors. So the combination of conj , first, and subvec should be your choice.

What container really mimics std::vector in Haskell?

The problem
I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one. This means that the size of the container, at the end, will always be n.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
The container have to provide:
constant time insertions at either beginning or end (one of the two, not necessarily both)
constant time indexing in the middle
or alternatively (given a O(n) initialization):
constant time single element edits
constant time indexing in the middle
What is std::vector and why is it relevant
For those of you who don't know C++, std::vector is a dynamically sized array. It is a perfect fit for this problem because it is able to:
reserve space at construction
offer constant time indexing in the middle
offer constant time insertion at the end (with a reserved space)
Therefore this problem is solvable in O(n) complexity, in C++.
Why Data.Vector is not std::vector
Data.Vector, together with Data.Array, provide similar functionality to std::vector, but not quite the same. Both, of course, offer constant time indexing in the middle, but they offer neither constant time modification ((//) for example is at least O(n)) nor constant time insertion at either beginning of end.
Conclusion
What container really mimics std::vector in Haskell? Alternatively, what is my best shot?
From reddit comes the suggestion to use Data.Vector.constructN:
O(n) Construct a vector with n elements by repeatedly applying the generator function to the already constructed part of the vector.
constructN 3 f = let a = f <> ; b = f <a> ; c = f <a,b> in f <a,b,c>
For example:
λ import qualified Data.Vector as V
λ V.constructN 10 V.length
fromList [0,1,2,3,4,5,6,7,8,9]
λ V.constructN 10 $ (1+) . V.sum
fromList [1,2,4,8,16,32,64,128,256,512]
λ V.constructN 10 $ \v -> let n = V.length v in if n <= 1 then 1 else (v V.! (n - 1)) + (v V.! (n - 2))
fromList [1,1,2,3,5,8,13,21,34,55]
This certainly seems to qualify to solve the problem as you've described it above.
The first data structures that come to my mind are either Maps from Data.Map or Sequences from Data.Sequence.
Update
Data.Sequence
Sequences are persistent data structures that allow most operations efficient, while allowing only finite sequences. Their implementation is based on finger-trees, if you are interested. But which qualities does it have?
O(1) calculation of the length
O(1) insert at front/back with the operators <| and |> respectively.
O(n) creation from a list with fromlist
O(log(min(n1,n2))) concatenation for sequences of length n1 and n2.
O(log(min(i,n-i))) indexing for an element at position i in a sequence of length n.
Furthermore this structure supports a lot of the known and handy functions you'd expect from a list-like structure: replicate, zip, null, scans, sort, take, drop, splitAt and many more. Due to these similarities you have to do either qualified import or hide the functions in Prelude, that have the same name.
Data.Map
Maps are the standard workhorse for realizing a correspondence between "things", what you might call a Hashmap or associave array in other programming languages are called Maps in Haskell; other than in say Python Maps are pure - so an update gives you back a new Map and does not modify the original instance.
Maps come in two flavors - strict and lazy.
Quoting from the Documentation
Strict
API of this module is strict in both the keys and the values.
Lazy
API of this module is strict in the keys, but lazy in the values.
So you need to choose what fits best for your application. You can try both versions and benchmark with criterion.
Instead of listing the features of Data.Map I want to pass on to
Data.IntMap.Strict
Which can leverage the fact that the keys are integers to squeeze out a better performance
Quoting from the documentation we first note:
Many operations have a worst-case complexity of O(min(n,W)). This means that the operation can become linear in the number of elements with a maximum of W -- the number of bits in an Int (32 or 64).
So what are the characteristics for IntMaps
O(min(n,W)) for (unsafe) indexing (!), unsafe in the sense that you will get an error if the key/index does not exist. This is the same behavior as Data.Sequence.
O(n) calculation of size
O(min(n,W)) for safe indexing lookup, which returns a Nothing if the key is not found and Just a otherwise.
O(min(n,W)) for insert, delete, adjust and update
So you see that this structure is less efficient than Sequences, but provide a bit more safety and a big benefit if you actually don't need all entries, such the representation of a sparse graph, where the nodes are integers.
For completeness I'd like to mention a package called persistent-vector, which implements clojure-style vectors, but seems to be abandoned as the last upload is from (2012).
Conclusion
So for your use case I'd strongly recommend Data.Sequence or Data.Vector, unfortunately I don't have any experience with the latter, so you need to try it for yourself. From the stuff I know it provides a powerful thing called stream fusion, that optimizes to execute multiple functions in one tight "loop" instead of running a loop for each function. A tutorial for Vector can be found here.
When looking for functional containers with particular asymptotic run times, I always pull out Edison.
Note that there's a result that in a strict language with immutable data structures, there's always a logarithmic slowdown to implementing mutable data structure on top of them. It's an open problem whether the limited mutation hidden behind laziness can avoid that slowdown. There also the issue of persistent vs. transient...
Okasaki is still a good read for background, but finger trees or something more complex like an RRB-tree should be available "off-the-shelf" and solve your problem.
I'm looking for a container that is used to save partial results of n - 1 problems in order to calculate the nth one.
Each element, i, of the container depends on at least 2 and up to 4 previous results.
Lets consider a very small program. that calculates fibonacci numbers.
fib 1 = 1
fib 2 = 1
fib n = fib (n-1) + fib (n-2)
This is great for small N, but horrible for n > 10. At this point, you stumble across this gem:
fib n = fibs !! n where fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
You may be tempted to exclaim that this is dark magic (infinite, self referential list building and zipping? wth!) but it is really a great example of tying the knot, and using lazyness to ensure that values are calcuated as-needed.
Similarly, we can use an array to tie the knot too.
import Data.Array
fib n = arr ! 10
where arr :: Arr Int Int
arr = listArray (1,n) (map fib' [1..n])
fib' 1 = 1
fib' 2 = 1
fib' n = arr!(n-1) + arr!(n-2)
Each element of the array is a thunk that uses other elements of the array to calculate it's value. In this way, we can build a single array, never having to perform concatenation, and call out values from the array at will, only paying for the calculation up to that point.
The beauty of this method is that you don't only have to look behind you, you can look in front of you as well.

Calculate ratio of an element in a list efficiently

The following code works with small lists, however it takes forever with long lists, I suppose it's my double use of length that is the problem.
ratioOfPrimes :: [Int] -> Double
ratioOfPrimes xs = fromIntegral (length (filter isPrime xs))/ fromIntegral(length xs)
How do I calculate the ratio of an element in longer lists?
The double use of length isn't the main problem here. The multiple traversals in your implementation produce a constant factor and with double length and filter you get the avg complexity of O(3n). Due to Stream Fusion it's even O(2n), as already mentioned by Impredicative. But in fact since the constant factors don't have a dramatic effect on performance, it's even conventional to simply ignore them, so, conventionally speaking, your implementation still has the complexity of O(n), where n is the length of the input list.
The real problem here is that the above would all be true only if isPrime had the complexity of O(1), but it doesn't. This function performs a traversal thru a list of all primes, so it itself has the complexity of O(m). So the dramatic performance decrease here is caused by your algorithm having the final complexity of O(n*m), because on each iteration of the input list it has to traverse the list of all primes to an unknown depth.
To optimize I suggest to first sort the input list (takes O(n*log n)) and itegrate a custom lookup on a list of all primes, which will drop the already visited numbers on each iteration. This way you'll be able to achieve a single traversal on the list of all primes, which theoretically could grant you with the complexity of O(n*log n + n + m), which again, conventionally can be thought of as simply O(n*log n), by highlighting the cost center.
So, there's a few things going on there. Let's look at some of the operations involved:
length
filter
isPrime
length
As you say, using length twice isn't going to help, since that's O(n) for lists. You do that twice. Then there's filter, which is also going to do a whole pass of the list in O(n). What we'd like to do is do all this in a single pass of the list.
Functions in the Data.List.Stream module implement a technique called Stream Fusion, which would for example rewrite your (length (filter isPrime xs)) call into a single loop. However, you'd still have the second call to length. You could rewrite this whole thing into a single fold (or use of the State or ST monads) with a pair of accumulators and do this in a single pass:
ratioOfPrimes xs = let
(a,b) = foldl' (\(odd,all) i -> if (isPrime i) then (odd +1, all+1) else (odd, all+1)) (0,0) xs
in a/b
However, in this case you could also move away from using a list and use the vector library. The vector library implements the same stream fusion techniques for removing intermediate lists, but also has some other nifty features:
length is O(1)
The Data.Vector.Unboxed module lets you store unboxable types (which primitive types such as Int certainly are) without the overhead of the boxed representation. So this list of ints would be stored as a low-level Int array.
Using the vector package should let you write the idiomatic representation you have above and get better than the performance of a single-pass translation.
import qualified Data.Vector.Unboxed as U
ratioOfPrimes :: U.Vector Int -> Double
ratioOfPrimes xs = (fromIntegral $ U.length . U.filter isPrime $ xs) / (fromIntegral $ U.length xs)
Of course, the thing that hasn't been mentioned is the isPrime function, and whether the real problem is that it's slow for large n. An unperformant prime checker could easily blow concerns over list indexing out of the water.

For what does clojure implement implicit conversion between, e.g. a vector to a list?

If I do
user => (next [1 2 3])
I get
(2 3)
It seems that an implicit conversion between vector and list is being operated.
Conceptually, applying next on a vector does not make a lot of sense because a vector is not a sequence. Indeed Clojure does not implement next for a vector. When I apply next on a vector, Clojure kindly suggests that "You wanted to say (next seq), right?".
Isn't it more straight forward to say that a vector does not have next method? What can be reasons why this implicit conversion is more advantageous and/or necessary?
If you look at the docs, next says:
Returns a seq of the items after the first. Calls seq on its argument.
If there are no more items, returns nil.
meaning that this method calls seq on the collection you give it (in your case, its a vector), and it returns a seq containing the rest.
In clojure, lots of things are "colls", such as sequences, vectors, sets and even maps, so for example, this would also work:
(next {:a 1 :b 2}) ; returns ([:b 2])
so the behavior is consistent - transform any collection of items into a seq. This is very common in clojure, map and partition for example do the same thing:
(map inc [1 2 3]) ; returns (2 3 4)
(partition 2 [1 2 3 4]) ; returns ((1 2)(3 4))
this is useful for two main reasons (more are welcome!):
it allows these core functions to operate on any data type you throw at them, as long as it is a "collection"
it allows for lazy computation, eg. even if try to map a large vector but you only asked for the first few items, map wont have to actually pre-compute all items.
Clojure has the concept of a sequence (which just happens to display the same as a list.
next is a function that makes sense on any collection that is a sequence (or can reasonably be coerced into one).
(type '(1 2 3))
=> clojure.lang.PersistentList
(type (rest [1 2 3]))
=>clojure.lang.PersistentVector$ChunkedSeq
There are tradeoffs in the design of any language or library. Allowing the same operation to work on different collection types makes it easier to write many programs. You often don't have to worry about differences between lists and vectors if you don't want to worry about them. If you decide you want to use one sequence type rather than another, you might be able to leave all of the rest of the code as it was. This is all implicit in Shlomi's answer, which also points out an advantage involving laziness.
There are disadvantages to Clojure's strategy, too. Clojure's flexible operations on collections mean that Clojure might not tell you that you have mistakenly used a collection type that you didn't intend. Other languages lack Clojure's flexibility, but might help you catch certain kinds of bugs more quickly. Some statically typed languages, such as Standard ML, for example, take this to an extreme--which is a good thing for certain purposes, but bad for others.
Clojure lets you control performance / abstractions operating a choice between list and vector.
List
is fast on operations at the beginning of the sequence like cons / conj
is fast on iteration with first / rest
Vector
is fast on operations at the end of the sequence like pop / peek
participates in associative abstraction with indexes as keys
is fast on subvec
Both participate in sequence abstraction. Clojure functions and conversions they operate, are made to ease idiomatic code writing.

Fast insert into the beginning and end of a clojure seq?

In clojure lists grow from the left and vectors grow from the right, so:
user> (conj '(1 2 3) 4)
(4 1 2 3)
user> (conj [1 2 3] 4)
[1 2 3 4]
What's the most efficient method of inserting values both into the front and the back of a sequence?
You need a different data structure to support fast inserting at both start and end. See https://github.com/clojure/data.finger-tree
As I understand it, a sequence is just a generic data structure so it depends on the specific implementation you are working with.
For a data structure that supports random access (e.g. a vector), it should take constant time, O(1).
For a list, I would expect inserting at the front of the list with a cons operation to take constant time, but inserting to the back of the list will take O(n) since you have to traverse the entire structure to get to the end.
There is, of course, a lot of other data structures that can theoretically be a sequence (e.g. trees) that will have their own O(n) characteristics.