What is Haskell's Stream Fusion and how do I use it?
The paper that Logan points to is great, but it's a little difficult. (Just ask my students.) It's also a great deal about 'how stream fusion works' and only a fraction 'what stream fusion is and how you can use it'.
The problem that stream fusion solves is that functional codes as written often allocate intermediate lists, e.g., to create an infinite list of node numbers, you might write
nodenames = map ("n"++) $ map show [1..]
Naive code would allocate an infinite list of integers [1, 2, 3, ...], an infinite list of strings ["1", "2", "3", ...], and eventually an infinite list of names ["n1", "n2", "n3", ...]. That's too much allocation.
What stream fusion does is translate a definition like nodenames into something which uses a recursive function that allocates only what is needed for the result. In general, eliminating allocation of intermediate lists is called deforestation.
To use stream fusion, you need to write non-recursive list functions that use the functions from the stream-fusion library described in GHC ticket 915 (map, foldr, and so on) instead of explicit recursion. This library contains new versions of all the Prelude functions which have been rewritten to exploit stream fusion. Apparently this stuff is slated to make it into the next GHC release (6.12) but is not in the current stable version (6.10). If you want to use the library Porges has a nice simple explanation in his answer.
If you actually want an explanation of how stream fusion works, post another question---but that's much harder.
As far as I am aware, and contrary to what Norman said, stream fusion is not currently implemented in GHC's base (ie. you cannot just use Prelude functions). For more information see GHC ticket 915.
To use stream fusion you need to install the stream-fusion library, import Data.List.Stream (you can also import Control.Monad.Stream) and only use functions from that module rather than the Prelude functions. This means importing the Prelude hiding all the default list functions, and not using [x..y] constructs or list comprehension.
Isn’t it correct, that when GHC in 6.12 uses those new functions by default, that they will also implement [x..y] and list comprehensions in that non-recursive manner? Because the only reason they aren’t right row, is that they are internal and not really written in Haskell, but more like keywords, for speed’s sake and/or because you wouldn’t be able to redefine that syntax.
Related
What is Haskell's Stream Fusion and how do I use it?
The paper that Logan points to is great, but it's a little difficult. (Just ask my students.) It's also a great deal about 'how stream fusion works' and only a fraction 'what stream fusion is and how you can use it'.
The problem that stream fusion solves is that functional codes as written often allocate intermediate lists, e.g., to create an infinite list of node numbers, you might write
nodenames = map ("n"++) $ map show [1..]
Naive code would allocate an infinite list of integers [1, 2, 3, ...], an infinite list of strings ["1", "2", "3", ...], and eventually an infinite list of names ["n1", "n2", "n3", ...]. That's too much allocation.
What stream fusion does is translate a definition like nodenames into something which uses a recursive function that allocates only what is needed for the result. In general, eliminating allocation of intermediate lists is called deforestation.
To use stream fusion, you need to write non-recursive list functions that use the functions from the stream-fusion library described in GHC ticket 915 (map, foldr, and so on) instead of explicit recursion. This library contains new versions of all the Prelude functions which have been rewritten to exploit stream fusion. Apparently this stuff is slated to make it into the next GHC release (6.12) but is not in the current stable version (6.10). If you want to use the library Porges has a nice simple explanation in his answer.
If you actually want an explanation of how stream fusion works, post another question---but that's much harder.
As far as I am aware, and contrary to what Norman said, stream fusion is not currently implemented in GHC's base (ie. you cannot just use Prelude functions). For more information see GHC ticket 915.
To use stream fusion you need to install the stream-fusion library, import Data.List.Stream (you can also import Control.Monad.Stream) and only use functions from that module rather than the Prelude functions. This means importing the Prelude hiding all the default list functions, and not using [x..y] constructs or list comprehension.
Isn’t it correct, that when GHC in 6.12 uses those new functions by default, that they will also implement [x..y] and list comprehensions in that non-recursive manner? Because the only reason they aren’t right row, is that they are internal and not really written in Haskell, but more like keywords, for speed’s sake and/or because you wouldn’t be able to redefine that syntax.
I have been coding for two years now. I can't say I'm an expert.
I have taken a course in functional programming in which we used Common Lisp. I heard a lot of great things about Scala, as a "new" language and wanted to learn it. I read a book for the basics and wanted to rewrite all the code we did in Lisp into Scala. Almost all the code was going through lists and this is where I found a problem. Most of the problems I could solve with recursively going through the list where I set it as List[Any] - for example:
def reverse(thelist: List[Any]):List[Any].....
but as I've found out there isn't a specific way for checking whether the head of the list is a list itself except for .isInstanceOf[List[Any]]
This was OK at first, but now I have a problem. Any isn't very specific, especially with comparing elements. If I wanted to have an equivalent list with, let's say, only Int, I can create a List[Int] which can only take an Int value as an element, none of which can be List[Int] itself. The other way, writing List[List[Int]] has the same problem, but in reverse, because every element has to be a List.
As a solution I've tried setting the original list as List[Either[Int,List[Int]]], but that only created more problems, as now I have to constantly write .isInstanceOf and .asInstanceOf in all of my ifs and recursive calls, which is time-consuming and makes the code harder to understand. But even List[Either[Int,List[Int]]] is a temporary solution, because it only goes one level deep. A list can contain a list that can contain a list... and so on.
Does Scala offer a more elegant solution I am not yet aware of, such as using classes or objects in some way, or a simple elegant solution, or am I stuck with writing this kind of code? To make my question more specific, is there a way in Scala to define a list that can, but doesn't have to contain a list of the same kind as an element?
Scala isn't just Common Lisp with different syntax. Using lists for everything is something specific to Lisp, not something you do in other languages.
In Scala it's not normal to ever use a heterogeneous list — List[Any] — for anything. You certainly can if you want, but it isn't the way Scala code is normally written. It certainly isn't the kind of code you should be writing when you are only just beginning to learn the language.
A list that contains a mixture of numbers and lists isn't really a list — it's a tree. In Scala, we don't represent trees using List at all — we define a proper tree data type. Any introductory Scala text contains examples of this. (See, for example, the expression trees in chapter 15 of Programming in Scala.)
As for your reverse example, in Scala we would normally never write:
def reverse(thelist: List[Any]): List[Any]
rather, we write:
def reverse[T](theList: List[T]): List[T]
which works on List[Any] but also works on more specific types such as List[Int].
If you insist on doing it the other way, you aren't really learning Scala — you're fighting with it. Anytime you think you need Any or List[Any], there is better, more idiomatic, more Scala-like solution.
It's also never normal to use asInstanceOf or isInstanceOf in Scala code. They have long obscure names on purpose — they're not intended to be used except in rare situations.
Instead, use pattern matching. It does the equivalent of isInstanceOf and asInstanceOf for you, but in much more concise and less error-prone way. Again, any introductory Scala text should have good coverage of what pattern matching is and how to use it (e.g. chapter 15 again).
Warning: I'm almost certain I'm using at least some of the relevant terms wrong
I want to modify flatland.ordered.set.OrderedSet so that nth works. I think this involves something like:
(extend-type flatland.ordered.set.OrderedSet
?????
(nth [this n] (nth (vec this) n))
I've been trying to discern what protocol defines nth for a few hours now, with no luck. Is there a list of "native" protocols? Am I just totally mixed up?
It is not currently possible to do what you want to do using extend-type. Clojure's persistent collection interfaces are implemented using Java interfaces, not Clojure protocols. Therefore, it is not possible to extend them using extend-type.
However, since the code is open source, you could always change the library itself. All you should need to do is implement nth in OrderedSet's deftype. nth is defined by the clojure.lang.Indexed interface.
As Nathan Davis says, you can't do this "from the outside", because this stuff is based on interfaces rather than protocols. It would be quite reasonable for OrderedSet to implement Indexed; I must have just overlooked that interface entirely.
On the other hand, your implementation of nth is very inefficient: you don't want to create an entire length-N vector just to look up a single element in it. Instead, you want to call into get, which does the same thing as nth.
Edit: having looked back over the code again, I see that nth is not nearly as easy to implement correctly, because the existence of disj makes it difficult to quickly tell how many elements have been dropped from the set where. I don't think an efficient implementation for nth can really exist for this data structure unless you remove the ability to use disj. So I probably won't accept a pull request implementing nth unless you figure out something really clever, but feel free to fork ordered and add it to your own fork if you don't need disj support.
So scala 2.9 recently turned up in Debian testing, bringing the newfangled parallel collections with it.
Suppose I have some code equivalent to
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):List[Int} = s.map(expensiveFunction)
now from the teeny bit I'd gleaned about parallel collections before the docs actually turned up on my machine, I was expecting to parallelize this just by switching the List to a ParList... but to my surprise, there isn't one! (Just ParVector, ParMap, ParSet...).
As a workround, this (or a one-line equivalent) seems to work well enough:
def process(s:List[Int]):List[Int} = {
val ps=scala.collection.parallel.immutable.ParVector()++s
val pr=ps.map(expensiveFunction)
List()++pr
}
yielding an approximately x3 performance improvement in my test code and achieving massively higher CPU usage (quad core plus hyperthreading i7). But it seems kind of clunky.
My question is a sort of an aggregated:
Why isn't there a ParList ?
Given there isn't a ParList, is there a
better pattern/idiom I should adopt so that
I don't feel like they're missing ?
Am I just "behind the times" using Lists a
lot in my scala programs (like all the Scala books I
bought back in the 2.7 days taught me) and
I should actually be making more use of
Vectors ? (I mean in C++ land
I'd generally need a pretty good reason to use
std::list over std::vector).
Lists are great when you want pattern matching (i.e. case x :: xs) and for efficient prepending/iteration. However, they are not so great when you want fast access-by-index, or splitting into chunks, or joining (i.e. xs ::: ys).
Hence it does not make much sense (to have a parallel List) when you think that this kind of thing (splitting and joining) is exactly what is needed for efficient parallelism. Use:
xs.toIndexedSeq.par
First, let me show you how to make a parallel version of that code:
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):Seq[Int] = s.par.map(expensiveFunction).seq
That will have Scala figure things out for you -- and, by the way, it uses ParVector. If you really want List, call .toList instead of .seq.
As for the questions:
There isn't a ParList because a List is an intrinsically non-parallel data structure, because any operation on it requires traversal.
You should code to traits instead of classes -- Seq, ParSeq and GenSeq, for example. Even performance characteristics of a List are guaranteed by LinearSeq.
All the books before Scala 2.8 did not have the new collections library in mind. In particular, the collections really didn't share a consistent and complete API. Now they do, and you'll gain much by taking advantage of it.
Furthermore, there wasn't a collection like Vector in Scala 2.7 -- an immutable collection with (near) constant indexed access.
A List cannot be easily split into various sub-lists which makes it hard to parallelise. For one, it has O(n) access; also a List cannot strip its tail, so one need to include a length parameter.
I guess, taking a Vector will be the better solution.
Note that Scala’s Vector is different from std::vector. The latter is basically a wrapper around standard array, a contiguous block in memory which needs to be copied every now and then when adding or removing data. Scala’s Vector is a specialised data structure which allows for efficient copying and splitting while keeping the data itself immutable.
I've been using haskell for quite a while now, and I've read most of Real World Haskell and Learn You a Haskell. What I want to know is whether there is a point to a language using lazy evaluation, in particular the "advantage" of having infinite lists, is there a task which infinite lists make very easy, or even a task that is only possible with infinite lists?
Here's an utterly trivial but actually day-to-day useful example of where infinite lists specifically come in handy: When you have a list of items that you want to use to initialize some key-value-style data structure, starting with consecutive keys. So, say you have a list of strings and you want to put them into an IntMap counting from 0. Without lazy infinite lists, you'd do something like walk down the input list, keeping a running "next index" counter and building up the IntMap as you go.
With infinite lazy lists, the list itself takes the role of the running counter; just use zip [0..] with your list of items to assign the indices, then IntMap.fromList to construct the final result.
Sure, it's essentially the same thing in both cases. But having lazy infinite lists lets you express the concept much more directly without having to worry about details like the length of the input list or keeping track of an extra counter.
An obvious example is chaining your data processing from input to whatever you want to do with it. E.g., reading a stream of characters into a lazy list, which is processed by a lexer, also producing a lazy list of tokens which are parsed into a lazy AST structure, then compiled and executed. It's like using Unix pipes.
I found it's often easier and cleaner to just define all of a sequence in one place, even if it's infinite, and have the code that uses it just grab what it wants.
take 10 mySequence
takeWhile (<100) mySequence
instead of having numerous similar but not quite the same functions that generate a subset
first10ofMySequence
elementsUnder100ofMySequence
The benefits are greater when different subsections of the same sequence are used in different areas.
Infinite data structures (including lists) give a huge boost to modularity and hence reusability, as explained & illustrated in John Hughes's classic paper Why Functional Programming Matters.
For instance, you can decompose complex code chunks into producer/filter/consumer pieces, each of which is potentially useful elsewhere.
So wherever you see real-world value in code reuse, you'll have an answer to your question.
Basically, lazy lists allow you to delay computation until you need it. This can prove useful when you don't know in advance when to stop, and what to precompute.
A standard example is u_n a sequence of numerical computations converging to some limit. You can ask for the first term such that |u_n - u_{n-1}| < epsilon, the right number of terms is computed for you.
Now, you have two such sequences u_n and v_n, and you want to know the sum of the limits to epsilon accuracy. The algorithm is:
compute u_n until epsilon/2 accuracy
compute v_n until epsilon/2 accuracy
return u_n + v_n
All is done lazily, only the necessary u_n and v_n are computed. You may want less simple examples, eg. computing f(u_n) where you know (ie. know how to compute) f's modulus of continuity.
Sound synthesis - see this paper by Jerzy Karczmarczuk:
http://users.info.unicaen.fr/~karczma/arpap/cleasyn.pdf
Jerzy Karczmarcuk has a number of other papers using infinite lists to model mathematical objects like power series and derivatives.
I've translated the basic sound synthesis code to Haskell - enough for a sine wave unit generator and WAV file IO. The performance was just about adequate to run with GHCi on a 1.5GHz Athalon - as I just wanted to test the concept I never got round to optimizing it.
Infinite/lazy structures permit the idiom of "tying the knot": http://www.haskell.org/haskellwiki/Tying_the_Knot
The canonically simple example of this is the Fibonacci sequence, defined directly as a recurrence relation. (Yes, yes, hold the efficiency complaints/algorithms discussion -- the point is the idiom.): fibs = 1:1:zipwith (+) fibs (tail fibs)
Here's another story. I had some code that only worked with finite streams -- it did some things to create them out to a point, then did a whole bunch of nonsense that involved acting on various bits of the stream dependent on the entire stream prior to that point, merging it with information from another stream, etc. It was pretty nice, but I realized it had a whole bunch of cruft necessary for dealing with boundary conditions, and basically what to do when one stream ran out of stuff. I then realized that conceptually, there was no reason it couldn't work on infinite streams. So I switched to a data type without a nil -- i.e. a genuine stream as opposed to a list, and all the cruft went away. Even though I know I'll never need the data past a certain point, being able to rely on it being there allowed me to safely remove lots of silly logic, and let the mathematical/algorithmic part of my code stand out more clearly.
One of my pragmatic favorites is cycle. cycle [False, True] generates the infinite list [False, True, False, True, False ...]. In particular, xs ! 0 = False, xs ! 1 = True, so this is just says whether or not the index of the element is odd or not. Where does this show up? Lot's of places, but here's one that any web developer ought to be familiar with: making tables that alternate shading from row to row.
The general pattern seen here is that if we want to do some operation on a finite list, rather than having to construct a specific finite list that will “do the thing we want,” we can use an infinite list that will work for all sizes of lists. camcann’s answer is in this vein.