Suppose you have this code:
(reduce + 1 (range 1 13000))
Should this not cause a stackoverflow, because it is not tail call optimized? Or is reduce similar to loop?
A reduce will effectively be a loop over the values of the collection.
I say "effectively" because there are actually many different reduce implementation strategies employed by Clojure, depending on the type of the collection. This includes seq traversal via first/next, reduction via iterator, and collections that know how to self-reduce more efficiently.
In this case, the call to range actually returns a LongRange object which implements the IReduceInit interface for self reduction and actually does a tight do/while loop (implemented in Java).
The code for this can be found here in the latest release.
Related
I've been reading a nice answer to Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)? provided by samthebest and I am not sure if I understand all the details:
According to the answer (reduce vs foldLeft):
A big big difference (...) is that reduce should be given a commutative monoid, (...)
This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists.
and
Reduce is defined formally as part of the MapReduce paradigm,
I am not sure how this two statements combine. Can anyone put some light on that?
I tested different collections and I haven't seen performance difference between reduce and foldLeft. It looks like ParSeq is a special case, is that right?
Do we really need order to define fold?
we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.
Why it couldn't be generalized to unordered collection?
As mentioned in the comments, the term reduce means different thing when used in the context of MapReduce and when used in the context of functional programming.
In MapReduce, the system groups the results of the map function by a given key and then calls the reduce operation to aggregate values for each group (so reduce is called once for each group). You can see it as a function (K, [V]) -> R taking the group key K together with all the values belonging to the group [V] and producing some result.
In functional programming, reduce is a function that aggregates elements of some collection when you give it an operation that can combine two elements. In other words, you define a function (V, V) -> V and the reduce function uses it to aggregate a collection [V] into a single value V.
When you want to add numbers [1,2,3,4] using + as the function, the reduce function can do it in a number of ways:
It can run from the start and calculate ((1+2)+3)+4)
It can also calculate a = 1+2 and b = 3+4 in parallel and then add a+b!
The foldLeft operation is, by definition always proceeding from the left and so it always uses the evaluation strategy of (1). In fact, it also takes an initial value, so it evaluates something more like (((0+1)+2)+3)+4). This makes foldLeft useful for operations where the order matters, but it also means that it cannot be implemented for unordered collections (because you do not know what "left" is).
I am looking for a simple example of transducers with a reducing function.
I was hoping that the following would return a transducing function, since (filter odd?) works that way :
(def sum (reduce +))
clojure.lang.ArityException: Wrong number of args (1) passed to: core$reduce
My current understanding of transducers is that by omitting the collection argument, we get back a transducing function that we can compose with other transducing functions. Why is it different for filter and reduce ?
The reduce function doesn't return a transducer.
It's that way because of reduce is the function which returns a value which could be sequence. Other function like filter or map always returns sequences(even empty), that allows to combine this functions.
For combining something with a reduce function, you can use a reducer library
which gives a functionality similar to what you want to do (if I correctly understood this).
###UPD
Ok, my answer is a bit confusing.
First of all let's take a look on an how filter, map and many other functions works: it's not surprise, that all this function is based on reduce, it's an reduction function(in that way, that they doesn't create a collection with bigger size that input one). So, if you reduce some coll in any way - You can combine your reduction to pass reducible value from reducible coll between all the reduction function to get final value. It is a great way to improve performance, because part of values will hopefully nil somehow (as a part of transformation), and there is only one logical cycle (I mean loop, you iterate over sequence only one time and pass every value through all the transformation).
So, why the reduce function is so different, because everything built on it?
All transducer based on a simple and effective idea which in my opinion looks like sieve. But the reduction may be only for last step, because of the only one value as the result. So, the only way you can use reduce here is to provide a coll and a reduction form.
Reduce in sieve analogy is like funnel under the sieve:
You take your collection, throw it to some function like map and filter and take - and as you can see the size of new collection, which is the result of transformation, will never be bigger than the input collection. So the last step may be a reduce which takes a sieved collection and takes one value based on everything what was done.
There is also a different function, which allows you to combine transducers and reducers - transducer, but it's also require a function, because it's like an entry point and the last step in our sieve.
The reducers library is similar to transducers and also allow reduce as a last step. It's just another approach to make the same as a transducer.
To achieve what you want to do you can use partial function instead. It would be like this:
(def sum
(partial reduce +'))
(sum [1 2 3])
will return obvious answer of 6
I recently noticed that there was a very clear implementation of insertion sort here :
Insertion sort in clojure throws StackOverFlow error
which suffers from a memory overflow, due to the fact that concat lazily joins lists. I was wondering :
What strategies can we apply to "de-lazying" a list when we want better performance on large collections ?
doall is certainly fine for forcing lazy evaluation.
Another useful thing to remember is that reduce is non-lazy. This can therefore be very useful in large computations for ensuring that intermediate results get evaluated and reduced to a single output value before the computation proceeds.
As I understand it, recursing in Clojure without using the loop .. recur syntax might not be a problem for short sequences. However, using the loop .. recur syntax is the preferred method for writing recursive functions. So, I would like to start with the preferred method.
However, I have been struggling to convert this function [edit] which returns the skeleton of a sequence (the sequence structure without its values)
(defn skl
[tree]
(map skl (filter seq? tree)))
tested with this data
(def test_data1 '(1 (2 3) ( ) (( )) :a))
(def test_data2 '(1 2 (3 4) (5 ( 6 7 8))))
to loop .. recur syntax. Any ideas or pointers to examples would be appreciated.
Loop and recur is a transformation of a simple iteration. Descending into a tree is inherently recursive, however. You would have to maintain a stack manually in order to transform it into a single iteration. There is thus no "simple" conversion for your code.
You may want to look into the zipper library which allows for good structured tree editing, though it will likely be less elegant than your origional. I almost never need to use loop ... recur. There is almost always a higher order function that solved the problem more elegantly with the same or better efficiency.
Replacing map with loop ... recur makes code more verbose and less clear. You also lose the benefits of chunked sequences.
Take a look at the clojure.walk source. It's a library to do (bulk) operations on all Clojure nested datastructures (excluding ordered maps). There's some very powerful but deceptively simple looking code in there, using recursion through locally defined anonymous functions without using loop/recur.
Most of the functions in there are based on the postwalk and prewalk functions, which in turn are both based on the walk function. With the source and (prewalk-demo form) and (postwalk-demo form) you can get a good insight into the recursive steps taken.
I don't know if this might help you in solving your problem though. I'm currently trying to do something in the same problem domain: create a function to "flatten" nested maps and vectors into a sequence of all paths from root to leaf, each path a sequence of keys and/or indexes ending in the 'leaf' value.
This library seems to make editing values recursively throughout the whole structure pretty simple. However, I still have no idea how to use it to functionally keep track of accumulated data between iterations that are needed for my 'paths' and probably as well for your 'skeleton' problem.
So scala 2.9 recently turned up in Debian testing, bringing the newfangled parallel collections with it.
Suppose I have some code equivalent to
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):List[Int} = s.map(expensiveFunction)
now from the teeny bit I'd gleaned about parallel collections before the docs actually turned up on my machine, I was expecting to parallelize this just by switching the List to a ParList... but to my surprise, there isn't one! (Just ParVector, ParMap, ParSet...).
As a workround, this (or a one-line equivalent) seems to work well enough:
def process(s:List[Int]):List[Int} = {
val ps=scala.collection.parallel.immutable.ParVector()++s
val pr=ps.map(expensiveFunction)
List()++pr
}
yielding an approximately x3 performance improvement in my test code and achieving massively higher CPU usage (quad core plus hyperthreading i7). But it seems kind of clunky.
My question is a sort of an aggregated:
Why isn't there a ParList ?
Given there isn't a ParList, is there a
better pattern/idiom I should adopt so that
I don't feel like they're missing ?
Am I just "behind the times" using Lists a
lot in my scala programs (like all the Scala books I
bought back in the 2.7 days taught me) and
I should actually be making more use of
Vectors ? (I mean in C++ land
I'd generally need a pretty good reason to use
std::list over std::vector).
Lists are great when you want pattern matching (i.e. case x :: xs) and for efficient prepending/iteration. However, they are not so great when you want fast access-by-index, or splitting into chunks, or joining (i.e. xs ::: ys).
Hence it does not make much sense (to have a parallel List) when you think that this kind of thing (splitting and joining) is exactly what is needed for efficient parallelism. Use:
xs.toIndexedSeq.par
First, let me show you how to make a parallel version of that code:
def expensiveFunction(x:Int):Int = {...}
def process(s:List[Int]):Seq[Int] = s.par.map(expensiveFunction).seq
That will have Scala figure things out for you -- and, by the way, it uses ParVector. If you really want List, call .toList instead of .seq.
As for the questions:
There isn't a ParList because a List is an intrinsically non-parallel data structure, because any operation on it requires traversal.
You should code to traits instead of classes -- Seq, ParSeq and GenSeq, for example. Even performance characteristics of a List are guaranteed by LinearSeq.
All the books before Scala 2.8 did not have the new collections library in mind. In particular, the collections really didn't share a consistent and complete API. Now they do, and you'll gain much by taking advantage of it.
Furthermore, there wasn't a collection like Vector in Scala 2.7 -- an immutable collection with (near) constant indexed access.
A List cannot be easily split into various sub-lists which makes it hard to parallelise. For one, it has O(n) access; also a List cannot strip its tail, so one need to include a length parameter.
I guess, taking a Vector will be the better solution.
Note that Scala’s Vector is different from std::vector. The latter is basically a wrapper around standard array, a contiguous block in memory which needs to be copied every now and then when adding or removing data. Scala’s Vector is a specialised data structure which allows for efficient copying and splitting while keeping the data itself immutable.