If I pmap a function onto a sequence how far ahead will the sequence be realised in parallel
If I have a single thread reading from a sequence being produced in parallel?
will this be different if I wrap it in a seque:
(seque 30 (pmap do-stuff (range 30000)))
vs.
(pmap do-stuff (range 30000))
pmap provides no guarantees of how far ahead it will read on it's input sequence - presumably, not too much farther than what it needs to do its calculation.
(seque 30 ...) will realize and cache up to 30 elements from pmap's output sequence. That must logically be at least the first 30 from the input sequence. How far beyond that I can't say without looking at the implementation of pmap, which you probably ought not to depend upon.
I'm curious why you need to know this. The details of when a function executes, particularly in a pmap, are something you usually want abstracted away. If it's curiosity, great. But if you're depending on some side effect of the do-stuff function, you're Doing It Wrong(tm).
Related
I've just started learning Clojure and I'm puzzled by how lazy sequences work. In particular, I don't understand why these 2 expressions produce different results in the repl:
;; infinite range works OK
(user=> (take 3 (map #(/(- % 5)) (range)))
(-1/5 -1/4 -1/3)
;; finite range causes error
user=> (take 3 (map #(/(- % 5)) (range 1000)))
Error printing return value (ArithmeticException) at clojure.lang.Numbers/divide (Numbers.java:188).
Divide by zero
I take the sequence of integers (0 1 2 3 ...) and apply a function that subtracts 5 and then takes the reciprocal. Obviously this causes a division-by-zero error if it's applied to 5. But since I'm only taking the first 3 values from a lazy sequence I wasn't expecting to see an exception.
The results are what I expected when I use all the integers, but I get an error if I use the first 1000 integers.
Why are the results different?
Clojure 1.1 introduced "chunked" sequences,
This can provide greater efficiency ... Consumption of chunked-seqs as
normal seqs should be completely transparent. However, note that some
sequence processing will occur up to 32 elements at a time. This could
matter to you if you are relying on full laziness to preclude the
generation of any non-consumed results. [Section 2.3 of "Changes to Clojure in Version 1.1"]
In your example (range) seems to be producing a seq that realizes one element at a time and (range 999) is producing a chunked seq. map will consume a chunked seq a chunk at a time, producing a chunked seq. So when take asks for the first element of the chunked seq, function passed to map is called 32 times on the values 0 through 31.
I believe it is wisest to code in such a way the code will still work for any seq producing function/arity if that function produces a chunked seq with an arbitrarily large chunk.
I do not know if one writes a seq producing function that is not chunked if one can rely in current and future versions of library functions like map and filter to not convert the seq into a chunked seq.
But, why the difference? What are the implementation details such that (range) and (range 999) are different in the sort of seq produced?
Range is implemented in clojure.core.
(range) is defined as (iterate inc' 0).
Ultimately iterate's functionality is provided by the Iterate class in Iterate.java.
(range end) is defined, when end is a long, as (clojure.lang.LongRange/create end)
The LongRange class lives in LongRange.java.
Looking at the two java files it can be seen that the LongRange class implements IChunkedSeq and the Iterator class does not. (Exercise left for the reader.)
Speculation
The implementation of clojure.lang.Iterator does not chunk because iterator can be given a function of arbitrary complexity and the efficiency from chunking can easily be overwhelmed by computing more values than needed.
The implementation of (range) relies on iterator instead of a custom optimized Java class that does chunking because the (range) case is not believed to be common enough to warrant optimization.
Clojure has a number of libraries for generative testing such as test.check, test.generative or data.generators.
It is possible to use higher order functions to create random data generators that are composable such as:
(defn gen [create-fn content-fn lazy]
(fn [] (reduce #(create-fn %1 %2) (for [a lazy] (content-fn)))))
(def a (gen str #(rand-nth [\a \b \c]) (range 10)))
(a)
(def b (gen vector #(rand-int 10) (range 2)))
(b)
(def c (gen hash-set b (range (rand-int 10))))
(c)
This is just an example and could be modified with different parameters, filters, partials, etc to create data generating functions which are quite flexible.
Is there something that any of the generative libraries can do that isn't also just as (or more) succinctly achievable by composing some higher order functions?
As a side note to the stackoverflow gods: I don't believe this question is subjective. I'm not asking for an opinion on which library is better. I want to know what specific feature(s) or technique(s) of any/all data generative libraries differentiate them from composing vanilla higher order functions. An example answer should illustrate generating random data using any of the libraries with an explanation as to why this would be more complex to do by composing HOFs in the way I have illustrated above.
test.check does this way better. Most notably, suppose you generate a random list of 100 elements, and your test fails: something about the way you handled that list is wrong. What now? How do you find the basic bug? It surely doesn't depend on exactly those 100 inputs; you could probably reproduce it with a list of just a few elements, or even an empty list if something is wrong with your base case.
The feature that makes all this actually useful isn't the random generators, it is the "shrinking" of those generators. Once test.check finds an input that breaks your tests, it tries to simplify the input as much as possible while still making your tests break. For a list of integers, the shrinks are simple enough you could maybe do them yourself: remove any element, or decrease any element. Even that may not be true: choosing the order to do shrinks in is probably a harder problem than I realize. And for larger inputs, like a list of maps from vectors to a 3-tuple of [string, int, keyword], you'll find it totally unmanageable, whereas test.check has done all the hard work already.
I have this function that reproduces my problem:
(defn my-problem
[preprocess count print-freq]
(doseq [x (preprocess (range 0 count))]
(when (= 0 (mod x print-freq))
(println x))))
Everything works fine when I call it with identity function like this :
(my-problem identity 10000000 200000)
;it prints 200000,400000 ... 9800000 just as it should
When I call it with seque function I get OutOfMemoryError :
(my-problem #(seque 5 %) 10000000 200000)
;it prints numbers up to 2000000 and then it throws OutOfMemoryException
My understanding is that seque function should just split the processing into two threads using ConcurrentBlockingQueue with max size 5 (in this case). I don't understand where the memory leak is.
The way seque is implemented, if you consume elements much more quickly than you can produce them, a large number of agent tasks will pile up in the queue used internally by seque (up to one task per element in the sequence). In theory what you're doing should be fine, but in practice it doesn't really work out. You should be able to see the same effect just by running (dorun (seque (range))).
You can also use the function sequeue in flatland/useful, which makes tradeoffs that are different from the ones in clojure.core. Read the docstring carefully, but I think it would work well for your situation.
This is a follow up to my question "Recursively reverse a sequence in Clojure".
Is it possible to reverse a sequence using the Clojure "for" macro? I'm trying to better understand the limitations and use-cases of this macro.
Here is the code I'm starting from:
((defn reverse-with-for [s]
(for [c s] c))
Possible?
If so, I assume the solution may require wrapping the for macro in some expression that defines a mutable var, or that the body-expr of the for macro will somehow pass a sequence to the next iteration (similar to map).
Clojure for macro is being used with arbitrary Clojure sequences.
These sequences may or may not expose random access like vectors do. So, in general case, you do not have access to the last element of a Clojure sequence without traversing all the way to it, which would make making a pass through it in reverse order not possible.
I'm assumming you had something like this in mind (Java-like pseudocode):
for(int i = n-1; i--; i<=0){
doSomething(array[i]);
}
In this example we know array size n in advance and we can access elements by its index. With Clojure sequences we don't know that. In Java it makes sense to do that with arrays and ArrayLists. Clojure sequences are however much more like linked lists - you have an element, and a reference to next one.
Btw, even if there were a (probably non-idiomatic)* way to do that, its time complexity would be something like O(n^2) which is just not worth the effort compared to much easier solution in the linked post which is O(n^2) for lists and a much better O(n) for vectors (and it is quite elegant and idiomatic. In fact, the official reverse has that implementation).
EDIT:
A general advice: Don't try to do imperative programming in Clojure, it wasn't designed for it. Although many things may seem strange or counter-intuitive (as opposed to well known idioms from imperative programming) once you get used to the functional way of doing things it is a lot, and I mean a lot easier.
Specifically for this question, despite the same name Java (and other C-like) for and Clojure for are not the same thing! First is an actual loop - it defines a flow control. The second one is a comprehension - look at it conceptually as a higher function of a sequence and a function f to be done for each of its element, which returns another sequence of f(element) s. Java for is a statement, it doesn't evaluate to anything, Clojure for (as well as anything else in Clojure) is an expression - it evaluates to the sequence of f(element) s.
Probably the easiest way to get the idea is to play with sequence functions library: http://clojure.org/sequences. Also, you can solve some problems on http://www.4clojure.com/. The first problems are very easy but they gradually get harder as you progress through them.
*As shown in Alexandre's answer the solution to the problem in fact is idiomatic and quite clever. Kudos for that! :)
Here's how you could reverse a string with for:
(defn reverse-with-for [s]
(apply str
(for [i (range (dec (count s)) -1 -1)]
(get s i))))
Note that this code is mutation free. It's the same as:
(defn reverse-with-map [s]
(apply str
(map (partial get s) (range (dec (count s)) -1 -1))))
A simpler solution would be:
(apply str (reverse s))
First of all, as Goran said, for is not a statement - it is an expression, namely sequence comprehension. It construct sequences by iteration through other sequences. So in the form it is meant to be used it is pure function (without side-effects). for can be seen as enhanced map infused with filter. Because of this it cannot be used to hold iteration state as e.g. reduce do.
Secondly, you can express sequence reversal using for and mutable state, e.g. using an atom, which is rough equivalent (not taking into account its concurrency properties) of java variable. But doing so you are facing several problems:
You are breaking main language paradigm so you will definitely get worse looking and behaving code.
Since all clojure mutable state cells are designed to be thread-safe, they all use some kind of illegal concurrent modification protection, and there is no ability to remove it. Consequently, you will get poorer performance characteristics.
In this particular case, like Goran said, sequences are one of the wide-used Clojure abstractions. For example, there are lazy sequences, which could be potentially infinite, so you just cannot walk them to the end. You certainly will have difficulties trying to work with such sequences with imperative techniques.
So don't do it, at least in Clojure :)
EDIT: I forgot to mention it. for returns lazy sequence, so you have to evaluate it in some way in order to apply all state mutations you do in it. Another reason not to do so :)
I'm calling clojure.core/time which is documented as "Evaluates expr and prints the time it took. Returns the value of expr"
Eg:
(time (expensive))
Macroexpanding it shows that it does store the value as a let so after outputting the time it should return immediately with the value in the let expression.
When I make the call however with an expensive computation, I see a delay and then get the time back, but then have to wait for a significant time (sometimes +10 seconds or more) for the result to appear.
Can anyone explain what's going on here?
PS: This is using clojure 1.3.0 if that makes any difference.
Maybe you are returning something lazy, of which the elements are only produced when fed to the REPL? In that case, you might want to wrap it in a dorun which forces all elements to be produced.
If you could provide the details of your expensive computation, we could see if this is true.
Useful addition from Savanni D'Gerinel's comment:
The proper syntax is probably (time (doall (computation))) if you want to return the result and (time (dorun (computation))) if you do not.