How do nested dosync calls behave? - clojure

What happens when you create nested dosync calls? Will sub-transactions be completed in the parent scope? Are these sub-transactions reversible if the parent transaction fails?

If you mean syntactic nesting, then the answer is it depends on whether the inner dosync will run on the same thread as the outer one.
In Clojure, whenever a dosync block is entered, a new transaction is started if one hasn't been running already on this thread. This means that while execution stays on a single thread, inner transactions can be said to be subsumed by outer transactions; however if a dosync occupies a position syntactically nested within another dosync, but happens to be launched on a new thread, it will have a new transaction to itself.
An example which (hopefully) illustrates what happens:
user> (def r (ref 0))
#'user/r
user> (dosync (future (dosync (Thread/sleep 50) (println :foo) (alter r inc)))
(println :bar)
(alter r inc))
:bar
:foo
:foo
1
user> #r
2
The "inner" transaction retries after printing :foo; the "outer" transaction never needs to restart. (Note that after this happens, r's history chain is grown, so if the "large" dosync form were evaluated for a second time, the inner dosync would not retry. It still wouldn't be merged into the outer one, of course.)
Incidentally, Mark Volkmann has written a fantastic article on Clojure's Software Transactional Memory; it's highly recommended reading for anyone interested in gaining solid insight into details of this sort.

Related

can you explain pmap laziness and memory footprint?

The docs says about pmap:
Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required.
Can you kindly dis-obfuscate these two statements in some simple context?
Also is there for the pmap function, a doseq equivalent, having a memory footprint constant relative to the size of the iterated collection?
Semi-lazy in that the parallel computation stays ahead of the consumption
This means that pmap will do slightly more work than is strictly required by the sequence's consumer. This "working ahead" minimizes the wait for more items to be computed when the sequence is consumed. For example, if you're computing some infinite sequence in parallel and you only consume the first 50 results, pmap may have gone ahead and computed 50+N.
but doesn't realize the entire result unless required.
This means it's only going to work ahead up to a certain threshold. The entire sequence won't be produced unless it's completely consumed (or almost completely consumed).
Also is there for the pmap function, a doseq equivalent
You can use doall or dorun with pmap to produce side effects in parallel.
Here's an example of all three together, using an infinite sequence as input to pmap:
(def calls (atom 0))
(dorun (take 50 (pmap (fn [_] (swap! calls inc)) (range))))
;; #calls => 60
When this completes the value of calls will be over 50, even though we only consumed 50 items from the sequence.
Also read up on reducers and core.async for another way to do the same thing.
While Taylor's answer is correct, I also gave a presentation on what happens inside of pmap, and how it's lazy, at Clojure West a few years ago. I know not everyone likes videos for learning, but if you do, it might be helpful: https://youtu.be/BzKjIk0vgzE?t=11m48s
(If you want non-lazy pmap, I second the endorsement for Claypoole.)

Clojure reset(ing) multiple atoms at once

Suppose foo and bar are atoms.
; consistent.
(reset! foo x)
; inconsistent x and y combination.
(reset! bar y)
; consistent.
Is it possible to reset them at once so that no other thread can see this inconsistency? Or is it necessary to bundle x and y into an atom instead of having x and y themselves being atoms?
For coordinated updates to multiple references, you want to use refs, not atoms. Refs can only be modified inside a transaction, and Clojure's STM (software transactional memory) ensures that all operations on refs succeed before committing the transaction, so all threads outside the transaction see a consistent view of the refs.
(def foo (ref :x))
(def bar (ref :y))
(dosync
(ref-set foo :x1)
(ref-set bar :y1))
In this example, the transaction (delineated by dosync) will be retried if either ref was modified by a transaction in another thread, ensuring that other threads see a consistent view of foo and bar.
There is overhead in using STM, so the choice of using coordinated refs vs. a single atom that encapsulates all your mutable state will depend on your exact usage.
From the Clojure website's page on atoms (emphasis added):
Atoms provide a way to manage shared, synchronous, independent state.
This means that each atom is independent of each other, and therefore a set of atoms cannot be updated atomically.
You can combine both items into a single atom. But you also may want to consider refs, which provide for transactional updates of multiple items:
transactional references (Refs) ensure safe shared use of mutable storage locations via a software transactional memory (STM)

Clojure Synchronize Futures with Await

I have 3 long running tasks that I need to synchronize on. They are independent, but the calling thread must wait until all three are finished before continuing.
I can create an agent for each task, and await on them, but agents aren't really the right semantic construct, since each agent will only be be called once.
What I really want is to await on 3 futures, or some approach that more closely resembles what I'm trying to achieve.
Can I await on futures instead of agents?
Edit:
I guess the answer is just simply to deref each future in the calling thread in a loop, which will block until they've all returned. If I wanted to do "prep" work during this time, I could put the "defrefing" code itself in yet another future.
It looks like you mostly answered your own question. I'll add my 2 cents about how to do this though.
(defn many-futures
[tasks]
(let [futures (for [task tasks]
(future (task)))]
(do-prep tasks)
(doseq [completion futures]
#completion)))
This will do your prep in parallel with all the futures, and then return after all the futures have completed. You could replace the doseq with (doall (for ...)) if you actually want to use the results somewhere. Or, indeed, you could skip the doall, and then only block once the results are actually accessed. Even further, you could return the lazy-seq of futures itself, and then you can access any one of them via deref independently of the completion status of the others.

What will the behaviour of line-seq be?

I'd like to understand the behaviour of a lazy sequence if I iterate over with doseq but hold onto part of the first element.
(with-open [log-file-reader (clojure.java.io/reader (clojure.java.io/file input-file-path))]
; Parse line parse-line returns some kind of representation of the line.
(let [parsed-lines (map parse-line (line-seq log-file-reader))
first-item (first parsed-lines)]
; Iterate over the parsed lines
(doseq [line parsed-lines]
; Do something with a side-effect
)))
I don't want to retain any of the list, I just want to perform a side-effect with each element. I believe that without the first-item there would be no problem.
I'm having memory issues in my program and I think that perhaps retaining a reference to something at the start of the parsed-line sequence means that the whole sequence is stored.
What's the defined behaviour here? If the sequence is being stored, is there a generic way to take a copy of an object and enable the realised portion of the sequence to be garbage collected?
The sequence-holding occurs here
...
(let [parsed-lines (map parse-line (line-seq log-file-reader))
...
The sequence of lines in the file are being lazily produce and parsed, but the entire sequence is held onto, within the scope of let. This sequence is realized in the doseq, but doseq is not the problem, it does not do sequence-holding.
...
(doseq [line parsed-lines]
; Do something
...
You wouldn't necessarily care about sequence-holding in a let because the scope of let is limited, but here presumably your file is large and/or you stay within the dynamic scope of let for a while, or perhaps return a closure containing it in the "do something" section.
Note that holding onto any given element of the sequence, including the first, does not hold the sequence. The term head-holding is a bit of a misnomer if you consider head to be the first element as in "head of the list" in Prolog. The problem is holding onto a reference to the sequence.
The JVM will never return memory to the OS once it becomes part of the java heap, and unless you configure it differently the default max heap size is pretty large (1/4 of available RAM, usually). So if you're only experiencing vague issues like "Gosh, this takes up a lot of memory" rather than "Well, the JVM threw an OutOfMemoryError", you probably just haven't tuned the JVM the way you'd like it to act. partition-by is a little eager, in that it holds one or two partitions in memory at once, but unless your partitions are huge, you shouldn't be running out of heap space with this code. Try setting -Xmx100m, or whatever you think is a reasonable heap size for your program, and see if you have problems.

Does Clojure use multiple threads in a map call?

I'm attempting to explore the behavior of a CPU-bound algorithm as it scales to multiple CPUs using Clojure. The algorithm takes a large sequence of consecutive integers as input, partitions the sequence into a given number of sub-sequences, then uses map to apply a function to each sub-sequence. Once the map function has completed, reduce is used to collect the results.
The full code is available on Github, but here is a sample:
(map computation-function (partitioning-function number-of-partitions input))
When I execute this code on a machine with twelve cores, I see most of the the cores in use, when I expect to see only one core in use.
Ideally, I would like to use pmap to use a given number of threads, but I am unable to cause the code to execute using only one thread.
So is Clojure spreading the computation across multiple CPUs? If so, is there anything that I can do to control this behavior?
My understanding is that pmap uses multiple cores and map uses the current thread only. (There would be no point in having both functions in the library if both used all available cores.)
The following simple experiment shows that pmap uses separate threads and map does not:
(defn something-slow [x]
(Thread/sleep 1000))
(map something-slow (range 5))
;; Takes 5 seconds
(pmap something-slow (range 5))
;; Takes 1 second
I do note that your GitHub code uses pmap in the example which runs in main-; if you change back to map does the parallelism persist?