I'd like to understand the behaviour of a lazy sequence if I iterate over with doseq but hold onto part of the first element.
(with-open [log-file-reader (clojure.java.io/reader (clojure.java.io/file input-file-path))]
; Parse line parse-line returns some kind of representation of the line.
(let [parsed-lines (map parse-line (line-seq log-file-reader))
first-item (first parsed-lines)]
; Iterate over the parsed lines
(doseq [line parsed-lines]
; Do something with a side-effect
)))
I don't want to retain any of the list, I just want to perform a side-effect with each element. I believe that without the first-item there would be no problem.
I'm having memory issues in my program and I think that perhaps retaining a reference to something at the start of the parsed-line sequence means that the whole sequence is stored.
What's the defined behaviour here? If the sequence is being stored, is there a generic way to take a copy of an object and enable the realised portion of the sequence to be garbage collected?
The sequence-holding occurs here
...
(let [parsed-lines (map parse-line (line-seq log-file-reader))
...
The sequence of lines in the file are being lazily produce and parsed, but the entire sequence is held onto, within the scope of let. This sequence is realized in the doseq, but doseq is not the problem, it does not do sequence-holding.
...
(doseq [line parsed-lines]
; Do something
...
You wouldn't necessarily care about sequence-holding in a let because the scope of let is limited, but here presumably your file is large and/or you stay within the dynamic scope of let for a while, or perhaps return a closure containing it in the "do something" section.
Note that holding onto any given element of the sequence, including the first, does not hold the sequence. The term head-holding is a bit of a misnomer if you consider head to be the first element as in "head of the list" in Prolog. The problem is holding onto a reference to the sequence.
The JVM will never return memory to the OS once it becomes part of the java heap, and unless you configure it differently the default max heap size is pretty large (1/4 of available RAM, usually). So if you're only experiencing vague issues like "Gosh, this takes up a lot of memory" rather than "Well, the JVM threw an OutOfMemoryError", you probably just haven't tuned the JVM the way you'd like it to act. partition-by is a little eager, in that it holds one or two partitions in memory at once, but unless your partitions are huge, you shouldn't be running out of heap space with this code. Try setting -Xmx100m, or whatever you think is a reasonable heap size for your program, and see if you have problems.
Related
The docs says about pmap:
Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required.
Can you kindly dis-obfuscate these two statements in some simple context?
Also is there for the pmap function, a doseq equivalent, having a memory footprint constant relative to the size of the iterated collection?
Semi-lazy in that the parallel computation stays ahead of the consumption
This means that pmap will do slightly more work than is strictly required by the sequence's consumer. This "working ahead" minimizes the wait for more items to be computed when the sequence is consumed. For example, if you're computing some infinite sequence in parallel and you only consume the first 50 results, pmap may have gone ahead and computed 50+N.
but doesn't realize the entire result unless required.
This means it's only going to work ahead up to a certain threshold. The entire sequence won't be produced unless it's completely consumed (or almost completely consumed).
Also is there for the pmap function, a doseq equivalent
You can use doall or dorun with pmap to produce side effects in parallel.
Here's an example of all three together, using an infinite sequence as input to pmap:
(def calls (atom 0))
(dorun (take 50 (pmap (fn [_] (swap! calls inc)) (range))))
;; #calls => 60
When this completes the value of calls will be over 50, even though we only consumed 50 items from the sequence.
Also read up on reducers and core.async for another way to do the same thing.
While Taylor's answer is correct, I also gave a presentation on what happens inside of pmap, and how it's lazy, at Clojure West a few years ago. I know not everyone likes videos for learning, but if you do, it might be helpful: https://youtu.be/BzKjIk0vgzE?t=11m48s
(If you want non-lazy pmap, I second the endorsement for Claypoole.)
Let's say I have the following code :
(defn multiple-writes []
(doseq [[x y] (map list [1 2] [3 4])] ;; let's imagine those are paths to files
(when-not (exists? x y) ;; could be left off, I feel it is faster to check before overwriting
(write-to-disk! (do-something x y)))))
That I call like this (parameters omitted) :
(go (multiple-writes))
I use go to execute some code "in the background", but I do not know if I am using the right tool here. Some more information about those functions :
this is not high-priority code at all. It could even fail - multiple-writes could be seen as a cache-filling function.
I consequently do not care about the return value.
do-something takes a between 100 and 500 milliseconds depending of the input
do-something consumes some memory (uses image buffers, some images can be 2000px * 2000px)
there are 10 to 40 elements/images to be processed every time multiple-writes is called.
every call to write-to-disk will create a new file (or overwrite it if any, though that should not happen)
write-to-disk writes always in the same directory
So I would like to speed up things by executing (write-to-disk! (do-something x y)) in parallel to go as fast as possible. But I don't want to overload the system at all, since this is not a high-priority task.
How should I go about this ?
Note : despite the title, this is not a duplicate of this question since I don't want to restrict to 3 threads (not saying that the answer can't be the same, but I feel this question differs).
Take a look at the claypoole library, which gives some good and simple abstractions filling the void between pmap and fork/join reducers, which otherwise would need to be coded by hand with futures and promises.
With pmap all results of a parallel batch need to have returned before the next batch is executed, because return order is preserved. This can be a problem with widely varying processing times (be they calculation, http requests, or work items of different "size"). This is what usually slows down pmap to single threaded map + unneeded overhead performance.
With claypoole's unordered pmap and unordered for (upmap and upfor), slower function calls in one thread (core) can be overtaken by faster ones on another thread because ordering doesn't need to be preserved, as long as not all cores are clogged by slow calls.
This might not help much in case of IO to one disk being the only bottleneck, but since claypoole has configurable thread pool sizes and functions to detect the number of available cores, it will help with restricting the amount of cores.
And where fork/join reducers would optimize CPU usage by work stealing, it might greatly increase memory use, since there is no option to restrict the amount of parallel processes without altering the reducer library.
Consider basing your design on streams or fork/join.
I would a single component that does IO. Every processing node can then send their results to be saved there. This is easy to model with streams. With fork/join, it can be achieved by not returning the result up in the hierarchy but sending it to eg. an agent.
If memory consumption is an issue, perhaps you can divide work even more. Like 100x100 patches.
I'm attempting to explore the behavior of a CPU-bound algorithm as it scales to multiple CPUs using Clojure. The algorithm takes a large sequence of consecutive integers as input, partitions the sequence into a given number of sub-sequences, then uses map to apply a function to each sub-sequence. Once the map function has completed, reduce is used to collect the results.
The full code is available on Github, but here is a sample:
(map computation-function (partitioning-function number-of-partitions input))
When I execute this code on a machine with twelve cores, I see most of the the cores in use, when I expect to see only one core in use.
Ideally, I would like to use pmap to use a given number of threads, but I am unable to cause the code to execute using only one thread.
So is Clojure spreading the computation across multiple CPUs? If so, is there anything that I can do to control this behavior?
My understanding is that pmap uses multiple cores and map uses the current thread only. (There would be no point in having both functions in the library if both used all available cores.)
The following simple experiment shows that pmap uses separate threads and map does not:
(defn something-slow [x]
(Thread/sleep 1000))
(map something-slow (range 5))
;; Takes 5 seconds
(pmap something-slow (range 5))
;; Takes 1 second
I do note that your GitHub code uses pmap in the example which runs in main-; if you change back to map does the parallelism persist?
I want to start determining the memory requirements of my refs and how they grow with application usage. How can I do this?
Someone asked about this on the mailing list a while ago (and probably someone else before that, and...). A few people provided utilities that kinda-sorta do what you might want, but I still prefer my answer: you can't do this in a language with such pervasive and automatic structural sharing. How do you calculate the size of a large object that you have two pointers to, etc etc.
In general, this isn't really a very useful thing to do: because the data used by any single ref is likely to be shared by many other refs, knowing this information isn't particularly useful.
Also, it will be highly JVM-specific - different JVM implementations may use different amounts of memory for the same Clojure structures depending on how they choose to pack data structures and pointers. For example, I believe that HotSpot pads object sizes up to the nearest 8 bytes, but other JVMs could do something completely different. Also 32/64-bit JVMs will typically use different sizes for pointers (but not necessarily in the obvious way, as some 64-bit JVMs use compressed pointers....)
If you are still determined to do this, the best approach would probably be to recursively descend the data structure in the ref and add up the estimated size of each sub-element.
You'd need to make assumptions or experimentally verify the size/overhead of each possible component type. Not easy... see this question for some of the gory details of estimating object sizes on the JVM. If you're lucky, you might be able to find a library that does this for you.
You would also need to keep track of all objects visited - which is a also bit tricky since you'd need to compare on object identity rather than equality, and hence you wouldn't be able to use any of the standard hashmap/set types. A hashmap of (object hashcode -> collection of objects with the same hashcode) would work.
There will also be some fun Clojure-specific corner cases to consider... e.g. are you counting meta data on a data structure or not?
On average though, I'd still recommend paying attention to the memory consumed by your application as a whole, rather than specific refs.
Because I can't do code in comments:
(let [a1 large-hash-map]
[a2 (assoc large-hash-map :foo :bar)]
;; now, the two 'pointers' are a1 and a2
;; and the data structures they point to can share
;; most (but not all) of their data.
;; making it more or less meaningless to ask
;; how much memory any of the bindings holds
)
Whether we're talking about refs or plain bindings doesn't matter as far as your question is concerned.
What is the difference in the 3 ways to set the value of a ref in Clojure? I've read the docs several times about ref-set, commute, and alter. I'm rather confused which ones to use at what times. Can someone provide me a short description of what the differences are and why each is needed?
As a super simple explanation of how the Software Transactional Memory system works in clojure; it retries transactions until everyone of them gets through without having its values changed out from under it. You can help it make this decision by using ref-changing-functions that give it hints about what interactions are safe between transactions.
ref-set is for when you don't care about the current value. Just set it to this! ref-set saves you the angst of writing something like (alter my-ref (fun [_] 4)) just to set the value of my-ref to 4. (ref-set my-ref 4) sure does look a lot better :).
Use ref-set to simply set the value.
alter is the most normal standard one. Use this function to alter the value. This is the meat of the STM. It uses the function you pass to change the value and retries if it cannot guarantee that the value was unchanged from the start of the transaction. This is very safe, even in some cases where you don't need it to be that safe, like incrementing a counter.
You probably want to use alter most of the time.
commute is an optimized version of alter for those times when the order of things really does not matter. it makes no difference who added which +1 to the counter. The result is the same. If the STM is deciding if your transaction is safe to commit and it only has conflicts on commute operations and none on alter operations then it can go ahead and commit the new values without having to restart anyone. This can save the occasional transaction retry though you're not going to see huge gains from this in normal code.
Use commute when you can.