Clojure structure multiple calculation/writes to work in parallel - clojure

Let's say I have the following code :
(defn multiple-writes []
(doseq [[x y] (map list [1 2] [3 4])] ;; let's imagine those are paths to files
(when-not (exists? x y) ;; could be left off, I feel it is faster to check before overwriting
(write-to-disk! (do-something x y)))))
That I call like this (parameters omitted) :
(go (multiple-writes))
I use go to execute some code "in the background", but I do not know if I am using the right tool here. Some more information about those functions :
this is not high-priority code at all. It could even fail - multiple-writes could be seen as a cache-filling function.
I consequently do not care about the return value.
do-something takes a between 100 and 500 milliseconds depending of the input
do-something consumes some memory (uses image buffers, some images can be 2000px * 2000px)
there are 10 to 40 elements/images to be processed every time multiple-writes is called.
every call to write-to-disk will create a new file (or overwrite it if any, though that should not happen)
write-to-disk writes always in the same directory
So I would like to speed up things by executing (write-to-disk! (do-something x y)) in parallel to go as fast as possible. But I don't want to overload the system at all, since this is not a high-priority task.
How should I go about this ?
Note : despite the title, this is not a duplicate of this question since I don't want to restrict to 3 threads (not saying that the answer can't be the same, but I feel this question differs).

Take a look at the claypoole library, which gives some good and simple abstractions filling the void between pmap and fork/join reducers, which otherwise would need to be coded by hand with futures and promises.
With pmap all results of a parallel batch need to have returned before the next batch is executed, because return order is preserved. This can be a problem with widely varying processing times (be they calculation, http requests, or work items of different "size"). This is what usually slows down pmap to single threaded map + unneeded overhead performance.
With claypoole's unordered pmap and unordered for (upmap and upfor), slower function calls in one thread (core) can be overtaken by faster ones on another thread because ordering doesn't need to be preserved, as long as not all cores are clogged by slow calls.
This might not help much in case of IO to one disk being the only bottleneck, but since claypoole has configurable thread pool sizes and functions to detect the number of available cores, it will help with restricting the amount of cores.
And where fork/join reducers would optimize CPU usage by work stealing, it might greatly increase memory use, since there is no option to restrict the amount of parallel processes without altering the reducer library.

Consider basing your design on streams or fork/join.
I would a single component that does IO. Every processing node can then send their results to be saved there. This is easy to model with streams. With fork/join, it can be achieved by not returning the result up in the hierarchy but sending it to eg. an agent.
If memory consumption is an issue, perhaps you can divide work even more. Like 100x100 patches.

Related

When does .race or .hyper outperform non-data-parallelized versions?

I have this code:
# Grab Nutrients.csv from https://data.nal.usda.gov/dataset/usda-branded-food-products-database/resource/c929dc84-1516-4ac7-bbb8-c0c191ca8cec
my #nutrients = "/path/to/Nutrients.csv".IO.lines;
for #nutrients.race {
my #data = $_.split('","');
.say if #data[2] eq "Protein" and #data[4] > 70 and #data[5] ~~ /^g/;
};
Nutrients.csv is a 174 MB file, with lots of rows. Non-trivial stuff is done on every row, but there's no data dependency. However, this takes circa 54s while the non-race version uses 43 seconds, 20% less. Any idea of why that happens? Is the kind of operation done here still too little for data parallelism to take hold? I have seen it only working with very heavy operations, like checking if something is prime. In that case, any ballpark of how much should be done for every piece of data to make data parallelism worth the while?
Assuming that "outperform" is defined as "using less wallclock":
Short answer: when it does.
Longer answer: when the overhead of batching values, distributing over multiple threads and collecting results + the actual CPU that is needed for the work divided by the number of threads, results in a shorter runtime.
Still longer answer: the dispatcher thread needs some CPU to batch up values and hand the work over to a worker thread and then process its result. As long as that amount of CPU is more than the amount of CPU needed to do the work, you will only use one thread (because by the time the dispatcher thread is ready to dispatch, the only worker thread is ready to receive more work). Which means you've made things worse, because the actual work is now still being done by one thread, but you've added a lot of overhead and latency.
So make sure that the amount of work a worker thread needs to do, is big enough so that the dispatcher thread will need to start up another thread for the next piece of work. This can be done by increasing the batch-size. But a bigger batch, also means that the dispatcher thread will need more CPU to create the batch. Which in turn can make the worker thread be ready to receive the next batch, in which case you're back to just having added overhead.
There are still plans to make the batch size adapt itself automatically to the amount of work that a worker thread needs to do. But unfortunately, that will also require quite an extensive reworking of the current implementation of hyper and race. So don't expect that any time soon, and definitely not before the Great Dispatcher Overhaul has landed.
Please have a look at:
Raku .hyper() and .race() example not working
The syntax in your example should be:
my #nutrients = "/path/to/Nutrients.csv".IO.lines;
race for #nutrients.race(batch => 1, degree => 2)
{
my #data = $_.split('","');
.say if #data[2] eq "Protein" and #data[4] > 70 and #data[5] ~~ /^g/;
}
The "race" in front of the "for" makes the difference.

can you explain pmap laziness and memory footprint?

The docs says about pmap:
Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required.
Can you kindly dis-obfuscate these two statements in some simple context?
Also is there for the pmap function, a doseq equivalent, having a memory footprint constant relative to the size of the iterated collection?
Semi-lazy in that the parallel computation stays ahead of the consumption
This means that pmap will do slightly more work than is strictly required by the sequence's consumer. This "working ahead" minimizes the wait for more items to be computed when the sequence is consumed. For example, if you're computing some infinite sequence in parallel and you only consume the first 50 results, pmap may have gone ahead and computed 50+N.
but doesn't realize the entire result unless required.
This means it's only going to work ahead up to a certain threshold. The entire sequence won't be produced unless it's completely consumed (or almost completely consumed).
Also is there for the pmap function, a doseq equivalent
You can use doall or dorun with pmap to produce side effects in parallel.
Here's an example of all three together, using an infinite sequence as input to pmap:
(def calls (atom 0))
(dorun (take 50 (pmap (fn [_] (swap! calls inc)) (range))))
;; #calls => 60
When this completes the value of calls will be over 50, even though we only consumed 50 items from the sequence.
Also read up on reducers and core.async for another way to do the same thing.
While Taylor's answer is correct, I also gave a presentation on what happens inside of pmap, and how it's lazy, at Clojure West a few years ago. I know not everyone likes videos for learning, but if you do, it might be helpful: https://youtu.be/BzKjIk0vgzE?t=11m48s
(If you want non-lazy pmap, I second the endorsement for Claypoole.)

memoization vs. state-free code

In the development of a stateless Clojure library I encounter a problem: Many functions have to be called repeatedly with the same arguments. Since everything until now is side-effect-free, this will always lead to the same results. I'm considering ways to make this more performative.
My library works like this: Every time a function is called it needs to be passed a state-hash-map, the function returns a replacement with a manipulated state object. So this keeps everything immutable and every sort of state is kept outside of the library.
(require '[mylib.core :as l])
(def state1 (l/init-state))
(def state2 (l/proceed state1))
(def state3 (l/proceed state2))
If proceed should not perform the same operations repeatedly, I have several options to solve this:
Option 1: "doing it by hand"
Store the necessary state in the state-hash-map, and update only where it is necessary. Means: Having a sophisticated mechanism that knows which parts have to be recalculated, and which not. This is always possible, in my case it would be not that trivial. If I implemented this, I'd produce much more code, which in the end is more error prone. So is it necessary?
Option 2: memoize
So there is the temptation to use the memoize function at the critical points in the lib: At the points, at which I'd expect the possibility of repeated function calls with the same args. This is sort of another philosophy of programming: Modelling each step as if it was the first time it has to run. And separating the fact that is called several times to another implementation. (this reminds me of the idea of react/om/reagent's render function)
Option 3: core.memoize
But memoization is stateful - of course. And this - for example - becomes a problem when the lib runs in a web-server. The server would just keep on filling memory with captured results. In my case however it would make sense, to only capture calculated results for each user-session. So it would be perfect to attach the cache to the previously described state-hash-map, which will be passed back by lib.
And it looks like core.memoize provides some tools for this job. Unfortunately it's not that well documented - I don't really find useful information related to the the described situation.
My question is: Do I more or less estimate the possible options correctly? Or are there other options that I have not considered? If not, it looks like the core.memoize is the way to go. Then, I'd appreciate if someone could give me a short pattern at hand, which one should use here.
If state1, state2 & state3 are different in your example, memoization will gain you nothing. proceed would, be called with different arguments each time.
As a general design principle do not impose caching strategies to the consumer. Design so that the consumers of your library have the possibility to use whatever memoization technique, or no memoization at all.
Also, you don't mention if init-state is side-effect free, and if it returns the same state1. If that is so, why not just keep all (or some) states as static literals. If they don't take much space, you can write a macro that calculates them compile time. Say, first 20 states hard-coded, then call proceed.

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Does Clojure use multiple threads in a map call?

I'm attempting to explore the behavior of a CPU-bound algorithm as it scales to multiple CPUs using Clojure. The algorithm takes a large sequence of consecutive integers as input, partitions the sequence into a given number of sub-sequences, then uses map to apply a function to each sub-sequence. Once the map function has completed, reduce is used to collect the results.
The full code is available on Github, but here is a sample:
(map computation-function (partitioning-function number-of-partitions input))
When I execute this code on a machine with twelve cores, I see most of the the cores in use, when I expect to see only one core in use.
Ideally, I would like to use pmap to use a given number of threads, but I am unable to cause the code to execute using only one thread.
So is Clojure spreading the computation across multiple CPUs? If so, is there anything that I can do to control this behavior?
My understanding is that pmap uses multiple cores and map uses the current thread only. (There would be no point in having both functions in the library if both used all available cores.)
The following simple experiment shows that pmap uses separate threads and map does not:
(defn something-slow [x]
(Thread/sleep 1000))
(map something-slow (range 5))
;; Takes 5 seconds
(pmap something-slow (range 5))
;; Takes 1 second
I do note that your GitHub code uses pmap in the example which runs in main-; if you change back to map does the parallelism persist?