How to execute parallel transactions in Clojure - clojure

I have a sequence of customers that needs to be processed in parallel. I tried to use a pmap for that. The result is painfully slow, much slower than a sequential implementation. The inner function process-customer has a transaction. Obviously, the pmap launches all the transactions at once and they end up retrying killing performance. What is thee best way to parallelize this?
(defn process-customers [customers]
(doall
(pmap
(fn [sub-customers]
(doseq [customer sub-customers]
(process-customer customer)))
(partition-all 10 customers))))
EDIT:
The process-customer function involves the below steps. I write the steps for brevity. All the steps are inside a transaction to ensure another parallel transaction does not cause inconsistencies like negative stock.
(defn- process-customer [customer]
"Process `customer`. Consists of three steps:
1. Finding all stores in which the requested products are still available.
2. Sorting the found stores to find the cheapest (for the sum of all products).
3. Buying the products by updating the `stock`.
)
EDIT 2: The below version of process-customers has the same performance as the parallel process-customers above. The below is obviously sequential.
(defn process-customers [customers]
"Process `customers` one by one. In this code, this happens sequentially."
(doseq [customer customers]
(process-customer customer)))

I assume your transaction is locking on the inventory for the full life cycle of process-customer. This will be slow as all customers are racing for the same universe of stores. If you can split the process into two phases: 1) quoting and 2) fulfilling and applies transaction only on (2) then the performance should be much better. Or if you buy into agent programming, you will have transaction boundary automatically defined for you at the message level. Here is one sample you can consider:
(defn get-best-deal
"Returns the best deal for a given order with given stores (agent)"
[stores order]
;;
;; request for quotation from 1000 stores (in parallel)
;;
(doseq [store stores]
(send store get-quote order))
;;
;; wait for reply, up to 0.5s
;;
(apply await-for 500 stores)
;;
;; sort and find the best store
;;
(when-let [best-store (->> stores
(filter (fn [store] (get-in #store [:quotes order])))
(sort-by (fn [store] (->> (get-in #store [:quotes order])
vals
(reduce +))))
first)]
{:best-store best-store
:invoice-id (do
;; execute the order
(send best-store fulfill order)
;; wait for the transaction to complete
(await best-store)
;; get an invoice id
(get-in #best-store [:invoices order]))}))
and to find best deals from 1,000 stores for 100 orders (Total 289 line items) from 100 products:
(->> orders
(pmap (partial get-best-deal stores))
(filter :invoice-id)
count
time)
;; => 57
;; "Elapsed time: 312.002328 msecs"
Sample business logic:
(defn get-quote
"issue a quote by checking inventory"
[store {:keys [order-items] :as order}]
(if-let [quote (->> order-items
(reduce reduce-inventory
{:store store
:quote nil})
:quote)]
;; has inventory to generate a quote
(assoc-in store [:quotes order] quote)
;; no inventory
(update store :quotes dissoc order)))
(defn fulfill
"fulfill an order if previuosly quoted"
[store order]
(if-let [quote (get-in store [:quotes order])]
;; check inventory again and generate invoice
(let [[invoice inventory'] (check-inventory-and-generate-invoice store order)]
(cond-> store
invoice (->
;; register invoice
(assoc-in [:invoices order] invoice)
;; invalidate the quote
(update :quotes dissoc order)
;; update inventory
(assoc :inventory inventory'))))
;; not quoted before
store))

Related

Measuring time of a process executing inside a future in Clojure using "time"

Below is a simplified version of an application I am working on. Specifically, I am interested in benchmarking the execution time of process-list. In the function process-list, I partition the input list into partitions equal to the number of threads I would like to execute in parallel. I then pass each partition to a thread through a call to future. Finally, In main I call process-list with time wrapped around it. Time should return the elapsed time of processing done by process-list but apparently, it only returns the amount of time it takes to create the future threads and does not wait for the futures to execute to completion. How can I dereference the futures inside process-list to ensure the elapsed time accounts for the execution of the future-threads to completion?
(ns listProcessing
(:require [clojure.string]
[clojure.pprint]
[input-random :as input]))
(def N-THREADS 4)
(def element_processing_retries (atom 0))
(def list-collection
"Each element is made into a ref"
(map ref input/myList))
(defn partition-list [threads list]
"partition list into required number of partitions which is equal
to the number of threads"
(let [partitions (partition-all
(Math/ceil (/ (count list) threads)) list)]
partitions))
(defn increase-element [element]
(ref-set element inc))
(defn process-list [list]
"Process `members of list` one by one."
(let [sub-lists (partition-list N-THREADS list)]
(doseq [sub-list sub-lists]
(let [futures '()
myFuture (future (dosync (swap! element_processing_retries inc)
(map increase-element sub-list)))]
(cons myFuture futures)
(map deref futures)))))
(defn main []
(let [f1 (future (time (process-list input/mylist)))]
#f1)
(main)
(shutdown-agents)
Below is an example of a simplified list input: Note the input here is simplified and the list processing too to simplify the question.
(ns input-random)
(def myList (list 1 2 4 7 89 12 34 45 56))
This will have some overhead. If you're trying to time millisecond differences, this will skew things a bit (although minute timings shouldn't be using time anyways).
I think your example was a little convoluted, so I reduced it down to what I think represents the problem a little better:
(time (doseq [n (range 5)]
(future
(Thread/sleep 2000))))
"Elapsed time: 1.687702 msecs"
The problem here is the same as the problem with your code: all this really does is time how long it takes for doseq to dispatch all the jobs.
The idea with my hack is to put each finished job into an atom, then check for an end condition in a busy wait:
(defn do-stuff [n-things]
(let [ret-atom (atom 0)]
(doseq [n (range n-things)]
(future
(Thread/sleep 2000)
(swap! ret-atom inc)))
ret-atom))
; Time how long it takes the entire `let` to run
(time
(let [n 5
ret-atom (do-stuff n)]
; Will block until the condition is met
(while (< #ret-atom n))))
"Elapsed time: 2002.813288 msecs"
The reason this is so hard to time is all you're doing is spinning up some side effects in a doseq. There's nothing defining what "done" is, so there's nothing to block on. I'm not great with core.async, but I suspect there may be something that may help in there. It may be possible to have a call to <!! that blocks until a channel has a certain number of elements. In that case, you would just need to put results into the channel as they're produced.

Proper way to ensure clj-http's connection manager is closed after all requests are done

I have a code that is a combination of clj-http, core.async facilities and an atom. It creates some threads to fetch and parse a bunch of pages:
(defn fetch-page
([url] (fetch-page url nil))
([url conn-manager]
(-> (http.client/get url {:connection-manager conn-manager})
:body hickory/parse hickory/as-hickory)))
(defn- create-worker
[url-chan result conn-manager]
(async/thread
(loop [url (async/<!! url-chan)]
(when url
(swap! result assoc url (fetch-page url conn-manager))
(recur (async/<!! url-chan))))))
(defn fetch-pages
[urls]
(let [url-chan (async/to-chan urls)
pages (atom (reduce (fn [m u] (assoc m u nil)) {} urls))
conn-manager (http.conn-mgr/make-reusable-conn-manager {})
workers (mapv (fn [_] (create-worker url-chan pages conn-manager))
(range n-cpus))]
; wait for workers to finish and shut conn-manager down
(dotimes [_ n-cpus] (async/alts!! workers))
(http.conn-mgr/shutdown-manager conn-manager)
(mapv #(get #pages %) urls)))
The idea is to use multiple threads to reduce the time to fetch and parse the pages, but I'd like to not overload the server, sending a lot of requests at once - that is why a connection manager was used. I don't know if my approach is correct, suggestions are welcome. Currently the problem is that the last requests fail because the connection manager is shutdown before they terminate: Exception in thread "async-thread-macro-15" java.lang.IllegalStateException: Connection pool shut down.
The main questions: how do I close the connection manager at the right moment (and why my current code fails in doing it)? The side quest: is my approach right? If not, what could I do to fetch and parse multiple pages at once, while not overloading the server?
Thanks!
The problem is that async/alts!! returns on the first result (and will keep doing so since workers never changes). I think using async/merge to build a channel and then repeatedly read off of it should work.
(defn fetch-pages
[urls]
(let [url-chan (async/to-chan urls)
pages (atom (reduce (fn [m u] (assoc m u nil)) {} urls))
conn-manager (http.conn-mgr/make-reusable-conn-manager {})
workers (mapv (fn [_] (create-worker url-chan pages conn-manager))
(range n-cpus))
all-workers (async/merge workers)]
; wait for workers to finish and shut conn-manager down
(dotimes [_ n-cpus] (async/<!! all-workers))
(http.conn-mgr/shutdown-manager conn-manager)
(mapv #(get #pages %) urls)))
Alternatively, you could recur and keep shrinking workers instead so that you're only waiting on previously unfinished workers.
(defn fetch-pages
[urls]
(let [url-chan (async/to-chan urls)
pages (atom (reduce (fn [m u] (assoc m u nil)) {} urls))
conn-manager (http.conn-mgr/make-reusable-conn-manager {})
workers (mapv (fn [_] (create-worker url-chan pages conn-manager))
(range n-cpus))]
; wait for workers to finish and shut conn-manager down
(loop [workers workers]
(when (seq workers)
(let [[_ finished-worker] (async/alts!! workers)]
(recur (filterv #(not= finished-worker %) workers)))))
(http.conn-mgr/shutdown-manager conn-manager)
(mapv #(get #pages %) urls)))
I believe Alejandro is correct about the reason for your error, and this is logical, since your error indicates that you have shut down the connection manager before all requests have been completed, so it's likely that all the workers have not finished when you shut it down.
Another solution I'll propose stems from the fact that you aren't actually doing anything in your create-worker thread that requires it to be a channel, which is implicitly created by async/thread. So, you can replace it with a future, like so:
(defn- create-worker
[url-chan result conn-manager]
(future
(loop [url (a/<!! url-chan)]
(when url
(swap! result assoc url (fetch-page url conn-manager))
(recur (a/<!! url-chan))))))
And in your fetch-pages function, "join" by derefing:
(doseq [worker workers]
#worker) ; alternatively, use deref to specify timeout
This eliminates a good deal of core.async interference in what is not a core.async issue to begin with. This of course depends on you keeping your method of collecting the data as-is, that is, using swap! on an atom to keep track of page data. If you were to send the result of fetch-page out onto a return channel, or something similar, then you'd want to keep your current thread approach.
Regarding your concern about overloading the server -- you have not yet defined what it means to "overload" the server. There are two dimensions of this: one is the rate of requests (number of requests per second, for example), and the other is the number of concurrent requests. Your current app has n worker threads, and that is the effective concurrency (along with the settings in the connection manager). But this does nothing to address the rate of requests per second.
This is a little more complicated than it might seem, though it is possible. You have to consider the total of all requests done by all threads per unit of time, and managing that is not something to tackle in one answer here. I suggest you do some research about throttling and rate limiting, and give it a go, and then go from there with questions.

What is the correct way to perform side effects in a clojure atom swap

I'm keeping a registry of processes in an atom.
I want to start one and only one process (specifically a core.async go-loop) per id.
However, you're not supposed to perform side-effects in a swap!, so this code is no good:
(swap! processes-atom
(fn [processes]
(if (get processes id)
processes ;; already exists, do nothing
(assoc processes id (create-process! id)))))
How would I go about doing this correctly?
I have looked at locking, which takes an object as a monitor for the lock. I would prefer that each id - which are dynamic - have their own lock.
It seems that you need to protect processes-atom from concurrent modification, so that only single thread can have access to it. locking will work in this case. Since, by usage of locking, we will manage thread safety by ourselves, we can use volatile instead of atom (volatile is faster, but doesn't provide any thread-safety and atomicity guaranees).
Summing up the above, something like below should work fine:
(def processes-volatile (volatile! {}))
(defn create-and-save-process! [id]
(locking processes-volatile
(vswap! processes-volatile
(fn [processes]
(if (get processes id)
processes
(assoc processes id (create-process! id)))))))
You can do this by hand with locking, as OlegTheCat shows, and often that is a fine approach. However, in the comments you remark that it would be nice to avoid having the whole atom locked for as long as it takes to spawn a process, and that too is possible in a surprisingly simple way: instead of having a map from pid to process, have a map from pid to delay of process. That way, you can add a new delay very cheaply, and only actually create the process by dereferencing the delay, outside of the call to swap!. Dereferencing the delay will block waiting for that particular delay, so multiple threads who need the same process will not step on each other's toes, but the atom itself will be unlocked, allowing threads who want a different process to get it.
Here is a sample implementation of that approach, along with example definitions of the other vars your question implies, to make the code runnable as-is:
(def process-results (atom []))
(defn create-process! [id]
;; pretend creating the process takes a long time
(Thread/sleep (* 1000 (rand-int 3)))
(future
;; running it takes longer, but happens on a new thread
(Thread/sleep (* 1000 (rand-int 10)))
(swap! process-results conj id)))
(def processes-atom (atom {}))
(defn cached-process [id]
(-> processes-atom
(swap! (fn [processes]
(update processes id #(or % (delay (create-process! id))))))
(get id)
(deref)))
Of course only cached-process is needed if you already have the other things defined. And a sample run, to show that processes are successfully reused:
(defn stress-test [num-processes]
(reset! process-results [])
(reset! processes-atom {})
(let [running-processes (doall (for [i (range num-processes)]
(cached-process (rand-int 10))))]
(run! deref running-processes)
(deref process-results)))
user> (time (stress-test 40))
"Elapsed time: 18004.617869 msecs"
[1 5 2 0 9 7 8 4 3 6]
I prefer using a channel
(defn create-process! [id] {:id id})
(def ^:private processes-channel (chan))
(go (loop [processes {}]
(let [id (<! processes-channel)
process (if (contains? processes id)
(get processes id)
(create-process! id))]
(>! processes-channel process)
(recur (assoc processes id process)))))
(defn get-process-by-id
"Public API"
[id]
(>!! processes-channel id)
(<!! processes-channel))
Another answer is to use an agent to start each process. This decouples each process from each other, and avoids the problem of possible multiple calls to the "create-process" function:
(defn start-proc-agent
[state]
(let [delay (int (* 2000 (rand)))]
(println (format "starting %d" (:id state)))
(Thread/sleep delay)
(println (format "finished %d" (:id state)))
(merge state {:delay delay :state :running} )))
(def procs-agent (atom {}))
(dotimes [i 3]
(let [curr-agent (agent {:id i :state :unstarted})]
(swap! procs-agent assoc i curr-agent)
(send curr-agent start-proc-agent )))
(println "all dispatched...")
(pprint #procs-agent)
(Thread/sleep 3000)
(pprint #procs-agent)
When run we see:
starting 2
starting 1
starting 0
all dispatched...
{0 #<Agent#39d8240b: {:id 0, :state :unstarted}>,
1 #<Agent#3a6732bc: {:id 1, :state :unstarted}>,
2 #<Agent#7414167a: {:id 2, :state :unstarted}>}
finished 0
finished 1
finished 2
{0 #<Agent#39d8240b: {:id 0, :state :running, :delay 317}>,
1 #<Agent#3a6732bc: {:id 1, :state :running, :delay 1635}>,
2 #<Agent#7414167a: {:id 2, :state :running, :delay 1687}>}
So the global map procs-agent associates each process ID with the agent for that process. A side benefit of this approach is that you can send subsequent commands (in the form of functions) to the agent for a process and be assured they are independent (and parallel & asynchronous) to every other agent.
Alternate solution
Similar to your original question, we could use a single agent (instead of an agent per process) to simply serialize the creation of each process. Since agents are asynchronous, they don't have the possibility of re-trying the input function like swap!. Thus, side-effecting functions aren't a problem. You could write it like so:
(defn start-proc-once-only
[state i]
(let [curr-proc (get state i) ]
(if (= :running (:state curr-proc))
(do
(println "skipping restart of" i)
state)
(let [delay (int (* 2000 (rand)))]
(println (format "starting %d" i))
(Thread/sleep delay)
(println (format "finished %d" i))
(assoc state i {:delay delay :state :running})))))
(def procs (agent {}))
(dotimes [i 3]
(println :starting i)
(send procs start-proc-once-only i))
(dotimes [i 3]
(println :starting i)
(send procs start-proc-once-only i))
(println "all dispatched...")
(println :procs) (pprint #procs)
(Thread/sleep 5000)
(println :procs) (pprint #procs)
with result
:starting 0
:starting 1
:starting 2
starting 0
:starting 0
:starting 1
:starting 2
all dispatched...
:procs
{}
finished 0
starting 1
finished 1
starting 2
finished 2
skipping restart of 0
skipping restart of 1
skipping restart of 2
:procs
{0 {:delay 1970, :state :running},
1 {:delay 189, :state :running},
2 {:delay 1337, :state :running}}
I think you should use add-watch. It gets called once per change to the atom. In the watch-fn check whether a new id has been added to the atom, if so, create the process and add it to the atom. That'll trigger another call to the watch-fn, but that second call won't identify any new id needing a process.

Strange behavior of clojure ref

I have 100 workers (agents) that share one ref that contains collection of tasks. While this collection have tasks, each worker get one task from this collection (in dosync block), print it and sometimes put it back in the collection (in dosync block):
(defn have-tasks?
[tasks]
(not (empty? #tasks)))
(defn get-task
[tasks]
(dosync
(let [task (first #tasks)]
(alter tasks rest)
task)))
(defn put-task
[tasks task]
(dosync (alter tasks conj task))
nil)
(defn worker
[& {:keys [tasks]}]
(agent {:tasks tasks}))
(defn worker-loop
[{:keys [tasks] :as state}]
(while (have-tasks? tasks)
(let [task (get-task tasks)]
(println "Task: " task)
(when (< (rand) 0.1)
(put-task tasks task))))
state)
(defn create-workers
[count & options]
(->> (range 0 count)
(map (fn [_] (apply worker options)))
(into [])))
(defn start-workers
[workers]
(doseq [worker workers] (send-off worker worker-loop)))
(def tasks (ref (range 1 10000000)))
(def workers (create-workers 100 :tasks tasks))
(start-workers workers)
(apply await workers)
When i run this code, the last value printed by agents is (after several tries):
435445,
4556294,
1322061,
3950017.
But never 9999999 what I expect.
And every time the collection is really empty at the end.
What I'm doing wrong?
Edit:
I rewrote worker-loop as simple as possible:
(defn worker-loop
[{:keys [tasks] :as state}]
(loop []
(when-let [task (get-task tasks)]
(println "Task: " task)
(recur)))
state)
But problem is still there.
This code behaves as expected when create one and only one worker.
The problem here has nothing to do with agents and barely anything to do with laziness. Here's a somewhat reduced version of the original code that still exhibits the problem:
(defn f [init]
(let [state (ref init)
task (fn []
(loop [last-n nil]
(if-let [n (dosync
(let [n (first #state)]
(alter state rest)
n))]
(recur n)
(locking :out
(println "Last seen:" last-n)))))
workers (->> (range 0 5)
(mapv (fn [_] (Thread. task))))]
(doseq [w workers] (.start w))
(doseq [w workers] (.join w))))
(defn r []
(f (range 1 100000)))
(defn i [] (f (->> (iterate inc 1)
(take 100000))))
(defn t []
(f (->> (range 1 100000)
(take Integer/MAX_VALUE))))
Running this code shows that both i and t, both lazy, reliably work, whereas r reliably doesn't. The problem is in fact a concurrency bug in the class returned by the range call. Indeed, that bug is documented in this Clojure ticket and is fixed as of Clojure version 1.9.0-alpha11.
A quick summary of the bug in case the ticket is not accessible for some reason: in the internals of the rest call on the result of range, there was a small opportunity for a race condition: the "flag" that says "the next value has already been computed" was set before the actual value itself, which meant that a second thread could see that flag as true even though the "next value" is still nil. The call to alter would then fix that nil value on the ref. It's been fixed by swapping the two assignment lines.
In cases where the result of range was either forcibly realized in a single thread or wrapped in another lazy seq, that bug would not appear.
I asked this question on the Clojure Google Group and it helped me to find the answer.
The problem is that I used a lazy sequence within the STM transaction.
When I replaced this code:
(def tasks (ref (range 1 10000000)))
by this:
(def tasks (ref (into [] (range 1 10000000))))
it worked as expected!
In my production code where the problem occurred, I used the Korma framework that also returns a lazy collection of tuples, as in my example.
Conclusion: Avoid the use of lazy data structures within the STM transaction.
When the last number in the range is reached, there a are still older numbers being held by the workers. Some of these will be returned to the queue, to be processed again.
In order to better see what is happening, you can change worker-loop to print the last task handled by each worker:
(defn worker-loop
[{:keys [tasks] :as state}]
(loop [last-task nil]
(if (have-tasks? tasks)
(let [task (get-task tasks)]
;; (when (< (rand) 0.1)
;; (put-task tasks task)
(recur task))
(when last-task
(println "Last task:" last-task))))
state)
This also shows the race condition in the code, where tasks seen by have-tasks? often is taken by others when get-task is called near the end of the processing of the tasks.
The race condition can be solved by removing have-tasks? and instead using the return value of nil from get-task as a signal that no more tasks are available (at the moment).
Updated:
As observed, this race conditions does not explain the problem.
Neither is the problem solved by removing a possible race condition in get-task like this:
(defn get-task [tasks]
(dosync
(first (alter tasks rest))))
However changing get-task to use an explicit lock seems to solve the problem:
(defn get-task [tasks]
(locking :lock
(dosync
(let [task (first #tasks)]
(alter tasks rest)
task))))

Process a collection at a timed interval or when it reaches a certain size

I am reading Strings in from the standard input with
(line-seq (java.io.BufferedReader. *in*))
How can I:
Store the lines in a collection
At some interval (say 5 minutes) process the collection and also
Process the collection as soon as its size grows to n (say 10) regardless of timing?
Here i left you my purposes:
As you can check in http://clojuredocs.org/clojure_core/clojure.core/line-seq the result of(line-seq (BufferedReader. xxx)) is a sequence, so this function stores the result (return a new ) collection
You can do it with clojure/core.async timeout function http://clojure.github.io/core.async/#clojure.core.async/timeout, you can take a look at https://github.com/clojure/core.async/blob/master/examples/walkthrough.clj to get acquainted with the library
Just use a conditional (if, when ...) to check the count of the collection
As #tangrammer says, core-async would be a good way to go, or Lamina (sample-every)
I pieced something together using a single atom. You probably have to adjust things to your need (e.g. parallel execution, not using a future to create the periodic processing thread, return values, ...). The following code creates processor-with-interval-and-threshold, a function creating another function that can be given a seq of elements which is processed in the way you described.
(defn- periodically!
[interval f]
(future
(while true
(Thread/sleep interval)
(f))))
(defn- build-head-and-tail
[{:keys [head tail]} n elements]
(let [[a b] (->> (concat tail elements)
(split-at n))]
{:head (concat head a) :tail b}))
(defn- build-ready-elements
[{:keys [head tail]}]
{:ready (concat head tail)})
(defn processor-with-interval-and-threshold
[interval threshold f]
(let [q (atom {})]
(letfn [(process-elements! []
(let [{:keys [ready]} (swap! q build-ready-elements)]
(when-not (empty? ready)
(f ready))))]
(periodically! interval process-elements!)
(fn [sq]
(let [{:keys [head]} (swap! q build-head-and-tail threshold sq)]
(when (>= (count head) threshold)
(process-elements!)))))))
The atom q manages a map of three elements:
:head: a seq that gets filled first and checked against the threshold,
:tail: a seq with the elements that exceed the threshold (possibly lazy),
:ready: elements to be processed.
Now, you could for example do the following:
(let [add! (processor-with-interval-and-threshold 300000 10 your-fn)]
(doseq [x (line-seq (java.io.BufferedReader. *in*))]
(add! [x])))
That should be enough to get you started, I guess.