Extraneous groupBy in spark DAG - clojure

According to the spark DAG vizualization there is a groupBy being performed in Stage 1 after the groupBy being performed in Stage 0. I only have one groupBy in my code and wouldn't expect any of the other transformations I'm doing to result in a groupBy.
Here's the code (clojure / flambo):
;; stage 0
(-> (.textFile sc path 8192)
(f/map (f/fn [msg] (json/parse-string msg true)))
(f/group-by (f/fn [msg] (:mmsi msg)) 8192)
;; stage 1
(f/map-values (f/fn [values] (sort-by :timestamp (vec values))))
(f/flat-map (ft/key-val-fn (f/fn [mmsi messages]
(let [state-map (atom {}) draught-map (atom {})]
(map #(mk-line % state-map draught-map) (vec messages))))))
(f/map (f/fn [line] (json/generate-string line)))
(f/save-as-text-file path)))
It's clear to me how Stage 0 is the sequence textFile, map, groupBy and Stage 1 is map-values, map-values, flat-map, map, saveAsTextFile, but where does the groupBy in stage 1 come from?
Since groupBy causes a shuffle which is computationally expensive and time-consuming I don't want an extraneous one if it can be helped.

There is no extraneous groupBy here. groupBy is a two-step process. The first step is a local map which transforms from x to (f(x), x). This is the part which is represented as a groupBy block in the Stage 0.
The second step is non-local groupByKey which is marked as a groupBy block in the Stage 1. Only this part requires shuffling.

Related

How to execute parallel transactions in Clojure

I have a sequence of customers that needs to be processed in parallel. I tried to use a pmap for that. The result is painfully slow, much slower than a sequential implementation. The inner function process-customer has a transaction. Obviously, the pmap launches all the transactions at once and they end up retrying killing performance. What is thee best way to parallelize this?
(defn process-customers [customers]
(doall
(pmap
(fn [sub-customers]
(doseq [customer sub-customers]
(process-customer customer)))
(partition-all 10 customers))))
EDIT:
The process-customer function involves the below steps. I write the steps for brevity. All the steps are inside a transaction to ensure another parallel transaction does not cause inconsistencies like negative stock.
(defn- process-customer [customer]
"Process `customer`. Consists of three steps:
1. Finding all stores in which the requested products are still available.
2. Sorting the found stores to find the cheapest (for the sum of all products).
3. Buying the products by updating the `stock`.
)
EDIT 2: The below version of process-customers has the same performance as the parallel process-customers above. The below is obviously sequential.
(defn process-customers [customers]
"Process `customers` one by one. In this code, this happens sequentially."
(doseq [customer customers]
(process-customer customer)))
I assume your transaction is locking on the inventory for the full life cycle of process-customer. This will be slow as all customers are racing for the same universe of stores. If you can split the process into two phases: 1) quoting and 2) fulfilling and applies transaction only on (2) then the performance should be much better. Or if you buy into agent programming, you will have transaction boundary automatically defined for you at the message level. Here is one sample you can consider:
(defn get-best-deal
"Returns the best deal for a given order with given stores (agent)"
[stores order]
;;
;; request for quotation from 1000 stores (in parallel)
;;
(doseq [store stores]
(send store get-quote order))
;;
;; wait for reply, up to 0.5s
;;
(apply await-for 500 stores)
;;
;; sort and find the best store
;;
(when-let [best-store (->> stores
(filter (fn [store] (get-in #store [:quotes order])))
(sort-by (fn [store] (->> (get-in #store [:quotes order])
vals
(reduce +))))
first)]
{:best-store best-store
:invoice-id (do
;; execute the order
(send best-store fulfill order)
;; wait for the transaction to complete
(await best-store)
;; get an invoice id
(get-in #best-store [:invoices order]))}))
and to find best deals from 1,000 stores for 100 orders (Total 289 line items) from 100 products:
(->> orders
(pmap (partial get-best-deal stores))
(filter :invoice-id)
count
time)
;; => 57
;; "Elapsed time: 312.002328 msecs"
Sample business logic:
(defn get-quote
"issue a quote by checking inventory"
[store {:keys [order-items] :as order}]
(if-let [quote (->> order-items
(reduce reduce-inventory
{:store store
:quote nil})
:quote)]
;; has inventory to generate a quote
(assoc-in store [:quotes order] quote)
;; no inventory
(update store :quotes dissoc order)))
(defn fulfill
"fulfill an order if previuosly quoted"
[store order]
(if-let [quote (get-in store [:quotes order])]
;; check inventory again and generate invoice
(let [[invoice inventory'] (check-inventory-and-generate-invoice store order)]
(cond-> store
invoice (->
;; register invoice
(assoc-in [:invoices order] invoice)
;; invalidate the quote
(update :quotes dissoc order)
;; update inventory
(assoc :inventory inventory'))))
;; not quoted before
store))

How to add multiple counts of large lists

If I do a simple version
(+ (count [1 2 3 4]) (count [1 2 3 4]))
I get the correct answer 8
But when I use the large scale version from my program that could potentially have count equaling 100,000 it no longer functions. combatLog is a 100,000 line log file.
(let [rdr (BufferedReader. (FileReader. combatLog))]
(+ (count (filter (comp not nil?) (take 100000 (repeatedly #(re-seq #"Kebtiz hits" (str (.readLine rdr)))))))
(count (filter (comp not nil?) (take 100000 (repeatedly #(re-seq #"Kebtiz misses" (str (.readLine rdr)))))))
)
)
In this case, it returns only the value of the first count. I am trying to figure out either why + and count aren't working in this case, or another way to sum the total number of elements in both lists.
In your code you are reading from the same reader in two places. It seems that the first line consumes the whole reader and the second one gets no lines to filter. Notice that each call to .readLine moves the position in the input file.
I guess you wanted to do something like:
(with-open [reader (clojure.java.io/reader combatLog]
(->> reader
(line-seq)
(filter #(re-seq #"Kebtiz hits|Kebtiz misses"))
(count)))
Using with-open will make sure your file handle will be closed and resources won't leak. I also combined your two separate regular expressions into one.

How do I undo or reverse a transaction in datomic?

I committed a transaction to datomic accidentally and I want to "undo" the whole transaction. I know exactly which transaction it is and I can see its datoms, but I don't know how to get from there to a rolled-back transaction.
The basic procedure:
Retrieve the datoms created in the transaction you want to undo. Use the transaction log to find them.
Remove datoms related to the transaction entity itself: we don't want to retract transaction metadata.
Invert the "added" state of all remaining datoms, i.e., if a datom was added, retract it, and if it was retracted, add it.
Reverse the order of the inverted datoms so the bad-new value is retracted before the old-good value is re-asserted.
Commit a new transaction.
In Clojure, your code would look like this:
(defn rollback
"Reassert retracted datoms and retract asserted datoms in a transaction,
effectively \"undoing\" the transaction.
WARNING: *very* naive function!"
[conn tx]
(let [tx-log (-> conn d/log (d/tx-range tx nil) first) ; find the transaction
txid (-> tx-log :t d/t->tx) ; get the transaction entity id
newdata (->> (:data tx-log) ; get the datoms from the transaction
(remove #(= (:e %) txid)) ; remove transaction-metadata datoms
; invert the datoms add/retract state.
(map #(do [(if (:added %) :db/retract :db/add) (:e %) (:a %) (:v %)]))
reverse)] ; reverse order of inverted datoms.
#(d/transact conn newdata))) ; commit new datoms.
This is not meant as an answer to the original question, but for those coming here from Google looking for inspiration for how to rollback a datascript transaction. I didn't find documentation about it, so I wrote my own:
(defn rollback
"Takes a transaction result and reasserts retracted
datoms and retracts asserted datoms, effectively
\"undoing\" the transaction."
[{:keys [tx-data]}]
; The passed transaction result looks something like
; this:
;
; {:db-before
; {1 :post/body,
; 1 :post/created-at,
; 1 :post/foo,
; 1 :post/id,
; 1 :post/title},
; :db-after {},
; :tx-data
; [#datascript/Datom [1 :post/body "sdffdsdsf" 536870914 false]
; #datascript/Datom [1 :post/created-at 1576538572631 536870914 false]
; #datascript/Datom [1 :post/foo "foo" 536870914 false]
; #datascript/Datom [1 :post/id #uuid "a21ad816-c509-42fe-a1b7-32ad9d3931ef" 536870914 false]
; #datascript/Datom [1 :post/title "123" 536870914 false]],
; :tempids {:db/current-tx 536870914},
; :tx-meta nil}))))
;
; We want to transform each datom into a new piece of
; a transaction. The last field in each datom indicates
; whether it was added (true) or retracted (false). To
; roll back the datom, this boolean needs to be inverted.
(let [t
(map
(fn [[entity-id attribute value _ added?]]
(if added?
[:db/retract entity-id attribute value]
[:db/add entity-id attribute value]))
tx-data)]
(transact t)))
You use it by first capturing a transaction's return value, then passing that return value to the rollback fn:
(let [tx (transact [...])]
(rollback tx))
Be careful though, I'm new to the datascript/Datomic world, so there might be something I am missing.

Process a collection at a timed interval or when it reaches a certain size

I am reading Strings in from the standard input with
(line-seq (java.io.BufferedReader. *in*))
How can I:
Store the lines in a collection
At some interval (say 5 minutes) process the collection and also
Process the collection as soon as its size grows to n (say 10) regardless of timing?
Here i left you my purposes:
As you can check in http://clojuredocs.org/clojure_core/clojure.core/line-seq the result of(line-seq (BufferedReader. xxx)) is a sequence, so this function stores the result (return a new ) collection
You can do it with clojure/core.async timeout function http://clojure.github.io/core.async/#clojure.core.async/timeout, you can take a look at https://github.com/clojure/core.async/blob/master/examples/walkthrough.clj to get acquainted with the library
Just use a conditional (if, when ...) to check the count of the collection
As #tangrammer says, core-async would be a good way to go, or Lamina (sample-every)
I pieced something together using a single atom. You probably have to adjust things to your need (e.g. parallel execution, not using a future to create the periodic processing thread, return values, ...). The following code creates processor-with-interval-and-threshold, a function creating another function that can be given a seq of elements which is processed in the way you described.
(defn- periodically!
[interval f]
(future
(while true
(Thread/sleep interval)
(f))))
(defn- build-head-and-tail
[{:keys [head tail]} n elements]
(let [[a b] (->> (concat tail elements)
(split-at n))]
{:head (concat head a) :tail b}))
(defn- build-ready-elements
[{:keys [head tail]}]
{:ready (concat head tail)})
(defn processor-with-interval-and-threshold
[interval threshold f]
(let [q (atom {})]
(letfn [(process-elements! []
(let [{:keys [ready]} (swap! q build-ready-elements)]
(when-not (empty? ready)
(f ready))))]
(periodically! interval process-elements!)
(fn [sq]
(let [{:keys [head]} (swap! q build-head-and-tail threshold sq)]
(when (>= (count head) threshold)
(process-elements!)))))))
The atom q manages a map of three elements:
:head: a seq that gets filled first and checked against the threshold,
:tail: a seq with the elements that exceed the threshold (possibly lazy),
:ready: elements to be processed.
Now, you could for example do the following:
(let [add! (processor-with-interval-and-threshold 300000 10 your-fn)]
(doseq [x (line-seq (java.io.BufferedReader. *in*))]
(add! [x])))
That should be enough to get you started, I guess.

Clojure - counting unique values from vectors in a seq

Being somewhat new to Clojure I can't seem to figure out how to do something that seems like it should be simple. I just can't see it. I have a seq of vectors. Let's say each vector has two values representing customer number and invoice number and each of the vectors represents a sale of an item. So it would look something like this:
([ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])
I want to count the number of unique customers and unique invoices. So the example should produce the vector
[ 2 3 ]
In Java or another imperative language I would loop over each one of the vectors in the seq, add the customer number and invoice number to a set then count the number of values in each set and return it. I can't see the functional way to do this.
Thanks for the help.
EDIT: I should have specified in my original question that the seq of vectors is in the 10's of millions and actually has more than just two values. So I want to only go through the seq one time and calculate these unique counts (and some sums as well) on that one run through the seq.
In Clojure you can do it almost the same way - first call distinct to get unique values and then use count to count results:
(def vectors (list [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]))
(defn count-unique [coll]
(count (distinct coll)))
(def result [(count-unique (map first vectors)) (count-unique (map second vectors))])
Note that here you first get list of first and second elements of vectors (map first/second vectors) and then operate on each separately, thus iterating over collection twice. If performance does matter, you may do same thing with iteration (see loop form or tail recursion) and sets, just like you would do in Java. To further improve performance you can also use transients. Though for beginner like you I would recommend first way with distinct.
UPD. Here's version with loop:
(defn count-unique-vec [coll]
(loop [coll coll, e1 (transient #{}), e2 (transient #{})]
(cond (empty? coll) [(count (persistent! e1)) (count (persistent! e2))]
:else (recur (rest coll)
(conj! e1 (first (first coll)))
(conj! e2 (second (first coll)))))))
(count-unique-vec vectors) ==> [2 3]
As you can see, no need in atoms or something like that. First, you pass state to each next iteration (recur call). Second, you use transients to use temporary mutable collections (read more on transients for details) and thus avoid creation of new object each time.
UPD2. Here's version with reduce for extended question (with price):
(defn count-with-price
"Takes input of form ([customer invoice price] [customer invoice price] ...)
and produces vector of 3 elements, where 1st and 2nd are counts of unique
customers and invoices and 3rd is total sum of all prices"
[coll]
(let [[custs invs total]
(reduce (fn [[custs invs total] [cust inv price]]
[(conj! custs cust) (conj! invs inv) (+ total price)])
[(transient #{}) (transient #{}) 0]
coll)]
[(count (persistent! custs)) (count (persistent! invs)) total]))
Here we hold intermediate results in a vector [custs invs total], unpack, process and pack them back to a vector each time. As you can see, implementing such nontrivial logic with reduce is harder (both to write and read) and requires even more code (in looped version it is enough to add one more parameter for price to loop args). So I agree with #ammaloy that for simpler cases reduce is better, but more complex things require more low-level constructs, such as loop/recur pair.
As is often the case when consuming a sequence, reduce is nicer than loop here. You can just do:
(map count (reduce (partial map conj)
[#{} #{}]
txn))
Or, if you're really into transients:
(map (comp count persistent!)
(reduce (partial map conj!)
(repeatedly 2 #(transient #{}))
txn))
Both of these solutions traverse the input only once, and they take much less code than the loop/recur solution.
Or you could use sets to handle the de-duping for you since sets can have a max of one of any specific value.
(def vectors '([100 2000] [100 2000] [101 2001] [100 2002]))
[(count (into #{} (map first vectors))) (count (into #{} (map second vectors)))]
Here's a nice way to do this with a map and higher order functions:
(apply
map
(comp count set list)
[[ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]])
=> (2 3)
Also other solutions to the nice above mentioned ones:
(map (comp count distinct vector) [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])
Other written with thread-last macro:
(->> '([100 2000] [100 2000] [101 2001] [100 2002]) (apply map vector) (map distinct) (map count))
both return (2 3).