How do I reduce a Clojure sequence in parallell - clojure

I have an unsorted sequence of maps (TV programs) that I need to merge, meaning that the resulting sequence is unique based on a special key (:title), and the other keys are merged with the duplicates. Think of it as merging all showings of a particular TV show into a single entry that hold all information about them.
A program looks like this (simplified):
[{:prog {:title "", ...} :starts #{} :directors #{} :actors #{} :categories {}}, ...]
Here's my current function that does the merging:
(defn- merge-programs [all-programs]
"Merge all instances of the same program"
(loop [acc []
programs all-programs]
(if (empty? programs)
acc
(let [first-prog (first programs)
dups (filter #(= (:title first-prog) (:title (:prog %))) programs)
merged-prog {:prog first-prog
:starts (apply set/union (map :starts dups))
:directors (apply set/union (map :directors dups))
:actors (apply set/union (map :actors dups))
:categories (apply set/union (map :categories dups))}]
(recur (conj acc merged-prog)
(remove #(= (:title first-prog) (:title (:prog %)))) programs))))))
I'm trying to figure out how to do this merging in parallel. But since, after each iteration of the loop, "random" elements of the start sequence are being removed, it would have to be some divide-and-conquer approach.
Any ideas on how to do this?

The Reducers functionality in Clojure 1.5 is what you want.

Related

Clojure/FP: apply functions to each argument to an operator

Let's say I have several vectors
(def coll-a [{:name "foo"} ...])
(def coll-b [{:name "foo"} ...])
(def coll-c [{:name "foo"} ...])
and that I would like to see if the names of the first elements are equal.
I could
(= (:name (first coll-a)) (:name (first coll-b)) (:name (first coll-c)))
but this quickly gets tiring and overly verbose as more functions are composed. (Maybe I want to compare the last letter of the first element's name?)
To directly express the essence of the computation it seems intuitive to
(apply = (map (comp :name first) [coll-a coll-b coll-c]))
but it leaves me wondering if there's a higher level abstraction for this sort of thing.
I often find myself comparing / otherwise operating on things which are to be computed via a single composition applied to multiple elements, but the map syntax looks a little off to me.
If I were to home brew some sort of operator, I would want syntax like
(-op- (= :name first) coll-a coll-b coll-c)
because the majority of the computation is expressed in (= :name first).
I'd like an abstraction to apply to both the operator & the functions applied to each argument. That is, it should be just as easy to sum as compare.
(def coll-a [{:name "foo" :age 43}])
(def coll-b [{:name "foo" :age 35}])
(def coll-c [{:name "foo" :age 28}])
(-op- (+ :age first) coll-a coll-b coll-c)
; => 106
(-op- (= :name first) coll-a coll-b coll-c)
; => true
Something like
(defmacro -op-
[[op & to-comp] & args]
(let [args' (map (fn [a] `((comp ~#to-comp) ~a)) args)]
`(~op ~#args')))
Is there an idiomatic way to do this in clojure, some standard library function I could be using?
Is there a name for this type of expression?
For your addition example, I often use transduce:
(transduce
(map (comp :age first))
+
[coll-a coll-b coll-c])
Your equality use case is trickier, but you could create a custom reducing function to maintain a similar pattern. Here's one such function:
(defn all? [f]
(let [prev (volatile! ::no-value)]
(fn
([] true)
([result] result)
([result item]
(if (or (= ::no-value #prev)
(f #prev item))
(do
(vreset! prev item)
true)
(reduced false))))))
Then use it as
(transduce
(map (comp :name first))
(all? =)
[coll-a coll-b coll-c])
The semantics are fairly similar to your -op- macro, while being both more idiomatic Clojure and more extensible. Other Clojure developers will immediately understand your usage of transduce. They may have to investigate the custom reducing function, but such functions are common enough in Clojure that readers can see how it fits an existing pattern. Also, it should be fairly transparent how to create new reducing functions for use cases where a simple map-and-apply wouldn't work. The transducing function can also be composed with other transformations such as filter and mapcat, for cases when you have a more complex initial data structure.
You may be looking for the every? function, but I would enhance clarity by breaking it down and naming the sub-elements:
(let [colls [coll-a coll-b coll-c]
first-name (fn [coll] (:name (first coll)))
names (map first-name colls)
tgt-name (first-name coll-a)
all-names-equal (every? #(= tgt-name %) names)]
all-names-equal => true
I would avoid the DSL, as there is no need and it makes it much harder for others to read (since they don't know the DSL). Keep it simple:
(let [colls [coll-a coll-b coll-c]
vals (map #(:age (first %)) colls)
result (apply + vals)]
result => 106
I don't think you need a macro, you just need to parameterize your op function and compare functions. To me, you are pretty close with your (apply = (map (comp :name first) [coll-a coll-b coll-c])) version.
Here is one way you could make it more generic:
(defn compare-in [op to-compare & args]
(apply op (map #(get-in % to-compare) args)))
(compare-in + [0 :age] coll-a coll-b coll-c)
(compare-in = [0 :name] coll-a coll-b coll-c)
;; compares last element of "foo"
(compare-in = [0 :name 2] coll-a coll-b coll-c)
I actually did not know you can use get on strings, but in the third case you can see we compare the last element of each foo.
This approach doesn't allow the to-compare arguments to be arbitrary functions, but it seems like your use case mainly deals with digging out what elements you want to compare, and then applying an arbitrary function to those values.
I'm not sure this approach is better than the transducer version supplied above (certainly not as efficient), but I think it provides a simpler alternative when that efficiency is not needed.
I would split this process into three stages:
transform items in collections into the data in collections you want to operate
on - (map :name coll);
Operate on transformed items in collections, returning collection of results - (map = transf-coll-a transf-coll-b transf-coll-c)
Finally, selecting which result in resulting collection to return - (first calculated-coll)
When playing with collections, I try to put more than one item into collection:
(def coll-a [{:name "foo" :age 43} {:name "bar" :age 45}])
(def coll-b [{:name "foo" :age 35} {:name "bar" :age 37}])
(def coll-c [{:name "foo" :age 28} {:name "bra" :age 30}])
For example, matching items by second char in :name and returning result for items in second place:
(let
[colls [coll-a coll-b coll-c]
transf-fn (comp #(nth % 1) :name)
op =
fetch second]
(fetch (apply map op (map #(map transf-fn %) colls))))
;; => false
In transducers world you can use sequence function which also works on multiple collections:
(let
[colls [coll-a coll-b coll-c]
transf-fn (comp (map :name) (map #(nth % 1)))
op =
fetch second]
(fetch (apply sequence (map op) (map #(sequence transf-fn %) colls))))
Calculate sum of ages (for all items at the same level):
(let
[colls [coll-a coll-b coll-c]
transf-fn (comp (map :age))
op +
fetch identity]
(fetch (apply sequence (map op) (map #(sequence transf-fn %) colls))))
;; => (106 112)

Improving complex data structure replacement

I'm attempting to modify a specific field in a data structure, described below (a filled example can be found here:
[{:fields "There are a few other fields here"
:incidents [{:fields "There are a few other fields here"
:updates [{:fields "There are a few other fields here"
:content "THIS is the field I want to replace"
:translations [{:based_on "Based on the VALUE of this"
:content "Replace with this value"}]}]}]}]
I already have this implemented it in a number of functions, as below:
(defn- translation-content
[arr]
(:content (nth arr (.indexOf (map :locale arr) (env/get-locale)))))
(defn- translate
[k coll fn & [k2]]
(let [k2 (if (nil? k2) k k2)
c ((keyword k2) coll)]
(assoc-in coll [(keyword k)] (fn c))))
(defn- format-update-translation
[update]
(dissoc update :translations))
(defn translate-update
[update]
(format-update-translation (translate :content update translation-content :translations)))
(defn translate-updates
[updates]
(vec (map translate-update updates)))
(defn translate-incident
[incident]
(translate :updates incident translate-updates))
(defn translate-incidents
[incidents]
(vec (map translate-incident incidents)))
(defn translate-service
[service]
(assoc-in service [:incidents] (translate-incidents (:incidents service))))
(defn translate-services
[services]
(vec (map translate-service services)))
Each array could have any number of entries (though the number is likely less than 10).
The basic premise is to replace the :content in each :update with the relevant :translation based on a provided value.
My Clojure knowledge is limited, so I'm curious if there is a more optimal way to achieve this?
EDIT
Solution so far:
(defn- translation-content
[arr]
(:content (nth arr (.indexOf (map :locale arr) (env/get-locale)))))
(defn- translate
[k coll fn & [k2]]
(let [k2 (if (nil? k2) k k2)
c ((keyword k2) coll)]
(assoc-in coll [(keyword k)] (fn c))))
(defn- format-update-translation
[update]
(dissoc update :translations))
(defn translate-update
[update]
(format-update-translation (translate :content update translation-content :translations)))
(defn translate-updates
[updates]
(mapv translate-update updates))
(defn translate-incident
[incident]
(translate :updates incident translate-updates))
(defn translate-incidents
[incidents]
(mapv translate-incident incidents))
(defn translate-service
[service]
(assoc-in service [:incidents] (translate-incidents (:incidents service))))
(defn translate-services
[services]
(mapv translate-service services))
I would start more or less as you do, bottom-up, by defining some functions that look like they will be useful: how to choose a translation from among a list of translations, and how to apply that choice to an update. But I wouldn't make the functions so tiny as yours: the logic is all spread out into a lot of places, and it's not easy to get an overall idea of what is going on. Here are the two functions I'd start with:
(letfn [(choose-translation [translations]
(let [applicable (filter #(= (:locale %) (get-locale))
translations)]
(when (= 1 (count applicable))
(:content (first applicable)))))
(translate-update [update]
(-> update
(assoc :content (or (choose-translation (:translations update))
(:content update)))
(dissoc :translations)))]
...)
Of course you can defn them instead if you'd like, and I suspect many people would, but they're only going to be used in one place, and they're intimately involved with the context in which they're used, so I like a letfn. These two functions are really all the interesting logic; the rest is just some boring tree-traversal code to apply this logic in the right places.
Now to build out the body of the letfn is straightforward, and easy to read if you make your code be the same shape as the data it manipulates. We want to walk through a series of nested lists, updating objects on the way, and so we just write a series of nested for comprehensions, calling update to descend into the right keyspace:
(for [user users]
(update user :incidents
(fn [incidents]
(for [incident incidents]
(update incident :updates
(fn [updates]
(for [update updates]
(translate-update update))))))))
I think using for here is miles better than using map, although of course they are equivalent as always. The important difference is that as you read through the code you see the new context first ("okay, now we're doing something to each user"), and then what is happening inside that context; with map you see them in the other order and it is difficult to keep tack of what is happening where.
Combining these, and putting them into a defn, we get a function that you can call with your example input and which produces your desired output (assuming a suitable definition of get-locale):
(defn translate [users]
(letfn [(choose-translation [translations]
(let [applicable (filter #(= (:locale %) (get-locale))
translations)]
(when (= 1 (count applicable))
(:content (first applicable)))))
(translate-update [update]
(-> update
(assoc :content (or (choose-translation (:translations update))
(:content update)))
(dissoc :translations)))]
(for [user users]
(update user :incidents
(fn [incidents]
(for [incident incidents]
(update incident :updates
(fn [updates]
(for [update updates]
(translate-update update))))))))))
we can try to find some patterns in this task (based on the contents of the snippet from github gist, you've posted):
simply speaking, you need to
1) update every item (A) in vector of data
2) updating every item (B) in vector of A's :incidents
3) updating every item (C) in vector of B's :updates
4) translating C
The translate function could look like this:
(defn translate [{translations :translations :as item} locale]
(assoc item :content
(or (some #(when (= (:locale %) locale) (:content %)) translations)
:no-translation-found)))
it's usage (some fields are omitted for brevity):
user> (translate {:id 1
:content "abc"
:severity "101"
:translations [{:locale "fr_FR"
:content "abc"}
{:locale "ru_RU"
:content "абв"}]}
"ru_RU")
;;=> {:id 1,
;; :content "абв",
;; :severity "101",
;; :translations [{:locale "fr_FR", :content "abc"} {:locale "ru_RU", :content "абв"}]}
then we can see that 1 and 2 are totally similar, so we can generalize that:
(defn update-vec-of-maps [data k f]
(mapv (fn [item] (update item k f)) data))
using it as a building block you can make up the whole data transformation:
(defn transform [data locale]
(update-vec-of-maps
data :incidents
(fn [incidents]
(update-vec-of-maps
incidents :updates
(fn [updates] (mapv #(translate % locale) updates))))))
(transform data "it_IT")
returns what you need.
then you can generalize it further, making the utility function for arbitrary depth transformations:
(defn deep-update-vec-of-maps [data ks terminal-fn]
(if (seq ks)
((reduce (fn [f k] #(update-vec-of-maps % k f))
terminal-fn (reverse ks))
data)
data))
and use it like this:
(deep-update-vec-of-maps data [:incidents :updates]
(fn [updates]
(mapv #(translate % "it_IT") updates)))
I recommend you look at https://github.com/nathanmarz/specter
It makes it really easy to read and update clojure data structures. Same performance as hand-written code, but much shorter.

clojure: filtering a vector of maps by keys existence and values

I have a vector of maps like this one
(def map1
[{:name "name1"
:field "xxx"}
{:name "name2"
:requires {"element1" 1}}
{:name "name3"
:consumes {"element2" 1 "element3" 4}}])
I'm trying to define a functions that takes in a map like {"element1" 1 "element3" 6} (ie: with n fields, or {}) and fiters the maps in map1, returning only the ones that either have no requires and consumes, or have a lower number associated to them than the one associated with that key in the provided map (if the provided map doesn't have any key like that, it's not returned)
but I'm failing to grasp how to approach the maps recursive loop and filtering
(defn getV [node nodes]
(defn filterType [type nodes]
(filter (fn [x] (if (contains? x type)
false ; filter for key values here
true)) nodes))
(filterType :requires (filterType :consumes nodes)))
There's two ways to look at problems like this: from the outside in or from the inside out. Naming things carefully can really help when working with nested structures. For example, calling a vector of maps map1 may be adding to the confusion.
Starting from the outside, you need a predicate function for filtering the list. This function will take a map as a parameter and will be used by a filter function.
(defn comparisons [m]
...)
(filter comparisons map1)
I'm not sure I understand the comparisons precisely, but there seems to be at least two flavors. The first is looking for maps that do not have :requires or :consumes keys.
(defn no-requires-or-consumes [m]
...)
(defn all-keys-higher-than-values [m]
...)
(defn comparisons [m]
(some #(% m) [no-requires-or-consumes all-keys-higher-than-values]))
Then it's a matter of defining the individual comparison functions
(defn no-requires-or-consumes [m]
(and (not (:requires m)) (not (:consumes m))))
The second is more complicated. It operates on one or two inner maps but the behaviour is the same in both cases so the real implementation can be pushed down another level.
(defn all-keys-higher-than-values [m]
(every? keys-higher-than-values [(:requires m) (:consumes m)]))
The crux of the comparison is looking at the number in the key part of the map vs the value. Pushing the details down a level gives:
(defn keys-higher-than-values [m]
(every? #(>= (number-from-key %) (get m %)) (keys m)))
Note: I chose >= here so that the second entry in the sample data will pass.
That leaves only pulling the number of of key string. how to do that can be found at In Clojure how can I convert a String to a number?
(defn number-from-key [s]
(read-string (re-find #"\d+" s)))
Stringing all these together and running against the example data returns the first and second entries.
Putting everything together:
(defn no-requires-or-consumes [m]
(and (not (:requires m)) (not (:consumes m))))
(defn number-from-key [s]
(read-string (re-find #"\d+" s)))
(defn all-keys-higher-than-values [m]
(every? keys-higher-than-values [(:requires m) (:consumes m)]))
(defn keys-higher-than-values [m]
(every? #(>= (number-from-key %) (get m %)) (keys m)))
(defn comparisons [m]
(some #(% m) [no-requires-or-consumes all-keys-higher-than-values]))
(filter comparisons map1)

Read each entry lazily from a zip file

I want to read file entries in a zip file into a sequence of strings if possible. Currently I'm doing something like this to print out directory names for example:
(defn entries [zipfile]
(lazy-seq
(if-let [entry (.getNextEntry zipfile)]
(cons entry (entries zipfile)))))
(defn with-each-entry [fileName f]
(with-open [z (ZipInputStream. (FileInputStream. fileName))]
(doseq [e (entries z)]
; (println (.getName e))
(f e)
(.closeEntry z))))
(with-each-entry "tmp/my.zip"
(fn [e] (if (.isDirectory e)
(println (.getName e)))))
However this will iterate through the entire zip file. How could I change this so I could take the first few entries say something like:
(take 10 (zip-entries "tmp/my.zip"
(fn [e] (if (.isDirectory e)
(println (.getName e)))))
This seems like a pretty natural fit for the new transducers in CLJ 1.7.
You just build up the transformations you want as a transducer using comp and the usual seq-transforming fns with no seq/collection argument. In your example cases,
(comp (map #(.getName %)) (take 10)) and
(comp (filter #(.isDirectory %)) (map #(-> % .getName println))).
This returns a function of multiple arities which you can use in a lot of ways. In this case you want to eagerly reduce it over the entries sequence (to ensure realization of the entries happens inside with-open), so you use transduce (example zip data made by zipping one of my clojure project folders):
(with-open [z (-> "training-day.zip" FileInputStream. ZipInputStream.)]
(let[transform (comp (map #(.getName %)) (take 10))]
(transduce transform conj (entries z))))
;;return value: [".gitignore" ".lein-failures" ".midje-grading-config.clj" ".nrepl-port" ".travis.yml" "project.clj" "README.md" "target/" "target/classes/" "target/repl-port"]
Here I'm transducing with base function conj which makes a vector of the names. If you instead want your transducer to perform side-effects and not return a value, you can do that with a base function like (constantly nil):
(with-open [z (-> "training-day.zip" FileInputStream. ZipInputStream.)]
(let[transform (comp (filter #(.isDirectory %)) (map #(-> % .getName println)))]
(transduce transform (constantly nil) (entries z))))
which gives output:
target/
target/classes/
target/stale/
test/
A potential downside with this is that you'll probably have to manually incorporate .closeEntry calls into each transducer you use here to prevent holding those resources, because you can't in the general case know when each transducer is done reading the entry.

How do I write a predicate that checks if a value exists in an infinite seq?

I had an idea for a higher-order function today that I'm not sure how to write. I have several sparse, lazy infinite sequences, and I want to create an abstraction that lets me check to see if a given number is in any of these lazy sequences. To improve performance, I wanted to push the values of the sparse sequence into a hashmap (or set), dynamically increasing the number of values in the hashmap whenever it is necessary. Automatic memoization is not the answer here due to sparsity of the lazy seqs.
Probably code is easiest to understand, so here's what I have so far. How do I change the following code so that the predicate uses a closed-over hashmap, but if needed increases the size of the hashmap and redefines itself to use the new hashmap?
(defn make-lazy-predicate [coll]
"Returns a predicate that returns true or false if a number is in
coll. Coll must be an ordered, increasing lazy seq of numbers."
(let [in-lazy-list? (fn [n coll top cache]
(if (> top n)
(not (nil? (cache n)))
(recur n (next coll) (first coll)
(conj cache (first coll)))]
(fn [n] (in-lazy-list? n coll (first coll) (sorted-set)))))
(def my-lazy-list (iterate #(+ % 100) 1))
(let [in-my-list? (make-lazy-predicate my-lazy-list)]
(doall (filter in-my-list? (range 10000))))
How do I solve this problem without reverting to an imperative style?
This is a thread-safe variant of Adam's solution.
(defn make-lazy-predicate
[coll]
(let [state (atom {:mem #{} :unknown coll})
update-state (fn [{:keys [mem unknown] :as state} item]
(let [[just-checked remainder]
(split-with #(<= % item) unknown)]
(if (seq just-checked)
(-> state
(assoc :mem (apply conj mem just-checked))
(assoc :unknown remainder))
state)))]
(fn [item]
(get-in (if (< item (first (:unknown #state)))
#state
(swap! state update-state item))
[:mem item]))))
One could also consider using refs, but than your predicate search might get rolled back by an enclosing transaction. This might or might not be what you want.
This function is based on the idea how the core memoize function works. Only numbers already consumed from the lazy list are cached in a set. It uses the built-in take-while instead of doing the search manually.
(defn make-lazy-predicate [coll]
(let [mem (atom #{})
unknown (atom coll)]
(fn [item]
(if (< item (first #unknown))
(#mem item)
(let [just-checked (take-while #(>= item %) #unknown)]
(swap! mem #(apply conj % just-checked))
(swap! unknown #(drop (count just-checked) %))
(= item (last just-checked)))))))