Full outer join two sequences of maps in Clojure

Full outer join two sequences of maps in Clojure - clojure

Given:
(def seq1 ({:id 1 :val 10} {:id 2 :val 20}))
(def seq2 ({:id 1 :val 12} {:id 3 :val 30}))
Within each sequence, the :id value is guaranteed to be unique in that sequence, though not necessarily ordered.
How can these two sequences of maps be joined by the :id key, so that the result is as shown?
{1 [{:id 1 :val 10} {:id 1 :val 12}],
2 [{:id 2 :val 20} nil ],
3 [nil {:id 3 :val 30}]}
The ultimate result is a map of pairs. This is similar to a full outer join, where not only the intersection, but also the difference of the two sets is included in the output.

Here's the answer I came up with, however I'm sure it can be made more elegant or potentially to have better performance.
(defn seq-to-map [seq key]
(into {} (map (fn [{id key :as m}] [id m]) seq)))
(defn outer-join-maps [seq1 seq2 key]
(let [map1 (seq-to-map seq1 key)
map2 (seq-to-map seq2 key)
allkeys (set (clojure.set/union (keys map1) (keys map2)))]
(into {} (map (fn [k] [k [(get map1 k) (get map2 k)]]) allkeys))))
The following tests pass:
(fact {:midje/description "Sequence to map"}
(seq-to-map [{:a 1, :b 1} {:a 2, :b 2}] :a)
=> {1 {:a 1, :b 1}, 2 {:a 2, :b 2}}
(seq-to-map [{:a 1, :b 1} {:a 1, :b 2}] :a)
=> {1 {:a 1, :b 2}} ; takes last value when a key is repeated
(seq-to-map [] :a)
=> {})
(fact {:midje/description "Sequence merging"}
(let [seq1 [{:id 1 :val 10} {:id 2 :val 20}]
seq2 [{:id 1 :val 12} {:id 3 :val 30}]]
(outer-join-maps seq1 seq2 :id)
=> {1 [{:id 1 :val 10} {:id 1 :val 12}],
2 [{:id 2 :val 20} nil],
3 [nil {:id 3 :val 30}]}))

Your answer is as good as anything else, really, but I would write it as
(defn outer-join [field a b]
(let [lookup #(get % field)
indexed (for [coll [a b]]
(into {} (map (juxt lookup identity) coll)))]
(into {} (for [key (distinct (mapcat keys indexed))]
[key (map #(get % key) indexed)]))))

Here's another version, which in my benchmarks using input sizes of 2+2, 90+90, 900+900, 90000+99000 and 300000+300000 is the fastest so far.
(defn outer-join [k xs ys]
(let [gxs (group-by #(get % k) xs)
gys (group-by #(get % k) ys)
kvs (concat (keys gxs) (keys gys))]
(persistent!
(reduce (fn [out k]
(let [l (first (get gxs k))
r (first (get gys k))]
(assoc! out k [l r])))
(transient {})
kvs))))
(I experimented with wrapping the key seq in distinct, but it turned out to result in a slowdown in benchmarks involving small-to-moderately-large inputs. This makes sense: we need to walk both key seqs anyway and the amount of work we do per key is so tiny that it may well be more work to avoid it.)
Here is a sanity check and a handful of Criterium benchmarks (with amalloy's version renamed to outer-join*):
(let [xs [{:id 1 :val 10} {:id 2 :val 20}]
ys [{:id 1 :val 12} {:id 3 :val 30}]]
(assert (= (outer-join :id xs ys)
(outer-join* :id xs ys)
(outer-join-maps xs ys :id)))
(c/bench (outer-join :id xs ys))
(c/bench (outer-join* :id xs ys))
(c/bench (outer-join-maps xs ys :id)))
WARNING: Final GC required 3.296446000194027 % of runtime
Evaluation count : 17099160 in 60 samples of 284986 calls.
Execution time mean : 3.589256 µs
Execution time std-deviation : 34.976485 ns
Execution time lower quantile : 3.544196 µs ( 2.5%)
Execution time upper quantile : 3.666515 µs (97.5%)
Overhead used : 2.295807 ns
Evaluation count : 6596160 in 60 samples of 109936 calls.
Execution time mean : 9.107578 µs
Execution time std-deviation : 82.176826 ns
Execution time lower quantile : 8.993900 µs ( 2.5%)
Execution time upper quantile : 9.295188 µs (97.5%)
Overhead used : 2.295807 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Evaluation count : 9298740 in 60 samples of 154979 calls.
Execution time mean : 6.592289 µs
Execution time std-deviation : 63.929382 ns
Execution time lower quantile : 6.506403 µs ( 2.5%)
Execution time upper quantile : 6.749262 µs (97.5%)
Overhead used : 2.295807 ns
Found 4 outliers in 60 samples (6.6667 %)
low-severe 4 (6.6667 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 90))
ys (map (fn [id] {:id id :val (* 20 id)}) (range 10 100))]
(assert (= (outer-join :id xs ys)
(outer-join* :id xs ys)
(outer-join-maps xs ys :id)))
(c/bench (outer-join :id xs ys))
(c/bench (outer-join* :id xs ys))
(c/bench (outer-join-maps xs ys :id)))
Evaluation count : 413760 in 60 samples of 6896 calls.
Execution time mean : 147.182107 µs
Execution time std-deviation : 1.282179 µs
Execution time lower quantile : 145.103445 µs ( 2.5%)
Execution time upper quantile : 149.658348 µs (97.5%)
Overhead used : 2.295807 ns
Evaluation count : 256920 in 60 samples of 4282 calls.
Execution time mean : 238.166905 µs
Execution time std-deviation : 1.987980 µs
Execution time lower quantile : 235.211277 µs ( 2.5%)
Execution time upper quantile : 242.255072 µs (97.5%)
Overhead used : 2.295807 ns
Evaluation count : 362760 in 60 samples of 6046 calls.
Execution time mean : 167.301109 µs
Execution time std-deviation : 1.616075 µs
Execution time lower quantile : 164.534670 µs ( 2.5%)
Execution time upper quantile : 170.757257 µs (97.5%)
Overhead used : 2.295807 ns
(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 900))
ys (map (fn [id] {:id id :val (* 20 id)}) (range 100 1000))]
(assert (= (outer-join :id xs ys)
(outer-join* :id xs ys)
(outer-join-maps xs ys :id)))
(c/bench (outer-join :id xs ys))
(c/bench (outer-join* :id xs ys))
(c/bench (outer-join-maps xs ys :id)))
Evaluation count : 33840 in 60 samples of 564 calls.
Execution time mean : 1.754723 ms
Execution time std-deviation : 29.229644 µs
Execution time lower quantile : 1.709219 ms ( 2.5%)
Execution time upper quantile : 1.805009 ms (97.5%)
Overhead used : 2.295807 ns
Evaluation count : 22740 in 60 samples of 379 calls.
Execution time mean : 2.559172 ms
Execution time std-deviation : 44.520222 µs
Execution time lower quantile : 2.490201 ms ( 2.5%)
Execution time upper quantile : 2.657706 ms (97.5%)
Overhead used : 2.295807 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 6.2842 % Variance is slightly inflated by outliers
Evaluation count : 30000 in 60 samples of 500 calls.
Execution time mean : 1.999194 ms
Execution time std-deviation : 25.723647 µs
Execution time lower quantile : 1.962350 ms ( 2.5%)
Execution time upper quantile : 2.045836 ms (97.5%)
Overhead used : 2.295807 ns
Huge inputs (excluding outer-join-maps):
(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 300000))
ys (map (fn [id] {:id id :val (* 20 id)}) (range 100000 400000))]
(assert (= (outer-join :id xs ys)
(outer-join* :id xs ys)
(outer-join-maps xs ys :id)))
(c/bench (outer-join :id xs ys))
(c/bench (outer-join* :id xs ys)))
WARNING: Final GC required 13.371566110062922 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
Execution time mean : 772.532296 ms
Execution time std-deviation : 12.710681 ms
Execution time lower quantile : 744.832577 ms ( 2.5%)
Execution time upper quantile : 801.098417 ms (97.5%)
Overhead used : 2.295807 ns
Found 6 outliers in 60 samples (10.0000 %)
low-severe 2 (3.3333 %)
low-mild 3 (5.0000 %)
high-mild 1 (1.6667 %)
Variance from outliers : 5.3156 % Variance is slightly inflated by outliers
WARNING: Final GC required 15.51698960336361 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
Execution time mean : 949.508151 ms
Execution time std-deviation : 32.952708 ms
Execution time lower quantile : 911.054447 ms ( 2.5%)
Execution time upper quantile : 1.031623 sec (97.5%)
Overhead used : 2.295807 ns
Found 4 outliers in 60 samples (6.6667 %)
low-severe 4 (6.6667 %)
Variance from outliers : 20.6517 % Variance is moderately inflated by outliers

If you don't require the nil's for each map which does not have a given key then merge-with can handle this problem fairly easily.
user> (def seq1 [{:id 1 :val 10} {:id 2 :val 20}])
#'user/seq1
user> (def seq2 [{:id 1 :val 12} {:id 3 :val 30}])
#'user/seq2
user> (def data (concat seq1 seq2))
#'user/data
user> (reduce (partial merge-with (comp vec concat))
(map #(hash-map (:id %) [%]) data))
{1 [{:val 10, :id 1} {:val 12, :id 1}],
2 [{:val 20, :id 2}],
3 [{:val 30, :id 3}]}

Related

Why reduce performs much sub-optimal than recur in this case?

One of intuitive ways to calculate π in polynomial sum looks like below,
π = ( 1/1 - 1/3 + 1/5 - 1/7 + 1/9 ... ) × 4
The following function ρ or ρ' denotes the polynomial sum, and the consumed time τ to calculate the π is measured respectively,
(defn term [k]
(let [x (/ 1. (inc (* 2 k)))]
(if (even? k)
x
(- x))))
(defn ρ [n]
(reduce
(fn [res k] (+ res (term k)))
0
(lazy-seq (range 0 n))))
(defn ρ' [n]
(loop [res 0 k 0]
(if (< k n)
(recur (+ res (term k)) (inc k))
res)))
(defn calc [P]
(let [start (System/nanoTime)
π (* (P 1000000) 4)
end (System/nanoTime)
τ (- end start)]
(printf "π=%.16f τ=%d\n" π τ)))
(calc ρ)
(calc ρ')
The result tells that ρ is about half more time spent than ρ', hence the underlying reduce performs much sub-optimal than recur in this case, but why?

Rewriting your code and using a more accurate timer shows there is no significant difference. This is to be expected since both loop/recur and reduce are very basic forms and we would expect them to both be fairly optimized.
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test)
(:require
[criterium.core :as crit] ))
(def result (atom nil))
(defn term [k]
(let [x (/ 1. (inc (* 2 k)))]
(if (even? k)
x
(- x))))
(defn ρ [n]
(reduce
(fn [res k] (+ res (term k)))
0
(range 0 n)) )
(defn ρ' [n]
(loop [res 0 k 0]
(if (< k n)
(recur (+ res (term k)) (inc k))
res)) )
(defn calc [calc-fn N]
(let [pi (* (calc-fn N) 4)]
(reset! result pi)
pi))
We measure the execution time for both algorithms using Criterium:
(defn timings
[power]
(let [N (Math/pow 10 power)]
(newline)
(println :-----------------------------------------------------------------------------)
(spyx N)
(newline)
(crit/quick-bench (calc ρ N))
(println :rho #result)
(newline)
(crit/quick-bench (calc ρ' N))
(println :rho-prime N #result)
(newline)))
and we try it for 10^2, 10^4, and 10^6 values of N:
(dotest
(timings 2)
(timings 4)
(timings 6))
with results for 10^2:
-------------------------------
Clojure 1.10.1 Java 14
-------------------------------
Testing tst.demo.core
:-----------------------------------------------------------------------------
N => 100.0
Evaluation count : 135648 in 6 samples of 22608 calls.
Execution time mean : 4.877255 µs
Execution time std-deviation : 647.723342 ns
Execution time lower quantile : 4.438762 µs ( 2.5%)
Execution time upper quantile : 5.962740 µs (97.5%)
Overhead used : 2.165947 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 31.6928 % Variance is moderately inflated by outliers
:rho 3.1315929035585537
Evaluation count : 148434 in 6 samples of 24739 calls.
Execution time mean : 4.070798 µs
Execution time std-deviation : 68.430348 ns
Execution time lower quantile : 4.009978 µs ( 2.5%)
Execution time upper quantile : 4.170038 µs (97.5%)
Overhead used : 2.165947 ns
:rho-prime 100.0 3.1315929035585537
with results for 10^4:
:-----------------------------------------------------------------------------
N => 10000.0
Evaluation count : 1248 in 6 samples of 208 calls.
Execution time mean : 519.096208 µs
Execution time std-deviation : 143.478354 µs
Execution time lower quantile : 454.389510 µs ( 2.5%)
Execution time upper quantile : 767.610509 µs (97.5%)
Overhead used : 2.165947 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 65.1517 % Variance is severely inflated by outliers
:rho 3.1414926535900345
Evaluation count : 1392 in 6 samples of 232 calls.
Execution time mean : 431.020370 µs
Execution time std-deviation : 14.853924 µs
Execution time lower quantile : 420.838884 µs ( 2.5%)
Execution time upper quantile : 455.282989 µs (97.5%)
Overhead used : 2.165947 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
:rho-prime 10000.0 3.1414926535900345
with results for 10^6:
:-----------------------------------------------------------------------------
N => 1000000.0
Evaluation count : 18 in 6 samples of 3 calls.
Execution time mean : 46.080480 ms
Execution time std-deviation : 1.039714 ms
Execution time lower quantile : 45.132049 ms ( 2.5%)
Execution time upper quantile : 47.430310 ms (97.5%)
Overhead used : 2.165947 ns
:rho 3.1415916535897743
Evaluation count : 18 in 6 samples of 3 calls.
Execution time mean : 52.527777 ms
Execution time std-deviation : 17.483930 ms
Execution time lower quantile : 41.789520 ms ( 2.5%)
Execution time upper quantile : 82.539445 ms (97.5%)
Overhead used : 2.165947 ns
Found 1 outliers in 6 samples (16.6667 %)
low-severe 1 (16.6667 %)
Variance from outliers : 81.7010 % Variance is severely inflated by outliers
:rho-prime 1000000.0 3.1415916535897743
Note that the times for rho and rho-prime flip-flop for the 10^4 and 10^6 cases. In any case, don't believe or worry much about timings that vary by less than 2x.
Update
I deleted the lazy-seq in the original code since clojure.core/range is already lazy. Also, I've never seen lazy-seq used without a cons and a recursive call to the generating function.
Re clojure.core/range, we have the docs:
range
Returns a lazy seq of nums from start (inclusive) to end (exclusive),
by step, where start defaults to 0, step to 1, and end to infinity.
When step is equal to 0, returns an infinite sequence of start. When
start is equal to end, returns empty list.
In the source code, it calls out into the Java impl of clojure.core:
([start end]
(if (and (instance? Long start) (instance? Long end))
(clojure.lang.LongRange/create start end)
(clojure.lang.Range/create start end)))
& the Java code indicates it is chunked:
public class Range extends ASeq implements IChunkedSeq, IReduce {
private static final int CHUNK_SIZE = 32;
<snip>

Additionally to other answers.
Performance can be significantly increased in you eliminate math boxing (original versions were both about 25ms). And variant with loop/recur is 2× faster.
(set! *unchecked-math* :warn-on-boxed)
(defn term ^double [^long k]
(let [x (/ 1. (inc (* 2 k)))]
(if (even? k)
x
(- x))))
(defn ρ [n]
(reduce
(fn [^double res ^long k] (+ res (term k)))
0
(range 0 n)))
(defn ρ' [^long n]
(loop [res (double 0) k 0]
(if (< k n)
(recur (+ res (term k)) (inc k))
res)))
(criterium.core/quick-bench
(ρ 1000000))
Evaluation count : 42 in 6 samples of 7 calls.
Execution time mean : 15,639294 ms
Execution time std-deviation : 371,972168 µs
Execution time lower quantile : 15,327698 ms ( 2,5%)
Execution time upper quantile : 16,227505 ms (97,5%)
Overhead used : 1,855553 ns
Found 1 outliers in 6 samples (16,6667 %)
low-severe 1 (16,6667 %)
Variance from outliers : 13,8889 % Variance is moderately inflated by outliers
=> nil
(criterium.core/quick-bench
(ρ' 1000000))
Evaluation count : 72 in 6 samples of 12 calls.
Execution time mean : 8,570961 ms
Execution time std-deviation : 302,554974 µs
Execution time lower quantile : 8,285648 ms ( 2,5%)
Execution time upper quantile : 8,919635 ms (97,5%)
Overhead used : 1,855553 ns
=> nil

Below is the improved version to be more representative. Seemingly, the performance varies from case to case but not by that much.
(defn term [k]
(let [x (/ 1. (inc (* 2 k)))]
(if (even? k)
x
(- x))))
(defn ρ1 [n]
(loop [res 0 k 0]
(if (< k n)
(recur (+ res (term k)) (inc k))
res)))
(defn ρ2 [n]
(reduce
(fn [res k] (+ res (term k)))
0
(range 0 n)))
(defn ρ3 [n]
(reduce + 0 (map term (range 0 n))))
(defn ρ4 [n]
(transduce (map term) + 0 (range 0 n)))
(defn calc [ρname ρ n]
(let [start (System/nanoTime)
π (* (ρ n) 4)
end (System/nanoTime)
τ (- end start)]
(printf "ρ=%8s n=%10d π=%.16f τ=%10d\n" ρname n π τ)))
(def args
{:N (map #(long (Math/pow 10 %)) [4 6])
:T 10
:P {:recur ρ1 :reduce ρ2 :mreduce ρ3 :xreduce ρ4}})
(doseq [n (:N args)]
(dotimes [_ (:T args)]
(println "---")
(doseq [kv (:P args)] (calc (key kv) (val kv) n))))

Trim unnecessary entries from deeply nested data structure using specter

I'm looking to use Clojure Specter to simplify a deeply nested datastructure. I want to remove:
any entries with nil values
any entries with empty string values
any entries with empty map values
any entries with empty sequential values
any entries with maps/sequential values that are empty after removing the above cases.
Something like this:
(do-something
{:a {:aa 1}
:b {:ba -1
:bb 2
:bc nil
:bd ""
:be []
:bf {}
:bg {:ga nil}
:bh [nil]
:bi [{}]
:bj [{:ja nil}]}
:c nil
:d ""
:e []
:f {}
:g {:ga nil}
:h [nil]
:i [{}]
:j [{:ja nil}]})
=>
{:a {:aa 1}
:b {:ba -1
:bb 2}}
I have something in vanilla Clojure:
(defn prunable?
[v]
(if (sequential? v)
(keep identity v)
(or (nil? v) (#{"" [] {}} v))))
(defn- remove-nil-values
[ticket]
(clojure.walk/postwalk
(fn [el]
(if (map? el)
(let [m (into {} (remove (comp prunable? second) el))]
(when (seq m)
m))
el))
ticket))
I think I need some sort of recursive-path but I'm not getting anywhere fast. Help much appreciated.

Comparing the performance of different versions against specter implementation:
#bm1729 plain vanilla version:
Evaluation count : 1060560 in 60 samples of 17676 calls.
Execution time mean : 57.083226 µs
Execution time std-deviation : 543.184398 ns
Execution time lower quantile : 56.559237 µs ( 2.5%)
Execution time upper quantile : 58.519433 µs (97.5%)
Overhead used : 7.023993 ns
Found 5 outliers in 60 samples (8.3333 %)
low-severe 3 (5.0000 %)
low-mild 2 (3.3333 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Below version:
Evaluation count : 3621960 in 60 samples of 60366 calls.
Execution time mean : 16.606135 µs
Execution time std-deviation : 141.114975 ns
Execution time lower quantile : 16.481250 µs ( 2.5%)
Execution time upper quantile : 16.922734 µs (97.5%)
Overhead used : 7.023993 ns
Found 9 outliers in 60 samples (15.0000 %)
low-severe 6 (10.0000 %)
low-mild 3 (5.0000 %)
Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(defn prune [x]
(cond
(map? x) (not-empty
(reduce-kv
(fn [s k v]
(let [v' (prune v)]
(cond-> s
v' (assoc k v'))))
(empty x)
x))
(seqable? x) (not-empty
(into
(empty x)
(->> x (map prune) (filter identity))))
:else x))
Test case:
(prune {:a {:aa 1}
:b {:ba -1
:bb 2
:bc nil
:bd ""
:be []
:bf {}
:bg {:ga nil}
:bh [nil]
:bi [{}]
:bj [{:ja nil}]}
:c nil
:d ""
:e []
:f {}
:g {:ga nil}
:h [nil]
:i [{}]
:j [{:ja nil}]})
;; => {:b {:bb 2, :ba -1}, :a {:aa 1}}
UPDATE - #bm1729 specter version
Evaluation count : 3314820 in 60 samples of 55247 calls.
Execution time mean : 18.421613 µs
Execution time std-deviation : 591.106243 ns
Execution time lower quantile : 18.148204 µs ( 2.5%)
Execution time upper quantile : 20.674292 µs (97.5%)
Overhead used : 7.065044 ns
Found 8 outliers in 60 samples (13.3333 %)
low-severe 2 (3.3333 %)
low-mild 6 (10.0000 %)
Variance from outliers : 18.9883 % Variance is moderately inflated by outliers

Thanks to nathanmarz on the Clojurians slack channel:
(def COMPACTED-VALS-PATH
(recursive-path [] p
(continue-then-stay
(cond-path
map? [(compact MAP-VALS) p]
vector? [(compact ALL) p]))))
(defn- compact-data
[m]
(setval [MAP-VALS COMPACTED-VALS-PATH #(or (nil? %) (= "" %))] NONE m))

performance gain while using core.reducers

I tried the below to compare the performance of core/map vs transducers vc core.reducers/map vs core.reducers/fold -
(time (->> (range 10000)
(r/map inc)
(r/map inc)
(r/map inc)
(into [])))
;; core.reducers/map
;; "Elapsed time: 3.962802 msecs"
(time (->> (range 10000)
vec
(r/map inc)
(r/map inc)
(r/map inc)
(r/fold conj)))
;; core.reducers/fold
;; "Elapsed time: 3.318809 msecs"
(time (->> (range 10000)
(map inc)
(map inc)
(map inc)))
;; core/map
;; "Elapsed time: 0.148433 msecs"
(time (->> (range 10000)
(sequence (comp (map inc)
(map inc)
(map inc)))))
;; transducers
;; "Elapsed time: 0.215037 msecs"
1) My expectation was that core/map will have the highest time, however it has the lowest time. Why is it more performant than transducers, when intermediate seqs dont get created for transducers, and transducers should be faster ?
2) Why is the core.reducers/fold version not significantly faster than the core.reducers/map version, shouldnt it have parallelized the operation ?
3) Why are the core.reducers versions so slow as compared to their lazy counterparts, the whole sequence is being realized at the end, so should not eager evaluation be more performant than the lazy one ?

map is lazy, so your test case with core/map does no work at all. Try doalling the collection (or into []), and I expect it will be the slowest after all. You can convince yourself of this by changing 10000 to 1e12, and observe that if your computer can process a trillion elements just as quickly as it can process ten thousand, it must not be doing much work for each element!
What is there to parallelize? The most expensive part of this operation is not the calls to inc (which are parallelized), but combining the results into a vector at the end (which can't be). Try it with a much more expensive function, like #(do (Thread/sleep 500) (inc %)) and you may see different results.
Isn't this the same question as (1)?

;; core/map without transducers
(quick-bench (doall (->> [1 2 3 4]
(map inc)
(map inc)
(map inc))))
;; Evaluation count : 168090 in 6 samples of 28015 calls.
;; Execution time mean : 3.651319 µs
;; Execution time std-deviation : 88.055389 ns
;; Execution time lower quantile : 3.584198 µs ( 2.5%)
;; Execution time upper quantile : 3.799202 µs (97.5%)
;; Overhead used : 7.546189 ns
;; Found 1 outliers in 6 samples (16.6667 %)
;; low-severe 1 (16.6667 %)
;; Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
;; transducers with a non lazy seq
(quick-bench (doall (->> [1 2 3 4]
(sequence (comp (map inc)
(map inc)
(map inc))))))
;; Evaluation count : 214902 in 6 samples of 35817 calls.
;; Execution time mean : 2.776696 µs
;; Execution time std-deviation : 24.377634 ns
;; Execution time lower quantile : 2.750123 µs ( 2.5%)
;; Execution time upper quantile : 2.809933 µs (97.5%)
;; Overhead used : 7.546189 ns
;;;;
;; tranducers with a lazy seq
;;;;
(quick-bench (doall (->> (range 1 5)
(sequence (comp (map inc)
(map inc)
(map inc))))))
;; Evaluation count : 214230 in 6 samples of 35705 calls.
;; Execution time mean : 3.361220 µs
;; Execution time std-deviation : 622.084860 ns
;; Execution time lower quantile : 2.874093 µs ( 2.5%)
;; Execution time upper quantile : 4.328653 µs (97.5%)
;; Overhead used : 7.546189 ns
;;;;
;; core.reducers
;;;;
(quick-bench (->> [1 2 3 4]
(r/map inc)
(r/map inc)
(r/map inc)))
;; Evaluation count : 6258966 in 6 samples of 1043161 calls.
;; Execution time mean : 89.610689 ns
;; Execution time std-deviation : 0.936108 ns
;; Execution time lower quantile : 88.786938 ns ( 2.5%)
;; Execution time upper quantile : 91.128549 ns (97.5%)
;; Overhead used : 7.546189 ns
;; Found 1 outliers in 6 samples (16.6667 %)
;; low-severe 1 (16.6667 %)
;; Variance from outliers : 13.8889 % Variance is moderately inflated by outliers
;;;; Evaluating a larger range so that the chunking comes into play ;;;;
;; core/map without transducers
(quick-bench (doall (->> (range 500)
(map inc)
(map inc)
(map inc))))
;; transducers with a non lazy seq
(quick-bench (doall (->> (doall (range 500))
(sequence (comp (map inc)
(map inc)
(map inc))))))
;; Evaluation count : 2598 in 6 samples of 433 calls.
;; Execution time mean : 237.164523 µs
;; Execution time std-deviation : 5.336417 µs
;; Execution time lower quantile : 231.751575 µs ( 2.5%)
;; Execution time upper quantile : 244.836021 µs (97.5%)
;; Overhead used : 7.546189 ns
;; tranducers with a lazy seq
(quick-bench (doall (->> (range 500)
(sequence (comp (map inc)
(map inc)
(map inc))))))
;; Evaluation count : 3210 in 6 samples of 535 calls.
;; Execution time mean : 183.866148 µs
;; Execution time std-deviation : 1.799841 µs
;; Execution time lower quantile : 182.137656 µs ( 2.5%)
;; Execution time upper quantile : 186.347677 µs (97.5%)
;; Overhead used : 7.546189 ns
;; core.reducers
(quick-bench (->> (range 500)
(r/map inc)
(r/map inc)
(r/map inc)))
;; Evaluation count : 4695642 in 6 samples of 782607 calls.
;; Execution time mean : 126.973627 ns
;; Execution time std-deviation : 5.972927 ns
;; Execution time lower quantile : 122.471060 ns ( 2.5%)
;; Execution time upper quantile : 134.181056 ns (97.5%)
;; Overhead used : 7.546189 ns
Based on the above answers / comments I tried the benchmarking again -
1) The reducers version is faster on a magnitude of 10^3.
2) This applies for both small collections (4 elements) and larger ones (500 element) where chunking can happen for lazy seqs.
3) Thus even with chunking, lazy evaluation is much slower than eager evaluation.
Corrections based on the remark :- the reducers only get executed on the reduce operation, which was not getting executed in the above code -
(quick-bench (->> [1 2 3 4]
(r/map inc)
(r/map inc)
(r/map inc)
(into [])))
;; Evaluation count : 331302 in 6 samples of 55217 calls.
;; Execution time mean : 2.035153 µs
;; Execution time std-deviation : 314.070348 ns
;; Execution time lower quantile : 1.720615 µs ( 2.5%)
;; Execution time upper quantile : 2.381706 µs (97.5%)
;; Overhead used : 7.546189 ns
(quick-bench (->> (range 500)
(r/map inc)
(r/map inc)
(r/map inc)
(into [])))
;; Evaluation count : 3870 in 6 samples of 645 calls.
;; Execution time mean : 150.349870 µs
;; Execution time std-deviation : 2.825632 µs
;; Execution time lower quantile : 146.468231 µs ( 2.5%)
;; Execution time upper quantile : 153.271325 µs (97.5%)
;; Overhead used : 7.546189 ns
So the reducer versions are 30-70 % faster than the transducer counterparts. The performance differential increases as the data set size increases.

Assoc or update Clojure lists and lazy sequences

If I have a vector (def v [1 2 3]), I can replace the first element with (assoc v 0 666), obtaining [666 2 3]
But if I try to do the same after mapping over the vector:
(def v (map inc [1 2 3]))
(assoc v 0 666)
the following exception is thrown:
ClassCastException clojure.lang.LazySeq cannot be cast to clojure.lang.Associative
What's the most idiomatic way of editing or updating a single element of a lazy sequence?
Should I use map-indexed and alter only the index 0 or realize the lazy sequence into a vector and then edit it via assoc/update?
The first has the advantage of maintaining the laziness, while the second is less efficient but maybe more obvious.
I guess for the first element I can also use drop and cons.
Are there any other ways? I was not able to find any examples anywhere.

What's the most idiomatic way of editing or updating a single element of a lazy sequence?
There's no built-in function for modifying a single element of a sequence/list, but map-indexed is probably the closest thing. It's not an efficient operation for lists. Assuming you don't need laziness, I'd pour the sequence into a vector, which is what mapv does i.e. (into [] (map f coll)). Depending on how you use your modified sequence, it may be just as performant to vectorize it and modify.
You could write a function using map-indexed to do something similar and lazy:
(defn assoc-seq [s i v]
(map-indexed (fn [j x] (if (= i j) v x)) s))
Or if you want to do this work in one pass lazily without vector-izing, you can also use a transducer:
(sequence
(comp
(map inc)
(map-indexed (fn [j x] (if (= 0 j) 666 x))))
[1 2 3])
Realizing your use case is to only modify the first item in a lazy sequence, then you can do something simpler while preserving laziness:
(concat [666] (rest s))
Update re: comment on optimization: leetwinski's assoc-at function is ~8ms faster when updating the 500,000th element in a 1,000,000 element lazy sequence, so you should use his answer if you're looking to squeeze every bit of performance out of an inherently inefficient operation:
(def big-lazy (range 1e6))
(crit/bench
(last (assoc-at big-lazy 500000 666)))
Evaluation count : 1080 in 60 samples of 18 calls.
Execution time mean : 51.567317 ms
Execution time std-deviation : 4.947684 ms
Execution time lower quantile : 47.038877 ms ( 2.5%)
Execution time upper quantile : 65.604790 ms (97.5%)
Overhead used : 1.662189 ns
Found 6 outliers in 60 samples (10.0000 %)
low-severe 4 (6.6667 %)
low-mild 2 (3.3333 %)
Variance from outliers : 68.6139 % Variance is severely inflated by outliers
=> nil
(crit/bench
(last (assoc-seq big-lazy 500000 666)))
Evaluation count : 1140 in 60 samples of 19 calls.
Execution time mean : 59.553335 ms
Execution time std-deviation : 4.507430 ms
Execution time lower quantile : 54.450115 ms ( 2.5%)
Execution time upper quantile : 69.288104 ms (97.5%)
Overhead used : 1.662189 ns
Found 4 outliers in 60 samples (6.6667 %)
low-severe 4 (6.6667 %)
Variance from outliers : 56.7865 % Variance is severely inflated by outliers
=> nil
The assoc-at version is 2-3x faster when updating the first item in a large lazy sequence, but it's no faster than (last (concat [666] (rest big-lazy))).

i would probably go with something generic like this, if this functionality is really needed (which i strongly doubt about):
(defn assoc-at [data i item]
(if (associative? data)
(assoc data i item)
(if-not (neg? i)
(letfn [(assoc-lazy [i data]
(cond (zero? i) (cons item (rest data))
(empty? data) data
:else (lazy-seq (cons (first data)
(assoc-lazy (dec i) (rest data))))))]
(assoc-lazy i data))
data)))
user> (assoc-at {:a 10} :b 20)
;; {:a 10, :b 20}
user> (assoc-at [1 2 3 4] 3 101)
;; [1 2 3 101]
user> (assoc-at (map inc [1 2 3 4]) 2 123)
;; (2 3 123 5)
another way is to use split-at:
(defn assoc-at [data i item]
(if (neg? i)
data
(let [[l r] (split-at i data)]
(if (seq r)
(concat l [item] (rest r))
data))))
notice that both this functions short circuit the coll traversal, which mapping approach doesn't. Here some quick and dirty benchmark:
(defn massoc-at [data i item]
(if (neg? i)
data
(map-indexed (fn [j x] (if (== i j) item x)) data)))
(time (last (assoc-at (range 10000000) 0 1000)))
;;=> "Elapsed time: 747.921032 msecs"
9999999
(time (last (massoc-at (range 10000000) 0 1000)))
;;=> "Elapsed time: 1525.446511 msecs"
9999999

Get all moving k-sized partitions of a string

I'm trying to get all "moving" partitions sized k of a string. Basically, I want to move a window of sized k along the string and get that k-word.
Here's an example,
k: 3
Input: ABDEFGH
Output: ABD, EFG, BDE, FGH, DEF
My idea was to walk along the input, drop a head and partition and then drop a head again from the previously (now headless) sequence, but I'm not sure exactly how to do this...Also, maybe there's a better way of doing this? Below is the idea I had in mind.
(#(partition k input) (collection of s where head was consecutively dropped))

Strings in Clojure can be treated as seqs of characters, so you can partition them directly. To get a sequence of overlapping partitions, use the version that accepts a size and a step:
user> (partition 3 1 "abcdef")
((\a \b \c) (\b \c \d) (\c \d \e) (\d \e \f))
To put a character sequence back into a string, just apply str to it:
user> (apply str '(\a \b \c))
"abc"
To put it all together:
user> (map (partial apply str) (partition 3 1 "abcdef"))
("abc" "bcd" "cde" "def")

Here are implementations of partition and partition-all for strings, returning a lazy-seq of strings, doing the splitting using subs. If you need high performance doing string transformations, these will be significantly faster (by average 8 times as fast, see criterium benchmarks below) than creating char-seqs.
(defn partition-string
"Like partition, but splits string using subs and returns a
lazy-seq of strings."
([n s]
(partition-string n n s))
([n p s]
(lazy-seq
(if-not (< (count s) n)
(cons
(subs s 0 n)
(->> (subs s p)
(partition-string n p)))))))
(defn partition-string-all
"Like partition-all, but splits string using subs and returns a
lazy-seq of strings."
([n s]
(partition-string-all n n s))
([n p s]
(let [less (if (< (count s) n)
(count s))]
(lazy-seq
(cons
(subs s 0 (or less n))
(if-not less
(->> (subs s p)
(partition-string-all n p))))))))
;; Alex answer:
;; (let [test-str "abcdefghijklmnopqrstuwxyz"]
;; (criterium.core/bench
;; (doall
;; (map (partial apply str) (partition 3 1 test-str)))))
;; WARNING: Final GC required 1.010207840526515 % of runtime
;; Evaluation count : 773220 in 60 samples of 12887 calls.
;; Execution time mean : 79.900801 µs
;; Execution time std-deviation : 2.008823 µs
;; Execution time lower quantile : 77.725304 µs ( 2.5%)
;; Execution time upper quantile : 83.888349 µs (97.5%)
;; Overhead used : 17.786101 ns
;; Found 3 outliers in 60 samples (5.0000 %)
;; low-severe 3 (5.0000 %)
;; Variance from outliers : 12.5585 % Variance is moderately inflated by outliers
;; KobbyPemson answer:
;; (let [test-str "abcdefghijklmnopqrstuwxyz"]
;; (criterium.core/bench
;; (doall
;; (moving-partition test-str 3))))
;; WARNING: Final GC required 1.674347646128195 % of runtime
;; Evaluation count : 386820 in 60 samples of 6447 calls.
;; Execution time mean : 161.928479 µs
;; Execution time std-deviation : 8.362590 µs
;; Execution time lower quantile : 154.707888 µs ( 2.5%)
;; Execution time upper quantile : 184.095816 µs (97.5%)
;; Overhead used : 17.786101 ns
;; Found 3 outliers in 60 samples (5.0000 %)
;; low-severe 2 (3.3333 %)
;; low-mild 1 (1.6667 %)
;; Variance from outliers : 36.8985 % Variance is moderately inflated by outliers
;; This answer
;; (let [test-str "abcdefghijklmnopqrstuwxyz"]
;; (criterium.core/bench
;; (doall
;; (partition-string 3 1 test-str))))
;; WARNING: Final GC required 1.317098148979236 % of runtime
;; Evaluation count : 5706000 in 60 samples of 95100 calls.
;; Execution time mean : 10.480174 µs
;; Execution time std-deviation : 240.957206 ns
;; Execution time lower quantile : 10.234580 µs ( 2.5%)
;; Execution time upper quantile : 11.075740 µs (97.5%)
;; Overhead used : 17.786101 ns
;; Found 3 outliers in 60 samples (5.0000 %)
;; low-severe 3 (5.0000 %)
;; Variance from outliers : 10.9961 % Variance is moderately inflated by outliers

(defn moving-partition
[input k]
(map #(.substring input % (+ k %))
(range (- (count input) (dec k)))))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Full outer join two sequences of maps in Clojure - clojure

Your answer is as good as anything else, really, but I would write it as (defn outer-join [field a b] (let [lookup #(get % field) indexed (for [coll [a b]] (into {} (map (juxt lookup identity) coll)))] (into {} (for [key (distinct (mapcat keys indexed))] [key (map #(get % key) indexed)]))))

Related

Why reduce performs much sub-optimal than recur in this case?

Trim unnecessary entries from deeply nested data structure using specter

performance gain while using core.reducers

Assoc or update Clojure lists and lazy sequences

Get all moving k-sized partitions of a string

Categories

Resources