Complex data manipulation in Clojure - clojure
I'm working on a personal market analysis project. I've got a data structure representing all the recent turning points in the market, that looks like this:
[{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}
{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}
{:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}
{:high 1.12215, :time "2016-08-02T23:00:00.000000Z"}
{:high 1.12273, :time "2016-08-02T21:15:00.000000Z"}
{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
{:low 1.119215, :time "2016-08-02T12:30:00.000000Z"}
{:low 1.118755, :time "2016-08-02T12:00:00.000000Z"}
{:low 1.117575, :time "2016-08-02T06:00:00.000000Z"}
{:low 1.117135, :time "2016-08-02T04:30:00.000000Z"}
{:low 1.11624, :time "2016-08-02T02:00:00.000000Z"}
{:low 1.115895, :time "2016-08-01T21:30:00.000000Z"}
{:low 1.11552, :time "2016-08-01T11:45:00.000000Z"}
{:low 1.11049, :time "2016-07-29T12:15:00.000000Z"}
{:low 1.108825, :time "2016-07-29T08:30:00.000000Z"}
{:low 1.10839, :time "2016-07-29T08:00:00.000000Z"}
{:low 1.10744, :time "2016-07-29T05:45:00.000000Z"}
{:low 1.10716, :time "2016-07-28T19:30:00.000000Z"}
{:low 1.10705, :time "2016-07-28T18:45:00.000000Z"}
{:low 1.106875, :time "2016-07-28T18:00:00.000000Z"}
{:low 1.10641, :time "2016-07-28T05:45:00.000000Z"}
{:low 1.10591, :time "2016-07-28T01:45:00.000000Z"}
{:low 1.10579, :time "2016-07-27T23:15:00.000000Z"}
{:low 1.105275, :time "2016-07-27T22:00:00.000000Z"}
{:low 1.096135, :time "2016-07-27T18:00:00.000000Z"}]
Conceptually, I want to match up :high/:low pairs, work out the price range (high-low) and midpoint (average of high & low), but I don't want every possible pair to be generated.
What I want to do is start from the 1st item in the collection {:high 1.121455, :time "2016-08-03T05:15:00.000000Z"} and walk "down" through the remainder of the collection, creating a pair with every :low item UNTIL I hit the next :high item. Once I hit that next :high item, I'm not interested in any further pairs. In this case, there's only a single pair created, which is the :high and the 1st :low - I stop there because the next (3rd) item is a :high. The 1 generated record should look like {:price-range 0.000365, :midpoint 1.121272, :extremes [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}]}
Next, I'd move onto the 2nd item in the collection {:low 1.12109, :time "2016-08-03T05:15:00.000000Z"} and walk "down" through the remainder of the collection, creating a pair with every :high item UNTIL I hit the next :low item. In this case, I get 5 new records generated, being the :low and the next 5 :high items which are all consecutive; the first of these 5 records would look like
{:price-range 0.000064, :midpoint 1.12131, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}]}
the second of these 5 records would look like
{:price-range 0.000835, :midpoint 1.1215075, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}]}
and so on.
After that, I get a :low so I stop there.
Then I'd move onto the 3rd item {:high 1.12173, :time "2016-08-03T04:30:00.000000Z"} and walk "down" creating pairs with every :low UNTIL I hit the next :high. In this case, I get 0 pairs generated, because the :high is followed immediately by another :high. Same for the next 3 :high items, which are all followed immediately by another :high
Next I get to the 7th item {:high 1.12338, :time "2016-08-02T18:15:00.000000Z"} and that should generate a pair with each of the following 20 :low items.
My generated result would be a list of all the pairs created:
[{:price-range 0.000365, :midpoint 1.121272, :extremes [{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}]}
{:price-range 0.000064, :midpoint 1.12131, :extremes [{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}]}
...
If I was implementing this using something like Python, I'd probably use a couple of nested loops, use a break to exit the inner loop when I stopped seeing :highs to pair with my :low and vice-versa, and accumulate all the generated records into an array as I traversed the 2 loops. I just can't work out a good way to attack it using Clojure...
Any ideas?
first of all you can rephrase this the following way:
you have to find all the boundary points, where :high is followed by :low, or vice versa
you need to take the item before the bound, and make something with it and every item after bound, but until the next switching bound.
for the simplicity let's use the following data model:
(def data0 [{:a 1} {:b 2} {:b 3} {:b 4} {:a 5} {:a 6} {:a 7}])
the first part can be achieved by using partition-by function, that splits the input collection every time the function changes it's value for the processed item:
user> (def step1 (partition-by (comp boolean :a) data0))
#'user/step1
user> step1
(({:a 1}) ({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7}))
now you need to take every two of these groups and manipulate them. the groups should be like this:
[({:a 1}) ({:b 2} {:b 3} {:b 4})]
[({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7})]
this is achieved by the partition function:
user> (def step2 (partition 2 1 step1))
#'user/step2
user> step2
((({:a 1}) ({:b 2} {:b 3} {:b 4}))
(({:b 2} {:b 3} {:b 4}) ({:a 5} {:a 6} {:a 7})))
you have to do something for every pair of groups. You could do it with map:
user> (def step3 (map (fn [[lbounds rbounds]]
(map #(vector (last lbounds) %)
rbounds))
step2))
#'user/step3
user> step3
(([{:a 1} {:b 2}] [{:a 1} {:b 3}] [{:a 1} {:b 4}])
([{:b 4} {:a 5}] [{:b 4} {:a 6}] [{:b 4} {:a 7}]))
but since you need the concatenated list, rather then the grouped one, you would want to use mapcat instead of map:
user> (def step3 (mapcat (fn [[lbounds rbounds]]
(map #(vector (last lbounds) %)
rbounds))
step2))
#'user/step3
user> step3
([{:a 1} {:b 2}]
[{:a 1} {:b 3}]
[{:a 1} {:b 4}]
[{:b 4} {:a 5}]
[{:b 4} {:a 6}]
[{:b 4} {:a 7}])
that's the result we want (it almost is, since we just generate vectors, instead of maps).
now you could prettify it with the threading macro:
(->> data0
(partition-by (comp boolean :a))
(partition 2 1)
(mapcat (fn [[lbounds rbounds]]
(map #(vector (last lbounds) %)
rbounds))))
which gives you exactly the same result.
applied to your data it would look almost the same (with another result generating fn)
user> (defn hi-or-lo [item]
(item :high (item :low)))
#'user/hi-or-lo
user>
(->> data
(partition-by (comp boolean :high))
(partition 2 1)
(mapcat (fn [[lbounds rbounds]]
(let [left-bound (last lbounds)
left-val (hi-or-lo left-bound)]
(map #(let [right-val (hi-or-lo %)
diff (Math/abs (- right-val left-val))]
{:extremes [left-bound %]
:price-range diff
:midpoint (+ (min right-val left-val)
(/ diff 2))})
rbounds))))
(clojure.pprint/pprint))
it prints the following:
({:extremes
[{:high 1.121455, :time "2016-08-03T05:15:00.000000Z"}
{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}],
:price-range 3.6500000000017074E-4,
:midpoint 1.1212725}
{:extremes
[{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
{:high 1.12173, :time "2016-08-03T04:30:00.000000Z"}],
:price-range 6.399999999999739E-4,
:midpoint 1.12141}
{:extremes
[{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
{:high 1.121925, :time "2016-08-03T00:00:00.000000Z"}],
:price-range 8.350000000001412E-4,
:midpoint 1.1215074999999999}
{:extremes
[{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
{:high 1.12215, :time "2016-08-02T23:00:00.000000Z"}],
:price-range 0.001060000000000061,
:midpoint 1.12162}
{:extremes
[{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
{:high 1.12273, :time "2016-08-02T21:15:00.000000Z"}],
:price-range 0.0016400000000000858,
:midpoint 1.12191}
{:extremes
[{:low 1.12109, :time "2016-08-03T05:15:00.000000Z"}
{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}],
:price-range 0.0022900000000001253,
:midpoint 1.1222349999999999}
{:extremes
[{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
{:low 1.119215, :time "2016-08-02T12:30:00.000000Z"}],
:price-range 0.004164999999999974,
:midpoint 1.1212975}
{:extremes
[{:high 1.12338, :time "2016-08-02T18:15:00.000000Z"}
{:low 1.118755, :time "2016-08-02T12:00:00.000000Z"}],
:price-range 0.004625000000000101,
:midpoint 1.1210675}
...
As an answer the question about "complex data manipulation", i would advice you to look through all the collections' manipulating functions from the clojure core, and then try to decompose any task to the application of those. There are not so many cases when you need something beyond them.
Related
What is the Clojure way to transform following data?
So I've just played with Clojure today. Using this data, (def test-data [{:id 35462, :status "COMPLETED", :p 2640000, :i 261600} {:id 35462, :status "CREATED", :p 240000, :i 3200} {:id 57217, :status "COMPLETED", :p 470001, :i 48043} {:id 57217, :status "CREATED", :p 1409999, :i 120105}]) Then transform the above data with, (as-> test-data input (group-by :id input) (map (fn [x] {:id (key x) :p {:a (as-> (filter #(= (:status %) "COMPLETED") (val x)) tmp (into {} tmp) (get tmp :p)) :b (as-> (filter #(= (:status %) "CREATED") (val x)) tmp (into {} tmp) (get tmp :p))} :i {:a (as-> (filter #(= (:status %) "COMPLETED") (val x)) tmp (into {} tmp) (get tmp :i)) :b (as-> (filter #(= (:status %) "CREATED") (val x)) tmp (into {} tmp) (get tmp :i))}}) input) (into [] input)) To produce, [{:id 35462, :p {:a 2640000, :b 240000}, :i {:a 261600, :b 3200}} {:id 57217, :p {:a 470001, :b 1409999}, :i {:a 48043, :b 120105}}] But I have a feeling that my code is not the "Clojure way". So my question is, what is the "Clojure way" to achieve what I've produced?
The only things that stand out to me are using as-> when ->> would work just as well, and some work being done redundantly, and some destructuring opportunities: (defn aggregate [[id values]] (let [completed (->> (filter #(= (:status %) "COMPLETED") values) (into {})) created (->> (filter #(= (:status %) "CREATED") values) (into {}))] {:id id :p {:a (:p completed) :b (:p created)} :i {:a (:i completed) :b (:i created)}})) (->> test-data (group-by :id) (map aggregate)) => ({:id 35462, :p {:a 2640000, :b 240000}, :i {:a 261600, :b 3200}} {:id 57217, :p {:a 470001, :b 1409999}, :i {:a 48043, :b 120105}}) However, pouring those filtered values (which are maps themselves) into a map seems suspect to me. This is creating a last-one-wins scenario where the order of your test data affects the output. Try this to see how different orders of test-data affect output: (into {} (filter #(= (:status %) "COMPLETED") (shuffle test-data)))
It's a pretty odd transformation, keys seem a little arbitrary and it's hard to generalise from n=2 (or indeed to know whether n ever > 2). I'd use functional decomposition to factor out some of the commonality and get some traction. First of all let us transform the statuses into our keys... (def status->ab {"COMPLETED" :a "CREATED" :b}) Then, with that in hand, I'd like an easy way of getting the "meat" outof the substructure. Here, for a given key into the data, I'm providing the content of the enclosing map for that key and a given group result. (defn subgroup->subresult [k subgroup] (apply array-map (mapcat #(vector (status->ab (:status %)) (k %)) subgroup))) With this, the main transformer becomes much more tractable: (defn group->result [group] { :id (key group) :p (subgroup->subresult :p (val group)) :i (subgroup->subresult :i (val group))}) I wouldn't consider generalising across :p and :i for this - if you had more than two keys, then maybe I would generate a map of k -> the subgroup result and do some sort of reducing merge. Anyway, we have an answer: (map group->result (group-by :id test-data)) ;; => ({:id 35462, :p {:b 240000, :a 2640000}, :i {:b 3200, :a 261600}} {:id 57217, :p {:b 1409999, :a 470001}, :i {:b 120105, :a 48043}})
There are no one "Clojure way" (I guess you mean functional way) as it depends on how you decompose a problem. Here is the way I will do: (->> test-data (map (juxt :id :status identity)) (map ->nested) (apply deep-merge) (map (fn [[id m]] {:id id :p (->ab-map m :p) :i (->ab-map m :i)}))) ;; ({:id 35462, :p {:a 2640000, :b 240000}, :i {:a 261600, :b 3200}} ;; {:id 57217, :p {:a 470001, :b 1409999}, :i {:a 48043, :b 120105}}) As you can see, I used a few functions and here is the step-by-step explanation: Extract index keys (id + status) and the map itself into vector (map (juxt :id :status identity) test-data) ;; ([35462 "COMPLETED" {:id 35462, :status "COMPLETED", :p 2640000, :i 261600}] ;; [35462 "CREATED" {:id 35462, :status "CREATED", :p 240000, :i 3200}] ;; [57217 "COMPLETED" {:id 57217, :status "COMPLETED", :p 470001, :i 48043}] ;; [57217 "CREATED" {:id 57217, :status "CREATED", :p 1409999, :i 120105}]) Transform into nested map (id, then status) (map ->nested *1) ;; ({35462 {"COMPLETED" {:id 35462, :status "COMPLETED", :p 2640000, :i 261600}}} ;; {35462 {"CREATED" {:id 35462, :status "CREATED", :p 240000, :i 3200}}} ;; {57217 {"COMPLETED" {:id 57217, :status "COMPLETED", :p 470001, :i 48043}}} ;; {57217 {"CREATED" {:id 57217, :status "CREATED", :p 1409999, :i 120105}}}) Merge nested map by id (apply deep-merge *1) ;; {35462 ;; {"COMPLETED" {:id 35462, :status "COMPLETED", :p 2640000, :i 261600}, ;; "CREATED" {:id 35462, :status "CREATED", :p 240000, :i 3200}}, ;; 57217 ;; {"COMPLETED" {:id 57217, :status "COMPLETED", :p 470001, :i 48043}, ;; "CREATED" {:id 57217, :status "CREATED", :p 1409999, :i 120105}}} For attribute :p and :i, map to :a and :b according to status (->ab-map {"COMPLETED" {:id 35462, :status "COMPLETED", :p 2640000, :i 261600}, "CREATED" {:id 35462, :status "CREATED", :p 240000, :i 3200}} :p) ;; => {:a 2640000, :b 240000} And below are the few helper functions I used: (defn ->ab-map [m k] (zipmap [:a :b] (map #(get-in m [% k]) ["COMPLETED" "CREATED"]))) (defn ->nested [[k & [v & r :as t]]] {k (if (seq r) (->nested t) v)}) (defn deep-merge [& xs] (if (every? map? xs) (apply merge-with deep-merge xs) (apply merge xs)))
I would approach it more like the following, so it can handle any number of entries for each :id value. Of course, many variations are possible. (ns tst.demo.core (:use demo.core tupelo.core tupelo.test) (:require [tupelo.core :as t] )) (dotest (let [test-data [{:id 35462, :status "COMPLETED", :p 2640000, :i 261600} {:id 35462, :status "CREATED", :p 240000, :i 3200} {:id 57217, :status "COMPLETED", :p 470001, :i 48043} {:id 57217, :status "CREATED", :p 1409999, :i 120105}] d1 (group-by :id test-data) d2 (t/forv [[id entries] d1] {:id id :status-all (mapv :status entries) :p-all (mapv :p entries) :i-all (mapv :i entries)})] (is= d1 {35462 [{:id 35462, :status "COMPLETED", :p 2640000, :i 261600} {:id 35462, :status "CREATED", :p 240000, :i 3200}], 57217 [{:id 57217, :status "COMPLETED", :p 470001, :i 48043} {:id 57217, :status "CREATED", :p 1409999, :i 120105}]}) (is= d2 [{:id 35462, :status-all ["COMPLETED" "CREATED"], :p-all [2640000 240000], :i-all [261600 3200]} {:id 57217, :status-all ["COMPLETED" "CREATED"], :p-all [470001 1409999], :i-all [48043 120105]}]) ))
Clojure parse nested vectors
I am looking to transform a clojure tree structure into a map with its dependencies For example, an input like: [{:value "A"} [{:value "B"} [{:value "C"} {:value "D"}] [{:value "E"} [{:value "F"}]]]] equivalent to: :A :B :C :D :E :F output: {:A [:B :E] :B [:C :D] :C [] :D [] :E [:F] :F} I have taken a look at tree-seq and zippers but can't figure it out!
Here's a way to build up the desired map while using a zipper to traverse the tree. First let's simplify the input tree to match your output format (maps of :value strings → keywords): (def tree [{:value "A"} [{:value "B"} [{:value "C"} {:value "D"}] {:value "E"} [{:value "F"}]]]) (def simpler-tree (clojure.walk/postwalk #(if (map? %) (keyword (:value %)) %) tree)) ;; [:A [:B [:C :D] :E [:F]]] Then you can traverse the tree with loop/recur and clojure.zip/next, using two loop bindings: the current position in tree, and the map being built. (loop [loc (z/vector-zip simpler-tree) deps {}] (if (z/end? loc) deps ;; return map when end is reached (recur (z/next loc) ;; advance through tree (if (z/branch? loc) ;; for (non-root) branches, add top-level key with direct descendants (if-let [parent (some-> (z/prev loc) z/node)] (assoc deps parent (filterv keyword? (z/children loc))) deps) ;; otherwise add top-level key with no direct descendants (assoc deps (z/node loc) []))))) => {:A [:B :E], :B [:C :D], :C [], :D [], :E [:F], :F []}
This is easy to do using the tupelo.forest library. I reformatted your source data to make it fit into the Hiccup syntax: (dotest (let [relationhip-data-hiccup [:A [:B [:C] [:D]] [:E [:F]]] expected-result {:A [:B :E] :B [:C :D] :C [] :D [] :E [:F] :F []} ] (with-debug-hid (with-forest (new-forest) (let [root-hid (tf/add-tree-hiccup relationhip-data-hiccup) result (apply glue (sorted-map) (forv [hid (all-hids)] (let [parent-tag (grab :tag (hid->node hid)) kid-tags (forv [kid-hid (hid->kids hid)] (let [kid-tag (grab :tag (hid->node kid-hid))] kid-tag))] {parent-tag kid-tags})))] (is= (format-paths (find-paths root-hid [:A])) [[{:tag :A} [{:tag :B} [{:tag :C}] [{:tag :D}]] [{:tag :E} [{:tag :F}]]]]) (is= result expected-result )))))) API docs are here. The project README (in progress) is here. A video from the 2017 Clojure Conj is here. You can see the above live code in the project repo.
How best to update this tree?
I've got the following tree: {:start_date "2014-12-07" :data { :people [ {:id 1 :projects [{:id 1} {:id 2}]} {:id 2 :projects [{:id 1} {:id 3}]} ] } } I want to update the people and projects subtrees by adding a :name key-value pair. Assuming I have these maps to perform the lookup: (def people {1 "Susan" 2 "John") (def projects {1 "Foo" 2 "Bar" 3 "Qux") How could I update the original tree so that I end up with the following? {:start_date "2014-12-07" :data { :people [ {:id 1 :name "Susan" :projects [{:id 1 :name "Foo"} {:id 2 :name "Bar"}]} {:id 2 :name "John" :projects [{:id 1 :name "Foo"} {:id 3 :name "Qux"}]} ] } } I've tried multiple combinations of assoc-in, update-in, get-in and map calls, but haven't been able to figure this out.
I have used letfn to break down the update into easier to understand units. user> (def tree {:start_date "2014-12-07" :data {:people [{:id 1 :projects [{:id 1} {:id 2}]} {:id 2 :projects [{:id 1} {:id 3}]}]}}) #'user/tree user> (def people {1 "Susan" 2 "John"}) #'user/people user> (def projects {1 "Foo" 2 "Bar" 3 "Qux"}) #'user/projects user> (defn integrate-tree [tree people projects] ;; letfn is like let, but it creates fn, and allows forward references (letfn [(update-person [person] ;; -> is the "thread first" macro, the result of each expression ;; becomes the first arg to the next (-> person (assoc :name (people (:id person))) (update-in [:projects] update-projects))) (update-projects [all-projects] (mapv #(assoc % :name (projects (:id %))) all-projects))] (update-in tree [:data :people] #(mapv update-person %)))) #'user/integrate-tree user> (pprint (integrate-tree tree people projects)) {:start_date "2014-12-07", :data {:people [{:projects [{:name "Foo", :id 1} {:name "Bar", :id 2}], :name "Susan", :id 1} {:projects [{:name "Foo", :id 1} {:name "Qux", :id 3}], :name "John", :id 2}]}} nil
Not sure if entirely the best approach: (defn update-names [tree people projects] (reduce (fn [t [id name]] (let [person-idx (ffirst (filter #(= (:id (second %)) id) (map-indexed vector (:people (:data t))))) temp (assoc-in t [:data :people person-idx :name] name)] (reduce (fn [t [id name]] (let [project-idx (ffirst (filter #(= (:id (second %)) id) (map-indexed vector (get-in t [:data :people person-idx :projects]))))] (if project-idx (assoc-in t [:data :people person-idx :projects project-idx :name] name) t))) temp projects))) tree people)) Just call it with your parameters: (clojure.pprint/pprint (update-names tree people projects)) {:start_date "2014-12-07", :data {:people [{:projects [{:name "Foo", :id 1} {:name "Bar", :id 2}], :name "Susan", :id 1} {:projects [{:name "Foo", :id 1} {:name "Qux", :id 3}], :name "John", :id 2}]}} With nested reduces Reduce over the people to update corresponding names For each people, reduce over projects to update corresponding names The noisesmith solution looks better since doesn't need to find person index or project index for each step. Naturally you tried to assoc-in or update-in but the problem lies in your tree structure, since the key path to update John name is [:data :people 1 :name], so your assoc-in code would look like: (assoc-in tree [:data :people 1 :name] "John") But you need to find John's index in the people vector before you can update it, same things happens with projects inside.
clojure prewalk with select-keys
(clojure.walk/prewalk #(if (map? %) (select-keys % [:c]) %) {:a 1 :b [{:c 3} {:d 4}] :c 5}) =>{:c 5} why does this only find {:c 5} and not also {:c 3}? I'm trying to write something that will pull out all key/value pairs that exist for any form and at any level for the key I specify.
When it your function is called with {:c 5, :b [{:c 3} {:d 4}], :a 1} ...it returns: {:c 5} ...thus discarding all other keys, including the :b branch, which is thus not traversed.
how to represent a seq in midje test
(fact "Checking :time has been removed" (remove-date [{:time 1 :a 2} {:c 3 :time 4}]) => (seq '({:a 2} {:c 4}))) In the above test the remove-date function returns a seq ({:a 2} {:c 4}) How do I represent the seq on the right hand side ? (The above doesnt work)
The above works for me, you just got the {:c 4} in your assertion wrong. It should be {:c 3}. (fact "Checking :time has been removed" (remove-date [{:time 1 :a 2} {:c 3 :time 4}]) => (seq '({:a 2} {:c 3}))) In fact you don't need the seq call: (fact "Checking :time has been removed" (remove-date [{:time 1 :a 2} {:c 3 :time 4}]) => '({:a 2} {:c 3})) I tested it with midje 1.5.0 and clojure 1.4.0