Is there a generic way to consolidate a list of maps based on specific matching keys, while grouping the data of others? - clojure

I'm looking to nicely group some data from a list of maps, effectively attempting to "join" on given keys, and consolidating values associated with others. I have a (non-generic) example.
I work at a digital marketing company, and my particular example is regarding trying to consolidate a list of maps representing click-counts on our sites from different device types.
Example data is as follows
(def example-data
[{:site-id "439", :pid "184", :device-type "PC", :clicks 1}
{:site-id "439", :pid "184", :device-type "Tablet", :clicks 2}
{:site-id "439", :pid "184", :device-type "Mobile", :clicks 4}
{:site-id "439", :pid "3", :device-type "Mobile", :clicks 6}
{:site-id "100", :pid "200", :device-type "PC", :clicks 3}
{:site-id "100", :pid "200", :device-type "Mobile", :clicks 7}])
I want to "join" on the :site-id and :pid keys, while consolidating the :device-types and their corresponding :clicks into a map themselves: a working solution would result in the following list
[{:site-id "439", :pid "184", :device-types {"PC" 1, "Tablet" 2, "Mobile" 4}}
{:site-id "439", :pid "3", :device-types {"Mobile" 6}}
{:site-id "100", :pid "200", :device-types {"PC" 3, "Mobile" 7}}]
So, I do have a working solution for this specific transformation, which is as follows:
(defn consolidate-click-counts [click-counts]
(let [[ks vs] (->> click-counts
(group-by (juxt :site-id :pid))
((juxt keys vals)))
consolidate #(reduce (fn [acc x]
(assoc acc (:device-type x) (:clicks x)))
{}
%)]
(map (fn [[site-id pid] devs]
{:site-id site-id :pid pid :device-types (consolidate devs)})
ks
vs)))
While this works for my immediate use, this solution feels a little clumsy to me, and is also strongly tied to this exact transformation, I've been trying to think of what a more generic version of this function would look like, where the keys to join/consolidate on were parameterized maybe? I think it would also be ideal to have some kind of resolving fn that could be provided for duplicate (not-joined-on) keys (e.g. if I had two maps with a matching :site-id, :pid, and :device-type, where I would then probably want to add the click-counts together), sort of like merge-with - but maybe that's too much)
I'm not sold on my grouping method either, perhaps it would be better to have another list of maps, a la
[{:site-id "439",
:pid "184",
:grouped-data [{:device-type "PC", :clicks 1}
{:device-type "Tablet", :clicks 2}
{:device-type "Mobile", :clicks 4}}]
{:site-id "439",
:pid "3",
:grouped-data [{:device-type "Mobile", :clicks 6}}]
{:site-id "100",
:pid "200",
:grouped-data [{:device-type "PC", :clicks 3}
{:device-type "Mobile", :clicks 7}}]

Your general approach is reasonable, but the combination of group-by and your custom assoc lambda passed to reduce is more easily replicated with merge-with merge, a common idiom for combining data from multiple maps with shared keys:
(defn consolidate-click-counts [click-counts]
(for [[k v] (apply merge-with merge
(for [m click-counts]
{(select-keys m [:site-id :pid])
(select-keys m [:device-type :clicks])}))]
(assoc k :device-types v))))
Notice I also use a map {:site-id s :pid p} as the intermediate map key, rather than the vector [s p]. Both are fine, but this is easier to get to. It also avoids having to repeat the key names multiple times in the implementation.
I've written this basic function many times; see How to merge duplicated keys in list in vectors in Clojure? for another recent example.
You ask about combining multiple maps with the same keys, where you'd want to add together the click counts. That's not hard either: just tweak which part of the submap goes into the "key" section of the intermediate map, and change the merge function:
(defn consolidate-click-counts [click-counts]
(for [[k v] (apply merge-with +
(for [m click-counts]
{(select-keys m [:site-id :pid :device-type])
(:clicks m)}))]
(assoc k :clicks v)))
And we can see this works fine to group duplicate keys:
=> (consolidate-click-counts (concat example-data example-data))
({:site-id "439", :pid "184", :device-type "PC", :clicks 2}
{:site-id "439", :pid "184", :device-type "Tablet", :clicks 4}
{:site-id "439", :pid "184", :device-type "Mobile", :clicks 8}
{:site-id "439", :pid "3", :device-type "Mobile", :clicks 12}
{:site-id "100", :pid "200", :device-type "PC", :clicks 6}
{:site-id "100", :pid "200", :device-type "Mobile", :clicks 14})
You asked about extracting a function that parameterizes over the variables in this algorithm. I don't really think it's worth doing, since the function is so small already, and easy enough to read - I'd rather just read another appply merge-with/for loop than remember some new function somebody wrote that abstracts it for me. But if you disagree, it is as easy as defining a new function with parameters for the stuff one could reasonably fiddle with:
(defn map-combiner [group-keys output-key inspect combine]
(fn [coll]
(for [[k v] (apply merge-with combine
(for [m coll]
{(select-keys m group-keys)
(inspect m)}))]
(assoc k output-key v))))
(def consolidate-click-counts
(map-combiner [:site-id :pid :device-type] :device-types :clicks +))

There are many options outside of clojure core for this as well. The smallest addition would be xforms, which is just a set of additional transducers. by-key allows to do this transformation in a single pass:
(into [] (x/by-key (juxt :site-id :pid) identity
(fn [[site-id pid] v] {:site-id site-id :pid pid :device-types v})
(x/reduce (completing (fn [ac {:keys [device-type clicks]}]
(assoc ac device-type clicks))) {}))
example-data)
meander was built specifically to declaratively handle these kinds of transformations.
datascript is an in-memory datalog database that would allow you to express your query as a d/q with a pull or other similar means.

Related

How do I modify maps nested in vectors based on a series of values in Clojure?

Supposing I has a data structure like this:
[[{:name "bob" :favorite-color "green"}{:name "tim" :favorite-color "blue"}]
[{:name "eric" :favorite-color "orange"}{:name "jim" :favorite-color "purple"}]
[{:name "andy" :favorite-color "green"}{:name "tom" :favorite-color "blue"}]]
and an array like this:
["green" "purple"]
How would I pass over my data structure and augment all maps for folks who liked the colors in my array with a new key value pair of :likes-my-colors "yes" ?
The result would be:
[[{:name "bob" :favorite-color "green" :likes-my-colors "yes"}{:name "tim" :favorite-color "blue"}]
[{:name "eric" :favorite-color "orange"}{:name "jim" :favorite-color "purple" :likes-my-colors "yes"}]
[{:name "andy" :favorite-color "green" :likes-my-colors "yes"}{:name "tom" :favorite-color "blue"}]]
(I intentionally made the value a string of yes as opposed to true because that's closer to what I am trying to figure out).
I tried loop and recur with postwalk but couldn't figure out how to mutate the map with subsequent recursions. I won't paste my horrid attempt here because I am guessing there's a better way to do it then with recur. However, postwalk would have the advantage of being able to handle more an increasingly nested data structure, which will likely be the case. So maybe recur with postwalk is the way to go.
I'm using ClojureScript and Reagent to store app state in an atom... as things occur I need to keep updating the app state in that atom. The app state gets reset repeatedly in a single user session... it gets built up and modified after each reset. As in this example, the app state gets modified based on arrays. My code needs to work through the elements of the array and modify all the maps that meet a condition. Eventually, this structure is used to add classes to a Hiccup data structure. The UI changes accordingly; people in a list would have borders appear around them if they liked my colors, for example, by having a class added.
I had awesome help in learning how to look through a data structure like this and update all maps given a specific key/value pair... but I've run into trouble doing it with a series of values. In other words, 'build up' a map in a sense... but it's more 'modify with multiple passes'. That phrasing will hopefully improve as my understanding does.
I am wondering, as a side note, how Clojure users go about accessing and mutating elements buried deeply in nested data structures. I'd rather have more complicated data structures but I avoid them because it seems hard to modify deep elements. I'm suspecting they might use libraries. It seems like there may be an easier way of getting at and modifying complex structures than writing brain teaser (for me) code. But then again, I may be wrong. There are a lot of examples online but they are often about modifying simple structures.
i would start with an item updater, like this for example:
(defn handle-fav-colors [color-set data]
(if (color-set (:favorite-color data))
(assoc data :likes-my-color "yes")
data))
then you would be free to update your data any way you like. Like mapping:
user> (mapv (partial mapv (partial handle-fav-colors #{"green" "purple"})) data)
;;=> [[{:name "bob", :favorite-color "green", :likes-my-color "yes"}
;; {:name "tim", :favorite-color "blue"}]
;; [{:name "eric", :favorite-color "orange"}
;; {:name "jim", :favorite-color "purple", :likes-my-color "yes"}]
;; [{:name "andy", :favorite-color "green", :likes-my-color "yes"}
;; {:name "tom", :favorite-color "blue"}]]
i won't personally recommend using walkers for this, since this one has a regular structure, while walkers go indefinitely deep, leading to a non-zero possibility to mess up some deeply nested maps. The rule of thumb (works for me): when you know the exact level in data structure you need to operate at, you should not use tools possibly operating above or below this level.
Also, as clojure (and FP in general) is all about small composable and reusable utils, you could approach with first making up the proper general functions like nested collections mapping:
(defn mapv-deep [level f data]
(if (pos? level)
(mapv (partial mapv-deep (dec level) f) data)
(mapv f data)))
user> (mapv-deep 0 inc [1 2 3])
;; [2 3 4]
user> (mapv-deep 1 inc [[1 2] [3 4]])
;; [[2 3] [4 5]]
user> (mapv-deep 2 inc [[[1 2] [3 4]] [[5 6] [7 8]]])
;; [[[2 3] [4 5]] [[6 7] [8 9]]]
and conditional analog of assoc
(defn assoc-when [data pred k v]
(if (pred data)
(assoc data k v)
data))
user> (assoc-when {:a 10 :b 20} #(-> % :a even?) :even-a? true)
;;=> {:a 10, :b 20, :even-a? true}
user> (assoc-when {:a 11 :b 20} #(-> % :a even?) :even-a? true)
;;=> {:a 11, :b 20}
so now the task can be solved this way:
(defn handle-fav-colors [color-set data]
(assoc-when data (comp color-set :favorite-color) :likes-my-color "yes"))
user> (mapv-deep 1 (partial handle-fav-colors #{"green" "purple"}) data)
;;=> [[{:name "bob", :favorite-color "green", :likes-my-color "yes"}
;; {:name "tim", :favorite-color "blue"}]
;; [{:name "eric", :favorite-color "orange"}
;; {:name "jim", :favorite-color "purple", :likes-my-color "yes"}]
;; [{:name "andy", :favorite-color "green", :likes-my-color "yes"}
;; {:name "tom", :favorite-color "blue"}]]
Lets write an utility function first for handling a single map. It will check if the value under :favorite-color is present in favorite-colors. Since favorite-colors is a vector, we need to convert it to a set so we can use contains? on it.
(defn handle-map [m]
(if (contains? (set favorite-colors) (:favorite-color m))
(assoc m :likes-my-colors "yes")
m))
Now we can use postwalk to call it on all map nodes:
(clojure.walk/postwalk
(fn [m]
(if (map? m)
(handle-map m)
m))
data)

How to filter a map comparing it with another collection

I have a map with collection of these {:id 2489 ,values :.......} {:id 5647 ,values : .....} and so on till 10000 and I want to filter its value dependent on another collection which has ids of first one like (2489 ,......)
I am new to clojure and I have tried :
(into {} (filter #(some (fn [u] (= u (:id %))) [2489 3456 4567 5689]) record-sets))
But it gives me only the last that is 5689 id as output {:id 5689 ,:values ....}, while I want all of them, can you suggest what I can do.
One problem is that you start out with a sequence of N maps, then you try to stuff them into a single map. This will cause the last one to overwrite the first one.
Instead, you need to have the output be a sequence of M maps (M <= N).
Something like this is what you want:
(def data
[{:id 1 :stuff :fred}
{:id 2 :stuff :barney}
{:id 3 :stuff :wilma}
{:id 4 :stuff :betty}])
(let [ids-wanted #{1 3}
data-wanted (filterv
(fn [item]
(contains? ids-wanted (:id item)))
data)]
(println data-wanted))
with result:
[{:id 1, :stuff :fred}
{:id 3, :stuff :wilma}]
Be sure to use the Clojure CheatSheet: http://jafingerhut.github.io/cheatsheet/clojuredocs/cheatsheet-tiptip-cdocs-summary.html
I like filterv over plain filter since it always gives a non-lazy result (i.e. a Clojure vector).
You are squashing all your maps into one. First thing, for sake of performance, is to change your list of IDs into a set, then simply filter.
(let [ids (into #{} [2489 3456 4567 5689])]
(filter (comp ids :id) record-sets))
This will give you the sequence of correct maps. If you want to covert this sequence of maps into a map keyed by ID, you can do this:
(let [ids (into #{} [2489 3456 4567 5689])]
(->> record-sets
(filter (comp ids :id))
(into {} (map (juxt :id identity)))))
Another way to do this could be with the use of select-keys functions in Clojure
select-keys returns a map of only the keys given to the function
so given that your data is a list of maps we can convert it into a hash-map of ids using group-by and then call select-keys on it
(def data
[{:id 1 :stuff :fred}
{:id 2 :stuff :barney}
{:id 3 :stuff :wilma}
{:id 4 :stuff :betty}])
(select-keys (group-by :id data) [1 4])
; => {1 [{:id 1, :stuff :fred}], 4 [{:id 4, :stuff :betty}]}
However now the values is a map of ids. So in order to get the orignal structure back we need get all the values in the map and then flatten the vectors
; get all the values in the map
(vals (select-keys (group-by :id data) [1 4]))
; => ([{:id 1, :stuff :fred}] [{:id 4, :stuff :betty}])
; flatten the nested vectors
(flatten (vals (select-keys (group-by :id data) [1 4])))
; => ({:id 1, :stuff :fred} {:id 4, :stuff :betty})
Extracting the values and flattening might seem a bit inefficient but i think its less complex then the nested loop that needs to be done in the filter based methods.
You can using the threading macro to compose all the steps together
(-> (group-by :id data)
(select-keys [1 4])
vals
flatten)
Another thing that you can do is to store the data as a map of ids from the beginning this way using select keys wont require group-by and the result wont require flattening.
Update all keys in a map
(update-values (group-by :id data) first)
; => {1 {:id 1, :stuff :fred}, 2 {:id 2, :stuff :barney}, 3 {:id 3, :stuff :wilma}, 4 {:id 4, :stuff :betty}}
This would probably be the most efficient for this problem but this structure might not work for every case.

Get two different keywords from map

I just started learning Clojure and I'd like to get two keywords from a vector of maps.
Let's say there's a vector
(def a [{:id 1, :description "bla", :amount 12, :type "A", :other "x"} {:id 2, :description "blabla", :amount 10, :type "B", :other "y"}])
And I'd like to get a new vector
[{"bla" 12} {"blabla" 10}]
How can I do that??
Thanks!
Assuming you want the :description and :amount separately, not maps that map one to the other, you can use juxt to retrieve both at the same time:
(mapv (juxt :description :amount) a)
;; => [["bla" 12] ["blabla" 10]]
If you actually did mean to make maps, you can use for instance apply and hash-map to do that:
(mapv #(apply hash-map ((juxt :description :amount) %)) a)
;; => [{"bla" 12} {"blabla" 10}]
You can use mapv to map over the source vector. Within the transform function you can destructure each map to extract the keys you want and construct the result:
(mapv (fn [{:keys [description amount]}] {description amount}) a)
(mapv #(hash-map (:description %) (:amount %)) a)

Recursive map query using specter

Is there a simple way in specter to collect all the structure satisfying a predicate ?
(./pull '[com.rpl/specter "1.0.0"])
(use 'com.rpl.specter)
(def data {:items [{:name "Washing machine"
:subparts [{:name "Ballast" :weight 1}
{:name "Hull" :weight 2}]}]})
(reduce + (select [(walker :weight) :weight] data))
;=> 3
(select [(walker :name) :name] data)
;=> ["Washing machine"]
How can we get all the value for :name, including ["Ballast" "Hull"] ?
Here's one way, using recursive-path and stay-then-continue to do the real work. (If you omit the final :name from the path argument to select, you'll get the full “item / part maps” rather than just the :name strings.)
(def data
{:items [{:name "Washing machine"
:subparts [{:name "Ballast" :weight 1}
{:name "Hull" :weight 2}]}]})
(specter/select
[(specter/recursive-path [] p
[(specter/walker :name) (specter/stay-then-continue [:subparts p])])
:name]
data)
;= ["Washing machine" "Ballast" "Hull"]
Update: In answer to the comment below, here's a version of the above the descends into arbitrary branches of the tree, as opposed to only descending into the :subparts branch of any given node, excluding :name (which is the key whose values in the tree we want to extract and should not itself be viewed as a branching off point):
(specter/select
[(specter/recursive-path [] p
[(specter/walker :name)
(specter/stay-then-continue
[(specter/filterer #(not= :name (key %)))
(specter/walker :name)
p])])
:name]
;; adding the key `:subparts` with the value [{:name "Foo"}]
;; to the "Washing machine" map to exercise the new descent strategy
(assoc-in data [:items 0 :subparts2] [{:name "Foo"}]))
;= ["Washing machine" "Ballast" "Hull" "Foo"]
The selected? selector can be used to collect structures for which another selector matches something within the structure
From the examples at https://github.com/nathanmarz/specter/wiki/List-of-Navigators#selected
=> (select [ALL (selected? [(must :a) even?])] [{:a 0} {:a 1} {:a 2} {:a 3}])
[{:a 0} {:a 2}]
I think you could iterate on map recursively using clojure.walk package. On each step, you may check the current value for a predicate and push it into an atom to collect the result.

Perform "get" on all HashMap elements of a LazySeq

I'm parsing some XML data from Stack Exchange using clojure.data.xml, for example if I parse Votes data it returns a LazySeq containing a HashMap for each row of data.
What I am trying to do is to get the values associated with only certain keys, for each row,e.g., (get votes [:Id :CreationDate]). I've tried numerous things, most of them leading to casting errors.
The closest I could get to what I need is using (doall (map get votes [:Id :CreationDate])). However, the problem I am experiencing now is that I cannot seem to return more than just the first row (i.e. (1 2011-01-19T00:00:00.000))
Here is a MCVE that can be run on any Clojure REPL, or on Codepad online IDE.
Ideally I would like to return some kind of collection or map which contains the values I need for each row, the end goal is to write to something like a CSV file or such. For example a map like
(1 2011-01-19T00:00:00.000
2 2011-01-19T00:00:00.000
3 2011-01-19T00:00:00.000
4 2011-01-19T00:00:00.000)
(def votes '({:Id "1",
:PostId "2",
:VoteTypeId "2",
:CreationDate "2011-01-19T00:00:00.000"}
{:Id "2",
:PostId "3",
:VoteTypeId "2",
:CreationDate "2011-01-19T00:00:00.000"}
{:Id "3",
:PostId "1",
:VoteTypeId "2",
:CreationDate "2011-01-19T00:00:00.000"}
{:Id "4",
:PostId "1",
:VoteTypeId "2",
:CreationDate "2011-01-19T00:00:00.000"}))
(println (doall (map get votes [:Id :CreationDate])))
Additional detail: If this is of any help/interest, the code I am using to get the above lazy seq is as follows:
(ns se-datadump.read-xml
(require
[clojure.data.xml :as xml])
(def xml-votes
"<votes><row Id=\"1\" PostId=\"2\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" /> <row Id=\"2\" PostId=\"3\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" /> <row Id=\"3\" PostId=\"1\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" /> <row Id=\"4\" PostId=\"1\" VoteTypeId=\"2\" CreationDate=\"2011-01-19T00:00:00.000\" /></votes>")
(defn se-xml->rows-seq
"Returns LazySequence from a properly formatted XML string,
which contains a HashMap for every <row> element with each of its attributes.
This assumes the standard Stack Exchange XML format, where a parent element contains
only a series of <row> child elements with no further hierarchy."
[xml-str]
(let [xml-records (xml/parse-str xml-str)]
(map :attrs (-> xml-records :content))))
; this returns a map identical as in the MCVE:
(def votes (se-xml->rows-seq xml-votes)
You apparently need juxt:
(map (juxt :Id :CreationDate) votes)
;; => (["1" "2011-01-19T00:00:00.000"] ["2" "2011-01-19T00:00:00.000"] ["3" "2011-01-19T00:00:00.000"] ["4" "2011-01-19T00:00:00.000"])
If you need a map out of it:
(into {} (map (juxt :Id :CreationDate) votes))
;; => {"1" "2011-01-19T00:00:00.000", "2" "2011-01-19T00:00:00.000", "3" "2011-01-19T00:00:00.000", "4" "2011-01-19T00:00:00.000"}
First of all, let me explain, what the piece of code you suggest in the CodePad is actually doing. I doubt that it's the thing you are intending to do:
(println (doall (map get votes [:Id :CreationDate])))
The crucial part is: (map get votes [:Id :CreationDate])
This maps over two collections: the lazy sequence 'votes' and a vector. Whenever mapping over more than one collection, the returned lazy sequence will be as long as the shortest collection provided.
For instance one can map over a finite collection and an infinite sequence:
(map + (range) [1 2 3])
;; (0 3 5)
This explains why yours result is only two items long:
(map get votes [:Id :CreationDate])
reduces to:
((get (votes 0) ([:Id :CreationDate] 0)
(get (votes 1) ([:Id :CreationDate] 1))
reduces to:
((get {:Id "1",
:PostId "2",
:VoteTypeId "2",
:CreationDate "2011-01-19T00:00:00.000"} :Id)
(get {:Id "2",
:PostId "3",
:VoteTypeId "2",
:CreationDate "2011-01-19T00:00:00.000"} :CreationDate))
reduces finally to:
(1 2011-01-19T00:00:00.000)
This is just for understanding purpose. If the compiler does exactly these steps, is another question.
doall is not necessary here, since println already does this implicitly.
As already noted. In your case you'd better use juxt and only map over votes. If you really want to have the sample output you additionally need to flatten the output:
(flatten (map (juxt :Id :CreationDate) votes))