Outer join in Clojure - clojure

Similar to this question: Inner-join in clojure
Is there a function for outer joins (left, right and full) performed on collections of maps in any of the Clojure libraries?
I guess it could be done by modifying the code of clojure.set/join but this seems as a common enough requirement, so it's worth to check if it already exists.
Something like this:
(def s1 #{{:a 1, :b 2, :c 3}
{:a 2, :b 2}})
(def s2 #{{:a 2, :b 3, :c 5}
{:a 3, :b 8}})
;=> (full-join s1 s2 {:a :a})
;
; #{{:a 1, :b 2, :c 3}
; {:a 2, :b 3, :c 5}
; {:a 3, :b 8}}
And the appropriate functions for left and right outer join, i.e. including the entries where there is no value (or nil value) for the join key on the left, right or both sides.

Sean Devlin's (of Full Disclojure fame) table-utils has the following join types:
inner-join
left-outer-join
right-outer-join
full-outer-join
natural-join
cross-join
It hasn't been updated in a while, but works in 1.3, 1.4 and 1.5. To make it work without any external dependencies:
replace fn-tuple with juxt
replace the whole (:use ) clause in the ns declaration with (require [clojure.set :refer [intersection union]])
add the function map-vals from below:
either
(defn map-vals
[f coll]
(into {} (map (fn [[k v]] {k (f v)}) coll)))
or for Clojure 1.5 and up
(defn map-vals
[f coll]
(reduce-kv (fn [acc k v] (assoc acc k (f v))) {} coll))
Usage of the library is join type, two collections (two sets of maps like the example above, or two sql resultsets) and at least one join fn. Since keywords are functions on maps, usually only the join keys will suffice:
=> (full-outer-join s1 s2 :a :a)
({:a 1, :c 3, :b 2}
{:a 2, :c 5, :b 3}
{:b 8, :a 3})
If I remember correctly Sean tried to get table-utils into contrib some time ago, but that never worked out. Too bad it never got it's own project (on github/clojars). Every now and then a question for a library like this pops up on Stackoverflow or the Clojure Google group.
Another option might be using the datalog library from datomic to query clojure data structures. Stuart Halloway has some examples in his gists.

Related

How to filter map content by path

I want to select paths of a deeply nested map to keep.
For example:
{:a 1
:b {:c [{:d 1 :e 1}
{:d 2 :e 2}]
:f 1}
:g {:h {:i 4 :j [1 2 3]}}}
I want to select by paths, like so:
(select-paths m [[:a]
[:b :c :e]
[:b :f]
[:g :h :i]])
This would return
{:a 1
:b {:c [{:e 1}
{:e 2}]
:f 1}
:g {:h {:i 4}}}
Essentially the same as Elasticsearch's fields parameter. The format of the paths argument can be something else, this is just the first idea.
I tried two different solutions
Go through the entire map and checking if the full path of the current element is in the given paths. I can't figure out how to handle lists of maps so that they are kept as lists of maps.
Creating select-keys statements from the given paths but again I run into problems with lists of maps - and especially trying to resolve paths of varying depths that have some common depth.
I looked at spectre but I didn't see anything that would do this. Any map or postwalk based solution I come up with turns into something incredibly convoluted at some point. I must be thinking about this the wrong way.
If there's a way to do this with raw json, that would be fine as well. Or even a Java solution.
There is no simple way to accomplish your goal. The automatic processing implied for the sequence under [:b :c] is also problematic.
You can get partway there using the Tupelo Forest library. See the Lightning Talk video from Clojure/Conj 2017.
I did some additional work in data destructuring that you may find useful building the tupelo.core/destruct macro (see examples here). You could follow a similar outline to build a recursive solution to your specific problem.
A related project is Meander. I have worked on my own version which is like a generalized version of tupelo.core/destruct. Given data like this
(def skynet-widgets [{:basic-info {:producer-code "Cyberdyne"}
:widgets [{:widget-code "Model-101"
:widget-type-code "t800"}
{:widget-code "Model-102"
:widget-type-code "t800"}
{:widget-code "Model-201"
:widget-type-code "t1000"}]
:widget-types [{:widget-type-code "t800"
:description "Resistance Infiltrator"}
{:widget-type-code "t1000"
:description "Mimetic polyalloy"}]}
{:basic-info {:producer-code "ACME"}
:widgets [{:widget-code "Dynamite"
:widget-type-code "c40"}]
:widget-types [{:widget-type-code "c40"
:description "Boom!"}]}])
You can search and extract data using a template like this:
(let [root-eid (td/add-entity-edn skynet-widgets)
results (td/match
[{:basic-info {:producer-code ?}
:widgets [{:widget-code ?
:widget-type-code wtc}]
:widget-types [{:widget-type-code wtc
:description ?}]}])]
(is= results
[{:description "Resistance Infiltrator" :widget-code "Model-101" :producer-code "Cyberdyne" :wtc "t800"}
{:description "Resistance Infiltrator" :widget-code "Model-102" :producer-code "Cyberdyne" :wtc "t800"}
{:description "Mimetic polyalloy" :widget-code "Model-201" :producer-code "Cyberdyne" :wtc "t1000"}
{:description "Boom!" :widget-code "Dynamite" :producer-code "ACME" :wtc "c40"}])))
This code is working (see here) but it needs more polish. You could use it as a guide to building a generalized select-paths function.
Can you add any details on how this problem arose or the specific context? That may point to ideas for an alternate solution.
One way of solving this problem would be to generate a set of all subpaths that you accept and then write a recursive function that traverses the data structure and keeps track of the path to the current node. The code that accomplishes that does not need to be very long:
(defn select-paths-from-set [current-path path-set data]
(cond
(map? data) (into {}
(remove nil?)
(for [[k v] data]
(let [p (conj current-path k)]
(if (contains? path-set p)
[k (select-paths-from-set p path-set v)]))))
(sequential? data) (mapv (partial select-paths-from-set current-path path-set) data)
:default data))
(defn select-paths [data paths]
(select-paths-from-set []
(into #{}
(mapcat #(take-while seq (iterate butlast %)))
paths)
data))
(select-paths {:a 1
:b {:c [{:d 1 :e 1}
{:d 2 :e 2}]
:f 1}
:g {:h {:i 4 :j [1 2 3]}}}
[[:a]
[:b :c :e]
[:b :f]
[:g :h :i]])
;; => {:a 1, :b {:c [{:e 1} {:e 2}], :f 1}, :g {:h {:i 4}}}

Clojure custom function for threading macro

I have a map and I want to write a custom function for updating it.
(-> {:a 1 :b 2}
(fn [x] (update x :a inc)))
This of course is a simple example and could be easily done without the function wrapped around the update, but it shows what I want to do. But this gives me the following error.
Syntax error macroexpanding clojure.core/fn at (core.clj:108:1).
{:a 1, :b 2} - failed: vector? at: [:fn-tail :arity-1 :params] spec: :clojure.core.specs.alpha/param-list
{:a 1, :b 2} - failed: (or (nil? %) (sequential? %)) at: [:fn-tail :arity-n] spec: :clojure.core.specs.alpha/params+body
I don't get why this is not working, since the threading macro should but my map as first parameter in the function, right?
You can always use macroexpand to see what happened. In your case, macroexpand will return you:
(fn {:a 1, :b 2} [x] (update x :a inc))
obviously this is not a valid function. But if you tweak it this way:
(-> {:a 1 :b 2}
(#(update % :a inc)))
the expanded form will then become valid:
(#(update % :a inc) {:a 1, :b 2})
You don't put a function itself to be called, but call the function without the first parameter, For your example it would be:
> (-> {:a 1 :b 2}
(update :a inc))
{:a 2, :b 2}
This is easier to see by expanding the macro in each case
> (macroexpand-1 '(-> {:a 1 :b 2} (update :a inc)))
(update {:a 1, :b 2} :a inc)
> (macroexpand-1 '(-> {:a 1 :b 2} (fn [x] (update x :a inc))))
(fn {:a 1, :b 2} [x] (update x :a inc))
As #jas and #rmcv pointed out, I was giving the threading macro the function itself, not the call of a function without the argument. So in short terms the solution would be
(-> {:a 1 :b 2}
((fn [x] (update x :a inc))))
I don't think any of these solutions are the simplest. I would propose choosing one of the following:
A. Use the normal threading form:
(-> {:a 1, :b 2}
(update :a inc)) => {:a 2, :b 2}
Everyone is used to seeing this and can understand it easily. Since you have already rejected this approach, I assume you think the code is clearer by using a named parameter.
B. Use a named function
(defn updater [x] (update x :a inc))
(-> {:a 1, :b 2}
updater) => {:a 2, :b 2}
(-> {:a 1, :b 2}
(updater)) => {:a 2, :b 2}
This is more how the -> form was envisioned to work. I think the 2nd version is the clearest, as it is the most consistent where all function expressions have parentheses (single arg or multi-arg).
C. Consider using the it-> macro from the Tupelo Library:
(it-> {:a 1, :b 2}
(update it :a inc)) => {:a 2, :b 2}
Much like the named function, the expression is normal Clojure form without the "invisible" parameter silently inserted into the update expression. The pronoun it serves as the temporary placeholder for the threaded value (an idea copied from Groovy). Simple, explicit, and flexible, since the it can be in the first, last, or any other parameter location:
(it-> 1
(inc it) ; thread-first or thread-last
(+ it 3) ; thread-first
(/ 10 it) ; thread-last
(str "We need to order " it " items." ) ; middle of 3 arguments
;=> "We need to order 2 items." )

How to merge maps and get a map of lists?

Let's say we a list of maps. Maps all have the same keywords, but we don't know the keywords beforehand.
[{:a 1 :b 2} {:a 3 :b 4}]
And what would be the idiomatic way of merging this list into such a map:
{:a [1 3]
:b [2 4]}
Doesn't seem hard, however as I start to implement the function, it gets super ugly and repetitive. I have a feeling that there are much cleaner ways of achieving this.
Thank you
You can actually get a pretty elegant solution by using several functions from the standard library:
(defn consolidate [& ms]
(apply merge-with conj (zipmap (mapcat keys ms) (repeat [])) ms))
Example:
(consolidate {:a 1 :b 2} {:a 3 :b 4})
;=> {:a [1 3], :b [2 4]}
One cool thing about this solution is that it works even if the maps have different key sets.
i would rather use double reduction to "merge" them with update:
(defn merge-maps-with-vec [maps]
(reduce (partial reduce-kv #(update %1 %2 (fnil conj []) %3))
{} maps))
user> (merge-maps-with-vec [{:a 1 :b 2} {:a 3 :b 4 :c 10}])
{:a [1 3], :b [2 4], :c [10]}
It is not as expressive as #Sam Estep's answer, but on the other hand it doesn't generate any intermediate sequences (like every-key-to-empty-vector map which also needs one extra pass through every entry of every map). Of course, premature optimizations are bad in general, but it won't hurt here i guess. Though the reduce based solution looks a bit more obscure, but being put into a library with proper docs it would not look as obscure to the end user (or to yourself a year after)
While many solutions are possible, here is one that uses some of the convenience functions in the Tupelo library:
(ns clj.core
(:use tupelo.core)
(:require [tupelo.schema :as ts]
[schema.core :as s] ))
(s/defn gather-keys
[list-of-maps :- [ts/KeyMap]]
(newline)
(let [keys-vec (keys (first list-of-maps))]
(s/validate [s/Keyword] keys-vec) ; verify it is a vector of keywords
(apply glue
(for [curr-key keys-vec]
{curr-key (forv [curr-map list-of-maps]
(get curr-map curr-key))} ))))
(deftest t-maps
(spyx
(gather-keys [{:a 1 :b 2}
{:a 3 :b 4} ] )))
(gather-keys [{:a 1, :b 2} {:a 3, :b 4}]) ;=> {:a [1 3], :b [2 4]}
Note that this solution assumes that each input map has an identical set of keys. Normally I'd want to enforce that assumption with a sanity check in the code as well.
Looking at the answer from Sam, I would rewrite it with some temporary variables to help document the sub-steps:
(defn consolidate-keys [list-of-maps]
(let [keys-set (set (mapcat keys list-of-maps))
base-result (zipmap keys-set (repeat [] )) ]
(apply merge-with conj base-result list-of-maps)))
(consolidate-keys [ {:a 1 :b 2}
{:a 3 :z 9} ] )
;=> {:z [9], :b [2], :a [1 3]}

clojure: given a list of maps, get the total sum value of a specific key value

Input: [{:a "ID1" :b 2} {:a "ID2" :b 4}]
I want to only add up all the keys :b and produce the following:
Result: 6
I thought about doing a filter? to pull all the numbers into vector and add it all up but this seems like doing work twice. I can't use merge-with + here since the :a has a string in it. Do I use a reduce here with a function that will pull the appropriate key?
(reduce (fn [x] (+ (x :b))) 0 list-of-maps)
It would be even nicer if I could retain the map structure with updated value ({:a "ID1" :b 6}) but since I don't really need the other keys, just the total sum is fine.
I want to only add up all the keys :b and produce the following:
Result: 6
I believe workable code is:
(def m1 {:a 1, :b 2})
(def m2 {:a 11, :b 12})
(def m3 {:a 21, :b 22})
(def ms [m1 m2 m3])
(->> ms
(map :b)
(reduce +))
I feel use of ->> here can help readability in your situation.
This says to take action on ms, which is defined to be a vector of maps, threading incremental results through the remaining forms.
The first thing is to transform each entry of maps using the keyword :b as a function on each, extracing the value corresponding to that key, resulting in the sequence:
(2 12 22)
You can then apply reduce exactly as you intuit across that seq to get the result:
user=> (def m1 {:a 1, :b 2})
#'user/m1
user=> (def m2 {:a 11, :b 12})
#'user/m2
user=> (def m3 {:a 21, :b 22})
#'user/m3
user=> (def ms [m1 m2 m3])
#'user/ms
user=> (->> ms
#_=> (map :b)
#_=> (reduce +))
36
I'm a tad confused by what you intend by this part of the question:
It would be even nicer if I could retain the map structure with updated value ({:a "ID1" :b 6})
Do you want to have each value for :b across all maps contain the sum of them all in a result, or something else?
(reduce + (map :b list-of-maps))
This its simple but it works!
user=> (+ (your-map :b) (your-map :b))
or
user=> (def x [{:a "ID1" :b 2} {:a "ID2" :b 4}])
#'user/x
user=> (+ ((first x) :b) ((second x) :b))
6
user=>
or user=> (+ ((nth x 0) :b) ((nth x 1) :b))
6

Saving+reading sorted maps to a file in Clojure

I'm saving a nested map of data to disk via spit. I want some of the maps inside my map to be sorted, and to stay sorted when I slurp the map back into my program. Sorted maps don't have a unique literal representation, so when I spit the map-of-maps onto disk, the sorted maps and the unsorted maps are represented the same, and #(read-string (slurp %))ing the data makes every map the usual unsorted type. Here's a toy example illustrating the problem:
(def sorted-thing (sorted-map :c 3 :e 5 :a 1))
;= #'user/sorted-thing
(spit "disk" sorted-thing)
;= nil
(def read-thing (read-string (slurp "disk")))
;= #'user/read-thing
(assoc sorted-thing :b 2)
;= {:a 1, :b 2, :c 3, :e 5}
(assoc read-thing :b 2)
;= {:b 2, :a 1, :c 3, :e 5}
Is there some way to read the maps in as sorted in the first place, rather than converting them to sorted maps after reading? Or is this a sign that I should be using some kind of real database?
The *print-dup* dynamically rebindable Var is meant to support this use case:
(binding [*print-dup* true]
(prn (sorted-map :foo 1)))
; #=(clojure.lang.PersistentTreeMap/create {:foo 1})
The commented out line is what gets printed.
It so happens that it also affects str when applied to Clojure data structures, and therefore also spit, so if you do
(binding [*print-dup* true]
(spit "foo.txt" (sorted-map :foo 1)))
the map representation written to foo.txt will be the one displayed above.
Admittedly, I'm not 100% sure whether this is documented somewhere; if you feel uneasy about this, you could always spit the result of using pr-str with *print-dup* bound to true:
(binding [*print-dup* true]
(pr-str (sorted-map :foo 1)))
;= "#=(clojure.lang.PersistentTreeMap/create {:foo 1})"
(This time the last line is the value returned rather than printed output.)
Clearly you'll have to have *read-eval* bound to true to be able to read back these literals. That's fine though, it's exactly the purpose it's meant to serve (reading code from trusted sources).
I don't think its necessarily a sign that you should be using a database, but I do think its a sign that you shouldn't be using spit. When you write your sorted maps to disk, don't use the map literal syntax. If you write it out in the following format, read-string will work:
(def sorted-thing (eval (read-string "(sorted-map :c 3 :e 5 :a 1)")))
(assoc sorted-thing :b 2)
;= {:a 1, :b 2, :c 3, :e 5}