How can I find the difference in 2 data sets? - clojure

If I have 2 pipe delimited files containing bookmark data, for example. How can I read in the data then determine the difference in the two sets of data?
Input Set #1: bookmarks.csv
2|www.cnn.com|News|This is CNN
3|www.msnbc.com|Search|
4|news.ycombinator.com|News|Tech News
5|bing.com|Search|The contender
Input Set #2: bookmarks2.csv
1|www.google.com|Search|The King of Search
2|www.cnn.com|News|This is CNN
3|www.msnbc.com|Search|New Comment
4|news.ycombinator.com|News|Tech News
Output
Id #1 is missing in set #1
Id #5 is missing in set #2
Id #3 is different:
->www.msnbc.com|Search|
->www.msnbc.com|Search|New Comment

(use '[clojure.contrib str-utils duck-streams pprint]
'[clojure set])
(defn read-bookmarks [filename]
(apply hash-map
(mapcat #(re-split #"\|" % 2)
(read-lines filename))))
(defn diff-bookmarks [filename1 filename2]
(let [f1 (read-bookmarks filename1)
f2 (read-bookmarks filename2)
k1 (set (keys f1))
k2 (set (keys f2))
missing-in-1 (difference k2 k1)
missing-in-2 (difference k1 k2)
present-but-different (filter #(not= (f1 %) (f2 %))
(intersection k1 k2))]
(cl-format nil "~{Id #~a is missing in set #1~%~}~{Id #~a is missing in set #2~%~}~{~{Id #~a is different~% -> ~a~% -> ~a~%~}~}"
missing-in-1
missing-in-2
(map #(list % (f1 %) (f2 %))
present-but-different))))
(print (diff-bookmarks "bookmarks.csv" "bookmarks2.csv"))

Here is my stab at a functional-ish approach to the problem:
Create 2 maps, one for each file
Find the missing items between the two maps, using dissoc
Find the different, but shared items between the two maps, using intersection and filter
Code
(ns diffset
(:use [clojure.contrib.duck-streams]
[clojure.set]))
(def file1 "bookmarks.csv")
(def file2 "bookmarks2.csv")
(defn split-record [line]
"split line into (id, bookmark)"
(map #(apply str %)
(split-with #(not (= % \|)) line)))
(defn map-from-file [f]
"create initial map from file f"
(with-open [r (reader f)]
(doall (apply hash-map (apply concat (map split-record
(line-seq r)))))))
(defn missing [x y]
"return seq of all ids in x that are not in y"
(keys (apply dissoc x (keys y))))
(defn different [x y]
"return seq of all ids that match but have different bookmark string"
(let [match-keys (intersection (set (keys x)) (set (keys y)))]
(filter #(not (= (get x %)
(get y %)))
match-keys)))
(defn diff [file1 file2]
"print out differences between two bookmark files"
(let [[s1 s2] (map map-from-file [file1 file2])]
(dorun (map #(println (format "Id #%s is missing in set #1" %))
(missing s2 s1)))
(dorun (map #(println (format "Id #%s is missing in set #2" %))
(missing s1 s2)))
(dorun (map #(println (format "Id #%s is different:" %) "\n"
" ->" (get s1 %) "\n"
" ->" (get s2 %)) (different s1 s2)))))
Result
user> (use 'diffset)
nil
user> (diff file1 file2)
Id #1 is missing in set #1
Id #5 is missing in set #2
Id #3 is different:
-> |www.msnbc.com|Search|
-> |www.msnbc.com|Search|New Comment
nil

split them with re regexp and make a set out of them with (apply set (re-seq ... ) then call (difference set1 set2) to find the things that are in set 1 and not set 2. reverse it to find this items in set 2 that are not in set one.
look at http://clojure.org/data_structures for more info on clojure sets.

put the first data in a dictionary (hashtable) with the id as key
the read the next data line by line, retrieve the id from the hash.
if the id is not in the hash, output: id missing in set 1
if the value in the has differs, output: id is different
store the id's in a second hashtable
then run through the keys of the first hashtable
check if they are also in the second hashtable. if not output: id is missing in set2

Related

Get key by first element in value list in Clojure

This is similar to Clojure get map key by value
However, there is one difference. How would you do the same thing if hm is like
{1 ["bar" "choco"]}
The idea being to get 1 (the key) where the first element if the value list is "bar"? Please feel free to close/merge this question if some other question answers it.
I tried something like this, but it doesn't work.
(def hm {:foo ["bar", "choco"]})
(keep #(when (= ((nth val 0) %) "bar")
(key %))
hm)
You can filter the map and return the first element of the first item in the resulting sequence:
(ffirst (filter (fn [[k [v & _]]] (= "bar" v)) hm))
you can destructure the vector value to access the second and/or third elements e.g.
(ffirst (filter (fn [[k [f s t & _]]] (= "choco" s))
{:foo ["bar", "choco"]}))
past the first few elements you will probably find nth more readable.
Another way to do it using some:
(some (fn [[k [v & _]]] (when (= "bar" v) k)) hm)
Your example was pretty close to working, with some minor changes:
(keep #(when (= (nth (val %) 0) "bar")
(key %))
hm)
keep and some are similar, but some only returns one result.
in addition to all the above (correct) answers, you could also want to reindex your map to desired form, especially if the search operation is called quite frequently and the the initial map is rather big, this would allow you to decrease the search complexity from linear to constant:
(defn map-invert+ [kfn vfn data]
(reduce (fn [acc entry] (assoc acc (kfn entry) (vfn entry)))
{} data))
user> (def data
{1 ["bar" "choco"]
2 ["some" "thing"]})
#'user/data
user> (def inverted (map-invert+ (comp first val) key data))
#'user/inverted
user> inverted
;;=> {"bar" 1, "some" 2}
user> (inverted "bar")
;;=> 1

Simple "R-like" melt : better way to do?

Today I tried to implement a "R-like" melt function. I use it for Big Data coming from Big Query.
I do not have big constraints about time to compute and this function takes less than 5-10 seconds to work on millions of rows.
I start with this kind of data :
(def sample
'({:list "123,250" :group "a"} {:list "234,260" :group "b"}))
Then I defined a function to put the list into a vector :
(defn split-data-rank [datatab value]
(let [splitted (map (fn[x] (assoc x value (str/split (x value) #","))) datatab)]
(map (fn[y] (let [index (map inc (range (count (y value))))]
(assoc y value (zipmap index (y value)))))
splitted)))
Launch :
(split-data-rank sample :list)
As you can see, it returns the same sequence but it replaces :list by a map giving the position in the list of each item in quoted list.
Then, I want to melt the "dataframe" by creating for each item in a group its own row with its rank in the group.
So that I created this function :
(defn split-melt [datatab value]
(let [splitted (split-data-rank datatab value)]
(map (fn [y] (dissoc y value))
(apply concat
(map
(fn[x]
(map
(fn[[k v]]
(assoc x :item v :Rank k))
(x value)))
splitted)))))
Launch :
(split-melt sample :list)
The problem is that it is heavily indented and use a lot of map. I apply dissoc to drop :list (which is useless now) and I have also to use concat because without that I have a sequence of sequences.
Do you think there is a more efficient/shorter way to design this function ?
I am heavily confused with reduce, does not know whether it can be applied here since there are two arguments in a way.
Thanks a lot !
If you don't need the split-data-rank function, I will go for:
(defn melt [datatab value]
(mapcat (fn [x]
(let [items (str/split (get x value) #",")]
(map-indexed (fn [idx item]
(-> x
(assoc :Rank (inc idx) :item item)
(dissoc value)))
items)))
datatab))

How could I write a function call once with an nested let in an if-let?

I have these functions:
(def i (atom {})) ;incremented/calculated file stats
(defn updatei [n fic fos]
(swap! i conj {(keyword n)[fic fos]}))
(defn calcfo [fo fi fis]
(if-let [n (get #i fo)] ;find a previous record?
(let [fic (inc (first n)), fos (+ fis (second n))] ;increment the stats
(updatei fo fic fos))
(let [fic 1, fos fis] ;if not then: first calc recorded
(updatei fo fic fos))))
How could I write (updatei fo fic fos) once, instead of having it listed twice in the function? Is there a secret or-let I am unaware of?
-Hypothetical code-
(defn calcfo [fo fi fis]
(if-let [n (get #i fo)] ;find a previous record?
(let [fic (inc (first n)), fos (+ fis (second n))] ;increment the stats
(or-let [fic 1, fos fis] ;if not then: first calc recorded
(updatei fo fic fos)))))
Or am I thinking of this too imperatively versus functionally?
EDIT:
I decided this made the most sense to me:
(defn calcfo [fo fis]
(apply updatei fo
(if-let [[rfc rfos] (get #i fo)] ;find a previous record?
[(inc rfc) (+ rfos fis)] ;increment the stats
[1 fis]) ;if not then: first calc recorded
))
Thanks for the great answers!
A rearrangement might help
(defn calcfo [fo fi fis]
(apply updatei fo
(if-let [n (get #i fo)]
[(inc (first n)), (+ fis (second n))]
[1, fis] )))
What about using an if and then destructuring? Here's an approach:
(defn calcfo [fo fi fis]
(let [n (get #i fo) ;find a previous record?
[fic fos] (if n
[(-> n first inc) (-> n second (+ fis))] ;increment the stats
[1 fis])] ;if not then: first calc recorded
(updatei fo fic fos)))
The argument fi doesn't seem to be being used so maybe you could remove it from the argument list.
(defn calcfo [fo fis] ,,,)
The usage of first and second could also be avoided with the use of destructuring when binding n in the let form:
(defn calcfo [fo fis]
(let [[x y & _] (get #i fo)
[fic fos] (if x [(inc x) (+ fis y)] [1 fis])]
(updatei fo fic fos)))
I think you would sidestep the whole problem and make your code better if you rewrote updatei, something like:
(defn- safe+ [a b]
(if a (if b (+ a b) a) b))
(defn updatei [n fic fos]
(swap! i update-in [(keyword n)] #(vector (safe+ fic (first %)) (safe+ fos (second %)))))
There may be a better way to write that code, but the basic idea is to use update-in to either store the new values (if nothing was stored for that key before), or combine them with what is already there.

How do I convert list of strings to list of doubles in closure?

How can I convert the values of 'mymap' to a list of Doubles instead of a list of Strings, at the same time as mymap is created?
(use '[clojure.string :only (join split)])
;(def raw-data (slurp "http://ichart.finance.yahoo.com/table.csv?s=INTC"))
;Downloaded and removed the first line
(def raw-data (slurp "table-INTC.csv"))
(def raw-vector-list
(map
#(split % #",") ; anonymous map function to split by comma
(split raw-data #"\n"))) ; split raw data by new line
(pr (take 1 raw-vector-list))
(def mymap
(zipmap
;construct composite key out of symbol and date which is head of the list
(map #(str "INTC-" %) (map first raw-vector-list))
;How do i convert these values to Double instead of Strings?
(map rest raw-vector-list)))
(pr (take 1 mymap))
(def mymap
(zipmap
(map #(str "NAT-" %) (map first raw-vector-list))
(map #(map (fn [v] (Double/parseDouble v)) %)
(map rest raw-vector-list))))
(pprint (take 1 mymap))
-> (["NAT-1991-09-30" (41.75 42.25 41.25 42.25 3.62112E7 1.03)])
Another version
(def mymap
(map (fn [[date & values]]
[(str "NAT-" date)
(map #(Double/parseDouble %) values)])
;; Drop first non-parsable element in raw-vector-list
;; ["Date" "Open" "High" "Low" "Close" "Volume" "Adj Close"]
(drop 1 raw-vector-list)))
So for the tail/rest portion of this data. You are mapping an anonymous, map function, to a list of strings, and then mapping the type conversion to the elements in each sublist.
(def mymap
(zipmap
(map #(str "NAT-" %) (map first raw-vector-list))
(map #(map (fn [v] (Double/parseDouble v)) %)
(map rest raw-vector-list))))
How can I pull out the type conversion into a function like below...And then utilize my custom method?
(defn str-to-dbl [n] (Double/parseDouble n))
This code complains about nested #'s.
(def mymap
(zipmap
(map #(str "NAT-" %) (map first raw-vector-list))
(map #(map #(str-to-double %)
(map rest raw-vector-list))))

clojure - ordered pairwise combination of 2 lists

Being quite new to clojure I am still struggling with its functions. If I have 2 lists, say "1234" and "abcd" I need to make all possible ordered lists of length 4. Output I want to have is for length 4 is:
("1234" "123d" "12c4" "12cd" "1b34" "1b3d" "1bc4" "1bcd"
"a234" "a23d" "a2c4" "a2cd" "ab34" "ab3d" "abc4" "abcd")
which 2^n in number depending on the inputs.
I have written a the following function to generate by random walk a single string/list.
The argument [par] would be something like ["1234" "abcd"]
(defn make-string [par] (let [c1 (first par) c2 (second par)] ;version 3 0.63 msec
(apply str (for [loc (partition 2 (interleave c1 c2))
:let [ch (if (< (rand) 0.5) (first loc) (second loc))]]
ch))))
The output will be 1 of the 16 ordered lists above. Each of the two input lists will always have equal length, say 2,3,4,5, up to say 2^38 or within available ram. In the above function I have tried to modify it to generate all ordered lists but failed. Hopefully someone can help me. Thanks.
Mikera is right that you need to use recursion, but you can do this while being both more concise and more general - why work with two strings, when you can work with N sequences?
(defn choices [colls]
(if (every? seq colls)
(for [item (map first colls)
sub-choice (choices (map rest colls))]
(cons item sub-choice))
'(())))
(defn choose-strings [& strings]
(for [chars (choices strings)]
(apply str chars)))
user> (choose-strings "123" "abc")
("123" "12c" "1b3" "1bc" "a23" "a2c" "ab3" "abc")
This recursive nested-for is a very useful pattern for creating a sequence of paths through a "tree" of choices. Whether there's an actual tree, or the same choice repeated over and over, or (as here) a set of N choices that don't depend on the previous choices, this is a handy tool to have available.
You can also take advantage of the cartesian-product from the clojure.math.combinatorics package, although this requires some pre- and post-transformation of your data:
(ns your-namespace (:require clojure.math.combinatorics))
(defn str-combinations [s1 s2]
(->>
(map vector s1 s2) ; regroup into pairs of characters, indexwise
(apply clojure.math.combinatorics/cartesian-product) ; generate combinations
(map (partial apply str)))) ; glue seqs-of-chars back into strings
> (str-combinations "abc" "123")
("abc" "ab3" "a2c" "a23" "1bc" "1b3" "12c" "123")
>
The trick is to make the function recursive, calling itself on the remainder of the list at each step.
You can do something like:
(defn make-all-strings [string1 string2]
(if (empty? string1)
[""]
(let [char1 (first string1)
char2 (first string2)
following-strings (make-all-strings (next string1) (next string2))]
(concat
(map #(str char1 %) following-strings)
(map #(str char2 %) following-strings)))))
(make-all-strings "abc" "123")
=> ("abc" "ab3" "a2c" "a23" "1bc" "1b3" "12c" "123")
(defn combine-strings [a b]
(if (seq a)
(for [xs (combine-strings (rest a) (rest b))
x [(first a) (first b)]]
(str x xs))
[""]))
Now that I wrote it I realize it's a less generic version of amalloiy's one.
You could also use the binary digits of numbers between 0 and 16 to form your combinations:
if a bit is zero select from the first string otherwise the second.
E.g. 6 = 2r0110 => "1bc4", 13 = 2r1101 => "ab3d", etc.
(map (fn [n] (apply str (map #(%1 %2)
(map vector "1234" "abcd")
(map #(if (bit-test n %) 1 0) [3 2 1 0])))); binary digits
(range 0 16))
=> ("1234" "123d" "12c4" "12cd" "1b34" "1b3d" "1bc4" "1bcd" "a234" "a23d" "a2c4" "a2cd" "ab34" "ab3d" "abc4" "abcd")
The same approach can apply to generating combinations from more than 2 strings.
Say you have 3 strings ("1234" "abcd" "ABCD"), there will be 81 combinations (3^4). Using base-3 ternary digits:
(defn ternary-digits [n] (reverse (map #(mod % 3) (take 4 (iterate #(quot % 3) n))))
(map (fn [n] (apply str (map #(%1 %2)
(map vector "1234" "abcd" "ABCD")
(ternary-digits n)
(range 0 81))
(def c1 "1234")
(def c2 "abcd")
(defn make-string [c1 c2]
(map #(apply str %)
(apply map vector
(map (fn [col rep]
(take (math/expt 2 (count c1))
(cycle (apply concat
(map #(repeat rep %) col)))))
(map vector c1 c2)
(iterate #(* 2 %) 1)))))
(make-string c1 c2)
=> ("1234" "a234" "1b34" "ab34" "12c4" "a2c4" "1bc4" "abc4" "123d" "a23d" "1b3d" "ab3d" "12cd" "a2cd" "1bcd" "abcd")