Clojure CSV Matching Files

Clojure CSV Matching Files - clojure

I have 2 csv files with around 22K records in file F1, and 50K records in file F2, both containing company name, and address information. I need to do a fuzzy-match on name, address, and phone. Each record in F1 needs to be fuzzy-matched against each record in F2. I have made a third file R3 which is a csv containing the rules for fuzzy-matching which column from F1 to which column with F2, with a fuzzy-tolerance-level. I am trying to do it with for loop, this way -
(for [j f1-rows
h f2-rows
r r3-rows
:while (match-row j h r)]
(merge j h))
(defn match-row [j h rules]
(every?
identity
(map (fn [rule]
(<= (fuzzy/jaccard
((keyword (first rule)) j)
((keyword (second rule)) h))
((nth rule 2))))
rules)))
f1-rows and f2-rows are collections of map. Rules is a collection of sequences containing column name from f1, f2, and the tolerance level. The code is running and functioning as expected. But my problem is, it is taking around 2 hours to execute. I read up on how transducers help improving performance by eliminating intermediate chunks, but I am not able to visualize how I would apply that in my case. Any thoughts on how I can make this better/faster ?

:while vs :when
Your use of :while in this case doesn't seem to agree with your stated problem. Your for-expression will keep going while match-row is true, and stop altogether at the first false result. :when will iterate through all the combinations and include only ones where match-row is true in the resulting lazy-seq. The difference is explained here.
For example:
(for [i (range 10)
j (range 10)
:while (= i j)]
[i j]) ;=> ([0 0])
(for [i (range 10)
j (range 10)
:when (= i j)]
[i j]) ;=> ([0 0] [1 1] [2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9])
It's really strange that your code kept running for 2 hours, because that means that during those two hours every invocation of (match-row j h r) returned true, and only the last one returned false. I would check the result again to see if it really makes sense.
How long should it take?
Let's first do some back-of-the-napkin math. If you want to compare every one of 22k records with every one of 55k records, you're gonna be doing 22k * 55k comparisons, there's no way around that.
22k * 55k = 1,210,000,000
That's a big number!
What's the cost of a comparison?
From a half-a-minute glance at wikipedia, jaccard is something about sets.
The following will do to get a ballpark estimate of the cost, though it's probably very much on the low end.
(time (clojure.set/difference (set "foo") (set "bar")))
That takes about a tenth of a millisecond on my computer.
(/ (* 22e3 55e3) ;; Number of comparisons.
10 ; milliseconds
1000 ;seconds
60 ;minutes
60) ;hours
;=> 33.611111111111114
That's 33 and a half hours. And that's with a low-end estimate of the individual cost, and not counting the fact that you want to compare name, address and phone on each one (?). So that' 33 hours if every comparison fails at the first row, and 99 hours if they all get to the last row.
Before any micro-optimizations, you need to work on the algorithm, by finding some clever way of not needing to do more than a billion comparisons. If you want help on that, you need to at least supply some sample data.
Nitpicks
The indentation of the anon fn inside match-row is confusing. I'd use an editor that indents automatically, and stick to that 99% of the time, because lisp programmers read nesting by the indentation, like python. There are some slight differences between editors/autoindenters, but they're all consistent with the nesting.
(defn match-row [j h rules]
(every?
identity
(map (fn [rule]
(<= (fuzzy/jaccard
((keyword (first rule)) j)
((keyword (second rule)) h))
((nth rule 2))))
rules)))
Also,match-row needs to be defined before it is used (which it probably is in your actual code, seeing as it compiles).

22k x 50k is over 1 billion combinations. Multiply by 3 fuzzy rules & you're at 3 billion calculations. Most of which are wasted.
The only way to speed it up is to do some pre-sorting or other pre-trimming of all the combinations. For example, only do the fuzzy calc if zipcodes are identical. If you waste time trying to match people from N.Y. and Florida, you are throwing away 99% or more of your work

Related

Is there a way to provide hints to the solver, about which paths are better?

Is there a way to prefer certain paths over others during solving? This is really a performance question. When I put all my logic together, it generates 1000’s of solution and this takes exponentially increasing time. These really are all valid solutions, so I could do something like (run 1 …) instead of (run* …) but this gives me an arbitrary solution. What I want to do is be able to provide some hints about which paths are better.
I know, I could get the one best answer by sorting them with a custom comparator, but this doesn’t help the performance problem.
Here’s a simplified, contrived example:
(require
'[clojure.core.logic :refer :all]
'[clojure.core.logic.fd :as fd]))
(defn multipleo
[multiple value domain]
(fresh [n]
(fd/in multiple domain)
(fd/in n (fd/interval 1 10))
(fd/* n multiple value)))
(run* [q]
(multipleo q 60 (fd/domain 30 24 15 12)))
=> (12 15 30)
12, 15 and 30 are all valid solutions, but the one I want is the largest (-> *1 sort last), but again I want to do it with the solver so (run 1 [q] (multipleo q 60 (fd/domain 30 24 15 12))) would ideally produce (30).

Expanding on #amlloy's suggestion, I can try this:
(defn multipleo
[multiple value]
(fresh [n]
(conde
[(== multiple 6)]
[(== multiple 3)])
(fd/in n (fd/interval 1 10))
(fd/* n multiple value)))
(run* [q] (multipleo q 12))
=> (6 3)
This seems to work. As far as I can tell, ordering the domain in fd/in has no impact. But if I move the domain entries intp a conde in the order that I prefer, that works. Without the conde the above code would produce (3 6) instead. However, this is much slower than the fd/in approach. I guess fd/in does some nice performance tricks as compared to just conde.
I also tried condu, but this doesn't work as I would expect.
(defn multipleo
[multiple value]
(fresh [n]
(condu
[(== multiple 6)]
[(== multiple 3)])
(fd/in n (fd/interval 1 10))
(fd/* n multiple value)))
(run* [q] (multipleo q 3))
=> ()
I would have expected the first condu group to fail, since the overall logic can not succeed w/ multiple=6, in this example. Can anyone help me understand why that doesn't work as I expect?

What would be the functional way of getting the mean of each vector in Clojure?

I am a beginner to functional programming and Clojure programming language and I'm resorting to recur pretty much for everything. I have a dataset in csv, imported as a map. I have extracted the info I need to use as vectors. Each column is a vector [1 5 10 8 3 2 1 ...] and I want to calculate the mean of each column. I wrote the following function:
(defn mean
"Calculate the mean for each column"
([columns]
(mean columns []))
([columns
means]
(if (empty? columns)
means
(recur (rest columns)
(conj means (float (/ (reduce + (first columns)) (count (first columns)))))))))
;; Calcule the mean for the following vectors
(mean [[1 2 3] [1 2 3] [1 2 3]])
; => [2.0 2.0 2.0]
Is this a functional way of solving this problem?

I'd break it down a little farther and use map instead of for. I personally like having many, smaller functions:
(defn mean [row]
(/ (apply + row) (count row)))
(defn mean-rows [rows]
(map mean rows))
But this is the same general idea as #Alan's answer.
The way you're doing it is already considered "functional". It should be said though that while using recur is good, usually you can achieve what you need more easily using reduce, or at least map. These options eliminate the need for explicit recursion, and generally lead to simpler, easier to understand code.

Here is a simple answer:
(defn mean
"Calculate the mean of each column"
[cols]
(vec ; this is optional if you don't want it
(for [col cols]
(/ ; ratio of num/denom
(reduce + col) ; calculates the numerator
(count col))))) ; the number of items
(mean [[1 2 3] [1 2 3] [1 2 3]]) => [2 2 2]
If you haven't seen it yet, you can get started here: https://www.braveclojure.com/
I recommend buying the printed version of the book, as it has more than the online version.

update or assoc a list rather than a vector

Updating a vector works fine:
(update [{:idx :a} {:idx :b}] 1 (fn [_] {:idx "Hi"}))
;; => [{:idx :a} {:idx "Hi"}]
However trying to do the same thing with a list does not work:
(update '({:idx :a} {:idx :b}) 1 (fn [_] {:idx "Hi"}))
;; => ClassCastException clojure.lang.PersistentList cannot be cast to clojure.lang.Associative clojure.lang.RT.assoc (RT.java:807)
Exactly the same problem exists for assoc.
I would like to do update and overwrite operations on lazy types rather than vectors. What is the underlying issue here, and is there a way I can get around it?

The underlying issue is that the update function works on associative structures, i.e. vectors and maps. Lists can't take a key as a function to look up a value.
user=> (associative? [])
true
user=> (associative? {})
true
user=> (associative? `())
false
update uses get behind the scenes to do its random access work.
I would like to do update and overwrite operations on lazy types
rather than vectors
It's not clear what want to achieve here. You're correct that vectors aren't lazy, but if you wish to do random access operations on a collection then vectors are ideal for this scenario and lists aren't.
and is there a way I can get around it?
Yes, but you still wouldn't be able to use the update function, and it doesn't look like there would be any benefit in doing so, in your case.
With a list you'd have to walk the list in order to access an index somewhere in the list - so in many cases you'd have to realise a great deal of the sequence even if it was lazy.

You can define your own function, using take and drop:
(defn lupdate [list n function]
(let [[head & tail] (drop n list)]
(concat (take n list)
(cons (function head) tail))))
user=> (lupdate '(a b c d e f g h) 4 str)
(a b c d "e" f g h)
With lazy sequences, that means that you will compute the n first values (but not the remaining ones, which after all is an important part of why we use lazy sequences). You have also to take into account space and time complexity (concat, etc.). But if you truly need to operate on lazy sequences, that's the way to go.

Looking behind your question to the problem you are trying to solve:
You can use Clojure's sequence functions to construct a simple solution:
(defn elf [n]
(loop [population (range 1 (inc n))]
(if (<= (count population) 1)
(first population)
(let [survivors (->> population
(take-nth 2)
((if (-> population count odd?) rest identity)))]
(recur survivors)))))
For example,
(map (juxt identity elf) (range 1 8))
;([1 1] [2 1] [3 3] [4 1] [5 3] [6 5] [7 7])
This has complexity O(n). You can speed up count by passing the population count as a redundant argument in the loop, or by dumping the population and survivors into vectors. The sequence functions - take-nth and rest - are quite capable of doing the weeding.
I hope I got it right!

Why filter on a lazy sequence doesn't work in clojure?

I am hoping to generate all the multiples of two less than 10 using the following code
(filter #(< % 10) (iterate (partial + 2) 2))
Expected output:
(2 4 6 8)
However, for some reason repl just doesn't give any output?
But, the below code works just fine...
(filter #(< % 10) '(2 4 6 8 10 12 14 16))
I understand one is lazy sequence and one is a regular sequence. That's the reason. But how can I overcome this issue if I wish to filter all the number less than 10 from a lazy sequence...?

(iterate (partial + 2) 2)
is an infinite sequence. filter has no way to know that the number of items for which the predicate is true is finite, so it will keep going forever when you're realizing the sequence (see Mark's answer).
What you want is:
(take-while #(< % 10) (iterate (partial + 2) 2))

I think I should note that Diego Basch's answer is not fully correct in its argumentation:
filter has no way to know that the number of items for which the predicate is true is finite, so it will keep going forever
Why should filter know something about that? Actually filter works fine in this case. One can apply filter on a lazy sequence and get another lazy sequence that represent potentially infinite sequence of filtered numbers:
user> (def my-seq (iterate (partial + 2) 2)) ; REPL won't be able to print this
;; => #'user/my-seq
user> (def filtered (filter #(< % 10) my-seq)) ; filter it without problems
;; => #'user/filtered
user>
Crucial detail here is that one should never try to realize (by printing in OP's case) lazy sequence when actual sequence is not finite for sure (so that Clojure knows that).
Of course, this example is only for demonstration purposes, you should use take-while here, not filter.

Get two elements from a sequence each time

Does clojure have a powerful 'loop' like common lisp.
for example:
get two elements from a sequence each time
Common Lisp:
(loop for (a b) on '(1 2 3 4) by #'cddr collect (cons a b))
how to do this in Clojure?

By leveraging for and some destructuring you can achieve your specific example:
(for [[a b] (partition 2 [1 2 3 4])](use-a-and-b a b))

There is cl-loop, which is a LOOP workalike, and there are also clj-iter and clj-iterate, which are both based on the iterate looping construct for Common Lisp.

Clojure's multi-purpose looping construct is for. It doesn't have as many features as CL's loop built into it (especially not the side-effecting ones, since Clojure encourages functional purity), so many operations that you might otherwise do simply with loop itself are accomplished "around" for. For example, to sum the elements generated by for, you would put an apply + in front of it; to walk elements pairwise, you would (as sw1nn shows) use partition 2 on the input sequence fed into for.

I would do this with loop, recur and destructuring.
For example, if I wanted to group every two values together:
(loop [[a b & rest] [1 2 3 4 5 6]
result []]
(if (empty? rest)
(conj result [a b])
(recur rest (conj result [a b]))))
Ends up with a result of:
=> [[1 2] [3 4] [5 6]]
a and b are the first and second elements of the sequence respectively, and then rest is what is left over. We can then recur-sively go around until there is nothing left over in rest and we are done.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Clojure CSV Matching Files - clojure

Related

Is there a way to provide hints to the solver, about which paths are better?

What would be the functional way of getting the mean of each vector in Clojure?

update or assoc a list rather than a vector

Why filter on a lazy sequence doesn't work in clojure?

Get two elements from a sequence each time

Categories

Resources