How to Parse and Compare Files?

How to Parse and Compare Files? - clojure

I'd appreciate suggestions/insights on how I can leverage Clojure to efficiently parse and compare two files. There are two (log) files that contain employee attendance; from these files I need to determine all the days that two employees worked the same times, in the same department. Below are examples of the log files.
Note: each file has differing number of entries.
First File:
Employee Id Name Time In Time Out Dept.
mce0518 Jon 2011-01-01 06:00 2011-01-01 14:00 ER
mce0518 Jon 2011-01-02 06:00 2011-01-01 14:00 ER
mce0518 Jon 2011-01-04 06:00 2011-01-01 13:00 ICU
mce0518 Jon 2011-01-05 06:00 2011-01-01 13:00 ICU
mce0518 Jon 2011-01-05 17:00 2011-01-01 23:00 ER
Second File:
Employee Id Name Time In Time Out Dept.
pdm1705 Jane 2011-01-01 06:00 2011-01-01 14:00 ER
pdm1705 Jane 2011-01-02 06:00 2011-01-01 14:00 ER
pdm1705 Jane 2011-01-05 06:00 2011-01-01 13:00 ER
pdm1705 Jane 2011-01-05 17:00 2011-01-01 23:00 ER

if you are not going to do it periodically,
(defn data-seq [f]
(with-open [rdr (java.io.BufferedReader.
(java.io.FileReader. f))]
(let [s (rest (line-seq rdr))]
(doall (map seq (map #(.split % "\\s+") s))))))
(defn same-time? [a b]
(let [a (drop 2 a)
b (drop 2 b)]
(= a b)))
(let [f1 (data-seq "f1.txt")
f2 (data-seq "f2.txt")]
(reduce (fn[h v]
(let [f2 (filter #(same-time? v %) f2)]
(if (empty? f2)
h
(conj h [(first v) (map first f2)])))) [] f1)
)
will get you,
[["mce0518" ("pdm1705")] ["mce0518" ("pdm1705")] ["mce0518" ("pdm1705")]]

I came to somewhat shorter and (IMHO) more readable version
(use ; moar toolz - moar fun
'[clojure.contrib.duck-streams :only (reader)]
'[clojure.string :only (split)]
'[clojure.contrib.str-utils :only (str-join)]
'[clojure.set :only (intersection)])
(defn read-presence [filename]
(with-open [rdr (reader filename)] ; file will be securely (always) closed after use
(apply hash-set ; make employee's hash-set
(map #(str-join "--" (drop 2 (split % #" [ ]+"))) ; right-to-left: split row by spaces then forget two first columns then join using "--"
(drop 1 ; ommit first line
(line-seq rdr)))))) ; read file content line-by-line
(intersection (read-presence "a.in") (read-presence "b.in")) ; now it's simple!
;result: #{"2011-01-01 06:00--2011-01-01 14:00--ER" "2011-01-02 06:00--2011-01-01 14:00--ER" "2011-01-05 17:00--2011-01-01 23:00--ER"}
Assuming a.in and b.in are your files. I also assumed you'll have one hash-set for each employee -- (naive) generalization to N employees would need next six lines:
(def employees ["greg.txt" "allison.txt" "robert.txt" "eric.txt" "james.txt" "lisa.txt"])
(for [a employees b employees :when (and
(= a (first (sort [a b]))) ; thou shall compare greg with james ONCE
(not (= a b)))] ; thou shall not compare greg with greg
(str-join " -- " ; well, it's not pretty... nor pink at least
[a b (intersection (read-presence a) (read-presence b))]))
;result: ("a.in -- b.in -- #{\"2011-01-01 06:00--2011-01-01 14:00--ER\" \"2011-01-02 06:00--2011-01-01 14:00--ER\" \"2011-01-05 17:00--2011-01-01 23:00--ER\"}")
Actually this loop is sooo ugly and it doesn't memorize intermediate results... To be improved.
--edit--
I knew there must be something elegant in core or contrib!
(use '[clojure.contrib.combinatorics :only (combinations)])
(def employees ["greg.txt" "allison.txt" "robert.txt" "eric.txt" "james.txt" "lisa.txt"])
(def employee-map (apply conj (for [e employees] {e (read-presence e)})))
(map (fn [[a b]] [a b (intersection (employee-map a) (employee-map b))])
(combinations employees 2))
;result: (["a.in" "b.in" #{"2011-01-01 06:00--2011-01-01 14:00--ER" "2011-01-02 06:00--2011-01-01 14:00--ER" "2011-01-05 17:00--2011-01-01 23:00--ER"}])
Now it's memorized (parsed data in employee-map), general and... lazy :D

Related

How would Time Ago function implementation look like in Clojure?

I mean function, when given time returns smallest time unit ago.
E.g
"5 minutes ago"
"30 seconds ago"
"just now"

One possible implementation might look like this:
Note, I've used clj-time/clj-time · GitHub library.
(require '[clj-time.core :as t])
(defn time-ago [time]
(let [units [{:name "second" :limit 60 :in-second 1}
{:name "minute" :limit 3600 :in-second 60}
{:name "hour" :limit 86400 :in-second 3600}
{:name "day" :limit 604800 :in-second 86400}
{:name "week" :limit 2629743 :in-second 604800}
{:name "month" :limit 31556926 :in-second 2629743}
{:name "year" :limit Long/MAX_VALUE :in-second 31556926}]
diff (t/in-seconds (t/interval time (t/now)))]
(if (< diff 5)
"just now"
(let [unit (first (drop-while #(or (>= diff (:limit %))
(not (:limit %)))
units))]
(-> (/ diff (:in-second unit))
Math/floor
int
(#(str % " " (:name unit) (when (> % 1) "s") " ago")))))))
Example usage:
(time-ago (t/minus (t/now) (t/days 400)))
=> "1 year ago"
(time-ago (t/minus (t/now) (t/days 15)))
=> "2 weeks ago"
(time-ago (t/minus (t/now) (t/seconds 45)))
=> "45 seconds ago"
(time-ago (t/minus (t/now) (t/seconds 1)))
=> "just now"

If you are using Clojure on the JVM, consider using the PrettyTime library. Using that library for implementing "time ago" in Java was suggested here.
To use PrettyTime library from Clojure, first add the following to the :dependencies vector in project.clj:
[org.ocpsoft.prettytime/prettytime "3.2.7.Final"]
Then you can use Java interop directly. One quirk I found is that the cut-off between "moments ago" and other outputs is at 1 minute by default. I added a line to change that to one second. This library appears to support several languages, which is a plus. By default it prints "moments ago" instead of "just now". It would require some effort to deal with in case that is really important.
(import 'org.ocpsoft.prettytime.PrettyTime
'org.ocpsoft.prettytime.units.JustNow
'java.util.Date)
(defn time-ago [date]
(let [pretty-time (PrettyTime.)]
(.. pretty-time (getUnit JustNow) (setMaxQuantity 1000))
(.format pretty-time date)))
(let [now (System/currentTimeMillis)]
(doseq [offset [200, (* 30 1000), (* 5 60 1000)]]
(println (time-ago (Date. (- now offset))))))
;; moments ago
;; 30 seconds ago
;; 5 minutes ago

It only supports minutes, hours & days but if that's sufficient you may also want to look at goog.date.relative:
https://github.com/google/closure-library/blob/master/closure/goog/date/relative.js#L87

How to add days to current date in clojure

In clojure I want to add days to current date can anyone please guide me on that. Am getting current date as below and now let's say I want to add 7 days to it, how can I get a new date?
(.format (java.text.SimpleDateFormat. "MM/dd/yyyy") (java.util.Date.))

This would work:
(java.util.Date. (+ (* 7 86400 1000) (.getTime (java.util.Date.)))
I prefer to use System/currentTimeMillis for the current time:
(java.util.Date. (+ (* 7 86400 1000) (System/currentTimeMillis)))
Or you can use clj-time which is a nicer api to deal with time (it's a wrapper around Joda Time). From the readme file:
(t/plus (t/date-time 1986 10 14) (t/months 1) (t/weeks 3))
=> #<DateTime 1986-12-05T00:00:00.000Z>

user> (import '[java.util Calendar])
;=> java.util.Calendar
user> (defn days-later [n]
(let [today (Calendar/getInstance)]
(doto today
(.add Calendar/DATE n)
.toString)))
#'user/days-later
user> (println "Tomorrow: " (days-later 1))
;=> Tomorrow: #inst "2014-11-26T15:36:31.901+09:00"
;=> nil
user> (println "7 Days from now: " (days-later 7))
;=> 7 Days from now: #inst "2014-12-02T15:36:44.785+09:00"
;=> nil

Set difference using a projected function

I've got two databases that I'm attempting to keep in sync using a bit of Clojure glue code.
I'd like to make something like a clojure.set/difference that operates on values projected by a function.
Here's some sample data:
(diff #{{:name "bob smith" :favourite-colour "blue"}
{:name "geraldine smith" :age 29}}
#{{:first-name "bob" :last-name "smith" :favourite-colour "blue"}}
:name
(fn [x] (str (:first-name x) " " (:last-name x))))
;; => {:name "geraldine smith" :age 29}
The best I've got is:
(defn diff
"Return members of l who do not exist in r, based on applying function
fl to left and fr to right"
[l r fl fr]
(let [l-project (into #{} (map fl l))
r-project (into #{} (map fr r))
d (set/difference l-project r-project)
i (group-by fl l)]
(map (comp first i) d)))
But I feel that this is a bit unwieldly, and I can't imagine it performs very well. I'm throwing away information that I'd like to keep, and then looking it up again.
I did have a go using metadata, to keep the original values around during the set difference, but I can't seem put metadata on primitive types, so that didn't work...
I'm not sure why, but I have this tiny voice inside my head telling me that this kind of operation on the side is what monads are for, and that I should really get around to finding out what a monad is and how to use it. Any guidance as to whether the tiny voice is right is very welcome!

(defn diff
[l r fl fr]
(let [r-project (into #{} (map fr r))]
(set (remove #(contains? r-project (fl %)) l))))
This no longer exposes the difference operation directly (it is now implicit with the remove / contains combination), but it is succinct and should give the result you are looking for.
example usage and output:
user> (diff #{{:name "bob smith" :favourite-colour "blue"}
{:name "geraldine smith" :age 29}}
#{{:first-name "bob" :last-name "smith" :favourite-colour "blue"}}
:name
(fn [x] (str (:first-name x) " " (:last-name x))))
#{{:age 29, :name "geraldine smith"}}

How to improve text processing performance in Clojure?

I'm writing a simple desktop search engine in Clojure as a way to learn more about the language. Until now, the performance during the text processing phase of my program is really bad.
During the text processing I've to:
Clean up unwanted characters;
Convert the string to lowercase;
Split the document to get a list of words;
Build a map which associates each word to its occurrences in the document.
Here is the code:
(ns txt-processing.core
(:require [clojure.java.io :as cjio])
(:require [clojure.string :as cjstr])
(:gen-class))
(defn all-files [path]
(let [entries (file-seq (cjio/file path))]
(filter (memfn isFile) entries)))
(def char-val
(let [value #(Character/getNumericValue %)]
{:a (value \a) :z (value \z)
:A (value \A) :Z (value \Z)
:0 (value \0) :9 (value \9)}))
(defn is-ascii-alpha-num [c]
(let [n (Character/getNumericValue c)]
(or (and (>= n (char-val :a)) (<= n (char-val :z)))
(and (>= n (char-val :A)) (<= n (char-val :Z)))
(and (>= n (char-val :0)) (<= n (char-val :9))))))
(defn is-valid [c]
(or (is-ascii-alpha-num c)
(Character/isSpaceChar c)
(.equals (str \newline) (str c))))
(defn lower-and-replace [c]
(if (.equals (str \newline) (str c)) \space (Character/toLowerCase c)))
(defn tokenize [content]
(let [filtered (filter is-valid content)
lowered (map lower-and-replace filtered)]
(cjstr/split (apply str lowered) #"\s+")))
(defn process-content [content]
(let [words (tokenize content)]
(loop [ws words i 0 hmap (hash-map)]
(if (empty? ws)
hmap
(recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i)))))))
(defn -main [& args]
(doseq [file (all-files (first args))]
(let [content (slurp file)
oc-list (process-content content)]
(println "File:" (.getPath file)
"| Words to be indexed:" (count oc-list )))))
As I have another implementation of this problem in Haskell, I compared both as you can see in the following outputs.
Clojure version:
$ lein uberjar
Compiling txt-processing.core
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar
Including txt-processing-0.1.0-SNAPSHOT.jar
Including clojure-1.5.1.jar
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
real 2m2.164s
user 2m3.868s
sys 0m0.978s
Haskell version:
$ ghc -rtsopts --make txt-processing.hs
[1 of 1] Compiling Main ( txt-processing.hs, txt-processing.o )
Linking txt-processing ...
$ time ./txt-processing ../data/ +RTS -K12m
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
real 0m9.086s
user 0m8.591s
sys 0m0.463s
I think the (string -> lazy sequence) conversion in the Clojure implementation is killing the performance. How can I improve it?
P.S: All the code and data used in these tests can be downloaded here.

Some things you could do that would probably speed this code up:
1) Instead of mapping your chars to char-val, just do direct value comparisons between the characters. This is faster for the same reason it would faster in Java.
2) You repeatedly use str to convert single-character values to full-fledged strings. Again, consider using the character values directly. Again, object creation is slow, same as in Java.
3) You should replace process-content with clojure.core/frequencies. Perhaps inspect frequencies source to see how it is faster.
4) If you must update a (hash-map) in a loop, use transient. See: http://clojuredocs.org/clojure_core/clojure.core/transient
Also note that (hash-map) returns a PersistentArrayMap, so you are creating new instances with each call to update-in - hence slow and why you should use transients.
5) This is your friend: (set! *warn-on-reflection* true) - You have quite a bit of reflection that could benefit from type hints
Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved.
Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved.
Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved.
Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved.
Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved.
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.

Just for comparison's sake, here's a regexp based Clojure version
(defn re-index
"Returns lazy sequence of vectors of regexp matches and their start index"
[^java.util.regex.Pattern re s]
(let [m (re-matcher re s)]
((fn step []
(when (. m (find))
(cons (vector (re-groups m)(.start m)) (lazy-seq (step))))))))
(defn group-by-keep
"Returns a map of the elements of coll keyed by the result of
f on each element. The value at each key will be a vector of the
results of r on the corresponding elements."
[f r coll]
(persistent!
(reduce
(fn [ret x]
(let [k (f x)]
(assoc! ret k (conj (get ret k []) (r x)))))
(transient {}) coll)))
(defn word-indexed
[s]
(group-by-keep
(comp clojure.string/lower-case first)
second
(re-index #"\w+" s)))

Why is -> not taking a (fn ...)?

I have the following code that works:
(def *primes*
(let [l "2 3 5 7 11 13 17 19 23 29 31"
f (fn [lst] (filter #(< 0 (count (str/trim %))) lst))
m (fn [lst] (map #(Integer/parseInt %) lst))]
(-> l
(str/partition #"[0-9]+")
f
m)))
If I change it to inline the filter (f) and map (m) functions to this:
(def *primes*
(let [l "2 3 5 7 11 13 17 19 23 29 31"]
(-> l
(str/partition #"[0-9]+")
(fn [lst] (filter #(< 0 (count (str/trim %))) lst))
(fn [lst] (map #(Integer/parseInt %) lst)))))
it doesn't compile anymore. The error is:
#<CompilerException java.lang.RuntimeException: java.lang.IllegalArgumentException: Don't know how to create ISeq from: clojure.lang.Symbol (NO_SOURCE_FILE:227)>
Can anyone explain this to me?
The problem that I'm trying to solve is that map and filter takes the collection as the last parameter, yet str/partition takes the collection as the first, so I'm trying to mix the two using -> but currying map and filter into functions that only take one (the first) parameter for the collection to go into.

You can mix -> and ->> to a certain degree.
(-> l
(str/partition #"[0-9]+")
(->> (filter #(< 0 (count (str/trim %)))))
(->> (map #(Integer/parseInt %))))
But usually having problems like this is a sign that you try to do too much in one form. This simple example could be easily fixed.
(->> (str/partition l #"[0-9]+")
(filter #(< 0 (count (str/trim %))))
(map #(Integer/parseInt %)))

You're using function declarations as function calls. the immediate (ugly) way to fix it is to replace (fn [..] ..) with ((fn [..] ...))

-> is a macro. It manipulates the code you give it, and then executes that code. What happens when you try to use anonymous functions inline like that, is the previous expressions get threaded in as the first argument to fn. That is not what you want. You want them threaded in as the first argument to the actual function.
To use ->, you'd have to declare the functions beforehand, as you did in your first example.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js