How to improve text processing performance in Clojure? - clojure

I'm writing a simple desktop search engine in Clojure as a way to learn more about the language. Until now, the performance during the text processing phase of my program is really bad.
During the text processing I've to:
Clean up unwanted characters;
Convert the string to lowercase;
Split the document to get a list of words;
Build a map which associates each word to its occurrences in the document.
Here is the code:
(ns txt-processing.core
(:require [clojure.java.io :as cjio])
(:require [clojure.string :as cjstr])
(:gen-class))
(defn all-files [path]
(let [entries (file-seq (cjio/file path))]
(filter (memfn isFile) entries)))
(def char-val
(let [value #(Character/getNumericValue %)]
{:a (value \a) :z (value \z)
:A (value \A) :Z (value \Z)
:0 (value \0) :9 (value \9)}))
(defn is-ascii-alpha-num [c]
(let [n (Character/getNumericValue c)]
(or (and (>= n (char-val :a)) (<= n (char-val :z)))
(and (>= n (char-val :A)) (<= n (char-val :Z)))
(and (>= n (char-val :0)) (<= n (char-val :9))))))
(defn is-valid [c]
(or (is-ascii-alpha-num c)
(Character/isSpaceChar c)
(.equals (str \newline) (str c))))
(defn lower-and-replace [c]
(if (.equals (str \newline) (str c)) \space (Character/toLowerCase c)))
(defn tokenize [content]
(let [filtered (filter is-valid content)
lowered (map lower-and-replace filtered)]
(cjstr/split (apply str lowered) #"\s+")))
(defn process-content [content]
(let [words (tokenize content)]
(loop [ws words i 0 hmap (hash-map)]
(if (empty? ws)
hmap
(recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i)))))))
(defn -main [& args]
(doseq [file (all-files (first args))]
(let [content (slurp file)
oc-list (process-content content)]
(println "File:" (.getPath file)
"| Words to be indexed:" (count oc-list )))))
As I have another implementation of this problem in Haskell, I compared both as you can see in the following outputs.
Clojure version:
$ lein uberjar
Compiling txt-processing.core
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar
Including txt-processing-0.1.0-SNAPSHOT.jar
Including clojure-1.5.1.jar
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
real 2m2.164s
user 2m3.868s
sys 0m0.978s
Haskell version:
$ ghc -rtsopts --make txt-processing.hs
[1 of 1] Compiling Main ( txt-processing.hs, txt-processing.o )
Linking txt-processing ...
$ time ./txt-processing ../data/ +RTS -K12m
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642
real 0m9.086s
user 0m8.591s
sys 0m0.463s
I think the (string -> lazy sequence) conversion in the Clojure implementation is killing the performance. How can I improve it?
P.S: All the code and data used in these tests can be downloaded here.

Some things you could do that would probably speed this code up:
1) Instead of mapping your chars to char-val, just do direct value comparisons between the characters. This is faster for the same reason it would faster in Java.
2) You repeatedly use str to convert single-character values to full-fledged strings. Again, consider using the character values directly. Again, object creation is slow, same as in Java.
3) You should replace process-content with clojure.core/frequencies. Perhaps inspect frequencies source to see how it is faster.
4) If you must update a (hash-map) in a loop, use transient. See: http://clojuredocs.org/clojure_core/clojure.core/transient
Also note that (hash-map) returns a PersistentArrayMap, so you are creating new instances with each call to update-in - hence slow and why you should use transients.
5) This is your friend: (set! *warn-on-reflection* true) - You have quite a bit of reflection that could benefit from type hints
Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved.
Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved.
Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved.
Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved.
Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved.
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.

Just for comparison's sake, here's a regexp based Clojure version
(defn re-index
"Returns lazy sequence of vectors of regexp matches and their start index"
[^java.util.regex.Pattern re s]
(let [m (re-matcher re s)]
((fn step []
(when (. m (find))
(cons (vector (re-groups m)(.start m)) (lazy-seq (step))))))))
(defn group-by-keep
"Returns a map of the elements of coll keyed by the result of
f on each element. The value at each key will be a vector of the
results of r on the corresponding elements."
[f r coll]
(persistent!
(reduce
(fn [ret x]
(let [k (f x)]
(assoc! ret k (conj (get ret k []) (r x)))))
(transient {}) coll)))
(defn word-indexed
[s]
(group-by-keep
(comp clojure.string/lower-case first)
second
(re-index #"\w+" s)))

Related

Can an echo program in Clojure be built with lazy infinite sequences?

Take the following program as an example:
(defn echo-ints []
(doseq [i (->> (BufferedReader. *in*)
(line-seq)
(map read-string)
(take-while integer?))]
(println i)))
The idea is to prompt the user for input and then echo it back if it's an integer. However, in this particular program almost every second input won't be echoed immediately. Instead the program will wait for additional input before processing two inputs at once.
Presumably this a consequence of some performance tweaks happening behind the scenes. However in this instance I'd really like to have an immediate feedback loop. Is there an easy way to accomplish this, or does the logic of the program have to be significantly altered?
(The main motivation here is to pass the infinite sequence of user inputs to another function f that transforms lazy sequences to other lazy sequences. If I wrote some kind of while-loop, I wouldn't be able to use f.)
It is generally not good to mix lazyness with side-effect (printing in this case), since most sequence functions have built-in optimizations that cause unintended effects while still being functionally correct.
Here's a good write up: https://stuartsierra.com/2015/08/25/clojure-donts-lazy-effects
What you are trying to do seems like a good fit for core.async channels. I would think as the problem as 'a stream of user input' instead of 'infinite sequence of user inputs', and 'f transforms lazy sequences to lazy sequences' becomes 'f transform a stream into another stream'. This will allow you to write f as transducers which you can arbitrarily compose.
I would do it like the following. Note we use spyx and spyxx from the Tupelo library to display some results.
First, write a simple version with canned test data:
(ns tst.demo.core
(:use tupelo.test)
(:require
[tupelo.core :as t] )
(:import [java.io BufferedReader StringReader]))
(t/refer-tupelo)
(def user-input
"hello
there
and
a
1
and-a
2
and
a
3.14159
and-a
4
bye" )
(defn echo-ints
[str]
(let [lines (line-seq (BufferedReader. (StringReader. str)))
data (map read-string lines)
nums (filter integer? data) ]
(doseq [it data]
(spyxx it))
(spyx nums)))
(newline)
(echo-ints user-input)
This gives us the results:
it => <#clojure.lang.Symbol hello>
it => <#clojure.lang.Symbol there>
it => <#clojure.lang.Symbol and>
it => <#clojure.lang.Symbol a>
it => <#java.lang.Long 1>
it => <#clojure.lang.Symbol and-a>
it => <#java.lang.Long 2>
it => <#clojure.lang.Symbol and>
it => <#clojure.lang.Symbol a>
it => <#java.lang.Double 3.14159>
it => <#clojure.lang.Symbol and-a>
it => <#java.lang.Long 4>
it => <#clojure.lang.Symbol bye>
nums => (1 2 4)
So, we see that it works and gives us the numbers we want.
Next, write a looping version. We make it terminate gracefully when our test data runs out.
(defn echo-ints-loop
[str]
(loop [lines (line-seq (BufferedReader. (StringReader. str)))]
(let [line (first lines)
remaining (rest lines)
data (read-string line)]
(when (integer? data)
(println "found:" data))
(when (not-empty? remaining)
(recur remaining)))))
(newline)
(echo-ints-loop user-input)
found: 1
found: 2
found: 4
Next, we write an infinite loop to read the keyboard. You need to terminate this one with CRTL-C at the keyboard:
(ns demo.core
(:require [tupelo.core :as t])
(:import [java.io BufferedReader StringReader]))
(t/refer-tupelo)
(defn echo-ints-inf
[]
(loop [lines (line-seq (BufferedReader. *in*))]
(let [line (first lines)
remaining (rest lines)
data (read-string line)]
(when (integer? data)
(println "found:" data))
(when (not-empty? remaining)
(recur remaining)))))
(defn -main []
(println "main - enter")
(newline)
(echo-ints-inf))
And we run it manually:
~/clj > lein run
main - enter
hello
there
1
found: 1
and
a
2
found: 2
and-a
3
found: 3
further more
4
found: 4
^C
~/clj >
~/clj >

Reading another Clojure program as a list of S-Expressions

Suppose I have a very simple .clj file on disk with the following content:
(def a 2)
(def b 3)
(defn add-two [x y] (+ x y))
(println (add-two a b))
From the context of separate program, I would like to read the above program as a list of S-Expressions, '((def a 2) (def b 3) ... (add-two a b))).
I imagine that one way of doing this involves 1. Using slurp on (io/file file-name.clj) to produce a string containing the file's contents, 2. passing that string to a parser for Clojure code, and 3. injecting the sequence produced by the parser to a list (i.e., (into '() parsed-code)).
However, this approach seems sort of clumsy and error prone. Does anyone know of a more elegant and/or idiomatic way to read a Clojure file as a list of S-Expressions?
Update: Following up on feedback from the comments section, I've decided to try the approach I mentioned on an actual source file using aphyr's clj-antlr as follows:
=> (def file-as-string (slurp (clojure.java.io/file "src/tcl/core.clj")))
=> tcl.core=> (pprint (antlr/parser "src/grammars/Clojure.g4" file-as-string))
{:parser
{:local
#object[java.lang.ThreadLocal 0x5bfcab6 "java.lang.ThreadLocal#5bfcab6"],
:grammar
#object[org.antlr.v4.tool.Grammar 0x5b8cfcb9 "org.antlr.v4.tool.Grammar#5b8cfcb9"]},
:opts
"(ns tcl.core\n (:gen-class)\n (:require [clj-antlr.core :as antlr]))\n\n(def foo 42)\n\n(defn parse-program\n \"uses antlr grammar to \"\n [program]\n ((antlr/parser \"src/grammars/Clojure.g4\") program))\n\n\n(defn -main\n \"I don't do a whole lot ... yet.\"\n [& args]\n (println \"tlc is tcl\"))\n"}
nil
Does anyone know how to transform this output to a list of S-Expressions as originally intended? That is, how might one go about squeezing valid Clojure code/data from the result of parsing with clj-antlr?
(import '[java.io PushbackReader])
(require '[clojure.java.io :as io])
(require '[clojure.edn :as edn])
;; adapted from: http://stackoverflow.com/a/24922859/6264
(defn read-forms [file]
(let [rdr (-> file io/file io/reader PushbackReader.)
sentinel (Object.)]
(loop [forms []]
(let [form (edn/read {:eof sentinel} rdr)]
(if (= sentinel form)
forms
(recur (conj forms form)))))))
(comment
(spit "/tmp/example.clj"
"(def a 2)
(def b 3)
(defn add-two [x y] (+ x y))
(println (add-two a b))")
(read-forms "/tmp/example.clj")
;;=> [(def a 2) (def b 3) (defn add-two [x y] (+ x y)) (println (add-two a b))]
)
Do you need something like this?
(let [exprs (slurp "to_read.clj")]
;; adding braces to form a proper list
(-> (str "(" (str exprs")"))
;; read-string is potentially harmful, since it evals the string
;; there exist non-evaluating readers for clojure but I don't know
;; which one are good
(read-string)
(prn)))

Filter (dedupe) and concat csv files

A tip request please:
How can I concat a set of large csv files into one. I need rows identified as duplicates removed (i.e. filter (some #{s} (get row 1) ) Each file has no duplicates, actually, only between the files can duplicate rows appear. The order of the final outputs isn't crucial, but matching a sequential scan of the files would be preferred.
The total number of ids to maintain is about 150,000,000, so maintaining a set that large in memory is doable, I think.
So, I've got a a fn that takes a filename and a set of ids to avoid and returns a filtered sequence of rows. I've also got a vector of filenames to process. I can't wrap my head around how to output the filtered rows to a single file while conj the ids from each filtered set of rows into an existing set.
(defn open-seq "read file f and filter rows based on set s" [f s]
(letfn [(iset? [x]
(let [ls (s/split x #", ")
id (read-string (get ls 1))]
(not (some #{id} s))))]
(with-open [in (io/reader f)]
(->> (line-seq in)
(filter iset?)
; shortcut (take 20)
doall)
))
)
EDIT:
This is a two-pass solution.
(defn proc [infiles outfile]
(with-open [outf (io/writer outfile)]
(let [s (atom #{})]
(doseq [infile infiles]
(with-open [in (io/reader infile)]
(doseq [line (open-seq in #s)]
(.write outf line)
(.newLine outf)))
(with-open [in (io/reader infile)]
(let [ids (->> (open-seq in #s)
(map (fn [x] (get x 1))))]
(swap! s conj ids)
))
))))
I suppose I could conj each id onto the set atom with each line. I guess that had a preconceived notion that conjing the whole seq of ids would be more idiomatic.

Clojure - Is it possible to increment a variable within a doseq statement?

I am trying to iterate over a list of files in a given directory, and add an incrementing variable i = {1,2,3.....} to their names.
Here is the code I have for iterating through the files and changing each file's name:
(defn addCounterToExtIn [d]
(def i 0)
(doseq [f (.listFiles (file d)) ] ; make a sequence of all files in d
(if (and (not (.isDirectory f)) ; if file is not a directry and
(= '(\. \i \n) (take-last 3 (.getName f))) ) ; if it ends with .in
(fs/rename f (str d '/ i (.getName f)))))) ; add i to start of its name
I don't know how can I increment i as doseq iterates through each file. Alternatively, is there a better loop to use to achieve the desired result?
use file-seq and map-indexed:
(require '[clojure.java.io :as io])
(dorun
(->>
(file-seq (io/file "/home/eduard/Downloads"))
(filter #(re-find #".+\.pdf$" (.getName %)))
(map-indexed (fn [i v] [i v]))))
Change function in map-indexed to rename and you're done.
The sample output for pdf files:
([0 #<File /home/eduard/Downloads/some.pdf>] ...)
This is the first approach off the top of my head. It's not ideal, but certainly more idiomatic than what the question proposes.
(def rename-one-file! [file counter]
(if (and (not (.isDirectory file))
(= ".in" (str (take-last 3 (.getName file)))))
(fs/rename file (file (parent dir)
(str counter (.getName file)))))
(defn iterate-files-with-counter [fn dir]
(loop [counter 0
remaining-files (.listFiles (file dir))]
(let [current-file (first remaining-files)]
(fn file counter)
(recur (+ counter 1) (rest remaining-files))))
(def add-counter-to-ext-in-dir
(partial iterate-files-with-counter rename-one-file!))
Note that the work of actually performing the rename was split off from the work of iterating over the files. Having a large number of small functions is better than than a small number of large functions in general, and making those functions reusable / independent unless you choose to use them together is even better than that.

More idiomatic line-by-line handling of a file in Clojure

I'm trying to read a file that (may or may not) have YAML frontmatter line-by-line using Clojure, and return a hashmap with two vectors, one containing the frontmatter lines and one containing everything else (i.e., the body).
And example input file would look like this:
---
key1: value1
key2: value2
---
Body text paragraph 1
Body text paragraph 2
Body text paragraph 3
I have functioning code that does this, but to my (admittedly inexperienced with Clojure) nose, it reeks of code smell.
(defn process-file [f]
(with-open [rdr (java.io.BufferedReader. (java.io.FileReader. f))]
(loop [lines (line-seq rdr) in-fm 0 frontmatter [] body []]
(if-not (empty? lines)
(let [line (string/trim (first lines))]
(cond
(zero? (count line))
(recur (rest lines) in-fm frontmatter body)
(and (< in-fm 2) (= line "---"))
(recur (rest lines) (inc in-fm) frontmatter body)
(= in-fm 1)
(recur (rest lines) in-fm (conj frontmatter line) body)
:else
(recur (rest lines) in-fm frontmatter (conj body line))))
(hash-map :frontmatter frontmatter :body body)))))
Can someone point me to a more elegant way to do this? I'm going to be doing a decent amount of line-by-line parsing in this project, and I'd like a more idiomatic way of going about it if possible.
Firstly, I'd put line-processing logic in its own function to be called from a function actually reading in the files. Better yet, you can make the function dealing with IO take a function to map over the lines as an argument, perhaps along these lines:
(require '[clojure.java.io :as io])
(defn process-file-with [f filename]
(with-open [rdr (io/reader (io/file filename))]
(f (line-seq rdr))))
Note that this arrangement makes it the duty of f to realize as much of the line seq as it needs before it returns (because afterwards with-open will close the underlying reader of the line seq).
Given this division of responsibilities, the line processing function might look like this, assuming the first --- must be the first non-blank line and all blank lines are to be skipped (as they would be when using the code from the question text):
(require '[clojure.string :as string])
(defn process-lines [lines]
(let [ls (->> lines
(map string/trim)
(remove string/blank?))]
(if (= (first ls) "---")
(let [[front sep-and-body] (split-with #(not= "---" %) (next ls))]
{:front (vec front) :body (vec (next sep-and-body))})
{:body (vec ls)})))
Note the calls to vec which cause all the lines to be read in and returned in a vector or pair of vectors (so that we can use process-lines with process-file-with without the reader being closed too soon).
Because reading lines from an actual file on disk is now decoupled from processing a seq of lines, we can easily test the latter part of the process at the REPL (and of course this can be made into a unit test):
;; could input this as a single string and split, of course
(def test-lines
["---"
"key1: value1"
"key2: value2"
"---"
""
"Body text paragraph 1"
""
"Body text paragraph 2"
""
"Body text paragraph 3"])
Calling our function now:
user> (process-lines test-lines)
{:front ("key1: value1" "key2: value2"),
:body ("Body text paragraph 1"
"Body text paragraph 2"
"Body text paragraph 3")}
actually, the idiomatic way to do it using clojure would be to avoid returning 'a hashmap with two vectors' and treat the file as a (lazy) sequence of lines
then, the function that will process the sequence of lines decides whether the file has a YAML frontmatter or not
something like this:
(use '[clojure.java.io :only (reader)])
(let [s (line-seq (reader "YOURFILENAMEHERE"))]
(if (= "---\n" (take 1 (line-seq (reader "YOURFILENAMEHERE"))))
(process-seq-with-frontmatter s)
(process-seq-without-frontmatter s))
by the way, this is a quit and dirty solution; two things to improve:
notice I'm creating two seqs for the same file, it would be better to create just one and make the inspection of the first line so that it wouldn't traverse over the first element of the seq (like a peek instead of a pop)
I think it would be cleaner to have a multimethod 'process-seq' (with a better name of course) that would dispatch based on the content of the first line of the seq