Efficiently create and diff sets created from large text file

Efficiently create and diff sets created from large text file - clojure

I am attempting to copy about 12 million documents in an AWS S3 bucket to give them new names. The names previously had a prefix and will now all be document name only. So a/b/123 once renamed will be 123. The last segment is a uuid so there will not be any naming collisions.
This process has been partially completed so some have been copied and some still need to be. I have a text file that contains all of the document names. I would like an efficient way to determine which documents have not yet been moved.
I have some naive code that shows what I would like to accomplish.
(def doc-names ["o/123" "o/234" "t/543" "t/678" "123" "234" "678"])
(defn still-need-copied [doc-names]
(let [last-segment (fn [doc-name]
(last (clojure.string/split doc-name #"/")))
by-position (group-by #(.contains % "/") doc-names)
top (set (get by-position false))
nested (set (map #(last-segment %) (get by-position true)))
needs-copied (clojure.set/difference nested top)]
(filter #(contains? needs-copied (last-segment %)) doc-names)))

I would propose this solution:
(defn still-need-copied [doc-names]
(->> doc-names
(group-by #(last (clojure.string/split % #"/")))
(keep #(when (== 1 (count (val %))) (first (val %))))))
first you group all the items by the last element split string, getting this for your input:
{"123" ["o/123" "123"],
"234" ["o/234" "234"],
"543" ["t/543"],
"678" ["t/678" "678"]}
and then you just need to select all the values of a map, having length of 1, and to take their first elements.
I would say it is way more readable than your variant, and also seems to be more productive.
That's why:
as far as I can understand, your code here probably has a complexity of
N (grouping to a map with just 2 keys) +
Nlog(N) (creation and filling of top set) +
Nlog(N) (creation and filling of nested set) +
Nlog(N) (sets difference) +
Nlog(N) (filtering + searching each element in a needs-copied set) =
4Nlog(N) + N
whereas my variant would probably have the complexity of
Nlog(N) (grouping values into a map with a large amount of keys) +
N (keeping needed values) =
N + Nlog(N)
And though asymptotically they are both O(Nlog(N)), practically mine will probably complete faster.
ps: Not an expert in the complexity theory. Just made some very rough estimation
here is a little test:
(defn generate-data [len]
(doall (mapcat
#(let [n (rand-int 2)]
(if (zero? n)
[(str "aaa/" %) (str %)]
[(str %)]))
(range len))))
(defn still-need-copied [doc-names]
(let [last-segment (fn [doc-name]
(last (clojure.string/split doc-name #"/")))
by-position (group-by #(.contains % "/") doc-names)
top (set (get by-position false))
nested (set (map #(last-segment %) (get by-position true)))
needs-copied (clojure.set/difference nested top)]
(filter #(contains? needs-copied (last-segment %)) doc-names)))
(defn still-need-copied-2 [doc-names]
(->> doc-names
(group-by #(last (clojure.string/split % #"/")))
(keep #(when (== 1 (count (val %))) (first (val %))))))
(def data-100k (generate-data 100000))
(def data-1m (generate-data 1000000))
user> (let [_ (time (dorun (still-need-copied data-100k)))
_ (time (dorun (still-need-copied-2 data-100k)))
_ (time (dorun (still-need-copied data-1m)))
_ (time (dorun (still-need-copied-2 data-1m)))])
"Elapsed time: 714.929641 msecs"
"Elapsed time: 243.918466 msecs"
"Elapsed time: 7094.333425 msecs"
"Elapsed time: 2329.75247 msecs"
so it is ~3 times faster, just as I predicted
update:
found one solution, which is not so elegant, but seems to be working.
You said you're using iota, so i've generated a huge file with the lines of ~15 millions of lines (with forementioned generate-data fn)
then i've decided to sort if by the last part after slash (so that "123" and "aaa/123" stand together.
(defn last-part [s] (last (clojure.string/split s #"/")))
(def sorted (sort-by last-part (iota/seq "my/file/path")))
it has completed surprisingly fast. So the last thing i had to do, is to make a simple loop checking for every item if there is an item with the same last part nearby:
(def res (loop [res [] [item1 & [item2 & rest :as tail] :as coll] sorted]
(cond (empty? coll) res
(empty? tail) (conj res item1)
(= (last-part item1) (last-part item2)) (recur res rest)
:else (recur (conj res item1) tail))))
it has also completed without any visible difficulties, so i've got the needed result without any map/reduce framework.
I think also, that if you won't keep the sorted coll in a var, you would probably save memory by avoiding the huge coll head retention:
(def res (loop [res []
[item1 & [item2 & rest :as tail] :as coll] (sort-by last-part (iota/seq "my/file/path"))]
(cond (empty? coll) res
(empty? tail) (conj res item1)
(= (last-part item1) (last-part item2)) (recur res rest)
:else (recur (conj res item1) tail))))

Related

Return an else value when using recur

I am new to Clojure, and doing my best to forget all my previous experience with more procedural languages (java, ruby, swift) and embrace Clojure for what it is. I am actually really enjoying the way it makes me think differently -- however, I have come up against a pattern that I just can't seem to figure out. The easiest way to illustrate, is with some code:
(defn char-to-int [c] (Integer/valueOf (str c)))
(defn digits-dont-decrease? [str]
(let [digits (map char-to-int (seq str)) i 0]
(when (< i 5)
(if (> (nth digits i) (nth digits (+ i 1)))
false
(recur (inc i))))))
(def result (digits-dont-decrease? "112233"))
(if (= true result)
(println "fit rules")
(println "doesn't fit rules"))
The input is a 6 digit number as a string, and I am simply attempting to make sure that each digit from left to right is >= the previous digit. I want to return false if it doesn't, and true if it does. The false situation works great -- however, given that recur needs to be the last thing in the function (as far as I can tell), how do I return true. As it is, when the condition is satisfied, I get an illegal argument exception:
Execution error (IllegalArgumentException) at clojure.exercise.two/digits-dont-decrease? (four:20).
Don't know how to create ISeq from: java.lang.Long
How should I be thinking about this? I assume my past training is getting in my mental way.

This is not answering your question, but also shows an alternative. While the (apply < ...) approach over the whole string is very elegant for small strings (it is eager), you can use every? for an short-circuiting approach. E.g.:
user=> (defn nr-seq [s] (map #(Integer/parseInt (str %)) s))
#'user/nr-seq
user=> (every? (partial apply <=) (partition 2 1 (nr-seq "123")))
true

You need nothing but
(apply <= "112233")
Reason: string is a sequence of character and comparison operator works on character.
(->> "0123456789" (mapcat #(repeat 1000 %)) (apply str) (def loooong))
(count loooong)
10000
(time (apply <= loooong))
"Elapsed time: 21.006625 msecs"
true
(->> "9123456789" (mapcat #(repeat 1000 %)) (apply str) (def bad-loooong))
(count bad-loooong)
10000
(time (apply <= bad-loooong))
"Elapsed time: 2.581750 msecs"
false
(above runs on my iPhone)

In this case, you don't really need loop/recur. Just use the built-in nature of <= like so:
(ns tst.demo.core
(:use demo.core tupelo.core tupelo.test))
(def true-samples
["123"
"112233"
"13"])
(def false-samples
["10"
"12324"])
(defn char->int
[char-or-str]
(let [str-val (str char-or-str)] ; coerce any chars to len-1 strings
(assert (= 1 (count str-val)))
(Integer/parseInt str-val)))
(dotest
(is= 5 (char->int "5"))
(is= 5 (char->int \5))
(is= [1 2 3] (mapv char->int "123"))
; this shows what we are going for
(is (<= 1 1 2 2 3 3))
(isnt (<= 1 1 2 1 3 3))
and now test the char sequences:
;-----------------------------------------------------------------------------
; using built-in `<=` function
(doseq [true-samp true-samples]
(let [digit-vals (mapv char->int true-samp)]
(is (apply <= digit-vals))))
(doseq [false-samp false-samples]
(let [digit-vals (mapv char->int false-samp)]
(isnt (apply <= digit-vals))))
if you want to write your own, you can like so:
(defn increasing-equal-seq?
"Returns true iff sequence is non-decreasing"
[coll]
(when (< (count coll) 2)
(throw (ex-info "coll must have at least 2 vals" {:coll coll})))
(loop [prev (first coll)
remaining (rest coll)]
(if (empty? remaining)
true
(let [curr (first remaining)
prev-next curr
remaining-next (rest remaining)]
(if (<= prev curr)
(recur prev-next remaining-next)
false)))))
;-----------------------------------------------------------------------------
; using home-grown loop/recur
(doseq [true-samp true-samples]
(let [digit-vals (mapv char->int true-samp)]
(is (increasing-equal-seq? digit-vals))))
(doseq [false-samp false-samples]
(let [digit-vals (mapv char->int false-samp)]
(isnt (increasing-equal-seq? digit-vals))))
)
with result
-------------------------------
Clojure 1.10.1 Java 13
-------------------------------
Testing tst.demo.core
Ran 2 tests containing 15 assertions.
0 failures, 0 errors.
Passed all tests
Finished at 23:36:17.096 (run time: 0.028s)

You an use loop with recur.
Assuming you require following input v/s output -
"543221" => false
"54321" => false
"12345" => true
"123345" => true
Following function can help
;; Assuming char-to-int is defined by you before as per the question
(defn digits-dont-decrease?
[strng]
(let [digits (map char-to-int (seq strng))]
(loop [;;the bindings in loop act as initial state
decreases true
i (- (count digits) 2)]
(let [decreases (and decreases (>= (nth digits (+ i 1)) (nth digits i)))]
(if (or (< i 1) (not decreases))
decreases
(recur decreases (dec i)))))))
This should work for numeric string of any length.
Hope this helps. Please let me know if you were looking for something else :).

(defn non-decreasing? [str]
(every?
identity
(map
(fn [a b]
(<= (int a) (int b)))
(seq str)
(rest str))))
(defn non-decreasing-loop? [str]
(loop [a (seq str) b (rest str)]
(if-not (seq b)
true
(if (<= (int (first a)) (int (first b)))
(recur (rest a) (rest b))
false))))
(non-decreasing? "112334589")
(non-decreasing? "112324589")
(non-decreasing-loop? "112334589")
(non-decreasing-loop? "112324589")

Speeding up Clojure to avoid timeouts

I've doing a few of the hackerrank challenges and noticing that I seem to not be able to code efficient code, as quite often I get timeouts, even though the answers that do pass the tests are correct. For example for this challenge this is my code:
(let [divisors (fn [n] (into #{n} (into #{1} (filter (comp zero? (partial rem n)) (range 1 n)))))
str->ints (fn [string]
(map #(Integer/parseInt %)
(clojure.string/split string #" ")))
;lines (line-seq (java.io.BufferedReader. *in*))
lines ["3"
"10 4"
"1 100"
"288 240"
]
pairs (map str->ints (rest lines))
first-divs (map divisors (map first pairs))
second-divs (map divisors (map second pairs))
intersections (map clojure.set/intersection first-divs second-divs)
counts (map count intersections)
]
(doseq [v counts]
(println (str v))))
Note that clojure/set doesn't exist at hackerrank. I just put in here for the sake of brevity.

in this exact case there is an obvious misuse of map function:
although the clojure collections are lazy, operations on them still don't come for free. So when you chain lots of maps, you still have all the intermediate collections (there are 7 here). To avoid this, one would usually use transducers, but in your case you are just mapping every input line to one output line, so it is really enough to do it in one pass over the input collection:
(let [divisors (fn [n] (into #{n} (into #{1} (filter (comp zero? (partial rem n)) (range 1 n)))))
str->ints (fn [string]
(map #(Integer/parseInt %)
(clojure.string/split string #" ")))
;lines (line-seq (java.io.BufferedReader. *in*))
get-counts (fn [pair] (let [d1 (divisors (first pair))
d2 (divisors (second pair))]
(count (clojure.set/intersection d1 d2))))
lines ["3"
"10 4"
"1 100"
"288 240"
]
counts (map (comp get-counts str->ints) (rest lines))]
(doseq [v counts]
(println (str v))))
Not talking about the correctness of the whole algorithm here. Maybe it could also be optimized. But as of clojure's mechanics, this change should speed up your code quite notably.
update
as for the algorithm, you would probably want to start with limiting the range from 1..n to 1..(sqrt n), adding both x and n/x into resulting set when x is a divisor of n, that should give you quite a big profit for large numbers:
(defn divisors [n]
(into #{} (mapcat #(when (zero? (rem n %)) [% (/ n %)])
(range 1 (inc (Math/floor (Math/sqrt n)))))))
also i would consider finding all the divisors of the least of two numbers, and then keeping the ones the other number is divisible by. This will eliminate the search of the greater number's divisors.
(defn common-divisors [pair]
(let [[a b] (sort pair)
divs (divisors a)]
(filter #(zero? (rem b %)) divs)))
if that still doesn't manage to pass the test, you should probably look for some nice algorithm for common divisors.
update 2
submitted the updated algorithm to hackerrank and it passes well now

Checking odd parity in clojure

I have the following functions that check for odd parity in sequence
(defn countOf[a-seq elem]
(loop [number 0 currentSeq a-seq]
(cond (empty? currentSeq) number
(= (first currentSeq) elem) (recur (inc number) (rest currentSeq))
:else (recur number (rest currentSeq))
)
)
)
(defn filteredSeq[a-seq elemToRemove]
(remove (set (vector (first a-seq))) a-seq)
)
(defn parity [a-seq]
(loop [resultset [] currentSeq a-seq]
(cond (empty? currentSeq) (set resultset)
(odd? (countOf currentSeq (first currentSeq))) (recur (concat resultset (vector(first currentSeq))) (filteredSeq currentSeq (first currentSeq)))
:else (recur resultset (filteredSeq currentSeq (first currentSeq)))
)
)
)
for example (parity [1 1 1 2 2 3]) -> (1 3) that is it picks odd number of elements from a sequence.
Is there a better way to achieve this?
How can this be done with reduce function of clojure

First, I decided to make more idiomatic versions of your code, so I could really see what it was doing:
;; idiomatic naming
;; no need to rewrite count and filter for this code
;; putting item and collection in idiomatic argument order
(defn count-of [elem a-seq]
(count (filter #(= elem %) a-seq)))
;; idiomatic naming
;; putting item and collection in idiomatic argument order
;; actually used the elem-to-remove argument
(defn filtered-seq [elem-to-remove a-seq]
(remove #(= elem-to-remove %) a-seq))
;; idiomatic naming
;; if you want a set, use a set from the beginning
;; destructuring rather than repeated usage of first
;; use rest to recur when the first item is guaranteed to be dropped
(defn idiomatic-parity [a-seq]
(loop [result-set #{}
[elem & others :as current-seq] a-seq]
(cond (empty? current-seq)
result-set
(odd? (count-of elem current-seq))
(recur (conj result-set elem) (filtered-seq elem others))
:else
(recur result-set (filtered-seq elem others)))))
Next, as requested, a version that uses reduce to accumulate the result:
;; mapcat allows us to return 0 or more results for each input
(defn reducing-parity [a-seq]
(set
(mapcat
(fn [[k v]]
(when (odd? v) [k]))
(reduce (fn [result item]
(update-in result [item] (fnil inc 0)))
{}
a-seq))))
But, reading over this, I notice that the reduce is just frequencies, a built in clojure function. And my mapcat was really just a hand-rolled keep, another built in.
(defn most-idiomatic-parity [a-seq]
(set
(keep
(fn [[k v]]
(when (odd? v) k))
(frequencies a-seq))))
In Clojure we can refine our code, and as we recognize places where our logic replicates the built in functionality, we can simplify the code and make it more clear. Also, there is a good chance the built in is better optimized than our own work-alikes.

Is there a better way to achieve this?
(defn parity [coll]
(->> coll
frequencies
(filter (fn [[_ v]] (odd? v)))
(map first)
set))
For example,
(parity [1 1 1 2 1 2 1 3])
;#{1 3}
How can this be done with reduce function of clojure.
We can use reduce to rewrite frequencies:
(defn frequencies [coll]
(reduce
(fn [acc x] (assoc acc x (inc (get acc x 0))))
{}
coll))
... and again to implement parity in terms of it:
(defn parity [coll]
(let [freqs (frequencies coll)]
(reduce (fn [s [k v]] (if (odd? v) (conj s k) s)) #{} freqs)))

Grouping words and more

I'm working on a project to learn Clojure in practice. I'm doing well, but sometimes I get stuck. This time I need to transform sequence of the form:
[":keyword0" "word0" "word1" ":keyword1" "word2" "word3"]
into:
[[:keyword0 "word0" "word1"] [:keyword1 "word2" "word3"]]
I'm trying for at least two hours, but I know not so many Clojure functions to compose something useful to solve the problem in functional manner.
I think that this transformation should include some partition, here is my attempt:
(partition-by (fn [x] (.startsWith x ":")) *1)
But the result looks like this:
((":keyword0") ("word1" "word2") (":keyword1") ("word3" "word4"))
Now I should group it again... I doubt that I'm doing right things here... Also, I need to convert strings (only those that begin with :) into keywords. I think this combination should work:
(keyword (subs ":keyword0" 1))
How to write a function which performs the transformation in most idiomatic way?

Here is a high performance version, using reduce
(reduce (fn [acc next]
(if (.startsWith next ":")
(conj acc [(-> next (subs 1) keyword)])
(conj (pop acc) (conj (peek acc)
next))))
[] data)
Alternatively, you could extend your code like this
(->> data
(partition-by #(.startsWith % ":"))
(partition 2)
(map (fn [[[kw-str] strs]]
(cons (-> kw-str
(subs 1)
keyword)
strs))))

what about that:
(defn group-that [ arg ]
(if (not-empty arg)
(loop [list arg, acc [], result []]
(if (not-empty list)
(if (.startsWith (first list) ":")
(if (not-empty acc)
(recur (rest list) (vector (first list)) (conj result acc))
(recur (rest list) (vector (first list)) result))
(recur (rest list) (conj acc (first list)) result))
(conj result acc)
))))
Just 1x iteration over the Seq and without any need of macros.

Since the question is already here... This is my best effort:
(def data [":keyword0" "word0" "word1" ":keyword1" "word2" "word3"])
(->> data
(partition-by (fn [x] (.startsWith x ":")))
(partition 2)
(map (fn [[[k] w]] (apply conj [(keyword (subs k 1))] w))))
I'm still looking for a better solution or criticism of this one.

First, let's construct a function that breaks vector v into sub-vectors, the breaks occurring everywhere property pred holds.
(defn breakv-by [pred v]
(let [break-points (filter identity (map-indexed (fn [n x] (when (pred x) n)) v))
starts (cons 0 break-points)
finishes (concat break-points [(count v)])]
(mapv (partial subvec v) starts finishes)))
For our case, given
(def data [":keyword0" "word0" "word1" ":keyword1" "word2" "word3"])
then
(breakv-by #(= (first %) \:) data)
produces
[[] [":keyword0" "word0" "word1"] [":keyword1" "word2" "word3"]]
Notice that the initial sub-vector is different:
It has no element for which the predicate holds.
It can be of length zero.
All the others
start with their only element for which the predicate holds and
are at least of length 1.
So breakv-by behaves properly with data that
doesn't start with a breaking element or
has a succession of breaking elements.
For the purposes of the question, we need to muck about with what breakv-by produces somewhat:
(let [pieces (breakv-by #(= (first %) \:) data)]
(mapv
#(update-in % [0] (fn [s] (keyword (subs s 1))))
(rest pieces)))
;[[:keyword0 "word0" "word1"] [:keyword1 "word2" "word3"]]

Proper way to obtain side effects while using the java.io/reader in Clojure?

I'm reading lines from a very large text file. The file contains a set of data that I'd like to select specific line numbers from. What I'd like to do is read in a line from the file, if the line is one that I want, conj it to my result, and if it's not, then check the next line. I don't want to store all the lines I've seen in memory so I'd like a way to drop them from the reader line-seq as I read them.
I have a function like this:
;; evaluates but doesn't modify the line sequence so continuously adds
;; the same first line to the result. I would like this exact function
;; but somehow have it drop the first line of lines at each iteration.
(defn get-training-data [batch-size batch-num]
(let [line-numbers (fn that returns vector of random numbers)]
(with-open [rdr (clojure.java.io/reader "resources/sample.txt")]
(let [lines (line-seq rdr) res []]
(for [i (range (apply max line-numbers))
:let [res (conj res (json/read-str (first lines)))]
:when (some #{i} line-numbers)]
res)))))
I also have a function like this:
;;this works as I want it to, but only with a small file and produces a
;;stack overflow with a large file
(defn get-training-data1 [batch-size batch-num]
(let [line-numbers (fn that returns a vector of random numbers)]
(with-open [rdr (clojure.java.io/reader "resources/sample.txt")]
(let [lines (line-seq rdr)]
(loop [i 0 f (apply max line-numbers) res [] lines lines]
(if (> i f)
res
(if (some #{i} line-numbers)
(recur
(inc i)
f
(conj res (json/read-str (first lines)))
(drop 1 lines))
(recur
(inc i)
f
res
(drop 1 lines)))))))))
As I tried to test this, I developed the following simpler cases:
;;works
(let [res []]
(for [i (range 10)
:let [res (conj res i)]
:when (odd? i)]
res)) ;;([1] [3] [5] [7] [9])
;;now an attempt to get the same result but have a side effect each time,
;;produces null pointer exception.
(let [res []]
(for [i (range 10)
:let [res (conj res i)]
:when (odd? i)]
(doall
(println i)
res)))
I believe if I could figure out how to produce a side effect within a for, then the first problem about would be resolved because I could just make the side effect to drop the first line of the reader's line sequence.
Do you guys have any thoughts?

map and filter will do this nicely and keep it lazy so you don't store any more in memory than you have to.
user> (->> (line-seq (clojure.java.io/reader "project.clj")) ;; lazy sequence of lines
(map vector (range)) ;; add an index
(filter #(#{1 3 7 9} (first %))) ;; filter by index
(map second )) ;; drop the index
(" :description \"API server for Yummly mobile app(s)\""
"[com.project/example \"1.4.8-SNAPSHOT\"]"
" [org.clojure/tools.cli \"0.2\.4\"]"
" [clojurewerkz/mailer \"1.0.0-alpha3\"]")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Efficiently create and diff sets created from large text file - clojure

Related

Return an else value when using recur

Speeding up Clojure to avoid timeouts

Checking odd parity in clojure

Grouping words and more

Proper way to obtain side effects while using the java.io/reader in Clojure?

Categories

Resources