Clojure - Sliding Window Minimum in Log Time - clojure

Given vector size n and window size k, how can I efficiently calculate the sliding window minimum in n log k time? ie, for vector [1 4 3 2 5 4 2] and window size 2, the output would be [1 3 2 2 4 2].
Obviously I can do it using partition and map but that that's n * k time.
I think I need to keep track of the minimum in a sorted map, and update the map when it's outside the window. But although I can get the min of a sorted map in log time, searching through the map to find any indexes that are expired is not log time.
Thanks.

You can solve this is with a priority queue based on Clojure's priority map data structure. We index the values in the window with their position in the vector.
The value of its first entry is the window minimum.
We add the new entry and get rid of the oldest one by key/vector-position.
A possible implementation is
(use [clojure.data.priority-map :only [priority-map]])
(defn windowed-min [k coll]
(let [numbered (map-indexed list coll)
[head tail] (split-at k numbered)
init-win (into (priority-map) head)
win-seq (reductions
(fn [w [i n]]
(-> w (dissoc (- i k)) (assoc i n)))
init-win
tail)]
(map (comp val first) win-seq)))
For example,
(windowed-min 2 [1 4 3 2 5 4 2])
=> (1 3 2 2 4 2)
The solution is developed lazily, so can be applied to an endless sequence.
After the initialisation, which is O(k), the function computes each element in the sequence in O(log k) time, as noted here.

You can solve in linear time --O(n), rather than O(n*log k)) as described by 1. http://articles.leetcode.com/sliding-window-maximum/ (easily change from find max to find min) and 2. https://people.cs.uct.ac.za/~ksmith/articles/sliding_window_minimum.html
The approaches needs a double ended queue to manage previous values which uses O(1) time for most queue operations (i.e. push/pop/peek, etc.) rather than O(log K) when using Priority Queue (i.e. Priority Map). I used a double ended queue from https://github.com/pjstadig/deque-clojure
Main Code to implement code in 1st reference above (for min rather than max):
(defn windowed-min-queue [w a]
(let [
deque-init (fn deque-init [] (reduce (fn [dq i]
(dq-push-back i (prune-back a i dq)))
empty-deque (range w)))
process-min (fn process-min [dq] (reductions (fn [q i]
(->> q
(prune-back a i)
(prune-front i w)
(dq-push-back i)))
dq (range w (count a))))
init (deque-init)
result (process-min init)] ;(process-min init)]
(map #(nth a (dq-front %)) result)))
Comparing the speed of this method to the other solution that uses a Priority Map we have (note: I liked the other solution since as well since its simpler).
; Test using Random arrays of data
(def N 1000000)
(def a (into [] (take N (repeatedly #(rand-int 50)))))
(def b (into [] (take N (repeatedly #(rand-int 50)))))
(def w 1024)
; Solution based upon Priority Map (see other solution which is also great since its simpler)
(time (doall (windowed-min-queue w a)))
;=> "Elapsed time: 1820.526521 msecs"
; Solution based upon double-ended queue
(time (doall (windowed-min w b)))
;=> "Elapsed time: 8290.671121 msecs"
Which is over a 4x faster, which is great considering the PriorityMap is written in Java while the double-ended queue code is pure Clojure (see https://github.com/pjstadig/deque-clojure)
Including the other wrappers/utilities used on the double-ended queue for reference.
(defn dq-push-front [e dq]
(conj dq e))
(defn dq-push-back [e dq]
(proto/inject dq e))
(defn dq-front [dq]
(first dq))
(defn dq-pop-front [dq]
(pop dq))
(defn dq-pop-back [dq]
(proto/eject dq))
(defn deque-empty? [dq]
(identical? empty-deque dq))
(defn dq-back [dq]
(proto/last dq))
(defn dq-front [dq]
(first dq))
(defn prune-back [a i dq]
(cond
(deque-empty? dq) dq
(< (nth a i) (nth a (dq-back dq))) (recur a i (dq-pop-back dq))
:else dq))
(defn prune-front [i w dq]
(cond
(deque-empty? dq) dq
(<= (dq-front dq) (- i w)) (recur i w (dq-pop-front dq))
:else dq))

My solution uses two auxillary maps to achieve fast performance. I map the keys to their values and also store the values to their occurrences in a sorted map. Upon each move of the window, I update the maps, and get the minimum of the sorted map, all in log time.
The downside is the code is a lot uglier, not lazy, and not idiomatic. The upside is that it outperforms the priority-map solution by about 2x. I think a lot of that though, can be blamed on the laziness of the solution above.
(defn- init-aux-maps [w v]
(let [sv (subvec v 0 w)
km (->> sv (map-indexed vector) (into (sorted-map)))
vm (->> sv frequencies (into (sorted-map)))]
[km vm]))
(defn- update-aux-maps [[km vm] j x]
(let [[ai av] (first km)
km (-> km (dissoc ai) (assoc j x))
vm (if (= (vm av) 1) (dissoc vm av) (update vm av dec))
vm (if (nil? (get vm x)) (assoc vm x 1) (update vm x inc))]
[km vm]))
(defn- get-minimum [[_ vm]] (ffirst vm))
(defn sliding-minimum [w v]
(loop [i 0, j w, am (init-aux-maps w v), acc []]
(let [acc (conj acc (get-minimum am))]
(if (< j (count v))
(recur (inc i) (inc j) (update-aux-maps am j (v j)) acc)
acc))))

Related

How to replace the last element in a vector in Clojure

As a newbie to Clojure I often have difficulties to express the simplest things. For example, for replacing the last element in a vector, which would be
v[-1]=new_value
in python, I end up with the following variants in Clojure:
(assoc v (dec (count v)) new_value)
which is pretty long and inexpressive to say the least, or
(conj (vec (butlast v)) new_value)
which even worse, as it has O(n) running time.
That leaves me feeling silly, like a caveman trying to repair a Swiss watch with a club.
What is the right Clojure way to replace the last element in a vector?
To support my O(n)-claim for butlast-version (Clojure 1.8):
(def v (vec (range 1e6)))
#'user/v
user=> (time (first (conj (vec (butlast v)) 55)))
"Elapsed time: 232.686159 msecs"
0
(def v (vec (range 1e7)))
#'user/v
user=> (time (first (conj (vec (butlast v)) 55)))
"Elapsed time: 2423.828127 msecs"
0
So basically for 10 time the number of elements it is 10 times slower.
I'd use
(defn set-top [coll x]
(conj (pop coll) x))
For example,
(set-top [1 2 3] :a)
=> [1 2 :a]
But it also works on the front of lists:
(set-top '(1 2 3) :a)
=> (:a 2 3)
The Clojure stack functions - peek, pop, and conj - work on the natural open end of a sequential collection.
But there is no one right way.
How do the various solutions react to an empty vector?
Your Python v[-1]=new_value throws an exception, as does your (assoc v (dec (count v)) new_value) and my (defn set-top [coll x] (conj (pop coll) x)).
Your (conj (vec (butlast v)) new_value) returns [new_value]. The butlast has no effect.
If you insist on being "pure", your 2nd or 3rd solutions will work. I prefer to be simpler & more explicit using the helper functions from the Tupelo library:
(s/defn replace-at :- ts/List
"Replaces an element in a collection at the specified index."
[coll :- ts/List
index :- s/Int
elem :- s/Any]
...)
(is (= [9 1 2] (replace-at (range 3) 0 9)))
(is (= [0 9 2] (replace-at (range 3) 1 9)))
(is (= [0 1 9] (replace-at (range 3) 2 9)))
As with drop-at, replace-at will throw an exception for invalid values of index.
Similar helper functions exist for
insert-at
drop-at
prepend
append
Note that all of the above work equally well for either a Clojure list (eager or lazy) or a Clojure vector. The conj solution will fail unless you are careful to always coerce the input to a vector first as in your example.

Need your help on running clojure library via leiningen

I found a solution for minimum hitting set on github: https://github.com/bdesham/hitting-set and then tried to use it. The solution is clojure library so I downloaded leiningen to try to run it.
I read the readme file from github link but I still didn't know how to run the clj code to get result of minimal hitting set. I saw that there was a function called minimal-hitting-sets in hitting_set.clj file but I don't know how to call it with argument.
Eg: Get minimal hitting set of:
{"Australia" #{:white :red :blue},
"Tanzania" #{:black :blue :green :yellow},
"Norway" #{:white :red :blue},
"Uruguay" #{:white :blue :yellow},
"Saint Vincent and the Grenadines" #{:blue :green :yellow},
"Ivory Coast" #{:white :orange :green},
"Sierra Leone" #{:white :blue :green},
"United States" #{:white :red :blue}}
Project.clj code:
(defproject hitting-set "0.9.0"
:description "Find minimal hitting sets"
:url "https://github.com/bdesham/hitting-set"
:license {:name "Eclipse Public License"
:url "http://www.eclipse.org/legal/epl-v10.html"
:distribution :repo
:comments "Same as Clojure"}
:main hitting-set
:min-lein-version "2.0.0"
:dependencies [ [org.clojure/clojure "1.4.0"]
[hitting-set "0.9.0"]])
hitting_set.clj code:
(ns hitting-set
(:use hitting-set :only [minimal-hitting-sets]))
; Utility functions
(defn- dissoc-elements-containing
"Given a map in which the keys are sets, removes all keys whose sets contain
the element el. Adapted from http://stackoverflow.com/a/2753997/371228"
[el m]
(apply dissoc m (keep #(-> % val
(not-any? #{el})
(if nil (key %)))
m)))
(defn- map-old-new
"Returns a sequence of vectors. Each first item is an element of coll and the
second item is the result of calling f with that item."
[f coll]
(map #(vector % (f %)) coll))
(defn- count-vertices
"Returns the number of vertices in the hypergraph h."
[h]
(count (apply union (vals h))))
(defn- sorted-hypergraph
"Returns a version of the hypergraph h that is sorted so that the edges with
the fewest vertices come first."
[h]
(into (sorted-map-by (fn [key1 key2]
(compare [(count (get h key1)) key1]
[(count (get h key2)) key2])))
h))
(defn- remove-dupes
"Given a map m, remove all but one of the keys that map to any given value."
[m]
(loop [sm (sorted-map),
m m,
seen #{}]
(if-let [head (first m)]
(if (contains? seen (second head))
(recur sm
(rest m)
seen)
(recur (assoc sm (first head) (second head))
(rest m)
(conj seen (second head))))
sm)))
(defn- efficient-hypergraph
"Given a hypergraph h, returns an equivalent hypergraph that will go through
the hitting set algorithm more quickly. Specifically, redundant edges are
discarded and then the map is sorted so that the smallest edges come first."
[h]
(-> h remove-dupes sorted-hypergraph))
(defn- largest-edge
"Returns the name of the edge of h that has the greatest number of vertices."
[h]
(first (last (sorted-hypergraph h))))
(defn- remove-vertices
"Given a hypergraph h and a set vv of vertices, remove the vertices from h
(i.e. remove all of the vertices of vv from each edge in h). If this would
result in an edge becoming empty, remove that edge entirely."
[h vv]
(loop [h h,
res {}]
(if (first h)
(let [edge (difference (second (first h))
vv)]
(if (< 0 (count edge))
(recur (rest h)
(assoc res (first (first h)) edge))
(recur (rest h)
res)))
res)))
; Auxiliary functions
;
; These functions might be useful if you're working with hitting sets, although
; they're not actually invoked anywhere else in this project.
(defn reverse-map
"Takes a map from keys to sets of values. Produces a map in which the values
are mapped to the set of keys in whose sets they originally appeared."
[m]
(apply merge-with into
(for [[k vs] m]
(apply hash-map (flatten (for [v vs]
[v #{k}]))))))
(defn drop-elements
"Given a set of N elements, return a set of N sets, each of which is the
result of removing a different item from the original set."
[s]
(set (for [e s] (difference s #{e}))))
; The main functions
;
; These are the functions that users are probably going to be interested in.
; Hitting set
(defn hitting-set?
"Returns true if t is a hitting set of h. Does not check whether s is
minimal."
[h t]
(not-any? empty? (map #(intersection % t)
(vals h))))
(defn hitting-set-exists?
"Returns true if a hitting set of size k exists for the hypergraph h. See the
caveat in README.md for odd behavior of this function."
[h k]
(cond
(< (count-vertices h) k) false
(empty? h) true
(zero? k) false
:else (let [hvs (map #(dissoc-elements-containing % h)
(first (vals h)))]
(boolean (some #(hitting-set-exists? % (dec k))
hvs)))))
(defn- enumerate-algorithm
[h k x]
(cond
(empty? h) #{x}
(zero? k) #{}
:else (let [hvs (map-old-new #(dissoc-elements-containing % h)
(first (vals h)))]
(apply union (map #(enumerate-algorithm (second %)
(dec k)
(union x #{(first %)}))
hvs)))))
(defn enumerate-hitting-sets
"Return a set containing the hitting sets of h. See the caveat in README.md
for odd behavior of this function. If the parameter k is passed then the
function will return all hitting sets of size less than or equal to k."
([h]
(enumerate-algorithm (efficient-hypergraph h) (count-vertices h) #{}))
([h k]
(enumerate-algorithm (efficient-hypergraph h) k #{})))
(defn minimal-hitting-sets
"Returns a set containing the minimal hitting sets of the hypergraph h. If
you just want one hitting set and don't care whether there are multiple
minimal hitting sets, use (first (minimal-hitting-sets h))."
[h]
(first (filter #(> (count %) 0)
(map #(enumerate-hitting-sets h %)
(range 1 (inc (count-vertices h)))))))
; Set cover
(defn cover?
"Returns true if the elements of s form a set cover for the hypergraph h."
[h s]
(= (apply union (vals h))
(apply union (map #(get h %) s))))
(defn greedy-cover
"Returns a set cover of h using the 'greedy' algorithm."
[h]
(loop [hh h,
edges #{}]
(if (cover? h edges)
edges
(let [e (largest-edge hh)]
(recur (remove-vertices hh (get hh e))
(conj edges e))))))
(defn approx-hitting-set
"Returns a hitting set of h. The set is guaranteed to be a hitting set, but
may not be minimal."
[h]
(greedy-cover (reverse-map h)))
Since I am a new bie to leiningen and clojure so I really need your help on it.
Thanks,
Hung
In general to use a clojure library from clojure:
make a new project with lein new app project-name
include the library in project.clj's dependency section
require and refer to that library in at lease one .clj file (core.clj is an example)
load that file in you editor of choice and switch the REPL namespace to the namespace in ns form at the top of the file.
...
profit!!
There are a lot more details though I hope this is enough to give you an overview of one way to go about this, and if you solve step 5 please share your solution ;-)

Efficiently create and diff sets created from large text file

I am attempting to copy about 12 million documents in an AWS S3 bucket to give them new names. The names previously had a prefix and will now all be document name only. So a/b/123 once renamed will be 123. The last segment is a uuid so there will not be any naming collisions.
This process has been partially completed so some have been copied and some still need to be. I have a text file that contains all of the document names. I would like an efficient way to determine which documents have not yet been moved.
I have some naive code that shows what I would like to accomplish.
(def doc-names ["o/123" "o/234" "t/543" "t/678" "123" "234" "678"])
(defn still-need-copied [doc-names]
(let [last-segment (fn [doc-name]
(last (clojure.string/split doc-name #"/")))
by-position (group-by #(.contains % "/") doc-names)
top (set (get by-position false))
nested (set (map #(last-segment %) (get by-position true)))
needs-copied (clojure.set/difference nested top)]
(filter #(contains? needs-copied (last-segment %)) doc-names)))
I would propose this solution:
(defn still-need-copied [doc-names]
(->> doc-names
(group-by #(last (clojure.string/split % #"/")))
(keep #(when (== 1 (count (val %))) (first (val %))))))
first you group all the items by the last element split string, getting this for your input:
{"123" ["o/123" "123"],
"234" ["o/234" "234"],
"543" ["t/543"],
"678" ["t/678" "678"]}
and then you just need to select all the values of a map, having length of 1, and to take their first elements.
I would say it is way more readable than your variant, and also seems to be more productive.
That's why:
as far as I can understand, your code here probably has a complexity of
N (grouping to a map with just 2 keys) +
Nlog(N) (creation and filling of top set) +
Nlog(N) (creation and filling of nested set) +
Nlog(N) (sets difference) +
Nlog(N) (filtering + searching each element in a needs-copied set) =
4Nlog(N) + N
whereas my variant would probably have the complexity of
Nlog(N) (grouping values into a map with a large amount of keys) +
N (keeping needed values) =
N + Nlog(N)
And though asymptotically they are both O(Nlog(N)), practically mine will probably complete faster.
ps: Not an expert in the complexity theory. Just made some very rough estimation
here is a little test:
(defn generate-data [len]
(doall (mapcat
#(let [n (rand-int 2)]
(if (zero? n)
[(str "aaa/" %) (str %)]
[(str %)]))
(range len))))
(defn still-need-copied [doc-names]
(let [last-segment (fn [doc-name]
(last (clojure.string/split doc-name #"/")))
by-position (group-by #(.contains % "/") doc-names)
top (set (get by-position false))
nested (set (map #(last-segment %) (get by-position true)))
needs-copied (clojure.set/difference nested top)]
(filter #(contains? needs-copied (last-segment %)) doc-names)))
(defn still-need-copied-2 [doc-names]
(->> doc-names
(group-by #(last (clojure.string/split % #"/")))
(keep #(when (== 1 (count (val %))) (first (val %))))))
(def data-100k (generate-data 100000))
(def data-1m (generate-data 1000000))
user> (let [_ (time (dorun (still-need-copied data-100k)))
_ (time (dorun (still-need-copied-2 data-100k)))
_ (time (dorun (still-need-copied data-1m)))
_ (time (dorun (still-need-copied-2 data-1m)))])
"Elapsed time: 714.929641 msecs"
"Elapsed time: 243.918466 msecs"
"Elapsed time: 7094.333425 msecs"
"Elapsed time: 2329.75247 msecs"
so it is ~3 times faster, just as I predicted
update:
found one solution, which is not so elegant, but seems to be working.
You said you're using iota, so i've generated a huge file with the lines of ~15 millions of lines (with forementioned generate-data fn)
then i've decided to sort if by the last part after slash (so that "123" and "aaa/123" stand together.
(defn last-part [s] (last (clojure.string/split s #"/")))
(def sorted (sort-by last-part (iota/seq "my/file/path")))
it has completed surprisingly fast. So the last thing i had to do, is to make a simple loop checking for every item if there is an item with the same last part nearby:
(def res (loop [res [] [item1 & [item2 & rest :as tail] :as coll] sorted]
(cond (empty? coll) res
(empty? tail) (conj res item1)
(= (last-part item1) (last-part item2)) (recur res rest)
:else (recur (conj res item1) tail))))
it has also completed without any visible difficulties, so i've got the needed result without any map/reduce framework.
I think also, that if you won't keep the sorted coll in a var, you would probably save memory by avoiding the huge coll head retention:
(def res (loop [res []
[item1 & [item2 & rest :as tail] :as coll] (sort-by last-part (iota/seq "my/file/path"))]
(cond (empty? coll) res
(empty? tail) (conj res item1)
(= (last-part item1) (last-part item2)) (recur res rest)
:else (recur (conj res item1) tail))))

clojure performance on badly performing code

I have completed this problem on hackerrank and my solution passes most test cases but it is not fast enough for 4 out of the 11 test cases.
My solution looks like this:
(ns scratch.core
(require [clojure.string :as str :only (split-lines join split)]))
(defn ascii [char]
(int (.charAt (str char) 0)))
(defn process [text]
(let [parts (split-at (int (Math/floor (/ (count text) 2))) text)
left (first parts)
right (if (> (count (last parts)) (count (first parts)))
(rest (last parts))
(last parts))]
(reduce (fn [acc i]
(let [a (ascii (nth left i))
b (ascii (nth (reverse right) i))]
(if (> a b)
(+ acc (- a b))
(+ acc (- b a))))
) 0 (range (count left)))))
(defn print-result [[x & xs]]
(prn x)
(if (seq xs)
(recur xs)))
(let [input (slurp "/Users/paulcowan/Downloads/input10.txt")
inputs (str/split-lines input)
length (read-string (first inputs))
texts (rest inputs)]
(time (print-result (map process texts))))
Can anyone give me any advice about what I should look at to make this faster?
Would using recursion instead of reduce be faster or maybe this line is expensive:
right (if (> (count (last parts)) (count (first parts)))
(rest (last parts))
(last parts))
Because I am getting a count twice.
You are redundantly calling reverse on every iteration of the reduce:
user=> (let [c [1 2 3]
noisey-reverse #(doto (reverse %) println)]
(reduce (fn [acc e] (conj acc (noisey-reverse c) e))
[]
[:a :b :c]))
(3 2 1)
(3 2 1)
(3 2 1)
[(3 2 1) :a (3 2 1) :b (3 2 1) :c]
The reversed value could be calculated inside the containing let, and would then only need to be calculated once.
Also, due to the way your parts is defined, you are doing linear time lookups with each call to nth. It would be better to put parts in a vector and do indexed lookup. In fact you wouldn't need a reversed parts, and could do arithmetic based on the count of the vector to find the item to look up.

insert-sort with reduce clojure

I have function
(defn goneSeq [inseq uptil]
(loop [counter 0 newSeq [] orginSeq inseq]
(if (== counter uptil)
newSeq
(recur (inc counter) (conj newSeq (first orginSeq)) (rest orginSeq)))))
(defn insert [sorted-seq n]
(loop [currentSeq sorted-seq counter 0]
(cond (empty? currentSeq) (concat sorted-seq (vector n))
(<= n (first currentSeq)) (concat (goneSeq sorted-seq counter) (vector n) currentSeq)
:else (recur (rest currentSeq) (inc counter)))))
that takes in a sorted-sequence and insert the number n at its appropriate position for example: (insert [1 3 4] 2) returns [1 2 3 4].
Now I want to use this function with reduce to sort a given sequence so something like:
(reduce (insert seq n) givenSeq)
What is thr correct way to achieve this?
If the function works for inserting a single value, then this would work:
(reduce insert [] givenSeq)
for example:
user> (reduce insert [] [0 1 2 30.5 0.88 2.2])
(0 0.88 1 2 2.2 30.5)
Also, it should be noted that sort and sort-by are built in and are better than most hand-rolled solutions.
May I suggest some simpler ways to do insert:
A slowish lazy way is
(defn insert [s x]
(let [[fore aft] (split-with #(> x %) s)]
(concat fore (cons x aft))))
A faster eager way is
(defn insert [coll x]
(loop [fore [], coll coll]
(if (and (seq coll) (> x (first coll)))
(recur (conj fore x) (rest coll))
(concat fore (cons x coll)))))
By the way, you had better put your defns in bottom-up order, if possible. Use declare if there is mutual recursion. You had me thinking your solution did not compile.