Complexity of Clojure's distinct + randomly generated stream - clojure

What is the time complexity of an expression
(doall (take n (distinct stream)))
where stream is a lazily generated (possibly infinite) collection with duplicates?
I guess this partially depends on the amount or chance of duplicates in stream? What if stream is (repeatedly #(rand-int m))) where m >= n?
My estimation:
For every element in the resulting list there has to be at least one element realized from the stream. Multiple if the stream has duplicates. For every iteration there is a set lookup and/or insert, but since those are near constant time we get at least: O(n*~1) = O(n) and then some complexity for the duplicates. My intuition is that the complexity for the duplicates can be neglected too, but I'm not sure how to formalize this. For example, we cannot just say it is O(n*k*~1) = O(n) for some constant k since there is not an obvious maximum number k of duplicates we could encounter in the stream.
Let me demonstrate the problem with some data:
(defn stream [upper distinct-n]
(let [counter (volatile! 0)]
(doall (take distinct-n
(distinct
(repeatedly (fn []
(vswap! counter inc)
(rand-int upper))))))
#counter))
(defn sample [times-n upper distinct-n]
(->> (repeatedly times-n
#(stream upper distinct-n))
frequencies
(sort-by val)
reverse))
(sample 10000 5 1) ;; ([1 10000])
(sample 10000 5 2) ;; ([2 8024] [3 1562] [4 334] [5 66] [6 12] [8 1] [7 1])
(sample 10000 5 3) ;; ([3 4799] [4 2898] [5 1324] [6 578] [7 236] [8 87] [9 48] [10 14] [11 10] [14 3] [12 2] [13 1])
(sample 10000 5 3) ;; ([3 4881] [4 2787] [5 1359] [6 582] [7 221] [8 107] [9 39] [10 12] [11 9] [12 1] [17 1] [13 1])
(sample 10000 5 4) ;; ([5 2258] [6 1912] [4 1909] [7 1420] [8 985] [9 565] [10 374] [11 226] [12 138] [13 89] [14 50] [15 33] [16 16] [17 9] [18 8] [20 5] [19 1] [23 1] [21 1])
(sample 10000 5 5) ;; ([8 1082] [9 1055] [7 1012] [10 952] [11 805] [6 778] [12 689] [13 558] [14 505] [5 415] [15 387] [16 338] [17 295] [18 203] [19 198] [20 148] [21 100] [22 96] [23 72] [24 53] [25 44] [26 40] [28 35] [27 31] [29 19] [30 16] [31 15] [32 13] [35 10] [34 6] [33 6] [42 3] [38 3] [45 3] [36 3] [37 2] [39 2] [52 1] [66 1] [51 1] [44 1] [41 1] [50 1] [60 1] [58 1])
Note that for the last sample the number of iterations distinct can go up to 66, although the chance is small.
Also notice that for increasing n in (sample 10000 n n) the most likely number of realized elements from the stream seems to go up more than linearly.
This chart illustrates the number of realized elements from the input (most common occurance from 10000 samples) in (doall (take n (repeatedly #(rand-int m))) for various numbers of n and m.
For completeness, here is the code I used to generate the chart:
(require '[com.hypirion.clj-xchart :as c])
(defn most-common [times-n upper distinct-n]
(->> (repeatedly times-n
#(stream upper distinct-n))
frequencies
(sort-by #(- (val %)))
ffirst))
(defn series [m]
{(str "m = " m)
(let [x (range 1 (inc m))]
{:x x
:y (map #(most-common 10000 m %)
x)})})
(c/view
(c/xy-chart
(merge (series 10)
(series 25)
(series 50)
(series 100))
{:x-axis {:title "n"}
:y-axis {:title "realized"}}))

Your problem is known as the Coupon collectors problem and the expected number of elements is given by just summing up m/m + m/(m-1) ... until you have your n items:
(defn general-coupon-collector-expect
"n: Cardinality of resulting set (# of uniuque coupons to collect)
m: Cardinality of set to draw from (#coupons that exist)"
[n m]
{:pre [(<= n m)]}
(double (apply + (mapv / (repeat m) (range m (- m n) -1)))))
(general-coupon-collector-expect 25 25)
;; => 95
;; This generates the data for you plot:
(for [x (range 10 101 5)]
(general-coupon-collector-expect x 100))
Worst case will be infinite. Best case will be just N. Average case is O(N log N). This ignores the complexity of checking if an element has already been drawn. In practice it is Log_32 N for clojure sets (which is used in distinct).

While I aggree with ClojureMostly answer, that a lookup in a lazy sequence is O(1) if you iterate over the list in order. I disagree with best and worst case complexity.
In general (doall (take n (distinct stream))) is not guaranteed to finish at all so worst case time complexity is obivously O(infinite). Even if the stream is generated randomly, it might be identicall to let's say (repeat 0)
Best case complexity would be either O(1) for n<=1 (you do not need to any check for beeing distinct on a list of length 0 or 1)
If you say n needs to be greater then 1 it will be O((n-1)(n-2)/2) (for a list that is allready distinct to check you need to iterate for each element over all the elements that come after this element. that will be (n-1) + (n-2)+...+(n-n) = (n-1)(n-2)/2. This is a slight deviation from what Carl Friedrich Gauß
, a german mathematician, dicovered while beeing in primary school)
Note that best and worstcase is not dependent on how the stream is generated. However this will be important if you are interested in average complexity
Average complexity:
Let's say you genereate the stream with (repeatedly #(rand-int m)), which means it is evenly distributed.
Average complexity will then be best case complexity O plus the expected amount of duplicates in the first n elements of the stream (that is n/m ) times the expected amount of additional lookups in stream to find another value, that has not been in the resulting list yet. This will be (i/m), wehre i is the index of the current element in the resulting list.
Because stream is a evenly distributed random sequence, i is expected to be evenly distributed as well, so it will on average equal n/2. there we go:
O((n-1)(n-2)/2 + ( n/m * n/2m))

Related

ClojureScript zipmap tricks me or what?

I use Clojurescript to develop webbrowser-games. (Actually a friend of mine teaches me, we started only a few weeks ago).
I wanted to generate a map in which keys are vectors and values are numbers. For e.g.: {[0 0] 0, [0 1] 1, [0 2] 2, ...}.
I used this formula:
(defn oxo [x y]
(zipmap (map vec (combi/cartesian-product (range 0 x) (range 0 y))) (range (* x y))))
(where combi/ refers to clojure.math.combinatorics).
When it generates the map, key-value pairs are ok, but they are in a random order, like:
{[0 1] 1, [6 8] 68, [6 9] 69, [5 7] 57, ...}
What went wrong after using zipmap and how can i fix it?
Clojure maps aren't guaranteed to have ordered/sorted keys. If you want to ensure the keys are sorted, use a sorted-map:
(into (sorted-map) (oxo 10 10))
=>
{[0 0] 0,
[0 1] 1,
[0 2] 2,
[0 3] 3,
[0 4] 4,
[0 5] 5,
...
If your map has fewer than 9 keys then insertion order is preserved because the underlying data structure is different depending on the number of keys:
clojure.lang.PersistentArrayMap for <9 keys
clojure.lang.PersistentHashMap otherwise.
array-map produces a clojure.lang.PersistentArrayMap and sorted-map produces a clojure.lang.PersistentTreeMap. Note that associng onto an array map may produce a hash map, but associng on to a sorted map still produces a sorted map.
zipmap produces a hash-map where order of the keys is not guaranteed.
If you want ordered keys you can use either sorted-map or array-map.
As far as my knowledge goes, you should not rely on Map/Hash/Dictionary for ordering in any languages.
If the order is important but you don't need O(1) lookup performance of the map, a vector of vector pairs is a good option for you.
(defn oxo [x y]
(mapv vector (map vec (combi/cartesian-product (range 0 x) (range 0 y))) (range (* x y))))
You will get something like this.
=> (oxo 10 10)
[[[0 0] 0] [[0 1] 1] [[0 2] 2] [[0 3] 3] [[0 4] 4] [[0 5] 5] ...]

How to split an input sequence according to the input number given

I'm writing a clojure function like:
(defn area [n locs]
(let [a1 (first locs)]
(vrp (rest locs))))
I basically want to input like: (area 3 '([1 2] [3 5] [3 1] [4 2])) But when I do that it gives me an error saying Wrong number of args (1) passed. But I'm passing two arguments.
What I actually want to do with this function is that whatever value of n is inputted (say 3 is inputted), then a1 should store [1 2], a2 should store [3 5], a3 should store ([3 1] [4 2]). What should I add in the function to get that?
clojue's build in split-at function is very close to solving this. It splits a sequence "at" a given point. So if we first split the data apart and then wrap the second half in a list and concatenate it back together again it should solve this problem:
user> (let [[start & end] (split-at 2 sample-data)]
(concat start end))
([1 2] [3 5] ([3 1] [4 2]))
the & before end in let causes the last item to be rolled up in a list. It's equivalent to:
user> (let [[start end] (split-at 2 sample-data)]
(concat start (list end)))
([1 2] [3 5] ([3 1] [4 2]))
If I recklessly assume you have some function called vrp that needs data in this form then you could finish your function with something like this:
(defn area [n locs]
(let [[start end] (split-at (dec n) sample-data)
a (concat start (list end))]
(apply vrp a)))
Though please forgive me to making wild guesses as to the nature of vrp, I could be totally off base here.

Adding sets of numbers up to 16

I have some sets of numbers:
(#{7 1} #{3 5} #{6 3 2 5}
#{0 7 1 8} #{0 4 8} #{7 1 3 5}
#{6 2} #{0 3 5 8} #{4 3 5}
#{4 6 2} #{0 6 2 8} #{4} #{0 8}
#{7 1 6 2} #{7 1 4})
I wish to make each set into a four number vector, such that the sum of all the vectors add up to 16 and they can only come from the set of numbers:
#{7 1} => [1 1 7 7]
#{4 3 5} => [3 4 4 5]
#{4} => [4 4 4 4]
#{0 8} => [0 0 8 8]
Lastly, the vector has to contain all the numbers in the set. It'll be great to solve this for abitrary vector lengths :)
How would the clojure code be written.
With small sets and the originally stated output length of 4
This is easily handled with naive search
(defn bag-sum [s n]
(for [a s, b s, c s, d s
:let [v [a b c d]]
:when (= n (apply + v))
:when (= (set v) s)]
v))
(take 1 (bag-sum #{7 1} 16)) ;=> ([7 7 1 1])
(take 1 (bag-sum #{3 5} 16)) ;=> ([3 3 5 5])
(take 1 (bag-sum #{4 3 5} 16)) ;=> ([4 4 3 5])
Assuming 16 is fixed and all numbers are non-negative
The search space even without the set constraint is tiny.
(require '[clojure.math.combinatorics :refer [partition]])
(count (partitions (repeat 16 1))) ;=> 231
So, again a naive solution is very practical. We'll produce solutions of all lengths, which can be further filtered as desired. If there is a zero in the input set, it can pad any solution.
(defn bag-sum16 [s]
(for [p (partitions (repeat 16 1))
:let [v (mapv (partial apply +) p)]
:when (= (set v) s)]
v))
First example has 2 solutions - length 4 and length 10.
(bag-sum16 #{7 1}) ;=> ([7 7 1 1] [7 1 1 1 1 1 1 1 1 1])
(bag-sum16 #{3 5}) ;=> ([5 5 3 3])
(bag-sum16 #{3 4 5}) ;=> ([5 4 4 3])
Using core.logic finite domains to prune the search space with arbitrary but specified domain set s, output length m, and sum n
This is still fairly naive but prunes the search tree when the target sum is exceeded. I am a novice at core.logic, so this is more an opportunity to practice than an attempt at best representation of the problem. This performs worse than the naive solutions above on small spaces, but enables calculation in some medium size cases.
(defn bag-sum-logic [s m n]
(let [m* (- m (count s))
n* (- n (apply + s))
nums (vec (repeatedly m* lvar))
sums (concat [0] (repeatedly (dec m*) lvar) [n*])
dom (apply fd/domain (sort s))
rng (fd/interval n*)
sol (run 1 [q]
(== q nums)
(everyg #(fd/in % dom) nums)
(everyg #(fd/in % rng) sums)
(everyg #(apply fd/+ %)
(map cons nums (partition 2 1 sums))))]
(when (seq sol) (sort (concat s (first sol))))))
(bag-sum-logic #{7 1} 4 16) ;=> (1 1 7 7)
(bag-sum-logic #{7 1} 10 16) ;=> (1 1 1 1 1 1 1 1 1 7)
(bag-sum-logic #{3 5} 4 16) ;=> (3 3 5 5)
(bag-sum-logic #{3 4 5} 4 16) ;=> (3 4 4 5)
(time (bag-sum-logic #{3 4 5} 30 100))
;=> "Elapsed time: 18.739627 msecs"
;=> (3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 5 5 5 5)
Better algorithms for the general case?
This problem is a linear Diophantine equation, which can be solved with the Extended Euclidean Algorithm via matrix unimodular row reduction, i.e. carry out the Euclidean algorithm in one column while bringing the entire basis row along for the ride.
For example, in the case of #{3 5} and sum 16, you want to solve the equation
3x + 5y = 16
subject to the additional constraints that x > 0, y > 0 and x + y = 4 (your example).
The matrix and reduction steps
[[3 1 0] -> [[3 1 0] -> [[1 2 -1] -> [[1 2 -1]
[5 0 1]] [2 -1 1]] [2 -1 1]] [0 -5 3]]
So the GCD of 3 and 5 is 1, which divides into 16. Therefore there are (infinitely many) solutions before the constraints
x = 16 * 2 - 5k
y = 16 * -1 + 3k
Since we need x + y = 4, 4 = 16 - 2k and therefore k = 6, so
x = 2
y = 2
And we need 2 copies of 3 and 2 copies of 5.
This generalizes to more than 2 variables in the same manner. But whereas for 2 variables the length of the solution fully constrains the single free variable as shown above, more than 3 variables can be underspecified.
Solving linear Diophantine equations can be done in polynomial time. However, once you add the bounds (0, m), finding a solution becomes NP-complete, though a quick perusal of research results suggest there are fairly tractable approaches.
Working on the assumptions that you only want one solution per set and you want the solution ordered ascending as per your example this is what I came up with. There aren't many combinations of sets of 1-4 numbers so the way I initially decomposed the problem was to look at what the pattern of possible solutions might look like.
(def x #{3 5})
(def g 16)
(def y {1 [[0 0 0 0]]
2 [[0 0 0 1][0 0 1 1][0 1 1 1]]
3 [[0 0 1 2][0 1 1 2][0 1 2 2]]
4 [[0 1 2 3]]})
This key of this map indicates the size of the set x that is being evaluated. The values are the possible permutations of indices to the set once it is sorted into a vector. Now we can choose the permutations based on the size of the set and calculate the values of each permutation, stopping as soon as we reach the goal:
(filter #(= g (apply + %))
(for [p (y (count x))]
(mapv #((into [] (sort x)) %) p)))
The values of each key of the map above the permutations form a pattern: the first index is always 0 and the last always is the set size - 1 and all values are either the same as or one above the value to the left. Therefore, the above map can be generalised to:
(defn y2 [m s]
(map (fn [c] (reduce #(conj %1 (+ %2 (peek %1))) [0] c))
(clojure.math.combinatorics/permutations
(mapv #(if (>= % (dec s)) 0 1) (range (dec m))))))
(def y (partial y2 4))
The filter will now work for any number of set items up to s. As the input set is sorted, the search could be optimised to find the right (or no) solution by doing a binary search over the permutations of possible solutions for log2n search time.

Circularly shifting nested vectors

Given a nested vector A
[[1 2 3] [4 5 6] [7 8 9]]
my goal is to circularly shift rows and columns.
If I first consider a single row shift I'd expect
[[7 8 9] [1 2 3] [4 5 6]]
where the 3rd row maps to the first in this case.
This is implemented by the code
(defn circles [x i j]
(swap-rows x i j))
with inputs
(circles [[1 2 3] [4 5 6] [7 8 9]] 0 1)
However, I am unsure how to go further and shift columns. Ideally, I would like to add to the function circles and be able to either shift rows or columns. Although I'm not sure if it's easiest to just have two distinct functions for each shift choice.
(defn circles [xs i j]
(letfn [(shift [v n]
(let [n (- (count v) n)]
(vec (concat (subvec v n) (subvec v 0 n)))))]
(let [ys (map #(shift % i) xs)
j (- (count xs) j)]
(vec (concat (drop j ys) (take j ys))))))
Example:
(circles [[1 2 3] [4 5 6] [7 8 9]] 1 1)
;= [[9 7 8] [3 1 2] [6 4 5]]
Depending on how often you expect to perform this operation, the sizes of the input vectors and the shifts to be applied, using core.rrb-vector could make sense. clojure.core.rrb-vector/catvec is the relevant function (you could also use clojure.core.rrb-vector/subvec for slicing, but actually here it's fine to use the regular subvec from clojure.core, as catvec will perform its own conversion).
You can also use cycle:
(defn circle-drop [i coll]
(->> coll
cycle
(drop i)
(take (count coll))
vec))
(defn circles [coll i j]
(let [n (count coll)
i (- n i)
j (- n j)]
(->> coll
(map #(circle-drop i %))
(circle-drop j))))
(circles [[1 2 3] [4 5 6] [7 8 9]] 2 1)
;; => [[8 9 7] [2 3 1] [5 6 4]]
There's a function for that called rotate in core.matrix (as is often the case for general purpose array/matrix operations)
The second parameter to rotate lets you choose the dimension to rotate around (0 for rows, 1 for columns)
(use 'clojure.core.matrix)
(def A [[1 2 3] [4 5 6] [7 8 9]])
(rotate A 0 1)
=> [[4 5 6] [7 8 9] [1 2 3]]
(rotate A 1 1)
=> [[2 3 1] [5 6 4] [8 9 7]]

Clojure: Semi-Flattening a nested Sequence

I have a list with embedded lists of vectors, which looks like:
(([1 2]) ([3 4] [5 6]) ([7 8]))
Which I know is not ideal to work with. I'd like to flatten this to ([1 2] [3 4] [5 6] [7 8]).
flatten doesn't work: it gives me (1 2 3 4 5 6 7 8).
How do I do this? I figure I need to create a new list based on the contents of each list item, not the items, and it's this part I can't find out how to do from the docs.
If you only want to flatten it one level you can use concat
(apply concat '(([1 2]) ([3 4] [5 6]) ([7 8])))
=> ([1 2] [3 4] [5 6] [7 8])
To turn a list-of-lists into a single list containing the elements of every sub-list, you want apply concat as nickik suggests.
However, there's usually a better solution: don't produce the list-of-lists to begin with! For example, let's imagine you have a function called get-names-for which takes a symbol and returns a list of all the cool things you could call that symbol:
(get-names-for '+) => (plus add cross junction)
If you want to get all the names for some list of symbols, you might try
(map get-names-for '[+ /])
=> ((plus add cross junction) (slash divide stroke))
But this leads to the problem you were having. You could glue them together with an apply concat, but better would be to use mapcat instead of map to begin with:
(mapcat get-names-for '[+ /])
=> (plus add cross junction slash divide stroke)
The code for flatten is fairly short:
(defn flatten
[x]
(filter (complement sequential?)
(rest (tree-seq sequential? seq x))))
It uses tree-seq to walk through the data structure and return a sequence of the atoms. Since we want all the bottom-level sequences, we could modify it like this:
(defn almost-flatten
[x]
(filter #(and (sequential? %) (not-any? sequential? %))
(rest (tree-seq #(and (sequential? %) (some sequential? %)) seq x))))
so we return all the sequences that don't contain sequences.
Also you may found useful this general 1 level flatten function I found on clojuremvc:
(defn flatten-1
"Flattens only the first level of a given sequence, e.g. [[1 2][3]] becomes
[1 2 3], but [[1 [2]] [3]] becomes [1 [2] 3]."
[seq]
(if (or (not (seqable? seq)) (nil? seq))
seq ; if seq is nil or not a sequence, don't do anything
(loop [acc [] [elt & others] seq]
(if (nil? elt) acc
(recur
(if (seqable? elt)
(apply conj acc elt) ; if elt is a sequence, add each element of elt
(conj acc elt)) ; if elt is not a sequence, add elt itself
others)))))
Example:
(flatten-1 (([1 2]) ([3 4] [5 6]) ([7 8])))
=>[[1 2] [3 4] [5 6] [7 8]]
concat exampe surely do job for you, but this flatten-1 is also allowing non seq elements inside a collection:
(flatten-1 '(1 2 ([3 4] [5 6]) ([7 8])))
=>[1 2 [3 4] [5 6] [7 8]]
;whereas
(apply concat '(1 2 ([3 4] [5 6]) ([7 8])))
=> java.lang.IllegalArgumentException:
Don't know how to create ISeq from: java.lang.Integer
Here's a function that will flatten down to the sequence level, regardless of uneven nesting:
(fn flt [s] (mapcat #(if (every? coll? %) (flt %) (list %)) s))
So if your original sequence was:
'(([1 2]) (([3 4]) ((([5 6])))) ([7 8]))
You'd still get the same result:
([1 2] [3 4] [5 6] [7 8])