Clojure - counting unique values from vectors in a seq - clojure

Being somewhat new to Clojure I can't seem to figure out how to do something that seems like it should be simple. I just can't see it. I have a seq of vectors. Let's say each vector has two values representing customer number and invoice number and each of the vectors represents a sale of an item. So it would look something like this:
([ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])
I want to count the number of unique customers and unique invoices. So the example should produce the vector
[ 2 3 ]
In Java or another imperative language I would loop over each one of the vectors in the seq, add the customer number and invoice number to a set then count the number of values in each set and return it. I can't see the functional way to do this.
Thanks for the help.
EDIT: I should have specified in my original question that the seq of vectors is in the 10's of millions and actually has more than just two values. So I want to only go through the seq one time and calculate these unique counts (and some sums as well) on that one run through the seq.

In Clojure you can do it almost the same way - first call distinct to get unique values and then use count to count results:
(def vectors (list [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]))
(defn count-unique [coll]
(count (distinct coll)))
(def result [(count-unique (map first vectors)) (count-unique (map second vectors))])
Note that here you first get list of first and second elements of vectors (map first/second vectors) and then operate on each separately, thus iterating over collection twice. If performance does matter, you may do same thing with iteration (see loop form or tail recursion) and sets, just like you would do in Java. To further improve performance you can also use transients. Though for beginner like you I would recommend first way with distinct.
UPD. Here's version with loop:
(defn count-unique-vec [coll]
(loop [coll coll, e1 (transient #{}), e2 (transient #{})]
(cond (empty? coll) [(count (persistent! e1)) (count (persistent! e2))]
:else (recur (rest coll)
(conj! e1 (first (first coll)))
(conj! e2 (second (first coll)))))))
(count-unique-vec vectors) ==> [2 3]
As you can see, no need in atoms or something like that. First, you pass state to each next iteration (recur call). Second, you use transients to use temporary mutable collections (read more on transients for details) and thus avoid creation of new object each time.
UPD2. Here's version with reduce for extended question (with price):
(defn count-with-price
"Takes input of form ([customer invoice price] [customer invoice price] ...)
and produces vector of 3 elements, where 1st and 2nd are counts of unique
customers and invoices and 3rd is total sum of all prices"
[coll]
(let [[custs invs total]
(reduce (fn [[custs invs total] [cust inv price]]
[(conj! custs cust) (conj! invs inv) (+ total price)])
[(transient #{}) (transient #{}) 0]
coll)]
[(count (persistent! custs)) (count (persistent! invs)) total]))
Here we hold intermediate results in a vector [custs invs total], unpack, process and pack them back to a vector each time. As you can see, implementing such nontrivial logic with reduce is harder (both to write and read) and requires even more code (in looped version it is enough to add one more parameter for price to loop args). So I agree with #ammaloy that for simpler cases reduce is better, but more complex things require more low-level constructs, such as loop/recur pair.

As is often the case when consuming a sequence, reduce is nicer than loop here. You can just do:
(map count (reduce (partial map conj)
[#{} #{}]
txn))
Or, if you're really into transients:
(map (comp count persistent!)
(reduce (partial map conj!)
(repeatedly 2 #(transient #{}))
txn))
Both of these solutions traverse the input only once, and they take much less code than the loop/recur solution.

Or you could use sets to handle the de-duping for you since sets can have a max of one of any specific value.
(def vectors '([100 2000] [100 2000] [101 2001] [100 2002]))
[(count (into #{} (map first vectors))) (count (into #{} (map second vectors)))]

Here's a nice way to do this with a map and higher order functions:
(apply
map
(comp count set list)
[[ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ]])
=> (2 3)

Also other solutions to the nice above mentioned ones:
(map (comp count distinct vector) [ 100 2000 ] [ 100 2000 ] [ 101 2001 ] [ 100 2002 ])
Other written with thread-last macro:
(->> '([100 2000] [100 2000] [101 2001] [100 2002]) (apply map vector) (map distinct) (map count))
both return (2 3).

Related

update or assoc a list rather than a vector

Updating a vector works fine:
(update [{:idx :a} {:idx :b}] 1 (fn [_] {:idx "Hi"}))
;; => [{:idx :a} {:idx "Hi"}]
However trying to do the same thing with a list does not work:
(update '({:idx :a} {:idx :b}) 1 (fn [_] {:idx "Hi"}))
;; => ClassCastException clojure.lang.PersistentList cannot be cast to clojure.lang.Associative clojure.lang.RT.assoc (RT.java:807)
Exactly the same problem exists for assoc.
I would like to do update and overwrite operations on lazy types rather than vectors. What is the underlying issue here, and is there a way I can get around it?
The underlying issue is that the update function works on associative structures, i.e. vectors and maps. Lists can't take a key as a function to look up a value.
user=> (associative? [])
true
user=> (associative? {})
true
user=> (associative? `())
false
update uses get behind the scenes to do its random access work.
I would like to do update and overwrite operations on lazy types
rather than vectors
It's not clear what want to achieve here. You're correct that vectors aren't lazy, but if you wish to do random access operations on a collection then vectors are ideal for this scenario and lists aren't.
and is there a way I can get around it?
Yes, but you still wouldn't be able to use the update function, and it doesn't look like there would be any benefit in doing so, in your case.
With a list you'd have to walk the list in order to access an index somewhere in the list - so in many cases you'd have to realise a great deal of the sequence even if it was lazy.
You can define your own function, using take and drop:
(defn lupdate [list n function]
(let [[head & tail] (drop n list)]
(concat (take n list)
(cons (function head) tail))))
user=> (lupdate '(a b c d e f g h) 4 str)
(a b c d "e" f g h)
With lazy sequences, that means that you will compute the n first values (but not the remaining ones, which after all is an important part of why we use lazy sequences). You have also to take into account space and time complexity (concat, etc.). But if you truly need to operate on lazy sequences, that's the way to go.
Looking behind your question to the problem you are trying to solve:
You can use Clojure's sequence functions to construct a simple solution:
(defn elf [n]
(loop [population (range 1 (inc n))]
(if (<= (count population) 1)
(first population)
(let [survivors (->> population
(take-nth 2)
((if (-> population count odd?) rest identity)))]
(recur survivors)))))
For example,
(map (juxt identity elf) (range 1 8))
;([1 1] [2 1] [3 3] [4 1] [5 3] [6 5] [7 7])
This has complexity O(n). You can speed up count by passing the population count as a redundant argument in the loop, or by dumping the population and survivors into vectors. The sequence functions - take-nth and rest - are quite capable of doing the weeding.
I hope I got it right!

clojure refactor code from recursion

I have the following bit of code that produces the correct results:
(ns scratch.core
(require [clojure.string :as str :only (split-lines join split)]))
(defn numberify [str]
(vec (map read-string (str/split str #" "))))
(defn process [acc sticks]
(let [smallest (apply min sticks)
cuts (filter #(> % 0) (map #(- % smallest) sticks))]
(if (empty? cuts)
acc
(process (conj acc (count cuts)) cuts))))
(defn print-result [[x & xs]]
(prn x)
(if (seq xs)
(recur xs)))
(let [input "8\n1 2 3 4 3 3 2 1"
lines (str/split-lines input)
length (read-string (first lines))
inputs (first (rest lines))]
(print-result (process [length] (numberify inputs))))
The process function above recursively calls itself until the sequence sticks is empty?.
I am curious to know if I could have used something like take-while or some other technique to make the code more succinct?
If ever I need to do some work on a sequence until it is empty then I use recursion but I can't help thinking there is a better way.
Your core problem can be described as
stop if count of sticks is zero
accumulate count of sticks
subtract the smallest stick from each of sticks
filter positive sticks
go back to 1.
Identify the smallest sub-problem as steps 3 and 4 and put a box around it
(defn cuts [sticks]
(let [smallest (apply min sticks)]
(filter pos? (map #(- % smallest) sticks))))
Notice that sticks don't change between steps 5 and 3, that cuts is a fn sticks->sticks, so use iterate to put a box around that:
(defn process [sticks]
(->> (iterate cuts sticks)
;; ----- 8< -------------------
This gives an infinite seq of sticks, (cuts sticks), (cuts (cuts sticks)) and so on
Incorporate step 1 and 2
(defn process [sticks]
(->> (iterate cuts sticks)
(map count) ;; count each sticks
(take-while pos?))) ;; accumulate while counts are positive
(process [1 2 3 4 3 3 2 1])
;-> (8 6 4 1)
Behind the scene this algorithm hardly differs from the one you posted, since lazy seqs are a delayed implementation of recursion. It is more idiomatic though, more modular, uses take-while for cancellation which adds to its expressiveness. Also it doesn't require one to pass the initial count and does the right thing if sticks is empty. I hope it is what you were looking for.
I think the way your code is written is a very lispy way of doing it. Certainly there are many many examples in The Little Schema that follow this format of reduction/recursion.
To replace recursion, I usually look for a solution that involves using higher order functions, in this case reduce. It replaces the min calls each iteration with a single sort at the start.
(defn process [sticks]
(drop-last (reduce (fn [a i]
(let [n (- (last a) (count i))]
(conj a n)))
[(count sticks)]
(partition-by identity (sort sticks)))))
(process [1 2 3 4 3 3 2 1])
=> (8 6 4 1)
I've changed the algorithm to fit reduce by grouping the same numbers after sorting, and then counting each group and reducing the count size.

first common element of potentially infinite sequences

I have written code to find the common elements of a number of sequences:
(defn common [l & others]
(if (= 1 (count others))
(filter #(some #{%1} (first others)) l)
(filter #(and (some #{%1} (first others)) (not (empty? (apply common (list %1) (rest others))))) l)))
which can find the first common element of finite sequences like this:
(first (common [1 2] [0 1 2 3 4 5] [3 1]))
-> 1
but it is very easily sent on an infinite search if any of the sequences are infinite:
(first (common [1 2] [0 1 2 3 4 5] (range)))
I understand why this is happening, and I know I need to make the computation lazy in some way, but I can't yet see how best to do that.
So that is my question: how to rework this code (or maybe use entirely different code) to find the first common element of a number of sequences, one or more of which could be infinite.
This isn't possible without some other constraints on the contents of the sequence. For example, if they were required to be in sorted order, you could do it. But given two infinite, arbitrarily-ordered sequences A and B, you can't ever decide for sure that A[0] isn't in B, because you'll keep searching forever, so you'll never be able to start searching for A[1].
I would probably do something like
(fn [ & lists ]
(filter search-in-finite-lists (map (fn [ & elements ] elements) lists)))
The trick is to search one level by level, in all lists at once. At each level, you need only search if the last element of each list is in any other list.
I guess it is expected to search infinitely if the list are infinite and there is no match. However, you could add a (take X lists) before the filter to impose a maximum. like so :
(fn [ max & lists ]
(filter search-in-finite-lists (take max (map (fn [ & elements ] elements) lists))))
Well, that is still assuming a finite number of lists... Which shoud be reasonable.

Clojure 2d list to hash-map

I have an infinite list like that:
((1 1)(3 9)(5 17)...)
I would like to make a hash map out of it:
{:1 1 :3 9 :5 17 ...)
Basically 1st element of the 'inner' list would be a keyword, while second element a value. I am not sure if it would not be easier to do it at creation time, to create the list I use:
(iterate (fn [[a b]] [(computation for a) (computation for b)]) [1 1])
Computation of (b) requires (a) so I believe at this point (a) could not be a keyword... The whole point of that is so one can easily access a value (b) given (a).
Any ideas would be greatly appreciated...
--EDIT--
Ok so I figured it out:
(def my-map (into {} (map #(hash-map (keyword (str (first %))) (first (rest %))) my-list)))
The problem is: it does not seem to be lazy... it just goes forever even though I haven't consumed it. Is there a way to force it to be lazy?
The problem is that hash-maps can be neither infinite nor lazy. They designed for fast key-value access. So, if you have a hash-map you'll be able to perform fast key look-up. Key-value access is the core idea of hash-maps, but it makes creation of lazy infinite hash-map impossible.
Suppose, we have an infinite 2d list, then you can just use into to create hash-map:
(into {} (vec (map vec my-list)))
But there is no way to make this hash-map infinite. So, the only solution for you is to create your own hash-map, like Chouser suggested. In this case you'll have an infinite 2d sequence and a function to perform lazy key lookup in it.
Actually, his solution can be slightly improved:
(def my-map (atom {}))
(def my-seq (atom (partition 2 (range))))
(defn build-map [stop]
(when-let [[k v] (first #my-seq)]
(swap! my-seq rest)
(swap! my-map #(assoc % k v))
(if (= k stop)
v
(recur stop))))
(defn get-val [k]
(if-let [v (#my-map k)]
v
(build-map k)))
my-map in my example stores the current hash-map and my-seq stores the sequence of not yet processed elements. get-val function performs a lazy look-up, using already processed elements in my-map to improve its performance:
(get-val 4)
=> 5
#my-map
=> {4 5, 2 3, 0 1}
And a speed-up:
(time (get-val 1000))
=> Elapsed time: 7.592444 msecs
(time (get-val 1000))
=> Elapsed time: 0.048192 msecs
In order to be lazy, the computer will have to do a linear scan of the input sequence each time a key is requested, at the very least if the key is beyond what has been scanned so far. A naive solution is just to scan the sequence every time, like this:
(defn get-val [coll k]
(some (fn [[a b]] (when (= k a) b)) coll))
(get-val '((1 1)(3 9)(5 17))
3)
;=> 9
A slightly less naive solution would be to use memoize to cache the results of get-val, though this would still scan the input sequence more than strictly necessary. A more aggressively caching solution would be to use an atom (as memoize does internally) to cache each pair as it is seen, thereby only consuming more of the input sequence when a lookup requires something not yet seen.
Regardless, I would not recommend wrapping this up in a hash-map API, as that would imply efficient immutable "updates" that would likely not be needed and yet would be difficult to implement. I would also generally not recommend keywordizing the keys.
If you flatten it down to a list of (k v k v k v k v) with flatten then you can use apply to call hash-map with that list as it's arguments which will git you the list you seek.
user> (apply hash-map (flatten '((1 1)(3 9)(5 17))))
{1 1, 3 9, 5 17}
though it does not keywordize the first argument.
At least in clojure the last value associated with a key is said to be the value for that key. If this is not the case then you can't produce a new map with a different value for a key that is already in the map, because the first (and now shadowed key) would be returned by the lookup function. If the lookup function searches to the end then it is not lazy. You can solve this by writing your own map implementation that uses association lists, though it would lack the performance guarantees of Clojure's trei based maps because it would devolve to linear time in the worst case.
Im not sure keeping the input sequence lazy will have the desired results.
To make a hashmap from your sequence you could try:
(defn to-map [s] (zipmap (map (comp keyword str first) s) (map second s)))
=> (to-map '((1 1)(3 9)(5 17)))
=> {:5 17, :3 9, :1 1}
You can convert that structure to a hash-map later this way
(def it #(iterate (fn [[a b]] [(+ a 1) (+ b 1)]) [1 1]))
(apply hash-map (apply concat (take 3 (it))))
=> {1 1, 2 2, 3 3}

How can I test for a given sum in all combinations of multiple sets?

I'm working on Problem 131 from 4Clojure site.
What kind of "for" statement might I add to combinatorially check each of these sets for a subset of items which sums to 0?
In particular I had a few questions here:
Is there any clojure function which takes an arbitrary number of sets?
If so, how can I generate all subsets AND sum those subsets without adding an extra Clojure to this code, or am I mistaken?
I need to fill in the __ part.
(= true (__ #{-1 1 99}
#{-2 2 888}
#{-3 3 7777}))
You mean sets (instead of maps)? But actually, that doesn't matter.
For example, count takes one argument, but you can make anonymous function, that takes arbitrary number of arguments.
((fn [& args] (map count args)) #{-1 1 99} #{-2 2 888} #{-3 3 7777})
or
(#(map count %&) #{-1 1 99} #{-2 2 888} #{-3 3 7777})
You can use subsets from combinatorics contrib to generate all subsets and then reduce them with +
#(map (partial reduce +) (subsets %))
So, this problem can be solved with these two functions:
(defn sums [s]
(set (map #(reduce + %) (rest (subsets s)))))
(defn cmp [& sets]
(not (empty? (apply intersection (map sums sets)))))
I wasn't able to make 4clojure import libraries from contrib, so I leave it as is.