Clojure 2d list to hash-map - clojure

I have an infinite list like that:
((1 1)(3 9)(5 17)...)
I would like to make a hash map out of it:
{:1 1 :3 9 :5 17 ...)
Basically 1st element of the 'inner' list would be a keyword, while second element a value. I am not sure if it would not be easier to do it at creation time, to create the list I use:
(iterate (fn [[a b]] [(computation for a) (computation for b)]) [1 1])
Computation of (b) requires (a) so I believe at this point (a) could not be a keyword... The whole point of that is so one can easily access a value (b) given (a).
Any ideas would be greatly appreciated...
--EDIT--
Ok so I figured it out:
(def my-map (into {} (map #(hash-map (keyword (str (first %))) (first (rest %))) my-list)))
The problem is: it does not seem to be lazy... it just goes forever even though I haven't consumed it. Is there a way to force it to be lazy?

The problem is that hash-maps can be neither infinite nor lazy. They designed for fast key-value access. So, if you have a hash-map you'll be able to perform fast key look-up. Key-value access is the core idea of hash-maps, but it makes creation of lazy infinite hash-map impossible.
Suppose, we have an infinite 2d list, then you can just use into to create hash-map:
(into {} (vec (map vec my-list)))
But there is no way to make this hash-map infinite. So, the only solution for you is to create your own hash-map, like Chouser suggested. In this case you'll have an infinite 2d sequence and a function to perform lazy key lookup in it.
Actually, his solution can be slightly improved:
(def my-map (atom {}))
(def my-seq (atom (partition 2 (range))))
(defn build-map [stop]
(when-let [[k v] (first #my-seq)]
(swap! my-seq rest)
(swap! my-map #(assoc % k v))
(if (= k stop)
v
(recur stop))))
(defn get-val [k]
(if-let [v (#my-map k)]
v
(build-map k)))
my-map in my example stores the current hash-map and my-seq stores the sequence of not yet processed elements. get-val function performs a lazy look-up, using already processed elements in my-map to improve its performance:
(get-val 4)
=> 5
#my-map
=> {4 5, 2 3, 0 1}
And a speed-up:
(time (get-val 1000))
=> Elapsed time: 7.592444 msecs
(time (get-val 1000))
=> Elapsed time: 0.048192 msecs

In order to be lazy, the computer will have to do a linear scan of the input sequence each time a key is requested, at the very least if the key is beyond what has been scanned so far. A naive solution is just to scan the sequence every time, like this:
(defn get-val [coll k]
(some (fn [[a b]] (when (= k a) b)) coll))
(get-val '((1 1)(3 9)(5 17))
3)
;=> 9
A slightly less naive solution would be to use memoize to cache the results of get-val, though this would still scan the input sequence more than strictly necessary. A more aggressively caching solution would be to use an atom (as memoize does internally) to cache each pair as it is seen, thereby only consuming more of the input sequence when a lookup requires something not yet seen.
Regardless, I would not recommend wrapping this up in a hash-map API, as that would imply efficient immutable "updates" that would likely not be needed and yet would be difficult to implement. I would also generally not recommend keywordizing the keys.

If you flatten it down to a list of (k v k v k v k v) with flatten then you can use apply to call hash-map with that list as it's arguments which will git you the list you seek.
user> (apply hash-map (flatten '((1 1)(3 9)(5 17))))
{1 1, 3 9, 5 17}
though it does not keywordize the first argument.
At least in clojure the last value associated with a key is said to be the value for that key. If this is not the case then you can't produce a new map with a different value for a key that is already in the map, because the first (and now shadowed key) would be returned by the lookup function. If the lookup function searches to the end then it is not lazy. You can solve this by writing your own map implementation that uses association lists, though it would lack the performance guarantees of Clojure's trei based maps because it would devolve to linear time in the worst case.
Im not sure keeping the input sequence lazy will have the desired results.

To make a hashmap from your sequence you could try:
(defn to-map [s] (zipmap (map (comp keyword str first) s) (map second s)))
=> (to-map '((1 1)(3 9)(5 17)))
=> {:5 17, :3 9, :1 1}

You can convert that structure to a hash-map later this way
(def it #(iterate (fn [[a b]] [(+ a 1) (+ b 1)]) [1 1]))
(apply hash-map (apply concat (take 3 (it))))
=> {1 1, 2 2, 3 3}

Related

Combine transduction output with input into a hashmap

I want to do the following in Clojure as idiomatically as possible:
transduce a collection
associate each element of the input collection with the corresponding element in the output collection
return the result in a hashmap
Is there a succinct way to do this using core library functions?
If not, what improvements can you suggest to the following implementation?
(defn to-hash [coll xform]
(reduce
merge
(map
#(apply hash-map %)
(mapcat hash-map coll (into [] xform coll)))))
something like this should do the trick without intermediate collections:
(defn process [data xform]
(zipmap data (eduction xform data)))
user> (process [1 2 3] (comp (map inc) (map #(* % %))))
;;=> {1 4, 2 9, 3 16}
the docs on eduction say the following:
Returns a reducible/iterable application of the transducers
to the items in coll. Transducers are applied in order as if
combined with comp. Note that these applications will be
performed every time reduce/iterator is called.
so no additional collection is created.
This is any good, of course, as long as there is one-to-one relationship between input and output elements. What is desired output for (process [1 -2 3] (filter pos?)) or (process [1 1 1 2 2 2] (dedupe)) ?
(by the way, your to-hash implementation has the same flaw)
A transducer is a function that takes a reducing function and returns a new reducing function. To make it work with transducers where there is not a one-to-one mapping from elements in the input collection to the output, you will have to use your transducer to create a new reducing function (step2 in the code below) that will associate elements into your hash map. Something like this.
(def ^:dynamic assoc-k nil)
(defn assoc-step [dst x]
(assoc dst assoc-k x))
(defn to-hash [coll xform]
(let [step (xform (completing assoc-step))
step2 (fn [dst x] (binding [assoc-k x] (step dst x)))]
(reduce step2 {} coll)))
This implementation is quite basic and I am not sure to which extent it will work with stateful transducers. But it will work with the stateless ones, such as map and filter.
And we can test it with a transducer that keeps odd elements in the input collection and squares them:
(defn square [x] (* x x))
(to-hash (range 10) (comp (filter odd?) (map square)))
;; => {1 1, 3 9, 5 25, 7 49, 9 81}

Clojure - Difference Between Map and Reduce // Converting One to the Other

(defn DoubleFrequency []
(def s (slurp "Example.txt"))
(def m (reduce #(assoc %1 %2 (inc (%1 %2 0)))
{}
(re-seq #".." s)))
(def c (count m))
(doseq [[k x] m]
(println k ":" (/ x c))))
I'm trying to apply concurrency to my program, and I want to use pmap, but I'm not sure how to work it into my current code here. The functionality is correct for single core, but Ideally I want to replace reduce with pmap in some way and achieve the same results.
first of all, the function you're trying to make up, is called frequencies:
user> (frequencies [1 2 1 3 1 4 4])
;;=> {1 3, 2 1, 3 1, 4 2}
it is, indeed, single threaded. So let's try to make it parallel.
the initial approach with reduce is the right direction, though it's not parallel either, it could be employed to make the parallel one with clojure's standard library concurrency facilities, namely reducers.
first of all, let's rewrite your reducer function a bit, to do the same thing, but in a more idiomatic way (it is optional, but good for readability):
#(assoc %1 %2 (inc (%1 %2 0))) => #(update %1 %2 (fnil inc 0))
then we can approach to the parallel reduce with fold:
(require '[clojure.core.reducers :as r])
(defn pfreq [data]
(r/fold
(partial merge-with +)
(fn [acc k] (update acc k (fnil inc 0)))
data))
the idea is that it splits your collection by chunks (if it is long enough), and then combines chunks' results with merge-with:
user> (pfreq [1 2 1 3 1 4 1 5 2])
;;=> {1 4, 2 2, 3 1, 4 1, 5 1}
notice also, that the collection should be 'foldable'. By default, persistent vectors and maps are foldable, re-seq result is not, so you should first convert it into vector: (vec (re-seq #"..x" s)), otherwise you won't get any parallelization, falling back to plain reduce.
You can obviously approach to this one with pmap, with the same strategy: split -> map -> combine:
(defn pfreq2 [chunk-size data]
(->> data
(partition-all chunk-size)
(pmap frequencies)
(apply merge-with +)))
but this is not as flexible and powerful, as the reducers pipelines.

update or assoc a list rather than a vector

Updating a vector works fine:
(update [{:idx :a} {:idx :b}] 1 (fn [_] {:idx "Hi"}))
;; => [{:idx :a} {:idx "Hi"}]
However trying to do the same thing with a list does not work:
(update '({:idx :a} {:idx :b}) 1 (fn [_] {:idx "Hi"}))
;; => ClassCastException clojure.lang.PersistentList cannot be cast to clojure.lang.Associative clojure.lang.RT.assoc (RT.java:807)
Exactly the same problem exists for assoc.
I would like to do update and overwrite operations on lazy types rather than vectors. What is the underlying issue here, and is there a way I can get around it?
The underlying issue is that the update function works on associative structures, i.e. vectors and maps. Lists can't take a key as a function to look up a value.
user=> (associative? [])
true
user=> (associative? {})
true
user=> (associative? `())
false
update uses get behind the scenes to do its random access work.
I would like to do update and overwrite operations on lazy types
rather than vectors
It's not clear what want to achieve here. You're correct that vectors aren't lazy, but if you wish to do random access operations on a collection then vectors are ideal for this scenario and lists aren't.
and is there a way I can get around it?
Yes, but you still wouldn't be able to use the update function, and it doesn't look like there would be any benefit in doing so, in your case.
With a list you'd have to walk the list in order to access an index somewhere in the list - so in many cases you'd have to realise a great deal of the sequence even if it was lazy.
You can define your own function, using take and drop:
(defn lupdate [list n function]
(let [[head & tail] (drop n list)]
(concat (take n list)
(cons (function head) tail))))
user=> (lupdate '(a b c d e f g h) 4 str)
(a b c d "e" f g h)
With lazy sequences, that means that you will compute the n first values (but not the remaining ones, which after all is an important part of why we use lazy sequences). You have also to take into account space and time complexity (concat, etc.). But if you truly need to operate on lazy sequences, that's the way to go.
Looking behind your question to the problem you are trying to solve:
You can use Clojure's sequence functions to construct a simple solution:
(defn elf [n]
(loop [population (range 1 (inc n))]
(if (<= (count population) 1)
(first population)
(let [survivors (->> population
(take-nth 2)
((if (-> population count odd?) rest identity)))]
(recur survivors)))))
For example,
(map (juxt identity elf) (range 1 8))
;([1 1] [2 1] [3 3] [4 1] [5 3] [6 5] [7 7])
This has complexity O(n). You can speed up count by passing the population count as a redundant argument in the loop, or by dumping the population and survivors into vectors. The sequence functions - take-nth and rest - are quite capable of doing the weeding.
I hope I got it right!

Grouping a seq by different sizes - Clojure

I have the following data:
(def letters [:a :b :c :d :e :f :g ])
(def group-sizes [2 3 2])
What would be an idiomatic way to group letters by size, such that I get:
[[:a :b] [:c :d :e] [:f :g]]
Thanks.
(->> group-sizes
(reductions + 0)
(partition 2 1)
(map (partial apply subvec letters)))
This algorithm requires the input coll letters to be a vector and to have at least the required amount of (apply + group-sizes) elements. It returns a lazy seq (or a vector if you use mapv) of vectors that share structure with the input vector.
Thanks to subvec they are created in O(1), constant time so the overall time complexity should be O(N) where N is (count group-sizes), compared to Diegos algorithm where N would be the drastically higher (count letters).
After I started writing my answer, I noticed, that Leon Grapenthin's solution is almost identical to mine.
Here is my version of it:
(let [end (reductions + group-sizes)
start (cons 0 end)]
(map (partial subvec letters) start end))
The only difference from Leon Grapenthin's solution is that I'm using let and cons instead of partition and apply.
Note, that both solutions consume group-sizes lazily, thus producing a lazy sequence as an output.
Not necessarily the best way (e.g. you may want to check that the sum of the group sizes is the same as the size of letters to avoid an NPE) but it was my first thought:
(defn sp [[f & r] l]
(when (seq l)
(cons (take f l)
(sp r (drop f l)))))
You could also do it with an accumulator and recur if you have a long list and don't want to blow up the stack.

Good Clojure representation for unordered pairs?

I'm creating unordered pairs of data elements. A comment by #Chouser on this question says that hash-sets are implemented with 32 children per node, while sorted-sets are implemented with 2 children per node. Does this mean that my pairs will take up less space if I implement them with sorted-sets rather than hash-sets (assuming that the data elements are Comparable, i.e. can be sorted)? (I doubt it matters for me in practice. I'll only have hundreds of these pairs, and lookup in a two-element data structure, even sequential lookup in a vector or list, should be fast. But I'm curious.)
When comparing explicitly looking at the first two elements of a list, to using Clojure's built in sets I don't see a significant difference when running it ten million times:
user> (defn my-lookup [key pair]
(condp = key
(first pair) true
(second pair) true false))
#'user/my-lookup
user> (time (let [data `(1 2)]
(dotimes [x 10000000] (my-lookup (rand-nth [1 2]) data ))))
"Elapsed time: 906.408176 msecs"
nil
user> (time (let [data #{1 2}]
(dotimes [x 10000000] (contains? data (rand-nth [1 2])))))
"Elapsed time: 1125.992105 msecs"
nil
Of course micro-benchmarks such as this are inherently flawed and difficult to really do well so don't try to use this to show that one is better than the other. I only intend to demonstrate that they are very similar.
If I'm doing something with unordered pairs, I usually like to use a map since that makes it easy to look up the other element. E.g., if my pair is [2 7], then I'll use {2 7, 7 2}, and I can do ({2 7, 7 2} 2), which gives me 7.
As for space, the PersistentArrayMap implementation is actually very space conscious. If you look at the source code (see previous link), you'll see that it allocates an Object[] of the exact size needed to hold all the key/value pairs. I think this is used as the default map type for all maps with no more than 8 key/value pairs.
The only catch here is that you need to be careful about duplicate keys. {2 2, 2 2} will cause an exception. You could get around this problem by doing something like this: (merge {2 2} {2 2}), i.e. (merge {a b} {b a}) where it's possible that a and b have the same value.
Here's a little snippet from my repl:
user=> (def a (array-map 1 2 3 4))
#'user/a
user=> (type a)
clojure.lang.PersistentArrayMap
user=> (.count a) ; count simply returns array.length/2 of the internal Object[]
2
Note that I called array-map explicitly above. This is related to a question I asked a while ago related to map literals and def in the repl: Why does binding affect the type of my map?
This should be a comment, but i'm too short in reputation and too eager to share information.
If you are concerned about performance clj-tuple by Zachary Tellman may be 2-3 times faster than ordinary list/vectors, as claimed here ztellman / clj-tuple.
I wasn't planning to benchmark different pair representations now, but #ArthurUlfeldt's answer and #DaoWen's led me to do so. Here are my results using criterium's bench macro. Source code is below. To summarize, as expected, there are no large differences between the seven representations I tested. However, there is a gap between times for the fastest, array-map and hash-map, and the others. This is consistent with DaoWen's and Arthur Ulfeldt's remarks.
Average execution time in seconds, in order from fastest to slowest (MacBook Pro, 2.3GHz Intel Core i7):
array-map: 5.602099
hash-map: 5.787275
vector: 6.605547
sorted-set: 6.657676
hash-set: 6.746504
list: 6.948222
Edit: I added a run of test-control below, which does only what is common to all of the different other tests. test-control took, on average, 5.571284 seconds. It appears that there is a bigger difference between the -map representations and the others than I had thought: Access to a hash-map or an array-map of two entries is essentially instantaneous (on my computer, OS, Java, etc.), whereas the other representations take about a second for 10 million iterations. Which, given that it's 10M iterations, means that those operations are still almost instantaneous. (My guess is that the fact that test-arraymap was faster than test-control is due to noise from other things happening in the background on the computer. Or it could have to do with idiosyncrasies of compilation.)
(A caveat: I forgot to mention that I'm getting a warning from criterium: "JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active." I believe this means that Leiningen is starting Java with a command line option that is geared toward the -server JIT compiler, but is being run instead with the default -client JIT compiler. So the warning is saying "you think you're running -server, but you're not, so don't expect -server behavior." Running with -server might change the times given above.)
(use 'criterium.core)
;; based on Arthur Ulfedt's answer:
(defn pairlist-contains? [key pair]
(condp = key
(first pair) true
(second pair) true
false))
(defn pairvec-contains? [key pair]
(condp = key
(pair 0) true
(pair 1) true
false))
(def ntimes 10000000)
;; Test how long it takes to do what's common to all of the other tests
(defn test-control []
(print "=============================\ntest-control:\n")
(bench
(dotimes [_ ntimes]
(def _ (rand-nth [:a :b])))))
(defn test-list []
(let [data '(:a :b)]
(print "=============================\ntest-list:\n")
(bench
(dotimes [_ ntimes]
(def _ (pairlist-contains? (rand-nth [:a :b]) data))))))
(defn test-vec []
(let [data [:a :b]]
(print "=============================\ntest-vec:\n")
(bench
(dotimes [_ ntimes]
(def _ (pairvec-contains? (rand-nth [:a :b]) data))))))
(defn test-hashset []
(let [data (hash-set :a :b)]
(print "=============================\ntest-hashset:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-sortedset []
(let [data (sorted-set :a :b)]
(print "=============================\ntest-sortedset:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-hashmap []
(let [data (hash-map :a :a :b :b)]
(print "=============================\ntest-hashmap:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-arraymap []
(let [data (array-map :a :a :b :b)]
(print "=============================\ntest-arraymap:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-all []
(test-control)
(test-list)
(test-vec)
(test-hashset)
(test-sortedset)
(test-hashmap)
(test-arraymap))