Say I have a list (a b c d e), I'm trying to figure out a "lazy" and Clojure-idiomatic way of producing a list or seq of each item with each other item, such as ((a b) (a c) (a d) (a e) (b c) (b d) (b e) (c d) (c e) (d e)).
Clojure's for doesn't seem to allow this, it just produces one item as it goes through a list and doesn't allow access to a sub-list. The closest I've come so far is to turn the original list into a vector, and have a for statement that iterates over the count of the vector and grab indexed items,
(for [i (range vector-count) j (range i vector-count)]
...
but I hope that there's a better way.
You want combinations. There's a function to give you a lazy sequence of combinations right here in clojure-contrib.
user> (combinations [:a :b :c :d :e] 2)
((:a :b) (:a :c) (:a :d) (:a :e) (:b :c) (:b :d) (:b :e) (:c :d) (:c :e) (:d :e))
(Unfortunately, the monolithic clojure-contrib repo containing that file is deprecated in favor of splitting contrib up into smaller separate repos, and clojure.contrib.combinatorics doesn't seem to have made the transition yet, so there's no easy way currently to install that library, but you can snag the code from github if nothing else.)
FWIW, I tried writing this without looking at the code in contrib. I think my code is much easier to understand, and in my simple-minded benchmark it's more than twice as fast. It's available at https://gist.github.com/1042047, and reproduced below for convenience:
(defn combinations [n coll]
(if (= 1 n)
(map list coll)
(lazy-seq
(when-let [[head & tail] (seq coll)]
(concat (for [x (combinations (dec n) tail)]
(cons head x))
(combinations n tail))))))
user> (require '[clojure.contrib.combinatorics :as combine])
nil
user> (time (last (user/combinations 4 (range 100))))
"Elapsed time: 4379.959957 msecs"
(96 97 98 99)
user> (time (last (combine/combinations (range 100) 4)))
"Elapsed time: 10913.170605 msecs"
(96 97 98 99)
I strongly prefer the [n coll] argument order, rather than [coll n] - clojure likes the "important" argument to come last, especially for functions dealing with seqs: mostly this is for ease of combination with (->>) in scenarios like (->> (my-list) (filter even?) (take 10) (combinations 8)).
why use range and index grabbing in the for loop?
(let [myseq (list :a :b :c :d)]
(for [a myseq b myseq] (list a b)))
works.
Related
I often have to run my data through a function if the data fulfill certain criteria. Typically, both the function f and the criteria checker pred are parameterized to the data. For this reason, I find myself wishing for a higher-order if-then-else which knows neither f nor pred.
For example, assume I want to add 10 to all even integers in (range 5). Instead of
(map #(if (even? %) (+ % 10) %) (range 5))
I would prefer to have a helper –let's call it fork– and do this:
(map (fork even? #(+ % 10)) (range 5))
I could go ahead and implement fork as function. It would look like this:
(defn fork
([pred thenf elsef]
#(if (pred %) (thenf %) (elsef %)))
([pred thenf]
(fork pred thenf identity)))
Can this be done by elegantly combining core functions? Some nice chain of juxt / apply / some maybe?
Alternatively, do you know any Clojure library which implements the above (or similar)?
As Alan Thompson mentions, cond-> is a fairly standard way of implicitly getting the "else" part to be "return the value unchanged" these days. It doesn't really address your hope of being higher-order, though. I have another reason to dislike cond->: I think (and argued when cond-> was being invented) that it's a mistake for it to thread through each matching test, instead of just the first. It makes it impossible to use cond-> as an analogue to cond.
If you agree with me, you might try flatland.useful.fn/fix, or one of the other tools in that family, which we wrote years before cond->1.
to-fix is exactly your fork, except that it can handle multiple clauses and accepts constants as well as functions (for example, maybe you want to add 10 to other even numbers but replace 0 with 20):
(map (to-fix zero? 20, even? #(+ % 10)) xs)
It's easy to replicate the behavior of cond-> using fix, but not the other way around, which is why I argue that fix was the better design choice.
1 Apparently we're just a couple weeks away from the 10-year anniversary of the final version of fix. How time flies.
I agree that it could be very useful to have some kind of higher-order functional construct for this but I am not aware of any such construct. It is true that you could implement a higher order fork function, but its usefulness would be quite limited and can easily be achieved using if or the cond-> macro, as suggested in the other answers.
What comes to mind, however, are transducers. You could fairly easily implement a forking transducer that can be composed with other transducers to build powerful and concise sequence processing algorithms.
The implementation could look like this:
(defn forking [pred true-transducer false-transducer]
(fn [step]
(let [true-step (true-transducer step)
false-step (false-transducer step)]
(fn
([] (step))
([dst x] ((if (pred x) true-step false-step) dst x))
([dst] dst))))) ;; flushing not performed.
And this is how you would use it in your example:
(eduction (forking even?
(map #(+ 10 %))
identity)
(range 20))
;; => (10 1 12 3 14 5 16 7 18 9 20 11 22 13 24 15 26 17 28 19)
But it can also be composed with other transducers to build more complex sequence processing algorithms:
(into []
(comp (forking even?
(comp (drop 4)
(map #(+ 10 %)))
(comp (filter #(< 10 %))
(map #(vector % % %))
cat))
(partition-all 3))
(range 20))
;; => [[18 20 11] [11 11 22] [13 13 13] [24 15 15] [15 26 17] [17 17 28] [19 19 19]]
Another way to define fork (with three inputs) could be:
(defn fork [pred then else]
(comp
(partial apply apply)
(juxt (comp {true then, false else} pred) list)))
Notice that in this version the inputs and output can receive zero or more arguments. But let's take a more structured approach, defining some other useful combinators. Let's start by defining pick which corresponds to the categorical coproduct (sum) of morphisms:
(defn pick [actions]
(fn [[tag val]]
((actions tag) val)))
;alternatively
(defn pick [actions]
(comp
(partial apply apply)
(juxt (comp actions first) rest)))
E.g. (mapv (pick [inc dec]) [[0 1] [1 1]]) gives [2 0]. Using pick we can define switch which works like case:
(defn switch [test actions]
(comp
(pick actions)
(juxt test identity)))
E.g. (mapv (switch #(mod % 3) [inc dec -]) [3 4 5]) gives [4 3 -5]. Using switch we can easily define fork:
(defn fork [pred then else]
(switch pred {true then, false else}))
E.g. (mapv (fork even? inc dec) [0 1]) gives [1 0]. Finally, using fork let's also define fork* which receives zero or more predicate and action pairs and works like cond:
(defn fork* [& args]
(->> args
(partition 2)
reverse
(reduce
(fn [else [pred then]]
(fork pred then else))
identity)))
;equivalently
(defn fork* [& args]
(->> args
(partition 2)
(map (partial apply (partial partial fork)))
(apply comp)
(#(% identity))))
E.g. (mapv (fork* neg? -, even? inc) [-1 0 1]) gives [1 1 1].
Depending on the details, it is often easiest to accomplish this goal using the cond-> macro and friends:
(let [myfn (fn [val]
(cond-> val
(even? val) (+ val 10))) ]
with result
(mapv myfn (range 5)) => [10 1 14 3 18]
There is a variant in the Tupelo library that is sometimes helpful:
(mapv #(cond-it-> %
(even? it) (+ it 10))
(range 5))
that allows you to use the special symbol it as you thread the value through multiple stages.
As the examples show, you have the option to define and name the transformer function (my favorite), or use the function literal syntax #(...)
I recently discovered the Specter library that provides data-structure navigation and transformation functions and is written in Clojure.
Implementing some of its API as a learning exercise seemed like a good idea. Specter implements an API taking a function and a nested structure as arguments and returns a vector of elements from the nested structure that satisfies the function like below:
(select (walker number?) [1 :a {:b 2}]) => [1 2]
Below is my attempt at implementing a function with similar API:
(defn select-walker [afn ds]
(vec (if (and (coll? ds) (not-empty ds))
(concat (select-walker afn (first ds))
(select-walker afn (rest ds)))
(if (afn ds) [ds]))))
(select-walker number? [1 :a {:b 2}]) => [1 2]
I have tried implementing select-walker by using list comprehension, looping, and using cons and conj. In all these cases
the return value was a nested list instead of a flat vector of elements.
Yet my implementation does not seem like idiomatic Clojure and has poor time and space complexity.
(time (dotimes [_ 1000] (select (walker number?) (range 100))))
"Elapsed time: 19.445396 msecs"
(time (dotimes [_ 1000] (select-walker number? (range 100))))
"Elapsed time: 237.000334 msecs"
Notice that my implementation is about 12 times slower than Specter's implementation.
I have three questions on the implemention of select-walker.
Is a tail-recursive implementaion of select-walker possible?
Can select-walker be written in more idiomatic Clojure?
Any hints to make select-walker execute faster?
there are at least two possibilities to make it tail recursive. First one is to process data in loop like this:
(defn select-walker-rec [afn ds]
(loop [res [] ds ds]
(cond (empty? ds) res
(coll? (first ds)) (recur res
(doall (concat (first ds)
(rest ds))))
(afn (first ds)) (recur (conj res (first ds)) (rest ds))
:else (recur res (rest ds)))))
in repl:
user> (select-walker-rec number? [1 :a {:b 2}])
[1 2]
user> user> (time (dotimes [_ 1000] (select-walker-rec number? (range 100))))
"Elapsed time: 19.428887 msecs"
(simple select-walker works about 200ms for me)
the second one (slower though, and more suitable for more difficult tasks) is to use zippers:
(require '[clojure.zip :as z])
(defn select-walker-z [afn ds]
(loop [res [] curr (z/zipper coll? seq nil ds)]
(cond (z/end? curr) res
(z/branch? curr) (recur res (z/next curr))
(afn (z/node curr)) (recur (conj res (z/node curr))
(z/next curr))
:else (recur res (z/next curr)))))
user> (time (dotimes [_ 1000] (select-walker-z number? (range 100))))
"Elapsed time: 219.015153 msecs"
this one is really slow, since zipper operates on more complex structures. It's great power brings unneeded overhead to this simple task.
the most idiomatic approach i guess, is to use tree-seq:
(defn select-walker-t [afn ds]
(filter #(and (not (coll? %)) (afn %))
(tree-seq coll? seq ds)))
user> (time (dotimes [_ 1000] (select-walker-t number? (range 100))))
"Elapsed time: 1.320209 msecs"
it is incredibly fast, as it produces a lazy sequence of results. In fact you should realize its data for the fair test:
user> (time (dotimes [_ 1000] (doall (select-walker-t number? (range 100)))))
"Elapsed time: 53.641014 msecs"
one more thing to notice about this variant, is that it's not tail recursive, so it would fail in case of really deeply nested structures (maybe i'm mistaken, but i guess it's about couple of thousands levels of nesting), still it's suitable for the most cases.
I need to take 20 results from a lazy sequence of millions of hash-maps but for the 20 to be based on sorting on various values within the hashmaps.
For example:
(def population [{:id 85187153851 :name "anna" :created #inst "2012-10-23T20:36:25.626-00:00" :rank 77336}
{:id 12595145186 :name "bob" :created #inst "2011-02-03T20:36:25.626-00:00" :rank 983666}
{:id 98751563911 :name "cartmen" :created #inst "2007-01-13T20:36:25.626-00:00" :rank 112311}
...
{:id 91514417715 :name "zaphod" :created #inst "2015-02-03T20:36:25.626-00:00" :rank 9866}]
In normal circumstances a simple sort-by would get the job done:
(sort-by :id population)
(sort-by :name population)
(sort-by :created population)
(sort-by :rank population)
But I need to do this across millions of records as fast as possible and want to do it lazily rather than having to realize the entire data set.
I looked around a lot and found a number of implementations of algorithms that work really well for sorting a sequence of values (mostly numeric) but none for a lazy sequence of hash-maps in the way I need.
Speed & efficiency being of prime importance, the best I have found has been the quicksort example from the Joy Of Clojure book (Chapter 6.4) which does just enough work to return the required result.
(ns joy.q)
(defn sort-parts
"Lazy, tail-recursive, incremental quicksort. Works against
and creates partitions based on the pivot, defined as 'work'."
[work]
(lazy-seq
(loop [[part & parts] work]
(if-let [[pivot & xs] (seq part)]
(let [smaller? #(< % pivot)]
(recur (list*
(filter smaller? xs)
pivot
(remove smaller? xs)
parts)))
(when-let [[x & parts] parts]
(cons x (sort-parts parts)))))))
(defn qsort [xs]
(sort-parts (list xs)))
Works really well...
(time (take 10 (qsort (shuffle (range 10000000)))))
"Elapsed time: 551.714003 msecs"
(0 1 2 3 4 5 6 7 8 9)
Great! But...
However much I try I can't seem to work out how to apply this to a sequence of hashmaps.
I need something like:
(take 20 (qsort-by :created population))
If you only need the top N elements a full sort is too expensive (even a lazy sort as the one in the JoC: it needs to keep nearly the all data set in memory).
You only need to scan (reduce) the dataset and keep the best N items so far.
=> (defn top-by [n k coll]
(reduce
(fn [top x]
(let [top (conj top x)]
(if (> (count top) n)
(disj top (first top))
top)))
(sorted-set-by #(< (k %1) (k %2))) coll))
#'user/top-by
=> (top-by 3 first [[1 2] [10 2] [9 3] [4 2] [5 6]])
#{[5 6] [9 3] [10 2]}
I am learning Clojure and trying to solve Project's Euler (http://projecteuler.net/) problems using this language.
Second problem asks to find the sum of the even-valued terms in Fibonacci sequence whose values do not exceed four million.
I've tried several approaches and would find next one most accurate if I could find where it's broken. Now it returns 0. I am pretty sure there is a problem with take-while condition but can't figure it out.
(reduce +
(take-while (and even? (partial < 4000000))
(map first (iterate (fn [[a b]] [b (+ a b)]) [0 1]))))
To compose multiple predicates in this way, you can use every-pred:
(every-pred even? (partial > 4000000))
The return value of this expression is a function that takes an argument and returns true if it is both even and greater than 4000000, false otherwise.
user> ((partial < 4000000) 1)
false
Partial puts the static arguments first and the free ones at the end, so it's building the opposite of what you want. It is essentially producing #(< 4000000 %) instead of #(< % 4000000) as you intended, So just change the > to <:
user> (reduce +
(take-while (and even? (partial > 4000000))
(map first (iterate (fn [[a b]] [b (+ a b)]) [0 1]))))
9227464
or perhaps it would be more clear to use the anonymous function form directly:
user> (reduce +
(take-while (and even? #(< % 4000000))
(map first (iterate (fn [[a b]] [b (+ a b)]) [0 1]))))
9227464
Now that we have covered a bit about partial, let's break down a working solution. I'll use the thread-last macro ->> to show each step separately.
user> (->> (iterate (fn [[a b]] [b (+ a b)]) [0 1]) ;; start with the fibs
(map first) ;; keep only the answer
(take-while #(< % 4000000)) ;; stop when they get too big
(filter even?) ;; take only the even? ones
(reduce +)) ;; sum it all together.
4613732
From this we can see that we don't actually want to compose the predicates evan? and less-than-4000000 on a take-while because this would stop as soon as either condition was true leaving only the number zero. Rather we want to use one of the predicates as a limit and the other as a filter.
I'm creating unordered pairs of data elements. A comment by #Chouser on this question says that hash-sets are implemented with 32 children per node, while sorted-sets are implemented with 2 children per node. Does this mean that my pairs will take up less space if I implement them with sorted-sets rather than hash-sets (assuming that the data elements are Comparable, i.e. can be sorted)? (I doubt it matters for me in practice. I'll only have hundreds of these pairs, and lookup in a two-element data structure, even sequential lookup in a vector or list, should be fast. But I'm curious.)
When comparing explicitly looking at the first two elements of a list, to using Clojure's built in sets I don't see a significant difference when running it ten million times:
user> (defn my-lookup [key pair]
(condp = key
(first pair) true
(second pair) true false))
#'user/my-lookup
user> (time (let [data `(1 2)]
(dotimes [x 10000000] (my-lookup (rand-nth [1 2]) data ))))
"Elapsed time: 906.408176 msecs"
nil
user> (time (let [data #{1 2}]
(dotimes [x 10000000] (contains? data (rand-nth [1 2])))))
"Elapsed time: 1125.992105 msecs"
nil
Of course micro-benchmarks such as this are inherently flawed and difficult to really do well so don't try to use this to show that one is better than the other. I only intend to demonstrate that they are very similar.
If I'm doing something with unordered pairs, I usually like to use a map since that makes it easy to look up the other element. E.g., if my pair is [2 7], then I'll use {2 7, 7 2}, and I can do ({2 7, 7 2} 2), which gives me 7.
As for space, the PersistentArrayMap implementation is actually very space conscious. If you look at the source code (see previous link), you'll see that it allocates an Object[] of the exact size needed to hold all the key/value pairs. I think this is used as the default map type for all maps with no more than 8 key/value pairs.
The only catch here is that you need to be careful about duplicate keys. {2 2, 2 2} will cause an exception. You could get around this problem by doing something like this: (merge {2 2} {2 2}), i.e. (merge {a b} {b a}) where it's possible that a and b have the same value.
Here's a little snippet from my repl:
user=> (def a (array-map 1 2 3 4))
#'user/a
user=> (type a)
clojure.lang.PersistentArrayMap
user=> (.count a) ; count simply returns array.length/2 of the internal Object[]
2
Note that I called array-map explicitly above. This is related to a question I asked a while ago related to map literals and def in the repl: Why does binding affect the type of my map?
This should be a comment, but i'm too short in reputation and too eager to share information.
If you are concerned about performance clj-tuple by Zachary Tellman may be 2-3 times faster than ordinary list/vectors, as claimed here ztellman / clj-tuple.
I wasn't planning to benchmark different pair representations now, but #ArthurUlfeldt's answer and #DaoWen's led me to do so. Here are my results using criterium's bench macro. Source code is below. To summarize, as expected, there are no large differences between the seven representations I tested. However, there is a gap between times for the fastest, array-map and hash-map, and the others. This is consistent with DaoWen's and Arthur Ulfeldt's remarks.
Average execution time in seconds, in order from fastest to slowest (MacBook Pro, 2.3GHz Intel Core i7):
array-map: 5.602099
hash-map: 5.787275
vector: 6.605547
sorted-set: 6.657676
hash-set: 6.746504
list: 6.948222
Edit: I added a run of test-control below, which does only what is common to all of the different other tests. test-control took, on average, 5.571284 seconds. It appears that there is a bigger difference between the -map representations and the others than I had thought: Access to a hash-map or an array-map of two entries is essentially instantaneous (on my computer, OS, Java, etc.), whereas the other representations take about a second for 10 million iterations. Which, given that it's 10M iterations, means that those operations are still almost instantaneous. (My guess is that the fact that test-arraymap was faster than test-control is due to noise from other things happening in the background on the computer. Or it could have to do with idiosyncrasies of compilation.)
(A caveat: I forgot to mention that I'm getting a warning from criterium: "JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active." I believe this means that Leiningen is starting Java with a command line option that is geared toward the -server JIT compiler, but is being run instead with the default -client JIT compiler. So the warning is saying "you think you're running -server, but you're not, so don't expect -server behavior." Running with -server might change the times given above.)
(use 'criterium.core)
;; based on Arthur Ulfedt's answer:
(defn pairlist-contains? [key pair]
(condp = key
(first pair) true
(second pair) true
false))
(defn pairvec-contains? [key pair]
(condp = key
(pair 0) true
(pair 1) true
false))
(def ntimes 10000000)
;; Test how long it takes to do what's common to all of the other tests
(defn test-control []
(print "=============================\ntest-control:\n")
(bench
(dotimes [_ ntimes]
(def _ (rand-nth [:a :b])))))
(defn test-list []
(let [data '(:a :b)]
(print "=============================\ntest-list:\n")
(bench
(dotimes [_ ntimes]
(def _ (pairlist-contains? (rand-nth [:a :b]) data))))))
(defn test-vec []
(let [data [:a :b]]
(print "=============================\ntest-vec:\n")
(bench
(dotimes [_ ntimes]
(def _ (pairvec-contains? (rand-nth [:a :b]) data))))))
(defn test-hashset []
(let [data (hash-set :a :b)]
(print "=============================\ntest-hashset:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-sortedset []
(let [data (sorted-set :a :b)]
(print "=============================\ntest-sortedset:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-hashmap []
(let [data (hash-map :a :a :b :b)]
(print "=============================\ntest-hashmap:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-arraymap []
(let [data (array-map :a :a :b :b)]
(print "=============================\ntest-arraymap:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-all []
(test-control)
(test-list)
(test-vec)
(test-hashset)
(test-sortedset)
(test-hashmap)
(test-arraymap))