A sumifs in Clojure

A sumifs in Clojure - clojure

I am trying to do a sumifs (from Excel) in Clojure. I have a csv file which has column size categorized by Big, Medium, Small. And there is another column called as Revenue. What I am trying to do is to sum the revenue for each company by size.
This is what I've tried so far:
(math
sum
($for [Size] [row (vals input-data) :let [Size (:Size row)]]
(+ (:Revenue row) 0 )))
This is a fork of Clojure.

Here is a conventional way to do a sumif using standard Clojure methods:
(defn sumif [pred coll]
(reduce + (filter pred coll)))
Here is an example of using sumif to sum all odd numbers from 0 to 9:
(sumif odd? (range 10)) ; => 25
Update:
But if you want to aggregate your data by Size, then you may apply fmap method from algo.generic library to the results of group-by aggregation:
(defn aggregate-by [group-key sum-key data]
(fmap #(reduce + (map sum-key %))
(group-by group-key data)))
Here is an example:
(defn aggregate-by [group-key sum-key data]
(fmap #(reduce + (map sum-key %))
(group-by group-key data)))

I would suggest the following, assuming that you have a seq of maps representing your data available:
(defn revenue-sum [props]
(reduce (fn [acc {:keys [size revenue]}]
(update-in acc
[size]
#(if % (+ revenue %) revenue)))
{}
props))

Related

Clojure : idiomatic weighted mean of vectors

I would like to compute the weighted mean of vectors in an idiomatic way.
To illustrate what I want, imagine I have this data :
data 1 = [2 1] , weight 1 = 1
data 2 = [3 4], weight 2 = 2
Then mean = [(2*1 + 3*2)/(1+2) (1*1 + 2*4)/(1+2)] = [2.67 3.0]
Here is my code :
(defn meanv
"Returns the vector that is the mean of input ones.
You can also pass weights just like apache-maths.stats/mean"
([data]
(let [n (count (first data))]
(->> (for [i (range 0 n)]
(vec (map (i-partial nth i) data)))
(mapv stats/mean))))
([data weights]
(let [n (count (first data))]
(->> (for [i (range 0 n)]
(vec (map (i-partial nth i) data)))
(mapv (i-partial stats/mean weights))))))
Then
(meanv [[2 1] [3 4]] [1 2]) = [2.67 3.0]
Few notes :
stats/means takes 1 or 2 inputs.
One input version has weights = 1 by default.
Two inputs is the weighted version.
i-partial is like partial but the fn has reversed args
Ex : ((partial / 2) 1) = 2
((i-partial / 2) 1 = 1/2
So my function works, no problem.
But in a way I would like to implement it in a more idiomatic Clojure.
I tried many combinations with things like (map (fn [&xs ... but it does not work.
Is it possible to take all nth elements of undefined number of vectors and directly apply stats/mean ? I mean a one-liner
Thanks
EDIT (birdspider answer)
(defn meanv
([data]
(->> (apply mapv vector data)
(mapv stats/mean)))
([data weights]
(->> (apply mapv vector data)
(mapv (i-partial stats/mean weights)))))
And with
(defn transpose [m]
(apply mapv vector m))
(defn meanv
([data]
(->> (transpose data)
(mapv stats/mean)))
([data weights]
(->> (transpose data)
(mapv (i-partial stats/mean weights)))))

(def mult-v (partial mapv *))
(def sum-v (partial reduce +))
(def transpose (partial apply mapv vector))
(defn meanv [data weights]
(->> data
transpose
(map (partial mult-v weights))
(map sum-v)
(map #(/ % (sum-v weights)))))

First thing you want to do is to transpose the matrix (get the firsts, seconds, thirds, etc.)
See this SO page.
; https://stackoverflow.com/a/10347404/2645347
(defn transpose [m]
(apply mapv vector m))
Then I would do it as follows, input checks are utterly absent.
(defn meanv
([data]
; no weigths default to (1 1 1 ...
(meanv data (repeat (count data) 1))))
([data weigths]
(let [wf (mapv #(partial * %) weigths) ; vector of weight mult fns
wsum (reduce + weigths)]
(map-indexed
(fn [i datum]
(/
; map over datum apply corresponding weight-fn - then sum
(apply + (map-indexed #((wf %1) %2) datum))
wsum))
(transpose data)))))
(meanv [[2 1] [3 4]] [1 2]) => (8/3 3) ; (2.6666 3.0)
Profit!

Speeding up Clojure to avoid timeouts

I've doing a few of the hackerrank challenges and noticing that I seem to not be able to code efficient code, as quite often I get timeouts, even though the answers that do pass the tests are correct. For example for this challenge this is my code:
(let [divisors (fn [n] (into #{n} (into #{1} (filter (comp zero? (partial rem n)) (range 1 n)))))
str->ints (fn [string]
(map #(Integer/parseInt %)
(clojure.string/split string #" ")))
;lines (line-seq (java.io.BufferedReader. *in*))
lines ["3"
"10 4"
"1 100"
"288 240"
]
pairs (map str->ints (rest lines))
first-divs (map divisors (map first pairs))
second-divs (map divisors (map second pairs))
intersections (map clojure.set/intersection first-divs second-divs)
counts (map count intersections)
]
(doseq [v counts]
(println (str v))))
Note that clojure/set doesn't exist at hackerrank. I just put in here for the sake of brevity.

in this exact case there is an obvious misuse of map function:
although the clojure collections are lazy, operations on them still don't come for free. So when you chain lots of maps, you still have all the intermediate collections (there are 7 here). To avoid this, one would usually use transducers, but in your case you are just mapping every input line to one output line, so it is really enough to do it in one pass over the input collection:
(let [divisors (fn [n] (into #{n} (into #{1} (filter (comp zero? (partial rem n)) (range 1 n)))))
str->ints (fn [string]
(map #(Integer/parseInt %)
(clojure.string/split string #" ")))
;lines (line-seq (java.io.BufferedReader. *in*))
get-counts (fn [pair] (let [d1 (divisors (first pair))
d2 (divisors (second pair))]
(count (clojure.set/intersection d1 d2))))
lines ["3"
"10 4"
"1 100"
"288 240"
]
counts (map (comp get-counts str->ints) (rest lines))]
(doseq [v counts]
(println (str v))))
Not talking about the correctness of the whole algorithm here. Maybe it could also be optimized. But as of clojure's mechanics, this change should speed up your code quite notably.
update
as for the algorithm, you would probably want to start with limiting the range from 1..n to 1..(sqrt n), adding both x and n/x into resulting set when x is a divisor of n, that should give you quite a big profit for large numbers:
(defn divisors [n]
(into #{} (mapcat #(when (zero? (rem n %)) [% (/ n %)])
(range 1 (inc (Math/floor (Math/sqrt n)))))))
also i would consider finding all the divisors of the least of two numbers, and then keeping the ones the other number is divisible by. This will eliminate the search of the greater number's divisors.
(defn common-divisors [pair]
(let [[a b] (sort pair)
divs (divisors a)]
(filter #(zero? (rem b %)) divs)))
if that still doesn't manage to pass the test, you should probably look for some nice algorithm for common divisors.
update 2
submitted the updated algorithm to hackerrank and it passes well now

Simple "R-like" melt : better way to do?

Today I tried to implement a "R-like" melt function. I use it for Big Data coming from Big Query.
I do not have big constraints about time to compute and this function takes less than 5-10 seconds to work on millions of rows.
I start with this kind of data :
(def sample
'({:list "123,250" :group "a"} {:list "234,260" :group "b"}))
Then I defined a function to put the list into a vector :
(defn split-data-rank [datatab value]
(let [splitted (map (fn[x] (assoc x value (str/split (x value) #","))) datatab)]
(map (fn[y] (let [index (map inc (range (count (y value))))]
(assoc y value (zipmap index (y value)))))
splitted)))
Launch :
(split-data-rank sample :list)
As you can see, it returns the same sequence but it replaces :list by a map giving the position in the list of each item in quoted list.
Then, I want to melt the "dataframe" by creating for each item in a group its own row with its rank in the group.
So that I created this function :
(defn split-melt [datatab value]
(let [splitted (split-data-rank datatab value)]
(map (fn [y] (dissoc y value))
(apply concat
(map
(fn[x]
(map
(fn[[k v]]
(assoc x :item v :Rank k))
(x value)))
splitted)))))
Launch :
(split-melt sample :list)
The problem is that it is heavily indented and use a lot of map. I apply dissoc to drop :list (which is useless now) and I have also to use concat because without that I have a sequence of sequences.
Do you think there is a more efficient/shorter way to design this function ?
I am heavily confused with reduce, does not know whether it can be applied here since there are two arguments in a way.
Thanks a lot !

If you don't need the split-data-rank function, I will go for:
(defn melt [datatab value]
(mapcat (fn [x]
(let [items (str/split (get x value) #",")]
(map-indexed (fn [idx item]
(-> x
(assoc :Rank (inc idx) :item item)
(dissoc value)))
items)))
datatab))

Timeseries resample

I have a dict, similar to the {:datetime [unix-timestamp] :count [longs]}.
There are an equal number of things in :datetime and :count.
:datetime not have specified interval, usually ticks data. I would like to resample the data so that they have a defined interval, eg 5 minutes, and sum up :count of the range.
example:
{
   :datetime [timestamp every minute]
   :count [1 1 1 1 1. . .]
}
resample it to
{
   :datetime [timestamp every 5 minutes]
   :count [5 5 5 5 5 ...]
}

You want to take one element in five from the timestamp vector, and add groups of five counts from the counts vector. Something like this will do it:
(defn resample [m]
(let [{dt :datetime ct :count} m
newdt (map first (partition 5 dt))
newct (map (partial apply +) (partition 5 ct))]
{:datetime newdt
:count newct}))

Here's something fancy, but possibly inefficient:
(defn resample-5 [{:keys [datetime count]}]
(letfn [(floor-5 [dt] (- dt (mod dt (* 5 60 1000))))
(sum-counts [[time pairs]]
[time (reduce + (map second pairs))])]
(let [pairs (partition 2 (interleave datetime count))
pair-groups (group-by #(floor-5 (first %)) pairs)
sums (map sum-counts pair-groups)]
{:datetime (map first sums)
:count (map second sums)})))
Note how many operations it performs on the collection: interleave, partition, group-by, map+reduce, and again map twice.
And here's something much more efficient that only scans the collection once:
(defn resample-5 [{:keys [datetime count]}]
(letfn [(add-tick [result dt c]
(if dt
(-> result
(update-in [:datetime] conj dt)
(update-in [:count] conj c))
result))]
(loop [datetimes datetime
counts count
rounded-last nil
count-last 0
result {:datetime [] :count []}]
(if (empty? datetimes)
(add-tick result rounded-last count-last)
(let [dt (first datetimes)
c (first counts)
rounded (- dt (mod dt (* 5 60 1000)))]
(if (= rounded-last rounded)
(recur (rest datetimes) (rest counts) rounded (+ count-last c) result)
(recur (rest datetimes) (rest counts) rounded c (add-tick result rounded-last count-last))))))))

Applying multiple filters to a collection in a thrush in Clojure

The following code
(let [coll [1 2 3 4 5]
filters [#(> % 1) #(< % 5)]]
(->> coll
(filter (first filters))
(filter (second filters))))
Gives me
(2 3 4)
Which is great, but how do I apply all the filters in coll without having to explicitly name them?
There may be totally better ways of doing this, but ideally I'd like to know an expression that can replace (filter (first filters)) (filter (second filters)) above.
Thanks!

Clojure 1.3 has a new every-pred function, which you could use thusly:
(filter (apply every-pred filters) coll)

This should work :-
(let [coll [1 2 3 4 5]
filters [#(> % 1) #(< % 5)]]
(filter (fn [x] (every? #(% x) filters)) coll)
)

I can't say I'm very proud of the following, but at least it works and allows for infinite filters:
(seq
(reduce #(clojure.set/intersection
(set %1)
(set %2)) (map #(filter % coll) filters)))
If you can use sets in place of seqs it would simplify the above code as follows:
(reduce clojure.set/intersection (map #(filter % coll) filters))

(let [coll [1 2 3 4 5]
filters [#(> % 1) #(< % 5)]]
(reduce (fn [c f] (filter f c)) coll filters))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

A sumifs in Clojure - clojure

I would suggest the following, assuming that you have a seq of maps representing your data available: (defn revenue-sum [props] (reduce (fn [acc {:keys [size revenue]}] (update-in acc [size] #(if % (+ revenue %) revenue))) {} props))

Related

Clojure : idiomatic weighted mean of vectors

Speeding up Clojure to avoid timeouts

Simple "R-like" melt : better way to do?

Timeseries resample

Applying multiple filters to a collection in a thrush in Clojure

Categories

Resources