I have a vector of vectors that contains some strings and ints:
(def data [
["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]
])
I'm trying to iterate through the vector and return(?) any rows that contain a certain string e.g. "a". I tried implementing things like this:
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
I'm quite new to Clojure, but I believe this is saying: For every element (vector) in 'data', if that vector contains "a", return it?
I know get-in needs 2 parameters, that part is where I'm unsure of what to do.
I have looked at answers like this and this but I don't really understand how they work. From what I can gather they're converting the vector to a map and doing the operations on that instead?
(filter #(some #{"a"} %) data)
It's a bit strange seeing the set #{"a"} but it works as a predicate function for some. Adding more entries to the set would be like a logical OR for it, i.e.
(filter #(some #{"a" "c"} %) data)
=> (["a" "title" "b" 1] ["c" "title" "d" 1])
ok you have error in your code
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
the error is here:
(for [d [data] ...
to traverse all the elements you shouldn't enclose data in brackets, because this syntax is for creating vectors. Here you are trying to traverse a vector of one element. that is how it look like for clojure:
(for [d [[["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]]] ...
so, correct variant is:
(defn get-row [data]
(for [d data
:when (= "a" (get-in d [0]))]
d))
then, you could use clojure' destructuring for that:
(defn get-row [data]
(for [[f & _ :as d] data
:when (= f "a")]
d))
but more clojuric way is to use higher order functions:
(defn get-row [data]
(filter #(= (first %) "a") data))
that is about your code. But corretc variant is in other guys' answers, because here you are checking just first item.
(defn get-row [data]
(for [d data ; <-- fix: [data] would result
; in one iteration with d bound to data
:when (= (get-in d[0]) "a")]
d))
Observe that your algorithm returns rows where the first column is "a". This can e. g. be solved using some with a set as predicate function to scan the entire row.
(defn get-row [data]
(for [row data
:when (some #{"a"} row)]
row))
Even better than the currently selected answer, this would work:
(filter #(= "a" (% 0)) data)
The reason for this is because for the top answer you are searching all the indexes of the sub-vectors for your query, whereas you might only wantto look in the first index of each sub-vector (in this case, search through each position 0 for "a", before returning the whole sub-vector if true)
Related
Hi l'm trying to do a function that returns the 3 most commom strings
(take 3 (sort-by val > (frequencies s))))
(freq ["hi" "hi" "hi" "ola" "hello" "hello" "string" "str" "ola" "hello" "hello" "str"])
l've got this so far but a noticed that if there are more than 1 string with the same frenquency it won't return. Is there a way to filter the values of the frequencies funcition by their highest (eventually the top 3 highest)?
Thanks in advance.
I would propose slightly different solution which involves inverting frequencies map with group-by value (which is the items' count):
(->> data
frequencies
(group-by val))
;;{3 [["hi" 3]],
;; 2 [["ola" 2] ["str" 2]],
;; 4 [["hello" 4]],
;; 1 [["string" 1]]}
so the only thing you need is to just sort and process it:
(->> data
frequencies
(group-by val)
(sort-by key >)
(take 3)
(mapv (fn [[k vs]] {:count k :items (mapv first vs)})))
;;[{:count 4, :items ["hello"]}
;; {:count 3, :items ["hi"]}
;; {:count 2, :items ["ola" "str"]}]
frequencies gives you a map, where the keys are the original values to
investigate and the values in that map are the frequency of those
values. For your result you are interested for all original values,
that have the most occurrences including those original values with the
same occurrences.
One way would be to "invert" the frequencies result, to get a map from
occurrences to all original values with that occurrence. Then you can
get the highest N keys and from this map and select them (by using
a "sorted map" for inverting the map, we get the sorting by keys without
further steps).
(defn invert-map
([source]
(invert-map source {}))
([source target]
(reduce (fn [m [k v]]
(update m v (fnil conj []) k))
target
source)))
(assert (=
{1 ["do" "re"]}
(invert-map {"do" 1 "re" 1})))
(defn freq
[n s]
(let [fs (invert-map (frequencies s) (sorted-map-by >))
top-keys (take n (keys fs))]
(select-keys fs top-keys)))
(assert (=
{4 ["hello"], 3 ["hi"], 2 ["ola" "str"]}
(freq 3 ["hi" "hi" "hi" "ola" "hello" "hello" "string" "str" "ola" "hello" "hello" "str"])))
I've got this list of fields (that's Facebook's graph API fields list).
["a" "b" ["c" ["t"] "d"] "e" ["f"] "g"]
I want to generate a map out of it. The convention is following, if after a key vector follows, then its an inner object for the key. Example vector could be represented as a map as:
{"a" "value"
"b" {"c" {"t" "value"} "d" "value"}
"e" {"f" "value"}
"g" "value"}
So I have this solution so far
(defn traverse
[data]
(mapcat (fn [[left right]]
(if (vector? right)
(let [traversed (traverse right)]
(mapv (partial into [left]) traversed))
[[right]]))
(partition 2 1 (into [nil] data))))
(defn facebook-fields->map
[fields default-value]
(->> fields
(traverse)
(reduce #(assoc-in %1 %2 nil) {})
(clojure.walk/postwalk #(or % default-value))))
(let [data ["a" "b" ["c" ["t"] "d"] "e" ["f"] "g"]]
(facebook-fields->map data "value"))
#=> {"a" "value", "b" {"c" {"t" "value"}, "d" "value"}, "e" {"f" "value"}, "g" "value"}
But it is fat and difficult to follow. I am wondering if there is a more elegant solution.
Here's another way to do it using postwalk for the whole traversal, rather than using it only for default-value replacement:
(defn facebook-fields->map
[fields default-value]
(clojure.walk/postwalk
(fn [v] (if (coll? v)
(->> (partition-all 2 1 v)
(remove (comp coll? first))
(map (fn [[l r]] [l (if (coll? r) r default-value)]))
(into {}))
v))
fields))
(facebook-fields->map ["a" "b" ["c" ["t"] "d"] "e" ["f"] "g"] "value")
=> {"a" "value",
"b" {"c" {"t" "value"}, "d" "value"},
"e" {"f" "value"},
"g" "value"}
Trying to read heavily nested code makes my head hurt. It is worse when the answer is something of a "force-fit" with postwalk, which does things in a sort of "inside out" manner. Also, using partition-all is a bit of a waste, since we need to discard any pairs with two non-vectors.
To me, the most natural solution is a simple top-down recursion. The only problem is that we don't know in advance if we need to remove one or two items from the head of the input sequence. Thus, we can't use a simple for loop or map.
So, just write it as a straightforward recursion, and use an if to determine whether we consume 1 or 2 items from the head of the list.
If the 2nd item is a value, we consume one item and add in
:dummy-value to make a map entry.
If the 2nd item is a vector, we recurse and use that
as the value in the map entry.
The code:
(ns tst.demo.core
(:require [clojure.walk :as walk] ))
(def data ["a" "b" ["c" ["t"] "d"] "e" ["f"] "g"])
(defn parse [data]
(loop [result {}
data data]
(if (empty? data)
(walk/keywordize-keys result)
(let [a (first data)
b (second data)]
(if (sequential? b)
(recur
(into result {a (parse b)})
(drop 2 data))
(recur
(into result {a :dummy-value})
(drop 1 data)))))))
with result:
(parse data) =>
{:a :dummy-value,
:b {:c {:t :dummy-value}, :d :dummy-value},
:e {:f :dummy-value},
:g :dummy-value}
I added keywordize-keys at then end just to make the result a little more "Clojurey".
Since you're asking for a cleaner solution as opposed to a solution, and because I thought it was a neat little problem, here's another one.
(defn facebook-fields->map [coll]
(into {}
(keep (fn [[x y]]
(when-not (vector? x)
(if (vector? y)
[x (facebook-fields->map y)]
[x "value"]))))
(partition-all 2 1 coll)))
This is a scenario I encountered many times, yet didn't find an idiomatic approach for it...
Suppose one would like to use a self-defined self-pred function to filter a seq. This self-pred function returns nil for unwanted elements, and useful information for wanted elements. It is desirable to keep the evaluated self-pred values for these wanted elements.
My general solution is:
;; self-pred is a pred function which returns valuable info
;; in general, they are unique and can be used as key
(let [new-seq (filter self-pred aseq)]
(zipmap (map self-pred new-seq) new-seq))
Basically, it is to call self-pred twice on all wanted elements. I feel it is so ugly...
Wonder if there is any better ways. Much appreciated for any input!
In these scenarios you can use keep, but you have to change your "predicate" function to return the full information you need, or nil, for each item.
For example:
(keep (fn [item]
(when-let [tested (some-test item)]
(assoc item :test-output tested))) aseq)
i use this kind of snippet:
(keep #(some->> % self-pred (vector %)) data)
like this:
user> (keep #(some->> % rseq (vector %)) [[1 2] [] [3 4]])
;;=> ([[1 2] (2 1)] [[3 4] (4 3)])
or if you like more verbose result:
user> (keep #(some->> % rseq (hash-map :data % :result)) [[1 2] [] [3 4]])
;;=> ({:result (2 1), :data [1 2]} {:result (4 3), :data [3 4]})
I wouldn't bother with keep, but would just use plain map & filter like so:
(def data (range 6))
(def my-pred odd?)
(defn zip [& colls] (apply map vector colls)) ; like Python zip
(defn filter-with-pred
[vals pred]
(filter #(first %)
(zip (map pred vals) vals)))
(println (filter-with-pred data my-pred))
with result:
([true 1] [true 3] [true 5])
If self-pred guarantees no duplicate key creation for differing values then I'd reach for reduce (since assoc the same key twice will override the original key value pair):
(reduce #(if-let [k (self-pred %2)]
(assoc %1 k %2)
%1)
{}
aseq)
Else we can use group-by to drive a similar result:
(dissoc (group-by self-pred aseq) nil)
Although not the same since the values will be in vectors: {k1 [v1 ..], k2 [..], ..}. but this guarantees all values are kept.
I have a function that is deduplicating with preference, I thought of implementing the solution in clojure using flambo function thus:
From the data set, using the group-by, to group duplicates (i.e based on a specified :key)
Given a :val as input, using a filter to check if the some of values for each row are equal to this :val
Use a map to untuple the duplicates to return single vectors (Not very sure if that is the right way though, I tried using a flat-map without any luck)
For a sample data-set
(def rdd
(f/parallelize sc [ ["Coke" "16" ""] ["Pepsi" "" "5"] ["Coke" "2" "3"] ["Coke" "" "36"] ["Pepsi" "" "34"] ["Pepsi" "25" "34"]]))
I tried this:
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))))]
result))
if I run this function thus:
(dedup-rows rdd {:key-col 0 :col 1 :val ""})
it produces
;=> [(["Pepsi" 25 34]), (["Coke" 16 ] ["Coke" 2 3])]]
I don't know what else to do to handle the result to produce a result of
;=> [["Pepsi" 25 34],["Coke" 16 ],["Coke" 2 3]]
I tried f/map f/untuple as the last form in the -> macro with no luck.
Any suggestions? I will really appreciate if there's another way to go about this.
Thanks.
PS: when grouped
;=> [[["Pepsi" "" 5], ["Pepsi" "" 34], ["Pepsi" 25 34]], [["Coke" 16 ""], ["Coke" 2 3], ["Coke" "" 36]]]
For each group, rows that have"" are considered duplicates and hence removed from the group.
Looking at the flambo readme, there is a flat-map function. This is slightly unfortunate naming because the Clojure equivalent is called mapcat. These functions take each map result - which must be a sequence - and concatenates them together. Another way to think about it is that it flattens the final sequence by one level.
I can't test this but I think you should replace your f/map with f/flat-map.
Going by #TheQuickBrownFox suggestion, I tried the following
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))
(f/flat-map (f/fn [row]
(mapcat vector row)))))]
result))
and seems to work
I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.
First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))