Filter RDDs using clojure and flambo

Filter RDDs using clojure and flambo - clojure

I have an RDD with index of the form: (:rdd xctx)
[[["1" "32" "44" "55" "14"] 0] [["21" "23" "24" "25" "24"] 1] [["41" "53" "54" "5" "24"] 2] [["11" "35" "34" "15" "64"] 3]]
and I want to filter out the RDDs that have their indexes in a vector for example:
:row-list s[1 3]
I tried this but somehow I'm getting an error:
(defn remove-index-rows
"Function to catch the row(s) with the specific Row Number(s) in rows-list
input = { :rows-list [ val(s)]}"
[row input]
(let [{:keys [ rows-list ]} input
row-and-index (f/collect (f/filter #(= row (get % 0)) (:rdd xctx)))]
(when-not (some #(= (get row-and-index 1) %) rows-list) row)))
Desired output is:
[ [["1" "32" "44" "55" "14"] 0] [["41" "53" "54" "5" "24"] 2] ]
Thanks for helping out

For starers I would replace rows-list with set. Lets define it as follows
(set row-list)
After that you can simply filter like this:
(f/filter
(:rdd xctx)
(f/fn [row] (let [[v i] row] (not (contains? row-set i)))))

Related

looking for a split-on function

I'm looking for a function with the following behavior
(split-on "" ("" "test" "one" "" "two"))
(() ("test" "one") ("two"))
I can't find it in 'core', and I'm not sure how to look it up. Suggestions?
Edit:
split-when looks promising, but I think I am using it wrong.
(t/split-when #(= "" %) '("" "test" "one" "" "two"))
[["" "test" "one" "" "two"] ()]
whereas I am looking for the return value of
[[] ["test" "one"] ["two"]]

partition-by is close. You can partition the sequence by members that are equal fo the split token:
(partition-by #(= "" %) '("" "test" "one" "" "two"))
(("") ("test" "one") ("") ("two"))
This leaves extra seperators in there, though that's easy enough to remove:
(remove #(= '("") %)
(partition-by empty? ["" "test" "one" "" "two"]))
(("test" "one") ("two"))
If you want to get fancy about it and make a transducer out of that, you can define one like so:
(def split-on
(comp
(partition-by #(= "" %))
(remove #(= '("") %))))
(into [] split-on ["" "test" "one" "" "two"])
[["test" "one"] ["two"]]
This does it on "one pass" without building intermediate structures.
To make that into a normal function (if you don't want a transducer):
(defn split-on [coll]
(into [] (comp
(partition-by #(= "" %))
(remove #(= '("") %)))
coll))

I was looking for exactly this function recently and had to create it myself. It is available in the Tupelo library. You can see the API docs here: http://cloojure.github.io/doc/tupelo/tupelo.core.html#var-split-when
(split-when pred coll)
Splits a collection based on a predicate with a collection
argument. Finds the first index N such that (pred (drop N coll))
is true. Returns a length-2 vector of [ (take N coll) (drop N coll) ].
If pred is never satisified, [ coll [] ] is returned.
The unit tests show the function in action (admittedly boring test data):
(deftest t-split-when
(is= [ [] [0 1 2 3 4] ] (split-when #(= 0 (first %)) (range 5)))
(is= [ [0] [1 2 3 4] ] (split-when #(= 1 (first %)) (range 5)))
(is= [ [0 1] [2 3 4] ] (split-when #(= 2 (first %)) (range 5)))
(is= [ [0 1 2] [3 4] ] (split-when #(= 3 (first %)) (range 5)))
(is= [ [0 1 2 3] [4] ] (split-when #(= 4 (first %)) (range 5)))
(is= [ [0 1 2 3 4] [] ] (split-when #(= 5 (first %)) (range 5)))
(is= [ [0 1 2 3 4] [] ] (split-when #(= 9 (first %)) (range 5)))
You can also read the source if you are interested.

Untuple a Clojure sequence

I have a function that is deduplicating with preference, I thought of implementing the solution in clojure using flambo function thus:
From the data set, using the group-by, to group duplicates (i.e based on a specified :key)
Given a :val as input, using a filter to check if the some of values for each row are equal to this :val
Use a map to untuple the duplicates to return single vectors (Not very sure if that is the right way though, I tried using a flat-map without any luck)
For a sample data-set
(def rdd
(f/parallelize sc [ ["Coke" "16" ""] ["Pepsi" "" "5"] ["Coke" "2" "3"] ["Coke" "" "36"] ["Pepsi" "" "34"] ["Pepsi" "25" "34"]]))
I tried this:
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))))]
result))
if I run this function thus:
(dedup-rows rdd {:key-col 0 :col 1 :val ""})
it produces
;=> [(["Pepsi" 25 34]), (["Coke" 16 ] ["Coke" 2 3])]]
I don't know what else to do to handle the result to produce a result of
;=> [["Pepsi" 25 34],["Coke" 16 ],["Coke" 2 3]]
I tried f/map f/untuple as the last form in the -> macro with no luck.
Any suggestions? I will really appreciate if there's another way to go about this.
Thanks.
PS: when grouped
;=> [[["Pepsi" "" 5], ["Pepsi" "" 34], ["Pepsi" 25 34]], [["Coke" 16 ""], ["Coke" 2 3], ["Coke" "" 36]]]
For each group, rows that have"" are considered duplicates and hence removed from the group.

Looking at the flambo readme, there is a flat-map function. This is slightly unfortunate naming because the Clojure equivalent is called mapcat. These functions take each map result - which must be a sequence - and concatenates them together. Another way to think about it is that it flattens the final sequence by one level.
I can't test this but I think you should replace your f/map with f/flat-map.

Going by #TheQuickBrownFox suggestion, I tried the following
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))
(f/flat-map (f/fn [row]
(mapcat vector row)))))]
result))
and seems to work

Clojure - apply to all but nth element

I have a vector that looks like:
[ "1" "2" "3" "4" ]
I wish to write a function returns the vector to:
[ 1 "2" 3 4 ]
; Note that the second element is still a string
Note that nothing is changed, an entirely new vector is returned. What is the simplest way to do this in clojure?

map-indexed is a decent choice. call a function you pass with the value of one of the items form your input and the index where it was found (index first). that function can choose to produce a new value or return the existing one.
user> (map-indexed (fn [i v]
(if-not (= 1 i)
(Integer/parseInt v)
v))
[ "1" "2" "3" "4"])
(1 "2" 3 4)
When the if returns v it is the exact same value in the resulting map so you keep the benefits of structural sharing in the parts you choose to keep. If you want the output to be kept as a vector then you can use mapv and pass the index sequence your self.
user> (mapv (fn [i v]
(if-not (= 1 i)
(Integer/parseInt v)
v))
(range)
[ "1" "2" "3" "4"])
[1 "2" 3 4]
there are many ways to write this

Here is how I would do it. Note that the index is zero-based:
(defn map-not-nth
"Transform all elements of coll except the one corresponding to idx (zero-based)."
[func coll idx]
{:pre [ (<= 0 idx (count coll)) ]
:post [ (= (count %) (count coll))
(= (nth coll idx) (nth % idx) ) ] }
(let [coll-tx (map func coll) ; transform all data
result (flatten [ (take idx coll-tx) ; [0..idx-1]
(nth coll idx) ; idx
(drop (inc idx) coll-tx) ; [idx+1..N-1]
] ) ]
result ))
(def xx [ 0 1 2 3 4 ] )
(prn (map-not-nth str xx 0))
(prn (map-not-nth str xx 1))
(prn (map-not-nth str xx 2))
(prn (map-not-nth str xx 3))
(prn (map-not-nth str xx 4))
Result is:
user=> (prn (map-not-nth str xx 0))
(0 "1" "2" "3" "4")
user=> (prn (map-not-nth str xx 1))
("0" 1 "2" "3" "4")
user=> (prn (map-not-nth str xx 2))
("0" "1" 2 "3" "4")
user=> (prn (map-not-nth str xx 3))
("0" "1" "2" 3 "4")
user=> (prn (map-not-nth str xx 4))
("0" "1" "2" "3" 4)

Clojure: not the whole collection after conversion to hash map

I am exploring the exciting world of Clojure, but I am stopped on this...
I have two vectors, different in length, stored in vars.
(def lst1 ["name" "surname" "age"])
(def lst2 ["Jimi" "Hendrix" "28" "Sam" "Cooke" "33" "Buddy" "Holly" "23"])
I want to interleave them and obtain a map, with keys from first list and values from the second, like the following one:
{"name" "Jimi" , "surname" "Hendrix" , "age" "28" ,
"name" "Sam" , "surname" "Cooke" , "age" "33" ... }
even the following solution, with proper keys, would be ok:
{:name "Jimi" , :surname "Hendrix" , :age" "28" , ... }
I can use interleave-all function from Medley library and then apply the hash-map fn:
(apply hash-map
(vec (interleave-all (flatten (repeat 3 lst1)) lst2)))
=> {"age" "23", "name" "Buddy", "surname" "Holly"}
but returns just last musician. This persistent hashmap is not ordered, but is not the point.
I later tried to pair keys and values, maybe for a possible future use of assoc, who knows...
(map vector
(for [numMusicians (range 0 3) , keys (range 0 3)] (-> lst1 (nth keys) (keyword)))
(for [values (range 0 9)] (-> lst2 (nth values) (str)))
)
Returns a lazy sequence with paired vectors and proper keywords.
=> ([:name "Jimi"] [:surname "Hendrix"] [:age "28"] [:name "Sam"] ...)
Now I want to try into that should return
a new coll consisting of to-coll with ALL of the items of from-coll
conjoined.
(into {}
(map vector
(for [numMusicians (range 0 3) , keys (range 0 3)] (-> lst1 (nth keys) (keyword)))
(for [values (range 0 9)] (-> lst2 (nth values) (str)))
))
But again:
=> {:name "Buddy", :surname "Holly", :age "23"}
just the last musician, this time in a persistent array map.
I want a map with all my dead musicians. Someone knows where I am wrong?
Edit:
Thank you guys! Have managed the fn this way:
(use 'clojure.set)
(->> (partition 3 lst2) (map #(zipmap % lst1)) (map map-invert))
=> ({"name" "Jimi", "surname" "Hendrix", "age" "28"} {"name" "Sam", "surname" "Cooke", "age" "33"} {"name" "Buddy", "surname" "Holly", "age" "23"})

Each key can only exist once in a map. So your later values overwrite the earlier ones.
To get a list of maps per artiste you could do something like:
(def lst1 ["name" "surname" "age"])
(def lst2 ["Jimi" "Hendrix" "28" "Sam" "Cooke" "33" "Buddy" "Holly" "23"])
(->> (partition 3 lst2) ; Split out the seperate people
(map (fn [artist-seq] (zipmap lst1 artist-seq)))) ; Use zipmap to connect the keys and values.
This should work for any number of people as long as all the values are there, in the right order

Although you want the following form:
{"name" "Jimi" , "surname" "Hendrix" , "age" "28" ,
"name" "Sam" , "surname" "Cooke" , "age" "33" ... }
This is not allowed because keys are collided. You can't add "name" as a key several times. Key should be unique in a map.
But you can construct a list of map with the following code:
user=> (->> (map (fn [ks vs] (interleave ks vs)) (repeat 3 lst1) (partition 3 lst2))
(map #(apply hash-map %)))
({"age" "28", "name" "Jimi", "surname" "Hendrix"} {"age" "33", "name" "Sam", "surname" "Cooke"} {"age" "23", "name" "Buddy", "surname" "Holly"})
UPDATE
#status203's solution which uses zipmap looks much better.
user=> (->> (partition 3 lst2)
(map #(zipmap lst1 %)))
({"age" "28", "surname" "Hendrix", "name" "Jimi"} {"age" "33", "surname" "Cooke", "name" "Sam"} {"age" "23", "surname" "Holly", "name" "Buddy"})

Compact Clojure code for regular expression matches and their position in string

Stuart Halloway gives the example
(re-seq #"\w+" "The quick brown fox")
as the natural method for finding matches of regex matches in Clojure. In his book this construction is contrasted with iteration over a matcher. If all one cared about were a list of matches this would be great. However, what if I wanted matches and their position within the string? Is there a better way of doing this that allows me to leverage the existing functionality in java.util.regex with resorting to something like a sequence comprehension over each index in the original string? In other words, one would like to type something like
(re-seq-map #"[0-9]+" "3a1b2c1d")
which would return a map with keys as the position and values as the matches, e.g.
{0 "3", 2 "1", 4 "2", 6 "1"}
Is there some implementation of this in an extant library already or shall I write it (shouldn't be too may lines of code)?

You can fetch the data you want out of a java.util.regex.Matcher object.
user> (defn re-pos [re s]
(loop [m (re-matcher re s)
res {}]
(if (.find m)
(recur m (assoc res (.start m) (.group m)))
res)))
#'user/re-pos
user> (re-pos #"\w+" "The quick brown fox")
{16 "fox", 10 "brown", 4 "quick", 0 "The"}
user> (re-pos #"[0-9]+" "3a1b2c1d")
{6 "1", 4 "2", 2 "1", 0 "3"}

You can apply any function to the java.util.regex.Matcher object and return its results (simmilar to Brian's solution, but without explicit loop):
user=> (defn re-fun
[re s fun]
(let [matcher (re-matcher re s)]
(take-while some? (repeatedly #(if (.find matcher) (fun matcher) nil)))))
#'user/re-fun
user=> (defn fun1 [m] (vector (.start m) (.end m)))
#'user/fun1
user=> (re-fun #"[0-9]+" "3a1b2c1d" fun1)
([0 1] [2 3] [4 5] [6 7])
user=> (defn re-seq-map
[re s]
(into {} (re-fun re s #(vector (.start %) (.group %)))))
user=> (re-seq-map #"[0-9]+" "3a1b2c1d")
{0 "3", 2 "1", 4 "2", 6 "1"}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Filter RDDs using clojure and flambo - clojure

For starers I would replace rows-list with set. Lets define it as follows (set row-list) After that you can simply filter like this: (f/filter (:rdd xctx) (f/fn [row] (let [[v i] row] (not (contains? row-set i)))))

Related

looking for a split-on function

Untuple a Clojure sequence

Clojure - apply to all but nth element

Clojure: not the whole collection after conversion to hash map

Compact Clojure code for regular expression matches and their position in string

Categories

Resources