I have a function that is deduplicating with preference, I thought of implementing the solution in clojure using flambo function thus:
From the data set, using the group-by, to group duplicates (i.e based on a specified :key)
Given a :val as input, using a filter to check if the some of values for each row are equal to this :val
Use a map to untuple the duplicates to return single vectors (Not very sure if that is the right way though, I tried using a flat-map without any luck)
For a sample data-set
(def rdd
(f/parallelize sc [ ["Coke" "16" ""] ["Pepsi" "" "5"] ["Coke" "2" "3"] ["Coke" "" "36"] ["Pepsi" "" "34"] ["Pepsi" "25" "34"]]))
I tried this:
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))))]
result))
if I run this function thus:
(dedup-rows rdd {:key-col 0 :col 1 :val ""})
it produces
;=> [(["Pepsi" 25 34]), (["Coke" 16 ] ["Coke" 2 3])]]
I don't know what else to do to handle the result to produce a result of
;=> [["Pepsi" 25 34],["Coke" 16 ],["Coke" 2 3]]
I tried f/map f/untuple as the last form in the -> macro with no luck.
Any suggestions? I will really appreciate if there's another way to go about this.
Thanks.
PS: when grouped
;=> [[["Pepsi" "" 5], ["Pepsi" "" 34], ["Pepsi" 25 34]], [["Coke" 16 ""], ["Coke" 2 3], ["Coke" "" 36]]]
For each group, rows that have"" are considered duplicates and hence removed from the group.
Looking at the flambo readme, there is a flat-map function. This is slightly unfortunate naming because the Clojure equivalent is called mapcat. These functions take each map result - which must be a sequence - and concatenates them together. Another way to think about it is that it flattens the final sequence by one level.
I can't test this but I think you should replace your f/map with f/flat-map.
Going by #TheQuickBrownFox suggestion, I tried the following
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))
(f/flat-map (f/fn [row]
(mapcat vector row)))))]
result))
and seems to work
Related
Hi l'm trying to do a function that returns the 3 most commom strings
(take 3 (sort-by val > (frequencies s))))
(freq ["hi" "hi" "hi" "ola" "hello" "hello" "string" "str" "ola" "hello" "hello" "str"])
l've got this so far but a noticed that if there are more than 1 string with the same frenquency it won't return. Is there a way to filter the values of the frequencies funcition by their highest (eventually the top 3 highest)?
Thanks in advance.
I would propose slightly different solution which involves inverting frequencies map with group-by value (which is the items' count):
(->> data
frequencies
(group-by val))
;;{3 [["hi" 3]],
;; 2 [["ola" 2] ["str" 2]],
;; 4 [["hello" 4]],
;; 1 [["string" 1]]}
so the only thing you need is to just sort and process it:
(->> data
frequencies
(group-by val)
(sort-by key >)
(take 3)
(mapv (fn [[k vs]] {:count k :items (mapv first vs)})))
;;[{:count 4, :items ["hello"]}
;; {:count 3, :items ["hi"]}
;; {:count 2, :items ["ola" "str"]}]
frequencies gives you a map, where the keys are the original values to
investigate and the values in that map are the frequency of those
values. For your result you are interested for all original values,
that have the most occurrences including those original values with the
same occurrences.
One way would be to "invert" the frequencies result, to get a map from
occurrences to all original values with that occurrence. Then you can
get the highest N keys and from this map and select them (by using
a "sorted map" for inverting the map, we get the sorting by keys without
further steps).
(defn invert-map
([source]
(invert-map source {}))
([source target]
(reduce (fn [m [k v]]
(update m v (fnil conj []) k))
target
source)))
(assert (=
{1 ["do" "re"]}
(invert-map {"do" 1 "re" 1})))
(defn freq
[n s]
(let [fs (invert-map (frequencies s) (sorted-map-by >))
top-keys (take n (keys fs))]
(select-keys fs top-keys)))
(assert (=
{4 ["hello"], 3 ["hi"], 2 ["ola" "str"]}
(freq 3 ["hi" "hi" "hi" "ola" "hello" "hello" "string" "str" "ola" "hello" "hello" "str"])))
I have a vector of vectors that contains some strings and ints:
(def data [
["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]
])
I'm trying to iterate through the vector and return(?) any rows that contain a certain string e.g. "a". I tried implementing things like this:
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
I'm quite new to Clojure, but I believe this is saying: For every element (vector) in 'data', if that vector contains "a", return it?
I know get-in needs 2 parameters, that part is where I'm unsure of what to do.
I have looked at answers like this and this but I don't really understand how they work. From what I can gather they're converting the vector to a map and doing the operations on that instead?
(filter #(some #{"a"} %) data)
It's a bit strange seeing the set #{"a"} but it works as a predicate function for some. Adding more entries to the set would be like a logical OR for it, i.e.
(filter #(some #{"a" "c"} %) data)
=> (["a" "title" "b" 1] ["c" "title" "d" 1])
ok you have error in your code
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
the error is here:
(for [d [data] ...
to traverse all the elements you shouldn't enclose data in brackets, because this syntax is for creating vectors. Here you are trying to traverse a vector of one element. that is how it look like for clojure:
(for [d [[["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]]] ...
so, correct variant is:
(defn get-row [data]
(for [d data
:when (= "a" (get-in d [0]))]
d))
then, you could use clojure' destructuring for that:
(defn get-row [data]
(for [[f & _ :as d] data
:when (= f "a")]
d))
but more clojuric way is to use higher order functions:
(defn get-row [data]
(filter #(= (first %) "a") data))
that is about your code. But corretc variant is in other guys' answers, because here you are checking just first item.
(defn get-row [data]
(for [d data ; <-- fix: [data] would result
; in one iteration with d bound to data
:when (= (get-in d[0]) "a")]
d))
Observe that your algorithm returns rows where the first column is "a". This can e. g. be solved using some with a set as predicate function to scan the entire row.
(defn get-row [data]
(for [row data
:when (some #{"a"} row)]
row))
Even better than the currently selected answer, this would work:
(filter #(= "a" (% 0)) data)
The reason for this is because for the top answer you are searching all the indexes of the sub-vectors for your query, whereas you might only wantto look in the first index of each sub-vector (in this case, search through each position 0 for "a", before returning the whole sub-vector if true)
I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.
First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))
I have a clojure function that uses the flambo v0.60 functions api to do some analysis on a sample data set. I noticed that when I use a (get rdd 2) instead of getting the second element in the rdd collection, its getting the second character of the first element of the rdd collection. My assumption is clojure is treating each row of the rdd collection as a whole string and not a vector for me to be able to get the second element in the collection. I'm thinking of using the map-values function to convert the mapped values into a vector for which I can get the second element, I tried this:
(defn split-on-tab-transformation [xctx input]
(assoc xctx :rdd (-> (:rdd xctx)
(spark/map (spark/fn [row] (s/split row #"\t")))
(spark/map-values vec))))
Unfortunately I got an error:
java.lang.IllegalArgumentException: No matching method found: mapValues for class org.apache.spark.api.java.JavaRDD...
This is code returns the first collection in the rdd:
(assuming I removed the (spark/map-values vec) in the above function
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint rdds)))
Output:
[2.00000 770127 200939.000000 \t6094\tBENTONVILLE, AR DPS\t22.500000\t5.000000\t2.500000\t5.000000\t0.000000\t0.000000\t0.000000\t0.000000\t0.000000\t1\tStore Tab\t0.000000\t4.50\t3.83\t5.00\t0.000000\t0.000000\t0.000000\t0.000000\t19.150000]
if I try to get the second element 770127
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint (get rdds 1)))
I get :
[\.]
Flambo documentation for map-values
I'm new to clojure and I'd appreciate any help. Thanks
First of all map-values (or mapValues in Spark API) is a valid transformation only on a PairRDD (for example something like this [:foo [1 2 3]]. RDDs with values like this can be interpreted as some some sort of maps where the first element is a key and the second is a value.
If you have RDD like this mapValues transforms the values without changing the key. In this case you should use a second map, although it seem obsolete since clojure.string/split already returns a vector.
A simple example of using map-values:
(let [pairs [(ft/tuple :foo 1) (ft/tuple :bar 2)]
rdd (f/parallelize-pairs sc pairs) ;; Note parallelize-pairs -> PairRDD
result (-> rdd
(f/map-values inc) ;; Map values
(f/collect))]
(assert (= result [(ft/tuple :foo 2) (ft/tuple :bar 3)])))
From your description it looks like you're using an input RDD instead of the one returned from split-on-tab-transformation. If I had to guess you're trying to use original xctx, not the one returned from split-on-tab-transformation. Since Clojure maps are immutable assoc doesn't change a passed argument and get-distinct-column-val receives RDD[String] not RDD[Array[String]]
Based on a naming convention I assume you want to get distinct values for a single position in a array. I removed unused parts of your code for clarity. First lets create dummy data:
(spit "data.txt"
(str "Mazda RX4\t21\t6\t160\n"
"Mazda RX4 Wag\t21\t6\t160\n"
"Datsun 710\t22.8\t4\t108\n"))
add rewritten versions of your functions
(defn split-on-tab-transformation [xctx]
(assoc xctx :rdd (-> (:rdd xctx)
(f/map #(clojure.string/split % #"\t")))))
(defn get-distinct-column-val
[xctx col]
(-> (:rdd xctx)
(f/map #(get % col))
(f/distinct)))
and result
(assert
(= #{"Mazda RX4 Wag" "Datsun 710" "Mazda RX4"}
(-> {:sc sc :rdd (f/text-file sc "data.txt")}
(split-on-tab-transformation)
(get-distinct-column-val 0)
(f/collect)
(set))))
My question is how can I re-write the following reduce solution using map and probably doseq? I've been having a lot of trouble with the following solution.
That solution is to solve the following problem. Specifically, I have two csv files parsed by clojure-csv. Each vector of vectors could be called bene-data and gic-data. I want to take the value in a column in each row bene-data and see if that value is another column in one row in gic-data. I want to accumulate those bene-data values not found in gic-data into a vector. I originally tried to accumulate into a map, and that started off the stack overflow when trying to debug print. Eventually, I want to take this data, combine with some static text, and spit into a report file.
The following functions:
(defn is-a-in-b
"This is a helper function that takes a value, a column index, and a
returned clojure-csv row (vector), and checks to see if that value
is present. Returns value or nil if not present."
[cmp-val col-idx csv-row]
(let [csv-row-val (nth csv-row col-idx nil)]
(if (= cmp-val csv-row-val)
cmp-val
nil)))
(defn key-pres?
"Accepts a value, like an index, and output from clojure-csv, and looks
to see if the value is in the sequence at the index. Given clojure-csv
returns a vector of vectors, will loop around until and if the value
is found."
[cmp-val cmp-idx csv-data]
(reduce
(fn [ret-rc csv-row]
(let [temp-rc (is-a-in-b cmp-val cmp-idx csv-row)]
(if-not temp-rc
(conj ret-rc cmp-val))))
[]
csv-data))
(defn test-key-inclusion
"Accepts csv-data param and an index, a second csv-data param and an index,
and searches the second csv-data instances' rows (at index) to see if
the first file's data is located in the second csv-data instance."
[csv-data1 pkey-idx1 csv-data2 pkey-idx2 lnam-idx fnam-idx]
(reduce
(fn [out-log csv-row1]
(let [cmp-val (nth csv-row1 pkey-idx1 nil)
lnam (nth csv-row1 lnam-idx nil)
fnam (nth csv-row1 fnam-idx)
temp-rc (first (key-pres? cmp-val pkey-idx2 csv-data2))]
(println (vector temp-rc cmp-val lnam fnam))
(into out-log (vector temp-rc cmp-val lnam fnam))))
[]
csv-data1))
represent my attempt to solve this problem. I usually run into a wall trying to use doseq and map, because I have nowhere to accumulate the resulting data, unless I use loop recur.
This solution reads all of column 2 into a set once (so, it's non-lazy) for ease of writing. It should also perform better than re-scanning column 2 for each value of column 1. Adjust as needed if column 2 is too large to be read in memory.
(defn column
"extract the values of a column out of a seq-of-seqs"
[s-o-s n]
(map #(nth % n) s-o-s))
(defn test-key-inclusion
"return all values in column1 that arent' in column2"
[column1 column2]
(filter (complement (into #{} column2)) column1))
user> (def rows1 [[1 2 3] [4 5 6] [7 8 9]])
#'user/rows1
user> (def rows2 '[[a b c] [d 2 f] [g h i]])
#'user/rows2
user> (test-key-inclusion (column rows1 1) (column rows2 1))
(5 8)