I have a clojure function that uses the flambo v0.60 functions api to do some analysis on a sample data set. I noticed that when I use a (get rdd 2) instead of getting the second element in the rdd collection, its getting the second character of the first element of the rdd collection. My assumption is clojure is treating each row of the rdd collection as a whole string and not a vector for me to be able to get the second element in the collection. I'm thinking of using the map-values function to convert the mapped values into a vector for which I can get the second element, I tried this:
(defn split-on-tab-transformation [xctx input]
(assoc xctx :rdd (-> (:rdd xctx)
(spark/map (spark/fn [row] (s/split row #"\t")))
(spark/map-values vec))))
Unfortunately I got an error:
java.lang.IllegalArgumentException: No matching method found: mapValues for class org.apache.spark.api.java.JavaRDD...
This is code returns the first collection in the rdd:
(assuming I removed the (spark/map-values vec) in the above function
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint rdds)))
Output:
[2.00000 770127 200939.000000 \t6094\tBENTONVILLE, AR DPS\t22.500000\t5.000000\t2.500000\t5.000000\t0.000000\t0.000000\t0.000000\t0.000000\t0.000000\t1\tStore Tab\t0.000000\t4.50\t3.83\t5.00\t0.000000\t0.000000\t0.000000\t0.000000\t19.150000]
if I try to get the second element 770127
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint (get rdds 1)))
I get :
[\.]
Flambo documentation for map-values
I'm new to clojure and I'd appreciate any help. Thanks
First of all map-values (or mapValues in Spark API) is a valid transformation only on a PairRDD (for example something like this [:foo [1 2 3]]. RDDs with values like this can be interpreted as some some sort of maps where the first element is a key and the second is a value.
If you have RDD like this mapValues transforms the values without changing the key. In this case you should use a second map, although it seem obsolete since clojure.string/split already returns a vector.
A simple example of using map-values:
(let [pairs [(ft/tuple :foo 1) (ft/tuple :bar 2)]
rdd (f/parallelize-pairs sc pairs) ;; Note parallelize-pairs -> PairRDD
result (-> rdd
(f/map-values inc) ;; Map values
(f/collect))]
(assert (= result [(ft/tuple :foo 2) (ft/tuple :bar 3)])))
From your description it looks like you're using an input RDD instead of the one returned from split-on-tab-transformation. If I had to guess you're trying to use original xctx, not the one returned from split-on-tab-transformation. Since Clojure maps are immutable assoc doesn't change a passed argument and get-distinct-column-val receives RDD[String] not RDD[Array[String]]
Based on a naming convention I assume you want to get distinct values for a single position in a array. I removed unused parts of your code for clarity. First lets create dummy data:
(spit "data.txt"
(str "Mazda RX4\t21\t6\t160\n"
"Mazda RX4 Wag\t21\t6\t160\n"
"Datsun 710\t22.8\t4\t108\n"))
add rewritten versions of your functions
(defn split-on-tab-transformation [xctx]
(assoc xctx :rdd (-> (:rdd xctx)
(f/map #(clojure.string/split % #"\t")))))
(defn get-distinct-column-val
[xctx col]
(-> (:rdd xctx)
(f/map #(get % col))
(f/distinct)))
and result
(assert
(= #{"Mazda RX4 Wag" "Datsun 710" "Mazda RX4"}
(-> {:sc sc :rdd (f/text-file sc "data.txt")}
(split-on-tab-transformation)
(get-distinct-column-val 0)
(f/collect)
(set))))
Related
I am trying to understand the program "A Vampire Data Analysis Program for the FWPD" at the end of the 4th chapter in the book "Clojure for the Brave and True". Here is the code:
(ns fwpd.core)
(def filename "suspects.csv")
(def vamp-keys [:name :glitter-index])
(defn str->int
[str]
(Integer. str))
(def conversions {:name identity
:glitter-index str->int})
(defn convert
[vamp-key value]
((get conversions vamp-key) value))
(defn parse
"Convert a CSV into rows of columns"
[string]
(map #(clojure.string/split % #",")
(clojure.string/split string #"\n")))
(defn mapify
"Return a seq of maps like {:name \"Edward Cullen\" :glitter-index 10}"
[rows]
(map (fn [unmapped-row]
(reduce (fn [row-map [vamp-key value]]
(assoc row-map vamp-key (convert vamp-key value)))
{}
(map vector vamp-keys unmapped-row)))
rows))
(defn glitter-filter
[minimum-glitter records]
(filter #(>= (:glitter-index %) minimum-glitter) records))
Can somebody help about conversions and convert function?
conversions is a map, and as such contains key value pairs, called map-entries. get is a function that allows you to get the respective value returned when all you have is a key, and of course the map. So for convert to do its job then vamp-key must be either :name or :glitter-index (as they are the only keys on the map). Lets assume it is :glitter-index and that str->int is returned. Thus:
((get conversions vamp-key) value))
, becomes:
(str->int value)
So vamp-key is needed to obtain the correct function to convert the value. If :glitter-index and "10" are the arguments passed into the function then 10 will be returned.
I have a function that is deduplicating with preference, I thought of implementing the solution in clojure using flambo function thus:
From the data set, using the group-by, to group duplicates (i.e based on a specified :key)
Given a :val as input, using a filter to check if the some of values for each row are equal to this :val
Use a map to untuple the duplicates to return single vectors (Not very sure if that is the right way though, I tried using a flat-map without any luck)
For a sample data-set
(def rdd
(f/parallelize sc [ ["Coke" "16" ""] ["Pepsi" "" "5"] ["Coke" "2" "3"] ["Coke" "" "36"] ["Pepsi" "" "34"] ["Pepsi" "25" "34"]]))
I tried this:
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))))]
result))
if I run this function thus:
(dedup-rows rdd {:key-col 0 :col 1 :val ""})
it produces
;=> [(["Pepsi" 25 34]), (["Coke" 16 ] ["Coke" 2 3])]]
I don't know what else to do to handle the result to produce a result of
;=> [["Pepsi" 25 34],["Coke" 16 ],["Coke" 2 3]]
I tried f/map f/untuple as the last form in the -> macro with no luck.
Any suggestions? I will really appreciate if there's another way to go about this.
Thanks.
PS: when grouped
;=> [[["Pepsi" "" 5], ["Pepsi" "" 34], ["Pepsi" 25 34]], [["Coke" 16 ""], ["Coke" 2 3], ["Coke" "" 36]]]
For each group, rows that have"" are considered duplicates and hence removed from the group.
Looking at the flambo readme, there is a flat-map function. This is slightly unfortunate naming because the Clojure equivalent is called mapcat. These functions take each map result - which must be a sequence - and concatenates them together. Another way to think about it is that it flattens the final sequence by one level.
I can't test this but I think you should replace your f/map with f/flat-map.
Going by #TheQuickBrownFox suggestion, I tried the following
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))
(f/flat-map (f/fn [row]
(mapcat vector row)))))]
result))
and seems to work
I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.
First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))
My question is how can I re-write the following reduce solution using map and probably doseq? I've been having a lot of trouble with the following solution.
That solution is to solve the following problem. Specifically, I have two csv files parsed by clojure-csv. Each vector of vectors could be called bene-data and gic-data. I want to take the value in a column in each row bene-data and see if that value is another column in one row in gic-data. I want to accumulate those bene-data values not found in gic-data into a vector. I originally tried to accumulate into a map, and that started off the stack overflow when trying to debug print. Eventually, I want to take this data, combine with some static text, and spit into a report file.
The following functions:
(defn is-a-in-b
"This is a helper function that takes a value, a column index, and a
returned clojure-csv row (vector), and checks to see if that value
is present. Returns value or nil if not present."
[cmp-val col-idx csv-row]
(let [csv-row-val (nth csv-row col-idx nil)]
(if (= cmp-val csv-row-val)
cmp-val
nil)))
(defn key-pres?
"Accepts a value, like an index, and output from clojure-csv, and looks
to see if the value is in the sequence at the index. Given clojure-csv
returns a vector of vectors, will loop around until and if the value
is found."
[cmp-val cmp-idx csv-data]
(reduce
(fn [ret-rc csv-row]
(let [temp-rc (is-a-in-b cmp-val cmp-idx csv-row)]
(if-not temp-rc
(conj ret-rc cmp-val))))
[]
csv-data))
(defn test-key-inclusion
"Accepts csv-data param and an index, a second csv-data param and an index,
and searches the second csv-data instances' rows (at index) to see if
the first file's data is located in the second csv-data instance."
[csv-data1 pkey-idx1 csv-data2 pkey-idx2 lnam-idx fnam-idx]
(reduce
(fn [out-log csv-row1]
(let [cmp-val (nth csv-row1 pkey-idx1 nil)
lnam (nth csv-row1 lnam-idx nil)
fnam (nth csv-row1 fnam-idx)
temp-rc (first (key-pres? cmp-val pkey-idx2 csv-data2))]
(println (vector temp-rc cmp-val lnam fnam))
(into out-log (vector temp-rc cmp-val lnam fnam))))
[]
csv-data1))
represent my attempt to solve this problem. I usually run into a wall trying to use doseq and map, because I have nowhere to accumulate the resulting data, unless I use loop recur.
This solution reads all of column 2 into a set once (so, it's non-lazy) for ease of writing. It should also perform better than re-scanning column 2 for each value of column 1. Adjust as needed if column 2 is too large to be read in memory.
(defn column
"extract the values of a column out of a seq-of-seqs"
[s-o-s n]
(map #(nth % n) s-o-s))
(defn test-key-inclusion
"return all values in column1 that arent' in column2"
[column1 column2]
(filter (complement (into #{} column2)) column1))
user> (def rows1 [[1 2 3] [4 5 6] [7 8 9]])
#'user/rows1
user> (def rows2 '[[a b c] [d 2 f] [g h i]])
#'user/rows2
user> (test-key-inclusion (column rows1 1) (column rows2 1))
(5 8)
I am trying to pass the (lazy) sequence returned from a map operation to another map operation, so that I can look up elements in the first sequence. The code is parsing some football fixtures from a text file (in row/column format), cleaning it up, and then returning a map.
Here is the code:
(ns fixtures.test.lazytest
(:require [clojure.string :as str])
(:use [clojure.test]))
(defn- column-map
"Produce map with column labels given raw data, return nil if not enough columns"
[cols]
(let [trimmed-cols (map str/trim cols)
column-names {0 :fixture-type, 1 :division, 2 :home-team, 4 :away-team}]
(if (> (count cols) (apply max (keys column-names)))
(zipmap (vals column-names) (map trimmed-cols (keys column-names)))
nil)))
(deftest test-mapping
(let [cols '["L" " Premier " " Chelsea " "v" "\tArsenal "]
fixture (column-map cols)]
(is (= "Arsenal" (fixture :away-team)))
(is (= "Chelsea" (fixture :home-team)))
(is (= "Premier" (fixture :division)))
(is (= "L" (fixture :fixture-type)))
)
)
(run-tests 'fixtures.test.lazytest)
The approach I'm taking is:
Clean up the vector of column data (remove leading/trailing spaces)
Using zipmap, combine column name keywords with their corresponding element in the column data (note that not all columns are used)
The problem is, using the trimmed-cols in zipmap causes
java.lang.ClassCastException: clojure.lang.LazySeq cannot be cast to clojure.lang.IFn
I think I know why this is happening ... as trimmed-cols is a LazySeq, the map called from zipmap is objecting to receiving a non-function as its first argument.
To fix this I can change the let to:
trimmed-cols (vec (map str/trim cols))
But this doesn't feel like the "best" option.
So:
Is there a good general solution for using the result of a map operation as the "function" argument to another map?
Is there a better approach for deriving a map of {: value} pairs from a vector of raw value data, where not all of the vector elements are used?
(I hesitate to ask for an idiomatic solution, but imagine somewhere there must be a generally accepted way of doing this.)
I am not entirely sure what you're after, but I can see why your code fails - like you said, trimmed-cols is not a function. Wouldn't this simply work?
I'm using :_dummy keys for columns that should be skipped. You can use as many as you like in the column-names vector: zipmap will merge them into one, which is removed by dissoc.
(defn- column-map
"Produce map with column labels given raw data, return nil if not enough columns"
[cols]
(let [trimmed-cols (map str/trim cols)
column-names [:fixture-type :division :home-team :_dummy :away-team]]
(if (>= (count cols) (count column-names))
(dissoc (zipmap column-names trimmed-cols) :_dummy)
nil)))
(column-map ["L" " Premier " " Chelsea " "v" "\tArsenal "])
; => {:away-team "Arsenal", :home-team "Chelsea", :division "Premier", :fixture-type "L"}