I'm new to the Clojure universe and I have a problem.
I got a LazySeq which looks like this (longer in fact)
values = (("Brand1" "0") ("Brand2" "15") ("Brand3" "12"))
I also defined field as
fields = [:Brand :Sale]
I would like to have finally at least
({:Brand "Brand1 :Sale "0"} {:Brand "Brand2 :Sale "15"} {:Brand "Brand3 :Sale "12"})
I tried several things (apply, interleave, reduce, into and combinations of that) but I get every time an unexpected result.
Is that possible ?
Thanks
You should use zipmap
(map (partial zipmap fields) values)
well you're right: you have to interleave and then add them all to map. But you should do it for every collection in values. That means you need to use map:
(let [values '(("Brand1" "0") ("Brand2" "15") ("Brand3" "12"))
fields [:Brand :Sale]]
(map #(apply hash-map (interleave fields %)) values))
output:
({:Sale "0", :Brand "Brand1"}
{:Sale "15", :Brand "Brand2"}
{:Sale "12", :Brand "Brand3"})
another variant is to do it like this:
(let [values '(("Brand1" "0") ("Brand2" "15") ("Brand3" "12"))
fields [:Brand :Sale]]
(map #(into {} (map vector fields %)) values))
Related
I have a vector of vectors that contains some strings and ints:
(def data [
["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]
])
I'm trying to iterate through the vector and return(?) any rows that contain a certain string e.g. "a". I tried implementing things like this:
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
I'm quite new to Clojure, but I believe this is saying: For every element (vector) in 'data', if that vector contains "a", return it?
I know get-in needs 2 parameters, that part is where I'm unsure of what to do.
I have looked at answers like this and this but I don't really understand how they work. From what I can gather they're converting the vector to a map and doing the operations on that instead?
(filter #(some #{"a"} %) data)
It's a bit strange seeing the set #{"a"} but it works as a predicate function for some. Adding more entries to the set would be like a logical OR for it, i.e.
(filter #(some #{"a" "c"} %) data)
=> (["a" "title" "b" 1] ["c" "title" "d" 1])
ok you have error in your code
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
the error is here:
(for [d [data] ...
to traverse all the elements you shouldn't enclose data in brackets, because this syntax is for creating vectors. Here you are trying to traverse a vector of one element. that is how it look like for clojure:
(for [d [[["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]]] ...
so, correct variant is:
(defn get-row [data]
(for [d data
:when (= "a" (get-in d [0]))]
d))
then, you could use clojure' destructuring for that:
(defn get-row [data]
(for [[f & _ :as d] data
:when (= f "a")]
d))
but more clojuric way is to use higher order functions:
(defn get-row [data]
(filter #(= (first %) "a") data))
that is about your code. But corretc variant is in other guys' answers, because here you are checking just first item.
(defn get-row [data]
(for [d data ; <-- fix: [data] would result
; in one iteration with d bound to data
:when (= (get-in d[0]) "a")]
d))
Observe that your algorithm returns rows where the first column is "a". This can e. g. be solved using some with a set as predicate function to scan the entire row.
(defn get-row [data]
(for [row data
:when (some #{"a"} row)]
row))
Even better than the currently selected answer, this would work:
(filter #(= "a" (% 0)) data)
The reason for this is because for the top answer you are searching all the indexes of the sub-vectors for your query, whereas you might only wantto look in the first index of each sub-vector (in this case, search through each position 0 for "a", before returning the whole sub-vector if true)
I have a function that is deduplicating with preference, I thought of implementing the solution in clojure using flambo function thus:
From the data set, using the group-by, to group duplicates (i.e based on a specified :key)
Given a :val as input, using a filter to check if the some of values for each row are equal to this :val
Use a map to untuple the duplicates to return single vectors (Not very sure if that is the right way though, I tried using a flat-map without any luck)
For a sample data-set
(def rdd
(f/parallelize sc [ ["Coke" "16" ""] ["Pepsi" "" "5"] ["Coke" "2" "3"] ["Coke" "" "36"] ["Pepsi" "" "34"] ["Pepsi" "25" "34"]]))
I tried this:
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))))]
result))
if I run this function thus:
(dedup-rows rdd {:key-col 0 :col 1 :val ""})
it produces
;=> [(["Pepsi" 25 34]), (["Coke" 16 ] ["Coke" 2 3])]]
I don't know what else to do to handle the result to produce a result of
;=> [["Pepsi" 25 34],["Coke" 16 ],["Coke" 2 3]]
I tried f/map f/untuple as the last form in the -> macro with no luck.
Any suggestions? I will really appreciate if there's another way to go about this.
Thanks.
PS: when grouped
;=> [[["Pepsi" "" 5], ["Pepsi" "" 34], ["Pepsi" 25 34]], [["Coke" 16 ""], ["Coke" 2 3], ["Coke" "" 36]]]
For each group, rows that have"" are considered duplicates and hence removed from the group.
Looking at the flambo readme, there is a flat-map function. This is slightly unfortunate naming because the Clojure equivalent is called mapcat. These functions take each map result - which must be a sequence - and concatenates them together. Another way to think about it is that it flattens the final sequence by one level.
I can't test this but I think you should replace your f/map with f/flat-map.
Going by #TheQuickBrownFox suggestion, I tried the following
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))
(f/flat-map (f/fn [row]
(mapcat vector row)))))]
result))
and seems to work
I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.
First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))
I have a clojure function that uses the flambo v0.60 functions api to do some analysis on a sample data set. I noticed that when I use a (get rdd 2) instead of getting the second element in the rdd collection, its getting the second character of the first element of the rdd collection. My assumption is clojure is treating each row of the rdd collection as a whole string and not a vector for me to be able to get the second element in the collection. I'm thinking of using the map-values function to convert the mapped values into a vector for which I can get the second element, I tried this:
(defn split-on-tab-transformation [xctx input]
(assoc xctx :rdd (-> (:rdd xctx)
(spark/map (spark/fn [row] (s/split row #"\t")))
(spark/map-values vec))))
Unfortunately I got an error:
java.lang.IllegalArgumentException: No matching method found: mapValues for class org.apache.spark.api.java.JavaRDD...
This is code returns the first collection in the rdd:
(assuming I removed the (spark/map-values vec) in the above function
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint rdds)))
Output:
[2.00000 770127 200939.000000 \t6094\tBENTONVILLE, AR DPS\t22.500000\t5.000000\t2.500000\t5.000000\t0.000000\t0.000000\t0.000000\t0.000000\t0.000000\t1\tStore Tab\t0.000000\t4.50\t3.83\t5.00\t0.000000\t0.000000\t0.000000\t0.000000\t19.150000]
if I try to get the second element 770127
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint (get rdds 1)))
I get :
[\.]
Flambo documentation for map-values
I'm new to clojure and I'd appreciate any help. Thanks
First of all map-values (or mapValues in Spark API) is a valid transformation only on a PairRDD (for example something like this [:foo [1 2 3]]. RDDs with values like this can be interpreted as some some sort of maps where the first element is a key and the second is a value.
If you have RDD like this mapValues transforms the values without changing the key. In this case you should use a second map, although it seem obsolete since clojure.string/split already returns a vector.
A simple example of using map-values:
(let [pairs [(ft/tuple :foo 1) (ft/tuple :bar 2)]
rdd (f/parallelize-pairs sc pairs) ;; Note parallelize-pairs -> PairRDD
result (-> rdd
(f/map-values inc) ;; Map values
(f/collect))]
(assert (= result [(ft/tuple :foo 2) (ft/tuple :bar 3)])))
From your description it looks like you're using an input RDD instead of the one returned from split-on-tab-transformation. If I had to guess you're trying to use original xctx, not the one returned from split-on-tab-transformation. Since Clojure maps are immutable assoc doesn't change a passed argument and get-distinct-column-val receives RDD[String] not RDD[Array[String]]
Based on a naming convention I assume you want to get distinct values for a single position in a array. I removed unused parts of your code for clarity. First lets create dummy data:
(spit "data.txt"
(str "Mazda RX4\t21\t6\t160\n"
"Mazda RX4 Wag\t21\t6\t160\n"
"Datsun 710\t22.8\t4\t108\n"))
add rewritten versions of your functions
(defn split-on-tab-transformation [xctx]
(assoc xctx :rdd (-> (:rdd xctx)
(f/map #(clojure.string/split % #"\t")))))
(defn get-distinct-column-val
[xctx col]
(-> (:rdd xctx)
(f/map #(get % col))
(f/distinct)))
and result
(assert
(= #{"Mazda RX4 Wag" "Datsun 710" "Mazda RX4"}
(-> {:sc sc :rdd (f/text-file sc "data.txt")}
(split-on-tab-transformation)
(get-distinct-column-val 0)
(f/collect)
(set))))
When use update-in we need to provide the full path to an element. But what if I want to update ALL elements whose second level key is :MaxInclude
e.g the input is
(def a {:store {:type "varchar"},
:amount {:nullable true, :length nil, :type "float", :MaxInclude "100.02"},
:unit {:type "int"},
:unit-uniform {:type "int" :MaxInclude "100"}
})
the required output is (convert MaxInclude from string to float/int based on theie type):
{:store {:type "varchar"},
:amount {:nullable true, :length nil, :type "float", :MaxInclude 100.02},
:unit {:type "int"},
:unit-uniform {:type "int" :MaxInclude 100}
}
I was thinking it would be nice to have a function like update-in that matches on key predicate functions instead of exact key values. This is what I came up with:
(defn update-all
"Like update-in except the second parameter is a vector of predicate
functions taking keys as arguments. Updates all values contained at a
matching path. Looks for keys in maps only."
[m [key-pred & key-preds] update-fn]
(if (map? m)
(let [matching-keys (filter key-pred (keys m))
f (fn [acc k]
(update-in acc [k] (if key-preds
#(update-all %
key-preds
update-fn)
update-fn)))]
(reduce f m matching-keys))
m))
With this in place, all you need to do is:
(update-all a [= #{:MaxInclude}] read-string)
The = is used as the first key matching function because it always returns true when passed one argument. The second is using the fact that a set is a function. This function uses non-optimised recursion but the call stack will only be as deep as the number of matching map levels.
(into {}
(map (fn [[k v]]
{k (if (contains? v :MaxInclude)
(update-in v [:MaxInclude] read-string)
v)})
a))
Here I am mapping over the key-value pairs and destructuring each into k and v. Then I use update-in on the value if it contains :MaxInclude. Finally, I pour the pairs from a list into a hash map.
Notes:
This will error on contains? if any of the main map's values are not indexed collections.
I use read-string as a convenient way to convert the string to a number in the same way the Clojure reader would do when compiling the string that is your number literal. There may be disadvantages to this approach.