When use update-in we need to provide the full path to an element. But what if I want to update ALL elements whose second level key is :MaxInclude
e.g the input is
(def a {:store {:type "varchar"},
:amount {:nullable true, :length nil, :type "float", :MaxInclude "100.02"},
:unit {:type "int"},
:unit-uniform {:type "int" :MaxInclude "100"}
})
the required output is (convert MaxInclude from string to float/int based on theie type):
{:store {:type "varchar"},
:amount {:nullable true, :length nil, :type "float", :MaxInclude 100.02},
:unit {:type "int"},
:unit-uniform {:type "int" :MaxInclude 100}
}
I was thinking it would be nice to have a function like update-in that matches on key predicate functions instead of exact key values. This is what I came up with:
(defn update-all
"Like update-in except the second parameter is a vector of predicate
functions taking keys as arguments. Updates all values contained at a
matching path. Looks for keys in maps only."
[m [key-pred & key-preds] update-fn]
(if (map? m)
(let [matching-keys (filter key-pred (keys m))
f (fn [acc k]
(update-in acc [k] (if key-preds
#(update-all %
key-preds
update-fn)
update-fn)))]
(reduce f m matching-keys))
m))
With this in place, all you need to do is:
(update-all a [= #{:MaxInclude}] read-string)
The = is used as the first key matching function because it always returns true when passed one argument. The second is using the fact that a set is a function. This function uses non-optimised recursion but the call stack will only be as deep as the number of matching map levels.
(into {}
(map (fn [[k v]]
{k (if (contains? v :MaxInclude)
(update-in v [:MaxInclude] read-string)
v)})
a))
Here I am mapping over the key-value pairs and destructuring each into k and v. Then I use update-in on the value if it contains :MaxInclude. Finally, I pour the pairs from a list into a hash map.
Notes:
This will error on contains? if any of the main map's values are not indexed collections.
I use read-string as a convenient way to convert the string to a number in the same way the Clojure reader would do when compiling the string that is your number literal. There may be disadvantages to this approach.
Related
I am trying to understand the program "A Vampire Data Analysis Program for the FWPD" at the end of the 4th chapter in the book "Clojure for the Brave and True". Here is the code:
(ns fwpd.core)
(def filename "suspects.csv")
(def vamp-keys [:name :glitter-index])
(defn str->int
[str]
(Integer. str))
(def conversions {:name identity
:glitter-index str->int})
(defn convert
[vamp-key value]
((get conversions vamp-key) value))
(defn parse
"Convert a CSV into rows of columns"
[string]
(map #(clojure.string/split % #",")
(clojure.string/split string #"\n")))
(defn mapify
"Return a seq of maps like {:name \"Edward Cullen\" :glitter-index 10}"
[rows]
(map (fn [unmapped-row]
(reduce (fn [row-map [vamp-key value]]
(assoc row-map vamp-key (convert vamp-key value)))
{}
(map vector vamp-keys unmapped-row)))
rows))
(defn glitter-filter
[minimum-glitter records]
(filter #(>= (:glitter-index %) minimum-glitter) records))
Can somebody help about conversions and convert function?
conversions is a map, and as such contains key value pairs, called map-entries. get is a function that allows you to get the respective value returned when all you have is a key, and of course the map. So for convert to do its job then vamp-key must be either :name or :glitter-index (as they are the only keys on the map). Lets assume it is :glitter-index and that str->int is returned. Thus:
((get conversions vamp-key) value))
, becomes:
(str->int value)
So vamp-key is needed to obtain the correct function to convert the value. If :glitter-index and "10" are the arguments passed into the function then 10 will be returned.
I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.
First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))
I have a clojure function that uses the flambo v0.60 functions api to do some analysis on a sample data set. I noticed that when I use a (get rdd 2) instead of getting the second element in the rdd collection, its getting the second character of the first element of the rdd collection. My assumption is clojure is treating each row of the rdd collection as a whole string and not a vector for me to be able to get the second element in the collection. I'm thinking of using the map-values function to convert the mapped values into a vector for which I can get the second element, I tried this:
(defn split-on-tab-transformation [xctx input]
(assoc xctx :rdd (-> (:rdd xctx)
(spark/map (spark/fn [row] (s/split row #"\t")))
(spark/map-values vec))))
Unfortunately I got an error:
java.lang.IllegalArgumentException: No matching method found: mapValues for class org.apache.spark.api.java.JavaRDD...
This is code returns the first collection in the rdd:
(assuming I removed the (spark/map-values vec) in the above function
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint rdds)))
Output:
[2.00000 770127 200939.000000 \t6094\tBENTONVILLE, AR DPS\t22.500000\t5.000000\t2.500000\t5.000000\t0.000000\t0.000000\t0.000000\t0.000000\t0.000000\t1\tStore Tab\t0.000000\t4.50\t3.83\t5.00\t0.000000\t0.000000\t0.000000\t0.000000\t19.150000]
if I try to get the second element 770127
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint (get rdds 1)))
I get :
[\.]
Flambo documentation for map-values
I'm new to clojure and I'd appreciate any help. Thanks
First of all map-values (or mapValues in Spark API) is a valid transformation only on a PairRDD (for example something like this [:foo [1 2 3]]. RDDs with values like this can be interpreted as some some sort of maps where the first element is a key and the second is a value.
If you have RDD like this mapValues transforms the values without changing the key. In this case you should use a second map, although it seem obsolete since clojure.string/split already returns a vector.
A simple example of using map-values:
(let [pairs [(ft/tuple :foo 1) (ft/tuple :bar 2)]
rdd (f/parallelize-pairs sc pairs) ;; Note parallelize-pairs -> PairRDD
result (-> rdd
(f/map-values inc) ;; Map values
(f/collect))]
(assert (= result [(ft/tuple :foo 2) (ft/tuple :bar 3)])))
From your description it looks like you're using an input RDD instead of the one returned from split-on-tab-transformation. If I had to guess you're trying to use original xctx, not the one returned from split-on-tab-transformation. Since Clojure maps are immutable assoc doesn't change a passed argument and get-distinct-column-val receives RDD[String] not RDD[Array[String]]
Based on a naming convention I assume you want to get distinct values for a single position in a array. I removed unused parts of your code for clarity. First lets create dummy data:
(spit "data.txt"
(str "Mazda RX4\t21\t6\t160\n"
"Mazda RX4 Wag\t21\t6\t160\n"
"Datsun 710\t22.8\t4\t108\n"))
add rewritten versions of your functions
(defn split-on-tab-transformation [xctx]
(assoc xctx :rdd (-> (:rdd xctx)
(f/map #(clojure.string/split % #"\t")))))
(defn get-distinct-column-val
[xctx col]
(-> (:rdd xctx)
(f/map #(get % col))
(f/distinct)))
and result
(assert
(= #{"Mazda RX4 Wag" "Datsun 710" "Mazda RX4"}
(-> {:sc sc :rdd (f/text-file sc "data.txt")}
(split-on-tab-transformation)
(get-distinct-column-val 0)
(f/collect)
(set))))
I've been learning Clojure for a few weeks now. I know the basics of the data structures and some functions. (I'm reading the Clojure Programming book).
I'm stuck with the following. I'm writing a function which will lower case the keys of the supplied map.
(defn lower-case-map [m]
(def lm {})
(doseq [k (keys m)]
(assoc lm (str/lower-case k) (m k))))
This does what I want, but how do I return the map? Is the def correct?
I know this works
(defn lower-case-map [m]
(assoc {} :a 1))
But the doseq above seems to be creating a problem.
Within a function body you should define your local variables with let, yet this code looks alot like you try to bend it into an imperative mindset (def tempvar = new Map; foreach k,v in m do tempvar[k.toLower] = v; return tempvar). Also note, that the docs of doseq explicitly state, that it returns nil.
The functional approach would be a map or reduce over the input returning the result directly. E.g. a simple approach to map (iterating the sequence of elements, destructure the key/value tuple, emit a modified tuple, turn them back into a map):
user=> (into {} (map (fn [[k v]] [(.toLowerCase k) v]) {"A" 1 "B" 2}))
{"a" 1, "b" 2}
For your use-case (modify all keys in a map) is already a nice core function: reduce-kv:
user=> (doc reduce-kv)
-------------------------
clojure.core/reduce-kv
([f init coll])
Reduces an associative collection. f should be a function of 3
arguments. Returns the result of applying f to init, the first key
and the first value in coll, then applying f to that result and the
2nd key and value, etc. If coll contains no entries, returns init
and f is not called. Note that reduce-kv is supported on vectors,
where the keys will be the ordinals.
user=> (reduce-kv (fn [m k v] (assoc m (.toLowerCase k) v)) {} {"A" 1 "B" 2})
{"a" 1, "b" 2}
I'm new to Clojure, and I am using ring.velocity to develop a webapp.
Here is my ring.velocity.core/render method:
(defn render
[tname & kvs]
"Render a template to string with vars:
(render :name \"dennis\" :age 29)
:name and :age are the variables in template. "
(let [kvs (apply hash-map kvs)]
(render-template *velocity-render tname kvs)))
For this simple example, it works fine:
(velocity/render "test.vm" :name "nile")
But sometimes, we can't hard code the key value pairs. A common way:
(defn get-data [] {:key "value"}) ;; define a fn get-data dynamic.
(velocity/render "test.vm" (get-data));; **this go wrong** because in render fn , called (apply hash-map kvs)
Has the error:
No value supplied for key: ....
It looks like it is treated as if it was a single value. I've changed the type to [], {}, and (), but each of these fails.
My question is: What does & kvs in clojure mean? How can I dynamically create it and pass it to method?
ADD A Simple Test
(defn params-test [a & kvls]
(println (apply hash-map kvls)))
(defn get-data []
[:a "A"])
(defn test[]
(params-test (get-data))
Result
No value supplied for key:((:a "A"))
The problem here is that you're trying to create a hash-map from a single list argument instead of list of arguments.
Use
(apply hash-map kvls)
instead of
(hash-map kvls)
In your original question you can try to use apply with partial
(apply (partial velocity/render "test.vm") (get-data))