I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.
First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))
Related
I am currently working on an assignment to come up with a solution for a simple "compression/ encoding" algorithm.
Objective is to compress subsequent identical letters in a string: "AABBBCCCC" --> "2A3B4C"
Although there are several (and possibly more elegant) approaches to this solution, I got stuck trying to derive a single reduce function that counts subsequent appearances of the same letter, building up the output array:
(reduce (fn [[a counter seq] b]
(if (= a b) [b (inc counter) seq] ([b 0 (conj(conj seq b )counter)] )))
[ "" 0 [] ]
"AABBBCCCC")
However, trouble starts already with my destructuring attempt of the reducer function which should be of type:
fn [[char int []] char] -> [char int []]
I could already figure out there is another solution to the problem using the identity macro. However, I still would like to get the reducer working.
Any suggestion and help is much appreciated!
the reduce variant could be something like this:
(defn process [s]
(->> (reduce (fn [[res counter a] b]
(if (= a b)
[res (inc counter) b]
[(conj res counter a) 1 b]))
[[] 0 nil]
s)
(apply conj)
(drop 2)
(apply str)))
user> (process "AABBCCCCC")
;;=> "2A2B5C"
though i would probably go with clojure's sequence functions:
(->> "AABBCCCC"
(eduction (partition-by identity)
(mapcat (juxt count first)))
(apply str))
;;=> "2A2B4C"
I am coming from a Java background trying to learn Clojure. As the best way of learning is by actually writing some code, I took a very simple example of finding even numbers in a vector. Below is the piece of code I wrote:
`
(defn even-vector-2 [input]
(def output [])
(loop [x input]
(if (not= (count x) 0)
(do
(if (= (mod (first x) 2) 0)
(do
(def output (conj output (first x)))))
(recur (rest x)))))
output)
`
This code works, but it is lame that I had to use a global symbol to make it work. The reason I had to use the global symbol is because I wanted to change the state of the symbol every time I find an even number in the vector. let doesn't allow me to change the value of the symbol. Is there a way this can be achieved without using global symbols / atoms.
The idiomatic solution is straightfoward:
(filter even? [1 2 3])
; -> (2)
For your educational purposes an implementation with loop/recur
(defn filter-even [v]
(loop [r []
[x & xs :as v] v]
(if (seq v) ;; if current v is not empty
(if (even? x)
(recur (conj r x) xs) ;; bind r to r with x, bind v to rest
(recur r xs)) ;; leave r as is
r))) ;; terminate by not calling recur, return r
The main problem with your code is you're polluting the namespace by using def. You should never really use def inside a function. If you absolutely need mutability, use an atom or similar object.
Now, for your question. If you want to do this the "hard way", just make output a part of the loop:
(defn even-vector-3 [input]
(loop [[n & rest-input] input ; Deconstruct the head from the tail
output []] ; Output is just looped with the input
(if n ; n will be nil if the list is empty
(recur rest-input
(if (= (mod n 2) 0)
(conj output n)
output)) ; Adding nothing since the number is odd
output)))
Rarely is explicit looping necessary though. This is a typical case for a fold: you want to accumulate a list that's a variable-length version of another list. This is a quick version:
(defn even-vector-4 [input]
(reduce ; Reducing the input into another list
(fn [acc n]
(if (= (rem n 2) 0)
(conj acc n)
acc))
[] ; This is the initial accumulator.
input))
Really though, you're just filtering a list. Just use the core's filter:
(filter #(= (rem % 2) 0) [1 2 3 4])
Note, filter is lazy.
Try
#(filterv even? %)
if you want to return a vector or
#(filter even? %)
if you want a lazy sequence.
If you want to combine this with more transformations, you might want to go for a transducer:
(filter even?)
If you wanted to write it using loop/recur, I'd do it like this:
(defn keep-even
"Accepts a vector of numbers, returning a vector of the even ones."
[input]
(loop [result []
unused input]
(if (empty? unused)
result
(let [curr-value (first unused)
next-result (if (is-even? curr-value)
(conj result curr-value)
result)
next-unused (rest unused) ]
(recur next-result next-unused)))))
This gets the same result as the built-in filter function.
Take a look at filter, even? and vec
check out http://cljs.info/cheatsheet/
(defn even-vector-2 [input](vec(filter even? input)))
If you want a lazy solution, filter is your friend.
Here is a non-lazy simple solution (loop/recur can be avoided if you apply always the same function without precise work) :
(defn keep-even-numbers
[coll]
(reduce
(fn [agg nb]
(if (zero? (rem nb 2)) (conj agg nb) agg))
[] coll))
If you like mutability for "fun", here is a solution with temporary mutable collection :
(defn mkeep-even-numbers
[coll]
(persistent!
(reduce
(fn [agg nb]
(if (zero? (rem nb 2)) (conj! agg nb) agg))
(transient []) coll)))
...which is slightly faster !
mod would be better than rem if you extend the odd/even definition to negative integers
You can also replace [] by the collection you want, here a vector !
In Clojure, you generally don't need to write a low-level loop with loop/recur. Here is a quick demo.
(ns tst.clj.core
(:require
[tupelo.core :as t] ))
(t/refer-tupelo)
(defn is-even?
"Returns true if x is even, otherwise false."
[x]
(zero? (mod x 2)))
; quick sanity checks
(spyx (is-even? 2))
(spyx (is-even? 3))
(defn keep-even
"Accepts a vector of numbers, returning a vector of the even ones."
[input]
(into [] ; forces result into vector, eagerly
(filter is-even? input)))
; demonstrate on [0 1 2...9]
(spyx (keep-even (range 10)))
with result:
(is-even? 2) => true
(is-even? 3) => false
(keep-even (range 10)) => [0 2 4 6 8]
Your project.clj needs the following for spyx to work:
:dependencies [
[tupelo "0.9.11"]
Today I tried to implement a "R-like" melt function. I use it for Big Data coming from Big Query.
I do not have big constraints about time to compute and this function takes less than 5-10 seconds to work on millions of rows.
I start with this kind of data :
(def sample
'({:list "123,250" :group "a"} {:list "234,260" :group "b"}))
Then I defined a function to put the list into a vector :
(defn split-data-rank [datatab value]
(let [splitted (map (fn[x] (assoc x value (str/split (x value) #","))) datatab)]
(map (fn[y] (let [index (map inc (range (count (y value))))]
(assoc y value (zipmap index (y value)))))
splitted)))
Launch :
(split-data-rank sample :list)
As you can see, it returns the same sequence but it replaces :list by a map giving the position in the list of each item in quoted list.
Then, I want to melt the "dataframe" by creating for each item in a group its own row with its rank in the group.
So that I created this function :
(defn split-melt [datatab value]
(let [splitted (split-data-rank datatab value)]
(map (fn [y] (dissoc y value))
(apply concat
(map
(fn[x]
(map
(fn[[k v]]
(assoc x :item v :Rank k))
(x value)))
splitted)))))
Launch :
(split-melt sample :list)
The problem is that it is heavily indented and use a lot of map. I apply dissoc to drop :list (which is useless now) and I have also to use concat because without that I have a sequence of sequences.
Do you think there is a more efficient/shorter way to design this function ?
I am heavily confused with reduce, does not know whether it can be applied here since there are two arguments in a way.
Thanks a lot !
If you don't need the split-data-rank function, I will go for:
(defn melt [datatab value]
(mapcat (fn [x]
(let [items (str/split (get x value) #",")]
(map-indexed (fn [idx item]
(-> x
(assoc :Rank (inc idx) :item item)
(dissoc value)))
items)))
datatab))
I have a vector of vectors that contains some strings and ints:
(def data [
["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]
])
I'm trying to iterate through the vector and return(?) any rows that contain a certain string e.g. "a". I tried implementing things like this:
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
I'm quite new to Clojure, but I believe this is saying: For every element (vector) in 'data', if that vector contains "a", return it?
I know get-in needs 2 parameters, that part is where I'm unsure of what to do.
I have looked at answers like this and this but I don't really understand how they work. From what I can gather they're converting the vector to a map and doing the operations on that instead?
(filter #(some #{"a"} %) data)
It's a bit strange seeing the set #{"a"} but it works as a predicate function for some. Adding more entries to the set would be like a logical OR for it, i.e.
(filter #(some #{"a" "c"} %) data)
=> (["a" "title" "b" 1] ["c" "title" "d" 1])
ok you have error in your code
(defn get-row [data]
(for [d [data]
:when (= (get-in d[0]) "a")] d
))
the error is here:
(for [d [data] ...
to traverse all the elements you shouldn't enclose data in brackets, because this syntax is for creating vectors. Here you are trying to traverse a vector of one element. that is how it look like for clojure:
(for [d [[["a" "title" "b" 1]
["c" "title" "d" 1]
["e" "title" "f" 2]
["g" "title" "h" 1]]] ...
so, correct variant is:
(defn get-row [data]
(for [d data
:when (= "a" (get-in d [0]))]
d))
then, you could use clojure' destructuring for that:
(defn get-row [data]
(for [[f & _ :as d] data
:when (= f "a")]
d))
but more clojuric way is to use higher order functions:
(defn get-row [data]
(filter #(= (first %) "a") data))
that is about your code. But corretc variant is in other guys' answers, because here you are checking just first item.
(defn get-row [data]
(for [d data ; <-- fix: [data] would result
; in one iteration with d bound to data
:when (= (get-in d[0]) "a")]
d))
Observe that your algorithm returns rows where the first column is "a". This can e. g. be solved using some with a set as predicate function to scan the entire row.
(defn get-row [data]
(for [row data
:when (some #{"a"} row)]
row))
Even better than the currently selected answer, this would work:
(filter #(= "a" (% 0)) data)
The reason for this is because for the top answer you are searching all the indexes of the sub-vectors for your query, whereas you might only wantto look in the first index of each sub-vector (in this case, search through each position 0 for "a", before returning the whole sub-vector if true)
I have the following code:
(ns macroo)
(def primitives #{::byte ::short ::int})
(defn primitive? [type]
(contains? primitives type))
(def pp clojure.pprint/pprint)
(defn foo [buffer data schema]
(println schema))
(defmacro write-fn [buffer schema schemas]
(let [data (gensym)]
`(fn [~data]
~(cond
(primitive? schema) `(foo ~buffer ~data ~schema)
(vector? schema) (if (= ::some (first schema))
`(do (foo ~buffer (count ~data) ::short)
(map #((write-fn ~buffer ~(second schema) ~schemas) %)
~data))
`(do ~#(for [[i s] (map-indexed vector schema)]
((write-fn buffer s schemas) `(get ~data ~i)))))
:else [schema `(primitive? ~schema) (primitive? schema)])))) ; for debugging
(pp (clojure.walk/macroexpand-all '(write-fn 0 [::int ::int] 0)))
The problem is, upon evaluating the last expression, I get
=>
(fn*
([G__6506]
(do
[:macroo/int :macroo/int true false]
[:macroo/int :macroo/int true false])))
I'll explain the code if necessary, but for now i'll just state the problem (it might be just a newbie error I'm making):
`(primitive? ~schema)
and
(primitive? schema)
in the :else branch return true and false respectively, and since i'm using the second version in the cond expression, it fails where it shouldn't (I'd prefer the second version as it would be evaluated at compile time if i'm not mistaken).
I suspect it might have something to do with symbols being namespace qualified?
After some investigations (see edits), here is a working Clojure alternative. Basically, you rarely need recursive macros. If you
need to build forms recursively, delegate to auxiliary functions and call them from the macro (also, write-fn is not a good name).
(defmacro write-fn [buffer schemas fun]
;; we will evaluate "buffer" and "fun" only once
;; and we need gensym for intermediate variables.
(let [fsym (gensym)
bsym (gensym)]
;; define two mutually recursive function
;; to parse and build a map consisting of two keys
;;
;; - args is the argument list of the generated function
;; - body is a list of generated forms
;;
(letfn [(transformer [schema]
(cond
(primitive? schema)
(let [g (gensym)]
{:args g
:body `(~fsym ~schema ~bsym ~g)})
(sequential? schema)
(if (and(= (count schema) 2)
(= (first schema) ::some)
(primitive? (second schema)))
(let [g (gensym)]
{:args ['& g]
:body
`(doseq [i# ~g]
(~fsym ~(second schema) ~bsym i#))})
(reduce reducer {:args [] :body []} schema))
:else (throw (Exception. "Bad input"))))
(reducer [{:keys [args body]} schema]
(let [{arg :args code :body} (transformer schema)]
{:args (conj args arg)
:body (conj body code)}))]
(let [{:keys [args body]} (transformer schemas)]
`(let [~fsym ~fun
~bsym ~buffer]
(fn [~args] ~#body))))))
The macro takes a buffer (whatever it is), a schema as defined by your language and a function to be called for each value being visited by the generated function.
Example
(pp (macroexpand
'(write-fn 0
[::int [::some ::short] [::int ::short ::int]]
(fn [& more] (apply println more)))))
... produces the following:
(let*
[G__1178 (fn [& more] (apply println more)) G__1179 0]
(clojure.core/fn
[[G__1180 [& G__1181] [G__1182 G__1183 G__1184]]]
(G__1178 :macroo/int G__1179 G__1180)
(clojure.core/doseq
[i__1110__auto__ G__1181]
(G__1178 :macroo/short G__1179 i__1110__auto__))
[(G__1178 :macroo/int G__1179 G__1182)
(G__1178 :macroo/short G__1179 G__1183)
(G__1178 :macroo/int G__1179 G__1184)]))
First, evaluate buffer and fun and bind them to local variables
Return a closure which accept one argument and destructures it according to the given schema, thanks to Clojure's destructuring capabilities.
For each value, call fun with the appropriate arguments.
When the schema is [::some x], accept zero or more values as a vector and call the function fun for each of those values. This needs to be done with a loop, since the size is only know when calling the function.
If we pass the vector [32 [1 3 4 5 6 7] [2 55 1]] to the function generated by the above macroexpansion, the following is printed:
:macroo/int 0 32
:macroo/short 0 1
:macroo/short 0 3
:macroo/short 0 4
:macroo/short 0 5
:macroo/short 0 6
:macroo/short 0 7
:macroo/int 0 2
:macroo/short 0 55
:macroo/int 0 1
In this line:
`(do ~#(for [[i s] (map-indexed vector schema)]
((write-fn buffer s schemas) `(get ~data ~i)))))
you are calling write-fn, the macro, in your current scope, where s is just a symbol, not one of the entries in schema. Instead, you want to emit code that will run in the caller's scope:
`(do ~#(for [[i s] (map-indexed vector schema)]
`((write-fn ~buffer ~s ~schemas) (get ~data ~i)))))
And make a similar change to the other branch of the if, as well.
As an aside, it looks to me at first glance like this doesn't really need to be a macro, but could be a higher-order function instead: take in a schema or whatever, and return a function of data. My guess is you're doing it as a macro for performance, in which case I would counsel you to try it out the slow, easy way first; once you have that working you can make it a macro if necessary. Or, maybe I'm wrong and there's something in here that fundamentally has to be a macro.