zipWithUniqueId() in flambo using clojure - clojure

I want to create a rdd such that each row has an index. I tried the following
Given an rdd:
["a" "b" "c"]
(defn make-row-index [input]
(let [{:keys [col]} input]
(swap! #rdd assoc :rdd (-> (:rdd xctx)
(f/map #(vector %1 %2 ) (range))))))
Desired output:
(["a" 0] ["b" 1] ["c" 2])
I got an arity error, since f/map is used as (f/map rdd fn)
Wanted to use zipWithUniqueId() in apache spark but I'm lost on how to implement this and I cant find equivalent function in flambo. Any suggestion and help is appreciated.
Apache-spark zip with Index
Map implementation in flambo
Thanks

You can simply call zipWithIndex followed by map using untuple:
(def rdd (f/parallelize sc ["a" "b" "c"]))
(f/map (.zipWithIndex rdd) f/untuple)
You can use .zipWithUniqueId exactly the same way but result will be different from what you expect. zipWithUniqueId will generate pairs but index field won't be ordered.
It should be also possible to use zip with, but as far as I can tell it doesn't work with infinite range.
(def idx (f/parallelize sc (range (f/count rdd))))
(f/map (.zip rdd idx) f/untuple)
Whenever you use zip you should be careful though Generally speaking RDD should be considered as an unordered collection if there is a shuffling involved.

Related

Reducing a list of maps to a list by in clojure

I've started to get some functional programming some weeks ago and I'm trying to perform a mapping from a list of maps to a list considering a specific key in clojure.
My list of maps looks like: '({:a "a1" :b "b1" :c "c1"} {:a "a2" :b "b2" :c "c2"} {:a "a3" :b "b3" :c "c3"})
And the output I'm trying to get is: '("b1" "b2" "b3").
I've tried the following:
(doseq [m maps]
(println (list (get m :b))))
And my output is a list of lists (what is expected as I'm creating a list for each iteration). So my question is, how can I reduce this to a single list?
Update
Just tried the following:
(let [x '()]
(doseq [m map]
(conj x (get m :b))))
However, it is still not working. I`m not getting the point as I was expecting to be appending the elements into a empty list
This is a very common pattern in production Clojure code so it's a good place to learn. In general check out the docs on sequences at https://clojure.org/reference/sequences and when faced with similar task, look to see which pattern best fits and explore functions in that group. In this case it's "Process each item of a seq to create a new seq" and the first item listed is map
your example might look like
(map :b my-data)
You have the right idea, but are using the wrong function. doseq is intended only for side effects and always returns nil. The function you are looking for is for, which takes a sequence as input and returns another sequence as output. I generally prefer for over the similar map as for allows you to name the loop variable:
(def data-list
[{:a "a1" :b "b1" :c "c1"}
{:a "a2" :b "b2" :c "c2"}
{:a "a3" :b "b3" :c "c3"}])
(let [result (vec (for [item data-list]
(:b item)))]
(println result) ; print result
result) ; return result from `let` expression
result => ["b1" "b2" "b3"]
If instead you do this:
(println
(doseq [item data-list]
(println (:b item))))
you can see the difference with doseq vs for:
b1 ; loop item #1
b2 ; loop item #2
b3 ; loop item #3
nil ; return value of doseq
Please see https://www.braveclojure.com/ for online details, and buy a good book (or 5) like Getting Clojure, etc.
(doseq [m maps]
(println (list (get m :b))))
In two short lines, you break several general rules of functional programming:
Pass data into a function as arguments, not as references to global
variables.
Don't print the results of computation. Return them as the value of
the function.
Avoid mechanisms such as doseq that work by side-effects.
Despite this, you were not too far from a solution. doseq is essentially a version of for that throws away its result. If we replace doseq with for, and get rid of the println and the list, we get
=> (for [m maps] (get m :b))
("b1" "b2" "b3")
But Arthur Ulfeldt's simple use of map is better.

Is there any function to index the result of a computation on a collection?

Sincere apologies if this has been answered before. The search terms I came up with for this one weren't really specific...
I have defined the following function in a utils namespace, but it seems rather silly and is probably defined somewhere else:
(defn index-with
"returns a map of x -> (f x) for every x in xs"
[f xs]
(apply hash-map (mapcat #(vector % (f %)) xs)))
Here's an example of its usage:
(index-with count ["a" "bb" "ccc" "ddd"])
=> {"a" 1, "bb" 2, "ccc" 3, "ddd" 3}
If you were to invert the roles of keys and values (potentially ambiguous), you'd get something like group-by's output:
(group-by count ["a" "bb" "ccc" "ddd"])
=> {1: ["a"], 2: ["bb"], 3: ["ccc", "ddd"]}
I also looked at clojure.set/index but it seems to only cover a specific scenario (which doesn't apply here).
Does something like this exist already in the Clojure core-lib?
It seems like clojure.core/zipmap is reasonably close to what I am looking for:
(def xs ["a" "bb" "ccc" "ddd"])
(zipmap xs (map count xs))
=> {"a" 1, "bb" 2, "ccc" 3, "ddd" 3}
You have proposed two one line implementations. A third might be:
(into {} (map (juxt identity count)) xs)
All three of solutions appear to assume uniqueness in the xs input. If you have a seperate function then I'd expect it should deal with the collisions in a reasonable way.
If you are looking to accumulate a look-aside table of a pure function, then the function you want is memoize.
The memoized version of the function keeps a cache of the mapping from
arguments to results ...

update or assoc a list rather than a vector

Updating a vector works fine:
(update [{:idx :a} {:idx :b}] 1 (fn [_] {:idx "Hi"}))
;; => [{:idx :a} {:idx "Hi"}]
However trying to do the same thing with a list does not work:
(update '({:idx :a} {:idx :b}) 1 (fn [_] {:idx "Hi"}))
;; => ClassCastException clojure.lang.PersistentList cannot be cast to clojure.lang.Associative clojure.lang.RT.assoc (RT.java:807)
Exactly the same problem exists for assoc.
I would like to do update and overwrite operations on lazy types rather than vectors. What is the underlying issue here, and is there a way I can get around it?
The underlying issue is that the update function works on associative structures, i.e. vectors and maps. Lists can't take a key as a function to look up a value.
user=> (associative? [])
true
user=> (associative? {})
true
user=> (associative? `())
false
update uses get behind the scenes to do its random access work.
I would like to do update and overwrite operations on lazy types
rather than vectors
It's not clear what want to achieve here. You're correct that vectors aren't lazy, but if you wish to do random access operations on a collection then vectors are ideal for this scenario and lists aren't.
and is there a way I can get around it?
Yes, but you still wouldn't be able to use the update function, and it doesn't look like there would be any benefit in doing so, in your case.
With a list you'd have to walk the list in order to access an index somewhere in the list - so in many cases you'd have to realise a great deal of the sequence even if it was lazy.
You can define your own function, using take and drop:
(defn lupdate [list n function]
(let [[head & tail] (drop n list)]
(concat (take n list)
(cons (function head) tail))))
user=> (lupdate '(a b c d e f g h) 4 str)
(a b c d "e" f g h)
With lazy sequences, that means that you will compute the n first values (but not the remaining ones, which after all is an important part of why we use lazy sequences). You have also to take into account space and time complexity (concat, etc.). But if you truly need to operate on lazy sequences, that's the way to go.
Looking behind your question to the problem you are trying to solve:
You can use Clojure's sequence functions to construct a simple solution:
(defn elf [n]
(loop [population (range 1 (inc n))]
(if (<= (count population) 1)
(first population)
(let [survivors (->> population
(take-nth 2)
((if (-> population count odd?) rest identity)))]
(recur survivors)))))
For example,
(map (juxt identity elf) (range 1 8))
;([1 1] [2 1] [3 3] [4 1] [5 3] [6 5] [7 7])
This has complexity O(n). You can speed up count by passing the population count as a redundant argument in the loop, or by dumping the population and survivors into vectors. The sequence functions - take-nth and rest - are quite capable of doing the weeding.
I hope I got it right!

use 'for' inside 'let' return a list of hash-map

Sorry for the bad title 'cause I don't know how to describe in 10 words. Here's the detail:
I'd like to loop a file in format like:
a:1 b:2...
I want to loop each line, collect all 'k:v' into a hash-map.
{ a 1, b 2...}
I initialize a hash-map in a 'let' form, then loop all lines with 'for' inside let form.
In each loop step, I use 'assoc' to update the original hash-map.
(let [myhash {}]
(for [line #{"A:1 B:2" "C:3 D:4"}
:let [pairs (clojure.string/split line #"\s")]]
(for [[k v] (map #(clojure.string/split %1 #":") pairs)]
(assoc myhash k (Float. v)))))
But in the end I got a lazy-seq of hash-map, like this:
{ {a 1, b 2...} {x 98 y 99 z 100 ...} }
I know how to 'merge' the result now, but still don't understand why 'for' inside 'let' return
a list of result.
What I'm confused is: does the 'myhash' in the inner 'for' refers to the 'myhash' declared in the 'let' form every time? If I do want a list of hash-map like the output, is this the idiomatic way in Clojure ?
Clojure "for" is a list comprehension, so it creates list. It is NOT a for loop.
Also, you seem to be trying to modify the myhash, but Clojure's datastructures are immutable.
The way I would approach the problem is to try to create a list of pair like (["a" 1] ["b" 2] ..) and the use the (into {} the-list-of-pairs)
If the file format is really as simple as you're describing, then something much more simple should suffice:
(apply hash-map (re-seq #"\w+" (slurp "your-file.txt")))
I think it's more readable if you use the ->> threading macro:
(->> "your-file.txt" slurp (re-seq #"\w+") (apply hash-map))
The slurp function reads an entire file into a string. The re-seq function will just return a sequence of all the words in your file (basically the same as splitting on spaces and colons in this case). Now you have a sequence of alternating key-value pairs, which is exactly what hash-map expects...
I know this doesn't really answer your question, but you did ask about more idiomatic solutions.
I think #dAni is right, and you're confused about some fundamental concepts of Clojure (e.g. the immutable collections). I'd recommend working through some of the exercises on 4Clojure as a fun way to get more familiar with the language. Each time you solve a problem, you can compare your own solution to others' solutions and see other (possibly more idomatic) ways to solve the problem.
Sorry, I didn't read your code very thorougly last night when I was posting my answer. I just realized you actually convert the values to Floats. Here are a few options.
1) partition the sequence of inputs into key/val pairs so that you can map over it. Since you now how a sequence of pairs, you can use into to add them all to a map.
(->> "kvs.txt" slurp (re-seq #"\w") (partition 2)
(map (fn [[k v]] [k (Float. v)])) (into {}))
2) Declare an auxiliary map-values function for maps and use that on the result:
(defn map-values [m f]
(into {} (for [[k v] m] [k (f v)])))
(->> "your-file.txt" slurp (re-seq #"\w+")
(apply hash-map) (map-values #(Float. %)))
3) If you don't mind having symbol keys instead of strings, you can safely use the Clojure reader to convert all your keys and values.
(->> "your-file.txt" slurp (re-seq #"\w+")
(map read-string) (apply hash-map))
Note that this is a safe use of read-string because our call to re-seq would filter out any hazardous input. However, this will give you longs instead of floats since numbers like 1 are long integers in Clojure
Does the myhash in the inner for refer to the myhash declared in the let form every time?
Yes.
The let binds myhash to {}, and it is never rebound. myhash is always {}.
assoc returns a modified map, but does not alter myhash.
So the code can be reduced to
(for [line ["A:1 B:2" "C:3 D:4"]
:let [pairs (clojure.string/split line #"\s")]]
(for [[k v] (map #(clojure.string/split %1 #":") pairs)]
(assoc {} k (Float. v))))
... which produces the same result:
(({"A" 1.0} {"B" 2.0}) ({"C" 3.0} {"D" 4.0}))
If I do want a list of hash-map like the output, is this the idiomatic way in Clojure?
No.
See #DaoWen's answer.

How do I modify a portion of a vector in Clojure?

I am wondering if I'm missing something basic involving vector manipulation. Let's say I have the following:
(def xs [9 10 11 12 13])
(def idx [0 2])
(def values [1 3])
If I want to return the vector [1 10 3 12 13] in Matlab, I would write xs(idx) = values.
In Clojure, is there a primitive way of achieving this? Right now I'm using the following function:
(defn xinto [seq idx val]
(apply assoc seq (interleave idx val)))
Thanks.
It's a bit awkward because you've split up idx and values into two seqs, when they're conceptually a map of indexes to values. So if you'll permit me a little creative modification of your data format:
(def x [9 10 11 12 13])
(def changes {0 1, 2 3})
(defn xinto [v changes]
(reduce (fn [acc [k v]]
(assoc acc k v))
v
changes))
(xinto x changes) ;; gets the result you want
If you generate idx and values in some weird way that it's not convenient to group them together, you can group them later with (map list idx values) and then use my xinto implementation with that.
I'd probably use reduce for this:
(reduce
(fn [old [i v]] (assoc old i v))
x
(map vector idx values))
However, if you really want to do this a lot (Matlab-style) then I'd suggest creating some helper macros / functions to create some kind of DSL for vector manipulation.
Could not find something better.
In the core sequence functions there is replace, but it works on values, not on keys.
So,
(replace {9 2} x)
Would return
[2 10 11 12 13]
If you plan to do math related things in Clojure, I also propose you have a look at Incanter. It has a lot of APIs to manipulate mathematical data.