I have following data structure: There are scenes which can be parts of sequences.
For example, let's say we have scene sc026:
{:sceneId "sc026"}
It is part of seq07:
(def seq07
{
:SeqId "seq07"
:Desc "Sequence name"
:Scenes [sc026]
:Comments []
}
)
Given a list of scenes, I want to create a list which for every scene contains a list of IDs of sequences that a particular scene is part of.
Example
Let's assume there is a list of two scenes sc026 and sc027. sc026 is part of seq07, sc027 is not part of any sequence.
The result I want to achieve is this: [["seq07"], []].
What I tried to implement
I have a function generate-scene-overview which, among others, needs to create that list. It has following signature:
(defn- generate-scene-overview
[scene-list time-info seqs]
scene-list is the collection of scenes (result of (filter some? my-scene-list) where my-scene-list is a list of scenes that contains nil elements).
seqs is a list of sequences.
Sequences in seqs can be structured and unstructured. The unstructured ones have a non-empty list in :Scenes field.
Therefore, in generate-scene-overview I first extract the unstructured scenes from seqs:
unstructured-seqs (filter
(fn [cur-seq]
(let
[scenes (get cur-seq :Scenes)]
(not (empty? scenes))
)
)
seqs)
Next I need to convert the unstructured sequences into a collection of scene-sequence tuples:
unstructured-seq-tuples (compose-unstructured-tuple-list unstructured-seqs)
compose-unstructured-tuple-list is defined as follows.
(defn- compose-unstructured-tuple-list
[unstructured-seqs]
(into []
(map
(fn [cur-seq]
(let
[
scenes (get cur-seq :Scenes)
seqId (get cur-seq :SeqId)
scene-seq-tuples (into []
(map
(fn [cur-scene]
(let [scene-id (get cur-scene :sceneId)]
{
:SceneId scene-id
:SeqId seqId
}
)
)
scenes
)
)
]
scene-seq-tuples
)
)
)
unstructured-seqs
)
)
Next, I need to combine the tuples for structured sequences with those from unstructured ones:
seq-tuples (set/union unstructured-seq-tuples structured-seq-tuples)
Finally, seq-tuples are converted into a list of sequence IDs for each scene:
scene-seqs (compose-scene-seqs scene-list seq-tuples)
compose-scene-seqs is defined as follows:
(defn compose-scene-seqs
[scene-list seq-tuples]
(into [] (map (fn [cur-scene]
(let
[scene-id (get cur-scene :sceneId)]
(findSeqIdsBySceneId scene-id seq-tuples)
)
)
scene-list
)
)
)
findSeqIdsBySceneId looks like this:
(defn findSeqIdsBySceneId
[scene-id seq-tuples]
(let
[
scene-tuples (filter (fn [cur-tuple]
(let [cur-tuple-scene-id (get cur-tuple :SceneId)]
(= scene-id cur-tuple-scene-id))
)
seq-tuples
)
seqs (map (fn [cur-tuple]
(get cur-tuple :SeqId)
)
scene-tuples
)
]
seqs
)
)
My problem
When I run the above code in debugger, scene-seqs only contains empty collections.
It should contain exactly one non-empty collection for scene sc026 (with string seq07 inside it).
How I tried to diagnose the problem
I tried to reproduce the problem in automated tests.
First attempt -- findSeqIdsBySceneId:
(deftest findSeqIdsBySceneId-test
(is (= ["seq07"]
(findSeqIdsBySceneId "sc026" [{
:SceneId "sc026"
:SeqId "seq07"
}])
)
)
(is (= ["seq07", "seq06"]
(findSeqIdsBySceneId "sc026" [{
:SceneId "sc026"
:SeqId "seq07"
}
{
:SceneId "sc026"
:SeqId "seq06"
}
])
)
)
)
Those tests run, so I wrote a couple of tests for compose-scene-seqs:
(deftest compose-scene-seqs-test
(is (= [["seq07"]]
(let
[
scene-list [{:sceneId "sc026"}]
seq-tuples [
{
:SceneId "sc026"
:SeqId "seq07"
}
]
]
(compose-scene-seqs scene-list seq-tuples)
)
))
)
(deftest compose-scene-seqs-test2
(is (= [["seq07"] []]
(let
[
scene-list [
{:sceneId "sc026"}
{:sceneId "sc027"}
]
seq-tuples [
{
:SceneId "sc026"
:SeqId "seq07"
}
]
]
(compose-scene-seqs scene-list seq-tuples)
)
))
)
(deftest compose-scene-seqs-test3
(is (= [[] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] ["seq07"] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []]
(let
[
scene-list my-scene-list
seq-tuples [
{
:SceneId "sc026"
:SeqId "seq07"
}
]
]
(compose-scene-seqs scene-list seq-tuples)
)
))
)
All of them run.
If I replace
scene-list my-scene-list
with
scene-list (filter some? перечень-сцен2)
I get the following assertion error, but even then there is one non-empty collection:
Question
What else can I do to diagnose and fix the error?
Update 1:
Full code is available in this GitHub gist.
I managed to reproduce the error in the following test:
(deftest compose-scene-seqs-test4
(is (= [[] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] ["seq07"] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []]
(let
[
scene-list (filter some? перечень-сцен2)
unstructured-seqs [seq07]
unstructured-seq-tuples (compose-unstructured-tuple-list unstructured-seqs)
seq-tuples (set/union unstructured-seq-tuples [])
]
(compose-scene-seqs scene-list seq-tuples)
)
)
)
)
Here is a complete working solution for the stated task:
(defn- contains-scene? [seq scene-id]
(some #(= scene-id (:sceneId %)) (:Scenes seq)))
(defn- seq-ids-containing-scene [seqs scene-id]
(keep #(and (contains-scene? % scene-id) (:SeqId %)) seqs))
(defn seq-ids-containing-scenes [seqs scenes]
(map #(seq-ids-containing-scene seqs (:sceneId %)) scenes))
Test case:
(def sc026 {:sceneId "sc026"})
(def sc027 {:sceneId "sc027"})
(def seq07 {:SeqId "seq07"
:Desc "Sequence name"
:Scenes [sc026]
:Comments []})
(seq-ids-containing-scenes [seq07] [sc026 sc027]) ;; => (("seq07") ())
I couldn't follow the logic of the attempted solution. It introduces a concept "unstructured" (not the same as Clojure's destructuring) which doesn't seem to add value. I tried re-creating the issue but found the presented code was incomplete, so I can't offer any help on why it fails.
Here is a second alternative solution which builds a map scene-map in a single pass over the collection of sequences. scene-map has scene id key and a collection of sequence ids as the corresponding value:
(defn seq-ids-containing-scenes* [seqs scenes]
(let [maps (for [seq seqs
scene (:Scenes seq)]
{(:sceneId scene) [(:SeqId seq)]})
scene-map (apply merge-with into maps)]
(map #(get scene-map (:sceneId %) []) scenes)))
Update:
I found the bug in the original code presented in the question. In function compose-unstructured-tuple-list, replace the first map with mapcat.
I'd probably first refactor this to be a bit smaller
I think you could do this in one pass with reduce by creating a new data structure where the scene name is a key and a list of seqid information build those list up with conj as you iterate through the old data with reduce
In terms of debugging intellji has a step debugger which you can use to observe the list building and an expression window which you can use to run commands like in the repl but in the context of your break point that should hopefully give you enough observability to understand your problem
This is a very interesting problem as it uses data (seqid) conbined with deeper data (secnes). Conceptually that seems to suggest using a zipper but those are hard to deal and only "worth it" if you are manipulating data. If you have a lot of data like this to handle then specter might be worth a look.
Stephans solution is shorter than this attempt but I agree with #arcanine that a single pass solution. Scenes and sequences suggest that multiple passes might not be a problem but here is a single pass effort anyway:
(defn collect-seqs [seqs scenes]
(let [scene->seqs (->> seqs
(mapcat (fn seq-sc-pairs [sq]
(->> sq
:Scenes
(keep (comp (set scenes) :sceneId))
(map (partial vector (:SeqId sq))))))
;; group collection of ([seqid sceneid], ...)
(group-by second))]
(for [sc scenes]
(map first (get scene->seqs sc)))))
Note that this could be a little shorter if a map were an accepable solution. that might also mean that a reduction function would work
Related
I need a predicate which returns logically true if the given value is a not-empty collection and logically false if it's anything else (number, string etc.).
And more specifically, that the predicate won't throw the IllegalArgumentException if applied to single number, or string.
I came up with the following function, but I'm wondering if there is some more idiomatic approach?
(defn not-empty-coll? [x]
(and (coll? x) (seq x)))
This will satisfy following tests:
(is (not (not-empty-coll? nil))) ;; -> false
(is (not (not-empty-coll? 1))) ;; -> false
(is (not (not-empty-coll? "foo"))) ;; -> false
(is (not (not-empty-coll? []))) ;; -> nil (false)
(is (not (not-empty-coll? '()))) ;; -> nil (false)
(is (not (not-empty-coll? {}))) ;; -> nil (false)
(is (not-empty-coll? [1])) ;; -> (1) (true)
(is (not-empty-coll? '(1))) ;; -> (1) (true)
(is (not-empty-coll? {:a 1})) ;; -> ([:a 1]) (true)
EDIT: A potential use case:
Let's say we need to process some raw external data which are not (yet) under our control. Input could be for example a collection which contains either primitive values, or nested collections. Other example could be a collection holding some inconsistent (maybe broken?) tree structure. So, we can consider mentioned predicate as first line data cleaning.
Otherwise, I agree with comments that is better to explicitly separate and process collection and non-collection data.
How about using Clojure protocols and type extensions to solve this?
(defprotocol EmptyCollPred
(not-empty-coll? [this]))
(extend-protocol EmptyCollPred
Object
(not-empty-coll? [this] false)
nil
(not-empty-coll? [this] false)
clojure.lang.Seqable
(not-empty-coll? [this] (not (empty? (seq this)))))
(is (not (not-empty-coll? nil))) ;; -> false
(is (not (not-empty-coll? 1))) ;; -> false
(is (not (not-empty-coll? "foo"))) ;; -> false
(is (not (not-empty-coll? []))) ;; -> nil (false)
(is (not (not-empty-coll? '()))) ;; -> nil (false)
(is (not (not-empty-coll? {}))) ;; -> nil (false)
(is (not-empty-coll? [1])) ;; -> (1) (true)
(is (not-empty-coll? '(1))) ;; -> (1) (true)
(is (not-empty-coll? {:a 1})) ;; -> ([:a 1]) (true)
Maybe it would be cleaner to extend just String and Number instead of Object - depends on what do you know about the incoming data. Also, it would be probably better to filter out nils beforehand instead of creating a case for it as you see above.
Another - conceptually similar - solution could use multimethods.
As suggested in the comments, I would consider calling not-empty? with a non-collection argument to be an invalid usage, which should generate an IllegalArgumentException.
There is already a function not-empty? available for use in the Tupelo library. Here are the unit tests:
(deftest t-not-empty
(is (every? not-empty? ["one" [1] '(1) {:1 1} #{1} ] ))
(is (has-none? not-empty? [ "" [ ] '( ) {} #{ } nil] ))
(is= (map not-empty? ["1" [1] '(1) {:1 1} #{1} ] )
[true true true true true] )
(is= (map not-empty? ["" [] '() {} #{} nil] )
[false false false false false false ] )
(is= (keep-if not-empty? ["1" [1] '(1) {:1 1} #{1} ] )
["1" [1] '(1) {:1 1} #{1} ] )
(is= (drop-if not-empty? ["" [] '() {} #{} nil] )
["" [] '() {} #{} nil] )
(throws? IllegalArgumentException (not-empty? 5))
(throws? IllegalArgumentException (not-empty? 3.14)))
Update
The preferred approach would be for a function to only receive collection parameters in a given argument, not a mixture scalar & collection arguments. Then, one only needs not-empty given the pre-knowledge that the value in question is not a scalar. I often use Plumatic Schema to enforce this assumption and catch any errors in the calling code:
(ns xyz
(:require [schema.core :as s] )) ; plumatic schema
(s/defn foo :- [s/Any]
"Will do bar to the supplied collection"
[coll :- [s/Any]]
(if (not-empty coll)
(mapv bar foo)
[ :some :default :value ] ))
The 2 uses of notation :- [s/Any] checks that the arg & return value are both declared to be a sequential collection (list or vector). Each element is unrestricted by the s/Any part.
If you can't enforce the above strategy for some reason, I would just modify your first approach as follows:
(defn not-empty-coll? [x]
(and (coll? x) (t/not-empty? x)))
I'm hoping you know at least a little about the param x so the question becomes: Is x a scalar or a non-empty vector. Then you could say something like:
(defn not-empty-coll? [x]
(and (sequential? x) (t/not-empty? x)))
The repl returns 2 when expected to return 5.
(defn counter []
(let [count 1]
(fn []
(+ count 1)
)
)
)
(defn test-counter []
(let [increment (counter)]
(increment)
(increment)
(increment)
(increment)
)
)
count is not a mutable variable so (+ count 1) does not change its value. If you want mutation you can store the count in an atom and update it using swap!:
(defn counter []
(let [count (atom 0)]
(fn [] (swap! count inc))))
I developed a function in clojure to fill in an empty column from the last non-empty value, I'm assuming this works, given
(:require [flambo.api :as f])
(defn replicate-val
[ rdd input ]
(let [{:keys [ col ]} input
result (reductions (fn [a b]
(if (empty? (nth b col))
(assoc b col (nth a col))
b)) rdd )]
(println "Result type is: "(type result))))
Got this:
;=> "Result type is: clojure.lang.LazySeq"
The question is how do I convert this back to type JavaRDD, using flambo (spark wrapper)
I tried (f/map result #(.toJavaRDD %)) in the let form to attempt to convert to JavaRDD type
I got this error
"No matching method found: map for class clojure.lang.LazySeq"
which is expected because result is of type clojure.lang.LazySeq
Question is how to I make this conversion, or how can I refactor the code to accomodate this.
Here is a sample input rdd:
(type rdd) ;=> "org.apache.spark.api.java.JavaRDD"
But looks like:
[["04" "2" "3"] ["04" "" "5"] ["5" "16" ""] ["07" "" "36"] ["07" "" "34"] ["07" "25" "34"]]
Required output is:
[["04" "2" "3"] ["04" "2" "5"] ["5" "16" ""] ["07" "16" "36"] ["07" "16" "34"] ["07" "25" "34"]]
Thanks.
First of all RDDs are not iterable (don't implement ISeq) so you cannot use reductions. Ignoring that a whole idea of accessing previous record is rather tricky. First of all you cannot directly access values from an another partition. Moreover only transformations which don't require shuffling preserve order.
The simplest approach here would be to use Data Frames and Window functions with explicit order but as far as I know Flambo doesn't implement required methods. It is always possible to use raw SQL or access Java/Scala API but if you want to avoid this you can try following pipeline.
First lets create a broadcast variable with last values per partition:
(require '[flambo.broadcast :as bd])
(import org.apache.spark.TaskContext)
(def last-per-part (f/fn [it]
(let [context (TaskContext/get) xs (iterator-seq it)]
[[(.partitionId context) (last xs)]])))
(def last-vals-bd
(bd/broadcast sc
(into {} (-> rdd (f/map-partitions last-per-part) (f/collect)))))
Next some helper for the actual job:
(defn fill-pair [col]
(fn [x] (let [[a b] x] (if (empty? (nth b col)) (assoc b col (nth a col)) b))))
(def fill-pairs
(f/fn [it] (let [part-id (.partitionId (TaskContext/get)) ;; Get partion ID
xs (iterator-seq it) ;; Convert input to seq
prev (if (zero? part-id) ;; Find previous element
(first xs) ((bd/value last-vals-bd) part-id))
;; Create seq of pairs (prev, current)
pairs (partition 2 1 (cons prev xs))
;; Same as before
{:keys [ col ]} input
;; Prepare mapping function
mapper (fill-pair col)]
(map mapper pairs))))
Finally you can use fill-pairs to map-partitions:
(-> rdd (f/map-partitions fill-pairs) (f/collect))
A hidden assumption here is that order of the partitions follows order of the values. It may or may not be in general case but without explicit ordering it is probably the best you can get.
Alternative approach is to zipWithIndex, swap order of values and perform join with offset.
(require '[flambo.tuple :as tp])
(def rdd-idx (f/map-to-pair (.zipWithIndex rdd) #(.swap %)))
(def rdd-idx-offset
(f/map-to-pair rdd-idx
(fn [t] (let [p (f/untuple t)] (tp/tuple (dec' (first p)) (second p))))))
(f/map (f/values (.rightOuterJoin rdd-idx-offset rdd-idx)) f/untuple)
Next you can map using similar approach as before.
Edit
Quick note on using atoms. What is the problem there is lack of referential transparency and that you're leveraging incidental properties of a given implementation not a contract. There is nothing in the map semantics that requires elements to be processed in a given order. If internal implementation changes it may be no longer valid. Using Clojure
(defn foo [x] (let [aa #a] (swap! a (fn [&args] x)) aa))
(def a (atom 0))
(map foo (range 1 20))
compared to:
(def a (atom 0))
(pmap foo (range 1 20))
This seems paradoxical:
(def foo ["some" "list" "of" "strings"])
`[ ~#(apply concat (map (fn [a] [a (symbol a)]) foo)) ]
; ["some" some "list" list "of" of "strings" strings]
; Changing only the outer [] into {}
`{ ~#(apply concat (map (fn [a] [a (symbol a)]) foo)) }
; RuntimeException Map literal must contain an even number of forms
; However, this works:
`{"some" some "list" list "of" of "strings" strings}
; {"list" clojure.core/list, "of" user/of, "strings" user/strings, "some" clojure.core/some}
Whats going on?
The exception is triggered by the reader because it can't read a literal map with one element which is your unsplice form before evaluation.
Workaround:
`{~#(apply concat (map (fn [a] [a (symbol a)]) foo)) ~#[]}
Unless you are writing a macro, it may be easiest to say:
(into {} (map (fn [a] [a (symbol a)]) foo))
;=> {"some" some, "list" list, "of" of, "strings" strings}
I just go through various documentation on Clojure concurrency and came accross the example on the website (http://clojure.org/concurrent_programming).
(import '(java.util.concurrent Executors))
(defn test-stm [nitems nthreads niters]
(let [refs (map ref (replicate nitems 0))
pool (Executors/newFixedThreadPool nthreads)
tasks (map (fn [t]
(fn []
(dotimes [n niters]
(dosync
(doseq [r refs]
(alter r + 1 t))))))
(range nthreads))]
(doseq [future (.invokeAll pool tasks)]
(.get future))
(.shutdown pool)
(map deref refs)))
I understand what it does and how it works, but I don't get why the second anonymous function fn[] is needed?
Many thanks,
dusha.
P.S. Without this second fn [] I get NullPointerException.
Here is a classic example of using higher-order functions:
;; a function returns another function
(defn make-multiplyer [times]
(fn [x]
(* x times)))
;; now we bind returned function to a symbol to use it later
(def multiply-by-two (make-multiplyer 2))
;; let's use it
(multiply-by-two 100) ; => 200
In that code sample fn inside fn works the same way. When map invokes (fn [t] (fn [] ...)) it gets inner fn.
(def list-of-funcs (map (fn [t]
(fn [] (* t 10))) ; main part
(range 5)))
;; Nearly same as
;; (def list-of-funcs (list (fn [] (* 0 10))
;; (fn [] (* 1 10))
;; ...
;; (fn [] (* 4 10))))
(for [i list-of-funcs]
(i))
; => (0 10 20 30 40)
Update: And as Alex said tasks in the code sample is bound to list of callables which is passed then to .invokeAll().
The first fn is what map uses to create a seq of fn's -- one for each of the threads. This is because tasks is a seq of functions! The method .invokeAll() is expecting a Collection of Callables (Clojure functions implement the Callable interface)
from Clojure.org: Special Forms
fns implement the Java Callable, Runnable and Comparator interfaces.