Streaming data to the caller in JVM - clojure

I have a function which gets data periodically and then stops getting the data. This function has to return the data that it is fetching periodically to the caller of the function either
As and when it gets
At one shot
The 2nd one is an easy implementation i.e you block the caller, fetch all the data and then send it in one shot.
But I want to implement the 1st one (I want to avoid having callbacks). Is streams the things to be used here? If so, how? If not, how do I return something on which the caller can query for data and also stop when it returns a signal that there is no more data?
Note: I am on the JVM ecosystem, clojure to be specific. I have had a look at the clojure library core.async which kind of solves this kind of a problem with the use of channels. But I was thinking if there is any other way which probably looks like this (assuming streams is something that can be used).
Java snippet
//Function which will periodically fetch MyData until there is no data
public Stream<MyData> myFunction() {
...
}
myFunction().filter(myData -> myData.text.equals("foo"))

Maybe you can just use seq - which is lazy by default (like Stream) so caller can decide when to pull the data in. And when there are no more data myFunction can simply end the sequence. While doing this, you would also encapsulate some optimisation within myFunction - e.g. to get data in batch to minimise roundtrips. Or fetch data periodically per your original requirement.
Here is one naive implementation:
(defn my-function []
(let [batch 100]
(->> (range)
(map #(let [from (* batch %)
to (+ from batch)]
(db-get from to)))
;; take while we have data from db-get
(take-while identity)
;; returns as one single seq/Stream
(apply concat))))
;; use it as a normal seq/Stream
(->> (my-function)
(filter odd?))
where db-get would be something like:
(defn db-get [from to]
;; return first 1000 records only, i.e. returns nil to signal completion
(when (< from 1000)
;; returns a range of records
(range from to)))

You might want to check https://github.com/ReactiveX/RxJava and https://github.com/ReactiveX/RxClojure (seems no longer maintained?)

Related

Clojure - caching with atoms?

I have a number of threads that need to access a collection of values, some of these values also need to be persisted to a database when any changes are made to them so that I don't lose the state in case of a server reboot etc.
Currently I'm using an atom to store these values, I have a set of functions which I call when something in the atom needs to change.
Inside these functions I'm also persisting data to a database before calling swap!, I chose this approach because I need to frequently read the values inside the atom, and it doesn't seem performant to open/close db connections every time I'm interested in one of the values.
the question:
Is this approach viable? I'm interested to know if anyone has had success implementing a similar solution or are there pitfalls I should be aware of?
Atoms are fine.
An alternative approach would be to use https://github.com/clojure/core.memoize or core.cached directly, like suggested by Stefan Kamphausen.
Approach:
Cache query results on the function level. This way you are sure that what you get back is exactly how the database would return it and not the way you would think it would serialize/deserialize.
Invalidate key/args after you have inserted/changed something in the database.
One benefit of this approach is that you can tweak the caching behavior: TTL, LRU, FIFO, etc.
Demo:
(require '[clojure.core.memoize :as memo])
;; suppose this is a real DB
(def db (atom {}))
(defn my-get [k]
;; expensive database call
(Thread/sleep 5000)
(get #db k))
(def my-get-cached
(memo/memo my-get))
(defn my-put
[k val]
(swap! db assoc k val)
(memo/memo-clear! my-get-cached [k]))
(comment
(my-put :foo "the value")
(my-get-cached :foo) ;; wait 5 seconds, "the value"
(my-get-cached :foo) ;; "the value", instantly
(my-put :bar "other-value")
(my-get-cached :foo) ;; "the value", still instantly
(my-get-cached :bar) ;; wait 5 seconds, "other value"
(my-put :foo "changed")
(my-get-cached :foo) ;; wait 5 seconds, "changed"
)

Idiomatic error/exception handling with threading macros

I'm fetching thousands of entities from an API one at a time using http requests. As next step in the pipeline I want to shovel all of them into a database.
(->> ids
(pmap fetch-entity)
(pmap store-entity)
(doall))
fetch-entity expects a String id and tries to retrieve an entity using an http request and either returns a Map or throws an exception (e.g. because of a timeout).
store-entity expects a Map and tries to store it in a database. It possibly throws an exception (e.g. if the Map doesn't match the database schema or if it didn't receive a Map at all).
Inelegant Error Handling
My first "solution" was to write wrapper functions fetch-entity' and store-entity' to catch exceptions of their respective original functions.
fetch-entity' returns its input on failure, basically passing along a String id if the http request failed. This ensures that the whole pipeline keeps on trucking.
store-entity' checks the type of its argument. If the argument is a Map (fetch entity was successful and returned a Map) it attempts to store it in the database.
If the attempt of storing to the database throws an exception or if store-entity' got passed a String (id) instead of a Map it will conj to an external Vector of error_ids.
This way I can later use error_ids to figure out how often there was a failure and which ids were affected.
It doesn't feel like the above is a sensible way to achieve what I'm trying to do. For example the way I wrote store-entity' complects the function with the previous pipeline step (fetch-entity') because it behaves differently based on whether the previous pipeline step was successful or not.
Also having store-entity' be aware of an external Vector called error_ids does not feel right at all.
Is there an idiomatic way to handle these kinds of situations where you have multiple pipeline steps where some of them can throw exceptions (e.g. because they are I/O) where I can't easily use predicates to make sure the function will behave predictable and where I don't want to disturb the pipeline and only later check in which cases it went wrong?
It is possible to use a type of Try monad, for example from the cats library:
It represents a computation that may either result in an exception or return a successfully computed value. Is very similar to the Either monad, but is semantically different.It consists of two types: Success and Failure. The Success type is a simple wrapper, like Right of the Either monad. But the Failure type is slightly different from Left, because it always wraps an instance of Throwable (or any value in cljs since you can throw arbitrary values in the JavaScript host).(...)It is an analogue of the try-catch block: it replaces try-catch’s stack-based error handling with heap-based error handling. Instead of having an exception thrown and having to deal with it immediately in the same thread, it disconnects the error handling and recovery.
Heap-based error-handling is what you want.
Below I made an example of fetch-entity and store-entity. I made fetch-entity throw an ExceptionInfo on the first id (1) and store-entity throws a DivideByZeroException on the second id (0).
(ns your-project.core
(:require [cats.core :as cats]
[cats.monad.exception :as exc]))
(def ids [1 0 2]) ;; `fetch-entity` throws on 1, `store-entity` on 0, 2 works
(defn fetch-entity
"Throws an exception when the id is 1..."
[id]
(if (= id 1)
(throw (ex-info "id is 1, help!" {:id id}))
id))
(defn store-entity
"Unfortunately this function still needs to be aware that it receives a Try.
It throws a `DivideByZeroException` when the id is 0"
[id-try]
(if (exc/success? id-try) ; was the previous step a success?
(exc/try-on (/ 1 (exc/extract id-try))) ; if so: extract, apply fn, and rewrap
id-try)) ; else return original for later processing
(def results
(->> ids
(pmap #(exc/try-on (fetch-entity %)))
(pmap store-entity)))
Now you can filter results on successes or failures with respectively success? or failure? and retrieve the values via cats-extract
(def successful-results
(->> results
(filter exc/success?)
(mapv cats/extract)))
successful-results ;; => [1/2]
(def error-messages
(->> results
(filter exc/failure?)
(mapv cats/extract) ; gets exceptions without raising them
(mapv #(.getMessage %))))
error-messages ;; => ["id is 1, help!" "Divide by zero"]
Note that if you want to only loop over the errors or successful-results once you can use a transducer as follows:
(transduce (comp
(filter exc/success?)
(map cats/extract))
conj
results))
;; => [1/2]
My first thought is to combine fetch-entity and store-entity into a single operation:
(defn fetch-and-store [id]
(try
(store-entity (fetch-entity id))
(catch ... <log error msg> )))
(doall (pmap fetch-and-store ids))
Would something like this work?

Clojure: How to Serialize a Function and Reuse it Later

(defn my-func [opts]
(assoc opts :something :else))
What i want to be able to do, is serialize a reference to the function (maybe via #'my-func ?) to a string in such a way that i can upon deserializing it, invoke it with args.
How does this work?
Edit-- Why This is Not a Duplicate
The other question asked how to serialize a function body-- the entire function code. I am not asking how to do that. I am asking how to serialize a reference.
Imagine a cluster of servers all running the same jar, attached to a MQ. The MQ pubs in fn-reference and fn-args for functions in the jar, and the server in the cluster runs it and acks it. That's what i'm trying to do-- not pass function bodies around.
In some ways, this is like building a "serverless" engine in clojure.
Weirdly, a commit for serializing var identity was just added to Clojure yesterday: https://github.com/clojure/clojure/commit/a26dfc1390c53ca10dba750b8d5e6b93e846c067
So as of the latest master snapshot version, you can serialize a Var (like #'clojure.core/conj) and deserialize it on another JVM with access to the same loaded code, and invoke it.
(import [java.io File FileOutputStream FileInputStream ObjectOutputStream ObjectInputStream])
(defn write-obj [o f]
(let [oos (ObjectOutputStream. (FileOutputStream. (File. f)))]
(.writeObject oos o)
(.close oos)))
(defn read-obj [f]
(let [ois (ObjectInputStream. (FileInputStream. (File. f)))
o (.readObject ois)]
(.close ois)
o))
;; in one JVM
(write-obj #'clojure.core/conj "var.ser")
;; in another JVM
(read-obj "var.ser")
As suggested on the comments, if you can just serialize a keyword label for the function and store/retrieve that, you are finished.
If you need to transmit the function from one place to another, you essentially need to send the function source code as a string and then have it compiled via eval on the other end. This is what Datomic does when a Database Function is stored in the DB and automatically run by Datomic for any new additions/changes to the DB (these can perform automatic data validation, for example). See:
http://docs.datomic.com/database-functions.html
http://docs.datomic.com/clojure/index.html#datomic.api/function
As similar technique is used in the book Clojure in Action (1st Edition) for the distributed compute engine example using RabbitMQ.

Idiomatic use of atom in a web crawler

For example in a web crawler, it maintains a global set of visited URLs. Once a worker started working on a URL or has completed a URL, other workers should not take the same URL. One way to implement this in Java is to put visited URLs in a ConcurrentHashMap (a Set is better probably). Each worker looks at the map before visiting a URL
if (visited.putIfAbsent(url, true) == null) {
crawl(url);
} else {
// do nothing
}
In Clojure, I use a set in an atom. Each time I'm about to swap in a new set with the latest visited URL, the swap function should check if the set has this URL already. If the URL exists, the worker should stop from there. To be able to tell the worker if swap succeeds, I had to save the return value in the global state like [visited-urls last-swap-succeeded]
(def state (atom [#{} nil]))
(defn f [state key] (let [[visited-urls l] state] (if (visited-urls key) [visited-urls false] [(conj (visited-urls key) true]))))
Workers should do
(when (second (swap! state f url)) (crawl url))
It works but looks quite ugly to me. The problem is that the swap function doesn't allow a return value to the callsite. Is there better way to do this in Clojure?
Refs were kinda made for this sort of thing. Here's a simple way to do it
(when (dosync (when-not (#visited-urls-ref url-to-visit)
(alter visited-urls-ref conj url-to-visit)))
; continue crawling url-to-visit
)
I can't imagine it would add any significant overhead for a web crawler.
Personally, assuming the order in which urls are visited is unimportant, I'd create a core.async channel with a dedupe transducer and simply have all the workers put/take urls to/from that.

Getting most recent response from a core.async

I am trying to validate a form using core.async by making a request to a validation function every time the form changes. The validation function is asynchronous itself. It hits an external service and returns either an array of error messages or an empty array.
(go-loop []
(when-let [value (<! field-chan)]
(go (let [errors (<! (validate value))]
(put! field-error-chan errors)))))
The above code is what i have at the moment. It works most of the time, but sometimes the response time from the server changes so the second request arrives before the first. If the value is not valid in the second case but valid the first time we would pull an array of errors followed by an empty array off the field-error-chan.
I could of course take the validation out of a go loop and have everything returning in the correct order, but, I would then end up taking values from the field-chan only after checking for errors. What I would like to do is validate values as they come but put the validation response on the errors channel in the order the value came not the order of the response.
Is this possible with core.async if not what would be my best approach to getting ordered responses?
Thanks
Assuming you can modify the external validation service, the simplest approach would probably be to attach timestamps (or simply counters) to validation requests and to have the validation service include them in their responses. Then you could always tell whether you're dealing with the response to the latest request.
Incidentally, the internal go form serves no purpose and could be merged into the outer go-loop. (Well, go forms return channels, but if the go-loop is actually meant to loop, this probably isn't important.)
You can write a switch function (inspired by RxJs):
(defn switch [in]
(let [out (chan)]
(go (loop [subchannel (<! in)]
(let [[v c] (alts! [subchannel in])]
(if (= c subchannel)
(do (>! out v) (recur subchannel))
(recur v)))))
out))
Then wrap the field-chan function and
(let [validate-last (switch (async/map validate [field-chan])]
...)
But note that the switch does not handle closing channels.