Persisting State from a DRPC Spout in Trident - clojure

I'm experimenting with Storm and Trident for this project, and I'm using Clojure and Marceline to do so. I'm trying to expand the wordcount example given on the Marceline page, such that the sentence spout comes from a DRPC call rather than from a local spout. I'm having problems which I think stem from the fact that the DRPC stream needs to have a result to return to the client, but I would like the DRPC call to effectively return null, and simply update the persisted data.
(defn build-topology
([]
(let [trident-topology (TridentTopology.)]
(let [
;; ### Two alternatives here ###
;collect-stream (t/new-stream trident-topology "words" (mk-fixed-batch-spout 3))
collect-stream (t/drpc-stream trident-topology "words")
]
(-> collect-stream
(t/group-by ["args"])
(t/persistent-aggregate (MemoryMapState$Factory.)
["args"]
count-words
["count"]))
(.build trident-topology)))))
There are two alternatives in the code - the one using a fixed batch spout loads with no problem, but when I try to load the code using a DRPC stream instead, I get this error:
InvalidTopologyException(msg:Component: [b-2] subscribes from non-existent component [$mastercoord-bg0])
I believe this error comes from the fact that the DRPC stream must be trying to subscribe to an output in order to have something to return to the client - but persistent-aggregate doesn't offer any such outputs to subscribe to.
So how can I set up my topology so that a DRPC stream leads to my persisted data being updated?
Minor update: Looks like this might not be possible :( https://issues.apache.org/jira/browse/STORM-38

Related

In Flink is it possible to use state with a non keyed stream?

Lets assume that I have an input DataStream and want to implement some functionality that requires "memory" so I need ProcessFunction that gives me access to state. Is it possible to do it straight to the DataStream or the only way is to keyBy the initial stream and work in keyed-context?
I'm thinking that one solution would be to keyBy the stream with a hardcoded unique key so the whole input stream ends up in the same group. Then technically I have a KeyedStream and I can normally use keyed state, like I'm showing below with keyBy(x->1). But is this a good solution?
DataStream<Integer> inputStream = env.fromSource(...)
DataStream<Integer> outputStream = inputStream
.keyBy(x -> 1)
.process(...) //I've got acess to state heree
As I understand that's not a common usecase because the main purpose of flink is to partition the stream, process them seperately and then merge the results. In my scenario thats exactly what I'm doing, but the problem is that the merge step requires state to produce the final "global" result. What I actually want to do is something like this:
DataStream<Integer> inputStream = env.fromElements(1,2,3,4,5,6,7,8,9)
//two groups: group1=[1,2,3,4] & group2=[5,6,7,8,9]
DataStream<Integer> partialResult = inputStream
.keyBy(val -> val/5)
.process(<..stateful processing..>)
//Can't do statefull processing here because partialResult is not a KeyedStream
DataStream<Integer> outputStream = partialResult
.process(<..statefull processing..>)
outputStream.print();
But Flink doesnt seem to allow me do the final "merge partial results operation" because I can't get access to state in process function as partialResult is not a KeyedStream.
I'm beginner to flink so I hope what I'm writing makes sense.
In general I can say that I haven't found a good way to do the "merging" step, especially when it comes to complex logic.
Hope someone can give me some info, tips or correct me if I'm missing something
Thank you for your time
Is "keyBy the stream with a hardcoded unique key" a good idea? Well, normally no, since it forces all data to flow through a single sub-task, so you get no benefit from the full parallelism in your Flink cluster.
If you want to get a global result (e.g. the "best" 3 results, from any results generated in the preceding step) then yes, you'll have to run all records through a single sub-task. So you could have a fixed key value, and use a global window. But note (as the docs state) you need to come up with some kind of "trigger condition", otherwise with a streaming workflow you never know when you really have the best N results, and thus you'd never emit any final result.

Connection Pooling in Clojure

I am unable to understand the use of pool-db and connection function
in this connection pooling guide.
(defn- get-pool
"Creates Database connection pool to be used in queries"
[{:keys [host-port db-name username password]}]
(let [pool (doto (ComboPooledDataSource.)
(.setDriverClass "com.mysql.cj.jdbc.Driver")
(.setJdbcUrl (str "jdbc:mysql://"
host-port
"/" db-name))
(.setUser username)
(.setPassword password)
;; expire excess connections after 30 minutes of inactivity:
(.setMaxIdleTimeExcessConnections (* 30 60))
;; expire connections after 3 hours of inactivity:
(.setMaxIdleTime (* 3 60 60)))]
{:datasource pool}))
(def pool-db (delay (get-pool db-spec)))
(defn connection [] #pool-db)
; usage in code
(jdbc/query (connection) ["Select SUM(1, 2, 3)"])
Why can't we simply do?
(def connection (get-pool db-spec))
; usage in code
(jdbc/query connection ["SELECT SUM(1, 2, 3)"])
The delay ensures that you create the connection pool the first time you try to use it, rather than when the namespace is loaded.
This is a good idea because your connection pool may fail to be created for any one of a number of reasons, and if it fails during namespace load you will get some odd behaviour - any defs after your failing connection pool creation will not be evaluated, for example.
In general, top level var definitions should be constructed so they cannot fail at runtime.
Bear in mind they may also be evaluated during the AOT compile process, as amalloy notes below.
In your application, you want to create the pool just one time and reuse it. For this reason, delay is used to wrap the (get-pool db-spec) method so that this method will be invoked only the first time it is forced with deref/# and will cache the pool return it in subsequent forcecalls
The difference is that in the delay version a pool will be created only if it is called (which might not be the case if everything was cached), but the non-delay version will instantiate a pool no matter what, i.e. always, even if a database connection is not used.
delay runs only if deref is called and does nothing otherwise.
I would suggest you use an existing library to handle connection pooling, something like hikari-cp, which is highly configurable and works across many implements of SQL.

Onyx: Can't pick up trigger/emit results in the next task

I'm trying to get started with Onyx, the distributed computing platform in Clojure. In particular, I try to understand how to aggregate data. If I understand the documentation correctly, a combination of a window and a :trigger/emit function should allow me to do this.
So, I modified the aggregation example (Onyx 0.13.0) in three ways (cf. gist with complete code):
in -main I println any segments put on the output channel; this works as expected with the original code in that it picks up all segments and prints them to stdout.
I add an emit function like this:
(defn make-ds
[event window trigger {:keys [lower-bound upper-bound event-type] :as state-event} extent-state]
(println "make-ds called")
{:ds window})
I add a trigger configuration (original dump-words trigger emitted for brevity):
(def triggers
[{:trigger/window-id :word-counter
:trigger/id :make-ds
:trigger/on :onyx.triggers/segment
:trigger/fire-all-extents? true
:trigger/threshold [5 :elements]
:trigger/emit ::make-ds}])
I change the :count-words task to from calling the identity function to the reduce type, so that it doesn't hand over all input segments to the output (and added config options that onyx should tackle this as a batch):
{:onyx/name :count-words
;:onyx/fn :clojure.core/identity
:onyx/type :reduce ; :function
:onyx/group-by-key :word
:onyx/flux-policy :kill
:onyx/min-peers 1
:onyx/max-peers 1
:onyx/batch-size 1000
:onyx/batch-fn? true}
When I run this now, I can see in the output that the emit function (i.e. make-ds) gets called for each input segment (first output coming from the dump-words trigger of the original code):
> lein run
[....]
Om -> 1
name -> 1
My -> 2
a -> 1
gone -> 1
Coffee -> 1
to -> 1
get -> 1
Time -> 1
make-ds called
make-ds called
make-ds called
make-ds called
[....]
However, the segment build from make-ds doesn't make it through to the output-channel, they are never being printed. If I revert the :count-words task to the identity function, this works just fine. Also, it looks as if the emit function is called for each input segment, whereas I would expect it to be called only when the threshold condition is true (i.e. whenever 5 elements have been aggregated in the window).
As the test for this functionality within the Onyx code base (onyx.windowing.emit-aggregate-test) is passing just fine, I guess I'm making a stupid mistake somewhere, but I'm at a loss figuring out what.
I finally saw that there was a warning in the log file onxy.log like this:
[clojure.lang.ExceptionInfo: Windows cannot be checkpointed with ZooKeeper unless
:onyx.peer/storage.zk.insanely-allow-windowing? is set to true in the peer config.
This should only be turned on as a development convenience.
[clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task
lifecycle :lifecycle/checkpoint-state. Killing the job. -> Exception type:
clojure.lang.ExceptionInfo. Exception message: Windows cannot be checkpointed with
ZooKeeper unless :onyx.peer/storage.zk.insanely-allow-windowing? is set to true in
the peer config. This should only be turned on as a development convenience.
As soon as I set this, I finally got some segments handed over to the next task. I.e., I had to change the peer config to:
(def peer-config
{:zookeeper/address "127.0.0.1:2189"
:onyx/tenancy-id id
:onyx.peer/job-scheduler :onyx.job-scheduler/balanced
:onyx.peer/storage.zk.insanely-allow-windowing? true
:onyx.messaging/impl :aeron
:onyx.messaging/peer-port 40200
:onyx.messaging/bind-addr "localhost"})
Now, :onyx.peer/storage.zk.insanely-allow-windowing? doesn't sound like a good thing to do. Lucas Bradstreet recommended on the Clojurians Slack channel switching to S3 checkpointing.

Composing Flow Graphs

I've been playing around with Akka Streams and get the idea of creating Flows and wiring them together using FlowGraphs.
I know this part of Akka is still under development so some things may not be finished and some other bits may change, but is it possible to create a FlowGraph that isn't "complete" - i.e. isn't attached to a Sink - and pass it around to different parts of my code to be extended by adding Flow's to it and finally completed by adding a Sink?
Basically, I'd like to be able to compose FlowGraphs but don't understand how... Especially if a FlowGraph has split a stream by using a Broadcast.
Thanks
The next week (December) will be documentation writing for us, so I hope this will help you to get into akka streams more easily! Having that said, here's a quick answer:
Basically you need a PartialFlowGraph instead of FlowGraph. In those we allow the usage of UndefinedSink and UndefinedSource which you can then"attach" afterwards. In your case, we also provide a simple helper builder to create graphs which have exactly one "missing" sink – those can be treated exactly as if it was a Source, see below:
// for akka-streams 1.0-M1
val source = Source() { implicit b ⇒
// prepare an undefined sink, which can be relpaced by a proper sink afterwards
val sink = UndefinedSink[Int]
// build your processing graph
Source(1 to 10) ~> sink
// return the undefined sink which you mean to "fill in" afterwards
sink
}
// use the partial graph (source) multiple times, each time with a different sink
source.runWith(Sink.ignore)
source.runWith(Sink.foreach(x ⇒ println(x)))
Hope this helps!

redis.py: How to flush all the queries in a pipeline

I have a redis pipeline say:
r = redis.Redis(...).pipline()
Suppose I need to remove any residual query, if present in the pipeline without executing. Is there anything like r.clear()?
I have search docs and source code and I am unable to find anything.
The command list is simply a python list object. You can inspect it like such:
from redis import StrictRedis
r = StrictRedis()
pipe = r.pipeline()
pipe.set('KEY1', 1)
pipe.set('KEY2', 2)
pipe.set('KEY3', 3)
pipe.command_stack
[(('SET', 'KEY1', 1), {}), (('SET', 'KEY2', 2), {}), (('SET', 'KEY3', 3), {})]
This has not yet been sent to the server so you can just pop() or remove the commands you don't want. You can also just assign an empty list, pipe.command_stack = [].
If there is a lot you could simply just re-assign a new Pipeline object to pipe.
Hope this is what you meant.
Cheers
Joe
Use:
pipe.reset()
Other than the obvious advantage of ignoring implementation details (such as the command_stack mentioned before), this method will take care of interrupting the current ongoing transaction (if any) and returning the connection to the pool.