How to set `setNumTasks` for a storm bolt in clojure? - clojure

I would like to set number of tasks for a bolt in my storm topology. But I can't find a way to do this in clojure. I didn't see anything in the documentation either in bolt-spec. Am I missing something? For fine tuning my application I need a way to set number of tasks. Is this possible?
[1] : http://storm.apache.org/releases/0.10.0/Clojure-DSL.html

I believe you can pass the number of topology tasks in a configuration map when calling bolt spec like
(bolt-spec {"1" :shuffle} geocode-lookup :p 8 :conf {TOPOLOGY_TASKS 64})
Here is an example topology:
(defn heatmap-topology []
(topology
{"1" (spout-spec checkins :p 4)}
{"2" (bolt-spec {"1" :shuffle} geocode-lookup :p 8 :conf {TOPOLOGY_TASKS 64})
"3" (bolt-spec {"2" :shuffle} time-interval-extractor :p 4)
"4" (bolt-spec {"3" ["time-interval" "city"]} heatmap-builder :p 4)
"5" (bolt-spec {"4" :shuffle} persistor :conf {TOPOLOGY_TASKS 4})}))
I was too wondering the same thing and found this:
org.apache.storm.clojure/bolt-spec is an alias for org.apache.storm.thrift/mk-bolt-spec link
bolt-spec is defined using org.apache.storm.util/defnk link, so it returns a map like {:obj bolt :inputs inputs :p parallelism-hint :conf conf} with conf defaulting to nil
org.apache.storm.clojure/bolt-spec is an alias for org.apache.storm.thrift/mk-topology and calls org.apache.storm.topology.TopologyBuilder.addConfigurations passing conf as the only argument. link
Also, setNumTasks simple calls addConfiguration(Config.TOPOLOGY_TASKS, val)link
And last, the correct constant for setting the number of tasks for a spout of bolt is TOPOLOGY_TASKS link

According to Clojure-DSL page:
To create topology configs, it's easiest to use the
backtype.storm.config namespace which defines constants for all of the
possible configs. The constants are the same as the static constants
in the Config class, except with dashes instead of underscores. For
example, here's a topology config that sets the number of workers to
15 and configures the topology in debug mode:
{TOPOLOGY-DEBUG true
TOPOLOGY-WORKERS 15}
And from Storm documentation (look for section named "Number of tasks"):
Number of tasks
Description: How many tasks to create per component.
Configuration option: TOPOLOGY_TASKS
How to set in your code (examples):
ComponentConfigurationDeclarer#setNumTasks()
Here is an example code snippet to show these settings in practice:
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
.setNumTasks(4)
.shuffleGrouping("blue-spout");
It looks like using backtype.storm.config namespace and defining TOPOLOGY-WORKERS would be equivalent to calling setNumTasks.

Related

Connection Pooling in Clojure

I am unable to understand the use of pool-db and connection function
in this connection pooling guide.
(defn- get-pool
"Creates Database connection pool to be used in queries"
[{:keys [host-port db-name username password]}]
(let [pool (doto (ComboPooledDataSource.)
(.setDriverClass "com.mysql.cj.jdbc.Driver")
(.setJdbcUrl (str "jdbc:mysql://"
host-port
"/" db-name))
(.setUser username)
(.setPassword password)
;; expire excess connections after 30 minutes of inactivity:
(.setMaxIdleTimeExcessConnections (* 30 60))
;; expire connections after 3 hours of inactivity:
(.setMaxIdleTime (* 3 60 60)))]
{:datasource pool}))
(def pool-db (delay (get-pool db-spec)))
(defn connection [] #pool-db)
; usage in code
(jdbc/query (connection) ["Select SUM(1, 2, 3)"])
Why can't we simply do?
(def connection (get-pool db-spec))
; usage in code
(jdbc/query connection ["SELECT SUM(1, 2, 3)"])
The delay ensures that you create the connection pool the first time you try to use it, rather than when the namespace is loaded.
This is a good idea because your connection pool may fail to be created for any one of a number of reasons, and if it fails during namespace load you will get some odd behaviour - any defs after your failing connection pool creation will not be evaluated, for example.
In general, top level var definitions should be constructed so they cannot fail at runtime.
Bear in mind they may also be evaluated during the AOT compile process, as amalloy notes below.
In your application, you want to create the pool just one time and reuse it. For this reason, delay is used to wrap the (get-pool db-spec) method so that this method will be invoked only the first time it is forced with deref/# and will cache the pool return it in subsequent forcecalls
The difference is that in the delay version a pool will be created only if it is called (which might not be the case if everything was cached), but the non-delay version will instantiate a pool no matter what, i.e. always, even if a database connection is not used.
delay runs only if deref is called and does nothing otherwise.
I would suggest you use an existing library to handle connection pooling, something like hikari-cp, which is highly configurable and works across many implements of SQL.

Onyx: Can't pick up trigger/emit results in the next task

I'm trying to get started with Onyx, the distributed computing platform in Clojure. In particular, I try to understand how to aggregate data. If I understand the documentation correctly, a combination of a window and a :trigger/emit function should allow me to do this.
So, I modified the aggregation example (Onyx 0.13.0) in three ways (cf. gist with complete code):
in -main I println any segments put on the output channel; this works as expected with the original code in that it picks up all segments and prints them to stdout.
I add an emit function like this:
(defn make-ds
[event window trigger {:keys [lower-bound upper-bound event-type] :as state-event} extent-state]
(println "make-ds called")
{:ds window})
I add a trigger configuration (original dump-words trigger emitted for brevity):
(def triggers
[{:trigger/window-id :word-counter
:trigger/id :make-ds
:trigger/on :onyx.triggers/segment
:trigger/fire-all-extents? true
:trigger/threshold [5 :elements]
:trigger/emit ::make-ds}])
I change the :count-words task to from calling the identity function to the reduce type, so that it doesn't hand over all input segments to the output (and added config options that onyx should tackle this as a batch):
{:onyx/name :count-words
;:onyx/fn :clojure.core/identity
:onyx/type :reduce ; :function
:onyx/group-by-key :word
:onyx/flux-policy :kill
:onyx/min-peers 1
:onyx/max-peers 1
:onyx/batch-size 1000
:onyx/batch-fn? true}
When I run this now, I can see in the output that the emit function (i.e. make-ds) gets called for each input segment (first output coming from the dump-words trigger of the original code):
> lein run
[....]
Om -> 1
name -> 1
My -> 2
a -> 1
gone -> 1
Coffee -> 1
to -> 1
get -> 1
Time -> 1
make-ds called
make-ds called
make-ds called
make-ds called
[....]
However, the segment build from make-ds doesn't make it through to the output-channel, they are never being printed. If I revert the :count-words task to the identity function, this works just fine. Also, it looks as if the emit function is called for each input segment, whereas I would expect it to be called only when the threshold condition is true (i.e. whenever 5 elements have been aggregated in the window).
As the test for this functionality within the Onyx code base (onyx.windowing.emit-aggregate-test) is passing just fine, I guess I'm making a stupid mistake somewhere, but I'm at a loss figuring out what.
I finally saw that there was a warning in the log file onxy.log like this:
[clojure.lang.ExceptionInfo: Windows cannot be checkpointed with ZooKeeper unless
:onyx.peer/storage.zk.insanely-allow-windowing? is set to true in the peer config.
This should only be turned on as a development convenience.
[clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task
lifecycle :lifecycle/checkpoint-state. Killing the job. -> Exception type:
clojure.lang.ExceptionInfo. Exception message: Windows cannot be checkpointed with
ZooKeeper unless :onyx.peer/storage.zk.insanely-allow-windowing? is set to true in
the peer config. This should only be turned on as a development convenience.
As soon as I set this, I finally got some segments handed over to the next task. I.e., I had to change the peer config to:
(def peer-config
{:zookeeper/address "127.0.0.1:2189"
:onyx/tenancy-id id
:onyx.peer/job-scheduler :onyx.job-scheduler/balanced
:onyx.peer/storage.zk.insanely-allow-windowing? true
:onyx.messaging/impl :aeron
:onyx.messaging/peer-port 40200
:onyx.messaging/bind-addr "localhost"})
Now, :onyx.peer/storage.zk.insanely-allow-windowing? doesn't sound like a good thing to do. Lucas Bradstreet recommended on the Clojurians Slack channel switching to S3 checkpointing.

How do I suppress the ANSI coloring ring.middleware.logger is putting in my logs?

I've inherited a project that gets some logging magic through [ring.middleware.logger "0.5.0" :exclusions [org.slf4j/slf4j-log4j12]] in the project.clj. As the middlewares get set up ring.middleware.logger/wrap-with-logger comes in and that gets me some nice logging on each request like...
2016-03-25 15:46:03,787 a939 level=INFO [qtp509784188-34] core:288 - Starting :delete /v4/events/c.c.t.p.v4.api-a9c6d846-1da5-4593-a711-18d90aa8490f/test-layer/2015-05-31T00:00:00.000Z for 127.0.0.1 {"host" "localhost:50654", "user-agent" "Apache-HttpClient/4.3.6 (java 1.5)", "accept-encoding" "gzip, deflate", "connection" "close"}
2016-03-25 15:46:03,788 a939 level=INFO [qtp509784188-34] core:288 - \ - - - - Params: {}
2016-03-25 15:46:03,794 a939 level=INFO [qtp509784188-34] core:288 - Finished :delete /v4/events/c.c.t.p.v4.api-a9c6d846-1da5-4593-a711-18d90aa8490f/test-layer/2015-05-31T00:00:00.000Z for 127.0.0.1 in (6 ms) Status: 404
...the problem is that some of the fields in this logging come out ANSI colorized. There is a request id like thing, above its the "a939" field, as well as the "Starting", "Finished", and the "Status" code which are presented with ANSI colors. This has the unpleasant side effect of making it challenging to RegEx the logs in Splunk, as there are now control characters, which appear as ascii digits now, mucking up the works.
2016-03-25 15:46:03,794 [0m[35m[44ma939[0m level=INFO [qtp509784188-34] onelog.core - [36mFinished...Status: [39m200[0m
How can I suppress the ANSI colorization of the logging output through the ring.middleware.logger thing?
An alternative is to migrate to [ring-logger-onelog "0.7.6"]. I started it as a fork of ring.middleware.logger with the goal of making it more flexible. For example it includes an option :printer :no-color with which you can avoid all the ANSI colorization.
The migration path is very smooth, as shown in the README:
Replace dependency in project.clj from [ring.middleware.logger "0.5.0"] to [ring-logger-onelog "0.7.6"]
Replace the require from [ring.middleware.logger :as logger] to [ring.logger.onelog :as logger]
Pass options to logger/wrap-with-logger using a proper map instead of keyword arguments.
There's an example app that you can run to see what would be logged.
There's also the core ring-logger library that works with different logging backends (as opposed to ring-logger-onelog which relies on onelog, which ultimately goes through log4j... or slf4j, not sure). ring-logger doesn't have request-id built-in, but there's an example in the README that shows how you can implement it yourself.
It seems that ring.middleware.logger has ANSI coloring implemented by default and there is no configuration option to easily disable it without providing your own pre-logger, post-logger etc.
There is however a chance that you can create a workaround: ring.middleware.logger uses clansi for applying ANSI colors. clansi provides without-ansi macro to disable coloring.
You can write your own middleware that needs to wrap ring.middleware.logger:
(defn wrap-no-ansi-colors [handler]
(fn [rq]
(without-ansi
(handler rq))))
(def app
(-> handler
(wrap-with-logger)
(wrap-no-ansi-colors)))

Persisting State from a DRPC Spout in Trident

I'm experimenting with Storm and Trident for this project, and I'm using Clojure and Marceline to do so. I'm trying to expand the wordcount example given on the Marceline page, such that the sentence spout comes from a DRPC call rather than from a local spout. I'm having problems which I think stem from the fact that the DRPC stream needs to have a result to return to the client, but I would like the DRPC call to effectively return null, and simply update the persisted data.
(defn build-topology
([]
(let [trident-topology (TridentTopology.)]
(let [
;; ### Two alternatives here ###
;collect-stream (t/new-stream trident-topology "words" (mk-fixed-batch-spout 3))
collect-stream (t/drpc-stream trident-topology "words")
]
(-> collect-stream
(t/group-by ["args"])
(t/persistent-aggregate (MemoryMapState$Factory.)
["args"]
count-words
["count"]))
(.build trident-topology)))))
There are two alternatives in the code - the one using a fixed batch spout loads with no problem, but when I try to load the code using a DRPC stream instead, I get this error:
InvalidTopologyException(msg:Component: [b-2] subscribes from non-existent component [$mastercoord-bg0])
I believe this error comes from the fact that the DRPC stream must be trying to subscribe to an output in order to have something to return to the client - but persistent-aggregate doesn't offer any such outputs to subscribe to.
So how can I set up my topology so that a DRPC stream leads to my persisted data being updated?
Minor update: Looks like this might not be possible :( https://issues.apache.org/jira/browse/STORM-38

Interleaving Watch Multi/exec on a single Redis connection. Expected or weird behavior?

Consider a front-facing app where every request shares the same Redis Connection, which I believe is the recommended way (?).
In this situation I believe I'm seeing some weird watch multi/exec behavior. Specifically, I would expect one of two transactions to fail because of optimistic locking failure (i.e.: the watch guard) but both seem to go through without throwing a tantrum, but result in the wrong final value.
To illustrate see the below contrived scenario. It's in Node, but I believe it's a general thing. This runs 2 processes in parallel which both update a counter. (It basically implements the canonical example of Watch as seen in the Redis Docs.
The expected result is that the first process results in an increment of 1 while the second fails to update and returns null. Instead, the result is that both processes update the counter with 1. However one is based on a stale counter so in the end the counter is incremented with 1 instead of 2.
//NOTE: db is a promisified version of node-redis, but that really doesn't matter
var db = Source.app.repos.redis._raw;
Promise.all(_.reduce([1, 2], function(arr, val) {
db.watch("incr");
var p = Promise.resolve()
.then(function() {
return db.get("incr");
})
.then(function(val) { //say 'val' returns '4' for both processes.
console.log(val);
val++;
db.multi();
db.set("incr", val);
return db.exec();
})
.then(function(resultShouldBeNullAtLeastOnce) {
console.log(resultShouldBeNullAtLeastOnce);
return; //explict end
});
arr.push(p);
return arr;
}, [])).then(function() {
console.log("done all");
next(undefined);
})
The resulting interleaving is seen when tailing Redis' MONITOR command:
1414491001.635833 [0 127.0.0.1:60979] "watch" "incr"
1414491001.635936 [0 127.0.0.1:60979] "watch" "incr"
1414491001.636225 [0 127.0.0.1:60979] "get" "incr"
1414491001.636242 [0 127.0.0.1:60979] "get" "incr"
1414491001.636533 [0 127.0.0.1:60979] "multi"
1414491001.636723 [0 127.0.0.1:60979] "set" "incr" "5"
1414491001.636737 [0 127.0.0.1:60979] "exec"
1414491001.639660 [0 127.0.0.1:60979] "multi"
1414491001.639691 [0 127.0.0.1:60979] "set" "incr" "5"
1414491001.639704 [0 127.0.0.1:60979] "exec"
Is this expected behavior? Would using multiple redis connections circumvent this issue?
To answer my own question:
This is expected behavior. The first exec unwatches all properties. Therefore, the second multi/exec goes through without watch-guard.
It's in the docs, but it's fairly hidden.
Solution: use multiple connections, in spite of some answers on SO explicitly warning against this, since it (quote) 'shouldn't be needed'. In this situation IT IS needed.
Too late but for anyone reading this in the future, the solution suggested by Geert is not advised by Redis.
One request per connection
Many databases use the concept of REST as a primary interface—send a plain old HTTP request to an endpoint with arguments encoded as POST. The database grabs the information and returns it as a response with a status code and closes the connection. Redis should be used differently—the connection should be persistent and you should make requests as needed to a long-lived connection. However, well-meaning developers sometimes create a connection, run a command, and close the connection. While opening and closing connections per command will technically work, it’s far from optimal and needlessly cuts into the performance of Redis as a whole.
Using the OSS Cluster API, the connection to the nodes are maintained by the client as needed, so you’ll have multiple connections open to different nodes at any given time. With Redis Enterprise, the connection is actually to a proxy, which takes care of the complexity of connections at the cluster level.
TL;DR: Redis connections are designed to stay open across countless operations.
Best-practice alternative: Keep your connections open over multiple commands.
A better solution to tackle this solution is to use lua scripts and make your set of operations blocking and atomic.
EVAL to run redis scripts