Apache Kafka and Strom Clojure implementation - clojure

This is the first time implementing stream processing infrastructure and my poison was storm 1.0.1,
kafka 0.9.0 and Clojure 1.5.
Now I have background working with a messaging system (RabbitMQ) and I liked it for a couple of reasons.
Simple to install and maintain
Nice frontend web portal
Persistent message states are maintained where I can start a consumer and it know which messages have not been consumed. i.e. "Exactly once"
However it cannot achieve the throughput I desire.
Now having gone through Kafka it heavily depends on manually maintaining offsets (internally in the Kafka broker,Zookeper or externally)
I at long last managed to create a spout in Clojure with the source being the Kafka broker which was nightmare.
Now like for most scenarios what I desire is "Exactly once messaging" and as per Kafka documentation states
So effectively Kafka guarantees at-least-once delivery by default and allows the user to implement at most once delivery by disabling retries on the producer and committing its offset prior to processing a batch of messages. Exactly-once delivery requires co-operation with the destination storage system but Kafka provides the offset which makes implementing this straight-forward.
What does this translate to for a clojure kafka spout, finding it hard to conceptualize.
I may have several boltz along the way but the end point is Postgres cluster. Am i to store the offset in the database (sounds like a race hazard waiting to happen) and on initialization of my storm cluster i fetch the offset from Postgres?
Also is there any danger of setting my parallelism for the Kafka spout to a number greater than one?
I generally used this as a starting point, as examples for many things are just not available in Clojure. With a few minor tweaks for the version i am using. (my messages don't quite come out as I expect them but at least i can see them)
(def ^{:private true
:doc "kafka spout config definition"}
spout-config (let [cfg (SpoutConfig. (ZkHosts. "") "test" "/broker" (.toString (UUID/randomUUID)))]
;;(set! (. cfg scheme) (StringScheme.)) depricated
(set! (. cfg scheme) (SchemeAsMultiScheme. (StringScheme.)))
;;(.forceStartOffsetTime cfg -2)
(defn mk-topology []
{;;"1" (spout-spec sentence-spout)
"1" (spout-spec my-kafka-spout :p 1)
"2" (spout-spec (sentence-spout-parameterized
["the cat jumped over the door"
"greetings from a faraway land"])
:p 2)}
{"3" (bolt-spec {"1" :shuffle}
:p 5)
"4" (bolt-spec {"3" ["word"]}
:p 1)}))

With any distributed system it's impossible to ensure that a portion of the work to be done will be worked on exactly once. At some point something will fail and it will need to retried (this is called "at least once" processing) or not retried (this is called "at most once" processing) though you can't have exactly the middle of that and get "exactly once" processing. What you can get is very close to exactly once processing.
The trick is to, at the end of your process, throw out the second copy if you then find that work was done twice. This is where the index comes in. When you are saving the result into the database, look to see if work with a later index than the index on this work as already been saved. If you find that this later work exists, then throw the work out and don't save it. As for the documentation, that's the kind of explanation that's only "strait forward" to people who have done it many times...


How can I create "n" number of arbitrary Actors to do concurrent processing?

I am very new to Akka (just started looking today) and believe I have a need to create a program using Akka that reads messages from Kafka. Say for example my message looks like "{weather: Rainy, zipcode: 123456, temperature: 55 }". I would like to route each message as it comes in based on specific Zipcode to an Actor that handles messages related to that zipcode.
I guess I have 2 problems here.
1) At the start of my application, I'm not sure what number of Actors I will need.
2) How to specify that an Actor belongs to a specific zipcode and route the messages there?
3-ish) Is this something I can use Akka + Kafka for? Or something more suited towards streaming like Alpakka?
not a problem given the answer to 2
have a look at cluster sharding: https://doc.akka.io/docs/akka/current/cluster-sharding.html
really depends on what you exactly need, you could also use pure kafka stream.
If you need to scale out (i.e., have more than can fit inside a single JVM given the volume/throughput requirements), then you should consider either cluster sharding or pure kafka streams. Which one is another question.

Apache beam / PubSub time delay before processing files

I need to delay processing or publishing filenames (files).
I am looking for the best option.
Currently I have two Apache Beam Dataflows and PubSub in between. First dataflow reads filenames from source and pushes those to PubSub topic. Another dataflow reads them and process them. However my use case is to start processing/reading actual files minimum 1 hour after they are being created in the source.
So I have two options:
1) Delay publishing a message in order to process it right away but in the good/expected moment
2) Delay processing of retrieved files
Like above mentioned I am looking for the best solution. I am not sure if guava retry mechanism should be used in Apache Beam ? Any other ideas?
You could likely achieve what you want via triggering/window configuration in the publishing job.
Then, you could define a windowing configuration where the trigger does not fire until after a 1 hour delay. Something like:
Keep in mind that you'll end up with a job that's simply sitting doing not much of anything except holding onto state for an hour. Also, the above is based solely on processing time, so it will wait an hour after job start even if the actual creation time of the files is old enough that it could emit the results immediately.
You could refine this to an event time trigger, but you would likely need to write your own code to assign timestamps to your records (the filenames). To my knowledge, Beam does not currently have built-in support for reading the creation time of files. When reading files via TextIO, for example, I have observed that the records are all assigned a default static timestamp. You should check the specifics of the transform you're using to read filenames to see if it perhaps does something more useful for your purposes. You can also use a WithTimestamps transform to assign timestamps on your own.

Why should I use Simple Queue Service (SQS) over ElastiCache on AWS

SQS seems really easy to use but has some message size restrictions, e.g. 256 KB message size (really, really small). On the other hand, ElasticCache seems to be more high-end? I am not sure if this assumptions are right - please correct me.
I am in the process of deploying an application on AWS use by making use of one or the other type of message passing (and/or caching) system. In what situations will I choose one over the other?
Comparing SQS to ElastiCache is somewhat like comparing... postcards to filing cabinets.
Which one is "better?"
It depends on what you want to accomplish, and in both cases, they have very little overlap in their functionality other than the fact that they all transiently store information.
A cache, like ElastiCache, is a place where information that's frequently accessed can be stored for frequent retrieval when repeatedly fetching that information from its authoritative source (often a database) is more expensive (typically in terms of resources or time) than fetching it from the cache would be. A cache is more like a filing cabinet with an open back that automatically drains old documents into the shredder any time there's a need to store new documents. This is referred to as eviction from the cache. Because of its purpose, the information stored in a cache is typically not considered durably stored. A node fails, or the data is evicted, and what you stored is not there any more.
“Also implicit is the fact that the cache can expire or evict values if they become too old or if the cache becomes full.”
— http://aws.amazon.com/blogs/aws/amazon-elasticache-distributed-in-memory-caching/
No problem, because the cache wasn't the authoritative data source... but while the data was there, you could access it very quickly. If you look something up and the cache doesn't have it, you go to the authoritative source for the data, and optionally, send a copy of it back to the cache so the next entity to ask for it may find it in the cache.
When you want something from a cache, you go ask for that thing, specifically.
On the other hand, a queue, like Simple Queue Service (SQS) is more like a postcard -- or, at least, that's what queue messages are like. You write the message, send it into the queue, and it pops out the other end -- once, typically. Ordering of messages isn't guaranteed (though its common for them to arrive in order) and messages are guaranteed to be delivered "at least once" (though, again, typically, duplicated messages are rare -- still, it's a massive, distributed infrastructure, so duplicated deliveries are possible).
When you want something from a queue, it sends you the next message in line -- you don't select which one.
If you need to cache information for random, quick, and repeated retrieval, and the information is disposable and recreatable, then of course ElastiCache is the choice of the two.
If you need to send messages between two parts of a system that run independently or at different speeds, then you're looking for a message queue, such as SQS. Typically, the small payload size limit is more than sufficient, because it's unnecessary to send a block of data in the queue message itself. Instead, you send a reference to the data, a pointer, an "id" from a transaction table or a URL to a web-accessible object (perhaps stored in S3) and the queue consumer can then fetch the chunk of data referenced by the queue message, and act on it.
Got the use case right, through experience. Actually deployed code on AWS about 2 years ago. Went with SQS because it was just what I needed to interface two parts of my cloud application. And indeed... 256 KB is way more than enough size ;).

Inotify-like feature in a distributed file system

As the title goes, I want to trigger a notification when some events happen.
A event above can be user-defined, such as updating specified files in 1-miniute.
If files are stored locally, I can easily make it with the system call inotify, but the case is that files locate on a distributed file system such as mfs..
How to make it? I wonder to know if there are some solutions or open-source project to solve this problem. Thanks.
If you have only black-box access (e.g. NFS protocol) to the remote system(s), you don't have much options unless the protocol supports what you need. So I'll assume you have control over the remote systems.
The "trivial" approach is running a local inotify/fanotify listener on each computer that would forward the notification over the network. FAM can do this over NFS.
A problem with all notification-based system is the risk of lost notifications in various edge cases. This becomes much more acute over a network - e.g. client confirms reciept of notification, then immediately crashes. There are reliable message queues you can build on but IMHO this way lies madness...
A saner approach is stateless hash-based scan.
I like to call the following design "hnotify" but that's not an established term. The ideas are widely used by many version control and backup systems, dating back to Plan 9.
The core idea is if you know cryptographic hashes for files, you can compose a single hash that represents a directory of files - it changes if any of the files changed - and you can build these bottom-up to represent the whole filesystem's state.
(Git stores things this way and is very efficient at it.)
Why are hash trees cool? If you have 2 hash trees — one representing the filesystem state you saw at point in the past, one representing the current state — you can easily find out what changed between them:
You start at the roots. If they are different you read the 2 root directories and compare hashes for subdirectories.
If a subdirectory has same hash in both trees, then nothing under it changed. No point going there.
If a subdirectory's hash changed, compare its contents recursively — call step (1).
If one has a subdirectory the other doesn't, well that's a change. With some global table you can also detect moves/renames.
Note that if few files changed, you only read a small portion of the current state. So the remote system doesn't have to send you the whole tree of hashes, it can be an interactive ping-pong of "give me hashes for this directory; ok now for this...".
(This is akin to how Git's dumb http protocol worked; there is a newer protocol with less round trips.)
This is as robust and bug-proof as polling the whole filesystem for changes — you can't miss anything — but reasonably efficient!
But how does the server track current hashes?
Unfortunately, fully hashing all disk writes is too expensive for most people. You may get if for free if you're lucky to be running a deduplicating filesystem, e.g. ZFS or Btrfs.
Otherwise you're stuck with re-reading all changed files (which is even more expensive than doing it in the filesystem layer) or using fake file hashes: upon any change to a file, invent a new random "hash" to invalidate it (and try to keep the fake hashes on moves). Still compute real hashes up the tree. Now you may have false positives — you "detect a change" when the content is the same — but never false negatives.
Anyway, the point is that whatever stateful hacks you do (e.g. inotify with periodic scans to be sure), you only do them locally on the server. Across the network, you only ever send hashes that represent snapshots of current state (or its subtrees)! This way you can have a distributed system with many servers and clients, intermittent connectivity, and still keep your sanity.
P.S. Btrfs can efficiently find differences from an older snapshot. But this is a snapshot taken on the server (and causing all data to be preserved!), less flexible than a client-side lightweight tree-of-hashes.
P.S. One of your tags is HadoopFS. I'm not really familiar with it, but I suspect a lot of its files are write-once-then-immutable, and it might be able to natively give you some kind of file/chunk ids that can serve as fake hashes?
Existing tools
The first tool that springs to my mind is bup index. bup is a very clever deduplicating backup tool built on git (only scalable to huge data), so it sits on the foundation described above. In theory, indexing data in bup on the server and doing git fetch over the network would even implement the hash-walking comparison of what's new that I described above — unfortunately the git repositories that bup produces are too big for git itself to cope with. Also you probably don't want bup to read and store all your data. But bup index is a separate subsystem that quickly scans a filesystem for potential changes, without yet reading the changed files.
Currently bup doesn't use inotify but it's been discussed in depth.
Oh, and bup uses Bloom Filters which are a nearly optimal way to represent sets with false positives. I'm almost certain Bloom filters have a role to play in optimizion stateless notification protocols ("here is a compressed bitmap of all I have; you should be able to narrow your queries with it" or "here is a compressed bitmap of what I want to be notified about"). Not sure if the way bup uses them is directly useful to you, but this data structure should definitely be in your toolbelt.
Another tool is git annex. It's also based on Git (are you noticing a trend?) but is designed to keep the data itself out of Git repos (so git fetch should just work!) and has a "WORM" option that uses fake hashes for faster performance.
Alternative design: compressed replayable journal
I used to think the above is the only sane stateless approach for clients to check what's changed. But I just read http://arstechnica.com/apple/2007/10/mac-os-x-10-5/7/ about OS X's FSEvents framework, which has a perhaps simpler design:
ALL changes are logged to a file. It's kept forever.
Clients can ask "replay for me everything since event 51348".
The magic trick is the log has coarse granularity ("something in this directory changed, go re-scan it to find out what", repeated changes within 30 seconds are combined) so this journal file is very compact.
At the low level you might resort to similar techniques — e.g. hashes — but the top-level interface is different: instead of snapshots you deal with a timeline of events. It may be an easier fit for some applications.

What's the difference between pub and mult in core.async? & a sample usecase?

I've been using core.async for some time, but avoided pub and mult, since I can't really grasp a useful usecase from their documentation.
Specifically what's the purpose of the topic-fn and how would you use it in practice?
Or maybe you can map a theoretical explanation onto the following fictive approach. I think this could help a lot to see how it works in practice (if applicable at all?)
Fictive approach explained:
There would be several different views to represent the state. To let them act and respond to state-changes, I would like to have several channels (on an application level), which are - for example - dedicated to state-changes and user-inputs (like key presses).
Each of the views should be able to sub(scribe) ? to this application channel, so they can react independently to changes. Also each of the views should be possible to put something on the state-channel (but not the user-input-chan).
Channels in core.async are single put, single take. That is to say any message going in is given to only one taker. This doesn't work well in broadcast situations where many go blocks need a copy of each message put into a channel, then you need something else. This is what mult is useful for. Mult could probably also be called "broadcast"
Pub is then mult + multimethods. topic-fn is a function that is applied to each input item. The output of the function decides the topic of the message. The input message is then only broadcast to those subscribers who are listening to that topic.
More information is in the notes from my talk at the last Conj, available here: https://github.com/halgari/clojure-conj-2013-core.async-examples/blob/master/src/clojure_conj_talk/core.clj#L398