Exception handling in a batch of Event Hub events using Azure WebJobs Sdk - azure-eventhub

I use the EventHub support of the Azure WebJobs Sdk to process Events. Because of the throughput I decided to go for batch processing of those Events, e.g. my method looks like this:
public static void HandleBatchRaw([EventHubTrigger("std")] EventData[] events) {...}
Now one of those events within a batch might cause an Exception - what's the right way to handle that? When I leave the Exception uncaught the processing stops and the remainder of the Events in the EventData[] parameter get lost.
Options:
Catch the Exception manually, forward the Event to some place
else and continue
Let the SDK do the magic, e.g. it should just
'ACK' the Events processed until then (I probably would have to do that), mark this event as 'Poisoned', exit the method and continue on the next call of the function.
Move to Single Event Handling - but for performance
goals I don't feel that's right
I missed the point and should think of another strategy
How should I approach this?

There are only four choices in any messaging solution:
1 Stop
2 Drop
3 Retry
4 Dead letter
You have to do that. I don't believe that SDK will retry anything. Recall there is no ACK for Event Hubs read, you just read.
How are you checkpointing?

Your best bet is probably your option #1. WebJobs EventHub binding doesn't give you many options here. Feel free to file an issue at https://github.com/Azure/azure-webjobs-sdk/issues to request better error handling support here.
If you want to see exactly what it's doing under the hood, here's the spot in the WebJobs SDK EventHub binding that receives events via EventProcessorHost:
https://github.com/Azure/azure-webjobs-sdk/blob/dev/src/Microsoft.Azure.WebJobs.ServiceBus/EventHubs/EventHubListener.cs#L86

Related

Event Pattern for creating a rule/trigger (CloudWatch-Lambda) when AWS Step Function `ExecutionsFailed`

When my AWS Step Functions' State Machine fails ExecutionsFailed , I'd like to trigger a lambda function in response to it.
Seems that you have to create a rule on CloudWatch; but I couldn't find description on how to do that (in particular how the Event Patterns supposed to look like).
p.s. in my case it happens due to exceeding 25,000 history limit (so not quite so easy to handle within state machine; without having to add loop counters etc.; so I'd prefer for it to fail; and then handle it via lambda)
Current workaround is to create a cron rule for a scheduled event on a cloudwatch; and the check the state machine; and in case it is failed; handle it.

Azure Event Hub ServiceBusException causing skipped messages

We are using the Azure Java event hub library to read messages out of an event hub. Most of the time it works perfectly, but periodically we see exceptions of type "com.microsoft.azure.servicebus.ServiceBusException" occur that correspond to times when messages seem to be skipped that are in the event hub.
Here are some examples of exception details:
"The message container is being closed (some number here)."
This generally hits multiple partitions at the same time, but not all.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
"The link 'xxx' is force detached by the broker due to errors occurred in consumer(link#). Detach origin: InnerMessageReceiver was closed."
This is generally tied to com.microsoft.azure.servicebus.amqp.AmqpException exceptions.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
Example callstack:
at com.microsoft.azure.servicebus.ExceptionUtil.toException(ExceptionUtil.java:93)
at com.microsoft.azure.servicebus.MessageReceiver.onError(MessageReceiver.java:393)
at com.microsoft.azure.servicebus.MessageReceiver.onClose(MessageReceiver.java:646)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.processOnClose(BaseLinkHandler.java:83)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.onLinkRemoteClose(BaseLinkHandler.java:52)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:309)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:276)
at com.microsoft.azure.servicebus.MessagingFactory$RunReactor.run(MessagingFactory.java:340)
at java.lang.Thread.run(Thread.java:745)
There doesn't seem to be a way for clients of the library to recognize a problem occurs and avoid moving ahead in the event hub past our skipped messages. Has anyone else run into this? Is there some other way to recognize and avoid skipping or retrying missed messages?
This error DOESN'T SKIP any messages - it will throw an Exception, when it shouldn't have. This will result in EPH to RESTART the affected Partitions' Receiver. If the application using EventHubs javaclient doesn't handle the errors - they may experience loss of messages.
This is a bug in our retry logic - in the current version of EventHubs JavaClient - until 0.11.0.
Here's the corresponding issue to track progress.
In EventHubs service - these errors happen if - for any reason - the Container hosting your EventHubs' code has to close (for the sake of the explanation, imagine we run a set of Container's - like DockerContainers for every EventHub namespace) - this is a transient error - this Container will eventually be opened in another Node.
Our javaclient-retry logic should have handled this error and should have retried - Will keep this thread posted with the fix.
EDIT
We just released 0.12.0 - which fixes this issue.
Thanks!
Sreeram

Cloud Dataflow: Side effect when watermark advances

Working with a streaming, unbounded PCollection in Google Dataflow that originates from a Cloud PubSub subscription. We are using this as a firehose to simply deliver events to BigTable continuously. Everything with the delivery is performing nicely.
Our problem is that we have downstream batch jobs that expect to read a day's worth of data out of BigTable once it is delivered. I would like to utilize windowing and triggering to implement a side effect that will write a marker row out to bigtable when the watermark advances beyond the day threshold, indicating that dataflow has reason to believe that most of the events have been delivered (we don't need strong guarantees on completeness, just reasonable ones) and that downstream processing can begin.
What we've tried is write out the raw events as one sink in the pipeline, and then window into another sink, using the timing information in the pane to determine if the watermark has advanced. The problem with this approach is that it operates upon the raw events themselves again, which is undesirable since it would repeat writing the event rows. We can prevent this write, but the parallel path in the pipeline would still be operating over the windowed streams of events.
Is there an effecient way to attach a callback-of-sorts to the watermark, such that we can perform a single action when the watermark advances?
The general ability to set a timer in event time and receive a callback is definitely an important feature request, filed as BEAM-27, which is under active development.
But actually your approach of windowing into FixedWindows.of(Duration.standardDays(1)) seems like it will accomplish your goal using just the features of the Dataflow Java SDK 1.x. Instead of forking your pipeline, you can maintain the "firehose" behavior by adding the trigger AfterPane.elementCountAtLeast(1). It does incur the cost of a GroupByKey but does not duplicate anything.
The complete pipeline might look like this:
pipeline
// Read your data from Cloud Pubsub and parse to MyValue
.apply(PubsubIO.Read.topic(...).withCoder(MyValueCoder.of())
// You'll need some keys
.apply(WithKeys.<MyKey, MyValue>of(...))
// Window into daily windows, but still output as fast as possible
.apply(Window.into(FixedWindows.of(Duration.standardDays(1)))
.triggering(AfterPane.elementCountAtLeast(1)))
// GroupByKey adds the necessary EARLY / ON_TIME / LATE labeling
.apply(GroupByKey.<MyKey, MyValue>create())
// Convert KV<MyKey, Iterable<MyValue>>
// to KV<ByteString, Iterable<Mutation>>
// where the iterable of mutations has the "end of day" marker if
// it was ON_TIME
.apply(MapElements.via(new MessageToMutationWithEndOfWindow())
// Write it!
.apply(BigTableIO.Write.to(...);
Please do comment on my answer if I have missed some detail of your use case.

Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?

The Zookeeper Watches documentation states:
"A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode." Furthermore, "Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper."
The point is, there is no guarantee you'll get a watch notification.
This is important, because in a sytem like Clojure's Avout, you're trying to mimic Clojure's Software Transactional Memory, over the network using Zookeeper. This relies on there being a watch notification for every change.
Now I'm trying to work out if this is a coding flaw, or a fundamental computer science problem, (ie the CAP Theorem).
My question is: Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?
This seems to be a limitation in the way ZooKeeper implements watches, not a limitation of the CAP theorem. There is an open feature request to add continuous watch to ZooKeeper: https://issues.apache.org/jira/browse/ZOOKEEPER-1416.
etcd has a watch function that uses long polling. The limitation here which you need to account for is that multiple events may happen between receiving the first long poll result, and re-polling. This is roughly analogous to the issue with ZooKeeper. However they have a solution:
However, the watch command can do more than this. Using the index [passing the last index we've seen], we can watch for commands that have happened in the past. This is useful for ensuring you don't miss events between watch commands.
curl -L 'http://127.0.0.1:4001/v2/keys/foo?wait=true&waitIndex=7'

What happens to running processes on a continuous Azure WebJob when website is redeployed?

I've read about graceful shutdowns here using the WEBJOBS_SHUTDOWN_FILE and here using Cancellation Tokens, so I understand the premise of graceful shutdowns, however I'm not sure how they will affect WebJobs that are in the middle of processing a queue message.
So here's the scenario:
I have a WebJob with functions listening to queues.
Message is added to Queue and job begins processing.
While processing, someone pushes to develop, triggering a redeploy.
Assuming I have my WebJobs hooked up to deploy on git pushes, this deploy will also trigger the WebJobs to be updated, which (as far as I understand) will kick off some sort of shutdown workflow in the jobs. So I have a few questions stemming from that.
Will jobs in the middle of processing a queue message finish processing the message before the job quits? Or is any shutdown notification essentially treated as "this bitch is about to shutdown. If you don't have anything to handle it, you're SOL."
If we are SOL, is our best option for handling shutdowns essentially to wrap anything you're doing in the equivalent of DB transactions and implement your shutdown handler in such a way that all changes are rolled back on shutdown?
If a queue message is in the middle of being processed and the WebJob shuts down, will that message be requeued? If not, does that mean that my shutdown handler needs to handle requeuing that message?
Is it possible for functions listening to queues to grab any more queue messages after the Job has been notified that it needs to shutdown?
Any guidance here is greatly appreciated! Also, if anyone has any other useful links on how to handle job shutdowns besides the ones I mentioned, it would be great if you could share those.
After no small amount of testing, I think I've found the answers to my questions and I hope someone else can gain some insight from my experience.
NOTE: All of these scenarios were tested using .NET Console Apps and Azure queues, so I'm not sure how blobs or table storage, or different types of Job file types, would handle these different scenarios.
After a Job has been marked to exit, the triggered functions that are running will have the configured amount of time (grace period) (5 seconds by default, but I think that is configurable by using a settings.job file) to finish before they are exited. If they do not finish in the grace period, the function quits. Main() (or whichever file you declared host.RunAndBlock() in), however, will finish running any code after host.RunAndBlock() for up to the amount of time remaining in the grace period (I'm not sure how that would work if you used an infinite loop instead of RunAndBlock). As far as handling the quit in your functions, you can essentially "listen" to the CancellationToken that you can pass in to your triggered functions for IsCancellationRequired and then handle it accordingly. Also, you are not SOL if you don't handle the quits yourself. Huzzah! See point #3.
While you are not SOL if you don't handle the quit (see point #3), I do think it is a good idea to wrap all of your jobs in transactions that you won't commit until you're absolutely sure the job has ran its course. This way if your function exits mid-process, you'll be less likely to have to worry about corrupted data. I can think of a couple scenarios where you might want to commit transactions as they pass (batch jobs, for instance), however you would need to structure your data or logic so that previously processed entities aren't reprocessed after the job restarts.
You are not in trouble if you don't handle job quits yourself. My understanding of what's going on under the covers is virtually non-existent, however I am quite sure of the results. If a function is in the middle of processing a queue message and is forced to quit before it can finish, HAVE NO FEAR! When the job grabs the message to process, it will essentially hide it on the queue for a certain amount of time. If your function quits while processing the message, that message will "become visible" again after x amount of time, and it will be re-grabbed and ran against the potentially updated code that was just deployed.
So I have about 90% confidence in my findings for #4. And I say that because to attempt to test it involved quick-switching between windows while not actually being totally sure what was going on with certain pieces. But here's what I found: on the off chance that a queue has a new message added to it in the grace period b4 a job quits, I THINK one of two things can happen: If the function doesn't poll that queue before the job quits, then the message will stay on the queue and it will be grabbed when the job restarts. However if the function DOES grab the message, it will be treated the same as any other message that was interrupted: it will "become visible" on the queue again and be reran upon the restart of the job.
That pretty much sums it up. I hope other people will find this useful. Let me know if you want any of this expounded on and I'll be happy to try. Or if I'm full of it and you have lots of corrections, those are probably more welcome!