Azure Event Hub ServiceBusException causing skipped messages - azure-eventhub

We are using the Azure Java event hub library to read messages out of an event hub. Most of the time it works perfectly, but periodically we see exceptions of type "com.microsoft.azure.servicebus.ServiceBusException" occur that correspond to times when messages seem to be skipped that are in the event hub.
Here are some examples of exception details:
"The message container is being closed (some number here)."
This generally hits multiple partitions at the same time, but not all.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
"The link 'xxx' is force detached by the broker due to errors occurred in consumer(link#). Detach origin: InnerMessageReceiver was closed."
This is generally tied to com.microsoft.azure.servicebus.amqp.AmqpException exceptions.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
Example callstack:
at com.microsoft.azure.servicebus.ExceptionUtil.toException(ExceptionUtil.java:93)
at com.microsoft.azure.servicebus.MessageReceiver.onError(MessageReceiver.java:393)
at com.microsoft.azure.servicebus.MessageReceiver.onClose(MessageReceiver.java:646)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.processOnClose(BaseLinkHandler.java:83)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.onLinkRemoteClose(BaseLinkHandler.java:52)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:309)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:276)
at com.microsoft.azure.servicebus.MessagingFactory$RunReactor.run(MessagingFactory.java:340)
at java.lang.Thread.run(Thread.java:745)
There doesn't seem to be a way for clients of the library to recognize a problem occurs and avoid moving ahead in the event hub past our skipped messages. Has anyone else run into this? Is there some other way to recognize and avoid skipping or retrying missed messages?

This error DOESN'T SKIP any messages - it will throw an Exception, when it shouldn't have. This will result in EPH to RESTART the affected Partitions' Receiver. If the application using EventHubs javaclient doesn't handle the errors - they may experience loss of messages.
This is a bug in our retry logic - in the current version of EventHubs JavaClient - until 0.11.0.
Here's the corresponding issue to track progress.
In EventHubs service - these errors happen if - for any reason - the Container hosting your EventHubs' code has to close (for the sake of the explanation, imagine we run a set of Container's - like DockerContainers for every EventHub namespace) - this is a transient error - this Container will eventually be opened in another Node.
Our javaclient-retry logic should have handled this error and should have retried - Will keep this thread posted with the fix.
EDIT
We just released 0.12.0 - which fixes this issue.
Thanks!
Sreeram

Related

Akka EventsourcedBehavior JournalFailureException but stack trace doesn't show underlying cause

I have an akka persistence app (EventSourcedBehavior based actors, akka 2.6.13) and using akka-persistence-jdbc 3.5.3 for the journal/snapshot plugin along with a cockroachdb cluster. Things work fine, but recently I've seen a lot of event persist failures but the error logs do not show any underlying cause of the issue - no SQL level exceptions in the trace at all. At the same time as this, we usually see errors due to actors being restored, and the journal again throwing JournalFailureExceptions, but no underlying reason.
If I can't see any underlying reasons (the only thing the logs show is async write timed out after 5.00 s (is this timeout value configurable?) does this mean there is something else causing the issues, unrelated to the journal plugin implementation or database? How can I debug this further - i've examined the message handler in my EventSourcedBehavior that has failed when persisting an event to see if is doing anything weird or blocking, but I can't see anything obviously wrong.
Any ideas welcome.
Thanks
The JournalFailureExceptions likely indicate connectivity or slow responses from the DB. If it's just slowness, scaling out/up the cockroach cluster might help.
"async write timed out after" is from cluster sharding's remember-entities feature (that's the only occurrence in Akka) which also indicates connectivity issues or slow responses from the DB.
There is most likely no problem with your behaviors. It's worth noting that remember-entities (especially in eventsourced mode... ddata mode is a little better in this regard if you're OK with not remembering entities across full-cluster restarts) itself puts a substantial load on persistence and your DB and is consistently (if you have more than a few hundred entities) counterproductive, in my experience. Unless you've actually tried disabling it and seen an actual net negative effect, I suggest disabling remember-entities.

ZMQ - Client Server: Client is powered off unexpectedly, how server detects it?

Multiple clients are connected to a single ZMQ_PUSH socket. When a client is powered off unexpectedly, server does not get an alert and keep sending messages to it. Despite of using ZMQ_OBLOCK and setting ZMQ_HWM to 5 (queue only 5 messages at max), my server doesn't get an error until unless client is reconnected and all the messages in queue are received at once.
I recently ran into a similar problem when using ZMQ. We would cut power to interconnected systems, and the subscriber would be unable to reconnect automatically. It turns out the there has recently (past year or so) been implemented a heartbeat mechanism over ZMTP, the underlying protocol used by ZMQ sockets.
If you are using ZMQ version 4.2.0 or greater, look into setting the ZMQ_HEARTBEAT_IVL and ZMQ_HEARTBEAT_TIMEOUT socket options (http://api.zeromq.org/4-2:zmq-setsockopt). These will set the interval between heartbeats (ZMQ_HEARTBEAT_IVL) and how long to wait for the reply until closing the connection (ZMQ_HEARTBEAT_TIMEOUT).
EDIT: You must set these socket options before connecting.
There is nothing in zmq explicitly to detect the unexpected termination of a program at the other end of a socket, or the gratuitous and unexpected failure of a network connection.
There has been historical talk of adding some kind of underlying ping-pong are-you-still-alive internal messaging to zmq, but last time I looked (quite some time ago) it had been decided not to do this.
This does mean that crashes, network failures, etc aren't necessarily handled very cleanly, and your application will not necessarily know what is going on or whether messages have been successfully sent. It is Actor model after all. As you're finding your program may eventually determine something had previously gone wrong. Timeouts in zmtp will spot the failure, and eventually the consequences bubble back up to your program.
To do anything better you'd have to layer something like a ping-pong on top yourself (eg have a separate socket just for that so that you can track the reachability of clients) but that then starts making it very hard to use the nice parts of ZMQ such as push / pull. Which is probably why the (excellent) zmq authors decided not to put it in themselves.
When faced with a similar problem I ended up writing my own transport library. I couldn't find one off the shelf that gave nice behaviour in the face of network failures, crashes, etc. It implemented CSP, not actor model, wasn't terribly fast (an inevitability), didn't do patterns in the zmq sense, but did mean that programs knew exactly where messages were at all times, and knew that clients were alive or unreachable at all times. The CSPness also meant message transfers were an execution rendezvous, so programs know what each other is doing too.

Wrong `SocketKind` in `SocketActivityTrigger` background task

During testing of my project on a background server, I have encountered the weird situation where every time I triggers a request to my suspended server using ServerTestingTask, the ServerTask is triggered twice with identical SocketActivityTriggerDetails (trigger reason is SocketActivityTriggerReason::ConnectionAccepted, the socket information is always SocketActivityKind::StreamSocketListener). The problem is that the first one supplies a valid StreamSocket in the information and my code handled the request perfectly while the second trigger raises invalid object exception (just by accessing socketInformation->StreamSocket which is some kind of fatal and kill my server [need to resuming the app UI and click the button to start the server again]. It feels like the first trigger should indicate the socket kind to be SocketActivityKind::StreamSocket instead. Is it a known problem or is there some work around?

Websphere MQ - error with reason code 2042 on a get

We're getting an intermittent error on a ImqQueue::get( ImqMsg &, ImqGetMessageOptions & ); call with reason code 2042, which Should Not Happen™ based on the Websphere documentation; we should only get that reason code on an open.
Would this error indicate that the server could not open a queue on its side, or does it indicate that there's a problem in our client? What is the best way to handle this error? Right now we just log that it occurs, but it's happening a lot. Unfortunately I'm not well-versed in Websphere MQ; I'm kind of picking this up as I go, so I don't have all the terminology correct.
Our client is written in C++ linking against libmq 6.0.2.4 and running on SLES-10. I don't know the particulars for the server other than it's running version 7.1. We're requesting an upgrade to bring our side up-to-date. We have multiple instances of the client running concurrently; all are using the same request queue, but each is creating its own dynamic reply queue with MQOO_INPUT_EXCLUSIVE + MQOO_INPUT_FAIL_IF_QUIESCING.
If the queue is not already open, the ImqQueue::get method will implicitly open the queue for you. This will end up with the MQOO_INPUT_AS_Q_DEF option being used which will therefore use the DEFSOPT(EXCL|SHARED) attribute on the queue. You should also double check that the queue is defined SHARE rather than NOSHARE, but I suspect that will already be correctly set.
You mention that you have multiple instances of the application running concurrently so if one of them has the queue opened implicitly as MQOO_INPUT_AS_Q_DEF resulting in MQOO_INPUT_SHARED from DEFSOPT, then it will get 2042 (MQRC_OBJECT_IN_USE) if others have it open. If nothing else had it open at the time, then the implicit open will work, and later instances will instead get the 2042.
If it is intermittent, then I suggest there is a path through your application where ImqQueue::open method is not invoked. While you look for that, changing the queue definition to DEFSOPT(SHARED) should get rid of the 2042s.

"Specified network name is no longer available" in Httplistener

I have built a simple web service that simply uses HttpListener to receive and send requests. Occasionally, the service fails with "Specified network name is no longer available". It appears to be thrown when I write to the output buffer of the HttpListenerResponse.
Here is the error:
ListenerCallback() Error: The specified network name is no longer available at System.Net.HttpResponseStream.Write(Byte[] buffer, Int32 offset, Int32 size)
and here is the guilty portion of the code. responseString is the data being sent back to the client:
buffer = System.Text.Encoding.UTF8.GetBytes(responseString);
response.ContentLength64 = buffer.Length;
output = response.OutputStream;
output.Write(buffer, 0, buffer.Length);
It doesn't seem to always be a huge buffer, two examples are 3,816 bytes and, 142,619 bytes, these errors were thrown about 30 seconds apart. I would not think that my single client application would be overloading HTTPlistener; the client does occasionally sent/receive data in bursts, with several exchanges happening one after another.
Mostly Google searches shows that this is a common IT problem where, when there are network problems, this error is shown -- most of the help is directed toward sysadmins diagnosing a problem with an app moreso than developers tracking down a bug. My app has been tested on different machines, networks, etc. and I don't think it's simply a network configuration problem.
What may be the cause of this problem?
I'm getting this too, when a ContentLength64 is specified and KeepAlive is false. It seems as though the client is inspecting the Content-Length header (which, by all possible accounts, is set correctly, since I get an exception with any other value) and then saying "Whelp I'm done KTHXBYE" and closing the connection a little bit before the underlying HttpListenerResponse stream was expecting it to. For now, I'm just catching the exception and moving on.
I've only gotten this particular exception once so far when using HttpListener.
It occurred when I resumed execution after my application had been standing on a breakpoint for a while.
Perhaps there is some sort of internal timeout involved? Your application sends data in bursts, which means it's probably completely inactive a lot of the time. Did the exception occur immediately after a period of inactivity?
Same problem here, but other threads suggest ignoring the Exception.
C# problem with HttpListener
May be that's not the right thing to do.
For me I find that whenever the client close the webpage before it load fully it gives me that exception. What I do is just add a try catch block and print something when the exception happen. In another word I just ignore the exception.
The problem occurs when you're trying to respond to an invalid request. Take a look at this. I found out that the only way to solve this problem is:
listener = new HttpListener();
listener.IgnoreWriteExceptions = true;
Just set IgnoreWriteExceptions to true after instantiating your listener and the errors are gone.
Update:
For a deeper explanation, Http protocol is based on TCP protocol which works with streams to which each peer writes data. TCP protocol is peer to peer and each peer can close the connection. When the client sends a request to your HttpListener there will be a TCP handshake, then the server will process the data and responds back to the client by writing into the connection's stream. If you try to write into a stream which is already closed by the remote peer the Exception with "Specified network name is no longer available" will occur.