Akka EventsourcedBehavior JournalFailureException but stack trace doesn't show underlying cause - akka

I have an akka persistence app (EventSourcedBehavior based actors, akka 2.6.13) and using akka-persistence-jdbc 3.5.3 for the journal/snapshot plugin along with a cockroachdb cluster. Things work fine, but recently I've seen a lot of event persist failures but the error logs do not show any underlying cause of the issue - no SQL level exceptions in the trace at all. At the same time as this, we usually see errors due to actors being restored, and the journal again throwing JournalFailureExceptions, but no underlying reason.
If I can't see any underlying reasons (the only thing the logs show is async write timed out after 5.00 s (is this timeout value configurable?) does this mean there is something else causing the issues, unrelated to the journal plugin implementation or database? How can I debug this further - i've examined the message handler in my EventSourcedBehavior that has failed when persisting an event to see if is doing anything weird or blocking, but I can't see anything obviously wrong.
Any ideas welcome.
Thanks

The JournalFailureExceptions likely indicate connectivity or slow responses from the DB. If it's just slowness, scaling out/up the cockroach cluster might help.
"async write timed out after" is from cluster sharding's remember-entities feature (that's the only occurrence in Akka) which also indicates connectivity issues or slow responses from the DB.
There is most likely no problem with your behaviors. It's worth noting that remember-entities (especially in eventsourced mode... ddata mode is a little better in this regard if you're OK with not remembering entities across full-cluster restarts) itself puts a substantial load on persistence and your DB and is consistently (if you have more than a few hundred entities) counterproductive, in my experience. Unless you've actually tried disabling it and seen an actual net negative effect, I suggest disabling remember-entities.

Related

How to debug a hanging job resulting from reading from lustre?

I have a job in interruptible sleep state (S), hanging for a few hours.
can't use gdb (gdb will hang when attaching to the PID).
can't use strace, strace will resume the hanging job =(
WCHAN field shows the PID is waiting for ptlrpc. After some search online, it looks like this is a lustre operation. The print files also revealed the program is stuck in reading data from lustre. Any idea or suggestion on how to proceed the diagnose? Or possible reason why the hanging happens?
You can check /proc/$PID/stack on the client to see the whole stack of the process, which would give you some more information about what the process is doing (ptlrpc_set_wait() is just the generic "wait for RPC completion" function).
That said, what is more likely to be useful is to check the kernel console error messages (dmesg and/or /var/log/messages) to see what is going on. Lustre is definitely not shy about logging errors when there is a problem.
Very likely this will show that the client is waiting on a server to complete the RPC, so you'll also have to check the dmesg and/or /var/log/messages To see what the problem is on the server. There are several existing docs that go into detail about how to debug Lustre issues:
https://wiki.lustre.org/Diagnostic_and_Debugging_Tools
https://cug.org/5-publications/proceedings_attendee_lists/CUG11CD/pages/1-program/final_program/Wednesday/12A-Spitz-Paper.pdf
At that point, you are probably best off to check for existing Lustre bugs at https://jira.whamcloud,com/ to search for the first error messages that are reported, or maybe a stack trace. It is very likely (depending on what error is being hit), that there is already a fix available, and upgrading to the latest maintenance release (2.12.7 currently), or applying a patch (if the bug is recently fixed) will sole your problem.

Azure Event Hub ServiceBusException causing skipped messages

We are using the Azure Java event hub library to read messages out of an event hub. Most of the time it works perfectly, but periodically we see exceptions of type "com.microsoft.azure.servicebus.ServiceBusException" occur that correspond to times when messages seem to be skipped that are in the event hub.
Here are some examples of exception details:
"The message container is being closed (some number here)."
This generally hits multiple partitions at the same time, but not all.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
"The link 'xxx' is force detached by the broker due to errors occurred in consumer(link#). Detach origin: InnerMessageReceiver was closed."
This is generally tied to com.microsoft.azure.servicebus.amqp.AmqpException exceptions.
The callstack only includes com.microsoft.azure.servicebus and org.apache.qpid.proton.
Example callstack:
at com.microsoft.azure.servicebus.ExceptionUtil.toException(ExceptionUtil.java:93)
at com.microsoft.azure.servicebus.MessageReceiver.onError(MessageReceiver.java:393)
at com.microsoft.azure.servicebus.MessageReceiver.onClose(MessageReceiver.java:646)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.processOnClose(BaseLinkHandler.java:83)
at com.microsoft.azure.servicebus.amqp.BaseLinkHandler.onLinkRemoteClose(BaseLinkHandler.java:52)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:309)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:276)
at com.microsoft.azure.servicebus.MessagingFactory$RunReactor.run(MessagingFactory.java:340)
at java.lang.Thread.run(Thread.java:745)
There doesn't seem to be a way for clients of the library to recognize a problem occurs and avoid moving ahead in the event hub past our skipped messages. Has anyone else run into this? Is there some other way to recognize and avoid skipping or retrying missed messages?
This error DOESN'T SKIP any messages - it will throw an Exception, when it shouldn't have. This will result in EPH to RESTART the affected Partitions' Receiver. If the application using EventHubs javaclient doesn't handle the errors - they may experience loss of messages.
This is a bug in our retry logic - in the current version of EventHubs JavaClient - until 0.11.0.
Here's the corresponding issue to track progress.
In EventHubs service - these errors happen if - for any reason - the Container hosting your EventHubs' code has to close (for the sake of the explanation, imagine we run a set of Container's - like DockerContainers for every EventHub namespace) - this is a transient error - this Container will eventually be opened in another Node.
Our javaclient-retry logic should have handled this error and should have retried - Will keep this thread posted with the fix.
EDIT
We just released 0.12.0 - which fixes this issue.
Thanks!
Sreeram

gRPC C++ client calls against Bigtable hangs occasionally

I am having a problem with gRPC C++ client making calls against google cloud Bigtable. These calls occasionally hang and it is only if the call deadline is set the call returns. There is an issue filed with gRPC team: https://github.com/grpc/grpc/issues/6278 with stack trace and a piece of gRPC tracing log provided.
The call that hangs most often is ReadRows stream read call. I have seen MutateRow call hanging a few times as well but that is rather rare.
gRPC tracing shows that there is some response coming back from the server, however that response seems to be insufficient for gRPC client to go on.
I did spend a fair amount of time debugging the code, no obvious problems found so far on the client side, no memory corruptions seen. This is a single-threaded application, making one call at a time, client side concurrency is not a suspect. Client runs on google compute engine box, so the network likely is not an issue as well. gRPC client is kept up to date with the github repository main line.
Any suggestions would be appreciated. If anyone have debugging ideas that would be great as well. Using valgrind, gdb, reducing the application to a subset with reproducible results did not help so far, I have not been able to find out what the problem is. The problem is random and shows up occasionally.
Additional note on May 17, 2016
There was a suggestion that re-tries may help to deal with the issue.
Unfortunately re-tries do not work very well for us because we would have to carry that over into the application logic. We can easily re-try updates, which is MutateRow calls, and we do that, these are not streaming calls and easy to re-try. However once the iteration of the DB query results has begun by the application, if it fails, the re-trying means that the application needs to re-issue the query and start iteration of the results again. Which is problematic. It is always possible to consider a change that would make our applications to read the whole result set at once and then at the application level iterations can be done in memory. Then re-tries can be handled. But that is problematic for all kinds of reasons, like memory footprint and application latencies. We want to process DB query results as soon as they arrive, not when all of them are in memory. There is also timeout added to the call latency when the call hangs. So, re-tries of the query result iterations are really costly to such a degree that they are not practical.
We've experienced hanging issues with gRPC in various languages. The gRPC team is investigating.

Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?

The Zookeeper Watches documentation states:
"A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode." Furthermore, "Because watches are one time triggers and there is latency between getting the event and sending a new request to get a watch you cannot reliably see every change that happens to a node in ZooKeeper."
The point is, there is no guarantee you'll get a watch notification.
This is important, because in a sytem like Clojure's Avout, you're trying to mimic Clojure's Software Transactional Memory, over the network using Zookeeper. This relies on there being a watch notification for every change.
Now I'm trying to work out if this is a coding flaw, or a fundamental computer science problem, (ie the CAP Theorem).
My question is: Does the Zookeeper Watches system have a bug, or is this a limitation of the CAP theorem?
This seems to be a limitation in the way ZooKeeper implements watches, not a limitation of the CAP theorem. There is an open feature request to add continuous watch to ZooKeeper: https://issues.apache.org/jira/browse/ZOOKEEPER-1416.
etcd has a watch function that uses long polling. The limitation here which you need to account for is that multiple events may happen between receiving the first long poll result, and re-polling. This is roughly analogous to the issue with ZooKeeper. However they have a solution:
However, the watch command can do more than this. Using the index [passing the last index we've seen], we can watch for commands that have happened in the past. This is useful for ensuring you don't miss events between watch commands.
curl -L 'http://127.0.0.1:4001/v2/keys/foo?wait=true&waitIndex=7'

How to avoid dropping messages zeromq pub sub

I have seen several questions about this, but none have answers I found satisfactory. This question, zeromq pattern: pub/sub with guaranteed delivery in particular is similar, though I am open to using any other zeromq mechanism to achieve the same effect.
My question is, is there any way to send messages in a fan-out pattern like publisher-subscriber in ZeroMQ with the assurance that the messages will be delivered? It seems as though a Dealer with zero-copy could do this okay, but it would be much messier than pub-sub. Is there a better option? What are the drawbacks of doing it this way besides having to write more code?
Reason for needing this:
I am writing a code to analyze data coming from instrumentation. The module which connects to the instrumentation needs to be able to broadcast data to other modules for them to analyze. They, in turn, need to broadcast their analyzed data to output modules.
At first glance pub-sub with ZeroMQ seemed perfect for the job, but messages get dropped if any subscriber slows down and hits the high watermark. In the case of this system, it is not acceptable for messages to be dropped at only a fraction of the modules because of event continuity. All the modules need to analyze an event for the output to be meaningful. However, if no modules received the messages for an event, that would be fine. For this reason, it would be okay to block the publisher (the instrumentation module) if one of the analysis modules hit the high watermark.
I suppose another alternative is to deal with missed messages after the fact, but that just wastes processing time on events that would be discarded later.
EDIT:
I guess thinking about this further, I currently expect a message sent = message delivered because I'm using inproc and communicating between threads. However, if I were to send messages over TCP there is a chance that the message could be lost even if ZeroMQ does not drop it on purpose. Does this mean I probably need to deal with dropped messages even if I use a blocking send? Are there any guarantees about message delivery with inproc?
In general, I think there's no way of providing a guarantee for pub/sub on its own with 0MQ. If you really need completely reliable messaging, you're going to have to roll your own.
Networks are inherently unreliable, which is why TCP does so much handshaking just to get packets across.
As ever, it's a balance between latency and throughput. If you're prepared to sacrifice throughput, you can do message handshaking yourself - perhaps using REQ/REP - and handle the broadcasting yourself.
The 0MQ guide has some ideas on how to go about at least part of what you want here.
I agree with SteveL. If you truly need 100% reliability (or close to it), ZeroMq is probably not your solution. You're better off going with commercial messaging products where guaranteed message delivery and persistence are addressed, otherwise, you'll be coding reliability features in ZeroMq and likely pull your hair out in the process. Would you implement your own app server if you required ACID compliance between your application and database? Unless you want to implement your own transaction manager, you'd buy WebLogic, WebSphere, or JBoss to do it for you.
Does this mean I probably need to deal with dropped messages even if I
use a blocking send?
I'd stay away from explicitly blocking anything, it's too brittle. A synchronous sender could hang indefinitely if something goes wrong on the consumption side. You could address this using polling and timeouts, but again, it's brittle and messy code; stick with asynchronous.
Are there any guarantees about message delivery with inproc?
Well, one thing is guaranteed; you're not dealing with physical sockets, so any network issues are eliminated.
This question comes up on search engines, so I just wanted to update.
You can stop ZeroMQ from dropping messages when using PUB sockets. You can set the ZMQ_XPUB_NODROP option, which will instead raise an error when the send buffer is full.
With that information, you could create something like a dead letter queue, as mentioned here, and keep trying to resend with sleeps in between.
Efficiently handling this problem may not be possible currently, as there does not appear to be a way to be notified when the send buffer in ZeroMQ is no longer full, which means timed sleeps / polling may be the only way to find out if the send queue has room again so the messages can be published.