I have an integration test which sends a lot of messages to a remote Akka (2.0.5) actor. After each test run, the remote actor tree is restarted. After 43 successful test runs, according to the debug-level log messages, the remote actor started to send replies to itself, which obviously caused the test to fail.
Why might this happen?
There is only one place in the codebase where I am sending these type of messages, and it clearly says
sender ! generateTheMessage()
I figured out why it's happening in my particular case. There are actually two actors involved here:
A -> B
A initially queues up messages until the system is initialised. Then it sends the queued up messages on to B, and forwards all further messages to B as soon as they arrive.
The problem is that when it forwards the queued up messages, the original sender information has been lost and so A becomes the sender. So the reply from B goes back to A and is forwarded back to B again. I didn't initially realise the latter forwarding was happening, because I hadn't enabled logging for the forwarding.
So it's a race condition. If the system comes up quickly enough everything is OK, but if not, some initial replies will be misdirected.
How I fixed this was to pair up the sender with each queued message, and resend each queued message using the Java API, which allows specifying the sender explicitly.
Related
I understand that standard SQS uses "at least once" delivery while FIFO messages are delivered exactly once. I'm trying to weigh standard queues vs FIFO for my application, and one factor is how long it takes for the duplicated message to arrive.
I intend to consume messages from SQS then post the data I received to an idempotent third-party API. I understand that with standard SQS, there's always a risk of me overwriting more recent data with the old duplicated data.
For example:
Message A arrives, I post it onwards.
Message A duplicate arrives, I post it onwards.
Message B arrives, I post it onwards.
All fine ✓
On the other hand:
Message A arrives, I post it onwards.
Message B arrives, I post it onwards.
Message A duplicate arrives - I post it and overwrite the latest data, which was B! ✖
I want to measure this risk, i.e. I want to know how long the duplicate message should take to arrive. Will the duplicate message take roughly the same amount of time to arrive, as the original message?
Maybe it's useful to understand how message duplication occurs. As far as I know this isn't documented in the official docs, but instead it's my mental model of how it works. This is an educated guess.
Whenever you send a message to SQS (SendMessage API), this message arrives at the SQS webservice endpoint, which is one of probably thousands of servers. This endpoint receives your message, duplicates it one or more times and stores these duplicates on more than one SQS server. After it has received confirmation from at least two SQS servers, it acknowledges to the client that the message has been received.
When you call the ReceiveMessage API only a subset of the SQS servers that handle your queue are queried for messages. When a message is returned, these servers communicate to their peers, that this message is currently in-flight and the visibility timeout starts. This doesn't happen instantaneously, as it's a distributed system. While this ReceiveMessage call takes place another consumer might also do a ReceiveMessage call and happen to query one of the servers that have a replica of the message, before it's marked as in-flight. That server hands out the message and now you have to consumers working on it.
This is just one scenario, which is the result of this being a distributed system.
There are a couple of edge cases that can happen as the result of network issues, e.g. when the SQS response to the initial SendMessage gets lost and the client thinks the message didn't arrive and sends it again - poof, you got another duplicate.
The point being: things fail in weird and complex ways. That makes measuring the risk of a delayed message difficult. If your use case can't handle duplicate and out of order messages, you should go for FIFO, but that will inherently limit your throughput. Alternatives are based on distributed locking mechanisms and keeping track of which messages you have already processed, which are complex tools to solve a complex problem.
Here is the situation, One my actors (A) is supervised by a Backoff supervisor (B).
The sequence of events that interest me is the following:
The system starts and everybody is happy
A fails while processing a message
B now considers A to be suspended until the backoff delay elapses
B receives some messages (MM) that he is meant to forward to A
The backoff delay elaspes and B restarts A
Everybody is happy again
On step 4, what happens to those messages?
Are they lost? Sent to dead-letters? stashed inside B somewhere, and sent to A when it restarts / resumes?
Now let's add another layer: A is not a standard Actor but an Actor with Stash.
What happens to the stash of messages between the failure of A and its restart/resume?
Is it discarded? Is it unstashed? Is it kept inside the stash?
After a few experiments, I think I can answer both of my questions above:
What happens to messages sent to an actor supervised by a Backoff supervisor between the moment the actor fails and the moment it is restarted?
They are sent to deadLetters()
What happens to the stash of messages between the failure of an actor and its restart?
Nothing happens to the stash until the restart is actually attempted. When the restarting process starts, the preRestart() step of the stash calls unstashAll() to return all messages to the mailbox. Therefore messages are not lost, and not kept in the stash, but just unstashed and returned to the mailbox.
According to the Akka docs for PoisonPill:
You can also send an actor the akka.actor.PoisonPill message, which will stop the actor when the message is processed. PoisonPill is enqueued as ordinary messages and will be handled after messages that were already queued in the mailbox.
Although the usefulness/utility of such a feature may be obvious to an Akka Guru, to a newcomer, this sounds completely useless/reckless/dangerous.
So I ask: What's the point of this message and when would one ever use it, for any reason?!?
We use a pattern called disposable actors:
A new temporary actor is created for each application request.
This actor may create some other actors to do some work related to the request.
Processed result is sent back to client.
All temporary actors related to this request are killed. That's the place where PoisonPill is used.
Creating an actor implies a very low overhead (about 300 bytes of RAM), so it's quite a good practise.
I am using Akka.NET to implement an actor system in which some actors are created on demand and are deleted after a configurable idle period (I use Akka's "ReceiveTimeout" mechanism for this). Each of these actors is identified by a key, and there should not exist two actors with the same key.
These actors are currently created and deleted by a common supervisor. The supervisor can be asked to return a reference to the actor matching a given key, either by returning an existing one or creating a new one, if an actor with this key doesn't exist yet. When an actor receives the "ReceiveTimeout" message, it notifies the supervisor who in turn kills it with a "PoisonPill".
I have an issue when sending a message to one of these actors right after it has been deleted. I noticed that sending a message to a dead actor doesn't generate an exception. Worse, when sending an "Ask" message, the sender remains blocked, waiting indefinitely (or until a timeout) for a response that he will never receive.
I first thought about Akka's "Deatchwatch" mechanism to monitor an actor's lifecycle. But, if I'm not mistaken, the "Terminated" message sent by the dying actor will be received by the monitoring actor asynchronously just like any other message, so the problem may still occur in between the target actor's death and the reception of its "Terminated" message.
To solve this problem, I made it so that anyone asking the supervisor for a reference to such an actor has to send a "close session" message to the supervisor to release the actor when he doesn't need it anymore (this is done transparently by a disposable "ActorSession" object). As long as there are any open sessions on an actor, the supervisor will not delete it.
I suppose that this situation is quite common and am therefore wondering if there isn't a simpler pattern to follow to address this kind of problem. Any suggestion would be appreciated.
I have an issue when sending a message to one of these actors right after it has been deleted. I noticed that sending a message to a dead actor doesn't generate an exception.
This is by design. You will never receive an exception upon attempting to send a message - it will simply be routed to Deadletters and logged. There's a lot of reasons for this that I won't get into here, but the bottom line is that this is intended behavior.
DeathWatch is the right tool for this job, but as you point out - you might receive a Terminated message after you already sent a message to that actor.
A simpler pattern than tracking open / closed sessions is to simply use acknowledgement / reply messages from the recipient using Ask + Wait + a hard timeout. The downside of course is that if your recipient actor has a lot of long-running operations then you might block for a long period of time inside the sender.
The other option you can go with is to redesign your recipient actor to act as a state machine and have a soft-terminated or terminating state that it uses to drain connections / references with potential senders. That way the original actor can still reply and accept messages, but let callers know that it's no longer available to do work.
I solved this problem with entity actors created through Akka's Cluster Sharding mechanism:
If the state of the entities are persistent you may stop entities that are not used to reduce memory consumption. This is done by the application specific implementation of the entity actors for example by defining receive timeout (context.setReceiveTimeout). If a message is already enqueued to the entity when it stops itself the enqueued message in the mailbox will be dropped. To support graceful passivation without losing such messages the entity actor can send ShardRegion.Passivate to its parent Shard. The specified wrapped message in Passivate will be sent back to the entity, which is then supposed to stop itself. Incoming messages will be buffered by the Shard between reception of Passivate and termination of the entity. Such buffered messages are thereafter delivered to a new incarnation of the entity.
We are planning to use akka pub-sub in our project. I have the following 2 queries reg. the akka pub-sub behavior.
What if the Publisher starts sending the messages before any actor has subscribed. What will happen to those messages that were published before any subscriber came to existence. Will those be discarded silently ?
What if the subscriber actor dies?[There are no subscribers at all] Will the messages sent by the Publisher gets accumulated somewhere or will it be discarded by the pub-sub system.
Message routing is decided on the spot: no subscribers, no sending. Buffering messages arbitrarily within the toolkit will only lead to surprising memory outages. If you want to buffer you will have to do that explicitly.
If no one is subscribed, then the messages aren't caught. But if you set things up right, that won't be an issue.
For the second question, it depends what you mean by it dying. Actor death is where the actor is gone for good. Which is different from an actor failing and being restarted. If the actor is restarted using fault-handling in its supervisor, it preserves its mailbox so nothing is lost. Only if the actor completely died (not restarted), will it lose its mailbox.
So, with a good fault-handling scheme, you can preserve your messages across actor failure for most use-cases. Just keep your listeners higher-up in the actor heiarchy and push all risky things that are likely to fail such as I/O down to the bottom. Look into the error kernel pattern to see what I mean. That way, failure usually won't climb high enough for you to be losing mailboxes.
Since its just a restart, all messages it is subscribed for will still end up in its mailbox and be waiting for it.