Akka Terminate Dead Letters - akka

I completed the first Akka assignment for the Coursera Reactive Programming Class (week five - binary trees).
My question is about Akka itself.
My app runs correctly, but I notice a lot of non-fatal dead letter warnings. Here is one:
[INFO] [01/16/2014 15:09:41.668] [PostponeSpec-akka.actor.default-dispatcher-23] [akka://PostponeSpec/user/$c/$b/$a/$b/$a/$a/$b/$b/$a/$a] Message [akka.dispatch.sysmsg.Terminate] from Actor[akka://PostponeSpec/user/$c/$b/$a/$b/$a/$a/$b/$b/$a/$a#570299303] to Actor[akka://PostponeSpec/user/$c/$b/$a/$b/$a/$a/$b/$b/$a/$a#570299303] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
I notice others have asked about this, and the official answer is that this isn't a problem, it's just verbose information that can be ignored and hidden by updating the logging settings.
I understand the advice to simply ignore this, but it still seems like a sloppy flaw on Akka's part. In this simple learning exercise, I am confident that my actors never get sent a message after they initiate a graceful shutdown. Akka should not be putting anything in the dead letter queue in these idealized circumstances. What is the justification for these dead letters? I also see it that the dead letter message isn't one that my app explicitly sends, but an internal message.

As someone who took the course as well,and asked questions in the course feed I recall the following: the child actor may decide to stop itself,but its parent may decide to do the same thing. At this point there's an inherent race condition between the parent's termination and the delivery of the Terminate(child) message,if the parent managed to stop itself prior to receiving the message it will end in the dead letters queue.

Related

DDD - Concurrency and Command retrying with side-effects

I am developing an event-sourced Electric Vehicle Charging Station Management System, which is connected to several Charging Stations. In this domain, I've come up with an aggregate for the Charging Station, which includes the internal state of the Charging Station(whether it is network-connected, if a car is charging using the station's connectors).
The station notifies me about its state through messages defined in a standardized protocol:
Heartbeat: whether the station is still "alive"
StatusNotification: whether the station has encountered an error(under voltage), or if everything is correct
And my server can send commands to this station:
RemoteStartTransaction: tells the station to unlock and reserve one of its connectors, for a car to charge using the connector.
I've developed an Aggregate for this Charging Station. It contains the internal entities of its connector, whether it's charging or not, if it has a problem in the power system, ...
And the Aggregate, which its memory representation resides in the server that I control, not in the Charging Station itself, has a StationClient service, which is responsible for sending these commands to the physical Charging Station(pseudocode):
class StationAggregate {
stationClient: StationClient
URL: string
connector: Connector[]
unlock(connectorId) {
if this.connectors.find(connectorId).isAvailableToBeUnlocked() {
return ErrorConnectorNotAvailable
}
error = this.stationClient.sendRemoteStartTransaction(this.URL, connectorId)
if error {
return ErrorStationRejectedUnlock
}
this.applyEvents([
StationUnlockedEvent(connectorId, now())
])
return Ok
}
receiveHeartbeat(timestamp) {
this.applyEvents([
StationSentHeartbeat(timestamp)
])
return Ok
}
}
I am using a optimistic concurrency, which means that, I load the Aggregate from a list of events, and I store the current version of the Aggregate in its memory representation: StationAggregate in version #2032, when a command is successfully processed and event(s) applied, it would the in version #2033, for example. In that way, I can put a unique constraint on the (StationID, Version) tuple on my persistence layer, and guarantee that only one event is persisted.
If by any chance, occurs a receival of a Heartbeat message, and the receival of a Unlock command. In both threads, they would load the StationAggregate and would be both in version X, in the case of the Heartbeat receival, there would be no side-effects, but in the case of the Unlock command, there would be a side-effect that tells the physical Charging Station to be unlocked. However as I'm using optimistic concurrency, that StationUnlocked event could be rejected from the persistence layer. I don't know how I could handle that, as I can't retry the command, because it its inherently not idempotent(as the physical Station would reject the second request)
I don't know if I'm modelling something wrong, or if it's really a hard domain to model.
I am not sure I fully understand the problem, but the idea of optimistic concurrency is to prevent writes in case of a race condition. Versions are used to ensure that your write operation has the version that is +1 from the version you've got from the database before executing the command.
So, in case there's a parallel write that won and you got the wrong version exception back from the event store, you retry the command execution entirely, meaning you read the stream again and by doing so you get the latest state with the new version. Then, you give the command to the aggregate, which decides if it makes sense to perform the operation or not.
The issue is not particularly related to Event Sourcing, it is just as relevant for any persistence and it is resolved in the same way.
Event Sourcing could bring you additional benefits since you know what happened. Imagine that by accident you got the Unlock command twice. When you got the "wrong version" back from the store, you can read the last event and decide if the command has already been executed. It can be done logically (there's no need to unlock if it's already unlocked, by the same customer), technically (put the command id to the event metadata and compare), or both ways.
When handling duplicate commands, it makes sense to ensure a decent level of idempotence of the command handling, ignore the duplicate and return OK instead of failing to the user's face.
Another observation that I can deduce from the very limited amount of information about the domain, is that heartbeats are telemetry and locking and unlocking are business. I don't think it makes a lot of sense to combine those two distinctly different things in one domain object.
Update, following the discussion in comments:
What you got with sending the command to the station at the same time as producing the event, is the variation of two-phase commit. Since it's not executed in a transaction, any of the two operations could fail and lead the system to an inconsistent state. You either don't know if the station got the command to unlock itself if the command failed to send, or you don't know that it's unlocked if the event persistence failed. You only got as far as the second operation, but the first case could happen too.
There are quite a few ways to solve it.
First, solving it entirely technical. With MassTransit, it's quite easy to fix using the Outbox. It will not send any outgoing messages until the consumer of the original message is fully completed its work. Therefore, if the consumer of the Unlock command fails to persist the event, the command will not be sent. Then, the retry filter would engage and the whole operation would be executed again and you already get out of the race condition, so the operation would be properly completed.
But it won't solve the issue when your command to the physical station fails to send (I reckon it is an edge case).
This issue can also be easily solved and here Event Sourcing is helpful. You'd need to convert sending the command to the station from the original (user-driven) command consumer to the subscriber. You subscribe to the event stream of StationUnlocked event and let the subscriber send commands to the station. With that, you would only send commands to the station if the event was persisted and you can retry sending the command as many times as you'd need.
Finally, you can solve it in a more meaningful way and change the semantics. I already mentioned that heartbeats are telemetry messages. I could expect the station also to respond to lock and unlock commands, telling you if it actually did what you asked.
You can use the station telemetry to create a representation of the physical station, which is not a part of the aggregate. In fact, it's more like an ACL to the physical world, represented as a read model.
When you have such a mirror of the physical station on your side, when you execute the Unlock command in your domain, you can engage a domain server to consult with the current station state and make a decision. If you find out that the station is already unlocked and the session id matches (yes, I remember our previous discussion :)) - you return OK and safely ignore the command. If it's locked - you proceed. If it's unlocked and the session id doesn't match - it's obviously an error and you need to do something else.
In this last option, you would clearly separate telemetry processing from the business so you won't have heartbeats impact your domain model, so you really won't have the versioning issue. You also would always have a place to look at to understand what is the current state of the physical station.

Can I persist events for other actors?

Using akka-typed I'm trying to create an event-sourced application in which a command on an actor can cause effect on another actor. Specifically I have the following situation:
RootActor
BranchActor (it's the representation of a child of the Root)
When RootActor is issued a CreateBranch command, validation happens, and if everything is o.k. the results must be:
RootActor will update the list of children of that actor
BranchActor will be initialized with some contents (previously given in the command)
RootActor replies to the issuer of the command with OperationDone
Right now the only thing I could come up with is: RootActor processes the Event and as a side effect issues a command to the BranchActor, which in turn saves an initialization eventt, replies to the RootActor, which finally replies to the original issuer.
This looks way too complicated, though, because:
I need to use a pipe to self mechanism, which implies that
I need to manage internal commands as well that allow me to reply to the original issuer
I need to manage the case where that operation might fail, and if this fails, it means that the creation of a branch is not atomic, whereas saving two events is atomic, in the sense that either both are saved or neither is.
I need to issue another command to another actor, but I shouldn't need to do that, because the primary command should take care of everything
The new command should be validated, though it is not necessary because it comes from the system and not an "external" user in this case.
My question then is: can't I just save from the RootActor two events, one for self, and one for a target BranchActor?
Also, as a bonus question: is this even a good practice for event-sourcing?
My question then is: can't I just save from the RootActor two events, one for self, and one for a target BranchActor?
No. Not to sound trite, but the only thing you can do to an actor is to send a message to it. If you must do what you are doing you are doing, you are on the right path. (e.g. pipeTo etc.)
is this even a good practice for event-sourcing?
It's not a good practice. Whether it's suboptimal or a flat out anti-pattern is still debatable. (I feel like I say say this confidently because of this Lightbend Discussion thread where it was debated with one side arguing "tricky but I have no regrets" and the other side arguing "explicit anti-pattern".)
To quote someone from an internal Slack (I don't want attribute him without his permission, but I saved it because it seemed to so elegantly sum up this kind of scenario.)
If an event sourced actor needs to contact another actor to make the decision if it can persist an event, then we are not modeling a consistency boundary anymore. It should only rely on the state that [it has] in scope (own state and incoming command). … all the gymnastics (persist the fact that its awaiting confirmation, stash, pipe to self) to make it work properly is an indication that we are not respecting the consistency boundary.
If you can't fix your aggregates such that one actor is responsible for the entire consistency boundary the better practice is to enrich the command beforehand: essentially building a Saga pattern.

What is the purpose of stopping actors in Akka?

I have read the Akka docs on fault tolerance & supervision, and I think I totally get them, with one big exception (no pun intended).
Why would you ever want/need to stop a child actor???
The only clue in the docs is:
Closer to the Erlang way is the strategy to just stop children when they fail and then take corrective action in the supervisor...
But to me, stopping a child is the same as saying "don't execute this code any longer", which to me, is effectively the same as deploying new changes to the code which has that actor removed entirely:
Every Actor plays some critical role in the actor system
To simply stop the actor means that actor currently doesn't have a role any longer, and presumes the system can now somehow (magically) work without it
So again, to me, this is no different than refactoring the code to not even have the actor any more, and then deploying those changes
I'm sure I'm just not seeing the forest through the trees on this one, but I just don't see any use cases where I'd have this big complex actor system, where each actor does critical work and then hands it off to the next critical actor, but then I stop an actor, and magically the whole system keeps on working perfectly.
In short: stopping an actor (to me) is like ripping the transmission out of a moving vehicle. How can this ever be a good/desirable thing?!?
The essence of the "error kernel" pattern is to delegate risky operations and protect essential state, it is common to spawn child-actors for one-off operations, and when that operation is completed and its result send off somewhere else, the child-actor or the parent-actor needs to stop it. (otherwise the child-actor will remain active/leak)
If the child actor is doing a longer process that could be terminated safely, such as video coding, or some kind of file transformation and you have to deploy a new build, in that case a terminate sign would be useful to stop running processes gracefully.
Every Actor plays some critical role in the actor system
This is where you are running into trouble, I can create a child actor to do a job, for example execute a query against a database or maintain the state of a connected user and this is its only purpose.
Once the database query is complete or the user has gracefully disconnected the child actor no longer has any role to play and should be stopped so that it will release any resources it holds.
To simply stop the actor means that actor currently doesn't have a role any >longer, and presumes the system can now somehow (magically) work without it
The system is able to continue because I can create new child actors if/when they are needed.

Does it make sense to watch(self) in akka?

As far as I understand, context.watch simply delivers actor.Terminated message to watcher. I wanted it to be the last message that actor receives. Yet, I see that it is never delivered. I guess it can be because it is terminated and does not process messages anymore. As part of the answer you may tell what is expected behaviour. You can also tell what is the way to handle the stop condition.
Seems like you've already answered your own question: watching self will not result in that actor receiving a Terminated message for itself. The real question is why you need that message. If you just need to clean up resources, override postStop and put that logic there.
postStop is guaranteed to be executed after messages have stopped being enqueued in that actor's mailbox so you can be sure nothing will come after it.

What happens to running processes on a continuous Azure WebJob when website is redeployed?

I've read about graceful shutdowns here using the WEBJOBS_SHUTDOWN_FILE and here using Cancellation Tokens, so I understand the premise of graceful shutdowns, however I'm not sure how they will affect WebJobs that are in the middle of processing a queue message.
So here's the scenario:
I have a WebJob with functions listening to queues.
Message is added to Queue and job begins processing.
While processing, someone pushes to develop, triggering a redeploy.
Assuming I have my WebJobs hooked up to deploy on git pushes, this deploy will also trigger the WebJobs to be updated, which (as far as I understand) will kick off some sort of shutdown workflow in the jobs. So I have a few questions stemming from that.
Will jobs in the middle of processing a queue message finish processing the message before the job quits? Or is any shutdown notification essentially treated as "this bitch is about to shutdown. If you don't have anything to handle it, you're SOL."
If we are SOL, is our best option for handling shutdowns essentially to wrap anything you're doing in the equivalent of DB transactions and implement your shutdown handler in such a way that all changes are rolled back on shutdown?
If a queue message is in the middle of being processed and the WebJob shuts down, will that message be requeued? If not, does that mean that my shutdown handler needs to handle requeuing that message?
Is it possible for functions listening to queues to grab any more queue messages after the Job has been notified that it needs to shutdown?
Any guidance here is greatly appreciated! Also, if anyone has any other useful links on how to handle job shutdowns besides the ones I mentioned, it would be great if you could share those.
After no small amount of testing, I think I've found the answers to my questions and I hope someone else can gain some insight from my experience.
NOTE: All of these scenarios were tested using .NET Console Apps and Azure queues, so I'm not sure how blobs or table storage, or different types of Job file types, would handle these different scenarios.
After a Job has been marked to exit, the triggered functions that are running will have the configured amount of time (grace period) (5 seconds by default, but I think that is configurable by using a settings.job file) to finish before they are exited. If they do not finish in the grace period, the function quits. Main() (or whichever file you declared host.RunAndBlock() in), however, will finish running any code after host.RunAndBlock() for up to the amount of time remaining in the grace period (I'm not sure how that would work if you used an infinite loop instead of RunAndBlock). As far as handling the quit in your functions, you can essentially "listen" to the CancellationToken that you can pass in to your triggered functions for IsCancellationRequired and then handle it accordingly. Also, you are not SOL if you don't handle the quits yourself. Huzzah! See point #3.
While you are not SOL if you don't handle the quit (see point #3), I do think it is a good idea to wrap all of your jobs in transactions that you won't commit until you're absolutely sure the job has ran its course. This way if your function exits mid-process, you'll be less likely to have to worry about corrupted data. I can think of a couple scenarios where you might want to commit transactions as they pass (batch jobs, for instance), however you would need to structure your data or logic so that previously processed entities aren't reprocessed after the job restarts.
You are not in trouble if you don't handle job quits yourself. My understanding of what's going on under the covers is virtually non-existent, however I am quite sure of the results. If a function is in the middle of processing a queue message and is forced to quit before it can finish, HAVE NO FEAR! When the job grabs the message to process, it will essentially hide it on the queue for a certain amount of time. If your function quits while processing the message, that message will "become visible" again after x amount of time, and it will be re-grabbed and ran against the potentially updated code that was just deployed.
So I have about 90% confidence in my findings for #4. And I say that because to attempt to test it involved quick-switching between windows while not actually being totally sure what was going on with certain pieces. But here's what I found: on the off chance that a queue has a new message added to it in the grace period b4 a job quits, I THINK one of two things can happen: If the function doesn't poll that queue before the job quits, then the message will stay on the queue and it will be grabbed when the job restarts. However if the function DOES grab the message, it will be treated the same as any other message that was interrupted: it will "become visible" on the queue again and be reran upon the restart of the job.
That pretty much sums it up. I hope other people will find this useful. Let me know if you want any of this expounded on and I'll be happy to try. Or if I'm full of it and you have lots of corrections, those are probably more welcome!