Akka: Persistence failure when replaying events - akka

We are working on an event sourced application with akka-persistance using Oracle database as event store. The application have been running in production for sometime now. Lately we are seeing the following error in the application for some of the persistent actors.
Persistence failure when replaying events for persistenceId [some-persistence-id]. Last known sequence number [0]
Can someone who faced a similar issue in their application share their experience of why this happens?
Also, going through Akka documentation at: https://doc.akka.io/docs/akka/current/persistence.html, onRecoveryFailure is responsible for handling such failures. Is there a way we can override this method to ignore the persisted events in case we see failures while replaying events? In our scenario replaying the events is not very critical and we can serve the users even by ignoring the m.

That log is typically a manifestation of something else. Since the failure is from sequence number zero, that points an actual query to the DB failing (e.g. timeout). There should be other logs around the time of that log which will provide further information.
Akka Persistence has a fairly strong assumption that the persisted state is important (otherwise why would you be persisting?). Off the top of my head, I would consider separating the parts of the actor which are affected by persistence from the parts which aren't: the non-persistent actor can spawn a persistent child and interact with it (it can do tricks with stashing, for instance, to present an illusion that it and its child are the same actor).

Related

Akka Durable State restore

I have an existing Akka Typed application, and am considering adding in support for persistent actors, using the Durable State feature. I am not currently using cluster sharding, but plan to implement that sometime in the future (after implementing Durable State).
I have read the documentation on how to implement Durable State to persist the actor's state, and that all makes sense. However, there does not appear to be any information in that document about how/when an actor's state gets recovered, and I'm not quite clear as to what I would need to do to recover persisted actors when the entire service is restarted.
My current architecture consists of an HTTP service (using AkkaHTTP), a "dispatcher" actor (which is the ActorSystem's guardian actor, and currently a singleton), and N number of "worker" actors, which are children of the dispatcher. Both the dispatcher actor and the worker actors are stateful.
The dispatcher actor's state contains a map of requestId->ActorRef. When a new job request comes in from the HTTP service, the dispatcher actor creates a worker actor, and stores its reference in the map. Future requests for the same requestId (i.e. status and result queries) are forwarded by the dispatcher to the appropriate worker actor.
Currently, if the entire service is restarted, the dispatcher actor is recreated as a blank slate, with an empty worker map. None of the worker actors exist anymore, and their status/results can no longer be retrieved.
What I want to accomplish when the service is restarted is that the dispatcher gets recreated with its last-persisted state. All of the worker actors that were in the dispatcher's worker map should get restored with their last-persisted states as well. I'm not sure how much of this is automatic, simply by refactoring my actors as persistent actors using Durable State, and what I need to do explicitly.
Questions:
Upon restart, if I create the dispatcher (guardian) actor with the same name, is that sufficient for Akka to know to restore its persisted state, or is there something more explicit that I need to do to tell it to do that?
Since persistent actors require the state to be serializable, will this work with the fact that the dispatcher's worker map references the workers by ActorRef? Are those serializable, or do I need to switch it to referencing them by name?
If I leave the references to the worker actors as ActorRefs, and the service is restarted, will those ActorRefs (that were restored as part of the dispatcher's persisted state) continue to work, and will the worker actors' persisted states be automatically restored? Or, again, do I need to do something explicit to tell it to revive those actors and restore their states.
Currently, since all of the worker actors are not persisted, I assume that their states are all held in memory. Is that true? I currently keep all workers around indefinitely so that the results of their work (which is part of their state) can be retrieved in the future. However, I'm worried about running out of memory on the server. I'd like to have workers that are done with their work be able to be persisted to disk only, kind of "putting them to sleep", so that the results of their work can be retrieved in the future, without taking up memory, days or weeks later. I'd like to have control over when an actor is "in memory", and when it's "on disk only". Can this Durable State persistence serve as a mechanism for this? If so, can I kill an actor, and then revive it on demand (and restore its state) when I need it?
The durable state is stored keyed by an akka.persistence.typed.PersistenceId. There's no necessary relationship between the actor's name and its persistence ID.
ActorRefs are serializable (the included Jackson serializations (CBOR or JSON) do it out of the box; if using a custom serializer, you will need to use the ActorRefResolver), though in the persistence case, this isn't necessarily that useful: there's no guarantee that the actor pointed to by the ref is still there (consider, for instance, if the JVM hosting that actor system has stopped between when the state was saved and when it was read back).
Non-persistent actors (assuming they're not themselves directly interacting with some persistent data store: there's nothing stopping you from having an actor that reads state on startup from somewhere else (possibly stashing incoming commands until that read completes) and writes state changes... that's basically all durable state is under the hood) keep all their state in memory, until they're stopped. The mechanism of stopping an actor is typically called "passivation": in typed you typically have a Passivate command in the actor's protocol. Bringing it back is then often called "rehydration". Both event-sourced and durable-state persistence are very useful for implementing this.
Note that it's absolutely possible to run a single-node Akka Cluster and have sharding. Sharding brings a notion of an "entity", which has a string name and is conceptually immortal/eternal (unlike an actor, which has a defined birth-to-death lifecycle). Sharding then has a given entity be incarnated by at most one actor at any given time in a cluster (I'm ignoring the multiple-datacenter case: if multiple datacenters are in use, you're probably going to want event sourced persistence). Once you have an EntityRef from sharding, the EntityRef will refer to whatever the current incarnation is: if a message is sent to the EntityRef and there's no living incarnation, a new incarnation is spawned. If the behavior for that TypeKey which was provided to sharding is a persistent behavior, then the persisted state will be recovered. Sharding can also implement passivation directly (with a few out-of-the-box strategies supported).
You can implement similar functionality yourself (for situations where there aren't many children of the dispatcher, a simple map in the dispatcher and asks/watches will work).
The Akka Platform Guide tutorial works an example using cluster sharding and persistence (in this case, it's event sourced, but the durable state APIs are basically the same, especially if you ignore the CQRS bits).

AWS SQS BackUp Solution Design

Problem Statement
Informal State
We have some scenarios where the integration layer (a combination of AWS SNS/SQS components and etc.) is also responsible for the data distribution to target systems. Those are mostly async flows. In this case, we send a confirmation to a caller that we have received the data and will take a responsibility for the data delivery. Here, although the data is not originated from the integration layer we are still holding it and need to make sure that the data is not lost, for example, if the consumers are down or if messages, on-error, are sent to the DLQs and hence being automatically deleted after the retention period.
Solution Design
Currently my idea was to proceed with a back-up of the SQS/DLQ queues based upon CloudWatch configured alerts using ApproximateAgeOfOldestMessage metric with some applied threshold (something like the below):
Msg Expiration Event if ApproximateAgeOfOldestMessage / Message retention > Threshold
Now, more I go forward with this idea and more I doubt that this might be actually the right approach…
In particular, I would like to build something unobtrusive that can be "attached" to our SQS queues and dump the messages that are about to expire in some repository, like for example the AWS S3. Then have a procedure to recover the messages from S3 to the same original queue.
The above procedure contains many challenges like: message identification and consumption (receive message is not design to "query" for specific messages), message dump in the repository with a reference to the source queue, etc. which would suggest to me that the above approach might be a complex over-kill.
That being said, I'm aware of other "alternatives" (such as this) but I would appreciate if you could answer to the specific technical details described above, without trying to challenge the "need" instead.
Similar to Mark B's suggestion, you can use the SQS extended client (https://github.com/awslabs/amazon-sqs-java-extended-client-lib) to send all your messages through S3 (which is a configuration knob: https://github.com/awslabs/amazon-sqs-java-extended-client-lib/blob/master/src/main/java/com/amazon/sqs/javamessaging/ExtendedClientConfiguration.java#L189).
The extended client is a drop-in replacement for the AmazonSQS interface so it minimizes the intrusion on business logic - usually it's a matter of just changing your dependency injection.

Losing event publishing in Persistent Actor on crash

In this example from Akka persistance documentation
val receiveRecover: Receive = {
case evt: Evt => updateState(evt)
case SnapshotOffer(_, snapshot: ExampleState) => state = snapshot
}
val snapShotInterval = 1000
val receiveCommand: Receive = {
case Cmd(data) =>
persist(Evt(s"${data}-${numEvents}")) { event =>
updateState(event)
context.system.eventStream.publish(event)
if (lastSequenceNr % snapShotInterval == 0 && lastSequenceNr != 0)
saveSnapshot(state)
}
case "print" => println(state)
}
I understand that this lambda:
event =>
updateState(event)
context.system.eventStream.publish(event)
if (lastSequenceNr % snapShotInterval == 0 && lastSequenceNr != 0)
saveSnapshot(state)
Is executed when the event has been successfully persisted.
What if the actor crashes while this lambda is being executed BEFORE successful publishing of the event, ie before context.system.eventStream.publish(event)?
Do I understand correctly that in such case the event is never published which may lead to an inconsistent state of the system? If so, is there any way to detect that such thing happened?
[EDIT]
Also, if you use the event publishing in your system, then correct me if I am wrong:
If your application is deployed in one JVM and you use the default Akka's event publishing facilities, then JVM crash will mean that all events published but not yet processed will be lost since that facility does not have any recovery mechanisms.
If your application is deployed in a cluster then you'll run in the same situation as above only if the whole cluster goes down.
For any production setup you should configure something like Kafka for event publishing/consuming.
I understand that this lambda:
...
Is executed when the event has been successfully persisted. What if
the actor crashes while this lambda is being executed BEFORE
successful publishing of the event, ie before
context.system.eventStream.publish(event)?
The lambda is run after the state is persisted. And the actor essentially suspends itself (putting all pending work in the stash) until that persistence is complete so that it remains consistent.
Do I understand correctly that in such case the event is never
published which may lead to an inconsistent state of the system?
No, it will remain consistent for the above reason.
If your application is deployed in one JVM and you use the default Akka's event publishing facilities, then JVM crash will mean that all events published but not yet processed will be lost since that facility does not have any recovery mechanisms.
I guess it depends on what you mean by default event publishing. Regular actors, yes. If you lose the JVM you lose "regular" actors. Regular actors are in memory, essentially like normal Java/Scala objects. Persistent Actors, are, of course a different story.
You also say "published but not yet processed". Those, of course, are lost as well. Anything that isn't "processed" is essentially like a JDBC statement that hasn't been received by the database yet, or a message not transmitted to Kafka, etc. The design is essentially to save the event to the database immediately (almost like a transaction log) and then do the work after it is known to be safely persisted.
If your application is deployed in a cluster then you'll run in the same situation as above only if the whole cluster goes down.
A cluster essentially just gives a place for the persistent actor to recover. The cluster still relies on the persistent store for recovery.
(I'm keeping this answer focused on Akka Persistent Actors, the answers get more varied with things like Distributed Data.)
For any production setup you should configure something like Kafka for event publishing/consuming.
Not necessarily. The persistent module is definitely a consistent option. Kafka and Akka are really just different animals with different goals. Kafka is effectively pub/sub, Akka essentially takes a much more event sourced approach. I've worked systems that use both, but they use them for very different purposes.

How to externalize akka sharded actor state to redis or ignite?

I am very new to Akka clustering and working on a proof of concept. In my case i have an actor which is running on a cluster and the actor has state as a Map[String,Any]. So, for any request the actor receives it based on the incoming message it create a new entity actor and the data map. The problem here is the map is in memory right now. Is it possible to store the sharded actor state somewhere in redis or ignite ?
You should probably start by having a look at akka-persistence (the persistence module included in akka). The snapshotting part is meant to persist the state directly, but you have to start with the command/event-sourcing part, the snapshotting part being an optional enhancement.
Then you can combine this with automatic passivation of your sharded actors after a certain inactivity timeout.
With the above, you'll have a solution that persists the state of your actors in an external storage system to free up memory, restoring your actor's state whenever they come back to life.
Last step would be to see which storage backends are available for akka-persistence and match your requirements, you can implement your own of course.

Configuring spray-servlet to avoid request bottleneck

I have an application which uses spray-servlet to bootstrap my custom Spray routing Actor via spray.servlet.Initializer. The requests are then handed off to my Actor via spray.servlet.Servlet30ConnectorServlet.
From what I can gather, the Servlet30ConnectorServlet simply retrieves my Actor out of the ServletContext that the Initializer had set when the application started, and hands the HttpServletRequest to my Actor's receive method. This leads me to believe that only one instance of my Actor will have to handle all requests. If my Actor blocks in its receive method, then subsequent requests will queue waiting for it to complete.
Now I realize that I can code my routing Actor to use detach() or a complete that returns a Future, however most of the documentation never alludes to having to do this.
If my above assumption is true (single Actor instance handling all requests), is there a way to configure the Servlet30ConnectorServlet to perhaps load balance the incoming requests amongst multiple instances of my routing Actor instead of just the one? Or is this something I'll have to roll myself by subclassing Servlet30ConnectorServlet?
I did some research and now I understand better how spray-servlet is working. It's not spray-servlet that dictates the strategy for how many Request Handler Actors are created but rather the plumbing code provided with the example I based my application on.
My assumption all along was that spray-servlet would essentially work like a traditional Java EE application dispatcher in a handler-per-request type of fashion (or some reasonable variant of that notion). That is not the case because it is routing the request to an Actor with a mailbox, not some singleton HttpServlet.
I am now delegating the requests to a pool of actors in order to reduce our potential for bottleneck when our system is under load.
val serviceActor = system.actorOf(RoundRobinPool(config.SomeReasonableSize).props(Props(Props[BootServiceActor])), "my-route-actors")
I am still a bit baffled by the fact that the examples and documentation assumes everyone would be writing non-blocking Request Handler Actors under spray. All of their documentation essentially demonstrates non-Future rendering complete, yet there is no mention in their literature that maybe, just maybe, you might want to create a reasonable sized pool of Request Handler Actors to prevent a slew of requests from bottle necking the poor single overworked Actor. Or it's possible I've overlooked it.