Losing event publishing in Persistent Actor on crash

Losing event publishing in Persistent Actor on crash - akka

In this example from Akka persistance documentation
val receiveRecover: Receive = {
case evt: Evt => updateState(evt)
case SnapshotOffer(_, snapshot: ExampleState) => state = snapshot
}
val snapShotInterval = 1000
val receiveCommand: Receive = {
case Cmd(data) =>
persist(Evt(s"${data}-${numEvents}")) { event =>
updateState(event)
context.system.eventStream.publish(event)
if (lastSequenceNr % snapShotInterval == 0 && lastSequenceNr != 0)
saveSnapshot(state)
}
case "print" => println(state)
}
I understand that this lambda:
event =>
updateState(event)
context.system.eventStream.publish(event)
if (lastSequenceNr % snapShotInterval == 0 && lastSequenceNr != 0)
saveSnapshot(state)
Is executed when the event has been successfully persisted.
What if the actor crashes while this lambda is being executed BEFORE successful publishing of the event, ie before context.system.eventStream.publish(event)?
Do I understand correctly that in such case the event is never published which may lead to an inconsistent state of the system? If so, is there any way to detect that such thing happened?
[EDIT]
Also, if you use the event publishing in your system, then correct me if I am wrong:
If your application is deployed in one JVM and you use the default Akka's event publishing facilities, then JVM crash will mean that all events published but not yet processed will be lost since that facility does not have any recovery mechanisms.
If your application is deployed in a cluster then you'll run in the same situation as above only if the whole cluster goes down.
For any production setup you should configure something like Kafka for event publishing/consuming.

I understand that this lambda:
...
Is executed when the event has been successfully persisted. What if
the actor crashes while this lambda is being executed BEFORE
successful publishing of the event, ie before
context.system.eventStream.publish(event)?
The lambda is run after the state is persisted. And the actor essentially suspends itself (putting all pending work in the stash) until that persistence is complete so that it remains consistent.
Do I understand correctly that in such case the event is never
published which may lead to an inconsistent state of the system?
No, it will remain consistent for the above reason.
If your application is deployed in one JVM and you use the default Akka's event publishing facilities, then JVM crash will mean that all events published but not yet processed will be lost since that facility does not have any recovery mechanisms.
I guess it depends on what you mean by default event publishing. Regular actors, yes. If you lose the JVM you lose "regular" actors. Regular actors are in memory, essentially like normal Java/Scala objects. Persistent Actors, are, of course a different story.
You also say "published but not yet processed". Those, of course, are lost as well. Anything that isn't "processed" is essentially like a JDBC statement that hasn't been received by the database yet, or a message not transmitted to Kafka, etc. The design is essentially to save the event to the database immediately (almost like a transaction log) and then do the work after it is known to be safely persisted.
If your application is deployed in a cluster then you'll run in the same situation as above only if the whole cluster goes down.
A cluster essentially just gives a place for the persistent actor to recover. The cluster still relies on the persistent store for recovery.
(I'm keeping this answer focused on Akka Persistent Actors, the answers get more varied with things like Distributed Data.)
For any production setup you should configure something like Kafka for event publishing/consuming.
Not necessarily. The persistent module is definitely a consistent option. Kafka and Akka are really just different animals with different goals. Kafka is effectively pub/sub, Akka essentially takes a much more event sourced approach. I've worked systems that use both, but they use them for very different purposes.

Related

Akka Durable State restore

I have an existing Akka Typed application, and am considering adding in support for persistent actors, using the Durable State feature. I am not currently using cluster sharding, but plan to implement that sometime in the future (after implementing Durable State).
I have read the documentation on how to implement Durable State to persist the actor's state, and that all makes sense. However, there does not appear to be any information in that document about how/when an actor's state gets recovered, and I'm not quite clear as to what I would need to do to recover persisted actors when the entire service is restarted.
My current architecture consists of an HTTP service (using AkkaHTTP), a "dispatcher" actor (which is the ActorSystem's guardian actor, and currently a singleton), and N number of "worker" actors, which are children of the dispatcher. Both the dispatcher actor and the worker actors are stateful.
The dispatcher actor's state contains a map of requestId->ActorRef. When a new job request comes in from the HTTP service, the dispatcher actor creates a worker actor, and stores its reference in the map. Future requests for the same requestId (i.e. status and result queries) are forwarded by the dispatcher to the appropriate worker actor.
Currently, if the entire service is restarted, the dispatcher actor is recreated as a blank slate, with an empty worker map. None of the worker actors exist anymore, and their status/results can no longer be retrieved.
What I want to accomplish when the service is restarted is that the dispatcher gets recreated with its last-persisted state. All of the worker actors that were in the dispatcher's worker map should get restored with their last-persisted states as well. I'm not sure how much of this is automatic, simply by refactoring my actors as persistent actors using Durable State, and what I need to do explicitly.
Questions:
Upon restart, if I create the dispatcher (guardian) actor with the same name, is that sufficient for Akka to know to restore its persisted state, or is there something more explicit that I need to do to tell it to do that?
Since persistent actors require the state to be serializable, will this work with the fact that the dispatcher's worker map references the workers by ActorRef? Are those serializable, or do I need to switch it to referencing them by name?
If I leave the references to the worker actors as ActorRefs, and the service is restarted, will those ActorRefs (that were restored as part of the dispatcher's persisted state) continue to work, and will the worker actors' persisted states be automatically restored? Or, again, do I need to do something explicit to tell it to revive those actors and restore their states.
Currently, since all of the worker actors are not persisted, I assume that their states are all held in memory. Is that true? I currently keep all workers around indefinitely so that the results of their work (which is part of their state) can be retrieved in the future. However, I'm worried about running out of memory on the server. I'd like to have workers that are done with their work be able to be persisted to disk only, kind of "putting them to sleep", so that the results of their work can be retrieved in the future, without taking up memory, days or weeks later. I'd like to have control over when an actor is "in memory", and when it's "on disk only". Can this Durable State persistence serve as a mechanism for this? If so, can I kill an actor, and then revive it on demand (and restore its state) when I need it?

The durable state is stored keyed by an akka.persistence.typed.PersistenceId. There's no necessary relationship between the actor's name and its persistence ID.
ActorRefs are serializable (the included Jackson serializations (CBOR or JSON) do it out of the box; if using a custom serializer, you will need to use the ActorRefResolver), though in the persistence case, this isn't necessarily that useful: there's no guarantee that the actor pointed to by the ref is still there (consider, for instance, if the JVM hosting that actor system has stopped between when the state was saved and when it was read back).
Non-persistent actors (assuming they're not themselves directly interacting with some persistent data store: there's nothing stopping you from having an actor that reads state on startup from somewhere else (possibly stashing incoming commands until that read completes) and writes state changes... that's basically all durable state is under the hood) keep all their state in memory, until they're stopped. The mechanism of stopping an actor is typically called "passivation": in typed you typically have a Passivate command in the actor's protocol. Bringing it back is then often called "rehydration". Both event-sourced and durable-state persistence are very useful for implementing this.
Note that it's absolutely possible to run a single-node Akka Cluster and have sharding. Sharding brings a notion of an "entity", which has a string name and is conceptually immortal/eternal (unlike an actor, which has a defined birth-to-death lifecycle). Sharding then has a given entity be incarnated by at most one actor at any given time in a cluster (I'm ignoring the multiple-datacenter case: if multiple datacenters are in use, you're probably going to want event sourced persistence). Once you have an EntityRef from sharding, the EntityRef will refer to whatever the current incarnation is: if a message is sent to the EntityRef and there's no living incarnation, a new incarnation is spawned. If the behavior for that TypeKey which was provided to sharding is a persistent behavior, then the persisted state will be recovered. Sharding can also implement passivation directly (with a few out-of-the-box strategies supported).
You can implement similar functionality yourself (for situations where there aren't many children of the dispatcher, a simple map in the dispatcher and asks/watches will work).
The Akka Platform Guide tutorial works an example using cluster sharding and persistence (in this case, it's event sourced, but the durable state APIs are basically the same, especially if you ignore the CQRS bits).

Akka: Persistence failure when replaying events

We are working on an event sourced application with akka-persistance using Oracle database as event store. The application have been running in production for sometime now. Lately we are seeing the following error in the application for some of the persistent actors.
Persistence failure when replaying events for persistenceId [some-persistence-id]. Last known sequence number [0]
Can someone who faced a similar issue in their application share their experience of why this happens?
Also, going through Akka documentation at: https://doc.akka.io/docs/akka/current/persistence.html, onRecoveryFailure is responsible for handling such failures. Is there a way we can override this method to ignore the persisted events in case we see failures while replaying events? In our scenario replaying the events is not very critical and we can serve the users even by ignoring the m.

That log is typically a manifestation of something else. Since the failure is from sequence number zero, that points an actual query to the DB failing (e.g. timeout). There should be other logs around the time of that log which will provide further information.
Akka Persistence has a fairly strong assumption that the persisted state is important (otherwise why would you be persisting?). Off the top of my head, I would consider separating the parts of the actor which are affected by persistence from the parts which aren't: the non-persistent actor can spawn a persistent child and interact with it (it can do tricks with stashing, for instance, to present an illusion that it and its child are the same actor).

How to stream events with GCP platform?

I am looking into building a simple solution where producer services push events to a message queue and then have a streaming service make those available through gRPC streaming API.
Cloud Pub/Sub seems well suited for the job however scaling the streaming service means that each copy of that service would need to create its own subscription and delete it before scaling down and that seems unnecessarily complicated and not what the platform was intended for.
On the other hand Kafka seems to work well for something like this but I'd like to avoid having to manage the underlying platform itself and instead leverage the cloud infrastructure.
I should also mention that the reason for having a streaming API is to allow for streaming towards a frontend (who may not have access to the underlying infrastructure)
Is there a better way to go about doing something like this with the GCP platform without going the route of deploying and managing my own infrastructure?

If you essentially want ephemeral subscriptions, then there are a few things you can set on the Subscription object when you create a subscription:
Set the expiration_policy to a smaller duration. When a subscriber is not receiving messages for that time period, the subscription will be deleted. The tradeoff is that if your subscriber is down due to a transient issue that lasts longer than this period, then the subscription will be deleted. By default, the expiration is 31 days. You can set this as low as 1 day. For pull subscribers, the subscribers simply need to stop issuing requests to Cloud Pub/Sub for the timer on their expiration to start. For push subscriptions, the timer starts based on when no messages are successfully delivered to the endpoint. Therefore, if no messages are published or if the endpoint is returning an error for all pushed messages, the timer is in effect.
Reduce the value of message_retention_duration. This is the time period for which messages are kept in the event a subscriber is not receiving messages and acking them. By default, this is 7 days. You can set it as low as 10 minutes. The tradeoff is that if your subscriber disconnects or gets behind in processing messages by more than this duration, messages older than that will be deleted and the subscriber will not see them.
Subscribers that cleanly shut down could probably just call DeleteSubscription themselves so that the subscription goes away immediately, but for ones that shut down unexpectedly, setting these two properties will minimize the time for which the subscription continues to exist and the number of messages (that will never get delivered) that will be retained.
Keep in mind that Cloud Pub/Sub quotas limit one to 10,000 subscriptions per topic and per project. Therefore, if a lot of subscriptions are created and either active or not cleaned up (manually, or automatically after expiration_policy's ttl has passed), then new subscriptions may not be able to be created.

I think your original idea was better than ephemeral subscriptions tbh. I mean it works, but it feels totally unnatural. Depending on what your requirements are. For example, do clients only need to receive messages while they're connected or do they all need to get all messages?
Only While Connected
Your original idea was better imo. What I probably would have done is to create a gRPC stream service that clients could connect to. The implementation is essentially an observer pattern. The consumer will receive a message and then iterate through the subscribers to do a "Send" to all of them. From there, any time a client connects to the service, it just registers itself with that observer collection and unregisters when it disconnects. Horizontal scaling is passive since clients are sticky to whatever instance they've connected to.
Everyone always get the message, if eventually
The concept is similar to the above but the client doesn't implicitly un-register from the observer on disconnect. Instead, it would register and un-register explicitly (through a method/command designed to do so). Modify the 'on disconnected' logic to tell the observer list that the client has gone offline. Then the consumer's broadcast logic is slightly different. Now it iterates through the list and says "if online, then send, else queue", and send the message to a ephemeral queue (that belongs to the client). Then your 'on connect' logic will send all messages that are in queue to the client before informing the consumer that it's back online. Basically an inbox. Setting up ephemeral, self-deleting queues is really easy in most products like RabbitMQ. I think you'll have to do a bit of managing whether or not it's ok to delete a queue though. For example, never delete the queue unless the client explicitly unsubscribes or has been inactive for so long. Fail to do that, and the whole inbox idea falls apart.
The selected answer above is most similar to what I'm subscribing here in that the subscription is the queue. If I did this, then I'd probably implement it as an internal bus instead of an observer (since it would be unnecessary) - You create a consumer on demand for a connecting client that literally just forwards the message. The message consumer subscribes and unsubscribes based on whether or not the client is connected. As Kamal noted, you'll run into problems if your scale exceeds the maximum number of subscriptions allowed by pubsub. If you find yourself in that position, then you can unshackle that constraint by implementing the pattern above. It's basically the same pattern but you shift the responsibility over to your infra where the only constraint is your own resources.
gRPC makes this mechanism pretty easy. Alternatively, for web, if you're on a Microsoft stack, then SignalR makes this pretty easy too. Clients connect to the hub, and you can publish to all connected clients. The consumer pattern here remains mostly the same, but you don't have to implement the observer pattern by hand.
(note: arrows in diagram are in the direction of dependency, not data flow)

Kafka and Akka Cluster

Following is my use case
Bunch of applications enqueue messages in Kafka under different topics.
Have consumer of each topic distribute the work to a worker in a cluster. The work can be classified as long running, memory intensive, simple etc and the worker is chosen accordingly.
This has me exploring Akka cluster for work distribution, routing and scaling. I can use Akka "Supervisor" as a Kafka consumer and assign incoming work to the appropriate worker based on its classification.
But what I am still trying to understand is the correct way to implement a resilient way of communication between the supervisor and workers in the Akka cluster. Because as soon as the supervisor consumes the message from Kafka, the Kafka offset is committed. If some error happens in processing after the offset commit, is the following acceptable way to recover and start from where it was last left?
Make the supervisor a persistent actor by using durable mailbox backed by Kafka. Supervisor enqueues work in Kafka and worker gets its work from Kafka and commits its offset only after completing the work.

As said by Jaakko, it really depends on the third-part library you are using.
As far as I'm concerned I have successfully used Akka Streams Kafka although I did enable offset auto-commit.
However, this library may meet your needs since it allows you to customize offset commit (see sections External Offset Storage and Offset Storage in Kafka).
The documentation says:
The Consumer.committableSource makes it possible to commit offset positions to Kafka. Compared to auto-commit this gives exact control of when a message is considered consumed.
In order to disable auto-commit, you have to complete your Akka application.conf file by adding an akka.kafka.consumer section:
akka.kafka.consumer {
# Properties defined by org.apache.kafka.clients.consumer.ConsumerConfig
# can be defined in this configuration section.
kafka-clients {
# Disable auto-commit by default
enable.auto.commit = false
}
}
Last version of akka-stream-kafka_2.11 (version 0.16) is compatible with Akka 2.5.x but you have to override akka-stream_2.11 dependency with the one of the Akka toolkit. Currently, I am using this library with Akka 2.5.3 and it works really well.
Hope you will find what your are looking for :)

Monitor system for Akka cluster

It's very difficult to keep track of the states of all actors in Akka cluster. I've been searching around the internet for a good system for monitoring Akka cluster system. However, the results were most likely systems to monitor JVM stats. I am curious if there is a system I can use to monitor the statistics below :
What are the active actors, their states and all other attributes.. i.e connect time, role, path, host etc
The status of all active shard regions and their shards
The messages buffered in Akka (Pending messages)
The deadletter mailbox
The status of the coordinators

You could just have some observing actor send messages to the actors you want to see the state of to tell them to send a message back to this observing actor with a snap shot of their state.
You can use Agents somewhat as well but I don't think they are distrubuted however.
If you were looking for some common framework or something to do this then I would suggest trying to bundle some of this behavior to a trait, I don't really know what that would like that because it depends allot on how you invision this behavior working, if all the messages sent back to the observer can be of the same case class or not, ect.

Kamon.io can gather matrixes as you want.
for instance :
val myHistogram = Kamon.metrics.histogram("my-histogram")
myHistogram.record(42)
myHistogram.record(43)
myHistogram.record(44)
val myCounter = Kamon.metrics.counter("my-counter")
myCounter.increment()
myCounter.increment(17)
val myMMCounter = Kamon.metrics.minMaxCounter("my-mm-counter", refreshInterval = 500 milliseconds)
myMMCounter.increment()
myMMCounter.decrement()
val myTaggedHistogram = Kamon.metrics.histogram("my-tagged-histogram", tags = Map("algorithm" -> "X"))
myTaggedHistogram.record(700L)
myTaggedHistogram.record(800L)
Also, Kamon.io supports several backends as a datastore of these metrixes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js