Handle the crashed remote actor in a cluster - akka

I am new to Akka. I built an Akka cluster. In the cluster, I have one node as the master, which will distribute works to the slave nodes. The master node will first be started. Then the slave nodes will register themslves to the master. If the slave leaves gracefully, the master will receive a message as
message instanceof Terminated
Then the master will do some recovery for the slave node. But if the slave simply crashed, How can I handle it. Currently, the console will print error as "Connection refused". Could anyone tell me how I can catch this error and know the ActorRef of this crashed slave so that the master will do similar recovery for the crashed slave node.
Thank you very much

You can maintain a list (or map) of other node addresses with corresponding ActorRef-s (or actor paths) on them. And you can subscribe to cluster messages (like UnreachableMember) and do some recover when receiving it.
Something like this:
class ClusterRefRecoverExample extends Actor {
private val membersWithActorRefs = collection.mutable.HashMap[Address, ActorRef]()
override def preStart() {
super.preStart()
val cluster = Cluster(context.system)
cluster.subscribe(self, classOf[MemberEvent])
cluster.subscribe(self, classOf[UnreachableMember])
}
override def postStop() {
super.postStop()
Cluster(context.system).unsubscribe(self)
}
def recoverAddress(addr: Address) {
membersWithActorRefs.get(addr) foreach {
theRef =>
// do your recover here
}
}
def removeAddress(addr: Address) {
membersWithActorRefs.remove(addr)
}
def receive = {
....
case UnreachableMember(member) =>
recoverAddress(member.address)
case MemberRemoved(member, _) =>
removeAddress(member.address)
case MemberExited(member) =>
removeAddress(member.address)
}
}

From the Cluster documentation:
"Death watch uses the cluster failure detector for nodes in the cluster, i.e. it generates Terminated message from network failures and JVM crashes, in addition to graceful termination of watched actor." - http://doc.akka.io/docs/akka/2.2.3/scala/cluster-usage.html

Related

What is different between EventHub Client with EventHub Processor?

I find there are two way to receive EventHub message data:
Using EventHub Processor, it seems will use checkpoint to save. It will make sure when the process running EventProcessor on a specific partition dies/crashes.
public class SimpleEventProcessor : IEventProcessor
{
public Task CloseAsync(PartitionContext context, CloseReason reason)
{
Console.WriteLine($"Processor Shutting Down. Partition '{context.PartitionId}', Reason: '{reason}'.");
return Task.CompletedTask;
}
public Task OpenAsync(PartitionContext context)
{
Console.WriteLine($"SimpleEventProcessor initialized. Partition: '{context.PartitionId}'");
return Task.CompletedTask;
}
public Task ProcessErrorAsync(PartitionContext context, Exception error)
{
Console.WriteLine($"Error on Partition: {context.PartitionId}, Error: {error.Message}");
return Task.CompletedTask;
}
public Task ProcessEventsAsync(PartitionContext context, IEnumerable<EventData> messages)
{
foreach (var eventData in messages)
{
var data = Encoding.UTF8.GetString(eventData.Body.Array, eventData.Body.Offset, eventData.Body.Count);
Console.WriteLine($"Message received. Partition: '{context.PartitionId}', Data: '{data}'");
}
return context.CheckpointAsync();
}
}
Using EventHub client to receive message:
EventHubClient eventHub
var reciever = eventHub.CreateReceiver("consumer1", "0", EventPosition.FromStart());
var recieved = await reciever.ReceiveAsync(10);
What is the difference for them? Could we save the checkpoint for second ways? How to handle the crash case in second ways? Why does it need two different ways?
EventHubClient aka. low level API is used for building connectors. In this case, developer is responsible to manage partition receivers, checkpoints, load distribution, and crash recovery etc. Most won't be using this API to receive, and again this API is for building source to sink connectors.
Processor Host comes with built-in checkpointing, load distribution, and partition receiver manager. This API might look like an overkill when it comes to implement IEventProcessor, and provide storage checkpoint store however in the long run it is more worry-free.

Canceling Apache Flink job from the code

I am in a situation where I want to stop/cancel the flink job from the code. This is in my integration test where I am submitting a task to my flink job and check the result. As the job runs, asynchronously, it doesn't stop even when the test fails/passes. I want to job the stop after the test is over.
I tried a few things which I am listing below :
Get the jobmanager actor
Get running jobs
For each running job, send a cancel request to the jobmanager
This, of course in not running but I am not sure whether the jobmanager actorref is wrong or something else is missing.
The error I get is : [flink-akka.actor.default-dispatcher-5] [akka://flink/user/jobmanager_1] Message [org.apache.flink.runtime.messages.JobManagerMessages$RequestRunningJobsStatus$] from Actor[akka://flink/temp/$a] to Actor[akka://flink/user/jobmanager_1] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'
which means either the job manager actor ref is wrong or the message sent to it is incorrect.
The code looks like the following:
val system = ActorSystem("flink", ConfigFactory.load.getConfig("akka")) //I debugged to get this path
val jobManager = system.actorSelection("/user/jobmanager_1") //also got this akka path by debugging and getting the jobmanager akka url
val responseRunningJobs = Patterns.ask(jobManager, JobManagerMessages.getRequestRunningJobsStatus, new FiniteDuration(10000, TimeUnit.MILLISECONDS))
try {
val result = Await.result(responseRunningJobs, new FiniteDuration(5000, TimeUnit.MILLISECONDS))
if(result.isInstanceOf[RunningJobsStatus]){
val runningJobs = result.asInstanceOf[RunningJobsStatus].getStatusMessages()
val itr = runningJobs.iterator()
while(itr.hasNext){
val jobId = itr.next().getJobId
val killResponse = Patterns.ask(jobManager, new CancelJob(jobId), new Timeout(new FiniteDuration(2000, TimeUnit.MILLISECONDS)));
try {
Await.result(killResponse, new FiniteDuration(2000, TimeUnit.MILLISECONDS))
}
catch {
case e : Exception =>"Canceling the job with ID " + jobId + " failed." + e
}
}
}
}
catch{
case e : Exception => "Could not retrieve running jobs from the JobManager." + e
}
}
Can someone check if this is the correct approach ?
EDIT :
To completely stop the job, it is necessary to stop the TaskManager along with the JobManager in the order TaskManager first and then JobManager.
You're creating a new ActorSystem and then try to find an actor with the name /user/jobmanager_1 in the same actor system. This won't work, since the actual job manager will run in a different ActorSystem.
If you want to obtain an ActorRef to the real job manager, you either have to use the same ActorSystem for the selection (then you can use a local address) or you have find out the remote address for the job manager actor. The remote address has the format akka.tcp://flink#[address_of_actor_system]/user/jobmanager_[instance_number]. If you have access to the FlinkMiniCluster then you can use the leaderGateway promise to obtain the current leader's ActorGateway.

Akka TestProbe to test context.watch() / Terminated handling

I'm testing an akka system using TestKit . One actor of the system I'm testing, upon receiving a certain message type, context.watches the sender, and kills itself when the sender dies:
trait Handler extends Actor {
override def receive: Receive = {
case Init => context.watch(sender)
case Terminated => context.stop(self)
}
}
In my test I'm sending
val probe = TestProbe(system)
val target = TestActorRef(Props(classOf[Handler]))
probe.send(target, Init)
Now, to test the watch / Terminated behavior - I want to simulate the testprobe being killed.
I can do
probe.send(target, Terminated)
But, this presupposes that target has called context.watch(sender) , else it would not receive a Terminated.
I can do
probe.testActor ! Kill
with doesn't send Terminated unless target has correctly called context.watch(sender) , but I don't actually want the testprobe killed, as it needs to remain responsive to test if (for example) target continues to send messages instead of stopping itself .
I'm come across this a few times now, what's the correct way to test if an actor is handling the above situation correctly?
You could watch the actor under test for termination with a separate probe instead of trying to do that via the 'sender' probe:
val probe = TestProbe(system)
val deathWatcher = TestProbe(system)
val target = TestActorRef(Props(classOf[Handler]))
deathWatcher.watch(target)
probe.send(target, Init)
// TODO make sure the message is processed.. perhaps ack it?
probe ! Kill
deathWatcher.expectTerminated(target)

How to stop a poll actor

I'm using akka actors for doing some tasks that get scheduled, like a poll coming live on a date/time that was scheduled.
This way I'm creating an actor...
final ActorRef pollActor = pollSystem.actorOf(new Props(
new UntypedActorFactory() {
public UntypedActor create() {
return new PollActor(pollObj);
}
}), "pollActor" + pollObj.getId()+":"+pollMts);
But when I update a poll that was already created to change the scheduled go-live date, there I can create another actor and I want the existing actor for the same poll to be stopped.
For that I'm doing this...
ActorRef pollActor = pollSystem
.actorFor("akka://pollSystem/user/pollActor" + poll.getId()+":"+oldPollMTS);
pollActor.tell(PoisonPill.getInstance(), null);
But the old actor is not getting stopped and no postStop() method invoked. I tried Kill.getInstance() too, but in vain.
Help me to find a way that I can stop the old actor and the messages sent to it; thereby create a new actor.

How to detect dead remote client or server in akka2

Im new to AKKA2.The following is my question:
There is a server actor and several client actors.
The server stores all the ref of the client actors.
I wonder how the server can detect which client is disconnected(shutdown, crash...)
And if there is a way to tell the clients that the server is dead.
There are two ways to interact with an actor's lifecycle. First, the parent of an actor defines a supervisory policy that handles actor failures and has the option to restart, stop, resume, or escalate after a failure. In addition, a non-supervisor actor can "watch" an actor to detect the Terminated message generated when the actor dies. This section of the docs covers the topic: http://doc.akka.io/docs/akka/2.0.1/general/supervision.html
Here's an example of using watch from a spec. I start an actor, then set up a watcher for the Termination. When the actor gets a PoisonPill message, the event is detected by the watcher:
"be able to watch the proxy actor fail" in {
val myProxy = system.actorOf(Props(new VcdRouterActor(vcdPrivateApiUrl, vcdUser, vcdPass, true, sessionTimeout)), "vcd-router-" + newUuid)
watch(myProxy)
myProxy ! PoisonPill
expectMsg(Terminated(`myProxy`))
}
Here's an example of a custom supervisor strategy that Stops the child actor if it failed due to an authentication exception since that probably will not be correctable, or escalates the failure to a higher supervisor if the failure was for another reason:
override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 1 minutes) {
// presumably we had a connection, and lost it. Let's restart the child and see if we can re-establish one.
case e: AuthenticationException ⇒
log.error(e.message + " Stopping proxy router for this host")
Stop
// don't know what it was, escalate it.
case e: Exception ⇒
log.warning("Unknown exception from vCD proxy. Escalating a {}", e.getClass.getName)
Escalate
}
Within an actor, you can generate the failure by throwing an exception or handling a PoisonPill message.
Another pattern that may be useful if you don't want to generate a failure is to respond with a failure to the sender. Then you can have a more personal message exchange with the caller. For example, the caller can use the ask pattern and use an onComplete block for handling the response. Caller side:
vcdRouter ? DisableOrg(id) mapTo manifest[VcdHttpResponse] onComplete {
case Left(failure) => log.info("receive a failure message")
case Right(success) ⇒ log.info("org disabled)
}
Callee side:
val org0 = new UUID("00000000-0000-0000-0000-000000000000")
def receive = {
case DisableOrg(id: UUID) if id == org0 => sender ! Failure(new IllegalArgumentException("can't disable org 0")
case DisableOrg(id: UUID) => sender ! disableOrg(id)
}
In order to make your server react to changes of remote client status you could use something like the following (example is for Akka 2.1.4).
In Java
#Override
public void preStart() {
context().system().eventStream().subscribe(getSelf(), RemoteLifeCycleEvent.class);
}
Or in Scala
override def preStart = {
context.system.eventStream.subscribe(listener, classOf[RemoteLifeCycleEvent])
}
If you're only interested when the client is disconnected you could register only for RemoteClientDisconnected
More info here(java)and here(scala)
In the upcoming Akka 2.2 release (RC1 was released yesterday), Death Watch works both locally and remote. If you watch the root guardian on the other system, when you get Terminated for him, you know that the remote system is down.
Hope that helps!