Akka: How to reconnect to restarted slave? - akka

I have two docker containers running localy, one is master, the second is slave, communicating over akka remote. Slave can go OOM from time to time for certain messages, in which case docker gracefully restarts it..
The code looks a little bit like this:
object Master {
def main() {
...
val slave =
typedActorOf(TypedProps[Slave], resolveRemoteAtor(..))
val dispatcher =
typedActorOf(TypedProps(classOf[Dispatcher], new DispatcherImpl(slave)))
val httpServer =
typedActorOf(TypedProps(classOf[HTTPServer], new HTTPServerImpl(dispatcher)))
}
}
class Slave() { def compute() = ... }
class Dispatcher(s: Slave) { def compute() = s.compute() }
The problem is, that the master shutdowns the connection with the slave, once it becomes unavailable due to OOM, and it never renews it:
[ERROR] from a.r.EndpointWriter - AssociationError akka.tcp://MasterSystem#localhost:0] -> [akka.tcp://SlaveSystem#localhost:1]: Error [Shut down address: akka.tcp://SlaveSystem#localhost:1] [akka.remote.ShutDownAssociation: Shut down address: akka.tcp://SlaveSystem#localhost:1 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. ]
[INFO] from a.r.RemoteActorRef - Message [akka.actor.TypedActor$MethodCall] from Actor[akka://MasterSystem/temp/$c] to Actor[akka.tcp://SlaveSystem#localhost:1/user/Slave#1817887555] was not delivered. [1] dead letters encountered.
So my question is, how can I force the master to reconnect with the slave once the slave restarts and send all the pending messages, that were not possible to deliver during the time it was down?

I'd recommend using Akka Cluster over remoting directly, for this and in general as well, cluster will allow you to listen for membership events so that you can react on a node leaving and reappearing.
Making guarantees around delivery of messages requires some extra thought though. This section of the docs is good to read for better understanding the issues around it.

Related

Akka Daemon Services

Most of the beginner's Akka examples seem to advocate calling the actor system's stop() and shutdown() methods like so:
object Main extends App {
// create the ActorSystem
val system = ActorSystem("HelloSystem")
// put your actors to work here ...
// shut down the ActorSystem when the work is finished
system.stop
system.shutdown
}
However what if your Akka app is meant to be a running service, that should (conceivably) live forever? Meaning it starts, the actor system is created, and actors simply idle until work (perhaps coming in from connected clients, etc.) needs to be done?
Is it OK to just initialize/start the actor system and leave it be (that is, omit invoking stop and shutdown altogether? Why/why not?
Yes, it is ok. This is a problem similar to AkkaHTTP implementation. In AkkaHTTP, you start actors which open a socket and wait for requests.
One possible issue comes to my mind: if you need some short-living actors (inside your long-running service) to process a single request, you should stop them after they are no longer needed (to free resources), especially if the actors are stateful.
I wrote a blog post about that issue: https://mikulskibartosz.name/always-stop-unused-akka-actors-a2ceeb1ed41

akka persistence, resume on failure, at least once semantics

I have small mess in my head
Avoid having poisoned mailbox
http://doc.akka.io/docs/akka/2.4.2/general/supervision.html
The new actor then resumes processing its mailbox, meaning that the
restart is not visible outside of the actor itself with the notable
exception that the message during which the failure occurred is not
re-processed.
My case: actor receives the "command" to run smth. Actor tries to reach remote service. Service is unavailable. Exception is thrown. I want actor to keep going contact remote server. I don't wan't actor to skip input command which caused exception. Will Resume help me to force actor to keep going?
override val supervisorStrategy: SupervisorStrategy =
OneForOneStrategy(maxNrOfRetries = -1, withinTimeRange = 5.minutes) {
case _: RemoteServiceIsDownException => Resume
case _ => Stop
}
By Resume, I mean retry the invocation that caused the exception to be thrown. I suspect that akka Resume means keep actor instance, but not retry failed invocation
Does akka persistence means durable mailboxes?
Extending first case. Actor tries to reach remote service. Now actor is persistent. SupervisorStrategy forces Actor to continue to contact remote service. The whole JVM shuts down. Akka app is restarted. Will Actor Resume from the point where it tired desperately reach remote service?
Does akka persistence means At least once semantics?
Actor receives message. Then JVM crashes. Will parent re-receive message it was processing during crush?
Expanding my comment:
Will Resume help me to force actor to keep going? ... By Resume, I mean retry the invocation that caused the exception to be thrown. I suspect that akka Resume means keep actor instance, but not retry failed invocation
No, I do not believe so. The Resume directive will keep the actor chugging along after your message processing failure. One way to retry the message though, is to actually just use Restart and to take advantage of an Actor's preRestart hook:
override def preRestart(t: Throwable, msgBeforeFailure: Option[Any]): Unit = {
case t: RemoteServiceIsDownException if msgBeforeFailure.isDefined =>
self ! msgBeforeFailure.get
case _: =>
}
When the actor crashes, it will run this hook and offer you an opportunity to handle the message that caused it to fail.
Does akka persistence means durable mailboxes?
Not necessarily, using a persistent actor just means that the actor's domain events, and subsequently, internal state is durable, not the mailbox it uses to process messages. That being said, it is possible to implement a durable mailbox pretty easily, see below.
Does akka persistence means At least once semantics?
Again not necessarily, but the toolkit does have a trait called AtleastOnceDelivery to let you achieve this (and durable mailbox)!
See http://doc.akka.io/docs/akka/current/scala/persistence.html#At-Least-Once_Delivery

akka cluster give false alarm in reporting node unreachable

I got a cluster event listener running on each node who send email to notify me when nodes are unreachable, and I noticed two strange things:
most of the time, unreachable event are followed by reachable again event
when unreachable event occurs, I query the state of cluster, it shows that all node are still UP
Here is my conf:
akka {
loglevel = INFO
loggers = ["akka.event.slf4j.Slf4jLogger"]
jvm-exit-on-fatal-error = on
actor {
provider = "akka.cluster.ClusterActorRefProvider"
}
remote {
//will be overwrite on runtime
log-remote-lifecycle-events = off
netty.tcp {
hostname = "127.0.0.1"
port = 9989
}
}
cluster {
failure-detector {
threshold = 12.0
acceptable-heartbeat-pause = 10 s
}
use-dispatcher = cluster-dispatcher
}
}
//relieve unreachable report rate
cluster-dispatcher {
type = "Dispatcher"
executor = "fork-join-executor"
fork-join-executor {
parallelism-min = 4
parallelism-max = 8
}
}
Please read the cluster membership lifecycle section in the documentation: http://doc.akka.io/docs/akka/2.4.0/common/cluster.html#Membership_Lifecycle
Unreachability is temporary, and indicates that there were no heartbeats for a while from the remote node. This can be reverted once heartbeats come again. This is useful to reroute data from overloaded nodes to others or compensating smaller, intermittent networking issues. Please note that a cluster member does not go to DOWN from unreachable automatically unless configured so: http://doc.akka.io/docs/akka/2.4.0/scala/cluster-usage.html#Automatic_vs__Manual_Downing
The reason why DOWNing is manual and not automatic by default is because of the risk of split-brain scenarios and their consequences for example when Cluster Singletons are used (which won't be singletons after the cluster falls into two parts because of a broken network cable). For more options for automatically resolving such cases there is the SBR (Split Brain Resolver) in the commercial version of Akka: http://doc.akka.io/docs/akka/rp-15v09p01/scala/split-brain-resolver.html
Also, DOWN-ing is permanent, a node, once marked as DOWN is forever banished from the surviving part of the cluster, i.e. even if it turns out to be alive in the future, it won't be allowed back again (see Fencing and STONITH for explanation: https://en.wikipedia.org/wiki/STONITH or http://advogato.org/person/lmb/diary/105.html).

Spark - Remote Akka Client Disassociated

I am setting up Spark 0.9 on AWS and am finding that when launching the interactive Pyspark shell, my executors / remote workers are first being registered:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Registered executor:
Actor[akka.tcp://sparkExecutor#ip-xx-xx-xxx-xxx.ec2.internal:54110/user/
Executor#-862786598] with ID 0
and then disassociated almost immediately, before I have the chance to run anything:
14/07/08 22:48:05 INFO cluster.SparkDeploySchedulerBackend: Executor 0 disconnected,
so removing it
14/07/08 22:48:05 ERROR scheduler.TaskSchedulerImpl: Lost an executor 0 (already
removed): remote Akka client disassociated
Any idea what might be wrong? I've tried adjusting the JVM options spark.akka.frameSize and spark.akka.timeout, but I'm pretty sure this is not the issue since (1) I'm not running anything to begin with, and (2) my executors are disconnecting a few seconds after startup, which is well within the default 100s timeout.
Thanks!
Jack
I had a very similar problem, if not the same.
It started to work for me once the workers were connecting to master by using the very same name as the master thought it had.
My log messages were something like:
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#idc1-hrm1.heylinux.com:7078] -> [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#vagrant-centos64.vagrantup.com:7077]].
ERROR remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#192.168.121.127:7078] -> [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]: Error [Association failed with [akka.tcp://sparkMaster#idc1-hrm1.heylinux.com:7077]]
WARN util.Utils: Your hostname, idc1-hrm1 resolves to a loopback address: 127.0.0.1; using 192.168.121.187 instead (on interface eth0)
So check the log of the master and see what name it thinks it has.
Then use that very same name on the workers.

detecting failure from remote nodes from an akka router

Lets say I have a router which is configured to create actors on multiple remote nodes. Perhaps I have a configuration like this:
akka {
actor {
deployment {
/fooRouter {
router = round-robin
resizer {
lower-bound = 2
upper-bound = 10
}
target {
nodes = ["akka://mana#10.0.1.1:2555", "akka://mana#10.0.1.2:2555"]
}
}
}
}
If we pretend that one of these nodes, 10.0.1.1, for some reason, has lost connectivity to the database server, so all messages passed to it will result in failure. Is there some way that the router could come to know that the 10.0.1.1 node as effectively useless and stop using it?
No, currently there is not. You can have the actors on the failed node commit suicide, but as soon as the resizer starts new ones, they will reappear. Even with clustering support—which is yet to come—this would not be automatic, because connections to some external resource are not part of the cluster’s reachability metric. This means that you would have to write code which takes that node down explicitly, upon which the actors could be migrated to some other node (details are not yet fully fleshed out).
So, currently you would have to write your own router as a real actor, which takes reachability into account.