Canceling Apache Flink job from the code

Canceling Apache Flink job from the code - akka

I am in a situation where I want to stop/cancel the flink job from the code. This is in my integration test where I am submitting a task to my flink job and check the result. As the job runs, asynchronously, it doesn't stop even when the test fails/passes. I want to job the stop after the test is over.
I tried a few things which I am listing below :
Get the jobmanager actor
Get running jobs
For each running job, send a cancel request to the jobmanager
This, of course in not running but I am not sure whether the jobmanager actorref is wrong or something else is missing.
The error I get is : [flink-akka.actor.default-dispatcher-5] [akka://flink/user/jobmanager_1] Message [org.apache.flink.runtime.messages.JobManagerMessages$RequestRunningJobsStatus$] from Actor[akka://flink/temp/$a] to Actor[akka://flink/user/jobmanager_1] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'
which means either the job manager actor ref is wrong or the message sent to it is incorrect.
The code looks like the following:
val system = ActorSystem("flink", ConfigFactory.load.getConfig("akka")) //I debugged to get this path
val jobManager = system.actorSelection("/user/jobmanager_1") //also got this akka path by debugging and getting the jobmanager akka url
val responseRunningJobs = Patterns.ask(jobManager, JobManagerMessages.getRequestRunningJobsStatus, new FiniteDuration(10000, TimeUnit.MILLISECONDS))
try {
val result = Await.result(responseRunningJobs, new FiniteDuration(5000, TimeUnit.MILLISECONDS))
if(result.isInstanceOf[RunningJobsStatus]){
val runningJobs = result.asInstanceOf[RunningJobsStatus].getStatusMessages()
val itr = runningJobs.iterator()
while(itr.hasNext){
val jobId = itr.next().getJobId
val killResponse = Patterns.ask(jobManager, new CancelJob(jobId), new Timeout(new FiniteDuration(2000, TimeUnit.MILLISECONDS)));
try {
Await.result(killResponse, new FiniteDuration(2000, TimeUnit.MILLISECONDS))
}
catch {
case e : Exception =>"Canceling the job with ID " + jobId + " failed." + e
}
}
}
}
catch{
case e : Exception => "Could not retrieve running jobs from the JobManager." + e
}
}
Can someone check if this is the correct approach ?
EDIT :
To completely stop the job, it is necessary to stop the TaskManager along with the JobManager in the order TaskManager first and then JobManager.

You're creating a new ActorSystem and then try to find an actor with the name /user/jobmanager_1 in the same actor system. This won't work, since the actual job manager will run in a different ActorSystem.
If you want to obtain an ActorRef to the real job manager, you either have to use the same ActorSystem for the selection (then you can use a local address) or you have find out the remote address for the job manager actor. The remote address has the format akka.tcp://flink#[address_of_actor_system]/user/jobmanager_[instance_number]. If you have access to the FlinkMiniCluster then you can use the leaderGateway promise to obtain the current leader's ActorGateway.

Related

MismatchingMessageCorrelationException : Cannot correlate message ‘onEventReceiver’: No process definition or execution matches the parameters

We are facing an MismatchingMessageCorrelationException for the receive task in some cases (less than 5%)
The call back to notify receive task is done by :
protected void respondToCallWorker(
#NonNull final String correlationId,
final CallWorkerResultKeys result,
#Nullable final Map<String, Object> variables
) {
try {
runtimeService.createMessageCorrelation("callWorkerConsumer")
.processInstanceId(correlationId)
.setVariables(variables)
.setVariable("callStatus", result.toString())
.correlateWithResult();
} catch(Exception e) {
e.printStackTrace();
}
}
When i check the logs : i found that the query executed is this one :
select distinct RES.* from ACT_RU_EXECUTION RES
inner join ACT_RE_PROCDEF P on RES.PROC_DEF_ID_ = P.ID_
WHERE RES.PROC_INST_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0' and RES.SUSPENSION_STATE_ = '1'
and exists (select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = RES.ID_ and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer' )
Some times, When i look for the instance of the process in the database i found it waiting in the receive task
SELECT DISTINCT * FROM ACT_RU_EXECUTION RES
WHERE id_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
However, when i check the subscription event, it's not yet created in the database
select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer'
I think that the solution is to save the "receive task" before getting the response for respondToCallWorker, but sadly i can't figure it out.
I tried "asynch before" callWorker and "Message consumer" but it did not work,
I also tried camunda.bpm.database.jdbc-batch-processing=false and got the same results,
I tried also parallel branches but i get OptimisticLocak exception and MismatchingMessageCorrelationException
Maybe i am doing it wrong
Thanks for your help

This is an interesting problem. As you already found out, the error happens, when you try to correlate the result from the "worker" before the main process ended its transaction, thus there is no message subscription registered at the time you correlate.
This problem in process orchestration is described and analyzed in this blog post, which is definitely worth reading.
Taken from that post, here is a design that should solve the issue:
You make message send and receive parallel and put an async before the send task.
By doing so, the async continuation job for the send event and the message subscription are written in the same transaction, so when the async message send executes, you already have the subscription waiting.
Although this should work and solve the issue on BPMN model level, it might be worth to consider options that do not require remodeling the process.
First, instead of calling the worker directly from your delegate, you could (assuming you are on spring boot) publish a "CallWorkerCommand" (simple pojo) and use a TransactionalEventLister on a spring bean to execute the actual call. By doing so, you first will finish the BPMN process by subscribing to the message and afterwards, spring will execute your worker call.
Second: you could use a retry mechanism like resilience4j around your correlate message call, so in the rare cases where the result comes to quickly, you fail and retry a second later.
Another solution I could think of, since you seem to be using an "external worker" pattern here, is to use an external-task-service task directly, so the send/receive synchronization gets solved by the Camunda external worker API.
So many options to choose from. I would possibly prefer the external task, followed by the transactionalEventListener, but that is a matter of personal preference.

Akka TestProbe to test context.watch() / Terminated handling

I'm testing an akka system using TestKit . One actor of the system I'm testing, upon receiving a certain message type, context.watches the sender, and kills itself when the sender dies:
trait Handler extends Actor {
override def receive: Receive = {
case Init => context.watch(sender)
case Terminated => context.stop(self)
}
}
In my test I'm sending
val probe = TestProbe(system)
val target = TestActorRef(Props(classOf[Handler]))
probe.send(target, Init)
Now, to test the watch / Terminated behavior - I want to simulate the testprobe being killed.
I can do
probe.send(target, Terminated)
But, this presupposes that target has called context.watch(sender) , else it would not receive a Terminated.
I can do
probe.testActor ! Kill
with doesn't send Terminated unless target has correctly called context.watch(sender) , but I don't actually want the testprobe killed, as it needs to remain responsive to test if (for example) target continues to send messages instead of stopping itself .
I'm come across this a few times now, what's the correct way to test if an actor is handling the above situation correctly?

You could watch the actor under test for termination with a separate probe instead of trying to do that via the 'sender' probe:
val probe = TestProbe(system)
val deathWatcher = TestProbe(system)
val target = TestActorRef(Props(classOf[Handler]))
deathWatcher.watch(target)
probe.send(target, Init)
// TODO make sure the message is processed.. perhaps ack it?
probe ! Kill
deathWatcher.expectTerminated(target)

How to stop a poll actor

I'm using akka actors for doing some tasks that get scheduled, like a poll coming live on a date/time that was scheduled.
This way I'm creating an actor...
final ActorRef pollActor = pollSystem.actorOf(new Props(
new UntypedActorFactory() {
public UntypedActor create() {
return new PollActor(pollObj);
}
}), "pollActor" + pollObj.getId()+":"+pollMts);
But when I update a poll that was already created to change the scheduled go-live date, there I can create another actor and I want the existing actor for the same poll to be stopped.
For that I'm doing this...
ActorRef pollActor = pollSystem
.actorFor("akka://pollSystem/user/pollActor" + poll.getId()+":"+oldPollMTS);
pollActor.tell(PoisonPill.getInstance(), null);
But the old actor is not getting stopped and no postStop() method invoked. I tried Kill.getInstance() too, but in vain.
Help me to find a way that I can stop the old actor and the messages sent to it; thereby create a new actor.

Handle the crashed remote actor in a cluster

I am new to Akka. I built an Akka cluster. In the cluster, I have one node as the master, which will distribute works to the slave nodes. The master node will first be started. Then the slave nodes will register themslves to the master. If the slave leaves gracefully, the master will receive a message as
message instanceof Terminated
Then the master will do some recovery for the slave node. But if the slave simply crashed, How can I handle it. Currently, the console will print error as "Connection refused". Could anyone tell me how I can catch this error and know the ActorRef of this crashed slave so that the master will do similar recovery for the crashed slave node.
Thank you very much

You can maintain a list (or map) of other node addresses with corresponding ActorRef-s (or actor paths) on them. And you can subscribe to cluster messages (like UnreachableMember) and do some recover when receiving it.
Something like this:
class ClusterRefRecoverExample extends Actor {
private val membersWithActorRefs = collection.mutable.HashMap[Address, ActorRef]()
override def preStart() {
super.preStart()
val cluster = Cluster(context.system)
cluster.subscribe(self, classOf[MemberEvent])
cluster.subscribe(self, classOf[UnreachableMember])
}
override def postStop() {
super.postStop()
Cluster(context.system).unsubscribe(self)
}
def recoverAddress(addr: Address) {
membersWithActorRefs.get(addr) foreach {
theRef =>
// do your recover here
}
}
def removeAddress(addr: Address) {
membersWithActorRefs.remove(addr)
}
def receive = {
....
case UnreachableMember(member) =>
recoverAddress(member.address)
case MemberRemoved(member, _) =>
removeAddress(member.address)
case MemberExited(member) =>
removeAddress(member.address)
}
}

From the Cluster documentation:
"Death watch uses the cluster failure detector for nodes in the cluster, i.e. it generates Terminated message from network failures and JVM crashes, in addition to graceful termination of watched actor." - http://doc.akka.io/docs/akka/2.2.3/scala/cluster-usage.html

How to detect dead remote client or server in akka2

Im new to AKKA2.The following is my question:
There is a server actor and several client actors.
The server stores all the ref of the client actors.
I wonder how the server can detect which client is disconnected(shutdown, crash...)
And if there is a way to tell the clients that the server is dead.

There are two ways to interact with an actor's lifecycle. First, the parent of an actor defines a supervisory policy that handles actor failures and has the option to restart, stop, resume, or escalate after a failure. In addition, a non-supervisor actor can "watch" an actor to detect the Terminated message generated when the actor dies. This section of the docs covers the topic: http://doc.akka.io/docs/akka/2.0.1/general/supervision.html
Here's an example of using watch from a spec. I start an actor, then set up a watcher for the Termination. When the actor gets a PoisonPill message, the event is detected by the watcher:
"be able to watch the proxy actor fail" in {
val myProxy = system.actorOf(Props(new VcdRouterActor(vcdPrivateApiUrl, vcdUser, vcdPass, true, sessionTimeout)), "vcd-router-" + newUuid)
watch(myProxy)
myProxy ! PoisonPill
expectMsg(Terminated(`myProxy`))
}
Here's an example of a custom supervisor strategy that Stops the child actor if it failed due to an authentication exception since that probably will not be correctable, or escalates the failure to a higher supervisor if the failure was for another reason:
override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 1 minutes) {
// presumably we had a connection, and lost it. Let's restart the child and see if we can re-establish one.
case e: AuthenticationException ⇒
log.error(e.message + " Stopping proxy router for this host")
Stop
// don't know what it was, escalate it.
case e: Exception ⇒
log.warning("Unknown exception from vCD proxy. Escalating a {}", e.getClass.getName)
Escalate
}
Within an actor, you can generate the failure by throwing an exception or handling a PoisonPill message.
Another pattern that may be useful if you don't want to generate a failure is to respond with a failure to the sender. Then you can have a more personal message exchange with the caller. For example, the caller can use the ask pattern and use an onComplete block for handling the response. Caller side:
vcdRouter ? DisableOrg(id) mapTo manifest[VcdHttpResponse] onComplete {
case Left(failure) => log.info("receive a failure message")
case Right(success) ⇒ log.info("org disabled)
}
Callee side:
val org0 = new UUID("00000000-0000-0000-0000-000000000000")
def receive = {
case DisableOrg(id: UUID) if id == org0 => sender ! Failure(new IllegalArgumentException("can't disable org 0")
case DisableOrg(id: UUID) => sender ! disableOrg(id)
}

In order to make your server react to changes of remote client status you could use something like the following (example is for Akka 2.1.4).
In Java
#Override
public void preStart() {
context().system().eventStream().subscribe(getSelf(), RemoteLifeCycleEvent.class);
}
Or in Scala
override def preStart = {
context.system.eventStream.subscribe(listener, classOf[RemoteLifeCycleEvent])
}
If you're only interested when the client is disconnected you could register only for RemoteClientDisconnected
More info here(java)and here(scala)

In the upcoming Akka 2.2 release (RC1 was released yesterday), Death Watch works both locally and remote. If you watch the root guardian on the other system, when you get Terminated for him, you know that the remote system is down.
Hope that helps!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js