Akka Stream how to determine inside GraphStageLogic whether it failed - akka

I have a network of nodes all implemented using custom GraphStageLogic. I can't find any API to determine when a stage throws an exception (e.g. IllegalArgumentException for Cannot pull port). The only thing Akka does is fail the down stream connections. What I need to determine is, for example in postStop or through a callback, when a node shuts down due to runtime exception, and propagate that information to a Promise that monitors the state of the entire system. Using withAttributes(supervisionStrategy) does not have any effect, either. It seems bewildering to me that there is no way to monitor exceptions thrown inside a GraphStageLogic? failStage is final like basically the entire API of GraphStageLogic.

Using decider when defining the ActorMaterializer used for materializing the Graph should work:
implicit val materializer: ActorMaterializer = ActorMaterializer(
ActorMaterializerSettings(actorSystem).withSupervisionStrategy(decider))
where decider is the typical
val decider: Supervision.Decider = {
case e: IllegalArgumentException => ....
}

Related

MismatchingMessageCorrelationException : Cannot correlate message ‘onEventReceiver’: No process definition or execution matches the parameters

We are facing an MismatchingMessageCorrelationException for the receive task in some cases (less than 5%)
The call back to notify receive task is done by :
protected void respondToCallWorker(
#NonNull final String correlationId,
final CallWorkerResultKeys result,
#Nullable final Map<String, Object> variables
) {
try {
runtimeService.createMessageCorrelation("callWorkerConsumer")
.processInstanceId(correlationId)
.setVariables(variables)
.setVariable("callStatus", result.toString())
.correlateWithResult();
} catch(Exception e) {
e.printStackTrace();
}
}
When i check the logs : i found that the query executed is this one :
select distinct RES.* from ACT_RU_EXECUTION RES
inner join ACT_RE_PROCDEF P on RES.PROC_DEF_ID_ = P.ID_
WHERE RES.PROC_INST_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0' and RES.SUSPENSION_STATE_ = '1'
and exists (select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = RES.ID_ and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer' )
Some times, When i look for the instance of the process in the database i found it waiting in the receive task
SELECT DISTINCT * FROM ACT_RU_EXECUTION RES
WHERE id_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
However, when i check the subscription event, it's not yet created in the database
select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer'
I think that the solution is to save the "receive task" before getting the response for respondToCallWorker, but sadly i can't figure it out.
I tried "asynch before" callWorker and "Message consumer" but it did not work,
I also tried camunda.bpm.database.jdbc-batch-processing=false and got the same results,
I tried also parallel branches but i get OptimisticLocak exception and MismatchingMessageCorrelationException
Maybe i am doing it wrong
Thanks for your help
This is an interesting problem. As you already found out, the error happens, when you try to correlate the result from the "worker" before the main process ended its transaction, thus there is no message subscription registered at the time you correlate.
This problem in process orchestration is described and analyzed in this blog post, which is definitely worth reading.
Taken from that post, here is a design that should solve the issue:
You make message send and receive parallel and put an async before the send task.
By doing so, the async continuation job for the send event and the message subscription are written in the same transaction, so when the async message send executes, you already have the subscription waiting.
Although this should work and solve the issue on BPMN model level, it might be worth to consider options that do not require remodeling the process.
First, instead of calling the worker directly from your delegate, you could (assuming you are on spring boot) publish a "CallWorkerCommand" (simple pojo) and use a TransactionalEventLister on a spring bean to execute the actual call. By doing so, you first will finish the BPMN process by subscribing to the message and afterwards, spring will execute your worker call.
Second: you could use a retry mechanism like resilience4j around your correlate message call, so in the rare cases where the result comes to quickly, you fail and retry a second later.
Another solution I could think of, since you seem to be using an "external worker" pattern here, is to use an external-task-service task directly, so the send/receive synchronization gets solved by the Camunda external worker API.
So many options to choose from. I would possibly prefer the external task, followed by the transactionalEventListener, but that is a matter of personal preference.

Can a MassTransit Consumer Saga be InitiatedBy the same message(s) that it Orchestrates?

The new support for Event Hub Riders in 7.0 plus the existing InMemoryRepository backing for Sagas looks like it could provide a straightforward means of creating aggregate states based on a stream of correlated messages, e.g. across all sensors in a Building). In this scenario, the Building's Identifier would be used as the CorrelationId of the Messages, the Saga, and as the PartitionKey of the EventData messages sent to the Event Hub, ensuring the same consuming service instance receives all messages for that device at a given time. Given the way Event Hub's rebalancing works, it can be assumed that at some point while this service is running, the service instance managing messages for a Partition will shift to a new host, which will start reading messages sent by the sensors in the building. At that moment:
The new host does not know anything about the old host's processing. It just knows that it is now receiving messages for the Event Hub partition that includes that Building's messages.
The devices sending the messages do not know anything about the transition in state aggregation responsibility "downstream of them" - they are still happily reporting new measurements as always.
The challenge this creates is: on the new service instance, we need a new Saga to be created to take over for the previous Saga, but the only thing that knows no Saga lives for a given entity is MassTransit: nothing on the new instance knows a sensor reading from Building A is the first one from Building A since this service instance took over tracking the aggregate Building A state. We thought this could be handled by marking the same Message (DataCollected) with both InitiatedBy and Orchestrates:
public class BuildingAggregator:
ISaga,
InitiatedBy<DataCollected>, //init saga on first DataCollected with a given CorrelationId seen
Orchestrates<DataCollected> //then keep handling those in that saga
{
//saga Consume methods
}
However, this throws the following exception when the BuildingAggregator receives its second DataCollected message with a given Guid:
Saga exception on receipt of MassTransitFW_POC.Program+DataCollected: The message cannot be accepted by an existing saga
at MassTransit.Saga.Policies.NewSagaPolicy`2.MassTransit.Saga.ISagaPolicy<TSaga,TMessage>.Existing(SagaConsumeContext`2 context, IPipe`1 next)
at MassTransit.Saga.SendSagaPipe`2.<Send>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at MassTransit.Saga.SendSagaPipe`2.<Send>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MassTransit.Saga.InMemoryRepository.InMemorySagaRepositoryContextFactory`1.<Send>d__4`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
Is there another way of achieving this logic? Is this the "wrong way" to apply Sagas?
As per Chris Patterson's comments on the question above, this is achievable with the state machine syntax:
Initially(
When(DataCollected)
.Then(f => _logger.LogInformation("Initiating Network Manager for Network: {NetworkId}", f.Data.NetworkId))
.TransitionTo(Running));
During(Running,
When(DataCollected)
.Then(f => { // activities and state transitions }),
When(SimulationComplete)
.Then(f => _logger.LogInformation("Network {NetworkId} shutting down.", f.Instance.CorrelationId))
.TransitionTo(Final));
Note how the DataCollected event is handled both in the Initially state transition and in a state transition set by the Initially condition.

Processing Dropped Message In Akka Streams

I have the following source queue definition.
lazy val (processMessageSource, processMessageQueueFuture) =
peekMatValue(
Source
.queue[(ProcessMessageInputData, Promise[ProcessMessageOutputData])](5, OverflowStrategy.dropNew))
def peekMatValue[T, M](src: Source[T, M]): (Source[T, M], Future[M]) {
val p = Promise[M]
val s = src.mapMaterializedValue { m =>
p.trySuccess(m)
m
}
(s, p.future)
}
The Process Message Input Data Class is essentially an artifact that is created when a caller calls a web server endpoint, which is hooked upto this stream (i.e. the service endpoint's business logic puts messages into this queue). The Promise of process message out is something that is completed downstream in the sink of the application, and the web server then has an on complete callback on this future to return the response back.
There are also other sources of ingress into this stream.
Now the buffer may be backed up since the other source may overload the system, thereby triggering stream back pressure. The existing code just drops the new message. But I still want to complete the process message output promise to complete with an exception stating something like "Throttled".
Is there a mechanism to write a custom overflow strategy, or a post processing on the overflowed element that allows me to do this?
According to https://github.com/akka/akka/blob/master/akkastream/src/main/scala/akka/stream/impl/QueueSource.scala#L83
dropNew would work just fine. On clients end it would look like.
processMessageQueue.offer(in, pr).foreach { res =>
res match {
case Enqueued => // Code to handle case when successfully enqueued.
case Dropped => // Code to handle messages that are dropped since the buffier was overflowing.
}
}

Cause of actor termination or how to handle error

When actor fails, i need to send cause of failure to another actor.
I know there are supervision strategies and i use them. The problem is - i cannot find correct place for such error reporting.
I tried watching actor, but Terminated message does not provide cause of termination.
Currently, i added error handling in Decider:
override def supervisorStrategy: SupervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = Duration(1, TimeUnit.SECONDS), loggingEnabled = true) {
case e: Exception =>
onActorError(sender(), e)
Stop
}
But I think that it is not a good time and place to do so, "decider" should return strategy, and not implicitly do something else.
So the question is: is there a proper place to catch actor exceptions and do something about it?
postRestart method of the supervised actor seems like a good place to do the postmortem logging.
From documentation:
The new actor’s postRestart method is invoked with the exception which
caused the restart. By default the preStart is called, just as in the
normal start-up case.

How to detect dead remote client or server in akka2

Im new to AKKA2.The following is my question:
There is a server actor and several client actors.
The server stores all the ref of the client actors.
I wonder how the server can detect which client is disconnected(shutdown, crash...)
And if there is a way to tell the clients that the server is dead.
There are two ways to interact with an actor's lifecycle. First, the parent of an actor defines a supervisory policy that handles actor failures and has the option to restart, stop, resume, or escalate after a failure. In addition, a non-supervisor actor can "watch" an actor to detect the Terminated message generated when the actor dies. This section of the docs covers the topic: http://doc.akka.io/docs/akka/2.0.1/general/supervision.html
Here's an example of using watch from a spec. I start an actor, then set up a watcher for the Termination. When the actor gets a PoisonPill message, the event is detected by the watcher:
"be able to watch the proxy actor fail" in {
val myProxy = system.actorOf(Props(new VcdRouterActor(vcdPrivateApiUrl, vcdUser, vcdPass, true, sessionTimeout)), "vcd-router-" + newUuid)
watch(myProxy)
myProxy ! PoisonPill
expectMsg(Terminated(`myProxy`))
}
Here's an example of a custom supervisor strategy that Stops the child actor if it failed due to an authentication exception since that probably will not be correctable, or escalates the failure to a higher supervisor if the failure was for another reason:
override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 1 minutes) {
// presumably we had a connection, and lost it. Let's restart the child and see if we can re-establish one.
case e: AuthenticationException ⇒
log.error(e.message + " Stopping proxy router for this host")
Stop
// don't know what it was, escalate it.
case e: Exception ⇒
log.warning("Unknown exception from vCD proxy. Escalating a {}", e.getClass.getName)
Escalate
}
Within an actor, you can generate the failure by throwing an exception or handling a PoisonPill message.
Another pattern that may be useful if you don't want to generate a failure is to respond with a failure to the sender. Then you can have a more personal message exchange with the caller. For example, the caller can use the ask pattern and use an onComplete block for handling the response. Caller side:
vcdRouter ? DisableOrg(id) mapTo manifest[VcdHttpResponse] onComplete {
case Left(failure) => log.info("receive a failure message")
case Right(success) ⇒ log.info("org disabled)
}
Callee side:
val org0 = new UUID("00000000-0000-0000-0000-000000000000")
def receive = {
case DisableOrg(id: UUID) if id == org0 => sender ! Failure(new IllegalArgumentException("can't disable org 0")
case DisableOrg(id: UUID) => sender ! disableOrg(id)
}
In order to make your server react to changes of remote client status you could use something like the following (example is for Akka 2.1.4).
In Java
#Override
public void preStart() {
context().system().eventStream().subscribe(getSelf(), RemoteLifeCycleEvent.class);
}
Or in Scala
override def preStart = {
context.system.eventStream.subscribe(listener, classOf[RemoteLifeCycleEvent])
}
If you're only interested when the client is disconnected you could register only for RemoteClientDisconnected
More info here(java)and here(scala)
In the upcoming Akka 2.2 release (RC1 was released yesterday), Death Watch works both locally and remote. If you watch the root guardian on the other system, when you get Terminated for him, you know that the remote system is down.
Hope that helps!