Can a MassTransit Consumer Saga be InitiatedBy the same message(s) that it Orchestrates? - azure-eventhub

The new support for Event Hub Riders in 7.0 plus the existing InMemoryRepository backing for Sagas looks like it could provide a straightforward means of creating aggregate states based on a stream of correlated messages, e.g. across all sensors in a Building). In this scenario, the Building's Identifier would be used as the CorrelationId of the Messages, the Saga, and as the PartitionKey of the EventData messages sent to the Event Hub, ensuring the same consuming service instance receives all messages for that device at a given time. Given the way Event Hub's rebalancing works, it can be assumed that at some point while this service is running, the service instance managing messages for a Partition will shift to a new host, which will start reading messages sent by the sensors in the building. At that moment:
The new host does not know anything about the old host's processing. It just knows that it is now receiving messages for the Event Hub partition that includes that Building's messages.
The devices sending the messages do not know anything about the transition in state aggregation responsibility "downstream of them" - they are still happily reporting new measurements as always.
The challenge this creates is: on the new service instance, we need a new Saga to be created to take over for the previous Saga, but the only thing that knows no Saga lives for a given entity is MassTransit: nothing on the new instance knows a sensor reading from Building A is the first one from Building A since this service instance took over tracking the aggregate Building A state. We thought this could be handled by marking the same Message (DataCollected) with both InitiatedBy and Orchestrates:
public class BuildingAggregator:
ISaga,
InitiatedBy<DataCollected>, //init saga on first DataCollected with a given CorrelationId seen
Orchestrates<DataCollected> //then keep handling those in that saga
{
//saga Consume methods
}
However, this throws the following exception when the BuildingAggregator receives its second DataCollected message with a given Guid:
Saga exception on receipt of MassTransitFW_POC.Program+DataCollected: The message cannot be accepted by an existing saga
at MassTransit.Saga.Policies.NewSagaPolicy`2.MassTransit.Saga.ISagaPolicy<TSaga,TMessage>.Existing(SagaConsumeContext`2 context, IPipe`1 next)
at MassTransit.Saga.SendSagaPipe`2.<Send>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at MassTransit.Saga.SendSagaPipe`2.<Send>d__5.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MassTransit.Saga.InMemoryRepository.InMemorySagaRepositoryContextFactory`1.<Send>d__4`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
Is there another way of achieving this logic? Is this the "wrong way" to apply Sagas?

As per Chris Patterson's comments on the question above, this is achievable with the state machine syntax:
Initially(
When(DataCollected)
.Then(f => _logger.LogInformation("Initiating Network Manager for Network: {NetworkId}", f.Data.NetworkId))
.TransitionTo(Running));
During(Running,
When(DataCollected)
.Then(f => { // activities and state transitions }),
When(SimulationComplete)
.Then(f => _logger.LogInformation("Network {NetworkId} shutting down.", f.Instance.CorrelationId))
.TransitionTo(Final));
Note how the DataCollected event is handled both in the Initially state transition and in a state transition set by the Initially condition.

Related

MismatchingMessageCorrelationException : Cannot correlate message ‘onEventReceiver’: No process definition or execution matches the parameters

We are facing an MismatchingMessageCorrelationException for the receive task in some cases (less than 5%)
The call back to notify receive task is done by :
protected void respondToCallWorker(
#NonNull final String correlationId,
final CallWorkerResultKeys result,
#Nullable final Map<String, Object> variables
) {
try {
runtimeService.createMessageCorrelation("callWorkerConsumer")
.processInstanceId(correlationId)
.setVariables(variables)
.setVariable("callStatus", result.toString())
.correlateWithResult();
} catch(Exception e) {
e.printStackTrace();
}
}
When i check the logs : i found that the query executed is this one :
select distinct RES.* from ACT_RU_EXECUTION RES
inner join ACT_RE_PROCDEF P on RES.PROC_DEF_ID_ = P.ID_
WHERE RES.PROC_INST_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0' and RES.SUSPENSION_STATE_ = '1'
and exists (select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = RES.ID_ and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer' )
Some times, When i look for the instance of the process in the database i found it waiting in the receive task
SELECT DISTINCT * FROM ACT_RU_EXECUTION RES
WHERE id_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
However, when i check the subscription event, it's not yet created in the database
select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer'
I think that the solution is to save the "receive task" before getting the response for respondToCallWorker, but sadly i can't figure it out.
I tried "asynch before" callWorker and "Message consumer" but it did not work,
I also tried camunda.bpm.database.jdbc-batch-processing=false and got the same results,
I tried also parallel branches but i get OptimisticLocak exception and MismatchingMessageCorrelationException
Maybe i am doing it wrong
Thanks for your help
This is an interesting problem. As you already found out, the error happens, when you try to correlate the result from the "worker" before the main process ended its transaction, thus there is no message subscription registered at the time you correlate.
This problem in process orchestration is described and analyzed in this blog post, which is definitely worth reading.
Taken from that post, here is a design that should solve the issue:
You make message send and receive parallel and put an async before the send task.
By doing so, the async continuation job for the send event and the message subscription are written in the same transaction, so when the async message send executes, you already have the subscription waiting.
Although this should work and solve the issue on BPMN model level, it might be worth to consider options that do not require remodeling the process.
First, instead of calling the worker directly from your delegate, you could (assuming you are on spring boot) publish a "CallWorkerCommand" (simple pojo) and use a TransactionalEventLister on a spring bean to execute the actual call. By doing so, you first will finish the BPMN process by subscribing to the message and afterwards, spring will execute your worker call.
Second: you could use a retry mechanism like resilience4j around your correlate message call, so in the rare cases where the result comes to quickly, you fail and retry a second later.
Another solution I could think of, since you seem to be using an "external worker" pattern here, is to use an external-task-service task directly, so the send/receive synchronization gets solved by the Camunda external worker API.
So many options to choose from. I would possibly prefer the external task, followed by the transactionalEventListener, but that is a matter of personal preference.

Akka Stream how to determine inside GraphStageLogic whether it failed

I have a network of nodes all implemented using custom GraphStageLogic. I can't find any API to determine when a stage throws an exception (e.g. IllegalArgumentException for Cannot pull port). The only thing Akka does is fail the down stream connections. What I need to determine is, for example in postStop or through a callback, when a node shuts down due to runtime exception, and propagate that information to a Promise that monitors the state of the entire system. Using withAttributes(supervisionStrategy) does not have any effect, either. It seems bewildering to me that there is no way to monitor exceptions thrown inside a GraphStageLogic? failStage is final like basically the entire API of GraphStageLogic.
Using decider when defining the ActorMaterializer used for materializing the Graph should work:
implicit val materializer: ActorMaterializer = ActorMaterializer(
ActorMaterializerSettings(actorSystem).withSupervisionStrategy(decider))
where decider is the typical
val decider: Supervision.Decider = {
case e: IllegalArgumentException => ....
}

Processing AWS Lambda messages in Batches

I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks
There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...
For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.
As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.
Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.
If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.
To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures

Why are my requests handled by a single thread in spray-http?

I set up an http server using spray-can, spray-http 1.3.2 and akka 2.3.6.
my application.conf doesn't have any akka (or spray) entries. My actor code:
class TestActor extends HttpServiceActor with ActorLogging with PlayJsonSupport {
val route = get {
path("clientapi"/"orders") {
complete {{
log.info("handling request")
System.err.println("sleeping "+Thread.currentThread().getName)
Thread.sleep(1000)
System.err.println("woke up "+Thread.currentThread().getName)
Seq[Int]()
}}
}
}
override def receive: Receive = runRoute(route)
}
started like this:
val restService = system.actorOf(Props(classOf[TestActor]), "rest-clientapi")
IO(Http) ! Http.Bind(restService, serviceHost, servicePort)
When I send 10 concurrent requests, they are all accepted immediately by spray and forwarded to different dispatcher actors (according to logging config for akka I have removed from applicaiton.conf lest it influenced the result), but all are handled by the same thread, which sleeps, and only after waking up picks up the next request.
What should I add/change in the configuration? From what I've seen in reference.conf the default executor is a fork-join-executor, so I'd expect all the requests to execute in parallel out of the box.
From your code I see that there is only one TestActor to handle all requests, as you've created only one with system.actorOf. You know, actorOf doesn't create new actor per request - more than that, you have the val there, so it's only one actor. This actor handles requests sequntially one-by-one and your routes are processing inside this actor. There is no reason for dispatcher to pick-up some another thread, while the only one thread per time is used by only one actor, so you've got only one thread in the logs (but it's not guaranteed) - I assume it's first thread in the pool.
Fork-join executor does nothing here except giving first and always same free thread as there is no more actors requiring threads in parallel with current one. So, it receives only one task at time. Even with "work stealing" - it doesn't work til you have some blocked (and marked to have managed block) thread to "steal" resources from. Thread.sleep(1000) itself doesn't mark thread automatically - you should surround it with scala.concurrent.blocking to use "work stealing". Anyway, it still be only one thread while you have only one actor.
If you need to have several actors to process the requests - just pass some akka router actor (it has nothing in common with spray-router):
val restService = context.actorOf(RoundRobinPool(5).props(Props[TestActor]), "router")
That will create a pool (not thread-pool) with 5 actors to serve your requests.

How to detect dead remote client or server in akka2

Im new to AKKA2.The following is my question:
There is a server actor and several client actors.
The server stores all the ref of the client actors.
I wonder how the server can detect which client is disconnected(shutdown, crash...)
And if there is a way to tell the clients that the server is dead.
There are two ways to interact with an actor's lifecycle. First, the parent of an actor defines a supervisory policy that handles actor failures and has the option to restart, stop, resume, or escalate after a failure. In addition, a non-supervisor actor can "watch" an actor to detect the Terminated message generated when the actor dies. This section of the docs covers the topic: http://doc.akka.io/docs/akka/2.0.1/general/supervision.html
Here's an example of using watch from a spec. I start an actor, then set up a watcher for the Termination. When the actor gets a PoisonPill message, the event is detected by the watcher:
"be able to watch the proxy actor fail" in {
val myProxy = system.actorOf(Props(new VcdRouterActor(vcdPrivateApiUrl, vcdUser, vcdPass, true, sessionTimeout)), "vcd-router-" + newUuid)
watch(myProxy)
myProxy ! PoisonPill
expectMsg(Terminated(`myProxy`))
}
Here's an example of a custom supervisor strategy that Stops the child actor if it failed due to an authentication exception since that probably will not be correctable, or escalates the failure to a higher supervisor if the failure was for another reason:
override val supervisorStrategy = OneForOneStrategy(maxNrOfRetries = 5, withinTimeRange = 1 minutes) {
// presumably we had a connection, and lost it. Let's restart the child and see if we can re-establish one.
case e: AuthenticationException ⇒
log.error(e.message + " Stopping proxy router for this host")
Stop
// don't know what it was, escalate it.
case e: Exception ⇒
log.warning("Unknown exception from vCD proxy. Escalating a {}", e.getClass.getName)
Escalate
}
Within an actor, you can generate the failure by throwing an exception or handling a PoisonPill message.
Another pattern that may be useful if you don't want to generate a failure is to respond with a failure to the sender. Then you can have a more personal message exchange with the caller. For example, the caller can use the ask pattern and use an onComplete block for handling the response. Caller side:
vcdRouter ? DisableOrg(id) mapTo manifest[VcdHttpResponse] onComplete {
case Left(failure) => log.info("receive a failure message")
case Right(success) ⇒ log.info("org disabled)
}
Callee side:
val org0 = new UUID("00000000-0000-0000-0000-000000000000")
def receive = {
case DisableOrg(id: UUID) if id == org0 => sender ! Failure(new IllegalArgumentException("can't disable org 0")
case DisableOrg(id: UUID) => sender ! disableOrg(id)
}
In order to make your server react to changes of remote client status you could use something like the following (example is for Akka 2.1.4).
In Java
#Override
public void preStart() {
context().system().eventStream().subscribe(getSelf(), RemoteLifeCycleEvent.class);
}
Or in Scala
override def preStart = {
context.system.eventStream.subscribe(listener, classOf[RemoteLifeCycleEvent])
}
If you're only interested when the client is disconnected you could register only for RemoteClientDisconnected
More info here(java)and here(scala)
In the upcoming Akka 2.2 release (RC1 was released yesterday), Death Watch works both locally and remote. If you watch the root guardian on the other system, when you get Terminated for him, you know that the remote system is down.
Hope that helps!