Event Pattern for creating a rule/trigger (CloudWatch-Lambda) when AWS Step Function `ExecutionsFailed` - amazon-web-services

When my AWS Step Functions' State Machine fails ExecutionsFailed , I'd like to trigger a lambda function in response to it.
Seems that you have to create a rule on CloudWatch; but I couldn't find description on how to do that (in particular how the Event Patterns supposed to look like).
p.s. in my case it happens due to exceeding 25,000 history limit (so not quite so easy to handle within state machine; without having to add loop counters etc.; so I'd prefer for it to fail; and then handle it via lambda)

Current workaround is to create a cron rule for a scheduled event on a cloudwatch; and the check the state machine; and in case it is failed; handle it.

Related

AWS lambda sequentially invoke same function

I have nearly 1000 items in my DB. I have to run the same operation on each item. The issue is that this is a third party service that has a 1 second rate limit for each operation. Until now, I was able to do the entire thing inside a lambda function. It is now getting close to the 15 minute (900 second) timeout limit.
I was wondering what the best way for splitting this would be. Can I dump each item (or batches of items) into SQS and have a lambda function process them sequentially? But from what I understood, this isn't the recommended way to do this as I can't delay invocations sufficiently long. Or I would have to call lambda within a lambda, which also sounds weird.
Is AWS Step Functions the way to go here? I haven't used that service yet, so I was wondering if there are other options too. I am also using the serverless framework for doing this if it is of any significance.
Both methods you mentioned are options that would work. Within lambda you could add a delay (sleep) after one item has been processed and then trigger another lambda invocation following the delay. You'll be paying for that dead time, of course, if you use this approach, so step functions may be a more elegant solution. One lambda can certainly invoke another--even invoking itself. If you invoke the next lambda asynchronously, then the initial function will finish while the newly-invoked function starts to run. This article on Asynchronous invocation will be useful for that approach. Essentially, each lambda invocation would be responsible for processing one item, delaying sufficiently to accommodate the service limit, and then invoking the next item.
If anything goes wrong you'd want to build appropriate exception handling so a problem with one item either halts the rest or allows the rest of the chain to continue, depending on what is appropriate for your use case.
Step Functions would also work well to handle this use case. With options like Wait and using a loop you could achieve the same result. For example, your step function flow could invoke one lambda that processes an item and returns the next item, then it could next run a wait step, then process the next item and so on until you reach the end. You could use a Map that runs a lambda task and a wait task:
The Map state ("Type": "Map") can be used to run a set of steps for
each element of an input array. While the Parallel state executes
multiple branches of steps using the same input, a Map state will
execute the same steps for multiple entries of an array in the state
input.
This article on Iterating a Loop Using Lambda is also useful.
If you want the messages to be processed serially and are happy to dump the messages to sqs, set both the concurency of the lambda and the batchsize property of the sqs event that triggers the function to 1
Make it a FIFO queue so that messages dont potentially get processed more than once if that is important.

DDD - Concurrency and Command retrying with side-effects

I am developing an event-sourced Electric Vehicle Charging Station Management System, which is connected to several Charging Stations. In this domain, I've come up with an aggregate for the Charging Station, which includes the internal state of the Charging Station(whether it is network-connected, if a car is charging using the station's connectors).
The station notifies me about its state through messages defined in a standardized protocol:
Heartbeat: whether the station is still "alive"
StatusNotification: whether the station has encountered an error(under voltage), or if everything is correct
And my server can send commands to this station:
RemoteStartTransaction: tells the station to unlock and reserve one of its connectors, for a car to charge using the connector.
I've developed an Aggregate for this Charging Station. It contains the internal entities of its connector, whether it's charging or not, if it has a problem in the power system, ...
And the Aggregate, which its memory representation resides in the server that I control, not in the Charging Station itself, has a StationClient service, which is responsible for sending these commands to the physical Charging Station(pseudocode):
class StationAggregate {
stationClient: StationClient
URL: string
connector: Connector[]
unlock(connectorId) {
if this.connectors.find(connectorId).isAvailableToBeUnlocked() {
return ErrorConnectorNotAvailable
}
error = this.stationClient.sendRemoteStartTransaction(this.URL, connectorId)
if error {
return ErrorStationRejectedUnlock
}
this.applyEvents([
StationUnlockedEvent(connectorId, now())
])
return Ok
}
receiveHeartbeat(timestamp) {
this.applyEvents([
StationSentHeartbeat(timestamp)
])
return Ok
}
}
I am using a optimistic concurrency, which means that, I load the Aggregate from a list of events, and I store the current version of the Aggregate in its memory representation: StationAggregate in version #2032, when a command is successfully processed and event(s) applied, it would the in version #2033, for example. In that way, I can put a unique constraint on the (StationID, Version) tuple on my persistence layer, and guarantee that only one event is persisted.
If by any chance, occurs a receival of a Heartbeat message, and the receival of a Unlock command. In both threads, they would load the StationAggregate and would be both in version X, in the case of the Heartbeat receival, there would be no side-effects, but in the case of the Unlock command, there would be a side-effect that tells the physical Charging Station to be unlocked. However as I'm using optimistic concurrency, that StationUnlocked event could be rejected from the persistence layer. I don't know how I could handle that, as I can't retry the command, because it its inherently not idempotent(as the physical Station would reject the second request)
I don't know if I'm modelling something wrong, or if it's really a hard domain to model.
I am not sure I fully understand the problem, but the idea of optimistic concurrency is to prevent writes in case of a race condition. Versions are used to ensure that your write operation has the version that is +1 from the version you've got from the database before executing the command.
So, in case there's a parallel write that won and you got the wrong version exception back from the event store, you retry the command execution entirely, meaning you read the stream again and by doing so you get the latest state with the new version. Then, you give the command to the aggregate, which decides if it makes sense to perform the operation or not.
The issue is not particularly related to Event Sourcing, it is just as relevant for any persistence and it is resolved in the same way.
Event Sourcing could bring you additional benefits since you know what happened. Imagine that by accident you got the Unlock command twice. When you got the "wrong version" back from the store, you can read the last event and decide if the command has already been executed. It can be done logically (there's no need to unlock if it's already unlocked, by the same customer), technically (put the command id to the event metadata and compare), or both ways.
When handling duplicate commands, it makes sense to ensure a decent level of idempotence of the command handling, ignore the duplicate and return OK instead of failing to the user's face.
Another observation that I can deduce from the very limited amount of information about the domain, is that heartbeats are telemetry and locking and unlocking are business. I don't think it makes a lot of sense to combine those two distinctly different things in one domain object.
Update, following the discussion in comments:
What you got with sending the command to the station at the same time as producing the event, is the variation of two-phase commit. Since it's not executed in a transaction, any of the two operations could fail and lead the system to an inconsistent state. You either don't know if the station got the command to unlock itself if the command failed to send, or you don't know that it's unlocked if the event persistence failed. You only got as far as the second operation, but the first case could happen too.
There are quite a few ways to solve it.
First, solving it entirely technical. With MassTransit, it's quite easy to fix using the Outbox. It will not send any outgoing messages until the consumer of the original message is fully completed its work. Therefore, if the consumer of the Unlock command fails to persist the event, the command will not be sent. Then, the retry filter would engage and the whole operation would be executed again and you already get out of the race condition, so the operation would be properly completed.
But it won't solve the issue when your command to the physical station fails to send (I reckon it is an edge case).
This issue can also be easily solved and here Event Sourcing is helpful. You'd need to convert sending the command to the station from the original (user-driven) command consumer to the subscriber. You subscribe to the event stream of StationUnlocked event and let the subscriber send commands to the station. With that, you would only send commands to the station if the event was persisted and you can retry sending the command as many times as you'd need.
Finally, you can solve it in a more meaningful way and change the semantics. I already mentioned that heartbeats are telemetry messages. I could expect the station also to respond to lock and unlock commands, telling you if it actually did what you asked.
You can use the station telemetry to create a representation of the physical station, which is not a part of the aggregate. In fact, it's more like an ACL to the physical world, represented as a read model.
When you have such a mirror of the physical station on your side, when you execute the Unlock command in your domain, you can engage a domain server to consult with the current station state and make a decision. If you find out that the station is already unlocked and the session id matches (yes, I remember our previous discussion :)) - you return OK and safely ignore the command. If it's locked - you proceed. If it's unlocked and the session id doesn't match - it's obviously an error and you need to do something else.
In this last option, you would clearly separate telemetry processing from the business so you won't have heartbeats impact your domain model, so you really won't have the versioning issue. You also would always have a place to look at to understand what is the current state of the physical station.

Exception handling in a batch of Event Hub events using Azure WebJobs Sdk

I use the EventHub support of the Azure WebJobs Sdk to process Events. Because of the throughput I decided to go for batch processing of those Events, e.g. my method looks like this:
public static void HandleBatchRaw([EventHubTrigger("std")] EventData[] events) {...}
Now one of those events within a batch might cause an Exception - what's the right way to handle that? When I leave the Exception uncaught the processing stops and the remainder of the Events in the EventData[] parameter get lost.
Options:
Catch the Exception manually, forward the Event to some place
else and continue
Let the SDK do the magic, e.g. it should just
'ACK' the Events processed until then (I probably would have to do that), mark this event as 'Poisoned', exit the method and continue on the next call of the function.
Move to Single Event Handling - but for performance
goals I don't feel that's right
I missed the point and should think of another strategy
How should I approach this?
There are only four choices in any messaging solution:
1 Stop
2 Drop
3 Retry
4 Dead letter
You have to do that. I don't believe that SDK will retry anything. Recall there is no ACK for Event Hubs read, you just read.
How are you checkpointing?
Your best bet is probably your option #1. WebJobs EventHub binding doesn't give you many options here. Feel free to file an issue at https://github.com/Azure/azure-webjobs-sdk/issues to request better error handling support here.
If you want to see exactly what it's doing under the hood, here's the spot in the WebJobs SDK EventHub binding that receives events via EventProcessorHost:
https://github.com/Azure/azure-webjobs-sdk/blob/dev/src/Microsoft.Azure.WebJobs.ServiceBus/EventHubs/EventHubListener.cs#L86

How to make AWS Lambda stop execution?

I have an AWS Lambda function does operations against Kinesis Firehose.
The function uses backoff mechanism. (which at this time I think wasting my computation time).
But anyway, in some point in my code, I would like to fail the execution.
What command should I use in order to make the execution stop?
P.s.
I found out that there are commands such as:
context.done()
context.succeed()
context.fail()
I've got to tell you, I could not find any documentation about these commands in AWS documentation.
Those methods are available only for backward compatibility, since they were first introduced with Node.js v0.10.42. If you use NodeJS version 4.* or 6.*. Use the callback() function.
Check: Using the Callback Parameter in Lambda for more information how to take advantage of this function.
Here is my solution (probably not perfect, but it works for me)
time.sleep(context.get_remaining_time_in_millis() / 1000)
The code is in Python, but I am sure you can apply the same logic using any other language. The idea is make my lambda function "fall asleep" for the remaining time of "retries".
The full example may loo like this:
...
some code that processes logs from CloudWatch when my ECS container stops(job is finished)
...
# Send an email notification
sns_client = client('sns')
sns_client.publish(
TopicArn=sns_topic,
Subject="ECS task success detected for container",
Message=json.dumps(detail)
)
# Make sure it was send only once, therefore 'sleep'
# untill lambda stops 'retries'
time.sleep(context.get_remaining_time_in_millis() / 1000)
So, the email is send only once. I hope it helps!

Difference between ExecutionListener and TaskListener

As I have read:
In general, the task listener event cycle is contained between execution listener events:
ExecutionListener#start
TaskListener#create
TaskListener#{assignment}*
TaskListener#{complete, delete}
ExecutionListener#end
see complete list at Camunda BPMN - Task listener vs Execution listeners
But now I have this question: what is the difference between ExecutionListener#start and TaskListener#create, or as I noticed the create event has started after start event started, which business should I set in the start event and which one should I set in the create event? Or are there any problems if I put all of my business in the start event?
I think the important difference to remember is that the ExecutionListener is available for all elements and allows access to the DelegateExecution, while the TaskListener only applies to tasks (bpmn and cmmn) and gives you access to the DelegateTask.
The DelegateTask is important for all task-lifecycle operations, like setting due date, assigning candidate groups, ... you just cannot do this with the DelegateExecution.
So in general, we use ExecutionListeners on events and gateways, JavaDelegates on ServiceTasks and TaskListeners on UserTasks.