I've been tasked to look at a service built on AWS Lambda that performs a long-running task of turning VMs on and off. Mind you, I come from the Azure team so I am not familair with the styling or best practices of AWS services.
The approach the original developer has taken is to send the entire workload to one Lambda function and then have that function take a section of the workload and then recursively call itself with the remaining workload until all items are gone (workload = 0).
Pseudo-ish Code:
// Assume this gets sent to a HTTP Lambda endpoint as a whole
let workload = [1, 2, 3, 4, 5, 6, 7, 8]
// The Lambda HTTP endpoint
function Lambda(workload) {
if (!workload.length) {
return "No more work!"
}
const toDo = workload.splice(0, 2) // get first two items
doWork(toDo)
// Then... except it builds a new HTTP request with aws sdk
Lambda(workload) // 3, 4, 5, 6, 7, 8, etc.
}
This seems highly inefficient and unreliable (correct me if I am wrong). There is a lot of state being stored in this process and in my opinion that creates a lot of failure points.
My plan is to suggest we re-engineer the entire service to use a Queue/Worker type framework instead, where ideally the endpoint would handle one workload at a time, and be stateless.
The queue would be populated by a service (Jenkins? Lambda? Manually?), then a second service would read from the queue (and ideally scale-out as well, as needed).
UPDATE: AWS EventBridge now looks like the preferred solution.
It's "Coupling" that I was thinking of, see here: https://www.jeffersonfrank.com/insights/aws-lambda-design-considerations
Coupling
Coupling goes beyond Lambda design considerations—it’s more about the system as a whole. Lambdas within a microservice are sometimes tightly coupled, but this is nothing to worry about as long as the data passed between Lambdas within their little black box of a microservice is not over-pure HTTP and isn’t synchronous.
Lambdas shouldn’t be directly coupled to one another in a Request Response fashion, but asynchronously. Consider the scenario when an S3 Event invokes a Lambda function, then that Lambda also needs to call another Lambda within that same microservice and so on.
aws lambda coupling
You might be tempted to implement direct coupling, like allowing Lambda 1 to use the AWS SDK to call Lambda 2 and so on. This introduces some of the following problems:
If Lambda 1 is invoking Lambda 2 synchronously, it needs to wait for the latter to be done first. Lambda 1 might not know that Lambda 2 also called Lambda 3 synchronously, and Lambda 1 may now need to wait for both Lambda 2 and 3 to finish successfully. Lambda 1 might timeout as it needs to wait for all the Lambdas to complete first, and you’re also paying for each Lambda while they wait.
What if Lambda 3 has a concurrency limit set and is also called by another service? The call between Lambda 2 and 3 will fail until it has concurrency again. The error can be returned to all the way back to Lambda 1 but what does Lambda 1 then do with the error? It has to store that the S3 event was unsuccessful and that it needs to replay it.
This process can be redesigned to be event-driven: lambda coupling
Not only is this the solution to all the problems introduced by the direct coupling method, but it also provides a method of replaying the DLQ if an error occurred for each Lambda. No message will be lost or need to be stored externally, and the demand is decoupled from the processing.
AWS Step Functions is one way you can achieve this. Step Functions are used to orchestrate multiple Lambda functions in any manner you want - parallel executions, sequential executions or a mix of both. You can also put wait steps, condition checks, retries in between if you need.
Your overall step function might look something like this (say you want 1,2,3 to execute in parallel. Then when all these are complete, you want to execute 4, and then again 5 and 6 in parallel)
Configuring this is also pretty simple. It accepts a JSON like the following
{
"Comment": "An example of the Amazon States Language using a parallel state to execute two branches at the same time.",
"StartAt": "Parallel",
"States": {
"Parallel": {
"Type": "Parallel",
"Next": "Task4",
"Branches": [
{
"StartAt": "Task1",
"States": {
"Task1": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task2",
"States": {
"Task2": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task3",
"States": {
"Task3": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
}
]
},
"Task4": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"Next": "Parallel2"
},
"Parallel2": {
"Type": "Parallel",
"Next": "Final State",
"Branches": [
{
"StartAt": "Task5",
"States": {
"Task5": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
},
{
"StartAt": "Task6",
"States": {
"Task6": {
"Type": "Task",
"Resource": "arn:aws:lambda:ap-south-1:XXX:function:XXX",
"End": true
}
}
}
]
},
"Final State": {
"Type": "Pass",
"End": true
}
}
}
Related
I am writing a cloudformation template for an AWS Batch job triggered by an Eventbridge rule. However, I am getting the following error:
shareIdentifier must be specified. (Service: AWSBatch; Status Code: 400; Error Code: ClientException;
I cannot find any documentation of how to pass a shareIdentifier to my batch job, how can I add it to my eventbridge rule's cloudformation template?
I have tried passing as the Input variable:
Input: |
{
"shareIdentifier": "mid"
}
this is not picked up, I have also tried passing shareIdentifier/ShareIdentifier directly in the BatchParameters. this was an unrecognised key.
In the end, I couldn't crack this, I had to wrap it in a state machine step, and call that from eventbridge instead. This was the logic in the state machine to add the Share Identifier:
"States": {
"Batch SubmitJob": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobName": <name>,
"JobDefinition": <Arn>,
"JobQueue": <QueueName>,
"ShareIdentifier": <Share>
},
If anyone works out how to do it directly from Eventbridge, I'd love to hear it.
I have an emr step which is submitted through step function. During step run I can see task is submitted, but emr step is not executed and emr console don’t have any information .
How can I debug this?
How can I send an sns when a step doesn’t start execution with in a threshold timeframe?in my case step function shows emr task submitted but no information on emr console and pipeline is long running without failing for more than half hr
You could start the debugging process through the Step Functions execution log and identify the specific step that has failed, and later, you can move on looking for the EMR console or the specific service that has failed. Usually when the EMR step doesn't appear in the EMR console, is due to a Runtime Error, caused by an exception raised when calling the EMR step.
For this scenario, you can use the Error Handling that Step Functions has, using the Catch and Timeout fields, you can find more details in the AWS documentation here.
Basically you need to add this fields as show bellow:
{
"StartAt": "EmrStep",
"States": {
"EmrStep": {
"Type": "Task",
"Resource": "arn:aws:emr:execute-X-step",
"Comment": "This is your EMR step",
"TimeoutSeconds": 10,
"Catch": [ {
"ErrorEquals": ["States.Timeout"],
"Next": "ShutdownClusterAndSendSNS"
} ],
"End": true
},
"ShutdownClusterAndSendSNS": {
"Type": "Pass",
"Comment": "This step handles the timeout exception raised",
"Result": "You can shutdown the EMR cluster to avoid increased cost here and later send a sns notification!",
"End": true
}
}
Note: To catch the timeout exception, you have to catch the error States.Timeout, but also you can define the same catch field for other types of error.
In the Standard Workflow we can happily invoke another Standard workflow using
{
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync:2",
"Parameters": {
"StateMachineArn": "${NestedStateMachineArn}",
...
}
...
When we try to do the same with Express workflow we of course get the Express state machine does not support '.sync' service integration. That is stated by aws so expected behaviour.
Is there another way to execute Express workflow from another Express workflow and somehow get the execution result/output? I can think of a last resort - use Lambda function to execute the nested workflow sync and wait for a response, that said, it will increase the cost having a function waiting for StateMachine needlessly.
I tried to look around but couldn't find this documented anywhere.
You can execute Express executions synchronously using the StartSyncExecution API, which is now supported as an AWS SDK integration in Step Functions using "Resource": "arn:aws:states:::aws-sdk:sfn:startSyncExecution".
"NestedExpressWorkflow": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:sfn:startSyncExecution",
"Parameters": {
"StateMachineArn": <your_express_state_machine>,
"Input.$": "$"
},
"Next": "NextState"
}
You can execute another workflow, you just can't wait for the results. I believe you just need to remove .sync from the resource. If you are needing to wait for the results of the second function you won't be able to do that within an express workflow.
From Service Integrations with AWS Step Functions
Standard Workflows and Express Workflows support the same set of service integrations but do not support the same integration patterns. Express Workflows do not support Run a Job (.sync) or Wait for Callback (.waitForTaskToken). For more information, see Standard vs. Express Workflows.
I need to get IoT devices status reliable.
Now, I have Lambda connected to SELECT * FROM '$aws/events/presence/#' events on IoT.
But I can't get reliable device status in the case when a connected device was disconnected and connected back within ~ 40 seconds. The result of this scenario - events in the order:
1. Connected - shortly after device was connected again
2. Disconnected - after ~ 40 seconds.
It looks like the message disconnected is not discarded when device is connected back and emitted after connection timeout in any case.
I've found a workaround - request device connectivity from AWS_Things IoT index. In fact, I also receive previous connectivity state, but it has timestamp field. Then, I just compare the current event.timestamp with the timestamp from index and if it higher that 30 seconds - I discard disconnected event silently. But this approach is not reliable, because I am still able get wrong behavior when switching device faster - with 5 seconds interval. This is not acceptable for my project.
Is it possible to use IoT events to solve my problem? I wouldn't like to go in devices index polling..
You can also use an sqs delay queue and check after 5 secs if the disconnect is true. That is way cheaper than using step functions. This is also the official solution.
Handling client disconnections
The best practice is to always have a wait state implemented for lifecycle events, including Last Will and Testament (LWT) messages. When a disconnect message is received, your code should wait a period of time and verify a device is still offline before taking action. One way to do this is by using SQS Delay Queues. When a client receives a LWT or a lifecycle event, you can enqueue a message (for example, for 5 seconds). When that message becomes available and is processed (by Lambda or another service), you can first check if the device is still offline before taking further action.
https://docs.aws.amazon.com/iot/latest/developerguide/life-cycle-events.html#connect-disconnect
Well, on this moment I just use StepFunction, connected to SELECT * FROM '$aws/events/presence/#' event that makes check of actual thing state after 10s delay:
{
"StartAt": "ChoiceEvent",
"States": {
"ChoiceEvent": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.eventType",
"StringEquals": "disconnected",
"Next": "WaitDelay"
}
],
"Default": "CheckStatus"
},
"WaitDelay": {
"Type": "Wait",
"Seconds": 30,
"Next": "CheckStatus"
},
"CheckStatus": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:xxxxxxx:function:connectivity-check",
"End": true
}
}
}
The connectivity-check lambda just checks actual thing state in IoT registry when eventType is disconnected
If you use a "Push" subscription to a Google Cloud Pub/Sub, you'll be registering an HTTPS endpoint that receives messages from Google's managed service. This is great if you wish to avoid dependencies on Google Cloud's SDKs and instead trigger your asynchronous services via a traditional web request. However, the intended casing of the properties of the payload is not clear, and since I'm using Push subscriptions I don't have a SDK to defer to for deserialization.
If you look at this documentation, you see references to message_id using snake_case (Update 9/18/18: As stated in Kamal's answer, the documentation was updated since this was incorrect), e.g.:
{
"message": {
"attributes": {
"key": "value"
},
"data": "SGVsbG8gQ2xvdWQgUHViL1N1YiEgSGVyZSBpcyBteSBtZXNzYWdlIQ==",
"message_id": "136969346945",
"publish_time": "2014-10-02T15:01:23.045123456Z"
},
"subscription": "projects/myproject/subscriptions/mysubscription"
}
If you look at this documentation, you see references to messageId using camelCase, e.g.:
{
"message": {
"attributes": {
"key": "value"
},
"data": "SGVsbG8gQ2xvdWQgUHViL1N1YiEgSGVyZSBpcyBteSBtZXNzYWdlIQ==",
"messageId": "136969346945",
"publishTime": "2014-10-02T15:01:23.045123456Z"
},
"subscription": "projects/myproject/subscriptions/mysubscription"
}
If you subscribe to the topics and log the output, you actually get both formats, e.g.:
{
"message": {
"attributes": {
"key": "value"
},
"data": "SGVsbG8gQ2xvdWQgUHViL1N1YiEgSGVyZSBpcyBteSBtZXNzYWdlIQ==",
"messageId": "136969346945",
"message_id": "136969346945",
"publishTime": "2014-10-02T15:01:23.045123456Z",
"publish_time": "2014-10-02T15:01:23.045123456Z"
},
"subscription": "projects/myproject/subscriptions/mysubscription"
}
An ideal response would answer both of these questions:
Why are there two formats?
Is one more correct or authoritative?
The officially correct names for the variables should be camel case (messageId), based on the Google JSON style guide. In the early phases of Cloud Pub/Sub, snake case was used for message_id and publish_time, but was changed later in order to conform to style standards. The snake case ones were kept in addition to the camel case ones in order to ensure push endpoints depending on the original format did not break. The first documentation link you point apparently was not updated at the time and it will be fixed shortly.