I created a Step Function with an Activity in it, when this SF starts it remains Running as it waits for manual approval. Correct.
At this point, I can invoke a lambda function (a worker) that calls the method getActivityTask. The answer to this method is something like this:
{
"taskToken": "xyz",
"input": "<THE INPUT PROVIDED>"
}
The taskToken value is useful because I need it to call a sendTaskSuccess or sendTaskFailure to restart or fail the SF execution.
Everything seems clear.
The thing I don't understand is why getActivityTask provide a result only the first time is invoked. I mean: if I have a SF execution blocked for manual approval and if I exec two times the call to getActivityTask, the first time it answers with the payload above, and the second time it is null.
I cannot find any info about this behaviour on the doc.
Related
I am looking for a way to "wake up" the cloud function when a related process is done.
To understand in-depth - these are the functions I have-
1. A cloud function that gets called any X time and its purpose is to call another function (function # 2).
2. An external provider function that requests information, (I can't edit the code, I have only the request body). The information is not received in real-time, but once it ends - it sends a callback. It should be noted that the process can take long minutes and even hours.
I want to create a process where every X time function 1 will call function 2 and as soon as the second is over it will return the information to function 1 and it will store it in DB.
Example code for func1:
import requests
def entry_point():// func1
response = requests.get('https://outsourceapi.com/get_info')// func2
save_response_in_DB(response.json())// This will happand after getting response
Because I can not keep function 1 awake for so long, is there a way to "wake it up" again?
Or alternatively another solution?
I have sfn with activity. I want to get taskToken in lambda, which execute SendTaskSuccess.
Multiple sfn start execute. I need to get taskToken for sfn specific execution in oreder to continue it.Method getActivityTask returns GetActivityTaskResult starting from oldest sfn execution to newest. It doesn't suit me, if it returned a list of running execution, but no. How can I get needed taskToken for me?
Thanks for your question and for your interest in Step Functions.
The task token for a Task state can be accessed from the global Context Object, and passed to the invoked Lambda function as part of its payload.
Here's what the state definition would look like in the waitForTask token task state example:
{
"StartAt":"GetManualReview",
"States":{
"GetManualReview":{
"Type":"Task",
"Resource":"arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters":{
"FunctionName":"get-model-review-decision",
"Payload":{
"model.$":"$.new_model",
"token.$":"$$.Task.Token"
},
"Qualifier":"prod-v1"
},
"End":true
}
}
}
See: Invoke Lambda with Step Functions - AWS Step Functions
This is more of a concern than a question, but still, has anyone experienced this before? Does anyone know how to prevent it?
I have a lambda function (L1) which calls a second lambda function (L2) all written in NodeJs (runtime: Node.Js 8.10, and aws-sdk should be v2.488.0 - but I'm just pasting that from the documentation). The short story is that L1 is supposed to call L2, and when it does L2 is executed twice! I discovered this by writing logs to CloudWatch and I could see one L1 log and two L2 logs.
Here's a simplified version of L1 and L2.
L1:
const AWS = require('aws-sdk');
const lambda = new AWS.Lambda();
module.exports = {
handler: async (event, context, callback) => {
const payload: { rnd: Math.random() };
const lambdaParams = {
FunctionName: 'L2',
Qualifier: `dev`,
Payload: JSON.stringify(payload),
};
console.log(`L1 calling: ${JSON.stringify(payload)}`);
return await lambda.invoke(lambdaParams).promise();
},
};
L2:
module.exports = {
handler: async (event, context, callback) => {
console.log(`L2 called: ${JSON.stringify(event)}`);
},
};
In CloudWatch I can see one L1 calling {"rnd": 0.012072353149807702} and two L2 called: {"rnd": 0.012072353149807702}!
BTW, this does not happen all the time. This is part of a step function process which was going to call L1 10k times. My code is written in a way that if L2 is executed twice (per one call), it will fail the whole process (because L2 inserts a record to DB only if it does not exist and fails if it does). So far, I managed to log this behaviour three times. All of them processing the same 10k items, facing the issue at a different iteration each time.
Does anyone have the same experience? Or even better, knows how to make sure one call leads to exactly one execution?
Your lambda function must be idempotent, because it can be called twice in different situations.
https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/
https://cloudonaut.io/your-lambda-function-might-execute-twice-deal-with-it/
With 10K lambda invokes it must be experiencing a failure and doing a retry.
From the documentation:
Asynchronous Invocation – Lambda retries function errors twice. If the
function doesn't have enough capacity to handle all incoming requests,
events may wait in the queue for hours or days to be sent to the
function. You can configure a dead-letter queue on the function to
capture events that were not successfully processed. For more
information, see Asynchronous Invocation.
If this is what is a happening and you setup the dead letter queue you'll be able to isolate the failure event.
You can also use CloudWatch Logs Insights to easily and quickly search for errors messages of the lambda. Once you select the log group this query should help you get started. Just change the time window.
fields #timestamp, #message
| filter #message like /(?i)(Exception|error|fail|5\d\d)/
| sort #timestamp desc
| limit 20
One case that may cause this is that in your L2 lambda you didn't return anything, which will lead the L1 lambda (the caller) to think there is an error with L2 and so the Retry mechanism is triggered. Try to return something in L2, even simply an "OK".
In my case, it happened that when calling my second lambda, there was a try that was catching an exception with traceback, this triggered the lambda to retry the call, normally without traceback this does not happen, but when commenting the module it stopped happening.
Also within the try it had a condition that yes or yes it could fail, since it had to query for a resource with boto3, so if it existed there was no problem, but since there was no traceback it forced the general failure, not capturing it as an exception.
I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks
There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...
For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.
As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.
Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.
If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.
To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures
I have a lambda function that sends http call to a API(Let's say 'A'). After getting response from 'A' Immediately return the stuff's to the caller i.e., (callback(null, success)) within 10secs. Then save the Data fetched from API 'A' to My External API(Let's Say 'B').
I tried like below but Lambda waits until event loop is empty(It is waiting for the response from second http call).
I doesn't want to set the eventLoopWaitEmpty to false since it freezes the eventloop and Execute next time when invoked.
request.get({url: endpointUrlA},
function (errorA, responseA, bodyA) {
callback(null, "success");
request.post({url: endpointUrlB,
body: bodyA,
json: true}, function(errorB, responseB, bodyB){
//Doesn't want to wait for this response
});
/* Also tried the callback(null, "success"); here too
});
Anybody have any thoughts on How can I implement this? Thanks!
PS - Btw I read the Previous similar questions doesn't seems to clear with those.
This seems like a good candidate for breaking up this lambda into two lambdas with some support code.
First lambda recieves request to 'A' and places a message onto SQS. It then returns to the caller the success status.
A separate process monitors the SQS queue and invokes a second Lambda on it when a message becomes available.
This has several benefits.
Firstly, you no longer have a long-running lambda waiting for a second system that may be down to return.
Secondly, you're doing things asynchronously in the background.
Take a look at this blog post for an overview of how this could work in practice.