It is unclear to me why the lambda-based trigger I just recreated atop my DynamoDB stream has stopped firing. Per the docs, I know that the stream atop my single sharded DynamoDB instance sends the payloads synchronously and will not invoke subsequent batches until the previous one finishes.
Because I wanted to recreate the trigger with more batches processed per payload (from 100 to 5000) I took these steps:
Deleted the trigger.
Disabled the previous dynamodb stream;
Re-enabled the stream (creating a new ARN with the updated
timestamp);
Recreated the trigger tied to the same lambda (with
a batch size of 5000).
Either the lambda that polls the stream and sends those batches to my lambda is not polling OR by doing one of the above steps, I've voided the stream and it has no new results. But I've since updated dynamodb directly as well as inserted new rows. The trigger still hasn't fired.
I'm not sure what I'm missing?
Lambda functions may not execute for a variety of reasons.
Lack of permissions
Trigger not being enabled
DynamoDB Stream being disabled
Hitting Lambda region and account limits
Related
I am using Lambda function to read data from dyanmoDB streams. Lambda read items from stream and invokes lambda function once for each batch. Lambda invokes lambda function synchronously using event source mapping.
From what i understand from aws docs is, Lambda invokes a lambda function for each batch in the stream. Suppose there are 1000 items in stream instantly and I configures my lambda function to read 100 items in a batch.
So will it invoke 10 lambda function concurrently to process 10 batch of 100 items each?
I am learning AWS. Is my understanding correct? if yes what does synchronously invoked mean?
DynamoDB uses shards* to partition the data inside a table. The data that will be stored in each shard is defined by the HashKey of the table. DynamoDB streams will trigger AWS Lambda for each shard that was updated and aggregate the shard records in a batch. So the number of records in each batch will depend on the number of records updated in each shard. They can be different of course.
Synchronously invoked means that the service that triggered the function will wait until the function ends to finish its own execution. When you trigger asynchronous, you send a request and forget about it. If the downstream function successfully process the stream or not is not in control of the upstream service. DynamoDB invokes Lambda Function synchronously and waits while it works. If it ends successfully, it will mark the stream as processed. If it ends with a failure it will retry a few more times. This is important to allow at least once processing of ddb streams.
*Shards are different partitions of the database. They allow DynamoDB to process parallel queries and updates. Normally they reside in different storages/availability zones.
Is there a way to run a Lambda on every DynamoDb table record?
I have a Dynamo table with name, last name, email and a Lambda that takes name, last name, email as parameters. I am trying to configure the environment such that, every day, the Lambda runs automatically for every value it finds within Dynamo; can't do all the records in one Lambda as it won't scale (will timeout once more users are added).
I currently have a CloudWatch rule set up that triggers the lambda on schedule but I had to manually add the parameters to the trigger from Dynamo - It's not automatic and not dynamic/not connected to dynamo.
--
Another option would be to run a lambda every time a DynamoDb record is updated... I could update all the records weekly and then upon updating them the Lambda would be triggered but I don't know if that's possible either.
Some more insight on either one of these approaches would be appreciated!
Is there a way to run a Lambda on every DynamoDb table record?
For your specific case where all you want to do is process each row of a DynamoDB table in a scalable fashion, I'd try going with a Lambda -> SQS -> Lambdas fanout like this:
Set up a CloudWatch Events Rule that triggers on a schedule. Have this trigger a dispatch Lambda function.
The dispatch Lambda function's job is to read all of the entries in your DynamoDB table and write messages to a jobs SQS queue, one per DynamoDB item.
Create a worker Lambda function that does whatever you want it to do with any given item from your DynamoDB table.
Connect the worker Lambda to the jobs SQS queue so that an instance of it will dispatch whenever something is put on the queue.
Since the limiting factor is lambda timeouts, run multiple lambdas using step functions. Perform a paginated scan of the table; each lambda will return the LastEvaluatedKey and pass it to the next invocation for the next page.
I think your best option is, just as you pointed out, to run a Lambda every time a DynamoDB record is updated. This is possible thanks to DynamoDB streams.
Streams are a ordered record of changes that happen to a table. These can invoke a Lambda, so it's automatic (however beware that the change appears only once in the stream, set up a DLQ in case your Lambda fails). This approach scales well and is also pretty evolvable. If need be, you can either push the events from the stream to an SQS or Kinesis, fan out, etc., depending on the requirements.
I have a Lambda function that’s triggered by a PUT to an S3 bucket.
I want to limit this Lambda function so that it’s only running one instance at a time – I don’t want two instances running concurrently.
I’ve had a look through the Lambda configuration and docs, but I can’t see anything obvious. I can about writing my own locking system, but it would be nice if this was already a solved problem.
How can I limit the number of concurrent invocations of a Lambda?
AWS Lambda now supports concurrency limits on individual functions:
https://aws.amazon.com/about-aws/whats-new/2017/11/set-concurrency-limits-on-individual-aws-lambda-functions/
I would suggest you to use Kinesis Streams (or alternatively DynamoDB + DynamoDB Streams, which essentially have the same behavior).
You can see Kinesis Streams as as queue. The good part is that you can use a Kinesis Stream as a Trigger to you Lambda function. So anything that gets inserted into this queue will automatically be passed over to your function, in order. So you will be able to process those S3 events one by one, one Lambda execution after the other (one instance at a time).
In order to do that, you'll need to create a Lambda function with the simple purpose of getting S3 Events and putting them into a Kinesis Stream. Then you'll configure that Kinesis Stream as your Lambda Trigger.
When you configure the Kinesis Stream as your Lambda Trigger I suggest you to use the following configuration:
Batch size: 1
This means that your Lambda will be called with only one event from Kinesis. You can select a higher number and you'll get a list of events of that size (for example, if you want to process the last 10 events in one Lambda execution instead of 10 consecutive Lambda executions).
Starting position: Trim horizon
This means it'll behave as a queue (FIFO)
A bit more info on AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AWS Lambda.
I hope this helps anyone with a similar problem.
P.S. Bear in mind that Kinesis Streams have their own pricing. Using DynamoDB + DynamoDB Streams might be cheaper (or even free due to the non-expiring Free Tier of DynamoDB).
No, this is one of the things I'd really like to see Lambda support, but currently it does not. One of the problems is that if there were a lot of S3 PUT operations happening AWS would have to queue up all the Lambda invocations somehow, and there is currently no support for that.
If you built a locking mechanism into your Lambda function, what would you do with the requests you don't process due to a lock? Would you just throw those S3 notifications away?
The solution most people recommend is to have S3 send the notifications to an SQS queue, and then have your Lambda function scheduled to run periodically, like once a minute, and check if there is an item in the queue that needs to be processed.
Alternatively, have S3 send the notifications to SQS and just have a t2.nano EC2 instance with a single-threaded service polling the queue.
I know this is an old thread, but I ran across it trying to figure out how to make sure my time sequenced SQS messages were processed in order coming out of a FIFO queue and not getting processed simultaneously/out-of-order via multiple Lambda threads running.
Per the documentation:
For FIFO queues, Lambda sends messages to your function in the order
that it receives them. When you send a message to a FIFO queue, you
specify a message group ID. Amazon SQS ensures that messages in the
same group are delivered to Lambda in order. Lambda sorts the messages
into groups and sends only one batch at a time for a group. If your
function returns an error, the function attempts all retries on the
affected messages before Lambda receives additional messages from the
same group.
Your function can scale in concurrency to the number of active message
groups.
Link: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
So essentially, as long as you use a FIFO queue and submit your messages that need to stay in sequence with the same MessageGroupID, SQS/Lambda automatically handles the sequencing without any additional settings necessary.
Have the S3 "Put events" cause a message to be placed on the queue (instead of involving a lambda function). The message should contain a reference to the S3 object. Then SCHEDULE a lambda to "SHORT POLL the entire queue".
PS: S3 events can not trigger a Kinesis Stream... only SQS, SMS, Lambda (see http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-destinations). Kinesis Stream are expensive and used for real-time event handling.
For example I have lambda functions that consume messages from a KinesisStream. How do stop and resume the function so that I don't incur charges and I don't loose data in the stream.
I know that if the events keep failing, Kinesis will keep retrying and the cost can be very high.
I cannot delete the function because there is lots of automation around it through CloudFormation. Is there a way to stop and restart the function?
SOLUTION: http://alestic.com/2015/11/aws-lambda-kinesis-pause-resume
NOTE: Event sources for rules, log streaming, cannot be disable using the event source. You will not event get it in the list when calling the API using the SDK. For those you have to disable the Event Rule, or the Log Subscription.
The updated Lambda console on AWS supports this in the UI now. Click on the Kinesis stream feeding your lambda function, toggle the "Enabled/Disabled" toggle at the bottom, and Save. This will essentially pause/resume your function.Screenshot - Toggling Kinesis input into Lambda
Let's talk about Kinesis for a moment. When you pull records off the stream, Kinesis will not 'delete' those records until you 'checkpoint' the stream. You can read the same records over and over until you confirm with Kinesis that you don't need them anymore.
AWS Lambda does not checkpoint the stream until the function completes its execution without an error. (context.success())
If you deploy a Lambda function and it is broken in some way (exits with an exception/error), the Lambda function will not checkpoint the stream, and your records will stay in the stream for as long until retention period expires (24 hours, by default). The 'un-checkpointed' records can then be read in a subsequent Lambda execution.
During deployment, the same thing applies. Any currently executing Lambdas that are interrupted will not checkpoint the stream, and any currently executing Lambdas that complete successfully will checkpoint as you expect.
My lambda function contains some code that includes queries to dynamodb. Once a query is executed, the lambda continues with the rest of the code, which is based on the result of that query. What happens if I exceed the capacity limit of the dynamodb? I can push the query to SQS and process it later, but then I will not be able to continue the execution of the lambda. Another solution would be to retry each query that fails, but if the dynamodb is extremely busy, my lambda might exceed the 5 minute limit. Seems like a lose-lose situation. What would you do?
The most fault-tolerant solution would be to decouple the querying and the processing of the results.
Instead of processing results immediately, write the results to another SQS queue and send an SNS notification.
Move the processing to a second Lambda function. This new function can be triggered by the SNS notification. It can read the results queue and process any pending messages.
Modify the original function to queue any failed queries for later.