Using DyanamoDb streaming I am triggering a Lambda function. In function I am retrieving data from dynamodb based on primary key and if key matches the row needs to be updated. If not matched then a new entry will be created in dynamodb.
There are possibilities where Lambda function can scale as per shards created in streaming.
If received requests with similar primary key in different shards, multiple instances of Lambda function will try to get and update the same row at same time. Which will eventually produce wrong data in database (Chances of data overwrite).
I am thinking about solution to use UUID column and insert condition based data in dynamodb so it will fail if updated by another instance. But then need to execute all steps again for failed data.
Another solution where "reservedConcurrentExecutions" property of lambda function to 1 and then lambda function does not scale. Not sure if it throws exception when more than 1 shards get created in dynamodb streaming
I would like to know how I can implement this scenario.
I am using DynamoDB as my database and have DynamoDB streams setup to do some extra things when a row is saved onto DynamoDB.
But my problem is that when i write to a database. The dynamo stream is getting triggered but when i try to fetch the data from the lambda that is triggered by the stream the event is not present in the dynamo table.
Really not sure why it is happening. Is it something that happens when the table gets big and we try to add a lot of data in dynamo at the same time?
I have a combination of partition key and sort key. partition key is the userId and sort key is the timestamp. Could it be something due to that?
Could it be that the entries are lot of a lot of different services try to write to dynamo at the same time?
I have created a custom error log table in Redshift.
In which rows are inserted when any error occurs in my stored procedures.
Is there a way to get a notification like SNS whenever a new row gets inserted in that error table?
There is no capability in Amazon Redshift to trigger events on row insertion.
Is there a way to run a Lambda on every DynamoDb table record?
I have a Dynamo table with name, last name, email and a Lambda that takes name, last name, email as parameters. I am trying to configure the environment such that, every day, the Lambda runs automatically for every value it finds within Dynamo; can't do all the records in one Lambda as it won't scale (will timeout once more users are added).
I currently have a CloudWatch rule set up that triggers the lambda on schedule but I had to manually add the parameters to the trigger from Dynamo - It's not automatic and not dynamic/not connected to dynamo.
--
Another option would be to run a lambda every time a DynamoDb record is updated... I could update all the records weekly and then upon updating them the Lambda would be triggered but I don't know if that's possible either.
Some more insight on either one of these approaches would be appreciated!
Is there a way to run a Lambda on every DynamoDb table record?
For your specific case where all you want to do is process each row of a DynamoDB table in a scalable fashion, I'd try going with a Lambda -> SQS -> Lambdas fanout like this:
Set up a CloudWatch Events Rule that triggers on a schedule. Have this trigger a dispatch Lambda function.
The dispatch Lambda function's job is to read all of the entries in your DynamoDB table and write messages to a jobs SQS queue, one per DynamoDB item.
Create a worker Lambda function that does whatever you want it to do with any given item from your DynamoDB table.
Connect the worker Lambda to the jobs SQS queue so that an instance of it will dispatch whenever something is put on the queue.
Since the limiting factor is lambda timeouts, run multiple lambdas using step functions. Perform a paginated scan of the table; each lambda will return the LastEvaluatedKey and pass it to the next invocation for the next page.
I think your best option is, just as you pointed out, to run a Lambda every time a DynamoDB record is updated. This is possible thanks to DynamoDB streams.
Streams are a ordered record of changes that happen to a table. These can invoke a Lambda, so it's automatic (however beware that the change appears only once in the stream, set up a DLQ in case your Lambda fails). This approach scales well and is also pretty evolvable. If need be, you can either push the events from the stream to an SQS or Kinesis, fan out, etc., depending on the requirements.
I'm trying to set up ElasticSearch import process from DynamoDB table. I have already created AWS Lambda and enabled DynamoDB stream with trigger that invokes my lambda for every added/updated record. Now I want to perform initial seed operation (import all records that are currently in my DynamoDB table to ElasticSearch). How do I do that? Is there any way to make all records in a table be "reprocessed" and added to stream (so they can be processed by my lambda)? Or is it better to write a separate function that will manually read all data from the table and send it to ElasticSearch - so basically have 2 lambdas: one for initial data migration (executed only once and triggered manually by me), and another one for syncing new records (triggered by DynamoDB stream events)?
Thanks for all the help :)
Depending on how Large your dataset is you won't be able to seed your database in Lambda as there is a max timeout of 300 seconds (EDIT: This is now 15 minutes, thanks #matchish).
You could spin up an EC2 instance and use the SDK to perform a DynamoDB scan operation and batch write to your Elasticsearch instance.
You could also use Amazon EMR to perform a Map Reduce Job to export to S3 and from there process all your data.
I would write a script that will touch each records in dynamodb. For each items in your dynamodb, add a new property called migratedAt or whatever you wish. Adding this property will trigger dynamodb stream which in turn will trigger your lambda handler. Based on your question, your lambda handler already handles the update so there is no change there.