Dynamodb table stream trigger lambda for existing records - amazon-web-services

I'm a beginner of Dynamodb and Dynamodb table stream. I have already created AWS Lambda and enabled DynamoDB stream with trigger that invokes my lambda for every added/updated/delete record. Now I want to perform initial sync operation for all my existing records. How can I do this?
Is there any way to make all existing records in a table to be "reprocessed" and added to stream (so they can be processed by my lambda)?
Do I have to write a custom script?

To my knowledge there is no way to do this without writing some custom script.
You could for instance write a script that reads every current item out of the table and then writes it back in overwriting itself and putting a new entry in the stream that would then be handled by your existing Lambda.
Another option is to not try and use the stream in any way for the existing items in the table. Leave the steam and Lambda as is for all future writes to the table and write a script that goes through all the existing items and processes them accordingly.

I think, by creating another Lambda and setting the startingPosition as TRIM_HORIZON, you will be able to get all records again from the stream.
https://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html#services-dynamodb-eventsourcemapping

Related

Options to export selective data from one dynamodb table to another table in same region

Need to move data from one dynamodb table to another table after doing a transformation
What is the best approach to do that
Do I need to write a script to read selective data from one table and put in another table
or Do I need to follow CSV export
You need to write a script to do so. However, you may wish to first export the data to S3 using DynamoDB's native function as it does not impact capacity on the table, ensuring you do not impact production traffic for example.
If your table is not serving production traffic or the size of the table is not too large then you can simply use Lambda functions to read your items, transform and then write to the new table.
If your table is large, you can use AWS Glue to achieve the same result in a distributed fashion.
Is this a live table that is used on prod?
If it is what I usually do is.
Enable Dynamo streams (if not already enabled)
Create a lambda function that has access to both tables
Place transformation logic in the lambda
Subscribe the lambda to the dynamo stream
Update all fields on the original table (like update a new field called 'migrate')
Now all elements will flow through the lambda and it can store them with transformation on the new table
You can now switch to the new table
Check if everything still works
Delete lambda, old table, and disable dynamo streams (if needed)
This approach is the only one I found that can guarantee 100% uptime during the migration.
If the table is not live then you can just export it to S3 and then import it into the new table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html

DynamoDB and computed columns: Run Lambda on GetItem / Query request but before data is returned to caller

Is it possible to run a Lambda function as part of a GetItem / Query request? I plan to use some kind of computed colum that I would like to update before the value is returned to the caller. The current idea is to do this with a Lambda function and DynamoDB Streams. Up to know, I kind of missed the part in the docs where I can specify the exact moment when the Lambda is executed (before, after fetching data). Of course, I am open for better ideas!
No. It is not possible. Dynamodb is designed to response items in distributed systems within milliseconds. There is no way to execute Lambdas synchronous with Put or Get Requets. DynamodDB Streams are more like asynchronous Table Trigger and only executed on new Data.
One Idea is to call an Lambda to collect and compute your data instead request Dynamodb.

Copying DynamoDB table to S3 every time table is updated

I have a DynamoDB table that's fairly large and is updated infrequently and sporadically (maybe once a month). When it is updated it happens in a large batch, i.e. a large number of items are inserted with batch update.
Every time the DynamoDB table changes, I want to convert the contents to a custom JSON format and store the result in S3 to be more easily accessed by clients (since the most common operation is downloading the entire table, and DynamoDB scans are not as efficient).
Is there a way to convert the table to JSON and copy to S3 after every batch update?
I saw Data Pipelines (https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-exportddbtos3.html) but it seems like this is triggered manually or on a schedule. I also looked at DynamoDB Streams, but it seems like an event is inserted for every record changed. I want to do the conversion and copying to S3 only once per batch update.

Can Dynamodb check the items regularly?

Can Dynamodb check the items regularly instead of use a schedule cloudwatch event to trigger a lambda to scan the table?
Or to say does Dynamodb has any functions so it can check the table itself for example is the item in "count" column is bigger than 5 and trigger a lambda?
The short answer is no!
DynamoDB is a database. It stores data. At this date it does not have embedded functions like store procedures or triggers that are common in relational databases. You can however use DynamoDB streams to implement a kind of a trigger.
DynamoDB streams could be used to start a lambda function with the old data, new data or the old and the new data of the item updated/created in a table. You can then use the lambda to check for your count column and if it is greater than 5 call another lambda or do the procedure that you need.

ETL Possible Between S3 and Redshift with Kinesis Firehose?

My team is attempting to use Redshift to consolidate information from several different databases. In our first attempt to implement this solution, we used Kinesis Firehose to write records of POSTs to our APIs to S3 then issued a COPY command to write the data being inserted to the correct tables in Redshift. However, this only allowed us to insert new data and did not let us transform data, update rows when altered, or delete rows.
What is the best way to maintain an updated data warehouse in Redshift without using batch transformation? Ideally, we would like updates to occur "automatically" (< 5min) whenever data is altered in our local databases.
Firehose or Redshift don't have triggers, however you could potentially use the approach using Lambda and Firehose to pre-process the data before it gets inserted as described here: https://blogs.aws.amazon.com/bigdata/post/Tx2MUQB5PRWU36K/Persist-Streaming-Data-to-Amazon-S3-using-Amazon-Kinesis-Firehose-and-AWS-Lambda
In your case, you could extend it to use Lambda on S3 as Firehose is creating new files, which would then execute COPY/SQL update.
Another alternative is just writing your own KCL client that would implement what Firehose does, and then executing the required updates after COPY of micro-batches (500-1000 rows).
I've done such an implementation (we needed to update old records based on new records) and it works alright from consistency point of view, though I'd advise against such architecture in general due to bad Redshift performance with regards to updates. Based on my experience, the key rule is that Redshift data is append-only, and it is often faster to use filters to remove unnecessary rows (with optional regular pruning, like daily) than to delete/update those rows in real-time.
Yet another alernative, is to have Firehose dump data into staging table(s), and then have scheduled jobs take whatever is in that table, do processing, move the data, and rotate tables.
As a general reference architecture for real-time inserts into Redshift, take a look at this: https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1PGELDJU/Best-Practices-for-Micro-Batch-Loading-on-Amazon-Redshift
This has been implemented multiple times, and works well.