Is there a way to purge a dynamoDB stream? - amazon-web-services

Hi I am working with dynamodb stream and lambda triggers over them. I've got myself into a fix as my lambda reads records from TRIM_HORIZON and it failed to process the very first record. Now the lambda is hell-bent on retrying the processing of that specific record. Is there a way to purge the stream so that new records start flowing and they can be processed?

If you only want new records (those coming in now, rather than historical records), use LATEST instead of TRIM_HORIZON.
As to answer the question, there is no way to purge a Kinesis/DynamoDb stream yet.

I think there is no way to purge the dynamo stream. One work around is to delete the stream and recreate it. (Highly not recommended in production as that incurs data loss)

Related

Prevent lambda from processing dynamodb stream events during table recovery

I'm working on preparing a disaster recovery plan for DynamoDB. In a DR situation we would create a temporary table to restore a snapshot to. From the Temp table we would copy data to a table that has been provisioned with IAC. Our DynamoDB tables have a stream and associated lambda trigger which would process all events that were copied in from the data in the temp table, this is unwanted and would cause a bunch of downstream issues.
Ideally I would like to disable the stream/ lambda trigger until the restore is complete and then enable and not process/ignore any of the changes from the copy/restore process.
I've read through DynamoDB stream documentation and it isn't clear to me if disabling the stream will clear events. It is my understanding that disabling/ enabling the DynamoDB stream although will provide you with a new arn is still the same stream behind the scenes and that log lives for 24 hours and once enable events would be sent to the lambda trigger.
Seems I might be able to configure the trigger on the lambda side, disable it, and then set the ShardIteratorType to 'LATEST' in order to prevent reading events from copied data.
Thanks in advance for any advice.
For anyone traveling down this path in the future I received some good answers on AWS Re:post here.
Some take aways:
When you disable/enable a stream it changes ARN and is completely separate to the old stream.
If the stream is disabled and you copy data to the table the stream log will remain empty once you reenable.
lambda event source mapping solution: LATEST will start reading from the the point of when you enable the trigger on the ESM. So if your copy puts events on the stream and you later create an ESM with iterator position as LATEST your Lambda will disregard all the data from the copy.

What's the best aws approach to send a notification message to validate if all records have been processed in dynamoDB

Introduction
We are building an application to process a monthly file, and there are many aws components involved in this project:
Lambda reads the file from S3, parse it and push it to dynamoDB with flag (PENDING) for each record.
Another Lambda will processing these records after the first Lambda is done, and to flag a record as (PROCESSED) after it's done with it.
Problem:
We want to send a result to SQS after all records are processed.
Our approach
Is to use DynamoDB streaming to trigger a lambda each time a record gets updated, and Lambda to query dynamoDB to check f all records are processed, and to send the notification when that's true.
Questions
Are there any other approach that can achieve this goal without triggering Lambda each time a record gets updated?
Are there a better approach that doesn't include DynamoDB streaming?
I would recommend Dynamodb Stream as they are reliable enough, triggering lambda for and update is pretty cheap, execution will be 1-100 ms usually. Even if you have millions of executions it is a robust solution. There is a way to a have shared counter of processed messages using elastic cache, once you receive update and counter is 0 you are complete.
Are there any other approach that can achieve this goal without
triggering Lambda each time a record gets updated?
Other option is having a scheduled lambda execution to check status of all processed from the db (query for PROCESSED) and move it to SQS. Depending on the load you could define how often to be run. (Trigger using cloudwatch scheduled event)
What about having a table monthly_file_process with row for every month having extra counter.
Once the the s3 files is read count the records and persist the total as counter. With every PROCESSED one decrease the counter , if the counter is 0 after the update send the SQS notification. This entire thing with sending to SQS could be done from 2 lambda which processes the record, just extra step checking the counter.

Best strategy to archive specific records from RDS to a cheaper storage in AWS

I have the following requirements:
For every deleted record in RDS we need to archive it into somewhere cheaper on AWS.
Reduce storage cost
Not using Glacier
Context oriented (e.g. a file per table)
re-import is not a requirement
I'm not an experienced user with AWS, so I'm still a bit lost among the amount of options it has to offer and I'd like to know if you have more ideas to help me clear it out.
Initial thoughts:
The microservice that deletes the record, might send it to a broker (RabbitMQ for e.g.) and another microservice (let's call it archiver) will listen to it, write into a file, zip and send to S3. This approach has some technical challenges though: in order to make sense create big files, I need to wait the queue to growth a bit, wrap it into a stream and zip inside S3. The transaction control is very weak as well, since file writing and ack on messages are signal based i.e. I'll remove the messages from the broker just after the file is created.
Add a new column to the "archiveble" tables as "deleted (bool)" and run a separate job fetching only those records and saving them into S3. Discarded they don't want the new microservice with access to other's databases.
Following the same approach as in the first item, but instead of save into S3, save into a cheaper database. SimpleDB?
option 1, but instead of rabbitmq, write it to a kinesis firehose and direct that to an s3 location - it doesn't get much cheaper or easier than that.

How to read the oldest unprocessed record in Kinesis Data Stream

I'm new to AWS and would like some guidance.
I want to process the oldest unprocessed record but I cannot seem to get the params right.
Current Architecture
For the shard iterator:
I've tried TRIM_HORIZON which gave me all the records since the
beginning.
I've also tried LATEST which only gave me the one latest record.
Not sure if these additional details will help but...
I'm putting my own records in through Lambda on the AWS console
I'm debugging this by looking at the log files in CloudWatch
I'm getting records through the shard iterator (TRIM_HORIZON and LATEST)
My getRecords limit is set at 100
Thanks in advance!
There is no "oldest unprocessed record", as Kinesis doesn't know what you've processed (for example, you may have fetched the records but not done anything with them).
If you're using Kinesis, I strongly recommend using Kinesis Client Library, which has the concept of checkpoints - these are essentially a nice wrapper on top of ShardIterator AFTER_SEQUENCE_NUMBER, which translates to "oldest uncheckpointed record" - or as close as you'll get to "oldest unprocessed record".
(You could always implement this logic yourself, but why not reuse work that Amazon has already done for you?)

Resume reading from kinesis after a KCL consumer outage [duplicate]

I can't find in the formal documentation of AWS Kinesis any explicit reference between TRIM_HORIZON and the checkpoint, and also any reference between LATEST and the checkpoint.
Can you confirm my theory:
TRIM_HORIZON - In case the application-name is new, then I will read all the records available in the stream. Else, application-name was already used, then I will read from my last checkpoint.
LATEST - In case the application-name is new, then I will read all the records in the stream which added after I subscribed to the stream. Else, application-name was already used, I will read messages from my last checkpoint.
The difference between TRIM_HORIZON and LATEST is only in case the application-name is new.
AT_TIMESTAMP
-- from specific time stamp
TRIM_HORIZON
-- all the available messages in Kinesis stream from the beginning (same as earliest in Kafka)
LATEST
-- from the latest messages , i.e current message that just came into Kinesis/Kafka and all the incoming messages from that time onwords
From GetShardIterator documentation (which lines up with my experience using Kinesis):
In the request, you can specify the shard iterator type AT_TIMESTAMP to read records from an arbitrary point in time, TRIM_HORIZON to cause ShardIterator to point to the last untrimmed record in the shard in the system (the oldest data record in the shard), or LATEST so that you always read the most recent data in the shard.
Basically, the difference is whether you want to start from the oldest record (TRIM_HORIZON), or from "right now" (LATEST - skipping data between latest checkpoint and now).
The question clearly asks how these options relate to the checkpoint. However, none of the existing answers addresses the checkpoint at all.
An authoritative answer to this question by Justin Pfifer appears in a GitHub issue here.
The most relevant portion is
The KCL will always use the value in the lease table if it's present. It's important to remember that Kinesis itself doesn't track the position of consumers. Tracking is provided by the lease table. Leases in the KCL server double duty. They provide both mutual exclusion, and position tracking. So for mutual exclusion a lease needs to be created, and to satisfy the position tracking an initial value must be selected.
(Emphasis added by me.)
I think choosing between either is a trade off between do you want to start from the most recent data or do you want to start from the oldest data that hasnt been processed from kinesis.
Imagine a scenario when there is a bug in your lambda function and it is throwing an exception on the first record it gets and returns an error back to kinesis because of which now none of the records in your kinesis are going to be processed and going to remain there for 1 day period(retention period). Now after you have fixed the bug and deploy your lambda now your lambda will start getting all those messages from the buffer that kinesis has been holding up. Now your downstream service will have to process old data instead of the most recent data. This could add unwanted latency in your application if you choose TRIM_HIROZON.
But if you used LATEST, you can ignore all those previous stuck messages and have your lambda actually start processing from new events/messages and thus improving the latency your system provides.
So you will have to decide which is more important for your customers. Is losing a few data points fine and what is your tolerance limit or you always want accurate results like calculating sum/counter.