I want to send data from DynamoDB to AWS Lambda, and to configure fault tolerance I am looking into the data retention.
As per AWS docs DDB Stream keeps data for 24 hours. However when we set the Trigger to AWS Lambda, we can set the data retention to 7 days. How is this possible?
When enabled, DynamoDB Streams captures a time-ordered sequence of
item-level modifications in a DynamoDB table and durably stores the
information for up to 24 hours.
DynamoDB Stream Docs
Enabling Trigger and Data Retention during Error
How can trigger have maximum age to 7 days when the source DDB Stream can only keep data for 24 hours?
Related
I'm new to AWS, and I'm working on archiving data from DynamoDB to S3. This is my solution and I have done the pipeline.
DynamoDB -> DynamoDB TTL + DynamoDB Stream -> Lambda -> Kinesis Firehose -> S3
But I found that the files in S3 has different number of JSON objects. Some files has 7 JSON objects, some has 6 or 4 objects. I have done ETL in lambda, the S3 only saves REMOVE item, and the JSON has been unmarshall.
I thought it would be a JSON object in a file, since the TTL value is different for each item, and the lambda would deliver the item immediately when the item is deleted by TTL.
Does it because the Kinesis Firehose batches the items? (It would wait for sometime after collecting more items then saving them to a file) Or there's other reason? Could I estimate how many files it will save if DynamoDB has a new item is deleted by TTL every 5 minutes?
Thank you in advance.
Kinesis Firehose splits your data based on buffer size or interval.
Let's say you have a buffer size of 1MB and an interval of 1 minute.
If you receive less than 1MB within the 1 minute interval, Kinesis Firehose will anyway create a batch file out of the received data, even if it is less than 1MB of data.
This is likely happening in scenarios with few data arriving. You can adjust your buffer size and interval to your needs. E.g. You could increase the interval to collect more items within a single batch.
You can choose a buffer size of 1–128 MiBs and a buffer interval of 60–900 seconds. The condition that is satisfied first triggers data delivery to Amazon S3.
From the AWS Kinesis Firehose Docs: https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html
In my DynamoDB table, Kinesis Firehose is triggered which dumps my data to S3 whenever some records is added/updated. My DynamoDB table also has TTL enabled.
Will it also be triggered when some record is deleted?
When the item expires, will Kinesis Firehose be triggered at that time and what happen on the S3 side?
My understanding is, that the data format DynamoDB sends to Kinesis Data Streams is basically identical as the data it sends to the regular DynamoDB streams and as a result of that, I expect the behavior to be identical.
According to the Kinesis Data Streams integration docs (emphasis mine):
Amazon Kinesis Data Streams for Amazon DynamoDB operates asynchronously, so there is no performance impact on a table if a stream is enabled. Whenever items are created, updated, or deleted in the table, DynamoDB sends a data record to Kinesis. The record contains information about a data modification to a single item in a DynamoDB table. Specifically, a data record contains the primary key attribute of the item that was modified, together with the "before" and "after" images of the modified item.
That's essentially what a regular DynamoDB stream does as well and concerning TTL-deletes the docs for that say:
You can back up, or otherwise process, items that are deleted by Time to Live (TTL) by enabling Amazon DynamoDB Streams on the table and processing the streams records of the expired items.
The streams record contains a user identity field Records[].userIdentity.
Items that are deleted by the Time to Live process after expiration have the following fields:
Records[<index>].userIdentity.type
"Service"
Records[<index>].userIdentity.principalId
"dynamodb.amazonaws.com"
tl;dr: Yes, the TTL-deletes should show up in the stream as well and will be handled by Firehose like any regular delete.
I have a dynamoDB stream and lambda trigger for a table. The lambda trigger basically sync dynamoDB table to DocumentDB.
What if, DocumentDB is down for more than 24 hours. How can I put back the all the activity(put, delete, update) happened in dynamoDB back to the stream so that lambda trigger can access the records and sync the data to DocumentDB.
I see that dynamoDB stream keeps the record for maximum of 24 hours.
By default its not possible unlike Regular Kinesis, which has max retention of 7 days, Kinesis behind DynamoDB has max retention of 24 hours and messages will be discarded after it exceeds max retry attempts and deleted after 24 hours.
So, we need to build an exception handling process, one such methods
Create an SQS Queue with higher MessageRetentionPeriod (max 14 days) and set a RedrivePolicy maxReceiveCount on no of times to retry.
Setup Destination on Failure on Lambda to SQS.
Same Lambda can be slightly modified to read either from Kinesis or from SQS or a different Lambda can be used to read from SQS.
Throw error back from Lambda when it fails to write to DocumentDb. This will send record back to Kinesis/SQS. This way we can get away up to 14 days. We can add a DLQ on SQS to another SQS too, which can send left over messages after 14 days to DLQ, with destination to a persistent storage.
I was assuming I
create a table and enable stream and I now have an ARN
create a kinesis stream
configure somewhere to tell the dynamoDb stream to write to kinesis stream
I was looking at working with https://github.com/harlow/kinesis-consumer but this reads from kinesis or can I use the ARN and use it to read right from the dynamoDB stream?
The more I look, the more I seem to think, I have to write a lambda to read dynamoDB and write to kinesis. Is that correct?
thanks
Hey can you provide a bit more of information about your target setup? do you plan to have some sort of ETL process for your dynamoDB table? AFAIK when you bound a kinesis stream to a dynamodb table, everytime you add, remove or update rows on the dynamodb a new event will be publish in the associated kinesis stream which you can consume from and use the event in whatever way you want.
maybe worth checking this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html
DynamoDB now support Kinesis Data Streams natively:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/kds.html
You can choose either DynamoDB Streams or Kinesis Data Streams for your Change Data Capture (CDC).
Properties
Kinesis Data Streams for DynamoDB
DynamoDB Streams
Data retention
Up to 1 year.
24 hours.
Kinesis Client Library (KCL) support
Supports KCL versions 1.X and 2.X.
Supports KCL version 1.X.
Number of consumers
Up to 5 simultaneous consumers per shard, or up to 20 simultaneous consumers per shard with enhanced fan-out.
Up to 2 simultaneous consumers per shard.
Throughput quotas
Unlimited.
Subject to throughput quotas by DynamoDB table and AWS Region.
Record delivery model
Pull model over HTTP using GetRecords and with enhanced fan-out, Kinesis Data Streams pushes the records over HTTP/2 by using SubscribeToShard.
Pull model over HTTP using GetRecords.
Ordering of records
The timestamp attribute on each stream record can be used to identify the actual order in which changes occurred in the DynamoDB table.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Duplicate records
Duplicate records might occasionally appear in the stream.
No duplicate records appear in the stream.
Stream processing options
Process stream records using AWS Lambda, Kinesis Data Analytics, Kinesis data firehose , or AWS Glue streaming ETL.
Process stream records using AWS Lambda or DynamoDB Streams Kinesis adapter.
Durability level
Availability zones to provide automatic failover without interruption.
Availability zones to provide automatic failover without interruption.
You can use Amazon Kinesis Data Streams to capture changes to Amazon DynamoDB. According to the AWS documentation:
Kinesis Data Streams captures item-level modifications in any DynamoDB table and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. You can take advantage of longer data retention time—and with enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Other benefits include additional audit and security transparency.
Also You can enable streaming to Kinesis from your DynamoDB table.
From AWS documentation:
Data delivery to your S3 bucket might fail for reasons such as the
bucket doesn’t exist anymore, the IAM role that Kinesis Firehose
assumes doesn’t have access to the bucket, network failure, or similar
events. Under these conditions, Kinesis Firehose keeps retrying for up
to 24 hours until the delivery succeeds. The maximum data storage time
of Kinesis Firehose is 24 hours and your data is lost if data delivery
fails for more than 24 hours.
Now what happens to data that is lost?
Are there any logs or metrics to check for such failures?
I have created alarm on DeliverytoS3.Success metrics (if metric value < 1 for 1 min, alarm triggers). So whenever there is a failure while sending to S3, it retries till 24 hrs but metrics show value < 1 for that period and alarm triggers. Also I am not seeing any CloudWatch error (S3Delivery) logs.
My aim is to trigger alarm only when we are not able to send data to S3 ultimately (even after 24 hrs).
Note: Please let me know if any explanation or correction is required.