In my DynamoDB table, Kinesis Firehose is triggered which dumps my data to S3 whenever some records is added/updated. My DynamoDB table also has TTL enabled.
Will it also be triggered when some record is deleted?
When the item expires, will Kinesis Firehose be triggered at that time and what happen on the S3 side?
My understanding is, that the data format DynamoDB sends to Kinesis Data Streams is basically identical as the data it sends to the regular DynamoDB streams and as a result of that, I expect the behavior to be identical.
According to the Kinesis Data Streams integration docs (emphasis mine):
Amazon Kinesis Data Streams for Amazon DynamoDB operates asynchronously, so there is no performance impact on a table if a stream is enabled. Whenever items are created, updated, or deleted in the table, DynamoDB sends a data record to Kinesis. The record contains information about a data modification to a single item in a DynamoDB table. Specifically, a data record contains the primary key attribute of the item that was modified, together with the "before" and "after" images of the modified item.
That's essentially what a regular DynamoDB stream does as well and concerning TTL-deletes the docs for that say:
You can back up, or otherwise process, items that are deleted by Time to Live (TTL) by enabling Amazon DynamoDB Streams on the table and processing the streams records of the expired items.
The streams record contains a user identity field Records[].userIdentity.
Items that are deleted by the Time to Live process after expiration have the following fields:
Records[<index>].userIdentity.type
"Service"
Records[<index>].userIdentity.principalId
"dynamodb.amazonaws.com"
tl;dr: Yes, the TTL-deletes should show up in the stream as well and will be handled by Firehose like any regular delete.
Related
I'm looking at the diagram here, and from my understanding, a DynamoDB stream into a redshift table via kinesis firehose will send the updates as redshift commands to the table (i.e. update, insert etc). So this will keep a redshift version of a dynamodb table in sync.
But how do you deal with the historical data? Is there a good process for filling the redshift table with data to date, that can then be kept in sync via a dynamodb stream? It's not trivial, because some updates may be lost if I manually copy the data into a redshift table and then switch on a dynamodb stream depending on the timing.
So regarding the diagram, it shows kinesis firehose delivering data to s3, queryable by athena. I feel like I'm missing sometthing because if the data going to s3 are only updates and new records, it doesn't seem like something that works well for athea (a partitioned snapshot of the entire table makes more sense).
So if I have a dynamodb table that is currently receiving data, and I want to create a new redshift table that contains all the same data up to a given time, and then gets all the updates via a dynamodb stream after, how do I go about doing that?
I am using dynamodb and I'd like to enable dynamodb stream to process any data change in the dynamodb table. By looking at the stream options, there are two streams Amazon Kinesis data stream and DynamoDB stream. From the doc of these two streams, both are handling the data change from dynamodb table but I am not sure what the main different between using these two.
There are quite a few of the differences, which are listed in:
Streaming Options for Change Data Capture
Few notable ones are that DynamoDB Streams, unlike Kinesis Data Streams for DynamoDB, guarantees no duplicates, the record retention time is only 24 hours, and the are throughout capacity limits.
Another important difference is that DynamoDB Streams guarantees order while Kinesis (associated with a DynamoDB table) does not.
I was assuming I
create a table and enable stream and I now have an ARN
create a kinesis stream
configure somewhere to tell the dynamoDb stream to write to kinesis stream
I was looking at working with https://github.com/harlow/kinesis-consumer but this reads from kinesis or can I use the ARN and use it to read right from the dynamoDB stream?
The more I look, the more I seem to think, I have to write a lambda to read dynamoDB and write to kinesis. Is that correct?
thanks
Hey can you provide a bit more of information about your target setup? do you plan to have some sort of ETL process for your dynamoDB table? AFAIK when you bound a kinesis stream to a dynamodb table, everytime you add, remove or update rows on the dynamodb a new event will be publish in the associated kinesis stream which you can consume from and use the event in whatever way you want.
maybe worth checking this one:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html
DynamoDB now support Kinesis Data Streams natively:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/kds.html
You can choose either DynamoDB Streams or Kinesis Data Streams for your Change Data Capture (CDC).
Properties
Kinesis Data Streams for DynamoDB
DynamoDB Streams
Data retention
Up to 1 year.
24 hours.
Kinesis Client Library (KCL) support
Supports KCL versions 1.X and 2.X.
Supports KCL version 1.X.
Number of consumers
Up to 5 simultaneous consumers per shard, or up to 20 simultaneous consumers per shard with enhanced fan-out.
Up to 2 simultaneous consumers per shard.
Throughput quotas
Unlimited.
Subject to throughput quotas by DynamoDB table and AWS Region.
Record delivery model
Pull model over HTTP using GetRecords and with enhanced fan-out, Kinesis Data Streams pushes the records over HTTP/2 by using SubscribeToShard.
Pull model over HTTP using GetRecords.
Ordering of records
The timestamp attribute on each stream record can be used to identify the actual order in which changes occurred in the DynamoDB table.
For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item.
Duplicate records
Duplicate records might occasionally appear in the stream.
No duplicate records appear in the stream.
Stream processing options
Process stream records using AWS Lambda, Kinesis Data Analytics, Kinesis data firehose , or AWS Glue streaming ETL.
Process stream records using AWS Lambda or DynamoDB Streams Kinesis adapter.
Durability level
Availability zones to provide automatic failover without interruption.
Availability zones to provide automatic failover without interruption.
You can use Amazon Kinesis Data Streams to capture changes to Amazon DynamoDB. According to the AWS documentation:
Kinesis Data Streams captures item-level modifications in any DynamoDB table and replicates them to a Kinesis data stream. Your applications can access this stream and view item-level changes in near-real time. You can continuously capture and store terabytes of data per hour. You can take advantage of longer data retention timeāand with enhanced fan-out capability, you can simultaneously reach two or more downstream applications. Other benefits include additional audit and security transparency.
Also You can enable streaming to Kinesis from your DynamoDB table.
I want to move(export) data from DynamoDB to S3
I have seen this tutorial but i'm not sure if the extracted data of dynamoDB will be deleted or coexits in DynamoDB and S3 at the same time.
What I expect is the data from dynamoDB will be deleted and stored in s3 (after X time stored in DynamoDB)
The main purpose of the project could be similar to this
There are any way to do this without have to develop a lambda function?
In resume, I have found this 2 different ways:
DynamoDB -> Pipeline -> S3 (Are the dynamoDB data deleted?)
DynamoDB -> TTL DynamoDB + DynamoDB stream -> Lambda -> firehose -> s3 (this appears to be more difficult)
Is this post currently valid for this purpouse?
What would be the simpliest and fasted way?
In your first option, as per default, data is not removed from dynamoDB. You can design a pipeline to make this work, but I think that is not the best solution.
In your second option, you must evaluate the solution based on your expected data volume:
If the data volume that will expire in TTL definition is not very
large, you can use lambda to persist removed data into S3 without
Firehose. You can design a simple lambda function to be triggered by
DynamoDB Stream and persist each stream event as a S3 object. You
can even trigger another lambda function to consolidate the objects
in a single file in the end of the day, week or month. But again,
based on your expected volume.
If you have a lot of data being expired at the same time and you
must perform transformations on this data, the best solution is to
use Firehose. Firehose can proceed with the transformation,
encryption and compact your data before sending it to S3. If the
volume of data is to big, using functions in the end of the day,
week or month may not be feasible. So it's better to perform all
this procedures before persisting it.
You can use AWS Pipeline to dump DynamoDB table to S3 and it will not be deleted.
Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.