Streaming data from dynamodb to Redshift with kinesis - backfilling history? - amazon-web-services

I'm looking at the diagram here, and from my understanding, a DynamoDB stream into a redshift table via kinesis firehose will send the updates as redshift commands to the table (i.e. update, insert etc). So this will keep a redshift version of a dynamodb table in sync.
But how do you deal with the historical data? Is there a good process for filling the redshift table with data to date, that can then be kept in sync via a dynamodb stream? It's not trivial, because some updates may be lost if I manually copy the data into a redshift table and then switch on a dynamodb stream depending on the timing.
So regarding the diagram, it shows kinesis firehose delivering data to s3, queryable by athena. I feel like I'm missing sometthing because if the data going to s3 are only updates and new records, it doesn't seem like something that works well for athea (a partitioned snapshot of the entire table makes more sense).
So if I have a dynamodb table that is currently receiving data, and I want to create a new redshift table that contains all the same data up to a given time, and then gets all the updates via a dynamodb stream after, how do I go about doing that?

Related

Sync dynamodb into redshift table with kenisis firehose

I've found a tutorial for how to get item level changes into s3 from dynambo via kinsis firehose here
but how do I get these into a redshift table? If an item is updated, it will create a new record for it and post to s3, so is there a tutorial or guidance on how to take these item level changes and read them into a table?
Kinesis Firehose has multiple destinations that you can choose from. S3 is only one of them, and Redshift is another.
You can use the following configuration to set up Redshift as the destination.

Recommendation for near real time data sync between DynamoDb and S3/Redshift

I have a bunch of tables in DynamoDB (80 right now but can grow in future) and looking to sync data from these tables to either Redshift or S3 (using Glue on top of it to query using Athena) for running analytics query.
There are also very frequent updates (to existing entries) and deletes happen in the DynamoDB tables which I want to sync along with addition of newer entries.
Checked the Write capacity units (WCU) consumed for all tables and the rate is coming out to be around 35-40 WCU per second at peak times.
Solutions I considered:
Use Kinesis Firehose along with lambda (which reads updates from DDB streams) to push data to Redshift in small batches (Issue: it cannot support updates and deletes and is only good for adding new entries because it uses Redshift COPY command under the hood to upload data to Redshift)
Use Lambda (which reads updates from DDB streams) to copy data to S3 directly as json with each entry being a separate file. This can support updates and deletes if we have S3 filepath same as primary key of DynamoDB tables (Issue: It will result in tons of small files in S3 which might not scale for querying using AWS Glue)
Use Lambda (which reads updates from DDB streams) to update data to Redshift directly as soon as a new update happens. (Issue: Too many small writes to Redshift can cause scaling issues for Redshift as it is more suited for batch writes/updates)

What benefit does Firehose+S3 provide instead of using Athena directly on DynamoDb?

In this article - https://aws.amazon.com/blogs/database/how-to-perform-advanced-analytics-and-build-visualizations-of-your-amazon-dynamodb-data-by-using-amazon-athena/:
Similarly this article - https://aws.amazon.com/blogs/database/simplify-amazon-dynamodb-data-extraction-and-analysis-by-using-aws-glue-and-amazon-athena/:
Why not use Athena to directly query into the DynamoDb?
First of all, Athena cannot query directly to DynamoDB.
In order to do so, you need to make data available in another location that can be identified as a valid data source by AWS Glue;
The most common is actually S3 and Kinesis (due to performance and cost reasons), but there are other options as:
JDBC
Amazon RDS
MongoDB
Amazon DocumentDB
Kafka
(others options will be displayed according to the method you choose to map data)
For DynamoDb you must extract data from the desired table before it can be used. Or, as in the first example, use real-time streams.
Explaining each scenario.
First Scenario: Uses DynamoDb Streams directly connected to kinesis Firehouse which makes the data emitted by real-time DynamoDb streams available in S3. This way Athena could use S3 as a source for the data.
Second Scenario: Uses glue crawler to map data schema from DynamoDb and create a table in your Data Catalog containing the schema map of the object properties. And to extract data itself uses a glue job that points out to properties map table and extracts the data to S3, creating another table in your Data Catalog but this time pointing to S3, making it available for Athena to perform queries.
The DynamoDB data structure and storage are not optimized to perform relational queries as Athena expects, you could read more about it on DynamoDB docs.

Are there any benefits to storing data in DynamoDB vs S3 for use with Redshift?

My particular scenario: Expecting to amass TBs or even PBs of JSON data entries which track price history for many items. New data will be written to the data store hundreds or even thousands of times per a day. This data will be analyzed by Redshift and possibly AWS ML. I don't expect to query outside of Redshift or ML.
Question: How do I decide if I should store my data in S3 or DynamoDB? I am having trouble deciding because I know that both stores are supported with redshift, but I did notice Redshift Spectrum exists specifically for S3 data.
Firstly DynamoDB is far more expensive than S3. S3 is only a storage solution; while DynamoDB is a full-fledge NoSQL database.
If you want to query using Redshift; you have to load data into Redshift. Redshift is again an independent full-fledge database ( warehousing solution ).
You can use Athena to query data directly from S3.

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?
Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.
The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift
This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).
If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.