Sync AWS DynamoDB data with local DynamoDB instance - amazon-web-services

If I am not wrong for local DynamoDB, the data is present in shared-local-instance.db file. Is there a way to sync data from DynamoDB on AWS with my local DynamoDB (shared-local-instance.db)?
Also if a new table is created on AWS DynamoDB, can I pull that also along with its records. I don't want to manually create a table or enter the records to get sync with my local DynamoDB table. Hoping to get easy way out of this. Thanks in advance.

Related

Build s3 Datalake Using Dynamo DB data source

i'am a data engineer using AWS, we want to build a data pipeline in order to visualise our Dynmaodb data on QuickSigth, as u know, it's not possible de connect directly dynamo to Quick...u have to pass by S3.
S3 Will be our datalake, the issue is that the date updates frequently (for exemple column named can change / costumer status can evolve..)
So i'am looking for a batch solution in order to always get the lastest data from dynamo on my s3 datalake and visualise it in quicksigth.
Thank u
You can access your tables at DynamoDB, in the console, and export data to S3 under the Streams and Exports tab. This blog post from AWS explains just what you need.
You could also try this approach with Athena instead of S3.

Batch file processing in AWS using Data Pipeline

I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.
Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?
We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).
We also use AWS DataPipeline for this process.
We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.
We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.
This can be having additional costs too.

How to copy only new records from DynamoDB to S3 using Pipeline

I have a DynamoDB database with a lot of records, which increase every day. Recently I've exported all my records from DynamoDB to S3 bucket using Pipeline and it was OK. But now I want to create another Pipeline and export only new records from DynamoDB to this bucket. How can I make it?
Just switched to DynamoDB - Redshift option.

Backup only new records from DynamoDB to S3 and load them into RedShift

I saw that similar questions already exist:
Backup AWS Dynamodb to S3
Copying only new records from AWS DynamoDB to AWS Redshift
Loading data from Amazon dynamoDB to redshift
Unfortunately most of them are outdated (since amazon introduced new services) and/or have different answers.
In my case I have two databases (RedShift and DynamoDB) and I have to:
Keep RedShift database up-to-date
Store database backup on S3
To do that I want to use that approach:
Backup only new/modified records
from DynamoDB to S3 at the end of the day. (1 file per day)
Update RedShift database using file from S3
So my question is what is the most efficient way to do that?
I read this tutorial but I am not sure that AWS Data Pipeline could be configured to "catch" only new records from DynamoDB. If that is not possible, scanning entire database every time is not an option.
Thank you in advance!
you can use Amazon Lambda with dynamodb stream (documentation)
you can configure your lambda function to get updated records (from dynamodb stream) and then update redshift db

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?
Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.
The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift
This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).
If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.