DynamoDB Global Table Backup & Restore

DynamoDB Global Table Backup & Restore - amazon-web-services

I have a global DynamoDB table that is currently replicated across 3 regions (eu-west-1, eu-west-2, eu-central-1).
As part of a PoC piece of work I am looking to use AWS Backup to schedule automated backups, I was wondering what the best practice for this was?
Is it acceptable to take backups of a single region, i.e only schedule the backups for the table in eu-west-1? Then when it comes to recovering the table, I can go through the process of first restoring to a non-global table, then adding replica's.
Or is it better practice to ensure all region's tables are backed up at the same time?

I would suggest that you backup from a single region (perhaps if you have a primary region for writes use this.
If you restore the DynamoDB table, it needs to create a new DynamoDB table resource. Once this is restored you would then add your global tables which would replicate the data currently stored in the restored DynamoDB table.
By having multiple backups, you would need to have a strategy to preprocess for any differences between all regions your table exists.

Related

Can I stop the replicas of a DynamoDB global table from synchronising?

I am having DynamoDB table in a specific region but the data it contains support application instances in multiple regions. I want to create a DDB per region setup without downtime.
In the end I want to have multiple instances running, each one in it's own region with it's own regional database table, but I also want the two tables to be in sync while the migration is rolling out.
I know that I can use DynamoDB streams with lambda to keep the two tables in sync for as long as I need, but I wonder if there's an easier way.
The idea is to add the extra region to the existing table, making it a global table. This will allow each local instance to use it's local database while also keeping the data in sync among regions.
But I don't want to maintain a global table for ever since after the migration is completed there's no reason to keep the replicas in sync.
So, is it possible to stop the replicas of a global table from syncing?
Is is possible to split a global table to it's local parts?
I couldn't find anything in the docs, but maybe I missed something.

Migrating Dynamodb Data between regions

I've a requirement to copy and move (depend on the case) items between AWS Dyanmodb regions.
For example, let's say, there is a USER table in ap-south-1 region and I want to move some items of it to the us-east-1 region.
So is there any way to migrate the data from one region to another considering large data set in mind?
I've read this solution [https://tvernon.tech/blog/move-dynamodb-data-between-regions]. But I am not sure how feasible it would be with large dataset.
See, I am not talking about Global tables here, which gives multi-region replication as a feature.

How to perform AWS DynamoDB backup and restore operations by utilizing minimal read/write units?

We are looking for a solution which uses minimum read/write units of DynamoDB table for performing full backup, incremental backup and restore operations. Backup should store in AWS S3 (open to other alternatives). We have thought of few options such as:
1) Using python multiprocessing and boto modules we were able to perform Full backup and Restore operations, it is performing well, but is taking more DynamoDB read/write Units.
2) Using AWS Data Pipeline service, we were able to perform Full backup and Restore operations.
3) Using Dynamo Streams and kinesis Adapter/ Dynamo Streams and Lambda function, we were able to perform Incremental backup.
Are there other alternatives for Full backup, Incremental backup and Restore operations. The main limitation/need is to have a scalable solution by utilizing minimal read/write units of DynamoDb table.

Option #1 and #2 are almost the same- both do a Scan operation on the DynamoDB table, thereby consuming maximum no. of RCUs.
Option #3 will save RCUs, but restoring becomes a challenge. If a record is updated more than once, you'll have multiple copies of it in the S3 backup because the record update will appear twice in the DynamoDB stream. So, while restoring you need to pick the latest record. You also need to handle deleted records correctly.
You should choose option #3 if the frequency of restoring is less, in which case you can run an EMR job over the incremental backups when needed. Otherwise, you should choose #1 or #2.

On-Demand backups are a feature built into the DynamoDB service (Accessible via the API, AWS Management Console and CLI as usual), which allows you to take a full backup of a table at a point in time.
This task has no impact on performance or availability to your tables. All backups are automatically encrypted, cataloged, easily discoverable, and retained until you explicitly delete them.
Additionally, you can restore these backups to a new table at any point.
Along with data, the following is included in the backups:
Global secondary indexes (GSIs)
Local secondary indexes (LSIs)
Streams
Provisioned read and write capacity
The following is NOT included in the backups:
Auto scaling policies
AWS Identity and Access Management (IAM) policies
Amazon CloudWatch metrics and alarms
Tags
Stream settings
Time To Live (TTL) settings
I've blogged more information and a walkthrough here: https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups/

Do I need to set up backup data pipeline for AWS Dynamo DB on a daily basis?

I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?

There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.

This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.

data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region

I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?

Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.

The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift

This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).

If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js