So this is essentially the same as this RDS question: Should I stick only to AWS RDS Automated Backup or DB Snapshots?
Dynamo now has Snapshots and PITR continuous backups.
In RDS it seems PITR backups will fail you if you delete the actual DB instance. What happens if I delete my dynamo table accidentally? Will I similarly loose all the PITR backups?
I'm thinking scheduling my own snapshots is only necessary to guard against accidental table deletion or if I want backups old than 35 days. Is this reasoning correct?
Also, how does dynamo achieve PITR without traditional relational transaction logs?
Is this reasoning correct?
Yes, for the most part.
There appears to be a safety net for dropped tables...
If you need to recover a deleted table that had point in time recovery enabled, you need to contact AWS Support to restore that table within the 35-day recovery window
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/pointintimerecovery_beforeyoubegin.html
...it seems foolhardy to assume that nothing can go wrong, here.
For example:
Important
If you disable point-in-time recovery and later re-enable it on a table, you reset the start time for which you can recover that table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery_Howitworks.html
...and, of course, as a matter of practice, the privileges required for deleting a table should be granted sparingly.
As a long-time DBA, I would also assert that if you like your data, you never entrust any single entity with the data. For data that isn't ephemeral and can't be reproduced from another source, the data needs to be somewhere else, as well.
How does dynamo achieve PITR without traditional relational transaction logs?
There must internally be some type of "transaction" logs -- and we already know that the necessary underpinnings are present, otherwise how would DynamoDB Streams and Global Tables be possible? Theoretically, you could roll your own PITR by capturing everything using Streams (although that seems unlikely to be worth the effort)... but it would be a viable mechanism for off-site/off-platform backup.
Related
Recently, S3 announces strong read-after-write consistency. I'm curious as to how one can program that. Doesn't it violate the CAP theorem?
In my mind, the simplest way is to wait for the replication to happen and then return, but that would result in performance degradation.
AWS says that there is no performance difference. How is this achieved?
Another thought is that amazon has a giant index table that keeps track of all S3 objects and where it is stored (triple replication I believe). And it will need to update this index at every PUT/DELTE. Is that technically feasible?
As indicated by Martin above, there is a link to Reddit which discusses this. The top response from u/ryeguy gave this answer:
If I had to guess, s3 synchronously writes to a cluster of storage nodes before returning success, and then asynchronously replicates it to other nodes for stronger durability and availability. There used to be a risk of reading from a node that didn't receive a file's change yet, which could give you an outdated file. Now they added logic so the lookup router is aware of how far an update is propagated and can avoid routing reads to stale replicas.
I just pulled all this out of my ass and have no idea how s3 is actually architected behind the scenes, but given the durability and availability guarantees and the fact that this change doesn't lower them, it must be something along these lines.
Better answers are welcome.
Our assumptions will not work in the Cloud systems. There are a lot of factors involved in the risk analysis process like availability, consistency, disaster recovery, backup mechanism, maintenance burden, charges, etc. Also, we only take reference of theorems while designing. we can create our own by merging multiple of them. So I would like to share the link provided by AWS which illustrates the process in detail.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
When you create a cluster with consistent view enabled, Amazon EMR uses an Amazon DynamoDB database to store object metadata and track consistency with Amazon S3. You must grant EMRFS role with permissions to access DynamoDB. If consistent view determines that Amazon S3 is inconsistent during a file system operation, it retries that operation according to rules that you can define. By default, the DynamoDB database has 400 read capacity and 100 write capacity. You can configure read/write capacity settings depending on the number of objects that EMRFS tracks and the number of nodes concurrently using the metadata. You can also configure other database and operational parameters. Using consistent view incurs DynamoDB charges, which are typically small, in addition to the charges for Amazon EMR.
I use AWS RDS as a database for my Spring boot application. I would like to archive earlier than 6 months of data from one specific table. In this context, I have gone through a few articles here but did not get any concrete idea of how to do this. Could anyone please help here?
If you are looking to backup with RDS itself, your options are limited. You can, of course, use automated RDS snapshots, but that won't let you pick a specific table (it will backup the entire database) and can't be set for retention longer than 35 days. Alternatively, you could manually initiate a snapshot, but you can't indicate a retention period. In this case though, you could instead use the AWS published rds-snapshot-tool which will help you automate the snapshot process and let you specify a maximum age of the snapshot. This is likely the easiest way to use RDS for your question. If you only wanted to restore one specific table (and didn't care about have the other tables in the backup), you could restore the snapshot and just immediately DROP the tables you don't care about before you start using the snapshot.
However, if you really care about only backing up one specific table, then RDS itself is out as a possible means for taking the backups on your behalf. I am assuming a mysql database for your spring application, in which case you will need to use the mysqldump tool to grab the database table you are interested in. You will need to manually call that tool from an application and then store the data persistently somewhere (perhaps S3). You will also need to manage the lifecycle on that backup, but if you do use S3, you can set a lifecycle policy to automatically age out and drop old files (backups, in this case).
I have a global DynamoDB table that is currently replicated across 3 regions (eu-west-1, eu-west-2, eu-central-1).
As part of a PoC piece of work I am looking to use AWS Backup to schedule automated backups, I was wondering what the best practice for this was?
Is it acceptable to take backups of a single region, i.e only schedule the backups for the table in eu-west-1? Then when it comes to recovering the table, I can go through the process of first restoring to a non-global table, then adding replica's.
Or is it better practice to ensure all region's tables are backed up at the same time?
I would suggest that you backup from a single region (perhaps if you have a primary region for writes use this.
If you restore the DynamoDB table, it needs to create a new DynamoDB table resource. Once this is restored you would then add your global tables which would replicate the data currently stored in the restored DynamoDB table.
By having multiple backups, you would need to have a strategy to preprocess for any differences between all regions your table exists.
We have a completely serverless application, with only lambdas and DynamoDB.
The lambdas are running in two regions, and the originals are stored in Cloud9.
DynamoDB is configured with all tables global (bidirectional multi-master replication across the two regions), and the schema definitions are stored in Cloud9.
The only data loss we need to worry about is DynamoDB, which even if it crashed in both regions is presumably diligently backed up by AWS.
Given all of that, what is the point of classic backups? If both regions were completely obliterated, we'd likely be out of business anyway, and anything short of that would be recoverable from AWS.
Not all AWS regions support backup and restore functionality. You'll need to roll your own solution for backups in unsupported regions.
If all the regions your application runs in supports the backup functionality, you probably don't need to do it yourself. That is the point of going serverless. You let the platform handle simple DevOps tasks.
Having redundancy with regional or optionally cross-regional replication for DynamoDB provides mainly the durability, availability and fault tolerance for your data storage. However along with these inbuilt capabilities, still there can be the need for having backups.
For instance, if there is a data corruption due to an external threat (Like an attack) or based on an application malfunction, still you might want to restore the data back. This is one place where having backups is useful to restore the data back to a recent point of time.
There can also be compliance related requirement, which will require taking backups of your database system.
Another use case is when there is a need to create new DynamoDB tables for your build pipeline and quality assurance, it is more practical to re-use an already made snapshot of data from a backup rather taking a copy from the live database (Since it can consume the IOPS provisioned, affecting the application behaviors).
I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.