Archiving data of specific table in AWS RDS periodically - amazon-web-services

I use AWS RDS as a database for my Spring boot application. I would like to archive earlier than 6 months of data from one specific table. In this context, I have gone through a few articles here but did not get any concrete idea of how to do this. Could anyone please help here?

If you are looking to backup with RDS itself, your options are limited. You can, of course, use automated RDS snapshots, but that won't let you pick a specific table (it will backup the entire database) and can't be set for retention longer than 35 days. Alternatively, you could manually initiate a snapshot, but you can't indicate a retention period. In this case though, you could instead use the AWS published rds-snapshot-tool which will help you automate the snapshot process and let you specify a maximum age of the snapshot. This is likely the easiest way to use RDS for your question. If you only wanted to restore one specific table (and didn't care about have the other tables in the backup), you could restore the snapshot and just immediately DROP the tables you don't care about before you start using the snapshot.
However, if you really care about only backing up one specific table, then RDS itself is out as a possible means for taking the backups on your behalf. I am assuming a mysql database for your spring application, in which case you will need to use the mysqldump tool to grab the database table you are interested in. You will need to manually call that tool from an application and then store the data persistently somewhere (perhaps S3). You will also need to manage the lifecycle on that backup, but if you do use S3, you can set a lifecycle policy to automatically age out and drop old files (backups, in this case).

Related

Back up RDS snapshots to S3 automatically

I have RDS which autobackup period is 7 days.
And I have found I can backup the RDS's snapshot to S3 manually.
However I want to make back up RDS snapshot sautomatically to S3.
How can I do this , I should make event bridge?
The first stop for an answer about an AWS service is normally the AWS documentation.
Since sometimes finding the right section in the sea of info could be a bit overwhelming, please find below references that should answer your question.
There are 3 ways you could export an RDS snapshot to S3:
Using the management console
the AWS CLI
RDS APIs
The Exporting DB snapshot data to Amazon S3 AWS document explains each process in detail.
As described in previous comments, you could for instance using a lambda to trigger the RDS APIs.
Even more interesting, AWS provide a GitHub repository with the code to automate the export. Please find the code here.
As mentioned in the document, please note that:
Exporting RDS snapshots can take a while depending on your database type and size. The export task first restores and scales the entire database before extracting the data to Amazon S3. The task's progress during this phase displays as Starting. When the task switches to exporting data to S3, progress displays as In progress. The time it takes for the export to complete depends on the data stored in the database. For example, tables with well-distributed numeric primary key or index columns export the fastest. Tables that don't contain a column suitable for partitioning and tables with only one index on a string-based column take longer. This longer export time occurs because the export uses a slower single-threaded process.

dynamodb PITR replacement for scripted Snapshots?

So this is essentially the same as this RDS question: Should I stick only to AWS RDS Automated Backup or DB Snapshots?
Dynamo now has Snapshots and PITR continuous backups.
In RDS it seems PITR backups will fail you if you delete the actual DB instance. What happens if I delete my dynamo table accidentally? Will I similarly loose all the PITR backups?
I'm thinking scheduling my own snapshots is only necessary to guard against accidental table deletion or if I want backups old than 35 days. Is this reasoning correct?
Also, how does dynamo achieve PITR without traditional relational transaction logs?
Is this reasoning correct?
Yes, for the most part.
There appears to be a safety net for dropped tables...
If you need to recover a deleted table that had point in time recovery enabled, you need to contact AWS Support to restore that table within the 35-day recovery window
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/pointintimerecovery_beforeyoubegin.html
...it seems foolhardy to assume that nothing can go wrong, here.
For example:
Important
If you disable point-in-time recovery and later re-enable it on a table, you reset the start time for which you can recover that table
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery_Howitworks.html
...and, of course, as a matter of practice, the privileges required for deleting a table should be granted sparingly.
As a long-time DBA, I would also assert that if you like your data, you never entrust any single entity with the data. For data that isn't ephemeral and can't be reproduced from another source, the data needs to be somewhere else, as well.
How does dynamo achieve PITR without traditional relational transaction logs?
There must internally be some type of "transaction" logs -- and we already know that the necessary underpinnings are present, otherwise how would DynamoDB Streams and Global Tables be possible? Theoretically, you could roll your own PITR by capturing everything using Streams (although that seems unlikely to be worth the effort)... but it would be a viable mechanism for off-site/off-platform backup.

AWS S3 w/ tags, DynamoDB, Redshift?

I'm comparing cloud storage for a large set of files with certain 'attributes' to query. Right now it's about 2.5TB of files and growing quickly. I need high throughput writes and queries. I'll first write the file and attributes to store, then will query to summarize attributes (counts, etc), additionally querying attributes to pull small set of files (by date, name, etc).
I've explored Google Cloud Datastore as a noSQL option, but trying to compare it to AWS services.
One option would be to store files in S3 with 'tags'. I believe you can query these with the REST API, but concerned about performance. I also have seen suggestions to connect Athena, but not sure if that will pull in the tags and the correct use-case.
The other option would be using something like Dynamo or possibly a large RDS? Redshift says it's for Petabyte scale, which we're not quite there...
Thoughts on best AWS storage solution? Pricing is a consideration, but more concerned with best solution moving forward.
You don't want to store the files themselves in a database like RDS or Redshift. You should definitely store the files in S3, but you should probably store or copy the metadata somewhere that is more indexable and searchable.
I would suggest setting up a new object trigger in S3 that triggers a Lambda function whenever a new file is uploaded to S3. The Lambda function could take the file location, size, any tags, etc. and insert that metadata into Redshift, DynamoDB, Elastic Search, or an RDS database like Aurora, where you could then perform queries against that metadata. Unless you are talking many millions of files, then the metadata will be fairly small and you probably won't need the scale of Redshift. The exact database you pick to store the metadata will depend on your use case such as the specific queries you want to perform.

Amazon Redshift Automated Snapshot Trigger

What actually triggers an automatic incremental backup/snapshot for Amazon Redshift? Is it time-based? The site says it "periodically takes snapshots and tracks incremental changes to the cluster since the last snapshot" and I know whenever I modify the cluster(either delete, modify size, or change node type) itself, a snapshot is taken. But what about when a database on the cluster is altered? I have inserted, loaded, deleted many rows but no automatic snapshot is taken. Would I just have to do manual backups then?
I have asked around and looked up online and no one has been able to give me an answer. I am trying to figure out an optimal backing strategy for my workload.
Automated backups are taken every 8 hours or every 5 GB of inserted data, whichever happens first.
Source: I work for AWS Redshift.

Do I need to set up backup data pipeline for AWS Dynamo DB on a daily basis?

I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?
There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.
This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.
data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region
I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.