Back up RDS snapshots to S3 automatically - amazon-web-services

I have RDS which autobackup period is 7 days.
And I have found I can backup the RDS's snapshot to S3 manually.
However I want to make back up RDS snapshot sautomatically to S3.
How can I do this , I should make event bridge?

The first stop for an answer about an AWS service is normally the AWS documentation.
Since sometimes finding the right section in the sea of info could be a bit overwhelming, please find below references that should answer your question.
There are 3 ways you could export an RDS snapshot to S3:
Using the management console
the AWS CLI
RDS APIs
The Exporting DB snapshot data to Amazon S3 AWS document explains each process in detail.
As described in previous comments, you could for instance using a lambda to trigger the RDS APIs.
Even more interesting, AWS provide a GitHub repository with the code to automate the export. Please find the code here.
As mentioned in the document, please note that:
Exporting RDS snapshots can take a while depending on your database type and size. The export task first restores and scales the entire database before extracting the data to Amazon S3. The task's progress during this phase displays as Starting. When the task switches to exporting data to S3, progress displays as In progress. The time it takes for the export to complete depends on the data stored in the database. For example, tables with well-distributed numeric primary key or index columns export the fastest. Tables that don't contain a column suitable for partitioning and tables with only one index on a string-based column take longer. This longer export time occurs because the export uses a slower single-threaded process.

Related

How to push data from Amazon Connect (e.g. Historical Metrics) to Amazon Redshift?

Is there any API i can use or write a lambda function! (GetMetricData)
what steps to follow?
What i am willing to do is push the Amazon Connect Data (like Historical reports) to redshift.
What could be the possible scenarios to accomplish this? Data can be pushed on regular intervals. in other words i want to retrieve data of amazon connect instance in redshift.
copy Agent1 from 's3://my-bucket/connect/oblab2/Reports/Historical Metrics Report (1).csv'
iam_role 'arn:aws:iam::my-rule:role/RedshiftRoleForS3'
csv
null as '\000'
IGNOREHEADER 1;
I am using this (above) to pull the data from s3 to redshift table. its working fine but there is one problem as when data is pulled/copied very first time it inserted into table but when the data get updated in s3 bucket file and we run the same query what it does is add the whole new rows of data instead of overwriting the already created rows.
Solutions Architects at AWS created an opinionated solution for this a few years ago, which is available via a Quick Start (CloudFormation template) here. It may not be fully up to date, as it was created some time ago, but it will likely give you a functional model to follow and the example lambda code you would need to build a solution that works for you.

How to configure AWS DMS to keep multiple full load files in the same s3 target destination?

I have a table running on AWS RDS. I want to use AWS DMS to export all the data on the table every week. Each week after the export I will truncate the table so every next phase the source table will have new data and I planned to perform the DMS task to safely offload the data from the RDS table.
I have configured an RDS source and S3 bucket as target to export data as CSV. The replication type is Full-Load only and it migrates existing data(No Ongoing replication).
But the problem I found is that DMS keeps dropping the old LOADXXXXXXX.csv file from the target s3 whenever I perform the reload-target operation on the DMS task next week.
How can I achieve my goal? How to configure AWS DMS to keep multiple full load files in the same s3 target destination?
I was able to keep the old load file in s3 with a bit of a trick at the S3 target end. It is true that AWS DMS doesn't provide anything to keep the old load file after restarting the DMS task. But if you turn on versioning in the target S3 bucket, then you can keep that old load file as a previous version.
This solution was able to fulfill my requirements.
This is said in another thread [here][1]:
For DMS the incremental counter is started over from 1 each time the task is run. It does not have a "Don't override existing objects" feature.
And to this day, its not possible to change the file naming. So, to do this, you are force to have your execution results in different folders.
[1]: https://stackoverflow.com/a/60385265

Archiving data of specific table in AWS RDS periodically

I use AWS RDS as a database for my Spring boot application. I would like to archive earlier than 6 months of data from one specific table. In this context, I have gone through a few articles here but did not get any concrete idea of how to do this. Could anyone please help here?
If you are looking to backup with RDS itself, your options are limited. You can, of course, use automated RDS snapshots, but that won't let you pick a specific table (it will backup the entire database) and can't be set for retention longer than 35 days. Alternatively, you could manually initiate a snapshot, but you can't indicate a retention period. In this case though, you could instead use the AWS published rds-snapshot-tool which will help you automate the snapshot process and let you specify a maximum age of the snapshot. This is likely the easiest way to use RDS for your question. If you only wanted to restore one specific table (and didn't care about have the other tables in the backup), you could restore the snapshot and just immediately DROP the tables you don't care about before you start using the snapshot.
However, if you really care about only backing up one specific table, then RDS itself is out as a possible means for taking the backups on your behalf. I am assuming a mysql database for your spring application, in which case you will need to use the mysqldump tool to grab the database table you are interested in. You will need to manually call that tool from an application and then store the data persistently somewhere (perhaps S3). You will also need to manage the lifecycle on that backup, but if you do use S3, you can set a lifecycle policy to automatically age out and drop old files (backups, in this case).

Move data from S3 to Amazon Aurora Postgres

I have multiple files present in different buckets in S3. I need to move these files to Amazon Aurora PostgreSQL every day on a schedule. Every day I will get a new file and, based on the data, insert or update will happen. I was using Glue for insert but with upsert Glue doesn't seem to be the right option. Is there a better way to handle this? I saw Load command from S3 to RDS will solve the issue but didn't get enough details on it. Any recommendations please?
You can trigger a Lambda function from S3 events, that could then process the file(s) and inject them into Aurora. Alternatively you can create a cron-type function that will run daily on whatever schedule you define.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

AWS data pipeline: dump data to 3 s3 nodes

I have a use case wherein I want to take a data from DynamoDB and do some transformation on the data. After this I want to create 3 csv files (there will be 3 transformations on the same data) and dump them to 3 different s3 locations.
My architecture would be sort of following:
Is it possible to do so? I can't seem to find any documentation regarding it. If it's not possible using pipeline, are there any other services which could help me with my use case?
These dumps will be scheduled daily. My other consideration was using aws lamda. But according to my understanding, it's event based triggered rather time based scheduling, is that correct?
Yes it is possible but not using HiveActivity instead EMRActivity. If you look into Data pipeline documentation for HiveActivity, it clearly states its purpose and not suits your use case:
Runs a Hive query on an EMR cluster. HiveActivity makes it easier to set up an Amazon EMR activity and automatically creates Hive tables based on input data coming in from either Amazon S3 or Amazon RDS. All you need to specify is the HiveQL to run on the source data. AWS Data Pipeline automatically creates Hive tables with ${input1}, ${input2}, and so on, based on the input fields in the HiveActivity object.
Below is how your data pipeline should look like. There is also a inbuilt template Export DynamoDB table to S3 in UI for AWS Data Pipeline which creates the basic structure for you, and then you can extend/customize to suit your requirements.
To your next question using Lambda, Of course lambda can be configured to have event based triggering or schedule based triggering, but I wouldn't recommend using AWS Lambda for any ETL operations as they are time bound & usual ETLs are longer than lambda time limits.
AWS has specific optimized feature offerings for ETLs, AWS Data Pipeline & AWS Glue, I would always recommend to choose between one of two. In case your ETL involves data sources not managed within AWS compute and storage services OR any speciality use case which can't be sufficed by above two options, then AWS Batch will be my next consideration.
Thanks amith for your answer. I have been busy for quite some time now. I did some digging after you posted your answer. Turns out we can dump the data to different s3 locations using Hive activity as well.
This is how the data pipeline would like in that case.
But I believe writing multiple hive activities, when your input source is DynamoDB table, is not a good idea since hive doesn't load any data in memory. It does all the computations on the actual table which could deteriorate the performance of the table. Even documentation suggests to export the data incase you need to make multiple queries to same data. Reference
Enter a Hive command that maps a table in the Hive application to the data in DynamoDB. This table acts as a reference to the data stored in Amazon DynamoDB; the data is not stored locally in Hive and any queries using this table run against the live data in DynamoDB, consuming the table’s read or write capacity every time a command is run. If you expect to run multiple Hive commands against the same dataset, consider exporting it first.
In my case I needed to perform different type of aggregations on the same data once a day. Since dynamoDB doesn't support aggregations, I turned to Data pipeline using Hive. In the end we ended up using AWS Aurora which is My-SQL based.