Creating Databricks Database Snapshot

Creating Databricks Database Snapshot - amazon-web-services

I have a database created in my databricks environment which is mount to AWS S3 location. I am looking a way to take the snapshot of the database so that I can store it to different place and restore it in case of any failure.

Databricks is not like a traditional database where all data is stored "inside" the database. For example, Amazon RDS provides a "snapshot" feature that can dump the entire contents of a database, and the snapshot can then be restored to a new database server if required.
The equivalent in Databricks would be Delta Lake time travel, which allows you to access the database as it was at a previous point-in-time. Data is not "restored" -- rather, it is simply presented as it previously was at a given timestamp. It is a snapshot without the need to actually create a snapshot.
From Configure data retention for time travel:
To time travel to a previous version, you must retain both the log and the data files for that version.
The data files backing a Delta table are never deleted automatically; data files are deleted only when you run VACUUM. VACUUM does not delete Delta log files; log files are automatically cleaned up after checkpoints are written.
If, instead, you do want to keep a "snapshot" of the database, a good method would be to create a deep clone of a table, which includes all data. See:
CREATE TABLE CLONE | Databricks on AWS
Using Deep Clone for Disaster Recovery with Delta Lake on Databricks
I think you would need to write your own script to loop through each table and perform this operation. It is not as simple as clicking the "Create Snapshot" button in Amazon RDS.

Related

How to handle Amazon S3 updates for transactional data

I have a transactional table that I load from SQL server to Amazon S3 S3 using AWS DMS. For handling updates I move the old files to archive and then process only the incremental records everytime.
This is fine when I have only insert operation in my database. But problem comes when we need to accomodate the updates. Right now for doing any updates we read the entire S3 file and make changes to the records which are updated as a part of incremental load. Then as the data keeps on increasing the process of reading the entire file in S3 bucket and updating it would take more time and in future the job might not be able to end in time (Considering that job needs to end in 1 hour).
This can be handled using Databricks where we can use Delta Table to update the records and finally overwrite the existing file. But Databricks is bit expensive.
How do we handle the same using AWS Glue?

Archiving data of specific table in AWS RDS periodically

I use AWS RDS as a database for my Spring boot application. I would like to archive earlier than 6 months of data from one specific table. In this context, I have gone through a few articles here but did not get any concrete idea of how to do this. Could anyone please help here?

If you are looking to backup with RDS itself, your options are limited. You can, of course, use automated RDS snapshots, but that won't let you pick a specific table (it will backup the entire database) and can't be set for retention longer than 35 days. Alternatively, you could manually initiate a snapshot, but you can't indicate a retention period. In this case though, you could instead use the AWS published rds-snapshot-tool which will help you automate the snapshot process and let you specify a maximum age of the snapshot. This is likely the easiest way to use RDS for your question. If you only wanted to restore one specific table (and didn't care about have the other tables in the backup), you could restore the snapshot and just immediately DROP the tables you don't care about before you start using the snapshot.
However, if you really care about only backing up one specific table, then RDS itself is out as a possible means for taking the backups on your behalf. I am assuming a mysql database for your spring application, in which case you will need to use the mysqldump tool to grab the database table you are interested in. You will need to manually call that tool from an application and then store the data persistently somewhere (perhaps S3). You will also need to manage the lifecycle on that backup, but if you do use S3, you can set a lifecycle policy to automatically age out and drop old files (backups, in this case).

How to run delete and insert query on S3 data on AWS

So I have some historical data on S3 in .csv/.parquet format. Everyday I have my batch job running which gives me 2 files having the list of data that needs to be deleted from the historical snapshot, and the new records that needs to be inserted to the historical snapshot. I cannot run insert/delete queries on athena. What are the options (cost effective and managed by aws) do I have to execute my problem?

Objects in Amazon S3 are immutable. This means that be replaced, but they cannot be edited.
Amazon Athena, Amazon Redshift Spectrum and Hive/Hadoop can query data stored in Amazon S3. They typically look in a supplied path and load all files under that path, including sub-directories.
To add data to such data stores, simply upload an additional object in the given path.
To delete all data in one object, delete the object.
However, if you wish to delete data within an object, then you will need to replace the object with a new object that has those rows removed. This must be done outside of S3. Amazon S3 cannot edit the contents of an object.
See: AWS Glue adds new transforms (Purge, Transition and Merge) for Apache Spark applications to work with datasets in Amazon S3
Data Bricks has a product called Delta Lake that can add an additional layer between queries tools and Amazon S3:
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Delta Lake supports deleting data from a table because it sits "in front of" Amazon S3.

How to snapshot a bucket without shutting down a service?

I have a service using GCS to store data, and I am working on a backup plan. Currently, I am using Storage Transfer Service to transfer that data to the backup bucket. My problem is I want the backup bucket data be consistent with other data backups, so I just want a snapshot at the time backup started without shutting down my service.
So my questions would be:
1. What happens if the service write new data or update existing data (create a new version) into the source bucket during the transfer? Will the new records be transferred to the sink bucket?
2. My problem is I don't want any new records or new version created after the backup timestamp to be in the sink bucket. Is this even possible? Is there a practical solution for this or alternative solution?

If you want to avoid having objects written after the backup start timestamp you'll need to write code to check the object's timestamp before backing it up -- or find a piece of backup software that can do that.

Do I need to set up backup data pipeline for AWS Dynamo DB on a daily basis?

I am considering using AWS DynamoDB for an application we are building. I understand that setting a backup job that exports data from DynamoDB to S3 involves a data pipeline with EMR. But my question is do I need to worry about having a backup job set up on day 1? What are the chances that a data loss would happen?

There are multiple use-cases for DynamoDB table data copy elsewhere:
(1) Create a backup in S3 on a daily basis, in order to restore in case of accidental deletion of data or worse yet drop table (code bugs?)
(2) Create a backup in S3 to become the starting point of your analytics workflows. Once this data is backed up in S3, you can combine it with, say, your RDBMS system (RDS or on-premise) or other S3 data from log files. Data Integration workflows could involve EMR jobs to be ultimately loaded into Redshift (ETL) for BI queries. Or directly load these into Redshift to do more ELT style - so transforms happen within Redshift
(3) Copy (the whole set or a subset of) data from one table to another (either within the same region or another region) - so the old table can be garbage collected for controlled growth and cost containment. This table-to-table copy could also be used as a readily consumable backup table in case of, say region-specific availability issues. Or, use this mechanism to copy data from one region to another to serve it from an endpoint closer to the DynamoDB client application that is using it.
(4) Periodic restore of data from S3. Possibly as a way to load back post-analytics data back into DynamoDB for serving it in online applications with high-concurrency, low-latency requirements.
AWS Data Pipeline helps schedule all these scenarios with flexible data transfer solutions (using EMR underneath).
One caveat when using these solutions is to note that this is not a point-in-time backup: so any changes to the underlying table happening during the backup might be inconsistent.

This is really subjective. IMO you shouldn't worry about them 'now'.
You can also use simpler solutions other than pipleline. Perhaps that will be a good place to start.
After running DynamoDB as our main production database for more than a year I can say it is a great experience. No data loss and no downtime. The only thing that we care about is sometimes SDK misbehaves and tweaking provisioned throughput.

data pipeline has limit regions.
https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region

I would recommend setting up a Data pipeline to backup on daily basis to an S3 bucket - If you want to be really safe.
Dynamo DB itself might be very reliable, but nobody can protect you from your own accidental deletions (what if by mistake you or your colleague ended up deleting a table from the console). So I would suggest setup a backup on daily basis - It doesn't any case cost so much.
You can tell the Pipeline to only consume say may 25% of the capacity while backup is going on so that your real users don't see any delay. Every backup is "full" (not incremental), so in some periodic interval, you can delete some old backups if you are concerned about storage.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js