Copy all files from S3 to ESB drive - amazon-web-services

Is there a way how to copy all files from S3 to an EBS drive belonging to a EC2 instance (which may belong to a a different AWS account than the S3)?
We are performing a migration of the whole account and upgrading the instances from t1 to t2 type and would like to backup the data from S3 somewhere outside S3 (and Glacier since Glacier is closely linked to S3) in case that something goes wrong and we lose the data.
I found only articles and docs talking about EBS snapshots but I am not sure if the S3 data can be actually copied to EBS (in some other way than manually).
According to this docs, I can ssh to my instance and copy the data from S3 buckets to my local EBS drive, but I have to specify the name of the bucket. Is there a way how to copy all the buckets there?
aws s3 sync s3://mybucket
I would like to achieve this:
Pseudocode:
for each bucket
do
aws s3 sync s3://bucketName bucketName
endfor
Is there a way how to do this using the AWS CLI?

Amazon S3 is designed to provide 99.999999999% durability of objects over a given year and achieves this by automatically replicating the data you put into a bucket across 3 separate facilities (think datacenters), across Availability Zones, within a Region. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years. In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities.
If you are still concerned about losing your data, you may consider copying the contents of the buckets into new buckets set up in another region. That means that you have your data in 1 system that offers 11x9's with a copy in another system that offers 11x9's. Say your original buckets reside in the Dublin region, create corresponding 'backup' buckets in the Frankfurt region and use the sync command.
eg.
aws s3 sync s3://originalbucket s3://backupbucket
That way you will have six copies of your data in six different facilities spread across Europe (naturally this is just as relevant if you use multiple regions in the US or ASIA). This would be a much more redundant configuration than pumping it into EBS volumes that have a meagre (when compared to S3) 99.999% availability. And better economics with S3 rates lower than EBS (1TB in S3 = US$30 vs 1TB in EBS(Magnetic) = US$50) and you only pay for the capacity you consume whereas EBS is based on what you provision.
Happy days...
References
http://aws.amazon.com/s3/faqs/
http://aws.amazon.com/ebs/faqs/
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

I would agree with rdp-cloud's answer, but if you insist on creating EBS backups, to answer your question - there is no single aws cli command that would sync all available buckets in one go. You can use a bash script to get the list of all available buckets and then sync looping through them:
#!/bin/bash
BUCKETS=($(aws s3 ls | awk '{print $3}'))
for (( i=0; i<${#BUCKETS[#]}; i++ ))
do
aws s3 sync s3://$BUCKETS[$i] <destination>
done
Make sure to test that aws s3 ls | awk '{print $3}' gives you the exact list you intend to sync before running the above.

Related

Move large volumes (> 50tb) of data from one s3 to another s3 in another account cost effectively

I have some S3 buckets in one AWS account which have large amount of data (50+ Tbs)
I want it to move it to new S3 buckets in another account completely and use the 1st AWS account for another purpose.
The method I know is AWS CLI using s3 cp/s3 sync/s3 mv , but this would take days when running in my laptop
And I want it to be more cost effective when considering the data transfer also.
Buckets contain mainly zip files and rar files having size ranging from 1GB to 150+GB and also other files too.
Can someone suggest me methods to do this which would be cost effective as well as less time consuming .
You can use Skyplane which is much faster and cheaper than aws s3 cp (up to 110x for large files). Skyplane will automatically compress data to reduce egress costs, and will also give you cost estimates before running the transfer.
You can transfer data between buckets in region A and region B with:
skyplane cp -r s3://<region-A-bucket>/ s3://<region-B-bucket>/
If the destination bucket is in the same region as the source bucket (even if it's in a different account), there's no data transfer cost for running s3 cp/sync/mv according to the docs (check the Data transfer tab).
For a fast solution, consider using S3 Transfer Acceleration, but note that this does incur transfer costs.

Data Lakes - S3 and Databricks

I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones.
How would I do the equivalent in AWS - Would it be a separate bucket for each zone (s3://landing_data/, s3://staging_data, s3://curated_data) or a single bucket with multiple folders (i.e. s3://bucket_name/landing/..., s3://bucket_name/staging/). I understand AWS S3 is nothing more than containers.
Also, would I be able to mount multiple S3 buckets on Databricks AWS? If so is there any reference documentation?
Is there any best/recommended approach given that we can read and write to S3 in multiple ways?
I looked at this as well.
S3 performance Best Pratices
There is no single solution - the actual implementation depends on the amount of data, number of consumers/producers, etc. You need to take into account AWS S3 limits, like:
By default you may have only 100 buckets in an account - it could be increased although
You may issue 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix (directory) in a single bucket (although the number of prefixes is not limited)
You can mount each of the buckets, or individual folders into Databricks workspace as described in documentation. But it's really not recommended from the security standpoint, as everyone in workspace will have the same permissions as role that was used for mounting. Instead of that, just use full S3 URLs in combination with instance profiles.

Fastest and most cost efficient way to copy over an S3 bucket from another AWS account

I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).
There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2
You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog
I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

Copying multiple files in large volume between two s3 buckets which are in the different regions

I need to copy a large chunk of data, around 300 GB of files from say bucket A which is in us-east region and to bucket B which is in ap-southeast region. Also I need to change the structure of the bucket. Like I need to push the files to different folders on bucket B according to the image name which is in the bucket A. I tried to using AWS Lambda but it's not available in ap-southeast.
Also how much would it cost since data will be transferred between regions?
Method
The AWS Command-Line Interface (CLI) has the aws s3 cp command that can be used to move objects between buckets (even in different regions), and can rename them at the same time.
aws s3 cp s3://bucket-in-us/foo/bar.txt s3://bucket-in-ap/foo1/foo2/bar3.txt
There is also the aws s3 sync option that can be used to synchronize content between two buckets, but that doesn't help your requirement to rename objects.
Cost
Data Transfer charges from US regions to another region are shown on the Amazon S3 pricing page as US$0.02/GB.
Use bucket replication and then create another bucket in your target region and do your S3 object key manipulation.
Read more on S3 cross-region replication.

How to make automated S3 Backups

I am working on an app which uses S3 to store important documents. These documents need to be backed up on a daily, weekly rotation basis much like how database backups are maintained.
Does S3 support a feature where a bucket can be backup up into multiple buckets periodically or perhaps in Amazon Glacier. I want to avoid using an external service as much as possible, and was hoping S3 had some mechanism to do this, as its a common usecase.
Any help would be appreciated.
Quote from Amazon S3 FAQ about durability:
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years
These numbers mean, first of all, that they are almost unbeatable. In other words, your data is safe in Amazon S3.
Thus, the only reason why you would need to backup your data objects is to prevent their accidental loss (by your own mistake). To solve this problem Amazon S3 enables versioning of S3 objects. Enable this feature on your S3 bucket and you're safe.
ps. Actually, there is one more possible reason - cost optimization. Amazon Glacier is cheaper than S3. I would recommend to use AWS Data Pipeline to move S3 data to Glacier routinely.
Regarding Glacier, you can make settings on your bucket to backup (old) s3 data to glaciaer if it is older than specified duration. This can save you cost if you want infrequently accessed data to be archived.
In s3 bucket there are lifecycle rules using which we can automatically move data from s3 to glaciers.
but if you want to access these important documents frequently from backup then you can also use another S3 bucket for backup your data.This backup can be scheduled using AWS datapipeline daily,weekly etc.
*Glaciers are cheaper than S3 as data is stored in compressed format in galaciers.
I created a Windows application that will allow you to schedule S3 bucket backups. You can create three kinds of backups: Cumulative, Synchronized and Snapshots. You can also include or exclude root level folders and files from your backups. You can try it free with no registration at https://www.bucketbacker.com