Archiving millions of small files on S3 to S3 Glacier Deep Archive

Archiving millions of small files on S3 to S3 Glacier Deep Archive - amazon-web-services

I have about 80,000,000 50KB files on S3 (4TB), which I want to transfer to Glacier DA.
I have come to realize there's a cost inefficiency in transferring a lot of small files to Glacier.
Assuming I don't mind archiving my files into a single (or multiple) tar/zips - what would be the best practice to transition those files to Glacier DA?
It is important to note that I only have these files on S3, and not on any local machine.

The most efficient way would be:
Launch an Amazon EC2 instance in the same region as the bucket. Choose an instance type with high-bandwidth networking (eg t3 family). Launch it with spot pricing because you can withstand the small chance that it is stopped. Assign plenty of EBS disk space. (Alternatively, you could choose a Storage Optimized instance since the disk space is included free, but the instance is more expensive. Your choice!)
Download a subset of the files to the instance using the AWS Command-Line Interface (CLI) by specifying a path (subdirectory) to copy. Don't try and do it all at once!
Zip/compress the files on the EC2 instance
Upload the compressed files to S3 using --storage-class DEEP_ARCHIVE
Check that everything seems good, and repeat for another subset!
The above would incur very little charge since you can terminate the EC2 when it is no longer needed, and EBS is only charged while the volumes exist.
If it takes too long to list a subset of the files, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then use this list to specifically copy files, or identify a path/subdirectory to copy.
As an extra piece of advice... if your system is continuing to collect even more files, you might consider collecting the data in a different way (eg streaming to Kinesis Firehose to batch data together), or combining the data on a regular basis rather than letting it creep up to so many files again. Fewer, larger files are much easier to use in processes if possible.

Related

Import data to Amazon AWS SageMaker from S3 or EC2

For an AI project I want to train a model over a dataset which is about 300 GB. I want to use the AWS SageMaker framework.
In SageMaker documentation, they write that SageMaker can import data from AWS S3 bucket. Since the dataset is huge, I zipped it (to several zip files) and uploaded it to a S3 bucket. It took several hours. However, in order to use it I need to unzip the dataset. There are several options:
Unzip directly in S3. This might be impossible to do. See refs below.
Upload the uncompressed data directly, I tried it but it takes too much time and stopped in the middle, uploading only 9% of the data.
Uploading the data to a AWS EC2 machine and unzip it there. But can I import the data to SageMaker from EC2?
Many solutions offer a Python script that downloading the data from S3, unzipping it locally (on the desktop) and then streaming it back to the S3 bucket (see references below). Since I have the original files I can simply upload them to S3, but this takes too long (see 2).
Added in Edit:
I am now trying to upload the uncompressed data using AWS CLI V2.
References:
How to extract files in S3 on the fly with boto3?
https://community.talend.com/s/question/0D53p00007vCjNSCA0/unzip-aws-s3?language=en_US
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
https://repost.aws/questions/QUI8fTOgURT-ipoJmN7qI_mw/unzipping-files-from-s-3-bucket
https://dev.to/felipeleao18/how-to-unzip-zip-files-from-s3-bucket-back-to-s3-29o9

The main strategy most commonly used, and also least expensive (since space has its own cost * GB), is not to use the space of the EC2 instance used for the training job but rather to take advantage of the high transfer rate from bucket to instance memory.
This is on the basis that the bucket resides in the same region as the EC2 instance. Otherwise you have to increase the transmission performance, for a fee of course.
You can implement all the strategies for reading files in parallel in your script or reads by chunks, but my advice is to use automated frameworks such as dask/pyspark/pyarrow (in case you need to read dataframes) or review the nature of the storage of these zippers if it can be transformed into a more facilitative form (e.g., a csv transformed into parquet.gzip).
If the nature of the data is different (e.g., images or other), an appropriate lazy data-loading strategy must be identified.
For example, for your zipper problem, you can easily get the list of your files from an S3 folder and read them sequentially.

You already have the data in S3 zipped. What's left is:
Provision a SageMaker notebook instance, or an EC2 instance with enough EBS storage (say 800GB)
Login to the notebook instance, open a shell, copy the data from S3 to local disk.
Unzip the data.
Copy unzip data back to S3.
terminate the instance and the EBS to avoid extra cost.
This should be fast (no less than 250MB/sec) as both the instance has high bandwidth to S3 within the same AWS Region.
Assuming you refer to Training, when talking about using the dataset in SageMaker, read this guide on different storage options for large datasets.

Compress billions of files in S3 bucket

We have lots of files in S3 (>1B), I'd like to compress those to reduce storage costs.
What would be a simple and efficient way to do this?
Thank you
Alex

Amazon S3 cannot compress your data.
You would need to write a program to run on an Amazon EC2 instance that would:
Download the objects
Compress them
Upload the files back to S3
An alternative is to use Storage Classes:
If the data is infrequently accessed, use S3 Standard - Infrequent Access -- this is available immediately and is cheaper as long as data is accessed less than once per month
Glacier is substantially cheaper but takes some time to restore (speed of restore is related to cost)

Fastest and most cost efficient way to copy over an S3 bucket from another AWS account

I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).

There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2

You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog

I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

Checking the integrity of an archive uploaded to AWS Glacier

We have daily database backups created and stored on a server. In order to free up space, it was decided that all the backups older than 30 days should be archived using AWS Glacier.
So far so good, I managed to write a PowerShell script to select the required files and upload them to Glacier, but since I am new to all the AWS stuff, I have one question: is it possible to check that the files I have uploaded are indeed in the archive and that there has been no information loss?
My first approach was to send job retrieval requests for all the files that we have uploaded, and 4 hours later compare the checksums and archive ids of our original files and the ones we retrieved from Glacier. However, I think this process takes long, costs extra money, and most importantly, makes no sense at all..
I have also found that I can use inventory retrieval, but as far as I can tell this approach would be very similar to the one described above, just without downloading all the files again.
Lastly, is there even a point to trying to ensure that a file upload was successful if there are no errors? My vague understanding is that AWS would come back with error messages should an upload to Glacier fail, and it computes checksums internally during uploads.
I know that StackOverflow has seen more precisely worded questions, but any clarification regarding this would be immensely appreciated.

You have to try pretty hard to upload a corrupt file to Glacier, because Glacier requires checksums sent with each API request, and will reject the uploads if they don't match the hashes. Obviously you need to spot check your archives, but each one does not need to be downloaded and verified because of the built-in protections.
See Computing Checksums in the Amazon S3 Glacier Developer Guide for descriptions of how this works, on the wire.
Then, consider not using Glacier at all... not directly, anyway. Use S3, and upload your files using the GLACIER or DEEP_ARCHIVE storage class. Or upload them as Standard, with a lifecycle policy that moves them into one of the archive storage classes after 1 day. (Useful because if you delete Glacier or Deep Archive uploads before the minimum storage time, you're billed for the entire minimum time... this way you have a 24 hour "oops I don't like the way I set this up" window, since Standard storage has no minimum storage time period).
Using S3 is a far better solution, because S3 has a much better API and console, but the pricing is identical, because S3 is actually using Glacier as its backend, while you have the advantage of S3 as the frontend. Glacier has essentially no console functionality, is very opaque, and is not really designed for human interaction -- Glacier appears to have been designed as a backing store for an archiving system or service, which is exactly how S3 uses Glacier.
Amazon Simple Storage Service (Amazon S3) supports lifecycle configuration on an S3 bucket, which enables you to transition objects to the Amazon S3 GLACIER storage class for archival. When you transition Amazon S3 objects to the GLACIER storage class, Amazon S3 internally uses Glacier for durable storage at lower cost. Although the objects are stored in Glacier, they remain Amazon S3 objects that you manage in Amazon S3, and you cannot access them directly through Glacier.
https://docs.aws.amazon.com/amazonglacier/latest/dev/introduction.html
It is confusing and unfortunate that AWS recently confused this issue by dumbing things down, rebranding "Glacier" as "S3 Glacier," as if they were the same thing, when they are two very different services, one of which operates in a mode that gives you a gateway to the other. It's similarly unfortunate how Glacier has traditionally been marketed. Without S3 in front, Glacier is not well suited for very many applications.

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively

You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.

If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.

1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.

This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js