Compress billions of files in S3 bucket

Compress billions of files in S3 bucket - amazon-web-services

We have lots of files in S3 (>1B), I'd like to compress those to reduce storage costs.
What would be a simple and efficient way to do this?
Thank you
Alex

Amazon S3 cannot compress your data.
You would need to write a program to run on an Amazon EC2 instance that would:
Download the objects
Compress them
Upload the files back to S3
An alternative is to use Storage Classes:
If the data is infrequently accessed, use S3 Standard - Infrequent Access -- this is available immediately and is cheaper as long as data is accessed less than once per month
Glacier is substantially cheaper but takes some time to restore (speed of restore is related to cost)

Related

Move large volumes (> 50tb) of data from one s3 to another s3 in another account cost effectively

I have some S3 buckets in one AWS account which have large amount of data (50+ Tbs)
I want it to move it to new S3 buckets in another account completely and use the 1st AWS account for another purpose.
The method I know is AWS CLI using s3 cp/s3 sync/s3 mv , but this would take days when running in my laptop
And I want it to be more cost effective when considering the data transfer also.
Buckets contain mainly zip files and rar files having size ranging from 1GB to 150+GB and also other files too.
Can someone suggest me methods to do this which would be cost effective as well as less time consuming .

You can use Skyplane which is much faster and cheaper than aws s3 cp (up to 110x for large files). Skyplane will automatically compress data to reduce egress costs, and will also give you cost estimates before running the transfer.
You can transfer data between buckets in region A and region B with:
skyplane cp -r s3://<region-A-bucket>/ s3://<region-B-bucket>/

If the destination bucket is in the same region as the source bucket (even if it's in a different account), there's no data transfer cost for running s3 cp/sync/mv according to the docs (check the Data transfer tab).
For a fast solution, consider using S3 Transfer Acceleration, but note that this does incur transfer costs.

Archiving millions of small files on S3 to S3 Glacier Deep Archive

I have about 80,000,000 50KB files on S3 (4TB), which I want to transfer to Glacier DA.
I have come to realize there's a cost inefficiency in transferring a lot of small files to Glacier.
Assuming I don't mind archiving my files into a single (or multiple) tar/zips - what would be the best practice to transition those files to Glacier DA?
It is important to note that I only have these files on S3, and not on any local machine.

The most efficient way would be:
Launch an Amazon EC2 instance in the same region as the bucket. Choose an instance type with high-bandwidth networking (eg t3 family). Launch it with spot pricing because you can withstand the small chance that it is stopped. Assign plenty of EBS disk space. (Alternatively, you could choose a Storage Optimized instance since the disk space is included free, but the instance is more expensive. Your choice!)
Download a subset of the files to the instance using the AWS Command-Line Interface (CLI) by specifying a path (subdirectory) to copy. Don't try and do it all at once!
Zip/compress the files on the EC2 instance
Upload the compressed files to S3 using --storage-class DEEP_ARCHIVE
Check that everything seems good, and repeat for another subset!
The above would incur very little charge since you can terminate the EC2 when it is no longer needed, and EBS is only charged while the volumes exist.
If it takes too long to list a subset of the files, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then use this list to specifically copy files, or identify a path/subdirectory to copy.
As an extra piece of advice... if your system is continuing to collect even more files, you might consider collecting the data in a different way (eg streaming to Kinesis Firehose to batch data together), or combining the data on a regular basis rather than letting it creep up to so many files again. Fewer, larger files are much easier to use in processes if possible.

Fastest and most cost efficient way to copy over an S3 bucket from another AWS account

I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).

There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2

You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog

I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

Many small files Vs 1 big file for AWS S3 and Glacier :

I have a streaming server as EC2 instance and the Video chunks duration is 8 seconds. I want to archive the stream for auditing purpose so I record the stream back as one file each 1 minute
should I save the 8 seconds chunks to S3 then to Glacier or save the combined 1-minute file
Which choice is better in terms of Cost and performance? for s3 and then for Glacier

So, to answer your question:
You should upload the bigger file, which is the combined 1 minute file.
In terms of cost, both S3 and Glacier charge you per request besides per GB storage you use, so uploading bigger chunks means less requests made to S3 and Glacier, thus saving costs.
In terms of performance, you said in the comments that you rarely need to retrieve the files, so I recommend you use Glacier. Beware though, that once you put a file inside Glacier, it will take a couple of hours to retrieve it back, so it is only suitable if you very rarely need the data, if not ever.
If you need to retrieve the data often, you should use S3 (data retrieval is instant). But S3 charges more for storage than Glacier, so there are pros and cons between both.

How to make automated S3 Backups

I am working on an app which uses S3 to store important documents. These documents need to be backed up on a daily, weekly rotation basis much like how database backups are maintained.
Does S3 support a feature where a bucket can be backup up into multiple buckets periodically or perhaps in Amazon Glacier. I want to avoid using an external service as much as possible, and was hoping S3 had some mechanism to do this, as its a common usecase.
Any help would be appreciated.

Quote from Amazon S3 FAQ about durability:
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years
These numbers mean, first of all, that they are almost unbeatable. In other words, your data is safe in Amazon S3.
Thus, the only reason why you would need to backup your data objects is to prevent their accidental loss (by your own mistake). To solve this problem Amazon S3 enables versioning of S3 objects. Enable this feature on your S3 bucket and you're safe.
ps. Actually, there is one more possible reason - cost optimization. Amazon Glacier is cheaper than S3. I would recommend to use AWS Data Pipeline to move S3 data to Glacier routinely.

Regarding Glacier, you can make settings on your bucket to backup (old) s3 data to glaciaer if it is older than specified duration. This can save you cost if you want infrequently accessed data to be archived.

In s3 bucket there are lifecycle rules using which we can automatically move data from s3 to glaciers.
but if you want to access these important documents frequently from backup then you can also use another S3 bucket for backup your data.This backup can be scheduled using AWS datapipeline daily,weekly etc.
*Glaciers are cheaper than S3 as data is stored in compressed format in galaciers.

I created a Windows application that will allow you to schedule S3 bucket backups. You can create three kinds of backups: Cumulative, Synchronized and Snapshots. You can also include or exclude root level folders and files from your backups. You can try it free with no registration at https://www.bucketbacker.com

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js