Costs Related to Individual Bucket Items in S3 - amazon-web-services

My AWS S3 costs have been going up pretty quickly for usage type "DataTransfer-Out-Bytes". I have thousands of images in this one bucket and I can't seem to find a way to drill down into the bucket to see which individual bucket items might be causing the increase. Is there a way to see which individual files are attributing to the higher data transfer cost?

Use Cloudfront if you can - its cheaper(if you properly set your cache headers!) than directly hosting from S3 and Cloudfront includes a popular objects report - which would answer your question.
If your using S3 alone you need to enable logging on the bucket (more storage cost) and then crunch the data in the logs (more data transfer cost) to get your answer. You can use AWS Athena to process the s3 access logs or use unix command line tools like grep/wc/uniq/cut to operate on the log files locally/from a server to find the culprits.

Related

Techniques for AWS CloudTrail and VPC Flow log S3 archival

Following AWS-recommended best practices, we have organization-wide CloudTrail and VPC flow logging configured to log to a centralized logs archive account. Since CloudTrail and VPC flow are organization-wide in multiple regions, we're getting a high number of new log files saved to S3 daily. Most of these files are quite small (several KB).
The high number of small log files is fine while they're in the STANDARD storage class, since you just pay for total data size without any minimum file size overhead. However, we've found it challenging to deep archive these files after 6 or 12 months, since any storage class other than STANDARD (such as GLACIER) has a minimum billable file size (STANDARD-IA is 128, GLACIER doesn't have a minimum size but adds 40KB of metadata per object, etc.).
What are the best practices for archiving a large number of small S3 objects? I could use a Lambda to download multiple files, re-bundle them into a larger file, and re-store it, but that would be pretty expensive in terms of compute time and GET/PUT requests. As far as I can tell, S3 Batch Operations has no support for this. Any suggestions?
Consider using a tool like S3-utils concat. This is not an AWS-supported tool but an open source tool to perform the type of action you are requiring.
You'll probably want the pattern matching syntax which will allow you to create a single file for each day's logs.
$ s3-utils concat my.bucket.name 'date-hierachy/(\d{4})/(\d{2})/(\d{2})/*.gz' 'flat-hierarchy/$1-$2-$3.gz'
This could be run as a daily job so each day is condensed into one file. Definitely recommended to run this in a resource on the Amazon network (i.e. your VPC with the s3 gateway endpoint attached) to improve file transfer performance and avoid data transfer out fees.

Fastest and most cost efficient way to copy over an S3 bucket from another AWS account

I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).
There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2
You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog
I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/

How to optimize download speeds from AWS S3 bucket?

We keep user-specific downloadable files on AWS S3 buckets in N.Virginia region. Our clients download the files from these buckets all over the world. Files size ranges from 1-20 GB. For larger files, clients in non-US location face and complain about slow downloads or interrupted downloads. How can we optimize these downloads?
We are thinking about the following approaches:
Accelerated downloads (higher costs)
use of CloudFront CDN with S3 origin (Since our downloads are of different files, each file being downloaded just once or twice, will CDN help since, for 1st time, it will fetch data from US bucket only)
Use of akamai as CDN (same concern as of CloudFront, only thing is we have a better price deal with akamai at org level)
Depending on the user's location (we know where the download will happen), we can keep the file in the specific bucket which was created at that aws region.
So, I want recommendations in terms of cost+download speed. Which may be a better option to explore further?
As each file will only be downloaded a few times, you won't benefit from CloudFront's caching, because the likelihood that the download requests all hit the same CloudFront node and that this node hasn't evicted the file from its cache yet, are probably near zero, especially for such large files.
On the other hand you gain something else by using CloudFront or S3 Transfer Acceleration (the latter one being essentially the same as the first one without caching): The requests enter AWS' network already at the edge, so you can avoid using congested networks from the location of the user to the location of your S3 bucket, which is usually the main reason for slow and interrupted downloads.
Storing the data depending on the users location would improve the situation as well, although CloudFront edge locations are usually closer to a user than the next AWS region with S3. Another reason for not distributing the files to different S3 buckets depending on the users location is the management overhead: You need to manage multiple S3 buckets, store each file in the correct bucket and point each user to the correct bucket. While storing could be simplified by using S3 Replication (you could use a filter to only replicate objects to a specific target bucket meant for this bucket), the overhead with managing multiple endpoints for multiple customers remains. Also while you state that you know the location of the customers, what happens if a customer does change its location and suddenly wants to download an object which is now stored on the other side of the world? You'd have the same problem again.
In your situation I'd probably choose option 2 and set up CloudFront in front of S3. I'd prefer CloudFront over S3 Transfer Acceleration, as it gives you more flexibility: You can use your own domain with HTTPS, you can later on reconfigure origins when the location of the files changes, etc. Depending on how far you want to go you could even combine that with S3 replication and have multiple origins for your CloudFront distribution to direct requests for different files to S3 buckets in different regions.
Which solution to choose depends on your use case and constraints. One constraint seems to be cost for you, another one could for example be the maximum file size of 20GB supported by CloudFront, if you have files to distribute larger than that.

Best way to transfer data from on-prem to AWS

I have a requirement to transfer data(one time) from on prem to AWS S3. The data size is around 1 TB. I was going through AWS Datasync, Snowball etc... But these managed services are better to migrate if the data is in petabytes. Can someone suggest me the best way to transfer the data in a secured way cost effectively
You can use the AWS Command-Line Interface (CLI). This command will copy data to Amazon S3:
aws s3 sync c:/MyDir s3://my-bucket/
If there is a network failure or timeout, simply run the command again. It only copies files that are not already present in the destination.
The time taken will depend upon the speed of your Internet connection.
You could also consider using AWS Snowball, which is a piece of hardware that is sent to your location. It can hold 50TB of data and costs $200.
If you have no specific requirements (apart from the fact that it needs to be encrypted and the file-size is 1TB) then I would suggest you stick to something plain and simple. S3 supports an object size of 5TB so you wouldn't run into trouble. I don't know if your data is made up of many smaller files or 1 big file (or zip) but in essence its all the same. Since the end-points or all encrypted you should be fine (if your worried, you can encrypt your files before and they will be encrypted while stored (if its backup of something). To get to the point, you can use API tools for transfer or just file-explorer type of tools which have also connectivity to S3 (e.g. https://www.cloudberrylab.com/explorer/amazon-s3.aspx). some other point: cost-effectiviness of storage/transfer all depends on how frequent you need the data, if just a backup or just in case. archiving to glacier is much cheaper.
1 TB is large but it's not so large that it'll take you weeks to get your data onto S3. However if you don't have a good upload speed, use Snowball.
https://aws.amazon.com/snowball/
Snowball is a device shipped to you which can hold up to 100TB. You load your data onto it and ship it back to AWS and they'll upload it to the S3 bucket you specify when loading the data.
This can be done in multiple ways.
Using AWS Cli, we can copy files from local to S3
AWS Transfer using FTP or SFTP (AWS SFTP)
Please refer
There are tools like cloudberry clients which has a UI interface
You can use AWS DataSync Tool

How to make automated S3 Backups

I am working on an app which uses S3 to store important documents. These documents need to be backed up on a daily, weekly rotation basis much like how database backups are maintained.
Does S3 support a feature where a bucket can be backup up into multiple buckets periodically or perhaps in Amazon Glacier. I want to avoid using an external service as much as possible, and was hoping S3 had some mechanism to do this, as its a common usecase.
Any help would be appreciated.
Quote from Amazon S3 FAQ about durability:
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years
These numbers mean, first of all, that they are almost unbeatable. In other words, your data is safe in Amazon S3.
Thus, the only reason why you would need to backup your data objects is to prevent their accidental loss (by your own mistake). To solve this problem Amazon S3 enables versioning of S3 objects. Enable this feature on your S3 bucket and you're safe.
ps. Actually, there is one more possible reason - cost optimization. Amazon Glacier is cheaper than S3. I would recommend to use AWS Data Pipeline to move S3 data to Glacier routinely.
Regarding Glacier, you can make settings on your bucket to backup (old) s3 data to glaciaer if it is older than specified duration. This can save you cost if you want infrequently accessed data to be archived.
In s3 bucket there are lifecycle rules using which we can automatically move data from s3 to glaciers.
but if you want to access these important documents frequently from backup then you can also use another S3 bucket for backup your data.This backup can be scheduled using AWS datapipeline daily,weekly etc.
*Glaciers are cheaper than S3 as data is stored in compressed format in galaciers.
I created a Windows application that will allow you to schedule S3 bucket backups. You can create three kinds of backups: Cumulative, Synchronized and Snapshots. You can also include or exclude root level folders and files from your backups. You can try it free with no registration at https://www.bucketbacker.com