Reducing AWS Data Transfer Cost for AWS S3 files

Reducing AWS Data Transfer Cost for AWS S3 files - amazon-web-services

AWS S3 has a standard public bucket and folder (Asia Pacific region) which hosts ~30 GB of images/media. On another hand, the website and app access these images by using a direct S3 object URL. Unknowingly we run into high data transfer cost and its significantly unproportionate:
Amazon Simple Storage Service: USD 30
AWS Data Transfer: USD 110
I have also read that if EC2 and S3 is in the same region cost will be significantly lower, but problem is S3 objects are accessible from anywhere in the world from client machine directly and no EC2 is involved in between.
Can someone please suggest how can data transfer costs be reduced?

The Data Transfer charge is directly related to the amount of information that goes from AWS to the Internet. Depending on your region, it is typically charged at 9c/GB.
If you are concerned about the Data Transfer charge, there are a few things you could do:
Activate Amazon S3 Server Access Logging, which will create a log file for each web request. You can then see how many requests are coming and possibly detect strange access behaviour (eg bots, search engines, abuse).
You could try reducing the size of files that are typically accessed, such as making images smaller. Take a look at the Access Logs and determine which objects are being accessed the most, and are therefore causing the most costs.
Use less large files on your website (eg videos). Again, look at the Access Logs to determine where the data is being used.
A cost of $110 suggests about 1.2TB of data being transferred.

Related

Costs Related to Individual Bucket Items in S3

My AWS S3 costs have been going up pretty quickly for usage type "DataTransfer-Out-Bytes". I have thousands of images in this one bucket and I can't seem to find a way to drill down into the bucket to see which individual bucket items might be causing the increase. Is there a way to see which individual files are attributing to the higher data transfer cost?

Use Cloudfront if you can - its cheaper(if you properly set your cache headers!) than directly hosting from S3 and Cloudfront includes a popular objects report - which would answer your question.
If your using S3 alone you need to enable logging on the bucket (more storage cost) and then crunch the data in the logs (more data transfer cost) to get your answer. You can use AWS Athena to process the s3 access logs or use unix command line tools like grep/wc/uniq/cut to operate on the log files locally/from a server to find the culprits.

Do data transfer cost apply to access images from amazon S3?

Scenario is that I want to store my images uploaded by user through my application in amazon s3 and later retrieve them on demand.
So I was going through how much cost it will be there to store images in s3 and I across term "Data transfer cost",
I am bit confused like suppose I uploaded and image in s3 and make it public and when even I access that image suppose in a Brower tab through public s3 link, do I have to pay data transfer cost for that? or accessing and image from it's public link is free?
Same is if I download my image from s3 using my front end application, then also I have to pay data transfer cost?
Second part of my question :
If data transfer cost apply can I avoid my data transfer cost by first accessing my image from NodeJS running in ec2 instance then sending it back from my NodeJS application running in ec2 instance ?

Yes, you pay in both cases, that is exactly what the data transfer cost means.
Additionally you pay for the storage and for the number of requests.
You can e.g. go through https://calculator.aws/#/createCalculator to get a rough estimate of what cost you may incur.
No, you cannot avoid the cost. If you start loading them into the ec2 machine that cost is free (assuming same region etc.) but then transferring the data from the ec2 instances does cost again.

Should I use nginx reverse proxy for cloud object storage?

I am currently implementing image storing architecture for my service.
As I read in one article it is a good idea to move whole
image upload and download traffic to the external cloud object storage.
https://medium.com/#jgefroh/software-architecture-image-uploading-67997101a034
As I noticed there are many cloud object storage providers:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Blob Storage
- Alibaba Object Storage
- Oracle Object Storage
- IBM Object Storage
- Backblaze B2 Object
- Exoscale Object Storage
- Aruba Object Storage
- OVH Object Storage
- DreamHost DreamObjects
- Rackspace Cloud Files
- Digital Ocean Spaces
- Wasabi Hot Object Storage
My first choice was Amazon S3 because
almost all of my system infrastructure is located on AWS.
However I see a lot of problems with this object storage.
(Please correct me if I am wrong in any point below)
1) Expensive log delivery
AWS is charging for all operational requests. If I have to pay for all requests I would like to see all request logs. and I would like to get these logs as fast as possible. AWS S3 provide log delivery, but with a big delay and each log is provided as a separate file in other S3 bucket, so each log is a separate S3 write request. Write requests are more expensive, they cost approximately 5$ per 1M requests. There is another option to trigger AWS Lambda whenever request is made, however it is also additional cost 0,2 $ per 1M lambda invocations. In summary - in my opinion log delivery of S3 requests is way to expensive.
2) Cannot configure maximum object content-length globally for a whole bucket.
I have not found the possibility to configure maximum object size (content-length) restriction for a whole bucket. In short - I want to have a possibility to block uploading files larger than specified limit for a chosen bucket. I know that it is possible to specify content-length of uploaded file in a presigned PUT urls, however I think this should be available to configure globally for a whole bucket.
3) Cannot configure request rate limit per IP numer per minute directly on a bucket.
Because all S3 requests are chargable I would like to have a possibility
to restrict a limit of requests that will be made on my bucket from one IP number.
I want to prevent massive uploads and downloads from one IP number
and I want it to be configurable for a whole bucket.
I know that this functionality can be privided by AWS WAF attached to Cloudfront
however such WAF inspected requests are way to expensive!
You have to pay 0,60$ per each 1M inspected requests.
Direct Amazon S3 requests costs 0,4$ per 1M requests,
so there is completely no point and it is completely not profitable
to use AWS WAF as a rate limit option for S3 requests as a "wallet protection" for DOS attacks.
4) Cannot create "one time - upload" presigned URL.
Generated presigned URLs can be used multiple times as long as the didnt expired.
It means that you can upload one file many times using same presigned URL.
It would be great if AWS S3 API would provide a possibility to create "one time upload" presigned urls. I know that I can implement such "one time - upload" functionality by myself.
For example see this link https://serverless.com/blog/s3-one-time-signed-url/
However in my opinion such functionality should be provided directly via S3 API
5) Every request to S3 is chargable!
Let's say you created a private bucket.
No one can access data in it however....
Anybody from the internet can run bulk requests on your bucket...
and Amazon will charge you for all that forbidden 403 requests!!!
It is not very comfortable that someone can "drain my wallet"
anytime by knowing only the name of my bucket!
It is far from being secure!, especially if you give someone
direct S3 presigned URL with bucket address.
Everyone who knows the name of a bucket can run bulk 403 requests and drain my wallet!!!
Someone already asked that question here and I guess it is still a problem
https://forums.aws.amazon.com/message.jspa?messageID=58518
In my opinion forbidden 403 requests should not be chargable at all!
6) Cannot block network traffic to S3 via NaCL rules
Because every request to S3 is chargable.
I would like to have a possibility to completely block
network traffic to my S3 bucket in a lower network layer.
Because S3 buckets cannot be placed in a private VPC
I cannot block traffic from a particular IP number via NaCl rules.
In my opinion AWS should provide such NaCl rules for S3 buckets
(and I mean NaCLs rules not ACLs rules that block only application layer)
Because of all these problems I am considering using nginx
as a proxy for all requests made to my private S3 buckets
Advantages of this solution:
I can rate limit requests to S3 for free however I want
I can cache images on my nginx for free - less requests to S3
I can add extra layer of security with Lua Resty WAF (https://github.com/p0pr0ck5/lua-resty-waf)
I can quickly cut off requests with request body greater than specified
I can provide additional request authentication with the use of openresty
(custom lua code can be executed on each request)
I can easily and quickly obtain all access logs from my EC2 nginx machine and forward them to cloud watch using cloud-watch-agent.
Disadvantages of this solution:
I have to transfer all the traffic to S3 through my EC2 machines and scale my EC2 nginx machines with the use of autoscaling group.
Direct traffic to S3 bucket is still possible from the internet for everyone who knows my bucket name!
(No possibility to hide S3 bucket in private network)
MY QUESTIONS
Do you think that such approach with reverse proxy nginx server in front of object storage is good?
Or maybe a better way is to just find alternative cloud object storage provider and not proxy object storage requests at all?
I woud be very thankful for the recommendations of alternative storage providers.
Such info about given recommendation would be preferred.
Object storage provider name
A. What is the price for INGRESS traffic?
B. What is the price for EGRESS traffic?
C. What is the price for REQUESTS?
D. What payment options are available?
E. Are there any long term agreement?
F. Where data centers are located?
G. Does it provide S3 compatible API?
H. Does it provide full access for all request logs?
I. Does it provide configurable rate limit per IP number per min for a bucket?
J. Does it allow to hide object storage in private network or allow network traffic only from particular IP number?
In my opinion a PERFECT cloud object storage provider should:
1) Provide access logs of all requests made on bucket (IP number, response code, content-length, etc.)
2) Provide possibility to rate limit buckets requests per IP number per min
3) Provide possibility to cut off traffic from malicious IP numbers in network layer
4) Provide possibility to hide object storage buckets in private network or give access only for specified IP numbers
5) Do not charge for forbidden 403 requests
I would be very thankful for allt the answers, comments and recommendations
Best regards

Using nginx as a reverse proxy for cloud object storage is a good idea for many use-cases and you can find some guides online on how to do so (at least with s3).
I am not familiar with all features available by all cloud storage providers, but I doubt that any of them will give you all the features and flexibility you have with nginx.
Regarding your disadvantages:
Scaling is always an issue, but you can see with benchmark tests
that nginx can handle a lot of throughput even in small machines
There are solution for that in AWS. First make your S3 bucket private, and then you can:
Allow access to your bucket only from the EC2 instance/s running your nginx servers
generate pre-signed URLs to your S3 bucket and serve them to your clients using nginx.
Note that both solutions for your second problem require some development

If you have an AWS Infrastructure and want to implement a on-prem S3 compatible API, you can look into MinIO.
It is a performant object storage which protects data protection through Erasure Coding

Difference between s3 bucket vs host files for Amazon Cloud Front

Background
We have developed an e-commerce application where I want to use CDN to improve the speed of the app and also to reduce the load on the host.
The application is hosted on an EC2 server and now we are going to use Cloud Front.
Questions
After reading a lot of articles and documents, I have created a distribution for my sample site. After doing all the experience I have come to know the following things. I want to be sure if am right about these points or not.
When we create a distribution it takes all the accessible data from the given origin path. We don't need to copy/ sync our files to cloud front.
We just have to change the path of our application according to this distribution CNAME (if cname is given).
There is no difference between placing the images/js/CSS files on S3 or on our own host. Cloud Front will just take them by itself.
The application will have thousands of pictures of the products, should we place them on S3 or its ok if they are on the host itself? Please share any good article to understand the difference of both the techniques.
Because if S3 is significantly better then I'll have to make a program to sync all such data on S3.
Thanks for the help.

Some reasons to store the images on Amazon S3 rather than your own host (and then serve them via Amazon CloudFront):
Less load on your servers
Even though content is cached in Amazon CloudFront, your servers will still be hit with requests for the first access of each object from every edge location (each edge location maintains its own cache), repeated every time that the object expires. (Refreshes will generate a HEAD request, and will only re-download content that has changed or been flushed from the cache.)
More durable storage
Amazon S3 keeps copies of your data across multiple Availability Zones within the same Region. You could also replicate data between your servers to improve durability but then you would need to manage the replication and pay for storage on every server.
Lower storage cost
Storing data on Amazon S3 is lower cost than storing it on Amazon EBS volumes. If you are planning on keeping your data in both locations, then obviously using S3 is more expensive but you should also consider storing it only on S3, which makes it lower cost, more durable and less for you to backup on your server.
Reasons to NOT use S3:
More moving parts -- maintaining code to move files to S3
Not as convenient as using a local file system
Having to merge log files from S3 and your own servers to gather usage information

AWS S3 pricing - data transfer in/out

I am planning to store objects in S3 standard storage, each object could be of 100MB in size so monthly it could go upto 1TB. I will use a single region to store these objects in S3.
I want to create a mobile app to store and fetch this objects using post/get apis.
And then show them in my app.
S3 uses different pricing sections, I understand storage and requests (post/get) pricing.
My question is around data transfer in/out pricing, in my case above will I be billed for data transfer in/out? if no why not?

Yes, you will be billed because you mobile app will connect from internet. Even connected from within AWS there are fees associated with your number of requests and data transferred (inside region or outside region).
You can use the AWS Calc to get an estimate for the cost associated: https://calculator.s3.amazonaws.com/index.html

All traffic FROM mobile phones to S3 or EC2 is free.
All traffic TO mobile phones from S3/CloudFront is billed according to a selected region. Take a look at https://aws.amazon.com/s3/pricing/.
Keep in mind that incoming traffic (to S3) is free only if you're NOT using S3 Transfer Acceleration.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js