I am currently implementing image storing architecture for my service.
As I read in one article it is a good idea to move whole
image upload and download traffic to the external cloud object storage.
https://medium.com/#jgefroh/software-architecture-image-uploading-67997101a034
As I noticed there are many cloud object storage providers:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Blob Storage
- Alibaba Object Storage
- Oracle Object Storage
- IBM Object Storage
- Backblaze B2 Object
- Exoscale Object Storage
- Aruba Object Storage
- OVH Object Storage
- DreamHost DreamObjects
- Rackspace Cloud Files
- Digital Ocean Spaces
- Wasabi Hot Object Storage
My first choice was Amazon S3 because
almost all of my system infrastructure is located on AWS.
However I see a lot of problems with this object storage.
(Please correct me if I am wrong in any point below)
1) Expensive log delivery
AWS is charging for all operational requests. If I have to pay for all requests I would like to see all request logs. and I would like to get these logs as fast as possible. AWS S3 provide log delivery, but with a big delay and each log is provided as a separate file in other S3 bucket, so each log is a separate S3 write request. Write requests are more expensive, they cost approximately 5$ per 1M requests. There is another option to trigger AWS Lambda whenever request is made, however it is also additional cost 0,2 $ per 1M lambda invocations. In summary - in my opinion log delivery of S3 requests is way to expensive.
2) Cannot configure maximum object content-length globally for a whole bucket.
I have not found the possibility to configure maximum object size (content-length) restriction for a whole bucket. In short - I want to have a possibility to block uploading files larger than specified limit for a chosen bucket. I know that it is possible to specify content-length of uploaded file in a presigned PUT urls, however I think this should be available to configure globally for a whole bucket.
3) Cannot configure request rate limit per IP numer per minute directly on a bucket.
Because all S3 requests are chargable I would like to have a possibility
to restrict a limit of requests that will be made on my bucket from one IP number.
I want to prevent massive uploads and downloads from one IP number
and I want it to be configurable for a whole bucket.
I know that this functionality can be privided by AWS WAF attached to Cloudfront
however such WAF inspected requests are way to expensive!
You have to pay 0,60$ per each 1M inspected requests.
Direct Amazon S3 requests costs 0,4$ per 1M requests,
so there is completely no point and it is completely not profitable
to use AWS WAF as a rate limit option for S3 requests as a "wallet protection" for DOS attacks.
4) Cannot create "one time - upload" presigned URL.
Generated presigned URLs can be used multiple times as long as the didnt expired.
It means that you can upload one file many times using same presigned URL.
It would be great if AWS S3 API would provide a possibility to create "one time upload" presigned urls. I know that I can implement such "one time - upload" functionality by myself.
For example see this link https://serverless.com/blog/s3-one-time-signed-url/
However in my opinion such functionality should be provided directly via S3 API
5) Every request to S3 is chargable!
Let's say you created a private bucket.
No one can access data in it however....
Anybody from the internet can run bulk requests on your bucket...
and Amazon will charge you for all that forbidden 403 requests!!!
It is not very comfortable that someone can "drain my wallet"
anytime by knowing only the name of my bucket!
It is far from being secure!, especially if you give someone
direct S3 presigned URL with bucket address.
Everyone who knows the name of a bucket can run bulk 403 requests and drain my wallet!!!
Someone already asked that question here and I guess it is still a problem
https://forums.aws.amazon.com/message.jspa?messageID=58518
In my opinion forbidden 403 requests should not be chargable at all!
6) Cannot block network traffic to S3 via NaCL rules
Because every request to S3 is chargable.
I would like to have a possibility to completely block
network traffic to my S3 bucket in a lower network layer.
Because S3 buckets cannot be placed in a private VPC
I cannot block traffic from a particular IP number via NaCl rules.
In my opinion AWS should provide such NaCl rules for S3 buckets
(and I mean NaCLs rules not ACLs rules that block only application layer)
Because of all these problems I am considering using nginx
as a proxy for all requests made to my private S3 buckets
Advantages of this solution:
I can rate limit requests to S3 for free however I want
I can cache images on my nginx for free - less requests to S3
I can add extra layer of security with Lua Resty WAF (https://github.com/p0pr0ck5/lua-resty-waf)
I can quickly cut off requests with request body greater than specified
I can provide additional request authentication with the use of openresty
(custom lua code can be executed on each request)
I can easily and quickly obtain all access logs from my EC2 nginx machine and forward them to cloud watch using cloud-watch-agent.
Disadvantages of this solution:
I have to transfer all the traffic to S3 through my EC2 machines and scale my EC2 nginx machines with the use of autoscaling group.
Direct traffic to S3 bucket is still possible from the internet for everyone who knows my bucket name!
(No possibility to hide S3 bucket in private network)
MY QUESTIONS
Do you think that such approach with reverse proxy nginx server in front of object storage is good?
Or maybe a better way is to just find alternative cloud object storage provider and not proxy object storage requests at all?
I woud be very thankful for the recommendations of alternative storage providers.
Such info about given recommendation would be preferred.
Object storage provider name
A. What is the price for INGRESS traffic?
B. What is the price for EGRESS traffic?
C. What is the price for REQUESTS?
D. What payment options are available?
E. Are there any long term agreement?
F. Where data centers are located?
G. Does it provide S3 compatible API?
H. Does it provide full access for all request logs?
I. Does it provide configurable rate limit per IP number per min for a bucket?
J. Does it allow to hide object storage in private network or allow network traffic only from particular IP number?
In my opinion a PERFECT cloud object storage provider should:
1) Provide access logs of all requests made on bucket (IP number, response code, content-length, etc.)
2) Provide possibility to rate limit buckets requests per IP number per min
3) Provide possibility to cut off traffic from malicious IP numbers in network layer
4) Provide possibility to hide object storage buckets in private network or give access only for specified IP numbers
5) Do not charge for forbidden 403 requests
I would be very thankful for allt the answers, comments and recommendations
Best regards
Using nginx as a reverse proxy for cloud object storage is a good idea for many use-cases and you can find some guides online on how to do so (at least with s3).
I am not familiar with all features available by all cloud storage providers, but I doubt that any of them will give you all the features and flexibility you have with nginx.
Regarding your disadvantages:
Scaling is always an issue, but you can see with benchmark tests
that nginx can handle a lot of throughput even in small machines
There are solution for that in AWS. First make your S3 bucket private, and then you can:
Allow access to your bucket only from the EC2 instance/s running your nginx servers
generate pre-signed URLs to your S3 bucket and serve them to your clients using nginx.
Note that both solutions for your second problem require some development
If you have an AWS Infrastructure and want to implement a on-prem S3 compatible API, you can look into MinIO.
It is a performant object storage which protects data protection through Erasure Coding
Related
AWS S3 has a standard public bucket and folder (Asia Pacific region) which hosts ~30 GB of images/media. On another hand, the website and app access these images by using a direct S3 object URL. Unknowingly we run into high data transfer cost and its significantly unproportionate:
Amazon Simple Storage Service: USD 30
AWS Data Transfer: USD 110
I have also read that if EC2 and S3 is in the same region cost will be significantly lower, but problem is S3 objects are accessible from anywhere in the world from client machine directly and no EC2 is involved in between.
Can someone please suggest how can data transfer costs be reduced?
The Data Transfer charge is directly related to the amount of information that goes from AWS to the Internet. Depending on your region, it is typically charged at 9c/GB.
If you are concerned about the Data Transfer charge, there are a few things you could do:
Activate Amazon S3 Server Access Logging, which will create a log file for each web request. You can then see how many requests are coming and possibly detect strange access behaviour (eg bots, search engines, abuse).
You could try reducing the size of files that are typically accessed, such as making images smaller. Take a look at the Access Logs and determine which objects are being accessed the most, and are therefore causing the most costs.
Use less large files on your website (eg videos). Again, look at the Access Logs to determine where the data is being used.
A cost of $110 suggests about 1.2TB of data being transferred.
I am a remote developer working out of India. My client is based out of North America and has his ec2 servers/s3 data kept in the us-west-2 region.
The number of hops needed to fetch the data is obviously big and thus wastes a lot of my time during testing, as we are dependent on large data coming in from s3.
How can I replicate the existing ec2/s3 system to have an endpoint in India so that my testing performance can be increased?
How to geographically replicate an entire environment is a very broad topic.
But there is a potential solution you should investigate, S3 Transfer Acceleration, which optimizes your connection to the bucket from distant locations, by creating a mesh of global endpoints for the bucket using the AWS Edge Network (the same global network of edge locations that provide services like CloudFront and Route 53) so that your traffic is routed to the nearest edge, where it hops on the managed AWS network and then rides back to the actual bucket location. The content isn't replicated, but your connection is transparently proxied, providing signicant optimization.
There's a test page at the link, above, that will illustrate the impact of transfer acceleration on your uploads, but the improvement of downloads is similar.
When the feature is enabled on a bucket, the bucket works the same as always, with no change, unless you access it using the accelerated endpoint, bucket-name.s3-accelerate[.dualstack].amazonaws.com, which causes you to connect to the nearest accelerate endpoint rather than all the way back to the actual bucket in its home region. (Add .dualstack for IPv6.) The SDKs provide a way to specify that the accelerate endpoint be used.
Background
We have developed an e-commerce application where I want to use CDN to improve the speed of the app and also to reduce the load on the host.
The application is hosted on an EC2 server and now we are going to use Cloud Front.
Questions
After reading a lot of articles and documents, I have created a distribution for my sample site. After doing all the experience I have come to know the following things. I want to be sure if am right about these points or not.
When we create a distribution it takes all the accessible data from the given origin path. We don't need to copy/ sync our files to cloud front.
We just have to change the path of our application according to this distribution CNAME (if cname is given).
There is no difference between placing the images/js/CSS files on S3 or on our own host. Cloud Front will just take them by itself.
The application will have thousands of pictures of the products, should we place them on S3 or its ok if they are on the host itself? Please share any good article to understand the difference of both the techniques.
Because if S3 is significantly better then I'll have to make a program to sync all such data on S3.
Thanks for the help.
Some reasons to store the images on Amazon S3 rather than your own host (and then serve them via Amazon CloudFront):
Less load on your servers
Even though content is cached in Amazon CloudFront, your servers will still be hit with requests for the first access of each object from every edge location (each edge location maintains its own cache), repeated every time that the object expires. (Refreshes will generate a HEAD request, and will only re-download content that has changed or been flushed from the cache.)
More durable storage
Amazon S3 keeps copies of your data across multiple Availability Zones within the same Region. You could also replicate data between your servers to improve durability but then you would need to manage the replication and pay for storage on every server.
Lower storage cost
Storing data on Amazon S3 is lower cost than storing it on Amazon EBS volumes. If you are planning on keeping your data in both locations, then obviously using S3 is more expensive but you should also consider storing it only on S3, which makes it lower cost, more durable and less for you to backup on your server.
Reasons to NOT use S3:
More moving parts -- maintaining code to move files to S3
Not as convenient as using a local file system
Having to merge log files from S3 and your own servers to gather usage information
I have following S3 buckets
"client1"
"client2"
...
"clientX"
and our clients upload data to their buckets via jar app (client1 to bucket client1 ect.). Here is peace of code:
BasicAWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonS3 s3client = new AmazonS3Client(credentials);
s3client.setRegion(Region.getRegion(Regions.US_EAST_1));
File file = new File(OUTPUT_DIRECTORY + '/' + fileName);
s3client.putObject(new PutObjectRequest(bucketName, datasource + '/' + fileName, file));
and problem is, that they have firewall for output traffic. They must allow URL .amazonaws.com in firewall. Is it possible to set endpoint to our domain storage.domain.com ?
We are expecting, that we will change region in future, but all our client are locket to amazonaws.com = US_EAST_1 region now -> all our clients will need change rules in firewall.
If the endpoint will be storage.domain.com - everything will be ok :)
Example of expected clients URL
client1 will put data to URL client1.storage.domain.com
client2 will put data to URL client2.storage.domain.com
clientX will put data to URL clientX.storage.domain.com
We know about setting in CloudFront, but it's per bucket. We are finding solution with one global AWS setting. How can we do that?
Thank you very much
Not sure if this will be affordable for you (due to extra fee you may have), but this should work:
Create Route53 with your domain and subdomains (client1, client2.. clientX)
Create (or use default) VPC with endpoints (https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/)
Route all traffic from Route53 to your VPC through the internet gateway (IGW)
You may need to have Security group and NACL things configured. Let me know if you need further details.
There are numerous factors at play, here, not the least of which is support for SSL.
First, we need to eliminate one obvious option that will not work:
S3 supports naming each bucket after a domain name, then using a CNAME to point at each bucket. So, for example, if you name a bucket client-1.storage.example.com and then point a DNS CNAME (in Route 53) for client-1.storage.example.com CNAME client-1.storage.example.com.s3.amazonaws.com then this bucket becomes accessible on the Internet as client-1.storage.example.com.
This works only if you do not try to use HTTPS. The reason for the limitation is a combination of factors, which are outside the scope of this answer. There is no workaround that uses only S3. Workarounds require additonal components.
Even though the scenario above will not work for your application, let's assume for a moment that it will, since it makes another problem easy to illustrate:
We are finding solution with one global AWS setting
This may not be a good idea, even if it is possible. In the above scenario, it would be tempting for you to set up a wildcard CNAME so that *.storage.example.com CNAME s3[-region].amazonaws.com which would give you a magical DNS entry that would work for any bucket with a name matching *.storage.example.com and created in the appropriate region... but there is a serious vulnerability in such a configuration -- I could create a bucket called sqlbot.storage.example.com (assuming no such bucket already existed) and now I have a bucket that you do not control, using a hostname under your domain, and you don't have any way to know about it, or stop it. I can potentially use this to breach your clients' security because now my bucket is accessible from inside your client's firewall, thanks to the wildcard configuration.
No, you really need to automate the steps to deploy each client, regardless of the ultimate solution, rather than relying on a single global setting. All AWS services (S3, Route 53, etc.) lend themselves to automation.
CloudFront seems like it holds the key to the simplest solution, by allowing you to map each client hostname to their own bucket. Yes, this does require a CloudFront distribution to be configured for each client, but this operation can also be automated, and there isn't a charge for each CloudFront distribution. The only charges for CloudFront are usage-related (per request and per GB transferted). Additional advantages here include SSL support (including wildcard *.storage.example.com certificate from ACM which can be shared across multiple CloudFront distributions) and the fact that with CloudFront in the path, you do not need the bucket name and the hostname to be the same.
This also gives you the advantage of being able to place each bucket in the most desirable region for that specific bucket. It is, however, limited to files not exceeding 20 GB in size, due to the size limit imposed by CloudFront.
But the problem with using CloudFront for applications with a large number of uploads of course is that you're going to pay bandwidth charges for the uploads. In Europe, Canada, and the US, it's cheap ($0.02/GB) but in India it is much more expensive ($0.16/GB), with other areas varying in price between these extremes. (You pay for downloads with CloudFront, too, but in that case, S3 does not bill you for any bandwidth charges when downloads are pulled through CloudFront... so the consideration is not usually as significant, and adding CloudFront in front of S3 for downloads can actually be slightly cheaper than using S3 alone).
So, while CloudFront is probably the official answer, there are a couple of considerations that are still potentially problematic.
S3 transfer acceleration avoids the other problem you mentioned -- the bucket regions. Buckets with transfer acceleration enabled are accessible at https://bucket-name.s3-accelerate.amazonaws.com regardless of the bucket region, so that's a smaller hole to open, but the transfer acceleration feature is only supported for buckets without dots in their bucket names. And transfer acceleration comes with additional bandwidth charges.
So where does this leave you?
There's not a built-in, "serverless" solution that I can see that would be simple, global, automatic, and inexpensive.
It seems unlikely, in my experience, that a client who is so security-conscious as to restrict web access by domain would simultaneously be willing to whitelist what is in effect a wildcard (*.storage.example.com) and could result in trusting traffic that should not be trusted. Granted, it would be better than *.amazonaws.com but it's not clear just how much better.
I'm also reasonably confident that many security configurations rely on static IP address whitelisting, rather than whitelisting by name... filtering by name in an HTTPS environment has implications and complications of its own.
Faced with such a scenario, my solution would revolve around proxy servers deployed in EC2 -- in the same region as the buckets -- which would translate the hostnames in the requests into bucket names and forward the requests to S3. These could be deployed behind ELB or could be deployed on Elastic IP addresses, load balanced using DNS from Route 53, so that you have static endpoint IP addresses for clients that need them.
Note also that any scenario involving Host: header rewrites for requests authorized by AWS Signature V4 will mean that you have to modify your application's code to sign the requests with the real hostname of the target S3 endpoint, while sending the requests to a different hostname. Sending requests directly to the bucket endpoint (including the transfer acceleration endpoint) is the only way to avoid this.
OK, so I have a an Amazon S3 bucket to which I want to allow users to upload files directly from the client over https.
In order to do this it became apparent that I would have to change the bucket name from a format using periods to a format using dashes. So :
my.bucket.com
became :
my-bucket-com
This being required due to a limitation of https authentication which can't deal with periods in the bucket name when resolving the S3 endpoint.
So everything is peachy, except now I'd like to allow access to those files while hiding the fact that they are being stored on Amazon S3.
The obvious choice seems to be to use Route 53 zone configuration records to add a CNAME record to point my url at the bucket, given that I already have the 'bucket.com' domain :
my.bucket.com > CNAME > my-bucket-com.s3.amazonaws.com
However, I now seem to have hit another limitation, in that Amazon seem to insist that the name of the CNAME record must match the bucket name exactly so the above example will not work.
My temporary solution is to use a reverse proxy on an EC2 instance while traffic volumes are low. But this is not a good or long term solution as it means that all S3 access is being funneled through the proxy server causing extra server load, and data transfer charges. Not to mention the solution really isn't scalable when traffic volumes start to increase.
So is it possible to achieve both of my goals above or are they mutually exclusive?
If I want to be able to upload directly from clients over https, I can't then hide the S3 url from end users accessing that content and vice versa?
Well there simply doesn't seem to be a straightforward way of achieving this.
There are 2 possible solutions :
1.) Put your S3 bucket behind Amazon Cloudfront - but this does incur a lot more charges, all be it with the added benefit of lower latency regional access to your content.
2.) The solution we will go with will simply be to split the bucket in to two.
One for upload from HTTPS clients my-bucket-com; And one for CNAME aliased access to that content my.bucket.com. This keeps the costs down, although it will involve extra steps in organising the content before it can be accessed.