I'm developing CMS which runs as a single instance, but serves multiple websites of different users. This CMS needs to store files in storage. Each website can have either few images but also thousands of objects. Currently we serve around 5 websites, but plan to have hundreds, so it must scale easily.
Now I'm thinking about two possible ways to go. I want to use S3 for storage.
solution is to have single bucket for all files in my app
solution is to have one bucket for each website.
According to AWS docs, S3 can handle "virtually unlimited amount of bytes", so I think first solution could work well, but I'm thinking about other aspects:
Isn't it just cleaner to have one bucket for each website? Is it better for maintance?
Which solution is more secure, if so? Are there some security concerns to care about?
Is same applicable to other S3-like services like Minio or DigitalOcean Spaces?
Thank you very much for your answers.
I'd go for solution 1.
From a technical perspective there really is virtually no limit to the amounts of objects you can put in a bucket - S3 is built for extreme scale. For 5 websites option 2 might sound tempting, but that doesn't scale very well.
There's a soft-limit (i.e. you can raise it) of 100 buckets per region or per account, which is an indication that using hundreds of buckets is probably an anti-pattern. Also securing 100s of buckets is not easier than securing one bucket.
Concerning security: You can be very granular with bucket policies in S3 if you need that. You can also choose how you want to encrypt each object individually if that is a requirement. Features like pre-signed URLs can help you grant temporary access to specific objects in S3.
If your goal is to serve static content to end users, you'll have to either make the objects publicly readable, use the aforementioned pre-signed URLs or set up CloudFront as a CDN in front of your bucket.
I don't know how this relates to S3-like services.
Related
What's the best practice for using S3 to store image uploads from users in terms of a single bucket or multiple buckets for different purposes? Use case is a b2b application.
There is no limit to the amount of data you can store in an Amazon S3 bucket. Therefore you could, in theory, simply use one bucket for everything. (However, if you want data in multiple regions, then you would need to use a separate bucket per region.)
To best answer your question, you would need to think about how data is accessed:
If controlling access for IAM Users, then giving each user a separate folder is easy for access control using IAM Policy Elements: Variables and Tags
If controlling access for application users, then users will authenticate to an application, which will determine their access to objects. The application can then generate Amazon S3 pre-signed URLs to grant access to specific objects, so separation by bucket/folder is less important
If the data is managed by different Admins/Developers it is a good idea to keep the data in separate buckets to simplify access permissions (eg keeping HR data separate from customer data)
Basically, as long as you have a good reason to separate the data (eg test vs prod, different apps, different admins), then use separate buckets. But, for a single app, it might make better sense to use a single bucket.
I believe it's the same in terms of performance and availability. As for splitting content by purpose - It's probably ok to use a single bucket as long as the content is split in different folders (Paths).
We used to have one bucket for user-uploaded content and another one for static (CSS/JS/IMG) files that were auto-generated.
After reading some AWS documentations, I am wondering what's the difference between these different use cases if I want to delivery (js, css, images and api request) content in Asia (including China), US, and EU.
Store my images and static files on S3 US region and setup EU and Asia(Japan or Singapore) cross region replication to sync with US region S3.
Store my images and static files on S3 US region and setup cloudfront CDN to cache my content in different locations after initial request.
Do both above (if there is significant performance improvement).
What is the most cost effective solution if I need to achieve global deployment? And how to make request from China consistent and stable (I tried cloudfront+s3(us-west), it's fast but the performance is not consistent)?
PS. In early stage, I don't expect too many user requests, but users spread globally and I want them to have similar experience. The majority of my content are panorama images which I'd expect to load ~30MB (10 high res images) data sequentially in each visit.
Cross region replication will copy everything in a bucket in one region to a different bucket in another region. This is really only for extra backup/redundancy in case an entire AWS region goes down. It has nothing to do with performance. Note that it replicates to a different bucket, so you would need to use different URLs to access the files in each bucket.
CloudFront is a Content Delivery Network. S3 is simply a file storage service. Serving a file directly from S3 can have performance issues, which is why it is a good idea to put a CDN in front of S3. It sounds like you definitely need a CDN, and it sounds like you have tested CloudFront and are unimpressed. It also sounds like you need a CDN with a larger presence in China.
There is no reason you have to chose CloudFront as your CDN just because you are using other AWS services. You should look at other CDN services and see what their edge networks looks like. Given your requirements I would highly recommend you take a look at CloudFlare. They have quite a few edge network locations in China.
Another option might be to use a CDN that you can actually push your files to. I've used this feature in the past with MaxCDN. You would push your files to the CDN via FTP, and the files would automatically be pushed to all edge network locations and cached until you push an update. For your use case of large image downloads, this might provide a more performant caching mechanism. MaxCDN doesn't appear to have a large China presence though, and the bandwidth charges would be more expensive than CloudFlare.
If you want to serve your files in S3 buckets to all around the world, then I believe you may consider using S3 Transfer acceleration. It can be used in cases where you either upload to or download from your S3 bucket . Or you may also try AWS Global Accelerator
CloudFront's job is to cache content at hundreds of caches ("edge locations") around the world, making them more quickly accessible to users around the world. By caching content at locations close to users, users can get responses to their requests more quickly than they otherwise would.
S3 Cross-Region Replication (CRR) simply copies an S3 bucket from one region to another. This is useful for backing up data, and it also can be used to speed up content delivery for a particular region. Unlike CloudFront, CRR supports real-time updating of bucket data, which may be important in situations where data needs to be current (e.g. a website with frequently-changing content). However, it's also more of a hassle to manage than CloudFront is, and more expensive on a multi-region scale.
If you want to achieve global deployment in a cost-effective way, then CloudFront would probably be the better of the two, except in the special situation outlined in the previous paragraph.
currently my app can upload images to a bucket in APSE1(singapore) and my app is mostly used in south east asia, so everything is pretty fast. I am wondering how can I support multiple regions? Let's say I want to also get people in the US to use my app, right now, their uploads will be slow since the bucket location is in singapore. I know that there's this S3 feature to replicate data across regions, but I wonder how I can detect the user location and get presigned upload url to closest bucket for that particular user? Right now I hardcode it to singapore... Any ideas? Thanks!
Your best bet is probably putting a CloudFront distribution in front of the bucket.
That said, have you benchmarked this? My understanding is that latency is more of a concern for a flurry of small requests than it would be for something like one or just a couple of large image uploads.
I'd like to set up a separate s3 bucket folder for each of my mobile app users for them to store their files. However, I also want to set up size limits so that they don't use up too much storage. Additionally, if they do go over the limit I'd like to offer them increased space if they sign up for a premium service.
Is there a way I can set folder file size limits through s3 configuration or api? If not would I have to use the apis somehow to calculate folder size on every upload? I know that there is the devpay feature in Amazon but it might be a hassle for users to sign up with Amazon if they want to just use small amount of free space.
There does not appear to be a way to do this, probably at least in part because there is actually no such thing as "folders" in S3. There is only the appearance of folders.
Amazon S3 does not have concept of a folder, there are only buckets and objects. The Amazon S3 console supports the folder concept using the object key name prefixes.
— http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
All of the keys in an S3 bucket are actually in a flat namespace, with the / delimiter used as desired to conceptually divide objects into logical groupings that look like folders, but it's only a convenient illusion. It seems impossible that S3 would have a concept of the size of a folder, when it has no actual concept of "folders" at all.
If you don't maintain an authoritative database of what's been stored by clients (which suggests that all uploads should pass through an app server rather than going directly to S3, which is the the only approach that makes sense to me at all) then your only alternative is to poll S3 to discover what's there. An imperfect shortcut would be for your application to read the S3 bucket logs to discover what had been uploaded, but that is only provided on a best-effort basis. It should be reliable but is not guaranteed to be perfect.
This service provides a best effort attempt to log all access of objects within a bucket. Please note that it is possible that the actual usage report at the end of a month will slightly vary.
Your other option is to develop your own service that sits between users and Amazon S3, that monitors all requests to your buckets/objects.
— http://aws.amazon.com/articles/1109#13
Again, having your app server mediate all requests seems to be the logical approach, and would also allow you to detect immediately (as opposed to "discover later") that a user had exceeded a threshold.
I would maintain a seperate database in the cloud to hold each users total hdd usage count. Its easy to manage the count via S3 Object Lifecycle Events which could easily trigger a Lambda which in turn writes to a DB.
Is it possible to setup Amazon's Simple Storage Solution to use custom domains (storage-01.example.com, storage-02.example.com, storage-03.example.com, ...) without using Cloud Front? I don't really care about having an 'edge' network, but do want the browsers to make parallel requests for assets. Thanks!
No, unless you duplicate your keys into multiple S3 buckets. This is because S3 uses the Host header value as a reference to the bucket.
I guess you could be sneaky and take advantage of the different URL styles. But it's a horrible suggestion and I would never implement it.
http://www.mybucketdomain.com/foo.jpg
http://www.mybucketdomain.com.s3.amazonaws.com/foo.jpg
http://s3.amazonaws.com/www.mybucketdomain.com/foo.jpg