S3 what is recommended hierarchy for storing objects? - amazon-web-services

I've been playing around with Amazon S3 and I wonder why would I need to use multiple buckets. I just though I would name my objects according to hierarchy they belong to, eg. blog/articles/2016/08/article-title.jpg and store them all in one bucket. The folders will be created in this case. Or is there any reason why would I need multiple buckets to store uploaded files?
And if so, what is the proper design of having multiple buckets? Let's say I need to categorise files by type year and month. I suppose I can't have buckets in a bucket.

AWS guidance in S3 Bucket Restrictions and Limitations states:
There is no limit to the number of objects that can be stored in a
bucket and no difference in performance whether you use many buckets
or just a few. You can store all of your objects in a single bucket,
or you can organize them across several buckets.
I would keep it simple, and store that type of asset data in a single bucket, perhaps divided up into a few 'top level' key name prefixes (folders) such as images, scripts, etc.

Related

Using S3 for User Image Content: Single or multiple buckets?

What's the best practice for using S3 to store image uploads from users in terms of a single bucket or multiple buckets for different purposes? Use case is a b2b application.
There is no limit to the amount of data you can store in an Amazon S3 bucket. Therefore you could, in theory, simply use one bucket for everything. (However, if you want data in multiple regions, then you would need to use a separate bucket per region.)
To best answer your question, you would need to think about how data is accessed:
If controlling access for IAM Users, then giving each user a separate folder is easy for access control using IAM Policy Elements: Variables and Tags
If controlling access for application users, then users will authenticate to an application, which will determine their access to objects. The application can then generate Amazon S3 pre-signed URLs to grant access to specific objects, so separation by bucket/folder is less important
If the data is managed by different Admins/Developers it is a good idea to keep the data in separate buckets to simplify access permissions (eg keeping HR data separate from customer data)
Basically, as long as you have a good reason to separate the data (eg test vs prod, different apps, different admins), then use separate buckets. But, for a single app, it might make better sense to use a single bucket.
I believe it's the same in terms of performance and availability. As for splitting content by purpose - It's probably ok to use a single bucket as long as the content is split in different folders (Paths).
We used to have one bucket for user-uploaded content and another one for static (CSS/JS/IMG) files that were auto-generated.

Sorting s3 picture files according to size

I want to map s3 bucket's picture files size.
Is it possible to get a bucket's file percentage which are bigger than 5mb?
Your question isn't too clear, but it appears that you want to obtain information about the size of objects in an Amazon S3 bucket.
The GET Bucket (List Objects) Version 2 API call (and its equivalent in various SDKs such as list-objects in the AWS CLI and list_objects_v2() in Python) will return a list of objects in a bucket, including the size of the objects. You could then use this information to calculate which objects are consuming the most storage space.
When listing objects, the only filter is the ability to specify a path (folder). It is not possible to list files based upon their size. Instead, all objects in the desired path will be returned.
If you have many objects (eg millions), it might be easier to use Amazon S3 Inventory, which can provide a daily CSV file listing all objects in a bucket.

Boto3 / Python - Listing recently modified objects?

I'm looking to list objects but I am only really interested in the last 1,000 that have been modified that same day.
I have seen that Boto3 supports pagination and getting specific objects by key name + a modified date. However, I can't see any mechanism that allows listing objects by their modified date?
Boto3 S3 Object.get() - This supports returning a key if its modified on a certain day.
Boto3 Paginators - This allows listing listing a certain number of objects, but doesn't allow you to determine the listing method.
I can achieve this by first listing all objects, then iterating over that list of objects, but this incurs the full costs, which is what I'm trying to avoid. I'm trying to do this to prevent having to list an entire bucket (which has more overhead costs).
No, there is no base functionality that offers this capability.
An alternative is to activate Amazon S3 Inventory:
Amazon S3 inventory is one of the tools Amazon S3 provides to help manage your storage. You can simplify and speed up business workflows and big data jobs using the Amazon S3 inventory, which provides a scheduled alternative to the Amazon S3 synchronous List API operation. Amazon S3 inventory provides a comma-separated values (CSV) flat-file output of your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (that is, objects that have names that begin with a common string).

Storing multiple results in Amazon S3

I am running multiple plugins as cron jobs everyday to gather some data from different data sources. I am planning to store these data in 2 places: in Amazon DynamoDB and in Amazon S3. The metadata of the results will be stored in DynamoDB. This will also hold the S3 bucket name where the data will store. My question is how do I group these daily results in S3? I am thinking of couple of ways:
(1)Lets say for plugin1 everyday I run I will store it in different buckets where the bucket name will be -. The merit of this approach is easy to retrieve the data for each day but the demerit is we have now 365 buckets for just one plugin. So if I have n plugins I will have 365 times n buckets over a year. We could delete buckets after some time interval to reduce the number of buckets (say 3 months)
(2) I could also use one bucket per plugin, and use a guid as a prefix for my keys. Like guid/result_n where result_n is the nth result that I get for that plugin. I would also add a key let's call it plugin_runs that would hold a list of dictionaries, where each dictionary would have this format {date: execution_id}. Then I could for a given date, find the prefix for the execution_id for that date and retrieve the contents of those keys.
Which approach would be better? Any other suggestions?
Given that AWS will only allow you to create 100 buckets per account, then I would say #2 is a much better approach.
But you really only need a single bucket, with a key prefix on each object to organize them. Here, for example, is how AWS Kinesis Firehose creates objects for you, and the naming convention they use. If it works for them, it should work for you:
Amazon S3 Object Name Format
Firehose adds a UTC time prefix in the format YYYY/MM/DD/HH before
putting objects to Amazon S3. The prefix translates into the Amazon S3
folder structure, where each label separated by a forward slash (/)
becomes a sub-folder. You can modify this folder structure by adding
your own top-level folder with a forward slash (for example,
myApp/YYYY/MM/DD/HH) or prepending text to the YYYY top-level folder
name (for example, myApp YYYY/MM/DD/HH). This is accomplished by
specifying an S3 Prefix when creating the delivery stream, either by
using the Firehose console or the Firehose API.
http://docs.aws.amazon.com/firehose/latest/dev/basic-deliver.html

Is it better to have multiple s3 buckets or one bucket with sub folders?

Is it better to have multiple s3 buckets per category of uploads or one bucket with sub folders OR a linked s3 bucket? I know for sure there will be more user-images than there will be profille-pics and that there is a 5TB limit per bucket and 100 buckets per account. I'm doing this using aws boto library and https://github.com/amol-/depot
Which is the structure my folders in which of the following manner?
/app_bucket
/profile-pic-folder
/user-images-folder
OR
profile-pic-bucket
user-images-bucket
OR
/app_bucket_1
/app_bucket_2
The last one implies that its really a 10TB bucket where a new bucket is created when the files within bucket_1 exceeds 5TB. But all uploads will be read as if in one bucket. Or is there a better way of doing what I'm trying to do? Many thanks!
I'm not sure if this is correct... 100 buckets per account?
https://www.reddit.com/r/aws/comments/28vbjs/requesting_increase_in_number_of_s3_buckets/
Yes, there is actually a 100 bucket limit per account. I asked the reason for that to an architect in an AWS event. He said this is to avoid people hosting unlimited static websites on S3 as they think this may be abused. But you can apply for an increase.
By default, you can create up to 100 buckets in each of your AWS
accounts. If you need additional buckets, you can increase your bucket
limit by submitting a service limit increase.
Source: http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html
Also, please note that there are actually no folders in S3, just a flat file structure:
Amazon S3 has a flat structure with no hierarchy like you would see in
a typical file system. However, for the sake of organizational
simplicity, the Amazon S3 console supports the folder concept as a
means of grouping objects. Amazon S3 does this by using key name
prefixes for objects.
Source: http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
Finally, the 5TB limit only applies to a single object. There is no limit on the number of objects or total size of the bucket.
Q: How much data can I store?
The total volume of data and number of objects you can store are
unlimited.
Source: https://aws.amazon.com/s3/faqs/
Also the documentation states there is no performance difference between using a single bucket or multiple buckets so I guess both option 1 and 2 would be suitable for you.
Hope this helps.
Simpler Permission with Multiple Buckets
If the images are used in different use cases, using multiple buckets will simplify the permissions model, since you can give clients/users bucket level permissions instead of directory level permissions.
2-way doors and migrations
On a similar note, using 2 buckets is more flexible down the road.
1 to 2:
If you switch from 1 bucket to 2, you now have to move all clients to the new set-up. You will need to update permissions for all clients, which can require IAM policy changes for both you and the client. Then you can move your clients over by releasing a new client library during the transition period.
2 to 1:
If you switch from 2 buckets to 1 bucket, your clients will already have access to the 1 bucket. All you need to do is update the client library and move your clients onto it during the transition period.
*If you don't have a client library than code changes are required in both cases for the clients.