Organizing files in S3 - amazon-web-services

Organizing files in S3 - amazon-web-services

I have a social media web application. Users upload pictures such as profile picture, project pictures, and etc. What's the best way to organize these files in a S3 bucket?
I thought of creating a folder with userid as its name inside the bucket and the inside that multiple other folders i.e. profile, projects and etc.
Not sure if that's the best approach to follow!

The names (Keys) you assign an object in Amazon S3 are frankly irrelevant.
What matters is that you have a database that tracks the objects, their ownership and their purpose.
You should not use the filename (Key) of an Amazon S3 object as a way of storing information about the object, because your application might have millions of objects in S3 and it is too slow to scan the list of objects to see which ones exist. Instead, consult a database to find them.
To answer your question: Yes, create a prefix by username if you wish, but then just give it a unique name (eg a Universally unique identifier - Wikipedia) that avoids name clashes.

Earlier there used to be a need to add random prefixes for better performance. More details here and here.
Following is the extract from one of that pages
Pay Attention to Your Naming Scheme If:
Distributing the Key names
Don’t save your object's key name starts with a date or standard key
names, it improves complexity in the S3 indexing and will reduce
performance, because based on the indexing objects saves in the single
storage partition .
Amazon S3 maintains keys lexicographically in its internal indices.
However, as of 17 Jul 2018 announcement, adding random prefix to S3 key isn't required for improving the performance

Related

Efficient way to find and delete s3 objects with extension

I have a bucket in S3 for which i want to delete all objects with a particular extension.
The easiest solution is to list all keys and checks if it ends with extension and delete it, but this solution is very costly. Can anyone suggest any efficient to achieve this?

Look at S3 Inventory report, if you do not need up-to-the minute accuracy.
Alternatively, you might have to create an index of your S3 objects in DynamoDB or elsewhere so that you can easily find objects with a given suffix. Or even consider restructuring your keys so that they begin with the file extension, then you can list a prefix such as csv/ (obviously this might have negative consequences elsewhere in your application so is not necessarily a good solution).
Note that the price of listing objects in S3 Standard is $0.005 per 1,000 requests and each of those requests will return up to 1,000 S3 keys. I'm not sure how many keys you would be listing but that's $0.005 per million objects.

Is there anything to be gained by using 'folders' in an s3 bucket?

I am moving a largish number of jpgs (several hundred thousand) from a static filesystem to amazon s3.
On the old filesytem, I grouped files into subfolders to keep the total number of files / folder manageable.
For example, a file
4aca29c7c0a76c1cbaad40b2693e6bef.jpg
would be saved to:
/4a/ca/29/4aca29c7c0a76c1cbaad40b2693e6bef.jpg
From what I understand, s3 doesn't respect hierarchial namespaces. So if I were to use 'folders' on s3, the object, including the /'s, would really just be in a flat namesapce.
Still, according to the docs, amazon recommends mimicking a structured filesytem when working with s3.
So I am wondering: Is there anything to be gained using the above folder structure to organize files on s3? Or in this case am I better off just adding the files to s3 without any kind of 'folder' structure.

Performance is not impacted by the use (or non-use) of folders.
Some systems can use folders for easier navigation of the files. For example, Amazon Athena can scan specific sub-directories when querying data rather than having to read every file.
If your bucket is being used for one specific purpose, there is no reason to use folders. However, if it contains different types of data, then you might consider at least a top-level set of folders to keep data separated.
Another potential reason for using folders is for security. A bucket policy can grant access to buckets based upon a prefix (which is a folder name). However, this is likely not relevant for your use-case.

Using "folders" has no performance impact on S3, either way. It doesn't make it faster, and it doesn't make it slower.
The value of delimiting your object keys with / is in organization, both machine-friendly and human-friendly.
If you're trolling through a bucket in the console, troubleshooting, those meaningless noise-filled keys are a hassle to paginate through, only a few dozen at a time.
The console automatically groups objects into imaginary folders based on the / delimiters, so you can find your object to inspect it (check headers, metadata, etc.) is much easier if you can just click on 4a then ca then 29.
The S3 ListObjects APIs support requesting all the objects with a certain key prefix, but they also support finding all the common prefixes before the next delimiter, so you can send API requests to list prefix 4a/ca/ with delimiter / and it will only return the "folders" one level deep, which it refers to as "common prefixes."
This is less meaningful if your object keys are fully opaque and convey nothing more about the objects, as opposed to using key prefixes like images/ and thumbnails/ and videos/.
Having been an admin and working with S3 for a number of years, and having worked with buckets with key naming schemes designed by different teams, I would definitely recommend using some / delimiters for organization purposes. The buckets without them become more of a hassle to navigate over time.
Note that the console does allow you to "create folders," but this is more of the illusion -- there is no need to actually do this, unless you're loading a bucket manually. When you create a folder in the console, it just creates an empty object with a / at the end.

Easy way to created dated subdirectories on AWS S3

I'm trying to create a web service that is able to store user-upload files in S3. The problem is that we want the files stored in "dated directories".
For example, if a user uploads a.txt on 12/1/2017 at 9:15am, the file should look like this in S3:
https://s3-eu-west-1.amazonaws.com/test-bucket/uploaded/2017/12/1/9/a.txt
Does S3 have any API to help us achieving this or do we need to hand-craft this solution?

There is no such API in S3. Think of Amazon S3 as a storage service, not an application or database.
It is the responsibility of your application to store the data in the desired naming format -- just like storing data on a disk.
By the way, your naming format could do with some improvement:
Always expand fields to the correct number of digits (use 01 for January rather than 1) so that they sort correctly.
Think about your use-case -- if you will be scanning documents by year, then the /2017/12/01/09/a.txt naming format makes sense since you can look in the 2017 directory (not that directories really exist in S3). If not, then simply store it as /2017-12-01-09-a.txt.
Make it very clear which one is month vs day -- the USA is the only country in the world that treats "12/1/2017" as December 1st. The rest of the world reads it as "12 January". Using the format of 2017-12-01 makes it clear that it is 1-December-2017.
What about naming conflicts? Can only one person upload a file with a given name on a given day? How are you going to differentiate between different users uploading a file with the same name?
The reality is, the filename is totally irrelevant -- your application should use a database to keep track of objects that users
upload and assign each of them a unique name. When a file is later
requested, lookup the filename in the database and then provide that
file. Do not use S3 filenames as a pseudo-database where the name
conveys particular meaning, otherwise you'll often have to rename
files to add more meaning!
Directories don't actually exist in S3 -- they are just part of the filename. So, you can create a file in a given directory just by storing it -- there is no need to pre-create directories.

AWS S3 does not provide you with such logic. But it should by fairly easy to use the time information of your application to create such a s3 object key ("path").
Good luck!

Is there a way to query S3 object key names for the latest per prefix?

In an S3 bucket, I have thousands and thousands of files stored with names having a structure that comes down to prefix and number:
A-0001
A-0002
A-0003
B-0001
B-0002
C-0001
C-0002
C-0003
C-0004
C-0005
New objects for a given prefix should come in with varying frequency, but might not. Older objects may disappear.
Is there a way to efficiently query S3 for the highest number of every prefix, i.e. without listing the entire bucket? The result I want is:
A-0003
B-0002
C-0005
The S3 API itself does not seem to offer anything usable for that. However, perhaps another service, like Athena, could do it? So far I have only found it capable of searching within objects, but all I care about are their key names. If it can report on the contents of objects in the bucket, can't it on the bucket itself?
I would be okay with the latest modification date per prefix, but I want to avoid having to switch to a versioned bucket with just the prefixes as names to achieve that.

I think this is what you are looking for:
variable name is $path and you can regexp to get the pattern you are querying...
WHERE regexp_extract(sp."$path", '[^/]+$') like concat('%',cast(current_date - interval '1' day as varchar),'.csv')

The S3 API itself does not seem to offer anything usable for that.
However, perhaps another service, like Athena, could do it?
Yes at the moment, there is not direct way of doing it only with AWS S3. Even with Athena, it will go through the files to query their content but it will be easier using standard SQL support with Athena and would be faster since the queries runs in parallel.
So far I have only found it capable of searching within objects, but
all I care about are their key names.
Both Athena and S3 Select is to query by content not keys.
The best approach I can recommend is to use AWS DynamoDB to keep the metadata of the files, including file names for faster querying.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js