Optimize photo storage nomenclature on Amazon S3 - amazon-web-services

I have to store lots of photos (+1 000 000, one max 5MB) and I have a database, every record has 5 photos, so what is the best solution:
Create directory for each record's slug/id, and upload photos inside it
Put all photos into one directory, and in name contain id or slug of record
Put all photos into one directory, and in database to each record add field with names of photos.
I use Amazon S3 server.

i would suggest you to name your photos like this while uploading in batch:
user1/image1.jpeg
user2/image2.jpeg
Though these names would not effect the way objects are stored on s3 , these names will simply be 'keys' of 'objects', as there is no folder like hierarchical structure in s3 , but doing these will make objects appear in folders which will help to segregate images easily if you want later to do so.
for example , let us suppose you stored all images with unique names and you are using unique UUID to map records in database to images in your bucket.
But later on suppose you want all 5 photos of a particular user, then what will you have to do is
scan the database for particular username
Retrieve UUID's for the images of that user
and then using the UUID for fetching images from s3
But if you name images by prefixing username to it , you can directly fetch images from s3 without making any reference to your database.
For example, to list all photos of user1, you can use this small code snippet in python :
import boto3
s3 = boto3.resource('s3')
Bucket=s3.Bucket('bucket_name')
for obj in Bucket.objects.filter(Prefix='user1/'):
print(obj.key)
while if you don't use any user-id in key of object , then you have to refer database to do a mapping between photos and records even just to get a list of images of a particular user

A lot of this depends on your use-case, such as how the database and the photos will be used. There is not enough information here to give a definitive answer.
However, some recommendations for the storage side...
The easiest option is just to use a UUID for each photo. This is effectively a random name that has no meaning. Store that name in your database and your system will know which image relates to which record. There is no need to ever rename the images because the names are just Unique IDs and convey no further information.
When you want to provide access to a particular image, your application can generate an Amazon S3 pre-signed URL that grants time-limited access to an object. After the expiry time, the URL does not work so the object remains private. Granting access in this manner means that there is no need to group images into directories by "owner", since access is granted per-object rather than per-owner.
Also, please note that Amazon S3 doesn't actually support folders. Rather, the Key ("filename") of the object is the entire path (eg user-2/foo.jpg). This makes it more human-readable (because the objects 'appear' to be in folders), but doesn't actually impact the way data is stored behind-the-scenes.
Bottom line: It doesn't really matter how you store the images. What matters is that you store the image name in your database so you know which image matches which record. Avoid situations where you need to rename images - just give them a name and keep it.

Related

AWS S3 filename

I’m trying to build application with backend in java that allows users to create a text with images in it (something like a a personal blog). I’m planning to store these images to s3 bucket. When uploading image files to bucket i’m hashing the original name and store the hashed one in the bucket. Images are for display purpose only, no user will be able to download them. Frontend displays these images by getting a path to them from the server. So the question is, is there any need to store original name of the image file in the database? And what are the reasons, if any, of doing so?
I guess in general it is not needed because what is more important is how these resources are used or managed in the system.
Assuming your service is something like data access (similar to google drive), I don't think it's necessary to store it in DB, unless you want to make faster search queries.

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Storing of S3 Keys vs URLs

I have some functionality that uploads Documents to an S3 Bucket.
The key names are programmatically generated via some proprietary logic for the layout/naming convention needed.
The results of my S3 upload command is the actual url itself. So, it's in the format of
REGION/BUCKET/KEY
I was planning on storing that full url into my DB so that users can access their uploads.
Given that REGION and BUCKET probably wouldn't change, does it make sense to just store the KEY - and then dynamically generate the full url when the client needs it?
Just want to know what the desired pattern here is and what others do. Thanks!
Storing the full URL is a bad idea. As you said in the question, the region and bucket are already known, so storing the full URL is a waste of disk space. Also, if in the future say, you want to migrate your assets to a different bucket may be in a different region, having full URLs stored in the DB just make things harder.

Organizing files in S3

I have a social media web application. Users upload pictures such as profile picture, project pictures, and etc. What's the best way to organize these files in a S3 bucket?
I thought of creating a folder with userid as its name inside the bucket and the inside that multiple other folders i.e. profile, projects and etc.
Not sure if that's the best approach to follow!
The names (Keys) you assign an object in Amazon S3 are frankly irrelevant.
What matters is that you have a database that tracks the objects, their ownership and their purpose.
You should not use the filename (Key) of an Amazon S3 object as a way of storing information about the object, because your application might have millions of objects in S3 and it is too slow to scan the list of objects to see which ones exist. Instead, consult a database to find them.
To answer your question: Yes, create a prefix by username if you wish, but then just give it a unique name (eg a Universally unique identifier - Wikipedia) that avoids name clashes.
Earlier there used to be a need to add random prefixes for better performance. More details here and here.
Following is the extract from one of that pages
Pay Attention to Your Naming Scheme If:
Distributing the Key names
Don’t save your object's key name starts with a date or standard key
names, it improves complexity in the S3 indexing and will reduce
performance, because based on the indexing objects saves in the single
storage partition .
Amazon S3 maintains keys lexicographically in its internal indices.
However, as of 17 Jul 2018 announcement, adding random prefix to S3 key isn't required for improving the performance

File Storage usage with Django and Amazon S3

I have a scenario where storage use needs to be determined in my app by user since it will be placing limits on how much storage can be used. I'm currently using django-storage, boto, and S3 to manage and store files. What is the best way to aggregate storage use on a per user basis?
I thought about keeping track of each file that's uploaded, incrementing the file size on upload/decrementing on delete, and storing that aggregated file size in the DB, but I'm wondering if there is a way to get this more cleanly. What solutions have others done out there? Many thanks.
It's not a full solution, but a start might be to store the user-uploaded fields in folders that are specific to the user. The "upload_to" attribute of a model field can be a callable that returns the location to store the media; this could end up putting users' uploads into user-specific buckets within your S3 storage.
Then you can use boto (or other tools) to query for the size of a user's bucket.