File Storage usage with Django and Amazon S3

File Storage usage with Django and Amazon S3 - django

I have a scenario where storage use needs to be determined in my app by user since it will be placing limits on how much storage can be used. I'm currently using django-storage, boto, and S3 to manage and store files. What is the best way to aggregate storage use on a per user basis?
I thought about keeping track of each file that's uploaded, incrementing the file size on upload/decrementing on delete, and storing that aggregated file size in the DB, but I'm wondering if there is a way to get this more cleanly. What solutions have others done out there? Many thanks.

It's not a full solution, but a start might be to store the user-uploaded fields in folders that are specific to the user. The "upload_to" attribute of a model field can be a callable that returns the location to store the media; this could end up putting users' uploads into user-specific buckets within your S3 storage.
Then you can use boto (or other tools) to query for the size of a user's bucket.

Related

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!

Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Storing of S3 Keys vs URLs

I have some functionality that uploads Documents to an S3 Bucket.
The key names are programmatically generated via some proprietary logic for the layout/naming convention needed.
The results of my S3 upload command is the actual url itself. So, it's in the format of
REGION/BUCKET/KEY
I was planning on storing that full url into my DB so that users can access their uploads.
Given that REGION and BUCKET probably wouldn't change, does it make sense to just store the KEY - and then dynamically generate the full url when the client needs it?
Just want to know what the desired pattern here is and what others do. Thanks!

Storing the full URL is a bad idea. As you said in the question, the region and bucket are already known, so storing the full URL is a waste of disk space. Also, if in the future say, you want to migrate your assets to a different bucket may be in a different region, having full URLs stored in the DB just make things harder.

how to restrict google cloud storage upload

I have a mobile application that uses Google Cloud Storage. The application allows each registered user to upload a specific number of files.
My question is, is there a way to do some kind of checks before the storage upload? Or do I need to implement a separate reservation API of sorts that OKs an upload step?
Any alternative suggestions are welcome too, of course.

warning: Not an authoritative answer. Happy to accept removal or update requests.
I am not aware of any GCS or Firebase Cloud Storage mechanisms that will inherently limit the number of files (objects) that a given user can create. If it were me, this is how I would approach the puzzle.
I would create a database (eg. Firestore / Datastore) that has a key for each user and a value which is the number of files they have uploaded. When a user wants to upload a new file, it would first make a REST call to a Cloud Function that I would write. This Cloud Function would implicitly know the identity of the calling user. It would look up the record in the database and determine if we are allowed to upload a new file. If no, then return an error and end of story. If yes, then increment the value in the database. Next I would create a GCS "signed URL" that can be used to permit an upload. It would be that signed URL that the Cloud Function would return. The app that now wishes to upload can use that signed URL to perform the actual upload.
I would also add metadata to each file uploaded to identify the logical uploader (user) of the file. That can be then used for reconciliation if needed. We could examine all the files in the bucket and re-build the database of how many files each user had uploaded.
A possible alternative to this story is for the Cloud Function to not return a signed-url but instead receive the data to be uploaded in the same request. If the check on number of files passes, then the Cloud Function could be a proxy to a GCS write to create the file directly. This alternative needs to be carefully examined as a function of the sizes of the files to be uploaded. If the files are large this may be a very poor solution. We want to be in and out of Cloud Functions as quickly as possible and holding a Cloud Function "around" to service data pass through isn't great. We may want to look at Cloud Run in that case as it supports concurrency in the instance without increasing the cost per call.

Organizing files in S3

I have a social media web application. Users upload pictures such as profile picture, project pictures, and etc. What's the best way to organize these files in a S3 bucket?
I thought of creating a folder with userid as its name inside the bucket and the inside that multiple other folders i.e. profile, projects and etc.
Not sure if that's the best approach to follow!

The names (Keys) you assign an object in Amazon S3 are frankly irrelevant.
What matters is that you have a database that tracks the objects, their ownership and their purpose.
You should not use the filename (Key) of an Amazon S3 object as a way of storing information about the object, because your application might have millions of objects in S3 and it is too slow to scan the list of objects to see which ones exist. Instead, consult a database to find them.
To answer your question: Yes, create a prefix by username if you wish, but then just give it a unique name (eg a Universally unique identifier - Wikipedia) that avoids name clashes.

Earlier there used to be a need to add random prefixes for better performance. More details here and here.
Following is the extract from one of that pages
Pay Attention to Your Naming Scheme If:
Distributing the Key names
Don’t save your object's key name starts with a date or standard key
names, it improves complexity in the S3 indexing and will reduce
performance, because based on the indexing objects saves in the single
storage partition .
Amazon S3 maintains keys lexicographically in its internal indices.
However, as of 17 Jul 2018 announcement, adding random prefix to S3 key isn't required for improving the performance

Optimize photo storage nomenclature on Amazon S3

I have to store lots of photos (+1 000 000, one max 5MB) and I have a database, every record has 5 photos, so what is the best solution:
Create directory for each record's slug/id, and upload photos inside it
Put all photos into one directory, and in name contain id or slug of record
Put all photos into one directory, and in database to each record add field with names of photos.
I use Amazon S3 server.

i would suggest you to name your photos like this while uploading in batch:
user1/image1.jpeg
user2/image2.jpeg
Though these names would not effect the way objects are stored on s3 , these names will simply be 'keys' of 'objects', as there is no folder like hierarchical structure in s3 , but doing these will make objects appear in folders which will help to segregate images easily if you want later to do so.
for example , let us suppose you stored all images with unique names and you are using unique UUID to map records in database to images in your bucket.
But later on suppose you want all 5 photos of a particular user, then what will you have to do is
scan the database for particular username
Retrieve UUID's for the images of that user
and then using the UUID for fetching images from s3
But if you name images by prefixing username to it , you can directly fetch images from s3 without making any reference to your database.
For example, to list all photos of user1, you can use this small code snippet in python :
import boto3
s3 = boto3.resource('s3')
Bucket=s3.Bucket('bucket_name')
for obj in Bucket.objects.filter(Prefix='user1/'):
print(obj.key)
while if you don't use any user-id in key of object , then you have to refer database to do a mapping between photos and records even just to get a list of images of a particular user

A lot of this depends on your use-case, such as how the database and the photos will be used. There is not enough information here to give a definitive answer.
However, some recommendations for the storage side...
The easiest option is just to use a UUID for each photo. This is effectively a random name that has no meaning. Store that name in your database and your system will know which image relates to which record. There is no need to ever rename the images because the names are just Unique IDs and convey no further information.
When you want to provide access to a particular image, your application can generate an Amazon S3 pre-signed URL that grants time-limited access to an object. After the expiry time, the URL does not work so the object remains private. Granting access in this manner means that there is no need to group images into directories by "owner", since access is granted per-object rather than per-owner.
Also, please note that Amazon S3 doesn't actually support folders. Rather, the Key ("filename") of the object is the entire path (eg user-2/foo.jpg). This makes it more human-readable (because the objects 'appear' to be in folders), but doesn't actually impact the way data is stored behind-the-scenes.
Bottom line: It doesn't really matter how you store the images. What matters is that you store the image name in your database so you know which image matches which record. Avoid situations where you need to rename images - just give them a name and keep it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

File Storage usage with Django and Amazon S3 - django

Related

Storing S3 Urls vs calling listObjects

Storing of S3 Keys vs URLs

how to restrict google cloud storage upload

Organizing files in S3

Optimize photo storage nomenclature on Amazon S3

Categories

Resources