Do we need directory structure logic for storing millions of images on Amazon S3/Cloudfront? - amazon-web-services

In order to support millions of potential images we have previously followed this sort of directory structure:
/profile/avatars/44/f2/47/48px/44f247d4e3f646c66d4d0337c6d415eb.jpg
The filename is md5 hashed, then we extract the first 6 characters in the string and build the folder structure from that.
So in the above example the filename:
44f247d4e3f646c66d4d0337c6d415eb.jpg
produces a directory structure of:
/44/f2/47/
We always did this in order to minimize the number of photos in any single directory, ultimately to aid filesystem performance.
However our new app is using Amazon S3 with Cloudfront
My understanding is that any folders you create on Amazon S3 are actually just references and are not directories on the filesystem.
If that is correct is it still recommended to split into folders/directories in the above, or similar method? Or can we simply remove this complexity in our application code and provide image links like so:
/profile/avatars/48px/filename.jpg
Baring in mind that this app is intended to serve 10's of millions of photos.
Any guidance would be greatly appreciated.

Although S3 folders are basically only another way of writing the key name (as #E.J.Brennan already said in his answer), there are reasons to think about the naming structure of your "folders".
With your current number of photos and probably your access patterns, it might make sense to think about a way to speed up the S3 keyname lookups, making sure that operations on photos get spread out over multiple partitions. There is a great article on the AWS blog explaining all the details.

You don't need to setup that structure on s3 unless you are doing it for your own convenience. All of the folders you create on s3 are really just an illusion for you, the files are stored in one big continuous container, so if you don't have a reason to organize the files in a pseudo-folder hierarchy, then don't bother.
If you needed to control access to different groups of people, based on you folder struture, that might be a reason to keep the structure, but besides that there probably isn't a benefit/

Related

Retrieving data from AWS S3 too slow in Shiny app

I know that this question can be mostly answered generally for any Web App, but because I am specifically using Shiny I figured that your answers may be considerably more useful.
I have made a relatively complex app. The data is not complex, but the user interface is.
I am storing the data in S3 using the aws.s3 package, and have built my app using golem. Because most shiny apps are used to analyse or enter some data, they usually deal with a couple of datasets, and a relational database is very useful and fast for that type of app.
However, my app is quite UI/UX extensive. Users can have their own/shared whiteboard space(s) where they drag around items. The coordinates of the items are stored in rds files in my S3 bucket, for each user. They can customise many aspects of the app just for them, font size, colours of various experimental groups (it's a research app), experimental visits that are storing pdf files, .html files and .rds files.
The .rds files stored can contain variables, lists, data.frames, reactiveValues, renderUI() objects etc.. So they are widely different.
As such I have dozens of rds files that are stored in a bucket and everytime the app loads each of these .rds files need to be read one by one in order to recreate the environment appropriate for each user. The number of files/folders in directories are queried to know how many divs need to be generated for the user to click inside their files etc..
The range of objects stored is too wide for me to use a relational database - but my app is taking at least 40 seconds to load. It is also generally slow when submitting data as well, mostly because the data entered often modified many UI elements that need to be pushed to S3 again. Because I have no background in proper Web Dev, I have no idea what is the best way to store user-related UX/UI elements and how to retrieve them seamlessly.
Could anyone please recommend me to appropriate resources for me to learn more about it?
Am I doing it completely wrong? I honestly do not know how else to store and retrieve all these R objects.
Thank you in advance for your help with the above.

Is there an implementation of a single instance blob store for Django?

I am new to Django so I apologize if I missed something. I would like to have a library that gives me a single-instance data store for Blob / Binary data. I want a library that masks whether or not the files are stored in the database, file system or some kind of back end like S3 on Amazon. I want a single API that lets me add files, and get back URLs to serve those files. Also it would be nice if the implementation supported some kind of migration if I had blobs in a database for a site when it just started out and then move those blobs to an S3 bucket behind the scenes without me needing to change how my application stores and serves the data.
An important sub-aspect of this is that the files have to be only shown to properly authorized users (i.e. just putting them in an open /media/ folder as files is not sufficient).
Perhaps I am asking too much - but I find this kind of service very useful in my applications. The main reason that I am asking is that unless I find such a thing - I will wander off and build my own library - I just don't want to waste the time if this kind of thing already exists.

Can you request an object from S3 without knowing its extension?

Say I have a bucket called uploads with two directories, both of which contain images.
The first directory, called catalog, has images with various extensions (.jpg, .png, etc.)
The second directory, called brands, has images with no extensions.
I can request uploads/catalog/some-image.jpg and uploads/brands/extensionless-image, and they both return an image as I expect.
We're already using a third-party service, imgix, which is just an image-processing CDN that links to the S3 bucket so that we can request, say, a smaller or cropped version of the image in the bucket.
Ideally, I'd like to keep the images and objects in their current formats in the bucket, but I would like the client-side to be agnostic about which file it is requesting. In other words, I'd like to request some-image, and even though it may or may not actually have an extension in the bucket, I'd still like to somehow "intelligently guess" the image I'm requesting. We'll also assume that there are no collisions, i.e., there will never be an image some-image.jpg and some-image with both the same name (our objects are named with a collision-less algorithm).
This is what I've tried:
Simply request images in one directory by their extension, and the images in the other bucket without their extension (however, even though the policy is the same of requesting an image, the mechanism has to be implemented in two different ways. I would like a singular mechanism)
Another solution is to programmatically remove the extensions from all the images in catalog and re-sync the bucket
Anyone run into something similar before? Thoughts?
I suspect your best bet is going to be renaming the images. Not that there aren't other solutions, but because that is probably going to be the simplest and most straightforward approach.
First, S3 will not guess. The key on an S3 object is an opaque string from S3's perspective. The extension has no meaning, and even the slashes delimiting "directories" have no intrinsic meaning to S3. (Deleting a "directory" in S3 means sending a delete request for every individual object in the directory. The console creates a convenient illusion by doing this for you.)
S3 has redirect rules, but they only match and manipulate path prefixes, not suffixes, so no help there.
It would be possible, using a reverse proxy in front of S3, to inspect requests and for any 404 or 403, the proxy could retry the request with alternate extensions, until it found one that worked, and it could potentially "learn" the right extension for use on subsequent requests, but then you'd have the added turn-around time and additional cost for multiple requests.
I have developed systems whose job it is to "find" things requested over HTTP by trying multiple back-end URLs, without the requester being aware of the "hunting" going on in the background, and it can be very useful... but that is a much more complicated solution than you would probably want to consider, particularly in light of the fact that every millisecond counts when it comes to image loading.
There is no native solution for magic guessing with S3. You pretty much have to ask it for exactly what you want. Storage in S3 is cheap enough, of course, that you could probably duplicate your content, with and without extensions, without giving too much thought to the cost. If you used a Lambda event on the bucket, you could even automate the process of copying "kitten.jpg" to "kitten" each time "kitten.jpg" was modified.
If the content-type is set correctly in your object metadata, you should be fine regardless of extensions. If content-type header is not set, you can set it, for example using ImageMagick Identify to discover the image type and AWS CLI to set it.

Where to store user file uploads?

In my compojure app, where should I store user upload files? Do I just make a user-upload dir in my project root and stick everything in there? Is there anything special I should do (classpath, permissions, etc)?
To properly answer your question, you need to think of the lifecycle of the uploaded files. I would start answering questions such as:
how big are the files going to be?
what storage options will hold enough data to store all the uploads?
how about SLAs, redundancy and disaster avoidance?
how and who to monitor the free space and health of the storage?
In general, the file system location is much less relevant than the block device sitting behind it: as long as your data is stored safely enough for your application, user-upload can be anywhere and be anything from a regular disk to an S3 bucket e.g. via s3fs-fuse.
Putting such folder in your classpath sounds odd to me. It gives no essential benefit, as you will always need to go through a configuration entry to state where to store and read files from.
Permission wise, your application will require at least write access to the upload storage (most likely read access as well). Granting such permissions depends on the physical device you choose: if you opt for the local file system as you suggest in your question, you need to make sure the Clojure app is run by a user with chmod +rw, but in case of S3, you will need to configure API keys.
For anything other than a practice problem, I would suggest using a database such as Postgres or Datomic. This way, you get the reliability of a DB with real transactions, along with the ability to access the files across a network from any location.

where to store 10kb pieces of text in amazon aws?

These will be indexed and randomly accessed in a web app like SO questions. SimpleDB has a 1024-byte limit per attribute but you could use multiple attrs but sounds inelegant.
Examples: blog posts; facebook status messages; recipes (in a blogging application; facebook-like application; recipe web site).
If I were to build such an application on Amazon AWS, where/how should I store the pieces of text?
With S3, you could put all the actual files in S3, then index them with Amazon RDS, or Postgres on Heroku, or whatever suits you at that time.
Also, you can get the client to download the multi kB text blurbs directly from S3, so your app could just deliver URLs to the messages, thereby creating a massively parallel server - even if the main server is just a single thread on one machine, constructing the page from S3 asset URLs. S3 could store all assets, like images, etc.
The advantages are big. This also solves backup, etc. And allows you to play with many indexing and searching schemes. Search could for instance be done using Google...
I'd say you would want to look at Amazon RDS, running a relational database like MySQL in the cloud. A single DynamoDB read capacity unit can only (consistently) read a 1kb-item, that's probably not going to work for you.
Alternatively, you could store the text files in S3 and put pointers to these files in SimpleDB. It depends on a lot of factors which is going to be more cost-effective: how many files you add every day, how often these files are expected to change, how often they are requested, etc.
Personally, I think that using S3 would not be the best approach. If you store all questions and answers in separate text files, you're looking at a number of requests for displaying even a simple page. Let alone search, which would require you to fetch all the files from S3 and search through them. So for search, you need a database anyway.
You could use SDB for keeping an index but frankly, I would just use MySQL on Amazon RDS (there's a free two-month trial period right now, I think) where you can do all the nice things that relational databases can do, and which also offers support for full-text search. RDS should be able to scale up to huge numbers of visitors every day: you can easily scale up all the way to a High-Memory Quadruple Extra Large DB Instance with 68 GB of memory and 26 ECUs.
As far as I know, SO is also built on top of a relational database: https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/
DynamoDB is might be what you want, there is even a forum use case in their documentation: Example Tables and Data in Amazon DynamoDB
There is insufficient information in the question to provide a reasonable answer to "where should I store text that I'm going to use?"
Depending on how you build your application and what the requirements are for speed, redundancy, latency, volume, scalability, size, cost, robustness, reliability, searchability, modifiability, security, etc., the answer could be any of:
Drop the text in files on an EBS volume attached to an instance.
Drop the text into a MySQL or RDS database.
Drop the text into a distributed file system spread across multiple instances.
Upload the text to S3
Store the text in SimpleDB
Store the text in DynamoDB
Cache the text in ElastiCache
There are also a number of variations on this like storing the master copy in S3, caching copies in ElastiCache and on the local disk, indexing it with specific keys in DynamoDB and making it searchable in Cloud Search.