Can you request an object from S3 without knowing its extension? - amazon-web-services

Say I have a bucket called uploads with two directories, both of which contain images.
The first directory, called catalog, has images with various extensions (.jpg, .png, etc.)
The second directory, called brands, has images with no extensions.
I can request uploads/catalog/some-image.jpg and uploads/brands/extensionless-image, and they both return an image as I expect.
We're already using a third-party service, imgix, which is just an image-processing CDN that links to the S3 bucket so that we can request, say, a smaller or cropped version of the image in the bucket.
Ideally, I'd like to keep the images and objects in their current formats in the bucket, but I would like the client-side to be agnostic about which file it is requesting. In other words, I'd like to request some-image, and even though it may or may not actually have an extension in the bucket, I'd still like to somehow "intelligently guess" the image I'm requesting. We'll also assume that there are no collisions, i.e., there will never be an image some-image.jpg and some-image with both the same name (our objects are named with a collision-less algorithm).
This is what I've tried:
Simply request images in one directory by their extension, and the images in the other bucket without their extension (however, even though the policy is the same of requesting an image, the mechanism has to be implemented in two different ways. I would like a singular mechanism)
Another solution is to programmatically remove the extensions from all the images in catalog and re-sync the bucket
Anyone run into something similar before? Thoughts?

I suspect your best bet is going to be renaming the images. Not that there aren't other solutions, but because that is probably going to be the simplest and most straightforward approach.
First, S3 will not guess. The key on an S3 object is an opaque string from S3's perspective. The extension has no meaning, and even the slashes delimiting "directories" have no intrinsic meaning to S3. (Deleting a "directory" in S3 means sending a delete request for every individual object in the directory. The console creates a convenient illusion by doing this for you.)
S3 has redirect rules, but they only match and manipulate path prefixes, not suffixes, so no help there.
It would be possible, using a reverse proxy in front of S3, to inspect requests and for any 404 or 403, the proxy could retry the request with alternate extensions, until it found one that worked, and it could potentially "learn" the right extension for use on subsequent requests, but then you'd have the added turn-around time and additional cost for multiple requests.
I have developed systems whose job it is to "find" things requested over HTTP by trying multiple back-end URLs, without the requester being aware of the "hunting" going on in the background, and it can be very useful... but that is a much more complicated solution than you would probably want to consider, particularly in light of the fact that every millisecond counts when it comes to image loading.
There is no native solution for magic guessing with S3. You pretty much have to ask it for exactly what you want. Storage in S3 is cheap enough, of course, that you could probably duplicate your content, with and without extensions, without giving too much thought to the cost. If you used a Lambda event on the bucket, you could even automate the process of copying "kitten.jpg" to "kitten" each time "kitten.jpg" was modified.

If the content-type is set correctly in your object metadata, you should be fine regardless of extensions. If content-type header is not set, you can set it, for example using ImageMagick Identify to discover the image type and AWS CLI to set it.

Related

How to add custom authentication to aws s3 download of large files

I'm trying to figure out how to implement these requirements for S3 downloads:
Signed URL (links should become invalid after some amount of time).
Download only 1 time - any other requests to the same URL should fail.
Need to restrict downloads to the user/browser who made the request to generate the signed URL - no other user should be able to download.
Be able to deal with large files (ideally, streaming, just like when someone downloads directly from a standard S3 access point).
Things that I've tried:
S3 Object Lambda + Access Point
Generate pre-signed URL to lambda access point, this works well.
Make use of S3 object metadata to store download state / restrict downloads to just 1 time. This works well.
No way to access user-agent or requestor's IP.
Large files are a problem. Timeout has been configured to 15 minutes (the max), but request still times out much earlier. This was done with NodeJS.
Lambda + Lambda URL
Pre-signed URL is generated and passed to lambda URL as encoded param - the lambda makes the request if auth/validation passes. This approach seems to work fine.
Can use same approach of leveraging S3 object metadata to limit downloads to just 1 time.
User-agent and requestor IP is available, this is great.
Large files are a problem. I've tried NodeJS and it behaves the same as the S3 Object Lambda (eventually times out, even earlier than the configured time), Also implemented the Java streaming handler but it dies with an "out of memory" error, even when I bump the memory up to 3GB (the file is only 1GB and I thought streaming would get around the memory problem anyway). I've tried several ways to stream (Java 11), but it really seems like the streaming handler is not really streaming, but buffering somewhere outside of the lambda.
I'm now unsure if AWS lambda will be able to handle all of these requirements, but I would really like to know if others might have ideas, or if I'm missing something.

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

Returning images through AWS API Gateway

I'm trying to use AWS API Gateway as a proxy in front of an image service.
I'm able to get the image to come through but it gets displayed as a big chunk of ASCII because Content-Type is getting set to "application/json".
Is there a way to tell the gateway NOT to change the source Content-Type at all?
I just want "image/jpeg", "image/png", etc. to come through.
I was trying to format a string to be returned w/o quotes and discovered the Integration Response functionality. I haven't tried this fix myself, but something along these lines should work:
Go to the Method Execution page of your Resource,
click on Integration Response,
expand Method Response Status 200,
expand Mapping Templates,
click "application/json",
click the pencil next to Output Passthrough,
change "application/json" to "image/png"
Hope it works!
I apologize, in advance, for giving an answer that does not directly answer the question, and instead suggests you adopt a different approach... but based in the question and comments, and my own experience with what I believe to be a similar application, it seems like you may be using the the wrong tool for the problem, or at least a tool that is not the optimal choice within the AWS ecosystem.
If your image service was running inside Amazon Lambda, the need for API Gateway would be more apparent. Absent that, I don't see it.
Amazon CloudFront provides fetching of content from a back-end server, caching of content (at over 50 "edge" locations globally), no charge for the storage of cached content, and you can configure up to 100 distinct hostnames pointing to a single Cloudfront distribution, in addition to the default xxxxxxxx.cloudfront.net hostname. It also supports SSL. This seems like what you are trying to do, and then some.
I use it, quite successfully for exactly the scenario you describe: "a proxy in front of an image service." Exactly what my image service and your image service do may be different (mine is a resizer that can look up the source URL of missing/never before requested images, fetch, and resize) but fundamentally it seems like we're accomplishing a similar purpose.
Curiously, the pricing structure of CloudFront in some regions (such as us-east-1 and us-west-2) is such that it's not only cost-effective, but in fact using CloudFront can be almost $0.005 cheaper than not using it per gigabyte downloaded.
In my case, in addition to the back-end image service, I also have an S3 bucket with a single file in it, attached to a single path in the CloudFront distribution (as a second "custom origin"), for the sole purpose of serving up /robots.txt, to control direct access to my images by well-behaved crawlers. This allows the robots.txt file to be managed separately from the image service itself.
If this doesn't seem to address your need, feel free to comment and I will clarify or withdraw this answer.
#kjsc: we finally figured out how to get this working on an alternate question with base64 encoded image data which you may find helpful in your solution:
AWS Gateway API base64Decode produces garbled binary?
To answer your question, to get the Content-Type to come through as a hard-coded value, you would first go into the method response screen and add a Content-Type header and whatever Content type you want.
Then you'd go into the Integration Response screen and set the Content type to your desired value (image/png in this example). Wrap 'image/png' in single quotes.

Do we need directory structure logic for storing millions of images on Amazon S3/Cloudfront?

In order to support millions of potential images we have previously followed this sort of directory structure:
/profile/avatars/44/f2/47/48px/44f247d4e3f646c66d4d0337c6d415eb.jpg
The filename is md5 hashed, then we extract the first 6 characters in the string and build the folder structure from that.
So in the above example the filename:
44f247d4e3f646c66d4d0337c6d415eb.jpg
produces a directory structure of:
/44/f2/47/
We always did this in order to minimize the number of photos in any single directory, ultimately to aid filesystem performance.
However our new app is using Amazon S3 with Cloudfront
My understanding is that any folders you create on Amazon S3 are actually just references and are not directories on the filesystem.
If that is correct is it still recommended to split into folders/directories in the above, or similar method? Or can we simply remove this complexity in our application code and provide image links like so:
/profile/avatars/48px/filename.jpg
Baring in mind that this app is intended to serve 10's of millions of photos.
Any guidance would be greatly appreciated.
Although S3 folders are basically only another way of writing the key name (as #E.J.Brennan already said in his answer), there are reasons to think about the naming structure of your "folders".
With your current number of photos and probably your access patterns, it might make sense to think about a way to speed up the S3 keyname lookups, making sure that operations on photos get spread out over multiple partitions. There is a great article on the AWS blog explaining all the details.
You don't need to setup that structure on s3 unless you are doing it for your own convenience. All of the folders you create on s3 are really just an illusion for you, the files are stored in one big continuous container, so if you don't have a reason to organize the files in a pseudo-folder hierarchy, then don't bother.
If you needed to control access to different groups of people, based on you folder struture, that might be a reason to keep the structure, but besides that there probably isn't a benefit/

Push files up to Amazon Cloudfront: Possible?

I've been reading up about pull and push CDNs. I've been using Cloudfront as a pull CDN for resized images:
Receive image from client
Put image in S3
later on, when a client makes a request to cloudfront for a URL, Cloudfront does not have the image, hence it has to forward it to my server, which:
Receive request
Pull image from S3
Resize image
Push image back to Cloudfront
However, this takes a few seconds, which is a really annoying wait when you first upload your beautiful image and want to see it. The delay appears to be mostly the download/reuploading time, rather than the resizing, which is pretty fast.
Is it possible to pro-actively push the resized image to Cloudfront and attach it to a URL, such that future requests can immediately get the prepared image? Ideally I would like to
Receive image from client
Put image in S3
Resize image for common sizes
Pre-emptively push these sizes to cloudfront
This avoids the whole download/reupload cycle, making the common sizes really fast, but the less-common sizes can still be accessed (albeit with a delay the first time). However, to do this I'd need to push the images up to Cloudfront. This:
http://www.whoishostingthis.com/blog/2010/06/30/cdns-push-vs-pull/
seems to suggest it can be done, but everything else i've seen makes no mention of it. My question is: is it possible? Or are there any other solutions to this problem that I am missing?
We have tried to similar things with different CDN providers, and for CloudFront I don't think there is any existing way for you to push (what we call pre-feeding) your specific contents to nodes/edges if the cloudfront distribution is using your custom origin.
One way I can think of, also as mentioned by #Xint0 is set up another S3 bucket to specifically hosting those files you would like to push (in your case those resized images). Basically you will have two cloudFront distributions one to pull those files rarely accessed and another to push for those files accessed frequently and also those images you expect to be resized. This sounds a little bit complex but I believe that's the tradeoff you have to make.
Another point I can recommend you to look at is EdgeCast which is another CDN provider and they do provide function called load_to_edge (which I spent quite a lot of time last month to integrate this with our service, that's why I remember it clearly) which does exactly what you expect. They also support custom origin pull, so that maybe you can take a trial there.
The OP asks for a push CDN solution, but it sounds like he's really just trying to make things faster. I'm venturing that you probably don't really need to implement a CDN push, you just need to optimize your origin server pattern.
So, OP, I'm going to assume you're supporting at most a handful of image sizes--let's say 128x128, 256x256 and 512x512. It also sounds like you have your original versions of these images in S3.
This is what currently happens on a cache miss:
CDN receives request for a 128x128 version of an image
CDN does not have that image, so it requests it from your origin server
Your origin server receives the request
Your origin server downloads the original image from S3 (presumably a larger image)
Your origin resizes that image and returns it to the CDN
CDN returns that image to user and caches it
What you should be doing instead:
There are a few options here depending on your exact situation.
Here are some things you could fix quickly, with your current setup:
If you have to fetch your original images from S3, you're basically making it so that a cache miss results in every image taking as long to download as the original sized image. If at all possible, you should try to stash those original images somewhere that your origin server can access quickly. There's a million different options here depending on your setup, but fetching them from S3 is about the slowest of all of them. At least you aren't using Glacier ;).
You aren't caching the resized images. That means that every edge node Cloudfront uses is going to request this image, which triggers the whole resizing process. Cloudfront may have hundreds of individual edge node servers, meaning hundreds of missing and resizes per image. Depending on what Cloudfront does for tiered distribution, and how you set your file headers it may not actually be that bad, but it won't be good.
I'm going out on a limb here, but I'm betting you aren't setting custom expiration headers, which means Cloudfront is only caching each of these images for 24 hours. If your images are immutable once uploaded, you'd really benefit from returning expiration headers telling the CDN not to check for a new version for a long, long time.
Here are a couple ideas for potentially better patterns:
When someone uploads a new image, immediately transcode it into all the sizes you support and upload those to S3. Then just point your CDN at that S3 bucket. This assumes you have a manageable number of supported image sizes. However, I would point out that if you support too many image sizes, a CDN may be the wrong solution altogether. Your cache hit rate may be so low that the CDN is really getting in the way. If that's the case, see the next point.
If you are supporting something like continuous resizing (ie, I could request image_57x157.jpg or image_315x715.jpg, etc and the server would return it) then your CDN may actually be doing you a disservice by introducing an extra hop without offloading much from your origin. In that case, I would probably spin up EC2 instances in all the available regions, install your origin server on them, and then swap image URLs to regionally appropriate origins based on client IP (effectively rolling your own CDN).
And if you reeeeeally want to push to Cloudfront:
You probably don't need to, but if you simply must, here are a couple options:
Write a script to use the webpagetest.org APIs to fetch your image from a variety of different places around the world. In a sense, you'd be pushing a pull command to all the different edge locations. This isn't guaranteed to populate every edge location, but you could probably get close. Note that I'm not sure how thrilled webpagetest.org would be about using it this way, but I don't see anything in there terms of use about it (IANAL).
If you don't want to use a third party or risk irking webpagetest.org, just spin up a micro EC2 instance in every region, and use those to fetch the content, same as in #1.
AFAIK CloudFront uses S3 buckets as the datastore. So, after resizing the images you should be able to save the resized images to the S3 bucket used by CloudFront directly.