Push files up to Amazon Cloudfront: Possible? - amazon-web-services

I've been reading up about pull and push CDNs. I've been using Cloudfront as a pull CDN for resized images:
Receive image from client
Put image in S3
later on, when a client makes a request to cloudfront for a URL, Cloudfront does not have the image, hence it has to forward it to my server, which:
Receive request
Pull image from S3
Resize image
Push image back to Cloudfront
However, this takes a few seconds, which is a really annoying wait when you first upload your beautiful image and want to see it. The delay appears to be mostly the download/reuploading time, rather than the resizing, which is pretty fast.
Is it possible to pro-actively push the resized image to Cloudfront and attach it to a URL, such that future requests can immediately get the prepared image? Ideally I would like to
Receive image from client
Put image in S3
Resize image for common sizes
Pre-emptively push these sizes to cloudfront
This avoids the whole download/reupload cycle, making the common sizes really fast, but the less-common sizes can still be accessed (albeit with a delay the first time). However, to do this I'd need to push the images up to Cloudfront. This:
http://www.whoishostingthis.com/blog/2010/06/30/cdns-push-vs-pull/
seems to suggest it can be done, but everything else i've seen makes no mention of it. My question is: is it possible? Or are there any other solutions to this problem that I am missing?

We have tried to similar things with different CDN providers, and for CloudFront I don't think there is any existing way for you to push (what we call pre-feeding) your specific contents to nodes/edges if the cloudfront distribution is using your custom origin.
One way I can think of, also as mentioned by #Xint0 is set up another S3 bucket to specifically hosting those files you would like to push (in your case those resized images). Basically you will have two cloudFront distributions one to pull those files rarely accessed and another to push for those files accessed frequently and also those images you expect to be resized. This sounds a little bit complex but I believe that's the tradeoff you have to make.
Another point I can recommend you to look at is EdgeCast which is another CDN provider and they do provide function called load_to_edge (which I spent quite a lot of time last month to integrate this with our service, that's why I remember it clearly) which does exactly what you expect. They also support custom origin pull, so that maybe you can take a trial there.

The OP asks for a push CDN solution, but it sounds like he's really just trying to make things faster. I'm venturing that you probably don't really need to implement a CDN push, you just need to optimize your origin server pattern.
So, OP, I'm going to assume you're supporting at most a handful of image sizes--let's say 128x128, 256x256 and 512x512. It also sounds like you have your original versions of these images in S3.
This is what currently happens on a cache miss:
CDN receives request for a 128x128 version of an image
CDN does not have that image, so it requests it from your origin server
Your origin server receives the request
Your origin server downloads the original image from S3 (presumably a larger image)
Your origin resizes that image and returns it to the CDN
CDN returns that image to user and caches it
What you should be doing instead:
There are a few options here depending on your exact situation.
Here are some things you could fix quickly, with your current setup:
If you have to fetch your original images from S3, you're basically making it so that a cache miss results in every image taking as long to download as the original sized image. If at all possible, you should try to stash those original images somewhere that your origin server can access quickly. There's a million different options here depending on your setup, but fetching them from S3 is about the slowest of all of them. At least you aren't using Glacier ;).
You aren't caching the resized images. That means that every edge node Cloudfront uses is going to request this image, which triggers the whole resizing process. Cloudfront may have hundreds of individual edge node servers, meaning hundreds of missing and resizes per image. Depending on what Cloudfront does for tiered distribution, and how you set your file headers it may not actually be that bad, but it won't be good.
I'm going out on a limb here, but I'm betting you aren't setting custom expiration headers, which means Cloudfront is only caching each of these images for 24 hours. If your images are immutable once uploaded, you'd really benefit from returning expiration headers telling the CDN not to check for a new version for a long, long time.
Here are a couple ideas for potentially better patterns:
When someone uploads a new image, immediately transcode it into all the sizes you support and upload those to S3. Then just point your CDN at that S3 bucket. This assumes you have a manageable number of supported image sizes. However, I would point out that if you support too many image sizes, a CDN may be the wrong solution altogether. Your cache hit rate may be so low that the CDN is really getting in the way. If that's the case, see the next point.
If you are supporting something like continuous resizing (ie, I could request image_57x157.jpg or image_315x715.jpg, etc and the server would return it) then your CDN may actually be doing you a disservice by introducing an extra hop without offloading much from your origin. In that case, I would probably spin up EC2 instances in all the available regions, install your origin server on them, and then swap image URLs to regionally appropriate origins based on client IP (effectively rolling your own CDN).
And if you reeeeeally want to push to Cloudfront:
You probably don't need to, but if you simply must, here are a couple options:
Write a script to use the webpagetest.org APIs to fetch your image from a variety of different places around the world. In a sense, you'd be pushing a pull command to all the different edge locations. This isn't guaranteed to populate every edge location, but you could probably get close. Note that I'm not sure how thrilled webpagetest.org would be about using it this way, but I don't see anything in there terms of use about it (IANAL).
If you don't want to use a third party or risk irking webpagetest.org, just spin up a micro EC2 instance in every region, and use those to fetch the content, same as in #1.

AFAIK CloudFront uses S3 buckets as the datastore. So, after resizing the images you should be able to save the resized images to the S3 bucket used by CloudFront directly.

Related

submit PUT request through CloudFront

Can anyone please help me before I go crazy?
I have been searching for any documentation/sample-code (in JavaScript) for uploading files to S3 via CloudFront but I can't find a proper guide.
I know I could use Tranfer Acceleration feature for faster uploads and yeah, Transfer Acceleration essentially does the job through CloudFront Edge Points but as long as I searched, it is possible to make the POST/PUT request via AWS.CloudFront...
Also read an article posted in 2013 says that AWS just added a functionality to make POST/PUT requests but says not a single thing about how to do it!?
CloudFront documentation for JavaScript sucks, it does not even show any sample codes. All they do is assuming that we already know all the things about the subject. If I knew, why would I dive into documentation in the first place.
I believe there is some confusion here about adding these requests. This feature was added simply to allow POST/PUT requests to be supported for your origin so that functionality in your application such as form submissions or API requests would now function.
The recommended approach as you pointed out is to make use of S3 transfer acceleration, which actually makes use of the CloudFront edge locations.
Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

Uploading various sized Images to AWS Cloudfront versus post processing

We are using AWS cloudfront to render static contents on our site with origin as S3 BUCKET. Now as next steps, the user can dynamically upload images which we want to push to CDN. But we would require different sizes of it so that we can use it later in in the site. One option is to actually do preprocessing of images before pushing to S3 BUCKET . This ends up creating multiple images based on sizes. Can we do post processing something like http://imageprocessor.org/imageprocessor-web/ does but still use cloudfront. Any feedback would be helpful.
Regards
Raghav
Well, yes, it is possible to do post-processing and use CloudFront but you need an intermediate layer between CloudFront and S3. I designed a system using the following high-level implementation:
Request arrives at CloudFront, which serves the image from cache if available; otherwise CloudFront sends the request to the origin server.
The origin server is not S3. The origin server is Varnish, on EC2.
Varnish sends the request to S3, where all the resized image results are stored. If S3 returns 200 OK, the image is returned to CloudFront and to the requesting browser and the process is complete. Since the Varnish machine runs in the same AWS region as the S3 bucket, the performance is essentially indistinguishble between CloudFront >> S3 and CloudFront >> Varnish >> S3.
Otherwise, Varnish is configured to retry the failed request by sending it to the resizer platform, which also runs in EC2.
The resizer examines the request to determine what image is being requested, and what size. In my application, the desired size is in the last few characters of the filename, so xxxxx_300_300_.jpg means 300 x 300. The resizer fetches the source image... resizes it... stores the result in S3... and returns the new image to Varnish, which returns it to CloudFront and to the requester. The resizer itself is Imagemagick wrapped in Mojolicious and uses a MySQL database to identify the source URI where the original image can be fetched.
Storing the results in a backing store, like S3, and checking there, first, on each request, is a critical part of this process, because CloudFront does not work like many people seem to assume. Check your assumptions against the following assertions:
CloudFront has 50+ edge locations. Requests are routed to the edge that optimal for (usually, geographically close to) the viewer. The edge caches are all independent. If I request an object through CloudFront, and you request the same object, and our requests arrive at different edge locations, then neither of us will be served from cache. If you are generating content on demand, you want to save your results to S3 so that you do not have to repeat the processing effort.
CloudFront honors your Cache-Control: header (or overridden values in configuration) for expiration purposes, but does not guarantee to retain objects in cache until they expire. Caches are volatile and CloudFront is no exception. For this reason, too, your results need to be stored in S3 to avoid duplicate processing.
This is a much more complex solution than pre-processing.
I have a pool of millions of images, a large percentage of which would have a very low probability of being viewed, and this is an appropriate solution, here. It was originally designed as a parallel solution to make up for deficiencies in a poorly-architected preprocessor that sometimes "forgot" to process everything correctly, but it worked so well that it is now the only service providing images.
However, if your motivation revolves around avoiding the storage cost of the preprocessed results, this solution won't entirely solve that.

How can I stagger purging multiple cascading CDN's to assure a complete purge?

I use an Amazon S3 bucket, a cloudinary cache, and a fastly cache. In conjunction, they deliver images of any shape, size, or other transformation you ask for, and very fast. However, they propagate purge requests at different rates.
Here is the cascading arrangement:
When an image is requested, Fastly tries to serve that image from it's cache.
If the image is absent, Fastly asks Cloudinary for that image.
Cloudinary tries to serve the image serve from its cache.
If the image is absent, Cloudinary checks to see if the requested image had associated transformation parameters.
After finding the transformation params, Cloudinary tries to find an untransformed version of the image in its cache to apply the transformation to.
If the untransformed image is absent, Cloudinary will request it from the S3 bucket.
Cloudinary then applies the transformation and caches both the original and the transformed versions.
Cloudinary serves the transformed image to Fastly.
Fastly caches the transformed image and serves it.
I'd like to completely remove an image and all transformed versions (derivatives) of that image from all of my services. Cloudinary takes an hour to propagate the DELETE request to all of its servers.
I see that it is best to delete first in S3, then Cloudinary, and finally to purge fastly. How best does one delay a purge call for an hour though?
What is the best practice, programmatically speaking, in this situation?
As part of its solution, Cloudinary provides CDN services via Akamai (integration to other CDNs is also available).
Images are cached on the CDN after initial delivery and until a purge is requested. The propagation to all CDN nodes takes time and may last up to 1 hour, however usually only takes several minutes.
Coordinating between two CDN layers is a quite complex task, therefore the best practice is to avoid using your own CDN in front of Cloudinary's, especially if invalidations are commonly required.

Can you request an object from S3 without knowing its extension?

Say I have a bucket called uploads with two directories, both of which contain images.
The first directory, called catalog, has images with various extensions (.jpg, .png, etc.)
The second directory, called brands, has images with no extensions.
I can request uploads/catalog/some-image.jpg and uploads/brands/extensionless-image, and they both return an image as I expect.
We're already using a third-party service, imgix, which is just an image-processing CDN that links to the S3 bucket so that we can request, say, a smaller or cropped version of the image in the bucket.
Ideally, I'd like to keep the images and objects in their current formats in the bucket, but I would like the client-side to be agnostic about which file it is requesting. In other words, I'd like to request some-image, and even though it may or may not actually have an extension in the bucket, I'd still like to somehow "intelligently guess" the image I'm requesting. We'll also assume that there are no collisions, i.e., there will never be an image some-image.jpg and some-image with both the same name (our objects are named with a collision-less algorithm).
This is what I've tried:
Simply request images in one directory by their extension, and the images in the other bucket without their extension (however, even though the policy is the same of requesting an image, the mechanism has to be implemented in two different ways. I would like a singular mechanism)
Another solution is to programmatically remove the extensions from all the images in catalog and re-sync the bucket
Anyone run into something similar before? Thoughts?
I suspect your best bet is going to be renaming the images. Not that there aren't other solutions, but because that is probably going to be the simplest and most straightforward approach.
First, S3 will not guess. The key on an S3 object is an opaque string from S3's perspective. The extension has no meaning, and even the slashes delimiting "directories" have no intrinsic meaning to S3. (Deleting a "directory" in S3 means sending a delete request for every individual object in the directory. The console creates a convenient illusion by doing this for you.)
S3 has redirect rules, but they only match and manipulate path prefixes, not suffixes, so no help there.
It would be possible, using a reverse proxy in front of S3, to inspect requests and for any 404 or 403, the proxy could retry the request with alternate extensions, until it found one that worked, and it could potentially "learn" the right extension for use on subsequent requests, but then you'd have the added turn-around time and additional cost for multiple requests.
I have developed systems whose job it is to "find" things requested over HTTP by trying multiple back-end URLs, without the requester being aware of the "hunting" going on in the background, and it can be very useful... but that is a much more complicated solution than you would probably want to consider, particularly in light of the fact that every millisecond counts when it comes to image loading.
There is no native solution for magic guessing with S3. You pretty much have to ask it for exactly what you want. Storage in S3 is cheap enough, of course, that you could probably duplicate your content, with and without extensions, without giving too much thought to the cost. If you used a Lambda event on the bucket, you could even automate the process of copying "kitten.jpg" to "kitten" each time "kitten.jpg" was modified.
If the content-type is set correctly in your object metadata, you should be fine regardless of extensions. If content-type header is not set, you can set it, for example using ImageMagick Identify to discover the image type and AWS CLI to set it.