How can I stagger purging multiple cascading CDN's to assure a complete purge? - amazon-web-services

I use an Amazon S3 bucket, a cloudinary cache, and a fastly cache. In conjunction, they deliver images of any shape, size, or other transformation you ask for, and very fast. However, they propagate purge requests at different rates.
Here is the cascading arrangement:
When an image is requested, Fastly tries to serve that image from it's cache.
If the image is absent, Fastly asks Cloudinary for that image.
Cloudinary tries to serve the image serve from its cache.
If the image is absent, Cloudinary checks to see if the requested image had associated transformation parameters.
After finding the transformation params, Cloudinary tries to find an untransformed version of the image in its cache to apply the transformation to.
If the untransformed image is absent, Cloudinary will request it from the S3 bucket.
Cloudinary then applies the transformation and caches both the original and the transformed versions.
Cloudinary serves the transformed image to Fastly.
Fastly caches the transformed image and serves it.
I'd like to completely remove an image and all transformed versions (derivatives) of that image from all of my services. Cloudinary takes an hour to propagate the DELETE request to all of its servers.
I see that it is best to delete first in S3, then Cloudinary, and finally to purge fastly. How best does one delay a purge call for an hour though?
What is the best practice, programmatically speaking, in this situation?

As part of its solution, Cloudinary provides CDN services via Akamai (integration to other CDNs is also available).
Images are cached on the CDN after initial delivery and until a purge is requested. The propagation to all CDN nodes takes time and may last up to 1 hour, however usually only takes several minutes.
Coordinating between two CDN layers is a quite complex task, therefore the best practice is to avoid using your own CDN in front of Cloudinary's, especially if invalidations are commonly required.

Related

GCP CDN not caching data

We have deployed our website on GCP VM, and enabled GCP CDN in front of the VM. When we browse website in most of the cases GCP CDN making requests to the Origin VM.
I am using below stack driver query to check the cache hits.
resource.type="http_load_balancer"
resource.labels.forwarding_rule_name="rule_name"
httpRequest.serverIp="gcpvmip"
httpRequest.requestUrl="request_url"
httpRequest.cacheFillBytes > 0
Based on your latest comment, it sounds like you're expecting all requests to your site to be served from Cloud CDN's caches without contacting your origin server. However, it's normal to see cache misses when using a CDN. Each CDN operates numerous caches, not one big global cache. The fact the content for one URL has been inserted into one cache does not mean it will be present in all caches everywhere. Further, unpopular cache entries are routinely evicted from cache to make room for more popular content.
Here are some relevant excerpts from the Cloud CDN docs:
Cloud CDN uses caches in numerous locations around the world. Caching
is reactive in that an object is stored in a particular cache if a
request goes through that cache and if the response is cacheable. An
object stored in one cache does not automatically replicate into other
caches; cache-to-cache fill happens only in response to a
client-initiated request.
https://cloud.google.com/cdn/docs/overview
Note that the expiration time is an upper bound on how long a cache
entry remains valid. There is no guarantee that a cache entry will
remain in the cache until it expires.
https://cloud.google.com/cdn/docs/caching
Note, though, that Cloud CDN operates numerous caches around the
world, and old cache entries are routinely evicted to make room for
new content. As a result, multiple cache fills per resource are
expected as part of normal operation.
https://cloud.google.com/cdn/docs/support#low-hit-rate
If you're seeing low cache hit rates for popular content, that last link has suggestions that should help.
I know exactly what the problem is... GCP CDN does not have Origin Shield feature. Even worse, with GCP almost every request comes from a different one of its massive number of CDN PoPs around the world. Without Origin Shield, your app server is the origin server and it has to fill the cache of every CDN edge point.
In my experience you should use GCP CDN only for DOS protection & caching and improving the TTFB performance of HTML requests (specially to offload SSL handshake). Use another CDN for caching other assets with better Cache/Hit ratio.
Some CDN providers have Origin Shield which helps with the cache hit ratio. E.g. create cdn.yourdomain.com with a CDN provider that has Origin Shield Feature and serve all other static content from there.
I know it may sound crazy to put a CDN in front of your CDN, but trust me it works amazing and you can even save money if you go with a CDN that charges less for bandwidth. Also, GCP CDN only caches content up to 10MB.

Uploading various sized Images to AWS Cloudfront versus post processing

We are using AWS cloudfront to render static contents on our site with origin as S3 BUCKET. Now as next steps, the user can dynamically upload images which we want to push to CDN. But we would require different sizes of it so that we can use it later in in the site. One option is to actually do preprocessing of images before pushing to S3 BUCKET . This ends up creating multiple images based on sizes. Can we do post processing something like http://imageprocessor.org/imageprocessor-web/ does but still use cloudfront. Any feedback would be helpful.
Regards
Raghav
Well, yes, it is possible to do post-processing and use CloudFront but you need an intermediate layer between CloudFront and S3. I designed a system using the following high-level implementation:
Request arrives at CloudFront, which serves the image from cache if available; otherwise CloudFront sends the request to the origin server.
The origin server is not S3. The origin server is Varnish, on EC2.
Varnish sends the request to S3, where all the resized image results are stored. If S3 returns 200 OK, the image is returned to CloudFront and to the requesting browser and the process is complete. Since the Varnish machine runs in the same AWS region as the S3 bucket, the performance is essentially indistinguishble between CloudFront >> S3 and CloudFront >> Varnish >> S3.
Otherwise, Varnish is configured to retry the failed request by sending it to the resizer platform, which also runs in EC2.
The resizer examines the request to determine what image is being requested, and what size. In my application, the desired size is in the last few characters of the filename, so xxxxx_300_300_.jpg means 300 x 300. The resizer fetches the source image... resizes it... stores the result in S3... and returns the new image to Varnish, which returns it to CloudFront and to the requester. The resizer itself is Imagemagick wrapped in Mojolicious and uses a MySQL database to identify the source URI where the original image can be fetched.
Storing the results in a backing store, like S3, and checking there, first, on each request, is a critical part of this process, because CloudFront does not work like many people seem to assume. Check your assumptions against the following assertions:
CloudFront has 50+ edge locations. Requests are routed to the edge that optimal for (usually, geographically close to) the viewer. The edge caches are all independent. If I request an object through CloudFront, and you request the same object, and our requests arrive at different edge locations, then neither of us will be served from cache. If you are generating content on demand, you want to save your results to S3 so that you do not have to repeat the processing effort.
CloudFront honors your Cache-Control: header (or overridden values in configuration) for expiration purposes, but does not guarantee to retain objects in cache until they expire. Caches are volatile and CloudFront is no exception. For this reason, too, your results need to be stored in S3 to avoid duplicate processing.
This is a much more complex solution than pre-processing.
I have a pool of millions of images, a large percentage of which would have a very low probability of being viewed, and this is an appropriate solution, here. It was originally designed as a parallel solution to make up for deficiencies in a poorly-architected preprocessor that sometimes "forgot" to process everything correctly, but it worked so well that it is now the only service providing images.
However, if your motivation revolves around avoiding the storage cost of the preprocessed results, this solution won't entirely solve that.

GET from CloudFront after PUT returns old data

I am using CloudFront to access my S3 bucket.
I perform both GET and PUT operations to retrieve and update the data. The problem is that after i send PUT request with new data, GET request still returns the older data. I do see that the file is updated in S3 bucket.
I am performing both GET and PUT from iOS application. However, i tried performing GET request using regular browsers and i still receive older data.
Do i need to do anything in addition to make CloudFront refresh its data?
Cloudfront caches your data. How long depends on the headers the origin serves content with and the distribution settings.
Amazon has a document with the full results of how they interac, but if you haven't set your cache control headers, not changed any cloudfront settings, then by default data is cached for upto 24 hours.
You can either:
set headers indicating how long to cache content for (e.g. Cache-Control: max-age=300 to allow caching for up to 5 minutes). How exactly you do this depends on how you are uploading the content, at a pinch you can use the console
Use the console / api to invalidate content. Beware that only the first 1000 invalidations a month a free - beyond that amazon charges. In addition, invalidations take 10-15 minutes to process.
Change the naming strategy for your s3 data so that new data is served under a different name (perhaps less relevant in your case)
When you PUT an object into S3 by sending it through Cloudfront, Cloudfront proxies the PUT request back to S3, without interpreting it within Cloudfront... so the PUT request changes the S3 object, but the old version of the object, if cached, would have no reason to be evicted from the Cloudfront cache, and would continue to be served until it is expired, evicted, or invalidated.
"The" Cloudfront cache is not a single thing. Cloudfront has over 50 global edge locations (reqests are routed to what should be the closest one, using geolocating DNS), and objects are only cached in locations through which they have been requested. Sending an invalidation request to purge an object from cache causes a background process at AWS to contact all of the edge locations and request the object be purged, if it exists.
What's the point of uploading this way, then? The point has to do with the impact of packet loss, latency, and overall network performance on the throughput of a TCP connection.
The Cloudfront edge locations are connected to the S3 regions by high bandwidth, low loss, low latency (within the bounds of the laws of physics) connections... so the connection from the "back side" of Cloudfront towards S3 may be a connection of higher quality than the browser would be able to establish.
Since the Cloudfront edge location is also likely to be closer to the browser than S3 is, the browser connection is likely to be of higher quality and more resilient... thereby improving the net quality of the end-to-end logical connection, by splitting it into two connections. This feature is solely about performance:
http://aws.amazon.com/blogs/aws/amazon-cloudfront-content-uploads-post-put-other-methods/
If you don't have any issues sending directly to S3, then uploads "through" Cloudfront serve little purpose.

Push files up to Amazon Cloudfront: Possible?

I've been reading up about pull and push CDNs. I've been using Cloudfront as a pull CDN for resized images:
Receive image from client
Put image in S3
later on, when a client makes a request to cloudfront for a URL, Cloudfront does not have the image, hence it has to forward it to my server, which:
Receive request
Pull image from S3
Resize image
Push image back to Cloudfront
However, this takes a few seconds, which is a really annoying wait when you first upload your beautiful image and want to see it. The delay appears to be mostly the download/reuploading time, rather than the resizing, which is pretty fast.
Is it possible to pro-actively push the resized image to Cloudfront and attach it to a URL, such that future requests can immediately get the prepared image? Ideally I would like to
Receive image from client
Put image in S3
Resize image for common sizes
Pre-emptively push these sizes to cloudfront
This avoids the whole download/reupload cycle, making the common sizes really fast, but the less-common sizes can still be accessed (albeit with a delay the first time). However, to do this I'd need to push the images up to Cloudfront. This:
http://www.whoishostingthis.com/blog/2010/06/30/cdns-push-vs-pull/
seems to suggest it can be done, but everything else i've seen makes no mention of it. My question is: is it possible? Or are there any other solutions to this problem that I am missing?
We have tried to similar things with different CDN providers, and for CloudFront I don't think there is any existing way for you to push (what we call pre-feeding) your specific contents to nodes/edges if the cloudfront distribution is using your custom origin.
One way I can think of, also as mentioned by #Xint0 is set up another S3 bucket to specifically hosting those files you would like to push (in your case those resized images). Basically you will have two cloudFront distributions one to pull those files rarely accessed and another to push for those files accessed frequently and also those images you expect to be resized. This sounds a little bit complex but I believe that's the tradeoff you have to make.
Another point I can recommend you to look at is EdgeCast which is another CDN provider and they do provide function called load_to_edge (which I spent quite a lot of time last month to integrate this with our service, that's why I remember it clearly) which does exactly what you expect. They also support custom origin pull, so that maybe you can take a trial there.
The OP asks for a push CDN solution, but it sounds like he's really just trying to make things faster. I'm venturing that you probably don't really need to implement a CDN push, you just need to optimize your origin server pattern.
So, OP, I'm going to assume you're supporting at most a handful of image sizes--let's say 128x128, 256x256 and 512x512. It also sounds like you have your original versions of these images in S3.
This is what currently happens on a cache miss:
CDN receives request for a 128x128 version of an image
CDN does not have that image, so it requests it from your origin server
Your origin server receives the request
Your origin server downloads the original image from S3 (presumably a larger image)
Your origin resizes that image and returns it to the CDN
CDN returns that image to user and caches it
What you should be doing instead:
There are a few options here depending on your exact situation.
Here are some things you could fix quickly, with your current setup:
If you have to fetch your original images from S3, you're basically making it so that a cache miss results in every image taking as long to download as the original sized image. If at all possible, you should try to stash those original images somewhere that your origin server can access quickly. There's a million different options here depending on your setup, but fetching them from S3 is about the slowest of all of them. At least you aren't using Glacier ;).
You aren't caching the resized images. That means that every edge node Cloudfront uses is going to request this image, which triggers the whole resizing process. Cloudfront may have hundreds of individual edge node servers, meaning hundreds of missing and resizes per image. Depending on what Cloudfront does for tiered distribution, and how you set your file headers it may not actually be that bad, but it won't be good.
I'm going out on a limb here, but I'm betting you aren't setting custom expiration headers, which means Cloudfront is only caching each of these images for 24 hours. If your images are immutable once uploaded, you'd really benefit from returning expiration headers telling the CDN not to check for a new version for a long, long time.
Here are a couple ideas for potentially better patterns:
When someone uploads a new image, immediately transcode it into all the sizes you support and upload those to S3. Then just point your CDN at that S3 bucket. This assumes you have a manageable number of supported image sizes. However, I would point out that if you support too many image sizes, a CDN may be the wrong solution altogether. Your cache hit rate may be so low that the CDN is really getting in the way. If that's the case, see the next point.
If you are supporting something like continuous resizing (ie, I could request image_57x157.jpg or image_315x715.jpg, etc and the server would return it) then your CDN may actually be doing you a disservice by introducing an extra hop without offloading much from your origin. In that case, I would probably spin up EC2 instances in all the available regions, install your origin server on them, and then swap image URLs to regionally appropriate origins based on client IP (effectively rolling your own CDN).
And if you reeeeeally want to push to Cloudfront:
You probably don't need to, but if you simply must, here are a couple options:
Write a script to use the webpagetest.org APIs to fetch your image from a variety of different places around the world. In a sense, you'd be pushing a pull command to all the different edge locations. This isn't guaranteed to populate every edge location, but you could probably get close. Note that I'm not sure how thrilled webpagetest.org would be about using it this way, but I don't see anything in there terms of use about it (IANAL).
If you don't want to use a third party or risk irking webpagetest.org, just spin up a micro EC2 instance in every region, and use those to fetch the content, same as in #1.
AFAIK CloudFront uses S3 buckets as the datastore. So, after resizing the images you should be able to save the resized images to the S3 bucket used by CloudFront directly.

How does IMDb do its on-the-fly image resize and cropping?

I've been researching CDNs and image thumbnail generation and I was impressed with how IMDb does its image manipulation. Here's an example of a thumbnail version:
http://ia.media-imdb.com/images/M/MV5BMTc0MzU5ODQ5OF5BMl5BanBnXkFtZTYwODIwODk1._V1._SY98_CR1,0,67,98_.jpg
And here's a tweaked version that plays with size and cropping:
http://ia.media-imdb.com/images/M/MV5BMTc0MzU5ODQ5OF5BMl5BanBnXkFtZTYwODIwODk1._V1._SY400_CR10,40,213,314_.jpg
It seems pretty straight forward where everything from '.V1._...' on is used to determine how to manipulate the image. This is all done impressibly fast and I decided to find an existing solution that mimics this functionality.
I was able to find plenty of solutions in image re-sizing and I did find the Google App Engine's page on Transforming Images in Java. However, I don't think Amazon's IMDb is using Google to serve its images and since all my images are on Amazon's S3, I don't think I can use that solution.
After four hours of online searching, I decided to ask the intelligent crowd here.
Further context: I am building a web application on Amazon's Elastic Beanstalk and I'm thinking of having a separate server (perhaps another Beanstalk) to handle the images...similar to what IMDb is doing.
Thanks in advance for your insight.
I can't speak to IMDB's specific implementation, but I have implemented a similar solution on Amazon EC2 and S3 in the past. Here is an overview of my implementation:
All master (full-sized) images stored on S3, but NOT publicly accessible.
All image src urls point to an EC2 web server.
Smaller (thumbnail) versions of images also stored on S3 with a naming convention that identifies their size AND aspect ration:
"myimage1_s200.jpg" is a square version of "myimage1.jpg" that has been resized as a 200x200 square.
"myimage1_h100.jpg" has been resized to a maximum height of 100 (with a variable width that conforms to the original aspect ratio).
When the EC2 server receives a request for a specific image size: it checks to see if that size already exists, and if so it returns the existing image to the requester.
When the EC2 server receives a request for an image size that DOES NOT exist: it retrieves a copy of the next larger size version of the same image and resizes it and returns the new image to the requester, AND ALSO saves a copy to S3 for future use.
Performance Notes:
Pointing image src's directly to previously resized images on S3 is a lot faster if you know they exist!
Resizing the next larger version of the image vs always going back to the original is a LOT faster under load!