Path based AWS API Caching Keys Issue - amazon-web-services

I have several API paths set up in a test API Gateway setup with a simple 'api' stage. I am using AWS Lambda and wish to cache the results of the lambda call.
There are three test paths (no authentication)
/a/{thing} (GET Caching turned on in stage)
/b/{thing} (GET Caching turned off in stage)
/c/{thing} (GET Caching turned off in stage)
They all map to the same lambda function. The lambda function returns the current time and the value of {thing}.
If I request /a/0000 through /a/1000 I get back the same result for a function that ran for thing=0000.
If I request /b/0000 through /b/1000 (or /c/) I get back uncached results.
thing is selected as 'cache' in resources /a/{thing}. Nothing else is set 'cache'.
It is my understanding that selecting 'cache' next to a path element, query element, or header would construct a cache key - possibly a multi-key cache key hash. That would be ideal!
Ideally /a/0000 and /a/1234 would return a cached version keyed to the {thing} value.
What did I do wrong or misread or step over? Am I hitting a bug when it comes to AWS Lambda? Is caching keyed to authorization - these URLs are public and unauthenticated. I'm just using curl to request these and nothing is being cached on the client side of course.
Honestly. I've also tried using a query argument as the only cache key and let the cache flush and waited 30 minutes to try try try again. Still not giving the results I would expect.

Pro Tip:
You still have to deploy from resources to stage when you set up cache keys. This makes sense of course but it would be good if the management console showed more about the method parameters than it does.
I am using Chalice.. which is why I wasn't deploying in the normal fashion.

Related

Google Cloud CDN started ignoring query strings for storage buckets

Some months ago activated Cloud CDN for storage buckets. Our storage data is regularly changed via a backend. So to invalidate the cached version we added a query param with the changedDate to the url that is served to the client.
Back then this worked well.
Sometime in the last months (probably weeks) Google seemed to change that and is now ignoring the query string for caching from storage buckets.
First part: Does anyone know why this is changed and why noone was
notified about it?
Second part: How can you invalidate the Cache for a particular object
in a storage bucket without sending a cache-invalidation request
(which you shouldn't) everytime?
I don't like the idea of deleting the old file and uploading a new file with changed filename everytime something is uploaded...
EDIT:
for clarification: the official docu ( cloud.google.com/cdn/docs/caching ) already states that they now ignore query strings for storage buckets:
For backend buckets, the cache key consists of the URI without the query > string. Thus https://example.com/images/cat.jpg, https://example.com/images/cat.jpg?user=user1, and https://example.com/images/cat.jpg?user=user2 are equivalent.
We were affected by this also. After contacting Google Support, they have confirmed this is a permanent change. The recommended work around is to either use versioning in the object name, or use cache invalidation. The latter sounds a bit odd as the cache invalidation documentation states:
Invalidation is intended for use in exceptional circumstances, not as part of your normal workflow.
For backend buckets, the cache key consists of the URI without the query string, as the official documentation states.1 The bucket is not evaluating the query string but the CDN should still do that. I could reproduce this same scenario and currently is still possible to use a query string as cache buster.
Seems like the reason for the change is that the old behavior resulted in lost caching opportunities, higher costs and higher latency. The only recommended workaround for now is to create the new objects by incorporating the version into the object's name (which seems is not valid options for your case), or using cache invalidation.
Invalidating the cache for a particular object will require to use a particular query. Maybe a Cache-Control header allowing such objects to be cached for a certain time may be your workaround. Cloud CDN cache has an expiration time defined by the "Cache-Control: s-maxage", "Cache-Control: max-age", and/or Expires headers 2.
According to the doc, when using backend bucket as origin for Cloud CDN, query strings in the request URL are not included in the cache key:
For backend buckets, the cache key consists of the URI without the protocol, host, or query string.
Maybe using the query string to identify different versions of cached content is not the best practices promoted by GCP. But for some legacy issues, it has to be.
So, one way to workaround this is make backend bucket to be a static website (do NOT enable CDN here), then use custom origins (Cloud CDN backed by Internet network endpoint groups backend service) which points to that static website.
For backend service, query string IS part of cache key.
For backend services, Cloud CDN defaults to using the complete request URI as the cache key
That's it. Yes, It is tedious but works!

How to invalidate AWS APIGateway cache

We have a service which inserts into dynamodb certain values. For sake of this question let's say its key:value pair i.e., customer_id:customer_email. The inserts don't happen that frequently and once the inserts are done, that specific key doesn't get updated.
What we have done is create a client library which, provided with customer_id will fetch customer_email from dynamodb.
Given that customer_id data is static, what we were thinking is to add cache to the table but one thing which we are not sure that what will happen in the following use-case
client_1 uses our library to fetch customer_email for customer_id = 2.
The customer doesn't exist so API Gateway returns not found
APIGateway will cache this response
For any subsequent calls, this cached response will be sent
Now another system inserts customer_id = 2 with its email id. This system doesn't know if this response has been cached previously or not. It doesn't even know that any other system has fetched this specific data. How can we invalidate cache for this specific customer_id when it gets inserted into dynamodb
You can send a request to the API endpoint with a Cache-Control: max-age=0 header which will cause it to refresh.
This could open your application up to attack as a bad actor can simply flood an expensive endpoint with lots of traffic and buckle your servers/database. In order to safeguard against that it's best to use a signed request.
In case it's useful to people, here's .NET code to create the signed request:
https://gist.github.com/secretorange/905b4811300d7c96c71fa9c6d115ee24
We've built a Lambda which takes care of re-filling cache with updated results. It's a quite manual process, with very little re-usable code, but it works.
Lambda is triggered by the application itself following application needs. For example, in CRUD operations the Lambda is triggered upon successful execution of POST, PATCH and DELETE on a specific resource, in order to clear the general GET request (i.e. clear GET /books whenever POST /book succeeded).
Unfortunately, if you have a View with a server-side paginated table you are going to face all sorts of issues because invalidating /books is not enough since you actually may have /books?page=2, /books?page=3 and so on....a nightmare!
I believe APIG should allow for more granular control of cache entries, otherwise many use cases aren't covered. It would be enough if they would allow to choose a root cache group for each request, so that we could manage cache entries by group rather than by single request (which, imho, is also less common).
Did you look at this https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html ?
There is way to invalidate entire cache or a particular cache entry

API Gateway caching not calling Lambda function

I'm using Amazon API Gateway to execute a Lambda function when the API endpoint is called. In my Lambda function I'm updating a DynamoDB table.
Whenever I call the API with caching disabled using Chrome Developer Tools, the DynamoDB table is updated.
When I have caching enabled, the first request from my API updates the table, every subsequent request is much faster but doesn't update the table.
I'm assuming that CloudFront is caching the responses so as to not have to call the Lambda function each time.
Is there any way to force the Lambda function to be executed with each request?
Few possible solutions:
CloudFront should be used only when u want caching. In this case you don't need it; so call API endpoint directly from browser instead of calling CF end point. This will also save your cloudfront cost.
With each request add a timestamp.
If you have to use CF; you can configure it very easily as to what requests should ALWAYS go to API end points ( which serves dynamic content) while which one should be cached.
Probably you are calling CF as a GET request; just make it POST which is NEVER cached. Ideally, as you are updating table it should be a POST request. This should be simplistic solution with minimal and right changes.

Resize images on the fly in CloudFront and get them in the same URL instantly: AWS CloudFront -> S3 -> Lambda -> CloudFront

TLDR: We have to trick CloudFront 307 redirect caching by creating new cache behavior for responses coming from our Lambda function.
You will not believe how close we are to achieve this. We have stucked so badly in the last step.
Business case:
Our application stores images in S3 and serves them with CloudFront in order to avoid any geographic slow downs around the globe.
Now, we want to be really flexible with the design and to be able to request new image dimentions directly in the CouldFront URL!
Each new image size will be created on demand and then stored in S3, so the second time it is requested it will be
served really quickly as it will exist in S3 and also will be cached in CloudFront.
Lets say the user had uploaded the image chucknorris.jpg.
Only the original image will be stored in S3 and wil be served on our page like this:
//xxxxx.cloudfront.net/chucknorris.jpg
We have calculated that we now need to display a thumbnail of 200x200 pixels.
Therefore we put the image src to be in our template:
//xxxxx.cloudfront.net/chucknorris-200x200.jpg
When this new size is requested, the amazon web services have to provide it on the fly in the same bucket and with the requested key.
This way the image will be directly loaded in the same URL of CloudFront.
I made an ugly drawing with the architecture overview and the workflow on how we are doing this in AWS:
Here is how Python Lambda ends:
return {
'statusCode': '301',
'headers': {'location': redirect_url},
'body': ''
}
The problem:
If we make the Lambda function redirect to S3, it works like a charm.
If we redirect to CloudFront, it goes into redirect loop because CloudFront caches 307 (as well as 301, 302 and 303).
As soon as our Lambda function redirects to CloudFront, CloudFront calls the API Getaway URL instead of fetching the image from S3:
I would like to create new cache behavior in CloudFront's Behaviors settings tab.
This behavior should not cache responses from Lambda or S3 (don't know what exactly is happening internally there), but should still cache any followed requests to this very same resized image.
I am trying to set path pattern -\d+x\d+\..+$, add the ARN of the Lambda function in add "Lambda Function Association"
and set Event Type Origin Response.
Next to that, I am setting the "Default TTL" to 0.
But I cannot save the behavior due to some error:
Are we on the right way, or is the idea of this "Lambda Function Association" totally different?
Finally I was able to solve it. Although this is not really a structural solution, it does what we need.
First, thanks to the answer of Michael, I have used path patterns to match all media types. Second, the Cache Behavior page was a bit misleading to me: indeed the Lambda association is for Lambda#Edge, although I did not see this anywhere in all the tooltips of the cache behavior: all you see is just Lambda. This feature cannot help us as we do not want to extend our AWS service scope with Lambda#Edge just because of that particular problem.
Here is the solution approach:
I have defined multiple cache behaviors, one per media type that we support:
For each cache behavior I set the Default TTL to be 0.
And the most important part: In the Lambda function, I have added a Cache-Control header to the resized images when putting them in S3:
s3_resource.Bucket(BUCKET).put_object(Key=new_key,
Body=edited_image_obj,
CacheControl='max-age=12312312',
ContentType=content_type)
To validate that everything works, I see now that the new image dimention is served with the cache header in CloudFront:
You're on the right track... maybe... but there are at least two problems.
The "Lambda Function Association" that you're configuring here is called Lambda#Edge, and it's not yet available. The only users who can access it is users who have applied to be included in the limited preview. The "maximum allowed is 0" error means you are not a preview participant. I have not seen any announcements related to when this will be live for all accounts.
But even once it is available, it's not going to help you, here, in the way you seem to expect, because I don't believe an Origin Response trigger allows you to do anything to trigger CloudFront to try a different destination and follow the redirect. If you see documentation that contradicts this assertion, please bring it to my attention.
However... Lambda#Edge will be useful for setting Cache-Control: no-cache on the 307 so CloudFront won't cache it, but the redirect itself will still need to go all the way back to the browser.
Note also, Lambda#Edge only supports Node, not Python... so maybe this isn't even part of your plan, yet. I can't really tell, from the question.
Read about the Lambda#Edge limited preview.
The second problem:
I am trying to set path pattern -\d+x\d+\..+$
You can't do that. Path patterns are string matches supporting * wildcards. They are not regular expressions. You might get away with /*-*x*.jpg, though, since multiple wildcards appear to be supported.

How do I set the cache key for the AWS API Gateway?

I have a Lambda function that is mapped to a HTTP endpoint using the AWS API Gateway. This works fine, I have mapped query string params to the Lambda event, everything works:
https://api.buzzcloud.xyz/?count=999
Which I can call from http://buzzcloud.xyz
I would like to enable caching, but it seems that by default the API Gateway uses the URL for caching, and so changes in my query string parameters are not triggering a different cache result.
The result is that with caching on, my page returns whatever data was first requested and put in the cache.
How do I set a custom cache key or ensure querystring is part of the cache identifier?
Turns out the is a not-so-secret setting that I totally missed that allows for the exact query string params that should be used for the cache to be set.