How to invalidate AWS APIGateway cache - amazon-web-services

We have a service which inserts into dynamodb certain values. For sake of this question let's say its key:value pair i.e., customer_id:customer_email. The inserts don't happen that frequently and once the inserts are done, that specific key doesn't get updated.
What we have done is create a client library which, provided with customer_id will fetch customer_email from dynamodb.
Given that customer_id data is static, what we were thinking is to add cache to the table but one thing which we are not sure that what will happen in the following use-case
client_1 uses our library to fetch customer_email for customer_id = 2.
The customer doesn't exist so API Gateway returns not found
APIGateway will cache this response
For any subsequent calls, this cached response will be sent
Now another system inserts customer_id = 2 with its email id. This system doesn't know if this response has been cached previously or not. It doesn't even know that any other system has fetched this specific data. How can we invalidate cache for this specific customer_id when it gets inserted into dynamodb

You can send a request to the API endpoint with a Cache-Control: max-age=0 header which will cause it to refresh.
This could open your application up to attack as a bad actor can simply flood an expensive endpoint with lots of traffic and buckle your servers/database. In order to safeguard against that it's best to use a signed request.
In case it's useful to people, here's .NET code to create the signed request:
https://gist.github.com/secretorange/905b4811300d7c96c71fa9c6d115ee24

We've built a Lambda which takes care of re-filling cache with updated results. It's a quite manual process, with very little re-usable code, but it works.
Lambda is triggered by the application itself following application needs. For example, in CRUD operations the Lambda is triggered upon successful execution of POST, PATCH and DELETE on a specific resource, in order to clear the general GET request (i.e. clear GET /books whenever POST /book succeeded).
Unfortunately, if you have a View with a server-side paginated table you are going to face all sorts of issues because invalidating /books is not enough since you actually may have /books?page=2, /books?page=3 and so on....a nightmare!
I believe APIG should allow for more granular control of cache entries, otherwise many use cases aren't covered. It would be enough if they would allow to choose a root cache group for each request, so that we could manage cache entries by group rather than by single request (which, imho, is also less common).

Did you look at this https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html ?
There is way to invalidate entire cache or a particular cache entry

Related

How to handle background jobs in "Cloud Run" when new instance is created immediately?

I have a FastAPI project in Cloud Run and it has some background jobs inside it. (Not heavy stuff)
However, when a new instance is being created by Cloud Run due to number of requests etc. every instance runs the background job concurrently.
For example;
I have a task that creates some invoices for customers in the background and if three instances is created immediately, three invoices will be created.
I researched about "FOR UPDATE" usage in PostgreSQL etc. It seems like I can solve by modifying my database but I just wonder if it can be solved in Cloud's side.
I don't want to limit the max. number of instances to 1
What would you do in this situation?
Thank you for your time.
If you can potentially have N instances of a job (because you don't want to set the max limit to 1), you need to implement your jobs in an idempotent way. Broadly speaking, you have a few ways to achieve idempotency:
by enforcing a business constraint.
by storing an idempotency key.
by using the Etag HTTP response header.
For example, Stripe lets you define an idempotency key for all of your API requests. Stripe stores this key on its servers, and when you make a POST request with the same payload of a previous one, Stripe returns you the same result. POST requests are not idempotent, but using this "trick" they become idempotent.
Stripe's idempotency works by saving the resulting status code and body of the first request made for any given idempotency key, regardless of whether it succeeded or failed. Subsequent requests with the same key return the same result, including 500 errors.
https://stripe.com/docs/api/idempotent_requests
Tip: you could expand your question by clarifying how these background tasks are created, and where they run.

S3 consistency for sucessful read after write

Could not find a definitive answer here.
According to Amazon S3 docs, the caveat for read after write is if I got 404 for GET, then PUT a new object, then GET.
My question is, after I do GET a successful read,
does subsequent reads will be successful too?
Example:
GET key 404
PUT key 200
GET key 404 # because caveat
GET key 200
From now on, does any subsequent GET key is guaranteed to be successful?
The caveat AWS describes in the S3 documentation suggests that they use a caching layer on top of the database they use to store details of objects in S3 like it's key and meta data.
If you do a PUT for a object as first operation and a GET afterwards, there will be a cache miss for the GET operation so the caching layer will fetch information about this object from the database.
If you do a GET before the PUT the caching layer will query the database, will receive the information that this object doesn't exist and cache that information, even though after the PUT creates the mentioned object shortly after. So the GET after the PUT will receive the information that the object doesn't exist from the cache.
That's probably why this caveat exists. Unfortunately that doesn't answer your question, because we don't know how that caching layer works. If this layer uses shared state, then you should receive a 200 response for all requests, once you received one response with 200. My guess is that they don't use shared state for the caching layer, as that's easier to scale. Without shared state it depends on your luck, the time-to-live for items in the cache and if they employ some kind of cache invalidation for updated objects whether you receive a 200 or a 404 for requests even after the first successful 200 request.
Because the details of the inner workings of S3 are unknown I wouldn't rely on ubsequent calls to succeed, but my guess is that the probability of receiving a 404 after a successful 200 is rather low. In the end you have to decide based on your use case if and how it makes sense to account for this situation or not.
Updated answer
Snippets from the official AWS blog
S3 is Now Strongly Consistent
After that overly-long introduction, I
am ready to share some good news!
Effective immediately, all S3 GET, PUT, and LIST operations, as well
as operations that change object tags, ACLs, or metadata, are now
strongly consistent. What you write is what you will read, and the
results of a LIST will be an accurate reflection of what’s in the
bucket. This applies to all existing and new S3 objects, works in all
regions, and is available to you at no extra charge! There’s no impact
on performance, you can update an object hundreds of times per second
if you’d like, and there are no global dependencies.

Google Cloud CDN started ignoring query strings for storage buckets

Some months ago activated Cloud CDN for storage buckets. Our storage data is regularly changed via a backend. So to invalidate the cached version we added a query param with the changedDate to the url that is served to the client.
Back then this worked well.
Sometime in the last months (probably weeks) Google seemed to change that and is now ignoring the query string for caching from storage buckets.
First part: Does anyone know why this is changed and why noone was
notified about it?
Second part: How can you invalidate the Cache for a particular object
in a storage bucket without sending a cache-invalidation request
(which you shouldn't) everytime?
I don't like the idea of deleting the old file and uploading a new file with changed filename everytime something is uploaded...
EDIT:
for clarification: the official docu ( cloud.google.com/cdn/docs/caching ) already states that they now ignore query strings for storage buckets:
For backend buckets, the cache key consists of the URI without the query > string. Thus https://example.com/images/cat.jpg, https://example.com/images/cat.jpg?user=user1, and https://example.com/images/cat.jpg?user=user2 are equivalent.
We were affected by this also. After contacting Google Support, they have confirmed this is a permanent change. The recommended work around is to either use versioning in the object name, or use cache invalidation. The latter sounds a bit odd as the cache invalidation documentation states:
Invalidation is intended for use in exceptional circumstances, not as part of your normal workflow.
For backend buckets, the cache key consists of the URI without the query string, as the official documentation states.1 The bucket is not evaluating the query string but the CDN should still do that. I could reproduce this same scenario and currently is still possible to use a query string as cache buster.
Seems like the reason for the change is that the old behavior resulted in lost caching opportunities, higher costs and higher latency. The only recommended workaround for now is to create the new objects by incorporating the version into the object's name (which seems is not valid options for your case), or using cache invalidation.
Invalidating the cache for a particular object will require to use a particular query. Maybe a Cache-Control header allowing such objects to be cached for a certain time may be your workaround. Cloud CDN cache has an expiration time defined by the "Cache-Control: s-maxage", "Cache-Control: max-age", and/or Expires headers 2.
According to the doc, when using backend bucket as origin for Cloud CDN, query strings in the request URL are not included in the cache key:
For backend buckets, the cache key consists of the URI without the protocol, host, or query string.
Maybe using the query string to identify different versions of cached content is not the best practices promoted by GCP. But for some legacy issues, it has to be.
So, one way to workaround this is make backend bucket to be a static website (do NOT enable CDN here), then use custom origins (Cloud CDN backed by Internet network endpoint groups backend service) which points to that static website.
For backend service, query string IS part of cache key.
For backend services, Cloud CDN defaults to using the complete request URI as the cache key
That's it. Yes, It is tedious but works!

How to avoid sending 2 duplicate POST requests to a webservice

I send a POST request to create an object. That object is created successfully on the server, but I cannot receive the response (dropped somewhere), so I try to send the POST request again (and again). The result is there are many duplicated objects on the server side.
What is the official way to handle that issue? I think it is a very common problem, but I don't know its exact name, so cannot google it. Thanks.
In REST terminology, which is how interfaces where POST is used to create an object (and PUT to modify, DELETE to delete and GET to retrieve) are called, the POST operation is attributed un-'safe' and non-'idempotent, because the second operation of every other type of petition has no effect in the collection of objects.
I doubt there is an "official" way to deal with this, but there are probably some design patterns to deal with it. For example, these two alternatives may solve this problem in certain scenarios:
Objects have unicity constraints. For example, a record that stores a unique username cannot be duplicated, since the database will reject it.
Issue an one-time use token to each client before it makes the POST request, usually when the client loads the page with the input form. The first POST creates an object and marks the token as used. The second POST will see that the token is already used and you can answer with a "Yes, yes, ok, ok!" error or success message.
Useful link where you can read more about REST.
It is unreliable to fix these issues on the client only.
In my experience, RESTful services with lots of traffic are bound to receive duplicate incoming POST requests unintentionally - e.g. sometimes a user will click 'Signup' and two requests will be sent simultaneously; this should be expected and handled by your backend service.
When this happens, two identical users will be created even if you check for uniqueness on the User model. This is because unique checks on the model are handled in-memory using a full-table scan.
Solution: these cases should be handled in the backend using unique checks and SQL Server Unique Indices.

Updating a hit counter when an image is accessed in Django

I am working on doing some simple analytics on a Django webstite (v1.4.1). Seeing as this data will be gathered on pretty much every server request, I figured the right way to do this would be with a piece of custom middleware.
One important metric for the site is how often given images are accessed. Since each image is its own object, I thought about using django-hitcount, but figured that was unnecessary for what I was trying to do. If it proves easier, I may use it though.
The current conundrum I face is that I don't want to query the database and look for a given object for every HttpRequest that occurs. Instead, I would like to wait until a successful response (indicated by an HttpResponse.status of 200 or whatever), and then query the server and update a hit field for the corresponding image. The reason the only way to access the path of the image is in process_request, while the only way to access the status code is in process_response.
So, what do I do? Is it as simple as creating a class variable that can hold the path and then lookup the file once the response code of 200 is returned, or should I just use django-hitcount?
Thanks for your help
Set up a cron task to parse your Apache/Nginx/whatever access logs on a regular basis, perhaps with something like pylogsparser.
You could use memcache to store the counters and then periodically persist them to the database. There are risks that memcache will evict the value before it's been persisted but this could be acceptable to you.
This article provides more information and highlights a risk arising when using hosted memcache with keys distributed over multiple servers. http://bjk5.com/post/36567537399/dangers-of-using-memcache-counters-for-a-b-tests