varnish invalidate url REGEX from backend

varnish invalidate url REGEX from backend - regex

Say I have some highly-visited front-page, which displays number of some items by categories.
When some item is added / deleted I need to invalidate this front-page/url and some 2 others.
What is the best practice how to invalidate those urls from backend in Varnish (4.x)?
From what I captured, I can:
implement my HTTP PURGE handler in VCL configuration file, that "bans" urls matching received regex
from backend to Varnish, send 3x HTTP PURGE requests for those 3 urls.
But is this approach safe for this automatic usage? Basicly I need to invalidate some views everytime some related entity is inserted/updated/deleted.
Can it lead to ban list cumulation and increasing CPU consumption?
Is there any other approach? Thanks.

According this brilliant article http://www.smashingmagazine.com/2014/04/23/cache-invalidation-strategies-with-varnish-cache/ the solution are Tags.
X-depends-on: 3483 4376 32095 28372 #http-header created by backend
ban obj.http.x-depends-on ~ “\D4376\D” #ban rule emitted to discard dependant objects
What I missed is, that there is background process "ban-lurker", that iterates over cached objects, for which exists applicable and yet not tryed ban-rules and if all applicable objects were tested, ban rule is discarded. The ban rule only needs to be written such as it uses only data stored with cached object, not using e.g. req.url, since req object is not stored with object in cache and so lurker-process does not have it.
So now ban-way + tags looks pretty reliable to me.
Thanks Per Buer :)

Related

REST API - Update of single resource changes multiple others

I'm looking for a way how to deal with a following problem:
Imagine you modify a resource and that subsequently causes update of other resources.
E.g. you issue a PUT to, say /api/orders/1234, which by definition changes state of all other Orders of given user. There may be UI clients that display the table of Orders and they should know that not only single item in the table was updated, but eventually other as well.
Now, is there any standard way how inform a clients about such a situation?
So far I can only think of sending back the 205 Reset Content HTTP status code to inform the client that he should refresh the state, as not just a single thing was changed.

There are multiple solutions.
You can define specific resources as non-cacheable, so the client does not cache them at all. (no-store)
You can try giving a max-age of 0, so the client will have to re-validate those resources always. In this case you might have to implement ETags and conditional GETs, but it will be easier on the server than option 1.
Some push method like WebSockets.
If you really want to "notify" potentially multiple clients of a change, then it sounds like you would need option 3.
However, correctly configured caching is normally good enough. For example you could mark not-yet-executed orders as not cached (max-age=0), but as soon as it is executed, you might mark it to be cached indefinitely, since it can not change anymore.

Invalidate all files in a folder in cloudfront console

I know cloudfront provides a mechanism to invalidate a file, but what if I want to invalidate all files in a specific folder ? The documentation mentions that I can't use wildcards to do this.
Here's the instruction taken from the official documentation:
You must explicitly invalidate every object and every directory that you want CloudFront to stop serving. You cannot use wildcards to invalidate groups of objects, and you cannot invalidate all of the objects in a directory by specifying the directory path.

Back in 2013, in a previous version of this answer, I wrote:
You can't do this because "files" in cloudfront are not in "folders." Everything is an object and every object is independent.
At the time, that was entirely true. It's still true that everything is an object and every object is independent, but CloudFront has changed its invalidation logic. Keep reading.
At the time, this was also true, and again, to a certain extent, it still is:
The cloudfront documentation mentions "invalidating directories," but this refers to web sites that actually allow a directory listing [when] the listing is what you want to invalidate, so this won't help you either.
However, times have changed significantly.
Technically, each object is still independent, and CloudFront does not really store them in hierarchical folders, but the invalidation interface has been enhanced, to support a left-anchored wildcard match. You can invalidate the contents of a "folder" or any number of objects that you can match with a wildcard at the end of the string. Anything that matches will be evicted from the cache:
To invalidate objects, you can specify either the path for individual objects or a path that ends with the * wildcard, which might apply to one object or to many, as shown in the following examples:
/images/image1.jpg
/images/image*
/images/*
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html
Nice enhancement. But is there a catch?
Other than the fact that an invalidation requires -- as always -- 10 to 15 minutes to complete under normal operations, the answer is no, there's not really a catch. The first 1,000 invalidation paths (formerly "requests," and a "request" was for a single object) you submit within a month are free; after that, there is a charge, but:
The price is the same whether you're invalidating individual objects or using the * wildcard to invalidate multiple objects.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html#PayingForInvalidation
Note that if you don't include the * at the end, then an invalidation for /images/ (for example) will only tell CloudFront to invalidate whatever single object your origin server returns for requests for /images/.
The leading slash is documented as optional.

As of 2015-05-25, you can invalidate using a wildcard. Ex: /* or /images/*
It is also far less costly to do it this way, as something like /images/* counts as one object for invalidation, rather than being charged for the thousands of images in the /images directory.
http://aws.amazon.com/about-aws/whats-new/2015/05/amazon-cloudfront-makes-it-easier-to-invalidate-multiple-objects/

As long as you want to invalidate a reasonable amount of objects, one of the easier ways I've found is to select the objects in Cyberduck, right click > select Info and click on Distribution tab and you can invalidate from there. Cyberduck will submit one invalidation request to your Cloudfront with the list of selected files.
Cyberduck is open source too.
ps: not affiliated with the product in any way. Just listing an alternative.

RESTful search. Return actual resources or URIs?

Pretty new to all this REST stuff.
I'm designing my API, and am not sure what I'm supposed to return from a search query. I was assuming I would just return all objects that match the query in their entirety, but after reading up a bit about HATEOAS I am thinking I should be returning a list of URI's instead?
I can see that this could help with caching of items, but I'm worried that there will be a lot of overhead generated by the subsequent multiple HTTP requests required to get the actual object info.
Am I misunderstanding? Is it acceptable to return object instances instead or URIs?

I would return a list of resources with links to more details on those resources.
From RESTFull Web Services Cookbook 2010 - Subbu Allamaraju
Design the response of a query as a representation of a collection
resource. Set the appropriate expiration caching headers. If the query
does not match any resources, return an empty collection.

IMHO it is important to always remember that "pure REST" and "real world REST" are two quite different beasts.
How are you returning the list of URIs from your query in the first place? If you return e.g. application/json, this certainly does not tell the client how it is supposed to interpret the content; therefore, the interaction is already being driven by out-of-band information (the client magically already knows where to look for the data it needs) in conflict with HATEOAS.
So, to answer your question: I find it quite acceptable to return object instances instead of URIs -- but be careful because in the general case this means you are generating all this data without knowing if the client is even going to use it. That's why you will see a hybrid approach quite often: the object instances are not full objects (i.e. a portion of the information the server has is not returned), but they do contain a unique identifier that allows the client to fetch the full representation of selected objects if it chooses to do so.

REST API Design : Is it ok to change the resource identifier during a PUT call?

I'm curious to learn more about RESTful design patterns around the PUT call. Specifically, am I violating norms by changing the resource ID as part of a PUT call?
Consider the following...
POST /api/event/ { ... } - returns the resource ID (eventid) of the new event in the body
GET /api/event/eventid
PUT /api/event/eventid - returns the (possibly new) resource ID depending on request body
GET /api/event/eventid - fails if the original eventid was used in the URI
The endpoints for GET and PUT can quickly access the resource if the eventid represents internal resources (like a database record). If the PUT results in the server moving the underlying resource, the ID can change.
Am I violating norms when I do this?

REST is not a strict specification, but more a set of guidelines and best practices that can be followed to build web-services that are easy to understand and work with. So there's nothing that prevents you from changing a resource IDs during a PUT.
That being said, doing so is IMO a bad practice. One of the ideas behind REST is that each resource can be referenced using a URI. In your case this URI is the concatenation of the path and (I assume) an internal ID. This URI could be used by other "systems" and stored as references. If you change the ID of a resource on a PUT, you change the URI and all references to that resource will be broken (404).
If you feel the need to change the ID that is part of the URI, you may not have picked the right property for it. Consider something else that would be immutable (e.g.: tag your resource with a UUID and use it rather than an internal DB ID).

Not addressing your question full on, but this makes me worry:
returns the resource ID (eventid) of the new event in the body
You aren't returning an integer id, and then letting the client construct urls from this, are you? A proper REST application should give url's to resources, not ids.
As for your question - PUT means something like "Create a new resource at this location". You could conceivably reply with a redirect and a Location header, but it's a bit of a strange thing to do. Besides, the semantics of PUT dictates that you send the entire entity with the request, which is probably not what you want in this scenario. Maybe it would be more fitting to use POST in this situation? (E.g. POST on /api/event/1234

I think it's ok; PUT is still idempotent (repeated calls will not lead to other modifications).
Just: I would ensure that the old ID is not reused, and have the api return 301 codes for calls to old ID (in case other clients had links to the resource).
Maybe the initial PUT that modifies the ID should return a 303 code that point to the new resource location, I'm not sure here.

How to solve two REST problems: the interface document; loss of privacy in descriptive URLs

Coming from a lot of frustrating times with WSDL/Soap, I very much like the REST paradigm, but am trying to solve two basic problems in our application, before moving over to REST. The first problem relates to the lack of an interface document. I think I finally see how to handle this situation: One can query his way down from a top-level "/resources" resource using various requests of GET, HEAD, and OPTIONS to find the one needed resource in the correct hypermedia format. Is this the idea? If so, the client need only be provided with a top-level resource URI: http://www.mywebservicesite.com/mywebservice/resources. He will then have to do some searching and possible keep track of what he is discovering, so that he can use the URIs again efficiently in future to do GETs, POSTs, PUTs, and DELETEs. Are there any thoughts on what should happen here?
The other problem is that we cannot use descriptive URLs like /resources/../customer/Madonna/phonenumber. We do have an implementation of opaque URLs we use in the context of a session, and I'm wondering how opaque URLs might be applied to REST. The general problem is how to keep domain-specific details out of URLs, and still benefit from what REST has to offer.

The other problem is that we cannot use descriptive URLs like /resources/../customer/Madonna/phonenumber.
I think you've misunderstood the point of opaque URIs. The notion of opaque URIs is with respect to clients: A client shall not decipher a URI to guess anything of semantic meaning from it. So a service may well have URIs like /resources/.../customer/Madonna/phonenumber, and that's quite a good idea. The URIs should be treated as opaque by clients: not infer from the URI that it represents Madonna's phone number, and that Madonna is a customer of some sort. That knowledge can only be obtained by looking inside the URI itself, or perhaps by remembering where the URI was discovered.
Edit:
A consequence of this is that navigation should happen by links, not by deconstructing the URI. So if you see /resouces/customer/Madonna/phonenumber (and it actually represents Customer Madonna's phone number) you should have links in that resource to point to the Madonna resource: e.g.
{
"phone_number" : "01-234-56",
"customer_URI": "/resources/customer/Madonna"
}
That's the only way to navigate from a phone number resource to a customer resource. An important aspect is that the server implementation might or might not have domain specific information in the URI, The Madonna record might just as well live somewhere else: /resources/customers/byid/81496237. This is why clients should treat URIs as opaque.
Edit 2:
Another question you have (in the comments) is then how a client, with the required no knowledge of the server's URIs is supposed to be able to find anything. Clients have the following possibilities to find resources:
Provide a search interface. This could be done by providing an OpenSearch description document, which tells clients how to search for items. An OpenSearch template can include several variables, and several endpoints, depending on what you're looking for. So if you have a "customer ID" that's unique, you could have the following template: /customers/byid/{proprietary:customerid}", the customerid element needs to be documented somewhere, inside the proprietary namespace. A client can then know how to use such a template.
Provide a custom form. This implies making a custom media type in which you explicitly define how (based on an instance of the document) a URI to a customer can be forged. <customers template="/customers/byid/{id}"/>. The documentation (for the media type) would have to state that the template attribute must be interpreted as a relative URI after the string substitution "{id}" to an actual customer ID.
Provide links to all resources. Some resources aren't innumerable, so you can simply make a link to each and every one of them, optionally including identifying information along with the links. This could also be done in a custom media type: <customer id="12345" href="/customer/byid/12345"/>.
It should be noted that #1 and #2 are two ways of saying the same thing: Clients are allowed to create URIs if they
haven't got the URI structure a priori
a media type exists for which the documentation states that URIs should be created
This is much the same way as a web browser has no idea of any URI structure on the web, except for the rules laid out in the definition of HTML forms, to add a ? and then all the query parameters separated by &.
In theory, if you have a customer with id 12345, then you could actually dispense with the href, since you could plug the customer id 12345 into #1 or #2. It's more common to actually provide real links between resources, rather than always relying on lookup or search techniques.

I haven't really used web RPC systems (WSDL/Soap), but i think the 'interface document' is there mostly to allow client libraries to create the service API, right? if so, REST shouldn't need it, because the verbs are already defined and don't really need to be documented again.
AFAIUI, the REST way is to document the structure of each resource (usually encoded in XML or JSON). In that document, you'll also have to document the relationship between those resources. In my case, a resource is often a container of other resources (sometimes more than one type), therefore the structure doc specifies what field holds a list of URLs pointing to the contained resources. Ideally, only one unique resource will need a single, fixed (documented) URL. everithing else follows from there.
The URL 'style' is meaningless to the client, since it shouldn't 'construct' an URL. Every URL it needs should be already constructed on a resource field. That let's you change the URL structure without changing the client (that has saved tons of time to me). Your URLs can be as opaque or as descriptive as you like. (personally, i don't like text keys or slugs; my keys are all BIGINTs or UUIDs)

I am currently building a REST "agent" that addresses the first part of your question. The agent offers a temporary bookmarking service. The client code that is interacting with the agent can request that an URL be bookmarked using some identifier. If the client code needs to retrieve that representation again, it simply asks the agent for the url that corresponds to the saved bookmark and then navigates to that bookmark. Currently those bookmarks are not persisted so they only last for the lifetime of the client application, but I have found it a useful mechanism for accessing commonly used resources. E.g. The root representation provides a login link. I bookmark that link and if the client ever receives a 401 then I can redirect to the "login" bookmark.
To address an issue you mentioned in a comment, the agent also has the ability to store retrieved representations in a dictionary. If it becomes necessary to aggregate and manipulate multiple representations at the same time then I can simply request that the agent store the current representation in a dictionary associated to a key and then continue navigating to the next resource. Once the client has accumulated all the necessary representation it can do what it needs to do.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js