Optional trailing slash in url in url-config of web-site - web-services

If I define url like "^optional/slash/?&" - and so web-page to which it bound will available by both url versions - with slash and without - will I violate any conventions or standards by doing that?

Wouldn't a redirection be more appropriate?
If I remember correctly, trailing slashes should be used with resources that list other resources. Like a directory that lists files, a list of articles or a category query (e.g http://www.example.com/category/cakes/). Without trailing slashes the URI should point to a single resource. Like a file, an article or a complex query with parameters (e.g http://www.example.com/search?ingredients=strawberry&taste=good)
Just use the HTTP code 302 FOUND to redirect typos to their correct URIs.
EDIT: Thanks to AndreD for pointing it out, a HTTP code 301 MOVED PERMANENTLY is more appropriate for permanently aliasing typos. Search engines and other clients should stop querying for the misspelled URL after getting a 301 code once, and Google recommends using it for changing the URL of a page in their index.

According to RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
Section 6.2.4. Protocol-Based Normalization -
"Substantial effort to reduce the incidence of false negatives is
often cost-effective for web spiders. Therefore, they implement even
more aggressive techniques in URI comparison. For example, if they
observe that a URI such as
http://example.com/data
redirects to a URI differing only in the trailing slash
http://example.com/data/
they will likely regard the two as equivalent in the future. This
kind of technique is only appropriate when equivalence is clearly
indicated by both the result of accessing the resources and the
common conventions of their scheme's dereference algorithm (in this
case, use of redirection by HTTP origin servers to avoid problems
with relative references)."
My interpretation of this statement would be that making the two URIs functionally equivalent (e.g. by means of an .htaccess statement, redirect, or similar) does not violate any standard conventions. According to the RFC, web spiders are prepared to treat them functionally equivalent if they point to the same resource.

No, you are not violating any standards by doing that you can Use this Optional trailing slash in URL of websites
but you need to stay on the safe side, because there are different ways servers handle the issue:
Sometimes, it doesn't matter for SEO: many web servers will just re-direct using 301 status code to the default version;
Some web servers may return a 404 page for the non-trailing-slash address = wasted link juice and efforts;
Some web servers may return 302 redirect to the correct version = wasted link juice and efforts;
Some web servers may return 200 response for both the versions = wasted link juice and efforts as well as potential duplicate content problems.

Related

varnish invalidate url REGEX from backend

Say I have some highly-visited front-page, which displays number of some items by categories.
When some item is added / deleted I need to invalidate this front-page/url and some 2 others.
What is the best practice how to invalidate those urls from backend in Varnish (4.x)?
From what I captured, I can:
implement my HTTP PURGE handler in VCL configuration file, that "bans" urls matching received regex
from backend to Varnish, send 3x HTTP PURGE requests for those 3 urls.
But is this approach safe for this automatic usage? Basicly I need to invalidate some views everytime some related entity is inserted/updated/deleted.
Can it lead to ban list cumulation and increasing CPU consumption?
Is there any other approach? Thanks.
According this brilliant article http://www.smashingmagazine.com/2014/04/23/cache-invalidation-strategies-with-varnish-cache/ the solution are Tags.
X-depends-on: 3483 4376 32095 28372 #http-header created by backend
ban obj.http.x-depends-on ~ ā€œ\D4376\Dā€ #ban rule emitted to discard dependant objects
What I missed is, that there is background process "ban-lurker", that iterates over cached objects, for which exists applicable and yet not tryed ban-rules and if all applicable objects were tested, ban rule is discarded. The ban rule only needs to be written such as it uses only data stored with cached object, not using e.g. req.url, since req object is not stored with object in cache and so lurker-process does not have it.
So now ban-way + tags looks pretty reliable to me.
Thanks Per Buer :)

Using HTTP URIs as identifiers of resources in RESTful web API

Usually to retrieve a resource one uses:
GET http://ws.mydomain.com/resource/123212
But what if your item IDs are HTTP URIs?:
GET http://ws.mydomain.com/resource/http://id.someotherdomain.com/SGX.3211
Browsers replace two slashes with one, and the request turns into:
GET http://ws.mydomain.com/resource/http:/id.someotherdomain.com/SGX.3211
which will not work.
URI encoding the "http://id.someotherdomain.com/SGX.3211" -part results in HTTP 400 - Bad request.
Is there a best practice for handling this?
Edit:
Then of course if we would need to have (I don't at the moment) request in form:
resources/ID/collections/ID
and all IDs are HTTP URIs, things get out of hand... Possibly one could do something like this and parse the contents inside the curly braces:
resources/{http://id...}/collections/{http://id...}
Encode the other system's URI, and then pass the value as a query parameter:
GET http://ws.mydomain.com/resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211
Ugly looking, but no one said that URIs used in a REST architecture have to be beautiful. :)
By the way, a GET actually looks like this when it's sent:
GET /resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211 HTTP/1.1
Host: ws.mydomain.com
UPDATE: apparently you no longer have to encode "/" and "?" within a query component. From RFC 3986:
The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators. However, as query components are
often used to carry identifying information in the form of "key=value"
pairs and one frequently used value is a reference to another URI, it
is sometimes better for usability to avoid percent-encoding those
characters.
So you could legally do this:
GET /resource?ref=id.someotherdomain.com/SGX.3211 HTTP/1.1
Host: ws.mydomain.com

Browser behavior for multiple cookies with same name/path

I'm interested in the behavior of various browsers when there are multiple cookies with the same name and path which are valid for the current domain. E.g. the browser has stored these two cookies:
key=value; path=/; domain=foo.bar.baz
key=value; path=/; domain=bar.baz
What will be the content of the Cookie header when the user visits foo.bar.baz?
RFC 2965 has this to say about the issue:
If multiple cookies satisfy the criteria above, they are ordered in
the Cookie header such that those with more specific Path attributes
precede those with less specific. Ordering with respect to other
attributes (e.g., Domain) is unspecified.
(which is IMO a very weird design choice, but that is what we have). I suppose server-side frameworks use the first value, beacause that is at least sometimes more specific (I checked PHP and it indeed does so).
What I would like to know is the behavior of the major browsers: which cookie would they send first? (In other words, how much can I rely on my application getting the "correct", more specific value?)
As per comments above:
The easiest defense against this obviously "undefined behaviour (standard-wise)" from my POV is to not use PHPSESSID on the main domain bar.baz but instead on www.bar.baz - the subdomains will work fine since according to the standard there is no "fallback" in that case so the cookie stays on its own subdomain.
One possible problem needs to be checked:
PHP scripts running on a subdomain can be configured explicitly to set their cookie on the main domain... IF that is the case (code looks similar to ini_set('session.cookie_domain', 'bar.baz');) then you need to change this config to "standard" (by removing the code shown) which means that a script on a subdomain should only set cookies on its own subdomain.
EDIT - as per comments:
IF you don't have any control over some other subdomain then the "ultimate defense" is to rename your PHPSESSID cookie to something really unique (like a GUID with PHPSESSID as prefix) either by calling session_name() BEFORE session_start() OR by setting it in the config - this way you circumvent the whole problem regardless of subdomains/browser versions etc.

Detail question on REST URLs

This is one of those little detail (and possibly religious) questions. Let's assume we're constructing a REST architecture, and for definiteness lets assume the service needs three parameters, x, y, and z. Reading the various works about REST, it would seem that this should be expressed as a URI like
http://myservice.example.com/service/ x / y / z
Having written a lot of CGIs in the past, it seems about as natural to express this
http://myservice.example.com/service?x=val,y=val,z=val
Is there any particular reason to prefer the all-slashes form?
The reason is small but here it is.
Cool URI's Don't Change.
The http://myservice.example.com/resource/x/y/z/ form makes a claim in front of God and everybody that this is the path to a specific resource.
Note that I changed the name. There may be a service involved, but the REST principle is that you're describing a specific web resource, named /x/y/z/.
The http://myservice.example.com/service?x=val,y=val,z=val form doesn't make as strong a claim. It says there's a piece of code named service that will try to do some sort of query. No guarantees.
Query parameters are rarely "cool". Take a look at the Google Chart API. Should that use a /full/path/notation for all of the fields? Would each URL be cool if it did?
Query parameters are useful. Optional fields can be omitted. New keys can be added to support new functionality. Over time, old fields can be deprecated and removed. Doing this is clumsier with a /path/notation .
Quoting from http://www.xml.com/pub/a/2004/08/11/rest.html
URI Opacity [BP]
The creator of a URI decides the encoding
of the URI, and users should not derive
metadata from the URI itself. URI opacity
only applies to the path of a URI. The
query string and fragment have special
meaning that can be understood by users.
There must be a shared vocabulary between
a service and its consumers.
This sounds like query strings are what you want.
One downside to query strings is that the are unordered. The GET ending with "?x=1&y=2" is different than that ending with "?y=2&x=1". This means the browser and any other intermediate systems won't be able to cache it, because caching is done based on the full URL. If this is a concern, then generate the query string in a well-defined order.
While constructing URIs this is the priniciple I follow. I don't know whether it is perfectly acceptable in all cases
Say for instance, that I have to get the details of an employee, then the URI will be of the form:
GET /employees/1/ and not GET /employees?id=1 since I treat every employee as a resource and the whole URI "employees/{id}" is used in identification of the resource.
On the other hand, if I have algorithmic operations that do not identify a specific resource as such,but merely require inputs to the algorithm which in turn identify the resource, then I use query strings.
For instance GET /employees?empname='%Bob%'&maxResults=100 might give me all employees whose names have the word Bob in them, with the maximum results returned by the query limited to 100.
Hope this answers your question
URIs are strictly split into a hierarchical part (the path) and a non-hierarchical path (the query), and both serve to identify the resource
Tthe URI spec itself (RFC 3986) clearly sets the path and the query portion of a URI as equal.
Section 3.3:
The path component contains data [...] that along with [the] query component
serves to identify a resource.
Section 3.4:
The query component contains [...] data that, along with
[...] the path component serves to identify a resource
So your choice in using x/y/z versus x=val&y=val&z=val has mainly to do if x, y or z are hierarchical in nature or if they're non-hierarchical, and if you can perceive them as always being hierarchical or non-hierarchical for the foreseeable future, along with any technical limitations you might be having on selecting one over the other.
But to answer your question, as others have noted: Neither is more RESTful than the other, since they both end up identifying a resource.
If the resource is the service, independent of parameters, it should be
http://myservice.example.com/service?x=val&y=val&z=val
This is a GET query. One of the principles behind REST is that you GET to read (but not modify!) the resource; you can POST to modify a resource & get a response; you can PUT to write to a resource; and you can DELETE to remove a resource.
If the resource specific with those parameters is a persistent resource, it needs a name. You could (if you organized your webservice this way) POST to http://myservice.example.com/service?x=val&y=val&z=val to create a particular instance of the service and have it return an ID to name this instance, e.g.
http://myservice.example.com/service/12312549
then use GET/POST/PUT/DELETE to interact with that instance.
First of all, defining URIs as part of your API violates a constraint of the REST architecture. You cannot do that and call your API RESTful.
Secondly, the reason query parameters are bad for non-query resource access is that they are generally not cached. It is also a violation of HTTP standards.
A URL with slashes like /x/y/z/ would impose a hierarchy and is not suited for the exact case of just passing three parameters.
If, like you said, x y z are indeed just parameters and the order is not important, it would be more RESTful to use semicolons:
http://myservice.example.com/service/x;y;z/
If your "service" however is just an algorithm that works the same with different parameters, there would also be nothing unRESTful with using ?x=val format.

Best way to decide on XML or HTML response?

I have a resource at a URL that both humans and machines should be able to read:
http://example.com/foo-collection/foo001
What is the best way to distinguish between human browsers and machines, and return either HTML or a domain-specific XML response?
(1) The Accept type field in the request?
(2) An additional bit of URL? eg:
http://example.com/foo-collection/foo001 -> returns HTML
http://example.com/foo-collection/foo001?xml -> returns, er, XML
I do not wish to oblige machines reading the resource to parse HTML (or XHTML for that matter). Machines like the googlebot should receive the HTML response.
It is reasonable to assume I control the machine readers.
If this is under your control, rather than adding a query parameter why not add a file extension:
http://example.com/foo-collection/foo001.html - return HTML
http://example.com/foo-collection/foo001.xml - return XML
Apart from anything else, that means if someone fetches it with wget or saves it from their browser, it'll have an appropriate filename without any fuss.
My preference is to make it a first-class part of the URI. This is debatable, since there are -- in a sense -- multiple URI's for the same resource. And is "format" really part of the URI?
http://example.com/foo-collection/html/foo001
http://example.com/foo-collection/xml/foo001
These are very easy deal with in a web framework that has URI parsing to direct the request to the proper application.
If this is indeed the same resource with two different representations, the HTTP invites you to use the Accept-header as you suggest. This is probably a very reliable way to distinguish between the two different scenarios. You can be plenty sure that user agents (including search engine spiders) send the Accept-header properly.
About the machine agents you are going to give XML; are they under your control? In that case you can be doubly sure that Accept will work. If they do not set this header properly, you can give XML as default. User agents DO set the header properly.
I would try to use the Accept heder for this, because this is exactly what the Accept header is there for.
The problem with having two different URLs is that is is not automatically apparent that these two represent the same underlying resource. This can be bad if a user finds an URL in one program, which renders HTML, and pastes it in the other, which needs XML. At this point a smart user could probably change the URL appropriately, but this is just a source of error that you don't need.
I would say adding a Query String parameter is your best bet. The only way to automatically detect whether your client is a browser(human) or application would be to read the User-Agent string from the HTTP Request. But this is easily set by any application to mimic a browser, you're not guaranteed that this is going to work.