Using HTTP URIs as identifiers of resources in RESTful web API

Using HTTP URIs as identifiers of resources in RESTful web API - web-services

Usually to retrieve a resource one uses:
GET http://ws.mydomain.com/resource/123212
But what if your item IDs are HTTP URIs?:
GET http://ws.mydomain.com/resource/http://id.someotherdomain.com/SGX.3211
Browsers replace two slashes with one, and the request turns into:
GET http://ws.mydomain.com/resource/http:/id.someotherdomain.com/SGX.3211
which will not work.
URI encoding the "http://id.someotherdomain.com/SGX.3211" -part results in HTTP 400 - Bad request.
Is there a best practice for handling this?
Edit:
Then of course if we would need to have (I don't at the moment) request in form:
resources/ID/collections/ID
and all IDs are HTTP URIs, things get out of hand... Possibly one could do something like this and parse the contents inside the curly braces:
resources/{http://id...}/collections/{http://id...}

Encode the other system's URI, and then pass the value as a query parameter:
GET http://ws.mydomain.com/resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211
Ugly looking, but no one said that URIs used in a REST architecture have to be beautiful. :)
By the way, a GET actually looks like this when it's sent:
GET /resource?ref=http%3A%2F%2Fid.someotherdomain.com%2FSGX.3211 HTTP/1.1
Host: ws.mydomain.com
UPDATE: apparently you no longer have to encode "/" and "?" within a query component. From RFC 3986:
The characters slash ("/") and question mark ("?") may represent data
within the query component. Beware that some older, erroneous
implementations may not handle such data correctly when it is used as
the base URI for relative references (Section 5.1), apparently
because they fail to distinguish query data from path data when
looking for hierarchical separators. However, as query components are
often used to carry identifying information in the form of "key=value"
pairs and one frequently used value is a reference to another URI, it
is sometimes better for usability to avoid percent-encoding those
characters.
So you could legally do this:
GET /resource?ref=id.someotherdomain.com/SGX.3211 HTTP/1.1
Host: ws.mydomain.com

Related

RESTservice, resource with two different outputs - how would you do it?

Im currently working on a more or less RESTful webservice, a type of content api for my companys articles. We currently have a resource for getting all the content of a specific article
http://api.com/content/articles/{id}
will return a full set of article data of the given article id.
Currently we control alot of the article's business logic becasue we only serve a native-app from the webservice. This means we convert tags, links, images and so on in the body text of the article, into a protocol the native-app can understand. Same with alot of different attributes and data on the article, we will transform and modify its original (web) state into a state that the native-app will understand.
fx. img tags will be converted from a normal <img src="http://source.com"/> into a <img src="inline-image//{imageId}"/> tag, samt goes for anchor tags etc.
Now i have to implement a resource that can return the articles data in a new representation
I'm puzzled over how best to do this.
I could just implement a completely new resource, on a different url like: content/articles/web/{id} and move the old one to content/article/app/{id}
I could also specify in my documentation of the resource, that a client should always specify a specific request header maybe the Accept header for the webservice to determine which representation of the article to return.
I could also just use the original url, and use a url parameter like .../{id}/?version=app or .../{id}/?version=web
What would you guys reckon would be the best option? My personal preference lean towards option 1, simply because i think its easier to understand for clients of the webservice.
Regards, Martin.
EDIT:
I have chosen to go with option 1. Thanks for helping out and giving pros and cons. :)

I would choose #1. If you need to preserve the existing URLS you could add a new one content/articles/{id}/native or content/native-articles/{id}/. Both are REST enough.
Working with paths make content more easily cacheable than both header or param options. Using Content-Type overcomplicates the service especially when both are returning JSON.

Use the HTTP concept of Content Negotiation. Use the Accept header with vendor types.
Get the articles in the native representation:
GET /api.com/content/articles/1234
Accept: application/vnd.com.exmaple.article.native+json
Get the articles in the original representation:
GET /api.com/content/articles/1234
Accept: application/vnd.com.exmaple.article.orig+json

Option 1 and Option 3
Both are perfectly good solutions. I like the way Option 1 looks better, but that is just aesthetics. It doesn't really matter. If you choose one of these options, you should have requests to the old URL redirect to the new location using a 301.
Option 2
This could work as well, but only if the two responses have a different Content-Type. From the description, I couldn't really tell if this was the case. I would not define a custom Content-Type in this case just so you could use Content Negotiation. If the media type is not different, I would not use this option.

Perhaps option 2 - with the header being a Content-Type?
That seems to be the way resources are served in differing formats; e.g. XML, JSON, some custom format

In REST, How to discover acceptable media types?

Given a REST api.
I want to learn what media types I can set in the Accept header.
How should I this?
I know I could do a random
GET http://some.api.com/
Accept:flying/elephants
and hope for a 406 with a body that has the correct acceptable media types.
Is there a better way?

In theory, API could indicate supported Content Types via HTTP OPTIONS
Usually, API offers either
Documentation
Specific resource of supported Accept-header values.
Also (as you might know), Accept-header values are usually bound to IANA defined MIME types

One issue with this is any URI within the API can respond with different media types. It's very common to have different endpoints in the API return different content types.
You could use multiple wildcard requests to probe for support.
You can start with Accept: */* and then application/* text/* */json */xml etc. You would receive a non-exhaustive list, but you'd get the big ones and the the preferred ones.
There's other weird edge cases. For example OData allows you to specify a $format parameter in the URL to define the response type. This overrides the accept header. Thus every format is it's own URI.
It'd be cool if APIs made more use of the alternate link relationship (http://www.w3.org/TR/html5/links.html#rel-alternate), i think that would be the most appropriate. That combined with the type attribute of the link would let you know all the formats for any resource you retrieve. Again it would be specific to each URI though.

Optional trailing slash in url in url-config of web-site

If I define url like "^optional/slash/?&" - and so web-page to which it bound will available by both url versions - with slash and without - will I violate any conventions or standards by doing that?

Wouldn't a redirection be more appropriate?
If I remember correctly, trailing slashes should be used with resources that list other resources. Like a directory that lists files, a list of articles or a category query (e.g http://www.example.com/category/cakes/). Without trailing slashes the URI should point to a single resource. Like a file, an article or a complex query with parameters (e.g http://www.example.com/search?ingredients=strawberry&taste=good)
Just use the HTTP code 302 FOUND to redirect typos to their correct URIs.
EDIT: Thanks to AndreD for pointing it out, a HTTP code 301 MOVED PERMANENTLY is more appropriate for permanently aliasing typos. Search engines and other clients should stop querying for the misspelled URL after getting a 301 code once, and Google recommends using it for changing the URL of a page in their index.

According to RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
Section 6.2.4. Protocol-Based Normalization -
"Substantial effort to reduce the incidence of false negatives is
often cost-effective for web spiders. Therefore, they implement even
more aggressive techniques in URI comparison. For example, if they
observe that a URI such as
http://example.com/data
redirects to a URI differing only in the trailing slash
http://example.com/data/
they will likely regard the two as equivalent in the future. This
kind of technique is only appropriate when equivalence is clearly
indicated by both the result of accessing the resources and the
common conventions of their scheme's dereference algorithm (in this
case, use of redirection by HTTP origin servers to avoid problems
with relative references)."
My interpretation of this statement would be that making the two URIs functionally equivalent (e.g. by means of an .htaccess statement, redirect, or similar) does not violate any standard conventions. According to the RFC, web spiders are prepared to treat them functionally equivalent if they point to the same resource.

No, you are not violating any standards by doing that you can Use this Optional trailing slash in URL of websites
but you need to stay on the safe side, because there are different ways servers handle the issue:
Sometimes, it doesn't matter for SEO: many web servers will just re-direct using 301 status code to the default version;
Some web servers may return a 404 page for the non-trailing-slash address = wasted link juice and efforts;
Some web servers may return 302 redirect to the correct version = wasted link juice and efforts;
Some web servers may return 200 response for both the versions = wasted link juice and efforts as well as potential duplicate content problems.

How to solve two REST problems: the interface document; loss of privacy in descriptive URLs

Coming from a lot of frustrating times with WSDL/Soap, I very much like the REST paradigm, but am trying to solve two basic problems in our application, before moving over to REST. The first problem relates to the lack of an interface document. I think I finally see how to handle this situation: One can query his way down from a top-level "/resources" resource using various requests of GET, HEAD, and OPTIONS to find the one needed resource in the correct hypermedia format. Is this the idea? If so, the client need only be provided with a top-level resource URI: http://www.mywebservicesite.com/mywebservice/resources. He will then have to do some searching and possible keep track of what he is discovering, so that he can use the URIs again efficiently in future to do GETs, POSTs, PUTs, and DELETEs. Are there any thoughts on what should happen here?
The other problem is that we cannot use descriptive URLs like /resources/../customer/Madonna/phonenumber. We do have an implementation of opaque URLs we use in the context of a session, and I'm wondering how opaque URLs might be applied to REST. The general problem is how to keep domain-specific details out of URLs, and still benefit from what REST has to offer.

The other problem is that we cannot use descriptive URLs like /resources/../customer/Madonna/phonenumber.
I think you've misunderstood the point of opaque URIs. The notion of opaque URIs is with respect to clients: A client shall not decipher a URI to guess anything of semantic meaning from it. So a service may well have URIs like /resources/.../customer/Madonna/phonenumber, and that's quite a good idea. The URIs should be treated as opaque by clients: not infer from the URI that it represents Madonna's phone number, and that Madonna is a customer of some sort. That knowledge can only be obtained by looking inside the URI itself, or perhaps by remembering where the URI was discovered.
Edit:
A consequence of this is that navigation should happen by links, not by deconstructing the URI. So if you see /resouces/customer/Madonna/phonenumber (and it actually represents Customer Madonna's phone number) you should have links in that resource to point to the Madonna resource: e.g.
{
"phone_number" : "01-234-56",
"customer_URI": "/resources/customer/Madonna"
}
That's the only way to navigate from a phone number resource to a customer resource. An important aspect is that the server implementation might or might not have domain specific information in the URI, The Madonna record might just as well live somewhere else: /resources/customers/byid/81496237. This is why clients should treat URIs as opaque.
Edit 2:
Another question you have (in the comments) is then how a client, with the required no knowledge of the server's URIs is supposed to be able to find anything. Clients have the following possibilities to find resources:
Provide a search interface. This could be done by providing an OpenSearch description document, which tells clients how to search for items. An OpenSearch template can include several variables, and several endpoints, depending on what you're looking for. So if you have a "customer ID" that's unique, you could have the following template: /customers/byid/{proprietary:customerid}", the customerid element needs to be documented somewhere, inside the proprietary namespace. A client can then know how to use such a template.
Provide a custom form. This implies making a custom media type in which you explicitly define how (based on an instance of the document) a URI to a customer can be forged. <customers template="/customers/byid/{id}"/>. The documentation (for the media type) would have to state that the template attribute must be interpreted as a relative URI after the string substitution "{id}" to an actual customer ID.
Provide links to all resources. Some resources aren't innumerable, so you can simply make a link to each and every one of them, optionally including identifying information along with the links. This could also be done in a custom media type: <customer id="12345" href="/customer/byid/12345"/>.
It should be noted that #1 and #2 are two ways of saying the same thing: Clients are allowed to create URIs if they
haven't got the URI structure a priori
a media type exists for which the documentation states that URIs should be created
This is much the same way as a web browser has no idea of any URI structure on the web, except for the rules laid out in the definition of HTML forms, to add a ? and then all the query parameters separated by &.
In theory, if you have a customer with id 12345, then you could actually dispense with the href, since you could plug the customer id 12345 into #1 or #2. It's more common to actually provide real links between resources, rather than always relying on lookup or search techniques.

I haven't really used web RPC systems (WSDL/Soap), but i think the 'interface document' is there mostly to allow client libraries to create the service API, right? if so, REST shouldn't need it, because the verbs are already defined and don't really need to be documented again.
AFAIUI, the REST way is to document the structure of each resource (usually encoded in XML or JSON). In that document, you'll also have to document the relationship between those resources. In my case, a resource is often a container of other resources (sometimes more than one type), therefore the structure doc specifies what field holds a list of URLs pointing to the contained resources. Ideally, only one unique resource will need a single, fixed (documented) URL. everithing else follows from there.
The URL 'style' is meaningless to the client, since it shouldn't 'construct' an URL. Every URL it needs should be already constructed on a resource field. That let's you change the URL structure without changing the client (that has saved tons of time to me). Your URLs can be as opaque or as descriptive as you like. (personally, i don't like text keys or slugs; my keys are all BIGINTs or UUIDs)

I am currently building a REST "agent" that addresses the first part of your question. The agent offers a temporary bookmarking service. The client code that is interacting with the agent can request that an URL be bookmarked using some identifier. If the client code needs to retrieve that representation again, it simply asks the agent for the url that corresponds to the saved bookmark and then navigates to that bookmark. Currently those bookmarks are not persisted so they only last for the lifetime of the client application, but I have found it a useful mechanism for accessing commonly used resources. E.g. The root representation provides a login link. I bookmark that link and if the client ever receives a 401 then I can redirect to the "login" bookmark.
To address an issue you mentioned in a comment, the agent also has the ability to store retrieved representations in a dictionary. If it becomes necessary to aggregate and manipulate multiple representations at the same time then I can simply request that the agent store the current representation in a dictionary associated to a key and then continue navigating to the next resource. Once the client has accumulated all the necessary representation it can do what it needs to do.

Best way to decide on XML or HTML response?

I have a resource at a URL that both humans and machines should be able to read:
http://example.com/foo-collection/foo001
What is the best way to distinguish between human browsers and machines, and return either HTML or a domain-specific XML response?
(1) The Accept type field in the request?
(2) An additional bit of URL? eg:
http://example.com/foo-collection/foo001 -> returns HTML
http://example.com/foo-collection/foo001?xml -> returns, er, XML
I do not wish to oblige machines reading the resource to parse HTML (or XHTML for that matter). Machines like the googlebot should receive the HTML response.
It is reasonable to assume I control the machine readers.

If this is under your control, rather than adding a query parameter why not add a file extension:
http://example.com/foo-collection/foo001.html - return HTML
http://example.com/foo-collection/foo001.xml - return XML
Apart from anything else, that means if someone fetches it with wget or saves it from their browser, it'll have an appropriate filename without any fuss.

My preference is to make it a first-class part of the URI. This is debatable, since there are -- in a sense -- multiple URI's for the same resource. And is "format" really part of the URI?
http://example.com/foo-collection/html/foo001
http://example.com/foo-collection/xml/foo001
These are very easy deal with in a web framework that has URI parsing to direct the request to the proper application.

If this is indeed the same resource with two different representations, the HTTP invites you to use the Accept-header as you suggest. This is probably a very reliable way to distinguish between the two different scenarios. You can be plenty sure that user agents (including search engine spiders) send the Accept-header properly.
About the machine agents you are going to give XML; are they under your control? In that case you can be doubly sure that Accept will work. If they do not set this header properly, you can give XML as default. User agents DO set the header properly.
I would try to use the Accept heder for this, because this is exactly what the Accept header is there for.
The problem with having two different URLs is that is is not automatically apparent that these two represent the same underlying resource. This can be bad if a user finds an URL in one program, which renders HTML, and pastes it in the other, which needs XML. At this point a smart user could probably change the URL appropriately, but this is just a source of error that you don't need.

I would say adding a Query String parameter is your best bet. The only way to automatically detect whether your client is a browser(human) or application would be to read the User-Agent string from the HTTP Request. But this is easily set by any application to mimic a browser, you're not guaranteed that this is going to work.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js