REST API for data processing and method chaining - web-services

I apologize in advance if the quality of the question is bad. I am still beginning to learn the concepts of REST API. I am trying to implement a scalable REST API for data processing. Here is what I could think of so far.
Consider some numerical data that can be retrieved using a GET call:
GET http://my.api/data/123/
Users can apply a sequence of arithmetic operations such as add and multiply. A non-RESTful way to do that is:
GET http://my.api/data/123?add=10&multiply=5
Assupmtions:
The original data in the DB is not changed. Only an altered version of it is returned to the user.
The data is large in size (say a large multi-dimensional array), so we can't afford to return the whole data with every opertation call. Instead, we want to apply operations as a batch and return the final modified data in the end.
There are 2 RESTful ways I am currently conisdering:
1. Model arithmetic operations as subresources of data.
If we consider add and multiply as subresources of data as here. In this case, we can use:
GET http://my.api/data/123/add/10/
which would be safe and idempotent, given that the original data is never changed. However, we need to chain multiple operations. Can we do that?
GET http://my.api/data/123/add/10/multiply/5/
Where multiply is creating a subresource of add/10/ which itself is a subresource of data/123
Pros:
Statelessness: The sever doesn't keep any information about the modified data.
Easy access to modified data: It is just a simple GET call.
Cons:
Chaining: I don't know if it can be easily implemented.
Long URIs: with each operation applied, the URI gets longer and longer.
2. Create an editable data object:
In this case, a user creates an editable version of the original data:
POST http://my.api/data/123/
will return
201 Created
Location: http://my.api/data/123/edit/{uniqueid}
Users can then PATCH this editable data
PATCH http://my.api/data/123/edit/{uniqueid}
{add:10, multiply:5}
And finally, GET the edited data
GET http://my.api/data/123/edit/{uniqueid}
Pros:
Clean URIs.
Cons:
The server has to save the state of edited data.
Editing is no long idempotent.
Getting edited data requires users to make at least 3 calls.
Is there a cleaner, more semantic way to implement data processing RESTfully?
Edit:
If you are wondering what is the real world problem behind this, I am dealing with digital signal processing.
As a simple example, you can think of applying visual filters to images. Following this example, a RESTful web service can do:
GET http://my.api/image/123/blur/5px/rotate/90deg/?size=small&format=png

A couple of things worth reviewing in your question.
REST based API’s are resource based
So looking at your first example, trying to chain transformation properties into the URL path following a resource identifier..
GET http://my.api/data/123/add/10/multiply/5/
..does not fit well (as well as being complicated to implement dynamically, as you already guessed)
Statelessness
The idea of statelessness in REST is built around a single HTTP call containing enough information to process the request and provide a result without going back to the client for more information. Storing the result of an HTTP call on the server is not state, it’s cache.
Now, given that a REST based API is probably not the best fit for your usage, if you do still want to use it here are your options:
1. Use the Querystring with a common URL operation
You could use the Querystring but simplify the resource path to accept all transformations upon a single URI. Given your examples and reluctance to store transformed results this is probably your best option.
GET http://my.api/data/123/transform?add=10&multiply=5
2. Use POST non-RESTfully
You could use POST requests, and leverage the HTTP body to send in the transformation parameters. This will ensure that you don’t ever run out of space on the query string if you ever decide to do a lot of processing and it will also keep your communication tidier. This isn’t considered RESTful if the POST returns the image data.
3. Use POST RESTfully
Finally, if you decide that you do want to cache things, your POST can in fact store the transformed object (note that REST doesn’t dictate how this is stored, in memory or DB etc.) which can be re-fetched by Id using a GET.
Option A
POSTing to the URI creates a subordinate resource.
POST http://my.api/data/123
{add:10, multiply:5}
returns
201 Created
Location: http://my.api/data/123/edit/{uniqueid}
then GET the edited data
GET http://my.api/data/123/edit/{uniqueid}
Option B
Remove the resource identifier from the URL to make it clear that you're creating a new item, not changing the existing one. The resulting URL is also at the same level as the original one since it's assumed it's the same type of result.
POST http://my.api/data
{original: 123, add:10, multiply:5}
returns
201 Created
Location: http://my.api/data/{uniqueid}
then GET the edited data
GET http://my.api/data/{uniqueid}

There are multiple ways this can be done. In the end it should be clean, regardless of what label you want to give it (REST non-REST). REST is not a protocol with an RFC, so don't worry too much about whteher you pass information as URL paths or URL params. The underlying webservice should be able to get you the data regarless of how it is passed. For example Java Jersey will give you your params no matter if they are param or URL path, its just an annotation difference.
Going back to your specific problem I think that the resource in this REST type call is not so much the data that is being used to do the numerical operations on but the actual response. In that case, a POST where the data ID and the operations are fields might suffice.
POST http://my.api/operations/
{
"dataId": "123",
"operations": [
{
"type": "add",
"value": 10
},
{
"type": "multiply",
"value": 5
}
]
}
The response would have to point to the location of where the result can be retrieved, as you have pointed out. The result, referenced by the location (and ID) in the response, is essentially an immutable object. So that is in fact the resource being created by the POST, not the data used to calculate that result. Its just a different way of viewing it.
EDIT: In response to your comment about not wanting to store the outcome of the operations, then you can use a callback to transmit the results of the operation to the caller. You can easily add the a field in the JSON input for the host or URL of the callback. If the callback URL is present, then you can POST to that URL with the results of the operation.
{
"dataId": "123",
"operations": [
{
"type": "add",
"value": 10
},
{
"type": "multiply",
"value": 5
}
],
"callBack": "<HOST or URL>"
}

Please don't view this as me answering my own question, but rather as a constribution to the discussion.
I have given a lot of thought into this. The main problem with the currently suggested architectures is scalability, since the server creates copies of data each time it is operated on.
The only way to avoid this is to model operations and data separately. So, similar to Jose's answer, we create a resource:
POST http://my.api/operations/
{add:10, multiply:5}
Note here, I didn't specify the data at all. The created resource represents a series of operations only. The POST returns:
201 Created
Location: http://my.api/operations/{uniqueid}
The next step is to apply the operations on the data:
GET http://my.api/data/123/operations/{uniqueid}
This seprate modeling approach have several advantages:
Data is not replicated each time applies a different set of operations.
Users create only operations resources, and since their size is tiny, we don't have to worry about scalability.
Users create a new resource only when they need a new set of operations.Going to the image example: if I am designing a greyscale website, and I want all images to be converted to greyscale, I can do
POST http://my.api/operations/
{greyscale: "50%"}
And then apply this operation on all my images by:
GET http://my.api/image/{image_id}/operations/{geyscale_id}
As long as I don't want to change the operation set, I can use GET only.
Common operations can be created and stored on the server, so users don't have to create them. For example:
GET http://my.api/image/{image_id}/operations/flip
Where operations/flip is already an available operation set.
Easily, applying the same set of operations to different data, and vice versa.
GET http://my.api/data/{id1},{id2}/operations/{some_operation}
Enables you to compare two datasets that are processed similarly. Alternatively:
GET http://my.api/data/{id1}/operations/{some_operation},{another_operation}
Allows you to see how different processing procedures affects the result.

I wouldn't try to describe your math function using the URI or request body. We have a more or less standard language to describe math, so you could use some kind of template.
GET http://my.api/data/123?transform="5*(data+10)"
POST http://my.api/data/123 {"transform": "5*({data}+10)"}
You need a code on client side, which can build these kind of templates and another code in the server side, which can verify, parse, etc... the templates built by the client.

Related

Lazily create database records on GET requests

First, I understand GET requests should be safe and idempotent. However, my current situation is a little bit different from all the examples I have seen, so I'm not sure what to do.
The web app is some kind of metadata database for all online videos (by "all" I actually mean "all YouTube, Vimeo, XXX, ...", i.e., a known range of mainstream online video websites). Users can POST to http://www.example.com/api/video/:id to add metadata to a certain video, and GET from http://www.example.com/api/video/:id to get back all the current metadata for the given video.
The problem is how to get the video ID for a URL (say https://youtu.be/foobarqwe12). I think the users can query the server somehow, perhaps with a GET at http://www.example.com/api/find_video?url=xxx. The idea is that as long as the URL is valid, the query should always return the information of the video (including its ID); this seems to require that the server creates the record for a video if it doesn't exist yet.
My opinion is that although this seems to violate the safety and idempotence requirements for GET requests, it can also be seen as implementation detail (ideally there is a record for every video for every URL at the beginning of time, and lazily creating records on GETs is just a kind of optimization).
Nonsense, it doesn't violate anything.
If "every valid resource name" has a "valid representation", how that representation is manifested is an internal detail that's outside scope.
Your GET is idempotent. Just because you create a new row in a DB on first access doesn't make it not so.
When you GET /missingurl, you get a representation -- not a 404, but a 200 and some kind of result. This representation could also just be a templated boilerplate that all entities get (only with the URL linked filled in).
Whether you simply print some templated boilerplate, or create a row in the DB, the representation to the client is the same. They make the request, they get the representation -- all the time, all the same. That's idempotent. The fact "something happens" on the backend in an implementation detail hidden from the client.

RESTful search. Return actual resources or URIs?

Pretty new to all this REST stuff.
I'm designing my API, and am not sure what I'm supposed to return from a search query. I was assuming I would just return all objects that match the query in their entirety, but after reading up a bit about HATEOAS I am thinking I should be returning a list of URI's instead?
I can see that this could help with caching of items, but I'm worried that there will be a lot of overhead generated by the subsequent multiple HTTP requests required to get the actual object info.
Am I misunderstanding? Is it acceptable to return object instances instead or URIs?
I would return a list of resources with links to more details on those resources.
From RESTFull Web Services Cookbook 2010 - Subbu Allamaraju
Design the response of a query as a representation of a collection
resource. Set the appropriate expiration caching headers. If the query
does not match any resources, return an empty collection.
IMHO it is important to always remember that "pure REST" and "real world REST" are two quite different beasts.
How are you returning the list of URIs from your query in the first place? If you return e.g. application/json, this certainly does not tell the client how it is supposed to interpret the content; therefore, the interaction is already being driven by out-of-band information (the client magically already knows where to look for the data it needs) in conflict with HATEOAS.
So, to answer your question: I find it quite acceptable to return object instances instead of URIs -- but be careful because in the general case this means you are generating all this data without knowing if the client is even going to use it. That's why you will see a hybrid approach quite often: the object instances are not full objects (i.e. a portion of the information the server has is not returned), but they do contain a unique identifier that allows the client to fetch the full representation of selected objects if it chooses to do so.

Create single and multiple resources using restful HTTP

In my API server I have this route defined:
POST /categories
To create one category you do:
POST /categories {"name": "Books"}
I thought that if you want to create multiple categories, then you could do:
POST /categories [{"name": "Books"}, {"name": "Games"}]
I just wanna confirm that this is a good practice for Restful HTTP API.
Or should one have a
POST /bulk
for allowing them to do whatever operations at once (Creating, Reading, Updating and Deleting)?
In true REST, you should probably POST this in multiple separate calls. The reason is that each one will result in a new representation. How would you expect to get that back otherwise.
Each post should return the resultant resource location:
POST -> New Resource Location
POST -> New Resource Location
...
However, if you need a bulk, then create a bulk. Be dogmatic where possible, but if not, pragmatism gets the job done. If you get too hung up on dogmatism, then you never get anything done.
Here is a similar question
Here is one that suggests HTTP Pipelining to make this more efficient
There's nothing particularly wrong with having a bulk operation that you POST to, to activate (it'll be non-idempotent so POST is the right verb) but there are some caveats:
You're making multiple resources, so you need to respond with multiple URLs. This means you can't use the redirect pattern: you'll have to send a list of URLs back in some form.
You have a problem in that bulk operations are often not very discoverable. Discoverability is one of the most important things about RESTfulness, as it means that someone can come along and figure out how to write a client without lots of help from the server author.
Dealing with partial failures when you've got bulk operations remains problematic. It's a problem with any other paradigm too (I've watched people tie themselves in knots over this when working with extensions to SOAP) so it isn't a surprise, but unless you can guarantee that all the creations will work, you're going to have to work out what happens when you make one resource and fail to make the second. (Also, if the bulk request wanted a third one done, would you go on and try that?)
The simplest approach is just to support one create per request; that's a much easier pattern to get right and is better understood all round.
There's nothing wrong with creating multiple resources at once with POST (just don't try it with PUT). It's not "un-REST-ful", especially if you create a representation for the bulk operation itself. I suggest you create an index resource at the same time you create the individual resources, and return a "303 See Other" to it. That index representation would then contain links to all of the created resources (and possibly error information if any of them failed).
POST /categories/uploads/
[{"name": "Books"}, {"name": "Games"}]
303 See Other
Location: /categories/uploads/321/
(actually, now that I think about it, 201 might be better than 303)
GET /categories/uploads/321/
200 OK
Content-Type: application/json
[{"name": "Books", "link": "/categories/Books/"},
{"name": "Games", "error": "The 'Games' category already exists."}]
In your case I would also go the /bulk resource way. But the pattern I would suggest is the following and from my understanding the most natural: Work with the 202 Accepted status code.
The idea of a bulk request is that the server should not be forced to answer immediately as this would mean client needs to wait until it's bulk request completed.
Here is the pattern:
POST /bulk [{"name": "Books"}, {"name": "Games"}]
202 Accepted | Location: /bulk/processing/status/resourceId
GET /bulk/processing/status/resourceId
entry = "REST in peace" | completed | 0 errors | /categories/category/resourceId
entry = "Walking dead" | processing | 0 errors ->
So, the client POSTs the bulk information to the server. The server just accepts them with a 202 which gives no guarantee about the processing state at the time of response.
But the server also provides the link to a status resource. Here the client can have a look on each of the created resources and the processing state. When finished the client can access the resource via the given link.
Error cases can be identified by the client and erroneous data might be resend by a PUT on the completed resource.
Finally, a good advice I am usually following is: Whenever you hit a resource in your design that cannot be mapped on a HTTP feature it is probably because of a missing resource.
Actually this is still a hot topic till today, But simplify things I almost of the time say there is always a batter suited scenario for each practice.
Eg:
1. If you are receiving the likes from a post you don't need the bulk as in case there is only one like per comment.
2. If you are receiving favorites comment the bulk can fit well by considering someone reviewing the comment he reads and check box all of his favorites and send it once.
Again this is based on my experience working with Restful API, and but currently for the sake of multi tasking and others things, me and my colleague we found our selves doing the bulk all the time in most MIS(Management Information System) we do. This is because modern days web app and mobile app that can do a lot of work and send the final results to the back-end, this way the back-end has little job to do as long as the data received don't violate the business logic.

Should the paramaters provided in a web service call be included in the response

Iv not got much experience with creating web services, however, I do spend a lot of time interfacing with them.
I wondered if there was a best practice that stated weather or not parameter that are provided in the request should be included in the response.
E.g.
Request:
a.com/getStuff?key=123
(JSON) Response:
{"key":"123",
"value":"abc"}
or..
(JSON) Response:
{"value":"abc"}
I much prefer the more verbose first option because it dos not enforce coupling between the request and the response. i.e. the response dosn't care what the request was, so you do not need to pass state around.
Is there a best practice?
If you are referencing a record in a database, or some other entity that is uniquely identified by an integer, GUID, or specifically-formatted string value, you should ALWAYS return that unique ID with the response, particularly if you are planning to allow the user to update that entity or reference it in a subsequent operation for creating related data or searching for related data.
If you are returning a derived value that may be a composite of many records' values, or of environmentally specific data (such as "How much free disk space is on my server?"), then the supplied parameters wouldn't mean anything in the response, and therefore shouldn't be returned.
Your point on coupling request-response is right on the money. If you are doing multiple simultaneous asynchronous calls, then the key value is very useful when handling the responses.
Referring to your example: I think id should always be part of the resource representation in your case JSON). The representation should be as self-explaining and self-referrable as possible. On top of an id-attribute/field I like also to use a link field:
{
"id":123,
"link":{
"href":"http://api.com/item/123",
"rel":"self"
},
otherData...
}
If your example GET /getStuff?key=123 is more a search (the parameter looks a bit like that) then it good to present the user a "summary" of your search:
{
"items":[{
item1...
},
{
item2...
}
],
"submitted-params":{
"key":"123",
"other-param":"paramValue"
}
}

How best to design a RESTful API for initiating an action

I'm building a RESTful web service that has the usual flavor of CRUD operations for a set of data types. The HTTP verb mappings for these APIs are obvious.
The interesting part comes in where the client can request that a long-running (i.e., hours) operation against one of the data objects be initialized; the status of the operation is reported by querying the data type itself.
For example, assume an object with the following characteristics:
SomeDataType
{
Name: "Some name",
CurrentOperation: "LongOperationA",
CurrentOperationPercent: 0.75,
CurrentOperationEtaSeconds: 3600
}
My question, then, is what the best RESTful approach should be for starting LongOperationA?
The most obvious approach would seem to be making the operation itself the identifier, perhaps something along the lines of POST https://my-web-service.com/api/StartLongOperationA?DataID=xxxx, but that seems a bit clunky, even if I don't specify the data identifier as a query parameter.
It's also pretty trivial to implement this as an idempotent action, so using POST seems like a waste; on the other hand, PUT is awkward, since no data is actually being written to the service.
Has anybody else faced this type of scenario in their services? What have you done to expose an API for initializing actions that honors RESTful principals?
TIA,
-Mark
You could do,
POST /LongRunningOperations?DataId=xxxx
to create a new LongRunningOperation. The URI of the long running operation would be returned in the Location header along with a 201 status code.
Or if you want to keep the long running operations associated to the DataId you could do
POST /Data/xxx/LongRunningOperations
Both these options will give you the opportunity to inquire if there are long running operations still executing. If you need information after the operation has completed you can create things like
GET /CompletedLongRunningOperations
GET /Data/xxx/CompletedLongRunningOperations
GET /Data/xxx/LastCompletedLongRunningOperation