Why use pagination tokens? - web-services

I am implementing pagination on a webservice. My first thought was to use query params page and size, like Spring Data.
However, we are basing some of our design on the google webservice apis. I notice that they use pagination tokens, with each page result containing a nextPageToken. What are the advantages to using this approach? Changing data? What kind of info would be encoded in such a token?

When paginating with an offset, inserts and deletes into the data between your requests will cause rows to be skipped or included twice. If you are able to keep track in the token of what you returned previously (via a key or whatever), you can guarantee you don't return the same result twice, even when there are inserts/deletes between requests.
I'm a little uncertain of how you would encode a token, but for a single tables at least it seems that you could use the an encoded version of the primary key as a limit. "I just returned everything before key=200. Next time I'll only return things after 200." I guess this assumes a new item inserted between requests 1 and 2 will be given a key greater than existing keys.
https://use-the-index-luke.com/no-offset

One reason opaque strings are used for pagination tokens is so that you can change how pagination is implemented without breaking your clients. A query param like a page(I assume you mean page number) is transparent to your client and allows them to make assumptions about it.

Related

WS2ESB: Store state between sequence invocations

I was wondering about the proper way to store state between sequence invocations in WSO2ESB. In other words, if I have a scheduled task that invokes sequence S, at the end of iteration 0 I want to store some String variable (lets' call it ID), and then I want to read this ID at the start (or in the middle) of iteration 1, and so on.
To be more precise, I want to get a list of new SMS messages from an existing service, Twilio to be exact. However, Twilio only lets me get messages for selected days, i.e. there's no way for me to say give me only new messages (since I last checked / newer than certain message ID). Therefore, I'd like to create a scheduled task that will query Twilio and pass only new messages via REST call to my service. In order to do this, my sequence needs to query Twilio and then go through the returned list of messages, and discard messages that were already reported in the previous invocation. Now, to do this I need to store some state between different task/sequence invocations, i.e. at the end of the sequence I need to store the ID of the newest message in the current batch. This ID can then be used in subsequent invocation to determine which messages were already reported in the previous invocation.
I could use DBLookup and DB Report mediators, but it seems like an overkill (using a database to store a single string) and not very performance friendly. On the other hand, as far as I can see Class mediators are instantiated as singletons, therefore I could create a custom Class mediator that would manage this state and filter the list of messages to be sent to my service. I am quite sure that this will work, but I was wondering if this is the way to go, or there might be a more elegant solution that I missed.
We can think of 3 options here.
Using DBLookup/Report as you've suggested
Using the Carbon registry to store the values (this again uses DBs in the back end)
Using a Custom mediator to hold the state and read/write it from/to properties
Out of these three, obviously the third one will deliver the best performance since everything will be in-memory. It's also quite simple to implement and sometime back I did something similar and wrote a blog post here.
But on the other hand, the first two options can keep the state even when the server crashes, if it's a concern for your use case.
Since esb 490 you can persist and read properties from registry using property mediator.
https://docs.wso2.com/display/ESB490/Property+Mediator

Lazily create database records on GET requests

First, I understand GET requests should be safe and idempotent. However, my current situation is a little bit different from all the examples I have seen, so I'm not sure what to do.
The web app is some kind of metadata database for all online videos (by "all" I actually mean "all YouTube, Vimeo, XXX, ...", i.e., a known range of mainstream online video websites). Users can POST to http://www.example.com/api/video/:id to add metadata to a certain video, and GET from http://www.example.com/api/video/:id to get back all the current metadata for the given video.
The problem is how to get the video ID for a URL (say https://youtu.be/foobarqwe12). I think the users can query the server somehow, perhaps with a GET at http://www.example.com/api/find_video?url=xxx. The idea is that as long as the URL is valid, the query should always return the information of the video (including its ID); this seems to require that the server creates the record for a video if it doesn't exist yet.
My opinion is that although this seems to violate the safety and idempotence requirements for GET requests, it can also be seen as implementation detail (ideally there is a record for every video for every URL at the beginning of time, and lazily creating records on GETs is just a kind of optimization).
Nonsense, it doesn't violate anything.
If "every valid resource name" has a "valid representation", how that representation is manifested is an internal detail that's outside scope.
Your GET is idempotent. Just because you create a new row in a DB on first access doesn't make it not so.
When you GET /missingurl, you get a representation -- not a 404, but a 200 and some kind of result. This representation could also just be a templated boilerplate that all entities get (only with the URL linked filled in).
Whether you simply print some templated boilerplate, or create a row in the DB, the representation to the client is the same. They make the request, they get the representation -- all the time, all the same. That's idempotent. The fact "something happens" on the backend in an implementation detail hidden from the client.

Is it necessary to encrypt cookie data if it just stores user preferences?

I am creating basic website which customizes the look of website based on user preference data stored in cookies. The data in cookie is stored like this:
country,option1,option2...and so on.
I read this data from the cookies directly and act on it. Is there a reason why I should encrypt it because I don't think this poses any security threat.
The answer is a bit more nuanced than what you'd expect ...
Encryption would make the data unreadable, but that's the only thing that bare encryption does; it doesn't actually prevent the data from being modified. What you'd want instead (or in addition to encryption) is message authentication.
These two techniques are often used together (and if you do encryption, you surely do have to do authentication as well), but are slightly different in what they do.
If you don't store any private, sensitive information in the cookie, then you'd probably be fine without hiding (encrypting) it from the user. However, you absolutely MUST implement a message authentication mechanism.
Even if you don't think it is currently a security threat, that might be because you haven't considered all possible attack vectors or you're not actually aware of all of them. And even if it is safe now, that doesn't mean it will be in the future when you or your colleagues add more features or otherwise alter the application's logic.
Therefore, you must never trust unvalidated user input, no matter how, where, when it got into the system or how harmless it may seem at first.
Edit note: Your question doesn't reference PHP, but I assume it as the most popular language for web application development. And I do need some language to produce an example. :)
The easiest way to implement message authentication in PHP is by using the hash_hmac() function, which takes your data, a key and a cryptographic hash function name in order to produce a Hash-based Message Authentication Code (HMAC):
$hmac = hash_hmac('sha256', $stringCookieData, $key);
You can then append (or prepend) the resulting HMAC to your cookie data, which is effectively your "signature" for it:
$stringCookieData = $hmac.$stringCookieData;
And then of course, you'll need to verify that signature when you receive the cookie. In order to do that, you need to re-create the HMAC using the data that you received in the cookie and the same secret key, and finally compare the two hashes:
// 64 is the length of a hex-encoded SHA-256 hash
$hmacReceived = substr($_COOKIE['cookieName'], 0, 64);
$dataReceived = substr($_COOKIE['cookieName'], 64);
$hmacComputed = hash_hmac('sha256', $dataReceived, $key);
if (hash_equals($hmacComputed, $hmacReceived))
{
// All is fine
}
else
{
// ERROR: The received data has been modified outside of the application
}
There are two details here that you need to note here:
Just as with encryption, the key is NOT a password and it must be random, unpredictable (and that does NOT mean hashing the current timestamp). Always use random_bytes() (PHP7) or random_compat to generate a key. You need 32 bytes of random data for SHA-256.
Do NOT replace hash_equals() with a simple comparison operator. This function is specifically designed to prevent timing attacks, which could be used to break your key.
That's probably a lot of info to digest, but unfortunately, implementing secure applications today is very complicated. Trust me, I really did try to make it as short as possible - it is important that you understand all of the above.
If you are fine with
the user being able to see it
the user being able to change it
then you do not need to encrypt it.
I read this data from the cookies directly and act on it
Make sure you have proper validations in place. Don't trust the cookie to not be tampered with.

RESTful search. Return actual resources or URIs?

Pretty new to all this REST stuff.
I'm designing my API, and am not sure what I'm supposed to return from a search query. I was assuming I would just return all objects that match the query in their entirety, but after reading up a bit about HATEOAS I am thinking I should be returning a list of URI's instead?
I can see that this could help with caching of items, but I'm worried that there will be a lot of overhead generated by the subsequent multiple HTTP requests required to get the actual object info.
Am I misunderstanding? Is it acceptable to return object instances instead or URIs?
I would return a list of resources with links to more details on those resources.
From RESTFull Web Services Cookbook 2010 - Subbu Allamaraju
Design the response of a query as a representation of a collection
resource. Set the appropriate expiration caching headers. If the query
does not match any resources, return an empty collection.
IMHO it is important to always remember that "pure REST" and "real world REST" are two quite different beasts.
How are you returning the list of URIs from your query in the first place? If you return e.g. application/json, this certainly does not tell the client how it is supposed to interpret the content; therefore, the interaction is already being driven by out-of-band information (the client magically already knows where to look for the data it needs) in conflict with HATEOAS.
So, to answer your question: I find it quite acceptable to return object instances instead of URIs -- but be careful because in the general case this means you are generating all this data without knowing if the client is even going to use it. That's why you will see a hybrid approach quite often: the object instances are not full objects (i.e. a portion of the information the server has is not returned), but they do contain a unique identifier that allows the client to fetch the full representation of selected objects if it chooses to do so.

Client side id generation strategy for REST web service

Let's say I want to build a REST service for making notes that looks something like this:
GET /notes/ // gives me all notes
GET /notes/{id} // gives the note identified by {id}
DELETE /notes/{id} // delete note
PUT /notes/{id} // creates a new note if there is no note identified by {id}
// otherwise the existing note is updated
Since I want my service to be indempotent I'm using PUT to create and update my notes,
which implies that the ids of new notes are set/generated by the Client.
I thought of using GUIDs/UUIDs but they are pretty long and would make remembering the URLs rather dificult. Also from a database perspective such long string ids can be troublesome from a performance point of view when used as primary key in big tables.
Do you know a good id generation strategy, that generates short ids and of course avoids collisions?
There is a reason why highly distributed system (like git, mongodb, etc.) use long UUIDs/hashes while centralized relational databases (or svn for that matter) can simply use ints. There is no easy way of creating short ids on the client-side in a distributed fashion. Either the server handles them or you must live with wasteful ids. Typically they contain encoded timestamp, client/computer id, hashed content, etc.
That's why REST services typically use
POST /notes
for non-idempotent save and then use the output of Location: header in response:
Location: /notes/42