REST search interface and the idempotency of GET

REST search interface and the idempotency of GET - web-services

In order to stick with the REST concepts, such as safe operations, idempotency, etc., how can one implement a complex search operation involving multiple parameters?
I have seen Google's implementation, and that is creative. What is an option, other than that?
The idempotent requirement is what is tripping me up, as the operation will definitely not return the same results for the same criteria, say searching for customers named "Smith" will not return the same set every time, because more "Smith" customer are added all the time. My instinct is to use GET for this, but for a true search feature, the result would not seem to be idempotent, and would need to be marked as non-cacheable due to its fluid result set.

To put it another way, the basics behind idempotency is that the GET operation doesn't affect the results of the operation. That is, the GET can safely be repeated with no ill side effects.
However, an idempotent request has nothing to do with the representation of the resource.
Two contrived examples:
GET /current-time
GET /current-weather/90210
As should be obvious, these resources will change over time, some resources change more rapidly than others. But the GET operation itself is not germane in affecting the actual resource.
Contrast to:
GET /next-counter
This is, obviously I hope, not an idempotent request. The request itself is changing the resource.
Also, there's nothing that says an idempotent operation has NO side effects. Clearly, many system log accesses and requests, including GETs. Therefore, when you do GET /resource, the logs will change as a result of that GET. That kind of side affect doesn't make the GET not idempotent. The fundamental premise is the affect on the resource itself.
But what about, say:
GET /logs
If the logs register every request, and the GET is returning the logs in their current state, does that mean that the GET in this case is not idempotent? Yup! Does it really matter? Nope. Not for this one edge case. Just the nature of the game.
What about:
GET /random-number
If you're using a pseudo-random number generator, most of those feed upon themselves. Starting with a seed and feeding their results back in to themselves to get the next number. So, using a GET here may not be idempotent. But is it? How do you know how the random number is generated. It could be a white noise source. And why do you care? If the resource is simply a random number, you really don't know if the operation is changing it or not.
But just because there may be exceptions to the guidelines, doesn't necessarily invalidate the concepts behind those guidelines.
Resources change, thats a simple fact of life. The representation of a resource does not have to be universal, or consistent across requests, or consistent across users. Literally, the representation of a resource is what GET delivers, and it is up to the application, using who knows what criteria to determine that representation for each request. Idempotent requests are very nice because they work well with the rest of the REST model -- things like caching and content negotiation.
Most resources don't change quickly, and relying on specific transactions, using non-idempotent verbs, offers a more predictable and consistent interface for clients. When a method is supposed to be idempotent, clients will be quite surprised when it turns out to not be the case. But in the end, its up to the application and its documented interface.

GET is safe and idempotent when properly implemented. That means:
It will cause no client-visible side-effects on the server side
When directed at the same URI, it causes the same server-side function to be executed each time, regardless of how many times it is issued, or when
What is not said above is that GET to the same URI always returns the same data.
GET causes the same server-side function to be executed each time, and that function is typically, "return a representation of the requested resource". If that resource has changed since the last GET, the client will get the latest data. The function which the server executes is the source of the idempotency, not the data which it uses as input (the state of the resource being requested).
If a timestamp is used in the URI to make sure that the server data being requested is the same each time, that just means that something which is already idempotent (the function implementing GET) will act upon the same data, thereby guaranteeing the same result each time.

It would be idempotent for the same dataset. You could achieve this with a timestamp filter.

Related

How often do duplciates occur if I do not set "exactly once delivery" with pubsub?

I have noticed that using "Exactly once delivery" affects performance when using pull and acknowledge. Pull and acknowledge messages take up to 5 times longer ~0.2s. If I disable the "Exactly once delivery" response is much fast, under 0.05s for both pull and acknowledge. I tested using curl and php with similar results (reusing existing connection).
I am concerned what is the consequence of disabling this feature. How often do duplicates occur if this feature is disabled? Are there ways to avoid duplicates without enabling this feature?
For example, if I have an acknowledge deadline of 60 seconds, I pull a message then pull again after 10 seconds, could I get the same message again? Its unclear from the docs how often duplicates will occur and under what circumstances they will occur if this option is disabled.

How often do duplicates occur if this feature is disabled?
Not super often in my experience, but this doesn't matter, your system needs to be able to handle them one way or another, because it will happen.
Are there ways to avoid duplicates without enabling this feature?
On googles side? No, otherwise what would be the point of the option. The user should either de-duplicate it with the messageID, by only processing each id once, or make sure that whatever operation you perform is idempotent. Or you don't bother, hope it doesn't happen often and live with the consequences (be it by crashing, having corruption somewhere that you may or may not fix,...).
Its unclear from the docs how often duplicates will occur and under what circumstances they will occur if this option is disabled.
Pub/sub is a complex highly scaling distributed system, duplicated messages are not an intended feature on a fixed schedule, they are a necessary evil if you want high performance. Nobody can predict when they will happen, only that they can occur.

In the system I use, duplicates were happening often enough to cause us massive problems.

Is it possible to get strong consistency in a distributed system?

User sends a write request.
Someone then sends a read request for that resource.
The read request arrives before the write request, so the read request data is stale but has no way of knowing that it's stale yet.
Likewise, you could also have two write requests to the same resource but the later write request arrives first.
How is it possible to provide strong consistency in a distributed system when race conditions like this can happen?

What is consistency? You say two writes arrive "out of order", but what established that order? The thing that establishes that order is your basis for consistency.
An simple basis is a generation number; so any object O is augmented by a version N. When you retrieve an O, you also retrieve N. When you write, you write to O.N. If the object is at O.N+1 when the write to O.N arrives, it is stale and generates an error. Multiple versions of the O remain available for some period.
Of course, you can't replicate the object readily with this in any widely distributed system, since two disconnected owners of O could be permitting different operations that would be impossible to unify. Etcd, for example, solves this in a limited since. Block chain solves it in a wider sense.

Is it necessary to include GameObjects whose physics are deterministic in worldUpdate?

In order to reduce data transfer size and the computational time for serializing world objects for each worldUpdate, I was wondering if it is possible to omit syncs for objects whose physics can be entirely, faithfully simulated on the client-side gameEngine (they are not playerObjects so playerInput does not affect them directly, and their physics are entirely deterministic). Interactions with these GameObjects would be entirely handled by GameEvents that are much less frequent. I feel like this should be possible if the client is running the same physics as the server and has access to the same initial conditions.
When I try to omit GameObjects from subsequent worldUpdates, I see that their motion becomes more choppy and they move faster than if they were not omitted; however, when I stop the game server while keeping the client open, their motion is more like what I would expect if I hadn't omitted them. This is all on my local machine with extrapolation synchronization.

The short answer is that the latest version of Lance (1.0.8 at the time of this writing) doesn't support user omission of game objects from world updates, but it does implement a diffing mechanism that omits objects from the update if their netScheme properties haven't changed, saving up on bandwidth.
This means that if you have static objects, like walls, for example, they will only get transmitted once for each player. Not transmitting this at all is an interesting feature to have.
If objects you're referring to are not static, then there is no real way to know their position deterministically. You might have considered using the world step count, but different clients process different world steps at different times due to the web's inherent latency. A client can't know what is the true step being handled by the server at a given point in time, so it cannot deterministically decide on such an object's position. This is why Lance uses the Authoritative server model - to allow one single source of truth, and make sure clients are synched up.
If you still want to manually avoid sending updates for an object, you can edit its netScheme so that it doesn't return anything but its id, for example:
static get netScheme() {
return {
id: { type: Serializer.TYPES.INT32 }
};
}
Though it's not a typical use due to the aforementioned reasons, so if you encounter specific sync issues and this is still a feature you're interested in, it's best if you submit a feature request in the Lance issue tracker. Make sure to include details on your use case to promote a healthy discussion

Can Message::GetDescriptor() return null?

I'm always checking the return value of Message::GetDescriptor() before using it, but when would it ever return null? Is it perhaps unnecessary to check the return value?
The docs:
https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.message#Message.GetDescriptor.details
Declaration:
const Descriptor *
Message::GetDescriptor() const

You should always check return type of possibly every API that you code invokes, and should never make any kind of assumption however reliable the API may be. API s fail for a variety of reasons beyond human control:-
Network condition fluctuations including PHY disruption (not applicable in this case)
System running the implementation of the API running Out of Resource like space
System overload (too busy with other processes)
Unreliable API implementation (bug)
etc..
Since the API is from Google making a naive assumption that the 4th reason can never be true simply reduces the robustness of your software. For 99.99% of times it might just seem to be a redundant check or an over protective code - but for that 0.01% times when it fails you have unreliable behavior from your software
The costliest bugs that could have easily been avoided (if not fixed), from my experience over the years, are a result overlooking simple and basic error handling

You don't have to check it, for each message you should get a non NULL pointer.

Should I be concerned with bit flips on Amazon S3?

I've got some data that I want to save on Amazon S3. Some of this data is encrypted and some is compressed. Should I be worried about single bit flips? I know of the MD5 hash header that can be added. This (from my experience) will prevent flips in the most unreliable portion of the deal (network communication), however I'm still wondering if I need to guard against flips on disk?

I'm almost certain the answer is "no", but if you want to be extra paranoid you can precalculate the MD5 hash before uploading, compare that to the MD5 hash you get after upload, then when downloading calculate the MD5 hash of the downloaded data and compare it to your stored hash.
I'm not sure exactly what risk you're concerned about. At some point you have to defer the risk to somebody else. Does "corrupted data" fall under Amazon's Service Level Agreement? Presumably they know what the file hash is supposed to be, and if the hash of the data they're giving you doesn't match, then it's clearly their problem.
I suppose there are other approaches too:
Store your data with an FEC so that you can detect and correct N bit errors up to your choice of N.
Store your data more than once in Amazon S3, perhaps across their US and European data centers (I think there's a new one in Singapore coming online soon too), with RAID-like redundancy so you can recover your data if some number of sources disappear or become corrupted.
It really depends on just how valuable the data you're storing is to you, and how much risk you're willing to accept.

I see your question from two points of view, a theoretical and practical.
From a theoretical point of view, yes, you should be concerned - and not only about bit flipping, but about several other possible problems. In particular section 11.5 of the customer agreements says that Amazon
MAKE NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, WHETHER EXPRESS, IMPLIED, STATUTORY OR OTHERWISE WITH RESPECT TO THE SERVICE OFFERINGS. (..omiss..) WE AND OUR LICENSORS DO NOT WARRANT THAT THE SERVICE OFFERINGS WILL FUNCTION AS DESCRIBED, WILL BE UNINTERRUPTED OR ERROR FREE, OR FREE OF HARMFUL COMPONENTS, OR THAT THE DATA YOU STORE WITHIN THE SERVICE OFFERINGS WILL BE SECURE OR NOT OTHERWISE LOST OR DAMAGED.
Now, in practice, I'd not be concerned. If your data will be lost, you'll blog about it and (although they might not face any legal action), their business will be pretty much over.
On the other hand, that depends on how much vital your data is. Suppose that you were rolling your own stuff in your own data center(s). How would you plan for disaster recovery there? If you says: I'd just keep two copies in two different racks, just use the same technique with Amazon, maybe keeping two copies in two different datacenters (since you wrote that you are not interested in how to protect against bit flips, I'm providing only a trivial example here)

Probably not: Amazon is using checksums to protect against bit flips, regularly combing through data at rest, ensuring that no bit flips have occurred. So, unless you have corruption in all instances of the data within the interval of integrity check loops you should be fine.
Internally, S3 uses MD5 checksums throughout the system to detect/protect against bitflips. When you PUT an object into S3, we compute the MD5 and store that value. When you GET an object we recompute the MD5 as we stream it back. If our stored MD5 doesn't match the value we compute as we're streaming the object back we'll return an error for the GET request. You can then retry the request.
We also continually loop through all data at rest, recomputing checksums and validating them against the MD5 we saved when we originally stored the object. This allows us to detect and repair bit flips that occur in data at rest. When we find a bit flip in data at rest, we repair it using the redundant data we store for each object.
You can also protect yourself against bitflips during transmission to and from S3 by providing an MD5 checksum when you PUT the object (we'll error if the data we received doesn't match the checksum) and by validating the MD5 when GET an object.
Source:
https://forums.aws.amazon.com/thread.jspa?threadID=38587

There are two ways of reading your question:
"Is Amazon S3 perfect?"
"How do I handle the case where Amazon S3 is not perfect?"
The answer to (1) is almost certainly "no". They might have lots of protection to get close, but there is still the possibility of failure.
That leaves (2). The fact is that devices fail, sometimes in obvious ways and other times in ways that appear to work but give an incorrect answer. To deal with this, many databases use a per-page CRC to ensure that a page read from disk is the same as the one that was written. This approach is also used in modern filesystems (for example ZFS, which can write multiple copies of a page, each with a CRC to handle raid controller failures. I have seen ZFS correct single bit errors from a disk by reading a second copy; disks are not perfect.)
In general you should have a check to verify that your system is operating is you expect. Using a hash function is a good approach. What approach you take when you detect a failure depends on your requirements. Storing multiple copies is probably the best approach (and certainly the easiest) because you can get protection from site failures, connectivity failures and even vendor failures (by choosing a second vendor) instead of just redundancy in the data itself by using FEC.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js