What is the modern programming standard for synchronizing data between a web service and a client? - web-services

The question is a little general, so to help narrow the focus, I'll share my current setup that is motivating this question. I have a LAMP web service running a RESTful API. We have two client implementations: one browser-based javascript client (local storage store) and one iOS-based client (core data store). Obviously these two clients store data very differently, but the data itself needs to be kept in two-way sync with the remote server as often as possible.
Currently, our "sync" process is a little dumb (as in, non-smart). Conceptually, it looks like:
Client periodically asks the server for ALL of the most-recent data.
Server sends down the remote data, which overwrites the current set of local data in the client's store.
Any local creates/updates/deletes after this point are treated as gold, and immediately sent to the server.
The data itself is stored relationally, and updated occasionally by client users. The clients in my specific case don't care too much about the relationships themselves (which is why we can get away with local storage in the browser client for now).
Obviously this isn't true synchronization. I want to move to a system where, conceptually, a "diff" of the most recent changes are sent to the server periodically, and the server sends back a "diff" of the most recent changes it knows about. It seems very difficult to get to this point, but maybe I just don't understand the problem very well.
REST feels like a good start, but REST only talks about the way two data stores talk to each other, not how the data itself is synchronized between them. (This sync process is left up to the implementer of each store.) What is the best way to implement this process? Is there a modern set of programming design patterns that apply to inform a specific solution to this problem? I'm mostly interested in a general (technology agnostic) approach if possible... but specific frameworks would be useful to look at too, if they exist.

Multi-master replication is always (and will always be) difficult and bespoke, because how conflicts are handled will be specific to your application.
IMO A more robust approach is to use Master-slave replication, with your web service as the master and the clients as slaves. To keep the clients in sync, use an archived atom feed of the changes (see event sourcing) as per RFC5005. This is the closest you'll get to a modern standard for this type of replication and it's RESTful.
When the clients are online, they do not update their replica directly, instead they send commands to the server and have their replica updated via the atom feed.
When the clients are offline things get difficult. Your clients will need to have a model of how your web service behaves. It will need to have an offline copy of your replica, which should be copied on write from the online replica (the online replica is the one that is updated by the atom feed). When the client executes commands that modify the data, it should store the command (for later replay against the web service), the expected result (for verification during replay) and update the offline replica.
When the client goes back online, it should replay the commands, compare the result with the expected result and notify the client of any variances. How these variances are handled will vary based on your application. The offline replica can then be discarded.

CouchDB replication works over HTTP and does what you are looking to do. Once databases are synced on either end it will send diffs for adds/updates/deletes.
Couch can do this with other Couch machines or with a mobile framework like TouchDB.
https://github.com/couchbaselabs/TouchDB-iOS
I've done a fair amount of it, but you can always set up CouchDB on one machine, set up TouchDB on a mobile device and then watch the HTTP traffic go back and forth to get an idea of how they do it.
Or read this: http://guide.couchdb.org/draft/replication.html
Maybe something from the link above will help you get an idea of how to do your own diffs for your REST service. (Since they are both over HTTP thought it could be useful.)

You may want to look into the Dropbox Datastore API:
https://www.dropbox.com/developers/datastore
It sounds like it might be a very good fit for your purposes. They have iOS and javascript clients.

Lately, I've been interested in Meteor.
The platform sets up Mongo on the server and minimongo in the browser. The client subscribes to some data and when that data changes, the platform automatically sends down the new data to the client.
It's a clever solution to the syncing problem, and it solves several other problems as well. It will be interesting to see if more platforms do this in the future.

Related

Offline capabilities when using REST (Firebase)

So I found this here in the Docs:
Every client sharing a Firebase maintains its own internal version of any active data. When data is updated or saved, it is written to this local version of the Firebase. The Firebase client then synchronizes that data with the Firebase servers and with other clients on a 'best-effort' basis.
As a result, all writes to Firebase will trigger local events immediately, before any data has even been written to the server. This means the app will remain responsive regardless of network latency or Internet connectivity.
Once connectivity is reestablished, we'll receive the appropriate set of events so that the client "catches up" with the current server state, without having to write any custom code.
Yeah, I got that going for me, but to be more specific (and I wasn't able to find an answer to that):
I am using the REST-API in a c++-program, which executes a curl-request. Everything is working so far. For the insertion, this is not a big deal. In case of an error, I can easily store the data via redis or something and update them later, but how does reading work?
To give you a scenario:
I made a scanner, which recognizes an ID. After this process, it is inserted (as explained above) into the Firebase. People can also register on a regarding homepage (and insert their ID manually). This will be saved in the Firebase as well. Same node obviously.
Firebase is designed to provide the data from one endpoint to another by accessing the db, which is fine. The user registered on the site and is inserted in the DB. Suddenly, my internet connection went away.
Is there any way to get the last "stack" or "full dataset", which was used, before my connection went away? Is there a way to replicate the DB and queue jobs, which will sync after the connection is re-established?
Disclaimer: I work for Firebase.
The passage you're quoting above specifically refers to the client libraries that we maintain (currently in Objective-C, Java, and JavaScript) - which are pieces of code that we've written that you would run in your app.
In this case, you're specifically not using a client library - you're just hitting our regular REST endpoint, so you won't get any of the benefits. To implement your own client would be a significant undertaking; it's the client code that maintains the internal view of the data, compensates when it's offline, triggers local events, etc.

Publish/Subscribe/Request for exchange of big, complex, and confidential data?

I am working on a project where a website needs to exchange complex and confidential (and thus encrypted) data with other systems. The data includes personal information, technical drawings, public documents etc.
We would prefer to avoid the Request-Reply pattern to the dependent systems (and there are a LOT of them), as that would create an awful lot of empty traffic.
On the other hand, I am not sure that a pure Publisher/Subscriber pattern would be apropriate -- mainly because of the complex and bulky nature of the data to be exchanged.
For that reason we have discussed the possibility of a "publish/subscribe/request" solution. The Publish/Subscribe part would be to publish a message to the dependent systems, that something is ready for pickup. The actual content is then picked up by old-school Request-Reply action.
How does this sound to you??
Regards,
Morten
If the systems are always online, it sounds good.
You might want to look at PubSubHubbub because:
1. Don't solve a problem that has already been solved 2. It is scalable and represents a good separation of concern.
It involves 3 parties:
Publishers (who publish stuff)
Subscribers (who are interested in certain publications)
Hubs (who mediate and get rid of 'polling')
It works in the following way:
A subscriber, registers their interest in a URL with a Hub and provides a callback URL.
A publisher, notifies the hub when publishing content.
A hub fetches the 'delta' and pushes it to interested subscribers.
The protocol itself is an extension to Atom, but it seems to fit your requirement, e.g. the new Atom 'content' could be an item containing URLs to newly published documents (which can then be downloaded separately).
New/modified documents => new/modified items in feed containing URLs to fetch them => Hub => Subscribers => Pull documents from Publisher
I don't have a great experience about this, but a messaging queue should help you accomplish what you need. I am using such a system while managing publishing data to multiple front end clients from a backend.
If the client is off, the data is not consumed and the server receives no acknowledgement of data being reveived. Once the client comes back online he consumes the data and remains listening for more messages onve the queue is clear. And ofc the publisher receives a ack for data being consumed. In this way we can identify and notify people who have problems at the receiving end as a bonus. Could this do it in your case?
This approach works if the dependent systems are always online - you can't send messages to PCs that are turned off for the night/weekend.
So if the clients are servers that run 24/7, this works. Otherwise, try this approach:
Let clients register themselves
When new documents come in, add an entry "client X needs to see this" in your database
When clients connect, send them all the entries.
When clients successfully downloaded a document, delete the "client X needs to see this" entry. That keeps the work table small.
This has several advantages:
Clients don't need to run 24/7
The flag is only removed after the client has seen the document (so no updates can be lost).
You have one place where you can see which client never pulls it's documents. A simple select client, count(*) group by client having count(*) > 10 tells you about problems.
Most clients will fetch their data timely, so the work table will stay small. That means there is little overhead when you have to collect the "what's now" data.
EDIT The problem with off-line subscribers is that they don't know what they're missing. So the sending side needs to keep track of the failed push/pull requests. Which means you must implement my suggested pseudo-code to make sure broken connections can be resumed.

Designing an architecture for exchanging data between two systems

I've been tasked with creating an intermediate layer which needs to exchange data (over HTTP) between two independent systems (e.g. Receiver <=> Intermediate Layer (IL) <=> Sender). Receiver and Sender both expose a set of API's via Web Services. Everytime a transaction occurs in the Sender system, the IL should know about it (I'm thinking of creating a Windows Service which constantly pings the Sender), massage the data, then deliver it to the Receiver. The IL can temporarily store the data in a SQL database until it is transferred to the Receiver. I have the following questions -
Can WCF (haven't used it a lot) be used to talk to the Sender and Receiver (both expose web services)?
How do I ensure guaranteed delivery?
How do I ensure security of the messages over the Internet?
What are best practices for handling concurrency issues?
What are best practices for error handling?
How do I ensure reliability of the data (data is not tampered along the way)
How do I ensure the receipt of the data back to the Sender?
What are the constraints that I need to be aware of?
I need to implement this on MS platform using a custom .NET solution. I was told not to use any middleware like BizTalk. The receiver is an SDFC instance, if that matters.
Any pointers are greatly appreciated. Thank you.
A Windows Service that orchestras the exchange sounds fine.
Yes WCF can deal with traditional Web Services.
How do I ensure guaranteed delivery?
To ensure delivery you can use TransactionScope to handle the passing of data between the
Receiver <=> Intermediate Layer and Intermediate Layer <=> Sender but I wouldn't try and do them together.
You might want to consider some sort of queuing mechanism to send the data to the receiver; I guess I'm thinking more of a logical queue rather than an actual queuing component. A workflow framework could also be an option.
make sure you have good logging / auditing in place; make sure it's rock solid, has the right information and is easy to read. Assuming you write a service it will execute without supervision so the operational / support aspects are more demanding.
Think about scenarios:
How do you manage failed deliveries?
What happens if the reciever (or sender) is unavailbale for periods of time (and how long is that?); for example: do you need to "escalate" to an operator via email?
How do I ensure security of the messages over the Internet?
HTTPS. Assuming other existing clients make calls to the Web Services how do they ensure security? (I'm thinking encryption).
What are best practices for handling concurrency issues?
Hmm probably a separate question. You should be able to find information on that easily enough. How much data are we taking? what sort of frequency? How many instances of the Windows Service were you thinking of having - if one is enough why would concurrency be an issue?
What are best practices for error handling?
Same as for concurrency, but I can offer some pointers:
Use an established logging framework, I quite like MS EntLibs but there are others (re-using whatever's currently used is probably going to make more sense - if there is anything).
Remember that execution is unattended so ensure information is complete, clear and unambiguous. I'd be tempted to log more and dial it down once a level of comfort is reached.
use a top level handler to ensure nothing get's lost; but don;t be afraid to log deep in the application where you can still get useful context (like the metadata of the data being sent / recieved).
How do I ensure the receipt of the data back to the Sender?
Include it (sending the receipt) as a step that is part of the transaction.
On a different angle - have a look on CodePlex for ESB type libraries, you might find something useful: http://www.codeplex.com/site/search?query=ESB&ac=8
For example ESBasic which seems to be a class library which you could reuse.

Ideal way/architecture to deliver large data over Web Services

We are trying to design 6 web services, which will serve another client component. The client component requires data from the web service we are implementing.
Now, the problem is, there is not 1 Web Service we are implementing, there is one Web Service which the client component hits, this initiates a series (5 more) of Web Services which gather data from their respective data stores and finally provide the data back to the original Web Service, which then delivers the data back to the client component.
So, if the requested data becomes huge, then, this will be a serious problem for our internal communication channel.
So, what do you guys suggest? What can be done to avoid overloading of the communication channel between the internal Web Service and at the same time, also delivering the data to the client component.
Update 1
Using 5 WS, where, 1WS does not know about the others, except the next one is a business requirement. Actually, 5 companies "small services" are being integrated.
We use Java and Axis2
We've had a similar problem. Apart from trying to avoid it (eg for internal communication go direct to db instead of web service) you can mitigate it by at least not performing the 5 or so tasks in series. Make new threads to collect them all in parallel and process them at the end to reduce latency (except where they might contend for the same resource and bottle neck).
But before I'd do anything load test it and see if it is even an issue and get some baseline stats so you can see what improvement each change makes. Also sometimes you might be better off tweaking network settings or the actual network rather than trying to optimise the code - but again test and see.
Put all the data on a temporary compressed file and give back the ftp url of the file.
The client fetches the big data chunk uncompress it and reads it. (maybe some authentication mechanism for the ftp server)

Message queuing solutions?

(Edited to try to explain better)
We have an agent, written in C++ for Win32. It needs to periodically post information to a server. It must support disconnected operation. That is: the client doesn't always have a connection to the server.
Note: This is for communication between an agent running on desktop PCs, to communicate with a server running somewhere in the enterprise.
This means that the messages to be sent to the server must be queued (so that they can be sent once the connection is available).
We currently use an in-house system that queues messages as individual files on disk, and uses HTTP POST to send them to the server when it's available.
It's starting to show its age, and I'd like to investigate alternatives before I consider updating it.
It must be available by default on Windows XP SP2, Windows Vista and Windows 7, or must be simple to include in our installer.
This product will be installed (by administrators) on a couple of hundred thousand PCs. They'll probably use something like Microsoft SMS or ConfigMgr. In this scenario, "frivolous" prerequisites are frowned upon. This means that, unless the client-side code (or a redistributable) can be included in our installer, the administrator won't be happy. This makes MSMQ a particularly hard sell, because it's not installed by default with XP.
It must be relatively simple to use from C++ on Win32.
Our client is an unmanaged C++ Win32 application. No .NET or Java on the client.
The transport should be HTTP or HTTPS. That is: it must go through firewalls easily; no RPC or DCOM.
It should be relatively reliable, with retries, etc. Protection against replays is a must-have.
It must be scalable -- there's a lot of traffic. Per-message impact on the server should be minimal.
The server end is C#, currently using ASP.NET to implement a simple HTTP POST mechanism.
(The slightly odd one). It must support client-side in-memory queues, so that we can avoid spinning up the hard disk. It must allow flushing to disk periodically.
It must be suitable for use in a proprietary product (i.e. no GPL, etc.).
How is your current solution showing its age?
I would push the logic on to the back end, and make the clients extremely simple.
Messages are simply stored in the file system. Have the client write to c:/queue/{uuid}.tmp. When the file is written, rename it to c:/queue/{uuid}.msg. This makes writing messages to the queue on the client "atomic".
A C++ thread wakes up, scans c:\queue for "*.msg" files, and if it finds one it then checks for the server, and HTTP POSTs the message to it. When it receives the 200 status back from the server (i.e. it has got the message), then it can delete the file. It only scans for *.msg files. The *.tmp files are still being written too, and you'd have a race condition trying to send a msg file that was still being written. That's what the rename from .tmp is for. I'd also suggest scanning by creation date so early messages go first.
Your server receives the message, and here it can to any necessary dupe checking. Push this burden on the server to centralize it. You could simply record every uuid for every message to do duplication elimination. If that list gets too long (I don't know your traffic volume), perhaps you can cull it of items greater than 30 days (I also don't know how long your clients can remain off line).
This system is simple, but pretty robust. If the file sending thread gets an error, it will simply try to send the file next time. The only time you should be getting a duplicate message is in the window between when the client gets the 200 ack from the server and when it deletes the file. If the client shuts down or crashes at that point, you will have a file that has been sent but not removed from the queue.
If your clients are stable, this is a pretty low risk. With the dupe checking based on the message ID, you can mitigate that at the cost of some bookkeeping, but maintaining a list of uuids isn't spectacularly daunting, but again it does depend on your message volume and other performance requirements.
The fact that you are allowed to work "offline" suggests you have some "slack" in your absolute messaging performance.
To be honest, the requirements listed don't make a lot of sense and show you have a long way to go in your MQ learning. Given that, if you don't want to use MSMQ (probably the easiest overall on Windows -- but with [IMO severe] limitations), then you should look into:
qpid - Decent use of AMQP standard
zeromq - (the best, IMO, technically but also requires the most familiarity with MQ technologies)
I'd recommend rabbitmq too, but that's an Erlang server and last I looked it didn't have usuable C or C++ libraries. Still, if you are shopping MQ, take a look at it...
[EDIT]
I've gone back and reread your reqs as well as some of your comments and think, for you, that perhaps client MQ -> server is not your best option. I would maybe consider letting your client -> server operations be HTTP POST or SOAP and allow the HTTP endpoint in turn queue messages on your MQ backend. IOW, abstract away the MQ client into an architecture you have more control over. Then your C++ client would simply be HTTP (easy), and your HTTP service (likely C# / .Net from reading your comments) can interact with any MQ backend of your choice. If all your HTTP endpoint does is spawn MQ messages, it'll be pretty darned lightweight and can scale through all the traditional load balancing techniques.
Last time I wanted to do any messaging I used C# and MSMQ. There are MSMQ libraries available that make using MSMQ very easy. It's free to install on both your servers and never lost a message to this day. It handles reboots etc all by itself. It's a thing of beauty and 100,000's of message are processed daily.
I'm not sure why you ruled out MSMQ and I didn't get point 2.
Quite often for queues we just dump record data into a database table and another process lifts rows out of the table periodically.
How about using Asynchronous Agents library from .NET Framework 4.0. It is still beta though.
http://msdn.microsoft.com/en-us/library/dd492627(VS.100).aspx