Publish/Subscribe/Request for exchange of big, complex, and confidential data? - web-services

I am working on a project where a website needs to exchange complex and confidential (and thus encrypted) data with other systems. The data includes personal information, technical drawings, public documents etc.
We would prefer to avoid the Request-Reply pattern to the dependent systems (and there are a LOT of them), as that would create an awful lot of empty traffic.
On the other hand, I am not sure that a pure Publisher/Subscriber pattern would be apropriate -- mainly because of the complex and bulky nature of the data to be exchanged.
For that reason we have discussed the possibility of a "publish/subscribe/request" solution. The Publish/Subscribe part would be to publish a message to the dependent systems, that something is ready for pickup. The actual content is then picked up by old-school Request-Reply action.
How does this sound to you??
Regards,
Morten

If the systems are always online, it sounds good.
You might want to look at PubSubHubbub because:
1. Don't solve a problem that has already been solved 2. It is scalable and represents a good separation of concern.
It involves 3 parties:
Publishers (who publish stuff)
Subscribers (who are interested in certain publications)
Hubs (who mediate and get rid of 'polling')
It works in the following way:
A subscriber, registers their interest in a URL with a Hub and provides a callback URL.
A publisher, notifies the hub when publishing content.
A hub fetches the 'delta' and pushes it to interested subscribers.
The protocol itself is an extension to Atom, but it seems to fit your requirement, e.g. the new Atom 'content' could be an item containing URLs to newly published documents (which can then be downloaded separately).
New/modified documents => new/modified items in feed containing URLs to fetch them => Hub => Subscribers => Pull documents from Publisher

I don't have a great experience about this, but a messaging queue should help you accomplish what you need. I am using such a system while managing publishing data to multiple front end clients from a backend.
If the client is off, the data is not consumed and the server receives no acknowledgement of data being reveived. Once the client comes back online he consumes the data and remains listening for more messages onve the queue is clear. And ofc the publisher receives a ack for data being consumed. In this way we can identify and notify people who have problems at the receiving end as a bonus. Could this do it in your case?

This approach works if the dependent systems are always online - you can't send messages to PCs that are turned off for the night/weekend.
So if the clients are servers that run 24/7, this works. Otherwise, try this approach:
Let clients register themselves
When new documents come in, add an entry "client X needs to see this" in your database
When clients connect, send them all the entries.
When clients successfully downloaded a document, delete the "client X needs to see this" entry. That keeps the work table small.
This has several advantages:
Clients don't need to run 24/7
The flag is only removed after the client has seen the document (so no updates can be lost).
You have one place where you can see which client never pulls it's documents. A simple select client, count(*) group by client having count(*) > 10 tells you about problems.
Most clients will fetch their data timely, so the work table will stay small. That means there is little overhead when you have to collect the "what's now" data.
EDIT The problem with off-line subscribers is that they don't know what they're missing. So the sending side needs to keep track of the failed push/pull requests. Which means you must implement my suggested pseudo-code to make sure broken connections can be resumed.

Related

What is the modern programming standard for synchronizing data between a web service and a client?

The question is a little general, so to help narrow the focus, I'll share my current setup that is motivating this question. I have a LAMP web service running a RESTful API. We have two client implementations: one browser-based javascript client (local storage store) and one iOS-based client (core data store). Obviously these two clients store data very differently, but the data itself needs to be kept in two-way sync with the remote server as often as possible.
Currently, our "sync" process is a little dumb (as in, non-smart). Conceptually, it looks like:
Client periodically asks the server for ALL of the most-recent data.
Server sends down the remote data, which overwrites the current set of local data in the client's store.
Any local creates/updates/deletes after this point are treated as gold, and immediately sent to the server.
The data itself is stored relationally, and updated occasionally by client users. The clients in my specific case don't care too much about the relationships themselves (which is why we can get away with local storage in the browser client for now).
Obviously this isn't true synchronization. I want to move to a system where, conceptually, a "diff" of the most recent changes are sent to the server periodically, and the server sends back a "diff" of the most recent changes it knows about. It seems very difficult to get to this point, but maybe I just don't understand the problem very well.
REST feels like a good start, but REST only talks about the way two data stores talk to each other, not how the data itself is synchronized between them. (This sync process is left up to the implementer of each store.) What is the best way to implement this process? Is there a modern set of programming design patterns that apply to inform a specific solution to this problem? I'm mostly interested in a general (technology agnostic) approach if possible... but specific frameworks would be useful to look at too, if they exist.
Multi-master replication is always (and will always be) difficult and bespoke, because how conflicts are handled will be specific to your application.
IMO A more robust approach is to use Master-slave replication, with your web service as the master and the clients as slaves. To keep the clients in sync, use an archived atom feed of the changes (see event sourcing) as per RFC5005. This is the closest you'll get to a modern standard for this type of replication and it's RESTful.
When the clients are online, they do not update their replica directly, instead they send commands to the server and have their replica updated via the atom feed.
When the clients are offline things get difficult. Your clients will need to have a model of how your web service behaves. It will need to have an offline copy of your replica, which should be copied on write from the online replica (the online replica is the one that is updated by the atom feed). When the client executes commands that modify the data, it should store the command (for later replay against the web service), the expected result (for verification during replay) and update the offline replica.
When the client goes back online, it should replay the commands, compare the result with the expected result and notify the client of any variances. How these variances are handled will vary based on your application. The offline replica can then be discarded.
CouchDB replication works over HTTP and does what you are looking to do. Once databases are synced on either end it will send diffs for adds/updates/deletes.
Couch can do this with other Couch machines or with a mobile framework like TouchDB.
https://github.com/couchbaselabs/TouchDB-iOS
I've done a fair amount of it, but you can always set up CouchDB on one machine, set up TouchDB on a mobile device and then watch the HTTP traffic go back and forth to get an idea of how they do it.
Or read this: http://guide.couchdb.org/draft/replication.html
Maybe something from the link above will help you get an idea of how to do your own diffs for your REST service. (Since they are both over HTTP thought it could be useful.)
You may want to look into the Dropbox Datastore API:
https://www.dropbox.com/developers/datastore
It sounds like it might be a very good fit for your purposes. They have iOS and javascript clients.
Lately, I've been interested in Meteor.
The platform sets up Mongo on the server and minimongo in the browser. The client subscribes to some data and when that data changes, the platform automatically sends down the new data to the client.
It's a clever solution to the syncing problem, and it solves several other problems as well. It will be interesting to see if more platforms do this in the future.

How can you effectively use web services in an enterprise environment if you can't use transactions?

The place I'm working at is trying to establish some ground rules, and the debate we're having now is local libraries vs web services for code reuse. Web services seem to be the popular pick in most companies, and that's what most of the developers here are leaning toward.
I just can't see how you can effectively use web services for any serious work. How can I safely execute multiple service calls if I can't use a transaction?
Let's say I have a cron job that grabs customers from our database who meet a certain condition that they need to be notified of. They are sent a fax, an email, and a ticket is created to track the issue internally. That is 3 different service calls that would happen for each customer in a for loop.
If an error occurs anywhere in there, it's possible that, for example, a fax and email is sent to the customer, but a ticket is not created. Or worse, this cron job could contain a bug on that causes it to fail at the same point every time, and it repeatedly emails the same customer. If the libraries were all local, everything could just be wrapped in a transaction, and none of that would happen. But we're using web services in this example.
Note that the email and fax methods actually insert the data into an email queue and a fax queue, which in turn are handled with a cron job. So the call to the "send email" and "send fax" service methods would be safe to rollback.
An option is to put this entire chunk of code in the web service itself, so the web service itself would call the email, fax, and ticket creating methods in a transaction. But then we're creating a web service method just for the use of a transaction; there is no valid reason we would ever actually need to call this method from anywhere except this one cron script.
How would you generally handle this method?
I would handle it by building a SAGA, an abstraction over a long running business process that has internal state, responds to external events, and interacts with external systems.
I would do this because your problem statement is incomplete: what happens when you can't send the email because the server is down? What if the fax system isn't working, but the other two aren't?
When you can't invoke one, should you retry? For how long? What happens if you can't raise a ticket for four hours, should you escalate to someone? Should this get a response, so something needs to track ticket status and escalate after some time? Should you email the original submitter some time, if you can't carry out any notification actions?
Using a saga is a model for when you can't just have a transaction, because it might potentially span hours of real time before the actions are complete - and holding a database lock that long, ouch.
Moving to a SOA means moving away from some of your older assumptions. One of those is that you should write methods and invoke them, to one where you encapsulate the behaviour of the system at a higher level and expose that as services.
If you try and build a web service that is like a local library, your life is going to suck. Approach this from the view that you want services that own data, and own behaviour related to that data, and encapsulate the details inside that.
(As an aside, I suspect that sending those things off from cron is actually part of a bigger business process, right, where cron does stuff, and sends notification as a consequence of that. Your service might well want to expose that entire sequence as a saga instead.)
Anyway, point is: you don't encapsulate things in a service because you want a transaction, and you don't put things in a transaction just to make them atomic. Those are separate concerns, and should be treated separately.
PS: if you use a transaction, don't you email twice if the email sent but the ticket wasn't created? You actually need a finer grained set of updates anyhow.

Designing an architecture for exchanging data between two systems

I've been tasked with creating an intermediate layer which needs to exchange data (over HTTP) between two independent systems (e.g. Receiver <=> Intermediate Layer (IL) <=> Sender). Receiver and Sender both expose a set of API's via Web Services. Everytime a transaction occurs in the Sender system, the IL should know about it (I'm thinking of creating a Windows Service which constantly pings the Sender), massage the data, then deliver it to the Receiver. The IL can temporarily store the data in a SQL database until it is transferred to the Receiver. I have the following questions -
Can WCF (haven't used it a lot) be used to talk to the Sender and Receiver (both expose web services)?
How do I ensure guaranteed delivery?
How do I ensure security of the messages over the Internet?
What are best practices for handling concurrency issues?
What are best practices for error handling?
How do I ensure reliability of the data (data is not tampered along the way)
How do I ensure the receipt of the data back to the Sender?
What are the constraints that I need to be aware of?
I need to implement this on MS platform using a custom .NET solution. I was told not to use any middleware like BizTalk. The receiver is an SDFC instance, if that matters.
Any pointers are greatly appreciated. Thank you.
A Windows Service that orchestras the exchange sounds fine.
Yes WCF can deal with traditional Web Services.
How do I ensure guaranteed delivery?
To ensure delivery you can use TransactionScope to handle the passing of data between the
Receiver <=> Intermediate Layer and Intermediate Layer <=> Sender but I wouldn't try and do them together.
You might want to consider some sort of queuing mechanism to send the data to the receiver; I guess I'm thinking more of a logical queue rather than an actual queuing component. A workflow framework could also be an option.
make sure you have good logging / auditing in place; make sure it's rock solid, has the right information and is easy to read. Assuming you write a service it will execute without supervision so the operational / support aspects are more demanding.
Think about scenarios:
How do you manage failed deliveries?
What happens if the reciever (or sender) is unavailbale for periods of time (and how long is that?); for example: do you need to "escalate" to an operator via email?
How do I ensure security of the messages over the Internet?
HTTPS. Assuming other existing clients make calls to the Web Services how do they ensure security? (I'm thinking encryption).
What are best practices for handling concurrency issues?
Hmm probably a separate question. You should be able to find information on that easily enough. How much data are we taking? what sort of frequency? How many instances of the Windows Service were you thinking of having - if one is enough why would concurrency be an issue?
What are best practices for error handling?
Same as for concurrency, but I can offer some pointers:
Use an established logging framework, I quite like MS EntLibs but there are others (re-using whatever's currently used is probably going to make more sense - if there is anything).
Remember that execution is unattended so ensure information is complete, clear and unambiguous. I'd be tempted to log more and dial it down once a level of comfort is reached.
use a top level handler to ensure nothing get's lost; but don;t be afraid to log deep in the application where you can still get useful context (like the metadata of the data being sent / recieved).
How do I ensure the receipt of the data back to the Sender?
Include it (sending the receipt) as a step that is part of the transaction.
On a different angle - have a look on CodePlex for ESB type libraries, you might find something useful: http://www.codeplex.com/site/search?query=ESB&ac=8
For example ESBasic which seems to be a class library which you could reuse.

Windows Phone: Updating backend datastore (via web service) while keeping UI very responsive

I am developing a Windows Phone app where users can update a list. Each update, delete, add etc need to be stored in a database that sits behind a web service. As well as ensuring all the operations made on the phone end up in the cloud, I need to make sure the app is really responsive and the user doesn’t feel any lag time whatsoever.
What’s the best design to use here? Each check box change, each text box edit fires a new thread to contact the web service? Locally store a list of things that need to be updated then send to the server in batch every so often (what about the back button)? Am I missing another even easier implementation?
Thanks in advance,
Data updates to your web service are going to take some time to execute, so in terms of providing the very best responsiveness to the user your best approach would be to fire these off on a background thread.
If updates not taking place (until your app resumes) due to a back press is a concern for your app then you can increase the frequency of sending these updates off.
Storing data locally would be a good idea following each change to make sure nothing is lost since you don't know if your app will get interrupted such as by a phone call.
You are able to intercept the back button which would allow you to handle notifying the user of pending updates being processed or requesting confirmation to defer transmission (say in the case of poor performing network location). Perhaps a visual queue in your UI would be helpful to indicate pending requests in your storage queue.
You may want to give some thought to the overall frequency of data updates in a typical usage scenario for your application and how intensely this would utilise the network connection. Depending on this you may want to balance frequency of updates with potential power consumption.
This may guide you on whether to fire updates off of field level changes, a timer when the queue isn't empty, and/or manipulating a different row of data among other possibilities.
General efficiency guidance with mobile network communications is to have larger and less frequent transmissions rather than a "chatty" or frequent transmissions pattern, however this is up to you to decide what is most applicable for your application.
You might want to look into something similar to REST or SOAP.
Each update, delete, add would send a request to the web service. After the request is fulfilled, the web service sends a message back to the Phone application.
Since you want to keep this simple on the Phone application, you would send a URL to the web service, and the web service would respond with a simple message you can easily parse.
Something like this:
http://webservice?action=update&id=10345&data=...
With a reply of:
Update 10345 successful
The id number is just an incrementing sequence to identify the request / response pair.
There is the Microsoft Sync Framework recently released and discussed some weeks back on DotNetRocks. I must admit I didnt consider this till I read your comment.
I've not looked into the sync framework's dependencies and thus capability for running on the wp7 platform as yet, but it's probably worth checking out.
Here's a link to the framework.
And a link to Carl and Richard's show with Lev Novik, an architect on the project if you're interested in some background info. It was quite an interesting show.

Message queuing solutions?

(Edited to try to explain better)
We have an agent, written in C++ for Win32. It needs to periodically post information to a server. It must support disconnected operation. That is: the client doesn't always have a connection to the server.
Note: This is for communication between an agent running on desktop PCs, to communicate with a server running somewhere in the enterprise.
This means that the messages to be sent to the server must be queued (so that they can be sent once the connection is available).
We currently use an in-house system that queues messages as individual files on disk, and uses HTTP POST to send them to the server when it's available.
It's starting to show its age, and I'd like to investigate alternatives before I consider updating it.
It must be available by default on Windows XP SP2, Windows Vista and Windows 7, or must be simple to include in our installer.
This product will be installed (by administrators) on a couple of hundred thousand PCs. They'll probably use something like Microsoft SMS or ConfigMgr. In this scenario, "frivolous" prerequisites are frowned upon. This means that, unless the client-side code (or a redistributable) can be included in our installer, the administrator won't be happy. This makes MSMQ a particularly hard sell, because it's not installed by default with XP.
It must be relatively simple to use from C++ on Win32.
Our client is an unmanaged C++ Win32 application. No .NET or Java on the client.
The transport should be HTTP or HTTPS. That is: it must go through firewalls easily; no RPC or DCOM.
It should be relatively reliable, with retries, etc. Protection against replays is a must-have.
It must be scalable -- there's a lot of traffic. Per-message impact on the server should be minimal.
The server end is C#, currently using ASP.NET to implement a simple HTTP POST mechanism.
(The slightly odd one). It must support client-side in-memory queues, so that we can avoid spinning up the hard disk. It must allow flushing to disk periodically.
It must be suitable for use in a proprietary product (i.e. no GPL, etc.).
How is your current solution showing its age?
I would push the logic on to the back end, and make the clients extremely simple.
Messages are simply stored in the file system. Have the client write to c:/queue/{uuid}.tmp. When the file is written, rename it to c:/queue/{uuid}.msg. This makes writing messages to the queue on the client "atomic".
A C++ thread wakes up, scans c:\queue for "*.msg" files, and if it finds one it then checks for the server, and HTTP POSTs the message to it. When it receives the 200 status back from the server (i.e. it has got the message), then it can delete the file. It only scans for *.msg files. The *.tmp files are still being written too, and you'd have a race condition trying to send a msg file that was still being written. That's what the rename from .tmp is for. I'd also suggest scanning by creation date so early messages go first.
Your server receives the message, and here it can to any necessary dupe checking. Push this burden on the server to centralize it. You could simply record every uuid for every message to do duplication elimination. If that list gets too long (I don't know your traffic volume), perhaps you can cull it of items greater than 30 days (I also don't know how long your clients can remain off line).
This system is simple, but pretty robust. If the file sending thread gets an error, it will simply try to send the file next time. The only time you should be getting a duplicate message is in the window between when the client gets the 200 ack from the server and when it deletes the file. If the client shuts down or crashes at that point, you will have a file that has been sent but not removed from the queue.
If your clients are stable, this is a pretty low risk. With the dupe checking based on the message ID, you can mitigate that at the cost of some bookkeeping, but maintaining a list of uuids isn't spectacularly daunting, but again it does depend on your message volume and other performance requirements.
The fact that you are allowed to work "offline" suggests you have some "slack" in your absolute messaging performance.
To be honest, the requirements listed don't make a lot of sense and show you have a long way to go in your MQ learning. Given that, if you don't want to use MSMQ (probably the easiest overall on Windows -- but with [IMO severe] limitations), then you should look into:
qpid - Decent use of AMQP standard
zeromq - (the best, IMO, technically but also requires the most familiarity with MQ technologies)
I'd recommend rabbitmq too, but that's an Erlang server and last I looked it didn't have usuable C or C++ libraries. Still, if you are shopping MQ, take a look at it...
[EDIT]
I've gone back and reread your reqs as well as some of your comments and think, for you, that perhaps client MQ -> server is not your best option. I would maybe consider letting your client -> server operations be HTTP POST or SOAP and allow the HTTP endpoint in turn queue messages on your MQ backend. IOW, abstract away the MQ client into an architecture you have more control over. Then your C++ client would simply be HTTP (easy), and your HTTP service (likely C# / .Net from reading your comments) can interact with any MQ backend of your choice. If all your HTTP endpoint does is spawn MQ messages, it'll be pretty darned lightweight and can scale through all the traditional load balancing techniques.
Last time I wanted to do any messaging I used C# and MSMQ. There are MSMQ libraries available that make using MSMQ very easy. It's free to install on both your servers and never lost a message to this day. It handles reboots etc all by itself. It's a thing of beauty and 100,000's of message are processed daily.
I'm not sure why you ruled out MSMQ and I didn't get point 2.
Quite often for queues we just dump record data into a database table and another process lifts rows out of the table periodically.
How about using Asynchronous Agents library from .NET Framework 4.0. It is still beta though.
http://msdn.microsoft.com/en-us/library/dd492627(VS.100).aspx