The problem: You have a big dictionary on the server and you are distributing it to lots of clients.
The dictionary is updated only on the server side but you want to allow the clients to update the dictionary by minimizing the data being transfered.
Also you can assume that you have a huge number of clients requesting updates, probably daily or so.
If a key is removed from the server you expect it to be removed from the client on sync.
How would you solve this problem?
Additional request: the solution should be easy to implement on different platforms including desktop (Windows,Linux,OS X) and mobile ones (iOS, Android,...). If this request the usage of third-party library their license has to be very liberal, like BSD.
If this is at a file level, you use rsync (or the awesome bsdiff or xdelta or such).
If this is at an application level, then one approach is to write journal updates to the dictionary (key-value-store) in the server - you write a log of all updates, adds and removes in the order they occur. Your clients then periodically hit the server and say the position in the log they last received, and the server sends them all log items newer than that. The server may also skip journal items that are superseded (e.g. an add that was later removed). If the server keeps track of the clients, it can keep track of the minimum client journal position, and so get rid of journal items it doesn't need any more.
If the dictionary is large and yet requests are low, the clients can just hit the server for each lookup and always get the newest key. This often scales better than you imagine.
Ideally you can find a solution that supports your requirements rather than build your own.
I suggest that you take a look at CouchDB. It has the following features that make it relevant for your problem imo:
It's a key-value store = dictionary, so should easily fit your data model.
Supports replication from machine to machine (or multiple machines) in an occasionally connected environment. That should fit your use case of clients connecting to a server once in a while to pull all updates.
Works well in a distributed environment, so you should be able to handle the huge number of clients, e.g. by maintaining several servers.
Good scaling - works on servers and any kind of client (including mobile). Also, runs on multiple OSs.
It has a rather efficient data protocol for the replication process.
It's free.
Related
I'm new to AWS and back-end architecture in general. My current configuration is an EC2 instance (south-east region Singapore) running a Twisted real-time server for a real-time chat app.
Currently, in my implementation, whenever a sender sends a message to the server, it is stored in a python dictionary on the server if the receiver is not online. So basically it is storing this message in the instance's RAM. Now, I want to make the app available worldwide, so I'll be running it on instances of different regions. So my question is, how am I supposed to duplicate/replicate this dictionary stored in RAM of one instance to all the other instance, so it is readily available in all regions? (The reason of storing the messages in RAM and not in a database is the nature of the app. The app involves a large volume of messages sent in bursts, which requires it to be considerably faster than speeds offered by a persistent DB store's I/O read-writes.) My aim is to make the app available globally, and having real-time performance.
(Kindly don't flag this question as an "opinion-based" question and close it. I'm new to server side architecture and I really need someone to at least just point me in the right direction. And I don't think I'll be able to find help on this anywhere other than StackOverflow.)
Here's a few things I would think of if I had to build it myself (I've implemented most of these pointers in our own project and it took me quite a while).
If you really really need all servers to be in sync you'll need a consensus protocol. If you do. Don't built this yourself. It's going to take a lot of time and errors.
If you can, partition your chat data into chatrooms and have only a few servers handle one chatroom.
I've used msgpack to encode my data. It's faster and smaller than json.
You'll benefit a lot of compressing your data before you send it over the wire. Have a look at something like zlib or lz4
Even though the size of compressed msgpack is almost the same of that compressed json. I'd choose msgpack because it's faster. It's easier to parse because it's length prefixed encoded.
I would try to send messages together. Batch up all messages every x ms. In my project I chose 100ms batching up messages will save you a lot of bandwidth since your compression algorithm can remove more duplication.
You'll have to handle connection timeouts. Only regard a message as sent and done when you get a reply back (you'll have to design/choose your protocol to handle that)
Think of what is acceptable, how much data you're willing to loose when something crashes or otherwise fails. If you're not willing to loose data you'll have to implement something that stores data to disk.
I've had the problem that writes to database we use (Google Cloud Datastore) take a long time as well. Like somewhere between 100ms and 900ms depending on how much I store. What I did was only store this data every x seconds and set flags on objects that need to be saved next run. Of course you can only do this if you're willing to loose some data when your program crashes.
You'll need something to keep track of what servers are running and which server is responsible for which piece of data
Set up something that checks whether your connection is alive. For example send echoRequests and echos every x time. The sooner you detect a faillure the better. Note however if your reactor is blocked by some cpu intensive task it will not send your echo in time.
If you're not in control of how much data comes in you'll have to slow down or penalize connections that would otherwise take up all of your server time.
EDIT: I only now see that you're looking into redis. As far as I know it's a good queueing system. Use that if you can. Implementing the stuff above would take a lot of time to get it right.
I'm trying to modify a game engine so it records events (like key presses), and store these in a MySQL database on a remote server. The game engine is written in C++, and I currently have the following straightforward architecture, using mysql++ to directly INSERTrecords into appropriate databases:
Unfortunately, there's a very large overhead when connecting to the MySQL server, and the game stops for a significant amount of time. Pushing a batch of Xs worth of events to the server causes a significant delay in gameplay (60s worth of events can take 12s to synchronise). There are also apparently security concerns with leaving the MySQL port accessible publicly.
I was considering an alternative option, instead sending commands to the server, which can interact with the database in its own time:
Here the game would only send the necessary information (e.g. the table to update and the data to insert). I'm not sure whether the speed increase would be sufficient, or what system would be appropriate for managing the commands sent from the game.
Someone else suggested Log4j, but obviously I need a C++ solution. Is there an appropriate existing framework for accomplishing what I want?
Most applications gathering user-interface interaction data (in your case keystrokes) put it into a local file of some sort.
Then at an appropriate time (for example at the end of the game, or the beginning of another game), they POST that file, often in compressed form, to a publicly accessible web server. The software on the web server decompresses the data and loads it into the analytics system (the MySQL server in your case) for processing.
So, I suggest the following.
stop making your MySQL server's port available to people you don't know and trust.
get your game to gather keystrokes locally somehow.
get it to upload that data in big bunches when your game is not in realtime mode.
write a web service to receive and interpret these files.
That way you'll build a more secure analytics system and a more responsive game.
The question is a little general, so to help narrow the focus, I'll share my current setup that is motivating this question. I have a LAMP web service running a RESTful API. We have two client implementations: one browser-based javascript client (local storage store) and one iOS-based client (core data store). Obviously these two clients store data very differently, but the data itself needs to be kept in two-way sync with the remote server as often as possible.
Currently, our "sync" process is a little dumb (as in, non-smart). Conceptually, it looks like:
Client periodically asks the server for ALL of the most-recent data.
Server sends down the remote data, which overwrites the current set of local data in the client's store.
Any local creates/updates/deletes after this point are treated as gold, and immediately sent to the server.
The data itself is stored relationally, and updated occasionally by client users. The clients in my specific case don't care too much about the relationships themselves (which is why we can get away with local storage in the browser client for now).
Obviously this isn't true synchronization. I want to move to a system where, conceptually, a "diff" of the most recent changes are sent to the server periodically, and the server sends back a "diff" of the most recent changes it knows about. It seems very difficult to get to this point, but maybe I just don't understand the problem very well.
REST feels like a good start, but REST only talks about the way two data stores talk to each other, not how the data itself is synchronized between them. (This sync process is left up to the implementer of each store.) What is the best way to implement this process? Is there a modern set of programming design patterns that apply to inform a specific solution to this problem? I'm mostly interested in a general (technology agnostic) approach if possible... but specific frameworks would be useful to look at too, if they exist.
Multi-master replication is always (and will always be) difficult and bespoke, because how conflicts are handled will be specific to your application.
IMO A more robust approach is to use Master-slave replication, with your web service as the master and the clients as slaves. To keep the clients in sync, use an archived atom feed of the changes (see event sourcing) as per RFC5005. This is the closest you'll get to a modern standard for this type of replication and it's RESTful.
When the clients are online, they do not update their replica directly, instead they send commands to the server and have their replica updated via the atom feed.
When the clients are offline things get difficult. Your clients will need to have a model of how your web service behaves. It will need to have an offline copy of your replica, which should be copied on write from the online replica (the online replica is the one that is updated by the atom feed). When the client executes commands that modify the data, it should store the command (for later replay against the web service), the expected result (for verification during replay) and update the offline replica.
When the client goes back online, it should replay the commands, compare the result with the expected result and notify the client of any variances. How these variances are handled will vary based on your application. The offline replica can then be discarded.
CouchDB replication works over HTTP and does what you are looking to do. Once databases are synced on either end it will send diffs for adds/updates/deletes.
Couch can do this with other Couch machines or with a mobile framework like TouchDB.
https://github.com/couchbaselabs/TouchDB-iOS
I've done a fair amount of it, but you can always set up CouchDB on one machine, set up TouchDB on a mobile device and then watch the HTTP traffic go back and forth to get an idea of how they do it.
Or read this: http://guide.couchdb.org/draft/replication.html
Maybe something from the link above will help you get an idea of how to do your own diffs for your REST service. (Since they are both over HTTP thought it could be useful.)
You may want to look into the Dropbox Datastore API:
https://www.dropbox.com/developers/datastore
It sounds like it might be a very good fit for your purposes. They have iOS and javascript clients.
Lately, I've been interested in Meteor.
The platform sets up Mongo on the server and minimongo in the browser. The client subscribes to some data and when that data changes, the platform automatically sends down the new data to the client.
It's a clever solution to the syncing problem, and it solves several other problems as well. It will be interesting to see if more platforms do this in the future.
I am working on a project where a website needs to exchange complex and confidential (and thus encrypted) data with other systems. The data includes personal information, technical drawings, public documents etc.
We would prefer to avoid the Request-Reply pattern to the dependent systems (and there are a LOT of them), as that would create an awful lot of empty traffic.
On the other hand, I am not sure that a pure Publisher/Subscriber pattern would be apropriate -- mainly because of the complex and bulky nature of the data to be exchanged.
For that reason we have discussed the possibility of a "publish/subscribe/request" solution. The Publish/Subscribe part would be to publish a message to the dependent systems, that something is ready for pickup. The actual content is then picked up by old-school Request-Reply action.
How does this sound to you??
Regards,
Morten
If the systems are always online, it sounds good.
You might want to look at PubSubHubbub because:
1. Don't solve a problem that has already been solved 2. It is scalable and represents a good separation of concern.
It involves 3 parties:
Publishers (who publish stuff)
Subscribers (who are interested in certain publications)
Hubs (who mediate and get rid of 'polling')
It works in the following way:
A subscriber, registers their interest in a URL with a Hub and provides a callback URL.
A publisher, notifies the hub when publishing content.
A hub fetches the 'delta' and pushes it to interested subscribers.
The protocol itself is an extension to Atom, but it seems to fit your requirement, e.g. the new Atom 'content' could be an item containing URLs to newly published documents (which can then be downloaded separately).
New/modified documents => new/modified items in feed containing URLs to fetch them => Hub => Subscribers => Pull documents from Publisher
I don't have a great experience about this, but a messaging queue should help you accomplish what you need. I am using such a system while managing publishing data to multiple front end clients from a backend.
If the client is off, the data is not consumed and the server receives no acknowledgement of data being reveived. Once the client comes back online he consumes the data and remains listening for more messages onve the queue is clear. And ofc the publisher receives a ack for data being consumed. In this way we can identify and notify people who have problems at the receiving end as a bonus. Could this do it in your case?
This approach works if the dependent systems are always online - you can't send messages to PCs that are turned off for the night/weekend.
So if the clients are servers that run 24/7, this works. Otherwise, try this approach:
Let clients register themselves
When new documents come in, add an entry "client X needs to see this" in your database
When clients connect, send them all the entries.
When clients successfully downloaded a document, delete the "client X needs to see this" entry. That keeps the work table small.
This has several advantages:
Clients don't need to run 24/7
The flag is only removed after the client has seen the document (so no updates can be lost).
You have one place where you can see which client never pulls it's documents. A simple select client, count(*) group by client having count(*) > 10 tells you about problems.
Most clients will fetch their data timely, so the work table will stay small. That means there is little overhead when you have to collect the "what's now" data.
EDIT The problem with off-line subscribers is that they don't know what they're missing. So the sending side needs to keep track of the failed push/pull requests. Which means you must implement my suggested pseudo-code to make sure broken connections can be resumed.
(Edited to try to explain better)
We have an agent, written in C++ for Win32. It needs to periodically post information to a server. It must support disconnected operation. That is: the client doesn't always have a connection to the server.
Note: This is for communication between an agent running on desktop PCs, to communicate with a server running somewhere in the enterprise.
This means that the messages to be sent to the server must be queued (so that they can be sent once the connection is available).
We currently use an in-house system that queues messages as individual files on disk, and uses HTTP POST to send them to the server when it's available.
It's starting to show its age, and I'd like to investigate alternatives before I consider updating it.
It must be available by default on Windows XP SP2, Windows Vista and Windows 7, or must be simple to include in our installer.
This product will be installed (by administrators) on a couple of hundred thousand PCs. They'll probably use something like Microsoft SMS or ConfigMgr. In this scenario, "frivolous" prerequisites are frowned upon. This means that, unless the client-side code (or a redistributable) can be included in our installer, the administrator won't be happy. This makes MSMQ a particularly hard sell, because it's not installed by default with XP.
It must be relatively simple to use from C++ on Win32.
Our client is an unmanaged C++ Win32 application. No .NET or Java on the client.
The transport should be HTTP or HTTPS. That is: it must go through firewalls easily; no RPC or DCOM.
It should be relatively reliable, with retries, etc. Protection against replays is a must-have.
It must be scalable -- there's a lot of traffic. Per-message impact on the server should be minimal.
The server end is C#, currently using ASP.NET to implement a simple HTTP POST mechanism.
(The slightly odd one). It must support client-side in-memory queues, so that we can avoid spinning up the hard disk. It must allow flushing to disk periodically.
It must be suitable for use in a proprietary product (i.e. no GPL, etc.).
How is your current solution showing its age?
I would push the logic on to the back end, and make the clients extremely simple.
Messages are simply stored in the file system. Have the client write to c:/queue/{uuid}.tmp. When the file is written, rename it to c:/queue/{uuid}.msg. This makes writing messages to the queue on the client "atomic".
A C++ thread wakes up, scans c:\queue for "*.msg" files, and if it finds one it then checks for the server, and HTTP POSTs the message to it. When it receives the 200 status back from the server (i.e. it has got the message), then it can delete the file. It only scans for *.msg files. The *.tmp files are still being written too, and you'd have a race condition trying to send a msg file that was still being written. That's what the rename from .tmp is for. I'd also suggest scanning by creation date so early messages go first.
Your server receives the message, and here it can to any necessary dupe checking. Push this burden on the server to centralize it. You could simply record every uuid for every message to do duplication elimination. If that list gets too long (I don't know your traffic volume), perhaps you can cull it of items greater than 30 days (I also don't know how long your clients can remain off line).
This system is simple, but pretty robust. If the file sending thread gets an error, it will simply try to send the file next time. The only time you should be getting a duplicate message is in the window between when the client gets the 200 ack from the server and when it deletes the file. If the client shuts down or crashes at that point, you will have a file that has been sent but not removed from the queue.
If your clients are stable, this is a pretty low risk. With the dupe checking based on the message ID, you can mitigate that at the cost of some bookkeeping, but maintaining a list of uuids isn't spectacularly daunting, but again it does depend on your message volume and other performance requirements.
The fact that you are allowed to work "offline" suggests you have some "slack" in your absolute messaging performance.
To be honest, the requirements listed don't make a lot of sense and show you have a long way to go in your MQ learning. Given that, if you don't want to use MSMQ (probably the easiest overall on Windows -- but with [IMO severe] limitations), then you should look into:
qpid - Decent use of AMQP standard
zeromq - (the best, IMO, technically but also requires the most familiarity with MQ technologies)
I'd recommend rabbitmq too, but that's an Erlang server and last I looked it didn't have usuable C or C++ libraries. Still, if you are shopping MQ, take a look at it...
[EDIT]
I've gone back and reread your reqs as well as some of your comments and think, for you, that perhaps client MQ -> server is not your best option. I would maybe consider letting your client -> server operations be HTTP POST or SOAP and allow the HTTP endpoint in turn queue messages on your MQ backend. IOW, abstract away the MQ client into an architecture you have more control over. Then your C++ client would simply be HTTP (easy), and your HTTP service (likely C# / .Net from reading your comments) can interact with any MQ backend of your choice. If all your HTTP endpoint does is spawn MQ messages, it'll be pretty darned lightweight and can scale through all the traditional load balancing techniques.
Last time I wanted to do any messaging I used C# and MSMQ. There are MSMQ libraries available that make using MSMQ very easy. It's free to install on both your servers and never lost a message to this day. It handles reboots etc all by itself. It's a thing of beauty and 100,000's of message are processed daily.
I'm not sure why you ruled out MSMQ and I didn't get point 2.
Quite often for queues we just dump record data into a database table and another process lifts rows out of the table periodically.
How about using Asynchronous Agents library from .NET Framework 4.0. It is still beta though.
http://msdn.microsoft.com/en-us/library/dd492627(VS.100).aspx