Real-time duplication of data among EC2 instances located in different regions - amazon-web-services

I'm new to AWS and back-end architecture in general. My current configuration is an EC2 instance (south-east region Singapore) running a Twisted real-time server for a real-time chat app.
Currently, in my implementation, whenever a sender sends a message to the server, it is stored in a python dictionary on the server if the receiver is not online. So basically it is storing this message in the instance's RAM. Now, I want to make the app available worldwide, so I'll be running it on instances of different regions. So my question is, how am I supposed to duplicate/replicate this dictionary stored in RAM of one instance to all the other instance, so it is readily available in all regions? (The reason of storing the messages in RAM and not in a database is the nature of the app. The app involves a large volume of messages sent in bursts, which requires it to be considerably faster than speeds offered by a persistent DB store's I/O read-writes.) My aim is to make the app available globally, and having real-time performance.
(Kindly don't flag this question as an "opinion-based" question and close it. I'm new to server side architecture and I really need someone to at least just point me in the right direction. And I don't think I'll be able to find help on this anywhere other than StackOverflow.)

Here's a few things I would think of if I had to build it myself (I've implemented most of these pointers in our own project and it took me quite a while).
If you really really need all servers to be in sync you'll need a consensus protocol. If you do. Don't built this yourself. It's going to take a lot of time and errors.
If you can, partition your chat data into chatrooms and have only a few servers handle one chatroom.
I've used msgpack to encode my data. It's faster and smaller than json.
You'll benefit a lot of compressing your data before you send it over the wire. Have a look at something like zlib or lz4
Even though the size of compressed msgpack is almost the same of that compressed json. I'd choose msgpack because it's faster. It's easier to parse because it's length prefixed encoded.
I would try to send messages together. Batch up all messages every x ms. In my project I chose 100ms batching up messages will save you a lot of bandwidth since your compression algorithm can remove more duplication.
You'll have to handle connection timeouts. Only regard a message as sent and done when you get a reply back (you'll have to design/choose your protocol to handle that)
Think of what is acceptable, how much data you're willing to loose when something crashes or otherwise fails. If you're not willing to loose data you'll have to implement something that stores data to disk.
I've had the problem that writes to database we use (Google Cloud Datastore) take a long time as well. Like somewhere between 100ms and 900ms depending on how much I store. What I did was only store this data every x seconds and set flags on objects that need to be saved next run. Of course you can only do this if you're willing to loose some data when your program crashes.
You'll need something to keep track of what servers are running and which server is responsible for which piece of data
Set up something that checks whether your connection is alive. For example send echoRequests and echos every x time. The sooner you detect a faillure the better. Note however if your reactor is blocked by some cpu intensive task it will not send your echo in time.
If you're not in control of how much data comes in you'll have to slow down or penalize connections that would otherwise take up all of your server time.
EDIT: I only now see that you're looking into redis. As far as I know it's a good queueing system. Use that if you can. Implementing the stuff above would take a lot of time to get it right.

Related

Debugging network applications and testing for synchronicity?

If I have a server running on my machine, and several clients running on other networks, what are some concepts of testing for synchronicity between them? How would I know when a client goes out-of-sync?
I'm particularly interested in how network programmers in the field of game design do this (or just any continuous network exchange application), where realtime synchronicity would be a commonly vital aspect of success.
I can see how this may be easily achieved on LAN via side-by-side comparisons on separate machines... but once you branch out the scenario to include clients from foreign networks, I'm just not sure how it can be done without clogging up your messaging system with debug information, and therefore effectively changing the way that synchronicity would result without that debug info being passed over the network.
So what are some ways that people get around this issue?
For example, do they simply induce/simulate latency on the local network before launching to foreign networks, and then hope for the best? I'm hoping there are some more concrete solutions, but this is what I'm doing in the meantime...
When you say synchronized, I believe you are talking about network latency. Meaning, that a client on a local network may get its gaming information sooner than a client on the other side of the country. Correct?
If so, then I'm sure you can look for books or papers that cover this kind of topic, but I can give you at least one way to detect this latency and provide a way to manage it.
To detect latency, your server can use a type of trace route program to determine how long it takes for data to reach each client. A common Linux program example can be found here http://linux.about.com/library/cmd/blcmdl8_traceroute.htm. While the server is handling client data, it can also continuously collect the latency statistics and provide the data to the clients. For example, the server can update each client on its own network latency and what the longest latency is for the group of clients that are playing each other in a game.
The clients can then use the latency differences to determine when they should process the data they receive from the server. For example, a client is told by the server that its network latency is 50 milliseconds and the maximum latency for its group it 300 milliseconds. The client then knows to wait 250 milliseconds before processing game data from the server. That way, each client processes game data from the server at approximately the same time.
There are many other (and probably better) ways to handle this situation, but that should get you started in the right direction.

What is the modern programming standard for synchronizing data between a web service and a client?

The question is a little general, so to help narrow the focus, I'll share my current setup that is motivating this question. I have a LAMP web service running a RESTful API. We have two client implementations: one browser-based javascript client (local storage store) and one iOS-based client (core data store). Obviously these two clients store data very differently, but the data itself needs to be kept in two-way sync with the remote server as often as possible.
Currently, our "sync" process is a little dumb (as in, non-smart). Conceptually, it looks like:
Client periodically asks the server for ALL of the most-recent data.
Server sends down the remote data, which overwrites the current set of local data in the client's store.
Any local creates/updates/deletes after this point are treated as gold, and immediately sent to the server.
The data itself is stored relationally, and updated occasionally by client users. The clients in my specific case don't care too much about the relationships themselves (which is why we can get away with local storage in the browser client for now).
Obviously this isn't true synchronization. I want to move to a system where, conceptually, a "diff" of the most recent changes are sent to the server periodically, and the server sends back a "diff" of the most recent changes it knows about. It seems very difficult to get to this point, but maybe I just don't understand the problem very well.
REST feels like a good start, but REST only talks about the way two data stores talk to each other, not how the data itself is synchronized between them. (This sync process is left up to the implementer of each store.) What is the best way to implement this process? Is there a modern set of programming design patterns that apply to inform a specific solution to this problem? I'm mostly interested in a general (technology agnostic) approach if possible... but specific frameworks would be useful to look at too, if they exist.
Multi-master replication is always (and will always be) difficult and bespoke, because how conflicts are handled will be specific to your application.
IMO A more robust approach is to use Master-slave replication, with your web service as the master and the clients as slaves. To keep the clients in sync, use an archived atom feed of the changes (see event sourcing) as per RFC5005. This is the closest you'll get to a modern standard for this type of replication and it's RESTful.
When the clients are online, they do not update their replica directly, instead they send commands to the server and have their replica updated via the atom feed.
When the clients are offline things get difficult. Your clients will need to have a model of how your web service behaves. It will need to have an offline copy of your replica, which should be copied on write from the online replica (the online replica is the one that is updated by the atom feed). When the client executes commands that modify the data, it should store the command (for later replay against the web service), the expected result (for verification during replay) and update the offline replica.
When the client goes back online, it should replay the commands, compare the result with the expected result and notify the client of any variances. How these variances are handled will vary based on your application. The offline replica can then be discarded.
CouchDB replication works over HTTP and does what you are looking to do. Once databases are synced on either end it will send diffs for adds/updates/deletes.
Couch can do this with other Couch machines or with a mobile framework like TouchDB.
https://github.com/couchbaselabs/TouchDB-iOS
I've done a fair amount of it, but you can always set up CouchDB on one machine, set up TouchDB on a mobile device and then watch the HTTP traffic go back and forth to get an idea of how they do it.
Or read this: http://guide.couchdb.org/draft/replication.html
Maybe something from the link above will help you get an idea of how to do your own diffs for your REST service. (Since they are both over HTTP thought it could be useful.)
You may want to look into the Dropbox Datastore API:
https://www.dropbox.com/developers/datastore
It sounds like it might be a very good fit for your purposes. They have iOS and javascript clients.
Lately, I've been interested in Meteor.
The platform sets up Mongo on the server and minimongo in the browser. The client subscribes to some data and when that data changes, the platform automatically sends down the new data to the client.
It's a clever solution to the syncing problem, and it solves several other problems as well. It will be interesting to see if more platforms do this in the future.

Jabber and expensive data (xml its trash)

Im working in a project that has jabber has communication platform.
The thing is that i need clients (a lot of clients) to communicate between each other not only for signalization, but to change data between them.
Imagine that the client A has 3 services available. The client B could request to A to start sending him info from each service (like a stream service) until the client B says to A to stop the services.
These services could only send one character with 100ms interval or 1000characters with 100ms interval or even send some data when its needed.
When the info sended to B, arrives it has to know what service corresponds, what action and the values (example), so im using json over jabber.
My problem is that im wasting a lot of bandwith with jabber xmpp protocol just to send a message with a body like:
{"s":"x", "x":5} //each 100ms (5 represents any number)
I really don't want to have parallel communication (like direct sockets), because jabber has all of that implemented and its easy scalable, firewall problems, sometimes i use http communications (im using BOSH in this case).
I know that there is some compression that i can do, but im wondering if you recommends something else that could not have such ammount of xml behind my message and still, using jabber.
Thanks a lot for your help.
Best Regards,
Eduardo
It sounds like, except for your significant data transfer, XMPP suits your application well.
As you probably know, XMPP was never designed or intended to be used as a big pipe for data transfer. Most applications that involve significant data transfer, such as file transfers and voice/video, use XMPP just for negotiation of a separate "out of band" stream. You say this might cause problems for you because of firewalls and web clients.
If your application is mostly transferring text, then you really should try out compression... it offers significant savings on bandwidth, if that's your most constrained resource. The downside is that it will take more client and server memory (around 300KB by default, but that can be reduced with marginal compression loss).
Alternatively you can look at tunneling your data base64-encoded using In-Band Bytestreams. I don't have your sample data, or know how you are wrapping them for transport, and this could come off worse or better. I would say it would come off better if you stripped out your JSON and made it into a more efficient binary format instead. Base64 data will not compress so well, and is roughly 33% larger than the raw data. The savings would be in being able to strip out JSON and any other extraneous wrappings, while keeping the data within the XMPP stream.
In the end scaling most applications is hard, whichever technologies you use. It requires primarily insight - you shouldn't change anything without testing it first, and you should be testing beforehand to find out what you ought to change. You should be analyzing your system for the primary bottlenecks (is it really the client's bandwidth??). Rarely in my experience has XML itself been the direct bottleneck. However ultimately all these things are unique to your application, it's not easy to give generic advice at scale.
No, Xml is no trash. Its human readable, very extensible and can be compressed extremely well.
XMPP supports stream compression, and this stream compression (mostly zlib) works extremely well according to all my tests. So if its important for you that you optimize the number of bytes you send over the wire or are on low bandwidth then use stream compression when you are on sockets. When you are on Bosh then you have to use either a server which supports HTTP compression or use a proxy in between to enable compression. But keep in mind that BOSH has also lots of overhead with all the HTTP headers.

Finding and optimal way of distributing dictionary changes over the network

The problem: You have a big dictionary on the server and you are distributing it to lots of clients.
The dictionary is updated only on the server side but you want to allow the clients to update the dictionary by minimizing the data being transfered.
Also you can assume that you have a huge number of clients requesting updates, probably daily or so.
If a key is removed from the server you expect it to be removed from the client on sync.
How would you solve this problem?
Additional request: the solution should be easy to implement on different platforms including desktop (Windows,Linux,OS X) and mobile ones (iOS, Android,...). If this request the usage of third-party library their license has to be very liberal, like BSD.
If this is at a file level, you use rsync (or the awesome bsdiff or xdelta or such).
If this is at an application level, then one approach is to write journal updates to the dictionary (key-value-store) in the server - you write a log of all updates, adds and removes in the order they occur. Your clients then periodically hit the server and say the position in the log they last received, and the server sends them all log items newer than that. The server may also skip journal items that are superseded (e.g. an add that was later removed). If the server keeps track of the clients, it can keep track of the minimum client journal position, and so get rid of journal items it doesn't need any more.
If the dictionary is large and yet requests are low, the clients can just hit the server for each lookup and always get the newest key. This often scales better than you imagine.
Ideally you can find a solution that supports your requirements rather than build your own.
I suggest that you take a look at CouchDB. It has the following features that make it relevant for your problem imo:
It's a key-value store = dictionary, so should easily fit your data model.
Supports replication from machine to machine (or multiple machines) in an occasionally connected environment. That should fit your use case of clients connecting to a server once in a while to pull all updates.
Works well in a distributed environment, so you should be able to handle the huge number of clients, e.g. by maintaining several servers.
Good scaling - works on servers and any kind of client (including mobile). Also, runs on multiple OSs.
It has a rather efficient data protocol for the replication process.
It's free.

Increasing SSL handshaking performance

I've got a short-lived client process that talks to a server over SSL. The process is invoked frequently and only runs for a short time (typically for less than 1 second). This process is intended to be used as part of a shell script used to perform larger tasks and may be invoked pretty frequently.
The SSL handshaking it performs each time it starts up is showing up as a significant performance bottleneck in my tests and I'd like to reduce this if possible.
One thing that comes to mind is taking the session id and storing it somewhere (kind of like a cookie), and then re-using this on the next invocation, however this is making me feel uneasy as I think there would be some security concerns around doing this.
So, I've got a couple of questions,
Is this a bad idea?
Is this even possible using OpenSSL?
Are there any better ways to speed up the SSL handshaking process?
After the handshake, you can get the SSL session information from your connection with SSL_get_session(). You can then use i2d_SSL_SESSION() to serialise it into a form that can be written to disk.
When you next want to connect to the same server, you can load the session information from disk, then unserialise it with d2i_SSL_SESSION() and use SSL_set_session() to set it (prior to SSL_connect()).
The on-disk SSL session should be readable only by the user that the tool runs as, and stale sessions should be overwritten and removed frequently.
You should be able to use a session cache securely (which OpenSSL supports), see the documentation on SSL_CTX_set_session_cache_mode, SSL_set_session and SSL_session_reused for more information on how this is achieved.
Could you perhaps use a persistent connection, so the setup is a one-time cost?
You could abstract away the connection logic so your client code still thinks its doing a connect/process/disconnect cycle.
Interestingly enough I encountered an issue with OpenSSL handshakes just today. The implementation of RAND_poll, on Windows, uses the Windows heap APIs as a source of random entropy.
Unfortunately, due to a "bug fix" in Windows 7 (and Server 2008) the heap enumeration APIs (which are debugging APIs afterall) now can take over a second per call once the heap is full of allocations. Which means that both SSL connects and accepts can take anywhere from 1 seconds to more than a few minutes.
The Ticket contains some good suggestions on how to patch openssl to achieve far FAR faster handshakes.