Optimizing Jetty for heartbeat detection of thousands of machines? - jetty

I have a large number of machines (thousands and more) that every X seconds would perform an HTTP request to a Jetty server to notify they are alive. For what value of X should I use persistent HTTP connections (which limits number of monitored machines to number of concurrent connections), and for what value of X the client should re-establish a TCP connection (which in theory would allow to monitor more machines with the same Jetty server).
How would the answer change for HTTPS connections? (Assuming CPU is not a constraint)
This question ignores scaling-out with multiple Jetty web servers on purpose.
Update: Basically the question can be reduced to the smallest recommended value of lowResourcesMaxIdleTime.

I would say that this is less of a jetty scaling issue and more of a network scaling issue, in which case 'it depends' on your network infrastructure. Only you really know how your network is laid out and what sort of latencies are involved in order to come up with a value of X.
From an overhead perspective the persistent HTTP connections will of course have some minor effect (well I say minor but depends on your network) and the HTTPS will again have a larger impact....but only from a volume of traffic perspective since you are assuming CPU is not a constraint.
So from a jetty perspective, it really doesn't need to be involved in the question, you seem to ultimately be asking for help optimizing bytes of traffic on the wire so really you are looking for the best protocol at this point. Since with HTTP you are having to mess with headers for each request you may be well served looking at something like spdy or websocket which will give you persistent connections but are optimized for low round trip network overhead. But...they seem sort of overkill for a heartbeat. :)

How about just make them request at different time? Assume first machine request, then you pick a time to response to that machine as the next time to heart beat of that machine (also keep the id/time at jetty server), the second machine request, you can pick another time to response to second machine.
In this way, you can make each machine perform heart beat request at different time so no concurrent issue.
You can also use a random time for the first heart beat if all machines might start up at the same time.

Related

What is the cause of websocket latency spikes?

I am running a server (that uses tornado python) on a single AWS instance and I am running into spikes in websocket latency.
Profiling the round trip time from when a websocket message is sent to the client, which then immediately sends an ack message back to the server, to when the server receives the ack message yields an average of <.1 second, however I note sometimes it goes up to 3 seconds. Note: there are no spikes when running the server locally.
What could be the cause or fix for this? I looked at the CPU usage and it only goes up to 40% max. The spikes are not correlated with heavy traffic (2 or 3 clients usually) and the client's internet seems fine. I find it hard to believe the instance is going beyond capacity with such low usage.
The fact that the spike is 3 seconds is actually telling you a lot more than you may suspect, about the nature of the problem.
It's packet loss.
TCP, as you likely know, is said to provide "reliable" transport, guaranteeing that payload sent is received by the far end in the order in which it was sent, because TCP reassembles things in the correct order before delivering the payload. One significant way in which this is accomplished is by the automatic retransmission of packets that are considered to have been lost.
You'll never guess the default initial timer value for retransmissions of lost packets. Or, perhaps, now, you will.
It's 3 seconds in many, if not most, implementations, based on standards established several years ago in a time when the bandwidth and latency of today's transmission links were unheard of, perhaps unimagined.
You won't see evidence of the retransmission at at the websocket server or the client software, because TCP shields the higher layers from knowing that it occurs... but 3 seconds is a dead giveaway that this is exactly the problem.
You'll see the retransmissions of the traffic occurring if you observe the network traffic with a packet sniffer, though that will only serve to confirm that this is the issue.
It could be loss from server to client, or loss from client to server. The latter is generally more likely, since clients often have a lower amount of available upstream bandwidth... but the directionality of the packet loss doesn't clearly indicate the physical location where it is occurring. Unless your client keeps track of local time, so that request and response initiation times can be correlated, you don't know whether the delay is in the message, or in the acknowledgement.
Under relatively light load, it seems unlikely that the problem is on your instance or in the AWS network on your side, and you obviously can't connect a sniffer to arbitrary points on the Internet to pinpoint the problem.
Given a case like this, it may be easier -- and surprisingly feasible -- to prove where the problem isn't, rather than where it is.
One technique for this would be to create a deliberate detour for the traffic through different equipment located elsewhere -- such as a different AWS region or another cloud provider.
First, of course, you'll want to learn to spot these retransmissions using wireshark.
Then, configure a proxy server at a different location, using a simple TCP connection proxy -- such as HAProxy, or even a simple tool like redir or socat.
Such a configuration will listen for connections from clients, and when one is established, will create a new TCP connection to the destination (your websocket server) but -- importantly -- they only tie the two connections together at the payload level -- not the TCP level, and of course nothing lower -- so retransmissions will only be seen on the wire at all between this intermediate server and the end of the connection with the packet loss problem. The other end will show no evidence of the retransmissions -- just data arriving later than expected.
For this test to be meaningful, the proxy needs to be located away from the server and the client, and with no meaningful common infrastructure -- hence the suggestion of placing it in a different AWS region. A different availability zone in the same region may share common Internet infrastructure at some level, so that's not far enough away for this purpose.
If client <--> proxy <--> server shows TCP retransmissions on the path between proxy and server, and not between client and proxy, the problem really is likely to be in your server, its hardware, network, or Internet connection, and you'll have to proceed accordingly.
Conversely (and, I would suggest, more likely) if the path between proxy and server is free of retransmissions but the path between client and proxy is still dirty, you have eliminated the server and its infrastructure as the source of the problem. How to proceed is up to you, but at this point you do know what the problem... isn't.
Two other possibilities:
Both sides remain dirty, which is the least likely scenario. Rule 1 of troubleshooting is to assume initially that you only have one problem, not two.
Or, both sides are suddenly and unexectedly clean when traffic uses this setup, which suggests thay your test setup has routed around a broken piece of the Internet. You've "solved" it but have no idea how. We'll also hope this isn't the outcome, but given the vagaries of the global Internet, it's not unthinkable that your stack may include components like this, with geolocation-DNS-based selection of an intermediate endpoint. This seems like a convolution but does have its place.
Such a tactic is actually part of the logic behind the S3 transfer acceleration feature. The content is not any closer to the end user, but the TCP connection from the browser is being terminated on equipment in the AWS edge network, at a location that is often nearer to the browser, and a second TCP connection back to the bucket is established, with the payload connected together... and, yes, it's faster and more stable, with the significance of the change becoming more notable as distance and connection quality vary.

Debugging network applications and testing for synchronicity?

If I have a server running on my machine, and several clients running on other networks, what are some concepts of testing for synchronicity between them? How would I know when a client goes out-of-sync?
I'm particularly interested in how network programmers in the field of game design do this (or just any continuous network exchange application), where realtime synchronicity would be a commonly vital aspect of success.
I can see how this may be easily achieved on LAN via side-by-side comparisons on separate machines... but once you branch out the scenario to include clients from foreign networks, I'm just not sure how it can be done without clogging up your messaging system with debug information, and therefore effectively changing the way that synchronicity would result without that debug info being passed over the network.
So what are some ways that people get around this issue?
For example, do they simply induce/simulate latency on the local network before launching to foreign networks, and then hope for the best? I'm hoping there are some more concrete solutions, but this is what I'm doing in the meantime...
When you say synchronized, I believe you are talking about network latency. Meaning, that a client on a local network may get its gaming information sooner than a client on the other side of the country. Correct?
If so, then I'm sure you can look for books or papers that cover this kind of topic, but I can give you at least one way to detect this latency and provide a way to manage it.
To detect latency, your server can use a type of trace route program to determine how long it takes for data to reach each client. A common Linux program example can be found here http://linux.about.com/library/cmd/blcmdl8_traceroute.htm. While the server is handling client data, it can also continuously collect the latency statistics and provide the data to the clients. For example, the server can update each client on its own network latency and what the longest latency is for the group of clients that are playing each other in a game.
The clients can then use the latency differences to determine when they should process the data they receive from the server. For example, a client is told by the server that its network latency is 50 milliseconds and the maximum latency for its group it 300 milliseconds. The client then knows to wait 250 milliseconds before processing game data from the server. That way, each client processes game data from the server at approximately the same time.
There are many other (and probably better) ways to handle this situation, but that should get you started in the right direction.

How heavy for the server to transmit data over HTTPS?

I am trying to implement web service and web client applications using Ruby on Rails 3. For that I am considering to use a SSL but I would like to know: how "heavy" is it for servers to handle a lot of HTTPS connection instead of HTTP? what is the difference of response time and the performance at all?
The cost of SSL/TLS handshake (which takes most of the overall "slowdown" SSL/TLS adds) nowadays is much less than the cost of TCP connection establishment and other actions associated with session establishment (logging, user lookup etc). And if you worry about speed and want to save any ns of time, there exist hardware SSL accelerators that you can install to your server.
It is several times slower to go with HTTPS, however, most of the time that's not what is actually going to slow your app down. Especially if you're running on Rails, your performance scaling is going to be bottlenecked elsewhere in the system. If you are doing anything that requires the passing of secrets of any kind over the wire (including a shared session cookie), SSL is the only way to go and you probably won't notice the cost. If you happen to scale up to the point where you do start to see a performance hit from encryption, there are hardware acceleration appliances out there that help tremendously. However, rails is likely to fall over long before that point.

How good is NTP for distributed time synchronization?

How accurate is NTP for keeping a set of servers time synchronized?
I'm writing a service which requires a set of servers (some acting as clients, some as servers) synchronized to second level granularity. I'm wondering if NTP is the best thing to use, or if there's something better?
Should I run a ntp server on one of them, and have the others use that as their source? Any other recommendations/horror stories with NTP?
All the servers are linux.
Update: Service levels:
I'd like the one server to be accurate UTC(second level, not microsecond or such), and I'd like all the other servers to be the same ts as that one server, regardless of whether its accurate UTC or not (events are received by this one server from multiple locations at various intervals, I require all those events to be at the same "relative" ts. No, I can't have the main server TS the events as they come in, because that'll require storing an offset (when the event actually happened and when it was logged, which requires a whole lot of extra work), and that complicates matters needlessly.
I've currently set up one server as stratum 2 timeserver, using some startum 1 GPS sources as servers in ntp.conf, on the other servers, I've set this server to be the sole server in ntp.conf.
I hope this will be enough.
Thank you!
NTP will keep you within a second well enough for most applications.
If you need higher precision, and all the servers are running *nix I would investigate implementing Precision Time Protocol. It involves multiple parent clocks and negotiation to find a reliable source in the network. This is the time protocol recommended for timestamping events in the power industry (e.g. accurate timestamping in the log files for relay actions and metering alarms aided in the investigation of the Northeast Blackout of 2003).
First off, you might have a look at the Wikipedia NTP page.
Basically, to start with (I preach this regularly) state what the service levels you want might be. Do you need accurate UTC? To what tolerance? That is, do you really need to know what time it is?
Or do you simply want precise synchronization among the systems?
How many machines are we talking about, and are they geographically distributed?
Some options:
accurate time: Set up at least one server as stratum 2, and have it reference at least 3 stratum 1 servers. If you have lots of servers, make that more than one; obviously you get more reliability by having no single point of failure.
precise synchronization: set up NTP peers.
accurate time and geographical distribution: more than one stratum 2 server, as above, with one "near" each cluster; they can peer at stratum 2 to improve the voting.
I don't think there's anything well known better than NTP that's available.
Update Another question mentions the PTP precision time protocol (IEEE 1588) This is excellent for precise synchronization, but depends on multicast.
Also, it's worth considering getting a GPS time source.
Yes, set up one of your servers as your in-house NTP server, and sync the others to that. It gives you accuracy typically within milliseconds, as I remember.
If any of your servers are way off -- and I can't remember what constitutes 'way off' -- NTP won't fix it. There is a way to automatically fix that but I can't remember at the moment.

World Clock Webservice?

What is the most reliable World Clock Webservice that you use?
Unfortunately, you'll probably never get a really accurate atomic clock webservice due to latency issues with the transport of the messages/packets back and forth from your machine to the server.
Most atomic clocks that are accessible over the internet use a specific protocol called the Network Time Protocol that includes a jitter buffer which specifically accounts for and adjusts based upon the latency of the transport. This provides a more accurate representation of the atomic clock's time than using a web-service over HTTP.
I think if you must use a webservice, the most accurate one will be the one hosted on a server that is physically and geographically closest to you and also has the least number of network hops to get from your own machine to the server, since this will reduce the latency of the packets.
Understood about latency. With that in mind, I go to NIST's site for US times and World Time Server for the rest. Don't know if either is the "best".
I think due to latency, there is no such thing as a reliable atomic clock webservice.
Here's a blog post which comes to the same conclusion.
Purists are quick to point to the accuracy problem. But I bet you could not even get perfectly accurate time even if your application was sitting on the same server as the atomic clock software itself.
I think there is a need for a clock Web Service. I can think of a few scenarios where it doesn't matter being off a few seconds.
Aside accuracy, another challenging area of serving up date and time is taking into account the daylight saving details of most country. That is something even the latest OSes struggle to get right. But that is definitely something that would make a clock Web Service valuable.
Since there are so few web services out there delivering time, http://www.timeapi.org/utc/now is only reliable web service that I know of (besides http://www.earthtools.org/timezone/0/0, which does not appear to be reliable). Therefore it's the most accurate one I can recommend, especially if you are just using it for determining the difference between local time and UTC time, which can be rounded to the nearest 15 minutes. And if you want the time in a specific time zone, replace utc with the three-letter abbreviation for the time zone -- i.e., http://www.timeapi.org/est/now for the Eastern Standard Time.
A NTP webservice would be fine as long as the latency is predictable. NTP is a wire protocol and very lightweight to remove any moving pieces that may cause additional variation in latency (aka jitter). A SOAP stack would introduce more variability.