I've a client who I developed a Rails app for. The app relies on his customers uploading varies of images, files, and pdf size ranges from 1mb to 100mb.
His been telling me that many of his customer are complaining the slowness and unstable upload speed.
I use direct connect to Amazon S3 to handle the upload. I explain to him that it could there are factors that is out of my control in terms of upload speed.
But he insist that there is something we can do to improve upload speed.
I'm running out of ideas and expertise here. Does anyone have a solution?
On the surface, there are two answers -- no, of course there's nothing you can do, the Internet is a best-effort transport, etc., etc.,... and no, there really shouldn't be a problem, because S3 uploads perform quite well.
There is an option worth considering, though.
You can deploy a global network of proxy servers in front of S3 and use geographic DNS to route those customers to their nearest proxy. Then install high-speed, low latency optical circuits from the proxies back to S3, reducing the amount of "unknown" in the path, as well as reducing the round-trip time and packet loss potential between the browser and the chosen proxy node at the edge of your network, improving throughput.
I hope the previous paragraph is amusing on first reading, since it sounds like a preposterously grandiose plan for improving uploads to S3... but of course, I'm referring to CloudFront.
You don't actually have to use it for downloads; you can, if you want, just use it for uploads.
your users can now benefit from accelerated content uploads. After you enable the additional HTTP methods for your application’s distribution, PUT and POST operations will be sent to the origin (e.g. Amazon S3) via the CloudFront edge location, improving efficiency, reducing latency, and allowing the application to benefit from the monitored, persistent connections that CloudFront maintains from the edge locations to the origin servers.
https://aws.amazon.com/blogs/aws/amazon-cloudfront-content-uploads-post-put-other-methods/
To illustrate that the benefit here does have a solid theoretical basis...
Back in the day when we still used telnet, when T1s were fast Internet connections and 33.6kbps was a good modem, I discovered that I had far better responsiveness from home, making a telnet connection to a distant system, if I first made a telnet connection to a server immediately on the other side of the modem link, then make a telnet connection to the distant node from within the server.
A direct telnet connection to the distant system followed exactly the same path, through all the same routers and circuits, and yet, it was so sluggish as to be unusable. Why the stark difference, and what caused the substantial improvement?
The explanation was that making the intermediate connection to the server meant there were two independent TCP connections, with only their payload tied together: me to the server... and the server to the distant system. Both connections were bad in their own way -- high latency on my modem link, and congestion/packet loss on the distant link (which had much lower round-trip times, but was overloaded with traffic). The direct connection meant I had a TCP connection that had to recover from packet loss while dealing with excessive latency. Making the intermediate connection meant that the recovery from the packet loss was not further impaired by the additional latency added by my modem connection, because the packet loss was handled only on the 2nd leg of the connection.
Using CloudFront in front of S3 promises to solve the same sort of problem in reverse -- improving the responsiveness, and therefore the throughput, of a connection of unknown quality by splitting the TCP connection into two independent connections, at the user's nearest CloudFront edge.
Related
I am running a server (that uses tornado python) on a single AWS instance and I am running into spikes in websocket latency.
Profiling the round trip time from when a websocket message is sent to the client, which then immediately sends an ack message back to the server, to when the server receives the ack message yields an average of <.1 second, however I note sometimes it goes up to 3 seconds. Note: there are no spikes when running the server locally.
What could be the cause or fix for this? I looked at the CPU usage and it only goes up to 40% max. The spikes are not correlated with heavy traffic (2 or 3 clients usually) and the client's internet seems fine. I find it hard to believe the instance is going beyond capacity with such low usage.
The fact that the spike is 3 seconds is actually telling you a lot more than you may suspect, about the nature of the problem.
It's packet loss.
TCP, as you likely know, is said to provide "reliable" transport, guaranteeing that payload sent is received by the far end in the order in which it was sent, because TCP reassembles things in the correct order before delivering the payload. One significant way in which this is accomplished is by the automatic retransmission of packets that are considered to have been lost.
You'll never guess the default initial timer value for retransmissions of lost packets. Or, perhaps, now, you will.
It's 3 seconds in many, if not most, implementations, based on standards established several years ago in a time when the bandwidth and latency of today's transmission links were unheard of, perhaps unimagined.
You won't see evidence of the retransmission at at the websocket server or the client software, because TCP shields the higher layers from knowing that it occurs... but 3 seconds is a dead giveaway that this is exactly the problem.
You'll see the retransmissions of the traffic occurring if you observe the network traffic with a packet sniffer, though that will only serve to confirm that this is the issue.
It could be loss from server to client, or loss from client to server. The latter is generally more likely, since clients often have a lower amount of available upstream bandwidth... but the directionality of the packet loss doesn't clearly indicate the physical location where it is occurring. Unless your client keeps track of local time, so that request and response initiation times can be correlated, you don't know whether the delay is in the message, or in the acknowledgement.
Under relatively light load, it seems unlikely that the problem is on your instance or in the AWS network on your side, and you obviously can't connect a sniffer to arbitrary points on the Internet to pinpoint the problem.
Given a case like this, it may be easier -- and surprisingly feasible -- to prove where the problem isn't, rather than where it is.
One technique for this would be to create a deliberate detour for the traffic through different equipment located elsewhere -- such as a different AWS region or another cloud provider.
First, of course, you'll want to learn to spot these retransmissions using wireshark.
Then, configure a proxy server at a different location, using a simple TCP connection proxy -- such as HAProxy, or even a simple tool like redir or socat.
Such a configuration will listen for connections from clients, and when one is established, will create a new TCP connection to the destination (your websocket server) but -- importantly -- they only tie the two connections together at the payload level -- not the TCP level, and of course nothing lower -- so retransmissions will only be seen on the wire at all between this intermediate server and the end of the connection with the packet loss problem. The other end will show no evidence of the retransmissions -- just data arriving later than expected.
For this test to be meaningful, the proxy needs to be located away from the server and the client, and with no meaningful common infrastructure -- hence the suggestion of placing it in a different AWS region. A different availability zone in the same region may share common Internet infrastructure at some level, so that's not far enough away for this purpose.
If client <--> proxy <--> server shows TCP retransmissions on the path between proxy and server, and not between client and proxy, the problem really is likely to be in your server, its hardware, network, or Internet connection, and you'll have to proceed accordingly.
Conversely (and, I would suggest, more likely) if the path between proxy and server is free of retransmissions but the path between client and proxy is still dirty, you have eliminated the server and its infrastructure as the source of the problem. How to proceed is up to you, but at this point you do know what the problem... isn't.
Two other possibilities:
Both sides remain dirty, which is the least likely scenario. Rule 1 of troubleshooting is to assume initially that you only have one problem, not two.
Or, both sides are suddenly and unexectedly clean when traffic uses this setup, which suggests thay your test setup has routed around a broken piece of the Internet. You've "solved" it but have no idea how. We'll also hope this isn't the outcome, but given the vagaries of the global Internet, it's not unthinkable that your stack may include components like this, with geolocation-DNS-based selection of an intermediate endpoint. This seems like a convolution but does have its place.
Such a tactic is actually part of the logic behind the S3 transfer acceleration feature. The content is not any closer to the end user, but the TCP connection from the browser is being terminated on equipment in the AWS edge network, at a location that is often nearer to the browser, and a second TCP connection back to the bucket is established, with the payload connected together... and, yes, it's faster and more stable, with the significance of the change becoming more notable as distance and connection quality vary.
I am running an Openfire server on a AWS EC2 instance and am able to connect to the server from my mobile devices and send messages back and forth. Of course, since XMPP is a client-server based protocol, I incur costs for running this traffic over the AWS server. However, for most use cases, this cost is not very high at all, as normal XMPP stanzas rarely seem to go above ca. 1 KB, so from this end all is ok.
I would now, however, like to include the ability to send images from one client to another. One way would be to use an HTTP server, to which user A uploads the picture and then sends the URL of the image to user B via XMPP, so that the user can now get the image via HTTP. There are also several other methods for sending images via XMPP. However, I am interested in doing this via Jingle.
As far as as I understand, Jingle is an out of band peer-to-peer extension to XMPP. My simple question is, since Jingle communicates peer-to-peer, i.e. without the use of a server, for the multimedia aspect of the session, will I even incur any data cost on AWS for transferring multimedia from one client to another using Jingle? Or put differently, if Jingle is peer-to-peer, does any data go via my AWS server using Jingle (except the session initiate, ack, session terminate stanzas)? If not, what kind of route does this data take, and how can anyone be billed for this traffic cost, if it is peer-to-peer?
Jingle is a negotiation mechanism, and there are a couple of different transports it could negotiate for file transfer. The most common transport is peer to peer bytestreams defined in http://xmpp.org/extensions/xep-0260.html - here the only traffic you'd see via the server would be the jingle negotiation, which is a similar sort of volume to other XMPP traffic). There is also an in-band bytestream transport defined in http://xmpp.org/extensions/xep-0261.html that some clients will use - typically for smaller transfers as it's inefficient, but has the advantage of working in hostile networks with NAT and firewalls. If you control the clients, simply not supporting IBB would be your best bet for ensuring the traffic doesn't travel via the server. If you don't, I'd suggest configuring your server to block IBB traffic.
I note as well that running a server-side proxy will drastically increase the odds of the out-of-band mechanism in 260 working in the face of hostile networks, at the cost of server bandwidth.
There is also the not-widely-deployed http://xmpp.org/extensions/xep-0343.html out of band transport.
If I have a server running on my machine, and several clients running on other networks, what are some concepts of testing for synchronicity between them? How would I know when a client goes out-of-sync?
I'm particularly interested in how network programmers in the field of game design do this (or just any continuous network exchange application), where realtime synchronicity would be a commonly vital aspect of success.
I can see how this may be easily achieved on LAN via side-by-side comparisons on separate machines... but once you branch out the scenario to include clients from foreign networks, I'm just not sure how it can be done without clogging up your messaging system with debug information, and therefore effectively changing the way that synchronicity would result without that debug info being passed over the network.
So what are some ways that people get around this issue?
For example, do they simply induce/simulate latency on the local network before launching to foreign networks, and then hope for the best? I'm hoping there are some more concrete solutions, but this is what I'm doing in the meantime...
When you say synchronized, I believe you are talking about network latency. Meaning, that a client on a local network may get its gaming information sooner than a client on the other side of the country. Correct?
If so, then I'm sure you can look for books or papers that cover this kind of topic, but I can give you at least one way to detect this latency and provide a way to manage it.
To detect latency, your server can use a type of trace route program to determine how long it takes for data to reach each client. A common Linux program example can be found here http://linux.about.com/library/cmd/blcmdl8_traceroute.htm. While the server is handling client data, it can also continuously collect the latency statistics and provide the data to the clients. For example, the server can update each client on its own network latency and what the longest latency is for the group of clients that are playing each other in a game.
The clients can then use the latency differences to determine when they should process the data they receive from the server. For example, a client is told by the server that its network latency is 50 milliseconds and the maximum latency for its group it 300 milliseconds. The client then knows to wait 250 milliseconds before processing game data from the server. That way, each client processes game data from the server at approximately the same time.
There are many other (and probably better) ways to handle this situation, but that should get you started in the right direction.
I have a large number of machines (thousands and more) that every X seconds would perform an HTTP request to a Jetty server to notify they are alive. For what value of X should I use persistent HTTP connections (which limits number of monitored machines to number of concurrent connections), and for what value of X the client should re-establish a TCP connection (which in theory would allow to monitor more machines with the same Jetty server).
How would the answer change for HTTPS connections? (Assuming CPU is not a constraint)
This question ignores scaling-out with multiple Jetty web servers on purpose.
Update: Basically the question can be reduced to the smallest recommended value of lowResourcesMaxIdleTime.
I would say that this is less of a jetty scaling issue and more of a network scaling issue, in which case 'it depends' on your network infrastructure. Only you really know how your network is laid out and what sort of latencies are involved in order to come up with a value of X.
From an overhead perspective the persistent HTTP connections will of course have some minor effect (well I say minor but depends on your network) and the HTTPS will again have a larger impact....but only from a volume of traffic perspective since you are assuming CPU is not a constraint.
So from a jetty perspective, it really doesn't need to be involved in the question, you seem to ultimately be asking for help optimizing bytes of traffic on the wire so really you are looking for the best protocol at this point. Since with HTTP you are having to mess with headers for each request you may be well served looking at something like spdy or websocket which will give you persistent connections but are optimized for low round trip network overhead. But...they seem sort of overkill for a heartbeat. :)
How about just make them request at different time? Assume first machine request, then you pick a time to response to that machine as the next time to heart beat of that machine (also keep the id/time at jetty server), the second machine request, you can pick another time to response to second machine.
In this way, you can make each machine perform heart beat request at different time so no concurrent issue.
You can also use a random time for the first heart beat if all machines might start up at the same time.
I am trying to implement web service and web client applications using Ruby on Rails 3. For that I am considering to use a SSL but I would like to know: how "heavy" is it for servers to handle a lot of HTTPS connection instead of HTTP? what is the difference of response time and the performance at all?
The cost of SSL/TLS handshake (which takes most of the overall "slowdown" SSL/TLS adds) nowadays is much less than the cost of TCP connection establishment and other actions associated with session establishment (logging, user lookup etc). And if you worry about speed and want to save any ns of time, there exist hardware SSL accelerators that you can install to your server.
It is several times slower to go with HTTPS, however, most of the time that's not what is actually going to slow your app down. Especially if you're running on Rails, your performance scaling is going to be bottlenecked elsewhere in the system. If you are doing anything that requires the passing of secrets of any kind over the wire (including a shared session cookie), SSL is the only way to go and you probably won't notice the cost. If you happen to scale up to the point where you do start to see a performance hit from encryption, there are hardware acceleration appliances out there that help tremendously. However, rails is likely to fall over long before that point.