I have issues with websocket performance on AWS EC2.
I use websockets to listen to a server with incoming network rate 100-300 Kb/sec. Just listening, not sending. On EC2, every 10-20 minutes, I get disconnected (code 1006 - abnormal connection loss - no reason given). I have tested with t2.micro (which I believe should be more than enough for such a small task) and t2.large. I use US East, which should be close to the source.
This is to be compared with only one disconnection every few hours when I run the same app on my personal computer, in a different country. I have used two different libraries (Python aiohttp and websockets) to confirm that I have the same issues.
This points to an issue with network quality on EC2. However I'm not sure if this websockets task is demanding, so this is surprising.
Did anyone experience this before? What other diagnostics can I do to better understand the root cause?
Related
I am experiencing the same issue as many of others have had (based on my research) but none of them seemed to have figured out a solution.
I am connecting from localhost to my AWS database (server located in Dublin - eu-west-1) and I am located in Denmark.
From my investigation I came to this conclusion that none of the above are the culprit:
Internet speed I am using is 1GB/s so it is definitely not a problem!
Bandwidth download - 500 GB, upload - 150 GB.
Max connections to database is 312.
Latency shouldn't be a problem as it is within the continent.
I referred to this post but couldn't find any of the posted solutions that fits my case AWS RDS painfully slow when connecting from local machine
My queries are taking way too long (40+ seconds).
What could be causing it?!
EDIT:
I currently am using PDO.
I have tested it with ODBC connection, same problem there.
If you are thinking that the amount of data is affecting it, that's wrong. I am fetching the same amount of data from the server hosted web but with a lighting speed.
I am working on a new Amazon Redshift database that I recently started.
I am experiencing an issue where after I connect to the database, I can run queries without any issue. However, if I spend some time without running anything (like, 5 minutes), when I try running another query or command, ir never finishes.
I am using dBeaver Community 21.2.2 to interact with the connection, and it stays "Executing query" forever. The only way i can get it to work is by cancelling, disconnecting from the redshift, connecting again and then it executes correctly. Until I stop using for some minutes, and then it's happens all over again.
I tought this was a dBeaver issue, as we have a Meabase connected to this same cluster without any issues. But today, I tried manipulating this cluster with R using RJDBC, and the same thing happens: I can run queries, until I stop, and then when I try running something else it never stops, until I disconnect and connect again.
I'm sorry if I wasn't able to explain it clearly, I tried searching for simmilar issues but couldn't.
I suspect that the queries in question are not even being launched on the database. You can check this by reviewing svl_statementtext to see if the query is even being seen. Put a unique comment in the query to help determine if it is actually the query in question.
Since I've seen similar behavior before I'll write up a possible way this can happen. In this case the queries were not being seen by the database or the connection to the database was being dropped mid execution. The cause is network switches and their configurations.
Typical network connections are fairly quick - you ask for a web page and it is given to you. Connection is complete. When you click on a link a new connection is established and also end quickly. These network actions are atomic from a network connection point of view. However, database connections are different. One connection is made and many back and forth transmissions of data happen while the connection is open. No problem and with the right set of network configurations these connections can be open and idle for days.
The problem come in when the operators of the network equipment decide that connections that have no data flowing are "stale" after some fixed amount of time. They do this so that the network equipment can "forget" about these connections and focus on "active" connections. ISPs drop idle connections a lot so that they can handle the load of traffic and connections that flow through their equipment. This doesn't cause any issues for web pages and APIs but database connections get clobbered.
When this happens is look exactly like what you describe. Both sides (client and database) think that the connection is still active but the network equipment has dropped the connection. Nothing gets through but no notification is sent either party. You will likely see corresponding open sessions on the Redshift side for these dropped connections and the database is just waiting for the client to give a command on each of them. An administrator will need to go through and close (terminate) these sessions for them to go away.
Now the thing that doesn't align with experience is the speed at which these connections are being marked as "stale". In my case my ISP was closing connections that were idle for more than 30 min. You seem to be timing out much faster than this. In some cases corporate firewalls will be configured with short idle connection timeouts for routes out of the private network to the internet. So there are cases where the timeouts can be short. The networks at AWS do not have these timeouts so if your connections are completely within AWS then this isn't your answer.
To address this there are a few ways to go. The easy way is to set up a tunnel into AWS with "keep alive" packets sent every 30 sec or so. You will need an ec2 instance at AWS so it isn't cost free. Ssh tunneling is the usual tool for this and there are write-ups online for setting it up.
The hard way (but likely most correct way) is to work with network experts to understand when the timeout is happening and why. If the timeout cannot be changed then it may be possible to configure a different network topology for your use case. Network peering or VPN could address.
In some cases you may be able to not have jdbc or odbc connections at all. You see these protocols are valid but they are old and most networking doesn't work this way anymore which is why they suffer from these issues. Redshift Data API let's you issue SQL to redshift in a single package and check on completion later on. These API calls are each independent connections so there is no possibility of "timing out" between them. The downside is this process is not interactive and therefore not supported by workbenches.
So does this match what you have going on?
From some reason download traffic from virtual machine on GCP (Google Cloud Platform) with Debian 9 is limited to 50K/s? Upload seems to be fine, inline with my local upload link.
It is the same with scp or https download. Any suggestions what might be wrong, where to search?
Machine type
n1-standard-1 (1 vCPU, 3.75 GB memory)
CPU platform
Intel Skylake
Zone
europe-west4-a
Network interfaces
Premium tier
Thanks,
Mihaelus
Simple test:
wget https://hrcki.primasystems.si/Nova/assets/download.test.html
Output:
--2018-10-18 15:21:00-- https://hrcki.primasystems.si/Nova/assets/download.test.html Resolving
hrcki.primasystems.si (hrcki.primasystems.si)... 35.204.252.248
Connecting to hrcki.primasystems.si
(hrcki.primasystems.si)|35.204.252.248|:443... connected. HTTP request
sent, awaiting response... 200 OK Length: 541422592 (516M) [text/html]
Saving to: `download.test.html.1' 0% [] 1,073,152 48.7K/s eta
2h 59m
Always good to minimize variables when trying to diagnose. So while it is unlikely the use of HTTP is why things are that very slow, you might consider using netperf or iperf3 to measure TCP bulk transfer performance between your VM in GCP and your local system. You can do that either "by hand" or via PerfKit Benchmarker https://cloud.google.com/blog/products/networking/perfkit-benchmarker-for-evaluating-cloud-network-performance
It can be helpful to have packet traces - from both ends when possible - to look at. You want the packet traces to be started before the test - it is important to see the packets used to establish the TCP connection(s). They do not need to be "full packet" traces, and often you don't want them to be. Capturing just the first 96 bytes of each packet would be sufficient for this sort of investigating.
You might also consider taking snapshots of the network statistics offered by the OSes running in your GCP VM and local system. For example, if running *nix taking a snapshot of "netstat -s" before and after the test. And perhaps a traceroute from each end towards the other.
Network statistics and packet traces, along with as many details about the two endpoints as possible are among the sorts of things support organizations are likely to request when looking to help resolve an issue of this sort.
I am running a server (that uses tornado python) on a single AWS instance and I am running into spikes in websocket latency.
Profiling the round trip time from when a websocket message is sent to the client, which then immediately sends an ack message back to the server, to when the server receives the ack message yields an average of <.1 second, however I note sometimes it goes up to 3 seconds. Note: there are no spikes when running the server locally.
What could be the cause or fix for this? I looked at the CPU usage and it only goes up to 40% max. The spikes are not correlated with heavy traffic (2 or 3 clients usually) and the client's internet seems fine. I find it hard to believe the instance is going beyond capacity with such low usage.
The fact that the spike is 3 seconds is actually telling you a lot more than you may suspect, about the nature of the problem.
It's packet loss.
TCP, as you likely know, is said to provide "reliable" transport, guaranteeing that payload sent is received by the far end in the order in which it was sent, because TCP reassembles things in the correct order before delivering the payload. One significant way in which this is accomplished is by the automatic retransmission of packets that are considered to have been lost.
You'll never guess the default initial timer value for retransmissions of lost packets. Or, perhaps, now, you will.
It's 3 seconds in many, if not most, implementations, based on standards established several years ago in a time when the bandwidth and latency of today's transmission links were unheard of, perhaps unimagined.
You won't see evidence of the retransmission at at the websocket server or the client software, because TCP shields the higher layers from knowing that it occurs... but 3 seconds is a dead giveaway that this is exactly the problem.
You'll see the retransmissions of the traffic occurring if you observe the network traffic with a packet sniffer, though that will only serve to confirm that this is the issue.
It could be loss from server to client, or loss from client to server. The latter is generally more likely, since clients often have a lower amount of available upstream bandwidth... but the directionality of the packet loss doesn't clearly indicate the physical location where it is occurring. Unless your client keeps track of local time, so that request and response initiation times can be correlated, you don't know whether the delay is in the message, or in the acknowledgement.
Under relatively light load, it seems unlikely that the problem is on your instance or in the AWS network on your side, and you obviously can't connect a sniffer to arbitrary points on the Internet to pinpoint the problem.
Given a case like this, it may be easier -- and surprisingly feasible -- to prove where the problem isn't, rather than where it is.
One technique for this would be to create a deliberate detour for the traffic through different equipment located elsewhere -- such as a different AWS region or another cloud provider.
First, of course, you'll want to learn to spot these retransmissions using wireshark.
Then, configure a proxy server at a different location, using a simple TCP connection proxy -- such as HAProxy, or even a simple tool like redir or socat.
Such a configuration will listen for connections from clients, and when one is established, will create a new TCP connection to the destination (your websocket server) but -- importantly -- they only tie the two connections together at the payload level -- not the TCP level, and of course nothing lower -- so retransmissions will only be seen on the wire at all between this intermediate server and the end of the connection with the packet loss problem. The other end will show no evidence of the retransmissions -- just data arriving later than expected.
For this test to be meaningful, the proxy needs to be located away from the server and the client, and with no meaningful common infrastructure -- hence the suggestion of placing it in a different AWS region. A different availability zone in the same region may share common Internet infrastructure at some level, so that's not far enough away for this purpose.
If client <--> proxy <--> server shows TCP retransmissions on the path between proxy and server, and not between client and proxy, the problem really is likely to be in your server, its hardware, network, or Internet connection, and you'll have to proceed accordingly.
Conversely (and, I would suggest, more likely) if the path between proxy and server is free of retransmissions but the path between client and proxy is still dirty, you have eliminated the server and its infrastructure as the source of the problem. How to proceed is up to you, but at this point you do know what the problem... isn't.
Two other possibilities:
Both sides remain dirty, which is the least likely scenario. Rule 1 of troubleshooting is to assume initially that you only have one problem, not two.
Or, both sides are suddenly and unexectedly clean when traffic uses this setup, which suggests thay your test setup has routed around a broken piece of the Internet. You've "solved" it but have no idea how. We'll also hope this isn't the outcome, but given the vagaries of the global Internet, it's not unthinkable that your stack may include components like this, with geolocation-DNS-based selection of an intermediate endpoint. This seems like a convolution but does have its place.
Such a tactic is actually part of the logic behind the S3 transfer acceleration feature. The content is not any closer to the end user, but the TCP connection from the browser is being terminated on equipment in the AWS edge network, at a location that is often nearer to the browser, and a second TCP connection back to the bucket is established, with the payload connected together... and, yes, it's faster and more stable, with the significance of the change becoming more notable as distance and connection quality vary.
I have a winsock IOCP server written in c++ using TCP IP connections. I have tested this server locally, using the loopback address with a client simulator. I have been able to get upwards of 60,000 clients no sweat. The issue I am having, is when I run the server at my house and the client simulator at a friends house. Everything works fine up until we hit around 3700 connections, after that every call to connect() fails from the client side with a return of 10060 (this is the winsock timed out error). Last night this number was 3700, but it has been around 300 before, and we also saw it near 1000. But whatever the number is, every time we try to simulate it, it will fail right around that number (within 10 or so).
Both computers are using Windows 7 Ultimate. We have also both modified the TCPIP registry setting MaxTcpConnections to around 16 million. We also changed the MaxUserPort setting from its 5000 default to 65k. No useful information is showing up in the event viewer. We also both watched our resource monitor, and we havent even gotten to 1% network utilization, the CPU is also close to 0% usage as well.
We just got off the phone with our ISP, and they are saying that they are not limiting us in any way but the guy was kinda unsure and ended up hanging up on us anyway after a 30 minute hold time...
We are trying everything to figure this issue out, but cannot come up with the solution. I would be very greatful if someone out there could give us a hand with this issue.
P.S. Both computers are on Verizon FIOS with the same verizon router. Another thing to note, the server is using WSAAccept and NOT AcceptEx. The client simulator is attempting to connect over many seconds though, so I am pretty sure the connects are not getting backlogged. We have tried to change the speed at which the client simulator connects, and no matter what speed it is set to it fails right around the same number each time.
UPDATE
We simulated 2 separate clients (on 2 separate machines) on network A. The server was running on network B. Each client was only able to connect half (about 1600) connections to the server. We were initially using a port below 1,000, this has been changed to above 50,000. The router log on both machines showed nothing. We are both using the Actiontec MI424WR verizon FIOS router. This leads me to believe the problem is not with the client code. The server throws no errors and has no unexpected behavior. Could this be an ISP/Router issue?
UPDATE
The solution has been found. The verizon router we were using (MI424WR revision C) is unable to handle any more than 3700 connections, we tested this with a separate set of networks. Thanks for the help guys!
Thanks
- Rick
I would have guessed that this was a MaxUserPort issue, but you say you've changed that. Did you reboot after changing it?
Run the test on the exact same computers on your local network (this will take the computers out of the equation).
The issue could be one of your routers not being up to the job?