I'm trying to get a starting point on where to begin understanding what could cause a socket stall and would appreciate any insights any of you might have.
So, server is a modern dual socket xeon (2 x 6 core # 3.5 ghz) running windows 2012. In a single process, there are 6 blocking tcp sockets with default options, each of which are running on their own threads (not numa/core specified). 5 of them are connected to the same remote server and receiving very heavy loads (hundreds of thousands of small ~75 byte msgs per second). The last socket is connected to a different server with a very light send/receive load for administrative messaging.
The problem I ran into was a 5 second stall in the admin messaging socket. Multiple send calls to the socket returned successfully, however nothing was received from the remote server (should receive a protocol ack within milliseconds) or received BY the remote admin server for 5 seconds. It was as if that socket just turned off for a bit. After the 5 seconds stall passed, all of the acks came in a burst, and afterwards everything continued normally. During this, the other sockets were receiving much higher numbers of messages than normal, however there was no indication of any interruption or stall as the data logs displayed nothing unusual (light logging, maybe 500 msgs/sec).
From what I understand, the socket send call does not ensure that data has gone out on the wire, just that a transfer to the tcp stack was successful. So, I'm trying to understand the different scenarios that could have taken place that would cause a 5 second stall on the admin socket. Is it possible that due to the large amount of data being received the tcp stack was essentially overwhelmed and prioritized those sockets that were being most heavily utilized? What other situations could have potentially caused this?
Thank you!
If the sockets are receiving hundreds of thousands of 75-byte messages per second there is a possibility that the server is at maximum capacity with some resources. Maybe not bandwidth, as with 100K messages you might be consuming around 10Mbps. But it could be CPU utilization.
You should use two tools to understand you problem:
perfmon to see utilization of CPU (user and priviledged https://technet.microsoft.com/en-us/library/aa173932(v=sql.80).aspx) , memory, bandwidth, and disk queue length. You can also check number of interrupts and context switches with perfmon.
A sniffer like Wireshark to see if at TCP level data is being transmitted and responses received.
Something else I would do is to write a timestamp right after the send call and right before and after the read call in the thread in charge of admin socket. Maybe it is a coding problem.
The fact that send calls return successfully doesn't mean data was immediately sent. In TCP data will be stored in the send buffer and from there, TCP stack will send the data to the other end.
If your system is CPU bound (you can see with perfmon if this is true), then you should put attention to the comments written by #EJP, this is something that could happen when the machine is under heavy load. With the tools I mentioned, you can see if the receive window in the admin socket is closed or if it is just that socket read is taking time in the admin socket.
Related
I have 2 c++ applications communicating over tcp/ip. One application acts as a client and another one acts as a server. We have observed some delay in receiving data into client. This delay keeps on increasing during a day from few seconds to few minutes in a day.
How we concluded that delay is in communication ?
We have debug statement that prints timestamp when data in ready for write in server. Also we have debug statements in client when we receive that data. After comparing those timestamp we realized that we received data in client after few minutes it was written by server. Each data ha id so it's easy for us to know it's same data whose timestamp is recorded at server & client.
Send/Receive buffer sizes from netstat command : -
We have 1GB send buffer in server which is filled max upto 300MB when this delay is seen.
We have 512MB receive buffer in client which always shows 0 whenever delay is seen. That indicates client is processing data fast enough to make sure that sender(server) will not slow down.
It's my assumption that somehow data is accumulated in send buffer of server that is causing this delay?
Is my assumption correct? Is there solution for this?
Update 1 :- One important fact that I forgot to mention that both the apps are running on same machine. They are suppose to run different machine that's why they use tcp but in current situation they are running on same machine so there should not be a problem of bandwidth.
I am trying to develop a multi-threaded server to handle video streams. Multiple users can be connected at the same time. Each user sends a frame from his camera to the server, the server processes it, sends a response to the user, and so on. There were 2 options:
Use UDP protocol. The server in the main thread receives frames, throws them into the thread pool for processing, and from there, when ready, sends a response to the client. This approach is bad because the acceptance of a frame in the main thread is very long and there was an out of sync between users, which led to client hangs.
Use TCP protocol. The server in the main thread listens for the connection of clients and puts into the thread pool not the processing of 1 frame, but the processing of all frames of this client. This approach solved the problem of synchronization, but slowed down the work by 5-6 times. In fact, it is not clear why, because the image processing takes place on the GPU, which forces all frames to be processed sequentially and the thread pool does not play a role in speeding up calculations, but support for multiuser operation.So I concluded that this slowdown was due to TCP (on localhost, there were almost no changes in the operating time - an increase of about 1 fps for TCP).
I would like to figure out whether such a slowdown is really possible and how to be in this case.
All frames are sent in an uncompressed state, which was chosen empirically because the compression took longer.
P.S. It is very disappointing when the main algorithm works in 14 ms and all efforts were directed to its optimization, and the bottleneck turned out to be data transfer - about 250-500 ms.
I wrote a server in c which receive UDP data from client in port X. I have used Epoll(non block) socket for UDP listening and has only one thread as worker. Pseudo code is following:
on_data_receive(socket){
process(); //take 2-4 millisecond
send_response(socket);
}
But when I send 5000 concurrent (using thread) request server miss 5-10% request. on_data_receive() never called for 5-10% request. I am testing in local network so you can assume there is no packet loss. My question is why on_data_receive didn't call for some request? What is the connection limit for socket? With the increase of concurrent request loss ratio also increase.
Note: I used random sleep upto 200 millisecond before sending the request to server.
There is no 'connection' for UDP. All packets are just sent between peers, and the OS does some magic buffering to avoid packet loss to some degree.
But when too many packets arrive, or if the receiving application is too slow reading the packets, some packets get dropped without notice. This is not an error.
For example Linux has a UDP receive buffer which is about 128k by default (I think). You can probably change that, but it is unlikely to solve the systematic problem that UDP may expose packet loss.
With UDP there is no congestion control like there is for TCP. The raw artifacts of the underlying transport (Ethernet, local network) are exposed. Your 5000 senders get probably more CPU time in total than your receiver, and so they can send more packets than the receiver can receive. With UDP senders do not get blocked (e.g. in sendto()) when the receiver cannot keep up receiving the packets. With UDP the sender always needs to control and limit the data rate explicitly. There is no back pressure from the network side (*).
(*) Theoretically there is no back pressure in UDP. But on some operating systems (e.g. Linux) you can observe that there is back pressure (at least to some degree) when sending for example over a local Ethernet. The OS blocks in sendto() when the network driver of the physical network interface reports that it is busy (or that its buffer is full). But this back pressure stops working when the local network adapter cannot determine the network 'being busy' for the whole network path. See also "Ethernet flow control (Pause Frames)". Through this the sending side can block the sending application even when the receive-buffer on the receiving side is full. This explains why often there seems to be a UDP back-pressure like a TCP back pressure, although there is nothing in the UDP protocol to support back pressure.
I have a blocking client/server connected locally via Winsock. The client uses firefox to retrieve data from websites, passing certain data along to the server for extra processing. The server always responds, and the processing can take anywhere from 1/10th second to a few minutes. The client has no winsock connection to anything but the server; all web data is retrieved to hard-drive via firefox.
This setup works quite well until, seemingly randomly, the client's recv returns -1 (SOCKET_ERROR) with error code 10054 (WSAECONNRESET). This means the server supposedly terminated connection, but the server is actually still waiting to recv as if nothing is wrong. The connection has failed in this way as early as 5 minutes in or after working for as long as about an hour and a half. The client sends about 10 different types of requests to the server, and failure has occurred on a variety of them. The frequency of requests is roughly constant, probably an average of 10-15 a minute. When the connection breaks, neither computer experiences internet problems and remote desktop does not disconnect.
Initially I thought memory leaks, but after extensive debugging I am reasonably certain no more exist. Firefox is engaged in considerable HTTP traffic at times, so I thought maybe that could be filling available socket bufferspace or something -- seems doubtful but at this point I'm really not sure. So, could it be more memory leaks, maybe a hidden buffer overrun, too much web traffic? What is causing my Winsock app to randomly fail?
Sounds like a firewall at work.
Many firewalls are configured to terminate idle connections (i.e. open TCP sessions on which no data is transferred for awhile). Especially if it's an HTTP connection, which are typically not persistent.
I need a little help if someone's got a minute.
I've written a web server using IO completion ports, but I am having some trouble sending out large files. Web pages seem to load fine, but during large file transfers, WSASend() fails after a few minutes with error "The specified network name is no longer available."
Right now, my server just closes the associated connection when any overlapped operation fails. Is this the right thing to do? or should I retry failed overlapped operations a few times before I close the socket? I am using tcp/stream sockets.
(fixed) I am also receiving what seems like random 0 byte packets from WSARecv. I am not sure what to make of this, or if the problem is related.(/fixed)
Thanks for any help
edit: now that the server properly handles connections, and has a much more comprehensive log, it seems like Len is right. The client is closing the connection for some reason.
The log:
Initializing Windows Sockets...
Forwarding port 80...
Starting server...
Waiting for incoming connections...
Socket 1128: Client connected.
Socket 1128: Request received
Socket 1128: Sent response
Socket 1128: Error 64: SendChunk() failed. //WSASend()
Socket 1128: Closing connection - GetQueueCompletionStatus == FALSE
so the question is now, why would the client close the connection? It takes anywhere from 2-5 minutes to happen. I have decreased the buffer size to 4098 bytes per send, and only send the next chunk when the first has completed.
Thanks again for any ideas on this.
p.s. I even just implemented a retry function so that it will retry a failed overlapped IO operation five times before giving up....still no luck =(
A zero length packet returned from recv indicates client on the other end has closed the connection.
Which answers why your subsequent send to the client failed.
http://www.opengroup.org/onlinepubs/009695399/functions/recv.html
If no messages are available to be
received and the peer has performed an
orderly shutdown, recv() shall return
0.
Are you doing anything to impose some form of flow control on your data transmission?
If not then you are probably using up resources which is causing the send to fail.
For example, if you are simply issuing LOTS of WSASend() calls one after the other rather than pacing them based on when they complete then each one will use system resources (non-paged pool and/or lock pages which go towards the 'locked pages limit'). You'll then likely eventually fail with ENOBUFS or similar errors.
What you need to do is build a flow control system that works off of the send completions so that you only ever have a known number of sends outstanding at a time.
See these questions for more detail:
Implement a good performing "to-send" queue with TCP
Limiting TCP sends with a "to-be-sent" queue and other design issues
Finally figured it out.
from Rogers Internet Terms of Service:
Without limitation, you may not use (or allow anyone else to use) our Services to:
(xvi) operate a server in connection with the Services, including, without limitation, >mail, news, file, gopher, telnet, chat, Web, or host configuration servers, multimedia >streamers or multi-user interactive forums;
how lame is that? O_o
good news: server works fine =)
edit- called Rogers. They verified that they are cutting me off, and told me that I need a business account to run a web server.