I have 2 c++ applications communicating over tcp/ip. One application acts as a client and another one acts as a server. We have observed some delay in receiving data into client. This delay keeps on increasing during a day from few seconds to few minutes in a day.
How we concluded that delay is in communication ?
We have debug statement that prints timestamp when data in ready for write in server. Also we have debug statements in client when we receive that data. After comparing those timestamp we realized that we received data in client after few minutes it was written by server. Each data ha id so it's easy for us to know it's same data whose timestamp is recorded at server & client.
Send/Receive buffer sizes from netstat command : -
We have 1GB send buffer in server which is filled max upto 300MB when this delay is seen.
We have 512MB receive buffer in client which always shows 0 whenever delay is seen. That indicates client is processing data fast enough to make sure that sender(server) will not slow down.
It's my assumption that somehow data is accumulated in send buffer of server that is causing this delay?
Is my assumption correct? Is there solution for this?
Update 1 :- One important fact that I forgot to mention that both the apps are running on same machine. They are suppose to run different machine that's why they use tcp but in current situation they are running on same machine so there should not be a problem of bandwidth.
Related
I am trying to develop a multi-threaded server to handle video streams. Multiple users can be connected at the same time. Each user sends a frame from his camera to the server, the server processes it, sends a response to the user, and so on. There were 2 options:
Use UDP protocol. The server in the main thread receives frames, throws them into the thread pool for processing, and from there, when ready, sends a response to the client. This approach is bad because the acceptance of a frame in the main thread is very long and there was an out of sync between users, which led to client hangs.
Use TCP protocol. The server in the main thread listens for the connection of clients and puts into the thread pool not the processing of 1 frame, but the processing of all frames of this client. This approach solved the problem of synchronization, but slowed down the work by 5-6 times. In fact, it is not clear why, because the image processing takes place on the GPU, which forces all frames to be processed sequentially and the thread pool does not play a role in speeding up calculations, but support for multiuser operation.So I concluded that this slowdown was due to TCP (on localhost, there were almost no changes in the operating time - an increase of about 1 fps for TCP).
I would like to figure out whether such a slowdown is really possible and how to be in this case.
All frames are sent in an uncompressed state, which was chosen empirically because the compression took longer.
P.S. It is very disappointing when the main algorithm works in 14 ms and all efforts were directed to its optimization, and the bottleneck turned out to be data transfer - about 250-500 ms.
I've written a program in C++/Qt that reads data from the serial port (ttyS0) and writes it over the network (on a specific TCP port).
At the other end of the network, I read the data from the corresponding socket and write it back over the serial port.
I' ve essentially created a transparent serial-to-serial bridge over the ip network.
So far so good.
The problem starts when the input serial data comes in NON-STOP. Sending the non-stop serial stream (#19200bps) over the network (#100Mbps) and back over the remote serial port (#19200bps) creates an ACCUMULATIVE DELAY.
The first serial packets are sent over the REMOTE SERIAL interface with neglible delay (i.e. the total delay from the time I receive the serial data # local side to the time i write it over the remote serial port is negligible)
But the delay just adds up and after some time, the "SERIAL pkts" received at local side can be seen at the remote serial port with huge delays (in the order of "minutes" and counting up)!!
Some notes:
I am using 2 linux boxes that are directly connected (so network delay is not an issue)
no pkt/serial data is lost in the process (the only problem is accumulative delays)
my OS is not real time, does this have to do with reat time issues!!
I believe the problem stems from small delays that result in the remote serial port becoming IDLE for short amounts of time.
Since data can leave the system #19200 bps, any delay in the process that might cause the remote serial port becoming idle with lead to an accumulative delay.
But I've no idea on how to measure the delays or make the program to work timely.
Any help/hints are highly appreciated
I'm trying to get a starting point on where to begin understanding what could cause a socket stall and would appreciate any insights any of you might have.
So, server is a modern dual socket xeon (2 x 6 core # 3.5 ghz) running windows 2012. In a single process, there are 6 blocking tcp sockets with default options, each of which are running on their own threads (not numa/core specified). 5 of them are connected to the same remote server and receiving very heavy loads (hundreds of thousands of small ~75 byte msgs per second). The last socket is connected to a different server with a very light send/receive load for administrative messaging.
The problem I ran into was a 5 second stall in the admin messaging socket. Multiple send calls to the socket returned successfully, however nothing was received from the remote server (should receive a protocol ack within milliseconds) or received BY the remote admin server for 5 seconds. It was as if that socket just turned off for a bit. After the 5 seconds stall passed, all of the acks came in a burst, and afterwards everything continued normally. During this, the other sockets were receiving much higher numbers of messages than normal, however there was no indication of any interruption or stall as the data logs displayed nothing unusual (light logging, maybe 500 msgs/sec).
From what I understand, the socket send call does not ensure that data has gone out on the wire, just that a transfer to the tcp stack was successful. So, I'm trying to understand the different scenarios that could have taken place that would cause a 5 second stall on the admin socket. Is it possible that due to the large amount of data being received the tcp stack was essentially overwhelmed and prioritized those sockets that were being most heavily utilized? What other situations could have potentially caused this?
Thank you!
If the sockets are receiving hundreds of thousands of 75-byte messages per second there is a possibility that the server is at maximum capacity with some resources. Maybe not bandwidth, as with 100K messages you might be consuming around 10Mbps. But it could be CPU utilization.
You should use two tools to understand you problem:
perfmon to see utilization of CPU (user and priviledged https://technet.microsoft.com/en-us/library/aa173932(v=sql.80).aspx) , memory, bandwidth, and disk queue length. You can also check number of interrupts and context switches with perfmon.
A sniffer like Wireshark to see if at TCP level data is being transmitted and responses received.
Something else I would do is to write a timestamp right after the send call and right before and after the read call in the thread in charge of admin socket. Maybe it is a coding problem.
The fact that send calls return successfully doesn't mean data was immediately sent. In TCP data will be stored in the send buffer and from there, TCP stack will send the data to the other end.
If your system is CPU bound (you can see with perfmon if this is true), then you should put attention to the comments written by #EJP, this is something that could happen when the machine is under heavy load. With the tools I mentioned, you can see if the receive window in the admin socket is closed or if it is just that socket read is taking time in the admin socket.
I have a problem with client-server application. As I've almost run out of sane ideas for its solving I am asking for help. I've stumbled into described situation about three or four times now. Provided data is from last failure, when I've turned all the possible logging, messages dumping and so on.
System description
1) Client. Works under Windows. I take as an assumption that there is no problem with its work (judging from logs)
2) Server. Works under Linux (RHEL 5). It is server where I has a problem.
3) Two connections are maintained between client and server: one command and one for data sending. Both work asynchronously. Both connections live in one thread and on one boost::asio::io_service.
4) Data to be sent from client to server is messages delimeted by '\0'.
5) Data load is about 50 Mb/hour, 24 hours a day.
6) Data is read on server side using boost::asio::async_read_until with corresponding delimeter
Problem
- For two days system worked as expected
- On third day at 18:55 server read one last message from client and then stopped reading them. No info in logs about new data.
- From 18:55 to 09:00 (14 hours) client reported no errors. So it sent data (about 700 Mb) successfully and no errors arose.
- At 08:30 I started investigation of a problem. Server process was alive, both connections between server and client were alive too.
- At 09:00 I attached to server process with gdb. Server was in sleeping state, waiting for some signal from system. I believe I accidentally hit Ctrl + C and may be there was some message.
- Later in logs I found message with smth like 'system call interrupted'. After that both connections to client were dropped. Client reconnected and server started to worked normally.
- The first message processed by server was timestamped at 18:57 on client side. So after restarting normal work, server didn't drop all the messages up to 09:00, they were stored somewhere and it processed them accordingly after that.
Things I've tried
- Simulated scenario above. As server dumped all incoming messages I've wrote a small script which presented itself as client and sent all the messages back to server again. Server dropped with out of memory error, but, unfortunately, it was because of high data load (about 3 Gb/hour this time), not because of the same error. As it was Friday evening I had no time to correctly repeat the experiment.
- Nevertheless, I've run server through Valgrind to detect possible memory leaks. Nothing serious was found (except the fact that server was dropped because of high load), no huge memory leaks.
Questions
- Where were these 700 Mb of data which client sent and server didn't get? Why they were persistent and weren't lost when server restarted the connection?
- It seems to me that problem is someway connected with server not getting message from boost::asio::io_service. Buffer is get filled with data, but no calls to read handler are made. Could this be problem on OS side? Something wrong with asynchronous calls may be? If it is so, how could this be checked?
- What can I do to detect the source of problem? As i said I've run out of sane ideas and each experiment costs very much in terms of time (it takes about two or three days to get the system to described state), so I need to run as much possible checks for experiment as I could.
Would be grateful for any ideas I can use to get to the error.
Update: Ok, it seems that error was in synchronous write left in the middle of asynchronous client-server interaction. As both connections lived in one thread, this synchronous write was blocking thread for some reason and all interaction both on command and data connection stopped. So, I changed it to async version and now it seems to work.
As i said I've run out of sane ideas and each experiment costs very
much in terms of time (it takes about two or three days to get the
system to described state)
One way to simplify investigation of this problem is to run server inside some Virtual Machine until it reaches this broken state. Then you can make snapshot of whole system and revert to it every time when things go wrong during investigation. At least you will not have to wait 3 days to get this state again.
I have a blocking client/server connected locally via Winsock. The client uses firefox to retrieve data from websites, passing certain data along to the server for extra processing. The server always responds, and the processing can take anywhere from 1/10th second to a few minutes. The client has no winsock connection to anything but the server; all web data is retrieved to hard-drive via firefox.
This setup works quite well until, seemingly randomly, the client's recv returns -1 (SOCKET_ERROR) with error code 10054 (WSAECONNRESET). This means the server supposedly terminated connection, but the server is actually still waiting to recv as if nothing is wrong. The connection has failed in this way as early as 5 minutes in or after working for as long as about an hour and a half. The client sends about 10 different types of requests to the server, and failure has occurred on a variety of them. The frequency of requests is roughly constant, probably an average of 10-15 a minute. When the connection breaks, neither computer experiences internet problems and remote desktop does not disconnect.
Initially I thought memory leaks, but after extensive debugging I am reasonably certain no more exist. Firefox is engaged in considerable HTTP traffic at times, so I thought maybe that could be filling available socket bufferspace or something -- seems doubtful but at this point I'm really not sure. So, could it be more memory leaks, maybe a hidden buffer overrun, too much web traffic? What is causing my Winsock app to randomly fail?
Sounds like a firewall at work.
Many firewalls are configured to terminate idle connections (i.e. open TCP sessions on which no data is transferred for awhile). Especially if it's an HTTP connection, which are typically not persistent.