Process terminated, but its network resource remained - c++

My program has a TCP server and always has several longlived connections. Sometimes I close the program without closing all the connections and then I execute netstat -ano in command line, amazingly all the connections remained with the state of "ESTABLISHED" with a pid that doesn't exist in the task-manager! Restarting the network card doesn't do any help. The only solution is logout/logon or restarting the computer. Anybody ever met this problem?

This may be the sockets in a 'half-closed' state.
They usually disappear after some timeout which may be pretty big (from 5 to 30 minutes), depending on your system.

Related

How to free a port locked by a killed process?

I have problem same as in this question.
PID exists in netstat but does not exist in task manager
I have discovered a running process with PID 26376 listening on port
9001 and 9002 as when I try to run my program(as a service) which
binds to that port it fails.
But when I try to kill it using taskkill /PID it says that the process
26376 is not found. Similarly when I try to find the process in task
manager with "Show processes from all users" selected, I couldn't find
it anywhere.
And the accepted answer says:
What may be happening is that your process had a TCP port open when it
crashed or otherwise exited without explicitly closing it. Normally
the OS cleans up these sorts of things, but only when the process
record goes away.
I am working on a C++ code and to fix an unknown/unresolved issue I am doing taskkill which sometimes locks the port. How do I free it on next use without restarting the whole OS? Is there any way to free such ports?
The TCP TIME-WAIT state will remain for 2MSL , so that the port will still be held open. You can set socketopt SO_REUSEADDR, it will work.
See this.

Cannot connect to a local port anymore that is still being listened by a process

I have a server application (unimrcpserver.exe) that is answering requests from client processes. This server process listens to several ports.
with netstat -a command I get the following lines for my process.
TCP 192.168.10.65:2544 MERTB-PC:0 LISTENING
TCP 192.168.10.65:2554 MERTB-PC:0 LISTENING
TCP 192.168.10.65:9060 MERTB-PC:0 LISTENING
(netstat output is long I only put relevant lines here)
Normally when the system works I make requests to the server from these ports and each of them works fine.
When I was doing stress tests I saw a situation where the system no longer responded the my requests that I make through the port 2554.
netstat -a still gives me the above lines so the server is somehow still listening to this port. When I run telnet on the same machine it gives an error :
telnet 192.168.10.65 2554
Connecting To 192.168.10.65...Could not open connection to the host, on port 2554: Connect failed
I also wrote a simple program with c++ to get the exact error message that the system generates to a connect() request. This time I get the following error:
No connection could be made because the target machine actively refused it
Additional info: Everything is on the same Windows machine. Firewall is disabled. This situation occurred only once when I am doing stress tests that sends multiple requests at the same time. Before the situation occurred the system handled to around 13000 requests, which took around half an hour.
So the question is : How can this situation occur? The port is being reported as "LISTENING" with netstat but I cannot connect to it. If it can be caused by a programming error what kind of an error can cause this kind of behavior?
A new connection can be "actively refused" under several conditions:
there is no LISTENING socket on the IP:Port being connected to.
there is a LISTENING socket, but its backlog of pending connections is full, so it cannot accept a new connection at that moment.
A firewall is blocking it. Though the firewall is more likely to use a different error, if it sends an error at all.
Since there is a LISTENING socket, #2 is the most likely/common case. If so, it means the server app is not accepting clients from its backlog fast enough, if at all.
A client cannot differentiate between these conditions. All it can do is detect the connect failure - WSAECONNREFUSED or ECONNREFUSED, depending on platform - and try again later.
So the question is : How can this situation occur? The port is being reported >>as "LISTENING" with netstat but I cannot connect to it. If it can be caused >>by a programming error what kind of an error can cause this kind of behavior?
Yes,It could be caused by a programming error on the server. I have seen it happening when the server's listening thread is deadlocked. The socket's state is "listening" but if the listening thread has some global state and is blocked on other threads waiting on a mutex to be released you will encounter this.
Also, like others here stated if the CPU is loaded due to your stress test and that might cause the server to refuse connections since the threads might be busy processing and the listening thread never got a chance to accept the connection.

Asynchronous server stopping getting data from client with no visible reason

I have a problem with client-server application. As I've almost run out of sane ideas for its solving I am asking for help. I've stumbled into described situation about three or four times now. Provided data is from last failure, when I've turned all the possible logging, messages dumping and so on.
System description
1) Client. Works under Windows. I take as an assumption that there is no problem with its work (judging from logs)
2) Server. Works under Linux (RHEL 5). It is server where I has a problem.
3) Two connections are maintained between client and server: one command and one for data sending. Both work asynchronously. Both connections live in one thread and on one boost::asio::io_service.
4) Data to be sent from client to server is messages delimeted by '\0'.
5) Data load is about 50 Mb/hour, 24 hours a day.
6) Data is read on server side using boost::asio::async_read_until with corresponding delimeter
Problem
- For two days system worked as expected
- On third day at 18:55 server read one last message from client and then stopped reading them. No info in logs about new data.
- From 18:55 to 09:00 (14 hours) client reported no errors. So it sent data (about 700 Mb) successfully and no errors arose.
- At 08:30 I started investigation of a problem. Server process was alive, both connections between server and client were alive too.
- At 09:00 I attached to server process with gdb. Server was in sleeping state, waiting for some signal from system. I believe I accidentally hit Ctrl + C and may be there was some message.
- Later in logs I found message with smth like 'system call interrupted'. After that both connections to client were dropped. Client reconnected and server started to worked normally.
- The first message processed by server was timestamped at 18:57 on client side. So after restarting normal work, server didn't drop all the messages up to 09:00, they were stored somewhere and it processed them accordingly after that.
Things I've tried
- Simulated scenario above. As server dumped all incoming messages I've wrote a small script which presented itself as client and sent all the messages back to server again. Server dropped with out of memory error, but, unfortunately, it was because of high data load (about 3 Gb/hour this time), not because of the same error. As it was Friday evening I had no time to correctly repeat the experiment.
- Nevertheless, I've run server through Valgrind to detect possible memory leaks. Nothing serious was found (except the fact that server was dropped because of high load), no huge memory leaks.
Questions
- Where were these 700 Mb of data which client sent and server didn't get? Why they were persistent and weren't lost when server restarted the connection?
- It seems to me that problem is someway connected with server not getting message from boost::asio::io_service. Buffer is get filled with data, but no calls to read handler are made. Could this be problem on OS side? Something wrong with asynchronous calls may be? If it is so, how could this be checked?
- What can I do to detect the source of problem? As i said I've run out of sane ideas and each experiment costs very much in terms of time (it takes about two or three days to get the system to described state), so I need to run as much possible checks for experiment as I could.
Would be grateful for any ideas I can use to get to the error.
Update: Ok, it seems that error was in synchronous write left in the middle of asynchronous client-server interaction. As both connections lived in one thread, this synchronous write was blocking thread for some reason and all interaction both on command and data connection stopped. So, I changed it to async version and now it seems to work.
As i said I've run out of sane ideas and each experiment costs very
much in terms of time (it takes about two or three days to get the
system to described state)
One way to simplify investigation of this problem is to run server inside some Virtual Machine until it reaches this broken state. Then you can make snapshot of whole system and revert to it every time when things go wrong during investigation. At least you will not have to wait 3 days to get this state again.

Socket re-connection failure

System Background:
Its basically a client/server application. Server is an embedded device and Client is a windows app developed in C++.
Issue: After a runtime of about a week, communication breaks between client/server,
because of this the server is not able to connect back to the client and needs a restart to recover. Looks like System is experiencing Socket re-connection problem. Also The network sometimes experiences intermittent failures.
Abrupt Termination at remote end
Port locking
Want some suggestions on how to cleanup the socket or shutdown cleanly so that re-connection happens properly. Other alternate solutions?
Thanks,
Hussain
It does not sound like you are in a position to easily write a stress test app to reproduce this more quickly out of band, which is what I would normally suggest. A pragmatic solution might be to periodically restart the server and client at a time when you think the system is least busy, or when problems arise. This sounds like cheating but many production systems I have been involved with take this approach to maximize system uptime.
My preferred solution here would be to abstract the server and client socket code (hopefully your design allows this to be done without too much work) and use it to implement client and server test apps that can be used to stress test only the socket code by simulating a lot of normal socket traffic in a short space of time - this helps identify timing windows and edge cases that could cause problems over time, and might speed up the process of obtaining a debuggable repro - you can simulate network error in your test code by dropping the socket on the client or server periodically.
A further step to take on the strategic front would be to ensure that you have good diagnostics in your socket handlers on client and server side. Track socket open and close, with special focus on your socket error and reconnect paths given you know the network is unreliable. Make sure the logs are output sequential with a timestamp. Something as simple as this might quickly show you what error or conditions trigger your problems. You can quickly make sure the logs are correct and complete using the test apps I mentioned above.
One thing you might want to check is that you are not being hit by lack of ability to reuse addresses. Sometimes when a socket gets closed, it cannot be immediately reused for a reconnect attempt as there is still residual activity on one or other end. You may be able to get around this (based on my Windows/Winsock experience) by experimenting with SO_REUSEADDR and SO_LINGER on your sockets. however, my first focus in your case would be on ensuring the socket code on client and server handles all errors and mainline cases correctly, before worrying about this.
A common issue is that when a connection is dropped, it is kept opened by the OS in TIME_WAIT state. If you want to restart the server socket, it will not be able to reopen the same port directly because it is still present for the OS.
To avoid that, you need to set the parameter SO_REUSEADDR so that the OS allows you to reuse the port if it is in TIME_WAIT state for a server socket.
Example:
int optval=1;
// set SO_REUSEADDR on a socket to true (1):
setsockopt(s1, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof optval);
I'm experiencing something similar with encrypted connections. I believe in my case it is because the client dropped the connection and reconnected in less than the 4 minute FIN_WAIT period. The initial connection is recycled (by the os) and the server doesn't see the drop out. The SSL authentication is lost when the client loses connection so the client tries to re-authenticate. This is during what the servers considers the middle of a conversation. The server then hangs up on the client. I think the server ssl code considers this a man in the middle attack or just gets confused and closes the connection.

telnet client connection stops receiveing data, server is still sending

I'm Working in an embedded linux environment.
it launches a telnet daemon on startup which watches on a particular port and launches a program when a connection is received.
i.e.
telnetd -l /usr/local/bin/PROGA -p 1234
PROGA - will output some data at irregular intervals. When it is not outputting data, every X period of time it sends out a 'heartbeat' type string to let the client know that we are still active i.e. "heartbeat\r\n"
After a random amount of time, the client (use a linux version of telnet, launched by: telnet xxx.xxx.xxx.xxx 1234) will fail to receive the 'heartbeat\r\n'
The data the client sees:
heartbeat
heartbeat
heartbeat
...
heartbeat
[nothing, should have received heartbeat]
[nothing forever]
heartbeat is sent:
result = printf("%s", heartbeat);
checking result, it is always the length of heartbeat. Logging to syslog shows us that the printf() is executing with success at the proper intervals
I've since added in a tcdrain and fflush which both return success, but do not seem to help the situation.
Any help would be appreciated.
**UDPATE: got a wireshark capture from the server side. Very Clearly the heartbeat is being sent continuously. No Hicups, no delays. Found something interesting on the client though. The client in this test case (telnet on Ubuntu 9.04) seems to suddenly stop receiving heartbeat (as describes above). Wireshark confirms this, big pause in packets. Well, once the client had stopped receiving the heartbeat, pressing any keystroke (on the client) seems to trigger a spew of data from the client's buffer (all heartbeats). Wireshark on the client also shows this massive amount of data all in one packet.
Unfortunately I don't really know what this means. It this a line mode on/off thing? Line endings (\r\n) are very clearly coming through.
**Update 2: running netcat instead of telnetd, the problem is not reproducible.
The first thing I would do is get out Wireshark and try to find out if the server is truly sending the message. It would be instructive to run Wireshark at the server as well as third party PC. Is there anything different about the last heartbeat?
Edit. Well, that was an interesting find on your client.
It seems like there's some sort of terminal thing in the way. You may want to use the netcat program rather than telnetd. netcat is designed for sending arbitrary data over a TCP session in raw mode, without any special formatting, and it has the ability to hook up an arbitrary process to a socket. On a Windows machine you can use PuTTY in raw mode to accomplish the same thing.
It may still be worth examining traffic with a third party between your client and server. The kernel may be optimizing away writes to the network and internally buffering data. That's the only way to ensure that what see is what's really happening on the wire.