I have a very annoying bug showing up.
We have left our iPhone app running overnight.
Every 2 seconds it sends a broadcast ping out on to the network via the open socket to inform that the device is alive. Now the other application detects that ping and attempts to send messages back. The problem is that despite the ping continuing to go out no packets ever get received.
This only seems to happen after several hours (annoyingly we've only ever managed to get this overnight). It then seems to leave the iphone in a very confused state where even after restarting the app it is still unable to receive the packets. Eventually after a time (sorry I have no idea how long) the phone starts re-acting normally and I can continue.
I'm guessing that somewhere along the line iOS is blocking the socket from receiving data (but oddly not sending on the same socket!).
Has anyone any idea what this might be and, more importantly, how I might solve the issue?
Well this turned out to be a very strange problem.
I broke out a packet sniffer to inspect what was going on and I found that my PC was sending out ARP broadcasts trying to identify who had the ip address. These ARP requests were not getting answered by the router or the iPhone.
This was very strange.
In the end I started checking the wifi access point I was attached to. I disabled the wifi on it forcing us to use a different (though slightly weaker) access point and suddenly the ARP requests were getting answered and everything jumped into life.
It was at this point that I remembered that my Boss had tripped over the wire of the access point and it had come crashing to the ground. It "seemed" to work .. but, evidently, he broke "something" :(
The problem is now no more!
Related
i am developing client server application in windows using c++ and winsock lib it work fine but if it is on network and once server listening started and if i remove network cable then server doesn't shows any error in any thread so where server socket knows network cable is unplugged.
if any body knows please help me.
While it should be possible to detect that the network cable is unplugged on the host, you will still have the same problem if the network is disrupted somewhere else between your server and the clients.
One common (if not the most common) way to solve this is to have a "keep-alive" message being sent. If no reply to that message is received within some timeout you simply close the connection and release all resources associated with it.
Edit
A "keep-alive" message is like using the "ping" command to see if a remote machine can be reached. It is simply a message that is sent, either by the server or the client (it doesn't matter who initiate it) to see if the other end of the connection is alive and can be reached.
It can be as simple as sending the string "Are you there?" and expecting a reply containing "Yes I am". If you send it once every minute, and don't get a reply withing (for example) one minute, you can consider the connection being dead. The other end, that receives the "Are you there?", knows it will get the message once every minute. If it hasn't arrived for two minutes then the sender is no longer reachable.
If the protocol can't be modified to add such messages, then see if some other message can be used instead.
Also, remember that the best and some cases only way to know if something is wrong with a connection is to attempt to read from the socket.
You can unplug a network and then plug it back in, or your Wi-Fi laptop can lose reception for a second and then pick it back up. It would be frustrating if such resumable cases were treated as an error in all the programs we use.
From this Winsock "newbie" FAQ:
The previous question deals with detecting when a protocol connection is dropped normally, but what if you want to detect other problems, like unplugged network cables or crashed workstations? In these cases, the failure prevents notifying the remote peer that something is wrong. My feeling is that this is usually a feature, because the broken component might get fixed before anyone notices, so why demand that the connection be reestablished?
If you feel you have a "special needs" situation you can be aggressive with timeouts. But I wouldn't do that unless there was a really good reason.
I have a problem with client-server application. As I've almost run out of sane ideas for its solving I am asking for help. I've stumbled into described situation about three or four times now. Provided data is from last failure, when I've turned all the possible logging, messages dumping and so on.
System description
1) Client. Works under Windows. I take as an assumption that there is no problem with its work (judging from logs)
2) Server. Works under Linux (RHEL 5). It is server where I has a problem.
3) Two connections are maintained between client and server: one command and one for data sending. Both work asynchronously. Both connections live in one thread and on one boost::asio::io_service.
4) Data to be sent from client to server is messages delimeted by '\0'.
5) Data load is about 50 Mb/hour, 24 hours a day.
6) Data is read on server side using boost::asio::async_read_until with corresponding delimeter
Problem
- For two days system worked as expected
- On third day at 18:55 server read one last message from client and then stopped reading them. No info in logs about new data.
- From 18:55 to 09:00 (14 hours) client reported no errors. So it sent data (about 700 Mb) successfully and no errors arose.
- At 08:30 I started investigation of a problem. Server process was alive, both connections between server and client were alive too.
- At 09:00 I attached to server process with gdb. Server was in sleeping state, waiting for some signal from system. I believe I accidentally hit Ctrl + C and may be there was some message.
- Later in logs I found message with smth like 'system call interrupted'. After that both connections to client were dropped. Client reconnected and server started to worked normally.
- The first message processed by server was timestamped at 18:57 on client side. So after restarting normal work, server didn't drop all the messages up to 09:00, they were stored somewhere and it processed them accordingly after that.
Things I've tried
- Simulated scenario above. As server dumped all incoming messages I've wrote a small script which presented itself as client and sent all the messages back to server again. Server dropped with out of memory error, but, unfortunately, it was because of high data load (about 3 Gb/hour this time), not because of the same error. As it was Friday evening I had no time to correctly repeat the experiment.
- Nevertheless, I've run server through Valgrind to detect possible memory leaks. Nothing serious was found (except the fact that server was dropped because of high load), no huge memory leaks.
Questions
- Where were these 700 Mb of data which client sent and server didn't get? Why they were persistent and weren't lost when server restarted the connection?
- It seems to me that problem is someway connected with server not getting message from boost::asio::io_service. Buffer is get filled with data, but no calls to read handler are made. Could this be problem on OS side? Something wrong with asynchronous calls may be? If it is so, how could this be checked?
- What can I do to detect the source of problem? As i said I've run out of sane ideas and each experiment costs very much in terms of time (it takes about two or three days to get the system to described state), so I need to run as much possible checks for experiment as I could.
Would be grateful for any ideas I can use to get to the error.
Update: Ok, it seems that error was in synchronous write left in the middle of asynchronous client-server interaction. As both connections lived in one thread, this synchronous write was blocking thread for some reason and all interaction both on command and data connection stopped. So, I changed it to async version and now it seems to work.
As i said I've run out of sane ideas and each experiment costs very
much in terms of time (it takes about two or three days to get the
system to described state)
One way to simplify investigation of this problem is to run server inside some Virtual Machine until it reaches this broken state. Then you can make snapshot of whole system and revert to it every time when things go wrong during investigation. At least you will not have to wait 3 days to get this state again.
I have a winsock IOCP server written in c++ using TCP IP connections. I have tested this server locally, using the loopback address with a client simulator. I have been able to get upwards of 60,000 clients no sweat. The issue I am having, is when I run the server at my house and the client simulator at a friends house. Everything works fine up until we hit around 3700 connections, after that every call to connect() fails from the client side with a return of 10060 (this is the winsock timed out error). Last night this number was 3700, but it has been around 300 before, and we also saw it near 1000. But whatever the number is, every time we try to simulate it, it will fail right around that number (within 10 or so).
Both computers are using Windows 7 Ultimate. We have also both modified the TCPIP registry setting MaxTcpConnections to around 16 million. We also changed the MaxUserPort setting from its 5000 default to 65k. No useful information is showing up in the event viewer. We also both watched our resource monitor, and we havent even gotten to 1% network utilization, the CPU is also close to 0% usage as well.
We just got off the phone with our ISP, and they are saying that they are not limiting us in any way but the guy was kinda unsure and ended up hanging up on us anyway after a 30 minute hold time...
We are trying everything to figure this issue out, but cannot come up with the solution. I would be very greatful if someone out there could give us a hand with this issue.
P.S. Both computers are on Verizon FIOS with the same verizon router. Another thing to note, the server is using WSAAccept and NOT AcceptEx. The client simulator is attempting to connect over many seconds though, so I am pretty sure the connects are not getting backlogged. We have tried to change the speed at which the client simulator connects, and no matter what speed it is set to it fails right around the same number each time.
UPDATE
We simulated 2 separate clients (on 2 separate machines) on network A. The server was running on network B. Each client was only able to connect half (about 1600) connections to the server. We were initially using a port below 1,000, this has been changed to above 50,000. The router log on both machines showed nothing. We are both using the Actiontec MI424WR verizon FIOS router. This leads me to believe the problem is not with the client code. The server throws no errors and has no unexpected behavior. Could this be an ISP/Router issue?
UPDATE
The solution has been found. The verizon router we were using (MI424WR revision C) is unable to handle any more than 3700 connections, we tested this with a separate set of networks. Thanks for the help guys!
Thanks
- Rick
I would have guessed that this was a MaxUserPort issue, but you say you've changed that. Did you reboot after changing it?
Run the test on the exact same computers on your local network (this will take the computers out of the equation).
The issue could be one of your routers not being up to the job?
I have a blocking client/server connected locally via Winsock. The client uses firefox to retrieve data from websites, passing certain data along to the server for extra processing. The server always responds, and the processing can take anywhere from 1/10th second to a few minutes. The client has no winsock connection to anything but the server; all web data is retrieved to hard-drive via firefox.
This setup works quite well until, seemingly randomly, the client's recv returns -1 (SOCKET_ERROR) with error code 10054 (WSAECONNRESET). This means the server supposedly terminated connection, but the server is actually still waiting to recv as if nothing is wrong. The connection has failed in this way as early as 5 minutes in or after working for as long as about an hour and a half. The client sends about 10 different types of requests to the server, and failure has occurred on a variety of them. The frequency of requests is roughly constant, probably an average of 10-15 a minute. When the connection breaks, neither computer experiences internet problems and remote desktop does not disconnect.
Initially I thought memory leaks, but after extensive debugging I am reasonably certain no more exist. Firefox is engaged in considerable HTTP traffic at times, so I thought maybe that could be filling available socket bufferspace or something -- seems doubtful but at this point I'm really not sure. So, could it be more memory leaks, maybe a hidden buffer overrun, too much web traffic? What is causing my Winsock app to randomly fail?
Sounds like a firewall at work.
Many firewalls are configured to terminate idle connections (i.e. open TCP sessions on which no data is transferred for awhile). Especially if it's an HTTP connection, which are typically not persistent.
I'm Working in an embedded linux environment.
it launches a telnet daemon on startup which watches on a particular port and launches a program when a connection is received.
i.e.
telnetd -l /usr/local/bin/PROGA -p 1234
PROGA - will output some data at irregular intervals. When it is not outputting data, every X period of time it sends out a 'heartbeat' type string to let the client know that we are still active i.e. "heartbeat\r\n"
After a random amount of time, the client (use a linux version of telnet, launched by: telnet xxx.xxx.xxx.xxx 1234) will fail to receive the 'heartbeat\r\n'
The data the client sees:
heartbeat
heartbeat
heartbeat
...
heartbeat
[nothing, should have received heartbeat]
[nothing forever]
heartbeat is sent:
result = printf("%s", heartbeat);
checking result, it is always the length of heartbeat. Logging to syslog shows us that the printf() is executing with success at the proper intervals
I've since added in a tcdrain and fflush which both return success, but do not seem to help the situation.
Any help would be appreciated.
**UDPATE: got a wireshark capture from the server side. Very Clearly the heartbeat is being sent continuously. No Hicups, no delays. Found something interesting on the client though. The client in this test case (telnet on Ubuntu 9.04) seems to suddenly stop receiving heartbeat (as describes above). Wireshark confirms this, big pause in packets. Well, once the client had stopped receiving the heartbeat, pressing any keystroke (on the client) seems to trigger a spew of data from the client's buffer (all heartbeats). Wireshark on the client also shows this massive amount of data all in one packet.
Unfortunately I don't really know what this means. It this a line mode on/off thing? Line endings (\r\n) are very clearly coming through.
**Update 2: running netcat instead of telnetd, the problem is not reproducible.
The first thing I would do is get out Wireshark and try to find out if the server is truly sending the message. It would be instructive to run Wireshark at the server as well as third party PC. Is there anything different about the last heartbeat?
Edit. Well, that was an interesting find on your client.
It seems like there's some sort of terminal thing in the way. You may want to use the netcat program rather than telnetd. netcat is designed for sending arbitrary data over a TCP session in raw mode, without any special formatting, and it has the ability to hook up an arbitrary process to a socket. On a Windows machine you can use PuTTY in raw mode to accomplish the same thing.
It may still be worth examining traffic with a third party between your client and server. The kernel may be optimizing away writes to the network and internally buffering data. That's the only way to ensure that what see is what's really happening on the wire.