Socket recv call freezes thread for approx. 5 seconds

Socket recv call freezes thread for approx. 5 seconds - c++

I've a client server architecture implemented in C++ with blocking sockets under Windows 7. Everything is running well up to a certain level of load. If there are a couple of clients (e.g. > 4) receiving or sending megabytes of data, sometimes the communication with one client freezes for approximately 5 seconds. All other clients are working as expected in that case.
The buffer size is 8192 bytes and logging on the server side reads as follows:
TimeStamp (s.ms) - received bytes
…
1299514524.618 - 8192
1299514524.618 - 8192
1299514524.618 - 0004
1299514529.641 - 8192
1299514529.641 - 3744
1299514529.641 - 1460
1299514529.641 - 1460
1299514529.641 - 8192
…
It seems that only 4 bytes can be read within that 5 seconds. Furthermore I found out that the freezing time is always arounds that 5 seconds - never 4 or less and never 6 or more...
Any ideas?
Best regards
Michael

This is a Windows bug.
KB 2020447 - Socket communication using the loopback address will intermittently encounter a five second delay
A Hotfix is available in
KB 2861819 - Data transfer stops for five seconds in a Windows Socket-based application in Windows 7 and Windows Server 2008 R2

I've had this problem in situations of high load: the last packet of TCP data sometimes reached before the second to last, as the default stack is not defined for package sorting,
this disorder caused in receiving similar result to what you describe.
The solution adopted was: load distribution in more servers

Related

How to make RakNet more reliable?

Here's the summary, I send a packet from a server to a client run on the same computer. For some reason the packet sent is not the same as the packet received.
Here's the details:
The packet was sent using RakNet with the calling function:
rakPeer->Send(&bitStream, MEDIUM_PRIORITY, RELIABLE_ORDERED, 0, UNASSIGNED_RAKNET_GUID, true);
Here are the first 10 bytes of the packet sent by the server:
27,50,39,133,202,135,0,0,0,99 ... 1180 more bytes
Here are the first 10 bytes of the packet as seen by the receiving client (Note: 50% of the time it is right, the other half it is this):
27,50,43,40,247,134,255,255,255,99 ... 1180 more bytes
The first byte is ID_TIMESTAMP. Bytes 2-5 contain the time stamp, which I presume RakNet messes with somehow. Byte 6 is the packed ID which is clearly changed, as well as the following 3 bytes.
My suspicion is that the error is some how caused by the length of the packet, as smaller packets seem to send without any detectable errors, however I understand RakNet automatically handles packet corruption and internally splits the packet if it is too large.
Any help is appreciated.

Well for anyone who has the same issue, here is the solution.
RakNet time stamps are 32 bit or 64 bit depending on your build configuration. In this case I was sending 32 bit timestamps using a 64 bit build. That is a no-no since RakNet will change the bits it thinks are the timestamp to account for the relative time between computers.

dropped frames over UDP

this is my first "question", I hope I do it right :)
I am experimenting with network programming and in particular I want to broadcast data from one machine to some other >10 devices using UDP, over a wireless network. The data comes in packets of about 300 bytes, and at about 30 frames per second, i.e., one every ~33ms.
My implementation is based on the qt example: http://qt-project.org/doc/qt-4.8/network-broadcastreceiver.html
I am testing the application with just one client and experiencing quite a few dropped frames, not really sure why. All works fine if I used ethernet cables. I hope someone here can help me find a reason.
I can spot dropped frames because the packets contain a timestamp: After I receive one datagram, I can check for the difference between its timestamp and the last one received, if this is greater than e.g. 50ms, it means that I lost one packet on the way.
This happens quite often, even though I have a dedicated wi-fi network (not connected to the internet and with just 3 machines connected to a router I just bought). Most of the times I drop one or two packets, which would not be a problem, but sometimes the difference between the timestamps suggests that some >30 packets are lost, which is not good for what I am trying to achieve.
When I ping from one machine to the other, I get these values:
50 packets transmitted, 50 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.244/91.405/508.959/119.074 ms
pretty bad for a new router, in a dedicated network with just 3 clients, isn't it? The router is advertised as a very fast Wi-Fi router, with three times faster performance than 802.11n routers.
Compare it with the values I get from an older router, sitting in the same room, with some 10 machines connected to it, during office hour:
39 packets transmitted, 39 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.458/47.297/142.201/37.186 ms
Perhaps the router is defective?
One thing I cannot explain is that, if I ping while running my UDP client/server application, the statistics improve:
55 packets transmitted, 55 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.164/6.174/197.962/26.181 ms
I was wondering if anyone had tips on what to test, hints on how to achieve a "reliable" UDP connection between these machines over wi-fi. By reliable I mean that I would be ok dropping 2 consecutive packets, but not more.
Thanks.
Edit
It seems that the router (?) sends the packets in bursts. I am measuring the time it passes between receiving two datagrams on the client and this value is about 3 ms for a sequence of ~10 packets, and then, around 300 ms for the next packet. I think my issues at the client is more related to this inconsistency in the intervals between frames, rather than the dropped frames. I probably just need to have a queue and a delay of >300ms wrt to the server.

The first and easiest way to tackle any problem related to network is to capture them on wireshark.
And also check if packets are really being sent out from broadcasting machine.
And also, based on your description if packets being transmitted fine with etherne cables and not with UDP then
it could be issue with UDP port too.

UDP packets not sent on time

I am working on a C++ application that can be qualified as a router. This application receives UDP packets on a given port (nearly 37 bytes each second) and must multicast them to another destinations within a 10 ms period. However, sometimes after packet reception, the retransmission exceeds the 10 ms limit and can reach the 100 ms. these off-limits delays are random.
The application receives on the same Ethernet interface but on a different port other kind of packets (up to 200 packets of nearly 100 bytes each second). I am not sure that this later flow is disrupting the other one because these delay peaks are too scarce (2 packets among 10000 packets)
What can be the causes of these sporadic delays? And how to solve them?
P.S. My application is running on a Linux 2.6.18-238.el5PAE. Delays are measured between the reception of the packet and after the success of the transmission!
An image to be more clear :

10ms is a tough deadline for a non-realtime OS.
Assign your process to one of the realtime scheduling policies, e.g. SCHED_RR or SCHED_FIFO (some reading). It can be done in the code via sched_setscheduler() or from command line via chrt. Adjust the priority as well, while you're at it.
Make sure your code doesn't consume CPU more than it has to, or it will affect entire system performance.
You may also need RT_PREEMPT patch.
Overall, the task of generating Ethernet traffic to schedule on Linux is not an easy one. E.g. see BRUTE, a high-performance traffic generator; maybe you'll find something useful in its code or in the research paper.

How to debug packet loss?

I wrote a C++ application (running on Linux) that serves an RTP stream of about 400 kbps. To most destinations this works fine, but some destinations expericence packet loss. The problematic destinations seem to have a slower connection in common, but it should be plenty fast enough for the stream I'm sending.
Since these destinations are able to receive similar RTP streams for other applications without packet loss, my application might be at fault.
I already verified a few things:
- in a tcpdump, I see all RTP packets going out on the sending machine
- there is a UDP send buffer in place (I tried sizes between 64KB and 300KB)
- the RTP packets mostly stay below 1400 bytes to avoid fragmentation
What can a sending application do to minimize the possibility of packet loss and what would be the best way to debug such a situation ?

Don't send out packets in big bursty chunks.
The packet loss is usually caused by slow routers with limited packet buffer sizes. The slow router might be able to handle 1 Mbps just fine if it has time to send out say, 10 packets before receiving another 10, but if the 100 Mbps sender side sends it a big chunk of 50 packets it has no choice but to drop 40 of them.
Try spreading out the sending so that you write only what is necessary to write in each time period. If you have to write one packet every fifth of a second, do it that way instead of writing 5 packets per second.

netstat has several usefull option to debug the situation.
First one is netstat -su (dump UDP statistics):
dima#linux-z8mw:/media> netstat -su
IcmpMsg:
InType3: 679
InType4: 20
InType11: 548
OutType3: 100
Udp:
12945 packets received
88 packets to unknown port received.
0 packet receive errors
13139 packets sent
RcvbufErrors: 0
SndbufErrors: 0
UdpLite:
InDatagrams: 0
NoPorts: 0
InErrors: 0
OutDatagrams: 0
RcvbufErrors: 0
SndbufErrors: 0
IpExt:
InNoRoutes: 0
InTruncatedPkts: 0
InMcastPkts: 3877
OutMcastPkts: 3881
InBcastPkts: 0
OutBcastPkts: 0
InOctets: 7172779304
OutOctets: 785498393
InMcastOctets: 525749
OutMcastOctets: 525909
InBcastOctets: 0
OutBcastOctets: 0
Notice "RcvbufErrors" and "SndbufErrors"
Additional option is to monitor receive and send UDP buffers of the process:
dima#linux-z8mw:/media> netstat -ua
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
udp 0 0 *:bootpc *:*
udp 0 0 *:40134 *:*
udp 0 0 *:737 *:*
udp 0 0 *:mdns *:*
Here you need to look at Recv-Q and Send-Q column of the connection you're interested. If the values high and don't drop to zero, than the process can not handle the load.
You can use these commands on sending and on receiving machine.
Also you can use mtr, which combines traceroute and ping - it pings each hop in route.
This may detect a slow hop in your route. Run it on oth machines to check connectivity to the second one.

RTP typically uses UDP, which is inherently lossy. Packets could be lost anywhere between sender and receiver, so local debug will show you nothing useful.
Obvious things to do:
a: Reduce the overall data rate
b: Reduce the 'peak' data rate, by
sending small packets more often
rather than one huge chunk every few
seconds. ie, REDUCE your UDP send
buffer - maybe even to just 1400
bytes.
c: See if you can switch to a TCP
variant of RTP.
If all else fails, WireShark is your friend. It will give you a true picture of how much data - and when is being sent by your app.

You should try reducing the rate you send packets. A slow connection can mean all sorts of things, and trying to send it packets (small or large) at a high rate won't help.

This may not be the answer you want, but if I had packet loss problems I'd try to switch my application to use TCP, and have most worries of packet loss taken off my mind.

Programs run in 2 seconds on my machine but 15 seconds on others

I have two programs written in C++ that use Winsock. They both accept TCP connections and one sends data the other receives data. They are compiled in Visual Studio 2008. I also have a program written in C# that connects to both C++ programs and forwards the packets it receives from one and sends them to the other. In the process it counts and displays the number of packets forwarded. Also, the elapsed time from the first to the most recent packet is displayed.
The C++ program that sends packets simply loops 1000 times sending the exact same data. When I run all three apps on my development machine (using loopback or actual IP) the packets get run through the entire system in around 2 seconds. When I run all three on any other PC in our lab it always takes between 15 and 16 seconds. Each PC has different processors and amounts of memory but all of them run Windows XP Professional. My development PC actually has an older AMD Athlon with half as much memory as one of the machines that takes longer to perform this task. I have watched the CPU time graph in Task Manager on my machine and one other and neither of them is using a significant amount of the processor (i.e. more than 10%) while these programs run.
Does anyone have any ideas? I can only think to install Visual Studio on a target machine to see if it has something to do with that.
Problem Solved ====================================================
I first installed Visual Studio to see if that had any effect and it didn't. Then I tested the programs on my new development PC and it ran just as fast as my old one. Running the programs on a Vista laptop yielded 15 second times again.
I printed timestamps on either side of certain instructions in the server program to see which was taking the longest and I found that the delay was being caused by a Sleep() method call of 1 millisecond. Apparently on my old and new systems the Sleep(1) was being ignored because I would have anywhere from 10 to >20 packets being sent in the same millisecond. Occasionally I would have a break in execution of around 15 or 16 milliseconds which led to the the time of around 2 seconds for 1000 packets. On the systems that took around 15 seconds to run through 1000 packets I would have either a 15 or 16 millisecond gap between sending each packet.
I commented out the Sleep() method call and now the packets get sent immediately. Thanks for the help.

You should profile your application on the good, 2 second case, and the 15 second lab case and see where they differ. The difference could be due to any number of a problems (disk, antivirus, network) - without any data backing it up we'd just be shooting in the dark.
If you don't have access to a profiler, you can add timing instrumentation to various phases of your program to see which phase is taking longer.

You could try checking the Winsock performance tuning registry settings - it may be the installation of some game or utility has tweaked those on your PC.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js