Packet Delay Variation (PDV) - c++

I am currently implementing video streaming application where the goal is to utilize as much as possible gigabit ethernet bandwidth
Application protocol is built over tcp/ip
Network library is using asynchronous iocp mechanism
Only streaming over LAN is needed
No need for packets to go through routers
This simplifies many things. Nevertheless, I am experiencing problems with packet delay variation.
It means that a video frame which should arrive for example every 20 ms (1280 x 720p 50Hz video signal) sometimes arrives delayed by tens of milliseconds. More:
Average frame rate is kept
Maximum video frame delay is dependent on network utilization
The more data on LAN, the higher the maximum video frame delay
For example, when bandwidth usage is 800mbps, PDV is about 45 - 50 ms.
To my questions:
What are practical boundaries in lowering that value?
Do you know about measurement report available on internet dealing with this?
I want to know if there is some subtle error in my application (perhaps excessive locking) or if there is no way to make numbers better with current technology.

For video streaming, I would recommend using UDP instead of TCP, as it has less overhead and packet confirmation is usually not needed, as the retransmited data would already be obsolete.

Related

Minimizing dropped UDP packets at high packet rates (Windows 10)

IMPORTANT NOTE: I'm aware that UDP is an unreliable protocol. But, as I'm not the manufacturer of the device that delivers the data, I can only try to minimize the impact. Hence, please don't post any more statements about UDP being unreliable. I need suggestions to reduce the loss to a minimum instead.
I've implemented an application C++ which needs to receive a large amount of UDP packets in short time and needs to work under Windows (Winsock). The program works, but seems to drop packets, if the Datarate (or Packet Rate) per UDP stream reaches a certain level... Note, that I cannot change the camera interface to use TCP.
Details: It's a client for Gigabit-Ethernet cameras, which send their images to the computer using UDP packets. The data rate per camera is often close to the capacity of the network interface (~120 Megabytes per second), which means even with 8KB-Jumbo Frames the packet rate is at 10'000 to 15'000 per camera. Currently we have connected 4 cameras to one computer... and this means up to 60'000 packets per second.
The software handles all cameras at the same time and the stream receiver for each camera is implemented as a separate thread and has it's own receiving UDP socket.
At a certain frame rate the software seems miss a few UDP frames (even the network capacity is used only by ~60-70%) every few minutes.
Hardware Details
Cameras are from foreign manufacturers! They send UDP streams to a configurable UDP endpoint via ethernet. No TCP-support...
Cameras are connected via their own dedicated network interface (1GBit/s)
Direct connection, no switch used (!)
Cables are CAT6e or CAT7
Implementation Details
So far I set the SO_RCVBUF to a large value:
int32_t rbufsize = 4100 * 3100 * 2; // two 12 MP images
if (setsockopt(s, SOL_SOCKET, SO_RCVBUF, (char*)&rbufsize, sizeof(rbufsize)) == -1) {
perror("SO_RCVBUF");
throw runtime_error("Could not set socket option SO_RCVBUF.");
}
The error is not thrown. Hence, I assume the value was accepted.
I also set the priority of the main process to HIGH-PRIORITY_CLASS by using the following code:
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
However, I didn't find any possibility to change the thread priorities. The threads are created after the process priority is set...
The receiver threads use blocking IO to receive one packet at a time (with a 1000 ms timeout to allow the thread to react to a global shutdown signal). If a packet is received, it's stored in a buffer and the loop immediately continues to receive any further packets.
Questions
Is there any other way how I can reduce the probability of a packet loss? Any possibility to maybe receive all packets that are stored in the sockets buffer with one call? (I don't need any information about the sender side; just the contained payload)
Maybe, you can also suggest some registry/network card settings to check...
To increase the UDP Rx performance for GigE cameras on Widnows you may want to look into writing a custom filter driver (NDIS). This allows you to intercept the messages in the kernel, stop them from reaching userspace, pack them into some buffer and then send to userspace via a custom ioctl to your application. I have done this, it took about a week of work to get done. There is a sample available from Microsoft which I used as base for it.
It is also possible to use an existing generic driver, such as pcap, which I also tried and that took about half a week. This is not as good because pcap cannot determine when the frames end so packet grouping will be sub optimal.
I would suggest first digging deep in the network stack settings and making sure that the PC is not starved for resources. Look at guides for tuning e.g. Intel network cards for this type of load, that could potentially have a larger impact than a custom driver.
(I know this is an older thread and you have probably solved your problem. But things like this is good to document for future adventurers..)
IOCP and WSARecv in overlapped mode, you can setup around ~60k WSARecv
on the thread that handles the GetQueuedCompletionStatus process the data and also do a WSARecv in that thread to comnpensate for the one being used when receiving the data
please note that your udp packet size should stay below the MTU above it will cause drops depending on all the network hardware between the camera and the software
write some UDP testers that mimuc the camera to test the network just to be sure that the hardware will support the load.
https://www.winsocketdotnetworkprogramming.com/winsock2programming/winsock2advancediomethod5e.html

Latency measurement over UDP on Linux

I want to measure UDP latency and drop rate between two machines on Linux. Preferably (but not crucial) to perform measurement between multiple machines at the same time.
As a result I want to get a histogram, e.g. RTT times of each individual packet at every moment during measurement. Expected frequency is about 10 packets per second.
Do you know of any tool that I can use for this purpose?
What I tried so far is:
ping - uses icmp instead of UDP
iperf - measures only jitter but not latency.
D-ITG - measures per flow statistics, no histograms
tshark - uses TCP for pings instead UDP
I have also created a simple C++ socket program where I have Client and Server on each side, and I send UDP packets with counter and timestamp. My program seems to work ok, although since I am not a network programmer I am not 100% sure that I handled buffers correctly (specifically in the case of partial packets etc). So I would prefer to use some proven software for this task.
Can you recommend something?
Thanks
It depends. If all you want is a trace with timestamps, Wireshark is your friend: https://www.wireshark.org/
I would like to remind you that UDP is a message based protocol and packets have definite boundaries. There cannot be reception of partial packets. That is, you will either get the complete message or you will not get it. So, you need not worry about partial packets in UDP.
The method of calculating packet drop using counter & calculating latency using time delta appears fine for UDP. However the important point to be taken in to consideration is ensuring the synchronization of the system time of client and server.

Low latency serial communication on Linux

I'm implementing a protocol over serial ports on Linux. The protocol is based on a request answer scheme so the throughput is limited by the time it takes to send a packet to a device and get an answer. The devices are mostly arm based and run Linux >= 3.0. I'm having troubles reducing the round trip time below 10ms (115200 baud, 8 data bit, no parity, 7 byte per message).
What IO interfaces will give me the lowest latency: select, poll, epoll or polling by hand with ioctl? Does blocking or non blocking IO impact latency?
I tried setting the low_latency flag with setserial. But it seemed like it had no effect.
Are there any other things I can try to reduce latency? Since I control all devices it would even be possible to patch the kernel, but its preferred not to.
---- Edit ----
The serial controller uses is an 16550A.
Request / answer schemes tends to be inefficient, and it shows up quickly on serial port. If you are interested in throughtput, look at windowed protocol, like kermit file sending protocol.
Now if you want to stick with your protocol and reduce latency, select, poll, read will all give you roughly the same latency, because as Andy Ross indicated, the real latency is in the hardware FIFO handling.
If you are lucky, you can tweak the driver behaviour without patching, but you still need to look at the driver code. However, having the ARM handle a 10 kHz interrupt rate will certainly not be good for the overall system performance...
Another options is to pad your packet so that you hit the FIFO threshold every time. It will also confirm that if it is or not a FIFO threshold problem.
10 msec # 115200 is enough to transmit 100 bytes (assuming 8N1), so what you are seeing is probably because the low_latency flag is not set. Try
setserial /dev/<tty_name> low_latency
It will set the low_latency flag, which is used by the kernel when moving data up in the tty layer:
void tty_flip_buffer_push(struct tty_struct *tty)
{
unsigned long flags;
spin_lock_irqsave(&tty->buf.lock, flags);
if (tty->buf.tail != NULL)
tty->buf.tail->commit = tty->buf.tail->used;
spin_unlock_irqrestore(&tty->buf.lock, flags);
if (tty->low_latency)
flush_to_ldisc(&tty->buf.work);
else
schedule_work(&tty->buf.work);
}
The schedule_work call might be responsible for the 10 msec latency you observe.
Having talked to to some more engineers about the topic I came to the conclusion that this problem is not solvable in user space. Since we need to cross the bridge into kernel land, we plan to implement an kernel module which talks our protocol and gives us latencies < 1ms.
--- edit ---
Turns out I was completely wrong. All that was necessary was to increase the kernel tick rate. The default 100 ticks added the 10ms delay. 1000Hz and a negative nice value for the serial process gives me the time behavior I wanted to reach.
Serial ports on linux are "wrapped" into unix-style terminal constructs, which hits you with 1 tick lag, i.e. 10ms. Try if stty -F /dev/ttySx raw low_latency helps, no guarantees though.
On a PC, you can go hardcore and talk to standard serial ports directly, issue setserial /dev/ttySx uart none to unbind linux driver from serial port hw and control the port via inb/outb to port registers. I've tried that, it works great.
The downside is you don't get interrupts when data arrives and you have to poll the register. often.
You should be able to do same on the arm device side, may be much harder on exotic serial port hw.
Here's what setserial does to set low latency on a file descriptor of a port:
ioctl(fd, TIOCGSERIAL, &serial);
serial.flags |= ASYNC_LOW_LATENCY;
ioctl(fd, TIOCSSERIAL, &serial);
In short: Use a USB adapter and ASYNC_LOW_LATENCY.
I've used a FT232RL based USB adapter on Modbus at 115.2 kbs.
I get about 5 transactions (to 4 devices) in about 20 mS total with ASYNC_LOW_LATENCY. This includes two transactions to a slow-poke device (4 mS response time).
Without ASYNC_LOW_LATENCY the total time is about 60 mS.
With FTDI USB adapters ASYNC_LOW_LATENCY sets the inter-character timer on the chip itself to 1 mS (instead of the default 16 mS).
I'm currently using a home-brewed USB adapter and I can set the latency for the adapter itself to whatever value I want. Setting it at 200 µS shaves another mS off that 20 mS.
None of those system calls have an effect on latency. If you want to read and write one byte as fast as possible from userspace, you really aren't going to do better than a simple read()/write() pair. Try replacing the serial stream with a socket from another userspace process and see if the latencies improve. If they don't, then your problems are CPU speed and hardware limitations.
Are you sure your hardware can do this at all? It's not uncommon to find UARTs with a buffer design that introduces many bytes worth of latency.
At those line speeds you should not be seeing latencies that large, regardless of how you check for readiness.
You need to make sure the serial port is in raw mode (so you do "noncanonical reads") and that VMIN and VTIME are set correctly. You want to make sure that VTIME is zero so that an inter-character timer never kicks in. I would probably start with setting VMIN to 1 and tune from there.
The syscall overhead is nothing compared to the time on the wire, so select() vs. poll(), etc. is unlikely to make a difference.

UDP packets not sent on time

I am working on a C++ application that can be qualified as a router. This application receives UDP packets on a given port (nearly 37 bytes each second) and must multicast them to another destinations within a 10 ms period. However, sometimes after packet reception, the retransmission exceeds the 10 ms limit and can reach the 100 ms. these off-limits delays are random.
The application receives on the same Ethernet interface but on a different port other kind of packets (up to 200 packets of nearly 100 bytes each second). I am not sure that this later flow is disrupting the other one because these delay peaks are too scarce (2 packets among 10000 packets)
What can be the causes of these sporadic delays? And how to solve them?
P.S. My application is running on a Linux 2.6.18-238.el5PAE. Delays are measured between the reception of the packet and after the success of the transmission!
An image to be more clear :
10ms is a tough deadline for a non-realtime OS.
Assign your process to one of the realtime scheduling policies, e.g. SCHED_RR or SCHED_FIFO (some reading). It can be done in the code via sched_setscheduler() or from command line via chrt. Adjust the priority as well, while you're at it.
Make sure your code doesn't consume CPU more than it has to, or it will affect entire system performance.
You may also need RT_PREEMPT patch.
Overall, the task of generating Ethernet traffic to schedule on Linux is not an easy one. E.g. see BRUTE, a high-performance traffic generator; maybe you'll find something useful in its code or in the research paper.

Measure data transfer rate (bandwidth) between 2 apps across network using C++, how to get unbiased and accurate result?

I am trying to measure IO data transfer rate (bandwidth) between 2 simulation applications (written in C++). I created a very simple perfclient and perfserver program just to verify that my approach in calculating the network bandwidth is correct before implementing this calculation approach in the real applications. So in this case, I need to do it programatically (NOT using Iperf).
I tried to run my perfclient and perfserver program on various domain (localhost, computer connected to ethernet,and computer connected to wireless connection). However I always get about the similar bandwidth on each of these different hosts, around 1900 Mbps (tested using data size of 1472 bytes). Is this a reasonable result, or can I get a better and more accurate bandwidth?
Should I use 1472 (which is the ethernet MTU, not including header) as the maximum data size for each send() and recv(), and why/why not? I also tried using different data size, and here are the average bandwidth that I get (tested using ethernet connection), which did not make sense to me because the number exceeded 1Gbps and reached something like 28 Gbps.
SIZE BANDWIDTH
1KB 1396 Mbps
2KB 2689 Mbps
4KB 5044 Mbps
8KB 9146 Mbps
16KB 16815 Mbps
32KB 22486 Mbps
64KB 28560 Mbps
HERE is my current approach:
I did a basic ping-pong fashion loop, where the client continuously send bytes of data stream to the server program. The server will read those data, and reflect (send) the data back to the client program. The client will then read those reflected data (2 way transmission). The above operation is repeated 1000 times, and I then divided the time by 1000 to get the average latency time. Next, I divided the average latency time by 2, to get the 1 way transmission time. Bandwidth can then be calculated as follow:
bandwidth = total bytes sent / average 1-way transmission time
Is there anything wrong with my approach? How can I make sure that my result is not biased? Once I get this right, I will need to test this approach in my original application (not this simple testing application), and I want to put this performance testing result in a scientific paper.
EDIT:
I have solved this problem. Check out the answer that I posted below.
Unless you have a need to reinvent the wheel iperf was made to handle just this problem.
Iperf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. Iperf allows the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss.
I was finally able to figure and solve this out :-)
As I mentioned in the question, regardless of the network architecture that I used (localhost, 1Gbps ethernet card, Wireless connection, etc), my achieved bandwidth scaled up for up to 28Gbps. I have tried to bind the server IP address to several different IP addresses, as follow:
127.0.0.1
IP address given by my LAN connection
IP address given by my wireless connection
So I thought that this should give me correct result, in fact it didn't.
This was mainly because I was running both of the client and server program on the same computers (different terminal window, even though the client and server are both bound to different IP addresses). My guess is that this is caused by the internal loopback. This is the main reason why the result is so biased and not accurate.
Anyway, so I then tried to run the client on one workstation, and the server on another workstation, and I tested them using the different network connection, and it worked as expected :-)
On 1Gbps connection, I got about 9800 Mbps (0.96 Gbps), and on 10Gbps connection, I got about 10100 Mbps (9.86 Gbps). So this work exactly as I expected. So my approach is correct. Perfect !!