Measuring Round Trip Time using DPDK

Measuring Round Trip Time using DPDK - dpdk

My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:
I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).
One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).
Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.
Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.

It really depends on the kind of round trip you want to measure. Consider the following timestamps:
-> t1 -> send() -> NIC_A -> t2 --link--> t3 -> NIC_B -> recv() -> t4
host_A host_B
<- t1' <- recv() <- NIC_A <- t2' <--link-- t3' <- NIC_B <- send() <- t4'
Do you want to measure t1' - t1? Then it's just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See also rte_rdtsc_precise() and rte_get_tsc_hz() for converting the TSC deltas to nanoseconds.
For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call clock_gettime(CLOCK_REALTIME) which has an overhead of 18 ns or so.
This works for single packet transmits via rte_eth_tx_burst() and single packet receives - which aren't necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.
Timestamps t2, t3, t2', t3' are hardware transmit/receive timestamps provided by (more serious) NICs.
If you want to compute the roundtrip t2' - t2 then you first need to discipline the NIC's clock (e.g. with phc2ys), enable timestamping and get those timestamps. However, AFAICS dpdk doesn't support obtaining the TX timestamps, in general.
Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the t2' - t2 roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.

The ideal way to latency for sending and receiving packets through an interface is setup external Loopback device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.
The next best alternative is to enable Internal Loopback, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NIC
supports internal Loopback
and can suppress Loopback error handlers.
Another way is to use either PCIe port to port cross connect. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.
Note: Newer Hardware supports doorbell mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.
But in my recommendation using an external (Machine-B) is not desirable because of
Depending upon the quality of the transfer Medium, the latency varies
If machine-B has to be configured to the ideal settings (for almost 0 latency)
Machine-A and Machine-B even if physical configurations are the same, need to be maintained and run at the same thermal settings to allow the right clocking.
Both Machine-A and Machine-B has to run with same PTP grand master to synchronize the clocks.
If DPDK is used, either modify the PMD or use rte_eth_tx_buffer_flush to ensure the packet is sent out to the NIC
With these changes, a dummy UDP packet can be created, where
first 8 bytes should carry the actual TX time before tx_burst from Machine-A (T1).
second 8 bytes is added by machine-B when it actually receives the packet in SW via rx_burst (2).
third 8 bytes is added by Machine-B when tx_burst is completed (T3).
fourth 8 bytes are found in Machine-A when packet is actually received via rx-burst (T4)
with these Round trip Time = (T4 - T1) - (T3 - T2), where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.
Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks rte_get_tsc_cycles is not varying per frequency and power states.
[Edit-1] as mentioned in comments
#AmmerUsman, I highly recommend editing your question to reflect the real intention as to how to measure the round trip time is taken, rather than TX-RX latency from DUT?, this is because you are referring to DPDK latency stats/metric but that is for measuring min/max/avg latency between Rx-Tx on the same DUT.
#AmmerUsman latency library in DPDK is stats representing the difference between TX-callback and RX-callback and not for your use case described. As per Keith explanation pointed out Packet send out by the traffic generator should send a timestamp on the payload, receiver application should forward to the same port. then the receiver app can measure the difference between the received timestamp and the timestamp embedded in the packet. For this, you need to send it back on the same port which does not match your setup diagram

Related

Proper DPDK device and port initialization for transmission

While writing a simple DPDK packet generator I noticed some additional initialization steps that are required for reliable and successful packet transmission:
calling rte_eth_link_get() or rte_eth_timesync_enable() after rte_eth_dev_start()
waiting 2 seconds before sending the first packet with rte_eth_tx_burst()
So these steps are necessary when I use the ixgbe DPDK vfio driver with an Intel X553 NIC.
When I'm using the AF_PACKET DPDK driver, it works without those extra steps.
Is this a bug or a feature?
Is there a better way than waiting 2 seconds before the first send?
Why is the wait required with the ixgbe driver? Is this a limitation of that NIC or the involved switch (Mikrotik CRS326 with Marvell chipset)?
Is there a more idiomatic function to call than rte_eth_link_get() in order to complete the device initialization for transmission?
Is there some way to keep the VFIO NIC initialized (while keeping its link up) to avoid re-initializing it over and over again during the usual edit/compile/test cycle? (i.e. to speed up that cycle ...)
Additional information: When I connect the NIC to a mirrored port (which is configured via Mikrotik's mirror-source/mirror-target ethernet switch settings) and the sleep(2) is removed then I see the first packet transmitted to the mirror target but not to the primary destination. Thus, it seems like the sleep is necessary to give the switch some time after the link is up (after the dpdk program start) to completely initialize its forwarding table or something like that?
Waiting just 1 second before the first transmission works less reliable, i.e. the packet reaches the receiver only every odd time.
My device/port initialization procedure implements the following setup sequence:
rte_eth_dev_count_avail()
rte_eth_dev_is_valid_port()
rte_eth_dev_info_get()
rte_eth_dev_adjust_nb_rx_tx_desc()
rte_eth_dev_configure(port_id, 0 /* rxrings */, 1 /* txrings */, &port_conf)
rte_eth_tx_queue_setup()
rte_eth_dev_start()
rte_eth_macaddr_get()
rte_eth_link_get() // <-- REQUIRED!
rte_eth_dev_get_mtu()
Without rte_eth_link_get() (or rte_eth_timesync_enable()) the first transmitted packet doesn't even show up on the mirrored port.
The above functions (and rte_eth_tx_burst()) complete successfully with/without rte_eth_link_get()/sleep(2) being present. Especially, the read MAC address, MTU have the expected values (MTU -> 1500) and rte_eth_tx_burst() returns 1 for a burst of one UDP packet.
The returned link status is: Link up at 1 Gbps FDX Autoneg
The fact that rte_eth_link_get() can be replaced with rte_eth_timesync_enable() probably can be explained by the latter calling ixgbe_start_timecounters() which calls rte_eth_linkstatus_get() which is also called by rte_eth_link_get().
I've checked the DPDK examples and most of them don't call rte_eth_link_get() before sending something. There is also no sleep after device initialization.
I'm using DPDK 20.11.2.
Even more information - to answer the comments:
I'm running this on Fedora 33 (5.13.12-100.fc33.x86_64).
Ethtool reports: firmware-version: 0x80000877
I had called rte_eth_timesync_enable() in order to work with the transmit timestamps. However, during debugging I removed it to arrive at an minimal reproducer. At that point I noticed that removing it made it actually worse (i.e. no packet transmitted over the mirror port). I thus investigated what part of that function might make the difference and found rte_eth_link_get() which has similar side-effects.
When switching to AF_PACKET I'm using the stock ixgbe kernel driver, i.e. ixgbe with default settings on a device that is initialized by networkd (dhcp enabled).
My expectation was that when rte_eth_dev_start() terminates that the link is up and the device is ready for transmission.
However, it would be nice, I guess, if one could avoid resetting the device after program restarts. I don't know if DPDK supports this.
Regarding delays: I've just tested the following: rte_eth_link_get() can be omitted if I increase the sleep to 6 seconds. Whereas a call to rte_eth_link_get() takes 3.3 s. So yeah, it's probably just helping due to the additional delay.

The difference between the two attempted approaches
In order to use af_packet PMD, you first bind the device in question to the kernel driver. At this point, a kernel network interface is spawned for that device. This interface typically has the link active by default. If not, you typically run ip link set dev <interface> up. When you launch your DPDK application, af_packet driver does not (re-)configure the link. It just unconditionally reports the link to be up on device start (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L266) and vice versa when doing device stop. Link update operation is also no-op in this driver (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L416).
In fact, with af_packet approach, the link is already active at the time you launch the application. Hence no need to await the link.
With VFIO approach, the device in question has its link down, and it's responsibility of the corresponding PMD to activate it. Hence the need to test link status in the application.
Is it possible to avoid waiting on application restarts?
Long story short, yes. Awaiting link status is not the only problem with application restarts. You effectively re-initialise EAL as a whole when you restart, and that procedure is also eye-wateringly time consuming. In order to cope with that, you should probably check out multi-process support available in DPDK (see https://doc.dpdk.org/guides/prog_guide/multi_proc_support.html).
This requires that you re-implement your application to have its control logic in one process (also, the primary process) and Rx/Tx datapath logic in another one (the secondary process). This way, you can keep the first one running all the time and re-start the second one when you need to change Rx/Tx logic / re-compile. Restarting the secondary process will re-attach to the existing EAL instance all the time. Hence no PMD restart being involved, and no more need to wait.

Based on the interaction via comments, the real question is summarized as I'm just asking myself if it's possible to keep the link-up between invocations of a DPDK program (when using a vfio device) to avoid dealing with the relatively long wait times before the first transmit comes through. IOW, is it somehow possible to skip the device reset when starting the program for the 2nd time?
The short answer is No for the packet-generator program between restarts, because any Physcial NIC which uses PCIe config space for both PF (IXGBE for X533) and VF (IXGBE_VF for X553) bound with uio_pci_generic|igb_uio|vfio-pci requires PCIe reset & configuration. But, when using AF_PACKET (ixgbe kernel diver) DPDK PMD, this is the virtual device that does not do any PCIe resets and directly dev->data->dev_link.link_status = ETH_LINK_UP; in eth_dev_start function.
For the second part Is the delay for the first TX packets expected?
[Answer] No, as a few factors contribute to delay in the first packet transmission
Switch software & firmware (PF only)
Switch port Auto-neg or fixed speed (PF only)
X533 software and firmware (PF and VF)
Autoneg enable or disabled (PF and VF)
link medium SFP (fiber) or DAC (Direct Attached Copper) or RJ-45 (cat5|cat6) connection (PF only)
PF driver version for NIC (X553 ixgbe) (PF and VF)
As per intel driver Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE 802.1Qbb (priority based flow-control), and others of this type (VF only)
Note: Since the issue is mentioned for VF ports only (and not PF ports), my assumption is
TX packet uses the SRC MAC address of VF to avoid MAC Spoof check on ASIC
configure all SR-IOV enabled ports for VLAN tagging from the administrative interface on the PF to avoid flooding of traffic to VF
PF driver is updated to avoid old driver issues such (VF reset causes PF link to reset). This can be identified via checking dmesg
Steps to isolate the problem is the NIC by:
Check if (X533) PF DPDK has the same delay as VF DPDK (Isolate if it is PF or VF) problem.
cross-connect 2 NIC (X533) on the same system. Then compare Linux vs DPDK link up events (to check if it is a NIC problem or PCIe LNK issue)
Disable Auto-neg for DPDK X533 and compare PF vs Vf in DPDK
Steps to isolate the problem is the Switch by:
Disable Auto-neg on Switch and set FD auto-neg-disable speed 1Gbps and check the behaviour
[EDIT-1] I do agree with the workaround solution suggested by #stackinside using DPDK primary-secondary process concept. As the primary is responsible for Link and port bring up. While secondary is used for actual RX and TX burst.

Minimizing dropped UDP packets at high packet rates (Windows 10)

IMPORTANT NOTE: I'm aware that UDP is an unreliable protocol. But, as I'm not the manufacturer of the device that delivers the data, I can only try to minimize the impact. Hence, please don't post any more statements about UDP being unreliable. I need suggestions to reduce the loss to a minimum instead.
I've implemented an application C++ which needs to receive a large amount of UDP packets in short time and needs to work under Windows (Winsock). The program works, but seems to drop packets, if the Datarate (or Packet Rate) per UDP stream reaches a certain level... Note, that I cannot change the camera interface to use TCP.
Details: It's a client for Gigabit-Ethernet cameras, which send their images to the computer using UDP packets. The data rate per camera is often close to the capacity of the network interface (~120 Megabytes per second), which means even with 8KB-Jumbo Frames the packet rate is at 10'000 to 15'000 per camera. Currently we have connected 4 cameras to one computer... and this means up to 60'000 packets per second.
The software handles all cameras at the same time and the stream receiver for each camera is implemented as a separate thread and has it's own receiving UDP socket.
At a certain frame rate the software seems miss a few UDP frames (even the network capacity is used only by ~60-70%) every few minutes.
Hardware Details
Cameras are from foreign manufacturers! They send UDP streams to a configurable UDP endpoint via ethernet. No TCP-support...
Cameras are connected via their own dedicated network interface (1GBit/s)
Direct connection, no switch used (!)
Cables are CAT6e or CAT7
Implementation Details
So far I set the SO_RCVBUF to a large value:
int32_t rbufsize = 4100 * 3100 * 2; // two 12 MP images
if (setsockopt(s, SOL_SOCKET, SO_RCVBUF, (char*)&rbufsize, sizeof(rbufsize)) == -1) {
perror("SO_RCVBUF");
throw runtime_error("Could not set socket option SO_RCVBUF.");
}
The error is not thrown. Hence, I assume the value was accepted.
I also set the priority of the main process to HIGH-PRIORITY_CLASS by using the following code:
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
However, I didn't find any possibility to change the thread priorities. The threads are created after the process priority is set...
The receiver threads use blocking IO to receive one packet at a time (with a 1000 ms timeout to allow the thread to react to a global shutdown signal). If a packet is received, it's stored in a buffer and the loop immediately continues to receive any further packets.
Questions
Is there any other way how I can reduce the probability of a packet loss? Any possibility to maybe receive all packets that are stored in the sockets buffer with one call? (I don't need any information about the sender side; just the contained payload)
Maybe, you can also suggest some registry/network card settings to check...

To increase the UDP Rx performance for GigE cameras on Widnows you may want to look into writing a custom filter driver (NDIS). This allows you to intercept the messages in the kernel, stop them from reaching userspace, pack them into some buffer and then send to userspace via a custom ioctl to your application. I have done this, it took about a week of work to get done. There is a sample available from Microsoft which I used as base for it.
It is also possible to use an existing generic driver, such as pcap, which I also tried and that took about half a week. This is not as good because pcap cannot determine when the frames end so packet grouping will be sub optimal.
I would suggest first digging deep in the network stack settings and making sure that the PC is not starved for resources. Look at guides for tuning e.g. Intel network cards for this type of load, that could potentially have a larger impact than a custom driver.
(I know this is an older thread and you have probably solved your problem. But things like this is good to document for future adventurers..)

IOCP and WSARecv in overlapped mode, you can setup around ~60k WSARecv
on the thread that handles the GetQueuedCompletionStatus process the data and also do a WSARecv in that thread to comnpensate for the one being used when receiving the data
please note that your udp packet size should stay below the MTU above it will cause drops depending on all the network hardware between the camera and the software
write some UDP testers that mimuc the camera to test the network just to be sure that the hardware will support the load.
https://www.winsocketdotnetworkprogramming.com/winsock2programming/winsock2advancediomethod5e.html

Latency measurement over UDP on Linux

I want to measure UDP latency and drop rate between two machines on Linux. Preferably (but not crucial) to perform measurement between multiple machines at the same time.
As a result I want to get a histogram, e.g. RTT times of each individual packet at every moment during measurement. Expected frequency is about 10 packets per second.
Do you know of any tool that I can use for this purpose?
What I tried so far is:
ping - uses icmp instead of UDP
iperf - measures only jitter but not latency.
D-ITG - measures per flow statistics, no histograms
tshark - uses TCP for pings instead UDP
I have also created a simple C++ socket program where I have Client and Server on each side, and I send UDP packets with counter and timestamp. My program seems to work ok, although since I am not a network programmer I am not 100% sure that I handled buffers correctly (specifically in the case of partial packets etc). So I would prefer to use some proven software for this task.
Can you recommend something?
Thanks

It depends. If all you want is a trace with timestamps, Wireshark is your friend: https://www.wireshark.org/

I would like to remind you that UDP is a message based protocol and packets have definite boundaries. There cannot be reception of partial packets. That is, you will either get the complete message or you will not get it. So, you need not worry about partial packets in UDP.
The method of calculating packet drop using counter & calculating latency using time delta appears fine for UDP. However the important point to be taken in to consideration is ensuring the synchronization of the system time of client and server.

UDP packets not sent on time

I am working on a C++ application that can be qualified as a router. This application receives UDP packets on a given port (nearly 37 bytes each second) and must multicast them to another destinations within a 10 ms period. However, sometimes after packet reception, the retransmission exceeds the 10 ms limit and can reach the 100 ms. these off-limits delays are random.
The application receives on the same Ethernet interface but on a different port other kind of packets (up to 200 packets of nearly 100 bytes each second). I am not sure that this later flow is disrupting the other one because these delay peaks are too scarce (2 packets among 10000 packets)
What can be the causes of these sporadic delays? And how to solve them?
P.S. My application is running on a Linux 2.6.18-238.el5PAE. Delays are measured between the reception of the packet and after the success of the transmission!
An image to be more clear :

10ms is a tough deadline for a non-realtime OS.
Assign your process to one of the realtime scheduling policies, e.g. SCHED_RR or SCHED_FIFO (some reading). It can be done in the code via sched_setscheduler() or from command line via chrt. Adjust the priority as well, while you're at it.
Make sure your code doesn't consume CPU more than it has to, or it will affect entire system performance.
You may also need RT_PREEMPT patch.
Overall, the task of generating Ethernet traffic to schedule on Linux is not an easy one. E.g. see BRUTE, a high-performance traffic generator; maybe you'll find something useful in its code or in the research paper.

Measure data transfer rate (bandwidth) between 2 apps across network using C++, how to get unbiased and accurate result?

I am trying to measure IO data transfer rate (bandwidth) between 2 simulation applications (written in C++). I created a very simple perfclient and perfserver program just to verify that my approach in calculating the network bandwidth is correct before implementing this calculation approach in the real applications. So in this case, I need to do it programatically (NOT using Iperf).
I tried to run my perfclient and perfserver program on various domain (localhost, computer connected to ethernet,and computer connected to wireless connection). However I always get about the similar bandwidth on each of these different hosts, around 1900 Mbps (tested using data size of 1472 bytes). Is this a reasonable result, or can I get a better and more accurate bandwidth?
Should I use 1472 (which is the ethernet MTU, not including header) as the maximum data size for each send() and recv(), and why/why not? I also tried using different data size, and here are the average bandwidth that I get (tested using ethernet connection), which did not make sense to me because the number exceeded 1Gbps and reached something like 28 Gbps.
SIZE BANDWIDTH
1KB 1396 Mbps
2KB 2689 Mbps
4KB 5044 Mbps
8KB 9146 Mbps
16KB 16815 Mbps
32KB 22486 Mbps
64KB 28560 Mbps
HERE is my current approach:
I did a basic ping-pong fashion loop, where the client continuously send bytes of data stream to the server program. The server will read those data, and reflect (send) the data back to the client program. The client will then read those reflected data (2 way transmission). The above operation is repeated 1000 times, and I then divided the time by 1000 to get the average latency time. Next, I divided the average latency time by 2, to get the 1 way transmission time. Bandwidth can then be calculated as follow:
bandwidth = total bytes sent / average 1-way transmission time
Is there anything wrong with my approach? How can I make sure that my result is not biased? Once I get this right, I will need to test this approach in my original application (not this simple testing application), and I want to put this performance testing result in a scientific paper.
EDIT:
I have solved this problem. Check out the answer that I posted below.

Unless you have a need to reinvent the wheel iperf was made to handle just this problem.
Iperf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. Iperf allows the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss.

I was finally able to figure and solve this out :-)
As I mentioned in the question, regardless of the network architecture that I used (localhost, 1Gbps ethernet card, Wireless connection, etc), my achieved bandwidth scaled up for up to 28Gbps. I have tried to bind the server IP address to several different IP addresses, as follow:
127.0.0.1
IP address given by my LAN connection
IP address given by my wireless connection
So I thought that this should give me correct result, in fact it didn't.
This was mainly because I was running both of the client and server program on the same computers (different terminal window, even though the client and server are both bound to different IP addresses). My guess is that this is caused by the internal loopback. This is the main reason why the result is so biased and not accurate.
Anyway, so I then tried to run the client on one workstation, and the server on another workstation, and I tested them using the different network connection, and it worked as expected :-)
On 1Gbps connection, I got about 9800 Mbps (0.96 Gbps), and on 10Gbps connection, I got about 10100 Mbps (9.86 Gbps). So this work exactly as I expected. So my approach is correct. Perfect !!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js