Timestamp source for dpdk-pdump packets on E810 and DPDK 19.11 - dpdk

I am using a Intel E810-XXVDA4 and DPDK version 19.11. During troubleshooting we have captures outgoing traffic with dpdk-pdump:
dpdk-pdump -- --pdump 'port=1,queue=*,tx-dev=/tmp/tx.pcap'
I have opened the pcap-file in wireshark and I can see that some packets are delayed. However, system behavior indicates that the delays I see in the pcap file are not correct, they are too big.
My questions is, how is the timestamp on packets in the pcap from dpdk-pdump created, like NIC HW generated, CPU time at disk write etc or something else?
Are there any run-time or build-time options to change the source of the timestamps?

Answer from dpdk users mailing list, many thanks to S.H
Follow up. The older dump-pdump in 19.11 does timestamps when packet is read from the ring which is bad.
You might have better luck with DPDK 20.11 and the dpdk-dumpcap which puts timestamp in when packet is put into ring.
Let me give picture:
Application (primary)
| 1 2
+--+-------------------------> wire
|
| dumpcap (secondary)
+---========--------------> capture file
3
The ==== is ringbuffer between processes
Where 20.11 with dpdk-dumpcap gets timestamp
Where you want the timestamp but is not possible
Where 19.11 and dpdk-pdump gets timestamp

Related

Proper DPDK device and port initialization for transmission

While writing a simple DPDK packet generator I noticed some additional initialization steps that are required for reliable and successful packet transmission:
calling rte_eth_link_get() or rte_eth_timesync_enable() after rte_eth_dev_start()
waiting 2 seconds before sending the first packet with rte_eth_tx_burst()
So these steps are necessary when I use the ixgbe DPDK vfio driver with an Intel X553 NIC.
When I'm using the AF_PACKET DPDK driver, it works without those extra steps.
Is this a bug or a feature?
Is there a better way than waiting 2 seconds before the first send?
Why is the wait required with the ixgbe driver? Is this a limitation of that NIC or the involved switch (Mikrotik CRS326 with Marvell chipset)?
Is there a more idiomatic function to call than rte_eth_link_get() in order to complete the device initialization for transmission?
Is there some way to keep the VFIO NIC initialized (while keeping its link up) to avoid re-initializing it over and over again during the usual edit/compile/test cycle? (i.e. to speed up that cycle ...)
Additional information: When I connect the NIC to a mirrored port (which is configured via Mikrotik's mirror-source/mirror-target ethernet switch settings) and the sleep(2) is removed then I see the first packet transmitted to the mirror target but not to the primary destination. Thus, it seems like the sleep is necessary to give the switch some time after the link is up (after the dpdk program start) to completely initialize its forwarding table or something like that?
Waiting just 1 second before the first transmission works less reliable, i.e. the packet reaches the receiver only every odd time.
My device/port initialization procedure implements the following setup sequence:
rte_eth_dev_count_avail()
rte_eth_dev_is_valid_port()
rte_eth_dev_info_get()
rte_eth_dev_adjust_nb_rx_tx_desc()
rte_eth_dev_configure(port_id, 0 /* rxrings */, 1 /* txrings */, &port_conf)
rte_eth_tx_queue_setup()
rte_eth_dev_start()
rte_eth_macaddr_get()
rte_eth_link_get() // <-- REQUIRED!
rte_eth_dev_get_mtu()
Without rte_eth_link_get() (or rte_eth_timesync_enable()) the first transmitted packet doesn't even show up on the mirrored port.
The above functions (and rte_eth_tx_burst()) complete successfully with/without rte_eth_link_get()/sleep(2) being present. Especially, the read MAC address, MTU have the expected values (MTU -> 1500) and rte_eth_tx_burst() returns 1 for a burst of one UDP packet.
The returned link status is: Link up at 1 Gbps FDX Autoneg
The fact that rte_eth_link_get() can be replaced with rte_eth_timesync_enable() probably can be explained by the latter calling ixgbe_start_timecounters() which calls rte_eth_linkstatus_get() which is also called by rte_eth_link_get().
I've checked the DPDK examples and most of them don't call rte_eth_link_get() before sending something. There is also no sleep after device initialization.
I'm using DPDK 20.11.2.
Even more information - to answer the comments:
I'm running this on Fedora 33 (5.13.12-100.fc33.x86_64).
Ethtool reports: firmware-version: 0x80000877
I had called rte_eth_timesync_enable() in order to work with the transmit timestamps. However, during debugging I removed it to arrive at an minimal reproducer. At that point I noticed that removing it made it actually worse (i.e. no packet transmitted over the mirror port). I thus investigated what part of that function might make the difference and found rte_eth_link_get() which has similar side-effects.
When switching to AF_PACKET I'm using the stock ixgbe kernel driver, i.e. ixgbe with default settings on a device that is initialized by networkd (dhcp enabled).
My expectation was that when rte_eth_dev_start() terminates that the link is up and the device is ready for transmission.
However, it would be nice, I guess, if one could avoid resetting the device after program restarts. I don't know if DPDK supports this.
Regarding delays: I've just tested the following: rte_eth_link_get() can be omitted if I increase the sleep to 6 seconds. Whereas a call to rte_eth_link_get() takes 3.3 s. So yeah, it's probably just helping due to the additional delay.
The difference between the two attempted approaches
In order to use af_packet PMD, you first bind the device in question to the kernel driver. At this point, a kernel network interface is spawned for that device. This interface typically has the link active by default. If not, you typically run ip link set dev <interface> up. When you launch your DPDK application, af_packet driver does not (re-)configure the link. It just unconditionally reports the link to be up on device start (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L266) and vice versa when doing device stop. Link update operation is also no-op in this driver (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L416).
In fact, with af_packet approach, the link is already active at the time you launch the application. Hence no need to await the link.
With VFIO approach, the device in question has its link down, and it's responsibility of the corresponding PMD to activate it. Hence the need to test link status in the application.
Is it possible to avoid waiting on application restarts?
Long story short, yes. Awaiting link status is not the only problem with application restarts. You effectively re-initialise EAL as a whole when you restart, and that procedure is also eye-wateringly time consuming. In order to cope with that, you should probably check out multi-process support available in DPDK (see https://doc.dpdk.org/guides/prog_guide/multi_proc_support.html).
This requires that you re-implement your application to have its control logic in one process (also, the primary process) and Rx/Tx datapath logic in another one (the secondary process). This way, you can keep the first one running all the time and re-start the second one when you need to change Rx/Tx logic / re-compile. Restarting the secondary process will re-attach to the existing EAL instance all the time. Hence no PMD restart being involved, and no more need to wait.
Based on the interaction via comments, the real question is summarized as I'm just asking myself if it's possible to keep the link-up between invocations of a DPDK program (when using a vfio device) to avoid dealing with the relatively long wait times before the first transmit comes through. IOW, is it somehow possible to skip the device reset when starting the program for the 2nd time?
The short answer is No for the packet-generator program between restarts, because any Physcial NIC which uses PCIe config space for both PF (IXGBE for X533) and VF (IXGBE_VF for X553) bound with uio_pci_generic|igb_uio|vfio-pci requires PCIe reset & configuration. But, when using AF_PACKET (ixgbe kernel diver) DPDK PMD, this is the virtual device that does not do any PCIe resets and directly dev->data->dev_link.link_status = ETH_LINK_UP; in eth_dev_start function.
For the second part Is the delay for the first TX packets expected?
[Answer] No, as a few factors contribute to delay in the first packet transmission
Switch software & firmware (PF only)
Switch port Auto-neg or fixed speed (PF only)
X533 software and firmware (PF and VF)
Autoneg enable or disabled (PF and VF)
link medium SFP (fiber) or DAC (Direct Attached Copper) or RJ-45 (cat5|cat6) connection (PF only)
PF driver version for NIC (X553 ixgbe) (PF and VF)
As per intel driver Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE 802.1Qbb (priority based flow-control), and others of this type (VF only)
Note: Since the issue is mentioned for VF ports only (and not PF ports), my assumption is
TX packet uses the SRC MAC address of VF to avoid MAC Spoof check on ASIC
configure all SR-IOV enabled ports for VLAN tagging from the administrative interface on the PF to avoid flooding of traffic to VF
PF driver is updated to avoid old driver issues such (VF reset causes PF link to reset). This can be identified via checking dmesg
Steps to isolate the problem is the NIC by:
Check if (X533) PF DPDK has the same delay as VF DPDK (Isolate if it is PF or VF) problem.
cross-connect 2 NIC (X533) on the same system. Then compare Linux vs DPDK link up events (to check if it is a NIC problem or PCIe LNK issue)
Disable Auto-neg for DPDK X533 and compare PF vs Vf in DPDK
Steps to isolate the problem is the Switch by:
Disable Auto-neg on Switch and set FD auto-neg-disable speed 1Gbps and check the behaviour
[EDIT-1] I do agree with the workaround solution suggested by #stackinside using DPDK primary-secondary process concept. As the primary is responsible for Link and port bring up. While secondary is used for actual RX and TX burst.

Measuring Round Trip Time using DPDK

My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:
I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).
One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).
Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.
Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.
It really depends on the kind of round trip you want to measure. Consider the following timestamps:
-> t1 -> send() -> NIC_A -> t2 --link--> t3 -> NIC_B -> recv() -> t4
host_A host_B
<- t1' <- recv() <- NIC_A <- t2' <--link-- t3' <- NIC_B <- send() <- t4'
Do you want to measure t1' - t1? Then it's just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See also rte_rdtsc_precise() and rte_get_tsc_hz() for converting the TSC deltas to nanoseconds.
For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call clock_gettime(CLOCK_REALTIME) which has an overhead of 18 ns or so.
This works for single packet transmits via rte_eth_tx_burst() and single packet receives - which aren't necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.
Timestamps t2, t3, t2', t3' are hardware transmit/receive timestamps provided by (more serious) NICs.
If you want to compute the roundtrip t2' - t2 then you first need to discipline the NIC's clock (e.g. with phc2ys), enable timestamping and get those timestamps. However, AFAICS dpdk doesn't support obtaining the TX timestamps, in general.
Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the t2' - t2 roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.
The ideal way to latency for sending and receiving packets through an interface is setup external Loopback device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.
The next best alternative is to enable Internal Loopback, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NIC
supports internal Loopback
and can suppress Loopback error handlers.
Another way is to use either PCIe port to port cross connect. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.
Note: Newer Hardware supports doorbell mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.
But in my recommendation using an external (Machine-B) is not desirable because of
Depending upon the quality of the transfer Medium, the latency varies
If machine-B has to be configured to the ideal settings (for almost 0 latency)
Machine-A and Machine-B even if physical configurations are the same, need to be maintained and run at the same thermal settings to allow the right clocking.
Both Machine-A and Machine-B has to run with same PTP grand master to synchronize the clocks.
If DPDK is used, either modify the PMD or use rte_eth_tx_buffer_flush to ensure the packet is sent out to the NIC
With these changes, a dummy UDP packet can be created, where
first 8 bytes should carry the actual TX time before tx_burst from Machine-A (T1).
second 8 bytes is added by machine-B when it actually receives the packet in SW via rx_burst (2).
third 8 bytes is added by Machine-B when tx_burst is completed (T3).
fourth 8 bytes are found in Machine-A when packet is actually received via rx-burst (T4)
with these Round trip Time = (T4 - T1) - (T3 - T2), where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.
Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks rte_get_tsc_cycles is not varying per frequency and power states.
[Edit-1] as mentioned in comments
#AmmerUsman, I highly recommend editing your question to reflect the real intention as to how to measure the round trip time is taken, rather than TX-RX latency from DUT?, this is because you are referring to DPDK latency stats/metric but that is for measuring min/max/avg latency between Rx-Tx on the same DUT.
#AmmerUsman latency library in DPDK is stats representing the difference between TX-callback and RX-callback and not for your use case described. As per Keith explanation pointed out Packet send out by the traffic generator should send a timestamp on the payload, receiver application should forward to the same port. then the receiver app can measure the difference between the received timestamp and the timestamp embedded in the packet. For this, you need to send it back on the same port which does not match your setup diagram

dpdk-19.11,ixgbe PMD got imissed when up to 8Mpps(decreased 3Mpps),any one have solution or know the reason

using DPDK 17.02 with my custom application, I get to see missed fo 11Mpps. Using 19.11 DPDK it has reduced to 8Mpps. are there compile flags or code changes for ixgbe PMD which has reduced the same.
new updates:
the application arch is rx_cores(3)-->worker cores(16)-->tx cores(2),
when I increased tx cores to 3,it can reach 10M pps,but increased to 4,It didn't take affects any more
imissed stands for packets missed in HW, due to less number of poll cycles for RX thread. Lower the value better it is. Hence using DPDK 19.11 it is having reduced impact.
reasons for imissed to be lower is 19.11 can be
better compiler flags
better code optimization.
assuming you are using static libraries, code logic might be better fitting into the instruction cache.
note: you should really run profiler and use objdump to decipher this.
reasons for imissed in your applications could be the following reasons
CPU frequency is in power save and not performance (impact of not disabling C states in BIOS).
The is additional work or sleep added in RX thread loop will prevent rx_burst been invoked frequently
RSS is enabled, but traffic send to DPDK port is falling on 1 queue only. In case of too many RX queues like 16, this leads to increase delay to pick packet from the relevant queue.
iF RX thread is feeding worker cores based on flow id, there could be retries if RING full scenarios leading to loss of rx_burst.
If testpmd or example/skeleton or example/l2fwd is not having imissed. then here are my suggestions to debug your applciation.
try sending RX-TX from single-core
set power governed from powersave to performance.
ensure you are running the core threads on isolated cores.

c++ drops udp packets with recvmmsg

I am developing an c++ program which consumes a stream of UDP data from an FPGA over ethernet. There is no hub or router between the FPGA and my ethernet card. The data is 10446 pps with a rate of 125350.0 kbps.
My c++ app uses a dedicated thread and recvmmsg to empty the data. Each packet has a sequence number as the first 4 bytes, followed by 1468 bytes of stream data. I am using recvmmsg and i have tried VLEN (10,100) and combinations of MSG_WAITFORONE, MSG_DONTWAIT, 0 for flags.
The symptoms I am seeing are this:
Before the program starts, the stream is running at a fixed speed.
When the program boots, I have a short initial period where the return value of recvmmsg is the same as VLEN. If i understand correctly, this is the draining of the Linux kernel buffer.
After this, I always get a value of 1 for the return value of recvmmsg
If I cause small load on the system (resizing a gui window for instance). I see a drop of UDP packets, as indicated by missing sequence numbers. (not reordered, just missing).
During/after a drop, I do not get a larger return value for recvmmsg
Wireshark/tcpdump do not show any missing data, all sequence numbers are present.
If i watch the output of netstat -suna, I see increases of the value of RcvbufErrors:.
If I watch the output of ifconfig I do not see any dropped packets (RX packets:602492703 errors:0 dropped:0 overruns:0 frame:0).
These are my questions:
Why do I never get more than one packet from recvmmsg during a drop condition?
Why is wireshark able to capture the packets, but my c++ cannot?
What tools can I use to get a better understanding of why I am dropping?
I have tried adjusting the following tunables:
sysctl -w net.core.netdev_max_backlog=10000
sysctl -w net.core.rmem_max=9926214400
Please do not suggest that I should switch to TCP. This is not an option for this particular application. Thanks.
Increasing the size of the receive buffer for the receiving socket should solve this:
setsockopt (fd, IPPROTO_UDP, SO_RCVBUF, desired_receive_buffer_size);
Documentation here.

UDP packets are dropped when its size is less than 12 byte in a certain PC. how do i figure it out the reason?

i've stuck in a problem that is never heard about before.
i'm making an online game which uses UDP packets in a certain character action. after i developed the udp module, it seems to work fine. though most of our team members have no problem, but a man, who is my boss, told me something is wrong for that module.
i have investigated the problem, and finally i found the fact that... on his PC, if udp packet size is less than 12, the packet is never have been delivered to the other host.
the following is some additional information:
1~11 bytes udp packets are dropped, 12 bytes and over 12 bytes packets are OK.
O/S: Microsoft Windows Vista Business
NIC: Attansic L1 Gigabit Ethernet 10/100/1000Base-T Controller
WSASendTo returns TRUE.
loopback udp packet works fine.
how do you think of this problem? and what do you think... what causes this problem?
what should i do for the next step for the cause?
PS. i don't want to padding which makes length of all the packets up to 12 bytes.
Just to get one of the non-obvious answers in: maybe UDP checksum offload is broken on that card, i.e. the packets are sent, but dropped by the receiver?
You can check for this by looking at the received packets using Wireshark.
IF you already checked firewall, antivirus, network firewall, network intrusion. read this
For a UDP packet ethernet_header(14 bytes) + IPv4_header(20 bytes min) + UDP_header (8 bytes) = 42 bytes
Now since its less than the 64 bytes or 60 on linux, network driver will pad the packet with (64-42 = 22 ) zeros to make it 60 bytes before it send out the packet.
that's the minimum length for a UDP packet.
theoretically you can send 0 data bytes packet, but haven't tried it yet.
as for your issue it must be an OS issue . check your network's driver's manual or check with manufacturer. because this isn't suuposed to happen.
REF:http://www.freesoft.org/CIE/Course/Section4/8.htm
REF:http://en.wikipedia.org/wiki/User_Datagram_Protocol
Run Wireshark on his PC AND on the destination PC.
Does the log show the udp packet leaving his machine? Does it show it arriving on the destination PC?
What kind of router hardware or switches are between his PC and the destination? Can you remove them and link the two with a cross over cable? (or replace the destination with a laptop and link that to his PC with a cross over cable?)
Have you removed or at least listed all anti virus and firewall products on his machine and anything that installs a Winsock LSP ?
Do ALL 12 byte or less packets get dropped or just some, can you generate packets with random content and see if it's something in the content, rather than just the size, that's causing the issue.
Assuming your problem is with sending from his PC: First, run a packet sniffer on the problematic PC to see if it arrives at the NIC. If it makes it there, there may be a problem in the NIC or NIC driver.
Next, check for any running firewall software. Try disabling it and see what happens.
If that doesn't work, clear out any Winsock Layered Service Providers with netsh winsock catalog reset.
If that doesn't work, I'm stumped :)
Finally, you're probably going to find other customers with the same problem; you might want to think about that workaround anyway. Try sending a few small-size UDP packets on connect, and if they consistently fail to go through, enable a padding workaround. For hosts where the probe packets make it through, you don't need to pad them out.
Pure conjecture: RTP, which is a very common packet to send on UDP, defines a 12 byte header. I wonder if some layer of network software is assuming that anything smaller is a malformed RTP packet and throwing it away?