tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC - dpdk

I am using XL710 i40e NIC on dpdk19.11. I found that the NIC occasionally lost packets when I enabled the TSO feature.
The detailed information is as follow:
https://github.com/JiangHeng12138/dpdk-issue/issues/1
I gussed lost packet is caused by i40e NIC dirver, but I dont know how to debug i40e driver code, could you please provide me an effective way.

Based on the problem statement tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC, one needs to isolate the issue whether is it is client (peer system) or server (dpdk DUT) which leads to packet loss. So to debug the issue at DPDK server side, one needs to evaluate both RX and TX issues. DPDK tool dpdk-procinfo can retrieve port statistics, which can be used for the analysis of the issue.
Diagnose the issue:
Run the application (dpdk primary) to reproduce the issue in terminal-1.
In terminal-2, run the command dpdk-procinfo -- --stats. refer link for more details
Check RX-errors counter, this will show if the packets which were faulty were dropped at PMD level.
Check RX-nombuf counter, this will show if the packets from NIC were not able to be DMA to DDR memory on the HOST.
Check TX-errors counter, this will show if the copy of packet descriptor (DMA descriptors) to NIC had been faulty or not.
Also check the HW nic statistics with dpdk-procinfo -- --xstats for any error or drop counter updates.
sample of the capture of stats and xstats counters on the desired nic
Note:
"tx_good_packets" means the number of packets sent by the dpdk NIC. if the number of packets tried to be sent is equal to "tx_good_packets", there is no packet dropped at the sent client.
"rx-missed-errors" means packets loss at the receiver; this means you are processing packets more than what the Current CPU can handle. So either you will need to increase CPU frequency, or use additional cores to distribute the traffic.
If none of these counters is updated or errors are found, then the issue is at the peer (client non-dpdk) side.

Related

dpdk-testpmd command executed and then hangs

I made ready dpdk compatible environment and then I tried to send packets using dpdk-testpmd and wanted to see it being received in another server.
I am using vfio-pci driver in no-IOMMU (unsafe) mode.
I ran
$./dpdk-testpmd -l 11-15 -- -i
which had output like
EAL: Detected NUMA nodes: 2
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: VFIO support initialized
EAL: Using IOMMU type 8 (No-IOMMU)
EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.1 (socket 0)
TELEMETRY: No legacy callbacks, legacy socket not created
Interactive-mode selected
testpmd: create a new mbuf pool <mb_pool_1>: n=179456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
testpmd: create a new mbuf pool <mb_pool_0>: n=179456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.
Configuring Port 0 (socket 0)
Port 0: E4:43:4B:4E:82:00
Checking link statuses...
Done
then
$set nbcore 4
Number of forwarding cores set to 4
testpmd> show config fwd
txonly packet forwarding - ports=1 - cores=1 - streams=1 - NUMA support enabled, MP allocation mode: native
Logical Core 12 (socket 0) forwards packets on 1 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=BE:A6:27:C7:09:B4
my nbcore is not being set correctly, even 'txonly' mode was not being set before I set the eth-peer addr. but some parameters are working. Moreover if I don't change the burst delay my server gets crashed as soon as I start transmitting through it has 10G ethernet port (80MBps available bandwidth by calculation). Hence, I am not seeing packets at receiving server by tailing tcpdump at corresponding receiving interface. What is happening here and what am I doing wrong?
based on the question & answers in the comments, the real intention is to send packets from DPDK testpmd using Intel Fortville (net_i40e) to the remote server.
The real issue for traffic not being generated is neither the application command line nor the interactive option is set to create packets via dpdk-testpmd.
In order to generate packets there are 2 options in testpmd
start tx_first: this will send out a default burst of 32 packets as soon the port is started.
forward mode tx-only: this puts the port under dpdk-testpmd in transmission-only mode. once the port is start it will transmit packets with the default packet size.
Neither of these options is utilized, hence my suggestion is
please have a walk through DPDK documentation on testpmd and its configuratio
make use of either --tx-first or use --forward-mode=txonly as per DPDK Testpmd Command-line Options
make use of either start txfirst or set fwd txonly or set fwd flwogen under interactive mode refer Testpmd Runtime Functions
with this traffic will be generated from testpmd and sent to the device (remote server). A quick example of the same is "dpdk-testpmd --file-prefix=test1 -a81:00.0 -l 7,8 --socket-mem=1024 -- --burst=128 --txd=8192 --rxd=8192 --mbcache=512 --rxq=1 --txq=1 --nb-cores=2 -a --forward-mode=io --rss-udp --enable-rx-cksum --no-mlockall --no-lsc-interrupt --enable-drop-en --no-rmv-interrupt -i"
From the above example config parameters
numbers of packets for rx-tx burst is set by --burst=128
number of rx-tx queues is configured by --rxq=1 --txq=1
number of cores to use for rx-tx is set by --nb-cores=2
to set flowgen, txonly, rxonly or io mode we use --forward-mode=io
hence in comments, it is mentioned neither set nbcore 4 or there are any configurations in testpmd args or interactive which shows the application is set for TX only.
The second part of the query is really confusing, because as it states
Moreover if I don't change the burst delay my server gets crashed as
soon as I start transmitting through it has 10G ethernet port (80MBps
available bandwidth by calculation). Hence, I am not seeing packets at
receiving server by tailing tcpdump at corresponding receiving
interface. What is happening here and what am I doing wrong?
assuming my server is the remote server to which packets are being sent by dpdk testpmd. because there is mention of I see packets with tcpdump (since Intel fortville X710 when bound with UIO driver will remove kernel netlink interface).
it is mentioned 80MBps which is around 0.08Gbps, is really strange. If the remote interface is set to promiscuous mode and there is AF_XDP application or raw socket application configured to receive traffic at line rate (10Gbps) it works. Since there is no logs or crash dump of the remote server, and it is highly unlikely actual traffic is generated from testpmd, this looks more of config or setup issue in remote server.
[EDIT-1] based on the live debug, it is confirmed
The DPDK is not installed - fixed by using ninja isntall
the DPDK nic port eno2 - is not connected to remote server directly.
the dpdk nic port eno2 is connected through switch
DPDk application testpmd is not crashing - confirmed with pgrep testpmd
instead when used with set fwd txonly, packets flood the switch and SSH packets from other port is dropped.
Solution: please use another switch for data path testing, or use direct connection to remote server.

Rss hash for fragmented packet

I am using Mellanox Technologies MT27800 Family [ConnectX-5], using dpdk multi rx queue with rss "ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP"
I analyzer traffic and need all packet of same session to arrive to the same process ( session for now can be ip+port)
So Packet that have the same ip + port arrive to the same queue.
But If some packet are ip fragmented, packet arrive to different process. It is a problem!
How can i calculate the hash value in the c++ code, like it is done in the card, so i can reassemble packets and send them to the same process like the non fragmented packets
Instead of ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP you can only use ETH_RSS_IP to calculate the RSS hash only from the IP addresses of the packet. This way even if the packet is fragmented, segments of the packet will arrive to the same the CPU core.
RSS value of the packets can be calculated in the software with using the following library https://doc.dpdk.org/api/rte__thash_8h.html
While this option is possible, I would still recommend you to check out the proposed setting of ETH_RSS_IP only.
When ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP is enabled, the RSS function takes IP addresses and src + dst ports to calculate the RSS hash value. As you don't have ports present in the IP fragmented packets, you are unable to compute the same value as non-fragmented IP packets.
You either can:
reassemble the IP fragments to form a complete IP packet and then using the rte_thash library compute the RSS value,
compute the RSS value only from the IP addresses (use the ETH_RSS_IP setting only).
As you are only doing load-balancing on CPU cores I think the latter option suits your use-case well enough.
#Davidboo based on the question and explanation in comment, what you have described is
all packet of same session to arrive to the same process ( session for now can be ip+port) - which means you are looking for symmetric hash
some packet are ip fragmented, packet arrive to different process - you need packet reassemble before symmetric RSS
Mellanox Technologies MT27800 Family [ConnectX-5] - current NIC does not support reassemble in NIC (embedded switch).
Hence the actual question is what is the right way to solve the problem with the following constrains. There are 3 (1 HW and 2 SW) solutions.
option 1 (HW): Use smart NIC or network Appliance offload card, that can ingress the traffic and does reassembly for fragmented before sending to Host Server
option 2 (SW): Disable RSS and use single RX queue. Check packet is fragment or not. If yes, reassemble and then use rte_flow_distributor or rte_eventdev with atomic flow to spread traffic to worker cores.
option 3 (SW): Disable RSS, but use n + 1 RX queues and n SW ring . By default all packets will be received on queue 0. Based on JHASH (for SW RSS) add rte_flow rules pinning the flows to queues 1 to (n + 1).
Assuming you can not change the NIC, my recommendation is option-2 with evenetdev for the following reasons.
It is much easier and simpler to allow either HW or SW DLB (eventdev) to spread traffic across multiple cores.
rte_flow_distributor is similar to HW RSS where static pinning will lead to congestion and packet drops.
Unlike option-3 one need not maintain state flow table to keep track of the flows.
How to achieve the same.
Use dpdk example code skeleton to create basic port initialization.
enable DPDK rte_ethdev for PTYPES to identify if IP, non-IP, IP fragmented without parsing the Frame and payload.
check packets are fragmented by using RTE_ETH_IS_IPV4_HDR and rte_ipv4_frag_pkt_is_fragmented for ipv4 and RTE_ETH_IS_IPV6_HDR and rte_ipv6_frag_get_ipv6_fragment_header for ipv6 (refer DPDK example ip_reassembly).
extract SRC-IP, DST-IP, SRC-PORT, DST-PORT (in case if packet is not SCTP, UDP or TCP set src port and dst port as 0) and use rte_hash_crc_2byte. This will ensure it is symmetric hash (SW RSS).
Then feed the packet with the hash value to eventdev stage or flow distributor stage (refer eventdev_pipeline or distributor).
Note:
each worker core will be running business logic.
On broadwell Xeon core can handle around 30Mpps.
On icelake Xeon core can handle around 100Mpps.

DPDK getting too many rx_crc_errors on one port

what may cause rx_crc_erros in DPDK ports?
is it a software thing ? or a hardware thing related to the port or the traffic coming from the other end ?
DPDK Version: 19.02
PMD: I40E
This Port is running on customer Network, worth mentioning that this is the only port (out of 4) having this behaviour, so this may be a router/traffic thing but I couldnt verify that
used dpdk-proc-info to get this data
could not do any additional activity as this is running on customer site
DPDK I40E PMD has only option to enable or disable CRC on the port. Hence the assumption of DPDK I40E PMD is causing CRC error on 1 port out of 4 can be ruled out fully.
The `RX packets are validated by ASIC per port for CRC and then DMA to mbuf for packet buffer. The PMD copies the descirptor states to mbuf struct (one among them is CRC). The packet descriptor indicates the CRC result of the packet buffer to the driver (Kernel/DPDK-PMD). So the CRC error on a given port can arise due to the following reasons as
the port connected to ASIC is faulty (very rare case)
the SFP+ is not properly connected (possible).
the SFP+ is not the recommended one (possible).
the traffic coming from the other end is sending CRC packets as faulty.
One needs to isolate the issue by
binding the port to Linux Driver i40e and checking the statistics via ethtool -S [port].
Check SFP+ for compatibility on the faulty port by swapping with a working one.
re-seat the SFP+ again.
swap the data cables between working and faulty port. Then check if the error is present or not.
If all the above 4 cases the error only comes on the fault port, then indeed the NIC card has only 3 working ports among 4. The NIC card needs replacements or one should ignore the faulty port altogether. Hence this is not a DPDK PMD or library issue.

How to receive 40Gbps line rate traffic without zero loss?

My Goal is capturing incoming packet with DPDK, to do this I want to integrate DPDK library ETH API to my project to receive all incoming Packets (NIC rate:40Gbps, pkt size 1500 bytes) with zero packet loss.
I didn't know How can I do this ?
I installed DPDK from [DPDK Quick Installation.][1]
shared the debug session and showcased 4 *10gbps x710 (Fortville) is able to send 40Gbps of traffic from pkt-gen. On receiver end with examples/skeleton with 1 core how the total 40Gbps is received and processed.
pkt-size is 1500 as per request from #Alexanov.
Hence there is no issue in DPDK library to receive packets

Unable to receive burst of 1 packet with rte_eth_rx_burst

I am trying to receive packets transmitted by another DPDK application on different system. I am able transmit a burst value of 1 packet with rte_eth_tx_burst Api, but unable to receive with burst value of 1 in rte_eth_rx_burst Api. I am able to receive packets only if rx_burst value is 4 or higher than that. Is it because of any ethdev configuration?
ixgbe i40e device have the problem, virtio-net hasn't the problem. modify RTE_*_DESCS_PER_LOOP in file *_rxtx.h.