L2Fwd Application achieved good rate, when using VFs on same port but low rate seen, when sending traffic from VF on 1 port to VF on a different port

L2Fwd Application achieved good rate, when using VFs on same port but low rate seen, when sending traffic from VF on 1 port to VF on a different port - dpdk

Issue Summary:
On a dual port 10Gbps NIC card my dpdk application can successfully sustain ~9Gbps traffic on each port ( I receive traffic on 1 port , process and send via the same port. Similar process on 2nd port using 2nd application).
However if my application receives traffic on 1 port and sends it to the 2nd port (internally), and a different application receives traffic on the 2nd port - I can maximum receive only upto 3.4Gbps. Beyond this rate, packets are dropped but imissed count in DPDK statistics are not increased.
Issue in Detail:
I’m running on a server that has an "X710 for 10GbE SFP+ 1572" ethernet controller with 2 ports/ physical functions. I have created 4 virtual functions on each physical function.
Physical function:
0000:08:00.0 'Ethernet Controller X710 for 10GbE SFP+ 1572' if=ens2f0 drv=i40e unused=vfio-pci *Active*
0000:08:00.1 'Ethernet Controller X710 for 10GbE SFP+ 1572' if=ens2f1 drv=i40e unused=vfio-pci *Active*
Machine specification:
CentOS 7.8.2003
Hardware:
Intel(R) Xeon(R) CPU L5520 # 2.27GHz
L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 8192K
NIC: X710 for 10GbE SFP+ 1572
RAM: 70Gb
PCI:
Intel Corporation 5520/5500/X58 I/O Hub PCI Express
Capabilities: [90] Express (v2) Root Port (Slot-), MSI 00
LnkSta: Speed 5GT/s, Width x4,
isolcpu: 0,1,2,3,4,5,6,7,8,9,10
NUMA hardware:
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 36094 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 36285 MB
Hugepage: size - 2MB and count - 1024
Model-1(Intra VF):
Running 2 instances of DPDK l2fwd application namely ApApp1 and ApApp2. ApApp1is bound with 2 VFs and 2 cores, ApApp2 is bound with 1 VF and 1 core.
Traffic handling:
App1 receives external traffic on VF-0 and sends it out via VF-1. App2 receives external traffic on VF-2 and sends it out via VF-2 itself.
In this model App1 & App2 together receive 8.8 Gbps and transmit the same without any loss.
Model-2(inter VF):
I have modified the l2fwd application App1 to send the external traffic to App2, App2 receives and sends it back to App1 and App1 sends traffic out to the external destination.
Model-2 diagram
Traffic handling:
App1 receives external traffic on VF0 and sends it to App2 via VF1. App2 receives packets on VF2 and sends it out to App1 via VF2 itself. App1 receives packets from App2 on VF1 and sends it out to the external destination via VF0
In this model App1 & App2 together receive only 3.5 Gbps and transmit the same without any loss.
If I try to increase the traffic rate, not all packets sent by App1 were received by App2 and vice versa. Please note that there were no increase in the imissed count at port level statistics. (leading to inference that packets were dropped not because of enough cpu cycles but rather in PCI communication between VFs)
I came across following link https://community.intel.com/t5/Ethernet-Products/SR-IOV-VF-performance-and-default-bandwidth-rate-limiting/m-p/277795
However for me, in case of intra VF communication there is no issue with the throughput.
My limited understanding is the communication between two different Physical Functions would happen via PCI express switch
Is so much deterioration in performance expected(two 10Gbps ports giving throughput of less than 4 Gbps) and hence do I need to change my design?
Could it be because of some misconfiguration?
Please suggest any pointers to proceed further.

Based on the analysis of the issue, there seem to be a platform configuration related issue which can cause this effects.
Problem: (Throughput issue) unable to achieve 20Gbps bi-directional (from simulator ingress and application egress via VF) based on maximum receive only upto 3.4Gbps.
[Solution] This is most likely to following reason
Interconnect cables like fiber, copper, DAC might be faulty. very unlikely for both ports.
Both ports might be negotiating to half duplex. Not likely because default settings for DPDK force full duplex.
Platform or motherboard not setting the right PCIe gen or allocating sufficient lanes. Most likely
To identify the PCIe lane issue use sudo lscpi -vvvs [PCIe BDF] | gerp Lnk. Compare for LnkCap against LnkSta. If there is a mismatch then its a PCIe lane issue.
[Edit based on live debug] it has been identified indeed the issue was PCIe link. The current Xeon platform only support PCIe gen-2 4x lanes, while X710-T2 card requires PCie gen 3, 4x lanes.
Recommended in upgrading the CPU and mother board with least Broadwell Xeon or better.

Related

Rss hash for fragmented packet

I am using Mellanox Technologies MT27800 Family [ConnectX-5], using dpdk multi rx queue with rss "ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP"
I analyzer traffic and need all packet of same session to arrive to the same process ( session for now can be ip+port)
So Packet that have the same ip + port arrive to the same queue.
But If some packet are ip fragmented, packet arrive to different process. It is a problem!
How can i calculate the hash value in the c++ code, like it is done in the card, so i can reassemble packets and send them to the same process like the non fragmented packets

Instead of ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP you can only use ETH_RSS_IP to calculate the RSS hash only from the IP addresses of the packet. This way even if the packet is fragmented, segments of the packet will arrive to the same the CPU core.
RSS value of the packets can be calculated in the software with using the following library https://doc.dpdk.org/api/rte__thash_8h.html
While this option is possible, I would still recommend you to check out the proposed setting of ETH_RSS_IP only.
When ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP is enabled, the RSS function takes IP addresses and src + dst ports to calculate the RSS hash value. As you don't have ports present in the IP fragmented packets, you are unable to compute the same value as non-fragmented IP packets.
You either can:
reassemble the IP fragments to form a complete IP packet and then using the rte_thash library compute the RSS value,
compute the RSS value only from the IP addresses (use the ETH_RSS_IP setting only).
As you are only doing load-balancing on CPU cores I think the latter option suits your use-case well enough.

#Davidboo based on the question and explanation in comment, what you have described is
all packet of same session to arrive to the same process ( session for now can be ip+port) - which means you are looking for symmetric hash
some packet are ip fragmented, packet arrive to different process - you need packet reassemble before symmetric RSS
Mellanox Technologies MT27800 Family [ConnectX-5] - current NIC does not support reassemble in NIC (embedded switch).
Hence the actual question is what is the right way to solve the problem with the following constrains. There are 3 (1 HW and 2 SW) solutions.
option 1 (HW): Use smart NIC or network Appliance offload card, that can ingress the traffic and does reassembly for fragmented before sending to Host Server
option 2 (SW): Disable RSS and use single RX queue. Check packet is fragment or not. If yes, reassemble and then use rte_flow_distributor or rte_eventdev with atomic flow to spread traffic to worker cores.
option 3 (SW): Disable RSS, but use n + 1 RX queues and n SW ring . By default all packets will be received on queue 0. Based on JHASH (for SW RSS) add rte_flow rules pinning the flows to queues 1 to (n + 1).
Assuming you can not change the NIC, my recommendation is option-2 with evenetdev for the following reasons.
It is much easier and simpler to allow either HW or SW DLB (eventdev) to spread traffic across multiple cores.
rte_flow_distributor is similar to HW RSS where static pinning will lead to congestion and packet drops.
Unlike option-3 one need not maintain state flow table to keep track of the flows.
How to achieve the same.
Use dpdk example code skeleton to create basic port initialization.
enable DPDK rte_ethdev for PTYPES to identify if IP, non-IP, IP fragmented without parsing the Frame and payload.
check packets are fragmented by using RTE_ETH_IS_IPV4_HDR and rte_ipv4_frag_pkt_is_fragmented for ipv4 and RTE_ETH_IS_IPV6_HDR and rte_ipv6_frag_get_ipv6_fragment_header for ipv6 (refer DPDK example ip_reassembly).
extract SRC-IP, DST-IP, SRC-PORT, DST-PORT (in case if packet is not SCTP, UDP or TCP set src port and dst port as 0) and use rte_hash_crc_2byte. This will ensure it is symmetric hash (SW RSS).
Then feed the packet with the hash value to eventdev stage or flow distributor stage (refer eventdev_pipeline or distributor).
Note:
each worker core will be running business logic.
On broadwell Xeon core can handle around 30Mpps.
On icelake Xeon core can handle around 100Mpps.

DPDK getting too many rx_crc_errors on one port

what may cause rx_crc_erros in DPDK ports?
is it a software thing ? or a hardware thing related to the port or the traffic coming from the other end ?
DPDK Version: 19.02
PMD: I40E
This Port is running on customer Network, worth mentioning that this is the only port (out of 4) having this behaviour, so this may be a router/traffic thing but I couldnt verify that
used dpdk-proc-info to get this data
could not do any additional activity as this is running on customer site

DPDK I40E PMD has only option to enable or disable CRC on the port. Hence the assumption of DPDK I40E PMD is causing CRC error on 1 port out of 4 can be ruled out fully.
The `RX packets are validated by ASIC per port for CRC and then DMA to mbuf for packet buffer. The PMD copies the descirptor states to mbuf struct (one among them is CRC). The packet descriptor indicates the CRC result of the packet buffer to the driver (Kernel/DPDK-PMD). So the CRC error on a given port can arise due to the following reasons as
the port connected to ASIC is faulty (very rare case)
the SFP+ is not properly connected (possible).
the SFP+ is not the recommended one (possible).
the traffic coming from the other end is sending CRC packets as faulty.
One needs to isolate the issue by
binding the port to Linux Driver i40e and checking the statistics via ethtool -S [port].
Check SFP+ for compatibility on the faulty port by swapping with a working one.
re-seat the SFP+ again.
swap the data cables between working and faulty port. Then check if the error is present or not.
If all the above 4 cases the error only comes on the fault port, then indeed the NIC card has only 3 working ports among 4. The NIC card needs replacements or one should ignore the faulty port altogether. Hence this is not a DPDK PMD or library issue.

DPDK driver compatible with Intel X710 NIC

Could you please suggest which Intel DPDK driver in Virtual Machine is compatible with Intel X710 NIC driver in Host?The igb_uio driver which we are currently using may be only compatible with Intel NICs like 82599.

Since the question is not clear, I have to make certain assumptions.
Assumptions:
you want to run your application (DPDK) in guest OS.
You are having x710 (Fortville) NW card on host.
To achieve the same, you will have 3 options.
a. X710 pass through to guest os.
b. X710 as SRIOV to guest OS.
c. Using intermediate Application like OVS, Virtual Switch, VPP or Snabb switch to connect to guest OS.
For case a and b you still can use igb_uio or 'vfio-pcias the kernel driver is still i40e and device is seen as x710. For casecyou can use 'igb_uio` with virtio-pci as kernel driver.

Thanks for updating the details, as this makes clear of the environment and setup. please find the answers to the queries and what can be done to fix things
Environment:
host os: RHEL 7.6, X710 PF lets call it eno1, kernel PF driver is i40e
guest os: RHEL 7.6, X710 VF created from eno1 let us call them eno2 and eno3, these are passed to VF and bound with igb_uio
expected behaviour: Ingress (RX) and Egress (TX) should work
observed behaviour: Egress (TX) only works and Ingress (RX) to VM ports are not work
Fix for incoming packets from Host’s Physical port are not reaching VM via VF is to redirect traffic from Physical X710 to the required SRIOV port we have 2 options
using virtual switches like OVS, Snabb Switch or VPP
using PF flow director to set rules.
Current description I am not able to find the same.
Answer to your queries
why does X710 NIC VF driver remove the VLAN without RX offload VLAN strip flags set? The unexpected VLAN removal behaviour of X710 NIC VF driver vfio-pci is a known bug?
I believe, this is to do with the port init configuration you pass as you might be passing eth_conf in API rte_eth_dev_configure as default. This will use default RX offload behaviour which is dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP | DEV_RX_OFFLOAD_QINQ_STRIP.
The outgoing packets from DPDK application are leaving VM via VF towards Host’s Physical ports
this is because if you use default config for rte_eth_dev_configure the tx offload is to support VLAN
But the incoming packets from Host’s Physical port are not reaching VM via VF,
this has to be dictated by HOST PF, flow director rule and VF settings. I assume you are not using flow director on the host and set rte_eth_dev_configure as default values in guest OS.

Does PF RING linux kernel module have functionality to split the incoming traffic to 3 different devices

Currently my c++ program uses PF ring kernel module to read from a 1 Gbps NIC on linux. There is a bottleneck of application that it can not process more than 700 Mbps because it decodes all the headers and maintain some states.
Now There is an immediate urgency to scale the application to cater 10 Gbps of traffic from a pipe. After making all necessary changes inside my application, it can reach upto 1.2 Gbps.
I have certain queries -
Would PF Ring be able to cater 10 Gbps traffic without any packet loss.
If yes, is it possible to split the incoming traffic of 10 Gbps to 4 different devices on a round robin fashion using PF Ring module concepts
Does anyone have any good reference document of PF Ring.
What are the different ways to split the incoming traffic on a particular interface to multiple interface devices in solaris?

Socket throughput on localhost? [duplicate]

How efficient is it to use sockets when doing IPC as compared to named pipes and other methods on Windows and Linux?
Right now, I have 4 separate apps on 4 separate boxes that need to communicate. Two are .NET 3.5 applications running on Windows Server 2003 R2. Two are Linux (Suse Linux 10). They're not generally CPU bound. The amount of traffic is not that large but its very important that it be low latency. We're using sockets right now with nagle dis-abled and the sles10 slow start patch installed on the linux machines.
How much of a speed boost do you think we would get by simply running the two windows apps on the same windows box and the two linux apps on the same linux box and making no code changes (ie still using sockets).
Will the OS's realize that the endpoints are on the same machine and know not to go out to the ethernet with the packets? Will they packets still have to go through the whole networking stack? How much faster it be if we took the time to change to named pipes or memory mapped files or something else?

As for TCP performance, I have done this sort of test recently on an HP-UX server (8 Intel Itanium 2 processors 1.5 GHz 6 MB, 400 MT/s bus) and on Red Hat Linux (2 IA-64 1,6 Ghz). I used iperf in order to test TCP performance. I found that speed of TCP exchange is more than ten times faster when I run iperf on the same machine comparing to running iperf on two different machines.
You can also give it a try as there are options that might be of interest to you - length of buffer to read or write, set TCP no delay and so on. Also you can compare speed of TCP exchange on Windows machines as there is a version of iperf for Winddws.
This is a more detailed comparison:
1) Speed of TCP exchange between two iperf applicatons running on different HP-UX server, default TCP window 32K: 387 Mbits/sec
2) Speed of TCP exchange between two iperf applicatons running on different HP-UX server, TCP window 512K: 640 Mbits/sec
3) Speed of TCP exchange between two iperf applicatons running on the same HP-UX server, default TCP window 32K: 5.60 Gbits/sec
4) Speed of TCP exchange between two iperf applicatons running on the same HP-UX server, default TCP window 512K: 5.70 Gbits/sec.
5) Speed of TCP exchange between two iperf applicatons running on the same Linux server, TCP window 512K: 7.06 Gbits/sec
6) Speed of TCP exchange between two iperf applicatons running on HP-UX and Linux, TCP window 512K: 699 Mbits/sec

Local named pipes will be faster since they run in kernel mode.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js