DPDK: MPLS packet processing - dpdk

I am trying to build a multi-RX-queue dpdk program, using RSS to split the incoming traffic into RX queues on a single port. Mellanox ConnectX-5 and DPDK Version 19.11 is used for this purpose. It works fine when I use IP over Ethernet packets as input. However when the packet contains IP over MPLS over Ethernet, RSS does not seems to work. As a result, all packets belonging to various flows (with different src & dst IPs, ports) over MPLS are all sent into the same RX queue.
My queries are
Is there any parameter/techniques in DPDK to distribute MPLS packets to multiple RX queues?
Is there any way to strip off MPLS tags (between Eth and IP) in hardware, something like hw_vlan_strip?
My Port configuration is
const struct rte_eth_conf default_port_conf = {
.rxmode = {
.hw_vlan_strip = 0, /* VLAN strip enabled. */
.header_split = 0, /* Header Split disabled. */
.hw_ip_checksum = 0, /* IP checksum offload disabled. */
.hw_strip_crc = 0, /* CRC stripping by hardware disabled. */
},
.rx_adv_conf = {
.rss_conf = {
.rss_key = NULL,
.rss_key_len = 0,
.rss_hf = ETH_RSS_IP,
},
} };

The requirement of POP_MPLS and RSS on MPLS can be activated via RTE_FLOW for supported NIC PMD. But mellanox mxl5 PMD supports only RTE_FLOW_ACTION_TYPE_OF_PUSH_VLAN & RTE_FLOW_ACTION_TYPE_OF_PUSH_VLAN. Only options supported for tunneled packets by mxl5 PMD are MPLSoGRE, MPLSoUD. Hence POP MPLS in HW via PMD is not possible on MXL5 PMD for DPDK 19.11 LTS
For any PMD RSS is reserved for outer/inner IP address along with TCP/UDP/SCTP port numbers. Hence I have to interpret RSS for MPLS as I would like to distribute/ spread packets with different MPLS to various queues. This can be achieved by again using RTE_FLOW for RTE_FLOW_ITEM_TYPE_MPLS and action field as RTE_FLOW_ACTION_TYPE_QUEUE. Using mask/range fields one can set patterns which can satisfy condition as 2 ^ 20 (MPLS id max value) / number of RX queues. hence the recommendation is to use RTE_FLOW_ITEM_TYPE_MPLS from RTE_FLOW and RTE_FLOW_ACTION_TYPE_QUEUE. But there is no IP/PORT RSS hashing for the same.
to test the same you can use
DPDK testpmd and set the flow rules or
make use of RTE_FLOW code snippet from rte_flow link
note: for POP MPLS I highly recommend to use PTYPES to identify the metadata and use RX-callabck to modify the packet header.

Related

i40e XL710 send packets failed on dpdk19.11, report TX descriptor 219 is not done(port=0 queue=0)

dpdk version : dpdk-19.11
firmware-version: 6.80 0x80003cfb 1.2007.0
NIC:
Network devices using DPDK-compatible driver
============================================
0000:05:00.1 'Ethernet Controller XL710 for 40GbE QSFP+ 1583' drv=vfio-pci unused=i40e,igb_uio
I used rte_eth_tx_prepare before rte_eth_tx_burst, and rte_eth_tx_prepare returns ok;
I enable TSO offload.
I found the following mailing list, but the issue seems to be inconclusive : http://mails.dpdk.org/archives/dev/2017-August/073154.html
[Based on the updates from the comments]
Based on the limited logs, the issue lies within MBUF descriptor fields after the update of tx_prepare. Hence easiest ways to compare the TX fields of mbuf before and after making the flags changes. Please follow the steps
Use DPDK API rte_pktmbuf_dump and dump the packet contents before calling tx_prepare for 16B.
After tx_prepare invoke DPK API rte_pktmbuf_dump for 16B
After update of TSO (SW or HW offload), dump the packet contents for 16B using rte_pktmbuf_dump
Compare the MBUF fields, especially for TX offloads for anomalies.
If FAST_MBUF_FREE is enabled, n_ref should be 1.
If segments are used, n_seg should not be greater than MAX_SEG supported by NIC
Total pkt_len should be equal to the sum of all data_len of each chained segment
Offload NIC should support flags used.
Note: there is a minimal reason why a TX descriptor will fail for a valid device_configure.
[Update] based on the suggestion, it is updated as thanks for the remainder!I used rte_pktmbuf_dump and found that m->nb_segs value is not correct when tcp retransmits. .

DEV_TX_OFFLOAD_VXLAN_TNL_TSO Offload Testing - DPDK

I am working on Mellanox ConnectX-5 cards and using DPDK 20.11 with CentOS 8 (4.18.0-147.5.1.el8_1.x86_64).
I wanted to test the DEV_TX_OFFLOAD_VXLAN_TNL_TSO offload and what I want to ask is that what should the packet structure be like (I am using scapy) that I should send to the DPDK application such that this offload will come into action and perform segmentation (since it is a VXLAN_TNL_TSO).
I am modifying the dpdk-ip_fragmentation example and have added: DEV_TX_OFFLOAD_IP_TNL_TSO inside the port_conf
static struct rte_eth_conf port_conf = {
.rxmode = {
.max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE,
.split_hdr_size = 0,
.offloads = (DEV_RX_OFFLOAD_CHECKSUM |
DEV_RX_OFFLOAD_SCATTER |
DEV_RX_OFFLOAD_JUMBO_FRAME),
},
.txmode = {
.mq_mode = ETH_MQ_TX_NONE,
.offloads = (DEV_TX_OFFLOAD_IPV4_CKSUM |
DEV_TX_OFFLOAD_VXLAN_TNL_TSO
),
},
};
And at the ol_flags:
ol_flags |= (PKT_TX_IPV4 | PKT_TX_IP_CKSUM | PKT_TX_TUNNEL_VXLAN );
In short, to test this offload it would be great if someone can help me with 2 things:
What should the packet structure be that I should send (using scapy, such that the offload comes into action)?
Required settings to do in the DPDK example application (It is not necessary to use the ip_fragmentation example, any other example would be fine too).
note: Based on the 3 hours debug session, it is been clarified the title and question shared is incorrect. Hence the question will be re-edited to reflect actual requirement as how enable DPDK port with TCP-TSO offloads for tunnelled VXLAN packets.
Answer to the first question what should be scapy settings for sending a packet to DPDK DUT for TSO and receiving segmented traffic is
Disable all TSO related offload on the SCAPY interface using ethtool -K [scapy interface] rx off tx off tso off gso off gro off lro off
Set MTU to send larger frames like 9000
Ensure to send large frames as payload but less than interface MTU.
Run tcpdump for ingress traffic with the directional flag as tcpdump -eni [scapy interface] -Q in
Answers to the second question Required settings to do in the DPDK example application is as follows
dpdk testpmd application can enable HW and SW TSO offloads based on NIC support.
next best application is tep_termination, but requires vhost interface (VM) or DPDK vhost to achieve the same.
Since the requirement is targeted for any generic application like skeleton, l2fwd, one can enable as follows
Ensure to use DPDK 20.11 LTS (to get the latest and best support for TUNNEL TSO)
In application check for tx_offload capability with dev_get_info API.
Cross-check for HW TSO for tunnelled (VXLAN) packets.
If the TSO needs to done for UDP payload, check for UDP_TSO support in HW.
configure the NIC with no-multisegment, jumbo frame, max frame len > 9000 Bytes.
Receive the packet via rx_burst, and ensure the packet is ipv4, UDP, VXLAN (tunnel) with nb_segs as 1.
modify the mbuf to point to l2_len, l3_len, l4_len.
mark the packets ol_flags as PKT_TX_IPV4 | PKT_TX_TUNNEL_VXLAN | PKT_TX_TUNNEL_IP. For UDP inner payload PKT_TX_TUNNEL_UDP.
then set the segment size as DPDK MTU (1500 as default) - l3_len - l4_len
This will enable the PMD which support HW TSO offload to update appropriate fields in descriptors for the given payload to transmitted as multiple packets. For our test case scapy send packet of 9000 bytes will be converted into 7 * 1500 byte packets. this can be observed as part tcpdump.
Note:
the reference code is present in tep_termination and test_pmd.
If there is no HW offload, SW library of rte gso is available.
For HW offload all PMD as of today require the MBUF is a continuous single non-external buffer. So make sure to create mbufpool or mempool with sufficient size for receiving large packets.

How dpdk disable `CRC strip`, `header split`, `IP checksum offload`, and `jumbo frame support`

In the older version of dpdk, the struct rte_eth_rxmode has these members.
struct rte_eth_rxmode {
header_split = 0, /**< Header Split disabled */
hw_ip_checksum = 0, /**< IP checksum offload disabled */
hw_vlan_filter = 0, /**< VLAN filtering disabled */
jumbo_frame = 0, /**< Jumbo Frame Support disabled */
hw_strip_crc = 0, /**< CRC stripped by hardware */
...
}
But after updating to dpdk-stable-19.11.3, these members are removed. According to the docs, the testpmd app supports command-line options such as --disable-crc-strip but these are not EAL command-line options. How can I disable these five options listed above in dpdk-stable-19.11.3? Or are these options disabled by default? If so, how can I check these status?
In addition, the member variable txq_flags of struct rte_eth_txconf is also removed from dpdk-stable-19.11.3. How can I set this in dpdk-stable-19.11.3?
I haven't used dpdk for a long time. It has changed a lot, and I am struggling with these changes. Is there any way suggested for catch up these changes?
With DPDK 19.11.3 one can enable the desired features (crc-keep, jumbo, ipv4-cksum, and header split) programmatically by either editing
default configuration as
static struct rte_eth_conf port_conf = {
.rxmode = {
.max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE,
.split_hdr_size = 0,
.offloads = DEV_RX_OFFLOAD_JUMBO_FRAME | DEV_RX_OFFLOAD_KEEP_CRC | DEV_RX_OFFLOAD_IPV4_CKSUM | DEV_RX_OFFLOAD_HEADER_SPLIT,
},
.txmode = {
.mq_mode = ETH_MQ_TX_NONE,
}
};
or modifying the offload features in port_init by fetching and comparing the features by
port_conf.rxmode.offloads |= DEV_RX_OFFLOAD_JUMBO_FRAME | DEV_RX_OFFLOAD_HEADER_SPLIT | DEV_RX_OFFLOAD_KEEP_CRC | DEV_RX_OFFLOAD_IPV4_CKSUM;
note: a handful of NIC support the feature for DEV_RX_OFFLOAD_HEADER_SPLIT, so there much likely it will fail in port_init. Use http://doc.dpdk.org/guides/nics/overview.html as a generic guide for offload features.
use https://doc.dpdk.org/guides/testpmd_app_ug/run_app.html#eal-command-line-options for enabling the features in testpmd
--max-pkt-len=[size] - enable JUMBO
--disable-crc-strip - keeps the crc from stripping
--enable-rx-cksum - enables HW checksum (even for IPv4 Checksum)
note: with regards to DEV_RX_OFFLOAD_HEADER_SPLIT looks like it is not added to testpmd as not many NIC PMD supports the same.
if there are features not supported by NIC PMD one can expect error messages like
Ethdev port_id=0 requested Rx offloads 0x2000e doesn't match Rx offloads capabilities 0x92e6f in rte_eth_dev_configure()
in order to get more description please run with --log-level=pmd,8
Yes in DPDK version 19.11 hardware offloads are enabled using a single member field unit64_t offloads in struct rte_eth_rxmode unlike individual offload parameters as older DPDK versions.
On the other hand hardware offloads in 19.11 are divided into per-port and per-queue offloads based on the configuration. For instance user can set per port and per queue based offloads which device supports which can be fetched using rte_eth_dev_info_get().
As shown below offloads field in struct rte_eth_rxmode and struct rte_eth_rxconf are used to set per port and per queue offloads respectively.
struct rte_eth_rxmode {
...
/**
* Per-port Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
* Only offloads set on rx_offload_capa field on rte_eth_dev_info
* structure are allowed to be set.
*/
uint64_t offloads;
...
};
struct rte_eth_rxconf {
...
/**
* Per-queue Rx offloads to be set using DEV_RX_OFFLOAD_* flags.
* Only offloads set on rx_queue_offload_capa or rx_offload_capa
* fields on rte_eth_dev_info structure are allowed to be set.
*/
uint64_t offloads;
...
};
Note: Device capable offloads are enabled using the macro's DEV_RX_OFFLOAD_* flags defined here - Rx offload capabilities of a device
For testpmd you can set the offloads as a bitmask from DEV_RX_OFFLOAD_* flags, as shown below provided device supports the features,
--rx-offloads=0xXXXXXXXX: hexadecimal bitmask of RX queue offloads
--tx-offloads=0xXXXXXXXX: hexadecimal bitmask of TX queue offloads
I haven't used dpdk for a long time. It has changed a lot, and I am struggling with these changes. Is there any way suggested for catch up these changes?
I would suggest you should register to dpdk developments via mailing list (dev#dpdk.org) to know upstream patches/updates.

DPDK 18.11 HW checksum support for X722 NIC?

I am running dpdk-stable-18.11.8 on Centos 7, targeting an Intel X722 NIC.
I want ipv4 and udp header checksums to be calculated by hardware, so I set the device configuration to:
struct rte_eth_conf local_port_conf;
memset(&local_port_conf, 0, sizeof(struct rte_eth_conf));
local_port_conf.rxmode.split_hdr_size = 0;
local_port_conf.txmode.mq_mode = ETH_MQ_TX_NONE;
local_port_conf.txmode.offloads = DEV_TX_OFFLOAD_OUTER_UDP_CKSUM | DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM;
rte_eth_dev_configure(0,1,1,&local_port_conf);
rte_eth_dev_configure returns:
0xffffffea (-22)
Does this mean that DPDK 18.11 doesn't support checksum offload to the X722 NIC?
DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM is used for outer tunnelling packet, for which X710 has to be loaded with DDP. If the intent is for normal packet DEV_TX_OFFLOAD_IPV4_CKSUM is to be used.
Note: right way of configuring any DPDK port is to first fetch capability by rte_eth_dev_info_get. Then check dev_info.tx_offload_capa & DEV_TX_OFFLOAD_IPV4_CKSUM, if present configure.

No DPDK packet fragmentation supported in Mellanox ConnectX-3?

Hello Stackoverflow Experts,
I am using DPDK on Mellanox NIC, but am struggling with applying the packet
fragmentation in DPDK application.
sungho#c3n24:~$ lspci | grep Mellanox
81:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
the dpdk application(l3fwd, ip-fragmentation, ip-assemble) did not
recognized the received packet as the ipv4 header.
At first, I have crafted my own packets when sending ipv4 headers so I
assumed that I was crafting the packets in a wrong way.
So I have used DPDK-pktgen but dpdk-application (l3fwd, ip-fragmentation,
ip-assemble) did not recognized the ipv4 header.
As the last resort, I have tested the dpdk-testpmd, and found out this in
the status info.
********************* Infos for port 1 *********************
MAC address: E4:1D:2D:D9:CB:81
Driver name: net_mlx4
Connect to socket: 1
memory allocation on the socket: 1
Link status: up
Link speed: 10000 Mbps
Link duplex: full-duplex
MTU: 1500
Promiscuous mode: enabled
Allmulticast mode: disabled
Maximum number of MAC addresses: 127
Maximum number of MAC addresses of hash filtering: 0
VLAN offload:
strip on
filter on
qinq(extend) off
No flow type is supported.
Max possible RX queues: 65408
Max possible number of RXDs per queue: 65535
Min possible number of RXDs per queue: 0
RXDs number alignment: 1
Max possible TX queues: 65408
Max possible number of TXDs per queue: 65535
Min possible number of TXDs per queue: 0
TXDs number alignment: 1
testpmd> show port
According to DPDK documentation.
in the flow type of the info status of port 1 should show, but mine shows
that no flow type is supported.
The below example should be the one that needs to be displayed in flow types:
Supported flow types:
ipv4-frag
ipv4-tcp
ipv4-udp
ipv4-sctp
ipv4-other
ipv6-frag
ipv6-tcp
ipv6-udp
ipv6-sctp
ipv6-other
l2_payload
port
vxlan
geneve
nvgre
So Is my NIC, Mellanox Connect X-3 does not support DPDK IP fragmentation? Or is
there additional configuration that needs to be done before trying out the packet fragmentation?
-- [EDIT]
So I have checked the packets from DPDK-PKTGEN and the packets received by DPDK application.
The packets that I receive is the exact one that I have sent from the application. (I get the correct data)
The problem begins at the code
struct rte_mbuf *pkt
RTE_ETH_IS_IPV4_HDR(pkt->packet_type)
This determines the whether the packet is ipv4 or not.
and the value of pkt->packet_type is both zero from DPDK-PKTGEN and DPDK application. and if the pkt-packet_type is zero then the DPDK application reviews this packet as NOT IPV4 header.
This basic type checker is wrong from the start.
So what I believe is that either the DPDK sample is wrong or the NIC cannot support ipv4 for some reason.
The data I received have some pattern at the beginning I receive the correct message but after that sequence of packets have different data between the MAC address and the data offset
So what I assume is they are interpreting the data differently, and getting the wrong result.
I am pretty sure any NIC, including Mellanox ConnectX-3 MUST support ip fragments.
The flow type you are referring is for the Flow Director, i.e. mapping specific flows to specific RX queues. Even if your NIC does not support flow director, it does not matter for the IP fragmentation.
I guess there is an error in the setup or in the app. You wrote:
the dpdk application did not recognized the received packet as the ipv4 header.
I would look into this more closely. Try to dump those packets with dpdk-pdump or even by simply dumping the receiving packet on the console with rte_pktmbuf_dump()
If you still suspect the NIC, the best option would be to temporary substitute it with another brand or a virtual device. Just to confirm it is the NIC indeed.
EDIT:
Have a look at mlx4_ptype_table for fragmented IPv4 packets it should return packet_type set to RTE_PTYPE_L2_ETHER | RTE_PTYPE_L3_IPV4_EXT_UNKNOWN | RTE_PTYPE_L4_FRAG
Please note the functionality was added in DPDK 17.11.
I suggest you to dump pkt->packet_type on console to make sure it is zero indeed. Also make sure you have the latest libmlx4 installed.