No DPDK packet fragmentation supported in Mellanox ConnectX-3? - dpdk

Hello Stackoverflow Experts,
I am using DPDK on Mellanox NIC, but am struggling with applying the packet
fragmentation in DPDK application.
sungho#c3n24:~$ lspci | grep Mellanox
81:00.0 Ethernet controller: Mellanox Technologies MT27500 Family
[ConnectX-3]
the dpdk application(l3fwd, ip-fragmentation, ip-assemble) did not
recognized the received packet as the ipv4 header.
At first, I have crafted my own packets when sending ipv4 headers so I
assumed that I was crafting the packets in a wrong way.
So I have used DPDK-pktgen but dpdk-application (l3fwd, ip-fragmentation,
ip-assemble) did not recognized the ipv4 header.
As the last resort, I have tested the dpdk-testpmd, and found out this in
the status info.
********************* Infos for port 1 *********************
MAC address: E4:1D:2D:D9:CB:81
Driver name: net_mlx4
Connect to socket: 1
memory allocation on the socket: 1
Link status: up
Link speed: 10000 Mbps
Link duplex: full-duplex
MTU: 1500
Promiscuous mode: enabled
Allmulticast mode: disabled
Maximum number of MAC addresses: 127
Maximum number of MAC addresses of hash filtering: 0
VLAN offload:
strip on
filter on
qinq(extend) off
No flow type is supported.
Max possible RX queues: 65408
Max possible number of RXDs per queue: 65535
Min possible number of RXDs per queue: 0
RXDs number alignment: 1
Max possible TX queues: 65408
Max possible number of TXDs per queue: 65535
Min possible number of TXDs per queue: 0
TXDs number alignment: 1
testpmd> show port
According to DPDK documentation.
in the flow type of the info status of port 1 should show, but mine shows
that no flow type is supported.
The below example should be the one that needs to be displayed in flow types:
Supported flow types:
ipv4-frag
ipv4-tcp
ipv4-udp
ipv4-sctp
ipv4-other
ipv6-frag
ipv6-tcp
ipv6-udp
ipv6-sctp
ipv6-other
l2_payload
port
vxlan
geneve
nvgre
So Is my NIC, Mellanox Connect X-3 does not support DPDK IP fragmentation? Or is
there additional configuration that needs to be done before trying out the packet fragmentation?
-- [EDIT]
So I have checked the packets from DPDK-PKTGEN and the packets received by DPDK application.
The packets that I receive is the exact one that I have sent from the application. (I get the correct data)
The problem begins at the code
struct rte_mbuf *pkt
RTE_ETH_IS_IPV4_HDR(pkt->packet_type)
This determines the whether the packet is ipv4 or not.
and the value of pkt->packet_type is both zero from DPDK-PKTGEN and DPDK application. and if the pkt-packet_type is zero then the DPDK application reviews this packet as NOT IPV4 header.
This basic type checker is wrong from the start.
So what I believe is that either the DPDK sample is wrong or the NIC cannot support ipv4 for some reason.
The data I received have some pattern at the beginning I receive the correct message but after that sequence of packets have different data between the MAC address and the data offset
So what I assume is they are interpreting the data differently, and getting the wrong result.

I am pretty sure any NIC, including Mellanox ConnectX-3 MUST support ip fragments.
The flow type you are referring is for the Flow Director, i.e. mapping specific flows to specific RX queues. Even if your NIC does not support flow director, it does not matter for the IP fragmentation.
I guess there is an error in the setup or in the app. You wrote:
the dpdk application did not recognized the received packet as the ipv4 header.
I would look into this more closely. Try to dump those packets with dpdk-pdump or even by simply dumping the receiving packet on the console with rte_pktmbuf_dump()
If you still suspect the NIC, the best option would be to temporary substitute it with another brand or a virtual device. Just to confirm it is the NIC indeed.
EDIT:
Have a look at mlx4_ptype_table for fragmented IPv4 packets it should return packet_type set to RTE_PTYPE_L2_ETHER | RTE_PTYPE_L3_IPV4_EXT_UNKNOWN | RTE_PTYPE_L4_FRAG
Please note the functionality was added in DPDK 17.11.
I suggest you to dump pkt->packet_type on console to make sure it is zero indeed. Also make sure you have the latest libmlx4 installed.

Related

i40e XL710 send packets failed on dpdk19.11, report TX descriptor 219 is not done(port=0 queue=0)

dpdk version : dpdk-19.11
firmware-version: 6.80 0x80003cfb 1.2007.0
NIC:
Network devices using DPDK-compatible driver
============================================
0000:05:00.1 'Ethernet Controller XL710 for 40GbE QSFP+ 1583' drv=vfio-pci unused=i40e,igb_uio
I used rte_eth_tx_prepare before rte_eth_tx_burst, and rte_eth_tx_prepare returns ok;
I enable TSO offload.
I found the following mailing list, but the issue seems to be inconclusive : http://mails.dpdk.org/archives/dev/2017-August/073154.html
[Based on the updates from the comments]
Based on the limited logs, the issue lies within MBUF descriptor fields after the update of tx_prepare. Hence easiest ways to compare the TX fields of mbuf before and after making the flags changes. Please follow the steps
Use DPDK API rte_pktmbuf_dump and dump the packet contents before calling tx_prepare for 16B.
After tx_prepare invoke DPK API rte_pktmbuf_dump for 16B
After update of TSO (SW or HW offload), dump the packet contents for 16B using rte_pktmbuf_dump
Compare the MBUF fields, especially for TX offloads for anomalies.
If FAST_MBUF_FREE is enabled, n_ref should be 1.
If segments are used, n_seg should not be greater than MAX_SEG supported by NIC
Total pkt_len should be equal to the sum of all data_len of each chained segment
Offload NIC should support flags used.
Note: there is a minimal reason why a TX descriptor will fail for a valid device_configure.
[Update] based on the suggestion, it is updated as thanks for the remainder!I used rte_pktmbuf_dump and found that m->nb_segs value is not correct when tcp retransmits. .

DEV_TX_OFFLOAD_VXLAN_TNL_TSO Offload Testing - DPDK

I am working on Mellanox ConnectX-5 cards and using DPDK 20.11 with CentOS 8 (4.18.0-147.5.1.el8_1.x86_64).
I wanted to test the DEV_TX_OFFLOAD_VXLAN_TNL_TSO offload and what I want to ask is that what should the packet structure be like (I am using scapy) that I should send to the DPDK application such that this offload will come into action and perform segmentation (since it is a VXLAN_TNL_TSO).
I am modifying the dpdk-ip_fragmentation example and have added: DEV_TX_OFFLOAD_IP_TNL_TSO inside the port_conf
static struct rte_eth_conf port_conf = {
.rxmode = {
.max_rx_pkt_len = JUMBO_FRAME_MAX_SIZE,
.split_hdr_size = 0,
.offloads = (DEV_RX_OFFLOAD_CHECKSUM |
DEV_RX_OFFLOAD_SCATTER |
DEV_RX_OFFLOAD_JUMBO_FRAME),
},
.txmode = {
.mq_mode = ETH_MQ_TX_NONE,
.offloads = (DEV_TX_OFFLOAD_IPV4_CKSUM |
DEV_TX_OFFLOAD_VXLAN_TNL_TSO
),
},
};
And at the ol_flags:
ol_flags |= (PKT_TX_IPV4 | PKT_TX_IP_CKSUM | PKT_TX_TUNNEL_VXLAN );
In short, to test this offload it would be great if someone can help me with 2 things:
What should the packet structure be that I should send (using scapy, such that the offload comes into action)?
Required settings to do in the DPDK example application (It is not necessary to use the ip_fragmentation example, any other example would be fine too).
note: Based on the 3 hours debug session, it is been clarified the title and question shared is incorrect. Hence the question will be re-edited to reflect actual requirement as how enable DPDK port with TCP-TSO offloads for tunnelled VXLAN packets.
Answer to the first question what should be scapy settings for sending a packet to DPDK DUT for TSO and receiving segmented traffic is
Disable all TSO related offload on the SCAPY interface using ethtool -K [scapy interface] rx off tx off tso off gso off gro off lro off
Set MTU to send larger frames like 9000
Ensure to send large frames as payload but less than interface MTU.
Run tcpdump for ingress traffic with the directional flag as tcpdump -eni [scapy interface] -Q in
Answers to the second question Required settings to do in the DPDK example application is as follows
dpdk testpmd application can enable HW and SW TSO offloads based on NIC support.
next best application is tep_termination, but requires vhost interface (VM) or DPDK vhost to achieve the same.
Since the requirement is targeted for any generic application like skeleton, l2fwd, one can enable as follows
Ensure to use DPDK 20.11 LTS (to get the latest and best support for TUNNEL TSO)
In application check for tx_offload capability with dev_get_info API.
Cross-check for HW TSO for tunnelled (VXLAN) packets.
If the TSO needs to done for UDP payload, check for UDP_TSO support in HW.
configure the NIC with no-multisegment, jumbo frame, max frame len > 9000 Bytes.
Receive the packet via rx_burst, and ensure the packet is ipv4, UDP, VXLAN (tunnel) with nb_segs as 1.
modify the mbuf to point to l2_len, l3_len, l4_len.
mark the packets ol_flags as PKT_TX_IPV4 | PKT_TX_TUNNEL_VXLAN | PKT_TX_TUNNEL_IP. For UDP inner payload PKT_TX_TUNNEL_UDP.
then set the segment size as DPDK MTU (1500 as default) - l3_len - l4_len
This will enable the PMD which support HW TSO offload to update appropriate fields in descriptors for the given payload to transmitted as multiple packets. For our test case scapy send packet of 9000 bytes will be converted into 7 * 1500 byte packets. this can be observed as part tcpdump.
Note:
the reference code is present in tep_termination and test_pmd.
If there is no HW offload, SW library of rte gso is available.
For HW offload all PMD as of today require the MBUF is a continuous single non-external buffer. So make sure to create mbufpool or mempool with sufficient size for receiving large packets.

DPDK: MPLS packet processing

I am trying to build a multi-RX-queue dpdk program, using RSS to split the incoming traffic into RX queues on a single port. Mellanox ConnectX-5 and DPDK Version 19.11 is used for this purpose. It works fine when I use IP over Ethernet packets as input. However when the packet contains IP over MPLS over Ethernet, RSS does not seems to work. As a result, all packets belonging to various flows (with different src & dst IPs, ports) over MPLS are all sent into the same RX queue.
My queries are
Is there any parameter/techniques in DPDK to distribute MPLS packets to multiple RX queues?
Is there any way to strip off MPLS tags (between Eth and IP) in hardware, something like hw_vlan_strip?
My Port configuration is
const struct rte_eth_conf default_port_conf = {
.rxmode = {
.hw_vlan_strip = 0, /* VLAN strip enabled. */
.header_split = 0, /* Header Split disabled. */
.hw_ip_checksum = 0, /* IP checksum offload disabled. */
.hw_strip_crc = 0, /* CRC stripping by hardware disabled. */
},
.rx_adv_conf = {
.rss_conf = {
.rss_key = NULL,
.rss_key_len = 0,
.rss_hf = ETH_RSS_IP,
},
} };
The requirement of POP_MPLS and RSS on MPLS can be activated via RTE_FLOW for supported NIC PMD. But mellanox mxl5 PMD supports only RTE_FLOW_ACTION_TYPE_OF_PUSH_VLAN & RTE_FLOW_ACTION_TYPE_OF_PUSH_VLAN. Only options supported for tunneled packets by mxl5 PMD are MPLSoGRE, MPLSoUD. Hence POP MPLS in HW via PMD is not possible on MXL5 PMD for DPDK 19.11 LTS
For any PMD RSS is reserved for outer/inner IP address along with TCP/UDP/SCTP port numbers. Hence I have to interpret RSS for MPLS as I would like to distribute/ spread packets with different MPLS to various queues. This can be achieved by again using RTE_FLOW for RTE_FLOW_ITEM_TYPE_MPLS and action field as RTE_FLOW_ACTION_TYPE_QUEUE. Using mask/range fields one can set patterns which can satisfy condition as 2 ^ 20 (MPLS id max value) / number of RX queues. hence the recommendation is to use RTE_FLOW_ITEM_TYPE_MPLS from RTE_FLOW and RTE_FLOW_ACTION_TYPE_QUEUE. But there is no IP/PORT RSS hashing for the same.
to test the same you can use
DPDK testpmd and set the flow rules or
make use of RTE_FLOW code snippet from rte_flow link
note: for POP MPLS I highly recommend to use PTYPES to identify the metadata and use RX-callabck to modify the packet header.

pcap ethertype not consistent with ip version

I'm using pcap to capture ip (both v4 and v6) packets on my router. It works just fine but I've noticed that sometimes the ethertype of an ethernet frame (LINKTYPE_ETHERNET) or a linux cooked capture encapsulation (LINKTYPE_LINUX_SLL) does not correctly indicate the version of the ip packet they contain.
I was expecting that if I get a frame whose ethertype is 0x0800 (ETHERTYPE_IP) then it should contain an ipv4 packet with version == 4 and if I get a frame whose ethertype is 0x86DD (ETHERTYPE_IPV6) then it should contain an ipv6 packet with version == 6.
Most of the time the above is true but sometimes it's not. I would get a frame whose ethertype is ETHERTYPE_IP but somehow it contains an ipv6 packet or I get a frame whose ethertype is ETHERTYPE_IPV6 but it contains an ipv4 packet.
I seem to have heard "ipv4 over ipv6" or "ipv6 over ipv4" but I don't know exactly how they work or if they apply to my problem, but otherwise I'm not sure what's causing this inconsistency.
EDIT
I think my actually question is whether such behavior is normal. If so should I simply ignore the ethertype field and just check the version field in the ip header to determine if it's ipv4 or ipv6.
According to my (limited) understanding, both IPv4 and IPv6 can appear after an IPv4 (0x800) Ethernet type (ethtype). This is related to transmitting IPv4 packets over IPv6. When the ethtype is 0x800 and the IP header is version 6, then the address in the IP header is an IPv4 address mapped to IPv6.
One example that shows this is Linux UDP receive code, which checks for ethtype 0x800 and then converts the source address to ipv4 using ipv6_addr_set_v4mapped

How to send and receive data using DPDK

I have a quad port Intel 1G network card. I am using DPDK to send data on one physical port and receive on another.
I saw a few examples in DPDK code, but could not make it work. If anybody knows how to do that please send me simple instructions so I can follow and understand. I setup my PC properly for huge pages, loading driver, and assigning network port to use dpdk driver etc... I can run helloworld from DPDK so system setup looks ok to me.
Thanks in advance.
temp5556
After building DPDK:
cd to the DPDK directory.
Run sudo build/app/testpmd -- --interactive
You should see output like this:
$ sudo build/app/testpmd -- --interactive
EAL: Detected 8 lcore(s)
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Multi-process socket /var/run/.rte_unix
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0002:00:02.0 on NUMA socket 0
EAL: probe driver: 15b3:1004 net_mlx4
PMD: net_mlx4: PCI information matches, using device "mlx4_0" (VF: true)
PMD: net_mlx4: 1 port(s) detected
PMD: net_mlx4: port 1 MAC address is 00:0d:3a:f4:6e:17
Interactive-mode selected
testpmd: create a new mbuf pool <mbuf_pool_socket_0>: n=203456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Warning! port-topology=paired and odd forward ports number, the last port
will pair with itself.
Configuring Port 0 (socket 0)
Port 0: 00:0D:3A:F4:6E:17
Checking link statuses...
Done
testpmd>
Don't worry about the "No free hugepages" message. It means it couldn't find any 1024 MB hugepages but it since it continued OK, it must have found some 2 MB hugepages. It'd be nice if it said "EAL: Using 2 MB huge pages" instead.
At the prompt type, start tx_first, then quit. You should see something like:
testpmd> start tx_first
io packet forwarding - ports=1 - cores=1 - streams=1 - NUMA support enabled, MP over anonymous pages disabled
Logical Core 1 (socket 0) forwards packets on 1 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
io packet forwarding packets/burst=32
nb forwarding cores=1 - nb forwarding ports=1
port 0:
CRC stripping enabled
RX queues=1 - RX desc=1024 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX queues=1 - TX desc=1024 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX RS bit threshold=0 - TXQ offloads=0x0
testpmd> quit
Telling cores to stop...
Waiting for lcores to finish...
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 32 TX-dropped: 0 TX-total: 32
----------------------------------------------------------------------------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 32 TX-dropped: 0 TX-total: 32
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
In my system there is only one DPDK port, so I sent 32 packets but did not receive any. If I had a multi-port card with a cable directly between the ports, then I'd see the RX count also increase.
you can use TESTPMD to test DPDK.
TestPMD can work as a packet generator (tx_only mode) , a receiver (rx_only mode) , or a forwarder(io mode).
you will need generator nodes to be connected to your box if you are willing to use TESTPMD as a forwarder only.
I propose that you start with the following examples :
generator(pktgen) ------> testPMD (io mode )----------> recevier (testPMD rx_only mode).
at the pktgen generator specify the mac address destination which is the MAC address of the receive's receiving PORT.
PKTGEN and how it works in detail is explained more in this link :
http://pktgen.readthedocs.io/en/latest/getting_started.html
TESTPMD and how it works is explained here :
http://www.intel.com/content/dam/www/public/us/en/documents/guides/dpdk-testpmd-application-user-guide.pdf
I hope this helps.