DPDK: Problem with Rx offloads capabilities - dpdk

I'm trying to set environments to test programs written in the P4 language, using t4p4s as the compiler. When a P4 program is compiled with t4p4s, a C/C++ program using DPDK is generated, which in turn is compiled and run.
Compiling the program works fine. The resulting executable is run like so:
./build/l2switch -c 0xc -n 4 --no-pci --vdev net_pcap0,rx_iface_in=veth3-s,tx_iface=veth3-s,tx_iface=veth3-s --vdev net_pcap1,rx_iface_in=veth5-s,tx_iface=veth5-s,tx_iface=veth5-s -- -p 0x3 --config "\"(0,0,2)(1,0,3)\""
On a Raspberry Pi, this works with every network interface I've tried so far (virtual ethernet devices as seen in the command above, the builtin ethernet port and a Realtek USB NIC).
Inside a Ubuntu 21.04 VM using virtual ethernet devices, I get the following error:
--- Init switch
EAL: Detected CPU lcores: 4
EAL: Detected NUMA nodes: 1
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: VFIO support initialized
TELEMETRY: No legacy callbacks, legacy socket not created
2 :::: Allocating DPDK mbuf pool on socket 0
2 :::: Init ports
2 :::: Init port 0
2 :::: Creating queues: nb_rxq=1 nb_txq=2
Ethdev port_id=0 requested Rx offloads 0xe doesn't match Rx offloads capabilities 0x0 in rte_eth_dev_configure()
EAL: Error - exiting with code: 1
Cause: Cannot configure device: err=-22, port=0
I've had this problem pop up on the Raspberry Pis too, but it would fix itself after a restart or enough tries. On the VM, this problem is persistent.
Questions:
In both cases, I'm using virtual ethernet devices for the interfaces. Both cases use the same driver and software NIC. How can I find out what the difference between the VM and the Raspberry Pi is? After all, if there was no difference then it would work in both cases.
What does the error try to tell me? I've tried searching for it online to no avail and my knowledge of DPDK is very limited.
What can I try in order to fix this issue?

Solved it!
While looking through the files to find the program listing #stackinside requested, I found an argument of t4p4s called "vethmode". There are plenty of arguments like this and I've yet to find a complete documentation for them. Turning that on results in the macro T4P4S_VETH_MODE being defined when the C program is compiled. This in turn changes the composition of a struct rte_eth_conf that is passed to rte_eth_dev_configure some time later.
For the sake of completeness, here is the relevant file.
Line 40 is where the struct rte_eth_conf is defined/initialized.
Line 244 is the start of the function in which the call to rte_eth_dev_configure (Line 261) fails.

Related

Proper DPDK device and port initialization for transmission

While writing a simple DPDK packet generator I noticed some additional initialization steps that are required for reliable and successful packet transmission:
calling rte_eth_link_get() or rte_eth_timesync_enable() after rte_eth_dev_start()
waiting 2 seconds before sending the first packet with rte_eth_tx_burst()
So these steps are necessary when I use the ixgbe DPDK vfio driver with an Intel X553 NIC.
When I'm using the AF_PACKET DPDK driver, it works without those extra steps.
Is this a bug or a feature?
Is there a better way than waiting 2 seconds before the first send?
Why is the wait required with the ixgbe driver? Is this a limitation of that NIC or the involved switch (Mikrotik CRS326 with Marvell chipset)?
Is there a more idiomatic function to call than rte_eth_link_get() in order to complete the device initialization for transmission?
Is there some way to keep the VFIO NIC initialized (while keeping its link up) to avoid re-initializing it over and over again during the usual edit/compile/test cycle? (i.e. to speed up that cycle ...)
Additional information: When I connect the NIC to a mirrored port (which is configured via Mikrotik's mirror-source/mirror-target ethernet switch settings) and the sleep(2) is removed then I see the first packet transmitted to the mirror target but not to the primary destination. Thus, it seems like the sleep is necessary to give the switch some time after the link is up (after the dpdk program start) to completely initialize its forwarding table or something like that?
Waiting just 1 second before the first transmission works less reliable, i.e. the packet reaches the receiver only every odd time.
My device/port initialization procedure implements the following setup sequence:
rte_eth_dev_count_avail()
rte_eth_dev_is_valid_port()
rte_eth_dev_info_get()
rte_eth_dev_adjust_nb_rx_tx_desc()
rte_eth_dev_configure(port_id, 0 /* rxrings */, 1 /* txrings */, &port_conf)
rte_eth_tx_queue_setup()
rte_eth_dev_start()
rte_eth_macaddr_get()
rte_eth_link_get() // <-- REQUIRED!
rte_eth_dev_get_mtu()
Without rte_eth_link_get() (or rte_eth_timesync_enable()) the first transmitted packet doesn't even show up on the mirrored port.
The above functions (and rte_eth_tx_burst()) complete successfully with/without rte_eth_link_get()/sleep(2) being present. Especially, the read MAC address, MTU have the expected values (MTU -> 1500) and rte_eth_tx_burst() returns 1 for a burst of one UDP packet.
The returned link status is: Link up at 1 Gbps FDX Autoneg
The fact that rte_eth_link_get() can be replaced with rte_eth_timesync_enable() probably can be explained by the latter calling ixgbe_start_timecounters() which calls rte_eth_linkstatus_get() which is also called by rte_eth_link_get().
I've checked the DPDK examples and most of them don't call rte_eth_link_get() before sending something. There is also no sleep after device initialization.
I'm using DPDK 20.11.2.
Even more information - to answer the comments:
I'm running this on Fedora 33 (5.13.12-100.fc33.x86_64).
Ethtool reports: firmware-version: 0x80000877
I had called rte_eth_timesync_enable() in order to work with the transmit timestamps. However, during debugging I removed it to arrive at an minimal reproducer. At that point I noticed that removing it made it actually worse (i.e. no packet transmitted over the mirror port). I thus investigated what part of that function might make the difference and found rte_eth_link_get() which has similar side-effects.
When switching to AF_PACKET I'm using the stock ixgbe kernel driver, i.e. ixgbe with default settings on a device that is initialized by networkd (dhcp enabled).
My expectation was that when rte_eth_dev_start() terminates that the link is up and the device is ready for transmission.
However, it would be nice, I guess, if one could avoid resetting the device after program restarts. I don't know if DPDK supports this.
Regarding delays: I've just tested the following: rte_eth_link_get() can be omitted if I increase the sleep to 6 seconds. Whereas a call to rte_eth_link_get() takes 3.3 s. So yeah, it's probably just helping due to the additional delay.
The difference between the two attempted approaches
In order to use af_packet PMD, you first bind the device in question to the kernel driver. At this point, a kernel network interface is spawned for that device. This interface typically has the link active by default. If not, you typically run ip link set dev <interface> up. When you launch your DPDK application, af_packet driver does not (re-)configure the link. It just unconditionally reports the link to be up on device start (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L266) and vice versa when doing device stop. Link update operation is also no-op in this driver (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L416).
In fact, with af_packet approach, the link is already active at the time you launch the application. Hence no need to await the link.
With VFIO approach, the device in question has its link down, and it's responsibility of the corresponding PMD to activate it. Hence the need to test link status in the application.
Is it possible to avoid waiting on application restarts?
Long story short, yes. Awaiting link status is not the only problem with application restarts. You effectively re-initialise EAL as a whole when you restart, and that procedure is also eye-wateringly time consuming. In order to cope with that, you should probably check out multi-process support available in DPDK (see https://doc.dpdk.org/guides/prog_guide/multi_proc_support.html).
This requires that you re-implement your application to have its control logic in one process (also, the primary process) and Rx/Tx datapath logic in another one (the secondary process). This way, you can keep the first one running all the time and re-start the second one when you need to change Rx/Tx logic / re-compile. Restarting the secondary process will re-attach to the existing EAL instance all the time. Hence no PMD restart being involved, and no more need to wait.
Based on the interaction via comments, the real question is summarized as I'm just asking myself if it's possible to keep the link-up between invocations of a DPDK program (when using a vfio device) to avoid dealing with the relatively long wait times before the first transmit comes through. IOW, is it somehow possible to skip the device reset when starting the program for the 2nd time?
The short answer is No for the packet-generator program between restarts, because any Physcial NIC which uses PCIe config space for both PF (IXGBE for X533) and VF (IXGBE_VF for X553) bound with uio_pci_generic|igb_uio|vfio-pci requires PCIe reset & configuration. But, when using AF_PACKET (ixgbe kernel diver) DPDK PMD, this is the virtual device that does not do any PCIe resets and directly dev->data->dev_link.link_status = ETH_LINK_UP; in eth_dev_start function.
For the second part Is the delay for the first TX packets expected?
[Answer] No, as a few factors contribute to delay in the first packet transmission
Switch software & firmware (PF only)
Switch port Auto-neg or fixed speed (PF only)
X533 software and firmware (PF and VF)
Autoneg enable or disabled (PF and VF)
link medium SFP (fiber) or DAC (Direct Attached Copper) or RJ-45 (cat5|cat6) connection (PF only)
PF driver version for NIC (X553 ixgbe) (PF and VF)
As per intel driver Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE 802.1Qbb (priority based flow-control), and others of this type (VF only)
Note: Since the issue is mentioned for VF ports only (and not PF ports), my assumption is
TX packet uses the SRC MAC address of VF to avoid MAC Spoof check on ASIC
configure all SR-IOV enabled ports for VLAN tagging from the administrative interface on the PF to avoid flooding of traffic to VF
PF driver is updated to avoid old driver issues such (VF reset causes PF link to reset). This can be identified via checking dmesg
Steps to isolate the problem is the NIC by:
Check if (X533) PF DPDK has the same delay as VF DPDK (Isolate if it is PF or VF) problem.
cross-connect 2 NIC (X533) on the same system. Then compare Linux vs DPDK link up events (to check if it is a NIC problem or PCIe LNK issue)
Disable Auto-neg for DPDK X533 and compare PF vs Vf in DPDK
Steps to isolate the problem is the Switch by:
Disable Auto-neg on Switch and set FD auto-neg-disable speed 1Gbps and check the behaviour
[EDIT-1] I do agree with the workaround solution suggested by #stackinside using DPDK primary-secondary process concept. As the primary is responsible for Link and port bring up. While secondary is used for actual RX and TX burst.

Running Helloworld example for DPDK?

I'm new to DPDK and running the Helloworld example given as part of the DPDK build.Here I'm able to successfully run and get the O/p expected. That is printing Hello World on different lcores.
./dpdk-helloworld -l 0-3 -n 4
EAL: Detected 112 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL: No legacy callbacks, legacy socket not created
hello from core 1
hello from core 2
hello from core 3
hello from core 0
But When I run htop, I can see only the first core being used, I was expecting four core will be utilized.
Is there something wrong in my understanding of how EAL works in DPDK?
There can be couple of reasons which might lead to your observation, such as
DPDK example hello world is a simple run to competition program model which launches lcore_hello in all worker threads and the main thread. There is no infinite loop of while or for to run on desired lcores.
DPDK application is running inside VM with ISOL cpu set for single core.
htop sampling time is not enough to show case all 4 cpu were utilized
Since limited information is shared about setup and environment, I have to assume cause of your error is 3. The solution for your requirement is In order to achieve 100% utilization across all threads edit lcore_hello to be an infinite loop, example code below
static int
lcore_hello(__rte_unused void *arg)
{
unsigned lcore_id;
lcore_id = rte_lcore_id();
printf("hello from core %u\n", lcore_id);
while(1);
return 0;
}
Note: I highly recommend using perf record with the right sample rate to capture across all core events for specific application launches too.

How to run send and receive traffic over 2 instances of dpdk-testpmd running on the same host?

Note: I'm new to network and dpdk, so there might be some misunderstanding in fundamental concepts...
I want to run 2 instances of dpdk-testpmd on the same host to send and receive traffic over separate NIC.
Configuration:
NIC:
PMD: MLX5
version: 5.0-1.0.0.0
firmware-version: 16.26.1040 (MT_0000000011)
NUMA-Socket: same
PCIe: 0000:3b:00.0, 0000:3b:00.1
Update
DPDK Version: 20.11.1
Hugepage Total: 32768
Hugepage Free: 32768
TESTPMD Logs:
Logical Core 12 (socket 0) forwards packets on 6 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
Questions:
How to set the MAC address of instance 2's port in instance 1?
What's the meaning of RX P=0/Q=0?
Is the NIC will receive packets from it's RXQ 0(a ring buffer in memory?)
Is the NIC will put packets(just got from RXQ) to it's TXQ 0?
What will happen next?
Where is the packet's destination?
How to set/check it?
Hope for your help.
Thanks in advance.
Note: I highly recommend reading about dpdk testpmd as it covered all your questions in detail. As per the StackOverflow guideline multiple sub-questions make answering difficult. Please use well-formatted and formed question for better reach and answers.
#Alexcured since you have mentioned you know how to run 2 separate instances of dpdk-testpmd. I will only recommend heavily reading up on dpdk-testpmd URL which has answers to most of your questions too.
The assumption is made, both the PCIe NIC are working properly and interconnect between 2 are tested with either arping or ping (Kernel Driver). After binding both PCIe devices to DPDK supported drivers, one should use options for DPDK 20.11.1 such as
use file-prefix option as unique names
set socket-memory to fetch memory from the desired NUMA-SOCKET
set socket-limit to prevent ballooning of huge page mmap
use w|b option to whitelist|blacklist PCIe devices (0000:3b:00.0 and 0000:3b:00.1)
Since it is separate physical devices ensure there physical cable connection between 2 PCIe ports.
[Q.1] How to set the MAC address of instance 2's port in instance 1?
[A.1] in DPDK instance-1 & instance 2, execute `show port 0 mac` to show current port MAC address. Note down the MAC address correctly. To set the MAC address of peer use command `set eth-peer (port_id) (peer_addr)`.
In non interactive mode this can be done from cmdline too.
[Q.2] What's the meaning of RX P=0/Q=0?
[A.2] P stands for `port`, and Q stands for `queue`. So P=0/Q=0 means `port 0/queue 0`
[Q.3] Is the NIC will receive packets from it's RXQ 0(a ring buffer in memory?)
[A.3] NIC can receive from port 0 - rxq -0; provided there is no offload|RSS|Rules set (since you have not shared the cmdline I will assume there is none).
NIC HW RQ is mmaped to reflect HEAD-TAIL pointer to allow PMD to trigger DMA copy to mempool (huge page backed area). This is different from DPDK lib_rte_ring buffer area
[Q.4] Is the NIC will put packets(just got from RXQ) to it's TXQ 0?
[A.4] if the option `start tx-first` is not used, as per the current understanding there will not be any packets TX from the NIC (for Intel FVL LLDP packets will be send from NIC HW unless disabled).
The default configuration in looped mode, and with 1 DPDK port RX of the Port will be always sent out as TX.
[Q.5] What will happen next?
[A.5] Packet will be send out with modified eth-peer mac address
[Q.6] Where is the packet's destination?
[A.6] Ethernet Packet has the format of first 6 bytes as destination MAC address. With `set peer-address` done the MAC address as the destination will be set as value set.
[Q.7] How to set/check it?
[A.7] use option `set verbose 2`. This will show the RX packet details received from the port

Can multi-queue be used with DPDK vdev rx_pcap

I could not tell from the documentation if it is possible to use vdev rx_pcap to simulate RSS with a pcap file, using multiple cores.
It seemed like an interesting proposition, after reading this:
For ease of use, the DPDK EAL also has been extended to allow
pseudo-Ethernet devices, using one or more of these drivers, to be
created at application startup time during EAL initialization.
To do so, the –vdev= parameter must be passed to the EAL. This takes
take options to allow ring and pcap-based Ethernet to be allocated and
used transparently by the application. This can be used, for example,
for testing on a virtual machine where there are no Ethernet ports.
Pcap-based devices can be created using the virtual device –vdev
option.
This is how I read one PCAP file and write to another, using their example with the dpdk-testpmd application:
sudo build/app/dpdk-testpmd -l 0-3 --vdev 'net_pcap0,rx_pcap=file_rx.pcap,tx_pcap=file_tx.pcap' -- --port-topology=chained --no-flush-rx
This works fine, and I get the file_tx.pcap generated. But if I try to set the number of RX queues to 4 it tells me that I can't:
$ sudo build/app/dpdk-testpmd -l 0-3 --vdev 'net_pcap0,rx_pcap=file_rx.pcap,tx_pcap=file_tx.pcap' -- --port-topology=chained --no-flush-rx --rxq=4
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: Invalid NUMA socket, default to 0
EAL: Invalid NUMA socket, default to 0
Fail: input rxq (4) can't be greater than max_rx_queues (1) of port 0
EAL: Error - exiting with code: 1
Cause: rxq 4 invalid - must be >= 0 && <= 1
Is it possible to change max_rx_queues for vdev rx_pcap at all, or is there a better alternative?
The number of queues available on a port depends upon the NIC configuration and driver from OS. Hence expecting a PCAP PMD emulating RXPCAP file as RX device will have multiple RX queue is not right. You would need to start using the actual interface from OS which has multiple queues.
Explanation below:
As per the DPDK NIC Feature there is no support for RSS on PCAP PMD. So the option of receiving packets based on 3 IP Tuple or 5 IP+TCP|UDP|SCTP Tuple on multiple queues is not present natively and needs to be implemented in SW.
A per the PCAP PMD if one needs to read from the physical port we have use option rx_iface and not rx_pcap. Similarly to send on Physical interface one has to use option tx_iface and not tx_pcap.
If your requirement was to capture the RX or TX packet from specific DPDK port, you should look DPDK PDUMP application which uses rte_pdump API. The PDUMP Documentation explains clearly how to grab packets from specific queues too.
If one needs to read packets using PCAP PMD, use rx_iface in primary application. Then to write packets from desired port-queue into a PCAP file use dpdk-pdump as secondary application with option --pdump 'port=[your desired DPDK port],queue=[your desired DPDK port queue],rx-dev=/tmp/rx.pcap.

How to obtain a backend transport type/configuration in MPI (openMPI application)?

So I having an access to submit jobs to a small cluster, how to obtain inside an MPI app what type of backend MPI is running on (infiniband, ethernet, etc)?
Open MPI ranks the network interconnects it finds on each host and selects the fastest one that allows communication with the other nodes. InfiniBand always wins over Ethernet unless one fiddles with the BTL component priorities (one usually doesn't).
To see the components being selected, set the verbosity level of the BTL framework to at least 5:
$ mpiexec --mca btl_base_verbose 5 -np 2 ./a.out
[host:08691] mca: bml: Using self btl to [[56717,1],1] on node host
[host:08690] mca: bml: Using self btl to [[56717,1],0] on node host
[host:08691] mca: bml: Using vader btl to [[56717,1],0] on node host
[host:08690] mca: bml: Using vader btl to [[56717,1],1] on node host
What you see here is that modules from two BTL components were instantiated:
self, which Open MPI uses to communicate within the same process;
vader, previously known as sm, which implements message passing via shared-memory for processes on the same node.
If TCP/IP over the 10G Ethernet or IPoIB is used, you'll see the tcp BTL being selected. Otherwise, the output depends on the Open MPI version you have. With the older versions, Mellanox InfiniBand HCAs are driven natively by the openib BTL component. With the newer versions, the mx MTL takes over and you might need to increase the verbosity of the MTL framework instead by setting mtl_base_verbose to 5.