I'm new to DPDK and running the Helloworld example given as part of the DPDK build.Here I'm able to successfully run and get the O/p expected. That is printing Hello World on different lcores.
./dpdk-helloworld -l 0-3 -n 4
EAL: Detected 112 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: No available hugepages reported in hugepages-2048kB
EAL: Probing VFIO support...
EAL: No legacy callbacks, legacy socket not created
hello from core 1
hello from core 2
hello from core 3
hello from core 0
But When I run htop, I can see only the first core being used, I was expecting four core will be utilized.
Is there something wrong in my understanding of how EAL works in DPDK?
There can be couple of reasons which might lead to your observation, such as
DPDK example hello world is a simple run to competition program model which launches lcore_hello in all worker threads and the main thread. There is no infinite loop of while or for to run on desired lcores.
DPDK application is running inside VM with ISOL cpu set for single core.
htop sampling time is not enough to show case all 4 cpu were utilized
Since limited information is shared about setup and environment, I have to assume cause of your error is 3. The solution for your requirement is In order to achieve 100% utilization across all threads edit lcore_hello to be an infinite loop, example code below
static int
lcore_hello(__rte_unused void *arg)
{
unsigned lcore_id;
lcore_id = rte_lcore_id();
printf("hello from core %u\n", lcore_id);
while(1);
return 0;
}
Note: I highly recommend using perf record with the right sample rate to capture across all core events for specific application launches too.
Related
How do you figure out what settings need to be used to correctly configure a DPDK mempool for your application?
Specifically using rte_pktmbuf_pool_create():
n, the number of elements in the mbuf pool
cache_size
priv_size
data_room_size
EAL arguments:
n number of memory channels
r number of memory ranks
m amount of memory to preallocate at startup
in-memory no shared data structures
IOVA mode
huge-worker-stack
My setup:
2 x Intel Xeon Gold 6348 CPU # 2.6 Ghz
28 cores per socket
Max 3.5 Ghz
Hyperthreading disabled
Ubuntu 22.04.1 LTS
Kernel 5.15.0-53-generic
Cores set to performance governor
4 x Sabrent 2TB Rocket 4 Plus in RAID0 Config
128 GB DDR4 Memory
10 1GB HugePages (Can change to what is required)
1 x Mellanox ConnectX-5 100gbe NIC
31:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Firmware-version: 16.35.1012
UDP Source:
100 gbe NIC
9000 MTU Packets
ipv4-udp packets
Will be receiving 10GB/s UDP packets over a 100gbe link. Plan is to strip the headers and write the payload to a file. Right now trying to get it working for 2GB/s to a single queue.
Reviewed the DPDK Programmers guide: https://doc.dpdk.org/guides/prog_guide/mempool_lib.html
Also searched online but the resources seem limited. Would appreciate any help or a push in the right direction.
based on the updates from comments the question can be summarized as
what are the correct settings to be used for DPDK mbuf|mempool which needs to handle 9000B UDP payload for processing 10Gbps packets on 100Gbps MLX CX-5 NIC with single or multiple queues
Let me summarize my suggestions below for this unique case
[for 100Gbps]
as per MLX DPDK performance report for test case 4, for packet size 1518 we get theoretical and practical Million Packets per sec as 8.13
Hence for 9000B payload this will be, 9000B/1518B=6 is around 8.13/6 = 1.355 MPps
With MLX CX-5 1 queue achieve a mx of 36Mpps - so with 1 queue and JUMBo enabled, you should get the 9000B into a single queue
note: 10Gbps it will be 0.1355Mpps
Settings for MBUF or mempool:
if your application logic requires 0.1 seconds to process the payload, I recommend you to use 3 * max expected packets. So roughly 10000 packets
Each payload has total size of 10000B (data_room_size) as single contiguous buffer.
priv_size is wholly dependant upon your logic to store metadata
Note: in case multiple queue, I always configure for worst case scenario, that is I assume there will be elephant flow which can fall onto specific queue. So if with 1 queue you have created 10000 elements, for multiple queues I use 2.5 * 10000
I'm trying to set environments to test programs written in the P4 language, using t4p4s as the compiler. When a P4 program is compiled with t4p4s, a C/C++ program using DPDK is generated, which in turn is compiled and run.
Compiling the program works fine. The resulting executable is run like so:
./build/l2switch -c 0xc -n 4 --no-pci --vdev net_pcap0,rx_iface_in=veth3-s,tx_iface=veth3-s,tx_iface=veth3-s --vdev net_pcap1,rx_iface_in=veth5-s,tx_iface=veth5-s,tx_iface=veth5-s -- -p 0x3 --config "\"(0,0,2)(1,0,3)\""
On a Raspberry Pi, this works with every network interface I've tried so far (virtual ethernet devices as seen in the command above, the builtin ethernet port and a Realtek USB NIC).
Inside a Ubuntu 21.04 VM using virtual ethernet devices, I get the following error:
--- Init switch
EAL: Detected CPU lcores: 4
EAL: Detected NUMA nodes: 1
EAL: Detected shared linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: VFIO support initialized
TELEMETRY: No legacy callbacks, legacy socket not created
2 :::: Allocating DPDK mbuf pool on socket 0
2 :::: Init ports
2 :::: Init port 0
2 :::: Creating queues: nb_rxq=1 nb_txq=2
Ethdev port_id=0 requested Rx offloads 0xe doesn't match Rx offloads capabilities 0x0 in rte_eth_dev_configure()
EAL: Error - exiting with code: 1
Cause: Cannot configure device: err=-22, port=0
I've had this problem pop up on the Raspberry Pis too, but it would fix itself after a restart or enough tries. On the VM, this problem is persistent.
Questions:
In both cases, I'm using virtual ethernet devices for the interfaces. Both cases use the same driver and software NIC. How can I find out what the difference between the VM and the Raspberry Pi is? After all, if there was no difference then it would work in both cases.
What does the error try to tell me? I've tried searching for it online to no avail and my knowledge of DPDK is very limited.
What can I try in order to fix this issue?
Solved it!
While looking through the files to find the program listing #stackinside requested, I found an argument of t4p4s called "vethmode". There are plenty of arguments like this and I've yet to find a complete documentation for them. Turning that on results in the macro T4P4S_VETH_MODE being defined when the C program is compiled. This in turn changes the composition of a struct rte_eth_conf that is passed to rte_eth_dev_configure some time later.
For the sake of completeness, here is the relevant file.
Line 40 is where the struct rte_eth_conf is defined/initialized.
Line 244 is the start of the function in which the call to rte_eth_dev_configure (Line 261) fails.
While writing a simple DPDK packet generator I noticed some additional initialization steps that are required for reliable and successful packet transmission:
calling rte_eth_link_get() or rte_eth_timesync_enable() after rte_eth_dev_start()
waiting 2 seconds before sending the first packet with rte_eth_tx_burst()
So these steps are necessary when I use the ixgbe DPDK vfio driver with an Intel X553 NIC.
When I'm using the AF_PACKET DPDK driver, it works without those extra steps.
Is this a bug or a feature?
Is there a better way than waiting 2 seconds before the first send?
Why is the wait required with the ixgbe driver? Is this a limitation of that NIC or the involved switch (Mikrotik CRS326 with Marvell chipset)?
Is there a more idiomatic function to call than rte_eth_link_get() in order to complete the device initialization for transmission?
Is there some way to keep the VFIO NIC initialized (while keeping its link up) to avoid re-initializing it over and over again during the usual edit/compile/test cycle? (i.e. to speed up that cycle ...)
Additional information: When I connect the NIC to a mirrored port (which is configured via Mikrotik's mirror-source/mirror-target ethernet switch settings) and the sleep(2) is removed then I see the first packet transmitted to the mirror target but not to the primary destination. Thus, it seems like the sleep is necessary to give the switch some time after the link is up (after the dpdk program start) to completely initialize its forwarding table or something like that?
Waiting just 1 second before the first transmission works less reliable, i.e. the packet reaches the receiver only every odd time.
My device/port initialization procedure implements the following setup sequence:
rte_eth_dev_count_avail()
rte_eth_dev_is_valid_port()
rte_eth_dev_info_get()
rte_eth_dev_adjust_nb_rx_tx_desc()
rte_eth_dev_configure(port_id, 0 /* rxrings */, 1 /* txrings */, &port_conf)
rte_eth_tx_queue_setup()
rte_eth_dev_start()
rte_eth_macaddr_get()
rte_eth_link_get() // <-- REQUIRED!
rte_eth_dev_get_mtu()
Without rte_eth_link_get() (or rte_eth_timesync_enable()) the first transmitted packet doesn't even show up on the mirrored port.
The above functions (and rte_eth_tx_burst()) complete successfully with/without rte_eth_link_get()/sleep(2) being present. Especially, the read MAC address, MTU have the expected values (MTU -> 1500) and rte_eth_tx_burst() returns 1 for a burst of one UDP packet.
The returned link status is: Link up at 1 Gbps FDX Autoneg
The fact that rte_eth_link_get() can be replaced with rte_eth_timesync_enable() probably can be explained by the latter calling ixgbe_start_timecounters() which calls rte_eth_linkstatus_get() which is also called by rte_eth_link_get().
I've checked the DPDK examples and most of them don't call rte_eth_link_get() before sending something. There is also no sleep after device initialization.
I'm using DPDK 20.11.2.
Even more information - to answer the comments:
I'm running this on Fedora 33 (5.13.12-100.fc33.x86_64).
Ethtool reports: firmware-version: 0x80000877
I had called rte_eth_timesync_enable() in order to work with the transmit timestamps. However, during debugging I removed it to arrive at an minimal reproducer. At that point I noticed that removing it made it actually worse (i.e. no packet transmitted over the mirror port). I thus investigated what part of that function might make the difference and found rte_eth_link_get() which has similar side-effects.
When switching to AF_PACKET I'm using the stock ixgbe kernel driver, i.e. ixgbe with default settings on a device that is initialized by networkd (dhcp enabled).
My expectation was that when rte_eth_dev_start() terminates that the link is up and the device is ready for transmission.
However, it would be nice, I guess, if one could avoid resetting the device after program restarts. I don't know if DPDK supports this.
Regarding delays: I've just tested the following: rte_eth_link_get() can be omitted if I increase the sleep to 6 seconds. Whereas a call to rte_eth_link_get() takes 3.3 s. So yeah, it's probably just helping due to the additional delay.
The difference between the two attempted approaches
In order to use af_packet PMD, you first bind the device in question to the kernel driver. At this point, a kernel network interface is spawned for that device. This interface typically has the link active by default. If not, you typically run ip link set dev <interface> up. When you launch your DPDK application, af_packet driver does not (re-)configure the link. It just unconditionally reports the link to be up on device start (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L266) and vice versa when doing device stop. Link update operation is also no-op in this driver (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L416).
In fact, with af_packet approach, the link is already active at the time you launch the application. Hence no need to await the link.
With VFIO approach, the device in question has its link down, and it's responsibility of the corresponding PMD to activate it. Hence the need to test link status in the application.
Is it possible to avoid waiting on application restarts?
Long story short, yes. Awaiting link status is not the only problem with application restarts. You effectively re-initialise EAL as a whole when you restart, and that procedure is also eye-wateringly time consuming. In order to cope with that, you should probably check out multi-process support available in DPDK (see https://doc.dpdk.org/guides/prog_guide/multi_proc_support.html).
This requires that you re-implement your application to have its control logic in one process (also, the primary process) and Rx/Tx datapath logic in another one (the secondary process). This way, you can keep the first one running all the time and re-start the second one when you need to change Rx/Tx logic / re-compile. Restarting the secondary process will re-attach to the existing EAL instance all the time. Hence no PMD restart being involved, and no more need to wait.
Based on the interaction via comments, the real question is summarized as I'm just asking myself if it's possible to keep the link-up between invocations of a DPDK program (when using a vfio device) to avoid dealing with the relatively long wait times before the first transmit comes through. IOW, is it somehow possible to skip the device reset when starting the program for the 2nd time?
The short answer is No for the packet-generator program between restarts, because any Physcial NIC which uses PCIe config space for both PF (IXGBE for X533) and VF (IXGBE_VF for X553) bound with uio_pci_generic|igb_uio|vfio-pci requires PCIe reset & configuration. But, when using AF_PACKET (ixgbe kernel diver) DPDK PMD, this is the virtual device that does not do any PCIe resets and directly dev->data->dev_link.link_status = ETH_LINK_UP; in eth_dev_start function.
For the second part Is the delay for the first TX packets expected?
[Answer] No, as a few factors contribute to delay in the first packet transmission
Switch software & firmware (PF only)
Switch port Auto-neg or fixed speed (PF only)
X533 software and firmware (PF and VF)
Autoneg enable or disabled (PF and VF)
link medium SFP (fiber) or DAC (Direct Attached Copper) or RJ-45 (cat5|cat6) connection (PF only)
PF driver version for NIC (X553 ixgbe) (PF and VF)
As per intel driver Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE 802.1Qbb (priority based flow-control), and others of this type (VF only)
Note: Since the issue is mentioned for VF ports only (and not PF ports), my assumption is
TX packet uses the SRC MAC address of VF to avoid MAC Spoof check on ASIC
configure all SR-IOV enabled ports for VLAN tagging from the administrative interface on the PF to avoid flooding of traffic to VF
PF driver is updated to avoid old driver issues such (VF reset causes PF link to reset). This can be identified via checking dmesg
Steps to isolate the problem is the NIC by:
Check if (X533) PF DPDK has the same delay as VF DPDK (Isolate if it is PF or VF) problem.
cross-connect 2 NIC (X533) on the same system. Then compare Linux vs DPDK link up events (to check if it is a NIC problem or PCIe LNK issue)
Disable Auto-neg for DPDK X533 and compare PF vs Vf in DPDK
Steps to isolate the problem is the Switch by:
Disable Auto-neg on Switch and set FD auto-neg-disable speed 1Gbps and check the behaviour
[EDIT-1] I do agree with the workaround solution suggested by #stackinside using DPDK primary-secondary process concept. As the primary is responsible for Link and port bring up. While secondary is used for actual RX and TX burst.
I could not tell from the documentation if it is possible to use vdev rx_pcap to simulate RSS with a pcap file, using multiple cores.
It seemed like an interesting proposition, after reading this:
For ease of use, the DPDK EAL also has been extended to allow
pseudo-Ethernet devices, using one or more of these drivers, to be
created at application startup time during EAL initialization.
To do so, the –vdev= parameter must be passed to the EAL. This takes
take options to allow ring and pcap-based Ethernet to be allocated and
used transparently by the application. This can be used, for example,
for testing on a virtual machine where there are no Ethernet ports.
Pcap-based devices can be created using the virtual device –vdev
option.
This is how I read one PCAP file and write to another, using their example with the dpdk-testpmd application:
sudo build/app/dpdk-testpmd -l 0-3 --vdev 'net_pcap0,rx_pcap=file_rx.pcap,tx_pcap=file_tx.pcap' -- --port-topology=chained --no-flush-rx
This works fine, and I get the file_tx.pcap generated. But if I try to set the number of RX queues to 4 it tells me that I can't:
$ sudo build/app/dpdk-testpmd -l 0-3 --vdev 'net_pcap0,rx_pcap=file_rx.pcap,tx_pcap=file_tx.pcap' -- --port-topology=chained --no-flush-rx --rxq=4
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: No available hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: Invalid NUMA socket, default to 0
EAL: Invalid NUMA socket, default to 0
Fail: input rxq (4) can't be greater than max_rx_queues (1) of port 0
EAL: Error - exiting with code: 1
Cause: rxq 4 invalid - must be >= 0 && <= 1
Is it possible to change max_rx_queues for vdev rx_pcap at all, or is there a better alternative?
The number of queues available on a port depends upon the NIC configuration and driver from OS. Hence expecting a PCAP PMD emulating RXPCAP file as RX device will have multiple RX queue is not right. You would need to start using the actual interface from OS which has multiple queues.
Explanation below:
As per the DPDK NIC Feature there is no support for RSS on PCAP PMD. So the option of receiving packets based on 3 IP Tuple or 5 IP+TCP|UDP|SCTP Tuple on multiple queues is not present natively and needs to be implemented in SW.
A per the PCAP PMD if one needs to read from the physical port we have use option rx_iface and not rx_pcap. Similarly to send on Physical interface one has to use option tx_iface and not tx_pcap.
If your requirement was to capture the RX or TX packet from specific DPDK port, you should look DPDK PDUMP application which uses rte_pdump API. The PDUMP Documentation explains clearly how to grab packets from specific queues too.
If one needs to read packets using PCAP PMD, use rx_iface in primary application. Then to write packets from desired port-queue into a PCAP file use dpdk-pdump as secondary application with option --pdump 'port=[your desired DPDK port],queue=[your desired DPDK port queue],rx-dev=/tmp/rx.pcap.
I have a centos minimal hexacore 3.5ghz machine and I do not undestand why a SCHED_FIFO realtime thread pinned to 1 core only, freezes the terminal? How to avoid this while keeping the realtime behaviour of the thread without using sleep in the loop or blocking it? To simplify my problem, this thread tries to dequeuue items from a non-blocking,lockfree,concurrent queue in an infinite loop.
The kernel runs on core 0, all the other cores are free. All other threads and my process too, are SCHED_OTHER same priority, 20. This is the only thread where i need ultra low latency for some high frequency calculations. After starting the application it seems everything works ok but my terminal freezes (i connect remotely trough ssh). I am able to see the threads created and force close my app from htop. The RT thread seems to run 100% burnout the core assigned as expected. When i kill the app, the terminal frozen is released and i can use again.
It looks like that thread has higher priorty than everything else across all cores, but i want this on the core i pinned it only.
Thank you
Hi victor you need to isolate the core from the linux scheduler so that it does not try to assign lower priority tasks such as running your terminal to a core that is running SCHED_* jobs with higher priority. You can achieve isolating core 1 in your case by adding the kernel option isolcpus=1 to your grub.cfg (or whatever boot loader config you are using).
After rebooting you can confirm that you have successfully isolated core 1 by running dmesg | grep isol
and see that your kernel was booted with the option.
Here is some more info on isolcpus:
https://www.linuxtopia.org/online_books/linux_kernel/kernel_configuration/re46.html