How do you figure out what settings need to be used to correctly configure a DPDK mempool for your application?
Specifically using rte_pktmbuf_pool_create():
n, the number of elements in the mbuf pool
cache_size
priv_size
data_room_size
EAL arguments:
n number of memory channels
r number of memory ranks
m amount of memory to preallocate at startup
in-memory no shared data structures
IOVA mode
huge-worker-stack
My setup:
2 x Intel Xeon Gold 6348 CPU # 2.6 Ghz
28 cores per socket
Max 3.5 Ghz
Hyperthreading disabled
Ubuntu 22.04.1 LTS
Kernel 5.15.0-53-generic
Cores set to performance governor
4 x Sabrent 2TB Rocket 4 Plus in RAID0 Config
128 GB DDR4 Memory
10 1GB HugePages (Can change to what is required)
1 x Mellanox ConnectX-5 100gbe NIC
31:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Firmware-version: 16.35.1012
UDP Source:
100 gbe NIC
9000 MTU Packets
ipv4-udp packets
Will be receiving 10GB/s UDP packets over a 100gbe link. Plan is to strip the headers and write the payload to a file. Right now trying to get it working for 2GB/s to a single queue.
Reviewed the DPDK Programmers guide: https://doc.dpdk.org/guides/prog_guide/mempool_lib.html
Also searched online but the resources seem limited. Would appreciate any help or a push in the right direction.
based on the updates from comments the question can be summarized as
what are the correct settings to be used for DPDK mbuf|mempool which needs to handle 9000B UDP payload for processing 10Gbps packets on 100Gbps MLX CX-5 NIC with single or multiple queues
Let me summarize my suggestions below for this unique case
[for 100Gbps]
as per MLX DPDK performance report for test case 4, for packet size 1518 we get theoretical and practical Million Packets per sec as 8.13
Hence for 9000B payload this will be, 9000B/1518B=6 is around 8.13/6 = 1.355 MPps
With MLX CX-5 1 queue achieve a mx of 36Mpps - so with 1 queue and JUMBo enabled, you should get the 9000B into a single queue
note: 10Gbps it will be 0.1355Mpps
Settings for MBUF or mempool:
if your application logic requires 0.1 seconds to process the payload, I recommend you to use 3 * max expected packets. So roughly 10000 packets
Each payload has total size of 10000B (data_room_size) as single contiguous buffer.
priv_size is wholly dependant upon your logic to store metadata
Note: in case multiple queue, I always configure for worst case scenario, that is I assume there will be elephant flow which can fall onto specific queue. So if with 1 queue you have created 10000 elements, for multiple queues I use 2.5 * 10000
Related
DPDK: 22.03
PMD: Amazon ENA
We have a DPDK application that only calls rte_eth_rx_burst() (we do not transmit packets) and it must process the payload very quickly. The payload of a single network packet MUST be in contiguous memory.
The DPDK API is optimized around having memory pools of fixed-size mbufs in memory pools. If a packet is received on the DPDK port that is larger than the mbuf size, but smaller than the max MTU then it will be segmented according to the figure below:
This leads us the following problems:
If we configure the memory pool to store large packets (for example
max MTU size) then we will always store the payload in contiguous memory, but we will waste huge amounts memory in the case we
receive traffic containing small packets. Imagine that our mbuf size
is 9216 bytes, but we are receiving mostly packets of size 100-300
bytes. We are wasting memory by a factor of 90!
If we reduce the size of mbufs, to let's say 512 bytes, then we need
special handling of those segments in order to store the payload in
contiguous memory. Special handling and copying hurts our performance, so it should be limited.
My final question:
What strategy is recommended for a DPDK application that needs to process the payload of network packets in contiguous memory? With both small (100-300 bytes) and large (9216) packets, without wasting huge amounts of memory with 9K-sized mbuf pools? Is copying segmented jumbo frames into a larger mbuf the only option?
There are a couple of ways involving the use of HW and SW logic to make use of multiple-size mempool.
via hardware:
If the NIC PMD supports packet or metadata (RX descriptor) parsing, one can use RTE_FLOW RAW to program the flow direction to a specific queue. Where each can be set up with desired rte_mempool.
IF the NIC PMD does not support parsing of metadata (RX descriptors) but the user is aware of specific protocol fields like ETH + MPLS|VLAN or ETH + IP + UDP or ETH + IP + UDP + Tunnel (Geneve|VxLAN); one can use RTE_FLOW to distribute the traffic over specific queues (which has larger mempool object size). thus making default traffic to fall on queue-0 (which has smaller mempool object size)
if hardware option of flow bifurcate is available, one can set the RTE_FLOW with raw or tunnel headers to be redirect to VF. thus PF can make use of smaller object mempool and VF can make use of larger size mempool.
via software: (if HW supported is absent or limited)
Using RX callback (rte_rx_callback_fn), one can check mbuf->nb_segs > 1 to confirm multiple segments are present and then use mbuf_alloc from larger mempool, attach as first segment and then invoke rte_pktmbuf_linearize to move the content to first buffer.
Pre set all queue with large size mempool object, using RX callback check mbuf->pktlen < [threshold size], if yes alloc mbuf from smaller pool size, memcpy the content (pkt data and necessary metadata) and then swap the original mbuf with new mbuf and free the original mbuf.
Pros and Cons:
SW-1: this costly process, as multiple segment access memory is non-contiguous and will be done for larger size payload such as 2K to 9K. hardware NIC also has to support RX scatter or multi-segment too.
SW-2: this is less expensive than SW-1. As there is no multiple segments, the cost can be amortized with mtod and prefetch of payload.
note: in both cases, the cost of mbuf_free within RX-callback can be reduced by maintaining a list of original mbufs to free.
Alternative option-1 (involves modifying the PMD):
modify the PMD code probe or create to allocate mempool for large and small objects.
set MAX elements per RX burst as 1 element
use scalar code path only
Change recv function to
check the packet size from RX descriptor
comment the original code to replenish per threshold
check the packet size via reading packet descriptor.
alloc for either a large or small size mempool object.
[edit-1] based on the comment update DPDK version is 22.03 and PMD is Amazon ENA. Based on DPDK NIC summary and ENA PMD it points to
No RTE_FLOW RSS to specific queues.
No RTE_FLOW_RAW for packet size.
In file in function ena_rx_queue_setup; it supports individual rte_mempool
Hence current options are
Modify the ENA PMD to reflect support for multiple mempool size
Use SW-2 for rx_callback to copy smaller payload to new mbuf and swap out.
Note: There is an alternate approach by
creating an empty pool with external mempool
Use modified ENA PMD to get pool objects as single small buffers or multiple continuous pool objects.
Recommendation: Use a PMD or Programmable NIC which can bifurcate based on Packet size and then RTE_FLOW to a specific queue. To allow multiple CPU to process multiple flow setup Q-0 as default small packets, and other queues with RTE_FLOW_RSS with specific mempool.
My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:
I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).
One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).
Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.
Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.
It really depends on the kind of round trip you want to measure. Consider the following timestamps:
-> t1 -> send() -> NIC_A -> t2 --link--> t3 -> NIC_B -> recv() -> t4
host_A host_B
<- t1' <- recv() <- NIC_A <- t2' <--link-- t3' <- NIC_B <- send() <- t4'
Do you want to measure t1' - t1? Then it's just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See also rte_rdtsc_precise() and rte_get_tsc_hz() for converting the TSC deltas to nanoseconds.
For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call clock_gettime(CLOCK_REALTIME) which has an overhead of 18 ns or so.
This works for single packet transmits via rte_eth_tx_burst() and single packet receives - which aren't necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.
Timestamps t2, t3, t2', t3' are hardware transmit/receive timestamps provided by (more serious) NICs.
If you want to compute the roundtrip t2' - t2 then you first need to discipline the NIC's clock (e.g. with phc2ys), enable timestamping and get those timestamps. However, AFAICS dpdk doesn't support obtaining the TX timestamps, in general.
Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the t2' - t2 roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.
The ideal way to latency for sending and receiving packets through an interface is setup external Loopback device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.
The next best alternative is to enable Internal Loopback, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NIC
supports internal Loopback
and can suppress Loopback error handlers.
Another way is to use either PCIe port to port cross connect. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.
Note: Newer Hardware supports doorbell mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.
But in my recommendation using an external (Machine-B) is not desirable because of
Depending upon the quality of the transfer Medium, the latency varies
If machine-B has to be configured to the ideal settings (for almost 0 latency)
Machine-A and Machine-B even if physical configurations are the same, need to be maintained and run at the same thermal settings to allow the right clocking.
Both Machine-A and Machine-B has to run with same PTP grand master to synchronize the clocks.
If DPDK is used, either modify the PMD or use rte_eth_tx_buffer_flush to ensure the packet is sent out to the NIC
With these changes, a dummy UDP packet can be created, where
first 8 bytes should carry the actual TX time before tx_burst from Machine-A (T1).
second 8 bytes is added by machine-B when it actually receives the packet in SW via rx_burst (2).
third 8 bytes is added by Machine-B when tx_burst is completed (T3).
fourth 8 bytes are found in Machine-A when packet is actually received via rx-burst (T4)
with these Round trip Time = (T4 - T1) - (T3 - T2), where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.
Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks rte_get_tsc_cycles is not varying per frequency and power states.
[Edit-1] as mentioned in comments
#AmmerUsman, I highly recommend editing your question to reflect the real intention as to how to measure the round trip time is taken, rather than TX-RX latency from DUT?, this is because you are referring to DPDK latency stats/metric but that is for measuring min/max/avg latency between Rx-Tx on the same DUT.
#AmmerUsman latency library in DPDK is stats representing the difference between TX-callback and RX-callback and not for your use case described. As per Keith explanation pointed out Packet send out by the traffic generator should send a timestamp on the payload, receiver application should forward to the same port. then the receiver app can measure the difference between the received timestamp and the timestamp embedded in the packet. For this, you need to send it back on the same port which does not match your setup diagram
I'm trying to find the proper raw perf event descriptor to monitor QPI traffic (bandwidth) on Intel Xeon E5-2600 (Sandy Bridge).
I've found an event that seems relative here (qpi_data_bandwidth_tx: Number of data flits transmitted . Derived from unc_q_txl_flits_g0.data. Unit: uncore_qpi) but I can't use it in my system. So probably these events refer to a different micro-architecture.
Moreover, I've looked into the "Intel ® Xeon ® Processor E5-2600 Product Family Uncore Performance Monitoring Guide" and the most relative reference I found is the following:
To calculate "data" bandwidth, one should therefore do:
data flits * 8B / time (for L0)
or 4B instead of 8B for L0p
The events that monitor the data flits are:
RxL_FLITS_G0.DATA
RxL_FLITS_G1.DRS_DATA
RxL_FLITS_G2.NCB_DATA
Q1: Are those the correct events?
Q2: If yes, should I monitor all these events and add them in order to get the total data flits or just the first?
Q3: I don't quite understand in what the 8B and time refer to.
Q4: Is there any way to validate?
Also, please feel free to suggest alternatives in monitoring QPI traffic bandwidth in case there are any.
Thank you!
A Xeon E5-2600 processor has two QPI ports, each port can send up to one flit and receive up to one flit per QPI domain clock cycle. Not all flits carry data, but all non-idle flits consume bandwidth. It seems to me that you're interested in counting only data flits, which is useful for detecting remote access bandwdith bottlenecks at the socket level (instead of a particular agent within a socket).
The event RxL_FLITS_G0.DATA can be used to count the number of data flits received. This is equal to the sum of RxL_FLITS_G1.DRS_DATA and RxL_FLITS_G2.NCB_DATA. You only need to measure the latter two events if you care about the break down. Note that there are only 4 event counter per QPI port. The event TxL_FLITS_G0.DATA can be used to count the number of data flits transmitted to other sockets.
The events RxL_FLITS_G0.DATA and TxL_FLITS_G0.DATA can be used to measure the total number of flits transferred through the specified port. So it takes two out of the four counts available in each port to count total data flits.
There is no accurate way to convert data flits to bytes. A flit may contain up to 8 valid bytes. This depends on the type of transaction and power state of the link direction (power states are per link per direction). A good estimate can be obtained by reasonably assuming that most data flits are part of full cache line packets and are being transmitted in the L0 power state, so each flit does contains exactly 8 valid bytes. Alternatively, you can just measure port utilization in terms of data flits rather than bytes.
The unit of time is up to you. Ultimately, if you want to determine whether QPI bandwdith is a bottleneck, the bandwdith has to be measured periodically and compared against the theoretical maximum bandwidth. You can, for example, use total QPI clock cycles, which can be counted on one of the free QPI port PMU counters. The QPI frequency is fixed on JKT.
For validation, you can write a simple program that allocates a large buffer in remote memory and reads it. The measured number of bytes should be about the same as the size of the buffer in bytes.
I am working on a C++ application that can be qualified as a router. This application receives UDP packets on a given port (nearly 37 bytes each second) and must multicast them to another destinations within a 10 ms period. However, sometimes after packet reception, the retransmission exceeds the 10 ms limit and can reach the 100 ms. these off-limits delays are random.
The application receives on the same Ethernet interface but on a different port other kind of packets (up to 200 packets of nearly 100 bytes each second). I am not sure that this later flow is disrupting the other one because these delay peaks are too scarce (2 packets among 10000 packets)
What can be the causes of these sporadic delays? And how to solve them?
P.S. My application is running on a Linux 2.6.18-238.el5PAE. Delays are measured between the reception of the packet and after the success of the transmission!
An image to be more clear :
10ms is a tough deadline for a non-realtime OS.
Assign your process to one of the realtime scheduling policies, e.g. SCHED_RR or SCHED_FIFO (some reading). It can be done in the code via sched_setscheduler() or from command line via chrt. Adjust the priority as well, while you're at it.
Make sure your code doesn't consume CPU more than it has to, or it will affect entire system performance.
You may also need RT_PREEMPT patch.
Overall, the task of generating Ethernet traffic to schedule on Linux is not an easy one. E.g. see BRUTE, a high-performance traffic generator; maybe you'll find something useful in its code or in the research paper.
I am trying to measure IO data transfer rate (bandwidth) between 2 simulation applications (written in C++). I created a very simple perfclient and perfserver program just to verify that my approach in calculating the network bandwidth is correct before implementing this calculation approach in the real applications. So in this case, I need to do it programatically (NOT using Iperf).
I tried to run my perfclient and perfserver program on various domain (localhost, computer connected to ethernet,and computer connected to wireless connection). However I always get about the similar bandwidth on each of these different hosts, around 1900 Mbps (tested using data size of 1472 bytes). Is this a reasonable result, or can I get a better and more accurate bandwidth?
Should I use 1472 (which is the ethernet MTU, not including header) as the maximum data size for each send() and recv(), and why/why not? I also tried using different data size, and here are the average bandwidth that I get (tested using ethernet connection), which did not make sense to me because the number exceeded 1Gbps and reached something like 28 Gbps.
SIZE BANDWIDTH
1KB 1396 Mbps
2KB 2689 Mbps
4KB 5044 Mbps
8KB 9146 Mbps
16KB 16815 Mbps
32KB 22486 Mbps
64KB 28560 Mbps
HERE is my current approach:
I did a basic ping-pong fashion loop, where the client continuously send bytes of data stream to the server program. The server will read those data, and reflect (send) the data back to the client program. The client will then read those reflected data (2 way transmission). The above operation is repeated 1000 times, and I then divided the time by 1000 to get the average latency time. Next, I divided the average latency time by 2, to get the 1 way transmission time. Bandwidth can then be calculated as follow:
bandwidth = total bytes sent / average 1-way transmission time
Is there anything wrong with my approach? How can I make sure that my result is not biased? Once I get this right, I will need to test this approach in my original application (not this simple testing application), and I want to put this performance testing result in a scientific paper.
EDIT:
I have solved this problem. Check out the answer that I posted below.
Unless you have a need to reinvent the wheel iperf was made to handle just this problem.
Iperf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. Iperf allows the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss.
I was finally able to figure and solve this out :-)
As I mentioned in the question, regardless of the network architecture that I used (localhost, 1Gbps ethernet card, Wireless connection, etc), my achieved bandwidth scaled up for up to 28Gbps. I have tried to bind the server IP address to several different IP addresses, as follow:
127.0.0.1
IP address given by my LAN connection
IP address given by my wireless connection
So I thought that this should give me correct result, in fact it didn't.
This was mainly because I was running both of the client and server program on the same computers (different terminal window, even though the client and server are both bound to different IP addresses). My guess is that this is caused by the internal loopback. This is the main reason why the result is so biased and not accurate.
Anyway, so I then tried to run the client on one workstation, and the server on another workstation, and I tested them using the different network connection, and it worked as expected :-)
On 1Gbps connection, I got about 9800 Mbps (0.96 Gbps), and on 10Gbps connection, I got about 10100 Mbps (9.86 Gbps). So this work exactly as I expected. So my approach is correct. Perfect !!