How to avoid wasting huge amounts of memory when processing DPDK mbufs?

How to avoid wasting huge amounts of memory when processing DPDK mbufs? - dpdk

DPDK: 22.03
PMD: Amazon ENA
We have a DPDK application that only calls rte_eth_rx_burst() (we do not transmit packets) and it must process the payload very quickly. The payload of a single network packet MUST be in contiguous memory.
The DPDK API is optimized around having memory pools of fixed-size mbufs in memory pools. If a packet is received on the DPDK port that is larger than the mbuf size, but smaller than the max MTU then it will be segmented according to the figure below:
This leads us the following problems:
If we configure the memory pool to store large packets (for example
max MTU size) then we will always store the payload in contiguous memory, but we will waste huge amounts memory in the case we
receive traffic containing small packets. Imagine that our mbuf size
is 9216 bytes, but we are receiving mostly packets of size 100-300
bytes. We are wasting memory by a factor of 90!
If we reduce the size of mbufs, to let's say 512 bytes, then we need
special handling of those segments in order to store the payload in
contiguous memory. Special handling and copying hurts our performance, so it should be limited.
My final question:
What strategy is recommended for a DPDK application that needs to process the payload of network packets in contiguous memory? With both small (100-300 bytes) and large (9216) packets, without wasting huge amounts of memory with 9K-sized mbuf pools? Is copying segmented jumbo frames into a larger mbuf the only option?

There are a couple of ways involving the use of HW and SW logic to make use of multiple-size mempool.
via hardware:
If the NIC PMD supports packet or metadata (RX descriptor) parsing, one can use RTE_FLOW RAW to program the flow direction to a specific queue. Where each can be set up with desired rte_mempool.
IF the NIC PMD does not support parsing of metadata (RX descriptors) but the user is aware of specific protocol fields like ETH + MPLS|VLAN or ETH + IP + UDP or ETH + IP + UDP + Tunnel (Geneve|VxLAN); one can use RTE_FLOW to distribute the traffic over specific queues (which has larger mempool object size). thus making default traffic to fall on queue-0 (which has smaller mempool object size)
if hardware option of flow bifurcate is available, one can set the RTE_FLOW with raw or tunnel headers to be redirect to VF. thus PF can make use of smaller object mempool and VF can make use of larger size mempool.
via software: (if HW supported is absent or limited)
Using RX callback (rte_rx_callback_fn), one can check mbuf->nb_segs > 1 to confirm multiple segments are present and then use mbuf_alloc from larger mempool, attach as first segment and then invoke rte_pktmbuf_linearize to move the content to first buffer.
Pre set all queue with large size mempool object, using RX callback check mbuf->pktlen < [threshold size], if yes alloc mbuf from smaller pool size, memcpy the content (pkt data and necessary metadata) and then swap the original mbuf with new mbuf and free the original mbuf.
Pros and Cons:
SW-1: this costly process, as multiple segment access memory is non-contiguous and will be done for larger size payload such as 2K to 9K. hardware NIC also has to support RX scatter or multi-segment too.
SW-2: this is less expensive than SW-1. As there is no multiple segments, the cost can be amortized with mtod and prefetch of payload.
note: in both cases, the cost of mbuf_free within RX-callback can be reduced by maintaining a list of original mbufs to free.
Alternative option-1 (involves modifying the PMD):
modify the PMD code probe or create to allocate mempool for large and small objects.
set MAX elements per RX burst as 1 element
use scalar code path only
Change recv function to
check the packet size from RX descriptor
comment the original code to replenish per threshold
check the packet size via reading packet descriptor.
alloc for either a large or small size mempool object.
[edit-1] based on the comment update DPDK version is 22.03 and PMD is Amazon ENA. Based on DPDK NIC summary and ENA PMD it points to
No RTE_FLOW RSS to specific queues.
No RTE_FLOW_RAW for packet size.
In file in function ena_rx_queue_setup; it supports individual rte_mempool
Hence current options are
Modify the ENA PMD to reflect support for multiple mempool size
Use SW-2 for rx_callback to copy smaller payload to new mbuf and swap out.
Note: There is an alternate approach by
creating an empty pool with external mempool
Use modified ENA PMD to get pool objects as single small buffers or multiple continuous pool objects.
Recommendation: Use a PMD or Programmable NIC which can bifurcate based on Packet size and then RTE_FLOW to a specific queue. To allow multiple CPU to process multiple flow setup Q-0 as default small packets, and other queues with RTE_FLOW_RSS with specific mempool.

Related

Sandy Bridge QPI bandwidth perf event

I'm trying to find the proper raw perf event descriptor to monitor QPI traffic (bandwidth) on Intel Xeon E5-2600 (Sandy Bridge).
I've found an event that seems relative here (qpi_data_bandwidth_tx: Number of data flits transmitted . Derived from unc_q_txl_flits_g0.data. Unit: uncore_qpi) but I can't use it in my system. So probably these events refer to a different micro-architecture.
Moreover, I've looked into the "Intel ® Xeon ® Processor E5-2600 Product Family Uncore Performance Monitoring Guide" and the most relative reference I found is the following:
To calculate "data" bandwidth, one should therefore do:
data flits * 8B / time (for L0)
or 4B instead of 8B for L0p
The events that monitor the data flits are:
RxL_FLITS_G0.DATA
RxL_FLITS_G1.DRS_DATA
RxL_FLITS_G2.NCB_DATA
Q1: Are those the correct events?
Q2: If yes, should I monitor all these events and add them in order to get the total data flits or just the first?
Q3: I don't quite understand in what the 8B and time refer to.
Q4: Is there any way to validate?
Also, please feel free to suggest alternatives in monitoring QPI traffic bandwidth in case there are any.
Thank you!

A Xeon E5-2600 processor has two QPI ports, each port can send up to one flit and receive up to one flit per QPI domain clock cycle. Not all flits carry data, but all non-idle flits consume bandwidth. It seems to me that you're interested in counting only data flits, which is useful for detecting remote access bandwdith bottlenecks at the socket level (instead of a particular agent within a socket).
The event RxL_FLITS_G0.DATA can be used to count the number of data flits received. This is equal to the sum of RxL_FLITS_G1.DRS_DATA and RxL_FLITS_G2.NCB_DATA. You only need to measure the latter two events if you care about the break down. Note that there are only 4 event counter per QPI port. The event TxL_FLITS_G0.DATA can be used to count the number of data flits transmitted to other sockets.
The events RxL_FLITS_G0.DATA and TxL_FLITS_G0.DATA can be used to measure the total number of flits transferred through the specified port. So it takes two out of the four counts available in each port to count total data flits.
There is no accurate way to convert data flits to bytes. A flit may contain up to 8 valid bytes. This depends on the type of transaction and power state of the link direction (power states are per link per direction). A good estimate can be obtained by reasonably assuming that most data flits are part of full cache line packets and are being transmitted in the L0 power state, so each flit does contains exactly 8 valid bytes. Alternatively, you can just measure port utilization in terms of data flits rather than bytes.
The unit of time is up to you. Ultimately, if you want to determine whether QPI bandwdith is a bottleneck, the bandwdith has to be measured periodically and compared against the theoretical maximum bandwidth. You can, for example, use total QPI clock cycles, which can be counted on one of the free QPI port PMU counters. The QPI frequency is fixed on JKT.
For validation, you can write a simple program that allocates a large buffer in remote memory and reads it. The measured number of bytes should be about the same as the size of the buffer in bytes.

When do you need to modify the receive buffer size of sockets?

From time to time I see network related code in legacy source code and elsewhere modifying the receive buffer size for sockets (using setsockopt with the SO_RCVBUF option). On my Windows 10 system the default buffer size for sockets seems to be 64kB. The legacy code I am working on now (written 10+ years ago) sets the receive buffer size to 256kB for each socket.
Some questions related to this:
Is there any reason at all to modify receive buffer sizes when sockets are monitored and read continuously, e.g. using select?
If not, was there some motivation for this 10+ years ago?
Are there any examples, use cases or applications, where modification of receive buffer sizes (or even send buffer sizes) for sockets are needed?

Typically receive-buffer sizes are modified to be larger because the code's author is trying to reduce the likelihood of the condition where the socket's receive-buffer becomes full and therefore the OS has to drop some incoming packets because it has no place to put the data. In a TCP-based application, that condition will cause the stream to temporarily stall until the dropped packets are successfully resent; in a UDP-based application, that condition will cause incoming UDP packets to be silently dropped.
Whether or not doing that is necessary depends on two factors: how quickly data is expected to fill up the socket's receive-buffer, and how quickly the application can drain the socket's receive-buffer via calls to recv(). If the application is reliably able to drain the buffer faster than the data is received, then the default buffer size is fine; OTOH if you see that it is not always able to do so, then a larger receive-buffer-size may help it handle sudden bursts of incoming data more gracefully.
Is there any reason at all to modify receive buffer sizes when sockets
are monitored and read continuously, e.g. using select?
There could be, if the incoming data rate is high (e.g. megabytes per second, or even just occasional bursts of data at that rate), or if the thread is doing something between select()/recv() calls that might keep it busy for a significant period of time -- e.g. if the thread ever needs to write to disk, disk-write calls might take several hundred milliseconds in some cases, potentially allowing the socket's receive buffer to fill during that period.
For very high-bandwidth applications, even a very short pause (e.g. due to the thread being kicked off of the CPU for a few quanta, so that another thread can run for a quantum or two) might be enough to allow the buffer to fill up. It depends a lot on the application's use-case, and of course on the speed of the CPU hardware relative to the network.
As for when to start messing with receive-buffer-sizes: don't do it unless you notice that your application is dropping enough incoming packets that it is noticeably limiting your app's network performance. There's no sense allocating more RAM than you need to.

For TCP, the RECVBUF buffer is the maximum number of unread bytes that the kernel can hold. In TCP the window size reflects the maximum number of unacknowledged bytes the sender can safely send. The sender will receive an ACK which will include a new window which depends on the free space in the RECVBUF.
When RECVBUF is full the sender will stop sending data. Since mechanism means the sender will not be able to send more data than the receiving application can receive.
A small RECVBUF will work well on low latency networks but on high bandwidth high latency networks ACKS may take too long to get to the sender and since the sender has run out of window, the sender will not make use of the full bandwidth.
Increasing the RECVBUF size increases the window which means the sender can send more data while waiting for an ACK, this then will allow the sender to make use of the entire bandwidth. It does mean that things are less responsive.
Shrinking the RECVBUF means the sender is more responsive and aware of the receiver not eating the data and can back off a lot quicker.
The same logic applies for the SENDBUF as well.

Using ASIO to capture lots of UDP packets

I'm using the asio ( non boost version) library to capture incoming UDP packets via a 10GB Ethernet adapter.
150k packets a second is fine, but I start getting dropped packets when i got to higher rates like 300k packets/sec.
I'm pretty sure the bottleneck is in DMA'ing 300k seperate transfers from the network card to the host system. The transfers aren't big only 1400 bytes per transfer, so not a bandwidth issue.
Ideally i would like a mechanism to coalesce the data from multiple packets into a single DMA transfer to the host. Currently I am using asio::receive, to do synchronous transfers which gives better performance than async_receive.
I have tried using the receive command with a larger buffer, or using an array of multiple buffers, but i always seem to get a single read of 1400 bytes.
Is there any way around this?
Ideally i would like to read some multiple of the 1400 bytes at a time, so long as it didn't take too long for the total to be filled.
ie. wait up to 4ms and then return 4 x 1400 bytes, or simply return after 4ms with however many bytes are available...
I do not control the entire network so i cannot force jumbo frames :(
Cheers,

I would remove the asio layer and go direct to the metal.
If you're on Linux you should use recvmmsg(2) rather than recvmsg() or recvfrom(), as it at least allows for the possibility of transferring multiple messages at a time within the kernel, which the others don't.
If you can't do either of these things, you need to at least moderate your expectations. recvfrom() and recvmsg() and whatever lies over them in asio will never deliver more than one UDP datagram at a time. You need to:
speed up your receiving loop as much as possible, eliminating all possible overhead, especially dynamic memory allocation and I/O to other sockets or files.
ensure that the socket receiver buffer is as large as possible, at least a megabyte, via setsockopt()/SO_RCVBUFSIZ, and don't assume that what you set was what you got: get it back via getsockopt() to see if the platform has limited you in some way.

may be you can try a workarround with tcpdump using the libcap library http://www.tcpdump.org/ and filtering to recive UDP packets

Can boost::asio only receive full UDP datagrams?

I am working on a UDP server built with boost::asio and I started from the tutorial customizing to my needs. When I call socket.receive_from(boost::asio::buffer(buf), remote, 0, error); it fills my buffer with data from the packet, but, if my understanding is correct, it drops any data that won't fit in the buffer. Subsequent calls to receive_from will receive the next datagram available, so it looks to me like there is some loss of data without even a notice. Am I understanding this the wrong way?
I tried reading over and over the boost::asio documentation, but I didn't manage to find clues as to how I am supposed to do this the right way. What I'd like to do is reading a certain amount of data so that I can process it; if reading an entire datagram is the only way, I can manage that, but then how can I be sure not to lose the data I am receiving? What buffer size should I use to be sure? Is there any way to tell that my buffer is too small and I'm losing information?
I have to assume that I may be receiving huge datagrams by design.

This is not specific to boost; it's just how datagram sockets work. You have to specify the buffer size, and if the packet doesn't fit into the buffer, then it will be truncated and there is no way to recover the lost information.
For example, the SNMP protocol specifies that:
An implementation of this protocol need not accept messages whose length exceeds 484 octets. However, it is recommended that implementations support larger datagrams whenever feasible.
In short: you have to take it into account when designing your communication protocol that datagrams may be lost, or they may be truncated beyond some specified size.

For IPv4, the datagram size field in the UDP header is 16 bits, giving a maximum size of 65,535 bytes; when you subtract 8 bytes for the header, you end up with a maximum of 65,527 bytes of data. (Note that this would require fragmentation of the enclosing IPv4 datagram regardless of the underlying interface MTU due to the 16-bit IPv4 packet/fragment length field.)
I just use a 64 KiB buffer because it's a nice round number.
You'll want to keep in mind that on the transmitting side you may need to explicitly enable fragmentation if you want to send datagrams larger than will fit in the interface MTU. From my Ubuntu 12.04 UDP(7) manpage:
By default, Linux UDP does path MTU (Maximum Transmission Unit) discov‐
ery. This means the kernel will keep track of the MTU to a specific
target IP address and return EMSGSIZE when a UDP packet write exceeds
it. When this happens, the application should decrease the packet
size. Path MTU discovery can be also turned off using the IP_MTU_DIS‐
COVER socket option or the /proc/sys/net/ipv4/ip_no_pmtu_disc file; see
ip(7) for details. When turned off, UDP will fragment outgoing UDP
packets that exceed the interface MTU. However, disabling it is not
recommended for performance and reliability reasons.

Use getsockopt with the SO_NREAD option.
From the Mac OS X manpage:
SO_NREAD returns the amount of data in the input buffer that is available to be received. For data-gram oriented sockets, SO_NREAD returns the size of the first packet -- this differs from the ioctl() command FIONREAD that returns the total amount of data available.

Why Fragmentation is Done at IP why not for TCP/UDP

I am looking for the reason Why Fragmentation is Done at IP level but why not for TCP/UDP.
Suppose say my frame looks like this |MAC|IP|TCP|Payload|FCS. the whole size if say for eg: 1600. PathMTU happens here, why fragmentation is implemented # IP level is my question and why not implemented # TCP/UDP level/code.
Thank in advance.

That's exactly what multiple layers in the TCP/IP stack and in ISO/OSI model are for. TCP/UDP are transport protocols and they shouldn't care of fragmentation - it's not their problem. The IP level deals with the network and it deals with fragmentation since size of fragment depends on the network properties. The layer that has best conditions for solving the problem does solve it.

Some TCP implementations also determine the MTU and size their segments to avoid fragmentation as well. Doing so improves reliability under lossy conditions, as any TCP segment that is received can be acknowledged and not retransmitted. Only lost TCP segments are retransmitted. In contrast, if any IP datagram fragment is lost, then no useful information is received.

Layer-4 (TCP/UDP) comes into picture only at the end points (sender/receiver).
layer-3 (IP) comes into picture per hop basis.
MTU is a property of the link, but fragmentation on the basis of this link property (MTU) always done at IP layer on a router (hop)
Now the link between each hop can be of different bandwidth, so at each hop it has to be decided how to forward the packet to the destination. As MTU is the maximum amount of data that can be pushed onto the link and if it it smaller than the size of packet to be send out, One has to fragment it into smaller chunks to accommodate onto the link.
As fragmentation and reassembly has many drawbacks like
1. small increase in CPU and memory overhead
2. more overhead per packet due to addition of fragment headers
3. If one fragment is lost sender has to transmit the entire packet
To solve above issues,
1. Path MTU Discovery can be used.
2. In Layer 4, TCP MSS-clamping can be used.

If fragmentation were performed on higher layers (TCP, UDP, etc.) then this would make fragmentation/reassembly redundantly implemented (once per protocol); if fragmentation were performed on a lower layer (Ethernet, ATM, etc.) then this would require fragmentation/reassembly to be performed on each hop (could be quite costly) and redundantly implemented (once per link layer protocol). Therefore, the IP layer is the most efficient one for fragmentation.

It makes less sense to fragment TCP than it does to fragment UDP. Since TCP provides a reliable segmentation/reassembly/retransmission mechanism, one can just send smaller TCP segments and avoid the whole necessity for fragmentation (this is what d3jones is talking about).
In UDP, however, fragmentation still makes sense. You can send a single UDP segment greater in length than the MTU. The IP layer will fragment it correctly and invisibly. The application developer doesn't have to determine the MTU or anything about the network in order to code the application layer protocol.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js