I am writing linux network server using poll / epoll.
I plan to have lots of connected clients, say 5000 - 10000 or even 20000.
Each client will send some data as request, then server will send some data back. For the moment for simplicity I decided to limit this data to 16 KB.
I am using C++11.
As I see it,
I can create huge "static" array of 16 KB blocks with total size of max clients e.g. for 10K connections - 10000 x 16 KB = 160 MB
I can create array of buffers (std::vector<char>) and push_back there.
I can create std::vector of buffers.
In all cases server will take 160 MB on full load, but if I use "static" array, there will be no any memory allocations / movements after initial allocation.
What is the best way to proceed and is there some other solution I am missing here.
Related
How do you figure out what settings need to be used to correctly configure a DPDK mempool for your application?
Specifically using rte_pktmbuf_pool_create():
n, the number of elements in the mbuf pool
cache_size
priv_size
data_room_size
EAL arguments:
n number of memory channels
r number of memory ranks
m amount of memory to preallocate at startup
in-memory no shared data structures
IOVA mode
huge-worker-stack
My setup:
2 x Intel Xeon Gold 6348 CPU # 2.6 Ghz
28 cores per socket
Max 3.5 Ghz
Hyperthreading disabled
Ubuntu 22.04.1 LTS
Kernel 5.15.0-53-generic
Cores set to performance governor
4 x Sabrent 2TB Rocket 4 Plus in RAID0 Config
128 GB DDR4 Memory
10 1GB HugePages (Can change to what is required)
1 x Mellanox ConnectX-5 100gbe NIC
31:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Firmware-version: 16.35.1012
UDP Source:
100 gbe NIC
9000 MTU Packets
ipv4-udp packets
Will be receiving 10GB/s UDP packets over a 100gbe link. Plan is to strip the headers and write the payload to a file. Right now trying to get it working for 2GB/s to a single queue.
Reviewed the DPDK Programmers guide: https://doc.dpdk.org/guides/prog_guide/mempool_lib.html
Also searched online but the resources seem limited. Would appreciate any help or a push in the right direction.
based on the updates from comments the question can be summarized as
what are the correct settings to be used for DPDK mbuf|mempool which needs to handle 9000B UDP payload for processing 10Gbps packets on 100Gbps MLX CX-5 NIC with single or multiple queues
Let me summarize my suggestions below for this unique case
[for 100Gbps]
as per MLX DPDK performance report for test case 4, for packet size 1518 we get theoretical and practical Million Packets per sec as 8.13
Hence for 9000B payload this will be, 9000B/1518B=6 is around 8.13/6 = 1.355 MPps
With MLX CX-5 1 queue achieve a mx of 36Mpps - so with 1 queue and JUMBo enabled, you should get the 9000B into a single queue
note: 10Gbps it will be 0.1355Mpps
Settings for MBUF or mempool:
if your application logic requires 0.1 seconds to process the payload, I recommend you to use 3 * max expected packets. So roughly 10000 packets
Each payload has total size of 10000B (data_room_size) as single contiguous buffer.
priv_size is wholly dependant upon your logic to store metadata
Note: in case multiple queue, I always configure for worst case scenario, that is I assume there will be elephant flow which can fall onto specific queue. So if with 1 queue you have created 10000 elements, for multiple queues I use 2.5 * 10000
DPDK: 22.03
PMD: Amazon ENA
We have a DPDK application that only calls rte_eth_rx_burst() (we do not transmit packets) and it must process the payload very quickly. The payload of a single network packet MUST be in contiguous memory.
The DPDK API is optimized around having memory pools of fixed-size mbufs in memory pools. If a packet is received on the DPDK port that is larger than the mbuf size, but smaller than the max MTU then it will be segmented according to the figure below:
This leads us the following problems:
If we configure the memory pool to store large packets (for example
max MTU size) then we will always store the payload in contiguous memory, but we will waste huge amounts memory in the case we
receive traffic containing small packets. Imagine that our mbuf size
is 9216 bytes, but we are receiving mostly packets of size 100-300
bytes. We are wasting memory by a factor of 90!
If we reduce the size of mbufs, to let's say 512 bytes, then we need
special handling of those segments in order to store the payload in
contiguous memory. Special handling and copying hurts our performance, so it should be limited.
My final question:
What strategy is recommended for a DPDK application that needs to process the payload of network packets in contiguous memory? With both small (100-300 bytes) and large (9216) packets, without wasting huge amounts of memory with 9K-sized mbuf pools? Is copying segmented jumbo frames into a larger mbuf the only option?
There are a couple of ways involving the use of HW and SW logic to make use of multiple-size mempool.
via hardware:
If the NIC PMD supports packet or metadata (RX descriptor) parsing, one can use RTE_FLOW RAW to program the flow direction to a specific queue. Where each can be set up with desired rte_mempool.
IF the NIC PMD does not support parsing of metadata (RX descriptors) but the user is aware of specific protocol fields like ETH + MPLS|VLAN or ETH + IP + UDP or ETH + IP + UDP + Tunnel (Geneve|VxLAN); one can use RTE_FLOW to distribute the traffic over specific queues (which has larger mempool object size). thus making default traffic to fall on queue-0 (which has smaller mempool object size)
if hardware option of flow bifurcate is available, one can set the RTE_FLOW with raw or tunnel headers to be redirect to VF. thus PF can make use of smaller object mempool and VF can make use of larger size mempool.
via software: (if HW supported is absent or limited)
Using RX callback (rte_rx_callback_fn), one can check mbuf->nb_segs > 1 to confirm multiple segments are present and then use mbuf_alloc from larger mempool, attach as first segment and then invoke rte_pktmbuf_linearize to move the content to first buffer.
Pre set all queue with large size mempool object, using RX callback check mbuf->pktlen < [threshold size], if yes alloc mbuf from smaller pool size, memcpy the content (pkt data and necessary metadata) and then swap the original mbuf with new mbuf and free the original mbuf.
Pros and Cons:
SW-1: this costly process, as multiple segment access memory is non-contiguous and will be done for larger size payload such as 2K to 9K. hardware NIC also has to support RX scatter or multi-segment too.
SW-2: this is less expensive than SW-1. As there is no multiple segments, the cost can be amortized with mtod and prefetch of payload.
note: in both cases, the cost of mbuf_free within RX-callback can be reduced by maintaining a list of original mbufs to free.
Alternative option-1 (involves modifying the PMD):
modify the PMD code probe or create to allocate mempool for large and small objects.
set MAX elements per RX burst as 1 element
use scalar code path only
Change recv function to
check the packet size from RX descriptor
comment the original code to replenish per threshold
check the packet size via reading packet descriptor.
alloc for either a large or small size mempool object.
[edit-1] based on the comment update DPDK version is 22.03 and PMD is Amazon ENA. Based on DPDK NIC summary and ENA PMD it points to
No RTE_FLOW RSS to specific queues.
No RTE_FLOW_RAW for packet size.
In file in function ena_rx_queue_setup; it supports individual rte_mempool
Hence current options are
Modify the ENA PMD to reflect support for multiple mempool size
Use SW-2 for rx_callback to copy smaller payload to new mbuf and swap out.
Note: There is an alternate approach by
creating an empty pool with external mempool
Use modified ENA PMD to get pool objects as single small buffers or multiple continuous pool objects.
Recommendation: Use a PMD or Programmable NIC which can bifurcate based on Packet size and then RTE_FLOW to a specific queue. To allow multiple CPU to process multiple flow setup Q-0 as default small packets, and other queues with RTE_FLOW_RSS with specific mempool.
I'm using the asio ( non boost version) library to capture incoming UDP packets via a 10GB Ethernet adapter.
150k packets a second is fine, but I start getting dropped packets when i got to higher rates like 300k packets/sec.
I'm pretty sure the bottleneck is in DMA'ing 300k seperate transfers from the network card to the host system. The transfers aren't big only 1400 bytes per transfer, so not a bandwidth issue.
Ideally i would like a mechanism to coalesce the data from multiple packets into a single DMA transfer to the host. Currently I am using asio::receive, to do synchronous transfers which gives better performance than async_receive.
I have tried using the receive command with a larger buffer, or using an array of multiple buffers, but i always seem to get a single read of 1400 bytes.
Is there any way around this?
Ideally i would like to read some multiple of the 1400 bytes at a time, so long as it didn't take too long for the total to be filled.
ie. wait up to 4ms and then return 4 x 1400 bytes, or simply return after 4ms with however many bytes are available...
I do not control the entire network so i cannot force jumbo frames :(
Cheers,
I would remove the asio layer and go direct to the metal.
If you're on Linux you should use recvmmsg(2) rather than recvmsg() or recvfrom(), as it at least allows for the possibility of transferring multiple messages at a time within the kernel, which the others don't.
If you can't do either of these things, you need to at least moderate your expectations. recvfrom() and recvmsg() and whatever lies over them in asio will never deliver more than one UDP datagram at a time. You need to:
speed up your receiving loop as much as possible, eliminating all possible overhead, especially dynamic memory allocation and I/O to other sockets or files.
ensure that the socket receiver buffer is as large as possible, at least a megabyte, via setsockopt()/SO_RCVBUFSIZ, and don't assume that what you set was what you got: get it back via getsockopt() to see if the platform has limited you in some way.
may be you can try a workarround with tcpdump using the libcap library http://www.tcpdump.org/ and filtering to recive UDP packets
I have a C++ application that is reading data from a serial port but I'm seeing some weird behavior with regards to the amount of time it takes to complete a select call. I'm trying to read some data from another device which is streaming out 5 kB blocks at 115200 baud, the device streams out a total of 200 kB total during an entire transaction. The actual read calls grab 512 bytes at a time.
When grabbing the 5 kB blocks the first select call takes ~475ms then the subsequent ones take 48ms. So it looks like select is only unblocking once all the data reaches the port. Since 475ms at 115200 baud gets me about 5 kB. Then the 48ms gets me about 512 bytes which is the size of the UART buffer on my device. What's strange is that when grabbing the last 3-4 blocks the first select call returns in 48ms which is what I'd expect it to do in the first place.
From what I gather it looks like select is blocking until the device is finished writing all data over UART since it's only the initial one in each block that's effected. Is select supposed to return once there's any data to read on the device? Or are there some other conditions it's looking for? Is there any way I can configure my serial port to return from select once there is data available on the handle?
I've developed a TCP network application using boost::asio with an asynchronous approach. The application sends around 1GB of data in the following way:
Send a 5 bytes command (using async_write())
Send a 1024 bytes data (using another async_write())
Repeat until all the 1GB data is sent
When I use a synchronous approach the performances are the expected (around 9 seconds to TX 1GB of data using a 1Gb ethernet) but when I use asynchronous calls the performance decrease and 20 seconds are needed to TX the same amount of data.
I have tried to deactivate the Nagle's algorithm but it doesn't solve the problem.
do you know if using several async_write() calls with small amounts of data can have a negative impact on performances?
Thanks!