My current network stack running on DPDK supports MTU of 1500 bytes. How can I send+receive data more than 1500 bytes?
I am thinking in terms of Generic Receive Offload (GRO).
DPDK has GRO library, have a look at DPDK Programmer's Guide
There are calls like rte_gro_reassemble_burst() and an example in testpmd.
There is also an IP fragmentation/reassembly library with examples.
Related
How do you figure out what settings need to be used to correctly configure a DPDK mempool for your application?
Specifically using rte_pktmbuf_pool_create():
n, the number of elements in the mbuf pool
cache_size
priv_size
data_room_size
EAL arguments:
n number of memory channels
r number of memory ranks
m amount of memory to preallocate at startup
in-memory no shared data structures
IOVA mode
huge-worker-stack
My setup:
2 x Intel Xeon Gold 6348 CPU # 2.6 Ghz
28 cores per socket
Max 3.5 Ghz
Hyperthreading disabled
Ubuntu 22.04.1 LTS
Kernel 5.15.0-53-generic
Cores set to performance governor
4 x Sabrent 2TB Rocket 4 Plus in RAID0 Config
128 GB DDR4 Memory
10 1GB HugePages (Can change to what is required)
1 x Mellanox ConnectX-5 100gbe NIC
31:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Firmware-version: 16.35.1012
UDP Source:
100 gbe NIC
9000 MTU Packets
ipv4-udp packets
Will be receiving 10GB/s UDP packets over a 100gbe link. Plan is to strip the headers and write the payload to a file. Right now trying to get it working for 2GB/s to a single queue.
Reviewed the DPDK Programmers guide: https://doc.dpdk.org/guides/prog_guide/mempool_lib.html
Also searched online but the resources seem limited. Would appreciate any help or a push in the right direction.
based on the updates from comments the question can be summarized as
what are the correct settings to be used for DPDK mbuf|mempool which needs to handle 9000B UDP payload for processing 10Gbps packets on 100Gbps MLX CX-5 NIC with single or multiple queues
Let me summarize my suggestions below for this unique case
[for 100Gbps]
as per MLX DPDK performance report for test case 4, for packet size 1518 we get theoretical and practical Million Packets per sec as 8.13
Hence for 9000B payload this will be, 9000B/1518B=6 is around 8.13/6 = 1.355 MPps
With MLX CX-5 1 queue achieve a mx of 36Mpps - so with 1 queue and JUMBo enabled, you should get the 9000B into a single queue
note: 10Gbps it will be 0.1355Mpps
Settings for MBUF or mempool:
if your application logic requires 0.1 seconds to process the payload, I recommend you to use 3 * max expected packets. So roughly 10000 packets
Each payload has total size of 10000B (data_room_size) as single contiguous buffer.
priv_size is wholly dependant upon your logic to store metadata
Note: in case multiple queue, I always configure for worst case scenario, that is I assume there will be elephant flow which can fall onto specific queue. So if with 1 queue you have created 10000 elements, for multiple queues I use 2.5 * 10000
I am using a Intel E810-XXVDA4 and DPDK version 19.11. During troubleshooting we have captures outgoing traffic with dpdk-pdump:
dpdk-pdump -- --pdump 'port=1,queue=*,tx-dev=/tmp/tx.pcap'
I have opened the pcap-file in wireshark and I can see that some packets are delayed. However, system behavior indicates that the delays I see in the pcap file are not correct, they are too big.
My questions is, how is the timestamp on packets in the pcap from dpdk-pdump created, like NIC HW generated, CPU time at disk write etc or something else?
Are there any run-time or build-time options to change the source of the timestamps?
Answer from dpdk users mailing list, many thanks to S.H
Follow up. The older dump-pdump in 19.11 does timestamps when packet is read from the ring which is bad.
You might have better luck with DPDK 20.11 and the dpdk-dumpcap which puts timestamp in when packet is put into ring.
Let me give picture:
Application (primary)
| 1 2
+--+-------------------------> wire
|
| dumpcap (secondary)
+---========--------------> capture file
3
The ==== is ringbuffer between processes
Where 20.11 with dpdk-dumpcap gets timestamp
Where you want the timestamp but is not possible
Where 19.11 and dpdk-pdump gets timestamp
Note: I'm new to network and dpdk, so there might be some misunderstanding in fundamental concepts...
I want to run 2 instances of dpdk-testpmd on the same host to send and receive traffic over separate NIC.
Configuration:
NIC:
PMD: MLX5
version: 5.0-1.0.0.0
firmware-version: 16.26.1040 (MT_0000000011)
NUMA-Socket: same
PCIe: 0000:3b:00.0, 0000:3b:00.1
Update
DPDK Version: 20.11.1
Hugepage Total: 32768
Hugepage Free: 32768
TESTPMD Logs:
Logical Core 12 (socket 0) forwards packets on 6 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
Questions:
How to set the MAC address of instance 2's port in instance 1?
What's the meaning of RX P=0/Q=0?
Is the NIC will receive packets from it's RXQ 0(a ring buffer in memory?)
Is the NIC will put packets(just got from RXQ) to it's TXQ 0?
What will happen next?
Where is the packet's destination?
How to set/check it?
Hope for your help.
Thanks in advance.
Note: I highly recommend reading about dpdk testpmd as it covered all your questions in detail. As per the StackOverflow guideline multiple sub-questions make answering difficult. Please use well-formatted and formed question for better reach and answers.
#Alexcured since you have mentioned you know how to run 2 separate instances of dpdk-testpmd. I will only recommend heavily reading up on dpdk-testpmd URL which has answers to most of your questions too.
The assumption is made, both the PCIe NIC are working properly and interconnect between 2 are tested with either arping or ping (Kernel Driver). After binding both PCIe devices to DPDK supported drivers, one should use options for DPDK 20.11.1 such as
use file-prefix option as unique names
set socket-memory to fetch memory from the desired NUMA-SOCKET
set socket-limit to prevent ballooning of huge page mmap
use w|b option to whitelist|blacklist PCIe devices (0000:3b:00.0 and 0000:3b:00.1)
Since it is separate physical devices ensure there physical cable connection between 2 PCIe ports.
[Q.1] How to set the MAC address of instance 2's port in instance 1?
[A.1] in DPDK instance-1 & instance 2, execute `show port 0 mac` to show current port MAC address. Note down the MAC address correctly. To set the MAC address of peer use command `set eth-peer (port_id) (peer_addr)`.
In non interactive mode this can be done from cmdline too.
[Q.2] What's the meaning of RX P=0/Q=0?
[A.2] P stands for `port`, and Q stands for `queue`. So P=0/Q=0 means `port 0/queue 0`
[Q.3] Is the NIC will receive packets from it's RXQ 0(a ring buffer in memory?)
[A.3] NIC can receive from port 0 - rxq -0; provided there is no offload|RSS|Rules set (since you have not shared the cmdline I will assume there is none).
NIC HW RQ is mmaped to reflect HEAD-TAIL pointer to allow PMD to trigger DMA copy to mempool (huge page backed area). This is different from DPDK lib_rte_ring buffer area
[Q.4] Is the NIC will put packets(just got from RXQ) to it's TXQ 0?
[A.4] if the option `start tx-first` is not used, as per the current understanding there will not be any packets TX from the NIC (for Intel FVL LLDP packets will be send from NIC HW unless disabled).
The default configuration in looped mode, and with 1 DPDK port RX of the Port will be always sent out as TX.
[Q.5] What will happen next?
[A.5] Packet will be send out with modified eth-peer mac address
[Q.6] Where is the packet's destination?
[A.6] Ethernet Packet has the format of first 6 bytes as destination MAC address. With `set peer-address` done the MAC address as the destination will be set as value set.
[Q.7] How to set/check it?
[A.7] use option `set verbose 2`. This will show the RX packet details received from the port
I am trying to measure IO data transfer rate (bandwidth) between 2 simulation applications (written in C++). I created a very simple perfclient and perfserver program just to verify that my approach in calculating the network bandwidth is correct before implementing this calculation approach in the real applications. So in this case, I need to do it programatically (NOT using Iperf).
I tried to run my perfclient and perfserver program on various domain (localhost, computer connected to ethernet,and computer connected to wireless connection). However I always get about the similar bandwidth on each of these different hosts, around 1900 Mbps (tested using data size of 1472 bytes). Is this a reasonable result, or can I get a better and more accurate bandwidth?
Should I use 1472 (which is the ethernet MTU, not including header) as the maximum data size for each send() and recv(), and why/why not? I also tried using different data size, and here are the average bandwidth that I get (tested using ethernet connection), which did not make sense to me because the number exceeded 1Gbps and reached something like 28 Gbps.
SIZE BANDWIDTH
1KB 1396 Mbps
2KB 2689 Mbps
4KB 5044 Mbps
8KB 9146 Mbps
16KB 16815 Mbps
32KB 22486 Mbps
64KB 28560 Mbps
HERE is my current approach:
I did a basic ping-pong fashion loop, where the client continuously send bytes of data stream to the server program. The server will read those data, and reflect (send) the data back to the client program. The client will then read those reflected data (2 way transmission). The above operation is repeated 1000 times, and I then divided the time by 1000 to get the average latency time. Next, I divided the average latency time by 2, to get the 1 way transmission time. Bandwidth can then be calculated as follow:
bandwidth = total bytes sent / average 1-way transmission time
Is there anything wrong with my approach? How can I make sure that my result is not biased? Once I get this right, I will need to test this approach in my original application (not this simple testing application), and I want to put this performance testing result in a scientific paper.
EDIT:
I have solved this problem. Check out the answer that I posted below.
Unless you have a need to reinvent the wheel iperf was made to handle just this problem.
Iperf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. Iperf allows the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss.
I was finally able to figure and solve this out :-)
As I mentioned in the question, regardless of the network architecture that I used (localhost, 1Gbps ethernet card, Wireless connection, etc), my achieved bandwidth scaled up for up to 28Gbps. I have tried to bind the server IP address to several different IP addresses, as follow:
127.0.0.1
IP address given by my LAN connection
IP address given by my wireless connection
So I thought that this should give me correct result, in fact it didn't.
This was mainly because I was running both of the client and server program on the same computers (different terminal window, even though the client and server are both bound to different IP addresses). My guess is that this is caused by the internal loopback. This is the main reason why the result is so biased and not accurate.
Anyway, so I then tried to run the client on one workstation, and the server on another workstation, and I tested them using the different network connection, and it worked as expected :-)
On 1Gbps connection, I got about 9800 Mbps (0.96 Gbps), and on 10Gbps connection, I got about 10100 Mbps (9.86 Gbps). So this work exactly as I expected. So my approach is correct. Perfect !!
i've stuck in a problem that is never heard about before.
i'm making an online game which uses UDP packets in a certain character action. after i developed the udp module, it seems to work fine. though most of our team members have no problem, but a man, who is my boss, told me something is wrong for that module.
i have investigated the problem, and finally i found the fact that... on his PC, if udp packet size is less than 12, the packet is never have been delivered to the other host.
the following is some additional information:
1~11 bytes udp packets are dropped, 12 bytes and over 12 bytes packets are OK.
O/S: Microsoft Windows Vista Business
NIC: Attansic L1 Gigabit Ethernet 10/100/1000Base-T Controller
WSASendTo returns TRUE.
loopback udp packet works fine.
how do you think of this problem? and what do you think... what causes this problem?
what should i do for the next step for the cause?
PS. i don't want to padding which makes length of all the packets up to 12 bytes.
Just to get one of the non-obvious answers in: maybe UDP checksum offload is broken on that card, i.e. the packets are sent, but dropped by the receiver?
You can check for this by looking at the received packets using Wireshark.
IF you already checked firewall, antivirus, network firewall, network intrusion. read this
For a UDP packet ethernet_header(14 bytes) + IPv4_header(20 bytes min) + UDP_header (8 bytes) = 42 bytes
Now since its less than the 64 bytes or 60 on linux, network driver will pad the packet with (64-42 = 22 ) zeros to make it 60 bytes before it send out the packet.
that's the minimum length for a UDP packet.
theoretically you can send 0 data bytes packet, but haven't tried it yet.
as for your issue it must be an OS issue . check your network's driver's manual or check with manufacturer. because this isn't suuposed to happen.
REF:http://www.freesoft.org/CIE/Course/Section4/8.htm
REF:http://en.wikipedia.org/wiki/User_Datagram_Protocol
Run Wireshark on his PC AND on the destination PC.
Does the log show the udp packet leaving his machine? Does it show it arriving on the destination PC?
What kind of router hardware or switches are between his PC and the destination? Can you remove them and link the two with a cross over cable? (or replace the destination with a laptop and link that to his PC with a cross over cable?)
Have you removed or at least listed all anti virus and firewall products on his machine and anything that installs a Winsock LSP ?
Do ALL 12 byte or less packets get dropped or just some, can you generate packets with random content and see if it's something in the content, rather than just the size, that's causing the issue.
Assuming your problem is with sending from his PC: First, run a packet sniffer on the problematic PC to see if it arrives at the NIC. If it makes it there, there may be a problem in the NIC or NIC driver.
Next, check for any running firewall software. Try disabling it and see what happens.
If that doesn't work, clear out any Winsock Layered Service Providers with netsh winsock catalog reset.
If that doesn't work, I'm stumped :)
Finally, you're probably going to find other customers with the same problem; you might want to think about that workaround anyway. Try sending a few small-size UDP packets on connect, and if they consistently fail to go through, enable a padding workaround. For hosts where the probe packets make it through, you don't need to pad them out.
Pure conjecture: RTP, which is a very common packet to send on UDP, defines a 12 byte header. I wonder if some layer of network software is assuming that anything smaller is a malformed RTP packet and throwing it away?