How to receive 40Gbps line rate traffic without zero loss? - dpdk

My Goal is capturing incoming packet with DPDK, to do this I want to integrate DPDK library ETH API to my project to receive all incoming Packets (NIC rate:40Gbps, pkt size 1500 bytes) with zero packet loss.
I didn't know How can I do this ?
I installed DPDK from [DPDK Quick Installation.][1]

shared the debug session and showcased 4 *10gbps x710 (Fortville) is able to send 40Gbps of traffic from pkt-gen. On receiver end with examples/skeleton with 1 core how the total 40Gbps is received and processed.
pkt-size is 1500 as per request from #Alexanov.
Hence there is no issue in DPDK library to receive packets

Related

tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC

I am using XL710 i40e NIC on dpdk19.11. I found that the NIC occasionally lost packets when I enabled the TSO feature.
The detailed information is as follow:
https://github.com/JiangHeng12138/dpdk-issue/issues/1
I gussed lost packet is caused by i40e NIC dirver, but I dont know how to debug i40e driver code, could you please provide me an effective way.
Based on the problem statement tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC, one needs to isolate the issue whether is it is client (peer system) or server (dpdk DUT) which leads to packet loss. So to debug the issue at DPDK server side, one needs to evaluate both RX and TX issues. DPDK tool dpdk-procinfo can retrieve port statistics, which can be used for the analysis of the issue.
Diagnose the issue:
Run the application (dpdk primary) to reproduce the issue in terminal-1.
In terminal-2, run the command dpdk-procinfo -- --stats. refer link for more details
Check RX-errors counter, this will show if the packets which were faulty were dropped at PMD level.
Check RX-nombuf counter, this will show if the packets from NIC were not able to be DMA to DDR memory on the HOST.
Check TX-errors counter, this will show if the copy of packet descriptor (DMA descriptors) to NIC had been faulty or not.
Also check the HW nic statistics with dpdk-procinfo -- --xstats for any error or drop counter updates.
sample of the capture of stats and xstats counters on the desired nic
Note:
"tx_good_packets" means the number of packets sent by the dpdk NIC. if the number of packets tried to be sent is equal to "tx_good_packets", there is no packet dropped at the sent client.
"rx-missed-errors" means packets loss at the receiver; this means you are processing packets more than what the Current CPU can handle. So either you will need to increase CPU frequency, or use additional cores to distribute the traffic.
If none of these counters is updated or errors are found, then the issue is at the peer (client non-dpdk) side.

Rss hash for fragmented packet

I am using Mellanox Technologies MT27800 Family [ConnectX-5], using dpdk multi rx queue with rss "ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP"
I analyzer traffic and need all packet of same session to arrive to the same process ( session for now can be ip+port)
So Packet that have the same ip + port arrive to the same queue.
But If some packet are ip fragmented, packet arrive to different process. It is a problem!
How can i calculate the hash value in the c++ code, like it is done in the card, so i can reassemble packets and send them to the same process like the non fragmented packets
Instead of ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP you can only use ETH_RSS_IP to calculate the RSS hash only from the IP addresses of the packet. This way even if the packet is fragmented, segments of the packet will arrive to the same the CPU core.
RSS value of the packets can be calculated in the software with using the following library https://doc.dpdk.org/api/rte__thash_8h.html
While this option is possible, I would still recommend you to check out the proposed setting of ETH_RSS_IP only.
When ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP is enabled, the RSS function takes IP addresses and src + dst ports to calculate the RSS hash value. As you don't have ports present in the IP fragmented packets, you are unable to compute the same value as non-fragmented IP packets.
You either can:
reassemble the IP fragments to form a complete IP packet and then using the rte_thash library compute the RSS value,
compute the RSS value only from the IP addresses (use the ETH_RSS_IP setting only).
As you are only doing load-balancing on CPU cores I think the latter option suits your use-case well enough.
#Davidboo based on the question and explanation in comment, what you have described is
all packet of same session to arrive to the same process ( session for now can be ip+port) - which means you are looking for symmetric hash
some packet are ip fragmented, packet arrive to different process - you need packet reassemble before symmetric RSS
Mellanox Technologies MT27800 Family [ConnectX-5] - current NIC does not support reassemble in NIC (embedded switch).
Hence the actual question is what is the right way to solve the problem with the following constrains. There are 3 (1 HW and 2 SW) solutions.
option 1 (HW): Use smart NIC or network Appliance offload card, that can ingress the traffic and does reassembly for fragmented before sending to Host Server
option 2 (SW): Disable RSS and use single RX queue. Check packet is fragment or not. If yes, reassemble and then use rte_flow_distributor or rte_eventdev with atomic flow to spread traffic to worker cores.
option 3 (SW): Disable RSS, but use n + 1 RX queues and n SW ring . By default all packets will be received on queue 0. Based on JHASH (for SW RSS) add rte_flow rules pinning the flows to queues 1 to (n + 1).
Assuming you can not change the NIC, my recommendation is option-2 with evenetdev for the following reasons.
It is much easier and simpler to allow either HW or SW DLB (eventdev) to spread traffic across multiple cores.
rte_flow_distributor is similar to HW RSS where static pinning will lead to congestion and packet drops.
Unlike option-3 one need not maintain state flow table to keep track of the flows.
How to achieve the same.
Use dpdk example code skeleton to create basic port initialization.
enable DPDK rte_ethdev for PTYPES to identify if IP, non-IP, IP fragmented without parsing the Frame and payload.
check packets are fragmented by using RTE_ETH_IS_IPV4_HDR and rte_ipv4_frag_pkt_is_fragmented for ipv4 and RTE_ETH_IS_IPV6_HDR and rte_ipv6_frag_get_ipv6_fragment_header for ipv6 (refer DPDK example ip_reassembly).
extract SRC-IP, DST-IP, SRC-PORT, DST-PORT (in case if packet is not SCTP, UDP or TCP set src port and dst port as 0) and use rte_hash_crc_2byte. This will ensure it is symmetric hash (SW RSS).
Then feed the packet with the hash value to eventdev stage or flow distributor stage (refer eventdev_pipeline or distributor).
Note:
each worker core will be running business logic.
On broadwell Xeon core can handle around 30Mpps.
On icelake Xeon core can handle around 100Mpps.

Why do I see strange UDP fragmentation on my C++ server?

I have build an UDP server with C++ and I have a couple questions about this.
Goal:
I have incomming TCP trafic and I need to sent this further as UDP trafic. My own UDP server then processes this UDP data.
The size of the TCP packets can vary.
Details:
In my example I have a TCP packet that consists of a total of 2000 bytes (4 random bytes, 1995 'a' (0x61) bytes and the last byte being 'b' (0x62)).
My UDP server has a buffer (recvfrom buffer) with size larger then 2000 bytes.
My MTU size is 1500 everywhere.
My server is receiving this packet correctly. In my UDP server I can see the received packet has a length of 2000 and if I check the last byte buffer[1999], it prints 'b' (0x62), which is correct. But if I open tcpdump -i eth0 I see only one UDP packet: 09:06:01.143207 IP 192.168.1.1.5472 > 192.168.1.2.9000: UDP, bad length 2004 > 1472.
With the tcpdump -i eth0 -X command, I see the data of the packet, but only ~1472 bytes, which does not include the 'b' (0x62) byte.
The ethtool -k eth0 command prints udp-fragmentation-offload: off.
So my questions are:
Why do I only see one packet and not two (fragmented part 1 and 2)?
Why dont I see the 'b' (0x62) byte in the tcpdump?
In my C++ server, what buffer size is best to use? I have it now on 65535 because the incomming TCP packets can be any size.
What will happen if the size exceedes 65535 bytes, will I have to make an own fragmentation scheme before sending the TCP packet as UDP?
OK, circumstances are more complicated as they appear from question, extracted from your comments there's the following information available:
Some client sends data to a server – both not modifiable – via TCP.
In between both resides a firewall, though, only allowing uni-directional communication to server via UDP.
You now intend to implement a proxy consisting of two servers residing in between and tunneling TCP data via UDP.
Not being able for the server to reply backwards does not impose a problem either.
My personal approach would then be as follows:
Let the proxy servers be entirely data unaware! Let the outbound receiver accept (recv or recvfrom depending on a single or multiple clients being available) chunks of data that yet fit into UDP packets and simply forward them as they are.
Apply some means to assure lost data is at least detected, better such that lost data can be reconstructed. As confirmation or re-request messages are impossible due to firewall limitation, only chance to increase reliability is via redundancy, though.
Configure the final target server to listen on loopback only.
Let the inbound proxy connect to the target server via TCP and as long as no (non-recoverable) errors occur just forward any incoming data as is.
To be able to detect lost messages I'd at very least prepend a packet counter to any UDP message sent. If two subsequent messages do not provide consecutive counter values then a message has been lost in between.
As no communication backwards is possible the only way to increase reliability is unconditional redundancy, trading some of your transfer rate for, e.g. by sending every message more than once and ignoring surplus duplicates on reception side.
A more elaborate approach might distribute redundant data over several packets such that a missing one can be reconstructed from the remaining ones – maybe similar to what RAID level 5 does. Admitted, you need to be pretty committed to try that...
Final question would be how routing looks like. There's no guarantee with UDP for packets being received in the same order as they are sent. If there's really only one fix route available from outbound proxy to inbound one via firewall then packets shouldn't overtake one another – you might still want to at least log to file appropriately to monitor the inbound UDP packets and in case of errors occurring apply appropriate means (buffering packets and re-ordering them if need be).
The size of the TCP packets can vary.
While there is no code shown the sentence above and your description suggests that you are working with wrong assumptions of how TCP works.
Contrary to UDP, TCP is not a message based protocol but a byte stream. This especially means that it does not guarantee that single send at the sender will be matched by a single recv in the recipient. Thus even if the send is done with 2000 bytes it might still be that the first recv only gets 1400 bytes while another recv will get the rest - no matter if everything would fit into the socket buffer at once.

Unable to receive burst of 1 packet with rte_eth_rx_burst

I am trying to receive packets transmitted by another DPDK application on different system. I am able transmit a burst value of 1 packet with rte_eth_tx_burst Api, but unable to receive with burst value of 1 in rte_eth_rx_burst Api. I am able to receive packets only if rx_burst value is 4 or higher than that. Is it because of any ethdev configuration?
ixgbe i40e device have the problem, virtio-net hasn't the problem. modify RTE_*_DESCS_PER_LOOP in file *_rxtx.h.

Failed to post large data (>9MB) using libCurl

I am getting failure while sending large data (neraly 9-10 MB) using C++ Libcurl, where as the data of lesser size works fine.
i am trying to send the data from linux machine to Exchange Server (windows 2008). from wireshark capture, i could see initial few packets are getting uploaded successfully, but then i could see errors like TCP retransmissions, TCP Zero size Window, TCP Window update etc. Apart from these packets i could see one packet related to HTTTP 400 , followed by TCP RST (thereby terminating the connection).
source Address (linux machine ): 10.65.156.77
Destination ipaddress (exchange server) = 10.76.214.38
libcurl version : 7.30
any help will be appreciated.