Rss hash for fragmented packet - dpdk

I am using Mellanox Technologies MT27800 Family [ConnectX-5], using dpdk multi rx queue with rss "ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP"
I analyzer traffic and need all packet of same session to arrive to the same process ( session for now can be ip+port)
So Packet that have the same ip + port arrive to the same queue.
But If some packet are ip fragmented, packet arrive to different process. It is a problem!
How can i calculate the hash value in the c++ code, like it is done in the card, so i can reassemble packets and send them to the same process like the non fragmented packets

Instead of ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP you can only use ETH_RSS_IP to calculate the RSS hash only from the IP addresses of the packet. This way even if the packet is fragmented, segments of the packet will arrive to the same the CPU core.
RSS value of the packets can be calculated in the software with using the following library https://doc.dpdk.org/api/rte__thash_8h.html
While this option is possible, I would still recommend you to check out the proposed setting of ETH_RSS_IP only.
When ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP is enabled, the RSS function takes IP addresses and src + dst ports to calculate the RSS hash value. As you don't have ports present in the IP fragmented packets, you are unable to compute the same value as non-fragmented IP packets.
You either can:
reassemble the IP fragments to form a complete IP packet and then using the rte_thash library compute the RSS value,
compute the RSS value only from the IP addresses (use the ETH_RSS_IP setting only).
As you are only doing load-balancing on CPU cores I think the latter option suits your use-case well enough.

#Davidboo based on the question and explanation in comment, what you have described is
all packet of same session to arrive to the same process ( session for now can be ip+port) - which means you are looking for symmetric hash
some packet are ip fragmented, packet arrive to different process - you need packet reassemble before symmetric RSS
Mellanox Technologies MT27800 Family [ConnectX-5] - current NIC does not support reassemble in NIC (embedded switch).
Hence the actual question is what is the right way to solve the problem with the following constrains. There are 3 (1 HW and 2 SW) solutions.
option 1 (HW): Use smart NIC or network Appliance offload card, that can ingress the traffic and does reassembly for fragmented before sending to Host Server
option 2 (SW): Disable RSS and use single RX queue. Check packet is fragment or not. If yes, reassemble and then use rte_flow_distributor or rte_eventdev with atomic flow to spread traffic to worker cores.
option 3 (SW): Disable RSS, but use n + 1 RX queues and n SW ring . By default all packets will be received on queue 0. Based on JHASH (for SW RSS) add rte_flow rules pinning the flows to queues 1 to (n + 1).
Assuming you can not change the NIC, my recommendation is option-2 with evenetdev for the following reasons.
It is much easier and simpler to allow either HW or SW DLB (eventdev) to spread traffic across multiple cores.
rte_flow_distributor is similar to HW RSS where static pinning will lead to congestion and packet drops.
Unlike option-3 one need not maintain state flow table to keep track of the flows.
How to achieve the same.
Use dpdk example code skeleton to create basic port initialization.
enable DPDK rte_ethdev for PTYPES to identify if IP, non-IP, IP fragmented without parsing the Frame and payload.
check packets are fragmented by using RTE_ETH_IS_IPV4_HDR and rte_ipv4_frag_pkt_is_fragmented for ipv4 and RTE_ETH_IS_IPV6_HDR and rte_ipv6_frag_get_ipv6_fragment_header for ipv6 (refer DPDK example ip_reassembly).
extract SRC-IP, DST-IP, SRC-PORT, DST-PORT (in case if packet is not SCTP, UDP or TCP set src port and dst port as 0) and use rte_hash_crc_2byte. This will ensure it is symmetric hash (SW RSS).
Then feed the packet with the hash value to eventdev stage or flow distributor stage (refer eventdev_pipeline or distributor).
Note:
each worker core will be running business logic.
On broadwell Xeon core can handle around 30Mpps.
On icelake Xeon core can handle around 100Mpps.

Related

tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC

I am using XL710 i40e NIC on dpdk19.11. I found that the NIC occasionally lost packets when I enabled the TSO feature.
The detailed information is as follow:
https://github.com/JiangHeng12138/dpdk-issue/issues/1
I gussed lost packet is caused by i40e NIC dirver, but I dont know how to debug i40e driver code, could you please provide me an effective way.
Based on the problem statement tcp packet loss occurs occasionally when use dpdk19.11 i40e NIC, one needs to isolate the issue whether is it is client (peer system) or server (dpdk DUT) which leads to packet loss. So to debug the issue at DPDK server side, one needs to evaluate both RX and TX issues. DPDK tool dpdk-procinfo can retrieve port statistics, which can be used for the analysis of the issue.
Diagnose the issue:
Run the application (dpdk primary) to reproduce the issue in terminal-1.
In terminal-2, run the command dpdk-procinfo -- --stats. refer link for more details
Check RX-errors counter, this will show if the packets which were faulty were dropped at PMD level.
Check RX-nombuf counter, this will show if the packets from NIC were not able to be DMA to DDR memory on the HOST.
Check TX-errors counter, this will show if the copy of packet descriptor (DMA descriptors) to NIC had been faulty or not.
Also check the HW nic statistics with dpdk-procinfo -- --xstats for any error or drop counter updates.
sample of the capture of stats and xstats counters on the desired nic
Note:
"tx_good_packets" means the number of packets sent by the dpdk NIC. if the number of packets tried to be sent is equal to "tx_good_packets", there is no packet dropped at the sent client.
"rx-missed-errors" means packets loss at the receiver; this means you are processing packets more than what the Current CPU can handle. So either you will need to increase CPU frequency, or use additional cores to distribute the traffic.
If none of these counters is updated or errors are found, then the issue is at the peer (client non-dpdk) side.

Why do I see strange UDP fragmentation on my C++ server?

I have build an UDP server with C++ and I have a couple questions about this.
Goal:
I have incomming TCP trafic and I need to sent this further as UDP trafic. My own UDP server then processes this UDP data.
The size of the TCP packets can vary.
Details:
In my example I have a TCP packet that consists of a total of 2000 bytes (4 random bytes, 1995 'a' (0x61) bytes and the last byte being 'b' (0x62)).
My UDP server has a buffer (recvfrom buffer) with size larger then 2000 bytes.
My MTU size is 1500 everywhere.
My server is receiving this packet correctly. In my UDP server I can see the received packet has a length of 2000 and if I check the last byte buffer[1999], it prints 'b' (0x62), which is correct. But if I open tcpdump -i eth0 I see only one UDP packet: 09:06:01.143207 IP 192.168.1.1.5472 > 192.168.1.2.9000: UDP, bad length 2004 > 1472.
With the tcpdump -i eth0 -X command, I see the data of the packet, but only ~1472 bytes, which does not include the 'b' (0x62) byte.
The ethtool -k eth0 command prints udp-fragmentation-offload: off.
So my questions are:
Why do I only see one packet and not two (fragmented part 1 and 2)?
Why dont I see the 'b' (0x62) byte in the tcpdump?
In my C++ server, what buffer size is best to use? I have it now on 65535 because the incomming TCP packets can be any size.
What will happen if the size exceedes 65535 bytes, will I have to make an own fragmentation scheme before sending the TCP packet as UDP?
OK, circumstances are more complicated as they appear from question, extracted from your comments there's the following information available:
Some client sends data to a server – both not modifiable – via TCP.
In between both resides a firewall, though, only allowing uni-directional communication to server via UDP.
You now intend to implement a proxy consisting of two servers residing in between and tunneling TCP data via UDP.
Not being able for the server to reply backwards does not impose a problem either.
My personal approach would then be as follows:
Let the proxy servers be entirely data unaware! Let the outbound receiver accept (recv or recvfrom depending on a single or multiple clients being available) chunks of data that yet fit into UDP packets and simply forward them as they are.
Apply some means to assure lost data is at least detected, better such that lost data can be reconstructed. As confirmation or re-request messages are impossible due to firewall limitation, only chance to increase reliability is via redundancy, though.
Configure the final target server to listen on loopback only.
Let the inbound proxy connect to the target server via TCP and as long as no (non-recoverable) errors occur just forward any incoming data as is.
To be able to detect lost messages I'd at very least prepend a packet counter to any UDP message sent. If two subsequent messages do not provide consecutive counter values then a message has been lost in between.
As no communication backwards is possible the only way to increase reliability is unconditional redundancy, trading some of your transfer rate for, e.g. by sending every message more than once and ignoring surplus duplicates on reception side.
A more elaborate approach might distribute redundant data over several packets such that a missing one can be reconstructed from the remaining ones – maybe similar to what RAID level 5 does. Admitted, you need to be pretty committed to try that...
Final question would be how routing looks like. There's no guarantee with UDP for packets being received in the same order as they are sent. If there's really only one fix route available from outbound proxy to inbound one via firewall then packets shouldn't overtake one another – you might still want to at least log to file appropriately to monitor the inbound UDP packets and in case of errors occurring apply appropriate means (buffering packets and re-ordering them if need be).
The size of the TCP packets can vary.
While there is no code shown the sentence above and your description suggests that you are working with wrong assumptions of how TCP works.
Contrary to UDP, TCP is not a message based protocol but a byte stream. This especially means that it does not guarantee that single send at the sender will be matched by a single recv in the recipient. Thus even if the send is done with 2000 bytes it might still be that the first recv only gets 1400 bytes while another recv will get the rest - no matter if everything would fit into the socket buffer at once.

Does PF RING linux kernel module have functionality to split the incoming traffic to 3 different devices

Currently my c++ program uses PF ring kernel module to read from a 1 Gbps NIC on linux. There is a bottleneck of application that it can not process more than 700 Mbps because it decodes all the headers and maintain some states.
Now There is an immediate urgency to scale the application to cater 10 Gbps of traffic from a pipe. After making all necessary changes inside my application, it can reach upto 1.2 Gbps.
I have certain queries -
Would PF Ring be able to cater 10 Gbps traffic without any packet loss.
If yes, is it possible to split the incoming traffic of 10 Gbps to 4 different devices on a round robin fashion using PF Ring module concepts
Does anyone have any good reference document of PF Ring.
What are the different ways to split the incoming traffic on a particular interface to multiple interface devices in solaris?

Connecting to multiple servers from a single client

I have 10.000 small device and those have one server port(waiting connection). I want to connect all devices at the same time with one Server(PC). Can I open port for each Device? Is it possible for Windows ? thnx
Read section 4.8 on this page. It looks like the answer is yes, in principle, but you need to do asynchronous IO, because you can't run 10000 threads on Windows at the same time.
As long as you used a reasonably efficient polling strategy (e.g. I/O completion ports, if you're using Windows) and keep the kernel socket buffers fairly small, it's possible in principle. However, if reliability is not a large concern, and you can control both ends of the protocol (i.e., you're designing the devices), UDP would be far more efficient - with UDP you could read from all devices using a single socket.
If TCP is a requirement, you'll have an absolute limit of 60,000 connections from a single interface because TCP port numbers are only 16 bits, i.e., 64k possible values. Eventually you'll run out of local port numbers, unless you do something exotic like giving your network interface more than one IP address.
The way to actually do this is to have the devices listen on a specific multicast group. This way you can just broadcast a packet and the machines listening to that group will (most likely) receive the packet.
This also give many benefits of dividing things into groups using multicast addresses.
Note that there is a possibility of lost packets - so I suggest sequence numbering/recovery method if every single message is important.

Sending arbitrary (raw) packets

I've seen it asked elsewhere but no one answers it to my satisfaction: how can I receive and send raw packets?
By "raw packets", I mean where I have to generate all the headers and data, so that the bytes are completely arbitrary, and I am not restricted in any way. This is why Microsofts RAW sockets won't work, because you can't send TCP or UDP packets with incorrect source addresses.
I know you can send packets like I want to with WinPCAP but you cannot receive raw information with it, which I also need to do.
First of all decide what protocol layer you want to test malformed data on:
Ethernet
If you want to generate and receive invalid Ethernet frames with a wrong ethernet checksum, you are more or less out of luck as the checksumming is often done in hardware, and in the cases they're not, the driver for the NIC performs the checksumming and there's no way around that at least on Windows. NetBSD provides that option for most of it drivers that does ethernet checksumming in the OS driver though.
The alternative is to buy specialized hardware, (e.g. cards from Napatech, you might find cheaper ones though), which provides an API for sending and receiving ethernet frames however invalid you would want.
Be aware that sending by sending invalid ethernet frames, the receiving end or a router inbetween will just throw the frames away, they will never reach the application nor the OS IP layer. You'll be testing the NIC or NIC driver on the receiving end.
IP
If all you want is to send/receive invalid IP packets, winpcap lets you do this. Generate the packets, set up winpcap to capture packets, use winpcap to send..
Be aware that packets with an invalid IP checksum other invalid fields, the TCP/IP stack the receiving application runs on will just throw the IP packets away, as will any IP/layer 3 router inbetween the sender and receiver do. They will not reach the application. If you're generating valid IP packets, you'll also need to generate valid UDP and implement a TCP session with valid TCP packets yourself in order for the application to process them, otherwise they'll also be thrown away by the TCP/IP stack
You'll be testing the lower part of the TCP/IP stack on the receiving end.
TCP/UDP
This is not that different from sending/receiving invalid IP packets. You an do all this with winpcap, though routers will not throw them away, as long as the ethernet/IP headers are ok. An application will not receive these packets though, they'll be thrown away by the TCP/IP stack.
You'll be testing the upperpart of the TCP/IP stack on the receiving end.
Application Layer
This is the (sane) way of actually testing the application(unless your "application" actually is a TCP/IP stack, or lower). You send/receive data as any application would using sockets, but generate malformed application data as you want. The application will receive this data, it's not thrown away by lower protocol layers.
Although one particular form of tests with TCP can be hard to test - namely varying the TCP segments sent, if you e.g. want to test that an application correctly interprets the TCP data as a stream. (e.g. you want to send the string "hello" in 5 segments and somehow cause the receiving application to read() the characters one by one). If you don't need speed, you can usually get that behaviour by inserting pauses in the sending and turn off nagel's algorithm (TCP_NDELAY) and/or tune the NIC MTU.
Remember that any muckery with lower level protocols in a TCP stream, e.g. cause one of the packets to have an invalid/diffferent IP source address just gets thrown away by lower level layers.
You'll be testing an application running on top of TCP/UDP(or any other IP protocol).
Alternatives
switch to another OS, where you at least can use raw sockets without the restrictions of recent windows.
Implement a transparent drop insert solution based on the "Ethernet" or "IP" alternative above. i.e. you have your normal client application, your normal server application. You break a cable inbetween them, insert your box with 2 NICs where you programatically alter bytes of the frames received and send them back out again on the other NIC. This'll allow you to easily introduce packet delays in the system as well. Linux' netfilter already have this capability which you can easily build on top of, often with just configuration or scripting.
If you can alter the receiving application you want to test, have it read data from something else such as a file or pipe and feed it random bytes/packets as you wish.
Hybrid model, mainly for TCP application testing, but also useful for e.g. testing UDP ICMP responses. Set up a TCP connection using sockets. Send your invalid application data using sockets. Introduce random malformed packets(much easier than programming with raw sockets that set up a TCP session and then introduce lower layer errors). Send malformed IP or UDP/TCP packets, or perhaps ICMP packets using WinPcap, though communicate with the socket code to the winpcap code so you'll the addresses/port correct, such that the receiving application sees it.
Check out NS/2