DPDK IPv4 Flow Filtering on Mellanox

DPDK IPv4 Flow Filtering on Mellanox - c++

I have a DPDK application that used Boost asio to join a multicast group and receives multicast IPv4 UDP packets over a VLAN on a particular UDP port number (other UDP ports are also used for other traffic). I am trying to receive only those multicast UDP packets at that port in the DPDK application and place them into an RX queue and have all other ingress network traffic act as if the DPDK application were not running (go to kernel). As such, I am using flow isolated mode (rte_flow_isolate()). The flow filtering part of my application as based off the flow_filtering example provided by DPDK with the additions of the call to rte_flow_isolate() and a VLAN filter. The filters I'm using are below:
action[0].type = RTE_FLOW_ACTION_TYPE_QUEUE;
action[0].conf = &queue;
action[1].type = RTE_FLOW_ACTION_TYPE_END;
pattern[0].type = RTE_FLOW_ITEM_TYPE_ETH;
pattern[1].type = RTE_FLOW_ITEM_TYPE_VLAN;
//vlan id here
pattern[2].type = RTE_FLOW_ITEM_TYPE_IPV4;
//no specific ip address given
pattern[3].type = RTE_FLOW_ITEM_TYPE_UDP;
//udp port here
pattern[4].type = RTE_FLOW_ITEM_TYPE_END;
Using these filters, I am unable to receive any packets and the same is true if I only remove the UDP filter. However, if I remove both the IPV4 and UDP filters (keeping the ETH and VLAN filters), I can receive all the packets I need, along with other ones I don't want (and would like to be sent to the kernel).
Here's an entry for the packet I need to receive from a Wireshark capture. Currently my theory is that because the reserved bit (evil bit) is being set in the IPv4 header, the packet is not being recognized as IPv4. This is probably a stretch:
Frame 100: 546 bytes on wire (4368 bits), 546 bytes captured (4368 bits) on interface 0
Ethernet II, Src: (src MAC), Dst: IPv4mcast_...
802.1Q Virtual LAN, PRI: 0, FRI: 0, ID: 112
Internet Protocol Version 4, Src: (src IP), Dst: (Dst mcast IP)
Version: 4
Header length: 20 bytes
Differentiated Services Field: 0x00 (DSCP 0x00: Default; ECN: 0x00: Not-ECT (Not ECN-Capable Transport))
Total length: 1108
Identification: 0x000 (0)
Flags: 0x04 (RESERVED BIT HAS BEEN SET)
Fragment offset: 0
Time to live: 64
Protocol: UDP (17)
Header checksum: 0xd8c4 [validation disabled]
Source: srcip
Destination: dstip
User Datagram Protocol, Src Port: (src port), Dst Port: (dst port)
Data (N bytes)
The hardware I'm running on has a Mellanox ConnectX-5 card and as such, DPDK is using the MLX5 driver, which does not support RTE_FLOW_ITEM_TYPE_RAW along with many other items in the RTE Flow API. I am on DPDK 19.11 and the OFED version I'm using is 4.6 for RHEL 7.6 (x86_64)
What am I doing wrong here and why does adding the RTE_FLOW_ITEM_TYPE_IPV4 filter (without ip address, spec and mask both memset to 0) cause my application to not receive any packets, even though they are IPv4 packets? Is there a way around this with the MLX5 driver for DPDK?

The answer is quite simple: the packets are fragmented. There are two reasons they can’t be matched:
The UDP header is present only in the first IP fragment,
So from NIC perspective, fragmented UDP is just an IP packet.
Try to match non-fragmented packets to confirm.

Is it possible to use flow isolation mode while receiving multicast with DPDK? I thought that flow isolation and promiscuous or allmulticast modes were not compatible.

Related

dpdk-testpmd command executed and then hangs

I made ready dpdk compatible environment and then I tried to send packets using dpdk-testpmd and wanted to see it being received in another server.
I am using vfio-pci driver in no-IOMMU (unsafe) mode.
I ran
$./dpdk-testpmd -l 11-15 -- -i
which had output like
EAL: Detected NUMA nodes: 2
EAL: Detected static linkage of DPDK
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'PA'
EAL: VFIO support initialized
EAL: Using IOMMU type 8 (No-IOMMU)
EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.1 (socket 0)
TELEMETRY: No legacy callbacks, legacy socket not created
Interactive-mode selected
testpmd: create a new mbuf pool <mb_pool_1>: n=179456, size=2176, socket=1
testpmd: preferred mempool ops selected: ring_mp_mc
testpmd: create a new mbuf pool <mb_pool_0>: n=179456, size=2176, socket=0
testpmd: preferred mempool ops selected: ring_mp_mc
Warning! port-topology=paired and odd forward ports number, the last port will pair with itself.
Configuring Port 0 (socket 0)
Port 0: E4:43:4B:4E:82:00
Checking link statuses...
Done
then
$set nbcore 4
Number of forwarding cores set to 4
testpmd> show config fwd
txonly packet forwarding - ports=1 - cores=1 - streams=1 - NUMA support enabled, MP allocation mode: native
Logical Core 12 (socket 0) forwards packets on 1 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=BE:A6:27:C7:09:B4
my nbcore is not being set correctly, even 'txonly' mode was not being set before I set the eth-peer addr. but some parameters are working. Moreover if I don't change the burst delay my server gets crashed as soon as I start transmitting through it has 10G ethernet port (80MBps available bandwidth by calculation). Hence, I am not seeing packets at receiving server by tailing tcpdump at corresponding receiving interface. What is happening here and what am I doing wrong?

based on the question & answers in the comments, the real intention is to send packets from DPDK testpmd using Intel Fortville (net_i40e) to the remote server.
The real issue for traffic not being generated is neither the application command line nor the interactive option is set to create packets via dpdk-testpmd.
In order to generate packets there are 2 options in testpmd
start tx_first: this will send out a default burst of 32 packets as soon the port is started.
forward mode tx-only: this puts the port under dpdk-testpmd in transmission-only mode. once the port is start it will transmit packets with the default packet size.
Neither of these options is utilized, hence my suggestion is
please have a walk through DPDK documentation on testpmd and its configuratio
make use of either --tx-first or use --forward-mode=txonly as per DPDK Testpmd Command-line Options
make use of either start txfirst or set fwd txonly or set fwd flwogen under interactive mode refer Testpmd Runtime Functions
with this traffic will be generated from testpmd and sent to the device (remote server). A quick example of the same is "dpdk-testpmd --file-prefix=test1 -a81:00.0 -l 7,8 --socket-mem=1024 -- --burst=128 --txd=8192 --rxd=8192 --mbcache=512 --rxq=1 --txq=1 --nb-cores=2 -a --forward-mode=io --rss-udp --enable-rx-cksum --no-mlockall --no-lsc-interrupt --enable-drop-en --no-rmv-interrupt -i"
From the above example config parameters
numbers of packets for rx-tx burst is set by --burst=128
number of rx-tx queues is configured by --rxq=1 --txq=1
number of cores to use for rx-tx is set by --nb-cores=2
to set flowgen, txonly, rxonly or io mode we use --forward-mode=io
hence in comments, it is mentioned neither set nbcore 4 or there are any configurations in testpmd args or interactive which shows the application is set for TX only.
The second part of the query is really confusing, because as it states
Moreover if I don't change the burst delay my server gets crashed as
soon as I start transmitting through it has 10G ethernet port (80MBps
available bandwidth by calculation). Hence, I am not seeing packets at
receiving server by tailing tcpdump at corresponding receiving
interface. What is happening here and what am I doing wrong?
assuming my server is the remote server to which packets are being sent by dpdk testpmd. because there is mention of I see packets with tcpdump (since Intel fortville X710 when bound with UIO driver will remove kernel netlink interface).
it is mentioned 80MBps which is around 0.08Gbps, is really strange. If the remote interface is set to promiscuous mode and there is AF_XDP application or raw socket application configured to receive traffic at line rate (10Gbps) it works. Since there is no logs or crash dump of the remote server, and it is highly unlikely actual traffic is generated from testpmd, this looks more of config or setup issue in remote server.
[EDIT-1] based on the live debug, it is confirmed
The DPDK is not installed - fixed by using ninja isntall
the DPDK nic port eno2 - is not connected to remote server directly.
the dpdk nic port eno2 is connected through switch
DPDk application testpmd is not crashing - confirmed with pgrep testpmd
instead when used with set fwd txonly, packets flood the switch and SSH packets from other port is dropped.
Solution: please use another switch for data path testing, or use direct connection to remote server.

Rss hash for fragmented packet

I am using Mellanox Technologies MT27800 Family [ConnectX-5], using dpdk multi rx queue with rss "ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP"
I analyzer traffic and need all packet of same session to arrive to the same process ( session for now can be ip+port)
So Packet that have the same ip + port arrive to the same queue.
But If some packet are ip fragmented, packet arrive to different process. It is a problem!
How can i calculate the hash value in the c++ code, like it is done in the card, so i can reassemble packets and send them to the same process like the non fragmented packets

Instead of ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP you can only use ETH_RSS_IP to calculate the RSS hash only from the IP addresses of the packet. This way even if the packet is fragmented, segments of the packet will arrive to the same the CPU core.
RSS value of the packets can be calculated in the software with using the following library https://doc.dpdk.org/api/rte__thash_8h.html
While this option is possible, I would still recommend you to check out the proposed setting of ETH_RSS_IP only.
When ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP is enabled, the RSS function takes IP addresses and src + dst ports to calculate the RSS hash value. As you don't have ports present in the IP fragmented packets, you are unable to compute the same value as non-fragmented IP packets.
You either can:
reassemble the IP fragments to form a complete IP packet and then using the rte_thash library compute the RSS value,
compute the RSS value only from the IP addresses (use the ETH_RSS_IP setting only).
As you are only doing load-balancing on CPU cores I think the latter option suits your use-case well enough.

#Davidboo based on the question and explanation in comment, what you have described is
all packet of same session to arrive to the same process ( session for now can be ip+port) - which means you are looking for symmetric hash
some packet are ip fragmented, packet arrive to different process - you need packet reassemble before symmetric RSS
Mellanox Technologies MT27800 Family [ConnectX-5] - current NIC does not support reassemble in NIC (embedded switch).
Hence the actual question is what is the right way to solve the problem with the following constrains. There are 3 (1 HW and 2 SW) solutions.
option 1 (HW): Use smart NIC or network Appliance offload card, that can ingress the traffic and does reassembly for fragmented before sending to Host Server
option 2 (SW): Disable RSS and use single RX queue. Check packet is fragment or not. If yes, reassemble and then use rte_flow_distributor or rte_eventdev with atomic flow to spread traffic to worker cores.
option 3 (SW): Disable RSS, but use n + 1 RX queues and n SW ring . By default all packets will be received on queue 0. Based on JHASH (for SW RSS) add rte_flow rules pinning the flows to queues 1 to (n + 1).
Assuming you can not change the NIC, my recommendation is option-2 with evenetdev for the following reasons.
It is much easier and simpler to allow either HW or SW DLB (eventdev) to spread traffic across multiple cores.
rte_flow_distributor is similar to HW RSS where static pinning will lead to congestion and packet drops.
Unlike option-3 one need not maintain state flow table to keep track of the flows.
How to achieve the same.
Use dpdk example code skeleton to create basic port initialization.
enable DPDK rte_ethdev for PTYPES to identify if IP, non-IP, IP fragmented without parsing the Frame and payload.
check packets are fragmented by using RTE_ETH_IS_IPV4_HDR and rte_ipv4_frag_pkt_is_fragmented for ipv4 and RTE_ETH_IS_IPV6_HDR and rte_ipv6_frag_get_ipv6_fragment_header for ipv6 (refer DPDK example ip_reassembly).
extract SRC-IP, DST-IP, SRC-PORT, DST-PORT (in case if packet is not SCTP, UDP or TCP set src port and dst port as 0) and use rte_hash_crc_2byte. This will ensure it is symmetric hash (SW RSS).
Then feed the packet with the hash value to eventdev stage or flow distributor stage (refer eventdev_pipeline or distributor).
Note:
each worker core will be running business logic.
On broadwell Xeon core can handle around 30Mpps.
On icelake Xeon core can handle around 100Mpps.

How to send packet another server using dpdk?

My question is how to send packet another Physical Server from my Computer using dpdk.
I already watched example code rxtx_callbacks and i want to use this code.
but there is no place to enter a specific ip and port to another server.
how i can send packets to places on a server using dpdk with specified ip and port?
and how i can receive packets using dpdk?
Is l3fwd correct or is this another concept?
help me

DPDK is an open-source library that allows one to bypass Kernel and ETH-IP-TCP stack to send packets from userspace directly on NIC or other custom hardware. There are multiple examples and projects like pktgen and TREX which uses to generate user-defined packets (desired MAC address, VLAN, IP and TCP-UDP) payload.
For the queries
how i can send packets to places on a server using dpdk with specified ip and port?
[Answer] make use of DPDK PKTGEN as an easy way to generate. Other examples are pcap based burst replay and trex.
But the easiest way to generate and send traffic is using scapy with DPDK sample application skeleton. Following are the steps required to achieve the same.
Install DPDK to desired platform (preferably Linux)
build the DPDK example skeleton found in path [dpdk root folder]/examples/skeleton
bind a physical NIC (if traffic needs to be send out of server) with userspace drivers like igb_uio, uio_pci_generic or vfio-pci
start the application with options '-l 1 --vdev=net_tap0,iface=scapyEth'. this will create TAP interface with name scapyEth.
using scapy now create your custom packet with desired MAC, VLAN, IP and Port numbers.
and how i can receive packets using dpdk?
[Answer] on receiver side run DPDK application like testpmd, l2fwd, or skeleton if packets needs to received by Userspace DPDK application or any Linux sockets can receive the UDP packets.
Note: easiest way to check whether packets are received is to run tcpudmp. example tcpdump -eni eth1 -Q in (where eth1 is physical interface on Reciever Server.
Note: Since the request how i can send packets to places on a server is not clear
Using DPDK one can send packets through a physical interface using dedicated NIC, FPGA and wireless devices
DPDK can send packets among applications using memif interface
DPDK can send packets between VM using virtio and vhost
DPDK can send and receive packets to kernel, where Kernel routing stack and ARP table determine which kernel interface will forward the packets.

DPDK getting too many rx_crc_errors on one port

what may cause rx_crc_erros in DPDK ports?
is it a software thing ? or a hardware thing related to the port or the traffic coming from the other end ?
DPDK Version: 19.02
PMD: I40E
This Port is running on customer Network, worth mentioning that this is the only port (out of 4) having this behaviour, so this may be a router/traffic thing but I couldnt verify that
used dpdk-proc-info to get this data
could not do any additional activity as this is running on customer site

DPDK I40E PMD has only option to enable or disable CRC on the port. Hence the assumption of DPDK I40E PMD is causing CRC error on 1 port out of 4 can be ruled out fully.
The `RX packets are validated by ASIC per port for CRC and then DMA to mbuf for packet buffer. The PMD copies the descirptor states to mbuf struct (one among them is CRC). The packet descriptor indicates the CRC result of the packet buffer to the driver (Kernel/DPDK-PMD). So the CRC error on a given port can arise due to the following reasons as
the port connected to ASIC is faulty (very rare case)
the SFP+ is not properly connected (possible).
the SFP+ is not the recommended one (possible).
the traffic coming from the other end is sending CRC packets as faulty.
One needs to isolate the issue by
binding the port to Linux Driver i40e and checking the statistics via ethtool -S [port].
Check SFP+ for compatibility on the faulty port by swapping with a working one.
re-seat the SFP+ again.
swap the data cables between working and faulty port. Then check if the error is present or not.
If all the above 4 cases the error only comes on the fault port, then indeed the NIC card has only 3 working ports among 4. The NIC card needs replacements or one should ignore the faulty port altogether. Hence this is not a DPDK PMD or library issue.

Sending UDP packets through SOCKS proxy

I'm building a simple application which should send UDP datagram packets through a socks4/5 proxy. I use UDP approach so I don't have to keep the connection(s) opened.
However it wasn't as easy as I thought. According to this schema I conclude that I cannot send UDP data through a proxy without establishing TCP connection first with the proxy server.
Nonetheless I couldn't find any suitable example of building such connection in CPP. I would be thankful for any resources :)

It is possible. You have to specify value 0x03 in field 2 of your client's connection request according to wikipedia description of SOCKS5 protocol.
The client's connection request is
field 1: SOCKS version number, 1 byte (must be 0x05 for this version)
field 2: command code, 1 byte:
0x01 = establish a TCP/IP stream connection
0x02 = establish a TCP/IP port binding
0x03 = associate a UDP port
field 3: reserved, must be 0x00
field 4: address type, 1 byte:
0x01 = IPv4 address
0x03 = Domain name
0x04 = IPv6 address
field 5: destination address of
4 bytes for IPv4 address
1 byte of name length followed by the name for Domain name
16 bytes for IPv6 address
field 6: port number in a network byte order, 2 bytes
As Hasturkun pointed out
Your code doesn't work because you aren't sending a connection request
at all. You must send a UDP ASSOCIATE request (on the TCP connection),
and you need to use the port and address from the response to get your
datagrams relayed.
You should really take a look at Socks5 RFC

Mike, your code does not work because you are trying to send the UDP associate command through the a UDP datagram. The SOCKS5 handshake must be over a TCP control connection.
Your server may need to keep one open TCP connection per client, but each client does not need many TCP connections open - one TCP connection can handle any number of UDP associate commands.
If your sole purpose is to not use TCP at all on the server side, it will not achieve what you want. The TCP connection is needed so that the SOCKS proxy knows when to disassociate the UDP ports (namely, when the TCP connection goes down).
However, your server application does not need to worry about this at all. The TCP control connection terminates at the SOCKS server just as your diagram shows.
I advise reading the RFC.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js