Is it possible to assign priority on multiple rx queue on a single core - DPDK?

Is it possible to assign priority on multiple rx queue on a single core - DPDK? - dpdk

Hi I'm relatively new to DPDK. I want to know if its possible to assign more than 1 Rx-queues per core on DPDK.
But here is more about the question
1'st queue always takes priority and will be used for main packet processing (Rx -> decode -> do_some_oper -> Tx)
2nd queue should only be used if there are no more mbuf's left in the 1'st queue. (Rx -> save_5_tuple_info ->drop)
Packet must be first tried on 1st queue if it fails put in the 2nd queue.
So basically my end goal is to capture few important details (5 tuples and few other) which would rather be dropped since the queue is full.

#RoninGoda there are multiple questions and sub-sections. So let me try explain step by step
[Question-1] if its possible to assign more than 1 Rx-queues per core on DPDK.
[Answer] Yes, If a DPDK port supports multiple queues and one has configured and started multiple queues; a single DPDK logical core can query all the queues.
[Question-2] But here is more about the question
1'st queue always takes priority and will be used for main packet processing (Rx -> decode -> do_some_oper -> Tx)
2nd queue should only be used if there are no more mbuf's left in the 1'st queue. (Rx -> save_5_tuple_info ->drop)
Packet must be first tried on 1st queue if it fails put in the 2nd queue.
[Answer]
First part: here is where some amount of clarity is required. Whether it is a physical NIC (PF or VF) or vdev NIC supporting multiple queues, one needs to either enable RSS or RTE_FLOW or use external flow director to program the NIC to distribute packets on Queue-0 and Queue-1.
If the above mechanism is not configured all packets will fall on default queue-0. hence the rx_burst will always return n packets as long the traffic flows.
Second part: Assuming RSS or RTE_FLOW or Flow Distributor is enabled for queue-0 and queue-1, there can be the following 4 scenarios
queue-0 has packets, queue-1 has no packets
queue-0 has no packets, queue-1 has no packets
queue-0 has packets, queue-1 has packets
queue-0 has no packets, queue-1 has no packets
For scenario 1 your current approach always processes the packets and tx_burst as you have shared. For scenario 2 since packets are not there in queue-0, queue-1 packets will be fetched, accounted and drop.
Problem scenario is 3 because when queue-0 & queue-1 has continuous packets and your logic always processes and tx packets from queue-0 this will run continuously. While packets on queue-1 start accumulating the RX descriptors (that packets DMA to host memory) run out. Once there are no more RX descriptors for the queue, packets destined for queue -1 will get dropped in the NIC. Hence you will start losing packets in the NIC itself.
Scenario 4 is not an issue since there are no packets in both queues.
note: packets will also get accumulated in queue-0 if the processing thread is consuming more cycles to process and tx_burst too. There can be also scenarios in which mbuf can run out leading to starvation on both queue-0 and queue-1 too.
[Recommended solution to avoid drops]
check if DPDK NIC supports multiple mempool for each queue. If yes, set queue-0 with a dedicated mbuf with a larger element and queue-1 with a mempool with smaller elements.
set the RX descriptor for queue-0 to a larger value like 2048 or 4096 allowing more accumulation, while set queue-2 RX descriptor as 512 or 1024.
if the processing of queue-0 is heavy, try using tx_buffer rather than tx_burst to amortize the cost to PCIe writes.
If NIC PMD parses and stores the l2, l3, l4 header in rte_mbuf make use of the same rather than parsing the 5 tuples from queue-1 to save the CPU cycles.
Important change: rather than having a fixed loop
example:
burst1 = q0 rx_burst
burst2 = q0 rx_burst
burst3 = q0 rx_burst
if (burst1 ==0 && burst2 ==0 && burst3 ==0); then
burst = q1 rx_burst
have packet counter driven approach
portqueue_to_rxburst[4] = {0,0,0,0} /* global array*/
for (index =0; index < 4; index)
{
burst = q-portqueue_to_rxburst[index] rx_burst
if (unlikely(portqueue_to_rxburst[index] == 1) /* drop logic*/
{
....
continue
}
/* process queue-0 packets*/
}
use an external timer thread or service thread to get_xstats | get_Stats from the port and check the queue stats*. if packets are arriving on queue-1 update global array index 3 to queue-1*.

Related

How to calculate the network bandwidth of the device?

To achieve effective data transfer mechanism, I need to find out how many bits can fill up a network link.
Let me explain the situation,
Once I send a data(application protocol) it will reply a ACK after it process the data (in application layer) . If the RTT is high (Like 500 ms RTT) it takes too much time to send a ACK back. Until the ACK is received data will not being sent and it is in idle mode. To rectify the situation , I need to flight some data in-between intervals.
So I decide to transfer the data until the bandwidth delay product value(how many bits can fill up a network link) is exhaust by sent data size
BDP = Bandwidth(bits per sec) x RTT ( in secs).
How to find the network bandwidth of the device.
Is there any Windows API or other ways to finds the bandwidth of link ?
PS : I am newbie to network programming

You do not calculate bandwidth. The bandwidth is a property of the network interface. A 100 Mbps ethernet interface always has a 100 Mbps bandwidth. You are using the incorrect term.
If you are using TCP, the sender will constantly increase the send/congestion window until there is a problem, then it exponentially reduces the window size, and again starts increasing it until there is again a problem, repeating that over and over. Only a sender will know this window.
The receiver has a buffer that is the receive window, and it will communicate the current window size to the sender in every acknowledgement. The receive window will shrink as the buffer is filled, and grows as the buffer is emptied. The receive window determines how much data the sender is allowed to send before stopping to wait for an acknowledgement.
TCP handles all that automatically, calculating the SRTT and automatically adjusting to give you a good throughput for the conditions. You seem to want to control what TCP inherently does for you. You can tweak things like the receive buffer to increase the throughput, but you need to write your own transport protocol to do what you propose because you will overrun the receive buffer, losing data or crashing the receiving host.
Also, remember that TCP creates a connection between two equal TCP peers. Both are senders and both are receivers. Either side can send and receive, and either side can initiate closing the connection or kill it with a RST.

winapi GetIpNetworkConnectionBandwidthEstimates() gets "historical" BW "estimates" for a network connection (this is more relevant than the whole interface/link) on the spec'd intf.

Measuring Round Trip Time using DPDK

My system is CentOS 8 with kernel: 4.18.0-240.22.1.el8_3.x86_64 and I am using DPDK 20.11.1. Kernel:
I want to calculate the round trip time in an optimized manner such that the packet sent from Machine A to Machine B is looped back from Machine B to A and the time is measured. While this being done, Machine B has a DPDK forwarding application running (like testpmd or l2fwd/l3fwd).
One approach can be to use DPDK pktgen application (https://pktgen-dpdk.readthedocs.io/en/latest/), but I could not find it to be calculating the Round Trip Time in such a way. Though ping is another way but when Machine B receives ping packet from Machine A, it would have to process the packet and then respond back to Machine A, which would add some cycles (which is undesired in my case).
Open to suggestions and approaches to calculate this time. Also a benchmark to compare the RTT (Round Trip Time) of a DPDK based application versus non-DPDK setup would also give a better comparison.
Edit: There is a way to enable latency in DPDK pktgen. Can anyone share some information that how this latency is being calculated and what it signifies (I could not find solid information regarding the page latency in the documentation.

It really depends on the kind of round trip you want to measure. Consider the following timestamps:
-> t1 -> send() -> NIC_A -> t2 --link--> t3 -> NIC_B -> recv() -> t4
host_A host_B
<- t1' <- recv() <- NIC_A <- t2' <--link-- t3' <- NIC_B <- send() <- t4'
Do you want to measure t1' - t1? Then it's just a matter of writing a small DPDK program that stores the TSC value right before/after each transmit/receive function call on host A. (On host b runs a forwarding application.) See also rte_rdtsc_precise() and rte_get_tsc_hz() for converting the TSC deltas to nanoseconds.
For non-DPDK programs you can read out the TSC values/frequency by other means. Depending on your resolution needs you could also just call clock_gettime(CLOCK_REALTIME) which has an overhead of 18 ns or so.
This works for single packet transmits via rte_eth_tx_burst() and single packet receives - which aren't necessarily realistic for your target application. For larger bursts you would have to use get a timestamp before the first transmit and after the last transmit and compute the average delta then.
Timestamps t2, t3, t2', t3' are hardware transmit/receive timestamps provided by (more serious) NICs.
If you want to compute the roundtrip t2' - t2 then you first need to discipline the NIC's clock (e.g. with phc2ys), enable timestamping and get those timestamps. However, AFAICS dpdk doesn't support obtaining the TX timestamps, in general.
Thus, when using SFP transceivers, an alternative is to install passive optical TAPs on the RX/TX end of NIC_A and connect the monitor ports to a packet capture NIC that supports receive hardware timestamping. With such as setup, computing the t2' - t2 roundtrip is just a matter of writing a script that reads the timestamps of the matching packets from your pcap and computes the deltas between them.

The ideal way to latency for sending and receiving packets through an interface is setup external Loopback device on the Machine A NIC port. This will ensure the packet sent is received back to the same NIC without any processing.
The next best alternative is to enable Internal Loopback, this will ensure the desired packet is converted to PCIe payload and DMA to the Hardware Packet Buffer. Based on the PCIe config the packet buffer will share to RX descriptors leading to RX of send packet. But for this one needs a NIC
supports internal Loopback
and can suppress Loopback error handlers.
Another way is to use either PCIe port to port cross connect. In DPDK, we can run RX_BURST for port-1 on core-A and RX_BURST for port-2 on core-B. This will ensure an almost accurate Round Trip Time.
Note: Newer Hardware supports doorbell mechanism, so on both TX and RX we can enable HW to send a callback to driver/PMD which then can be used to fetch HW assisted PTP time stamps for nanosecond accuracy.
But in my recommendation using an external (Machine-B) is not desirable because of
Depending upon the quality of the transfer Medium, the latency varies
If machine-B has to be configured to the ideal settings (for almost 0 latency)
Machine-A and Machine-B even if physical configurations are the same, need to be maintained and run at the same thermal settings to allow the right clocking.
Both Machine-A and Machine-B has to run with same PTP grand master to synchronize the clocks.
If DPDK is used, either modify the PMD or use rte_eth_tx_buffer_flush to ensure the packet is sent out to the NIC
With these changes, a dummy UDP packet can be created, where
first 8 bytes should carry the actual TX time before tx_burst from Machine-A (T1).
second 8 bytes is added by machine-B when it actually receives the packet in SW via rx_burst (2).
third 8 bytes is added by Machine-B when tx_burst is completed (T3).
fourth 8 bytes are found in Machine-A when packet is actually received via rx-burst (T4)
with these Round trip Time = (T4 - T1) - (T3 - T2), where T4 and T1 gives receive and transmit time from Machine A and T3 and T2 gives the processing overhead.
Note: depending upon the processor and generation, no-variant TSC is available. this will ensure the ticks rte_get_tsc_cycles is not varying per frequency and power states.
[Edit-1] as mentioned in comments
#AmmerUsman, I highly recommend editing your question to reflect the real intention as to how to measure the round trip time is taken, rather than TX-RX latency from DUT?, this is because you are referring to DPDK latency stats/metric but that is for measuring min/max/avg latency between Rx-Tx on the same DUT.
#AmmerUsman latency library in DPDK is stats representing the difference between TX-callback and RX-callback and not for your use case described. As per Keith explanation pointed out Packet send out by the traffic generator should send a timestamp on the payload, receiver application should forward to the same port. then the receiver app can measure the difference between the received timestamp and the timestamp embedded in the packet. For this, you need to send it back on the same port which does not match your setup diagram

DPDK buffers received from the RX ring and freeded on the TX path

Consider a DPDK program where each EAL thread:
receives a packet on its own RX queue
modifies the buffer in place
puts it back on the TX ring to echo it back to the sender
The RX buffers are not explicitely freeed as they are re-used on the TX ring. Is it good practice to depend on the TX queue to be processed by the NIC to free up entries in the RX ring?

The buffers successfully put in the Tx queue will be freed by the PMD. That’s the only option, so yes it’s a good practice.
Please note though, that placing a burst of packets in the Tx queue might fail, as the queue might be full for some reason. So if there are any packets left unqueued after rte_eth_tx_burst(), those must be freed manually or the transmission must be retried.

Minimizing dropped UDP packets at high packet rates (Windows 10)

IMPORTANT NOTE: I'm aware that UDP is an unreliable protocol. But, as I'm not the manufacturer of the device that delivers the data, I can only try to minimize the impact. Hence, please don't post any more statements about UDP being unreliable. I need suggestions to reduce the loss to a minimum instead.
I've implemented an application C++ which needs to receive a large amount of UDP packets in short time and needs to work under Windows (Winsock). The program works, but seems to drop packets, if the Datarate (or Packet Rate) per UDP stream reaches a certain level... Note, that I cannot change the camera interface to use TCP.
Details: It's a client for Gigabit-Ethernet cameras, which send their images to the computer using UDP packets. The data rate per camera is often close to the capacity of the network interface (~120 Megabytes per second), which means even with 8KB-Jumbo Frames the packet rate is at 10'000 to 15'000 per camera. Currently we have connected 4 cameras to one computer... and this means up to 60'000 packets per second.
The software handles all cameras at the same time and the stream receiver for each camera is implemented as a separate thread and has it's own receiving UDP socket.
At a certain frame rate the software seems miss a few UDP frames (even the network capacity is used only by ~60-70%) every few minutes.
Hardware Details
Cameras are from foreign manufacturers! They send UDP streams to a configurable UDP endpoint via ethernet. No TCP-support...
Cameras are connected via their own dedicated network interface (1GBit/s)
Direct connection, no switch used (!)
Cables are CAT6e or CAT7
Implementation Details
So far I set the SO_RCVBUF to a large value:
int32_t rbufsize = 4100 * 3100 * 2; // two 12 MP images
if (setsockopt(s, SOL_SOCKET, SO_RCVBUF, (char*)&rbufsize, sizeof(rbufsize)) == -1) {
perror("SO_RCVBUF");
throw runtime_error("Could not set socket option SO_RCVBUF.");
}
The error is not thrown. Hence, I assume the value was accepted.
I also set the priority of the main process to HIGH-PRIORITY_CLASS by using the following code:
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
However, I didn't find any possibility to change the thread priorities. The threads are created after the process priority is set...
The receiver threads use blocking IO to receive one packet at a time (with a 1000 ms timeout to allow the thread to react to a global shutdown signal). If a packet is received, it's stored in a buffer and the loop immediately continues to receive any further packets.
Questions
Is there any other way how I can reduce the probability of a packet loss? Any possibility to maybe receive all packets that are stored in the sockets buffer with one call? (I don't need any information about the sender side; just the contained payload)
Maybe, you can also suggest some registry/network card settings to check...

To increase the UDP Rx performance for GigE cameras on Widnows you may want to look into writing a custom filter driver (NDIS). This allows you to intercept the messages in the kernel, stop them from reaching userspace, pack them into some buffer and then send to userspace via a custom ioctl to your application. I have done this, it took about a week of work to get done. There is a sample available from Microsoft which I used as base for it.
It is also possible to use an existing generic driver, such as pcap, which I also tried and that took about half a week. This is not as good because pcap cannot determine when the frames end so packet grouping will be sub optimal.
I would suggest first digging deep in the network stack settings and making sure that the PC is not starved for resources. Look at guides for tuning e.g. Intel network cards for this type of load, that could potentially have a larger impact than a custom driver.
(I know this is an older thread and you have probably solved your problem. But things like this is good to document for future adventurers..)

IOCP and WSARecv in overlapped mode, you can setup around ~60k WSARecv
on the thread that handles the GetQueuedCompletionStatus process the data and also do a WSARecv in that thread to comnpensate for the one being used when receiving the data
please note that your udp packet size should stay below the MTU above it will cause drops depending on all the network hardware between the camera and the software
write some UDP testers that mimuc the camera to test the network just to be sure that the hardware will support the load.
https://www.winsocketdotnetworkprogramming.com/winsock2programming/winsock2advancediomethod5e.html

C++ Reading UDP packets [duplicate]

I have a java app on linux which opens UDP socket and waits for messages.
After couple of hours under heavy load, there is a packet loss, i.e. the packets are received by kernel but not by my app (we see the lost packets in sniffer, we see UDP packets lost in netstat, we don't see those packets in our app logs).
We tried enlarging socket buffers but this didnt help - we started losing packets later then before, but that's it.
For debugging, I want to know how full the OS udp buffer is, at any given moment. Googled, but didn't find anything. Can you help me?
P.S. Guys, I'm aware that UDP is unreliable. However - my computer receives all UDP messages, while my app is unable to consume some of them. I want to optimize my app to the max, that's the reason for the question. Thanks.

UDP is a perfectly viable protocol. It is the same old case of the right tool for the right job!
If you have a program that waits for UDP datagrams, and then goes off to process them before returning to wait for another, then your elapsed processing time needs to always be faster than the worst case arrival rate of datagrams. If it is not, then the UDP socket receive queue will begin to fill.
This can be tolerated for short bursts. The queue does exactly what it is supposed to do – queue datagrams until you are ready. But if the average arrival rate regularly causes a backlog in the queue, it is time to redesign your program. There are two main choices here: reduce the elapsed processing time via crafty programming techniques, and/or multi-thread your program. Load balancing across multiple instances of your program may also be employed.
As mentioned, on Linux you can examine the proc filesystem to get status about what UDP is up to. For example, if I cat the /proc/net/udp node, I get something like this:
$ cat /proc/net/udp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
40: 00000000:0202 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 3466 2 ffff88013abc8340 0
67: 00000000:231D 00000000:0000 07 00000000:0001E4C8 00:00000000 00000000 1006 0 16940862 2 ffff88013abc9040 2237
122: 00000000:30D4 00000000:0000 07 00000000:00000000 00:00000000 00000000 1006 0 912865 2 ffff88013abc8d00 0
From this, I can see that a socket owned by user id 1006, is listening on port 0x231D (8989) and that the receive queue is at about 128KB. As 128KB is the max size on my system, this tells me my program is woefully weak at keeping up with the arriving datagrams. There have been 2237 drops so far, meaning the UDP layer cannot put any more datagrams into the socket queue, and must drop them.
You could watch your program's behaviour over time e.g. using:
watch -d 'cat /proc/net/udp|grep 00000000:231D'
Note also that the netstat command does about the same thing: netstat -c --udp -an
My solution for my weenie program, will be to multi-thread.
Cheers!

Linux provides the files /proc/net/udp and /proc/net/udp6, which lists all open UDP sockets (for IPv4 and IPv6, respectively). In both of them, the columns tx_queue and rx_queue show the outgoing and incoming queues in bytes.
If everything is working as expected, you usually will not see any value different than zero in those two columns: as soon as your application generates packets they are sent through the network, and as soon those packets arrive from the network your application will wake up and receive them (the recv call immediately returns). You may see the rx_queue go up if your application has the socket open but is not invoking recv to receive the data, or if it is not processing such data fast enough.

rx_queue will tell you the queue length at any given instant, but it will not tell you how full the queue has been, i.e. the highwater mark. There is no way to constantly monitor this value, and no way to get it programmatically (see How do I get amount of queued data for UDP socket?).
The only way I can imagine monitoring the queue length is to move the queue into your own program. In other words, start two threads -- one is reading the socket as fast as it can and dumping the datagrams into your queue; and the other one is your program pulling from this queue and processing the packets. This of course assumes that you can assure each thread is on a separate CPU. Now you can monitor the length of your own queue and keep track of the highwater mark.

The process is simple:
If desired, pause the application process.
Open the UDP socket. You can snag it from the running process using /proc/<PID>/fd if necessary. Or you can add this code to the application itself and send it a signal -- it will already have the socket open, of course.
Call recvmsg in a tight loop as quickly as possible.
Count how many packets/bytes you got.
This will discard any datagrams currently buffered, but if that breaks your application, your application was already broken.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js