What are command, data and comp ring in vmxnet3 PMD in DPDK

What are command, data and comp ring in vmxnet3 PMD in DPDK - vmware

I am starting to work and understand the basics of DPDK and it's working with VMWare (VMXNET3 PMD). I started browsing through the code base and found reference to 3 ring structures in vmxnet3_tx_queue_t (at vmxnet3_ring.h), namely cmd_ring, data_ring and comp_ring.
I tried surfing to understand the use case and working of them, but didn't get quite get the documentation on it or was unable to understand.
Any pointers / direction would be of great help.

The vmxnet3 is pretty decently described in the DPDK NIC documentation:
http://doc.dpdk.org/guides/nics/vmxnet3.html
The driver pre-allocates the packet buffers and loads the command ring descriptors in advance. The hypervisor fills those packet buffers on packet arrival and write completion ring descriptors, which are eventually pulled by the PMD. After reception, the DPDK application frees the descriptors and loads new packet buffers for the coming packets.
In the transmit routine, the DPDK application fills packet buffer pointers in the descriptors of the command ring and notifies the hypervisor. In response the hypervisor takes packets and passes them to the vSwitch, It writes into the completion descriptors ring. The rings are read by the PMD in the next transmit routine call and the buffers and descriptors are freed from memory.
Not sure though if those details are the "basics of DPDK", as those low level queues are abstracted by the DPDK Poll Mode Driver API:
https://doc.dpdk.org/guides/prog_guide/poll_mode_drv.html
So you better refer this document and use this API, as you won't be able to use vmxnet3 rings directly in your app anyway...

Related

Proper DPDK device and port initialization for transmission

While writing a simple DPDK packet generator I noticed some additional initialization steps that are required for reliable and successful packet transmission:
calling rte_eth_link_get() or rte_eth_timesync_enable() after rte_eth_dev_start()
waiting 2 seconds before sending the first packet with rte_eth_tx_burst()
So these steps are necessary when I use the ixgbe DPDK vfio driver with an Intel X553 NIC.
When I'm using the AF_PACKET DPDK driver, it works without those extra steps.
Is this a bug or a feature?
Is there a better way than waiting 2 seconds before the first send?
Why is the wait required with the ixgbe driver? Is this a limitation of that NIC or the involved switch (Mikrotik CRS326 with Marvell chipset)?
Is there a more idiomatic function to call than rte_eth_link_get() in order to complete the device initialization for transmission?
Is there some way to keep the VFIO NIC initialized (while keeping its link up) to avoid re-initializing it over and over again during the usual edit/compile/test cycle? (i.e. to speed up that cycle ...)
Additional information: When I connect the NIC to a mirrored port (which is configured via Mikrotik's mirror-source/mirror-target ethernet switch settings) and the sleep(2) is removed then I see the first packet transmitted to the mirror target but not to the primary destination. Thus, it seems like the sleep is necessary to give the switch some time after the link is up (after the dpdk program start) to completely initialize its forwarding table or something like that?
Waiting just 1 second before the first transmission works less reliable, i.e. the packet reaches the receiver only every odd time.
My device/port initialization procedure implements the following setup sequence:
rte_eth_dev_count_avail()
rte_eth_dev_is_valid_port()
rte_eth_dev_info_get()
rte_eth_dev_adjust_nb_rx_tx_desc()
rte_eth_dev_configure(port_id, 0 /* rxrings */, 1 /* txrings */, &port_conf)
rte_eth_tx_queue_setup()
rte_eth_dev_start()
rte_eth_macaddr_get()
rte_eth_link_get() // <-- REQUIRED!
rte_eth_dev_get_mtu()
Without rte_eth_link_get() (or rte_eth_timesync_enable()) the first transmitted packet doesn't even show up on the mirrored port.
The above functions (and rte_eth_tx_burst()) complete successfully with/without rte_eth_link_get()/sleep(2) being present. Especially, the read MAC address, MTU have the expected values (MTU -> 1500) and rte_eth_tx_burst() returns 1 for a burst of one UDP packet.
The returned link status is: Link up at 1 Gbps FDX Autoneg
The fact that rte_eth_link_get() can be replaced with rte_eth_timesync_enable() probably can be explained by the latter calling ixgbe_start_timecounters() which calls rte_eth_linkstatus_get() which is also called by rte_eth_link_get().
I've checked the DPDK examples and most of them don't call rte_eth_link_get() before sending something. There is also no sleep after device initialization.
I'm using DPDK 20.11.2.
Even more information - to answer the comments:
I'm running this on Fedora 33 (5.13.12-100.fc33.x86_64).
Ethtool reports: firmware-version: 0x80000877
I had called rte_eth_timesync_enable() in order to work with the transmit timestamps. However, during debugging I removed it to arrive at an minimal reproducer. At that point I noticed that removing it made it actually worse (i.e. no packet transmitted over the mirror port). I thus investigated what part of that function might make the difference and found rte_eth_link_get() which has similar side-effects.
When switching to AF_PACKET I'm using the stock ixgbe kernel driver, i.e. ixgbe with default settings on a device that is initialized by networkd (dhcp enabled).
My expectation was that when rte_eth_dev_start() terminates that the link is up and the device is ready for transmission.
However, it would be nice, I guess, if one could avoid resetting the device after program restarts. I don't know if DPDK supports this.
Regarding delays: I've just tested the following: rte_eth_link_get() can be omitted if I increase the sleep to 6 seconds. Whereas a call to rte_eth_link_get() takes 3.3 s. So yeah, it's probably just helping due to the additional delay.

The difference between the two attempted approaches
In order to use af_packet PMD, you first bind the device in question to the kernel driver. At this point, a kernel network interface is spawned for that device. This interface typically has the link active by default. If not, you typically run ip link set dev <interface> up. When you launch your DPDK application, af_packet driver does not (re-)configure the link. It just unconditionally reports the link to be up on device start (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L266) and vice versa when doing device stop. Link update operation is also no-op in this driver (see https://github.com/DPDK/dpdk/blob/main/drivers/net/af_packet/rte_eth_af_packet.c#L416).
In fact, with af_packet approach, the link is already active at the time you launch the application. Hence no need to await the link.
With VFIO approach, the device in question has its link down, and it's responsibility of the corresponding PMD to activate it. Hence the need to test link status in the application.
Is it possible to avoid waiting on application restarts?
Long story short, yes. Awaiting link status is not the only problem with application restarts. You effectively re-initialise EAL as a whole when you restart, and that procedure is also eye-wateringly time consuming. In order to cope with that, you should probably check out multi-process support available in DPDK (see https://doc.dpdk.org/guides/prog_guide/multi_proc_support.html).
This requires that you re-implement your application to have its control logic in one process (also, the primary process) and Rx/Tx datapath logic in another one (the secondary process). This way, you can keep the first one running all the time and re-start the second one when you need to change Rx/Tx logic / re-compile. Restarting the secondary process will re-attach to the existing EAL instance all the time. Hence no PMD restart being involved, and no more need to wait.

Based on the interaction via comments, the real question is summarized as I'm just asking myself if it's possible to keep the link-up between invocations of a DPDK program (when using a vfio device) to avoid dealing with the relatively long wait times before the first transmit comes through. IOW, is it somehow possible to skip the device reset when starting the program for the 2nd time?
The short answer is No for the packet-generator program between restarts, because any Physcial NIC which uses PCIe config space for both PF (IXGBE for X533) and VF (IXGBE_VF for X553) bound with uio_pci_generic|igb_uio|vfio-pci requires PCIe reset & configuration. But, when using AF_PACKET (ixgbe kernel diver) DPDK PMD, this is the virtual device that does not do any PCIe resets and directly dev->data->dev_link.link_status = ETH_LINK_UP; in eth_dev_start function.
For the second part Is the delay for the first TX packets expected?
[Answer] No, as a few factors contribute to delay in the first packet transmission
Switch software & firmware (PF only)
Switch port Auto-neg or fixed speed (PF only)
X533 software and firmware (PF and VF)
Autoneg enable or disabled (PF and VF)
link medium SFP (fiber) or DAC (Direct Attached Copper) or RJ-45 (cat5|cat6) connection (PF only)
PF driver version for NIC (X553 ixgbe) (PF and VF)
As per intel driver Software-generated layer two frames, like IEEE 802.3x (link flow control), IEEE 802.1Qbb (priority based flow-control), and others of this type (VF only)
Note: Since the issue is mentioned for VF ports only (and not PF ports), my assumption is
TX packet uses the SRC MAC address of VF to avoid MAC Spoof check on ASIC
configure all SR-IOV enabled ports for VLAN tagging from the administrative interface on the PF to avoid flooding of traffic to VF
PF driver is updated to avoid old driver issues such (VF reset causes PF link to reset). This can be identified via checking dmesg
Steps to isolate the problem is the NIC by:
Check if (X533) PF DPDK has the same delay as VF DPDK (Isolate if it is PF or VF) problem.
cross-connect 2 NIC (X533) on the same system. Then compare Linux vs DPDK link up events (to check if it is a NIC problem or PCIe LNK issue)
Disable Auto-neg for DPDK X533 and compare PF vs Vf in DPDK
Steps to isolate the problem is the Switch by:
Disable Auto-neg on Switch and set FD auto-neg-disable speed 1Gbps and check the behaviour
[EDIT-1] I do agree with the workaround solution suggested by #stackinside using DPDK primary-secondary process concept. As the primary is responsible for Link and port bring up. While secondary is used for actual RX and TX burst.

Minimizing dropped UDP packets at high packet rates (Windows 10)

IMPORTANT NOTE: I'm aware that UDP is an unreliable protocol. But, as I'm not the manufacturer of the device that delivers the data, I can only try to minimize the impact. Hence, please don't post any more statements about UDP being unreliable. I need suggestions to reduce the loss to a minimum instead.
I've implemented an application C++ which needs to receive a large amount of UDP packets in short time and needs to work under Windows (Winsock). The program works, but seems to drop packets, if the Datarate (or Packet Rate) per UDP stream reaches a certain level... Note, that I cannot change the camera interface to use TCP.
Details: It's a client for Gigabit-Ethernet cameras, which send their images to the computer using UDP packets. The data rate per camera is often close to the capacity of the network interface (~120 Megabytes per second), which means even with 8KB-Jumbo Frames the packet rate is at 10'000 to 15'000 per camera. Currently we have connected 4 cameras to one computer... and this means up to 60'000 packets per second.
The software handles all cameras at the same time and the stream receiver for each camera is implemented as a separate thread and has it's own receiving UDP socket.
At a certain frame rate the software seems miss a few UDP frames (even the network capacity is used only by ~60-70%) every few minutes.
Hardware Details
Cameras are from foreign manufacturers! They send UDP streams to a configurable UDP endpoint via ethernet. No TCP-support...
Cameras are connected via their own dedicated network interface (1GBit/s)
Direct connection, no switch used (!)
Cables are CAT6e or CAT7
Implementation Details
So far I set the SO_RCVBUF to a large value:
int32_t rbufsize = 4100 * 3100 * 2; // two 12 MP images
if (setsockopt(s, SOL_SOCKET, SO_RCVBUF, (char*)&rbufsize, sizeof(rbufsize)) == -1) {
perror("SO_RCVBUF");
throw runtime_error("Could not set socket option SO_RCVBUF.");
}
The error is not thrown. Hence, I assume the value was accepted.
I also set the priority of the main process to HIGH-PRIORITY_CLASS by using the following code:
SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
However, I didn't find any possibility to change the thread priorities. The threads are created after the process priority is set...
The receiver threads use blocking IO to receive one packet at a time (with a 1000 ms timeout to allow the thread to react to a global shutdown signal). If a packet is received, it's stored in a buffer and the loop immediately continues to receive any further packets.
Questions
Is there any other way how I can reduce the probability of a packet loss? Any possibility to maybe receive all packets that are stored in the sockets buffer with one call? (I don't need any information about the sender side; just the contained payload)
Maybe, you can also suggest some registry/network card settings to check...

To increase the UDP Rx performance for GigE cameras on Widnows you may want to look into writing a custom filter driver (NDIS). This allows you to intercept the messages in the kernel, stop them from reaching userspace, pack them into some buffer and then send to userspace via a custom ioctl to your application. I have done this, it took about a week of work to get done. There is a sample available from Microsoft which I used as base for it.
It is also possible to use an existing generic driver, such as pcap, which I also tried and that took about half a week. This is not as good because pcap cannot determine when the frames end so packet grouping will be sub optimal.
I would suggest first digging deep in the network stack settings and making sure that the PC is not starved for resources. Look at guides for tuning e.g. Intel network cards for this type of load, that could potentially have a larger impact than a custom driver.
(I know this is an older thread and you have probably solved your problem. But things like this is good to document for future adventurers..)

IOCP and WSARecv in overlapped mode, you can setup around ~60k WSARecv
on the thread that handles the GetQueuedCompletionStatus process the data and also do a WSARecv in that thread to comnpensate for the one being used when receiving the data
please note that your udp packet size should stay below the MTU above it will cause drops depending on all the network hardware between the camera and the software
write some UDP testers that mimuc the camera to test the network just to be sure that the hardware will support the load.
https://www.winsocketdotnetworkprogramming.com/winsock2programming/winsock2advancediomethod5e.html

video streaming serialization

I am working on a video streaming application in C++, and I have questions about serialization of the packets. I know there are a number of serialization frameworks out there from Google Protocol Buffers, Apache Thrift, Cereal, etc. I am correct in thinking that the packets will need to be serialized, right?
Since any little bit of overhead can cause latency, is it worthwhile to use an existing framework to do the serialization?
Would it even be worth my time to try and roll my own or just stick to the frameworks?
I know most of the frameworks will take into account big-endian and little-endian which is nice.
From the research I have done, I was leaning towards trying to use Google Protocol Buffers. Is this a good serialization library for a streaming application?
Are there any other suggestions?
I appreciate the advice and suggestions.

The framework should help you send and retrieve packets. Filling in the packet with a payload is your task.
If you are worried about the latency and don't want to delay your processing by waiting for packets to be sent, you should consider using a separate thread and buffer for the packets.
For example, the processing thread would create or fill in a packet and put it into the outgoing buffer (e.g. circular queue). The packet sending thread would take packets from the buffer and send them out. The thread would sleep until either the packet has been sent or a new packet is added to the queue.
This technique will "blast" the network with packets, which you may not want to do. If you continuously blast packets, nobody else will be able to send out packets. This may not be a problem depend on routers and switches on the network and your packet framework.

Multiple applications using GStreamer

I want to write (but first I want to understand how to do it) applications (more than one) based on GStreamer framework that would share the same hardware resource at the same time.
For example: there is a hardware with HW acceleration for video decoding. I want to start simultaneously two applications that are able to decode different video streams, using HW acceleration. Of course I assume that HW is able to handle such requests, there is appropriate driver (but not GStreamer element) for doing this, but how to write GStreamer element that would support such resource sharing between separate processes?
I would appreciate any links, suggestions where to start...

You have h/w that can be accessed concurrently. Hence two gstreamer elements accessing it concurrently should work! There is nothing Gstreamer specific here.
Say you wanted to write a decoding element, it is like any decoding element and you access your hardware correctly. Your drivers should take care of the concurrent access.
The starting place is the Gstreamer plugin writer's guide.

So you need a single process that controls the HW decoder, and decodes streams from multiple sources.
I would recommend building a daemon, possibly itself based on GStreamer also. The gdppay and gdpdepay provide quite simple ways to pass data through sockets to the daemon and back. The daemon would wait for connections on a specified port (or unix socket) and open a virtual decoder per each connection. The video decoder elements in the separate applications would internally connect to the daemon and get back the decoded video.

Using Tcp, why do large blocks of data get transmitted with a lower bandwidth then small blocks of data?

Using 2 PC's with Windows XP, 64kB Tcp Window size, connected with a crossover cable
Using Qt 4.5.3, QTcpServer and QTcpSocket
Sending 2000 messages of 40kB takes 2 seconds (40MB/s)
Sending 1 message of 80MB takes 80 seconds (1MB/s)
Anyone has an explanation for this? I would expect the larger message to go faster, since the lower layers can then fill the Tcp packets more efficiently.

This is hard to comment on without seeing your code.
How are you timing this on the sending side? When do you know you're done?
How does the client read the data, does it read into fixed sized buffers and throw the data away or does it somehow know (from the framing) that the "message" is 80MB and try and build up the "message" into a single data buffer to pass up to the application layer?
It's unlikely to be the underlying Windows sockets code that's making this work poorly.

TCP, from the application side, is stream-based which means there are no packets, just a sequence of bytes. The kernel may collect multiple writes to the connection before sending it out and the receiving side may make any amount of the received data available to each "read" call.
TCP, on the IP side, is packets. Since standard Ethernet has an MTU (maximum transfer unit) of 1500 bytes and both TCP and IP have 20-byte headers, each packet transferred over Ethernet will pass 1460 bytes (or less) of the TCP stream to the other side. 40KB or 80MB writes from the application will make no difference here.
How long it appears to take data to transfer will depend on how and where you measure it. Writing 40KB will likely return immediately since that amount of data will simply get dropped in TCP's "send window" inside the kernel. An 80MB write will block waiting for it all to get transferred (well, all but the last 64KB which will fit, pending, in the window).
TCP transfer speed is also affected by the receiver. It has a "receive window" that contains everything received from the peer but not fetched by the application. The amount of space available in this window is passed to the sender with every return ACK so if it's not being emptied quickly enough by the receiving application, the sender will eventually pause. WireShark may provide some insight here.
In the end, both methods should transfer in the same amount of time since an application can easily fill the outgoing window faster than TCP can transfer it no matter how that data is chunked.
I can't speak for the operation of QT, however.

Bug in Qt 4.5.3
..................................

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js