I am using DPDK 21.11 for my application. After a certain time, the the API rte_eth_tx_burst stops sending any packets out.
Ethernet Controller X710 for 10GbE SFP+ 1572
drv=vfio-pci
MAX_RETRY_COUNT_RTE_ETH_TX_BURST 3
do
{
num_sent_pkt = rte_eth_tx_burst(eth_port_id, queue_id, &mbuf[mbuf_idx], pkt_count);
pkt_count -= num_sent_pkt;
retry_count++;
} while(pkt_count && (retry_count != MAX_RETRY_COUNT_RTE_ETH_TX_BURST));
To debug, I tried to use telemetry to print out the xstats. However, i do not see any errors.
--> /ethdev/xstats,1
{"/ethdev/xstats": {"rx_good_packets": 97727, "tx_good_packets": 157902622, "rx_good_bytes": 6459916, "tx_good_bytes": 229590348448, "rx_missed_errors": 0, "rx_errors": 0, "tx_errors": 0, "rx_mbuf_allocation_errors": 0, "rx_unicast_packets": 95827, "rx_multicast_packets": 1901, "rx_broadcast_packets": 0, "rx_dropped_packets": 0, "rx_unknown_protocol_packets": 97728, "rx_size_error_packets": 0, "tx_unicast_packets": 157902621, "tx_multicast_packets": 0, "tx_broadcast_packets": 1, "tx_dropped_packets": 0, "tx_link_down_dropped": 0, "rx_crc_errors": 0, "rx_illegal_byte_errors": 0, "rx_error_bytes": 0, "mac_local_errors": 0, "mac_remote_errors": 0, "rx_length_errors": 0, "tx_xon_packets": 0, "rx_xon_packets": 0, "tx_xoff_packets": 0, "rx_xoff_packets": 0, "rx_size_64_packets": 967, "rx_size_65_to_127_packets": 96697, "rx_size_128_to_255_packets": 0, "rx_size_256_to_511_packets": 64, "rx_size_512_to_1023_packets": 0, "rx_size_1024_to_1522_packets": 0, "rx_size_1523_to_max_packets": 0, "rx_undersized_errors": 0, "rx_oversize_errors": 0, "rx_mac_short_dropped": 0, "rx_fragmented_errors": 0, "rx_jabber_errors": 0, "tx_size_64_packets": 0, "tx_size_65_to_127_packets": 46, "tx_size_128_to_255_packets": 0, "tx_size_256_to_511_packets": 0, "tx_size_512_to_1023_packets": 0, "tx_size_1024_to_1522_packets": 157902576, "tx_size_1523_to_max_packets": 0, "rx_flow_director_atr_match_packets": 0, "rx_flow_director_sb_match_packets": 13, "tx_low_power_idle_status": 0, "rx_low_power_idle_status": 0, "tx_low_power_idle_count": 0, "rx_low_power_idle_count": 0, "rx_priority0_xon_packets": 0, "rx_priority1_xon_packets": 0, "rx_priority2_xon_packets": 0, "rx_priority3_xon_packets": 0, "rx_priority4_xon_packets": 0, "rx_priority5_xon_packets": 0, "rx_priority6_xon_packets": 0, "rx_priority7_xon_packets": 0, "rx_priority0_xoff_packets": 0, "rx_priority1_xoff_packets": 0, "rx_priority2_xoff_packets": 0, "rx_priority3_xoff_packets": 0, "rx_priority4_xoff_packets": 0, "rx_priority5_xoff_packets": 0, "rx_priority6_xoff_packets": 0, "rx_priority7_xoff_packets": 0, "tx_priority0_xon_packets": 0, "tx_priority1_xon_packets": 0, "tx_priority2_xon_packets": 0, "tx_priority3_xon_packets": 0, "tx_priority4_xon_packets": 0, "tx_priority5_xon_packets": 0, "tx_priority6_xon_packets": 0, "tx_priority7_xon_packets": 0, "tx_priority0_xoff_packets": 0, "tx_priority1_xoff_packets": 0, "tx_priority2_xoff_packets": 0, "tx_priority3_xoff_packets": 0, "tx_priority4_xoff_packets": 0, "tx_priority5_xoff_packets": 0, "tx_priority6_xoff_packets": 0, "tx_priority7_xoff_packets": 0, "tx_priority0_xon_to_xoff_packets": 0, "tx_priority1_xon_to_xoff_packets": 0, "tx_priority2_xon_to_xoff_packets": 0, "tx_priority3_xon_to_xoff_packets": 0, "tx_priority4_xon_to_xoff_packets": 0, "tx_priority5_xon_to_xoff_packets": 0, "tx_priority6_xon_to_xoff_packets": 0, "tx_priority7_xon_to_xoff_packets": 0}}
I have RX-DESC = 128 and TX-DESC = 512 configured.
I am assuming there is some desc leak, is there a way to know if the drop is due to no-desc present? Which counter should I check for that"?
[More Info]
Debugging refcnt lead to a deadend.
Following the code, it seems that the NIC card does not set the DONE status on the descriptor.
When rte_eth_tx_burst is called, the next func internally calls i40e_xmit_pkts -> i40e_xmit_cleanup
When the issue occurs, the following condition fails leading to NIC failure in sending packets out.
if ((txd[desc_to_clean_to].cmd_type_offset_bsz &
rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
PMD_TX_LOG(DEBUG, "TX descriptor %4u is not done "
"(port=%d queue=%d)", desc_to_clean_to,
txq->port_id, txq->queue_id);
return -1;
}
If I comment out the "return -1" (ofcourse not the fix and will lead to other issues) ..but I can see that traffic is stable for a long long time.
I tracked all the mbuf from start of traffic till issue is hit,there is no issue seen atleast in mbuf that I could see.
I40E_TX_DESC_DTYPE_DESC_DONE will be set in h/w for the descriptor. Is there any way I can see that code? Is it part of x710 driver code?
I still doubt my own code since the issue is present even after NIC card is replaced.
However, how can my code effect NIC card not modifying the DONE status of descriptor?
Any suggestions would really be helpful.
[UPDATE]
Found out that 2 cores were using the same TX queueID to send packets.
Data processing and TX core
ARP req/response by Data RX core
This lead to some potential corruption ?
Found some info on this:
http://mails.dpdk.org/archives/dev/2014-January/001077.html
After creating separate queue for ARP messages, issue is not seen anymore/yet for 2+hours
[EDIT-2] the error is narrowed down to multiple threads are using same portid-queueid pair which causes the stalls in the NIC from XMIT. Earlier the debug was not focusing on slow path (ARP reply) hence this was missed out.
[Edit-1] based on the limited debug opportunities and updates from the message, the updates are
The internal TX code updates refcnt by 2 (that is refcnt is 3).
Once the reply is received the refcnt is decremented by 2
Corner cases are now addressed for mbuf_free
Tested on RHEL and Centos both has issues, hence it is software and not os
updated the NIC firmware, now all platforms consistently shows error after a couple of hours of the run.
Note:
hence all pointers lead to code and corner case handling gaps since testpmd|l2fwd|l3fwd does not show the case the error with DPDK library or platform.
Since the code base is not shared, only option is to rely on updates.
hence after extensive debugging and analysis, the root cause of the issue is not DPDK, NIC or platform but GAP in the code being used.
If the code's intent is to try within MAX_RETRY_COUNT_RTE_ETH_TX_BURST for all packets of pkt_count, the current code snippet needs a few corrections. Let me explain
mbuf is the array of valid packets to be TX
mbuf_idx represents the current index to be sent for TX
pkt_count represents the number of packets sent out in the current attempt.
num_sent_pkt represents actual packets sent for DMA copy to NIC (physical).
retry_count is the local variable keeping count of retries.
there are 2 corner cases to be taken care of (not shared in the current snippet)
If MAX_RETRY_COUNT_RTE_ETH_TX_BURST is exceeded and num_sent_pkt is not equal to actual TX, at end of the while loop one needs to free up the non-transmitted MBUF.
If there are any MBUF with ref_cnt greater than 1 (especially with multicast or broadcast or packet duplication) one needs a mechanism free those too.
A possible code snippet could be:
MAX_RETRY_COUNT_RTE_ETH_TX_BURST 3
retry_count = 0;
mbuf_idx = 0;
pkt_count = try_sent; /* try_sent intended send*/
/* if there are any mbuf with ref_cnt > 1, we need separate logic to handle those */
do {
num_sent_pkt = rte_eth_tx_burst(eth_port_id, queue_id, &mbuf[mbuf_idx], pkt_count);
pkt_count -= num_sent_pkt;
mbuf_idx += num_sent_pkt;
retry_count++;
} while((pkt_count) && (retry_count < MAX_RETRY_COUNT_RTE_ETH_TX_BURST));
/* to prevent the leak for unsent packet*/
if (pkt_count) {
rte_pktmbuf_free_bulk(&mbuf[mbuf_idx], pkt_count);
}
note: the easiest way to identify mbuf leak is to run DPDK secondary process proc-info to check for mbuf free count.
[EDIT-1] based on the debug, it has been identified that the recent is indeed greater than 1. Accumulating such corner cases lead to mempool depletion.
logs:
dump mbuf at 0x2b67803c0, iova=0x2b6780440, buf_len=9344
pkt_len=1454, ol_flags=0x180, nb_segs=1, port=0, ptype=0x291
segment at 0x2b67803c0, data=0x2b67804b8, len=1454, off=120, refcnt=3
I'm stepping through some code in a third party library that our executable is linked to, specifically the "shutdown" code. I'm sending a SIGQUIT to our application, which shuts down the third party objects.
For some reason, reliably a call that library is making to pthread_mutex_destroy fails and returns a 16: EBUSY. The documentation says this occurs when "the implementation has detected an attempt to destroy the object referenced by mutex while it is locked or referenced (for example, while being used in a pthread_cond_timedwait() or pthread_cond_wait()) by another thread."
I've put a breakpoint right where the pthread_mutex_destroy() gets called.
a) I don't believe it is locked, since the mutex's state looks like this:
$6 = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 4294967293, __kind = 0, __spins = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' "\375, \377\377\377", '\000' , __align = 0}
And my guess is that __lock = 0 means "unlocked". However, I don't know what __nusers really represents.
b) I don't see any evidence of pthread_cond_wait() or pthread_cond_timedwait(). I got backtraces of all threads running and none were waiting on this mutex.
What could be going on here?
Obviously, your problem is with __nusers member. I would presume, you unlocked the already unlocked mutex somewhere.
I wrote a super simple wrapper for a pthread_mutex_t meant to be used between two processes:
//basic version just to test using it between two processes
struct MyLock
{
public:
MyLock() {
pthread_mutexattr_init(&attr);
pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ADAPTIVE_NP);
pthread_mutex_init(&lock, &attr);
}
~MyLock() {
pthread_mutex_destroy(&lock);
pthread_mutexattr_destroy(&attr);
}
lock() {
pthread_mutex_lock(&lock);
}
unlock() {
pthread_mutex_unlock(&lock);
}
private:
pthread_mutexattr_t attr;
pthread_mutex_t lock;
};
I am able to see this lock work fine between regular threads in a process but when I run process A which does the following in a shared memory region:
void* mem; //some shared memory from shm_open
MyLock* myLock = new(mem) MyLock;
//loop sleeping random amounts and calling ->lock and ->unlock
Then process B opens the shared memory object (verified by setting it with combinations of characters that it's the same region of memory) and does this:
MyLock* myLock = reinterpret_cast<MyLock*>(mem);
//same loop for locking and unlocking as process A
but process B segfaults when trying to lock with the backtrace leading to pthread_mutex_lock() in libpthread.so.0
What am I doing wrong?
The backtrace I get from process B looks like this:
in pthread_mutex_lock () from /lib64/libpthread.so.0
in MyLock::lock at MyLock.H:50
in Server::setUpSharedMemory at Server.C:59
in Server::Server at Server.C
in main.C:52
The call was the very first call to lock after reinterpret casting the memory into a MyLock*. If I dump the contents of MyLock in gdb in the crashing process I see:
{
attr = {
__size = "\003\000\000\200",
__align = -2147483645
},
lock = {
__data = {
__lock = 1
__count = 0,
__owner = 6742, //this is the lightweight process id of a thread in process A
__nusers = 1,
__kind = 131,
__spins = 0,
__list = {
__prev = 0x0,
__Next = 0x0
}
},
__size = "\001\000\000\000\000 //etc,
__align = 1
}
}
so it looks alright (looks like this in the other process gdb as well). I am compiling both applications together using no additional optimization flags either.
You didn't post the code to open and initialize a shared memory region but I suspect that part might be responsible for your problem.
Because pthread_mutex_t is much larger than "combination of characters," you should check your shm_open(3)-ftruncate(2)-mmap(2) sequence with reading and writing a longer (~ KB) string.
Dont't forget to check both endpoints can really write to the shm region and the written data is really visible to the other side.
Process A: [open and initialize the shm]-[write AAA...AA]-[sleep 5 sec]-[read BBB...BB]-[close the thm]
Process B: (a second or two later) [open the shm]-[read AAA...AA]-[write BBB...BB]-[close the thm]
I have a similar issue where the writer Process is root and the Readers Processes are regular users (case of a hardware daemon).
This would segfault in Readers as soon as any pthread_mutex_lock() or pthread_cond_wait() and their unlock counterparts were called.
I solved it by modifying the SHM file permissions using an appropriated umask:
Writer
umask(!S_IRUSR|!S_IWUSR|!S_IRGRP|!S_IWGRP|!S_IROTH|!S_IWOTH);
FD=shm_open("the_SHM_file", O_CREAT|O_TRUNC|O_RDWR, S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH);
ftruncate(FD, 28672);
SHM=mmap(0, 28672, PROT_READ|PROT_WRITE, MAP_SHARED, FD, 0);
Readers
FD=shm_open("the_SHM_file", O_RDWR, S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH);
SHM=mmap(0, 28672, PROT_READ|PROT_WRITE, MAP_SHARED, A.FD, 0);
You don't say what OS you are using, but you don't check the return value of the pthread_mutexattr_setpshared call. It's possible your OS does not support shared mutexes and this call is failing.
I'm trying to set the affinity of my thread to a certain mask each time I run a thread by pressing a button. It will work the first time I do it after opening the window, but not after that. However, my OutputDebugString code produces output that suggests it has been changed. I've tried using CloseHandle() but that didn't seem to have an effect. Is there something else it could be?
void CSMPDemoDlg::OnBnClickedButton1()
{
// Start thread
DWORD_PTR affinityMask = (static_cast<DWORD_PTR>(1) << NumberOfCores ) - 1;
HANDLE WorkThreadHandle = CreateThread(NULL, 0, WorkThread, &tp, 0, NULL);
DWORD_PTR z = SetThreadAffinityMask(WorkThreadHandle, affinityMask);
if (z!=0) {
char bb[100];
sprintf_s(bb, 100, "Affinity changed from %d to %d", z, affinityMask);
OutputDebugString(bb);
}
}
So, you want something like this:
static count = 0;
DWORD_PTR affinityMask = (static_cast<DWORD_PTR>(1) << NumberOfCores ) - 1;
affinityMask <<= ((count * numberOfCores) % totalCores);
That means that it will run on the next set of cores in the group, so if you run on, say 4 cores, the first tiem, it will run on cores 0..3, then 4..7, then 8..11.
It does assume that totalCores is a multiple of numberofCores, so if you have 16 cores and numberOfCores = 3, you'll get weird results.
I have a problem that the select() does not give timeout when I run the program inside a Bash script file. This is my implementation:
#include <sys/select.h>
bool checkKeyPressed()
{
struct timeval tv;
tv.tv_sec = 1;
tv.tv_usec = 0;
fd_set descriptor;
const int input = 0;
FD_ZERO(&descriptor);
FD_SET(input, &descriptor);
return select(1, &descriptor, NULL, NULL, &tv) > 0;
}
// strace result after running the program directly (correct that there is a timeout)
select(1, [0], NULL, NULL, {1, 0}) = 0 (Timeout)
// strace result to run the application inside a bash script file (no timeout)
select(1, [0], NULL, NULL, {1, 0}) = 1 (in [0], left {0, 999996})
read(0, "", 1) = 0
How can I change the function to get it working with also running under the Bash script?
If you look closer at the read call in the trace, you will notice it returns zero meaning end-of-file.
When a file descriptor is at EOF (or remote socket closed, etc), the descriptor is readable with read returning zero.
If you would have pressed CTRL+d in the interactive shell, you would have gotten the same result.
If you just need a 1-second timeout don't pass any file descriptors to select(). In this case select() works as a portable sleep() function.