I am using DPDK 21.11 for my application. After a certain time, the the API rte_eth_tx_burst stops sending any packets out.
Ethernet Controller X710 for 10GbE SFP+ 1572
drv=vfio-pci
MAX_RETRY_COUNT_RTE_ETH_TX_BURST 3
do
{
num_sent_pkt = rte_eth_tx_burst(eth_port_id, queue_id, &mbuf[mbuf_idx], pkt_count);
pkt_count -= num_sent_pkt;
retry_count++;
} while(pkt_count && (retry_count != MAX_RETRY_COUNT_RTE_ETH_TX_BURST));
To debug, I tried to use telemetry to print out the xstats. However, i do not see any errors.
--> /ethdev/xstats,1
{"/ethdev/xstats": {"rx_good_packets": 97727, "tx_good_packets": 157902622, "rx_good_bytes": 6459916, "tx_good_bytes": 229590348448, "rx_missed_errors": 0, "rx_errors": 0, "tx_errors": 0, "rx_mbuf_allocation_errors": 0, "rx_unicast_packets": 95827, "rx_multicast_packets": 1901, "rx_broadcast_packets": 0, "rx_dropped_packets": 0, "rx_unknown_protocol_packets": 97728, "rx_size_error_packets": 0, "tx_unicast_packets": 157902621, "tx_multicast_packets": 0, "tx_broadcast_packets": 1, "tx_dropped_packets": 0, "tx_link_down_dropped": 0, "rx_crc_errors": 0, "rx_illegal_byte_errors": 0, "rx_error_bytes": 0, "mac_local_errors": 0, "mac_remote_errors": 0, "rx_length_errors": 0, "tx_xon_packets": 0, "rx_xon_packets": 0, "tx_xoff_packets": 0, "rx_xoff_packets": 0, "rx_size_64_packets": 967, "rx_size_65_to_127_packets": 96697, "rx_size_128_to_255_packets": 0, "rx_size_256_to_511_packets": 64, "rx_size_512_to_1023_packets": 0, "rx_size_1024_to_1522_packets": 0, "rx_size_1523_to_max_packets": 0, "rx_undersized_errors": 0, "rx_oversize_errors": 0, "rx_mac_short_dropped": 0, "rx_fragmented_errors": 0, "rx_jabber_errors": 0, "tx_size_64_packets": 0, "tx_size_65_to_127_packets": 46, "tx_size_128_to_255_packets": 0, "tx_size_256_to_511_packets": 0, "tx_size_512_to_1023_packets": 0, "tx_size_1024_to_1522_packets": 157902576, "tx_size_1523_to_max_packets": 0, "rx_flow_director_atr_match_packets": 0, "rx_flow_director_sb_match_packets": 13, "tx_low_power_idle_status": 0, "rx_low_power_idle_status": 0, "tx_low_power_idle_count": 0, "rx_low_power_idle_count": 0, "rx_priority0_xon_packets": 0, "rx_priority1_xon_packets": 0, "rx_priority2_xon_packets": 0, "rx_priority3_xon_packets": 0, "rx_priority4_xon_packets": 0, "rx_priority5_xon_packets": 0, "rx_priority6_xon_packets": 0, "rx_priority7_xon_packets": 0, "rx_priority0_xoff_packets": 0, "rx_priority1_xoff_packets": 0, "rx_priority2_xoff_packets": 0, "rx_priority3_xoff_packets": 0, "rx_priority4_xoff_packets": 0, "rx_priority5_xoff_packets": 0, "rx_priority6_xoff_packets": 0, "rx_priority7_xoff_packets": 0, "tx_priority0_xon_packets": 0, "tx_priority1_xon_packets": 0, "tx_priority2_xon_packets": 0, "tx_priority3_xon_packets": 0, "tx_priority4_xon_packets": 0, "tx_priority5_xon_packets": 0, "tx_priority6_xon_packets": 0, "tx_priority7_xon_packets": 0, "tx_priority0_xoff_packets": 0, "tx_priority1_xoff_packets": 0, "tx_priority2_xoff_packets": 0, "tx_priority3_xoff_packets": 0, "tx_priority4_xoff_packets": 0, "tx_priority5_xoff_packets": 0, "tx_priority6_xoff_packets": 0, "tx_priority7_xoff_packets": 0, "tx_priority0_xon_to_xoff_packets": 0, "tx_priority1_xon_to_xoff_packets": 0, "tx_priority2_xon_to_xoff_packets": 0, "tx_priority3_xon_to_xoff_packets": 0, "tx_priority4_xon_to_xoff_packets": 0, "tx_priority5_xon_to_xoff_packets": 0, "tx_priority6_xon_to_xoff_packets": 0, "tx_priority7_xon_to_xoff_packets": 0}}
I have RX-DESC = 128 and TX-DESC = 512 configured.
I am assuming there is some desc leak, is there a way to know if the drop is due to no-desc present? Which counter should I check for that"?
[More Info]
Debugging refcnt lead to a deadend.
Following the code, it seems that the NIC card does not set the DONE status on the descriptor.
When rte_eth_tx_burst is called, the next func internally calls i40e_xmit_pkts -> i40e_xmit_cleanup
When the issue occurs, the following condition fails leading to NIC failure in sending packets out.
if ((txd[desc_to_clean_to].cmd_type_offset_bsz &
rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE)) {
PMD_TX_LOG(DEBUG, "TX descriptor %4u is not done "
"(port=%d queue=%d)", desc_to_clean_to,
txq->port_id, txq->queue_id);
return -1;
}
If I comment out the "return -1" (ofcourse not the fix and will lead to other issues) ..but I can see that traffic is stable for a long long time.
I tracked all the mbuf from start of traffic till issue is hit,there is no issue seen atleast in mbuf that I could see.
I40E_TX_DESC_DTYPE_DESC_DONE will be set in h/w for the descriptor. Is there any way I can see that code? Is it part of x710 driver code?
I still doubt my own code since the issue is present even after NIC card is replaced.
However, how can my code effect NIC card not modifying the DONE status of descriptor?
Any suggestions would really be helpful.
[UPDATE]
Found out that 2 cores were using the same TX queueID to send packets.
Data processing and TX core
ARP req/response by Data RX core
This lead to some potential corruption ?
Found some info on this:
http://mails.dpdk.org/archives/dev/2014-January/001077.html
After creating separate queue for ARP messages, issue is not seen anymore/yet for 2+hours
[EDIT-2] the error is narrowed down to multiple threads are using same portid-queueid pair which causes the stalls in the NIC from XMIT. Earlier the debug was not focusing on slow path (ARP reply) hence this was missed out.
[Edit-1] based on the limited debug opportunities and updates from the message, the updates are
The internal TX code updates refcnt by 2 (that is refcnt is 3).
Once the reply is received the refcnt is decremented by 2
Corner cases are now addressed for mbuf_free
Tested on RHEL and Centos both has issues, hence it is software and not os
updated the NIC firmware, now all platforms consistently shows error after a couple of hours of the run.
Note:
hence all pointers lead to code and corner case handling gaps since testpmd|l2fwd|l3fwd does not show the case the error with DPDK library or platform.
Since the code base is not shared, only option is to rely on updates.
hence after extensive debugging and analysis, the root cause of the issue is not DPDK, NIC or platform but GAP in the code being used.
If the code's intent is to try within MAX_RETRY_COUNT_RTE_ETH_TX_BURST for all packets of pkt_count, the current code snippet needs a few corrections. Let me explain
mbuf is the array of valid packets to be TX
mbuf_idx represents the current index to be sent for TX
pkt_count represents the number of packets sent out in the current attempt.
num_sent_pkt represents actual packets sent for DMA copy to NIC (physical).
retry_count is the local variable keeping count of retries.
there are 2 corner cases to be taken care of (not shared in the current snippet)
If MAX_RETRY_COUNT_RTE_ETH_TX_BURST is exceeded and num_sent_pkt is not equal to actual TX, at end of the while loop one needs to free up the non-transmitted MBUF.
If there are any MBUF with ref_cnt greater than 1 (especially with multicast or broadcast or packet duplication) one needs a mechanism free those too.
A possible code snippet could be:
MAX_RETRY_COUNT_RTE_ETH_TX_BURST 3
retry_count = 0;
mbuf_idx = 0;
pkt_count = try_sent; /* try_sent intended send*/
/* if there are any mbuf with ref_cnt > 1, we need separate logic to handle those */
do {
num_sent_pkt = rte_eth_tx_burst(eth_port_id, queue_id, &mbuf[mbuf_idx], pkt_count);
pkt_count -= num_sent_pkt;
mbuf_idx += num_sent_pkt;
retry_count++;
} while((pkt_count) && (retry_count < MAX_RETRY_COUNT_RTE_ETH_TX_BURST));
/* to prevent the leak for unsent packet*/
if (pkt_count) {
rte_pktmbuf_free_bulk(&mbuf[mbuf_idx], pkt_count);
}
note: the easiest way to identify mbuf leak is to run DPDK secondary process proc-info to check for mbuf free count.
[EDIT-1] based on the debug, it has been identified that the recent is indeed greater than 1. Accumulating such corner cases lead to mempool depletion.
logs:
dump mbuf at 0x2b67803c0, iova=0x2b6780440, buf_len=9344
pkt_len=1454, ol_flags=0x180, nb_segs=1, port=0, ptype=0x291
segment at 0x2b67803c0, data=0x2b67804b8, len=1454, off=120, refcnt=3
I am implementing some basic communication over serial port.
According to protocol I must respond with answer within 225ms after receiving the request. Maximum package size is 256B.
So, when I receive request I create response [header, lenght, payload, crc16] (total 256 B) for that I need in average 30 - 40 ms. Then the actual problem occurs when I put that response(byte array) in asio::async_write. That function take around 240ms in average to process that byte array.
Everything works fine except when I sending maximum length packages. It takes 240ms (asio::async_write) + 40ms(package assembly) around 280 ~ 300 ms.
Port: 9600 baud, length 8, one stop bit
Any idea how can I speed it up?
void Example::do_write()
{
if (pimpl_->WriteBuffer == nullptr)
{
boost::lock_guard<boost::mutex> l(pimpl_->WriteQueueMutex);
pimpl_->WriteBufferSize = pimpl_->WriteQueue.size();
pimpl_->WriteBuffer.reset(new byte[pimpl_->WriteQueue.size()]);
std::move(pimpl_->WriteQueue.begin(), pimpl_->WriteQueue.end(), pimpl_->WriteBuffer.get());
pimpl_->WriteQueue.clear();
begin = boost::chrono::steady_clock::now();
async_write(pimpl_->Port, asio::buffer(pimpl_->WriteBuffer.get(), pimpl_->WriteBufferSize), boost::bind(&Example::write_end, this, asio::placeholders::error));
}
}
void Example::write_end(const system::error_code& error)
{
if (!error)
{
boost::lock_guard<boost::mutex> l(pimpl_->WriteQueueMutex);
if (pimpl_->WriteQueue.empty())
{
pimpl_->WriteBuffer.reset();
pimpl_->WriteBufferSize = 0;
end = boost::chrono::steady_clock::now();
OutputDebugString(string("\nWRITE TIME: " + to_string(boost::chrono::duration_cast<boost::chrono::milliseconds>(end - begin).count()) + "\n").c_str());
return;
}
pimpl_->WriteBufferSize = pimpl_->WriteQueue.size();
pimpl_->WriteBuffer.reset(new byte[pimpl_->WriteQueue.size()]);
std::move(pimpl_->WriteQueue.begin(), pimpl_->WriteQueue.end(), pimpl_->WriteBuffer.get());
pimpl_->WriteQueue.clear();
async_write(pimpl_->Port, asio::buffer(pimpl_->WriteBuffer.get(), pimpl_->WriteBufferSize), boost::bind(&Example::write_end, this, asio::placeholders::error));
}
else
{
set_error_status(true);
do_close();
}
}
In my experience boost::asio itself takes fractions of microseconds. You use 40 ms to fetch the data, communication takes 220 ms (it is about minimum to send 256 bytes over 9600 baud) and somewhere you waste more 20-40 ms and sum is 280 - 300 ms.
What to do?
Most profitable is likely to increase the baud rate to 115200 baud (0.85 ms/byte of 9600 baud versus 0.07 ms/byte of 115200 baud).
Next idea is to profile out where those 20-40 ms go (likely something unneeded in message loop you wrote).
Finally it is likely possible also to reduce those data fetching 40 ms.
I have to display 3 images in my application window, there should be 10 seconds of delay while displaying each image(i.e. each image should stay for 10 seconds).
How can i do this using ontimer() without using sleep().
Use ON_WM_TIMER()
SetTimer( TIMER_ID, 10000, NULL);
Here TIMER_ID you can pass any unique id.
10000 milliseconds = 10 seconds
void CYOURDlg::OnTimer(UINT_PTR nIDEvent)
{
if(nIDEvent == TIMER_ID) // check timer Id
{
// Write your code to show exe
}
CDialog::OnTimer(nIDEvent);
}
This will call every 10 seconds as delay we have given 10 seconds.
you can call KillTimer(TIMER_ID) when you don't want to run timer.
I have a strange problem.
void MySocket::OnReceive( int nErrorCode )
{
static CMutex mutex;
static int depth=0;
static int counter=0;
CSingleLock lock(&mutex, true);
Receive(pBuff, iBuffSize-1);
counter++;
depth++; //<-- Breakpoint
log("Onreceive: enter %d %d %d", GetCurrentThreadId(), counter, depth);
.....
Code handling data
depth--;
log("Onreceive: exit %d %d %d", GetCurrentThreadId(), counter, depth);
}
Results in this log statement:
02/19/2014 08:33:14:982 [DEBUG] Onreceive Enter: 3200 1 2
02/19/2014 08:34:13:726 [DEBUG] Onreceive Exit : 3200 2 1
02/19/2014 08:32:34:193 [DEBUG] Onreceive Enter: 3200 0 1 <- Log statement was created but interrupted before it was written to disk
02/19/2014 08:34:13:736 [DEBUG] Onreceive Exit : 3200 2 0
Now what happens:
I start the program and the debugger stops at the breakpoint
Step through into the log
Some where in the log the debugger jumps back to the break point
This is the second entry into the OnReceive
Second call completes
First call continues
My questions:
How is it possible to get two concurrent calls to OnReceive
Why does the Mutex not work (Due to the same threadid?)
And how can I have two executing paths with the same ThreadID???
And ofcourse, how can I fix this??
Note that this only happens if I send a lot small messages (<50Bytes) until the Send blocks. In total it's around 500KB/s. If I put a Sleep(1) after each send it doesn't happen.. but that ofcourse kills my transfer speed.
Ok, I found the root cause. In the Log statement a Win32 Mutex is used and the following Wait:
DWORD dwResult = MsgWaitForMultipleObjects(nNoOfHandle, handle, FALSE, dwTimeout, QS_POSTMESSAGE|QS_ALLPOSTMESSAGE|QS_SENDMESSAGE|QS_TIMER);
if (dwResult == WAIT_OBJECT_0 + nNoOfHandle) // new message is in the queue, let's clear
{
MSG Msg;
while (PeekMessage(&Msg, NULL, 0, 0, PM_REMOVE))
{
::TranslateMessage(&Msg);
::DispatchMessage(&Msg);
}
}
This waits for the Mutex to be cleared OR for a message to be posted. CSocket posts messages to the thread when it receives data and that will call the OnReceive. So, this code produced the problem that while waiting for the mutex it would handle incoming messages and effectively call the OnReceive again.
One way of solving this was to prevent the CSocket from posting more notifications like this:
void MySocket::OnReceive(int nErrorCode)
{
/* Remove FD_READ notifications */
VERIFY(AsyncSelect(FD_WRITE | FD_OOB | FD_ACCEPT | FD_CONNECT | FD_CLOSE));
OldOnReceive(nErrorCode);
/* Restore default notifications */
VERIFY(AsyncSelect());
}
I have two threads in a producer-consumer pattern. Code works, but then the consumer thread will get starved, and then the producer thread will get starved.
When working, program outputs:
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Then something changes and threads get starved, program outputs:
Send Data...semValue = 1
Send Data...semValue = 2
Send Data...semValue = 3
...
Send Data...semValue = 256
Send Data...semValue = 257
Send Data...semValue = 258
Recv Data...semValue = 257
Recv Data...semValue = 256
Recv Data...semValue = 255
...
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
Send Data...semValue = 1
Recv Data...semValue = 0
I know threads are scheduled by the OS, and can run at different rates and in random order. My question: When I do a YieldThread(calls pthread_yield), shouldn’t the Talker give Listener a chance to run? Why am I getting this bizarre scheduling?
Snippet of Code below. Thread class and Semaphore class are abstractions classes. I went ahead as stripped out the queue for data passing between the threads so I could eliminate that variable.
const int LOOP_FOREVER = 1;
class Listener : public Thread
{
public:
Listener(Semaphore* dataReadySemaphorePtr)
: Thread("Listener"),
dataReadySemaphorePtr(dataReadySemaphorePtr)
{
//Intentionally left blank.
}
private:
void ThreadTask(void)
{
while(LOOP_FOREVER)
{
this->dataReadySemaphorePtr->Wait();
printf("Recv Data...");
YieldThread();
}
}
Semaphore* dataReadySemaphorePtr;
};
class Talker : public Thread
{
public:
Talker(Semaphore* dataReadySemaphorePtr)
: Thread("Talker"),
dataReadySemaphorePtr(dataReadySemaphorePtr)
{
//Intentionally left blank
}
private:
void ThreadTask(void)
{
while(LOOP_FOREVER)
{
printf("Send Data...");
this->dataReadySemaphorePtr->Post();
YieldThread();
}
}
Semaphore* dataReadySemaphorePtr;
};
int main()
{
Semaphore dataReadySemaphore(0);
Listener listener(&dataReadySemaphore);
Talker talker(&dataReadySemaphore);
listener.StartThread();
talker.StartThread();
while (LOOP_FOREVER); //Wait here so threads can run
}
No. Unless you are using a lock to prevent it, even if one thread yields it's quantum, there's no requirement that the other thread receives the next quantum.
In a multithreaded environment, you can never ever ever make assumptions about how processor time is going to be scheduled; if you need to enforce correct behavior, use a lock.
Believe it or not, it runs that way because it's more efficient. Every time the processor switches between threads, it performs a context switch that wastes a certain amount of time. My advice is to let it go unless you have another requirement like a maximum latency or queue size, in which case you need another semaphore for "ready for more data" in addition to your "data ready for listening" one.