dpdk mbuf ref count check after detach - dpdk

I use mbuf from the intel dpdk library
I had a check in code for mbuf double release
I used to check if the reference count <= 0
Since the following change i can't check anymore this way since the reference count is 1 even after it is release to the pool http://mails.dpdk.org/archives/dev/2017-January/056188.html
I there a way to check if the mbuf is detached(in the pool or out of the pool)?

When we free and refcnt is 1 it's either the first or the second (double) free. So just use another mbuf field to distinct, i.e.:
if (refcnt == 1) {
if (port != UINT16_MAX - 1) {
// mark the mbuf
port = UINT16_MAX - 1;
} else {
// already marked
RTE_LOG(ERROR, USER1, "Double free!");
}
}
Update:
I just realized that you might not aware about the standard double free check in DPDK. Normally you just enable RTE_LIBRTE_MEMPOOL_DEBUG to check for double free and memory corruptions.
Another safe place to store the marker would be to allocate a private 8 byte area for each mbuf. See the priv_size argument in the rte_pktmbuf_pool_create() for more details.

Related

Embedding cache aligned meta data inside mbuf

I'm developing my own dpdk application and I wish received packets to go through several threads in series. Each individual thread has it's own duty of inspecting packets and generating some metadata for each individual packet. It appears to be the easiest and most efficient way to transfer packets between threads is using rte rings. However I need to transfer the metadata generated by each thread to the next thread as well. I have tried doing this using array of structures for metadata and parsing a pointer to next thread. However this method proved to be inefficient since I got lot of cache misses.
As a solution I came up with idea of putting metadata generated by each thread into mbufs themselves. It seems to be doable with "Dynamic fields" of mbufs. However documentation of this feature seems to be very limited. For my application I wish to use a metadata field inside dynamic field something like this,
typedef struct {
uint32_t packet_id;
uint64_t time_stamp;
uint8_t ip_v;
uint32_t length;
.........
.........
} my_metadata_field;
What I don't understand is how much space I can use for dynamic field? The only thing mentioned about this on dpdk documentation is,
"10.6.1. Dynamic fields and flags
The size of the mbuf is constrained and limited; while the amount of
metadata to save for each packet is quite unlimited. The most basic
networking information already find their place in the existing mbuf
fields and flags.
If new features need to be added, the new fields and flags should fit
in the “dynamic space”, by registering some room in the mbuf
structure:
dynamic field -
named area in the mbuf structure, with a given size (at least 1 byte) and alignment constraint."
which doesn't make much sense for me. How much memory I have for this field? If it's almost unlimited, what are the tradeoffs I have to deal with if I use a large metadata field? (performance wise)
I use dpdk 20.08
Edit:
After some digging I have abandoned the idea of using dynamic field for metadata since lack of documentation and it doesn't appears to be able to hold more than 64bits.
I am looking for an easy way to embed my metadata inside cache aligned mbufs (preferably using a struct like above) so I can use rte rings to share them between threads. I'm looking for any documentation or reference project for me to begin with.
There are a couple of ways to carry metadata along with MBUF. Following are the options to do the same
in function rte_mempool_create instead of passing private_data_size as 0 pass the size as custom metadata size.
in function rte_pktmbuf_pool_create instead of passing priv_size as 0 pass the size as custom metadata size
if size of metadata is less than 128 Bytes, use typecast to access memory area right after rte_mbuf
If there are no external buffer used in DPDK application, update rte_mbuf shinfo or next
Solution 1: rte_mempool_create("FIPS_SESS_PRIV_MEMPOOL", 16, sess_sz, 0, sizeof(my_metadata_field), NULL, NULL, NULL, NULL, rte_socket_id(), 0);
Solution 2: rte_pktmbuf_pool_create("MBUF_POOL", NUM_MBUFS * nb_ports, MBUF_CACHE_SIZE, sizeof(my_metadata_field), RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
Solution 3:
struct rte_mbuf *bufs[BURST_SIZE];
const uint16_t nb_rx = rte_eth_rx_burst(port, 0, bufs, BURST_SIZE);
if (unlikely(nb_rx == 0))
continue;
for (int index = 0; index < nb_rx; index++)
{
assert(sizeof(my_metadata_field) <= RTE_CACHE_LINE_SIZE);
my_metadata_field *ptr = bufs[index] + 1;
...
...
...
}
Solution 4:
privdata_ptr = rte_mempool_create("METADATA_POOL", 16 * 1024, sizeof(my_metadata_field), 0, 0,
NULL, NULL, NULL, NULL, rte_socket_id(), 0);
struct rte_mbuf *bufs[BURST_SIZE];
const uint16_t nb_rx = rte_eth_rx_burst(port, 0, bufs, BURST_SIZE);
if (unlikely(nb_rx == 0))
continue;
for (int index = 0; index < nb_rx; index++)
{
void *msg = NULL;
if (0 == rte_mempool_get(privdata_ptr, &msg))
{
assert(msg != NULL);
bufs[index]->shinfo = msg;
continue;
}
/* free the mbuf as we are not able to retrieve the private data */
}
/* before transmit or pkt free ensure to release object back to mempool via rte_mempool_put */

Validation Layer Error with Queues: QueueFamilyIndex is not unique within pCreateInfo->pQueueCreateInfos array

I'm creating a Vulkan Renderer and setting up a Vulkan device and it's corresponding queues. Previously I haven't had any problems creating queues since I was only creating one, but now that I am creating several of them (one for graphics, one for compute, and one for transfer) the validation layer is throwing this error after/during device creation:
VUID-VkDeviceCreateInfo-queueFamilyIndex-00372(ERROR / SPEC): msgNum: 0 - vkCreateDevice: pCreateInfo->pQueueCreateInfos[1].queueFamilyIndex (=0) is not unique within pCreateInfo->pQueueCreateInfos array. The Vulkan spec states: (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkDeviceCreateInfo-queueFamilyIndex-00372)
Objects: 1
[0] 0x16933778a10, type: 2, name: NULL
VUID-VkDeviceCreateInfo-queueFamilyIndex-00372(ERROR / SPEC): msgNum: 0 - vkCreateDevice: pCreateInfo->pQueueCreateInfos[2].queueFamilyIndex (=0) is not unique within pCreateInfo->pQueueCreateInfos array. The Vulkan spec states: (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkDeviceCreateInfo-queueFamilyIndex-00372)
Objects: 1
[0] 0x16933778a10, type: 2, name: NULL
VUID-VkDeviceCreateInfo-queueFamilyIndex-00372(ERROR / SPEC): msgNum: 0 - CreateDevice(): pCreateInfo->pQueueCreateInfos[1].queueFamilyIndex (=0) is not unique within pQueueCreateInfos. The Vulkan spec states: (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkDeviceCreateInfo-queueFamilyIndex-00372)
Objects: 1
[0] 0x16933778a10, type: 3, name: NULL
VUID-VkDeviceCreateInfo-queueFamilyIndex-00372(ERROR / SPEC): msgNum: 0 - CreateDevice(): pCreateInfo->pQueueCreateInfos[2].queueFamilyIndex (=0) is not unique within pQueueCreateInfos. The Vulkan spec states: (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-VkDeviceCreateInfo-queueFamilyIndex-00372)
Objects: 1
[0] 0x16933778a10, type: 3, name: NULL
And the VkResult from vkCreateDevice is VK_ERROR_INITIALIZATION_FAILED.
My three queues are created like this(this is non standard code, it's wrapped in my own structure):
GraphicsQueueInfo.QueueFlag = VK_QUEUE_GRAPHICS_BIT;
GraphicsQueueInfo.QueuePriority = 1.0f;
ComputeQueueInfo.QueueFlag = VK_QUEUE_COMPUTE_BIT;
ComputeQueueInfo.QueuePriority = 1.0f;
TransferQueueInfo.QueueFlag = VK_QUEUE_TRANSFER_BIT;
TransferQueueInfo.QueuePriority = 1.0f;
The QueueFlag member is used to determine the type of queue we want to make from it. This is later used in the queue selection function(here's a snippet)
uint8 i = 0;
while (true)
{
if ((queueFamilies[i].queueCount > 0) && (queueFamilies[i].queueFlags & _QI.QueueFlag))
{
break;
}
i++;
}
QueueCreateInfo.queueFamilyIndex = i;
It seems all queues end up having the same queueFamilyIndex(which is set from i) and that causes the error, but I don't know if I'm doing something wrong.
The vulkan-1.dll also crashes when vkGetDeviceQueue is called after the failed device cretion happens.
In your second block you're populating QueueCreateInfo.queueFamilyIndex (I assume of type VkDeviceQueueCreateInfo) with i, based on the first queue you find with _QI.QueueFlag.
I also assume you're calling this block in some kind of loop in order to get queues for graphics, compute and transfer. So let's suppose your block up there is in a function called findQueueFamilyIndex(...) and that you're calling it something like this....
std::vector<VkDeviceQueueCreateInfo> deviceQueueInfos;
deviceQueueInfos.push_back({});
findQueueFamilyIndex(VK_QUEUE_GRAPHICS_BIT, deviceQueueInfos.back());
deviceQueueInfos.push_back({});
findQueueFamilyIndex(VK_QUEUE_COMPUTE_BIT, deviceQueueInfos.back());
deviceQueueInfos.push_back({});
findQueueFamilyIndex(VK_QUEUE_TRANSFER_BIT, deviceQueueInfos.back());
The problem here is that you will almost certainly get the same queue family index for all three queues here, and that's illegal to request. Every graphics queue must support compute and transfer operations, so your loop of
if ((queueFamilies[i].queueCount > 0) && (queueFamilies[i].queueFlags & _QI.QueueFlag))
{
break;
}
is a bad way to pick a queue family index. What you want is the queue family index of the queue that has a given flag, and as few other flags as possible. Something like this:
uint32_t targetIndex = UINT32_MAX;
uint32_t targetFlags = 0xFFFFFFFF;
for (uint32_t i = 0; i < queueFamilyCount; ++i) {
// doesn't have the flag? ignore this
if (0 == (queueFamilies[i].queueFlags & _QI.QueueFlag)) {
continue;
}
// first matching queue? use it and continue
if (targetIndex == UINT32_MAX) {
targetIndex = i;
targetFlags = queueFamilies[i].queueFlags;
continue;
}
// Matching queue, but with fewer flags than the current best? Use it.
if (countBits(queueFamilies[i].queueFlags) < countBits(targetFlags)) {
targetIndex = i;
targetFlags = queueFamilies[i].queueFlags;
continue;
}
}
If you want N queues from a given queue family, you must specify the family ONCE, and say you want N queues. So if you want multiple queues for different kinds of work, the best way to approach it is to first find the best queue family index for a given kind of work. This will be the one with the least number of VkQueueFlagBits other than the one you're requesting. For instance, an nVidia RTX 2080 has 3 queue families. One is dedicated to compute, one is dedicated to transfer, and one supports all 3.
So lets assume you write a function that takes a list of queue families and returns the best family index for a given queue:
uint32_t findBestQeueue(
VkQueueFlags desiredFlags,
const std::vector<VkQueueFamilyProperties>& queueFamilies)
{ ... }
Then what you can do is something like this:
std::vector<VkQueueFamilyProperties> qfps;
... populate qfps using vkGetPhysicalDeviceQueueFamilyProperties ...
std::map<uint32_t, uint32_t> queueFamilyToQueueCount;
auto qfi = findBestQeueue(VK_QUEUE_TRANSFER_BIT, qfps);
queueFamilyToQueueCount[qfi] += 1;
qfi = findBestQeueue(VK_QUEUE_COMPUTE_BIT, qfps);
queueFamilyToQueueCount[qfi] += 1;
qfi = findBestQeueue(VK_QUEUE_TRANSFER_BIT, qfps);
queueFamilyToQueueCount[qfi] += 1;
Now you have a map of queue family indices to the count of queues you need for them. You can then turn that into a std::vector<VkDeviceQueueCreateInfo> which can then be used to populate the appropriate members of VkDeviceCreateInfo.
Note this is not a perfect way of grabbing queues. For instance it's possible that there will only be one single queue family with one queue... in which case this code would fail because it's requesting 3 queues from a family with only one available, but for most hardware this should get your past this particular failure.

HeapWalk not working as expected in Release mode

So I used this example of the HeapWalk function to implement it into my app. I played around with it a bit and saw that when I added
HANDLE d = HeapAlloc(hHeap, 0, sizeof(int));
int* f = new(d) int;
after creating the heap then some new output would be logged:
Allocated block Data portion begins at: 0X037307E0
Size: 4 bytes
Overhead: 28 bytes
Region index: 0
So seeing this I thought I could check Entry.wFlags to see if it was set as PROCESS_HEAP_ENTRY_BUSY to keep a track of how much allocated memory I'm using on the heap. So I have:
HeapLock(heap);
int totalUsedSpace = 0, totalSize = 0, largestFreeSpace = 0, largestCounter = 0;
PROCESS_HEAP_ENTRY entry;
entry.lpData = NULL;
while (HeapWalk(heap, &entry) != FALSE)
{
int entrySize = entry.cbData + entry.cbOverhead;
if ((entry.wFlags & PROCESS_HEAP_ENTRY_BUSY) != 0)
{
// We have allocated memory in this block
totalUsedSpace += entrySize;
largestCounter = 0;
}
else
{
// We do not have allocated memory in this block
largestCounter += entrySize;
if (largestCounter > largestFreeSpace)
{
// Save this value as we've found a bigger space
largestFreeSpace = largestCounter;
}
}
// Keep a track of the total size of this heap
totalSize += entrySize;
}
HeapUnlock(heap);
And this appears to work when built in debug mode (totalSize and totalUsedSpace are different values). However, when I run it in Release mode totalUsedSpace is always 0.
I stepped through it with the debugger while in Release mode and for each heap it loops three times and I get the following flags in entry.wFlags from calling HeapWalk:
1 (PROCESS_HEAP_REGION)
0
2 (PROCESS_HEAP_UNCOMMITTED_RANGE)
It then exits the while loop and GetLastError() returns ERROR_NO_MORE_ITEMS as expected.
From here I found that a flag value of 0 is "the committed block which is free, i.e. not being allocated or not being used as control structure."
Does anyone know why it does not work as intended when built in Release mode? I don't have much experience of how memory is handled by the computer, so I'm not sure where the error might be coming from. Searching on Google didn't come up with anything so hopefully someone here knows.
UPDATE: I'm still looking into this myself and if I monitor the app using vmmap I can see that the process has 9 heaps, but when calling GetProcessHeaps it returns that there are 22 heaps. Also, none of the heap handles it returns matches to the return value of GetProcessHeap() or _get_heap_handle(). It seems like GetProcessHeaps is not behaving as expected. Here is the code to get the list of heaps:
// Count how many heaps there are and allocate enough space for them
DWORD numHeaps = GetProcessHeaps(0, NULL);
HANDLE* handles = new HANDLE[numHeaps];
// Get a handle to known heaps for us to compare against
HANDLE defaultHeap = GetProcessHeap();
HANDLE crtHeap = (HANDLE)_get_heap_handle();
// Get a list of handles to all the heaps
DWORD retVal = GetProcessHeaps(numHeaps, handles);
And retVal is the same value as numHeaps, which indicates that there was no error.
Application Verifier had been set up previously to do a full page heap verifying of my executable and was interfering with the heaps returned by GetProcessHeaps. I'd forgotten about it being set up as it was done for a different issue several days ago and then closed without clearing the tests. It wasn't happening in debug build because the application builds to a different file name for debug builds.
We managed to detect this by adding a breakpoint and looking at the callstack of the thread. We could see the AV DLL had been injected in and that let us know where to look.

How to reduce cpu usage during data transfer on TCP ports realtime

I have a socket program which acts like both client and server.
It initiates connection on an input port and reads data from it. On a real time scenario it reads data on input port and sends the data (record by record ) on to the output port.
The problem here is that while sending data to the output port CPU usage increases to 50% while is not permissible.
while(1)
{
if(IsInputDataAvail==1)//check if data is available on input port
{
//condition to avoid duplications while sending
if( LastRecordSent < LastRecordRecvd )
{
record_time temprt;
list<record_time> BufferList;
list<record_time>::iterator j;
list<record_time>::iterator i;
// Storing into a temp list
for(i=L.begin(); i != L.end(); ++i)
{
if((i->recordId > LastRecordSent) && (i->recordId <= LastRecordRecvd))
{
temprt.listrec = i->listrec;
temprt.recordId = i->recordId;
temprt.timestamp = i->timestamp;
BufferList.push_back(temprt);
}
}
//Sending to output port
for(j=BufferList.begin(); j != BufferList.end(); ++j)
{
LastRecordSent = j->recordId;
std::string newlistrecord = j->listrec;
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
if ( s.OutputClientAvail() == 1) //check if output client is available
{
int ret = s.SendBytes(newrecord,strlen(newrecord));
if ( ret < 0)
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Disconnected");
--connected;
return;
}
}
else
{
log1.AddLogFormatFatal("Nice Send Thread : Nice Client Timedout..connection closed");
--connected; //if output client not available disconnect after a timeout
return;
}
}
}
}
// Sleep(100); if we include sleep here CPU usage is less..but to send data real time I need to remove this sleep.
If I remove Sleep()...CPU usage goes very high while sending data to out put port.
}//End of while loop
Any possible ways to maintain real time data transfer and reduce CPU usage..please suggest.
There are two potential CPU sinks in the listed code. First, the outer loop:
while (1)
{
if (IsInputDataAvail == 1)
{
// Not run most of the time
}
// Sleep(100);
}
Given that the Sleep call significantly reduces your CPU usage, this spin-loop is the most likely culprit. It looks like IsInputDataAvail is a variable set by another thread (though it could be a preprocessor macro), which would mean that almost all of that CPU is being used to run this one comparison instruction and a couple of jumps.
The way to reclaim that wasted power is to block until input is available. Your reading thread probably does so already, so you just need some sort of semaphore to communicate between the two, with a system call to block the output thread. Where available, the ideal option would be sem_wait() in the output thread, right at the top of your loop, and sem_post() in the input thread, where it currently sets IsInputDataAvail. If that's not possible, the self-pipe trick might work in its place.
The second potential CPU sink is in s.SendBytes(). If a positive result indicates that the record was fully sent, then that method must be using a loop. It probably uses a blocking call to write the record; if it doesn't, then it could be rewritten to do so.
Alternatively, you could rewrite half the application to use select(), poll(), or a similar method to merge reading and writing into the same thread, but that's far too much work if your program is already mostly complete.
if(IsInputDataAvail==1)//check if data is available on input port
Get rid of that. Just read from the input port. It will block until data is available. This is where most of your CPU time is going. However there are other problems:
std::string newlistrecord = j->listrec;
Here you are copying data.
newlistrecord.append("\n");
char* newrecord= new char [newlistrecord.size()+1];
strcpy (newrecord, newlistrecord.c_str());
Here you are copying the same data again. You are also dynamically allocating memory, and you are also leaking it.
if ( s.OutputClientAvail() == 1) //check if output client is available
I don't know what this does but you should delete it. The following send is the time to check for errors. Don't try to guess the future.
int ret = s.SendBytes(newrecord,strlen(newrecord));
Here you are recomputing the length of the string which you probably already knew back at the time you set j->listrec. It would be much more efficient to just call s.sendBytes() directly with j->listrec and then again with "\n" than to do all this. TCP will coalesce the data anyway.

Speeding up non-blocking Unix Sockets (C++)

I've implemented a simple socket wrapper class. It includes a non-blocking function:
void Socket::set_non_blocking(const bool b) {
mNonBlocking = b; // class member for reference elsewhere
int opts = fcntl(m_sock, F_GETFL);
if(opts < 0) return;
if(b)
opts |= O_NONBLOCK;
else
opts &= ~O_NONBLOCK;
fcntl(m_sock, F_SETFL, opts);
}
The class also contains a simple receive function:
int Socket::recv(std::string& s) const {
char buffer[MAXRECV + 1];
s = "";
memset(buffer,0,MAXRECV+1);
int status = ::recv(m_sock, buffer, MAXRECV,0);
if(status == -1) {
if(!mNonBlocking)
std::cout << "Socket, error receiving data\n";
return 0;
} else if (status == 0) {
return 0;
} else {
s = buffer;
return status;
}
}
In practice, there seems to be a ~15ms delay when Socket::recv() is called. Is this delay avoidable? I've seen some non-blocking examples that use select(), but don't understand how that might help.
It depends on how you using sockets. If you have multiple sockets and you loop over all of them checking for data that may account for the delay.
With non-blocking recv you are depending on data being there. If your application need to use more than one socket you will have to constantly pool each socket in turns to find out if any of them have data available.
This is bad for system resources because it means your application is constantly running even when there is nothing to do.
You can avoid that with select. You basically set up your sockets, add them to group and select on the group. When anything happens on any of the selected sockets select returns specifying what happened and on which socket.
For some code about how to use select look at beej's guide to network programming
select will let you a specify a timeout, and can test if the socket is ready to be read from. So you can use something smaller than 15ms. Incidentally you need to be careful with that code you have, if the data on the wire can contain embedded NULs s won't contain all the read data. You should use something like s.assign(buffer, status);.
In addition to stefanB, I see that you are zeroing out your buffer every time. Why bother? recv returns how many bytes were actually read. Just zero out the one byte after ( buffer[status+1]=NULL )
How big is your MAXRECV? It might just be that you incur a page fault on the stack growth. Others already mentioned that zeroing out the receive buffer is completely unnecessary. You also take memory allocation and copy hit when you create a std::string out of received character data.