Concurrent modification of Linux BPF hashtab map, how to make it safe without resorting to BPF_F_NO_PREALLOC flag - concurrency

A BPF map preallocates memory for items by default. BPF_F_NO_PREALLOC flag turns preallocation off.
A preallocated map is faster. Sleepable programs could only work with preallocated maps until recently.
When it comes to concurrency, there's a significant difference:
When items are allocated and freed individually, RCU ensures that the memory is not reused until it is safe to do so.
Preallocated map can reuse an item right away.
Imagine a BPF program P obtained a pointer to a value with bpf_map_lookup_elem(). Neither P nor userspace ever modify values. However, the value DO change when the item is killed and reused for unrelated element in a preallocated map.
It looks like there's nothing P could do to obtain a stable view of the value, even with explicit locking (as bpf_map_update_elem() doesn't lock a brand new item before copying in data, which might be the same object P is currently looking into).
Even worse, a lookup in a map could yield a completely bogus result if the key is modified concurrently with lookup.
Is it even possible to use preallocated maps safely?

Related

Does std::unordered_map::erase actually perform dynamic deallocation?

It isn't difficult to find information on the big-O time behavior of stl container operations. However, we operate in a hard real-time environment, and I'm having a lot more trouble finding information on their heap memory usage behavior.
In particular I had a developer come to me asking about std::unordered_map. We're allowed to be non-realtime at startup, so he was hoping to perform a .reserve() at startup time. However, he's finding he gets overruns at runtime. The operations he uses are lookups, insertions, and deletions with .erase().
I'm a little worried about that .reserve() actually preventing later runtime memory allocations (I don't really understand the explanation of what it does wrt to heap usage), but .erase() in particular I don't see any guarantee whatsoever that it won't be asking the heap for a dynamic deallocation when called.
So the question is what's the specified heap interactions (if any) for std::unordered_map::erase, and if it actually does deallocations, if there's some kind of trick that can be used to avoid them?
The standard doesn't specify container allocation patterns per-se. These are effectively derived from iterator/reference invalidation rules. For example, vector::insert only invalidates all references if the number of elements inserted causes the size of the container to exceed its capacity. Which means reallocation happened.
By contrast, the only operations on unordered_map which invalidates references are those which actually remove that particular element. Even a rehash (which likely allocates memory) does not invalidate references (this is why reserve changes nothing).
This means that each element must be stored separately from the hash table itself. They are individual nodes (which is why it has a node_type extraction interface), and must be able to be allocated and deallocated individually.
So it is reasonable to assume that each insertion or erasure represents at least one allocation/deallocation.
If you're all right with nodes continuing to consume memory, even after they've been removed from the container, you could pretty easily write an Allocator class that basically made deallocation a NOP.
Quite a few real-time systems basically allocate all the memory they're going to use up-front, then once they've finished initialization they neither allocate nor release memory. This would allow you to do pretty much the same thing with an unordered_map.
That said, I'm somewhat skeptical about the benefit in this case. The main strength of unordered_map is supporting insertion and deletion that are usually fast. If you're not going to be doing insertion at runtime, chances are pretty good that it's not a particularly great choice.
If it's a collection that's mostly filled during initialization, then used mostly as-is, with a few items being "removed", but no more being inserted after you finish initialization, you're likely to be better off with a simple sorted array and an interpolating search (or, if the data is distributed extremely unpredictably, maybe a binary search--but an interpolating search is usually better). In this case, I'd handle removal by simply adding a boolean to each item saying whether that item is valid or not. Erase by setting that value to false. If you find such a value during a search, you basically just ignore it.

std::vector increasing peak memory

This is in continuation of my last question. I am failed to understand the memory taken up by vector. Problem skeleton:
Consider an vector which is an collection of lists and lists is an collection of pointers. Exactly like:
std::vector<std::list<ABC*> > vec;
where ABC is my class. We work on 64bit machines, so size of pointer is 8 bytes.
At the start of my flow in the project, I resize this vector to an number so that I can store lists at respective indexes.
vec.resize(613284686);
At this point, capacity and size of the vector would be 613284686. Right. After resizing, I am inserting the lists at corresponding indexes as:
// Some where down in the program, make these lists. Simple push for now.
std::list<ABC*> l1;
l1.push_back(<pointer_to_class_ABC>);
l1.push_back(<pointer_to_class_ABC>);
// Copy the list at location
setInfo(613284686, l1);
void setInfo(uint64_t index, std::list<ABC*> list>) {
std::copy(list.begin(), list.end(), std::back_inserter(vec.at(index));
}
Alright. So inserting is done. Notable things are:
Size of vector is : 613284686
Entries in the vector is : 3638243731 // Calculated this by going over vector indexes and add the size of std::lists at each index.
Now, since there are 3638243731 entries of pointers, I would expect memory taken by this vector is ~30Gb. 3638243731 * 8(bytes) = ~30Gb.
BUT BUT When I have this data in memory, memory peaks to, 400G.
And then I clear up this vector with:
std::vector<std::list<nl_net> >& ccInfo = getVec(); // getVec defined somewhere and return me original vec.
std::vector<std::list<nl_net> >::iterator it = ccInfo.begin();
for(; it != ccInfo.end(); ++it) {
(*it).clear();
}
ccInfo.clear(); // Since it is an reference
std::vector<std::list<nl_net> >().swap(ccInfo); // This makes the capacity of the vector 0.
Well, after clearing up this vector, memory drops down to 100G. That is too much holding from an vector.
Would you all like to correct me what I am failing to understand here?
P.S. I can not reproduce it on smaller cases and it is coming in my project.
vec.resize(613284686);
At this point, capacity and size of the vector would be 613284686
It would be at least 613284686. It could be more.
std::vector<std::list<nl_net> >().swap(ccInfo); // This makes the capacity of the vector 0.
Technically, there is no guarantee by the standard that a default constructed vector wouldn't have capacity other than 0... But in practice, this is probably true.
Now, since there are 3638243731 entries of pointers, I would expect memory taken by this vector is ~30Gb. 3638243731 * 8(bytes)
But the vector doesn't contain pointers. It contains std::list<ABC*> objects. So, you should expect vec.capacity() * sizeof(std::list<ABC*>) bytes used by the buffer of the vector itself. Each list has at least a pointer to beginning and the end.
Furthermore, you should expect each element in each of the lists to use memory as well. Since the list is doubly linked, you should expect about two pointers plus the data (a third pointer) worth of memory for each element.
Also, each pointer in the lists apparently points to an ABC object, and each of those use sizeof(ABC) memory as well.
Furthermore, since each element of the linked lists are allocated separately, and each dynamic allocation requires book-keeping so that they can be individually de-allocated, and each allocation must be aligned to the maximum native alignment, and the free store may have fragmented during the execution, there will be much overhead associated with each dynamic allocation.
Well, after clearing up this vector, memory drops down to 100G.
It is quite typical for the language implementation to retain (some) memory it has allocated from the OS. If your target system documents an implementation specific function for explicitly requesting release of such memory, then you could attempt using that.
However, if the vector buffer wasn't the latest dynamic allocation, then its deallocation may have left a massive reusable area in the free store, but if there exists later allocations, then all that memory might not be releasable back to the OS.
Even if the langauge implementation has released the memory to the OS, it is quite typical for the OS to keep the memory mapped for the process until another process actually needs the memory for something else. So, depending on how you're measuring memory use, the results might not necessarily be meaningful.
General rules of thumb that may be useful:
Don't use a vector unless you use all (or most) of the indices. In case where you don't, consider a sparse array instead (there is no standard container for such data structure though).
When using vector, reserve before resize if you know the upper bound of allocation.
Don't use linked lists without a good reason.
Don't rely on getting all memory back from peak usage (back to the OS that is; The memory is still usable for further dynamic allocations).
Don't stress about virtual memory usage.
std::list is a fragmented memory container. Typically each node MUST have the data it is storing, plus the 2 prev/next pointers, and then you have to add in the space required within the OS allocation table (typically 16 or 32 bytes per allocation - depending on OS). You then have to account for the fact all allocations must be returned on a 16byte boundary (on Intel/AMD based 64bit machines anyway).
So using the example of std::list<ABC*> the size of a pointer is 8, however you will need at least 48bytes to store each element (at least).
So memory usage for ONLY the list entries is going to be around: 3638243731 * 48(bytes) = ~162Gb.
This is of course assuming that there is no memory fragmentation (where there may be a block of 62bytes free, and the OS returns the entire block of 62 rather than the 48 requested). We are also assuming here that the OS has a minimum allocation size of 48 bytes (and not say, 64bytes, which would not be overly silly, but would push the usage up far higher).
The size of the std::lists themselves within the vector comes to around 18GB. So in total we are looking at 180Gb at least to store that vector. It would not be beyond the realm of possibility that the extra allocations are additional OS book keeping info, for all of those individual memory allocations (e.g. lists of loaded memory pages, lists of swapped out memory pages, the read/write/mmap permissions, etc, etc).
As a final note, instead of using swap on a newly constructed vector, you can just use shrink to fit.
ccInfo.clear();
ccInfo.shrinkToFit();
The main vector needs some more consideration. I get the impression it will always be a fixed size. So why not use a std::array instead? A std::vector always allocates more memory than it needs to allow for growth. The bigger your vector the bigger the reservation of memory to allow for more even growth. The reasononing behind is to keep relocations in memory to a minimum. Relocations on really big vectors take up huge amounts of time so a lot of extra memory is reserved to prevent this.
No vector function that can delete elements (such as vector::clear and ::erase) also deallocates memory (e.g. lower the capacity). The size will decrease but the capacity doesn't. Again, this is meant to prevent relocations; if you delete you are also very likely to add again. ::shrink_to_fit also doesn't guarantuee you that all of the used memory is released.*
Next is the choice of a list to store elements. Is a list really applicable? Lists are strong in random access/insertion/removal operations. Are you really constantly adding and removing ABC objects to the list in random locations? Or is another container type with different properties but with contiguous memory more suitable? Another std::vector or std::array perhaps. If the answer is yes than you're pretty much stuck with a list and its scattered memory allocations. If no, than you could win back a lot of memory by using a different container type.
So, what is it you really want to do? Do you really need dynamic growth on both the main container and its elements? Do you really need random manipulation? Or can you use fixed-size arrays for both container and ABC objects and use iteration instead? When contemplating this you might want to read up on the available containers and their properties on en.cppreference.com. It will help you decide what is most appropriate.
*For the fun of it I dug around in VS2017's implementation and it creates an entirely new vector without the growth segment, copies the old elements and then reassigns the internal pointers of the old vector to the new one while deleting the old memory. So at least with that compiler you can count on memory being released.

What is the fastest way for multiple threads to insert into a vector safely?

I have a program where multiple threads share the same data structure which is basically a 2D array of vectors and sometimes two or more threads might have to insert at the same position i.e. vector which might result in a crash if no precautions were taken. What is the fastest and most efficient way to implement a safe solution for this issue ? Since this issue does not happen very often (no high contention) I had a 2D array of mutexes where each mutex maps to a vector and then each thread locks then unlocks the mutex after finishing from updating the corresponding vector. If this is a good solution, I would like to know if there is something faster than mutex to use.
Note, I am using OpenMP for the multithreading.
The solution greatly depends on how the problem is. For example:
If the vector size may exceed its capacity (i.e. reallocation is required).
Whether the vector is only being read, elements are being inserted or elements can be both inserted and removed.
In the first case, you don't have any other possibility than using locks, since you always need to check whether the vector is being reallocated, and wait for the reallocation to complete if necessary.
On the other hand, if you are completely sure that the vector is only initialized once by a single thread (which is not your case), probably you would not need any synchronization mechanism to perform access to vector elements (inside-element access synchronization may still be required though).
If elements are being inserted and removed from the back of the vector only (queue style), then using atomic compare and swap would be enough (atomically increase the size of the vector, and insert in position size-1 when the swap was successful.
If elements may be removed at any point of the vector, its contents may need to be moved to remove empty holes. This case is similar to a reallocation. You can use a customized heap to manage the empty positions in your vector, although this will increase the complexity.
At the end of the day, probably you will need to either develop your own parallel data structure or rely on a library, such as TBB or Boost.

C++ Stack Walking on Windows

I'm building a memory manager for C++ using a very .NET style approach. In doing so I need to know which objects are considered reachable; and object is considered reachable if a reachable object has a handle to the object in question. So this poses the question of which object(s) are the root of our search? The answer would be that these "eve" objects are on the stack, be it in the form of a handle to a managed object or an instance of a scope-local object that itself has a handle to a managed object.
I've read through some articles on this and also checked out implementation details on the MSDN about the StackWalk method in the Win32 API.
As always any help is greatly appreciated. And please don't advise against making a memory manager, or suggest alternatives such as smart pointers. I fully understand what I am doing. Thanks!
Your requirements sort of seem similar to a small project I’m working on at the moment, but my goal isn’t to make a memory manager, my goal is to instrument dmalloc (and the debug-mode long-running application within which it is running) with the ability to periodically halt execution and scan memory looking for heap allocations for which there are no references. Sort of like a “dumb” garbage collector, but not with the goal of freeing memory; instead, with the goal of logging leaked allocations for later analysis (along with stacktraces captured at allocation-time, which I’ve already added to dmalloc). Note that as a general-purpose memory manager’s garbage collector, this will be a pretty inefficient process and will take a “long” time to run (I’m not done yet, but I won’t be surprised if each time it runs it halts normal program execution for over 10 seconds), but for my own purposes I don’t care too much about performance because I’ll enable it only once every few months to test for new memory leaks in my company’s product.
In any case, I assume your memory manager will be the only source of heap memory in your application? And that threads in your system operate in a fully shared-memory environment, where no thread has any memory, including stack space and thread-local storage space, that cannot be seen from other threads? If so...
I believe there are just four categories of memory within which you may find pointers to heap allocations:
On the callstacks of each thread
Within heap allocations themselves
In statically allocated writable memory (.bss & .data/.sdata, but
not .rdata/.rodata)
In thread-local storage space for each thread
You are already aware that pointers to heap allocations may occur on the stack. Pointers to allocations may also (may instead) be stored in heap objects themselves, and not even stored on the stack. Your question suggests you may be hoping to use the stack as a “root” of your garbage collector’s search; I’m taking this to mean you hope to be able to follow pointers on the stack outwards to other allocations, searching from one object to another through memory until you’ve traversed all objects in memory and found all pointers to all allocations. "Root" pointers may also exist in statically allocated objects, which can be referenced directly without there even being a pointer to such an object on the stack, so you can't just assume all allocations are reachable from "pointers" you find in the stack. Also, unfortunately with C++, unless you’re able to know the structure of each allocation (which you won’t without help from the compiler), you’ll have to assume that any location is possibly a pointer. So you’ll have to scan through each of these four categories of memory looking for potential pointers to all existing allocations, flagging each with a “possibly still in use” flag if you find a value in memory that matches the address of an allocation, whether or not it’s actually a pointer. As you scan through memory, at each byte location (or at each byte location evenly divisible by sizeof(void*), if you know your platform can’t have pointers at misaligned addresses), you’ll have to search your list of allocations to see if that value is in your list of allocations.
Since you're confident that you know what you’re doing, your memory manager is probably tracking these allocations in a balanced tree structure (perhaps a red-black tree or Andersson tree) which gives you O(log n) insertion & lookup on those allocations, but the constant of proportionality for navigating those trees is going to really kill your garbage collector’s performance. Before doing your garbage collection scan, you’ll want to copy the tree’s allocation pointers into a flat contiguous buffer (i.e. an “array”) in order (i.e. ascending or descending using inorder traversal). I suggest an array of void* of each allocation’s address and a separate bit-array (not bool array) with one bit per allocation, initialized to all-zeros, where an allocation’s corresponding bit is set to 1 if you find a potential reference to it. This will still give you O(log n) lookup (using binary search) while you’re scanning for garbage collection, but with a much more manageable constant of proportionality for your lookups; in addition, this more compact data structure will tend to have better cache hit performance than a balanced tree.
Now I’ll discuss each of the three categories of memory you’d have to scan:
The callstacks of each thread
For this, you’ll have to be able to query your thread manager for the top & bottom of each thread’s stacks. If you can only get the current stack pointer for each thread, then you may be able to use a “backtrace” API to get a list of function return addresses on that stack. From that, you can scan back toward each stack’s base (which you don’t know), ticking off each return address in order until you get to the last return address, where you’ve then found the stack base (or close enough). And for the “current thread”, be sure to not include any stackframes associated with your memory manager; i.e., back up a few stackframes & ignore the ones associated with your garbage collector, or else you might find addresses of leaked allocations in your garbage collector’s local variables and mistake them for
Within heap allocations themselves
Heap objects can reference each other, and you could have a network of leaked objects that all reference each other yet as a group, they are leaked. You don't want to see their pointers to each other & treat them as "in-use", so you have to handle these carefully... and last. Once all other categories are finished, you can collapse/split your flat array of void* allocation addresses, making a separate list of "considered in-use" allocations and "not yet verified" allocations. Scan through the "considered in-use" allocations looking for potential pointers to allocations still in the "not yet verified" list. As you find any, move them from the "not yet verified" list to the end of the "considered in-use" list so that you'll eventually scan those as well.
In statically allocated writable memory (.bss & .data/.sdata, but not
.rdata/.rodata)
For this, you’ll need to get symbols from your linker to the start & end (or length) of each of these sections. If such symbols don’t already exist or you can’t get that information from a platform API, you’ll need to get your linker command script (linker script) and modify it to add & initialize global symbols to the start address & end address (or length) of each of these sections. The .bss section contains uninitialized global, file scope, and class static data members. The .data/.sdata section(s) contain non-const pre-initialized global, file scope, and class static data members. You don’t need to worry about the .rdata/.rodata section(s) because your program won’t be writing heap-allocation addresses into static const data.
In thread-local storage space for each thread
For this, you’ll have to be able to query your thread manager for the thread-local storage space for each thread, or else part of the startup of each thread must be to add its thread-local storage to a list of thread-local space for the application, and remove it when the thread exits.
If you’re still on board and want to do this, by now you’ve probably realized it’s a bigger project than you may have initially thought. Let me know how it goes!

std::list vs std::vector iteration

It is said that iterating through a vector (as in reading all it's element) is faster than iterating through a list, because of optimized cache.
Is there any ressource on the web that would quantify how much it impacts the performances ?
Also, would it be better to use a custom linked list, whom elements would be prealocated so that they are consecutive in memory?
The idea behind that is that I want to store elements in a certain order that won't change. I still need to be able to insert some at run time in the midle quickly, but most of them will still be consecutive, because the order won't change.
Does the fact that the elements are consecutive have an impact in the cache, or because I'll still call list_element->next instead of ++list_element it does not improve anything ?
The main difference between vector and lists is that in vector elements are constructed subsequently inside a preallocated buffer, while in a list elements are constructed one by one.
As a consequence, elements in a vector are granted to occupy a contiguous memory space, while list elements (unless some specific situations, like a custom allocator working that way) aren't granted to be so, and can be "sparse" around the memory.
Now, since the processor operates on a cache (that can be up to 1000 times faster than the main RAM) that remaps entire pages of the main memory, if elements are consecutive it is higly probable that they fits a same memory page and hence are moved all together in the cache when iteration begins. While proceeding, everything happens in the cache without further moving of data or further access to the slower RAM.
With list-s, since elements are sparse everywhere, "going to the next" means refer to an address that may not be in the same memory page of its previous, and hence, the cache needs to be updated upon every iteration step, accessing the slower RAM on each iteration.
The performance difference greatly depends on the processor and on the type of memory used for both the main RAM and the cache, and on the way the std::allocator (and ultimately operator new and malloc) are implemented, so a general number is impossible to be given.
(Note: great difference means bad RAM respect to to the cache, but may also means bad implementation on list-s)
The efficiency gains from cache coherency due to compact representation of data structures can be rather dramatic. In the case of vectors compared to lists, compact representation can be better not just for read but even for insertion (shifting in vectors) of elements up to the order of 500K elements for some particular architecture as demonstrated in Figure 3 of this article by Bjarne Stroustrup:
http://www2.research.att.com/~bs/Computer-Jan12.pdf
(Publisher site: http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2011.353)
I think that if this is a critical factor for your program, you should profile it on your architecture.
Not sure if I can explain it right but here's my view(i'm thinking along the lines of translated machine instruction below:),
Vector iterator(contiguous memory):
When you increment a vector iterator, the iterator value is simply added the size of the object(known at compile time) to point to the next object. In most CPUs this is anything from one to three instructions at most.
List iterator(linked list http://www.sgi.com/tech/stl/List.html):
When you increment a list iterator(the pointed object), the location of the forward link is located by adding some number to the base of the object pointed and then loaded up as the new value of the iterator. There is more than one memory access for this and is slower than the vector iteration operation.