Overriding global operator new to track huge memory allocations?

Overriding global operator new to track huge memory allocations? - c++

I am trying to produce a special build of a large monolithic application. The problem I am trying to solve is tracking hard-to-reproduce huge memory allocations (30-80 gigabytes, judging by what OS reports).
I believe the problem is an std::vector resized to a negative 32-bit integer value. The only platform exhibiting this behavior is Solaris (maybe it's the only platform that manages to successfully allocate such chunks of contiguous memory).
Can I globally replace std::vector with my class, delegating all calls to the real vector, watching for suspicious allocations (size > 0x7FFFFFFFu)? Maybe selectively replace the constructor that takes size_t and the resize() methods? Maybe even hijacking the global operator new?

Why not to do something like this?
void *operator new(size_t size)
{
// if (size > MAX_SIZE) ...
return malloc(size);
}
void *operator new [](size_t size)
{
// if (size > MAX_SIZE) ...
return malloc(size);
}
Setting a breakpoint in the if would find the problem right away.

You can provide a custom allocator on your vector at the time it's constructed.
You could just delegate to std::allocator, and firewall the requested memory size, in the first instance.

Take a look at the implementation of the std::vector class on the problem platform. Each implementation handles memory management differently (e.g. some double the currently allocated space when you add an object outside the vector's currently allocation size). If your objects are sufficiently large and/or you have a large number of entries being added to the vector, would be possible to attempt to allocate beyond the available (contiguous) memory on the computer. If that is the case, you'll want to look into a custom allocator for that vector.
If you're storing that many large items in a vector, you may want to look into another collection (e.g. std::list) or try storing pointers instead of actual objects.

You can supply your own allocator type to std::vector to track allocation.
But I doubt that's the reason. First, looking at the sizes (30-80GB) I conclude it's a 64-bit code. How could 32-bit negative integer value make it to vector size, which is 64-bit, it would have been promoted to 64-bit first to preserve value? Second, if this problem only occurs on Solaris then it can indicate a different problem. As far as I remember, Solaris is the only OS that commits memory on allocation, the other operating systems only mark the address space allocated until those memory pages are actually used. So I would search for unused allocations.

Related

C++ replacement for C99 VLAs (goal: preserve performance)

I am porting some C99 code that makes heavy use of variable length arrays (VLA) to C++.
I replaced the VLAs (stack allocation) with an array class that allocates memory on the heap. The performance hit was huge, a slowdown of a factor of 3.2 (see benchmarks below). What fast VLA replacement can I use in C++? My goal is to minimize performance hit when rewriting the code for C++.
One idea that was suggested to me was to write an array class that contains a fixed-size storage within the class (i.e. can be stack-allocated) and uses it for small arrays, and automatically switches to heap allocation for larger arrays. My implementation of this is at the end of the post. It works fairly well, but I still cannot reach the performance of the original C99 code. To come close to it, I must increase this fixed-size storage (MSL below) to sizes which I am not comfortable with. I don't want to allocate too-huge arrays on the stack even for the many small arrays that don't need it because I worry that it will trigger a stack overflow. A C99 VLA is actually less prone to this because it will never use more storage than needed.
I came upon std::dynarray, but my understanding is that it was not accepted into the standard (yet?).
I know that clang and gcc support VLAs in C++, but I need it to work with MSVC too. In fact better portability is one of the main goals of rewriting as C++ (the other goal being making the program, which was originally a command line tool, into a reusable library).
Benchmark
MSL refers to the array size above which I switch to heap-allocation. I use different values for 1D and 2D arrays.
Original C99 code: 115 seconds.
MSL = 0 (i.e. heap allocation): 367 seconds (3.2x).
1D-MSL = 50, 2D-MSL = 1000: 187 seconds (1.63x).
1D-MSL = 200, 2D-MSL = 4000: 143 seconds (1.24x).
1D-MSL = 1000, 2D-MSL = 20000: 131 (1.14x).
Increasing MSL further improves performance more, but eventually the program will start returning wrong results (I assume due to stack overflow).
These benchmarks are with clang 3.7 on OS X, but gcc 5 shows very similar results.
Code
This is the current "smallvector" implementation I use. I need 1D and 2D vectors. I switch to heap-allocation above size MSL.
template<typename T, size_t MSL=50>
class lad_vector {
const size_t len;
T sdata[MSL];
T *data;
public:
explicit lad_vector(size_t len_) : len(len_) {
if (len <= MSL)
data = &sdata[0];
else
data = new T[len];
}
~lad_vector() {
if (len > MSL)
delete [] data;
}
const T &operator [] (size_t i) const { return data[i]; }
T &operator [] (size_t i) { return data[i]; }
operator T * () { return data; }
};
template<typename T, size_t MSL=1000>
class lad_matrix {
const size_t rows, cols;
T sdata[MSL];
T *data;
public:
explicit lad_matrix(size_t rows_, size_t cols_) : rows(rows_), cols(cols_) {
if (rows*cols <= MSL)
data = &sdata[0];
else
data = new T[rows*cols];
}
~lad_matrix() {
if (rows*cols > MSL)
delete [] data;
}
T const * operator[] (size_t i) const { return &data[cols*i]; }
T * operator[] (size_t i) { return &data[cols*i]; }
};

Create a large buffer (MB+) in thread-local storage. (Actual memory on heap, management in TLS).
Allow clients to request memory from it in FILO manner (stack-like). (this mimics how it works in C VLAs; and it is efficient, as each request/return is just an integer addition/subtraction).
Get your VLA storage from it.
Wrap it pretty, so you can say stack_array<T> x(1024);, and have that stack_array deal with construction/destruction (note that ->~T() where T is int is a legal noop, and construction can similarly be a noop), or make stack_array<T> wrap a std::vector<T, TLS_stack_allocator>.
Data will be not as local as the C VLA data is because it will be effectively on a separate stack. You can use SBO (small buffer optimization), which is when locality really matters.
A SBO stack_array<T> can be implemented with an allocator and a std vector unioned with a std array, or with a unique ptr and custom destroyer, or a myriad of other ways. You can probably retrofit your solution, replacing your new/malloc/free/delete with calls to the above TLS storage.
I say go with TLS as that removes need for synchronization overhead while allowing multi-threaded use, and mirrors the fact that the stack itself is implicitly TLS.
Stack-buffer based STL allocator? is a SO Q&A with at least two "stack" allocators in the answers. They will need some adaption to automatically get their buffer from TLS.
Note that the TLS being one large buffer is in a sense an implementation detail. You could do large allocations, and when you run out of space do another large allocation. You just need to keep track each "stack page" current capacity and a list of stack pages, so when you empty one you can move onto an earlier one. That lets you be a bit more conservative in your TLS initial allocation without worrying about running OOM; the important part is that you are FILO and allocate rarely, not that the entire FILO buffer is one contiguous one.

I think you have already enumerated most options in your question and the comments.
Use std::vector. This is the most obvious, most hassle-free but maybe also the slowest solution.
Use platform-specific extensions on those platforms that provide them. For example, GCC supports variable-length arrays in C++ as an extension. POSIX specifies alloca which is widely supported to allocate memory on the stack. Even Microsoft Windows provides _malloca, as a quick web search told me.
In order to avoid maintenance nightmares, you'll really want to encapsulate these platform dependencies into an abstract interface that automatically and transparently chooses the appropriate mechanism for the current platform. Implementing this for all platforms will be a bit of work but if this single feature accounts for 3 × speed differences as you're reporting, it might be worth it. As a fallback for unknown platforms, I'd keep std::vector in reserve as a last resort. It is better to run slow but correctly than to behave erratic or not run at all.
Build your own variable-sized array type that implements a “small array” optimization embedded as a buffer inside the object itself as you have shown in your question. I'll just note that I'd rather try using a union of a std::array and a std::vector instead of rolling my own container.
Once you have a custom type in place, you can do interesting profiling such as maintaining a global hash table of all occurrences of this type (by source-code location) and recording each allocation size during a stress test of your program. You can then dump the hash table at program exit and plot the distributions in allocation sizes for the individual arrays. This might help you to fine-tune the amount of storage to reserve for each array individually on the stack.
Use a std::vector with a custom allocator. At program startup, allocate a few megabytes of memory and give it to a simple stack allocator. For a stack allocator, allocation is just comparing and adding two integers and deallocation is simply a subtraction. I doubt that the compiler-generated stack allocation can be much faster. Your “array stack” would then pulsate correlated to your “program stack”. This design would also have the advantage that accidental buffer overruns – while still invoking undefined behavior, trashing random data and all that bad stuff – wouldn't as easily corrupt the program stack (return addresses) as they would with native VLAs.
Custom allocators in C++ are a somewhat dirty business but some people do report they're using them successfully. (I don't have much experience with using them myself.) You might want to start looking at cppreference. Alisdair Meredith who is one of those people that promote the usage of custom allocators gave a double-session talk at CppCon'14 titled “Making Allocators Work” (part 1, part 2) that you might find interesting as well. If the std::allocator interface it too awkward to use for you, implementing your own variable (as opposed to dynamically) sized array class with your own allocator should be doable as well.

Regarding support for MSVC:
MSVC has _alloca which allocates stack space. It also has _malloca which allocates stack space if there is enough free stack space, otherwise falls back to dynamic allocation.
You cannot take advantage of the VLA type system, so you would have to change your code to work based in a pointer to first element of such an array.
You may end up needing to use a macro which has different definitions depending on the platform. E.g. invoke _alloca or _malloca on MSVC, and on g++ or other compilers, either calls alloca (if they support it), or makes a VLA and a pointer.
Consider investigating ways to rewrite the code without needing to allocate an unknown amount of stack. One option is to allocate a fixed-size buffer that is the maximum you will need. (If that would cause stack overflow it means your code is bugged anyway).

`std::string` allocations are my current bottleneck - how can I optimize with a custom allocator?

I'm writing a C++14 JSON library as an exercise and to use it in my personal projects.
By using callgrind I've discovered that the current bottleneck during a continuous value creation from string stress test is an std::string dynamic memory allocation. Precisely, the bottleneck is the call to malloc(...) made from std::string::reserve.
I've read that many existing JSON libraries such as rapidjson use custom allocators to avoid malloc(...) calls during string memory allocations.
I tried to analyze rapidjson's source code but the large amount of additional code and comments, plus the fact that I'm not really sure what I'm looking for, didn't help me much.
How do custom allocators help in this situation?
Is a memory buffer preallocated somewhere (where? statically?) and std::strings take available memory from it?
Are strings using custom allocators "compatible" with normal strings?
They have different types. Do they have to be "converted"? (And does that result in a performance hit?)
Code notes:
Str is an alias for std::string.

By default, std::string allocates memory as needed from the same heap as anything that you allocate with malloc or new. To get a performance gain from providing your own custom allocator, you will need to be managing your own "chunk" of memory in such a way that your allocator can deal out the amounts of memory that your strings ask for faster than malloc does. Your memory manager will make relatively few calls to malloc, (or new, depending on your approach) under the hood, requesting "large" amounts of memory at once, then deal out sections of this (these) memory block(s) through the custom allocator. To actually achieve better performance than malloc, your memory manager will usually have to be tuned based on known allocation patterns of your use cases.
This kind of thing often comes down to the age-old trade off of memory use versus execution speed. For example: if you have a known upper bound on your string sizes in practice, you can pull tricks with over-allocating to always accommodate the largest case. While this is wasteful of your memory resources, it can alleviate the performance overhead that more generalized allocation runs into with memory fragmentation. As well as making any calls to realloc essentially constant time for your purposes.
#sehe is exactly right. There are many ways.
EDIT:
To finally address your second question, strings using different allocators can play nicely together, and usage should be transparent.
For example:
class myalloc : public std::allocator<char>{};
myalloc customAllocator;
int main(void)
{
std::string mystring(customAllocator);
std::string regularString = "test string";
mystring = regularString;
std::cout << mystring;
return 0;
}
This is a fairly silly example and, of course, uses the same workhorse code under the hood. However, it shows assignment between strings using allocator classes of "different types". Implementing a useful allocator that supplies the full interface required by the STL without just disguising the default std::allocator is not as trivial. This seems to be a decent write up covering the concepts involved. The key to why this works, in the context of your question at least, is that using different allocators doesn't cause the strings to be of different type. Notice that the custom allocator is given as an argument to the constructor not a template parameter. The STL still does fun things with templates (such as rebind and Traits) to homogenize allocator interfaces and tracking.

What often helps is the creation of a GlobalStringTable.
See if you can find portions of the old NiMain library from the now defunct NetImmerse software stack. It contains an example implementation.
Lifetime
What is important to note is that this string table needs to be accessible between different DLL spaces, and that it is not a static object. R. Martinho Fernandes already warned that the object needs to be created when the application or DLL thread is created / attached, and disposed when the thread is destroyed or the dll is detached, and preferrably before any string object is actually used. This sounds easier than it actually is.
Memory allocation
Once you have a single point of access that exports correctly, you can have it allocate a memory buffer up-front. If the memory is not enough, you have to resize it and move the existing strings over. Strings essentially become handles to regions of memory in this buffer.
Placement new
Something that often works well is called the placement new() operator, where you can actually specify where in memory your new string object needs to be allocated. However, instead of allocating, the operator can simply grab the memory location that is passed in as an argument, zero the memory at that location, and return it. You can also keep track of the allocation, the actual size of the string etc.. in the Globalstringtable object.
SOA
Handling the actual memory scheduling is something that is up to you, but there are many possible ways to approach this. Often, the allocated space is partitioned in several regions so that you have several blocks per possible string size. A block for strings <= 4 bytes, one for <= 8 bytes, and so on. This is called a Small Object Allocator, and can be implemented for any type and buffer.
If you expect many string operations where small strings are incremented repeatedly, you may change your strategy and allocate larger buffers from the start, so that the number of memmove operations are reduced. Or you can opt for a different approach and use string streams for those.
String operations
It is not a bad idea to derive from std::basic_str, so that most of the operations still work but the internal storage is actually in the GlobalStringTable, so that you can keep using the same stl conventions. This way, you also make sure that all the allocations are within a single DLL, so that there can be no heap corruption by linking different kinds of strings between different libraries, since all the allocation operations are essentially in your DLL (and are rerouted to the GlobalStringTable object)

Custom allocators can help because most malloc()/new implementations are designed for maximum flexibility, thread-safety and bullet-proof workings. For instance, they must gracefully handle the case that one thread keeps allocating memory, sending the pointers to another thread that deallocates them. Things like these are difficult to handle in a performant way and drive the cost of malloc() calls.
However, if you know that some things cannot happen in your application (like one thread deallocating stuff another thread allocated, etc.), you can optimize your allocator further than the standard implementation. This can yield significant results, especially when you don't need thread safety.
Also, the standard implementation is not necessarily well optimized: Implementing void* operator new(size_t size) and void operator delete(void* pointer) by simply calling through to malloc() and free() gives an average performance gain of 100 CPU cycles on my machine, which proves that the default implementation is suboptimal.

I think you'd be best served by reading up on the EASTL
It has a section on allocators and you might find fixed_string useful.

The best way to avoid a memory allocation is don't do it!
BUT if I remember JSON correctly all the readStr values either gets used as keys or as identifiers so you will have to allocate them eventually, std::strings move semantics should insure that the allocated array are not copied around but reused until its final use. The default NRVO/RVO/Move should reduce any copying of the data if not of the string header itself.
Method 1:
Pass result as a ref from the caller which has reserved SomeResonableLargeValue chars, then clear it at the start of readStr. This is only usable if the caller actually can reuse the string.
Method 2:
Use the stack.
// Reserve memory for the string (BOTTLENECK)
if (end - idx < SomeReasonableValue) { // 32?
char result[SomeReasonableValue] = {0}; // feel free to use std::array if you want bounds checking, but the preceding "if" should insure its not a problem.
int ridx = 0;
for(; idx < end; ++idx) {
// Not an escape sequence
if(!isC('\\')) { result[ridx++] = getC(); continue; }
// Escape sequence: skip '\'
++idx;
// Convert escape sequence
result[ridx++] = getEscapeSequence(getC());
}
// Skip closing '"'
++idx;
result[ridx] = 0; // 0-terminated.
// optional assert here to insure nothing went wrong.
return result; // the bottleneck might now move here as the data is copied to the receiving string.
}
// fallback code only if the string is long.
// Your original code here
Method 3:
If your string by default can allocate some size to fill its 32/64 byte boundary, you might want to try to use that, construct result like this instead in case the constructor can optimize it.
Str result(end - idx, 0);
Method 4:
Most systems already has some optimized allocator that like specific block sizes, 16,32,64 etc.
siz = ((end - idx)&~0xf)+16; // if the allocator has chunks of 16 bytes already.
Str result(siz);
Method 5:
Use either the allocator made by google or facebooks as global new/delete replacement.

To understand how a custom allocator can help you, you need to understand what malloc and the heap does and why it is quite slow in comparison to the stack.
The Stack
The stack is a large block of memory allocated for your current scope. You can think of it as this
([] means a byte of memory)
[P][][][][][][][][][][][][][][][]
(P is a pointer that points to a specific byte of memory, in this case its pointing at the first byte)
So the stack is a block with only 1 pointer. When you allocate memory, what it does is it performs a pointer arithmetic on P, which takes constant time.
So declaring int i = 0; would mean this,
P + sizeof(int).
[i][i][i][i][P][][][][][][][][][][][],
(i in [] is a block of memory occupied by an integer)
This is blazing fast and as soon as you go out of scope, the entire chunk of memory is emptied simply by moving P back to the first position.
The Heap
The heap allocates memory from a reserved pool of bytes reserved by the c++ compiler at runtime, when you call malloc, the heap finds a length of contiguous memory that fits your malloc requirements, marks it as used so nothing else can use it, and returns that to you as a void*.
So, a theoretical heap with little optimization calling new(sizeof(int)), would do this.
Heap chunk
At first : [][][][][][][][][][][][][][][][][][][][][][][][][]
Allocate 4 bytes (sizeof(int)):
A pointer goes though every byte of memory, finds one that is of correct length, and returns to you a pointer.
After : [i][i][i][i][][][]][][][][][][][][][]][][][][][][][]
This is not an accurate representation of the heap, but from this you can already see numerous reasons for being slow relative to the stack.
The heap is required to keep track of all already allocated memory and their respective lengths. In our test case above, the heap was already empty and did not require much, but in worst case scenarios, the heap will be populated with multiple objects with gaps in between (heap fragmentation), and this will be much slower.
The heap is required to cycle though all the bytes to find one that fits your length.
The heap can suffer from fragmentation since it will never completely clean itself unless you specify it. So if you allocated an int, a char, and another int, your heap would look like this
[i][i][i][i][c][i2][i2][i2][i2]
(i stands for bytes occupied by int and c stands for bytes occupied by a char. When you de-allocate the char, it will look like this.
[i][i][i][i][empty][i2][i2][i2][i2]
So when you want to allocate another object into the heap,
[i][i][i][i][empty][i2][i2][i2][i2][i3][i3][i3][i3]
unless an object is the size of 1 char, the overall heap size for that allocation is reduced by 1 byte. In more complex programs with millions of allocations and deallocations, the fragmentation issue becomes severe and the program will become unstable.
Worry about cases like thread safety (Someone else said this already).
Custom Heap/Allocator
So, a custom allocator usually needs to address these problems while providing the benefits of the heap, such as personalized memory management and object permanence.
These are usually accomplished with specialized allocators. If you know you dont need to worry about thread safety or you know exactly how long your string will be or a predictable usage pattern you can make your allocator fast than malloc and new by quite a lot.
For example, if your program requires a lot of allocations as fast as possible without lots of deallocations, you could implement a stack allocator, in which you allocate a huge chunk of memory with malloc at startup,
e.g
typedef char* buffer;
//Super simple example that probably doesnt work.
struct StackAllocator:public Allocator{
buffer stack;
char* pointer;
StackAllocator(int expectedSize){ stack = new char[expectedSize];pointer = stack;}
allocate(int size){ char* returnedPointer = pointer; pointer += size; return returnedPointer}
empty() {pointer = stack;}
};
Get expected size, get a chunk of memory from the heap.
Assign a pointer to the beginning.
[P][][][][][][][][][] ..... [].
then have one pointer that moves for each allocation. When you no longer need the memory, you simply move the pointer to the beginning of your buffer. This gives your the advantage of O(1) speed allocations and deallocations as well as object permanence for the lack of flexible deallocation and large initial memory requirements.
For strings, you could try a chunk allocator. For every allocation, the allocator gives a set chunk of memory.
Compatibility
Compatibility with other strings is almost guaranteed. As long as you are allocating a contiguous chunk of memory and preventing anything else from using that block of memory, it will work.

Is it possible to implement a memory pool that works with arrays instead of single objects?

I know it's easy to make a memory pool for single objects, however I need to make a memory pool for arrays. The memory pool I have currently has a vector of addresses to contiguous memory blocks and a stack that points to each object from these blocks, so when you allocate from the pool you just pop the stack and when you free, you just push an object's address back to it. However I also need an array equivalent. Something like this:
template<typename T>
class ArrayPool
{
public:
ArrayPool();
~ArrayPool();
T* AllocateArray(int x); //Returns a pointer to a T array that contains 'x' elements.
void FreeArray(T* arr, int x); //Returns the array to the free address list/stack/whatever/
};
Has such a thing been implemented? I imagine a big problem from having such a pool - if make sure arrays returned by ALlocateArray are contiguous in memory, I'm basically doing the same as if not having a memorypool. Just allocating arrays on the spot. With the normal object pool every time I just allocate 1 object. With the arrays I may allocate a different sized array every time, so once an array is freed, it won't be compatible with a new one of different size, unless I stich arrays together with some linkedlist-like structure, but then they won't be contiguous.

Currently your allocator takes advantage of the fact that all allocations are the same size. This simplifies and speeds up allocation and freeing, and means memory fragmentation is impossible.
If you have to allocate arrays of any size, then what you want is a general-purpose allocator, not a pool allocator. What to do next depends why you're using a pool allocator in the first place. I can think of two other features of a pool allocator that might be relevant, and there may be others:
all memory comes from a particular region specified when you create the pool
all memory can be freed at once without freeing each individual allocation, by resetting the pool.
If you don't need any special features of controlling allocation yourself then just use vector or global operator new or malloc to allocate your memory. If you do need special features then you'll probably want to take an allocator off the shelf rather than implementing your own. If you really want to get into the details of how a good memory allocator works then look at http://g.oswego.edu/dl/html/malloc.html and perhaps adapt it to your use.
But if you really need to hand-roll an allocator for limited purposes, then the basic idea is that instead of a list of free nodes from which you can always take the first, you need some data structure (your choice what) containing free blocks of different sizes, that allows you to quickly find a block that's big enough to satisfy the current request. In the case where it's much bigger you might choose to split the block, return part of it, and keep the rest as a new smaller free block. In the case where two free blocks are adjacent you might choose to merge them into a single larger free block.
One common strategy is to keep pool-like lists of blocks of certain sizes (for example 16, 32, 64...). If the request is small enough, satisfy it using one of these. If not, do something more complex. But as I say, if you want to see a lot of tricks working together then look at dlmalloc.

What you could do is having fixed sizes and only work on those. For example 400st 32 byte arrays, 200 128b, 100 1024b, 50 8096b or something like that. When something ask for an array of size N you match to the closest size with a free array.
How many you need to each size is probably up for a lot of tweaking.
That would allow you to re-use arrays much more freely than allowing custom sizes.

What exactly are you trying to win from this? Why isn't it enough just to treat each array as an object? Unless you are direly strapped for memory or the time to construct the array elements is really excessive and not to be wasted, this sounds like a classic case of premature optimization. And if the above are your problems, I'd explore other data structures (not arrays) first before plunging into this.
Your time (getting this working and its quirks ironed out will be a week or so, methinks) is way more valuable than a few pennies of computer time or memory saved.

Linux Memory Usage in top when using std::vector versus std::list

I have noticed some interesting behavior in Linux with regard to the Memory Usage (RES) reported by top. I have attached the following program which allocates a couple million objects on the heap, each of which has a buffer that is around 1 kilobyte. The pointers to those objects are tracked by either a std::list, or a std::vector. The interesting behavior I have noticed is that if I use a std::list, the Memory Usage reported by top never changes during the sleep periods. However if I use std::vector, the memory usage will drop to near 0 during those sleeps.
My test configuration is:
Fedora Core 16
Kernel 3.6.7-4
g++ version 4.6.3
What I already know:
1. std::vector will re-allocate (doubling its size) as needed.
2. std::list (I beleive) is allocating its elements 1 at a time
3. both std::vector and std::list are using std::allocator by default to get their actual memory
4. The program is not leaking; valgrind has declared that no leaks are possible.
What I'm confused by:
1. Both std::vector and std::list are using std::allocator. Even if std::vector is doing batch re-allocations, wouldn't std::allocator be handing out memory in almost the same arrangement to std::list and std::vector? This program is single threaded after all.
2. Where can I learn about the behavior of Linux's memory allocation. I have heard statements about Linux keeping RAM assigned to a process even after it frees it, but I don't know if that behavior is guaranteed. Why does using std::vector impact that behavior so much?
Many thanks for reading this; I know this is a pretty fuzzy problem. The 'answer' I'm looking for here is if this behavior is 'defined' and where I can find its documentation.
#include <string.h>
#include <unistd.h>
#include <iostream>
#include <vector>
#include <list>
#include <iostream>
#include <memory>
class Foo{
public:
Foo()
{
data = new char[999];
memset(data, 'x', 999);
}
~Foo()
{
delete[] data;
}
private:
char* data;
};
int main(int argc, char** argv)
{
for(int x=0; x<10; ++x)
{
sleep(1);
//std::auto_ptr<std::list<Foo*> > foos(new std::list<Foo*>);
std::auto_ptr<std::vector<Foo*> > foos(new std::vector<Foo*>);
for(int i=0; i<2000000; ++i)
{
foos->push_back(new Foo());
}
std::cout << "Sleeping before de-alloc\n";
sleep(5);
while(false == foos->empty())
{
delete foos->back();
foos->pop_back();
}
}
std::cout << "Sleeping after final de-alloc\n";
sleep(5);
}

The freeing of memory is done on a "chunk" basis. It's quite possible that when you use list, the memory gets fragmented into little tiny bits.
When you allocate using a vector, all elements are stored in one big chunk, so it's easy for the memory freeing code to say "Golly, i've got a very large free region here, I'm going to release it back to the OS". It's also entirely possible that when growing the vector, the memory allocator goes into "large chunk mode", which uses a different allocation method than "small chunk mode" - say for example you allocate more than 1MB, the memory allocation code may see that as a good time to start using a different strategy, and just ask the OS for a "perfect fit" piece of memory. This large block is very easy to release back to he OS when it's being freed.
On the ohter hand if you are adding to a list, you are constantly asking for little bits, so the allocator uses a different strategy of asking for large block and then giving out small portions. It's both difficult and time-consuming to ensure that ALL blocks within a chunk have been freed, so the allocator may well "not bother" - because chances are that there are some regions in there "still in use", and then it can't be freed at all anyways.
I would also add that using "top" as a memory measure isn't a particularly accurate method, and is very unreliable, as it very much depends on what the OS and the runtime library does. Memory belonging to a process may not be "resident", but the process still hasn't freed it - it's just not "present in actual memory" (out in the swap partition instead!)
And to your question "is this defined somewhere", I think it is in the sense that the C/C++ library source cod defines it. But it's not defined in the sense that somewhere it's written that "This is how it's meant to work, and we promise never to hange it". The libraries supplied as glibc and libstdc++ are not going to say that, they will change the internals of malloc, free, new and delete as new technologies and ideas are invented - some may make things better, others may make it worse, for a given scenario.
As has been pointed out in the comments, the memory is not locked to the process. If the kernel feels that the memory is better used for something else [and the kernel is omnipotent here], then it will "steal" the memory from one running process and give it to another. Particularly memory that hasn't been "touched" for a long time.

1 . Both std::vector and std::list are using std::allocator. Even if std::vector is doing batch re-allocations, wouldn't std::allocator be
handing out memory in almost the same arrangement to std::list and
std::vector? This program is single threaded after all.
Well, what are the differences?
std::list allocates nodes one-by-one (each node needs two pointers in addition to your Foo *). Also, it never re-allocates these nodes (this is guaranteed by the iterator invalidation requirements for list). So, the std::allocator will request a sequence of fixed-size chunks from the underlying mechanism (probably malloc which will in turn use the sbrk or mmap system calls). These fixed-size chunks may well be larger than a list node, but if so they'll all be the same default chunk size used by std::allocator.
std::vector allocates a contiguous block of pointers with no book-keeping overhead (that's all in the vector parent object). Every time a push_back would overflow the current allocation, the vector will allocate a new, larger chunk, move everything across to the new chunk, and release the old one. Now, the new chunk will be something like double (or 1.6 times, or whatever) the size of the old one, as is required to keep the amortized constant time guarantee for push_back. So, pretty quickly, I'd expect the sizes it requests to exceed any sensible default chunk size for std::allocator.
So, the the interesting interactions are different: one between between std::vector and the allocator's underlying mechanism, and one between the std::allocator itself and that underlying mechanism.
2 . Where can I learn about the behavior of Linux's memory allocation. I have heard statements about Linux keeping RAM assigned to a process
even after it frees it, but I don't know if that behavior is
guaranteed. Why does using std::vector impact that behavior so much?
There are several levels you might care about:
The container's own allocation pattern: which is hopefully described above
note that in real-world applications, the way a container is used is just as important
std::allocator itself, which may provide a layer of buffering for small allocations
I don't think this is required by the standard, so it's specific to your implementation
The underlying allocator, which depends on your std::allocator implementation (it could for example be malloc, however that is implemented by your libc)
The VM scheme used by the kernel, and its interactions with whatever syscall (3) ultimately uses
In your particular case, I can think of a possible explanation for the vector apparently releasing more memory than the list.
Consider that the vector ends up with a single contiguous allocation, and lots of the Foos will also be allocated contiguously. This means that when you release all this memory, it's pretty easy to figure out that most of the underlying pages are genuinely free.
Now consider that the list node allocations are interleaved 1:1 with the Foo instances. Even if the allocator did some batching, it seems likely that the heap is much more fragmented than in the std::vector case. Therefore, when you release the allocated records, some work would be required to figure out whether an underlying page is now free, and there's no particular reason to expect this will happen (unless a subsequent large allocation encouraged coalescing of heap records).

The answer is the malloc "fastbins" optimization.
std::list creates tiny (less then 64 bytes) allocations and when it frees them up they are not actually freed - but goes to the fastblock pool.
This behavior means that the heap stays fragmented even AFTER the list is cleared and therefore it does not return to the system.
You can either use malloc_trim(128*1024) in order to forcibly clear them.
Or use mallopt(M_MXFAST, 0) in order to disable fastbins altogether.
I find the first solution to be more correct if you call it when you really don't need the memory anymore.

Smaller chunks go through brk and adjusting the data segment and constant splitting and fusion and bigger chunks mmap the process is a little less disturbed. more info (PDF)
also ptmalloc source code.

Efficient Array Reallocation in C++

How would I efficiently resize an array allocated using some standards-conforming C++ allocator? I know that no facilities for reallocation are provided in the C++ alloctor interface, but did the C++11 revision enable us to work with them more easily? Suppose that I have a class vec with a copy-assignment operator foo& operator=(const foo& x) defined. If x.size() > this->size(), I'm forced to
Call allocator.destroy() on all elements in the internal storage of foo.
Call allocator.deallocate() on the internal storage of foo.
Reallocate a new buffer with enough room for x.size() elements.
Use std::uninitialized_copy to populate the storage.
Is there some way that I more easily reallocate the internal storage of foo without having to go through all of this? I could provide an actual code sample if you think that it would be useful, but I feel that it would be unnecessary here.

Based on a previous question, the approach that I took for handling large arrays that could grow and shrink with reasonable efficiency was to write a container similar to a deque that broke the array down into multiple pages of smaller arrays. So for example, say we have an array of n elements, we select a page size p, and create 1 + n/p arrays (pages) of p elements. When we want to re-allocate and grow, we simply leave the existing pages where they are, and allocate the new pages. When we want to shrink, we free the totally empty pages.
The downside is the array access is slightly slower, in that given and index i, you need the page = i / p, and the offset into the page i % p, to get the element. I find this is still very fast however and provides a good solution. Theoretically, std::deque should do something very similar, but for the cases I tried with large arrays it was very slow. See comments and notes on the linked question for more details.
There is also a memory inefficiency in that given n elements, we are always holding p - n % p elements in reserve. i.e. we only ever allocate or deallocate complete pages. This was the best solution I could come up with in the context of large arrays with the requirement for re-sizing and fast access, while I don't doubt there are better solutions I'd love to see them.

A similar problem also arises if x.size() > this->size() in foo& operator=(foo&& x).
No, it doesn't. You just swap.

There is no function that will resize in place or return 0 on failure (to resize). I don't know of any operating system that supports that kind of functionality beyond telling you how big a particular allocation actually is.
All operating systems do however have support for implementing realloc, however, that does a copy if it cannot resize in place.
So, you can't have it because the C++ language would not be implementable on most current operating systems if you had to add a standard function to do it.

There are the C++11 rvalue reference and move constructors.
There's a great video talk on them.

Even if re-allocate exists, actually, you can only avoid #2 you mentioned in your question in a copy constructor. However in the case of internal buffer growing, re-allocate can save these four operations.
Is internal buffer of your array continuous? if so see the answer of your link
if not, Hashed array tree or array list may be your choice to avoid re-allocate.

Interestingly, the default allocator for g++ is smart enough to use the same address for consecutive deallocations and allocations of larger sizes, as long as there is enough unused space after the end of the initially-allocated buffer. While I haven't tested what I'm about to claim, I doubt that there is much of a time difference between malloc/realloc and allocate/deallocate/allocate.
This leads to a potentially very dangerous, nonstandard shortcut that may work if you know that there is enough room after the current buffer so that a reallocation would not result in a new address. (1) Deallocate the current buffer without calling alloc.destroy() (2) Allocate a new, larger buffer and check the returned address (3) If the new address equals the old address, proceed happily; otherwise, you lost your data (4) Call allocator.construct() for elements in the newly-allocated space.
I wouldn't advocate using this for anything other than satisfying your own curiosity, but it does work on g++ 4.6.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js