Avoid memory fragmentation when memory pools are a bad idea

Avoid memory fragmentation when memory pools are a bad idea - c++

I am developing a C++ application, where the program run endlessly, allocating and freeing millions of strings (char*) over time. And RAM usage is a serious consideration in the program. This results in RAM usage getting higher and higher over time. I think the problem is heap fragmentation. And I really need to find a solution.
You can see in the image, after millions of allocation and freeing in the program, the usage is just increasing. And the way I am testing it, I know for a fact that the data it stores is not increasing. I can guess that you will ask, "How are you sure of that?", "How are you sure it's not just a memory leak?", Well.
This test run much longer. I run malloc_trim(0), whenever possible in my program. And it seems, application can finally return the unused memory to the OS, and it goes almost to zero (the actual data size my program has currently). This implies the problem is not a memory leak. But I can't rely on this behavior, the allocation and freeing pattern of my program is random, what if it never releases the memory ?
I said memory pools are a bad idea for this project in the title. Of course I don't have absolute knowledge. But the strings I am allocating can be anything between 30-4000 bytes. Which makes many optimizations and clever ideas much harder. Memory pools are one of them.
I am using GCC 11 / G++ 11 as a compiler. If some old versions have bad allocators. I shouldn't have that problem.
How am I getting memory usage ? Python psutil module. proc.memory_full_info()[0], which gives me RSS.
Of course, you don't know the details of my program. It is still a valid question, if this is indeed because of heap fragmentation. Well what I can say is, I am keeping a up to date information about how many allocations and frees took place. And I know the element counts of every container in my program. But if you still have some ideas about the causes of the problem, I am open to suggestions.
I can't just allocate, say 4096 bytes for all the strings so it would become easier to optimize. That's the opposite I am trying to do.
So my question is, what do programmers do(what should I do), in an application where millions of alloc's and free's take place over time, and they are of different sizes so memory pools are hard to use efficiently. I can't change what the program does, I can only change implementation details.
Bounty Edit: When trying to utilize memory pools, isn't it possible to make multiple of them, to the extent that there is a pool for every possible byte count ? For example my strings can be something in between 30-4000 bytes. So couldn't somebody make 4000 - 30 + 1, 3971 memory pools, for each and every possible allocation size of the program. Isn't this applicable ? All pools could start small (no not lose much memory), then enlarge, in a balance between performance and memory. I am not trying to make a use of memory pool's ability to reserve big spaces beforehand. I am just trying to effectively reuse freed space, because of frequent alloc's and free's.
Last edit: It turns out that, the memory growth appearing in the graphs, was actually from a http request queue in my program. I failed to see that hundreds of thousands of tests that I did, bloated this queue (something like webhook). And the reasonable explanation of figure 2 is, I finally get DDOS banned from the server (or can't open a connection anymore for some reason), the queue emptied, and the RAM issue resolved. So anyone reading this question later in the future, consider every possibility. It would have never crossed my mind that it was something like this. Not a memory leak, but an implementation detail. Still I think #Hajo Kirchhoff deserves the bounty, his answer was really enlightening.

If everything really is/works as you say it does and there is no bug you have not yet found, then try this:
malloc and other memory allocation usually uses chunks of 16 bytes anyway, even if the actual requested size is smaller than 16 bytes. So you only need 4000/16 - 30/16 ~ 250 different memory pools.
const int chunk_size = 16;
memory_pools pool[250]; // 250 memory pools, managing '(idx+1)*chunk_size' size
char* reserve_mem(size_t sz)
{
size_t pool_idx_to_use = sz/chunk_size;
char * rc=pool[pool_idx_to_use].allocate();
}
IOW, you have 250 memory pools. pool[0] allocates and manages chunk with a length of 16 bytes. pool[100] manages chunks with 1600 bytes etc...
If you know the length distribution of your strings in advance, you can reserve initial memory for the pools based on this knowledge. Otherwise I'd just probably reserve memory for the pools in 4096 bytes increment.
Because while the malloc C heap usually allocates memory in multiple of 16 bytes it will (at least unter Windows, but I am guessing, Linux is similar here) ask the OS for memory - which usually works with 4K pages. IOW, the "outer" memory heap managed by the operating system reserves and frees 4096 bytes.
So increasing your own internal memory pool in 4096 bytes means no fragmentation in the OS app heap. This 4096 page size (or multiple of...) comes from the processor architecture. Intel processors have a builtin page size of 4K (or multiple of). Don't know about other processors, but I suspect similar architectures there.
So, to sum it up:
Use chunks of multiple of 16 bytes for your strings per memory pool.
Use chunks of multiple of 4K bytes to increase your memory pool.
That will align the memory use of your application with the memory management of the OS and avoid fragmentation as much as possible.
From the OS point of view, your application will only increment memory in 4K chunks. That's very easy to allocate and release. And there is no fragmentation.
From the internal (lib) C heap management point of view, your application will use memory pools and waste at most 15 bytes per string. Also all similar length allocations will be heaped together, so also no fragmentation.

Here a solution, which might help, if your application has the following traits:
Runs on Windows
Has episodes, where the working set is large, but all that data is released at around the same point in time entirely. (If the data is processed in some kind of batch mode and the work is done when its done and then the related data is freed).
This approach uses a (unique?) Windows feature, called custom heaps. Possibly, you can create yourself something similar on other OS.
The functions, you need are in a header called <heapapi.h>.
Here is how it would look like:
At the start of a memory intensive phase of your program, use HeapCreate() to have a heap, where all your data will go.
Perform the memory intensive tasks.
At the end of the memory intensive phase, Free your data and call HeapDestroy().
Depending on the detailed behavior of your application (e.g. whether or not this memory intensive computation runs in 1 or multiple threads), you can configure the heap accordingly and possibly even gain a little speed (If only 1 thread uses the data, you can give HeapCreate() the HEAP_NO_SERIALIZE flag, so it would not take a lock.) You can also give an upper bound of what the heap will allow to store.
And once, your computation is complete, destroying the heap also prevents long term fragmentation, because when the time comes for the next computation phase, you start with a fresh heap.
Here is the documentation of the heap api.
In fact, I found this feature so useful, that I replicated it on FreeBSD and Linux for an application I ported, using virtual memory functions and my own heap implementation. Was a few years back and I do not have the rights for that piece of code, so I cannot show or share.
You could also combine this approach with fixed element size pools, using one heap for one dedicated block size and then expect less/no fragmentation within each of those heaps (because all blocks have same size).
struct EcoString {
size_t heapIndex;
size_t capacity;
char* data;
char* get() { return data; }
};
struct FixedSizeHeap {
const size_t SIZE;
HANDLE m_heap;
explicit FixedSizeHeap(size_t size)
: SIZE(size)
, m_heap(HeapCreate(0,0,0)
{
}
~FixedSizeHeap() {
HeapDestroy(m_heap);
}
bool allocString(capacity, EcoString& str) {
assert(capacity <= SIZE);
str.capacity = SIZE; // we alloc SIZE bytes anyway...
str.data = (char*)HeapAlloc(m_heap, 0, SIZE);
if (nullptr != str.data)
return true;
str.capacity = 0;
return false;
}
void freeString(EcoString& str) {
HeapFree(m_heap, 0, str.data);
str.data = nullptr;
str.capacity = 0;
}
};
struct BucketHeap {
using Buckets = std::vector<FixedSizeHeap>; // or std::array
Buckets buckets;
/*
(loop for i from 0
for size = 40 then (+ size (ash size -1))
while (< size 80000)
collecting (cons i size))
((0 . 40) (1 . 60) (2 . 90) (3 . 135) (4 . 202) (5 . 303) (6 . 454) (7 . 681)
(8 . 1021) (9 . 1531) (10 . 2296) (11 . 3444) (12 . 5166) (13 . 7749)
(14 . 11623) (15 . 17434) (16 . 26151) (17 . 39226) (18 . 58839))
Init Buckets with index (first item) with SIZE (rest item)
in the above item list.
*/
// allocate(nbytes) looks for the correct bucket (linear or binary
// search) and calls allocString() on that bucket. And stores the
// buckets index in the EcoString.heapIndex field (so free has an
// easier time).
// free(EcoString& str) uses heapIndex to find the right bucket and
// then calls free on that bucket.
}

What you are trying to do is called a slab allocator and it is a very well studied problem with lots of research papers.
You don't need every possible size. Usually slabs come in power of 2 sizes.
Don't reinvent the wheel, there are plenty of opensource implementations of a slab allocator. The Linux kernel uses one.
Start by reading the Wikipedia page on Slab Allocation.

I don't know the details of your program. There are many patterns to work with memory, each of which leads to different results and problems. However you can take a look of .Net garbage collector. It uses several generations of objects (similar to pool of memory). Each object in that generation is movable (its address in memory is changed during compressing of generation). Such trick allows to "compress/compact" memory and reduce fragmentation of memory. You can implement memory manager with similar semantics. It's not neccesary to implement garbage collector at all.

Related

What are the next step to improve malloc() algorithm? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm writing my own simple malloc() function and I would like to create more faster and efficient variant. I'm writed function that use linear search and allocated sequentially and contiguously in memory.
What is the next step to improve this algorithm? What are the main shortcomings of my current version? I would be very grateful for any feedback and recommendation.
typedef struct heap_block
{
struct heap_block* next;
size_t size;
bool isfree;
}header;
#define Heap_Capacity 100000
static char heap[Heap_Capacity];
size_t heap_size;
void* malloc(size_t sz)
{
if(sz == 0 || sz > Heap_Capacity) { return NULL; }
header* block = (header*)heap;
if(heap_size == 0)
{
set_data_to_block(block, sz);
return (void*)(block+1);
}
while(block->next != NULL) { block = block->next; }
block->next = (header*)((char*)to_end_data(block) + 8);
header* new_block = block->next;
set_data_to_block(new_block, sz);
return (void*)(new_block+1);
}
void set_data_to_block(header* block, size_t sz)
{
block->size = sz;
block->isfree = false;
block->next = NULL;
heap_size += sz;
}
header* to_end_data(header* block)
{
return (header*)((size_t)(block+1) + block->size);
}

Notice that malloc is often built above lower level memory related syscalls (e.g. mmap(2) on Linux). See this answer which mentions GNU glibc and musl-libc. Look also inside tcmalloc, so study the source code of several free software malloc implementations.
Some general ideas for your malloc:
retrieve memory from the OS using mmap (and release it back to the OS kernel eventually with munmap). You certainly should not allocate a fixed size heap (since on a 64 bits computer with 128Gbytes of RAM you would want to succeed in malloc-ing a 10 billion bytes zone).
segregate small allocations from big ones, so handle differently malloc of 16 bytes from malloc of a megabyte. The typical threshold between small and large allocation is generally a small multiple of the page size (which is often 4Kbytes). Small allocations happen inside pages. Large allocations are rounded to pages. You might even handle very specially malloc of two words (like in many linked lists).
round up the requested size to some fancy number (e.g. powers of 2, or 3 times a power of 2).
manage together memory zones of similar sizes, that is of same "fancy" size.
for small memory zones, avoid reclaiming too early the memory zone, so keep previously free-d zones of same (small) size to reuse them in future calls to malloc.
you might use some tricks on the address (but your system might have ASLR), or keep near each memory zone a word of meta-data describing the chunk of which it is a member.
a significant issue is, given some address previously returned by malloc and argument of free, to find out the allocated size of that memory zone. You could manipulate the address bits, you could store that size in the word before, you could use some hash table, etc. Details are tricky.
Notice that details are tricky, and it might be quite hard to write a malloc implementation better than your system one. In practice, writing a good malloc is not a simple task. You should find many academic papers on the subject.
Look also into garbage collection techniques. Consider perhaps Boehm's conservative GC: you will replace malloc by GC_MALLOC and you won't bother about free... Learn about memory pools.

There are 3 ways to improve:
make it more robust
optimise the way memory is allocated
optimise the code that allocates the memory
Make It More Robust
There are many common programmer mistakes that could be easily detected (e.g. modifying data beyond the end of the allocated block). For a very simple (and relatively fast) example, it's possible to insert "canaries" before and after the allocated block, and detect programmer during free and realloc by checking if the canaries are correct (if they aren't then the programmer has trashed something by mistake). This only works "sometimes maybe". The problem is that (for a simple implementation of malloc) the meta-data is mixed in with the allocated blocks; so there's a chance that the meta-data has been corrupted even if the canaries haven't. To fix that it'd be a good idea to separate the meta-data from the allocated blocks. Also, merely reporting "something was corrupted" doesn't help as much as you'd hope. Ideally you'd want to have some sort of identifier for each block (e.g. the name of the function that allocated it) so that if/when problems occur you can report what was corrupted. Of course this could/should maybe be done via. a macro, so that those identifiers can be omitted when not debugging.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to return acceptable error conditions ("failed to allocate" is the only error it can return) and no way to pass additional information. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier) (with similar alterations to free and realloc to enable them to return a status code and identifier).
Optimise the Way Memory is Allocated
It's naive to assume that all memory is the same. It's not. Cache locality (including TLB locality) and other cache effects, and things like NUMA optimisation, all matter. For a simple example, imagine you're writing an application that has a structure describing a person (including a hash of their name) and a pointer to a person's name string; and both the structure and the name string are allocated via. malloc. The normal end result is that those structures and strings end up mixed together in the heap; so when you're searching through these structures (e.g. trying to find the structure that contains the correct hash) you end up pounding caches and TLBs. To optimise this properly you'd want to ensure that all the structures are close together in the heap. For that to work malloc needs to the difference between allocating 32 bytes for the structure and allocating 32 bytes for the name string. You need to introduce the concept of "memory pools" (e.g. where everything in "memory pool number 1" is kept close in the heap).
Another important optimisations include "cache colouring" (see http://en.wikipedia.org/wiki/Cache_coloring ). For NUMA systems, it can be important to know the difference between something where max. bandwidth is needed (where using memory from multiple NUMA domains increases bandwidth).
Finally, it'd be nice (to manage heap fragmentation, etc) to use different strategies for "temporary, likely to be freed soon" allocations and longer term allocations (e.g. where it's worth doing a extra to minimise fragmentation and wasted space/RAM).
Note: I'd estimate that getting all of this right can mean software running up to 20% faster in specific cases, due to far less cache misses, more bandwidth where it's needed, etc.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to pass the additional information to malloc in the first place. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier, int pool, int optimisationFlags) (with similar alterations to realloc).
Optimise the Code That Allocates the Memory
Given that you can assume memory is used more frequently than its allocated; this is the least important (e.g. less important than getting things like cache locality for the allocated blocks right).
Quite frankly, anyone that actually wants decent performance or decent debugging shouldn't be using malloc to begin with - generic solutions to specific problems are never ideal. With this in mind (and not forgetting that the interface to malloc is lame and broken and prevents everything that's important anyway) I'd recommend simply not bothering with malloc and creating something that's actually good (but non-standard) instead. For this you can adapt the algorithms used by existing implementations of malloc.

Allocating more memory than there exists using malloc

This code snippet will allocate 2Gb every time it reads the letter 'u' from stdin, and will initialize all the allocated chars once it reads 'a'.
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <vector>
#define bytes 2147483648
using namespace std;
int main()
{
char input [1];
vector<char *> activate;
while(input[0] != 'q')
{
gets (input);
if(input[0] == 'u')
{
char *m = (char*)malloc(bytes);
if(m == NULL) cout << "cant allocate mem" << endl;
else cout << "ok" << endl;
activate.push_back(m);
}
else if(input[0] == 'a')
{
for(int x = 0; x < activate.size(); x++)
{
char *m;
m = activate[x];
for(unsigned x = 0; x < bytes; x++)
{
m[x] = 'a';
}
}
}
}
return 0;
}
I am running this code on a linux virtual machine that has 3Gb of ram. While monitoring the system resource usage using the htop tool, I have realized that the malloc operation is not reflected on the resources.
For example when I input 'u' only once(i.e. allocate 2GB of heap memory), I don't see the memory usage increasing by 2GB in htop. It is only when I input 'a'(i.e. initialize), I see the memory usage increasing.
As a consequence, I am able to "malloc" more heap memory than there exists. For example, I can malloc 6GB(which is more than my ram and swap memory) and malloc would allow it(i.e. NULL is not returned by malloc). But when I try to initialize the allocated memory, I can see the memory and swap memory filling up till the process is killed.
-My questions:
1.Is this a kernel bug?
2.Can someone explain to me why this behavior is allowed?

It is called memory overcommit. You can disable it by running as root:
echo 2 > /proc/sys/vm/overcommit_memory
and it is not a kernel feature that I like (so I always disable it). See malloc(3) and mmap(2) and proc(5)
NB: echo 0 instead of echo 2 often -but not always- works also. Read the docs (in particular proc man page that I just linked to).

from man malloc (online here):
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.
So when you just want to allocate too much, it "lies" to you, when you want to use the allocated memory, it will try to find enough memory for you and it might crash if it can't find enough memory.

No, this is not a kernel bug. You have discovered something known as late paging (or overcommit).
Until you write a byte to the address allocated with malloc (...) the kernel does little more than "reserve" the address range. This really depends on the implementation of your memory allocator and operating system of course, but most good ones do not incur the majority of kernel overhead until the memory is first used.
The hoard allocator is one big offender that comes to mind immediately, through extensive testing I have found it almost never takes advantage of a kernel that supports late paging. You can always mitigate the effects of late paging in any allocator if you zero-fill the entire memory range immediately after allocation.
Real-time operating systems like VxWorks will never allow this behavior because late paging introduces serious latency. Technically, all it does is put the latency off until a later indeterminate time.
For a more detailed discussion, you may be interested to see how IBM's AIX operating system handles page allocation and overcommitment.

This is a result of what Basile mentioned, over commit memory. However, the explanation kind of interesting.
Basically when you attempt to map additional memory in Linux (POSIX?), the kernel will just reserve it, and will only actually end up using it if your application accesses one of the reserved pages. This allows multiple applications to reserve more than the actual total amount of ram / swap.
This is desirable behavior on most Linux environments unless you've got a real-time OS or something where you know exactly who will need what resources, when and why.
Otherwise somebody could come along, malloc up all the ram (without actually doing anything with it) and OOM your apps.
Another example of this lazy allocation is mmap(), where you have a virtual map that the file you're mapping can fit inside - but you only have a small amount of real memory dedicated to the effort. This allows you to mmap() huge files (larger than your available RAM), and use them like normal file handles which is nifty)
-n

Initializing / working with the memory should work:
memset(m, 0, bytes);
Also you could use calloc that not only allocates memory but also fills it with zeros for you:
char* m = (char*) calloc(1, bytes);

1.Is this a kernel bug?
No.
2.Can someone explain to me why this behavior is allowed?
There are a few reasons:
Mitigate need to know eventual memory requirement - it's often convenient to have an application be able to an amount of memory that it considers an upper limit on the need it might actually have. For example, if it's preparing some kind of report either of an initial pass just to calculate the eventual size of the report or a realloc() of successively larger areas (with the risk of having to copy) may significantly complicate the code and hurt performance, where-as multiplying some maximum length of each entry by the number of entries could be very quick and easy. If you know virtual memory is relatively plentiful as far as your application's needs are concerned, then making a larger allocation of virtual address space is very cheap.
Sparse data - if you have the virtual address space spare, being able to have a sparse array and use direct indexing, or allocate a hash table with generous capacity() to size() ratio, can lead to a very high performance system. Both work best (in the sense of having low overheads/waste and efficient use of memory caches) when the data element size is a multiple of the memory paging size, or failing that much larger or a small integral fraction thereof.
Resource sharing - consider an ISP offering a "1 giga-bit per second" connection to 1000 consumers in a building - they know that if all the consumers use it simultaneously they'll get about 1 mega-bit, but rely on their real-world experience that, though people ask for 1 giga-bit and want a good fraction of it at specific times, there's inevitably some lower maximum and much lower average for concurrent usage. The same insight applied to memory allows operating systems to support more applications than they otherwise would, with reasonable average success at satisfying expectations. Much as the shared Internet connection degrades in speed as more users make simultaneous demands, paging from swap memory on disk may kick in and reduce performance. But unlike an internet connection, there's a limit to the swap memory, and if all the apps really do try to use the memory concurrently such that that limit's exceeded, some will start getting signals/interrupts/traps reporting memory exhaustion. Summarily, with this memory overcommit behaviour enabled, simply checking malloc()/new returned a non-NULL pointer is not sufficient to guarantee the physical memory is actually available, and the program may still receive a signal later as it attempts to use the memory.

How to large memory blocks Windows 7

I am trying to allocate a large block (5GB) of memory under Windows 7 Professional. I have a 64-bit machine and 16GB RAM and I'm using MS Visual Studio 10. For those of you who might ask why - its because I need to hold a 2-D raster representation of a map of reference numbers to polygon data, the maps can be up to 40,000 x 40,000 units. This has to go into RAM and fragmenting it into smaller blocks would be too expensive in runtimes.
So if I do this:
int _tmain(int argc, _TCHAR* argv[])
{
int t = INT_MAX;
int * test = new int[t/6];
delete test;
return 0;
}
The call to new fails, but
int * test = new int[t/7]; succeeds.
Investigating a bit more I found that the memory allocation is only using the memory which is designated as 'free' by the resource monitor. So when this is smaller than the allocation requested the allocation fails. The resource monitor tells me that (when I looked) I had ~5GB in use, ~10GB on standby, and a little over 1GB free.
As far as I understand it this should not happen. Surely if memory is requested it should be taken from the standby memory? If this is not the case then is there a way to reduce the amount of standby memory used by windows from inside C++?
The latter can be done outside C++ using RAMMap as I discovered from this post: Clear the windows 7 standby memory programmatically but unfortunately there was no useful answer to the question of programatic clearing. Perhaps I am luckier in C++.
Of course the more likely scenario that I am just missing something obvious.
Thanks

Remember that when you allocate on the heap, that memory will have to be a contiguous block. If the memory is fragmented, no block big enough to hold the allocation may be awailable, even though the total of free memory is much larger than what you request.

Memory fragmentation might be one of the issues: you could have that amount of memory free, but it is not continuous.
A better approach would be to allocate rather big chunks of memory (several megabytes) and link-list them. The probability that you will find space for such a chunk (and thus every of the chunks) is way higher than that you'll find continuous space for several gigabytes.
As of performance: as long as you are working on exactly one chunk, you will lose no speed as data remains in the CPU cache. If you switch often between two chunks (e.g. you swap items between them), you will get cache misses.
Anyways any workarounds like clearing the memory are like poker: you might get a chunk then or not. It depends on too many factors you can't control.

Why is deleted memory unable to be reused

I am using C++ on Windows 7 with MSVC 9.0, and have also been able to test and reproduce on Windows XP SP3 with MSVC 9.0.
If I allocate 1 GB of 0.5 MB sized objects, when I delete them, everything is ok and behaves as expected. However if I allocate 1 GB of 0.25 MB sized objects when I delete them, the memory remains reserved (yellow in Address Space Monitor) and from then on will only be able to be used for allocations smaller than 0.25 MB.
This simple code will let you test both scenarios by changing which struct is typedef'd. After it has allocated and deleted the structs it will then allocate 1 GB of 1 MB char buffers to see if the char buffers will use the memory that the structs once occupied.
struct HalfMegStruct
{
HalfMegStruct():m_Next(0){}
/* return the number of objects needed to allocate one gig */
static int getIterations(){ return 2048; }
int m_Data[131071];
HalfMegStruct* m_Next;
};
struct QuarterMegStruct
{
QuarterMegStruct():m_Next(0){}
/* return the number of objects needed to allocate one gig */
static int getIterations(){ return 4096; }
int m_Data[65535];
QuarterMegStruct* m_Next;
};
// which struct to use
typedef QuarterMegStruct UseType;
int main()
{
UseType* first = new UseType;
UseType* current = first;
for ( int i = 0; i < UseType::getIterations(); ++i )
current = current->m_Next = new UseType;
while ( first->m_Next )
{
UseType* temp = first->m_Next;
delete first;
first = temp;
}
delete first;
for ( unsigned int i = 0; i < 1024; ++i )
// one meg buffer, i'm aware this is a leak but its for illustrative purposes.
new char[ 1048576 ];
return 0;
}
Below you can see my results from within Address Space Monitor. Let me stress that the only difference between these two end results is the size of the structs being allocated up to the 1 GB marker.
This seems like quite a serious problem to me, and one that many people could be suffering from and not even know it.
So is this by design or should this be considered a bug?
Can I make small deleted objects actually be free for use by larger allocations?
And more out of curiosity, does a Mac or a Linux machine suffer from the same problem?

I cannot positively state this is the case, but this does look like memory fragmentation (in one of its many forms). The allocator (malloc) might be keeping buckets of different sizes to enable fast allocation, after you release the memory, instead of directly giving it back to the OS it is keeping the buckets so that later allocations of the same size can be processed from the same memory. If this is the case, the memory would be available for further allocations of the same size.
This type of optimization, is usually disabled for big objects, as it requires reserving memory even if not in use. If the threshold is somewhere between your two sizes, that would explain the behavior.
Note that while you might see this as weird, in most programs (not test, but real life) the memory usage patterns are repeated: if you asked for 100k blocks once, it more often than not is the case that you will do it again. And keeping the memory reserved can improve performance and actually reduce fragmentation that would come from all requests being granted from the same bucket.
You can, if you want to invest some time, learn how your allocator works by analyzing the behavior. Write some tests, that will acquire size X, release it, then acquire size Y and then show the memory usage. Fix the value of X and play with Y. If the requests for both sizes are granted from the same buckets, you will not have reserved/unused memory (image on the left), while when the sizes are granted from different buckets you will see the effect on the image on the right.
I don't usually code for windows, and I don't even have Windows 7, so I cannot positively state that this is the case, but it does look like it.

I can confirm the same behaviour with g++ 4.4.0 under Windows 7, so it's not in the compiler. In fact, the program fails when getIterations() returns 3590 or more -- do you get the same cutoff? This looks like a bug in Windows system memory allocation. It's all very well for knowledgeable souls to talk about memory fragmentation, but everything got deleted here, so the observed behaviour definitely shouldn't happen.

Using your code I performed your test and got the same result. I suspect that David Rodríguez is right in this case.
I ran the test and had the same result as you. It seems there might be this "bucket" behaviour going on.
I tried two different tests too. Instead of allocating 1GB of data using 1MB buffers I allocated the same way as the memory was first allocated after deleting. The second test I allocated the half meg buffer cleaned up then allocated the quater meg buffer, adding up to 512MB for each. Both tests had the same memory result in the end, only 512 is allocated an no large chunk of reserved memory.
As David mentions, most applications tend to make allocation of the same size. One can see quite clearly why this could be a problem though.
Perhaps the solution to this is that if you are allocating many smaller objects in this way you would be better to allocate a large block of memory and manage it yourself. Then when you're done free the large block.

I spoke with some authorities on the subject (Greg, if you're out there, say hi ;D) and can confirm that what David is saying is basically right.
As the heap grows in the first pass of allocating ~0.25MB objects, the heap is reserving and committing memory. As the heap shrinks in the delete pass, it decommits at some pace but does not necessarily release the virtual address ranges it reserved in the allocation pass. In the last allocation pass, the 1MB allocations are bypassing the heap due to their size and thus begin to compete with the heap for VA.
Note that the heap is reserving the VA, not keeping it committed. VirtualAlloc and VirtualFree can help explain the different if you're curious. This fact doesn't solve the problem you ran into, which is that the process ran out of virtual address space.

This is a side-effect of the Low-Fragmentation Heap.
http://msdn.microsoft.com/en-us/library/aa366750(v=vs.85).aspx
You should try disabling it to see if that helps. Run against both GetProcessHeap and the CRT heap (and any other heaps you may have created).

Extreme memory usage for individual dynamic allocation

here's a simple test I did on MSVC++ 2010 under windows 7:
// A struct with sizeof(s) == 4, e.g 4 bytes
struct s
{
int x;
};
// Allocate 1 million structs
s* test1 = new s[1000000];
// Memory usage show that the increase in memory is roughly 4 bytes * 1000000 - As expected
// NOW! If I run this:
for (int i = 0; i < 1000000; i++)
new s();
// The memory usage is disproportionately large. When divided by 1000000, indicates 64 bytes per s!!!
Is this a common knowledge or am I missing something? Before I always used to create objects on the fly when needed. For example new Triangle() for every triangle in a mesh, etc.
Is there indeed order of magnitude overhead for dynamic memory allocation of individual instances?
Cheers
EDIT:
Just compiled and ran same program at work on Windows XP using g++:
Now the overhead is 16 bytes, not 64 as observed before! Very interesting.

Not necessarily, but the operating system will usually reserve memory on your behalf in whatever sized chunks it finds convenient; on your system, I'd guess it gives you multiples of 64 bytes per request.
There is an overhead associated with keeping track of the memory allocations, after all, and reserving very small amounts isn't worthwhile.

Is that for a debug build? Because in a debug build msvc will allocate "guards" around objects to see if you overwrite past your object boundary.

There is usually overhead with any single memory allocation. Now this is from my knowledge of malloc rather than new but I suspect it's the same.
A section of the memory arena, when carved out for an allocation of (say) 30 bytes, will typically have a header (e.g., 16 bytes, and all figures like that are examples only below, they may be different) and may be padded to a multiple of 16 bytes for easier arena management.
The header is usually important to allow the section to be re-integrated into the free memory pool when you're finished with it.
It contains information about the size of the block at a bare minimum and may have memory guards as well (to detect corruption of the arena).
So, when you allocate your one million structure array, you'll find that it uses an extra 16 bytes for the header (four million and sixteen bytes). When you try to allocate one million individual structures, each and every one of them will have that overhead.
I answered a related question here with more details. I suspect there will be more required header information for C++ since it will probably have to store the number of items over and above the section size (for proper destructor calls) but that's just supposition on my part. It doesn't affect the fact that accounting information of some sort is needed per allocated item.
If you really want to see what the space is being used for, you'll need to dig through the MSVC runtime source code.

You should check the malloc implementation. Probably this will clear things up.
Not sure though if MSVC++'s malloc can be viewed somewhere. If not, look at some other implementation, they are probably similar to some degree.
Don't expect the malloc implementation to be easy. It needs to search for some free space in the allocated virtual pages or allocate a new virtual page. And it must do this fast. As fast as possible. And it must be multithreading safe. Maybe your malloc implementation has some sort of bitvector where it safes which 64 bit chunks are free in some page and it just takes the next free chunk.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js