Why is deleted memory unable to be reused

Why is deleted memory unable to be reused - c++

I am using C++ on Windows 7 with MSVC 9.0, and have also been able to test and reproduce on Windows XP SP3 with MSVC 9.0.
If I allocate 1 GB of 0.5 MB sized objects, when I delete them, everything is ok and behaves as expected. However if I allocate 1 GB of 0.25 MB sized objects when I delete them, the memory remains reserved (yellow in Address Space Monitor) and from then on will only be able to be used for allocations smaller than 0.25 MB.
This simple code will let you test both scenarios by changing which struct is typedef'd. After it has allocated and deleted the structs it will then allocate 1 GB of 1 MB char buffers to see if the char buffers will use the memory that the structs once occupied.
struct HalfMegStruct
{
HalfMegStruct():m_Next(0){}
/* return the number of objects needed to allocate one gig */
static int getIterations(){ return 2048; }
int m_Data[131071];
HalfMegStruct* m_Next;
};
struct QuarterMegStruct
{
QuarterMegStruct():m_Next(0){}
/* return the number of objects needed to allocate one gig */
static int getIterations(){ return 4096; }
int m_Data[65535];
QuarterMegStruct* m_Next;
};
// which struct to use
typedef QuarterMegStruct UseType;
int main()
{
UseType* first = new UseType;
UseType* current = first;
for ( int i = 0; i < UseType::getIterations(); ++i )
current = current->m_Next = new UseType;
while ( first->m_Next )
{
UseType* temp = first->m_Next;
delete first;
first = temp;
}
delete first;
for ( unsigned int i = 0; i < 1024; ++i )
// one meg buffer, i'm aware this is a leak but its for illustrative purposes.
new char[ 1048576 ];
return 0;
}
Below you can see my results from within Address Space Monitor. Let me stress that the only difference between these two end results is the size of the structs being allocated up to the 1 GB marker.
This seems like quite a serious problem to me, and one that many people could be suffering from and not even know it.
So is this by design or should this be considered a bug?
Can I make small deleted objects actually be free for use by larger allocations?
And more out of curiosity, does a Mac or a Linux machine suffer from the same problem?

I cannot positively state this is the case, but this does look like memory fragmentation (in one of its many forms). The allocator (malloc) might be keeping buckets of different sizes to enable fast allocation, after you release the memory, instead of directly giving it back to the OS it is keeping the buckets so that later allocations of the same size can be processed from the same memory. If this is the case, the memory would be available for further allocations of the same size.
This type of optimization, is usually disabled for big objects, as it requires reserving memory even if not in use. If the threshold is somewhere between your two sizes, that would explain the behavior.
Note that while you might see this as weird, in most programs (not test, but real life) the memory usage patterns are repeated: if you asked for 100k blocks once, it more often than not is the case that you will do it again. And keeping the memory reserved can improve performance and actually reduce fragmentation that would come from all requests being granted from the same bucket.
You can, if you want to invest some time, learn how your allocator works by analyzing the behavior. Write some tests, that will acquire size X, release it, then acquire size Y and then show the memory usage. Fix the value of X and play with Y. If the requests for both sizes are granted from the same buckets, you will not have reserved/unused memory (image on the left), while when the sizes are granted from different buckets you will see the effect on the image on the right.
I don't usually code for windows, and I don't even have Windows 7, so I cannot positively state that this is the case, but it does look like it.

I can confirm the same behaviour with g++ 4.4.0 under Windows 7, so it's not in the compiler. In fact, the program fails when getIterations() returns 3590 or more -- do you get the same cutoff? This looks like a bug in Windows system memory allocation. It's all very well for knowledgeable souls to talk about memory fragmentation, but everything got deleted here, so the observed behaviour definitely shouldn't happen.

Using your code I performed your test and got the same result. I suspect that David Rodríguez is right in this case.
I ran the test and had the same result as you. It seems there might be this "bucket" behaviour going on.
I tried two different tests too. Instead of allocating 1GB of data using 1MB buffers I allocated the same way as the memory was first allocated after deleting. The second test I allocated the half meg buffer cleaned up then allocated the quater meg buffer, adding up to 512MB for each. Both tests had the same memory result in the end, only 512 is allocated an no large chunk of reserved memory.
As David mentions, most applications tend to make allocation of the same size. One can see quite clearly why this could be a problem though.
Perhaps the solution to this is that if you are allocating many smaller objects in this way you would be better to allocate a large block of memory and manage it yourself. Then when you're done free the large block.

I spoke with some authorities on the subject (Greg, if you're out there, say hi ;D) and can confirm that what David is saying is basically right.
As the heap grows in the first pass of allocating ~0.25MB objects, the heap is reserving and committing memory. As the heap shrinks in the delete pass, it decommits at some pace but does not necessarily release the virtual address ranges it reserved in the allocation pass. In the last allocation pass, the 1MB allocations are bypassing the heap due to their size and thus begin to compete with the heap for VA.
Note that the heap is reserving the VA, not keeping it committed. VirtualAlloc and VirtualFree can help explain the different if you're curious. This fact doesn't solve the problem you ran into, which is that the process ran out of virtual address space.

This is a side-effect of the Low-Fragmentation Heap.
http://msdn.microsoft.com/en-us/library/aa366750(v=vs.85).aspx
You should try disabling it to see if that helps. Run against both GetProcessHeap and the CRT heap (and any other heaps you may have created).

Related

Avoid memory fragmentation when memory pools are a bad idea

I am developing a C++ application, where the program run endlessly, allocating and freeing millions of strings (char*) over time. And RAM usage is a serious consideration in the program. This results in RAM usage getting higher and higher over time. I think the problem is heap fragmentation. And I really need to find a solution.
You can see in the image, after millions of allocation and freeing in the program, the usage is just increasing. And the way I am testing it, I know for a fact that the data it stores is not increasing. I can guess that you will ask, "How are you sure of that?", "How are you sure it's not just a memory leak?", Well.
This test run much longer. I run malloc_trim(0), whenever possible in my program. And it seems, application can finally return the unused memory to the OS, and it goes almost to zero (the actual data size my program has currently). This implies the problem is not a memory leak. But I can't rely on this behavior, the allocation and freeing pattern of my program is random, what if it never releases the memory ?
I said memory pools are a bad idea for this project in the title. Of course I don't have absolute knowledge. But the strings I am allocating can be anything between 30-4000 bytes. Which makes many optimizations and clever ideas much harder. Memory pools are one of them.
I am using GCC 11 / G++ 11 as a compiler. If some old versions have bad allocators. I shouldn't have that problem.
How am I getting memory usage ? Python psutil module. proc.memory_full_info()[0], which gives me RSS.
Of course, you don't know the details of my program. It is still a valid question, if this is indeed because of heap fragmentation. Well what I can say is, I am keeping a up to date information about how many allocations and frees took place. And I know the element counts of every container in my program. But if you still have some ideas about the causes of the problem, I am open to suggestions.
I can't just allocate, say 4096 bytes for all the strings so it would become easier to optimize. That's the opposite I am trying to do.
So my question is, what do programmers do(what should I do), in an application where millions of alloc's and free's take place over time, and they are of different sizes so memory pools are hard to use efficiently. I can't change what the program does, I can only change implementation details.
Bounty Edit: When trying to utilize memory pools, isn't it possible to make multiple of them, to the extent that there is a pool for every possible byte count ? For example my strings can be something in between 30-4000 bytes. So couldn't somebody make 4000 - 30 + 1, 3971 memory pools, for each and every possible allocation size of the program. Isn't this applicable ? All pools could start small (no not lose much memory), then enlarge, in a balance between performance and memory. I am not trying to make a use of memory pool's ability to reserve big spaces beforehand. I am just trying to effectively reuse freed space, because of frequent alloc's and free's.
Last edit: It turns out that, the memory growth appearing in the graphs, was actually from a http request queue in my program. I failed to see that hundreds of thousands of tests that I did, bloated this queue (something like webhook). And the reasonable explanation of figure 2 is, I finally get DDOS banned from the server (or can't open a connection anymore for some reason), the queue emptied, and the RAM issue resolved. So anyone reading this question later in the future, consider every possibility. It would have never crossed my mind that it was something like this. Not a memory leak, but an implementation detail. Still I think #Hajo Kirchhoff deserves the bounty, his answer was really enlightening.

If everything really is/works as you say it does and there is no bug you have not yet found, then try this:
malloc and other memory allocation usually uses chunks of 16 bytes anyway, even if the actual requested size is smaller than 16 bytes. So you only need 4000/16 - 30/16 ~ 250 different memory pools.
const int chunk_size = 16;
memory_pools pool[250]; // 250 memory pools, managing '(idx+1)*chunk_size' size
char* reserve_mem(size_t sz)
{
size_t pool_idx_to_use = sz/chunk_size;
char * rc=pool[pool_idx_to_use].allocate();
}
IOW, you have 250 memory pools. pool[0] allocates and manages chunk with a length of 16 bytes. pool[100] manages chunks with 1600 bytes etc...
If you know the length distribution of your strings in advance, you can reserve initial memory for the pools based on this knowledge. Otherwise I'd just probably reserve memory for the pools in 4096 bytes increment.
Because while the malloc C heap usually allocates memory in multiple of 16 bytes it will (at least unter Windows, but I am guessing, Linux is similar here) ask the OS for memory - which usually works with 4K pages. IOW, the "outer" memory heap managed by the operating system reserves and frees 4096 bytes.
So increasing your own internal memory pool in 4096 bytes means no fragmentation in the OS app heap. This 4096 page size (or multiple of...) comes from the processor architecture. Intel processors have a builtin page size of 4K (or multiple of). Don't know about other processors, but I suspect similar architectures there.
So, to sum it up:
Use chunks of multiple of 16 bytes for your strings per memory pool.
Use chunks of multiple of 4K bytes to increase your memory pool.
That will align the memory use of your application with the memory management of the OS and avoid fragmentation as much as possible.
From the OS point of view, your application will only increment memory in 4K chunks. That's very easy to allocate and release. And there is no fragmentation.
From the internal (lib) C heap management point of view, your application will use memory pools and waste at most 15 bytes per string. Also all similar length allocations will be heaped together, so also no fragmentation.

Here a solution, which might help, if your application has the following traits:
Runs on Windows
Has episodes, where the working set is large, but all that data is released at around the same point in time entirely. (If the data is processed in some kind of batch mode and the work is done when its done and then the related data is freed).
This approach uses a (unique?) Windows feature, called custom heaps. Possibly, you can create yourself something similar on other OS.
The functions, you need are in a header called <heapapi.h>.
Here is how it would look like:
At the start of a memory intensive phase of your program, use HeapCreate() to have a heap, where all your data will go.
Perform the memory intensive tasks.
At the end of the memory intensive phase, Free your data and call HeapDestroy().
Depending on the detailed behavior of your application (e.g. whether or not this memory intensive computation runs in 1 or multiple threads), you can configure the heap accordingly and possibly even gain a little speed (If only 1 thread uses the data, you can give HeapCreate() the HEAP_NO_SERIALIZE flag, so it would not take a lock.) You can also give an upper bound of what the heap will allow to store.
And once, your computation is complete, destroying the heap also prevents long term fragmentation, because when the time comes for the next computation phase, you start with a fresh heap.
Here is the documentation of the heap api.
In fact, I found this feature so useful, that I replicated it on FreeBSD and Linux for an application I ported, using virtual memory functions and my own heap implementation. Was a few years back and I do not have the rights for that piece of code, so I cannot show or share.
You could also combine this approach with fixed element size pools, using one heap for one dedicated block size and then expect less/no fragmentation within each of those heaps (because all blocks have same size).
struct EcoString {
size_t heapIndex;
size_t capacity;
char* data;
char* get() { return data; }
};
struct FixedSizeHeap {
const size_t SIZE;
HANDLE m_heap;
explicit FixedSizeHeap(size_t size)
: SIZE(size)
, m_heap(HeapCreate(0,0,0)
{
}
~FixedSizeHeap() {
HeapDestroy(m_heap);
}
bool allocString(capacity, EcoString& str) {
assert(capacity <= SIZE);
str.capacity = SIZE; // we alloc SIZE bytes anyway...
str.data = (char*)HeapAlloc(m_heap, 0, SIZE);
if (nullptr != str.data)
return true;
str.capacity = 0;
return false;
}
void freeString(EcoString& str) {
HeapFree(m_heap, 0, str.data);
str.data = nullptr;
str.capacity = 0;
}
};
struct BucketHeap {
using Buckets = std::vector<FixedSizeHeap>; // or std::array
Buckets buckets;
/*
(loop for i from 0
for size = 40 then (+ size (ash size -1))
while (< size 80000)
collecting (cons i size))
((0 . 40) (1 . 60) (2 . 90) (3 . 135) (4 . 202) (5 . 303) (6 . 454) (7 . 681)
(8 . 1021) (9 . 1531) (10 . 2296) (11 . 3444) (12 . 5166) (13 . 7749)
(14 . 11623) (15 . 17434) (16 . 26151) (17 . 39226) (18 . 58839))
Init Buckets with index (first item) with SIZE (rest item)
in the above item list.
*/
// allocate(nbytes) looks for the correct bucket (linear or binary
// search) and calls allocString() on that bucket. And stores the
// buckets index in the EcoString.heapIndex field (so free has an
// easier time).
// free(EcoString& str) uses heapIndex to find the right bucket and
// then calls free on that bucket.
}

What you are trying to do is called a slab allocator and it is a very well studied problem with lots of research papers.
You don't need every possible size. Usually slabs come in power of 2 sizes.
Don't reinvent the wheel, there are plenty of opensource implementations of a slab allocator. The Linux kernel uses one.
Start by reading the Wikipedia page on Slab Allocation.

I don't know the details of your program. There are many patterns to work with memory, each of which leads to different results and problems. However you can take a look of .Net garbage collector. It uses several generations of objects (similar to pool of memory). Each object in that generation is movable (its address in memory is changed during compressing of generation). Such trick allows to "compress/compact" memory and reduce fragmentation of memory. You can implement memory manager with similar semantics. It's not neccesary to implement garbage collector at all.

Memory Demands: Heap vs Stack in C++

So I had a strange experience this evening.
I was working on a program in C++ that required some way of reading a long list of simple data objects from file and storing them in the main memory, approximately 400,000 entries. The object itself is something like:
class Entry
{
public:
Entry(int x, int y, int type);
Entry(); ~Entry();
// some other basic functions
private:
int m_X, m_Y;
int m_Type;
};
Simple, right? Well, since I needed to read them from file, I had some loop like
Entry** globalEntries;
globalEntries = new Entry*[totalEntries];
entries = new Entry[totalEntries];// totalEntries read from file, about 400,000
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = new Entry(.......);
}
That addition to the program added about 25 to 35 megabytes to the program when I tracked it on the task manager. A simple change to stack allocation:
Entry* globalEntries;
globalEntries = new Entry[totalEntries];
for (int i=0;i<totalEntries;i++)
{
globalEntries[i] = Entry(.......);
}
and suddenly it only required 3 megabytes. Why is that happening? I know pointer objects have a little bit of extra overhead to them (4 bytes for the pointer address), but it shouldn't be enough to make THAT much of a difference. Could it be because the program is allocating memory inefficiently, and ending up with chunks of unallocated memory in between allocated memory?

Your code is wrong, or I don't see how this worked. With new Entry [count] you create a new array of Entry (type is Entry*), yet you assign it to Entry**, so I presume you used new Entry*[count].
What you did next was to create another new Entry object on the heap, and storing it in the globalEntries array. So you need memory for 400.000 pointers + 400.000 elements. 400.000 pointers take 3 MiB of memory on a 64-bit machine. Additionally, you have 400.000 single Entry allocations, which will all require sizeof (Entry) plus potentially some more memory (for the memory manager -- it might have to store the size of allocation, the associated pool, alignment/padding, etc.) These additional book-keeping memory can quickly add up.
If you change your second example to:
Entry* globalEntries;
globalEntries = new Entry[count];
for (...) {
globalEntries [i] = Entry (...);
}
memory usage should be equal to the stack approach.
Of course, ideally you'll use a std::vector<Entry>.

First of all, without specifying which column exactly you were watching, the number in task manager means nothing. On a modern operating system it's difficult even to define what you mean with "used memory" - are we talking about private pages? The working set? Only the stuff that stays in RAM? does reserved but not committed memory count? Who pays for memory shared between processes? Are memory mapped file included?
If you are watching some meaningful metric, it's impossible to see 3 MB of memory used - your object is at least 12 bytes (assuming 32 bit integers and no padding), so 400000 elements will need about 4.58 MB. Also, I'd be surprised if it worked with stack allocation - the default stack size in VC++ is 1 MB, you should already have had a stack overflow.
Anyhow, it is reasonable to expect a different memory usage:
the stack is (mostly) allocated right from the beginning, so that's memory you nominally consume even without really using it for anything (actually virtual memory and automatic stack expansion makes this a bit more complicated, but it's "true enough");
the CRT heap is opaque to the task manager: all it sees is the memory given by the operating system to the process, not what the C heap has "really" in use; the heap grows (requesting memory to the OS) more than strictly necessary to be ready for further memory requests - so what you see is how much memory it is ready to give away without further syscalls;
your "separate allocations" method has a significant overhead. The all-contiguous array you'd get with new Entry[size] costs size*sizeof(Entry) bytes, plus the heap bookkeeping data (typically a few integer-sized fields); the separated allocations method costs at least size*sizeof(Entry) (size of all the "bare elements") plus size*sizeof(Entry *) (size of the pointer array) plus size+1 multiplied by the cost of each allocation. If we assume a 32 bit architecture with a cost of 2 ints per allocation, you quickly see that this costs size*24+8 bytes of memory, instead of size*12+8 for the contiguous array in the heap;
the heap normally really gives away blocks that aren't really the size you asked for, because it manages blocks of fixed size; so, if you allocate single objects like that you are probably paying also for some extra padding - supposing it has 16 bytes blocks, you are paying 4 bytes extra per element by allocating them separately; this moves out memory estimation to size*28+8, i.e. an overhead of 16 bytes per each 12-byte element.

Allocating more memory than there exists using malloc

This code snippet will allocate 2Gb every time it reads the letter 'u' from stdin, and will initialize all the allocated chars once it reads 'a'.
#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <vector>
#define bytes 2147483648
using namespace std;
int main()
{
char input [1];
vector<char *> activate;
while(input[0] != 'q')
{
gets (input);
if(input[0] == 'u')
{
char *m = (char*)malloc(bytes);
if(m == NULL) cout << "cant allocate mem" << endl;
else cout << "ok" << endl;
activate.push_back(m);
}
else if(input[0] == 'a')
{
for(int x = 0; x < activate.size(); x++)
{
char *m;
m = activate[x];
for(unsigned x = 0; x < bytes; x++)
{
m[x] = 'a';
}
}
}
}
return 0;
}
I am running this code on a linux virtual machine that has 3Gb of ram. While monitoring the system resource usage using the htop tool, I have realized that the malloc operation is not reflected on the resources.
For example when I input 'u' only once(i.e. allocate 2GB of heap memory), I don't see the memory usage increasing by 2GB in htop. It is only when I input 'a'(i.e. initialize), I see the memory usage increasing.
As a consequence, I am able to "malloc" more heap memory than there exists. For example, I can malloc 6GB(which is more than my ram and swap memory) and malloc would allow it(i.e. NULL is not returned by malloc). But when I try to initialize the allocated memory, I can see the memory and swap memory filling up till the process is killed.
-My questions:
1.Is this a kernel bug?
2.Can someone explain to me why this behavior is allowed?

It is called memory overcommit. You can disable it by running as root:
echo 2 > /proc/sys/vm/overcommit_memory
and it is not a kernel feature that I like (so I always disable it). See malloc(3) and mmap(2) and proc(5)
NB: echo 0 instead of echo 2 often -but not always- works also. Read the docs (in particular proc man page that I just linked to).

from man malloc (online here):
By default, Linux follows an optimistic memory allocation strategy.
This means that when malloc() returns non-NULL there is no guarantee
that the memory really is available.
So when you just want to allocate too much, it "lies" to you, when you want to use the allocated memory, it will try to find enough memory for you and it might crash if it can't find enough memory.

No, this is not a kernel bug. You have discovered something known as late paging (or overcommit).
Until you write a byte to the address allocated with malloc (...) the kernel does little more than "reserve" the address range. This really depends on the implementation of your memory allocator and operating system of course, but most good ones do not incur the majority of kernel overhead until the memory is first used.
The hoard allocator is one big offender that comes to mind immediately, through extensive testing I have found it almost never takes advantage of a kernel that supports late paging. You can always mitigate the effects of late paging in any allocator if you zero-fill the entire memory range immediately after allocation.
Real-time operating systems like VxWorks will never allow this behavior because late paging introduces serious latency. Technically, all it does is put the latency off until a later indeterminate time.
For a more detailed discussion, you may be interested to see how IBM's AIX operating system handles page allocation and overcommitment.

This is a result of what Basile mentioned, over commit memory. However, the explanation kind of interesting.
Basically when you attempt to map additional memory in Linux (POSIX?), the kernel will just reserve it, and will only actually end up using it if your application accesses one of the reserved pages. This allows multiple applications to reserve more than the actual total amount of ram / swap.
This is desirable behavior on most Linux environments unless you've got a real-time OS or something where you know exactly who will need what resources, when and why.
Otherwise somebody could come along, malloc up all the ram (without actually doing anything with it) and OOM your apps.
Another example of this lazy allocation is mmap(), where you have a virtual map that the file you're mapping can fit inside - but you only have a small amount of real memory dedicated to the effort. This allows you to mmap() huge files (larger than your available RAM), and use them like normal file handles which is nifty)
-n

Initializing / working with the memory should work:
memset(m, 0, bytes);
Also you could use calloc that not only allocates memory but also fills it with zeros for you:
char* m = (char*) calloc(1, bytes);

1.Is this a kernel bug?
No.
2.Can someone explain to me why this behavior is allowed?
There are a few reasons:
Mitigate need to know eventual memory requirement - it's often convenient to have an application be able to an amount of memory that it considers an upper limit on the need it might actually have. For example, if it's preparing some kind of report either of an initial pass just to calculate the eventual size of the report or a realloc() of successively larger areas (with the risk of having to copy) may significantly complicate the code and hurt performance, where-as multiplying some maximum length of each entry by the number of entries could be very quick and easy. If you know virtual memory is relatively plentiful as far as your application's needs are concerned, then making a larger allocation of virtual address space is very cheap.
Sparse data - if you have the virtual address space spare, being able to have a sparse array and use direct indexing, or allocate a hash table with generous capacity() to size() ratio, can lead to a very high performance system. Both work best (in the sense of having low overheads/waste and efficient use of memory caches) when the data element size is a multiple of the memory paging size, or failing that much larger or a small integral fraction thereof.
Resource sharing - consider an ISP offering a "1 giga-bit per second" connection to 1000 consumers in a building - they know that if all the consumers use it simultaneously they'll get about 1 mega-bit, but rely on their real-world experience that, though people ask for 1 giga-bit and want a good fraction of it at specific times, there's inevitably some lower maximum and much lower average for concurrent usage. The same insight applied to memory allows operating systems to support more applications than they otherwise would, with reasonable average success at satisfying expectations. Much as the shared Internet connection degrades in speed as more users make simultaneous demands, paging from swap memory on disk may kick in and reduce performance. But unlike an internet connection, there's a limit to the swap memory, and if all the apps really do try to use the memory concurrently such that that limit's exceeded, some will start getting signals/interrupts/traps reporting memory exhaustion. Summarily, with this memory overcommit behaviour enabled, simply checking malloc()/new returned a non-NULL pointer is not sufficient to guarantee the physical memory is actually available, and the program may still receive a signal later as it attempts to use the memory.

Why is memory not reusable after allocating/deallocating a number of small objects?

While investigating a memory link in one of our projects, I've run into a strange issue. Somehow, the memory allocated for objects (vector of shared_ptr to object, see below) is not fully reclaimed when the parent container goes out of scope and can't be used except for small objects.
The minimal example: when the program starts, I can allocate a single continuous block of 1.5Gb without problem. After I use the memory somewhat (by creating and destructing an number of small objects), I can no longer do big block allocation.
Test program:
#include <iostream>
#include <memory>
#include <vector>
using namespace std;
class BigClass
{
private:
double a[10000];
};
void TestMemory() {
cout<< "Performing TestMemory"<<endl;
vector<shared_ptr<BigClass>> list;
for (int i = 0; i<10000; i++) {
shared_ptr<BigClass> p(new BigClass());
list.push_back(p);
};
};
void TestBigBlock() {
cout<< "Performing TestBigBlock"<<endl;
char* bigBlock = new char [1024*1024*1536];
delete[] bigBlock;
}
int main() {
TestBigBlock();
TestMemory();
TestBigBlock();
}
Problem also repeats if using plain pointers with new/delete or malloc/free in cycle, instead of shared_ptr.
The culprit seems to be that after TestMemory(), the application's virtual memory stays at 827125760 (regardless of number of times I call it). As a consequence, there's no free VM regrion big enough to hold 1.5 GB. But I'm not sure why - since I'm definitely freeing the memory I used. Is it some "performance optimization" CRT does to minimize OS calls?
Environment is Windows 7 x64 + VS2012 + 32-bit app without LAA

Sorry for posting yet another answer since I am unable to comment; I believe many of the others are quite close to the answer really :-)
Anyway, the culprit is most likely address space fragmentation. I gather you are using Visual C++ on Windows.
The C / C++ runtime memory allocator (invoked by malloc or new) uses the Windows heap to allocate memory. The Windows heap manager has an optimization in which it will hold on to blocks under a certain size limit, in order to be able to reuse them if the application requests a block of similar size later. For larger blocks (I can't remember the exact value, but I guess it's around a megabyte) it will use VirtualAlloc outright.
Other long-running 32-bit applications with a pattern of many small allocations have this problem too; the one that made me aware of the issue is MATLAB - I was using the 'cell array' feature to basically allocate millions of 300-400 byte blocks, causing exactly this issue of address space fragmentation even after freeing them.
A workaround is to use the Windows heap functions (HeapCreate() etc.) to create a private heap, allocate your memory through that (passing a custom C++ allocator to your container classes as needed), and then destroy that heap when you want the memory back - This also has the happy side-effect of being very fast vs delete()ing a zillion blocks in a loop..
Re. "what is remaining in memory" to cause the issue in the first place: Nothing is remaining 'in memory' per se, it's more a case of the freed blocks being marked as free but not coalesced. The heap manager has a table/map of the address space, and it won't allow you to allocate anything which would force it to consolidate the free space into one contiguous block (presumably a performance heuristic).

There is absolutely no memory leak in your C++ program. The real culprit is memory fragmentation.
Just to be sure(regarding memory leak point), I ran this program on Valgrind, and it did not give any memory leak information in the report.
//Valgrind Report
mantosh#mantosh4u:~/practice$ valgrind ./basic
==3227== HEAP SUMMARY:
==3227== in use at exit: 0 bytes in 0 blocks
==3227== total heap usage: 20,017 allocs, 20,017 frees, 4,021,989,744 bytes allocated
==3227==
==3227== All heap blocks were freed -- no leaks are possible
Please find my response to your query/doubt asked in original question.
The culprit seems to be that after TestMemory(), the application's
virtual memory stays at 827125760 (regardless of number of times I
call it).
Yes, real culprit is hidden fragmentation done during the TestMemory() function.Just to understand the fragmentation, I have taken the snippet from wikipedia
"
when free memory is separated into small blocks and is interspersed by allocated memory. It is a weakness of certain storage allocation algorithms, when they fail to order memory used by programs efficiently. The result is that, although free storage is available, it is effectively unusable because it is divided into pieces that are too small individually to satisfy the demands of the application.
For example, consider a situation wherein a program allocates 3 continuous blocks of memory and then frees the middle block. The memory allocator can use this free block of memory for future allocations. However, it cannot use this block if the memory to be allocated is larger in size than this free block."
The above explains paragraph explains very nicely about memory fragmentation.Some allocation patterns(such as frequent allocation and deal location) would lead to memory fragmentation,but its end impact(.i.e. memory allocation 1.5GBgets failed) would greatly vary on different system as different OS/heap manager has different strategy and implementation.
As an example, your program ran perfectly fine on my machine(Linux) however you have encountered the memory allocation failure.
Regarding your observation on VM size remains constant: VM size seen in task manager is not directly proportional to our memory allocation calls. It mainly depends on the how much bytes is in committed state. When you allocate some dynamic memory(using new/malloc) and you do not write/initialize anything in those memory regions, it would not go committed state and hence VM size would not get impacted due to this. VM size depends on many other factors and bit complicated so we should not rely completely on this while understanding about dynamic memory allocation of our program.
As a consequence, there's no free VM regrion big enough to hold 1.5
GB.
Yes, due to fragmentation, there is no contiguous 1.5GB memory. It should be noted that total remaining(free) memory would be more than 1.5GB but not in fragmented state. Hence there is not big contiguous memory.
But I'm not sure why - since I'm definitely freeing the memory I used.
Is it some "performance optimization" CRT does to minimize OS calls?
I have explained about why it may happen even though you have freed all your memory. Now in order to fulfil user program request, OS will call to its virtual memory manager and try to allocate the memory which would be used by heap memory manager. But grabbing the additional memory does depend on many other complex factor which is not very easy to understand.
Possible Resolution of Memory Fragmentation
We should try to reuse the memory allocation rather than frequent memory allocation/free. There could be some patterns(like a particular request size allocation in particular order) which may lead overall memory into fragmented state. There could be substantial design change in your program in order to improve memory fragmentation. This is complex topic and require internal understanding of memory manager to understand the complete root cause of such things.
However there are tools exists on Windows based system which I am not much aware. But I found one excellent SO post regarding the which tool(on windows) can be useful to understand and check the fragmentation status of your program by yourself.
https://stackoverflow.com/a/1684521/2724703

This is not memory leak. The memory U used was allocated by C\C++ Runtime. The Runtime apply a a bulk of memory from OS once and then each new you called will allocated from that bulk memory. when delete one object, the Runtime not return memory to OS immediately, it may hold that memory for performance.

There is nothing here which indicates a genuine "leak". The pattern of memory you describe is not unexpected. Here are a few points which might help to understand. What happens is highly OS dependent.
A program often has a single heap which can be extended or shrunk in length. It is however one contiguous memory area, so changing the size is just changing where the end of the heap is. This makes it very difficult to ever "return" memory to the OS, since even one little tiny object in that space will prevent its shrinking. On Linux you can lookup the function 'brk' (I know you're on Windows, but I presume it does something similar).
Large allocations are often done with a different strategy. Rather than putting them in the general purpose heap, an extra block of memory is created. When it is deleted this memory can actually be "returned" to the OS since its guaranteed nothing is using it.
Large blocks of unused memory don't tend to consume a lot of resources. If you generally aren't using the memory any more they might just get paged to disk. Don't presume that because some API function says you're using memory that you are actually consuming significant resources.
APIs don't always report what you think. Due to a variety of optimizations and strategies it may not actually be possible to determine how much memory is in use and/or available on a system at a particular moment. Unless you have intimate details of the OS you won't know for sure what those values mean.
The first two points can explain why a bunch of small blocks and one large block result in different memory patterns. The latter points indicate why this approach to detecting leaks is not useful. To detect genuine object-based "leaks" you generally need a dedicated profiling tool which tracks allocations.
For example, in the code provided:
TestBigBlock allocates and deletes array, assume this uses a special memory block, so memory is returned to OS
TestMemory extends the heap for all the small objects, and never returns any heap to the OS. Here the heap is entirely available from the applications point-of-view, but from the OS's point of view it is assigned to the application.
TestBigBlock now fails, since although it would use a special memory block, it shares the overall memory space with heap, and there just isn't enough left after 2 is complete.

How to large memory blocks Windows 7

I am trying to allocate a large block (5GB) of memory under Windows 7 Professional. I have a 64-bit machine and 16GB RAM and I'm using MS Visual Studio 10. For those of you who might ask why - its because I need to hold a 2-D raster representation of a map of reference numbers to polygon data, the maps can be up to 40,000 x 40,000 units. This has to go into RAM and fragmenting it into smaller blocks would be too expensive in runtimes.
So if I do this:
int _tmain(int argc, _TCHAR* argv[])
{
int t = INT_MAX;
int * test = new int[t/6];
delete test;
return 0;
}
The call to new fails, but
int * test = new int[t/7]; succeeds.
Investigating a bit more I found that the memory allocation is only using the memory which is designated as 'free' by the resource monitor. So when this is smaller than the allocation requested the allocation fails. The resource monitor tells me that (when I looked) I had ~5GB in use, ~10GB on standby, and a little over 1GB free.
As far as I understand it this should not happen. Surely if memory is requested it should be taken from the standby memory? If this is not the case then is there a way to reduce the amount of standby memory used by windows from inside C++?
The latter can be done outside C++ using RAMMap as I discovered from this post: Clear the windows 7 standby memory programmatically but unfortunately there was no useful answer to the question of programatic clearing. Perhaps I am luckier in C++.
Of course the more likely scenario that I am just missing something obvious.
Thanks

Remember that when you allocate on the heap, that memory will have to be a contiguous block. If the memory is fragmented, no block big enough to hold the allocation may be awailable, even though the total of free memory is much larger than what you request.

Memory fragmentation might be one of the issues: you could have that amount of memory free, but it is not continuous.
A better approach would be to allocate rather big chunks of memory (several megabytes) and link-list them. The probability that you will find space for such a chunk (and thus every of the chunks) is way higher than that you'll find continuous space for several gigabytes.
As of performance: as long as you are working on exactly one chunk, you will lose no speed as data remains in the CPU cache. If you switch often between two chunks (e.g. you swap items between them), you will get cache misses.
Anyways any workarounds like clearing the memory are like poker: you might get a chunk then or not. It depends on too many factors you can't control.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js