Stack memory not released - c++

I have the following loop, which pops a C++ concurrent queue I have, from the implementation here. https://juanchopanzacpp.wordpress.com/2013/02/26/concurrent-queue-c11/
while (!interrupted)
{
pxData data = queue->pop();
if (data.value == -1)
{
break; // exit loop on terminating condition
}
usleep(7000); // stub to simulate processing
}
I am looking at the memory history using System Monitor in CentOS7.
I'm trying to free up the memory taken up by the queue, after reading the value from the queue. However, as the following while loop runs, I don't see the memory usage going down. I've verified that the queue length does go down.
It does go down, however, when -1 is encountered and the loop exits. (program is still running) But I can't have this, because where usleep is, I want to do some intensive processing.
Question: Why doesn't the memory occupied by data get free-ed? (according to System Monitor) Isn't the stack allocated memory supposed to be free-ed when the variable goes out of scope?
The struct is defined as follows, and populated at the beginning of the program.
typedef struct pxData
{
float value; // -1 value terminates the loop
float x, y, z;
std::complex<float> valueData[65536];
} pxData;
It's populated with ~10000 pxData, which roughly translates to 5GB. System only has ~8GB.
So it's important that the memory is free-ed up for doing other processing in the system.

There are a few things at play here.
Virtual Memory
First, you need to understand that just because your program is "using" 5 GB of memory does not mean that there are only 3 GB of RAM left for other programs. Virtual memory means that those 5 GB might be only 1 GB of actual "resident" data, and the other 4 GB may actually be on disk rather than in RAM. So it's important to look at the "resident set size" rather than the "virtual size" when you're looking at your program. And note that if your system actually runs low on RAM, the OS may shrink the RSS of some programs by "paging out" some of their memory. So don't worry too much about "5 GB" appearing in the system monitor--worry if you have a real, concrete performance problem.
Heap Allocation
The second aspect is why your virtual size does not decrease as you remove items from the queue. We can guess that you put those elements into the queue by creating them with malloc or new one-by-one, then pushing them onto the back of the queue. This means that the first element you allocated will come out of the queue first. And that in turn means that when you have drained 90% of the queue, your memory allocation might look like this:
[program|------------------unused-------------------|pxData]
The problem here is that in the real world, just because you free or delete something does not mean the operating system instantly reclaims that memory. In fact, it may not be able to reclaim any unused spans unless they are at the "end" (i.e. most recently allocated). Since C++ does not have garbage collection and cannot move items around in memory without your consent, you end up with this big "hole" in your program's virtual memory. That hole would be used to satisfy future memory allocation requests, but if you haven't got any, it just sits there, until the queue is completely empty:
[program|------------------unused--------------------------]
Then the system is able to shrink your virtual address space back down:
[program]
Which brings you back to where you started.
Solutions
If you want to "fix" this, one option is to allocate your memory in "reverse", i.e. put the last items allocated into the front of the queue.
Another option is to allocate the elements for the queue via mmap, which is something that e.g. Linux will do automatically for allocations which are "large." You can change the threshold for this by calling mallopt(3) with M_MMAP_THRESHOLD and setting it to be a little bit smaller than your struct size. This makes the allocations independent of each other, so the OS can reclaim them individually. This technique can even be applied to existing programs without recompilation, so is often useful if you need to solve this problem in a program you cannot modify.

A C++ implementation would call some operator delete to release dynamically allocated (using some operator new) memory. In several C++ standard libraries, new calls malloc and delete calls free.
(I am focusing with a Linux point of view, but the principles are similar on other OSes)
But while malloc (or ::operator new) is sometimes asking the OS kernel some more memory by system calls changing the virtual address space like mmap(2), free (or ::operator delete) is often simply marking the released memory zone as re-available to future calls to malloc (or to new)
So from the kernel point of view (e.g. as seen thru /proc/, see proc(5)...), the virtual address space is not changing, and the memory remains consumed, even if inside the application it is marked as "freed" and will be reused at some future allocation (by future calls to malloc or new)
And most C++ standard containers are internally using heap data. In particular your local (stack-allocated) std::map or std::vector (or std::deque) variable will call new & delete for internal data.
BTW, I find quite strange your declaration. Unless every struct pxData has exactly 65536 used valueData slots, I would suggest to use some std::vector so have
std::vector<std::complex<float>> valueData;
and improve your code accordingly. You'll probably need to do some valueData.reserve(somesize); and/or valueData.resize(somesize); and/or valueData.push_back(somecomplexnumber); etc....

Related

Optimal Way of Using mlockall() for Real-time Application (nanosecond sensitive)

I am reading mlockall()'s manpage: http://man7.org/linux/man-pages/man2/mlock.2.html
It mentions
Real-time processes that are using mlockall() to prevent delays on page
faults should reserve enough locked stack pages before entering the time-
critical section, so that no page fault can be caused by function calls. This
can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages. This way, enough pages will be mapped for the
stack and can be locked into RAM. The dummy writes ensure that not even copy-
on-write page faults can occur in the critical section.
I am a bit confused by this statement:
This can be achieved by calling a function that allocates a sufficiently large
automatic variable (an array) and writes to the memory occupied by this array in
order to touch these stack pages.
All the automatic variables (variables on stack) are created "on the fly" on the stack when the function is called. So how can I achieve what the last statement says?
For example, let's say I have this function:
void foo() {
char a;
uint16_t b;
std::deque<int64_t> c;
// do something with those variables
}
Or does it mean before I call any function, I should call a function like this in main():
void reserveStackPages() {
int64_t stackPage[4096/8 * 1024 * 1024];
memset(stackPage, 0, sizeof(stackPage));
}
If yes, does it make a difference if I first allocate the stackPage variable on heap, write and then free? Probably yes, because heap and stack are 2 different region in the RAM?
std::deque exists above is just to bring up another related question -- what if I want to reserve memory for things using both stack pages and heap pages. Will calling "heap" version of reserveStackPages() help?
The goal is to minimize all the jitters in the application (yes, I know there are many other things to look at such as TLB miss, etc; just trying to deal with one kind of jitter at once, and slowly progressing into all).
Thanks in advance.
P.S. This is for a low latency trading application if it matters.
You generally don't need to use mlockall, unless you code (more or less hard) real-time applications (I actually never used it).
If you do need it, you'll better code in C (not in genuine C++) the most real-time parts of your code, because you surely want to understand the details of memory allocation. Notice that unless you dive into std::deque implementation, you don't exactly know where it is sitting (probably most of the data is heap allocated, even if your c is an automatic variable).
You should first understand in details the virtual address space of your process. For that, proc(5) is useful: from inside your process, you'll read /proc/self/maps (see this), from outside (e.g. some terminal) you'll do cat /proc/1234/maps for a process of pid 1234. Or use pmap(1).
because heap and stack are 2 different regions in the RAM?
In fact, your process' address space contains many segments (listed in /proc/1234/maps), much more that two. Typically every dynamically linked shared library (such as libc.so) brings a few segments.
Try cat /proc/self/maps and cat /proc/$$/maps in your terminal to get a better intuition about virtual address spaces. On my machine, the first gives 19 segments of the cat process -each displayed as a line- and the second 97 segments of the zsh (my shell) process.
To ensure that your stack has enough space, you indeed could call a function allocating a large enough automatic variable, like your reserveStackPages. Beware that call stacks are practically of limited size (a few megabytes usually, see also setrlimit(2)).
If you really need mlockall (which is unlikely) you might consider linking statically your program (to have less segments in your virtual address space).
Look also into madvise(2) (and perhaps mincore(2)). It is generally much more useful than mlockall. BTW, in practice, most of your virtual memory is in RAM (unless your system experiments thrashing, and then you'll see it immediately).
Read also Operating Systems: Three Easy Pieces to understand the role of paging.
PS. Nano-second sensitive applications does not make much sense (because of cache misses that the software does not control).

Is there any benefit to use multiple heaps for memory management purposes?

I am a student of a system software faculty. Now I'm developing a memory manager for Windows. Here's my simple implementation of malloc() and free():
HANDLE heap = HeapCreate(0, 0, 0);
void* hmalloc(size_t size)
{
return HeapAlloc(heap, 0, size);
}
void hfree(void* memory)
{
HeapFree(heap, 0, memory);
}
int main()
{
int* ptr1 = (int*)hmalloc(100*sizeof(int));
int* ptr2 = (int*)hmalloc(100*sizeof(int));
int* ptr3 = (int*)hmalloc(100*sizeof(int));
hfree(ptr2);
hfree(ptr3);
hfree(ptr1);
return 0;
}
It works fine. But I can't understand is there a reason to use multiple heaps? Well, I can allocate memory in the heap and get the address to an allocated memory chunk. But here I use ONE heap. Is there a reason to use multiple heaps? Maybe for multi-threaded/multi-process applications? Please explain.
The main reason for using multiple heaps/custom allocators are for better memory control. Usually after lots of new/delete's the memory can get fragmented and loose performance for the application (also the app will consume more memory). Using the memory in a more controlled environment can reduce heap fragmentation.
Also another usage is for preventing memory leaks in the application, you could just free the entire heap you allocated and you don't need to bother with freeing all the object allocated there.
Another usage is for tightly allocated objects, if you have for example a list then you could allocate all the nodes in a smaller dedicated heap and the app will gain performance because there will be less cache misses when iterating the nodes.
Edit: memory management is however a hard topic and in some cases it is not done right. Andrei Alexandrescu had a talk at one point and he said that for some application replacing the custom allocator with the default one increased the performance of the application.
This is a good link that elaborates on why you may need multiple heap:
https://caligari.dartmouth.edu/doc/ibmcxx/en_US/doc/libref/concepts/cumemmng.htm
"Why Use Multiple Heaps?
Using a single runtime heap is fine for most programs. However, using multiple
heaps can be more efficient and can help you improve your program's performance
and reduce wasted memory for a number of reasons:
1- When you allocate from a single heap, you may end up with memory blocks on
different pages of memory. For example, you might have a linked list that
allocates memory each time you add a node to the list. If you allocate memory for
other data in between adding nodes, the memory blocks for the nodes could end up
on many different pages. To access the data in the list, the system may have to
swap many pages, which can significantly slow your program.
With multiple heaps, you can specify which heap you allocate from. For example,
you might create a heap specifically for the linked list. The list's memory blocks
and the data they contain would remain close together on fewer pages, reducing the
amount of swapping required.
2- In multithread applications, only one thread can access the heap at a time to
ensure memory is safely allocated and freed. For example, say thread 1 is
allocating memory, and thread 2 has a call to free. Thread 2 must wait until
thread 1 has finished its allocation before it can access the heap. Again, this
can slow down performance, especially if your program does a lot of memory
operations.
If you create a separate heap for each thread, you can allocate from them
concurrently, eliminating both the waiting period and the overhead required to
serialize access to the heap.
3- With a single heap, you must explicitly free each block that you allocate. If you
have a linked list that allocates memory for each node, you have to traverse the
entire list and free each block individually, which can take some time.
If you create a separate heap for that linked list, you can destroy it with a
single call and free all the memory at once.
4- When you have only one heap, all components share it (including the IBM C and
C++ Compilers runtime library, vendor libraries, and your own code). If one
component corrupts the heap, another component might fail. You may have trouble
discovering the cause of the problem and where the heap was damaged.
With multiple heaps, you can create a separate heap for each component, so if
one damages the heap (for example, by using a freed pointer), the others can
continue unaffected. You also know where to look to correct the problem."
A reason would be the scenario that you need to execute a program internally e.g. running simulation code. By creating your own heap you could allow that heap to have execution rights which by default for security reasons is turned off. (Windows)
You have some good thoughts and this'd work for C but in C++ you have destructors, it is VERY important they run.
You can think of all types as having constructors/destructors, just that logically "do nothing".
This is about allocators. See "The buddy algorithm" which uses powers of two to align and re-use stuff.
If I allocate 4 bytes somewhere, my allocator might allocate a 4kb section just for 4 byte allocations. That way I can fit 1024 4 byte things in the block, if I need more add another block and so forth.
Ask it for 4kb and it wont allocate that in the 4byte block, it might have a separate one for larger requests.
This means you can keep big things together. If I go 17 bytes then 13 bytes the 1 byte and the 13byte gets freed, I can only stick something in there of <=13 bytes.
Hence the buddy system and powers of 2, easy to do using lshifts, if I want a 2.5kb block, I allocate it as the smallest power of 2 that'll fit (4kb in this case) that way I can use the slot afterwards for <=4kb items.
This is not for garbage collection, this is just keeping things more compact and neat, using your own allocator can stop calls to the OS (depending on the default implementation of new and delete they might already do this for your compiler) and make new/delete very quick.
Heap-compacting is very different, you need a list of every pointer that points to your heap, or some way to traverse the entire memory graph (like spits Java) so when you move stuff round and "compact" it you can update everything that pointed to that thing to where it currently is.
The only time I ever used more than one heap was when I wrote a program that would build a complicated data structure. It would have been non-trivial to free the data structure by walking through it and freeing the individual nodes, but luckily for me the program only needed the data structure temporarily (while it performed a particular operation), so I used a separate heap for the data structure so that when I no longer needed it, I could free it with one call to HeapDestroy.

Why is memory not reusable after allocating/deallocating a number of small objects?

While investigating a memory link in one of our projects, I've run into a strange issue. Somehow, the memory allocated for objects (vector of shared_ptr to object, see below) is not fully reclaimed when the parent container goes out of scope and can't be used except for small objects.
The minimal example: when the program starts, I can allocate a single continuous block of 1.5Gb without problem. After I use the memory somewhat (by creating and destructing an number of small objects), I can no longer do big block allocation.
Test program:
#include <iostream>
#include <memory>
#include <vector>
using namespace std;
class BigClass
{
private:
double a[10000];
};
void TestMemory() {
cout<< "Performing TestMemory"<<endl;
vector<shared_ptr<BigClass>> list;
for (int i = 0; i<10000; i++) {
shared_ptr<BigClass> p(new BigClass());
list.push_back(p);
};
};
void TestBigBlock() {
cout<< "Performing TestBigBlock"<<endl;
char* bigBlock = new char [1024*1024*1536];
delete[] bigBlock;
}
int main() {
TestBigBlock();
TestMemory();
TestBigBlock();
}
Problem also repeats if using plain pointers with new/delete or malloc/free in cycle, instead of shared_ptr.
The culprit seems to be that after TestMemory(), the application's virtual memory stays at 827125760 (regardless of number of times I call it). As a consequence, there's no free VM regrion big enough to hold 1.5 GB. But I'm not sure why - since I'm definitely freeing the memory I used. Is it some "performance optimization" CRT does to minimize OS calls?
Environment is Windows 7 x64 + VS2012 + 32-bit app without LAA
Sorry for posting yet another answer since I am unable to comment; I believe many of the others are quite close to the answer really :-)
Anyway, the culprit is most likely address space fragmentation. I gather you are using Visual C++ on Windows.
The C / C++ runtime memory allocator (invoked by malloc or new) uses the Windows heap to allocate memory. The Windows heap manager has an optimization in which it will hold on to blocks under a certain size limit, in order to be able to reuse them if the application requests a block of similar size later. For larger blocks (I can't remember the exact value, but I guess it's around a megabyte) it will use VirtualAlloc outright.
Other long-running 32-bit applications with a pattern of many small allocations have this problem too; the one that made me aware of the issue is MATLAB - I was using the 'cell array' feature to basically allocate millions of 300-400 byte blocks, causing exactly this issue of address space fragmentation even after freeing them.
A workaround is to use the Windows heap functions (HeapCreate() etc.) to create a private heap, allocate your memory through that (passing a custom C++ allocator to your container classes as needed), and then destroy that heap when you want the memory back - This also has the happy side-effect of being very fast vs delete()ing a zillion blocks in a loop..
Re. "what is remaining in memory" to cause the issue in the first place: Nothing is remaining 'in memory' per se, it's more a case of the freed blocks being marked as free but not coalesced. The heap manager has a table/map of the address space, and it won't allow you to allocate anything which would force it to consolidate the free space into one contiguous block (presumably a performance heuristic).
There is absolutely no memory leak in your C++ program. The real culprit is memory fragmentation.
Just to be sure(regarding memory leak point), I ran this program on Valgrind, and it did not give any memory leak information in the report.
//Valgrind Report
mantosh#mantosh4u:~/practice$ valgrind ./basic
==3227== HEAP SUMMARY:
==3227== in use at exit: 0 bytes in 0 blocks
==3227== total heap usage: 20,017 allocs, 20,017 frees, 4,021,989,744 bytes allocated
==3227==
==3227== All heap blocks were freed -- no leaks are possible
Please find my response to your query/doubt asked in original question.
The culprit seems to be that after TestMemory(), the application's
virtual memory stays at 827125760 (regardless of number of times I
call it).
Yes, real culprit is hidden fragmentation done during the TestMemory() function.Just to understand the fragmentation, I have taken the snippet from wikipedia
"
when free memory is separated into small blocks and is interspersed by allocated memory. It is a weakness of certain storage allocation algorithms, when they fail to order memory used by programs efficiently. The result is that, although free storage is available, it is effectively unusable because it is divided into pieces that are too small individually to satisfy the demands of the application.
For example, consider a situation wherein a program allocates 3 continuous blocks of memory and then frees the middle block. The memory allocator can use this free block of memory for future allocations. However, it cannot use this block if the memory to be allocated is larger in size than this free block."
The above explains paragraph explains very nicely about memory fragmentation.Some allocation patterns(such as frequent allocation and deal location) would lead to memory fragmentation,but its end impact(.i.e. memory allocation 1.5GBgets failed) would greatly vary on different system as different OS/heap manager has different strategy and implementation.
As an example, your program ran perfectly fine on my machine(Linux) however you have encountered the memory allocation failure.
Regarding your observation on VM size remains constant: VM size seen in task manager is not directly proportional to our memory allocation calls. It mainly depends on the how much bytes is in committed state. When you allocate some dynamic memory(using new/malloc) and you do not write/initialize anything in those memory regions, it would not go committed state and hence VM size would not get impacted due to this. VM size depends on many other factors and bit complicated so we should not rely completely on this while understanding about dynamic memory allocation of our program.
As a consequence, there's no free VM regrion big enough to hold 1.5
GB.
Yes, due to fragmentation, there is no contiguous 1.5GB memory. It should be noted that total remaining(free) memory would be more than 1.5GB but not in fragmented state. Hence there is not big contiguous memory.
But I'm not sure why - since I'm definitely freeing the memory I used.
Is it some "performance optimization" CRT does to minimize OS calls?
I have explained about why it may happen even though you have freed all your memory. Now in order to fulfil user program request, OS will call to its virtual memory manager and try to allocate the memory which would be used by heap memory manager. But grabbing the additional memory does depend on many other complex factor which is not very easy to understand.
Possible Resolution of Memory Fragmentation
We should try to reuse the memory allocation rather than frequent memory allocation/free. There could be some patterns(like a particular request size allocation in particular order) which may lead overall memory into fragmented state. There could be substantial design change in your program in order to improve memory fragmentation. This is complex topic and require internal understanding of memory manager to understand the complete root cause of such things.
However there are tools exists on Windows based system which I am not much aware. But I found one excellent SO post regarding the which tool(on windows) can be useful to understand and check the fragmentation status of your program by yourself.
https://stackoverflow.com/a/1684521/2724703
This is not memory leak. The memory U used was allocated by C\C++ Runtime. The Runtime apply a a bulk of memory from OS once and then each new you called will allocated from that bulk memory. when delete one object, the Runtime not return memory to OS immediately, it may hold that memory for performance.
There is nothing here which indicates a genuine "leak". The pattern of memory you describe is not unexpected. Here are a few points which might help to understand. What happens is highly OS dependent.
A program often has a single heap which can be extended or shrunk in length. It is however one contiguous memory area, so changing the size is just changing where the end of the heap is. This makes it very difficult to ever "return" memory to the OS, since even one little tiny object in that space will prevent its shrinking. On Linux you can lookup the function 'brk' (I know you're on Windows, but I presume it does something similar).
Large allocations are often done with a different strategy. Rather than putting them in the general purpose heap, an extra block of memory is created. When it is deleted this memory can actually be "returned" to the OS since its guaranteed nothing is using it.
Large blocks of unused memory don't tend to consume a lot of resources. If you generally aren't using the memory any more they might just get paged to disk. Don't presume that because some API function says you're using memory that you are actually consuming significant resources.
APIs don't always report what you think. Due to a variety of optimizations and strategies it may not actually be possible to determine how much memory is in use and/or available on a system at a particular moment. Unless you have intimate details of the OS you won't know for sure what those values mean.
The first two points can explain why a bunch of small blocks and one large block result in different memory patterns. The latter points indicate why this approach to detecting leaks is not useful. To detect genuine object-based "leaks" you generally need a dedicated profiling tool which tracks allocations.
For example, in the code provided:
TestBigBlock allocates and deletes array, assume this uses a special memory block, so memory is returned to OS
TestMemory extends the heap for all the small objects, and never returns any heap to the OS. Here the heap is entirely available from the applications point-of-view, but from the OS's point of view it is assigned to the application.
TestBigBlock now fails, since although it would use a special memory block, it shares the overall memory space with heap, and there just isn't enough left after 2 is complete.

Find huge blocks of allocated memory

I have a program (daemon) that is written in c/c++. It runs flawlessly, but after some period of time( it can be 5 days, week, 2 weeks ) it becomes to allocate a lot of megabytes of memory. I can't understand what parts of code do not free allocated memory. At startup memory usage is about 20-30 megabytes. Then after some period, or maybe event, it grows slowly about 1Mb per hour, and if not terminated can crash because no memory is available.
I've tried to use Valgrind and did shutdown the daemon in usual way when it has already allocated about 500Mb of memory. Shutdown process was really long, but when it finished Valgrind said no memory leaks were found, except for mysql_init/mysql_close procedures(about 504bytes are definetly lost). Google says not to worry about this Mysql leak, and gives some reasons why memory diagnostic tools like Valgrind think that it is a leak.
I don't really know what parts of code allocate memory but free it only on program shutdown. Help me to find out this
Valgrind only detects pointers that aren't deleted, more or less. Keeping them around when you don't need them is a different problem.
Firstly, all objects and memory are freed at shutdown. If there's a leak, valgrind will detect it as memory not referenced by an object, etc. Any leaks however are freed by the operating system in the end.
If you're catching all exceptions (...) and not doing anything with them, well, don't do that. It's a common cause.
Secondly, a logfile of destructors that are called during shutdown might be helpful. Perhaps at the end of main(), set a global flag; any destructors called while that flag is set can output that they exist. See if there are lots of objects that shouldn't be there.
A bit easier, you can use a global variable, each ctor can increment it by 1, and dtor decrement by 1. If you find that the number of objects isn't staying relatively the same, you can investigate which ones are making the problem using similar techniques.
Thirdly, use Boost and its scoped smart pointers to help, but do not rely on smart pointers as the holy grail.
There is a possible underlying issue that I have come across. For long-running programs, memory fragmentation can lead to large memory usage. You may delete a 1mb object, then try to create a 2mb object; the creation will be in new space because that 1mb 'free chunk' is not big enough. Then when you make a 512kb object it may go into that 1mb object's space, only using 1/2 of available space, but making it so that your next 1mb object needs to be allocated in big space.
Unfortunately this problem can become bad, due to small objects being allocated in persistent places. There may be, say, 50-byte classes 300kb apart in memory, and like 100 of them, but no 512kb objects can be allocated in that space, so it allocates an additional 512kb for each new object, effectively wasting 90% of actual 'free' space even though your program owns more than enough already.
This problem is hard to track down as the definite cause, but if you examine your program's flow, look for small allocations. Remember std::list/vector/etc. can all cause this; if you're looking to make a daemon that does lots of memory ops run for weeks, it's a good idea to pre-allocate memory using reserve(). Memory pools are even better.
Depending on the time you want to put in, you can also either make (or find) a custom memory allocator that will report on objects when it shuts down, too.
Try to use Valgrind Massif tool. From Massif manual:
Also, there are certain space leaks that aren't detected by
traditional leak-checkers, such as Memcheck's. That's because the
memory isn't ever actually lost -- a pointer remains to it -- but it's
not in use. Programs that have leaks like this can unnecessarily
increase the amount of memory they are using over time. Massif can
help identify these leaks.
Massif should show you what's happening with memory and where it is allocated and not freeing until shutdown.
Since you are sure, there's no memory leak, your program might be allocating memory and storing data without leaking.
For example, let's say your program uses a linked list...
struct list{
DATA_ARRAY arr; //Some data
struct *list next;
};
While(true) //infinite loop
{
// Add new nodes to list
// Store some data in the node
}
There's no leak here. But the loop adds new nodes forever and stores data and everything is perfectly valid. But memory usage increases all the time. Since you are running for 2-5 days, something like this is certainly possible.
You may have to inspect the code and free memory if no longer needed.

Is this normal behavior for a std::vector?

I have a std::vector of a class called OGLSHAPE.
each shape has a vector of SHAPECONTOUR struct which has a vector of float and a vector of vector of double. it also has a vector of an outline struct which has a vector of float in it.
Initially, my program starts up using 8.7 MB of ram. I noticed that when I started filling these these up, ex adding doubles and floats, the memory got fairly high quickly, then leveled off. When I clear the OGLSHAPE vector, still about 19MB is used. Then if I push about 150 more shapes, then clear those, I'm now using around 19.3MB of ram. I would have thought that logically, if the first time it went from 8.7 to 19, that the next time it would go up to around 30. I'm not sure what it is. I thought it was a memory leak but now I'm not sure. All I do is push numbers into std::vectors, nothing else. So I'd expect to get all my memory back. What could cause this?
Thanks
*edit, okay its memory fragmentation
from allocating lots of small things,
how can that be solved?
Calling std::vector<>::clear() does not necessarily free all allocated memory (it depends on the implementation of the std::vector<>). This is often done for the purpose of optimization to avoid unnessecary memory allocations.
In order to really free the memory held by an instance just do:
template <typename T>
inline void really_free_all_memory(std::vector<T>& to_clear)
{
std::vector<T> v;
v.swap(to_clear);
}
// ...
std::vector<foo> objs;
// ...
// really free instance 'objs'
really_free_all_memory(objs);
which creates a new (empty) instance and swaps it with your vector instance you would like to clear.
Use the correct tools to observe your memory usage, e.g. (on Windows) use Process Explorer and observe Private Bytes. Don't look at Virtual Address Space since that shows the highest memory address in use. Fragmentation is the cause of a big difference between both values.
Also realize that there are a lot of layers in between your application and the operating system:
the std::vector does not necessarily free all memory immediately (see tip of hkaiser)
the C Run Time does not always return all memory to the operating system
the Operating System's Heap routines may not be able to free all memory because it can only free full pages (of 4 KB). If 1 byte of a 4KB page is stil used, the page cannot be freed.
There are a few possible things at play here.
First, the way memory works in most common C and C++ runtime libraries is that once it is allocated to the application from the operating system it is rarely ever given back to the OS. When you free it in your program, the new memory manager keeps it around in case you ask for more memory again. If you do, it gives it back for you for re-use.
The other reason is that vectors themselves typically don't reduce their size, even if you clear() them. They keep the "capacity" that they had at their highest so that it is faster to re-fill them. But if the vector is ever destroyed, that memory will then go back to the runtime library to be allocated again.
So, if you are not destroying your vectors, they may be keeping the memory internally for you. If you are using something in the operating system to view memory usage, it is probably not aware of how much "free" memory is waiting around in the runtime libraries to be used, rather than being given back to the operating system.
The reason your memory usage increases slightly (instead of not at all) is probably because of fragmentation. This is a sort of complicated tangent, but suffice it to say that allocating a lot of small objects can make it harder for the runtime library to find a big chunk when it needs it. In that case, it can't reuse some of the memory it has laying around that you already freed, because it is in lots of small pieces. So it has to go to the OS and request a big piece.