Why new and delete so slow in a loop under MSVC 2010 - c++

I got a problem when I tried to create and delete an instance of the class in a loop.
Execution time of iterations is quite different. As I understand it, this is associated with the removal of objects from memory. However, the behavior of this operation I do not understand. Why time is different? How do I fix it? Time is steady when I'm deleting the object in a separate thread.
class NODE{
public:
NODE(){}
NODE* add(NODE* node)
{
children.push_back(node);
return node;
}
virtual ~NODE()
{
for(vector<NODE*>::iterator it = children.begin(); it != children.end(); ++it)
{
delete *it;
}
}
vector<NODE*> children;
};
NODE* create()
{
NODE* node( new NODE() );
for (int i=0; i<200;i++) {
NODE* subnode = node->add( new NODE());
for (int k=0; k<20; k++) subnode->add( new NODE());
}
return node;
}
int main()
{
NODE* root;
unsigned t;
for (int i=0; i<30; i++){
t = clock();
cout << "Create... ";
root = create();
delete root;
cout<< clock()-t << endl;
}
}
ADDED:
I'm confused. When I run program out of VS it works fine...

When you create an object, sometimes a new block of memory for it will be allocated, and sometimes it will fit into an already-existing block. This will cause two allocations to potentially take differing amounts of time.
If you want to make the time allocations and frees take consistent, handle them inside your application - grab a big chunk of memory and service allocations from that block. Of course, when you need another chunk, the allocation that causes that to happen is going to take longer... But with large chunks it should happen infrequently.
The only way to make allocations and deallocations take a perfectly consistent amount of time would be to slow down the fast ones until they take the maximum amount of time any request could spend. This would be the Harrison Bergeron approach to performance optimization. I don't recommend it.

There are real-time heaps in existance, but generally memory heap operations (dynamic allocations and deallocations) are the cannoical example of a non-deterministic operation. This means that the runtime varies, and doesn't even have a very good bound.
The issue is that you generally need adjacent blocks of memory merged into a single block when they occur. If that isn't done, eventually you'll just have oodles of tiny blocks and a large allocation may fail even though there really is plenty of memory available for it. On any given call there may or may not be merging to do, and the amount to do may vary. It all depends on the pattern of allocations/deallocations your system has performed recently, which is not something you can plan for at all. So we call it "non-deterministic".
If you don't like this behavior, there are really two possibilities:
Switch to using a real-time heap. Your OS probably doesn't have one built in, so you'd have to go buy or download one and use it for all your memory operations. One I've used in the past is TLSF.
Don't perform dynamic memory allocation/deallocation in your main loop (IOW: not after initialization). This is how we realtime programmers have trained ourselves to code for ages.

In addition to what other answers say you have to account for Visual C++ 10 runtime heap being thread-safe. This implies that even when you only have one thread you encounter some overhead facilitating heap operations being thread-safe. So one reason you see such poor results is that you use a universal but rather slow heap implementation.
The reason why you get different timing with/without the debugger is a special heap is used when a program is started under debugger (even in Release configuration) and that special heap is also relatively slow.

Generally it's not possible to predict memory allocation/deallocation times. For example times may vary significantly if the user space heap runs out of pages and needs to request more from the kernel and later when accessing a newly allocated page triggers a page fault.
So even if you go forward and implement your own heap using big chunks of memory, your allocation times will vary because lazyness is in the nature of the underlying memory system.

If you really wanna know why its slow you need to run a real profiler on it, such as AMD's codeanalyst, clock isn't exactly a high presision timer.
The reason why it'll run different each time depends on the paging of the underlying system, the CPU load and whether your data has been cached by the processor.

Related

vector of list memory leak in C++ on OSX 10.8

I'm having memory problems with in implementing a vector< list< size_t> >. Here's the code:
struct structure{
vector< list<size_t> > Ar;
structure(int n){
Ar.reserve(n);
for(size_t i =0; (int) i< n;i++){
list<size_t> L;
Ar.push_back(L);
}
}
~structure(){
vector<list<size_t> >::iterator it = Ar.begin();
while(it < Ar.end()){
(*it).clear();
++it;
}
Ar.clear();
}
};
int main() {
for(size_t k = 0; k <100 ; k++){
structure test_s = structure(1000*(100 - k));
}
return 0;
}
The physical memory allocation to this program should be going down as time progresses, because less and less memory is being allocated to test_s, through the use of 100 - k in the constructor. This isn't what I'm observing! Rather, the physical memory increases around half way through the run!
Since I am using this code in a bigger program that eats up a huge amount of memory, this is a bit of a catastrophe!
There are two details that I find strange:
Firstly, there is no progressive increment in the physical memory usage, even though the size of the object changes at every stage of the for loop, rather the memory increases suddenly at around the 50th iteration of the for loop. This happens consistently each run I do (and I've done many!). The iteration at which the memory increases is not random!
Secondly, when I pass a static size_t (e.g. 10000) to the structure(size_t) constructor, I don't get the problem anymore. As you can probably guess, a static value isn't very useful for my code, as I need to be able to dynamically allocate the size of the structure object.
I am compiling with g++ on macos 10.8.3. I haven't tried compiling on another platform, as I would prefer to keep working on my Mac.
All of the memory management tools I have tried (Apple Instruments and Valgrind) haven't been particularly helpful. Valgrind only returns references to libraries and not to the program itself.
Any help would be much appreciated!!
Cheers,
Plamen
The C++ allocator doesn't necessarily return memory to the OS when it's done with it, but usually keeps it around since you're probably going to need it soon.
I'm not familiar with the details of the OS X allocator, but it's quite common for an allocator to grab memory from the OS in larger chunks and then treat them as separate pools.
This may be what you're seeing, with sudden growth as the first chunk of memory is filled.
It's also possible that you're passing some threshold between "larger" allocations and "smaller" allocations, and you're just seeing an added pool for slightly smaller things - some allocators do that.
It's also possible that the cause is something entirely different, of course.
The difference when you're using the same size for each one is most likely because it's easy for the allocator to fill the request using a block of the same size that was recently freed.
When the blocks have different sizes it's faster to allocate a new block with a different size than to divide a free block into two smaller ones.
(This can also unfortunately lead to memory fragmentation. If you get many scattered small blocks, it may happen that a large allocation request can't be fulfilled despite there being enough room in total.)
In summary: Memory allocators and operating systems are quite complicated these days, and you can't look at growth in memory allocation and say for certain that you have a memory leak.
I would trust valgrind and Instruments in your case.
I don't see any leak in that code, but a lot of unneeded code. The simplified code would be:
struct structure{
vector< list<size_t> > Ar;
structure(int n): Ar(n) // initialize the vector with n empty lists
{
}
// destructor unneeded, since Ar will be destroyed anyway, with all of its elements.
};
But this doesn't answer your question.
Heap memory allocation doesn't means physical memory allocation. Modern OS use virtual memory, usually backed by paging file. Heap allocation get memory from virtual memory and OS decide if more or less physical memory is needed. Freeing memory to the virtual memory doesn't means free that physical memory (if is not needed for other process, why to do that in that time?).
Also, heap memory allocation are not directly translated to virtual memory allocations. Usually, virtual memory allocation have a big granularity so is not suitable for small allocations. Then, the heap manager allocate blocks of virtual memory and manage that for all heap allocations. (If virtual memory is not enough, heap manager will ask for more). When not used blocks of virtual memory are freed depends on how is heap manager implemented.
To do the things a bit more complex, allocating and deallocating different size of memory would produce heap fragmentation, depending on allocation/deallocation pattern and how is the heap implemented.
Physical memory is not a good indicator of memory leak in this type of program. Would be better private (virtual) memory or similar one.

profiling and performance issues

After observing some performance issues in my program, I decided to run a profiling session. The results seem to indicate that something like 87% of samples taken were somehow related to my Update() function.
In this function, I am going through a list of A*, where sizeof(A) equals 72, and deleting them after processing.
void Update()
{
//...
for(auto i = myList.begin(); i != myList.end(); i++)
{
A* pA = *i;
//Process item before deleting it.
delete pA;
}
myList.clear();
//...
}
where myList is a std::list<A*>. On average, I am calling this function anywhere from 30 to 60 times per second while the list contains an average of 5 items. That means I'm deleting anywhere from 150 to 300 A objects per second.
Would calling delete this many times be enough to cause a performance issue in most cases? Is there any way to track down exactly where in the function the problem is occuring? Is delete generally considered an expensive operation?
Very difficult to tell, since you brush over what is probably the bulk of the work done in the loop and give no hint as to what A is...
If A is a simple collection of data, particularly primitives then the deletion is almost certainly not the culprit. You can test the theory by splitting your update function in two - update and uninit. Update does all the processing, uninit deletes the object and clears the list.
If only update is slow, then it's the processing. If only uninit is slow, then it's the deletion. If both are slow then memory fragmentation is probably the culprit.
As others have pointed out in the comments, std::vector may give you a performance increase. But be careful since it may also cause performance problems elsewhere depending on how you build the data structure.
You could have a look at tcmalloc from gperftools (Google Performance Tools). gperftools also contains a profiler (both libraries only need to be linked in, very easy). tcmalloc keeps a memory pool for small objects and re-uses this memory when possible.
The profiler can be used for cpu and heap profiling.
Totally easy to tell what's going on.
Do yourself a favor and use this method.
It's been analyzed to the nth degree, and is very effective.
In a nutshell, if 87% of time is in Update, then if you just stop it a few times with Ctrl-C or whatever, the probability is 87% each time that you will catch it in the act.
You will not only see that it's in Update. You will see where in Update, and what it's doing. If it is in the process of delete, or accessing the data structure, you will see that. You will also see, further down the stack, the reason why that operation takes time.

Why Does a Memory Leak not Continue after Peaking?

I created an intentional memory leak to demonstrate a point to people who will shortly be learning pointers.
int main()
{
while (1)
{
int *a = new int [2];
//delete [] a;
}
}
If this is run without the commented code, the memory stays low and doesn't rise, as expected. However, if this is run as is, then on a machine with 2GB of RAM, the memory usage rapidly rises to about 1.5GB, or whatever is not in use by the system. Once it hits this point though, the CPU usage (which was previously max) greatly falls, and the memory usage as well, down to about 100MB.
What exactly caused this intervening action (if there's something more specific than "Windows", that'd be great), and why does the program not take up the CPU it would looping, but not terminate either? It seems like it's stuck between the end of the loop and the end of main.
Windows XP, GCC, MinGW.
What's probably happening is that your code allocates all available physical RAM. When it reaches that limit, the system starts to allocate space on the swap file for it. That means it's (nearly) constantly waiting on the disk, so its CPU usage drops to (almost) zero.
The system may easily keep track of the fact that it never actually writes to the memory it allocates, so when it needs to be stored on the swap file, it'll just make a small record basically saying "process X has N bytes of uninitialized storage" instead of actually copying all the data to the hard drive (but I'm not sure of that, and it may well depend on the exact system you're using).
To paraphrase Inigo Montoya, "I don't think that means what you think that means." The Windows task manager doesn't display the memory usage data that you are looking for.
The "Mem Usge" column displays something related to the working set size (or the resident set size) of the process. That is, "Mem Usage" displays a number related to the amount of physical memory currently allocated to your proccess.
The "VM Size" column displays a number wholly unrelated to the virtual memory subsystem (it is actually the size of the private heaps allocated by the process.
Try using a different tool to visual virtual memory usage. I suggest Process Explorer.
I guess when the program exhausts the available physical memory, it starts to use on-disk (virtual) memory, and it becomes so slow, it seems as if it's inactive. Try adding some speed visualization:
int counter = 0;
while (1)
{
int *a = new int [2];
++counter;
if (counter % 1000000 == 0)
std::cout << counter << '\n'
}
The default Memory column in the task manager of XP is the size of the working set of the process (the amount of physical memory allocated to that process), not the actual memory usage.
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684891%28v=vs.85%29.aspx
http://blogs.msdn.com/b/salvapatuel/archive/2007/10/13/memory-working-set-explored.aspx
The "Mem Usage" column of the task manager is probably the "working set" as explained by a few answers in this question, although to be honest I still get confused how the task manager refers to memory as it changes from version to version. This value goes up/down as you are obviously not actually using much memory at any given time. If you look at the "VM Size" you should see it constantly increase until something bad happens.
You can also given Process Explorer a try which I find easily to understand in how it displays things.
Several things: first, if you're only allocating 2 ints at a time, it
could take hours before you notice that the total memory usage is going
up because of it. And second, on a lot of systems, allocation doesn't
commit until you actually access the memory; the address space may be
reserved, but you don't really have the memory (and the program will
crash if you try to access the memory and there isn't any available).
If you want to simulate a leak, I'd recommend allocating at least a page
at a time, if not considerably more, and writing at least one byte in
each allocated page.

std::sort on container of pointers

I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.

Thread safe memory pool

My application currently is highly performance critical and is requests 3-5 million objects per frame. Initially, to get the ball rolling, I new'd everything and got the application to work and test my algorithms. The application is multi-threaded.
Once I was happy with the performance, I started to create a memory manager for my objects. The obvious reason is memory fragmentation and wastage. The application could not continue for more than a few frames before crashing due to memory fragmentation. I have checked for memory leaks and know the application is leak free.
So I started creating a simple memory manager using TBB's concurrent_queue. The queue stores a maximum set of elements the application is allowed to use. The class requiring new elements pops elements from the queue. The try_pop method is, according to Intel's documentation, lock-free. This worked quite well as far as memory consumption goes (although there is still memory fragmentation, but not nearly as much as before). The problem I am facing now is that the application's performance has slowed down approximately 4 times according to my own simple profiler (I do not have access to commercial profilers or know of any that will work on a real-time application... any recommendation would be appreciated).
My question is, is there a thread-safe memory pool that is scalable. A must-have feature of the pool is fast recycling of elements and making them available. If there is none, any tips/tricks performance wise?
EDIT: I thought I would explain the problem a bit more. I could easily initialize n number of arrays where n is the number of threads and start using the objects from the arrays per thread. This will work perfectly for some cases. In my case, I am recycling the elements as well (potentially every frame) and they could be recycled at any point in the array; i.e. it may be from elementArray[0] or elementArray[10] or elementArray[1000] part of the array. Now I will have a fragmented array of elements consisting of elements that are ready to be used and elements that are in-use :(
As said in comments, don't get a thread-safe memory allocator, allocate memory per-thread.
As you implied in your update, you need to manage free/in-use effectively. That is a pretty straightforward problem, given a constant type and no concurrency.
For example (off the top of my head, untested):
template<typename T>
class ThreadStorage
{
std::vector<T> m_objs;
std::vector<size_t> m_avail;
public:
explicit ThreadStorage(size_t count) : m_objs(count, T()) {
m_avail.reserve(count);
for (size_t i = 0; i < count; ++i) m_avail.push_back(i);
}
T* alloc() {
T* retval = &m_objs[0] + m_avail.back();
m_avail.pop_back();
return retval;
}
void free(T* p) {
*p = T(); // Assuming this is enough destruction.
m_avail.push_back(p - &m_objs[0]);
}
};
Then, for each thread, have a ThreadStorage instance, and call alloc() and free() as required.
You can add smart pointers to manage calling free() for you, and you can optimise constructor/destructor calling if that's expensive.
You can also look at boost::pool.
Update:
The new requirement for keeping track of things that have been used so that they can be processed in a second pass seems a bit unclear to me. I think you mean that when the primary processing is finished on an object, you need to not release it, but keep a reference to it for second stage processing. Some objects you will just be released back to the pool and not used for second stage processing.
I assume you want to do this in the same thread.
As a first pass, you could add a method like this to ThreadStorage, and call it when you want to do processing on all unreleased instances of T. No extra book keeping required.
void do_processing(boost::function<void (T* p)> const& f) {
std::sort(m_avail.begin(), m_avail.end());
size_t o = 0;
for (size_t i = 0; i != m_avail.size(); ++i) {
if (o < m_avail[i]) {
do {
f(&m_objs[o]);
} while (++o < m_avail[i]);
++o;
} else of (o == m_avail[i])
++o;
}
for (; o < m_objs.size(); ++o) f(&m_objs[o]);
}
Assumes no other thread is using the ThreadStorage instance, which is reasonable because it is thread-local by design. Again, off the top of my head, untested.
Google's TCMalloc,
TCMalloc assigns each thread a
thread-local cache. Small allocations
are satisfied from the thread-local
cache. Objects are moved from central
data structures into a thread-local
cache as needed, and periodic garbage
collections are used to migrate memory
back from a thread-local cache into
the central data structures.
Performance:
TCMalloc is faster than the glibc 2.3 malloc... ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair...
You may want to have a look at jemalloc.