My application currently is highly performance critical and is requests 3-5 million objects per frame. Initially, to get the ball rolling, I new'd everything and got the application to work and test my algorithms. The application is multi-threaded.
Once I was happy with the performance, I started to create a memory manager for my objects. The obvious reason is memory fragmentation and wastage. The application could not continue for more than a few frames before crashing due to memory fragmentation. I have checked for memory leaks and know the application is leak free.
So I started creating a simple memory manager using TBB's concurrent_queue. The queue stores a maximum set of elements the application is allowed to use. The class requiring new elements pops elements from the queue. The try_pop method is, according to Intel's documentation, lock-free. This worked quite well as far as memory consumption goes (although there is still memory fragmentation, but not nearly as much as before). The problem I am facing now is that the application's performance has slowed down approximately 4 times according to my own simple profiler (I do not have access to commercial profilers or know of any that will work on a real-time application... any recommendation would be appreciated).
My question is, is there a thread-safe memory pool that is scalable. A must-have feature of the pool is fast recycling of elements and making them available. If there is none, any tips/tricks performance wise?
EDIT: I thought I would explain the problem a bit more. I could easily initialize n number of arrays where n is the number of threads and start using the objects from the arrays per thread. This will work perfectly for some cases. In my case, I am recycling the elements as well (potentially every frame) and they could be recycled at any point in the array; i.e. it may be from elementArray[0] or elementArray[10] or elementArray[1000] part of the array. Now I will have a fragmented array of elements consisting of elements that are ready to be used and elements that are in-use :(
As said in comments, don't get a thread-safe memory allocator, allocate memory per-thread.
As you implied in your update, you need to manage free/in-use effectively. That is a pretty straightforward problem, given a constant type and no concurrency.
For example (off the top of my head, untested):
template<typename T>
class ThreadStorage
{
std::vector<T> m_objs;
std::vector<size_t> m_avail;
public:
explicit ThreadStorage(size_t count) : m_objs(count, T()) {
m_avail.reserve(count);
for (size_t i = 0; i < count; ++i) m_avail.push_back(i);
}
T* alloc() {
T* retval = &m_objs[0] + m_avail.back();
m_avail.pop_back();
return retval;
}
void free(T* p) {
*p = T(); // Assuming this is enough destruction.
m_avail.push_back(p - &m_objs[0]);
}
};
Then, for each thread, have a ThreadStorage instance, and call alloc() and free() as required.
You can add smart pointers to manage calling free() for you, and you can optimise constructor/destructor calling if that's expensive.
You can also look at boost::pool.
Update:
The new requirement for keeping track of things that have been used so that they can be processed in a second pass seems a bit unclear to me. I think you mean that when the primary processing is finished on an object, you need to not release it, but keep a reference to it for second stage processing. Some objects you will just be released back to the pool and not used for second stage processing.
I assume you want to do this in the same thread.
As a first pass, you could add a method like this to ThreadStorage, and call it when you want to do processing on all unreleased instances of T. No extra book keeping required.
void do_processing(boost::function<void (T* p)> const& f) {
std::sort(m_avail.begin(), m_avail.end());
size_t o = 0;
for (size_t i = 0; i != m_avail.size(); ++i) {
if (o < m_avail[i]) {
do {
f(&m_objs[o]);
} while (++o < m_avail[i]);
++o;
} else of (o == m_avail[i])
++o;
}
for (; o < m_objs.size(); ++o) f(&m_objs[o]);
}
Assumes no other thread is using the ThreadStorage instance, which is reasonable because it is thread-local by design. Again, off the top of my head, untested.
Google's TCMalloc,
TCMalloc assigns each thread a
thread-local cache. Small allocations
are satisfied from the thread-local
cache. Objects are moved from central
data structures into a thread-local
cache as needed, and periodic garbage
collections are used to migrate memory
back from a thread-local cache into
the central data structures.
Performance:
TCMalloc is faster than the glibc 2.3 malloc... ptmalloc2 takes approximately 300 nanoseconds to execute a malloc/free pair on a 2.8 GHz P4 (for small objects). The TCMalloc implementation takes approximately 50 nanoseconds for the same operation pair...
You may want to have a look at jemalloc.
Related
I am currently designing a user space scheduler in C11 for a custom co-processor under Linux (user space, because the co-processor does not run its own OS, but is controlled by software running on the host CPU). It keeps track of all the tasks' states with an array. Task states are regular integers in this case. The array is dynamically allocated and each time a new task is submitted whose state does not fit into the array anymore, the array is reallocated to twice its current size. The scheduler uses multiple threads and thus needs to synchronize its data structures.
Now, the problem is that I very often need to read entries in that array, since I need to know the states of tasks for scheduling decisions and resource management. If the base address was guaranteed to always be the same after each reallocation, I would simply use C11 atomics for accessing it. Unfortunately, realloc obviously cannot give such a guarantee. So my current approach is wrapping each access (reads AND writes) with one big lock in the form of a pthread mutex. Obviously, this is really slow, since there is locking overhead for each read, and the read is really small, since it only consists of a single integer.
To clarify the problem, I give some code here showing the relevant passages:
Writing:
// pthread_mutex_t mut;
// size_t len_arr;
// int *array, idx, x;
pthread_mutex_lock(&mut);
if (idx >= len_arr) {
len_arr *= 2;
array = realloc(array, len_arr*sizeof(int));
if (array == NULL)
abort();
}
array[idx] = x;
pthread_mutex_unlock(&mut);
Reading:
// pthread_mutex_t mut;
// int *array, idx;
pthread_mutex_lock(&mut);
int x = array[idx];
pthread_mutex_unlock(&mut);
I have already used C11 atomics for efficient synchronization elsewhere in the implementation and would love to use them to solve this problem as well, but I could not find an efficient way to do so. In a perfect world, there would be an atomic accessor for arrays which performs address calculation and memory read/write in a single atomic operation. Unfortunately, I could not find such an operation. But maybe there is a similarly fast or even faster way of achieving synchronization in this situation?
EDIT:
I forgot to specify that I cannot reuse slots in the array when tasks terminate. Since I guarantee access to the state of every task ever submitted since the scheduler was started, I need to store the final state of each task until the application terminates. Thus, static allocation is not really an option.
Do you need to be so economical with virtual address space? Can't you just set a very big upper limit and allocate enough address space for it (maybe even a static array, or dynamic if you want the upper limit to be set at startup from command-line options).
Linux does lazy memory allocation so virtual pages that you never touch aren't actually using any physical memory. See Why is iterating though `std::vector` faster than iterating though `std::array`? that show by example that reading or writing an anonymous page for the first time causes a page fault. If it was a read access, it gets the kernel to CoW (copy-on-write) map it to a shared physical zero page. Only an initial write, or a write to a CoW page, triggers actual allocation of a physical page.
Leaving virtual pages completely untouched avoids even the overhead of wiring them into the hardware page tables.
If you're targeting a 64-bit ISA like x86-64, you have boatloads of virtual address space. Using up more virtual address space (as long as you aren't wasting physical pages) is basically fine.
Practical example of allocating more address virtual space than you can use:
If you allocate more memory than you could ever practically use (touching it all would definitely segfault or invoke the kernel's OOM killer), that will be as large or larger than you could ever grow via realloc.
To allocate this much, you may need to globally set /proc/sys/vm/overcommit_memory to 1 (no checking) instead of the default 0 (heuristic which makes extremely large allocations fail). Or use mmap(MAP_NORESERVE) to allocate it, making that one mapping just best-effort growth on page-faults.
The documentation says you might get a SIGSEGV on touching memory allocated with MAP_NORESERVE, which is different than invoking the OOM killer. But I think once you've already successfully touched memory, it is yours and won't get discarded. I think it's also not going to spuriously fail unless you're actually running out of RAM + swap space. IDK how you plan to detect that in your current design (which sounds pretty sketchy if you have no way to ever deallocate).
Test program:
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
int main(void) {
size_t sz = 1ULL << 46; // 2**46 = 64 TiB = max power of 2 for x86-64 with 48-bit virtual addresses
// in practice 1ULL << 40 (1TiB) should be more than enough.
// the smaller you pick, the less impact if multiple things use this trick in the same program
//int *p = aligned_alloc(64, sz); // doesn't use NORESERVE so it will be limited by overcommit settings
int *p = mmap(NULL, sz, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0);
madvise(p, sz, MADV_HUGEPAGE); // for good measure to reduce page-faults and TLB misses, since you're using large contiguous chunks of this array
p[1000000000] = 1234; // or sz/sizeof(int) - 1 will also work; this is only touching 1 page somewhere in the array.
printf("%p\n", p);
}
$ gcc -Og -g -Wall alloc.c
$ strace ./a.out
... process startup
mmap(NULL, 70368744177664, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x15c71ef7c000
madvise(0x15c71ef7c000, 70368744177664, MADV_HUGEPAGE) = 0
... stdio stuff
write(1, "0x15c71ef7c000\n", 15) = 15
0x15c71ef7c000
exit_group(0) = ?
+++ exited with 0 +++
My desktop has 16GiB of RAM (a lot of it in use by Chromium and some big files in /tmp) + 2GiB of swap. Yet this program allocated 64 TiB of virtual address space and touched 1 int of it nearly instantly. Not measurably slower than if it had only allocated 1MiB. (And future performance from actually using that memory should also be unaffected.)
The largest power-of-2 you can expect to work on current x86-64 hardware is 1ULL << 46. The total lower canonical range of the 48-bit virtual address space is 47 bits (user-space virtual address space on Linux), and some of that is already allocated for stack/code/data. Allocating a contiguous 64 TiB chunk of that still leaves plenty for other allocations.
(If you do actually have that much RAM + swap, you're probably waiting for a new CPU with 5-level page tables so you can use even more virtual address space.)
Speaking of page tables, the larger the array the more chance of putting some other future allocations very very far from existing blocks. This can have a minor cost in TLB-miss (page walk) time, if your actual in-use pages end up more scattered around your address space in more different sub-trees of the multi-level page tables. That's more page-table memory to keep in cache (including cached within the page-walk hardware).
The allocation size doesn't have to be a power of 2 but it might as well be. There's also no reason to make it that big. 1ULL << 40 (1TiB) should be fine on most systems. IDK if having more than half the available address space for a process allocated could slow future allocations; bookkeeping is I think based on extents (ptr + length) not bitmaps.
Keep in mind that if everyone starts doing this for random arrays in libraries, that could use up a lot of address space. This is great for the main array in a program that spends a lot of time using it. Keep it as small as you can while still being big enough to always be more than you need. (Optionally make it a config parameter if you want to avoid a "640kiB is enough for everyone" situation). Using up virtual address space is very low-cost, but it's probably better to use less.
Think of this as reserving space for future growth but not actually using it until you touch it. Even though by some ways of looking at it, the memory already is "allocated". But in Linux it really isn't. Linux defaults to allowing "overcommit": processes can have more total anonymous memory mapped than the system has physical RAM + swap. If too many processes try to use too much by actually touching all that allocated memory, the OOM killer has to kill something (because the "allocate" system calls like mmap have already returned success). See https://www.kernel.org/doc/Documentation/vm/overcommit-accounting
(With MAP_NORESERVE, it's only reserving address space which is shared between threads, but not reserving any physical pages until you touch them.)
You probably want your array to be page-aligned: #include <stdalign.h> so you can use something like
alignas(4096) struct entry process_array[MAX_LEN];
Or for non-static, allocate it with C11 aligned_alloc().
Give back early parts of the array when you're sure all threads are done with it
Page alignment makes it easy do the calculations to "give back" a memory page (4kiB on x86) if your array's logical size shrinks enough. madvise(addr, 4096*n, MADV_FREE); (Linux 4.5 and later). This is kind of like mmap(MAP_FIXED) to replace some pages with new untouched anonymous pages (that will read as zeroes), except it doesn't split up the logical mapping extents and create more bookkeeping for the kernel.
Don't bother with this unless you're returning multiple pages, and leave at least one page unfreed above the current top to avoid page faults if you grow again soon. Like maybe maintain a high-water mark that you've ever touched (without giving back) and a current logical size. If high_water - logical_size > 16 pages give back all page from 4 past the logical size up to the high water mark.
If you will typically be actually using/touching at least 2MiB of your array, use madvise(MADV_HUGEPAGE) when you allocate it to get the kernel to prefer using transparent hugepages. This will reduce TLB misses.
(Use strace to see return values from your madvise system calls, and look at /proc/PID/smaps, to see if your calls are having the desired effect.)
If up-front allocation is unacceptable, RCU (read-copy-update) might be viable if it's read-mostly. https://en.wikipedia.org/wiki/Read-copy-update. But copying a gigantic array every time an element changes isn't going to work.
You'd want a different data-structure entirely where only small parts need to be copied. Or something other than RCU; like your answer, you might not need the read side being always wait-free. The choice will depend on acceptable worst-case latency and/or average throughput, and also how much contention there is for any kind of ref counter that has to bounce around between all threads.
Too bad there isn't a realloc variant that attempts to grow without copying so you could attempt that before bothering other threads. (e.g. have threads with idx>len spin-wait on len in case it increases without the array address changing.)
So, I came up with a solution:
Reading:
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
int x = array[idx];
cnt--;
Writing:
if (idx == len) {
wait = true;
while (cnt > 0); // busy wait to minimize latency of reallocation
array = realloc(array, 2*len*sizeof(int));
if (!array) abort(); // shit happens
len *= 2; // must not be updated before reallocation completed
wait = false;
}
// this is why len must be updated after realloc,
// it serves for synchronization with other writers
// exceeding the current length limit
while (idx > len) {yield();}
while(true) {
cnt++;
if (wait) {
cnt--;
yield();
} else {
break;
}
}
array[idx] = x;
cnt--;
wait is an atomic bool initialized as false, cnt is an atomic int initialized as zero.
This only works because I know that task IDs are chosen ascendingly without gaps and that no task state is read before it is initialized by the write operation. So I can always rely on the one thread which pulls the ID which only exceeds current array length by 1. New tasks created concurrently will block their thread until the responsible thread performed reallocation. Hence the busy wait, since the reallocation should happen quickly so the other threads do not have to wait for too long.
This way, I eliminate the bottlenecking big lock. Array accesses can be made concurrently at the cost of two atomic additions. Since reallocation occurs seldom (due to exponential growth), array access is practically block-free.
EDIT:
After taking a second look, I noticed that one has to be careful about reordering of stores around the length update. Also, the whole thing only works if concurrent writes always use different indices. This is the case for my implementation, but might not generally be. Thus, this is not as elegant as I thought and the solution presented in the accepted answer should be preferred.
I have a RenderQueue that sorts a list of elements to render.
Now that RenderQueue creates a RenderCommandBuffer with all the "low level" rendering operations, the problem is that the performance goes from 1400 FPS to 40FPS for 1000 elements
I profiled the app and the problem lies here (those per frame allocations):
std::for_each(element.meshCommands.begin(), element.meshCommands.end(), [&] (auto &command) {
std::vector<std::pair<std::string, glm::mat4> > p{ { "MVP", VPmatrix * command.second} };
m_commandBuffer.addCommand(std::make_shared<SetShaderValuesCommand>(element.material,p));
m_commandBuffer.addCommand(std::make_shared<BindMaterialCommand>(element.material));
m_commandBuffer.addCommand(std::make_shared<RenderMeshCommand>(meshProperty.mesh));
});
I know that I can group my meshes by material, but the problem is more or less the same. Allocation of many objects per frame. How will you avoid this situation? How the game engines deal with this problem ? Memory pools?
Details are scant, but I see two opportunities for tuning.
m_commandBuffer is a polymorphic container of some kind. I completely understand why you would build it this way, but it presents a problem - each element must be separately allocated.
You may well get much better performance by amalgamating all render operations into a variant, and implement m_commandBuffer as a vector (or queue) of such variants. This allows you to reserve() space for the 1000 commands with 1 memory allocation, rather than the (at least) 1000 you currently require.
It also means that you only incur the cost of one memory fence during the allocation, again rather than the thousands you are suffering while incrementing and decrementing all the reference counts in all those shared_ptrs.
So:
using Command = boost::variant< SetShaderValuesCommand, BindMaterialCommand, RenderMeshCommand>;
using CommandQueue = std::deque<Command>;
Executing the commands then becomes:
for (auto& cmd : m_commandBuffer) {
boost::apply_visitor([](auto& actualCmd) {
actualCmd.run(); /* or whatever is the interface */
}, cmd);
}
I have noticed some interesting behavior in Linux with regard to the Memory Usage (RES) reported by top. I have attached the following program which allocates a couple million objects on the heap, each of which has a buffer that is around 1 kilobyte. The pointers to those objects are tracked by either a std::list, or a std::vector. The interesting behavior I have noticed is that if I use a std::list, the Memory Usage reported by top never changes during the sleep periods. However if I use std::vector, the memory usage will drop to near 0 during those sleeps.
My test configuration is:
Fedora Core 16
Kernel 3.6.7-4
g++ version 4.6.3
What I already know:
1. std::vector will re-allocate (doubling its size) as needed.
2. std::list (I beleive) is allocating its elements 1 at a time
3. both std::vector and std::list are using std::allocator by default to get their actual memory
4. The program is not leaking; valgrind has declared that no leaks are possible.
What I'm confused by:
1. Both std::vector and std::list are using std::allocator. Even if std::vector is doing batch re-allocations, wouldn't std::allocator be handing out memory in almost the same arrangement to std::list and std::vector? This program is single threaded after all.
2. Where can I learn about the behavior of Linux's memory allocation. I have heard statements about Linux keeping RAM assigned to a process even after it frees it, but I don't know if that behavior is guaranteed. Why does using std::vector impact that behavior so much?
Many thanks for reading this; I know this is a pretty fuzzy problem. The 'answer' I'm looking for here is if this behavior is 'defined' and where I can find its documentation.
#include <string.h>
#include <unistd.h>
#include <iostream>
#include <vector>
#include <list>
#include <iostream>
#include <memory>
class Foo{
public:
Foo()
{
data = new char[999];
memset(data, 'x', 999);
}
~Foo()
{
delete[] data;
}
private:
char* data;
};
int main(int argc, char** argv)
{
for(int x=0; x<10; ++x)
{
sleep(1);
//std::auto_ptr<std::list<Foo*> > foos(new std::list<Foo*>);
std::auto_ptr<std::vector<Foo*> > foos(new std::vector<Foo*>);
for(int i=0; i<2000000; ++i)
{
foos->push_back(new Foo());
}
std::cout << "Sleeping before de-alloc\n";
sleep(5);
while(false == foos->empty())
{
delete foos->back();
foos->pop_back();
}
}
std::cout << "Sleeping after final de-alloc\n";
sleep(5);
}
The freeing of memory is done on a "chunk" basis. It's quite possible that when you use list, the memory gets fragmented into little tiny bits.
When you allocate using a vector, all elements are stored in one big chunk, so it's easy for the memory freeing code to say "Golly, i've got a very large free region here, I'm going to release it back to the OS". It's also entirely possible that when growing the vector, the memory allocator goes into "large chunk mode", which uses a different allocation method than "small chunk mode" - say for example you allocate more than 1MB, the memory allocation code may see that as a good time to start using a different strategy, and just ask the OS for a "perfect fit" piece of memory. This large block is very easy to release back to he OS when it's being freed.
On the ohter hand if you are adding to a list, you are constantly asking for little bits, so the allocator uses a different strategy of asking for large block and then giving out small portions. It's both difficult and time-consuming to ensure that ALL blocks within a chunk have been freed, so the allocator may well "not bother" - because chances are that there are some regions in there "still in use", and then it can't be freed at all anyways.
I would also add that using "top" as a memory measure isn't a particularly accurate method, and is very unreliable, as it very much depends on what the OS and the runtime library does. Memory belonging to a process may not be "resident", but the process still hasn't freed it - it's just not "present in actual memory" (out in the swap partition instead!)
And to your question "is this defined somewhere", I think it is in the sense that the C/C++ library source cod defines it. But it's not defined in the sense that somewhere it's written that "This is how it's meant to work, and we promise never to hange it". The libraries supplied as glibc and libstdc++ are not going to say that, they will change the internals of malloc, free, new and delete as new technologies and ideas are invented - some may make things better, others may make it worse, for a given scenario.
As has been pointed out in the comments, the memory is not locked to the process. If the kernel feels that the memory is better used for something else [and the kernel is omnipotent here], then it will "steal" the memory from one running process and give it to another. Particularly memory that hasn't been "touched" for a long time.
1 . Both std::vector and std::list are using std::allocator. Even if std::vector is doing batch re-allocations, wouldn't std::allocator be
handing out memory in almost the same arrangement to std::list and
std::vector? This program is single threaded after all.
Well, what are the differences?
std::list allocates nodes one-by-one (each node needs two pointers in addition to your Foo *). Also, it never re-allocates these nodes (this is guaranteed by the iterator invalidation requirements for list). So, the std::allocator will request a sequence of fixed-size chunks from the underlying mechanism (probably malloc which will in turn use the sbrk or mmap system calls). These fixed-size chunks may well be larger than a list node, but if so they'll all be the same default chunk size used by std::allocator.
std::vector allocates a contiguous block of pointers with no book-keeping overhead (that's all in the vector parent object). Every time a push_back would overflow the current allocation, the vector will allocate a new, larger chunk, move everything across to the new chunk, and release the old one. Now, the new chunk will be something like double (or 1.6 times, or whatever) the size of the old one, as is required to keep the amortized constant time guarantee for push_back. So, pretty quickly, I'd expect the sizes it requests to exceed any sensible default chunk size for std::allocator.
So, the the interesting interactions are different: one between between std::vector and the allocator's underlying mechanism, and one between the std::allocator itself and that underlying mechanism.
2 . Where can I learn about the behavior of Linux's memory allocation. I have heard statements about Linux keeping RAM assigned to a process
even after it frees it, but I don't know if that behavior is
guaranteed. Why does using std::vector impact that behavior so much?
There are several levels you might care about:
The container's own allocation pattern: which is hopefully described above
note that in real-world applications, the way a container is used is just as important
std::allocator itself, which may provide a layer of buffering for small allocations
I don't think this is required by the standard, so it's specific to your implementation
The underlying allocator, which depends on your std::allocator implementation (it could for example be malloc, however that is implemented by your libc)
The VM scheme used by the kernel, and its interactions with whatever syscall (3) ultimately uses
In your particular case, I can think of a possible explanation for the vector apparently releasing more memory than the list.
Consider that the vector ends up with a single contiguous allocation, and lots of the Foos will also be allocated contiguously. This means that when you release all this memory, it's pretty easy to figure out that most of the underlying pages are genuinely free.
Now consider that the list node allocations are interleaved 1:1 with the Foo instances. Even if the allocator did some batching, it seems likely that the heap is much more fragmented than in the std::vector case. Therefore, when you release the allocated records, some work would be required to figure out whether an underlying page is now free, and there's no particular reason to expect this will happen (unless a subsequent large allocation encouraged coalescing of heap records).
The answer is the malloc "fastbins" optimization.
std::list creates tiny (less then 64 bytes) allocations and when it frees them up they are not actually freed - but goes to the fastblock pool.
This behavior means that the heap stays fragmented even AFTER the list is cleared and therefore it does not return to the system.
You can either use malloc_trim(128*1024) in order to forcibly clear them.
Or use mallopt(M_MXFAST, 0) in order to disable fastbins altogether.
I find the first solution to be more correct if you call it when you really don't need the memory anymore.
Smaller chunks go through brk and adjusting the data segment and constant splitting and fusion and bigger chunks mmap the process is a little less disturbed. more info (PDF)
also ptmalloc source code.
I want to explore the performance differences for multiple dereferencing of data inside a vector of new-ly allocated structs (or classes).
struct Foo
{
int val;
// some variables
}
std::vector<Foo*> vectorOfFoo;
// Foo objects are new-ed and pushed in vectorOfFoo
for (int i=0; i<N; i++)
{
Foo *f = new Foo;
vectorOfFoo.push_back(f);
}
In the parts of the code where I iterate over vector I would like to enhance locality of reference through the many iterator derefencing, for example I have very often to perform a double nested loop
for (vector<Foo*>::iterator iter1 = vectorOfFoo.begin(); iter!=vectorOfFoo.end(); ++iter1)
{
int somevalue = (*iter)->value;
}
Obviously if the pointers inside the vectorOfFoo are very far, I think locality of reference is somewhat lost.
What about the performance if before the loop I sort the vector before iterating on it? Should I have better performance in repeated dereferencings?
Am I ensured that consecutive ´new´ allocates pointer which are close in the memory layout?
Just to answer your last question: no, there is no guarantee whatsoever where new allocates memory. The allocations can be distributed throughout the memory. Depending on the current fragmentation of the memory you may be lucky that they are sometimes close to each other but no guarantee is - or, actually, can be - given.
If you want to improve the locality of reference for your objects then you should look into Pool Allocation.
But that's pointless without profiling.
It depends on many factors.
First, it depends on how your objects that are being pointed to from the vector were allocated. If they were allocated on different pages then you cannot help it but fix the allocation part and/or try to use software prefetching.
You can generally check what virtual addresses malloc gives out, but as a part of the larger program the result of separate allocations is not deterministic. So if you want to control the allocation, you have to do it smarter.
In case of NUMA system, you have to make sure that the memory you are accessing is allocated from the physical memory of the node on which your process is running. Otherwise, no matter what you do, the memory will be coming from the other node and you cannot do much in that case except transfer you program back to its "home" node.
You have to check the stride that is needed in order to jump from one object to another. Pre-fetcher can recognize the stride within 512 byte window. If the stride is greater, you are talking about a random memory access from the pre-fetcher point of view. Then it will shut off not to evict your data from the cache, and the best you can do there is to try and use software prefetching. Which may or may not help (always test it).
So if sorting the vector of pointers makes the objects pointed by them continuously placed one after another with a relatively small stride - then yes, you will improve the memory access speed by making it more friendly for the prefetch hardware.
You also have to make sure that sorting that vector doesn't result in a worse gain/lose ratio.
On a side note, depending on how you use each element, you may want to allocate them all at once and/or split those objects into different smaller structures and iterate over smaller data chunks.
At any rate, you absolutely must measure the performance of the whole application before and after your changes. These sort of optimizations is a tricky business and things can get worse even though in theory the performance should have been improved. There are many tools that can be used to help you profile the memory access. For example, cachegrind. Intel's VTune does the same. And many other tools. So don't guess, experiment and verify the results.
I got a problem when I tried to create and delete an instance of the class in a loop.
Execution time of iterations is quite different. As I understand it, this is associated with the removal of objects from memory. However, the behavior of this operation I do not understand. Why time is different? How do I fix it? Time is steady when I'm deleting the object in a separate thread.
class NODE{
public:
NODE(){}
NODE* add(NODE* node)
{
children.push_back(node);
return node;
}
virtual ~NODE()
{
for(vector<NODE*>::iterator it = children.begin(); it != children.end(); ++it)
{
delete *it;
}
}
vector<NODE*> children;
};
NODE* create()
{
NODE* node( new NODE() );
for (int i=0; i<200;i++) {
NODE* subnode = node->add( new NODE());
for (int k=0; k<20; k++) subnode->add( new NODE());
}
return node;
}
int main()
{
NODE* root;
unsigned t;
for (int i=0; i<30; i++){
t = clock();
cout << "Create... ";
root = create();
delete root;
cout<< clock()-t << endl;
}
}
ADDED:
I'm confused. When I run program out of VS it works fine...
When you create an object, sometimes a new block of memory for it will be allocated, and sometimes it will fit into an already-existing block. This will cause two allocations to potentially take differing amounts of time.
If you want to make the time allocations and frees take consistent, handle them inside your application - grab a big chunk of memory and service allocations from that block. Of course, when you need another chunk, the allocation that causes that to happen is going to take longer... But with large chunks it should happen infrequently.
The only way to make allocations and deallocations take a perfectly consistent amount of time would be to slow down the fast ones until they take the maximum amount of time any request could spend. This would be the Harrison Bergeron approach to performance optimization. I don't recommend it.
There are real-time heaps in existance, but generally memory heap operations (dynamic allocations and deallocations) are the cannoical example of a non-deterministic operation. This means that the runtime varies, and doesn't even have a very good bound.
The issue is that you generally need adjacent blocks of memory merged into a single block when they occur. If that isn't done, eventually you'll just have oodles of tiny blocks and a large allocation may fail even though there really is plenty of memory available for it. On any given call there may or may not be merging to do, and the amount to do may vary. It all depends on the pattern of allocations/deallocations your system has performed recently, which is not something you can plan for at all. So we call it "non-deterministic".
If you don't like this behavior, there are really two possibilities:
Switch to using a real-time heap. Your OS probably doesn't have one built in, so you'd have to go buy or download one and use it for all your memory operations. One I've used in the past is TLSF.
Don't perform dynamic memory allocation/deallocation in your main loop (IOW: not after initialization). This is how we realtime programmers have trained ourselves to code for ages.
In addition to what other answers say you have to account for Visual C++ 10 runtime heap being thread-safe. This implies that even when you only have one thread you encounter some overhead facilitating heap operations being thread-safe. So one reason you see such poor results is that you use a universal but rather slow heap implementation.
The reason why you get different timing with/without the debugger is a special heap is used when a program is started under debugger (even in Release configuration) and that special heap is also relatively slow.
Generally it's not possible to predict memory allocation/deallocation times. For example times may vary significantly if the user space heap runs out of pages and needs to request more from the kernel and later when accessing a newly allocated page triggers a page fault.
So even if you go forward and implement your own heap using big chunks of memory, your allocation times will vary because lazyness is in the nature of the underlying memory system.
If you really wanna know why its slow you need to run a real profiler on it, such as AMD's codeanalyst, clock isn't exactly a high presision timer.
The reason why it'll run different each time depends on the paging of the underlying system, the CPU load and whether your data has been cached by the processor.