Multithread Memory Profiling in C++

Multithread Memory Profiling in C++ - c++

I am working on profiling the memory usage of multiple threads in my application. I would like to be able to track the maximum allocation/current allocation of any given thread that is running. In order to so, I planned on interposing on mallocs/frees. During each call to malloc, I would update the allocation records for the particular thread in a static map that associated thread ids to their particular metadata record. I am currently having issues during process exit. I think the issue is that when all the destructors are called for cleanup, the static map and lock protecting it have to be destroyed. My interposed mallocs/frees, however, acquire the lock before updating the profiling metadata structures. Eventually, the lock is destroyed, but there are subsequent calls to malloc/free that result in an attempt to acquire the no longer existent lock resulting in a segfault.
Another issue that I am concerned about is that there are internal calls to malloc generated within my interposed malloc to allocate entries in the map.
Any ideas on ways of approaching the problem of profiling memory usage on a per thread basis? Any suggestions on data structures to track the usage of each thread? Does the above approach seem reasonable or are there any other ways of approaching the problem?

If you store your "extra" data as part of the allocation itself (before is easier, but you could do it after too - just need a size somewhere), then you shouldn't need any locks at all. Just a tad more memory. Of course, you will need to use atomics to update any lists of items.
If you look at this answer:
Setting memory on a custom heap
and imagine that HeapAlloc and HeapFree are malloc and free respectively. Then add code to store which thread is being used for the allocation.
So, instead of using a map, you simply update a linked list (using atomics to prevent multiple updates). This does of course make it a little more difficult to make the up to date measurements per thread, you'll have to scan the list of allocations.
Of course, this only works for DIRECT calls to malloc and free.
The same principle would be possible by "injecting" a replacement malloc/free function (built along the principles in the other post, but of course not using the original malloc to allocate the memory, and not using free to free the memory).

This is a complicated thing to do and make work for all cases. There are many issues that you'll miss and only ever find through trial and error. I should know, I've been responsible for building a tool that does what you are trying to do. We've been doing this since 1999, available commercially since 2002.
If you are using Windows, C++ Memory Validator can give you per-thread profiling statistics.
http://www.softwareverify.com/cpp-memory.php.
The Objects tab and Sizes tab both have Threads sub-tabs which allow you to view data per thread. You can also run advanced queries on the Analysis tab that will allow you to view data on a per-thread basis.
Spend your time on your job, not writing tools.

Related

Is access to the heap serialized?

One rule every programmer quickly learns about multithreading is:
If more than one thread has access to a data structure, and at least one of threads might modify that data structure, then you'd better serialize all accesses to that data structure, or you're in for a world of debugging pain.
Typically this serialization is done via a mutex -- i.e. a thread that wants to read or write the data structure locks the mutex, does whatever it needs to do, and then unlocks the mutex to make it available again to other threads.
Which brings me to the point: the memory-heap of a process is a data structure which is accessible by multiple threads. Does this mean that every call to default/non-overloaded new and delete is serialized by a process-global mutex, and is therefore a potential serialization-bottleneck that can slow down multithreaded programs? Or do modern heap implementations avoid or mitigate that problem somehow, and if so, how do they do it?
(Note: I'm tagging this question linux, to avoid the correct-but-uninformative "it's implementation-dependent" response, but I'd also be interested in hearing about how Windows and MacOS/X do it as well, if there are significant differences across implementations)

new and delete are thread safe
The following functions are required to be thread-safe:
The library versions of operator new and operator delete
User replacement versions of global operator new and operator delete
std::calloc, std::malloc, std::realloc, std::aligned_alloc, std::free
Calls to these functions that allocate or deallocate a particular unit of storage occur in a single total order, and each such deallocation call happens-before the next allocation (if any) in this order.
With gcc, new is implemented by delegating to malloc, and we see that their malloc does indeed use a lock. If you are worried about your allocation causing bottlenecks, write your own allocator.

Answer is yes, but in practice it is usually not a problem.
If it is a problem for you you may try replacing your implementation of malloc with tcmalloc that reduces, but does not eliminate possible contention(since there is only 1 heap that needs to be shared among threads and processes).
TCMalloc assigns each thread a thread-local cache. Small allocations are satisfied from the thread-local cache. Objects are moved from central data structures into a thread-local cache as needed, and periodic garbage collections are used to migrate memory back from a thread-local cache into the central data structures.
There are also other options like using custom allocators and/or specialized containers and/or redesigning your application.

As you tried to avoid the the answer is architecture/system dependant by trying to avoid the problem that multiple threads must serialize accesses, this only happens with heaps that grow or shrink when the program needs to expand it or return part of it to the system.
The first answer has to be simply it's implementation dependant, without any system dependencies, because normally, libraries get large chunks of memory to base the heap and they administer those internally, which makes the problem actually operating system and architecture independent.
The second answer is that, of course, if you have only one single heap for all the threads, you'll have a possible bottleneck in case all of the active threads compete for a single chunk of memory. There are several approaches to this, you can have a pool of heaps to allow parallelism, and make the different threads use different pools for their requests, think that the possible largest problem is in requesting memory, as this is the case when you have the bottleneck. On returning there's not such issue, as you can act more like a garbage collector in which you queue the returned chunks of memory and enqueue them for a thread to dispatch and put those chunks in the proper places to conserve the heaps integrities. Having multiple heaps allows even to classify them by priorities, by chunk sizes, etc. so the risk of collision is made low by the class or problem you are going to deal with. This is the case of operating system kernels like *BSD, which use several memory heaps, somewhat dedicated to the kind of use they are going to receive (there's one for the io-disk buffers, one for virtual memory mapped segments, one for process virtual memory space management, etc)
I recommend you to read The design and implementation of the FreeBSD Operating System which explains very well the approach used in the kernel of BSD systems. This is general enough and probably a great percentage of the other systems follow this or a very similar approach.

Is dynamic memory access detrimental in real-time programs?

I'm working on a project that contains a real-time software component, using the RT PREEMPT patch on Linux.
During "idle" operation the software just sits waiting for incoming TCP connections and requests. Depending on the request, the software may create a real-time thread that runs for a period of time. So the entire application doesn't need to operate in real-time, only this one thread from time to time.
My question is this: I'm well aware that dynamic memory allocation is non-deterministic and is detrimental to real-time code. However, is accessing existing memory on the heap also detrimental to real-time constraints?
I ask because I'm considering a situation where the program starts up, allocates any required structures on the heap, then triggers a real-time thread that accesses the heap.
EDIT: Once the real-time thread has started, other threads are prevented from writing to variables the real-time thread needs to access using locks (well, except for one variable that must be updated, but access is still restricted using a separate lock).
EDIT2: I forgot to mention that the program will ultimately be deployed on a system that doesn't have any swap space, so I don't think the paging of memory should be an issue. (Though of course this doesn't avoid the issue of page-faults through memory that hasn't yet been provisioned by the OS.)

It is possible that the virtual memory manager might move your memory to swap making your thread generate a major page fault when it runs. You need to lock the memory pages using mlock(). I also recommend allocating memory in chunks and writing to all of it with memset() before using it to avoid minor page faults at run time and use placement new instead of the regular one to create your objects in the already allocated memory.

is accessing existing memory on the heap also detrimental to real-time constraints?
No, unless your system is thrashing.
BTW, you could consider writing your own allocation (e.g. above mmap(2)...) and using mlock(2) for the memory that ought to be in RAM.

Can multithreading speed up memory allocation?

I'm working with an 8 core processor, and am using Boost threads to run a large program.
Logically, the program can be split into groups, where each group is run by a thread.
Inside each group, some classes invoke the 'new' operator a total of 10000 times.
Rational Quantify shows that the 'new' memory allocation is taking up the maximum processing time when the program runs, and is slowing down the entire program.
One way I can speed up the system could be to use threads inside each 'group', so that the 10000 memory allocations can happen in parallel.
I'm unclear of how the memory allocation will be managed here. Will the OS scheduler really be able to allocate memory in parallel?

Standard CRT
While with older of Visual Studio the default CRT allocator was blocking, this is no longer true at least for Visual Studio 2010 and newer, which calls corresponding OS functions directly. The Windows heap manager was blocking until Widows XP, in XP the optional Low Fragmentation Heap is not blocking, while the default one is, and newer OSes (Vista/Win7) use LFH by default. The performance of recent (Windows 7) allocators is very good, comparable to scalable replacements listed below (you still might prefer them if targeting older platforms or when you need some other features they provide). There exist several multiple "scalable allocators", with different licenses and different drawbacks. I think on Linux the default runtime library already uses a scalable allocator (some variant of PTMalloc).
Scalable replacements
I know about:
HOARD (GNU + commercial licenses)
MicroQuill SmartHeap for SMP (commercial license)
Google Perf Tools TCMalloc (BSD license)
NedMalloc (BSD license)
JemAlloc (BSD license)
PTMalloc (GNU, no Windows port yet?)
Intel Thread Building Blocks (GNU, commercial)
You might want to check Scalable memory allocator experiences for my experiences with trying to use some of them in a Windows project.
In practice most of them work by having a per thread cache and per thread pre-allocated regions for allocations, which means that small allocations most often happen inside of a context of thread only, OS services are called only infrequently.

Dynamic allocation of memory uses the heap of the application/module/process (but not thread). The heap can only handle one allocation request at a time. If you try to allocate memory in "parallel" threads, they will be handled in due order by the heap. You will not get a behaviour like: one thread is waiting to get its memory while another can ask for some, while a third one is getting some. The threads will have to line-up in queue to get their chunk of memory.
What you would need is a pool of heaps. Use whichever heap is not busy at the moment to allocate the memory. But then, you have to watch out throughout the life of this variable such that it does not get de-allocated on another heap (that would cause a crash).
I know that Win32 API has functions such as GetProcessHeap(), CreateHeap(), HeapAlloc() and HeapFree(), that allow you to create a new heap and allocate/deallocate memory from a specific heap HANDLE. I don't know of an equivalence in other operating systems (I have looked for them, but to no avail).
You should, of course, try to avoid doing frequent dynamic allocations. But if you can't, you might consider (for portability) to create your own "heap" class (doesn't have to be a heap per se, just a very efficient allocator) that can manage a large chunk of memory and surely a smart pointer class that would hold a reference to the heap from which it came. This would enable you to use multiple heaps (make sure they are thread-safe).

There are 2 scalable drop-in replacements for malloc that I know of:
Google's tcmalloc
Facebook's jemalloc (link to a performance study comparing to tcmalloc)
I don't have any experience with Hoard (which performed poorly in the study), but Emery Berger lurks on this site and was astonished by the results. He said he would have a look and I surmise there might have been some specifics to either the test or implementation that "trapped" Hoard as the general feedback is usually good.
One word of caution with jemalloc, it can waste a bit of space when you rapidly create then discard threads (as it creates a new pool for each thread you allocate from). If your threads are stable, there should not be any issue with this.

I believe the short answer to your question is : yes, probably. And as already pointed out by several people here there are ways to achieve this.
Aside from your question and the answers already posted here, it would be good to start with your expectations on improvements, because that will pretty much tell which path to take. Maybe you need to be 100x faster. Also, do you see yourself doing speed improvements in the near future as well or is there a level which will be good enough? Not knowing your application or problem domain it's difficult to also advice you specifically. Are you for instance in a problem domain where speed continuously have to be improved?
One good thing to start off with when doing performance improvements is to question if you need to do things the way you currently do it? In this case, can you pre-allocate objects? Is there a maximum number of X objects in the system? Could you re-use objects? All of this is better, because you don't necessarily need to do allocations on the critical path. E.g. if you can re-use objects, a custom allocator with pre-allocated objects would work well. Also, what OS are you on?
If you don't have concrete expectations or a certain level of performance, just start experimenting with any of the advices here and you'll find out more.
Good luck!

Roll your own non-multi-threaded new memory allocator a distinct copy of which each thread has.
(you can override new and delete)
So it's allocating in large chunks that it works through and doesn't need any locking as each is owned by a single thread.
limit your threads to the number of cores you have available.

new is pretty much blocking, it has to find the next free bit of memory which is tricky to do if you have lots of threads all asking for that at once.
Memory allocation is slow - if you are doing it more than a few times, especially on lots of threads then you need a redesign. Can you pre-allocate enough space at the start, can you just allocate a big chunk with 'new' and then partition it out yourself?

You need to check your compiler documentation whether it makes the allocator thread safe or not. If it does not, then you will need to overload your new operator and make it thread safe.
Else it will result in either a segfault or UB.

On some platforms like Windows, access to the global heap is serialized by the OS. Having a thread-separate heap could substantially improve allocation times.
Of course, in this case, it might be worth questioning whether or not you genuinely need heap allocation as opposed to some other form of dynamic allocation.

You may want to take a look at The Hoard Memory Allocator: "is a drop-in replacement for malloc() that can dramatically improve application performance, especially for multithreaded programs running on multiprocessors."

The best what you can try to reach ~8 memory allocation in parallel (since you have 8 physical cores), not 10000 as you wrote
standard malloc uses mutex and standard STL allocator does the same. Therefore it will not speed up automatically when you introduce threading.
Nevertheless, you can use another malloc library (google for e.g. "ptmalloc") which does not use global locking. if you allocate using STL (e.g. allocate strings, vectors) you have to write your own allocator.
Rather interesting article: http://developers.sun.com/solaris/articles/multiproc/multiproc.html

Efficiently allocating many short-lived small objects

I've got a small class (16 bytes on a 32bit system) which I need to dynamically allocate. In most cases the life-time of any given instance is very short. Some instances may also be passed across thread boundaries.
Having done some profiling, I found that my program appears to be spending more time allocating and deallocating the things than it's actually spending using them so I want to replace the default new and delete with something that a little more efficient.
For a large object (db connections as it happens, which are expensive to construct rather than allocate), I'm already using a pooling system, however that involves a list for storing the "free" objects, and also a mutex for thread safety. Between the mutex and the list it actually performs worse than with the basic new/delete for the small objects.
I found a number of small object allocators on Google, however they seem to be using a global/static pool which is not used in a thread safe manner, making them unsuitable for my use :(
What other options have I got for efficient memory management of such small objects?

Maybe try using Google's tcmalloc? It is optimized for fast allocation/deallocation in a threaded program, and has low overhead for small objects.

Some instances may also be passed across thread boundaries
Only "some"? So perhaps you can afford to pay extra for these, if it makes the ones that don't get passed to other threads cheaper. There are various ways I can think of to get to one allocator per thread and avoid the need to lock when allocating or freeing in the thread to which the allocator belongs. I don't know which might be possible in your program:
Copy things across the thread boundary, instead of passing them.
Arrange that if they're passed to another thread for any reason, then they're passed back to the original thread to free. This doesn't necessarily have to happen very often, you could queue up a few in the receiving thread and pass them all back in a message later. This assumes of course that the thread which owns the allocator is going to stick around.
Have two free lists per allocator, one synchronised (to which objects are added when they're freed from another thread), and one unsynchronised. Only if the unsynchronised list is empty, and you're allocating (and hence in the thread which owns the allocator), do you need to lock the synchronised free list and move all of its current contents to the unsynchronised list. If objects being passed to other threads is rare, this basically eliminates contention on the mutex and massively reduces the number of times it's taken at all.
If all the above fails, having one allocator per thread might still allow you to get rid of the mutex and use a lock-free queue for the free list (multiple writers freeing, single reader allocating), which could improve performance a bit. Implementing a lock-free queue is platform-specific.
Taking a step further back, does your app frequently hit a state in which you know that all cells allocated after a certain point (perhaps a little in the past), are no longer in use? If so, and assuming the destructor of your small objects doesn't do anything terribly urgent, then don't bother freeing cells at all - at the "certain point" create a new allocator and mark the old one as no longer in use for new allocations. When you "hit the state", free the whole allocator and its underlying buffer. If the "certain point" and the "state" are simultaneous, all the easier - just reset your allocator.

You might make sure that you are using the low fragmentation heap. It is on by default in Vista and later, but I do not think that is so with earlier OS's. That can make a big difference in allocation speed for small objects.

Can Win32 "move" heap-allocated memory?

I have a .NET/native C++ application. Currently, the C++ code allocates memory on the default heap which persists for the life of the application. Basically, functions/commands are executed in the C++ which results in allocation/modification of the current persistent memory. I am investigating an approach for cancelling one of these functions/commands mid-execution. We have hundreds of these commands, and many are very complicated (legacy) code.
The brute-force approach that I am trying to avoid is modifying each and every command/function to check for the cancellation and do all the appropriate clean-up (freeing heap memory). I am investigating a multi-threaded approach in which an additional thread receives the cancellation request and terminates the command-execution thread. I would want all dynamic memory to be allocated on a "private heap" using HeapCreate() (Win32). This way, the private heap could be destroyed by the thread handling the cancellation request. However, if the command runs to completion, I need the dynamic memory to persist. In this case, I would like to do the logical equivalent of "moving" the private heap memory to the default/process heap without incurring the cost of an actual copy. Is this in any way possible? Does this even make sense?
Alternatively, I recognize that I could just have a new private heap for every command/function execution (each will be a new thread). The private heap could be destroyed if the command is cancelled, or it would survive if the command completes. Is there any problem with the number of heaps growing indefinitely? I know there is some overhead involved with each heap. What limitations might I run into?
I am running on Windows 7 64-bit with 8GB RAM (consider this the target platform). The application I am working with is about 1 million SLOC (half C++, half C#). I am looking for any experience/suggestions with private heap management, or just alternatives to my solution.

You might be better off with separate processes instead of separate threads:
use memory mapped files (ie not a file at all - just cross-processed shared memory)
killing a process is 'cleaner' than killing a thread
I think you can have the shared memory 'survive' the killing without a move - you map/unmap instead of move
although you might need to do some memory management on your own.
Anyhow, worth looking into. I was looking into using inter-process memory for a few other things, and it had some unusual properties (can recall all of it clearly, it was a while ago), and you might be able to take advantage of it.
Just an idea!

From MSDN's Heap Functions page:
"Memory allocated by HeapAlloc is not movable. The address returned by HeapAlloc is valid until the memory block is freed or reallocated; the memory block does not need to be locked."
Can you re-link the legacy apps against your own malloc() implementation? If so, you should be able to manage without modifying the rest of the code. Your custom malloc library can track allocated blocks by thread, and have a "FreeAllByThreadId() function which you call after killing the legacy function's thread. You could use private heaps inside the library.
An alternative to private heaps might be doing your own allocation from memory-mapped files. See "Creating Named Shared Memory." You create the shared memory while initializing the alloc library for the legacy thread. On success, map it into the main thread so your c# can access it; on termination, close it and it is released to the system.

Heap is a sort of big chunk of memory. It is a user-level memory manager. A heap is created by lower-level system memory calls (e.g., sbrk in Linux and VirtualAlloc in Windows). In a a heap, then you can request or return a small chunk of memory by malloc/new/free/delete. By default, a process has a single heap (unlike stack, all threads share a heap). But, you can have many heaps.
Is it possible to combine two heaps w/o copying? A heap is essentially a data structure that maintains a list of used and freed memory chunks. So, a heap should have a sort of bookkeeping data called meta data. Of course, this meta data is per heap. AFAIK, no heap manager supports a merge operation of two heaps. I had reviewed entire source code of malloc implementation in Linux glibc (Doug Lea's implementation), but no such operation. Windows Heap* functions are also implemented in a similar way. So, it is currently impossible to move or merge two separate heaps.
Is it possible to have many heaps? I don't think there should be a big problem to have many heaps. As I said before, a heap is just a data structure that keeps used/freed memory chunks. So, there should be some amount of overhead. But, it's not that severe. When you look at one of malloc implementation, there is malloc_state, which is a basic data structure for each heap. For example, you can create another heap by create_mspace (in Windows, it is HeapCreate), then you will get a new malloc state. It's not that big. So, if this tread-off (some heap overhead vs. implementation easiness) is fine, then you may go on.
If I were you, I'll try the way you describe. It makes sense to me. Having a lot of heap objects would not make a big overhead.
Also, it should be noted that technically moving memory regions is impossible. Pointers that pointed the moved memory region will result in dangling pointers.
p.s. Your problem seems like a transaction, especially Software Transactional Memory. A typical implementation of STM buffers pending memory writes, and then commits to the real system memory it the transaction had no conflict.

No. Memory cannot be moved between heaps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js