Capping allocated memory in multi-threaded C++ library - c++

I've developed a library in C++ that allows multi-threaded usage. I want to support an option for the caller to specify a cap on the memory allocated by a given thread. (We can ignore the case of one thread allocating memory and others using it.)
Possibly making this more complicated is that my library uses various open source components (boost, ICU, etc), some of which are statically linked and others dynamically.
One option I've been looking into is overriding the allocation functions (new/delete/etc) to do the bookkeeping per thread ID. Natural concerns come up around the bookkeeping: performance, etc.
But an even bigger question/concern is whether this approach will work with the open source components without code changes to them?
I can't seem to find pre-existing solutions for this, though it seems to me like it's not very unusual.
Any suggestions on this approach, or another approach?
EDIT: More background: The library can allocate a significantly large range of memory per calling thread depending on the input provided (ie. KBs to GBs).
So the goal of this request is to (more graciously & deterministically) support running in RAM-constrained environments. This is not for a hard-real-time environment with strict memory limits--it's to support a number of concurrent threads which each have a "safe" allocation cap to avoid engaging the page/swap file.
Basic example use case: a system with 32GB RAM, 20GB free, the application using my library may configure itself to use a max of 10 threads and configure the library to use a max of 1GB per thread.
Upon hitting the cap the current thread's call into the library will cease further work and return a suitable error. (The code is already fully RAII so unwinding cleanly is easy.)
BTW I found some interesting content on the web already, sadly none provide a lot of hope for a "simple & effective" solution. But this one is especially insightful.

Related

Implementing a memory manager in multithreaded C/C++ with dynamically sized memory pool?

Background: I'm developing a multiplatform framework of sorts that will be used as base for both game and util/tool creation. The basic idea is to have a pool of workers, each executing in its own thread. (Furthermore, workers will also be able to spawn at runtime.) Each thread will have it's own memory manager.
I have long thought about creating my own memory management system, and I think this project will be perfect to finally give it a try. I find such a system fitting due to the types of usages of this framework will often require memory allocations in realtime (games and texture edition tools).
Problems:
No generally applicable solution(?) - The framework will be used for both games/visualization (not AAA, but indie/play) and tool/application creation. My understanding is that for game development it is usual (at least for console games) to allocate a big chunk of memory only once in the initialization, and then use this memory internally in the memory manager. But is this technique applicable in a more general application?
In a game you could theoretically know how much memory your scenes and resources will need, but for example, a photo editing application will load resources of all different sizes... So in the latter case a more dynamic memory "chunk size" would be needed? Which leads me to the next problem:
Moving already allocated data and keeping valid pointers - Normally when allocating on the heap, you will acquire a simple pointer to the memory chunk. In a custom memory manager, as far as I understand it, a similar approach is then to return a pointer to somewhere free in the pre-allocated chunk. But what happens if the pre-allocated chunk is too small and needs to be resized or even defragmentated? The data would be needed to be moved around in the memory and the old pointers would be invalid. Is there a way to transparently wrap these pointers in some way, but still use them as normally "outside" the memory management as if they were usual C++ pointers?
Third party libraries - If there is no way to transparently use a custom memory management system for all memory allocation in the application, every third party library I'm linking with, will still use the "old" OS memory allocations internally. I have learned that it is common for libraries to expose functions to set custom allocation functions that the library will use, but it is not guaranteed every library I will use will have this ability.
Questions: Is it possible and feasible to implement a memory manager that can use a dynamically sized memory chunk pool? If so, how would defragmentation and memory resize work, without breaking currently in-use pointers? And finally, how is such a system best implemented to work with third party libraries?
I'm also thankful for any related reading material, papers, articles and whatnot! :-)
As someone who has previously written many memory managers and heap implementations for AAA games for the last few generations of consoles let me tell you its simply not worth it anymore.
Your information is old - back in the gamecube era [circa 2003] we used to do what you said- allocate a large chunk and carve out that chunk manually using custom algorithms tweaked for each game.
Once virtual memory came along (xbox era), games got more complicated [and so made more allocations and became multimthreaded] address fragmentation made this untenable. So we switched to custom allocators to handle certain types of requests only - for instance physical memory, or lock free small block low fragmentation heaps or thread local cache of recently used blocks.
As built in memory managers become better it gets harder to do better than those - certainly in the general case and a close thing for a specific use cases. Doug Lea Allocator [or whatever the mainstream c++ linux compilers come with now] and the latest Windows low fragmentation heaps are really very good, and you'd do far better investing your time elsewhere.
I've got spreadsheets at work measuring all kinds of metrics for a whole load of allocators - all the big name ones and a fair few I've collected over the years. And basically whilst the specialist allocators can win on a few metrics [lowest overhead per alloc, spacial proximity, lowest fragmentation, etc] for overall metrics the mainstream ones are simply the best.
As a user of your library, my personal preferred option is you just allocate memory when you need it. Use operator new/the new operator and I can use the standard C++ mechanisms to replace those and use my custom heap (if I indeed have one), or alternatively I can use platform specific ways of replacing your allocations (e.g. XMemAlloc on Xbox). I don't need tagging [capturing callstacks is far superior which I can do if I want]. Lower down that list comes you giving me an interface that you'll call when you need to allocate memory - this is just a pain for you to implement and I'll probably just pass it onto operator new anyway. The worst thing you can do is 'know best' and create your own custom heaps. If memory allocation performance is a problem, I'd much rather you share the solution the whole game uses than roll your own.
If you're looking to write your own malloc()/free(), etc., you probably should start by checking out the source code for existing systems such as dlmalloc. This is a hard problem, though, for what it's worth. Writing your own malloc library is Hard. Beating existing general purpose malloc libraries will be Even Harder.
And now, here is the correct answer: DON'T IMPLEMENT YET ANOTHER MEMORY MANAGER.
It is incredibly hard to implement a memory manager that does not fail under different kinds of usage patterns and events. You may be able to build a specific manager that works well under YOUR usage patterns, but to write one which works well for MANY users is a full-time job that almost no one has really done well. Worse, it is fantastically easy to implement a memory manager that works great 99% of the time and then 1% of the time crash or suddenly consume most or all available memory on your system due to unexpected heap fragmentation.
I say this as someone who has written multiple memory managers, watched multiple people write their own memory managers, and watched even more people attempt to write memory managers and fail. This problem is deceptively difficult, not because it's hard to write templated allocators and generic types with inheritance and such, but because the other solutions given in this thread tend to fail under corner types of load behavior. Once you start supporting byte alignments (as all real-world allocators must) then heap fragmentation rears its ugly head. Cute heuristics that work great for small test programs, fail miserably when subjected to large, real-world programs.
And once you get it working, someone else will need: cookies to verify against memory stomps; heap usage reporting; memory pools; pools of pools; memory leak tracking and reporting; heap auditing; chunk splitting and coalescing; thread-local storage; lookasides; CPU and process-level page faulting and protection; setting and checking and clearing "free-memory" patterns aka 0xdeadbeef; and whatever else I can't think of off the top of my head.
Writing yet another memory manager falls squarely under the heading of Premature Optimization. Since there are multiple free, good, memory managers with thousands of hours of development and testing behind them, you have to justify spending the cost of your own time in such a way that the result would provide some sort of measurable improvement over what other people have done, and you can use, for free.
If you are SURE you want to implement your own memory manager (and hopefully you are NOT sure after reading this message), read through the dlmalloc sources in detail, then read through the tcmalloc sources in detail as well, THEN make sure you understand the performance trade-offs in implementing a thread-safe versus a thread-unsafe memory manager, and why the naive implementations tend to give poor performance results.
Prepare more than one solution and let the user of the framework adopt any particular one. Policy classes to the generic allocator you develop would do this nicely.
A nice way to get around this is to wrap up pointers in a class with overloaded * operator. Make the internal data of that class only an index to the memory pool. Now, you can just change the index quickly after a background thread copies the data over.
Most good C++ libraries support allocators and you should implement one. You can also overload the global new so your version gets used. And keep in mind that you generally won't need to think about a library allocating or deallocating a large amount of data, which is generally a responsibility of client code.

Troubleshooting parallel code

What tools are there for troubleshooting parallel programs?
Say I have a code that performs worse than expected (4 times instead of theoretical 8 times of serial version executing speed). I suspect the cause is either some locking caused by threads accessing shared variables (say adjacent elements of a shared vector), or locking caused by threads accessing heap (which I suppose is also a shared resource). But I don't know what tools are there available to check what might be the cause of excessive thread sleeps, threads switching, etc. E.g. profiler will tell me what function took how much time, and perhaps that there was a lot of activity related to threads management, but not what the cause and state of threads was (or perhaps I don't know how to use one well).
I'm working in C++ on OS X.
The following may be of interest
Vampir -- costs money
DTrace -- is already installed on your Mac, provides the tools you need, but is far from an out-of-the-box solution
TAU
Those are just the first three tools which spring to mind, I'm sure some more diligent googling will turn up more.
Your final comment perhaps I don't know how to use one well is well made, these tools generally require a significant commitment to use them, to understand what they are telling you and to make appropriate and performance-enhancing modifications to your programs.

Can multithreading speed up memory allocation?

I'm working with an 8 core processor, and am using Boost threads to run a large program.
Logically, the program can be split into groups, where each group is run by a thread.
Inside each group, some classes invoke the 'new' operator a total of 10000 times.
Rational Quantify shows that the 'new' memory allocation is taking up the maximum processing time when the program runs, and is slowing down the entire program.
One way I can speed up the system could be to use threads inside each 'group', so that the 10000 memory allocations can happen in parallel.
I'm unclear of how the memory allocation will be managed here. Will the OS scheduler really be able to allocate memory in parallel?
Standard CRT
While with older of Visual Studio the default CRT allocator was blocking, this is no longer true at least for Visual Studio 2010 and newer, which calls corresponding OS functions directly. The Windows heap manager was blocking until Widows XP, in XP the optional Low Fragmentation Heap is not blocking, while the default one is, and newer OSes (Vista/Win7) use LFH by default. The performance of recent (Windows 7) allocators is very good, comparable to scalable replacements listed below (you still might prefer them if targeting older platforms or when you need some other features they provide). There exist several multiple "scalable allocators", with different licenses and different drawbacks. I think on Linux the default runtime library already uses a scalable allocator (some variant of PTMalloc).
Scalable replacements
I know about:
HOARD (GNU + commercial licenses)
MicroQuill SmartHeap for SMP (commercial license)
Google Perf Tools TCMalloc (BSD license)
NedMalloc (BSD license)
JemAlloc (BSD license)
PTMalloc (GNU, no Windows port yet?)
Intel Thread Building Blocks (GNU, commercial)
You might want to check Scalable memory allocator experiences for my experiences with trying to use some of them in a Windows project.
In practice most of them work by having a per thread cache and per thread pre-allocated regions for allocations, which means that small allocations most often happen inside of a context of thread only, OS services are called only infrequently.
Dynamic allocation of memory uses the heap of the application/module/process (but not thread). The heap can only handle one allocation request at a time. If you try to allocate memory in "parallel" threads, they will be handled in due order by the heap. You will not get a behaviour like: one thread is waiting to get its memory while another can ask for some, while a third one is getting some. The threads will have to line-up in queue to get their chunk of memory.
What you would need is a pool of heaps. Use whichever heap is not busy at the moment to allocate the memory. But then, you have to watch out throughout the life of this variable such that it does not get de-allocated on another heap (that would cause a crash).
I know that Win32 API has functions such as GetProcessHeap(), CreateHeap(), HeapAlloc() and HeapFree(), that allow you to create a new heap and allocate/deallocate memory from a specific heap HANDLE. I don't know of an equivalence in other operating systems (I have looked for them, but to no avail).
You should, of course, try to avoid doing frequent dynamic allocations. But if you can't, you might consider (for portability) to create your own "heap" class (doesn't have to be a heap per se, just a very efficient allocator) that can manage a large chunk of memory and surely a smart pointer class that would hold a reference to the heap from which it came. This would enable you to use multiple heaps (make sure they are thread-safe).
There are 2 scalable drop-in replacements for malloc that I know of:
Google's tcmalloc
Facebook's jemalloc (link to a performance study comparing to tcmalloc)
I don't have any experience with Hoard (which performed poorly in the study), but Emery Berger lurks on this site and was astonished by the results. He said he would have a look and I surmise there might have been some specifics to either the test or implementation that "trapped" Hoard as the general feedback is usually good.
One word of caution with jemalloc, it can waste a bit of space when you rapidly create then discard threads (as it creates a new pool for each thread you allocate from). If your threads are stable, there should not be any issue with this.
I believe the short answer to your question is : yes, probably. And as already pointed out by several people here there are ways to achieve this.
Aside from your question and the answers already posted here, it would be good to start with your expectations on improvements, because that will pretty much tell which path to take. Maybe you need to be 100x faster. Also, do you see yourself doing speed improvements in the near future as well or is there a level which will be good enough? Not knowing your application or problem domain it's difficult to also advice you specifically. Are you for instance in a problem domain where speed continuously have to be improved?
One good thing to start off with when doing performance improvements is to question if you need to do things the way you currently do it? In this case, can you pre-allocate objects? Is there a maximum number of X objects in the system? Could you re-use objects? All of this is better, because you don't necessarily need to do allocations on the critical path. E.g. if you can re-use objects, a custom allocator with pre-allocated objects would work well. Also, what OS are you on?
If you don't have concrete expectations or a certain level of performance, just start experimenting with any of the advices here and you'll find out more.
Good luck!
Roll your own non-multi-threaded new memory allocator a distinct copy of which each thread has.
(you can override new and delete)
So it's allocating in large chunks that it works through and doesn't need any locking as each is owned by a single thread.
limit your threads to the number of cores you have available.
new is pretty much blocking, it has to find the next free bit of memory which is tricky to do if you have lots of threads all asking for that at once.
Memory allocation is slow - if you are doing it more than a few times, especially on lots of threads then you need a redesign. Can you pre-allocate enough space at the start, can you just allocate a big chunk with 'new' and then partition it out yourself?
You need to check your compiler documentation whether it makes the allocator thread safe or not. If it does not, then you will need to overload your new operator and make it thread safe.
Else it will result in either a segfault or UB.
On some platforms like Windows, access to the global heap is serialized by the OS. Having a thread-separate heap could substantially improve allocation times.
Of course, in this case, it might be worth questioning whether or not you genuinely need heap allocation as opposed to some other form of dynamic allocation.
You may want to take a look at The Hoard Memory Allocator: "is a drop-in replacement for malloc() that can dramatically improve application performance, especially for multithreaded programs running on multiprocessors."
The best what you can try to reach ~8 memory allocation in parallel (since you have 8 physical cores), not 10000 as you wrote
standard malloc uses mutex and standard STL allocator does the same. Therefore it will not speed up automatically when you introduce threading.
Nevertheless, you can use another malloc library (google for e.g. "ptmalloc") which does not use global locking. if you allocate using STL (e.g. allocate strings, vectors) you have to write your own allocator.
Rather interesting article: http://developers.sun.com/solaris/articles/multiproc/multiproc.html

On Sandboxing a memory-leaky 3rd-Party DLL

I am looking for a way to cure at least the symptoms of a leaky DLL i have to use. While the library (OpenCascade) claims to provides a memory manager, i have as of yet being unable to make it release any memory it allocated.
I would at least wish to put the calls to this module in a 'sandbox', in order to keep my application from not losing memory while the OCC-Module isn't even running any more.
My question is: While I realise that it would be an UGLY HACK (TM) to do so, is it possible to preallocate a stretch of memory to be used specifically by the libraries, or to build some kind of sandbox around it so i can track what areas of memory they used in order to release them myself when i am finished?
Or would that be to ugly a hack and I should try to resolve the issues otherwise?
The only reliable way is to separate use of the library into a dedicated process. You will start that process, pass data and parameters to it, run the library code, retrieve results. Once you decide the memory consumption is no longer tolerable you restart the process.
Using a library that isn't broken would probably be much easier, but if a replacement ins't available you could try intercepting the allocation calls. If the library isn't too badly 'optimized' (specifically function inlining) you could disassemble it and locate the malloc and free functions; on loading, you could replace every 4 (or 8 on p64 system) byte sequence that encodes that address with one that points to your own memory allocator. This is almost guaranteed to be a buggy, unreadable timesink, though, so don't do this if you can find a working replacement.
Edit:
Saw #sharptooth's answer, which has a much better chance of working. I'd still advise trying to find a replacement though.
You should ask Roman Lygin's opinion - he used to work at occ. He has at least one post that mentions memory management http://opencascade.blogspot.com/2009/06/developing-parallel-applications-with_23.html.
If you ask nicely, he might even write a post that explains mmgt's internals.

shared memory, MPI and queuing systems

My unix/windows C++ app is already parallelized using MPI: the job is splitted in N cpus and each chunk is executed in parallel, quite efficient, very good speed scaling, the job is done right.
But some of the data is repeated in each process, and for technical reasons this data cannot be easily splitted over MPI (...).
For example:
5 Gb of static data, exact same thing loaded for each process
4 Gb of data that can be distributed in MPI, the more CPUs are used, smaller this per-CPU RAM is.
On a 4 CPU job, this would mean at least a 20Gb RAM load, most of memory 'wasted', this is awful.
I'm thinking using shared memory to reduce the overall load, the "static" chunk would be loaded only once per computer.
So, main question is:
Is there any standard MPI way to share memory on a node? Some kind of readily available + free library ?
If not, I would use boost.interprocess and use MPI calls to distribute local shared memory identifiers.
The shared-memory would be read by a "local master" on each node, and shared read-only. No need for any kind of semaphore/synchronization, because it wont change.
Any performance hit or particular issues to be wary of?
(There wont be any "strings" or overly weird data structures, everything can be brought down to arrays and structure pointers)
The job will be executed in a PBS (or SGE) queuing system, in the case of a process unclean exit, I wonder if those will cleanup the node-specific shared memory.
One increasingly common approach in High Performance Computing (HPC) is hybrid MPI/OpenMP programs. I.e. you have N MPI processes, and each MPI process has M threads. This approach maps well to clusters consisting of shared memory multiprocessor nodes.
Changing to such a hierarchical parallelization scheme obviously requires some more or less invasive changes, OTOH if done properly it can increase the performance and scalability of the code in addition to reducing memory consumption for replicated data.
Depending on the MPI implementation, you may or may not be able to make MPI calls from all threads. This is specified by the required and provided arguments to the MPI_Init_Thread() function that you must call instead of MPI_Init(). Possible values are
{ MPI_THREAD_SINGLE}
Only one thread will execute.
{ MPI_THREAD_FUNNELED}
The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are ``funneled'' to the main thread).
{ MPI_THREAD_SERIALIZED}
The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ``serialized'').
{ MPI_THREAD_MULTIPLE}
Multiple threads may call MPI, with no restrictions.
In my experience, modern MPI implementations like Open MPI support the most flexible MPI_THREAD_MULTIPLE. If you use older MPI libraries, or some specialized architecture, you might be worse off.
Of course, you don't need to do your threading with OpenMP, that's just the most popular option in HPC. You could use e.g. the Boost threads library, the Intel TBB library, or straight pthreads or windows threads for that matter.
I haven't worked with MPI, but if it's like other IPC libraries I've seen that hide whether other threads/processes/whatever are on the same or different machines, then it won't be able to guarantee shared memory. Yes, it could handle shared memory between two nodes on the same machine, if that machine provided shared memory itself. But trying to share memory between nodes on different machines would be very difficult at best, due to the complex coherency issues raised. I'd expect it to simply be unimplemented.
In all practicality, if you need to share memory between nodes, your best bet is to do that outside MPI. i don't think you need to use boost.interprocess-style shared memory, since you aren't describing a situation where the different nodes are making fine-grained changes to the shared memory; it's either read-only or partitioned.
John's and deus's answers cover how to map in a file, which is definitely what you want to do for the 5 Gb (gigabit?) static data. The per-CPU data sounds like the same thing, and you just need to send a message to each node telling it what part of the file it should grab. The OS should take care of mapping virtual memory to physical memory to the files.
As for cleanup... I would assume it doesn't do any cleanup of shared memory, but mmaped files should be cleaned up since files are closed (which should release their memory mappings) when a process is cleaned up. I have no idea what caveats CreateFileMapping etc. have.
Actual "shared memory" (i.e. boost.interprocess) is not cleaned up when a process dies. If possible, I'd recommend trying killing a process and seeing what is left behind.
With MPI-2 you have RMA (remote memory access) via functions such as MPI_Put and MPI_Get. Using these features, if your MPI installation supports them, would certainly help you reduce the total memory consumption of your program. The cost is added complexity in coding but that's part of the fun of parallel programming. Then again, it does keep you in the domain of MPI.
MPI-3 offers shared memory windows (see e.g. MPI_Win_allocate_shared()), which allows usage of on-node shared memory without any additional dependencies.
I don't know much about unix, and I don't know what MPI is. But in Windows, what you are describing is an exact match for a file mapping object.
If this data is imbedded in your .EXE or a .DLL that it loads, then it will automatically be shared between all processes. Teardown of your process, even as a result of a crash will not cause any leaks or unreleased locks of your data. however a 9Gb .dll sounds a bit iffy. So this probably doesn't work for you.
However, you could put your data into a file, then CreateFileMapping and MapViewOfFile on it. The mapping can be readonly, and you can map all or part of the file into memory. All processes will share pages that are mapped the same underlying CreateFileMapping object. it's good practice to close unmap views and close handles, but if you don't the OS will do it for you on teardown.
Note that unless you are running x64, you won't be able to map a 5Gb file into a single view (or even a 2Gb file, 1Gb might work). But given that you are talking about having this already working, I'm guessing that you are already x64 only.
If you store your static data in a file, you can use mmap on unix to get random access to the data. Data will be paged in as and when you need access to a particular bit of the data. All that you will need to do is overlay any binary structures over the file data. This is the unix equivalent of CreateFileMapping and MapViewOfFile mentioned above.
Incidentally glibc uses mmap when one calls malloc to request more than a page of data.
I had some projects with MPI in SHUT.
As i know , there are many ways to distribute a problem using MPI, maybe you can find another solution that does not required share memory,
my project was solving an 7,000,000 equation and 7,000,000 variable
if you can explain your problem,i would try to help you
I ran into this problem in the small when I used MPI a few years ago.
I am not certain that the SGE understands memory mapped files. If you are distributing against a beowulf cluster, I suspect you're going to have coherency issues. Could you discuss a little about your multiprocessor architecture?
My draft approach would be to set up an architecture where each part of the data is owned by a defined CPU. There would be two threads: one thread being an MPI two-way talker and one thread for computing the result. Note that MPI and threads don't always play well together.