How to evaluate the quality of custom memory allocator? - c++

What characteristics should be checked when evaluating memory allocation?
Performance of allocation and de-allocation? Are simple stress-tests enough? How to check the quality of allocation?
For example, I found Oracle's test for malloc, but it's only Oracle's view of the problem. And this test is oriented only to multi-threaded performance.
How people usually checks their allocators?

Just to give more focus on the "how", rather than the "what", whic the other answers seem to deal with. Here's how I would do it.
First Step - Make it possible to compare approaches
Determine what qualities you value. Make a list, prioritize and finally, make a value function.
That is, figure out which measurements are the most useful indicators of quality, in your view/case. A few good measurements could be average time to allocate a memory block, total runtime of the application (if applicable), average frame rate, total or average memory consumption ... It all depends on what you wish to achieve.
Then, create a function which, given these measurements from a test run, gives you a value which can be used as quality measure. The simplest case would be to simply decide a weight factor for each of the measurements. These weight factors should embody both the importance of each measurement and, if they use different units (such as nanoseconds for average allocation time and bytes for average memory consumption), attempt to scale them to compare fairly.
Second Step - Device a test scenario
This should be as close to a realistic case as possible. The best would be simply the actual code that you want to use your memory allocator for, with added code for taking all the measurements needed to compute your value function.
Third Step - Test
Write a bunch of different allocators and test them all against each other, as well as the default or without any allocator (if applicable). Measure all results, compute the value function for each and rank them according to the results. Keep in mind all the different considerations that you always need to think of when performing performance measurements.
Fourth Step - Evaluate and re-iterate
Look at how the different solutions stack up against each other. Apply some critical thinking. Do these results actually correspond to how you experienced the quality each allocator during the tests? If the results do not match what you thought you saw.
For example, if the one which seemed blazing fast and gave a total runtime of half a minute less than the rest, gets a mediocre score.. Well, then it's time to scrutinize your approach. Perhaps there's a bug in your measuring? Or perhaps you need to re-evaluate your chosen value function... Re-iterate steps one through four until the results are clear and seem in accordance with your actual experience in testing them.

Usually, the performance of a memory allocator is about the speed of the finding and creating a memory chunk in the heap depending of the size of the manipulated memory blocks. And, also (but more recently), how does it behave in the case of multithreaded allocations. You can find interesting studies and benchmarks in the following list:
ptmalloc - a multi-thread malloc implementation
Benchmarks of the Lockless Memory Allocator
Dynamic memory allocator implementations in Linux system libraries
A Scalable Concurrent malloc(3) Implementation for FreeBSD
... probably many others ...

I guess my answer is not genius but - it depends.
If you are writing custom memory allocator you probably knows what should be it's characteristic. Eg. if you want to have allocator allowing you to quickly allocate a lot of small object and you dont really care about memory usage overhead you should probably have different tests then when you are creating allocator for big objects and you want to save as much of memory as it's possible even with cost of CPU time.
Stress tests are always good because it can help you to find some race conditions and check if your allocator is bugfree, but perfromace test depends on what you wanted to achive.

Here are the metrics that one should consider when optimizing/analyzing the dynamic memory allocation mechanism in the system.
Implementation overhead - how much memory does it cost to keep the allocation's internal data structures operational. Additionally, if these structures grow over time or pre-allocated once (both approaches have pros and cons and both are valid).
Operational efficiency - how long does it take to allocate/free a memory block. Here usually the allocation is a challenge, because it is almost never a constant time and depends on characteristics of previously allocated blocks of memory. Freeing the block looks straightforward but if combined with memory de-fragmentation, deserves further attention (will not be covered here).
Thread safety has less to do with allocation itself and more with the decision to use certain solution in the system. Basically here, if you don't have threads - there is nothing to worry about. If you do have threads - make sure that your allocation will not be interrupted while at work.
Memory fragmentation - actual layout of the allocated memory. Here two completely contradicting requirements come in play - you allocate as soon as you find the right spot in your buffer, or you make sure you cause as little fragmentation as possible. Former is fast, whilst the later is more resource-friendly (and also potentially slower).
Garbage collection - This is a separate topic and a field of study on its own, it is being mentioned only for the sake of completeness. Important is to understand, that even if your don't plan on releasing allocated memory too often, GC can still be of use to help with analysis of already allocated memory, preparing internal data structures for the next efficient memory allocation. IDLE CPU time is arguable the best moment to do this house keeping task. This topic, however is out of scope of this question.

Related

C++17 PMR:: Set number of blocks and their size in a unsynchronized_pool_resource

Is any rule for setting in the most effective way the number of blocks in a chunk (max_blocks_per_chunk) and the largest required block (largest_required_pool_block), in a unsynchronized_pool_resource?
How to avoid unnecessary memory allocations?
For example have a look in this demo.
How to reduce the number of allocation that take place as much as possible?
Pooled allocators function on a memory waste vs upstream allocator calls trade-off. Reducing one will almost always increase the other and vice-versa.
On top of that, one of the primary reason behind their use (in my experience, at least) is to limit or outright eliminate memory fragmentation for long-running processes in memory-constrained scenarios. So it is sort of assumed that "throwing more memory at the problem" is going to be counterproductive more often than not.
Because of this, there is no universal one-size-fit-all rule here. What is preferable will invariably be dictated by the needs of your application.
Figuring out the correct values for max_blocks_per_chunk and largest_required_pool_block is ideally based on a thorough memory usage analysis so that the achieved balance benefits the application as much as possible.
However, given the wording of the question:
How to avoid unnecessary memory allocations?
How to reduce the number of allocation that take place as much as possible?
If you want to minimize upstream allocator calls as much as possible, then it's simple:
Make largest_required_pool_block the largest frequent allocation size you expect the allocator to face. Larger blocks means more allocations will qualify for pooled allocation.
Make max_blocks_per_chunk as large as you dare, up to the maximum number of concurrent allocations for any given block size. More blocks per chunks means more allocations between requests to the upstream.
The only limiting factor is how much memory footprint bloat you are willing to tolerate for your application.

Does using heap memory (malloc/new) create a non-deterministic program?

I started developing software for real-time systems a few months ago in C for space applications, and also for microcontrollers with C++. There's a rule of thumb in such systems that one should never create heap objects (so no malloc/new), because it makes the program non-deterministic. I wasn't able to verify the correctness of this statement when people tell me that. So, Is this a correct statement?
The confusion for me is that as far as I know, determinism means that running a program twice will lead to the exact, same execution path. From my understanding this is an issue with multithreaded systems, since running the same program multiple times could have different threads running in different order every time.
In the context of realtime systems, there is more to determinism than a repeatable "execution path". Another required property is that timing of key events is bounded. In hard realtime systems, an event that occurs outside its allowed time interval (either before the start of that interval, or after the end) represents a system failure.
In this context, usage of dynamic memory allocation can cause non-determinism, particularly if the program has a varying pattern of allocating, deallocating, and reallocating. The timing of allocations, deallocation, and reallocation can vary over time - and therefore making timings for the system as a whole unpredictable.
The comment, as stated, is incorrect.
Using a heap manager with non-deterministic behavior creates a program with non-deterministic behavior. But that is obvious.
Slightly less obvious is the existence of heap managers with deterministic behavior. Perhaps the most well-known example is the pool allocator. It has an array of N*M bytes, and an available[] mask of N bits. To allocate, it checks for the first available entry (bit test, O(N), deterministic upper bound). To deallocate, it sets the available bit (O(1)). malloc(X) will round up X to the next biggest value of M to choose the right pool.
This might not be very efficient, especially if your choices of N and M are too high. And if you choose too low, your program can fail. But the limits for N and M can be lower than for an equivalent program without dynamic memory allocation.
Nothing in the C11 standard or in n1570 says that malloc is deterministic (or is not); and neither some other documentation like malloc(3) on Linux. BTW, many malloc implementations are free software.
But malloc can (and does) fail, and its performance is not known (a typical call to malloc on my desktop would practically take less than a microsecond, but I could imagine weird situations where it might take much more, perhaps many milliseconds on a very loaded computer; read about thrashing). And my Linux desktop has ASLR (address space layout randomization) so runnning the same program twice gives different malloc-ed addresses (in the virtual address space of the process). BTW here is a deterministic (under specific assumptions that you need to elaborate) but practically useless malloc implementation.
determinism means that running a program twice will lead to the exact, same execution path
This is practically wrong in most embedded systems, because the physical environment is changing; for example, the software driving a rocket engine cannot expect that the thrust, or the drag, or the wind speed, etc... is exactly the same from one launch to the next one.
(so I am surprised that you believe or wish that real-time systems are deterministic; they never are! Perhaps you care about WCET, which is increasingly difficult to predict because of caches)
BTW some "real-time" or "embedded" systems are implementing their own malloc (or some variant of it). C++ programs can have their allocator-s, usable by standard containers. See also this and that, etc, etc.....
And high-level layers of embedded software (think of an autonomous automobile and its planning software) are certainly using heap allocation and perhaps even garbage collection techniques (some of which are "real-time"), but are generally not considered safety critical.
tl;dr: It's not that dynamic memory allocation is inherently non-deterministic (as you defined it in terms of identical execution paths); it's that it generally makes your program unpredictable. Specifically, you can't predict whether the allocator might fail in the face of an arbitrary sequence of inputs.
You could have a non-deterministic allocator. This is actually common outside of your real-time world, where operating systems use things like address layout randomization. Of course, that would make your program non-deterministic.
But that's not an interesting case, so let's assume a perfectly deterministic allocator: the same sequence of allocations and deallocations will always result in the same blocks in the same locations and those allocations and deallocations will always have a bounded running time.
Now your program can be deterministic: the same set of inputs will lead to exactly the same execution path.
The problem is that if you're allocating and freeing memory in response to inputs, you can't predict whether an allocation will ever fail (and failure is not an option).
First, your program could leak memory. So if it needs to run indefinitely, eventually an allocation will fail.
But even if you can prove there are no leaks, you would need to know that there's never an input sequence that could demand more memory than is available.
But even if you can prove that the program will never need more memory than is available, the allocator might, depending on the sequence of allocations and frees, fragment memory and thus eventually be unable to find a contiguous block to satisfy an allocation, even though there is enough free memory overall.
It's very difficult to prove that there's no sequence of inputs that will lead to pathological fragmentation.
You can design allocators to guarantee there won't be fragmentation (e.g., by allocating blocks of only one size), but that puts a substantial constraint on the caller and possibly increases the amount of memory required due to the waste. And the caller must still prove that there are no leaks and that there's a satiable upper-bound on total memory required regardless of the sequence of inputs. This burden is so high that it's actually simpler to design the system so that it doesn't use dynamic memory allocation.
The deal with real-time systems is that program must strictly meet certain computation and memory restrictions regardless of the execution path taken (which may still vary considerably depending on input). So what does use of generic dynamic memory allocation (such as malloc/new) mean in this context? It means that developer at some point is not able to determine exact memory consumption and it would be impossible to tell whether resulting program will be able to meet the requirements, both for memory and for computation power.
Yes it is correct. For the kind of applications you mention, everything that can occur must be specified in detail. The program must handle the worst-case scenario according to specification and set aside exactly that much memory, no more, no less. The situation where "we don't know how many inputs we get" does not exist. The worst-case scenario is specified with fixed numbers.
Your program must be deterministic in a sense that it can handle everything up to the worst-case scenario.
The very purpose of the heap is to allow several unrelated applications to share RAM memory, such as in a PC, where the amount of programs/processes/threads running isn't deterministic. This scenario does not exist in a real-time system.
In addition, the heap is non-deterministic in its nature, as segments get added or removed over time.
More info here: https://electronics.stackexchange.com/a/171581/6102
Even if your heap allocator has repeatable behavior (the same sequence of allocation and free calls yield the same sequence of blocks, hence (hopefully) the same internal heap state), the state of the heap may vary drastically if the sequence of calls is changed, potentially leading to fragmentation that will cause memory allocation failures in an unpredictable way.
The reason heap allocation is frowned upon of downright forbidden in embedded systems, esp. mission critical systems such as aircraft or spacecraft guidance or life support systems is there is no way to test all possible variations in the sequence of malloc/free calls that can happen in response to intrinsically asynchronous events.
The solution is for each handler to have its one memory set aside for its purpose and it does not matter anymore (at least as far as memory use is concerned) in what order these handlers are invoked.
The problem with using heap in hard realtime software is heap allocations can fail. What do you when you run out of heap?
You are talking about space applications. You have pretty hard no-fail requirements. You must have no possibility of leaking memory so there is not enough for at least the safe mode code to run. You must not fall over. You must not throw exceptions that have no catch block. You probably don't have an OS with protected memory so one crashing application can in theory take out everything.
You probably don't want to use heap at all. The benefits don't outweigh the whole-program costs.
Non-determinsitic normally means something else but in this case the best read is they want the entire program behavior completely predictable.
Introduce Integrity RTOS from GHS:
https://www.ghs.com/products/rtos/integrity.html
and LynxOS:
http://www.lynx.com/products/real-time-operating-systems/lynxos-178-rtos-for-do-178b-software-certification/
LynxOS and Integrity RTOS are among the software used in space applications, missiles, aircraft etc as many others are not approved or certified by authorities (eg, FAA).
https://www.ghs.com/news/230210r.html
To meet the stringent criteria of space applications, Integrity RTOS actually provide formal verification, ie, mathematically proven logic, that their software behave as according to specification.
Among these criteria, to quote from here:
https://en.wikipedia.org/wiki/Integrity_(operating_system)
and here:
Green Hills Integrity Dynamic memory allocation
is this:
I am not a specialist in formal methods, but perhaps one of the requirements for this verification is to remove the uncertainties in the timing required for memory allocation. In RTOS, all event is precisely planned milliseconds away from each other. And dynamic memory allocation always have a problem with timing required.
Mathematically you really need to prove everything worked from most fundamental assumptions about timing and amount of memory.
And if you think of the alternatives to heap memory: static memory. The address is fixed, the size allocated is fixed. The position in memory is fixed. So it is very easy to reason about memory sufficiency, reliability, availability etc.
Short answer
There are some effects on the data values or their statistical uncertainty distributions of, e.g, a first or second level trigger scintillator device that can derive from the non-reproducible quantity of time that you may have to wait for malloc/free.
The worst aspect is that they are not related to the physical phenomenon either with the hardware but somehow with the state of the memory (and its history).
Your goal, in that case, is to reconstruct the original sequence of events from the data affected by those errors. The reconstructed/guessed sequence will be affected by errors too. Not always this iteration will convergence on a stable solution; it is not said it will be the correct one; your data is not any more independent... You risk a logical short-circuit...
Longer answer
You stated "I wasn't able to verify the correctness of this statement when people tell me that".
I will try to give you a purely hypothetical situation/ case study.
Let's we imagine you deal with a CCD or with some 1st and 2nd level scintillator triggers on a system that have to economize resources (you're in space).
The acquisition rate will be set so that the background will be at x% of the MAXBINCOUNT.
There's a burst, you have a spike in the counts and an overflow in the bin counter.
I want it all: you switch to the max acquisition rate and you finish your buffer.
You go to free/allocate more memory meanwhile you finish the extra buffer.
What will you do?
You will keep the counteractive risking the overflow (the second level will try to count properly the timing of the data-packages) but in this case, you will go to underestimate the counts for that period?
you will stop the counter introducing a hole in the time series?
Note that:
Waiting for allocation you will lose the transient (or at least its beginning).
Whatever you do it depends on the state of your memory and it is not reproducible.
Now instead the signal is variable around the maxbincount at the maximum acquisition rate allowed from your hardware, and the event is longer than usual.
You finish the space and ask for more... meanwhile, you incur in the same problem above.
Overflow and systematic peaks counts underestimation or holes in the time series?
Let we move a second level (it can be on the 1st level trigger too).
From your hardware, you receive more data than you can stock or transmit.
You have to cluster the data in time or space (2x2, 4x4, ... 16x16 ... 256x256... pixel scaling...).
The uncertainty from the previous problem may affect the error distribution.
There are CCD setting for which you have the pixels of the border with counts close to the maxbincount (it depends from "where" you want to see better).
Now you can have a shower on your CCD or a single big spot with the same total number of counts but with a different statistical uncertainty (the part that is introduced by the waiting time)...
So for example where you are expecting a Lorentzian profile you can obtain its convolution with a Gaussian one (a Voigt), or if the second it's really dominant with a dirty Gaussian...
There is a trade-off always. It's the program's running environment and the tasks it performs that should be the basis to decide whether HEAP should be used or not.
Heap object is efficient when you want to share the data between multiple function calls. You just need to pass the pointer since heap is globally accessible. There are disadvantages as well. Some function might free up this memory but still some references may exist at other places as well.
If the heap memory is not freed after it's work is done and the program keeps on allocating more memory, at some point HEAP will run out of memory and affects the deterministic character of the program.

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.
It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.
Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.
As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?
The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".
One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.

Memory management while loading huge XML files

We have an application which imports objects from an XML. The XML is around 15 GB. The application invariably starts running out of memory. We tried to free memory in between operations but this has lead to degrading performance. i.e it takes more time to complete the import operation. The CPU utilization reaches 100%
The application is written in C++.
Does the frequent call to free() will lead to performance issues?
Promoted from a comment by the OP: the parser being used in expat, which is a SAX parser with a very small footprint, and customisable memory management.
Use SAX parser instead of DOM parser.
Have you tried resuing the memory and your classes as opposed to freeing and reallocating it? Constant allocation/deallocation cycles, especially if they are coupled with small (less than 4096 bytes) data fragments can lead to serious performance problems and memory address space fragmentation.
Profile the application during one of these bothersome loads, to see where it is spending most of its time.
I believe that free() can sometimes be costly, but that is of course very much dependent on the platform's implementation.
Also, you don't say a lot about the lifelength of the objects loaded; if the XML is 15 GB, how much of that is kept around for each "object", once the markup is parsed and thrown away?
It sounds sensible to process an input document of this size in a streaming fashion, i.e. not trying a DOM-approach which loads and builds the entire XML parse tree at once.
If you want to minimise your memory usage, took a look at How to read the XML data from a file by using Visual C++.
One thing that often helps is to use a lightweight low-overhead memory pool. If you combine this with "frame" allocation methods (ignoring any delete/free until you're all done with the data), you can get something that's ridiculously fast.
We did this for an embedded system recently, mostly for performance reasons, but it saved a lot of memory as well.
The trick was basically to allocate a big block -- slightly bigger than we'd need (you could allocate a chain of blocks if you like) -- and just keep returning a "current" pointer (bumping it up by allocSize, rounded up to maximum align requirement of 4 in our case, each time). This cut our overhead per alloc from on the order of 52-60 bytes down to <= 3 bytes. We also ignored "free" calls until we were all done parsing and then freed the whole block.
If you're clever enough with your frame allocation you can save a lot of space and time. It might not get you all the way to your 15GiB, but it would be worth looking at how much space overhead you really have... My experience with DOM-based systems is that they use tons of small allocs, each with a relatively high overhead.
(If you have virtual memory, a large "block" might not even hurt that much, if your access at any given time is local to a page or three anyway...)
Obviously you have to keep the memory you actually need in the long run, but the parser's "scratch memory" becomes a lot more efficient this way.
We tried to free memory in between operations but this has lead to degrading performance .
Does the frequent call to free() will lead to performance issues ?
Based on the evidence supplied, yes.
Since your already using expat, a SAX parser, what exactly are you freeing? If you can free it, why are you mallocing it in a loop in the first place?
Maybe, it should say profiler.
Also don't forget that work with heap is single-thread. I mean that if booth of your threads will allocate/free memory in ont time, one of them will waiting when first will done.
If you allocating and free memory for same objects, you could create pool of this object and do allocate/free once.
Try and find a way to profile your code.
If you have no profiler, try and organize your work so that you only have a few free() commands (instead of the many you suggest).
It is very common to have to find the right balance between memory consumption and time efficiency.
I did not try it myself, but have you heard of XMLLite, there's an MSDN artical introducing it. It's used by MS Office internally.