Does using heap memory (malloc/new) create a non-deterministic program? - c++

I started developing software for real-time systems a few months ago in C for space applications, and also for microcontrollers with C++. There's a rule of thumb in such systems that one should never create heap objects (so no malloc/new), because it makes the program non-deterministic. I wasn't able to verify the correctness of this statement when people tell me that. So, Is this a correct statement?
The confusion for me is that as far as I know, determinism means that running a program twice will lead to the exact, same execution path. From my understanding this is an issue with multithreaded systems, since running the same program multiple times could have different threads running in different order every time.

In the context of realtime systems, there is more to determinism than a repeatable "execution path". Another required property is that timing of key events is bounded. In hard realtime systems, an event that occurs outside its allowed time interval (either before the start of that interval, or after the end) represents a system failure.
In this context, usage of dynamic memory allocation can cause non-determinism, particularly if the program has a varying pattern of allocating, deallocating, and reallocating. The timing of allocations, deallocation, and reallocation can vary over time - and therefore making timings for the system as a whole unpredictable.

The comment, as stated, is incorrect.
Using a heap manager with non-deterministic behavior creates a program with non-deterministic behavior. But that is obvious.
Slightly less obvious is the existence of heap managers with deterministic behavior. Perhaps the most well-known example is the pool allocator. It has an array of N*M bytes, and an available[] mask of N bits. To allocate, it checks for the first available entry (bit test, O(N), deterministic upper bound). To deallocate, it sets the available bit (O(1)). malloc(X) will round up X to the next biggest value of M to choose the right pool.
This might not be very efficient, especially if your choices of N and M are too high. And if you choose too low, your program can fail. But the limits for N and M can be lower than for an equivalent program without dynamic memory allocation.

Nothing in the C11 standard or in n1570 says that malloc is deterministic (or is not); and neither some other documentation like malloc(3) on Linux. BTW, many malloc implementations are free software.
But malloc can (and does) fail, and its performance is not known (a typical call to malloc on my desktop would practically take less than a microsecond, but I could imagine weird situations where it might take much more, perhaps many milliseconds on a very loaded computer; read about thrashing). And my Linux desktop has ASLR (address space layout randomization) so runnning the same program twice gives different malloc-ed addresses (in the virtual address space of the process). BTW here is a deterministic (under specific assumptions that you need to elaborate) but practically useless malloc implementation.
determinism means that running a program twice will lead to the exact, same execution path
This is practically wrong in most embedded systems, because the physical environment is changing; for example, the software driving a rocket engine cannot expect that the thrust, or the drag, or the wind speed, etc... is exactly the same from one launch to the next one.
(so I am surprised that you believe or wish that real-time systems are deterministic; they never are! Perhaps you care about WCET, which is increasingly difficult to predict because of caches)
BTW some "real-time" or "embedded" systems are implementing their own malloc (or some variant of it). C++ programs can have their allocator-s, usable by standard containers. See also this and that, etc, etc.....
And high-level layers of embedded software (think of an autonomous automobile and its planning software) are certainly using heap allocation and perhaps even garbage collection techniques (some of which are "real-time"), but are generally not considered safety critical.

tl;dr: It's not that dynamic memory allocation is inherently non-deterministic (as you defined it in terms of identical execution paths); it's that it generally makes your program unpredictable. Specifically, you can't predict whether the allocator might fail in the face of an arbitrary sequence of inputs.
You could have a non-deterministic allocator. This is actually common outside of your real-time world, where operating systems use things like address layout randomization. Of course, that would make your program non-deterministic.
But that's not an interesting case, so let's assume a perfectly deterministic allocator: the same sequence of allocations and deallocations will always result in the same blocks in the same locations and those allocations and deallocations will always have a bounded running time.
Now your program can be deterministic: the same set of inputs will lead to exactly the same execution path.
The problem is that if you're allocating and freeing memory in response to inputs, you can't predict whether an allocation will ever fail (and failure is not an option).
First, your program could leak memory. So if it needs to run indefinitely, eventually an allocation will fail.
But even if you can prove there are no leaks, you would need to know that there's never an input sequence that could demand more memory than is available.
But even if you can prove that the program will never need more memory than is available, the allocator might, depending on the sequence of allocations and frees, fragment memory and thus eventually be unable to find a contiguous block to satisfy an allocation, even though there is enough free memory overall.
It's very difficult to prove that there's no sequence of inputs that will lead to pathological fragmentation.
You can design allocators to guarantee there won't be fragmentation (e.g., by allocating blocks of only one size), but that puts a substantial constraint on the caller and possibly increases the amount of memory required due to the waste. And the caller must still prove that there are no leaks and that there's a satiable upper-bound on total memory required regardless of the sequence of inputs. This burden is so high that it's actually simpler to design the system so that it doesn't use dynamic memory allocation.

The deal with real-time systems is that program must strictly meet certain computation and memory restrictions regardless of the execution path taken (which may still vary considerably depending on input). So what does use of generic dynamic memory allocation (such as malloc/new) mean in this context? It means that developer at some point is not able to determine exact memory consumption and it would be impossible to tell whether resulting program will be able to meet the requirements, both for memory and for computation power.

Yes it is correct. For the kind of applications you mention, everything that can occur must be specified in detail. The program must handle the worst-case scenario according to specification and set aside exactly that much memory, no more, no less. The situation where "we don't know how many inputs we get" does not exist. The worst-case scenario is specified with fixed numbers.
Your program must be deterministic in a sense that it can handle everything up to the worst-case scenario.
The very purpose of the heap is to allow several unrelated applications to share RAM memory, such as in a PC, where the amount of programs/processes/threads running isn't deterministic. This scenario does not exist in a real-time system.
In addition, the heap is non-deterministic in its nature, as segments get added or removed over time.
More info here: https://electronics.stackexchange.com/a/171581/6102

Even if your heap allocator has repeatable behavior (the same sequence of allocation and free calls yield the same sequence of blocks, hence (hopefully) the same internal heap state), the state of the heap may vary drastically if the sequence of calls is changed, potentially leading to fragmentation that will cause memory allocation failures in an unpredictable way.
The reason heap allocation is frowned upon of downright forbidden in embedded systems, esp. mission critical systems such as aircraft or spacecraft guidance or life support systems is there is no way to test all possible variations in the sequence of malloc/free calls that can happen in response to intrinsically asynchronous events.
The solution is for each handler to have its one memory set aside for its purpose and it does not matter anymore (at least as far as memory use is concerned) in what order these handlers are invoked.

The problem with using heap in hard realtime software is heap allocations can fail. What do you when you run out of heap?
You are talking about space applications. You have pretty hard no-fail requirements. You must have no possibility of leaking memory so there is not enough for at least the safe mode code to run. You must not fall over. You must not throw exceptions that have no catch block. You probably don't have an OS with protected memory so one crashing application can in theory take out everything.
You probably don't want to use heap at all. The benefits don't outweigh the whole-program costs.
Non-determinsitic normally means something else but in this case the best read is they want the entire program behavior completely predictable.

Introduce Integrity RTOS from GHS:
https://www.ghs.com/products/rtos/integrity.html
and LynxOS:
http://www.lynx.com/products/real-time-operating-systems/lynxos-178-rtos-for-do-178b-software-certification/
LynxOS and Integrity RTOS are among the software used in space applications, missiles, aircraft etc as many others are not approved or certified by authorities (eg, FAA).
https://www.ghs.com/news/230210r.html
To meet the stringent criteria of space applications, Integrity RTOS actually provide formal verification, ie, mathematically proven logic, that their software behave as according to specification.
Among these criteria, to quote from here:
https://en.wikipedia.org/wiki/Integrity_(operating_system)
and here:
Green Hills Integrity Dynamic memory allocation
is this:
I am not a specialist in formal methods, but perhaps one of the requirements for this verification is to remove the uncertainties in the timing required for memory allocation. In RTOS, all event is precisely planned milliseconds away from each other. And dynamic memory allocation always have a problem with timing required.
Mathematically you really need to prove everything worked from most fundamental assumptions about timing and amount of memory.
And if you think of the alternatives to heap memory: static memory. The address is fixed, the size allocated is fixed. The position in memory is fixed. So it is very easy to reason about memory sufficiency, reliability, availability etc.

Short answer
There are some effects on the data values or their statistical uncertainty distributions of, e.g, a first or second level trigger scintillator device that can derive from the non-reproducible quantity of time that you may have to wait for malloc/free.
The worst aspect is that they are not related to the physical phenomenon either with the hardware but somehow with the state of the memory (and its history).
Your goal, in that case, is to reconstruct the original sequence of events from the data affected by those errors. The reconstructed/guessed sequence will be affected by errors too. Not always this iteration will convergence on a stable solution; it is not said it will be the correct one; your data is not any more independent... You risk a logical short-circuit...
Longer answer
You stated "I wasn't able to verify the correctness of this statement when people tell me that".
I will try to give you a purely hypothetical situation/ case study.
Let's we imagine you deal with a CCD or with some 1st and 2nd level scintillator triggers on a system that have to economize resources (you're in space).
The acquisition rate will be set so that the background will be at x% of the MAXBINCOUNT.
There's a burst, you have a spike in the counts and an overflow in the bin counter.
I want it all: you switch to the max acquisition rate and you finish your buffer.
You go to free/allocate more memory meanwhile you finish the extra buffer.
What will you do?
You will keep the counteractive risking the overflow (the second level will try to count properly the timing of the data-packages) but in this case, you will go to underestimate the counts for that period?
you will stop the counter introducing a hole in the time series?
Note that:
Waiting for allocation you will lose the transient (or at least its beginning).
Whatever you do it depends on the state of your memory and it is not reproducible.
Now instead the signal is variable around the maxbincount at the maximum acquisition rate allowed from your hardware, and the event is longer than usual.
You finish the space and ask for more... meanwhile, you incur in the same problem above.
Overflow and systematic peaks counts underestimation or holes in the time series?
Let we move a second level (it can be on the 1st level trigger too).
From your hardware, you receive more data than you can stock or transmit.
You have to cluster the data in time or space (2x2, 4x4, ... 16x16 ... 256x256... pixel scaling...).
The uncertainty from the previous problem may affect the error distribution.
There are CCD setting for which you have the pixels of the border with counts close to the maxbincount (it depends from "where" you want to see better).
Now you can have a shower on your CCD or a single big spot with the same total number of counts but with a different statistical uncertainty (the part that is introduced by the waiting time)...
So for example where you are expecting a Lorentzian profile you can obtain its convolution with a Gaussian one (a Voigt), or if the second it's really dominant with a dirty Gaussian...

There is a trade-off always. It's the program's running environment and the tasks it performs that should be the basis to decide whether HEAP should be used or not.
Heap object is efficient when you want to share the data between multiple function calls. You just need to pass the pointer since heap is globally accessible. There are disadvantages as well. Some function might free up this memory but still some references may exist at other places as well.
If the heap memory is not freed after it's work is done and the program keeps on allocating more memory, at some point HEAP will run out of memory and affects the deterministic character of the program.

Related

Why Stack and Heap Size are not defined in User Manual of microcontroller?

I am quite new to embedded programming.So may be this is a quite easy question for you.
I have seen different linker script file/linker configuration files of different SDK(e.g IAR EWARM, Tasking etc) in which the size of stack/heap are defined.
The size/Range of RAM and flash are also defined of every microcontroler in Linker file.Which are usually taken from memory map of User Manual.(address range are provided i user manual)
My question is how this size of stack and heap are calculated?
Can i select any value to the size of stack/heap size? Or is their any criteria foe that?
These are not defined in the microcontroller user manual because they are not hardware defined constraints. Rather they are application defined. It is a software dependent partitioning of memory, not hardware dependent.
Local, non-static variables, function arguments and call return addresses are generally stored on the stack; so the required stack size depends on the call depth and the number and size of local-variables and parameters for each function in a call-tree. The stack usage is dynamic, but there will be some worst-case path where the combination of variables and call-depth causes a peak usage.
On top of that on many architectures you have to also account for interrupt handler stack usage, which is generally less deterministic, but still has a "worst-case" of interrupt nesting and call depth. For this reasons ISR should generally be short, deterministic and use few variables.
Further is you have a multi-threaded environment such as an RTOS scheduler, each thread will have a separate stack. Typically these thread stacks are statically allocated arrays or dynamically (heap) allocated rather then defined by the linker script. The linker script normally defines only the system stack for the main() thread and interrupt/exception handlers.
Estimating the required stack usage is not always easy, but methods for doing so exist, using either static or dynamic analysis. Some examples (partly toolchain specific) at:
https://www.keil.com/support/man/docs/armclang_intro/armclang_intro_hla1474359990839.htm
https://www.keil.com/appnotes/docs/apnt_316.asp for example.
Many default linker scripts automatically expand the heap to fill all remaining space available after static data and stack allocation. One notable exception is the Keil ARM-MDK toolchain, which requires you to explicitly set a heap size.
A linker script may reserve memory regions for other purposes; especially if the memory is not homogeneous - for example on-chip MCU memory will typically be faster to access than external RAM, and may itself be subdivided on different busses so for example there might be a small segment useful for DMA on a separate buss so avoiding bus contention and yielding more deterministic execution.
The use of dynamic memory (heap) allocation in embedded systems needs to be carefully considered (or even banned as #Lundin would suggest, but not all embedded systems are subject to the same constraints). There are a number of issues to consider, including:
Memory constraints - many embedded systems have very small memories, you have to consider the response, safety and functionality of the system in the event an allocation request cannot be satisfied.
Memory leaks - your own, your colleagues on a team and third party code may not be as high a quality as you would hope; you need to be certain that the entire code base is free of memory leaks (failing to deallocate/free memory appropriately).
Determinism - most heap allocators take a variable and non-deterministic length of time to allocate memory, and even freeing can be non-deterministic if it involves block consolidation.
Heap corruption - an owner of an allocated block can easily under/overrun an allocation and corrupt adjacent memory. Typically such memory contains the heap-management meta-data for the block or other flocks, and the actual data for other allocations. Corrupting this data has non-deterministic effects on other code most often unrelated to the code that caused the error, such that it is common for failure to occur some-time after and in code unrelated to the event that caused the error. Such bugs hard hard to spot and resolve. If the heap meta-data is corrupted, often the error is detected when when further heap operations (alloc/free) fail.
Efficiency - Heap allocations mage by malloc() et-al are normally 8 byte aligned and have a block of pre-pended meta-data. Some implementations may add some "buffer" region to help detect overruns (especially in debug builds). As such making numerous allocations of very small blocks can be a remarkably inefficient use of a scarce resource.
Common strategies in embedded system to deal with these issues include:
Disallowing any dynamic memory allocations. This is common in safety critical and MISRA compliant applications for example.
Allowing dynamic memory allocation only during initialisation, and disallowing free(). This may seem counterintuitive, but can be useful where an application itself is "dynamic" and perhaps in some configurations not all tasks or device drivers etc. are started, where static allocation might lead to a great deal of unused/unusable memory.
Replacing the default heap with a deterministic memory allocation means such as a fixed-block allocator. Often these have a separate API rather then overriding malloc/free, so not then strictly a replacement; just a different solution.
Disallowing dynamic memory allocation in hard-real-time critical code. This addresses only the determinism issue, but in systems with large memories, and carefully design code, and perhaps MMU protection of allocations, there maybe mitigations for those.
Basically the stack size is picked depending on expected program size. For larger and more complex programs, you will want more stack size. It also depends on architecture, 32 bitters will generally consume slightly more memory than 8 and 16 bitters. The exact value is picked based on experience, though once you know exactly how much RAM your program actually uses, you can increase the stack size to use most of the unused memory.
It's also custom to map the stack so that it grows into a harmless area upon overflow, such as non-mapped memory or flash. Ideally so that you get a hardware exception, "software interrupt" or similar when stack overflow happens. You should never map it so that it grows into .data/.bss and overwrites other variables there.
As for the heap, the size is almost always picked to 0 and the segment is removed completely from the linker script. Heap allocation is banned in almost every microcontroller application.
Stack and heap are part of your program itself. They are based on how your program is structured and written How much memory it is taking up. rest free memory will work as Stack or Heap depending on how you set it up.
In Linker script you can define these values.

Is it practical to delete all heap-allocated memory after you have finished using it?

Are there any specific situations in which it would not be practical nor necessary to delete the heap-allocated memory when you are done using it? Or does not deleting it always affect programs to a large extent?
In a few cases, I've had code that allocated lots of stuff on the heap. A typical run of the program took at least a few hours, and with larger data sets, that could go up to a couple of days or so. When it finished and you exited the program, all the destructors ran, and freed all the memory.
That led to a bit of a problem though. Especially after a long run (which allocated many blocks on the heap) it could take around five minutes for all the destructors to run.
So, I rewrote some destructors to do nothing--not even free memory an object had allocated.
The program had a pretty simple memory usage pattern, so everything it allocated remained in use until you shut it down. Disabling the destructors so they no longer released the memory that had been allocated reduced the time to shut down the program from ~5 minutes to what appeared instant (but was still actually pretty close to 100 ms).
That said, this is really only rarely an option. The vast majority of the time, a program should clean up after itself. With well written code it's usually pretty trivial anyway.
Are there any specific situations in which it would not be practical
nor necessary to delete the heap-allocated memory when you are done
using it?
Yes.
In certain types of telecomm embedded systems I have seen:
1) an operator commanded software-revision-update can also perform (or remind the user to perform) a software reset as the last step in the upgrade. This is not a power bounce, and (typically) the associated hw continues to run.
Note: There are two (or more) kinds of revision updates: 1) processor code; and 2) firmware (of the fpga's which is typically stored in eprom)
In this case, there need not be a delete of long-term heap allocated memory. The embedded software I am familiar with has many new'd data structures that last the life of the code. Software reset is the user-commanded end-of-life, and the memory is zero'd at system startup (not shutdown). No dtor's are used at this point, either.
There is often a customer requirement about the upper limit on how long a system reboot takes. The time starts when the customer wants ... perhaps at the start of the download of a new revision ... so a fast reset can help achieve that additional requirement.
2) I have worked on (embedded telecom) systems with a 'Watchdog' feature to detect certain inconsistencies (including thread 'hangs'). This failure mechanism generates a log entry in some persistent store (such as battery-back-static-ram or eprom or file system).
The log entry is evidence of some 'self-detected' inconsistency.
Any attempt to delete heap memory would be suspect, as the inconsistency might have already corrupted the system. This reset is not user-commanded, but may have site policy based controls. A fast reset is also desired here to restore functionality when the reset occurs with no user at the console.
Note:
IMHO, The most useful "development features" for embedded system (none of which trigger heap clean up efforts) are :
a) a soft-reset switch (fairly common availability) - reboots the processor with no impact to the hw that the software controls/monitors. Is used often.
b) a hard-reset switch (availability rare) - power bounces the card .. both processor and the equipment it controls, without impact to the rest of the cards in the shelf. (Unknown utility.)
c) a shelf-reset switch (some times the shelf has its own switch) - power bounces the shelf and all cards, processors and equipment within. This is seldom used, (except for system startup issues) but the alternative is to clumsily 'pull the power plug'.
d) computer control of these three switches - I've never seen it.
Are there any specific situations in which it would not be practical
nor necessary to delete the heap-allocated memory when you are done
using it?
Any heap memory you allocate and never free will remain allocated until your process exits. During that time, no other program will be able to use that portion of the computer's RAM for any purpose.
So the question is, will that cause a problem? The answer will depend on a number of variables:
How much RAM has your process allocated?
How much RAM does the computer have physically installed and available for other programs to use?
How long will your process continue running (and thus holding on to that memory) for?
If your program is of the type that runs, does its thing, and then exits (more-or-less) immediately, then there's likely no problem with it "leaking" memory, since the leaked memory will be reclaimed by the OS when your process exits (note: some very primitive/embedded/old OS's may not reclaim the resources of an exited process, so make sure your OS does -- that said, almost all commonly in-use modern OS's do)
If your program is of the type that can continue running indefinitely, on the other hand, then memory leaks are going to be a problem, because if the program keeps allocating memory and never freeing it, eventually it will eat up all of the computer's available RAM and then bad things will start to happen.
In general, there is no reason why you should ever have to leak memory in a modern C++ program -- smart pointers (e.g. std::unique_ptr and std::shared_ptr) are there specifically to make memory-leaks easy to avoid, and they are easier to use than the old/leak-prone raw C pointers, so there's no reason not to use them.

Why we use memory managers?

I have seen that lot of code bases specially server codes have basic (sometimes advanced) memory managers. Is the real purpose of memory manager is to reduce number of malloc calls or mainly for the purpose of memory analysis, corruption check or may be other application centric purposes.
Is the argument of saving malloc calls reasonable enough as malloc in itself is a memory manager. The only performance gain I can reason is when we know that system always ask for same size memory.
Or the reason for having memory manager is that free does not return memory to OS but saves in the list. So over the lifetime of the process, the heap usage of the process may increase if we keep on doing malloc/free because of fragmentation.
mallocis a general purpose allocator - "not slow" is more important than "always fast".
Consider a feature that would be a 10% improvement in many common cases, but might cause significant performance degradation in a few rare cases. An application specific allocator can avoid the rare case and reap the benefits. A general purpose allocator should not.
Besides number of calls to malloc, there are other relevant attributes:
locality of allocations
On current hardware, this easily the most important factor for performance. An application has more knowledge of the access patterns and can optimize the allocations accordingly.
multithreading
A general purpose allocator must allow calls to malloc and free from different threads. This usually requires a lock or similar concurrency handling. If the heap is very busy, this leads to massive contention.
An application that knows that some high-frequency alloc/frees come only from one thread can use its own thread-specific heap, which not only avoids contention for these allocations, but also increases their locality and takes load off the default allocator.
fragmentation
This is still a problem for long running applications on systems with limited physical memory or address space. Fragmentation may require more and more memory or address space from the OS, even without the actual working set increasing. This is a significant problem for applications that need to run uninterrupted.
Last time I looked deeper into allocators (which is probably half a decade past), the consensus was that naive attempts to reduce fragmentation often conflict with the never slow rule.
Again, an application that knows (some of its) allocation patterns can take a lot of load from the default allocator. One very common use case is building a syntax tree or something similar: there are gazillions of small allocations which are never freed individually, only as a whole. Such a pattern can be served efficiently with a very trivial allocator.
resilence and diagnostics
Last not least the diagnostic and self-protection capabilities of the default allocator may not be sufficient for many applications.
Why do we have custom memory managers rather than the built-in ones?
Number one reason is probably that the codebase was originaly written 20-30years ago when the provided one wasn't any good and nobody dares change it.
But otherwise, as you say because the application needs to manage fragmentation, grab memory at startup to ensure that memory will always be available, for security or a bunch of other reasons - most of which could be acheived by correct use of the built-in manager.
C and C++ are designed to be stripped down. They don't do much that is not explicitly asked for, so when a program asks for memory, it gets the minimum possible effort required to deliver that memory.
In other words, if you don't need it, you don't pay for it.
If finer-grained control of the memory is required, that's the domain of the programmer. If the programmer wishes to trade bare metal speed for a system that will provide higher performance on the target hardware in conjunction with the program's often unique goals, better debugging support, or simply likes the look and feel and warm fuzzies that come from using a manager, that is up to them. The programmer either writes something smarter or finds a third party library to do what they want.
You briefly touched on a lot of the different reasons why you would use a memory manager in your question.
Is the real purpose of a memory manager to reduce the number of malloc calls or mainly for the purpose of memory analysis, corruption check or other application centric purposes?
This is the big question. A memory manager in any application can be generic (like malloc) or it can be more specific. The more specialized the memory manager becomes it is likely to be more efficient at the specific task it is supposed to accomplish.
Take this overly-simplified example:
#define MAX_OBJECTS 1000
Foo globalObjects[MAX_OBJECTS];
int main(int argc, char ** argv)
{
void * mallocObjects[MAX_OBJECTS] = {0};
void * customObjects[MAX_OBJECTS] = {0};
for(int i = 0; i < 1000; ++i)
{
mallocObjects[i] = malloc(sizeof(Foo));
customObjects[i] = &globalObjects[i];
}
}
In the above I am pretending that this global object list is our "custom memory allocator." This is just to simplify what I am explaining.
When you allocate with malloc there is no guarantee it is right next to the previous allocation. Malloc is a general purpose allocator and does a good job at that but doesn't necessarily make the most efficient choice for every application.
With a custom allocator you might be able to up front allocate room for 1000 custom objects and since they are a fixed size return the exact amount of memory you need to prevent fragmentation and to efficiently allocate that block.
There is also the difference between memory abstraction and custom memory allocators. STL allocators are arguably an abstraction model and not a custom memory allocator.
Take a look at this link for some more information on custom allocators and why they are useful: gamedev.net link
There are many reasons why we would want to do this and it really depends on the application itself. In fact all the reasons you mentioned are valid.
I once built a very simple memory manager that kept track of shared_ptr allocations in order for me to see what was not being released properly on application end.
I would say stick to your runtime unless you need something that it does not provide.
Memory managers are used basically to manage efficiently your memory reservation. Normally processes have access to a limited amount of memory (4GB in 32bits systems), from this you have to subtract the virtual memory space reserved for the kernel (1GB or 2GB depending on your OS configuration). Thus, virtually the process has access let's say to 3GB of memory that will be used to hold all of its segments (code, data, bss, heap and stack).
Memory managers (malloc for example) try to fulfill the different memory reservation requests issued by the process by requesting new memory pages to the OS (using sbrk or mmap system calls). Every time this happens it implies an extra cost on the program execution since the OS has to look for a suitable memory page to be assigned to the process (Physical memory is limited and all the running processes want to use it), update the process tables (TMP, etc). These operations are time consuming and hit the process execution and performance. Thus, the memory manager normally try to request the needed pages to fulfill the process reservations cleverly. For example it could ask for some more pages to avoid calling more mmap calls in the near future. Additionally, it tries to deal with issues like fragmentation, memory alignment, etc. This basically unloads the process from this responsibility, otherwise everybody writing some program that needs dynamic memory allocation has to perform this manually!
Actually, there are cases where one could be interested in doing the memory management manually. This is the case for embedded or high availability systems which have to run for 24/365. In these cases even if the memory fragmentation is low it could become a problem after very long period of running (1 year for example). So, one of the solutions that are used in this case is to use a memory pool to allocate before hand the memory for the application objects. After-wards each time you need memory for some object you just use the already reserved memory.
For server based or any application that needs to run for long periods of time or indefinitely, the main issue is paged memory fragmentation. After a long series of mallocs / new and free / delete, paged memory can end up with gaps in the pages that waste space and could eventually run out of virtual address space. Microsoft deals with this with it's .NET framework, by occasionally pausing a process to repack paged memory for a process.
To avoid slowdown during repacking of memory in a process, a server type application can use multiple processes for the application, so that during repacking of one process, the other process(es) take more of the load.

How to evaluate the quality of custom memory allocator?

What characteristics should be checked when evaluating memory allocation?
Performance of allocation and de-allocation? Are simple stress-tests enough? How to check the quality of allocation?
For example, I found Oracle's test for malloc, but it's only Oracle's view of the problem. And this test is oriented only to multi-threaded performance.
How people usually checks their allocators?
Just to give more focus on the "how", rather than the "what", whic the other answers seem to deal with. Here's how I would do it.
First Step - Make it possible to compare approaches
Determine what qualities you value. Make a list, prioritize and finally, make a value function.
That is, figure out which measurements are the most useful indicators of quality, in your view/case. A few good measurements could be average time to allocate a memory block, total runtime of the application (if applicable), average frame rate, total or average memory consumption ... It all depends on what you wish to achieve.
Then, create a function which, given these measurements from a test run, gives you a value which can be used as quality measure. The simplest case would be to simply decide a weight factor for each of the measurements. These weight factors should embody both the importance of each measurement and, if they use different units (such as nanoseconds for average allocation time and bytes for average memory consumption), attempt to scale them to compare fairly.
Second Step - Device a test scenario
This should be as close to a realistic case as possible. The best would be simply the actual code that you want to use your memory allocator for, with added code for taking all the measurements needed to compute your value function.
Third Step - Test
Write a bunch of different allocators and test them all against each other, as well as the default or without any allocator (if applicable). Measure all results, compute the value function for each and rank them according to the results. Keep in mind all the different considerations that you always need to think of when performing performance measurements.
Fourth Step - Evaluate and re-iterate
Look at how the different solutions stack up against each other. Apply some critical thinking. Do these results actually correspond to how you experienced the quality each allocator during the tests? If the results do not match what you thought you saw.
For example, if the one which seemed blazing fast and gave a total runtime of half a minute less than the rest, gets a mediocre score.. Well, then it's time to scrutinize your approach. Perhaps there's a bug in your measuring? Or perhaps you need to re-evaluate your chosen value function... Re-iterate steps one through four until the results are clear and seem in accordance with your actual experience in testing them.
Usually, the performance of a memory allocator is about the speed of the finding and creating a memory chunk in the heap depending of the size of the manipulated memory blocks. And, also (but more recently), how does it behave in the case of multithreaded allocations. You can find interesting studies and benchmarks in the following list:
ptmalloc - a multi-thread malloc implementation
Benchmarks of the Lockless Memory Allocator
Dynamic memory allocator implementations in Linux system libraries
A Scalable Concurrent malloc(3) Implementation for FreeBSD
... probably many others ...
I guess my answer is not genius but - it depends.
If you are writing custom memory allocator you probably knows what should be it's characteristic. Eg. if you want to have allocator allowing you to quickly allocate a lot of small object and you dont really care about memory usage overhead you should probably have different tests then when you are creating allocator for big objects and you want to save as much of memory as it's possible even with cost of CPU time.
Stress tests are always good because it can help you to find some race conditions and check if your allocator is bugfree, but perfromace test depends on what you wanted to achive.
Here are the metrics that one should consider when optimizing/analyzing the dynamic memory allocation mechanism in the system.
Implementation overhead - how much memory does it cost to keep the allocation's internal data structures operational. Additionally, if these structures grow over time or pre-allocated once (both approaches have pros and cons and both are valid).
Operational efficiency - how long does it take to allocate/free a memory block. Here usually the allocation is a challenge, because it is almost never a constant time and depends on characteristics of previously allocated blocks of memory. Freeing the block looks straightforward but if combined with memory de-fragmentation, deserves further attention (will not be covered here).
Thread safety has less to do with allocation itself and more with the decision to use certain solution in the system. Basically here, if you don't have threads - there is nothing to worry about. If you do have threads - make sure that your allocation will not be interrupted while at work.
Memory fragmentation - actual layout of the allocated memory. Here two completely contradicting requirements come in play - you allocate as soon as you find the right spot in your buffer, or you make sure you cause as little fragmentation as possible. Former is fast, whilst the later is more resource-friendly (and also potentially slower).
Garbage collection - This is a separate topic and a field of study on its own, it is being mentioned only for the sake of completeness. Important is to understand, that even if your don't plan on releasing allocated memory too often, GC can still be of use to help with analysis of already allocated memory, preparing internal data structures for the next efficient memory allocation. IDLE CPU time is arguable the best moment to do this house keeping task. This topic, however is out of scope of this question.

Which memory allocation algorithm suits best for performance and time critical c++ applications?

I ask this question to determine which memory allocation algorithm gives better results with performance critical applications, like game engines, or embedded applications. Results are actually depends percentage of memory fragmented and time-determinism of memory request.
There are several algorithms in the text books (e.g. Buddy memory allocation), but also there are others like TLSF. Therefore, regarding memory allocation algorithms available, which one of them is fastest and cause less fragmentation. BTW, Garbage collectors should be not included.
Please also, note that this question is not about profiling, it just aims to find out optimum algorithm for given requirements.
It all depends on the application. Server applications which can clear out all memory relating to a particular request at defined moments will have a different memory access pattern than video games, for instance.
If there was one memory allocation algorithm that was always best for performance and fragmentation, wouldn't the people implementing malloc and new always choose that algorithm?
Nowadays, it's usually best to assume that the people who wrote your operating system and runtime libraries weren't brain dead; and unless you have some unusual memory access pattern don't try to beat them.
Instead, try to reduce the number of allocations (or reallocations) you make. For instance, I often use a std::vector, but if I know ahead of time how many elements it will have, I can reserve that all in one go. This is much more efficient than letting it grow "naturally" through several calls to push_back().
Many people coming from languages where new just means "gimme an object" will allocate things for no good reason. If you don't have to put it on the heap, don't call new.
As for fragmentation: it still depends. Unfortunately I can't find the link now, but I remember a blog post from somebody at Microsoft who had worked on a C++ server application that suffered from memory fragmentation. The team solved the problem by allocating memory from two regions. Memory for all requests would come from region A until it was full (requests would free memory as normal). When region A was full, all memory would be allocated from region B. By the time region B was full, region A was completely empty again. This solved their fragmentation problem.
Will it solve yours? I have no idea. Are you working on a project which services several independent requests? Are you working on a game?
As for determinism: it still depends. What is your deadline? What happens when you miss the deadline (astronauts lost in space? the music being played back starts to sound like garbage?)? There are real time allocators, but remember: "real time" means "makes a promise about meeting a deadline," not necessarily "fast."
I did just come across a post describing various things Facebook has done to both speed up and reduce fragmentation in jemalloc. You may find that discussion interesting.
Barış:
Your question is very general, but here's my answer/guidance:
I don't know about game engines, but for embedded and real time applications, The general goals of an allocation algorithm are:
1- Bounded execution time: You have to know in advance the worst case allocation time so you can plan your real time tasks accordingly.
2- Fast execution: Well, the faster the better, obviously
3- Always allocate: Especially for real-time, security critical applications, all requests must be satisfied. If you request some memory space and get a null pointer: trouble!
4- Reduce fragmentation: Although this depends on the algorithm used, generally, less fragmented allocations provide better performance, due to a number of reasons, including caching effects.
In most critical systems, you are not allowed to dynamically allocate any memory to begin with. You analyze your requirements and determine your maximum memory use and allocate a large chunk of memory as soon as your application starts. If you can't, then the application does not even start, if it does start, no new memory blocks are allocated during execution.
If speed is a concern, I'd recommend following a similar approach. You can implement a memory pool which manages your memory. The pool could initialize a "sufficient" block of memory in the start of your application and serve your memory requests from this block. If you require more memory, the pool can do another -probably large- allocation (in anticipation of more memory requests), and your application can start using this newly allocated memory. There are various memory pooling schemes around as well, and managing these pools is another whole topic.
As for some examples: VxWorks RTOS used to employ a first-fit allocation algorithm where the algorithm analyzed a linked list to find a big enough free block. In VxWorks 6, they're using a best-fit algorithm, where the free space is kept in a tree and allocations traverse the tree for a big enough free block. There's a white paper titled Memory Allocation in VxWorks 6.0, by Zoltan Laszlo, which you can find by Googling, that has more detail.
Going back to your question about speed/fragmentation: It really depends on your application. Things to consider are:
Are you going to make lots of very small allocations, or relatively larger ones?
Will the allocations come in bursts, or spread equally throughout the application?
What is the lifetime of the allocations?
If you're asking this question because you're going to implement your own allocator, you should probably design it in such a way that you can change the underlying allocation/deallocation algorithm, because if the speed/fragmentation is really that critical in your application, you're going to want to experiment with different allocators. If I were to recommend something without knowing any of your requirements, I'd start with TLSF, since it has good overall characteristics.
As other already wrote, there is no "optimum algorithm" for each possible application. It was already proven that for any possible algorithm you can find an allocation sequence which will cause a fragmentation.
Below I write a few hints from my game development experience:
Avoid allocations if you can
A common practices in the game development field was (and to certain extent still is) to solve the dynamic memory allocation performance issues by avoiding the memory allocations like a plague. It is quite often possible to use stack based memory instead - even for dynamic arrays you can often come with an estimate which will cover 99 % of cases for you and you need to allocate only when you are over this boundary. Another commonly used approach is "preallocation": estimate how much memory you will need in some function or for some object, create a kind of small and simplistic "local heap" you allocate up front and perform the individual allocations from this heap only.
Memory allocator libraries
Another option is to use some of the memory allocation libraries - they are usually created by experts in the field to fit some special requirements, and if you have similar requiremens, they may fit your requirements.
Multithreading
There is one particular case in which you will find the "default" OS/CRT allocator performs badly, and that is multithreading. If you are targeting Windows, by aware both OS and CRT allocators provided by Microsoft (including the otherwise excellent Low Fragmentation Heap) are currently blocking. If you want to perform significant threading, you need either to reduce the allocation as much as possible, or to use some of the alternatives. See Can multithreading speed up memory allocation?
The best practice is - use whatever you can use to make the thing done in time (in your case - default allocator). If the whole thing is very complex - write tests and samples that will emulate parts of the whole thing. Then, run performance tests and benchmarks to find bottle necks (probably they will nothing to do with memory allocation :).
From this point you will see what exactly slowdowns your code and why. Only based on such precise knowledge you can ever optimize something and choose one algorithm over another. Without tests its just a waste of time since you can't even measure how much your optimization will speedup your app (in fact such "premature" optimizations can really slowdown it).
Memory allocation is a very complex thing and it really depends on many factors. For example, such allocator is simple and damn fast but can be used only in limited number of situations:
char pool[MAX_MEMORY_REQUIRED_TO_RENDER_FRAME];
char *poolHead = pool;
void *alloc(size_t sz) { char *p = poolHead; poolHead += sz; return p; }
void free() { poolHead = pool; }
So there is no "the best algorithm ever".
One constraint that's worth mentioning, which has not been mentioned yet, is multi-threading: Standard allocators must be implemented to support several threads, all allocating/deallocating concurrently, and passing objects from one thread to another so that it gets deallocated by a different thread.
As you may have guessed from that description, it is a tricky task to implement an allocator that handles all of this well. And it does cost performance as it is impossible to satisfy all these constrains without inter-thread communication (= use of atomic variables and locks) which is quite costly.
As such, if you can avoid concurrency in your allocations, you stand a good chance to implement your own allocator that significantly outperforms the standard allocators: I once did this myself, and it saved me roughly 250 CPU cycles per allocation with a fairly simple allocator that's based on a number of fixed sized memory pools for small objects, stacking free objects with an intrusive linked list.
Of course, avoiding concurrency is likely a no-go for you, but if you don't use it anyway, exploiting that fact might be something worth thinking about.