I am debugging a potential memory leak problem in a debug DLL.
The case is that process run a sub-test which loads/unloads a DLL dynamically, during the test a lot of memory is reserved and committed(1.3GB). After test is finished and DLL unloaded, still massive amount of memory remains reserved(1.2GB).
Reason why I said this reserved memory is allocated by DLL is that if I use a release DLL(nothing else changed, same test), reserved memory is ~300MB, so all the additional reserved memory must be allocated in debug DLL.
Looks like a lot of memory is committed during test but decommit(not release to free status) after test. So I want to track who reserve/decommit that large memory. But in the source code, there is no VirtualAlloc called, so questions are:
Is VirtualAlloc the only way to reserve memory?
If not, what other API can do that? If is, what other API will internally call VirtualAlloc? Quite some people online says HeapAlloc will internally call VirtualAlloc? How does it work?
[Parts of this are purely implementation detail, and not things that your application should rely on, so take them only for informational purposes, not as official documentation or contract of any kind. That said, there is some value in understanding how things are implemented under the hood, if only for debugging purposes.]
Yes, the VirtualAlloc() function is the workhorse function for memory allocation in Windows. It is a low-level function, one that the operating system makes available to you if you need its features, but also one that the system uses internally. (To be precise, it probably doesn't call VirtualAlloc() directly, but rather an even lower level function that VirtualAlloc() also calls down to, like NtAllocateVirtualMemory(), but that's just semantics and doesn't change the observable behavior.)
Therefore, HeapAlloc() is built on top of VirtualAlloc(), as are GlobalAlloc() and LocalAlloc() (although the latter two became obsolete in 32-bit Windows and should basically never be used by applications—prefer explicitly calling HeapAlloc()).
Of course, HeapAlloc() is not just a simple wrapper around VirtualAlloc(). It adds some logic of its own. VirtualAlloc() always allocates memory in large chunks, defined by the system's allocation granularity, which is hardware-specific (retrievable by calling GetSystemInfo() and reading the value of SYSTEM_INFO.dwAllocationGranularity). HeapAlloc() allows you to allocate smaller chunks of memory at whatever granularity you need, which is much more suitable for typical application programming. Internally, HeapAlloc() handles calling VirtualAlloc() to obtain a large chunk, and then divvying it up as needed. This not only presents a simpler API, but is also more efficient.
Note that the memory allocation functions provided by the C runtime library (CRT)—namely, C's malloc() and C++'s new operator—are a higher level yet. These are built on top of HeapAlloc() (at least in Microsoft's implementation of the CRT). Internally, they allocate a sizable chunk of memory that basically serves as a "master" block of memory for your application, and then divvy it up into smaller blocks upon request. As you free/delete those individual blocks, they are returned to the pool. Once again, this extra layer provides a simplified interface (and in particular, the ability to write platform-independent code), as well as increased efficiency in the general case.
Memory-mapped files and other functionality provided by various OS APIs is also built upon the virtual memory subsystem, and therefore internally calls VirtualAlloc() (or a lower-level equivalent).
So yes, fundamentally, the lowest level memory allocation routine for a normal Windows application is VirtualAlloc(). But that doesn't mean it is the workhorse function that you should generally use for memory allocation. Only call VirtualAlloc() if you actually need its additional features. Otherwise, either use your standard library's memory allocation routines, or if you have some compelling reason to avoid them (like not linking to the CRT or creating your own custom memory pool), call HeapAlloc().
Note also that you must always free/release memory using the corresponding mechanism to the one you used to allocate the memory. Just because all memory allocation functions ultimately call VirtualAlloc() does not mean that you can free that memory by calling VirtualFree(). As discussed above, these other functions implement additional logic on top of VirtualAlloc(), and thus require that you call their own routines to free the memory. Only call VirtualFree() if you allocated the memory yourself via a call to VirtualAlloc(). If the memory was allocated with HeapAlloc(), call HeapFree(). For malloc(), call free(); for new, call delete.
As for the specific scenario described in your question, it is unclear to me why you are worrying about this. It is important to keep in mind the distinction between reserved memory and committed memory. Reserved simply means that this particular block in the address space has been reserved for use by the process. Reserved blocks cannot be used. In order to use a block of memory, it must be committed, which refers to the process of allocating a backing store for the memory, either in the page file or in physical memory. This is also sometimes known as mapping. Reserving and committing can be done as two separate steps, or they can be done at the same time. For example, you might want to reserve a contiguous address space for future use, but you don't actually need it yet, so you don't commit it. Memory that has been reserved but not committed is not actually allocated.
In fact, all of this reserved memory may not be a leak at all. A rather common strategy used in debugging is to reserve a specific range of memory addresses, without committing them, to trap attempts to access memory within this range with an "access violation" exception. The fact that your DLL is not making these large reservations when compiled in Release mode suggests that, indeed, this may be a debugging strategy. And it also suggests a better way of determining the source: rather than scanning through your code looking for all of the memory-allocation routines, scan your code looking for the conditional code that depends upon the build configuration. If you're doing something different when DEBUG or _DEBUG is defined, then that is probably where the magic is happening.
Another possible explanation is the CRT's implementation of malloc() or new. When you allocate a small chunk of memory (say, a few KB), the CRT will actually reserve a much larger block but only commit a chunk of the requested size. When you subsequently free/delete that small chunk of memory, it will be decommitted, but the larger block will not be released back to the OS. The reason for this is to allow future calls to malloc/new to re-use that reserved block of memory. If a subsequent request is for a larger block than can be satisfied by the currently reserved address space, it will reserve additional address space. If, in debugging builds, you are repeatedly allocating and freeing increasingly large chunks of memory, what you're seeing may be the result of memory fragmentation. But this is really not a problem, aside from a minor performance hit, which is really not worth worrying about in debugging builds.
Related
My program is a daemon and runs for a long time. Only some of the time it will needs a lot of memory resource.
I want to increase my program performance by increasing memory locality.
And PMR seems like a good tool for this purpose.
However, it seems that the memory resources provided by the standard does not return the memory to upstream when there are lots of memory currently not used.
(i.e. synchronized_pool_resource, unsynchronized_pool_resource, monotonic_buffer_resource)
I want that my program can use less memory when the load is not high. (kinds of like calling malloc_trim when needed)
Is there a memory resource that will only cache small amount of currenly un-used memory, and return the rest to upstream.
A memory resource can be written to do whatever you want. However, since what you've described (returning memory that is unused) is what the default allocator does (and is one of the main reasons to use it), there wasn't much point in adding more standard library memory resources that do this.
Most of the defined memory resources are all about not returning unused memory, because returning and reallocating memory is expensive. They provide different strategies for keeping that memory accessible, so that later allocation calls are as fast as possible. That is, their whole point is to avoid the cost of allocating memory from the heap.
So you'll have to write a resource with the functionality you're looking for.
I am quite new to embedded programming.So may be this is a quite easy question for you.
I have seen different linker script file/linker configuration files of different SDK(e.g IAR EWARM, Tasking etc) in which the size of stack/heap are defined.
The size/Range of RAM and flash are also defined of every microcontroler in Linker file.Which are usually taken from memory map of User Manual.(address range are provided i user manual)
My question is how this size of stack and heap are calculated?
Can i select any value to the size of stack/heap size? Or is their any criteria foe that?
These are not defined in the microcontroller user manual because they are not hardware defined constraints. Rather they are application defined. It is a software dependent partitioning of memory, not hardware dependent.
Local, non-static variables, function arguments and call return addresses are generally stored on the stack; so the required stack size depends on the call depth and the number and size of local-variables and parameters for each function in a call-tree. The stack usage is dynamic, but there will be some worst-case path where the combination of variables and call-depth causes a peak usage.
On top of that on many architectures you have to also account for interrupt handler stack usage, which is generally less deterministic, but still has a "worst-case" of interrupt nesting and call depth. For this reasons ISR should generally be short, deterministic and use few variables.
Further is you have a multi-threaded environment such as an RTOS scheduler, each thread will have a separate stack. Typically these thread stacks are statically allocated arrays or dynamically (heap) allocated rather then defined by the linker script. The linker script normally defines only the system stack for the main() thread and interrupt/exception handlers.
Estimating the required stack usage is not always easy, but methods for doing so exist, using either static or dynamic analysis. Some examples (partly toolchain specific) at:
https://www.keil.com/support/man/docs/armclang_intro/armclang_intro_hla1474359990839.htm
https://www.keil.com/appnotes/docs/apnt_316.asp for example.
Many default linker scripts automatically expand the heap to fill all remaining space available after static data and stack allocation. One notable exception is the Keil ARM-MDK toolchain, which requires you to explicitly set a heap size.
A linker script may reserve memory regions for other purposes; especially if the memory is not homogeneous - for example on-chip MCU memory will typically be faster to access than external RAM, and may itself be subdivided on different busses so for example there might be a small segment useful for DMA on a separate buss so avoiding bus contention and yielding more deterministic execution.
The use of dynamic memory (heap) allocation in embedded systems needs to be carefully considered (or even banned as #Lundin would suggest, but not all embedded systems are subject to the same constraints). There are a number of issues to consider, including:
Memory constraints - many embedded systems have very small memories, you have to consider the response, safety and functionality of the system in the event an allocation request cannot be satisfied.
Memory leaks - your own, your colleagues on a team and third party code may not be as high a quality as you would hope; you need to be certain that the entire code base is free of memory leaks (failing to deallocate/free memory appropriately).
Determinism - most heap allocators take a variable and non-deterministic length of time to allocate memory, and even freeing can be non-deterministic if it involves block consolidation.
Heap corruption - an owner of an allocated block can easily under/overrun an allocation and corrupt adjacent memory. Typically such memory contains the heap-management meta-data for the block or other flocks, and the actual data for other allocations. Corrupting this data has non-deterministic effects on other code most often unrelated to the code that caused the error, such that it is common for failure to occur some-time after and in code unrelated to the event that caused the error. Such bugs hard hard to spot and resolve. If the heap meta-data is corrupted, often the error is detected when when further heap operations (alloc/free) fail.
Efficiency - Heap allocations mage by malloc() et-al are normally 8 byte aligned and have a block of pre-pended meta-data. Some implementations may add some "buffer" region to help detect overruns (especially in debug builds). As such making numerous allocations of very small blocks can be a remarkably inefficient use of a scarce resource.
Common strategies in embedded system to deal with these issues include:
Disallowing any dynamic memory allocations. This is common in safety critical and MISRA compliant applications for example.
Allowing dynamic memory allocation only during initialisation, and disallowing free(). This may seem counterintuitive, but can be useful where an application itself is "dynamic" and perhaps in some configurations not all tasks or device drivers etc. are started, where static allocation might lead to a great deal of unused/unusable memory.
Replacing the default heap with a deterministic memory allocation means such as a fixed-block allocator. Often these have a separate API rather then overriding malloc/free, so not then strictly a replacement; just a different solution.
Disallowing dynamic memory allocation in hard-real-time critical code. This addresses only the determinism issue, but in systems with large memories, and carefully design code, and perhaps MMU protection of allocations, there maybe mitigations for those.
Basically the stack size is picked depending on expected program size. For larger and more complex programs, you will want more stack size. It also depends on architecture, 32 bitters will generally consume slightly more memory than 8 and 16 bitters. The exact value is picked based on experience, though once you know exactly how much RAM your program actually uses, you can increase the stack size to use most of the unused memory.
It's also custom to map the stack so that it grows into a harmless area upon overflow, such as non-mapped memory or flash. Ideally so that you get a hardware exception, "software interrupt" or similar when stack overflow happens. You should never map it so that it grows into .data/.bss and overwrites other variables there.
As for the heap, the size is almost always picked to 0 and the segment is removed completely from the linker script. Heap allocation is banned in almost every microcontroller application.
Stack and heap are part of your program itself. They are based on how your program is structured and written How much memory it is taking up. rest free memory will work as Stack or Heap depending on how you set it up.
In Linker script you can define these values.
I have several large std::vectors of chars (bytes loaded from binary files).
When my program runs out of memory, I need to be able to cull some of the memory used by these vectors. These vectors are almost the entirety of my memory usage, and they're just caches for local and network files, so it's safe to just grab the largest one and chop it in half or so.
Only thing is, I'm currently using vector::resize and vector::shrink_to_fit, but this seems to require more memory (I imagine for a reallocation of the new size) and then a bunch of time (for destruction of the now destroyed pointers, which I thought would be free?) and then copying the remaining to the new vector. Note, this is on Windows platform, in debug, so the pointers might not be destroyed in the Release build or on other platforms.
Is there something I can do to just say "C++, please tell the OS that I no longer need the memory located past location N in this vector"?
Alternatively, is there another container I'd be better off using? I do need to have random access though, or put effort into designing a way to easily keep iterators pointing at the place I'll want to read next, which would be possible, just not easy, so I'd prefer not using a std::list.
resize and shrink_to_fit are your best bets, as long as we are talking about standard C++, but these, as you noticed may not help at all if you are in a low memory situation to begin with: since the allocator interface do not provide a realloc-like operation, vector is forced to allocate a new block, copy the data in it and deallocate the old block.
Now, I see essentially four easy ways out:
drop whole vectors, not just parts of them, possibly using an LRU or stuff like that; working with big vectors, the C++ allocator normally just forwards to the OS's memory management calls, so the memory should go right back to the OS;
write your own container which uses malloc/realloc, or OS-specific functionality;
use std::deque instead; you lose guaranteed contiguousness of data, but, since deques normally allocate the space for data in distinct chunks, doing a resize+shrink_to_fit should be quite cheap - simply all the unused blocks at the end are freed, with no need for massive reallocations;
just leave this job to the OS. As already stated in the comments, the OS already has a file cache, and in normal cases it can handle it better than you or me, even just for the fact that it has a better vision of how much physical memory is left for that, what files are "hot" for most applications, and so on. Also, since you are working in a virtual address space you cannot even guarantee that your memory will actually stay in RAM; the very moment that the machine goes in memory pressure and you aren't using some memory pages so often, they get swapped to disk, so all your performance gain is lost (and you wasted space on the paging file for stuff that is already found on disk).
An additional way may be to just use memory mapped files - the system will do its own caching as usual, but you avoid any syscall overhead as long as the file remains in memory.
std::vector::shrink_to_fit() cannot result in more memory being used, if so it's a bug.
C++11 defines shrink_to_fit() as follows:
void shrink_to_fit(); Remarks: shrink_to_fit is a non-binding request to reduce capacity() to size(). [ Note: The request is non-binding to allow latitude for implementation-specific optimizations. — end note ]
As the note indicates, shrink_to_fit() may, but not necessarily, actually free memory, and the standard gives C++ implementations a free hand to recycle and optimize memory usage internally, as they see fit. C++ does not make it mandatory for shrink_to_fit(), and the like, to result in actually memory being released to the operating system, and in many cases the C++ runtime library may not actually be able to, as I'll get to in a moment. The C++ runtime library is allowed to take the freed memory, and stash it away internally, and reuse it automatically for the future memory allocation requests (explicit news, or container growth).
Most modern operating systems are not designed to allocate and release memory blocks of arbitrary sizes. Details differ, but typically an operating system allocates and deallocates memory in even chunks, typically 4Kb, or larger, at even memory page addresses. If you allocate a new object that's only a few hundred bytes long, the C++ library will request an entire page of memory to be allocated, take the first hundred bytes of it for the new object, then keep the spare amount of memory for future new requests.
Similarly, even if shrink_to_fit(), or delete, frees up a few hundred bytes, it can't go back to the operating system immediately, but only when an entire 4kb continuous memory range (or whatever is the allocation page size used by the operating system) -- suitably aligned -- is completely unused at all. Only then can a process release that page back to the operating system. Until then, the library keeps track of freed memory ranges, to be used for future new requests, without asking the operating system to allocate more pages of memory to the process.
I have a question about low level stuff of dynamic memory allocation.
I understand that there may be different implementations, but I need to understand the fundamental ideas.
So,
when a modern OS memory allocator or the equivalent allocates a block of memory, this block needs to be freed.
But, before that happends, some system needs to exist to control the allocation process.
I need to know:
How this system keeps track of allocated and unallocated memory. I mean, the system needs to know what blocks have already been allocated and what their size is to use this information in allocation and deallocation process.
Is this process supported by modern hardware, like allocation bits or something like that?
Or is some kind of data structure used to store allocation information.
If there is a data structure, how much memory it uses compared to the allocated memory?
Is it better to allocate memory in big chunks rather than small ones and why?
Any answer that can help reveal fundamental implementation details is appreciated.
If there is a need for code examples, C or C++ will be just fine.
"How this system keeps track of allocated and unallocated memory." For non-embedded systems with operating systems, a virtual page table, which the OS is in charge of organizing (with hardware TLB support of course), tracks the memory usage of programs.
AS FAR AS I KNOW (and the community will surely yell at me if I'm mistaken), tracking individual malloc() sizes and locations has a good number of implementations and is runtime-library dependent. Generally speaking, whenever you call malloc(), the size and location is stored in a table. Whenever you call free(), the table entry for the provided pointer is looked up. If it is found, that entry is removed. If it is not found, the free() is ignored (which also indicates a possible memory leak).
When all malloc() entries in a virtual page are freed, that virtual page is then released back to the OS (this also implies that free() does not always release memory back to the OS since the virtual page may still have other malloc() entries in it). If there is not enough space within a given virtual page to support another malloc() of a specified size, another virtual page is requested from the OS.
Embedded processors usually don't have operating systems, virtual page tables, nor multiple processes. In this case, virtual memory is not used. Instead, the entire memory of the embedded processor is treated like one large virtual page (although the addresses are actually physical addresses) and memory management follows a similar process as previously described.
Here is a similar stack overflow question with more in-depth answers.
"Is it better to allocate memory in big chunks rather than small ones and why?" Allocate as much memory as you need, no more and no less. Compiler optimizations are very smart, and memory will almost always be managed more efficiently (i.e. reducing memory fragmentation) than the programmer can manually do. This is especially true in a non-embedded environment.
Here is a similar stack overflow question with more in-depth answers (note that it pertains to C and not C++, however it is still relevant to this discussion).
Well, there are more than one way to achieve that.
I once had to wrote a malloc() (and free()) implementation for educational purpose.
This is from my experience, and real world implementation surely vary.
I used a double linked list.
Memory chunk returned to the user after calling malloc() were in fact a struct containing relevant information to my implementation (ie the next and prev pointer, and a is_used byte).
So when a user request N bytes I allocated N + sizeof(my_struct) bytes, hiding next and prev pointers at the begenning of the chunk, and returning what's left to the user.
Surely, this is poor design for a program that use a lot of small allocation (because each allocation takes up to N + 2 pointers + 1 byte).
For a real world implementation, you can take a look to the code of good and well known memory allocator.
Normally there exist two different layers.
One layer lives at application level, usually as part of the C standard library. This is what you call through functions like malloc and free (or operator new in C++, which in turn usually calls malloc). This layer takes care of your allocations, but does not know about memory or where it comes from.
The other layer, at OS level, does not know and does not care anything about your allocations. It only maintains a list of fixed-size memory pages that have been reserved, allocated, and accessed, and with each page information such as where it maps to.
There are many different implementations for either layer, but in general it works like this:
When you allocate memory, the allocator (the "application level part") looks whether it has a matching block somewhere in its books that it can give to you (some allocators will split a larger block in two, if need be).
If it doesn't find a suitable block, it reserves a new block (usually much larger than what you ask for) from the operating system. sbrk or mmap on Linux, or VirtualAlloc on Windows would be typical examples of functions it might use for that effect.
This does very little apart from showing intent to the operating system, and generating some page table entries.
The allocator then (logically, in its books) splits up that large area into smaller pieces according to its normal mode of operation, finds a suitable block, and returns it to you. Note that this returned memory does not necessarily even exist as phsyical memory (though most allocators write some metadata into the first few bytes of each allocated unit, so they necessarily pre-fault the pages).
In the mean time, invisibly, a background task zeroes out memory pages that were in use by some process once but have been freed. This happens all the time, on a tentative base, since sooner or later someone will ask for memory (often, that's what the idle task does).
Once you access an address in the page that contains your allocated block for the first time, you generate a fault. The page table entry of this yet non-existent page (it logically exists, just not phsyically) is replaced with a reference to a page from the pool of zero pages. In the uncommon case that there is none left, for example if huge amounts of memory are being allocated all the time, the OS swaps out a page which it believes will not be accessed any time soon, zeroes it, and returns this one.
Now the page becomes part of your working set, it corresponds to actual phsyical memory, and it accounts towards your process' quota. While your process is running, pages may be moved in and out of your working set, or may be paged out and in, as you exceed certain limits, and according to how much memory is needed and how it is accessed.
Once you call free, the allocator puts the freed area back into its books. It may tell the OS that it does not need the memory any more instead, but usually this does not happen as it is not really necessary and it is more efficient to keep around a little extra memory and reuse it. Also, it may not be easy to free the memory because usually the units that you allocate/deallocate do not directly correspond with the units the OS works with (and, in the case of sbrk they'd need to happen in the correct order, too).
When the process ends, the OS simply throws away all page table entries and adds all pages to the list of pages that the idle task will zero out. So the physical memory becomes available to the next process asking for some.
What is the difference between malloc() and HeapAlloc()? As far as I understand malloc allocates memory from the heap, just as HeapAlloc, right?
So what is the difference?
Actually, malloc() (and other C runtime heap functions) are module dependant, which means that if you call malloc() in code from one module (i.e. a DLL), then you should call free() within code of the same module or you could suffer some pretty bad heap corruption (and this has been well documented). Using HeapAlloc() with GetProcessHeap() instead of malloc(), including overloading new and delete operators to make use of such, allow you to pass dynamically allocated objects between modules and not have to worry about memory corruption if memory is allocated in code of one module and freed in code of another module once the pointer to a block of memory has been passed across to an external module.
You are right that they both allocate memory from a heap. But there are differences:
malloc() is portable, part of the standard.
HeapAlloc() is not portable, it's a Windows API function.
It's quite possible that, on Windows, malloc would be implemented on top of HeapAlloc. I would expect malloc to be faster than HeapAlloc.
HeapAlloc has more flexibility than malloc. In particular it allows you to specify which heap you wish to allocate from. This caters for multiple heaps per process.
For almost all coding scenarios you would use malloc rather than HeapAlloc. Although since you tagged your question C++, I would expect you to be using new!
With Visual C++, the function malloc() or the operator new eventually calls HeapAlloc(). If you debug the code, you will find the the function _heap_alloc_base() (in the file malloc.c) is calling return HeapAlloc(_crtheap, 0, size) where _crtheap is a global heap created with HeapCreate().
The function HeapAlloc() does a good job to minimize the memory overhead, with a minimum of 8 bytes overhead per allocation. The largest I have seen is 15 bytes per allocation, for allocations ranging from 1 byte to 100,000 bytes. Larger blocks have larger overhead, however as a percent of the total allocated it remains less than 2.5% of the payload.
I cannot comment on performance because I have not benchmarked the HeapAlloc() with a custom made routine, however as far as the memory overhead of using HeapAlloc(), the overhead is amazingly low.
malloc is a function in the C standard library (and also in the C++ standard library).
HeapAlloc is a Windows API function.
The latter lets you specify the heap to allocate from, which I imagine can be useful for avoiding serialization of allocation requests in different threads (note the HEAP_NO_SERIALIZE flag).
In systems where multiple DLLs may come and go (via LoadLibrary/Freelibrary), and when memory may be allocated within one DLL, but freed in another (see previous answer), HeapAlloc and related functions seem to be the least-common-denominator for successful memory sharing.
Thread safe, presumably highly optimized by PhDs galore, HeapAlloc appears to work in all kinds of situations where our not-so-shareable code using malloc/free would fail.
We are a C++ embedded shop, so we have overloaded operator new/delete across our system to use the HeapAlloc( GetProcessHeap( ) ) which can stubbed (on target) or native (to windows) for code portability.
So far no problems now that we have bypassed malloc/free which seem indisputably DLL specifically, a new "heap" for each DLL load.
Additionally, you can refer to:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366705(v=vs.85).aspx
Which stands that you can enable some features of the HEAP managed by WinApi memory allocator, eg "HeapEnableTerminationOnCorruption".
As I understand, it makes some basic heap overflow protections which may be considered as an added value to your application in terms of security.
(eg, I would prefer to crash my app (as an app owner) rather than execute arbitrary code)
Other thing is that it might be useful in early phase of development, so you could catch memory issues before going to the production.
malloc is exported function by C run-time library(CRT) which is compiler specific.
C run-time library dll name changes from visual studio versions to versions.
HeapAlloc function is exported by kernel32.dll present in windows folder.
This is what MS has to say about it: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366533(v=vs.85).aspx
One thing one one mentioned thus far is: "The malloc function has the disadvantage of being run-time dependent. The new operator has the disadvantage of being compiler dependent and language dependent."
Also, "HeapAlloc can be instructed to raise an exception if memory could not be allocated"
So if you want your program to run with any CRT, or perhaps no CRT at all, you'd use HeapAlloc. Perhaps only people who would do such thing would be malware writers. Another use might be if you are writing a very memory intensive application with specific memory allocation/usage patterns that you'd rather write your own heap allocator instead of using a CRT one.