How are malloc and free implemented?

How are malloc and free implemented? - c++

I want to implement my own dynamic memory management system in order to add new features that help to manage memory in C++.
I use Windows (XP) and Linux (Ubuntu).
What is needed to implement functions like 'malloc' and 'free'?
I think that I have to use lowest level system calls.
For Windows, I have found the functions: GetProcessHeap, HeapAlloc, HeapCreate, HeapDestroy and HeapFree.
For Linux, I have not found any system calls for heap management. On Linux, malloc and free are system calls, are not they?
Thanks
Edit:
C++ does not provide garbage collector and garbage collector is slow. Some allocations are easy to free, but there are allocations that needs a garbage collector.
I want to implement these functions and add new features:
* Whenever free() be called, check if the pointer belongs to heap.
* Help with garbage collection. I have to store some information about the allocated block.
* Use multiple heaps (HeapCreate/HeapDestroy on Windows). I can delete an entire heap with its allocated blocks quickly.

On linux, malloc and free are not system calls. malloc/free obtains memory from the kernel by extending and shrinking(if it can) the data segment using the brk system calls as well as obtaining anonymous memory with mmap - and malloc manages memory within those regions. Some basic information any many great references can be found here

In *nix, malloc() is implemented at the C library level. It uses brk()/sbrk() to grow/shrink the data segment, and mmap/munmap to request/release memory mappings. See this page for a description of the malloc implementation used in glibc and uClibc.

If you are simply wrapping the system calls then you are probably not gaining anything on using the standard malloc - thats all they are doing.
It's more common to malloc (or HeapAlloc() etc ) a single block of memory at the start of the program and manage the allocation into this yourself, this can be more efficient if you know you are going to be creating/discarding a lot of small blocks of memory regularly.

brk is the system call used on Linux to implement malloc and free. Try the man page for information.
You've got the Windows stuff down already.
Seeing the other answers here, I would like to note that you are probably reinventing the wheel; there are many good malloc implementations out there already. But programming malloc is a good thought exercise - take a look here for a nice homework assignment (originally CMU code) implementing the same. Their shell gives you a bit more than the Linux OS actually does, though :-).

garbage collector is slow
This is a completely meaningless statement. In many practical situations, programs can get a significant performance boost by using a Garbage Collector, especially in multi-threaded scenarios. In many other situations, Garbage Collectors do incur a performance penalty.

Try http://www.dent.med.uni-muenchen.de/~wmglo/malloc-slides.html for pointers.
This is a brief performance comparison, with pointers to eight different malloc/free implementations. A nice starting point, because a few good reference statistics will help you determine whether you've improved on the available implementations - or not.

Related

Allocating Memory to a Program Upon Initialization in C++?

I would like to allocate a set amount of memory for the program upon initialization so that other programs cannot steal memory from it. Essentially, I would like to create a Heap for my program (without having to program a heap module all for myself).
If this is not possible, can you please refer me to a heap module that I can import into my project?
Using C++17.
Edit: More specifically, I am trying to for example specify that it is only allowed to malloc 4MB of data for example. If it tries to allocate anymore, it should throw an error.

What you ask is not possible with the features provided by ISO C++.
However, on most common platforms, reserving physical RAM is possible using platform-specific extensions. For example, Linux provides the function mlock and Microsoft Windows provides the function VirtualLock. But, in order to use these functions, you must either
know which memory pages the default allocator is using for memory allocation, which can get messy and complicated, or
use your own implementation of a memory allocator, so that it can itself call mlock/VirtualLock whenever it receives memory from the operating system.
Your own implementation of a memory allocator could be as simple as forwarding all memory allocation request to the operating system's kernel, for example using mmap on Linux or VirtualAlloc on Windows. However, this has the disadvantage that the granularity of all memory allocation requests is the size of a memory page, which on most systems is at least 4096 bytes. This means that even very small memory allocation requests of a few bytes will actually take 4096 bytes of memory. This would be a big waste of memory. Also, in your question, you stated that you wanted to preallocate a certain amount of memory when you start your application, so that you can use that memory later to satisfy smaller memory allocation requests. This cannot be done using the method described above.
Therefore, you may want to consider using a "proper" memory allocator implementation, which is able to satisfy several smaller allocation requests using a single memory page. See this list on Wikipedia for a list of common implementations.
That said, what you describe may be an XY problem, depending on what operating system you are using. For example, in contrast to Windows, Linux will typically overcommit memory. This means that the Linux kernel will allow applications to allocate more memory than is actually available, on the assumption that most applications will not use all the memory they request. Therefore, a call to std::malloc or new will seldom fail on Linux (but it is still possible, depending on the configuration). Instead, under low memory conditions, the Linux OOM killer (out of memory killer) will start killing processes that are taking up large amounts of memory, in order to free up memory and to keep the system running.
For this reason, the methods described above are likely to work on Microsoft Windows, but on Linux, they could be counterproductive, as they would make your process more likely to fall prey to the OOM killer.
However, even if you are able to accomplish what you want using the methods described above, I generally don't recommend that you use these methods, as this behavior is unfair towards the other processes in the system. Generally, you should leave the task of deciding which process gets (fast) physical memory and which process gets (slow) swap space to the operating system, as the operating system can do a better job of fairly distributing its resources among its processes.

If you want to force actual allocation of memory pages to your process, there's no way around managing your own memory.
In C++, the canonical way to do this would be to write an implementation for operator new() and operator delete() (the global ones!) which are responsible to perform the actual memory allocation. The function signatures are:
void* operator new (size_t size);
void operator delete (void *pointer);
and you'll need to include the #include <new> header.
Your implementation can do its work via one of three possible routes:
It allocates the memory using the C function malloc(), and immediately touches each memory page by writing a value to it. This forces the system kernel to actually back the memory region with real memory.
It allocates the memory using malloc(), and proceeds to call mlockall(). This is the nuclear option for when you absolutely must avoid all paging, including paging of code segments and shared libraries.
It asks the kernel directly for some chunks of memory using mmap() and proceeds to lock them into RAM via mlock(). The effect is similar to the previous option, but it is targeted only at the memory you allocated for your operator new() implementation.
The first method works independent of the OS kernel, the other two assume a Linux kernel.
With GCC, you can perform the memory allocation before main() is called by using the __attribute__((constructor)).
Writing such a memory allocator is not rocket science. It's not even a lot of code if done right. I once wrote an operator new()/operator delete() implementation that fits into 170 lines, including all my special features, comments, empty lines, and the license declaration. It's really not that hard.

I would like to allocate a set amount of memory for the program upon initialization so that other programs cannot steal memory from it.
Why would you want to do that?
it is not your business to decide if your program is more important than others !
Imagine your program running in parallel with some printing utility driving the printer. This is a common occurrence: I have downloaded some long PDF document (e.g. several hundred pages, like the C++ standard n3337), and I want to print it on paper to study it in a train, an airplane, at home and annotate it with a pencil and paper. The printing is likely to last more than an hour, and require computing resources (e.g. on Linux some CUPS printer driver converting PDF to PCL). During the printing, I could use your program.
If I am a user of your program, you have decided (at my place) that printing that document is less important for me than using your program (while the printer is slowly spitting pages).
Leave the allocation and management of memory to the operating system of your user.
There are of course important exceptions to that common sense rule. A typical medical robot used in neurosurgery has some embedded software with constraints different of a web server software. See also this draft report. For Linux, read Advanced Linux Programming then syscalls(2).
More specifically, I am trying to for example specify that it is only allowed to malloc 4MB of data for example.
This is really simple. Some OSes provide the ability to limit resources (on Linux, see setrlimit(2)...). Write your own malloc routine, above operating system specific primitives such as (on Linux) mmap(2). See also this, this and that answers (all focused on Linux; adapt them to your particular operating system). You probably can find open source implementations of malloc (on github or gitlab) for your particular operating system. For Linux, look here, then study the source code of glibc or musl-libc. In C++, study the source code of GCC or Clang (probably ::operator new is using malloc)

What decides where on the heap memory is allocated?

Let me clear up: I understand how new and delete (and delete[]) work. I understand what the stack is, and I understand when to allocate memory on the stack and on the heap.
What I don't understand, however, is: where on the heap is memory allocated. I know we're supposed to look at the heap as this big pool of pretty much limitless RAM, but surely that's not the case.
What is in control of choosing where on the heap memory is stored and how does it choose that?
Also: the term "returning memory to the OS" is one I come across quite often. Does this mean that the heap is shared between all processes?
The reason I care about all this is because I want to learn more about memory fragmentation. I figured it'd be a good idea to know how the heap works before I learn how to deal with memory fragmentation, because I don't have enough experience with memory allocation, nor C++ to dive straight into that.

The memory is managed by the OS. So the answer depends on the OS/Plattform that is used. The C++ specification does not specify how memory on a lower level is allocated/freed, it specifies it in from of the lifetime.
While multi-user desktop/server/phone OS (like Windows, Linux, macOS, Android, …) have similar ways to how memory is managed, it could be completely different on embedded systems.
What is in control of choosing where on the heap memory is stored and how does it choose that?
Its the OS that is responsible for that. How exactly depends - as already said - on the OS. The OS could also be a thin layer in the form of a combination of the runtime library and minimal OS like includeos
Does this mean that the heap is shared between all processes?
Depends on the point of view. The address space is - for multiuser systems - in general not shared between processes. The OS ensures that one process cannot access memory of another process, which is ensured through virtual address spaces. But the OS can distribute the whole RAM among all processes.
For embedded systems, it could even be the case, that each process has a fixed amount a preallocated memory - that is not shared between processes - and with no way to allocated new memory or free memory. And then it is up to the developer to manage that preallocated memory by themselves by providing custom allocators to the objects of the stdlib, and to construct in allocated storage.
I want to learn more about memory fragmentation
There are two ways of fragmentation. The one is given by the memory addresses exposed by the OS to the C++ runtime. And the one on the hardware/OS side (which could be the same for embedded system) . How and in which form the memory might be fragmented organized by the OS can't be determined using the function provided by the stdlib. And how the fragmentation of the address spaces of the process behaves, depends again on the os and the also on the used stdlib.

None of these details are specified in the C++ specification standard. Each C++ implementation is free to implement these details in whichever way that works for it, as long as the end result is agreeable with the standard.
Each C++ compiler, and operating system implements these low level details in its own unique way. There is no specific answer to these questions that apply to every C++ compiler and every operating system. Over time, a lot of research has went into profiling and optimizing memory allocation and deallocation algorithms for a typical C++ application, and there are some tailored C++ implementation that offer alternative memory allocation algorithm that each application will choose, that it thinks will work best for it. Of course, none of this is covered by the C++ standard.
Of course, all memory in your computer must be shared between all processes that are running on it, and your operating system is responsible for divvying it up and parceling it out to all the processes, when they request more memory. All that "returning memory to the OS" means is that a process's memory allocator has determined that it no longer needs a sufficiently large continuous memory range, that was used before but not any more, and notifies the operating system that it no longer uses it and it can be reassigned to another process.

What decides where on the heap memory is allocated?
From the perspective of a C++ programmer: It is decided by the implementation (of the C++ language).
From the perspective of a C++ standard library implementer (as an example of what may hypothetically be true for some implementation): It is decided by malloc which is part of the C standard library.
From the perspective of malloc implementer (as an example of what may hypothetically be true for some implementation): The location of heap in general is decided by the operating system (for example, on Linux systems it might be whatever address is returned by sbrk). The location of any individual allocation is up to the implementer to decide as long as they stay within the limitations established by the operating system and the specification of the language.
Note that heap memory is called "free store" in C++. I think this is to avoid confusion with the heap data structure which is unrelated.
I understand what the stack is
Note that there is no such thing as "stack memory" in the C++ language. The fact that C++ implementations store automatic variables in such manner is an implementation detail.

The heap is indeed shared between processes, but in C++ the delete keyword does not return the memory to the operating system, but keeps it to reuse later on. The location of the allocated memory is dependent on how much memory you want to access, there has to be enough space and how the OS handles memory allocations, it can be one on first, best and worst hit (Read more on that topic on google). The name RAM basically tells you where to search for your memory :D
It is however possible to get the same memory location when you have a small program and restart it multiple times.

malloc() vs. HeapAlloc()

What is the difference between malloc() and HeapAlloc()? As far as I understand malloc allocates memory from the heap, just as HeapAlloc, right?
So what is the difference?

Actually, malloc() (and other C runtime heap functions) are module dependant, which means that if you call malloc() in code from one module (i.e. a DLL), then you should call free() within code of the same module or you could suffer some pretty bad heap corruption (and this has been well documented). Using HeapAlloc() with GetProcessHeap() instead of malloc(), including overloading new and delete operators to make use of such, allow you to pass dynamically allocated objects between modules and not have to worry about memory corruption if memory is allocated in code of one module and freed in code of another module once the pointer to a block of memory has been passed across to an external module.

You are right that they both allocate memory from a heap. But there are differences:
malloc() is portable, part of the standard.
HeapAlloc() is not portable, it's a Windows API function.
It's quite possible that, on Windows, malloc would be implemented on top of HeapAlloc. I would expect malloc to be faster than HeapAlloc.
HeapAlloc has more flexibility than malloc. In particular it allows you to specify which heap you wish to allocate from. This caters for multiple heaps per process.
For almost all coding scenarios you would use malloc rather than HeapAlloc. Although since you tagged your question C++, I would expect you to be using new!

With Visual C++, the function malloc() or the operator new eventually calls HeapAlloc(). If you debug the code, you will find the the function _heap_alloc_base() (in the file malloc.c) is calling return HeapAlloc(_crtheap, 0, size) where _crtheap is a global heap created with HeapCreate().
The function HeapAlloc() does a good job to minimize the memory overhead, with a minimum of 8 bytes overhead per allocation. The largest I have seen is 15 bytes per allocation, for allocations ranging from 1 byte to 100,000 bytes. Larger blocks have larger overhead, however as a percent of the total allocated it remains less than 2.5% of the payload.
I cannot comment on performance because I have not benchmarked the HeapAlloc() with a custom made routine, however as far as the memory overhead of using HeapAlloc(), the overhead is amazingly low.

malloc is a function in the C standard library (and also in the C++ standard library).
HeapAlloc is a Windows API function.
The latter lets you specify the heap to allocate from, which I imagine can be useful for avoiding serialization of allocation requests in different threads (note the HEAP_NO_SERIALIZE flag).

In systems where multiple DLLs may come and go (via LoadLibrary/Freelibrary), and when memory may be allocated within one DLL, but freed in another (see previous answer), HeapAlloc and related functions seem to be the least-common-denominator for successful memory sharing.
Thread safe, presumably highly optimized by PhDs galore, HeapAlloc appears to work in all kinds of situations where our not-so-shareable code using malloc/free would fail.
We are a C++ embedded shop, so we have overloaded operator new/delete across our system to use the HeapAlloc( GetProcessHeap( ) ) which can stubbed (on target) or native (to windows) for code portability.
So far no problems now that we have bypassed malloc/free which seem indisputably DLL specifically, a new "heap" for each DLL load.

Additionally, you can refer to:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366705(v=vs.85).aspx
Which stands that you can enable some features of the HEAP managed by WinApi memory allocator, eg "HeapEnableTerminationOnCorruption".
As I understand, it makes some basic heap overflow protections which may be considered as an added value to your application in terms of security.
(eg, I would prefer to crash my app (as an app owner) rather than execute arbitrary code)
Other thing is that it might be useful in early phase of development, so you could catch memory issues before going to the production.

malloc is exported function by C run-time library(CRT) which is compiler specific.
C run-time library dll name changes from visual studio versions to versions.
HeapAlloc function is exported by kernel32.dll present in windows folder.

This is what MS has to say about it: http://msdn.microsoft.com/en-us/library/windows/desktop/aa366533(v=vs.85).aspx
One thing one one mentioned thus far is: "The malloc function has the disadvantage of being run-time dependent. The new operator has the disadvantage of being compiler dependent and language dependent."
Also, "HeapAlloc can be instructed to raise an exception if memory could not be allocated"
So if you want your program to run with any CRT, or perhaps no CRT at all, you'd use HeapAlloc. Perhaps only people who would do such thing would be malware writers. Another use might be if you are writing a very memory intensive application with specific memory allocation/usage patterns that you'd rather write your own heap allocator instead of using a CRT one.

Can multithreading speed up memory allocation?

I'm working with an 8 core processor, and am using Boost threads to run a large program.
Logically, the program can be split into groups, where each group is run by a thread.
Inside each group, some classes invoke the 'new' operator a total of 10000 times.
Rational Quantify shows that the 'new' memory allocation is taking up the maximum processing time when the program runs, and is slowing down the entire program.
One way I can speed up the system could be to use threads inside each 'group', so that the 10000 memory allocations can happen in parallel.
I'm unclear of how the memory allocation will be managed here. Will the OS scheduler really be able to allocate memory in parallel?

Standard CRT
While with older of Visual Studio the default CRT allocator was blocking, this is no longer true at least for Visual Studio 2010 and newer, which calls corresponding OS functions directly. The Windows heap manager was blocking until Widows XP, in XP the optional Low Fragmentation Heap is not blocking, while the default one is, and newer OSes (Vista/Win7) use LFH by default. The performance of recent (Windows 7) allocators is very good, comparable to scalable replacements listed below (you still might prefer them if targeting older platforms or when you need some other features they provide). There exist several multiple "scalable allocators", with different licenses and different drawbacks. I think on Linux the default runtime library already uses a scalable allocator (some variant of PTMalloc).
Scalable replacements
I know about:
HOARD (GNU + commercial licenses)
MicroQuill SmartHeap for SMP (commercial license)
Google Perf Tools TCMalloc (BSD license)
NedMalloc (BSD license)
JemAlloc (BSD license)
PTMalloc (GNU, no Windows port yet?)
Intel Thread Building Blocks (GNU, commercial)
You might want to check Scalable memory allocator experiences for my experiences with trying to use some of them in a Windows project.
In practice most of them work by having a per thread cache and per thread pre-allocated regions for allocations, which means that small allocations most often happen inside of a context of thread only, OS services are called only infrequently.

Dynamic allocation of memory uses the heap of the application/module/process (but not thread). The heap can only handle one allocation request at a time. If you try to allocate memory in "parallel" threads, they will be handled in due order by the heap. You will not get a behaviour like: one thread is waiting to get its memory while another can ask for some, while a third one is getting some. The threads will have to line-up in queue to get their chunk of memory.
What you would need is a pool of heaps. Use whichever heap is not busy at the moment to allocate the memory. But then, you have to watch out throughout the life of this variable such that it does not get de-allocated on another heap (that would cause a crash).
I know that Win32 API has functions such as GetProcessHeap(), CreateHeap(), HeapAlloc() and HeapFree(), that allow you to create a new heap and allocate/deallocate memory from a specific heap HANDLE. I don't know of an equivalence in other operating systems (I have looked for them, but to no avail).
You should, of course, try to avoid doing frequent dynamic allocations. But if you can't, you might consider (for portability) to create your own "heap" class (doesn't have to be a heap per se, just a very efficient allocator) that can manage a large chunk of memory and surely a smart pointer class that would hold a reference to the heap from which it came. This would enable you to use multiple heaps (make sure they are thread-safe).

There are 2 scalable drop-in replacements for malloc that I know of:
Google's tcmalloc
Facebook's jemalloc (link to a performance study comparing to tcmalloc)
I don't have any experience with Hoard (which performed poorly in the study), but Emery Berger lurks on this site and was astonished by the results. He said he would have a look and I surmise there might have been some specifics to either the test or implementation that "trapped" Hoard as the general feedback is usually good.
One word of caution with jemalloc, it can waste a bit of space when you rapidly create then discard threads (as it creates a new pool for each thread you allocate from). If your threads are stable, there should not be any issue with this.

I believe the short answer to your question is : yes, probably. And as already pointed out by several people here there are ways to achieve this.
Aside from your question and the answers already posted here, it would be good to start with your expectations on improvements, because that will pretty much tell which path to take. Maybe you need to be 100x faster. Also, do you see yourself doing speed improvements in the near future as well or is there a level which will be good enough? Not knowing your application or problem domain it's difficult to also advice you specifically. Are you for instance in a problem domain where speed continuously have to be improved?
One good thing to start off with when doing performance improvements is to question if you need to do things the way you currently do it? In this case, can you pre-allocate objects? Is there a maximum number of X objects in the system? Could you re-use objects? All of this is better, because you don't necessarily need to do allocations on the critical path. E.g. if you can re-use objects, a custom allocator with pre-allocated objects would work well. Also, what OS are you on?
If you don't have concrete expectations or a certain level of performance, just start experimenting with any of the advices here and you'll find out more.
Good luck!

Roll your own non-multi-threaded new memory allocator a distinct copy of which each thread has.
(you can override new and delete)
So it's allocating in large chunks that it works through and doesn't need any locking as each is owned by a single thread.
limit your threads to the number of cores you have available.

new is pretty much blocking, it has to find the next free bit of memory which is tricky to do if you have lots of threads all asking for that at once.
Memory allocation is slow - if you are doing it more than a few times, especially on lots of threads then you need a redesign. Can you pre-allocate enough space at the start, can you just allocate a big chunk with 'new' and then partition it out yourself?

You need to check your compiler documentation whether it makes the allocator thread safe or not. If it does not, then you will need to overload your new operator and make it thread safe.
Else it will result in either a segfault or UB.

On some platforms like Windows, access to the global heap is serialized by the OS. Having a thread-separate heap could substantially improve allocation times.
Of course, in this case, it might be worth questioning whether or not you genuinely need heap allocation as opposed to some other form of dynamic allocation.

You may want to take a look at The Hoard Memory Allocator: "is a drop-in replacement for malloc() that can dramatically improve application performance, especially for multithreaded programs running on multiprocessors."

The best what you can try to reach ~8 memory allocation in parallel (since you have 8 physical cores), not 10000 as you wrote
standard malloc uses mutex and standard STL allocator does the same. Therefore it will not speed up automatically when you introduce threading.
Nevertheless, you can use another malloc library (google for e.g. "ptmalloc") which does not use global locking. if you allocate using STL (e.g. allocate strings, vectors) you have to write your own allocator.
Rather interesting article: http://developers.sun.com/solaris/articles/multiproc/multiproc.html

About Memory Management When mentioning C++

When someone mention the memory management the c++ is capable of doing, how can i see this thing? is this done in theory like guessing?
i took a logical design intro course and it covered the systems of numbers and boolean algebra and combinational logic,will this help?
so say in Visual Studio , is there some kind of a tool to visualize the memory,i hope im not being ridiculous here ?
thank you.

C++ has a variety of memory areas:
space used for global and static variables, which is pre-allocated by the compiler
"stack" memory that's used for preserving caller context during function calls, passing some function arguments (others may fit in CPU registers), and local variables
"heap" memory that's allocated using new or new[](C++'s preferred approach) or malloc (a lower-level function inherited from C), and released with delete, delete[] or free respectively.
The heap is important in that it supports run-time requests for arbitrary amounts of memory, and the usage persists until delete or free is explicitly used, rather than being tied to the lifetime of particular function calls as per stack memory.
I'm not aware of any useful tools for visualising and categorising the overall memory usage of a running C++ program, less still for relating that back to which pointers in the source code currently have how much memory associated with them. As a very general guideline, it's encouraged to write code in such a way that pointers are only introduced when the program is ready to point them at something, and they go out of scope when they will no longer pointer at something. When that's impractical, it can be useful to set them to NULL (0), so that if you're monitoring the executing program in a debugger you can tell the pointer isn't meant to point to legitimate data at that point.

Memory management is not something that you can easily visualize while you are programming. Instead, it refers to how your program is allocating and freeing memory while it is running. Many debuggers will provide a way to halt a program while it is running and view information about the dynamic memory that it has allocated. You can plan your classes and interfaces with proper memory management techniques, but it's not as simple as "hit this button for a chart of your memory usage".
You can also implement something like this to keep track of your memory allocations and warn you about anything that your program didn't free. A garbage collector can free you from some of the hassles associated with memory management.

In Visual Studio, there is a Memory Window (Alt+6) that will let you read/write memory manually provided it is a valid memory location for the operation you are trying to do, during debugging.
On Windows platform, you can get a first initial feel of memory management using tools like perfmon.exe, taskmgr.exe and many other tools from sysinternals

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js