I have a question about low level stuff of dynamic memory allocation.
I understand that there may be different implementations, but I need to understand the fundamental ideas.
So,
when a modern OS memory allocator or the equivalent allocates a block of memory, this block needs to be freed.
But, before that happends, some system needs to exist to control the allocation process.
I need to know:
How this system keeps track of allocated and unallocated memory. I mean, the system needs to know what blocks have already been allocated and what their size is to use this information in allocation and deallocation process.
Is this process supported by modern hardware, like allocation bits or something like that?
Or is some kind of data structure used to store allocation information.
If there is a data structure, how much memory it uses compared to the allocated memory?
Is it better to allocate memory in big chunks rather than small ones and why?
Any answer that can help reveal fundamental implementation details is appreciated.
If there is a need for code examples, C or C++ will be just fine.
"How this system keeps track of allocated and unallocated memory." For non-embedded systems with operating systems, a virtual page table, which the OS is in charge of organizing (with hardware TLB support of course), tracks the memory usage of programs.
AS FAR AS I KNOW (and the community will surely yell at me if I'm mistaken), tracking individual malloc() sizes and locations has a good number of implementations and is runtime-library dependent. Generally speaking, whenever you call malloc(), the size and location is stored in a table. Whenever you call free(), the table entry for the provided pointer is looked up. If it is found, that entry is removed. If it is not found, the free() is ignored (which also indicates a possible memory leak).
When all malloc() entries in a virtual page are freed, that virtual page is then released back to the OS (this also implies that free() does not always release memory back to the OS since the virtual page may still have other malloc() entries in it). If there is not enough space within a given virtual page to support another malloc() of a specified size, another virtual page is requested from the OS.
Embedded processors usually don't have operating systems, virtual page tables, nor multiple processes. In this case, virtual memory is not used. Instead, the entire memory of the embedded processor is treated like one large virtual page (although the addresses are actually physical addresses) and memory management follows a similar process as previously described.
Here is a similar stack overflow question with more in-depth answers.
"Is it better to allocate memory in big chunks rather than small ones and why?" Allocate as much memory as you need, no more and no less. Compiler optimizations are very smart, and memory will almost always be managed more efficiently (i.e. reducing memory fragmentation) than the programmer can manually do. This is especially true in a non-embedded environment.
Here is a similar stack overflow question with more in-depth answers (note that it pertains to C and not C++, however it is still relevant to this discussion).
Well, there are more than one way to achieve that.
I once had to wrote a malloc() (and free()) implementation for educational purpose.
This is from my experience, and real world implementation surely vary.
I used a double linked list.
Memory chunk returned to the user after calling malloc() were in fact a struct containing relevant information to my implementation (ie the next and prev pointer, and a is_used byte).
So when a user request N bytes I allocated N + sizeof(my_struct) bytes, hiding next and prev pointers at the begenning of the chunk, and returning what's left to the user.
Surely, this is poor design for a program that use a lot of small allocation (because each allocation takes up to N + 2 pointers + 1 byte).
For a real world implementation, you can take a look to the code of good and well known memory allocator.
Normally there exist two different layers.
One layer lives at application level, usually as part of the C standard library. This is what you call through functions like malloc and free (or operator new in C++, which in turn usually calls malloc). This layer takes care of your allocations, but does not know about memory or where it comes from.
The other layer, at OS level, does not know and does not care anything about your allocations. It only maintains a list of fixed-size memory pages that have been reserved, allocated, and accessed, and with each page information such as where it maps to.
There are many different implementations for either layer, but in general it works like this:
When you allocate memory, the allocator (the "application level part") looks whether it has a matching block somewhere in its books that it can give to you (some allocators will split a larger block in two, if need be).
If it doesn't find a suitable block, it reserves a new block (usually much larger than what you ask for) from the operating system. sbrk or mmap on Linux, or VirtualAlloc on Windows would be typical examples of functions it might use for that effect.
This does very little apart from showing intent to the operating system, and generating some page table entries.
The allocator then (logically, in its books) splits up that large area into smaller pieces according to its normal mode of operation, finds a suitable block, and returns it to you. Note that this returned memory does not necessarily even exist as phsyical memory (though most allocators write some metadata into the first few bytes of each allocated unit, so they necessarily pre-fault the pages).
In the mean time, invisibly, a background task zeroes out memory pages that were in use by some process once but have been freed. This happens all the time, on a tentative base, since sooner or later someone will ask for memory (often, that's what the idle task does).
Once you access an address in the page that contains your allocated block for the first time, you generate a fault. The page table entry of this yet non-existent page (it logically exists, just not phsyically) is replaced with a reference to a page from the pool of zero pages. In the uncommon case that there is none left, for example if huge amounts of memory are being allocated all the time, the OS swaps out a page which it believes will not be accessed any time soon, zeroes it, and returns this one.
Now the page becomes part of your working set, it corresponds to actual phsyical memory, and it accounts towards your process' quota. While your process is running, pages may be moved in and out of your working set, or may be paged out and in, as you exceed certain limits, and according to how much memory is needed and how it is accessed.
Once you call free, the allocator puts the freed area back into its books. It may tell the OS that it does not need the memory any more instead, but usually this does not happen as it is not really necessary and it is more efficient to keep around a little extra memory and reuse it. Also, it may not be easy to free the memory because usually the units that you allocate/deallocate do not directly correspond with the units the OS works with (and, in the case of sbrk they'd need to happen in the correct order, too).
When the process ends, the OS simply throws away all page table entries and adds all pages to the list of pages that the idle task will zero out. So the physical memory becomes available to the next process asking for some.
Related
My question is a bit naive. I'm willing to have an overview as simple as possible and couldn't find any resource that made it clear to me. I am a developer and I want to understand what exactly is the memory displayed in the "memory" column by default in Windows Task Manager:
To make things a bit simpler, let's forget about the memory the process shares with other processes, and imagine the shared memory is negligible. Also I'm focussed on the big picture and mainly care for things at GB level.
As far as I know, the memory reserved by the process called "virtual memory", is partly stored in the main memory (RAM), partly on the disk. The system decides what goes where. The system basically keeps in RAM the parts of the virtual memory that is accessed sufficiently frequently by the process. A process can reserve more virtual memory than RAM available in the computer.
From a developer point of view, the virtual memory may only be partially allocated by the program through its own memory manager (with malloc() or new X() for example). I guess the system has no awareness of what part of the virtual memory is allocated since this is handled by the process in a "private" way and depends on the language, runtime, compiler... Q: Is this correct?
My hypothesis is that the memory displayed by the task manager is essentially the part of the virtual memory being stored in RAM by the system. Q: Is it correct? And is there a simple way to know the total virtual memory reserved by the process?
Memory on windows is... extremely complicated and asking 'how much memory does my process use' is effectively a nonsensical question. TO answer your questions lets get a little background first.
Memory on windows is allocated via ptr = VirtualAlloc(..., MEM_RESERVE, ...) and committed later with VirtualAlloc(ptr+n, MEM_COMMIT, ...).
Any reserved memory just uses up address space and so isn't interesting. Windows will let you MEM_RESERVE terabytes of memory just fine. Committing the memory does use up resources but not in the way you'd think. When you call commit windows does a few sums and basically works out (total physical ram + total swap - current commit) and lets you allocate memory if there's enough free. BUT the windows memory manager doesn't actually give you physical ram until you actually use it.
Later, however, if windows is tight for physical RAM it'll swap some of your RAM out to disk (it may compress it and also throw away unused pages, throw away anything directly mapped from a file and other optimisations). This means your total commit and total physical ram usage for your program may be wildly different. Both numbers are useful depending on what you're measuring.
There's one last large caveat - memory that is shared. When you load DLLs the code, the read-only memory [and even maybe the read/write section but this is COW'd] can be shared with other programs. This means that your app requires that memory but you cannot count that memory against just your app - after all it can be shared and so doesn't take up as much physical memory as a naive count would think.
(If you are writing a game or similar you also need to count GPU memory but I'm no expert here)
All of the above goodness is normally wrapped up by the heap the application uses and you see none of this - you ask for and use memory. And its just as optimal as possible.
You can see this by going to the details tab and looking at the various options - commit-size and working-set are really useful. If you just look at the main window in task-manager and it has a single value I'd hope you understand now that a single value for memory used has to be some kind of compromise as its not a question that makes sense.
Now to answer your questions
Firstly the OS knows exactly how much memory your app has reserved and how much it has committed. What it doesn't know is if the heap implementation you (or more likely the CRT) are using has kept some freed memory about which it hasn't released back to the operation system. Heaps often do this as an optimisation - asking for memory from the OS and freeing it back to the OS is a fairly expensive operation (and can only be done in large chunks known as pages) and so most of them keep some around.
Second question: Dont use that value, go to details and use the values there as only you know what you actually want to ask.
EDIT:
For your comment, yes, but this depends on the size of the allocation. If you allocate a large block of memory (say >= 1MB) then the heap in the CRT generally directly defers the allocation to the operating system and so freeing individual ones will actually free them. For small allocations the heap in the CRT asks for pages of memory from the operating system and then subdivides that to give out in allocations. And so if you then free every other one of those you'll be left with holes - and the heap cannot give those holes back to the OS as the OS generally only works in whole pages. So anything you see in task manager will show that all the memory is still used. Remember this memory isn't lost or leaked, its just effectively pooled and will be used again if allocations ask for that size. If you care about this memory you can use the crt heap statistics famliy of functions to keep an eye on those - specifically _CrtMemDumpStatistics
I am debugging a potential memory leak problem in a debug DLL.
The case is that process run a sub-test which loads/unloads a DLL dynamically, during the test a lot of memory is reserved and committed(1.3GB). After test is finished and DLL unloaded, still massive amount of memory remains reserved(1.2GB).
Reason why I said this reserved memory is allocated by DLL is that if I use a release DLL(nothing else changed, same test), reserved memory is ~300MB, so all the additional reserved memory must be allocated in debug DLL.
Looks like a lot of memory is committed during test but decommit(not release to free status) after test. So I want to track who reserve/decommit that large memory. But in the source code, there is no VirtualAlloc called, so questions are:
Is VirtualAlloc the only way to reserve memory?
If not, what other API can do that? If is, what other API will internally call VirtualAlloc? Quite some people online says HeapAlloc will internally call VirtualAlloc? How does it work?
[Parts of this are purely implementation detail, and not things that your application should rely on, so take them only for informational purposes, not as official documentation or contract of any kind. That said, there is some value in understanding how things are implemented under the hood, if only for debugging purposes.]
Yes, the VirtualAlloc() function is the workhorse function for memory allocation in Windows. It is a low-level function, one that the operating system makes available to you if you need its features, but also one that the system uses internally. (To be precise, it probably doesn't call VirtualAlloc() directly, but rather an even lower level function that VirtualAlloc() also calls down to, like NtAllocateVirtualMemory(), but that's just semantics and doesn't change the observable behavior.)
Therefore, HeapAlloc() is built on top of VirtualAlloc(), as are GlobalAlloc() and LocalAlloc() (although the latter two became obsolete in 32-bit Windows and should basically never be used by applications—prefer explicitly calling HeapAlloc()).
Of course, HeapAlloc() is not just a simple wrapper around VirtualAlloc(). It adds some logic of its own. VirtualAlloc() always allocates memory in large chunks, defined by the system's allocation granularity, which is hardware-specific (retrievable by calling GetSystemInfo() and reading the value of SYSTEM_INFO.dwAllocationGranularity). HeapAlloc() allows you to allocate smaller chunks of memory at whatever granularity you need, which is much more suitable for typical application programming. Internally, HeapAlloc() handles calling VirtualAlloc() to obtain a large chunk, and then divvying it up as needed. This not only presents a simpler API, but is also more efficient.
Note that the memory allocation functions provided by the C runtime library (CRT)—namely, C's malloc() and C++'s new operator—are a higher level yet. These are built on top of HeapAlloc() (at least in Microsoft's implementation of the CRT). Internally, they allocate a sizable chunk of memory that basically serves as a "master" block of memory for your application, and then divvy it up into smaller blocks upon request. As you free/delete those individual blocks, they are returned to the pool. Once again, this extra layer provides a simplified interface (and in particular, the ability to write platform-independent code), as well as increased efficiency in the general case.
Memory-mapped files and other functionality provided by various OS APIs is also built upon the virtual memory subsystem, and therefore internally calls VirtualAlloc() (or a lower-level equivalent).
So yes, fundamentally, the lowest level memory allocation routine for a normal Windows application is VirtualAlloc(). But that doesn't mean it is the workhorse function that you should generally use for memory allocation. Only call VirtualAlloc() if you actually need its additional features. Otherwise, either use your standard library's memory allocation routines, or if you have some compelling reason to avoid them (like not linking to the CRT or creating your own custom memory pool), call HeapAlloc().
Note also that you must always free/release memory using the corresponding mechanism to the one you used to allocate the memory. Just because all memory allocation functions ultimately call VirtualAlloc() does not mean that you can free that memory by calling VirtualFree(). As discussed above, these other functions implement additional logic on top of VirtualAlloc(), and thus require that you call their own routines to free the memory. Only call VirtualFree() if you allocated the memory yourself via a call to VirtualAlloc(). If the memory was allocated with HeapAlloc(), call HeapFree(). For malloc(), call free(); for new, call delete.
As for the specific scenario described in your question, it is unclear to me why you are worrying about this. It is important to keep in mind the distinction between reserved memory and committed memory. Reserved simply means that this particular block in the address space has been reserved for use by the process. Reserved blocks cannot be used. In order to use a block of memory, it must be committed, which refers to the process of allocating a backing store for the memory, either in the page file or in physical memory. This is also sometimes known as mapping. Reserving and committing can be done as two separate steps, or they can be done at the same time. For example, you might want to reserve a contiguous address space for future use, but you don't actually need it yet, so you don't commit it. Memory that has been reserved but not committed is not actually allocated.
In fact, all of this reserved memory may not be a leak at all. A rather common strategy used in debugging is to reserve a specific range of memory addresses, without committing them, to trap attempts to access memory within this range with an "access violation" exception. The fact that your DLL is not making these large reservations when compiled in Release mode suggests that, indeed, this may be a debugging strategy. And it also suggests a better way of determining the source: rather than scanning through your code looking for all of the memory-allocation routines, scan your code looking for the conditional code that depends upon the build configuration. If you're doing something different when DEBUG or _DEBUG is defined, then that is probably where the magic is happening.
Another possible explanation is the CRT's implementation of malloc() or new. When you allocate a small chunk of memory (say, a few KB), the CRT will actually reserve a much larger block but only commit a chunk of the requested size. When you subsequently free/delete that small chunk of memory, it will be decommitted, but the larger block will not be released back to the OS. The reason for this is to allow future calls to malloc/new to re-use that reserved block of memory. If a subsequent request is for a larger block than can be satisfied by the currently reserved address space, it will reserve additional address space. If, in debugging builds, you are repeatedly allocating and freeing increasingly large chunks of memory, what you're seeing may be the result of memory fragmentation. But this is really not a problem, aside from a minor performance hit, which is really not worth worrying about in debugging builds.
I have a C++ application where I sometimes require a large buffer of POD types (e.g. an array of 25 billion floats) to be held in memory at once in a contiguous block. This particular memory organization is driven by the fact that the application makes use of some C APIs that operate on the data. Therefore, a different arrangement (such as a list of smaller chunks of memory like std::deque uses) isn't feasible.
The application has an algorithm that is run on the array in a streaming fashion; think something like this:
std::vector<float> buf(<very_large_size>);
for (size_t i = 0; i < buf.size(); ++i) do_algorithm(buf[i]);
This particular algorithm is the conclusion of a pipeline of earlier processing steps that have been applied to the dataset. Therefore, once my algorithm has passed over the i-th element in the array, the application no longer needs it.
In theory, therefore, I could free that memory in order to reduce my application's memory footprint as it chews through the data. However, doing something akin to a realloc() (or a std::vector<T>::shrink_to_fit()) would be inefficient because my application would have to spend its time copying the unconsumed data to the new spot at reallocation time.
My application runs on POSIX-compliant operating systems (e.g. Linux, OS X). Is there any interface by which I could ask the operating system to free only a specified region from the front of the block of memory? This would seem to be the most efficient approach, as I could just notify the memory manager that, for example, the first 2 GB of the memory block can be reclaimed once I'm done with it.
If your entire buffer has to be in memory at once, then you probably will not gain much from freeing it partially later.
The main point on this post is basically to NOT tell you to do what you want to do, because the OS will not unnecessarily keep your application's memory in RAM if it's not actually needed. This is the difference between "resident memory usage" and "virtual memory usage". "Resident" is what is currently used and in RAM, "virtual" is the total memory usage of your application. And as long as your swap partition is large enough, "virtual" memory is pretty much a non-issue. [I'm assuming here that your system will not run out of virtual memory space, which is true in a 64-bit application, as long as you are not using hundreds of terabytes of virtual space!]
If you still want to do that, and want to have some reasonable portability, I would suggest building a "wrapper" that behaves kind of like std::vector and allocates lumps of some megabytes (or perhaps a couple of gigabytes) of memory at a time, and then something like:
for (size_t i = 0; i < buf.size(); ++i) {
do_algorithm(buf[i]);
buf.done(i);
}
The done method will simply check if the value if i is (one element) past the end of the current buffer, and free it. [This should inline nicely, and produce very little overhead on the average loop - assuming elements are actually used in linear order, of course].
I'd be very surprised if this gains you anything, unless do_algorithm(buf[i]) takes quite some time (certainly many seconds, probably many minutes or even hours). And of course, it's only going to help if you actually have something else useful to do with that memory. And even then, the OS will reclaim memory that isn't actively used by swapping it out to disk, if the system is short of memory.
In other words, if you allocate 100GB, fill it, leave it sitting without touching, it will eventually ALL be on the hard-disk rather than in RAM.
Further, it is not at all unusual that the heap in the application retains freed memory, and that the OS does not get the memory back until the application exits - and certainly, if only parts of a larger allocation is freed, the runtime will not release it until the whole block has been freed. So, as stated at the beginning, I'm not sure how much this will actually help your application.
As with everything regarding "tuning" and "performance improvements", you need to measure and compare a benchmark, and see how much it helps.
Is it possible to partially free dynamically-allocated memory on a POSIX system?
You can not do it using malloc()/realloc()/free().
However, you can do it in a semi-portable way using mmap() and munmap(). The key point is that if you munmap() some page, malloc() can later use that page:
create an anonymous mapping using mmap();
subsequently call munmap() for regions that you don't need anymore.
The portability issues are:
POSIX doesn't specify anonymous mappings. Some systems provide MAP_ANONYMOUS or MAP_ANON flag. Other systems provide special device file that can be mapped for this purpose. Linux provides both.
I don't think that POSIX guarantees that when you munmap() a page, malloc() will be able to use it. But I think it'll work an all systems that have mmap()/unmap().
Update
If your memory region is so large that most pages surely will be written to swap, you will not loose anything by using file mappings instead of anonymous mappings. File mappings are specified in POSIX.
If you can do without the convenience of std::vector (which won't give you much in this case anyway because you'll never want to copy / return / move that beast anyway), you can do your own memory handling. Ask the operating system for entire pages of memory (via mmap) and return them as appropriate (using munmap). You can tell mmap via its fist argument and the optional MAP_FIXED flag to map the page at a particular address (which you must ensure to be not otherwise occupied, of course) so you can build up an area of contiguous memory. If you allocate the entire memory upfront, then this is not an issue and you can do it with a single mmap and let the operating system choose a convenient place to map it. In the end, this is what malloc does internally. For platforms that don't have sys/mman.h, it's not difficult to fall back to using malloc if you can live with the fact that on those platforms, you won't return memory early.
I'm suspecting that if your allocation sizes are always multiples of the page size, realloc will be smart enough not to copy any data. You'd have to try this out and see if it works (or consult your malloc's documentation) on your particular target platform, though.
mremap is probably what you need. As long as you're shifting whole pages, you can do a super fast realloc (actually the kernel would do it for you).
I have several large std::vectors of chars (bytes loaded from binary files).
When my program runs out of memory, I need to be able to cull some of the memory used by these vectors. These vectors are almost the entirety of my memory usage, and they're just caches for local and network files, so it's safe to just grab the largest one and chop it in half or so.
Only thing is, I'm currently using vector::resize and vector::shrink_to_fit, but this seems to require more memory (I imagine for a reallocation of the new size) and then a bunch of time (for destruction of the now destroyed pointers, which I thought would be free?) and then copying the remaining to the new vector. Note, this is on Windows platform, in debug, so the pointers might not be destroyed in the Release build or on other platforms.
Is there something I can do to just say "C++, please tell the OS that I no longer need the memory located past location N in this vector"?
Alternatively, is there another container I'd be better off using? I do need to have random access though, or put effort into designing a way to easily keep iterators pointing at the place I'll want to read next, which would be possible, just not easy, so I'd prefer not using a std::list.
resize and shrink_to_fit are your best bets, as long as we are talking about standard C++, but these, as you noticed may not help at all if you are in a low memory situation to begin with: since the allocator interface do not provide a realloc-like operation, vector is forced to allocate a new block, copy the data in it and deallocate the old block.
Now, I see essentially four easy ways out:
drop whole vectors, not just parts of them, possibly using an LRU or stuff like that; working with big vectors, the C++ allocator normally just forwards to the OS's memory management calls, so the memory should go right back to the OS;
write your own container which uses malloc/realloc, or OS-specific functionality;
use std::deque instead; you lose guaranteed contiguousness of data, but, since deques normally allocate the space for data in distinct chunks, doing a resize+shrink_to_fit should be quite cheap - simply all the unused blocks at the end are freed, with no need for massive reallocations;
just leave this job to the OS. As already stated in the comments, the OS already has a file cache, and in normal cases it can handle it better than you or me, even just for the fact that it has a better vision of how much physical memory is left for that, what files are "hot" for most applications, and so on. Also, since you are working in a virtual address space you cannot even guarantee that your memory will actually stay in RAM; the very moment that the machine goes in memory pressure and you aren't using some memory pages so often, they get swapped to disk, so all your performance gain is lost (and you wasted space on the paging file for stuff that is already found on disk).
An additional way may be to just use memory mapped files - the system will do its own caching as usual, but you avoid any syscall overhead as long as the file remains in memory.
std::vector::shrink_to_fit() cannot result in more memory being used, if so it's a bug.
C++11 defines shrink_to_fit() as follows:
void shrink_to_fit(); Remarks: shrink_to_fit is a non-binding request to reduce capacity() to size(). [ Note: The request is non-binding to allow latitude for implementation-specific optimizations. — end note ]
As the note indicates, shrink_to_fit() may, but not necessarily, actually free memory, and the standard gives C++ implementations a free hand to recycle and optimize memory usage internally, as they see fit. C++ does not make it mandatory for shrink_to_fit(), and the like, to result in actually memory being released to the operating system, and in many cases the C++ runtime library may not actually be able to, as I'll get to in a moment. The C++ runtime library is allowed to take the freed memory, and stash it away internally, and reuse it automatically for the future memory allocation requests (explicit news, or container growth).
Most modern operating systems are not designed to allocate and release memory blocks of arbitrary sizes. Details differ, but typically an operating system allocates and deallocates memory in even chunks, typically 4Kb, or larger, at even memory page addresses. If you allocate a new object that's only a few hundred bytes long, the C++ library will request an entire page of memory to be allocated, take the first hundred bytes of it for the new object, then keep the spare amount of memory for future new requests.
Similarly, even if shrink_to_fit(), or delete, frees up a few hundred bytes, it can't go back to the operating system immediately, but only when an entire 4kb continuous memory range (or whatever is the allocation page size used by the operating system) -- suitably aligned -- is completely unused at all. Only then can a process release that page back to the operating system. Until then, the library keeps track of freed memory ranges, to be used for future new requests, without asking the operating system to allocate more pages of memory to the process.
Besides remembering the address of the pointer of the object, I think the OS also need to record how large the size of the memory is. So that when we use delete, the os will know how much memory to free.
Can anyone tell me more details about this? What other information are recorded? And where are those information stored? What does the OS do after you delete the memory?
As already noted, new is a library function, not an OS feature.
The general case is approximately like this:
The C++ compiler translates the new keyword into function calls to malloc() (or equivalent)
The allocator keeps a list of free blocks of memory, it searches there for the best match.
Typically, the 'best' match is bigger than the amount asked by your program. if so, the allocator splits the block, marks one with the size (and maybe a few other metadata), puts the rest back into the free list, and returns the allocated block to the your program.
If no appropriate free block is found, the allocator asks for some chunk of memory from the OS. There are several ways to do it, but it's typically considered a slow operation, so it asks in bigger steps (at least one page at a time, usually 4KB). When it gets the new free block, splits into the requested size and the rest is put in the free list.
The OS is the one controlling the MMU (Memory Management Unit) of the processor. This unit is the one that translates the linear addresses as seen by the currently running process into the physical addresses of RAM pages. This allows the OS the flexibility it needs to allocate and deallocate RAM pages to each process.
Each process has a different memory map, that allows each one to 'see' a linear memory space while at the same time keeping each process isolated from the others. The OS is the one who loads and unloads the map into the MMU at each process switch. Allocating a new page to a process ultimately means adding it to the memory map of the process.
can anyone tell me more detail about this?
These are all details that are highly dependent on the OS and compiler and I can only provide fairly broad answers (but you really shouldn't have to worry about this, unless you are thinking of going into that line of work, of course).
what other information are recorded?
Generally, freestore memory is commonly referred to as the "heap". This is basically because it is implemented as a heap, which is a priority-queue implementation that builds a tree of nodes (blocks of free memory). The actual implementation and the algorithms used are very complex because they need to perform very well (i.e. free blocks need to be found very quickly when requesting new memory, and newly freed memory has to be merged to the heap very quickly as well) and then, there are all the fragmentation issues and so on. Look at the wiki on Buddy memory allocation for a very simplified version.
So, there is quite a bit more than just the list of allocated pointers and their corresponding memory-block sizes. In fact, the list of available memory blocks is far more important.
and where are those information stored?
The Heap is a part of the running program (in fact, each process, and even each loaded module, will have one or more such structures). The heap is not residing in the OS (and I'm not even sure it needs to even be given by the OS, at least not for all OSes). However, the heap obviously asks the OS to provide it with large blocks of memory that it can incorporate into its tree of free memory-blocks, whenever capacity is exhausted by the memory requests of your program.
what does the OS do after you delete the memory?
Generally, when you delete memory, it simply gets added to the list of free (or available) memory blocks, and possibly gets merged with adjacent free memory-blocks if present. There is no "filling the memory with zeros" or any other actual operation on the memory that gets done, that is unnecessary and a waste of processing time.