What part of the process virtual memory does Windows Task Manager display - c++

My question is a bit naive. I'm willing to have an overview as simple as possible and couldn't find any resource that made it clear to me. I am a developer and I want to understand what exactly is the memory displayed in the "memory" column by default in Windows Task Manager:
To make things a bit simpler, let's forget about the memory the process shares with other processes, and imagine the shared memory is negligible. Also I'm focussed on the big picture and mainly care for things at GB level.
As far as I know, the memory reserved by the process called "virtual memory", is partly stored in the main memory (RAM), partly on the disk. The system decides what goes where. The system basically keeps in RAM the parts of the virtual memory that is accessed sufficiently frequently by the process. A process can reserve more virtual memory than RAM available in the computer.
From a developer point of view, the virtual memory may only be partially allocated by the program through its own memory manager (with malloc() or new X() for example). I guess the system has no awareness of what part of the virtual memory is allocated since this is handled by the process in a "private" way and depends on the language, runtime, compiler... Q: Is this correct?
My hypothesis is that the memory displayed by the task manager is essentially the part of the virtual memory being stored in RAM by the system. Q: Is it correct? And is there a simple way to know the total virtual memory reserved by the process?

Memory on windows is... extremely complicated and asking 'how much memory does my process use' is effectively a nonsensical question. TO answer your questions lets get a little background first.
Memory on windows is allocated via ptr = VirtualAlloc(..., MEM_RESERVE, ...) and committed later with VirtualAlloc(ptr+n, MEM_COMMIT, ...).
Any reserved memory just uses up address space and so isn't interesting. Windows will let you MEM_RESERVE terabytes of memory just fine. Committing the memory does use up resources but not in the way you'd think. When you call commit windows does a few sums and basically works out (total physical ram + total swap - current commit) and lets you allocate memory if there's enough free. BUT the windows memory manager doesn't actually give you physical ram until you actually use it.
Later, however, if windows is tight for physical RAM it'll swap some of your RAM out to disk (it may compress it and also throw away unused pages, throw away anything directly mapped from a file and other optimisations). This means your total commit and total physical ram usage for your program may be wildly different. Both numbers are useful depending on what you're measuring.
There's one last large caveat - memory that is shared. When you load DLLs the code, the read-only memory [and even maybe the read/write section but this is COW'd] can be shared with other programs. This means that your app requires that memory but you cannot count that memory against just your app - after all it can be shared and so doesn't take up as much physical memory as a naive count would think.
(If you are writing a game or similar you also need to count GPU memory but I'm no expert here)
All of the above goodness is normally wrapped up by the heap the application uses and you see none of this - you ask for and use memory. And its just as optimal as possible.
You can see this by going to the details tab and looking at the various options - commit-size and working-set are really useful. If you just look at the main window in task-manager and it has a single value I'd hope you understand now that a single value for memory used has to be some kind of compromise as its not a question that makes sense.
Now to answer your questions
Firstly the OS knows exactly how much memory your app has reserved and how much it has committed. What it doesn't know is if the heap implementation you (or more likely the CRT) are using has kept some freed memory about which it hasn't released back to the operation system. Heaps often do this as an optimisation - asking for memory from the OS and freeing it back to the OS is a fairly expensive operation (and can only be done in large chunks known as pages) and so most of them keep some around.
Second question: Dont use that value, go to details and use the values there as only you know what you actually want to ask.
EDIT:
For your comment, yes, but this depends on the size of the allocation. If you allocate a large block of memory (say >= 1MB) then the heap in the CRT generally directly defers the allocation to the operating system and so freeing individual ones will actually free them. For small allocations the heap in the CRT asks for pages of memory from the operating system and then subdivides that to give out in allocations. And so if you then free every other one of those you'll be left with holes - and the heap cannot give those holes back to the OS as the OS generally only works in whole pages. So anything you see in task manager will show that all the memory is still used. Remember this memory isn't lost or leaked, its just effectively pooled and will be used again if allocations ask for that size. If you care about this memory you can use the crt heap statistics famliy of functions to keep an eye on those - specifically _CrtMemDumpStatistics

Related

Is it possible to partially free dynamically-allocated memory on a POSIX system?

I have a C++ application where I sometimes require a large buffer of POD types (e.g. an array of 25 billion floats) to be held in memory at once in a contiguous block. This particular memory organization is driven by the fact that the application makes use of some C APIs that operate on the data. Therefore, a different arrangement (such as a list of smaller chunks of memory like std::deque uses) isn't feasible.
The application has an algorithm that is run on the array in a streaming fashion; think something like this:
std::vector<float> buf(<very_large_size>);
for (size_t i = 0; i < buf.size(); ++i) do_algorithm(buf[i]);
This particular algorithm is the conclusion of a pipeline of earlier processing steps that have been applied to the dataset. Therefore, once my algorithm has passed over the i-th element in the array, the application no longer needs it.
In theory, therefore, I could free that memory in order to reduce my application's memory footprint as it chews through the data. However, doing something akin to a realloc() (or a std::vector<T>::shrink_to_fit()) would be inefficient because my application would have to spend its time copying the unconsumed data to the new spot at reallocation time.
My application runs on POSIX-compliant operating systems (e.g. Linux, OS X). Is there any interface by which I could ask the operating system to free only a specified region from the front of the block of memory? This would seem to be the most efficient approach, as I could just notify the memory manager that, for example, the first 2 GB of the memory block can be reclaimed once I'm done with it.
If your entire buffer has to be in memory at once, then you probably will not gain much from freeing it partially later.
The main point on this post is basically to NOT tell you to do what you want to do, because the OS will not unnecessarily keep your application's memory in RAM if it's not actually needed. This is the difference between "resident memory usage" and "virtual memory usage". "Resident" is what is currently used and in RAM, "virtual" is the total memory usage of your application. And as long as your swap partition is large enough, "virtual" memory is pretty much a non-issue. [I'm assuming here that your system will not run out of virtual memory space, which is true in a 64-bit application, as long as you are not using hundreds of terabytes of virtual space!]
If you still want to do that, and want to have some reasonable portability, I would suggest building a "wrapper" that behaves kind of like std::vector and allocates lumps of some megabytes (or perhaps a couple of gigabytes) of memory at a time, and then something like:
for (size_t i = 0; i < buf.size(); ++i) {
do_algorithm(buf[i]);
buf.done(i);
}
The done method will simply check if the value if i is (one element) past the end of the current buffer, and free it. [This should inline nicely, and produce very little overhead on the average loop - assuming elements are actually used in linear order, of course].
I'd be very surprised if this gains you anything, unless do_algorithm(buf[i]) takes quite some time (certainly many seconds, probably many minutes or even hours). And of course, it's only going to help if you actually have something else useful to do with that memory. And even then, the OS will reclaim memory that isn't actively used by swapping it out to disk, if the system is short of memory.
In other words, if you allocate 100GB, fill it, leave it sitting without touching, it will eventually ALL be on the hard-disk rather than in RAM.
Further, it is not at all unusual that the heap in the application retains freed memory, and that the OS does not get the memory back until the application exits - and certainly, if only parts of a larger allocation is freed, the runtime will not release it until the whole block has been freed. So, as stated at the beginning, I'm not sure how much this will actually help your application.
As with everything regarding "tuning" and "performance improvements", you need to measure and compare a benchmark, and see how much it helps.
Is it possible to partially free dynamically-allocated memory on a POSIX system?
You can not do it using malloc()/realloc()/free().
However, you can do it in a semi-portable way using mmap() and munmap(). The key point is that if you munmap() some page, malloc() can later use that page:
create an anonymous mapping using mmap();
subsequently call munmap() for regions that you don't need anymore.
The portability issues are:
POSIX doesn't specify anonymous mappings. Some systems provide MAP_ANONYMOUS or MAP_ANON flag. Other systems provide special device file that can be mapped for this purpose. Linux provides both.
I don't think that POSIX guarantees that when you munmap() a page, malloc() will be able to use it. But I think it'll work an all systems that have mmap()/unmap().
Update
If your memory region is so large that most pages surely will be written to swap, you will not loose anything by using file mappings instead of anonymous mappings. File mappings are specified in POSIX.
If you can do without the convenience of std::vector (which won't give you much in this case anyway because you'll never want to copy / return / move that beast anyway), you can do your own memory handling. Ask the operating system for entire pages of memory (via mmap) and return them as appropriate (using munmap). You can tell mmap via its fist argument and the optional MAP_FIXED flag to map the page at a particular address (which you must ensure to be not otherwise occupied, of course) so you can build up an area of contiguous memory. If you allocate the entire memory upfront, then this is not an issue and you can do it with a single mmap and let the operating system choose a convenient place to map it. In the end, this is what malloc does internally. For platforms that don't have sys/mman.h, it's not difficult to fall back to using malloc if you can live with the fact that on those platforms, you won't return memory early.
I'm suspecting that if your allocation sizes are always multiples of the page size, realloc will be smart enough not to copy any data. You'd have to try this out and see if it works (or consult your malloc's documentation) on your particular target platform, though.
mremap is probably what you need. As long as you're shifting whole pages, you can do a super fast realloc (actually the kernel would do it for you).

Memory usage in C++ program, as reported by Gnome resource monitor: confusion

I am looking at the memory consumed by my app to make sure I am not allocating too much, and am confused as to what Gnome Resource Monitor is showing me. I have used the following pieces of code to allocate memory in two separate apps that are otherwise identical; they contain nothing other than this code and a scanf() call to pause execution whilst I grab the memory usage:
malloc(1024 * 1024 * 100);
and
char* p = new char[1204*1024*100];
The following image shows the memory usage of my app before and after each of these lines:
Now, I have read a lot (but obviously not enough) about memory usage (including this SO question), and am having trouble differentiating between writeable memory and virtual memory. According to the linked question,
"Writeable memory is the amount of address space that your process has
allocated with write privileges"
and
"Virtual memory is the address space that your application has
allocated"
1) If I have allocated memory myself, surely it has write privileges?
2) The linked question also states (regarding malloc)
"...which won't actually allocate any memory. (See the rant at the end
of the malloc(3) page for details.)"
I don't see any "rant", and my images show the virtual memory has increased! Can someone explain this please?
3) If I have purely the following code:
char* p = new char[100];
...the resource monitor shows that both Memory and Writeable Memory have increased by 8KB - the same as when I was allocating a full one megabyte! - with Virtual memory increasing by 0.1. What is happening here?
4) What column should I be looking at in the resource monitor to see how much memory my app is using?
Thanks very much in advance for participation, and sorry if have been unclear or missed anything that could have led me to find answers myself.
A more precise way to understand on Linux the memory usage of a running process is to use the proc(5) file system.
So, if your process pid is 1234, try
cat /proc/1234/maps
Notice that processes are having their address space in virtual memory. That address space can be changed by mmap(2) and other syscalls(2). For several efficency reasons malloc(3) and free avoid to make too much of these syscalls, and prefer to re-use previously free-d memory zones. So when your program is free-ing (or, in C++, delete-ing) some memory chunk, that chunk is often marked as re-usable but is not released back to the kernel (by e.g. munmap). Likewise, if you malloc only 100 bytes, your libc is allowed to e.g. request a whole megabyte using mmap (the next time you are calling malloc for e.g. 200 bytes, it will use part of that magabyte)
See also http://linuxatemyram.com/ and Advanced Linux Programming (and this question about memory overcommit)
The classes of memory reported by the Gnome resource monitor (and in fact, the vast majority of resource reporting tools) are not simply separate classes of memory - there is overlap between them because they are reporting on different characteristics of the memory. Some of those different characteristics include:
virtual vs physical - all memory in a processes address space on modern operating systems is virtual; that virtual address space is mapped to actual physical memory by the hardware capabilities of the CPU; how that mapping is done is a complex topic in itself, with a lot of differences between different architectures
memory access permissions - memory can be readable, writable, or executable, or any combination of the three (in theory - some combinations don't really make sense and so may actually not be allowed by hardware and/or software, but the point is that these permissions are treated separately)
resident vs non-resident - with a virtual memory system, much of the address space of a process may not actually be currently mapped to real physical memory, for a variety of reasons - it may not have been allocated yet; it may be part of the binary or one of the libraries, or even a data segment that has not yet been loaded because the program has not called for it yet; it may have been swapped out to a swap area to free up physical memory for a different program that needed it
shared vs private - parts of a processes virtual address space that are read-only (for example, the actual code of the program and most of the libraries) may be shared with other processes that use the same libraries or program - this is a big advantage for overall memory usage, as having 37 different xterm instances running does not mean that the code for xterm needs to be loaded 37 different times into memory - all the processes can share one copy of the code
Because of these, and a few other factors (IPC shared memory, memory-mapped files, physical devices that have memory regions mapped in hardware, etc.), determining the actual memory in use by any single process, or even the entire system, can be complicated.

File Based Memory Pool - Is it Possible?

Whenever a new / malloc is used, OS create a new(or reuse) heap memory segment, aligned to the page size and return it to the calling process. All these allocations will constitute to the Process's virtual memory. In 32bit computing, any process can scale only upto 4 GB. Higher the heap allocation, the rate of increase of process memory is higher. Though there are lot of memory management / memory pools available, all these utilities end up again in creating a heap and reusing it effeciently.
mmap (Memory mapping) on the other hand, provides the ablity to visualize a file as memory stream, and enables the program to use pointer manipulations directly on file. But here again, mmap is actually allocating the range of addresses in the process space. So if we mmap a 3GB file with size 3GB and take a pmap of the process, you could see the total memory consumed by the process is >= 3GB.
My question is, is it possible to have a file based memory pool [just like mmaping a file], however, does not constitute the process memory space. I visualize something like a memory DB, which is backed by a file, which is so fast for read/write, which supports pointer manipulations [i.e get a pointer to the record and store anything as if we do using new / malloc], which can grow on the disk, without touching the process virtual 4GB limit.
Is it possible ? if so, what are some pointers for me to start working.
I am not asking for a ready made solution / links, but to conceptually understand how it can be achieved.
It is generally possible but very coplicated. You would have to re-map if you wanted to acces different 3Gb segments of your file, which would probably kill the performance in case of scattered access. Pointers would only get much more difficult to work with, as remmpaing changes data but leaves the adresses the same.
I have seen STXXL project that might be interesting to you; or it might not. I have never used it so I cannot give you any other advice about it.
What you are looking for, is in principle, a memory backed file-cache. There are many such things in for example database implementations (where the whole database is way larger than the memory of the machine, and the application developer probably wants to have a bit of memory left for application stuff). This will involve having some sort of indirection - an index, hash or some such to indicate what area of the file you want to access, and using that indirection to determine if the memory is in memory or on disk. You would essentially have to replicate what the virtual memory handling of the OS and the processor does, by having tables that indicate where in physical memory your "virtual heap" is, and if it's not present in physical memory, read it in (and if the cache is full, get rid of some - and if it's been written, write it back again).
However, it's most likely that in today's world, you have a machine capable of 64-bit addressing, and thus, it would be much easier to recompile the application as a 64-bit application, usemmap or similar to access the large memory. In this case, even if RAM isn't sufficient, you can access the memory of the file via the virtual memory system, and it takes care of all the mapping back and forth between disk and RAM (physical memory).

Dynamic memory allocation and memory block metadata

I have a question about low level stuff of dynamic memory allocation.
I understand that there may be different implementations, but I need to understand the fundamental ideas.
So,
when a modern OS memory allocator or the equivalent allocates a block of memory, this block needs to be freed.
But, before that happends, some system needs to exist to control the allocation process.
I need to know:
How this system keeps track of allocated and unallocated memory. I mean, the system needs to know what blocks have already been allocated and what their size is to use this information in allocation and deallocation process.
Is this process supported by modern hardware, like allocation bits or something like that?
Or is some kind of data structure used to store allocation information.
If there is a data structure, how much memory it uses compared to the allocated memory?
Is it better to allocate memory in big chunks rather than small ones and why?
Any answer that can help reveal fundamental implementation details is appreciated.
If there is a need for code examples, C or C++ will be just fine.
"How this system keeps track of allocated and unallocated memory." For non-embedded systems with operating systems, a virtual page table, which the OS is in charge of organizing (with hardware TLB support of course), tracks the memory usage of programs.
AS FAR AS I KNOW (and the community will surely yell at me if I'm mistaken), tracking individual malloc() sizes and locations has a good number of implementations and is runtime-library dependent. Generally speaking, whenever you call malloc(), the size and location is stored in a table. Whenever you call free(), the table entry for the provided pointer is looked up. If it is found, that entry is removed. If it is not found, the free() is ignored (which also indicates a possible memory leak).
When all malloc() entries in a virtual page are freed, that virtual page is then released back to the OS (this also implies that free() does not always release memory back to the OS since the virtual page may still have other malloc() entries in it). If there is not enough space within a given virtual page to support another malloc() of a specified size, another virtual page is requested from the OS.
Embedded processors usually don't have operating systems, virtual page tables, nor multiple processes. In this case, virtual memory is not used. Instead, the entire memory of the embedded processor is treated like one large virtual page (although the addresses are actually physical addresses) and memory management follows a similar process as previously described.
Here is a similar stack overflow question with more in-depth answers.
"Is it better to allocate memory in big chunks rather than small ones and why?" Allocate as much memory as you need, no more and no less. Compiler optimizations are very smart, and memory will almost always be managed more efficiently (i.e. reducing memory fragmentation) than the programmer can manually do. This is especially true in a non-embedded environment.
Here is a similar stack overflow question with more in-depth answers (note that it pertains to C and not C++, however it is still relevant to this discussion).
Well, there are more than one way to achieve that.
I once had to wrote a malloc() (and free()) implementation for educational purpose.
This is from my experience, and real world implementation surely vary.
I used a double linked list.
Memory chunk returned to the user after calling malloc() were in fact a struct containing relevant information to my implementation (ie the next and prev pointer, and a is_used byte).
So when a user request N bytes I allocated N + sizeof(my_struct) bytes, hiding next and prev pointers at the begenning of the chunk, and returning what's left to the user.
Surely, this is poor design for a program that use a lot of small allocation (because each allocation takes up to N + 2 pointers + 1 byte).
For a real world implementation, you can take a look to the code of good and well known memory allocator.
Normally there exist two different layers.
One layer lives at application level, usually as part of the C standard library. This is what you call through functions like malloc and free (or operator new in C++, which in turn usually calls malloc). This layer takes care of your allocations, but does not know about memory or where it comes from.
The other layer, at OS level, does not know and does not care anything about your allocations. It only maintains a list of fixed-size memory pages that have been reserved, allocated, and accessed, and with each page information such as where it maps to.
There are many different implementations for either layer, but in general it works like this:
When you allocate memory, the allocator (the "application level part") looks whether it has a matching block somewhere in its books that it can give to you (some allocators will split a larger block in two, if need be).
If it doesn't find a suitable block, it reserves a new block (usually much larger than what you ask for) from the operating system. sbrk or mmap on Linux, or VirtualAlloc on Windows would be typical examples of functions it might use for that effect.
This does very little apart from showing intent to the operating system, and generating some page table entries.
The allocator then (logically, in its books) splits up that large area into smaller pieces according to its normal mode of operation, finds a suitable block, and returns it to you. Note that this returned memory does not necessarily even exist as phsyical memory (though most allocators write some metadata into the first few bytes of each allocated unit, so they necessarily pre-fault the pages).
In the mean time, invisibly, a background task zeroes out memory pages that were in use by some process once but have been freed. This happens all the time, on a tentative base, since sooner or later someone will ask for memory (often, that's what the idle task does).
Once you access an address in the page that contains your allocated block for the first time, you generate a fault. The page table entry of this yet non-existent page (it logically exists, just not phsyically) is replaced with a reference to a page from the pool of zero pages. In the uncommon case that there is none left, for example if huge amounts of memory are being allocated all the time, the OS swaps out a page which it believes will not be accessed any time soon, zeroes it, and returns this one.
Now the page becomes part of your working set, it corresponds to actual phsyical memory, and it accounts towards your process' quota. While your process is running, pages may be moved in and out of your working set, or may be paged out and in, as you exceed certain limits, and according to how much memory is needed and how it is accessed.
Once you call free, the allocator puts the freed area back into its books. It may tell the OS that it does not need the memory any more instead, but usually this does not happen as it is not really necessary and it is more efficient to keep around a little extra memory and reuse it. Also, it may not be easy to free the memory because usually the units that you allocate/deallocate do not directly correspond with the units the OS works with (and, in the case of sbrk they'd need to happen in the correct order, too).
When the process ends, the OS simply throws away all page table entries and adds all pages to the list of pages that the idle task will zero out. So the physical memory becomes available to the next process asking for some.

Dynamic allocation in uClinux

I'm new to embedded development, and the big differences I see between traditional Linux and uClinux is that uClinux lacks the MMU.
From this article:
Without VM, each process must be located at a place in memory where it can be run. In the simplest case, this area of memory must be contiguous. Generally, it cannot be expanded as there may be other processes above and below it. This means that a process in uClinux cannot increase the size of its available memory at runtime as a traditional Linux process would.
To me, this sounds like all data must reside on the stack, and that heap allocation is impossible, meaning malloc() and/or "new" are out of the question... is that accurate? Perhaps there are techniques/libraries which allow for managing a "static heap" (i.e. a stack based area from which "dynamic" allocations can be requested)?
Or am I over thinking it? Or over simplifying it?
Under regular Linux, the programmer does not need to deal with physical resources. The kernel takes care of this, and a user space process sees only its own address space. As the stack grows, or malloc-type requests are made, the kernel will map free memory into the process's virtual address space.
In uClinux, the programmer must be more concerned with physical memory. The MMU and VM are not available, and all address space is shared with the kernel. When a user space program is loaded, the process is allocated physical memory pages for the text, stack, and variables. The process's program counter, stack pointer, and data/bss table pointers are set to physical memory addresses. Heap allocations (via malloc-type calls) are made from the same pool.
You will not have to get rid of heap allocation in programs. You will need to be concerned with some new issues. Since the stack cannot grow via virtual memory, you must size it correctly during linking to prevent stack overflows. Memory fragmentation becomes an issue because there's no MMU to consolidate smaller free pages. Errant pointers become more dangerous because they can now cause unintended writes to anywhere in physical memory.
It's been a while since I've worked with uCLinux (it was before it was integrated into the main tree), but I thought malloc was still available as part of the c library. There was a lot higher chance of doing Very Bad Things (tm) in memory since the heap wasn't isolated, but it was possible.
yes you can use malloc in user space applications on uclinux ,but then you have to increase the size of stack of user space application(before running the program cause stack size would be static),so that when malloc runs it will get the space it needs.
for e.g. uclinux on arm-cortex
arm toolchain provides command to find and change size of stack used by binary of user application then you can tranfer it to your embedded system and run
----- > arm-uclinuxeabi-flthdr