Mapping of several big files into memory

Mapping of several big files into memory - c++

In our application we have to be able to map several (i.e. maybe up to 4) files into memory (via mapViewOfFile). For a long time this has not been a problem, but as the files were getting bigger and bigger over the last years, now memory fragmentation prevents us from mapping those big files (files will be about 200 MB). The problem may already exist if no other files are loaded at that moment.
I am now looking for a way t make sure that the mapping is always successful. Therefor I wanted to reserve a block of memory at program start only for the mapping and that would therefor suffer much less from the fragmentation.
My first approach was to HeapCreate a private heap, I would then HeapAlloc a block of memory large enough to hold the mapping for one file and then use MapViewOfFileEx with the address of that block. Of cause the address would have to match the memory allocation granularity. But the mapping still failed with error code ERROR_INVALID_ADDRESS (487).
Next I tried the same thing with VirtualAloc. My understanding was that when I pass the parameter MEM_RESERVE I would then be able to use that memory for what ever I wanted, e.g. to map a view of a file. But I found out that that is not possible (same error code as above) until i completely free the whole block with VirtualFree again. Therefor there would be no reserved memory for the next files anymore.
I'm already using the low fragmentation heap feature and it is of nearly no use to us. Rewriting our code to use only smaller views of the files is not an option at the moment. I also took a look at this post Can address space be recycled for multiple calls to MapViewOfFileEx without chance of failure? but didn't find any it very useful and was hoping for an other possibility.
Do you have any suggestions what I can do or where my design may be wrong?
Thank you.

Well, the documentation for MapViewOfFileEx is clear: "The suggested address is used to specify that a file should be mapped at the same address in multiple processes. This requires the region of address space to be available in all involved processes. No other memory allocation can take place in the region that is used for mapping, including the use of the VirtualAlloc"
The low fragmentation heap is intended to prevent even relatively small allocations from failing. I.e. it avoids 1 byte holes so 2 byte allocations will remain possible for longer. Your allocations are not small by 32 bits standards.
Realistically, this is going to hurt. If you really really need it, reimplement memory mapped files. All the necessary functions are available. Use a vectored exception handler to page in the source, and use QueryWorkingSet to figure out if pages are dirty.

Related

Checking number of actually allocated pages using anonymous mmap()

I'm allocating some memory using mmap() with MAP_ANONYMOUS flag. Because of the problem's features, often I need to write some data to one page ( for example, located in the middle ) of large memory chunk, and leave the other part untouched. So, I wouldn't this rather big part of unused memory to allocate physically.
In my view, such mmap() call just give me a pointer to some 0-filled virtual pages, but due to to copy-on-write and demand paging mechanisms, no one page ( besides first maybe ) isn't actually allocated in RAM until first write attempt to it's memory.
The problem is: one moment I've got MAP_FAILED from mmap() ( there were a lot of successful calls with fewer allocation queries before ), when tried to allocate a memory chunk larger than my physical RAM size. So, it seems there are much more pages actually allocated, even if there wasn't write access to them.
I need your help guys with two questions, first:
Is my views to anonymous memory allocation correct, and (if not) , what are the inaccuracies?
And the second:
How could I measure number of actually allocated pages after anonymous mmap() is done? I've tried use mincore(), and it's results show me that almost all pages are "resident" in memory ( i.e., physically allocated? ). So, it seems mincore() results are wrong, or I'm totally stuck :(
Upd.
It seems memory overcommiting mentioned by #Art indeed could influence this. But when I'm trying to disable it ( setting /proc/sys/vm/overcommit_memory to 1 mode or using mmap() with MAP_NORESERVE flag, my machine is heavily freezing, and hard reset is the only thing that helps.

"It's complicated." (caveat: I have much more experience of VM systems from not Linux, but I've picked up a few details from Linux over the years)
Generally your assumption is true. Many operating systems don't actually do much on mmap other than record "there is an allocation of X pages of object Y at V". Access to those pages causes a page fault which leads to an actual allocation. Some (or maybe all?) versions of Linux map a pre-zeroed read-only page to such allocations (not sure if it has changed or if it's still done) and the rest is handled like copy-on-write (for me it feels like a bad idea because it should generate more TLB flushes for a dubious optimization of reading zeroed memory which should happen rarely, but I guess Linux people have benchmarked it and found it good).
There are a few caveats. Just because you don't use the RAM doesn't mean that you'll be allowed to overcommit so much memory. Certain Linux distributions started default to a setting that disallows overcommit or disallows more than X% overcommit. Centos/RedHat was one of those (this was probably a decade ago, I don't know the state today). This means that even when you're just using 5% of your actual physical memory, if you have created anonymous mappings of 100+X% of your RAM, mmap will fail. There is a sysctl for it, look it up. It's something like /proc/sys/vm/overcommit_memory or overcommit_ratio or both, or probably even more under there.
Then you have to be aware of resource limits. Check with ulimit if you're allowed to create mappings that big (ulimit -v), the problem could be as simple as that. Also (I'm not sure how Linux does it here, but you can create simple test programs to try it), resource limits for data size (ulimit -d) can be accounted as potential pages you can use rather than actual pages you currently use (in other words no overcommit in resource limits).
Then let's get back to faulting in only the pages you use. There has been research into detecting access patterns and predicting future faults (so that mmap:ed files could do read-ahead), but I don't know the state of it in Linux. It feels like it would be dumb to apply this to anonymous mappings, but you never know, maybe someone is trying to be clever. To make sure, use madvise(..., MADV_RANDOM) on the mmap:ed block you get.
Finally mincore. My experience from it on Linux is that it's garbage, last time I tried it it didn't work for anonymous mappings (or was it private?). In your case it could be as simple as it reporting that the read-only copy-on-write zeroed page for anonymous mappings is considered to be "in core".

Loading large amount of binary data into RAM

My application needs to load from MegaBytes to dozens of GigaBytes of binary data (multiple files) into RAM. After some search, I decided to use std::vector<unsigned char> for this purpose, although I am not sure it's the best choice.
I would use one vector for each file. As application previously knows file size, it would call reserve() to allocate memory for it. Sometimes the application might need to fully read a file and in some others only part of it and vector's iterators are nice for that. It may need to unload a file from RAM and put other in place, std::vector::swap() and std::vector::shrink_to_fit() would be very useful. I don't want to have the hard work of dealing with low level memory allocation stuff (otherwise would go with C).
I've some questions:
Application must load the more files from a list it can into RAM. How would it know if there is enough memory space to load one more file? Should it call reserve() and look for errors? How? Reference only says reserve() throws an exception when requested size is greater than std::vector::max_size.
Is std::vector<unsigned char> applicable for getting such large amount of binary data into RAM? I'm worried about std::vector::max_size, since its reference says its value would depend on system or implementation limitations. I presume system limitation is free RAM, is it right? So, no problem. But what about implementations limitation? Are there anything regarding to implementations that could prevent me from doing what I want to? Case affirmative, please give me an alternative.
And what if I want to use entire RAM space, except N GigaBytes? Is the best way really to use sysinfo() and deduce based on free RAM if it is possible to load each file?
Obs.: This section of the application must be get the more performance (low processing time/CPU usage and RAM consumption) possible. I would appreciate your help.

How would it know if there is enough memory space to load one more file?
You wouldn't know before hand. Wrap the loading process in try - catch. If memory runs out, then a std::bad_alloc will be thrown (assuming you use default allocators). Assume that memory is sufficient in the loading code, and deal with the lack of memory in the exception handler.
But what about implementations limitation?
...
Are there anything regarding to implementations that could prevent me from doing what I want to?
You can check std::vector::max_size at run time to verify.
If the program is compiled with a 64 bit word size, then it is quite likely that the vector has sufficient max_size for a few hundred gigabytes.
This section of the application must be get the more performance
This conflicts with
I don't want to have the hard work of dealing with low level memory allocation stuff
But in case low level memory stuff is worth it for the performance, you could memory-map the file into memory.
I've read on some SO questions to avoid them on applications that need high performance and prefer dealing with return values, errno, etc
Unfortunately for you, non-throwing memory allocation is not an option if you use the standard containers. If you are allergic to exceptions, then you must use another implementation of a vector - or whatever container you decide to use. You don't need any container with mmap, though.
Won't handling exceptions break performance?
Luckily for you, run time cost of exceptions is insignificant compared to reading hundreds of gigabytes from disk.
May it be better to run sysinfo() and work on checking free RAM before loading a file?
sysinfo call may very well be slower than handling an exception (I haven't measured, that is just a conjecture) - and it won't tell you about process specific limits that may exist.
And also, it looks hard and costly to repetitively try load a file, catch exception and try load a smaller file (requires recursion?)
No recursion needed. You can use it if you prefer; it can be written with tail call, that can be optimized away.
About memory mapping: I took a look on it sometime ago and found boring to deal with. Would require to use C's open() and all that stuff and say bye to std::fstream.
Once you have mapped the memory, it is easier to use than std::fstream. You can skip the copying into vector part, and simply use the mapped memory as if it was an array that already exists in memory.
Looks like best way of partially reading a file using std::fstream is to derive std::streambuf
I don't see why you would need to derive anything. Just use std::basic_fstream::seekg() to skip to the part that you wish to read.

As an addition to #user2097303's answer, I want to add that vector guarantees contiguous allocation. For long running applications, this will result in memory fragmentation, and in the end, no contiguous block of memory will be present anymore, although between blocks, plenty of space is free.
Therefore it may be a good idea to store your data into deque

Dynamic memory allocation and memory block metadata

I have a question about low level stuff of dynamic memory allocation.
I understand that there may be different implementations, but I need to understand the fundamental ideas.
So,
when a modern OS memory allocator or the equivalent allocates a block of memory, this block needs to be freed.
But, before that happends, some system needs to exist to control the allocation process.
I need to know:
How this system keeps track of allocated and unallocated memory. I mean, the system needs to know what blocks have already been allocated and what their size is to use this information in allocation and deallocation process.
Is this process supported by modern hardware, like allocation bits or something like that?
Or is some kind of data structure used to store allocation information.
If there is a data structure, how much memory it uses compared to the allocated memory?
Is it better to allocate memory in big chunks rather than small ones and why?
Any answer that can help reveal fundamental implementation details is appreciated.
If there is a need for code examples, C or C++ will be just fine.

"How this system keeps track of allocated and unallocated memory." For non-embedded systems with operating systems, a virtual page table, which the OS is in charge of organizing (with hardware TLB support of course), tracks the memory usage of programs.
AS FAR AS I KNOW (and the community will surely yell at me if I'm mistaken), tracking individual malloc() sizes and locations has a good number of implementations and is runtime-library dependent. Generally speaking, whenever you call malloc(), the size and location is stored in a table. Whenever you call free(), the table entry for the provided pointer is looked up. If it is found, that entry is removed. If it is not found, the free() is ignored (which also indicates a possible memory leak).
When all malloc() entries in a virtual page are freed, that virtual page is then released back to the OS (this also implies that free() does not always release memory back to the OS since the virtual page may still have other malloc() entries in it). If there is not enough space within a given virtual page to support another malloc() of a specified size, another virtual page is requested from the OS.
Embedded processors usually don't have operating systems, virtual page tables, nor multiple processes. In this case, virtual memory is not used. Instead, the entire memory of the embedded processor is treated like one large virtual page (although the addresses are actually physical addresses) and memory management follows a similar process as previously described.
Here is a similar stack overflow question with more in-depth answers.
"Is it better to allocate memory in big chunks rather than small ones and why?" Allocate as much memory as you need, no more and no less. Compiler optimizations are very smart, and memory will almost always be managed more efficiently (i.e. reducing memory fragmentation) than the programmer can manually do. This is especially true in a non-embedded environment.
Here is a similar stack overflow question with more in-depth answers (note that it pertains to C and not C++, however it is still relevant to this discussion).

Well, there are more than one way to achieve that.
I once had to wrote a malloc() (and free()) implementation for educational purpose.
This is from my experience, and real world implementation surely vary.
I used a double linked list.
Memory chunk returned to the user after calling malloc() were in fact a struct containing relevant information to my implementation (ie the next and prev pointer, and a is_used byte).
So when a user request N bytes I allocated N + sizeof(my_struct) bytes, hiding next and prev pointers at the begenning of the chunk, and returning what's left to the user.
Surely, this is poor design for a program that use a lot of small allocation (because each allocation takes up to N + 2 pointers + 1 byte).
For a real world implementation, you can take a look to the code of good and well known memory allocator.

Normally there exist two different layers.
One layer lives at application level, usually as part of the C standard library. This is what you call through functions like malloc and free (or operator new in C++, which in turn usually calls malloc). This layer takes care of your allocations, but does not know about memory or where it comes from.
The other layer, at OS level, does not know and does not care anything about your allocations. It only maintains a list of fixed-size memory pages that have been reserved, allocated, and accessed, and with each page information such as where it maps to.
There are many different implementations for either layer, but in general it works like this:
When you allocate memory, the allocator (the "application level part") looks whether it has a matching block somewhere in its books that it can give to you (some allocators will split a larger block in two, if need be).
If it doesn't find a suitable block, it reserves a new block (usually much larger than what you ask for) from the operating system. sbrk or mmap on Linux, or VirtualAlloc on Windows would be typical examples of functions it might use for that effect.
This does very little apart from showing intent to the operating system, and generating some page table entries.
The allocator then (logically, in its books) splits up that large area into smaller pieces according to its normal mode of operation, finds a suitable block, and returns it to you. Note that this returned memory does not necessarily even exist as phsyical memory (though most allocators write some metadata into the first few bytes of each allocated unit, so they necessarily pre-fault the pages).
In the mean time, invisibly, a background task zeroes out memory pages that were in use by some process once but have been freed. This happens all the time, on a tentative base, since sooner or later someone will ask for memory (often, that's what the idle task does).
Once you access an address in the page that contains your allocated block for the first time, you generate a fault. The page table entry of this yet non-existent page (it logically exists, just not phsyically) is replaced with a reference to a page from the pool of zero pages. In the uncommon case that there is none left, for example if huge amounts of memory are being allocated all the time, the OS swaps out a page which it believes will not be accessed any time soon, zeroes it, and returns this one.
Now the page becomes part of your working set, it corresponds to actual phsyical memory, and it accounts towards your process' quota. While your process is running, pages may be moved in and out of your working set, or may be paged out and in, as you exceed certain limits, and according to how much memory is needed and how it is accessed.
Once you call free, the allocator puts the freed area back into its books. It may tell the OS that it does not need the memory any more instead, but usually this does not happen as it is not really necessary and it is more efficient to keep around a little extra memory and reuse it. Also, it may not be easy to free the memory because usually the units that you allocate/deallocate do not directly correspond with the units the OS works with (and, in the case of sbrk they'd need to happen in the correct order, too).
When the process ends, the OS simply throws away all page table entries and adds all pages to the list of pages that the idle task will zero out. So the physical memory becomes available to the next process asking for some.

What's the graceful way of handling out of memory situations in C/C++?

I'm writing an caching app that consumes large amounts of memory.
Hopefully, I'll manage my memory well enough, but I'm just thinking about what
to do if I do run out of memory.
If a call to allocate even a simple object fails, is it likely that even a syslog call
will also fail?
EDIT: Ok perhaps I should clarify the question. If malloc or new returns a NULL or 0L value then it essentially means the call failed and it can't give you the memory for some reason. So, what would be the sensible thing to do in that case?
EDIT2: I've just realised that a call to "new" can throw an exception. This could be caught at a higher level so I can perhaps gracefully exit further up. At that point, it may even be possible to recover depending on how much memory is freed. In the least I should by that point hopefully be able to log something. So while I have seen code that checks the value of a pointer after new, it is unnecessary. While in C, you should check the return value for malloc.

Well, if you are in a case where there is a failure to allocate memory, you're going to get a std::bad_alloc exception. The exception causes the stack of your program to be unwound. In all likelihood, the inner loops of your application logic are not going to be handling out of memory conditions, only higher levels of your application should be doing that. Because the stack is getting unwound, a significant chunk of memory is going to be free'd -- which in fact should be almost all the memory used by your program.
The one exception to this is when you ask for a very large (several hundred MB, for example) chunk of memory which cannot be satisfied. When this happens though, there's usually enough smaller chunks of memory remaining which will allow you to gracefully handle the failure.
Stack unwinding is your friend ;)
EDIT: Just realized that the question was also tagged with C -- if that is the case, then you should be having your functions free their internal structures manually when out of memory conditions are found; not to do so is a memory leak.
EDIT2: Example:
#include <iostream>
#include <vector>
void DoStuff()
{
std::vector<int> data;
//insert a whole crapload of stuff into data here.
//Assume std::vector::push_back does the actual throwing
//i.e. data.resize(SOME_LARGE_VALUE_HERE);
}
int main()
{
try
{
DoStuff();
return 0;
}
catch (const std::bad_alloc& ex)
{ //Observe that the local variable `data` no longer exists here.
std::cerr << "Oops. Looks like you need to use a 64 bit system (or "
"get a bigger hard disk) for that calculation!";
return -1;
}
}
EDIT3: Okay, according to commenters there are systems out there which do not follow the standard in this regard. On the other hand, on such systems, you're going to be SOL in any case, so I don't see why they merit discussion. But if you are on such a platform, it is something to keep in mind.

Doesn't this question make assumptions regarding overcommitted memory?
I.e., an out of memory situation might not be recoverable! Even if you have no memory left, calls to malloc and other allocators may still succeed until the program attempts to use the memory. Then, BAM!, some process gets killed by the kernel in order to satisfy memory load.

I don't have any specific experience on Linux, but I spent a lot of time working in video games on games consoles, where running out of memory is verboten, and on Windows-based tools.
On a modern OS, you're most likely to run out of address space. Running out of memory, as such, is basically impossible. So just allocate a large buffer, or buffers, on startup, in order to hold all the data you'll ever need, whilst leaving a small amount for the OS. Writing random junk to these regions would probably be a good idea in order to force the OS to actually assign the memory to your process. If your process survives this attempt to use every byte it's asked for, there's some kind of backing now reserved for all of this stuff, so now you're golden.
Write/steal your own memory manager, and direct it to allocate from these buffers. Then use it, consistently, in your app, or take advantage of gcc's --wrap option to forward calls from malloc and friends appropriately. If you use any libraries that can't be directed to call into your memory manager, junk them, because they'll just get in your way. Lack of overridable memory management calls is evidence of deeper-seated issues; you're better of without this particular component. (Note: even if you're using --wrap, trust me, this is still evidence of a problem! Life is too short to use those libraries that don't let you overload their memory management!)
Once you run out of memory, OK, you're screwed, but you've still got that space you left free before, so if freeing up some of the memory you've asked for is too difficult you can (with care) call system calls to write a message to the system log and then terminate, or whatever. Just make sure to avoid calls to the C library, because they'll probably try to allocate some memory when you least expect it -- programmers who work with systems that have virtualised address spaces are notorious for this kind of thing -- and that's the very thing that has caused the problem in the first place.
This approach might sound like a pain in the arse. Well... it is. But it's straightforward, and it's worth putting in a bit of effort for that. I think there's a Kernighan-and/or-Ritche quote about this.

If your application is likely to allocate large blocks of memory and risks hitting the per-process or VM limits, waiting until an allocation actually fails is a difficult situation from which to recover. By the time malloc returns NULL or new throws std::bad_alloc, things may be too far gone to reliably recover. Depending on your recovery strategy, many operations may still require heap allocations themselves, so you have to be extremely careful on which routines you can rely.
Another strategy you may wish to consider is to query the OS and monitor the available memory, proactively managing your allocations. This way you can avoid allocating a large block if you know it is likely to fail, and will thus have a better chance of recovery.
Also, depending on your memory usage patterns, using a custom allocator may give you better results than the standard built-in malloc. For example, certain allocation patterns can actually lead to memory fragmentation over time, so even though you have free memory, the available blocks in the heap arena may not have an available block of the right size. A good example of this is Firefox, which switched to dmalloc and saw a great increase in memory efficiency.

I don't think that capturing the failure of malloc or new will gain you much in your situation. linux allocates large chunks of virtual pages in malloc by means of mmap. By this you may find yourself in a situation where you allocate much more virtual memory than you have (real + swap).
The program then will only fail much later with a segfault (SIGSEGV) when you write to the first page for which there isn't any place in swap. In theory you could test for such situations by writing a signal handler and then dirtying all pages that you allocate.
But usually this will not help much either, since your application will be in a very bad state long before that: constantly swapping, computing mechanically with your harddisk...

It's possible for writes to the syslog to fail in low memory conditions: there's no way to know that for every platform without looking at the source for the relevant functions. They could need dynamic memory to format strings that are passed in, for instance.
Long before you run out of memory, however, you'll start paging stuff to disk. And when that happens, you can forget any performance advantages from caching.
Personally, I'm convinced by the design behind Varnish: the operating system offers services to solve a lot of the relevant problems, and it makes sense to use those services (minor editing):
So what happens with Squid's elaborate memory management is that it gets into fights with the kernel's elaborate memory management ...
Squid creates a HTTP object in RAM and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without Squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive. ...
After some time, Squid will also notice that these objects are unused, and it decides to move them to disk so the RAM can be used for more busy data. So Squid goes out, creates a file and then it writes the http objects to the file.
Here we switch to the high-speed camera: Squid calls write(2), the address it gives is a "virtual address" and the kernel has it marked as "not at home". ...
The kernel tries to find a free page, if there are none, it will take a little used page from somewhere, likely another little used Squid object, write it to the paging ... space on the disk (the "swap area") when that write completes, it will read from another place in the paging pool the data it "paged out" into the now unused RAM page, fix up the paging tables, and retry the instruction which failed. ...
So now Squid has the object in a page in RAM and written to the disk two places: one copy in the operating system's paging space and one copy in the filesystem. ...
Here is how Varnish does it:
Varnish allocate some virtual memory, it tells the operating system to back this memory with space from a disk file. When it needs to send the object to a client, it simply refers to that piece of virtual memory and leaves the rest to the kernel.
If/when the kernel decides it needs to use RAM for something else, the page will get written to the backing file and the RAM page reused elsewhere.
When Varnish next time refers to the virtual memory, the operating system will find a RAM page, possibly freeing one, and read the contents in from the backing file.
And that's it. Varnish doesn't really try to control what is cached in RAM and what is not, the kernel has code and hardware support to do a good job at that, and it does a good job.
You may not need to write caching code at all.

As has been stated, exhausting memory means that all bets are off. IMHO the best method of handling this situation is to fail gracefully (as opposed to simply crashing!). Your cache could allocate a reasonable amount of memory on instantiation. The size of this memory would equate to an amount that, when freed, will allow the program to terminate reasonably. When your cache detects that memory is becoming low then it should release this memory and instigate a graceful shutdown.

I'm writing an caching app that consumes large amounts of memory.
Hopefully, I'll manage my memory well enough, but I'm just thinking about what to do if I do run out of memory.
If you are writing deamon which should run 24/7/365, then you should not use dynamic memory management: preallocate all the memory in advance and manage it using some slab allocator/memory pool mechanism. That will also protect you again the heap fragmentation.
If a call to allocate even a simple object fails, is it likely that even a syslog call will also fail?
Should not. This is partially reason why syslog exists as a syscall: that application can report an error independent of its internal state.
If malloc or new returns a NULL or 0L value then it essentially means the call failed and it can't give you the memory for some reason. So, what would be the sensible thing to do in that case?
I generally try in the situations to properly handle the error condition, applying the general error handling rules. If error happens during initialization - terminate with error, probably configuration error. If error happens during request processing - fail the request with out-of-memory error.
For plain heap memory, malloc() returning 0 generally means:
that you have exhausted the heap and unless your application free some memory, further malloc()s wouldn't succeed.
the wrong allocation size: it is quite common coding error to mix signed and unsigned types when calculating block size. If the size ends up mistakenly negative, passed to malloc() where size_t is expected, it becomes very large number.
So in some sense it is also not wrong to abort() to produce the core file which can be analyzed later to see why the malloc() returned 0. Though I prefer to (1) include the attempted allocation size in the error message and (2) try to proceed further. If application would crash due to other memory problem down the road (*), it would produce core file anyway.
(*) From my experience of making software with dynamic memory management resilient to malloc() errors I see that often malloc() returns 0 not reliably. First attempts returning 0 are followed by a successful malloc() returning valid pointer. But first access to the pointed memory would crash the application. This is my experience on both Linux and HP-UX - and I have seen similar pattern on Solaris 10 too. The behavior isn't unique to Linux. To my knowledge the only way to make an application 100% resilient to memory problems is to preallocate all memory in advance. And that is mandatory for mission critical, safety, life support and carrier grade applications - they are not allowed dynamic memory management past initialization phase.

I don't know why many of the sensible answers are voted down. In most server environments, running out of memory means that you have a leak somewhere, and that it makes little sense to 'free some memory and try to go on'. The nature of C++ and especially the standard library is that it requires allocations all the time. If you are lucky, you might be able to free some memory and execute a clean shutdown, or at least emit a warning.
It is however far more likely that you won't be able to do a thing, unless the allocation that failed was a huge one, and there is still memory available for 'normal' things.
Dan Bernstein is one of the very few guys I know that can implement server software that operates in memory constrained situations.
For most of the rest of us, we should probably design our software that it leaves things in a useful state when it bails out because of an out of memory error.
Unless you are some kind of brain surgeon, there isn't a lot else to do.
Also, very often you won't even get an std::bad_alloc or something like that, you'll just get a pointer in return to your malloc/new, and only die when you actually try to touch all of that memory. This can be prevented by turning off overcommit in the operating system, but still.
Don't count on being able to deal with the SIGSEGV when you touch memory that the kernel hoped you wouldn't be.. I'm not quite sure how this works on the windows side of things, but I bet they do overcommit too.
All in all, this is not one of C++'s strong spots.

Reducing memory footprint of large unfamiliar codebase

Suppose you have a fairly large (~2.2 MLOC), fairly old (started more than 10 years ago) Windows desktop application in C/C++. About 10% of modules are external and don't have sources, only debug symbols.
How would you go about reducing application's memory footprint in half? At least, what would you do to find out where memory is consumed?

Override malloc()/free() and new()/delete() with wrappers that keep track of how big the allocations are and (by recording the callstack and later resolving it against the symbol table) where they are made from. On shutdown, have your wrapper display any memory still allocated.
This should enable you both to work out where the largest allocations are and to catch any leaks.

this is description/skeleton of memory tracing application I used to reduce memory consumption of our game by 20%. It helped me to track many allocations done by external modules.

It's not an easy task. Begin by chasing down any memory leaks you cand find (a good tool would be Rational Purify). Skim the source code and try to optimize data structures and/or algorithms.
Sorry if this may sound pessimistic, but cutting down memory usage by 50% doesn't sound realistic.

There is a chance is you can find some significant inefficiencies very fast. First you should check what is the memory used for. A tool which I have found very handy for this is Memory Validator
Once you have this "memory usage map", you can check for Low Hanging Fruit. Are there any data structures consuming a lot of memory which could be represented in a more compact form? This is often possible, esp. when the data access is well encapsulated and when you have a spare CPU power you can dedicate to compressing / decompressing them on each access.

I don't think your question is well posed.
The size of source code is not directly related to the memory footprint. Sure, the compiled code will occupy some memory but the application might will have memory requirements on it's own. Both static (the variables declared in the code) and dynamic (the object the application creates).
I would suggest you to profile program execution and study the code carefully.

First places to start for me would be:
Does the application do a lot of preallocation memory to be used later? Does this memory often sit around unused, never handed out? Consider switching to newing/deleting (or better use a smart_ptr) as needed.
Does the code use a static array such as
Object arrayOfObjs[MAX_THAT_WILL_EVER_BE_USED];
and hand out objs in this array? If so, consider manually managing this memory.

One of the tools for memory usage analysis is LeakDiag, available for free download from Microsoft. It apparently allows to hook all user-mode allocators down to VirtualAlloc and to dump process allocation snapshots to XML at any time. These snapshots then can be used to determine which call stacks allocate most memory and which call stacks are leaking. It lacks pretty frontend for snapshot analysis (unless you can get LDParser/LDGrapher via Microsoft Premier Support), but all the data is there.
One more thing to note is that you may have false leak positives from BSTR allocator due to caching, see "Hey, why am I leaking all my BSTR's?"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js