I should avoid static compilation because of cache miss? - c++

The title sums up pretty much the entire story, I was reading this and the key point is that
A bigger executable means more cache misses
and since a static executable it's by definition bigger than one that is dynamically linked, I'm curious about what are the practical considerations in this case.

The article in the link discusses the side-effect of inlining small functions in OS the kernel. This has indeed got a noticeable effect on performance, because the same function is called from many different places throughout the a sequence of system calls - for example if you call open, and then call read, seek write, open will store a filehandle somewhere in the kernel, and in the call to read, seek, and write, that handle will have to be "found". If that's an inlined function, we now have three copies of that function in the cache, and no benefit at all from read having called the same function as seek and write does. If it's a "non-inline" function, it will indeed be ready in the cache when seek and write calls that function.
For a given process, whether the code is linked statically or dynamically, once the application is fully loaded will have very small impact. If there are MANY copies of the application, then other processes may benefit from re-using the same memory for the shared libraries. But the size needed for that process remains the same whether it is shared with 0, 1, 3, or 100 other processes. The benefit in sharing the binary files across many executables come from things like the C library that is behind almost every single executable in the system - so when you have 1000 processes running in the system, that ALL use the same basic runtime system, there is only one copy rather than 1000 copies of the code. But it is unlikely to have much effect on the cache efficiency on any particular application - perhaps common functions like strcpy and such like are used often enough that there is a small chance that when the OS task switches, it's still in the cache when the next application does strcpy.
So, in summary: probably doesn't make any difference at all.

The overall memory footprint of the static version is the same as that of the dynamic version; remember that the dynamically-linked objects still need to be loaded into memory!
Of course, one could also argue that if there are multiple processes running, and they all dynamically link against the same object, then only one copy is required in memory, and so the aggregate footprint is lower than if they had all statically linked.
[Disclaimer: all of the above is educated guesswork; I've never measured the effect of linking on cache behaviour.]


How do DLLs handle concurrency from multiple processes?

I understand from Eric Lippert's answer that "two processes can share non-private memory pages. If twenty processes all load the same DLL, the processes all share the memory pages for that code. They don't share virtual memory address space, they share memory."
Now, if the same DLL file on the harddisk, after loaded into appliations, would share the same physical memory (be it RAM or page files), but mapped to different virtual memory address spaces, wouldn't that make it quite difficult to handle concurrency?
As I understand, concurrency concept in C++ is more about handling threading -- A process can start multiple threads, each can be run on an individual core, so when different threads calls the DLL at the same time, there might be data racing and we need mutex, lock, signal, conditional variable and so on.
But how would a DLL handles multi-processes? The same concept of data racing will happen, isn't it? What are the tools to handle that? Still the same toolset?
Now, if the same DLL file on the hard disk, after loaded into applications, would share the same physical memory (be it RAM or page files), but mapped to different virtual memory address spaces, wouldn't that make it quite difficult to handle concurrency?
As other answers have noted, the concurrency issues are of no concern if the shared memory is never written after it is initialized, which is typically the case for DLLs. If you are attempting to alter the code or resources in a DLL by writing into memory, odds are good you have a bad pointer somewhere and the best thing to do is to crash with an access violation.
However I wanted to also briefly follow up on your concern:
... mapped to different virtual memory address spaces ...
In practice we try very hard to avoid this happening because when it does, there can be a serious user-noticeable performance problem when loading code pages for the first time. (And of course a possible large increase in working set, which causes other performance problems.)
The code in a DLL often contains hard-coded virtual memory addresses, on the assumption that the code will be loaded into a known-at-compile-time virtual memory "base" address. If this assumption is violated at runtime -- because there's another DLL already there, for example -- then all those hard-coded addresses need to be patched at runtime, which is expensive.
If you want some historical details, see Raymond's article on the subject: https://blogs.msdn.microsoft.com/oldnewthing/20041217-00/?p=36953/
DLL's contain multiple "segments", and each segment has a descriptor telling Windows its Characteristics. This is a 32 bits DWORD. Code segments obviously have the code bit set, and generally also the shareable bit. Read-only data can also be shareable, whereas writeable data generally does not have the shareable flag.
Now you can set an unusual combination of characteristics on an extra segment: writeable and shareable. That is not the default, and indeed might cause race conditions. So the final answer to your question is: the problem is avoided chiefly by the default characteristics of segments, and secondly any DLL which has a segment with non-standard characteristics must deal with the self-inflicted problems.

Is program runtime affected by where the objects reside in the memory?

My program allocates all of its resources which is slightly below 1MB in startup and no more, except primitive local variables. The allocation took place originally by malloc, so on the heap, but I wondered whether there will be any difference by putting them on the stack.
In various tests with program runtime from 3 seconds to 3 minutes. Accessing the stack steadily appears to be faster up to 10%. All I changed was whether to malloc the structs or to declare them as automatic variables.
Another interesting fact I found is that when I declare the objects as static. The program will run 20~30% slower. I have no idea why. I double checked whether I made a mistake but the only difference really is whether static is there or not. Do static variables go somewhere else in the stack than automatic variables?
Before I had quite an opposite experience that in a C++ class, when I made a const member array from non-static to static, the program did run faster. The memory consumption was same because there was only one instance of that object.
Is program runtime affected by where the objects reside in the memory? Even if so, can't the compiler manage to place the objects in the right place for maximum efficiency?
Well, yeah, program performance is affected by where objects reside in memory.
The problem is, unless you have intimate knowledge of how your compiler works and how it uses features of your particular host system (operating system services, hardware, processor cache, etc), and how those things are configured, you will not be able to consistently exploit it. Even if you do succeed, small changes (e.g. upgrading a compiler, changing optimisation settings, a change of process quotas, changing amount of physical memory, changing a hard drive [e.g. that is used for swap space]) can affect performance, and it won't always be easy to predict whether a change will improve or degrade performance). Performance is sensitive to all these things - and to the interaction between them, in ways that are not always obvious without close analysis.
Is program runtime affected by where the objects reside in the memory?
Yes, program performance will be affected by the location of an object in memory, among other factors.
Whenever an object in the heap is accessed, it's done by dereferencing a pointer. Dereferencing pointers requires additional computation to find the next memory address and every additional pointer that's between you and your data (e.g. pointer-> pointer -> actual data) will make it slightly worse. In addition to this, it increases the cache miss rate for the CPU because the data that it expects to access next is not really in contiguous memory blocks. In other words, the assumption the CPU made to try and optimize its pipeline turns out to be false, and it pays the penalty for an incorrect prediction.
Even if so, can't the compiler manage to place the objects in the
right place for maximum efficiency?
The C/C++ compilers will place the objects at whatever location you tell it to. If you're using malloc, you shouldn't expect the compiler to put things on the stack, and vice-versa.
The compiler wouldn't be able to do this even in principle because malloc dynamically allocates memory at runtime, long after the compiler's job has ended. What is known at compile time is the size for the stack, but how the contents of this memory are organized depends entirely on you.
You might be able to get some benefits from compiler optimization settings, but most of your benefits in optimization efforts will be better spent improving data structures and/or algorithms being used.

How to profile a Linux executable's static memory usage?

I'm part of a team of developers who wrote a rather elaborate set of C++-based daemons, of which a dozen or so instances of which run simultaneously on a x86-based Xenomai/real-time Linux server.
The daemons are all compiled together into a single executable (BusyBox-style), whose main() function checks argv[1] and (based on its value) calls the appropriate daemon's subdaemonname_main() function.
I noticed the other day (by doing a "ps -ww -eio pid,%mem,rss,args" on the Linux server) that each of the processes takes up about 35 megabytes of RAM on the server, even if the process is doing nothing but sleeping at the very top of main(). For comparison, if I compile "hello world" (as a separate executable), ps shows that its process takes up practically no RAM.
The lesson I take from that is: the downside of compiling a number of C++ programs into a single executable like this is that each process that executable runs in will set up all the static/global objects declared in all of the .cpp files, even the ones that will never be used by the particular sub-daemon the process will run. This is a waste of RAM, and I was able to find a couple of large static C++ objects in the code and change them to be non-static to reduce RAM usage.
My question is, is there any (semi-)automated way for me to get an inventory of what composes that 35MB/process of before-main() memory usage? I can sort of do it by groveling through every .cpp file manually looking for static or global object declarations, but since it's a big codebase that would take quite a long time, and I might still miss something. Is there a quick way to profile the process's static-object-memory tables somehow? Having that information would give me a better idea of where the best opportunities for further reducing RAM usage might be.
Take a look at the link map (ld --print-map, or gcc -Wl,--print-map if you're linking with gcc), it should give you some idea re. which static objects went into the final executable.
Sure you get all of them in .data if they were declared that way. The linker has no way to tell some data will never be used, so it has to map everything.
Probably the only simple way to avoid this in reasonably readable C++ is placing "statics" in the stack, specifically the stack of their respective main_*() functions, and passing pointers. That's assuming "statics" are mostly empty. Any actualy static data will still be in .data and there's no way to avoid it.
With some black ELF magic, it may be possible to have dedicated .data sections and map them on demand, but I strongly suspect it's not worth the effort. Especially with C++ and its quirks.

Force executable into memory?

I have a cpp executable (it contains static libraries), about 1MB in size. When I run the exe, it consumes less than 200kb memory.
From what I understand this means the computer reads the exe little by little when it's needed from the HDD.
I want to improve the performance, even a bit, so, how can I say "load the exe into memory" and don't touch the HDD? Will this bring any performance improvement?
The OS will load parts of the executable into memory as it is needed. This is where knowing more about the instruction cache might be useful. The idea is that you structure your program so that common code is grouped together. For example, you might have some functions that are getting inlined - in this case the OS would have to load the same code in multiple places which might be slow. By removing the inline you'd have the code in one chunk in memory which would get cached and thus reduce loading time.
I would agree with the others though that this type of optimization should really be reserved until after you profile and know for sure that this is the bottleneck, which is very unlikely
If you really want to do this, you need to touch the memory pages by reading from them. But forcing pages into memory once does not guarantee that they will remain in memory. An apparent alternative solution would be to VirtualLock the region, but in practice this function doesn't work the way you'd think (at least on any system where I've used it), even if you have the appropriate privilegues.
Note that the default minimum working set is only 16MB, so for larger executables, forcing pages into RAM will necessarily push others (which you need!) out of the working set, so this is in fact an anti-optimization. Unless you have the necessary privilegues to increase the working set size.
It's a bit tedious to find out where the executable's mapping starts and ends. Not that it is impossible, but it's much more complicated than just mapping the file again. Then you simply run a loop which reads one byte every 4096 bytes, and you are done. This will consume twice as much address space, but will consume the same amount of RAM (thanks to how memory mapping works).
But, realistically, you will gain absolutely nothing from doing this.
The operating system does not need to load the entire executable and does not need to keep it resident at all times. Part of your executable will be debug info or import info, which the loader will maybe look at once (or won't look at) and never need afterwards. Forcing that stuff into memory only means you purge useful pages from the working set.
The OS likely has the parts (or most of it) that are not visible to you in the buffer cache anyway, but even if that isn't the case, you will hardly ever notice a difference.
Globally, forcing all of the program into RAM will slow it down.
There are usually large parts of the code which aren't executed
in any given run, and there's no need to ever read these from
Where forcing all or parts of the program into RAM can make a difference
is latency. If you're responding in real time to external
events, having to load the code in order to respond will reduce
latency. This can only be done by using a system specific
request (e.g. mlock under Posix systems supporting the read
time extension). You'll probably have to have special rights to
be able to do it, though. In practice, it should only be used
on machines dedicated to a specific application, since it can
have a very negative impact on the total system performance.
(There's a reason that it's in the real-time extensions, and not
in the basic Posix.) Locking the addresses used by the function in memory means that there can be no page faults when it is executed.

Accessing >2,3,4GB Files in 32-bit Process on 64-bit (or 32-bit) Windows

Disclaimer: I apologize for the verbosity of this question (I think it's an interesting problem, though!), yet I cannot figure out how to more concisely word it.
I have done hours of research as to the apparently myriad of ways in which to solve the problem of accessing multi-GB files in a 32-bit process on 64-bit Windows 7, ranging from /LARGEADDRESSAWARE to VirtualAllocEx AWE. I am somewhat comfortable in writing a multi-view memory-mapped system in Windows (CreateFileMapping, MapViewOfFile, etc.), yet can't quite escape the feeling that there is a more elegant solution to this problem. Also, I'm quite aware of Boost's interprocess and iostream templates, although they appear to be rather lightweight, requiring a similar amount of effort to writing a system utilizing only Windows API calls (not to mention the fact that I already have a memory-mapped architecture semi-implemented using Windows API calls).
I'm attempting to process large datasets. The program depends on pre-compiled 32-bit libraries, which is why, for the moment, the program itself is also running in a 32-bit process, even though the system is 64-bit, with a 64-bit OS. I know there are ways in which I could add wrapper libraries around this, yet, seeing as it's part of a larger codebase, it would indeed be a bit of an undertaking. I set the binary headers to allow for /LARGEADDRESSAWARE (at the expense of decreasing my kernel space?), such that I get up to around 2-3 GB of addressable memory per process, give or take (depending on heap fragmentation, etc.).
Here's the issue: the datasets are 4+GB, and have DSP algorithms run upon them that require essentially random access across the file. A pointer to the object generated from the file is handled in C#, yet the file itself is loaded into memory (with this partial memory-mapped system) in C++ (it's P/Invoked). Thus, I believe the solution is unfortunately not as simple as simply adjusting the windowing to access the portion of the file I need to access, as essentially I want to still have the entire file abstracted into a single pointer, from which I can call methods to access data almost anywhere in the file.
Apparently, most memory mapped architectures rely upon splitting the singular process into multiple processes.. so, for example, I'd access a 6 GB file with 3x processes, each holding a 2 GB window to the file. I would then need to add a significant amount of logic to pull and recombine data from across these different windows/processes. VirtualAllocEx apparently provides a method of increasing the virtual address space, but I'm still not entirely sure if this is the best way of going about it.
But, let's say I want this program to function just as "easily" as a singular 64-bit proccess on a 64-bit system. Assume that I don't care about thrashing, I just want to be able to manipulate a large file on the system, even if only, say, 500 MB were loaded into physical RAM at any one time. Is there any way to obtain this functionality without having to write a somewhat ridiculous, manual memory system by hand? Or, is there some better way than what I have found through thusfar combing SO and the internet?
This lends itself to a secondary question: is there a way of limiting how much physical RAM would be used by this process? For example, what if I wanted to limit the process to only having 500 MB loaded into physical RAM at any one time (whilst keeping the multi-GB file paged on disk)?
I'm sorry for the long question, but I feel as though it's a decent summary of what appear to be many questions (with only partial answers) that I've found on SO and the net at large. I'm hoping that this can be an area wherein a definitive answer (or at least some pros/cons) can be fleshed out, and we can all learn something valuable in the process!
You could write an accessor class which you give it a base address and a length. It returns data or throws exception (or however else you want to inform of error conditions) if error conditions arise (out of bounds, etc).
Then, any time you need to read from the file, the accessor object can use SetFilePointerEx() before calling ReadFile(). You can then pass the accessor class to the constructor of whatever objects you create when you read the file. The objects then use the accessor class to read the data from the file. Then it returns the data to the object's constructor which parses it into object data.
If, later down the line, you're able to compile to 64-bit, you can just change (or extend) the accessor class to read from memory instead.
As for limiting the amount of RAM used by the process.. that's mostly a matter of making sure that
A) you don't have memory leaks (especially obscene ones) and
B) destroying objects you don't need at the very moment. Even if you will need it later down the line but the data won't change... just destroy the object. Then recreate it later when you do need it, allowing it to re-read the data from the file.