How to profile a Linux executable's static memory usage?

How to profile a Linux executable's static memory usage? - c++

I'm part of a team of developers who wrote a rather elaborate set of C++-based daemons, of which a dozen or so instances of which run simultaneously on a x86-based Xenomai/real-time Linux server.
The daemons are all compiled together into a single executable (BusyBox-style), whose main() function checks argv[1] and (based on its value) calls the appropriate daemon's subdaemonname_main() function.
I noticed the other day (by doing a "ps -ww -eio pid,%mem,rss,args" on the Linux server) that each of the processes takes up about 35 megabytes of RAM on the server, even if the process is doing nothing but sleeping at the very top of main(). For comparison, if I compile "hello world" (as a separate executable), ps shows that its process takes up practically no RAM.
The lesson I take from that is: the downside of compiling a number of C++ programs into a single executable like this is that each process that executable runs in will set up all the static/global objects declared in all of the .cpp files, even the ones that will never be used by the particular sub-daemon the process will run. This is a waste of RAM, and I was able to find a couple of large static C++ objects in the code and change them to be non-static to reduce RAM usage.
My question is, is there any (semi-)automated way for me to get an inventory of what composes that 35MB/process of before-main() memory usage? I can sort of do it by groveling through every .cpp file manually looking for static or global object declarations, but since it's a big codebase that would take quite a long time, and I might still miss something. Is there a quick way to profile the process's static-object-memory tables somehow? Having that information would give me a better idea of where the best opportunities for further reducing RAM usage might be.

Take a look at the link map (ld --print-map, or gcc -Wl,--print-map if you're linking with gcc), it should give you some idea re. which static objects went into the final executable.
Sure you get all of them in .data if they were declared that way. The linker has no way to tell some data will never be used, so it has to map everything.
Probably the only simple way to avoid this in reasonably readable C++ is placing "statics" in the stack, specifically the stack of their respective main_*() functions, and passing pointers. That's assuming "statics" are mostly empty. Any actualy static data will still be in .data and there's no way to avoid it.
With some black ELF magic, it may be possible to have dedicated .data sections and map them on demand, but I strongly suspect it's not worth the effort. Especially with C++ and its quirks.

Related

calling exe performance cost (compared to DLL)

We were discussing the possibility of using an exe instead of DLL inside a C or C++ code. The idea would be that in some cases to use an exe and pass arguments to it. (I guess its equivalent to somehow loading its main function as if it was a DLL).
The question we were wondering is does it imply a performance cost (especially in a loop with more than one iteration).
I tried to look in existing threads, while nobody answered this specific question. I saw that calling a function from DLL had an overhead for the first call, but then subsequent calls would only take 1 or 2 instructions.
For the exe case, it will each time need to create a separate process so it can run.(a second process if I need to open a shell that would open it, but from my research I can do it wihtout calling a shell). This process creation should cost some performance I'd guess. Moreover I think that the exe will each time be loaded into RAM, destroyed at the end of the process, then reloaded for next call and so on. A problem that is not present (?) with DLL.
PS: we were discussing this question more on a theoretical level than for implementing it, it's a question for the sake of learning.

The costs of running an exe are tremendous compared to calling a function from a DLL. If you can do it with a DLL, the you should if performance matters.
Of course, there may be other factors to consider: For example, when there is a bug in the code called, and crashes the process, in the case of an exe it is merely that exe that goes down, and the caller survives, but if the bug is in a DLL, the caller crashes, too.

Clearly, a DLL is going to get loaded, and if you call to it many times in a short time, it will have a benefit. If the time between calls is long enough, the DLL content may get evicted from RAM and have to be loaded from disk again (yes, that's hard to specify, and partly depends on the memory usage on the system).
However, executable files do get cached in memory, so the cost of "loading the executable" isn't that big. Yes, you have to create a new process and destroy it at the end, with all the related memory management code. For a small executable, this will be relatively light work, for a large, complex executable, it may be quite a long time.
Bear in mind that executing the same program many times isn't unusual - compiling a large project or running some sort of script on a large number of files, just to give a couple of simple examples. So the performance of this will be tuned by OS developers.
Obviously, the "retain stuff in RAM" applies to both DLL and EXE - it's basic file-caching done by the OS.

Force executable into memory?

I have a cpp executable (it contains static libraries), about 1MB in size. When I run the exe, it consumes less than 200kb memory.
From what I understand this means the computer reads the exe little by little when it's needed from the HDD.
I want to improve the performance, even a bit, so, how can I say "load the exe into memory" and don't touch the HDD? Will this bring any performance improvement?

The OS will load parts of the executable into memory as it is needed. This is where knowing more about the instruction cache might be useful. The idea is that you structure your program so that common code is grouped together. For example, you might have some functions that are getting inlined - in this case the OS would have to load the same code in multiple places which might be slow. By removing the inline you'd have the code in one chunk in memory which would get cached and thus reduce loading time.
I would agree with the others though that this type of optimization should really be reserved until after you profile and know for sure that this is the bottleneck, which is very unlikely

If you really want to do this, you need to touch the memory pages by reading from them. But forcing pages into memory once does not guarantee that they will remain in memory. An apparent alternative solution would be to VirtualLock the region, but in practice this function doesn't work the way you'd think (at least on any system where I've used it), even if you have the appropriate privilegues.
Note that the default minimum working set is only 16MB, so for larger executables, forcing pages into RAM will necessarily push others (which you need!) out of the working set, so this is in fact an anti-optimization. Unless you have the necessary privilegues to increase the working set size.
It's a bit tedious to find out where the executable's mapping starts and ends. Not that it is impossible, but it's much more complicated than just mapping the file again. Then you simply run a loop which reads one byte every 4096 bytes, and you are done. This will consume twice as much address space, but will consume the same amount of RAM (thanks to how memory mapping works).
But, realistically, you will gain absolutely nothing from doing this.
The operating system does not need to load the entire executable and does not need to keep it resident at all times. Part of your executable will be debug info or import info, which the loader will maybe look at once (or won't look at) and never need afterwards. Forcing that stuff into memory only means you purge useful pages from the working set.
The OS likely has the parts (or most of it) that are not visible to you in the buffer cache anyway, but even if that isn't the case, you will hardly ever notice a difference.

Globally, forcing all of the program into RAM will slow it down.
There are usually large parts of the code which aren't executed
in any given run, and there's no need to ever read these from
disk.
Where forcing all or parts of the program into RAM can make a difference
is latency. If you're responding in real time to external
events, having to load the code in order to respond will reduce
latency. This can only be done by using a system specific
request (e.g. mlock under Posix systems supporting the read
time extension). You'll probably have to have special rights to
be able to do it, though. In practice, it should only be used
on machines dedicated to a specific application, since it can
have a very negative impact on the total system performance.
(There's a reason that it's in the real-time extensions, and not
in the basic Posix.) Locking the addresses used by the function in memory means that there can be no page faults when it is executed.

I should avoid static compilation because of cache miss?

The title sums up pretty much the entire story, I was reading this and the key point is that
A bigger executable means more cache misses
and since a static executable it's by definition bigger than one that is dynamically linked, I'm curious about what are the practical considerations in this case.

The article in the link discusses the side-effect of inlining small functions in OS the kernel. This has indeed got a noticeable effect on performance, because the same function is called from many different places throughout the a sequence of system calls - for example if you call open, and then call read, seek write, open will store a filehandle somewhere in the kernel, and in the call to read, seek, and write, that handle will have to be "found". If that's an inlined function, we now have three copies of that function in the cache, and no benefit at all from read having called the same function as seek and write does. If it's a "non-inline" function, it will indeed be ready in the cache when seek and write calls that function.
For a given process, whether the code is linked statically or dynamically, once the application is fully loaded will have very small impact. If there are MANY copies of the application, then other processes may benefit from re-using the same memory for the shared libraries. But the size needed for that process remains the same whether it is shared with 0, 1, 3, or 100 other processes. The benefit in sharing the binary files across many executables come from things like the C library that is behind almost every single executable in the system - so when you have 1000 processes running in the system, that ALL use the same basic runtime system, there is only one copy rather than 1000 copies of the code. But it is unlikely to have much effect on the cache efficiency on any particular application - perhaps common functions like strcpy and such like are used often enough that there is a small chance that when the OS task switches, it's still in the cache when the next application does strcpy.
So, in summary: probably doesn't make any difference at all.

The overall memory footprint of the static version is the same as that of the dynamic version; remember that the dynamically-linked objects still need to be loaded into memory!
Of course, one could also argue that if there are multiple processes running, and they all dynamically link against the same object, then only one copy is required in memory, and so the aggregate footprint is lower than if they had all statically linked.
[Disclaimer: all of the above is educated guesswork; I've never measured the effect of linking on cache behaviour.]

Accessing >2,3,4GB Files in 32-bit Process on 64-bit (or 32-bit) Windows

Disclaimer: I apologize for the verbosity of this question (I think it's an interesting problem, though!), yet I cannot figure out how to more concisely word it.
I have done hours of research as to the apparently myriad of ways in which to solve the problem of accessing multi-GB files in a 32-bit process on 64-bit Windows 7, ranging from /LARGEADDRESSAWARE to VirtualAllocEx AWE. I am somewhat comfortable in writing a multi-view memory-mapped system in Windows (CreateFileMapping, MapViewOfFile, etc.), yet can't quite escape the feeling that there is a more elegant solution to this problem. Also, I'm quite aware of Boost's interprocess and iostream templates, although they appear to be rather lightweight, requiring a similar amount of effort to writing a system utilizing only Windows API calls (not to mention the fact that I already have a memory-mapped architecture semi-implemented using Windows API calls).
I'm attempting to process large datasets. The program depends on pre-compiled 32-bit libraries, which is why, for the moment, the program itself is also running in a 32-bit process, even though the system is 64-bit, with a 64-bit OS. I know there are ways in which I could add wrapper libraries around this, yet, seeing as it's part of a larger codebase, it would indeed be a bit of an undertaking. I set the binary headers to allow for /LARGEADDRESSAWARE (at the expense of decreasing my kernel space?), such that I get up to around 2-3 GB of addressable memory per process, give or take (depending on heap fragmentation, etc.).
Here's the issue: the datasets are 4+GB, and have DSP algorithms run upon them that require essentially random access across the file. A pointer to the object generated from the file is handled in C#, yet the file itself is loaded into memory (with this partial memory-mapped system) in C++ (it's P/Invoked). Thus, I believe the solution is unfortunately not as simple as simply adjusting the windowing to access the portion of the file I need to access, as essentially I want to still have the entire file abstracted into a single pointer, from which I can call methods to access data almost anywhere in the file.
Apparently, most memory mapped architectures rely upon splitting the singular process into multiple processes.. so, for example, I'd access a 6 GB file with 3x processes, each holding a 2 GB window to the file. I would then need to add a significant amount of logic to pull and recombine data from across these different windows/processes. VirtualAllocEx apparently provides a method of increasing the virtual address space, but I'm still not entirely sure if this is the best way of going about it.
But, let's say I want this program to function just as "easily" as a singular 64-bit proccess on a 64-bit system. Assume that I don't care about thrashing, I just want to be able to manipulate a large file on the system, even if only, say, 500 MB were loaded into physical RAM at any one time. Is there any way to obtain this functionality without having to write a somewhat ridiculous, manual memory system by hand? Or, is there some better way than what I have found through thusfar combing SO and the internet?
This lends itself to a secondary question: is there a way of limiting how much physical RAM would be used by this process? For example, what if I wanted to limit the process to only having 500 MB loaded into physical RAM at any one time (whilst keeping the multi-GB file paged on disk)?
I'm sorry for the long question, but I feel as though it's a decent summary of what appear to be many questions (with only partial answers) that I've found on SO and the net at large. I'm hoping that this can be an area wherein a definitive answer (or at least some pros/cons) can be fleshed out, and we can all learn something valuable in the process!

You could write an accessor class which you give it a base address and a length. It returns data or throws exception (or however else you want to inform of error conditions) if error conditions arise (out of bounds, etc).
Then, any time you need to read from the file, the accessor object can use SetFilePointerEx() before calling ReadFile(). You can then pass the accessor class to the constructor of whatever objects you create when you read the file. The objects then use the accessor class to read the data from the file. Then it returns the data to the object's constructor which parses it into object data.
If, later down the line, you're able to compile to 64-bit, you can just change (or extend) the accessor class to read from memory instead.
As for limiting the amount of RAM used by the process.. that's mostly a matter of making sure that
A) you don't have memory leaks (especially obscene ones) and
B) destroying objects you don't need at the very moment. Even if you will need it later down the line but the data won't change... just destroy the object. Then recreate it later when you do need it, allowing it to re-read the data from the file.

Is it safe to send a pointer to a static function over the network?

I was thinking about some RPC code that I have to implement in C++ and I wondered if it's safe (and under which assumptions) to send it over the network to the same binary code (assuming it's exactly the same and that they are running on same architecture). I guess virtual memory should do the difference here.
I'm asking it just out of curiosity, since it's a bad design in any case, but I would like to know if it's theoretically possible (and if it's extendable to other kind of pointers to static data other than functions that the program may include).

In general, it's not safe for many reasons, but there are limited cases in which it will work. First of all, I'm going to assume you're using some sort of signing or encryption in the protocol that ensures the integrity of your data stream; if not, you have serious security issues already that are only compounded by passing around function pointers.
If the exact same program binary is running on both ends of the connection, if the function is in the main program (or in code linked from a static library) and not in a shared library, and if the program is not built as a position-independent executable (PIE), then the function pointer will be the same on both ends and passing it across the network should work. Note that these are very stringent conditions that would have to be documented as part of using your program, and they're very fragile; for instance if somebody upgrades the software on one side and forgets to upgrade the version on the other side of the connection at the same time, things will break horribly and dangerously.
I would avoid this type of low-level RPC entirely in favor of a higher-level command structure or abstract RPC framework, but if you really want to do it, a slightly safer approach would be to pass function names and use dlsym or equivalent to look them up. If the symbols reside in the main program binary rather than libraries, then depending on your platform you might need -rdynamic (GCC) or a similar option to make them available to dlsym. libffi might also be a useful tool for abstracting this.
Also, if you want to avoid depending on dlsym or libffi, you could keep your own "symbol table" hard-coded in the binary as a static const linear table or hash table mapping symbol names to function pointers. The hash table format used in ELF for this purpose is very simple to understand and implement, so I might consider basing your implementation on that.

What is it a pointer to?
Is it a pointer to a piece of static program memory? If so, don't forget that it's an address, not an offset, so you'd first need to convert between the two accordingly.
Second, if it's not a piece of static memory (ie: statically allocated array created at build time as opposed to run time) it's not really possible at all.
Finally, how are you ensuring the two pieces of code are the same? Are both binaries bit identical (eg: diff -a binary1 binary2). Even if they are bit-identical, depending on the virtual memory management on each machine, the entire program's program memory segment may not exist in a single page, or the alignment across multiple pages may be different for each system.
This is really a bad idea, no matter how you slice it. This is what message passing and APIs are for.

I don't know of any form of RPC that will let you send a pointer over the network (at least without doing something like casting to int first). If you do convert to int on the sending end, and convert that back to a pointer on the far end, you get pretty much the same as converting any other arbitrary int to a pointer: undefined behavior if you ever attempt to dereference it.
Normally, if you pass a pointer to an RPC function, it'll be marshalled -- i.e., the data it points to will be packaged up, sent across, put into memory, and a pointer to that local copy of the data passed to the function on the other end. That's part of why/how IDL gets a bit ugly -- you need to tell it how to figure out how much data to send across the wire when/if you pass a pointer. Most know about zero-terminated strings. For other types of arrays, you typically need to specify the size of the data (somehow or other).

This is highly system dependent. On systems with virtual addressing such that each process thinks it's running at the same address each time it executes, this could plausibly work for executable code. Darren Kopp's comment and link regarding ASLR is interesting - a quick read of the Wikipedia article suggests the Linux & Windows versions focus on data rather than executable code, except for "network facing daemons" on Linux, and on Windows it applies only when "specifically linked to be ASLR-enabled".
Still, "same binary code" is best assured by static linking - if different shared objects/libraries are loaded, or they're loaded in different order (perhaps due to dynamic loading - dlopen - driven by different ordering in config files or command line args etc.) you're probably stuffed.

Sending a pointer over the network is generally unsafe. The two main reasons are:
Reliability: the data/function pointer may not point to the same entity (data structure or function) on another machine due to different location of the program or its libraries or dynamically allocated objects in memory. Relocatable code + ASLR can break your design. At the very least, if you want to point to a statically allocated object or a function you should sent its offset w.r.t. the image base if your platform is Windows or do something similar on whatever OS you are.
Security: if your network is open and there's a hacker (or they have broken into your network), they can impersonate your first machine and make the second machine either hang or crash, causing a denial of service, or execute arbitrary code and get access to sensitive information or tamper with it or hijack the machine and turn it into an evil bot sending spam or attacking other computers. Of course, there are measures and countermeasures here, but...
If I were you, I'd design something different. And I'd ensure that the transmitted data is either unimportant or encrypted and the receiving part does the necessary validation of it prior to using it, so there are no buffer overflows or execution of arbitrary things.

If you're looking for some formal guarantees, I cannot help you. You would have to look in the documentation of the compiler and OS that you're using - however I doubt that you would find the necessary guarantees - except possibly for some specialized embedded systems OS'.
I can however provide you with one scenario where I'm 99.99% sure that it will work without any problems:
Windows
32 bit process
Function is located in a module that doesn't have relocation information
The module in question is already loaded & initialized on the client side
The module in question is 100% identical on both sides
A compiler that doesn't do very crazy stuff (e.g. MSVC and GCC should both be fine)
If you want to call a function in a DLL you might run into problems. As per the list above the module (=DLL) may not have relocation information, which of course makes it impossible to relocate it (which is what we need). Unfortunately that also means that loading the DLL will fail, if the "preferred load address" is used by something else. So that would be kind-of risky.
If the function resides in the EXE however, you should be fine. A 32 bit EXE doesn't need relocation information, and most don't include it (MSVC default settings). BTW: ASLR is not an issue here since a) ASLR does only move modules that are tagged as wanting to be moved and b) ASLR could not move a 32 bit windows module without relocation information, even if it wanted to.
Most of the above just makes sure that the function will have the same address on both sides. The only remaining question - at least that I can think of - is: is it safe to call a function via a pointer that we initialized by memcpy-ing over some bytes that we received from the network, assuming that the byte-pattern is the same that we would have gotten if we had taken the address of the desired function? That surely is something that the C++ standard doesn't guarantee, but I don't expect any real-world problems from current real-world compilers.
That being said, I would not recommend to do that, except for situations where security and robustness really aren't important.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js