interpreting _dl_runtime_resolve_xsave'2 in callgrind output - c++

Looking at the output of callgrind for my program run, I see that 125% !!! of the cycles are spent in _dl_runtime_resolve_xsave'2 (apparently part of the dynamic linker) while 100% is spent in main. But it also says that almost all the time spent inside _dl_runtime_resolve_xsave'2 is actually spent in inner methods (self=0%) but callgrind does not show any callees for this method.
Moreover, it looks like _dl_runtime_resolve_xsave'2 is called from several places in the program I am profiling.
I can understand that some time could be spent outside of main because the program I am profiling is using the prototype pattern and many objects prototypes are being built when their dynamic library are loaded but this cannot amount anywhere close to 25% of the time of that particular run (because if I do that run with no input data it takes orders of magnitude less time than the run I am profiling now).
Also the program is not using dlopen to open shared objects after the program start. Everything should be loaded at the start.
Here is a screenshot of the kcachegrind window:
How can I interpret those calls to _dl_runtime_resolve_xsave'2? Do I need to be concerned by the time spent in this method?
Thank you for your help.

_dl_runtime_resolve_xsave is used in the glibc dynamic loader during lazy binding. It looks up the function symbol during the first call to a function and then performs a tail call to the implementation. Unless you use something like LD_BIND_NOT=1 in the environment when launching the program, this is one-time operation that happens only during the first call to the function. Lazy binding has some cost, but unless you have many functions that are called exactly once, it will not contribute much to the execution cost. It is more likely a reporting artifact, perhaps related to the tail call or the rather exotic XSAVE instruction used in _dl_runtime_resolve_xsave.
You can disable lazy binding by launching the program with the LD_BIND_NOW=1 environment variable setting, the dynamic loader trampoline will not be used because all functions will be resolved on startup. Alternatively, you can link with -Wl,-z,now to make this change permanent (at least for the code you link, system libraries may still use lazy binding for their own function symbols).

Related

calling exe performance cost (compared to DLL)

We were discussing the possibility of using an exe instead of DLL inside a C or C++ code. The idea would be that in some cases to use an exe and pass arguments to it. (I guess its equivalent to somehow loading its main function as if it was a DLL).
The question we were wondering is does it imply a performance cost (especially in a loop with more than one iteration).
I tried to look in existing threads, while nobody answered this specific question. I saw that calling a function from DLL had an overhead for the first call, but then subsequent calls would only take 1 or 2 instructions.
For the exe case, it will each time need to create a separate process so it can run.(a second process if I need to open a shell that would open it, but from my research I can do it wihtout calling a shell). This process creation should cost some performance I'd guess. Moreover I think that the exe will each time be loaded into RAM, destroyed at the end of the process, then reloaded for next call and so on. A problem that is not present (?) with DLL.
PS: we were discussing this question more on a theoretical level than for implementing it, it's a question for the sake of learning.
The costs of running an exe are tremendous compared to calling a function from a DLL. If you can do it with a DLL, the you should if performance matters.
Of course, there may be other factors to consider: For example, when there is a bug in the code called, and crashes the process, in the case of an exe it is merely that exe that goes down, and the caller survives, but if the bug is in a DLL, the caller crashes, too.
Clearly, a DLL is going to get loaded, and if you call to it many times in a short time, it will have a benefit. If the time between calls is long enough, the DLL content may get evicted from RAM and have to be loaded from disk again (yes, that's hard to specify, and partly depends on the memory usage on the system).
However, executable files do get cached in memory, so the cost of "loading the executable" isn't that big. Yes, you have to create a new process and destroy it at the end, with all the related memory management code. For a small executable, this will be relatively light work, for a large, complex executable, it may be quite a long time.
Bear in mind that executing the same program many times isn't unusual - compiling a large project or running some sort of script on a large number of files, just to give a couple of simple examples. So the performance of this will be tuned by OS developers.
Obviously, the "retain stuff in RAM" applies to both DLL and EXE - it's basic file-caching done by the OS.

How do I get call parents for libc6 symbols (e.g. _int_malloc) with linux perf?

I'm profiling a C++ application using linux perf, and I'm getting a nice control flow graph using GProf2dot. However, some symbols from the C library (libc6-2.13.so) take a substantial portion of the total time, and yet have no in-edges.
For example:
_int_malloc takes 8% of the time but has no call parents.
__strcmp_sse42 and __cxxabiv1::__si_class_type_info::__do_dyncast together take about 10% of the time, and have a caller whose name is 0, which has callers 2d6935c, 2cc748c, and 6, which have no callers.
As a result, I can't find out which routines are responsible for all this mallocing and dynamic casting using just perf. However, it seems that other symbols (e.g. malloc but not _int_malloc) do have call parents.
Why doesn't perf show call parents for _int_malloc? Why can't I find the ultimate callers of __do_dyn_cast? And, is there some way for me to modify my setup so that I can get this information? I'm on x86-64, so I'm wondering if I need a (non-standard) libc6 with frame pointers.
Update: As of the 3.7.0 kernel, one can determine call parents of symbols in system libraries using perf record -gdwarf <command>.
Using -gdwarf, there is no need to compile with -fno-omit-frame-pointer.
Original answer:
Yes, one probably would need a libc6 compiled with frame pointers (-fno-omit-framepointer) on x86_64, at the moment (May 24, 2012).
However, developers are currently working on allowing the perf tools to use DWARF unwind info. This means that frame pointers are no longer needed to get backtrace information on x86_64. Linus, however, does not want a DWARF unwinder in the kernel. Thus, the perf tools will save registers as the system is running, and perform the DWARF unwinding in the userspace perf tool using the libunwind library.
This technique has been tested to successfully determine callers of (for example) malloc and dynamic_cast. However, the patch set is not yet integrated into the Linux kernel, and needs to undergo further revision before it is ready.
_int_malloc and __do_dyn_cast are being called from routines that the profiler can't identify because it doesn't have symbol table information for them.
What's more, it looks like you are showing self (exclusive) time.
That is only useful for finding hotspots in routines that a) have much self time, and b) you can fix.
There's a reason profilers subsequent to the original unix profil were created. Real software consists of functions that spend nearly all their time calling other functions, and you need to be able to find code that is on the stack much of the time, not that has the program counter much of the time.
So you need to configure perf to take stack samples and tell you the percent of time each of your routines is on the stack.
It is even better if it reports not just routines, but lines of code, as in Zoom.
It is best to take the samples on wall-clock time, so you're not blind to IO.
There's more to say on all this.

Newbie: Performance Analysis through the command line

I am looking for a performance analysis tool with the following properties:
Free.
Runs on Windows.
Does not require using the GUI
(i.e. can be run from command line or by using some library in any programming language).
Runs on some x86-based architecture (preferably Intel).
Can measure the running time of my C++, mingw-compiled, program, except for the time spent in a few certain functions I specify (and all calls emanating from them).
Can measure the amount of memory used by my program, except for the memory allocated by those functions I specified in (5) and all calls emanating from them.
A tool that have properties (1) to (5) (without 6) can still be very valuable to me.
My goal is to be able to compare the running time and memory usage of different programs in a consistent way (i.e. the main requirement is that timing the same program twice would return roughly the same results).
Mingw should already have a gprof tool. To use it you just need to compile with correct flags set. I think it was -g -pg.
For heap analysis (free) you can use umdh.exe which is a full heap dumper, you can also compare consecutive memory snapshots to check for leakage over time. You'd have to filter the output yourself to remove functions that were not of interest, however.
I know that's not exactly what you were asking for in (6), but it might be useful. I think filtering like this is not going to be so common in freeware.

Is a DLL slower than a static link?

I made a GUI library for games. My test demo runs at 60 fps. When I run this demo with the static version of my library it takes 2-3% cpu in taskmanager. When I use the DLL version it uses around 13-15%. Is that normal? Is so, how could I optimize it? I already ask it to use /O2 for the most function inlining.
Do not start your performance timer until the DLL has had opportunity to execute its functionality one time. This gives it time to load into memory. Then start the timer and check performance. It should then basically match that of the static lib.
Also keep in mind that the load-location of the DLL can greatly affect how quickly it loads. The default base addres for DLLs is 0x400000. If you already have some other DLL in that location, then the load process must perform an expensive re-addressing step which will throw off your timing even more.
If you have such a conflict, just choose a different base address in Visual Studio.
You will have the overhead of loading the DLL (should be just once at the beginning). It isn't statically linked in with direct calls, so I would expect a small amount of overhead but not much.
However, some DLLs will have much higher overheads. I'm thinking of COM objects although there may be other examples. COM adds a lot of overhead on function calls between objects.
If you call DLL-functions they cannot be inlined for a caller. You should think a little about your DLL-boundaries.
May be it is better for your application to have a small bootstrap exe which just executes a main loop in your DLL. This way you can avoid much overhead for function calls.
It's a little unclear as to what's being statically/dynamically linked. Is the DLL of your lib statically linked with its dependencies? Is it possible that the DLL is calling other DLLs (that will be slow)? Maybe try running a profiler from valgrind on your executable to determine where all the CPU usage is coming from.

_dl_runtime_resolve -- When do the shared objects get loaded in to memory?

We have a message processing system with high performance demands. Recently we have noticed that the first message takes many times longer then subsequent messages. A bunch of transformation and message augmentation happens as this goes through our system, much of it done by way of external lib.
I just profiled this issue (using callgrind), comparing a "run" of just one message with a "run" of many messages (providing a baseline of comparison).
The main difference I see is the function "do_lookup_x" taking up a huge amount of time. Looking at the various calls to this function, they all seem to be called by the common function: _dl_runtime_resolve. Not sure what this function does, but to me this looks like the first time the various shared libraries are being used, and are then being loaded in to memory by the ld.
Is this a correct assumption? That the binary will not load the shared libraries in to memory until they are being prepped for use, therefore we will see a massive slowdown on the first message, but on none of the subsequent?
How do we go about avoiding this?
Note: We operate on the microsecond scale.
From the ld.so(8) man page, ENVIRONMENT section:
LD_BIND_NOW
(libc5; glibc since 2.1.1) If set to a non-empty string, causes
the dynamic linker to resolve all symbols at program startup
instead of deferring function call resolution to the point when
they are first referenced. This is useful when using a debug-
ger.
So, LD_BIND_NOW=y ./timesensitiveapp.
As an alternative to Ignacio Vazquez-Abrams's runtime suggestion, you can do the same thing at link time. When you link your shared library, pass the -z now flag to the linker.