Is a DLL slower than a static link? - c++

I made a GUI library for games. My test demo runs at 60 fps. When I run this demo with the static version of my library it takes 2-3% cpu in taskmanager. When I use the DLL version it uses around 13-15%. Is that normal? Is so, how could I optimize it? I already ask it to use /O2 for the most function inlining.

Do not start your performance timer until the DLL has had opportunity to execute its functionality one time. This gives it time to load into memory. Then start the timer and check performance. It should then basically match that of the static lib.
Also keep in mind that the load-location of the DLL can greatly affect how quickly it loads. The default base addres for DLLs is 0x400000. If you already have some other DLL in that location, then the load process must perform an expensive re-addressing step which will throw off your timing even more.
If you have such a conflict, just choose a different base address in Visual Studio.

You will have the overhead of loading the DLL (should be just once at the beginning). It isn't statically linked in with direct calls, so I would expect a small amount of overhead but not much.
However, some DLLs will have much higher overheads. I'm thinking of COM objects although there may be other examples. COM adds a lot of overhead on function calls between objects.

If you call DLL-functions they cannot be inlined for a caller. You should think a little about your DLL-boundaries.
May be it is better for your application to have a small bootstrap exe which just executes a main loop in your DLL. This way you can avoid much overhead for function calls.

It's a little unclear as to what's being statically/dynamically linked. Is the DLL of your lib statically linked with its dependencies? Is it possible that the DLL is calling other DLLs (that will be slow)? Maybe try running a profiler from valgrind on your executable to determine where all the CPU usage is coming from.

Related

interpreting _dl_runtime_resolve_xsave'2 in callgrind output

Looking at the output of callgrind for my program run, I see that 125% !!! of the cycles are spent in _dl_runtime_resolve_xsave'2 (apparently part of the dynamic linker) while 100% is spent in main. But it also says that almost all the time spent inside _dl_runtime_resolve_xsave'2 is actually spent in inner methods (self=0%) but callgrind does not show any callees for this method.
Moreover, it looks like _dl_runtime_resolve_xsave'2 is called from several places in the program I am profiling.
I can understand that some time could be spent outside of main because the program I am profiling is using the prototype pattern and many objects prototypes are being built when their dynamic library are loaded but this cannot amount anywhere close to 25% of the time of that particular run (because if I do that run with no input data it takes orders of magnitude less time than the run I am profiling now).
Also the program is not using dlopen to open shared objects after the program start. Everything should be loaded at the start.
Here is a screenshot of the kcachegrind window:
How can I interpret those calls to _dl_runtime_resolve_xsave'2? Do I need to be concerned by the time spent in this method?
Thank you for your help.
_dl_runtime_resolve_xsave is used in the glibc dynamic loader during lazy binding. It looks up the function symbol during the first call to a function and then performs a tail call to the implementation. Unless you use something like LD_BIND_NOT=1 in the environment when launching the program, this is one-time operation that happens only during the first call to the function. Lazy binding has some cost, but unless you have many functions that are called exactly once, it will not contribute much to the execution cost. It is more likely a reporting artifact, perhaps related to the tail call or the rather exotic XSAVE instruction used in _dl_runtime_resolve_xsave.
You can disable lazy binding by launching the program with the LD_BIND_NOW=1 environment variable setting, the dynamic loader trampoline will not be used because all functions will be resolved on startup. Alternatively, you can link with -Wl,-z,now to make this change permanent (at least for the code you link, system libraries may still use lazy binding for their own function symbols).

calling exe performance cost (compared to DLL)

We were discussing the possibility of using an exe instead of DLL inside a C or C++ code. The idea would be that in some cases to use an exe and pass arguments to it. (I guess its equivalent to somehow loading its main function as if it was a DLL).
The question we were wondering is does it imply a performance cost (especially in a loop with more than one iteration).
I tried to look in existing threads, while nobody answered this specific question. I saw that calling a function from DLL had an overhead for the first call, but then subsequent calls would only take 1 or 2 instructions.
For the exe case, it will each time need to create a separate process so it can run.(a second process if I need to open a shell that would open it, but from my research I can do it wihtout calling a shell). This process creation should cost some performance I'd guess. Moreover I think that the exe will each time be loaded into RAM, destroyed at the end of the process, then reloaded for next call and so on. A problem that is not present (?) with DLL.
PS: we were discussing this question more on a theoretical level than for implementing it, it's a question for the sake of learning.
The costs of running an exe are tremendous compared to calling a function from a DLL. If you can do it with a DLL, the you should if performance matters.
Of course, there may be other factors to consider: For example, when there is a bug in the code called, and crashes the process, in the case of an exe it is merely that exe that goes down, and the caller survives, but if the bug is in a DLL, the caller crashes, too.
Clearly, a DLL is going to get loaded, and if you call to it many times in a short time, it will have a benefit. If the time between calls is long enough, the DLL content may get evicted from RAM and have to be loaded from disk again (yes, that's hard to specify, and partly depends on the memory usage on the system).
However, executable files do get cached in memory, so the cost of "loading the executable" isn't that big. Yes, you have to create a new process and destroy it at the end, with all the related memory management code. For a small executable, this will be relatively light work, for a large, complex executable, it may be quite a long time.
Bear in mind that executing the same program many times isn't unusual - compiling a large project or running some sort of script on a large number of files, just to give a couple of simple examples. So the performance of this will be tuned by OS developers.
Obviously, the "retain stuff in RAM" applies to both DLL and EXE - it's basic file-caching done by the OS.

does .so size affect virtual function performance

Recently, I met a performance problem. In Vtune result, virtual function cost is always the no.1 cost, when I reduce the so size which from 48M to 37M, the performance seems better, raise up 3.9%.
I wanna know, does the .so size realy affect virtual function performance, if so, why? Thanks!
It is not purely size (though of course that affects paging after the program is loaded), but the number of adjustments the loader must make when loading a program. You can see that measured by setting the environment variable
LD_DEBUG=statistics
Virtual functions in particular will require a lot of adjustments during loading. For discussion on this,
Measure time taken dynamically linking at program startup?
Faster C++ program startups
by improving runtime linking efficiency.
Position Independent Executable (PIE) Performance
I use the method from the article(blogs.oracle.com/ali/entry/the_cost_of_elf_symbol) which is provided by #ErikAlapää, use RTLD_LAZY instead of RTLD_NOW when dlopen the so. However, it seems nothing help. While I compile it with less objects, it becomes better. It seems paging cache does affect the process performance.

Preserve state in dll between succssesive calls from C++ application

I am explicitly using a DLL in my application, is it possible to preserve state in that DLL between successive calls to it? My attempts using a global have so far failed.
Would I have to use implicit linking for this to work?
The type of linking shouldn't have any influence here. It just defines when the DLL is loaded and if it's required to actually start your program. E.g. with runtime loading you're able to load DLLs that aren't there at compile time (e.g. plugins) and you're able to handle missing dependencies yourself. With compile time linking you'd just get a Windows error telling you there's a DLL missing.
As for the unloading, you don't have direct control wether your DLL will stay in memory, so it's possible it's unloaded between being used by two different programs. Also, what do you actually consider "successive calls"? Two calls from the same program? Two calls from the same program happening during two different executions? Two programs running at the same time? Depending on the scenario you might need some shared memory (or disk space) to actually pass data.
You might have a look at DllCanUnloadNow to tell windows if you're ready to unload already, but depending on your use case this might be the wrong tool.

Overhead of DLL

I have a quite basic question.
When a library is used only by a single process. Should I keep it as a static library?
If I use the library as a DLL, but only a single process uses it. **What will be the overhead?*
There is almost no overhead to having a separate DLL. Basically, the first call to a function exported from a DLL will run a tiny stub that fixes up the function addresses so that subsequent calls are performed via a single jump through a jump table. The way CPUs work, this extra indirection is practically free.
The main "overhead" is actually an opportunity cost, not an "overhead" per-se. That is, modern compilers can do something called "whole program optimization" in which the entire module (.exe or .dll) is compiled and optimized at once, at link time. This means the compiler can do things like adjust calling conventions, inline functions and so on across all .cpp files in the whole program, rather than just within a single .cpp file.
This can result in fairly nice performance boost, for certain kinds of applications. But of course, whole program optimization cannot happen across DLL boundaries.
There are two overheads to a DLL. First as the DLL is loaded into memory the internal addresses must be fixed for the actual address that the DLL is loaded at, versus the addresses assumed by the linker. This can be minimized by re-basing the DLLs. The second overhead is when the program and DLL are loaded, as the program calls into the DLL have the addresses of the functions filled in. These overheads are generally negligible except for very large programs and DLLs.
If this is a real concern you can use delay-loaded DLLs which only get loaded as they are called. If the DLL is never used, for example it implements a very uncommon function, then it never gets loaded at all. The downside is that there's a short delay the first time the DLL is called.
I like to use statically linked libraries, not to decrease the overhead but to minimize the hassle of having to keep the DLL with the program.
Imported functions have no more overhead than virtual functions.