calling exe performance cost (compared to DLL)

calling exe performance cost (compared to DLL) - c++

We were discussing the possibility of using an exe instead of DLL inside a C or C++ code. The idea would be that in some cases to use an exe and pass arguments to it. (I guess its equivalent to somehow loading its main function as if it was a DLL).
The question we were wondering is does it imply a performance cost (especially in a loop with more than one iteration).
I tried to look in existing threads, while nobody answered this specific question. I saw that calling a function from DLL had an overhead for the first call, but then subsequent calls would only take 1 or 2 instructions.
For the exe case, it will each time need to create a separate process so it can run.(a second process if I need to open a shell that would open it, but from my research I can do it wihtout calling a shell). This process creation should cost some performance I'd guess. Moreover I think that the exe will each time be loaded into RAM, destroyed at the end of the process, then reloaded for next call and so on. A problem that is not present (?) with DLL.
PS: we were discussing this question more on a theoretical level than for implementing it, it's a question for the sake of learning.

The costs of running an exe are tremendous compared to calling a function from a DLL. If you can do it with a DLL, the you should if performance matters.
Of course, there may be other factors to consider: For example, when there is a bug in the code called, and crashes the process, in the case of an exe it is merely that exe that goes down, and the caller survives, but if the bug is in a DLL, the caller crashes, too.

Clearly, a DLL is going to get loaded, and if you call to it many times in a short time, it will have a benefit. If the time between calls is long enough, the DLL content may get evicted from RAM and have to be loaded from disk again (yes, that's hard to specify, and partly depends on the memory usage on the system).
However, executable files do get cached in memory, so the cost of "loading the executable" isn't that big. Yes, you have to create a new process and destroy it at the end, with all the related memory management code. For a small executable, this will be relatively light work, for a large, complex executable, it may be quite a long time.
Bear in mind that executing the same program many times isn't unusual - compiling a large project or running some sort of script on a large number of files, just to give a couple of simple examples. So the performance of this will be tuned by OS developers.
Obviously, the "retain stuff in RAM" applies to both DLL and EXE - it's basic file-caching done by the OS.

Related

C++, always running processes or invoked executable files?

I'm working on a project made of some separate processes (services). Some services are called every second, some other every minute and some services may not be called after days. (and there are some services that are called randomly and there is no exact information about their call times).
I have two approaches to develop the project. To make services always running processes using interprocess messaging, or to write separate C++ programs and run executable files when I need them.
I have two questions that I couldn't find a suitable answer to.
Is there any way I could calculate an approximated threshold that can help me answer to 'when to use which way'?
How much faster is always running processes? (I mean compared with process of initializing and running executable files in OS)
Edit 1: As mentioned in comments and Mats Petersson's answer, answer to my questions is heavily related to environment. Then I explain more about these conditions.
OS: CentOS 6.3
services are small (smaller that 1000 line codes normally) and use no additional resources (such as database)

I don't think anyone can answer your direct two questions, as it depends on many factors, such as "what OS", "what secondary storage", "how large an application is", "what your application does" (loading up the contents of a database with a million entries takes much longer than int x = 73; as the whole initialization outside main).
There is overhead with both approaches, and assuming there isn't enough memory to hold EVERYTHING in RAM at all times (and modern OS's will try to make use of the RAM as disk-cache or for other caching, rather than keep old crusty application code that doesn't run, so eventually your application code will be swapped out if it's not being run), you are going to have approximately the same disk I/O amount for both solutions.
For me, "having memory available" trumps other things, so executing a process when itäs needed is better than leaving it running in the expectation that in some time, it will need to be reused. The only exceptions are if the executable takes a long time to start (in other words, it's large and has a complex starting procedure) AND it's being run fairly frequently (at the very least several times per minute). Or you have high real-time requirements, so the extra delay of starting the process is significantly worse than "we're holding it in memory" penalty (but bear in mind that holding it in memory isn't REALLY holding it in memory, since the content will be swapped out to disk anyway, if it isn't being used).
Starting a process that was recently run is typically done from cache, so it's less of an issue. Also, if the application uses shared libraries (.so, .dll or .dynlib depending on OS) that are genuinely shared, then it will normally shorten the load time if that shared library is in memory already.
Both Linux and Windows (and I expect OS X) are optimised to load a program much faster the second time it executes in short succession - because it caches things, etc. So for the frequent calling of the executable, this will definitely work in your favour.
I would start by "execute every time", and if you find that this is causing a problem, redesign the programs to stay around.

identify memory code injection by memory dumping of process or dll

In order to identify memory code injection (on windows systems), I want to a hash the memory of all processes on the system, for example, if the memory of calc.exe is always x and now it is y, I know that someone injected into calc.exe code.
1: Is this thinking correct? What part of the process memory always stays the same and what part is changing?
2: Dose dll have a separate memory, or it is in the memory of the exe? In other words, can i generate a hash for memory of a dll?
3: How can I dump the memory of a process or of a dll in c++?

Code is continually being injected in processes when running windows.
One example are delay loaded DLLs. When a process starts up, only the core DLLS are loaded. When certain features get exercised, the code first loads the new DLLs (code) from disk and then executes it.
Another example is .NET managed applications. Most code sits as uncompiled code on disk. When new parts of the application need to be run, the .NET runtime loads that uncompiled code, compiles it (aka JITs it) and then executes it.
The problem you are trying to solve is worthwhile, but extremely hard. The OS itself tries to solve this problem to protect your processes.
If you are trying to do something more advanced than what windows is doing for you behind the scenes, the first thing to do will be to understand all the steps windows takes to protect process and validate the code being injected in them, while still enabling processes to load code dynamically (which is a necessity).
Good luck.
Or maybe you have a more specific problem you are trying to solve?

1) The idea is nice. But as long as the process runs, they change their memory (or they do nothing) so it won't work. What you could do, is to hash the code part of the memory.
2) No, DLL are libraries linked to your code, not a separate process. They are just loaded dynamically instead of statically (http://msdn.microsoft.com/en-us/library/windows/desktop/ms681914%28v=vs.85%29.aspx)
3) Normally your OS prohibits you from accessing memory of neighbour processes. If it would allow it for your process, then it would be very easy for malware to propagate, and your system would be very instable, as one crashing process could crash all the others. So it'll be very very tricky to do such kind of dumps ! But if your process has the right priviledges, you could have a look at ReadProcessMemory()

I have just done something similar I basically c#'s these scripts:
http://www.exploit-monday.com/2012/03/powershell-live-memory-analysis-tools.html

I should avoid static compilation because of cache miss?

The title sums up pretty much the entire story, I was reading this and the key point is that
A bigger executable means more cache misses
and since a static executable it's by definition bigger than one that is dynamically linked, I'm curious about what are the practical considerations in this case.

The article in the link discusses the side-effect of inlining small functions in OS the kernel. This has indeed got a noticeable effect on performance, because the same function is called from many different places throughout the a sequence of system calls - for example if you call open, and then call read, seek write, open will store a filehandle somewhere in the kernel, and in the call to read, seek, and write, that handle will have to be "found". If that's an inlined function, we now have three copies of that function in the cache, and no benefit at all from read having called the same function as seek and write does. If it's a "non-inline" function, it will indeed be ready in the cache when seek and write calls that function.
For a given process, whether the code is linked statically or dynamically, once the application is fully loaded will have very small impact. If there are MANY copies of the application, then other processes may benefit from re-using the same memory for the shared libraries. But the size needed for that process remains the same whether it is shared with 0, 1, 3, or 100 other processes. The benefit in sharing the binary files across many executables come from things like the C library that is behind almost every single executable in the system - so when you have 1000 processes running in the system, that ALL use the same basic runtime system, there is only one copy rather than 1000 copies of the code. But it is unlikely to have much effect on the cache efficiency on any particular application - perhaps common functions like strcpy and such like are used often enough that there is a small chance that when the OS task switches, it's still in the cache when the next application does strcpy.
So, in summary: probably doesn't make any difference at all.

The overall memory footprint of the static version is the same as that of the dynamic version; remember that the dynamically-linked objects still need to be loaded into memory!
Of course, one could also argue that if there are multiple processes running, and they all dynamically link against the same object, then only one copy is required in memory, and so the aggregate footprint is lower than if they had all statically linked.
[Disclaimer: all of the above is educated guesswork; I've never measured the effect of linking on cache behaviour.]

Is a DLL slower than a static link?

I made a GUI library for games. My test demo runs at 60 fps. When I run this demo with the static version of my library it takes 2-3% cpu in taskmanager. When I use the DLL version it uses around 13-15%. Is that normal? Is so, how could I optimize it? I already ask it to use /O2 for the most function inlining.

Do not start your performance timer until the DLL has had opportunity to execute its functionality one time. This gives it time to load into memory. Then start the timer and check performance. It should then basically match that of the static lib.
Also keep in mind that the load-location of the DLL can greatly affect how quickly it loads. The default base addres for DLLs is 0x400000. If you already have some other DLL in that location, then the load process must perform an expensive re-addressing step which will throw off your timing even more.
If you have such a conflict, just choose a different base address in Visual Studio.

You will have the overhead of loading the DLL (should be just once at the beginning). It isn't statically linked in with direct calls, so I would expect a small amount of overhead but not much.
However, some DLLs will have much higher overheads. I'm thinking of COM objects although there may be other examples. COM adds a lot of overhead on function calls between objects.

If you call DLL-functions they cannot be inlined for a caller. You should think a little about your DLL-boundaries.
May be it is better for your application to have a small bootstrap exe which just executes a main loop in your DLL. This way you can avoid much overhead for function calls.

It's a little unclear as to what's being statically/dynamically linked. Is the DLL of your lib statically linked with its dependencies? Is it possible that the DLL is calling other DLLs (that will be slow)? Maybe try running a profiler from valgrind on your executable to determine where all the CPU usage is coming from.

Random Complete System Unresponsiveness Running Mathematical Functions

I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a time (ReadFile), and for each chunk performs a set of mathematical operations (basically calculates the hash).
After calculating the hash, it stores info about the chunk in an STL map (basically <chunkID, hash>) and then writes the chunk itself to another file (WriteFile).
That's all it does. This program will cause certain PCs to choke and die. The mouse begins to stutter, the task manager takes > 2 min to show, ctrl+alt+del is unresponsive, running programs are slow.... the works.
I've done literally everything I can think of to optimize the program, and have triple-checked all objects.
What I've done:
Tried different (less intensive) hashing algorithms.
Switched all allocations to nedmalloc instead of the default new operator
Switched from stl::map to unordered_set, found the performance to still be abysmal, so I switched again to Google's dense_hash_map.
Converted all objects to store pointers to objects instead of the objects themselves.
Caching all Read and Write operations. Instead of reading a 16k chunk of the file and performing the math on it, I read 4MB into a buffer and read 16k chunks from there instead. Same for all write operations - they are coalesced into 4MB blocks before being written to disk.
Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and perfmon.
Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
Set the thread priority to THREAD_PRIORITY_IDLE
Added a Sleep(100) call after every loop.
Even after all this, the application still results in a system-wide hang on certain machines under certain circumstances.
Perfmon and Process Explorer show minimal CPU usage (with the sleep), no constant reads/writes from disk, few hard pagefaults (and only ~30k pagefaults in the lifetime of the application on a 5GB input file), little virtual memory (never more than 150MB), no leaked handles, no memory leaks.
The machines I've tested it on run Windows XP - Windows 7, x86 and x64 versions included. None have less than 2GB RAM, though the problem is always exacerbated under lower memory conditions.
I'm at a loss as to what to do next. I don't know what's causing it - I'm torn between CPU or Memory as the culprit. CPU because without the sleep and under different thread priorities the system performances changes noticeably. Memory because there's a huge difference in how often the issue occurs when using unordered_set vs Google's dense_hash_map.
What's really weird? Obviously, the NT kernel design is supposed to prevent this sort of behavior from ever occurring (a user-mode application driving the system to this sort of extreme poor performance!?)..... but when I compile the code and run it on OS X or Linux (it's fairly standard C++ throughout) it performs excellently even on poor machines with little RAM and weaker CPUs.
What am I supposed to do next? How do I know what the hell it is that Windows is doing behind the scenes that's killing system performance, when all the indicators are that the application itself isn't doing anything extreme?
Any advice would be most welcome.

I know you said you had monitored memory usage and that it seems minimal here, but the symptoms sound very much like the OS thrashing like crazy, which would definitely cause general loss of OS responsiveness like you're seeing.
When you run the application on a file say 1/4 to 1/2 the size of available physical memory, does it seem to work better?
What I suspect may be happening is that Windows is "helpfully" caching your disk reads into memory and not giving up that cache memory to your application for use, forcing it to go to swap. Thus, even though swap use is minimal (150MB), it's going in and out constantly as you calculate the hash. This then brings the system to its knees.

Some things to check:
Antivirus software. These often scan files as they're opened to check for viruses. Is your delay occuring before any data is read by the application?
General system performance. Does copying the file using Explorer also show this problem?
Your code. Break it down into the various stages. Write a program that just reads the file, then one that reads and writes the files, then one that just hashes random blocks of ram (i.e. remove the disk IO part) and see if any particular step is problematic. If you can get a profiler then use this as well to see if there any slow spots in your code.
EDIT
More ideas. Perhaps your program is holding on to the GDI lock too much. This would explain everything else being slow without high CPU usage. Only one app at a time can have the GDI lock. Is this a GUI app, or just a simple console app?
You also mentioned RtlEnterCriticalSection. This is a costly operation, and can hang the system quite easily, i.e. mismatched Enters and Leaves. Are you multi-threading at all? Is the slow down due to race conditions between threads?

XPerf is your guide here - watch the PDC Video about it, and then take a trace of the misbehaving app. It will tell you exactly what's happening throughout the system, it is extremely powerful.

I like the disk-caching/thrashing suggestions, but if that's not it, here are some scattershot suggestions:
What non-MSVC libraries, if any, are you linking to?
Can your program be modified (#ifdef'd) to run without a GUI? Does the problem occur?
You added ::Sleep(100) after each loop in each thread, right? How many threads are you talking about? A handful or hundreds? How long does each loop take, roughly? What happens if you make that ::Sleep(10000)?
Is your program perhaps doing something else that locks a limited resources (ProcExp can show you what handles are being acquired ... of course you might have difficulty with ProcExp not responding:-[)
Are you sure CriticalSections are userland-only? I recall that was so back when I worked on Windows (or so I believed), but Microsoft could have modified that. I don't see any guarantee in the MSDN article Critical Section Objects (http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx) ... and this leads me to wonder: Anti-convoy locks in Windows Server 2003 SP1 and Windows Vista
Hmmm... presumably we're all multi-processor now, so are you setting the spin count on the CS?
How about running a debugging version of one of these OSes and monitoring the kernel debugging output (using DbgView)... possibly using the kernel debugger from the Platform SDK ... if MS still calls it that?
I wonder whether VMMap (another SysInternal/MS utility) might help with the Disk caching hypothesis.

It turns out that this is a bug in the Visual Studio compiler. Using a different compiler resolves the issue entirely.
In my case, I installed and used the Intel C++ Compiler and even with all optimizations disabled I did not see the fully-system hang that I was experiencing w/ the Visual Studio 2005 - 2010 compilers on this library.
I'm not certain as to what is causing the compiler to generate such broken code, but it looks like we'll be buying a copy of the Intel compiler.

It sounds like you're poking around fixing things without knowing what the problem is. Take stackshots. They will tell you what your program is doing when the problem occurs. It might not be easy to get the stackshots if the problem occurs on other machines where you cannot use an IDE or a stack sampler. One possibility is to kill the app and get a stack dump when it's acting up. You need to reproduce the problem in an environment where you can get a stack dump.
Added: You say it performs well on OSX and Linux, and poorly on Windows. I assume the ratio of completion time is some fairly large number, like 10 or 100, if you've even had the patience to wait for it. I said this in the comment, but it is a key point. The program is waiting for something, and you need to find out what. It could be any of the things people mentioned, but it is not random.
Every program, all the time while it runs, has a call stack consisting of a hierarchy of call instructions at specific addresses. If at a point in time it is calculating, the last instruction on the stack is a non-call instruction. If it is in I/O the stack may reach into a few levels of library calls that you can't see into. That's OK. Every call instruction on the stack is waiting. It is waiting for the work it requested to finish. If you look at the call stack, and look at where the call instructions are in your code, you will know what your program is waiting for.
Your program, since it is taking so long to complete, is spending nearly all of its time waiting for something to finish, and as I said, that's what you need to find out. Get a stack dump while it's being slow, and it will give you the answer. The chance that it will miss it is 1/the-slowness-ratio.
Sorry to be so elemental about this, but lots of people (and profiler makers) don't get it. They think they have to measure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js