Force executable into memory?

Force executable into memory? - c++

I have a cpp executable (it contains static libraries), about 1MB in size. When I run the exe, it consumes less than 200kb memory.
From what I understand this means the computer reads the exe little by little when it's needed from the HDD.
I want to improve the performance, even a bit, so, how can I say "load the exe into memory" and don't touch the HDD? Will this bring any performance improvement?

The OS will load parts of the executable into memory as it is needed. This is where knowing more about the instruction cache might be useful. The idea is that you structure your program so that common code is grouped together. For example, you might have some functions that are getting inlined - in this case the OS would have to load the same code in multiple places which might be slow. By removing the inline you'd have the code in one chunk in memory which would get cached and thus reduce loading time.
I would agree with the others though that this type of optimization should really be reserved until after you profile and know for sure that this is the bottleneck, which is very unlikely

If you really want to do this, you need to touch the memory pages by reading from them. But forcing pages into memory once does not guarantee that they will remain in memory. An apparent alternative solution would be to VirtualLock the region, but in practice this function doesn't work the way you'd think (at least on any system where I've used it), even if you have the appropriate privilegues.
Note that the default minimum working set is only 16MB, so for larger executables, forcing pages into RAM will necessarily push others (which you need!) out of the working set, so this is in fact an anti-optimization. Unless you have the necessary privilegues to increase the working set size.
It's a bit tedious to find out where the executable's mapping starts and ends. Not that it is impossible, but it's much more complicated than just mapping the file again. Then you simply run a loop which reads one byte every 4096 bytes, and you are done. This will consume twice as much address space, but will consume the same amount of RAM (thanks to how memory mapping works).
But, realistically, you will gain absolutely nothing from doing this.
The operating system does not need to load the entire executable and does not need to keep it resident at all times. Part of your executable will be debug info or import info, which the loader will maybe look at once (or won't look at) and never need afterwards. Forcing that stuff into memory only means you purge useful pages from the working set.
The OS likely has the parts (or most of it) that are not visible to you in the buffer cache anyway, but even if that isn't the case, you will hardly ever notice a difference.

Globally, forcing all of the program into RAM will slow it down.
There are usually large parts of the code which aren't executed
in any given run, and there's no need to ever read these from
disk.
Where forcing all or parts of the program into RAM can make a difference
is latency. If you're responding in real time to external
events, having to load the code in order to respond will reduce
latency. This can only be done by using a system specific
request (e.g. mlock under Posix systems supporting the read
time extension). You'll probably have to have special rights to
be able to do it, though. In practice, it should only be used
on machines dedicated to a specific application, since it can
have a very negative impact on the total system performance.
(There's a reason that it's in the real-time extensions, and not
in the basic Posix.) Locking the addresses used by the function in memory means that there can be no page faults when it is executed.

Related

Has image base address optimization sense? [duplicate]

Rebasing a DLL means to fix up the DLL such, that it's preferred load adress is the load address that the Loader is actually able to load the DLL at.
This can either be achieved by a tool such as Rebase.exe or by specifying default load addresses for all your (own) dlls so that they "fit" in your executable process.
The whole point of managing the DLL base addresses this way is to speed up application loads. (Or so I understand.)
The question is now: Is it worth the trouble?
I have the book Windows via C/C++ by Richter/Nazarre and they strongly recommend[a] making sure that the load addresses all match up so that the Loader doesn't have to rebase the loaded DLLs.
They fail to argue however, if this speeds up application load times to any significant amount.
Also, with ASLR it seems dubious that this has any value at all, since the load addresses will be randomized anyway.
Are there any hard facts on the pro/cons of this?
[a]: In my WvC++/5th ed it is in the sections titled Rebasing Modules and Binding Modules on pages 568ff. in Chapter 20, DLL Advanced Techniques.

Patching the relocatable addresses isn't the big deal, that runs at memory speeds, microseconds. The bigger issue is that the pages that contains this code now need to be backed up by the paging file instead of the DLL file. In other words, when pages containing code are unmapped, they need to be written to the paging file instead of just getting discarded.
The cost of this isn't that easy to measure, especially on modern machines with lots of RAM. It only counts when the machine starts to get under load with lots of processes competing for memory. And the fragmentation of the paging file.
But clearly, rebasing is a very cheap optimization. And it is very easy to see in the Debug + Windows + Modules window, there's a bright icon on the rebased DLLs. The Address column gives you a good hint what base address would be a good choice. Leave ample space between them so you don't constantly have to tweak this as your program grows.

I'd like to provide one answer myself, although the answers of Hans Passant and others are describing the tradeoffs already pretty well.
After recently fiddling with DLL base addresses in our application, I will here give my conclusion:
I think that, unless you can prove otherwise, providing DLLs with a non-default Base Address is an exercise in futility. This includes rebasing my DLLs.
For the DLLs I control, given the average application, each DLL will be loaded into memory only once anyway, so the load on the paging file should be minimal. (But see the comment of Michal Burr in another answer about Terminal Server environment.)
If DLLs are provided with a fixed base address (without rebasing) it will actually increase address space fragmentation, as sooner or later these addresses won't match anymore. In our app we had given all DLLs a fixed base address (for other legacy reasons, and not because of address space fragmentation) without using rebase.exe and this significantly increased address space fragmentation for us because you really can't get this right manually.
Rebasing (via rebase.exe) is not cheap. It is another step in the build process that has to be maintained and checked, so it has to have some benefit.
A large application will always have some DLLs loaded where the base address does not match, because of some hook DLLs (AV) and because you don't rebase 3rd party DLLs (or at least I wouldn't).
If you're using a RAM disk for the paging file, you might actually be better of if loaded DLLs get paged out :-)
So to sum up, I think that rebasing isn't worth the trouble except for special cases like the system DLLs.
I'd like to add a historical piece that I found on Old New Thing: How did Windows 95 rebase DLLs? --
When a DLL needed to be rebased, Windows 95 would merely make a note
of the DLL's new base address, but wouldn't do much else. The real
work happened when the pages of the DLL ultimately got swapped in. The
raw page was swapped off the disk, then the fix-ups were applied on
the fly to the raw page, thereby relocating it. The fixed-up page was
then mapped into the process's address space and the program was
allowed to continue.
Looking at how this process is done (read the whole thing), I personally suspect that part of the "rebasing is evil" stance dates back to the olden days of Win9x and low memory conditions.
Look, now there's a non-historical piece on Old New Thing:
How important is it nowadays to ensure that all my DLLs have non-conflicting base addresses?
Back in the day, one of the things you were exhorted to do was rebase
your DLLs so that they all had nonoverlapping address ranges, thereby
avoiding the cost of runtime relocation. Is this still important
nowadays?
...
In the presence of ASLR, rebasing your DLLs has no effect because ASLR is going to ignore your base address anyway and relocate the DLL into a location of its pseudo-random choosing.
...
Conclusion: It doesn't hurt to rebase, just in case, but understand
that the payoff will be extremely rare. Build your DLL with
/DYNAMICBASE enabled (and with /HIGHENTROPYVA for good measure)
and let ASLR do the work of ensuring that no base address collision
occurs. That will cover pretty much all of the real-world scenarios.
If you happen to fall into one of the very rare cases where ASLR is
not available, then your program will still work. It just may run a
little slower due to the relocation penalty.
... ASLR actually does a better job of avoiding collisions than manual
rebasing, since ASLR can view the system as a whole, whereas manual
rebasing requires you to know all the DLLs that are loaded into your
process, and coordinating base addresses across multiple vendors is
generally not possible.

They fail to argue however, if this speeds up application load times to any significant amount.
The load time change is minimal, because the v-table is what gets updated with the new addresses. However, if you have low memory - enough that stuff gets loaded in/out of the page file, then the system has to keep the dll in the page file (since the addresses are changed). If the dlls were rebased - and the rebased dlls don't collide with any other dlls - then instead of swapping them out to the page file (and back), the system just overwrites the memory and reloads the dll from the original on the hard drive.
The benefit is only relevant when systems are paging stuff in and out of main memory. The last time I made efforts to keep databases of applications and their base addresses was back in VB6 days, when the computers in our offices and data centers were lucky to have even 256MB of RAM.
Also, with ASLR it seems dubious that this has any value at all, since the load addresses will be randomized anyway.
At the moment ASLR only affects dlls and executables with the dynamic-relocation flag set. This includes Vista/Win7 system dlls and executables, and any developer made items where the developer intentionally set that flag during the build.
If you are going to set the dynamic-relocation flag, then don't bother rebasing the dlls. If all your clients have 4GB of RAM, then don't bother. If your boss is a cheapskate, then maybe.

You have to consider that user DLLs (that are not already loaded into another processes) has to be read from HDD. Usually the memory mapping is used for that (and it uses lazy loading), so if they have to be relocated, they'll have to be actually read from HDD before the process can start.
For those loaded by other processes the copy-on-write mechanism is used. So, again, relocating them will mean additional operations.
What's about ASLR, it's intended for security purposes, not for performance.

Yes, you should do it.
ASLR only impacts "system" DLLs and therefore the ones you are writing should not be impacted by ASLR. Additionally, ASLR doesn't completely "randomize" the location of these system binaries, it simply shuffles them around in the basic spot in the vm map.

Release Memory Mapped Memory

I am memory mapping a large file (~200GB) into a single region/view and sequentially writing to it. Every now and then I perform a boost::interprocess::mapped_region::flush(last, current, false).
After a while the process uses up the entire system memory. Which, from what I understand, is normal as it will be releasing the memory as other process request memory.
This works well on Windows 8. However, running on Windows 7 it doesn't seem to play well with the drivers for AJA video cards and it starts affecting performance (dropping IO packets).
Is there any way I can force the Windows 7 to flush parts of the memory to disk (after the data is written it is only interesting for a few seconds, and remember I am writing sequentially through the entire file), as to not use up the entire available system memory?

Flushing has nothing to with reclamation, IYAM. It just makes sure dirty pages are written out (I think you still need a disk sync to make sure it actually /hit the disk/).
So, you're looking for a way to unmap.
Maybe you can use a function like
EmptyWorkingSet to evict as many pages as possible
SetProcessWorkingSetSize to temporarily reduce the allowed process working set.
Of course, in a more portable fashion, you might just get away with unmapping and remapping. If the access is to spinning HDD and remains sequential across remaps, there might not be a performance penalty (there might be though, if the kernel prefetched data e.g. due to madvise() or the windows equivalent thereof)

I should avoid static compilation because of cache miss?

The title sums up pretty much the entire story, I was reading this and the key point is that
A bigger executable means more cache misses
and since a static executable it's by definition bigger than one that is dynamically linked, I'm curious about what are the practical considerations in this case.

The article in the link discusses the side-effect of inlining small functions in OS the kernel. This has indeed got a noticeable effect on performance, because the same function is called from many different places throughout the a sequence of system calls - for example if you call open, and then call read, seek write, open will store a filehandle somewhere in the kernel, and in the call to read, seek, and write, that handle will have to be "found". If that's an inlined function, we now have three copies of that function in the cache, and no benefit at all from read having called the same function as seek and write does. If it's a "non-inline" function, it will indeed be ready in the cache when seek and write calls that function.
For a given process, whether the code is linked statically or dynamically, once the application is fully loaded will have very small impact. If there are MANY copies of the application, then other processes may benefit from re-using the same memory for the shared libraries. But the size needed for that process remains the same whether it is shared with 0, 1, 3, or 100 other processes. The benefit in sharing the binary files across many executables come from things like the C library that is behind almost every single executable in the system - so when you have 1000 processes running in the system, that ALL use the same basic runtime system, there is only one copy rather than 1000 copies of the code. But it is unlikely to have much effect on the cache efficiency on any particular application - perhaps common functions like strcpy and such like are used often enough that there is a small chance that when the OS task switches, it's still in the cache when the next application does strcpy.
So, in summary: probably doesn't make any difference at all.

The overall memory footprint of the static version is the same as that of the dynamic version; remember that the dynamically-linked objects still need to be loaded into memory!
Of course, one could also argue that if there are multiple processes running, and they all dynamically link against the same object, then only one copy is required in memory, and so the aggregate footprint is lower than if they had all statically linked.
[Disclaimer: all of the above is educated guesswork; I've never measured the effect of linking on cache behaviour.]

Accessing >2,3,4GB Files in 32-bit Process on 64-bit (or 32-bit) Windows

Disclaimer: I apologize for the verbosity of this question (I think it's an interesting problem, though!), yet I cannot figure out how to more concisely word it.
I have done hours of research as to the apparently myriad of ways in which to solve the problem of accessing multi-GB files in a 32-bit process on 64-bit Windows 7, ranging from /LARGEADDRESSAWARE to VirtualAllocEx AWE. I am somewhat comfortable in writing a multi-view memory-mapped system in Windows (CreateFileMapping, MapViewOfFile, etc.), yet can't quite escape the feeling that there is a more elegant solution to this problem. Also, I'm quite aware of Boost's interprocess and iostream templates, although they appear to be rather lightweight, requiring a similar amount of effort to writing a system utilizing only Windows API calls (not to mention the fact that I already have a memory-mapped architecture semi-implemented using Windows API calls).
I'm attempting to process large datasets. The program depends on pre-compiled 32-bit libraries, which is why, for the moment, the program itself is also running in a 32-bit process, even though the system is 64-bit, with a 64-bit OS. I know there are ways in which I could add wrapper libraries around this, yet, seeing as it's part of a larger codebase, it would indeed be a bit of an undertaking. I set the binary headers to allow for /LARGEADDRESSAWARE (at the expense of decreasing my kernel space?), such that I get up to around 2-3 GB of addressable memory per process, give or take (depending on heap fragmentation, etc.).
Here's the issue: the datasets are 4+GB, and have DSP algorithms run upon them that require essentially random access across the file. A pointer to the object generated from the file is handled in C#, yet the file itself is loaded into memory (with this partial memory-mapped system) in C++ (it's P/Invoked). Thus, I believe the solution is unfortunately not as simple as simply adjusting the windowing to access the portion of the file I need to access, as essentially I want to still have the entire file abstracted into a single pointer, from which I can call methods to access data almost anywhere in the file.
Apparently, most memory mapped architectures rely upon splitting the singular process into multiple processes.. so, for example, I'd access a 6 GB file with 3x processes, each holding a 2 GB window to the file. I would then need to add a significant amount of logic to pull and recombine data from across these different windows/processes. VirtualAllocEx apparently provides a method of increasing the virtual address space, but I'm still not entirely sure if this is the best way of going about it.
But, let's say I want this program to function just as "easily" as a singular 64-bit proccess on a 64-bit system. Assume that I don't care about thrashing, I just want to be able to manipulate a large file on the system, even if only, say, 500 MB were loaded into physical RAM at any one time. Is there any way to obtain this functionality without having to write a somewhat ridiculous, manual memory system by hand? Or, is there some better way than what I have found through thusfar combing SO and the internet?
This lends itself to a secondary question: is there a way of limiting how much physical RAM would be used by this process? For example, what if I wanted to limit the process to only having 500 MB loaded into physical RAM at any one time (whilst keeping the multi-GB file paged on disk)?
I'm sorry for the long question, but I feel as though it's a decent summary of what appear to be many questions (with only partial answers) that I've found on SO and the net at large. I'm hoping that this can be an area wherein a definitive answer (or at least some pros/cons) can be fleshed out, and we can all learn something valuable in the process!

You could write an accessor class which you give it a base address and a length. It returns data or throws exception (or however else you want to inform of error conditions) if error conditions arise (out of bounds, etc).
Then, any time you need to read from the file, the accessor object can use SetFilePointerEx() before calling ReadFile(). You can then pass the accessor class to the constructor of whatever objects you create when you read the file. The objects then use the accessor class to read the data from the file. Then it returns the data to the object's constructor which parses it into object data.
If, later down the line, you're able to compile to 64-bit, you can just change (or extend) the accessor class to read from memory instead.
As for limiting the amount of RAM used by the process.. that's mostly a matter of making sure that
A) you don't have memory leaks (especially obscene ones) and
B) destroying objects you don't need at the very moment. Even if you will need it later down the line but the data won't change... just destroy the object. Then recreate it later when you do need it, allowing it to re-read the data from the file.

May a compiler ever generate code to unload parts of the code segment during execution?

Apart from Dll concept that provides ability of loading/unloading methods or functions at run-time, I'm wondering if a compiler may ever say something like, ok as this particular part of the code takes considerable amount of space in code segment and is never gonna be used again after this point during program execution, it'd be good to generate some code to unload that part of code segment after reaching that particular point during program execution so that overall space took by code segment gets smaller. Is it something just fictional or may that happen?

Sure. There's a technique called overlaying that loads different code into the same bit of address space at different times. Sometimes it was done manually, other times compilers helped. Sometimes the loading is done in software, sometimes in hardware (with address multiplexing, so that e.g. during boot time one bit of address space reads from a ROM chip, but after boot it switches to address RAM or a different ROM instead).
Overlaying was much more common when computers had less memory, e.g. in the early days of DOS where you had 640K at best and often not even that. These days it still has applications for embedded systems where memory and/or address space are at a premium.

A compiler can do anything it wants to, as long as that doesn't violate the standard. If it can figure out that the code is never called again, it can ditch it completely.
It could even replace it with a smaller stub function that would reload the code, were it required.
But you'll be very unlikely to ever see that in a modern OS since the OS itself provides that capability under the covers.
Operating systems (at least the common ones) will swap out your physical pages when memory runs low, and they won't be reloaded until they're needed (usually by a page fault when trying to access them).

Yes, Windows Device Drivers use this technique. The LE file format has certain code segments marked as discardable. The OS can also certain times take such a decision to swap out code segments that have not been used for a long time.
Stricly speaking however, this is not the area for compiler to play around with. It is mostly the linker/loader/OS that affect this.

I don't know of a compiler which does this, but there's no rule prohibiting it. If the compiler can prove that doing so won't change the semantics of the program, then it is allowed under the as-if rule.
It is usually not necessary however, because unused code can already get swapped out as part of the pagefile mechanism associated with virtual memory. (and because you'd probably only save a few KB of memory space)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js