Are there any downsides to using UPX to compress a Windows executable?

Are there any downsides to using UPX to compress a Windows executable? - c++

I've used UPX before to reduce the size of my Windows executables, but I must admit that I am naive to any negative side effects this could have. What's the downside to all of this packing/unpacking?
Are there scenarios in which anyone would recommend NOT UPX-ing an executable (e.g. when writing a DLL, Windows Service, or when targeting Vista or Win7)? I write most of my code in Delphi, but I've used UPX to compress C/C++ executables as well.
On a side note, I'm not running UPX in some attempt to protect my exe from disassemblers, only to reduce the size of the executable and prevent cursory tampering.

... there are downsides to
using EXE compressors. Most notably:
Upon startup of a compressed EXE/DLL, all of the code is
decompressed from the disk image into
memory in one pass, which can cause
disk thrashing if the system is low on
memory and is forced to access the
swap file. In contrast, with
uncompressed EXE/DLLs, the OS
allocates memory for code pages on
demand (i.e. when they are executed).
Multiple instances of a compressed EXE/DLL create multiple
instances of the code in memory. If
you have a compressed EXE that
contains 1 MB of code (before
compression) and the user starts 5
instances of it, approximately 4 MB of
memory is wasted. Likewise, if you
have a DLL that is 1 MB and it is used
by 5 running applications,
approximately 4 MB of memory is
wasted. With uncompressed EXE/DLLs,
code is only stored in memory once and
is shared between instances.
http://www.jrsoftware.org/striprlc.php#execomp

I'm surprised this hasn't been mentioned yet but using UPX-packed executables also increases the risk of producing false-positives from heuristic anti-virus software because statistically a lot of malware also uses UPX.

There are three drawbacks:
The whole code will be fully uncompressed in virtual memory, while in a regular EXE or DLL, only the code actually used is loaded in memory. This is especially relevant if only a small portion of the code in your EXE/DLL is used at each run.
If there are multiple instances of your DLL and EXE running, their code can't be shared across the instances, so you'll be using more memory.
If your EXE/DLL is already in cache, or on a very fast storage medium, or if the CPU you're running on is slow, you will experience reduced startup speed as decompression will still have to take place, and you won't benefit from the reduced size. This is especially true for an EXE that will be invoked multiple times repeatedly.
Thus the above drawbacks are more of an issue if your EXE or DLLs contains lots of resources, but otherwise, they may not be much of a factor in practice, given the relative size of executables and available memory, unless you're talking of DLLs used by lots of executables (like system DLLs).
To dispell some incorrect information in other answers:
UPX will not affect your ability to run on DEP-protected machines.
UPX will not affect the ability of major anti-virus software, as they support UPX-compressed executables (as well as other executable compression formats).
UPX has been able to use LZMA compression for some time now (7zip's compression algorithm), use the --lzma switch.

The only time size matters is during download off the Internet. If you are using UPX then you actually get worse performance than if you use 7-zip (based on my testing 7-Zip is twice as good as UPX). Then when it is actually left compressed on the target computer your performance is decreased (see Lars' answer). So UPX is not a good solution for file size. Just 7zip the whole thing.
As far as to prevent tampering, it is a FAIL as well. UPX supports decompressing too. If someone wants to modify the EXE then they will see it is compress with UPX and then uncompress it. The percentage of possible crackers you might slow down does not justify the effort and performance loss.
A better solution would be to use binary signing or at least just a hash. A simple hash verification system is to take a hash of your binary and a secret value (usually a guid). Only your EXE knows the secret value, so when it recalculates the hash for verification it can use it again. This isn't perfect (the secret value can be retrieved). The ideal situation would be to use a certificate and a signature.

The final size of the executable on disk is largely irrelevant these days. Your program may load a few milliseconds faster, but once it starts running the difference is indistinguishable.
Some people may be more suspicious of your executable just because it is compressed with UPX. Depending on your end users, this may or may not be an important consideration.

The last time I tried to use it on a managed assembly, it munged it so bad that the runtime refused to load it. That's the only time I can think of that you wouldn't want to use it (and, really, it's been so long since I tried that that the situation may even be better now). I've used it extensively in the past on all types of unmanaged binaries, and never had an issue.

If your only interest is in decreasing the size of the executables, then have you tried comparing the size of the executable with and without runtime packages? Granted you will have to also include the sizes of the packages overall along with your executable, but if you have multiple executables which use the same base packages, then your savings would be rather high.
Another thing to look at would be the graphics/glyphs you use in your program. You can save quite a bit of space by consolidating them to a single Timagelist included in a global data module rather than have them repeated on each form. I believe each image is stored in the form resource as hex, so that would mean that each byte takes up two bytes...you can shrink this a bit by loading the image from a RCData resource using a TResourceStream.

There are no drawbacks.
But just FYI, there is a very common misconception regarding UPX as--
resources are NOT just being compressed
Essentially you are building a new executable that has a "loader" duty and the "real" executable, well, is being section-stripped and compressed, placed as a binary-data resource of the loader executable (regardless the types of resources were in the original executable).
Using reverse-engineering methods and tools either for education purposes or other will show you the information regarding the "loader executable", and not variable information regarding the original executable.

IMHO routinely UPXing is pointless, but the reasons are spelled above, mostly, memory is more expensive than disk.
Erik: the LZMA stub might be bigger. Even if the algorithm is better, it does not always be a net plus.

Virus scanners that look for 'unknown' viruses can flag UPX compressed executables as having a virus. I have been told this is because several viruses use UPX to hide themselves. I have used UPX on software and McAfee will flag the file as having a virus.

The reason UPX has so many false alarms is because its open licensing allows malware authors to use and modify it with impunity. Of course, this issue is inherent to the industry, but sadly the great UPX project is plagued by this problem.
UPDATE: Note that as the Taggant project is completed, the ability to use UPX (or anything else) without causing false positives will be enhanced, assuming UPX supports it.

I believe there is a possibility that it might not work on computers that have DEP (Data Execution Prevention) turned on.

When Windows load a binary, first thing it does is called Import/Export Table resolution. Ie, what ever API and DLL that is indicated in the Import Table, it will first load the DLL into a randomly generated base address. And using the base address plus offset into the DLL's function, this information will be updated to the Import Table.
EXE does not have Export Table.
All these happened even before jumping to the original entry point for execution.
Then after it start executing from the entry point, the EXE will run a small piece of code before starting the decompression algorithm. This small piece of code also means that the Windows API needed will be very small, resulting in a small Import Table.
But after the binary is decompressed, if it started to use any Windows API not resolved before, then likely it is going to crash. So it is essential that the decompression routine will resolve and update the Import Table for all the referenced Window API inside the decompressed codes, before executing the decompressed codes.
References:
https://malwaretips.com/threads/malware-analysis-2-pe-imports-static-analysis.62135/

Related

Has image base address optimization sense? [duplicate]

Rebasing a DLL means to fix up the DLL such, that it's preferred load adress is the load address that the Loader is actually able to load the DLL at.
This can either be achieved by a tool such as Rebase.exe or by specifying default load addresses for all your (own) dlls so that they "fit" in your executable process.
The whole point of managing the DLL base addresses this way is to speed up application loads. (Or so I understand.)
The question is now: Is it worth the trouble?
I have the book Windows via C/C++ by Richter/Nazarre and they strongly recommend[a] making sure that the load addresses all match up so that the Loader doesn't have to rebase the loaded DLLs.
They fail to argue however, if this speeds up application load times to any significant amount.
Also, with ASLR it seems dubious that this has any value at all, since the load addresses will be randomized anyway.
Are there any hard facts on the pro/cons of this?
[a]: In my WvC++/5th ed it is in the sections titled Rebasing Modules and Binding Modules on pages 568ff. in Chapter 20, DLL Advanced Techniques.

Patching the relocatable addresses isn't the big deal, that runs at memory speeds, microseconds. The bigger issue is that the pages that contains this code now need to be backed up by the paging file instead of the DLL file. In other words, when pages containing code are unmapped, they need to be written to the paging file instead of just getting discarded.
The cost of this isn't that easy to measure, especially on modern machines with lots of RAM. It only counts when the machine starts to get under load with lots of processes competing for memory. And the fragmentation of the paging file.
But clearly, rebasing is a very cheap optimization. And it is very easy to see in the Debug + Windows + Modules window, there's a bright icon on the rebased DLLs. The Address column gives you a good hint what base address would be a good choice. Leave ample space between them so you don't constantly have to tweak this as your program grows.

I'd like to provide one answer myself, although the answers of Hans Passant and others are describing the tradeoffs already pretty well.
After recently fiddling with DLL base addresses in our application, I will here give my conclusion:
I think that, unless you can prove otherwise, providing DLLs with a non-default Base Address is an exercise in futility. This includes rebasing my DLLs.
For the DLLs I control, given the average application, each DLL will be loaded into memory only once anyway, so the load on the paging file should be minimal. (But see the comment of Michal Burr in another answer about Terminal Server environment.)
If DLLs are provided with a fixed base address (without rebasing) it will actually increase address space fragmentation, as sooner or later these addresses won't match anymore. In our app we had given all DLLs a fixed base address (for other legacy reasons, and not because of address space fragmentation) without using rebase.exe and this significantly increased address space fragmentation for us because you really can't get this right manually.
Rebasing (via rebase.exe) is not cheap. It is another step in the build process that has to be maintained and checked, so it has to have some benefit.
A large application will always have some DLLs loaded where the base address does not match, because of some hook DLLs (AV) and because you don't rebase 3rd party DLLs (or at least I wouldn't).
If you're using a RAM disk for the paging file, you might actually be better of if loaded DLLs get paged out :-)
So to sum up, I think that rebasing isn't worth the trouble except for special cases like the system DLLs.
I'd like to add a historical piece that I found on Old New Thing: How did Windows 95 rebase DLLs? --
When a DLL needed to be rebased, Windows 95 would merely make a note
of the DLL's new base address, but wouldn't do much else. The real
work happened when the pages of the DLL ultimately got swapped in. The
raw page was swapped off the disk, then the fix-ups were applied on
the fly to the raw page, thereby relocating it. The fixed-up page was
then mapped into the process's address space and the program was
allowed to continue.
Looking at how this process is done (read the whole thing), I personally suspect that part of the "rebasing is evil" stance dates back to the olden days of Win9x and low memory conditions.
Look, now there's a non-historical piece on Old New Thing:
How important is it nowadays to ensure that all my DLLs have non-conflicting base addresses?
Back in the day, one of the things you were exhorted to do was rebase
your DLLs so that they all had nonoverlapping address ranges, thereby
avoiding the cost of runtime relocation. Is this still important
nowadays?
...
In the presence of ASLR, rebasing your DLLs has no effect because ASLR is going to ignore your base address anyway and relocate the DLL into a location of its pseudo-random choosing.
...
Conclusion: It doesn't hurt to rebase, just in case, but understand
that the payoff will be extremely rare. Build your DLL with
/DYNAMICBASE enabled (and with /HIGHENTROPYVA for good measure)
and let ASLR do the work of ensuring that no base address collision
occurs. That will cover pretty much all of the real-world scenarios.
If you happen to fall into one of the very rare cases where ASLR is
not available, then your program will still work. It just may run a
little slower due to the relocation penalty.
... ASLR actually does a better job of avoiding collisions than manual
rebasing, since ASLR can view the system as a whole, whereas manual
rebasing requires you to know all the DLLs that are loaded into your
process, and coordinating base addresses across multiple vendors is
generally not possible.

They fail to argue however, if this speeds up application load times to any significant amount.
The load time change is minimal, because the v-table is what gets updated with the new addresses. However, if you have low memory - enough that stuff gets loaded in/out of the page file, then the system has to keep the dll in the page file (since the addresses are changed). If the dlls were rebased - and the rebased dlls don't collide with any other dlls - then instead of swapping them out to the page file (and back), the system just overwrites the memory and reloads the dll from the original on the hard drive.
The benefit is only relevant when systems are paging stuff in and out of main memory. The last time I made efforts to keep databases of applications and their base addresses was back in VB6 days, when the computers in our offices and data centers were lucky to have even 256MB of RAM.
Also, with ASLR it seems dubious that this has any value at all, since the load addresses will be randomized anyway.
At the moment ASLR only affects dlls and executables with the dynamic-relocation flag set. This includes Vista/Win7 system dlls and executables, and any developer made items where the developer intentionally set that flag during the build.
If you are going to set the dynamic-relocation flag, then don't bother rebasing the dlls. If all your clients have 4GB of RAM, then don't bother. If your boss is a cheapskate, then maybe.

You have to consider that user DLLs (that are not already loaded into another processes) has to be read from HDD. Usually the memory mapping is used for that (and it uses lazy loading), so if they have to be relocated, they'll have to be actually read from HDD before the process can start.
For those loaded by other processes the copy-on-write mechanism is used. So, again, relocating them will mean additional operations.
What's about ASLR, it's intended for security purposes, not for performance.

Yes, you should do it.
ASLR only impacts "system" DLLs and therefore the ones you are writing should not be impacted by ASLR. Additionally, ASLR doesn't completely "randomize" the location of these system binaries, it simply shuffles them around in the basic spot in the vm map.

Force executable into memory?

I have a cpp executable (it contains static libraries), about 1MB in size. When I run the exe, it consumes less than 200kb memory.
From what I understand this means the computer reads the exe little by little when it's needed from the HDD.
I want to improve the performance, even a bit, so, how can I say "load the exe into memory" and don't touch the HDD? Will this bring any performance improvement?

The OS will load parts of the executable into memory as it is needed. This is where knowing more about the instruction cache might be useful. The idea is that you structure your program so that common code is grouped together. For example, you might have some functions that are getting inlined - in this case the OS would have to load the same code in multiple places which might be slow. By removing the inline you'd have the code in one chunk in memory which would get cached and thus reduce loading time.
I would agree with the others though that this type of optimization should really be reserved until after you profile and know for sure that this is the bottleneck, which is very unlikely

If you really want to do this, you need to touch the memory pages by reading from them. But forcing pages into memory once does not guarantee that they will remain in memory. An apparent alternative solution would be to VirtualLock the region, but in practice this function doesn't work the way you'd think (at least on any system where I've used it), even if you have the appropriate privilegues.
Note that the default minimum working set is only 16MB, so for larger executables, forcing pages into RAM will necessarily push others (which you need!) out of the working set, so this is in fact an anti-optimization. Unless you have the necessary privilegues to increase the working set size.
It's a bit tedious to find out where the executable's mapping starts and ends. Not that it is impossible, but it's much more complicated than just mapping the file again. Then you simply run a loop which reads one byte every 4096 bytes, and you are done. This will consume twice as much address space, but will consume the same amount of RAM (thanks to how memory mapping works).
But, realistically, you will gain absolutely nothing from doing this.
The operating system does not need to load the entire executable and does not need to keep it resident at all times. Part of your executable will be debug info or import info, which the loader will maybe look at once (or won't look at) and never need afterwards. Forcing that stuff into memory only means you purge useful pages from the working set.
The OS likely has the parts (or most of it) that are not visible to you in the buffer cache anyway, but even if that isn't the case, you will hardly ever notice a difference.

Globally, forcing all of the program into RAM will slow it down.
There are usually large parts of the code which aren't executed
in any given run, and there's no need to ever read these from
disk.
Where forcing all or parts of the program into RAM can make a difference
is latency. If you're responding in real time to external
events, having to load the code in order to respond will reduce
latency. This can only be done by using a system specific
request (e.g. mlock under Posix systems supporting the read
time extension). You'll probably have to have special rights to
be able to do it, though. In practice, it should only be used
on machines dedicated to a specific application, since it can
have a very negative impact on the total system performance.
(There's a reason that it's in the real-time extensions, and not
in the basic Posix.) Locking the addresses used by the function in memory means that there can be no page faults when it is executed.

Interlocked API between Windows CE 5 and 6

I'm currently porting a VS2005 C++ application from CE5 to CE6 and I'm experiencing severe performance problems. This goes so far that a single HTTP request retrieving dynamic content takes 40ms on CE5 and 350ms on CE6. These values used to be worse due to a bunch of inefficiencies that I already cleaned up, improving performance on both systems, but at the moment I'm stuck at that latency. For the record, both tests are made on the same machine and the webserver is not the one supplied with CE but a custom one implemented in C++. Note also that the problem is not the network IO, CE6 even outperforms CE5 on the same machine when serving static files, but it's the dynamic content handling.
While trying to figure out why the program performs so badly, I stumbled across something that puzzled me: Under CE5, the Interlocked* API for x86 use neither the compiler intrinsics nor real function calls but inline assembly code. This code has a comment saying that the intrinsic includes lock prefixes that are only required for multi-processor systems and that slow down code running on just a single core like CE5. On CE6, these functions are implemented using the compiler intrinsics including the lock prefix. Since these functions are used by e.g. Boost and STLport, both of which are used inside the webserver, I was wondering if those could be the culprit.
Another thing I noticed was that some string parsing functions take extremely long. Worse, it seems that calling the same function a second time after the first time takes less time, so it seems as if some kind of caching was going on. Since this is a short (<1kB) string received via TCP that is parsed in memory, I can't imagine which cache could be responsible for that. The only cache could be the instruction cache, but the program is not larger than the CE5 version and if the code was running from uncached memory it would not show these caching effects.
TLDR - Questions:
Is CE6 capable of handling multiple processors at all?
Is there an easy way to tell the compiler that it should omit the lock prefix? My current approach to achieve that is to simply copy the inline assembly from the CE5 SDK, but that's beyond ugly.
I'd also appreciate any other suggestions what to look at or what to try. Many thanks in advance!
Summary There is no problem that depends on the executable, let alone on the Interlocked API. Running the same executable proved that. However, running on a different machine with a different platform setup made a difference. We're now back to Platform Builder, trying to figure out the differences between the two platforms.

No. WEC7 is required for SMP support. Most likely in CE6 the OEM has disabled the other cores.
None that I am aware of.
Either use the performance profiling tools or instrument your code with timing calls to narrow down where things are taking too long.

I have finally found the reason for the performance behaviour, it's simply paging. CE6 has a pool manager (see http://blogs.msdn.com/b/ce_base/archive/2008/01/19/paging-and-the-windows-ce-paging-pool.aspx) which handles paging out unused mapped DLLs and EXEs. When the amount of mapped binaries exceeds a certain size, it starts (with low priority) to page out memory. The limit when it starts paging out is just 3MiB by default, which is rather low for current applications. Also, the cache is not an LRU cache but simply discarding the pages in the order they were loaded.
It turns out that our system exceeded this limit, which causes the paging to begin. Due to the algorithm used, it will always throw out used ones that will then have to be paged in again. The code that serves static files is small, so this wasn't affected as much by this limit. The code that serves dynamic pages is much larger though, so it wreaks havoc on the overall system with IO. This also explains why the problem couldn't be attributed to a specific piece of code, it wasn't the code itself but loading it.
I have detected this via IOCTL_HAL_GET_POOL_PARAMETERS, which gave me the relevant configuration parameters, current state, how often the pageout-thread ran and for how long (although the latter is only the time it took to swap out pages). I should be able to find the resulting page faults in the kernel tracker, too, now that I know what I'm looking for. I could also observe that the activity LED on the CF card adapter now lights up when first loading a file, but not on subsequent requests, where it is taken from cache. This used to always cause the LED to flash on dynamic pages.
The simple solution is to increase the limit for the pool manager, so it doesn't start throwing out things. This can be done easily in config.bib by patching kernel.dll with the according values. Alternatively, reducing the executable size would help, but that's not so easy.

Is rebasing DLLs (or providing an appropriate default load address) worth the trouble?

Rebasing a DLL means to fix up the DLL such, that it's preferred load adress is the load address that the Loader is actually able to load the DLL at.
This can either be achieved by a tool such as Rebase.exe or by specifying default load addresses for all your (own) dlls so that they "fit" in your executable process.
The whole point of managing the DLL base addresses this way is to speed up application loads. (Or so I understand.)
The question is now: Is it worth the trouble?
I have the book Windows via C/C++ by Richter/Nazarre and they strongly recommend[a] making sure that the load addresses all match up so that the Loader doesn't have to rebase the loaded DLLs.
They fail to argue however, if this speeds up application load times to any significant amount.
Also, with ASLR it seems dubious that this has any value at all, since the load addresses will be randomized anyway.
Are there any hard facts on the pro/cons of this?
[a]: In my WvC++/5th ed it is in the sections titled Rebasing Modules and Binding Modules on pages 568ff. in Chapter 20, DLL Advanced Techniques.

Patching the relocatable addresses isn't the big deal, that runs at memory speeds, microseconds. The bigger issue is that the pages that contains this code now need to be backed up by the paging file instead of the DLL file. In other words, when pages containing code are unmapped, they need to be written to the paging file instead of just getting discarded.
The cost of this isn't that easy to measure, especially on modern machines with lots of RAM. It only counts when the machine starts to get under load with lots of processes competing for memory. And the fragmentation of the paging file.
But clearly, rebasing is a very cheap optimization. And it is very easy to see in the Debug + Windows + Modules window, there's a bright icon on the rebased DLLs. The Address column gives you a good hint what base address would be a good choice. Leave ample space between them so you don't constantly have to tweak this as your program grows.

They fail to argue however, if this speeds up application load times to any significant amount.
The load time change is minimal, because the v-table is what gets updated with the new addresses. However, if you have low memory - enough that stuff gets loaded in/out of the page file, then the system has to keep the dll in the page file (since the addresses are changed). If the dlls were rebased - and the rebased dlls don't collide with any other dlls - then instead of swapping them out to the page file (and back), the system just overwrites the memory and reloads the dll from the original on the hard drive.
The benefit is only relevant when systems are paging stuff in and out of main memory. The last time I made efforts to keep databases of applications and their base addresses was back in VB6 days, when the computers in our offices and data centers were lucky to have even 256MB of RAM.
Also, with ASLR it seems dubious that this has any value at all, since the load addresses will be randomized anyway.
At the moment ASLR only affects dlls and executables with the dynamic-relocation flag set. This includes Vista/Win7 system dlls and executables, and any developer made items where the developer intentionally set that flag during the build.
If you are going to set the dynamic-relocation flag, then don't bother rebasing the dlls. If all your clients have 4GB of RAM, then don't bother. If your boss is a cheapskate, then maybe.

You have to consider that user DLLs (that are not already loaded into another processes) has to be read from HDD. Usually the memory mapping is used for that (and it uses lazy loading), so if they have to be relocated, they'll have to be actually read from HDD before the process can start.
For those loaded by other processes the copy-on-write mechanism is used. So, again, relocating them will mean additional operations.
What's about ASLR, it's intended for security purposes, not for performance.

Yes, you should do it.
ASLR only impacts "system" DLLs and therefore the ones you are writing should not be impacted by ASLR. Additionally, ASLR doesn't completely "randomize" the location of these system binaries, it simply shuffles them around in the basic spot in the vm map.

ramdisk with a file mirror

I wanted to speed up compilation so i was thinking i could have my files be build on a ramdisk but also have it flushed to the filesystem automatically and use the filesystem if there is not enough ram.
I may need something similar for an app i am writing where i would like files to be cached in ram and flushed into the FS. What are my options? Is there something like this that already exist? (perhaps fuse?) The app is a toy app (for now) and i would need to compile c++ code repeatedly. As we know, the longer it takes to compile when there is a specific problem to solve before progressing. the less we can get done.

Ram-disks went the way of the dodo with the file system cache. It can make much better decisions than a static cache, having awareness of RAM usage by other programs and the position of the disk write head. The lazy write-back is for free.

Compilation is CPU-bound, not disk bound. If you utilize all your CPU cores using the appropriate build flag, you can easily saturate them on typical PCs. Unless you have some sort of supercomputer, I don't think this will speed things up much.
For VS2008 this flag is /MP. It also exists on VS2005.

Not sure it helps answering a question from almost 12 years ago, but just now I was looking for some software to sync file systems or directories to do exactly that.
Both answers are correct, in the sense that the file system cache will attempt to predict what you need and make available in RAM, and, generally, building a software benefits a lot of multithreading, possibly maxing out your CPU. But right now I'm building with Unity and I don't have control over build optimization.
That said, a virtual RAM disk has advantages, because those files are always in RAM, different from file system cache (FSC) that has to deal with resources and applications competing for disk access.
Another difference is that when an application closes a file handle or forces a sync, FSC will attempt to get that files as soon as possible to the disk, in order to avoid problems (power failure and so on). I believe you can alter the FSC behavior on Linux.
A sync to a RAM disk does not write to disk, that may be responsible for the difference of performance you said in your comment.
That said, I still need to look for something to auto-sync my two file systems!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js