How to tell if an exectuable (in any format) is compressed - compression

My question is, are there any tools out there that can detect what compression tool was used to compress an exeutable?
It doesn't matter what executable format the executableis in.(http://en.wikipedia.org/wiki/Category:Executable_file_formats)
I'm looking for a tool that can recognize what compression tools were used to compress the executable.
For example: say an executable was compressed with UPX: The Ultimate Packer For executables but I had no idea this compressor was used. Can I somehow determine that compressor was used to compress it through the use of a tool?
If you have any recommendations or can point me in the right direction it would be greatly appreciated! I would like to find a tool that can detect various compressors.

A start is: try to compress it with something like gzip. Something already compressed will not compress much, or at all, or perhaps expand a smidge. If the executable compresses a fair bit, then it was not compressed in the first place. A quick check shows that most of my executables compress by a factor of almost two.
Then at least you'll know whether or not it's compressed.
From there, you can try compressing executables using the candidates, and look for common sequences in the prefix executable, which is the decompressor.

On Unix, use the file command. It tries to determine what kind of data is in a file by matching it against various signatures. It's not specific to executables or compression, those are just some of the kinds of files it can recognize.

Related

When does decompression process take place and how does solidbreak flag really work in Inno Setup?

There's such a thing in Inno Setup as SolidCompression and there's a flag used in the [Files] section, which is solidbreak. Could anybody explain to me how does the aforementioned flag work, when do we really need to use it and when does the decompression process take place?
Solid compression means that the files are compressed as though all the files were just one big file. This usually results in better compression because compression knowledge built during one file will be carried over to the next, instead of a restart. The downside is that in order to decompress a specific file during installation, one has to decompress all the files before it.
The solidbreak flag, when applied, tells the compression engine to split up the solid compression and start a new stream when it comes to the source the flag is applied to, so that if that file specifically needs to be decompressed, the decompression code can simply seek to the position in the file where it starts. Basically, the downside from above disappears, but then some of the bonus of that compression knowledge gets lost as well.
If you want to use solid compression, and you have the sort of files that you have to install all of them, don't use solidbreak, but if you have a list of checkboxes to select modules, you might want to consider applying solidbreak to some or all of the optional modules. If you don't, all the files will be decompressed even though only one some are needed for the selectedd options. The exact result will vary with file size and so on so I can't say more than that you might have to experiment to see the results.

Should I use .tar.gz?

In the Unix world, there is a famous format called "tar.gz".
But now, I want to develop a game and random accessing a file will be more efficient. If it is archived first, it will cause sequential access.
I know that there is an alternative format called zip or 7z, but what about other formats?
Not only gz.tar, I'd like to a minor compressing library and also get archiving features.
Should I use *.tar or other solutions are available?
PS: I'm using C++.
"Random" access is not good on a .tar.gz, since that is a .tar file that has been wrapped in a .gz compression, so to get to things in the .tar file, you'd first have to decompress the .tar file.
It would be possible to use a .tar file that contains individual files compressed with .gz. You can read the table of content of the .tar file and find/store where all the files are in the archive, and then extract as you need. However, you may find that using your own format is "better" (for example, if I remember correctly, the "header" for a tar-archive is a file at a time, you may want to build your header in one lump, before you store the files [which does mean at least enumerating all the relevant files first, then forming the compressed variant and "patching up" the header with the offsets in compressed form]
For a game, one critical factor would probably be the decompression speed, so you may want to look at different libraries and which one has the best decompression speed. I found this when searching for a comparison:
http://catchchallenger.first-world.info//wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO
You may also care about memory usage, which also varies a bit depending on algorithm.
And I'm guessing your individual files will be much smaller than the entire tar-ball of Linux, so you may want to do your own benchmark, with your own data - after all, the speed of different compression formats does, to some degree, depend on the format of the data.
Normally, for computer games, what you need is a format where each file is compressed individually before being assembled into one file. This is the crucial difference between .tar.gz and .zip / .7z formats, that is, tar-gz is a "compressed archive" while zip / 7z are "archives of compressed files". In fact, both file formats use the same compression algorithm (by default), and the only reason that .tar.gz files are typically smaller is because they compress the entire archive instead of file-by-file, which increases the overall compression ratio.
AFAIK, most computer games use a zip format or a custom format that closely matches it, because it does per-file compression. For instance, Quake engines have always (.pak, .pk3, .pk4) relied on an off-the-shelf zip format with a few minor additions (like a built-in checksum, I think).
The .tar.gz format is created by first making an archive that puts all the (uncompressed) files into one .tar file. Then, that big archive file is compressed with the gzip method to create the final .tar.gz file. The point is that to get any one of the files from the archive you have the decompress the entire thing. This is very appropriate for backups or large transfers, but not appropriate at all for a game engine media archive.
That said, you could technically do the reverse of tar-gz, which is to compress each file individually with gzip, and then put them together in a .tar archive. But this is probably not worth the extra trouble, as it is pretty much exactly what zip files are (in "one easy step"). So, it will be a lot easier to use an off-the-shelf all-in-one format like zip that will allow you to extract individual files at a time. There are many off-the-shelf libraries for extracting and manipulating files in zip archives, just start with libzip (not to be confused with zlib (for gzip or .gz)).
In the Unix world, there is a famous format called "tar.gz".
Probably the biggest reason why "tar-ballz" are so popular and famously used in Unix-like systems is that they preserve file permissions (and other meta-data, I guess). I think that some implementations of zip and 7z might provide that feature as an extension to the format, but most don't have it. The convenient thing with tar archives is that whatever you put in there comes out exactly the same at the other end, with all permissions and whatever else preserved. And the "gzip" compression (from zlib) has just been historically an industry-standard compression algorithm, although, now, there are better ones, also supported by tar, such as .tar.lzma (or .tlz) or .tar.xz.
but what about other formats?
There aren't really that many other formats. Mostly, compressed archive formats often reuse the same few algorithms (DEFLATE, LZ77 / LZMA / LZMA2, BZIP, etc.), and often, formats like zip / 7z / rar are only really container formats that can employ any of those compression algorithms (and even mix and match depending on the individual file types). The point is that you won't really find much that is better than zip or 7z. And their competitors are more or less gone today (like rar?).
Should I use *.tar or other solutions are available?
No, use zip or 7z. Tar-balls are for backups. They are optimized for that purpose (e.g., dump a large folder full of files into a tar-ball, and recover it later, with everything preserved and with best full-archive compression). For your application, zip or 7z is more appropriate.

C++ file container (e.g. zip) for easy access

I have a lot of small files I need to ship with an application I build and I want to put this files into an archive to make copying and redistributing more easy.
I also really like the idea of having them all in one place so I need to compare the md5 of one file only in case something goes wrong.
I'm thinking about a class which can load the archive and return a list of files within the archive and load a file into memory if I need to access it.
I already searched the Internet for different methods of achieving what I want and found out about zlib and the lzma sdk.
Both didn't really appeal to me because I don't really found out how portable zlib is and I didn't like the lzma sdk as it is just to much and I don't want to blow up the application because of this problem. Another downside with zlib is that I don't have the C/C++ experience (I'm really new to C++) to get everything explained in the manual.
I also have to add that this is a time critical problem. I though some time about implementing a simple format like tar in a way I can easy access the files within my application but I just didn't find the time to do that yet.
So what I'm searching for is a library that allows me to access the files within an archive. I'd be glad if anybody could point me in the right direction here.
Thanks in advance,
Robin.
Edit: I need the archive to be accessed under linux and windows. Sorry I didn't mention that in the beginning.
For zipping, I've always been partial to ZipUtils, which makes the process easy and is built on top of the zlib and info-zip libraries.
The answer depends on whether you plan to modify the archive via code after the archive is initially built.
If you don't need to modify it, you can use TAR - it's a handy and simple format. If you want compression, you can implement tar.gz reader or find some library that does this (I believe there are some available, including open-source ones).
If your application needs random access to the data or it needs to modify the archive, then regular TAR or ZIP archives are not good. Virtual file system such as our SolFS or CodeBase file system will fit much better: virtual file systems are suited for frequent modifications of the storage, while archives target mainly write-once-read-many usage scenarios.
zlib is highly portable and very widely used. if you can't make sense of the C++ interface, there are alternatives for many other languages - see 'Related External Links' here.
Take another look before you search for something different.
If you're using Qt or Windows you can also pack data into the executable's resource area. You would only have to distribute the executable file using this technique. There's a well defined API already written and tested to access that data.
The zlib API is the way to go. Simple and portable. Lookat unzip.h header for APIs that access archive files. It is in C and very easy.
If the files are small, you can dump them into string literals (search for bin2h utility) and include in your project. Then change the code that read the files. If all files are currently read using ifstream class, simply changing it to istringstream class and recompile the code.
Try using Quazip - it's quite simple to use. You can use it as a stream from which you read the compressed file on the fly.

WAV compression help

How do you programmatically compress a WAV file to another format (PCM, 11,025 KHz sampling rate, etc.)?
I'd look into audacity... I'm pretty sure they don't have a command line utility that can do it, but they may have a library...
Update:
It looks like they use libsndfile, which is released under the LGPL. I for one, would probably just try using that.
Use sox (Sound eXchange : universal sound sample translator) in Linux:
SoX is a command line program that can convert most popular audio files to most other popular audio file formats. It can optionally
change the audio sample data type and apply one or more sound effects to the file during this translation.
If you mean how do you compress the PCM data to a different audio format then there are a variety of libraries you can use to do this, depending on the platform(s) that you want to support. If you just want to change the sample rate of the PCM data then you need a sample rate conversion algorithm instead, which is a completely different problem. Can you be more specific in your requirements?
You're asking about resampling, and more specifically downsampling, not compression. While both processes are lossy (meaning that you will suffer loss of information), downsampling works on raw samples instead of in the frequency domain.
If you are interested in doing compression, then you should look into lame or OGG vorbis libraries; you are no doubt familiar with MP3 and OGG technology, though I have a feeling from your question that you are interested in getting back a PCM file with a lower sampling rate.
In that case, you need a resampling library, of which there are a few possibilites. The most widely known is libsamplerate, which I honestly would not recommend due to quality issues not only within the generated audio files, but also of the stability of the code used in the library itself. The other non-commercial possibility is sox, as a few others have mentioned. Depending on the nature of your program, you can either exec sox as a separate process, or you can call it from your own code by using it as a library. I personally have not tried this approach, but I'm working on a product now where we use sox (for upsampling, actually), and we're quite happy with the results.
The other option is to write your own sample rate conversion library, which can be a significant undertaking, but, if you only are interested in converting with an integer factor (ie, from 44.1kHz to 22kHz, or from 44.1kHz to 11kHz), then it is actually very easy, since you only need to strip out every Nth sample.
In Windows, you can make use of the Audio Compression Manager to convert between files (the acm... functions). You will also need a working knowledge of the WAVEFORMAT structure, and WAV file formats. Unfortunately, to write all this yourself will take some time, which is why it may be a good idea to investigate some of the open source options suggested by others.
I have written a my own open source .NET audio library called NAudio that can convert WAV files from one format to another, making use of the ACM codecs that are installed on your machine. I know you have tagged this question with C++, but if .NET is acceptable then this may save you some time. Have a look at the NAudioDemo project for an example of converting files.

What compression/archive formats support inter-file compression?

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.
Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.
Several formats do inter-file compression.
The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.
More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.
Take a look at google's open-vcdiff.
http://code.google.com/p/open-vcdiff/
It is designed for calculating small compressed deltas and implements RFC 3284.
http://www.ietf.org/rfc/rfc3284.txt
Microsoft has an API for doing something similar, sans any semblance of a standard.
In general the algorithms you are looking for are ones based on Bentley/McIlroy:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.
They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.
Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.
If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.