Embedding compressed files into a c++ program

Embedding compressed files into a c++ program - c++

I want to create a cross-platform installer, in c++. It can be any compression type, eg zip or gzip, embedded inside the program itself like an average installer. I don't want to create many changes on different platforms, linux and windows. How do I embed and extract files to a c++ program, cross-platform?

C++ is a poor choice for a cross-platform installer, because there's no such thing as cross-platform machine code.
C++ code can be extremely portable, but it needs to be compiled for each platform, and then you get a distinct output executable for each platform.
If you want to build installers for many platforms from a single source file, you can use C++. But if you want to build ONE installer that works on many platforms, you'll need to use an interpreted or JIT-compiled language with runtime support available on all your targets. Of those, the only one likely to already be installed on a majority of computers of each platform is Java.
Ok, assuming that you're building many single-platform installers from machine code, this is what is needed:
You need to get the compressed code into the program. You want to do this in a way that doesn't affect the load time badly, nor cause compilation to take a few months. So using an initialized global array is a bad idea.
One way is to link your data in as an additional section. There are tools to help with that, e.g. Binary to COFF converter, I've seen an ELF version as well, maybe this. But this might still cause the runtime library to try to lead the entire file into memory before execution begins.
Another way is to use platform-specific resource APIs. This is efficient, but platform specific.
The most straightforward solution is to simply append the compressed archive to your executable, then append eight more bytes with the file offset where the compressed archive begins. Unpacking is then as simple as opening the executable in read-only mode, fseek(-8, SEEK_END), reading the correct offset, then seeking to the beginning of the compressed data and passing that stream to your decompressor.
Of course, now I find a website listing pretty much the same methods.
And here's a program which implements the last option, with additional ability to store multiple files. I wouldn't recommend doing that, let the compression library take care of storing the file metadata.

The only way I know of to portably embed data (strings or raw, binary
data) in a C++ program is to convert it into a data table, then compile
that. For raw data, this would look something like:
unsigned char data[] =
{
// raw data here.
};
It should be fairly trivial to write a small program which reads your
binary data, and writes it out as a C++ table, like the above. Compile
it, and link it into your program, and there you are.

Use zlib.
Have your packing program generate a list of exe's in the program. i,e.
unsigned char x86_windows_version[] = { 0xff,...,0xff};
unsigned char arm_linux_version[] = { 0xff,...,0xff};
unsigned char* binary_files[MAX_BINARIES] = {x86_windows_version,arm_linux_version};
somewhere in your excitable:
enflate(x86_windows_version);
And thats about it. Look at the zlib docs for the parameters for enflate() and deflate() and that's about it.
It's a pattern used a lot on embedded platforms (that are not linux) mostly for string _tables and other image binaries. It should work for your needs.

Related

How to cross-platformly build binary resources into program?

We are developing C++ applications for Windows, Mac and Linux. The application have many binary resources, and for some reason we need to pack them within executable binary, instead of laying at directories (or Apple App bundle).
At current, we use a script to convert these resources into C++ array constants, then compile and link them. However this approach have so many deficiencies:
You have to compile the resource source code, it takes time and is unnecessary in essential.
The resource source codes would be parsed by IDE. As they are large, code analytic is greatly slowed down.
MSVC have limit on source code size, so large resources (several MB) must be separated into many parts then concatenated at run-time.
After some study, I found some solutions:
In Windows, I can use .rc files and related WinAPI.
In Linux, I can directly convert arbitrary binary file into obj file via objcopy.
However, there are still some questions remaining:
The use of WinAPI to fetch resources needs many functions to access one resource. Is there any simpler ways in Windows?
How to do it in Mac?

A quite common trick, most notably used for self-extracting archives or scripting language to executable compilers, is to append the resources at the end of the executable file.
Windows:
copy app.exe+all-resources app-with-resources.exe
Linux:
cp executable executable-with-resources
cat all-resources >>executable-with-resources
Then you can read your own executable using fopen(argv[0]) for example.
In order to jump at the correct position, i.e. beginning of resources, a possible solution is to store the size of the executable without resources as the last word of the file.
FILE* fp = fopen(argv[0], "rb");
fseek(fp, -sizeof(int), SEEK_END);
int beginResourcesOffset;
fread(&beginResourcesOffset, 1, sizeof(int), fp);
fseek(fp, beginResourcesOffset, SEEK_SET);
Be careful with this solution though, anti-virus on windows sometimes don't like it. There probably are better solutions.

How to save a c++ readable .mat file

I am running a DCT code in matlab and i would like to read the compressed file (.mat) into a c code. However, am not sure this is right. I have not yet finished my code but i would like to request for an explanation of how to create a c++ readable file from my .mat file.
Am kinda confused when it comes to .mat, .txt and then binary, float details of files. Someone please explain this to me.

It seems that you have a lot of options here, depending on your exact needs, time, and skill level (in both Matlab and C++). The obvious ones are:
ASCII files
You can generate ASCII files in Matlab either using the save(filename, variablename, '-ascii') syntax, or you can create a more custom format using c-style fprintf commands. Then, within a C or C++ program the files are read using an fscanf.
This is often easiest, and good enough in many cases. The fact that a human can read the files using notepad++, emacs, etc. is a nice sanity check, (although this is often overrated).
There are two big downsides. First, the files are very large (an 8 byte double number requires about 19 bytes to store in ASCII). Second, you have to be very careful to minimize the inevitable loss of precision.
Bytes-on-a-disk
For a simple array of numbers (for example, a 32-by-32 array of doubles) you can simply use the fwrite Matlab function to write the array to a disk. Then within C/C++ use the parallel fread function.
This has no loss of precision, is pretty fast, and relatively small size on disk.
The downside with this approach is that complex Matlab structures cannot necessarily be saved.
Mathworks provided C library
Since this is a pretty common problem, the Mathworks has actually solved this by a direct C implementation of the functions needed to read/write to *.mat files. I have not used this particular library, but generally the libraries they provide are pretty easy to integrate. Some documentation can be found starting here: http://www.mathworks.com/help/matlab/read-and-write-matlab-mat-files-in-c-c-and-fortran.html
This should be a pretty robust solution, and relatively insensitive to changes, since it is part of the mainstream, supported Matlab toolset.
HDF5 based *.mat file
With recent versions of Matlab, you can use the notation save(filename, variablename, '-v7.3'); to force Matlab to save the file in an HDF5 based format. Then you can use tools from the HDF5 group to handle the file. Note a decent, java-based GUI viewer (http://www.hdfgroup.org/hdf-java-html/hdfview/index.html#download_hdfview) and libraries for C, C++ and Fortran.
This is a non-fragile method to store binary data. It is also a bit of work to get the libraries working in your code.
One downside is that the Mathworks may change the details of how they map Matlab data types into the HDF5 file. If you really want to be robust, you may want to try ...
Custom HDF5 file
Instead of just taking whatever format the Mathworks decides to use, it's not that hard create a HDF5 file directly and push data into it from Matlab. This lets you control things like compression, chunk sizing, dataset hierarchy and names. It also insulates you from any future changes in the default *.mat file format. See the h5write command in Matlab.
It is still a bit of effort to get running from the C/C++ end, so I would only go down this path if your project warranted it.

.mat is special format for the MATLAB itself.
What you can do is to load your .mat file in the MATLAB workspace:
load file.mat
Then use fopen and fprintf to write the data to file.txt and then you can read the content of that file in C.

You can also use matlab's dlmwrite to write to a delimited asci file which will be easy to read in C (and human readable too) although it may not be as compressed if that is core to the issue

Adding to what has already been mentioned you can save your data from MATLAB using -ascii.
save x.mat x
Becomes:
save x.txt x -ascii

how to encrypt data files in C++?

now I am writing a app in C++, and currently my app reads models or parameters from several data files. Those files, i.e. self-define dictionary, are currently stored in plain text and to be loaded dynamically by C++ while runtime.
Yet, I don't want those files to be easily seen by my client while they get the released application, so I need to encrypt the file first. What's the general practice for this situation?
And those file are huge in size, so compile to a resource file is not a good option.
Actually I just need a simple 'encryption', at least not plain text stored in released version. And I dont want the encryption libraries which will load the whole file into the memory first in order to perform decryption, since the files are huge and no need to load its whole body into memory at one time.
Thanks!

Usually when you want to deal with encryption in C++ people tend to go for Open SSL libraries which encompass all of the functionality in a pretty standard way.
You'd have to get yourself a copy of the library and some code samples, but it's a pretty common thing and there's lots of documentation around.

Simple API for random access into a compressed data file

Please recommend a technology suitable for the following task.
I have a rather big (500MB) data chunk, which is basically a matrix of numbers. The data entropy is low (it should be well-compressible) and the storage is expensive where it sits.
What I am looking for, is to compress it with a good compression algorithm (Like, say, GZip) with markers that would enable very occasional random access. Random access as in "read byte from location [64bit address] in the original (uncompressed) stream". This is a little different than the classic deflator libraries like ZLIB, which would let you decompress the stream continuously. What I would like, is have the random access at latency of, say, as much as 1MB of decompression work per byte read.
Of course, I hope to use existing library rather than reinvent the NIH wheel.

If you're working in Java, I just published a library for that: http://code.google.com/p/jzran.

Byte Pair Encoding allows random access to data.
You won't get as good compression with it, but you're sacrificing adaptive (variable) hash trees for a single tree, so you can access it.
However, you'll still need some kind of index in order to find a particular "byte". Since you're okay with 1 MB of latency, you'll be creating an index for every 1 MB. Hopefully you can figure out a way to make your index small enough to still benefit from the compression.
One of the benefits of this method is random access editing too. You can update, delete, and insert data in relatively small chunks.
If it's accessed rarely, you could compress the index with gzip and decode it when needed.

If you want to minimize the work involved, I'd just break the data into 1 MB (or whatever) chunks, then put the pieces into a PKZIP archive. You'd then need a tiny bit of front-end code to take a file offset, and divide by 1M to get the right file to decompress (and, obviously, use the remainder to get to the right offset in that file).
Edit: Yes, there is existing code to handle this. Recent versions of Info-zip's unzip (6.0 is current) include api.c. Among other things, that includes UzpUnzipToMemory -- you pass it the name of a ZIP file, and the name of one of the file in that archive that you want to retrieve. You then get a buffer holding the contents of that file. For updating, you'll need the api.c from zip3.0, using ZpInit and ZpArchive (though these aren't quite as simple to use as the unzip side).
Alternatively, you can just run a copy of zip/unzip in the background to do the work. This isn't quite as neat, but undoubtedly a bit simpler to implement (as well as allowing you to switch formats pretty easily if you choose).

Take a look at my project - csio. I think it is exactly what you are looking for: stdio-like interface and multithreaded compressor included.
It is library, writen in C, which provides CFILE structure and functions cfopen, cfseek, cftello, and others. You can use it with regular (not compressed) files and with files, compressed with help of dzip utility. This utility included in the project and written in C++. It produces valid gzip archive, wich can be handled by standard utilities as well as with csio. dzip can compress in many threads (see -j option), so it can very fast compress very big files.
Tipical usage:
dzip -j4 myfile
...
CFILE file = cfopen("myfile.dz", "r");
off_t some_offset = 673820;
cfseek(file, some_offset);
char buf[100];
cfread(buf, 100, 1, file);
cfclose(file);
It is MIT licensed, so you can use it in your projects without restrictions. For more information visit project page on github: https://github.com/hoxnox/csio

Compression algorithms usually work in blocks I think so you might be able to come up with something based on block size.

I would recommend using the Boost Iostreams Library. Boost.Iostreams can be used to create streams to access TCP connections or as a framework for cryptography and data compression. The library includes components for accessing memory-mapped files, for file access using operating system file descriptors, for code conversion, for text filtering with regular expressions, for line-ending conversion and for compression and decompression in the zlib, gzip and bzip2 formats.
The Boost library been accepted by the C++ standards committee as part of TR2 so it will eventually be built-in to most compilers (under std::tr2::sys). It is also cross-platform compatible.
Boost Releases
Boost Getting Started Guide NOTE: Only some parts of boost::iostreams are header-only library which require no separately-compiled library binaries or special treatment when linking.

Sort the big file first
divide it in chunks of your desire size (1MB) with some sequence in the name (File_01, File_02, .., File_NN)
take first ID from each chunk plus the filename and put both data into another file
compress the chunks
you will able to made a search into the ID's file using the method that you wish, may be a binary search and open each file as you need.
If you need a deep Indexing you could use a BTree algorithm with the "pages" are the files.
on the web exists several implementation of this because are little tricky the code.

You could use bzip2 and make your own API pretty easily based on the James Taylor's seek-bzip2

Unpacking an executable from within a library in C/C++

I am developing a library that uses one or more helper executable in the course of doing business. My current implementation requires that the user have the helper executable installed on the system in a known location. For the library to function properly the helper app must be in the correct location and be the correct version.
I would like to removed the requirement that the system be configured in the above manner.
Is there a way to bundle the helper executable in the library such that it could be unpacked at runtime, installed in a temporary directory, and used for the duration of one run? At the end of the run the temporary executable could be removed.
I have considered automatically generating an file containing an unsigned char array that contains the text of the executable. This would be done at compile time as part of the build process. At runtime this string would be written to a file thus creating the executable.
Would it be possible to do such a task without writing the executable to a disk (perhaps some sort of RAM disk)? I could envision certain virus scanners and other security software objecting to such an operation. Are there other concerns I should be worried about?
The library is being developed in C/C++ for cross platform use on Windows and Linux.

"A clever person solves a problem. A
wise person avoids it." — Albert Einstein
In the spirit of this quote I recommend that you simply bundle this executable along with the end-application.
Just my 2 cents.

You can use xxd to convert a binary file to a C header file.
$ echo -en "\001\002\005" > x.binary
$ xxd -i x.binary
unsigned char x_binary[] = {
0x01, 0x02, 0x05
};
unsigned int x_binary_len = 3;
xxd is pretty standard on *nix systems, and it's available on Windows with Cygwin or MinGW, or Vim includes it in the standard installer as well. This is an extremely cross-platform way to include binary data into compiled code.
Another approach is to use objcopy to append data on to the end of an executable -- IIRC you can obtain objcopy and use it for PEs on Windows.
One approach I like a little better than that is to just append raw data straight onto the end of your executable file. In the executable, you seek to the end of the file, and read in a number, indicating the size of the attached binary data. Then you seek backwards that many bytes, and fread that data and copy it out to the filesystem, where you could treat it as an executable file. This is incidentally the way that many, if not all, self-extracting executables are created.
If you append the binary data, it works with both Windows PE files and *nix ELF files -- neither of them read past the "limit" of the executable.
Of course, if you need to append multiple files, you can either append a tar/zip file to your exe, or you'll need a slightly more advance data structure to read what's been appended.
You'll also probably want to UPX your executables before you append them.
You might also be interested in the LZO library, which is reportedly one of the fastest-decompressing compression libraries. They have a MiniLZO library that you can use for a very lightweight decompressor. However, the LZO libraries are GPL licensed, so that might mean you can't include it in your source code unless your code is GPLed as well. On the other hand, there are commercial licenses available.

Slightly different approach than using an unsigned char* array is to put the entire executable binary as resource of the dll. At runtime, you can save the binary data as a local temp file and execute the app. I'm not sure if there is a way to execute an executable in memory, though.

For the library to function properly
the helper app must be in the correct
location
On Windows, would that be the Program Files directory or System32 directory?
This might be a problem. When an application is installed, particularly in a corporate environment, it usually happens in an context with administrative rights. On Vista and later with UAC enabled (the default), this is necessary to write to certain directories. And most Unix flavours have had sensible restrictions like that for as long as anyone can remember.
So if you try to do it at the time the host application calls into your library, that might not be in a context with sufficient rights to install the files, and so your library would put constraints on the host application.
(Another thing that will be ruled out is Registry changes, or config file updates on the various Unices, if the host application doesn't have the ability to elevate the process to an administrative level.)
Having said all that, you say you're considering unpacking the helpers into a temporary directory, so maybe this is all moot.

Qt has an excellent method of achieving this: QResource
"The Qt resource system is a platform-independent mechanism for storing binary files in the application's executable."
You don't say if you are currently using Qt, but you do say "C++ for cross platform use on Windows and Linux", so even if you aren't using it, you may want to consider starting.

There is a way in Windows to run an executable from within memory without writing it to disk. The problem is that due to modern security systems (DEP) this probably won't work on all systems and almost any anti-malware scanner will detect it and warn the user.
My advice is to simply package the executable into your distribution, it's certainly the most reliable way to achieve this.

Well, my first thought would be: what does this helper executable do that couldn't be done within your library's code itself, perhaps using a secondary thread if necessary. This might be something to consider.
But as for the actual question... If your "library" is actually bundled up as a dll (or even an exe) then at least Windows has relatively simpe support for embedding files within your library.
The resource mechanism that allows things like version information and icons to be embedded within executables can also allow arbitrary chunks of data. Since I don't know what development environment you're using, I can't say exactly how to do this. But roughly speaking, you'd need to create a custom resource with a type of "FILE" or something sensible like that and point it at the exe you want to embed.
Then, when you want to extract it, you would write something like
HRSRC hResource = FindResource(NULL, MAKEINTRESOURCE(IDR_MY_EMBEDDED_FILE), "FILE");
HGLOBAL hResourceData = LoadResource(NULL, hResource);
LPVOID pData = LockResource(hResourceData);
HANDLE hFile = CreateFile("DestinationPath\\Helper.exe", GENERIC_WRITE, 0, NULL, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL);
DWORD dwBytesWritten = 0;
WriteFile(hFile, pData, SizeofResource(NULL, hResource), &dwBytesWritten, NULL);
CloseHandle(hFile);
(filling in your own desired path, filename, and any appropriate error checking of course)
After that, the helper exe exists as a normal exe file and so you can execute it however you normally would.
For removing the file after use, you should investigate the flags for CreateFile, particularly FILE_FLAG_DELETE_ON_CLOSE. You might also look at using MoveFileEx when combining the MOVEFILE_DELAY_UNTIL_REBOOT flag with NULL passed for the new file name. And of course, you could always delete it in your own code if you can tell when the executable has ended.
I don't know enough about Linux executables, so I don't know if a similar feature is available there.
If Linux doesn't provide any convenient mechanism and/or if this idea doesn't suit your needs in Windows, then I suppose your idea of generating an unsigned char array from the contents of the helper exe would be the next best way to embed the exe in your library.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js