I have a file that contains an embedded file that is compressed using zlib.
I know where the compressed file starts within the main file and I have the decompressed file size, but don't know where the compressed file ends.
My goal is to extract the compressed file from its containing file and save it as its own file.
Given the start location of the compressed file and the decompressed file size, is there anything that will tell me the compressed file size or where the compressed file ends?
Presently I pass from the zlib header to EOF into a inflate function. This works fine for smaller files as the inflate function will discard any data after the end of the compressed file and only return me the decompressed bytes. However for larger files where the compressed file is located near the top of the file, my system does not have enough memory to pass the majority of the file into the inflate function.
A zlib stream is self-terminating. You find the compressed length by decompressing and seeing how many bytes have been consumed when it finishes.
You mention something about not having enough memory. The main zlib functions process a piece of the stream at a time, which are the inflate* functions for decompression. Read zlib.h.
Related
I'm using zlib for c++.
Quote from
http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/zlib-gzwrite-1.html regarding gzwrite function:
The gzwrite() function shall write data to the compressed file referenced by file, which shall have been opened in a write mode (see gzopen() and gzdopen()). On entry, buf shall point to a buffer containing len bytes of uncompressed data. The gzwrite() function shall compress this data and write it to file. The gzwrite() function shall return the number of uncompressed bytes actually written.
I interpret this as the return value will NOT tell me how much larger the file became when writing. Only how much data was compressed into the file.
The only way to know how large the file is would then be to close it, and read the size from the file system. I have a requirement to only continue to write to the file until it reaches a certain size. Can this be achieved without closing the file?
A workaround would be to write until the uncompressed size reaches my limit and then close the file, read the size from file system and update my best guess of file size based on that, and then re-open the file and continue writing. This would make me close and open the file a few times towards the end (as I'm approaching the size limit).
Another workaround, which would give more of an estimate (which is not what I want really) would be to write until uncompressed size reaches the limit, close the file, read the file size from the file system and calculate the compression ratio so far. The I can use this compression ratio to calculate a new limit for uncompressed file size where the compression should get me down to the limit for the compressed file size. If I repeat this the estimate would improve, but again, not what I'm looking for.
Are there better options?
Preferred option would be if zlib could tell me the compressed file size while the file is still open. I don't see why this information would not be available inside zlib at this point, since compression happens when I call gzwrite and not when i close the file.
zlib provides the function gzoffset(), which does exactly what you're asking.
If for some reason you are stuck with a version of zlib that is more than about eight years old, when gzoffset() was added, then this is easy to do with gzdopen(). You open the output file with fopen() or open(), and provide the file descriptor (using fileno() and dup() if you used fopen()), and then provide that descriptor to gzdopen(). Then you can use ftell() or lseek() at any time to see how much as been written. Be careful to not try to double-close the descriptor. See the comments for gzdopen().
You can work around this issue by using a pipe. The idea is to write the compressed data into a pipe. After that, you read the data from the other end of the pipe, count it and write it to the actual file.
To set this up you need to first open the file to write to via a simple open. Then create a pipe via pipe2 and initialize zlib by passing one of the pipe descriptors to gzdopen:
int out = open("/path/to/file", O_WRONLY | O_CREAT | O_TRUNC);
int p[2];
pipe2(p, O_NONBLOCK);
gzFile zFile = gzdopen(p[0], "w");
You can now write the data first to the pipe and then splice it from the pipe to the out file:
gzwrite(zFile, buf, 1024); //or any other length
size_t bytesWritten = 0;
do {
bytesWritten = splice(p[1], NULL, out, NULL, 1024, SPLICE_F_NONBLOCK | SPLICE_F_MORE);
} while(bytesWritten == 1024);
As you can see, you now have the bytesWritten to tell you how much data was actually written. Simply sum it up in another variable and stop splicing as soon as you have written as much data as you need to (or just splice it in one go by writing everything to the zFile and the splice once with the amount of data you are allowed to store as the fifth parameter. If you want to not compress uneccessary data, simply do it in chunks as shown above).
A note on splice: Splice is linux specific, and is basically just a very efficient copy. You can always replace it with a simple "read and write" combo, i.e. read data from fd[1] into a buffer and then write the data from that buffer into out - splice is just faster and less code.
I am looking for a way to read a part of a binary zip file (starting position and number of bytes to read). Currently I'm investigating this on Windows, but optimally it would be platform independent. For a normal binary file (unzipped), this can be achieved in the following way:
//Open the file
std::ifstream file (path, std::ios::in | std::ios::binary | std::ios::ate);
//Move to the position to start reading
file.seekg(64);
//Read 128 bytes of the file
std::vector<unsigned char> mDataBuffer;
mDataBuffer.resize( 128 ) ;
file.read( (char*)( &mDataBuffer[0]), 128 ) ;
//Read as string
std::string s_data( mDataBuffer.begin(), mDataBuffer.end());
file.close()
This example is a slightly modified version of this one.
There are also many unzip packages available (e.g. zlib or minizip). Each covering functions to unzip a file. I could simply unzip my zipped file, save it on the disk and read it using the method above.
Unfortunately, I didn't find an example to read only a part of a binary zip file (if that is even possible), straight from the zipped file. Because my file is quite large, I don't want to unzip it completely onto the hard drive. Furthermore, the part that I want to read is quite small, so it would be a waste of cpu time to completely unzip the file. For the same reasons, I also don't want to decompress the complete file into my memory. I am looking for a genuine way to read only a part of a zipped file.
How could this be accomplished in c++?
Apparently there is no general way to seek in zip files. This was according to:
A comment of #πάντα ῥεῖ.
A general thread on searching in zipped files here.
A similar question here (although the question itself is about Python).
I would like to send the contents of an audio file to another system over the network using socket. Both systems run on Windows operating system. Is there a tutorial on some way to store the audio contents into a C++ array or Stringstream datatype, so that it will be easier to send it to a different node.
I basically want to know how to extract data bytes from an audio file.
The easiest thing to do is to simply send the data in chunks of bytes. If you are starting with an audio file, just open it like any other binary file with something like file = fopen(filename, "rb"); (where filename is the name of the audio file). Then enter a loop to read a chunks of bytes until you reach the end of the file. Just use something like bytes_read = fread(buffer, sizeof(char), read_size, file); where buffer should probably be a char array of at least size read_size, which could be, say, 1024. After each fread, you can make your network send call. Alternately, you could read the whole file first and then send it chunk by chunk. Your call. Either way, when you reach the end of the file, send some sort of signal that you have reached the end. The receiving system should take these chunks and call fwrite to create a new audio file. You can either append each chunk as it comes in or buffer it all until you reach the end and then write it all out.
soundfile++ can be used if you have wav files only. Check the readtest and writetest demo programs here http://sig.sapp.org/doc/examples/soundfile/
I am writing a class that compresses binary data using a zlib stream. I have a buffer that I fill with the output stream and once it becomes full I dump the buffer out to a file using fopen(filename, 'ab');... What this means is that my program only opens up the file to write to it whenever it has a buffer full of data to dump, it goes and does it and immediately closes it.
The issue is in my format I use an 8 byte header at the beginning of each file which contains the original length and compressed length but I do not know these values until the end of the whole compression process.
What I wanted to do was write 8 bytes of zeros, then append with all my compressed data, then come back at the end during cleanup to fill in those 8 bytes with the size data, but I can't seem to find a way to open the file without bringing it all back into memory. I just want to edit the first 8 bytes of the file. Do I need to use mmap?
Since you're using the file in append mode, you do need to close and re-open it:
open with fopen(filename, "r+b");
write the 8 bytes;
close the file using fclose().
The r+ means
Open for reading and writing. The stream is positioned at the
beginning of the file.
and the b is needed to open in binary mode.
You can use this method to change the data at any position in the file, not just at the beginning: simply use fseek() to seek to the required position before writing.
Use rewind() to take the file pointer back to the start of the file after you write out the last few bytes of data. You can then output your 8 bytes of length info.
If you have flexibility in changing your format, I might suggest this. Define your compressed stream such that it is a sequence of an unknown number of blocks, and each block is preceded by a fixed length integer specifying the number of bytes in the block. The stream is finished when the next block has a size of zero.
The drawback to this format is that there no way for the reader of the stream to know how much data is coming until it's all been read. But the advantage is that it avoids this problem you are trying to solve.
More importantly, it allows you to send a compressed stream of data somewhere as you read the input and you don't have to save it all before sending it. For example, you could write a compression Unix filter that you could put in a pipe stream:
prog1 | yourprog -compress | rsh host yourprog -expand | prog2
Good luck.
I'm writing a browser plugin, similiar to Flash and Java in that it starts downloading a file (.jar or .swf) as soon as it gets displayed. Java waits (I believe) until the entire jar files is loaded, but Flash does not. I want the same ability, but with a compressed archive file. I would like to access files in the archive as soon as the bytes necessary for their decompression are downloaded.
For example I'm downloading the archive into a memory buffer, and as soon as the first file is possible to decompress, I want to be able to decompress it (also to a memory buffer).
Are there any formats/libraries that support this?
EDIT: If possible, I'd prefer a single file format instead of separate ones for compression and archiving, like gz/bzip2 and tar.
There are 2 issues here
How to write the code.
What format to use.
On the file format, You can't use the .ZIP format because .ZIP puts the table of contents at the end of the file. That means you'd have to download the entire file before you can know what's in it. Zip has headers you can scan for but those headers are not the official list of what's in the file.
Zip explicitly puts the table of contents at the end because it allows fast adding a files.
Assume you have a zip file with contains files 'a', 'b', and 'c'. You want to update 'c'. It's perfectly valid in zip to read the table of contents, append the new c, write a new table of contents pointing to the new 'c' but the old 'c' is still in the file. If you scan for headers you'll end up seeing the old 'c' since it's still in the file.
This feature of appending was an explicit design goal of zip. It comes from the 1980s when a zip could span multiple floppy discs. If you needed to add a file it would suck to have to read all N discs just to re-write the entire zip file. So instead the format just lets you append updated files to the end which means it only needs the last disc. It just reads the old TOC, appends the new files, writes a new TOC.
Gzipped tar files don't have this problem. Tar files are stored header, file, header file, and the compression is on top of that so it's possible to decompress as the file it's downloaded and use the files as they become available. You can create gzipped tar files easily in windows using winrar (commercial) or 7-zip (free) and on linux, osx and cygwin use the tar command.
On the code to write,
O3D does this and is open source so you can look at the code
http://o3d.googlecode.com
The decompression code is in o3d/import/cross/...
It targets the NPAPI using some glue which can be found in o3d/plugin/cross
Check out the boost::zlib filters. They make using zlib a snap.
Here's the sample from the boost docs that will decompress a file and write it to the console:
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/zlib.hpp>
int main()
{
using namespace std;
ifstream file("hello.z", ios_base::in | ios_base::binary);
filtering_streambuf<input> in;
in.push(zlib_decompressor());
in.push(file);
boost::iostreams::copy(in, cout);
}
Sure, zlib for example uses z_stream for incremental compression and decompression via functions inflateInit, inflate, deflateInit, deflate. libzip2 has similar abilities.
For incremental extraction from the archive (as it gets deflated), look e.g. to the good old tar format.