Read only a part of a binary zip file - c++

I am looking for a way to read a part of a binary zip file (starting position and number of bytes to read). Currently I'm investigating this on Windows, but optimally it would be platform independent. For a normal binary file (unzipped), this can be achieved in the following way:
//Open the file
std::ifstream file (path, std::ios::in | std::ios::binary | std::ios::ate);
//Move to the position to start reading
file.seekg(64);
//Read 128 bytes of the file
std::vector<unsigned char> mDataBuffer;
mDataBuffer.resize( 128 ) ;
file.read( (char*)( &mDataBuffer[0]), 128 ) ;
//Read as string
std::string s_data( mDataBuffer.begin(), mDataBuffer.end());
file.close()
This example is a slightly modified version of this one.
There are also many unzip packages available (e.g. zlib or minizip). Each covering functions to unzip a file. I could simply unzip my zipped file, save it on the disk and read it using the method above.
Unfortunately, I didn't find an example to read only a part of a binary zip file (if that is even possible), straight from the zipped file. Because my file is quite large, I don't want to unzip it completely onto the hard drive. Furthermore, the part that I want to read is quite small, so it would be a waste of cpu time to completely unzip the file. For the same reasons, I also don't want to decompress the complete file into my memory. I am looking for a genuine way to read only a part of a zipped file.
How could this be accomplished in c++?

Apparently there is no general way to seek in zip files. This was according to:
A comment of #πάντα ῥεῖ.
A general thread on searching in zipped files here.
A similar question here (although the question itself is about Python).

Related

Error while downloading to memory using curl

I want to download a file to memory using curl.
I am currently using this and it looks like it works to a certain extent but corrupts my file.
I have used sigbench and it is around 20% different (while comparing original and downloaded)
The file I want to download is a binary so it won't work after it's modified.
I am currently testing with the x86 version of this.
original binary
downloaded binary
This is the code I am using to write it to a file:
ofstream stream = ofstream("test.dll");
stream.write(chunk.memory, chunk.size);
Opening the file like this:
ofstream stream = ofstream("test.dll");
will cause line-end characters to be adjusted to match your target system.
You should instead open the file in binary mode:
ofstream stream = ofstream("test.dll", std::ios::binary);
This will leave the characters that could be interpreted as line-endings unchanged.
Further reading: https://en.cppreference.com/w/cpp/io/c#Binary_and_text_modes

Knowing current compressed file size using gzwrite (zlib)

I'm using zlib for c++.
Quote from
http://refspecs.linuxbase.org/LSB_3.0.0/LSB-PDA/LSB-PDA/zlib-gzwrite-1.html regarding gzwrite function:
The gzwrite() function shall write data to the compressed file referenced by file, which shall have been opened in a write mode (see gzopen() and gzdopen()). On entry, buf shall point to a buffer containing len bytes of uncompressed data. The gzwrite() function shall compress this data and write it to file. The gzwrite() function shall return the number of uncompressed bytes actually written.
I interpret this as the return value will NOT tell me how much larger the file became when writing. Only how much data was compressed into the file.
The only way to know how large the file is would then be to close it, and read the size from the file system. I have a requirement to only continue to write to the file until it reaches a certain size. Can this be achieved without closing the file?
A workaround would be to write until the uncompressed size reaches my limit and then close the file, read the size from file system and update my best guess of file size based on that, and then re-open the file and continue writing. This would make me close and open the file a few times towards the end (as I'm approaching the size limit).
Another workaround, which would give more of an estimate (which is not what I want really) would be to write until uncompressed size reaches the limit, close the file, read the file size from the file system and calculate the compression ratio so far. The I can use this compression ratio to calculate a new limit for uncompressed file size where the compression should get me down to the limit for the compressed file size. If I repeat this the estimate would improve, but again, not what I'm looking for.
Are there better options?
Preferred option would be if zlib could tell me the compressed file size while the file is still open. I don't see why this information would not be available inside zlib at this point, since compression happens when I call gzwrite and not when i close the file.
zlib provides the function gzoffset(), which does exactly what you're asking.
If for some reason you are stuck with a version of zlib that is more than about eight years old, when gzoffset() was added, then this is easy to do with gzdopen(). You open the output file with fopen() or open(), and provide the file descriptor (using fileno() and dup() if you used fopen()), and then provide that descriptor to gzdopen(). Then you can use ftell() or lseek() at any time to see how much as been written. Be careful to not try to double-close the descriptor. See the comments for gzdopen().
You can work around this issue by using a pipe. The idea is to write the compressed data into a pipe. After that, you read the data from the other end of the pipe, count it and write it to the actual file.
To set this up you need to first open the file to write to via a simple open. Then create a pipe via pipe2 and initialize zlib by passing one of the pipe descriptors to gzdopen:
int out = open("/path/to/file", O_WRONLY | O_CREAT | O_TRUNC);
int p[2];
pipe2(p, O_NONBLOCK);
gzFile zFile = gzdopen(p[0], "w");
You can now write the data first to the pipe and then splice it from the pipe to the out file:
gzwrite(zFile, buf, 1024); //or any other length
size_t bytesWritten = 0;
do {
bytesWritten = splice(p[1], NULL, out, NULL, 1024, SPLICE_F_NONBLOCK | SPLICE_F_MORE);
} while(bytesWritten == 1024);
As you can see, you now have the bytesWritten to tell you how much data was actually written. Simply sum it up in another variable and stop splicing as soon as you have written as much data as you need to (or just splice it in one go by writing everything to the zFile and the splice once with the amount of data you are allowed to store as the fifth parameter. If you want to not compress uneccessary data, simply do it in chunks as shown above).
A note on splice: Splice is linux specific, and is basically just a very efficient copy. You can always replace it with a simple "read and write" combo, i.e. read data from fd[1] into a buffer and then write the data from that buffer into out - splice is just faster and less code.

C++ - Missing end of line characters in file read

I am using the C++ streams to read in a bunch of files in a directory and then write them to another directory. Since these files may be of different types, I am using a the generic ios::binary flag when reading/writing these files.
Example code below:
std::fstream inf( "ex.txt", std::ios::in | std::ios::binary);
char c;
while( inf >> c ) {
// writing to another file in binary format
}
The issue I have is that in the case of files containing text, the end of line characters in these text files are not being written to the output file.
Edit: Or at least they do not appear to be as when the newly written file is opened, there is only a single continuous line of characters.
Edit again: The problem (of the continuous string) appears to persist even when the read / write is made in text mode.
Thus, I was wondering if there was a way to check if a file has text or binary and then read/write it appropriately. Else, is there any way to preserve the end of line characters even when opening the file in binary format?
Edit: I am using the g++ 4.8.2 compiler
When you want to manipulate bytes, you need to use read and write methods, not >> << operators.
You can get the intended behavior with inp.flags(inp.flags() & ~std::ios_base::skipws);, though.

how do I read a huge .gz file (more than 5 gig uncompressed) in c

I have some .gz compressed files which is around 5-7gig uncompressed.
These are flatfiles.
I've written a program that takes a uncompressed file, and reads it line per line, which works perfectly.
Now I want to be able to open the compressed files inmemory and run my little program.
I've looked into zlib but I can't find a good solution.
Loading the entire file is impossible using gzread(gzFile,void *,unsigned), because of the 32bit unsigned int limitation.
I've tried gzgets, but this almost doubles the execution time, vs reading in using gzread.(I tested on a 2gig sample.)
I've also looked into "buffering", such as splitting the gzread process into multiple 2gig chunks, find the last newline using strcchr, and then setting the gzseek.
But gzseek will emulate a total file uncompression. which is very slow.
I fail to see any sane solution to this problem.
I could always do some checking, whether or not a current line actually has a newline (should only occure in the last partially read line), and then read more data from the point in the program where this occurs.
But this could get very ugly.
Does anyhow have any suggestions?
thanks
edit:
I dont need to have the entire file at once,just need one line a time, but I got a fairly huge machine, so if that was the easiest I would have no problems.
For all those that suggest piping the stdin, I've experienced extreme slowdowns compared to opening the file. Here is a small code snippet I made some months ago, that illustrates it.
time ./a.out 59846/59846.txt
# 59846/59846.txt
18255221
real 0m4.321s
user 0m2.884s
sys 0m1.424s
time ./a.out <59846/59846.txt
18255221
real 1m56.544s
user 1m55.043s
sys 0m1.512s
And the source code
#include <iostream>
#include <fstream>
#define LENS 10000
int main(int argc, char **argv){
std::istream *pFile;
if(argc==2)//ifargument supplied
pFile = new std::ifstream(argv[1],std::ios::in);
else //if we want to use stdin
pFile = &std::cin;
char line[LENS];
if(argc==2) //if we are using a filename, print it.
printf("#\t%s\n",argv[1]);
if(!pFile){
printf("Do you have permission to open file?\n");
return 0;
}
int numRow=0;
while(!pFile->eof()) {
numRow++;
pFile->getline(line,LENS);
}
if(argc==2)
delete pFile;
printf("%d\n",numRow);
return 0;
}
thanks for your replies, I'm still waiting the golden apple
edit2:
using the cstyle FILE pointers instead of c++ streams is much much faster. So I think this is the way to go.
Thank for all your input
gzip -cd compressed.gz | yourprogram
just go ahead and read it line by line from stdin as it is uncompressed.
EDIT: Response to your remarks about performance. You're saying reading STDIN line by line is slow compared to reading an uncompressed file directly. The difference lies within terms of buffering. Normally pipe will yield to STDIN as soon as the output becomes available (no, or very small buffering there). You can do "buffered block reads" from STDIN and parse the read blocks yourself to gain performance.
You can achieve the same result with possibly better performance by using gzread() as well. (Read a big chunk, parse the chunk, read the next chunk, repeat)
gzread only reads chunks of the file, you loop on it as you would using a normal read() call.
Do you need to read the entire file into memory ?
If what you need is to read lines, you'd gzread() a sizable chunk(say 8192 bytes) into a buffer, loop through that buffer and find all '\n' characters and process those as individual lines. You'd have to save the last piece incase there is just part of a line, and prepend that to the data you read next time.
You could also read from stdin and invoke your app like
zcat bigfile.gz | ./yourprogram
in which case you can use fgets and similar on stdin. This is also beneficial in that you'd run decompression on one processor and processing the data on another processor :-)
I don't know if this will be an answer to your question, but I believe it's more than a comment:
Some months ago I discovered that the contents of Wikipedia can be downloaded in much the same way as the StackOverflow data dump. Both decompress to XML.
I came across a description of how the multi-gigabyte compressed dump file could be parsed. It was done by Perl scripts, actually, but the relevant part for you was that Bzip2 compression was used.
Bzip2 is a block compression scheme, and the compressed file could be split into manageable pieces, and each part uncompressed individually.
Unfortunately, I don't have a link to share with you, and I can't suggest how you would search for it, except to say that it was described on a Wikipedia 'data dump' or 'blog' page.
EDIT: Actually, I do have a link

Decompression and extraction of files from streaming archive on the fly

I'm writing a browser plugin, similiar to Flash and Java in that it starts downloading a file (.jar or .swf) as soon as it gets displayed. Java waits (I believe) until the entire jar files is loaded, but Flash does not. I want the same ability, but with a compressed archive file. I would like to access files in the archive as soon as the bytes necessary for their decompression are downloaded.
For example I'm downloading the archive into a memory buffer, and as soon as the first file is possible to decompress, I want to be able to decompress it (also to a memory buffer).
Are there any formats/libraries that support this?
EDIT: If possible, I'd prefer a single file format instead of separate ones for compression and archiving, like gz/bzip2 and tar.
There are 2 issues here
How to write the code.
What format to use.
On the file format, You can't use the .ZIP format because .ZIP puts the table of contents at the end of the file. That means you'd have to download the entire file before you can know what's in it. Zip has headers you can scan for but those headers are not the official list of what's in the file.
Zip explicitly puts the table of contents at the end because it allows fast adding a files.
Assume you have a zip file with contains files 'a', 'b', and 'c'. You want to update 'c'. It's perfectly valid in zip to read the table of contents, append the new c, write a new table of contents pointing to the new 'c' but the old 'c' is still in the file. If you scan for headers you'll end up seeing the old 'c' since it's still in the file.
This feature of appending was an explicit design goal of zip. It comes from the 1980s when a zip could span multiple floppy discs. If you needed to add a file it would suck to have to read all N discs just to re-write the entire zip file. So instead the format just lets you append updated files to the end which means it only needs the last disc. It just reads the old TOC, appends the new files, writes a new TOC.
Gzipped tar files don't have this problem. Tar files are stored header, file, header file, and the compression is on top of that so it's possible to decompress as the file it's downloaded and use the files as they become available. You can create gzipped tar files easily in windows using winrar (commercial) or 7-zip (free) and on linux, osx and cygwin use the tar command.
On the code to write,
O3D does this and is open source so you can look at the code
http://o3d.googlecode.com
The decompression code is in o3d/import/cross/...
It targets the NPAPI using some glue which can be found in o3d/plugin/cross
Check out the boost::zlib filters. They make using zlib a snap.
Here's the sample from the boost docs that will decompress a file and write it to the console:
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_streambuf.hpp>
#include <boost/iostreams/copy.hpp>
#include <boost/iostreams/filter/zlib.hpp>
int main()
{
using namespace std;
ifstream file("hello.z", ios_base::in | ios_base::binary);
filtering_streambuf<input> in;
in.push(zlib_decompressor());
in.push(file);
boost::iostreams::copy(in, cout);
}
Sure, zlib for example uses z_stream for incremental compression and decompression via functions inflateInit, inflate, deflateInit, deflate. libzip2 has similar abilities.
For incremental extraction from the archive (as it gets deflated), look e.g. to the good old tar format.