Library for extracting zip on the fly

Library for extracting zip on the fly - c++

I have a rather large ZIP file, which gets downloaded (cannot change the file). The quest now is to unzip the file while it is downloading instead of having to wait till the central directory end is received.
Does such a library exist?

I wrote "pinch" a while back. It's in Objective-C but the method to decode files from a zip might be a way to get it in C++? Yeah, some coding will be necessary.
http://forrst.com/posts/Now_in_ObjC_Pinch_Retrieve_a_file_from_inside-I54
https://github.com/epatel/pinch-objc

I'm not sure such a library exists. Unless you are on a very fast line [or have a very slow processor], it's unlikely to save you a huge amount of time. Decompressing several gigabytes only takes a few seconds if all the data is in ram [it may then take a while to write the uncompressed data to the disk, and loading it from the disk may add to the total time].
However, assuming the sending end supports "range" downloading, you could possibly write something that downloads the directory first [by reading the fixed header first, then reading the directory and then downloading the rest of the file from start to finish]. Presumably that's how "pinch" linked in epatel's answer works.

Related

Is it possible to continue bzip2 decompressing?

Long story short: I have big (700+ GB) .tar.bz2 archive and I wanted to decompress it. It is stored on very slow HDD, so it took my computer about 110 hours nonstop working to get 92% of data. But then I accidentally close the terminal with unarchiving process.
If decompressing process was stopped can it continue from the breakpoint or skip already unzipped files or skip some offset?

Yes, it is possible in principle since a bzip2 file consists of independent blocks, each of which starts with a specific marker that you can search for. Also a tar file consists of independent blocks for each file, for which you should be able to find headers on some 512-byte boundaries.
You would need to write your own code to poke around and try to find out where you left off, assuming you know what the last file extracted was. Then you could continue to decompress from there.

Why isn't lossless compression automatic on computers?

I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?

Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.

zip-file to buffer c++

I have to read a dat-file byte by byte from a zip-file in a char[] buffer. The zip-file contains only one dat-file. I guess unzip chunk by chunk would be good. I am using Visual Studio 2013 with c++.
I have found zip-utils (http://www.codeproject.com/Articles/7530/Zip-Utils-clean-elegant-simple-C-Win), would this be ok, because its nearly 10 years old? Would Minizip be a good way? I guess zlib alone would not be enough for this use case, right?
My question is, whats the best way to do the unzipping? I have no experience with handling zip-files and would like to hear a suggestion by somebody with experience.
Thank you,
Friedrich

Minizip would work. Please notice that it still requires zlib source code to link with.
A zip file is not just chunks of zlib compressed content.
It's an archive.
There is a directory header, and per element header you must decode too even if the archive only contains a single file. Typically, the header will tell you from which offset in the zip file you'll find your DAT compressed content. Then you'll likely use zlib to decode chunk by chunk starting at the given offset.
Please notice also that zip file format does not always imply zlib as a compressor (you can have many different compressor). If you master the code that create the zip file, it's not an issue. But if it comes from hostile user, then you should rely actually check the compressor used and assert it's zlib else you should deny decompressing the file because you'll not be able to do so.

Can zip files be sparse/non-contiguous?

The zip file format ends with a central directory section that then points to the individual zip entries within the file. This appears to allow zip entries to occur anywhere within the zip file itself. Indeed, self-extracting zip files are a good example: they start with an executable and all the zip entries occur after the executable bytes.
The question is: does the zip file format really allow sparse or non-contiguous zip entries? e.g. if there are empty or otherwise unaccounted bytes between zip entries? Both the definitive PK note and wikipedia article seem to allow this. Will all/most typical zip utilities work with such sparse zip files?
The use case is this: I want to be able to delete or replace zip entries in a zip file. To do this, the typical minizip etc. libraries want you to copy out the entire zip file while not copying out the deleted or replaced zip entry, which seems wasteful and slow.
Wouldn't it be better to over-allocate, say 1.5x the storage for an entry, then when deleting or replacing an entry you could figure out where the unallocated bytes were and use those directly? Using 1.5x the storage means that if the zip entry grew linearly, the reallocations should also happen amortized linearly. It would be similar to file system block allocation though probably not as sophisticated.
This also helps with a lot of the zip-based file formats out there. Instead of having to have some temp directory somewhere (or even in memory) with the temporarily unzipped files for editing/changing and then having to rezip the lot back into the file format, this would lessen the need for rezipping and rewriting portions of the zip file.
Are there any C/C++ libraries out there that do this?

No. Reading the central directory is optional. zip decoders can, and some do, simply read the zip file sequentially from the beginning, expecting to see the local headers and entry data contiguously. They can complete the job of decoding, never having even looked at the central directory.
In order to do what you want, you would need to put in dummy zip entries between the useful entries in order to hold that space. At least if you want to be compatible with the rest of the zip world.

Truncating the file in c++

I was writing a program in C++ and wonder if anyone can help me with the situation explained here.
Suppose, I have a log file of about size 30MB, I have copied last 2MB of file to a buffer within the program.
I delete the file (or clear the contents) and then write back my 2MB to the file.
Everything works fine till here. But, the concern is I read the file (the last 2MB) and clear the file (the 30MB file) and then write back the last 2MB.
To much of time will be needed if in a scenario where I am copying last 300MB of file from a 1GB file.
Does anyone have an idea of making this process simpler?
When having a large log file the following reasons should and will be considered.
Disk Space: Log files are uncompressed plain text and consume large amounts of space.
Typical compression reduce the file size by 10:1. However a file cannot be compressed
when it is in use (locked). So a log file must be rotated out of use.
System resources: Opening and closing a file regularly will consume lots of system
resources and it would reduce the performance of the server.
File size: Small files are easier to backup and restore in case of a failure.
I just do not want to copy, clear and re-write the last specific lines to a file. Just a simpler process.... :-)
EDIT: Not making any inhouse process to support log rotation.
logrotate is the tool.

I would suggest an slightly different approach.
Create a new temporary file
Copy the required data from the original file to the temporary file
Close both files
Delete the original file
Rename the temp file to the same name as the original file
To improve the performance of the copy, you can copy the data in chunks, you can play around with the chunk size to find the optimal value.

If this is your file before:
-----------------++++
Where - is what you don't want and + is what you do want, the most portable way of getting:
++++
...is just as you said. Read in the section you want (+), delete/clear the file (as with fopen(... 'wb') or something similar and write out the bit you want (+).
Anything more complicated requires OS-specific help, and isn't portable. Unfortunately, I don't believe any major OS out there has support for what you want. There might be support for "truncate after position X" (a sort of head), but not the tail like operation you're requesting.
Such an operation would be difficult to implement, as varying blocksizes on filesystems (if the filesystem has a block size) would cause trouble. At best, you'd be limited to cutting on blocksize boundaries, but this would be harry. This is such a rare case, that this is probably why such a procudure is not directly supported.

A better approach might be not to let the file grow that big but rather use rotating log files with a set maximum size per log file and a maximum number of old files being kept.

If you can control the writing process, what you probably want to do here is to write to the file like a circular buffer. That way you can keep the last X bytes of data without having to do what you're suggesting at all.
Even if you can't control the writing process, if you can at least control what file it writes to, then maybe you could get it to write to a named pipe. You could attach your own program at the end of this named pipe that writes to a circular buffer as discussed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js