Recovering a large archive created without zip64 - compression

I had created a large archive using an old version of minizip version 1.01h which is based on zlib library. It does not support Zip64.
The source file was a text file much larger than 4GB. The compressed size of the archive is 2GB. Since it was created without Zip64 support the archive is corrupt. I am unable to restore the archive. Is there a way to recover at least a part of the text file from this corrupt archive?

You could try this streaming unzip, which ignores the central directory.

Related

Unzipping Large Files in AWS

We've recently ran into an issue with file corruption after large files are unzipped. The unzip process completes without error but can be missing last 5k bytes.
Our current process: A .ZIP file is downloaded from S3 onto the linux pod, perl code using IO::Uncompress::Unzip unzips a single .JSON file, the .JSON is uploaded back to S3.
There is another layer of challenge too. When using native windows or linux tools locally the files unzip completely, no missing bytes. However, at times single characters are changed within the file (We've seen corrupted JSON, changing "}]}" to }M}" or misspelled words, "item" to "idem"). This problem seems worse with tools like 7zip and Winrar.
In checking the details on the .ZIP file it looks to using windows for the encoding compression, which research says uses a GBK encoding. I suspect there may be a decoding issue with linux and some tools that use UTF8 decoding, but I've been unable to confirm that. Plus, we've experienced even the local windows unzip process changing single characters.
We've tried using IO::Uncompress::Unzip locally, which resulted in incomplete file.
We've tried using Archive::Zip locally, which errors out on any files over 4 GB.
We've tried using Compress::Raw::Zlib, but that also didn't work.
We've tried autoflush on the file handle, which resulted in incomplete file.
Has anyone encountered similar behaviors?

Extract tar in memory and nonblocking

I need to extract a tar.gz datastream in memory. An additional limit is that I cannot block.
Deflating in memory works great via zlib.
Now I need the untar part. Sadly all libraries I found either block or just work with tar files. Is there any library that works similarly to zlib?
Ok, there was no suitable library before. But now there will be, soon.
Check it out here but be aware, that it is not yet working.

Free C/C++ based zip/zip64 library?

After having false starts with poco's zip and minizip (both have issues, minizip can't decompress files larger than 2gb and poco zip corrupts any zip file larger than 2 gigs it compresses) I was wondering if there was anything else left?
So any suggestions for a C++ archive library that can handle zip AND zip64?
7-zip handles both, as far as I could tell from a quick glance at their source code. It's also LGPL, which should allow its use in a closed source app.
Well there is the all-around very proven ZLIB : http://zlib.net/

What is the fastest way to access files in a zip file?

What is the fastest way to read individual files (in a random fashion) from a zip file?
As I understand it, zip files have a directory that stores the individual file entries, and I could scan this directory to build an external index. Are there any standardized ways (i.e. existing libraries) that already do that? Or could I use a specialized type of zip file?
Scanning the directory and building the index is the fastest and best way to provide random access to the compressed entries archived in a zip file. The directory is usually small and lies at the end of the archive. If you have seekable media, then this is what you want.
The zip format is documented pretty well; it's not too hard to do. The devil is in the details, though. If your zip files use ZIP64 extensions, encryption, split archives.. that's when it gets tricky. For simple zip files, doing what you imagine is not so difficult.
Still it would be easier to use an external library.
Minizip seems to be a good library for reading or writing zip files. It uses the zlib library.
http://www.winimage.com/zLibDll/minizip.html

File formats with included versioning

I like the idea of using compressed folders as containers for file formats. They are used for LibreOffice or Dia. So if I want to define a special purpose file format, I can define a folder and file structure and just zip the root folder and have a single file with all the data in a single file. Imported files just live as originals inside the compressed file. Defining a binary file format from zero with this features would be a lot of work.
Now to my question: Are there applications which are using compressed folders as file formats and do versioning inside the folder? The benefits would be great. You could just commit a state in your project into your file and the versioning is just decorated with functions from your own application. Also diffs could be presented your own way.
Libraries for working with compressed files and for versioning are available. The used versioning system should be a distributed system, where the repository lives inside your working folder and not seperate as for example subversion with its client-server model.
What do you think? I'm sure there are applications out there using this approach, but I couldn't find one. Or is there a major drawback in this approach?
Sounds like an interesting idea.
I know many applications claim they have "unlimited" undo and redo,
but that's only back to the most recent time I opened this file.
With your system, your application could "undo" to previous versions of the file,
even before the version I saw the most recent time I opened this file -- that might be a nifty feature.
Have you looked at TortoiseHg?
TortoiseHg uses Mercurial, which is
"a distributed system, where the repository lives inside your working folder".
Rather than defining a new compressed versioned file format and all the software to work with it from scratch,
perhaps you could use the Mercurial file format and borrow the TortoiseHg and Mercurial source code to work with it.
What happens if I'm working on a project using 2 different applications,
and each application wants to store the entire project in its own slightly different compressed versioned file format?
What I found now is that OpenOffice aka LibreOffice has kind of versioning inside. LibreOffice file is a zip file with a structured content (XMLs, direcories, ...) inside. You are able to mark the current content as a version. This results in creating a VersionList.xml which contains information about all the versions. A Versions directory is added and this contains files like Version1, Version2 and so on. These files are the actual documents at that state.