Working on zip archives with c++ - c++

has anyone of you got some experience with working on zip-archives? I have a programm which searches on a filesystem and searches for keywords in XML files. But the XML files are stored in zip64 archives. So every time I want to search something I have to unzip the files. Since I'm working with Qt the first thing I tried was Quazip but just like libarchive it doesn't seem to support zip64. Than I found libraries like the poco-library or zipstream, but having trouble getting it going.
Now I wanted to ask if anyone can tell how much longer it might take to perform a search on zipped files. Because the search already takes up to 15min. And if it is a lot slower it might not be worth the effort( e.g. if it takes more than 20minutes afterwards I wouldn't use it).
Is it possible to make a prognosis about the additional time to work with the zipped files?
Thanks in advance for any help!

InfoZip supports zip64. However, anyway to search in compressed XML you should decompress them and this takes most of your time.

Related

Does anyone know if ziplib has the ability to validate a zip library without actually extracting all the files

I'm looking to replace the zip library that I am using in a small utility with something a bit better.
One of the deficiencies in the library I am currently using is that it doesn't appear to validate zip file very well - I can corrupt the file by changing random characters and the library doesn't notice.
I am looking for a C++ zip library that has a function to validate the zip file without extracting all the files in the library.
Someone recommended ziplib to me, but I don't see anything in there about checking the integrity of a zip library.
Does anyone know if ziplib has this capability? Or have a better recommendation?
Libraries like libzip and libarchive allow you to read archive entries a chunk at a time. You can simply read the entire archive to verify it, repeatedly overwriting the same buffer in memory with the decompressed data and thereby discarding it.

Getting list of files and folders on the user's computer with the filename filtered by the text line

Currently I'm developing a project that should do the thing described above on Windows. I have the idea to recurcively go through all user's drives and collect all information on then, but it seems to be really time consuming. So is there a better way to do such thing (maybe to use OS's index file or NTFS MFT)?
I use C++/Qt.
You can search for any of the many code examples for this and use one.
The library finctions which you use FindFirstFile and FindNextFile are optimized and will go firectly to the FAT. They are coded by microsoft & I doubt that there is a faster way.
Btw, what do mean by "filtered by the text line"? Do you mean you want only filenames matching a certain pattern (use teh above) or files containing a string?

C++ file container (e.g. zip) for easy access

I have a lot of small files I need to ship with an application I build and I want to put this files into an archive to make copying and redistributing more easy.
I also really like the idea of having them all in one place so I need to compare the md5 of one file only in case something goes wrong.
I'm thinking about a class which can load the archive and return a list of files within the archive and load a file into memory if I need to access it.
I already searched the Internet for different methods of achieving what I want and found out about zlib and the lzma sdk.
Both didn't really appeal to me because I don't really found out how portable zlib is and I didn't like the lzma sdk as it is just to much and I don't want to blow up the application because of this problem. Another downside with zlib is that I don't have the C/C++ experience (I'm really new to C++) to get everything explained in the manual.
I also have to add that this is a time critical problem. I though some time about implementing a simple format like tar in a way I can easy access the files within my application but I just didn't find the time to do that yet.
So what I'm searching for is a library that allows me to access the files within an archive. I'd be glad if anybody could point me in the right direction here.
Thanks in advance,
Robin.
Edit: I need the archive to be accessed under linux and windows. Sorry I didn't mention that in the beginning.
For zipping, I've always been partial to ZipUtils, which makes the process easy and is built on top of the zlib and info-zip libraries.
The answer depends on whether you plan to modify the archive via code after the archive is initially built.
If you don't need to modify it, you can use TAR - it's a handy and simple format. If you want compression, you can implement tar.gz reader or find some library that does this (I believe there are some available, including open-source ones).
If your application needs random access to the data or it needs to modify the archive, then regular TAR or ZIP archives are not good. Virtual file system such as our SolFS or CodeBase file system will fit much better: virtual file systems are suited for frequent modifications of the storage, while archives target mainly write-once-read-many usage scenarios.
zlib is highly portable and very widely used. if you can't make sense of the C++ interface, there are alternatives for many other languages - see 'Related External Links' here.
Take another look before you search for something different.
If you're using Qt or Windows you can also pack data into the executable's resource area. You would only have to distribute the executable file using this technique. There's a well defined API already written and tested to access that data.
The zlib API is the way to go. Simple and portable. Lookat unzip.h header for APIs that access archive files. It is in C and very easy.
If the files are small, you can dump them into string literals (search for bin2h utility) and include in your project. Then change the code that read the files. If all files are currently read using ifstream class, simply changing it to istringstream class and recompile the code.
Try using Quazip - it's quite simple to use. You can use it as a stream from which you read the compressed file on the fly.

Game File Archive Format

I want to create a single data file that holds all the data that my game will need, and I want it to be compressed. I looked into tar and gzip, but I downloaded their sources and I don't know where to begin. Can somebody give me some pointers to how I can use these?
Unless you will always load all files from the archive, TAR/GZ might not be a very good idea, because you cannot extract specific files as you need them. This is the reason many games use ZIP archives, which do allow you to extract individual files as required (a good example is Quake, whose PK3 files are nothing but ZIP files with a different extension).
A bit of searching brought up Minizip, which is a ZIP library built on top of zlib. I couldn't find any separate documentation for it, but the header files seem to include a lot of comments, and I believe you can get off with it.
If you mean that you want your game to read out of the archive at runtime, then I recommend decompressing each time the game is run into a temporary folder, and then using the files as required. This can be achieved through using a library for decompressing whatever archive format you use. Look into zlib.

library for doing diffs

I've been tasked with creating a tool that can diff and merge the configuration files for my company's product. The configurations are stored as either XML or URL-encoded strings. I'm looking for a library, preferably open source with a license compatible with commercial software, that can do these diffs. Our app is written in C++, so C++ libraries would be best, but I'm willing to look at libraries that are C#-specific since I can write a wrapper that exposes it to C++ via COM. Three-way diffs would be ideal, but two-way is acceptable. If it has an understanding of XML, that would also be a plus (since XML nodes can be reordered without changing the document, etc). Any library suggestions? Should I even consider writing my own diff tools in the hopes of giving it semantic knowledge of our formats?
Thanks to this similar question, I've already discovered this google library, which seems really great, but I'm still looking for other options. It also seems to be able to output the diffs in HTML format (using the <ins> and <del> tags that I didn't know existed before I discovered it), which could be really handy, but it seems to be a unified diff only. I'm going to need to display the results in a web browser, and probably have to build an interface for doing the merges in the browser as well. I don't expect a library to be able to help with these tasks, but it must produce output in a format that is amenable to me building this on top of it. I'm currently envisioning something along the lines of TortoiseMerge (side-by-side diffs, not unified), except browser-based. Any tips/tricks/design ideas on how to present this would be appreciated too.
Subversion comes with libsvn_diff and libsvn_delta licensed under Apache Software License.
Here is a C++ library that can diff what the author calls semistructured data. It deals nicely with HTML and XML. Since your data is XML it would make a lot of sense to use this instead of plain text diff. This is especially the case when the files are machine generated.
I am currently trying to use this library to build a tool that diffs Visual Studio project files. These are basically XML files and using a plain diff tool like Winmerge is too painful because Visual Studio pretty much mucks up the whole file by crazy reordering. The idea is to do some kind of a structured diff to address the problem.
For diffing the XML I would propose that you normalize it first: sort all the elements in alphabetic order, then generate a stream of tokens/xml that represents the original document but is independent of the original formatting. After running the diff, parse the result to get a tree containing what was added / removed.