I'm trying to develop a file diff format for multiple files recursively in folders. Consider a source directory containing patched files and a destination directory containing original files. Write a size minimal diff file which expresses the difference between all files in the source and destination directory which can be applied to the original files in order to transform the original files into the patched files.
For this purpose I found the dtl library. Which algorithm or feature of the library should I use to write a file diff to the disk which I can then later read back and apply in order to patch the file? Any example code for this? I tried writing the result of the shortest edit script (SES) to the disk but I realized that I needed to specify the character and operation for every single byte. This of course makes the output file bigger than the entire comparison file, making this diff format entirely redundant since storing the entire target file instead would've saved more storage.
As another reference, this is very similar to how version control systems like git or svn operate but I don't want to use those since I'm mainly dealing with binary files and the simple requirement of creating and applying patches.
After doing some more search, I found the HDiffPatch project.
It worked fine apparently but it seems to take long on bigger folder comparisons:
diff usage: hdiffz [options] oldPath newPath outDiffFile
patch usage: hpatchz [options] oldPath diffFile outNewPath
EDIT:
Another good option is open-vcdiff but it only supports individual files.
use HDiffPatch: you can run hdiffz with "-s-48" for up speed;
or try "-s-32" , "-s-1k", "-s-128k" ...
Related
Summary
Let's say I have a large number of files in a folder that I want to compress/zip before I send to a server. After I've zipped them together, I realize I want to add/remove/modify a file. Can going through the entire compression process from scratch be avoided?
Details
I imagine there might be some way to cache part of the compression process (whether it is .zip, .gz or .bzip2), to make the compression incremental, even if it results in sub-optimal compression. For example, consider the naive dictionary encoding compression algorithm. I imagine it should be possible to use the encoding dictionary on a single file without re-processing all the files. I also imagine that the loss in compression provided by this caching mechanism would grow as more files are added/removed/edited.
Similar Questions
There are two questions related to this problem:
A C implementation, which implies it's possible
A C# related question, which implies it's possible by zipping individual files first?
A PHP implementation, which implies it isn't possible without a special file-system
A Java-specific adjacent question, which implies it's semi-possible?
Consulting the man page of zip, there are several relevant commands:
Update
-u
--update
Replace (update) an existing entry in the zip archive only if it has
been modified more recently than the version already in the zip
archive. For example:
zip -u stuff *
will add any new files in the current directory, and update any files
which have been modified since the zip archive stuff.zip was last
created/modified (note that zip will not try to pack stuff.zip into
itself when you do this).
Note that the -u option with no input file arguments acts like the -f
(freshen) option.
Delete
-d
--delete
Remove (delete) entries from a zip archive. For example:
zip -d foo foo/tom/junk foo/harry/\* \*.o
will remove the entry foo/tom/junk, all of the files that start with
foo/harry/, and all of the files that end with .o (in any path).
Note that shell pathname expansion has been inhibited with
backslashes, so that zip can see the asterisks, enabling zip to match
on the contents of the zip archive instead of the contents of the
current directory.
Yes. The entries in a zip file are all compressed individually. You can select and copy just the compressed entries you want from any zip file to make a new zip file, and you can add new entries to a zip file.
There is no need for any caching.
As an example, the zip command does this.
I am using .pvr files in my Android game. But when compressing it using Zipalign, the size of .pvr files are no change (another type of file worked well)
I tried to use the newest Zipalign tool, change flags
tools/windows/zipalign -v -f 4 C:_Working\Game.apk release_apk\Game.apk
The zipalign tool is not about compressing but about "aligning" elements in the zip file, which means moving them at a position in the zip file which is a multiple of bytes of the value you give (in this case 4 -- which means, every uncompressed element is located at an offset multiple of 4). Compression is completely orthogonal to zip-aligning.
Depending on what tool you use to build your APK, some build systems may keep some files uncompressed, so you should look at the documentation.
Another possibility is that the .pvr file is already compressed in itself so zipping it brings little gain in size.
So, I recently came across the .unity3d file for a game a used to play, and unpacked it using a tool. (http://en.unity3d.netobf.com/) Now, I've made the tweaks the the game I needed to to make it run on a local server, and have come across the issue of how to compress the files back into a .unity3d file. I've reverse engineered the tool and determined that .unity3d files are LZMA compressed( just like a .7z archive ), but the header is "UnityWeb" instead of "7z". How might I achieve this?
7z is open source. If the only difference is indeed that header, then get the sources, find where the header is, change it and compile your own compression utility. Watch out for other constants describing the headers and signatures though (e.g. length of the signature). I'd suggest starting with line 9 of the file Xz.c (defining XZ_SIG and XZ_FOOTER_SIG).
I have a text file (>50k lines) of ascii numbers, with string identifiers, that can be thought of as a collection of data vectors. Based on user input, the application only needs one of these data vectors at runtime.
As far as I can see, I have 3 options for getting the information from this text file:
Keep it as a text file, extract the required vector at run-time. I believe the downside is that you can't have a relative path in the code, so the user would have to point to the file's correct location (?). Or alternatively, get the configure script to inject the absolute path as a macro.
Convert it to a static unsigned char using xxd (as explained here) and then include the resulting file. Downside is that a 5MB file turns into a 25MB include file. Am I correct in thinking that this 25MB is loaded into memory for the duration of the runtime?
Convert it to an object and link using objcopy as explained here. This seems to keep the file size about the same -- are there other trade-offs?
Is there a standard/recommended method for doing this? I can use C or C++ if that makes a difference.
Thanks.
(Running on linux with gcc)
I would go with number 1 and pass the filepath into the program as an argument. There's nothing wrong with doing that and it is simple and straight-forward.
You should have a look at the answers here:
Directory of running program
The top voted answer gives you a glue how to handle your data file. But instead of the home folder I would suggest to save it under /usr/share as explained in the link.
I'd preffer to use zlib (and both ways are possible:side file or include with compressed data).
I'm making a simple game with SFML 1.6 in C++. Of course, I have a lot of picture, level, and data files. Problem is, I don't want these files visible. Right now they're just plain picture files in a res/ subdirectory, and I want to either conceal them or encrypt them. Is it possible to put the raw data from the files into a resource file or something? Any solution is okay to me, I just don't want the files exposed to the user.
EDIT
Cross platform solutions best, but if they don't exist, that's okay, I'm working on windows. But I don't really want to use a library if it's not needed.
Most environments come with a resource compiler that converts images/icons/etc into string data and includes them in the source.
Another common technique is to copy them into the end of the final .exe as the last part of the build process. Then at run time, open the .exe as a file and read the data from some determined offset, see Embedding a filesystem in an executable?
The ideal way for this is to make your own archive format, which would contain all of your files' data along with some extra info needed to split files distinctly within it.