Is there an open-source implementation of the Dictionary Huffman compression algorithm? - compression

I'm working on a library to work with Mobipocket-format ebook files, and I have LZ77-style PalmDoc decompression and compression working. However, PalmDoc compression is only one of the two currently-used types of text compression being used on ebooks in the wild, the other being Dictionary Huffman aka huffcdic.
I've found a couple of implementations of the huffcdic decoding algorithm, but I'd like to be able to compress to the same format, and so far I haven't been able to find any examples of how to do that yet. Has someone else already figured this out and published the code?

i have been trying to use http://bazaar.launchpad.net/~kovid/calibre/trunk/view/head:/src/calibre/ebooks/compression/palmdoc.c but compression doesnt produce identical results, & there are 3 - 4 descrepencies also read one related thread LZ77 compression of palmdoc

Related

Trying to identify exact Lempel-Ziv variant of compression algorithm in firmware

I'm currently reverse engineering a firmware that seems to be compressed, but am really having hard time identifying which algorithm it is using.
I have the original uncompressed data dumped from the flash chip, below is some of human readable data, uncompressed vs (supposedly) compressed:
You can get the binary portion here, should it helps: Link
From what I can tell, it might be using Lempel-Ziv variant of compression algorithm such as LZO, LZF or LZ4.
gzip and zlib can be ruled out because there will be very little to no human readable data after compression.
I do tried to compress the dumped data with Lempel-Ziv variant algorithms mentioned above using their respective Linux cli tools, but none of them show exact same output as the "compressed data".
Another idea I have for now is to try to decompress the data with each algorithm and see what it gives. But this is very difficult due to lack of headers in the compressed firmware. (Binwalk and signsrch both detected nothing.)
Any suggestion on how I can proceed?

"Priming" or "Training" a compression algorithm to be used for compression/decompression?

I'm trying to work out if there is a compression algorithm that can be trained beforehand, where you can use the trained data to compress and decompress data.
I don't know exactly how compression algorithms work, but I have an inkling that this is possible.
For example, if I compress these lines independently, it wouldn't compress very well.
banana: 1, tree: 2, frog: 3
banana: 7, tree: 9, elephant: 10
If I train the compression algorithm with 100's of sample lines beforehand, it would compress very well because it already has a way of mapping "banana" into a code/lookup value.
Pseudocode to help explain my question:
# Compressing side
rip = Rip()
trained = rip.train(data) # once off
send_trained_data_to_clients(trained)
compressed = rip.compress(data)
# And on the other end
rip = Rip()
rip.load_train_data(train)
data = rip.decompress(compressed)
Is there a common (i.e. has libraries for popular languages) compression algorithm that let's me do this?
What you are describing, in the parlance of most compression algorithms, would be a preset dictionary for the compressor.
I can't speak for all compression libraries, but zlib definitely supports this -- in the exact way you're imagining -- via the deflateSetDictionary() and inflateSetDictionary() functions. See the zlib manual for details.
It exists and it is called Lempel-Ziv coding, you can read more here:
http://en.wikipedia.org/wiki/LZ77_and_LZ78
Its one of several 'Dictionary' type lossless compression methods.
LZ is what your Zip archiver basically does.

String Compression Algorithm

I've been wanting to compress a string in C++ and display it's compressed state to console. I've been looking around for something and can't find anything appropriate so far. The closest thing I've come to finding it this one:
How to simply compress a C++ string with LZMA
However, I can't find the lzma.h header which works with it anywhere.
Basically, I am looking for a function like this:
std::string compressString(std::string uncompressedString){
//Compression Code
return compressedString;
}
The compression algorithm choice doesn't really matter. Anybody can help me out finding something like this? Thank you in advance! :)
Based on the pointers in the article I'm fairly certain they are using XZ Utils, so download that project and make the produced library available in your project.
However, two caveats:
dumping a compressed string to the console isn't very useful, as that string will contain all possible byte values, most of which aren't displayable on a console
compressing short strings (actually any small quantity of data) isn't what most general-purpose compressors were designed for, in many cases the compressed result will be as big or even bigger than the input. However, I have no experience with LZMA on small data quantities, an extensive test with data representative for your use case will tell you whether it works as expected.
One algorithm I've been playing with that gives good compression on small amounts of data (tested on data chunks sized 300-500 bytes) is range encoding.

Saving image without compressing

I know that JPG, BMP, GIF and others formats compress image. But can I get snapshot of display and save it without compressing(in binary file) in programming way (using c/c++ for example or other stuff)?
BMP files aren't compressed by default. See here: https://en.wikipedia.org/wiki/BMP_file_format
http://www.zlib.net is your best solution for loss-less compression in C. It's well-maintained, used in a host of different software and compatible with external archivers such as winzip.
C++ offers wrappers around it such as boost::iostreams::zlib and boost::iostreams::gzip.
Zlib uses the deflate algorithm (RFC1951); here a very good explanation of the algorithm: http://www.zlib.net/feldspar.html
The PAM format is uncompressed and really simple to understand.

What compression/archive formats support inter-file compression?

This question on archiving PDF's got me wondering -- if I wanted to compress (for archival purposes) lots of files which are essentially small changes made on top of a master template (a letterhead), it seems like huge compression gains can be had with inter-file compression.
Do any of the standard compression/archiving formats support this? AFAIK, all the popular formats focus on compressing each single file.
Several formats do inter-file compression.
The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.
More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.
Take a look at google's open-vcdiff.
http://code.google.com/p/open-vcdiff/
It is designed for calculating small compressed deltas and implements RFC 3284.
http://www.ietf.org/rfc/rfc3284.txt
Microsoft has an API for doing something similar, sans any semblance of a standard.
In general the algorithms you are looking for are ones based on Bentley/McIlroy:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.11.8470
In particular these algorithms will be a win if the size of the template is larger than the window size (~32k) used by gzip or the block size (100-900k) used by bzip2.
They are used by Google internally inside of their BIGTABLE implementation to store compressed web pages for much the same reason you are seeking them.
Since LZW compression (which pretty much they all use) involves building a table of repeated characters as you go along, such as schema as you desire would limit you to having to decompress the entire archive at once.
If this is acceptable in your situation, it may be simpler to implement a method which just joins your files into one big file before compression.