File Compressor In Assembly - compression

In an effort to get better at programming assembly, and as an academic exercise, I would like to write a non-trivial program in x86 assembly. Since file compression has always been kind of an interest to me, I would like to write something like the zip utility in assembly.
I'm not exactly out of my element here, having written a simple web server using assembly and coded for embedded devices, and I've read some of the material for zlib (and others) and played with its C implementation.
My problem is finding a routine that is simple enough to port to assembly. Many of the utilities I've inspected thus far are full of #define's and other included code. Since this is really just for me to play with, I'm not really interested in super-awesome compression ratios or anything like that. I'm basically just looking for the RC4 of compression algorithms.
Is a Huffman Coding the path I should be looking down or does anyone have another suggestion?

And here is a more sophisticated algorithm which should not be too hard to implement: LZ77 (containing assembly examples) or LZ77 (this site contains many different compression algorithms).

One option would be to write a decompressor for DEFLATE (the algorithm behind zip and gzip). zlib's implementation is going to be heavily optimized, but the RFC gives pseudocode for a decoder. After you have learned the compressed format, you can move on to writing a compressor based on it.

I remember a project from second year computing science that was something similar to this (in C).
Basically, compressing involves replacing a string of xxxxx (5 x's) with #\005x (the at sign, a byte with a value of 5, followed by the repeated byte. This algorithm is very simple. It doesn't work that well for English text, but works surprisingly well for bitmap images.
Edit: what I am describing is run length encoding.

Take a look at UPX executable packer. It contains some low-level decompressing code as part of unpacking procedures...

Related

What to do first to implement a JPG decoder

I need to understand JPG decompression so that I don't need other libraries that just do it.
After being able to identify the different parts of a JPG file in terms of file format, what do I need to do, understand or learn first in mathematical or algorithmic terms so I can start implementing decoding primitives?
Look at this answer to find all the specifications you need to read, and then read them. Several times. Front to back. Then start to implement, testing along the way often with many example jpeg files.
It wouldn't hurt to know a little bit about fourier transforms and then the discrete cosine transform, and also how Huffman codes work. Though you could pick up much of what you need from the specifications.

File compression/checking for data corruption in c/c++

For large files or other files that are not necessarily text, how can i compress them and what are the most efficient methods to check for data corruption? any tutorials on these kinds of algorithms would be greatly appreciated.
For compression, LZO should be helpful. Easy to use and library easily available.
For data corruption check, CRC ca
http://cppgm.blogspot.in/2008/10/calculation-of-crc.html
For general compression, I would recommend Huffman coding. It's very easy to learn, a full-featured (2-pass) coder/decoder can be written in <4 hours if you understand it. It is part of DEFLATE which is part of the .zip format. Once you have that down, learn LZ77, then put them together and make your own DEFLATE implementation.
Alternatively, use zlib, the library everyone uses for zip files.
For large files, I wouldn't recommend CRC32 like everyone is telling you. Larger files suffer from birthday corruption pretty easily. What I mean is that as a file gets larger, a 32-bit checksum can only find an increasingly limited number of errors. A fast implementation of a hash - say, MD5 - would do you well. Yes MD5 is cryptographically broken but I'm assuming, considering your question, that you're not working on a security-conscious problem.
Hamming codes are a possibility. The idea is to insert a few sum-bits at each N bits of data , and initialize each of them with 0 or 1, such that the sum of some of the bits of data and sum-bits is 1 all the time. In case in which a sum is not 1, looking at the values of these sum-bits, you can see what bits of data were lost.
There are lots of other possibilities, as the previous post says.
http://en.wikipedia.org/wiki/Hamming_code#General_algorithm

File compression with C++

I want to make my own text file compression program. I don't know much about C++ programming, but I have learned all the basics and writing/reading a file.
I have searched on google a lot about compression, and saw many different kind of methods to compress a file like LZW and Huffman. The problem is that most of them don't have a source code, or they have a very complicated one.
I want to ask if you know any good webpages where I can learn and make a compression program myself?
EDIT:
I will let this topic be open for a little longer, since I plan to study this the next few days, and if I have any questions, I'll ask them here.
Most of the algorithms are pretty complex. But they all have in common that they are taking data that is repeating and only store it once and have a system of knowing how to uncompress them (putting the repeated segments back in place)
Here is a simple example you can try to implement.
We have this data file
XXXXFGGGJJ
DDDDDDDDAA
XXXXFGGGJJ
Here we have chars that repeat and two lines that repeat. So you could start with finding a way to reduce the filesize.
Here's a simple compression algorithm.
4XF3G2J
8D2A
4XF3G2J
So we have 4 of X, one of F, 3 of G etc.
You can try this page it contains a clear walk through of the basics of compression and the first principles.
Compression is not the most easy task. I took a college class to learn about compression algorithms like LZW and Huffman, and I can tell you that they're not that easy. If C++ is your first language and you're just starting into this sort of thing, I wouldn't recommend trying to write your own sorting algorithm just yet. If you are more experienced, then I would try writing source without any code being provided to you - this shows that you truly understand the compression algorithm.
This is how I was taught - the professor explained the algorithm in very broad terms, and then either we would implement it (in Java, mind you) or we would answer questions about how the algorithm would behave under certain circumstances. If we could do either of those, then we really knew the algorithm - without him showing us any source at all - it's a good skill to develop ;)
Huffman encoding tree's are not too complicated, I'd start with them. Here's a link: Example: Huffman Encoding Trees

Full description of a compression algorithm

I am looking for a compression algorithm (for a programming competition) and I need a full description of how to implement it (all technical details), any loseless and patent-free algorithm will do, but the ease of implementation is a bonus :)
(Although possibly irrelevant) I plan to implement the algorithm in C++...
Thanks in advance.
EDIT:
I will be compressing text files only, no other file types...
Well, I can't go so far as to complete the competition for you, but please check out this article on wiki: Run Length Encoding. It is by far one of the simplest ways to compress data, albeit not always an efficient one. Compression is also domain specific, even amongst lossless algorithms you will find that what you are compressing determines how best to encode it.
RFC 1951 describes inflate/deflate, including a brief description of the compressor's algorithm. Antaeus Feldspar's An Explanation of the Deflate Algorithm provides a bit more background.
Also, the zlib source distribution contains a simplified reference inflater in contrib/puff/puff.c that can be helpful reading to understand exactly how the bits are arranged (but it doesn't contain a deflate, only inflate).
I'd start here on Wikipedia.
There's a whole lot to choose from, but without knowing more about what you want it's difficult to help more. Are you compressing text, images, video or just random files? Each one has it's own set of techniques and challenges for optimal results.
If ease of implementation is the sole criterion I'd use "filecopy" compression. Guaranteed compression ratio of exactly 1:1, and trivial implementation...
Huffman is good if you're compressing plain text. And all the commenters below assure me it's a joy to implement ;D
Ease of implementation: Huffman, as stated before. I believe LZW is no longer under patent, but I don't know for sure. It' a relatively simple algorithm. LZ77 should be available, though. Lastly, the Burrows-Wheeler transform allows for compression, but it's significantly more difficult to implement.
I like this introduction to the Burrows-Wheeler Transform.
If you go under "View" in your internet browser, there should be an option to either "Zoom Out" or make the text smaller.
Select one of those and...
BAM!
You just got more text on the same screen! Yay compression!
The Security Now! podcast recently put out an episode highlighting data compression algorithms. Steve Gibson gives a pretty good explanation of the basics of Huffman and Lempel-Ziv compression techniques. You can listen to the audio podcast or read the transcript for Episode 205.

What is the current state of text-only compression algorithms?

In honor of the Hutter Prize,
what are the top algorithms (and a quick description of each) for text compression?
Note: The intent of this question is to get a description of compression algorithms, not of compression programs.
The boundary-pushing compressors combine algorithms for insane results. Common algorithms include:
The Burrows-Wheeler Transform and here - shuffle characters (or other bit blocks) with a predictable algorithm to increase repeated blocks which makes the source easier to compress. Decompression occurs as normal and the result is un-shuffled with the reverse transform. Note: BWT alone doesn't actually compress anything. It just makes the source easier to compress.
Prediction by Partial Matching (PPM) - an evolution of arithmetic coding where the prediction model(context) is created by crunching statistics about the source versus using static probabilities. Even though its roots are in arithmetic coding, the result can be represented with Huffman encoding or a dictionary as well as arithmetic coding.
Context Mixing - Arithmetic coding uses a static context for prediction, PPM dynamically chooses a single context, Context Mixing uses many contexts and weighs their results. PAQ uses context mixing. Here's a high-level overview.
Dynamic Markov Compression - related to PPM but uses bit-level contexts versus byte or longer.
In addition, the Hutter prize contestants may replace common text with small-byte entries from external dictionaries and differentiate upper and lower case text with a special symbol versus using two distinct entries. That's why they're so good at compressing text (especially ASCII text) and not as valuable for general compression.
Maximum Compression is a pretty cool text and general compression benchmark site. Matt Mahoney publishes another benchmark. Mahoney's may be of particular interest because it lists the primary algorithm used per entry.
There's always lzip.
All kidding aside:
Where compatibility is a concern, PKZIP (DEFLATE algorithm) still wins.
bzip2 is the best compromise between being enjoying a relatively broad install base and a rather good compression ratio, but requires a separate archiver.
7-Zip (LZMA algorithm) compresses very well and is available for under the LGPL. Few operating systems ship with built-in support, however.
rzip is a variant of bzip2 that in my opinion deserves more attention. It could be particularly interesting for huge log files that need long-term archiving. It also requires a separate archiver.
If you want to use PAQ as a program, you can install the zpaq package on debian-based systems. Usage is (see also man zpaq)
zpaq c archivename.zpaq file1 file2 file3
Compression was to about 1/10th of a zip file's size. (1.9M vs 15M)