File compression with C++ - c++

I want to make my own text file compression program. I don't know much about C++ programming, but I have learned all the basics and writing/reading a file.
I have searched on google a lot about compression, and saw many different kind of methods to compress a file like LZW and Huffman. The problem is that most of them don't have a source code, or they have a very complicated one.
I want to ask if you know any good webpages where I can learn and make a compression program myself?
EDIT:
I will let this topic be open for a little longer, since I plan to study this the next few days, and if I have any questions, I'll ask them here.

Most of the algorithms are pretty complex. But they all have in common that they are taking data that is repeating and only store it once and have a system of knowing how to uncompress them (putting the repeated segments back in place)
Here is a simple example you can try to implement.
We have this data file
XXXXFGGGJJ
DDDDDDDDAA
XXXXFGGGJJ
Here we have chars that repeat and two lines that repeat. So you could start with finding a way to reduce the filesize.
Here's a simple compression algorithm.
4XF3G2J
8D2A
4XF3G2J
So we have 4 of X, one of F, 3 of G etc.

You can try this page it contains a clear walk through of the basics of compression and the first principles.

Compression is not the most easy task. I took a college class to learn about compression algorithms like LZW and Huffman, and I can tell you that they're not that easy. If C++ is your first language and you're just starting into this sort of thing, I wouldn't recommend trying to write your own sorting algorithm just yet. If you are more experienced, then I would try writing source without any code being provided to you - this shows that you truly understand the compression algorithm.
This is how I was taught - the professor explained the algorithm in very broad terms, and then either we would implement it (in Java, mind you) or we would answer questions about how the algorithm would behave under certain circumstances. If we could do either of those, then we really knew the algorithm - without him showing us any source at all - it's a good skill to develop ;)

Huffman encoding tree's are not too complicated, I'd start with them. Here's a link: Example: Huffman Encoding Trees

Related

Is it feasible to have an algorithm in C++ which calls a Fortran program for the heavily computation parts?

I am developing an algorithm that has a large numerical computation part. My project supervisor recommended me to use Fortran because of this and so for the last weeks I've been work on it (so far so good). It would be a new version of an algorithm of his, which is basically a lot of numerical computing.
Mine, however, would have more "logic" to it. Without going into much detail, the brute force approach is done using just fortran because it's just 95% of reading from a file and doing the operations. However, the aim of the project is to provide an efficient algorithm to do this, I had been thinking about methods and wanted to start with a Greedy approach (something like Hill Climbing) and that got me thinking that for this part in particular, maybe it would be better to write the algorithm in C++ instead of Fortran.
So basically, how hard do you think it would be to develop the algorithm "logic" in C++ and then call Fortran whenever the bulk of the numerical computation has to be performed. Would it be worth it? Or should I just stick with one of the two languages?
Sorry if it is a very ignorant question but I can't get an idea of whether writing an algorithm such as Hill Climbing would be more difficult if done with Fortran instead of C++ and the benefits of Fortran in this case would be worth it.
Thanks for your time and have a nice day!

What to do first to implement a JPG decoder

I need to understand JPG decompression so that I don't need other libraries that just do it.
After being able to identify the different parts of a JPG file in terms of file format, what do I need to do, understand or learn first in mathematical or algorithmic terms so I can start implementing decoding primitives?
Look at this answer to find all the specifications you need to read, and then read them. Several times. Front to back. Then start to implement, testing along the way often with many example jpeg files.
It wouldn't hurt to know a little bit about fourier transforms and then the discrete cosine transform, and also how Huffman codes work. Though you could pick up much of what you need from the specifications.

Why don't we use word ranks for string compression?

I have 3 main questions:
Let's say I have a large text file. (1)Is replacing the words with their rank an effective way to compress the file? (Got answer to this question. This is a bad idea.)
Also, I have come up with a new compression algorithm. I read some existing compression models that are used widely and I found out they use some pretty advanced concepts like statistical redundancy and probabilistic prediction. My algorithm does not use all these concepts and is a rather simple set of rules that need to be followed while compressing and decompressing. (2)My question is am I wasting my time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes?
(3)Furthermore, if I manage to successfully compress a string can I extend my algorithm to other content like videos, images etc.?
(I understand that the third question is difficult to answer without knowledge about the compression algorithm. But I am afraid the algorithm is so rudimentary and nascent I feel ashamed about sharing it. Please feel free to ignore the third question if you have to)
Your question doesn't make sense as it stands (see answer #2), but I'll try to rephrase and you can let me know if I capture your question. Would modeling text using the probability of individual words make for a good text compression algorithm? Answer: No. That would be a zeroth order model, and would not be able to take advantage of higher order correlations, such as the conditional probability of a given word following the previous word. Simple existing text compressors that look for matching strings and varied character probabilities would perform better.
Yes, you are wasting your time trying to come up with a new compression algorithm without having enough knowledge about existing compression schemes. You should first learn about the techniques that have been applied over time to model data, textual and others, and the approaches to use the modeled information to compress the data. You need to study what has already been researched for decades before developing a new approach.
The compression part may extend, but the modeling part won't.
Do you mean like having a ranking table of words sorted by frequency and assign smaller "symbols" to those words that are repeated the most, therefore reducing the amount of information that needs to be transmitted?
That's basically how Huffman Coding works, the problem with compression is that you always hit a limit somewhere along the road, of course, if the set of things that you try to compress follows a particular pattern/distribution then it's possible to be really efficient about it, but for general purposes (audio/video/text/encrypted data that appears to be random) there is no (and I believe that there can't be) "best" compression technique.
Huffman Coding uses frequency on letters. You can do the same with words or with letter frequency in more dimensions, i.e. combinations of letters and their frequency.

File Compressor In Assembly

In an effort to get better at programming assembly, and as an academic exercise, I would like to write a non-trivial program in x86 assembly. Since file compression has always been kind of an interest to me, I would like to write something like the zip utility in assembly.
I'm not exactly out of my element here, having written a simple web server using assembly and coded for embedded devices, and I've read some of the material for zlib (and others) and played with its C implementation.
My problem is finding a routine that is simple enough to port to assembly. Many of the utilities I've inspected thus far are full of #define's and other included code. Since this is really just for me to play with, I'm not really interested in super-awesome compression ratios or anything like that. I'm basically just looking for the RC4 of compression algorithms.
Is a Huffman Coding the path I should be looking down or does anyone have another suggestion?
And here is a more sophisticated algorithm which should not be too hard to implement: LZ77 (containing assembly examples) or LZ77 (this site contains many different compression algorithms).
One option would be to write a decompressor for DEFLATE (the algorithm behind zip and gzip). zlib's implementation is going to be heavily optimized, but the RFC gives pseudocode for a decoder. After you have learned the compressed format, you can move on to writing a compressor based on it.
I remember a project from second year computing science that was something similar to this (in C).
Basically, compressing involves replacing a string of xxxxx (5 x's) with #\005x (the at sign, a byte with a value of 5, followed by the repeated byte. This algorithm is very simple. It doesn't work that well for English text, but works surprisingly well for bitmap images.
Edit: what I am describing is run length encoding.
Take a look at UPX executable packer. It contains some low-level decompressing code as part of unpacking procedures...

Full description of a compression algorithm

I am looking for a compression algorithm (for a programming competition) and I need a full description of how to implement it (all technical details), any loseless and patent-free algorithm will do, but the ease of implementation is a bonus :)
(Although possibly irrelevant) I plan to implement the algorithm in C++...
Thanks in advance.
EDIT:
I will be compressing text files only, no other file types...
Well, I can't go so far as to complete the competition for you, but please check out this article on wiki: Run Length Encoding. It is by far one of the simplest ways to compress data, albeit not always an efficient one. Compression is also domain specific, even amongst lossless algorithms you will find that what you are compressing determines how best to encode it.
RFC 1951 describes inflate/deflate, including a brief description of the compressor's algorithm. Antaeus Feldspar's An Explanation of the Deflate Algorithm provides a bit more background.
Also, the zlib source distribution contains a simplified reference inflater in contrib/puff/puff.c that can be helpful reading to understand exactly how the bits are arranged (but it doesn't contain a deflate, only inflate).
I'd start here on Wikipedia.
There's a whole lot to choose from, but without knowing more about what you want it's difficult to help more. Are you compressing text, images, video or just random files? Each one has it's own set of techniques and challenges for optimal results.
If ease of implementation is the sole criterion I'd use "filecopy" compression. Guaranteed compression ratio of exactly 1:1, and trivial implementation...
Huffman is good if you're compressing plain text. And all the commenters below assure me it's a joy to implement ;D
Ease of implementation: Huffman, as stated before. I believe LZW is no longer under patent, but I don't know for sure. It' a relatively simple algorithm. LZ77 should be available, though. Lastly, the Burrows-Wheeler transform allows for compression, but it's significantly more difficult to implement.
I like this introduction to the Burrows-Wheeler Transform.
If you go under "View" in your internet browser, there should be an option to either "Zoom Out" or make the text smaller.
Select one of those and...
BAM!
You just got more text on the same screen! Yay compression!
The Security Now! podcast recently put out an episode highlighting data compression algorithms. Steve Gibson gives a pretty good explanation of the basics of Huffman and Lempel-Ziv compression techniques. You can listen to the audio podcast or read the transcript for Episode 205.