Standardized library to compress data as a sequence of small pieces C++/C? - c++

I need to compress data for logging by appending short strings to the log file using C++/C. I first tired gzip(zlib), but this makes a symbol table for each short string and actually makes the data longer rather than compressing. I believe the thing I'm looking for is a static Huffman table. Anyway, I was wondering if there was a common algorithm for this. I would much rather a format that anyone could read. I think the answer is no, but this is the place to ask. Thanks.

You should look at the examples/gzlog.[ch] source files in the zlib distribution. That code was written for precisely this purpose. It appends short strings to a growing compressed gzip file.

Related

compressing individual lines of text separately using common phrases in a global dictionary

Is there any open source library or algorithm available to look at what phrases or words are most common among individual lines of text in a file and create a global dictionary that would then be used to compress the lines of text separately? Preferably the code if available would be in C or C++.
I found this question that I think was similar but did not have an answer that meets what I am looking for:
compressing a huge set of similar strings
There are three important things to recognize here.
The value of replacing a word by a code depends on its frequency and its length. Replacing "a" isn't worth a lot, even if it appears very often.
Once you've identified the most common words, phrases can be found by looking for occurrences of two common words appearing side by side. (In most grammars, word repetition is fairly rare.)
However, one of the biggest sources of redundancy in text is actually the amount of bits needed to predict the next letter. That's typically about 2, given the preceding text. Do you really need word-based compression when letter-based compression is so much easier?
I did some more searching, and I think I have found my answer.
I came across this page discussing improving compression by using boosters
http://mainroach.blogspot.com/2013/08/boosting-text-compression-with-dense.html
That page provided a link to the research paper
http://www.dcc.uchile.cl/~gnavarro/ps/tcj11.pdf
and also to the source code used to do the compression
http://vios.dc.fi.udc.es/codes/download.html
Yes. zlib, an open source compression library in C, provides the deflateSetDictonary() and inflateSetDictionary() routines for this purpose. You provide up to 32K of seed data in which the compressor will look for matching strings. The same dictionary needs to reside on both ends. If you are compressing lots of small pieces of data with a lot of commonality, this can greatly improve compression. Your "lines of text" certainly qualify as small pieces of data.

How to check if a char is valid in C++

I need a program that reads the contents of a file and write them into another file but only the characters that are valid utf-8 characters. The problem is that the file may come in any encoding and the contents of the file may or may not correspond to such encoding.
I know it's a mess but that's the data I get to work with. The files I need to "clean" can be as big as a couple of terabytes so I need the program to be as efficient as humanly possible. Currently I'm using a program I write in python but it takes as long as a week to clean 100gb.
I was thinking of reading the characters with the w_char functions and then manage them as integers and discard all the numbers that are not in some range. Is this the optimal solution?
Also what's the most efficient way to read and write in C/C++?
EDIT: The problem is not the IO operations, that part of the question is intended as an extra help to have an even quicker program but the real issue is how to identify non UTF character quickly. Also, I have already tried palatalization and RAM disks.
Utf8 is just a nice way of encoding characters and has a very clearly defined structure, so fundamentally it is reasonably simple to read a chunk of memory and validate it contains utf8. Mostly this consists of verifying that certain bit patterns do NOT occur, such as C0, C1, F5 to FF. (depending on position)
It is reasonably simple in C (sorry, dont speak python) to code something that is a simple fopen/fread and check the bit patterns of each byte, although i would recommend finding some code to cut/paste ( eg http://utfcpp.sourceforge.net/ but i havent used these exact routines) as there are some caveats and special cases to handle. Just treat the input bytes as unsigned char and bitmask them directly. I would paste what I use, but not in office.
A C program will rapidly become IO bound, so suggestions about IO will then apply if you want ultimate performance, however direct byte inspection like this will be hard to beat in performance if you do it right. Utf8 is nice in that you can find boundaries even if you start in the middle of the file, so this leads nicely to parallel algorithms.
If you build you own, watch for BOM masks that might appear at start of some files.
Links
http://en.wikipedia.org/wiki/UTF-8 nice clear overview with table showing valid bit patterns.
https://www.rfc-editor.org/rfc/rfc3629 the rfc describing utf8
http://www.unicode.org/ homepage for unicode consortitum.
Your best bet according to me is parallilize. If you can parallelize the cleaning and clean many contents simoultaneously then the process will be more efficient. I'd look into a framework for parallelization e.g. mapreduce where you can multithread the task.
I would look at memory mapped files. This is something in the Microsoft world, not sure if it exists in unix etc., but likely would.
Basically you open the file and point the OS at it and it loads the file (or a chunk of it) into memory which you can then access using a pointer array. For a 100 GB file, you could load perhaps 1GB at a time, process and then write to a memory mapped output file.
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366556(v=vs.85).aspx
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366542(v=vs.85).aspx
This should I would think be the fastest way to perform big I/O, but you would need to test in order to say for sure.
HTH, good luck!
Unix/Linux and any other POSIX-compliant OSs support memory map(mmap) toow.

File compression/checking for data corruption in c/c++

For large files or other files that are not necessarily text, how can i compress them and what are the most efficient methods to check for data corruption? any tutorials on these kinds of algorithms would be greatly appreciated.
For compression, LZO should be helpful. Easy to use and library easily available.
For data corruption check, CRC ca
http://cppgm.blogspot.in/2008/10/calculation-of-crc.html
For general compression, I would recommend Huffman coding. It's very easy to learn, a full-featured (2-pass) coder/decoder can be written in <4 hours if you understand it. It is part of DEFLATE which is part of the .zip format. Once you have that down, learn LZ77, then put them together and make your own DEFLATE implementation.
Alternatively, use zlib, the library everyone uses for zip files.
For large files, I wouldn't recommend CRC32 like everyone is telling you. Larger files suffer from birthday corruption pretty easily. What I mean is that as a file gets larger, a 32-bit checksum can only find an increasingly limited number of errors. A fast implementation of a hash - say, MD5 - would do you well. Yes MD5 is cryptographically broken but I'm assuming, considering your question, that you're not working on a security-conscious problem.
Hamming codes are a possibility. The idea is to insert a few sum-bits at each N bits of data , and initialize each of them with 0 or 1, such that the sum of some of the bits of data and sum-bits is 1 all the time. In case in which a sum is not 1, looking at the values of these sum-bits, you can see what bits of data were lost.
There are lots of other possibilities, as the previous post says.
http://en.wikipedia.org/wiki/Hamming_code#General_algorithm

File Compressor In Assembly

In an effort to get better at programming assembly, and as an academic exercise, I would like to write a non-trivial program in x86 assembly. Since file compression has always been kind of an interest to me, I would like to write something like the zip utility in assembly.
I'm not exactly out of my element here, having written a simple web server using assembly and coded for embedded devices, and I've read some of the material for zlib (and others) and played with its C implementation.
My problem is finding a routine that is simple enough to port to assembly. Many of the utilities I've inspected thus far are full of #define's and other included code. Since this is really just for me to play with, I'm not really interested in super-awesome compression ratios or anything like that. I'm basically just looking for the RC4 of compression algorithms.
Is a Huffman Coding the path I should be looking down or does anyone have another suggestion?
And here is a more sophisticated algorithm which should not be too hard to implement: LZ77 (containing assembly examples) or LZ77 (this site contains many different compression algorithms).
One option would be to write a decompressor for DEFLATE (the algorithm behind zip and gzip). zlib's implementation is going to be heavily optimized, but the RFC gives pseudocode for a decoder. After you have learned the compressed format, you can move on to writing a compressor based on it.
I remember a project from second year computing science that was something similar to this (in C).
Basically, compressing involves replacing a string of xxxxx (5 x's) with #\005x (the at sign, a byte with a value of 5, followed by the repeated byte. This algorithm is very simple. It doesn't work that well for English text, but works surprisingly well for bitmap images.
Edit: what I am describing is run length encoding.
Take a look at UPX executable packer. It contains some low-level decompressing code as part of unpacking procedures...

What is the best compression scheme for small data such as 1.66kBytes?

This data is stored in an array (using C++) and is a repetition of 125 bits each one varying from the other. It also has 8 messages of 12 ASCII characters each at the end. Please suggest if I should use differential compression within the array and if so how?
Or should I apply some other compression scheme onto the whole array?
Generally you can compress data that has some sort of predictability or redundancy. Dictionary based compression (e.g. ZIP style algorithms) traditionally don't work well on small chunks of data because of the need to share the selected dictionary.
In the past, when I have compressed very small chunks of data with somewhat predictable patterns, I have used SharpZipLib with a custom dictionary. Rather than embed the dictionary in the actual data, I hard-coded the dictionary in every program that needs to (de)compress the data. SharpZipLib gives you both options: custom dictionary, and keep dictionary separate from the data.
Again this will only work well if you can predict some patterns to your data ahead of time so that you can create an appropriate compression dictionary, and it's feasible for the dictionary itself to be separate from the compressed data.
You haven't given us enough information to help you. However, I can highly recommend the book Text Compression by Bell, Cleary, and Witten. Don't be fooled by the title; "Text" here just means "lossless"—all the techniques apply to binary data. Because the book is expensive you might try to get it on interlibrary loan.
Also, don't overlook the obvious Burrows-Wheeler (bzip2) or Lempel-Ziv (gzip, zlib) techniques. It's quite possible that one of these techniques will work well for your application, so before investigating alternatives, try compressing your data with standard tools.