LZ77: storing format - compression

I started to write a little program that allow to compress a single file using LZ77 compression algorithm. It works fine. Now I'm thinking how to store the data. In LZ77, compressed data consists in a series of triplets. Each triplet has the following format:
<"start reading at n. positions backwards", "go ahead for n. positions", "next character">
What could be a right way to store these triplets? I thought about: <11, 5, 8> bits, then:
2048 positions for look backward
32 max length of matched string
next character is 1 byte.
This format works quite well in text compression, but it sucks for my purpose (video made of binary images), it also increase size if compared to the original filesize. Do you have any suggestions?

What I think you mean is more like: <go back n, copy k, insert literal byte>.
You need to look at the statistics of your matches. You are likely getting many literal bytes with zero-length matches. For that case, a good start would be to use a single bit to decide between a match and no match. If the bit is a one, then it is followed by a distance, length, and literal byte. If it is a zero, it is followed by only a literal bytes.
You can do better still by Huffman coding the literals, lengths, and distances. The lengths and literal could be combined into a single code, as deflate does, to remove even the one bit.

Related

How are Huffman trees transmitted?

I'm trying to understand how the DEFLATE algorithm works. I found this document published by UC Davis. I don't understand the part where it talks about how Huffman trees are transmitted
Probably the trickiest part of the DEFLATE specification to understand
is the way trees are encoded to go along with the data, when that data
is compressed with specialized trees.
The trees are transmitted by their codelengths, as previously
discussed. The codelengths are put all together into a sequence of
numbers between 0 and 15 (the Huffman trees that are created must be
kept to codelengths of no more than 15; this is the tricky part, not
the part about constraining the order of the elements).
Not all the elements have to be given codelengths; if the last
elements of an alphabet are of 0 codelengths, they can and probably
should be left out. The number of elements in each of the two
alphabets will be transmitted, so the trimmed alphabets go together
into a single sequence.
First of all, what does codelength mean exactly and why can it be 0?
Also I didn't understand run-length compression quite well, they mention it right after the last paragraph.
Once this sequence of codelengths is assembled, it is compressed with
a form of what is called run-length compression. When several elements
in a row have the same codelength (often 0) special symbols may be
used to indicate the number of elements with this codelength. Our
sequence is now a sequence of numbers between 0 and 18 (possibly with
extra bits forming integers to modify base values, as was the case
with the length and distance codes).
A Huffman tree is created for this alphabet of 0-18. Sigh. The
sequence of 0-18 codes and extra bits is prepared with the Huffman
codes replacing the 0-18 elements.
A codelength is the length of the code in bits for that symbol.
A zero codelength means that that symbol does not appear in the compressed data, so there is no code for that symbol.
Run-length encoding means, in this case, that a sequence of repeated codelengths, e.g. "7, 7, 7, 7, 7, 7", is replaced by "7, repeat the last length 5 times".

DEFLATE method reasoning

Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?
LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.
Mark Adler could best answer this question.
The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.
Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.
Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.
...
Read the whole deal here.
Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.

Is it possible to 'trim' trailing spaces/tabs from a string in an arbitrary encoding using ICU without doing any conversions

Specifically, given the following:
A pointer to a buffer containing string data in some encoding X
supported by ICU
The length of the data in the buffer, in bytes
The encoding of the buffer (i.e. X)
Can I compute the length of the string, minus the trailing space/tab characters, without actually converting it into ICU's internal encoding first, then converting back? (this itself could be problematic due to unicode normalizations).
For certain encodings, such as any ascii-based encoding along with utf-8/16/32 the solution is pretty simple, just iterate from the back of the string comparing either 1/2/4 bytes at a time against the two constants.
For others it could be harder (variable-length encodings come to mind). I would like this to be as efficient as possible.
For a large subset of encodings, and for the limited set of U+0020 SPACE and HORIZONTAL TAB U+0009, this is pretty simple.
In ASCII, single-byte Windows code pages, and single-byte ISO code pages, these characters all have the same value. You can simply work backwards, byte-by-byte, lopping them off as long as the value is either 9 or 32.
This approach also works for UTF-8, which has the nice property that a byte less than 128 is always that ASCII character. You don't have to wonder whether it's a lead byte or a continuation byte, as those always have the high bit set.
Given UTF-16, you work two bytes at a time, looking for 0x0009 and 0x0020, being careful to handle byte order. Like UTF-8, UTF-16 has the nice property that if you see this value, you don't have to wonder if it's part of a surrogate pair, as those always have a distinct value.
The problematic cases are the variable-byte encodings that don't give you the assurance that a given unit is unique. If you see a byte with a value 9, you don't necessarily know whether it's a tab character or a random byte from a multibyte encoding. Even for some of these, however, it may be possible that the specific values you care about (9 and 32) are unique. For example, looking at Windows code page 950, it seems that lead bytes have the high value set, and tail bytes steer clear of the lower values (it would take a lot of checking to be absolutely sure). So for your limited case, this might be sufficient.
For the general problem of stripping out an arbitrary set of characters from absolutely any encoding, you need to parse the string according to the rules of that encoding (as well as knowing all the character mappings). For the general case, it's almost certainly best to convert the string to some Unicode encoding, do the trimming, and then convert back. This should round-trip correctly if you're careful to use the K normalization forms.
I use the rather simplistic STL approach of:
std::string mystring;
mystring.erase(mystring.find_last_not_of(" \n\r\t")+1);
Which seems to work for all my needs so far (your mileage may vary), but after years of using it it seems to do the job:)
Let me know if you need more information:)
If you restrict "arbitrary encoding" requirement to "any encoding that uses same codevalue for space and tab as ascii" which is still rather general you even don't need ICU at all. boost::trim_right or boost::trim_right_if is all you need.
http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/usage.html#idp206822440

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.
This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'
I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.
If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.
How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.

Binary file special characters

I'm coding a suffix array sorting, and this algorithm appends a sentinel character to the original string. This character must not be in the original string.
Since this algorithm will process binary files bytes, is there any special byte character that I can ensure I won't find in any binary file?
If it exists, how do I represent this character in C++ coding?
I'm on linux, I'm not sure if it makes a difference.
No, there is not. Binary files can contain every combination of byte values. I wouldn't call them 'characters' though, because they are binary data, not (necessarily) representing characters. But whatever the name, they can have any value.
This is more like a question you should answer yourself. We do not know what binary data you have and what characters can be there and what cannot. If you are talking about generic binary data - there could be any combination of bits and bytes, and characters, so there is no such character.
From the other point of view, you are talking about strings. What kind of strings? ASCII strings? ASCII codes have very limited range, for example, so you can use 128, for example. Some old protocols use SOH (\1) for similar purposes. So there might be a way around if you know exactly what strings you are processing.
To the best of my knowledge, suffix array cannot be applied to arbitrary binary data (well, it can, but it won't make any sense).
A file could contains bits only. Groups of bits could be interpreted as an ASCII character, floating point number, a photo in JPEG format, anything you could imagine. The interpretation is based on a coding scheme (such as ASCII, BCD) you choose. If your coding scheme doesn't fill the entire table of possible codes, you could pick one for your special purpouses (for example digits could be encoded naively on 4 bits, 2^4=16, so you have 6 redundant codewords).