C++ and RLE for sequences of symbols

C++ and RLE for sequences of symbols - c++

I have difficulties with how to use RLE on sequences of symbols.
For example, I can do RLE encoding on strings like
"ASSSAAAEERRRRRRRR"
which will be transformed to:
"A1S3A3E2R8".
But I'd like to perform RLE on strings like
"XXXYYYYY(1ADEFC)(EDCADD)(1ADEFC)(1ADEFC)(1ADEFC)"
which will be transformed to:
"X3Y5(1ADEFC)1(EDCADD)1(1ADEFC)3"
Is there is a way to reach it? This job becomes a bit easer because long strings always follows in brackets. Could give an advice to do this in C++?
If there is a better way to store values than using brackets, it will be also great if you recommend me.

You should break down this problem into smaller parts. First, you should have a function that tokenizes your stream and returns each individual part. For this example input stream:
"XXXYYYYY(1ADEFC)(EDCADD)(1ADEFC)(1ADEFC)(1ADEFC)"
this function will return the following elements, one per call:
X
X
X
Y
Y
Y
Y
Y
(1ADEFC)
(EDCADD)
(1ADEFC)
(1ADEFC)
(1ADEFC)
<eof>
If you get this function correctly implemented, then the RLE algorithm that you already implemented for single characters should be easily adapted to support longer strings.

Since you mention your intention is to RLE encode the data to later use gzip compression and achieve better compression, my answer is don't bother encoding it first. gzip compression uses DEFLATE, which is a generalization of run-length encoding that can take advantage of runs of strings of characters. You won't get better compression for applying the same algorithm twice, and in fact you may even loose compression a bit.
If you insist in performing your own RLE, then it may be better to store the set length instead of using parenthesis. That is, instead of (1ADEFC)3 use 61ADEFC3. Also note that you intend to compress pixels, which use the full range of byte values. Keep that in mind, as an algorithm written to work with strings would not be appropiate for raw data with embedded nulls and non-printable characters all around.

Related

LZ77: storing format

I started to write a little program that allow to compress a single file using LZ77 compression algorithm. It works fine. Now I'm thinking how to store the data. In LZ77, compressed data consists in a series of triplets. Each triplet has the following format:
<"start reading at n. positions backwards", "go ahead for n. positions", "next character">
What could be a right way to store these triplets? I thought about: <11, 5, 8> bits, then:
2048 positions for look backward
32 max length of matched string
next character is 1 byte.
This format works quite well in text compression, but it sucks for my purpose (video made of binary images), it also increase size if compared to the original filesize. Do you have any suggestions?

What I think you mean is more like: <go back n, copy k, insert literal byte>.
You need to look at the statistics of your matches. You are likely getting many literal bytes with zero-length matches. For that case, a good start would be to use a single bit to decide between a match and no match. If the bit is a one, then it is followed by a distance, length, and literal byte. If it is a zero, it is followed by only a literal bytes.
You can do better still by Huffman coding the literals, lengths, and distances. The lengths and literal could be combined into a single code, as deflate does, to remove even the one bit.

What's the best way to store binary

Ive recently implemented Hoffman compression in c++, if I were to store the results as binary it would take up a lot more space as each 1 and 0 is a character. Alternatively I was thinking maybe I could break the binary into sections of 8 and put characters in the text file, but that would kinda be annoying (so hopefully that can be avoided). My question here is what is the best way to store binary in a text file in terms of character efficietcy?

[To recap the comments...]
My question here is what is the best way to store binary in a text file in terms of character efficiently?
If you can store the data as-is, then do so (in other words, do not use any encoding; simply save the raw bytes).
If you need to store the data within a text file (for instance as a paragraph or as a quoted string), then you have many ways of doing so. For instance, base64 is a very common one, but there are many others.

DEFLATE method reasoning

Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?

LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.

Mark Adler could best answer this question.
The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.
Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.
Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.
...
Read the whole deal here.
Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.

Is it possible to 'trim' trailing spaces/tabs from a string in an arbitrary encoding using ICU without doing any conversions

Specifically, given the following:
A pointer to a buffer containing string data in some encoding X
supported by ICU
The length of the data in the buffer, in bytes
The encoding of the buffer (i.e. X)
Can I compute the length of the string, minus the trailing space/tab characters, without actually converting it into ICU's internal encoding first, then converting back? (this itself could be problematic due to unicode normalizations).
For certain encodings, such as any ascii-based encoding along with utf-8/16/32 the solution is pretty simple, just iterate from the back of the string comparing either 1/2/4 bytes at a time against the two constants.
For others it could be harder (variable-length encodings come to mind). I would like this to be as efficient as possible.

For a large subset of encodings, and for the limited set of U+0020 SPACE and HORIZONTAL TAB U+0009, this is pretty simple.
In ASCII, single-byte Windows code pages, and single-byte ISO code pages, these characters all have the same value. You can simply work backwards, byte-by-byte, lopping them off as long as the value is either 9 or 32.
This approach also works for UTF-8, which has the nice property that a byte less than 128 is always that ASCII character. You don't have to wonder whether it's a lead byte or a continuation byte, as those always have the high bit set.
Given UTF-16, you work two bytes at a time, looking for 0x0009 and 0x0020, being careful to handle byte order. Like UTF-8, UTF-16 has the nice property that if you see this value, you don't have to wonder if it's part of a surrogate pair, as those always have a distinct value.
The problematic cases are the variable-byte encodings that don't give you the assurance that a given unit is unique. If you see a byte with a value 9, you don't necessarily know whether it's a tab character or a random byte from a multibyte encoding. Even for some of these, however, it may be possible that the specific values you care about (9 and 32) are unique. For example, looking at Windows code page 950, it seems that lead bytes have the high value set, and tail bytes steer clear of the lower values (it would take a lot of checking to be absolutely sure). So for your limited case, this might be sufficient.
For the general problem of stripping out an arbitrary set of characters from absolutely any encoding, you need to parse the string according to the rules of that encoding (as well as knowing all the character mappings). For the general case, it's almost certainly best to convert the string to some Unicode encoding, do the trimming, and then convert back. This should round-trip correctly if you're careful to use the K normalization forms.

I use the rather simplistic STL approach of:
std::string mystring;
mystring.erase(mystring.find_last_not_of(" \n\r\t")+1);
Which seems to work for all my needs so far (your mileage may vary), but after years of using it it seems to do the job:)
Let me know if you need more information:)

If you restrict "arbitrary encoding" requirement to "any encoding that uses same codevalue for space and tab as ascii" which is still rather general you even don't need ICU at all. boost::trim_right or boost::trim_right_if is all you need.
http://www.boost.org/doc/libs/1_55_0/doc/html/string_algo/usage.html#idp206822440

String-Conversion: MBCS <-> UNICODE with multiple \0 within

I am trying to convert a std::string Buffer - containing data from a bitmap file - to std::wstring.
I am using MultiByteToWideChar, but that does not work, because the function stops after it encounters the first '\0'-character. Seems like it interprets it as the end of the string.
When i dont pass -1 as the length-parameter, but the real length of the data in the std::string-Buffer, it messes the Unicode-String up with characters that definetly not appeared at that position in the original string...
Do I have to write my own conversion function?
Or maybe shall i keep the data as a casual char-array, because the special-symbols will be converted incorrectly?
With regards

There are many, many things that will fail with this approach. Among other things, extra bytes may be added to your data without your realizing it.
It's odd that your only option takes a std::wstring(). If this is a home-grown library, you should take the trouble to write a new function. If it's not, make sure there's nothing more suitable before writing your own.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js