Sequence Compression? - compression

lately i've faced a problem which gets me so confused ,
the problem is :
i want to compress a sequence so no information is lost , for example :
a,a,a,b --> a,b
a,b,a,a,c --> a,b,a,a,c (it can't be compressed to a,b,a,c because in this way we lose a,a)
Is there any algorithm to do such a thing ? what is name of this problem ? is it compression ? or anything else ?
I would really appreciate any help
Thanks in advance

Every algorithm which is able to transform data in such a way that is takes up less memory is called compression. May it be lossless or lossy.
e.g. (compressed form for "example given" :-) )
The following is imho the simples form, called run length encoding, short RLE:
a,a,a,b,c -> 3a,1b,1c
As you can see all subsequent characters which are identical are compressed into one.
You can also search for subsequent patterns which is much more difficult:
a,b,a,b,a,c --> 2(a,b),1(a),1(c)
There are lots of literature and web sources about compression algorithms, you should use them to get a deeper view.

Another good algorithm is Lempel–Ziv–Welch
I found marvellous this simple Javascript LZW function, from the magicians at 140 bytes of javascript :
function (
a // String to compress and placeholder for 'wc'.
){
for (
var b = a + "Ā", // Append first "illegal" character (charCode === 256).
c = [], // dictionary
d = 0, // dictionary size
e = d, // iterator
f = c, // w
g = c, // result
h; // c
h = b.charAt(e++);
)
c[h] = h.charCodeAt(), // Fill in the dictionary ...
f = 1 + c[a = f + h] ? a : (g[d++] = c[f], c[a] = d + 255, h); // ... and use it to compress data.
return g // Array of compressed data.
}

RLE

Yep, compression. A simple algorithm would be runlength encoding. There also information theory, which is the basis for compression algorithms.
Information theory: More common inputs should be shorter, thus making the sentence length shorter.
So, if you're encoding binary, where the sequence 0101 is very commmon (about 25% of the input), then a simple compression would be:
0101 = 0
anything else = 1[original 4 bits]
So the input: 0101 1100 0101 0101 1010 0101 1111 0101
Would be compressed to: 0 11100 0 0 11010 0 11111 0
Thats a compression of 32 bits -> 20 bits.
An important lesson: The compression algorithm choice is entirely dependent on the input. The wrong algorithm and you will likely make the data longer.

Unless you have to code some solution yourself, you could use some ZIP compression library for the programming language you're using.
And yes, this is data compression.

We can use the LZW compression algorithm to compress text files efficiently and quickly by making use of hash tables.

Related

tANS Mininum Size of State Set to Safely Encode a Symbol Frame

Hi I'm trying to implement tANS in a compute shader, but I am confused about the size of the state set. Also apologies but my account is too new to embed pictures of latex formatted equations.
Imagine we have a symbol frame S comprised of symbols s₁ to sₙ:
S = {s₁, s₂, s₁, s₂, ..., sₙ}
|S| = 2ᵏ
and the probability of each symbol is
pₛₙ = frequency(sₙ) / |S|
∑ pₛ₁ + pₛ₂ + ... pₛₙ = 1
According to Jarek Duda's slides (which can be found here) the first step in constructing the encoding function is to calculate the number of states L:
L = |S|
so that we can create a set of states
𝕃 = {L, ..., 2L - 1}
from which we can construct the encoding table from. In our example, this is simple L = |S| = 2^k. However, we don't want L to necessarily equal |S| because |S| could be enormous, and constructing an encoding table corresponding to size |S| would be counterproductive to compression. Jarek's solution is to create a quantization function so that we can choose an
L : L < |S|
which approximates the symbol probabilities
Lₛ / L ≈ pₛₙ
However as L decreases, the quality of the compression decreases, so I have two questions:
How small can we make L while still achieving compression?
What is a "good" way of determining the size of L for a given |S|?
In Jarek's ANS toolkit he uses the depth of a Huffman tree created from S to get the size of L, but this seems like a lot of work when we already know the upper bound of L (|S|; as I understand it when L = |S| we are at the Shannon entropy; thus making L > |S| would not increase compression). Instead it seems like it would be faster to choose an L that is both less than |S| and above some minimum L. A "good" size of L therefore would achieve some amount of compression, but more importantly would be easy to calculate. However we would need to determine the minimum L. Based on the pictures of sample ANS tables it seems like the minimum size of L could be the frequency of the most probable symbol, but I don't know enough about ANS to confirm this.
After mulling it over for awhile, both questions have very simple answers. The smallest L that still achieves lossless compression is L = |A|, where A is the alphabet of symbols to be encoded(I apologize, the lossless criterion should have been included in the original question). If L < |A| then we are pigeonholing symbols, thus losing information. When L = |A| what we essentially have is a fixed length variable code, where each symbol has an equal probability weighting in our encoding table. The answer to the second part is even more simple now that we know the answer to the first question. L can be pretty much whatever you want so long as its greater than the size of the alphabet to be encoded. Usually we want L to be a power of two for computational efficiency and then we want L to be greater than |A| to achieve better compression, so a very common L size is 2 times the greatest power of two equal to or greater than the size of the alphabet. This can easily be found by something like this:
int alphabetSize = SizeOfAlphabet();
int L = pow(2, ceil(log(alphabetSize, 2)) + 1);

Traversing lists of 0, 1 with constraint

my apologies if this was answered somewhere, I tried searching but I do not know if this kind of problem has a specific name, so nothing came up in my search...
I have a list of objects, and each of these objects can either be accepted or rejected. Every combination is assigned a value, while some combinations are not valid. (So for example we have 4 objects, and objects 1 and 2 don't go together, then every combination that has objects 1 and 2 as accepted is invalid.) It is not known beforehand which objects don't go together and it is not possible to find the invalid ones just by looking at pairs. (For example it is possible that objects 1, 2 are valid together, objects 2,3 are valid, objects 1,3 are valid, but 1,2,3 are invalid.) I modeled this as lists of 0 and 1, so now I want to traverse these lists to find the one with the maximum value in an efficient way.
My idea was to traverse the lists like a tree by starting at all zeros and then in each step flipping a zero to a one, so for example for 3 objects this gives the tree
000
/ | \
100 010 001
/ \ / \ / \
110 101 110 011 101 011
\ \ \ / / /
111
which is actually worse than just listing all 2^n options since there are duplicates, but at each node I could stop if I discovered that it is invalid. Saving the invalid combinations of ones and keeping a list of all already visited nodes I could make sure that I don't revisit already checked nodes. (But I would still have to check those if they were already visited)
Is there any better way to do this?
You can try to build tree of variants (at most 2^n options, as you noticed), but cut unappropriate branches as early as possible.
In example below I've set two binary masks - no 1,2,3 together and no 2,4 together
def buildtree(x, maxsize, level, masks):
if level == maxsize:
print("{0:b}".format(x).zfill(maxsize))
else:
buildtree(x, maxsize, level + 1, masks)
t = x | (1 << level)
good = True
for m in masks:
if (t & m) == m:
good = False
break
if good:
buildtree(t, maxsize, level + 1, masks)
buildtree(0, 4, 0, [7, 10])
0000
1000
0100
1100
0010
0110
0001
1001
0101
1101
0011
Is is possible also to remove some masks but code will be more complicated

IO for julia reading fortran files

Noob question:
I have the output of a complex matrix done in Fortran, the contents looks like this:
(-0.594209719263636,1.463867815703586E-006)
(-0.783378034185788,-0.182301028756558) (-0.794024313844809,0.128219337674814)
(0.592814294881930,4.069892201461069E-002)
I want to read and use this data in a julia program.
No, I don't want to change the writting format, I would like to learn how to strip off
the "trash" characters like '(', or ','. This may be useful for arbitrary Input files.
2.I have tried with the following code:
file = open(pathtofilename, "r")
data_str = readall(ifile)
data_numbers_str = split(data_str)
data_numbers = split(data_numbers_str, ['('])
However, the manual is not quite self-explanatory [http://docs.julialang.org/en/release-0.2/stdlib/base/?highlight=split].
Here is what I'd do
data = "(-0.594209719263636,1.463867815703586E-006) (-0.783378034185788,-0.182301028756558) (-0.794024313844809,0.128219337674814) (0.592814294881930,4.069892201461069E-002)"
function pair_to_complex(pair)
nums = float(split(pair[2:end-1], ","))
return Complex(nums...)
end
numbers = map(pair_to_complex, split(data, " "))
To explain
The pair[2:end-1] removes the parenthesis
I then split that on the , to get an array with two numbers, still as strings
I convert them to Float64 with float(), obtaining an array of floats
I make a new complex number. The ... splats the array out so it provides the two arguments to Complex - I could have done Complex(nums[1],nums[2])
I then apply this logic using map to every term in the data.

How to use Ruby's Enumerable .map method to do something similar to map in C++

map(-30, -89.75, 89.75, 0, 360)
I'm looking for something like this where:
-30 is the input value.
-89.75 to 89.75 is the range of possible input values
0 - 360 is the final range to be mapped to
I was told there is a way to do this using http://ruby-doc.org/core-1.9.3/Enumerable.html#method-i-map
.. however its not readily apparent !
If I'm understanding correctly, I think you just want to uniformly map one range onto another. So, we just need to calculate how far through the input range it is, and return that fraction of the output range.
def map_range(input, in_low, in_high, out_low, out_high)
# map onto [0,1] using input range
frac = (input - in_low) / (in_high-in_low)
# map onto output range
frac * (out_high-out_low) + out_low
end
Also, I should note that map has a bit of a different meaning in ruby, and a more appropriate description would probably be transform.

Efficient way of storing Huffman tree

I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.
Currently there are two different versions I am implementing.
This one reads the entire file into memory character by character and builds a frequency table for the whole document. This would only require outputting the tree once, and thus efficiency is not that big of a concern, other than if the input file is small.
The other method I am using is to read a chunk of data, about 64 kilobyte in size and run the frequency analysis over that, create a tree and encode it. However, in this case before every chunk I will need to output my frequency tree so that the decoder is able to re-build its tree and properly decode the encoded file. This is where the efficiency does come into place since I want to save as much space as possible.
In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!
Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.
Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.
So for each node, starting at root:
If leaf-node: Output 1-bit + N-bit character/byte
If not leaf-node, output 0-bit. Then encode both child nodes (left first then right) the same way
To read, do this:
Read bit. If 1, then read N-bit character/byte, return new node around it with no children
If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value
A leaf-node is basically any node that doesn't have children.
With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.
Pseudo-code for calculation:
Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))
The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.
SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.
PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.
Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.
void EncodeNode(Node node, BitWriter writer)
{
if (node.IsLeafNode)
{
writer.WriteBit(1);
writer.WriteByte(node.Value);
}
else
{
writer.WriteBit(0);
EncodeNode(node.LeftChild, writer);
EncodeNode(node.Right, writer);
}
}
To read it back in:
Node ReadNode(BitReader reader)
{
if (reader.ReadBit() == 1)
{
return new Node(reader.ReadByte(), null, null);
}
else
{
Node leftChild = ReadNode(reader);
Node rightChild = ReadNode(reader);
return new Node(0, leftChild, rightChild);
}
}
An example (simplified, use properties, etc.) Node implementation:
public class Node
{
public Byte Value;
public Node LeftChild;
public Node RightChild;
public Node(Byte value, Node leftChild, Node rightChild)
{
Value = value;
LeftChild = leftChild;
RightChild = rightChild;
}
public Boolean IsLeafNode
{
get
{
return LeftChild == null;
}
}
}
Here's a sample output from a specific example.
Input: AAAAAABCCCCCCDDEEEEE
Frequencies:
A: 6
B: 1
C: 6
D: 2
E: 5
Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.
The tree could look like this:
20
----------
| 8
| -------
12 | 3
----- | -----
A C E B D
6 6 5 1 2
So the paths to each character is as follows (0 is left, 1 is right):
A: 00
B: 110
C: 01
D: 111
E: 10
So to calculate the output size:
A: 6 occurrences * 2 bits = 12 bits
B: 1 occurrence * 3 bits = 3 bits
C: 6 occurrences * 2 bits = 12 bits
D: 2 occurrences * 3 bits = 6 bits
E: 5 occurrences * 2 bits = 10 bits
Sum of encoded bytes is 12+3+12+6+10 = 43 bits
Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.
The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.
001A1C01E01B1D 0000000000001100101010101011111111010101010
For the concrete example you have in the comments, AABCDEF, you will get this:
Input: AABCDEF
Frequencies:
A: 2
B: 1
C: 1
D: 1
E: 1
F: 1
Tree:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1
Paths:
A: 00
B: 01
C: 100
D: 101
E: 110
F: 111
Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes
Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.
If you have enough control over the tree generation, you could make it do a canonical tree (the same way DEFLATE does, for example), which basically means you create rules to resolve any ambiguous situations when building the tree. Then, like DEFLATE, all you actually have to store are the lengths of the codes for each character.
That is, if you had the tree/codes Lasse mentioned above:
A: 00
B: 110
C: 01
D: 111
E: 10
Then you could store those as:
2, 3, 2, 3, 2
And that's actually enough information to regenerate the huffman table, assuming you're always using the same character set -- say, ASCII. (Which means you couldn't skip letters -- you'd have to list a code length for each one, even if it's zero.)
If you also put a limitation on the bit lengths (say, 7 bits), you could store each of these numbers using short binary strings. So 2,3,2,3,2 becomes 010 011 010 011 010 -- Which fits in 2 bytes.
If you want to get really crazy, you could do what DEFLATE does, and make another huffman table of the lengths of these codes, and store its code lengths beforehand. Especially since they add extra codes for "insert zero N times in a row" to shorten things further.
The RFC for DEFLATE isn't too bad, if you're already familiar with huffman coding: http://www.ietf.org/rfc/rfc1951.txt
branches are 0 leaves are 1. Traverse the tree depth first to get its "shape"
e.g. the shape for this tree
0 - 0 - 1 (A)
| \- 1 (E)
\
0 - 1 (C)
\- 0 - 1 (B)
\- 1 (D)
would be 001101011
Follow that with the bits for the characters in the same depth first order AECBD (when reading you'll know how many characters to expect from the shape of the tree). Then output the codes for the message. You then have a long series of bits that you can divide up into characters for output.
If you are chunking it, you could test that storing the tree for the next chuck is as efficient as just reusing the tree for the previous chunk and have the tree shape being "1" as an indicator to just reuse the tree from the previous chunk.
The tree is generally created from a frequency table of the bytes. So store that table, or just the bytes themselves sorted by frequency, and re-create the tree on the fly. This of course assumes that you're building the tree to represent single bytes, not larger blocks.
UPDATE: As pointed out by j_random_hacker in a comment, you actually can't do this: you need the frequency values themselves. They are combined and "bubble" upwards as you build the tree. This page describes the way a tree is built from the frequency table. As a bonus, it also saves this answer from being deleted by mentioning a way to save out the tree:
The easiest way to output the huffman tree itself is to, starting at the root, dump first the left hand side then the right hand side. For each node you output a 0, for each leaf you output a 1 followed by N bits representing the value.
A better approach
Tree:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1 : frequencies
2 2 3 3 3 3 : tree depth (encoding bits)
Now just derive this table:
depth number of codes
----- ---------------
2 2 [A B]
3 4 [C D E F]
You don't need to use the same binary tree, just keep the computed tree depth i.e. the number of encoding bits. So just keep the vector of uncompressed values [A B C D E F] ordered by tree depth, use relative indexes instead to this separate vector. Now recreate the aligned bit patterns for each depth:
depth number of codes
----- ---------------
2 2 [00x 01x]
3 4 [100 101 110 111]
What you immediately see is that only the first bit pattern in each row is significant. You get the following lookup table:
first pattern depth first index
------------- ----- -----------
000 2 0
100 3 2
This LUT has a very small size (even if your Huffman codes can be 32-bit long, it will only contain 32 rows), and in fact the first pattern is always null, you can ignore it completely when performing a binary search of patterns in it (here only 1 pattern will need to be compared to know if the bit depth is 2 or 3 and get the first index at which the associated data is stored in the vector). In our example you'll need to perform a fast binary search of input patterns in a search space of 31 values at most, i.e. a maximum of 5 integer compares. These 31 compare routines can be optimized in 31 codes to avoid all loops and having to manage states when browing the integer binary lookup tree.
All this table fits in small fixed length (the LUT just needs 31 rows atmost for Huffman codes not longer than 32 bits, and the 2 other columns above will fill at most 32 rows).
In other words the LUT above requires 31 ints of 32-bit size each, 32 bytes to store the bit depth values: but you can avoid it this by implying the depth column (and the first row for depth 1):
first pattern (depth) first index
------------- ------- -----------
(000) (1) (0)
000 (2) 0
100 (3) 2
000 (4) 6
000 (5) 6
... ... ...
000 (32) 6
So your LUT contains [000, 100, 000(30times)]. To search in it you must find the position where the input bits pattern are between two patterns: it must be lower than the pattern at the next position in this LUT but still higher than or equal to the pattern in the current position (if both positions contain the same pattern, the current row will not match, the input pattern fits below). You'll then divide and conquer, and will use 5 tests at most (the binary search requires a single code with 5 embedded if/then/else nested levels, it has 32 branches, the branch reached indicates directly the bit depth that does not need to be stored; you perform then a single directly indexed lookup to the second table for returning the first index; you derive additively the final index in the vector of decoded values).
Once you get a position in the lookup table (search in the 1st column), you immediately have the number of bits to take from the input and then the start index to the vector. The bit depth you get can be used to derive directly the adjusted index position, by basic bitmasking after substracting the first index.
In summary: never store linked binary trees, and you don't need any loop to perform thelookup which just requires 5 nested ifs comparing patterns at fixed positions in a table of 31 patterns, and a table of 31 ints containing the start offset within the vector of decoded values (in the first branch of the nested if/then/else tests, the start offset to the vector is implied, it is always zero; it is also the most frequent branch that will be taken as it matches the shortest code which is for the most frequent decoded values).
There are two main ways to store huffman code LUTs as the other answers state. You can either store the geometry of the tree, 0 for a node, 1 for a leaf, then put in all the leaf values, or you can use canonical huffman encoding, storing the lengths of the huffman codes.
The thing is, one method is better than the other depending on the circumstances.
Let's say, the number of unique symbols in the data you wish to compress (aabbbcdddd, there are 4 unique symbols, a, b, c, d) is n.
The number of bits to store the geometry of the tree along side the symbols in the tree is 10n - 1.
Assuming you store the code lengths in order of the symbols the code lengths are for, and that the code lengths are 8 bits (code lengths for a 256 symbol alphabet will not exceed 8 bits), the size of the code length table will be a flat 2048 bits.
When you have a high number of unique symbols, say 256, it will take 2559 bits to store the geometry of the tree. In this case, the code length table is much more efficient. 511 bits more efficient, to be exact.
But if you only have 5 unique symbols, the tree geometry only takes 49 bits, and in this case, when compared to storing the code length table, storing the tree geometry is almost 2000 bits better.
The tree geometry is most efficient for n < 205, while a code length table is more efficient for n >= 205. So, why not get the best of both worlds, and use both? Have 1 bit at the start of your compressed data represent whether the next however many bits are going to be in the format of a code length table, or the geometry of the huffman tree.
In fact, why not add two bits, and when both of them are 0, there is no table, the data is uncompressed. Because sometimes, you can't get compression! And it would be best to have a single byte at the beginning of your file that is 0x00 telling your decoder not to worry about doing anything. Saves space by not including the table or geometry of a tree, and saves time, not having to unnecessarily compress and decompress data.