How to compress sequences using a predictive probabilistic model? - compression

Let's assume that we have sequences consisting of symbols from alphabet containing 50 symbols. So, every symbol can be encoded by 7 bits (2^7 = 64 > 50). It means that every given sequence of symbols can be represented as a sequence of 0s and 1s.
Now, let's assume that symbols in our sequences are not totally random, so they are to some extend predictable. In more details, given the first N symbols in a sequence we can estimate how probable is each symbols as a next one in the sequence. For example, we can say that A is expected to come with probability 0.01, B is expected to come with probability 0.3 and so on.
I believe that such a predictive model can be used to compress data. And my question is how exactly it should be done. Or, in more details, what is the best way to use the predictive model to compress the data.
I thought in the following direction. At a given stage, for all the symbols we have the estimated probabilities, so all the symbols can be ordered according to their probabilities (from the most probable symbol to the least probable). Then the first symbols is encoded by 0, second by 1, third by 00... So, the encodings are:
[0, 1, 00, 01, 10, 11, 000, 001, ..., 111110, 111111]
In such a way usually symbols will get an encoding with a small number of bits. However, these encodings contains commas. For example, the original sequence of symbols can be represented as:
[0, 00, 1, 10, 0, 0, 1, 0110, ...]
And commas are out of the alphabet.
I also though about the following encoding of the symbols in the list ordered by probabilities:
[0, 10, 110, 1110, 11110, 111110, ....]
Then 0 is used as a separator (instead of comas) and number of 1s represent the position of the symbol in the list. But again, I am not sure that this is the most efficient way to use bits and the best way to utilise the predictive model.

Yes, such a predictive model can be used to compress data, so long as the model has a good shot at predicting the next symbol. This is exactly the sort of probabilistic model that arithmetic coding was conceived for.

Related

Controlling newlines when writing out arrays in Fortran

So I have some code that does essentially this:
REAL, DIMENSION(31) :: month_data
INTEGER :: no_days
no_days = get_no_days()
month_data = [fill array with some values]
WRITE(1000,*) (month_data(d), d=1,no_days)
So I have an array with values for each month, in a loop I fill the array with a certain number of values based on how many days there are in that month, then write out the results into a file.
It took me quite some time to wrap my head around the whole 'write out an array in one go' aspect of WRITE, but this seems to work.
However this way, it writes out the numbers in the array like this (example for January, so 31 values):
0.00000 10.0000 20.0000 30.0000 40.0000 50.0000 60.0000
70.0000 80.0000 90.0000 100.000 110.000 120.000 130.000
140.000 150.000 160.000 170.000 180.000 190.000 200.000
210.000 220.000 230.000 240.000 250.000 260.000 270.000
280.000 290.000 300.000
So it prefixes a lot of spaces (presumably to make columns line up even when there are larger values in the array), and it wraps lines to make it not exceed a certain width (I think 128 chars? not sure).
I don't really mind the extra spaces (although they inflate my file sizes considerably, so it would be nice to fix that too...) but the breaking-up-lines screws up my other tooling. I've tried reading several Fortran manuals, but while some of the mention 'output formatting', I have yet to find one that mentions newlines or columns.
So, how do I control how arrays are written out when using the syntax above in Fortran?
(also, while we're at it, how do I control the nr of decimal digits? I know these are all integer values so I'd like to leave out any decimals all together, but I can't change the data type to INTEGER in my code because of reasons).
You probably want something similar to
WRITE(1000,'(31(F6.0,1X))') (month_data(d), d=1,no_days)
Explanation:
The use of * as the format specification is called list directed I/O: it is easy to code, but you are giving away all control over the format to the processor. In order to control the format you need to provide explicit formatting, via a label to a FORMAT statement or via a character variable.
Use the F edit descriptor for real variables in decimal form. Their syntax is Fw.d, where w is the width of the field and d is the number of decimal places, including the decimal sign. F6.0 therefore means a field of 6 characters of width with no decimal places.
Spaces can be added with the X control edit descriptor.
Repetitions of edit descriptors can be indicated with the number of repetitions before a symbol.
Groups can be created with (...), and they can be repeated if preceded by a number of repetitions.
No more items are printed beyond the last provided variable, even if the format specifies how to print more items than the ones actually provided - so you can ask for 31 repetitions even if for some months you will only print data for 30 or 28 days.
Besides,
New lines could be added with the / control edit descriptor; e.g., if you wanted to print the data with 10 values per row, you could do
WRITE(1000,'(4(10(F6.0,:,1X),/))') (month_data(d), d=1,no_days)
Note the : control edit descriptor in this second example: it indicates that, if there are no more items to print, nothing else should be printed - not even spaces corresponding to control edit descriptors such as X or /. While it could have been used in the previous example, it is more relevant here, in order to ensure that, if no_days is a multiple of 10, there isn't an empty line after the 3 rows of data.
If you want to completely remove the decimal symbol, you would need to rather print the nearest integers using the nint intrinsic and the Iw (integer) descriptor:
WRITE(1000,'(31(I6,1X))') (nint(month_data(d)), d=1,no_days)

Find all partial matches to vector of unsigned

For an AI project of mine, I need to apply to a factored state all rules that apply to its partial components. This needs to be done very frequently so I'm looking for a way to make this as fast as possible.
I'm going to describe my problem with strings, however the true problem works in the same way with vectors of unsigned integers.
I have a bunch of entries (of length N) like this which I need to store in some way:
__a_b
c_e__
___de
abcd_
fffff
__a__
My input is a single entry ciede to which I must find, as fast as possible, all stored entries which match to it. For example in this case the matches would be c_e__ and ___de. Removal and adding of entries should be supported, however I don't care how slow it is. What I would like to be as fast as possible is:
for ( const auto & entry : matchedEntries(input) )
My problem, as I said, is one where each letter is actually an unsigned integer, and the vector is of an unspecified (but known) length. I have no requirements for how entries should be stored, or what type of metadata is going to be associated with them. The naive algorithm of matching all is O(N), is it possible to do better? The number of reasonable entries I need stored is <=100k.
I'm thinking some kind of sorting might help, or some weird looking tree structure, but I can't seem to figure out a good way to approach this problem. It also looks like something word processers already need to do, so someone might be able to help.
The easiest solution is to build a trie containing your entries. When searching the trie, you start in the root and recursively follow an edge, that matches character from your input. There will be at most two of those edges in each node, one for the wildcard _ and one for the actual letter.
In the worst case you have to follow two edges from each node, which would add up to O(2^n) complexity, where n is the length of the input, while the space complexity is linear.
A different approach would be to preprocess the entries, to allow for linear search. This is basically what compiling regular expressions does. For your example, consider following regular expression, which matches your desired input:
(..a.b|c.e..|...de|abcd.|fffff|..a..)
This expression can be implemented as a nondeterministic finite state automaton, with initial state having ε-moves to a deterministic automaton for each of the single entries. This NFSA can then be turned to a deterministic FSA, using the standard powerset construction.
Although this construction can increase the number of states substantially, searching the input word can then be done in linear time, simply simulating the deterministic automaton.
Below is an example for entries ab, a_, ba, _a and __. First start with a nondeterministic automaton, which upon removing ε-moves and joining equivalent states is actually a trie for the set.
Then turn it into a deterministic machine, with states corresponding to subsets of states of the NFSA. Start in the state 0 and for each edge, other than _, create the next state as the union of the states in the original machine, that are reachable from any state in the current set.
For example, when DFSA is in state 16, that means the NFSA could be either in state 1 or 6. Upon transition on a, the NFSA could get to states 3 (from 1), 7 or 8 (from 6) - that will be your next state in the DFSA.
The standard construction would preserve the _-edges, but we can omit them, as long as the input does not contain _.
Now if you have a word ab on the input, you simulate this automaton (i.e. traverse its transition graph) and end up in state 238, from which you can easily recover the original entries.
Store the data in a tree, 1st layer represents 1st element (character or integer), and so on. This means the tree will have a constant depth of 5 (excluding the root) in your example. Don't care about wildcards ("_") at this point. Just store them like the other elements.
When searching for the matches, traverse the tree by doing a breadth first search and dynamically build up your result set. Whenever you encounter a wildcard, add another element to your result set for all other nodes of this layer that do not match. If no subnode matches, remove the entry from your result set.
You should also skip reduntant entries when building up the tree: In your example, __a_b is reduntant, because whenever it matches, __a__ also matches.
I've got an algorithm in mind which I plan to implement and benchmark, but I'll describe the approach already. It needs n_templates * template_length * n_symbols bits of storage (so for 100k templates of length 100 and 256 distinct symbols needs 2.56 Gb = 320 MB of RAM. This does not scale nicely to large number of symbols unless succinct data structure is used.
Query takes O(n_templates * template_length * n_symbols) time but should perform quite well thanks to bit-wise operations.
Let's say we have the given set of templates:
__a_b
c_e__
___de
abcd_
_ied_
bi__e
The set of symbols is abcdei, for each symbol we pre-calculate a bit mask indicating whether the template differs from the symbol at that location or not:
aaaaa bbbbb ccccc ddddd eeeee iiiii
....b ..a.. ..a.b ..a.b ..a.b ..a.b
c.e.. c.e.. ..e.. c.e.. c.... c.e..
...de ...de ...de ....e ...d. ...de
.bcd. a.cd. ab.d. abc.. abcd. abcd.
.ied. .ied. .ied. .ie.. .i.d. ..ed.
bi..e .i..e bi..e bi..e bi... b...e
Same tables expressed in binary:
aaaaa bbbbb ccccc ddddd eeeee iiiii
00001 00100 00101 00101 00101 00101
10100 10100 00100 10100 10000 10100
00011 00011 00011 00001 00010 00011
01110 10110 11010 11100 11110 11110
01110 01110 01110 01100 01010 00110
11001 01001 11001 11001 11000 10001
These are stored in columnar order, 64 templates / unsigned integer. To determine which templates match ciede we check the 1st column of c table, 2st column from i, 3rd from e and so forth:
ciede ciede
__a_b ..a.b 00101
c_e__ ..... 00000
___de ..... 00000
abcd_ abc.. 11100
_ied_ ..... 00000
bi__e b.... 10000
We find matching templates as rows of zeros, which indicates that no differences were found. We can check 64 templates at once, and the algorithm itself is very simple (python-like code):
for i_block in range(n_templates / 64):
mask = 0
for i in range(template_length):
# Accumulate difference-indicating bits
mask |= tables[i_block][word[i]][i]
if mask == 0xFFFFFFFF:
# All templates differ, we can stop early
break
for i in range(64):
if mask & (1 << i) == 0:
print('Match at template ' + (i_block * 64 + i))
As I said I haven't yet actually tried implementing this, so I have no clue how fast it is in practice.

Calculating tf-idf among documents using python 2.7

I have a scenario where i have retreived information/raw data from the internet and placed them into their respective json or .txt files.
From there on i would like to calculate the frequecies of each term in each document and their cosine similarity by using tf-idf.
For example:
there are 50 different documents/texts files that consists 5000 words/strings each
i would like to take the first word from the first document/text and compare all the total 250000 words find its frequencies then do so for the second word and so on for all 50 documents/texts.
Expected output of each frequecy will be from 0 -1
How am i able to do so. I have been referring to sklear package but most of them only consists of a few strings in each comparisons.
You really should show us your code and explain in more detail which part it is that you are having trouble with.
What you describe is not usually how it's done. What you usually do is vectorize documents, then compare the vectors, which yields the similarity between any two documents under this model. Since you are asking about NLTK, I will proceed on the assumption that you want this regular, traditional method.
Anyway, with a traditional word representation, cosine similarity between two words is meaningless -- either two words are identical, or they're not. But there are certainly other ways you could approach term similarity or document similarity.
Copying the code from https://stackoverflow.com/a/23796566/874188 so we have a baseline:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
There is nothing here which depends on the length of the input. The number of features in idf will be larger if you have longer documents and there will be more of them in the corpus if you have more documents, but the algorithm as such will not need to change at all to accommodate more or longer documents.
If you don't want to understand why, you can stop reading here.
The vectors are basically an array of counts for each word form. The length of each vector is the number of word forms (i.e. the number of features). So if you have a lexicon with six entries like this:
0: a
1: aardvark
2: banana
3: fruit
4: flies
5: like
then the input document "a fruit flies like a banana" will yield a vector of six elements like this:
[2, 0, 1, 1, 1, 1]
because there are two occurrences of the word at index zero in the lexicon, zero occurrences of the word at index one, one of the one at index two, etc. This is a TF (term frequency) vector. It is already a useful vector; you can compare two of them using cosine distance, and obtain a measurement of their similarity.
The purpose of the IDF factor is to normalize this. The normalization brings three benefits; computationally, you don't need to do any per-document or per-comparison normalization, so it's faster; and the algorithm also normalizes frequent words so that many occurrences of "a" is properly regarded as insignificant if most documents contain many occurrences of this word (so you don't have to do explicit stop word filtering), whereas many occurrences of "aardvark" is immediately obviously significant in the normalized vector. Also, the normalized output can be readily interpreted, whereas with plain TF vectors you have to take document length etc. into account to properly understand the result of the cosine similarity comparison.
So if the DF (document frequency) of "a" is 1000, and the DF of the other words in the lexicon is 1, the scaled vector will be
[0.002, 0, 1, 1, 1, 1]
(because we take the inverse of the document frequency, i.e. TF("a")*IDF("a") = TF("a")/DF("a") = 2/1000).
The cosine similarity basically interprets these vectors in an n-dimensional space (here, n=6) and sees how far from each other their arrows are. Just for simplicity, let's scale this down to three dimensions, and plot the (IDF-scaled) number of "a" on the X axis, the number of "aardvark" occurrences on the Y axis, and the number of "banana" occurrences on the Z axis. The end point [0.002, 0, 1] differs from [0.003, 0, 1] by just a tiny bit, whereas [0, 1, 0] ends up at quite another corner of the cube we are imagining, so the cosine distance is large. (The normalization means 1.0 is the maximum of any element, so we are talking literally a corner.)
Now, returning to the lexicon, if you add a new document and it has words which are not already in the lexicon, they will be added to the lexicon, and so the vectors will need to be longer from now on. (Vectors you already created which are now too short can be trivially extended; the term weight for the hitherto unseen terms will obviously always be zero.) If you add the document to the corpus, there will be one more vector in the corpus to compare against. But the algorithm doesn't need to change; it will always create vectors with one element per lexicon entry, and you can continue to compare these vectors using the same methods as before.
You can of course loop over the terms and for each, synthesize a "document" consisting of just that single term. Comparing it to other single-term "documents" will yield 0.0 similarity to the others (or 1.0 similarity to a document containing the same term and nothing else), so that's not too useful, but a comparison against real-world documents will reveal essentially what proportion of each document consists of the term you are examining.
The raw IDF vector tells you the relative frequency of each term. It usually expresses how many documents each term occurred in (so even if a term occurs more than once in a document, it only adds 1 to the DF for this term), though some implementations also allow you to use the bare term count.

unicode char value

Question: What is the correct order of Unicode extended symbols by value?
If I excel sort a list of Unicode chars the order is different than if I use the excel "=code()" and sort by those values. The purpose is that I want to measure the distance between chars, for example a-b = 1 and &-% = 1; when sorted with the excel sort function, two chars that are ordered within three appear to have values that are 134 away.
Also, some char symbols are blank in excel and several are found twice with 'find' and are two different symbols - and a couple are not found at all. Please explain the details of these 'special' chars.
http://en.wikipedia.org/wiki/List_of_Unicode_characters
sample code:
int charDist = abs(alpha[index] - code[0]);
EDIT:
To figure out the UNICODE values in c++ vs2008 I ran each code as a comparison from code 1 to code 255 against code 1
cout << mem << " code " << key << " is " << abs(key[0] - '') << " from " << endl;
In the brackets is a black happy face that this website does not have the font for but the command window does, in vs2008 it looks like a half-post | with the right half of a T. Excel leaves a blank.
The following Unicodes are not handled in c++ vs2008 with the std library and #include
9, 10, 13, 26, 34, 44,
And, the numerical 'distance' for codes 1 through 127 are correct, but at 128 the distance skips an extra and is one further away for some reason. Then from 128 to 255 the distance reverses and becomes closer; 255 is 2 away from 1 ''
It'd be nice if these followed something more logical and were just 1 through 255 without hiccups or skips and reversals, and 255-1 = 254 but hey, what do I know.
EDIT2: I found it - without the absolute - the collation for UNIFORMAT is 128 to 255 then 1 to 127 and yields 1 to 255 with the 6 skips for 9, 10, 13, 26, 34, 44 that are garbage. That was not intuitive. In the new order 128->255,1->127 the strange skip from 127 to 128 is clearer, it is because there is no 0 so the value is missing between 255 and 1.
SOLUTION: make my own hashtable with values for each symbol and do not rely on c++ std library or vs2008 to provide the UNIFORMAT values since they are not correct for measuring the char distance outside of several specific subsets of UNIFORMAT.
Unicode doesn't have a defined sort (or collation) order. When Excel sorts, it's using tables based on the currently selected language. For example, someone using Excel in English mode may get different sorting results that someone using Excel in Portuguese.
There are also issues of normalization. With Unicode, one "character" doesn't necessarily correspond to one value. Some characters can be represented in different ways. For example, a capital omega can be coded as a Greek letter or as a symbol for representing units of electrical resistance. In some languages, a single character may be composed from several consecutive values.
The blank values probably correspond to glyphs that you don't have any font coverage for. Some systems use so-called "Unicode fonts" which have a large percentage of the glyphs you need for every script. Windows tends to switch fonts on the fly when the current font doesn't have a necessary glyph. Neither approach will have every glyph necessary. Also, some Unicode values don't encode to a visible glyph (e.g., there are many different kinds of spaces in Unicode), some values act more like ASCII-style controls codes (e.g., paragraph separator or bidi controls), and some values only make sense when they combine with another character, like many of the "combining" accents.
So there's not an answer you're going to be satisfied with. Perhaps if you gave more information about what you're ultimately trying to do, we could suggest a different approach.
I don't think you can do what you want to do in Excel without limiting your approach significantly.
By experimentation, the Code function will never return a value higher than 255. If you use any unicode text that cannot be generated via this VBA Code, it will be interpreted as a question mark (?) or 63.
For x = 1 To 255
Cells(x, 1).Value = Chr(x)
Next
You should be able to determine the difference using Code. But if the character doesn't fall in that realm, you'll need to go outside of Excel, because even VBA will convert any other Unicode characters to the question mark(?) or 63.

Trying to generate a uniquely decodable code and decode it

I'm trying to encode arbitrary symbols into a bit string, and I don't really understand how I can either generate them or even decode a bit string containing those.
I want to work on arbitrary symbol to compress, I don't really know if an uniqulely decodable code is what I am looking for, maybe an arithmetic code or canonical huffman code ?
I just need a list of bit strings, describing the most frequent to least frequent, for any size of symbol table.
Encoding:
Generate Frequency table per symbol
Generate Probability Tree based on this frequency table
Generate Code Table - such that most frequent symbol gets the smallest bit string
Output: [Frequency Table + Sequence of Bit Strings (put together end to end)]
Important thing to note is that the sequence of these bit strings can be later segregated directly all by them selves. i.e. say [10010001] => {100, 1000, 1} (only an example)
Decoding:
Obtain Frequency table & Bit String sequence.
Generate Probability Tree (Same as 1.2)
Generate Code Table (Same as 1.3)
Recreate Data:
Parse Bit-String
Match with the Code Table
Output matched Symbol