Does Eigen "Sparse matrix format" example contains error? - c++

Eigen 3.3.7 documentation for SparceMatrix
http://eigen.tuxfamily.org/dox/group__TutorialSparse.html
seems to contain an error in Sparse matrix format section:
This storage scheme is better explained on an example. The following matrix
0 3 0 0 0
22 0 0 0 17
7 5 0 1 0
0 0 0 0 0
0 0 14 0 8
and one of its possible sparse, column major representation:
Values: 22 7 _ 3 5 14 _ _ 1 _ 17 8
InnerIndices: 1 2 _ 0 2 4 _ _ 2 _ 1 4
OuterStarts: 0 3 5 8 10 12
InnerNNZs: 2 2 1 1 2
If 14 is moved from the third column to the second (i.e. its indices changed from [4,2] to [4,1]), then the first two arrays, Values and InnerIndices, make sense. OuterStarts doesn't seem to be correct for either 14 position, while InnerNNZs makes sense for 14 being in [4,2] element of the matrix, but is inconsistent with Values array.
Is this example incorrect or am I missing something?
In general, what is the best way of figuring out Eigen, besides examining the source code? I normally look at tests and examples, but building most benchmark and tests for sparse matrices results in compilation errors (were these tests written for older version of Eigen and not updated for version 3?)...

The key is that the user is supposed to reserve at least as many entries per column as they need. In this example the user only reserved 2 entries for the second column, so if you were to try to add another entry to that column, it would probably require an expensive reallocation, or at least a complicated shift to "steal" an unused entry from another column. (I have no idea how this is implemented.)
Upon a cursory look at the documentation you linked to, I didn't see anything about moving entries like you're trying to do. I'm not sure that Eigen supports such an operation. (Correct me if I'm wrong.) I'm also not sure why you would want to do that.
Your final question is probably too broad. I'm not an expert at Eigen, but it seems like a mature, powerful, and well-documented library. If you have any specific problems compiling examples, you should post them here or on an Eigen specific forum. Many people at scicomp.SE are well-versed in Eigen and are accommodating.

Related

Minimum description length and Huffman coding for two symbols?

I am confused about the interpretation of the minimum description length of an alphabet of two symbols.
To be more concrete, suppose that we want to encode a binary string where 1's occur with probability 0.80; for instance, here is a string of length 40, with 32 1's and 8 0's:
1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 0 1
Following standard MDL analysis, we can encode this string using prefix codes (like Huffman's) and the code of encoding this string would be (-log(0.8) * 32 - log(0.2) * 8), which is lower than duplicating the string without any encoding.
Intuitively, it is "cheaper" to encode this string than some string where 1's and 0's occur with equal probability. However, in practice, I don't see why this would be the case. At the very least, we need one bit to distinguish between 1's and 0's. I don't see how prefix codes could do better than just writing the binary string without encoding.
Can someone help me clarify this, please?
I don't see how prefix codes could do better than just writing the
binary string without encoding.
You can't with prefix codes, unless you combine bits to make more symbols. For example, if you code every two bits, you now have four symbols with probabilities 0.64, 0.16, 0.16, and 0.04. That would be coded with 0, 10, 110, 111. That gives an average of 1.56 bits per symbol, or 0.7800 bits per original bit. We're getting a little closer to the optimal 0.7219 bits per bit (-0.2 log20.2 - 0.8 log20.8).
Do that for three-bit groupings, and you get 0.7280 bits per bit. Surprisingly close to the optimum. In this case, the code lengths just happen to group really nicely with the probabilities. The code is 1 bit (0) for the symbol with probability 0.512, 3 bits (100, 101, 110) for the three symbols with probability 0.128, and 5 bits (11100, 11101, 11110, 11111) for both the three symbols with probability 0.032 and the one symbol with probability 0.008.
You can keep going and get asymptotically close to the optimal 0.7219 bits per bit. Though it becomes more inefficient in time and space for larger groupings. The Pareto Front turns out to be at multiples of three bits up through 15. 6 bits gives 0.7252 bits per bit, 9 gives 0.7251, 12 is 0.7250, and 15 is 0.7249. The approach is monumentally slow, where you need to go to 28 bits to get to 0.7221. So you might as well stop at 6. Or even just 3 is pretty good.
Alternatively you can use something other than prefix coding, such as arithmetic coding, range coding, or asymmetric numeral system coding. They effectively use fractional bits for each symbol.

Multiple responses in Stata

Let's say you have a survey dataset, with 12 variables that stem from the same question, and each variable reports a response option for that question (multiple-response options possible for this question). Each variable (i.e. response option) is numeric with yes/no options. I am trying to combine all of these variables into one, so that I can do cross-tabs with other variables such as village name, and draw out the frequencies of each individual response and graphs nicely without extensive formatting. Does anyone have a solution to this: either to combine the variables or to do a multivariable cross-tab that doesn't require a lot of time spent on formatting?
Example data:
A B C D E F
1 0 1 0 1 0
0 0 1 0 1 1
1 1 1 0 0 0
There are many tricks and techniques here.
Tricks include using egen's concat() function as well as the group() function mentioned by #Dimitriy V. Masterov.
Techniques include special tabulation or listing commands, including tabm and groups on SSC and mrtab at the Stata Journal; on the last, see this article.
See also this article in the Stata Journal for a general discussion of handling multiple responses.
Does egen pattern = group(A-F), label do what you desire? If not, perhaps you can clarify what the desired transformation would look like for the 3 respondents you have shown.

MATLAB IF VALUE LESS THAN

Hi I am writing a matlab code at the moment. I am trying to compare the values in a list to the number 10 and if the value is less than 10 add 1 to the total. However I cannot seem to get the code right. My code so far
tot = 0
for i=1:n
if(x(i)<10)
tot = +1
else
y=0;
end
end
tot
The value I get for tot always = 1 and never increases? Can someone help edit this or if not provide a solution to the problem?
I would agree with the answer mentioned above, that one should avoid for loops for this. There can be a faster solution. Since, he is just interested in the counts, and not value of numbers, so there is no need to index things back.
Given:
a = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
Computing numbers less than 10 (you could put any number here)
answer = sum(a<10);
Good luck!
In languages like MATLAB and R, you really should not use for loops like this, even as an exercise. Each variable can be a vector, and operations can occur on the whole vector at once, rather than element-by-element.
Given:
x = [ 1 2 3 4 11 12 13 14 15 16 ]
To generate a list of all x less than 10 you would say:
x(x<10)
So to count them:
total = length(x(x<10))
No loop needed or wanted!

Huffman code with lookup table

I have read an article on Internet and know that the natural way of decoding by traversing from root but I want to do it faster with a lookup table.
After reading, I still cannot get the points.
ex:
input:"abcdaabaaabaaa"
code data
0 a
10 b
110 c
111 d
The article says that due to variable length, it determine the length by taking a bit of string of
max code length and use it as index.
output:"010110111001000010000"
Index Index(binary) Code Bits required
0 000 a 1
1 001 a 1
2 010 a 1
3 011 a 1
4 100 b 2
5 101 b 2
6 110 c 3
7 111 d 3
My questions are:
What does it means due to variable length, it determine the length by taking a bit of string of
max code length? How to determine the length?
How to generate the lookup table and how to use it? What is the algorithm behind?
For your example, the maximum code length is 3 bits. So you take the first 3 bits from your stream (010) and use that to index the table. This gives code, 'a' and bits = 1. You consume 1 bit from your input stream, output the code, and carry on. On the second go around you will get (101), which indexes as 'b' and 2 bits, etc.
To build the table, make it as large as 1 << max_code_length, and fill in details as if you are decoding the index as a huffman code. If you look at your example all the indices which begin '0' are a, indices beginning '10' are b, and so on.

regexp-like library for matrix pattern search

Is there a library (in any language) that can search patterns in matrixes like regular expressions work for strings ? Something like regular expresions for matrixes, or any matrix pattern search method ?
If you're not averse to using J, you can find out whether two matrices are equal by using the -: (match) operator. For example:
X =: 4 3 $ i.12
X
0 1 2
3 4 5
6 7 8
9 10 11
Y =: 4 3 $ (1+i.12)
Y
1 2 3
4 5 6
7 8 9
10 11 12
X -: X
1
X -: Y
0
One nice feature of the match operator is that you can use it to compare arrays of arbitrary dimension; if A is a 3x3x4 array and B is a 2x1 array, then A-:B returns 0.
To find out whether a matrix is a submatrix of another matrix, you can use the E: (member of interval) operator like so:
X =: 2 2 $ 1 2 4 5
X
1 2
4 5
Y =: 4 3 $ (1+i.12)
Y
1 2 3
4 5 6
7 8 9
10 11 12
X E. Y
1 0 0
0 0 0
0 0 0
0 0 0
The 1 at the top left of the result signifies that the part of Y that is equal to X has the given pixel as its upper left-hand corner. The reason for this is that there may be several overlapping copies of X embedded in Y, and only flagging the one pixel lets you see the location of every matching tile.
I found two things: gawk and a perl script.
It's a different problem because string regular expressions work (e.g., sed, grep) work line-by-line on one-dimensional strings.
Unless your matrices are one-dimensional (basically vectors), these programs and the algorithms they use won't work.
Good luck!
Just search rows of the pattern in each row of the input matrix using Aho-Corasick (time O(matrix size)). The result should be small enough to quickly join it into the final result.
I don't think there exists anything quite like regular expressions for dimensions higher than 1, but if you want to match an exact pattern instead of a class of patterns then I might suggest you read up on convolution (or rather Cross-correlation )
The reason being, there are many highly optimized library functions (eg. IPP) for doing this faster than you could ever hope to achieve on your own. Also this method scales to higher dimensions as well.
Also, this won't necessarily give you a "match", but rather a "peak" in a correlation map which will correspond to the match if that peak is equal to the sum of squared coefficients of the pattern you are searching for.