I am trying to do a PCA on some volatility data, and let's just say I can propose a model as the following:
volatility = bata0 + beta1*x + beta2* x^2
where x are some observations, say for example, moneyness and so on.
So in Matlab, what I did was to say Y=[ones x x^2] and then do pca(Y)
and for some reason, my first row in my coefficient matrix is always something like 0 0 1, i.e., 0 everywhere else except the last column, and output of atent always shows the highest value in the first row as well, no matter how I change the model.
Obviously, this can't be the case where the last term in every single model is explained well by the last term in the equation. And if I remove the constant term in Y (i.e., Y= [x x^2] then the first row of coefficient matrix becomes something more normal (i.e., non-zero value everywhere).
So my questions are:
is my way of doing PCA right?
Does PCA automatically rearrange the principal component and hence the first row in the coefficient matrix with all zeros except 1 at the last column may not necessarily represent the last term in the equation and
if it is wrong, what is the correct way of doing it?
From Matlab's documentation for princomp:
COEFF = princomp(X) performs principal components analysis (PCA) on
the n-by-p data matrix X, and returns the principal component
coefficients, also known as loadings. Rows of X correspond to
observations, columns to variables. COEFF is a p-by-p matrix, each
column containing coefficients for one principal component. The
columns are in order of decreasing component variance.
Related
I tried to solve this problem using Dynamic Programming but it seems I am missing some cases that I am unable to find.
Here is the equation that I used for getting values from sub-problem
dp[i][j] = dp[i][j-1] + 3*(dp[i-1][j-1] - dp[i-2][j-2]) + dp[i-3][j-2]
(i = k = no of cells to be colored and j = n = number of columns, note the row is fixed i.e 3)
The terms are as defined below:
dp[i][j-1] : case when I don't color any cell in the nth column.
dp[i-1][j-1] - dp[i-2][j-2] : case when I color one cell in the last column and then have to subtract the case where I color the adjacent cell in the n-1th column and since this can be done for each of the 3 cells in the nth column I multiplied it with 3.
dp[i-3][j-2] : case when I color two cells(top and bottom ones) in the nth column and thus have only one choice for the n-1th column, that is the middle one, hence subtracting 3 from i and since we have already considered the last two columns I reduce 2 from j.
I couldn't find any mistake in the above approach, If you see any mistake please help.
Below is the actual question where an extra condition of P consecutive column not be empty is also mentioned and should be taken care of.
My approach is to first find all the possible ways to color k cells in 3xN matrix such that they are not adjacent and then finding the number of ways where P consecutive columns exist such that there are no cells colored in them and subtracting it with the total count, but in this approach, I'm missing the correct answer by a small margin for smaller inputs and a large margin for larger inputs. I must be missing something here.
From time to time I have to port some Matlab Code to OpenCV.
Almost always there is a way to do it and an appropriate function in OpenCV. Nevertheless its not always easy to find.
Therefore I would like to start this summary to find and gather some equivalents between Matlab and OpenCV.
I use the Matlab function as heading and append its description from Matlab help. Afterwards a OpenCV example or links to solutions are appreciated.
Repmat
Replicate and tile an array. B = repmat(A,M,N) creates a large matrix B consisting of an M-by-N tiling of copies of A. The size of B is [size(A,1)*M, size(A,2)*N]. The statement repmat(A,N) creates an N-by-N tiling.
B = repeat(A, M, N)
OpenCV Docs
Find
Find indices of nonzero elements. I = find(X) returns the linear indices corresponding to the nonzero entries of the array X. X may be a logical expression. Use IND2SUB(SIZE(X),I) to calculate multiple subscripts from the linear indices I.
Similar to Matlab's find
Conv2
Two dimensional convolution. C = conv2(A, B) performs the 2-D convolution of matrices A and B. If [ma,na] = size(A), [mb,nb] = size(B), and [mc,nc] = size(C), then mc = max([ma+mb-1,ma,mb]) and nc = max([na+nb-1,na,nb]).
Similar to Conv2
Imagesc
Scale data and display as image. imagesc(...) is the same as IMAGE(...) except the data is scaled to use the full colormap.
SO Imagesc
Imfilter
N-D filtering of multidimensional images. B = imfilter(A,H) filters the multidimensional array A with the multidimensional filter H. A can be logical or it can be a nonsparse numeric array of any class and dimension. The result, B, has the same size and class as A.
SO Imfilter
Imregionalmax
Regional maxima. BW = imregionalmax(I) computes the regional maxima of I. imregionalmax returns a binary image, BW, the same size as I, that identifies the locations of the regional maxima in I. In BW, pixels that are set to 1 identify regional maxima; all other pixels are set to 0.
SO Imregionalmax
Ordfilt2
2-D order-statistic filtering. B=ordfilt2(A,ORDER,DOMAIN) replaces each element in A by the ORDER-th element in the sorted set of neighbors specified by the nonzero elements in DOMAIN.
SO Ordfilt2
Roipoly
Select polygonal region of interest. Use roipoly to select a polygonal region of interest within an image. roipoly returns a binary image that you can use as a mask for masked filtering.
SO Roipoly
Gradient
Approximate gradient. [FX,FY] = gradient(F) returns the numerical gradient of the matrix F. FX corresponds to dF/dx, the differences in x (horizontal) direction. FY corresponds to dF/dy, the differences in y (vertical) direction. The spacing between points in each direction is assumed to be one. When F is a vector, DF = gradient(F)is the 1-D gradient.
SO Gradient
Sub2Ind
Linear index from multiple subscripts. sub2ind is used to determine the equivalent single index corresponding to a given set of subscript values.
SO sub2ind
backslash operator or mldivide
solves the system of linear equations A*x = B. The matrices A and B must have the same number of rows.
cv::solve
I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data.
Thank you.
The principal components of 0/1 data can fall off slowly or rapidly,
and the PCs of continuous data too —
it depends on the data. Can you describe your data ?
The following picture is intended to compare the PCs of continuous image data
vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.
Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U VT, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long.
The top row is c U1 V, the second row is c U2 V ...
all the rows are proportional to V.
Similarly the leftmost column is c U V1 ...
all the columns are proportional to U.
But if all rows are similar (proportional to each other),
they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1T + c2 U2 V2T + ...,
we can get nearer to A: the smaller the higher ci, the faster..
(Of course, all 500 terms recreate A exactly, to within roundoff error.)
The top row is "lena", a well-known 512 x 512 matrix,
with 1-term and 10-term SVD approximations.
The bottom row is lena discretized to 0/1, again with 1 term and 10 terms.
I thought that the 0/1 lena would be much worse -- comments, anyone ?
(U VT is also written U ⊗ V, called a "dyad" or "outer product".)
(The wikipedia articles
Singular value decomposition
and Low-rank approximation
are a bit math-heavy.
An AMS column by
David Austin,
We Recommend a Singular Value Decomposition
gives some intuition on SVD / PCA -- highly recommended.)
There is a Table of pairs , which defines pieces bounds.
And we are using straightforward algorithm:
y = f(x)
Calculate index n in Table using x
Get Yn and Yn+1, compute linear interpolation Y
Y is the answer.
So i think, there must be more efficient method, could you please point me?
Depending on the number and distribution of pairs, you might be able to instead store a table T containing only the Y values at regular intervals. Pick the interval to be a power of 2: i=2^c. Then for a given X:
n=X>>c;
Y= T[n]
Y+= ((T[n+1]-T[n])* (X&(i-1))>>c;
This should work as long as you have space for a table with small enough intervals to catch sudden changes in the slope of Y, and enough headroom in Y for the multiply.
Use binary search for step 1.
EDIT: due to the comment you added afterwards, this is not necessary, since your intervals are equally spaced.
I have a matrix with dimensions 500 x 10000. Each row represents a sample. I want to find for each sample a set of cells that only identify that sample. Thus I am looking for a reduced matrix 500 x n. Where n <10000. Is there any algorithm already developed that would help me?
This example in R doesn't necessarily yield an optimal solution (which in wise foresight you didn't request), but is quite quick:
# generate sample matrix
m = matrix(sample(0:4, 500*10000, T), 500, 10000)
# find smallest i for which the first i columns are unique over all sample rows
for (i in 4:10000) if (nrow(unique(m[, 1:i])) == 500) break