I have a matrix with dimensions 500 x 10000. Each row represents a sample. I want to find for each sample a set of cells that only identify that sample. Thus I am looking for a reduced matrix 500 x n. Where n <10000. Is there any algorithm already developed that would help me?
This example in R doesn't necessarily yield an optimal solution (which in wise foresight you didn't request), but is quite quick:
# generate sample matrix
m = matrix(sample(0:4, 500*10000, T), 500, 10000)
# find smallest i for which the first i columns are unique over all sample rows
for (i in 4:10000) if (nrow(unique(m[, 1:i])) == 500) break
Related
I tried to solve this problem using Dynamic Programming but it seems I am missing some cases that I am unable to find.
Here is the equation that I used for getting values from sub-problem
dp[i][j] = dp[i][j-1] + 3*(dp[i-1][j-1] - dp[i-2][j-2]) + dp[i-3][j-2]
(i = k = no of cells to be colored and j = n = number of columns, note the row is fixed i.e 3)
The terms are as defined below:
dp[i][j-1] : case when I don't color any cell in the nth column.
dp[i-1][j-1] - dp[i-2][j-2] : case when I color one cell in the last column and then have to subtract the case where I color the adjacent cell in the n-1th column and since this can be done for each of the 3 cells in the nth column I multiplied it with 3.
dp[i-3][j-2] : case when I color two cells(top and bottom ones) in the nth column and thus have only one choice for the n-1th column, that is the middle one, hence subtracting 3 from i and since we have already considered the last two columns I reduce 2 from j.
I couldn't find any mistake in the above approach, If you see any mistake please help.
Below is the actual question where an extra condition of P consecutive column not be empty is also mentioned and should be taken care of.
My approach is to first find all the possible ways to color k cells in 3xN matrix such that they are not adjacent and then finding the number of ways where P consecutive columns exist such that there are no cells colored in them and subtracting it with the total count, but in this approach, I'm missing the correct answer by a small margin for smaller inputs and a large margin for larger inputs. I must be missing something here.
Hope I am posting in correct forum.
Just want to sound out my ideas and approach to solve problem. Would welcome any pointers, help (if given code would definitely be ideal :) )
Problem:
I want to code for the probability distribution (in a 400 x 400 map) in order to find the spatial location (x,y) of another line (let us call it fL) based upon the probability, in the probability map.
I have gotten a nearly horizontal line cue (let call it lC) from prior processing to calculate the probability to determine fL. fL is estimated to lie at D distance away from this horizontal line cue. My task is to calculate this probability map
Approach:
1) I would take the probability map distribution as Gaussian and to be
P(fL | point ) = exp( ( x-D )^2 /sigma^2 )
which is giving probability of the line fL given the point in line cue lC is at D distance away, pending on sigma (which defines how fast the probability decrease)
2) I would use a LineIterator to find every single pixel that lie on the line cue lC (given that I know the start and end point of line). Let say I have gotten n pixel in this line
3) For every single pixel in the 400 x 400 image, I would calculate the probability using 1) as described above for all n points that I have gotten for the line. I would sum up each line point contribution
4) After finishing all the pixel calculation in the 400x400 image, I would then normalize the probability based the largest pixel probability value. This part I am not unsure that I should normalize by the sum of all pixel probability or by using the step above.
5) After this I would multiply this probability map with other probability map. So I would get
P(fL | Cuefromthisline, Cuefromsomeother....) = P( fL | Cuefromthisline)P( fL | Cuefromsomeother).....
And I would set pixel with near 0 probability to be 0.001
6) That outlines my approach
Question
1) Is this workable? Or if there is any better method to doing this? ie getting the probability map
2) How do I normalize the map. by normalizing with the sum of all pixel probability or by normalizing with the max value
Thanks in advance for reading out this long post
I am trying to do a PCA on some volatility data, and let's just say I can propose a model as the following:
volatility = bata0 + beta1*x + beta2* x^2
where x are some observations, say for example, moneyness and so on.
So in Matlab, what I did was to say Y=[ones x x^2] and then do pca(Y)
and for some reason, my first row in my coefficient matrix is always something like 0 0 1, i.e., 0 everywhere else except the last column, and output of atent always shows the highest value in the first row as well, no matter how I change the model.
Obviously, this can't be the case where the last term in every single model is explained well by the last term in the equation. And if I remove the constant term in Y (i.e., Y= [x x^2] then the first row of coefficient matrix becomes something more normal (i.e., non-zero value everywhere).
So my questions are:
is my way of doing PCA right?
Does PCA automatically rearrange the principal component and hence the first row in the coefficient matrix with all zeros except 1 at the last column may not necessarily represent the last term in the equation and
if it is wrong, what is the correct way of doing it?
From Matlab's documentation for princomp:
COEFF = princomp(X) performs principal components analysis (PCA) on
the n-by-p data matrix X, and returns the principal component
coefficients, also known as loadings. Rows of X correspond to
observations, columns to variables. COEFF is a p-by-p matrix, each
column containing coefficients for one principal component. The
columns are in order of decreasing component variance.
I am using PCA on binary attributes to reduce the dimensions (attributes) of my problem. The initial dimensions were 592 and after PCA the dimensions are 497. I used PCA before, on numeric attributes in an other problem and it managed to reduce the dimensions in a greater extent (the half of the initial dimensions). I believe that binary attributes decrease the power of PCA, but i do not know why. Could you please explain me why PCA does not work as good as in numeric data.
Thank you.
The principal components of 0/1 data can fall off slowly or rapidly,
and the PCs of continuous data too —
it depends on the data. Can you describe your data ?
The following picture is intended to compare the PCs of continuous image data
vs. the PCs of the same data quantized to 0/1: in this case, inconclusive.
Look at PCA as a way of getting an approximation to a big matrix,
first with one term: approximate A ~ c U VT, c [Ui Vj].
Consider this a bit, with A say 10k x 500: U 10k long, V 500 long.
The top row is c U1 V, the second row is c U2 V ...
all the rows are proportional to V.
Similarly the leftmost column is c U V1 ...
all the columns are proportional to U.
But if all rows are similar (proportional to each other),
they can't get near an A matix with rows or columns 0100010101 ...
With more terms, A ~ c1 U1 V1T + c2 U2 V2T + ...,
we can get nearer to A: the smaller the higher ci, the faster..
(Of course, all 500 terms recreate A exactly, to within roundoff error.)
The top row is "lena", a well-known 512 x 512 matrix,
with 1-term and 10-term SVD approximations.
The bottom row is lena discretized to 0/1, again with 1 term and 10 terms.
I thought that the 0/1 lena would be much worse -- comments, anyone ?
(U VT is also written U ⊗ V, called a "dyad" or "outer product".)
(The wikipedia articles
Singular value decomposition
and Low-rank approximation
are a bit math-heavy.
An AMS column by
David Austin,
We Recommend a Singular Value Decomposition
gives some intuition on SVD / PCA -- highly recommended.)
There is a Table of pairs , which defines pieces bounds.
And we are using straightforward algorithm:
y = f(x)
Calculate index n in Table using x
Get Yn and Yn+1, compute linear interpolation Y
Y is the answer.
So i think, there must be more efficient method, could you please point me?
Depending on the number and distribution of pairs, you might be able to instead store a table T containing only the Y values at regular intervals. Pick the interval to be a power of 2: i=2^c. Then for a given X:
n=X>>c;
Y= T[n]
Y+= ((T[n+1]-T[n])* (X&(i-1))>>c;
This should work as long as you have space for a table with small enough intervals to catch sudden changes in the slope of Y, and enough headroom in Y for the multiply.
Use binary search for step 1.
EDIT: due to the comment you added afterwards, this is not necessary, since your intervals are equally spaced.