What kind of PCA is performed by the PCA widget - data-mining

I am wondering if PCA widget was performing centred and/or normalized PCA. I did'nt find any corresponding option in the widget.
Does anyone know the answer and if there is plan to add these options?
Thanks alot. Best regards,

Yes, the 'PCA' widget both centers and standardizes the data.

Usually, PCA is defined to be on standardized data.
The use of the term PCA for performing SVD on other matrixes (such as non-central correlation matrixes) is much less common - because there is a proper term for that, too: SVD.
Furthermore, using it on non-centered data means that the transformation is sensitive to data translation; an aspect one usually does not want.
So unless explicitly specified, you should assume that PCA is implemented by
Centering the data
Computing the covariance matrix
Applying SVD on the covariance matrix to decompose it as desired

Related

Principal component analysis on proportional data

Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.

How do I minimize global error across multiple image homographies?

I am stitching together multiple images with arbitrary 3D views of a planar surface. I have some estimation of which images overlap and a coarse estimate of each pairwise homography between pairs of overlapping images. However, I need to refine my homographies by minimizing the global error across all images.
I have read a few different papers with various methods for doing this, and I think the best way would be to use a non-linear optimization such as Levenberg–Marquardt, ideally in a fast way that is sparse and/or parallel.
Ideally I would like to use an existing library such as sba or pba, but I am really confused as to how to limit the calculation to just estimating the eight parameters of the homography rather than the full 3 dimensions for both camera pose and object position. I also found this handy explanation by Szeliski (see section 5.1 on page 50) but again, the math is all for a rotating camera rather than a flat surface.
How do I use L-M to minimize the global error for a set of homographies? Is there a speedy way to do this with existing bundle adjustment libraries?
Note: I cannot use methods that rely on rotation-only camera motion (such as in openCV) because those cannot accurately estimate camera poses, and I also cannot use full 3D reconstruction methods (such as SfM) because those have too many parameters which results in non-planar point clouds. I definitely need something specific to a full 8 parameter homography. Camera intrinsics don't really matter because I am already correcting those in an earlier step.
Thanks for your help!

What is class_weight parameter does in scikit-learn SGD

I am a frequent user of scikit-learn, I want some insights about the “class_ weight ” parameter with SGD.
I was able to figure out till the function call
plain_sgd(coef, intercept, est.loss_function,
penalty_type, alpha, C, est.l1_ratio,
dataset, n_iter, int(est.fit_intercept),
int(est.verbose), int(est.shuffle), est.random_state,
pos_weight, neg_weight,
learning_rate_type, est.eta0,
est.power_t, est.t_, intercept_decay)
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/stochastic_gradient.py
After this it goes to sgd_fast and I am not very good with cpython. Can you give some celerity on these questions.
I am having a class biased in the dev set where positive class is somewhere 15k and negative class is 36k. does the class_weight will resolve this problem. Or doing undersampling will be a better idea. I am getting better numbers but it’s hard to explain.
If yes then how it actually does it. I mean is it applied on the features penalization or is it a weight to the optimization function. How I can explain this to layman ?
class_weight can indeed help increasing the ROC AUC or f1-score of a classification model trained on imbalanced data.
You can try class_weight="auto" to select weights that are inversely proportional to class frequencies. You can also try to pass your own weights has a python dictionary with class label as keys and weights as values.
Tuning the weights can be achieved via grid search with cross-validation.
Internally this is done by deriving sample_weight from the class_weight (depending on the class label of each sample). Sample weights are then used to scale the contribution of individual samples to the loss function used to trained the linear classification model with Stochastic Gradient Descent.
The feature penalization is controlled independently via the penalty and alpha hyperparameters. sample_weight / class_weight have no impact on it.

Given that the data features are all nominal; does it make any sense to apply PCA to the data?

If PCA also helps to normalize the data, how a normalized data is going to be improved by PCA. Thanks
PCA does more than normalize the data (in fact, it only normalizes if whiten=True). It also projects into a different space using the n_components eigenvectors with maximal variation (largest eignevalues) - this can provide better performance for your classifier/clustering if your data is "stretched" along a particular dimension. See this example for more information.

How to move epipole to the outside of the image

Hi i had computed the fundamental matrix from two images and i found out that the epipoles lie within the image. I cannot do the rectification using matlab if the image contains epipole.
May i know how to compute the fundamental matrix that the epipole is not in the image?
The epipolar geometry is the intrinsic projective geometry between two
views. It is independent of scene structure, and only depends on the
cameras' internal parameters and relative pose.
So the intrinsics/extrinsics of the cameras define the fundamental matrix that you get (i.e. you cannot compute another fundamental, s.t. the epipoles are not in the image).
What you can do is either take a different pair of images (with a different camera geometry, for example) and you may get epipoles out of the image.
The problem you're actually having is that the rectification algorithm that you're using is limited and doesn't work for the case when the epipole is inside the image. Note, there exist other algorithms that do not have this limitation. I have implemented such an algorithm in the past, and may be can find the (MATLAB) code. So, please let me know if you're interested.
If you're in a mood to learn more about epipolar geometry and the fundamental matrix, I recommend you take a look here: