How reconstruct an approximate data set in weka after pca? - weka

I want to reconstruct the data set with reduced dimensionality back to an approximate of the original data set in WEKA. I tried the tranformBackToOriginal in PCA but I got a reduced number of attributes after doing this. How do I know what attribute it really represents of the original data set?

Related

How to produce a smoothed curve in Superset?

Using Apache Superset, I would like to plot a smoothed curve for my data.
I have data where the frequency of datapoints is inconsistent; sometimes there are whole months without data:
I would like to smooth this using Apache Superset.
My approach was to resample the data to daily values, filling in days without values with a zero value:
This gives an appropriate result:
I then attempted to smooth the data with a rolling mean:
However, this does not give the result I expected:
I would have expected a smooth line. How should I modify my use of Superset to obtain a smoothed line given data with inconsistent frequency of datapoints?

Principal component analysis on proportional data

Is it valid to run a PCA on data that is comprised of proportions? For example, I have data on the proportion of various food items in the diet of different species. Can I run a PCA on this type of data or should I transform the data or do something else beforehand?
I had a similar question. You should search for "compositional data analysis". There are transformation to apply to proportions in order to analyze them with multivariate tecniques such as PCA. You can find also "robust" PCA algorithms to run your analysis in R. Let us know if you find an appropriate solution to your specific problem.
I don't think so.
PCA will give you "impossible" answers. You might get principal components with values that proportions can't have, like negative values or values greater than 1. How would you interpret this component?
In technical terms, the support of your data is a subset of the support of PCA. Say you have $k$ classes. Then:
the support for PCA vectors is $\R^k$
the support for your proportion vectors is the $k$- dimensional simplex. By simplex I mean the set of $p$ vectors of length $k$ such that:
$0 \le p_i \le 1$ where $i = 1, ..., k$
$\sum_{i=1}^k{p_i} = 1$
One way around this is if there's a one to one mapping between the $k$-simplex to all of $\R^k$. If so, you could map from your proportions to $\R^k$, do PCA there, then map the PCA vectors to the simplex.
But I'm not sure the simplex is a self-contained linear space. If you add two elements of the simplex, you don't get an element of the simplex :/
A better approach, I think, is clustering, eg with Gaussian mixtures, or spectral clustering. This is related to PCA. But a nice property of clustering is you can express any element of your data as a "convex combination" of the clusters. If you analyze your proportion data and find clusters, they (unlike PCA vectors) will be within the simplex space, and any mixture of them will be, too.
I also recommend looking into nonnegative matrix factorization. This is like PCA but, as the name suggests, avoids negative components and also negative eigenvectors. It's very useful for inferring structure in strictly positive data, like proportions. But nmf does not give you a basis for simplex space.

Extract region from a Curvilinear satellite Dataset

I have satellite swath data from MODIS and need to extract a subset (region) of data to analyze (NOT PLOT). I am trying to find the best way to do this with out loops which can be slow. In the past I have used set.intersect but this does not work on 2D data.
My issue is both Lat and Lon are 2D and I need to find the indices where my conditions are met (lat>=x1)&(lat<=x2) and similar for lon. and then use those 2D indices to slice my main data set (Aerosol Optical Depth)
Latitude Sample
Longitude Sample
Aerosol MetaData
Code so Far
Normally (for 1D lat/lon) I would used Opt_Depth_Land[:,goodlat,goodlon] to extract my data but this does not work for this type of data set.
Any Help would be greatly appreciated.
valid_lat=(lat>=user_lat-radius)&(lat<=user_lat+radius)
valid_lon=(lon>=user_lon-radius)&(lon<=user_lon+radius)
Valid_Coord=np.where((valid_lat==True)&(valid_lon==True))

Title: SVC-Scikit Learn issue

I am getting this error in Scikit learn. Previously I worked with K validation, and never encountered error. My data is sparse and training and testing set is divided in the ratio 90:10
ValueError: cannot use sparse input in 'SVC' trained on dense data
Is there any straightforward reason and solution for this?
This basically means that your testing set is not in the same format as your training set.
A code snippet would have been great, but make sure you are using the same array format for both sets.
Since it cannot used sparse input on dense data, either convert your dense data to sparse data (recommended) or your sparse data to dense data. Use SciPy to create a sparse matrix from a dense one.

Stratified sampling in WEKA

How can I split a data set into a training and test set of sizes 75% and 25% of the original data set, respectively using stratified sampling in order to preserve the proportional class sizes in these new sets. I am trying to do this with WEKA.
The "RemovePercentage" filter helps does not do it in a stratified manner, and the "StratifiedRemoveFolds" filter does not do this using percentages.
I would appreciate any help or suggestion.
So, as a work around, I split the data set into two using stratifiedRemoveFolds. in this case my number of folds was 2, yielding a 50%-50% data set. Then, I split one of the folds into two using the same method, yielding a 25%-25% subset of the original data set. Then I merged one of the 25% data sets to the left over 50% yielding a 75%-25% stratified split - which was my target.