Title: SVC-Scikit Learn issue - python-2.7

I am getting this error in Scikit learn. Previously I worked with K validation, and never encountered error. My data is sparse and training and testing set is divided in the ratio 90:10
ValueError: cannot use sparse input in 'SVC' trained on dense data
Is there any straightforward reason and solution for this?

This basically means that your testing set is not in the same format as your training set.
A code snippet would have been great, but make sure you are using the same array format for both sets.

Since it cannot used sparse input on dense data, either convert your dense data to sparse data (recommended) or your sparse data to dense data. Use SciPy to create a sparse matrix from a dense one.

Related

Using AutoML to evaluate tha hyperparameters of the algorithm Word2Vec

Is it possible with AutoML (from H2O) to use only the Word2Vec algorithm and try out different values for the parameters to find out which parameter settings give me the most accurate vectors for my data set? So I don't want AutoML to apply the algorithms DeepLearning, GBM etc. to my dataset. Only the Word2Vec algorithm… How Do I do that?
So far I only managed to build a word2vec model with H2O.
I would like to test different Settings of the hyperparameters of Word2Vec with AutoML to evaluate which Settings are optimal...
The Word2Vec algorithm is a data transformation algorithm (converting rows of text to a matrix), not a supervised machine learning algorithm (which is what AutoML and all the algorithms inside of it do).
The typical way that Word2Vec is used is it apply Word2Vec to your text data so that your data can be used to train a supervised ML algorithm. From here you can run any supervised algorithm (GLM, Random Forest, GBM, etc) on this transformed dataset -- or my recommendation is to just pass the transformed data to AutoML, so it can find the best algorithm for you.
You will have to try out different settings for Word2Vec manually and see how well they do, given some particular supervised learning algorithm that you want to apply to your problem. Hopefully that clears up the confusion.

t-SNE Choosing the Number of Dimensions

I am using t-SNE for exploratory data analysis. I am using this instead of PCA because PCA is linear and t-SNE is non-linear.
It's really straight-forward to know how many dimensions are required to capture the necessary variance with PCA.
How do I know how many dimensions are required for my data using t-SNE?
I have read a popular website of very useful information, but it doesn't discuss dimensionality.
https://distill.pub/2016/misread-tsne/

Extract region from a Curvilinear satellite Dataset

I have satellite swath data from MODIS and need to extract a subset (region) of data to analyze (NOT PLOT). I am trying to find the best way to do this with out loops which can be slow. In the past I have used set.intersect but this does not work on 2D data.
My issue is both Lat and Lon are 2D and I need to find the indices where my conditions are met (lat>=x1)&(lat<=x2) and similar for lon. and then use those 2D indices to slice my main data set (Aerosol Optical Depth)
Latitude Sample
Longitude Sample
Aerosol MetaData
Code so Far
Normally (for 1D lat/lon) I would used Opt_Depth_Land[:,goodlat,goodlon] to extract my data but this does not work for this type of data set.
Any Help would be greatly appreciated.
valid_lat=(lat>=user_lat-radius)&(lat<=user_lat+radius)
valid_lon=(lon>=user_lon-radius)&(lon<=user_lon+radius)
Valid_Coord=np.where((valid_lat==True)&(valid_lon==True))

PCA in thousands of dimensions

I have vectors in 4096 VLAD codes, each one of them representing an image.
I have to run PCA on them without reducing their dimension on learning datasets with less than 4096 images (e.g., the Holiday dataset with <2k images.) to obtain the rotation matrix A.
In this paper (which also explains why I need to run this PCA without dimensionality reduction) they solved the problem with this approach:
For efficient computation of A, we compute at most the first 1,024 eigenvectors by eigendecomposition of the covariance matrix, and
the remaining orthogonal complements up to D-dimensional are filled using Gram-Schmidt orthogonalization.
Now, how do I implement this using C++ libraries? I'm using OpenCV, but cv::PCA seems not to offer such a strategy. Is there any way to do this?

classifying a weighted feature vector

I want to give weights to features of a data set before using the feature in any classification algorithm like KNN or J48, but i don't know how to evaluate a weighted feature vector.
dose any of the classification algorithms accept weights as input instead of just '0' and '1'?
especially, is any of Weka's ready classification functions capable of working with weights (not 0 and 1 as filters)?
In most situations, you can just scale the data set according to your weights. This is trivial to prove for Minkowski distances such as Euclidean distance.
Not all of weka's classification algorithms support weights but some do.
You need to set weight information while after loading your dataset , see example code in weka wiki. I remember that Weka J48 , decision tree , supports weights in developer version but can not find reference. There exists a patch though.
This search for feature weights in weka wiki may help.
I suggest trying add weights to data set and training in your data.