I'm applying a PCA to my train set and want to do a classification with SVM for example. How can I have the same features in the test set automatically? (same than the new train set after PCA).
In python with scikit-learn, we fit PCA and the classifier on the training data set, and then we transform the test data set using the already fitted pca and classifier.
This is an example:
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# load data
iris = load_iris()
# initiate PCA and classifier
pca = PCA()
classifier = DecisionTreeClassifier()
# transform / fit
X_transformed = pca.fit_transform(iris.data)
classifier.fit(X_transformed, iris.target)
# predict "new" data
# (I'm faking it here by using the original data)
newdata = iris.data
# transform new data using already fitted pca
# (don't re-fit the pca)
newdata_transformed = pca.transform(newdata)
# predict labels using the trained classifier
pred_labels = classifier.predict(newdata_transformed)
You should apply the same logic with weka: apply the fitted pca filter on the test data and then perform predictions on the pca-transformed test set. You can check the following weka related topic:
Principal Component Analysis on Weka
Related
I'm trying to create a Vertex AI Pipeline to perform a hyperparameter tuning job that reads the data from a Vertex AI Dataset to have the metadata functionality track the relationship between dataset, model and endpoint (once I deploy the best model).
I'm following this tutorial that reads directly the data from tensorflow_datasets, but I don't see any way to pass a Vertex AI dataset to the hyperparameter tuning job op.
Do anyone know how to access a Vertex AI Dataset in a Hyperparameter tuning job?
Thank you.
You will need to add hypertune to the notebook so that it can write the different hyperparameters and their performances at the interface:
import hypertune
hp_metric = f1_score(y_test, y_pred, average='weighted')
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(hyperparameter_metric_tag='accuracy',metric_value=hp_metric,global_step=100)
And also add args to tune the model:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Input Arguments
parser.add_argument(
'--max_depth',
help = 'RF model parameter- depth',
type = int,
default = 100
)
parser.add_argument(
'--max_features',
help = 'RF model parameter- Features',
type = int,
default = 34
)
parser.add_argument(
'--max_leaf_nodes',
help = 'RF max_leaf_nodes',
type = int,
default = 8
)
parser.add_argument(
'--min_samples_leaf',
help = 'RF min_samples_leaf',
type = int,
default = 1
)
I am trying to get a feature from an existing feature store.
In the documentation https://docs.mlrun.org/en/latest/api/mlrun.feature_store.html, it says you can either pass a feature vector uri or FeatureVector object to the mlrun.feature_store.get_offline_features().
What is the uri for a feature store?
Where can I find an example?
In MLRun, a Feature Set is a group of features that are ingested together. A Feature Vector is a selection of features from Feature Sets (a few columns here, a few columns there, etc). This is great for joining several data sources together using a common entity/key.
A full example of creating and querying a feature set from MLRun can be found below:
import mlrun.feature_store as fs
from mlrun import set_environment
import pandas as pd
# Set project - for retrieving features later
set_environment(project="my-project")
# Feature set to ingest
df = pd.DataFrame({
"key" : [0, 1, 2, 3],
"value" : ["A", "B", "C", "D"]
})
# Create feature set with desired name and entity/key
fset = fs.FeatureSet("my-feature-set", entities=[fs.Entity("key")])
# Ingest
fs.ingest(featureset=fset, source=df)
# Create feature vector (allows for joining multiple feature sets together)
features = ["my-feature-set.*"] # can also do ["my-feature-set.A", my-feature-set.B", ...]
vector = fs.FeatureVector("my-feature-vector", features)
# Retrieve offline features (vector object)
fs.get_offline_features(vector)
# Retrieve offline features (project + name)
fs.get_offline_features("my-project/my-feature-vector")
# Retrieve offline features as pandas dataframe
fs.get_offline_features("my-project/my-feature-vector").to_dataframe()
You can find more feature store examples in the documentation here: https://docs.mlrun.org/en/latest/feature-store/feature-store.html
I have a DataFrame which has a column of coordinates list looks like the following:
I want to make this DataFrame a GeoPandas DataFrame with a geometry column. One way for doing this is to create two lists representing latitude and longitude and store the first and second element from the coors column to latitude and longitude, respectively. Then sue gpd.points_from_xy to build the geometry column. But this approach adds extra steps for building GeoPandas DataFrame. My question is how to build geometry directly from the coors list.
I add some test data here:
import pandas ad pd
import geopandas as gpd
data = {'id':[0,1,2,3], 'coors':[[41,-80],[40,-76],[35,-70],[35,-87]]}
df = pd.DataFrame.from_dict(data)
You can just apply Point to the 'coors' column to generate point geometry.
from shapely.geometry import Point
df['geometry'] = df.coors.apply(Point)
gdf = gpd.GeoDataFrame(df) # you should also specify CRS
In the case that you have latitude and longitude in separated columns, You can do this:
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point
def create_geometry(row):
row["geometry"] = Point(row["latitude"],row["longitude"])
return row
df = pd.read_[parquet|csv|etc]("file")
df = df.apply(create_geometry,axis=1)
gdf = gpd.GeoDataFrame(df,crs=4326)
Then, You can verify it doing:
df.geometry.head(1)
output
16 POINT (7.88507 -76.63130)
Name: geometry, dtype: geometry
I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)
Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]
I have text classification data with predictions depending on categories, 'descriptions' and 'components'. I could do the classification using bag of words in python with scikit on 'descriptions'. But I want to get predictions using both categories in bag of words with weights to individual feature sets
x = descriptions + 2* components
How should I proceed?
You can train individual classifiers for descriptions and merchants, and obtain a final score using score = w1 * predictions + w2 * components.
The values of w1 and w2 should be obtained using cross validation.
Alternatively, you can train a single multiclass classifier by combining the training dataset.
You will now have 4 classes:
Neither 'predictions' nor 'components'
'predictions' but not 'components'
not 'predictions' but 'components'
'predictions' and 'components'
And you can go ahead and train as usual.