I have text classification data with predictions depending on categories, 'descriptions' and 'components'. I could do the classification using bag of words in python with scikit on 'descriptions'. But I want to get predictions using both categories in bag of words with weights to individual feature sets
x = descriptions + 2* components
How should I proceed?
You can train individual classifiers for descriptions and merchants, and obtain a final score using score = w1 * predictions + w2 * components.
The values of w1 and w2 should be obtained using cross validation.
Alternatively, you can train a single multiclass classifier by combining the training dataset.
You will now have 4 classes:
Neither 'predictions' nor 'components'
'predictions' but not 'components'
not 'predictions' but 'components'
'predictions' and 'components'
And you can go ahead and train as usual.
Related
I am working on a Bayesian Hierarchical linear model in Pymc3.
The model consists of three input variables on a daily level: number of users, product category and product sku and the output variable is revenue. In total the data consists of roughly 73.000 records with 180 categories and 12.000 sku's. Moreover, some categories/sku's are highly present while other categories aren't. An example of the data is shown in the link:
Preview of the data
As the data on sku level is very sparse an hierarchical model has been chosen with the intent that sku's with less data should shrink towards the category level mean and if a category is scarce the group level mean should shrink towards the overall mean.
In the final model the categories are label encoded and the continuous variables users and revenue are min-max scaled.
At this point the model is formalized as follows:
with pm.Model() as model:
sigma_overall = pm.HalfNormal("sigma_overall", mu=50)
sigma_category = pm.HalfNormal("sigma_category", mu=sigma_overall)
sigma_sku = pm.HalfNormal("sigma_sku", sigma=sigma_category, shape=n_sku)
beta = pm.HalfNormal("beta", sigma=sigma_sku, shape=n_sku)
epsilon = pm.HalfCauchy("epsilon", 1)
y = pm.Deterministic('y', beta[category_idx][sku_idx] * df['users'].values)
y_likelihood = pm.Normal("y_likelihood", mu=y, sigma=epsilon, observed=df['revenue'].values)
trace = pm.sample(2000)
The main hurdle is that the model is very slow. It takes hours, sometimes a day before the model completes. Metropolis- or NUTS sampling with find_MAP() did not make a difference. Furthermore, I doubt whether the model is formalized correctly as I am pretty new to Pymc3.
A review of the model and advice to speed it up is very welcome.
I wanted to forecast some data(suppose countries temperature).Is there any way to add multiple countires temperature at once in deepAR (Algorithm available at AWS Sagemaker marketplace) and deepAR forecast them independently?.Is it possible to remove a particular country data and add another after few days?
I am new to Forecasting and wanted to try deepAR.If anyone has arleady worked on this, please provide me some guidelines on how to do this using deepAR
Link - https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html
This is a late reply to this post, but my reply could be helpful in the future to others. The answer to your first question is yes.
The page you linked to references the cat field, this allows you to encode a vector representing different record groups. In your case, the cat field can just be a single value, but the cat field can encode more complex relationships too with more dimensions in the vector.
Say you have 3 countries you want to make predictions on. You have some time-series temperature training data for each country, then you would enter them as rows in the train JSON file like this:
Country 1:
{"start": "02/09/2019 00:00:00", "target": [T1,T2,T3,T4,...], "cat": [0]}
Country 2:
{"start": "02/09/2019 00:00:00", "target": [T1,T2,T3,T4,...], "cat": [1]}
Country 3:
{"start": "02/09/2019 00:00:00", "target": [T1,T2,T3,T4,...], "cat": [2]}
The category field indicates to DeepAR that these are independent data categories, in other words, different countries.
The frequency (time between temperature measurements) has to be the same for all data, however, the start time and the number of training points does not.
When you've trained the model, open the endpoint and want to make a prediction for a country, you can pass the context for a particular country along with the same cat as one of those countries above.
This allows you to make a single model that will allow you to make predictions from many independent groups of data.
I'm not sure exactly what you mean by the second question. If you mean to add more training data for another country later on, this would require you to create a different training dataset with an additional category for that country, then re-train the model.
I'm applying a PCA to my train set and want to do a classification with SVM for example. How can I have the same features in the test set automatically? (same than the new train set after PCA).
In python with scikit-learn, we fit PCA and the classifier on the training data set, and then we transform the test data set using the already fitted pca and classifier.
This is an example:
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# load data
iris = load_iris()
# initiate PCA and classifier
pca = PCA()
classifier = DecisionTreeClassifier()
# transform / fit
X_transformed = pca.fit_transform(iris.data)
classifier.fit(X_transformed, iris.target)
# predict "new" data
# (I'm faking it here by using the original data)
newdata = iris.data
# transform new data using already fitted pca
# (don't re-fit the pca)
newdata_transformed = pca.transform(newdata)
# predict labels using the trained classifier
pred_labels = classifier.predict(newdata_transformed)
You should apply the same logic with weka: apply the fitted pca filter on the test data and then perform predictions on the pca-transformed test set. You can check the following weka related topic:
Principal Component Analysis on Weka
I am pretty new to machine learning in general and scikit-learn in specific.
I am trying to use the example given on the site http://scikit-learn.org/stable/tutorial/basic/tutorial.html
For practicing on my own, I am using my own data-set. My data set is divided into two different CSV files:
Train_data.csv (Contains 32 columns, the last column is the output value).
Test_data.csv (Contains 31 columns the output column is missing - Which should be the case, no?)
Test data is one column less than training data..
I am using the following code to learn (using training data) and then predict (using test data).
The issue I am facing is the error:
*ValueError: X.shape[1] = 31 should be equal to 29, the number of features at training time*
Here is my code (sorry if it looks completely wrong :( )
import pandas as pd #import the library
from sklearn import svm
mydata = pd.read_csv("Train - Copy.csv") #I read my training data set
target = mydata["Desired"] #my csv has header row, and the output label column is named "Desired"
data = mydata.ix[:,:-3] #select all but the last column as data
clf = svm.SVC(gamma=0.001, C=100.) #Code from the URL above
clf.fit(data,target) #Code from the URL above
test_data = pd.read_csv("test.csv") #I read my test data set. Without the output column
clf.predict(test_data[-1:]) #Code from the URL above
The training data csv labels looks something like this:
Value1,Value2,Value3,Value4,Output
The test data csv labels looks something like this:
Value1,Value2,Value3,Value4.
Thanks :)
Your problem is a Supervised Problem, you have some data in form of (input,output).
The input are the features describing your example and the output is the prediction that your model should respond given that input.
In your training data, you'll have one more attribute in your csv file because in order to train your model you need to give him the output.
The general workflow in sklearn with a Supervised Problem should look like this
X, Y = read_data(data)
n = len(X)
X_train, X_test = X[:n*0.8], X[n*0.8:]
Y_train, Y_test = Y[:n*0.8], Y[n*0.8:]
model.fit(X_train,Y_train)
model.score(X_test, Y_test)
To split your data, you can use train_test_split and you can use several metrics in order to judge your model's performance.
You should check the shape of your data
data.shape
It seems like you're not taking into the account the last 3 columns instead of only the last. Try instead :
data = mydata.ix[:,:-1]
I would like to store multiple stations and multiple trains in my database. I was planning to have each station as one model and each train as one model.
But i would like to understand how should we store each train route in a dynamic way using the station models.
For Example, we have Stations A, B , C , D, E
And Train t1 route is A-C-B-D-E
And Train t2 route is A-B-E
So i would like to store these train route under each row of train model. Could someone help me with this?
Thanks
Something like this should work for you from a data model stand point.
class Station(models.Model):
pass
class Route(models.Model):
class Meta:
# Specify that the train and index are unique together to prevent
# any duplicate stations at the same index value at the db level.
# You'll probably want to validate this in the application logic as well.
unique_together = ('station', 'index')
station = models.ForeignKey('train.Station')
train = models.ForeignKey('train.Train')
index = models.IntegerField()
class Train(models.Model):
stations = models.ManyToManyField('train.Station', through=Route)
With your approach you think of stations as if they were all connected. For each station you could have a many2many field, that allows you to "connect" stations with each other.
You then could use three models:
Trains
Routes
Stations
Stations are connected to stations. Routes have a starting and an ending point. Trains are related to routes (since one train can run on multiple routes, or one route can transport multiple trains).
You also could programmatically calculate (graphs) how many stations which train has to take between start and goal.