Can I make Amazon SageMaker deliver a recommendation based on historic data instead of a probability score? - amazon-web-services

We have a huge set of data in CSV format, containing a few numeric elements, like this:
Year,BinaryDigit,NumberToPredict,JustANumber, ...other stuff
1954,1,762,16, ...other stuff
1965,0,142,16, ...other stuff
1977,1,172,16, ...other stuff
The thing here is that there is a strong correlation between the third column and the columns before that. So I have pre-processed the data and it's now available in a format I think is perfect:
1954,1,762
1965,0,142
1977,1,172
What I want is a predicition on the value in the third column, using the first two as input. So in the case above, I want the input 1965,0 to return 142. In real life this file is thousands of rows, but since there's a pattern, I'd like to retrieve the most possible value.
So far I've setup a train job on the CSV file using the Linear Learner algorithm, with the following settings:
label_size = 1
feature_dim = 2
predictor_type = regression
I've also created a model from it, and setup an endpoint. When I invoke it, I get a score in return.
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=payload)
My goal here is to get the third column prediction instead. How can I achieve that? I have read a lot of the documentation regarding this, but since I'm not very familiar with AWS, I might as well have used the wrong algorithms for what I am trying to do.
(Please feel free to edit this question to better suit AWS terminology)

For csv input, the label should be in the first column, as mentioned here: So you should preprocess your data to put the label (the column you want to predict) on the left.
Next, you need to decide whether this is a regression problem or a classification problem.
If you want to predict a number that's as close as possible to the true number, that's regression. For example, the truth might be 4, and the model might predict 4.15. If you need an integer prediction, you could round the model's output.
If you want the prediction to be one of a few categories, then you have a classification problem. For example, we might encode 'North America' = 0, 'Europe' = 1, 'Africa' = 2, and so on. In this case, a fractional prediction wouldn't make sense.
For regression, use 'predictor_type' = 'regressor' and for classification with more than 2 classes, use 'predictor_type' = 'multiclass_classifier' as documented here.
The output of regression will contain only a 'score' field, which is the model's prediction. The output of multiclass classification will contain a 'predicted_label' field, which is the model's prediction, as well as a 'score' field, which is a vector of probabilities representing the model's confidence. The index with the highest probability will be the one that's predicted as the 'predicted_label'. The output formats are documented here.

predictor_type = regression is not able to return the predicted label, according to
the linear-learner documentation:
For inference, the linear learner algorithm supports the application/json, application/x-recordio-protobuf, and text/csv formats. For binary classification models, it returns both the score and the predicted label. For regression, it returns only the score.
For more information on input and output file formats, see Linear
Learner Response Formats for inference, and the Linear Learner Sample
Notebooks.

Related

number expected.read Token[2015-02-02 14:19:00] weka project

i hope you all are doing well!
I have a project at data mining class.Τhe data consists of numerical data and many algorithms do not work.I have to do this:"you should compare the performance of the following categorization algorithms:
RandomForest, C4.5, JRip, Bayesian Network. Where necessary use them
Weka filters to replace or create values ​​for some properties
new properties. For comparison, adopt the Train / Test Percentage Split type with
percentage for training data equal to 80%.Describe your observations by giving tables with the results and
presenting the performance of the algorithms. Repeat the experiment by putting
percentage for training data equal to 70% and 50% presenting the results."
So my first try was to transform the data inside weka with preprocessing data numeric to nominal but a friend of mine suggest that is statistical wrong.So my second try was to use excel to transform all data even the date to numeric,remove the first row(id) and pass it to the weka(I leave double quotes only at date)
.But i have the error that i mention on the title.The dataset is:https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
Thank you for the time.
If you define date-like data as a DATE attribute in the ARFF file (using the right format for parsing the strings), then WEKA will treat it as a numeric attribute internally (Java epoch, ie milli-seconds since 1970-01-01).
Instead of using NumericToNominal, use either the supervised or unsupervised Discretize filter if the algorithm cannot handle numeric attributes.
Converting nominal attributes to numeric ones is not a recommended approach. Instead, try the supervised or unsupervised NominalToBinary filter.

Training and Test Set in Weka InCompatible in Text Classification

I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.

Overfitting with random forest though very successful cross validation results

I have moderate experience with data science. I have a data set with 9500 observations and more than 4500 features most of which are highly correlated. Here is briefly what I have tried: I have dropped columns where there are less than 6000 non-NAs and have imputed NAs with their corresponding columns' median values when there are at least 6000 non-NAs. As for correlation, I have kept only features having at most 0.7 correlation with others. By doing so, I have reduced the number of features to about 750. Then I have used those features in my binary classification task in random forest.
My data set is highly unbalanced where ratio of (0:1) is (10:1). So when I apply RF with 10-fold cv, I observe too good results in each cv (AUC of 99%) which is to good to be true and in my test set I got way worse results such as 0.7. Here is my code:
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init(port=23, nthreads=4)
train = fs_rf[fs_rf['Year'] <= '201705']
test = fs_rf[fs_rf['Year'] > '201705']
train = train.drop('Year',axis=1)
test = test.drop('Year',axis=1)
test.head()
train = h2o.H2OFrame(train)
train['BestWorst2'] = train['BestWorst2'].asfactor()
test = h2o.H2OFrame(test)
test['BestWorst2'] = test['BestWorst2'].asfactor()
training_columns = train.drop('BestWorst2',axis=1).col_names
response_column = 'BestWorst2'
model = H2ORandomForestEstimator(ntrees=100, max_depth=20, nfolds=10, balance_classes=True)
model.train(x=training_columns, y=response_column, training_frame=train)
performance = model.model_performance(test_data=test)
print(performance)
How could I avoid this over-fitting? I have tried many different parameters in grid search but none of them improved the results.
This is not what I would call "overfitting". The reason you are seeing really good cross-validation metrics compared to your test metrics is that you have time-series data and so you can't use k-fold cross-validation to give you an accurate estimate of performance.
Performing k-fold cross-validation on a time-series dataset will give you overly-optimistic performance metrics because you are not respecting the time-series component in your data. Regular k-fold cross-validation will randomly sample from your whole dataset to create a train & validation set. Essentially, your validation strategy is "cheating" because you have "future" data included in your CV training sets (if that makes any sense).
I can see by your code that you understand that you need to train with "past" data and predict on "future" data, but if you want to read more about this topic, I'd recommend this article or this article.
One solution is to simply look at test set performance as way to evaluate your model. Another option is to use what's called "rolling" or "time-series" cross-validation, but H2O does not currently support that (though it seems like it might be added soon). Here's a ticket for this if you want to keep track of the progress.

How to do prediction with weka

i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.

Regression Tree Forest in Weka

I'm using Weka and would like to perform regression with random forests. Specifically, I have a dataset:
Feature1,Feature2,...,FeatureN,Class
1.0,X,...,1.4,Good
1.2,Y,...,1.5,Good
1.2,F,...,1.6,Bad
1.1,R,...,1.5,Great
0.9,J,...,1.1,Horrible
0.5,K,...,1.5,Terrific
.
.
.
Rather than learning to predict the most likely class, I want to learn the probability distribution over the classes for a given feature vector. My intuition is that using just the RandomForest model in Weka would not be appropriate, since it would be attempting to minimize its absolute error (maximum likelihood) rather than its squared error (conditional probability distribution). Is that intuition right? Is there a better model to be using if I want to perform regression rather than classification?
Edit: I'm actually thinking now that in fact it may not be a problem. Presumably, classifiers are learning the conditional probability P(Class | Feature1,...,FeatureN) and the resulting classification is just finding the c in Class that maximizes that probability distribution. Therefore, a RandomForest classifier should be able to give me the conditional probability distribution. I just had to think about it some more. If that's wrong, please correct me.
If you want to predict the probabilities for each class explicitly, you need different input data. That is, you would need to replace the value to predict. Instead of one data set with the class label, you would need n data sets (for n different labels) with aggregated data for each unique feature vector. Your data would look something like
Feature1,...,Good
1.0,...,0.5
0.3,...,1.0
and
Feature1,...,Bad
1.0,...,0.8
0.3,...,0.1
and so on. You would need to learn one model for each class and run them separately on any data to be classified. That is, for each label you learn a model to predict a number that is the probability of being in that class, given a feature vector.
If you don't need the probabilities to be predicted explicitly, have a look at the Bayesian classifiers in Weka, which make use of probabilities in the models that they learn.