I am using the Metacost function in Weka. I would like to view the total cost of the classifier. Could someone please tell me how to view the total cost? I am using the Weka GUI explorer.
I have tried enabling the cost sensitive evaluation option in the 'more option' part in the classify tab. However, since I have to enter the cost matrix twice then (once in the metacost and once in the more options part), would the system return the total cost for metacost or the cost sensitive classifier ? I am a bit confused by it.
Thank you in advance
From my understanding, the cost matrix you supply for the classifier simply informs Weka how the classifier should be built. It will not affect the evaluation of the classifier. If you do want to evaluate the classifier's performance under the same cost structure, you of course would need to supply the cost matrix for cost sensitive evaluation.
In the procedure you describe, the output total cost would be from the cost sensitive evaluation.
Related
Is it possible in Weka to train a model minimizing a cost factor?
I have a data set containing a cost factor in each sample. It defines what using this sample would cost. Now, I would like to select as much of the samples as possible while minimizing this cost factor.
E.g. with Multilayer perceptron, I want to train the neurons in a way, that it chooses as many samples as possible while minimizing the sum of the cost factor.
I've checked all the model options and also searched the package manager for something like that, but I was unable to find anything. Could someone tell me whether this can be done using Weka?
What you are describing sounds more like an optimization problem rather than a classification or regression problem (for which you would use a Weka classifier).
Weka does have some limited support for optimization through its abstract weka.core.Optimization class (e.g., used internally by weka.classifiers.functions.Logistic). But that requires implementing some methods.
To cast your net wider, you might want to take a look at the following article that describes various optimization techniques:
https://machinelearningmastery.com/tour-of-optimization-algorithms/
I have moderate experience with data science. I have a data set with 9500 observations and more than 4500 features most of which are highly correlated. Here is briefly what I have tried: I have dropped columns where there are less than 6000 non-NAs and have imputed NAs with their corresponding columns' median values when there are at least 6000 non-NAs. As for correlation, I have kept only features having at most 0.7 correlation with others. By doing so, I have reduced the number of features to about 750. Then I have used those features in my binary classification task in random forest.
My data set is highly unbalanced where ratio of (0:1) is (10:1). So when I apply RF with 10-fold cv, I observe too good results in each cv (AUC of 99%) which is to good to be true and in my test set I got way worse results such as 0.7. Here is my code:
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init(port=23, nthreads=4)
train = fs_rf[fs_rf['Year'] <= '201705']
test = fs_rf[fs_rf['Year'] > '201705']
train = train.drop('Year',axis=1)
test = test.drop('Year',axis=1)
test.head()
train = h2o.H2OFrame(train)
train['BestWorst2'] = train['BestWorst2'].asfactor()
test = h2o.H2OFrame(test)
test['BestWorst2'] = test['BestWorst2'].asfactor()
training_columns = train.drop('BestWorst2',axis=1).col_names
response_column = 'BestWorst2'
model = H2ORandomForestEstimator(ntrees=100, max_depth=20, nfolds=10, balance_classes=True)
model.train(x=training_columns, y=response_column, training_frame=train)
performance = model.model_performance(test_data=test)
print(performance)
How could I avoid this over-fitting? I have tried many different parameters in grid search but none of them improved the results.
This is not what I would call "overfitting". The reason you are seeing really good cross-validation metrics compared to your test metrics is that you have time-series data and so you can't use k-fold cross-validation to give you an accurate estimate of performance.
Performing k-fold cross-validation on a time-series dataset will give you overly-optimistic performance metrics because you are not respecting the time-series component in your data. Regular k-fold cross-validation will randomly sample from your whole dataset to create a train & validation set. Essentially, your validation strategy is "cheating" because you have "future" data included in your CV training sets (if that makes any sense).
I can see by your code that you understand that you need to train with "past" data and predict on "future" data, but if you want to read more about this topic, I'd recommend this article or this article.
One solution is to simply look at test set performance as way to evaluate your model. Another option is to use what's called "rolling" or "time-series" cross-validation, but H2O does not currently support that (though it seems like it might be added soon). Here's a ticket for this if you want to keep track of the progress.
I have gone through many articles and all are suggested me to create custom fields/Keywords in testlink to estimate the time for sprint execution.
Articles like :-
http://www.softwaretestingconcepts.com/testlink-using-custom-fields-and-keywords-for-effective-testing
Is there any alternative approach or any scientific method to estimate your sprint execution accurately.
I have found one article proposing below method:-
Number of Test Cases = (Number of Function Points) × 1.2
Source :- http://www.tutorialspoint.com/estimation_techniques/estimation_techniques_testing.htm
What should be the approach to estimating your execution cycle? Currently, I am doing it as per my experience in my project. It is working fine but management wants a concrete mechanism for same. Please suggest and share your experience
I have added Time Estimate and Actual Time from below option:-
Below is the result of above setting
I am not able to get this field data in report. I need total estimate also and then comparison between actual and estimated time
Any help would be appreciated
TestLink provides inbuilt support to record Estimated Exec. time(define at TestCase creation) and Actual Execution time. (record at TestCase Execution time)
Based on this feature hopes you can buildup your requirement.
I am trying to predict whether a particular service ticket raised by client needs a code change.
I have training data.
I have around 17k data points with problem description and tag (Y for code change required and N for no code change)
I did TF-IDF and it gave me 27k features. So I tried to fit RandomForestClassifier (sklearn python) with this 17k x 27k matrix.
I am getting very low scores on test set while training accuracy is very high.
Precision on train set: 89%
Precision on test set: 21%
Can someone suggest any workarounds?
I am using this model now:
sklearn.RandomForestClassifier(n_jobs=3,n_estimators=100,class_weight='balanced',max_features=None,oob_score=True)
Please help!
EDIT:
I have 11k training data with 900 positives (skewed). I tried LinearSVC sparsify but didn't work as well as Truncated SVD (Latent Semantic Indexing). maxFeatures=None performs better on the test set than without it.
I have also tried SVM, logistic (l2 and l1), ExtraTrees. RandonForest still is working best.
Right now, going at 92% precision on positives but recall is 3% only
Any other suggestions would be appreciated!
Update:
Feature engineering helped a lot. I pulled features out of the air (len of chars, len of words, their, difference, ratio, day of week the problem was of reported, day of month, etc) and now I am at 19-20% recall with >95% accuracy.
Food for your thoughts on using word2vec average vectors as deep features for the free text instead of tf-idf or bag of words ???
[edited]
Random forest handles more features than data points quite fine. RF is e.g. used for micro-array studies with e.g. a 100:5000 data point/feature ratio or in single-nucleotide_polymorphism(SNP) studies with e.g 5000:500,000 ratio.
I do disagree with the diagnose provided by #ncfirth, but the suggested treatment of variable selection may help anyway.
Your default random forest is not badly overfitted. It is just not meaningful to pay any attention to a non-cross validated training set prediction performance for a RF model, because any sample will end in the terminal nodes/leafs it has itself defined. But the overall ensemble model is still robust.
[edit] If you would change the max_depth or min_samples_split, the training precision would probably drop, but that is not the point. The non-cross validated training error/precision of a random forest model or many other ensemble models simply does not estimate anything useful.
[I did before edit confuse max_features with n_estimators, sry I mostly use R]
Setting max_features="none" is not random forest, but rather 'bagged trees'. You may benefit from a somewhat lower max_features which improve regularization and speed, maybe not. I would try lowering max_features to somewhere between 27000/3 and sqrt(27000), the typical optimal range.
You may achieve better test set prediction performance by feature selection. You can run one RF model, keep the top ~5-50% most important features and then re-run the model with fewer features. "L1 lasso" variable selection as ncfirth suggests may also be a viable solution.
Your metric of prediction performance, precision, may not be optimal in case unbalanced data or if the cost of false-negative and false-positive is quite different.
If your test set is still predicted much worse than the out-of-bag cross-validated training set, you may have problems with your I.I.D. assumptions that any supervised ML model rely on or you may need to wrap the entire data processing in an outer cross-validation loop, to avoid over optimistic estimation of prediction performance due to e.g. the variable selection step.
Seems like you've overfit on your training set. Basically the model has learnt noise on the data rather than the signal. There are a few ways to combat this, but it seems fairly obvious that you're model has overfit because of the incredibly large number of features you're feeding it.
EDIT:
It seems I was perhaps too quick to jump to the conclusion of overfitting, however this may still be the case (left as an exercise to the reader!). However feature selection may still improve the generalisability and reliability of your model.
A good place to start for removing features in scikit-learn would be here. Using sparsity is a fairly common way to perform feature selection:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
import numpy as np
# Create some data
X = np.random.random((1800, 2700))
# Boolean labels as the y vector
y = np.random.random(1800)
y = y > 0.5
y = y.astype(bool)
lsvc = LinearSVC(C=0.05, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print X_new.shape
Which returns a new matrix of shape (1800, 640). You can tune the number of features selected by altering the C parameter (called the penalty parameter in scikit-learn but sometimes called the sparsity parameter).
I've estimated a model via maximum likelihood in Stata and was surprised to find that estimated standard errors for one particular parameter are drastically smaller when clustering observations. I take it from the Stata manual on robust standard error estimation in ML that this can happen if the contributions of individual observations to the score (the derivative of the log-likelihood) tend to cancel each other within clusters.
I would now like to dig a little deeper into what exactly is happening and would therefore like to have a look at these score contributions. As far as I can see, however, Stata only gives me the total sum as e(gradient). Is there any way to pry the individual summands out of Stata?
If you have written your own command, you can create a new variable containing these scores using the ml score command. Official Stata commands and most finished user written commands will often have score as an option for predict, which does the same thing but with an easier syntax.
These will give you the score of the log likelihood ($\ell$) with respect to the linear predictor, $x\beta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \elipses$. To get the derivative of the log likelihood with respect to an individual parameter, say $\beta_1$, you just use the chain rule:
$\frac{\partial \ell}{\partial \beta_1} = \frac{\partial \ell }{\partial x\beta} \frac{\partial x\beta}{\partial \beta_1}$
The scores returned by Stata are $ \frac{\partial \ell }{\partial x\beta}$, and $\frac{\partial x\beta}{\partial \beta_1} = x_1$.
So, to get the score for $\beta_1$ you just multiply the score returned by Stata and $x_1$.