Weka cross validation wrong results - weka

I am classifying 5 minutes of EEG data of 4 classes using a Bayesian Network.
When applying cross validation I get 100% correct results whereas when I use training and supplied testing data (the first 3.7 minutes for training, 1.3 minutes for testing) in a separate file I get really low results (30%).
I am new to Weka and do not know how this is possible. Any help would be highly appreciated :)

Related

Google AutoML Importing text items very slow

I'm importing text items to Google's AutoML. Each row contains around 5000 characters and I'm adding 70K of these rows. This is a multi-label data set. There is no progress bar or indication of how long this process will take. Its been running for a couple of hours. Is there any way to calculate time remaining or total estimated time. I'd like to add additional data sets, but I'm worried that this will be a very long process before the training even begins. Any sort of formula to create even a semi-wild guess would be great.
-Thanks!
I don't think that's possible today, but I filed a feature request [1] that you can follow for updates. I asked for both training and importing data, as for training it could be useful too.
I tried training with 50K records (~ 300 bytes/record) and the load took more than 20 mins after which I killed it. I retried with 1K, which ran for 20 mins and then emailed me an error message saying I had multiple labels per input (yes, so what? training data is going to have some of those) and I had >100 labels. I simplified the classification buckets and re-ran. It took another 20 mins and was successful. Then I ran 'training' which took 3 hours and billed me $11. That maps to $550 for 50K recs, assuming linear behavior. The prediction results were not bad for a first pass, but I got the feeling that it is throwing a super large neural net at the problem. Would help if they said what NN it was and its dimensions. They do say "beta" :)
don't wast your time trying to using google for text classification. I am a GCP hard user but microsoft LUIS is far better, precise and so much faster that I can't believe that both products are trying to solve same problem.
Luis has a much better documentation, support more languages, has a much better test interface, way faster.. I don't know if is cheaper yet because the pricing model is different but we are willing to pay more.

Training and Test Set in Weka InCompatible in Text Classification

I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.

Overfitting with random forest though very successful cross validation results

I have moderate experience with data science. I have a data set with 9500 observations and more than 4500 features most of which are highly correlated. Here is briefly what I have tried: I have dropped columns where there are less than 6000 non-NAs and have imputed NAs with their corresponding columns' median values when there are at least 6000 non-NAs. As for correlation, I have kept only features having at most 0.7 correlation with others. By doing so, I have reduced the number of features to about 750. Then I have used those features in my binary classification task in random forest.
My data set is highly unbalanced where ratio of (0:1) is (10:1). So when I apply RF with 10-fold cv, I observe too good results in each cv (AUC of 99%) which is to good to be true and in my test set I got way worse results such as 0.7. Here is my code:
import h2o
from h2o.estimators import H2ORandomForestEstimator
h2o.init(port=23, nthreads=4)
train = fs_rf[fs_rf['Year'] <= '201705']
test = fs_rf[fs_rf['Year'] > '201705']
train = train.drop('Year',axis=1)
test = test.drop('Year',axis=1)
test.head()
train = h2o.H2OFrame(train)
train['BestWorst2'] = train['BestWorst2'].asfactor()
test = h2o.H2OFrame(test)
test['BestWorst2'] = test['BestWorst2'].asfactor()
training_columns = train.drop('BestWorst2',axis=1).col_names
response_column = 'BestWorst2'
model = H2ORandomForestEstimator(ntrees=100, max_depth=20, nfolds=10, balance_classes=True)
model.train(x=training_columns, y=response_column, training_frame=train)
performance = model.model_performance(test_data=test)
print(performance)
How could I avoid this over-fitting? I have tried many different parameters in grid search but none of them improved the results.
This is not what I would call "overfitting". The reason you are seeing really good cross-validation metrics compared to your test metrics is that you have time-series data and so you can't use k-fold cross-validation to give you an accurate estimate of performance.
Performing k-fold cross-validation on a time-series dataset will give you overly-optimistic performance metrics because you are not respecting the time-series component in your data. Regular k-fold cross-validation will randomly sample from your whole dataset to create a train & validation set. Essentially, your validation strategy is "cheating" because you have "future" data included in your CV training sets (if that makes any sense).
I can see by your code that you understand that you need to train with "past" data and predict on "future" data, but if you want to read more about this topic, I'd recommend this article or this article.
One solution is to simply look at test set performance as way to evaluate your model. Another option is to use what's called "rolling" or "time-series" cross-validation, but H2O does not currently support that (though it seems like it might be added soon). Here's a ticket for this if you want to keep track of the progress.

Training Tensorflow Inception-v3 Imagenet on modest hardware setup

I've been training Inception V3 on a modest machine with a single GPU (GeForce GTX 980 Ti, 6GB). The maximum batch size appears to be around 40.
I've used the default learning rate settings specified in the inception_train.py file: initial_learning_rate = 0.1, num_epochs_per_decay = 30 and learning_rate_decay_factor = 0.16. After a couple of weeks of training the best accuracy I was able to achieve is as follows (About 500K-1M iterations):
2016-06-06 12:07:52.245005: precision # 1 = 0.5767 recall # 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision # 1 = 0.5957 recall # 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision # 1 = 0.6112 recall # 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision # 1 = 0.6136 recall # 5 = 0.8423 [50016 examples]
I've tried fiddling with the settings towards the end of the training session, but couldn't see any improvements in accuracy.
I've started a new training session from scratch with num_epochs_per_decay = 10 and learning_rate_decay_factor = 0.001 based on some other posts in this forum, but it's sort of grasping in the dark here.
Any recommendations on good defaults for a small hardware setup like mine?
TL,DR: There is no known method for training an Inception V3 model from scratch in a tolerable amount of time from a modest hardware set up. I would strongly suggest retraining a pre-trained model on your desired task.
On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.
The Inception V3 model available for download here was trained with an effective batch size of 1600 across 50 GPU's -- where each GPU ran a batch size of 32.
Given your modest hardware, my number one suggestion would be to download the pre-trained mode from the link above and retrain the model for the individual task you have at hand. This would make your life much happier.
As a thought experiment (but hardly practical) .. if you feel especially compelled to exactly match the training performance of the model from the pre-trained model by training from scratch, you could do the following insane procedure on your 1 GPU. Namely, you could run the following procedure:
Run with a batch size of 32
Store the gradients from the run
Repeat this 50 times.
Average the gradients from the 50 batches.
Update all variables with the gradients.
Repeat
I am only mentioning this to give you a conceptual sense of what would need to be accomplished to achieve the exact same performance. Given the speed numbers you mentioned, this procedure would take months to run. Hardly practical.
More realistically, if you are still strongly interested in training from scratch and doing the best you can, here are some general guidelines:
Always run with the largest batch size possible. It looks like you are already doing that. Great.
Make sure that you are not CPU bound. That is, make sure that the input processing queue's are always modestly full as displayed on TensorBoard. If not, increase the number of preprocessing threads or use a different CPU if available.
Re: learning rate. If you are always running synchronous training (which must be the case if you only have 1 GPU), then the higher batch size, the higher the tolerable learning rate. I would a try a series of several quick runs (e.g. a few hours each) to identify the highest learning possible which does not lead to NaN's. After you find such a learning rate, knock it down by say 5-10% and run with that.
As for num_epochs_per_decay and decay_rate, there are several strategies. The strategy highlighted by 10 epochs per decay, 0.001 decay factor is to hammer the model for as long as possible until the eval accuracy asymptotes. And then lower the learning rate. This is a simple strategy which is nice. I would verify that is what you see in your model monitoring that the eval accuracy and determining that it indeed asymptotes before you allow the model to decay the learning rate. Finally, the decay factor is a bit ad-hoc but lowering by say a power of 10 seems to be a good rule of thumb.
Note again that these are general guidelines and others might even offer differing advice. The reason why we can not give you more specific guidance is that CNNs of this size are just not often trained from scratch on a modest hardware setup.
Excellent tips.
There is precedence for training using a similar setup as yours.
Check this out - http://3dvision.princeton.edu/pvt/GoogLeNet/
These people trained GoogleNet, but, using Caffe. Still, studying their experience would be useful.

Artificial Neural Network with large inputs & outputs

I've been following Dave Miller's ANN C++ Tutorial, and I've been having some problems getting it to function as expected.
You can view the code I'm working with here. It's an XCode project, but includes the main.cpp and data set file.
Previously, this program would only gives outputs between -1 and 1, I'm presuming due to the use of the tanh function. I've manipulated the data inputs so I can input my data that is much larger and have valid outputs. I've simply done this by multiplying the input values by 0.0001, and multiplying the output values by 10000.
The training data I'm using is the included CSV file. The last column is the expected output, the rest are inputs. Am I using the wrong mathematical function for these data?
Would you say that this is actually learning? This whole thing has stressed me out so much, I understand the theory behind ANN's but just can't implement from scratch for myself.
The net recent average error definitely gets smaller and smaller, which to me would say it is learning.
I'm sorry if I haven't explained myself very well, I'm very new to ANN's and this whole thing is very confusing to me. My university lecturers are useless when it comes to the practical side, they only teach us the theory of it.
I've been playing around with the eta and alpha values, along with the number of hidden layers.
You explained yourself quite well, if the net recent average is getting lower and lower it probably means that the network is actually learning, but here is my suggestion about how to be completely sure.
Take you CSV file and split it into 2 files one should be about 10% of the all data and the other all the remaining.
You start with an untrained network and you run your 10% file trough the net and for each line you save the difference between actual output and expected result.
Then you train the network only with the 90% of the CSV file you have and finally you re run trough the NET the first 10% file again and you compare the differences you had on the first run with the the latest ones.
You should find out that the new results are much closer to the expected values than the first time, and this would be the final proof that your network is learning.
Does this make any sense ? if not please send share some code or send me a link to the exercise you are running and I will try to explain it in code.