On loading the saved Keras sequential model, my test data gives low accuracy in the beginning - python-2.7

I am creating a simple sequential Keras model which will take 10k inputs in a batch of 100. Each input has 3 columns and the corresponding output is sum of that row.
Sequential model has 2 layers- LSTM(Stateful=true) , Dense.
Now, after compiling and fitting the model, I am saving it in 'model.h5' file.
Then, I read the saved model, and call model.predict with a test data (size=10k , batch_size = 100).
Problem: the prediction doesn't work properly for first 400-500 inputs and for the rest its working perfectly fine with very low val_loss.
Case1: I make the LSTM layer Stateless(i.e. Stateful=False)
In this case Keras is providing very accurate outputs for all the test data.
Case2: Instead of saving and then reading again, if I directly apply model.predict on the model created, all the outputs are coming accurately.
But, I need Stateful=True, also, I want to save my model and then resume work on that model later.
1.Is there any way to solve this?
2.Also, when I am providing test data, how is the model's accuracy increasing? ( because the first 400-500 tests provide inaccurate results and the rest are pretty accurate)

Your problem seems to come from losing the hidden states of your cells. During model building they might be reset and this might cause the problem.
So (it's a little bit cumbersome), but you could save and load also a states of your network:
How to save? (assuming that i-th layer is a recurrentone):
hidden_state = model.layers[i].states[0].eval()
cell_state = model.layers[i].states[0].eval()
numpy.save("some name", hidden_state)
numpy.save("some other name", cell_state)
Now when you can reload the hidden state, here you can read on how to set the hidden state in a layer.
Of course - it's the best to pack all of this methods in some kind of object and e.g. class constructor methods.

Related

Training and Test Set in Weka InCompatible in Text Classification

I have two datasets regarding whether a sentence contains a mention of a drug adverse event or not, both the training and test set have only two fields the text and the labels{Adverse Event, No Adverse Event} I have used weka with the stringtoWordVector filter to build a model using Random Forest on the training set.
I want to test the model built with removing the class labels from the test data set, applying the StringToWordVector filter on it and testing the model with it. When I try to do that it gives me the error saying training and test set not compatible probably because the filter identifies a different set of attributes for the test dataset. How do I fix this and output the predictions for the test set.
The easiest way to do this for a one off test is not to pre-filter the training set, but to use Weka's FilteredClassifier and configure it with the StringToWordVector filter, and your chosen classifier to do the classification. This is explained well in this video from the More Data Mining with Weka online course.
For a more general solution, if you want to build the model once then evaluate it on different test sets in future, you need to use InputMappedClassifier:
Wrapper classifier that addresses incompatible training and test data
by building a mapping between the training data that a classifier has
been built with and the incoming test instances' structure. Model
attributes that are not found in the incoming instances receive
missing values, so do incoming nominal attribute values that the
classifier has not seen before. A new classifier can be trained or an
existing one loaded from a file.
Weka requires a label even for the test data. It uses the labels or „ground truth“ of the test data to compare the result of the model against it and measure the model performance. How would you tell whether a model is performing well, if you don‘t know whether its predictions are right or wrong. Thus, the test data needs to have the very same structure as the training data in WEKA, including the labels. No worries, the labels are not used to help the model with its predictions.
The best way to go is to select cross validation (e.g. 10 fold cross validation) which automatically will split your data into 10 parts, using 9 for training and the remaining 1 for testing. This procedure is repeated 10 times so that each of the 10 parts has once been used as test data. The final performance verdict will be an average of all 10 rounds. Cross validation gives you a quite realistic estimate of the model performance on new, unseen data.
What you were trying to do, namely using the exact same data for training and testing is a bad idea, because the measured performance you end up with is way too optimistic. This means, you‘ll get very impressive figures like 98% accuracy during testing - but as soon as you use the model against new unseen data your accuracy might drop to a much worse level.

When training a single batch, is iteration of examples necessary (optimal) in python code?

Say I have one batch that I want to train my model on. Do I simply run tf.Session()'s sess.run(batch) once, or do I have to iterate through all of the batch's examples with a loop in the session? I'm looking for the optimal way to iterate/update the training ops, such as loss. I thought tensorflow would handle it itself, especially in the cases where tf.nn.dynamic_rnn() takes in a batch dimension for listing the examples. I thought, perhaps naively, that a for loop in the python code would be the inefficient method of updating the loss. I am using tf.losses.mean_squared_error(batch) for a regression problem.
My regression problem is given two lists of word vectors (300d each), and determines the similarity between the two lists on a continuous scale from [0, 5]. My supervised model is Deepmind's Differential Neural Computer (DNC). The problem is I do not believe it is learning anything. this is due to the fact that the all of the output from the model is centered around 0 and even negative. I do not know how it could possibly be negative given no negative labels provided. I only call sess.run(loss) for the single batch, I do not create a python loop to iterate through it.
So, what is the most efficient way to iterate the training of a model and how do people go about it? Do they really use python loops to do multiple calls to sess.run(loss) (this was done in the training file example for DNC, and I have seen it in other examples as well). I am certain I get the final loss from the below process, but I am uncertain if the model has actually been trained entirely just because the loss was processed in one go. I also do not understand the point of update_ops returned by some functions, and am uncertain if they are necessary to ensure the model has been trained.
Example of what I mean by processing a batch's loss once:
# assume the model has been defined prior through batch_output_logits
train_loss = tf.losses.mean_squared_error(labels=target,
predictions=batch_output_logits)
with tf.Session() as sess:
sess.run(init_op) # pseudo code, unnecessary for question
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# is this the entire batch's loss && model has been trained for that batch?
loss_np = sess.run(train_step, train_loss)
coord.request_stop()
coord.join(threads)
Any input on why I am receiving negative values when the labels are in the range [0, 5] is welcomed as well(general abstract answers for this are fine, because its not the main focus). I am thinking of attempting to create a piece-wise function, if possible, for my loss, so that for any values out of bounds face a rapidly growing exponential loss function. Uncertain how to implement, or if it would even work.
Code is currently private. Once allowed, I will make the repo public.
To run DNC model, go to the project/ directory and run python -m src.main. If there are errors you encounter feel free to let me know.
This model depends upon Tensorflow r1.2, most recent Sonnet, and NLTK's punkt for Tokenizing sentences in sts_handler.py and tests/*.
In a regression model, the network calculates the model output based on the randomly initialized values for your model parameters. That's why you're seeing negative values here; you haven't trained your model enough for it to learn that your values are only between 0 and 5.
Unless I'm missing something, you are only calculating the loss, but you aren't actually training the model. You should probably be calling sess.run(optimizer) on an optimizer, not on your loss function.
You probably need to train your model for multiple epochs (training your model for one epoch = training your model once on the entire dataset).
Batches are used because it is more computationally efficient to train your model on a batch than it is to train it on a single example. However, your data seems to be small enough that you won't have that problem. As such, I would recommend reducing your batch size to as low as possible. As a general rule, you get better training from a smaller batch size, at the cost of added computation.
If you post all of your code, I can take a look.

tensorflow repeated running of fully connected model

Question:
How can I "rerun" tensorflow code that depends on queues? Is the best way really to close the session, build the model again, load variables and run?
Motivation:
In a standing unanswered question I asked how in a fully connected model one could interleave actions (such as generating cumulative summaries, calc AUC on test data, etc.) with training that reads data from tensorflow TFRecords files and tf.Queues.
For example, tf.train.string_input_producer returns a filename_queue. As part of the constructor it takes a "num_epochs" arg. Instead of setting "num_epochs" to 100, I'm thinking to just set "num_epochs" to "2" to generate summaries every other epoch. This requires running the same code 50 times, hence the need for an efficient answer to above.

Do OpenCV's machine learning algorithms continuously update a model?

Most machine learning algorithms implemented in OpenCV 2.4 built upon a CvStatModel which comes with a CvStatModel::train method.
There it says:
By default, the input feature vectors are stored as train_data rows, that is, all the components (features) of a training vector are stored continuously.
and
Usually, the previous model state is cleared by CvStatModel::clear() before running the training procedure. However, some algorithms may optionally update the model state with the new training data, instead of resetting it.
How do I know which ml algorithm isn't resetting the current model state. Since I wanted to use CvGBTrees::train which has a update parameter declared as being only a dummy parameter, I guess the model is discarded after every training call. Can I take it that if there is no such update parameter the current model state will always be discarded?
I need a machine learning algorithm which continuously trains one model and doesn't start with an initial model every training call.
Is this doable with the current ml implementations in OpenCV? and if so with which ones? Furthermore, if not are there other c++ libraries that would do so?

How to do prediction with weka

i'm using weka to do some text mining, i'm a little bit confused so i'm here to ask how can i ( with a set of comments that are in a some way classified as: notes, status of work, not conformity, warning) predict if a new comment belong to a specific class, with all the comment (9551) i've done a preprocess obtaining with the filter "stringtowordvector" a vector of tokens, and then i've used the simple kmeans to obtain a number of cluster.
So the question is: if a user post a new comment can i predict with those data if it belong to a category of comment?
sorry if my question is a little bit confused but so am i.
thank you
Trivial Training-validation-test
Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for this split, but a 60-40 split is a good choice.
Use K-means (or any other clustering algorithm) on your training data. Develop a model. Record the model's error on training set. If the error is low and acceptable, you are fine. Save the model.
For now, your validation set will be your test dataset. Apply the model you saved on your validation set. Record the error. What is the difference between training error and validation error? If they both are low, the model's generalization is "seemingly" good.
Prepare a test dataset where you have all the features of your training and test dataset but the class/cluster is unknown.
Apply the model on the test data.
10-fold cross validation
Use all of your labelled data instances for this task.
Apply K-means (or any other algorithm of your choice) with a 10-fold CV setup.
Record the training error and CV error. Are they low? Is the difference between the errors is low? If yes, then save the model and apply it on the test data whose class/cluster is unknown.
NB: The training/test/validation errors and their differences will give you an "very initial" idea of overfitting/underfitting of your model. They are sanity tests. You need to perform other tests like learning curves to see if your model overfits or underfits or perfect. If there appears to be an overfitting and underfitting problem, you need to try many different techniques to overcome them.