Adding weka instances after classification but before evaluation? - weka

Suppose X is a raw, labeled (ie, with training labels) data set, and Process(X) returns a set of Y instances
that have been encoded with attributes and converted into a weka-friendly file like Y.arff.
Also suppose Process() has some 'leakage':
some instances Leak = X-Y can't be encoded consistently, and need
to get a default classification FOO. The training labels are also known for the Leak set.
My question is how I can best introduce instances from Leak into the
weka evaluation stream AFTER some classifier has been applied to the
subset Y, folding the Leak instances in with their default
classification label, before performing evaulation across the full set X? In code:
DataSource LeakSrc = new DataSource("leak.arff");
Instances Leak = LeakSrc.getDataSet();
DataSource Ysrc = new DataSource("Y.arff");
Instances Y = Ysrc.getDataSet();
classfr.buildClassifer(Y)
// YunionLeak = ??
eval.crossValidateModel(classfr, YunionLeak);
Maybe this is a specific example of folding together results
from multiple classifiers?

the bounty is closing, but Mark Hall, in another forum (
http://list.waikato.ac.nz/pipermail/wekalist/2015-November/065348.html) deserves what will have to count as the current answer:
You’ll need to implement building the classifier for the cross-validation
in your code. You can still use an evaluation object to compute stats for
your modified test folds though, because the stats it computes are all
additive. Instances.trainCV() and Instances.testCV() can be used to create
the folds:
http://weka.sourceforge.net/doc.stable/weka/core/Instances.html#trainCV(int,%20int,%20java.util.Random)
You can then call buildClassifier() to process each training fold, modify
the test fold to your hearts content, and then iterate over the instances
in the test fold while making use of either Evaluation.evaluateModelOnce()
or Evaluation.evaluateModelOnceAndRecordPrediction(). The later version is
useful if you need the area under the curve summary metrics (as these
require predictions to be retained).
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnce(weka.classifiers.Classifier,%20weka.core.Instance)
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnceAndRecordPrediction(weka.classifiers.Classifier,%20weka.core.Instance)

Depending on your classifier, it could be very easy! Weka has an interface called UpdateableClassifier, any class using this can be updated after it has been built! The following classes implement this interface:
HoeffdingTree
IBk
KStar
LWL
MultiClassClassifierUpdateable
NaiveBayesMultinomialText
NaiveBayesMultinomialUpdateable
NaiveBayesUpdateable
SGD
SGDText
It can then be updated something like the following:
ArffLoader loader = new ArffLoader();
loader.setFile(new File("/data/data.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
NaiveBayesUpdateable nb = new NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null) {
nb.updateClassifier(current);
}

Related

serving_input_receiver_fn() function without the deprecated tf.placeholder method in TF 2.0

I have a functioning tf.estimator pipeline build in TF 1, but now I made the decision to move to TF 2.0, and I have problems in the end of my pipeline, when I want to save the model in the .pb format
I'm using this high level estimator export_saved_model method:
https://www.tensorflow.org/api_docs/python/tf/estimator/BoostedTreesRegressor#export_saved_model
I have two numeric features, 'age' and 'time_spent'
They're defined using tf.feature_column as such:
age = tf.feature_column.numeric_column('age')
time_spent = tf.feature_column.numeric_column('time_spent')
features = [age,time_spent]
After the model has been trained I turn the list of features into a dict using the method feature_column_make_parse_example_spec() and feed it to another method build_parsing_serving_input_receiver_fn() excactly as outlied on tensorflow's webpage, https://www.tensorflow.org/guide/saved_model under estimators.
columns_dict = tf.feature_column_make_parse_example_spec(features)
input_receiver_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(columns_dict)
model.export_saved_model(export_dir,input_receiver_fn)
I then inspect the output using the CLI tools
saved_model_cli show --dir mydir --all:
Resulting in the following:
enter image description here
Somehow Tensorflow squashes my two usefull numeric features into a useless string input crap called "inputs".
In TF 1 this could be circumvented by creating a custom input_receiver_fn() function using some tf.placeholder method, and I'd get the correct output with two distinct numeric features. But tf.placeholder doesn't exist in TF 2, so now it's pretty useless.
Sorry about the raging, but Tensorflow is horribly documented, and I'm really working with high level API's and it should just be straight out on the horse, but no.
I'd really appreciate any help :)
Tensorflow squashes my two usefull numeric features into a useless
string input crap called "inputs"
is not exactly true, as the exported model expects a serialized tf.Example proto. So, you can warp your age and time_spent into two features which will look like:
features {
feature {
key: "age"
value {
float32_list {
value: 10.2
}
}
}
feature {
key: "time_spent"
value {
float32_list {
value: 40.3
}
}
}
}
you can then call your regress function with the serialized string.

Connecting nodes of different GraphDef's

From Python, I have a frozen graph.pb that I'm currently using in a C++ environment. Now the data for the input tensor are currently preprocessed on the CPU, but I would like to do this step in another GraphDef to run it on the GPU, but I can't seem to find a way to connect nodes between two GraphDef's.
Lets assume my frozen graph have an input/placeholder named mid that I'd like to connect with the preprocessing steps below
tf::GraphDef create_graph_extension() {
tf::Scope root = tf::Scope::NewRootScope();
auto a = tf::ops::Const(root.WithOpName("in"), {(float) 23.0, (float) 31.0});
auto b = tf::ops::Identity(root.WithOpName("mid"), a);
tf::GraphDef graph;
TF_CHECK_OK(root.ToGraphDef(&graph));
return graph;
}
I usually use session->Extend() to run multiple graphs in the same session, but always making sure their node names are unique. With non-unique node names, that I hoped to connect, I get an error
Failed to install graph:
Invalid argument: GraphDef argument to Extend includes node 'mid', which
was created by a previous call to Create or Extend in this session.
P.s. It seems like it is possible in python at least (link)
You can achieve what you're looking for using the same idea that was suggested for Python - import one GraphDef into another and remap inputs.
In case you do use the C API (which has stability guarantees), you'd want to look at:
TF_GraphImportGraphDef (which is parallel to the tf.import_graph_def call in Python), and
TF_ImportGraphDefOptionsAddInputMapping which serves the same purpose as the input_map argument in Python.
These are implemented on top of the C++ ImportGraphDef function, which you might be able to use directly instead (though that doesn't seem to yet be part of the exported C++ API)
Hope that helps.

getting transformer results from sklearn.pipeline.Pipeline

I am using a sklearn.pipeline.Pipeline object for my clustering.
pipe = sklearn.pipeline.Pipeline([('transformer1': transformer1),
('transformer2': transformer2),
('clusterer': clusterer)])
Then I am evaluating the result by using the silhouette score.
sil = preprocessing.silhouette_score(X, y)
I'm wondering how I can get the X or the transformed data from the pipeline as it only returns the clusterer.fit_predict(X).
I understand that I can do this by just splitting the pipeline as
pipe = sklearn.pipeline.Pipeline([('transformer1': transformer1),
('transformer2': transformer2)])
X = pipe.fit_transform(data)
res = clusterer.fit_predict(X)
sil = preprocessing.silhouette_score(X, res)
but I would like to just do it all in one pipeline.
If you want to both fit and transform the data on intermediate steps of the pipeline then it makes no sense to reuse the same pipeline and better to use a new one as you specified, because calling fit() will forget all about previously learnt data.
However if you only want to transform() and see the intermediate data on an already fitted pipeline, then its possible by accessing the named_steps parameter.
new_pipe = sklearn.pipeline.Pipeline([('transformer1':
old_pipe.named_steps['transformer1']),
('transformer2':
old_pipe.named_steps['transformer2'])])
Or directly using the inner varible steps like:
transformer_steps = old_pipe.steps
new_pipe = sklearn.pipeline.Pipeline([('transformer1': transformer_steps[0]),
('transformer2': transformer_steps[1])])
And then calling the new_pipe.transform().
Update:
If you have version 0.18 or above, then you can set the non-required estimator inside the pipeline to None to get the result in same pipeline. Its discussed in this issue at scikit-learn github
Usage for above in your case:
pipe.set_params(clusterer=None)
pipe.transform(df)
But be aware to maybe store the fitted clusterer somewhere else to do so, else you need to fit the whole pipeline again when wanting to use that functionality.

Equivalent of tf.identity with control dependency for an operation node

I am writing a wrapper class that takes a generic graph with a special member "train_op" to manage the training, saving, and housekeeping of my model.
I wanted to cleanly keep track of the lifetime number of training steps like so:
with tf.control_dependencies([ step_add_one ]):
self.train_op=tf.identity(self.training_graph.train_op )
raise TypeError('Expected binary or unicode string, got %r'
e, is_training=True, inputs=None)
I think the rub here is that train_op is the return of tf.Optimizer.minimize(), so it is not a tensor per se, but an operation.
An obvious workaround would be to call tf.identity on the training_graph.loss, but I lose a bit of abstraction because I have to then handle the learning rate etc externally. Moreover, I feel like I'm missing something.
How can I best remedy this?
You can use tf.group(), which will work with operations and tensors.
For instance:
x = tf.Variable(1.)
loss = tf.square(x)
optimizer = tf.train.GradientDescentOptimizer(0.1)
train_op = optimizer.minimize(loss)
step = tf.Variable(0)
step_add_one = step.assign_add(1)
with tf.control_dependencies([step_add_one]):
train_op_2 = tf.group(train_op)
Now when you run train_op_2, the value of step will be incremented.
However, the best way to go (if you can modify the graph that created the graph) is to add a parameter global_step to the minimize function:
train_op = optimizer.minimize(loss, global_step=step)

How do I use AdaBoost for feature selection?

I want to use AdaBoost to choose a good set features from a large number (~100k). AdaBoost works by iterating though the feature set and adding in features based on how well they preform. It chooses features that preform well on samples that were mis-classified by the existing feature set.
Im currently using in Open CV's CvBoost. I got an example working, but from the documentation it is not clear how to pull out the feature indexes that It has used.
Using either CvBoost, a 3rd party library or implementing it myself, how can pull out a set of features from a large feature set using AdaBoot?
With the help of #greeness answer I made a subclass of CvBoost
std::vector<int> RSCvBoost::getFeatureIndexes() {
CvSeqReader reader;
cvStartReadSeq( weak, &reader );
cvSetSeqReaderPos( &reader, 0 );
std::vector<int> featureIndexes;
int weak_count = weak->total;
for( int i = 0; i < weak_count; i++ ) {
CvBoostTree* wtree;
CV_READ_SEQ_ELEM( wtree, reader );
const CvDTreeNode* node = wtree->get_root();
CvDTreeSplit* split = node->split;
const int index = split->condensed_idx;
// Only add features that are not already added
if (std::find(featureIndexes.begin(),
featureIndexes.end(),
index) == featureIndexes.end()) {
featureIndexes.push_back(index);
}
}
return featureIndexes;
}
Claim: I am not a user of opencv. From the documentation, opencv's adaboost is using the decision tree (either classification tree or regression tree) as the fundamental weak learner.
It seems to me this is the way to get the underline weak learners:
CvBoost::get_weak_predictors
Returns the sequence of weak tree classifiers.
C++: CvSeq* CvBoost::get_weak_predictors()
The method returns the sequence of weak classifiers.
Each element of the sequence is a pointer to the CvBoostTree class or
to some of its derivatives.
Once you have access to the sequence of CvBoostTree*, you should be able to inspect which features are contained in the tree and what are the split value etc.
If each tree is only a decision stump, only one feature is contained in each weak learner. But if we allow deeper depth of tree, a combination of features could exist in each individual weak learner.
I further took a look at the CvBoostTree class; unfortunately the class itself does not provide a public method to check the internal features used. But you might want to create your own sub-class inheriting from CvBoostTree and expose whatever functionality.