I am using a sklearn.pipeline.Pipeline object for my clustering.
pipe = sklearn.pipeline.Pipeline([('transformer1': transformer1),
('transformer2': transformer2),
('clusterer': clusterer)])
Then I am evaluating the result by using the silhouette score.
sil = preprocessing.silhouette_score(X, y)
I'm wondering how I can get the X or the transformed data from the pipeline as it only returns the clusterer.fit_predict(X).
I understand that I can do this by just splitting the pipeline as
pipe = sklearn.pipeline.Pipeline([('transformer1': transformer1),
('transformer2': transformer2)])
X = pipe.fit_transform(data)
res = clusterer.fit_predict(X)
sil = preprocessing.silhouette_score(X, res)
but I would like to just do it all in one pipeline.
If you want to both fit and transform the data on intermediate steps of the pipeline then it makes no sense to reuse the same pipeline and better to use a new one as you specified, because calling fit() will forget all about previously learnt data.
However if you only want to transform() and see the intermediate data on an already fitted pipeline, then its possible by accessing the named_steps parameter.
new_pipe = sklearn.pipeline.Pipeline([('transformer1':
old_pipe.named_steps['transformer1']),
('transformer2':
old_pipe.named_steps['transformer2'])])
Or directly using the inner varible steps like:
transformer_steps = old_pipe.steps
new_pipe = sklearn.pipeline.Pipeline([('transformer1': transformer_steps[0]),
('transformer2': transformer_steps[1])])
And then calling the new_pipe.transform().
Update:
If you have version 0.18 or above, then you can set the non-required estimator inside the pipeline to None to get the result in same pipeline. Its discussed in this issue at scikit-learn github
Usage for above in your case:
pipe.set_params(clusterer=None)
pipe.transform(df)
But be aware to maybe store the fitted clusterer somewhere else to do so, else you need to fit the whole pipeline again when wanting to use that functionality.
Related
From Python, I have a frozen graph.pb that I'm currently using in a C++ environment. Now the data for the input tensor are currently preprocessed on the CPU, but I would like to do this step in another GraphDef to run it on the GPU, but I can't seem to find a way to connect nodes between two GraphDef's.
Lets assume my frozen graph have an input/placeholder named mid that I'd like to connect with the preprocessing steps below
tf::GraphDef create_graph_extension() {
tf::Scope root = tf::Scope::NewRootScope();
auto a = tf::ops::Const(root.WithOpName("in"), {(float) 23.0, (float) 31.0});
auto b = tf::ops::Identity(root.WithOpName("mid"), a);
tf::GraphDef graph;
TF_CHECK_OK(root.ToGraphDef(&graph));
return graph;
}
I usually use session->Extend() to run multiple graphs in the same session, but always making sure their node names are unique. With non-unique node names, that I hoped to connect, I get an error
Failed to install graph:
Invalid argument: GraphDef argument to Extend includes node 'mid', which
was created by a previous call to Create or Extend in this session.
P.s. It seems like it is possible in python at least (link)
You can achieve what you're looking for using the same idea that was suggested for Python - import one GraphDef into another and remap inputs.
In case you do use the C API (which has stability guarantees), you'd want to look at:
TF_GraphImportGraphDef (which is parallel to the tf.import_graph_def call in Python), and
TF_ImportGraphDefOptionsAddInputMapping which serves the same purpose as the input_map argument in Python.
These are implemented on top of the C++ ImportGraphDef function, which you might be able to use directly instead (though that doesn't seem to yet be part of the exported C++ API)
Hope that helps.
I am writing a wrapper class that takes a generic graph with a special member "train_op" to manage the training, saving, and housekeeping of my model.
I wanted to cleanly keep track of the lifetime number of training steps like so:
with tf.control_dependencies([ step_add_one ]):
self.train_op=tf.identity(self.training_graph.train_op )
raise TypeError('Expected binary or unicode string, got %r'
e, is_training=True, inputs=None)
I think the rub here is that train_op is the return of tf.Optimizer.minimize(), so it is not a tensor per se, but an operation.
An obvious workaround would be to call tf.identity on the training_graph.loss, but I lose a bit of abstraction because I have to then handle the learning rate etc externally. Moreover, I feel like I'm missing something.
How can I best remedy this?
You can use tf.group(), which will work with operations and tensors.
For instance:
x = tf.Variable(1.)
loss = tf.square(x)
optimizer = tf.train.GradientDescentOptimizer(0.1)
train_op = optimizer.minimize(loss)
step = tf.Variable(0)
step_add_one = step.assign_add(1)
with tf.control_dependencies([step_add_one]):
train_op_2 = tf.group(train_op)
Now when you run train_op_2, the value of step will be incremented.
However, the best way to go (if you can modify the graph that created the graph) is to add a parameter global_step to the minimize function:
train_op = optimizer.minimize(loss, global_step=step)
Suppose X is a raw, labeled (ie, with training labels) data set, and Process(X) returns a set of Y instances
that have been encoded with attributes and converted into a weka-friendly file like Y.arff.
Also suppose Process() has some 'leakage':
some instances Leak = X-Y can't be encoded consistently, and need
to get a default classification FOO. The training labels are also known for the Leak set.
My question is how I can best introduce instances from Leak into the
weka evaluation stream AFTER some classifier has been applied to the
subset Y, folding the Leak instances in with their default
classification label, before performing evaulation across the full set X? In code:
DataSource LeakSrc = new DataSource("leak.arff");
Instances Leak = LeakSrc.getDataSet();
DataSource Ysrc = new DataSource("Y.arff");
Instances Y = Ysrc.getDataSet();
classfr.buildClassifer(Y)
// YunionLeak = ??
eval.crossValidateModel(classfr, YunionLeak);
Maybe this is a specific example of folding together results
from multiple classifiers?
the bounty is closing, but Mark Hall, in another forum (
http://list.waikato.ac.nz/pipermail/wekalist/2015-November/065348.html) deserves what will have to count as the current answer:
You’ll need to implement building the classifier for the cross-validation
in your code. You can still use an evaluation object to compute stats for
your modified test folds though, because the stats it computes are all
additive. Instances.trainCV() and Instances.testCV() can be used to create
the folds:
http://weka.sourceforge.net/doc.stable/weka/core/Instances.html#trainCV(int,%20int,%20java.util.Random)
You can then call buildClassifier() to process each training fold, modify
the test fold to your hearts content, and then iterate over the instances
in the test fold while making use of either Evaluation.evaluateModelOnce()
or Evaluation.evaluateModelOnceAndRecordPrediction(). The later version is
useful if you need the area under the curve summary metrics (as these
require predictions to be retained).
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnce(weka.classifiers.Classifier,%20weka.core.Instance)
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnceAndRecordPrediction(weka.classifiers.Classifier,%20weka.core.Instance)
Depending on your classifier, it could be very easy! Weka has an interface called UpdateableClassifier, any class using this can be updated after it has been built! The following classes implement this interface:
HoeffdingTree
IBk
KStar
LWL
MultiClassClassifierUpdateable
NaiveBayesMultinomialText
NaiveBayesMultinomialUpdateable
NaiveBayesUpdateable
SGD
SGDText
It can then be updated something like the following:
ArffLoader loader = new ArffLoader();
loader.setFile(new File("/data/data.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
NaiveBayesUpdateable nb = new NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null) {
nb.updateClassifier(current);
}
This questions is addressed to developers using C++ and the NDK of Nuke.
Context: Assume a custom Op which implements the interfaces of DD::Image::NoIop and
DD::Image::Executable. The node iterates of a range of frames extracting information at
each frame, which is stored in a custom data structure. An custom knob, which is a member
variable of the above Op (but invisible in the UI), handles the loading and saving
(serialization) of the data structure.
Now I want to exchange that data structure between Ops.
So far I have come up with the following ideas:
Expression linking
Knobs can share information (matrices, etc.) using expression linking.
Can this feature be exploited for custom data as well?
Serialization to image data
The custom data would be serialized and written into a (new) channel. A
node further down the processing tree could grab that and de-serialize
again. Of course, the channel must not be altered between serialization
and de-serialization or else ... this is a hack, I know, but, hey, any port
in a storm!
GeoOp + renderer
In cases where the custom data is purely point-based (which, unfortunately,
it isn't in my case), I could turn the above node into a 3D node and pass
point data to other 3D nodes. At some point a render node would be required
to come back to 2D.
I am going into the correct direction with this? If not, what is a sensible
approach to make this data structure available to other nodes, which rely on the
information contained in it?
This question has been answered on the Nuke-dev mailing list:
If you know the actual class of your Op's input, it's possible to cast the
input to that class type and access it directly. A simple example could be
this snippet below:
//! #file DownstreamOp.cpp
#include "UpstreamOp.h" // The Op that contains your custom data.
// ...
UpstreamOp * upstreamOp = dynamic_cast< UpstreamOp * >( input( 0 ) );
if ( upstreamOp )
{
YourCustomData * data = yourOp->getData();
// ...
}
// ...
UPDATE
Update with reference to a question that I received via email:
I am trying to do this exact same thing, pass custom data from one Iop
plugin to another.
But these two plugins are defined in different dso/dll files.
How did you get this to work ?
Short answer:
Compile your Ops into a single shared object.
Long answer:
Say
UpstreamOp.cpp
DownstreamOp.cpp
define the depending Ops.
In a first attempt I compiled the first plugin using only UpstreamOp.cpp,
as usual. For the second plugin I compiled both DownstreamOp.cpp and
UpstreamOp.cpp into that plugin.
Strangely enough that worked (on Linux; didn't test Windows).
However, by overriding
bool Op::test_input( int input, Op * op ) const;
things will break. Creating and saving a Comp using the above plugins still
works. But loading that same Comp again breaks the connection in the node graph
between UpstreamOp and DownstreamOp and it is no longer possible to connect
them again.
My hypothesis is this: since both plugins contain symbols for UpstreamOp it
depends on the load order of the plugins if a node uses instances of UpstreamOp
from the first or from the second plugin. So, if UpstreamOp from the first plugin
is used then any dynamic_cast in Op::test_input() will fail and the two Op cannot
be connected anymore.
It is still surprising that Nuke would even bother to start at all with the above
configuration, since it can be rather picky about symbols from plugins, e.g if they
are missing.
Anyway, to get around this problem I did the following:
compile both Ops into a single shared object, e.g. myplugins.so, and
add TCL script or Python script (init.py/menu.py)which instructs Nuke how to load
the Ops correctly.
An example for a TCL scripts can be found in the dev guide and the instructions
for your menu.py could be something like this
menu = nuke.menu( 'Nodes' ).addMenu( 'my-plugins' )
menu.addCommand('UpstreamOp', lambda: nuke.createNode('UpstreamOp'))
menu.addCommand('DownstreamOp', lambda: nuke.createNode('DownstreamOp'))
nuke.load('myplugins')
So far, this works reliably for us (on Linux & Windows, haven't tested Mac).
I am trying to use transform a vtkPolyData object by using vtkTransform.
However, the tutorials I found are using pipeline, for example: http://www.vtk.org/Wiki/VTK/Examples/Cxx/Filters/TransformPolyData
However, I am using VTK 6.1 which has removed thge GetOutputPort method for stand-alone data object as mentioned here:
http://www.vtk.org/Wiki/VTK/VTK_6_Migration/Replacement_of_SetInput
I have tried to replace the line:
transformFilter->SetInputConnection()
with
transformFilter->SetInputData(polydata_object);
Unfortunately, the data was not read properly (as the pipeline was not set correctly?)
Do you know how to correctly transform a stand-alone vtkPolyData without using pipeline in VTK6?
Thank you!
GetOutputPort was never a method on a data-object. It was always a method on vtkAlgorithm and it still is present on vtkAlgorithm (and subclasses). Where is the polydata_object coming from? If it's an output of a reader, you have two options:
// update the reader to ensure it executes and reads data.
reader->UpdatePipeline()
// now you can get access to the data object.
vtkSmartPointer<vtkPolyData> data = vtkPolyData::SafeDownCast(reader->GetOutputDataObject(0));
// pass that to the transform filter.
transformFilter->SetInputData(data.GetPointer());
transformFilter->Update();
Second option is to simply connect the pipeline:
transformFilter->SetInputConnection(reader->GetOutputPort());
The key is to ensure that the data is updated/reader before passing it to the transform filter, when not using the pipeline.