How do I use AdaBoost for feature selection? - c++

I want to use AdaBoost to choose a good set features from a large number (~100k). AdaBoost works by iterating though the feature set and adding in features based on how well they preform. It chooses features that preform well on samples that were mis-classified by the existing feature set.
Im currently using in Open CV's CvBoost. I got an example working, but from the documentation it is not clear how to pull out the feature indexes that It has used.
Using either CvBoost, a 3rd party library or implementing it myself, how can pull out a set of features from a large feature set using AdaBoot?

With the help of #greeness answer I made a subclass of CvBoost
std::vector<int> RSCvBoost::getFeatureIndexes() {
CvSeqReader reader;
cvStartReadSeq( weak, &reader );
cvSetSeqReaderPos( &reader, 0 );
std::vector<int> featureIndexes;
int weak_count = weak->total;
for( int i = 0; i < weak_count; i++ ) {
CvBoostTree* wtree;
CV_READ_SEQ_ELEM( wtree, reader );
const CvDTreeNode* node = wtree->get_root();
CvDTreeSplit* split = node->split;
const int index = split->condensed_idx;
// Only add features that are not already added
if (std::find(featureIndexes.begin(),
featureIndexes.end(),
index) == featureIndexes.end()) {
featureIndexes.push_back(index);
}
}
return featureIndexes;
}

Claim: I am not a user of opencv. From the documentation, opencv's adaboost is using the decision tree (either classification tree or regression tree) as the fundamental weak learner.
It seems to me this is the way to get the underline weak learners:
CvBoost::get_weak_predictors
Returns the sequence of weak tree classifiers.
C++: CvSeq* CvBoost::get_weak_predictors()
The method returns the sequence of weak classifiers.
Each element of the sequence is a pointer to the CvBoostTree class or
to some of its derivatives.
Once you have access to the sequence of CvBoostTree*, you should be able to inspect which features are contained in the tree and what are the split value etc.
If each tree is only a decision stump, only one feature is contained in each weak learner. But if we allow deeper depth of tree, a combination of features could exist in each individual weak learner.
I further took a look at the CvBoostTree class; unfortunately the class itself does not provide a public method to check the internal features used. But you might want to create your own sub-class inheriting from CvBoostTree and expose whatever functionality.

Related

TBB Dynamic Flow Graphs

I'm trying to come up with a way to define a flow graph (think TBB) defined at runtime. Currently, we use TBB to define the nodes and the edges between the nodes at compile time. This is sort of annoying because we have people who want to add processing steps and modify the processing chain without recompiling the whole application or really having to know anything about the application beyond how to add processing kernels. In an ideal world I would have some sort of plugin framework using dlls. We already have the software architected so that each node in TBB represents a processing step so it's pretty easy to add stuff if you're willing to recompile.
As a first step, I was trying to come up with a way to define a TBB flow graph in YAML but it was a massive rabbit hole. Does anyone know if something like this exists before I go all in on implementing this from scratch? It will be a fun project but no point in duplicating work.
I am not sure if anything like this exists in a TBB companion library but it is definitely doable to implement a small subset of the functionalities of Flow Graph configurable at runtime.
If the data that transit through your graph have a well defined type, aka your nodes are basically function_node<T, T> things are manageable. If the Graph transforms data from one type to another it gets more complicated -one solution would be to use a variant of these types and handle the possibly incompatible types at runtime. That really depends on the level of flexibility required.
With:
$ cat nodes.txt
# type concurrency params...
multiply unlimited 2
affine serial 5 -3
and
$ cat edges.txt
# src dst
0 1
1 2
where index 0 is a source node, here is a scaffold of how I would implement it:
using data_t = double;
using node_t = function_node<data_t , data_t >;
graph g;
std::vector<node_t> nodes;
auto node_factory = [&g](std::string type, std::string concurrency, std::string params) -> node_t {
// Implement a dynamic factory of nodes
};
// Add source node first
nodes.push_back(flow::input_node<data_t>(g,
[&]( oneapi::tbb::flow_control &fc ) -> data_t { /*...*/ });
// Parse the node description file and populate the node vector using the factory
for (auto&& n : nodes)
nodes.push_back(node_factory(n.type, n.concurrency, n.params));
// Parse the edge description file and call make_edge accordingly
for (auto&& e : edges)
flow::make_edge(nodes[e.src], nodes[e.dst]);
// Run the graph
nodes[0].activate();
g.wait_for_all();

How to build a graph of specific function calls?

I have a project where I want to dynamically build a graph of specific function calls. For example if I have 2 template classes, A and B, where A have a tracked method (saved as graph node) and B has 3 methods (non-tracked method, tracked method and a tracked method which calls A's tracked method), then I want to be able to only register the tracked method calls into the graph object as nodes. The graph object could be a singleton.
template <class TA>
class A
{
public:
void runTracked()
{
// do stuff
}
};
template <class TB>
class B
{
public:
void runNonTracked()
{
// do stuff
}
void runTracked()
{
// do stuff
}
void callATracked()
{
auto a = A<TB>();
a.runTracked();
// do stuff
}
};
void root()
{
auto b1 = B<int>();
auto b2 = B<double>();
b1.runTracked();
b2.runNonTracked();
b2.callATracked();
}
int main()
{
auto b = B<int>();
b.runTracked()
root();
return 0;
}
This should output a similar graph object to the below:
root()
\-- B<int>::runTracked()
\-- B<double>::callATracked()
\-- A<double>::runTracked()
The tracked functions should be adjustable. If the root would be adjustable (as in the above example) that would be the best.
Is there an easy way to achieve this?
I was thinking about introducing a macro for the tracked functionalities and a Singleton graph object which would register the tracked functions as nodes. However, I'm not sure how to determine which is the last tracked function in the callstack, or (from the graphs perspective) which graph node should be the parent when I want to add a new node.
In general, you have 2 strategies:
Instrument your application with some sort of logging/tracing framework, and then try to replicate some sort of tracing mixin-like functionality to apply global/local tracing depending on which parts of code you apply the mixins.
Recompile your code with some sort of tracing instrumentation feature enabled for your compiler or runtime, and then use the associated tracing compiler/runtime-specific tools/frameworks to transform/sift through the data.
For 1, this will require you to manually insert more code or something like _penter/_pexit for MSVC manually or create some sort of ScopedLogger that would (hopefully!) log async to some external file/stream/process. This is not necessarily a bad thing, as having a separate process control the trace tracking would probably be better in the case where the traced process crashes. Regardless, you'd probably have to refactor your code since C++ does not have great first-class support for metaprogramming to refactor/instrument code at a module/global level. However, this is not an uncommon pattern anyways for larger applications; for example, AWS X-Ray is an example of a commercial tracing service (though, typically, I believe it fits the use case of tracing network calls and RPC calls rather than in-process function calls).
For 2, you can try something like utrace or something compiler-specific: MSVC has various tools like Performance Explorer, LLVM has XRay, GCC has gprof. You essentially compile in a sort of "debug++" mode or there is some special OS/hardware/compiler magic to automatically insert tracing instructions or markers that help the runtime trace your desired code. These tracing-enabled programs/runtimes typically emit to some sort of unique tracing format that must then be read by a unique tracing format reader.
Finally, to dynamically build the graph in memory is a a similar story. Like the tracing strategies above, there are a variety of application and runtime-level libraries to help trace your code that you can interact with programmatically. Even the simplest version of creating ScopedTracer objects that log to a tracing file can then be fitted with a consumer thread that owns and updates the trace graph with whatever desired latency and data durability requirements you have.
Edit: If you would like, OpenTelemetry/Jaeger may be a good place to start visualizing traces once you have extracted the data (and you can also report directly to it if you want), although it prefers a tree presentation format: Jaeger documentation for Trace Detail View

The balance of Single responsibility/unit testability and practicality

I'm still confused about unit testing. Suppose I have something as trivial as this:
class x {
zzz someMethod(some input...) {
BufferedImage image = getter.getImageFromFile(...);
// determine resize mode:
int width = image.getWidth();
int height = image.getHeight();
Scalr.Mode resizeMode = (width > height) ? Scalr.Mode.FIT_TO_WIDTH : Scalr.Mode.FIT_TO_HEIGHT;
return ScalrWrapper.resize(image, resizeMode);
}
}
Going by rules, Scalr.Mode resizeMode = should probably be a in a separate class for better unit testability of the aforementioned method, like so:
class xxx {
mode getResizeMode(int width, int height)
{
return (width > height) ? Scalr.Mode.FIT_TO_WIDTH : Scalr.Mode.FIT_TO_HEIGHT;
}
}
class x {
zzz someMethod(some input...) {
BufferedImage image = getter.getImageFromFile(...);
// determine resize mode:
int width = image.getWidth();
int height = image.getHeight();
Scalr.Mode resizeMode = xxx.getResizeMode(width, height);
return ScalrWrapper.resize(image, resizeMode);
}
}
But it looks like such an overkill... I'm not sure which one is better but I guess this way is better. Suppose I go this route, would it be even better to do it this way?
class xxx {
mode getResizeMode(Image image)
{
return (image.getWidth() > image.getHeight()) ? Scalr.Mode.FIT_TO_WIDTH : Scalr.Mode.FIT_TO_HEIGHT;
}
}
class x {
void someMethod(some input...) {
BufferedImage image = getter.getImageFromFile(...);
// determine resize mode:
Scalr.Mode resizeMode = xxx.getResizeMode(image);
return ScalrWrapper.resize(image, resizeMode);
}
}
From what I understand, the correct way is the one where getResizeMode accepts integers as it is decoupled from the type of data whose properties are width and height. However, personally to me, the use of getResizeMode(BufferedImage) actually justifies the creation of a separate class better as some more work is removed from the main method. And since I am not going to be using getResizeMode for any sort of data other than BufferedImage in my application anyway, there is no problem of reusability. Also, I don't think I should be doing getResizeMode(int, int) simply for reusability if I see no need for it due to YAGNI principle.
So my question is: would getResizeMode(BufferedImage) be a good way according to OOD in real world? I understand it's text book good OOD, but then I have been lead to believe that 100% text book OOD is impracticle in real world. So as I am trying to learn OOD, I just want to know which path I should follow.
...Or maybe I should I just leave everything in one method like in the very first code snippet?
I don't think that resize mode calculation influences testability a lot.
As to Single Responsibility:
"A class should have only one reason to change" (https://en.wikipedia.org/wiki/Single_responsibility_principle).
Do you think that resizing mode calculation is going to change?
If not then just put in the class where this mode is needed.
This won't add any reasons to change for that class.
If the calculation is likely to change (and/or may have several versions)
then move it to a separate class (make it a strategy)
Achieving the Single Responsibility Principle (SRP) is not about creating new classes every time, one extracting a method. Moreover the SRP depends on the context.
A module should concern to the SRP.
A class should concern to the SRP.
A method should concern to the SRP.
The message from Uncle Bob is: Extract till you Drop
Beyond he said:
Perhaps you think this is taking things too far. I used to think so too. But after programming for over 40+ years, I’m beginning to come to the conclusion that this level of extraction is not taking things too far at all.
When it comes to the decision to create new classes, keep the metric high cohesion in mind. Cohesion is the degree to which the elements of a module belong together. If all methods work in one specific context and on the same set of variables, they belong to one class.
Back to your case. I would extract all the methods and put them in on class. And this one class is also nicely testable.
Little bit late to the party, but here's my 2c.
To my mind, class x is not adhering to the SRP for a different reason.
It's currently responsible for
Getting an image from a file (getter.getImageFromFile)
Resizing that image
TL;DR
The TL;DR on this is that both of your approaches are fine and both do in fact stick - with varying degrees of stickiness - to the SRP. However if you want to adhere very tightly to the SRP (which tends to lead to very testable code), you could split this into three classes first:
Orchestrator
class imageResizeService
{
ImageGetter _getter;
ImageResizer _resizer;
zzz ResizeImage(imageName)
{
image=_getter.GetImage(imageName);
resizedImage=_resizer.ResizeImage(image);
return resizedImage;
}
}
This class has a single responsibility; namely, given an image name,
return a resized version of it based on some criteria.
To do so, it orchestrates two dependencies. But it only has a single reason to change which is that the process used to get and resize an image in
general , has changed.
You can easily unit test this by mocking the getter and resizer and testing that they are called in order, that the resizer is called with the data given by the getter, and that the final return value equals that returned by the resizer, and so on (i.e. "White Box" testing)
ImageGetter
class ImageGetter
{
BufferedImage GetImage(imageName)
{
image=io.LoadFromDisk(imageName) or explode;
return image;
}
}
Again, we have a single responsiblity (load an image from disk, and return it).
The only reason to change this class would be if the mechanics of loading the image were to change - e.g. you are loading from a Database, not a Disk.
An interesting note here is that this class is ripe for further generalisation - for example to be able to compose it using a BufferedImageBuilder and a RawImageDataGetter abstraction which could have multiple implementations for Disk, Database, Http, etc. But that's YAGNI right now and a conversation for another day :)
Note on testability
In terms of unit testing this, you may run into a small problem, namely that you can't quite "unit test" it - unless your framework as a mock for the file system. In that case, you can either further abstract the loading of the raw data (as per the previous paragraph) or accept it and just perform an integration test off a known good file. Both approaches are perfectly valid and you should not worry about which you choose - whatever is easier for you.
ImageResizer
class ImageResizer
{
zzz ResizeImage(image)
{
int width = image.getWidth();
int height = image.getHeight();
Scalr.Mode resizeMode = getResizeMode(width, height);
return ScalrWrapper.resize(image, resizeMode);
}
private mode getResizemode(width, height)
{
return (width > height) ? Scalr.Mode.FIT_TO_WIDTH : Scalr.Mode.FIT_TO_HEIGHT;
}
}
This class also has but a single job, to resize an image.
The question of whether or not the getResizeMode method - currently just a private method to keep the code clean - should be a separate responsiblity has to be answered in the context of whether or not that operation is somehow independent of the image resizing.
Even if it's not, then the SRP is still being followed, because it's part of the single responsibility "Resize an Image".
Test-wise this is also really easy to test, and because it doesn't even cross any boundaries (you can create and supply the sole dependency - the image - during test runtime) you probably won't even need mocks.
Personally I would extract it to a separate class, just so that I could, in isolation, verify that given a width larger than a height, I was returned a Scalr.Mode.FIT_TO_WIDTH and vice-versa; it would also mean I could adhere to the Open Closed Principle whereby new scaling modes could be introduced without having to modify the ImageResizer class.
But really
The answer here has to be that that it depends; for example if you have a simple way to verify that, given a width of 100 and a height of 99, then the resized image is indeed scaled to "Fit to Width" then you really don't need to.
That being said I suspect you'll have an easier time testing this if you do extract that to a separate method.
Just bear in mind that if you're using a decent IDE with good refactoring tools, that should really not take you more than a couple of keystrokes, so don't worry about the overhead.

Adding weka instances after classification but before evaluation?

Suppose X is a raw, labeled (ie, with training labels) data set, and Process(X) returns a set of Y instances
that have been encoded with attributes and converted into a weka-friendly file like Y.arff.
Also suppose Process() has some 'leakage':
some instances Leak = X-Y can't be encoded consistently, and need
to get a default classification FOO. The training labels are also known for the Leak set.
My question is how I can best introduce instances from Leak into the
weka evaluation stream AFTER some classifier has been applied to the
subset Y, folding the Leak instances in with their default
classification label, before performing evaulation across the full set X? In code:
DataSource LeakSrc = new DataSource("leak.arff");
Instances Leak = LeakSrc.getDataSet();
DataSource Ysrc = new DataSource("Y.arff");
Instances Y = Ysrc.getDataSet();
classfr.buildClassifer(Y)
// YunionLeak = ??
eval.crossValidateModel(classfr, YunionLeak);
Maybe this is a specific example of folding together results
from multiple classifiers?
the bounty is closing, but Mark Hall, in another forum (
http://list.waikato.ac.nz/pipermail/wekalist/2015-November/065348.html) deserves what will have to count as the current answer:
You’ll need to implement building the classifier for the cross-validation
in your code. You can still use an evaluation object to compute stats for
your modified test folds though, because the stats it computes are all
additive. Instances.trainCV() and Instances.testCV() can be used to create
the folds:
http://weka.sourceforge.net/doc.stable/weka/core/Instances.html#trainCV(int,%20int,%20java.util.Random)
You can then call buildClassifier() to process each training fold, modify
the test fold to your hearts content, and then iterate over the instances
in the test fold while making use of either Evaluation.evaluateModelOnce()
or Evaluation.evaluateModelOnceAndRecordPrediction(). The later version is
useful if you need the area under the curve summary metrics (as these
require predictions to be retained).
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnce(weka.classifiers.Classifier,%20weka.core.Instance)
http://weka.sourceforge.net/doc.stable/weka/classifiers/Evaluation.html#evaluateModelOnceAndRecordPrediction(weka.classifiers.Classifier,%20weka.core.Instance)
Depending on your classifier, it could be very easy! Weka has an interface called UpdateableClassifier, any class using this can be updated after it has been built! The following classes implement this interface:
HoeffdingTree
IBk
KStar
LWL
MultiClassClassifierUpdateable
NaiveBayesMultinomialText
NaiveBayesMultinomialUpdateable
NaiveBayesUpdateable
SGD
SGDText
It can then be updated something like the following:
ArffLoader loader = new ArffLoader();
loader.setFile(new File("/data/data.arff"));
Instances structure = loader.getStructure();
structure.setClassIndex(structure.numAttributes() - 1);
NaiveBayesUpdateable nb = new NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance current;
while ((current = loader.getNextInstance(structure)) != null) {
nb.updateClassifier(current);
}

How to exchange custom data between Ops in Nuke?

This questions is addressed to developers using C++ and the NDK of Nuke.
Context: Assume a custom Op which implements the interfaces of DD::Image::NoIop and
DD::Image::Executable. The node iterates of a range of frames extracting information at
each frame, which is stored in a custom data structure. An custom knob, which is a member
variable of the above Op (but invisible in the UI), handles the loading and saving
(serialization) of the data structure.
Now I want to exchange that data structure between Ops.
So far I have come up with the following ideas:
Expression linking
Knobs can share information (matrices, etc.) using expression linking.
Can this feature be exploited for custom data as well?
Serialization to image data
The custom data would be serialized and written into a (new) channel. A
node further down the processing tree could grab that and de-serialize
again. Of course, the channel must not be altered between serialization
and de-serialization or else ... this is a hack, I know, but, hey, any port
in a storm!
GeoOp + renderer
In cases where the custom data is purely point-based (which, unfortunately,
it isn't in my case), I could turn the above node into a 3D node and pass
point data to other 3D nodes. At some point a render node would be required
to come back to 2D.
I am going into the correct direction with this? If not, what is a sensible
approach to make this data structure available to other nodes, which rely on the
information contained in it?
This question has been answered on the Nuke-dev mailing list:
If you know the actual class of your Op's input, it's possible to cast the
input to that class type and access it directly. A simple example could be
this snippet below:
//! #file DownstreamOp.cpp
#include "UpstreamOp.h" // The Op that contains your custom data.
// ...
UpstreamOp * upstreamOp = dynamic_cast< UpstreamOp * >( input( 0 ) );
if ( upstreamOp )
{
YourCustomData * data = yourOp->getData();
// ...
}
// ...
UPDATE
Update with reference to a question that I received via email:
I am trying to do this exact same thing, pass custom data from one Iop
plugin to another.
But these two plugins are defined in different dso/dll files.
How did you get this to work ?
Short answer:
Compile your Ops into a single shared object.
Long answer:
Say
UpstreamOp.cpp
DownstreamOp.cpp
define the depending Ops.
In a first attempt I compiled the first plugin using only UpstreamOp.cpp,
as usual. For the second plugin I compiled both DownstreamOp.cpp and
UpstreamOp.cpp into that plugin.
Strangely enough that worked (on Linux; didn't test Windows).
However, by overriding
bool Op::test_input( int input, Op * op ) const;
things will break. Creating and saving a Comp using the above plugins still
works. But loading that same Comp again breaks the connection in the node graph
between UpstreamOp and DownstreamOp and it is no longer possible to connect
them again.
My hypothesis is this: since both plugins contain symbols for UpstreamOp it
depends on the load order of the plugins if a node uses instances of UpstreamOp
from the first or from the second plugin. So, if UpstreamOp from the first plugin
is used then any dynamic_cast in Op::test_input() will fail and the two Op cannot
be connected anymore.
It is still surprising that Nuke would even bother to start at all with the above
configuration, since it can be rather picky about symbols from plugins, e.g if they
are missing.
Anyway, to get around this problem I did the following:
compile both Ops into a single shared object, e.g. myplugins.so, and
add TCL script or Python script (init.py/menu.py)which instructs Nuke how to load
the Ops correctly.
An example for a TCL scripts can be found in the dev guide and the instructions
for your menu.py could be something like this
menu = nuke.menu( 'Nodes' ).addMenu( 'my-plugins' )
menu.addCommand('UpstreamOp', lambda: nuke.createNode('UpstreamOp'))
menu.addCommand('DownstreamOp', lambda: nuke.createNode('DownstreamOp'))
nuke.load('myplugins')
So far, this works reliably for us (on Linux & Windows, haven't tested Mac).