Modification to J48 Algorithm in Weka - weka

I would like to modify Weka's J48 algorithm in this way:
I would like to change the J48 algorithm to divide the data similar to the RandomForest algorithm( the code responsible for finding the best split in node).
What I have to do? I know that I have to change the part of C45ModelSelection code for code in RandomForest:
C45ModelSelection.java
...
// Find "best" attribute to split on.
minResult = 0;
for (i=0;i<data.numAttributes();i++){
if ((i != (data).classIndex()) &&
(currentModel[i].checkModel()))
// Use 1E-3 here to get a closer approximation to the original
// implementation.
if ((currentModel[i].infoGain() >= (averageInfoGain-1E-3)) &&
Utils.gr(currentModel[i].gainRatio(),minResult)){
bestModel = currentModel[i];
minResult = currentModel[i].gainRatio();
}
}
...

It appears that you are looking to replace the Split code with the RandomForest Split Code. This code appears to exist in RandomTree.buildTree function in RandomTree.java
The split code appears somewhat different to the J48 Code, so you may need to consider what other changes are required in addition to the split code for the functionality to work correctly, but this would be a good starting point for achieving what you are after.
Hope this Helps!

Related

What are hp.Discrete and hp.Realinterval? Can I include more values in hp.realinterval instead of just 2?

I am using Hyperparameter using HParams Dashboard in Tensorflow 2.0-beta0 as suggested here https://www.tensorflow.org/tensorboard/r2/hyperparameter_tuning_with_hparams
I am confused in step 1, I could not find any better explanation. My questions are related to following lines:
HP_NUM_UNITS = hp.HParam('num_units', hp.Discrete([16, 32]))
HP_DROPOUT = hp.HParam('dropout', hp.RealInterval(0.1, 0.2))
HP_OPTIMIZER = hp.HParam('optimizer', hp.Discrete(['adam', 'sgd']))
My question:
I want to try more dropout values instead of just two (0.1 and 0.2). If I write more values in it then it throws an error- 'maximum 2 arguments can be given'. I tried to look for documentation but could not find anything like from where these hp.Discrete and hp.RealInterval functions came.
Any help would be appreciated. Thank you!
Good question. They notebook tutorial lacks in many aspects. At any rate, here is how you do it at a certain resolution res
for dropout_rate in tf.linspace(
HP_DROPOUT.domain.min_value,
HP_DROPOUT.domain.max_value,
res,):
By looking at the implementation to me it really doesn't seem to be GridSearch but MonteCarlo/Random search (note: this is not 100% correct, please see my edit below)
So on every iteration a random float of that real interval is chosen
If you want GridSearch behavior just use "Discrete". That way you can even mix and match GridSearch with Random search, pretty cool!
Edit: 27th of July '22: (based on the comment of #dpoiesz)
Just to make it a little more clear, as it is sampled from the intervals, concrete values are returned. Therefore, those are added to the grid dimension and grid search is performed using those
RealInterval is a min, max tuple in which the hparam will pick a number up.
Here a link to the implementation for better understanding.
The thing is that as it is currently implemented it does not seems to have any difference in between the two except if you call the sample_uniform method.
Note that tf.linspace breaks the mentioned sample code when saving current value.
See https://github.com/tensorflow/tensorboard/issues/2348
In particular OscarVanL's comment about his quick&dirty workaround.

What exactly is Pairwise Matching and How it works?

I'm working on Multiple Image Stitching and I came around the term Pairwise Matching. I almost searched on every site but am unable to get CLEAR description on what it exactly is and how it works.
I'm working in Visual Studio 2012 with opencv. I have modified stitching_detailed.cpp according to my requirement and am very successful in maintaining the quality with significantly less time, except pairwise matching. I'm using ORB to find feature points. BestOf2NearestMatcher is used in stitching_detailed.cpp for pairwise matching.
What I know about Pairwise Matching and BestOf2NearestMatcher:
(Correct me if I'm wrong somewhere)
1) Pairwise Matching works similarly like other matchers such as Brute Force Matcher, Flann Based Matcher, etc.
2) Pairwise Matching works with multiple images unlike the above matchers. You have to go one by one if you want to use them for multiple images.
3) In Pairwise Matching, the features of one image are matched with every other image in the data set.
4) BestOf2NearestMatcher finds two best matches for each feature and leaves the best one only if the ratio between descriptor distances is greater than the threshold match_conf.
What I want to know:
1) I want to know more details about pairwise matching, if I'm missing some on it.
2) I want to know HOW pairwise matching works, the actual flow of it in detail.
3) I want to know HOW BestOf2NearestMatcher works, the actual flow of it in detail.
4) Where can I find code for BestOf2NearestMatcher? OR Where can I get similar code to BestOf2NearestMatcher?
5) Is there any alternative I can use for pairwise matching (or BestOf2NearestMatcher) which takes less time than the current one?
Why I want to know and what I'd do with it:
1) As I stated in the introduction part, I want to reduce the time pairwise matching takes. If I'm able to understand what actually pairwise matching is and how it works, I can create my own according to my requirement or I can modify the existing one.
Here's where I posted a question in which I want to reduce time for the entire program: here. I'm not asking the same question again, I'm asking about specifics here. There I wanted to know how can I reduce time in pairwise matching as well as other code sections and here I want to know what pairwise matching is and how it works.
Any help is much appreciated!
EDIT: I found the code of pairwise matching in matchers.cpp. I created my own function in the main code to optimize the time. Works good.

Weka improve model TP Rate

j48 weka
Hi,
I have problem with my model in weka (j48 cross-validation) that many instances are classified wrong when it comes to the second class. Is there any way to improve it or rather not? I'm not an expert in weka. Thank you in advance. My output is above.
In NaiveBayes it presents better but still TP Rate < 0.5 for the second class.
NaiveByes weka
It is hard to reproduce your example with the given information. However the solution is probably to turn your classifiert into a cost sensitive classifier
https://weka.wikispaces.com/CostSensitiveClassifier?responseToken=019a566fb2ce3b016b9c8c791c92e8e35
What it does it assigns a higher value to misclassifications of a certain class. In your case this would be the "True" class.
You can also simulate such an algorithm by oversampling your positive examples. This is, if you have n positive examples you sample k*n positive example, while you keep your negative examples as they are. You could also simply double positive examples.

Auto correct algorithm

I want to implement the following in C++:
1) Check whether the given word exists in a dictionary. The dictionary file is a huge file; consider 100MB or 3-4 Million words.
2) Suggest corrections for the incorrect word.
3) Autocomplete feature.
My Approach
1) I am planning to build a tree so searching will efficient.
2) I am not getting how to implement auto correction feature.
3) I can implement auto complete feature using trees
What's the best data structure and algorithm to implement all the above features?
I have been working on the same problem. So far the best solution I have come across is using a ternary search tree for auto completion. Ternary Search Trees are more space efficient than tries.
If im unable to find the looked up string in my ternary search tree then I use an already built BK Tree for finding the closest match. BK Tree internally uses Levenshtein distance.
You
Metaphones are also something you might want to explore however I havent gone into the depth of metaphones.
I have a solution in Java for BK TREE + TERNARY SEARCH TREE if you like.
You can do autocomplete by looking at all the strings in a given subtree. Some score to help you pick might help. This works something like if you have "te" you go down that path in the trie and the traverse the entire subtree there to get all the possible endings.
For corrections you need to implement something like http://en.wikipedia.org/wiki/Levenshtein_distance over the tree. You can use the fact that if you processed a given path in the trie, you can reuse the result for all the strings in the subtree rooted at the end of your path.
1) Aside from trees, another interesting method is BWT
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
BWT suffix array can be easily used to track words with given prefix.
2) For error correction, modern approach is LHS:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
Some links randomly provided by google search:
https://cs.stackexchange.com/questions/2093/efficient-map-data-structure-supporting-approximate-lookup
https://code.google.com/p/likelike/
http://aspguy.wordpress.com/2012/02/18/the-magic-behind-the-google-search/

recursively find subsets

Here is a recursive function that I'm trying to create that finds all the subsets passed in an STL set. the two params are an STL set to search for subjects, and a number i >= 0 which specifies how big the subsets should be. If the integer is bigger then the set, return empty subset
I don't think I'm doing this correctly. Sometimes it's right, sometimes its not. The stl set gets passed in fine.
list<set<int> > findSub(set<int>& inset, int i)
{
list<set<int> > the_list;
list<set<int> >::iterator el = the_list.begin();
if(inset.size()>i)
{
set<int> tmp_set;
for(int j(0); j<=i;j++)
{
set<int>::iterator first = inset.begin();
tmp_set.insert(*(first));
the_list.push_back(tmp_set);
inset.erase(first);
}
the_list.splice(el,findSub(inset,i));
}
return the_list;
}
From what I understand you are actually trying to generate all subsets of 'i' elements from a given set right ?
Modifying the input set is going to get you into trouble, you'd be better off not modifying it.
I think that the idea is simple enough, though I would say that you got it backwards. Since it looks like homework, i won't give you a C++ algorithm ;)
generate_subsets(set, sizeOfSubsets) # I assume sizeOfSubsets cannot be negative
# use a type that enforces this for god's sake!
if sizeOfSubsets is 0 then return {}
else if sizeOfSubsets is 1 then
result = []
for each element in set do result <- result + {element}
return result
else
result = []
baseSubsets = generate_subsets(set, sizeOfSubsets - 1)
for each subset in baseSubssets
for each element in set
if no element in subset then result <- result + { subset + element }
return result
The key points are:
generate the subsets of lower rank first, as you'll have to iterate over them
don't try to insert an element in a subset if it already is, it would give you a subset of incorrect size
Now, you'll have to understand this and transpose it to 'real' code.
I have been staring at this for several minutes and I can't figure out what your train of thought is for thinking that it would work. You are permanently removing several members of the input list before exploring every possible subset that they could participate in.
Try working out the solution you intend in pseudo-code and see if you can see the problem without the stl interfering.
It seems (I'm not native English) that what you could do is to compute power set (set of all subsets) and then select only subsets matching condition from it.
You can find methods how to calculate power set on Wikipedia Power set page and on Math Is Fun (link is in External links section on that Wikipedia page named Power Set from Math Is Fun and I cannot post it here directly because spam prevention mechanism). On math is fun mainly section It's binary.
I also can't see what this is supposed to achieve.
If this isn't homework with specific restrictions i'd simply suggest testing against a temporary std::set with std::includes().