Does the Random tree algorithm in WEKA uses Information gain? - weka

I'm using weka with some db's and i'm trying to understand how works Random tree algorithm. I don't understand if it choose, in every node, a random attribute to split, or if it uses (in every node) the information gain of an attribute to decide which one it has to split.
Thank you
Regards,
Andrea

Related

Slow insertion using Neptune and Gremlin

I'm having problems with the insertion using gremlin to Neptune.
I am trying to insert many nodes and edges, potentially hundred thousands of nodes and edges, with checking for existence.
Currently, we are using inject to insert the nodes, and the problem is that it is slow.
After running the explain command, we figured out that the problem was the coalesce and the where steps - it takes more than 99.9% of the run duration.
I want to insert each node and edge only if it doesn’t exist, and that’s why I am using the coalesce and where steps.
For example, the query we use to insert nodes with inject:
properties_list = [{‘uid’:’1642’},{‘uid’:’1322’}…]
g.inject(properties_list).unfold().as_('node')
.sideEffect(__.V().where(P.eq('node')).by(‘uid).fold()
.coalesce(__.unfold(), __.addV(label).property(Cardinality.single,'uid','1')))
With 1000 nodes in the graph and properties_list with 100 elements, running the query above takes around 30 seconds, and it gets slower as the number of nodes in the graph increases.
Running a naive injection with the same environment as the query above, without coalesce and where, takes less than 1 second.
I’d like to hear your suggestions and to know what are the best practices for inserting many nodes and edges (with checking for existence).
Thank you very much.
If you have a set of IDs that you want to check for existence, you can speed up the query significantly by also providing just a list of IDs to the query and calculating the intersection of the ones that exist upfront. Then, having calculated the set that need updates you can just apply them in one go. This will make a big difference. The reason you are running into problems is that the mid traversal V has a lot of work to do. In general it would be better to use actual IDs rather than properties (UID in your case). If that is not an option the same technique will work for property based IDs. The steps are:
Using inject or sideEffect insert the IDs to be found as one list and the corresponding map containing the changes to conditionally be applied in a separate map.
Find the intersection of the ones that exist and those that do not.
Using that set of non existing ones, apply the updates using the values in the set to index into your map.
Here is a concrete example. I used the graph-notebook for this but you can do the same thing in code:
Given:
ids = "['1','2','9998','9999']"
and
data = "[['id':'1','value':'XYZ'],['id':'9998','value':'ABC'],['id':'9999','value':'DEF']]"
we can do something like this:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by()
which correctly finds the ones that do not already exist:
{'id': 9998, 'value': 'ABC'}
{'id': 9999, 'value': 'DEF'}
You can use this pattern to construct your conditional inserts a lot more efficiently (I hope :-) ). So to add the new vertices you might do:
g.V().hasId(${ids}).id().fold().as('exist').
constant(${data}).
unfold().as('d').
where(without('exist')).by('id').by().
addV('test').
property(id,select('d').select('id')).
property('value',select('d').select('value'))
v[9998]
v[9999]
As a side note, we are adding two new steps to Gremlin - mergeV and mergeE that will allow this to be done much more easily and in a more declarative style. Those new steps should be part of the TinkerPop 3.6 release.

How does Weka calculate the output predictions in J48 and other classifier?

I have used the output predictions of J48 classifier in Weka and got the results with predictions (probability). As I need to use these predictions number in my research, I need to know how the weka calculates these numbers? What is the formula? Is it specified for each classifier?
In addition to Jan Eglinger answer.
The J48 classifier is Weka's implementation of the infamous C4.5 decision tree classifier, which is a classification algorithm based on ID3 that classifies using information entropy.
The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i consists of a p-dimensional vector (x_{1,i}, x_{2,i}, ...,x_{p,i}) , where the x_j represent attribute values or features of the sample, as well as the class in which s_i falls.
At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists.
This algorithm has a few base cases.
All the samples in the list belong to the same class. When this
happens, it simply creates a leaf node for the decision tree saying
to choose that class.
None of the features provide any information gain. In this case,
C4.5 creates a decision node higher up the tree using the expected
value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates
a decision node higher up the tree using the expected value.
You can find the information Gain and entropy in the Weka Api package. For that you need to start dubbing the java weka api and go through each step.
In general, if you don't worry about how algorithm works internally using high level mathematics. Try to calculate InformationGain and entropy and explain them in your research apart from decision trees, you have methods for both of these to calculate their value.
What is the formula?
Weka's J48 classifier is an implementation of the C4.5 algorithm.
I need to know how the weka calculates these numbers?
You can find implementation details in J48.java and in the weka.classifiers.trees.j48 package.

what are sufficient approaches for manipulating a list of thousands of individual words

I'm trying to design such an application that manipulates a list of thousands of individual words that is stored in a txt file for the following tasks,
1- Randomly picking up some words.
2- Checking whether some entered words by the user are actually in the list.
3- Retrieve the entire list from a txt file and store it temporarily for subsequent manipulations.
I'm not asking for implementation neither for pseudo codes. I'm looking for sufficient approach to deal with a massive list of words. For the time being, I might go with a vector of strings, however, searching thousands of words will take some times. Of course there must be some strategies to cope with this kind of tasks however, since my background is not Computer Science, I don't know in which direction which I go. Any suggestions are welcomed.
A vector of strings is fine for this problem. Just sort them, and then you can use binary search to find a string in the list.
Radix trees are a good solution for searching through word lists for matches. Reduced space for storage, but you'll have to have some custom code for getting and putting words in the list. And the text file won't necessarily be easy to read unless you create the tree anew each time you load from a text file. Here's an implementation I committed to on GitHub (I can't even remember the source material at this point) that might be of assistance to you.

Auto correct algorithm

I want to implement the following in C++:
1) Check whether the given word exists in a dictionary. The dictionary file is a huge file; consider 100MB or 3-4 Million words.
2) Suggest corrections for the incorrect word.
3) Autocomplete feature.
My Approach
1) I am planning to build a tree so searching will efficient.
2) I am not getting how to implement auto correction feature.
3) I can implement auto complete feature using trees
What's the best data structure and algorithm to implement all the above features?
I have been working on the same problem. So far the best solution I have come across is using a ternary search tree for auto completion. Ternary Search Trees are more space efficient than tries.
If im unable to find the looked up string in my ternary search tree then I use an already built BK Tree for finding the closest match. BK Tree internally uses Levenshtein distance.
You
Metaphones are also something you might want to explore however I havent gone into the depth of metaphones.
I have a solution in Java for BK TREE + TERNARY SEARCH TREE if you like.
You can do autocomplete by looking at all the strings in a given subtree. Some score to help you pick might help. This works something like if you have "te" you go down that path in the trie and the traverse the entire subtree there to get all the possible endings.
For corrections you need to implement something like http://en.wikipedia.org/wiki/Levenshtein_distance over the tree. You can use the fact that if you processed a given path in the trie, you can reuse the result for all the strings in the subtree rooted at the end of your path.
1) Aside from trees, another interesting method is BWT
http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
BWT suffix array can be easily used to track words with given prefix.
2) For error correction, modern approach is LHS:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search
Some links randomly provided by google search:
https://cs.stackexchange.com/questions/2093/efficient-map-data-structure-supporting-approximate-lookup
https://code.google.com/p/likelike/
http://aspguy.wordpress.com/2012/02/18/the-magic-behind-the-google-search/

How to visualize generated RNA secondary structure

I'm working on a tool to visualize RNA secondary structure, for this purpose I have implemented Nussinov's algorithm which generates the RNA secondary structure as list with the corresponding indices, the code can be found here [0]
[0] http://dpaste.com/596262/
But I really stuck with understanding how I should visualize it (as a planar graph), the code above gives me a sequential list of the secondary structure, so can someone please suggest me as to how I can visualize the structure.An example of such tool can be found here [1]
[1] http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi
and I know there are better algorithms but for now I would just want to visualize with this and once I understand visualization, I will go for a better algorithm.
Visualizing the secondary structure of RNA (or any graph, for that matter) algorithmically is a difficult problem. You need to take care that there are as few overlaps as possible while maintaining consistent link lengths. As the other answers have pointed out, there are a number of existing implementations that you can already use. I'll just throw in another one that's quite easy to use and requires no downloads:
forna - nibiru.tbi.univie.ac.at/forna
Here you just need to enter a dotbracket string:
>molecule_name
CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG
((((((((((..((((((.........))))))......).((((((.......))))))..)))))))))
This will give you a visualization that looks something like this:
This is computed using a combination of the ViennaRNA RNAplot program and d3's force-directed graph algorithm.
You could do this with jmol . Jmol allows you to add arbitrary bonds / atoms to a coordinate space using its java or I believe its javascript api also.
In general, of course, PDB file formats would be used for such data.
RNAviz is old but still commonly used. JalView apparently was supposed to get RNA secondary structure rendering thru a GSoC project last year, but I'm not sure what the status in the program is.