How could I use graph mining method to get a multi-node graph? - data-mining

I now use apriori algorithm to do a data mining project,and I get result such as:item1 <=> iteam2、item2 <=> item3.......
I want use graph mining to generate a graph containing many nodes and illustrating relation between these node like this:
I heard some data ming software--weka,rapidminer;I also heard some graph library--igraph,networkx;I also heard--tableau.But I'm still confused,could someone give me a illustration about detailed procedure?

I recommend using Prefuse tool kit for your problem. Take a look here http://prefuse.org/gallery/ . This contains an example of the graph that you need.
Loosely speaking, Prefuse also has a browser version called D3.js . If you want to display your graph in browser then use D3.js
I have used Prefuse as well as D3.js when I needed a desktop graph and a graph in the browser.

If by multi-node graph you are referring to this definition http://dl.acm.org/citation.cfm?id=1292799, then I would say that you could use Gephi to visualize your graph. Gephi is a powerful tool for network visualization/analysis, since you can annotate the vertices, apply clustering algorithms etc.
In your case, since multi-node graphs have multiple states, you can either use some annotation/coloring to show the different states of the nodes/edges, or even visualize these different states by importing different timestamps/versions of the network in Gephi. You can then observe the differences among them. Even if your graph is not multi-node, I would recommend Gephi for visualizing it.
If item1 <=> iteam2、item2 <=> item3 .. is your current data format, you can transform it to a format that Gephi recognizes, like adjacency list or edge list.

Related

Clustering Using MapReduce

I have unstructured twitter data which is retrieved by the apache flume and stored it into the HDFS. So now I want to convert this unstructured data into structured one using the mapreduce.
Task wanted to do using the mapreduce:
1. conversion Unstructured to structure one.
2. I just want the text part which contain tweet part.
3. I want to identify the tweets for particular topic and grouped according to their sub part.
e.g. I have tweets of samsung handset so i want to make a group according to their handsets like groups of Samsung Note 4, Samsung galaxy etc.
It is my college project so my guide suggested me to use k means algorithm, I search a lot on k means but failed to understand how to identifies the Centroid for this basically i failed to understand how to apply K means to this situation in MapReduce.
Please gude me if I am doing wrong as I am new to this concept
K-means is clustering algorithm. It cluster or group similar data and calculate the common centroid. You can create time-series for the above questions you have mention. Group the tweets according to the topic.
K-mean implementation in MapReduce.
https://github.com/himank/K-Means
Using K-means in Twitter datasets.
You can check the following links
https://github.com/JulianHill/R-Tutorials/blob/master/r_twitter_cluster.r
http://www.r-bloggers.com/cluster-your-twitter-data-with-r-and-k-means/
http://rstudio-pubs-static.s3.amazonaws.com/5983_af66eca6775f4528a72b8e243a6ecf2d.html

SGDClassifier with HashingVectorizer and TfidfTransformer

I would like to understand if it is possible to train an online SGDClassifier (with partial_fit) using HashingVectorizer and TfidfTransformer. Simply joining them in a Pipeline will not work as TfidfTransformer is stateful so that would break the online learning process. This post says it's not possible to use tf-idf in an online fashion but a comment on this post suggests that it may somehow be possible: "In particular if you use stateful transformers as TfidfTransformer you will need to do several passes on your data". Is that possible without loading the whole training set into memory? If so, how? If not, is there an alternative solution to combine HashingVectorizer with tf-idf on large datasets?
Is that possible without loading the whole training set into memory?
No. TfidfTransformer needs to have the entire X matrix in memory. You'll need to roll your own tf-idf estimator, use that to compute per-term document frequencies in one pass over the data, then do another pass to produce tf-idf features and fit a classifier to them.

Face Recognition Using Backpropagation Neural Network?

I'm very new in image processing and my first assignment is to make a working program which can recognize faces and their names.
Until now, I successfully make a project to detect, crop the detected image, make it to sobel and translate it to array of float.
But, I'm very confused how to implement the Backpropagation MLP to learn the image so it can recognize the correct name for the detected face.
It's a great honor for all experts in stackoverflow to give me some examples how to implement the Image array to be learned with the backpropagation.
It is standard machine learning algorithm. You have a number of arrays of floats (instances in ML or observations in statistics terms) and corresponding names (labels, class tags), one per array. This is enough for use in most ML algorithms. Specifically in ANN, elements of your array (i.e. features) are inputs of the network and labels (names) are its outputs.
If you are looking for theoretical description of backpropagation, take a look at Stanford's ml-class lectures (ANN section). If you need ready implementation, read this question.
You haven't specified what are elements of your arrays. If you use just pixels of original image, this should work, but not very well. If you need production level system (though still with the use of ANN), try to extract more high level features (e.g. Haar-like features, that OpenCV uses itself).
Have you tried writing your feature vectors to an arff file and to feed them to weka, just to see if your approach might work at all?
Weka has a lot of classifiers integrated, including MLP.
As I understood so far, I suspect the features and the classifier you have chosen not to work.
To your original question: Have you made any attempts to implement a neural network on your own? If so, where you got stuck? Note, that this is not the place to request a complete working implementation from the audience.
To provide a general answer on a general question:
Usually you have nodes in an MLP. Specifically input nodes, output nodes, and hidden nodes. These nodes are strictly organized in layers. The input layer at the bottom, the output layer on the top, hidden layers in between. The nodes are connected in a simple feed-forward fashion (output connections are allowed to the next higher layer only).
Then you go and connect each of your float to a single input node and feed the feature vectors to your network. For your backpropagation you need to supply an error signal that you specify for the output nodes. So if you have n names to distinguish, you may use n output nodes (i.e. one for each name). Make them for example return 1 in case of a match and 0 else. You could very well use one output node and let it return n different values for the names. Probably it would even be best to use n completely different perceptrons, i.e. one for each name, to avoid some side-effects (catastrophic interference).
Note, that the output of each node is a number, not a name. Therefore you need to use some sort of thresholds, to get a number-name relation.
Also note, that you need a lot of training data to train a large network (i.e. to obey the curse of dimensionality). It would be interesting to know the size of your float array.
Indeed, for a complex decision you may need a larger number of hidden nodes or even hidden layers.
Further note, that you may need to do a lot of evaluation (i.e. cross validation) to find the optimal configuration (number of layers, number of nodes per layer), or to find even any working configuration.
Good luck, any way!

How to visualize generated RNA secondary structure

I'm working on a tool to visualize RNA secondary structure, for this purpose I have implemented Nussinov's algorithm which generates the RNA secondary structure as list with the corresponding indices, the code can be found here [0]
[0] http://dpaste.com/596262/
But I really stuck with understanding how I should visualize it (as a planar graph), the code above gives me a sequential list of the secondary structure, so can someone please suggest me as to how I can visualize the structure.An example of such tool can be found here [1]
[1] http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi
and I know there are better algorithms but for now I would just want to visualize with this and once I understand visualization, I will go for a better algorithm.
Visualizing the secondary structure of RNA (or any graph, for that matter) algorithmically is a difficult problem. You need to take care that there are as few overlaps as possible while maintaining consistent link lengths. As the other answers have pointed out, there are a number of existing implementations that you can already use. I'll just throw in another one that's quite easy to use and requires no downloads:
forna - nibiru.tbi.univie.ac.at/forna
Here you just need to enter a dotbracket string:
>molecule_name
CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG
((((((((((..((((((.........))))))......).((((((.......))))))..)))))))))
This will give you a visualization that looks something like this:
This is computed using a combination of the ViennaRNA RNAplot program and d3's force-directed graph algorithm.
You could do this with jmol . Jmol allows you to add arbitrary bonds / atoms to a coordinate space using its java or I believe its javascript api also.
In general, of course, PDB file formats would be used for such data.
RNAviz is old but still commonly used. JalView apparently was supposed to get RNA secondary structure rendering thru a GSoC project last year, but I'm not sure what the status in the program is.

CERN ROOT Extract Data from TNtuple

I am using CERN's ROOT framework (required), and I want to take data from a TNtuple and graph it. I can either graph the data when I create the TNtuple, or after I write it to a .root file. Some of the support documentation suggested that I create a TTree, but that seemed like it might be overkill/roundabout since I wouldn't be using it for anything else (and the TNtuple fulfills all of my other requirements). Does anyone have a better suggestion for how to extract data from the TNtuple and graph it?
As TNtuple inherits from TTree, you can use all the methods presented in the support documentation for TTrees directly on the TNtuple.
This especially means that you can use TTree::Draw() which is typically more than sufficient for quickly graphing the data. This function is documented here.
For more elaborate plots you will have to read the data from the TNtuple event by event and feed it to your favorite graphing tool in ROOT. This again follows the basic principles from a tree. The best example I could find on the ROOT homepage is in the user manual, section trees in the paragraph "Reading the Tree".
The methods used to create histograms and plots for TNtuples is essentially the same as TTrees. The code:
ntuple->Draw("var");
will create a histogram of the variable var stored in the Ntuple. If you want to plot one variable in the Ntuple as a function of another, use
ntuple->Draw("xVar:yVar");
You can do fancier things such as creating plots only when a logical condition is satisfied. For example, suppose you want a histogram of var1 only when var2 is greater than 2 and var3 is less than 0.
ntuple->Draw("var","var2 > 2 && var3 < 0");
By plotting in this way, ROOT automatically sets the binning and range for the x-axis. If you wish to control these features yourself, use
ntuple->Draw("var >> hist(Nbins,xmin,xmax)");
This creates the object hist, which you treat as a usual histogram object in ROOT. As stated in the previous post, this is documented in the ROOT manual along with several other features and tools. Unfortunately, the manual doesn't always give clear explanations.
{
ntuple->Draw("py:px","px>py","goff");
TGraph *gr = new TGraph(ntuple->GetSelectedRows(),ntuple->GetV2(), ntuple->GetV1());
gr->Draw("AP");
}