I'm working on a tool to visualize RNA secondary structure, for this purpose I have implemented Nussinov's algorithm which generates the RNA secondary structure as list with the corresponding indices, the code can be found here [0]
[0] http://dpaste.com/596262/
But I really stuck with understanding how I should visualize it (as a planar graph), the code above gives me a sequential list of the secondary structure, so can someone please suggest me as to how I can visualize the structure.An example of such tool can be found here [1]
[1] http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi
and I know there are better algorithms but for now I would just want to visualize with this and once I understand visualization, I will go for a better algorithm.
Visualizing the secondary structure of RNA (or any graph, for that matter) algorithmically is a difficult problem. You need to take care that there are as few overlaps as possible while maintaining consistent link lengths. As the other answers have pointed out, there are a number of existing implementations that you can already use. I'll just throw in another one that's quite easy to use and requires no downloads:
forna - nibiru.tbi.univie.ac.at/forna
Here you just need to enter a dotbracket string:
>molecule_name
CGCUUCAUAUAAUCCUAAUGAUAUGGUUUGGGAGUUUCUACCAAGAGCCUUAAACUCUUGAUUAUGAAGUG
((((((((((..((((((.........))))))......).((((((.......))))))..)))))))))
This will give you a visualization that looks something like this:
This is computed using a combination of the ViennaRNA RNAplot program and d3's force-directed graph algorithm.
You could do this with jmol . Jmol allows you to add arbitrary bonds / atoms to a coordinate space using its java or I believe its javascript api also.
In general, of course, PDB file formats would be used for such data.
RNAviz is old but still commonly used. JalView apparently was supposed to get RNA secondary structure rendering thru a GSoC project last year, but I'm not sure what the status in the program is.
Related
Is there a way to incorporate the uncertainties on my data set into the result of the Savitzky Golay fit? Since I am not passing this information into the function, I asume that it is simply calcuating the 'best fit' via an unweighted least-squares process. I am currently working with data that has non-uniform uncertainty, and so the fit of the data could be improved by including the errors that I have for my main dataset.
The wikipedia page for the Savitzky-Golay filter suggests how I might go about alter the process of calculating the coefficients of the fit, and I am staring at the code for scipy.signal.savgol_filter, but I cannot get my head around what I need to adjust so that this will do what I want it to.
Are there any ready-made weighted SG filters floating about? I find it hard to believe that no-one else has ever needed this tool in Python, but maybe I have missed something.
Check out this Python module: https://github.com/surhudm/savitzky_golay_with_errors
This python script improves upon the traditional Savitzky-Golay filter
by accounting for errors or covariance in the data. The inputs and
arguments are all modelled after scipy.signal.savgol_filter
Matlab function sgolayfilt supports weights. Check the documentation.
I am working user behavior project. Based on user interaction I have got some data. There is nice sequence which smoothly increases and decreases over the time. But there are little discrepancies, which are very bad. Please refer to graph below:
You can also find data here:
2.0789 2.09604 2.11472 2.13414 2.15609 2.17776 2.2021 2.22722 2.25019 2.27304 2.29724 2.31991 2.34285 2.36569 2.38682 2.40634 2.42068 2.43947 2.45099 2.46564 2.48385 2.49747 2.49031 2.51458 2.5149 2.52632 2.54689 2.56077 2.57821 2.57877 2.59104 2.57625 2.55987 2.5694 2.56244 2.56599 2.54696 2.52479 2.50345 2.48306 2.50934 2.4512 2.43586 2.40664 2.38721 2.3816 2.36415 2.33408 2.31225 2.28801 2.26583 2.24054 2.2135 2.19678 2.16366 2.13945 2.11102 2.08389 2.05533 2.02899 2.00373 1.9752 1.94862 1.91982 1.89125 1.86307 1.83539 1.80641 1.77946 1.75333 1.72765 1.70417 1.68106 1.65971 1.64032 1.62386 1.6034 1.5829 1.56022 1.54167 1.53141 1.52329 1.51128 1.52125 1.51127 1.50753 1.51494 1.51777 1.55563 1.56948 1.57866 1.60095 1.61939 1.64399 1.67643 1.70784 1.74259 1.7815 1.81939 1.84942 1.87731
1.89895 1.91676 1.92987
I would want to smooth out this sequence. The technique should be able to eliminate numbers with characteristic of X and Y, i.e. error in mono-increasing or mono-decreasing.
If not eliminate, technique should be able to shift them so that series is not affected by errors.
What I have tried and failed:
I tried to test difference between values. In some special cases it works, but for sequence as presented in this the distance between numbers is not such that I can cut out errors
I tried applying a counter, which is some X, then only change is accepted otherwise point is mapped to previous point only. Here I have great trouble deciding on value of X, because this is based on user-interaction, I am not really controller of it. If user interaction is such that its plot would be a zigzag pattern, I am ending up with 'no user movement data detected at all' situation.
Please share the techniques that you are aware of.
PS: Data made available in this example is a particular case. There is no typical pattern in which numbers are going to occure, but we expect some range to be continuous with all the examples. Solution I am seeking is generic.
I do not know how much effort you want to involve in this problem but if you want theoretical guaranties,
topological persistence seems well adapted to your problem imho.
Basically with that method, you can filtrate local maximum/minimum by fixing a scale
and there are theoritical proofs that says that if you sampling is
close from your function, then you extracts correct number of maximums with persistence.
You can see these slides (mainly pages 7-9 to get the idea) to get an idea of the method.
Basically, if you take your points as a landscape and imagine a watershed starting from maximum height and decreasing, you have some picks.
Every pick has a time where it is born which is the time where it becomes emerged and a time where it dies which is when it merges with an higher pick. Now a persistence diagram pictures a point for every pick where its x/y coordinates are its time of birth/death (by assumption the first pick does not die and is not shown).
If a pick is a global maximal, then it will be further from the diagonal in the persistence diagram than a local maximum pick. To remove local maximums you have to remove picks close to the diagonal. There are fours local maximums in your example as you can see with the persistence diagram of your data (thanks for providing the data btw) and two global ones (the first pick is not pictured in a persistence diagram):
If you noise your data like that :
You will still get a very decent persistence diagram that will allow you to filter local maximum as you want :
Please ask if you want more details or references.
Since you can not decide on a cut off frequency, and not even on the filter you want to use, I would implement several, and let the user set the parameters.
The first thing that I thought of is running average, and you can see that there are so many things to set, to get different outputs.
I now use apriori algorithm to do a data mining project,and I get result such as:item1 <=> iteam2、item2 <=> item3.......
I want use graph mining to generate a graph containing many nodes and illustrating relation between these node like this:
I heard some data ming software--weka,rapidminer;I also heard some graph library--igraph,networkx;I also heard--tableau.But I'm still confused,could someone give me a illustration about detailed procedure?
I recommend using Prefuse tool kit for your problem. Take a look here http://prefuse.org/gallery/ . This contains an example of the graph that you need.
Loosely speaking, Prefuse also has a browser version called D3.js . If you want to display your graph in browser then use D3.js
I have used Prefuse as well as D3.js when I needed a desktop graph and a graph in the browser.
If by multi-node graph you are referring to this definition http://dl.acm.org/citation.cfm?id=1292799, then I would say that you could use Gephi to visualize your graph. Gephi is a powerful tool for network visualization/analysis, since you can annotate the vertices, apply clustering algorithms etc.
In your case, since multi-node graphs have multiple states, you can either use some annotation/coloring to show the different states of the nodes/edges, or even visualize these different states by importing different timestamps/versions of the network in Gephi. You can then observe the differences among them. Even if your graph is not multi-node, I would recommend Gephi for visualizing it.
If item1 <=> iteam2、item2 <=> item3 .. is your current data format, you can transform it to a format that Gephi recognizes, like adjacency list or edge list.
I'm a little confused about online kmeans clustering. I know that it allows me to cluster with just one data at a time. But,is this all limited to one session? Suppose that I have a bunch of data clustered via this method and I get the clustered data result, would I be able to add more data to the cluster in the future?
I've also been looking for implementations of this code, and to no avail. Anyone know of any?
Update:
To clarify more. Here is how my code works right now:
Image is taken from live video feed, once enough pictures are saved, get kmeans of sift features.
Repeat step 1, a new batch of live feed pictures, get kmeans again. Combine the kmeans vectors with the previous kmeans like :[A B]
You can see that this is bad, because I quickly get too much clusters, and each batch of clusters will definitely have overlaps with another batch.
What I want:
Image taken from live video feed, once pics are saved, get kmeans
Repeat step 1, get kmeans again, which updates and adds new clusters to the previous cluster.
Nothing that I've seen could accommodate that, unless I'm just not understanding them correctly.
If you look at the original (!) publications, the method proposed by MacQueen - where the name k-means comes from - was in fact an online algorithm. I'm not sure if MacQueen did multiple passes over the data to improve the result. I believe he used a single pass, and objects would never be reassigned to a different cluster. If so, it was already an online algorithm!
Means are commonly computed as sum / count. This is not very sensible from a numerical point of view. E.g. in the classic Knuth book you can find a method for incrementally updating means. Wikipedia has it also.
Things get slightly more complicated once you actually want to reassign earlier points. But usually in a streaming context you do not know the previous points, so you cannot do that anyway.
I'm very new in image processing and my first assignment is to make a working program which can recognize faces and their names.
Until now, I successfully make a project to detect, crop the detected image, make it to sobel and translate it to array of float.
But, I'm very confused how to implement the Backpropagation MLP to learn the image so it can recognize the correct name for the detected face.
It's a great honor for all experts in stackoverflow to give me some examples how to implement the Image array to be learned with the backpropagation.
It is standard machine learning algorithm. You have a number of arrays of floats (instances in ML or observations in statistics terms) and corresponding names (labels, class tags), one per array. This is enough for use in most ML algorithms. Specifically in ANN, elements of your array (i.e. features) are inputs of the network and labels (names) are its outputs.
If you are looking for theoretical description of backpropagation, take a look at Stanford's ml-class lectures (ANN section). If you need ready implementation, read this question.
You haven't specified what are elements of your arrays. If you use just pixels of original image, this should work, but not very well. If you need production level system (though still with the use of ANN), try to extract more high level features (e.g. Haar-like features, that OpenCV uses itself).
Have you tried writing your feature vectors to an arff file and to feed them to weka, just to see if your approach might work at all?
Weka has a lot of classifiers integrated, including MLP.
As I understood so far, I suspect the features and the classifier you have chosen not to work.
To your original question: Have you made any attempts to implement a neural network on your own? If so, where you got stuck? Note, that this is not the place to request a complete working implementation from the audience.
To provide a general answer on a general question:
Usually you have nodes in an MLP. Specifically input nodes, output nodes, and hidden nodes. These nodes are strictly organized in layers. The input layer at the bottom, the output layer on the top, hidden layers in between. The nodes are connected in a simple feed-forward fashion (output connections are allowed to the next higher layer only).
Then you go and connect each of your float to a single input node and feed the feature vectors to your network. For your backpropagation you need to supply an error signal that you specify for the output nodes. So if you have n names to distinguish, you may use n output nodes (i.e. one for each name). Make them for example return 1 in case of a match and 0 else. You could very well use one output node and let it return n different values for the names. Probably it would even be best to use n completely different perceptrons, i.e. one for each name, to avoid some side-effects (catastrophic interference).
Note, that the output of each node is a number, not a name. Therefore you need to use some sort of thresholds, to get a number-name relation.
Also note, that you need a lot of training data to train a large network (i.e. to obey the curse of dimensionality). It would be interesting to know the size of your float array.
Indeed, for a complex decision you may need a larger number of hidden nodes or even hidden layers.
Further note, that you may need to do a lot of evaluation (i.e. cross validation) to find the optimal configuration (number of layers, number of nodes per layer), or to find even any working configuration.
Good luck, any way!