I am analyzing RNAseq data of mouse using mm10. I started with tophat2/bowtie. Then I ran cufflinks to generate the FPKM of genes and isoforms. I am using a gtf file that has the genebiotype (i.e. whether it is a pseudogene, protein coding, snRNA, lincRna...) in the second column as well as within the line next to gene names. Example of a line of my GTF is:
1 unprocessed_pseudogene exon 3054233 3054733 . + . exon_id "ENSMUSE00000848981"; exon_number "1"; gene_biotype "pseudogene"; gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; tag "mRNA_start_NF"; transcript_id "ENSMUST00000160944"; transcript_name "Gm16088-001"; transcript_source "havana"; tss_id "TSS82763";
1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_biotype "pseudogene"; gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; tag "mRNA_start_NF"; transcript_id "ENSMUST00000160944"; transcript_name "Gm16088-001"; transcript_source "havana"; tss_id "TSS82763";
1 snRNA exon 3102016 3102125 . + . exon_id "ENSMUSE00000522066"; exon_number "1"; gene_biotype "snRNA"; gene_id "ENSMUSG00000064842"; gene_name "Gm26206"; gene_source "ensembl"; transcript_id "ENSMUST00000082908"; transcript_name "Gm26206-201"; transcript_source "ensembl"; tss_id "TSS81070";
1 snRNA transcript 3102016 3102125 . + . gene_biotype "snRNA"; gene_id "ENSMUSG00000064842"; gene_name "Gm26206"; gene_source "ensembl"; transcript_id "ENSMUST00000082908"; transcript_name "Gm26206-201"; transcript_source "ensembl"; tss_id "TSS81070";
My cufflinks gene and isoform tracking output file looks like this:
tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
ENSMUSG00000090025 - - ENSMUSG00000090025 Gm16088 TSS82763 1:3054232-3054733 - - 0 0 0 OK
ENSMUSG00000064842 - - ENSMUSG00000064842 Gm26206 TSS81070 1:3102015-3102125 - - 0 0 0 OK
ENSMUSG00000025900 - - ENSMUSG00000025900 Rp1 TSS11475 1:4343506-4360314 - - 0 0 0 OK
ENSMUSG00000088333 - - ENSMUSG00000088333 Gm22848 TSS18078 1:3783875-3783933 - - 0 0 0 OK
ENSMUSG00000025902 - - ENSMUSG00000025902 Sox17 TSS56047,TSS74369 1:4490927-4496413 - - 0.611985 0.394887 0.829082 OK
ENSMUSG00000051951 - - ENSMUSG00000051951 Xkr4 TSS1201,TSS70682,TSS88403 1:3205900-3671498 - - 0 0 0 OK
As you can see, it lacks the second column of the gtf which indicates the type of the gene product. Is there anyway to have cufflinks automatically incorporate this into its output files? It doesn't seem there is a simple command for that unless I am missing it. please advise-
I tried several methods to get this to work directly while running cufflinks but none worked (if anyone has a solution, let me know). However, i found a way around it. what i did is the following:
1- After running all your cufflinks commands, merge the data by FPKM column to another file (i have a script written for this - if anyone wants it , let me know).
2- Download the function table from UCSC table browser
3- Add that column using the commands below to the merged output above
Comb_Iso = pd.read_csv('Combined.Isoforms.txt', sep='\t') # reads in combined cufflinks output
fnc_library = pd.read_csv(args.library, sep='\t', usecols=[0,1] ) #only takes first 2 columns #reads in the path to the function library
Comb_Iso_fnc = Comb_Iso.merge(fnc_library, on='TranscriptID', how='left') #merges them
ncol_CIF = len(Comb_Iso_fnc.columns)
df_iso=Comb_Iso_fnc.iloc[:,[0, 1, ncol_CIF-1] + list(range(2,ncol_CIF-1))]
df_iso.to_csv(args.output_combined.strip()+'.Isoforms.txt', index=False, sep="\t") #generates the output.
Example of the output now is:
TranscriptID GeneID Function GeneName TSS-ID Locus-ID ExampleHeader1 ExampleHeader2 ExampleHeader3
ENSMUST00000115488 ENSMUSG00000088000 protein_coding Gm25493 TSS84820 1:4723276-4723379 9 9 9
ENSMUST00000027042 ENSMUSG00000096126 protein_coding Gm22307 TSS39629 1:4529016-4529123 8 8 8
ENSMUST00000140302 ENSMUSG00000025902 Test2 Sox17 TSS56047,TSS74369 1:4490927-4496413 5 5 5
ENSMUST00000115484 ENSMUSG00000051951 protein_coding Xkr4 TSS1201,TSS70682,TSS88403 1:3205900-3671498 6 6 6
ENSMUST00000135046 ENSMUSG00000089699 protein_coding Gm1992 TSS23899 1:3466586-3513553 7 -7 -7
ENSMUST00000132064 ENSMUSG00000025900 linc Rp1 TSS11475 1:4343506-4360314 3 -3 -3
ENSMUST00000072177 ENSMUSG00000090025 Gm16088 TSS82763 1:3054232-3054733 1 1 1
ENSMUST00000082125 ENSMUSG00000064842 protein_coding Gm26206 TSS81070 1:3102015-3102125 2 2 2
You can see the function in the 3rd column.
Would be glad to share any of my scripts to combine cufflinks out put (although these may not be the best, still working on optimizing) if you want them.
Related
I have a TreeModel with a database like structure:
+Table1
-"Table1_key" -"name"
- 1 -"John"
- 2 -"Peter"
-...
+Table2
-"Table2_key" -"Table1_key" -"value1" -"value2"
- 1 - 1 - 1000 - 20000
- 2 - 1 - 3000 - 4000
- 3 - 2 - 1000 -2000
-...
+...
So the Data comes from an .xml file and is displayed in a TreeView which works just fine.
However i want to display some of the Tables in different Views and resolve the keys in that view.
For the example model the required View could look like this:
+"John"
-"value1" -"value2"
- 1000 - 20000
- 3000 - 4000
+"Peter"
- 1000 -2000
I guess using a QAbstractProxyModel would be the proper way.
So my question is: How do i implement this?
I can't find any examples and i have no idea how to map between the indexes of the source and the proxymodel in the mapToSource/mapFromSource methods.
I want to create a small test data set with some specific values:
x
-
1
3
4
5
7
I can do this the hard way:
. set obs 5
. generate x = .
. replace x = 1 in 1
. replace x = 3 in 2
. replace x = 4 in 3
. replace x = 5 in 4
. replace x = 7 in 5
I can also use the data editor, but I'd like to create a .do file which can recreate this data set.
So how do I set the values of a variable from a list of numbers?
This can be done using a (to my mind) poorly documented feature of input:
clear
input x
1
3
4
5
7
end
I say poorly documented because the title of the input help page is
[D] Input -- Enter data from keyboard
which is clearly only a subset of what this command can do.
Here is another way
clear
mat x = (1,3,4,5,7)
set obs `=colsof(x)'
generate x = x[1, _n]
and another
clear
mata : x = (1,3,4,5,7)'
getmata x=x
I have to crawl Wikipedia to get HTML pages of countries. I have successfully crawled. Now to build clusters, I have to do KMeans. I am using Weka for that.
I have used this code to convert my directory into arff format:
https://weka.wikispaces.com/file/view/TextDirectoryToArff.java
Here is its output:
enter image description here
Then I opened that file in Weka and performed StringToWordVector conversion with these parameters:
Then I performed Kmeans. The output I am getting is:
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 5000 -S 10
Relation: text_files_in_files-weka.filters.unsupervised.attribute.StringToWordVector-R1,2-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"-weka.filters.unsupervised.attribute.StringToWordVector-R-W1000-prune-rate-1.0-C-T-I-N1-L-S-stemmerweka.core.stemmers.SnowballStemmer-M0-O-tokenizerweka.core.tokenizers.WordTokenizer -delimiters " \r\n\t.,;:\'\"()?!"
Instances: 28
Attributes: 1040
[list of attributes omitted]
Test mode:evaluate on training data
=== Model and evaluation on training set ===
kMeans
Number of iterations: 2
Within cluster sum of squared errors: 1915.0448503841326
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(28) (22) (6)
====================================================================================
.
.
.
.
.
bolsheviks 0.3652 0.3044 0.5878
book 0.3229 0.3051 0.3883
border 0.4329 0.5509 0
border-left-style 0.4329 0.5509 0
border-left-width 0.3375 0.4295 0
border-spacing 0.3124 0.3304 0.2461
border-width 0.5128 0.2785 1.372
boundary 0.309 0.3007 0.3392
brazil 0.381 0.3744 0.4048
british 0.4387 0.2232 1.2288
brown 0.2645 0.2945 0.1545
cache-control=max-age=87840 0.4913 0.4866 0.5083
california 0.5383 0.5085 0.6478
called 0.4853 0.6177 0
camp 0.4591 0.5451 0.1437
canada 0.3176 0.3358 0.251
canadian 0.2976 0.1691 0.7688
capable 0.2475 0.315 0
capita 0.388 0.1188 1.375
carbon 0.3889 0.445 0.1834
caribbean 0.4275 0.5441 0
carlsbad 0.548 0.5339 0.5998
caspian 0.4737 0.5345 0.2507
category 0.2216 0.2821 0
censorship 0.2225 0.0761 0.7596
center 0.4829 0.4074 0.7598
central 0.211 0.0805 0.6898
century 0.2645 0.2041 0.4862
chad 0.3636 0.0979 1.3382
challenger 0.5008 0.6374 0
championship 0.6834 0.8697 0
championships 0.2891 0.1171 0.9197
characteristics 0.237 0 1.1062
charon 0.5643 0.4745 0.8934
china
.
.
.
.
.
Time taken to build model (full training data) : 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 22 ( 79%)
1 6 ( 21%)
How to check which DocId is in which cluster? I have searched a lot but didnt find anything.
Also, is there any other good Java Library for Kmeans and agglomerate clustering?
Problem: I have a large number of scanned documents that are linked to the wrong records in a database. Each image has the correct ID on it somewhere that says where it belongs in the db.
I.E. A DB row could be:
| user_id | img_id | img_loc |
| 1 | 1 | /img.jpg|
img.jpg would have the user_id (1) on the image somewhere.
Method/Solution: Loop through the database. Pull the image text in to a variable with OCR and check if user_id is found anywhere in the variable. If not, flag the record/image in a log, if so do nothing and move on.
My example is simple, in the real world I have a guarantee that user_id wouldn't accidentally show up on the wrong form (it is of a specific format that has its own significance)
Right now it is working. However, it is incredibly strict. If you've worked with OCR you understand how fickle it can be. Sometimes a 7 = 1 or a 9 = 7, etc. The result is a large number of false positives. Especially among images with low quality scans.
I've addressed some of the image quality issues with some processing on my side - increase image size, adjust the black/white threshold and had satisfying results. I'd like to add the ability for the prog to recognize, for example, that "81*7*23103" is not very far from "81*9*23103"
The only way I know how to do that is to check for strings >= to the length of what I'm looking for. Calculate the distance between each character, calc an average and give it a limit on what is a good average.
Some examples:
Ex 1
81723103 - Looking for this
81923103 - Found this
--------
00200000 - distances between characters
0 + 0 + 2 + 0 + 0 + 0 + 0 + 0 = 2
2/8 = .25 (pretty good match. 0 = perfect)
Ex 2
81723103 - Looking
81158988 - Found
--------
00635885 - distances
0 + 0 + 6 + 3 + 5 + 8 + 8 + 5 = 35
35/8 = 4.375 (Not a very good match. 9 = worst)
This way I can tell it "Flag the bottom 30% only" and dump anything with an average distance > 6.
I figure I'm reinventing the wheel and wanted to share this for feedback. I see a huge increase in run time and a performance hit doing all these string operations over what I'm currently doing.
So I have downloaded the mp4 and srt files for "Introduction to computer networks" course from Coursera. But there is slight discrepancy between the names of mp4 and srt files.
The file name samples are following:
1 - 1 - 1-1 Goals and Motivation (1253).mp4
1 - 1 - 1-1 Goals and Motivation (12_53).srt
1 - 2 - 1-2 Uses of Networks (1316).mp4
1 - 2 - 1-2 Uses of Networks (13_16).srt
1 - 3 - 1-3 Network Components (1330).mp4
1 - 3 - 1-3 Network Components (13_30).srt
1 - 4 - 1-4 Sockets (1407).mp4
1 - 4 - 1-4 Sockets (14_07).srt
1 - 5 - 1-5 Traceroute (0736).mp4
1 - 5 - 1-5 Traceroute (07_36).srt
1 - 6 - 1-6 Protocol Layers (2225).mp4
1 - 6 - 1-6 Protocol Layers (22_25).srt
1 - 7 - 1-7 Reference Models (1409).mp4
1 - 7 - 1-7 Reference Models (14_09).srt
1 - 8 - 1-8 Internet History (1239).mp4
1 - 8 - 1-8 Internet History (12_39).srt
1 - 9 - 1-9 Lecture Outline (0407).mp4
1 - 9 - 1-9 Lecture Outline (04_07).srt
2 - 1 - 2-1 Physical Layer Overview (09_27).mp4
2 - 1 - 2-1 Physical Layer Overview (09_27).srt
2 - 2 - 2-2 Media (856).mp4
2 - 2 - 2-2 Media (8_56).srt
2 - 3 - 2-3 Signals (1758).mp4
2 - 3 - 2-3 Signals (17_58).srt
2 - 4 - 2-4 Modulation (1100).mp4
2 - 4 - 2-4 Modulation (11_00).srt
2 - 5 - 2-5 Limits (1243).mp4
2 - 5 - 2-5 Limits (12_43).srt
2 - 6 - 2-6 Link Layer Overview (0414).mp4
2 - 6 - 2-6 Link Layer Overview (04_14).srt
2 - 7 - 2-7 Framing (1126).mp4
2 - 7 - 2-7 Framing (11_26).srt
2 - 8 - 2-8 Error Overview (1745).mp4
2 - 8 - 2-8 Error Overview (17_45).srt
2 - 9 - 2-9 Error Detection (2317).mp4
2 - 9 - 2-9 Error Detection (23_17).srt
2 - 10 - 2-10 Error Correction (1928).mp4
2 - 10 - 2-10 Error Correction (19_28).srt
I want to rename the mp4 files to match srt files so that vlc can automatically load the subtitles when I play the videos. What someone discuss be algorithms to do this? You can also provide solution code in any language as I am familiar with many programming languages. But python and c++ are preferable.
Edit:
Thanks to everyone who replied. I know it is easier to rename the srt files than the other way around. But I think it will be more interesting to rename the mp4 files. Any suggestions?
Here's a quick solution in python.
The job is simple if you make the following assumptions:
all files are in the same folder
you have the same number of srt and mp4 files in the directory
all srt are ordered alphabetically, all mp4 are ordered alphabetically
Note I do not assume anything about the actual names (e.g. that you only need to remove underscores).
So you don't need any special logic for matching the files, just go one-by-one.
import os, sys, re
from glob import glob
def mv(src, dest):
print 'mv "%s" "%s"' % (src, dest)
#os.rename(src, dest) # uncomment this to actually rename the files
dir = sys.argv[1]
vid_files = sorted(glob(os.path.join(dir, '*.mp4')))
sub_files = sorted(glob(os.path.join(dir, '*.srt')))
assert len(sub_files) == len(vid_files), "lists of different lengths"
for vidf, subf in zip(vid_files, sub_files):
new_vidf = re.sub(r'\.srt$', '.mp4', subf)
if vidf == new_vidf:
print '%s OK' % ( vidf, )
continue
mv(vidf, new_vidf)
Again, this is just a quick script. Suggested improvements:
support different file extensions
use a better cli, e.g. argparse
support taking multiple directories
support test-mode (don't actually rename the files)
better error reporting (instead of using assert)
more advanced: support undoing
If all those files really follow this scheme the Python implementation is almost trivial:
import glob, os
for subfile in glob.glob("*.srt")
os.rename(subfile, subfile.replace("_",""))
If your mp4 also contain underscores you want to add an additional loop for them.
for f in *.srt; do mv $f ${f%_}; done
This is just an implementation of what Zeta said :)
import os;
from path import path;
for filename in os.listdir('.'):
extension = os.path.splitext(path(filename).abspath())[1][1:]
if extension == 'srt':
newName = filename.replace('_','');
print 'Changing\t' + filename + ' to\t' + newName;
os.rename(filename,newName);
print 'Done!'