Extrude Acc (Gene ID or accession number) from a fasta file - regex

What does ".gb\\|(.*)\\|.*","\\1 in the function gsub mean?

If you have a single FASTA sequence in the file you can solve the problem by reading the first line of the file and then split it by the pipe character |.
If you have multiple sequences then you can read the first character for each line and look for the > character.
Here is a code example in Python. If you need another ID then you can change the index.
with open('AE004437.faa') as fh:
header_line = fh.readline()
ids = header_line.split('|')
gene_ids = ids[3]

Related

Python - using raw_input() to search a text document

I am trying to write a simple script that a user can enter what he/she wants to search in a specified txt file. If the word they searching is found then print it to a new text file. This is what I got so far.
import re
import os
os.chdir("C:\Python 2016 Training")
patterns = open("rtr.txt", "r")
what_directory_am_i_in = os.getcwd()
print what_directory_am_i_in
search = raw_input("What you looking for? ")
for line in patterns:
re.findall("(.*)search(.*)", line)
fo = open("test", "wb")
fo.write(line)
fo.close
This successfully creates a file called test, but the output is nothing close to what word was entered into the search variable.
Any advice appreciated.
First of all, you have not read a file
patterns = open("rtr.txt", "r")
this is a file object and not the content of file, to read the file contents you need to use
patterns.readlines()
secondly, re.findall returns a list of matched strings, so you would want to store that. You regex is also not correct as pointed by Hani, It should be
matched = re.findall("(.*)" + search + "(.*)", line)
rather it should be :
if you want the complete line
matched = re.findall(".*" + search + ".*", line)
or simply
matched = line if search in line else None
Thirdly, you don't need to keep opening your output file in the for loop. You are overwriting your file everytime in the loop so it will capture only the last result. Also remember to call the close method on the files.
Hope this helps
you are searching here for all lines that has "search" word in it
you need to get the lines that has the text you entered in the shell
so change this line
re.findall("(.*)search(.*)", line)
to
re.findall("(.*)"+search+"(.*)", line)

Python word search in a text file

I have a text file in which I need to search for specific 3 words using Python. For example the words are account, online and offer and I need the count of how many times it appears in the system.
with open('fixtures/file1.csv') as f:
print len(filter(
lambda line: "account" in line or "online" in line or "offer" in line,
f.readlines()
))
You can also check directly if the words are in the each line.
Update
To count how many times does each word appear in a file, the most effective way I find is to iterate once over the file and check how many times each word is found in the line. For that, try the following:
keys = ('account', 'online', 'offer')
with open('fixtures/file1.csv') as f:
found = dict((k, 0) for k in keys)
for line in f.readlines():
for k in keys:
found[k] += 1 if k in line else 0
found will then be a dictionary with what you are looking for.
Hope this helps!
I am assuming it is a plain text document. In that case you would open('file.txt') as f and then get every [for] line in f and check if 'word' in f.lower() and then incrament a counter accordingly (say wordxtotal += 1)

Python :Replace repetitive line in a file with empty space but not on first / last occurrence

I have a file which has repetitive lines <this is repeated> that I would like to replace with empty space "". However, the first occurrence or last occurrence of the repeated line does not need to be replaced. I tried replace() before but this function will replace all the strings in the file. Is there any way to write it to get the expected result? Ps: It is a large text file
The file is as follow:
<this is repeated>
second line
another lines
third line
<this is repeated>
<this is repeated>
Note: I realized after posting that if the last occurrence was the very last line without a \n after this technique would leave it as well as the next last occurrence.
First you would need to iterate over the file until you find the first occurrence:
file = <OPEN FILE>
rep_line = "<this is repeated>\n"
beginning = "" #record all data until found
while True: #broken when rep_line is found in file (or end of file is reached)
line = file.readline()
if not line:
raise EOFError("reached end of file before finding first occurence")
beginning+=line
if line == rep_line:
break
rest = file.read() #you can read the rest after iterating over a few lines
Then you will have beginning which contains everything up to and including the first occurrence, and the rest
So all you need to do with rest is to count how may time it occurs and replace all but the last one:
reps = rest.count(rep_line)
new_text = beginning + rest.replace(rep_line,"",reps - 1)
# ^ don't replace the last one
however this direct approach will pick up lines that end with the text (like "hello <this is repeated>" for example) and this can be fixed by also checking that there is a \n right before the line:
reps = rest.count("\n"+rep_line)
new_text = beginning + rest.replace("\n"+rep_line,"\n",reps - 1)
# ^ replace with a single newline

Python3.4 : Matching and returning list values

I have opened a text file in Python which has thousands of lines. I need to search each line to see if it contains 1 of many different specified values. I then need to return the specific value and the corresponding line that contains that value.
q1 = open('/home/lost/StockRec/StockIndex/edgar.full-index.2015.QTR1.master.idx', 'r')
list = ['1341234', '12341234', '4563456', '12341234', '6896786', '2727638']
for line in q1:
for listValue in list:
if listValue in line:
print(listValue, line)
I know this code is wrong. I need to search each line in q1 for each of the specific values in the list. I need to then print the specific list value and the line containing that value.
Unless your file is already somehow separated into lines, it looks like you will have to first split the file into lines when you import it. Right now it is returning all of it because q1 is only one line.
Look for some identifying information in your file such as new line characters ('\n') or if each line starts with a specific character.
so once you open the file you will include:
q1.split('your identifying character here')
That will split the copy of your file then you can perform the loops that you have already written

parse text with Matlab

I have a text file (output from an old program) that I'd like to clean. Here's an example of the file contents.
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
|||||||||||||||||
|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08|3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
||||||||||||asasasas
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
*|comment
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
I'd like it to look like:
*|V|0|0|0|t|0|1|1|4|11|T4|H01||||||||||||||||||||||
*|comment
*|comment
P|40|0.01|10|1|1|0|40|1|1|1||1|*||0|0|0||||||||||||||||
*|A1|A1|A7|A16|F|F|F|F|F|F|F|||||||||||||||||||||||
*|||||kV|kV|kV|MW|MVAR|S|S||||||||||||||||||||||||
N|I|01|H01N01|H01N01|132|125.4|138.6|0|0|||||||||||||||||||||
N|I|01|H01N02|H01N02|20|19|21|0|0|||||||||||||||||||||||
N|I|01|H01N03|H01N03|20|19|21|0.42318823|0.204959433|||||||||||||||||||||
*|comment||||||||||||||||
*|comment|||||||||||||||||
L|I|H010203|H01N02|H01N03|1.884|360|0.41071|0.207886957||3.19E-08||3.19E-08|||||||||||
L|I|H010304|H01N03|H01N04|1.62|360|0.35316|0.1787563||3.19E-08||3.19E-08||||||||||||||
L|I|H010405|H01N04|H01N05|0.532|360|0.11598|0.058702686||3.19E-08||3.19E-08|||||||||||
L|I|H010506|H01N05|H01N06|1.284|360|0.27991|0.14168092||3.19E-08||3.19E-08||||||||||||
*|comment
*|comment
S|SH01|SEZIONE01|1|-3|+3|-100|+100|||||||||||||||||||
S|SH02|SEZIONE02|1|-3|+3|-100|+100|||||||||||||||||||
S|SH03|SEZIONE03|1|-3|+3|-100|+100|||||||||||||||||||
S|SH04|SEZIONE04|1|-3|+3|-100|+100|||||||||||||||||||
S|SH05|SEZIONE05|1|-3|+3|-100|+100|||||||||||||||||||
The data are divided into 'packages' distinct from the first letter (PNLS). Each package must have at least two dedicated lines (* |) which is then read as a comment. The white lines between different letters are filled with character * |. The lines between various letters that do not begin with * | to be added. The white lines and characters 'random' between identical letters are removed.
Perhaps it is clearer in the example files.
How do I manipulate the text? Thank you in advance for the help.
Use fileread to get your file into MATLAB.
text = fileread('my file to clean.txt');
Split the resulting character string up by splitting on the new lines. (The newlines characters depend on your operating system.)
lines = regexp(text, '\r\n', 'split');
It isn't entirely clear exactly how you want the file cleaned, but these things might get you started.
% Replace blank lines with comment string
blanks = cellfun(#isempty, lines);
comment = '*|comment';
lines(blanks) = cellstr(repmat(comment, sum(blanks), 1))
% Prepend comment string to lines that start with a pipe
lines = regexprep(lines, '^\|', '\*\|comment\|')
You'll be needing to know your way around regular expressions. There's a good guide to them at regular-expressions.info.