How to read a list of gene pairs and write a fasta file for each line - list

I'm new to bioinformatics and would really aprecciated some help!
I have a big multi-fasta file (genes.faa), like this:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
(...)
And a list of gene pairs (gene.pairs.txt), with two genes per line separeted by a tab:
gene13_A \t gene33_B
gene2_A \t gene48_B
gene56_A \t gene2_B
And I needed a way to read the list of gene pairs and create a fasta file for each line of the list of gene pairs. So, in this case, I would have 3 fasta files (the name of the output fasta files is not important), like this:
fasta1
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta2
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
fasta3
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
I tried to write a script in python but I couldn't find a way to read the list in a loop and write a fasta file for each line.
Thank you so much in advance for any help!

This code works. I have tested on the following files:
Input FASTA file:
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA1
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA2
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA3
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPA4
Gene List file:
gene1_A gene3_B
gene2_A gene4_B
Code:
from Bio import SeqIO
my_gene_file = open('genelist.txt','r')
my_fasta_file = open('input.fasta','r')
my_list=[]
for line in my_gene_file:
line = line.strip(' \n')
gene_id1, gene_id2 = line.split('\t')
gene_tuple = (gene_id1, gene_id2)
my_list.append(gene_tuple)
my_dict={}
for seq_record in SeqIO.parse(my_fasta_file, "fasta"):
my_dict[seq_record.id] = seq_record.seq
for item in my_list:
if item[0] in my_dict and item[1] in my_dict:
output_file_name=(f'{item[0]}_{item[1]}.fasta')
f = open(output_file_name, "w")
f.write(f'>{item[0]}\n{my_dict[item[0]]}\n>{item[1]}\n{my_dict[item[1]]}')
It outputs files with fasta record ids in the filenames.

My attempt, sure there are faster better ways to accomplish it, it makes me wonder if there could be a way to skip the creation of the big dictionary : sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}.
I am using the Biopython library to parse the fasta file and to write them https://biopython.org/ , https://github.com/biopython/biopython; anyway:
input 'big_fasta_2.fa' :
>gene1_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene2_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene3_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene4_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAC
>gene13_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAA
>gene33_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAY
>gene48_B
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAW
>gene56_A
MCTGTRNKIIRTCDNCRKRKIKCDRKRPAP
input "gene_pairs_3.txt":
gene13_A gene33_B
gene1344_A gene33_B
gene2_A gene48_B
gene23333_A gene48_B
gene56_A gene2_B
code :
from Bio import SeqIO, __version__
print('Biopython version : ', __version__)
sequences = { i.id : i for i in SeqIO.parse('big_fasta_2.fa', 'fasta')}
print(sequences)
file = open("gene_pairs_3.txt","r")
cnt = 1
for line in file:
a, b = line.split()
print('++++++++++++')
print('pairs N° : ', cnt)
print(a)
print(b)
if a in sequences:
print('ok A')
print(sequences[a])
if b in sequences:
print('ok B')
print(sequences[b])
SeqIO.write([sequences[a],sequences[b]] , 'Fasta'+str(cnt)+'.fa' , 'fasta')
print('\nwritten file : ' ,'Fasta'+str(cnt)+'.fa' )
cnt += 1
else:
print('No B')
else:
print('No A')
continue
print('-----------\n')
Have a look at the output files and see if they are what you expected.

Related

rstrip, split and sort a list from input text file

I am new with python. I am trying to rstrip space, split and append the list into words and than sort by alphabetical order. I don’t what I am doing wrong.
fname = input("Enter file name: ")
fh = open(fname)
lst = list(fh)
for line in lst:
line = line.rstrip()
y = line.split()
i = lst.append()
k = y.sort()
print y
I have been able to fix my code and the expected result output.
This is what I was hoping to code:
name = input('Enter file: ')
handle = open(name, 'r')
wordlist = list()
for line in handle:
words = line.split()
for word in words:
if word in wordlist: continue
wordlist.append(word)
wordlist.sort()
print(wordlist)
If you are using python 2.7, I believe you need to use raw_input() in Python 3.X is correct to use input(). Also, you are not using correctly append(), Append is a method used for lists.
fname = raw_input("Enter filename: ") # Stores the filename given by the user input
fh = open(fname,"r") # Here we are adding 'r' as the file is opened as read mode
lines = fh.readlines() # This will create a list of the lines from the file
# Sort the lines alphabetically
lines.sort()
# Rstrip each line of the lines liss
y = [l.rstrip() for l in lines]
# Print out the result
print y

How to insert two file.txt into one file

I have this function that takes two input .txt file, delete the punctuation mark, and adds the sentence pos or neg.
I would like the content of these fle converted to lowercase
and then these two files merged into a single file name union.txt
But my code does not work
def extractor (feature_select):
posFeatures = []
negFeatures = []
with open('positive.txt', 'r') as posSentences:
for i in posSentences:
posWords = re.findall(r"[\w']+|[(,.;:*##/?!$&)]", i.rstrip())
posWords = [feature_select(posWords), 'pos']
posFeatures.append(posWords)
with open('negative.txt', 'r') as negSentences:
for i in negSentences:
negWords = re.findall(r"[\w']+|[(,.;:*##/?!$&)]", i.rstrip())
negWords = [feature_select(negWords), 'neg']
negFeatures.append(negWords)
return posFeature, negFeature
filenames = [posFeature, negFeature]
with open('union.txt', 'w') as outfile:
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read())
Actually you are trying to open the files with names from the contents of the two files. fname holds the contents read from the input files.
filenames = [posFeature, negFeature]
with open('union.txt', 'w') as outfile :
for i in filenames : #refers to posFeature or negFeature which is a list
for j in i: #this loop reads each sentence from the list i
outfile.write(j) #it writes the sentence into outfile
No need to read back the contents already read and appended in posFeature and negFeature. Above code will directly write the contents in the list filenames and now your two files are merged.

Python: count frequency of words in a txt file

I am required to count the frequency of the key words from a text file. I am not allowed to use dictionaries or sets, and I also cannot import any Python methods. I honestly cannot figure out how to do it!!
This is how its supposed to display:
car 4
dog 4
egg 3
here's what i have so far, and it absolutely does not work.
fname = input("enter file name:")
ifile = open(fname, 'r')
list1 = ['car', 'dog', 'cat'.....'ect']
list2 = []
for word in ifile:
if word in list1:
list2.index(word)[1] += 1
else:
list2.append([word,])
print(list2,)
I played with this a little... I noticed I had to enter file name in quotes for some reason.
fname = input('enter file name:')
ifile = open(fname, 'r')
list1 = []
list2 = []
for line in ifile.readlines():
for word in line.split(' '):
word = word.strip()
if word in list1:
list2[list1.index(word)] += 1
else:
list1.append(word)
list2.append(1)
for item in list1:
print item, list2[list1.index(item)]
Given you can't use any set/list structures why not use another string and write the unique words encountered, incrementing on existence. Pseudocode:
create empty string for storage
parse and extract words
iterate
check word against string (if exists: increment / ifnot exists: add and set count to 1)
output string

How i can read multiple txt files?

I want to get all words of my documents, but i have a problem with file in this code.
How do i fill file the fields of file with the content of the documents? This is my code:
textfilename=['example' '*' '.txt'];
Alltextfiles = dir(textfilename);
for i=1:length(Alltextfiles)
fileID (i) = fopen(Alltextfiles(i).name,'r+');
file (i) = fscanf(fileID(i), '%c',inf);
words (i) = regexp(file (i), ' ', 'split');
end
Make file and words cell arrays.
for i=1:length(Alltextfiles)
fileID(i) = fopen(Alltextfiles(i).name,'r+');
file{i} = fscanf(fileID(i), '%c',inf);
words{i} = regexp(file{i}, ' ', 'split');
end
Also, consider splitting by '\s|\n', I assume your regexp is not getting you the desired output.
You could do the following to read all words of a file:
words = textscan(fileread(fname), '%s');
words will be a N-by-1 cell array containing all the words of the file.

Read fields from text file and store them in a structure

I am trying to read a file that looks as follows:
Data Sampling Rate: 256 Hz
*************************
Channels in EDF Files:
**********************
Channel 1: FP1-F7
Channel 2: F7-T7
Channel 3: T7-P7
Channel 4: P7-O1
File Name: chb01_02.edf
File Start Time: 12:42:57
File End Time: 13:42:57
Number of Seizures in File: 0
File Name: chb01_03.edf
File Start Time: 13:43:04
File End Time: 14:43:04
Number of Seizures in File: 1
Seizure Start Time: 2996 seconds
Seizure End Time: 3036 seconds
So far I have this code:
fid1= fopen('chb01-summary.txt')
data=struct('id',{},'stime',{},'etime',{},'seizenum',{},'sseize',{},'eseize',{});
if fid1 ==-1
error('File cannot be opened ')
end
tline= fgetl(fid1);
while ischar(tline)
i=1;
disp(tline);
end
I want to use regexp to find the expressions and so I did:
line1 = '(.*\d{2} (\.edf)'
data{1} = regexp(tline, line1);
tline=fgetl(fid1);
time = '^Time: .*\d{2]}: \d{2} :\d{2}' ;
data{2}= regexp(tline,time);
tline=getl(fid1);
seizure = '^File: .*\d';
data{4}= regexp(tline,seizure);
if data{4}>0
stime = '^Time: .*\d{5}';
tline=getl(fid1);
data{5}= regexp(tline,seizure);
tline= getl(fid1);
data{6}= regexp(tline,seizure);
end
I tried using a loop to find the line at which file name starts with:
for (firstline<1) || (firstline>1 )
firstline= strfind(tline, 'File Name')
tline=fgetl(fid1);
end
and now I'm stumped.
Suppose that I am at the line at which the information is there, how do I store the information with regexp? I got an empty array for data after running the code once...
Thanks in advance.
I find it the easiest to read the lines into a cell array first using textscan:
%// Read lines as strings
fid = fopen('input.txt', 'r');
C = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
and then apply regexp on it to do the rest of the manipulations:
%// Parse field names and values
C = regexp(C{:}, '^\s*([^:]+)\s*:\s*(.+)\s*', 'tokens');
C = [C{:}]; %// Flatten the cell array
C = reshape([C{:}], 2, []); %// Reshape into name-value pairs
Now you have a cell array C of field names and their corresponding (string) values, and all you have to do is plug it into struct in the correct syntax (using a comma-separated list in this case). Note that the field names have spaces in them, so this needs to be taken care of before they can be used (e.g replace them with underscores):
C(1, :) = strrep(C(1, :), ' ', '_'); %// Replace spaces with underscores
data = struct(C{:});
Here's what I get for your input file:
data =
Data_Sampling_Rate: '256 Hz'
Channel_1: 'FP1-F7'
Channel_2: 'F7-T7'
Channel_3: 'T7-P7'
Channel_4: 'P7-O1'
File_Name: 'chb01_03.edf'
File_Start_Time: '13:43:04'
File_End_Time: '14:43:04'
Number_of_Seizures_in_File: '1'
Seizure_Start_Time: '2996 seconds'
Seizure_End_Time: '3036 seconds'
Of course, it is possible to prettify it even more by converting all relevant numbers to numerical values, grouping the 'channel' fields together and such, but I'll leave this to you. Good luck!