Using Interval tree to find overlapping regions - python-2.7

I have two files
File 1
chr1:4847593-4847993
TGCCGGAGGGGTTTCGATGGAACTCGTAGCA
File 2
Pbsn|X|75083240|75098962|
TTTACTACTTAGTAACACAGTAAGCTAAACAACCAGTGCCATGGTAGGCTTGAGTCAGCT
CTTTCAGGTTCATGTCCATCAAAGATCTACATCTCTCCCCTGGTAGCTTAAGAGAAGCCA
TGGTGGTTGGTATTTCCTACTGCCAGACAGCTGGTTGTTAAGTGAATATTTTGAAGTCC
File 1 has approximately 8000 more lines with different header and sequence below it.
I would first like to match the start and end co ordinates from file1 to file 2 or see if its close to each other let say by +- 100 if yes then match the sequence in file 2 and then print out the header info for file 2 and the matched sequence.
My approach use interval tree(in python i am still trying to get a hang of it), store the co ordinates ?
I tried using re.match but its not giving me accurate results.
Any tips would be highly appreciated.
Thanks.
My first try,
How ever now i have hit another road block so for my second second file if my start and end is 5000 and 8000 respectively I want to change this by subtracting 2000 so my new start and stop is 3000 and 5000 here is my code
from intervaltree import IntervalTree
from collections import defaultdict
binding_factor = some.txt
genome = dict()
with open('file2', 'r') as rows:
for row in rows:
#print row
if row.startswith('>'):
row = row.strip().split('|')
chrom_name = row[5]
start = int[row[3]
end = int(row[3])
# one interval tree per chromosome
if chrom_name not in genome:
genome[chrom_name] = IntervalTree()
# first time we've encountered this chromosome, createtree
# index the feature
genome[chrom_name].addi(start,end,row[2])
#for key,value in genome.iteritems():
#print key, ":", value
mast = defaultdict(list)
with open(file1', 'r') as f:
for row in f:
row = row.strip().split()
row[0] = row[0].replace('chr', '') if row[0].startswith('chr') else row[0]
row[0] = 'MT' if row[0] == 'M' else row[0]
#print row[0]
mast[row[0]].append({
'start':int(row[1]),
'end':int(row[2])
})
#for k,v in mast.iteritems():
#print k, ":", v
with open(binding_factor, 'w') as f :
for k,v in mast.iteritems():
for i in v:
g = genome[k].search(i['start'],i['end'])
if g:
print g
l = gene
f.write(str(l)`enter code here` + '\n')

Related

Cannot iterate through two csv files and compare

I'm relatively new to python (2.7) and need help looping through 2 CSV files. The first (outer loop) file is the row I want to write if certain conditions are met with the second (inner loop) file.
import csv
f = open('../CI Working Copy.csv')
with open('../first.csv', 'wb') as n:
theWriter = csv.writer(n)
csv_f = csv.reader(f)
g = open('../second.csv')
csv_g = csv.reader(g)
for row in csv_f:
cbd = row[3]
ced = row[4]
rbd = row[5]
red = row[6]
ciCn = row[10]
for iRow in csv_g:
cn = iRow[0]
startDate = iRow[1]
endDate = iRow[2]
iId = iRow[3]
writeRow = 'false'
if ciCn == cn:
if (cbd == startDate and ced == endDate) or (rbd == startDate and red == endDate):
theWriter.writerow(row)
g.close()
f.close()
It makes it into the second (inner loop) file, but never returns to the outer loop. I only need to write the row from the first file.
For each row of the first csv file, you consume all the second file, so you need to go back on the beginning of the second file on each iteration.
The solution is:
for row in csv_f:
g.seek(0) #go at the start of the second file
for iRow in csv_g:
do_smth(iRow,row)
g.close()

Q: Python3 - If Statements for changing list lengths

I am attempting to analyze data sets as lists of differing lengths. I am calling lines (rows) of my data set one by one to be analyzed by my function. I want the function to still be run properly regardless of the length of the list.
My Code:
f = open('DataSet.txt')
for line in iter(f):
remove_blanks = ['']
entries = line.split()
''.join([i for i in entries if i not in remove_blanks])
trash = (entries[0], entries[1])
time = int(entries[2])
column = [int(v) for v in entries[3:]]
def myFun():
print(entries)
print_string = ''
if column[0] == 100:
if column[1] >= 250 and column[2] == 300:
if len(column) >= 9:
digit = [chr(x) for x in column[4:9]]
print_string = ('code: ' + ''.join(str(digit[l]) for l in range(5)) + ' ')
if len(column) >= 13:
optional_digit = [chr(d) for d in column[9:13]]
for m in range(0, 4):
print_string += 'Optional Field: ' + optional_digit[m] + ''
else:
print_string += 'No Optional Field '
pass
pass
print(print_string)
print('')
myFun()
f.close()
What is happening is if the length of a line of my data is not long enough (i.e. the list ends at column[6]), I get the error:
line 17, in function
print('Code: ' + digit[l])
IndexError: list index out of range
I want it to still print Code: #number #number #number #number and leave any non-existent columns as blanks when it is printed so that one line may print as Code: ABC9 and the next print as Code: AB if there are differing list lengths.
Please help! :)
Well, just make sure you're not looping over a list longer than available:
print_string = 'code: ' + ''.join(str(digit[l]) for l in range(min(5, len(digit)))) + ' '
or better:
print_string = "code {} ".format("".join(str(dig) for dig in digit[:5]))
Although I have a feeling you're over-complicating this.

Why won't my csv list replace my blank values with "N"?

I'm attempting to create a function which reads a specific column of a csv file which currently alternates between empty values and "1", pops them into a list and then replaces them with an "N" for the empty value and "B" for the "1"'s. I'm pretty new to python, as well as programming in general, so any tips and all help is welcome. This is what I have so far, and it does process, but only replaces my "1"'s with "B"'s. I've double checked my csv and the position is definitely empty and does not contain spaces. I've also looked at other responses and tried to emulate some similar logic that appeared to be behind them, but something still doesn't seem to work. If someone could point me in the right direction it would be very much appreciated.
#sample data (for 195 entries):
["Header0,"Header1","Foundation","Header3"],
["abc1","a12n","","123"],
["def2","d13b","1","456"],
["ghi3","g12n","","789"],
def Foundation( csv_file_path, Remove_Header = False, Remove_SubHeader = False ):
delineator = ','
raw_file = file(csv_file_path, 'r')
return_List = []
n = 0
#Process lines in file
for line in raw_file.readlines():
#Check if to include or remove header
if (n == 0 ) and (Remove_Header == True):
n = n + 1
continue
#Check if to include or remove sub header
if (n == 1) and (Remove_SubHeader == True):
n = n + 1
continue
sList2 = line.replace("\n","").strip().split( delineator )
col_2 = str(sList2.pop(2))
for n in col_2:
if n == "1":
col_2 = col_2.replace("1", "B")
elif n == "":
col_2 = col_2.replace("", "N")
print col_2
return_List.append(sList2) #add my secondary list back to my main List? right?
sList2.insert(0, col_2)# insert back to my secondary list where it went
n = n + 1 #add to counter and move down the line
raw_file.close()
#Return the list
return return_List

Iterating through a .txt file in an odd way

What I am trying to do is write a program that opens a .txt file with movie reviews where the rating is a number from 0-4 followed by a short review of the movie. The program then prompts the user to open a second text file with words that will be matched against the reviews and given a number value based on the review.
For example, with these two sample reviews how they would appear in the .txt file:
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 2 Massoud 's story is an epic , but also a tragedy , the record of a tenacious , humane fighter who was also the prisoner -LRB- and ultimately the victim -RRB- of history .
So, if I were looking for the word "epic", it would increment the count for that word by 2 (which I already have figured out) since it appears twice, and then append the values 4 and 2 to a list of ratings for that word.
How do I append those ints to a list or dictionary related to that word? Keep in mind that I need to create a new list or dicitonary key for every word in a list of words.
Please and thank you. And sorry if this was poorly worded, programming isn't my forte.
All of my code:
def menu_validate(prompt, min_val, max_val):
""" produces a prompt, gets input, validates the input and returns a value. """
while True:
try:
menu = int(input(prompt))
if menu >= min_val and menu <= max_val:
return menu
break
elif menu.lower == "quit" or menu.lower == "q":
quit()
print("You must enter a number value from {} to {}.".format(min_val, max_val))
except ValueError:
print("You must enter a number value from {} to {}.".format(min_val, max_val))
def open_file(prompt):
""" opens a file """
while True:
try:
file_name = str(input(prompt))
if ".txt" in file_name:
input_file = open(file_name, 'r')
return input_file
else:
input_file = open(file_name+".txt", 'r')
return input_file
except FileNotFoundError:
print("You must enter a valid file name. Make sure the file you would like to open is in this programs root folder.")
def make_list(file):
lst = []
for line in file:
lst2 = line.split(' ')
del lst2[-1]
lst.append(lst2)
return lst
def rating_list(lst):
'''iterates through a list of lists and appends the first value in each list to a second list'''
rating_list = []
for list in lst:
rating_list.append(list[0])
return rating_list
def word_cnt(lst, word : str):
cnt = 0
for list in lst:
for word in list:
cnt += 1
return cnt
def words_list(file):
lst = []
for word in file:
lst.append(word)
return lst
##def sort(words, occurrences, avg_scores, std_dev):
## '''sorts and prints the output'''
## menu = menu_validate("You must choose one of the valid choices of 1, 2, 3, 4 \n Sort Options\n 1. Sort by Avg Ascending\n 2. Sort by Avg Descending\n 3. Sort by Std Deviation Ascending\n 4. Sort by Std Deviation Descending", 1, 4)
## print ("{}{}{}{}\n{}".format("Word", "Occurence", "Avg. Score", "Std. Dev.", "="*51))
## if menu == 1:
## for i in range (len(word_list)):
## print ("{}{}{}{}".format(cnt_list.sorted[i],)
def make_odict(lst1, lst2):
'''makes an ordered dictionary of keys/values from 2 lists of equal length'''
dic = OrderedDict()
for i in range (len(word_list)):
dic[lst2[i]] = lst2[i]
return dic
cnt_list = []
while True:
menu = menu_validate("1. Get sentiment for all words in a file? \nQ. Quit \n", 1, 1)
if menu == True:
ratings_file = open("sample.txt")
ratings_list = make_list(ratings_file)
word_file = open_file("Enter the name of the file with words to score \n")
word_list = words_list(word_file)
for word in word_list:
cnt = word_cnt(ratings_list, word)
cnt_list.append(word_cnt(ratings_list, word))
Sorry, I know it's messy and very incomplete.
I think you mean:
import collections
counts = collections.defaultdict(int)
word = 'epic'
counts[word] += 1
Obviously, you can do more with word than I have, but you aren't showing us any code, so ...
EDIT
Okay, looking at your code, I'd suggest you make the separation between rating and text explicit. Take this:
def make_list(file):
lst = []
for line in file:
lst2 = line.split(' ')
del lst2[-1]
lst.append(lst2)
return lst
And convert it to this:
def parse_ratings(file):
"""
Given a file of lines, each with a numeric rating at the start,
parse the lines into score/text tuples, one per line. Return the
list of parsed tuples.
"""
ratings = []
for line in file:
text = line.strip().split()
if text:
score = text[0]
ratings.append((score,text[1:]))
return ratings
Then you can compute both values together:
def match_reviews(word, ratings):
cnt = 0
scores = []
for score,text in ratings:
n = text.count(word)
if n:
cnt += n
scores.append(score)
return (cnt, scores)

Printing Results from Loops

I currently have a piece of code that works in two segments. The first segment opens the existing text file from a specific path on my local drive and then arranges, based on certain indices, into a list of sub list. In the second segment I take the sub-lists I have created and group them on a similar index to simplify them (starts at def merge_subs). I am getting no error code but I am not receiving a result when I try to print the variable answer. Am I not correctly looping the original list of sub-lists? Ultimately I would like to have a variable that contains the final product from these loops so that I may write the contents of it to a new text file. Here is the code I am working with:
from itertools import groupby, chain
from operator import itemgetter
with open ("somepathname") as g:
# reads text from lines and turns them into a list sub-lists
lines = g.readlines()
for line in lines:
matrix = line.split()
JD = matrix [2]
minTime= matrix [5]
maxTime= matrix [7]
newLists = [JD,minTime,maxTime]
L = newLists
def merge_subs(L):
dates = {}
for sub in L:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
answer.append([date] + dates[date])
new code
def openfile(self):
filename = askopenfilename(parent=root)
self.lines = open(filename)
def simplify(self):
g = self.lines.readlines()
for line in g:
matrix = line.split()
JD = matrix[2]
minTime = matrix[5]
maxTime = matrix[7]
self.newLists = [JD, minTime, maxTime]
print(self.newLists)
dates = {}
for sub in self.newLists:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
print(answer.append([date] + dates[date]))
enter code here
enter code here