Count the number of files in a folder that have certain strings - python-2.7

I have a folder with 200 files. Each file has data like
VISITERM_90 VISITERM_0 VISITERM_34 ..... etc.
Each file does not have the same elements. So, I would like to count the number of files which have the elements from VISITERM_0 to VISITERM_99. That is I should get my output as:
VISITERM_0 200
VISTERM_1 140
VISITERM_2 150
and so on depending upon the numbers of files that has the specified elements. I want to run it in a loop from VISITERM_0 till VISITERM_99 and for each element I need to find the number of files.
My code is:
import os
vt = 'VISITERM_'
no = 0
while no < 10:
for doc in os.listdir('/home/krupa/Krupa/Mirellas_Image_Annotation_Data/Test/sample_codes/Files'):
doc2 = '/home/krupa/Krupa/Mirellas_Image_Annotation_Data/Test/sample_codes/Files/' + doc
c = vt + (repr(no))
with open (doc2, 'r') as inF:
for line in inF:
if c in line:
print c, doc2
else:
print "DOES NOT EXIST" , c, doc2
no = no + 1
This code is printing me each visiterm and each of the file that has it. I just want the VISITERMS_* and their corresponding number of files. Please help!

My python skills are a bit rusty, so bear with me. I think that you need a way to store the values while looping, I'll use a dictionary. This is not the complete solution, but it can help you figure out what you need to do:
dict={}
for doc in os.listdir('..'):
doc2 = '..'
with open (doc2, 'r') as inF:
for line in inF:
while no < 10:
c = vt + (repr(no))
if c in line:
numberOfElements = 0
if dict.has_key(c):
numberOfElements = dict[c]
numberOfElements += 1
else:
numberOfElements = 1
dict[c] = numberOfElements
no += 1
for key in dict.keys():
print key, dict[key]

Related

Creating a if/else that appends data from mult. scraped pages if counts differ?

I"m trying to scrape Oregon teacher licensure information that looks like this or this(this is publicly available data)
This is my code:
for t in range(0,2): #Refers to txt file with ids
address = 'http://www.tspc.oregon.gov/lookup_application/LDisplay_Individual.asp?id=' + lines2[t]
page = requests.get(address)
tree = html.fromstring(page.text)
count = 0
for license_row in tree.xpath(".//tr[td[1] = 'License Type']/following-sibling::tr[1]"):
license_data = license_row.xpath(".//td/text()")
count = count + 1
if count==1:
ltest1.append(license_data)
if count==2:
ltest2.append(license_data)
if count==3:
ltest3.append(license_data)
with open('teacher_lic.csv', 'wb') as pensionfile:
writer = csv.writer(pensionfile, delimiter="," )
writer.writerow(["Name", "Lic1", "Lic2", "Lic3"])
pen = zip(lname, ltest1, ltest2, ltest3)
for penlist in pen:
writer.writerow(list(penlist))
The problem occurs when this happens: teacher A has 13 licenses and Teacher B has 2. In A my total count = 13 and B = 2. When I get to Teacher B and count equal to 3, I want to say, "if count==3 then ltest3.append(licensure_data) else if count==3 and license_data=='' then license3.append('')" but since there's no count==3 in B there's no way to tell it to append an empty set.
I'd want the output to look like this:
Is there a way to do this? I might be approaching this completely wrong so if someone can point me in another direction, that would be helpful as well.
There's probably a more elegant way to do this but this managed to work pretty well.
I created some blank spaces to fill in when Teacher A has 13 licenses and Teacher B has 2. There were some errors that resulted when the license_row.xpath got to the count==3 in Teacher B. I exploited these errors to create the ltest3.append('').
for t in range(0, 2): #Each txt file contains differing amounts
address = 'http://www.tspc.oregon.gov/lookup_application/LDisplay_Individual.asp?id=' + lines2[t]
page = requests.get(address)
tree = html.fromstring(page.text)
count = 0
test = tree.xpath(".//tr[td[1] = 'License Type']/following-sibling::tr[1]")
difference = 15 - len(test)
for i in range(0, difference):
test.append('')
for license_row in test:
count = count + 1
try:
license_data = license_row.xpath(".//td/text()")
except NameError:
license_data = ''
if license_data=='' and count==1:
ltest1.append('')
if license_data=='' and count==2:
ltest2.append('')
if license_data=='' and count==3:
ltest3.append('')
except AttributeError:
license_data = ''
if count==1 and True:
print "True"
if count==1:
ltest1.append(license_data)
if count==2 and True:
print "True"
if count==2:
ltest2.append(license_data)
if count==3 and True:
print "True"
if count==3:
ltest3.append(license_data)
del license_data
for endorse_row in tree.xpath(".//tr[td = 'Endorsements']/following-sibling::tr"):
endorse_data = endorse_row.xpath(".//td/text()")
lendorse1.append(endorse_data)

Using Interval tree to find overlapping regions

I have two files
File 1
chr1:4847593-4847993
TGCCGGAGGGGTTTCGATGGAACTCGTAGCA
File 2
Pbsn|X|75083240|75098962|
TTTACTACTTAGTAACACAGTAAGCTAAACAACCAGTGCCATGGTAGGCTTGAGTCAGCT
CTTTCAGGTTCATGTCCATCAAAGATCTACATCTCTCCCCTGGTAGCTTAAGAGAAGCCA
TGGTGGTTGGTATTTCCTACTGCCAGACAGCTGGTTGTTAAGTGAATATTTTGAAGTCC
File 1 has approximately 8000 more lines with different header and sequence below it.
I would first like to match the start and end co ordinates from file1 to file 2 or see if its close to each other let say by +- 100 if yes then match the sequence in file 2 and then print out the header info for file 2 and the matched sequence.
My approach use interval tree(in python i am still trying to get a hang of it), store the co ordinates ?
I tried using re.match but its not giving me accurate results.
Any tips would be highly appreciated.
Thanks.
My first try,
How ever now i have hit another road block so for my second second file if my start and end is 5000 and 8000 respectively I want to change this by subtracting 2000 so my new start and stop is 3000 and 5000 here is my code
from intervaltree import IntervalTree
from collections import defaultdict
binding_factor = some.txt
genome = dict()
with open('file2', 'r') as rows:
for row in rows:
#print row
if row.startswith('>'):
row = row.strip().split('|')
chrom_name = row[5]
start = int[row[3]
end = int(row[3])
# one interval tree per chromosome
if chrom_name not in genome:
genome[chrom_name] = IntervalTree()
# first time we've encountered this chromosome, createtree
# index the feature
genome[chrom_name].addi(start,end,row[2])
#for key,value in genome.iteritems():
#print key, ":", value
mast = defaultdict(list)
with open(file1', 'r') as f:
for row in f:
row = row.strip().split()
row[0] = row[0].replace('chr', '') if row[0].startswith('chr') else row[0]
row[0] = 'MT' if row[0] == 'M' else row[0]
#print row[0]
mast[row[0]].append({
'start':int(row[1]),
'end':int(row[2])
})
#for k,v in mast.iteritems():
#print k, ":", v
with open(binding_factor, 'w') as f :
for k,v in mast.iteritems():
for i in v:
g = genome[k].search(i['start'],i['end'])
if g:
print g
l = gene
f.write(str(l)`enter code here` + '\n')

out of bounds error when using a list as an index

I have two files: one is a single column (call it pred) and has no headers, the other has two columns: ID and IsClick (it has headers). My goal is to use the column ID as an index to pred.
import pandas as pd
import numpy as np
def LinesInFile(path):
with open(path) as f:
for linecount, line in enumerate(f):
pass
f.close()
print 'Found ' + str(linecount) + ' lines'
return linecount
path ='/Users/mas/Documents/workspace/Avito/input/' # path to testing file
submission = path + 'submission1234.csv'
lines = LinesInFile(submission)
lines = LinesInFile(path + 'sampleSubmission.csv')
sample = pd.read_csv(path + 'sampleSubmission.csv')
preds = np.array(pd.read_csv(submission, header = None))
index = sample.ID.values - 1
print index
print len(index)
sample['IsClick'] = preds[index]
sample.to_csv('submission.csv', index=False)
The output is:
Found 7816360 lines
Found 7816361 lines
[ 0 4 5 ..., 15961507 15961508 15961511]
7816361
Traceback (most recent call last):
File "/Users/mas/Documents/workspace/Avito/July3b.py", line 23, in <module>
sample['IsClick'] = preds[index]
IndexError: index 7816362 is out of bounds for axis 0 with size 7816361
there seems something wrong because my file has 7816361 lines counting the header while my list has an extra element (len of list 7816361)
I don't have your csv files to recreate the problem, but the problem looks like it is being caused by your use of index.
index = sample.ID.values - 1 is taking each of your sample ID's and subtracting 1. These are not index values in pred as it is only 7816360 long. Each of the last 3 items in your index array (based on your print output) would go out of bounds as they are >7816360. I suspect the error is showing you the first of your ID-1 that go out of bounds.
Assuming you just want to join the files based on their line number you could do the following:
sample=pd.concat((pd.read_csv(path + 'sampleSubmission.csv'),pd.read_csv(submission, header = None).rename(columns={0:'IsClick'})),axis=1)
Otherwise you'll need to perform a join or merge on your two dataframes.

IndexError, but more likely I/O error

Unsure of why I am getting this error. I'm reading from a file called columns_unsorted.txt, then trying to write to columns_unsorted.txt. There error is on fan_on = string_j[1], saying list index out of range. Here's my code:
#!/usr/bin/python
import fileinput
import collections
# open document to record results into
j = open('./columns_unsorted.txt', 'r')
# note this is a file of rows of space-delimited date in the format <1384055277275353 0 0 0 1 0 0 0 0 22:47:57> on each row, the first term being unix times, the last human time, the middle binary indicating which machine event happened
# open document to read from
l = open('./columns_sorted.txt', 'w')
# CREATE ARRAY CALLED EVENTS
events = collections.deque()
i = 1
# FILL ARRAY WITH "FACTS" ROWS; SPLIT INTO FIELDS, CHANGE TYPES AS APPROPRIATE
for line in j: # columns_unsorted
line = line.rstrip('\n')
string_j = line.split(' ')
time = str(string_j[0])
fan_on = int(string_j[1])
fan_off = int(string_j[2])
heater_on = int(string_j[3])
heater_off = int(string_j[4])
space_on = int(string_j[5])
space_off = int(string_j[6])
pump_on = int(string_j[7])
pump_off = int(string_j[8])
event_time = str(string_j[9])
row = time, fan_on, fan_off, heater_on, heater_off, space_on, space_off, pump_on, pump_off, event_time
events.append(row)
You are missing the readlines function, no?
You have to do:
j = open('./columns_unsorted.txt', 'r')
l = j.readlines()
for line in l:
# what you want to do with each line
In the future, you should print some of your variables, just to be sure the code is working as you want it to, and to help you identifying problems.
(for example, if in your code you would print string_j you would see what kind of problem you have)
Problem was an inconsistent line in the data file. Forgive my haste in posting

Python dictionary construction from file with multiple similar values and keys

I am new to python (well to coding in general) and am trying to use it to analyze some data at work. I have a file like this:
HWI-ST591_0064:5:1101:1228:2111#0/1 + 7included 11 A>G - -
HWI-ST591_0064:5:1101:1205:2125#0/1 + genomic 17 A>G - -
HWI-ST591_0064:5:1101:1178:2129#0/1 + 7included 6 A>C 8 A>T
HWI-ST591_0064:5:1101:1176:2164#0/1 + 7included 6 A>T 8 A>G
HWI-ST591_0064:5:1101:1199:2234#0/1 + 7included 14 T>C 21 G>A
HWI-ST591_0064:5:1101:1208:2249#0/1 + 7included 32 C>T - -
Tab delimited. I am trying to create a dictionary that contains the first value of the line (a unique identifier) as a list of values that matches the joined last 4 values as the key, like this:
{'32C>T--': ['HWI-ST591_0064:5:1101:1208:2249#0/1'],
'6A>C8A>C': ['HWI-ST591_0064:5:1101:1318:2090#0/1'],
'36A>G--': ['HWI-ST591_0064:5:1101:1425:2093#0/1'],
'----': ['HWI-ST591_0064:5:1101:1222:2225#0/1'],
'6A>C8A>T': ['HWI-ST591_0064:5:1101:1178:2129#0/1','HWIST591_0064:5:1101:1176:2164#0/1']}
This way I can then get a list of the unique identifies and count or sort or do the other things I need to do. I can get the dictionary made, but when I try to output it to a file I get an error. I think the problem is because this is a list, I keep getting the error
File "trial.py", line 33, in
outFile.write("%s\t%s\n" % ('\t' .join(key, mutReadDict[key])))
TypeError: unhashable type: 'list'
Is there a way to make this work so I can have it in a file? I tried .iteritems() on the for loop making the dictionary but that didn't seem to work. Thanks and here is my code:
inFile = open('path', 'rU')
outFile = open('path', 'w')
from collections import defaultdict
mutReadDict = defaultdict(list)
for line in inFile:
entry = line.strip('\n').split('\t')
fastQ_ID = entry[0]
strand = entry[1]
chromosome = entry[2]
mut1pos = entry[3]
mut1base = entry[4]
mut2pos = entry[5]
mut2base = entry[6]
mutKey = mut1pos + mut1base + mut2pos + mut2base
if chromosome == '7included':
mutReadDict[mutKey].append(fastQ_ID)
else:
pass
keyList = [mutReadDict.keys()]
keyList.sort()
for key in keyList:
outFile.write("%s\t%s\n" % ('\t' .join(key, mutReadDict[key])))
outFile.close()
I think you want:
keyList = mutReadDict.keys()
instead of
keyList = [mutReadDict.keys()]
You probably mean this too:
for key in keyList:
outFile.write("%s\t%s\n" % (key, '\t'.join(mutReadDict[key])))