Adding 2 list inside a dictionnary - list

I've been trying to add the number of 2 list inside a dictionnary. The thing is, I need to verify if the value in the selected row and column is already in the dictionnary, if so I want to add the double entry list to the value (another double entry list) already existing in the dictionnary. I'm using a excel spreadsheet + xlrd so i can read it up. I' pretty new to this.
For exemple, the code is checking the account (a number) in the specified row and columns, let's say the value is 10, then if it's not in the dictionnary, it add the 2 values corresponding to this count, let's say [100, 0] as a value to this key. This is working as intended.
Now, the hard part is when the account number is already in the dictionnary. Let's say its the second entry for the account number 10. and it's [50, 20]. I want the value associated to the key "10" to be [150, 20].
I've tried the zip method but it seems to return radomn result, Sometimes it adds up, sometime it doesn't.
import xlrd
book = xlrd.open_workbook("Entry.xls")
print ("The number of worksheets is", book.nsheets)
print ("Worksheet name(s):", book.sheet_names())
sh = book.sheet_by_index(0)
print (sh.name,"Number of rows", sh.nrows,"Number of cols", sh.ncols)
liste_compte = {}
for rx in range(4, 10):
if (sh.cell_value(rowx=rx, colx=4)) not in liste_compte:
liste_compte[((sh.cell_value(rowx=rx, colx=4)))] = [sh.cell_value(rowx=rx, colx=6), sh.cell_value(rowx=rx, colx=7)]
elif (sh.cell_value(rowx=rx, colx=4)) in liste_compte:
three = [x + y for x, y in zip(liste_compte[sh.cell_value(rowx=rx, colx=4)],[sh.cell_value(rowx=rx, colx=6), sh.cell_value(rowx=rx, colx=7)])]
liste_compte[(sh.cell_value(rowx=rx, colx=4))] = three
print (liste_compte)

I'm not going to directly untangle your code, but just help you with a general example that does what you want:
def update_balance(existing_balance, new_balance):
for column in range(len(existing_balance)):
existing_balance[column] += new_balance[column]
def update_account(accounts, account_number, new_balance):
if account_number in accounts:
update_balance(existing_balance = accounts[account_number], new_balance = new_balance)
else:
accounts[account_number] = new_balance
And finally you'd do something like (assuming your xls looks like [account_number, balance 1, balance 2]:
accounts = dict()
for row in xls:
update_account(accounts = accounts,
account_number = row[0],
new_balance = row[1:2])

Related

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

For-loop error: list index out of range

So I am rather new to programming and just recently started with Classes and we are supposed to make a phonebook that can be loaded in seperate text files.
I however keep running into the problem in this section that when I get into the for-loop. It hits a brick wall on
if storage[2] == permaStorage[i].number:
And tells me "IndexError: list index out of range". I am almost certain it is due to permaStorage starts out empty, but even when I attempt to fill it with temporary instances of Phonebook it tells me it out of range. The main reason it is there is to check if a phone number already exists within the permaStorage.
Anyone got a good tip on how to solve this or work around it?
(Sorry if the text is badly written. Just joined this site and not sure on the style)
class Phonebook():
def __init__(self):
self.name = ''
self.number = ''
def Add(name1, number1):
y = Phonebook()
y.name = name1
y.number = number1
return y
def Main():
permaStorage = []
while True:
print " add name number\n lookup name\n alias name newname\n change name number\n save filename\n load filename\n quit\n"
choices = raw_input ("What would you like to do?: ")
storage = choices.split(" ")
if storage[0] == "add":
for i in range(0, len(permaStorage)+1):
if storage[2] == permaStorage[i].number:
print "This number already exists. No two people can have the same phonenumber!\n"
break
if i == len(permaStorage):
print "hej"
try:
tempbox = Add(storage[1], storage[2])
permaStorage.append(tempbox)
except:
raw_input ("Remember to write name and phonenumber! Press any key to continue \n")
I think problem is that permaStorage is empty list and then u try to:
for i in range(0, len(permaStorage)+1):
if storage[2] == permaStorage[i].number:
will cause an error because permaStorage has 0 items but u trying to get first (i=0, permaStorage[0]) item.
I think you should replace second if clause with first one:
for i in range(0, len(permaStorage)+1):
if i == len(permaStorage):
print "hej"
try:
tempbox = Add(storage[1], storage[2])
permaStorage.append(tempbox)
if storage[2] == permaStorage[i].number:
print "This number already exists. No two people can have the same phonenumber!\n"
break
So in this case if perStorage is blank you will append some value and next if clause will be ok.
Indexing starts at zero in python. Hence, a list of length 5 has the last element index as 4 starting from 0. Change range to range(0, len(permastorage))
You should iterate upto the last element of the list, not beyond.
Try -
for i in range(0, len(permaStorage)):
The list of numbers produced in range() is from the start, but not including the end, so range(3) == [0, 1, 2].
So if your list x has length 10, range(0, len(x)) will give you 0 through 9, which is the correct indices of the elements of your list.
Adding 1 to len(x) will produce the range 0 through 10, and when you try to access x[10], it will fail.

Creating a if/else that appends data from mult. scraped pages if counts differ?

I"m trying to scrape Oregon teacher licensure information that looks like this or this(this is publicly available data)
This is my code:
for t in range(0,2): #Refers to txt file with ids
address = 'http://www.tspc.oregon.gov/lookup_application/LDisplay_Individual.asp?id=' + lines2[t]
page = requests.get(address)
tree = html.fromstring(page.text)
count = 0
for license_row in tree.xpath(".//tr[td[1] = 'License Type']/following-sibling::tr[1]"):
license_data = license_row.xpath(".//td/text()")
count = count + 1
if count==1:
ltest1.append(license_data)
if count==2:
ltest2.append(license_data)
if count==3:
ltest3.append(license_data)
with open('teacher_lic.csv', 'wb') as pensionfile:
writer = csv.writer(pensionfile, delimiter="," )
writer.writerow(["Name", "Lic1", "Lic2", "Lic3"])
pen = zip(lname, ltest1, ltest2, ltest3)
for penlist in pen:
writer.writerow(list(penlist))
The problem occurs when this happens: teacher A has 13 licenses and Teacher B has 2. In A my total count = 13 and B = 2. When I get to Teacher B and count equal to 3, I want to say, "if count==3 then ltest3.append(licensure_data) else if count==3 and license_data=='' then license3.append('')" but since there's no count==3 in B there's no way to tell it to append an empty set.
I'd want the output to look like this:
Is there a way to do this? I might be approaching this completely wrong so if someone can point me in another direction, that would be helpful as well.
There's probably a more elegant way to do this but this managed to work pretty well.
I created some blank spaces to fill in when Teacher A has 13 licenses and Teacher B has 2. There were some errors that resulted when the license_row.xpath got to the count==3 in Teacher B. I exploited these errors to create the ltest3.append('').
for t in range(0, 2): #Each txt file contains differing amounts
address = 'http://www.tspc.oregon.gov/lookup_application/LDisplay_Individual.asp?id=' + lines2[t]
page = requests.get(address)
tree = html.fromstring(page.text)
count = 0
test = tree.xpath(".//tr[td[1] = 'License Type']/following-sibling::tr[1]")
difference = 15 - len(test)
for i in range(0, difference):
test.append('')
for license_row in test:
count = count + 1
try:
license_data = license_row.xpath(".//td/text()")
except NameError:
license_data = ''
if license_data=='' and count==1:
ltest1.append('')
if license_data=='' and count==2:
ltest2.append('')
if license_data=='' and count==3:
ltest3.append('')
except AttributeError:
license_data = ''
if count==1 and True:
print "True"
if count==1:
ltest1.append(license_data)
if count==2 and True:
print "True"
if count==2:
ltest2.append(license_data)
if count==3 and True:
print "True"
if count==3:
ltest3.append(license_data)
del license_data
for endorse_row in tree.xpath(".//tr[td = 'Endorsements']/following-sibling::tr"):
endorse_data = endorse_row.xpath(".//td/text()")
lendorse1.append(endorse_data)

Creating a list of sums

I'm newbie in Python and I'm struggling in create a list of sums generated by a for loop.
I got an school assignment where my program have to simulate the scores of a class of blind students in a multiple choice test.
def blindwalk(): # Generates the blind answers in a test with 21 questions
import random
resp = []
gab = ["a","b","c","d"]
for n in range(0,21):
resp.append(random.choice(gab))
return(resp)
def gabarite(): # Generates the official answer key of the tests
import random
answ_gab = []
gab = ["a","b","c","d"]
for n in range(0,21):
answ_gab.append(random.choice(gab))
return(answ_gab)
def class_tests(A): # A is the number of students
alumni = []
A = int(A)
for a in range(0,A):
alumni.append(blindwalk())
return alumni
def class_total(A): # A is the number of students
A = int(A)
official_gab = gabarite()
tests = class_tests(A)
total_score = []*0
for a in range(0,A):
for n in range(0,21):
if tests[a][n] == official_gab[n]:
total_score[a].add(1)
return total_score
When I run the class_total() function, I get this error:
total_score[a].add(1)
IndexError: list index out of range
Question is: How I valuate the scores of each student and create a list with them, because this is what I want to do with the class_total() function.
I also tried
if tests[a][n] == official_gab[n]:
total_score[a] += 1
But I got the same error, so I think I don't fully understand how lists work in Python yet.
Thanks!
(Also, I'm not a English native-speaker, so please tell me if I couldn't be clear enough)
This line:
total_score = []*0
And in fact, any of the following lines:
total_score = []*30
total_score = []*3000
total_score = []*300000000
Cause total_score to be instantiated as an empty list. It doesn't even have a 0th index, in this case! If you'd like to initiate every value to x in a list of length l , the syntax would look more like:
my_list = [x]*l
Alternatively, instead of thinking about the size before-hand, you can use .append instead of trying to access a particular index, as in:
my_list = []
my_list.append(200)
# my_list is now [200], my_list[0] is now 200
my_list.append(300)
# my_list is now [200,300], my_list[0] is still 200 and my_list[1] is now 300

Importing and analysing text data using Python 2.7

I have created code in Python 2.7 which saves sales data for various products into a text file using the write() method. My limited Python skills have hit the wall with the next step - I need code which can read this data from the text file and then calculate and display the mean average number of sales of each item. The data is stored in the text file like the data shown below (but I am able to format it differently if that would help).
Product A,30
Product B,26
Product C,4
Product A,40
Product B,18
Product A,31
Product B,13
Product C,3
After far too long Googling around this to no avail, any pointers on the best way to manage this would be greatly appreciated. Thanks in advance.
You can read from the file, then split each line by a space (' '). Then, it is just a matter of creating a dictionary, and appending each new item to a list which is the value for each letter key, then using sum and len to get the average.
Example
products = {}
with open("myfile.txt") as product_info:
data = product_info.read().split('\n') #Split by line
for item in data:
_temp = item.split(' ')[1].split(',')
if _temp[0] not in products.keys():
products[_temp[0]] = [_temp[1]]
else:
products[_temp[0]] = products[_temp[0]]+[_temp[1]]
product_list = [[item, float(sum(key))/len(key)] for item, key in d.items()]
product_list.sort(key=lambda x:x[0])
for item in product_list:
print 'The average of {} is {}'.format(item[0], item[1])
from __future__ import division
dict1 = {}
dict2 = {}
file1 = open("input.txt",'r')
for line in file1:
if len(line)>2:
data = line.split(",")
a,b = data[0].strip(),data[1].strip()
if a in dict1:
dict1[a] = dict1[a] + int(b)
else:
dict1[a] = int(b)
if a in dict2:
dict2[a] = dict2[a] + 1
else:
dict2[a] = 1
for k,v in dict1.items():
for m,n in dict2.items():
if k == m:
avg = float(v/n)
print "%s Average is: %0.6f"%(k,float(avg))
Output:
Product A Average is: 33.666667
Product B Average is: 19.000000
Product C Average is: 3.500000