Importing and analysing text data using Python 2.7

Importing and analysing text data using Python 2.7 - python-2.7

I have created code in Python 2.7 which saves sales data for various products into a text file using the write() method. My limited Python skills have hit the wall with the next step - I need code which can read this data from the text file and then calculate and display the mean average number of sales of each item. The data is stored in the text file like the data shown below (but I am able to format it differently if that would help).
Product A,30
Product B,26
Product C,4
Product A,40
Product B,18
Product A,31
Product B,13
Product C,3
After far too long Googling around this to no avail, any pointers on the best way to manage this would be greatly appreciated. Thanks in advance.

You can read from the file, then split each line by a space (' '). Then, it is just a matter of creating a dictionary, and appending each new item to a list which is the value for each letter key, then using sum and len to get the average.
Example
products = {}
with open("myfile.txt") as product_info:
data = product_info.read().split('\n') #Split by line
for item in data:
_temp = item.split(' ')[1].split(',')
if _temp[0] not in products.keys():
products[_temp[0]] = [_temp[1]]
else:
products[_temp[0]] = products[_temp[0]]+[_temp[1]]
product_list = [[item, float(sum(key))/len(key)] for item, key in d.items()]
product_list.sort(key=lambda x:x[0])
for item in product_list:
print 'The average of {} is {}'.format(item[0], item[1])

from __future__ import division
dict1 = {}
dict2 = {}
file1 = open("input.txt",'r')
for line in file1:
if len(line)>2:
data = line.split(",")
a,b = data[0].strip(),data[1].strip()
if a in dict1:
dict1[a] = dict1[a] + int(b)
else:
dict1[a] = int(b)
if a in dict2:
dict2[a] = dict2[a] + 1
else:
dict2[a] = 1
for k,v in dict1.items():
for m,n in dict2.items():
if k == m:
avg = float(v/n)
print "%s Average is: %0.6f"%(k,float(avg))
Output:
Product A Average is: 33.666667
Product B Average is: 19.000000
Product C Average is: 3.500000

Related

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])

As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

summing up a column in a csv file based on user search

I have the following csv file:
data.cvs
school,students,teachers,subs
us-school1,10,2,0
us-school2,20,4,2
uk-school1,10,2,0
de-school1,10,3,1
de-school1,15,3,3
I am trying to have a user search for the school country (us or uk, or de)
and then sum up the corresponding column. (e.g. sum all students in us-* etc.)
So far i am able to search using the raw_input and display column contents corresponding to the country, appreciate if someone can give me some pointers on how i can achive this.
desired output:
Country: us
Total students: 30
Total teachers: 6
Total subs: 2
--
import csv
import re
search = raw_input('Enter school (e.g. us: ')
with open('data.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
school = row['school']
students = row['students']
teachers = row['teachers']
sub = row['subs']
if re.match(search, schools) is not None:
print students

That's relatively easy to do - all you need is a dict to hold group your countries, and then just add together all of the values:
import collections
import csv
result = {} # store the results
with open("data.csv", "rb") as f: # open our file
reader = csv.DictReader(f) # use csv.DictReader for convenience
for row in reader:
country = row.pop("school")[:2] # get our country
result[country] = result.get(country, collections.defaultdict(int)) # country group
for column in row: # loop through all other columns
result[country][column] += int(row[column]) # add them together
# Now you can use or print your result by country:
for country in result:
print("Country: {}".format(country))
print("Total students: {}".format(result[country].get("students", 0)))
print("Total teachers: {}".format(result[country].get("teachers", 0)))
print("Total subs: {}\n".format(result[country].get("subs", 0)))
This is also universal as you can add additional number columns (e.g. janitors :D) and it will happily sum them together, but keep in mind that it works only with integers (if you want floats, replace the references to int with float) and it expects that every field except school is a number.

Your problem could be solved with something like this:
import csv
search = raw_input('Enter school (e.g. us: ')
with open('data.csv') as csvfile:
reader = csv.DictReader(csvfile)
result_countrys = {}
for row in reader:
students = int(row['students'])
teachers = int(row['teachers'])
subs = int(row['subs'])
subs = row['subs']
country = school[: 2]
if country in result_countrys:
count = result_countrys[country]
count['students'] = count['students'] + students
count['teachers'] = count['teachers'] + teachers
count['subs'] = count['subs'] + subs
else :
dic = {}
dic['students'] = students
dic['teachers'] = teachers
dic['subs'] = subs
result_countrys[country] = dic
for k, v in result_countrys[search].iteritems():
print("country " + str(search) + " has " + str(v) + " " + str(k))
I tryed out with this set of values:
reader = [{'school': 'us-school1', 'students': 20, 'teachers': 6, 'subs': 2}, {'school': 'us-school2', 'students': 20, 'teachers': 6, 'subs': 2}, {'school': 'uk-school1', 'students': 20, 'teachers': 6, 'subs': 2}]
and the result is:
Enter school (e.g. us): us
country us has 30 students
country us has 6 teachers
country us has 2 subs

Iterating through a .txt file in an odd way

What I am trying to do is write a program that opens a .txt file with movie reviews where the rating is a number from 0-4 followed by a short review of the movie. The program then prompts the user to open a second text file with words that will be matched against the reviews and given a number value based on the review.
For example, with these two sample reviews how they would appear in the .txt file:
4 A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis . 2 Massoud 's story is an epic , but also a tragedy , the record of a tenacious , humane fighter who was also the prisoner -LRB- and ultimately the victim -RRB- of history .
So, if I were looking for the word "epic", it would increment the count for that word by 2 (which I already have figured out) since it appears twice, and then append the values 4 and 2 to a list of ratings for that word.
How do I append those ints to a list or dictionary related to that word? Keep in mind that I need to create a new list or dicitonary key for every word in a list of words.
Please and thank you. And sorry if this was poorly worded, programming isn't my forte.
All of my code:
def menu_validate(prompt, min_val, max_val):
""" produces a prompt, gets input, validates the input and returns a value. """
while True:
try:
menu = int(input(prompt))
if menu >= min_val and menu <= max_val:
return menu
break
elif menu.lower == "quit" or menu.lower == "q":
quit()
print("You must enter a number value from {} to {}.".format(min_val, max_val))
except ValueError:
print("You must enter a number value from {} to {}.".format(min_val, max_val))
def open_file(prompt):
""" opens a file """
while True:
try:
file_name = str(input(prompt))
if ".txt" in file_name:
input_file = open(file_name, 'r')
return input_file
else:
input_file = open(file_name+".txt", 'r')
return input_file
except FileNotFoundError:
print("You must enter a valid file name. Make sure the file you would like to open is in this programs root folder.")
def make_list(file):
lst = []
for line in file:
lst2 = line.split(' ')
del lst2[-1]
lst.append(lst2)
return lst
def rating_list(lst):
'''iterates through a list of lists and appends the first value in each list to a second list'''
rating_list = []
for list in lst:
rating_list.append(list[0])
return rating_list
def word_cnt(lst, word : str):
cnt = 0
for list in lst:
for word in list:
cnt += 1
return cnt
def words_list(file):
lst = []
for word in file:
lst.append(word)
return lst
##def sort(words, occurrences, avg_scores, std_dev):
## '''sorts and prints the output'''
## menu = menu_validate("You must choose one of the valid choices of 1, 2, 3, 4 \n Sort Options\n 1. Sort by Avg Ascending\n 2. Sort by Avg Descending\n 3. Sort by Std Deviation Ascending\n 4. Sort by Std Deviation Descending", 1, 4)
## print ("{}{}{}{}\n{}".format("Word", "Occurence", "Avg. Score", "Std. Dev.", "="*51))
## if menu == 1:
## for i in range (len(word_list)):
## print ("{}{}{}{}".format(cnt_list.sorted[i],)
def make_odict(lst1, lst2):
'''makes an ordered dictionary of keys/values from 2 lists of equal length'''
dic = OrderedDict()
for i in range (len(word_list)):
dic[lst2[i]] = lst2[i]
return dic
cnt_list = []
while True:
menu = menu_validate("1. Get sentiment for all words in a file? \nQ. Quit \n", 1, 1)
if menu == True:
ratings_file = open("sample.txt")
ratings_list = make_list(ratings_file)
word_file = open_file("Enter the name of the file with words to score \n")
word_list = words_list(word_file)
for word in word_list:
cnt = word_cnt(ratings_list, word)
cnt_list.append(word_cnt(ratings_list, word))
Sorry, I know it's messy and very incomplete.

I think you mean:
import collections
counts = collections.defaultdict(int)
word = 'epic'
counts[word] += 1
Obviously, you can do more with word than I have, but you aren't showing us any code, so ...
EDIT
Okay, looking at your code, I'd suggest you make the separation between rating and text explicit. Take this:
def make_list(file):
lst = []
for line in file:
lst2 = line.split(' ')
del lst2[-1]
lst.append(lst2)
return lst
And convert it to this:
def parse_ratings(file):
"""
Given a file of lines, each with a numeric rating at the start,
parse the lines into score/text tuples, one per line. Return the
list of parsed tuples.
"""
ratings = []
for line in file:
text = line.strip().split()
if text:
score = text[0]
ratings.append((score,text[1:]))
return ratings
Then you can compute both values together:
def match_reviews(word, ratings):
cnt = 0
scores = []
for score,text in ratings:
n = text.count(word)
if n:
cnt += n
scores.append(score)
return (cnt, scores)

Printing Results from Loops

I currently have a piece of code that works in two segments. The first segment opens the existing text file from a specific path on my local drive and then arranges, based on certain indices, into a list of sub list. In the second segment I take the sub-lists I have created and group them on a similar index to simplify them (starts at def merge_subs). I am getting no error code but I am not receiving a result when I try to print the variable answer. Am I not correctly looping the original list of sub-lists? Ultimately I would like to have a variable that contains the final product from these loops so that I may write the contents of it to a new text file. Here is the code I am working with:
from itertools import groupby, chain
from operator import itemgetter
with open ("somepathname") as g:
# reads text from lines and turns them into a list sub-lists
lines = g.readlines()
for line in lines:
matrix = line.split()
JD = matrix [2]
minTime= matrix [5]
maxTime= matrix [7]
newLists = [JD,minTime,maxTime]
L = newLists
def merge_subs(L):
dates = {}
for sub in L:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
answer.append([date] + dates[date])
new code
def openfile(self):
filename = askopenfilename(parent=root)
self.lines = open(filename)
def simplify(self):
g = self.lines.readlines()
for line in g:
matrix = line.split()
JD = matrix[2]
minTime = matrix[5]
maxTime = matrix[7]
self.newLists = [JD, minTime, maxTime]
print(self.newLists)
dates = {}
for sub in self.newLists:
date = sub[0]
if date not in dates:
dates[date] = []
dates[date].extend(sub[1:])
answer = []
for date in sorted(dates):
print(answer.append([date] + dates[date]))
enter code here
enter code here

Adding 2 list inside a dictionnary

I've been trying to add the number of 2 list inside a dictionnary. The thing is, I need to verify if the value in the selected row and column is already in the dictionnary, if so I want to add the double entry list to the value (another double entry list) already existing in the dictionnary. I'm using a excel spreadsheet + xlrd so i can read it up. I' pretty new to this.
For exemple, the code is checking the account (a number) in the specified row and columns, let's say the value is 10, then if it's not in the dictionnary, it add the 2 values corresponding to this count, let's say [100, 0] as a value to this key. This is working as intended.
Now, the hard part is when the account number is already in the dictionnary. Let's say its the second entry for the account number 10. and it's [50, 20]. I want the value associated to the key "10" to be [150, 20].
I've tried the zip method but it seems to return radomn result, Sometimes it adds up, sometime it doesn't.
import xlrd
book = xlrd.open_workbook("Entry.xls")
print ("The number of worksheets is", book.nsheets)
print ("Worksheet name(s):", book.sheet_names())
sh = book.sheet_by_index(0)
print (sh.name,"Number of rows", sh.nrows,"Number of cols", sh.ncols)
liste_compte = {}
for rx in range(4, 10):
if (sh.cell_value(rowx=rx, colx=4)) not in liste_compte:
liste_compte[((sh.cell_value(rowx=rx, colx=4)))] = [sh.cell_value(rowx=rx, colx=6), sh.cell_value(rowx=rx, colx=7)]
elif (sh.cell_value(rowx=rx, colx=4)) in liste_compte:
three = [x + y for x, y in zip(liste_compte[sh.cell_value(rowx=rx, colx=4)],[sh.cell_value(rowx=rx, colx=6), sh.cell_value(rowx=rx, colx=7)])]
liste_compte[(sh.cell_value(rowx=rx, colx=4))] = three
print (liste_compte)

I'm not going to directly untangle your code, but just help you with a general example that does what you want:
def update_balance(existing_balance, new_balance):
for column in range(len(existing_balance)):
existing_balance[column] += new_balance[column]
def update_account(accounts, account_number, new_balance):
if account_number in accounts:
update_balance(existing_balance = accounts[account_number], new_balance = new_balance)
else:
accounts[account_number] = new_balance
And finally you'd do something like (assuming your xls looks like [account_number, balance 1, balance 2]:
accounts = dict()
for row in xls:
update_account(accounts = accounts,
account_number = row[0],
new_balance = row[1:2])

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Importing and analysing text data using Python 2.7 - python-2.7

Related

Counting matrix pairs using a threshold

summing up a column in a csv file based on user search

Iterating through a .txt file in an odd way

Printing Results from Loops

Adding 2 list inside a dictionnary

Categories

Resources