I've been trying to wrap my head around this for a while now. I am trying to take a csv file, extract all rows, concantenate 2 values, take those 2 values to calculate distance from a third value separate of the csv, store the distance with the correct data from the csv, finally I need to find the shortest distance and return a dict with all the values i have not used yet.
with open(filename,'r') as csvfile:
reader = csv.DictReader(csvfile)
#create a multi-dimensional dictionary with the store name as keyword
new_dict = {}
try:
for row in reader:
new_dict[row['name']] ={}
new_dict[row['name']]['name'] = row['name']
new_dict[row['name']]['dist'] = {}
new_dict[row['name']]['address'] = row['address']
new_dict[row['name']]['city'] = row['city']
new_dict[row['name']]['state'] = row['state']
new_dict[row['name']]['zip'] = row['zip']
latt = str(row['latitude'])
longi = str(row['longitude'])
#concantenate latt and longi for use in grate_circle distance calculation
pharm_loc = latt + ','+ longi
#add distance from usr_loc for each store to dict for each store
new_dict[row['name']]['dist'] = str(calc_dist(usr_loc, store_loc))
I finally got this part fixed, now I need help filtering out all but the closest result for 'dist'... I cannot seem to wrap my head around this for some reason. Any help would be greatly appreciated.
---EDIT---
updated code that is working now. this produces a multidimensional dict as follows...
{'CONTINENTAL ': {'city': 'TOPEKA', 'dist': '50.3131329882', 'name': 'CONTINENTAL PHARMACY LLC', 'zip': '66603', 'state': 'KS', 'address': '821 SW 6TH AVE'}, 'DILLON ': {'city': 'TOPEKA', 'dist': '48.3573823197', 'name': 'DILLON PHARMACY', 'zip': '66605', 'state': 'KS', 'address': '2010 SE 29TH ST'}}
There are a lot more entries in the dict, I just need to filter down to the closest location and return only the values for that location.
Related
I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.
I need to return multiple calculated columns for each row of a pandas dataframe.
This error: ValueError: Shape of passed values is (4, 2), indices imply (4, 3) is raised when the apply function is executed in the following code snippet:
import pandas as pd
my_df = pd.DataFrame({
'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],
'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],
'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],
})
my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff'])
my_df.sort_values(['datetime_stuff'], inplace=True)
print(my_df.head())
def calculate_stuff(row):
if row['url'].startswith('http'):
categories = row['categories'] if type(row['categories']) == list else []
calculated_column_x = row['url'] + '_other_stuff_'
else:
calculated_column_x = None
another_column = 'deduction_from_fields'
return calculated_column_x, another_column
print(my_df.shape)
my_df['calculated_column_x'], my_df['another_column'] = zip(*my_df.apply(calculate_stuff, axis=1))
Each row of the dataframe I am working on is more complicated than the example above, and the function calculate_stuff I am applying is using many different columns for each row, then returning multiple new columns.
However, the previous example still raises this ValueError related to the shape of the dataframe that I am not able to understand how to fix.
How to create multiple new columns (for each row) that can be calculated starting from the existing columns?
When you return a list or tuple from a function that is being applied, pandas attempts to shoehorn it back into the dataframe you ran apply over. Instead, return a series.
Reconfigured Code
my_df = pd.DataFrame({
'datetime_stuff': ['2012-01-20', '2012-02-16', '2012-06-19', '2012-12-15'],
'url': ['http://www.something', 'http://www.somethingelse', 'http://www.foo', 'http://www.bar' ],
'categories': [['foo', 'bar'], ['x', 'y', 'z'], ['xxx'], ['a123', 'a456']],
})
my_df['datetime_stuff'] = pd.to_datetime(my_df['datetime_stuff'])
my_df.sort_values(['datetime_stuff'], inplace=True)
def calculate_stuff(row):
if row['url'].startswith('http'):
categories = row['categories'] if type(row['categories']) == list else []
calculated_column_x = row['url'] + '_other_stuff_'
else:
calculated_column_x = None
another_column = 'deduction_from_fields'
# I changed this VVVV
return pd.Series((calculated_column_x, another_column), ['calculated_column_x', 'another_column'])
my_df.join(my_df.apply(calculate_stuff, axis=1))
categories datetime_stuff url calculated_column_x another_column
0 [foo, bar] 2012-01-20 http://www.something http://www.something_other_stuff_ deduction_from_fields
1 [x, y, z] 2012-02-16 http://www.somethingelse http://www.somethingelse_other_stuff_ deduction_from_fields
2 [xxx] 2012-06-19 http://www.foo http://www.foo_other_stuff_ deduction_from_fields
3 [a123, a456] 2012-12-15 http://www.bar http://www.bar_other_stuff_ deduction_from_fields
I am very new to coding and seeking guidance on below...
I have a csv output currently like this:
'Age, First Name, Last Name, Mark'
'21, John, Smith, 68'
'16, Alex, Jones, 52'
'42, Michael, Carpenter, 92 '
How do I create a dictionary that will end up looking like this:
dictionary = {('age' : 'First Name', 'Mark'), ('21' : 'John', '68'), etc}
I would like the first value to be the key - and only want two other values, and I'm having difficulty finding ways to approach this.
So far I've got
data = open('test.csv', 'r').read().split('\n')
I've tried to split each part into a string
for row in data:
x = row.split(',')
EDIT:
Thank you for those who have gave some input into solving my problem.
So after using
myDic = {}
for row in data:
tmpLst = row.split(",")
key = tmpLst[0]
value = (tmpLst[1], tmpLst[-1])
myDic[key] = value
my data came out as
['Age', 'First Name', 'Last Name', 'Mark']
['21', 'John', 'Smith', '68']
['16', 'Alex', 'Jones', '52']
['42', 'Michael', 'Carpenter', '92']
But get an IndexError: list index out of range at the line
value = (tmpLst[1], tmpLst[-1])
even though I can see that it should be within the range of the index.
Does anyone know why this error is coming up or what needs to be changed?
Assuming an actual valid CSV file that looks like this:
Age,First Name,Last Name,Mark
21,John,Smith,68
16,Alex,Jones,52
42,Michael,Carpenter,92
the following code should do what you want:
from __future__ import print_function
import csv
with open('test.csv') as csv_file:
reader = csv.reader(csv_file)
d = { row[0]: (row[1], row[3]) for row in reader }
print(d)
# Output:
# {'Age': ('First Name', 'Mark'), '16': ('Alex', '52'), '21': ('John', '68'), '42': ('Michael', '92')}
If d = { row[0]: (row[1], row[3]) for row in reader } is confusing, consider this alternative:
d = {}
for row in reader:
d[row[0]] = (row[1], row[3])
I guess you want output like this:
dictionary = {'age' : ('First Name', 'Mark')}
Then you can use the following code:
myDic = {}
for row in data:
tmpLst = row.split(",")
key = tmpLst[0]
value = (tmpLst[1], tmpLst[-1])
myDic[key] = value
I have created code in Python 2.7 which saves sales data for various products into a text file using the write() method. My limited Python skills have hit the wall with the next step - I need code which can read this data from the text file and then calculate and display the mean average number of sales of each item. The data is stored in the text file like the data shown below (but I am able to format it differently if that would help).
Product A,30
Product B,26
Product C,4
Product A,40
Product B,18
Product A,31
Product B,13
Product C,3
After far too long Googling around this to no avail, any pointers on the best way to manage this would be greatly appreciated. Thanks in advance.
You can read from the file, then split each line by a space (' '). Then, it is just a matter of creating a dictionary, and appending each new item to a list which is the value for each letter key, then using sum and len to get the average.
Example
products = {}
with open("myfile.txt") as product_info:
data = product_info.read().split('\n') #Split by line
for item in data:
_temp = item.split(' ')[1].split(',')
if _temp[0] not in products.keys():
products[_temp[0]] = [_temp[1]]
else:
products[_temp[0]] = products[_temp[0]]+[_temp[1]]
product_list = [[item, float(sum(key))/len(key)] for item, key in d.items()]
product_list.sort(key=lambda x:x[0])
for item in product_list:
print 'The average of {} is {}'.format(item[0], item[1])
from __future__ import division
dict1 = {}
dict2 = {}
file1 = open("input.txt",'r')
for line in file1:
if len(line)>2:
data = line.split(",")
a,b = data[0].strip(),data[1].strip()
if a in dict1:
dict1[a] = dict1[a] + int(b)
else:
dict1[a] = int(b)
if a in dict2:
dict2[a] = dict2[a] + 1
else:
dict2[a] = 1
for k,v in dict1.items():
for m,n in dict2.items():
if k == m:
avg = float(v/n)
print "%s Average is: %0.6f"%(k,float(avg))
Output:
Product A Average is: 33.666667
Product B Average is: 19.000000
Product C Average is: 3.500000
I am trying to create a dictionary from a list and tuple of tuples as illustrated below. I have to reverse map the tuples to the list and create a set of non-None column names.
Any suggestions on a pythonic way to achieve the solution (desired dictionary) is much appreciated.
MySQL table 'StateLog':
Name NY TX NJ
Amy 1 None 1
Kat None 1 1
Leo None None 1
Python code :
## Fetching data from MySQL table
#cursor.execute("select * from statelog")
#mydataset = cursor.fetchall()
## Fetching column names for mapping
#state_cols = [fieldname[0] for fieldname in cursor.description]
state_cols = ['Name', 'NY', 'TX', 'NJ']
mydataset = (('Amy', '1', None, '1'), ('Kat', None, '1', '1'), ('Leo', None, None, '1'))
temp = [zip(state_cols, each) for each in mydataset]
# Looks like I can't do a tuple comprehension for the following snippet : finallist = ((eachone[1], eachone[0]) for each in temp for eachone in each if eachone[1] if eachone[0] == 'Name')
for each in temp:
for eachone in each:
if eachone[1]:
if eachone[0] == 'Name':
k = eachone[1]
print k, eachone[0]
print '''How do I get a dictionary in this format'''
print '''name_state = {"Amy": set(["NY", "NJ"]),
"Kat": set(["TX", "NJ"]),
"Leo": set(["NJ"])}'''
Output so far :
Amy Name
Amy NY
Amy NJ
Kat Name
Kat TX
Kat NJ
Leo Name
Leo NJ
Desired dictionary :
name_state = {"Amy": set(["NY", "NJ"]),
"Kat": set(["TX", "NJ"]),
"Leo": set(["NJ"])}
To be really honest, I would say your problem is that your code is becoming too cumbersome. Resist the temptation of "one-lining" it and create a function. Everything will become way easier!
mydataset = (
('Amy', '1', None, '1'),
('Kat', None, '1', '1'),
('Leo', None, None, '1')
)
def states(cols, data):
"""
This function receives one of the tuples with data and returns a pair
where the first element is the name from the tuple, and the second
element is a set with all matched states. Well, at least *I* think
it is more readable :)
"""
name = data[0]
states = set(state for state, value in zip(cols, data) if value == '1')
return name, states
pairs = (states(state_cols, data) for data in mydataset)
# Since dicts can receive an iterator which yields pairs where the first one
# will become a key and the second one will become the value, I just pass
# a list with all pairs to the dict constructor.
print dict(pairs)
The result is:
{'Amy': set(['NY', 'NJ']), 'Leo': set(['NJ']), 'Kat': set(['NJ', 'TX'])}
Looks like another job for defaultdict!
So lets create our default dict
name_state = collections.defaultdict(set)
We now have a dictionary that has sets as all default values, we can now do something like this
name_state['Amy'].add('NY')
Moving on you just need to iterate over your object and add to each name the right states. Enjoy
You can do this as a dictionary comprehension (Python 2.7+):
from itertools import compress
name_state = {data[0]: set(compress(state_cols[1:], data[1:])) for data in mydataset}
or as a generator expression:
name_state = dict((data[0], set(compress(state_cols[1:], data[1:]))) for data in mydataset)