difflib.get_close_matches GET SCORE - python-2.7

I am trying to get the score of the best match using difflib.get_close_matches:
import difflib
best_match = difflib.get_close_matches(str,str_list,1)[0]
I know of the option to add 'cutoff' parameter, but couldn't find out how to get the actual score after setting the threshold.
Am I missing something? Is there a better solution to match unicode strings?

I found that difflib.get_close_matches is the simplest way for matching/fuzzy-matching strings. But there are a few other more advanced libraries like fuzzywuzzy as you mentioned in the comments.
But if you want to use difflib, you can use difflib.SequenceMatcher to get the score as follows:
import difflib
my_str = 'apple'
str_list = ['ape' , 'fjsdf', 'aerewtg', 'dgyow', 'paepd']
best_match = difflib.get_close_matches(my_str,str_list,1)[0]
score = difflib.SequenceMatcher(None, my_str, best_match).ratio()
In this example, the best match between 'apple' and the list is 'ape' and the score is 0.75.
You can also loop through the list and compute all the scores to check:
for word in str_list:
print "score for: " + my_str + " vs. " + word + " = " + str(difflib.SequenceMatcher(None, my_str, word).ratio())
For this example, you get the following:
score for: apple vs. ape = 0.75
score for: apple vs. fjsdf = 0.0
score for: apple vs. aerewtg = 0.333333333333
score for: apple vs. dgyow = 0.0
score for: apple vs. paepd = 0.4
Documentation for difflib can be found here: https://docs.python.org/2/library/difflib.html

To answer the question, the usual route would be to obtain the comparative score for a match returned by get_close_matches() individually in this manner:
match_ratio = difflib.SequenceMatcher(None, 'aple', 'apple').ratio()
Here's a way that increases speed in my case by about 10% ...
I'm using get_close_matches() for spellcheck, it runs SequenceMatcher() under the hood but strips the scores returning just a list of matching strings. Normally.
But with a small change in Lib/difflib.py currently around line 736 the return can be a dictionary with scores as values, thus no need to run SequenceMatcher again on each list item to obtain their score ratios. In the examples I've shortened the output float values for clarity (like 0.8888888888888888 to 0.889). Input n=7 says to limit the return items to 7 if there are more than 7, i.e. the highest 7, and that could apply if candidates are many.
Current mere list return
In this example result would normally be like ['apple', 'staple', 'able', 'lapel']
... at the default cutoff of .6 if omitted (as in Ben's answer, no judgement).
The change
in difflib.py is simple (this line to the right shows the original):
return {v: k for (k, v) in result} # hack to return dict with scores instead of list, original was ... [x for score, x in result]
New dictionary return
includes scores like {'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667}
>>> to_match = 'aple'
>>> candidates = ['lapel', 'staple', 'zoo', 'able', 'apple', 'appealing']
Increasing minimum score cutoff/threshold from .4 to .8:
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.4)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667, 'appealing': 0.461}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.7)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.8)
{'apple': 0.889, 'staple': 0.8}

Related

How to use for loop in list in python 3?

I have a question about writing for loops in python 3.
Basically, I don't understand how to write for loop if I have a list that contains two elements, like this one:
list1 = [("Berlin", 22), ("Zagreb", 30), ("New York", 25), ("Chicago", 20), ("Paris", 29)]
This is basically a list that contains cities and their temperatures in Celsius degrees, and I would like to create a new list that contains cities but now their temperature in Fahrenheit. The formula is:
F° = (9/5)*C° + 32
I don't understand how am I suppose to loop through this list that contains two elements in every member.
Use a list comprehension:
list1 = [("Berlin", 22), ("Zagreb", 30), ("New York", 25), ("Chicago", 20), ("Paris", 29)]
list2 = [(city, 9/5 * temp + 32) for city, temp in list1]
print(list2)
# [('Berlin', 71.599999999999994), ('Zagreb', 86.0), ('New York', 77.0), ('Chicago', 68.0), ('Paris', 84.200000000000003)]
Here, you iterate through list of tuples getting city names to city and temperature in Celsius to temp and convert Celsius to Farenheit.
Without list comprehension:
list2 = []
for city, temp in list1:
list2.append((city, 9/5 * temp + 32))
print(list2)
# [('Berlin', 71.599999999999994), ('Zagreb', 86.0), ('New York', 77.0), ('Chicago', 68.0), ('Paris', 84.200000000000003)]
First thing you need to know is to access elements in a 2D list. According to your list1, city is in the 0th position and temperature is in 1st position in a row. Within a simple for loop, you can access those as follows. Remember to keep the indentation, to specify the body of the for loop.
for element in list1:
city = element[0]
temp_c = element[1]
Then you can directly use temp_c to compute temperature in Fahrenheit (temp_f), within the loop.
temp_f = (9 / 5) * temp_c + 32
Next task is to append calculated temp_f values to a new list (list2) along with city name.
list2.append((city, temp_f))
But before that you should define the list2. (somewhere near you define list1)
list2 =[]
So it's done. You can check it works using a print statement.
for element in list2:
print(element)
This code can be implemented much shorter.
I expand it to reduce the complexity. Hope you got it.

Python: referring to each duplicate item in a list by unique index

I am trying to extract particular lines from txt output file. The lines I am interested in are few lines above and few below the key_string that I am using to search through the results. The key string is the same for each results.
fi = open('Inputfile.txt')
fo = open('Outputfile.txt', 'a')
lines = fi.readlines()
filtered_list=[]
for item in lines:
if item.startswith("key string"):
filtered_list.append(lines[lines.index(item)-2])
filtered_list.append(lines[lines.index(item)+6])
filtered_list.append(lines[lines.index(item)+10])
filtered_list.append(lines[lines.index(item)+11])
fo.writelines(filtered_list)
fi.close()
fo.close()
The output file contains the right lines for the first record, but multiplied for every record available. How can I update the indexing so it can read every individual record? I've tried to find the solution but as a novice programmer I was struggling to use enumerate() function or collections package.
First of all, it would probably help if you said what exactly goes wrong with your code (a stack trace, it doesn't work at all, etc). Anyway, here's some thoughts. You can try to divide your problem into subproblems to make it easier to work with. In this case, let's separate finding the relevant lines from collecting them.
First, let's find the indexes of all the relevant lines.
key = "key string"
relevant = []
for i, item in enumerate(lines):
if item.startswith(key):
relevant.append(item)
enumerate is actually quite simple. It takes a list, and returns a sequence of (index, item) pairs. So, enumerate(['a', 'b', 'c']) returns [(0, 'a'), (1, 'b'), (2, 'c')].
What I had written above can be achieved with a list comprehension:
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
So, we have the indexes of the relevant lines. Now, let's collected them. You are interested in the line 2 lines before it and 6 and 10 and 11 lines after it. If your first lines contains the key, then you have a problem – you don't really want lines[-1] – that's the last item! Also, you need to handle the situation in which your offset would take you past the end of the list: otherwise Python will raise an IndexError.
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
You could also catch the IndexError, but that won't save us much typing, as we have to handle negative indexes anyway.
The whole program would look like this:
key = "key string"
with open('Inputfile.txt') as fi:
lines = fi.readlines()
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
with open('Outputfile.txt', 'a') as fi:
fi.writelines(out)
To get rid of duplicates you can cast list to set; example:
x=['a','b','a']
y=set(x)
print(y)
will result in:
['a','b']

Search for item in a list of tuples

I have a list of tuples structured as per the below (data is example only):
[('aaa', 10), ('bbb', 10), ('ccc', 12), ('ddd', 12), ('eee', 14)]
I need to search the second item in each of the tuples (the number) to see if it exists in the list (eg search for 12 = found, search for 5 = not found.
Currently I am using the below, which works but may not be the best way in Python:
not_there = True
for a in final_set:
if final_set[1] == episode_id:
not_there = False
break
What is the best / most efficient way in Python to do this?
Maybe you can try something like this :
test = [('aaa', 10), ('bbb', 10), ('ccc', 12), ('ddd', 12), ('eee', 14)]
number = 10
for i in test:
if number in i:
print("Number {} found.".format(number))
else:
print("Number {} not found".format(number))
That should work regardless you're searching element 1 in the tuple (the 'aaa') or element 2 (the number).
Hope this helps.
What about this:
is_there = (len([item for item in final_set if item[1] == episode_id]) > 0)
Basically, [item for item in final_set if item[1] == episode_id] is a list comprehension expression which creates a list of the items in final_set such that item[1] == episode_id.
Then, you can check the length of the resulting list: if it is greater than 0, than something has been found.

Python: How to calculate tf-idf for a large data set

I have a following data frame df, which I converted from sframe
URI name text
0 <http://dbpedia.org/resource/Digby_M... Digby Morrell digby morrell born 10 october 1979 i...
1 <http://dbpedia.org/resource/Alfred_... Alfred J. Lewy alfred j lewy aka sandy lewy graduat...
2 <http://dbpedia.org/resource/Harpdog... Harpdog Brown harpdog brown is a singer and harmon...
3 <http://dbpedia.org/resource/Franz_R... Franz Rottensteiner franz rottensteiner born in waidmann...
4 <http://dbpedia.org/resource/G-Enka> G-Enka henry krvits born 30 december 1974 i...
I have done the following:
from textblob import TextBlob as tb
import math
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
bloblist = []
for i in range(0, df.shape[0]):
bloblist.append(tb(df.iloc[i,2]))
for i, blob in enumerate(bloblist):
print("Top words in document {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:3]:
print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))
But this is taking a lot of time as there are 59000 documents.
Is there a better way to do it?
I am confused about this subject. But I found a few solution on the internet with use Spark. Here you can look at:
https://www.linkedin.com/pulse/understanding-tf-idf-first-principle-computation-apache-asimadi
On the other hand i tried theese method and i didn't get bad results. Maybe you want to try :
I hava a word list. This list contains word and it's counts.
I found the average of this words counts.
I selected the lower limit and the upper limit with the average value.
(e.g. lower bound = average / 2 and upper bound = average * 5)
Then i created a new word list with upper and lower bound.
With theese i got theese result :
Before normalization word vector length : 11880
Mean : 19 lower bound : 9 upper bound : 95
After normalization word vector length : 1595
And also cosine similarity results were better.

Python remove odd numbers and print only even

user = int(raw_input("Type 5 numbers"))
even = []
def purify(odd):
for n in odd:
even.append(n)
if n % 2 > 0:
print n
print purify(user)
Hello I am a beginner and I would like to understand what is wrong with this code.
The User chose 5 numers and I want to print the even numbers only.
Thanks for helping
There are a few problems:
You can't apply int to an overall string, just to one integer at a time.
So if your numbers are space-separated, then you should split them into a list of strings. You can either convert them immediately after input, or wait and do it within your purify function.
Also, your purify function appends every value to the list even without testing it first.
Also, your test is backwards -- you are printing only odd numbers, not even.
Finally, you should return the value of even if you want to print it outside the function, instead of printing them as you loop.
I think this edited version should work.
user_raw = raw_input("Type some space-separated numbers")
user = user_raw.split() # defaults to white space
def purify(odd):
even = []
for n in odd:
if int(n) % 2 == 0:
even.append(n)
return even
print purify(user)
raw_input returns a string and this cannot be converted to type int.
You can use this:
user = raw_input("Input 5 numbers separated by commas: ").split(",")
user = [int(i) for i in user]
def purify(x):
new_lst = []
for i in x:
if i % 2 == 0:
new_lst.append(i)
return new_lst
for search even
filter would be the simplest way to "filter" even numbers:
output = filter(lambda x:~x&1, input)
def purify(list_number):
s=[]
for number in list_number:
if number%2==0:
s+=[number]
return s