How do I fuzzy match items in a column of an array in python? - python-2.7

I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array with all of the data. Any help would be appreciated, as I have never used fuzzy matching before.
I have considered fuzzy matching through a for-loop, which would (despite being unbelievably slow) compare each element in the column of the array to every other element, but I'm not really sure how to build it.
Currently, my array looks like this:
{Names , info1, info2, info 3}
The array is a few thousand rows long, so I'm trying to make the program as efficient as possible.

The Levenshtein edit distance is the most common way to perform fuzzy matching of strings. It is available in the python-Levenshtein package. Another popular distance is Jaro Winkler's distance, also available in the same package.
Assuming a simple array numpy array:
import numpy as np
import Levenshtein as lv
ar = np.array([
'string'
, 'stum'
, 'Such'
, 'Say'
, 'nay'
, 'powder'
, 'hiden'
, 'parrot'
, 'ming'
])
We define helpers to give us indexes of Levenshtein and Jaro distances, between a string we have and all strings in the array.
def levenshtein(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.distance(string, x), ar))
def jaro(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.jaro_winkler(string, x), ar))
Now, note that Levenshtein distance is an integer value counted in number of characters, whilst Jaro's distance is a floating point value that normally varies between 0 and 1. Let's test this using np.where:
print ar[np.where(levenshtein(3, 'str'))]
print ar[np.where(levenshtein(5, 'str'))]
print ar[np.where(jaro(0.00000001, 'str'))]
print ar[np.where(jaro(0.9, 'str'))]
And we get:
['stum']
['string' 'stum' 'Such' 'Say' 'nay' 'ming']
['Such' 'Say' 'nay' 'powder' 'hiden' 'ming']
['string' 'stum' 'Such' 'Say' 'nay' 'powder' 'hiden' 'parrot' 'ming']

Related

Is there an algorithm/way to find out how different (or the minimum distance between) 2 list orders?

I have a bunch of items I want to rate in a specific order. For example:
["Person1", "Person2", "Person3", "Person4", "Person5"]
Which can be ordered like this:
["Person4", "Person5", "Person3", "Person1", "Person2"]
Given 2 different orders of the same list, is there a way to quantify how difference they are?
I know Levenshtein distance exists for strings, and I'm looking for something similar.
My ideal measurement for distance would be the minimum number of switches between two adjacent items required to change one list to the other - but I'm open to other algorithms if you think they're better.
The answer I'm looking for is an algorithm (and preferably, a [Python] implementation) to perform this kind of measurement (fast).
Thanks in advance!
To quantify how "different" two strings are, as you already noted, you can use Levenshtein distance, which is implemented in this library:
pip install levenshtein
>>> import Levenshtein
>>> Levenshtein.distance("lewenstein", "levenshtein")
2
To determine how "different" two lists are, you could assign each value in the list to a Unicode character.
import Levenshtein
def list_distance(A, B):
# Assign each unique value of the list to a unicode character
unique_map = {v:chr(k) for (k,v) in enumerate(set(A+B))}
# Create string versions of the lists
a = ''.join(list(map(unique_map.get, A)))
b = ''.join(list(map(unique_map.get, B)))
return Levenshtein.distance(a, b)
A = ["Person1", "Person2", "Person3", "Person4", "Person5"]
B = ["Person4", "Person5", "Person3", "Person1", "Person2"]
list_distance(A, B)
returns 4.
This works by making a unique mapping to arbitrary Unicode characters, for example:
the list A to the string '\x03\x02\x01\x00\x04' and
the list B to the string '\x00\x04\x01\x03\x02',
before taking the Levenshtein distance of the two strings.

Word2Vec is it for word only in a sentence or for features as well?

I would like to ask more about Word2Vec:
I am currently trying to build a program that check for the embedding vectors for a sentence. While at the same time, I also build a feature extraction using sci-kit learn to extract the lemma 0, lemma 1, lemma 2 from the sentence.
From my understanding;
1) Feature extractions : Lemma 0, lemma 1, lemma 2
2) Word embedding: vectors are embedded to each character (this can be achieved by using gensim word2vec(I have tried it))
More explanation:
Sentence = "I have a pen".
Word = token of the sentence, for example, "have"
1) Feature extraction
"I have a pen" --> lemma 0:I, lemma_1: have, lemma_2:a.......lemma 0:have, lemma_1: a, lemma_2:pen and so on.. Then when try to extract the feature by using one_hot then will produce:
[[0,0,1],
[1,0,0],
[0,1,0]]
2) Word embedding(Word2vec)
"I have a pen" ---> "I", "have", "a", "pen"(tokenized) then word2vec from gensim will produced matrices for example if using window_size = 2 produced:
[[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345],
[0.31235,0.31345]
]
The floating and integer numbers are for explanation purpose and original data should vary depending on the sentence. These are just dummy data to explain.*
Questions:
1) Is my understanding about Word2Vec correct? If yes, what is the difference between feature extraction and word2vec?
2) I am curious whether can I use word2vec to get the feature extraction embedding too since from my understanding, word2vec is only to find embedding for each word and not for the features.
Hopefully someone could help me in this.
It's not completely clear what you're asking, as you seem to have many concepts mixed-up together. (Word2Vec gives vectors per word, not character; word-embeddings are a kind of feature-extraction on words, rather than an alternative to 'feature extraction'; etc. So: I doubt your understanding is yet correct.)
"Feature extraction" is a very general term, meaning any and all ways of taking your original data (such as a sentence) and creating a numerical representation that's good for other kinds of calculation or downstream machine-learning.
One simple way to turn a corpus of sentences into numerical data is to use a "one-hot" encoding of which words appear in each sentence. For example, if you have the two sentences...
['A', 'pen', 'will', 'need', 'ink']
['I', 'have', 'a', 'pen']
...then you have 7 unique case-flattened words...
['a', 'pen', 'will', 'need', 'ink', 'i', 'have']
...and you could "one-hot" the two sentences as a 1-or-0 for each word they contain, and thus get the 7-dimensional vectors:
[1, 1, 1, 1, 1, 0, 0] # A pen will need ink
[1, 1, 0, 0, 0, 1, 1] # I have a pen
Even with this simple encoding, you can now compare sentences mathematically: a euclidean-distance or cosine-distance calculation between those two vectors will give you a summary distance number, and sentences with no shared words will have a high 'distance', and those with many shared words will have a small 'distance'.
Other very-similar possible alternative feature-encodings of these sentences might involve counts of each word (if a word appeared more than once, a number higher than 1 could appear), or weighted-counts (where words get an extra significance factor by some measure, such as the common "TF/IDF" calculation, and thus values scaled to be anywhere from 0.0 to values higher than 1.0).
Note that you can't encode a single sentence as a vector that's just as wide as its own words, such as "I have a pen" into a 4-dimensional [1, 1, 1, 1] vector. That then isn't comparable to any other sentence. They all need to be converted to the same-dimensional-size vector, and in "one hot" (or other simple "bag of words") encodings, that vector is of dimensionality equal to the total vocabulary known among all sentences.
Word2Vec is a way to turn individual words into "dense" embeddings with fewer dimensions but many non-zero floating-point values in those dimensions. This is instead of sparse embeddings, which have many dimensions that are mostly zero. The 7-dimensional sparse embedding of 'pen' alone from above would be:
[0, 1, 0, 0, 0, 0, 0] # 'pen'
If you trained a 2-dimensional Word2Vec model, it might instead have a dense embedding like:
[0.236, -0.711] # 'pen'
All the 7 words would have their own 2-dimensional dense embeddings. For example (all values made up):
[-0.101, 0.271] # 'a'
[0.236, -0.711] # 'pen'
[0.302, 0.293] # 'will'
[0.672, -0.026] # 'need'
[-0.198, -0.203] # 'ink'
[0.734, -0.345] # 'i'
[0.288, -0.549] # 'have'
If you have Word2Vec vectors, then one alternative simple way to make a vector for a longer text, like a sentence, is to average together all the word-vectors for the words in the sentence. So, instead of a 7-dimensional sparse vector for the sentence, like:
[1, 1, 0, 0, 0, 1, 1] # I have a pen
...you'd get a single 2-dimensional dense vector like:
[ 0.28925, -0.3335 ] # I have a pen
And again different sentences may be usefully comparable to each other based on these dense-embedding features, by distance. Or these might work well as training data for a downstream machine-learning process.
So, this is a form of "feature extraction" that uses Word2Vec instead of simple word-counts. There are many other more sophisticated ways to turn text into vectors; they could all count as kinds of "feature extraction".
Which works best for your needs will depend on your data and ultimate goals. Often the most-simple techniques work best, especially once you have a lot of data. But there are few absolute certainties, and you often need to just try many alternatives, and test how well they do in some quantitative, repeatable scoring evaluation, to find which is best for your project.

When a draw occurs when tracking most occurrences in a list how to find element with highest index?

lines = ["Pizza", "Vanilla","Los Angeles Pikes","Cookie Washington Tennis Festival","Water Fiesta","Watermelon"]
best= max(set(lines), key=lines.count)
print (best)
The code above returns the greatest occurrence of an element in the list, but in case there is a draw, I want it to return the element with the greatest index. So here I want Watermelon to be printed and if anything is added without a break in the tie the highest index of the draw should be printed.
I need a solution with simple basic code like that seen above and without the importing of libraries. If you could help find a good solution for this it would be really helpful.
You could add the index normalized to a value greater than the length of the array to the result of count. The normalized index will always be less than 1.0, so that it will not affect the first-order comparison, but will guarantee that there are no ties. I would use a small function to do this:
lines = ["Pizza", "Vanilla", "Los Angeles Pikes",
"Cookie Washington Tennis Festival",
"Water Fiesta", "Watermelon"]
def key(x):
return lines.count(x) + lines.index(x) / (len(lines) + 1)
best = max(set(lines), key=key)
print(best)
While your original code returned lines = "Los Angeles Pikes" in my version of Python (because of the way the hashes turned out), the new version returns "Watermelon", as expected.
You can also use a lambda, but I find that a bit harder to read:
best = max(set(lines), key=lambda x: lines.count(x) + lines.index(x) / (len(lines) + 1))

converting python pandas column to numpy array in place

I have a csv file in which one of the columns is a semicolon-delimited list of floating point numbers of variable length. For example:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
when I read this into a pandas DataFrame, the datatype for that column is object. I want to convert it, ideally in place, to a numpy array (or just a regular float array, it doesn't matter too much at this stage).
I wrote a little function which takes a single one of those list elements and converts it to a numpy array:
def parse_list(data):
data_list = data.split(';')
return np.array(map(float, data_list))
This works fine, but what I want to do is do this conversion directly in the DataFrame so that I can use pandasql and the like to manipulate the whole data set after the conversion. Can someone point me in the right direction?
EDIT: I seem to have asked the question poorly. I would like to convert the following data frame:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
where the dtype of List is 'object'
to the following dataframe:
Index List
0 [900.0, 300.0, 899.2]
1 [123.4, 887.3, 900.1, 985.3]
where the datatype of List is numpy array of floats
EDIT2: some progress, thanks to the first answer. I now have the line:
df['List'] = df['List'].str.split(';')
which splits the column in place into an array, but the dtypes remain object When I then try to do
df['List'] = df['List'].astype(float)
I get the error:
return arr.astype(dtype)
ValueError: setting an array element with a sequence.
If I understand you correctly, you want to transform your data from pandas to numpy arrays.
I used this:
pandas_DataName.as_matrix(columns=None)
And it worked for me.
For more information visit here
I hope this could help you.

Create a vector of occurrences the same size as an input string

I'm new to python and needed some help.
I have a string such a ACAACGG
I would now like to create 3 vectors where the elements are the counts of particular letter.
For example, for "A", this would produce (1123333)
For "C", this would produce (0111222)
etc.
I'm not sure how to put the results of the counting into an string or into a vector.
I believe this is similar to counting the occurrences of a character in a string, but I'm not sure how to have it run through the string and place the count value at each point.
For reference, I'm trying to implement the Burrows-Wheeler transform and use it for a string search. But, I'm not sure how to create the occurrence vector for the characters.
def bwt(s):
s = s + '$'
return ''.join([x[-1] for x in
sorted([s[i:] + s[:i] for i in range(len(s))])])
This gives me the transform and I'm trying to create the occurrence vector for it. Ultimately, I want to use this to search for repeats in a DNA string.
Any help would be greatly appreciated.
I'm not sure what type you want the vectors to be in, but here's a function that returns a list of ints.
In [1]: def countervector(s, char):
....: c = 0
....: v = []
....: for x in s:
....: if x == char:
....: c += 1
....: v.append(c)
....: return v
....:
In [2]: countervector('ACAACGG', 'A')
Out[2]: [1, 1, 2, 3, 3, 3, 3]
In [3]: countervector('ACAACGG', 'C')
Out[3]: [0, 1, 1, 1, 2, 2, 2]
Also, here's a much shorter way to do it, but it will probably be inefficient on long strings:
def countervector(s, char):
return [s[:i+1].count(char) for i, _ in enumerate(s)]
I hope it helps.
As promised here is the finished script I wrote. For reference, I'm trying to use the Burrows-Wheeler transform to do repeat matching in strings of DNA. Basically the idea is to take a strand of DNA of some length M and find all repeat within that string. So, as an example, if I had strange acaacg and searched for all duplicated substrings of size 2, I would get a count of 1 and the starting locations of 0,3. You could then type in string[0:2] and string[3:5] to verify that they do actually match and their result is "ac".
If anyone is interested in learning about the Burrows-Wheeler, a Wikipedia search on it produces very helpful results. Here's is another source from Stanford that also explains it well. http://www.stanford.edu/class/cs262/notes/lecture5.pdf
Now, there are a few issues that I did not address in this. First, I'm using n^2 space to create the BW transform. Also, I'm creating a suffix array, sorting it, and then replacing it with numbers so creating that may take up a bit of space. However, at the end I'm only really storing the occ matrix, the end column, and the word itself.
Despite the RAM problems for strings larger that 4^7 (got this to work with a string size of 40,000 but no larger...), I would call this a success seeing as before Monday, the only thing I new how to do in python was to have it print my name and hello world.
# generate random string of DNA
def get_string(length):
string=""
for i in range(length):
string += random.choice("ATGC")
return string
# Make the BW transform from the generated string
def make_bwt(word):
word = word + '$'
return ''.join([x[-1] for x in
sorted([word[i:] + word[:i] for i in range(len(word))])])
# Make the occurrence matrix from the transform
def make_occ(bwt):
letters=set(bwt)
occ={}
for letter in letters:
c=0
occ[letter]=[]
for i in range(len(bwt)):
if bwt[i]==letter:
c+=1
occ[letter].append(c)
return occ
# Get the initial starting locations for the Pos(x) values
def get_starts(word):
list={}
word=word+"$"
for letter in set(word):
list[letter]=len([i for i in word if i < letter])
return list
# Single range finder for the BWT. This produces a first and last position for one read.
def get_range(read,occ,pos):
read=read[::-1]
firstletter=read[0]
newread=read[1:len(read)]
readL=len(read)
F0=pos[firstletter]
L0=pos[firstletter]+occ[firstletter][-1]-1
F1=F0
L1=L0
for letter in newread:
F1=pos[letter]+occ[letter][F1-1]
L1=pos[letter]+occ[letter][L1] -1
return F1,L1
# Iterate the single read finder over the entire string to search for duplicates
def get_range_large(readlength,occ,pos,bwt):
output=[]
for i in range(0,len(bwt)-readlength):
output.append(get_range(word[i:(i+readlength)],occ,pos))
return output
# Create suffix array to use later
def get_suf_array(word):
suffix_names=[word[i:] for i in range(len(word))]
suffix_position=range(0,len(word))
output=zip(suffix_names,suffix_position)
output.sort()
output2=[]
for i in range(len(output)):
output2.append(output[i][1])
return output2
# Remove single hits that were a result of using the substrings to scan the large string
def keep_dupes(bwtrange):
mylist=[]
for i in range(0,len(bwtrange)):
if bwtrange[i][1]!=bwtrange[i][0]:
mylist.append(tuple(bwtrange[i]))
newset=set(mylist)
newlist=list(newset)
newlist.sort()
return newlist
# Count the duplicate entries
def count_dupes(hits):
c=0
for i in range(0,len(hits)):
sum=hits[i][1]-hits[i][0]
if sum > 0:
c=c+sum
else:
c
return c
# Get the coordinates from BWT and use the suffix array to map them back to their original indices
def get_coord(hits):
mylist=[]
for element in hits:
mylist.append(sa[element[0]-1:element[1]])
return mylist
# Use the coordinates to get the actual strings that are duplicated
def get_dupstrings(coord,readlength):
output=[]
for element in coord:
temp=[]
for i in range(0,len(element)):
string=word[element[i]:(element[i]+readlength)]
temp.append(string)
output.append(temp)
return output
# Merge the strings and the coordinates together for one big list.
def together(dupstrings,coord):
output=[]
for i in range(0,len(coord)):
merge=dupstrings[i]+coord[i]
output.append(merge)
return output
Now run the commands as follows
import random # This is needed to generate a random string
readlength=12 # pick read length
word=get_string(4**7) # make random word
bwt=make_bwt(word) # make bwt transform from word
occ=make_occ(bwt) # make occurrence matrix
pos=get_starts(word) # gets start positions of sorted first row
bwtrange=get_range_large(readlength,occ,pos,bwt) # Runs the get_range function over all substrings in a string.
sa=get_suf_array(word) # This function builds a suffix array and numbers it.
hits=keep_dupes(bwtrange) # Pulls out the number of entries in the bwt results that have more than one hit.
dupes=count_dupes(hits) # counts hits
coord=get_coord(hits) # This part attempts to pull out the coordinates of the hits.
dupstrings=get_dupstrings(coord,readlength) # pulls out all the duplicated strings
strings_coord=together(dupstrings,coord) # puts coordinates and strings in one file for ease of viewing.
print dupes
print strings_coord