Is there an algorithm/way to find out how different (or the minimum distance between) 2 list orders? - list

I have a bunch of items I want to rate in a specific order. For example:
["Person1", "Person2", "Person3", "Person4", "Person5"]
Which can be ordered like this:
["Person4", "Person5", "Person3", "Person1", "Person2"]
Given 2 different orders of the same list, is there a way to quantify how difference they are?
I know Levenshtein distance exists for strings, and I'm looking for something similar.
My ideal measurement for distance would be the minimum number of switches between two adjacent items required to change one list to the other - but I'm open to other algorithms if you think they're better.
The answer I'm looking for is an algorithm (and preferably, a [Python] implementation) to perform this kind of measurement (fast).
Thanks in advance!

To quantify how "different" two strings are, as you already noted, you can use Levenshtein distance, which is implemented in this library:
pip install levenshtein
>>> import Levenshtein
>>> Levenshtein.distance("lewenstein", "levenshtein")
2
To determine how "different" two lists are, you could assign each value in the list to a Unicode character.
import Levenshtein
def list_distance(A, B):
# Assign each unique value of the list to a unicode character
unique_map = {v:chr(k) for (k,v) in enumerate(set(A+B))}
# Create string versions of the lists
a = ''.join(list(map(unique_map.get, A)))
b = ''.join(list(map(unique_map.get, B)))
return Levenshtein.distance(a, b)
A = ["Person1", "Person2", "Person3", "Person4", "Person5"]
B = ["Person4", "Person5", "Person3", "Person1", "Person2"]
list_distance(A, B)
returns 4.
This works by making a unique mapping to arbitrary Unicode characters, for example:
the list A to the string '\x03\x02\x01\x00\x04' and
the list B to the string '\x00\x04\x01\x03\x02',
before taking the Levenshtein distance of the two strings.

Related

Finding the max value of a list of tuples, (applying max to the second value of the tuple)

So I have a list of tuples which I created from zipping two lists like this:
zipped =list(zip(neighbors, cv_scores))
max(zipped) produces
(49, 0.63941769316909292) where 49 is the max value.
However, I'm interesting in finding the max value among the latter value of the tuple (the .63941).
How can I do that?
The problem is that Python compares tuples lexicographically so it orders on the first item and only if these are equivalent, it compares the second and so on.
You can however use the key= in the max(..) function, to compare on the second element:
max(zipped,key=lambda x:x[1])
Note 1: Note that you do not have to construct a list(..) if you are only interested in the maximum value. You can use
max(zip(neighbors,cv_scores),key=lambda x:x[1]).
Note 2: Finding the max(..) runs in O(n) (linear time) whereas sorting a list runs in O(n log n).
max(zipped)[1]
#returns second element of the tuple
This should solve your problem in case you want to sort your data
and find the maximum you can use itemgetter
from operator import itemgetter
zipped.sort(key=itemgetter(1), reverse = True)
print(zipped[0][1]) #for maximum

Select duplicated lists from a list of lists (Python 2.7.13)

I have two lists, one is a list of lists, and they have the same number of indexes(the half number of values), like this:
list1=[['47', '43'], ['299', '295'], ['47', '43'], etc.]
list2=[[9.649, 9.612, 9.42, etc.]
I want to detect the repeated pair of values in the same list(and delete it), and sum the values with the same indexes in the second list, creating an output like this:
list1=[['47', '43'], ['299', '295'], etc.]
list2=[[19.069, 9.612, etc.]
The main problem is that the order of the values is important and I'm really stuck.
You could create a collections.defaultdict to sum values together, with keys as the sublists (converted as tuple to be hashable)
list1=[['47', '43'], ['299', '295'], ['47', '43']]
list2=[9.649, 9.612, 9.42]
import collections
c = collections.defaultdict(float)
for l,v in zip(list1,list2):
c[tuple(l)] += v
print(c)
Alternative using collections.Counter and which does the same:
c = collections.Counter((tuple(k),v) for k,v in zip(list1,list2))
At this point, we have the related data:
defaultdict(<class 'float'>, {('299', '295'): 9.612, ('47', '43'): 19.069})
now if needed (not sure, since the dictionary holds the data very well) we can rebuild the lists, keeping the (relative) order between them (but not their original order, that shouldn't be a problem since they're still linked):
list1=[]
list2=[]
for k,v in c.items():
list1.append(list(k))
list2.append(v)
print(list1,list2)
result:
[['299', '295'], ['47', '43']]
[9.612, 19.069]

How do I fuzzy match items in a column of an array in python?

I have an array of team names from NCAA, along with statistics associated with them. The school names are often shortened or left out entirely, but there is usually a common element in all variations of the name (like Alabama Crimson Tide vs Crimson Tide). These names are all contained in an array in no particular order. I would like to be able to take all variations of a team name by fuzzy matching them and rename all variants to one name. I'm working in python 2.7 and I have a numpy array with all of the data. Any help would be appreciated, as I have never used fuzzy matching before.
I have considered fuzzy matching through a for-loop, which would (despite being unbelievably slow) compare each element in the column of the array to every other element, but I'm not really sure how to build it.
Currently, my array looks like this:
{Names , info1, info2, info 3}
The array is a few thousand rows long, so I'm trying to make the program as efficient as possible.
The Levenshtein edit distance is the most common way to perform fuzzy matching of strings. It is available in the python-Levenshtein package. Another popular distance is Jaro Winkler's distance, also available in the same package.
Assuming a simple array numpy array:
import numpy as np
import Levenshtein as lv
ar = np.array([
'string'
, 'stum'
, 'Such'
, 'Say'
, 'nay'
, 'powder'
, 'hiden'
, 'parrot'
, 'ming'
])
We define helpers to give us indexes of Levenshtein and Jaro distances, between a string we have and all strings in the array.
def levenshtein(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.distance(string, x), ar))
def jaro(dist, string):
return map(lambda x: x<dist, map(lambda x: lv.jaro_winkler(string, x), ar))
Now, note that Levenshtein distance is an integer value counted in number of characters, whilst Jaro's distance is a floating point value that normally varies between 0 and 1. Let's test this using np.where:
print ar[np.where(levenshtein(3, 'str'))]
print ar[np.where(levenshtein(5, 'str'))]
print ar[np.where(jaro(0.00000001, 'str'))]
print ar[np.where(jaro(0.9, 'str'))]
And we get:
['stum']
['string' 'stum' 'Such' 'Say' 'nay' 'ming']
['Such' 'Say' 'nay' 'powder' 'hiden' 'ming']
['string' 'stum' 'Such' 'Say' 'nay' 'powder' 'hiden' 'parrot' 'ming']

Applying regexp and finding the highest number in a list

I have got a list of different names. I have a script that prints out the names from the list.
req=urllib2.Request('http://some.api.com/')
req.add_header('AUTHORIZATION', 'Token token=hash')
response = urllib2.urlopen(req).read()
json_content = json.loads(response)
for name in json_content:
print name['name']
Output:
Thomas001
Thomas002
Alice001
Ben001
Thomas120
I need to find the max number that comes with the name Thomas. Is there a simple way to to apply regexp for all the elements that contain "Thomas" and then apply max(list) to them? The only way that I have came up with is to go through each element in the list, match regexp for Thomas, then strip the letters and put the remaining numbers to a new list, but this seems pretty bulky.
You don't need regular expressions, and you don't need sorting. As you said, max() is fine. To be safe in case the list contains names like "Thomasson123", you can use:
names = ((x['name'][:6], x['name'][6:]) for x in json_content)
max(int(b) for a, b in names if a == 'Thomas' and b.isdigit())
The first assignment creates a generator expression, so there will be only one pass over the sequence to find the maximum.
You don't need to go for regex. Just store the results in a list and then apply sorted function on that.
>>> l = ['Thomas001',
'homas002',
'Alice001',
'Ben001',
'Thomas120']
>>> [i for i in sorted(l) if i.startswith('Thomas')][-1]
'Thomas120'

How do I count the number of elements in a list?

I need to write a small Prolog program to count the number of occurrence of each element in a list.
numberOfRepetition(input, result)
For example:
numberOfRepetition([a,b,a,d,c,a,b], X)
can be satisfied with X=[a/3,b/2,d/1,c/1] because a occurs three times, b occurs 2 times and c and d one time.
I don't want to give you the answer, so I gonna help you with it:
% Find the occurrences of given element in list
%
% occurrences([a,b,c,a],a,X).
% -> X = 2.
occurrences([],_,0).
occurrences([X|Y],X,N):- occurrences(Y,X,W),N is W + 1.
occurrences([X|Y],Z,N):- occurrences(Y,Z,N),X\=Z.
Depending on your effort and feedback, I can help you to get your answer.
Check out my answer to the related question "How to count number of element occurrences in a list in Prolog"!
In that answer I present the predicate list_counts/2, which should fot your needs.
Sample use:
:- list_counts([a,b,a,d,c,a,b],Ys).
Ys = [a-3, b-2, d-1, c-1].
Note that that this predicate uses a slightly different representation for key-value pairs expressing multiplicity: principal functor (-)/2 instead of (/)/2.
If possible, switch to the representation using (-)/2 for better interoperability with standard library predicates (like keysort/2).
If you wish to find element with max occurrences:
occurrences([],_,0).
occurrences([X|Y],X,N):- occurrences(Y,X,W),N is W + 1.
occurrences([X|Y],Z,N):- occurrences(Y,Z,N),X\=Z.
**make_list(Max):-
findall((Num,Elem),occurrences([d,d,d,a,a,b,c,d,e],Elem,Num),L),
sort(L,Sorted),
last(Sorted,(_,Max)).**