How to calculate all possibilities of very large string matrixes timely? - python-2.7

OK so let's say I have a situation where I have a bunch of objects in different classifications and I need to know the total possible combinations of these objects so I end up with an input that looks like this
{'raw':[{'AH':['P','C','R','Q','L']},
{'BG':['M','A','S','B','F']},
{'KH':['E','V','G','N','Y']},
{'KH':['E','V','G','N','Y']},
{'NM':['1','2','3','4','5']}]}
Where the keys AH, BG, KH, NM constitute groups, the values are list that hold individual objects and a finished group would constitute one member of each list, in this example KH is listed twice so each finished group would have 2 members of KH in it. I build something that handles this, it looks like this.
class Builder():
def __init__(self, data):
self.raw = data['raw']
node = []
for item in self.raw:
for k in item.keys():
node.append({k:0})
logger.debug('node: %s' % node)
#Parse out groups#
self.groups = []
increment = -2
while True:
try:
assert self.raw[increment].values()[0][node[increment][node[increment].keys()[0]]]
increment = -2
for x in self.raw[-1].values()[0]:
group = []
for k in range(0,len(node[:-1])):
position = node[k].keys()[0]
player = self.raw[k].values()[0][node[k][node[k].keys()[0]]]
group.append({position:player})
group.append({self.raw[-1].keys()[0]:x})
if self.repeatRemovals(group):
self.groups.append(group)
node[increment][node[increment].keys()[0]]+=1
except IndexError:
node[increment][node[increment].keys()[0]] = 0
increment-=1
try:
node[increment][node[increment].keys()[0]]+=1
except IndexError:
break
for group in self.groups:
logger.debug(group)
def repeatRemovals(self, group):
for x in range(0, len(group)):
for y in range(0, len(group)):
if group[x].values()[0] == group[y].values()[0] and x != y:
return False
return True
if __name__ == '__main__':
groups = Builder({'raw':[{'AH':['P','C','R','Q','L']},
{'BG':['M','A','S','B','F']},
{'KH':['E','V','G','N','Y']},
{'KH':['E','V','G','N','Y']},
{'NM':['1','2','3','4','5']}]})
logger.debug("Total groups: %d" % len(groups.groups))
The output of running this should clearly state my intended goal, if I have failed to do so in text. My concern is the time it takes to handle large classification of objects, when a classification has some 40 something objects in it, it is in the matrix three times and there are 7 other classifications with comparable object sizes. I think the numpy library could help me, but I am new to scientific programming and am not sure how to go about it, or if it would be worth it, could anyone provide some insight? Thank you.

Try this:
Remove duplicated values
Calculate all possibilities using permutation and factorial
Like that:
https://www.youtube.com/watch?v=Oc50d2GqXx0

Related

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

For-loop error: list index out of range

So I am rather new to programming and just recently started with Classes and we are supposed to make a phonebook that can be loaded in seperate text files.
I however keep running into the problem in this section that when I get into the for-loop. It hits a brick wall on
if storage[2] == permaStorage[i].number:
And tells me "IndexError: list index out of range". I am almost certain it is due to permaStorage starts out empty, but even when I attempt to fill it with temporary instances of Phonebook it tells me it out of range. The main reason it is there is to check if a phone number already exists within the permaStorage.
Anyone got a good tip on how to solve this or work around it?
(Sorry if the text is badly written. Just joined this site and not sure on the style)
class Phonebook():
def __init__(self):
self.name = ''
self.number = ''
def Add(name1, number1):
y = Phonebook()
y.name = name1
y.number = number1
return y
def Main():
permaStorage = []
while True:
print " add name number\n lookup name\n alias name newname\n change name number\n save filename\n load filename\n quit\n"
choices = raw_input ("What would you like to do?: ")
storage = choices.split(" ")
if storage[0] == "add":
for i in range(0, len(permaStorage)+1):
if storage[2] == permaStorage[i].number:
print "This number already exists. No two people can have the same phonenumber!\n"
break
if i == len(permaStorage):
print "hej"
try:
tempbox = Add(storage[1], storage[2])
permaStorage.append(tempbox)
except:
raw_input ("Remember to write name and phonenumber! Press any key to continue \n")
I think problem is that permaStorage is empty list and then u try to:
for i in range(0, len(permaStorage)+1):
if storage[2] == permaStorage[i].number:
will cause an error because permaStorage has 0 items but u trying to get first (i=0, permaStorage[0]) item.
I think you should replace second if clause with first one:
for i in range(0, len(permaStorage)+1):
if i == len(permaStorage):
print "hej"
try:
tempbox = Add(storage[1], storage[2])
permaStorage.append(tempbox)
if storage[2] == permaStorage[i].number:
print "This number already exists. No two people can have the same phonenumber!\n"
break
So in this case if perStorage is blank you will append some value and next if clause will be ok.
Indexing starts at zero in python. Hence, a list of length 5 has the last element index as 4 starting from 0. Change range to range(0, len(permastorage))
You should iterate upto the last element of the list, not beyond.
Try -
for i in range(0, len(permaStorage)):
The list of numbers produced in range() is from the start, but not including the end, so range(3) == [0, 1, 2].
So if your list x has length 10, range(0, len(x)) will give you 0 through 9, which is the correct indices of the elements of your list.
Adding 1 to len(x) will produce the range 0 through 10, and when you try to access x[10], it will fail.

How to sort python lists due to certain criteria

I would like to sort a list or an array using python to achive the following:
Say my initial list is:
example_list = ["retg_1_gertg","fsvs_1_vs","vrtv_2_srtv","srtv_2_bzt","wft_3_btb","tvsrt_3_rtbbrz"]
I would like to get all the elements that have 1 behind the first underscore together in one list and the ones that have 2 together in one list and so on. So the result should be:
sorted_list = [["retg_1_gertg","fsvs_1_vs"],["vrtv_2_srtv","srtv_2_bzt"],["wft_3_btb","tvsrt_3_rtbbrz"]]
My code:
import numpy as np
import string
example_list = ["retg_1_gertg","fsvs_1_vs","vrtv_2_srtv","srtv_2_bzt","wft_3_btb","tvsrt_3_rtbbrz"]
def sort_list(imagelist):
# get number of wafers
waferlist = []
for image in imagelist:
wafer_id = string.split(image,"_")[1]
waferlist.append(wafer_id)
waferlist = set(waferlist)
waferlist = list(waferlist)
number_of_wafers = len(waferlist)
# create list
sorted_list = []
for i in range(number_of_wafers):
sorted_list.append([])
for i in range(number_of_wafers):
wafer_id = waferlist[i]
for image in imagelist:
if string.split(image,"_")[1] == wafer_id:
sorted_list[i].append(image)
return sorted_list
sorted_list = sort_list(example_list)
works but it is really awkward and it involves many for loops that slow down everything if the lists are large.
Is there any more elegant way using numpy or anything?
Help is appreciated. Thanks.
I'm not sure how much more elegant this solution is; it is a bit more efficient. You could first sort the list and then go through and filter into final set of sorted lists:
example_list = ["retg_1_gertg","fsvs_1_vs","vrtv_2_srtv","srtv_2_bzt","wft_3_btb","tvsrt_3_rtbbrz"]
sorted_list = sorted(example_list, key=lambda x: x[x.index('_')+1])
result = [[]]
current_num = sorted_list[0][sorted_list[0].index('_')+1]
index = 0
for i in example_list:
if current_num != i[i.index('_')+1]:
current_num = i[i.index('_')+1]
index += 1
result.append([])
result[index].append(i)
print result
If you can make assumptions about the values after the first underscore character, you could clean it up a bit (for example, if you knew that they would always be sequential numbers starting at 1).

Creating a list of sums

I'm newbie in Python and I'm struggling in create a list of sums generated by a for loop.
I got an school assignment where my program have to simulate the scores of a class of blind students in a multiple choice test.
def blindwalk(): # Generates the blind answers in a test with 21 questions
import random
resp = []
gab = ["a","b","c","d"]
for n in range(0,21):
resp.append(random.choice(gab))
return(resp)
def gabarite(): # Generates the official answer key of the tests
import random
answ_gab = []
gab = ["a","b","c","d"]
for n in range(0,21):
answ_gab.append(random.choice(gab))
return(answ_gab)
def class_tests(A): # A is the number of students
alumni = []
A = int(A)
for a in range(0,A):
alumni.append(blindwalk())
return alumni
def class_total(A): # A is the number of students
A = int(A)
official_gab = gabarite()
tests = class_tests(A)
total_score = []*0
for a in range(0,A):
for n in range(0,21):
if tests[a][n] == official_gab[n]:
total_score[a].add(1)
return total_score
When I run the class_total() function, I get this error:
total_score[a].add(1)
IndexError: list index out of range
Question is: How I valuate the scores of each student and create a list with them, because this is what I want to do with the class_total() function.
I also tried
if tests[a][n] == official_gab[n]:
total_score[a] += 1
But I got the same error, so I think I don't fully understand how lists work in Python yet.
Thanks!
(Also, I'm not a English native-speaker, so please tell me if I couldn't be clear enough)
This line:
total_score = []*0
And in fact, any of the following lines:
total_score = []*30
total_score = []*3000
total_score = []*300000000
Cause total_score to be instantiated as an empty list. It doesn't even have a 0th index, in this case! If you'd like to initiate every value to x in a list of length l , the syntax would look more like:
my_list = [x]*l
Alternatively, instead of thinking about the size before-hand, you can use .append instead of trying to access a particular index, as in:
my_list = []
my_list.append(200)
# my_list is now [200], my_list[0] is now 200
my_list.append(300)
# my_list is now [200,300], my_list[0] is still 200 and my_list[1] is now 300

find all ocurrences inside a list

I'm trying to implement a function to find occurrences in a list, here's my code:
def all_numbers():
num_list = []
c.execute("SELECT * FROM myTable")
for row in c:
num_list.append(row[1])
return num_list
def compare_results():
look_up_num = raw_input("Lucky number: ")
occurrences = [i for i, x in enumerate(all_numbers()) if x == look_up_num]
return occurrences
I keep getting an empty list instead of the ocurrences even when I enter a number that is on the mentioned list.
Your code does the following:
It fetches everything from the database. Each row is a sequence.
Then, it takes all these results and adds them to a list.
It returns this list.
Next, your code goes through each item list (remember, its a sequence, like a tuple) and fetches the item and its index (this is what enumerate does).
Next, you attempt to compare the sequence with a string, and if it matches, return it as part of a list.
At #5, the script fails because you are comparing a tuple to a string. Here is a simplified example of what you are doing:
>>> def all_numbers():
... return [(1,5), (2,6)]
...
>>> lucky_number = 5
>>> for i, x in enumerate(all_numbers()):
... print('{} {}'.format(i, x))
... if x == lucky_number:
... print 'Found it!'
...
0 (1, 5)
1 (2, 6)
As you can see, at each loop, your x is the tuple, and it will never equal 5; even though actually the row exists.
You can have the database do your dirty work for you, by returning only the number of rows that match your lucky number:
def get_number_count(lucky_number):
""" Returns the number of times the lucky_number
appears in the database """
c.execute('SELECT COUNT(*) FROM myTable WHERE number_column = %s', (lucky_number,))
result = c.fetchone()
return result[0]
def get_input_number():
""" Get the number to be searched in the database """
lookup_num = raw_input('Lucky number: ')
return get_number_count(lookup_num)
raw_input is returning a string. Try converting it to a number.
occurrences = [i for i, x in enumerate(all_numbers()) if x == int(look_up_num)]