What's slowing down this piece of python code? - python-2.7

I have been trying to implement the Stupid Backoff language model (the description is available here, though I believe the details are not relevant to the question).
The thing is, the code's working and producing the result that is expected, but works slower than I expected. I figured out the part that was slowing down everything is here (and NOT in the training part):
def compute_score(self, sentence):
length = len(sentence)
assert length <= self.n
if length == 1:
word = tuple(sentence)
return float(self.ngrams[length][word]) / self.total_words
else:
words = tuple(sentence[::-1])
count = self.ngrams[length][words]
if count == 0:
return self.alpha * self.compute_score(sentence[1:])
else:
return float(count) / self.ngrams[length - 1][words[:-1]]
def score(self, sentence):
""" Takes a list of strings as argument and returns the log-probability of the
sentence using your language model. Use whatever data you computed in train() here.
"""
output = 0.0
length = len(sentence)
for idx in range(length):
if idx < self.n - 1:
current_score = self.compute_score(sentence[:idx+1])
else:
current_score = self.compute_score(sentence[idx-self.n+1:idx+1])
output += math.log(current_score)
return output
self.ngrams is a nested dictionary that has n entries. Each of these entries is a dictionary of form (word_i, word_i-1, word_i-2.... word_i-n) : the count of this combination.
self.alpha is a constant that defines the penalty for going n-1.
self.n is the maximum length of that tuple that the program is looking for in the dictionary self.ngrams. It is set to 3 (though setting it to 2 or even 1 doesn't anything). It's weird because the Unigram and Bigram models work just fine in fractions of a second.
The answer that I am looking for is not a refactored version of my own code, but rather a tip which part of it is the most computationally expensive (so that I could figure out myself how to rewrite it and get the most educational profit from solving this problem).
Please, be patient, I am but a beginner (two months into the world of programming). Thanks.
UPD:
I timed the running time with the same data using time.time():
Unigram = 1.9
Bigram = 3.2
Stupid Backoff (n=2) = 15.3
Stupid Backoff (n=3) = 21.6
(It's on some bigger data than originally because of time.time's bad precision.)

If the sentence is very long, most of the code that's actually running is here:
def score(self, sentence):
for idx in range(len(sentence)): # should use xrange in Python 2!
self.compute_score(sentence[idx-self.n+1:idx+1])
def compute_score(self, sentence):
words = tuple(sentence[::-1])
count = self.ngrams[len(sentence)][words]
if count == 0:
self.compute_score(sentence[1:])
else:
self.ngrams[len(sentence) - 1][words[:-1]]
That's not meant to be working code--it just removes the unimportant parts.
The flow in the critical path is therefore:
For each word in the sentence:
Call compute_score() on that word plus the following 2. This creates a new list of length 3. You could avoid that with itertools.islice().
Construct a 3-tuple with the words reversed. This creates a new tuple. You could avoid that by passing the -1 step argument when making the slice outside this function.
Look up in self.ngrams, a nested dict, with the first key being a number (might be faster if this level were a list; there are only three keys anyway?), and the second being the tuple just created.
Recurse with the first word removed, i.e. make a new tuple (sentence[2], sentence[1]), or
Do another lookup in self.ngrams, implicitly creating another new tuple (words[:-1]).
In summary, I think the biggest problem you have is the repeated and nested creation and destruction of lists and tuples.

Related

compare two dictionary, one with list of float value per key, the other one a value per key (python)

I have a query sequence that I blasted online using NCBIWWW.qblast. In my xml blast file result I obtained for a query sequence a list of hit (i.e: gi|). Each hit or gi| have multiple hsp. I made a dictionary my_dict1 where I placed gi| as key and I appended the bit score as value. So multiple values for each key.
my_dict1 = {
gi|1002819492|: [437.702, 384.47, 380.86, 380.86, 362.83],
gi|675820360| : [2617.97, 2614.37, 122.112],
gi|953764029| : [414.258, 318.66, 122.112, 86.158],
gi|675820410| : [450.653, 388.08, 386.27] }
Then I looked for max value in each key using:
for key, value in my_dict1.items():
max_value = max(value)
And made a second dictionary my_dict2:
my_dict2 = {
gi|1002819492|: 437.702,
gi|675820360| : 2617.97,
gi|953764029| : 414.258,
gi|675820410| : 450.653 }
I want to compare both dictionary. So I can extract the hsp with the highest score bits. I am also including other parameters like query coverage and identity percentage (Not shown here). The finality is to get the best gi| with the highest bit scores, coverage and identity percentage.
I tried many things to compare both dictionary like this :
First code :
matches[]
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score
else:
matches = matches[hit_id], bit_score
Second code:
if hit_id not in matches.keys():
matches[hit_id]= bit_score
else:
matches = matches[hit_id], bit_score
Third code:
intersection = set(set(my_dict1.items()) & set(my_dict2.items()))
Howerver I always end up with 2 types of errors:
1 ) TypeError: list indices must be integers, not unicode
2 ) ... float not iterable...
Please I need some help and guidance. Thank you very much in advance for your time. Best regards.
It's not clear what you're trying to do. What is hit_id? What is bit_score? It looks like your second dict is always going to have the same keys as your first if you're creating it by pulling the max value for each key of the first dict.
You say you're trying to compare them, but don't really state what you're actually trying to do. Find those with values under a certain max? Find those with the highest max?
Your first code doesn't work because I'm assuming you're trying to use a dict key value as an index to matches, which you define as a list. That's probably where your first error is coming from, though you haven't given the lines where the error is actually occurring.
See in-code comments below:
# First off, this needs to be a dict.
matches{}
# This will never happen if you've created these dicts as you stated.
if my_dict1.keys() not in my_dict2.keys():
matches[hit_id] = bit_score # Not clear what bit_score is?
else:
# Also not sure what you're trying to do here. This will assign a tuple
# to matches with whatever the value of matches[hit_id] is and bit_score.
matches = matches[hit_id], bit_score
Regardless, we really need more information and the full code to figure out your actual goal and what's going wrong.

how can i improve bulk calculation from file data

I have a file of binary values. The section I am looking at is 4 byte int with the values in the pattern of MW1, MVAR1, MW2, MVAR2,...
I read the values in with
temp = array.array("f")
temp.fromfile(file, length *2)
mw_mvar = temp.tolist()
I then calculate the magnitude like this.
mag = [0] * length
for x in range(0,length * 2, 2):
a = mw_mvar[x]
b = mw_mvar[x + 1]
mag[(x / 2)] = sqrt(a*a + b*b)
The calculations (not the read) are doubling the total length of my script. I know there is (theoretically) a way to do this faster because am mimicking a script that ultimately calls fortran (pyd to call function dlls in fortran i think) which is able to do this calculation with negligible affect on run time.
This is the best i can come up with. any suggestions for improvements?
I have also tried math.pow(), **.5, **2 with no differences.
with no luck improving the calculations, I went around the problem. I realised that I only needed 1% of those calculated values so I created a class to calculate them on demand. It was important (to me) that the resulting code act similar to as if it were a list of calculated values. A lot of the remainder of the process uses the values and different versions of the data are pre-calculated. The class means i don't need a set of procedures for each version of data
class mag:
def __init__(self,mw_mvar):
self._mw_mvar = mw_mvar
#_sgn = sgn
def __len__(self):
return len(self._mw_mvar/2)
def __getitem__(self, item):
return sqrt(self._mw_mvar[2*item] ** 2 + self._mw_mvar[2*item+1] ** 2)
ps this could also be done in a function and take both versions. i would have had to make more changes to the overall script.
function (a,b,x):
if b[x]==0:
return a[x]
else:
return sqrt(a[x]**2 + b[x]**2)

Python: How to create a function that uses its own output and uses an array of random generated numbers

Disclaimer: I am quite new to Python and programming as a whole.
I have been trying to create a function to generated random stock prices using the following:
New stock price = previous price + (previous price*(return + (volatility * random number)))
The return and volatility numbers are fixed. Also, I have generated the random numbers for N times.
The problem is how to create a function that has the output re-used again on itself as an input previous price.
Basically to have an array of NEW stock prices generated from this formula and the previous price variable is the output of the function on itself.
I have been trying to do this for a couple of days and I am sure I am not fully equipped to do it (given that I am a newbie) but ANY HELP would really really be more than appreciated...!!!
Please any help would be useful.
import random
initial_price = 10
return_daily = 0.12 / 252
vol_daily = 0.30 / (math.sqrt(252))
random_numbers = []
for i in range (5):
random_numbers.append(random.gauss(0,1))
def stock_prices(random_numbers):
prices = []
for i in range(0,len(random_numbers)):
calc = initial_price + (initial_price * (return_daily+(vol_daily*random_numbers[i])))
prices.append(calc)
return prices
You can't really use recursion here, because you don't have a break condition that ends the recursion. You could construct one by passing an additional counter parameter that specifies how many more levels to recurse, but that would be not optimal in my opinion.
Instead, I recommend you to use a for loop that gets repeated a fixed number of times you can specify. This way you can add one new price value to a list per loop iteration step and access the previous one to calculate it:
first_price = 100
list_length = 20
def price_formula(previous_price):
return previous_price * 1.2 # you would replace this with your actual calculation
prices = [first_price] # create list with initial item
for i in range(list_length): # repeats exactly 'list_length' times, turn number is 'i'
prices.append(price_formula(prices[-1])) # append new price to list
# prices[-1] always returns the last element of the list, i.e. the previously added one.
print("\n".join(map(str, prices)))
My optimization of your code snippet:
import random
initial_price = 10
return_daily = 0.12 / 252
vol_daily = 0.30 / (math.sqrt(252))
def stock_prices(number_of_prices):
prices = [initial_price]
for i in range(0, number_of_prices):
prices.append(prices[-1] + (prices[-1] * (return_daily+(vol_daily*random.gauss(0,1))))
return prices
This is the classic Markov process. The present value depends upon its previous value, and only its previous value. The best thing to use in this case is what is called an iterator. Iterators can be created to generate arbitrary iterators that model the markov model.
Learn about how iterators can be generated here http://anandology.com/python-practice-book/iterators.html
Now that you have some understanding of how iterators work, you can create your own iterators for your problem. You need a class that implements the __iter__() method and the next() method.
Something like this:
import random
from math import sqrt
class Abc:
def __init__(self, initPrice):
self.v = initPrice # This is the initial price
self.dailyRet = 0.12/252
self.dailyVol = 0.3/sqrt(252)
return
def __iter__(self): return self
def next(self):
self.v += self.v * (self.dailyRet + self.dailyVol*random.gauss(0,1) )
return self.v
if __name__ == '__main__':
initPrice = 10
temp = Abc(initPrice)
for i in range(10):
print temp.next()
This will give the output:
> python test.py
10.3035353791
10.3321905359
10.3963790497
10.5354048937
10.6345509793
10.2598381299
10.3336476153
10.6495914319
10.7915999185
10.6669136891
Note that this does not have the stop iteration command, so if you use this incorrectly, you may get into trouble. However, that is not difficult to implement and I hope you try to implement it ...

Finding length of list without using the 'len' function in python

In my High school assignment part of it is to make a function that will find the average number in a list of floating points. We can't use len and such so something like sum(numList)/float(len(numList)) isn't an option for me. I've spent an hour researching and racking my brain for a way to find the list length without using the len function, and I've got nothing so I was hoping to be either shown how to do it or to be pointed in the right direction. Help me stack overflow, your my only hope. :)
Use a loop to add up the values from the list, and count them at the same time:
def average(numList):
total = 0
count = 0
for num in numList:
total += num
count += 1
return total / count
If you might be passed an empty list, you might want to check for that first and either return a predetermined value (e.g. 0), or raise a more helpful exception than the ZeroDivisionError you'll get if you don't do any checking.
If you're using Python 2 and the list might be all integers, you should either put from __future__ import division at the top of the file, or convert one of total or count to a float before doing the division (initializing one of them to 0.0 would also work).
Might as well show how to do it with a while loop since it's another opportunity to learn.
Normally, you won't need counter variable(s) inside of a for loop. However, there are certain cases where it's helpful to keep a count as well as retrieve the item from the list and this is where enumerate() comes in handy.
Basically, the below solution is what #Blckknght's solution is doing internally.
def average(items):
"""
Takes in a list of numbers and finds the average.
"""
if not items:
return 0
# iter() creates an iterator.
# an iterator has gives you the .next()
# method which will return the next item
# in the sequence of items.
it = iter(items)
count = 0
total = 0
while True:
try:
# when there are no more
# items in the list
total += next(it)
# a stop iteration is raised
except StopIteration:
# this gives us an opportunity
# to break out of the infinite loop
break
# since the StopIteration will be raised
# before a value is returned, we don't want
# to increment the counter until after
# a valid value is retrieved
count += 1
# perform the normal average calculation
return total / float(count)
def length_of_list(my_list):
if not my_list:
return 0
return 1+length_of_list(my_list[1:])

Iterating over a large unicode list taking a long time?

I'm working with the program Autodesk Maya.
I've made a naming convention script that will name each item in a certain convention accordingly. However I have it list every time in the scene, then check if the chosen name matches any current name in the scene, and then I have it rename it and recheck once more through the scene if there is a duplicate.
However, when i run the code, it can take as long as 30 seconds to a minute or more to run through it all. At first I had no idea what was making my code run slow, as it worked fine in a relatively low scene amount. But then when i put print statements in the check scene code, i saw that it was taking a long time to check through all the items in the scene, and check for duplicates.
The ls() command provides a unicode list of all the items in the scene. These items can be relatively large, up to a thousand or more if the scene has even a moderate amount of items, a normal scene would be several times larger than the testing scene i have at the moment (which has about 794 items in this list).
Is this supposed to take this long? Is the method i'm using to compare things inefficient? I'm not sure what to do here, the code is taking an excessive amount of time, i'm also wondering if it could be anything else in the code, but this seems like it might be it.
Here is some code below.
class Name(object):
"""A naming convention class that runs passed arguments through user
dictionary, and returns formatted string of users input naming convention.
"""
def __init__(self, user_conv):
self.user_conv = user_conv
# an example of a user convention is '${prefix}_${name}_${side}_${objtype}'
#staticmethod
def abbrev_lib(word):
# a dictionary of abbreviated words is here, takes in a string
# and returns an abbreviated string, if not found return given string
#staticmethod
def check_scene(name):
"""Checks entire scene for same name. If duplicate exists,
Keyword Arguments:
name -- (string) name of object to be checked
"""
scene = ls()
match = [x for x in scene if isinstance(x, collections.Iterable)
and (name in x)]
if not match:
return name
else:
return ''
def convert(self, prefix, name, side, objtype):
"""Converts given information about object into user specified convention.
Keyword Arguments:
prefix -- what is prefixed before the name
name -- name of the object or node
side -- what side the object is on, example 'left' or 'right'
obj_type -- the type of the object, example 'joint' or 'multiplyDivide'
"""
prefix = self.abbrev_lib(prefix)
name = self.abbrev_lib(name)
side = ''.join([self.abbrev_lib(x) for x in side])
objtype = self.abbrev_lib(objtype)
i = 02
checked = ''
subs = {'prefix': prefix, 'name': name, 'side':
side, 'objtype': objtype}
while self.checked == '':
newname = Template (self.user_conv.lower())
newname = newname.safe_substitute(**subs)
newname = newname.strip('_')
newname = newname.replace('__', '_')
checked = self.check_scene(newname)
if checked == '' and i < 100:
subs['objtype'] = '%s%s' %(objtype, i)
i+=1
else:
break
return checked
are you running this many times? You are potentially trolling a list of several hundred or a few thousand items for each iteration inside while self.checked =='', which would be a likely culprit. FWIW prints are also very slow in Maya, especially if you're printing a long list - so doing that many times will definitely be slow no matter what.
I'd try a couple of things to speed this up:
limit your searches to one type at a time - why troll through hundreds of random nodes if you only care about MultiplyDivide right now?
Use a set or a dictionary to search, rather than a list - sets and dictionaries use hashsets and are faster for lookups
If you're worried about maintining a naming convetion, definitely design it to be resistant to Maya's default behavior which is to append numeric suffixes to keep names unique. Any naming convention which doesn't support this will be a pain in the butt for all time, because you can't prevent Maya from doing this in the ordinary course of business. On the other hand if you use that for differntiating instances you don't need to do any uniquification at all - just use rename() on the object and capture the result. The weakness there is that Maya won't rename for global uniqueness, only local - so if you want to make unique node name for things that are not siblings you have to do it yourself.
Here's some cheapie code for finding unique node names:
def get_unique_scene_names (*nodeTypes):
if not nodeTypes:
nodeTypes = ('transform',)
results = {}
for longname in cmds.ls(type = nodeTypes, l=True):
shortname = longname.rpartition("|")[-1]
if not shortname in results:
results[shortname] = set()
results[shortname].add(longname)
return results
def add_unique_name(item, node_dict):
shortname = item.rpartition("|")[-1]
if shortname in node_dict:
node_dict[shortname].add(item)
else:
node_dict[shortname] = set([item])
def remove_unique_name(item, node_dict):
shortname = item.rpartition("|")[-1]
existing = node_dict.get(shortname, [])
if item in existing:
existing.remove(item)
def apply_convention(node, new_name, node_dict):
if not new_name in node_dict:
renamed_item = cmds.ls(cmds.rename(node, new_name), l=True)[0]
remove_unique_name(node, node_dict)
add_unique_name ( renamed_item, node_dict)
return renamed_item
else:
for n in range(99999):
possible_name = new_name + str(n + 1)
if not possible_name in node_dict:
renamed_item = cmds.ls(cmds.rename(node, possible_name), l=True)[0]
add_unique_name(renamed_item, node_dict)
return renamed_item
raise RuntimeError, "Too many duplicate names"
To use it on a particular node type, you just supply the right would-be name when calling apply_convention(). This would rename all the joints in the scene (naively!) to 'jnt_X' while keeping the suffixes unique. You'd do something smarter than that, like your original code did - this just makes sure that leaves are unique:
joint_names= get_unique_scene_names('joint')
existing = cmds.ls( type='joint', l = True)
existing .sort()
existing .reverse()
# do this to make sure it works from leaves backwards!
for item in existing :
apply_convention(item, 'jnt_', joint_names)
# check the uniqueness constraint by looking for how many items share a short name in the dict:
for d in joint_names:
print d, len (joint_names[d])
But, like i said, plan for those damn numeric suffixes, maya makes them all the time without asking for permission so you can't fight em :(
Instead of running ls for each and every name, you should run it once and store that result into a set (an unordered list - slightly faster). Then check against that when you run check_scene
def check_scene(self, name):
"""Checks entire scene for same name. If duplicate exists,
Keyword Arguments:
name -- (string) name of object to be checked
"""
if not hasattr(self, 'scene'):
self.scene = set(ls())
if name not in self.scene:
return name
else:
return ''