Create Django objects in step increments - django

I'm trying to write a function to create total number of Django objects in step increments. Note, I'm using factory-boy to generate the fake data. Here is what I have come up with:
def create_todos_in_batch(total, step):
quotient, remainder = divmod(total, step)
for i in range(quotient):
print(f"Inserting {i * step:,} to {(i + 1) * step:,}")
todos = TodoFactory.build_batch(step)
Todo.objects.bulk_create(todos)
if remainder:
print(f"Inserting {total - remainder:,} to {total:,}")
todos = TodoFactory.build_batch(remainder)
Todo.objects.bulk_create(todos)
To create 100 objects in 20 object increments, the function can be called like so:
create_todos_in_batch(100, 20)
I have the following unit test to prove this works correctly:
#pytest.mark.parametrize(
"total, step",
[
(100, 20),
(1_000, 100),
(10_000, 1_000),
],
)
#pytest.mark.django_db
def test_create_in_batch(total, step):
create_todos_in_batch(total=total, step=step)
count = Todo.objects.count()
assert count == total, f"{count:,} != {total:,}"
Is there a better way to write this function? I feel like this is too verbose because I'm repeating TodoFactory.build_batch() and Todo.objects.bulk_create(todos) multiple times.

Related

How do I fit a pymc3 model when each person has multiple data points?

I'm trying to practice using pymc3 on the kinds of data I come across in my research, but I'm having trouble thinking through how to fit the model when each person gives me multiple data points, and each person comes from a different group (so trying a hierarchical model).
Here's the practice scenario I'm using: Suppose we have 2 groups of people, N = 30 in each group. All 60 people go through a 10 question survey, where each person can response ("1") or not respond ("0") to each question. So, for each person, I have an array of length 10 with 1's and 0's.
To model these data, I assume each person has some latent trait "theta", and each item has a "discrimination" a and a "difficulty" b (this is just a basic item response model), and the probability of responding ("1") is given by: (1 + exp(-a(theta - b)))^(-1). (Logistic applied to a(theta - b) .)
Here is how I tried to fit it using pymc3:
traces = {}
for grp in range(2):
group = prac_data["Group ID"] == grp
data = prac_data[group]["Response"]
with pm.Model() as irt:
# Priors
a_tmp = pm.Normal('a_tmp',mu=0, sd = 1, shape = 10)
a = pm.Deterministic('a', np.exp(a_tmp))
# We do this transformation since we must have a >= 0
b = pm.Normal('b', mu = 0, sd = 1, shape = 10)
# Now for the hyperpriors on the groups:
theta_mu = pm.Normal('theta_mu', mu = 0, sd = 1)
theta_sigma = pm.Uniform('theta_sigma', upper = 2, lower = 0)
theta = pm.Normal('theta', mu = theta_mu,
sd = theta_sigma, shape = N)
p = getProbs(Disc, Diff, theta, N)
y = pm.Bernoulli('y', p = p, observed = data)
traces[grp] = pm.sample(1000)
The function "getProbs" is supposed to give me an array of probabilities for the Bernoulli random variable, as the probability of responding 1 changes across trials/survey questions for each person. But this method gives me an error because it says to "specify one of p or logit_p", but I thought I did with the function?
Here's the code for "getProbs" in case it's helpful:
def getProbs(Disc, Diff, THETA, Nprt):
# Get a large array of probabilities for the bernoulli random variable
n = len(Disc)
m = Nprt
probs = np.array([])
for th in range(m):
for t in range(n):
p = item(Disc[t], Diff[t], THETA[th])
probs = np.append(probs, p)
return probs
I added the Nprt parameter because if I tried to get the length of THETA, it would give me an error since it is a FreeRV object. I know I can try and vectorize the "item" function, which is just the logistic function I put above, instead of doing it this way, but that also got me an error when I tried to run it.
I think I can do something with pm.Data to fix this, but the documentation isn't exactly clear to me.
Basically, I'm used to building models in JAGS, where you loop through each data point, but pymc3 doesn't seem to work like that. I'm confused about how to build/index my random variables in the model to make sure that the probabilities change how I'd like them to from trial-to-trial, and to make sure that the parameters I'm estimating correspond to the right person in the right group.
Thanks in advance for any help. I'm pretty new to pymc3 and trying to get the hang of it, and wanted to try something different from JAGS.
EDIT: I was able to solve this by first building the array I needed by looping through the trials, then transforming the array using:
p = theano.tensor.stack(p, axis = 0)
I then put this new variable in the "p" argument of the Bernoulli instance and it worked! Here's the updated full model: (below, I imported theano.tensor as T)
group = group.astype('int')
data = prac_data["Response"]
with pm.Model() as irt:
# Priors
# Item parameters:
a = pm.Gamma('a', alpha = 1, beta = 1, shape = 10) # Discrimination
b = pm.Normal('b', mu = 0, sd = 1, shape = 10) # Difficulty
# Now for the hyperpriors on the groups: shape = 2 as there are 2 groups
theta_mu = pm.Normal('theta_mu', mu = 0, sd = 1, shape = 2)
theta_sigma = pm.Uniform('theta_sigma', upper = 2, lower = 0, shape = 2)
# Individual-level person parameters:
# group is a 2*N array that lets the model know which
# theta_mu to use for each theta to estimate
theta = pm.Normal('theta', mu = theta_mu[group],
sd = theta_sigma[group], shape = 2*N)
# Here, we're building an array of the probabilities we need for
# each trial:
p = np.array([])
for n in range(2*N):
for t in range(10):
x = -a[t]*(theta[n] - b[t])
p = np.append(p, x)
# Here, we turn p into a tensor object to put as an argument to the
# Bernoulli random variable
p = T.stack(p, axis = 0)
y = pm.Bernoulli('y', logit_p = p, observed = data)
# On my computer, this took about 5 minutes to run.
traces = pm.sample(1000, cores = 1)
print(az.summary(traces)) # Summary of parameter distributions

Counting matrix pairs using a threshold

I have a folder with hundreds of txt files I need to analyse for similarity. Below is an example of a script I use to run similarity analysis. In the end I get an array or a matrix I can plot etc.
I would like to see how many pairs there are with cos_similarity > 0.5 (or any other threshold I decide to use), removing cos_similarity == 1 when I compare the same files, of course.
Secondly, I need a list of these pairs based on file names.
So the output for the example below would look like:
1
and
["doc1", "doc4"]
Will really appreciate your help as I feel a bit lost not knowing which direction to go.
This is an example of my script to get the matrix:
doc1 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints that it is failing to meet that pledge."
doc2 = "The BBC has been inundated with comments from Amazon Prime customers. Most reported problems with deliveries."
doc3 = "An Amazon spokesman told the BBC the ASA had confirmed to it there was no investigation at this time."
doc4 = "Amazon's promise of next-day deliveries could be investigated amid customer complaints..."
documents = [doc1, doc2, doc3, doc4]
# In my real script I iterate through a folder (path) with txt files like this:
#def read_text(path):
# documents = []
# for filename in glob.iglob(path+'*.txt'):
# _file = open(filename, 'r')
# text = _file.read()
# documents.append(text)
# return documents
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
tf_matrix = LemVectorizer.transform(documents).toarray()
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
tfidf_matrix = tfidfTran.transform(tf_matrix)
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
Out:
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
As I understood your question, you want to create a function that reads the output numpy array and a certain value (threshold) in order to return two things:
how many docs are bigger than or equal the given threshold
the names of these docs.
So, here I've made the following function which takes three arguments:
the output numpy array from cos_similarity() function.
list of document names.
a certain number (threshold).
And here it's:
def get_docs(arr, docs_names, threshold):
output_tuples = []
for row in range(len(arr)):
lst = [row+1+idx for idx, num in \
enumerate(arr[row, row+1:]) if num >= threshold]
for item in lst:
output_tuples.append( (docs_names[row], docs_names[item]) )
return len(output_tuples), output_tuples
Let's see it in action:
>>> docs_names = ["doc1", "doc2", "doc3", "doc4"]
>>> arr = cos_similarity(documents)
>>> arr
array([[ 1. , 0.1459739 , 0.03613371, 0.76357693],
[ 0.1459739 , 1. , 0.11459266, 0.19117117],
[ 0.03613371, 0.11459266, 1. , 0.04732164],
[ 0.76357693, 0.19117117, 0.04732164, 1. ]])
>>> threshold = 0.5
>>> get_docs(arr, docs_names, threshold)
(1, [('doc1', 'doc4')])
>>> get_docs(arr, docs_names, 1)
(0, [])
>>> get_docs(lst, docs_names, 0.13)
(3, [('doc1', 'doc2'), ('doc1', 'doc4'), ('doc2', 'doc4')])
Let's see how this function works:
first, I iterate over every row of the numpy array.
Second, I iterate over every item in the row whose index is bigger than the row's index. So, we are iterating in a traingular shape like so:
and that's because each pair of documents is mentioned twice in the whole array. We can see that the two values arr[0][1] and arr[1][0] are the same. You also should notice that the diagonal items arn't included because we knew for sure that they are 1 as evey document is very similar to itself :).
Finally, we get the items whose values are bigger than or equal the given threshold, and return their indices. These indices are used later to get the documents names.

Convert a concrete model in an abstract one

I am just starting with Pyomo and I have a big problem.
I want to run an Abstract Model without using the terminal. I can do it with a concrete model but I have serious problems to do it in with the abstract one.
I just want to use F5 and run the code.
This ismy program:
import pyomo
from pyomo.environ import *
#
# Model
#
model = AbstractModel()
#Set: Indices
model.Unit = Set()
model.Block = Set()
model.DemBlock = Set()
#Parameters
model.EnergyBid = Param(model.Unit, model.Block)
model.PriceBid = Param(model.Unit, model.Block)
model.EnergyDem = Param(model.DemBlock)
model.PriceDem = Param(model.DemBlock)
model.Pmin = Param(model.Unit)
model.Pmax = Param(model.Unit)
#Variables definition
model.PD = Var(model.DemBlock, within=NonNegativeReals)
model.PG = Var(model.Unit,model.Block, within=NonNegativeReals)
#Binary variable
model.U = Var(model.Unit, within = Binary)
#Objective
def SocialWellfare(model):
SocialWellfare = sum([model.PriceDem[i]*model.PD[i] for i in model.DemBlock]) - sum([model.PriceBid[j,k]*model.PG[j,k] for j in model.Unit for k in model.Block ])
return SocialWellfare
model.SocialWellfare = Objective(rule=SocialWellfare, sense=maximize)
#Constraints
#Max and min Power generated
def PDmax_constraint(model,p):
return ((model.PD[p] - model.EnergyDem[p])) <= 0
model.PDmax = Constraint(model.DemBlock, rule=PDmax_constraint)
def PGmax_constraint(model,n,m):
return ((model.PG[n,m] - model.EnergyBid[n,m])) <= 0
model.PGmax = Constraint(model.Unit, model.Block,rule = PGmax_constraint)
def Power_constraintDW(model,i):
return ((sum(model.PG[i,k] for k in model.Block))-(model.Pmin[i] * model.U[i]) ) >= 0
model.LimDemandDw = Constraint(model.Unit, rule=Power_constraintDW)
def Power_constraintUP(model,i):
return ((sum(model.PG[i,k] for k in model.Block) - (model.Pmax[i])*model.U[i])) <= 0
model.LimDemandaUp = Constraint(model.Unit, rule=Power_constraintUP)
def PowerBalance_constraint(model):
return (sum(model.PD[i] for i in model.DemBlock) - sum(model.PG[j,k] for j in model.Unit for k in model.Block)) == 0
model.PowBalance = Constraint(rule = PowerBalance_constraint)
model.pprint()
instance = model.create('datos_transporte.dat')
## Create the ipopt solver plugin using the ASL interface
solver = 'ipopt'
solver_io = 'nl'
opt = SolverFactory(solver,solver_io=solver_io)
results = opt.solve(instance)
results.write()
Any help with the last part??
Thanks anyway,
I think your example is actually working. Starting in Pyomo 4.1, the solution returned from the solver is stored directly into the instance that was solved, and is not returned in the solver results object. This change was made because generating the representation of the solution in the results object was rather expensive and was not easily parsed by people. Working with the model instance directly is more natural.
It is unfortunate that the results object reports the number of solutions: 0, although this is technically correct: the results object holds no solutions ... but the Solver section should indicate that a solution was returned and stored into the model instance.
If you want to see the result returned by the solver, you can print out the current status of the model using:
instance.display()
after the call to solve(). That will report the current Var values that were returned from the solver. You will want to pay attention to the stale column:
False indicates the value is not "stale" ... that is, it was either set by the user (before the call to solve()), or it was returned from the solver (after the call to solve()).
True indicates the solver did not return a value for that variable. This is usually because the variable was not referenced by the objective or any enabled constraints, so Pyomo never sent the variable to the solver.
[Note: display() serves a slightly different role than pprint(): pprint() outputs the model structure, whereas display() outputs the model state. So, for example, where pprint() will output the constraint expressions, display() will output the numerical value of the expression (using the current variable/param values).]
Edited to expand the discussion of display() & pprint()

How to calculate all possibilities of very large string matrixes timely?

OK so let's say I have a situation where I have a bunch of objects in different classifications and I need to know the total possible combinations of these objects so I end up with an input that looks like this
{'raw':[{'AH':['P','C','R','Q','L']},
{'BG':['M','A','S','B','F']},
{'KH':['E','V','G','N','Y']},
{'KH':['E','V','G','N','Y']},
{'NM':['1','2','3','4','5']}]}
Where the keys AH, BG, KH, NM constitute groups, the values are list that hold individual objects and a finished group would constitute one member of each list, in this example KH is listed twice so each finished group would have 2 members of KH in it. I build something that handles this, it looks like this.
class Builder():
def __init__(self, data):
self.raw = data['raw']
node = []
for item in self.raw:
for k in item.keys():
node.append({k:0})
logger.debug('node: %s' % node)
#Parse out groups#
self.groups = []
increment = -2
while True:
try:
assert self.raw[increment].values()[0][node[increment][node[increment].keys()[0]]]
increment = -2
for x in self.raw[-1].values()[0]:
group = []
for k in range(0,len(node[:-1])):
position = node[k].keys()[0]
player = self.raw[k].values()[0][node[k][node[k].keys()[0]]]
group.append({position:player})
group.append({self.raw[-1].keys()[0]:x})
if self.repeatRemovals(group):
self.groups.append(group)
node[increment][node[increment].keys()[0]]+=1
except IndexError:
node[increment][node[increment].keys()[0]] = 0
increment-=1
try:
node[increment][node[increment].keys()[0]]+=1
except IndexError:
break
for group in self.groups:
logger.debug(group)
def repeatRemovals(self, group):
for x in range(0, len(group)):
for y in range(0, len(group)):
if group[x].values()[0] == group[y].values()[0] and x != y:
return False
return True
if __name__ == '__main__':
groups = Builder({'raw':[{'AH':['P','C','R','Q','L']},
{'BG':['M','A','S','B','F']},
{'KH':['E','V','G','N','Y']},
{'KH':['E','V','G','N','Y']},
{'NM':['1','2','3','4','5']}]})
logger.debug("Total groups: %d" % len(groups.groups))
The output of running this should clearly state my intended goal, if I have failed to do so in text. My concern is the time it takes to handle large classification of objects, when a classification has some 40 something objects in it, it is in the matrix three times and there are 7 other classifications with comparable object sizes. I think the numpy library could help me, but I am new to scientific programming and am not sure how to go about it, or if it would be worth it, could anyone provide some insight? Thank you.
Try this:
Remove duplicated values
Calculate all possibilities using permutation and factorial
Like that:
https://www.youtube.com/watch?v=Oc50d2GqXx0

Creating a list of sums

I'm newbie in Python and I'm struggling in create a list of sums generated by a for loop.
I got an school assignment where my program have to simulate the scores of a class of blind students in a multiple choice test.
def blindwalk(): # Generates the blind answers in a test with 21 questions
import random
resp = []
gab = ["a","b","c","d"]
for n in range(0,21):
resp.append(random.choice(gab))
return(resp)
def gabarite(): # Generates the official answer key of the tests
import random
answ_gab = []
gab = ["a","b","c","d"]
for n in range(0,21):
answ_gab.append(random.choice(gab))
return(answ_gab)
def class_tests(A): # A is the number of students
alumni = []
A = int(A)
for a in range(0,A):
alumni.append(blindwalk())
return alumni
def class_total(A): # A is the number of students
A = int(A)
official_gab = gabarite()
tests = class_tests(A)
total_score = []*0
for a in range(0,A):
for n in range(0,21):
if tests[a][n] == official_gab[n]:
total_score[a].add(1)
return total_score
When I run the class_total() function, I get this error:
total_score[a].add(1)
IndexError: list index out of range
Question is: How I valuate the scores of each student and create a list with them, because this is what I want to do with the class_total() function.
I also tried
if tests[a][n] == official_gab[n]:
total_score[a] += 1
But I got the same error, so I think I don't fully understand how lists work in Python yet.
Thanks!
(Also, I'm not a English native-speaker, so please tell me if I couldn't be clear enough)
This line:
total_score = []*0
And in fact, any of the following lines:
total_score = []*30
total_score = []*3000
total_score = []*300000000
Cause total_score to be instantiated as an empty list. It doesn't even have a 0th index, in this case! If you'd like to initiate every value to x in a list of length l , the syntax would look more like:
my_list = [x]*l
Alternatively, instead of thinking about the size before-hand, you can use .append instead of trying to access a particular index, as in:
my_list = []
my_list.append(200)
# my_list is now [200], my_list[0] is now 200
my_list.append(300)
# my_list is now [200,300], my_list[0] is still 200 and my_list[1] is now 300