View execute time is very long (above one minute) - django

In our Django project, there is a view which creates multiple objects (from 5 to even 100). The problem is that creating phase takes a very long time.
Don't know why is that so but I suppose that it could be because on n objects, there are n database lookups and commits.
For example 24 objects takes 67 seconds.
I want to speed up this process.
There are two things I think may be worth to consider:
To create these objects in one query so only one commit is executed.
Create a ThreadPool and create these objects parallel.
This is a part of the view which causes problems (We use Postgres on localhost so connection is not a problem)
#require_POST
#login_required
def heu_import(request):
...
...
product = Product.objects.create(user=request.user,name=name,manufacturer=manufacturer,category=category)
product.groups.add(*groups)
occurences = []
counter = len(urls_xpaths)
print 'Occurences creating'
start = datetime.now()
eur_currency = Currency.objects.get(shortcut='eur')
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occ = Occurence.objects.create(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency)
occurences.append(occ)
print 'End'
print datetime.now()-start
...
return render(request,'main_app/dashboard/new-product.html',context)
Output:
Occurences creating
24
.
.
.
0
End
0:01:07.727000
EDIT:
I tried to put the for loop into the with transaction.atomic(): block but it seems to help only a bit (47 seconds instead of 67).
EDIT2:
I'm not sure but it seems that SQL queries are not a problem:

Please use bulk_create for inserting multiple objects.
occurences = []
for url_xpath in urls_xpaths:
counter-=1
print counter
url = url_xpath[1]
xpath = url_xpath[0]
occurances.append(Occurence(product=product,url=url,xpath=xpath,active=True if xpath else False,currency=eur_currency))
Occurence.objects.bulk_create(occurences)

Related

Efficiently updating a large number of records based on a field in that record using Django

I have about a million Comment records that I want to update based on that comment's body field. I'm trying to figure out how to do this efficiently. Right now, my approach looks like this:
update_list = []
qs = Comments.objects.filter(word_count=0)
for comment in qs:
model_obj = Comments.objects.get(id=comment.id)
model_obj.word_count = len(model_obj.body.split())
update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])
However, this hangs and seems to time out in my migration process. Does anybody have suggestions on how I can accomplish this?
It's not easy to determine the memory footprint of a Django object, but an absolute minimum is the amount of space needed to store all of its data. My guess is that you may be running out of memory and page-thrashing.
You probably want to work in batches of, say, 1000 objects at a time. Use Queryset slicing, which returns another queryset. Try something like
BATCH_SIZE = 1000
start = 0
base_qs = Comments.objects.filter(word_count=0)
while True:
batch_qs = base_qs[ start: start+BATCH_SIZE ]
start += BATCH_SIZE
if not batch_qs.exists():
break
update_list = []
for comment in batch_qs:
model_obj = Comments.objects.get(id=comment.id)
model_obj.word_count = len(model_obj.body.split())
update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])
print( f'Processed batch starting at {start}' )
Each trip around the loop will free the space occupied by the previous trip when it replaces batch_qs and update_list. The print statement will allow you to watch it progress at a hopefully acceptable, regular rate!
Warning - I have never tried this. I'm also wondering whether slicing and filtering will play nice with each other or whether one should use
base_qs = Comments.objects.all()
...
while True:
batch_qs = base_qs[ start: start+BATCH_SIZE ]
....
for comment in batch_qs.filter(word_count=0) :
so you are slicing your way though rows in the entire DB table and retrieving a subset of each slice that needs updating. This feels "safer". Anybody know for sure?

How to "create or update" using F object?

I need to create LeaderboardEntry if it is not exists. It should be updated with the new value (current + new) if exists. I want to achieve this with a single query. How can I do that?
Current code looking like this: (2 queries)
reward_amount = 50
LeaderboardEntry.objects.get_or_create(player=player)
LeaderboardEntry.objects.filter(player=player).update(golds=F('golds') + reward_amount)
PS: Default value of "golds" is 0.
you can have one query hit less with the defaults :
reward_amount = 50
leader_board, created = LeaderboardEntry.objects.get_or_create(
player=player,
defaults={
"golds": reward_amount,
}
)
if not created:
leader_board.golds += reward_amount
leader_board.save(update_fields=["golds"])
I think your problem is the get_or_create() method so it return back a tuple with two values, (object, created) so you have to recieve them in your code as following:
reward_amount = 50
entry, __ = LeaderboardEntry.objects.get_or_create(player=player)
entry.golds += reward_amount
entry.save()
It will work better than your actual code, just will avoid make two queries.
Of course the save() method will hit again your database.
You can solve this with update_or_create:
LeaderboardEntry.objects.update_or_create(
player=player,
defaults={
'golds': F('golds') + reward_amount
}
)
EDIT:
Sorry, F expressions in update_or_create are not yet supported.

python performance issue while searching in a huge list

I need to speed up (dramatically) the search in a "huge" single dimension list of unsigned values. The list has 389.114 elements, and I need to perform a check before I add an item to make sure it doesn't already exist
I do this check 15 millions times...
Of course, it takes too much time
The fastest way I found was :
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
I am building a dataset from time series logs
One column of these (huge) logs is a text message, which is very redondant
To dramatically speed up the process, I transform this text into an unsigned with Adler32(), and get a unique numeric value, which is great
Then I store the messages in a PostgreSQL database, with this value as index
For each line of my log files (15 millions all together), I need to update my database of unique messages (389.114 unique messages)
It means that for each line, I need to check if the message ID belongs to my in memory list
I tried "... in list", same with dictionaries, numpy arrays, transforming the list in a string and using string.search(), sql query in the database with good index...
Nothing better than "if item in list" when the list is loaded into memory (very fast)
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
For 15 millions iterations with some stuff and NO search in the list:
- It takes 8 minutes to generate 2 tables of 15 millions lines (features and targets)
- When I activate the code above to check if a message ID already exists, it takes 1 hour 35 mn ...
How could I optimize this?
Thank you for your help
If your code is, roughly, this:
my_list = []
for this_item in collection:
if this_item in my_list:
i = my_list.index(this_item)
else:
my_list.append(this_item)
i = len(my_list)
...
Then it will run in O(n^2) time since the in operator for lists is O(n).
You can achieve linear time if you use a dictionary (which is implemented with a hash table) instead:
my_list = []
table = {}
for this_item in collection:
i = table.get(this_item)
if i is None:
i = len(my_list)
my_list.append(this_item)
table[this_item] = i
...
Of course, if you don't care about processing the items in the original order, you can just do:
for i, this_item in enumerate(set(collection)):
...

What's slowing down this piece of python code?

I have been trying to implement the Stupid Backoff language model (the description is available here, though I believe the details are not relevant to the question).
The thing is, the code's working and producing the result that is expected, but works slower than I expected. I figured out the part that was slowing down everything is here (and NOT in the training part):
def compute_score(self, sentence):
length = len(sentence)
assert length <= self.n
if length == 1:
word = tuple(sentence)
return float(self.ngrams[length][word]) / self.total_words
else:
words = tuple(sentence[::-1])
count = self.ngrams[length][words]
if count == 0:
return self.alpha * self.compute_score(sentence[1:])
else:
return float(count) / self.ngrams[length - 1][words[:-1]]
def score(self, sentence):
""" Takes a list of strings as argument and returns the log-probability of the
sentence using your language model. Use whatever data you computed in train() here.
"""
output = 0.0
length = len(sentence)
for idx in range(length):
if idx < self.n - 1:
current_score = self.compute_score(sentence[:idx+1])
else:
current_score = self.compute_score(sentence[idx-self.n+1:idx+1])
output += math.log(current_score)
return output
self.ngrams is a nested dictionary that has n entries. Each of these entries is a dictionary of form (word_i, word_i-1, word_i-2.... word_i-n) : the count of this combination.
self.alpha is a constant that defines the penalty for going n-1.
self.n is the maximum length of that tuple that the program is looking for in the dictionary self.ngrams. It is set to 3 (though setting it to 2 or even 1 doesn't anything). It's weird because the Unigram and Bigram models work just fine in fractions of a second.
The answer that I am looking for is not a refactored version of my own code, but rather a tip which part of it is the most computationally expensive (so that I could figure out myself how to rewrite it and get the most educational profit from solving this problem).
Please, be patient, I am but a beginner (two months into the world of programming). Thanks.
UPD:
I timed the running time with the same data using time.time():
Unigram = 1.9
Bigram = 3.2
Stupid Backoff (n=2) = 15.3
Stupid Backoff (n=3) = 21.6
(It's on some bigger data than originally because of time.time's bad precision.)
If the sentence is very long, most of the code that's actually running is here:
def score(self, sentence):
for idx in range(len(sentence)): # should use xrange in Python 2!
self.compute_score(sentence[idx-self.n+1:idx+1])
def compute_score(self, sentence):
words = tuple(sentence[::-1])
count = self.ngrams[len(sentence)][words]
if count == 0:
self.compute_score(sentence[1:])
else:
self.ngrams[len(sentence) - 1][words[:-1]]
That's not meant to be working code--it just removes the unimportant parts.
The flow in the critical path is therefore:
For each word in the sentence:
Call compute_score() on that word plus the following 2. This creates a new list of length 3. You could avoid that with itertools.islice().
Construct a 3-tuple with the words reversed. This creates a new tuple. You could avoid that by passing the -1 step argument when making the slice outside this function.
Look up in self.ngrams, a nested dict, with the first key being a number (might be faster if this level were a list; there are only three keys anyway?), and the second being the tuple just created.
Recurse with the first word removed, i.e. make a new tuple (sentence[2], sentence[1]), or
Do another lookup in self.ngrams, implicitly creating another new tuple (words[:-1]).
In summary, I think the biggest problem you have is the repeated and nested creation and destruction of lists and tuples.

How can I share an updated values in dictionary among different process?

I am newbe here and also in Python. I have a question about dictionaries and multiprocessing. I want to run this part of the code on the second core of my Raspberry Pi (on first is running GUI application). So, I created a dictionary (keys(20) + array with the length of 256 for each of this key - script below is just short example). I initialized this dictionary in a separate script and put all values inside this dictionary on zero. Script table1.py (this dictionary should be available from both cores)
diction = {}
diction['FrameID']= [0]*10
diction['Flag']= ["Z"]*10
In the second script (should run on the second core), I read the values that I get from the serial port and put/set them in this dictionary (parsing + conversion) according to the appropriate place. Since I get much information through a serial port, the information is changing all the time. Script Readmode.py
from multiprocessing import Process
import time
import serial
import table1
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
def hexTobin(hex_num):
scale = 16 ## equals to hexadecimal
num_of_bits = len(hex_num)*4
bin_num = bin(int(hex_num, scale))[2:].zfill(num_of_bits) #hex to binary
return bin_num
def readSerial():
port = "/dev/ttyAMA0"
baudrate = 115200
ser = serial.Serial(port, baudrate, bytesize=8, parity=serial.PARITY_NONE, stopbits=1, xonxoff=False, rtscts=False)
line = []
for x in xrange(1):
ser.write(":AAFF:AA\r\n:F1\r\n") # new
while True:
for c in ser.read():
line.append(c)
a=''.join(line[-2:])
if a == '\r\n':
b=''.join(line)
print("what I get:" + b)
c=b[b.rfind(":"):len(b)] #string between last ":" start delimiter and stop delimiter
reps = {':':'', '\r\n':''} #throw away start and stop delimiter
txt = replace_all(c, reps)
print("hex num: " + txt)
bina_num=hexTobin(txt) # convert hex to bin
print("bin num: " + bina_num)
ssbit = bina_num[:3] # first three bits
print("select first three bits: " + ssbit)
abit=int(ssbit,2) # binary to integer
if abit == 5:
table1.diction['FrameID'][0]=abit
if abit == 7:
table1.diction['FrameID'][5]=abit
print("int num: ",abit)
print(table1.diction)
line = []
break
ser.close()
p1=Process(target=readSerial)
p1.start()
During that time I want to read information in this dictionary and use them in another process. But When I try to read that values there are all zero.
My question is how to create a dictionary that will be available for both process and can be updated based on data get from serial port?
Thank you for your answer in advance.
In Python, when you start two different scripts, even if they import some common modules, they share nothing (except the code). If two scripts both import table1, then they both have their own instance of the table1 module and therefore their own instance of the table1.diction variable.
If you think about it, it has to be like that. Otherwise all Python scripts would share the same sys.stdout, for example. So, more or less nothing is shared between two different scripts executing at the same time.
The multiprocessing module lets you create more than one process from the same single script. So you'll need to merge your GUI and your reading function into the same script. But then you can do what you want. Your code will look something like this:
import multiprocessing
# create shared dictionary sd
p1=Process(target=readSerial, args = (sd,)
p1.start()
# create the GUI behaviour you want
But hang on a minute. This won't work either. Because when the Process is created it starts a new instance of the Python interpreter and creates all its own instances again, just like starting a new script. So even now, by default, nothing in readSerial will be shared with the GUI process.
But fortunately the multiprocessing module provides some explicit techniques to share data between the processes. There is more than one way to do that, but here is one that works:
import multiprocessing
import time
def readSerial(d):
d["test"] = []
for i in range(100):
l = d["test"]
l.append(i)
d["test"] = l
time.sleep(1)
def doGUI(d):
while True:
print(d)
time.sleep(.5)
if __name__ == '__main__':
with multiprocessing.Manager() as manager:
sd = manager.dict()
p = multiprocessing.Process(target=readSerial, args=(sd,))
p.start()
doGUI(sd)
You'll notice that the append to the list in the readSerial function is a bit odd. That is because the dictionary object we are working with here is not a normal dictionary. It's actually a dictionary proxy that conceals a pipe used to send the data between the two processes. When I append to the list inside the dictionary the proxy needs to be notified (by assigning to the dictionary) so it knows to send the updated data across the pipe. That is why we need the assignment (rather than simply directly mutating the list inside the dictionary). For more on this look at the docs.