Multiple Postgres SELECT processes(django GET requests) stuck, causing 100% CPU usage - django

I'll try to give as much information I can here. Although the solution would be great, I just want guidance on how to tackle the problem. How to view more useful log files, etc. As I'm new to server maintainance. Any advice are welcome.
Here's what's happenning in chronological order:
I'm running 2 digitalocean droplets (Ubuntu 14.04 VPS)
Droplet #1 running django, nginx, gunicorn
Droplet #2 running postgres
Everything runs fine for a month and suddenly the postgres droplet
CPU usage spiked to 100%
You can see htop log when this happens. I've attached a screenshot
Another screenshot is nginx error.log, you can see that problem
started at 15:56:14 where I highlighted with red box
sudo poweroff the Postgres droplet and restart it doesn't fix the
problem
Restore postgres droplet to my last backup (20 hours ago) solves the problem but it keep happening again. This is 7th time in 2 days
I'll continue to do research and give more information. Meanwhile any opinions are welcome.
Thank you.
Update 20 May 2016
Enabled slow query logging on Postgres server as recommended by e4c5
6 hours later, server freezed(100% CPU usage) again at 8:07 AM. I've attached all related screenshots
Browser display 502 error if try to access the site during the freeze
sudo service restart postgresql (and gunicorn, nginx on django server) does NOT fix
the freeze (I think this is a very interesting point)
However, restore Postgres server to my previous backup(now 2 days old) does fix the freeze
The culprit Postgres log message is Could not send data to client: Broken
Pipe
The culprit Nginx log message is a simple django-rest-framework
api call which return only 20 items (each with some foreign-key data
query)
Update#2 20 May 2016
When the freeze occurs, I tried doing the following in chronological order (turn off everything and turn them back on one-by-one)
sudo service stop postgresql --> cpu usage fall to 0-10%
sudo service stop gunicorn --> cpu usage stays at 0-10%
sudo service stop nginx--> cpu usage stays at to 0-10%
sudo service restart postgresql --> cpu usage stays at to 0-10%
sudo service restart gunicorn --> cpu usage stays at to 0-10%
sudo service restart nginx --> cpu usage rose to 100% and stays
there
So this is not about server load or long query time then?
This is very confusing since if I restore database to my latest backup (2 days ago), everything is back online even without touching nginx/gunicorn/django server...
Update 8 June 2016
I turned on slow query logging. Set it to log queries that takes longer than 1000ms.
I got this one query shows up in the log many times.
SELECT
"products_product"."id",
"products_product"."seller_id",
"products_product"."priority",
"products_product"."media",
"products_product"."active",
"products_product"."title",
"products_product"."slug",
"products_product"."description",
"products_product"."price",
"products_product"."sale_active",
"products_product"."sale_price",
"products_product"."timestamp",
"products_product"."updated",
"products_product"."draft",
"products_product"."hitcount",
"products_product"."finished",
"products_product"."is_marang_offline",
"products_product"."is_seller_beta_program",
COUNT("products_video"."id") AS "num_video"
FROM "products_product"
LEFT OUTER JOIN "products_video" ON ( "products_product"."id" = "products_video"."product_id" )
WHERE ("products_product"."draft" = false AND "products_product"."finished" = true)
GROUP BY
"products_product"."id",
"products_product"."seller_id",
"products_product"."priority",
"products_product"."media",
"products_product"."active",
"products_product"."title",
"products_product"."slug",
"products_product"."description",
"products_product"."price",
"products_product"."sale_active",
"products_product"."sale_price",
"products_product"."timestamp",
"products_product"."updated",
"products_product"."draft",
"products_product"."hitcount",
"products_product"."finished",
"products_product"."is_marang_offline",
"products_product"."is_seller_beta_program"
HAVING COUNT("products_video"."id") >= 8
ORDER BY "products_product"."priority" DESC, "products_product"."hitcount" DESC
LIMIT 100
I know it's such an ugly query (generated by django aggregation). In English, this query just means "give me a list of products that have more than 8 videos in it".
And here the EXPLAIN output of this query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=351.90..358.40 rows=100 width=933)
-> GroupAggregate (cost=351.90..364.06 rows=187 width=933)
Filter: (count(products_video.id) >= 8)
-> Sort (cost=351.90..352.37 rows=187 width=933)
Sort Key: products_product.priority, products_product.hitcount, products_product.id, products_product.seller_id, products_product.media, products_product.active, products_product.title, products_product.slug, products_product.description, products_product.price, products_product.sale_active, products_product.sale_price, products_product."timestamp", products_product.updated, products_product.draft, products_product.finished, products_product.is_marang_offline, products_product.is_seller_beta_program
-> Hash Right Join (cost=88.79..344.84 rows=187 width=933)
Hash Cond: (products_video.product_id = products_product.id)
-> Seq Scan on products_video (cost=0.00..245.41 rows=2341 width=8)
-> Hash (cost=88.26..88.26 rows=42 width=929)
-> Seq Scan on products_product (cost=0.00..88.26 rows=42 width=929)
Filter: ((NOT draft) AND finished)
(11 rows)
--- Update 8 June 2016 #2 ---
Since there are many suggestions by many people. So I'll try to apply the fixes one-by-one and report back periodically.
#e4c5
Here's the information you need:
You can think of my site somewhat like Udemy, an online course marketplace. There are "Product"(course). Each product contain a number of videos. Users can comment on both Product page itself and each Videos.
In many cases, I'll need to query a list of products order by number of TOTAL comments it got(the sum of product comments AND comments on each Video of that Product)
The django query that correspond to the EXPLAIN output above:
all_products_exclude_draft = Product.objects.all().filter(draft=False)
products_that_contain_more_than_8_videos = all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30]
I just noticed that I(or some other dev in my team) hit database twice with these 2 python lines.
Here's the django models for Product and Video:
from django_model_changes import ChangesMixin
class Product(ChangesMixin, models.Model):
class Meta:
ordering = ['-priority', '-hitcount']
seller = models.ForeignKey(SellerAccount)
priority = models.PositiveSmallIntegerField(default=1)
media = models.ImageField(blank=True,
null=True,
upload_to=download_media_location,
default=settings.MEDIA_ROOT + '/images/default_icon.png',
storage=FileSystemStorage(location=settings.MEDIA_ROOT))
active = models.BooleanField(default=True)
title = models.CharField(max_length=500)
slug = models.SlugField(max_length=200, blank=True, unique=True)
description = models.TextField()
product_coin_price = models.IntegerField(default=0)
sale_active = models.BooleanField(default=False)
sale_price = models.IntegerField(default=0, null=True, blank=True) #100.00
timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
draft = models.BooleanField(default=True)
hitcount = models.IntegerField(default=0)
finished = models.BooleanField(default=False)
is_marang_offline = models.BooleanField(default=False)
is_seller_beta_program = models.BooleanField(default=False)
def __unicode__(self):
return self.title
def get_avg_rating(self):
rating_avg = self.productrating_set.aggregate(Avg("rating"), Count("rating"))
return rating_avg
def get_total_comment_count(self):
comment_count = self.video_set.aggregate(Count("comment"))
comment_count['comment__count'] += self.comment_set.count()
return comment_count
def get_total_hitcount(self):
amount = self.hitcount
for video in self.video_set.all():
amount += video.hitcount
return amount
def get_absolute_url(self):
view_name = "products:detail_slug"
return reverse(view_name, kwargs={"slug": self.slug})
def get_product_share_link(self):
full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
return full_url
def get_edit_url(self):
view_name = "sellers:product_edit"
return reverse(view_name, kwargs={"pk": self.id})
def get_video_list_url(self):
view_name = "sellers:video_list"
return reverse(view_name, kwargs={"pk": self.id})
def get_product_delete_url(self):
view_name = "products:product_delete"
return reverse(view_name, kwargs={"pk": self.id})
#property
def get_price(self):
if self.sale_price and self.sale_active:
return self.sale_price
return self.product_coin_price
#property
def video_count(self):
videoCount = self.video_set.count()
return videoCount
class Video(models.Model):
seller = models.ForeignKey(SellerAccount)
title = models.CharField(max_length=500)
slug = models.SlugField(max_length=200, null=True, blank=True)
story = models.TextField(default=" ")
chapter_number = models.PositiveSmallIntegerField(default=1)
active = models.BooleanField(default=True)
featured = models.BooleanField(default=False)
product = models.ForeignKey(Product, null=True)
timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
draft = models.BooleanField(default=True)
hitcount = models.IntegerField(default=0)
objects = VideoManager()
class Meta:
unique_together = ('slug', 'product')
ordering = ['chapter_number', 'timestamp']
def __unicode__(self):
return self.title
def get_comment_count(self):
comment_count = self.comment_set.all_jing_jing().count()
return comment_count
def get_create_chapter_url(self):
return reverse("sellers:video_create", kwargs={"pk": self.id})
def get_edit_url(self):
view_name = "sellers:video_update"
return reverse(view_name, kwargs={"pk": self.id})
def get_video_delete_url(self):
view_name = "products:video_delete"
return reverse(view_name, kwargs={"pk": self.id})
def get_absolute_url(self):
try:
return reverse("products:video_detail", kwargs={"product_slug": self.product.slug, "pk": self.id})
except:
return "/"
def get_video_share_link(self):
full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
return full_url
def get_next_url(self):
current_product = self.product
videos = current_product.video_set.all().filter(chapter_number__gt=self.chapter_number)
next_vid = None
if len(videos) >= 1:
try:
next_vid = videos[0].get_absolute_url()
except IndexError:
next_vid = None
return next_vid
def get_previous_url(self):
current_product = self.product
videos = current_product.video_set.all().filter(chapter_number__lt=self.chapter_number).reverse()
next_vid = None
if len(videos) >= 1:
try:
next_vid = videos[0].get_absolute_url()
except IndexError:
next_vid = None
return next_vid
And here is the index of the Product and Video table I got from the command:
my_database_name=# \di
Note: this is photoshopped and include some other models as well.
--- Update 8 June 2016 #3 ---
#Jerzyk
As you suspected. After I inspect all my code again, I found that I indeed did a 'slicing-in-memory': I tried to shuffle the first 10 results by doing this:
def get_queryset(self):
all_product_list = Product.objects.all().filter(draft=False).annotate(
num_video=Count(
Case(
When(
video__draft=False,
then=1,
)
)
)
).order_by('-priority', '-num_video', '-hitcount')
the_first_10_products = list(all_product_list[:10])
the_11th_product_onwards = list(all_product_list[10:])
random.shuffle(copy)
finalList = the_first_10_products + the_11th_product_onwards
Note: in the code above I need to count number of Video that is not in draft status.
So this will be one of the thing I need to fix as well. Thanks. >_<
--- Here are the related screenshots ---
Postgres log when freezing occurs (log_min_duration = 500 milliseconds)
Postgres log (contunued from the above screenshot)
Nginx error.log in the same time period
DigitalOcean CPU usage graph just before freezing
DigitalOcean CPU usage graph just after freezing

We can jump to the conclusion that your problems are caused by the slow query in question. By itself each run of the query does not appear to be slow enough to cause timeouts. However it's possible several of these queries are executed concurrently and that could lead to the meltdown. There are two things that you can do to speed things up.
1) Cache the result
The result of a long running query can be cached.
from django.core.cache import cache
def get_8x_videos():
cache_key = 'products_videos_join'
result = cache.get(cache_key, None)
if not result:
all_products_exclude_draft = Product.objects.all().filter(draft=False)
result = all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30]
result = Product.objects.annotate('YOUR LONG QUERY HERE')
cache.set(cache_key, result)
return result
This query now comes from memcache (or whatever you use for caching) that means if you have two successive hits for the page that uses this in quick succession, the second one will have no impact on the database. You can control how long the object is cached in memory.
2) Optimize the Query
The first thing that leaps out at you from the explain is that you are doing sequential scan on both the products_products and product_videos tables. Usually sequential scans are less desirable than index scans. However an index scan may not be used on this query because of the COUNT() and HAVING COUNT() clauses you have on it as well as the massive GROUP BY clauses on it.
update:
Your query has a LEFT OUTER JOIN, It's possible that an INNER JOIN or a subquery might be faster, in order to do that, we need to recognize that grouping on the Video table on product_id can give us the set of videos that figure in at least 8 products.
inner = RawSQL('SELECT id from product_videos GROUP BY product_id HAVING COUNT(product_id) > 1',params=[])
Product.objects.filter(id__in=b)
The above eleminates the LEFT OUTER JOIN and introduces a subquery. However this doesn't give easy access to the actual number of videos for each product, so this query in it's present form may not be fully usable.
3) Improving indexes
While it may be tempting to create an index on draft and finished columns, this will be futile as those columns do not have sufficient cardinality to be good candidates for indexes. However it may still be possible to create a conditional index. Again the conclusion can only be drawn after seeing your tables.

*** Update 7 June 2016 : Issue occur again. CPU hit 100% and stays there. This answer does help with performance but unfortunately not the solution to this problem.
Thanks to the recommendation by DigitalOcean suppport team. I tried the configuration suggested by this tool:
http://pgtune.leopard.in.ua/
Which recommend me the following values for my droplet with 1 CPU core and 1GB RAM:
in postgresql.conf:
max_connections = 200
shared_buffers = 256MB
effective_cache_size = 768MB
work_mem = 1310kB
maintenance_work_mem = 64MB
checkpoint_segments = 32
checkpoint_completion_target = 0.7
wal_buffers = 7864kB
default_statistics_target = 100
/etc/sysctl.conf
kernel.shmmax=536870912
kernel.shmall=131072
Until now my postgres server has been running fine for 3-4 days. So I assume this is the solution. Thanks everyone!

Related

Optimise query in a model method

I have a fairly simple model that's part of a double entry book keeping system. Double entry just means that each transaction (Journal Entry) is made up of multiple LineItems. The Lineitems should add up to zero to reflect the fact that money always comes out of one category (Ledger) and into another. The CR column is for money out, DR is money in (I think the CR and DR abreviations come from some Latin words and is standard naming convention in accounting systems).
My JournalEntry model has a method called is_valid() which checks that the line items balance and a few other checks. However the method is very database expensive, and when I use it to check many entries at once the database can't cope.
Any suggestions on how I can optimise the queries within this method to reduce database load?
class JournalEntry(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.PROTECT, null=True, blank=True)
date = models.DateField(null=False, blank=False)
# Make choiceset global so that it can be accessed in filters.py
global JOURNALENRTY_TYPE_CHOICES
JOURNALENRTY_TYPE_CHOICES = (
('BP', 'Bank Payment'),
('BR', 'Bank Receipt'),
('TR', 'Transfer'),
('JE', 'Journal Entry'),
('YE', 'Year End'),
)
type = models.CharField(
max_length=2,
choices=JOURNALENRTY_TYPE_CHOICES,
blank=False,
null=False,
default='0'
)
description = models.CharField(max_length=255, null=True, blank=True)
def __str__(self):
if self.description:
return self.description
else:
return 'Journal Entry '+str(self.id)
#property
def is_valid(self):
"""Checks if Journal Entry has valid data integrity"""
# NEEDS TO BE OPTIMISED AS PERFORMANCE IS BAD
cr = LineItem.objects.filter(journal_entry=self.id).aggregate(Sum('cr'))
dr = LineItem.objects.filter(journal_entry=self.id).aggregate(Sum('dr'))
if dr['dr__sum'] != cr['cr__sum']:
return "Line items do not balance"
if self.lineitem_set.filter(cr__isnull=True,dr__isnull=True).exists():
return "Empty line item(s)"
if self.lineitem_set.filter(cr__isnull=False,dr__isnull=False).exists():
return "CR and DR vales present on same lineitem(s)"
if (self.type=='BR' or self.type=='BP' or self.type=='TR') and len(self.lineitem_set.all()) != 2:
return 'Incorrect number of line items'
if len(self.lineitem_set.all()) == 0:
return 'Has zero line items'
return True
class LineItem(models.Model):
journal_entry = models.ForeignKey(JournalEntry, on_delete=models.CASCADE)
ledger = models.ForeignKey(Ledger, on_delete=models.PROTECT)
description = models.CharField(max_length=255, null=True, blank=True)
project = models.ForeignKey(Project, on_delete=models.SET_NULL, null=True, blank=True)
cr = models.DecimalField(max_digits=8, decimal_places=2, null=True, blank=True)
dr = models.DecimalField(max_digits=8, decimal_places=2, null=True, blank=True)
reconciliation_date = models.DateField(null=True, blank=True)
#def __str__(self):
# return self.description
class Meta(object):
ordering = ['id']
First thing first: if it's an expansive operation, it shouldn't be a property - not that it will change the execution time / db load, but at least it doesn't break the expectation that you're mostly doing a (relatively cheap) attribute access.
wrt/ possible optimisations, part of the cost is in the db roundtrip (including the time spent in the Python code - ORM and db adapter - itself) so a first thing would be to make as few queries as possible :
1/ replacing len(self.lineitem_set.all()) with self.lineitem_set.count() and avoiding calling it twice could save some time already
2/ you could probably regroup the first two queries in a single one (not tested...)
crdr = self.lineitem_set.aggregate(Sum('cr'), Sum('dr'))
if crdr['dr__sum'] != crdr['cr__sum']:
return "Line items do not balance"
and well, that's about all the simple obvious optimisations, and I don't think it will really solve your issue.
Next step would probably be to try a stored procedure that would do all the validation process - one single roundtrip and possibly more room for db-level optimisations (depending on your db vendor).
Then - assuming your db schema, settings, server etc are fully optimized (which is a bit outside this site's on-topic policy) -, the only solution left is denormalization, either at the db-level (safer) or at django level using a local per-instance cache on your model - the issue being to make sure you properly invalidate this cache everytime anything that might affect it changes.
NB actually I'm a bit surprised your db "can't cope" with this, at it doesn't seem _that_ heavy - but it of course depends on how many lineitems per journal you have (on average and worst case) in your production data.
More infos about your choosen rbdms, setup (same server or distinct one, and if yes network connectivity between the servers, available RAM, rdbms settings, etc) could probably help too - even with the most optimized queries at the client level, there are limits to what your rdbms can do... but then this becomes more of sysadmin/dbadmin question
EDIT
Page load time is now long but it does complete. Yes 2000 records to list and execute the method on
You mean you're executing this in a view on a 2000+ records queryset ? Well I can well understand it's a bit heavy - and not only on the database FWIW.
I think you might be able to optimize this quite further for this use case then. First option would be to make use of the queryset's select_related, prefetch_related, annotate and extra features, and if it's not enough to go for raw sql.

Django - What might cause the database to hang?

I have devised a local network website using the Django framework and recently ran in problems I was not having until recently.
We are running experiments on a local network collecting various measurements and I set up this website to make sure we are collecting all the data in the same place.
I set up a PostGreSQL database and use django to populate it on the fly as I receive measurements. The script that does that looks like:
**ladrLogger.py**
#various imports
import django
from django.db import IntegrityError
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "mysite.settings")
django.setup()
from logger.models import Measurement, Device, Type , Room, Experiment, ExperimentData
def logDevice(self,port):
# Callback function executed each time I receive data to log it in the database
deviceData = port.data # get the data
# Do a bunch of tests and checks
# ....
# Get all measurement to add to the database
# measurements is a list of measurement as defined in my django models
measurements = self.prepareMeasurement(...)
self.saveMeasurements(measurements)
print "Saved measurements successfully."
def saveMeasurements(self,meas):
if not meas:
return
elif type(meas) is list:
for m in meas:
self.saveMeasurements(m)
elif type(meas) is Measurement:
try:
meas.save()
except IntegrityError as e:
if 'unique constraint' in e.message:
print "Skipping... Measurement already existed for device " + meas.device.name
else:
print "Skipping measurement due to error: " + e.message
def prepareMeasurement(self,nameDevice, typeDevice, time, data):
### Takes the characteristics of measurement (device, name and type) and creates the appropriate measurements.
measurements = []
m = Measurement()
m.device = Device.objects.get(name=nameDevice)
m.date = time
# Bunch of tests
# .....
for idv,v in enumerate(value):
if v in data:
m = Measurement()
m.device = something
m.date = something else
m.value = bla
m.quantity = blabla
measurements.append(m)
return measurements
# Bunch of other methods
Note that this script is always running and waiting for more measurements to execute the logDevice callback.
EDIT: A custom based library based on YARP takes care of the callback handling. Code to create the callbacks looks like this:
portid = self.createPort(quer.group(1),True,True) #creates a port
pyarp.connect(desc[0], self.fullPortPath(portid)) #establishes connection to talking port
self.listenToPort(portid, lambda port: self.logDevice(port)) #tells him to execute that callback when he receives messages'
Callbacks are entirely dealt with in the background.
On the other hand, I have my django website that has various views displaying devices, measurements, plotting and whatnot.
The problem I have is that I am logging my measurements (about a few(2-3) per second at some times, usually less) and I can see that logging seems to be fine. But when I am calling my views, for example asking for the latest measurement for device x, I get an old measurement. One example of code:
def latestTemp(request,device_id):
# Creates a csv file with the latest temperature measured
#### for now does not check what measurements are actually available
dev = get_object_or_404(Device, pk=device_id)
tz = pytz.timezone('Europe/Zurich')
# Create the HttpResponse object with the appropriate CSV header.
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename="%s.csv"' %dev.name
#get measurement
lastMeas = Measurement.objects.filter(device=dev, quantity=Type.objects.get(quantity='Temperature')).latest('date')
writer = csv.writer(response)
# Form list of required timesteps
date = lastMeas.date.astimezone(tz)
writer.writerow([date.strftime('%Y-%m-%d'),date.strftime('%H:%M'),lastMeas.value])
return response
EDIT (more precisions):
I have been logging data for a few hours, but the website only shows me something dating back a few hours. As I keep asking for that data, it gets more and more recent, as if it had been buffered somewhere and was now getting slowly is place and visible by the website, until everything finally comes back to normal. On the other hand, if I kill the logging process, the data seems lost for ever. What is strange though is that the logDevice method completes and I can see that the meas.save() commands were executed. I also tried to add a listener for the Django signal post.save and I catch them correctly.
Few information:
- I am using the postgresql backend
- I am running all of this on a dedicated Mac machine.
- let me know whatever else would be useful to know
My questions are:
- Do you se any reason that might happen (it used to not happen so I guess it might have to do with the database becoming big, 4Gb right now)
- As a side question, but maybe related, I suspect the way I am pushing new elements in the database is not really nice, since the code runs completely independently from the django website itself. Any suggestions on how to improve ? I thought the ladrLogger code could send a request to a dedicated view that create the new element but that might be heavier for no purpose.
EDIT: adding my models.py
class Room(models.Model):
fullName = models.CharField(max_length=20, unique=True)
shortName = models.CharField(max_length=5, unique=True, default = "000")
nickName = models.CharField(max_length=20, default="Random Room")
def __unicode__(self):
return self.fullName
class Type(models.Model):
quantity = models.CharField(max_length=100, default="Temperature", unique = True)
unit = models.CharField(max_length=5, default="C", blank=True)
VALUE_TYPES = (
('float', 'float'),
('boolean', 'boolean'),
('integer', 'integer'),
('string', 'string'),
)
value_type = models.CharField(max_length=20, choices=VALUE_TYPES, default = "float")
def __unicode__(self):
return self.quantity
class Device(models.Model):
name = models.CharField(max_length=30, default="Unidentified Device",unique=True)
room = models.ForeignKey(Room)
description = models.CharField(max_length=500, default="", blank=True,)
indigoId = models.CharField(max_length=30,default="000")
def __unicode__(self):
#r = Room.objects.get(pk = self.room)
return self.name #+ ' in room ' + r.name
def latestMeasurement(self,*args):
if len(args)==0:
#No argument so just return latest argument
meas = Measurement.objects.filter(device=self).latest('date')
else:
#Use first argument as the type
meas = Measurement.objects.filter(device=self, quantity=args[0]).latest('date')
if not meas:
return None
else:
return meas
def typeList(self):
return Type.objects.filter(measurement__device=self).distinct()
class Measurement(models.Model):
device = models.ForeignKey(Device)
date = models.DateTimeField(db_index=True)
value = models.CharField(max_length=100,default="")
quantity = models.ForeignKey(Type)
class Meta:
unique_together = ('date','device','quantity',)
index_together = ['date', 'device']
def __unicode__(self):
t = self.quantity
return str(self.value) + " " + self.quantity.unit
# return str(self.value)

Slow iteration over django queryset

I am iterating over a django queryset that contains anywhere from 500-1000 objects. The corresponding model/table has 7 fields in it as well. The problem is that it takes about 3 seconds to iterate over which seems way too long when considering all the other data processing that needs to be done in my application.
EDIT:
Here is my model:
class Node(models.Model):
node_id = models.CharField(null=True, blank=True, max_length=30)
jobs = models.TextField(null=True, blank=True)
available_mem = models.CharField(null=True, blank=True, max_length=30)
assigned_mem = models.CharField(null=True, blank=True ,max_length=30)
available_ncpus = models.PositiveIntegerField(null=True, blank=True)
assigned_ncpus = models.PositiveIntegerField(null=True, blank=True)
cluster = models.CharField(null=True, blank=True, max_length=30)
datetime = models.DateTimeField(auto_now_add=False)
This is my initial query, which is very fast:
timestamp = models.Node.objects.order_by('-pk').filter(cluster=cluster)[0]
self.nodes = models.Node.objects.filter(datetime=timestamp.datetime)
But then, I go to iterate and it takes 3 seconds, I've tried two ways as seen below:
def jobs_by_node(self):
"""returns a dictionary containing keys that
are strings of node ids and values that
are lists of the jobs running on that node."""
jobs_by_node = {}
#iterate over nodes and populate jobs_by_node dictionary
tstart = time.time()
for node in self.nodes:
pass #I have omitted the code because the slowdown is simply iteration
tend = time.time()
tfinal = tend-tstart
return jobs_by_node
Other method:
all_nodes = self.nodes.values('node_id')
tstart = time.time()
for node in all_nodes:
pass
tend = time.time()
tfinal = tend-tstart
I tried the second method by referring to this post, but it still has not sped up my iteration one bit. I've scoured the web to no avail. Any help optimizing this process will be greatly appreciated. Thank you.
Note: I'm using Django version 1.5 and Python 2.7.3
Check the issued SQL query. You can use print statement:
print self.nodes.query # in general: print queryset.query
That should give you something like:
SELECT id, jobs, ... FROM app_node
Then run EXPLAIN SELECT id, jobs, ... FROM app_node and you'll know what exactly is wrong.
Assuming that you know what the problem is after running EXPLAIN, and that simple solutions like adding indexes aren't enough, you can think about e.g. fetching the relevant rows to a separate table every X minutes (in a cron job or Celery task) and using that separate table in you application.
If you are using PostgreSQL you can also use materialized views and "wrap" them in an unmanaged Django model.

UPDATE in raw sql does not hit all records although they meet the criteria

I am trying to update several records when I hit the save button in the admin with a raw sql which is located in models.py (def save(self, *args, **kwargs)
The raw sql is like this as a prototype
cursor=connection.cursor()
cursor.execute("UPDATE sales_ordered_item SET oi_delivery = %s WHERE oi_order_id = %s", ['2011-05-29', '1105212105'])
Unfortunately it does not update all records which meet the criteria. Only one and sometimes more but never all.
With the SQLite Manager and the following SQL everything works great and all the records get updated:
UPDATE sales_ordered_item
SET oi_delivery = '2011-05-29'
WHERE oi_order_id = '1105212105'
I was thinking of using a manager to update the table but I have no idea how this would work when not using static data like '2011-05-29'. Anyways, it would be great to understand in the first place how to hit all records with the raw sql.
Any recommendations how to solve the problems in a different way are also highly appreciated
Here ist the code which I stripped a little to keep it short
# Orders of the customers
class Order(models.Model):
"""
Defines the order data incl. payment, shipping and delivery
"""
# Main Data
o_customer = models.ForeignKey(Customer, related_name='customer',
verbose_name=_u'Customer'), help_text=_(u'Please select the related Customer'))
o_id = models.CharField(_(u'Order ID'), max_length=10, primary_key=True,
help_text=_(u'ID has the format YYMMDDHHMM'))
o_date = models.DateField(_(u'created'))
and more...
# Order Item
class Ordered_item(models.Model):
"""
Defines the ordered item to which order it belongs, pricing is decoupled from the
catalogue to be free of any changes in the pricing. Pricing and description is copied
from the item catalogue as a proposal and can be altered
"""
oi_order = models.ForeignKey(Order, related_name='Order', verbose_name=_(u'Order ID'))
oi_pos = models.CharField(_('Position'), max_length=2, default='01')
oi_quantity = models.PositiveIntegerField(_('Quantity'), default=1)
# Date of the delivery to determine the status of the item: ordered or already delivered
oi_delivery = models.DateField(_(u'Delivery'), null=True, blank=True)
and more ...
def save(self, *args, **kwargs):
# does not hit all records, use static values for test purposes
cursor=connection.cursor()
cursor.execute("UPDATE sales_ordered_item SET oi_delivery = %s WHERE oi_order_id = %s", ['2011-05-29', '1105212105'])
super(Ordered_item, self).save(*args, **kwargs)
This is probably happening because you are not commiting the transaction (See https://docs.djangoproject.com/en/dev/topics/db/sql/#executing-custom-sql-directly)
Add these lines after your cursor.execute:
from django.db import transaction
transaction.commit_unless_managed()
You asked for a manager method.
SalesOrderedItem.objects.filter(oi_order='1105212105').update(oi_delivery='2011-05-29')
should do the job for you!
Edit:
I assume that you have two models (I am guessing this code from your raw SQL):
class OiOrder(models.Model):
pass
class SalesOrderedItem(models.Model):
oi_order = models.ForeignKey(OiOrder)
oi_delivery = models.DateField()
So:
SalesOrderedItem.objects.filter(oi_order='1105212105')
gives you all SalesOrderedItem which have a oi_order of 1105212105.
... update(oi_delivery='2011-05-29')
The update method updates all oi_delivery attributes.

Django model manager live-object issues

I am using Django. I am having a few issues with caching of QuerySets for news/category models:
class Category(models.Model):
title = models.CharField(max_length=60)
slug = models.SlugField(unique=True)
class PublishedArticlesManager(models.Manager):
def get_query_set(self):
return super(PublishedArticlesManager, self).get_query_set() \
.filter(published__lte=datetime.datetime.now())
class Article(models.Model):
category = models.ForeignKey(Category)
title = models.CharField(max_length=60)
slug = models.SlugField(unique = True)
story = models.TextField()
author = models.CharField(max_length=60, blank=True)
published = models.DateTimeField(
help_text=_('Set to a date in the future to publish later.'))
created = models.DateTimeField(auto_now_add=True, editable=False)
updated = models.DateTimeField(auto_now=True, editable=False)
live = PublishedArticlesManager()
objects = models.Manager()
Note - I have removed some fields to save on complexity...
There are a few (related) issues with the above.
Firstly, when I query for LIVE objects in my view via Article.live.all() if I refresh the page repeatedly I can see (in MYSQL logs) the same database query being made with exactly the same date in the where clause - ie - the datetime.datetime.now() is being evaluated at compile time rather than runtime. I need the date to be evaluated at runtime.
Secondly, when I use the articles_set method on the Category object this appears to work correctly - the datetime used in the query changes each time the query is run - again I can see this in the logs. However, I am not quite sure why this works, since I don't have anything in my code to say that the articles_set query should return LIVE entries only!?
Finally, why is none of this being cached?
Any ideas how to make the correct time be used consistently? Can someone please explain why the latter setup appears to work?
Thanks
Jay
P.S - database queries below, note the date variations.
SELECT LIVE ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE `news_article`.`published` <= '2011-05-17 21:55:41' ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
SELECT LIVE ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE `news_article`.`published` <= '2011-05-17 21:55:41' ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
CATEGORY SELECT ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE (`news_article`.`published` <= '2011-05-18 21:21:33' AND `news_article`.`category_id` = 1 ) ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
CATEGORY SELECT ARTICLES, query #1:
SELECT `news_article`.`id`, `news_article`.`category_id`, `news_article`.`title`, `news_article`.`slug`, `news_article`.`teaser`, `news_article`.`summary`, `news_article`.`story`, `news_article`.`author`, `news_article`.`published`, `news_article`.`created`, `news_article`.`updated` FROM `news_article` WHERE (`news_article`.`published` <= '2011-05-18 21:26:06' AND `news_article`.`category_id` = 1 ) ORDER BY `news_article`.`published` DESC, `news_article`.`slug` ASC;
You should check out conditional view processing.
def latest_entry(request, article_id):
return Article.objects.latest("updated").updated
#conditional(last_modified_func=latest_entry)
def view_article(request, article_id)
your view code here
This should cache the page rather than reloading a new version every time.
I suspect that if you want the now() to be processed at runtime, you should do use raw sql. I think this will solve the compile/runtime issue.
class PublishedArticlesManager(models.Manager):
def get_query_set(self):
return super(PublishedArticlesManager, self).get_query_set() \
.raw("SELECT * FROM news_article WHERE published <= CURRENT_TIMESTAMP")
Note that this returns a RawQuerySet which may differ a bit from a normal QuerySet
I have now fixed this issue. It appears the problem was that the queryset returned by Article.live.all() was being cached in my urls.py! I was using function-based generic-views:
url(r'^all/$', object_list, {
'queryset' : Article.live.all(),
}, 'news_all'),
I have now changed this to use the class-based approach, as advised in the latest Django documentation:
url(r'^all/$', ListView.as_view(
model=Article,
), name="news_all"),
This now works as expected - by specifying the model attribute rather than the queryset attribute the query is QuerySet is created at compile-time instead of runtime.