Caching of querysets and re-evaluation - django

I'm going to post some incomplete code to make the example simple. I'm running a recursive function to compute some metrics on a hierarchical structure.
class Category(models.Model):
parent = models.ForeignKey('self', null=True, blank=True, related_name='children', default=1)
def compute_metrics(self, shop_object, metric_queryset=None, rating_queryset=None)
if(metric_queryset == None):
metric_queryset = Metric.objects.all()
if(rating_queryset == None):
rating_queryset = Rating.objects.filter(shop_object=shop_object)
for child in self.children.all():
do stuff
child_score = child.compute_metrics(shop_object, metric_queryset, rating_queryset)
metrics_in_cat = metric_queryset.filter(category=self)
for metric in metrics_in_cat
do stuff
I hope that's enough code to see what's going on. What I'm after here is a recursive function that is only going to run those queries once each, then pass the results down. That doesn't seem to be happening right now and it's killing performance. Were this PHP/MySQL (as much as I dislike them after working with Django!) I could just run the queries once and pass them down.
From what I understand of Django's querysets, they aren't going to be evaluated in my if queryset == None then queryset=stuff part. How can I force this? Will it be re-evaluated when I do things like metric_queryset.filter(category=self)?
I don't care about data freshness. I just want to read from the DB once for each of metrics and rating, then filter on them later without hitting the DB again. It's a frustrating problem that feels like it should have a very simple answer. Pickling looks like it could work but it's not very well explained in the Django documentation.

I think the problem here is you are not evaluating the queryset until after your recursive call. If you use list() to force the evaluation of the queryset then it should only hit the database once. Note you will have to change the metrics_in_cat line to a python level filter rather than using queryset filters.
parent = models.ForeignKey('self', null=True, blank=True, related_name='children', default=1)
def compute_metrics(self, shop_object, metric_queryset=None, rating_queryset=None)
if(metric_queryset is None):
metric_queryset = list([Metric.objects.all())
if(rating_queryset is None):
rating_queryset = list(Rating.objects.filter(shop_object=shop_object))
for child in self.children.all():
# do stuff
child_score = child.compute_metrics(shop_object, metric_queryset, rating_queryset)
# metrics_in_cat = metric_queryset.filter(category=self)
metrics_in_cat = [m for m in metric_queryset if m.category==self]
for metric in metrics_in_cat
# do stuff

Related

Django Models Unit Tests: Help for a newbie

Right, this is kind of the last place i would like to have asked due to the question being very vague, but I'm at a loss.
Basically, I'm trying to learn how to code and I'm currently working with Django to help me get to grips with the back end of stuff. I keep being reminded of the importance of unit testing so I want to include them in my dummy project that I'm working on so I can begin to understand them early on in my programming journey... only... they seem easy enough on the surface, but are clearly more complicated to a beginner than i gave them credit for.
I'm not entirely sure where / if I'm going wrong here, so feel free to poke me and make fun but try point me in the right direction, or to some beginner friendly resources (scoured the internet and there isn't a massive amount of info).
My project is a dummy GUI related to policing (hoping to use it in my portfolio in future)
Here's the model class I want help with testing. There are others but I want to do those by myself once I know more:
class Warrant(models.Model):
"""
"""
name = models.CharField(max_length=50)
WARRANT_TYPE_CHOICES = (
('ARREST', 'Arrest'),
('SEARCH', 'Search'),
)
warrant_type = models.CharField(max_length=10, choices=WARRANT_TYPE_CHOICES, default='ARREST')
start_date = models.DateTimeField(auto_now=True, null=True)
expiry_date = models.DateTimeField(auto_now_add=True, null=True)
renewal_date = models.DateTimeField(auto_now_add=True, null=True)
As you can see, fairly simple model.
Here are the tests that I'm currently aimlessly fiddling around with:
def test_model_creation(self):
object = Warrant.objects.create()
self.assertIsNotNone(object)
def test_warrant_str(self):
warrant = Warrant.objects.create(
name="Warrant Name",
)
self.assertEqual(str(warrant), "Warrant Name")
def test_datetime_model(self):
start_date = Warrant.objects.create(start_date=datetime.now())
expiry_date = Warrant.objects.create(expiry_date=datetime.now())
renewal_date = Warrant.objects.create(renewal_date=datetime.now())
To me, this reads fine, and all tests return OK. However, I'm not sure if this is best practise or actually does what I think it should be doing.
In addition, I'm not sure how to test the warrant_type / WARRANT_TYPE_CHOICES field either.
I know this isn't typically the type of question that should be asked here, but I'm essentially just typing stuff at random that I've picked up from tutorials and have no idea if it is even correct.
Thanks
It's good that you think tests first! It's usually not models that you test, but behavior, or in other words, methods. If you add a method to your model with some actual logic in it, then that's a good time to start a test for it. For example:
class Warrant(models.Model):
name = models.CharField(max_length=50)
WARRANT_TYPE_CHOICES = (
('ARREST', 'Arrest'),
('SEARCH', 'Search'),
)
warrant_type = models.CharField(max_length=10, choices=WARRANT_TYPE_CHOICES, default='ARREST')
start_date = models.DateTimeField(auto_now=True, null=True)
expiry_date = models.DateTimeField(null=True)
renewal_date = models.DateTimeField(null=True)
#property
def is_expired(self):
return self.expiry_date < timezone.now()
With as test cases:
import datetime
from .models import Warrant
from django.test import TestCase
from django.utils import timezone
class WarrantTestCase(TestCase):
def test_is_expired_returns_true_if_expiry_date_before_now(self):
now = timezone.now()
hour_ago = now - datetime.timedelta(hours=1)
warrant = Warrant.objects.create(expiry_date=hour_ago)
self.assertTrue(warrant.is_expired)
def test_is_expired_returns_false_if_expiry_date_after_now(self):
now = timezone.now()
in_an_hour = now + datetime.timedelta(hours=1)
warrant = Warrant.objects.create(expiry_date=in_an_hour)
self.assertFalse(warrant.is_expired)
Note that I did remove the properties auto_now_add=True from the fields expiry_date and renewal_date. That is probably behavior that you don't want. I figured this out by writing this test case, because one of the test cases failed when I didn't remove it. Long live tests!
I don't think your tests are actually doing anything useful, but there is a way to do what I believe you are looking for. If I wanted to write a test to make sure that a model was creating objects correctly, I would do something like this:
from django.test import TestCase
class WarrantTestCase(TestCase):
def setUp(self):
self.warrant = Warrant.objects.create(name="Warrant Name", warrant_type="SEARCH")
def test_warrant_create(self):
warrant = Warrant.objects.get(name="Warrant Name")
self.assertEqual(warrant.type, "SEARCH")
The setup first creates a sample object, and then my test grabs that object and checks to see that the warrant type is equal to what I expect it to be.
More info on unit testing in Django can be found in the docs: https://docs.djangoproject.com/en/3.1/topics/testing/overview/

Slow iteration over django queryset

I am iterating over a django queryset that contains anywhere from 500-1000 objects. The corresponding model/table has 7 fields in it as well. The problem is that it takes about 3 seconds to iterate over which seems way too long when considering all the other data processing that needs to be done in my application.
EDIT:
Here is my model:
class Node(models.Model):
node_id = models.CharField(null=True, blank=True, max_length=30)
jobs = models.TextField(null=True, blank=True)
available_mem = models.CharField(null=True, blank=True, max_length=30)
assigned_mem = models.CharField(null=True, blank=True ,max_length=30)
available_ncpus = models.PositiveIntegerField(null=True, blank=True)
assigned_ncpus = models.PositiveIntegerField(null=True, blank=True)
cluster = models.CharField(null=True, blank=True, max_length=30)
datetime = models.DateTimeField(auto_now_add=False)
This is my initial query, which is very fast:
timestamp = models.Node.objects.order_by('-pk').filter(cluster=cluster)[0]
self.nodes = models.Node.objects.filter(datetime=timestamp.datetime)
But then, I go to iterate and it takes 3 seconds, I've tried two ways as seen below:
def jobs_by_node(self):
"""returns a dictionary containing keys that
are strings of node ids and values that
are lists of the jobs running on that node."""
jobs_by_node = {}
#iterate over nodes and populate jobs_by_node dictionary
tstart = time.time()
for node in self.nodes:
pass #I have omitted the code because the slowdown is simply iteration
tend = time.time()
tfinal = tend-tstart
return jobs_by_node
Other method:
all_nodes = self.nodes.values('node_id')
tstart = time.time()
for node in all_nodes:
pass
tend = time.time()
tfinal = tend-tstart
I tried the second method by referring to this post, but it still has not sped up my iteration one bit. I've scoured the web to no avail. Any help optimizing this process will be greatly appreciated. Thank you.
Note: I'm using Django version 1.5 and Python 2.7.3
Check the issued SQL query. You can use print statement:
print self.nodes.query # in general: print queryset.query
That should give you something like:
SELECT id, jobs, ... FROM app_node
Then run EXPLAIN SELECT id, jobs, ... FROM app_node and you'll know what exactly is wrong.
Assuming that you know what the problem is after running EXPLAIN, and that simple solutions like adding indexes aren't enough, you can think about e.g. fetching the relevant rows to a separate table every X minutes (in a cron job or Celery task) and using that separate table in you application.
If you are using PostgreSQL you can also use materialized views and "wrap" them in an unmanaged Django model.

Modeling hierarchical data in django with polymorphic node types

I have two basic models, a Command and a Flow. Flows can contain a series of commands or other nested Flows. So any given Flow can have a list of children that are either Step or Flow types. (This is similar to Files and Directories you might model in a file system.)
I've tried to model this using ContentTypes, Generic relations, and mptt (which doesn't allow for generic content types AFAIK) but have had no success. Here are my basic models:
class Step(models.Model):
parent = models.ForeignKey('Step', null=True)
name = models.CharField( max_length=100 )
start_time = models.DateTimeField(null=True)
end_time = models.DateTimeField(null=True)
state = models.CharField( max_length=1, default='u' )
class Flow(Step):
type = models.CharField( max_length=1 )
def getChildren(self):
# todo: the steps returned here need to be sorted by their order in the flow
return Step.objects.filter(parent_id=self.parent_id)
def run(self):
for child in self.getChildren():
print("DEBUG: run method processing a {0}".format(child.__class__.__name__) )
# if this is a flow, run it
# else if it's a command, execute it
class Command(Step):
exec_string = models.TextField()
I want to be able to create Flows in my app, query the children, then process each child differently depending on its type (commands get executed, flows get recursively processed.)
I would appreciate any correction of my code above which would make this possible or even comments that I'm approaching this problem the complete wrong way for Django.
Edit: I should add that I'm using Python 3.3 and Django dev (to be named 1.6)
I finally found an answer via some great help on IRC here and wanted to share it in case anyone else had the same problem.
The only thing I had to ultimately change was Flow.getChildren().
def getChildren(self):
# Get a list of all the attrs relating to Child models.
child_attrs = dict(
(rel.var_name, rel.get_cache_name())
for rel in Step._meta.get_all_related_objects()
if issubclass(rel.field.model, Step) and isinstance(rel.field, models.OneToOneField)
)
objs = []
for obj in self.children.all().select_related(*child_attrs.keys()):
# Try to find any children...
for child in child_attrs.values():
sobj = obj.__dict__.get(child)
if sobj is not None:
break
objs.append(sobj)
return objs
If anyone has a cleaner solution I'd love to see it, especially since this seems like a lot of work for something it seems like the framework should handle more directly.
The thing that jumps out at me is "return Step.objects.filter(parent_id=self.parent_id)". I believe it should be "return Step.objects.filter(parent__pk=self.parent.pk)"

Django - Lazy results with a context processor

I am working on a django project that requires much of the common page data be dynamic. Things that appear on every page, such as the telephone number, address, primary contact email etc all need to be editable via the admin panel, so storing them in the settings.py file isn't going to work.
To get around this, I created a custom context processor which acts as an abstract reference to my "Live Settings" model. The model looks like this:
class LiveSetting(models.Model):
id = models.AutoField(primary_key=True)
title = models.CharField(max_length=255, blank=False, null=False)
description = models.TextField(blank=True, null=True)
key = models.CharField(max_length=100, blank=False, null=False)
value = models.CharField(max_length=255, blank=True)
And the context processor like so:
from livesettings.models import LiveSetting
class LiveSettingsProcessor(object):
def __getattr__(self, request):
val = LiveSetting.objects.get(request)
setattr(self, val.key, val.value)
return val.value
l = LiveSettingsProcessor()
def livesetting_processors(request):
return {'settings':l}
It works really nicely, and in my template I can use {{ settings.primary_email }} and get the corresponding value from the database.
The problem with the above code is it handles each live setting request individually and will hit the database each time my {{ settings.*}} tag is used in a template.
Does anyone have any idea if I could make this process lazy, so rather than retrieve the value and return it, it instead updates a QuerySet then returns the results in one hit just before the page is rendered?
You are trying to invent something complex and these is no reason for that. Something as simple as this will work fork you good enough:
def livesetting_processors(request):
settings = LiveSetting.objects.get(request)
return {'settings':settings}
EDIT:
This is how you will solve your problem in current implementation:
class LiveSettingsProcessor(object):
def __getattr__(self, request):
val = getattr(self, '_settings', LiveSetting.objects.get(request))
setattr(self, val.key, val.value)
return val.value
#Hanpan, I've updated my answer to show how you can to solve your problem, but what I want to say is that things you are trying to achieve does not give any practical win, however it increase complexity ant it takes your time. It might also be harder to setup cache on all of this later. And with caching enabled this will not give any improvements in performance at all.
I don't know if you heard this before: premature optimization is the root of evil. I think this thread on SO is useful to read: https://stackoverflow.com/questions/211414/is-premature-optimization-really-the-root-of-all-evil
Maybe you could try Django's caching?
In particular, you may want to check out the low-level caching feature. It seems like it would be a lot less work than what you plan on.

Is it "better" to have an update field or COUNT query?

In a Django App I'm working on I've got this going on:
class Parent(models.Model):
name = models.CharField(...)
def num_children(self):
return Children.objects.filter(parent=self).count()
def avg_child_rating(self):
return Child.objects.filter(parent=self).aggregate(Avg('rating'))
class Child(models.Model):
name = models.CharField(...)
parent = models.ForeignKey(Parent)
rating = models.IntegerField(default=0)
I plan on accessing avg_child_rating often. Would it be optimizing if I did the following:
class Parent(models.Model):
...
num_children = models.IntegerField(default=0)
avg_child_rating = models.FloatField(default=0.0)
def update_parent_child_stats(sender, instance, **kwargs):
num_children = Child.objects.filter(parent=instance.parent)
if instance.parent.num_children != num_children:
instance.parent.num_children = num_children
instance.parent.avg_child_rating = Child.objects.filter(instance.parent=self).aggregate(Avg('rating'))
post_save.connect(update_parent_child_stats, sender=Child)
post_delete.connect(update_parent_child_stats, sender=Child)
The difference now is that every time a child is created/rated/deleted, the Parent object is updated. I know that the created/rating will be done often.
What's more expensive?
Depends on the scale of the problem.
If you anticipate a lot of write traffic, this might be an issue. It's much harder to scale writes than reads (replicate, caching etc.) That said, you can probably going a long way without this extra query causing you any problems.
Depending on how up-to-date your stats must be you could have some other process (non-web session) come through and update these stats nightly.