Slow iteration over django queryset - django

I am iterating over a django queryset that contains anywhere from 500-1000 objects. The corresponding model/table has 7 fields in it as well. The problem is that it takes about 3 seconds to iterate over which seems way too long when considering all the other data processing that needs to be done in my application.
EDIT:
Here is my model:
class Node(models.Model):
node_id = models.CharField(null=True, blank=True, max_length=30)
jobs = models.TextField(null=True, blank=True)
available_mem = models.CharField(null=True, blank=True, max_length=30)
assigned_mem = models.CharField(null=True, blank=True ,max_length=30)
available_ncpus = models.PositiveIntegerField(null=True, blank=True)
assigned_ncpus = models.PositiveIntegerField(null=True, blank=True)
cluster = models.CharField(null=True, blank=True, max_length=30)
datetime = models.DateTimeField(auto_now_add=False)
This is my initial query, which is very fast:
timestamp = models.Node.objects.order_by('-pk').filter(cluster=cluster)[0]
self.nodes = models.Node.objects.filter(datetime=timestamp.datetime)
But then, I go to iterate and it takes 3 seconds, I've tried two ways as seen below:
def jobs_by_node(self):
"""returns a dictionary containing keys that
are strings of node ids and values that
are lists of the jobs running on that node."""
jobs_by_node = {}
#iterate over nodes and populate jobs_by_node dictionary
tstart = time.time()
for node in self.nodes:
pass #I have omitted the code because the slowdown is simply iteration
tend = time.time()
tfinal = tend-tstart
return jobs_by_node
Other method:
all_nodes = self.nodes.values('node_id')
tstart = time.time()
for node in all_nodes:
pass
tend = time.time()
tfinal = tend-tstart
I tried the second method by referring to this post, but it still has not sped up my iteration one bit. I've scoured the web to no avail. Any help optimizing this process will be greatly appreciated. Thank you.
Note: I'm using Django version 1.5 and Python 2.7.3

Check the issued SQL query. You can use print statement:
print self.nodes.query # in general: print queryset.query
That should give you something like:
SELECT id, jobs, ... FROM app_node
Then run EXPLAIN SELECT id, jobs, ... FROM app_node and you'll know what exactly is wrong.
Assuming that you know what the problem is after running EXPLAIN, and that simple solutions like adding indexes aren't enough, you can think about e.g. fetching the relevant rows to a separate table every X minutes (in a cron job or Celery task) and using that separate table in you application.
If you are using PostgreSQL you can also use materialized views and "wrap" them in an unmanaged Django model.

Related

Determine count of object retrieval per day in django

In a model like the one below
class Watched(Stamping):
user = models.ForeignKey("User", null=True, blank=True, on_delete=models.CASCADE)
count = models.PositiveIntegerField(default=0)
created_at = models.DateTimeField(auto_now_add=True)
updated_at = models.DateTimeField(auto_now=True)
Anytime an object is retrieved, I increment the count attribute.
Now my problem is how to get the number of times an object was retrieved for each day of the week
For example, WatchedObject1 will have {'Sun': 10, 'Tue': 70, 'Wed': 35}
This seems like a use case for auditing and there are plugins for Django that can help you with that. If you don't want to add this dependency you would have to create another model that you store your intended data.
class RetrievalOfData(models.Model):
date_of_retrieval = models.datetimefield(auto_now_add=True)
object_retrieved = models.ForeignKey("Watched")
You could probably also override the manager to create these objects everytime the model is queried: https://docs.djangoproject.com/en/3.2/topics/db/managers/
You might find it better to have a separate WatchedModelStats table, and perhaps link it you your model with Django signals. Whenever a countable event takes place, execute something like
try:
counter = WatchedModelStats.objects.get( name=model_name, date=today)
counter.count += 1
except WatchedModelStats.DoesNotExist:
counter = WatchedModelStats( name=model_name, date=today, count=1 )
counter.save()
One advantage is extensibility. You could easily implement multiple counts for differerent event types, if the need later becomes apparent.

Django annotation on compoundish primary key with filter ignoring primary key resutling in too many annotated items

Please see EDIT1 below, as well.
Using Django 3.0.6 and python3.8, given following models
class Plants(models.Model):
plantid = models.TextField(primary_key=True, unique=True)
class Pollutions(models.Model):
pollutionsid = models.IntegerField(unique=True, primary_key=True)
year = models.IntegerField()
plantid = models.ForeignKey(Plants, models.DO_NOTHING, db_column='plantid')
pollutant = models.TextField()
releasesto = models.TextField(blank=True, null=True)
amount = models.FloatField(db_column="amount", blank=True, null=True)
class Meta:
managed = False
db_table = 'pollutions'
unique_together = (('plantid', 'releasesto', 'pollutant', 'year'))
class Monthp(models.Model):
monthpid = models.IntegerField(unique=True, primary_key=True)
year = models.IntegerField()
month = models.IntegerField()
plantid = models.ForeignKey(Plants, models.DO_NOTHING, db_column='plantid')
power = models.IntegerField(null=False)
class Meta:
managed = False
db_table = 'monthp'
unique_together = ('plantid', 'year', 'month')
I'd like to annotate - based on a foreign key relationship and a fiter a value, particulary - to each plant the amount of co2 and the Sum of its power for a given year. For sake of debugging having replaced Sum by Count using the following query:
annotated = tmp.all().annotate(
energy=Count('monthp__power', filter=Q(monthp__year=YEAR)),
co2=Count('pollutions__amount', filter=Q(pollutions__year=YEAR, pollutions__pollutant="CO2", pollutions__releasesto="Air")))
However this returns too many items (a wrong number using Sum, respectively)
annotated.first().co2 # 60, but it should be 1
annotated.first().energy # 252, but it should be 1
although my database guarantees - as denoted, that (plantid, year, month) and (plantid, releasesto, pollutant, year) are unique together, which can easily be demonstrated:
pl = annotated.first().plantid
testplant = Plants.objects.get(pk=pl) # plant object
pco2 = Pollutions.objects.filter(plantid=testplant, year=YEAR, pollutant="CO2", releasesto="Air")
len(pco2) # 1, as expected
Why does django return to many results and how can I tell django to limit the elements to annotate to the 'current primary key' in other words to only annotate the elements where the foreign key matches the primary key?
I can achieve what I intend to do by using distinct and Max:
energy=Sum('yearly__power', distinct=True, filter=Q(yearly__year=YEAR)),
co2=Max('pollutions__amount', ...
However the performance is inacceptable.
I have tested to use model_to_dict and appending the wanted values "by hand" to the dict, which works for the values itself, but not for sorting the resulted dict (e.g. by energy) and it is acutally faster than the workaround directly above.
It conceptually strikes to me that the manual approach is faster than letting the database do, what it is intended to do.
Is this a feature limitation of django's orm or am I missing something?
EDIT1:
The behaviour is known as bug since 11 years.
Even others "spent a whole day on this".
I am now trying it with subqueries. However the forein key I am using is not a primary key of its table. So the kind of "usual" approach to use "pk=''" does not work. More clearly, trying:
tmp = Plants.objects.filter(somefilter)
subq1 = Subquery(Yearly.objects.filter(pk=OuterRef('plantid'), year=YEAR)) tmp1 = tmp.all().annotate(
energy=Count(Subquery(subq1))
)
returns
OperationalError at /xyz
no such column: U0.yid
Which definitely makes sense because Plants has no clue what a yid is, it only knows plantids. How do I adjust the subquery to that?

Speed up Django query

I am working with Django to create a dashboard which present many kind of data. My problem is that the page loading slowly despite I hit the database (PostgreSql) always once. These tables are loading with data in every 10th minute, so currently consist of millions of record. My problem is that when I make a query with Django ORM, I get the data slowly (according to the Django toolbar it is 1,4 second). I know that this not too much b is the half of the total loading time (3,1), so If I could decrease the time of the query the page loading could decrease to there for the user experience could be better. When the query run I fetch ~ 2800 rows. Is there any way to speed up this query? I do not know that I do something wrong or this time is normal with this amount of data. I attach my query and model. Thank you in advance for your help.
My query (Here I fetch 6 hours time intervall.):
my_query=MyTable.filter(time_stamp__range=(before_now, now)).values('time_stamp', 'value1', 'value2')
Here I tried to use .iterator() but the query wasn't faster.
My model:
class MyTable(models.Model):
time_stamp = models.DateTimeField()
value1 = models.FloatField(blank=True, null=True)
values2 = models.FloatField(blank=True, null=True)
Add an index:
class MyTable(models.Model):
time_stamp = models.DateTimeField()
value1 = models.FloatField(blank=True, null=True)
values2 = models.FloatField(blank=True, null=True)
class Meta:
indexes = [
models.Index(fields=['time_stamp']),
]
Don't forget to run manage.py makemigrations and manage.py migrate after this.

Django - Displaying result information while optimizing database queries with models that multiple foreign key relationships

So I'm trying to put together a webpage and I am currently have trouble putting together a results page for each user in the web application I am putting together.
Here are what my models look like:
class Fault(models.Model):
name = models.CharField(max_length=255)
severity = models.PositiveSmallIntegerField(default=0)
description = models.CharField(max_length=1024, null=False, blank=False)
recommendation = models.CharField(max_length=1024, null=False, blank=False)
date_added = models.DateTimeField(_('date added'), default=timezone.now)
...
class FaultInstance(models.Model):
auto = models.ForeignKey(Auto)
fault = models.ForeignKey(Fault)
date_added = models.DateTimeField(_('date added'), default=timezone.now)
objects = FaultInstanceManager()
...
class Auto(models.Model):
label = models.CharField(max_length=255)
model = models.CharField(max_length=255)
make = models.CharField(max_length=255)
year = models.IntegerField(max_length=4)
user = models.ForeignKey(AUTH_USER_MODEL)
...
I don't know if my model relationships are ideal, however it made sense it my head. So each user can have multiple Auto objects associated to them. And each Auto can have multiple FaultInstance objects associated to it.
In the results page, I want to list out the all the FaultInstances that a user has across their Autos. And under each listed FaultInstance I will have a list of all the autos that the user owns that has the fault, with its information (here is kind of what I had in mind).
All FaultInstance Listing Ordered by Severity (large number to low number)
FaultInstance:
FaultDescription:
FaultRecommendation:
ListofAutosWithFault:
AutoLabel AutoModel AutoYear ...
AutoLabel AutoModel AutoYear ...
Obviously, do things the correct way would mean that I want to do as much of the list creation in the Python/Django side of things and avoid doing any logic or processing in the template. I am able to create a list per severity with the a model manager as seen here:
class FaultInstanceManager(models.Manager):
def get_faults_by_user_severity(self, user, severity):
faults = defaultdict(list)
qs_faultinst = self.model.objects.select_related().filter(
auto__user=user, fault__severity=severity
).order_by('auto__make')
for result in qs_faultinst:
faults[result.fault].append(result)
faults.default_factory = None
return faults
I still need to specify each severity but I guess if I only have 5 severity levels, I can create a list for each severity level and pass each individual one to template. Any suggestions for this is appreciated. However, thats not my problem. My stopping point right now is that I want to create a summary table at the top of their report which can give the user breakdown of fault instances per make|model|year. I can't think of the proper query or data structure to pass on to the template.
Summary (table of all the FaultInstances with the following column headers):
FaultInstance Make|Model|Year NumberOfAutosAffected
This will let me know metrics for a make or a model or a year (in the example below, its separating faults based on model). I'm listing FaultInstances because I'm only listed Faults that a connected to a user.
For Example
Bad Starter Nissan 1
Bad Tailight Honda 2
Bad Tailight Nissan 1
And I am such a perfectionist that I want to do this while optimizing database queries. If I can create a data structure in my original query that will be easily parsed in template and still get both these sections in my report (maybe a defaultdict of a defaultdict(list)), thats what I want to do. Thanks for the help and hopefully my question is thorough and makes sense.
It makes sense to use related names because it simplifies your query. Like this:
class FaultInstance(models.Model):
auto = models.ForeignKey(Auto, related_name='fault_instances')
fault = models.ForeignKey(Fault, related_name='fault_instances')
...
class Auto(models.Model):
user = models.ForeignKey(AUTH_USER_MODEL, related_name='autos')
In this case you can use:
qs_faultinst = user.fault_instances.filter(fault__severity=severity).order_by('auto__make')
instead of:
qs_faultinst = self.model.objects.select_related().filter(
auto__user=user, fault__severity=severity
).order_by('auto__make')
I can't figure out your summary table, may be you meant:
Fault Make|Model|Year NumberOfAutosAffected
In this case you can use aggregation. But It (grouping) would still be slow if you have enough data. The one easy solution is just to denormalize data by creating extra model and create few signals to update it or you can use cache.
If you have a predefined set of severities then think about this:
class Fault(models.Model):
SEVERITY_LOW = 0
SEVERITY_MIDDLE = 1
SEVERITY_HIGH = 2
...
SEVERITY_CHOICES = (
(SEVERITY_LOW, 'Low'),
(SEVERITY_MIDDLE, 'Middle'),
(SEVERITY_HIGH, 'High'),
...
)
...
severity = models.PositiveSmallIntegerField(default=SEVERITY_LOW,
choices=SEVERITY_CHOICES)
...
In your templates you can just iterate through Fault.SEVERITY_CHOICES.
Update:
Change your models:
Аllocate model into a separate model:
class AutoModel(models.Model):
name = models.CharField(max_length=255)
Change the field model of model Auto :
class Auto(models.Model):
...
auto_model = models.ForeignKey(AutoModel, related_name='cars')
...
Add a model:
class MyDenormalizedModelForReport(models.Model):
fault = models.ForeignKey(Fault, related_name='reports')
auto_model = models.ForeignKey(AutoModel, related_name='reports')
year = models.IntegerField(max_length=4)
number_of_auto_affected = models.IntegerField(default=0)
Add a signal:
def update_denormalized_model(sender, instance, created, **kwargs):
if created:
rep, dummy_created = MyDenormalizedModelForReport.objects.get_or_create(fault=instance.fault, auto_model=instance.auto.auto_model, year=instance.auto.year)
rep.number_of_auto_affected += 1
rep.save()
post_save.connect(update_denormalized_model, sender=FaultInstance)

Caching of querysets and re-evaluation

I'm going to post some incomplete code to make the example simple. I'm running a recursive function to compute some metrics on a hierarchical structure.
class Category(models.Model):
parent = models.ForeignKey('self', null=True, blank=True, related_name='children', default=1)
def compute_metrics(self, shop_object, metric_queryset=None, rating_queryset=None)
if(metric_queryset == None):
metric_queryset = Metric.objects.all()
if(rating_queryset == None):
rating_queryset = Rating.objects.filter(shop_object=shop_object)
for child in self.children.all():
do stuff
child_score = child.compute_metrics(shop_object, metric_queryset, rating_queryset)
metrics_in_cat = metric_queryset.filter(category=self)
for metric in metrics_in_cat
do stuff
I hope that's enough code to see what's going on. What I'm after here is a recursive function that is only going to run those queries once each, then pass the results down. That doesn't seem to be happening right now and it's killing performance. Were this PHP/MySQL (as much as I dislike them after working with Django!) I could just run the queries once and pass them down.
From what I understand of Django's querysets, they aren't going to be evaluated in my if queryset == None then queryset=stuff part. How can I force this? Will it be re-evaluated when I do things like metric_queryset.filter(category=self)?
I don't care about data freshness. I just want to read from the DB once for each of metrics and rating, then filter on them later without hitting the DB again. It's a frustrating problem that feels like it should have a very simple answer. Pickling looks like it could work but it's not very well explained in the Django documentation.
I think the problem here is you are not evaluating the queryset until after your recursive call. If you use list() to force the evaluation of the queryset then it should only hit the database once. Note you will have to change the metrics_in_cat line to a python level filter rather than using queryset filters.
parent = models.ForeignKey('self', null=True, blank=True, related_name='children', default=1)
def compute_metrics(self, shop_object, metric_queryset=None, rating_queryset=None)
if(metric_queryset is None):
metric_queryset = list([Metric.objects.all())
if(rating_queryset is None):
rating_queryset = list(Rating.objects.filter(shop_object=shop_object))
for child in self.children.all():
# do stuff
child_score = child.compute_metrics(shop_object, metric_queryset, rating_queryset)
# metrics_in_cat = metric_queryset.filter(category=self)
metrics_in_cat = [m for m in metric_queryset if m.category==self]
for metric in metrics_in_cat
# do stuff