Optimise query in a model method - django

I have a fairly simple model that's part of a double entry book keeping system. Double entry just means that each transaction (Journal Entry) is made up of multiple LineItems. The Lineitems should add up to zero to reflect the fact that money always comes out of one category (Ledger) and into another. The CR column is for money out, DR is money in (I think the CR and DR abreviations come from some Latin words and is standard naming convention in accounting systems).
My JournalEntry model has a method called is_valid() which checks that the line items balance and a few other checks. However the method is very database expensive, and when I use it to check many entries at once the database can't cope.
Any suggestions on how I can optimise the queries within this method to reduce database load?
class JournalEntry(models.Model):
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.PROTECT, null=True, blank=True)
date = models.DateField(null=False, blank=False)
# Make choiceset global so that it can be accessed in filters.py
global JOURNALENRTY_TYPE_CHOICES
JOURNALENRTY_TYPE_CHOICES = (
('BP', 'Bank Payment'),
('BR', 'Bank Receipt'),
('TR', 'Transfer'),
('JE', 'Journal Entry'),
('YE', 'Year End'),
)
type = models.CharField(
max_length=2,
choices=JOURNALENRTY_TYPE_CHOICES,
blank=False,
null=False,
default='0'
)
description = models.CharField(max_length=255, null=True, blank=True)
def __str__(self):
if self.description:
return self.description
else:
return 'Journal Entry '+str(self.id)
#property
def is_valid(self):
"""Checks if Journal Entry has valid data integrity"""
# NEEDS TO BE OPTIMISED AS PERFORMANCE IS BAD
cr = LineItem.objects.filter(journal_entry=self.id).aggregate(Sum('cr'))
dr = LineItem.objects.filter(journal_entry=self.id).aggregate(Sum('dr'))
if dr['dr__sum'] != cr['cr__sum']:
return "Line items do not balance"
if self.lineitem_set.filter(cr__isnull=True,dr__isnull=True).exists():
return "Empty line item(s)"
if self.lineitem_set.filter(cr__isnull=False,dr__isnull=False).exists():
return "CR and DR vales present on same lineitem(s)"
if (self.type=='BR' or self.type=='BP' or self.type=='TR') and len(self.lineitem_set.all()) != 2:
return 'Incorrect number of line items'
if len(self.lineitem_set.all()) == 0:
return 'Has zero line items'
return True
class LineItem(models.Model):
journal_entry = models.ForeignKey(JournalEntry, on_delete=models.CASCADE)
ledger = models.ForeignKey(Ledger, on_delete=models.PROTECT)
description = models.CharField(max_length=255, null=True, blank=True)
project = models.ForeignKey(Project, on_delete=models.SET_NULL, null=True, blank=True)
cr = models.DecimalField(max_digits=8, decimal_places=2, null=True, blank=True)
dr = models.DecimalField(max_digits=8, decimal_places=2, null=True, blank=True)
reconciliation_date = models.DateField(null=True, blank=True)
#def __str__(self):
# return self.description
class Meta(object):
ordering = ['id']

First thing first: if it's an expansive operation, it shouldn't be a property - not that it will change the execution time / db load, but at least it doesn't break the expectation that you're mostly doing a (relatively cheap) attribute access.
wrt/ possible optimisations, part of the cost is in the db roundtrip (including the time spent in the Python code - ORM and db adapter - itself) so a first thing would be to make as few queries as possible :
1/ replacing len(self.lineitem_set.all()) with self.lineitem_set.count() and avoiding calling it twice could save some time already
2/ you could probably regroup the first two queries in a single one (not tested...)
crdr = self.lineitem_set.aggregate(Sum('cr'), Sum('dr'))
if crdr['dr__sum'] != crdr['cr__sum']:
return "Line items do not balance"
and well, that's about all the simple obvious optimisations, and I don't think it will really solve your issue.
Next step would probably be to try a stored procedure that would do all the validation process - one single roundtrip and possibly more room for db-level optimisations (depending on your db vendor).
Then - assuming your db schema, settings, server etc are fully optimized (which is a bit outside this site's on-topic policy) -, the only solution left is denormalization, either at the db-level (safer) or at django level using a local per-instance cache on your model - the issue being to make sure you properly invalidate this cache everytime anything that might affect it changes.
NB actually I'm a bit surprised your db "can't cope" with this, at it doesn't seem _that_ heavy - but it of course depends on how many lineitems per journal you have (on average and worst case) in your production data.
More infos about your choosen rbdms, setup (same server or distinct one, and if yes network connectivity between the servers, available RAM, rdbms settings, etc) could probably help too - even with the most optimized queries at the client level, there are limits to what your rdbms can do... but then this becomes more of sysadmin/dbadmin question
EDIT
Page load time is now long but it does complete. Yes 2000 records to list and execute the method on
You mean you're executing this in a view on a 2000+ records queryset ? Well I can well understand it's a bit heavy - and not only on the database FWIW.
I think you might be able to optimize this quite further for this use case then. First option would be to make use of the queryset's select_related, prefetch_related, annotate and extra features, and if it's not enough to go for raw sql.

Related

Django annotation on compoundish primary key with filter ignoring primary key resutling in too many annotated items

Please see EDIT1 below, as well.
Using Django 3.0.6 and python3.8, given following models
class Plants(models.Model):
plantid = models.TextField(primary_key=True, unique=True)
class Pollutions(models.Model):
pollutionsid = models.IntegerField(unique=True, primary_key=True)
year = models.IntegerField()
plantid = models.ForeignKey(Plants, models.DO_NOTHING, db_column='plantid')
pollutant = models.TextField()
releasesto = models.TextField(blank=True, null=True)
amount = models.FloatField(db_column="amount", blank=True, null=True)
class Meta:
managed = False
db_table = 'pollutions'
unique_together = (('plantid', 'releasesto', 'pollutant', 'year'))
class Monthp(models.Model):
monthpid = models.IntegerField(unique=True, primary_key=True)
year = models.IntegerField()
month = models.IntegerField()
plantid = models.ForeignKey(Plants, models.DO_NOTHING, db_column='plantid')
power = models.IntegerField(null=False)
class Meta:
managed = False
db_table = 'monthp'
unique_together = ('plantid', 'year', 'month')
I'd like to annotate - based on a foreign key relationship and a fiter a value, particulary - to each plant the amount of co2 and the Sum of its power for a given year. For sake of debugging having replaced Sum by Count using the following query:
annotated = tmp.all().annotate(
energy=Count('monthp__power', filter=Q(monthp__year=YEAR)),
co2=Count('pollutions__amount', filter=Q(pollutions__year=YEAR, pollutions__pollutant="CO2", pollutions__releasesto="Air")))
However this returns too many items (a wrong number using Sum, respectively)
annotated.first().co2 # 60, but it should be 1
annotated.first().energy # 252, but it should be 1
although my database guarantees - as denoted, that (plantid, year, month) and (plantid, releasesto, pollutant, year) are unique together, which can easily be demonstrated:
pl = annotated.first().plantid
testplant = Plants.objects.get(pk=pl) # plant object
pco2 = Pollutions.objects.filter(plantid=testplant, year=YEAR, pollutant="CO2", releasesto="Air")
len(pco2) # 1, as expected
Why does django return to many results and how can I tell django to limit the elements to annotate to the 'current primary key' in other words to only annotate the elements where the foreign key matches the primary key?
I can achieve what I intend to do by using distinct and Max:
energy=Sum('yearly__power', distinct=True, filter=Q(yearly__year=YEAR)),
co2=Max('pollutions__amount', ...
However the performance is inacceptable.
I have tested to use model_to_dict and appending the wanted values "by hand" to the dict, which works for the values itself, but not for sorting the resulted dict (e.g. by energy) and it is acutally faster than the workaround directly above.
It conceptually strikes to me that the manual approach is faster than letting the database do, what it is intended to do.
Is this a feature limitation of django's orm or am I missing something?
EDIT1:
The behaviour is known as bug since 11 years.
Even others "spent a whole day on this".
I am now trying it with subqueries. However the forein key I am using is not a primary key of its table. So the kind of "usual" approach to use "pk=''" does not work. More clearly, trying:
tmp = Plants.objects.filter(somefilter)
subq1 = Subquery(Yearly.objects.filter(pk=OuterRef('plantid'), year=YEAR)) tmp1 = tmp.all().annotate(
energy=Count(Subquery(subq1))
)
returns
OperationalError at /xyz
no such column: U0.yid
Which definitely makes sense because Plants has no clue what a yid is, it only knows plantids. How do I adjust the subquery to that?

Django pagination query duplicated, double the time

In my current project I want to do some filtering and ordering on a queryset and show it to the user in a paginated form.
This works fine, however I am not comfortable with the performance.
When I use and order_by statement either explicitly or implicitly with the model Meta ordering, I can see in the Debug toolbar that this query is essentially executed twice.
Once for the paginator count (without the ORDER BY) and once to fetch the objects slice (with ORDER BY).
From my observation this leads to doubling the time it takes.
Is there any way this can be optimized?
Below is a minimal working example, in my actual app I use class based views.
class Medium(models.Model):
title = models.CharField(verbose_name=_('title'),
max_length=256,
null=False, blank=False,
db_index=True,
)
offered_by = models.ForeignKey(Institution,
verbose_name=_('Offered by'),
on_delete=models.CASCADE,
)
quantity = models.IntegerField(verbose_name=_('Quantity'),
validators=[
MinValueValidator(0)
],
null=False, blank=False,
)
deleted = models.BooleanField(verbose_name=_('Deleted'),
default=False,
)
def index3(request):
media = Medium.objects.filter(deleted=False, quantity__gte=0)
media = media.exclude(offered_by_id=request.user.institution_id)
media = media.filter(title__icontains="funktion")
media = media.order_by('title')
paginator = Paginator(media, 25)
media = paginator.page(1)
return render(request, 'media/empty2.html', {'media': media})
Debug toolbar sql timings
The query is not exactly duplicated: One is a COUNT query, the other one fetches the actual objects for the specific page requested. This is unavoidable, since Django's Paginator needs to know the total number of objects. However, if the queryset media isn't too large, you can optimise by forcing the media Queryset to be evaluated (just add a line len(media) before you define the Paginator).
But note that if media is very large, you might not want to force media to be evaluated as you're loading all the objects into memory.

Django ORM and SQL inner joins

I am trying to get all Horse objects which fall within a specific from_date and to_date range on a related listing object. eg.
Horse.objects.filter(listings__to_date__lt=to_date.datetime,
listings__from_date__gt=from_date.datetime)
Now as I understand this database query creates an inner join which then enables me to find all my horse objects based on the related listing dates.
My question is how this exactly works, it probably comes down to a major lack of understanding in how inner joins actually work. Would this query need to first 'check' each and ever horse object first to ascertain whether or not it has a related listing object? I'd imagine this could prove to be quite inefficient because you might have 5million horse objects with no related listing object yet you still would have to check each and every one first?
Alternatively I could start with my Listings and do something like this first:
Listing.objects.filter(to_date__lt=to_date.datetime,
from_date__gt=from_date.datetime)
And then:
for listing in listing_objs:
if listing.horse:
horses.append(horse)
But this seems like a rather odd way of achieving my results too.
If anyone could help me understand how queries work in Django and which is the most efficient way to go about doing such a query it would be a great help!
This is my current model setup:
class Listing(models.Model):
to_date = models.DateTimeField(null=True, blank=True)
from_date = models.DateTimeField(null=True, blank=True)
promoted_to_date = models.DateTimeField(null=True, blank=True)
promoted_from_date = models.DateTimeField(null=True, blank=True)
# Relationships
horse = models.ForeignKey('Horse', related_name='listings', null=True, blank=True)
class Horse(models.Model):
created_date = models.DateTimeField(null=True, blank=True, auto_now=True)
type = models.CharField(max_length=200, null=True, blank=True)
name = models.CharField(max_length=200, null=True, blank=True)
age = models.IntegerField(null=True, blank=True)
colour = models.CharField(max_length=200, null=True, blank=True)
height = models.IntegerField(null=True, blank=True)
The way you write your query really depends on what information you want back most of the time. If you are interested in the horses, then query from Horse. If you're interested in listings then you should query from Listing. That's generally the correct thing to do, especially when you're working with simple foreign keys.
Your first query is probably the better one with regards to Django. I've used slightly simpler models to illustrate the differences. I've created an active field rather than using datetimes.
In [18]: qs = Horse.objects.filter(listings__active=True)
In [19]: print(qs.query)
SELECT
"scratch_horse"."id",
"scratch_horse"."name"
FROM "scratch_horse"
INNER JOIN "scratch_listing"
ON ( "scratch_horse"."id" = "scratch_listing"."horse_id" )
WHERE "scratch_listing"."active" = True
The inner join in the query above will ensure that you only get horses that have a listing. (Most) databases are very good at using joins and indexes to filter out unwanted rows.
If Listing was very small, and Horse was rather large, then I would hope the database would only look at the Listing table, and then use an index to fetch the correct parts of Horse without doing a full table scan (inspecting every horse). You will need to run the query and check what your database is doing though. EXPLAIN (or whatever database you use) is extremely useful. If you're guessing what the database is doing, you're probably wrong.
Note that if you need to access the listings of each horse then you'll be executing another query each time you access horse.listings. prefetch_related can help you if you need to access listings, by executing a single query and storing it in cache.
Now, your second query:
In [20]: qs = Listing.objects.filter(active=True).select_related('horse')
In [21]: print(qs.query)
SELECT
"scratch_listing"."id",
"scratch_listing"."active",
"scratch_listing"."horse_id",
"scratch_horse"."id",
"scratch_horse"."name"
FROM "scratch_listing"
LEFT OUTER JOIN "scratch_horse"
ON ( "scratch_listing"."horse_id" = "scratch_horse"."id" )
WHERE "scratch_listing"."active" = True
This does a LEFT join, which means that the right hand side can contain NULL. The right hand side is Horse in this instance. This would perform very poorly if you had a lot of listings without a Horse, because it would bring back every single active listing, whether or not a horse was associated with it. You could fix that with .filter(active=True, horse__isnull=False) though.
See that I've used select_related, which joins the tables so that you're able to access listing.horse without incurring another query.
Now I should probably ask why all your fields are nullable. That's usually a terrible design choice, especially for ForeignKeys. Will you ever have a listing that's not associated with a horse? If not, get rid of the null. Will you ever have a horse that won't have a name? If not, get rid of the null.
So the answer is, do what seems natural most of the time. If you know a particular table is going to be large, then you must inspect the query planner (EXPLAIN), look into adding/using indexes on filter/join conditions, or querying from the other side of the relation.

Slow iteration over django queryset

I am iterating over a django queryset that contains anywhere from 500-1000 objects. The corresponding model/table has 7 fields in it as well. The problem is that it takes about 3 seconds to iterate over which seems way too long when considering all the other data processing that needs to be done in my application.
EDIT:
Here is my model:
class Node(models.Model):
node_id = models.CharField(null=True, blank=True, max_length=30)
jobs = models.TextField(null=True, blank=True)
available_mem = models.CharField(null=True, blank=True, max_length=30)
assigned_mem = models.CharField(null=True, blank=True ,max_length=30)
available_ncpus = models.PositiveIntegerField(null=True, blank=True)
assigned_ncpus = models.PositiveIntegerField(null=True, blank=True)
cluster = models.CharField(null=True, blank=True, max_length=30)
datetime = models.DateTimeField(auto_now_add=False)
This is my initial query, which is very fast:
timestamp = models.Node.objects.order_by('-pk').filter(cluster=cluster)[0]
self.nodes = models.Node.objects.filter(datetime=timestamp.datetime)
But then, I go to iterate and it takes 3 seconds, I've tried two ways as seen below:
def jobs_by_node(self):
"""returns a dictionary containing keys that
are strings of node ids and values that
are lists of the jobs running on that node."""
jobs_by_node = {}
#iterate over nodes and populate jobs_by_node dictionary
tstart = time.time()
for node in self.nodes:
pass #I have omitted the code because the slowdown is simply iteration
tend = time.time()
tfinal = tend-tstart
return jobs_by_node
Other method:
all_nodes = self.nodes.values('node_id')
tstart = time.time()
for node in all_nodes:
pass
tend = time.time()
tfinal = tend-tstart
I tried the second method by referring to this post, but it still has not sped up my iteration one bit. I've scoured the web to no avail. Any help optimizing this process will be greatly appreciated. Thank you.
Note: I'm using Django version 1.5 and Python 2.7.3
Check the issued SQL query. You can use print statement:
print self.nodes.query # in general: print queryset.query
That should give you something like:
SELECT id, jobs, ... FROM app_node
Then run EXPLAIN SELECT id, jobs, ... FROM app_node and you'll know what exactly is wrong.
Assuming that you know what the problem is after running EXPLAIN, and that simple solutions like adding indexes aren't enough, you can think about e.g. fetching the relevant rows to a separate table every X minutes (in a cron job or Celery task) and using that separate table in you application.
If you are using PostgreSQL you can also use materialized views and "wrap" them in an unmanaged Django model.

What's the best way to ensure balanced transactions in a double-entry accounting app?

What's the best way to ensure that transactions are always balanced in double-entry accounting?
I'm creating a double-entry accounting app in Django. I have these models:
class Account(models.Model):
TYPE_CHOICES = (
('asset', 'Asset'),
('liability', 'Liability'),
('equity', 'Equity'),
('revenue', 'Revenue'),
('expense', 'Expense'),
)
num = models.IntegerField()
type = models.CharField(max_length=20, choices=TYPE_CHOICES, blank=False)
description = models.CharField(max_length=1000)
class Transaction(models.Model):
date = models.DateField()
description = models.CharField(max_length=1000)
notes = models.CharField(max_length=1000, blank=True)
class Entry(models.Model):
TYPE_CHOICES = (
('debit', 'Debit'),
('credit', 'Credit'),
)
transaction = models.ForeignKey(Transaction, related_name='entries')
type = models.CharField(max_length=10, choices=TYPE_CHOICES, blank=False)
account = models.ForeignKey(Account, related_name='entries')
amount = models.DecimalField(max_digits=11, decimal_places=2)
I'd like to enforce balanced transactions at the model level but there doesn't seem to be hooks in the right place. For example, Transaction.clean won't work because transactions get saved first, then entries are added due to the Entry.transaction ForeignKey.
I'd like balance checking to work within admin also. Currently, I use an EntryInlineFormSet with a clean method that checks balance in admin but this doesn't help when adding transactions from a script. I'm open to changing my models to make this easier.
(Hi Ryan! -- Steve Traugott)
It's been a while since you posted this, so I'm sure you're way past this puzzle. For others and posterity, I have to say yes, you need to be able to split transactions, and no, you don't want to take the naive approach and assume that transaction legs will always be in pairs, because they won't. You need to be able to do N-way splits, where N is any positive integer greater than 1. Ryan has the right structure here.
What Ryan calls Entry I usually call Leg, as in transaction leg, and I'm usually working with bare Python on top of some SQL database. I haven't used Django yet, but I'd be surprised (shocked) if Django doesn't support something like the following: Rather than use the native db row ID for transaction ID, I instead usually generate a unique transaction ID from some other source, store that in both the Transaction and Leg objects, do my final check to ensure debits and credits balance, and then commit both Transaction and Legs to the db in one SQL transaction.
Ryan, is that more or less what you wound up doing?
This may sound terribly naive, but why not just record each transaction in a single record containing "to account" and "from account" foreign keys that link to an accounts table instead of trying to create two records for each transaction? From my point of view, it seems that the essence of "double-entry" is that transactions always move money from one account to another. There is no advantage using two records to store such transactions and many disadvantages.