What is the best way to do intense read only queries in Django - django

We have a really big application in Django which uses Postgres database. We want to build an analytics module.
This module uses a base query e.g.
someFoo = SomeFoo.objects.all() # Around 100000 objects returned.
Then slice and dice this data. i.e.
someFoo.objects.filter(Q(creator=owner) | Q(moderated=False))
These queries will be very intense and as this will be an analytics and reporting dashboard the quires will hit the database very badly.
What is the best way to handle complex queries in such conditions ? i.e. when you have a base query and it will be sliced and diced very often in a span of short time and never be used again.
A few possible solutions that we have though of are
A read only database and a write only database.
Writing Raw sql queries and using them. As django ORM can be quite inefficient for certain types of queries.
Caching heavily (Have not though or done any research in this.)
Edit : E.g. query
select sport."sportName", sport.id, pop.name, analytics_query.loc_id, "new count"
from "SomeFoo_sportpop" as sportpop join "SomeFoo_pop" as pop on (sportpop.pop_id=pop.id) join "SomeFoo_sport" as sport on (sportpop.sport_id=sport.id) join
(select ref.catcher_pop_id as loc_id,
(select count(*) from "SomeFoo_pref" where catcher_pop_id=ref.catcher_pop_id and status='pending' and exists=True) as "new count"
from "SomeFoo_pref" as ref
where ref.exists=TRUE and ref.catcher_pop_id is not NULL
group by ref.catcher_pop_id) as analytics_query on (sportpop.pop_id=analytics_query.loc_id)
order by sport."sportName", pop.name asc
This is an example of a raw sql query we are planning to make and its going to have a lot of where statements and groupby. Basically we are going to slice and dice the base query a lot.
Is there any other possible solution or method that you can point us to. Any help is highly appreciated.

I can think to PREPARED STATMENT and a faster server, may be on linux...

Related

Cloud Spanner - read performance with large number of items in WHERE clause

I'm in the process of evaluating some different data stores for a project and I have a strange but inflexible requirement to check the existence of a 1500 keys per query... Basically the only query I'll be running is of the form:
SELECT user_id, name, gender
WHERE user_id in (user1, user2, ..., user1500)
I will have around 3.5 billion rows in the table. One data store that has caught my eye is Spanner. I was wondering if querying the data in this way would be feasible or if I would run into performance issues due to the large number of items in my WHERE clause. I have only been able to test these queries on a small amount of data so far so I'm leaning more on what the theoretical performance hit might look like instead having the luxury to just "try and found out".
Also, are there other data stores that might work better for this read pattern? I expected to run no more than 80 queries per second. Also, the data will be bulk loaded on a weekly basis. The data is structured by nature but we don't use it in a relational way (i.e. no joins).
Anyways, sorry if this question is vague in any way. I'm happy to provide more detail if needed.
1500 keys should not be a problem if you use a bound array parameter to specify the keys:
SELECT user_id, name, gender
FROM table
WHERE user_id in UNNEST(#users)
https://cloud.google.com/spanner/docs/sql-best-practices#write_efficient_queries_for_range_key_lookup

Convert Django ORM query with large IN clause to table value constructor

I have a bit of Django code that builds a relatively complicated query in a programmatic fashion, with various filters getting applied to an initial dataset through a series of filter and exclude calls:
for filter in filters:
if filter['name'] == 'revenue':
accounts = accounts.filter(account_revenue__in: filter['values'])
if filter['name'] == 'some_other_type':
if filter['type'] == 'inclusion':
accounts = accounts.filter(account__some_relation__in: filter['values'])
if filter['type'] == 'exclusion':
accounts = accounts.exclude(account__some_relation__in: filter['values'])
...etc
return accounts
For most of these conditions, the possible values of the filters are relatively small and contained, so the IN clauses that Django's ORM generates are performant enough. However there are a few cases where the IN clauses can be much larger (10K - 100K items).
In plain postgres I can make this query much more optimal by using a table value constructor, e.g.:
SELECT domain
FROM accounts
INNER JOIN (
VALUES ('somedomain.com'), ('anotherdomain.com'), ...etc 10K more times
) VALS(v) ON accounts.domain=v
With a 30K+ IN clause in the original query it can take 60+ seconds to run, while the table value version of the query takes 1 second, a huge difference.
But I cannot figure out how to get Django ORM to build the query like I want, and because of the way all the other filters are constructed from ORM filters I can't really write the entire thing as raw SQL.
I was thinking I could get the raw SQL that Django's ORM is going to run, regexp parse it, but that seems very brittle (and surprisingly difficult to get the actual SQL that is about to be run, because of parameter handling etc). I don't see how I could annotate with RawSQL since I don't want to add a column to select, but instead want to add a join condition. Is there a simple solution I am missing?

Django model count() with caching

I have an Django application with Apache Prometheus monitoring and model called Sample.
I want to monitor Sample.objects.count() metric
and cache this value for concrete time interval
to avoid costly COUNT(*) queries in database.
From this tutorial
https://github.com/prometheus/client_python#custom-collectors
i read that i need to write custom collector.
What is best approach to achieve this?
Is there any way in django to
get Sample.objects.count() cached value and update it after K seconds?
I also use Redis in my application. Should i store this value there?
Should i make separate thread to update Sample.objects.count() cache value?
First thing to note is that you don't really need to cache the result of a count(*) query.
Though different RDBMS handle count operations differently, they are slow across the board for large tables. But one thing they have in common is that there is an alternative to SELECT COUNT(*) provided by the RDBMS which is in fact a cached result. Well sort of.
You haven't mentioned what your RDBMS is so let's see how it is in the popular ones used wtih Django
mysql
Provided you have a primary key on your table and you are using MyISAM. SELECT COUNT() is really fast on mysql and scales well. But chances are that you are using Innodb. And that's the right storage engine for various reasons. Innodb is transaction aware and can't handle COUNT() as well as MyISAM and the query slows down as the table grows.
the count query on a table with 2M records took 0.2317 seconds. The following query took 0.0015 seconds
SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count';
but it reported a value of 1997289 instead of 2 million but close enough!
So you don't need your own caching system.
Sqlite
Sqlite COUNT(*) queries aren't really slow but it doesn't scale either. As the table size grows the speed of the count query slows down. Using a table similar to the one used in mysql, SELECT COUNT(*) FROM for_count required 0.042 seconds to complete.
There isn't a short cut. The sqlite_master table does not provide row counts. Neither does pragma table_info
You need your own system to cache the result of SELECT COUNT(*)
Postgresql
Despite being the most feature rich open source RDBMS, postgresql isn't good at handling count(*), it's slow and doesn't scale very well. In other words, no different from the poor relations!
The count query took 0.194 seconds on postgreql. On the other hand the following query took 0.003 seconds.
SELECT reltuples FROM pg_class WHERE relname = 'for_count'
You don't need your own caching system.
SQL Server
The COUNT query on SQL server took 0.160 seconds on average but it fluctuated rather wildly. For all the databases discussed here the first count(*) query was rather slow but the subsequent queries were faster because the file was cached by the operating system.
I am not an expert on SQL server so before answering this question, I didn't know how to look up the row count using schema info. I found this Q&A helpfull. One of them I tried produced the result in 0.004 seconds
SELECT t.name, s.row_count from sys.tables t
JOIN sys.dm_db_partition_stats s
ON t.object_id = s.object_id
AND t.type_desc = 'USER_TABLE'
AND t.name ='for_count'
AND s.index_id = 1
You dont' need your own caching system.
Integrate into Django
As can be seen, all databases considered except sqlite provide a built in 'Cached query count' There isn't a need for us to create one of our own. It's a simple matter of creating a customer manager to make use of this functionality.
class CustomManager(models.Manager):
def quick_count(self):
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("""SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count'""")
row = cursor.fetchone()
return row[0]
class Sample(models.Model):
....
objects = CustomManager()
The above example is for postgresql, but the same thing can be used for mysql or sql server by simply changing the query into one of those listed above.
Prometheus
How to plug this into django prometheus? I leave that as an exercise.
A custom collector that returns the previous value if it's not too old and fetches otherwise would be the way to go. I'd keep it all in-process.
If you're using MySQL you might want to look at the collectors the mysqld_exporter offers as there's some for table size that should be cheaper.

How to improve the performance of this django ORM query?

I'm using django and I'm running a postgresql database with 2.1 million records. I have a complex query which takes 20sec to run, and it takes that long because inside the query there's an aggregate count() function, which ends up counting 1.5million records. Having to wait 20 seconds is not acceptable for my application.
The django ORM "query" is as follows:
WebRequest.objects.values('FormUrl', 'Request__Platform','Request__Ip').annotate(total=Count('Request__Ip')).order_by('-total')[:10]
I tried using table indexes, but this hardly reduced the delay.
Now I'm considering saving the data in a table, and have the table regenerated every hour by pgadmin/cronjob/task scheduler, by e.g.
drop table if exists <table> tbl; select into <tabel> tbl from query;
I do feel like this is a sloppy fix and assume there must be a better way to reduce the time.
Are there any better approaches or do you guys consider this to be an acceptable solution?
If you don't need an exact count you could try using the postgresql stats instead of making the count. Check in here for a more detailed exaplanation https://wiki.postgresql.org/wiki/Count_estimate
This would require using raw queries instead of the ORM, but that's the way to go for a lot of performance related issues

Django does a useless join when filtering

I have an optional foreign key for a parameter:
class Club(models.Model):
...
locationid = models.ForeignKey(location_models.Location, null=True)
...
I want to find entries of club where this foreignkey is not set. Here's the ORM query:
print Club.objects.filter(locationid=None).only('name').query
Produces
SELECT `club_club`.`id`, `club_club`.`name` FROM `club_club`
LEFT OUTER JOIN `location_location`
ON (`club_club`.`locationid_id` = `location_location`.`id`)
WHERE `location_location`.`id` IS NULL
Same query is produced when I do filter(locationid_id__isnull=True)
What I want is to query on locationid_id without involving a JOIN. I know I can write raw SQL, but is there an ORM-al way of doing this?
This seems to be quite a persistent issue, and the patch that fixed it had other side effects, so it never got applied to a release version of Django.
A solution to this is to use the extra method. This will require raw SQL, but only a limited amount and using SQL standards, so it should be compatible with all SQL databases:
location_null = '`%s`.`%s` IS NULL' % (Club._meta.db_table, Club.locationid.field.column)
Club.objects.extra(where=[location_null])
You can add this as a manager/queryset method for a more DRY solution.
The other option is to just take the performance hit. This is what I would recommend, unless benchmarking shows that the performance hit really is unacceptable in your specific case.