Django model count() with caching - django

I have an Django application with Apache Prometheus monitoring and model called Sample.
I want to monitor Sample.objects.count() metric
and cache this value for concrete time interval
to avoid costly COUNT(*) queries in database.
From this tutorial
https://github.com/prometheus/client_python#custom-collectors
i read that i need to write custom collector.
What is best approach to achieve this?
Is there any way in django to
get Sample.objects.count() cached value and update it after K seconds?
I also use Redis in my application. Should i store this value there?
Should i make separate thread to update Sample.objects.count() cache value?

First thing to note is that you don't really need to cache the result of a count(*) query.
Though different RDBMS handle count operations differently, they are slow across the board for large tables. But one thing they have in common is that there is an alternative to SELECT COUNT(*) provided by the RDBMS which is in fact a cached result. Well sort of.
You haven't mentioned what your RDBMS is so let's see how it is in the popular ones used wtih Django
mysql
Provided you have a primary key on your table and you are using MyISAM. SELECT COUNT() is really fast on mysql and scales well. But chances are that you are using Innodb. And that's the right storage engine for various reasons. Innodb is transaction aware and can't handle COUNT() as well as MyISAM and the query slows down as the table grows.
the count query on a table with 2M records took 0.2317 seconds. The following query took 0.0015 seconds
SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count';
but it reported a value of 1997289 instead of 2 million but close enough!
So you don't need your own caching system.
Sqlite
Sqlite COUNT(*) queries aren't really slow but it doesn't scale either. As the table size grows the speed of the count query slows down. Using a table similar to the one used in mysql, SELECT COUNT(*) FROM for_count required 0.042 seconds to complete.
There isn't a short cut. The sqlite_master table does not provide row counts. Neither does pragma table_info
You need your own system to cache the result of SELECT COUNT(*)
Postgresql
Despite being the most feature rich open source RDBMS, postgresql isn't good at handling count(*), it's slow and doesn't scale very well. In other words, no different from the poor relations!
The count query took 0.194 seconds on postgreql. On the other hand the following query took 0.003 seconds.
SELECT reltuples FROM pg_class WHERE relname = 'for_count'
You don't need your own caching system.
SQL Server
The COUNT query on SQL server took 0.160 seconds on average but it fluctuated rather wildly. For all the databases discussed here the first count(*) query was rather slow but the subsequent queries were faster because the file was cached by the operating system.
I am not an expert on SQL server so before answering this question, I didn't know how to look up the row count using schema info. I found this Q&A helpfull. One of them I tried produced the result in 0.004 seconds
SELECT t.name, s.row_count from sys.tables t
JOIN sys.dm_db_partition_stats s
ON t.object_id = s.object_id
AND t.type_desc = 'USER_TABLE'
AND t.name ='for_count'
AND s.index_id = 1
You dont' need your own caching system.
Integrate into Django
As can be seen, all databases considered except sqlite provide a built in 'Cached query count' There isn't a need for us to create one of our own. It's a simple matter of creating a customer manager to make use of this functionality.
class CustomManager(models.Manager):
def quick_count(self):
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("""SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count'""")
row = cursor.fetchone()
return row[0]
class Sample(models.Model):
....
objects = CustomManager()
The above example is for postgresql, but the same thing can be used for mysql or sql server by simply changing the query into one of those listed above.
Prometheus
How to plug this into django prometheus? I leave that as an exercise.

A custom collector that returns the previous value if it's not too old and fetches otherwise would be the way to go. I'd keep it all in-process.
If you're using MySQL you might want to look at the collectors the mysqld_exporter offers as there's some for table size that should be cheaper.

Related

How to temporarily disable Django indexes (for SQLite)

I'm trying to create a large SQLite database from around 500 smaller databases (each 50-200MB) to put into Django, and would like to speed up this process. I'm doing this via a custom command.
This answer helped me a lot, in reducing the speed to around a minute each in processing a smaller database. However it's still quite a long time.
The one thing I haven't done in that answer is to disable database indexing in Django and re-create them. I think this matters for me as my database has few tables with many rows.
Is there a way to do that in Django when it's running live? If not in Django then perhaps there's some SQLite query to remove all the indexes and re-create them after I insert my records?
I just used raw SQL to remove the indexes and re-create them. This improved the speed of creating a big database from 2 of my small databases from 1:46 to 1:30, so quite significant. It also reduced the size from 341.7MB to 321.1MB.
# Delete all indexes for faster database creation
with connection.cursor() as cursor:
cursor.execute(f'SELECT name, sql FROM sqlite_master WHERE name LIKE "{app_label}_%" AND type == "index"')
indexes = cursor.fetchall()
names, create_sqls = zip(*indexes)
for name in names:
cursor.execute(f'DROP INDEX {name}')
After I create the databases re-create the index:
# Re-create indexes
with connection.cursor() as cursor:
for create_sql in create_sqls:
cursor.execute(create_sql)

How to improve the performance of this django ORM query?

I'm using django and I'm running a postgresql database with 2.1 million records. I have a complex query which takes 20sec to run, and it takes that long because inside the query there's an aggregate count() function, which ends up counting 1.5million records. Having to wait 20 seconds is not acceptable for my application.
The django ORM "query" is as follows:
WebRequest.objects.values('FormUrl', 'Request__Platform','Request__Ip').annotate(total=Count('Request__Ip')).order_by('-total')[:10]
I tried using table indexes, but this hardly reduced the delay.
Now I'm considering saving the data in a table, and have the table regenerated every hour by pgadmin/cronjob/task scheduler, by e.g.
drop table if exists <table> tbl; select into <tabel> tbl from query;
I do feel like this is a sloppy fix and assume there must be a better way to reduce the time.
Are there any better approaches or do you guys consider this to be an acceptable solution?
If you don't need an exact count you could try using the postgresql stats instead of making the count. Check in here for a more detailed exaplanation https://wiki.postgresql.org/wiki/Count_estimate
This would require using raw queries instead of the ORM, but that's the way to go for a lot of performance related issues

What is the best way to do intense read only queries in Django

We have a really big application in Django which uses Postgres database. We want to build an analytics module.
This module uses a base query e.g.
someFoo = SomeFoo.objects.all() # Around 100000 objects returned.
Then slice and dice this data. i.e.
someFoo.objects.filter(Q(creator=owner) | Q(moderated=False))
These queries will be very intense and as this will be an analytics and reporting dashboard the quires will hit the database very badly.
What is the best way to handle complex queries in such conditions ? i.e. when you have a base query and it will be sliced and diced very often in a span of short time and never be used again.
A few possible solutions that we have though of are
A read only database and a write only database.
Writing Raw sql queries and using them. As django ORM can be quite inefficient for certain types of queries.
Caching heavily (Have not though or done any research in this.)
Edit : E.g. query
select sport."sportName", sport.id, pop.name, analytics_query.loc_id, "new count"
from "SomeFoo_sportpop" as sportpop join "SomeFoo_pop" as pop on (sportpop.pop_id=pop.id) join "SomeFoo_sport" as sport on (sportpop.sport_id=sport.id) join
(select ref.catcher_pop_id as loc_id,
(select count(*) from "SomeFoo_pref" where catcher_pop_id=ref.catcher_pop_id and status='pending' and exists=True) as "new count"
from "SomeFoo_pref" as ref
where ref.exists=TRUE and ref.catcher_pop_id is not NULL
group by ref.catcher_pop_id) as analytics_query on (sportpop.pop_id=analytics_query.loc_id)
order by sport."sportName", pop.name asc
This is an example of a raw sql query we are planning to make and its going to have a lot of where statements and groupby. Basically we are going to slice and dice the base query a lot.
Is there any other possible solution or method that you can point us to. Any help is highly appreciated.
I can think to PREPARED STATMENT and a faster server, may be on linux...

Cassandra: How to query the complete data set?

My table has 77k entries (number of entries keep increasing this a high rate), I need to make a select query in CQL 3. When I do select count(*) ... where (some_conditions) allow filtering I get:
count
-------
10000
(1 rows)
Default LIMIT of 10000 was used. Specify your own LIMIT clause to get more results.
Let's say the 23k rows satisfied this some_condition. The 10000 count above is of the first 10k rows of these 23k rows, right? But how do I get the actual count?
More importantly, How do I get access to all of these 23k rows, so that my python api can perform some in-memory operation on the data in some columns of the rows. Are there a some sort pagination principles in Cassandra CQL 3.
I know I can just increase the limit to a very large number but that's not efficient.
Working Hard is right, and LIMIT is probably what you want. But if you want to "page" through your results at a more detailed level, read through this DataStax document titled: Paging through unordered partitioner results.
This will involve using the token function on your partitioning key. If you want more detailed help than that, you'll have to post your schema.
While I cannot see your complete table schema, by virtue of the fact that you are using ALLOW FILTERING I can tell that you are doing something wrong. Cassandra was not designed to serve data based on multiple secondary indexes. That approach may work with a RDBMS, but over time that query will get really slow. You should really design a column family (table) to suit each query you intend to use frequently. ALLOW FILTERING is not a long-term solution, and should never be used in a production system.
you just have to specify limit with your query.
let's assume your database is containing under 1 lack records so if you will execute below query it will give you the actual count of the records in table.
select count(*) ... where (some_conditions) allow filtering limit 100000;
Another way is to write python code, the cqlsh indeed is python script.
use
statement = " select count(*) from SOME_TABLE"
future = session.execute_async(statement)
rows = future.result()
count = 0
for row in rows:
count = count + 1
the above is using cassandra python driver PAGE QUERY feature.

Accelerate bulk insert using Django's ORM?

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create