GIN trigram index not used when case-insensitive query is performed - django

When I query the table using trigram_similar or contains the index is used, as expected.
When I query the same table using icontains, Django performs a sequential scan using UPPER.
The difference is 3ms vs 240ms.
Should I create a lowercase index and match with contains? (How could this be done?)
Should I create a field where all the contents will be lower cased and index that field?
Something else?
The model:
class Name(models.Model):
name_en = models.CharField(max_length=127)
...
class Meta:
indexes = [
GinIndex(
name="name_en_gin_trigram",
fields=["name_en"],
opclasses=["gin_trgm_ops"],
)
]
The query that uses the index:
>>> Name.objects.filter(
Q(name_en__contains='eeth')
| Q(name_en__trigram_similar='eeth')
)
SELECT *
FROM "shop_name"
WHERE ("shop_name"."name_en"::text LIKE '%eeth%' OR "shop_name"."name_en" % 'eeth')
LIMIT 21;
The resulting query plan:
Limit (cost=64.06..90.08 rows=7 width=121) (actual time=0.447..2.456 rows=14 loops=1)
-> Bitmap Heap Scan on shop_name (cost=64.06..90.08 rows=7 width=121) (actual time=0.443..2.411 rows=14 loops=1)
Recheck Cond: (((name_en)::text ~~ '%eeth%'::text) OR ((name_en)::text % 'eeth'::text))
Rows Removed by Index Recheck: 236
Heap Blocks: exact=206
-> BitmapOr (cost=64.06..64.06 rows=7 width=0) (actual time=0.371..0.378 rows=0 loops=1)
-> Bitmap Index Scan on name_en_gin_trigram (cost=0.00..20.03 rows=4 width=0) (actual time=0.048..0.049 rows=15 loops=1)
Index Cond: ((name_en)::text ~~ '%eeth%'::text)
-> Bitmap Index Scan on name_en_gin_trigram (cost=0.00..44.03 rows=4 width=0) (actual time=0.318..0.320 rows=250 loops=1)
Index Cond: ((name_en)::text % 'eeth'::text)
Planning Time: 0.793 ms
Execution Time: 2.531 ms
(12 rows)
If I use icontains the index is not used:
>>> Name.objects.filter(
Q(name_en__icontains='eeth')
| Q(name_en__trigram_similar='eeth')
)
SELECT *
FROM "shop_name"
WHERE (UPPER("shop_name"."name_en"::text) LIKE UPPER('%eeth%') OR "shop_name"."name_en" % 'eeth')
LIMIT 21;
The resulting query plan:
Limit (cost=0.00..95.61 rows=21 width=121) (actual time=10.513..244.244 rows=14 loops=1)
-> Seq Scan on shop_name (cost=0.00..1356.79 rows=298 width=121) (actual time=10.509..244.195 rows=14 loops=1)
Filter: ((upper((name_en)::text) ~~ '%EETH%'::text) OR ((name_en)::text % 'eeth'::text))
Rows Removed by Filter: 36774
Planning Time: 0.740 ms
Execution Time: 244.299 ms
(6 rows)

Django runs icontains with UPPER(), and we can address this by making the index also UPPER():
CREATE INDEX upper_col_name_gin_idx ON table_name USING GIN (UPPER(col_name) gin_trgm_ops)
Django will then run WHERE UPPER("table_name"."col_name"::text) LIKE UPPER('%term%'), using this index.

Update: This approach will not work as expected. However, the mechanics can be used to address the accepted approach.
Should I create a lowercase index and match with contains? (How could this be done?)
From Django-3.2:
Positional argument *expressions allows creating functional indexes on
expressions and database functions.
For example:
Index(Lower('title').desc(), 'pub_date', name='lower_title_date_idx')
creates an index on the lowercased value of the title field in
descending order and the pub_date field in the default ascending
order.
It sounds like some quality Django music!
The code used to accomplish the above, is the following:
migrations/0001_initial.py:
'''
A fake migration used to install the necessary extensions.
It should be followed by
./manage.py makemigrations && ./manage.py migrate
'''
from django.contrib.postgres.operations import (
BtreeGinExtension,
TrigramExtension,
)
from django.db import migrations
class Migration(migrations.Migration):
dependencies = []
operations = [
BtreeGinExtension(),
TrigramExtension(),
]
models.py:
from django.contrib.postgres.indexes import GinIndex, OpClass
from django.db import models
from django.db.models.functions import Lower
class Name(models.Model):
name_en = models.CharField(max_length=127)
...
class Meta:
indexes = [
GinIndex(
OpClass(Lower("name_en"), name="gin_trgm_ops"),
name="name_en_gin_trigram_lowercase",
),
]
OpClass is used to avoid the error:
ValueError: Index.opclasses cannot be used with expressions. Use django.contrib.postgres.indexes.OpClass() instead.

Related

Slow spatial join on a single table with PostGIS

My goal is to calculate if a building have at least one shared wall with a building of another estate. I used a PostGIS query to do so but it is really slow. I have tweaked this for two weeks with some success but no breakthrough.
I have two tables:
Estate (a piece of land)
CREATE TABLE IF NOT EXISTS public.front_estate
(
id integer NOT NULL DEFAULT nextval('front_estate_id_seq'::regclass),
perimeter geometry(Polygon,4326),
CONSTRAINT front_estate_pkey PRIMARY KEY (id),
)
CREATE INDEX IF NOT EXISTS front_estate_perimeter_idx
ON public.front_estate USING spgist
(perimeter);
Building
CREATE TABLE IF NOT EXISTS public.front_building
(
id integer NOT NULL DEFAULT nextval('front_building_id_seq'::regclass),
type character varying(255) COLLATE pg_catalog."default",
footprint integer,
polygon geometry(Polygon,4326),
shared_wall integer,
CONSTRAINT front_building_pkey PRIMARY KEY (id)
)
CREATE INDEX IF NOT EXISTS front_building_polygon_idx
ON public.front_building USING spgist
(polygon)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_building_type_124fcf82
ON public.front_building USING btree
(type COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_building_type_124fcf82_like
ON public.front_building USING btree
(type COLLATE pg_catalog."default" varchar_pattern_ops ASC NULLS LAST)
TABLESPACE pg_default;
The m2m relation:
CREATE TABLE IF NOT EXISTS public.front_estate_buildings
(
id integer NOT NULL DEFAULT nextval('front_estate_buildings_id_seq'::regclass),
estate_id integer NOT NULL,
building_id integer NOT NULL,
CONSTRAINT front_estate_buildings_pkey PRIMARY KEY (id),
CONSTRAINT front_estate_buildings_estate_id_building_id_863b3358_uniq UNIQUE (estate_id, building_id),
CONSTRAINT front_estate_buildin_building_id_fc5c4235_fk_front_bui FOREIGN KEY (building_id)
REFERENCES public.front_building (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT front_estate_buildings_estate_id_2c28ec2a_fk_front_estate_id FOREIGN KEY (estate_id)
REFERENCES public.front_estate (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
DEFERRABLE INITIALLY DEFERRED
)
CREATE INDEX IF NOT EXISTS front_estate_buildings_building_id_fc5c4235
ON public.front_estate_buildings USING btree
(building_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_estate_buildings_estate_id_2c28ec2a
ON public.front_estate_buildings USING btree
(estate_id ASC NULLS LAST)
TABLESPACE pg_default;
To have a shared wall a building must touch another building which is not the same estate.
The final data set will have around 100 millions rows. Right now my developpement building table has 2 millions rows.
Here is the query I used to get all relations between buildings and estates:
SELECT b.id as b_id, rel.estate_id as e_id, swb.id as swb_id, sw_rel.estate_id as swe_id
FROM front_building b
JOIN front_building swb ON swb.id < b.id AND ST_Intersects(b.polygon, swb.polygon)
JOIN front_estate_buildings rel ON rel.building_id = b.id
JOIN front_estate_buildings sw_rel ON sw_rel.building_id = swb.id
ORDER BY b.id ASC;
Here is the EXPLAIN ANALYZE given by pgAdmin:
1. Limit (rows=500 loops=1)
2. Nested Loop Inner Join (rows=500 loops=1)
3. Nested Loop Inner Join (rows=695 loops=1)
4. Nested Loop Inner Join (rows=2985 loops=1)
5. Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as rel (rows=2985 loops=1) 2985 1
6. Memoize (rows=1 loops=2985)
Buckets: Batches: Memory Usage: 715 kB
7. Index Scan using front_building_pkey on front_building as b (rows=1 loops=2751)
Index Cond: (id = rel.building_id)
8. Index Scan using front_building_polygon_idx on front_building as swb (rows=0 loops=2985)
Filter: ((id < b.id) AND st_intersects(b.polygon, polygon))
Index Cond: (polygon && b.polygon)
Rows Removed by Filter: 2
9. Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as sw_rel (rows=1 loops=695)
Index Cond: (building_id = swb.id)
On my dev machine (MBP M1 - 16GB RAM) it completes in 20 minutes for 2M rows which is not good but is ok. On my production machine (Linode - 8 CPU Cores - 16 GB RAM) the CPU goes throufh the roof (continuous 250% capacity) and the query seems to never end.
Do you have any clue on how to proceed ? Change the query ? The db struct ? Use multiprocessing ?

Django PostgreSQL double index cleanup

We've got this table in our database with 80GB of data and 230GB of Indexes. We are constrained on our disk which is already maxed out.
What bothers me is we have two indexes that look pretty darn similar
CREATE INDEX tracks_trackpoint_id ON tracks_trackpoint USING btree (id)
CREATE UNIQUE INDEX tracks_trackpoint_pkey ON tracks_trackpoint USING btree (id)
I have no idea what's the history behind this, but the first one seems quite redundant. What could be the risk of dropping it ? This would buy us one year of storage.
You can drop the first index, it is totally redundant.
If your tables are 80GB and your indexes 230GB, I am ready to bet that you have too many indexes in your database.
Drop the indexes that are not used.
Careful as I am, I disabled the index to benchmark this, and the query seems to fallback nicely on the other index. I'll try a few variants.
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_id on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.040 ms
(3 rows)
appdb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'tracks_trackpoint_id'::regclass;
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_pkey on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.036 ms
(3 rows)

Query to retrieve neighbors is too slow

I have a set of Django ORM models representing a directed graph, and I'm trying to retrieve all the adjacent vertices to a given vertex ignoring edge direction:
class Vertex(models.Model):
pass
class Edge(models.Model):
orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True)
dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True)
# ... other data about this edge ...
The query Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() returns the correct result, but in my case it takes far too long to execute.
Typically for my application there will be around 50-100 vertices at any given time, and around a million edges. Even reducing it to only 20 vertices and 100000 edges, that query takes about a minute and a half to execute:
for i in range(20):
Vertex().save()
vxs = list(Vertex.objects.all())
for i in tqdm.tqdm(range(100000)):
Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save()
v = vxs[0]
def f1():
return list( Vertex.objects.filter(
Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() )
t1 = timeit.Timer(f1)
print( t1.timeit(number=1) ) # 84.21138522100227
On the other hand, if I split the query up into two pieces I can get the exact same result in only a handful of milliseconds:
def f2():
q1 = Vertex.objects.filter(edge_orig__dest=v).distinct()
q2 = Vertex.objects.filter(edge_dest__orig=v).distinct()
return list( {x for x in itertools.chain(q1, q2)} )
t2 = timeit.Timer(f2)
print( t2.timeit(number=100)/100 ) # 0.0109818680600074
This second version has issues though:
It's not atomic. The list of edges is almost guaranteed to change between the two queries in my application, meaning the results won't represent a single point in time.
I can't perform additional processing and aggregation on the results without manually looping over it. (e.g. If I wanted Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field')))
Why does the second version run so much faster than the first one?
How can I do this, and is there a way to get the first one to run fast enough for practical use?
I'm currently testing with Python 3.5.2, PostgreSQL 9.5.6, and Django 1.11, although if this is an issue with one of those I'm not stuck with them.
Here is the SQL generated by each query, as well as Postgres's explan:
The first one:
Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v))
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061)
HashAggregate (cost=8275151.47..8275151.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Left Join (cost=3183.45..8154147.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Right Join (cost=1.45..2917.45 rows=100000 width=8)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
The second ones:
Vertex.objects.filter(edge_orig__dest=v).distinct()
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
WHERE "app_edge"."dest_id" = 1061
HashAggregate (cost=1531.42..1531.62 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=848.11..1519.04 rows=4950 width=4)
Hash Cond: (app_edge.orig_id = app_vertex.id)
-> Bitmap Heap Scan on app_edge (cost=846.65..1449.53 rows=4950 width=4)
Recheck Cond: (dest_id = 1061)
-> Bitmap Index Scan on app_edge_dest_id_4254b90f (cost=0.00..845.42 rows=4950 width=0)
Index Cond: (dest_id = 1061)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
#khampson's version also takes a minute-and-a-half to run, so it's also a no-go.
Vertex.objects.raw( ... )
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
HashAggregate (cost=8275347.47..8275347.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=3183.45..8154343.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Join (cost=1.45..2917.45 rows=100000 width=12)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
The query plans for those two queries are radically different. The first (slower) one isn't hitting any indexes, and is doing two left joins, both of which result in way, way more rows being processed and returned. From what I interpret of the intention of the Django ORM syntax, it doesn't sound like you would truly want to do left joins here.
I would recommend considering dropping down into raw SQL in this case from within the Django ORM, and hybridize the two. e.g. if you take the first one, and transform it to something like this:
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
Two questions there: How does that version perform, and does it give you the results you're looking for?
For more info on raw queries, check out this section of the Django doc.
Response to comment from OP:
The query plan for the query I suggested also shows that it's not hitting any indexes.
Do you have indexes on both tables for the columns involved? I suspect not, especially since for this specific query, we're looking for a single value, which means if there were indexes, I would be very surprised if the query planner determined a sequential scan were a better choice (OTOH, if you were looking for a wide range of rows, say, over 10% of the rows in the tables, the query planner might correctly make such a decision).
I propose another query could be:
# Get edges which contain Vertex v, "only" optimizes fields returned
edges = Edge.objects.filter(Q(orig=v) | Q(dest=v)).only('orig_id', 'dest_id')
# Get set of vertex id's to discard duplicates
vertex_ids = {*edges.values_list('orig_id', flat=True), *edges_values_list('dest_id', flat=True)}
# Get list of vertices, excluding the original vertex
vertices = Vertex.objects.filter(pk__in=vertex_ids).exclude(pk=v.pk)
This shouldn't require any joins and shouldn't suffer from the race conditions you mention.

Improving query speed: simple SELECT with LIKE

I have inherited a large legacy codebase which runs in django 1.5 and my current task is to speed up a section of the site which takes ~1min to load.
I did a profile of the app and got this:
The culprit in particular is the following query (stripped for brevity):
SELECT COUNT(*) FROM "entities_entity" WHERE (
"entities_entity"."date_filed" <= '2016-01-21' AND (
UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
-- 34 more of these
UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
-- 34 more of these
)
)
which basically consist on a big like query on two fields, entity_city_state_zip and agent_city_state_zip which are character varying(200) | not null fields.
That query is performed twice (!), taking 18814.02ms each time, and one more time replacing the COUNT for a SELECT taking up an extra 20216.49 (I'm going to cache the result of the COUNT)
The explain looks like this:
Aggregate (cost=175867.33..175867.34 rows=1 width=0) (actual time=17841.502..17841.502 rows=1 loops=1)
-> Seq Scan on entities_entity (cost=0.00..175858.95 rows=3351 width=0) (actual time=0.849..17818.551 rows=145075 loops=1)
Filter: ((date_filed <= '2016-01-21'::date) AND ((upper((entity_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((entity_city_state_zip)::text) ~~ '%BERKELEY%'::text) (..skipped..) OR (upper((agent_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BERKELEY%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BURLINGAME%'::text) ))
Rows Removed by Filter: 310249
Planning time: 2.110 ms
Execution time: 17841.944 ms
I've tried using an index on entity_city_state_zip and agent_city_state_zip using various combinations like:
CREATE INDEX ON entities_entity (upper(entity_city_state_zip));
CREATE INDEX ON entities_entity (upper(agent_city_state_zip));
or using varchar_pattern_ops, with no luck.
The server is using something like this:
qs = queryset.filter(Q(entity_city_state_zip__icontains = all_city_list) |
Q(agent_city_state_zip__icontains = all_city_list))
to generate that query.
I don't know what else to try,
Thanks!
I think problem in "multiple LIKE" and in UPPER("entities_entity ...
You can use:
WHERE entities_entity.entity_city_state_zip SIMILAR TO '%Atherton%|%Berkeley%'
Or something like this:
WHERE entities_entity.entity_city_state_zip LIKE ANY(ARRAY['%Atherton%', '%Berkeley%'])
Edited
About Raw SQL query in Django:
https://docs.djangoproject.com/es/1.9/topics/db/sql/
How do I execute raw SQL in a django migration
Regards
I watched a course in Pluralsight that addressed a very similar issue. The course was "Postgres for .NET Developers" and this was in the section "Fun With Simple SQL", "Full Text Search."
To summarize their solution, using your example:
Create a new column in your table that will represent your entity_city_state_zip as a tsvector:
create table entities_entity (
date_filed date,
entity_city_state_zip text,
csz_search tsvector not null -- add this column
);
Initially you might have to make it nullable, then populate the data and make it non-nullable.
update entities_entity
set csz_search = to_tsvector (entity_city_state_zip);
Next, create a trigger that will cause the new field to be populated any time a record is added or modified:
create trigger entities_insert_update
before insert or update on entities_entity
for each row execute procedure
tsvector_update_trigger(csz_search,'pg_catalog.english',entity_city_state_zip);
Your search queries can now query on the tsvector field rather than the city/state/zip field:
select * from entities_entity
where csz_search ## to_tsquery('Atherton')
Some notes of interest on this:
to_tsquery, in case you haven't used it is WAY more sophisticated than the example above. It allows and conditions, partial matches, etc
it is also case-insensitive, so there is no need to do the upper functions you have in your query
As a final step, put a GIN index on the tsquery field:
create index entities_entity_ix1 on entities_entity
using gin(csz_search);
If I understand the course right, this should make your query fly, and it will overcome the issue of a btree index's inability to work on a like '% query.
Here is the explain plan on such a query:
Bitmap Heap Scan on entities_entity (cost=56.16..1204.78 rows=505 width=81)
Recheck Cond: (csz_search ## to_tsquery('Atherton'::text))
-> Bitmap Index Scan on entities_entity_ix1 (cost=0.00..56.04 rows=505 width=0)
Index Cond: (csz_search ## to_tsquery('Atherton'::text))

Slow PostgreSQL query not using index

I have a simple Django site, using a PostgreSQL 9.3 database, with a single table storing user accounts (e.g. name, email, address, phone, active, etc). However, my user model is fairly large, and has around 2.6 million records. I noticed Django's admin was a little slow, so using django-debug-toolbar, I noticed that almost all queries ran in under 1 ms, except for:
SELECT COUNT(*) FROM "myapp_myuser" WHERE "myapp_myuser"."active" = true;
which took about 7000 ms. However, the active column is indexed using Django's standard db_index=True, which generates the index:
CREATE INDEX myapp_myuser_active
ON myapp_myuser
USING btree
(active);
Checking out the query with EXPLAIN via:
EXPLAIN ANALYZE VERBOSE
SELECT COUNT(*) FROM "myapp_myuser" WHERE "myapp_myuser"."active" = true;
returns:
Aggregate (cost=109305.45..109305.46 rows=1 width=0) (actual time=7342.973..7342.974 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.myapp_myuser (cost=0.00..102638.16 rows=2666916 width=0) (actual time=0.035..4765.059 rows=2666337 loops=1)
Output: id, created, category_id, name, email, address_1, address_2, city, active, (...)
Filter: myapp_myuser.active
Total runtime: 7343.031 ms
It appears to not be using the index at all. Am I reading this right?
Running just SELECT COUNT(*) FROM "myapp_myuser" completed in about 500 ms. Why such a disparity in run times, even though the only column being used is indexed?
How can I better optimize this query?
You're selecting a lot of columns out of a wide table. So this might not help, even though it does result in a bitmap index scan.
Try a partial index.
create index on myapp_myuser (active) where active = true;
I made a test table with a couple million rows.
explain analyze verbose
select count(*) from test where active = true;
"Aggregate (cost=41800.79..41800.81 rows=1 width=0) (actual time=500.756..500.756 rows=1 loops=1)"
" Output: count(*)"
" -> Bitmap Heap Scan on public.test (cost=8085.76..39307.79 rows=997200 width=0) (actual time=126.233..386.834 rows=1000000 loops=1)"
" Output: id, active"
" Filter: test.active"
" -> Bitmap Index Scan on test_active_idx1 (cost=0.00..7836.45 rows=497204 width=0) (actual time=123.398..123.398 rows=1000000 loops=1)"
" Index Cond: (test.active = true)"
"Total runtime: 500.794 ms"
When you write queries that you hope will use a partial index, you need to match the expression and WHERE clause. Using WHERE active is true is valid in PostgreSQL, but it doesn't match the WHERE clause in the partial index. That means you'll get a sequential scan again.