I have a set of Django ORM models representing a directed graph, and I'm trying to retrieve all the adjacent vertices to a given vertex ignoring edge direction:
class Vertex(models.Model):
pass
class Edge(models.Model):
orig = models.ForeignKey(Vertex, related_name='%(class)s_orig', null=True, blank=True)
dest = models.ForeignKey(Vertex, related_name='%(class)s_dest', null=True, blank=True)
# ... other data about this edge ...
The query Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() returns the correct result, but in my case it takes far too long to execute.
Typically for my application there will be around 50-100 vertices at any given time, and around a million edges. Even reducing it to only 20 vertices and 100000 edges, that query takes about a minute and a half to execute:
for i in range(20):
Vertex().save()
vxs = list(Vertex.objects.all())
for i in tqdm.tqdm(range(100000)):
Edge(orig = random.sample(vxs,1)[0], dest = random.sample(vxs,1)[0]).save()
v = vxs[0]
def f1():
return list( Vertex.objects.filter(
Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct() )
t1 = timeit.Timer(f1)
print( t1.timeit(number=1) ) # 84.21138522100227
On the other hand, if I split the query up into two pieces I can get the exact same result in only a handful of milliseconds:
def f2():
q1 = Vertex.objects.filter(edge_orig__dest=v).distinct()
q2 = Vertex.objects.filter(edge_dest__orig=v).distinct()
return list( {x for x in itertools.chain(q1, q2)} )
t2 = timeit.Timer(f2)
print( t2.timeit(number=100)/100 ) # 0.0109818680600074
This second version has issues though:
It's not atomic. The list of edges is almost guaranteed to change between the two queries in my application, meaning the results won't represent a single point in time.
I can't perform additional processing and aggregation on the results without manually looping over it. (e.g. If I wanted Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v)).distinct().aggregate(avg=Avg('some_field')))
Why does the second version run so much faster than the first one?
How can I do this, and is there a way to get the first one to run fast enough for practical use?
I'm currently testing with Python 3.5.2, PostgreSQL 9.5.6, and Django 1.11, although if this is an issue with one of those I'm not stuck with them.
Here is the SQL generated by each query, as well as Postgres's explan:
The first one:
Vertex.objects.filter(Q(edge_orig__dest=v) | Q(edge_dest__orig=v))
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
LEFT OUTER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
LEFT OUTER JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061)
HashAggregate (cost=8275151.47..8275151.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Left Join (cost=3183.45..8154147.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Right Join (cost=1.45..2917.45 rows=100000 width=8)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
The second ones:
Vertex.objects.filter(edge_orig__dest=v).distinct()
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
INNER JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
WHERE "app_edge"."dest_id" = 1061
HashAggregate (cost=1531.42..1531.62 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=848.11..1519.04 rows=4950 width=4)
Hash Cond: (app_edge.orig_id = app_vertex.id)
-> Bitmap Heap Scan on app_edge (cost=846.65..1449.53 rows=4950 width=4)
Recheck Cond: (dest_id = 1061)
-> Bitmap Index Scan on app_edge_dest_id_4254b90f (cost=0.00..845.42 rows=4950 width=0)
Index Cond: (dest_id = 1061)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
#khampson's version also takes a minute-and-a-half to run, so it's also a no-go.
Vertex.objects.raw( ... )
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
HashAggregate (cost=8275347.47..8275347.67 rows=20 width=4)
Group Key: app_vertex.id
-> Hash Join (cost=3183.45..8154343.45 rows=48401608 width=4)
Hash Cond: (app_vertex.id = app_edge.orig_id)
Join Filter: ((app_edge.dest_id = 1061) OR (t4.orig_id = 1061))
-> Hash Join (cost=1.45..2917.45 rows=100000 width=12)
Hash Cond: (t4.dest_id = app_vertex.id)
-> Seq Scan on app_edge t4 (cost=0.00..1541.00 rows=100000 width=8)
-> Hash (cost=1.20..1.20 rows=20 width=4)
-> Seq Scan on app_vertex (cost=0.00..1.20 rows=20 width=4)
-> Hash (cost=1541.00..1541.00 rows=100000 width=8)
-> Seq Scan on app_edge (cost=0.00..1541.00 rows=100000 width=8)
The query plans for those two queries are radically different. The first (slower) one isn't hitting any indexes, and is doing two left joins, both of which result in way, way more rows being processed and returned. From what I interpret of the intention of the Django ORM syntax, it doesn't sound like you would truly want to do left joins here.
I would recommend considering dropping down into raw SQL in this case from within the Django ORM, and hybridize the two. e.g. if you take the first one, and transform it to something like this:
SELECT DISTINCT "app_vertex"."id"
FROM "app_vertex"
JOIN "app_edge" ON ("app_vertex"."id" = "app_edge"."orig_id")
JOIN "app_edge" T4 ON ("app_vertex"."id" = T4."dest_id")
WHERE ("app_edge"."dest_id" = 1061
OR T4."orig_id" = 1061);
Two questions there: How does that version perform, and does it give you the results you're looking for?
For more info on raw queries, check out this section of the Django doc.
Response to comment from OP:
The query plan for the query I suggested also shows that it's not hitting any indexes.
Do you have indexes on both tables for the columns involved? I suspect not, especially since for this specific query, we're looking for a single value, which means if there were indexes, I would be very surprised if the query planner determined a sequential scan were a better choice (OTOH, if you were looking for a wide range of rows, say, over 10% of the rows in the tables, the query planner might correctly make such a decision).
I propose another query could be:
# Get edges which contain Vertex v, "only" optimizes fields returned
edges = Edge.objects.filter(Q(orig=v) | Q(dest=v)).only('orig_id', 'dest_id')
# Get set of vertex id's to discard duplicates
vertex_ids = {*edges.values_list('orig_id', flat=True), *edges_values_list('dest_id', flat=True)}
# Get list of vertices, excluding the original vertex
vertices = Vertex.objects.filter(pk__in=vertex_ids).exclude(pk=v.pk)
This shouldn't require any joins and shouldn't suffer from the race conditions you mention.
Related
I tried two type of joining condition in Redshift first I tried where after join on and second,I tried and after join on.I assumed that where is executed after join so that in this case it must be scaned so much rows.
explain
select
*
from
table1 t
left join table2 t2 on t.key = t2.key
where
t.snapshot_day = to_date('2021-12-18', 'YYYY-MM-DD');
XN Hash Right Join DS_DIST_INNER (cost=43055.58..114637511640937.91 rows=2906695 width=3169)
Inner Dist Key: t.key
Hash Cond: (("outer".asin)::text = ("inner".asin)::text)
-> XN Seq Scan on table2 t2 (cost=0.00..39874539.52 rows=3987453952 width=3038)
-> XN Hash (cost=35879.65..35879.65 rows=2870373 width=131)
-> XN Seq Scan on table1 t (cost=0.00..35879.65 rows=2870373 width=131)
Filter: (snapshot_day = '2021-12-18 00:00:00'::timestamp without time zone)
on the other hands, as follows,and is conditioned before join so that I assumed it is less rows to be scaned in join. but it returned too much rows and consume huge cost as follows greater than where clause
explain
select
*
from
table1 t
left join table2 t2 on t.key= t2.key
and
t.snapshot_day = to_date('2021-12-18', 'YYYY-MM-DD');
XN Hash Right Join DS_DIST_INNER (cost=40860915.20..380935317239623.75 rows=3268873216 width=3169)
Inner Dist Key: t.key
Hash Cond: (("outer".key)::text = ("inner".key)::text)
Join Filter: ("inner".snapshot_day = '2021-12-18 00:00:00'::timestamp without time zone)
-> XN Seq Scan on table2 t2 (cost=0.00..39874539.52 rows=3987453952 width=3038)
-> XN Hash (cost=32688732.16..32688732.16 rows=3268873216 width=131)
-> XN Seq Scan on table1 t (cost=0.00..32688732.16 rows=3268873216 width=131)
What is the difference between them ? where do I misunderstand in this case ?
If someone has opinion or materials please let me know
Thanks
There are several things happening here (in my original answer I missed that you were doing an outer join, so this is a complete rewrite).
WHERE happens before JOIN (ie, real-world databases don't use relational algebra)
A join merges two result-sets on common column(s). The query optimizer will attempt to reduce the size of those result-sets by applying any predicates from the WHERE clause that are independent of the join.
Conditions in the JOIN clause control how the two result-sets are merged, nothing else.
This is a semantic difference between your two queries: when you specify the predicate on t.snapshot_day in the WHERE clause, it limits the rows selected from t. When you specify it in the JOIN clause, it controls whether a row from t2 matches a row in t, not which rows are selected from t or which rows are returned from the join.
You're doing an outer join.
In an inner join, rows between the two result-sets are joined if and only if all conditions in the JOIN clause are matched, and only those rows are returned. In practice, this will limit the result set to only those rows in t that fulfill the predicate.
An outer join, however, selects all rows from the outer result-set (in this case, all rows from t), and includes values from the inner result-set iff all join conditions are true. So you'll only include data from t2 where the key matches and the predicate applies. For all other rows in t you'll get nulls.
That DS_DIST_INNER may be a problem
Not related to the join semantics, but for large queries in Redshift it's very expensive to redistribute rows. To avoid this, you should explicitly distribute all tables on the column that's used most frequently in a join (or used with the joins that involve the largest number of rows).
If you don't pick an explicit distribution key, then Redshift will allocate rows to nodes on a round-robin basis, and your query performance will suffer (because every join will require some amount of redistribution).
We've got this table in our database with 80GB of data and 230GB of Indexes. We are constrained on our disk which is already maxed out.
What bothers me is we have two indexes that look pretty darn similar
CREATE INDEX tracks_trackpoint_id ON tracks_trackpoint USING btree (id)
CREATE UNIQUE INDEX tracks_trackpoint_pkey ON tracks_trackpoint USING btree (id)
I have no idea what's the history behind this, but the first one seems quite redundant. What could be the risk of dropping it ? This would buy us one year of storage.
You can drop the first index, it is totally redundant.
If your tables are 80GB and your indexes 230GB, I am ready to bet that you have too many indexes in your database.
Drop the indexes that are not used.
Careful as I am, I disabled the index to benchmark this, and the query seems to fallback nicely on the other index. I'll try a few variants.
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_id on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.040 ms
(3 rows)
appdb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'tracks_trackpoint_id'::regclass;
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_pkey on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.036 ms
(3 rows)
When I query the table using trigram_similar or contains the index is used, as expected.
When I query the same table using icontains, Django performs a sequential scan using UPPER.
The difference is 3ms vs 240ms.
Should I create a lowercase index and match with contains? (How could this be done?)
Should I create a field where all the contents will be lower cased and index that field?
Something else?
The model:
class Name(models.Model):
name_en = models.CharField(max_length=127)
...
class Meta:
indexes = [
GinIndex(
name="name_en_gin_trigram",
fields=["name_en"],
opclasses=["gin_trgm_ops"],
)
]
The query that uses the index:
>>> Name.objects.filter(
Q(name_en__contains='eeth')
| Q(name_en__trigram_similar='eeth')
)
SELECT *
FROM "shop_name"
WHERE ("shop_name"."name_en"::text LIKE '%eeth%' OR "shop_name"."name_en" % 'eeth')
LIMIT 21;
The resulting query plan:
Limit (cost=64.06..90.08 rows=7 width=121) (actual time=0.447..2.456 rows=14 loops=1)
-> Bitmap Heap Scan on shop_name (cost=64.06..90.08 rows=7 width=121) (actual time=0.443..2.411 rows=14 loops=1)
Recheck Cond: (((name_en)::text ~~ '%eeth%'::text) OR ((name_en)::text % 'eeth'::text))
Rows Removed by Index Recheck: 236
Heap Blocks: exact=206
-> BitmapOr (cost=64.06..64.06 rows=7 width=0) (actual time=0.371..0.378 rows=0 loops=1)
-> Bitmap Index Scan on name_en_gin_trigram (cost=0.00..20.03 rows=4 width=0) (actual time=0.048..0.049 rows=15 loops=1)
Index Cond: ((name_en)::text ~~ '%eeth%'::text)
-> Bitmap Index Scan on name_en_gin_trigram (cost=0.00..44.03 rows=4 width=0) (actual time=0.318..0.320 rows=250 loops=1)
Index Cond: ((name_en)::text % 'eeth'::text)
Planning Time: 0.793 ms
Execution Time: 2.531 ms
(12 rows)
If I use icontains the index is not used:
>>> Name.objects.filter(
Q(name_en__icontains='eeth')
| Q(name_en__trigram_similar='eeth')
)
SELECT *
FROM "shop_name"
WHERE (UPPER("shop_name"."name_en"::text) LIKE UPPER('%eeth%') OR "shop_name"."name_en" % 'eeth')
LIMIT 21;
The resulting query plan:
Limit (cost=0.00..95.61 rows=21 width=121) (actual time=10.513..244.244 rows=14 loops=1)
-> Seq Scan on shop_name (cost=0.00..1356.79 rows=298 width=121) (actual time=10.509..244.195 rows=14 loops=1)
Filter: ((upper((name_en)::text) ~~ '%EETH%'::text) OR ((name_en)::text % 'eeth'::text))
Rows Removed by Filter: 36774
Planning Time: 0.740 ms
Execution Time: 244.299 ms
(6 rows)
Django runs icontains with UPPER(), and we can address this by making the index also UPPER():
CREATE INDEX upper_col_name_gin_idx ON table_name USING GIN (UPPER(col_name) gin_trgm_ops)
Django will then run WHERE UPPER("table_name"."col_name"::text) LIKE UPPER('%term%'), using this index.
Update: This approach will not work as expected. However, the mechanics can be used to address the accepted approach.
Should I create a lowercase index and match with contains? (How could this be done?)
From Django-3.2:
Positional argument *expressions allows creating functional indexes on
expressions and database functions.
For example:
Index(Lower('title').desc(), 'pub_date', name='lower_title_date_idx')
creates an index on the lowercased value of the title field in
descending order and the pub_date field in the default ascending
order.
It sounds like some quality Django music!
The code used to accomplish the above, is the following:
migrations/0001_initial.py:
'''
A fake migration used to install the necessary extensions.
It should be followed by
./manage.py makemigrations && ./manage.py migrate
'''
from django.contrib.postgres.operations import (
BtreeGinExtension,
TrigramExtension,
)
from django.db import migrations
class Migration(migrations.Migration):
dependencies = []
operations = [
BtreeGinExtension(),
TrigramExtension(),
]
models.py:
from django.contrib.postgres.indexes import GinIndex, OpClass
from django.db import models
from django.db.models.functions import Lower
class Name(models.Model):
name_en = models.CharField(max_length=127)
...
class Meta:
indexes = [
GinIndex(
OpClass(Lower("name_en"), name="gin_trgm_ops"),
name="name_en_gin_trigram_lowercase",
),
]
OpClass is used to avoid the error:
ValueError: Index.opclasses cannot be used with expressions. Use django.contrib.postgres.indexes.OpClass() instead.
I have a simple Django site, using a PostgreSQL 9.3 database, with a single table storing user accounts (e.g. name, email, address, phone, active, etc). However, my user model is fairly large, and has around 2.6 million records. I noticed Django's admin was a little slow, so using django-debug-toolbar, I noticed that almost all queries ran in under 1 ms, except for:
SELECT COUNT(*) FROM "myapp_myuser" WHERE "myapp_myuser"."active" = true;
which took about 7000 ms. However, the active column is indexed using Django's standard db_index=True, which generates the index:
CREATE INDEX myapp_myuser_active
ON myapp_myuser
USING btree
(active);
Checking out the query with EXPLAIN via:
EXPLAIN ANALYZE VERBOSE
SELECT COUNT(*) FROM "myapp_myuser" WHERE "myapp_myuser"."active" = true;
returns:
Aggregate (cost=109305.45..109305.46 rows=1 width=0) (actual time=7342.973..7342.974 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.myapp_myuser (cost=0.00..102638.16 rows=2666916 width=0) (actual time=0.035..4765.059 rows=2666337 loops=1)
Output: id, created, category_id, name, email, address_1, address_2, city, active, (...)
Filter: myapp_myuser.active
Total runtime: 7343.031 ms
It appears to not be using the index at all. Am I reading this right?
Running just SELECT COUNT(*) FROM "myapp_myuser" completed in about 500 ms. Why such a disparity in run times, even though the only column being used is indexed?
How can I better optimize this query?
You're selecting a lot of columns out of a wide table. So this might not help, even though it does result in a bitmap index scan.
Try a partial index.
create index on myapp_myuser (active) where active = true;
I made a test table with a couple million rows.
explain analyze verbose
select count(*) from test where active = true;
"Aggregate (cost=41800.79..41800.81 rows=1 width=0) (actual time=500.756..500.756 rows=1 loops=1)"
" Output: count(*)"
" -> Bitmap Heap Scan on public.test (cost=8085.76..39307.79 rows=997200 width=0) (actual time=126.233..386.834 rows=1000000 loops=1)"
" Output: id, active"
" Filter: test.active"
" -> Bitmap Index Scan on test_active_idx1 (cost=0.00..7836.45 rows=497204 width=0) (actual time=123.398..123.398 rows=1000000 loops=1)"
" Index Cond: (test.active = true)"
"Total runtime: 500.794 ms"
When you write queries that you hope will use a partial index, you need to match the expression and WHERE clause. Using WHERE active is true is valid in PostgreSQL, but it doesn't match the WHERE clause in the partial index. That means you'll get a sequential scan again.
I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.
The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)
#categories.article_set.filter(post_count__lte=50)
EXPLAIN ANALYZE SELECT *
FROM "sp_article"
INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id")
WHERE ("sp_article_categories"."category_id" = 1081
AND "sp_article"."post_count" <= 50 )
Nested Loop (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
-> Index Scan using sp_article_categories_category_id on sp_article_categories (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
Index Cond: (category_id = 1081)
-> Index Scan using sp_article_pkey on sp_article (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
Index Cond: (sp_article.id = sp_article_categories.article_id)
Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms
I have an index on:
sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)
Any suggestions on how I can tune this to get the query speedy?
Thanks!
You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.
The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.
The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).
So - you first need to figure out what exactly is slow...
Put an index on sp_article_categories.category_id
From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.
So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.
Pseudo-code follows:
SELECT * FROM categories c
INNER JOIN articles a
ON c.category_id = 1081
AND c.category_id = a.category_id
And put an index on category_id like Steven suggests.
You can use field names instead * too.
select [fields] from....
I assume you have run analyze on the database to get fresh statistics.
It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.
Cheers!
// John