Fast query become super slow into a procedure - postgresql-11

I have a problem on a query, that for privacy reasons I can't show you (however I'll provide you the execution plan).
The problem is that, when I execute this query outside a stored procedure, it's quite fast (20 sec for over 70.000 rows), but when i execute it into a stored porcedure it becomes super slow (5 minutes for the same data). How is this possible?
What can I try to do to improve this performances?
I've already tried to change the condition of execution of this query by changing it in a dyanamic form and putting it into a temporary table but the performances did not change.
Subquery Scan on t (cost=16149.34..16152.92 rows=53 width=775) (actual time=4978.471..5382.467 rows=77616 loops=1)
Buffers: shared hit=11847, temp read=6433 written=6439
-> Unique (cost=16149.34..16151.86 rows=53 width=711) (actual time=4978.465..5268.859 rows=77616 loops=1)
Buffers: shared hit=11847, temp read=6433 written=6439
-> Sort (cost=16149.34..16149.47 rows=53 width=711) (actual time=4978.464..5139.298 rows=171110 loops=1)
Sort Key: HIDDEN_TABLE1.aaa, HIDDEN_TABLE1.bbb, HIDDEN_TABLE1.ccc, HIDDEN_TABLE1.ddd, HIDDEN_TABLE1.eee, HIDDEN_TABLE1.rrr, HIDDEN_TABLE1.jjj, HIDDEN_TABLE1.contract_element, HIDDEN_TABLE1.hhh, HIDDEN_TABLE1.lll, HIDDEN_TABLE1.abc, (sum(HIDDEN_TABLE1.original_amount) OVER (?)), HIDDEN_TABLE1.uuu, (sum(HIDDEN_TABLE1.amount) OVER (?)), (max(HIDDEN_TABLE1.posting_date) OVER (?)), HIDDEN_TABLE1.qqq, HIDDEN_TABLE2.ttt, (CASE WHEN ((COALESCE(HIDDEN_TABLE1.asdf, '0'::numeric) = '0'::numeric) AND ((HIDDEN_TABLE1.gfd)::text <> 'YTD'::text)) THEN '1'::numeric WHEN ((HIDDEN_TABLE1.gfd)::text = 'YTD'::text) THEN CASE WHEN (COALESCE(fx2.uyt, '0'::numeric) = '0'::numeric) THEN '1'::numeric ELSE (fx1.uyt / fx2.uyt) END ELSE HIDDEN_TABLE1.asdf END)
Sort Method: external merge Disk: 25896kB
Buffers: shared hit=11847, temp read=6433 written=6439
-> WindowAgg (cost=16144.51..16147.82 rows=53 width=711) (actual time=2394.118..3317.757 rows=171110 loops=1)
Buffers: shared hit=11847, temp read=3196 written=3199
-> Sort (cost=16144.51..16144.64 rows=53 width=645) (actual time=2394.081..2785.327 rows=171110 loops=1)
Sort Key: HIDDEN_TABLE1.aaa, HIDDEN_TABLE1.bbb, HIDDEN_TABLE1.ddd, HIDDEN_TABLE1.rrr, HIDDEN_TABLE1.jjj, HIDDEN_TABLE1.h, HIDDEN_TABLE1.hhh, HIDDEN_TABLE1.uuu, HIDDEN_TABLE1.qqq, HIDDEN_TABLE2.ttt, HIDDEN_TABLE1.lll, HIDDEN_TABLE1.abc
Sort Method: external merge Disk: 25568kB
Buffers: shared hit=11847, temp read=3196 written=3199
-> Gather (cost=1590.77..16142.99 rows=53 width=645) (actual time=13.657..503.346 rows=171110 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=11847
-> Hash Left Join (cost=590.77..15137.69 rows=22 width=645) (actual time=19.160..283.159 rows=57037 loops=3)
Hash Cond: ((((HIDDEN_TABLE1.lll)::text || 'LEA'::text) = (fx2.cod_scenario)::text) AND ((HIDDEN_TABLE1.abc)::numeric(18,0) = fx2.cod_periodo) AND ((HIDDEN_TABLE1.qqq)::text = (fx2.cod_valuta)::text))
Buffers: shared hit=11847
-> Hash Left Join (cost=300.18..14830.75 rows=22 width=640) (actual time=7.164..206.819 rows=57037 loops=3)
Hash Cond: ((((HIDDEN_TABLE1.lll)::text || 'LEA'::text) = (fx1.cod_scenario)::text) AND ((HIDDEN_TABLE1.abc)::numeric(18,0) = fx1.cod_periodo) AND ((HIDDEN_TABLE1.uuu)::text = (fx1.cod_valuta)::text))
Buffers: shared hit=11616
-> Hash Join (cost=9.58..14523.82 rows=22 width=635) (actual time=0.166..111.711 rows=57037 loops=3)
Hash Cond: ((HIDDEN_TABLE1.field_x)::text = (HIDDEN_TABLE2.field_y)::text)
Buffers: shared hit=11363
-> Parallel Seq Scan on HIDDEN_TABLE1 (cost=0.00..14510.62 rows=905 width=130) (actual time=0.013..67.993 rows=145561 loops=3)
Filter: (((uuu)::text <> (qqq)::text) AND ((COALESCE(booking_type, 'JOURNAL'::character varying))::text = 'JOURNAL'::text) AND ((measure)::text = '-'::text))
Rows Removed by Filter: 5918
Buffers: shared hit=11197
-> Hash (cost=9.57..9.57 rows=1 width=1042) (actual time=0.081..0.082 rows=11 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=104
-> Nested Loop (cost=0.27..9.57 rows=1 width=1042) (actual time=0.049..0.076 rows=11 loops=3)
Buffers: shared hit=104
-> Seq Scan on HIDDEN_TABLE2 (cost=0.00..1.24 rows=1 width=1032) (actual time=0.017..0.020 rows=11 loops=3)
Filter: (((setting_field_x)::text = 'VVVVVVV'::text) AND ((active)::text = 'Y'::text))
Rows Removed by Filter: 5
Buffers: shared hit=3
-> Index Only Scan using pk_conto on conto c (cost=0.27..8.29 rows=1 width=10) (actual time=0.004..0.004 rows=1 loops=33)
Index Cond: (cod_conto = (HIDDEN_TABLE2.field_y)::text)
Heap Fetches: 33
Buffers: shared hit=101
-> Hash (cost=154.67..154.67 rows=7767 width=22) (actual time=6.931..6.931 rows=7767 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 484kB
Buffers: shared hit=231
-> Seq Scan on HIDDEN_TABLE_3 fx1 (cost=0.00..154.67 rows=7767 width=22) (actual time=0.012..1.413 rows=7767 loops=3)
Buffers: shared hit=231
-> Hash (cost=154.67..154.67 rows=7767 width=22) (actual time=11.962..11.962 rows=7767 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 484kB
Buffers: shared hit=231
-> Seq Scan on HIDDEN_TABLE_3 fx2 (cost=0.00..154.67 rows=7767 width=22) (actual time=0.005..5.334 rows=7767 loops=3)
Buffers: shared hit=231
Planning Time: 2.465 ms
Execution Time: 5395.802 ms
I would like to bring the execution time similar to the query executed outside the stored procedure.

Related

How do I find the most frequent element in a list in pyspark?

I have a pyspark dataframe with two columns, ID and Elements. Column "Elements" has list element in it. It looks like this,
ID | Elements
_______________________________________
X |[Element5, Element1, Element5]
Y |[Element Unknown, Element Unknown, Element_Z]
I want to form a column with the most frequent element in the column 'Elements.' Output should look like,
ID | Elements | Output_column
__________________________________________________________________________
X |[Element5, Element1, Element5] | Element5
Y |[Element Unknown, Element Unknown, Element_Z] | Element Unknown
How can I do that using pyspark?
Thanks in advance.
We can use higher order functions here (available from spark 2.4+)
First use transform and aggregate to get counts for each distinct value in the array.
Then sort the array of structs in descending manner and then get the first element.
from pyspark.sql import functions as F
temp = (df.withColumn("Dist",F.array_distinct("Elements"))
.withColumn("Counts",F.expr("""transform(Dist,x->
aggregate(Elements,0,(acc,y)-> IF (y=x, acc+1,acc))
)"""))
.withColumn("Map",F.arrays_zip("Dist","Counts")
)).drop("Dist","Counts")
out = temp.withColumn("Output_column",
F.expr("""element_at(array_sort(Map,(first,second)->
CASE WHEN first['Counts']>second['Counts'] THEN -1 ELSE 1 END),1)['Dist']"""))
Output:
Note that I have added a blank array for ID z to test. Also you can drop the column Map by adding .drop("Map") to the output
out.show(truncate=False)
+---+---------------------------------------------+--------------------------------------+---------------+
|ID |Elements |Map |Output_column |
+---+---------------------------------------------+--------------------------------------+---------------+
|X |[Element5, Element1, Element5] |[{Element5, 2}, {Element1, 1}] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|[{Element Unknown, 2}, {Element_Z, 1}]|Element Unknown|
|Z |[] |[] |null |
+---+---------------------------------------------+--------------------------------------+---------------+
For lower versions, you can use a udf with statistics mode:
from pyspark.sql import functions as F,types as T
from statistics import mode
u = F.udf(lambda x: mode(x) if len(x)>0 else None,T.StringType())
df.withColumn("Output",u("Elements")).show(truncate=False)
+---+---------------------------------------------+---------------+
|ID |Elements |Output |
+---+---------------------------------------------+---------------+
|X |[Element5, Element1, Element5] |Element5 |
|Y |[Element Unknown, Element Unknown, Element_Z]|Element Unknown|
|Z |[] |null |
+---+---------------------------------------------+---------------+
You can use pyspark sql functions to achieve that (spark 2.4+).
Here is a generic function that adds a new column containing the most common element in another array column. Here it is:
import pyspark.sql.functions as sf
def add_most_common_val_in_array(df, arraycol, drop=False):
"""Takes a spark df column of ArrayType() and returns the most common element
in the array in a new column of the df called f"MostCommon_{arraycol}"
Args:
df (spark.DataFrame): dataframe
arraycol (ArrayType()): array column in which you want to find the most common element
drop (bool, optional): Drop the arraycol after finding most common element. Defaults to False.
Returns:
spark.DataFrame: df with additional column containing most common element in arraycol
"""
dvals = f"distinct_{arraycol}"
dvalscount = f"distinct_{arraycol}_count"
startcols = df.columns
df = df.withColumn(dvals, sf.array_distinct(arraycol))
df = df.withColumn(
dvalscount,
sf.transform(
dvals,
lambda uval: sf.aggregate(
arraycol,
sf.lit(0),
lambda acc, entry: sf.when(entry == uval, acc + 1).otherwise(acc),
),
),
)
countercol = f"ReverseCounter{arraycol}"
df = df.withColumn(countercol, sf.map_from_arrays(dvalscount, dvals))
mccol = f"MostCommon_{arraycol}"
df = df.withColumn(mccol, sf.element_at(countercol, sf.array_max(dvalscount)))
df = df.select(*startcols, mccol)
if drop:
df = df.drop(arraycol)
return df

How can I optimize this Postgres query?

I have a very slow query that I found in my logs and I don't know how to optimize it with an index.
This is the query and the explain:
SELECT
"myapp_image"."id",
"myapp_image"."deleted",
"myapp_image"."title",
"myapp_image"."subject_type",
"myapp_image"."data_source",
"myapp_image"."objects_in_field",
"myapp_image"."solar_system_main_subject",
"myapp_image"."description",
"myapp_image"."link",
"myapp_image"."link_to_fits",
"myapp_image"."image_file",
"myapp_image"."uploaded",
"myapp_image"."published",
"myapp_image"."updated",
"myapp_image"."watermark_text",
"myapp_image"."watermark",
"myapp_image"."watermark_position",
"myapp_image"."watermark_size",
"myapp_image"."watermark_opacity",
"myapp_image"."user_id",
"myapp_image"."plot_is_overlay",
"myapp_image"."is_wip",
"myapp_image"."size",
"myapp_image"."w",
"myapp_image"."h",
"myapp_image"."animated",
"myapp_image"."license",
"myapp_image"."is_final",
"myapp_image"."allow_comments",
"myapp_image"."moderator_decision",
"myapp_image"."moderated_when",
"myapp_image"."moderated_by_id",
"auth_user"."id",
"auth_user"."password",
"auth_user"."last_login",
"auth_user"."is_superuser",
"auth_user"."username",
"auth_user"."first_name",
"auth_user"."last_name",
"auth_user"."email",
"auth_user"."is_staff",
"auth_user"."is_active",
"auth_user"."date_joined",
"myapp_userprofile"."id",
"myapp_userprofile"."deleted",
"myapp_userprofile"."user_id",
"myapp_userprofile"."updated",
"myapp_userprofile"."real_name",
"myapp_userprofile"."website",
"myapp_userprofile"."job",
"myapp_userprofile"."hobbies",
"myapp_userprofile"."timezone",
"myapp_userprofile"."about",
"myapp_userprofile"."premium_counter",
"myapp_userprofile"."company_name",
"myapp_userprofile"."company_description",
"myapp_userprofile"."company_website",
"myapp_userprofile"."retailer_country",
"myapp_userprofile"."avatar",
"myapp_userprofile"."exclude_from_competitions",
"myapp_userprofile"."default_frontpage_section",
"myapp_userprofile"."default_gallery_sorting",
"myapp_userprofile"."default_license",
"myapp_userprofile"."default_watermark_text",
"myapp_userprofile"."default_watermark",
"myapp_userprofile"."default_watermark_size",
"myapp_userprofile"."default_watermark_position",
"myapp_userprofile"."default_watermark_opacity",
"myapp_userprofile"."accept_tos",
"myapp_userprofile"."receive_important_communications",
"myapp_userprofile"."receive_newsletter",
"myapp_userprofile"."receive_marketing_and_commercial_material",
"myapp_userprofile"."language",
"myapp_userprofile"."seen_realname",
"myapp_userprofile"."seen_email_permissions",
"myapp_userprofile"."signature",
"myapp_userprofile"."signature_html",
"myapp_userprofile"."show_signatures",
"myapp_userprofile"."post_count",
"myapp_userprofile"."autosubscribe",
"myapp_userprofile"."receive_forum_emails"
FROM "myapp_image"
LEFT OUTER JOIN "myapp_apps_iotd_iotd"
ON ("myapp_image"."id" = "myapp_apps_iotd_iotd"."image_id")
INNER JOIN "auth_user"
ON ("myapp_image"."user_id" = "auth_user"."id")
LEFT OUTER JOIN "myapp_userprofile"
ON ("auth_user"."id" = "myapp_userprofile"."user_id")
WHERE ("myapp_image"."is_wip" = FALSE
AND NOT ("myapp_image"."id" IN (SELECT
U0."id" AS Col1
FROM "myapp_image" U0
LEFT OUTER JOIN "myapp_apps_iotd_iotdvote" U1
ON (U0."id" = U1."image_id")
WHERE U1."id" IS NULL)
)
AND "myapp_apps_iotd_iotd"."id" IS NULL
AND "myapp_image"."id" < 372320
AND "myapp_image"."deleted" IS NULL)
ORDER BY "myapp_image"."id" DESC
LIMIT 1;
QUERY PLAN:
Limit (cost=36302.74..36302.75 rows=1 width=1143) (actual time=1922.839..1923.002 rows=1 loops=1)
-> Sort (cost=36302.74..36302.75 rows=1 width=1143) (actual time=1922.836..1922.838 rows=1 loops=1)
Sort Key: myapp_image.id DESC
Sort Method: top-N heapsort Memory: 26kB
-> Nested Loop Left Join (cost=17919.42..36302.73 rows=1 width=1143) (actual time=1332.216..1908.796 rows=3102 loops=1)
-> Nested Loop (cost=17919.14..36302.40 rows=1 width=453) (actual time=1332.195..1867.675 rows=3102 loops=1)
-> Hash Left Join (cost=17918.85..36302.09 rows=1 width=321) (actual time=1332.164..1815.315 rows=3102 loops=1)
Hash Cond: (myapp_image.id = myapp_apps_iotd_iotd.image_id)
Filter: (myapp_apps_iotd_iotd.id IS NULL)
Rows Removed by Filter: 722
-> Seq Scan on myapp_image (cost=17856.32..35882.67 rows=135958 width=321) (actual time=1329.110..1801.409 rows=3824 loops=1)
Filter: ((NOT is_wip) AND (deleted IS NULL) AND (NOT (hashed SubPlan 1)) AND (id < 372320))
Rows Removed by Filter: 305733
SubPlan 1
-> Gather (cost=1217.31..17856.31 rows=1 width=4) (actual time=36.399..680.882 rows=305712 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Left Join (cost=217.31..16856.21 rows=1 width=4) (actual time=52.855..536.185 rows=101904 loops=3)
Hash Cond: (u0.id = u1.image_id)
Filter: (u1.id IS NULL)
Rows Removed by Filter: 2509
-> Parallel Seq Scan on myapp_image u0 (cost=0.00..14672.82 rows=128982 width=4) (actual time=0.027..175.375 rows=103186 loops=3)
-> Hash (cost=123.25..123.25 rows=7525 width=8) (actual time=52.601..52.602 rows=7526 loops=3)
Buckets: 8192 Batches: 1 Memory Usage: 358kB
-> Seq Scan on myapp_apps_iotd_iotdvote u1 (cost=0.00..123.25 rows=7525 width=8) (actual time=0.038..33.074 rows=7526 loops=3)
-> Hash (cost=35.57..35.57 rows=2157 width=8) (actual time=3.013..3.015 rows=2157 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 117kB
-> Seq Scan on myapp_apps_iotd_iotd (cost=0.00..35.57 rows=2157 width=8) (actual time=0.014..1.480 rows=2157 loops=1)
-> Index Scan using auth_user_id_pkey on auth_user (cost=0.29..0.31 rows=1 width=132) (actual time=0.012..0.012 rows=1 loops=3102)
Index Cond: (id = myapp_image.user_id)
-> Index Scan using myapp_userprofile_user_id on myapp_userprofile (cost=0.29..0.33 rows=1 width=690) (actual time=0.008..0.008 rows=1 loops=3102)
Index Cond: (auth_user.id = user_id)
Planning time: 1.722 ms
Execution time: 1925.867 ms
(34 rows)
There is a long Seq Scan on myapp_image and I have tried adding the following index but it made it even slower:
create index on myapp_image using btree (is_wip, deleted, id);
What could be my optimization strategy?
This query is generated by the Django ORM and at this time I don't know yet what code path made it.
Based on generated query:
SELECT *
FROM "myapp_image"
LEFT OUTER JOIN "myapp_apps_iotd_iotd"
ON ("myapp_image"."id" = "myapp_apps_iotd_iotd"."image_id")
INNER JOIN "auth_user"
ON ("myapp_image"."user_id" = "auth_user"."id")
LEFT OUTER JOIN "myapp_userprofile"
ON ("auth_user"."id" = "myapp_userprofile"."user_id")
WHERE ("myapp_image"."is_wip" = false
AND NOT ("myapp_image"."id" IN (SELECT U0."id" AS Col1
FROM "myapp_image" U0
LEFT OUTER JOIN "myapp_apps_iotd_iotdvote" U1
ON (U0."id" = U1."image_id")
WHERE U1."id" IS NULL))
AND "myapp_apps_iotd_iotd"."id" IS NULL
AND "myapp_image"."id" < 372320
AND "myapp_image"."deleted" IS NULL)
ORDER BY "myapp_image"."id" DESC
LIMIT 1;
I propose to add index:
create index on myapp_image using btree (id, is_wip, deleted);
-- id as a first column

Presto - reducing an array of structs

I'm trying to reduce an array of complex types, however I'm running into a syntax error (maybe this isn't even supported?).
SYNTAX_ERROR: line 2:1: Unexpected parameters (array(row(count
double,name varchar)), integer,
com.facebook.presto.sql.analyzer.TypeSignatureProvider#16881774,
com.facebook.presto.sql.analyzer.TypeSignatureProvider#1718b83d) for
function reduce. Expected: reduce(array(T), S, function(S,T,S),
function(S,R)) T, S, R
The complex type is defined as counters array<struct<count:double,name:string>> in the table. I have tried selecting reduce(counters, 0, (state, counter) -> state + counter.count , s -> s) and reduce(counters, 0, (state, counter) -> state + counter['count'] , s -> s), however neither work.
Your approach is correct (tested with Presto 0.205):
presto:default> desc t;
Column | Type | Extra | Comment
----------+----------------------------------------+-------+---------
counters | array(row(count double, name varchar)) | |
presto:default> select * from t;
counters
--------------------------------------------
[{count=1.0, name=a}, {count=2.0, name=b}]
presto:default> select reduce(
counters, 0,
(state, counter) -> state + counter.count,
state -> state) from t;
_col0
-------
3.0
(1 row)
You tagged the question prestodb and amazon-athena. If you are trying this on Athena, please bear in mind that Athena is based on Presto 0.172 (which was released April 2017)

How should I set the distkey for a left join with conditionals in Redshift?

I have a query that looks like this:
select
a.col1,
a.col2,
b.col3
from
a
left join b on (a.id=b.id and b.attribute_id=3)
left join c on (a.id=c.id and c.attribute_id=4)
Even setting the distkey to id gets me a DS_BCAST_INNER in the query plan and I end up with extraordinary query time for a mere 1 million rows.
Setting the id to be the distribution key should co-locate the data and remove the need for the broadcast.
create table a (id int distkey, attribute_id int, col1 varchar(10), col2 varchar(10));
create table b (id int distkey, attribute_id int, col3 varchar(10));
create table c (id int distkey, attribute_id int);
You should see an explain plan something like this:
admin#dev=# explain select
a.col1,
a.col2,
b.col3
from
a
left join b on (a.id=b.id and b.attribute_id=3)
left join c on (a.id=c.id and c.attribute_id=4);
QUERY PLAN
--------------------------------------------------------------------------
XN Hash Left Join DS_DIST_NONE (cost=0.09..0.23 rows=3 width=99)
Hash Cond: ("outer".id = "inner".id)
-> XN Hash Left Join DS_DIST_NONE (cost=0.05..0.14 rows=3 width=103)
Hash Cond: ("outer".id = "inner".id)
-> XN Seq Scan on a (cost=0.00..0.03 rows=3 width=70)
-> XN Hash (cost=0.04..0.04 rows=3 width=37)
-> XN Seq Scan on b (cost=0.00..0.04 rows=3 width=37)
Filter: (attribute_id = 3)
-> XN Hash (cost=0.04..0.04 rows=1 width=4)
-> XN Seq Scan on c (cost=0.00..0.04 rows=1 width=4)
Filter: (attribute_id = 4)
(11 rows)
Time: 123.315 ms
If the tables contain 3 million rows or less and have a low frequency of writes it should be safe to use DIST STYLE ALL. If you do use DIST STYLE KEY, verify that distributing your tables does not cause row skew (check with the following query):
select "schema", "table", skew_rows from svv_table_info;
"skew_rows" is the ratio of data between the slice with the most and the least data. It should be close 1.00.

Slow distance query in GeoDjango with PostGIS

I am using GeoDjango with Postgres 10 and PostGIS. I have two models as follows:
class Postcode(models.Model):
name = models.CharField(max_length=8, unique=True)
location = models.PointField(geography=True)
class Transaction(models.Model):
transaction_id = models.CharField(max_length=60)
price = models.IntegerField()
date_of_transfer = models.DateField()
postcode = models.ForeignKey(Postcode, on_delete=models.CASCADE)
property_type = models.CharField(max_length=1,blank=True)
street = models.CharField(blank=True, max_length=200)
class Meta:
indexes = [models.Index(fields=['-date_of_transfer',]),
models.Index(fields=['price',]),
]
Given a particular postcode, I would like to find the nearest transactions within a specified distance. To do this, I am using the following code:
transactions = Transaction.objects.filter(price__gte=min_price) \
.filter(postcode__location__distance_lte=(pc.location,D(mi=distance))) \
.annotate(distance=Distance('postcode__location',pc.location)).order_by('distance')[0:25]
The query runs slowly taking about 20 - 60 seconds (depending on filter criteria) on a Windows PC i5 2500k with 16GB RAM. If I order by date_of_transfer then it runs in <1 second for larger distances (over 1 mile) but is still slow for small distances (e.g. 45 seconds for a distance of 0.1m).
So far I have tried:
* changing the location field from Geometry to Geography
* using dwithin instead of distance_lte
Neither of these had more than a marginal impact on the speed of the query.
The SQL generated by GeoDjango for the current version is:
SELECT "postcodes_transaction"."id",
"postcodes_transaction"."transaction_id",
"postcodes_transaction"."price",
"postcodes_transaction"."date_of_transfer",
"postcodes_transaction"."postcode_id",
"postcodes_transaction"."street",
ST_Distance("postcodes_postcode"."location",
ST_GeogFromWKB('\x0101000020e6100000005471e316f3bfbf4ad05fe811c14940'::bytea)) AS "distance"
FROM "postcodes_transaction" INNER JOIN "postcodes_postcode"
ON ("postcodes_transaction"."postcode_id" = "postcodes_postcode"."id")
WHERE ("postcodes_transaction"."price" >= 50000
AND ST_Distance("postcodes_postcode"."location", ST_GeomFromEWKB('\x0101000020e6100000005471e316f3bfbf4ad05fe811c14940'::bytea)) <= 1609.344
AND "postcodes_transaction"."date_of_transfer" >= '2000-01-01'::date
AND "postcodes_transaction"."date_of_transfer" <= '2017-10-01'::date)
ORDER BY "distance" ASC LIMIT 25
On the postcodes table, there is an index on the location field as follows:
CREATE INDEX postcodes_postcode_location_id
ON public.postcodes_postcode
USING gist
(location);
The transaction table has 22 million rows and the postcode table has 2.5 million rows. Any suggestions on what approaches I can take to improve the performance of this query?
Here is the query plan for reference:
"Limit (cost=2394838.01..2394840.93 rows=25 width=76) (actual time=19028.400..19028.409 rows=25 loops=1)"
" Output: postcodes_transaction.id, postcodes_transaction.transaction_id, postcodes_transaction.price, postcodes_transaction.date_of_transfer, postcodes_transaction.postcode_id, postcodes_transaction.street, (_st_distance(postcodes_postcode.location, '0101 (...)"
" -> Gather Merge (cost=2394838.01..2893397.65 rows=4273070 width=76) (actual time=19028.399..19028.407 rows=25 loops=1)"
" Output: postcodes_transaction.id, postcodes_transaction.transaction_id, postcodes_transaction.price, postcodes_transaction.date_of_transfer, postcodes_transaction.postcode_id, postcodes_transaction.street, (_st_distance(postcodes_postcode.location, (...)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Sort (cost=2393837.99..2399179.33 rows=2136535 width=76) (actual time=18849.396..18849.449 rows=387 loops=3)"
" Output: postcodes_transaction.id, postcodes_transaction.transaction_id, postcodes_transaction.price, postcodes_transaction.date_of_transfer, postcodes_transaction.postcode_id, postcodes_transaction.street, (_st_distance(postcodes_postcode.loc (...)"
" Sort Key: (_st_distance(postcodes_postcode.location, '0101000020e6100000005471e316f3bfbf4ad05fe811c14940'::geography, '0'::double precision, true))"
" Sort Method: quicksort Memory: 1013kB"
" Worker 0: actual time=18615.809..18615.948 rows=577 loops=1"
" Worker 1: actual time=18904.700..18904.721 rows=576 loops=1"
" -> Hash Join (cost=699247.34..2074281.07 rows=2136535 width=76) (actual time=10705.617..18841.448 rows=5573 loops=3)"
" Output: postcodes_transaction.id, postcodes_transaction.transaction_id, postcodes_transaction.price, postcodes_transaction.date_of_transfer, postcodes_transaction.postcode_id, postcodes_transaction.street, _st_distance(postcodes_postcod (...)"
" Inner Unique: true"
" Hash Cond: (postcodes_transaction.postcode_id = postcodes_postcode.id)"
" Worker 0: actual time=10742.668..18608.763 rows=5365 loops=1"
" Worker 1: actual time=10749.748..18897.838 rows=5522 loops=1"
" -> Parallel Seq Scan on public.postcodes_transaction (cost=0.00..603215.80 rows=6409601 width=68) (actual time=0.052..4214.812 rows=5491618 loops=3)"
" Output: postcodes_transaction.id, postcodes_transaction.transaction_id, postcodes_transaction.price, postcodes_transaction.date_of_transfer, postcodes_transaction.postcode_id, postcodes_transaction.street"
" Filter: ((postcodes_transaction.price >= 50000) AND (postcodes_transaction.date_of_transfer >= '2000-01-01'::date) AND (postcodes_transaction.date_of_transfer <= '2017-10-01'::date))"
" Rows Removed by Filter: 2025049"
" Worker 0: actual time=0.016..4226.643 rows=5375779 loops=1"
" Worker 1: actual time=0.016..4188.138 rows=5439515 loops=1"
" -> Hash (cost=682252.00..682252.00 rows=836667 width=36) (actual time=10654.921..10654.921 rows=1856 loops=3)"
" Output: postcodes_postcode.location, postcodes_postcode.id"
" Buckets: 131072 Batches: 16 Memory Usage: 1032kB"
" Worker 0: actual time=10692.068..10692.068 rows=1856 loops=1"
" Worker 1: actual time=10674.101..10674.101 rows=1856 loops=1"
" -> Seq Scan on public.postcodes_postcode (cost=0.00..682252.00 rows=836667 width=36) (actual time=5058.685..10651.176 rows=1856 loops=3)"
" Output: postcodes_postcode.location, postcodes_postcode.id"
" Filter: (_st_distance(postcodes_postcode.location, '0101000020e6100000005471e316f3bfbf4ad05fe811c14940'::geography, '0'::double precision, true) <= '1609.344'::double precision)"
" Rows Removed by Filter: 2508144"
" Worker 0: actual time=5041.442..10688.265 rows=1856 loops=1"
" Worker 1: actual time=5072.242..10670.215 rows=1856 loops=1"
"Planning time: 0.538 ms"
"Execution time: 19065.962 ms"