Slow Postgres JOIN Query

Slow Postgres JOIN Query - django

I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.
The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)
#categories.article_set.filter(post_count__lte=50)
EXPLAIN ANALYZE SELECT *
FROM "sp_article"
INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id")
WHERE ("sp_article_categories"."category_id" = 1081
AND "sp_article"."post_count" <= 50 )
Nested Loop (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
-> Index Scan using sp_article_categories_category_id on sp_article_categories (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
Index Cond: (category_id = 1081)
-> Index Scan using sp_article_pkey on sp_article (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
Index Cond: (sp_article.id = sp_article_categories.article_id)
Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms
I have an index on:
sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)
Any suggestions on how I can tune this to get the query speedy?
Thanks!

You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.
The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.
The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).
So - you first need to figure out what exactly is slow...

Put an index on sp_article_categories.category_id

From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.
So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.
Pseudo-code follows:
SELECT * FROM categories c
INNER JOIN articles a
ON c.category_id = 1081
AND c.category_id = a.category_id
And put an index on category_id like Steven suggests.

You can use field names instead * too.
select [fields] from....

I assume you have run analyze on the database to get fresh statistics.
It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.
Cheers!
// John

Related

Django PostgreSQL double index cleanup

We've got this table in our database with 80GB of data and 230GB of Indexes. We are constrained on our disk which is already maxed out.
What bothers me is we have two indexes that look pretty darn similar
CREATE INDEX tracks_trackpoint_id ON tracks_trackpoint USING btree (id)
CREATE UNIQUE INDEX tracks_trackpoint_pkey ON tracks_trackpoint USING btree (id)
I have no idea what's the history behind this, but the first one seems quite redundant. What could be the risk of dropping it ? This would buy us one year of storage.

You can drop the first index, it is totally redundant.
If your tables are 80GB and your indexes 230GB, I am ready to bet that you have too many indexes in your database.
Drop the indexes that are not used.

Careful as I am, I disabled the index to benchmark this, and the query seems to fallback nicely on the other index. I'll try a few variants.
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_id on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.040 ms
(3 rows)
appdb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'tracks_trackpoint_id'::regclass;
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_pkey on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.036 ms
(3 rows)

Special character to query from latest timestamp sharded table in BigQuery

From
https://cloud.google.com/bigquery/docs/partitioned-tables:
you can shard tables using a time-based naming approach such as [PREFIX]_YYYYMMDD
This enables me to do:
SELECT count(*) FROM `xxx.xxx.xxx_*`
and query across all the shards. Is there a special notation that queries only the latest shard? For example say I had:
xxx_20180726
xxx_20180801
could I do something along the lines of
SELECT count(*) FROM `xxx.xxx.xxx_{{ latest }}`
to query xxx_20180801?
SINGLE QUERY INSPIRED BY Mikhail Berlyant:
SELECT count(*) as c FROM `XXX.PREFIX_*` WHERE _TABLE_SUFFIX IN ( SELECT
SUBSTR(MAX(table_id), LENGTH('PREFIX_') + 2)
FROM
`XXX.__TABLES_SUMMARY__`
WHERE
table_id LIKE 'PREFIX_%')

If you do care about cost (meaning how many tables will be scaned by your query) - the only way to do so is to do in two steps like below
First query
#standardSQL
SELECT SUBSTR(MAX(table_id), LENGTH('PREFIX') + 1)
FROM `xxx.xxx.__TABLES_SUMMARY__`
WHERE table_id LIKE 'PREFIX%'
Second Query
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '<result of first query>'
so, if result of first query is 20180801 so, second query will obviously look like below
#standardSQL
SELECT COUNT(*)
FROM `xxx.xxx.PREFIX_*`
WHERE _TABLE_SUFFIX = '20180801'
If you don't care about cost but rather need just result - you can easily combine above two queries into one - but - again - remember - even though result will be out of last table - cost will be as you query all table that match xxx.xxx.PREFIX_*
Forgot to mention (even though it should be obvious): of course when you have only COUNT(1) in your SELECT - the cost will be 0(zero) for both options - but in reality - most likely you will have something more valuable than just count(1)

I know this is a kind of an old thread but I was surprised why no one offers an answer using Variables.
"Héctor Neri" already mentioned this in the comments but I thought might be better to have an actual answer with a sample code posted.
#standardSQL
DECLARE SHARD_DATE STRING;
SET SHARD_DATE=(
SELECT MAX(REPLACE(table_name,'{TABLE}_',''))
FROM `{PRJ}.{DATASET}.INFORMATION_SCHEMA.TABLES`
WHERE table_name LIKE '{TABLE}_20%'
);
SELECT * FROM `{PRJ}.{DATASET}.{TABLE}_*`
WHERE _TABLE_SUFFIX = SHARD_DATE
Make sure to replace {PRJ}, {DATASET}, and {TABLE} values with your table location.
If you run this on BigQuery Web UI, you will see this message:
WARNING: Could not compute bytes processed estimate for script.
But you can see that variable properly reduce the table scan to the latest partition and does not cause any extra cost after running the script.

Recommendation on Query Efficiency : 2 different versions

which of these is more efficient query to run:
one where the INCLUDE / DON'T INCLUDE filter condition in WHERE clause and tested for each row
SELECT distinct fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t, unnest(hits) as ht
WHERE (select max(if(cd.index = 1,cd.value,null))from unnest(ht.customDimensions) cd)
= 'high_worth'
one returning all rows and then outer SELECT clause doing all filtering test to INCLUDE / DON'T INCLUDE
SELECT distinct fullvisitorid
FROM
(
SELECT
fullvisitorid
, (select max(if(cd.index = 1,cd.value,null)) FROM unnest(ht.customDimensions) cd) hit_cd_1
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t
, unnest(hits) as ht
)
WHERE
hit_cd_1 = 'high_worth'
Both produce exactly same results!
the goal is: list of fullvisitorId, who ever sent hit Level Custom Dimension (index =1) with value = 'high_worth' users ()
Thanks for your inputs!
Cheers!
/Vibhor

I tried the two queries and compared their explanations, they are identical. I am assuming some sort of optimization magic occurs prior to the query being ran.

As of your original two queries: obviously - they are identical even though you slightly rearranged appearance. so from those two you should choose whatever easier for you to read/maintain. I would pick first query - but it is really matter of personal preferences
Meantime, try below (BigQuery Standard SQL) - it looks slightly optimized to me - but I didn't have chance to test on real data
SELECT DISTINCT fullvisitorid
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t,
UNNEST(hits) AS ht, UNNEST(ht.customDimensions) cd
WHERE cd.index = 1 AND cd.value = 'high_worth'
Obviously - it should produce same result as your two queries
Execution plan looks better to me and it (query) is faster is much easier to read / manage

Improving query speed: simple SELECT with LIKE

I have inherited a large legacy codebase which runs in django 1.5 and my current task is to speed up a section of the site which takes ~1min to load.
I did a profile of the app and got this:
The culprit in particular is the following query (stripped for brevity):
SELECT COUNT(*) FROM "entities_entity" WHERE (
"entities_entity"."date_filed" <= '2016-01-21' AND (
UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
-- 34 more of these
UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
-- 34 more of these
)
)
which basically consist on a big like query on two fields, entity_city_state_zip and agent_city_state_zip which are character varying(200) | not null fields.
That query is performed twice (!), taking 18814.02ms each time, and one more time replacing the COUNT for a SELECT taking up an extra 20216.49 (I'm going to cache the result of the COUNT)
The explain looks like this:
Aggregate (cost=175867.33..175867.34 rows=1 width=0) (actual time=17841.502..17841.502 rows=1 loops=1)
-> Seq Scan on entities_entity (cost=0.00..175858.95 rows=3351 width=0) (actual time=0.849..17818.551 rows=145075 loops=1)
Filter: ((date_filed <= '2016-01-21'::date) AND ((upper((entity_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((entity_city_state_zip)::text) ~~ '%BERKELEY%'::text) (..skipped..) OR (upper((agent_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BERKELEY%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BURLINGAME%'::text) ))
Rows Removed by Filter: 310249
Planning time: 2.110 ms
Execution time: 17841.944 ms
I've tried using an index on entity_city_state_zip and agent_city_state_zip using various combinations like:
CREATE INDEX ON entities_entity (upper(entity_city_state_zip));
CREATE INDEX ON entities_entity (upper(agent_city_state_zip));
or using varchar_pattern_ops, with no luck.
The server is using something like this:
qs = queryset.filter(Q(entity_city_state_zip__icontains = all_city_list) |
Q(agent_city_state_zip__icontains = all_city_list))
to generate that query.
I don't know what else to try,
Thanks!

I think problem in "multiple LIKE" and in UPPER("entities_entity ...
You can use:
WHERE entities_entity.entity_city_state_zip SIMILAR TO '%Atherton%|%Berkeley%'
Or something like this:
WHERE entities_entity.entity_city_state_zip LIKE ANY(ARRAY['%Atherton%', '%Berkeley%'])
Edited
About Raw SQL query in Django:
https://docs.djangoproject.com/es/1.9/topics/db/sql/
How do I execute raw SQL in a django migration
Regards

I watched a course in Pluralsight that addressed a very similar issue. The course was "Postgres for .NET Developers" and this was in the section "Fun With Simple SQL", "Full Text Search."
To summarize their solution, using your example:
Create a new column in your table that will represent your entity_city_state_zip as a tsvector:
create table entities_entity (
date_filed date,
entity_city_state_zip text,
csz_search tsvector not null -- add this column
);
Initially you might have to make it nullable, then populate the data and make it non-nullable.
update entities_entity
set csz_search = to_tsvector (entity_city_state_zip);
Next, create a trigger that will cause the new field to be populated any time a record is added or modified:
create trigger entities_insert_update
before insert or update on entities_entity
for each row execute procedure
tsvector_update_trigger(csz_search,'pg_catalog.english',entity_city_state_zip);
Your search queries can now query on the tsvector field rather than the city/state/zip field:
select * from entities_entity
where csz_search ## to_tsquery('Atherton')
Some notes of interest on this:
to_tsquery, in case you haven't used it is WAY more sophisticated than the example above. It allows and conditions, partial matches, etc
it is also case-insensitive, so there is no need to do the upper functions you have in your query
As a final step, put a GIN index on the tsquery field:
create index entities_entity_ix1 on entities_entity
using gin(csz_search);
If I understand the course right, this should make your query fly, and it will overcome the issue of a btree index's inability to work on a like '% query.
Here is the explain plan on such a query:
Bitmap Heap Scan on entities_entity (cost=56.16..1204.78 rows=505 width=81)
Recheck Cond: (csz_search ## to_tsquery('Atherton'::text))
-> Bitmap Index Scan on entities_entity_ix1 (cost=0.00..56.04 rows=505 width=0)
Index Cond: (csz_search ## to_tsquery('Atherton'::text))

MS Access query to ORDER BY date from C++ application

How does a MS Access SQL query for ordering by date looks like? My current query is:
newVal.Format(_T("SELECT * FROM Table WHERE (CDATE(DateStart) BETWEEN #%s# AND #%s#) "), strDateVal, strDateVal2);
where strDateVal and strDateVal2 are CStrings resulting from formating ColeDateTime variables. In this form i get all the dates between strDateVal and strDateVal2 (eg. 10/20/2013 and 10/25/2013), but i can't figure out a way to sort it, ascending or descending.
I've tried using
ORDER BY DateStart ASC
ORDER BY=([DateStart] ASC)
ORDER BY (CDATE(DateStart)) ASC
but none worked, i get an empty result.

I found the answer, and it was quite simple and silly: the correct sintax is ORDER BY Table.Field ASC. So you have to use the table name even if you make a simple SELECT, as if you would make a JOIN.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Slow Postgres JOIN Query - django

Put an index on sp_article_categories.category_id

You can use field names instead * too. select [fields] from....

Related

Django PostgreSQL double index cleanup

Special character to query from latest timestamp sharded table in BigQuery

Recommendation on Query Efficiency : 2 different versions

Improving query speed: simple SELECT with LIKE

MS Access query to ORDER BY date from C++ application

Categories

Resources