Heroku Postgres Performance with RegEx - django

I have an iPhone app connected to a Django server running on Heroku. The user taps a word (like "cape") and then queries the server for any other passages containing that word. So right now I do a SQL query with some RegEx:
SELECT "connectr_passage"."id", "connectr_passage"."third_party_id",
"connectr_passage"."third_party_created", "connectr_passage"."source",
"connectr_passage"."text", "connectr_passage"."author",
"connectr_passage"."raw_data", "connectr_passage"."retweet_count",
"connectr_passage"."favorited_count", "connectr_passage"."lang",
"connectr_passage"."location",
"connectr_passage"."author_followers_count",
"connectr_passage"."created", "connectr_passage"."modified" FROM
"connectr_passage" WHERE ("connectr_passage"."text" ~
E'(?i)\ycape\y' AND NOT ("connectr_passage"."text" ~ E'https?://'
))
on a table with about 412K rows of data, using the $9 'dev' database, this query takes 1320 ms So for the app user it feels pretty slow as the total response time is even higher.
With the same exact database on my local machine (MBP, 8gb ram, ssd), this query takes 629.214 ms
I understand the dev database has some limitations (doesn't do in-memory cache and such), so my questions are:
Is there some way I can speed things up? Adding an index on the text column didn't seem to help.
Will upgrading to one of the production databases significantly improve this performance? They're pretty expensive for my needs.
Any other good alternatives for hosting a database connected to Heroku that you know about?
Any recommended alternatives to doing a regex sql query to search for terms? I was thinking about creating a custom index store of words or something, maybe there's a plugin for that somewhere. Haystack?
----- Edit -----
Here is what the elephant has to say about my query:
Sort (cost=16979.75..16979.83 rows=34 width=175) (actual time=616.131..616.132 rows=18 loops=1)
Sort Key: author_followers_count
Sort Method: quicksort Memory: 30kB
-> Seq Scan on connectr_passage (cost=0.00..16978.89 rows=34 width=175) (actual time=10.863..616.027 rows=18 loops=1)
Filter: (((text)::text ~ '(?i)\\ycape\\y'::text) AND ((text)::text !~ 'https?://'::text))
Total runtime: 616.229 ms
So it looks like it is doing a full table scan, so the index isn't working. I'm a Postgres newbie so not sure if I have this right, but here is my index (created by setting db_index=True in the Django model):
public | connectr_passage_text | index | connectr | connectr_passage
Another edit:
Here is the latest - after using the pg_trgm add-on.
create extension pg_trgm;
create index passage_trgm_gin on connectr_passage using gin (text gin_trgm_ops);
First attempt:
d2lgd5pcso4g2k=> explain analyze select * from connectr_passage where text ~ E'cape\y';
QUERY PLAN
Seq Scan on connectr_passage (cost=0.00..28627.30 rows=95 width=177) (actual time=2647.828..2647.828 rows=0 loops=1)
Filter: ((text)::text ~ 'capey'::text)
Rows Removed by Filter: 970514
Total runtime: 2647.866 ms
(4 rows)
Damn, still super slow. But wait what if I do a simple filter before the RegEx:
d2lgd5pcso4g2k=> explain analyze select * from connectr_passage where text like '%cape%' and text ~ E'(?i)\ycape\y';
QUERY PLAN
Bitmap Heap Scan on connectr_passage (cost=578.14..762.70 rows=1 width=177) (actual time=11.432..11.432 rows=0 loops=1)
Recheck Cond: ((text)::text ~~ '%cape%'::text)
Rows Removed by Index Recheck: 165
Filter: ((text)::text ~ '(?i)ycapey'::text)
Rows Removed by Filter: 468
-> Bitmap Index Scan on passage_trgm_gin (cost=0.00..578.14 rows=95 width=0) (actual time=8.845..8.845 rows=633 loops=1)
Index Cond: ((text)::text ~~ '%cape%'::text)
Total runtime: 11.479 ms
(8 rows)
Superfast!

So this is pretty much solved thanks to mu-is-too-short's sugestion and a bit of googling. Basically PostgreSQL's pg_trgm extension solved the problem and led to 800x faster query!

Related

Django PostgreSQL double index cleanup

We've got this table in our database with 80GB of data and 230GB of Indexes. We are constrained on our disk which is already maxed out.
What bothers me is we have two indexes that look pretty darn similar
CREATE INDEX tracks_trackpoint_id ON tracks_trackpoint USING btree (id)
CREATE UNIQUE INDEX tracks_trackpoint_pkey ON tracks_trackpoint USING btree (id)
I have no idea what's the history behind this, but the first one seems quite redundant. What could be the risk of dropping it ? This would buy us one year of storage.
You can drop the first index, it is totally redundant.
If your tables are 80GB and your indexes 230GB, I am ready to bet that you have too many indexes in your database.
Drop the indexes that are not used.
Careful as I am, I disabled the index to benchmark this, and the query seems to fallback nicely on the other index. I'll try a few variants.
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_id on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.040 ms
(3 rows)
appdb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'tracks_trackpoint_id'::regclass;
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_pkey on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.036 ms
(3 rows)

What is the difference between scan and query in dynamodb? When use scan / query?

A query operation as specified in DynamoDB documentation:
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
and the scan operation:
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
Which is best based on performance and cost?
When creating a Dynamodb table select Primary Keys and Local Secondary Indexes (LSIs) so that a Query operation returns the items you want.
Query operations only support an equal operator evaluation of the Primary Key, but conditional (=, <, <=, >, >=, Between, Begin) on the Sort Key.
Scan operations are generally slower and more expensive as the operation has to iterate through each item in your table to get the items you are requesting.
Example:
Table: CustomerId, AccountType, Country, LastPurchase
Primary Key: CustomerId + AccountType
In this example, you can use a Query operation to get:
A CustomerId with a conditional filter on AccountType
A Scan operation would need to be used to return:
All Customers with a specific AccountType
Items based on conditional filters by Country, ie All Customers from USA
Items based on conditional filters by LastPurchase, ie All Customers that made a purchase in the last month
To avoid scan operations on frequently used operations create a Local Secondary Index (LSI) or Global Secondary Index (GSI).
Example:
Table: CustomerId, AccountType, Country, LastPurchase
Primary Key: CustomerId + AccountType
GSI: AccountType + CustomerId
LSI: CustomerId + LastPurchase
In this example a Query operation can allow you to get:
A CustomerId with a conditional filter on AccountType
[GSI] A conditional filter on CustomerIds for a specific AccountType
[LSI] A CustomerId with a conditional filter on LastPurchase
You are having dynamodb table partition key/primary key as customer_country. If you use query, customer_country is the mandatory field to make query operation. All the filters can be made only items that belongs to customer_country.
If you perform table scan the filter will be performed on all partition key/primary key. First it fetched all data and apply filter after fetching from table.
eg:
here customer_country is the partition key/primary key
and id is the sort_key
-----------------------------------
customer_country | name | id
-----------------------------------
VV | Tom | 1
VV | Jack | 2
VV | Mary | 4
BB | Nancy | 5
BB | Lom | 6
BB | XX | 7
CC | YY | 8
CC | ZZ | 9
------------------------------------
If you perform query operation it applies only on customer_country value.
The value should only be equal operator (=).
So only items equal to that partition key/primary key value are fetched.
If you perform scan operation it fetches all items in that table and filter out data after it takes that data.
Note: Don't perform scan operation it exceeds your RCU.
Its similar as in the relational database.
Get query you are using a primary key in where condition, The computation complexity is log(n) as the most of key structure is binary tree.
while scan query you have to scan whole table then apply filter on every single row to find the right result. The performance is O(n). Its much slower if your table is big.
In short, Try to use query if you know primary key. only scan for only the worst case.
Also, think about the global secondary index to support a different kind of queries on different keys to gain performance objective
In terms of performance, I think it's good practice to design your table for applications to use Query instead of Scan. Because a scan operation always scan the entire table before it filters out the desired values, which means it takes more time and space to process data operations such as read, write and delete. For more information, please refer to the official document
Query is much better than Scan - performence wise. scan, as it's name imply, will scan the whole table. But you must be well aware of the table key, sort key, indexes and and related sort indexes in order to know that you can use the Query.
if you filter your query using:
key
key & key sort
index
index and it's related sort key
use Query! otherwise use scan which is more flexible about which columns you can filter.
you can NOT Query if:
more that 2 fields in the filter (e.g. key, sort and index)
sort key only (of primary key or index)
regular fields (not key, index or sort)
mixed index and sort (index1 with sort of index2)\
...
a good explaination:
https://medium.com/#amos.shahar/dynamodb-query-vs-scan-sql-syntax-and-join-tables-part-1-371288a7cb8f

Improving query speed: simple SELECT with LIKE

I have inherited a large legacy codebase which runs in django 1.5 and my current task is to speed up a section of the site which takes ~1min to load.
I did a profile of the app and got this:
The culprit in particular is the following query (stripped for brevity):
SELECT COUNT(*) FROM "entities_entity" WHERE (
"entities_entity"."date_filed" <= '2016-01-21' AND (
UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
-- 34 more of these
UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
-- 34 more of these
)
)
which basically consist on a big like query on two fields, entity_city_state_zip and agent_city_state_zip which are character varying(200) | not null fields.
That query is performed twice (!), taking 18814.02ms each time, and one more time replacing the COUNT for a SELECT taking up an extra 20216.49 (I'm going to cache the result of the COUNT)
The explain looks like this:
Aggregate (cost=175867.33..175867.34 rows=1 width=0) (actual time=17841.502..17841.502 rows=1 loops=1)
-> Seq Scan on entities_entity (cost=0.00..175858.95 rows=3351 width=0) (actual time=0.849..17818.551 rows=145075 loops=1)
Filter: ((date_filed <= '2016-01-21'::date) AND ((upper((entity_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((entity_city_state_zip)::text) ~~ '%BERKELEY%'::text) (..skipped..) OR (upper((agent_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BERKELEY%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BURLINGAME%'::text) ))
Rows Removed by Filter: 310249
Planning time: 2.110 ms
Execution time: 17841.944 ms
I've tried using an index on entity_city_state_zip and agent_city_state_zip using various combinations like:
CREATE INDEX ON entities_entity (upper(entity_city_state_zip));
CREATE INDEX ON entities_entity (upper(agent_city_state_zip));
or using varchar_pattern_ops, with no luck.
The server is using something like this:
qs = queryset.filter(Q(entity_city_state_zip__icontains = all_city_list) |
Q(agent_city_state_zip__icontains = all_city_list))
to generate that query.
I don't know what else to try,
Thanks!
I think problem in "multiple LIKE" and in UPPER("entities_entity ...
You can use:
WHERE entities_entity.entity_city_state_zip SIMILAR TO '%Atherton%|%Berkeley%'
Or something like this:
WHERE entities_entity.entity_city_state_zip LIKE ANY(ARRAY['%Atherton%', '%Berkeley%'])
Edited
About Raw SQL query in Django:
https://docs.djangoproject.com/es/1.9/topics/db/sql/
How do I execute raw SQL in a django migration
Regards
I watched a course in Pluralsight that addressed a very similar issue. The course was "Postgres for .NET Developers" and this was in the section "Fun With Simple SQL", "Full Text Search."
To summarize their solution, using your example:
Create a new column in your table that will represent your entity_city_state_zip as a tsvector:
create table entities_entity (
date_filed date,
entity_city_state_zip text,
csz_search tsvector not null -- add this column
);
Initially you might have to make it nullable, then populate the data and make it non-nullable.
update entities_entity
set csz_search = to_tsvector (entity_city_state_zip);
Next, create a trigger that will cause the new field to be populated any time a record is added or modified:
create trigger entities_insert_update
before insert or update on entities_entity
for each row execute procedure
tsvector_update_trigger(csz_search,'pg_catalog.english',entity_city_state_zip);
Your search queries can now query on the tsvector field rather than the city/state/zip field:
select * from entities_entity
where csz_search ## to_tsquery('Atherton')
Some notes of interest on this:
to_tsquery, in case you haven't used it is WAY more sophisticated than the example above. It allows and conditions, partial matches, etc
it is also case-insensitive, so there is no need to do the upper functions you have in your query
As a final step, put a GIN index on the tsquery field:
create index entities_entity_ix1 on entities_entity
using gin(csz_search);
If I understand the course right, this should make your query fly, and it will overcome the issue of a btree index's inability to work on a like '% query.
Here is the explain plan on such a query:
Bitmap Heap Scan on entities_entity (cost=56.16..1204.78 rows=505 width=81)
Recheck Cond: (csz_search ## to_tsquery('Atherton'::text))
-> Bitmap Index Scan on entities_entity_ix1 (cost=0.00..56.04 rows=505 width=0)
Index Cond: (csz_search ## to_tsquery('Atherton'::text))

Slow PostgreSQL query not using index

I have a simple Django site, using a PostgreSQL 9.3 database, with a single table storing user accounts (e.g. name, email, address, phone, active, etc). However, my user model is fairly large, and has around 2.6 million records. I noticed Django's admin was a little slow, so using django-debug-toolbar, I noticed that almost all queries ran in under 1 ms, except for:
SELECT COUNT(*) FROM "myapp_myuser" WHERE "myapp_myuser"."active" = true;
which took about 7000 ms. However, the active column is indexed using Django's standard db_index=True, which generates the index:
CREATE INDEX myapp_myuser_active
ON myapp_myuser
USING btree
(active);
Checking out the query with EXPLAIN via:
EXPLAIN ANALYZE VERBOSE
SELECT COUNT(*) FROM "myapp_myuser" WHERE "myapp_myuser"."active" = true;
returns:
Aggregate (cost=109305.45..109305.46 rows=1 width=0) (actual time=7342.973..7342.974 rows=1 loops=1)
Output: count(*)
-> Seq Scan on public.myapp_myuser (cost=0.00..102638.16 rows=2666916 width=0) (actual time=0.035..4765.059 rows=2666337 loops=1)
Output: id, created, category_id, name, email, address_1, address_2, city, active, (...)
Filter: myapp_myuser.active
Total runtime: 7343.031 ms
It appears to not be using the index at all. Am I reading this right?
Running just SELECT COUNT(*) FROM "myapp_myuser" completed in about 500 ms. Why such a disparity in run times, even though the only column being used is indexed?
How can I better optimize this query?
You're selecting a lot of columns out of a wide table. So this might not help, even though it does result in a bitmap index scan.
Try a partial index.
create index on myapp_myuser (active) where active = true;
I made a test table with a couple million rows.
explain analyze verbose
select count(*) from test where active = true;
"Aggregate (cost=41800.79..41800.81 rows=1 width=0) (actual time=500.756..500.756 rows=1 loops=1)"
" Output: count(*)"
" -> Bitmap Heap Scan on public.test (cost=8085.76..39307.79 rows=997200 width=0) (actual time=126.233..386.834 rows=1000000 loops=1)"
" Output: id, active"
" Filter: test.active"
" -> Bitmap Index Scan on test_active_idx1 (cost=0.00..7836.45 rows=497204 width=0) (actual time=123.398..123.398 rows=1000000 loops=1)"
" Index Cond: (test.active = true)"
"Total runtime: 500.794 ms"
When you write queries that you hope will use a partial index, you need to match the expression and WHERE clause. Using WHERE active is true is valid in PostgreSQL, but it doesn't match the WHERE clause in the partial index. That means you'll get a sequential scan again.

Slow Postgres JOIN Query

I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.
The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)
#categories.article_set.filter(post_count__lte=50)
EXPLAIN ANALYZE SELECT *
FROM "sp_article"
INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id")
WHERE ("sp_article_categories"."category_id" = 1081
AND "sp_article"."post_count" <= 50 )
Nested Loop (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
-> Index Scan using sp_article_categories_category_id on sp_article_categories (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
Index Cond: (category_id = 1081)
-> Index Scan using sp_article_pkey on sp_article (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
Index Cond: (sp_article.id = sp_article_categories.article_id)
Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms
I have an index on:
sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)
Any suggestions on how I can tune this to get the query speedy?
Thanks!
You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.
The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.
The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).
So - you first need to figure out what exactly is slow...
Put an index on sp_article_categories.category_id
From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.
So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.
Pseudo-code follows:
SELECT * FROM categories c
INNER JOIN articles a
ON c.category_id = 1081
AND c.category_id = a.category_id
And put an index on category_id like Steven suggests.
You can use field names instead * too.
select [fields] from....
I assume you have run analyze on the database to get fresh statistics.
It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.
Cheers!
// John