Text searching in Django with trigram - django

I want to speed up search results in my application, however, I keep getting the same results no matter what method I use. Since it's Django application I'll provide both ORM commands and generated SQL code (PostgreSQL is used).
First, I have enabled GIN indexing and trigram operations on the database:
Second, I have create table that contains 2 varchar columns: first_name and last_name (plus an id field as primary key).
from django.db import models
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
I have also filled the database with 952 example records so that I don't have a situation, where Postgres avoids using index because of too small data set.
Next, I run following queries on non-indexed data.
Simple LIKE query:
In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms
Trigram similar:
In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms
And a more fancy query with sorting results:
In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms
Next step was to create an index:
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
class Meta:
indexes = [GinIndex(fields=['last_name'])]
This resulted in a following SQL migration:
./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;
And now I run the same commands.
LIKE:
In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms
Trigram similar:
In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms
And the more complex query:
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms
The changes in execution times seem to be insignificant. In the case of the last query the scan takes 0.643 units compared to 0.657 in the previous case. Times also differ by 0.02 miliseconds (and the second query run even a bit slower). Is there some option that I am missing that should be enabled to help with the performance? Is it too simple data set?
Docs I used:
Django's docs on text searching
Gitlab's docs on trigrams
EDIT
I've added a few houndred records (now there are nearly 259 000 records) and run tests again. First without indexes:
In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms
In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms
In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms
And with it:
In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms
In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms
Still very similar times and it seems to be avoiding using the gin index.

CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
This did not create a trigram index. It created a GIN index on the whole string, using the operators from btree_gin (which you don't seem to be using for any good purpose). To make a trigram index, it would need to look like this:
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);
But I don't know how to get django to do that, I'm not a Django user.

Related

Sum distinct values based on two tables

I have two tables which looks like this:
Scenario
Key
Var Cost
First
full_New York_Automated
10000
First
full_New York_Automated
20000
First
full_Boston_Manual
12000
First
full_Boston_Manual
24000
Second
full_New York_Manual
12000
Second
full_New York_Manual
25000
Second
full_Dallas_Manual
12000
and:
Key
Fixed Cost
full_New York_Automated
40000
full_Boston_Manual
10000
full_Dallas_Manual
20000
full_New York_Manual
15000
I need to show in a card the total fixed cost for each scenario (which will be selected in a slicer). For example, if I select the "First" Scenario, my card will show "50000".
How can I do it?
Thanks.

DynamoDB date GSI

I have a DynamoDB table that stores executions of some programs, this is what it looks like:
Partition Key
Sort Key
StartDate
...
program-name
execution-id (uuid)
YYYY-MM-DD HH:mm:ss
...
I have two query scenarios for this table:
Query by program name and execution id (easy)
Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59
What is the correct way to perform the second query?
I understand I need to create a GSI to do that, but how should this GSI look like?
I was thinking about splitting the StartDate attribute into two, like this:
Partition Key
Sort Key
StartMonthYear
StartDayTime
...
program-name
execution-id (uuid)
YYYY-MM
DD HH:mm:ss
...
So I can define a GSI using the StartMonthYear as the partition key and the StartDayTime as the sort key.
The only problem with this approach is that I would have to write some extra logic in my application to identify all the partitions I would need to query in the requested range. For example:
If the range is: 2021-05-15 00:00:00 to 2021-07-15 23:59:59
I would need to query 2021-05, 2021-06 and 2021-07 partitions with the respective day/time restrictions (only the first and last partition is this example).
Is this the correct way of doing this or am I totally wrong?
If you quickly want to fetch all executions in a certain time-frame no matter the program, there are a few ways to approach this.
The easiest solution would be a setup like this:
PK
SK
GSI1PK
GSI1SK
StartDate
PROG#<name>
EXEC#<uuid>
ALL_EXECUTIONS
S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>
yyyy-mm-ddThh:mm:ss
PK is the partition key for the base table
SK is the sort key for the base table
GSI1PK is the partition key for the global secondary index GSI1
GSI1SK is the sort key for the global secondary index GSI1
Query by program name and execution id (easy)
Still easy, do a GetItem based on the program name for <name> and uuid for <uuid>.
Query by start date range, for example: all executions from 2021-05-15 00:00:00 to 2021-07-15 23:59:59
Do a Query on GSI1 with the KeyConditionExpression: PK = ALL_EXECUTIONS AND SK >= 'S#2021-05-15 00:00:00' AND SK <= 'S#2021-07-15 23:59:59'. This would return all the executions in the given time range.
But: You'll also build a hot partition, since you effectively write all your data in a single partition in GSI1.
To avoid that, we can partition the data a bit and the partitioning depends on the number of executions you're dealing with. You can choose years, months, days, hours, minutes or seconds.
Instead of GSI1PK just being ALL_EXECUTIONS, we can set it to a subset of the StartDate.
PK
SK
GSI1PK
GSI1SK
StartDate
PROG#<name>
EXEC#<uuid>
EXCTS#<yyyy-mm>
S#<yyyy-mm-ddThh:mm:ss>#EXEC<uuid>
yyyy-mm-ddThh:mm:ss
In this case you'd have a monthly partition, i.e.: all executions per month are grouped. Now you would have to make multiple queries to DynamoDB and later join the results.
For the query range from 2021-05-15 00:00:00 to 2021-07-15 23:59:59 you'd have to do these queries on GSI1:
#GSI1: GSI1PK=EXCTS#2021-05 AND GSI1SK >= S#2021-05-15 00:00:00
#GSI1: GSI1PK=EXCTS#2021-06
#GSI1: GSI1PK=EXCTS#2021-07 AND GSI1SK <= S#2021-07-15 23:59:59
You can even parallelize these and later join the results together.
Again: Your partitioning scheme depends on the number of executions you have in a day and also which maximum query ranges you want to support.
This is a long-winded way of saying that your approach is correct in principle, but you can choose to tune it based on your use case.

does django db_index=True index null value?

for example if i have a field name slug = models.CharField(null=True, db_index=True,max_length=50) and while saving data if left slug empty. will database index this saved null value?
Yes Postgresql does index NULL values.
Here is a small test case:
select version();
version
-----------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.21 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)
create table t(c1 serial, c2 text);
CREATE TABLE
insert into t(c2) select generate_series(1,1000000);
INSERT 0 1000000
create index on t(c2);
CREATE INDEX
analyze t;
ANALYZE
update t set c2=null where c1=123456;
UPDATE 1
explain analyze select count(*) from t where c2 is null;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
-------
Aggregate (cost=5.76..5.77 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
-> Index Only Scan using t_c2_idx on t (cost=0.42..5.76 rows=1 width=0) (actual time=0.006..0.006 rows=1 lo
ops=1)
Index Cond: (c2 IS NULL)
Heap Fetches: 1
Planning time: 0.271 ms
Execution time: 0.035 ms
(6 rows)

How to make sum query with type casting and calculation in django views?

I'm calculating the sold items cost in django views and in django signals, and I want to calculate sold items cost on the fly. Price and quantity fields are integers. How can I convert one of them to the float and make sum query with some calculations like these sql queries below?
SELECT sum((t.price::FLOAT * t.quantity) / 1000) as cost FROM public."sold" t;
SELECT
t.id, t.price, t.quantity, sum((price::FLOAT * quantity) / 1000) as cost
FROM public."sold" t
GROUP BY t.id;
EDIT: Of course expected results are querysets of django
I expected the output of first query
cost
-----------------
5732594.000000002
and I expected the output of second query
id price quantity cost
------------------------------
846 1100 5000 5500
790 1500 1000 1500
828 2600 1000 2600
938 1000 5000 5000
753 1500 2000 3000
652 5000 1520 7600
EDIT 2: I solved this issue via raw() method like MyModel.objects.raw(
'SELECT sum((t.price::FLOAT * t.quantity) / 1000) as cost '
'FROM public."sold" t'
) instead of pythonic way
You'll need to have a look into a couple of things to do that. The first one will be aggregation and annotation. Also, you'll need to look into Cast and F functions. Please have a look at the links below:
https://docs.djangoproject.com/en/2.2/topics/db/aggregation/
https://docs.djangoproject.com/en/2.2/ref/models/database-functions/#cast
https://docs.djangoproject.com/en/2.2/ref/models/expressions/
DISCLAIMER: This is an example and might not work
Your queryset will look something like this:
from django.db.models import FloatField, Sum
from django.db.models.functions import Cast
qs = MyModel.objects.annotate(cost=Sum(F(Cast('price', FloatField())) * F('quantity') / 1000))

SparkSQL on pyspark: how to generate time series?

I'm using SparkSQL on pyspark to store some PostgreSQL tables into DataFrames and then build a query that generates several time series based on a start and stop columns of type date.
Suppose that my_table contains:
start | stop
-------------------------
2000-01-01 | 2000-01-05
2012-03-20 | 2012-03-23
In PostgreSQL it's very easy to do that:
SELECT generate_series(start, stop, '1 day'::interval)::date AS dt FROM my_table
and it will generate this table:
dt
------------
2000-01-01
2000-01-02
2000-01-03
2000-01-04
2000-01-05
2012-03-20
2012-03-21
2012-03-22
2012-03-23
but how to do that using plain SparkSQL? Will it be necessary to use UDFs or some DataFrame methods?
EDIT
This creates a dataframe with one row containing an array of consecutive dates:
from pyspark.sql.functions import sequence, to_date, explode, col
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date")
+------------------------------------------+
| date |
+------------------------------------------+
| ["2018-01-01","2018-02-01","2018-03-01"] |
+------------------------------------------+
You can use the explode function to "pivot" this array into rows:
spark.sql("SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month) as date").withColumn("date", explode(col("date"))
+------------+
| date |
+------------+
| 2018-01-01 |
| 2018-02-01 |
| 2018-03-01 |
+------------+
(End of edit)
Spark v2.4 support sequence function:
sequence(start, stop, step) - Generates an array of elements from start to stop (inclusive), incrementing by step. The type of the returned elements is the same as the type of argument expressions.
Supported types are: byte, short, integer, long, date, timestamp.
Examples:
SELECT sequence(1, 5);
[1,2,3,4,5]
SELECT sequence(5, 1);
[5,4,3,2,1]
SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month);
[2018-01-01,2018-02-01,2018-03-01]
https://docs.databricks.com/spark/latest/spark-sql/language-manual/functions.html#sequence
The existing answers will work, but are very inefficient. Instead it is better to use range and then cast data. In Python
from pyspark.sql.functions import col
from pyspark.sql import SparkSession
def generate_series(start, stop, interval):
"""
:param start - lower bound, inclusive
:param stop - upper bound, exclusive
:interval int - increment interval in seconds
"""
spark = SparkSession.builder.getOrCreate()
# Determine start and stops in epoch seconds
start, stop = spark.createDataFrame(
[(start, stop)], ("start", "stop")
).select(
[col(c).cast("timestamp").cast("long") for c in ("start", "stop")
]).first()
# Create range with increments and cast to timestamp
return spark.range(start, stop, interval).select(
col("id").cast("timestamp").alias("value")
)
Example usage:
generate_series("2000-01-01", "2000-01-05", 60 * 60).show(5) # By hour
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-01 01:00:00|
|2000-01-01 02:00:00|
|2000-01-01 03:00:00|
|2000-01-01 04:00:00|
+-------------------+
only showing top 5 rows
generate_series("2000-01-01", "2000-01-05", 60 * 60 * 24).show() # By day
+-------------------+
| value|
+-------------------+
|2000-01-01 00:00:00|
|2000-01-02 00:00:00|
|2000-01-03 00:00:00|
|2000-01-04 00:00:00|
+-------------------+
#Rakesh answer is correct, but I would like to share a less verbose solution:
import datetime
import pyspark.sql.types
from pyspark.sql.functions import UserDefinedFunction
# UDF
def generate_date_series(start, stop):
return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
# Register UDF for later usage
spark.udf.register("generate_date_series", generate_date_series, ArrayType(DateType()) )
# mydf is a DataFrame with columns `start` and `stop` of type DateType()
mydf.createOrReplaceTempView("mydf")
spark.sql("SELECT explode(generate_date_series(start, stop)) FROM mydf").show()
Suppose you have dataframe df from spark sql, Try this
from pyspark.sql.functions as F
from pyspark.sql.types as T
def timeseriesDF(start, total):
series = [start]
for i xrange( total-1 ):
series.append(
F.date_add(series[-1], 1)
)
return series
df.withColumn("t_series", F.udf(
timeseriesDF,
T.ArrayType()
) ( df.start, F.datediff( df.start, df.stop ) )
).select(F.explode("t_series")).show()
Building off of user10938362 answer, just showing a way to use range without a UDF, provided that you are trying to build a data frame of dates based off of some ingested dataset, rather than with a hardcoded start/stop.
# start date is min date
date_min=int(df.agg({'date': 'min'}).first()[0])
# end date is current date or alternatively could use max as above
date_max=(
spark.sql('select unix_timestamp(current_timestamp()) as date_max')
.collect()[0]['date_max']
)
# range is int, unix time is s so 60*60*24=day
df=spark.range(date_min, date_max, 60*60*24).select('id')