does django db_index=True index null value? - django

for example if i have a field name slug = models.CharField(null=True, db_index=True,max_length=50) and while saving data if left slug empty. will database index this saved null value?

Yes Postgresql does index NULL values.
Here is a small test case:
select version();
version
-----------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.21 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)
create table t(c1 serial, c2 text);
CREATE TABLE
insert into t(c2) select generate_series(1,1000000);
INSERT 0 1000000
create index on t(c2);
CREATE INDEX
analyze t;
ANALYZE
update t set c2=null where c1=123456;
UPDATE 1
explain analyze select count(*) from t where c2 is null;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
-------
Aggregate (cost=5.76..5.77 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
-> Index Only Scan using t_c2_idx on t (cost=0.42..5.76 rows=1 width=0) (actual time=0.006..0.006 rows=1 lo
ops=1)
Index Cond: (c2 IS NULL)
Heap Fetches: 1
Planning time: 0.271 ms
Execution time: 0.035 ms
(6 rows)

Related

Cannot find running sum of a blended chart in data studio

I'm trying to create a chart for following big query using data studio. Instead, auto generating the chart from the GCP. I'm trying to create chart using tools in the data studio.
SELECT t.timestamp, sum(t.introduced_violation)
OVER(
PARTITION BY t.introduced_user_id
ORDER BY t.timestamp desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
as cumulative_introduced_violation,
sum(t.fixed_violation)
OVER(
PARTITION BY t.introduced_user_id
ORDER BY t.timestamp desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
as cumulative_fixed_violation,
FROM (SELECT SUM(CASE
WHEN is_fixed = 1 THEN 1
ELSE 0
END) AS fixed_violation,
SUM(1) AS introduced_violation,
timestamp, introduced_user_id
FROM `project_id.violation.table_name`
where introduced_user_id = 'username#company.com'
and timestamp >=1622556834000
and timestamp <=1631231999999
group by timestamp, introduced_user_id
order by timestamp desc) as t;
Expected output from the query:
At first, I try to create chart for the inner query (below). I succeed this step by creating 2 charts and blending them together.
SELECT SUM(CASE
WHEN is_fixed = 1 THEN 1
ELSE 0
END) AS fixed_violation,
SUM(1) AS introduced_violation,
timestamp, introduced_user_id
FROM `project_id.violation.table_name`
where introduced_user_id = 'username#company.com'
and timestamp >=1622556834000
and timestamp <=1631231999999
group by timestamp, introduced_user_id
order by timestamp desc;
Expected output from inner query:
As in the query output for introduced_violation and fixed_violation values are RunningSUM values.
Is there is a way to find the RunningSUM of introduced_violation and fixed_violation columns in the blended charts or some other way to do the whole scenario?

Retrieving the row with the greatest timestamp in questDB

I'm currently running QuestDB 6.1.2 on linux. How do I get the row with maximum value from a table? I have tried the following on a test table with around 5 million rows:
select * from table where cast(timestamp as symbol) in (select cast(max(timestamp) as symbol) from table );
select * from table inner join (select max(timestamp) mm from table ) on timestamp >= mm
select * from table where timestamp = max(timestamp)
select * from table where timestamp = (select max(timestamp) from table )
where 1 is correct but runs in ~5s, 2 is correct and runs in ~500ms but looks unnecessarily verbose for a query, 3 compiles but returns an empty table, and 4 is incorrect syntax although that's how sql usually does it
select * from table limit -1 works. QuestDB returns rows sorted by timestamp as default, and limit -1 takes the last row, which happens to be the row with the greatest timestamp. To be explicit about ordering by timestamp, select * from table order by timestamp limit -1 could be used instead. This query runs in around 300-400ms on the same table.
As a side note, the third query using timestamp=max(timestamp) doesn't work yet since QuestDB does not support subqueries in where yet (questDB 6.1.2).

Text searching in Django with trigram

I want to speed up search results in my application, however, I keep getting the same results no matter what method I use. Since it's Django application I'll provide both ORM commands and generated SQL code (PostgreSQL is used).
First, I have enabled GIN indexing and trigram operations on the database:
Second, I have create table that contains 2 varchar columns: first_name and last_name (plus an id field as primary key).
from django.db import models
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
I have also filled the database with 952 example records so that I don't have a situation, where Postgres avoids using index because of too small data set.
Next, I run following queries on non-indexed data.
Simple LIKE query:
In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms
Trigram similar:
In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms
And a more fancy query with sorting results:
In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms
Next step was to create an index:
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
class Meta:
indexes = [GinIndex(fields=['last_name'])]
This resulted in a following SQL migration:
./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;
And now I run the same commands.
LIKE:
In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms
Trigram similar:
In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms
And the more complex query:
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms
The changes in execution times seem to be insignificant. In the case of the last query the scan takes 0.643 units compared to 0.657 in the previous case. Times also differ by 0.02 miliseconds (and the second query run even a bit slower). Is there some option that I am missing that should be enabled to help with the performance? Is it too simple data set?
Docs I used:
Django's docs on text searching
Gitlab's docs on trigrams
EDIT
I've added a few houndred records (now there are nearly 259 000 records) and run tests again. First without indexes:
In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms
In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms
In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms
And with it:
In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms
In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms
Still very similar times and it seems to be avoiding using the gin index.
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
This did not create a trigram index. It created a GIN index on the whole string, using the operators from btree_gin (which you don't seem to be using for any good purpose). To make a trigram index, it would need to look like this:
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);
But I don't know how to get django to do that, I'm not a Django user.

BigQuery Limit Rows Scanned by Merge DML

Given DML statement below, is there a way to limit numbers of rows scanned by target table? For example, let's say we have a field shard_id which the table is partitioned with. I know beforehand that all update should happen in some range of shard_id. Is there a way to specify where clause for target to limit numbers of rows that need to be scanned so update does not have to do a full table scan to look for an id?
MERGE dataset.table_target target
USING dataset.table_source source
ON target.id = "123"
WHEN MATCHED THEN
UPDATE SET some_value = source.some_value
WHEN NOT MATCHED BY SOURCE AND id = "123" THEN
DELETE
The ON condition is the Where statement where you need to write your clause.
ON target.id = "123" AND DATE(t.shard_id) BETWEEN date1 and date2
For your case, it's incorrect to do the partition pruning by ON condition. Instead, you should do that in WHEN clause.
There is an example exactly for such scenario at https://cloud.google.com/bigquery/docs/using-dml-with-partitioned-tables#pruning_partitions_when_using_a_merge_statement.
Basically, the ON condition is used as the matching condition to join target & source tables in MERGE. Following two queries shows the difference between join condition and where clause,
Query 1:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 and pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL
2000-01-01 10 NULL
Query 2:
with
t1 as (
select '2018-01-01' pt, 10 v1 union all
select '2018-01-01', 20 union all
select '2000-01-01', 10),
t2 as (select 10 v2)
select * from t1 left outer join t2 on v1=v2 where pt = '2018-01-01'
Result:
pt v1 v2
2018-01-01 10 10
2018-01-01 20 NULL

SHOW PARTITIONS with order by in Amazon Athena

I have this query:
SHOW PARTITIONS tablename;
Result is:
dt=2018-01-12
dt=2018-01-20
dt=2018-05-21
dt=2018-04-07
dt=2018-01-03
This gives the list of partitions per table. The partition field for this table is dt which is a date column. I want to see the partitions ordered.
The documentation doesn't explain how to do it:
https://docs.aws.amazon.com/athena/latest/ug/show-partitions.html
I tried to add order by:
SHOW PARTITIONS tablename order by dt;
But it gives:
AmazonAthena; Status Code: 400; Error Code: InvalidRequestException;
AWS currently (as of Nov 2020) supports two versions of the Athena engines. How one selects and orders partitions depends upon which version is used.
Version 1:
Use the information_schema table. Assuming you have year, month as partitions (with one partition key, this is of course simpler):
WITH
a as (
SELECT partition_number as pn, partition_key as key, partition_value as val
FROM information_schema.__internal_partitions__
WHERE table_schema = 'my_database'
AND table_name = 'my_table'
)
SELECT
year, month
FROM (
SELECT val as year, pn FROM a WHERE key = 'year'
) y
JOIN (
SELECT val as month, pn FROM a WHERE key = 'month'
) m ON m.pn = y.pn
ORDER BY year, month
which outputs:
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
Version 2:
Use the built-in $partitions functionality, where the partitions are explicitly available as columns and the syntax is much simpler:
SELECT year, month FROM my_database."my_table$partitions" ORDER BY year, month
year month
0 2018 10
0 2018 11
0 2018 12
0 2019 01
...
For more information, see:
https://docs.aws.amazon.com/athena/latest/ug/querying-glue-catalog.html#querying-glue-catalog-listing-partitions
From your comment it sounds like you're looking to sort the partitions as a way to figure out whether or not a specific partition exists. For this purpose I suggest you use the Glue API instead of querying Athena. Run aws glue get-partition help or check your preferred SDK's documentation for how it works.
There is also a variant to list all partitions of a table, run aws glue get-partitions help to read more about that. I don't think it returns the partitions in alphabetical order, but it has operators for filtering.
The SHOW PARTITIONS command will not allow you to order the result, since this command does not produce a resultset to sort. This command only produces a string output.
You can on the other hand query the partition column and then order the result by value.
select distinct dt from tablename order by dt asc;