Athena Iceberg Slow On Empty Table - amazon-athena

I am looking at the new Iceberg Tables for AWS Athena. I'm hoping to move my data lake over to Iceberg so that I can significantly reduce the complexity of table partition management and hopefully get some better performance. I created a test iceberge table with two fields: event_date and log.
CREATE TABLE ACME.iceberg_test (
event_date timestamp,
log string
)
PARTITIONED BY (
hour(event_date)
)
LOCATION
's3://ACME/iceberg_test'
TBLPROPERTIES (
'table_type'='ICEBERG',
'compaction_bin_pack_target_file_size_bytes'='536870912'
);
This creates a new S3 prefix in ACME with metadata etc. I query this new empty table an it takes 14s to produce 0 results.
I load it with 20 sample rows:
INSERT INTO iceberg_test
VALUES
(timestamp '2021-12-20 01:30:00', 'hello'),
(timestamp '2021-12-20 02:30:00', 'hello'),
(timestamp '2021-12-20 03:30:00', 'hello'),
(timestamp '2021-12-20 04:30:00', 'hello'),
(timestamp '2021-12-20 05:30:00', 'hello'),
(timestamp '2021-12-20 06:30:00', 'hello'),
(timestamp '2021-12-20 07:30:00', 'hello'),
(timestamp '2021-12-20 08:30:00', 'hello'),
(timestamp '2021-12-20 09:30:00', 'hello'),
(timestamp '2021-12-20 10:30:00', 'hello'),
(timestamp '2021-12-20 11:30:00', 'hello'),
(timestamp '2021-12-20 12:30:00', 'hello'),
(timestamp '2021-12-20 13:30:00', 'hello'),
(timestamp '2021-12-20 14:30:00', 'hello'),
(timestamp '2021-12-20 15:30:00', 'hello'),
(timestamp '2021-12-20 16:30:00', 'hello'),
(timestamp '2021-12-20 17:30:00', 'hello'),
(timestamp '2021-12-20 18:30:00', 'hello'),
(timestamp '2021-12-20 19:30:00', 'hello'),
(timestamp '2021-12-20 20:30:00', 'hello');
And for good measure run their "OPTIMIZE" command on it which I think only does compaction but thought it might run some partition discovery as well.
OPTIMIZE iceberg_test REWRITE DATA
USING BIN_PACK;
But it still takes 14s to return my 20 rows here. It looks like I haven't configured my tables properly and it is not using the partitions efficiently. I even tried adding a partition predicate that is well out of the bounds of the sample data I gave:
SELECT * FROM iceberg_test WHERE event_date < timestamp '2021-10-10';
Still taking 14s.
I'm not sure what else I'm supposed to do other than register the partitions. Why is my iceberg table not using known partition meta? How can I get Iceberg to do some partition discovery?

Related

Athena Table Timestamp With Time Zone Not Possible?

I am trying to create an athena table with a timestamp column that has time zone information. The create sql looks something like this:
CREATE EXTERNAL TABLE `tmp_123` (
`event_datehour_tz` timestamp with time zone
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://...'
TBLPROPERTIES (
'Classification'='parquet'
)
When I run this, I get the error:
line 1:8: mismatched input 'external'. expecting: 'or', 'schema', 'table', 'view' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: b7fa4045-a77e-4151-84d7-1b43db2b68f2; proxy: null)
If I remove the with time zone it will create the table. I've tried this and timestamptz. Is it not possible to create a table in athena that has a timestamp with time zone column?
Unfortunately Athena does not support timestamp with time zone.
What you may do is use the CAST() function around that function call, which will change the type from timestamp with time zone into timestamp.
Or, you can maybe save it as timestamp and use AT TIME STAMP operator as given below:
SELECT event_datehour_tz AT TIME ZONE 'America/Los_Angeles' AS la_time;
Just to give a complete solution after #AswinRajaram answered that Athena does not support timestampo with timezone. Here is how one can CAST the timestamp from a string and use it with timezone.
select
parse_datetime('2022-09-10_00', 'yyyy-MM-dd_H'),
parse_datetime('2022-09-10_00', 'yyyy-MM-dd_H') AT TIME ZONE 'Europe/Berlin',
at_timezone(CAST(parse_datetime('2022-09-10_00', 'yyyy-MM-dd_HH') AS timestamp), 'Europe/Berlin') AS date_partition_berlin,
CAST(parse_datetime('2022-09-10_00', 'yyyy-MM-dd_HH') AT TIME ZONE 'Europe/Berlin' AS timestamp) AS date_partition_timestamp;
2022-09-10 00:00:00.000 UTC
2022-09-10 02:00:00.000 Europe/Berlin // time zone conversion + 2 hours
2022-09-10 02:00:00.000 Europe/Berlin // time zone conversion + 2 hours
2022-09-10 00:00:00.000

Text searching in Django with trigram

I want to speed up search results in my application, however, I keep getting the same results no matter what method I use. Since it's Django application I'll provide both ORM commands and generated SQL code (PostgreSQL is used).
First, I have enabled GIN indexing and trigram operations on the database:
Second, I have create table that contains 2 varchar columns: first_name and last_name (plus an id field as primary key).
from django.db import models
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
I have also filled the database with 952 example records so that I don't have a situation, where Postgres avoids using index because of too small data set.
Next, I run following queries on non-indexed data.
Simple LIKE query:
In [50]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [51]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.011..0.242 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.042 ms
Execution Time: 0.249 ms
Trigram similar:
In [55]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [56]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.582..0.582 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.033 ms
Execution Time: 0.591 ms
And a more fancy query with sorting results:
In [58]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [59]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.680..0.683 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.021..0.657 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.062 ms
Execution Time: 0.693 ms
Next step was to create an index:
class Author(models.Model):
first_name = models.CharField(max_length=100)
last_name = models.CharField(max_length=100)
class Meta:
indexes = [GinIndex(fields=['last_name'])]
This resulted in a following SQL migration:
./manage.py sqlmigrate reviews 0004
BEGIN;
--
-- Alter field score on review
--
--
-- Create index reviews_aut_last_na_a89a84_gin on field(s) last_name of model author
--
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
COMMIT;
And now I run the same commands.
LIKE:
In [60]: print(Author.objects.filter(last_name__icontains='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE UPPER("reviews_author"."last_name"::text) LIKE UPPER(%ari%)
In [61]: print(Author.objects.filter(last_name__icontains='ari').explain(analyze=T
...: rue))
Seq Scan on reviews_author (cost=0.00..24.28 rows=38 width=16) (actual time=0.009..0.237 rows=56 loops=1)
Filter: (upper((last_name)::text) ~~ '%ARI%'::text)
Rows Removed by Filter: 896
Planning Time: 0.089 ms
Execution Time: 0.244 ms
Trigram similar:
In [62]: print(Author.objects.filter(last_name__trigram_similar='ari').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name" FROM "reviews_author" WHERE "reviews_author"."last_name" % ari
In [63]: print(Author.objects.filter(last_name__trigram_similar='ari').explain(ana
...: lyze=True))
Seq Scan on reviews_author (cost=0.00..21.90 rows=1 width=16) (actual time=0.740..0.740 rows=0 loops=1)
Filter: ((last_name)::text % 'ari'::text)
Rows Removed by Filter: 952
Planning Time: 0.056 ms
Execution Time: 0.750 ms
And the more complex query:
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').query)
SELECT "reviews_author"."id", "reviews_author"."first_name", "reviews_author"."last_name", SIMILARITY("reviews_author"."last_name", ari) AS "similar" FROM "reviews_author" WHERE SIMILARITY("reviews_author"."last_name", ari) > 0.0 ORDER BY "similar" DESC
In [65]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'ari
...: ')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=38.24..39.03 rows=317 width=20) (actual time=0.659..0.662 rows=84 loops=1)
Sort Key: (similarity((last_name)::text, 'ari'::text)) DESC
Sort Method: quicksort Memory: 31kB
-> Seq Scan on reviews_author (cost=0.00..25.07 rows=317 width=20) (actual time=0.024..0.643 rows=84 loops=1)
Filter: (similarity((last_name)::text, 'ari'::text) > '0'::double precision)
Rows Removed by Filter: 868
Planning Time: 0.052 ms
Execution Time: 0.674 ms
The changes in execution times seem to be insignificant. In the case of the last query the scan takes 0.643 units compared to 0.657 in the previous case. Times also differ by 0.02 miliseconds (and the second query run even a bit slower). Is there some option that I am missing that should be enabled to help with the performance? Is it too simple data set?
Docs I used:
Django's docs on text searching
Gitlab's docs on trigrams
EDIT
I've added a few houndred records (now there are nearly 259 000 records) and run tests again. First without indexes:
In [59]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.018..58.630 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.046 ms
Execution Time: 58.662 ms
In [60]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.555..80.710 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.503..78.743 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.039 ms
Execution Time: 80.740 ms
In [61]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.214..168.876 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.022..165.806 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.052 ms
Execution Time: 169.319 ms
And with it:
In [62]: print(Author.objects.filter(last_name__icontains='bar').explain(analyze=True))
Seq Scan on reviews_author (cost=0.00..5433.28 rows=10358 width=16) (actual time=0.015..59.366 rows=846 loops=1)
Filter: (upper((last_name)::text) ~~ '%BAR%'::text)
Rows Removed by Filter: 258106
Planning Time: 0.072 ms
Execution Time: 59.395 ms
In [63]: print(Author.objects.filter(last_name__trigram_similar='bar').explain(analyze=True))
Gather (cost=1000.00..4478.96 rows=259 width=16) (actual time=0.545..80.337 rows=698 loops=1)
Workers Planned: 1
Workers Launched: 1
-> Parallel Seq Scan on reviews_author (cost=0.00..3453.06 rows=152 width=16) (actual time=0.292..78.502 rows=349 loops=2)
Filter: ((last_name)::text % 'bar'::text)
Rows Removed by Filter: 129127
Planning Time: 0.035 ms
Execution Time: 80.369 ms
In [64]: print(Author.objects.annotate(similar=TrigramSimilarity('last_name', 'bar')).filter(similar__gt=0).order_by('-similar').explain(analyze=True))
Sort (cost=12725.93..12941.72 rows=86317 width=20) (actual time=168.191..168.890 rows=14235 loops=1)
Sort Key: (similarity((last_name)::text, 'bar'::text)) DESC
Sort Method: quicksort Memory: 1485kB
-> Seq Scan on reviews_author (cost=0.00..5649.07 rows=86317 width=20) (actual time=0.029..165.743 rows=14235 loops=1)
Filter: (similarity((last_name)::text, 'bar'::text) > '0'::double precision)
Rows Removed by Filter: 244717
Planning Time: 0.054 ms
Execution Time: 169.340 ms
Still very similar times and it seems to be avoiding using the gin index.
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name");
This did not create a trigram index. It created a GIN index on the whole string, using the operators from btree_gin (which you don't seem to be using for any good purpose). To make a trigram index, it would need to look like this:
CREATE INDEX "reviews_aut_last_na_a89a84_gin" ON "reviews_author" USING gin ("last_name" gin_trgm_ops);
But I don't know how to get django to do that, I'm not a Django user.

does django db_index=True index null value?

for example if i have a field name slug = models.CharField(null=True, db_index=True,max_length=50) and while saving data if left slug empty. will database index this saved null value?
Yes Postgresql does index NULL values.
Here is a small test case:
select version();
version
-----------------------------------------------------------------------------------------------------------
PostgreSQL 9.5.21 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)
create table t(c1 serial, c2 text);
CREATE TABLE
insert into t(c2) select generate_series(1,1000000);
INSERT 0 1000000
create index on t(c2);
CREATE INDEX
analyze t;
ANALYZE
update t set c2=null where c1=123456;
UPDATE 1
explain analyze select count(*) from t where c2 is null;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
-------
Aggregate (cost=5.76..5.77 rows=1 width=0) (actual time=0.009..0.009 rows=1 loops=1)
-> Index Only Scan using t_c2_idx on t (cost=0.42..5.76 rows=1 width=0) (actual time=0.006..0.006 rows=1 lo
ops=1)
Index Cond: (c2 IS NULL)
Heap Fetches: 1
Planning time: 0.271 ms
Execution time: 0.035 ms
(6 rows)

How to build Django Query Expression for Window function

I have postgres query, and I want to represent it using Django QuerySet builder
I have a table:
history_events
date -------------------------- amount
2019-03-16 16:03:11.49294+05 250.00
2019-03-18 14:56:30.224846+05 250.00
2019-03-18 15:07:30.579531+05 250.00
2019-03-18 20:52:53.581835+05 5.00
2019-03-18 22:33:21.598517+05 1000.00
2019-03-18 22:50:57.157465+05 1.00
2019-03-18 22:51:44.058534+05 2.00
2019-03-18 23:11:29.531447+05 255.00
2019-03-18 23:43:43.143171+05 250.00
2019-03-18 23:44:47.445534+05 500.00
2019-03-18 23:59:23.007685+05 250.00
2019-03-19 00:01:05.103574+05 255.00
2019-03-19 00:01:05.107682+05 250.00
2019-03-19 00:01:05.11454+05 500.00
2019-03-19 00:03:48.182851+05 255.00
and I need to build graphic using this data with step-by step incrementing amount sum by dates
This SQL collects correct data:
with data as (
select
date(date) as day,
sum(amount) as day_sum
from history_event
group by day
)
select
day,
day_sum,
sum(day_sum) over (order by day asc rows between unbounded preceding and current row)
from data
But I can not understand how to build correct Queryset expression for this
Another problem - there is no data for some days, and they do not appear on my graph
Nested queries like yours cannot be easily defined in ORM syntax. Subquery is limited to correlated subqueries returning a single value. This often results in contorted and inefficient ORM workarounds for queries that you can easily express in SQL.
In this case, you can use two Window functions combined with a distinct clause.
result = (Event.objects
.values('date', 'amount')
.annotate(day_sum=Window(
expression=Sum('amount'),
partition_by=[F('date')],
))
.annotate(total=Window(
expression=Sum('amount'),
frame=RowRange(start=None, end=0),
order_by=F('date').asc(),
))
.distinct('date')
.order_by('date', '-total')
)
You need to order by '-total' as otherwise distinct discards the wrong rows, leaving you with less than the correct amounts in total.
As to the missing days; SQL has no inherent concept of calendars (and therefore of missing dates) and unless you have lots of data, it should be easier to add missing days in a Python loop. In SQL, you would do it with a calendar table.

How to query historical table size of database in Redshift to determine database size growth

I want to project forward the size of my Amazon Redshift tables because I'm planning to expand my Redshift cluster size.
I know how to query the table size for today (see query below) but how can I measure the growth of my table sizes over time without make an ETL job to make snapshot day-by-day table size?
-- Capture table sizes
select
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from (
select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (
select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;
There is no historical table size information retained by Amazon Redshift. You would need to run a query on a regular basis, such as the one in your question.
You could wrap the query in an INSERT statement and run it on a weekly basis, inserting the results into a table. This way, you'll have historical table size information for each table each week that you can use to predict future growth.
It would be worth doing a VACUUM prior to such measurements, to remove deleted rows from storage.
Following metrics is available in cloudwatch
RedshiftManagedStorageTotalCapacity (m1)
PercentageDiskSpaceUsed (m2).
Create a cloudwatch math expression m1*m2/100 to get this data for the past 3 months.