Amazon Athena scans lots of data when query involves only partitions - amazon-web-services

I have a table on Athena partitioned by day (huge table, TB of data). There's no day column on the table, at least not explicitly. I would expect that a query like the following:
select max(day) from my_table
would scan virtually no data. However, Athena reports that several hundreds of GB are scanned. Any idea why?
===== EDIT 2021-01-14 ===
I've recently bumped on this issue again. It turns out that when the underlying data is parquet then operations on partitions don't consume data. For other data formats that I've tried (including ORC) there is an associated data cost. It doesn't make any sense to me.

I don't know the answer for a fact but I guesstimate:
Athena just does not have the optimization of looking at the partition names only, when only they are queried. This is clear from its behaviour. So it scans everything.
Parquet has min/max for every column whereas ORC does it only if an index is present, AFAIU. Thus for Parquet Athena's query optimizer directs it to look directly at these rollup values, i.e., no scan is performed. It's different for ORC.

I know is a little late to answer this question for you Nicolas but it is important to keep here also some possible solutions.
Unfortunately, this is the way Athena works, Athena will read all data as a tableScan just to list the partitions values.
A possible workaround that works perfectly here is using the metadata of the partition instead of the data information, for example:
Instead of using this syntax:
select max(day) from my_table
Try to use this syntax:
SELECT day FROM my_schema."my_table$partitions" ORDER BY day DESC LIMIT 1
This second statement will read just metadata information and returns the same data you need.

It does not depend on the format but on the compression algorithm used. Snappy for ORC mostly & GZIP for parquet. This is what makes the difference

Related

Querying S3 using Athena

I have a setup with Kinesis Firehose ingesting data, AWS Lambda performing data transformation and dropping the incoming data into an S3 bucket. The S3 structure is organized by year/month/day/hour/messages.json, so all of the actual json files I am querying are at the 'hour' level with all year, month, day directories only containing sub directories.
My problem is I need to run a query to get all data for a given day. Is there an easy way to query at the 'day' directory level and return all files in its sub directories without having to run a query for 2020/06/15/00, 2020/06/15/01, 2020/06/15/02...2020/06/15/23?
I can successfully query the hour level directories since I can create a table and define the column name and type represented in my .json file, but I am not sure how to create a table in Athena (if possible) to represent a day directory with sub directories instead of actual files.
To query only the data for a day without making Athena read all the data for all days you need to create a partitioned table (look at the second example). Partitioned tables are like regular tables, but they contain additional metadata that describes where the data for a particular combination of the partition keys is located. When you run a query and specify criteria for the partition keys Athena can figure out which locations to read and which to skip.
How to configure the partition keys for a table depends on the way the data is partitioned. In your case the partitioning is by time, and the timestamp has hourly granularity. You can choose a number of different ways to encode this partitioning in a table, which one is the best depends on what kinds of queries you are going to run. You say you want to query by day, which makes sense, and will work great in this case.
There are two ways to set this up, the traditional, and the new way. The new way uses a feature that was released just a couple of days ago and if you try to find more examples of it you may not find many, so I'm going to show you the traditional too.
Using Partition Projection
Use the following SQL to create your table (you have to fill in the columns yourself, since you say you've successfully created a table already just use the columns from that table – also fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
TBLPROPERTIES (
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2020/06/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://cszlos-data/is/here/${date}"
)
This creates a table partitioned by date (please note that you need to quote this in queries, e.g. SELECT * FROM cszlos_firehose_data WHERE "date" = …, since it's a reserved word, if you want to avoid having to quote it use another name, dt seems popular, also note that it's escaped with backticks in DDL and with double quotes in DML statements). When you query this table and specify a criteria for date, e.g. … WHERE "date" = '2020/06/05', Athena will read only the data for the specified date.
The table uses Partition Projection, which is a new feature where you put properties in the TBLPROPERTIES section that tell Athena about your partition keys and how to find the data – here I'm telling Athena to assume that there exists data on S3 from 2020-06-01 up until the time the query runs (adjust the start date necessary), which means that if you specify a date before that time, or after "now" Athena will know that there is no such data and not even try to read anything for those days. The storage.location.template property tells Athena where to find the data for a specific date. If your query specifies a range of dates, e.g. … WHERE "date" > '2020/06/05' Athena will generate each date (controlled by the projection.date.interval property) and read data in s3://cszlos-data/is/here/2020-06-06, s3://cszlos-data/is/here/2020-06-07, etc.
You can find a full Kinesis Data Firehose example in the docs. It shows how to use the full hourly granularity of the partitioning, but you don't want that so stick to the example above.
The traditional way
The traditional way is similar to the above, but you have to add partitions manually for Athena to find them. Start by creating the table using the following SQL (again, add the columns from your previous experiments, and fix the S3 locations):
CREATE EXTERNAL TABLE cszlos_firehose_data (
-- fill in your columns here
)
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://cszlos-data/is/here/'
This is exactly the same SQL as above, but without the table properties. If you try to run a query against this table now you will not get any results. The reason is that you need to tell Athena about the partitions of a partitioned table before it knows where to look for data (partitioned tables must have a LOCATION, but it really doesn't mean the same thing as for regular tables).
You can add partitions in many different ways, but the most straight forward for interactive use is to use ALTER TABLE ADD PARTITION. You can add multiple partitions in one statement, like this:
ALTER TABLE cszlos_firehose_data ADD
PARTITION (`date` = '2020-06-06') LOCATION 's3://cszlos-data/is/here/2020/06/06'
PARTITION (`date` = '2020-06-07') LOCATION 's3://cszlos-data/is/here/2020/06/07'
PARTITION (`date` = '2020-06-08') LOCATION 's3://cszlos-data/is/here/2020/06/08'
PARTITION (`date` = '2020-06-09') LOCATION 's3://cszlos-data/is/here/2020/06/09'
If you start reading more about partitioned tables you will probably also run across the MSCK REPAIR TABLE statement as a way to load partitions. This command is unfortunately really slow, and it only works for Hive style partitioned data (e.g. …/year=2020/month=06/day=07/file.json) – so you can't use it.

BigQuery DML on column partitioned tables with streaming buffer

As far as I understand, UPDATEs and DELETEs are working on partitioned tables with streaming buffer, if query is not touching any records in streaming buffer. Otherwise, the following error is reported:
UPDATE or DELETE statement over table project.dataset.table would affect rows in the streaming buffer, which is not supported
Issue is similar to discussed this question, however it's about column partitioned tables, not about ingestion-time partitioned tables.
Problem is, that while ingestion-time partitioned have means to ignore data in streaming buffer via conditions on _PARTITIONTIME, it's not available for column-partitioned tables. Are there any other approaches that would allow to ignore streaming buffer data in DML statements?
At the moment you can only use Legacy SQL to get information about the streaming buffer.
Get all data from the streaming buffer like this:
#legacySQL
select MIN(partitioned_tstamp) AS min_tstamp
, MAX(partitioned_tstamp) AS max_tstamp
, COUNT(1) AS lines
FROM [my_dataset_id.mystreaming_data_table$__UNPARTITIONED__]
And get a summary of all partitions in the table like this:
#legacySQL
SELECT *
FROM [my_dataset_id.mystreaming_data_table$__PARTITIONS_SUMMARY__]
I have no idea why this isn't supported yet in standard SQL or when it will be.

AWS Redshift : DISTKEY / SORTKEY columns should be compressed?

Let me ask something about column compression on AWS Redshift.
Now we're verifying what can be made better performance using appropriate diststyle, sortkeys and column compression.
If my understanding is correct, the column compression can help to reduce IO cost. I tried "analyze compression table_name;". And mostly Redshift suggests to use 'zstd' or 'lzo' as compression method for our columns.
In general speaking, may I ask the columns set as DISTKEY/SORTKEY should be also compressed like other columns?
I'm totally new to Redshift and any advice would be appreciated.
Sincerly.
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw). If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column (if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist and sort keys on both tables so Redshift can use a faster merge join.

Redshift Query taking too much time

In Redshift, the queries are taking too much time to execute. Some queries keep on running or get aborted after some time.
I have very limited knowledge of Redshift and it is getting difficult to understand the Query plan to optimise the query.
Sharing one of the queries that we run, along with the Query Plan.
The query is taking 20 seconds to execute.
Query
SELECT
date_trunc('day',
ti) as date,
count(distinct deviceID) AS COUNT
FROM
live_events
WHERE
brandID = 3927
AND ti >= '2017-08-02T00:00:00+00:00'
AND ti <= '2017-09-02T00:00:00+00:00'
GROUP BY
1
Primary key
brandID
Interleaved Sort Keys
we have set following columns as interleaved sort keys -
brandID, ti, event_name
QUERY PLAN
You have 126 million rows in that table. It's going to take more than a second on a single dc1.large node.
Here's some ways you could improve the performance:
More nodes
Spreading data across more nodes allows more parallelization. Each node adds additional processing and storage. Even if your data volume only justifies one node, if you want more performance, add more nodes.
SORTKEY
For the right type of query, the SORTKEY can be the best way to improve query speed. Sorting data on disk allows Redshift to skip over blocks that it knows does not contain relevant data.
For example, your query has WHERE brandID = 3927, so having brandID as the SORTKEY would make this extremely efficient because very few disk blocks would contain data for one brand.
Interleaved sorting is rarely the best sorting method to use because it is less efficient than a single or compound sort key and takes a long time to VACUUM. If the query you have shown is typical of the type of queries you are running, then use a compound sort key of brandId, ti or ti, brandId. It will be much more efficient.
SORTKEYs are typically a date column, since they are often found in a WHERE clause and the table will be automatically sorted if data is always appended in time order.
The Interleaved Sort would be causing Redshift to read many more disk blocks to find your data, thereby significantly increasing query time.
DISTKEY
The DISTKEY should typically be set to the field that is most used in a JOIN statement on the table. This is because data relating to the same DISTKEY value is stored on the same slice. This won't have such a large impact on a single node cluster, but it is still worth getting right.
Again, you have only shown one type of query, so it is hard to recommend a DISTKEY. Based on this query alone, I would recommend DISTKEY EVEN so that all slices participate in the query. (It is also the default DISTKEY if no specific DISTKEY is selected.) Alternatively, set DISTKEY to a field not shown -- but certainly don't use brandId as the DISTKEY otherwise only one slice will participate in the query shown.
VACUUM
VACUUM your tables regularly so that the data is stored in SORTKEY order and deleted data is removed from storage.
Experiment!
Optimal settings depend upon your data and the queries you typically run. Perform some tests to compare SORTKEY and DISTKEY values and choose the settings that perform the best. Then, test again in 3 months to see if your queries or data has changed enough to make other settings more efficient.
Some time the issue could be due to locks being acquired by other processes. You can refer: https://aws.amazon.com/premiumsupport/knowledge-center/prevent-locks-blocking-queries-redshift/
I'd also like to add that in your query you are performing date transformations. Date operations are expensive in Redshift.
-- This date operation is expensive
date_trunc('day', ti) as date
If you have the luxury you should store the date in the format you need in an additional column.

Redshift performance: encoding on join column

would encoding on join column corrupts the query performance ? I let the "COPY command" to decide the encoding type.
In gernal no - since an encoding on your DIST KEY will even have a positive impact due to the reduction disk I/O.
According to the AWS table design playbook There are a few edge case were indeed an encoding on your DIST KEY will corrupt your query performance:
Your query patterns apply range restricted scans to a column that is
very well compressed.
The well compressed column’s blocks each contain a large number of values per block, typically many more values than the actual count of values your query is interested in.
The other columns necessary for the query pattern are large or don’t compress well. These columns are > 10x the size of the well
compressed column.
If you want to find the optimal encoding for your table you can use the Redshift column encoding utility.
Amazon Redshift is a column-oriented database, which means that rather than organising data on disk by rows, data is stored by column, and rows are extracted from column storage at runtime. This architecture is particularly well suited to analytics queries on tables with a large number of columns, where most queries only access a subset of all possible dimensions and measures. Amazon Redshift is able to only access those blocks on disk that are for columns included in the SELECT or WHERE clause, and doesn’t have to read all table data to evaluate a query. Data stored by column should also be encoded , which means that it is heavily compressed to offer high read performance. This further means that Amazon Redshift doesn’t require the creation and maintenance of indexes: every column is almost like its own index, with just the right structure for the data being stored.
Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied.
So your question it will not corrupt the query performance but not a best practice.
There are a couple of details on this by AWS respondants:
AWS Redshift : DISTKEY / SORTKEY columns should be compressed?
Generally:
DISTKEY can be compressed but the first SORTKEY column should be uncompressed (ENCODE raw).
If you have multiple sort keys (compound) the other sort key columns can be compressed.
Also, generally recommend using a commonly filtered date/timestamp column,
(if one exists) as the first sort key column in a compound sort key.
Finally, if you are joining between very large tables try using the same dist
and sort keys on both tables so Redshift can use a faster merge join.
Based on this, i think as long as both sides of the join have the same compression, i think redshift will join on the compressed value safely.