Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year - month - day) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:
SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'
The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:
create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';
Analysing the table in my IDE I can see the table is generated by this statement:
create external table my_external_schema.my_external_table
(
id string,
my_value string,
my_nice_value string
)
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');
When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.
How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.
I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).
I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.
The problem seems to be in the AWS Redshift Console.
If we analyse the query from "query details" in Redshift console, I can see that the "Total data scanned" reports 86GB. As Vzarr mentioned, I run the same query on Athena to compare the performance. The execution time was basically the same but the amount of data scanned was completely different: 2.55GB.
I did the same comparison with other queries on S3 external schema, with and without using partitions columns: I saw that the total of GB scanned differs in every test, sometimes differs a lot (320MB in Redshift Spectrum, 20GB in Athena).
I decided to look at the system tables in Redshift in order to understand how the query on the external schema was working. I did a very simple test using SVL_S3QUERY:
SELECT (cast(s3_scanned_bytes as double precision) / 1024 / 1024 / 1024) as gb_scanned,
s3_scanned_rows,
query
FROM SVL_S3QUERY
WHERE query = '<my-query-id>'
The result was completely different from what AWS Redshift Console says for the same query. Not only the gb_scanned was wrong, but s3_scanned_rows was too. The query returns a total of 2.55GB of data Scanned, exactly the same of what Athena said.
To confirm the numbers in the SVL_S3QUERY I used AWS Cost Explorer to double check the total of gb scanned in a day with how much we paid for Redshift Spectrum: the numbers were basically the same.
At this point, I don't know from where or which table the AWS Redshift Console take the query details, but they seem to be completely wrong.
Related
I am working with Delta Table and Redshift Spectrum and I notice strange behaviour.
I follow this article to set up a Redshift Spectrum to Delta Lake integration using manifest files and query Delta tables: https://docs.delta.io/latest/redshift-spectrum-integration.html
Environment information
Delta Lake version: 1.0.1 (io.delta)
Spark version: 2.4.3
Scala version: 2.12.10
Describe the problem
In my use case, the delta table is partitioned by 3 columns (year, month and day). Delta table have also the "application_id" column used as the key for the insert/update operations.
CREATE EXTERNAL TABLE yyyyy.xxxxxxxx (
application_id string,
general_status_startingoffer string,
general_status_offer string
)
PARTITIONED BY (year string, month string, day string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://xxxxxxx/v0/data/\_symlink_format_manifest/'
TBLPROPERTIES ('delta.compatibility.symlinkFormatManifest.enabled'='true');
Furthermore, an external schema has been created in redshift for mirroring data of the last version of the delta table.
create external schema yyyyy
from data catalog database 'yyyyy'
iam_role '${iam_role}';
We are using Athena and Redshift Editor to query the records.
The issue seems linked to an old partition that was been deleted in the last version of the delta table. In particular, redshift raises an Error whether fetching Delta Lake manifest.
Steps to reproduce
Following the steps to replicate the error:
The first glue job run creates 1 record in the following partition: year=2022, month=10, day=5
Symlink manifest is generated for redshift
Data has been correctly written
Delta log shows a newly added partition (year=2022, month=10, day=05)
Query on Athena shows correct results:
Query on redshift shows correct results:
The second glue job run updates this record, changing 2 columns (general_status_offer and day)
Symlink manifest is re-generated for redshift
The redshift manifest seems to be correctly updated
Delta log shows a new added partition (year=2022, month=10, day=13) and a deleted partition (year=2022, month=10, day=05). Query on Athena shows correct results:
Query on redshift fails with the following error message:
Caught exception in worker_thread loader thread for location=s3://xxxxxxx/v0/data/\_symlink_format_manifest/year=2022/month=10/day=05: error=DeltaManifest context=Error fetching Delta Lake manifest xxxxx/v0/data/\_symlink_format_manifest/year=2022/month=10/day=05/manifest Message: S3ServiceException:The specified key does not exist.,Status 404,Error NoSuchKey,Rid 8ZCB2E6TTZ7JMHFD,ExtRid vnD7YB7JAPkW/
It can be resolved by adding a new symlink folder (year=2022, month=10, day=05) with an empty manifest file or running this statement on redshift:
ALTER TABLE xxxx DROP PARTITION (year='2022', month='10', day='05');
We need to automatize this step and it is not easy to recognize which partitions are deleted. Are there any properties that I can set to force the automatic resolution?
If not, this could be a good enhancement :)
I'm trying to introduce partitioning into an Athena table. By way of background, what I have is a kinesis data stream receiving events that are JSON data, a delivery stream that converts the JSON from text to orc via a glue table we'll call glue-table and writes them to s3://bucket-name/path/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/.
With this setup, and no partitioning properties or fields set up in the glue table at all, there is automatically an external table in Athena named glue-table and with ~50 sample events run through kinesis, I have data in the appropriate path in S3 (aws s3 ls s3://bucket-name/path/year=2022/month=08/day=31/hour=19 for example lists results). I can execute SELECT * FROM glue-table in Athena and get ~50 results in ~900ms.
If I execute MSCK REPAIR TABLE glue_table the result is Partitions not in metastore: glue_table:year=2022/month=08/day=31/hour=19 which implies Athena is able to see that there ought to be partitions based on the templates it finds in the S3 path.
If I then do this to create a partitioned table using projections:
CREATE EXTERNAL TABLE paritioned_table (
action string,
createdat string
)
PARTITIONED BY (
year integer,
month integer,
day integer,
hour integer
)
STORED AS ORC
LOCATION 's3://bucket-name/path/'
TBLPROPERTIES (
'classification'='orc',
'projection.year.type'='integer',
'projection.year.range'='2022,9999',
'projection.month.type'='integer',
'projection.month.range'='01,12',
'projection.day.type'='integer',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.range'='00,24',
'projection.enabled'='true',
'storage.location.template'='s3://bucket-name/path/year=${year}/month=${month}/day=${day}/hour=${hour}/');
and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
If I go modify the glue table so it has the same projection settings from TBLPROPERTIES in the above example set as Table Properties, and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
Interestingly, if I drop and recreate partitioned_table so that it's location is s3://bucket-does-not-exist then SELECT action, year FROM partitioned_table WHERE year=2022; also runs for 3+ minutes and then returns no results.
I'm not able to see what I'm doing wrong here after referring to multiple pages of AWS documentation and a few official AWS videos describing this process. What's wrong here?
TLDR : Athena: select top 10 scans more data for parquet format, than csv format. Shouldn't it be the other way round?
I am using Athena(V1) to query the following two datasets (same data but two different file formats):
Format
Size
Athena DB name
Athena table name
dataset description
CSV
91.3 MB
nycitytaxi
data
nycity taxi trip, present in a public s3 bucket
Parquet
19.4 MB
nycitytaxi
aws_glue_result_xxxx
same data as above converted to parquet - through a Glue Crawler job - and stored in one of my S3 buckets
Now I am executing the following query on both the tables :
select lpep_pickup_datetime, lpep_dropoff_datetime
from nycitytaxi.<table_name>
limit 10
On executing this query on the csv based table (table_name: data), Athena console shows it scanned 721.96 KB of data.
On executing this query on the parquet based table (table_name : aws_glue_result_xxxx), Athena console shows it scanned 10.9 MB of data.
Shouldn't Athena be scanning way less data for the parquet based table, since parquet is columnar based, as opposed to row based storage for CSV ?
It is due to your specific query.
select lpep_pickup_datetime, lpep_dropoff_datetime
from nycitytaxi.<table_name>
limit 10
In row based formats like CSV, all data is stored row wise. Which means as soon as you say, select any 10 rows, it can just start reading the csv file from the beginning and select the first 10 rows, resulting in very low data scan.
In columnar data formats like parquet, the records are stored column wise. Let us assume the data has three columns, say id, name, number. This means, all of id values will be stored together, all name values will be stored together and all number values will be stored together. So when you run the query, select 10 rows in parquet, i will have to scan for 10 values in each column which are present in different storage locations. Which means I will have to scan more.
More on parquet pros and cons here.
I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.
I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?
Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog