AWS Athena partitioning - projected and otherwise - make queries run forever - amazon-web-services

I'm trying to introduce partitioning into an Athena table. By way of background, what I have is a kinesis data stream receiving events that are JSON data, a delivery stream that converts the JSON from text to orc via a glue table we'll call glue-table and writes them to s3://bucket-name/path/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/.
With this setup, and no partitioning properties or fields set up in the glue table at all, there is automatically an external table in Athena named glue-table and with ~50 sample events run through kinesis, I have data in the appropriate path in S3 (aws s3 ls s3://bucket-name/path/year=2022/month=08/day=31/hour=19 for example lists results). I can execute SELECT * FROM glue-table in Athena and get ~50 results in ~900ms.
If I execute MSCK REPAIR TABLE glue_table the result is Partitions not in metastore: glue_table:year=2022/month=08/day=31/hour=19 which implies Athena is able to see that there ought to be partitions based on the templates it finds in the S3 path.
If I then do this to create a partitioned table using projections:
CREATE EXTERNAL TABLE paritioned_table (
action string,
createdat string
)
PARTITIONED BY (
year integer,
month integer,
day integer,
hour integer
)
STORED AS ORC
LOCATION 's3://bucket-name/path/'
TBLPROPERTIES (
'classification'='orc',
'projection.year.type'='integer',
'projection.year.range'='2022,9999',
'projection.month.type'='integer',
'projection.month.range'='01,12',
'projection.day.type'='integer',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.range'='00,24',
'projection.enabled'='true',
'storage.location.template'='s3://bucket-name/path/year=${year}/month=${month}/day=${day}/hour=${hour}/');
and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
If I go modify the glue table so it has the same projection settings from TBLPROPERTIES in the above example set as Table Properties, and then execute SELECT action, year FROM partitioned_table WHERE year=2022; then Athena chews on the query for 3 minutes and then returns no results.
Interestingly, if I drop and recreate partitioned_table so that it's location is s3://bucket-does-not-exist then SELECT action, year FROM partitioned_table WHERE year=2022; also runs for 3+ minutes and then returns no results.
I'm not able to see what I'm doing wrong here after referring to multiple pages of AWS documentation and a few official AWS videos describing this process. What's wrong here?

Related

How Redshift Spectrum scans data?

Given a data-source of 1.4 TB of Parquet data on S3 partitioned by a timestamp field (so partitions are year - month - day) I am querying a specific day of data (2.6 GB of data) and retrieving all available fields in the Parquet files via Redshift Spectrum with this query:
SELECT *
FROM my_external_schema.my_external_table
WHERE year = '2020' and month = '01' and day = '01'
The table is made available via a Glue Crawler that points at the top level "folder" in S3; this creates a Database and then via this command I link the Database to the new external schema:
create external schema my_external_schema from data catalog
database 'my_external_schema'
iam_role 'arn:aws:iam::123456789:role/my_role'
region 'my-region-9';
Analysing the table in my IDE I can see the table is generated by this statement:
create external table my_external_schema.my_external_table
(
id string,
my_value string,
my_nice_value string
)
partitioned by (year string, month string, day string)
row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
with serdeproperties ('serialization.format'='1')
stored as
inputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 's3://my-bucket/my/location/'
table properties ('CrawlerSchemaDeserializerVersion'='1.0', 'CrawlerSchemaSerializerVersion'='1.0', 'UPDATED_BY_CRAWLER'='my_crawler');
When I analyse the query from Redshift I see it was scanned ~86 GB of data instead.
How's that possible? It is a concern because Redshift bills based on the amount of data scanned and looks like the service is scanning around 40 times the actual amount of data is in that partition.
I also tried to execute the same query in Athena and there I get only 2.55 GB of data scanned (definitely more reasonable).
I can't give too many details on the cluster size but assume that those 86GB of scanned data would fit in the cluster's Memory.
The problem seems to be in the AWS Redshift Console.
If we analyse the query from "query details" in Redshift console, I can see that the "Total data scanned" reports 86GB. As Vzarr mentioned, I run the same query on Athena to compare the performance. The execution time was basically the same but the amount of data scanned was completely different: 2.55GB.
I did the same comparison with other queries on S3 external schema, with and without using partitions columns: I saw that the total of GB scanned differs in every test, sometimes differs a lot (320MB in Redshift Spectrum, 20GB in Athena).
I decided to look at the system tables in Redshift in order to understand how the query on the external schema was working. I did a very simple test using SVL_S3QUERY:
SELECT (cast(s3_scanned_bytes as double precision) / 1024 / 1024 / 1024) as gb_scanned,
s3_scanned_rows,
query
FROM SVL_S3QUERY
WHERE query = '<my-query-id>'
The result was completely different from what AWS Redshift Console says for the same query. Not only the gb_scanned was wrong, but s3_scanned_rows was too. The query returns a total of 2.55GB of data Scanned, exactly the same of what Athena said.
To confirm the numbers in the SVL_S3QUERY I used AWS Cost Explorer to double check the total of gb scanned in a day with how much we paid for Redshift Spectrum: the numbers were basically the same.
At this point, I don't know from where or which table the AWS Redshift Console take the query details, but they seem to be completely wrong.

How Amazon Athena selecting new files/records from S3

I'm adding files on Amazon S3 from time to time, and I'm using Amazon Athena to perform a query on these data and save it in another S3 bucket as CSV format (aggregated data), I'm trying to find way for Athena to select only new data (which not queried before by Athena), in order to optimize the cost and avoid data duplication.
I have tried to update the records after been selected by Athena, but update query not supported in Athena.
Is any idea to solve this ?
Athena does not keep track of files on S3, it only figures out what files to read when you run a query.
When planning a query Athena will look at the table metadata for the table location, list that location, and finally read all files that it finds during query execution. If the table is partitioned it will list the locations of all partitions that matches the query.
The only way to control which files Athena will read during query execution is to partition a table and ensure that queries match the partitions you want it to read.
One common way of reading only new data is to put data into prefixes on S3 that include the date, and create tables partitioned by date. At query time you can then filter on the last week, month, or other time period to limit the amount of data read.
You can find more information about partitioning in the Athena documentation.

Determine the table creation date in AWS Athena using information_schema catalog?

Does anyone know if its possible to retrieve the creation date of a table in AWS Athena using SQL on the information_schema catalog? I know I can use show properties on an individual table basis but I want to get the data for 1000's of tables.
This can tell you when the data in a given table was populated.
1) list the s3 objects in the table
Some SQL to execute via Athena. I like knowing record counts as well, but really the "$PATH" thing is the important part.
select count(*) as record_cnt, "$PATH" as path
from myschema.mytable
group by "$PATH"
order by "$PATH"
record_cnt path
10000 s3://mybucket/data/foo/tables/01234567-890a-bcde/000000_00000-abcdef
..etc...
2) get the timestamp for the s3 objects
$ aws s3 ls s3://mybucket/data/foo/tables/01234567-890a-bcde/000000_00000-abcdef
2022-01-01 12:01:23 4456780 000000_00000-abcdef
$
You'll need to loop through the S3 objects & consolidate the timestamps.
I see +/- a few seconds on the backing s3 objects when I create tables, so you may want to consolidate that to the nearest 5 minutes... or hour.
shrug Whatever fits the tempo of your table creation.
I've also seen tables where data is appended over time,
so maybe the closest you could to an actual table create
timestamp would be the oldest s3 object's timestamp.
Ps. I haven't dug into the Glue table defn stuff enough to
know if one can get useful metadata out of that.

Duplicate Table in AWS Glue using AWS Athena

I have a table in AWS Glue which uses an S3 bucket for it's data location. I want to execute an Athena query on that existing table and use the query results to create a new Glue table.
I have tried creating a new Glue table, pointing it to a new location in S3, and piping the Athena query results to that S3 location. This almost accomplishes what I want, but
a .csv.metadata file is put in this location along with the actual .csv output (which is read by the Glue table as it reads all files in the specified s3 location).
The csv file places double quotes around each field, which ruins any fieldSchema defined in the Glue Table that uses numbers
These services are all designed to work together, so there must be a proper way to accomplish this. Any advice would be much appreciated :)
The way to do that is by using CTAS query statements.
A CREATE TABLE AS SELECT (CTAS) query creates a new table in Athena from the results of a SELECT statement from another query. Athena stores data files created by the CTAS statement in a specified location in Amazon S3.
For example:
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/'
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
There are some limitations though. However, for your case the most important are:
The destination location for storing CTAS query results in Amazon S3 must be empty.
The same applies to the name of new table, i.e. it shouldn't exist in AWS Glue Data Catalog.
In general, you don't have explicit control of how many files will be created as a result of CTAS query, since Athena is a distributed system.
However, can try this to use "this workaround" which uses bucketed_by and bucket_count fields within WITH clause
CREATE TABLE new_table
WITH (
external_location = 's3://my_athena_results/new_table_files/',
bucketed_by=ARRAY['some_column_from_select'],
bucket_count=1
) AS (
-- Here goes your normal query
SELECT
*
FROM
old_table;
)
Apart from creating a new files and defining a table associated with you can also convert your data to a different file formats, e.g. Parquet, JSON etc.
I guess you have to change ur ser-de. If you are querying csv data either opencsvserde or lazysimple serde should work for you.

Hive query on s3 partition is too slow

I have partitioned the data by date and here is how it is stored in s3.
s3://dataset/date=2018-04-01
s3://dataset/date=2018-04-02
s3://dataset/date=2018-04-03
s3://dataset/date=2018-04-04
...
Created hive external table on top of this. I am executing this query,
select count(*) from dataset where `date` ='2018-04-02'
This partition has two parquet files like this,
part1 -xxxx- .snappy.parquet
part2 -xxxx- .snappy.parquet
each file size is 297MB. , So not a big file and not many files to scan.
And the query is returning 12201724 records. However it takes 3.5 mins to return this, since one partition itself is taking this time, running even the count query on whole dataset ( 7 years ) of data takes hours to return the results. Is there anyway, I can speed up this ?
Amazon Athena is, effectively, a managed Presto service. It can query data stored in Amazon S3 without having to run any clusters.
It is charged based upon the amount of data read from disk, so it runs very efficiently when using partitions and parquet files.
See: Analyzing Data in S3 using Amazon Athena | AWS Big Data Blog