AWS Athena: Partition projection using date-hour with mixed ranges - amazon-web-services

I am trying to create an Athena table using partition projection. I am delivering records to S3 using Kinesis Firehouse, grouped using a dynamic partitioning key. For example, the records look like the following:
period
item_id
2022/05
monthly_item_1
2022/05/04
daily_item_1
2022/05/04/02
hourly_item_1
2022/06
monthly_item_2
I want to partition the data in S3 by period, which can be monthly, daily or hourly. It is guaranteed that period would be in a supported Java date format. Therefore, I am writing these records to S3 in the below format:
s3://bucket/prefix/2022/05/monthly_items.gz
s3://bucket/prefix/2022/05/04/daily_items.gz
s3://bucket/prefix/2022/05/04/02/hourly_items.gz
s3://bucket/prefix/2022/06/monthly_items.gz
I want to run Athena queries for every partition scope i.e. if my query is for a specific day, I want to fetch its daily_items and hourly_items. If I am running a query for a month, I want to its fetch monthly, daily as well as hourly items.
I've created an Athena table using below query:
create external table `my_table`(
`period` string COMMENT 'from deserializer',
`item_id` string COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.digits' = '2',
'projection.hour.range'='00,23',
'projection.month.type'='integer',
'projection.month.digits'='02',
'projection.month.range'='01,12',
'projection.year.format'='yyyy',
'projection.year.range'='2022,NOW',
'projection.year.type'='date',
'storage.location.template'='s3://bucket/prefix/${year}/${month}/${day}/${hour}')
However, with this table running below query outputs zero results:
select * from my_table where year = '2022' and month = '06';
I believe the reason is Athena expects all files to be present under the same prefix as defined by storage.location.template. Therefore, any records present under a month or day prefix are not projected.
I was wondering if it was possible to support such querying functionality in a single table with partition projection enabled, when data in S3 is in a folder type structure similar to the examples above.
Would be great if anyone can help me out!

Related

How to transform data in Amazon Athena

I have some data in S3 location in json format. It have 4 columns val, time__stamp, name and type. I would like to create an external Athena table from this data with some transformations given below:
timestamp: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.
name: name should filtered with following sql logic:
name not in ('abc','cdf','fgh') and name not like '%operator%'
type: type should not have values labeled as counter
I would like to add two partition columns date and hour which should be derived from time__stamp column
I started with following:
CREATE EXTERNAL TABLE `airflow_cluster_data`(
`val` string COMMENT 'from deserializer',
`time__stamp` timestamp COMMENT 'from deserializer',
`name` string COMMENT 'from deserializer',
`type` string COMMENT 'from deserializer')
PARTITIONED BY (
date,
hour)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.time_stamp'='#timestamp')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket1/raw/airflow_data'
I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.
Please have a look at some sample data:
1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer
Probably the simplest method is to create a View:
CREATE VIEW foo AS
SELECT
val,
cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
cast(from_unixtime(time__stamp / 1000) as date) as date,
hour(cast(from_unixtime(time__stamp / 1000) as timestamp)) as hour,
name,
type
FROM airflow_cluster_data
WHERE name not in ('abc','cdf','fgh')
AND name not like '%operator%'
AND type != 'counter'
You can create you own UDF for transformation and use it in Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html

Create automatic partition for S3 year/month/day/hour folders

My S3 bucket has the following structure.
's3://st.pix/year/month/day/hour' for example
's3://st.pix/2022/09/01/06'
So I tried to create a partition table on this bucket using this code:
CREATE EXTERNAL TABLE IF NOT EXISTS `acco`.`Accesn` (
`ad_id` string,
)
PARTITIONED BY (year string, month string, day string , hour string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://st.pix/${year}/${month}/${day}/${hour}/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
and right after run
MSCK REPAIR TABLE Accesn
But unfortunately, this query gets no result.
SELECT count(*) FROM `acco`.`Accesn` where year ='2022' and month= '03' and day ='01' and hour ='01'
Can I use ${year}/${month}/${day}/${hour}/ in my LOCATION ?
If no, what are the options to do it dynamically and not using ALTER TABLE .. ADD PARTITION for a specific partition.

Build Table in Glue Catalog

Is there a way to create table or update the table in Glue Catalog?
We are using the following DDL to create a table (and database) in Glue Catalog:
CREATE DATABASE IF NOT EXISTS glue_catalog;
CREATE EXTERNAL TABLE IF NOT EXISTS glue_catalog.date
(
file_dt date,
end_dt date
)
PARTITIONED BY (
year string,
month string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix1/date'
TBLPROPERTIES (
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'='|',
'skip.header.line.count'='1',
'typeOfData'='file');
When we need to make an update to the schema (i.e. a datatype). We have to delete the table in Glue Catalog and re-execute the script above because we're using the create external table if not exists statement. I'm curious whether there's a way to create a table if it does not exist, but if it does exist, update the table?
You can find the alter commands in the documentation. Although this appears under Athena service, it applies to Glue tables. Athena tables are Glue tables.

Create external table from csv file in AWS Athena

I am trying to create an external table in AWS Athena from a csv file that is stored in my S3.
The csv file looks as follows. As you can see, the data is not enclosed in quotation marks (") and is delimited by commas (,).
ID,PERSON_ID,DATECOL,GMAT
612766604,54723367,2020-01-15,637
615921503,158634997,2020-01-25,607
610656030,90359154,2020-01-07,670
I tried the following code to create a table:
CREATE EXTERNAL TABLE my_table
(
ID string,
PERSON_ID int,
DATE_COL date,
GMAT int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES
(
'skip.header.line.count'='1'
)
;
I tried to preview the table with the following code:
select
*
from
my_table
limit 10
Which raises this error:
HIVE_BAD_DATA: Error parsing field value '2020-01-15' for field 2: For input string: "2020-01-15"
My question is: Am I passing the correct serde? And if so, how can I format the date column (DATE_COL) such that it reads and displays days in YYYY-MM-DD?
I replaced ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with
FIELDS TERMINATED BY ',' and enclosed the column names with "`". The following code creates the table correctly:
CREATE EXTERNAL TABLE my_table
(
`ID` string,
`PERSON_ID` int,
`DATE_COL` date,
`GMAT` int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://my_bucket/som_bucket/dat/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
I do not understand the concept of a serde, but I suppose I did not need one to begin with.
Per documentation, a column with type DATE must have a values representing the number of days since January 1, 1970. For example, the date on row 1 after your header should have a value of 18276. When the table is queried the date will then be rendered as 2020-01-15.

Query Google Big Query Table by today's date

I have query pertaining to the google big query tables. We are currently looking to query the big query table based on the file uploaded on the day into the cloud storage.
Meaning:
I have to load the data into big query table based on every day's data into cloud storage.
When i query:
select * from BQT where load_date =<TODAY's DATE>
Can we achieve this without adding the date field into the file?
If you just don't want to add a date column, Append current date suffix to your table name like BQT_20200112 when the GCS file is uploaded.
Then you can query specific datetime table by _TABLE_SUFFIX syntax.
Below is example query using _TABLE_SUFFIX
SELECT
field1,
field2,
field3
FROM
`your_dataset.BQT_*`
WHERE
_TABLE_SUFFIX = '20200112'
As you see, You don't need to add additional field like load_date when you query the tables using date suffix and wildcard symbol.