Build Table in Glue Catalog - amazon-web-services

Is there a way to create table or update the table in Glue Catalog?
We are using the following DDL to create a table (and database) in Glue Catalog:
CREATE DATABASE IF NOT EXISTS glue_catalog;
CREATE EXTERNAL TABLE IF NOT EXISTS glue_catalog.date
(
file_dt date,
end_dt date
)
PARTITIONED BY (
year string,
month string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix1/date'
TBLPROPERTIES (
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'='|',
'skip.header.line.count'='1',
'typeOfData'='file');
When we need to make an update to the schema (i.e. a datatype). We have to delete the table in Glue Catalog and re-execute the script above because we're using the create external table if not exists statement. I'm curious whether there's a way to create a table if it does not exist, but if it does exist, update the table?

You can find the alter commands in the documentation. Although this appears under Athena service, it applies to Glue tables. Athena tables are Glue tables.

Related

AWS Athena: Partition projection using date-hour with mixed ranges

I am trying to create an Athena table using partition projection. I am delivering records to S3 using Kinesis Firehouse, grouped using a dynamic partitioning key. For example, the records look like the following:
period
item_id
2022/05
monthly_item_1
2022/05/04
daily_item_1
2022/05/04/02
hourly_item_1
2022/06
monthly_item_2
I want to partition the data in S3 by period, which can be monthly, daily or hourly. It is guaranteed that period would be in a supported Java date format. Therefore, I am writing these records to S3 in the below format:
s3://bucket/prefix/2022/05/monthly_items.gz
s3://bucket/prefix/2022/05/04/daily_items.gz
s3://bucket/prefix/2022/05/04/02/hourly_items.gz
s3://bucket/prefix/2022/06/monthly_items.gz
I want to run Athena queries for every partition scope i.e. if my query is for a specific day, I want to fetch its daily_items and hourly_items. If I am running a query for a month, I want to its fetch monthly, daily as well as hourly items.
I've created an Athena table using below query:
create external table `my_table`(
`period` string COMMENT 'from deserializer',
`item_id` string COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.digits' = '2',
'projection.hour.range'='00,23',
'projection.month.type'='integer',
'projection.month.digits'='02',
'projection.month.range'='01,12',
'projection.year.format'='yyyy',
'projection.year.range'='2022,NOW',
'projection.year.type'='date',
'storage.location.template'='s3://bucket/prefix/${year}/${month}/${day}/${hour}')
However, with this table running below query outputs zero results:
select * from my_table where year = '2022' and month = '06';
I believe the reason is Athena expects all files to be present under the same prefix as defined by storage.location.template. Therefore, any records present under a month or day prefix are not projected.
I was wondering if it was possible to support such querying functionality in a single table with partition projection enabled, when data in S3 is in a folder type structure similar to the examples above.
Would be great if anyone can help me out!

Create automatic partition for S3 year/month/day/hour folders

My S3 bucket has the following structure.
's3://st.pix/year/month/day/hour' for example
's3://st.pix/2022/09/01/06'
So I tried to create a partition table on this bucket using this code:
CREATE EXTERNAL TABLE IF NOT EXISTS `acco`.`Accesn` (
`ad_id` string,
)
PARTITIONED BY (year string, month string, day string , hour string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://st.pix/${year}/${month}/${day}/${hour}/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
and right after run
MSCK REPAIR TABLE Accesn
But unfortunately, this query gets no result.
SELECT count(*) FROM `acco`.`Accesn` where year ='2022' and month= '03' and day ='01' and hour ='01'
Can I use ${year}/${month}/${day}/${hour}/ in my LOCATION ?
If no, what are the options to do it dynamically and not using ALTER TABLE .. ADD PARTITION for a specific partition.

Snowflake date column have incorrect date from AVRO file

I have HIVE external table created in AVRO format. Data stored on S3 location. Now i am creating snowflake table on same data file stored on s3 in avro format. But getting issue with date column. Date is not coming correctly although string and int data are coming correctly in snowflake table.
data in hive table( col1 is timestamp data type in hive table):
col1
2021-02-04 10:02:31
data in snowflake table:
col1
53066-07-15 12:56:40.000
sql to create snowflake table:
CREATE OR REPLACE EXTERNAL TABLE test1
(
col1 timestamp as (value:col1::timestamp),
)
WITH LOCATION = #S3_location/folder/
AUTO_REFRESH = TRUE
FILE_FORMAT = 'AVRO';"

How to create parquet partitioned table

Here is a sample create table statement that works as expected.
CREATE EXTERNAL TABLE default.reviews(
marketplace varchar(10),
customer_id varchar(15),
review_id varchar(15),
product_id varchar(25),
product_parent varchar(15),
product_title varchar(50),
star_rating int,
helpful_votes int,
total_votes int,
vine varchar(5),
verified_purchase varchar(5),
review_headline varchar(25),
review_body varchar(1024),
review_date date,
year int)
PARTITIONED BY (
product_category varchar(25))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://amazon-reviews-pds/parquet/';
When I repair the table, I get an error:
MSCK REPAIR TABLE default.reviews
Partitions not in metastore: reviews:product_category=Apparel reviews:product_category=Automotive
If the Partition is not in metastore, how do I get a count of 3.5 million?
SELECT
COUNT(*)
FROM
"default"."reviews"
WHERE
product_category='Automotive'
-- OUTPUT
3516476
How do I make sure that all the records are correctly read and available?
How was this parquet partitioned table was created? I am asking because I have a csv table that I will like to partition exactly the same way.
Concept of partitioning is used in Athena only to restrict which "directories" should be scanned for data.
Since MSCK REPAIR TABLE command failed, no partitions were created. Therefore, WHERE product_category='Automotive'
doesn't have any affect and I'd say that 3516476 is the total number of rows in all csv files under s3://amazon-reviews-pds/parquet/.
Note, MSCK REPAIR TABLE would only work if "folder" structure on AWS S3 adheres HIVE convention:
s3://amazon-reviews-pds/parquet/
|
├── product_category=Apparel
│ ├── file_1.csv
│ | ...
│ └── file_N.csv
|
├── product_category=Automotive
│ ├── file_1.csv
│ | ...
│ └── file_M.csv
In order to ensure that all records are correctly read you have to make sure that table definition is correct.
In order to ensure that all records are available you have to make sure that LOCATION points to the root
"directory" of where all files are located on S3.
If you have a huge csv file with columns col_1, col_2, col_3, col_4 and you want to partition it by col_4 you would
need to use CTAS query statements, however, keep in mind
limitations of such statements.
Alternatively, if you already have multiple csv files each of which corresponds to a single value from col_4, then
simply upload them onto S3 in a way mentioned above. Then you should use combination of the following DDL statements:
-- FIRST STATMENT
CREATE EXTERNAL TABLE `my_database`.`my_table`(
`col_1` string,
`col_2` string,
`col_3` string,
)
PARTITIONED BY (
`col_4` string)
ROW FORMAT SERDE
-- CHANGE AS APPROPRIATE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
-- CHANGE AS APPROPRIATE
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
-- CHANGE AS APPROPRIATE
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://amazon-reviews-pds/parquet/';
-- SECOND STATEMENT
MSCK REPAIR TABLE `my_database`.`my_table`

AWS Athena query hanging and re-reading data for huge query sizes

I set up a new log table in Athena in an S3 bucket that looks like below, where Athena is sitting on top of BucketName/
I had a well-functioning Athena system based on the same data but without the subdirectory structure listed below. Now with this new subdirectory structure I can see the data is properly displaying when I do select * from table_name limit 100 but when I do something like count(x) by week the query hangs.
The data in S3 doesn't exceed 100GB in GZipped folders but the query was hanging for more than 20 minutes and said 6.5TB scanned, which sounds like it was looping and scanning over the same data. My guess is that it has to do with this directory structure but from what i've seen in other threads is that Athena should be able to parse through the subdirectories by just being pointed to the base folder BucketName/
BucketName
|
|
|---Year(2016)
| |
| |---Month(11)
| | |
| | |---Daily File Format YYYY-MM-DD-Data000.gz
Any advice would be appreciated!
Create Table DDL
CREATE EXTERNAL TABLEtest_table(
foo1string,
foo2string,
foo3string,
datestring,
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
MAP KEYS TERMINATED BY '\u0003'
WITH SERDEPROPERTIES (
'collection.delim'='\u0002')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://Listen_Data/2018/01'
TBLPROPERTIES (
'has_encrypted_data'='false',
)
Fixed by adding
PARTITIONED BY (
`year` string,
`month` string)
after the schema definition in the DDL statement.