Here is a sample create table statement that works as expected.
CREATE EXTERNAL TABLE default.reviews(
marketplace varchar(10),
customer_id varchar(15),
review_id varchar(15),
product_id varchar(25),
product_parent varchar(15),
product_title varchar(50),
star_rating int,
helpful_votes int,
total_votes int,
vine varchar(5),
verified_purchase varchar(5),
review_headline varchar(25),
review_body varchar(1024),
review_date date,
year int)
PARTITIONED BY (
product_category varchar(25))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://amazon-reviews-pds/parquet/';
When I repair the table, I get an error:
MSCK REPAIR TABLE default.reviews
Partitions not in metastore: reviews:product_category=Apparel reviews:product_category=Automotive
If the Partition is not in metastore, how do I get a count of 3.5 million?
SELECT
COUNT(*)
FROM
"default"."reviews"
WHERE
product_category='Automotive'
-- OUTPUT
3516476
How do I make sure that all the records are correctly read and available?
How was this parquet partitioned table was created? I am asking because I have a csv table that I will like to partition exactly the same way.
Concept of partitioning is used in Athena only to restrict which "directories" should be scanned for data.
Since MSCK REPAIR TABLE command failed, no partitions were created. Therefore, WHERE product_category='Automotive'
doesn't have any affect and I'd say that 3516476 is the total number of rows in all csv files under s3://amazon-reviews-pds/parquet/.
Note, MSCK REPAIR TABLE would only work if "folder" structure on AWS S3 adheres HIVE convention:
s3://amazon-reviews-pds/parquet/
|
├── product_category=Apparel
│ ├── file_1.csv
│ | ...
│ └── file_N.csv
|
├── product_category=Automotive
│ ├── file_1.csv
│ | ...
│ └── file_M.csv
In order to ensure that all records are correctly read you have to make sure that table definition is correct.
In order to ensure that all records are available you have to make sure that LOCATION points to the root
"directory" of where all files are located on S3.
If you have a huge csv file with columns col_1, col_2, col_3, col_4 and you want to partition it by col_4 you would
need to use CTAS query statements, however, keep in mind
limitations of such statements.
Alternatively, if you already have multiple csv files each of which corresponds to a single value from col_4, then
simply upload them onto S3 in a way mentioned above. Then you should use combination of the following DDL statements:
-- FIRST STATMENT
CREATE EXTERNAL TABLE `my_database`.`my_table`(
`col_1` string,
`col_2` string,
`col_3` string,
)
PARTITIONED BY (
`col_4` string)
ROW FORMAT SERDE
-- CHANGE AS APPROPRIATE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
-- CHANGE AS APPROPRIATE
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
-- CHANGE AS APPROPRIATE
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://amazon-reviews-pds/parquet/';
-- SECOND STATEMENT
MSCK REPAIR TABLE `my_database`.`my_table`
Related
I am trying to create an Athena table using partition projection. I am delivering records to S3 using Kinesis Firehouse, grouped using a dynamic partitioning key. For example, the records look like the following:
period
item_id
2022/05
monthly_item_1
2022/05/04
daily_item_1
2022/05/04/02
hourly_item_1
2022/06
monthly_item_2
I want to partition the data in S3 by period, which can be monthly, daily or hourly. It is guaranteed that period would be in a supported Java date format. Therefore, I am writing these records to S3 in the below format:
s3://bucket/prefix/2022/05/monthly_items.gz
s3://bucket/prefix/2022/05/04/daily_items.gz
s3://bucket/prefix/2022/05/04/02/hourly_items.gz
s3://bucket/prefix/2022/06/monthly_items.gz
I want to run Athena queries for every partition scope i.e. if my query is for a specific day, I want to fetch its daily_items and hourly_items. If I am running a query for a month, I want to its fetch monthly, daily as well as hourly items.
I've created an Athena table using below query:
create external table `my_table`(
`period` string COMMENT 'from deserializer',
`item_id` string COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.digits' = '2',
'projection.hour.range'='00,23',
'projection.month.type'='integer',
'projection.month.digits'='02',
'projection.month.range'='01,12',
'projection.year.format'='yyyy',
'projection.year.range'='2022,NOW',
'projection.year.type'='date',
'storage.location.template'='s3://bucket/prefix/${year}/${month}/${day}/${hour}')
However, with this table running below query outputs zero results:
select * from my_table where year = '2022' and month = '06';
I believe the reason is Athena expects all files to be present under the same prefix as defined by storage.location.template. Therefore, any records present under a month or day prefix are not projected.
I was wondering if it was possible to support such querying functionality in a single table with partition projection enabled, when data in S3 is in a folder type structure similar to the examples above.
Would be great if anyone can help me out!
My S3 bucket has the following structure.
's3://st.pix/year/month/day/hour' for example
's3://st.pix/2022/09/01/06'
So I tried to create a partition table on this bucket using this code:
CREATE EXTERNAL TABLE IF NOT EXISTS `acco`.`Accesn` (
`ad_id` string,
)
PARTITIONED BY (year string, month string, day string , hour string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://st.pix/${year}/${month}/${day}/${hour}/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
and right after run
MSCK REPAIR TABLE Accesn
But unfortunately, this query gets no result.
SELECT count(*) FROM `acco`.`Accesn` where year ='2022' and month= '03' and day ='01' and hour ='01'
Can I use ${year}/${month}/${day}/${hour}/ in my LOCATION ?
If no, what are the options to do it dynamically and not using ALTER TABLE .. ADD PARTITION for a specific partition.
Is there a way to create table or update the table in Glue Catalog?
We are using the following DDL to create a table (and database) in Glue Catalog:
CREATE DATABASE IF NOT EXISTS glue_catalog;
CREATE EXTERNAL TABLE IF NOT EXISTS glue_catalog.date
(
file_dt date,
end_dt date
)
PARTITIONED BY (
year string,
month string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix1/date'
TBLPROPERTIES (
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'='|',
'skip.header.line.count'='1',
'typeOfData'='file');
When we need to make an update to the schema (i.e. a datatype). We have to delete the table in Glue Catalog and re-execute the script above because we're using the create external table if not exists statement. I'm curious whether there's a way to create a table if it does not exist, but if it does exist, update the table?
You can find the alter commands in the documentation. Although this appears under Athena service, it applies to Glue tables. Athena tables are Glue tables.
New to presto, any pointer how can I use LATERAL VIEW EXPLODE in presto for below table.
I need to filter on names in my presto query
CREATE EXTERNAL TABLE `id`(
`id` string,
`names` map<string,map<string,string>>,
`tags` map<string,map<string,string>>)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test'
;
sample names value :
{3081={short=Abbazia 81427 - Milan}, 2057={short=Abbazia 81427 - Milan}, 1033={short=Abbazia 81427 - Milan}, 4105={short=Abbazia 81427 - Milan}, 5129={short=Abbazia 81427 - Milan}}
From the documentation: https://trino.io/docs/current/appendix/from-hive.html
Trino [formerly PrestoSQL] supports UNNEST for expanding arrays and maps. Use UNNEST instead of LATERAL VIEW explode().
Hive query:
SELECT student, score
FROM tests
LATERAL VIEW explode(scores) t AS score;
Presto query:
SELECT student, score
FROM tests
CROSS JOIN UNNEST(scores) AS t (score);
I am able to run below query to get to the mapped data
select
id
,names['1033']['short'] as srt_nm
from id;
I set up a new log table in Athena in an S3 bucket that looks like below, where Athena is sitting on top of BucketName/
I had a well-functioning Athena system based on the same data but without the subdirectory structure listed below. Now with this new subdirectory structure I can see the data is properly displaying when I do select * from table_name limit 100 but when I do something like count(x) by week the query hangs.
The data in S3 doesn't exceed 100GB in GZipped folders but the query was hanging for more than 20 minutes and said 6.5TB scanned, which sounds like it was looping and scanning over the same data. My guess is that it has to do with this directory structure but from what i've seen in other threads is that Athena should be able to parse through the subdirectories by just being pointed to the base folder BucketName/
BucketName
|
|
|---Year(2016)
| |
| |---Month(11)
| | |
| | |---Daily File Format YYYY-MM-DD-Data000.gz
Any advice would be appreciated!
Create Table DDL
CREATE EXTERNAL TABLEtest_table(
foo1string,
foo2string,
foo3string,
datestring,
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
MAP KEYS TERMINATED BY '\u0003'
WITH SERDEPROPERTIES (
'collection.delim'='\u0002')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://Listen_Data/2018/01'
TBLPROPERTIES (
'has_encrypted_data'='false',
)
Fixed by adding
PARTITIONED BY (
`year` string,
`month` string)
after the schema definition in the DDL statement.