LATERAL VIEW EXPLODE in presto - amazon-web-services

New to presto, any pointer how can I use LATERAL VIEW EXPLODE in presto for below table.
I need to filter on names in my presto query
CREATE EXTERNAL TABLE `id`(
`id` string,
`names` map<string,map<string,string>>,
`tags` map<string,map<string,string>>)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test'
;
sample names value :
{3081={short=Abbazia 81427 - Milan}, 2057={short=Abbazia 81427 - Milan}, 1033={short=Abbazia 81427 - Milan}, 4105={short=Abbazia 81427 - Milan}, 5129={short=Abbazia 81427 - Milan}}

From the documentation: https://trino.io/docs/current/appendix/from-hive.html
Trino [formerly PrestoSQL] supports UNNEST for expanding arrays and maps. Use UNNEST instead of LATERAL VIEW explode().
Hive query:
SELECT student, score
FROM tests
LATERAL VIEW explode(scores) t AS score;
Presto query:
SELECT student, score
FROM tests
CROSS JOIN UNNEST(scores) AS t (score);

I am able to run below query to get to the mapped data
select
id
,names['1033']['short'] as srt_nm
from id;

Related

AWS Athena: Partition projection using date-hour with mixed ranges

I am trying to create an Athena table using partition projection. I am delivering records to S3 using Kinesis Firehouse, grouped using a dynamic partitioning key. For example, the records look like the following:
period
item_id
2022/05
monthly_item_1
2022/05/04
daily_item_1
2022/05/04/02
hourly_item_1
2022/06
monthly_item_2
I want to partition the data in S3 by period, which can be monthly, daily or hourly. It is guaranteed that period would be in a supported Java date format. Therefore, I am writing these records to S3 in the below format:
s3://bucket/prefix/2022/05/monthly_items.gz
s3://bucket/prefix/2022/05/04/daily_items.gz
s3://bucket/prefix/2022/05/04/02/hourly_items.gz
s3://bucket/prefix/2022/06/monthly_items.gz
I want to run Athena queries for every partition scope i.e. if my query is for a specific day, I want to fetch its daily_items and hourly_items. If I am running a query for a month, I want to its fetch monthly, daily as well as hourly items.
I've created an Athena table using below query:
create external table `my_table`(
`period` string COMMENT 'from deserializer',
`item_id` string COMMENT 'from deserializer')
PARTITIONED BY (
`year` string,
`month` string,
`day` string,
`hour` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix/'
TBLPROPERTIES (
'projection.enabled'='true',
'projection.day.type'='integer',
'projection.day.digits' = '2',
'projection.day.range'='01,31',
'projection.hour.type'='integer',
'projection.hour.digits' = '2',
'projection.hour.range'='00,23',
'projection.month.type'='integer',
'projection.month.digits'='02',
'projection.month.range'='01,12',
'projection.year.format'='yyyy',
'projection.year.range'='2022,NOW',
'projection.year.type'='date',
'storage.location.template'='s3://bucket/prefix/${year}/${month}/${day}/${hour}')
However, with this table running below query outputs zero results:
select * from my_table where year = '2022' and month = '06';
I believe the reason is Athena expects all files to be present under the same prefix as defined by storage.location.template. Therefore, any records present under a month or day prefix are not projected.
I was wondering if it was possible to support such querying functionality in a single table with partition projection enabled, when data in S3 is in a folder type structure similar to the examples above.
Would be great if anyone can help me out!

Create automatic partition for S3 year/month/day/hour folders

My S3 bucket has the following structure.
's3://st.pix/year/month/day/hour' for example
's3://st.pix/2022/09/01/06'
So I tried to create a partition table on this bucket using this code:
CREATE EXTERNAL TABLE IF NOT EXISTS `acco`.`Accesn` (
`ad_id` string,
)
PARTITIONED BY (year string, month string, day string , hour string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://st.pix/${year}/${month}/${day}/${hour}/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
and right after run
MSCK REPAIR TABLE Accesn
But unfortunately, this query gets no result.
SELECT count(*) FROM `acco`.`Accesn` where year ='2022' and month= '03' and day ='01' and hour ='01'
Can I use ${year}/${month}/${day}/${hour}/ in my LOCATION ?
If no, what are the options to do it dynamically and not using ALTER TABLE .. ADD PARTITION for a specific partition.

Build Table in Glue Catalog

Is there a way to create table or update the table in Glue Catalog?
We are using the following DDL to create a table (and database) in Glue Catalog:
CREATE DATABASE IF NOT EXISTS glue_catalog;
CREATE EXTERNAL TABLE IF NOT EXISTS glue_catalog.date
(
file_dt date,
end_dt date
)
PARTITIONED BY (
year string,
month string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/prefix1/date'
TBLPROPERTIES (
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'='|',
'skip.header.line.count'='1',
'typeOfData'='file');
When we need to make an update to the schema (i.e. a datatype). We have to delete the table in Glue Catalog and re-execute the script above because we're using the create external table if not exists statement. I'm curious whether there's a way to create a table if it does not exist, but if it does exist, update the table?
You can find the alter commands in the documentation. Although this appears under Athena service, it applies to Glue tables. Athena tables are Glue tables.

Updating Records Via RowVersion , using 'SQL WHERE' to filter for a MAX Value

Trying to update a table based off a RowVersion value in existing table. My data lake updates once a week , with new data stored as a .json file, which holds any new RowVersions.
I need to:
1)Query the existing table in my data warehouse to find the most up to date RowVersion( ie max)
2)Use that value to only filter/select the records in my data warehouse that are greater than the RowVersion I just identified
3)Update my table to include the new Rows
My Question is - the SQL Below, I am not sure how to select the Max RowNumber in the current table and then use that to filter/specify what I want returned when querying my S3 Bucket:
create or replace temporary table UPDATE_CAR_SALES AS
SELECT
VALUE:CAR::string AS CARS,
VALUE:RowVersion::INT AS ROW_VERSION
having row_version > max(row_version)
from '#s3_bucket',
lateral flatten( input => $1:value);
It's not clear to me how you store the data. Is the CARS column unique? Do you need to find maximum row version for each car or for all cars/rows? Anyway you can use a sub-query to filter the rows having row version is higher than the max value:
create or replace temporary table UPDATE_CAR_SALES AS
SELECT
VALUE:CAR::string AS CARS,
VALUE:RowVersion::INT AS ROW_VERSION
FROM #s3_bucket, lateral flatten( input => $1 )
where ROW_VERSION > (SELECT MAX(RowVersion)
from MAIN_TABLE);
If you need to filter the rows, based on row version of each car (of the existing table):
create or replace temporary table UPDATE_CAR_SALES AS
SELECT * FROM (SELECT
VALUE:CAR::string AS CARS,
VALUE:RowVersion::INT AS ROW_VERSION
FROM #s3_bucket, lateral flatten( input => $1 )) temp_table
where temp_table.ROW_VERSION > (SELECT MAX(RowVersion)
from MAIN_TABLE where cars = temp_table.CARS );
I needed to put the main query in brackets to be able to use alias. Hope it helps.

How to query historical table size of database in Redshift to determine database size growth

I want to project forward the size of my Amazon Redshift tables because I'm planning to expand my Redshift cluster size.
I know how to query the table size for today (see query below) but how can I measure the growth of my table sizes over time without make an ETL job to make snapshot day-by-day table size?
-- Capture table sizes
select
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from (
select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (
select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;
There is no historical table size information retained by Amazon Redshift. You would need to run a query on a regular basis, such as the one in your question.
You could wrap the query in an INSERT statement and run it on a weekly basis, inserting the results into a table. This way, you'll have historical table size information for each table each week that you can use to predict future growth.
It would be worth doing a VACUUM prior to such measurements, to remove deleted rows from storage.
Following metrics is available in cloudwatch
RedshiftManagedStorageTotalCapacity (m1)
PercentageDiskSpaceUsed (m2).
Create a cloudwatch math expression m1*m2/100 to get this data for the past 3 months.