create bucket table in AWS Athena - amazon-web-services

I tried below query to create a bucket table but failed. However, if I remove
the clause CLUSTERED BY, the query could succeed. Any suggestion? Thank you.
the error message: no viable alternative at input create external
CREATE EXTERNAL TABLE nation5(
n_nationkey bigint,
n_name string,
n_rgionkey int,
n_comment string)
CLUSTERED BY
n_regionkey INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'

The CLUSTERED BY column needs to be in brackets, the following works:
CREATE EXTERNAL TABLE test.nation5(
n_nationkey bigint,
n_name string,
n_regionkey int,
n_comment string)
CLUSTERED BY
(n_regionkey) INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'
(You also have a spelling mistake in n_rgionkey in the column definition.

Related

AWS Athena identify malformed CSVs

I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10

Create automatic partition for S3 year/month/day/hour folders

My S3 bucket has the following structure.
's3://st.pix/year/month/day/hour' for example
's3://st.pix/2022/09/01/06'
So I tried to create a partition table on this bucket using this code:
CREATE EXTERNAL TABLE IF NOT EXISTS `acco`.`Accesn` (
`ad_id` string,
)
PARTITIONED BY (year string, month string, day string , hour string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://st.pix/${year}/${month}/${day}/${hour}/'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip');
and right after run
MSCK REPAIR TABLE Accesn
But unfortunately, this query gets no result.
SELECT count(*) FROM `acco`.`Accesn` where year ='2022' and month= '03' and day ='01' and hour ='01'
Can I use ${year}/${month}/${day}/${hour}/ in my LOCATION ?
If no, what are the options to do it dynamically and not using ALTER TABLE .. ADD PARTITION for a specific partition.

How to import nested json file in Athena

The below is the nested json format I am trying to create in a table in Amazon Athena.
Can anyone help me to create the table and query the table?
{
"Environment": "agilent-aws-sbx-21",
"Source1": "sdx:HD3,dev:HQ2,test:HT1,prod:HP1",
"Source2": "",
"Source3": "",
"Source4": ""
}
i have tried like this but query was not executing
CREATE EXTERNAL TABLE cfg_env ( environment string, source1 string, source2 string, source3 string, souce4 string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://agilent-aws-sbx-21-enterprise-analytics/it_share/config/config_env/' TBLPROPERTIES ('classification'='json');
i have tried like this but query was not executing
CREATE EXTERNAL TABLE cfg_env (
environment string,
source1 string,
source2 string,
source3 string,
souce4 string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://agilent-aws-sbx-21-enterprise-analytics/it_share/config/config_env/'
TBLPROPERTIES ('classification'='json');

AWS Redshift Spectrum(External table) - Varchar datatype assigned to a column is not able to handle both array and string data for same column

Task: Trying to load a bunch of JSON files from s3 buckets to Redshift using Redshift Spectrum.
Problem: JSON objects in few files are having data wrapped with Square brackets but other JSON files are having the same objects without square brackets. Is there a way to consume data both with/ without square brackets "[ ]" while creating an external table using the Redshift spectrum table?
JSON FILES to be consumed:
File 1: "x":{"y":{ "z":["ABCD"]}}
File 2: "x":{"y":{ "z":"EFGH"}}
Case 1
When column z is defined as an array, I am missing out the data from JSON file which are "without square brackets"
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:array<varchar(256)>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select c from spectrum.table t , t.x.y.z c;
Case 2
When column z is defined as varchar (without declaring as an array), below is the error:
Create Statement:
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:varchar(256)>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select regexp_replace( t.x.y.z ,'\\([\\"])', '' ) from spectrum.table t;
or Select t.x.y.z from spectrum.table t;
[XX000][500310] [Amazon](500310) Invalid operation: Spectrum Scan Error
Details:
-----------------------------------------------
error: Spectrum Scan Error
code: 15001
context: Unsupported implicit cast: Column ('x' 'y' 'z'), From Type: LIST, To Type: VARCHAR,
----------------------------------------------

AWS Athena null values are replaced by N after table is created. How to keep them as it is?

I'm creating a table in Athena from csv data in S3. The data has some columns quoted, so I use:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
'serialization.null.format' = '')
The serde works fine but then the null values in the resultant table are replaced with N. How can I keep the null values as empty or like Null etc, but not as N.
Thanks.