How to import nested json file in Athena - amazon-web-services

The below is the nested json format I am trying to create in a table in Amazon Athena.
Can anyone help me to create the table and query the table?
{
"Environment": "agilent-aws-sbx-21",
"Source1": "sdx:HD3,dev:HQ2,test:HT1,prod:HP1",
"Source2": "",
"Source3": "",
"Source4": ""
}
i have tried like this but query was not executing
CREATE EXTERNAL TABLE cfg_env ( environment string, source1 string, source2 string, source3 string, souce4 string ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://agilent-aws-sbx-21-enterprise-analytics/it_share/config/config_env/' TBLPROPERTIES ('classification'='json');

i have tried like this but query was not executing
CREATE EXTERNAL TABLE cfg_env (
environment string,
source1 string,
source2 string,
source3 string,
souce4 string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://agilent-aws-sbx-21-enterprise-analytics/it_share/config/config_env/'
TBLPROPERTIES ('classification'='json');

Related

AWS Athena identify malformed CSVs

I am trying to combine CSVs which I created and put into S3 into a single CSV file. I am using AWS Athena queries to do that, but I'm getting errors like
HIVE_BAD_DATA: Error parsing column '0': For input string: apples</li><li>Item 1</li><li>Item 2</li><div class=""atlanta"">organges</div> ...
How do I solve this?
This is my code - the description field/column contains HTML.
Creating table:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_table` (
`item_no` double,
`description` string,
`name` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://my_location/'
TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');
Querying table:
select * from my_table
where "item_no" < 30000
limit 10

The following query returns line 1:8: mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'

The following query returns:
line 1:8: mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'
CREATE EXTERNAL TABLE IF NOT EXISTS adult_data_clean(
age bigint,
workclass string,
education string,
relationship string,
occupation string,
country string,
income_cat string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
STORED AS OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://census-income-example/clean.data'
TBLPROPERTIES (
'classification'='csv'
'skip.header.line.count'='1')
There are two errors:
The TBLPROPERTIES is missing a comma between the parameters
OUTPUTFORMAT should not have STORED AS before it (since it simply continues from INPUTFORMAT):
CREATE EXTERNAL TABLE IF NOT EXISTS adult_data_clean(
age bigint,
workclass string,
education string,
relationship string,
occupation string,
country string,
income_cat string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://census-income-example/clean.data'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1')
Yes, those error messages were confusing and misleading.

create bucket table in AWS Athena

I tried below query to create a bucket table but failed. However, if I remove
the clause CLUSTERED BY, the query could succeed. Any suggestion? Thank you.
the error message: no viable alternative at input create external
CREATE EXTERNAL TABLE nation5(
n_nationkey bigint,
n_name string,
n_rgionkey int,
n_comment string)
CLUSTERED BY
n_regionkey INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'
The CLUSTERED BY column needs to be in brackets, the following works:
CREATE EXTERNAL TABLE test.nation5(
n_nationkey bigint,
n_name string,
n_regionkey int,
n_comment string)
CLUSTERED BY
(n_regionkey) INTO 256 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'path'='s3://test/testbucket/nation5/')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://test/testbucket/nation5/'
(You also have a spelling mistake in n_rgionkey in the column definition.

AWS Redshift Spectrum(External table) - Varchar datatype assigned to a column is not able to handle both array and string data for same column

Task: Trying to load a bunch of JSON files from s3 buckets to Redshift using Redshift Spectrum.
Problem: JSON objects in few files are having data wrapped with Square brackets but other JSON files are having the same objects without square brackets. Is there a way to consume data both with/ without square brackets "[ ]" while creating an external table using the Redshift spectrum table?
JSON FILES to be consumed:
File 1: "x":{"y":{ "z":["ABCD"]}}
File 2: "x":{"y":{ "z":"EFGH"}}
Case 1
When column z is defined as an array, I am missing out the data from JSON file which are "without square brackets"
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:array<varchar(256)>>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select c from spectrum.table t , t.x.y.z c;
Case 2
When column z is defined as varchar (without declaring as an array), below is the error:
Create Statement:
CREATE EXTERNAL TABLE spectrum.table
(x struct<y:struct<z:varchar(256)>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'dots.in.keys'='true') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://****'
Query: Select regexp_replace( t.x.y.z ,'\\([\\"])', '' ) from spectrum.table t;
or Select t.x.y.z from spectrum.table t;
[XX000][500310] [Amazon](500310) Invalid operation: Spectrum Scan Error
Details:
-----------------------------------------------
error: Spectrum Scan Error
code: 15001
context: Unsupported implicit cast: Column ('x' 'y' 'z'), From Type: LIST, To Type: VARCHAR,
----------------------------------------------

AWS Athena only return the first value of my JSON content

I have this JSON content on S3 {"foo":"anything", "bar": true }{"foo":"anything", "bar": false }...
My query to make the Athena Table.
CREATE EXTERNAL TABLE IF NOT EXISTS db.sample (
`foo` string,
`bar` boolean,
) PARTITIONED BY (
year string,
month string,
day string,
hour string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://bucket/sample/'
TBLPROPERTIES ('has_encrypted_data'='false');
when i make the query to athena
SELECT * from db.sample
It's only return the first value of my content column foo = anything, column bar = true
I can't get the two or more values to show at same time.
I already use ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true') and the problem continues.