Mismatched input error when creating Athena table - amazon-web-services

Getting the following error,
line 1:8: mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'
when creating an Athena table with the following command,
CREATE EXTERNAL TABLE IF NOT EXISTS 'abcd_123' (Item:struct<Id:struct<S:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://mybucket'
I've gone through other Q&A's and none of the answers have helped me - any points as to where the error might be here ?

Try putting a space between Item and struct instead of a colon, like so
CREATE EXTERNAL TABLE IF NOT EXISTS 'abcd_123' (
Item struct<
Id:struct<
S:string
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://mybucket'
This is taken from the AWS Athena docs. I believe the colon is only required between fields of structs and their types, not column names and their types.
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
`Date` Date,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
...

Related

AWS Athena query returning empty string

I've seen other questions saying their query returns no results. This is not what is happening with my query. The query itself is returning empty strings/results.
I have an 81.7MB JSON file in my input bucket (input-data/test_data). I've setup the datasource as JSON.
However, when I execute SELECT * FROM test_table; it shows (in green) that the data has been scanned, the query was successful and there are results, but not saved to the output bucket or displayed in the GUI.
I'm not sure what I've done wrong in the setup?
This is my table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS `test_db`.`test_data` (
`tbl_timestamp` timestamp,
`colmn1` string,
`colmn2` string,
`colmn3` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://input-data/test_data/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');
Resolved this issue. The labels of the table (e.g. the keys) need to be the same labels in the file itself. Simple really!

How to skip documents that do not match schema in Athena?

Suppose I have an external table like this:
CREATE EXTERNAL TABLE my.data (
`id` string,
`timestamp` string,
`profile` struct<
`name`: string,
`score`: int>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
A few of my documents have an invalid profile.score (a string rather than an integer).
This causes queries in Athena to fail:
"Status": {
"State": "FAILED",
"StateChangeReason": "HIVE_BAD_DATA: Error parsing field value for field 0: For input string: \"4099999.9999999995\"",
How can I configure Athena to skip the documents that do not fit the external table schema?
The question here is about finding the problematic documents; this question is about skipping them.
Here is a sample on how to exclude a particular file
SELECT
*
FROM
"some_database"."some_table"
WHERE(
"$PATH" != 's3://path/to/a/file'
)
Just tested this approach with
SELECT
COUNT(*)
FROM
"some_database"."some_table"
-- Result: 68491573
SELECT
COUNT(*)
FROM
"some_database"."some_table"
WHERE(
"$PATH" != 's3://path/to/a/file'
)
-- Result: 68041452
SELECT
COUNT(*)
FROM
"some_database"."some_table"
WHERE(
"$PATH" = 's3://path/to/a/file'
)
-- Result: 450121
Total: 450121 + 68041452 = 68491573
I have faced same issue. Since I could not found a specific solution, I have used a different approach. It might help you.
The error is related to bad data in profile field. Since you are using “struct” for profile field, Athena is expecting the profile field’s data in structured fashion in source files. If there is any bad data in this field, you will experience this error.
Can you try below queries:
CREATE EXTERNAL TABLE my.data (
id string,
timestamp string,
profile string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 's3://my-bucket-of-data'
TBLPROPERTIES ('has_encrypted_data'='false');
and use below query to get expected result
select
id
,timestamp
,socialdata
,json_extract_scalar(profile, '$.name')profile_name
,json_extract_scalar(profile, '$.score')profile_score
from my.data;
You can visit this link for more.

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

read csv file where value contain comma in AWS athena

Hi Currently I have created a table schema in AWS Athena as follow
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
`clientcomputername` string,
`intjnltblrecid` bigint,
`processingstate` string,
`sessionid` int,
`sessionlogindatetime` string,
`sessionlogindatetimetzid` bigint,
`recidoriginal` bigint,
`modifieddatetime` string,
`modifiedby` string,
`createddatetime` string,
`createdby` string,
`dataareaid` string,
`recversion` int,
`partition` bigint,
`recid` bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.
So what would be the best way to read this file. Thanks
When I removed this part
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
I was able to read quoted text with commas in it
As workaround - look at aws glue project.
Instead of creating table via CREATE EXTERNAL TABLE:
invoke get-table for your table
Then make json for create-table
Merge the following StorageDescriptor part:
{
"StorageDescriptor": {
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde"
...
}
...
}
perform create via aws cli. You will get this table in aws glue and athena be able to select correct columns.
Notes
If your table already defined OpenCSVSerde - they may be fixed this issue and you can simple recreate this table.
I do not have much knoledge about athena, but in aws glue you can delete or create table without any data loss
Before adding this table via create-table you have to check first how glue or/and athena hadles table duplicates
This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']
Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)
CREATE EXTERNAL TABLE test1 (
f1 string,
s2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\")
LOCATION 's3://user-test-region/dataset/test1/'

How can I create a table with only some specific files (wildcard) using Amazon Athena?

My bucket used to have this structure:
mybucket/raw/i1.json
mybucket/raw/i2.json
It was easy and straightfoward to use Amazon Athena using the code below to create
the table.
CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients (
`id_number` string,
`txt` string,
...
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');
Now I'm facing some problems with a migration in the bucket structure.
The new structure in the bucket is showed below.
mybucket/raw/1/i1.json
mybucket/raw/1/docs/doc_1.json
mybucket/raw/1/docs/doc_2.json
mybucket/raw/1/docs/doc_3.json
mybucket/raw/2/i2.json
mybucket/raw/2/docs/doc_1.json
mybucket/raw/2/docs/doc_2.json
I wish I could create now two tables (the same table I had before the migration and a new one only with the docs.)
Is there any way I could do that without having to rearrange my files in another folder?
I'm searching for some kind of wildcard for the bucket files on the creation of the table.
CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients (
`id_number` string,
`txt` string,
...
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 'i*.json'
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');
CREATE EXTERNAL TABLE IF NOT EXISTS myclients.big_clients_docs (
`dt` date,
`txt` string,
`id_number` string,
`s3_doc_path` string,
`s3_doc_path_origin` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = 'doc_*.json'
) LOCATION 's3://mybucket/raw/'
TBLPROPERTIES ('has_encrypted_data'='false');
I was looking for the same thing. Unfortunately this is not possible due to the s3 api not being that wildcard friendly (requires scanning all the keys client side, which is slow). The documentation for athena also states that this is not supported.