mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW' - amazon-athena

I am trying to create a table in AWS Athena with the following command. However I get the error:mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'
Can you helpp with this?
CREATE EXTERNAL TABLE IF NOT EXISTS 'transport_evaluator_prod' (
`messageId` STRING,
`type` STRING,
`causationId` STRING,
`correlationId` STRING,
`traceparent` STRING,
`data` STRUCT<
`evaluationOccurred`:STRING,
`eta`:STRUCT<
`distance`:INT,
`timeToDestination`:INT,
`eta`:STRING,
`destination`:STRUCT<
`latitude`:DOUBLE,
`longitude`:DOUBLE,
`altitude`:DOUBLE>,
`destinationEventId`:STRING,
`origin`:STRUCT<
`latitude`:DOUBLE,
`longitude`:DOUBLE,
`altitude`:DOUBLE>,
`originEventId`:STRING,
`plannedArrival`:STRING,
`locationActionReference`:STRING,
`resourceUrn`:STRING,
`eventProvider`:STRING,
`occured`:STRING,
`position`:STRUCT<
`latitude`:DOUBLE,
`longitude`:DOUBLE,
`altitude`:DOUBLE>,
`equipmentNumber`:STRING,
`received`:STRING>>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION
'for-security-pointing-to-folder'

Try to remove the quotes from the table name:
CREATE EXTERNAL TABLE IF NOT EXISTS transport_evaluator_prod (
`messageId` STRING,
`type` STRING,
...

I am providing what I changed in order to get it to work for the sake of the next person that might run into similar issues
I removed all quotation marks for the columns of the table as well as the inner attributes inside the 'data' struct.
Also watch out that all attributes under 'struct' are followed by ':'
Ensure that the type provided matches the one for the payload json

I have been facing the same issue, but in my case, the problem was the name of my database. When I put the correct name, the query works.
Something like:
CREATE EXTERNAL TABLE IF NOT EXISTS 'correct_db_name'.'transport_evaluator_prod'
...

Related

Mismatched input error when creating Athena table

Getting the following error,
line 1:8: mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'
when creating an Athena table with the following command,
CREATE EXTERNAL TABLE IF NOT EXISTS 'abcd_123' (Item:struct<Id:struct<S:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://mybucket'
I've gone through other Q&A's and none of the answers have helped me - any points as to where the error might be here ?
Try putting a space between Item and struct instead of a colon, like so
CREATE EXTERNAL TABLE IF NOT EXISTS 'abcd_123' (
Item struct<
Id:struct<
S:string
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://mybucket'
This is taken from the AWS Athena docs. I believe the colon is only required between fields of structs and their types, not column names and their types.
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
`Date` Date,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
...

Regex for Parsing vertical CSV in Athena

So, I've been trying to load csvs from a s3 bucket into Athena. However, the way the csv are designer looks like the following
ns=2;s=A_EREG.A_EREG.A_PHASE_PRESSURE,102.19468,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE,50.0,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_CURRENT,15.919731,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE,0.22070877,12/12/19 00:00:01.2144275 GMT
ns=2;s=A_EREG.A_EREG.ACTIVE_PWR,0.0,12/12/19 00:00:01.2144275 GMT
The csv is just one record. Each column of the record has a value associated to it, which sits between two commas between the timestamp and the name, which I am trying to capture.
I've been trying to parse it using Regex Serde and I got to this Regular expression:
((?<=\,).*?(?=\,))
demo
I want the output of the above to be:
col_a col_b col_c col_d col_e
102.19468 50.0 15.919731 0.22070877 0.0
My DDL query looks like this:
CREATE EXTERNAL TABLE IF NOT EXISTS
(...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "\(?<=\,).*?(?=\,)"
) LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false');
The table creation Query above works succesfully, but when I try to preview my table I get the following error:
HIVE_CURSOR_ERROR: Number of matching groups doesn't match the number of columns
I am fairly new to Hive and Regex so I don't know what is going on. Can someone help me out here?
Thanks in advance,
BR
One column in Hive table corresponds to one capturing group in the regex. If you want to select single column containing everything between commas then this will work:
'.*,(.*),.*'
Athena serdes require that each record in the input is a single line. Multiline records are not supported.
What you can do instead is to create a table which maps each line in your data to a row in a table, and use a view to pivot the rows that belong together into a single row.
I'm going to assume that the ns field at the start of the lines is an ID, if not, I assume there is some other thing identifying which lines belong together that you can use.
I used your demo to create a regex that matched all the fields of each line and came up with ns=(\d);s=([^,]+),([^,]+),(.+) (see https://regex101.com/r/HnjnxK/5).
CREATE EXTERNAL TABLE my_data (
ns string,
s string,
v double,
dt string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = "ns=(\\d);s=([^,]+),([^,]+),(.+)"
)
LOCATION 's3://jackson-nifi-plc-data-1/2019-12-12/'
TBLPROPERTIES ('has_encrypted_data'='false')
Apologies if the regex isn't correctly escaped, I'm just typing this into Stack Overflow.
This table has four columns, corresponding to the four fields in each line. I've named then ns and s from the data, and v for the numerical value, and dt for the date. The date needs to be typed as a string since it's not in a format Athena natively understands.
Assuming that ns is a record identifier you can then create a view that pivots rows with different values for s to columns. You have to do this the way you want it to, the following is of course just a demonstration:
CREATE VIEW my_pivoted_data AS
WITH data_aggregated_by_ns AS (
SELECT
ns,
map_agg(array_agg(s), array_agg(v)) AS s_and_v
FROM my_data
GROUP BY ns
)
SELECT
ns,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_PRESSURE') AS phase_pressure,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_REF_SIGNAL_TO_VALVE') AS phase_ref_signal_to_valve,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_CURRENT') AS phase_sec_current,
element_at(s_and_v, 'A_EREG.A_EREG.A_PHASE_SEC_VOLTAGE') AS phase_sec_voltage,
element_at(s_and_v, 'A_EREG.A_EREG.ACTIVE_PWR') AS active_pwr
FROM data_aggregated_by_ns
Apologies if there are syntax errors in the SQL above.
What this does is that it creates a view (but start by trying it out as a query using everything from WITH and onwards), which has two parts to it.
The first part, the first SELECT results in rows that aggregate all the s and v values for each value of ns into a map. Try to run this query by itself to see how the result looks.
The second part, the second SELECT uses the results of the first part and just picks out the different v values for a number of values of s that I chose from your question using the aggregated map.

Apache Hive regEx serde: proper regex for a mixed format (json)

im trying to create a AWS Athena table using RegexSerDe.. due to some export issues i cannot use JsonSerDe.
2019-04-11T09:05:16.775Z {"timestamp":"data0","level":"data1","thread":data2","logger":"data3","message":"data4","context":"data5"}
I was trying to obtain json values with a regex, but without any luck.
CREATE EXTERNAL TABLE IF NOT EXISTS dsfsdfs.mecs3(
`timestamp` string,
`level` string,
`thread` string,
`logger` string,
`message` string,
`context` string
)
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "[ :]+(\\"[^\"]*\\")"
)LOCATION 's3://thisisates/'
Error: HIVE_CURSOR_ERROR: Number of matching groups doesn't match the
number of columns
Would be great some help as i'm not an expert in regex.
Thanks and BR.
Getting this working will probably be very hard - even if you can write a regex that will capture the columns out of the JSON structure, can you guarantee that all JSON documents will be rendered with the properties in the same order? JSON itself considers {"a": 1, "b": 2} and {"b": 2, "a": 1} to be equivalent, so many JSON libraries don't guarantee, or even care about ordering.
Another approach to this is to create a table with two columns: timestamp and data, as a regex table with a regex with two capture groups, the timestamp and the rest of the line – or possibly as a CSV table if the character after the timestamp is a tab (if it's a space it won't work since the JSON will contain spaces):
CREATE EXTERNAL TABLE IF NOT EXISTS mecs3_raw (
`timestamp` string,
`data` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(\\S+) (.+)$"
)
LOCATION 's3://thisisates/'
(the regex assumes that there is a space between the timestamp and the JSON structure, change it as needed).
That table will not be very usable by itself, but what you can do next is to create a view that extracts the properties from the JSON structure:
CREATE VIEW mecs3 AS
SELECT
"timestamp",
JSON_EXTRACT_SCALAR("data", '$.level') AS level,
JSON_EXTRACT_SCALAR("data", '$.thread') AS thread,
JSON_EXTRACT_SCALAR("data", '$.logger') AS logger,
JSON_EXTRACT_SCALAR("data", '$.message') AS message,
JSON_EXTRACT_SCALAR("data", '$.context') AS context
FROM mecs3_raw
(mecs3_raw is the table with timestamp and data columns)
This will give you what you want and will be much less error prone.
Try Regex: (?<=")[^\"]*(?=\" *(?:,|}))
Demo

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

Does AWS Athena supports Sequence File

Has any one tried creating AWS Athena Table on top of Sequence Files. As per the Documentation looks like it is possible. I was able to execute below create table statement.
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
STORED AS sequencefile
location 's3://bucket/sequencefile/';
The Statement executed Successfully but when i try to read data from the table it throws below error
Your query has the following error(s):
HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://viewershipforneo4j/2017-09-26/000030_0 (offset=372128055, length=62021342) using org.apache.hadoop.mapred.SequenceFileInputFormat: s3://viewershipforneo4j/2017-09-26/000030_0 not a SequenceFile
This query ran against the "default" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 9f0983b0-33da-4686-84a3-91b14a39cd09.
Sequence file are valid one . Issue here is there is not deliminator defined.
Ie row format delimited fields terminated by is missing
if in your case if tab is column deliminator row data is in next row it will be
create external table if not exists sample_sequence (
account_id string,
receiver_id string,
session_index smallint,
start_epoch bigint)
row format delimited fields terminated by '\t'
STORED AS sequencefile
location 's3://bucket/sequencefile/';