Can't understand this Athena create table error - amazon-athena

Is there a better way to error check my create table statement below, Athena keeps giving me the useless error,
line 1:8: no viable alternative at input 'create external' (service: amazonathena; status code: 400; error code: invalidrequestexception;)
CREATE EXTERNAL TABLE IF NOT EXISTS database_1.table_1(
`id` string,
`starttime` string,
`endtime` string,
`projectedflow` int,
`volume` int,
`occupancy` int,
`averagespeed` int,
`maxlaneoccupancy` int,
`minlaneaveragespeed` int,
`maxflow` int,
`sustainableflow` int,
`site_id` int as SPLIT(id, ':')[1],
`region` string as SPLIT(id, ':')[3],
`starthour` int as SPLIT(SPLIT(starttime, 'T')[2], ':')[1],
`startminute` int as SPLIT(SPLIT(starttime, 'T')[2], ':')[2],
`starttime_adj` date as CONCAT(SPLIT(starttime, 'T')[1], ' ', SPLIT(SPLIT(starttime, 'T')[2], 'Z')[1]),
`endtime_adj` date as CONCAT(SPLIT(endtime, 'T')[1], ' ', SPLIT(SPLIT(endtime, 'T')[2], 'Z')[1])
)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket_1/traffic/v1/ds/csv'
TBLPROPERTIES (
'areColumnsQuoted'='false',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='gzip',
'delimiter'=',',
'skip.header.line.count'='4');
Any suggestions for strategies would be appreciated.

Since your table name contains a number table_1, enclose it with double quotes like "table_1".
Refer here for CREATE TABLE documentation.

Related

Athena Create External Table ParseException

Hi I am trying to run the following command in Athena
CREATE EXTERNAL TABLE transport_evaluator_prod(
messageId STRING,
type STRING,
causationId STRING,
correlationId STRING,
traceparent STRING,
`data` struct <
evaluationOccurred: STRING,
eta struct < distance: INT,
timeToDestination: INT,
eta: STRING,
destination struct < latitude: DOUBLE,
longitude: DOUBLE,
altitude: DOUBLE >,
destinationEventId: STRING,
origin struct < latitude: DOUBLE,
longitude: DOUBLE,
altitude: DOUBLE >, originEventId: STRING,
plannedArrival: STRING,
locationActionReference: STRING,
resourceUrn: STRING,
eventProvider: STRING,
occured: STRING,
position struct < latitude: DOUBLE,
longitude: DOUBLE,
altitude: DOUBLE >,
equipmentNumber: STRING,
received: STRING > >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 'changed-for-security'
TBLPROPERTIES ('has_encrypted_data' = 'false')
Unfortunately when I try to run this I get the following error :
FAILED: ParseException line 1:189 missing : at 'struct' near '' line 1:262 missing : at 'struct' near '' line 1:363 missing : at 'struct' near '' line 1:579 missing : at 'struct' near ''
Can someone please help?
In reading Create Tables in Amazon Athena from Nested JSON and Mappings Using JSONSerDe | AWS Big Data Blog, I notice that any fields inside a STRUCT should be referenced as field_name:type.
This also applies to struct.
Therefore, this type of line (which is inside a struct):
destination struct < latitude: DOUBLE,
should be:
destination:struct < latitude: DOUBLE,
Thus, this seems to work:
CREATE EXTERNAL TABLE transport_evaluator_prod(
messageId STRING,
type STRING,
causationId STRING,
correlationId STRING,
traceparent STRING,
`data` struct <
evaluationOccurred: STRING,
eta:struct < distance: INT,
timeToDestination: INT,
eta: STRING,
destination:struct < latitude: DOUBLE,
longitude: DOUBLE,
altitude: DOUBLE >,
destinationEventId: STRING,
origin:struct < latitude: DOUBLE,
longitude: DOUBLE,
altitude: DOUBLE >,
originEventId: STRING,
plannedArrival: STRING,
locationActionReference: STRING,
resourceUrn: STRING,
eventProvider: STRING,
occured: STRING,
position:struct < latitude: DOUBLE,
longitude: DOUBLE,
altitude: DOUBLE >,
equipmentNumber: STRING,
received: STRING > >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'ignore.malformed.json' = 'true'
)
LOCATION 'changed-for-security'
TBLPROPERTIES ('has_encrypted_data' = 'false')

How create a table with a serde where text file has fixed-width (non-delimited) columns?

I'm trying to create a table with a textfile without delimiter.
row exemple:
1000000000168999337200----------030420191455594197981209954------- 00000240000005010000000011800000000000
CREATE EXTERNAL TABLE IF NOT EXISTS p_bi.stg_cob (tp_registro string, seq string, num_a string, dt_chamada string, hr_chamada string, num_b string, pt_interconect string, dur_rel_chamada string, dur_tar_chamada string, tp_servico string, vl_liq_chamada string, vl_brt_chamada string, reserva string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ("input.regex" = "(.{1})(.{10})(.{21})(.{8})(.{6})(.{20})(.{10})(.{7})(.{7})(.{2})(.{11})(.{11})(.{29}).*") STORED AS TEXTFILE LOCATION '/user/Fin/Bat';

awk sql parsing

I have the following SQL used to create a table :
CREATE TABLE "SCHEMA"."TABLE" (
"COL1" INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY (
START WITH +50000
INCREMENT BY +1
MINVALUE +1
MAXVALUE +2147483647
NO CYCLE
CACHE 20
NO ORDER ) ,
"COL2" INTEGER NOT NULL ,
"COL3" INTEGER ,
"COL4" VARCHAR(60 OCTETS) ,
"COL5" VARCHAR(60 OCTETS) ,
"COL6" VARCHAR(60 OCTETS) GENERATED ALWAYS AS (FUNCTION(COL1)) ,
"COL7" VARCHAR(60 OCTETS) GENERATED ALWAYS AS (FUNCTION(COL2)) ,
"COL8" VARCHAR(10 OCTETS) )
IN "TABLESPACE"
ORGANIZE BY ROW;
I'm trying to produce an alter table order, using a regex to filter COL6 and COL7, containing both the pattern "GENERATED ALWAYS AS".
The expected text is :
ALTER TABLE "SCHEMA"."TABLE" ALTER COLUMN "COL6" DROP GENERATED ALTER COLUMN "COL7" DROP GENERATED
So far i made this, but I only capture the last occurence :
awk 'BEGIN{RS=";"; ORS=";\n"}
match( $0, /CREATE TABLE (.*) \(\n.*(".*").*GENERATED ALWAYS AS .*/, a){
print "ALTER TABLE "a[1] " ALTER COLUMN " a[2] " DROP GENERATED "
}'
Rather that using awk. If you have access to the database, why not generate the SQL you need from the catalog
SELECT 'ALTER TABLE "' || TABSCHEMA || '"."' || TABNAME || '" '
|| LISTAGG('ALTER COLUMN "' || COLNAME || '" DROP GENERATED', ' ') || ';'
FROM SYSCAT.COLUMNS
WHERE GENERATED = 'A' AND IDENTITY = 'N'
GROUP BY
TABSCHEMA, TABNAME
which produces this against your test table
ALTER TABLE "SCHEMA "."TABLE"ALTER COLUMN "COL6" DROP GENERATED ALTER COLUMN "COL7" DROP GENERATED;
Simple

REGEX creating HIVE EXTERNAL TABLE - issue no results during SELECT

I have the following sample data:
"HD",003498,"20160913:17:04:10","D3ZYE",1
"EH","XXX-1985977-1",1,"01","20151215","20151215","20151229","20151215","2304",,,"36-126481000",1340.74,61808.00,1126.62,0.00,214.12,0.00,0.00,0.00,"30","20151229","00653845",,,"PARTS","001","ABI","20151215","Y","Y","N","36-126481000",
I created a input.regex to get the fields defined since the file has numerous record type (signified by the first-2 character in the record)
Here is the statement that I have :
CREATE EXTERNAL TABLE EntryHeaderTable
(RecordType STRING
,EntryNumber STRING
,VersionNumber STRING
,EntryType STRING
,ImportDate STRING
,EntryDate STRING
,EntrySummaryDate STRING
,ForeignExportDate STRING
,PortCode STRING
,MasterBillofLading STRING
,ImporterofRecord STRING
,ImporterofRecord2 STRING
,TotalDue STRING
,EnteredValue STRING
,Duty STRING
,HarborMaintenanceFee STRING
,MerchandiseProcessingFee STRING
,DeferredTax STRING
,Tax STRING
,AD_CVD STRING
,ModeofTransportation STRING
,ACHPaymentDate STRING
,BrokerReferenceNumber STRING
,ReconciliationFlagforNAFTA STRING
,ReconciliationFlagforOther STRING
,CommodityDescriptionCode STRING
,SuretyCode STRING
,VersionReasonCode STRING
,VersionDate STRING
,ABIIndicator STRING
,PaperlessIndicator STRING
,IDProceduresFlag STRING
,UltimateConsignee STRING
,Filler STRING
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "(\EH\")(\,?.*)")
STORED AS TEXTFILE
LOCATION '/users/username/co/file'
;
it is giving me a message that it does not match the groups :
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns
But I counted the columns in the record definition and it matches the total number of groups produced by the regex.
The regex recognizes the record using regextesters like txt2re or regeExr
UPDATE:
I also tried to do the following definition to symbolize that there is 33 occurrences of the 2nd field
with serdeproperties ("input.regex" = "(\EH\"){1,33}(\,?.*)")
The only reason that I did not do CSV format for this, since I want to provide the record type as a variable field that changes for each of the record type

Error in DDL creation in Hive using RegexSerDe

I am having data with delimiters as ",|".
I have created the hive DDL as follows :
CREATE TABLE player_profile
(
player_id BIGINT COMMENT 'Player Profile Identifier',
change_ts STRING COMMENT 'Change Datetime',
child_birth_year INT COMMENT 'Child Birth Year',
country STRING COMMENT 'Country Code',
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex'='^(\\d+),\\|(.*),\\|(\\d+),\\|(.*)$')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
I am getting the flowing error while deploying this ddl.
FAILED: Error in metadata: java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.contrib.serde2.RegexSerDe only accepts string columns, but column[0] named player_id has type bigint)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Is it an issue with the regex expression i gave ? If so what is the regex in this case.
Does Hive(0.11) regex serde support BIGINT ?
RegexSerDe explicitly checks if all columns are strings. See RegexSerDe.java
You could change all columns to strings, and then cast between int and string when you are querying the data.
As the problem stated in your exception:
RegexSerDe only accepts string columns, but column[0] named player_id has type bigint
RegexSerDe doesn't support bigint only strings.
Have you tried changing:
player_id BIGINT COMMENT
By:
player_id STRING COMMENT
To check if it solves your issue?
You can see the RegexSerDe.java source code mention it:
// All columns have to be of type STRING.
for (int c = 0; c < numColumns; c++) {
if (!columnTypes.get(c).equals(TypeInfoFactory.stringTypeInfo)) {
throw new SerDeException(getClass().getName()
+ " only accepts string columns, but column[" + c + "] named "
+ columnNames.get(c) + " has type " + columnTypes.get(c));
}
}
As you can see that's the error you are getting. Hope to help.