AWS Athena table creation fails with “no viable alternative at input 'create external'” - amazon-web-services

This is my first attempt with Athena, please be gentle :)
This query is not failing with error -> No viable alternative
CREATE EXTERNAL TABLE IF NOT EXISTS dev.mytokendata (
'dateandtime' timestamp, 'requestid' string,
'ip' string, 'caller' string, 'token' string, 'requesttime' int,
'httpmethod' string, 'resourcepath' string, 'status' smallint,
'protocol' string, 'responselength' int )
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='%{TIMESTAMP_ISO8601:dateandtime}
\s+\{\"requestId\":\s+\"%{USERNAME:requestid}\",
\s+\"ip\":\s+\"%{IP:ip}\",
\s+\"caller\":\s+\"%{USERNAME:caller}\",
\s+\"token\":\s+\"%{USERNAME:token}\",
\s+\"requestTime\":\s+\"%{INT:requesttime}\",
\s+\"httpMethod\":\s+\"%{WORD:httpmethod}\",
\s+\"resourcePath\":\s+\"%{UNIXPATH:resourcepath}\",
\s+\"status\":\s+\"%{INT:status}\",
\s+\"protocol\":\s+\"%{UNIXPATH:protocol}\",
\s+\"responseLength:\"\s+\"%{INT:responselength}\"\s+\}' )
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://athena-abc/1234/'
TBLPROPERTIES ('has_encrypted_data'='false', 'compressionType'='gzip');
This is a line from the log file which I am trying to parse (.gz file)
2018-07-30T02:23:34.134Z { "requestId":
"810000-9100-1100-a100-f100000", "ip": "01.01.01.001", "caller": "-",
"token": "1234-5678-78910-abcd", "requestTime":
"1002917414000", "httpMethod": "POST", "resourcePath":
"/MyApp/v1.0/MyService.wsdl",
"status": "200", "protocol": "HTTP/1.1", "responseLength": "1000" }
Can anyone please point out what may be wrong? It would be of great help

You have used the apostrophes unnecessarily in the column names and there was also the problem with escape characters.
The correct version:
CREATE EXTERNAL TABLE mytokendata (
`dateandtime` timestamp,
`requestid` string,
`ip` string,
`caller` string,
`token` string,
`requesttime` int,
`httpmethod` string,
`resourcepath` string,
`status` smallint,
`protocol` string,
`responselength` int )
ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='%{TIMESTAMP_ISO8601:dateandtime}
\\s+\\{\"requestId\":\\s+\"%{USERNAME:requestid}\",
\\s+\"ip\":\\s+\"%{IP:ip}\",
\\s+\"caller\":\\s+\"%{USERNAME:caller}\",
\\s+\"token\":\\s+\"%{USERNAME:token}\",
\\s+\"requestTime\":\\s+\"%{INT:requesttime}\",
\\s+\"httpMethod\":\\s+\"%{WORD:httpmethod}\",
\\s+\"resourcePath\":\\s+\"%{UNIXPATH:resourcepath}\",
\\s+\"status\":\\s+\"%{INT:status}\",
\\s+\"protocol\":\\s+\"%{UNIXPATH:protocol}\",
\\s+\"responseLength:\"\\s+\"%{INT:responselength}\"\\s+\\}' )
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://athena-abc/1234/'
TBLPROPERTIES ('has_encrypted_data'='false', 'compressionType'='gzip');

Related

How to resolve EOF in Athena? FAILED: ParseException line 1:6 missing EOF at '-' near 'cw'

My Athena query
CREATE EXTERNAL TABLE IF NOT EXISTS cwmilenko.new1 ( `ticket-id` string, `amount_stake_one` struct )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1') location 's3://cw-milenko-tests/1507tick2.parquet/part-00000-bbab8f70-4758-4041-9f69-c17f21c916dac000.snappy.parquet'
Error
FAILED: ParseException line 1:6 missing EOF at '-' near 'cw'
How to fix this problem?
This works
CREATE EXTERNAL TABLE parket3 (
`amount_stake_one` double,
`bet_event_etl_date`string,
`ticket_id` string
) STORED AS PARQUET LOCATION 's3://myparket/' tblproperties ("parquet.compression"="SNAPPY")

AWS Glue pushdown predicate not working properly

I'm trying to optimize my Glue/PySpark job by using push down predicates.
start = date(2019, 2, 13)
end = date(2019, 2, 27)
print(">>> Generate data frame for ", start, " to ", end, "... ")
relaventDatesDf = spark.createDataFrame([
Row(start=start, stop=end)
])
relaventDatesDf.createOrReplaceTempView("relaventDates")
relaventDatesDf = spark.sql("SELECT explode(generate_date_series(start, stop)) AS querydatetime FROM relaventDates")
relaventDatesDf.createOrReplaceTempView("relaventDates")
print("===LOG:Dates===")
relaventDatesDf.show()
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights", push_down_predicate="""
querydatetime BETWEEN '%s' AND '%s'
AND querydestinationplace IN (%s)
""" % (start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d"), ",".join(map(lambda s: str(s), arr))))
However it appears, that Glue still attempts to read data outside the specified date range?
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-01/part-00045-6cdebbb1-562c-43fa-915d-93b125aeee61.c000.snappy.parquet' for reading
INFO FileScanRDD: Reading File path: s3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet, range: 0-11797922, partition values: [12191,17965]
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
INFO S3NativeFileSystem: Opening 's3://.../flights/querydestinationplace=12191/querydatetime=2019-03-10/part-00021-34a13146-8fb2-43de-9df2-d8925cbe472d.c000.snappy.parquet' for reading
WARN S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
Notice the querydatetime=2019-03-01 and querydatetime=2019-03-10 its outside the specified range of 2019-02-13 - 2019-02-27. Is that why there's the next line "aborting HTTP connection" tho? It goes on to say "This is likely an error and may result in sub-optimal behavior" is something wrong?
I wonder if the problem is because it does not support BETWEEN inside the predicate or IN?
The table create DDL
CREATE EXTERNAL TABLE `flights`(
`id` string,
`querytaskid` string,
`queryoriginplace` string,
`queryoutbounddate` string,
`queryinbounddate` string,
`querycabinclass` string,
`querycurrency` string,
`agent` string,
`quoteageinminutes` string,
`price` string,
`outboundlegid` string,
`inboundlegid` string,
`outdeparture` string,
`outarrival` string,
`outduration` string,
`outjourneymode` string,
`outstops` string,
`outcarriers` string,
`outoperatingcarriers` string,
`numberoutstops` string,
`numberoutcarriers` string,
`numberoutoperatingcarriers` string,
`indeparture` string,
`inarrival` string,
`induration` string,
`injourneymode` string,
`instops` string,
`incarriers` string,
`inoperatingcarriers` string,
`numberinstops` string,
`numberincarriers` string,
`numberinoperatingcarriers` string)
PARTITIONED BY (
`querydestinationplace` string,
`querydatetime` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://pinfare-glue/flights/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-parquet',
'averageRecordSize'='19',
'classification'='parquet',
'compressionType'='none',
'objectCount'='623609',
'recordCount'='4368434222',
'sizeKey'='86509997099',
'typeOfData'='file')
One of the issue I can see with the code is that you are using "today" instead of "end" in the between clause. Though I don't see the today variable declared anywhere in your code, I am assuming it has been initialized with today's date.
In that case the range will be different and the partitions being read by glue spark is correct.
In order to push down your condition, you need to change the order of columns in your partition by clause of table definition
A condition having "in" predicate on first partition column can not be push down as you are expecting.
Let me if it helps.
Pushdown predicates in Glue DynamicFrame works fine with between as well as IN clause.
As long as you have correct sequence of partition columns defined in table definition and in query.
I have table with three level of partitions.
s3://bucket/flights/year=2018/month=01/day=01 -> 50 records
s3://bucket/flights/year=2018/month=02/day=02 -> 40 records
s3://bucket/flights/year=2018/month=03/day=03 -> 30 records
Read data in dynamicFrame
ds = glueContext.create_dynamic_frame.from_catalog(
database = "abc",table_name = "pqr", transformation_ctx = "flights",
push_down_predicate = "(year == '2018' and month between '02' and '03' and day in ('03'))"
)
ds.count()
Output:
30 records
So, you are gonna get the correct results, if sequence of columns is correctly specified. Also note, you need to specify '(quote) IN('%s') in IN clause.
Partition columns in table:
querydestinationplace string,
querydatetime string
Data read in DynamicFrame:
flightsGDF = glueContext.create_dynamic_frame.from_catalog(database = "xxx", table_name = "flights", transformation_ctx="flights",
push_down_predicate=
"""querydestinationplace IN ('%s') AND
querydatetime BETWEEN '%s' AND '%s'
"""
%
( ",".join(map(lambda s: str(s), arr)),
start.strftime("%Y-%m-%d"), today.strftime("%Y-%m-%d")))
Try to do the end as this
start = str(date(2019, 2, 13))
end = str(date(2019, 2, 27))
# Set your push_down_predicate variable
pd_predicate = "querydatetime >= '" + start + "' and querydatetime < '" + end + "'"
#pd_predicate = "querydatetime between '" + start + "' AND '" + end + "'" # Or this one?
flightsGDF = glueContext.create_dynamic_frame.from_catalog(
database = "xxx"
, table_name = "flights"
, transformation_ctx="flights"
, push_down_predicate=pd_predicate)
The pd_predicate will be a string that will work as a push_down_predicate.
Here is a nice read about it if you like.
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/

Load array field in csv data file into Athena table

This is a sample row in input data file with two fields - dept and names
dept,names
Mathematics,[foo,bar,alice,bob]
Here, 'name' is an array of String and I want to load it as array of String Athena.
Any suggestion?
To have a valid CSV file, make sure you put quotes around your array:
Mathematics,"[foo,bar,alice,bob]"
If you can remove the "[" and "]" the solution below becomes even easier and you can just split without the regex.
Better: Mathematics,"foo,bar,alice,bob"
First create a simple table from CSV with just strings:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = ',',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Then create a view which uses a regex to remove your '[' and ']' characters, then splits the rest by ',' into an array.
CREATE OR REPLACE VIEW mydataview AS
SELECT dept,
split(regexp_extract(names, '^\[(.*)\]$', 1), ',') as names
FROM mydataset
Then use the view for your queries. I am not 100% sure as I've only spent like 12 hours using Athena.
--
Note that in order to use the quotes, you need to use OpenCSVSerde, the 'lazyserde' won't work as it does support quotes. lazyserde DOES support internal arrays, but you can't use the ',' as a separator in that case. If you want to try that, your data would look like:
Better: Mathematics,foo|bar|alice|bob
In that case this MIGHT work directly:
CREATE EXTERNAL TABLE IF NOT EXISTS test.mydataset (
`dept` string,
`names` array<string>
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'quoteChar' = '"',
"separatorChar" = ',',
'collection.delim' = '|',
'mapkey.delim' = ':'
) LOCATION 's3://<your location>'
TBLPROPERTIES ('has_encrypted_data'='false')
Note how collection.delim = '|', which should translate your field directly to an array.
Sorry I don't have time to test this, I'll be happy to update my answer if you can confirm what works. Hopefully this get's you started.

Athena: Custom Apache log format regex issue

I have a custom format apache access log with lines like:
Jan 1 23:59:59 ip-172-70-12-5 172.70.1.146 - - [01/Jan/2018:23:59:59 +0000] "GET /someurl HTTP/1.1" 200 22854 "somdomain.com" "5a4acbvv7f7d222" "-" "" "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10" 486029 "180.80.80.62"
In Athena, I have a table set up to process this using a regex, here is the table DDL:
CREATE EXTERNAL TABLE `test_access_log`(
`remoteaddr` string COMMENT '',
`remotelogname` string COMMENT '',
`user` string COMMENT '',
`day` string COMMENT '',
`month` string COMMENT '',
`year` string COMMENT '',
`request` string COMMENT '',
`status` string COMMENT '',
`bytes_string` string COMMENT '',
`tenant` string COMMENT '',
`xrequestid` string COMMENT '',
`userid` string COMMENT '',
`referrer` string COMMENT '',
`browser` string COMMENT '',
`requesttime` string COMMENT '',
`originatorip` string COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='(\\S+) (\\S+) (\\S+) \\[([0-9]*)\\/([a-zA-Z]*)\\/([0-9]*):[\\w:]+\\s[+\\-]\\d{4}\\] \\\"(.+?)\\\" (\\S+) (\\S+) \\\"([^\\\"]*)\\\" \\\"([^\\\"]*)\\\" \\\"([^\\\"]*)\\\" \\\"([^\\\"]*)\\\" \\\"([^\\\"]*)\\\" (\\S+) \\\"([^\\\"]*)\\\"$')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/access-logs'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1526982859')
The problem is a preview on this table comes back with every column showing an empty value.
I have tested the regex via an online Java tester and the expression correctly returns all 16 elements as expected.
If I run SELECT COUNT(*) FROM test_access_log then as expected I get the number of rows I am expecting, so it seems it was possible to load the files in my S3 bucket. This leads me to think this is a regex issue.
Any ideas what is wrong with the regex?

How does Hive works regular expression with < and > symbols?

This is siva Ramanjaneyulu, I am working on hive. I have got the following problem with hive
sample.log: <ABC>
CREATE TABLE sample4( num1 STRING ) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES ( "input.regex" = "<.*>", "output.format.string" =
"%1$s" ) STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH "../hive-0.9.0/sample.log" INTO TABLE sample4;
select * from sample4;
NULL
Expected output: ABC
Why does this .RegexSerDe not work on regular exprssion <.*>?
how it is possible to remove < and > symbels using regular expression , can u please provide solution for this
Try this :
hive> CREATE TABLE s(num1 STRING) ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH
SERDEPROPERTIES ( "input.regex" = "(<.*>)",
"output.format.string" = "%1$s" ) STORED AS TEXTFILE;
Mind the parentheses around regex.
You are getting a NULL value because you didn't include the parentheses in the regex definition. If you don't want the angle brackets to be included in the output you need to put them outside the parentheses. The stuff inside the parenthesis is what will be returned as the output.
CREATE TABLE sample4 (num1 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "<(.*)>"
, "output.format.string" = '%1$s'
)
STORED AS TEXTFILE;