Athena cannot query against Glue-crawled tables

Athena cannot query against Glue-crawled tables - amazon-web-services

I used Glue Crawler to create a table on top of a folder with Snappy Parquet file in S3. Queries fail with "SYNTAX_ERROR: line 1.8 Column 'isfraud' cannot be resolved."
Yet when I replicate that exact table manually, the same query succeeds. I tried this with a crawler on the same underlying S3 path, and also by using crawler on a copy of the same data to another path without special characters like dashes. See image.
SHOW CREATE TABLE ... seems to confirm that automatically generated and manually generated schemas are the same. See below.
The same thing happens with CSV-formatted data.
Adding single-quote, double-quote, or backtick around the table name in the query (either with the database name or separately) does not make a difference; nor does adding the Database name to the query.
How can I query a generated table?
The manually defined table where the query succeeds.
SHOW CREATE TABLE mdforaugmentedparquet.snappyparquet1;
CREATE EXTERNAL TABLE `mdforaugmentedparquet.snappyparquet1`(
`isfraud` int,
`step` int,
`hourof24` double,
`hourof24_nml` double,
`type` string,
`type_cash_out` int,
`type_transfer` int,
`amount` double,
`amount_nml` double,
`nameorig` string,
`oldbalanceorg` double,
`oldbalanceorg_nml` double,
`oldbalanceorigsign` int,
`newbalanceorig` double,
`newbalanceorig_nml` double,
`negdeltaorigin` double,
`negdeltaorigin_nml` double,
`namedest` string,
`oldbalancedest` double,
`oldbalancedest_nml` double,
`expectednewbaldest` double,
`expectednewbaldest_nml` double,
`newbalancedest` double,
`newbalancedest_nml` double,
`isflaggedfraud` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://sagemaker-819/augmented-parquet/'
TBLPROPERTIES (
'classification'='parquet')
An automatically crawled table where the query fails.
SHOW CREATE TABLE mdforaugmentedparquet.augparquetsnappy;
CREATE EXTERNAL TABLE `mdforaugmentedparquet.augparquetsnappy`(
`isfraud` int,
`step` int,
`hourof24` double,
`hourof24_nml` double,
`type` string,
`type_cash_out` int,
`type_transfer` int,
`amount` double,
`amount_nml` double,
`nameorig` string,
`oldbalanceorg` double,
`oldbalanceorg_nml` double,
`oldbalanceorigsign` int,
`newbalanceorig` double,
`newbalanceorig_nml` double,
`negdeltaorigin` double,
`negdeltaorigin_nml` double,
`namedest` string,
`oldbalancedest` double,
`oldbalancedest_nml` double,
`expectednewbaldest` double,
`expectednewbaldest_nml` double,
`newbalancedest` double,
`newbalancedest_nml` double,
`isflaggedfraud` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://sagemaker-819/augparquetsnappy/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='augmentedparquetsnappy',
'averageRecordSize'='125',
'classification'='parquet',
'compressionType'='none',
'objectCount'='1',
'recordCount'='2811841',
'sizeKey'='260257084',
'typeOfData'='file')
Another automatically crawled table where the query also fails.
SHOW CREATE TABLE mdforaugmentedparquet.augmented_parquet;
CREATE EXTERNAL TABLE `mdforaugmentedparquet.augmented_parquet`(
`isfraud` int,
`step` int,
`hourof24` double,
`hourof24_nml` double,
`type` string,
`type_cash_out` int,
`type_transfer` int,
`amount` double,
`amount_nml` double,
`nameorig` string,
`oldbalanceorg` double,
`oldbalanceorg_nml` double,
`oldbalanceorigsign` int,
`newbalanceorig` double,
`newbalanceorig_nml` double,
`negdeltaorigin` double,
`negdeltaorigin_nml` double,
`namedest` string,
`oldbalancedest` double,
`oldbalancedest_nml` double,
`expectednewbaldest` double,
`expectednewbaldest_nml` double,
`newbalancedest` double,
`newbalancedest_nml` double,
`isflaggedfraud` int)
PARTITIONED BY (
`partition_0` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://sagemaker-819/augmented-parquet/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='augmentedparquet',
'averageRecordSize'='125',
'classification'='parquet',
'compressionType'='none',
'objectCount'='1',
'recordCount'='2811841',
'sizeKey'='260257084',
'typeOfData'='file')
Here is the description of autogenerated augparquetsnappy table (where queries fail).
DESCRIBE FORMATTED `mdforaugmentedparquet.augparquetsnappy`
# col_name data_type comment
isfraud int
step int
hourof24 double
hourof24_nml double
type string
type_cash_out int
type_transfer int
amount double
amount_nml double
nameorig string
oldbalanceorg double
oldbalanceorg_nml double
oldbalanceorigsign int
newbalanceorig double
newbalanceorig_nml double
negdeltaorigin double
negdeltaorigin_nml double
namedest string
oldbalancedest double
oldbalancedest_nml double
expectednewbaldest double
expectednewbaldest_nml double
newbalancedest double
newbalancedest_nml double
isflaggedfraud int
# Detailed Table Information
Database: mdforaugmentedparquet
Owner: owner
CreateTime: Tue Nov 17 10:55:56 UTC 2020
LastAccessTime: Tue Nov 17 10:55:55 UTC 2020
Protect Mode: None
Retention: 0
Location: s3://sagemaker-819/augparquetsnappy
Table Type: EXTERNAL_TABLE
Table Parameters:
CrawlerSchemaDeserializerVersion 1.0
CrawlerSchemaSerializerVersion 1.0
UPDATED_BY_CRAWLER augmentedparquetsnappy
averageRecordSize 125
classification parquet
compressionType none
objectCount 1
recordCount 2811841
sizeKey 260257084
typeOfData file
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I further created a copy of augparquetsnappy table, by running the CREATE command above. Queries succeed on this copy, augparquetsnappy2. Here is the description of that table.
DESCRIBE FORMATTED `mdforaugmentedparquet.augparquetsnappy2`
# col_name data_type comment
isfraud int
step int
hourof24 double
hourof24_nml double
type string
type_cash_out int
type_transfer int
amount double
amount_nml double
nameorig string
oldbalanceorg double
oldbalanceorg_nml double
oldbalanceorigsign int
newbalanceorig double
newbalanceorig_nml double
negdeltaorigin double
negdeltaorigin_nml double
namedest string
oldbalancedest double
oldbalancedest_nml double
expectednewbaldest double
expectednewbaldest_nml double
newbalancedest double
newbalancedest_nml double
isflaggedfraud int
# Detailed Table Information
Database: mdforaugmentedparquet
Owner: hadoop
CreateTime: Thu Nov 19 18:31:47 UTC 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: s3://sagemaker-819/augmented-parquet
Table Type: EXTERNAL_TABLE
Table Parameters:
CrawlerSchemaDeserializerVersion 1.0
CrawlerSchemaSerializerVersion 1.0
EXTERNAL TRUE
UPDATED_BY_CRAWLER augmentedparquetsnappy
averageRecordSize 125
classification parquet
compressionType none
objectCount 1
recordCount 2811841
sizeKey 260257084
transient_lastDdlTime 1605810707
typeOfData file
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1

The problem seems to be that the user executing the statement doesn't have permission to get the table metadata from glue as the table was created by Glue Crawler.
Glue permissions can be assigned on different, levels, catalog, database and table. If the user executing the select statement doesn't have permission explicitly on all tables or implicitly on all tables by giving permission on the database, you won't be able to query certain tables, see also IAM actions available in Glue.
This might be helpful as well, Attach a Policy to IAM Users That Access AWS Glue

Related

read partitioned data of vpc flow log

I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?

You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.

Amamzon ATHENA: no viable alternative at input 'create external'

I am using following DDL query to create table
CREATE EXTERNAL TABLE IF NOT EXISTS poi_test1(
'taxonomy_level_1' string,
'taxonomy_level_2' string,
'taxonomy_level_3' string,
'taxonomy_level_4' string,
'poi_name' string,
'mw_segment_name' string,
'latitude' double,
'longitude' double,
'city' string,
'state' string,
'country_code' string,
'default_radius' float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mw.test/jishan1/qa1/poi1';
Error: line 1:8: no viable alternative at input 'create external' (service: amazonathena; status code: 400; error code: invalidrequestexception; request id: 5dbd0eb8-6842-45ca-8f60-9f17fd2e4c04)

You should remove single quotes around columns or enclose them in backticks if reserved keywords present and in double quotes if column starts with digit.
Read this for naming conventions to be used with Athena.
I ran your query as shown below by removing single quotes and it created table successfully
CREATE EXTERNAL TABLE poi_test1(
taxonomy_level_1 string,
taxonomy_level_2 string,
taxonomy_level_3 string,
taxonomy_level_4 string,
poi_name string,
mw_segment_name string,
latitude double,
longitude double,
city string,
state string,
country_code string,
default_radius float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION's3://mw.test/jishan1/qa1/poi1';

Failed to load Hive External table from flat file in GCS environment

I have two text file with same structure that I extracted from SQL Server. One file is 1.5gb while another is 7.5gb. I created a table in hive and then copied these files to corresponding gcs buckets. Now when I am trying to load data in tables it is failing for 7.5 gb file. After running LOAD DATA INPATH command my 7.5gb file in the bucket is getting deleted. While in case of 1.5 GB file it is working perfectly fine. What alternative way should I try to fix this issue.
My Hive QL is as below.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable
( v_nbr int,
v_nm varchar(80),
p_nbr int,
r_nbr int,
a_account varchar(80),
a_amount decimal(13,4),
c_store int,
c_account int,
c_amount decimal(13,4),
rec_date date)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable1.5/file1.5gb.txt' OVERWRITE INTO TABLE myschema.table1.5;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable7.5/file7.5gb.txt' OVERWRITE INTO TABLE myschema.table7.5;

You can try this:
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable
( v_nbr int,
v_nm varchar(80),
p_nbr int,
r_nbr int,
a_account varchar(80),
a_amount decimal(13,4),
c_store int,
c_account int,
c_amount decimal(13,4),
rec_date date)
row format delimited
fields terminated by ','
stored as textfile
LOCATION 'gs://mybucket/myschema.db/mytable1.5/file1.5gb.txt';

AWS Athena - Cast CloudFront log time field to timestamp

I'm following the example AWS documentation gave for creating a CloudFront log table in Athena.
CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
`date` DATE,
time STRING,
location STRING,
bytes BIGINT,
requestip STRING,
method STRING,
host STRING,
uri STRING,
status INT,
referrer STRING,
useragent STRING,
querystring STRING,
cookie STRING,
resulttype STRING,
requestid STRING,
hostheader STRING,
requestprotocol STRING,
requestbytes BIGINT,
timetaken FLOAT,
xforwardedfor STRING,
sslprotocol STRING,
sslcipher STRING,
responseresulttype STRING,
httpversion STRING,
filestatus STRING,
encryptedfields INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://your_log_bucket/prefix/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )
Creating the table with the time field as a string doesn't allow me to run conditional queries. I tried re-creating the table with the following:
CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
`date` DATE,
time timestamp,
....
Unfortunately this did not work and I received no results in the time field when I previewed the table.
Does anyone have any experience casting the time to something that I can use to query?

Concat the date and time into a timestamp in a subquery:
WITH ds AS
(SELECT *,
parse_datetime( concat( concat( format_datetime(date,
'yyyy-MM-dd'), '-' ), time ),'yyyy-MM-dd-HH:mm:ss') AS datetime
FROM default.cloudfront_www
WHERE requestip = '207.30.46.111')
SELECT *
FROM ds
WHERE datetime
BETWEEN timestamp '2018-11-19 06:00:00'
AND timestamp '2018-11-19 12:00:00'

It's frustrating that there isn't a straightforward way to have usable timestamps (dates with times included) in a table based on CloudFront logs.
However, this is now my workaround:
I create a view based on the original table. Say my original table is cloudfront_prod_logs. I create a view, cloudfront_prod_logs_w_datetime that has a proper datetime/timestamp field and I use that in queries, instead of the original table.
CREATE OR REPLACE VIEW cloudfront_prod_logs_w_datetime AS
SELECT
"date_parse"("concat"(CAST(date AS varchar), ' ', CAST(time AS varchar)), '%Y-%m-%d %H:%i:%s') datetime
, *
FROM
cloudfront_prod_logs

Can't create external table output from my hive query

I have a .csv file containing data about crime incidences in Philadelphia.
I am using a hive script in amazon EMR to convert this data into a HIVE table.
I am using the following hive script:
CREATE EXTERNAL TABLE IF NOT EXISTS Crime(
Dc_Dist INT,
PSA INT,
Dispatch_Date_Time TIMESTAMP,
Dispatch_Date date,
Dispatch_Time STRING,
Hour INT,
Dc_Key BIGINT,
Location_Block STRING,
UCR_General INT,
Text_General_Code STRING,
Police_Districts INT,
Month STRING,
Lon STRING,
Lat STRING)
COMMENT 'Data about crime from a public database'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location 's3://dsabucket/crimeData/crime';
I run this script but I do not get a file or data in my output folder. I am not sure if the table is created properly or not. As I understand the 'STORED AS TEXTFILE' line should store this table as a textfile.

to check table created or not use DESCRIBE
ie DESCRIBE tableNAMe;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Athena cannot query against Glue-crawled tables - amazon-web-services

Related

read partitioned data of vpc flow log

Amamzon ATHENA: no viable alternative at input 'create external'

Failed to load Hive External table from flat file in GCS environment

AWS Athena - Cast CloudFront log time field to timestamp

Can't create external table output from my hive query

Categories

Resources