Im trying to set up a table in Athena with partition projection.
My logs are in the format s3://bucket/folder/year/month/day/hour and then a json file inside that.
I have tried creating the table with partition projection as follows:
CREATE EXTERNAL TABLE `waf_logs_webacl1`(
`timestamp` bigint,
`formatversion` int,
`webaclid` string,
`terminatingruleid` string,
`terminatingruletype` string,
`action` string,
`terminatingrulematchdetails` array<
struct<
conditiontype:string,
location:string,
matcheddata:array<string>
>
>,
`httpsourcename` string,
`httpsourceid` string,
`rulegrouplist` array<
struct<
rulegroupid:string,
terminatingrule:struct<
ruleid:string,
action:string,
rulematchdetails:string
>,
nonterminatingmatchingrules:array<
struct<
ruleid:string,
action:string,
rulematchdetails:array<
struct<
conditiontype:string,
location:string,
matcheddata:array<string>
>
>
>
>,
excludedrules:array<
struct<
ruleid:string,
exclusiontype:string
>
>
>
>,
`ratebasedrulelist` array<
struct<
ratebasedruleid:string,
limitkey:string,
maxrateallowed:int
>
>,
`nonterminatingmatchingrules` array<
struct<
ruleid:string,
action:string
>
>,
`requestheadersinserted` string,
`responsecodesent` string,
`httprequest` struct<
clientip:string,
country:string,
headers:array<
struct<
name:string,
value:string
>
>,
uri:string,
args:string,
httpversion:string,
httpmethod:string,
requestid:string
>,
`labels` array<
struct<
name:string
>
>
)
PARTITIONED BY
(
day STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket/folder/'
TBLPROPERTIES
(
"projection.enabled" = "true",
"projection.day.type" = "date",
"projection.day.range" = "2021/01/01,NOW",
"projection.day.format" = "yyyy/MM/dd/HH",
"projection.day.interval" = "1",
"projection.day.interval.unit" = "YEARS",
"storage.location.template" = "s3://bucket/folder/${year}/${month}/${day}/${hour}/"
)
It gets created successfully but when I load all the partitions in it gives me the error
Partitions not in metastore: waf_logs_webacl1:2021/05/16/23 waf_logs_webacl1:2021/05/17/00 waf_logs_webacl1:2021/05/17/01 waf_logs_webacl1:2021/05/17/02 waf_logs_webacl1:2021/05/17/03 etc
I have also tried with the storage.location.template being s3://bucket/folder/ and s3://bucket/folder/${year}/ and get the same error when loading partitions. Please help thanks.
When you use partition projection you don't need to load partitions, the partitions will be found at query execution time.
The problem with your table is that you have one partition key, day, but you say to Athena that the data is stored in a directory structure containing /${year}/${month}/${day}/${hour}/, i.e. four partition keys.
Either you need to create the table with all four partition keys and configure partition projection for them (e.g. projection.year.type, etc.) or you need to remove the undefined keys from the storage location template.
I think the right course of action is the former, since that's how the data is organized. There is an example in the Athena docs that you should be able to use as a starting point here: https://docs.aws.amazon.com/athena/latest/ug/partition-projection-kinesis-firehose-example.html (edit: this page has been removed from the documentation and there is no updated version)
Related
Getting the following error,
line 1:8: mismatched input 'EXTERNAL'. Expecting: 'OR', 'SCHEMA', 'TABLE', 'VIEW'
when creating an Athena table with the following command,
CREATE EXTERNAL TABLE IF NOT EXISTS 'abcd_123' (Item:struct<Id:struct<S:string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://mybucket'
I've gone through other Q&A's and none of the answers have helped me - any points as to where the error might be here ?
Try putting a space between Item and struct instead of a colon, like so
CREATE EXTERNAL TABLE IF NOT EXISTS 'abcd_123' (
Item struct<
Id:struct<
S:string
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json' = 'true')
LOCATION 's3://mybucket'
This is taken from the AWS Athena docs. I believe the colon is only required between fields of structs and their types, not column names and their types.
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
`Date` Date,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
...
I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?
You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.
I used Glue Crawler to create a table on top of a folder with Snappy Parquet file in S3. Queries fail with "SYNTAX_ERROR: line 1.8 Column 'isfraud' cannot be resolved."
Yet when I replicate that exact table manually, the same query succeeds. I tried this with a crawler on the same underlying S3 path, and also by using crawler on a copy of the same data to another path without special characters like dashes. See image.
SHOW CREATE TABLE ... seems to confirm that automatically generated and manually generated schemas are the same. See below.
The same thing happens with CSV-formatted data.
Adding single-quote, double-quote, or backtick around the table name in the query (either with the database name or separately) does not make a difference; nor does adding the Database name to the query.
How can I query a generated table?
The manually defined table where the query succeeds.
SHOW CREATE TABLE mdforaugmentedparquet.snappyparquet1;
CREATE EXTERNAL TABLE `mdforaugmentedparquet.snappyparquet1`(
`isfraud` int,
`step` int,
`hourof24` double,
`hourof24_nml` double,
`type` string,
`type_cash_out` int,
`type_transfer` int,
`amount` double,
`amount_nml` double,
`nameorig` string,
`oldbalanceorg` double,
`oldbalanceorg_nml` double,
`oldbalanceorigsign` int,
`newbalanceorig` double,
`newbalanceorig_nml` double,
`negdeltaorigin` double,
`negdeltaorigin_nml` double,
`namedest` string,
`oldbalancedest` double,
`oldbalancedest_nml` double,
`expectednewbaldest` double,
`expectednewbaldest_nml` double,
`newbalancedest` double,
`newbalancedest_nml` double,
`isflaggedfraud` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://sagemaker-819/augmented-parquet/'
TBLPROPERTIES (
'classification'='parquet')
An automatically crawled table where the query fails.
SHOW CREATE TABLE mdforaugmentedparquet.augparquetsnappy;
CREATE EXTERNAL TABLE `mdforaugmentedparquet.augparquetsnappy`(
`isfraud` int,
`step` int,
`hourof24` double,
`hourof24_nml` double,
`type` string,
`type_cash_out` int,
`type_transfer` int,
`amount` double,
`amount_nml` double,
`nameorig` string,
`oldbalanceorg` double,
`oldbalanceorg_nml` double,
`oldbalanceorigsign` int,
`newbalanceorig` double,
`newbalanceorig_nml` double,
`negdeltaorigin` double,
`negdeltaorigin_nml` double,
`namedest` string,
`oldbalancedest` double,
`oldbalancedest_nml` double,
`expectednewbaldest` double,
`expectednewbaldest_nml` double,
`newbalancedest` double,
`newbalancedest_nml` double,
`isflaggedfraud` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://sagemaker-819/augparquetsnappy/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='augmentedparquetsnappy',
'averageRecordSize'='125',
'classification'='parquet',
'compressionType'='none',
'objectCount'='1',
'recordCount'='2811841',
'sizeKey'='260257084',
'typeOfData'='file')
Another automatically crawled table where the query also fails.
SHOW CREATE TABLE mdforaugmentedparquet.augmented_parquet;
CREATE EXTERNAL TABLE `mdforaugmentedparquet.augmented_parquet`(
`isfraud` int,
`step` int,
`hourof24` double,
`hourof24_nml` double,
`type` string,
`type_cash_out` int,
`type_transfer` int,
`amount` double,
`amount_nml` double,
`nameorig` string,
`oldbalanceorg` double,
`oldbalanceorg_nml` double,
`oldbalanceorigsign` int,
`newbalanceorig` double,
`newbalanceorig_nml` double,
`negdeltaorigin` double,
`negdeltaorigin_nml` double,
`namedest` string,
`oldbalancedest` double,
`oldbalancedest_nml` double,
`expectednewbaldest` double,
`expectednewbaldest_nml` double,
`newbalancedest` double,
`newbalancedest_nml` double,
`isflaggedfraud` int)
PARTITIONED BY (
`partition_0` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://sagemaker-819/augmented-parquet/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='augmentedparquet',
'averageRecordSize'='125',
'classification'='parquet',
'compressionType'='none',
'objectCount'='1',
'recordCount'='2811841',
'sizeKey'='260257084',
'typeOfData'='file')
Here is the description of autogenerated augparquetsnappy table (where queries fail).
DESCRIBE FORMATTED `mdforaugmentedparquet.augparquetsnappy`
# col_name data_type comment
isfraud int
step int
hourof24 double
hourof24_nml double
type string
type_cash_out int
type_transfer int
amount double
amount_nml double
nameorig string
oldbalanceorg double
oldbalanceorg_nml double
oldbalanceorigsign int
newbalanceorig double
newbalanceorig_nml double
negdeltaorigin double
negdeltaorigin_nml double
namedest string
oldbalancedest double
oldbalancedest_nml double
expectednewbaldest double
expectednewbaldest_nml double
newbalancedest double
newbalancedest_nml double
isflaggedfraud int
# Detailed Table Information
Database: mdforaugmentedparquet
Owner: owner
CreateTime: Tue Nov 17 10:55:56 UTC 2020
LastAccessTime: Tue Nov 17 10:55:55 UTC 2020
Protect Mode: None
Retention: 0
Location: s3://sagemaker-819/augparquetsnappy
Table Type: EXTERNAL_TABLE
Table Parameters:
CrawlerSchemaDeserializerVersion 1.0
CrawlerSchemaSerializerVersion 1.0
UPDATED_BY_CRAWLER augmentedparquetsnappy
averageRecordSize 125
classification parquet
compressionType none
objectCount 1
recordCount 2811841
sizeKey 260257084
typeOfData file
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
I further created a copy of augparquetsnappy table, by running the CREATE command above. Queries succeed on this copy, augparquetsnappy2. Here is the description of that table.
DESCRIBE FORMATTED `mdforaugmentedparquet.augparquetsnappy2`
# col_name data_type comment
isfraud int
step int
hourof24 double
hourof24_nml double
type string
type_cash_out int
type_transfer int
amount double
amount_nml double
nameorig string
oldbalanceorg double
oldbalanceorg_nml double
oldbalanceorigsign int
newbalanceorig double
newbalanceorig_nml double
negdeltaorigin double
negdeltaorigin_nml double
namedest string
oldbalancedest double
oldbalancedest_nml double
expectednewbaldest double
expectednewbaldest_nml double
newbalancedest double
newbalancedest_nml double
isflaggedfraud int
# Detailed Table Information
Database: mdforaugmentedparquet
Owner: hadoop
CreateTime: Thu Nov 19 18:31:47 UTC 2020
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: s3://sagemaker-819/augmented-parquet
Table Type: EXTERNAL_TABLE
Table Parameters:
CrawlerSchemaDeserializerVersion 1.0
CrawlerSchemaSerializerVersion 1.0
EXTERNAL TRUE
UPDATED_BY_CRAWLER augmentedparquetsnappy
averageRecordSize 125
classification parquet
compressionType none
objectCount 1
recordCount 2811841
sizeKey 260257084
transient_lastDdlTime 1605810707
typeOfData file
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
The problem seems to be that the user executing the statement doesn't have permission to get the table metadata from glue as the table was created by Glue Crawler.
Glue permissions can be assigned on different, levels, catalog, database and table. If the user executing the select statement doesn't have permission explicitly on all tables or implicitly on all tables by giving permission on the database, you won't be able to query certain tables, see also IAM actions available in Glue.
This might be helpful as well, Attach a Policy to IAM Users That Access AWS Glue
I have two text file with same structure that I extracted from SQL Server. One file is 1.5gb while another is 7.5gb. I created a table in hive and then copied these files to corresponding gcs buckets. Now when I am trying to load data in tables it is failing for 7.5 gb file. After running LOAD DATA INPATH command my 7.5gb file in the bucket is getting deleted. While in case of 1.5 GB file it is working perfectly fine. What alternative way should I try to fix this issue.
My Hive QL is as below.
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable
( v_nbr int,
v_nm varchar(80),
p_nbr int,
r_nbr int,
a_account varchar(80),
a_amount decimal(13,4),
c_store int,
c_account int,
c_amount decimal(13,4),
rec_date date)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable1.5/file1.5gb.txt' OVERWRITE INTO TABLE myschema.table1.5;
LOAD DATA INPATH 'gs://mybucket/myschema.db/mytable7.5/file7.5gb.txt' OVERWRITE INTO TABLE myschema.table7.5;
You can try this:
CREATE EXTERNAL TABLE IF NOT EXISTS myschema.mytable
( v_nbr int,
v_nm varchar(80),
p_nbr int,
r_nbr int,
a_account varchar(80),
a_amount decimal(13,4),
c_store int,
c_account int,
c_amount decimal(13,4),
rec_date date)
row format delimited
fields terminated by ','
stored as textfile
LOCATION 'gs://mybucket/myschema.db/mytable1.5/file1.5gb.txt';
Hi Currently I have created a table schema in AWS Athena as follow
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
`clientcomputername` string,
`intjnltblrecid` bigint,
`processingstate` string,
`sessionid` int,
`sessionlogindatetime` string,
`sessionlogindatetimetzid` bigint,
`recidoriginal` bigint,
`modifieddatetime` string,
`modifiedby` string,
`createddatetime` string,
`createdby` string,
`dataareaid` string,
`recversion` int,
`partition` bigint,
`recid` bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.
So what would be the best way to read this file. Thanks
When I removed this part
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
I was able to read quoted text with commas in it
As workaround - look at aws glue project.
Instead of creating table via CREATE EXTERNAL TABLE:
invoke get-table for your table
Then make json for create-table
Merge the following StorageDescriptor part:
{
"StorageDescriptor": {
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde"
...
}
...
}
perform create via aws cli. You will get this table in aws glue and athena be able to select correct columns.
Notes
If your table already defined OpenCSVSerde - they may be fixed this issue and you can simple recreate this table.
I do not have much knoledge about athena, but in aws glue you can delete or create table without any data loss
Before adding this table via create-table you have to check first how glue or/and athena hadles table duplicates
This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']
Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)
CREATE EXTERNAL TABLE test1 (
f1 string,
s2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\")
LOCATION 's3://user-test-region/dataset/test1/'