I'm creating a table in Athena based on a list of CSV files in an S3 bucket. The files in the buckets are placed in folders like this:
$ aws s3 ls s3://bucket-name/ --recursive
2023-01-23 16:05:01 25601 logs2023/01/23/23/analytics_Log-1-2023-01-23-23-59-59-6dc5bd4c-f00f-4f34-9292-7bfa9ec33c55
2023-01-23 16:10:03 18182 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-05-01-aa2cb565-05c8-43e2-a203-96324f66a5a7
2023-01-23 16:15:05 20350 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-10-03-87b03989-c059-4fca-8e8b-909e787db889
2023-01-23 16:20:09 25187 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-15-06-6d9b39fb-c05f-4416-9b17-415f48e63591
2023-01-23 16:25:18 20590 logs2023/01/24/00/analytics_Log-1-2023-01-24-00-20-16-3939a0fe-8cfb-4168-bc8e-e71d2122add5
This is the format for the folder structure:
logs{year}/{month}/{day}/{hour}/<filename>
I would like to use Athena's partition projection and this is how I'm creating my table:
CREATE EXTERNAL TABLE analytics.logs (
id string,
...
type tinyint)
PARTITIONED BY (
year bigint COMMENT '',
month string COMMENT '',
day string COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/'
TBLPROPERTIES (
'classification'='csv',
'partition.day.values'='01,02,03,04,05,06,07,08,09,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31',
'partition.day.type'='enum',
'partition.enable'='true',
'partition.month.values'='01,02,03,04,05,06,07,08,09,10,11,12',
'partition.month.type'='enum',
'partition.year.range'='2022,2100',
'partition.year.type'='integer',
'storage.location.template'='s3://bucket-name/logs${year}/${month}/${day}/')
As you can see, I'm trying to partition the data using year, month, and day. Even though there's also an hour folder, I'm not interested in that. This command executes just fine and it creates the table too. But when I query the table:
SELECT * FROM analytics.logs LIMIT 10;
It returns empty. But if I create the same table without the PARTITIONED part, I can see the records. Can someone please help me understand what I'm doing wrong?
[UPDATE]
I simplified the folder structure to see if it works. It does not.
$ aws s3 ls s3://bucket-name/test --recursive
2023-01-24 07:03:30 0 test/
2023-01-24 07:03:59 0 test/2022/
2023-01-24 07:11:06 13889 test/2022/Log-1-2022-12-01-00-00-11-255f8d74-5417-42a0-8c09-97282a626903
2023-01-24 07:11:05 8208 test/2022/Log-1-2022-12-01-00-05-15-c34eda24-36d8-484c-b7b6-4861c297d857
CREATE EXTERNAL TABLE `log_2`(
`id` string,
...
`type` tinyint)
PARTITIONED BY (
`year` bigint COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/test'
TBLPROPERTIES (
'classification'='csv',
'partition.enable'='true',
'partition.year.range'='2021,2023',
'partition.year.type'='integer',
'storage.location.template'='s3://bucket-name/test/${year}/')
And still the following query returns nothing:
SELECT * FROM "analytics"."log_2" where year = 2022 limit 10;
You have a mismatch in data types. Partition by year is bigint and partition projection is integer. Make both integers.
"projection.enabled" = "true",
"projection.datehour.type" = "date",
"projection.datehour.format" = "yyyy/MM/dd/HH",
"projection.datehour.range" = "2021/01/01/00,NOW",
"projection.datehour.interval" = "1",
"projection.datehour.interval.unit" = "HOURS",
Change the partition word to projection.
For anyone else who might make my mistake, the problem is that I was (incorrectly) using partition in the TBLPROPERTIES section. While it should have been projection.
To provide you with a working example:
CREATE EXTERNAL TABLE `log_2`(
id string,
...
type tinyint)
PARTITIONED BY (
`year` bigint COMMENT '')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket-name/test'
TBLPROPERTIES (
'classification'='csv',
'projection.enable'='true',
'projection.year.range'='2021,2023',
'projection.year.type'='integer',
'storage.location.template'='s3://bucket-name/test/${year}/')
Related
I have done a lab question that requires me to find a string of text from logs which I have done so by the long way (excel) instead of the intelligent way (having data populated nicely on the table in Athena). I have multiple files within an S3 folder that has json format like this...
{"Records":[{"eventVersion":"1.05","userIdentity":{"type":"AssumedRole","principalId":"ARXXXXXXXXXXXXXXXXXFVEC:AWSConfig-Describe","arn":"arn:aws:sts::2CCCCCCCCC8:assumed-role/AWSServiceRoleForConfig/AWSConfig-Describe","accountId":"2CCCCCCCCC8","accessKeyId":"ASXXXXXXXXXXXXXXXQVL","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"ARXXXXXXXXXXXXXXXXXFVEC","arn":"arn:aws:iam::2CCCCCCCCC8:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig","accountId":"2CCCCCCCCC8","userName":"AWSServiceRoleForConfig"},"attributes":{"creationDate":"2019-09-03T07:40:00Z","mfaAuthenticated":"false"}},"invokedBy":"AWS Internal"},"eventTime":"2019-09-03T07:40:00Z","eventSource":"s3.amazonaws.com","eventName":"HeadBucket","awsRegion":"us-west-2","sourceIPAddress":"172.18.87.252","userAgent":"[AWSConfig]","requestParameters":{"bucketName":"service_logs_10l51wolgib72","Host":"s3.us-west-2.amazonaws.com"},"responseElements":null,"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-SHA","bytesTransferredIn":0.0,"AuthenticationMethod":"AuthHeader","x-amz-id-2":"JYEwSk6jEv2rB/MjwluNXcnKxRSo72GCOz8WP9OYXDFI2FxS1T81K7excoDuo36rJIQz9MWYKEE=","bytesTransferredOut":0.0},"requestID":"E224F90BD7370007","eventID":"77d7ea03-b8a2-4b50-8f81-b8217eacf008","readOnly":true,"resources":[{"type":"AWS::S3::Object","ARNPrefix":"arn:aws:s3:::service_logs_10l51wolgib72/"},{"accountId":"2CCCCCCCCC8","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::service_logs_10l51wolgib72"}],"eventType":"AwsApiCall","recipientAccountId":"2CCCCCCCCC8","vpcEndpointId":"vpce-3c0ee766"},{"eventVersion":"1.05","userIdentity":{"type":"AssumedRole","principalId":"ARXXXXXXXXXXXXXXXXXFVEC:AWSConfig-Describe","arn":"arn:aws:sts::2CCCCCCCCC8:assumed-role/AWSServiceRoleForConfig/AWSConfig-Describe","accountId":"2CCCCCCCCC8","accessKeyId":"ASXXXXXXXXXXXXXXXQVL","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"ARXXXXXXXXXXXXXXXXXFVEC","arn":"arn:aws:iam::2CCCCCCCCC8:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig","accountId":"2CCCCCCCCC8","userName":"AWSServiceRoleForConfig"},"attributes":{"creationDate":"2019-09-03T07:40:00Z","mfaAuthenticated":"false"}},"invokedBy":"AWS Internal"},"eventTime":"2019-09-03T07:40:00Z","eventSource":"s3.amazonaws.com","eventName":"HeadBucket","awsRegion":"us-west-2","sourceIPAddress":"172.18.87.252","userAgent":"[AWSConfig]","requestParameters":{"bucketName":"service_logs_10l51wolgib72","Host":"s3.us-west-2.amazonaws.com"},"responseElements":null,"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-SHA","bytesTransferredIn":0.0,"AuthenticationMethod":"AuthHeader","x-amz-id-2":"JYEwSk6jEv2rB/MjwluNXcnKxRSo72GCOz8WP9OYXDFI2FxS1T81K7excoDuo36rJIQz9MWYKEE=","bytesTransferredOut":0.0},"requestID":"E224F90BD7370020","eventID":"77d7ea03-b8a2-4b50-8f81-b8217eacf021","readOnly":true,"resources":[{"type":"AWS::S3::Object","ARNPrefix":"arn:aws:s3:::service_logs_10l51wolgib72/"},{"accountId":"2CCCCCCCCC8","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::service_logs_10l51wolgib72"}],"eventType":"AwsApiCall","recipientAccountId":"2CCCCCCCCC8","vpcEndpointId":"vpce-3c0ee766"},{"eventVersion":"1.05","userIdentity":{"type":"AssumedRole","principalId":"ARXXXXXXXXXXXXXXXXXFVEC:AWSConfig-Describe","arn":"arn:aws:sts::2CCCCCCCCC8:assumed-role/AWSServiceRoleForConfig/AWSConfig-Describe","accountId":"2CCCCCCCCC8","accessKeyId":"ASXXXXXXXXXXXXXXXQVL","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"ARXXXXXXXXXXXXXXXXXFVEC","arn":"arn:aws:iam::2CCCCCCCCC8:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig","accountId":"2CCCCCCCCC8","userName":"AWSServiceRoleForConfig"},"attributes":{"creationDate":"2019-09-03T07:40:00Z","mfaAuthenticated":"false"}},"invokedBy":"AWS Internal"},"eventTime":"2019-09-03T07:40:00Z","eventSource":"s3.amazonaws.com","eventName":"HeadBucket","awsRegion":"us-west-2","sourceIPAddress":"172.18.87.252","userAgent":"[AWSConfig]","requestParameters":{"bucketName":"service_logs_10l51wolgib72","Host":"s3.us-west-2.amazonaws.com"},"responseElements":null,"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-SHA","bytesTransferredIn":0.0,"AuthenticationMethod":"AuthHeader","x-amz-id-2":"JYEwSk6jEv2rB/MjwluNXcnKxRSo72GCOz8WP9OYXDFI2FxS1T81K7excoDuo36rJIQz9MWYKEE=","bytesTransferredOut":0.0},"requestID":"E224F90BD7370033","eventID":"77d7ea03-b8a2-4b50-8f81-b8217eacf034","readOnly":true,"resources":[{"type":"AWS::S3::Object","ARNPrefix":"arn:aws:s3:::service_logs_10l51wolgib72/"},{"accountId":"2CCCCCCCCC8","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::service_logs_10l51wolgib72"}],"eventType":"AwsApiCall","recipientAccountId":"2CCCCCCCCC8","vpcEndpointId":"vpce-3c0ee766"}]}
I am importing them (the whole dir with the files sharing same format) into Athena using this code from online reference but no matter how I adjust, I still get the table with single column...
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.testjson4 (
Records struct<eventVersion:string,
userIdentity:string,
eventTime:string,
eventSource:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'FALSE',
'dots.in.keys' = 'FALSE',
'case.insensitive' = 'TRUE',
'mapping' = 'TRUE'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://******************/CloudTrail/us-east-1/****/'
TBLPROPERTIES ('classification' = 'json');
Would like to get help on how to do this correctly, thank you!
I have setup Firehose to deliver GZIP JSON files to S3 with a yyyy/MM/dd/HH/ path template:
s3://bucket-name/events/2022/11/08/16/file-2022-11-08-16-47-41-xxx.gz
If I create an external table in Athena with NO projection:
CREATE EXTERNAL TABLE `noproj` (
`ts` bigint,
`url` string,
`useragent` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='gzip',
'typeOfData'='file'
)
Everything works fine and I get results with SELECT * FROM noproj 👍
But if I create a partition projection:
CREATE EXTERNAL TABLE `proj` (
`ts` bigint,
`url` string,
`useragent` string
)
PARTITIONED BY (
`date_created` string,
`hour` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='gzip',
'typeOfData'='file',
'projection.enabled' = 'true',
'projection.date_created.type' = 'date',
'projection.date_created.format' = 'yyyy/MM/dd',
'projection.date_created.interval' = '1',
'projection.date_created.interval.unit' = 'DAYS',
'projection.date_created.range' = '2022/01/01, NOW',
'projection.hour.type' = 'integer',
'projection.hour.range' = '0,23',
'projection.hour.digits' = '2',
'storage.location.template'='s3://bucket-name/events/${date_created}/${hour}/'
)
I get NO results at all 😭 I tried the following queries:
SELECT * FROM proj
SELECT * FROM proj WHERE date_created = '2022/11/08'
SELECT * FROM proj WHERE date_created = '2022/11/08' AND hour = '16'
SELECT * FROM proj WHERE date_created >= date('2022-01-01')
Note that the last query throws an error:
SYNTAX_ERROR: line 1:41: '>=' cannot be applied to varchar, date
Probably related to Athena Partition Projection for Date column vs. String
I tried the following but with no success:
https://stackoverflow.com/a/64220659/1480391
Athena with partition projection returns no results
Edit: I tried with a simpler projection and it also doesn't work:
CREATE EXTERNAL TABLE `year` (
`ts` bigint,
`url` string,
`useragent` string
)
PARTITIONED BY (
`year` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket-name/events/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='gzip',
'projection.year.type' = 'integer',
'projection.year.range' = '2022,2023',
'projection.enabled' = 'true',
'storage.location.template'='s3://bucket-name/events/${year}/'
)
Edit2: I also tried with no template location and
s3://bucket-name/events/year=2022/month=11/day=08/hour=16/file-2022-11-08-16-47-41-xxx.gz
Edit3: I also tried with parquet format and I have the exact same behaviour. Results with NO partition. No results with a projected partition
Ok I find out the problem 🙂🔫
I 'm using CloudFormation with YAML to setup AWS::KinesisFirehose::DeliveryStream
I was using >- multiline string notation:
Prefix: >-
events/
!{partitionKeyFromQuery:year}/
!{partitionKeyFromQuery:month}/
!{partitionKeyFromQuery:day}/
!{partitionKeyFromQuery:hour}/
But I forget that the string gets joined with space characters!! And I didn't see the spaces in S3 console.. because a space is hard to spot.
This is the fix:
Prefix: events/!{partitionKeyFromQuery:year}/!{partitionKeyFromQuery:month}/!{partitionKeyFromQuery:day}/!{partitionKeyFromQuery:hour}/
If this help someone in the future please let me know that I have not suffered for nothing 🙏
Thank you fellow developers for your time.
I am trying to load a files from s3 to Athena to perform a query operation. But all the column values are getting added to the first column.
I have file in the following format:
id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE
This is the output I get:
Table creation query:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES (
'transient_lastDdlTime'='1665356861')
Please can someone tell me where am I going wrong?
You should add skip.header.line.count to your table properties to skip the first row. As you have defined all columns as string data type Athena was unable to differentiate between header and first row.
DDL with property added:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES ('skip.header.line.count'='1')
The Serde needs some parameter to recognize CSV files, such as:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena
An alternative method is to use AWS Glue to create the tables for you. In the AWS Glue console, you can create a Crawler and point it to your data. When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.
I'm trying to create an external table in Athena, the problem is that the s3 bucket has different files in the same folder so I can't use the folder as a location.
I can't modify the path of the s3 files but I have a CSV manifest, I was trying to use it as a location but Athena didn't allow me to do that.
CREATE EXTERNAL TABLE `my_DB`.`my_external_table`(
column1 string,
column2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mys3bucket/tables/my_table.csvmanifest'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
Any ideas to use my manifest? or another way to solve this without Athena? The goal of using Athena was to avoid getting all the data from the CSVs since I only need few records
You'll need to make a couple changes to your CREATE TABLE statement:
use 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' as your INPUTFORMAT
Ensure you're pointing to a folder with your LOCATION statement
So your statement would look like:
CREATE EXTERNAL TABLE `my_DB`.`my_external_table`(
column1 string,
column2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mys3bucket/tables/my_table/'
And s3://mys3bucket/tables/my_table/ would have a single file in it with the S3 paths of the CSV files you want query - one path per line. I'm unsure if the skip.header.line.count setting will operate on the manifest file itself or the CSV files so you will have to test.
Alternatively, if you have a limited number of files, you could use S3 Select to query for specific columns in those files, one at a time. Using the AWS CLI, the command to extract the 2nd column would look something like:
aws s3api select-object-content \
--bucket mys3bucket \
--key path/to/your.csv.gz \
--expression "select _2 from s3object limit 100" \
--expression-type SQL \
--input-serialization '{"CSV": {}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV":{}}' \
sample.csv
(Disclaimer: AWS employee)
When I query my files from Data Catalog using Athena, all the data appears wrapped with quotes. Isit possible to remove those quotes?
I tried adding quoteChar option in the table settings, but it didnt help
UPDATE
As requested, the DDL:
CREATE EXTERNAL TABLE `holidays`(
`id` bigint,
`start` string,
`end` string,
`createdat` string,
`updatedat` string,
`deletedat` string,
`type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'quoteChar'='\"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://pinfare-glue/holidays/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-holidays',
'averageRecordSize'='84',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'recordCount'='29',
'sizeKey'='2494',
'skip.header.line.count'='1',
'typeOfData'='file')
I know its late but I think the issue is with the "Serde serialization lib"
In
AWS GLUE --> Click on the table --> Edit Table --> check "Serde serialization lib"
it's value should be "org.apache.hadoop.hive.serde2.OpenCSVSerde"
Than Click Apply
This should solve your issue. Below is a sample image for your reference.