Need to skip CSV header when reading from s3 - amazon-athena

when i'm trying to load csv file from s3, headers are injecting into columns. i tried to skip header by
TBLPROPERTIES (
"skip.header.line.count"="1")
But still no use.
Any advice please?
CREATE EXTERNAL TABLE skipheader(
permalink string,
company string,
numemps bigint,
category string,
city string,
state string,
fundeddate string,
raisedamt bigint,
raisedcurrency string,
round string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/filename/'
TBLPROPERTIES (
"skip.header.line.count"="1")

Looking at the release notes for when the feature was released it says
Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers. This is currently supported for queries that use the OpenCSV SerDe, and not for Grok or Regex SerDes.
My interpretation of this is that it won't work with LazySimpleSerde, which is what you get when you say ROW FORMAT DELIMITED, and that you have to use the OpenCSV serde:
CREATE EXTERNAL TABLE skipheader ( … )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',')
STORED AS TEXTFILE
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/filename/'
TBLPROPERTIES ("skip.header.line.count"="1")
The OpenCSV serde works differently from LazySimpleSerde, it has much more limited data type support, but on the other hand it is more configurable.

If you can use the OpenCSV SerDe and make it work for you like described by Theo, go for it. However, if you have other tables of other formats, you can get around it in the following way even though it is a bit of a hack. You can simply add a WHERE clause that excludes the headers like
SELECT * FROM skipheader WHERE permalink != 'permalink'. Recently, Athena added the ability to make a new table as result of query (https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html) so if you could even filter out the headers and save to a new location using Athena if that was better for you.

Related

Create Athena table with Json

I have done a lab question that requires me to find a string of text from logs which I have done so by the long way (excel) instead of the intelligent way (having data populated nicely on the table in Athena). I have multiple files within an S3 folder that has json format like this...
{"Records":[{"eventVersion":"1.05","userIdentity":{"type":"AssumedRole","principalId":"ARXXXXXXXXXXXXXXXXXFVEC:AWSConfig-Describe","arn":"arn:aws:sts::2CCCCCCCCC8:assumed-role/AWSServiceRoleForConfig/AWSConfig-Describe","accountId":"2CCCCCCCCC8","accessKeyId":"ASXXXXXXXXXXXXXXXQVL","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"ARXXXXXXXXXXXXXXXXXFVEC","arn":"arn:aws:iam::2CCCCCCCCC8:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig","accountId":"2CCCCCCCCC8","userName":"AWSServiceRoleForConfig"},"attributes":{"creationDate":"2019-09-03T07:40:00Z","mfaAuthenticated":"false"}},"invokedBy":"AWS Internal"},"eventTime":"2019-09-03T07:40:00Z","eventSource":"s3.amazonaws.com","eventName":"HeadBucket","awsRegion":"us-west-2","sourceIPAddress":"172.18.87.252","userAgent":"[AWSConfig]","requestParameters":{"bucketName":"service_logs_10l51wolgib72","Host":"s3.us-west-2.amazonaws.com"},"responseElements":null,"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-SHA","bytesTransferredIn":0.0,"AuthenticationMethod":"AuthHeader","x-amz-id-2":"JYEwSk6jEv2rB/MjwluNXcnKxRSo72GCOz8WP9OYXDFI2FxS1T81K7excoDuo36rJIQz9MWYKEE=","bytesTransferredOut":0.0},"requestID":"E224F90BD7370007","eventID":"77d7ea03-b8a2-4b50-8f81-b8217eacf008","readOnly":true,"resources":[{"type":"AWS::S3::Object","ARNPrefix":"arn:aws:s3:::service_logs_10l51wolgib72/"},{"accountId":"2CCCCCCCCC8","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::service_logs_10l51wolgib72"}],"eventType":"AwsApiCall","recipientAccountId":"2CCCCCCCCC8","vpcEndpointId":"vpce-3c0ee766"},{"eventVersion":"1.05","userIdentity":{"type":"AssumedRole","principalId":"ARXXXXXXXXXXXXXXXXXFVEC:AWSConfig-Describe","arn":"arn:aws:sts::2CCCCCCCCC8:assumed-role/AWSServiceRoleForConfig/AWSConfig-Describe","accountId":"2CCCCCCCCC8","accessKeyId":"ASXXXXXXXXXXXXXXXQVL","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"ARXXXXXXXXXXXXXXXXXFVEC","arn":"arn:aws:iam::2CCCCCCCCC8:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig","accountId":"2CCCCCCCCC8","userName":"AWSServiceRoleForConfig"},"attributes":{"creationDate":"2019-09-03T07:40:00Z","mfaAuthenticated":"false"}},"invokedBy":"AWS Internal"},"eventTime":"2019-09-03T07:40:00Z","eventSource":"s3.amazonaws.com","eventName":"HeadBucket","awsRegion":"us-west-2","sourceIPAddress":"172.18.87.252","userAgent":"[AWSConfig]","requestParameters":{"bucketName":"service_logs_10l51wolgib72","Host":"s3.us-west-2.amazonaws.com"},"responseElements":null,"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-SHA","bytesTransferredIn":0.0,"AuthenticationMethod":"AuthHeader","x-amz-id-2":"JYEwSk6jEv2rB/MjwluNXcnKxRSo72GCOz8WP9OYXDFI2FxS1T81K7excoDuo36rJIQz9MWYKEE=","bytesTransferredOut":0.0},"requestID":"E224F90BD7370020","eventID":"77d7ea03-b8a2-4b50-8f81-b8217eacf021","readOnly":true,"resources":[{"type":"AWS::S3::Object","ARNPrefix":"arn:aws:s3:::service_logs_10l51wolgib72/"},{"accountId":"2CCCCCCCCC8","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::service_logs_10l51wolgib72"}],"eventType":"AwsApiCall","recipientAccountId":"2CCCCCCCCC8","vpcEndpointId":"vpce-3c0ee766"},{"eventVersion":"1.05","userIdentity":{"type":"AssumedRole","principalId":"ARXXXXXXXXXXXXXXXXXFVEC:AWSConfig-Describe","arn":"arn:aws:sts::2CCCCCCCCC8:assumed-role/AWSServiceRoleForConfig/AWSConfig-Describe","accountId":"2CCCCCCCCC8","accessKeyId":"ASXXXXXXXXXXXXXXXQVL","sessionContext":{"sessionIssuer":{"type":"Role","principalId":"ARXXXXXXXXXXXXXXXXXFVEC","arn":"arn:aws:iam::2CCCCCCCCC8:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig","accountId":"2CCCCCCCCC8","userName":"AWSServiceRoleForConfig"},"attributes":{"creationDate":"2019-09-03T07:40:00Z","mfaAuthenticated":"false"}},"invokedBy":"AWS Internal"},"eventTime":"2019-09-03T07:40:00Z","eventSource":"s3.amazonaws.com","eventName":"HeadBucket","awsRegion":"us-west-2","sourceIPAddress":"172.18.87.252","userAgent":"[AWSConfig]","requestParameters":{"bucketName":"service_logs_10l51wolgib72","Host":"s3.us-west-2.amazonaws.com"},"responseElements":null,"additionalEventData":{"SignatureVersion":"SigV4","CipherSuite":"ECDHE-RSA-AES128-SHA","bytesTransferredIn":0.0,"AuthenticationMethod":"AuthHeader","x-amz-id-2":"JYEwSk6jEv2rB/MjwluNXcnKxRSo72GCOz8WP9OYXDFI2FxS1T81K7excoDuo36rJIQz9MWYKEE=","bytesTransferredOut":0.0},"requestID":"E224F90BD7370033","eventID":"77d7ea03-b8a2-4b50-8f81-b8217eacf034","readOnly":true,"resources":[{"type":"AWS::S3::Object","ARNPrefix":"arn:aws:s3:::service_logs_10l51wolgib72/"},{"accountId":"2CCCCCCCCC8","type":"AWS::S3::Bucket","ARN":"arn:aws:s3:::service_logs_10l51wolgib72"}],"eventType":"AwsApiCall","recipientAccountId":"2CCCCCCCCC8","vpcEndpointId":"vpce-3c0ee766"}]}
I am importing them (the whole dir with the files sharing same format) into Athena using this code from online reference but no matter how I adjust, I still get the table with single column...
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.testjson4 (
Records struct<eventVersion:string,
userIdentity:string,
eventTime:string,
eventSource:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json' = 'FALSE',
'dots.in.keys' = 'FALSE',
'case.insensitive' = 'TRUE',
'mapping' = 'TRUE'
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://******************/CloudTrail/us-east-1/****/'
TBLPROPERTIES ('classification' = 'json');
Would like to get help on how to do this correctly, thank you!

Access S3 CSV file in Amazon Athena

I am trying to load a files from s3 to Athena to perform a query operation. But all the column values are getting added to the first column.
I have file in the following format:
id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE
This is the output I get:
Table creation query:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES (
'transient_lastDdlTime'='1665356861')
Please can someone tell me where am I going wrong?
You should add skip.header.line.count to your table properties to skip the first row. As you have defined all columns as string data type Athena was unable to differentiate between header and first row.
DDL with property added:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES ('skip.header.line.count'='1')
The Serde needs some parameter to recognize CSV files, such as:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena
An alternative method is to use AWS Glue to create the tables for you. In the AWS Glue console, you can create a Crawler and point it to your data. When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.

AWS Glue/Data catalog showing quotes around data

When I query my files from Data Catalog using Athena, all the data appears wrapped with quotes. Isit possible to remove those quotes?
I tried adding quoteChar option in the table settings, but it didnt help
UPDATE
As requested, the DDL:
CREATE EXTERNAL TABLE `holidays`(
`id` bigint,
`start` string,
`end` string,
`createdat` string,
`updatedat` string,
`deletedat` string,
`type` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'quoteChar'='\"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://pinfare-glue/holidays/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='pinfare-holidays',
'averageRecordSize'='84',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'recordCount'='29',
'sizeKey'='2494',
'skip.header.line.count'='1',
'typeOfData'='file')
I know its late but I think the issue is with the "Serde serialization lib"
In
AWS GLUE --> Click on the table --> Edit Table --> check "Serde serialization lib"
it's value should be "org.apache.hadoop.hive.serde2.OpenCSVSerde"
Than Click Apply
This should solve your issue. Below is a sample image for your reference.

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

Getting Null Values in Hive Create & Load Query with REGEX

I have a Log file in which i need to store data with REGEX. I tried below query but loading all NULL values. I have checked REGEX with http://www.regexr.com/, its working fine for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(IMEI\\s\\d{15} (\\b(\\d{15})([A-Z0-9]+)) )",
"output.format.string" = "%1$s %2$s"
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs:/user/user1/data' OVERWRITE INTO TABLE avl;
Please correct me here.
Sample Log:
[INFO_|01/31 07:19:29] IMEI 356307043180842
[INFO_|01/31 07:19:33] PacketLength = 372
[INFO_|01/31 07:19:33] Recv HEXString : 0000000000000168080700000143E5FC86B6002F20BC400C93C6F000FF000E0600280007020101F001040914B34238DD180028CD6B7801C7000000690000000143E5FC633E002F20B3000C93A3B00105000D06002C0007020101F001040915E64238E618002CCD6B7801C7000000640000000143E5FC43FE002F20AA800C9381700109000F06002D0007020101F001040915BF4238D318002DCD6B7801C70000006C0000000143E5FC20D6002F20A1400C935BF00111000D0600270007020101F001040916394238B6180027CD6B7801C70000006D0000000143E5FBF5DE002F2098400C9336500118000B0600260007020101F0010409174D42384D180026CD6B7801C70000006E0000000143E5FBD2B6002F208F400C931140011C000D06002B0007020101F001040915624238C018002BCD6B7801C70000006F0000000143E5FBAF8E002F2085800C92EB10011E000D06002B0007020101F0010409154C4238A318002BCD6B7801C700000067000700005873
Thanks.
With your current table definition, no regex will do what you're looking for. The reason is that your file_format is set to TEXTFILE, which splits up the input file by line (\r, \n, or \r\n), before the data ever gets to the SerDe.
Each line is then individually passed to RegexSerDe, matched against your regex, and any non-matches return NULL. For this reason, multiline regexes will not work using STORED AS TEXTFILE. This is also why you received all NULL rows: Because no single line of the input matched your entire regex.
One solution here might be pre-processing the data such that each record is only on one line in the input file, but that's not what you're asking for.
The way to do this in Hive is to use a different file_format:
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
TextInputFormat reads from the current configuration a configuration variable named textinputformat.record.delimiter. If you're using TextInputFormat, this variable tells Hadoop and Hive where one record ends and the next one begins.
Consequently, setting this value to something like EOR would mean that the input files are split on EOR, rather than by line. Each chunk generated by the split would then get passed to RegexSerDe as a whole chunk, newlines & all.
You can set this variable in a number of places, but if this is the delimiter for only this (and subsequent within the session) queries, then you can do:
SET textinputformat.record.delimiter=EOR;
CREATE EXTERNAL TABLE ...
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = ...
"output.regex" = ...
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION ...;
In your specific scenario, I can't tell what you might use for textinputformat.record.delimiter instead of EOF, since we were only given one example record, and I can't tell which field you're trying to capture second based on your regex.
If you can provide these two items (sample data with >1 records, and what you're trying to capture for packet), I might be able to help out more. As it stands now, your regex does not match the sample data you provided -- not even on the site you linked.