AWS Athena case sensitive orc.column.index.access=false? - amazon-web-services

I've created a table in Athena with below config
CREATE EXTERNAL TABLE `extern` (`FirstName` STRING, `LastName` STRING, `Email` STRING, `Phone` STRING, `AddressLine1` STRING, `City` STRING, `State` STRING, `PostalCode` STRING,
`time_on_page` DECIMAL(10,3), `page` STRING, `login_time` TIMESTAMP) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
WITH SERDEPROPERTIES ('orc.column.index.access'='false')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION 's3://location/' tblproperties ("orc.compress"="ZLIB")
and when I query it with select * from extern I get empty values for all columns except for time_on_page, page and login_time. However, this returns data in all columns if I use ('orc.column.index.access'='true').
The schema of the column names in the ORC file are exact with what's defined in the create statement.
File Version: 0.12 with ORC_135
Rows: 13559
Compression: ZLIB
Compression size: 131072
Type: struct<FirstName:string,LastName:string,Email:string,Phone:string,AddressLine1:string,City:string,State:string,PostalCode:string,time_on_page:decimal(10,3),page:string,login_time:timestamp>
The question is, could it be possible that when orc.column.index.access=false the engine tries to read column names with case sensitive constraints?

Apparently Presto version of Amazon is based on PrestoDB and this feature doesn't seem to be available on Prestodb as of this writing; Prestosql, however, seem to have had this support from Release 312 and onwards (ref: https://github.com/prestosql/presto/issues/802).
The solution is to make sure the schema in the ORC file uses lowercase column name for it to be matched with names from hive metastore (which are always lowercased).
Debug code for presto db: https://github.com/prestodb/presto/blob/6647e13f64883f7cfa89221d91b981bcc3a57618/presto-hive/src/main/java/com/facebook/presto/hive/HiveUtil.java#L984
https://github.com/prestodb/presto/blob/95bcb3947cad1570e19a0adaebc58009aa362ada/presto-orc/src/main/java/com/facebook/presto/orc/OrcReader.java#L130

Related

Access S3 CSV file in Amazon Athena

I am trying to load a files from s3 to Athena to perform a query operation. But all the column values are getting added to the first column.
I have file in the following format:
id,user_id,personal_id,created_at,updated_at,active
34,34,43,31:28.4,27:07.9,TRUE
This is the output I get:
Table creation query:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES (
'transient_lastDdlTime'='1665356861')
Please can someone tell me where am I going wrong?
You should add skip.header.line.count to your table properties to skip the first row. As you have defined all columns as string data type Athena was unable to differentiate between header and first row.
DDL with property added:
CREATE EXTERNAL TABLE `testing`(
`id` string,
`user_id` string,
`personal_id` string,
`created_at` string,
`updated_at` string,
`active` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://testing2fa/'
TBLPROPERTIES ('skip.header.line.count'='1')
The Serde needs some parameter to recognize CSV files, such as:
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
See: LazySimpleSerDe for CSV, TSV, and custom-delimited files - Amazon Athena
An alternative method is to use AWS Glue to create the tables for you. In the AWS Glue console, you can create a Crawler and point it to your data. When you run the crawler, it will automatically create a table definition in Amazon Athena that matches the supplied data files.

read partitioned data of vpc flow log

I used this article to read my vpc flow logs and everything worked correctly.
https://aws.amazon.com/blogs/big-data/optimize-performance-and-reduce-costs-for-network-analytics-with-vpc-flow-logs-in-apache-parquet-format/
But my question is that when I refer to documentation and run the create table statement, it does not return any record.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`region` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` string,
`month` string,
`day` string,
`hour` string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://DOC-EXAMPLE-BUCKET/prefix/AWSLogs/aws-account-id={account_id}/aws-service=vpcflowlogs/aws-region={region_code}/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)
official doc:
https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html
This create table statement should work after changing the variables like DOC-EXAMPLE-BUCKET/prefix, account_id and region_code. Why am I getting 0 rows returned for select * query?
You need to manually load the partitions first before you could use them.
From the docs:
After you create the table, you load the data in the partitions for querying. For Hive-compatible data, you run MSCK REPAIR TABLE. For non-Hive compatible data, you use ALTER TABLE ADD PARTITION to add the partitions manually.
So if your structure if hive compatible you can just run:
MSCK REPAIR TABLE `table name`;
And this will load all your new partitions.
Otherwise you'll have to manually load them using ADD PARTITION
ALTER TABLE test ADD PARTITION (aws-account-id='1', aws-acount-service='2' ...) location 's3://bucket/subfolder/data/accountid1/service2/'
Because manually adding partitions is so tedious if your data structure is not hive compatible I recommend you use partition projection for your table.
To avoid having to manage partitions, you can use partition projection. Partition projection is an option for highly partitioned tables whose structure is known in advance. In partition projection, partition values and locations are calculated from table properties that you configure rather than read from a metadata repository. Because the in-memory calculations are faster than remote look-up, the use of partition projection can significantly reduce query runtimes.

Full text query in Amazon Athena is timing-out when using `LIKE`

Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!
CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'

Need to skip CSV header when reading from s3

when i'm trying to load csv file from s3, headers are injecting into columns. i tried to skip header by
TBLPROPERTIES (
"skip.header.line.count"="1")
But still no use.
Any advice please?
CREATE EXTERNAL TABLE skipheader(
permalink string,
company string,
numemps bigint,
category string,
city string,
state string,
fundeddate string,
raisedamt bigint,
raisedcurrency string,
round string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucketname/filename/'
TBLPROPERTIES (
"skip.header.line.count"="1")
Looking at the release notes for when the feature was released it says
Support for ignoring headers. You can use the skip.header.line.count property when defining tables, to allow Athena to ignore headers. This is currently supported for queries that use the OpenCSV SerDe, and not for Grok or Regex SerDes.
My interpretation of this is that it won't work with LazySimpleSerde, which is what you get when you say ROW FORMAT DELIMITED, and that you have to use the OpenCSV serde:
CREATE EXTERNAL TABLE skipheader ( … )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',')
STORED AS TEXTFILE
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://bucketname/filename/'
TBLPROPERTIES ("skip.header.line.count"="1")
The OpenCSV serde works differently from LazySimpleSerde, it has much more limited data type support, but on the other hand it is more configurable.
If you can use the OpenCSV SerDe and make it work for you like described by Theo, go for it. However, if you have other tables of other formats, you can get around it in the following way even though it is a bit of a hack. You can simply add a WHERE clause that excludes the headers like
SELECT * FROM skipheader WHERE permalink != 'permalink'. Recently, Athena added the ability to make a new table as result of query (https://docs.aws.amazon.com/athena/latest/ug/create-table-as.html) so if you could even filter out the headers and save to a new location using Athena if that was better for you.

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?
Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV
Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV
This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/