How to handle embed line breaks in AWS Athena

How to handle embed line breaks in AWS Athena - amazon-web-services

I have created a table in AWS Athena like this:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'
In the bucket I put a simple CSV file with the following context:
rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"
When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10; then Athena returns the following response:
How should I set ROW FORMAT to properly handle line breaks within the field values? So that rec3\ncol2 appears in col2.

The problem here is that the OpenCSV Serializer-Deserializer
Does not support embedded line breaks in CSV files.
See this documentation from AWS.
However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.
Edit: Still working on the syntax for dealing with the embedded line feed \n. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"* worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3. I continue working on it.
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';

Related

AWS Athena query returning empty string

I've seen other questions saying their query returns no results. This is not what is happening with my query. The query itself is returning empty strings/results.
I have an 81.7MB JSON file in my input bucket (input-data/test_data). I've setup the datasource as JSON.
However, when I execute SELECT * FROM test_table; it shows (in green) that the data has been scanned, the query was successful and there are results, but not saved to the output bucket or displayed in the GUI.
I'm not sure what I've done wrong in the setup?
This is my table creation:
CREATE EXTERNAL TABLE IF NOT EXISTS `test_db`.`test_data` (
`tbl_timestamp` timestamp,
`colmn1` string,
`colmn2` string,
`colmn3` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://input-data/test_data/'
TBLPROPERTIES ('has_encrypted_data'='false',
'skip.header.line.count'='1');

Resolved this issue. The labels of the table (e.g. the keys) need to be the same labels in the file itself. Simple really!

Full text query in Amazon Athena is timing-out when using `LIKE`

Getting timeout error for a full text query in Athena like this...
SELECT count(textbody) FROM "email"."some_table" where textbody like '% some text to seach%'
Is there any way to optimize it?
Update:
The create table statement:
CREATE EXTERNAL TABLE `email`.`email5_newsletters_04032019`(
`nesletterid` string,
`name` string,
`format` string,
`subject` string,
`textbody` string,
`htmlbody` string,
`createdate` string,
`active` string,
`archive` string,
`ownerid` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'ESCAPED BY' = '\\'
) LOCATION 's3://some_bucket/email_backup_updated/email5/'
TBLPROPERTIES ('has_encrypted_data'='false');
And S3 bucket contents:
# aws s3 ls s3://xxx/email_backup_updated/email5/ --human
2020-08-22 15:34:44 2.2 GiB email_newsletters_04032019_updated.csv.gz
There are 11 million records in this file. The file can be imported within 30 minutes in Redshift and everything works OK in redshift. I will prefer to use Athena!

CSV is not a format that integrates very well with the presto engine, as queries need to read the full row to reach a single column. A way to optimize usage of athena, which will also save you plenty of storage costs, is to switch to a columnar storage format, like parquet or orc, and you can actually do it with a query:
CREATE TABLE `email`.`email5_newsletters_04032019_orc`
WITH (
external_location = 's3://my_orc_table/',
format = 'ORC')
AS SELECT *
FROM `email`.`email5_newsletters_04032019`;
Then rerun your query above on the new table:
SELECT count(textbody) FROM "email"."email5_newsletters_04032019_orc" where textbody like '% some text to seach%'

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
CREATE TABLE DEFINITION:
CREATE EXTERNAL TABLE schema.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'
CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.
I tried making the CSV look like this (quote empty string):
"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"
But it's not working.
Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.
Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.
It works, when you specify all columns as STRING but that's not what I need.
Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?

Quick and dirty way to handle these data:
CSV:
id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"
DDL:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;
The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.
I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.
I will work on the regex later.
Edit:
Regex as promised:
CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
id INT,
height INT,
age INT,
name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;
Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.
Further Reading:
Stackoverflow - remove surrounding quotes from fields while loading data into hive
Athena - OpenCSVSerDe for Processing CSV

Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.
You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.
Or you can use LazySimpleSerDe and strip quotes at query time.
I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.
You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV

This worked for me. Use OpenCSVSerDe and convert all columns into string. Read more: https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/

read csv file where value contain comma in AWS athena

Hi Currently I have created a table schema in AWS Athena as follow
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlActivityLogStaging (
`clientcomputername` string,
`intjnltblrecid` bigint,
`processingstate` string,
`sessionid` int,
`sessionlogindatetime` string,
`sessionlogindatetimetzid` bigint,
`recidoriginal` bigint,
`modifieddatetime` string,
`modifiedby` string,
`createddatetime` string,
`createdby` string,
`dataareaid` string,
`recversion` int,
`partition` bigint,
`recid` bigint
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://ax-large-table/AEGIntJnlActivityLogStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But one of the filed (processingstate) value contain comma as "Europe, Middle East, & Africa" which displace columns order.
So what would be the best way to read this file. Thanks

When I removed this part
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
I was able to read quoted text with commas in it

As workaround - look at aws glue project.
Instead of creating table via CREATE EXTERNAL TABLE:
invoke get-table for your table
Then make json for create-table
Merge the following StorageDescriptor part:
{
"StorageDescriptor": {
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.serde2.OpenCSVSerde"
...
}
...
}
perform create via aws cli. You will get this table in aws glue and athena be able to select correct columns.
Notes
If your table already defined OpenCSVSerde - they may be fixed this issue and you can simple recreate this table.
I do not have much knoledge about athena, but in aws glue you can delete or create table without any data loss
Before adding this table via create-table you have to check first how glue or/and athena hadles table duplicates

This is a common messy CSV file situation where certain values contain commas. The solution in Athena for this is to use SERDEPROPERTIES as described in the AWS doc https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html [the url may change so just search for 'OpenCSVSerDe for Processing']
Following is a basic create table example provided. Based on your data you would have to ensure that the data type is specified correctly (eg string)
CREATE EXTERNAL TABLE test1 (
f1 string,
s2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\")
LOCATION 's3://user-test-region/dataset/test1/'

Getting Null Values in Hive Create & Load Query with REGEX

I have a Log file in which i need to store data with REGEX. I tried below query but loading all NULL values. I have checked REGEX with http://www.regexr.com/, its working fine for my data.
CREATE EXTERNAL TABLE IF NOT EXISTS avl(imei STRING,packet STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(IMEI\\s\\d{15} (\\b(\\d{15})([A-Z0-9]+)) )",
"output.format.string" = "%1$s %2$s"
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs:/user/user1/data' OVERWRITE INTO TABLE avl;
Please correct me here.
Sample Log:
[INFO_|01/31 07:19:29] IMEI 356307043180842
[INFO_|01/31 07:19:33] PacketLength = 372
[INFO_|01/31 07:19:33] Recv HEXString : 0000000000000168080700000143E5FC86B6002F20BC400C93C6F000FF000E0600280007020101F001040914B34238DD180028CD6B7801C7000000690000000143E5FC633E002F20B3000C93A3B00105000D06002C0007020101F001040915E64238E618002CCD6B7801C7000000640000000143E5FC43FE002F20AA800C9381700109000F06002D0007020101F001040915BF4238D318002DCD6B7801C70000006C0000000143E5FC20D6002F20A1400C935BF00111000D0600270007020101F001040916394238B6180027CD6B7801C70000006D0000000143E5FBF5DE002F2098400C9336500118000B0600260007020101F0010409174D42384D180026CD6B7801C70000006E0000000143E5FBD2B6002F208F400C931140011C000D06002B0007020101F001040915624238C018002BCD6B7801C70000006F0000000143E5FBAF8E002F2085800C92EB10011E000D06002B0007020101F0010409154C4238A318002BCD6B7801C700000067000700005873
Thanks.

With your current table definition, no regex will do what you're looking for. The reason is that your file_format is set to TEXTFILE, which splits up the input file by line (\r, \n, or \r\n), before the data ever gets to the SerDe.
Each line is then individually passed to RegexSerDe, matched against your regex, and any non-matches return NULL. For this reason, multiline regexes will not work using STORED AS TEXTFILE. This is also why you received all NULL rows: Because no single line of the input matched your entire regex.
One solution here might be pre-processing the data such that each record is only on one line in the input file, but that's not what you're asking for.
The way to do this in Hive is to use a different file_format:
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
TextInputFormat reads from the current configuration a configuration variable named textinputformat.record.delimiter. If you're using TextInputFormat, this variable tells Hadoop and Hive where one record ends and the next one begins.
Consequently, setting this value to something like EOR would mean that the input files are split on EOR, rather than by line. Each chunk generated by the split would then get passed to RegexSerDe as a whole chunk, newlines & all.
You can set this variable in a number of places, but if this is the delimiter for only this (and subsequent within the session) queries, then you can do:
SET textinputformat.record.delimiter=EOR;
CREATE EXTERNAL TABLE ...
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = ...
"output.regex" = ...
)
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION ...;
In your specific scenario, I can't tell what you might use for textinputformat.record.delimiter instead of EOF, since we were only given one example record, and I can't tell which field you're trying to capture second based on your regex.
If you can provide these two items (sample data with >1 records, and what you're trying to capture for packet), I might be able to help out more. As it stands now, your regex does not match the sample data you provided -- not even on the site you linked.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js