Athena not able to read multi-line text in CSV fields

Athena not able to read multi-line text in CSV fields - amazon-web-services

This athena table correctly reads the first line of the file.
CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string,
`col6` string,
`col7` string,
`col8` string,
`col9` string,
`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\\',
'quoteChar' = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')
This table is not imported correctly due to html code found in 5th column. Is there any other way?

It appears that your file contains a lot of multi-line text in the textbody field. This does not the CSV standard (or at least, it cannot be understood by the OpenCSVSerde).
As a test, I made a simple file:
"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid"
"one","two","three","four","five","six","seven","eight","nine","ten"
"one","two","three","four","five \" quote \" five2","six","seven","eight","nine","ten"
"one","two","three","four","five \
five2","six","seven","eight","nine","ten"
Row 1 is the header
Row 2 is normal
Row 3 has a field with \" escaped quotes
Row 4 has escaped newlines
I then ran the command from your question and pointed it to this data file.
Result:
Rows 1-3 (including the header row) were returned
Row 4 only worked until the \ -- data after that was lost
Bottom line: Your file format is not compatible with CSV format.
You might be able to find some Serde that can handle it, but OpenCSVSerde doesn't seem to understand it because rows are normally split by newlines.

Related

AWS Athena query specifying linebreak and quotation character

I have the following query to create a table in Athena out of existing files located in S3. As we can see, I am defining the linebreak character and how to manage null values:
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
`field1` STRING,
`field2` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
NULL DEFINED AS ' '
LOCATION 's3://bucket/prefix/'
TBLPROPERTIES ('skip.header.line.count'='1')
Now I also want to include the quotation character, but I don't see any property for that.
I tried using WITH SERDEPROPERTIES properties as shown below (where I can use quoteChar), but then I cannot find any SERDE property to define the "linebreak" and the "NULL management".
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
`field1` STRING,
`field2` STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',', 'quoteChar' = '"')
LOCATION 's3://bucket/prefix/'
TBLPROPERTIES ('skip.header.line.count'='1')
Is there any way of using quotation character, field delimiter, linebreak, and null management together?

Create table Athena ignore comma in the row values

I am creating a table in Athena using below scripts
CREATE EXTERNAL TABLE `itcfmetadata`(
`itcf id` string,
`itcf control name` string,
`itcf control description` string,
`itcf process` string,
`standard` string,
`controlid` string,
`threshold` string,
`status` string,
`date reported` string,
`remediation (accs specific)` string,
`aws account id` string,
`aws resource id` string,
`aws account owner` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://itcfmetadata/'
TBLPROPERTIES (
'skip.header.line.count'='1');
The S3 source file is csv file. This file is converted from a excel file and this csv file doe snot have comma seperated values, it is more like a excel file. Problem is when any column contains text like "Hi, How are you". It get split into two as there is a comma and "Hi" and "How are you" becomes two value and get split into two rows. How to avoid this using above create scripts ?
CSV File :

Try using
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
instead of DELIMITED
The DELIMITED deserializer just looks at the delimiters you provide. The csv deserializet will only use those outside a pair of double quotes ".
See the docs: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html

aws athena-read csv file where values contain commas

I have a table like so
CREATE EXTERNAL TABLE IF NOT EXISTS something (
...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://...'
TBLPROPERTIES ('has_encrypted_data'='false');
but some fields contain a comma like (8-10,99) without quotes. the csv is too large to be opened on excel. is there any way to change the delimiter or make athena read this file?

If the fields are comma-separated, but contain commas without escaping there is no way for any automated tool to distinguish between a comma that represents a separator between fields and one that is meant to be content. In other words, the files are malformed and have to be fixed. If you have the option of generating the files again either make sure that fields are quoted, or use a separator that will not appear in the fields, such as a tab character.

Hive Serde Regex not recognizing string pattern

Here are two lines from my log files that I'm trying to match. I'm trying to separate each line into four columns (date, hostname, command, status).
The line is tab deliminated between date, hostname, command, and status in the line. The status column may contain spaces.
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
03-24-2014 fm504 CHECK-LOAD OK;SOFT;2;OK - load average: 54.61, 56.95
In Rubular (http://rubular.com/) my regex expression matches exactly as I want it; however after I query my hive table for the date column, I get the entire line which leads me to believe that the regex statement doesn't match what HIVE is looking for.
([^ ])\s([^ ])\s([^ ])\s(.*)
And this is my create table statement with results from select query:
CREATE EXTERNAL TABLE IF NOT EXISTS sys_results(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\\s*(.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION '/user/sys_log_output/sys-results/';
select date from sys_results;
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes

I figured it out. hive regex recognizes tabs using '\t' I changed my input.regex expression to this.
"input.regex" = "([^ ])\t([^ ])\t([^ ])\t([^ ].)"

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma.
While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?
If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong.
I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';
Any suggestions would be appreciated.
Thanks
Jitendra

I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.
So I have searched with links and I have found perfect solution for this.
We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table employee1;
and then run :
select * from employee1;
Now you will see the magic. Thanks.

Following code solved same type of problem
CREATE TABLE TableRowCSV2(
CODE STRING,
PRODUCTCODE STRING,
PRICE STRING
)
COMMENT 'row data csv'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");

If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.
But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';

Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

Hive doesn't support quoted strings right out of the box. There are two approaches to solving this:
Use a different field separator (e.g. a pipe).
Write a custom InputFormat based on OpenCSV.
The faster (and arguably more sane) approach is to modify your initial the export process to use a different delimiter so you can avoid quoted strings. This way you can tell Hive to use an external table with a tab or pipe delimiter:
CREATE TABLE foo (
col1 INT,
col2 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

Use the csv-serde-0.9.1.jar file in your hive query, see
http://illyayalovyy.github.io/csv-serde/
add jar /path/to/jar_file
Create external table emrS3_import_1(col1 string, col2 string, col3 string, col4 string) row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties
(
"separatorChar" = "\;",
"quoteChar" = "\"
) stored as textfile
tblproperties("skip.header.line.count"="1") ---to skip if have any header file
LOCATION 's3://emrTest/folder';

There can be multiple solutions to this problem.
Write custom SerDe class
Use RegexSerde
Remove escaped delimiter chars from data
Read more at http://grokbase.com/t/hive/user/117t2c6zhe/urgent-hive-not-respecting-escaped-delimiter-characters

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Athena not able to read multi-line text in CSV fields - amazon-web-services

Related

AWS Athena query specifying linebreak and quotation character

Create table Athena ignore comma in the row values

aws athena-read csv file where values contain commas

Hive Serde Regex not recognizing string pattern

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

Categories

Resources