I have the following query to create a table in Athena out of existing files located in S3. As we can see, I am defining the linebreak character and how to manage null values:
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
`field1` STRING,
`field2` STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
NULL DEFINED AS ' '
LOCATION 's3://bucket/prefix/'
TBLPROPERTIES ('skip.header.line.count'='1')
Now I also want to include the quotation character, but I don't see any property for that.
I tried using WITH SERDEPROPERTIES properties as shown below (where I can use quoteChar), but then I cannot find any SERDE property to define the "linebreak" and the "NULL management".
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
`field1` STRING,
`field2` STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',', 'quoteChar' = '"')
LOCATION 's3://bucket/prefix/'
TBLPROPERTIES ('skip.header.line.count'='1')
Is there any way of using quotation character, field delimiter, linebreak, and null management together?
I have a table like so
CREATE EXTERNAL TABLE IF NOT EXISTS something (
...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://...'
TBLPROPERTIES ('has_encrypted_data'='false');
but some fields contain a comma like (8-10,99) without quotes. the csv is too large to be opened on excel. is there any way to change the delimiter or make athena read this file?
If the fields are comma-separated, but contain commas without escaping there is no way for any automated tool to distinguish between a comma that represents a separator between fields and one that is meant to be content. In other words, the files are malformed and have to be fixed. If you have the option of generating the files again either make sure that fields are quoted, or use a separator that will not appear in the fields, such as a tab character.
This athena table correctly reads the first line of the file.
CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string,
`col6` string,
`col7` string,
`col8` string,
`col9` string,
`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\\',
'quoteChar' = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')
This table is not imported correctly due to html code found in 5th column. Is there any other way?
It appears that your file contains a lot of multi-line text in the textbody field. This does not the CSV standard (or at least, it cannot be understood by the OpenCSVSerde).
As a test, I made a simple file:
"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid"
"one","two","three","four","five","six","seven","eight","nine","ten"
"one","two","three","four","five \" quote \" five2","six","seven","eight","nine","ten"
"one","two","three","four","five \
five2","six","seven","eight","nine","ten"
Row 1 is the header
Row 2 is normal
Row 3 has a field with \" escaped quotes
Row 4 has escaped newlines
I then ran the command from your question and pointed it to this data file.
Result:
Rows 1-3 (including the header row) were returned
Row 4 only worked until the \ -- data after that was lost
Bottom line: Your file format is not compatible with CSV format.
You might be able to find some Serde that can handle it, but OpenCSVSerde doesn't seem to understand it because rows are normally split by newlines.
My question is how to properly use SerDeProperties to parse the lines below. I have tried multiple variations and I continue to get fill my tables with null values. Below I have the SerDe and the sample data. From my under standing ([^\s]*) should be anthing before ^ whitespace \s match 0 or more characters*. Likewise the next regex should put everything before the line return in the next column
My intent is to divide the numbers into one column and everything else into another column. What is wrong with my interpretation of the SerDe?
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^\s]*) ([^\n]*)");
1134999 06Crazy Life
6821360 Pang Nakarin
10113088 Terfel, Bartoli- Mozart: Don
10151459 The Flaming Sidebur
6826647 Bodenstandig 3000
10186265 Jota Quest e Ivete Sangalo
6828986 Toto_XX (1977
Try this (or something similar):
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\d+) ([^\\n]*)",
"output.format.string" = "%1$s %2$s"
)
STORED AS TEXTFILE;
Modified from here.
While creating table using a file in hue-hive interface we have to specify a delimiter. (Tab, Space, Comma etc.) . But my file delimited by one or more spaces. How to specify delimiter to delimit by one or more spaces.
You can create table use regex as delimiter via this way:
Data, put data to hdfs
1 2 3 4
a b c d
create table:
//grammar for create table
CREATE TABLE test1(
a string,
b string,
c string,
d string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES
(
"input.regex" ="([^\\s]*)\\s+([^\\s]*)\\s+([^\\s]*)\\s+([^\\s]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
LOCATION '/test1/';