My question is how to properly use SerDeProperties to parse the lines below. I have tried multiple variations and I continue to get fill my tables with null values. Below I have the SerDe and the sample data. From my under standing ([^\s]*) should be anthing before ^ whitespace \s match 0 or more characters*. Likewise the next regex should put everything before the line return in the next column
My intent is to divide the numbers into one column and everything else into another column. What is wrong with my interpretation of the SerDe?
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "([^\s]*) ([^\n]*)");
1134999 06Crazy Life
6821360 Pang Nakarin
10113088 Terfel, Bartoli- Mozart: Don
10151459 The Flaming Sidebur
6826647 Bodenstandig 3000
10186265 Jota Quest e Ivete Sangalo
6828986 Toto_XX (1977
Try this (or something similar):
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\d+) ([^\\n]*)",
"output.format.string" = "%1$s %2$s"
)
STORED AS TEXTFILE;
Modified from here.
Related
This athena table correctly reads the first line of the file.
CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string,
`col6` string,
`col7` string,
`col8` string,
`col9` string,
`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'LINES TERMINATED BY' = '\n',
'ESCAPED BY' = '\\',
'quoteChar' = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')
This table is not imported correctly due to html code found in 5th column. Is there any other way?
It appears that your file contains a lot of multi-line text in the textbody field. This does not the CSV standard (or at least, it cannot be understood by the OpenCSVSerde).
As a test, I made a simple file:
"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid"
"one","two","three","four","five","six","seven","eight","nine","ten"
"one","two","three","four","five \" quote \" five2","six","seven","eight","nine","ten"
"one","two","three","four","five \
five2","six","seven","eight","nine","ten"
Row 1 is the header
Row 2 is normal
Row 3 has a field with \" escaped quotes
Row 4 has escaped newlines
I then ran the command from your question and pointed it to this data file.
Result:
Rows 1-3 (including the header row) were returned
Row 4 only worked until the \ -- data after that was lost
Bottom line: Your file format is not compatible with CSV format.
You might be able to find some Serde that can handle it, but OpenCSVSerde doesn't seem to understand it because rows are normally split by newlines.
I am trying to parse a string which is :
"297","298","Y","","299"
using Regexp serder but i am unable to do so.
The Table definition i have created is :
create external table test.test1
(a string,
b string,
c string,
d string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "\"\"|\"([^\"]+)\"")
the regex used in the serde properties looks promising in the regexp test websites but i am getting exception while trying to read the table kindly help me out in this.
I know that this can be easily done using csv serde but i am trying to figure out a bigger part of the problem for which i have to use the regexp serde
Thanks
In the regex it should be capturing group per column.
Your data contains 5 columns and table 4, you want to skip one column, right?
For example this regex will work: with serdeproperties ('input.regex' = '^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$')
You can easily check without creating table, like this:
select regexp_replace('"297","298","Y","","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299
select regexp_replace('"297","298","Y","this column is skipped","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299
Hi I have created a table in Athena with following query which will read csv file form S3.
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlTblStaging (
`filename` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"'
)
LOCATION 's3://ax-large-table/AEGIntJnlTblStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
But value in filename filed like "\\emdc1fas\HR_UK\ADPFreedom_Employee_20141114_11.04.00.csv"
When I read this table my values appears like
"\emdc1fasHR_UKADPFreedom_Employee_20141114_11.04.00.csv"
where I missing all the escape character (backslash) from the value.
How can I read the value which will show me the actual value with escape character.
Thanks
As long as you don't need the escaping, you can set the escape character to something unrelated (for example "|").
CREATE EXTERNAL TABLE IF NOT EXISTS axlargetable.AEGIntJnlTblStaging (
filename string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '|'
)
LOCATION 's3://ax-large-table/AEGIntJnlTblStaging/'
TBLPROPERTIES ('has_encrypted_data'='false');
While creating table using a file in hue-hive interface we have to specify a delimiter. (Tab, Space, Comma etc.) . But my file delimited by one or more spaces. How to specify delimiter to delimit by one or more spaces.
You can create table use regex as delimiter via this way:
Data, put data to hdfs
1 2 3 4
a b c d
create table:
//grammar for create table
CREATE TABLE test1(
a string,
b string,
c string,
d string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES
(
"input.regex" ="([^\\s]*)\\s+([^\\s]*)\\s+([^\\s]*)\\s+([^\\s]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
LOCATION '/test1/';
Here are two lines from my log files that I'm trying to match. I'm trying to separate each line into four columns (date, hostname, command, status).
The line is tab deliminated between date, hostname, command, and status in the line. The status column may contain spaces.
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
03-24-2014 fm504 CHECK-LOAD OK;SOFT;2;OK - load average: 54.61, 56.95
In Rubular (http://rubular.com/) my regex expression matches exactly as I want it; however after I query my hive table for the date column, I get the entire line which leads me to believe that the regex statement doesn't match what HIVE is looking for.
([^ ])\s([^ ])\s([^ ])\s(.*)
And this is my create table statement with results from select query:
CREATE EXTERNAL TABLE IF NOT EXISTS sys_results(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\\s*(.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION '/user/sys_log_output/sys-results/';
select date from sys_results;
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
I figured it out. hive regex recognizes tabs using '\t' I changed my input.regex expression to this.
"input.regex" = "([^ ])\t([^ ])\t([^ ])\t([^ ].)"