Unable to Parse string using Hive Regex Serde

Unable to Parse string using Hive Regex Serde - regex

I am trying to parse a string which is :
"297","298","Y","","299"
using Regexp serder but i am unable to do so.
The Table definition i have created is :
create external table test.test1
(a string,
b string,
c string,
d string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "\"\"|\"([^\"]+)\"")
the regex used in the serde properties looks promising in the regexp test websites but i am getting exception while trying to read the table kindly help me out in this.
I know that this can be easily done using csv serde but i am trying to figure out a bigger part of the problem for which i have to use the regexp serde
Thanks

In the regex it should be capturing group per column.
Your data contains 5 columns and table 4, you want to skip one column, right?
For example this regex will work: with serdeproperties ('input.regex' = '^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$')
You can easily check without creating table, like this:
select regexp_replace('"297","298","Y","","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299
select regexp_replace('"297","298","Y","this column is skipped","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299

Related

Hive's RegexSerDe not giving the correct output

I tried to parse the below input string using Hive RegexSerDe but i am not getting the expected output. I really don't know whether the problem sits in my regex query or in RegexSerDe. My regex query is working as expected in the other online regex simulator but its not working in hive's RegexSerDe. Could someone please help me to understand what goes wrong here?
i am using apachehive-0.9.0 version.
Input:
1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy
My Expected output:
1 Toy Story 1995 Adventure|Animation|Children|Comedy|Fantasy
My hive query:
CREATE TABLE myMovie3(
id STRING,
name STRING,
year STRING,
category STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "^(.*?)::(.*)\(([0-9]*)\)::(.*)$","output.format.string" = "%1$s %2$s %3$s %4$s")
STORED AS TEXTFILE;
Actual output that i got from the regex is:
hive> select * from mymovie3;
OK
1 Toy Story (1995)

The regex is the cause. Although it's perfect in normal context, RegexSerDe is a Java class which needs escaping for the backslashes. Use the following :
^(.*?)::(.*)\\(([0-9]*)\\)::(.*)$

Hive's string parsing using regular expression with RegexSerDe

I'm trying to use RegexSerDe to parse a string input to different attributes of a table with Hive. The original string is in the format of
"... (A; B) (X; Y);"
And the output I expected is
foo = "A; B" and bar = "X; Y"
(as two separate attributes in the table). The regular expression I used as "input.regex" is
> CREATE EXTERNAL TABLE test(
...
foo STRING,
bar STRING
)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "... \(([^\)]*?)\) \(([^\)]*?)\)",
"output.format.string" = "...%4$s %5$s"
)
stored as textfile;
which parses my string correctly on other web tools I found. But SerDe wasn't be able to match the string (return null). Trying to use double backslash doesn't help.
I also tried to use some other expression like
"input.regex" = "... \(.*\) \(.*\)"
for the last two parentheses, HIVE output the parsed string as
"foo = "(A; B) (X" and bar = "Y)"
, as it splits my string by the last space. I guess I didn't handle my right parentheses correctly but I wan't be able to find out the correct way.

Hive Serde Regex not recognizing string pattern

Here are two lines from my log files that I'm trying to match. I'm trying to separate each line into four columns (date, hostname, command, status).
The line is tab deliminated between date, hostname, command, and status in the line. The status column may contain spaces.
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes
03-24-2014 fm504 CHECK-LOAD OK;SOFT;2;OK - load average: 54.61, 56.95
In Rubular (http://rubular.com/) my regex expression matches exactly as I want it; however after I query my hive table for the date column, I get the entire line which leads me to believe that the regex statement doesn't match what HIVE is looking for.
([^ ])\s([^ ])\s([^ ])\s(.*)
And this is my create table statement with results from select query:
CREATE EXTERNAL TABLE IF NOT EXISTS sys_results(
date STRING
,hostname STRING
,command STRING
,status STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*)\\s*([^ ]*)\\s*([^ ]*)\\s*(.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION '/user/sys_log_output/sys-results/';
select date from sys_results;
03-24-2014 fm506 TOTAL-PROCESS OK;HARD;1;PROCS OK: 717 processes

I figured it out. hive regex recognizes tabs using '\t' I changed my input.regex expression to this.
"input.regex" = "([^ ])\t([^ ])\t([^ ])\t([^ ].)"

Apache Hive regEx serde: data types

For processing logs I want to use Apache Hive regEx serde but I only found examples that use String as datatype for the columns of the table.
Now my question is: are datebased types and integers and arrays supported or is it just strings?
This example (and others) only uses strings:
CREATE TABLE access_log (
remote_ip STRING,
request_date STRING,
method STRING,
request STRING,
protocol STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]) . . [([^]]+)] \"([^ ]) ([^ ]) ([^ \"])\" *",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s"
)
STORED AS TEXTFILE
;

Refer the code of SERDE : code of RegexSerDe or github - RegexSerDe code,
All columns have to be of type STRING. -- from program comment
If you want to do some tweak to it, write some custom Serde code(if you are good at java , then proceed ) and add as a custom serde jar like this example csv custom serde
If not, let the columns type be STRING only, and when you want to act upon any column use Casting ( cast() function in hive ) in query.
hope this helps :)

I haven't used the RegexSerDe personally, but I do notice that there are two classes for it:
serde/src/java/org/apache/hadoop/hive/serde2/RegexSerDe.java
contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
The second one, which you are referring to, does indeed appear to be restricted to strings. The other appears to support primitive types.
For whatever reason I only see the second one referenced in the API docs.

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma.
While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?
If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong.
I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';
Any suggestions would be appreciated.
Thanks
Jitendra

I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.
So I have searched with links and I have found perfect solution for this.
We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table employee1;
and then run :
select * from employee1;
Now you will see the magic. Thanks.

Following code solved same type of problem
CREATE TABLE TableRowCSV2(
CODE STRING,
PRODUCTCODE STRING,
PRICE STRING
)
COMMENT 'row data csv'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");

If you're stuck with the CSV file format, you'll have to use a custom SerDe; and here's some work based on the opencsv libarary.
But, if you can modify the source files, you can either select a new delimiter so that the quoted fields aren't necessary (good luck), or rewrite to escape any embedded commas with a single escape character, e.g. '\', which can be specified within the ROW FORMAT with ESCAPED BY:
CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LOCATION 's3://emrTest/folder';

Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

Hive doesn't support quoted strings right out of the box. There are two approaches to solving this:
Use a different field separator (e.g. a pipe).
Write a custom InputFormat based on OpenCSV.
The faster (and arguably more sane) approach is to modify your initial the export process to use a different delimiter so you can avoid quoted strings. This way you can tell Hive to use an external table with a tab or pipe delimiter:
CREATE TABLE foo (
col1 INT,
col2 STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

Use the csv-serde-0.9.1.jar file in your hive query, see
http://illyayalovyy.github.io/csv-serde/
add jar /path/to/jar_file
Create external table emrS3_import_1(col1 string, col2 string, col3 string, col4 string) row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties
(
"separatorChar" = "\;",
"quoteChar" = "\"
) stored as textfile
tblproperties("skip.header.line.count"="1") ---to skip if have any header file
LOCATION 's3://emrTest/folder';

There can be multiple solutions to this problem.
Write custom SerDe class
Use RegexSerde
Remove escaped delimiter chars from data
Read more at http://grokbase.com/t/hive/user/117t2c6zhe/urgent-hive-not-respecting-escaped-delimiter-characters

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unable to Parse string using Hive Regex Serde - regex

Related

Hive's RegexSerDe not giving the correct output

Hive's string parsing using regular expression with RegexSerDe

Hive Serde Regex not recognizing string pattern

Apache Hive regEx serde: data types

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

Categories

Resources