Hive's RegexSerDe not giving the correct output - regex

I tried to parse the below input string using Hive RegexSerDe but i am not getting the expected output. I really don't know whether the problem sits in my regex query or in RegexSerDe. My regex query is working as expected in the other online regex simulator but its not working in hive's RegexSerDe. Could someone please help me to understand what goes wrong here?
i am using apachehive-0.9.0 version.
Input:
1::Toy Story (1995)::Adventure|Animation|Children|Comedy|Fantasy
My Expected output:
1 Toy Story 1995 Adventure|Animation|Children|Comedy|Fantasy
My hive query:
CREATE TABLE myMovie3(
id STRING,
name STRING,
year STRING,
category STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ("input.regex" = "^(.*?)::(.*)\(([0-9]*)\)::(.*)$","output.format.string" = "%1$s %2$s %3$s %4$s")
STORED AS TEXTFILE;
Actual output that i got from the regex is:
hive> select * from mymovie3;
OK
1 Toy Story (1995)

The regex is the cause. Although it's perfect in normal context, RegexSerDe is a Java class which needs escaping for the backslashes. Use the following :
^(.*?)::(.*)\\(([0-9]*)\\)::(.*)$

Related

Select the next line of the matched pattern in clob column using oracle regular expression

I have a clob column "details" in table xxx. I want to select the next line of the matched pattern using Regex.
Input Text (CLOB DATA) like below :( all placed in new line)
MODEL_DATA 1
TEST1:
NONE
TEST2:
NONE
INFO:
SERVICES,VALUED-YES
TYPE:
NONE
I tried to use INFO as pattern match string and retrieve the next line of the text . But could not able to do it by using Regular expression function . Please help me to resolve this
Output :
SERVICES,VALUES-YES
You can use the below to get the details
select replace(regexp_substr(details,'INFO:'||chr(10)||'.+'),'INFO:')
from your_table;
You can also try the below to be operation system independent
select replace(regexp_substr(details,'INFO:
('||chr(10)||'|'||chr(13)||chr(10)||').+'),'INFO:')
from your_table;

Unable to Parse string using Hive Regex Serde

I am trying to parse a string which is :
"297","298","Y","","299"
using Regexp serder but i am unable to do so.
The Table definition i have created is :
create external table test.test1
(a string,
b string,
c string,
d string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ("input.regex" = "\"\"|\"([^\"]+)\"")
the regex used in the serde properties looks promising in the regexp test websites but i am getting exception while trying to read the table kindly help me out in this.
I know that this can be easily done using csv serde but i am trying to figure out a bigger part of the problem for which i have to use the regexp serde
Thanks
In the regex it should be capturing group per column.
Your data contains 5 columns and table 4, you want to skip one column, right?
For example this regex will work: with serdeproperties ('input.regex' = '^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$')
You can easily check without creating table, like this:
select regexp_replace('"297","298","Y","","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299
select regexp_replace('"297","298","Y","this column is skipped","299"','^"(.*?)","(.*?)","(.*?)",.*?,"(.*?)"$','$1|$2|$3|$4');
OK
_c0
297|298|Y|299

Log analysis in Pig

I have a .txt file which looks like :
2017-06-22 23:19:05,758 use database stocks
2017-06-22 23:21:27,056 CREATE TABLE stocksdata ( stock_exchange string,
stock_symbol string, date TIMESTAMP,
The regex I wrote is ^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}:\\d{2}:\\d{2}),(\\d{3})\\s((?i)(create|select|use).*)$.
But my output is
2017-06-22 23:19:05,758 use database stocks
2017-06-22 23:21:27,056 CREATE TABLE stocksdata ( stock_exchange string,
It is not taking lines in next line of input viz stock_symbol string, date TIMESTAMP,. I need to capture this line as well.
Try using the following pattern:
^(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2}),(\d{3})\s((?i)(create|select|use)[\s\S]*)$
I replaced the .* at the end with [\s\S]*, because the latter consumes new lines.
Finally, this expression has worked out
(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2}),(\d{3})\s(\w{4})\s(.)(()(create\s|select\s|use\s).(.\s\S?\D.\s\D)*)
Thank you for replies

Regular Expression in redshift

I have a data which is being fed in the below format -
2016-006-011 04:58:22.058
This is an incorrect date/timestamp format and in order to convert this to a right one as below -
2016-06-11 04:58:22.058
I'm trying to achieve this using regex in redshift. Is there a way to remove the additional Zero(0) in the date and month portion using regex. I need something more generic and not tailed for this example alone as date will vary.
The function regexp_replace() (see documentation) should do the trick:
select
regexp_replace(
'2016-006-011 04:58:22.058' -- use your date column here instead
, '\-0([0-9]{2}\-)0([0-9]{2})' -- matches "-006-011", captures "06-" in $1, "11" in $2
, '-$1$2' -- inserts $1 and $2 to give "-06-11"
)
;
And so the result is, as required:
regexp_replace
-------------------------
2016-06-11 04:58:22.058
(1 row)

Oracle SQL regexp date formatting

im so new in oracle, and trying to select some bad formatted date as cleaned.
for example,
my field is: 12.05.2010 dfsafs()F(Gf, 12:45
can i select it as 12.05.2010 12:45 with regexp or something else ?
thanks
Use the below regex to match date and time formats.
[0-9]{2}\.[0-9]{2}\.[0-9]{4}|[0-9]{2}:[0-9]{2}
DEMO
In oracle, i think you need to escape the curly braces.
[0-9]\{2\}\.[0-9]\{2\}\.[0-9]\{4\}|[0-9]\{2\}:[0-9]\{2\}
Something like this should works:
select regexp_substr(dat,'.*(\d{2}\.\d{2}\.\d{4}).*',1,1,'i',1) ||' '||
regexp_substr(dat,'.*(\d{2}:\d{2}).*',1,1,'i',1) datetime
from
(select '12.05.2010 dfsafs()F(Gf, 12:45' dat from dual);
Check that i extract date and time using regexp_substr and then concat both values.