Strange behaviour of Regexp_replace in a Hive SQL query - regex

I have some input information where I'm trying to remove the part .0 from my input where an ID string ends with .0.
select student_id, regexp_replace(student_id, '.0','') from school_result.credit_records where student_id like '%.0';
Input:
01-0230984.03
12345098.0
34567.0
Expected output:
01-0230984.03
12345098
34567
But the result I'm getting is as follows: It's removing any character having with a 0 next to it instead of removing only the occurrences that end with .0
0129843
123498
34567
What am I doing wrong? Can someone please help?

Dot in regexp has special meaning (it means any character). If you need dot (.) literally, it should be shielded using double-slash (in Hive). Also add end-of-the-line anchor($):
with mydata as (
select stack(3,
'01-0230984.03',
'12345098.0',
'34567.0'
) as str
)
select regexp_replace(str,'\\.0$','') from mydata;
Result:
01-0230984.03
12345098
34567
Regexp '\\.0$' means dot zero (.0) literally, end of the line ($).

Related

Oracle: Special characters filter with few exceptions

I need some quick help.
I want to filter the input string and remove special characters except space( ), period(.), comma(,), hyphen(-), ampersand(&) and apostrophe(').
I am using below but it's filtering out everything except period(.) and comma(,).
SELECT REGEXP_REPLACE('*Bruce*-*Martha*-&-*Thomas%* *Wyane''s* *Enterprises* ([#Pvt,Ltd.])', '[^0-9A-Za-z,.'' ]', '')
FROM dual;
Input String: *Bruce*-*Martha*-&-*Thomas%* *Wyane's* *Enterprises* ([#Pvt,Ltd.])
What I am expecting: Bruce-Martha-&-Thomas Wyane's Enterprises Pvt,Ltd.
What I am getting: BruceMarthaThomas Wyane's Enterprises Pvt,Ltd.
Thanks.
You may use
SELECT REGEXP_REPLACE('*Bruce*-*Martha*-&-*Thomas%* *Wyane''s* *Enterprises* ([#Pvt,Ltd.])', '[^&0-9A-Za-z,.'' -]+', '') FROM dual
See the regex demo
The [^&0-9A-Za-z,.'' -]+ pattern will match one or more occurrences of any char but &, ASCII letter, digit, comma, dot, single apostrophe, space and hyphen.
To support any whitespace, replace the literal space with [:space:]:
'[^&0-9A-Za-z,.''[:space:]-]+'

Regex match everything after first and until 2nd occurrence of a slash

Need to match everything after the first / and until the 2nd / or end of string. Given the following examples:
/US
/CA
/DE/Special1
/FR/Special 1/special2
Need the following returned:
US
CA
DE
FR
Was using this in DataStudio which worked:
^(.+?)/
However the same in BigQuery is just returning null. After trying dozens of other examples here, decided to ask myself. Thanks for your help.
For such simple extraction - consider alternative of using cheaper string functions instead of more expensive regexp functions. See an example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '/US' line UNION ALL
SELECT '/CA' UNION ALL
SELECT '/DE/Special1' UNION ALL
SELECT '/FR/Special 1/special2'
)
SELECT line, SPLIT(line, '/')[SAFE_OFFSET(1)] value
FROM `project.dataset.table`
with result
Row line value
1 /US US
2 /CA CA
3 /DE/Special1 DE
4 /FR/Special 1/special2 FR
Your regex matches any 1 or more chars as few as possible at the start of a string (up to the first slash) and puts this value in Group 1. Then it consumes a / char. It does not actually match what you need.
You can use a regex in BigQuery that matches a string partially and capture the part you need to get as a result:
/([^/]+)
It will match the first occurrence of a slash followed with one or more chars other than a slash placing the captured substring in the result you get.

REGEXP_SUBSTR in Teradata

I am having data in a column like XXX/XXXX/XXXX/XYYUX/YYY. I am trying to extract only the first two digits after the 3rd backslash(/) in the column which is 'XY' in this example. Can you please help?
Thanks!
Try this:
REGEXP_SUBSTR('XXX/XXXX/XXXX/XYYUX/YYY','^([^/]*/){3}\K..',1,1,'i')
'^' start of string
'([^/]*/){3}' looks for 0 or more non-slashes followed by a slash, 3 times
'\K' match reset operator drops the part of the string that has been matched up to this point
'..' grabs the next two characters in the string
Try using - STRTOK('/88/209/89/132]', ' /]', 3)
returns the 3rd octet, '89'

Regex in Oracle PL/SQL to remove unwanted characters from a string containing a phone number

I need to remove the characters -, +, (, ), and space from a string in Oracle. The other characters in the string will all be numbers.
The function that can do this is REGEXP_REPLACE. I need help writing the correct regex.
Examples:
string '23+(67 -90' should return '236790'
string '123456' should return '123456'
Something like
SQL> ed
Wrote file afiedt.buf
1 with data as (
2 select 'abc123def456' str from dual union all
3 select '23+(67 -90' from dual union all
4 select '123456' from dual
5 )
6 select str,
7 regexp_replace( str, '[^[:digit:]]', null ) just_numbers
8* from data
SQL> /
STR JUST_NUMBERS
------------ --------------------
abc123def456 123456
23+(67 -90 236790
123456 123456
should do it. This will remove any non-digit character from the string.
regexp_replace is an amazing function, but it is a bit difficult.
You can use TRANSLATE function to replace multiple characters within a string. The way TRANSLATE function differs from REPLACE is that, TRANSLATE function provides single character one to one substitution while REPLACE allows you to replace one string with another.
Example:
SELECT TRANSLATE('23+(67 -90', '1-+() ', '1') "Replaced" FROM DUAL;
Output:
236790
In this example, ‘1’ will be replaced with the ‘1’ and ‘-+()‘ will be replaced with null value since we are not providing any corresponding character for it in the ‘to string’.
This statement also answers your question without the use of regexp.
You would think that you could use empty string as the last argument, but that doesn't work because when we pass NULL argument to TRANSLATE function, it returns null and hence we don’t get the desired result.
So I use REPLACE if I need to replace one character, but TRANSLATE if I want to replace multiple characters.
Source: https://decipherinfosys.wordpress.com/2007/11/27/removing-un-wanted-text-from-strings-in-oracle/
search for \D or [\-\+, ]and replace with empty string ''
regexp_replace is an amazing function, save a lot of time to replace alphabets in a alphanumeric string to convert to number.

How to make regular expression correctly?

I need to get data from third-occurrence position of "*" to 4th. I do so:
with t as (select 'T*76031*12558*test*received percents' as txt from dual)
select regexp_replace(txt, '.*(.{4})[*][^*].*$', '\1')
from t
I receive "test" - it's right, but how to get any number of characters, not just 4?
This should work given the example you have used:
REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
So the SELECT would be:
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
FROM t;
The regex looks for:
Group 1:
start of string. Any number of characters up to a ''. Any further characters up mto another ''. Any further characters up to the third '*'.
Group 2:
Any alphanumeric characters
Group 3:
A '*' followed by any other characters up to the end of the string.
Replace all of the above with whatever was found in Group 2.
Hope this helps.
EDIT:
Following on from a great answer from another thread by Rob van Wijk here:
Exracting substring from given string
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_SUBSTR( txt,'[^\*]+',1,4)
FROM t;
How about the following?
^([^*]*[*]){3}([^*]*)
The first part matches 3 groups of * and the second part matches everything until the next * or end of line.
You are assuming that the last * of your text is also the fourth. If this assumption is true then this :
\b\w*\b(?=\*[^*]*$)
Will get you what you want. But of course this only matches the last word between * before the last star. It only matches test in this case or whatever word characters are inside the *.
Note: 10g REGEXP_SUBSTR doesn't support returning subexpressions, see comments below.
If you are really only selecting a part of the string I recommend using REGEXP_SUBSTR instead. I don't know if it's more efficient, but it will better document your intent:
SQL> select regexp_substr('T*76031*12558*test*received percents',
'^([^*]*[*]){3}([^*]*)', 1, 1, '', 2) from dual;
REGEXP_SUBST
------------
test
Above I have used regexp provided by Pieter-Bas.
See also http://www.regular-expressions.info/oracle.html