Oracle Database, extract string beeing between two other strings - regex

I need a regexp that's combined with regexp_substr() would give me the word being between two other specified words.
Example:
source_string => 'First Middle Last'
substring varchar2(100);
substring := regexp_substr(source_string, 'First (.*) Last'); <===
this doesn't work :(.
dbms_output.put_line(substring) ===> output should be: 'Middle'
I know it looks simple and to be honest, at the beginning I thought the same.
But now after spending about 3h for searching for a solution I give up...

It's not working because the literal strings 'First' and 'Last' are being looked for. Assuming that the strings don't all literally begin 'First' you need to find another way to represent them. You've already done this by representing 'Middle' as (.*)
The next point is that you need to extract a sub-expression (the part in parenthesis), this is the 6th parameter of REGEXP_SUBSTR().
If you put these together then the following gives you what you want:
regexp_substr(source_string, '.*\s(.*)\s.*', 1, 1, 'i', 1)
An example of it working:
SQL> select regexp_substr('first middle last', '.*\s(.*)\s.*', 1, 1, 'i', 1)
2 from dual;
REGEXP
------
middle
You can also use an online regex tester to validate that 'middle' is the only captured group.
Depending on what your actual source strings look like you may not want to search for exactly spaces, but use \W (a non-word character) instead.
If you're expecting exactly three words I'd also anchor your expression to the start and end of the string: ^.*\s(.*)\s.*$

If source string always looks the same, i.e. consists of 3 elements (words), then such a simple regular expression does the job:
SQL> with t (str) as
2 (select 'First Middle Last' from dual)
3 select regexp_substr(str, '\w+', 1, 2) result from t;
RESULT
------
Middle
SQL>

(\S*) pattern might be used with regexp_replace and regexp_substr as in the following way to get the middle word :
with t(str) as
(
select 'First Middle Last' from dual
)
select regexp_substr(trim(regexp_replace(str, '^(\S*)', '')),'(\S*)')
as "Result String"
from t;
Result String
-------------
Middle
in the first step First, and in the second one Last words are trimmed.
Or, More directly you can figure out by using regexp_replace as
with t(str) as
(
select 'First Middle Last' from dual
)
select regexp_replace(str,'(.*) (.*) (.*)','\2')
as "Result String"
from t;
Result String
-------------
Middle

Related

Get a match when there are duplicate letters in a string

I have a list of inputs in google sheets,
Input
Desired Output
"To demonstrate only not an input" The repeated letters
Outdoors
Match
o
dog
No Match
step
No Match
bee
Match
e
Chessboard
Match
s
Cookbooks
Match
o, k
How do I verify if all letters are unique in a string without splitting it?
In other words if the string has one letter or more occurred twice or more, return TRUE
My process so far
I tried this solution in addition to splitting the string and dividing the length of the string on the COUNTA of unique letters of the string, if = 1 "Match", else "No match"
Or using regex
I found a method to match a letter is occure in a string 2 times this demonstration with REGEXEXTRACT But wait what needed is get TRUE when the letters are not unique in the string
=REGEXEXTRACT(A1,"o{2}?")
Returns:
oo
Something like this would do
=REGEXMATCH(Input,"(anyletter){2}?")
OR like this
=REGEXMATCH(lower(A6),"[a-zA-Z]{2}?")
Notes
The third column, "Column C," is only for demonstration and not for input.
The match is case insensitive
The string doesn't need to be splitted to aviod heavy calculation "I have long lists"
Avoid using lambda and its helper functions see why?
Its ok to return TRUE or FALSE instead of Match or No Match to keep it simple.
More examples
Input
Desired Output
Professionally
Match
Attractiveness
Match
Uncontrollably
Match
disreputably
No Match
Recommendation
Match
Interrogations
Match
Aggressiveness
Match
doublethinks
No Match
You are explicitly asking for an answer using a single regular expression. Unfortunately there is no such thing as a backreference to a former capture group using RE2. So if you'd spell out the answer to your problem it would look like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)")))
Since you are looking for case-insensitive matching (?i) modifier will help to cut down the options to just the 26 letters of the alphabet. I suppose the above can be written a bit neater like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")")))
EDIT 1:
The only other reasonable way to do this (untill I learned about the PREG supported syntax of the matches clause in QUERY() by #DoubleUnary) with a single regex other than the above is to create your own UDF in GAS (AFAIK). It's going to be JavaScript based thus supporting a backreferences. GAS is not my forte, but a simple example could be:
function REGEXMATCH_JS(s) {
if (s.map) {
return s.map(REGEXMATCH_JS);
} else {
return /([a-z]).*?\1/gi.test(s);
}
}
The pattern ([a-z]).*?\1 means:
([a-z]) - Capture a single character in range a-z;
.*?\1 - Look for 0+ (lazy) characters up to a copy of this 1st captured character with a backreference.
The match is global and case-insensitive. You can now call:
=INDEX(IF(A2:A="","",REGEXMATCH_JS(A2:A)))
EDIT 2:
For those that are benchmarking speed, I am not testing this myself but maybe this would speed things up:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)"))
Or:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")"))
Or:
=REGEXMATCH_JS(A2:INDEX(A:A,COUNTA(A:A)))
Respectively. Knowing there is a header in 1st row.
Benchmark:
Created a benchmark here.
Methodology:
Use NOW() to create a timestamp, when checkbox is clicked.
Use NOW() to create another timestamp, when the last row is filled and the checkbox is on.
The difference between those two timestamps gives time taken for the formula to complete.
The sample is a random data created from Math.random between [A-Za-z] with 10 characters per word.
Results:
Formula
Round1
Round2
Avg
% Slower than best
Sample size
10006
[re2](a.*a|b.*b)JvDv
0:00:19
0:00:19
0:00:19
-15.15%
[re2+recursion]MASTERMATCH_RE2
0:00:27
0:00:24
0:00:26
-54.55%
[Find+recursion]MASTERMATCH
0:00:17
0:00:16
0:00:17
0.00%
[PREG]Doubleunary
0:00:57
0:00:53
0:00:55
-233.33%
Conclusion:
This varies greatly based on browser/device/mobile app and on non-randomized sample data. But I found PREG to be consistently slower than re2
Use recursion.
This seems extremely faster than the regex based approach. Create a named function:
Name:
MASTERMATCH
Arguments(in this order):
word
The word to check
start
Starting at
Function:
=IF(
MID(word,start,1)="",
FALSE,
IF(
ISERROR(FIND(MID(word,start,1),word,start+1)),
MASTERMATCH(word,start+1),
TRUE
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH(A2:INDEX(A2:A,COUNTA(A2:A)),1))
Or without case sensitivity
=ARRAYFORMULA(MASTERMATCH(lower(A2:A),1))
Explanation:
It recurses through each character using MID and checks whether the same character is available after this position using FIND. If so, returns true and doesn't check anymore. If not, keeps checking until the last character using recursion.
Or with regex,
Create a named function:
Name:
MASTERMATCH_RE2
Arguments(in this order):
word
The word to check
start
Starting at
Function:
IF(
MID(word,start,1)="",
FALSE,
IF(
REGEXMATCH(word,MID(word, start, 1)&"(?i).*"&MID(word,start,1)),
TRUE,
MASTERMATCH_RE2(word,start+1)
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH_RE2(A2:A,1))
Or
=ARRAYFORMULA(MASTERMATCH_RE2(lower(A2:A),1))
Explanation:
It recurses through each character and creates a regex for that character. Instead of a.*a, b.*b,..., it takes the first character(using MID), eg: o in outdoor and creates a regex o.*o. If regex is positive for that regex (using REGEXMATCH), returns true and doesn't check for other letters or create other regexes.
Uses lambda, but it's efficient. Loop through each row and every character with MAP and REDUCE. REPLACE each character in the word and find the difference in length. If more than 1, don't check length anymore and return Match
=MAP(
A2:INDEX(A2:A,COUNTA(A2:A)),
LAMBDA(_,
REDUCE(
"No Match",
SEQUENCE(LEN(_)),
LAMBDA(a,c,
IF(a="Match",a,
IF(
LEN(_)-LEN(
REGEXREPLACE(_,"(?i)"&MID(_,c,1),)
)>1,
"Match",a
)
)
)
)
)
)
If you do run into lambda limitations, remove the MAP and drag fill the REDUCE formula.
=REDUCE("No Match",SEQUENCE(LEN(A2)),LAMBDA(a,c,IF(a="Match",a,IF(LEN(A2)-LEN(REGEXREPLACE(A2, "(?i)"&MID(A2,c,1),))>1,"Match",a))))
The latter is preferred for conditional formatting as well.
As Daniel Cruz said, Google Sheets functions such as regexmatch(), regexextract() and regexreplace() use RE2 regexes that do not support backreferences. However, the query() function uses Perl Compatible Regular Expressions that do support named capture groups and backreferences:
=arrayformula(
iferror( not( iserror(
match(
to_text(A3:A),
query(lower(unique(A3:A)), "where Col1 matches '.*?(?<char>.).*?\k<char>.*' ", 0),
0
)
) / (A3:A <> "") ) )
)
In my limited testing with a sample size of 1000 heterograms, pangrams, words with diacritic letters, and 10-character pseudo-random unique values from TheMaster's corpus, this PREG formula ran at about half the speed of the JvdV2 RE2 regex.
With Osm's sample of 50,000 highly repetitive sample values, the formula ran at 8x the speed of JvdV2.
A PREG regex is slower than a RE2 regex, but has the benefit that you can more easily check all characters for repeats. This lets you work with corpuses that include diacritic letters, numbers and other non-English alphabet characters:
Input
Output
Professionally
TRUE
disreputably
FALSE
Abacus
TRUE
Élysée
TRUE
naïve Ï
TRUE
määräävä
TRUE
121
TRUE
123
FALSE
You can also easily state which specific characters to check by replacing <char>. with something like <char>[\wéäåö] or <char>[^-;,.\s\d].
try:
=INDEX(IF(IFERROR(LEN(REGEXREPLACE(A1:A6, "[^"&C1:C6&"]", )), -1)>=
(LEN(SUBSTITUTE(C1:C6, "|", ))*2), "Match", "No Match"))
update
create a query heat map, filter it and vlookup back row position
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&REGEXEXTRACT(a, REPT("(.)", LEN(a)))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
case insensitive would be:
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&LOWER(REGEXEXTRACT(a, REPT("(.)", LEN(a))))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
Just to illustrate another method - not likely to be scaleable - try to substitute the second occurrence of the letter:
=ArrayFormula(if(isnumber(xmatch(len(A2)-1,len(substitute(upper(A2),char(sequence(1,26,65)),"",2)))),"Match","No match"))
If splitting were permitted, I would favour use of Frequency for speed, e.g.
=ArrayFormula(max(frequency(code(mid(upper(A2),sequence(len(A2)),1)),sequence(1,26,65)))>1)
You can give a try by using this RegEx : /(\w).*?\1/g in the REGEXMATCH function in google sheets.
Explanation :
(\w) - matches word characters (a-z, A-Z, 0-9, _), If you are sure that input will contain only alphabets then you can also use ([a-zA-Z]); then
.*? - zero or more characters (the ? denotes as optional that means it can match for consecutive as well as non-consecutive); until
\1 - it finds a repeat of the first matched character.
Live Demo : regex101
Coming after the battle ^^ Why not simply compare the number of unique letters in the string and its original length ?
=COUNTUNIQUE(split(regexreplace(A2;"(.)"; "$1_"); "_")) < LEN(A2)
All my tests seem fine.
(split() provided by this answer)

Regex match everything after first and until 2nd occurrence of a slash

Need to match everything after the first / and until the 2nd / or end of string. Given the following examples:
/US
/CA
/DE/Special1
/FR/Special 1/special2
Need the following returned:
US
CA
DE
FR
Was using this in DataStudio which worked:
^(.+?)/
However the same in BigQuery is just returning null. After trying dozens of other examples here, decided to ask myself. Thanks for your help.
For such simple extraction - consider alternative of using cheaper string functions instead of more expensive regexp functions. See an example below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '/US' line UNION ALL
SELECT '/CA' UNION ALL
SELECT '/DE/Special1' UNION ALL
SELECT '/FR/Special 1/special2'
)
SELECT line, SPLIT(line, '/')[SAFE_OFFSET(1)] value
FROM `project.dataset.table`
with result
Row line value
1 /US US
2 /CA CA
3 /DE/Special1 DE
4 /FR/Special 1/special2 FR
Your regex matches any 1 or more chars as few as possible at the start of a string (up to the first slash) and puts this value in Group 1. Then it consumes a / char. It does not actually match what you need.
You can use a regex in BigQuery that matches a string partially and capture the part you need to get as a result:
/([^/]+)
It will match the first occurrence of a slash followed with one or more chars other than a slash placing the captured substring in the result you get.

Why do I get empty response for regexp_matches function while using positive lookahead (?=...)

Why the following code returns just empty brackets - {''}. How to make it return matching strings?
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=..CAA)','g');
Expected output is:
regexp_matches
----------------
{GCCAA}
{AACAA}
{AACAA}
{GTCAA}
(4 rows)
but instead it returns the following:
regexp_matches
----------------
{""}
{""}
{""}
{""}
(4 rows)
I actually have a bit more complicated query, which requires positive lookahead in order to cover all occurrences of patterns in the string even if they overlap.
Well, it's not pretty, but you can do it without regular expressions or custom functions.
WITH data(d) as (
SELECT * FROM (VALUES ('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT')) v
)
SELECT substr(d, x, 5) AS match
FROM data
JOIN LATERAL (SELECT generate_series(1, length(d))) g(x) ON TRUE
WHERE substr(d, x, 5) LIKE '__CAA'
;
match
-------
GCCAA
AACAA
AACAA
GTCAA
(4 rows)
Basically, get each five letter slice of the string and see if it matches __CAA.
You could change generate_series(1, length(d)) to generate_series(1, length(d)-4) because the last ones will never match, but you would have to remember to update this if the length of your matching string changes.
Using a lookahead has the problem that the lookahead itself is not part of the match but it allows overlapping searches
Without using a lookahead, you lose the ability for overlapping searches.
Using Powershell, you can loop over the indexes returned from the lookaheads and use that as an index into your searchstring to get the matches
$string = 'ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT'
$r = [regex]::new('(?=..CAA)')
$r.Matches($string) | % {$string.Substring($_.Index, 5)}
returns
GCCAA
AACAA
AACAA
GTCAA
I don't know how to translate this to PostgreSQL (or if that's even possible)
update:
Aparently it won't capture inside of an assertion, that's ok because
what you really need is the first 2 characters, which can safely be
consumed. It will only give you the first 2 characters per row, but
since you know the last 3, you can easily join the set elements
with the CAA constant.
Try this
..(?=CAA)
and you're done.
If I knew the bizarre sql language, I could show you how to do the join.
Output should now be
match
-------
GC
AA
AA
GT
(4 rows)
This is the regex you need for overlapped matches.
(?=(..CAA))
https://regex101.com/r/eJ36zb/1
I think you just need this sql statement which captures group 1:
SELECT regexp_matches('ATGCATGCATGCCAACAACAACCTGTCAAGTGAGT','(?=(..CAA))','g');
Formatted regex
(?=
( . . CAA ) # (1)
)
The reason you got empty strings in your result is that
you didn't give the expression anything to consume and
nothing to capture.
I.e., it matched at the right places, but nothing was consumed or captured.
So, doing it this way allows the overlap and the capture so it
should show up on the output now.
Lookahead is a zero-width assertion. It doesn't match anything. If you change your regular expression to just a regular match/capture, you'll get a result. For matching any two characters that are followed by CAA in your case, lookahead probably isn't necessary.

Regex in Oracle PL/SQL to remove unwanted characters from a string containing a phone number

I need to remove the characters -, +, (, ), and space from a string in Oracle. The other characters in the string will all be numbers.
The function that can do this is REGEXP_REPLACE. I need help writing the correct regex.
Examples:
string '23+(67 -90' should return '236790'
string '123456' should return '123456'
Something like
SQL> ed
Wrote file afiedt.buf
1 with data as (
2 select 'abc123def456' str from dual union all
3 select '23+(67 -90' from dual union all
4 select '123456' from dual
5 )
6 select str,
7 regexp_replace( str, '[^[:digit:]]', null ) just_numbers
8* from data
SQL> /
STR JUST_NUMBERS
------------ --------------------
abc123def456 123456
23+(67 -90 236790
123456 123456
should do it. This will remove any non-digit character from the string.
regexp_replace is an amazing function, but it is a bit difficult.
You can use TRANSLATE function to replace multiple characters within a string. The way TRANSLATE function differs from REPLACE is that, TRANSLATE function provides single character one to one substitution while REPLACE allows you to replace one string with another.
Example:
SELECT TRANSLATE('23+(67 -90', '1-+() ', '1') "Replaced" FROM DUAL;
Output:
236790
In this example, ‘1’ will be replaced with the ‘1’ and ‘-+()‘ will be replaced with null value since we are not providing any corresponding character for it in the ‘to string’.
This statement also answers your question without the use of regexp.
You would think that you could use empty string as the last argument, but that doesn't work because when we pass NULL argument to TRANSLATE function, it returns null and hence we don’t get the desired result.
So I use REPLACE if I need to replace one character, but TRANSLATE if I want to replace multiple characters.
Source: https://decipherinfosys.wordpress.com/2007/11/27/removing-un-wanted-text-from-strings-in-oracle/
search for \D or [\-\+, ]and replace with empty string ''
regexp_replace is an amazing function, save a lot of time to replace alphabets in a alphanumeric string to convert to number.

How to make regular expression correctly?

I need to get data from third-occurrence position of "*" to 4th. I do so:
with t as (select 'T*76031*12558*test*received percents' as txt from dual)
select regexp_replace(txt, '.*(.{4})[*][^*].*$', '\1')
from t
I receive "test" - it's right, but how to get any number of characters, not just 4?
This should work given the example you have used:
REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
So the SELECT would be:
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_REPLACE( txt, '(^.*\*.*\*.*\*)([[:alnum:]]*)(\*.*$)', '\2')
FROM t;
The regex looks for:
Group 1:
start of string. Any number of characters up to a ''. Any further characters up mto another ''. Any further characters up to the third '*'.
Group 2:
Any alphanumeric characters
Group 3:
A '*' followed by any other characters up to the end of the string.
Replace all of the above with whatever was found in Group 2.
Hope this helps.
EDIT:
Following on from a great answer from another thread by Rob van Wijk here:
Exracting substring from given string
WITH t
AS (SELECT 'T*76031*12558*test*received percents' AS txt FROM DUAL)
SELECT REGEXP_SUBSTR( txt,'[^\*]+',1,4)
FROM t;
How about the following?
^([^*]*[*]){3}([^*]*)
The first part matches 3 groups of * and the second part matches everything until the next * or end of line.
You are assuming that the last * of your text is also the fourth. If this assumption is true then this :
\b\w*\b(?=\*[^*]*$)
Will get you what you want. But of course this only matches the last word between * before the last star. It only matches test in this case or whatever word characters are inside the *.
Note: 10g REGEXP_SUBSTR doesn't support returning subexpressions, see comments below.
If you are really only selecting a part of the string I recommend using REGEXP_SUBSTR instead. I don't know if it's more efficient, but it will better document your intent:
SQL> select regexp_substr('T*76031*12558*test*received percents',
'^([^*]*[*]){3}([^*]*)', 1, 1, '', 2) from dual;
REGEXP_SUBST
------------
test
Above I have used regexp provided by Pieter-Bas.
See also http://www.regular-expressions.info/oracle.html