plsql regex to remove text between quotes that has quotes - regex

I am struggling with the regex replacement solution that would remove all the text that are between quotes from VARCHAR2 field even if the text between these quotes has quoted text as well
For example text:
'text start 'text inside' text end' leftover 'some other text'
after regex replacement should contain: leftover
What I have came up with is this code:
with tbl as (
select
'''text start ''text inside'' text end'' leftover ''some other text''' as str
,'\''(.*?)\''' as regex
from dual
)
select
tbl.str as strA
,regexp_replace(tbl.str,tbl.regex, '') as strB
from tbl;
but the text between subquotes still remains.
Is it even possible to achieve this with regular expressions, or should I split and analyze the contents in some loop ?
An ideal solution would be if it could handle infinite levels occurrences of quoted text inside quoted text.

An ideal solution would be if it could handle infinite levels occurrences of quoted text inside quoted text.
It's impossible with a single regular expression.
Neither recursive regexps, nor recursive capture buffers are available in Oracle.
UPD :
But it could be done by SQL:
with tbl as (
select
'''text start ''text inside'' text end'' leftover ''some other text'''
as str
from dual
)
select
listagg(text) within group (order by n)
from
(
select
n,
sum(decode(regexp_replace(str, '^(.*?([<>])){'||n||'}.*$', '\2'),
'<', 1, '>', -1, 0)) over (order by n) as nest,
regexp_replace(str, '^(.*?[<>]){'||n||'}([^<>]*).*$', '\2') as text
from
( select regexp_replace(regexp_replace(str, '(\s|^)''', '\1<'),
'''(\s|$)', '>\1') as str from tbl ),
( select level-1 as n from dual
connect by level-1 <= (select regexp_count(str, '''') from tbl) )
)
where nest = 0
fiddle

try
, '^[^'']*(''.*'')[^'']*$' as regex
caveat: this will dumbly capture all content between the first and the last occurrence of single quotes inside tested text in capture group 1, including the outermost quotes themselves. in particular it does not check for proper nesting.
more important your replacement expr will be more complex:
, CASE WHEN REGEXP_INSTR(test, regex) > 0
THEN REPLACE ( test, REGEXP_REPLACE(test, regex, '\1'), '' )
ELSE test
END
if the regexp matches, the capture group is extracted first to be used in an ordinary replacement (this works because the matched portion is guaranteed to be maximal).
IMPORTANT: the solution won't produce the desired result in the particular context you have supplied. however, you cannot fare any better with plsql regexp functions since the oracle regex engine does not offer extensions to express recursion in the pattern (as eg. pcre do). you need this facility to resolve nesting constructs (ie. perform balanced counting).

Related

Get a match when there are duplicate letters in a string

I have a list of inputs in google sheets,
Input
Desired Output
"To demonstrate only not an input" The repeated letters
Outdoors
Match
o
dog
No Match
step
No Match
bee
Match
e
Chessboard
Match
s
Cookbooks
Match
o, k
How do I verify if all letters are unique in a string without splitting it?
In other words if the string has one letter or more occurred twice or more, return TRUE
My process so far
I tried this solution in addition to splitting the string and dividing the length of the string on the COUNTA of unique letters of the string, if = 1 "Match", else "No match"
Or using regex
I found a method to match a letter is occure in a string 2 times this demonstration with REGEXEXTRACT But wait what needed is get TRUE when the letters are not unique in the string
=REGEXEXTRACT(A1,"o{2}?")
Returns:
oo
Something like this would do
=REGEXMATCH(Input,"(anyletter){2}?")
OR like this
=REGEXMATCH(lower(A6),"[a-zA-Z]{2}?")
Notes
The third column, "Column C," is only for demonstration and not for input.
The match is case insensitive
The string doesn't need to be splitted to aviod heavy calculation "I have long lists"
Avoid using lambda and its helper functions see why?
Its ok to return TRUE or FALSE instead of Match or No Match to keep it simple.
More examples
Input
Desired Output
Professionally
Match
Attractiveness
Match
Uncontrollably
Match
disreputably
No Match
Recommendation
Match
Interrogations
Match
Aggressiveness
Match
doublethinks
No Match
You are explicitly asking for an answer using a single regular expression. Unfortunately there is no such thing as a backreference to a former capture group using RE2. So if you'd spell out the answer to your problem it would look like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)")))
Since you are looking for case-insensitive matching (?i) modifier will help to cut down the options to just the 26 letters of the alphabet. I suppose the above can be written a bit neater like:
=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")")))
EDIT 1:
The only other reasonable way to do this (untill I learned about the PREG supported syntax of the matches clause in QUERY() by #DoubleUnary) with a single regex other than the above is to create your own UDF in GAS (AFAIK). It's going to be JavaScript based thus supporting a backreferences. GAS is not my forte, but a simple example could be:
function REGEXMATCH_JS(s) {
if (s.map) {
return s.map(REGEXMATCH_JS);
} else {
return /([a-z]).*?\1/gi.test(s);
}
}
The pattern ([a-z]).*?\1 means:
([a-z]) - Capture a single character in range a-z;
.*?\1 - Look for 0+ (lazy) characters up to a copy of this 1st captured character with a backreference.
The match is global and case-insensitive. You can now call:
=INDEX(IF(A2:A="","",REGEXMATCH_JS(A2:A)))
EDIT 2:
For those that are benchmarking speed, I am not testing this myself but maybe this would speed things up:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)"))
Or:
=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")"))
Or:
=REGEXMATCH_JS(A2:INDEX(A:A,COUNTA(A:A)))
Respectively. Knowing there is a header in 1st row.
Benchmark:
Created a benchmark here.
Methodology:
Use NOW() to create a timestamp, when checkbox is clicked.
Use NOW() to create another timestamp, when the last row is filled and the checkbox is on.
The difference between those two timestamps gives time taken for the formula to complete.
The sample is a random data created from Math.random between [A-Za-z] with 10 characters per word.
Results:
Formula
Round1
Round2
Avg
% Slower than best
Sample size
10006
[re2](a.*a|b.*b)JvDv
0:00:19
0:00:19
0:00:19
-15.15%
[re2+recursion]MASTERMATCH_RE2
0:00:27
0:00:24
0:00:26
-54.55%
[Find+recursion]MASTERMATCH
0:00:17
0:00:16
0:00:17
0.00%
[PREG]Doubleunary
0:00:57
0:00:53
0:00:55
-233.33%
Conclusion:
This varies greatly based on browser/device/mobile app and on non-randomized sample data. But I found PREG to be consistently slower than re2
Use recursion.
This seems extremely faster than the regex based approach. Create a named function:
Name:
MASTERMATCH
Arguments(in this order):
word
The word to check
start
Starting at
Function:
=IF(
MID(word,start,1)="",
FALSE,
IF(
ISERROR(FIND(MID(word,start,1),word,start+1)),
MASTERMATCH(word,start+1),
TRUE
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH(A2:INDEX(A2:A,COUNTA(A2:A)),1))
Or without case sensitivity
=ARRAYFORMULA(MASTERMATCH(lower(A2:A),1))
Explanation:
It recurses through each character using MID and checks whether the same character is available after this position using FIND. If so, returns true and doesn't check anymore. If not, keeps checking until the last character using recursion.
Or with regex,
Create a named function:
Name:
MASTERMATCH_RE2
Arguments(in this order):
word
The word to check
start
Starting at
Function:
IF(
MID(word,start,1)="",
FALSE,
IF(
REGEXMATCH(word,MID(word, start, 1)&"(?i).*"&MID(word,start,1)),
TRUE,
MASTERMATCH_RE2(word,start+1)
)
)
Usage:
=ARRAYFORMULA(MASTERMATCH_RE2(A2:A,1))
Or
=ARRAYFORMULA(MASTERMATCH_RE2(lower(A2:A),1))
Explanation:
It recurses through each character and creates a regex for that character. Instead of a.*a, b.*b,..., it takes the first character(using MID), eg: o in outdoor and creates a regex o.*o. If regex is positive for that regex (using REGEXMATCH), returns true and doesn't check for other letters or create other regexes.
Uses lambda, but it's efficient. Loop through each row and every character with MAP and REDUCE. REPLACE each character in the word and find the difference in length. If more than 1, don't check length anymore and return Match
=MAP(
A2:INDEX(A2:A,COUNTA(A2:A)),
LAMBDA(_,
REDUCE(
"No Match",
SEQUENCE(LEN(_)),
LAMBDA(a,c,
IF(a="Match",a,
IF(
LEN(_)-LEN(
REGEXREPLACE(_,"(?i)"&MID(_,c,1),)
)>1,
"Match",a
)
)
)
)
)
)
If you do run into lambda limitations, remove the MAP and drag fill the REDUCE formula.
=REDUCE("No Match",SEQUENCE(LEN(A2)),LAMBDA(a,c,IF(a="Match",a,IF(LEN(A2)-LEN(REGEXREPLACE(A2, "(?i)"&MID(A2,c,1),))>1,"Match",a))))
The latter is preferred for conditional formatting as well.
As Daniel Cruz said, Google Sheets functions such as regexmatch(), regexextract() and regexreplace() use RE2 regexes that do not support backreferences. However, the query() function uses Perl Compatible Regular Expressions that do support named capture groups and backreferences:
=arrayformula(
iferror( not( iserror(
match(
to_text(A3:A),
query(lower(unique(A3:A)), "where Col1 matches '.*?(?<char>.).*?\k<char>.*' ", 0),
0
)
) / (A3:A <> "") ) )
)
In my limited testing with a sample size of 1000 heterograms, pangrams, words with diacritic letters, and 10-character pseudo-random unique values from TheMaster's corpus, this PREG formula ran at about half the speed of the JvdV2 RE2 regex.
With Osm's sample of 50,000 highly repetitive sample values, the formula ran at 8x the speed of JvdV2.
A PREG regex is slower than a RE2 regex, but has the benefit that you can more easily check all characters for repeats. This lets you work with corpuses that include diacritic letters, numbers and other non-English alphabet characters:
Input
Output
Professionally
TRUE
disreputably
FALSE
Abacus
TRUE
Élysée
TRUE
naïve Ï
TRUE
määräävä
TRUE
121
TRUE
123
FALSE
You can also easily state which specific characters to check by replacing <char>. with something like <char>[\wéäåö] or <char>[^-;,.\s\d].
try:
=INDEX(IF(IFERROR(LEN(REGEXREPLACE(A1:A6, "[^"&C1:C6&"]", )), -1)>=
(LEN(SUBSTITUTE(C1:C6, "|", ))*2), "Match", "No Match"))
update
create a query heat map, filter it and vlookup back row position
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&REGEXEXTRACT(a, REPT("(.)", LEN(a)))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
case insensitive would be:
=INDEX(LAMBDA(a, IF(""<>IFNA(VLOOKUP(ROW(a),
SPLIT(QUERY(QUERY(FLATTEN(ROW(a)&"​"&LOWER(REGEXEXTRACT(a, REPT("(.)", LEN(a))))),
"select Col1,count(Col1) where Col1 matches '.*\w+$' group by Col1"),
"select Col1 where Col2 > 1", ), "​"), 2, )), "Match", "No Match"))
(A2:INDEX(A:A, MAX((A:A<>"")*ROW(A:A)))))
Just to illustrate another method - not likely to be scaleable - try to substitute the second occurrence of the letter:
=ArrayFormula(if(isnumber(xmatch(len(A2)-1,len(substitute(upper(A2),char(sequence(1,26,65)),"",2)))),"Match","No match"))
If splitting were permitted, I would favour use of Frequency for speed, e.g.
=ArrayFormula(max(frequency(code(mid(upper(A2),sequence(len(A2)),1)),sequence(1,26,65)))>1)
You can give a try by using this RegEx : /(\w).*?\1/g in the REGEXMATCH function in google sheets.
Explanation :
(\w) - matches word characters (a-z, A-Z, 0-9, _), If you are sure that input will contain only alphabets then you can also use ([a-zA-Z]); then
.*? - zero or more characters (the ? denotes as optional that means it can match for consecutive as well as non-consecutive); until
\1 - it finds a repeat of the first matched character.
Live Demo : regex101
Coming after the battle ^^ Why not simply compare the number of unique letters in the string and its original length ?
=COUNTUNIQUE(split(regexreplace(A2;"(.)"; "$1_"); "_")) < LEN(A2)
All my tests seem fine.
(split() provided by this answer)

Split single row string into multiple rows by multi-chracter delimiter Oracle

I have attempted to use this question here Splitting string into multiple rows in Oracle and adjust it to my needs however I'm not very confident with regex and have not been able to solve it via searching.
Currently that questions answers it with a lot of regex_substr and so on, using [^,]+ as the pattern so it splits by a single comma. I need it to split by a multi-character delimiter (e.g. #;) but that regex pattern matches any single character to split it out so where there are #s or ;s elsewhere in the text this causes a split.
I've worked out the pattern (#;+) will match every group of #; but I cannot workout how to invert this as done above to split the row into multiple.
I'm sure I'm just missing something simple so any help would be greatly appreciated!
I think you should use:
[^#;+]+
instead of
(#;+)
As, it will be checking for any one of the characters in the range which can be # ; or + and then you can split accordingly.
You can change it according to your requirement but in the regex I
shared, I am consudering # , ; and + as delimeter
So, in end, the query would look something like this:
with tbl(str) as (
select ' My, Delimiter# Hello My; Delimiter World My Delimiter My Delimiter test My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;+]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;+]')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My
3 Delimiter World My Delimiter My Delimiter test My Deli
-- EDIT --
In case you want to check unlimited numbers of # or ; to split and don't want to split at one existence, I found the below regex, but again that is not supported by Oracle.
(?:(?:(?![;#]+).#(?![;#]+).|(?![;#]+).;(?![;#]+).|(?![;#]+).)*)+
So, I found no easy apart from below query which will not split on single existence if there is only one such instance between two delimeters:
select ' My, Delimiter;# Hello My Delimiter ;;# World My Delimiter ; My Delimiter test#; My Delimiter ' from dual
)
SELECT LEVEL AS element,
REGEXP_SUBSTR( str ,'([^#;]+#?[^#;]+;?[^#;]+)', 1, LEVEL, NULL, 1 ) AS element_value
FROM tbl
CONNECT BY LEVEL <= regexp_count(str, '[#;]{2,}')+1\\
Output:
ELEMENT ELEMENT_VALUE
1 My, Delimiter
2 Hello My Delimiter
3 World My Delimiter ; My Delimiter test
4 My Delimiter

replace regex does not work in postgresql

I have a table with a column of string. within the string there are single quote which I want to get rid of all single quotes.for example:
"''hey, hey, we're the monkees''"
my regex works perfect and select all the values containing single quotes.
select regexp_replace(colName, '%''%', '') from tblName;
but it does not update my table when I want to replace this regex with nothing.
UPDATE tblName SET colName = regexp_replace(colName, '%''%', '');
I also checked this one
UPDATE tblName SET colName = replace(colName, '%''%', '');
Different functions and operators in Postgres use one of three different pattern matching languages, as described in a dedicated section of the manual.
The % form you are using here is the SQL LIKE syntax, where % represents "any number of any character". But the function you are using, regexp_replace, expects a Posix regular expression, where the equivalent would be .* (. meaning any character, * meaning repeat zero or more times).
Also note that LIKE expressions have to match the whole string, but a Posix regex doesn't, unless you explicitly match the start of the string with ^ and the end with $.
So the direct translation of '%''%' would be '^.*''.*$', giving you this:
UPDATE tblName SET colName = regexp_replace(colName, '^.*''.*$', '');
In practice, this would give the same effect as the simpler:
UPDATE tblName SET colname='' WHERE colname LIKE '%''%';
Your actual use case is much simpler: you want to replace all occurrences of a fixed string (', which will need to be quoted and escaped as '''') with another fixed string (the empty string, written ''). So you don't need any pattern matching at all, just straight replacement using replace:
UPDATE tblName SET colname=replace(colname, '''', '');
This will probably be faster if you limit it to rows that contain an apostrophe to begin with:
UPDATE tblName SET colname=replace(colname, '''', '') WHERE colname LIKE '%''%';
% is not an regexp character
try this
select regexp_replace(colName, $$'$$, '','g') from tblName;
($$ is use to surround your string instead of ' to simplify the query)
(,'g') is use to continue after the first quote is found.
UPDATE tblName SET colName = regexp_replace(colName, $$'$$, '','g');

How to perform operations on a selected piece of string after regex in clojure

Base String:
SELECT (sum([column.one]) / sum([column.two])) AS [sum / sum], [column.three] AS [column.three] FROM [database.table] GROUP BY [column.three] ORDER BY [column.three] ASC
Resultant String:
SELECT (sum([column.one]) / sum([column.two])) AS [sum___sum], [column.three] AS [column.three] FROM [database.table] GROUP BY [column.three] ORDER BY [column.three] ASC
Here [sum / sum] could change to some other format like [sum * distinct] or [max + min - distinct]
What I have till now:
Replace all the values with [] around them with _:
(s/replace sql #"\[(.*?)\]" "_")
What I am trying:
If I can get the value that got matched, I can replace all special characters except dot (.) with an underscore.
(s/replace sql #"\[(.*?)\]" #(s/replace "$1" #"[\/\*\-\+\(\)\\\s]" "_"))
More clarity:
In short, anything inside [] can only be a combination of alphanumeric, dots, and underscores. Otherwise replace that character with underscore (_).
[Repeating my answer from comments]
In this case "$1" is not a valid syntax.
You are trying to replace something in literal string "$1", not in the matched string. You should operate the match passed by first replace in the second one. Just replace "$1" with (second %)
Ugly way would be simple line splitting with subs to first part and second part. Then add you "sum___sum" between those parts.
That would be quite simple if part to be replaced is always first "AS [" in your sql query string. You can use that to find right index-of from your string. That way you wouldn't need the regexp.
As mentioned earlier inserting string straight to the query might offer possibility to attack into your database using sql injection.
Better way would be use parameter(s) in your original query or create the query as a prepared statement.

How to create a generalized regex (POSIX) in PostgreSQL?

In postgreSQL (9.5), PgAdmin III, I would like to generalize this POSIX statement for two words:
This works for the words 'new' and 'intermediate' with word boundaries:
select * from cpt where cdesc ~* '^(?=.*\mnew\M)(?=.*\mintermediate\M)'
This fails ( the "where" argument is seen as a text string):
select * from cpt where cdesc ~* '^(?=.*\m'||'new'||'\M)(?=.*\mintermediate\M)'
How can this be written for a generlized function, e.g.:
CREATE OR REPLACE FUNCTION getDesc(string1 text, string2 text)
RETURNS SETOF cpt AS
$BODY$
select * from cpt where cdesc ~* '^(?=.*\m$1\M)(?=.*\m$2\M)'
$BODY$
LANGUAGE sql VOLATILE;
(where $1 is string1 and $2 is string2)
TIA
Edit. Match stings in cdesc would be:
"This is a new and intermediate art work"
"This is an intermediate and new piece of art"
Non-match would be:
"This is new art"
"This is intermediate art"
Please note the order of the words is not important as long as both are present. Also, either word may have a punctuation mark -- (comma or period)--immediately following the word (no space).
My first suggestion would be to split the expensive regex into two SQL WHERE clauses and:
matching with LIKE, as it is much faster, you can filter in code for more specific matches,
or matching with a simple regex, something like '\m$1[\M,.]'
As for the regex you are using:
I have not used it in a while, but I think you need parenthesis for string concatination
~* ( '^(?=.*\m' || 'new' || '\M)(?=.*\mintermediate\M)' )