How to create a generalized regex (POSIX) in PostgreSQL? - regex

In postgreSQL (9.5), PgAdmin III, I would like to generalize this POSIX statement for two words:
This works for the words 'new' and 'intermediate' with word boundaries:
select * from cpt where cdesc ~* '^(?=.*\mnew\M)(?=.*\mintermediate\M)'
This fails ( the "where" argument is seen as a text string):
select * from cpt where cdesc ~* '^(?=.*\m'||'new'||'\M)(?=.*\mintermediate\M)'
How can this be written for a generlized function, e.g.:
CREATE OR REPLACE FUNCTION getDesc(string1 text, string2 text)
RETURNS SETOF cpt AS
$BODY$
select * from cpt where cdesc ~* '^(?=.*\m$1\M)(?=.*\m$2\M)'
$BODY$
LANGUAGE sql VOLATILE;
(where $1 is string1 and $2 is string2)
TIA
Edit. Match stings in cdesc would be:
"This is a new and intermediate art work"
"This is an intermediate and new piece of art"
Non-match would be:
"This is new art"
"This is intermediate art"
Please note the order of the words is not important as long as both are present. Also, either word may have a punctuation mark -- (comma or period)--immediately following the word (no space).

My first suggestion would be to split the expensive regex into two SQL WHERE clauses and:
matching with LIKE, as it is much faster, you can filter in code for more specific matches,
or matching with a simple regex, something like '\m$1[\M,.]'
As for the regex you are using:
I have not used it in a while, but I think you need parenthesis for string concatination
~* ( '^(?=.*\m' || 'new' || '\M)(?=.*\mintermediate\M)' )

Related

Postgres Regex Negative Lookahead

Scenario: Match any string that starts with "J01" except the string "J01FA09".
I'm baffled why the following code returns nothing:
SELECT 1
WHERE
'^J01(?!FA09).*' ~ 'J01FA10'
when I can see on regexr.com that it's working (I realize there are different flavors of regex and that could be the reason for the site working).
I have confirmed in the postgres documentation that negative look aheads are supported though.
Table 9-15. Regular Expression Constraints
(?!re) negative lookahead matches at any point where no substring
matching re begins (AREs only). Lookahead constraints cannot contain
back references (see Section 9.7.3.3), and all parentheses within them
are considered non-capturing.
Match any string that starts with "J01" except the string "J01FA09".
You can do without a regex using
WHERE s LIKE 'J01%' AND s != 'J01FA09'
Here, LIKE 'J01%' requires a string to start with J01 and then may have any chars after, and s != 'J01FA09' will filter out the matches.
If you want to ahieve the same with a regex, use
WHERE s ~ '^J01(?!FA09$)'
The ^ matches the start of a string, J01 matches the literal J01 substring and (?!FA09$) asserts that right after J01 there is no FA09 followed with the end of string position. IF the FA09 appears and there is end of string after it, no match will be returned.
See the online demo:
CREATE TABLE table1
(s character varying)
;
INSERT INTO table1
(s)
VALUES
('J01NNN'),
('J01FFF'),
('J01FA09'),
('J02FA09')
;
SELECT * FROM table1 WHERE s ~ '^J01(?!FA09$)';
SELECT * FROM table1 WHERE s LIKE 'J01%' AND s != 'J01FA09';
RE is a right side operand:
SELECT 1
WHERE 'J01FA10' ~ '^J01(?!FA09)';
?column?
----------
1
(1 row)

How to perform operations on a selected piece of string after regex in clojure

Base String:
SELECT (sum([column.one]) / sum([column.two])) AS [sum / sum], [column.three] AS [column.three] FROM [database.table] GROUP BY [column.three] ORDER BY [column.three] ASC
Resultant String:
SELECT (sum([column.one]) / sum([column.two])) AS [sum___sum], [column.three] AS [column.three] FROM [database.table] GROUP BY [column.three] ORDER BY [column.three] ASC
Here [sum / sum] could change to some other format like [sum * distinct] or [max + min - distinct]
What I have till now:
Replace all the values with [] around them with _:
(s/replace sql #"\[(.*?)\]" "_")
What I am trying:
If I can get the value that got matched, I can replace all special characters except dot (.) with an underscore.
(s/replace sql #"\[(.*?)\]" #(s/replace "$1" #"[\/\*\-\+\(\)\\\s]" "_"))
More clarity:
In short, anything inside [] can only be a combination of alphanumeric, dots, and underscores. Otherwise replace that character with underscore (_).
[Repeating my answer from comments]
In this case "$1" is not a valid syntax.
You are trying to replace something in literal string "$1", not in the matched string. You should operate the match passed by first replace in the second one. Just replace "$1" with (second %)
Ugly way would be simple line splitting with subs to first part and second part. Then add you "sum___sum" between those parts.
That would be quite simple if part to be replaced is always first "AS [" in your sql query string. You can use that to find right index-of from your string. That way you wouldn't need the regexp.
As mentioned earlier inserting string straight to the query might offer possibility to attack into your database using sql injection.
Better way would be use parameter(s) in your original query or create the query as a prepared statement.

plsql regex to remove text between quotes that has quotes

I am struggling with the regex replacement solution that would remove all the text that are between quotes from VARCHAR2 field even if the text between these quotes has quoted text as well
For example text:
'text start 'text inside' text end' leftover 'some other text'
after regex replacement should contain: leftover
What I have came up with is this code:
with tbl as (
select
'''text start ''text inside'' text end'' leftover ''some other text''' as str
,'\''(.*?)\''' as regex
from dual
)
select
tbl.str as strA
,regexp_replace(tbl.str,tbl.regex, '') as strB
from tbl;
but the text between subquotes still remains.
Is it even possible to achieve this with regular expressions, or should I split and analyze the contents in some loop ?
An ideal solution would be if it could handle infinite levels occurrences of quoted text inside quoted text.
An ideal solution would be if it could handle infinite levels occurrences of quoted text inside quoted text.
It's impossible with a single regular expression.
Neither recursive regexps, nor recursive capture buffers are available in Oracle.
UPD :
But it could be done by SQL:
with tbl as (
select
'''text start ''text inside'' text end'' leftover ''some other text'''
as str
from dual
)
select
listagg(text) within group (order by n)
from
(
select
n,
sum(decode(regexp_replace(str, '^(.*?([<>])){'||n||'}.*$', '\2'),
'<', 1, '>', -1, 0)) over (order by n) as nest,
regexp_replace(str, '^(.*?[<>]){'||n||'}([^<>]*).*$', '\2') as text
from
( select regexp_replace(regexp_replace(str, '(\s|^)''', '\1<'),
'''(\s|$)', '>\1') as str from tbl ),
( select level-1 as n from dual
connect by level-1 <= (select regexp_count(str, '''') from tbl) )
)
where nest = 0
fiddle
try
, '^[^'']*(''.*'')[^'']*$' as regex
caveat: this will dumbly capture all content between the first and the last occurrence of single quotes inside tested text in capture group 1, including the outermost quotes themselves. in particular it does not check for proper nesting.
more important your replacement expr will be more complex:
, CASE WHEN REGEXP_INSTR(test, regex) > 0
THEN REPLACE ( test, REGEXP_REPLACE(test, regex, '\1'), '' )
ELSE test
END
if the regexp matches, the capture group is extracted first to be used in an ordinary replacement (this works because the matched portion is guaranteed to be maximal).
IMPORTANT: the solution won't produce the desired result in the particular context you have supplied. however, you cannot fare any better with plsql regexp functions since the oracle regex engine does not offer extensions to express recursion in the pattern (as eg. pcre do). you need this facility to resolve nesting constructs (ie. perform balanced counting).

Find all (multiline) 'SELECT...' queries that doesn't contain a certain word

In my java file there are many sql queries assigned to java strings like:
/* ... */
String str1 = "SELECT item1, item2 from table1 where this=that and MYWORD=that2 and this3=that3";
/* ... */
String str2 = "SELECT item1, item2 from table1 where this=that and" +
" MYWORD=that2 and this3=that3 and this4=that4";
/* ... */
/* ... */
String str3 = "SELECT item1, item2 from table2 where this=that and this2=that2 and" +
" this3=that3 and this4=that4";
/* ... */
String str4 = "SELECT item1, item2 from table3 where this=that and MYWORD=that2" +
" and this3=that3 and this4=that4";
/* ... */
String str5 = "SELECT item1, item2 from table4 where this=that and this2=that2 and this3=that3";
/* ... */
Now I want to find out the 'SELECT...' queries that doesn't contain the word 'MYWORD' in it.
From one of my previous S/O question I got the answer how to find all the 'SELECT...' queries, but I need to extend that solution to find the ones that doesn't contain certain word.
I have tried the regex SELECT(?!.*MYWORD).*; that can't find the multiline queries (like str3 above), finds only the single line ones.
I've also tried the regex SELECT[\s\S]*?(?!MYWORD).*(?<=;)$ that finds all the queries and is unable to determine whether the word 'MYWORD' is present in the query or not.
I know I'm very near to the solution, still can't figure it out.
Can anyone help me, please?
(I am using notepad++ on windows)
The problem with the first regex is that . doesn't match a newline. In normal regexes, there is an option to change that, but I don't know whether that feature exists in notepad++.
The problem with the second regex is that is matches "select, then some stuff, then anything that doesn't match MYWORD, then more stuff, then a semicolon" Even if MYWORD exists, the regex engine will happily match (?!MYWORD) to some other part of the string that is not MYWORD.
Something like this should work (caveat: not tested on Notepad++):
SELECT(?![^;]*MYWORD)[^;]*;
Instead of ., match anything that is not a semicolon. This should allow you to match a newline.
Beyond that, it is also important that you don't allow a semicolon to be part of the match. Otherwise, the pattern can expand to gobble up multiple SELECT statements as it tries to match.
Try this (on a current version of Notepad++ using Perl-compatible regexes; older versions don't support multiline regexes):
SELECT (?:(?!MYWORD)[^"]|"\s*\+\s*")*"\s*;
Explanation:
SELECT # Match SELECT
(?: # Match either...
(?!MYWORD) # (as long as it's not the word MYWORD)
[^"] # any character except a quote
| # or
"\s* # an ending quote, optional whitespace,
\+\s* # a plus sign, optional whitespace (including newlines),
" # and another opening quote.
)* # Repeat as needed.
"\s*; # Match a closing quote, optional whitespace, and a semicolon.

Is there a way, using regular expressions, to match a pattern for text outside of quotes?

As stated in the title, is there a way, using regular expressions, to match a text pattern for text that appears outside of quotes. Ideally, given the following examples, I would want to be able to match the comma that is outside of the quotes, but not the one in the quotes.
This is some text, followed by "text, in quotes!"
or
This is some text, followed by "text, in quotes" with more "text, in quotes!"
Additionally, it would be nice if the expression would respect nested quotes as in the following example. However, if this is technically not feasible with regular expressions then it wold simply be nice to know if that is the case.
The programmer looked up from his desk, "This can't be good," he exclaimed, "the system is saying 'File not found!'"
I have found some expressions for matching something that would be in the quotes, but nothing quite for something outside of the quotes.
Easiest is matching both commas and quoted strings, and then filtering out the quoted strings.
/"[^"]*"|,/g
If you really can't have the quotes matching, you could do something like this:
/,(?=[^"]*(?:"[^"]*"[^"]*)*\Z)/g
This could become slow, because for each comma, it has to look at the remaining characters and count the number of quotes. \Z matches the end of the string. Similar to $, but will never match line ends.
If you don't mind an extra capture group, it could be done like this instead:
/\G((?:[^"]*"[^"]*")*?[^"]*?)(,)/g
This will only scan the string once. It counts the quotes from the beginning of the string instead. \G will match the position where last match ended.
The last pattern could need an example.
Input String: 'This is, some text, followed by "text, in quotes!" and more ,-as'
Matches:
1. ['This is', ',']
2. [' some text', ',']
3. [' and followed by "text, in quotes!" and more ', ',']
It matches the string leading up to the comma, as well as the comma.
This can be done with modern regexes due to the massive number of hacks to regex engines that exist, but let me be the one to post the "Don't Do This With Regular Expressions" answer.
This is not a job for regular expressions. This is a job for a full-blown parser. As an example of something you can't do with (classical) regular expressions, consider this:
()(())(()())
No (classical) regex can determine if those parenthesis are matched properly, but doing so without a regex is trivial:
/* C code */
char string[] = "()(())(()())";
int parens = 0;
for(char *tmp = string; tmp; tmp++)
{
if(*tmp == '(') parens++;
if(*tmp == ')') parens--;
}
if(parens > 0)
{
printf("%s too many open parenthesis.\n", parens);
}
else if(parens < 0)
{
printf("%s too many closing parenthesis.\n", -parens);
}
else
{
printf("Parenthesis match!\n");
}
# Perl code
my $string = "()(())(()())";
my $parens = 0;
for(split(//, $string)) {
$parens++ if $_ eq "(";
$parens-- if $_ eq ")";
}
die "Too many open parenthesis.\n" if $parens > 0;
die "Too many closing parenthesis.\n" if $parens < 0;
print "Parenthesis match!";
See how simple it was to write some non-regex code to do the job for you?
EDIT: Okay, back from seeing Adventureland. :) Try this (written in Perl, commented to help you understand what I'm doing if you don't know Perl):
# split $string into a list, split on the double quote character
my #temp = split(/"/, $string);
# iterate through a list of the number of elements in our list
for(0 .. $#temp) {
# skip odd-numbered elements - only process $list[0], $list[2], etc.
# the reason is that, if we split on "s, every other element is a string
next if $_ & 1;
if($temp[$_] =~ /regex/) {
# do stuff
}
}
Another way to do it:
my $bool = 0;
my $str;
my $match;
# loop through the characters of a string
for(split(//, $string)) {
if($_ eq '"') {
$bool = !$bool;
if($bool) {
# regex time!
$match += $str =~ /regex/;
$str = "";
}
}
if(!$bool) {
# add the current character to our test string
$str .= $_;
}
}
# get trailing string match
$match += $str =~ /regex/;
(I give two because, in another language, one solution may be easier to implement than the other, not just because There's More Than One Way To Do Itâ„¢.)
Of course, as your problems grow in complexity, there will arise certain benefits of constructing a full-blown parser, but that's a different horse. For now, this will suffice.
As mentioned before, regexp cannot match any nested pattern, since it is not a Context-free language.
So if you have any nested quotes, you are not going to solve this with a regex.
(Except with the "balancing group" feature of a .Net regex engine - as mentioned by Daniel L in the comments - , but I am not making any assumption of the regex flavor here)
Except if you add further specification, like a quote within a quote must be escaped.
In that case, the following:
text before string "string with \escape quote \" still
within quote" text outside quote "within quote \" still inside" outside "
inside" final outside text
would be matched successfully with:
(?ms)((?:\\(?=")|[^"])+)(?:"((?:[^"]|(?<=\\)")+)(?<!\\)")?
group1: text preceding a quoted text
group2: text within double quotes, even if \" are present in it.
Here is an expression that gets the match, but it isn't perfect, as the first match it gets is the whole string, removing the final ".
[^"].*(,).*[^"]
I have been using my Free RegEx tester to see what works.
Test Results
Group Match Collection # 1
Match # 1
Value: This is some text, followed by "text, in quotes!
Captures: 1
Match # 2
Value: ,
Captures: 1
You should better build yourself a simple parser (pseudo-code):
quoted := False
FOR char IN string DO
IF char = '"'
quoted := !quoted
ELSE
IF char = "," AND !quoted
// not quoted comma found
ENDIF
ENDIF
ENDFOR
This really depends on if you allow nested quotes or not.
In theory, with nested quotes you cannot do this (regular languages can't count)
In practice, you might manage if you can constrain the depth. It will get increasingly ugly as you add complexity. This is often how people get into grief with regular expressions (trying to match something that isn't actually regular in general).
Note that some "regex" libraries/languages have added non-regular features.
If this sort of thing gets complicated enough, you'll really have to write/generate a parser for it.
You need more in your description. Do you want any set of possible quoted strings and non-quoted strings like this ...
Lorem ipsum "dolor sit" amet, "consectetur adipiscing" elit.
... or simply the pattern you asked for? This is pretty close I think ...
(?<outside>.*?)(?<inside>(?=\"))
It does capture the "'s however.
Maybe you could do it in two steps?
First you replace the quoted text:
("[^"]*")
and then you extract what you want from the remaining string
,(?=(?:[^"]*"[^"]*")*[^"]*\z)
Regexes may not be able to count, but they can determine whether there's an odd or even number of something. After finding a comma, the lookahead asserts that, if there are any quotation marks ahead, there's an even number of them, meaning the comma is not inside a set of quotes.
This can be tweaked to handle escaped quotes if needed, though the original question didn't mention that. Also, if your regex flavor supports them, I would add atomic groups or possessive quantifiers to keep backtracking in check.