complex regex to identify parameters with brackets only - regex

I'm trying to build a regex that will identify parameters inside brackets and ignore pl/sql comments (single line --, and multiple lines /* */)
For instance:
create or replace table_name ---sfjdslkfjslkfjslkfjdsfsdf
**(var1 in out number, var2 number)**
/* sdfls
sfdsd jfs
sfs f
sd f
sfsf */
AS
BEGIN
(var1 in out number, var2 number) should be matched only. It should also account for cases where:
There are not comments (single or multiple lines)
There are only single line comments either before or after the parameters
There are only multiple line comments either before or after the parameters
There are no parameters
Assumptions:
Parameters are always enclosed in brackets ()
Procedures can sometimes have no parameters but have comments (either single or multiple line comments) before the AS BEGIN clause
Procedures start with create or replace table_name
We're only interested to read the until the AS BEGIN clause
In other words, I need to find the index of the first opening bracket '(' that is outside any comments (single or multiple lines) and that comes before the AS BEGIN clause.
UPDATE:
I have managed to match the comments using the following regex:
(?:\/\*(?:[\s\S]*?)\*\/)|(?:\-\-(?:.*)$)
For instance here it will match all the comments:
create or replace table_name
-- sdlfksl kjs slkjslds js
/* lsdjfdkj
s fskjfs
sf sf
sdf;;''
sfs fs
*/
(hello number, var2 number)
--sdflksf
/*sl --sdflks s kdjfls())({fsfs */
AS
BEGIN
I can do this now in Java to identify the first opening bracket outside of any matching group. However it would be easier if I could just ignore the matching group and match the one parameters in the brackets only instead.
EDIT
This is not asking for a solution with pl/sql or sqlplus or whatever. I have a few pl/sql procedures stored in files that I need to modify and add new parameters to. I'm using Java to do that and inside Java im using a combination of loops and regexs.

Not sure why you want to reinvent the wheel. In Oracle, you could use the *_ARGUMENTS view to get the list of all the arguments.
For example,
SQL> CREATE OR REPLACE
2 PROCEDURE p(
3 in_date DATE,
4 out_text VARCHAR2)
5 AS
6 BEGIN
7 NULL;
8 END;
9 /
Procedure created.
SQL>
SQL> SELECT object_name,
2 package_name,
3 argument_name,
4 position,
5 data_type
6 FROM USER_ARGUMENTS
7 WHERE OBJECT_NAME = 'P';
OBJECT_NAM PACKAGE_NA ARGUMENT_NAME POSITION DATA_TYPE
---------- ---------- --------------- ---------- ---------
P OUT_TEXT 2 VARCHAR2
P IN_DATE 1 DATE
SQL>

I'm unfamiliar with Oracle's flavor of regexes, but I made a pattern that works in most regex-engines. I know MySQL has a wildly different syntax so this is quite possibly also the case in Oracle, but the general idea is this:
create or replace table_name(?:(?!AS\s+BEGIN)(?:--[^\n]*(?:\n|$)|\/\*(?:[^*]|\*(?!\/))*\*\/|(?!\/\*|--)[^(]))*\(([^)]+)
Debuggex Demo
It's a "match this except in contexts A, B or C" approach also used here and explained in depth in this SO answer. Combined with the requirement that whatever we're matching isn't followed by AS\s+BEGIN.
You can see in the Debuggex Demo that the correct part between brackets is matched in capture group 1. As well, the 2nd create doesn't have parameters, and indeed nothing is matched (even though there's brackets in comments and after the AS BEGIN.
The task at hand is porting this to Oracle's regex flavor.

Related

How to split a string in db2?

I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?
I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.
For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det

How can I replace multiple words "globally" using regexp_replace in Oracle?

I need to replace multiple words such as (dog|cat|bird) with nothing in a string where there may be multiple consecutive occurrences of a word. The actual code is to remove salutations and suffixes from a name. Unfortunately the garbage data I get sometimes contains "SNERD JR JR."
I was able to create a regular expression pattern that accomplishes my goal but only for the first occurrence. I implemented a stupid hack to get rid of the second occurrence, but I believe there has to be a better way. I just can't figure it out.
Here is my "hacked" code;
FUNCTION REMOVE_SALUTATIONS(IN_STRING VARCHAR2) RETURN VARCHAR2 DETERMINISTIC
AS
REGEX_SALUTATIONS VARCHAR2(4000) := '(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)';
BEGIN
RETURN TRIM(REGEXP_REPLACE(REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' '),REGEX_SALUTATIONS,''));
END REMOVE_SALUTATIONS;
I was actually proud that I was able to get this far, as regular expression are not very regular to me. All help is appreciated.
EDIT:
The default for regexp_replace based on my understanding is to do a global replace. But on the outside chance my DB is configured different I did try;
select REGEXP_REPLACE('SNERD JR JR','(^|\s)(MR|MS|MISS|MRS|DR|MD|M D|SR|SIR|PHD|P H D|II|III|IV|JR)(\.?)(\s|$)',' ',1,0) from dual;
and the results are;
SNERD JR
Use occurrence parameter of REGEXP_REPLACE function. The docs says:
occurrence is a nonnegative integer indicating the occurrence of the replace operation:
If you specify 0, then Oracle replaces all occurrences of the match.
If you specify a positive integer n, then Oracle replaces the nth occurrenc
https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions137.htm#SQLRF06302
It should look like:
...
REGEXP_REPLACE(IN_STRING,REGEX_SALUTATIONS,' ', 1,0 )
...

Extract text up to the Nth character in a string

How can I extract the text up to the 4th instance of a character in a column?
I'm selecting text out of a column called filter_type up to the fourth > character.
To accomplish this, I've been trying to find the position of the fourth > character, but it's not working:
select substring(filter_type from 1 for position('>' in filter_type))
You can use the pattern matching function in Postgres.
First figure out a pattern to capture everything up to the fourth > character.
To start your pattern you should create a sub-group that captures non > characters, and one > character:
([^>]*>)
Then capture that four times to get to the fourth instance of >
([^>]*>){4}
Then, you will need to wrap that in a group so that the match brings back all four instances:
(([^>]*>){4})
and put a start of string symbol for good measure to make sure it only matches from the beginning of the String (not in the middle):
^(([^>]*>){4})
Here's a working regex101 example of that!
Once you have the pattern that will return what you want in the first group element (which you can tell at the online regex on the right side panel), you need to select it back in the SQL.
In Postgres, the substring function has an option to use a regex pattern to extract text out of the input using a 'from' statement in the substring.
To finish, put it all together!
select substring(filter_type from '^(([^>]*>){4})')
from filter_table
See a working sqlfiddle here
If you want to match the entire string whenever there are less than four instances of >, use this regular expression:
^(([^>]*>){4}|.*)
You can also use a simple, non-regex solution:
SELECT array_to_string((string_to_array(filter_type, '>'))[1:4], '>')
The above query:
splits your string into an array, using '>' as delimeter
selects only the first 4 elements
transforms the array back to a string
substring(filter_type from '^(([^>]*>){4})')
This form of substring lets you extract the portion of a string that matches a regex pattern.
You can also split the string, then choose the N'th element inside the result list. For example:
SELECT SPLIT_PART('aa,bb,cc', ',', 2)
will return: bb.
This function is defined as:
SPLIT_PART(string, delimiter, position)
In order to look at this problem, I did the following (all of the code below is available on the fiddle here):
CREATE TABLE s
(
a TEXT
);
I then created a PL/pgSQL function to generate random strings as follows.
CREATE FUNCTION f() RETURNS TEXT LANGUAGE SQL AS
$$
SELECT STRING_AGG(SUBSTR('abcdef>', CEIL(RANDOM() * 7)::INTEGER, 1), '')
FROM GENERATE_SERIES(1, 40)
$$;
I got the code from here and modified it so that it would produce strings with lots of > characters for testing purposes.
I then manually inserted a few strings at the beginning so that a quick look would tell me if the code was working as anticipated.
INSERT INTO s VALUES
('afsad>adfsaf>asfasf>afasdX>asdffs>asfdf>'),
('23433>433453>4>4559>455>3433>'),
('adfd>adafs>afadsf>'), -- only 3 '>'s!
('babedacfab>feaefbf>fedabbcbbcdcfefefcfcd'),
('e>>>>>'), -- edge case - multiple terminal '>'s
('aaaaaaa'); -- edge case - no '>'s whatsoever
The reason I put in the records with fewer than 4 >s is because the accepted answer (see discussion at the end of this answer) puts forward a solution which should return the entire string if this is the case!
On the fiddle, I then added 50,000 records as follows:
INSERT INTO s
SELECT f() FROM GENERATE_SERIES(1, 50000);
I also created a table s on a home laptop (16GB RAM, 500MB NVMe SSD) and populated it with 40,000,000 (50M) records - times also shown.
Now, my reading of the question is that we need to extract the string up to but not including the 4th > character.
The first solution (from treecon) was this one (I also show them running on the fiddle, but to save space here, I've only included the partial output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE)) - the times shown are typical over a few runs:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
ARRAY_TO_STRING((STRING_TO_ARRAY(a, '>'))[1:4], '>'),
a
FROM s;
Result (only key parts included):
Seq Scan on public.s
Execution Time: 81.807 ms
40M Time: 46 seconds
A regex solution which works (significantly faster):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
SUBSTRING(a FROM '^(?:[^>]*>){0,3}[^>]*'),
a
FROM s;
Result:
Seq Scan on public.s
Execution Time: 74.757 ms
40M Time: 32 seconds
The accepted answer fails on many levels (see the fiddle). It leaves a > at the end and fails on various strings even when modified. Also, the solution proposed to include strings with fewer than 4 >s (i.e. ^(([^>]*>){4}|.*)) merely returns the original string (see end of fiddle).

Remove substrings that vary in value in Oracle

I have a column in Oracle which can contain up to 5 separate values, each separated by a '|'. Any of the values can be present or missing. Here are come examples of how the data might look:
100-1
10-3|25-1|120/240
15-1|15-3|15-2|120/208
15-1|15-3|15-2|120/208|STA-2
112-123|120/208|STA-3
The values are arbitrary except for the order. The numerical values separated by dashes always come first. There can be 1 to 3 of these values present. The numerical values separated by a slash (if it is present) is next. The string, 'STA', and a numerical value separated by a dash is always last, if it is present.
What I would like to do is reformat this column to only ever include the first three possible values, those being the three numerical values separated by dashes. Afterwards, I want to replace 2nd numeric in each value (the numeric after the dash) using the following pattern:
1 = A
2 = B
3 = C
I would also like to remove the dash afterwards, but not the '|' that separates the values unless there is a trailing '|'.
To give you an idea, here's how the values at the beginning of the post would look after the reformatting:
100A
10C|25A
15A|15C|15B
15A|15C|15B
112ABC
I'm thinking this can be done with regex expressions but it's got me a little confused. Does anyone have a solution?
If I have to solve this problem I will solve it in following ways.
SELECT
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?',''),
REGEXP_REPLACE(column,'(\d+)-(1)([^\d])','\1A\3'),
REGEXP_REPLACE(column,'(\d+)-(2)([^\d])','\1B\3'),
REGEXP_REPLACE(column,'(\d+)-(3)([^\d])','\1C\3'),
REGEXP_REPLACE(column,'(\d+)-(123)([^\d])','\1ABC')
FROM table;
Explanation: Let us break down each REGEXP_REPLACE statement one by one.
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?','')
This will replace the end part like 120/208|STA-2 with empty string so that further processing is easy.
Finding match was easy but replacing A for 1, B for 2 and C for 3 was not possible ( as per my knowledge ) So I did those matching and replacements separately.
In each regex from second statement (\d+)-(yourNumber)([^\d]) first group is number before - then yourNumber is either 1,2,3 or 123 followed by |.
So the replacement will be according to yourNumber.
All demos here from version 1 to 5.
Note:- I have just done replacement for combination of yourNUmber for those present in question. You can do likewise for other combinations too.
you can do this in one line, but you can write simple function to do that
SELECT str, REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?','') cut
, REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4') rep3toC
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4') rep2toB
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4') rep1toA
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4'), '-', '') "rep-"
FROM (
SELECT '100-1' str FROM dual UNION
SELECT '10-3|25-1|120/240' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208|STA-2' str FROM dual UNION
SELECT '112-123|120/208|STA-3' FROM dual
) tab

pl/sql negative look behind regex

I'm trying to implement a regular expression in pl/sql which excludes any results which are preceeded by a string.
data:
exclude this: 3
include this: 3
3
cvxcvxcv3
34edfgdsfg3
Using this regexp:
(?<!exclude this: )3\d{0}(\s|$)
What I would expect to be returned is:
exclude this: 3 <-- nothing
include this: 3 <- 3
3 <- 3
cvxcvxcv3 <- 3
34edfgdsfg3 <- the second 3 only
34edfgdsfg33 <- the last 3 only
This works fine when tested in notepad++ however when implementing it in pl/sql it isn't working. Looking at similar questions it appears that pl/sql doesn't support negative lookback fully but does anyone know of a similar construct or a way to work around this?
While i am not aware of any general technique to emulate negative lookbehind by means of pl/sql regexen, in your particular case a solution is possible:
([^e].{13}|[^x].{12}|[^c].{11}|[^l].{10}|[^u].{9}|[^d].{8}|[^e].{7}|[^ ].{6}|[^t].{5}|[^h].{4}|[^i].{3}|[^s].{2}|[^:].|[^ ]|^)3[^0-9]?(\s|$)
The negative lookbehind applies to a literal. Therefore all forbidden prefixes of the first character that must match are known beforehand as are their lengths. this allows for a compact (well ...) specification as a regex that must match.
Not that I would recommend that for best practice ... or any practice at all ...
Update (processing advice):
The regex as it stands identifies matches without providing any further information for postprocessing. However, you can identify the offset of the match and the length of the forbidden prefix with the following code:
DECLARE
s_data VARCHAR2(4000); -- This will contain the line you match against
s_matchpos BINARY_INTEGER; -- Offset of the 'interesting' part (digit '3' under the various constraints) in s_data
s_prefix VARCHAR2(100); -- The prefix part of the match
s_re VARCHAR2(4000); -- The regex
BEGIN
s_re := '([^e].{13}|[^x].{12}|[^c].{11}|[^l].{10}|[^u].{9}|[^d].{8}|[^e].{7}|[^ ].{6}|[^t].{5}|[^h].{4}|[^i].{3}|[^s].{2}|[^:].|[^ ]|^)3[^0-9]?(\s|$)';
s_prefix := regexp_replace( s_data, s_re, '\1', 1, 1); -- start at offset 1 of the data and find the first match
s_matchpos := regexp_instr( s_data, s_re ) + length(s_prefix);
END;
As mentioned above, not necessarily to recommend as best practice ...