Oracle 11g - REGEXP_REPLACE - Subexpressions/different matches - regex

SQLFiddle: http://sqlfiddle.com/#!4/db1bd/49/0
I'm working on a query that returns an object's DN:(cn=name,ou=folder,dc=hostname,dc=com)
My goal is to return this information in a "prettier" output akin to AD:(name\folder\hostname.com)
I've accomplished this in a clunky way:
REGEXP_REPLACE(REGEXP_REPLACE(TEST, '.*CN=(.+?),DC=.*', '\1', 1, 1, 'i'), ',OU=', '\', 1, 0, 'i') -- grab everything between CN= and DC=, replace with \'s --
|| '\' ||
REGEXP_REPLACE(SUBSTR(TEST, REGEXP_INSTR(TEST, ',DC=', 1, 1, 0, 'i')+4),',DC=','.', 1, 0, 'i') -- grab everything after DC=, replace with .'s --
While that works I'm not thrilled with how overly complicated it is (and that it involves having to stitch two regex'd strings together).
I started clean and realized I was doing too much to get what I wanted and my starting point is now here:
REGEXP_REPLACE(test, '(,?(cn=|ou=)(.+?),)', '\3\')
I think I have a good understanding of how this one works but if I add an additional (...) it breaks what I already have working and returns the entire string. I've read that Oracle's regex engine is not as advanced as some others, but I'm struggling to grasp the order of how things are evaluated.
Example Input (can have multiple OUs/DCs):
cn=name,ou=subgroup,ou=group,dc=accounts,dc=hostname,dc=com
cn=name,ou=group,dc=hostname,dc=com
Expected Output
name\subgroup\group\accounts.hostname.com
name\group\hostname.com
The data coming in is dynamic and never a set number of OUs or DCs.

You may use
SELECT REPLACE(
REGEXP_REPLACE(
test,
'(^|,)(cn|ou)=([^,]*)(,dc=)?',
'\3\\'),
',dc=',
'.')
FROM regexTest
See the SQLFiddle.
The first (^|,)(cn|ou)=([^,]*)(,dc=)? regex matches , or start of string, then cn or ou, then =, then captures into Group 3 zero or more chars other than a comma, and then matches an optional ,dc= substring (thus, removing the first instance of ,dc=). The replacement is Group 3 contents and a backslash.
So, the second operation is easy, just replace all ,dc= with ., you do not even need a regex for this.

May be something like that:
SELECT nvl(regexp_replace(
regexp_replace(
nullif(
regexp_replace(test, '^cn=(.+?),DC=(.+?)$', '\1 \2',1,1,'i')
, test
) , ' |,(CN|OU)=', '\\', 1, 0,'i'
), ',DC=', '.', 1, 0,'i'
),test) result
FROM regexTest
This query does not change the input if there is no DC=.

Related

Regex (All after first match (without the first match))

I am struggling with the easy Regex expression. Basically I want everything after the first match of "_" without the "_".
My current expression is like this: _(.*)
When I give input: AAA_BBB_CCC
The output is: _BBB_CCC
My ideal output would be: BBB_CCC
I am using a snowflake database with their build-in regex function.
Unfortunately, I can not use (?<=_).* as it does not support this format of "?<=". Is there some other way how can I modify _(.*) to get the right output?
Thank you.
You can use a regular expression to achieve this, something like this is JavaScript for example will do the job
"AAA_BBB_CCC".replace(/[^_]+./, '')
Use REGEXP_REPLACE with Snowflake
regexp_replace('AAA_BBB_CCC','^[^_]+_','')
https://docs.snowflake.net/manuals/sql-reference/functions/regexp_replace.html
But you can also find the first index of _ and use substring, available in all languages
let text = "AAA_BBB_CCC"
let index = text.indexOf('_')
if(index !== -1 && index < text.length) {
let result = text.substring(index+1)
}
In Snowflake SQL, you may use REGEXP_SUBSTR, its syntax is
REGEXP_SUBSTR( <string> , <pattern> [ , <position> [ , <occurrence> [ , <regex_parameters> [ , <group_num ] ] ] ] ).
The function allows you to return captured substrings:
By default, REGEXP_SUBSTR returns the entire matching part of the subject. However, if the e (for “extract”) parameter is specified, REGEXP_SUBSTR returns the the part of the subject that matches the first group in the pattern. If e is specified but a group_num is not also specified, then the group_num defaults to 1 (the first group). If there is no sub-expression in the pattern, REGEXP_SUBSTR behaves as if e was not set.
So, you need to set the regex_parameters to e and - optionally - group_num argument to 1:
Select REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'e', 1)
Select REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'e')
Use a capture group:
\_(?<data>.*)
Which returns the capture group data containing BBB_CCC
Example:
https://regex101.com/r/xZaXKR/1
To get this actually working you need to use:
SELECT REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'e', 1);
which gives:
REGEXP_SUBSTR('AAA_BBB_CCC', '_(.*)', 1, 1, 'E', 1)
BBB_CCC
you need to pass the REGEXP_SUBSTR parameter <regex_parameters> clause of e as that is extract sub-matches. thus Wiktor's answer is 95% correct.

Extract numbers from a field in PostgreSQL

I have a table with a column po_number of type varchar in Postgres 8.4. It stores alphanumeric values with some special characters. I want to ignore the characters [/alpha/?/$/encoding/.] and check if the column contains a number or not. If its a number then it needs to typecast as number or else pass null, as my output field po_number_new is a number field.
Below is the example:
SQL Fiddle.
I tired this statement:
select
(case when regexp_replace(po_number,'[^\w],.-+\?/','') then po_number::numeric
else null
end) as po_number_new from test
But I got an error for explicit cast:
Simply:
SELECT NULLIF(regexp_replace(po_number, '\D','','g'), '')::numeric AS result
FROM tbl;
\D being the class shorthand for "not a digit".
And you need the 4th parameter 'g' (for "globally") to replace all occurrences.
Details in the manual.
For a known, limited set of characters to replace, plain string manipulation functions like replace() or translate() are substantially cheaper. Regular expressions are just more versatile, and we want to eliminate everything but digits in this case. Related:
Regex remove all occurrences of multiple characters in a string
PostgreSQL SELECT only alpha characters on a row
Is there a regexp_replace equivalent for postgresql 7.4?
But why Postgres 8.4? Consider upgrading to a modern version.
Consider pitfalls for outdated versions:
Order varchar string as numeric
WARNING: nonstandard use of escape in a string literal
I think you want something like this:
select (case when regexp_replace(po_number, '[^\w],.-+\?/', '') ~ '^[0-9]+$'
then regexp_replace(po_number, '[^\w],.-+\?/', '')::numeric
end) as po_number_new
from test;
That is, you need to do the conversion on the string after replacement.
Note: This assumes that the "number" is just a string of digits.
The logic I would use to determine if the po_number field contains numeric digits is that its length should decrease when attempting to remove numeric digits.
If so, then all non numeric digits ([^\d]) should be removed from the po_number column. Otherwise, NULL should be returned.
select case when char_length(regexp_replace(po_number, '\d', '', 'g')) < char_length(po_number)
then regexp_replace(po_number, '[^0-9]', '', 'g')
else null
end as po_number_new
from test
If you want to extract floating numbers try to use this:
SELECT NULLIF(regexp_replace(po_number, '[^\.\d]','','g'), '')::numeric AS result FROM tbl;
It's the same as Erwin Brandstetter answer but with different expression:
[^...] - match any character except a list of excluded characters, put the excluded charaters instead of ...
\. - point character (also you can change it to , char)
\d - digit character
Since version 12 - that's 2 years + 4 months ago at the time of writing (but after the last edit that I can see on the accepted answer), you could use a GENERATED FIELD to do this quite easily on a one-time basis rather than having to calculate it each time you wish to SELECT a new po_number.
Furthermore, you can use the TRANSLATE function to extract your digits which is less expensive than the REGEXP_REPLACE solution proposed by #ErwinBrandstetter!
I would do this as follows (all of the code below is available on the fiddle here):
CREATE TABLE s
(
num TEXT,
new_num INTEGER GENERATED ALWAYS AS
(NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER) STORED
);
You can add to the 'ABCDEFG... string in the TRANSLATE function as appropriate - I have decimal point (.) and a space ( ) at the end - you may wish to have more characters there depending on your input!
And checking:
INSERT INTO s VALUES ('2'), (''), (NULL), (' ');
INSERT INTO t VALUES ('2'), (''), (NULL), (' ');
SELECT * FROM s;
SELECT * FROM t;
Result (same for both):
num new_num
2 2
NULL
NULL
NULL
So, I wanted to check how efficient my solution was, so I ran the following test inserting 10,000 records into both tables s and t as follows (from here):
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
INSERT INTO t
with symbols(characters) as
(
VALUES ('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')
)
select string_agg(substr(characters, (random() * length(characters) + 1) :: INTEGER, 1), '')
from symbols
join generate_series(1,10) as word(chr_idx) on 1 = 1 -- word length
join generate_series(1,10000) as words(idx) on 1 = 1 -- # of words
group by idx;
The differences weren't that huge but the regex solution was consistently slower by about 25% - even changing the order of the tables undergoing the INSERTs.
However, where the TRANSLATE solution really shines is when doing a "raw" SELECT as follows:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
NULLIF(TRANSLATE(num, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ. ', ''), '')::INTEGER
FROM s;
and the same for the REGEXP_REPLACE solution.
The differences were very marked, the TRANSLATE taking approx. 25% of the time of the other function. Finally, in the interests of fairness, I also did this for both tables:
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT
num, new_num
FROM t;
Both extremely quick and identical!

Remove substrings that vary in value in Oracle

I have a column in Oracle which can contain up to 5 separate values, each separated by a '|'. Any of the values can be present or missing. Here are come examples of how the data might look:
100-1
10-3|25-1|120/240
15-1|15-3|15-2|120/208
15-1|15-3|15-2|120/208|STA-2
112-123|120/208|STA-3
The values are arbitrary except for the order. The numerical values separated by dashes always come first. There can be 1 to 3 of these values present. The numerical values separated by a slash (if it is present) is next. The string, 'STA', and a numerical value separated by a dash is always last, if it is present.
What I would like to do is reformat this column to only ever include the first three possible values, those being the three numerical values separated by dashes. Afterwards, I want to replace 2nd numeric in each value (the numeric after the dash) using the following pattern:
1 = A
2 = B
3 = C
I would also like to remove the dash afterwards, but not the '|' that separates the values unless there is a trailing '|'.
To give you an idea, here's how the values at the beginning of the post would look after the reformatting:
100A
10C|25A
15A|15C|15B
15A|15C|15B
112ABC
I'm thinking this can be done with regex expressions but it's got me a little confused. Does anyone have a solution?
If I have to solve this problem I will solve it in following ways.
SELECT
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?',''),
REGEXP_REPLACE(column,'(\d+)-(1)([^\d])','\1A\3'),
REGEXP_REPLACE(column,'(\d+)-(2)([^\d])','\1B\3'),
REGEXP_REPLACE(column,'(\d+)-(3)([^\d])','\1C\3'),
REGEXP_REPLACE(column,'(\d+)-(123)([^\d])','\1ABC')
FROM table;
Explanation: Let us break down each REGEXP_REPLACE statement one by one.
REGEXP_REPLACE(column,'\|\d+\/\d+(\|STA-\d+)?','')
This will replace the end part like 120/208|STA-2 with empty string so that further processing is easy.
Finding match was easy but replacing A for 1, B for 2 and C for 3 was not possible ( as per my knowledge ) So I did those matching and replacements separately.
In each regex from second statement (\d+)-(yourNumber)([^\d]) first group is number before - then yourNumber is either 1,2,3 or 123 followed by |.
So the replacement will be according to yourNumber.
All demos here from version 1 to 5.
Note:- I have just done replacement for combination of yourNUmber for those present in question. You can do likewise for other combinations too.
you can do this in one line, but you can write simple function to do that
SELECT str, REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?','') cut
, REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4') rep3toC
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4') rep2toB
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4') rep1toA
, REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(str,'(\|\d+\/\d+)?(\|STA-\d+)?',''), '(\-)([1,2]*)(3)([1,2]*)', '\1\2C\4'), '(\-)([1,C]*)(2)([1,C]*)', '\1\2B\4'), '(\-)([B,C]*)(1)([B,C]*)', '\1\2A\4'), '-', '') "rep-"
FROM (
SELECT '100-1' str FROM dual UNION
SELECT '10-3|25-1|120/240' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208' str FROM dual UNION
SELECT '15-1|15-3|15-2|120/208|STA-2' str FROM dual UNION
SELECT '112-123|120/208|STA-3' FROM dual
) tab

Need to parse the string separated by colon and eliminate part of the string if found the match in plsql

I know this topic was discussed multiple times, I looked at multiple posts and answers, but could not find exactly what I need to do.
I am trying to search the string, that has multiple values of varchar2 separated by ':', and if the match is found on another string, delete part of the string that matched, and update the table with the rest of the string.
I wrote the code using combination of str and instr functions, but looking for more elegant solution using regexp, or collections.
For example, the input string looks like this: ('abc:defg:klmnp). Need to find for example the piece of the string (could be at any position), and remove it, that result would look like this: (abc:klmnp)?
EDIT - copied from comment:
The input string (abc:defg:klmn:defgb). Let's say I am looking for defg, and only defg will have to be removed, not defgb. Now, like I mentioned before, next time around, I might be looking for the value in position 1, or the last position. So the desired part of the string to be removed might not always be wrapped in ':' from the both sides, but depending where it is in the string, either from the right, or from the left, or from both sides.
You can do this with a combination of LIKE, REPLACE and TRIM functions.
select trim(':' from
replace(':'||your_column||':',':'||search_string||':',':')
) from table_name
where ':'||your_column||':' like '%:'||search_string||':%';
Idea is,
Surround the column with colons and use LIKE function to find the match.
And on such matched rows, use REPLACE to replace the search string along with surrounding colons, with a single colon.
And then use TRIM to remove the surrounding colons.
Demo at sqlfiddle
EDIT (simplified) Perhaps this is what you need:
SELECT REGEXP_REPLACE(REPLACE('abc:defg:klmnop', ':defg:', ':'), '(^defg:|:defg$)', '')
, REGEXP_REPLACE(REPLACE('defg:klmnop:abc', ':defg:', ':'), '(^defg:|:defg$)', '')
, REGEXP_REPLACE(REPLACE('abc:klmnop:defg', ':defg:', ':'), '(^defg:|:defg$)', '')
, REGEXP_REPLACE(REPLACE('abc:klmnop:defgb:defg', ':defg:', ':'), '(^defg:|:defg$)', '')
FROM DUAL
;
which removes defg from start, middle, and end, and ignores defgb, giving:
abc:klmnop
klmnop:abc
abc:klmnop
abc:klmnop:defgb
And to update the table, you could:
UPDATE my_table
SET value = REGEXP_REPLACE(REGEXP_REPLACE(value, ':defg:', ':'), '(^defg:|:defg$)', '')
-- WHERE REGEXP_LIKE(value, '(^|.*:)defg(:.*|$)')
WHERE value LIKE '%defg%'
;
(though that final regex for the where may need to be tweaked to match, hard to test...)

Use REGEXP_SUBSTR like a Split function

I need to extract a text value from data in a VARCHAR2 column. Sample:
EDKES^Visit: ^PRIMARY INSURANCE COMMENTS: ^SECONDARY INSURANCE COMMENTS: ^TERTIARY INSURANCE COMMENTS: ^NO PRIMARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO SECONDARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO TERTIARY INS*
I need to get the text that proceeds the 6th occurrence of the '^' (excluding the '^'). In this example, the text would be NO PRIMARY INSURANCE.
([\w\s\:\*]+(\^?)) mostly works, but doesn't exclude the '^'.
When I try to use this expression REGEXP_SUBSTR(VARCHAR_COL, '([\w\s\:\*]+(\^?))', 1, 6), I get a single character ('s'), rather than the expected match NO PRIMARY INSURANCE^.
What am I missing?
This should work pretty well:
REPLACE(REGEXP_SUBSTR(VARCHAR_COL, '[^^]+\^?', 1, 6), '^', '')
You might be able to account for blank columns as well. And if the engine only returns
the capture groups, it will trim the delimiter.
([^^]*).?
This of course means that the last column found is always invalid.