Using Oracle Regular Expression - Masking based on pattern

Using Oracle Regular Expression - Masking based on pattern - regex

Cleaning up ,
With Oracle 11g PL/SQL, for below query, can I get the capture groups' positions (something like what Matcher.start() provides in java).
`select regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2') from dual`
Result should look like : "zone", 9(start of text "zone").
The bigger problem I was trying to solve is to mask data like account number using patterns like '^.....(.*)..$' (this pattern can vary depending on installation).

Will something like below work for you?
select regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2') expr
,instr('1234bankzone1234',regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2')) pos from dual
or more readable subquery like
select a.*, instr(a.value,a.expr) from (
select '1234bankzone1234' value,
regexp_replace('1234bankzone1234', '^..(.*)bank(zone).(.*)..$', '\2') expr from dual
) a
I couldn't find any direct equivalent of Matcher API like functionality and there is no way you can access the position group buffer in SQL.

1: Reverse pattern using this
regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(
pattern, '(\()', '\1#') , '(\))', '#\1') , '\(#', ')#') , '\^\)#', '^') , '#\)\$', '$') , '#\)', '(#') , '#', '') , '\^([^\(]+\))', '^(\1') , '\(([^\)]+)\$', '(\1)$');
So, "^(.)..(.).$"; becomes "^.(..).(.)$";
2: Use this to bulk collect index and count of capture groups within both patterns
SELECT REGEXP_instr(pattern, '\(.*?\)+', 1, LEVEL) bulk collect into posCapture FROM v CONNECT BY LEVEL <= REGEXP_COUNT(pattern, '\(.*?\)');
3: Match both patterns against the text-to-be-masked. Merge them by the order found in step 2.
select regexp_replace(v_src, pattern, '\' || captureIndex) into tempStr from dual;

Related

Extract all substrings bounded by the same characters

Given a name_loc column of text like the following:
{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}
I'm trying to extract the names, ideally separated by commas:
Charlie, Wrigley, Ana
I've gotten this far:
SELECT SUBSTRING(CAST(name_loc AS VARCHAR) from '"([^ –]+)')
FROM table;
which returns
Charlie
How can I extend this query to extract all names?

You can do this with a combination of regexp_matches (to extract the names), array_agg (to regroup all matches in a row) and array_to_string (to format the array as you'd like, e.g. with a comma separator):
WITH input(name_loc) AS (
VALUES ('{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}')
, ('{"Other - somewhere}') -- added this to show multiple rows do not get merged
)
SELECT array_to_string(names, ', ')
FROM input
CROSS JOIN LATERAL (
SELECT array_agg(name)
FROM regexp_matches(name_loc, '"(\w+)', 'g') AS f(name)
) AS f(names);
array_to_string
Charlie, Wrigley, Ana
Other
View on DB Fiddle

My two cents, though I'm rather new to postgreSQL and I had to copy the 1st piece from #Marth's his answer:
WITH input(name_loc) AS (
VALUES ('{"Charlie – White Plains, NY","Wrigley – Minneapolis, MN","Ana – Decatur, GA"}')
, ('{"Other - somewhere"}')
)
SELECT REGEXP_REPLACE(name_loc, '{?(,)?"(\w+)[^"]+"}?','\1\2', 'g') FROM input;
regexp_replace
Charlie,Wrigley,Ana
Other

Your string literal happens to be a valid array literal.
(Maybe not by coincidence? And the column should be type text[] to begin with?)
If that's the reliable format, there is a safe and simple solution:
SELECT t.id, x.names
FROM tbl t
CROSS JOIN LATERAL (
SELECT string_agg(split_part(elem, ' – ', 1), ', ') AS names
FROM unnest(t.name_loc::text[]) elem
) x;
Or:
SELECT id, string_agg(split_part(elem, ' – ', 1), ', ') AS names
FROM (SELECT id, unnest(name_loc::text[]) AS elem FROM tbl) t
GROUP BY id;
db<>fiddle here
Steps
Unnest the array with unnest() in a LATERAL CROSS JOIN, or directly in the SELECT list.
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Take the first part with split_part(). I chose ' – ' as delimiter, not just ' ', to allow for names with nested space like "Anne Nicole". See:
Split comma separated column data into additional columns
Aggregate results with string_agg(). I added no particular order as you didn't specify one.
Concatenate multiple result rows of one column into one, group by another column

Regular expression to get last part of a string before a character

I have this string on a single column of a single row on an oracle table:
(test-1#gmail.com-1234567)
(testAAAcccc#gmail.com-7654321)
..
Above it's a single big string.
I need a regular expression to extract all the occurrences (could be 1 or more, 2 in above example) of the 7 numbers above, so the results should be:
1234567
7654321
I'm trying to to that with various regular expression or oracle functions, I'm not able to get both the occurrences.
Could you please help me?

If you need exactly regular expression:
select regexp_substr('(test-1#gmail.com-1234567)', '\d{7}' ) from dual
To find all occurences:
select *
from t,
lateral(select level occurence_number, regexp_substr(str, '(\d{7})',1,level ) digits7
from dual
connect by level<=regexp_count(str, '(\d{7})' )
);
or test query with test data (you can run it as-is to se how it works):
with t(str) as (
select '(test-1#gmail.com-1234567)' from dual union all
select '(testAAAcccc#gmail.com-7654321)' from dual union all
select '7654321 1234567 2345678' from dual
)
select *
from t,
lateral(select level occurence_number, regexp_substr(str, '(\d{7})',1,level ) digits7
from dual
connect by level<=regexp_count(str, '(\d{7})' )
);

If you are looking to extract only numbers, if you can do that by:
Select SUBSTR(column, INSTR(column,'-', -1) + 1)
from dual;
#It is fetching everything from column after dash(-).
If you have paranthesis, if you can replace it with a space and then TRIM:
Select TRIM(replace(SUBSTR(column, INSTR(column,'-', -1) + 1),')', ' '))
from dual;

Oracle regexp cant seem to get it right

I have some values like
CR-123456
ECR-12345
BCY-499494
134-ABC
ECW-ECR1233
CR-123344
I want to match all lines which do not start with ECR and the regex for doing so is ^((?!ECR)\w+) which seems to do what I want.
But then I want to replace the matched values which do not begin with ECR and replace them with ECR and i am blanked because the following doesn't seem to work
select regexp_replace('CR-123344','^((?!ECR)\w+)','ECR') from dual
Any ideas where i have gone wrong ?
I want the result to be
ECR-123456
ECR-12345
ECR-499494
ECR-ABC
ECR-ECR1233
ECR-123344

You don't absolutely need to use regex here, you can just use Oracle's base string functions.
SELECT
'ECR-' || SUBSTR(col,
INSTR(col, '-') + 1,
LENGTH(col) - INSTR(col, '-')) AS new_col
FROM yourTable
WHERE col NOT LIKE 'ECR-%'
The advantage of this approach is that it might run faster than a regex. The disadvantage is that the code is a bit less tidy, but if you understand how it works then this is the most important thing.

I would use substring and instr to replace everything before the dash, but here is your answer using regexp:
WITH aset
AS (SELECT 'CR-123456' a
FROM DUAL
UNION ALL
SELECT 'BCY-12345' a
FROM DUAL
UNION ALL
SELECT 'ECR-499494' a
FROM DUAL
UNION ALL
SELECT '134-ABC' a
FROM DUAL
UNION ALL
SELECT 'ECW-ECR1233' a
FROM DUAL
UNION ALL
SELECT 'CR-123344'
FROM DUAL)
SELECT a, regexp_replace(a, '^([^-]*)','ECR') b
FROM aset;
Results in
A,B
CR-123456,ECR-123456
BCY-12345,ECR-12345
ECR-499494,ECR-499494
134-ABC,ECR-ABC
ECW-ECR1233,ECR-ECR1233
CR-123344,ECR-123344

Looks like you are replacing characters before the '-' with ECR. Do you need to check if it does not match 'ECR' at all?
Because this will give you what you want, will it not?
select regexp_replace('CR-123344','(.*)-','ECR-')
from dual;

Oracle regex eliminate all duplicate words

I would like to eliminate all duplicate words in a comma separated list.
I've tried with:
SELECT
REGEXP_REPLACE(
'1234,234,1234,1234,928,1234,123,1234,Abcd,1234,1234',
'([^,\w]+)(,[ ]*[\1])+') AS r
FROM dual
It should return
1234,234,928,123,Abcd
But in fact it returns
1234,234,234,234
Also tried with ([^,\w]+)(,[ ]*\1)+ but with '1234,1234,1234' it returns (null)
Also tried with
SELECT
REGEXP_REPLACE(
'1234,234,1234,1234,928,1234,123,1234,Abcd,1234,1234',
'([^,\w]+)(,[ ]*[\1])+', '\1') AS r
FROM dual
and following replacements, even '\1\2' but none of them is giving the desired result.
Please, any ideas?

I know this isn't exactly the method you were asking for, but it still achieves the same result:
WITH DATA AS
( SELECT '1234,234,1234,1234,928,1234,123,1234,Abcd,1234,1234' str FROM dual)
SELECT DISTINCT trim(regexp_substr(str, '[^,]+', 1, LEVEL)) str
FROM DATA
CONNECT BY instr(str, ',', 1, LEVEL - 1) > 0

Last word in a sentence: In SQL (regular expressions possible?)

I need this to be done in Oracle SQL (10gR2). But I guess, I would rather put it plainly, any good, efficient algorithm is fine.
Given a line (or sentence, containing one or many words, English), how will you find the last word of the sentence?
Here is what I have tried in SQL. But, I would like to see an efficient way of doing this.
select reverse(substr(reverse(&p_word_in)
, 0
, instr(reverse(&p_word_in), ' ')
)
)
from dual;
The idea was to reverse the string, find the first occurring space, retrieve the substring and reverse the string. Is it quite efficient? Is a regular expression available? I am on Oracle 10g R2. But I dont mind seeing any attempt in other programming language, I wont mind writing a PL/SQL function if need be.
Update:
Jeffery Kemp has given a wonderful answer. This works perfectly.
Answer
SELECT SUBSTR(&sentence, INSTR(&sentence,' ',-1) + 1)
FROM dual

I reckon it's simpler with INSTR/SUBSTR:
WITH q AS (SELECT 'abc def ghi' AS sentence FROM DUAL)
SELECT SUBSTR(sentence, INSTR(sentence,' ',-1) + 1)
FROM q;

Not sure how it is performance wise, but this should do it:
select regexp_substr(&p_word_in, '\S+$') from dual;

I'm not sure if you can use a regex in oracle, but wouldn't
(\w+)\W*$
work?

This regex matches the last word on a line:
\w+$
And RegexBuddy gives this code for use in Oracle:
DECLARE
match VARCHAR2(255);
BEGIN
match := REGEXP_SUBSTR(subject, '[[:alnum:]]_+$', 1, 1, 'c');
END;

this leaves the punctuation but gets the final word
with datam as (
SELECT 'abc asdb.' A FROM DUAL UNION
select 'ipso factum' a from dual union
select 'ipso factum' a from dual union
SELECT 'ipso factum2' A FROM DUAL UNION
SELECT 'ipso factum!' A FROM DUAL UNION
SELECT 'ipso factum !' A FROM DUAL UNION
SELECT 'ipso factum/**//*/?.?' A FROM DUAL UNION
SELECT 'ipso factum ...??!?!**' A FROM DUAL UNION
select 'ipso factum ..d.../.>' a from dual
)
SELECT a,
--REGEXP_SUBSTR(A, '[[:alnum:]]_+$', 1, 1, 'c') , /** these are the other examples*/
--REGEXP_SUBSTR(A, '\S+$') , /** these are the other examples*/
regexp_substr(a, '[a-zA-Z]+[^a-zA-Z]*$')
from datam

This works too, even for non english words :
SELECT REGEXP_SUBSTR ('San maria Calle Cáceres Numéro 25 principal izquierda, España', '[^ .]+$') FROM DUAL;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using Oracle Regular Expression - Masking based on pattern - regex

Related

Extract all substrings bounded by the same characters

Regular expression to get last part of a string before a character

Oracle regexp cant seem to get it right

Oracle regex eliminate all duplicate words

Last word in a sentence: In SQL (regular expressions possible?)

Categories

Resources