Remove prefix and suffix with oracle regex

Remove prefix and suffix with oracle regex - regex

I have some records with that come with a prefix and suffix (may or may not come).
I'm trying to figure the REGEXP_REPLACE that will always return me a parametrized value.
My attempt so far is:
with teste as (
select '+123145#domain.com' as num from dual
union
select '0054321#domain.com' as num from dual
union
select '006789' as num from dual
union
select '+9876' as num from dual
union
select '13579#domain.com' as num from dual
union
select '123456789' as num from dual
)
select REGEXP_REPLACE(num,'^00(.*)\#.*$|^\+(.*)\#.*$','\1') from teste
but is not quite there.
The output of that should be:
num
12345
54321
6789
9876
13579
123456789

Try this one here
REGEXP_REPLACE(num,'^(00|\+)?(\d*)(\#.*)?$','\2')
See it here online at Regexr
I am not sure what Oracle regex is able to do. Critical points could be the \d meaning a digit, if this is not working replace \d with [0-9].
(?:) are non capturing groups. The pattern inside is not stored in a capturing group so you can always replace with the first capturing group \1
I also changed from your alternatives using | to optional parts using ? after the non capturing groups. The two brackets in your "OR" caused that the result is sometimes in group 1 (when the first alternative is matching) and sometimes in group 2 (when the second alternative is matching)

Related

matching patterns for regexp_like in Oracle to include a group of character conditionally

I tried to look up for a good documentation for matching pattern for use of regexp_like in Oracle. I have found some and followed their instructions but looks like I have missed something or the instruction is not comprehensive.
Let's look at this example:
SELECT * FROM
(
SELECT 'ABC' T FROM DUAL
UNION
SELECT 'WZY' T FROM DUAL
UNION
SELECT 'WZY_' T FROM DUAL
UNION
SELECT 'WZYEFG' T FROM DUAL
UNION
SELECT 'WZY_EFG' T FROM DUAL
) C
WHERE regexp_like(T, '(^WZY)+[_]{0,1}+[A-Z]{0,6}')
What I expect to receive are WZY and WZY_EFG. But what I got was:
What I would like to have is the "_" could be present or not but if there are character after the first group, it is mandatory that it be present only once.
Is there a clean way to do this?

Use a subexpression grouping to make sure the _ character appears only with Capitalized Alphabetical Characters
Yes, your pattern does not address the conditional logic you need (only see the _ when capitalized alphabetical characters follow).
Placing the _ character in with a capitalized alphabetical character list into a subexpression grouping forces this logic.
Finally, placing the end of line anchor addresses the zero match scenarios.
SCOTT#DB>SELECT
2 *
3 FROM
4 (
5 SELECT 'ABC' t FROM dual
6 UNION ALL
7 SELECT 'WZY' t FROM dual
8 UNION ALL
9 SELECT 'WZY_' t FROM dual
10 UNION ALL
11 SELECT 'WZYEFG' t FROM dual
12 UNION ALL
13 SELECT 'WZY_EFG' t FROM dual
14 ) c
15 WHERE
16 REGEXP_LIKE ( t, '^(WZY)+([_][A-Z]{1,6}){0,1}$' );
T
__________
WZY
WZY_EFG

SQL Regex Pattern, How to match only a specific variable between two characters? (see Sample Output)

I have this inputs:
John/Bean/4000-M100
John/4000-M100
John/4000
How can I get just the 4000 but note that the 4000 there will be change from time to time it can be 3000 or 2000 how can I treat that using regex pattern?
Here's my output so far, it statisfies John/400-M100 and John/4000 but the double slash doesnt suffice the match requirements in the regex I have:
REGEXP_REPLACE(REGEXP_SUBSTR(a.demand,'/(.*)-|/(.*)',1,1),'-|/','')

You can use this query to get the results you want:
select regexp_replace(data, '^.*/(\d{4})[^/]*$', '\1')
from test
The regex looks for a set of 4 digits following a / and then not followed by another / before the end of the line and replaces the entire content of the string with those 4 digits.
Demo on dbfiddle

This would also work, unless you need any digit followed by three zeros. See it in action here, for as long as it lives, http://sqlfiddle.com/#!4/23656/5
create table test_table
( data varchar2(200))
insert into test_table values('John/Bean/4000-M100')
insert into test_table values('John/4000-M100')
insert into test_table values('John/4000')
select a.*,
replace(REGEXP_SUBSTR(a.data,'/\d{4}'), '/', '')
from test_table a

The following will match any multiple of 1000 less than 10000 when its preceded by a slash:
\/[1-9]0{3}
To match any four-digit number preceded by a slash, not followed by another digit, such as 4031 in—
Sal_AS_180763852/4200009751_S5_154552/4031
—try:
\/\d{3}(?:(?:\d[^\d])|(?:\d$))
https://regex101.com/r/Am34WO/1

Regex match numbers of at least 6 digits that only have up to 3 digits different from 0

EDIT
I need to identify all numbers of at least 6 digits and maximum 25 digits having only 1 to 3 digits that are different from 0.
Examples: 000123, 0103040000, 10320000, 70000000, 12000009000
I was trying something like this:
regexp_like(number, '[1-9]\d{1,3}') AND regexp_like(number,'(0){5,24}')
(it's ok to use more than one regular expression)
But this also matches numbers like:
0046700000031,00394000007 - This should not match because they have 4 digits other than 0, it must match numbers with minimum 1 digit other than 0 and maximum 3 digits other than 0
I'm using Oracle 12C.
SOLUTION
Here is an alternative I've found, which seems to work but I presume only in Oracle.
SELECT NUMBER
FROM TABLE t
WHERE LENGTH(NUMBER) > 5 HAVING(regexp_count(NUMBER, '0') > 2
AND regexp_count(NUMBER, '[1-9]') BETWEEN 1 AND 3)
GROUP BY NUMBER
Thanks

You cannot use a single regex to do what you want in Oracle 12C, because the regex engine is POSIX based, and does not allow lookarounds, neither lookbehinds, nor lookaheads. You need to use a single pattern to check the format of the string, and a regular LENGTH function.
Here is a full demo:
WITH testdata(txt) AS (
SELECT '000123' from dual
UNION
SELECT '0103040000' from dual
UNION
SELECT '10320000' from dual
UNION
SELECT '70000000' from dual
UNION
SELECT '12000009000' from dual
UNION
SELECT '0046700000031' from dual
UNION
SELECT '00394000007' from dual
)
SELECT * FROM testdata WHERE REGEXP_LIKE(txt, '^(0*[1-9]){1,3}0*$') AND LENGTH(txt) > 5 AND LENGTH(txt) < 26
See the regex demo. Details:
^ - start of string
(0*[1-9]){1,3} - one, two or three repetitions of
0* - zero or more zeros
[1-9] - a non-zero digit
0* - 0+ zeros
$ - end of string.
See the Oracle demo online.

One option would be to use a positive lookahead to check for at least one, but no more than three, non zero digits:
^(?=.*[1-9])(?!.*[1-9].*[1-9].*[1-9].*[1-9])[0-9]{6,25}$
In a database such as Postgres, we could try the following query:
SELECT *
FROM yourTable
WHERE number ~ '^(?=.*[1-9])(?!.*[1-9].*[1-9].*[1-9].*[1-9])[0-9]{6,25}$';
Using a database like SQL Server which does not directly support regex, but has some regex LIKE capability, we could try:
WHERE LEN(number) BETWEEN 6 AND 25 AND -- 6 to 25 digits
number LIKE '%[1-9]%' AND -- at least 1 non zero digit
number NOT LIKE '%[1-9]%[1-9]%[1-9]%[1-9]%' -- at most 3 non zero digits
number NOT LIKE '%[^0-9]%'; -- all numbers

Try this pattern (?=^([123456789]*0){1,3}[123456789]*$)\d{6,25}.
Explanation: it uses lookahead to varify that what follows contains at most three zeros with pattern: (?=^([123456789]*0){1,3}[123456789]*$).
Demo

Regular Expression - Not matching certain characters and position of characters

SOLUTION:
Finally solved it using the regex provided by Gary_W below and a simple PowerShell command that uses the discussed replacement function. So there was no need to use the built in regex activity in the software we use. Here´s the PS:
"100,000.00" -replace "([,.]\d{2}$)|[,.]",""
Regular Expressions are freaking me out. I cannot get used to that logic. However, I think my current RE problem is a quite simple one bur I cannot make it work :(
So here´s what I want to achieve:
I want the RE to match only the digits before the last two decimal places.
Thus, the RE must ignore any "." and "," AND always the last two digits.
> Examples:
> 1.000.000,00 --> 1000000
> 123,456.00 --> 123456
> 100.000,00 --> 100000
> 10.000,00 --> 10000
> 10,000.00 --> 10000
> 1.000,00 --> 1000
> 100,00 --> 100
> 99.88 --> 99
> 99,88 --> 99
> 1,23 --> 1
> ...
Any ideas how to get this working?

Here's how I would do it in Oracle, for what it's worth. Maybe the regex used here will give you an idea. Read the regex as "Look for a match of a comma or a decimal followed by 2 digits at the end of the line, OR a comma or a decimal and replace with nothing.
Note the match for the optional decimal places at the end needs to be first in the regex, otherwise the single characters are matched first, making the 2 decimal places non-existent and thus not matched.
SQL> with tbl(str) as (
select '1.000.000,00' from dual union all
select '123,456.00' from dual union all
select '100.000,00' from dual union all
select '10.000,00' from dual union all
select '10,000.00' from dual union all
select '1.000,00' from dual union all
select '100,00' from dual union all
select '99.88' from dual union all
select '99,88' from dual union all
select '1,23' from dual union all
select '3' from dual
)
select str,
regexp_replace(str, '([,.]\d{2}$)|[,.]') fixed
from tbl;
STR FIXED
------------ ------------
1.000.000,00 1000000
123,456.00 123456
100.000,00 100000
10.000,00 10000
10,000.00 10000
1.000,00 1000
100,00 100
99.88 99
99,88 99
1,23 1
3 3
11 rows selected.
SQL>
Just saw the regexr link, plugging in my regex looks like it works with the global flag. The characters you wish to remove are highlighted.

In which language/with which tool? With sed, you can do:
sed 's/\(.*\)[\.,]../\1/;s/[\.,]//g'
In perl it's similar, just without the initial backslashes:
perl -pe 's/(.*)[\.,]../\1/;s/[\.,]//g'
This is done with two regexes, by the way. The first one reads "save all that you can, up to a dot or a comma followed by two chars, and then replace the whole match with that". The second one reads "replace all dots and commas with nothing", that is, "remove all dots and commas".
In regexr.com you can use "Replace" in Tools to replace the match with the first capture group. Just put (.*)[\.,].. in Expression, and $1 in Replace, to see the first regex working. Then you can do something similar with the second one, as regexr doesn't support chaining of expressions, as far as I can see.

Matching a group that may or may not exist

My regex needs to parse an address which looks like this:
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
-------------------- ----- -------- -----
1 2 3 4*
Groups one, two and three will always exist in an address. Group 4 may not exist. I've written a regex that helps me get the first, second and third part but I would also need the fourth part. Part 4 is the country name and can either be FINLAND or SUOMI. If the fourth part didn't exist in an address the fourth group would be empty. This is my regex so far but the third group captures the country too. Any help?
(.*?)\s(\d{5})\s(.*)$
(I'm going to be using this Oracles REGEXP function)

Change the regex to:
(.*?)\s(\d{5})\s(.+?)\s?(FINLAND|SUOMI)?$
Making group three none greedy will let you match the optional space + country choices. If group 4 doesn't match I think it will be uninitialized rather than blank, that depends on language.

To match a character (or in your case group) that may or may not exist, you need to use ? after the character/subpattern/class in question. I'm answering now because RegEx is complicated and should be explained: only posting the fix without the answer isn't enough!
A question mark matches zero or one of the preceding character, class, or subpattern. Think of this as "the preceding item is optional". For example, colou?r matches both color and colour because the "u" is optional.
Above quote from http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm

Try this:
(.*?)\s(\d{5})\s(.*?)\s?([^\s]*)?$

This will match your input more tightly and each of your groups is in its own regex group:
(\w+\s\d+\s\w\s\d+)\s(\d+)\s(\w+)\s(\w*)
or if space is OK instead of "whitespace":
(\w+ \d+ \w \d+) (\d+) (\w+) (\w*)
Group 1: BLOOKKOKATU 20 A 773
Group 2: 00810
Group 3: HELSINKI
Group 4: SUOMI (optional - doesn't have to match)

(.*?)\s(\d{5})\s(\w+)\s(\w*)
An example:
SQL> with t as
2 ( select 'BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI' text from dual
3 )
4 select text
5 , regexp_replace(text,'(.*?)\s(\d{5})\s(\w+)\s(\w*)','\1**\2**\3**\4') new_text
6 from t
7 /
TEXT
-----------------------------------------
NEW_TEXT
-----------------------------------------------------------------------------------------
BLOOKKOKATU 20 A 773 00810 HELSINKI SUOMI
BLOOKKOKATU 20 A 773**00810**HELSINKI**SUOMI
1 row selected.
Regards,
Rob.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Remove prefix and suffix with oracle regex - regex

Related

matching patterns for regexp_like in Oracle to include a group of character conditionally

SQL Regex Pattern, How to match only a specific variable between two characters? (see Sample Output)

Regex match numbers of at least 6 digits that only have up to 3 digits different from 0

Regular Expression - Not matching certain characters and position of characters

Matching a group that may or may not exist

Categories

Resources