Regexp in oracle to match a string - regex

I am trying to create a RegExp in oracle to match a string with the following criteria,
Length 11 characters.
The 2,5,8,9 characters are letters [A-Z ] except ( S,L, O,I,B and Z).
The 1,4,7,10,11 characters are numeric [0-9].
3rd and 6th will b either a number or a letter.

You'll want to use the following regex with REGEXP_LIKE(), REGEXP_SUBSTR(), etc:
^[0-9][AC-HJKMNP-RT-Y][A-Z0-9][0-9][AC-HJKMNP-RT-Y][A-Z0-9][0-9][AC-HJKMNP-RT-Y]{2}[0-9]{2}$
Hope this helps.

Make a fancy Character List
I just make a fancy character list excluding the alphabetical upper case letter you cited. This is similar to David Faber's answer.
Here is my fancy character list:
-'[AC-HJKMNPQRT-Y]' -Oracle's documentation states that the hyphen is special in that it forms a range when in this character list.
To make this pattern succinct, I noticed that for the most part, this string follows a pattern of digit, alphabet, alphabet pattern. Consequently, I placed this in a subexpression grouping which occurs 2 times (quantifier follows).
SCOTT#db>WITH smple AS (
2 SELECT
3 '123456789ab' tst
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1CC4DD7EE01'
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1CB4DD7EE01'
14 FROM
15 dual
16 UNION ALL
17 SELECT
18 '1C44D67EE01'
19 FROM
20 dual
21 ) SELECT
22 smple.tst,
23 regexp_substr(smple.tst,'^(\d[AC-HJKMNPQRT-Y](\d|[AC-HJKMNPQRT-Y])){2}\d[AC-HJKMNPQRT-Y]{2}\d{2}$') matching
24 FROM
25 smple;
TST MATCHING
-------------------------
123456789ab
1CC4DD7EE01 1CC4DD7EE01
1CB4DD7EE01
1C44D67EE01 1C44D67EE01

Related

Match street number from different formats without suffixes

We've a "street_number" field which has been freely filed over the years that we want to format. Using regular expressions, we'd like to to extract the real "street_number", and the "street_number_suffix".
Ex: 17 b, "street_number" would be 17, and "street_number_suffix" would be b.
As there's a dozen of different patterns, I'm having troubles to tune the regular expression correctly. I consider using 2 different regexes, one to extract the "street_number", and another to extract the "street_number_suffix"
Here's an exhaustive set of patterns we'd like to format and the expected output:
# Extract street_number using PCRE
input street_number street_number_suffix
19-21 19 null
2 G 2 G
A null A
1 bis 1 bis
3 C 3 C
N°10 10 null
17 b 17 b
76 B 76 B
7 ter 7 ter
9/11 9 null
21.3 21 3
42 42 null
I know I could invoke an expressions that matches any digits until a hyphen using \d+(?=\-).
It could be extended to match until a hyphen OR a slash using \d+(?=\-|\/), thought, once I include \s to this pattern, 21 from 19-21 will match. Adding conditions may no be that simple, which is why I ask your help.
Could anyone give me a helping hand on this ? If it can help, here's a draft: https://regex101.com/r/jGK5Sa/4
Edit: at the time I'm editing, here's the closest regex I could find:
(?:(N°|(?<!\-|\/|\.|[a-z]|.{1})))\d+
Thought the full match of N°10 isn't 10 but N°10 (and our ETL doesn't support capturing groups, so I can't use /......(\d+)/)
To get the street numbers, you could update the pattern to:
(?<![-/.a-z\d])\d+
Explanation
(?<! Negative lookbehind
[-/.a-z\d] Match any of the listed using a charater class
) Close the negative lookbehind
\d+ Match 1+ digits
Regex demo

Validate one two asterisks at the beginning of a word

Need to validate if a word has one or maximum two asterisks at the beginning of a word, from three onwards it should ignore them.
words:
[
'* 11 13 24.574 1,474.79'
'** 11 13 24.574 1,474.79'
'*** 11 13 24.574 1,474.79'
]
Test:
1. ^[**]
2. ^[*][*]
3. (^\*{1}\s)
4. ^\*|\*\s
Expected:
[
'* 11 13 24.574 1,474.79',
'** 11 13 24.574 1,474.79'
]
When you say words I'll assume you have all of the "words" listed in a vector. That should look like:
string_vector <- c("* 11 13 24.574 1,474.79", "** 11 13 24.574 1,474.79", "*** 11 13 24.574 1,474.79")
The problem with test 1 is that [] selects either of the elements inside the brackets, so ^[**] just searches for one asterisk at the start of the string. All 3 words will be matched. Test two will match any case where there are 2 asterisks at the beginning which includes all 3 of the strings in your vector. Test 3 matches exactly one asterisk at the beginning followed by a space which will only return the first item. Test 4 matches either one asterisk at the beginning or an asterisk followed by a space anywhere in the string which will result in matches in all items in the vector. You would need to use the ^ after the | to have the choice be between two different patterns as the first character. However, it's not clear why this would apply to your question as 2 asterisks at the beginning wouldn't be matched. You can test all of this for yourself by using the "str_view_all" function in the stringr pacakge. You will need to use two backslashes before the * and s if they're not in square brackets.
I suggest using the following:
library(stringr)
str_subset(string_vector,"^\\*{1,2}[^*].+")
This matches all elements of your vector which have exactly 1 or 2 asterisks at the beginning "^\\*{1,2}" and not any more asterisks connected to the original one or two [^*]. Then the ".+" means any other characters can occupy the rest of the string.
This command gives your desired output
[1] "* 11 13 24.574 1,474.79" "** 11 13 24.574 1,474.79"
you can assign to an object if you want to do more with the resulting vector
object <- str_subset(string_vector,"^\\*{1,2}[^*].+")
EDIT based on Cary Swoveland's helpful comments:
If just "**" and "*" are also supposed to be matched, then the following expression should work. Based on the provided data I assumed there would always be more characters following the *'s at the beginning, but I now see that there was no explicit statement in the description which would logically lead to this assumption.
object <- str_subset(string_vector,"^*(?!\\*)|^\\*{2}(?!\\*)")
This will match:
one * not followed by another * OR
two * not followed by another *
The (?!) represents a negative lookahead. i.e. the character(s) to be matched (in this case 1 or 2 *) cannot immediately precede the character in parentheses after (?!) (in this case another * which is escaped with \\). Cary is also correct in pointing out that since we're only interested in how the string begins, it doesn't matter if there are any more characters after the 1 or 2 * of interest.
This might work for you:
https://regex101.com/r/OyGxta/2
Test String:
* 11 13 24.574 1,474.79
** 11 13 24.574 1,474.79
*** 11 13 24.574 1,474.79
Pattern:
^\*{1,2}(?!\*).*

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

RegEx to allow 12 or 13 digits

In my MVC application I define validation using the following RegEx
[RegularExpression(#"\d{8}0[1-2]\d{3}", ErrorMessage = "Must be numeric, 12 or 13 characters long & Format xxxxxxxx[01 or 02]xxx")]
But I want to allow 12 or 13 characters. The d{3} appears to be forcing that overall I have 13 characters input
To allow it to accept 12 or 13, I have changed d{3} to d{2} and its accepting 12 now.
But - can I be sure it will still take 13 characters?
Must be numeric, 12 or 13 characters long & Format xxxxxxxx[01 or 02]xxx
To allow digits 1 or 2 after first nine digits,
^\d{8}0[12]\d{2,3}$
^^^^ : Allow 1 or 2 after `0`
^^^^^^^ : Any two or three digits
Note that [12] can also be written as (1|2) using OR/alteration.
Demo

Regular expression for matching numbers and ranges of numbers

In an application I have the need to validate a string entered by the user.
One number
OR
a range (two numbers separated by a '-')
OR
a list of comma separated numbers and/or ranges
AND
any number must be between 1 and 999999.
A space is allowed before and after a comma and or '-'.
I thought the following regular expression would do it.
(\d{1,6}\040?(,|-)?\040?){1,}
This matches the following (which is excellent). (\040 in the regular expression is the character for space).
00001
12
20,21,22
100-200
1,2-9,11-12
20, 21, 22
100 - 200
1, 2 - 9, 11 - 12
However, I also get a match on:
!!!12
What am I missing here?
You need to anchor your regex
^(\d{1,6}\040?(,|-)?\040?){1,}$
otherwise you will get a partial match on "!!!12", it matches only on the last digits.
See it here on Regexr
/\d*[-]?\d*/
i have tested this with perl:
> cat temp
00001
12
20,21,22
100-200
1,2-9,11-12
20, 21, 22
100-200
1, 2-9, 11-12
> perl -lne 'push #a,/\d*[-]?\d*/g;END{print "#a"}' temp
00001 12 20 21 22 100-200 1 2-9 11-12 20 21 22 100-200 1 2-9 11-12
As the result above shows putting all the regex matches in an array and finally printing the array elements.