Match street number from different formats without suffixes - regex

We've a "street_number" field which has been freely filed over the years that we want to format. Using regular expressions, we'd like to to extract the real "street_number", and the "street_number_suffix".
Ex: 17 b, "street_number" would be 17, and "street_number_suffix" would be b.
As there's a dozen of different patterns, I'm having troubles to tune the regular expression correctly. I consider using 2 different regexes, one to extract the "street_number", and another to extract the "street_number_suffix"
Here's an exhaustive set of patterns we'd like to format and the expected output:
# Extract street_number using PCRE
input street_number street_number_suffix
19-21 19 null
2 G 2 G
A null A
1 bis 1 bis
3 C 3 C
N°10 10 null
17 b 17 b
76 B 76 B
7 ter 7 ter
9/11 9 null
21.3 21 3
42 42 null
I know I could invoke an expressions that matches any digits until a hyphen using \d+(?=\-).
It could be extended to match until a hyphen OR a slash using \d+(?=\-|\/), thought, once I include \s to this pattern, 21 from 19-21 will match. Adding conditions may no be that simple, which is why I ask your help.
Could anyone give me a helping hand on this ? If it can help, here's a draft: https://regex101.com/r/jGK5Sa/4
Edit: at the time I'm editing, here's the closest regex I could find:
(?:(N°|(?<!\-|\/|\.|[a-z]|.{1})))\d+
Thought the full match of N°10 isn't 10 but N°10 (and our ETL doesn't support capturing groups, so I can't use /......(\d+)/)

To get the street numbers, you could update the pattern to:
(?<![-/.a-z\d])\d+
Explanation
(?<! Negative lookbehind
[-/.a-z\d] Match any of the listed using a charater class
) Close the negative lookbehind
\d+ Match 1+ digits
Regex demo

Related

Why is this regex performing partial matches?

I have the following raw data:
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 ...
I'm using this regex to remove duplicates:
([^.]+)(.[ ]*\1)+
which results in the following:
1.2.4.5.9.115.16.19 ...
The problem is how the regex handles 1.1 in the substring .11.15. What should be 9.11.15.16 becomes 9.115.16. How do I fix this?
The raw values are sorted in numeric order to accommodate the regex used for processing the duplicate values.
The regex is being used within Oracle's REGEXP_REPLACE
The decimal is a delimiter. I've tried commas and pipes but that doesn't fix the problem.
Oracle's REGEX does not work the way you intended. You could split the string and find distinct rows using the general method Splitting string into multiple rows in Oracle. Another option is to use XMLTABLE , which works for numbers and also strings with proper quoting.
SELECT LISTAGG(n, '.') WITHIN
GROUP (
ORDER BY n
) AS n
FROM (
SELECT DISTINCT TO_NUMBER(column_value) AS n
FROM XMLTABLE(replace('1.1.2.2.4.4.4.5.5.9.11.15.16.16.19', '.', ','))
);
Demo
Unfortunately Oracle doesn't provide a token to match a word boundary position. Neither familiar \b token nor ancient [[:<:]] or [[:>:]].
But on this specific set you can use:
(\d+\.)(\1)+
Note: You forgot to escape dot.
Your regex caught:
a 1 - the second digit in 11,
then a dot,
and finally 1 - the first digit in 15.
So your regex failed to catch the whole sequence of digits.
The most natural way to write a regex catching the whole sequence
of digits would be to use:
a loobehind for either the start of the string or a dot,
then catch a sequence of digits,
and finally a lookahead for a dot.
But as I am not sure whether Oracle supports lookarounds, I wrote
the regex another way:
(^|\.)(\d+)(\.(\2))+
Details:
(^|\.) - Either start of the string or a dot (group 1), instead of
the loobehind.
(\d+) - A sequence of digits (group 2).
( - Start of group 3, containing:
\.(\2) - A dot and the same sequence of digits which caught group 2.
)+ - End of group 3, it may occur multiple times.
Group the repeating pattern and remove it
As revo has indicated, a big source of your difficulties came with not escaping the period. In addition, the resulting string having a 115 included can be explained as follows (Valdi_Bo made a similar observation earlier):
([^.]+)(.[ ]*\1)+ will match 11.15 as follow:
SCOTT#DB>SELECT
2 '11.15' val,
3 regexp_replace('11.15','([^.]+)(\.[ ]*\1)+','\1') deduplicated
4 FROM
5 dual;
VAL DEDUPLICATED
11.15 115
Here is a similar approach to address those problems:
matching pattern composition
-Look for a non-period matching list of length 0 to N (subexpression is referenced by \1).
'19' which matches ([^.]*)
-Look for the repeats which form our second matching list associated with subexression 2, referenced by \2.
'19.19.19' which matches ([^.]*)([.]\1)+
-Look for either a period or end of string. This is matching list referenced by \3. This fixes the match of '11.15' by '115'.
([.]|$)
replacement string
I replace the match pattern with a replacement string composed of the first instance of the non-period matching list.
\1\3
Solution
regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3')
Here is an example using some permutations of your examples:
SCOTT#db>WITH tst AS (
2 SELECT
3 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19' val
4 FROM
5 dual
6 UNION ALL
7 SELECT
8 '1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19' val
9 FROM
10 dual
11 UNION ALL
12 SELECT
13 '1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19' val
14 FROM
15 dual
16 ) SELECT
17 val,
18 regexp_replace(val,'([^.]*)([.]\1)+([.]|$)','\1\3') deduplicate
19 FROM
20 tst;
VAL DEDUPLICATE
------------------------------------------------------------------------
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.1.1.2.2.4.4.4.4.4.5.5.9.11.11.11.15.16.16.19 1.2.4.5.9.11.15.16.19
1.1.2.2.4.4.4.5.5.9.11.15.16.16.19.19.19 1.2.4.5.9.11.15.16.19
My approach does not address possible spaces in the string. One could just remove them separately (e.g. through a separate replace statement).

Regular Expression for parsing a sports score

I'm trying to validate that a form field contains a valid score for a volleyball match. Here's what I have, and I think it works, but I'm not an expert on regular expressions, by any means:
r'^ *([0-9]{1,2} *- *[0-9]{1,2})((( *[,;] *)|([,;] *)|( *[,;])|[,;]| +)[0-9]{1,2} *- *[0-9]{1,2})* *$'
I'm using python/django, not that it really matters for the regex match. I'm also trying to learn regular expressions, so a more optimal regex would be useful/helpful.
Here are rules for the score:
1. There can be one or more valid set (set=game) results included
2. Each result must be of the form dd-dd, where 0 <= dd <= 99
3. Each additional result must be separated by any of [ ,;]
4. Allow any number of sets >=1 to be included
5. Spaces should be allowed anywhere except in the middle of a number
So, the following are all valid:
25-10 or 25 -0 or 25- 9 or 23 - 25 (could be one or more spaces)
25-10,25-15 or 25-10 ; 25-15 or 25-10 25-15 (again, spaces allowed)
25-1 2 -25, 25- 3 ;4 - 25 15-10
Also, I need each result as a separate unit for parsing. So in the last example above, I need to be able to separately work on:
25-1
2 -25
25- 3
4 - 25
15-10
It'd be great if I could strip the spaces from within each result. I can't just strip all spaces, because a space is a valid separator between result sets.
I think this is solution for your problem.
str.replace(r"(\d{1,2})\s*-\s*(\d{1,2})", "$1-$2")
How it works:
(\d{1,2}) capture group of 1 or 2 numbers.
\s* find 0 or more whitespace.
- find -.
$1 replace content with content of capture group 1
$2 replace content with content of capture group 2
you can also look at this.

is it possible to solve this with just one regex?

I would like to know if there is a regular expression that given for example this input:
lkjs kjsfjk ijsfj á 13total wer6klje additional lñk jshv kjsdfjk dj d 22total kejk jksfljkakvhjr j 3total fkljbher jr6 hrew7 hwr 41total sfdkj additional iuwefjkwf7 7erfh sf 5total klj kj kjsef87 jhwfe7 89 jhf
could output these 3 matches, which are numbers followed by total, that do not contain the word additional after (and before finding the next number):
22
3
5
So, for example I didn't match 13 because
13total wer6klje additional lñk jshv kjsdfjk dj d 22total
contains the word additional
And I didn't match 41 because
41total sfdkj additional iuwefjkwf7 7erfh sf 5total
contains the word additional
let me explain the input structure used in the example:
randomText 13total randomText aditional randomText
22total randomText
3total randomText
41total randomText aditional randomText
5total randomText
So basically the input is something like:
randomText X_total randomText_that_contains_or_not_'additional'
X_total randomText_that_contains_or_not_'additional'
....
X_total randomText_that_contains_or_not_'additional'
I know how to solve the problem using some additional code (using several patterns and matches, if-else structures...) but the system I'm working with, cannot make use of those. It just can be fed up with one regular expression (it's a complicated system, not easy to modify).
So, for example, with the regular expression [0-9]+(?=total) I would get this matches: 13, 22, 3, 41, 5
but as I said I just need 22, 3, 5
Can anybody build a more complex regular expression that matches those 3 numbers?
Thanks!
Of course it is possible (given that your regex flavour supports lookahead assertions)
\d+(?=total(?!\D*additional))
See it here on regex101
\d+ matches one ore more digits
(?=total(?!\D*additional)) nested lookaround assertions. Digits has to be followed by "total" not followed by additional (with only non digits inbetween)
A more advanced example based on Bergis comment:
\d+(?=total(?!(?:.(?!\d+total))*additional))
See it on regex101
Here I searching for additional as long as I not find \d+total
You can use (the total will always be preceded by a digit, right?)
\d+(?=total(?!(?:\D|\d(?!total))*additional))
Explanation
The idea is to forbid any additional before the next <digit>total:
\d+ # digits
(?=total # followed by total
(?! # not followed by...
(?:
\D++ # not a digit (possessive quantifier)
| # OR
\d(?!total) # a digit, but not followed by total
)*+ # any number of times
additional
)
)
The negative look ahead will fail the regex if it finds one, and we're sure not to pass over a <digit>total thanks to (?:\D|\d(?!total)).
See demo here.

Regular expression for A123ABC

I have a string in the format A123ABC
First letter cannot contain <I,O,Q,U,Z>
Next 3 digits (0-9) from 21-998
Last 3 letters cannot include <I,Q,Z>
I used the following expression [A-HJ-NPR-TV-Y]{1}[0-9]{2,3}[A-HJ-PR-Y]{3}
But I am not able to restrict the number in the range 21-998.
Your letter part is fine, below is just the numbers portion:
regex = "(?:2[1-9]|[3-9][0-9]|[1-8][0-9][0-9]|9[0-8][0-9]|99[0-8])"
(?:...) group, but do not capture.
2[1-9] covers 21-29
[3-9][0-9] covers 30-99
[1-8][0-9][0-9] covers 100-899
9[0-8][0-9] covers 900-989
99[0-8] covers 990-998
| stands for "or"
Note: [0-9] may be replaced by \d. So, a more concise representation would be:
regex = "(?:2\d|[3-9]\d|[1-8]\d{2}|9[0-8]\d|99[0-8])"
One option would be matching (\d+) and checking if that falls in the range 21 - 998 outside a regex, in the language you're using, if possible.
If that is not feasible, you have to break it up (just showing the middle part):
(2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])
Breakdown:
2[1-9] matches 21 - 29
[3-9]\d matches 30 - 99
[1-8]\d\d matches 100 - 899
9[0-8]\d matches 900 - 989
99[0-8] matches 990 - 998
Also, the {1} is superfluous and can be omitted, making the complete regex
[A-HJ-NPR-TV-Y](2[1-9]|[3-9]\d|[1-8]\d\d|9[0-8]\d|99[0-8])[A-HJ-PR-Y]{3}
Assuming the numbers between 21 and 99 are displayed with three digits (ie. : 021, 055, 099), here's a solution for the number part :
((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))
Entire regex :
[A-HJ-NPR-TV-Y]{1}((02[1-9])|(0[3-9][0-9])|([1-8][0-9]{2})|(9([0-8][0-9])|(9[0-8])))[A-HJ-PR-Y]{3}
There are probably easier ways to do this, but one way would be to use:
^((?=[^IOQUZ])([A-Z]))((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8]))((?=[^IQZ])([A-Z])){3}$
To explain:
^ denotes the beginning of the string.
((?=[^IOQUZ])([A-Z])) would give you any capital letter not in <I, O, Q, U, Z>.
((02[^0])|(0[3-9]\d)|([1-8]\d\d)|(9[0-8]\d)|(99[0-8])) denotes any number between ((21 to 29) or (30 to 99) or (100 to 899) or (900 to 989) or (990 to 998)).
((?=[^IQZ])([A-Z])){3} would match any three capital letters not in <I, Q, Z>.
$ would denote the end of the string.

How to combine these regex requirements?

I'm using an Asp.Net RegularExpressionValidator to validate phone numbers.
The check is quite basic - a number can be 10 or 11 characters in length, all numeric and starting 01 or 02.
Here's the regex:
^0[12]\d{8,9}$
However, I've recently started working with a 3rd party, who enforce stricter rules. In my opinon it's a bad idea - partly because they don't even publish these rules, and they are subject to change and therefore maintenance across all their partners. However...
I now need to incorporate their additions into my regex, but I'm not sure where to start.
They currently do this using 2 separate regexes in an OR, however I'd like to do this in 1 if possible.
The additional syntax should ensure that for 10 digit phone numbers also adhere to these additional rules - here's their 10 digit syntax.
"^01(204|208|254|276|297|298|363|364|384|386|404|420|460|461|480|488|524|527|562|566|606|629|635|647|659|695|726|744|750|768|827|837|884|900|905|935|946|949|963|995)[0-9]{5}$
Any ideas as to how to achieve this?
Disclaimer: This answer is based on the logic followed by this answer to demonstrate the "virtual" requirements (which we should drop anyways).
Let me explain what is going on:
^0[12]\d{8,9}$ What's going on here ?
^ : match begin of line
0 : match 0
[12] : match 1 or 2
\d{8,9} : match a digit 8 or 9 times
$ : match end of line
^01(204|20...3|995)[0-9]{5}$ What does this big regex do ?
^ : match begin of line
01 : match 01.
(204|20...3|995) : match certain 3 digit combination
[0-9]{5} : match a digit 5 times
$ : match end of line
Well, what if we merged these two in an OR statement ?
^
(?:
01(204|20...3|995)[0-9]{5}
)
|
(?:
0[12]\d{8,9}
)
$
I'll show you why it doesn't make sense.
How many digits does 0[12]\d{8,9} match ? 10 or 11 right ?
Now how many digits does the other regex match ?
01(204|20...3|995)[0-9]{5}
^^ ^-----\/-----^ ^--\/--^
2 + 3 + 5 = 10
Now if we compare the 2 regexes. It's clear that ^0[12]\d{8,9}$ will match all the digits that are valid for the other regex. So why in the world would you combine these 2 ?
To make the problem simpler, say you have regex1: abc, regex2: [a-z]+. What you want is like abc|[a-z]+, but that doesn't make sense since [a-z]+ will match abc, so we can get ride of abc.
On a side note, \d does match more than you think in some languages. Your final regex should be ^0[12][0-9]{8,9}$.
You could merge them with an OR in the regex itself:
^(?:01(204|208|254|276|297|298|363|364|384|386|404|420|460|461|480|488|524|527|562|566|606|629|635|647|659|695|726|744|750|768|827|837|884|900|905|935|946|949|963|995)\d{5}|0[12]\d{9})$
Edited 11 digit regex.