regex to extract housenumber plus addition - regex

I'm looking for a regex that matches housenumbers combined with additions for all addresses below:
Breestraat 4
Breestraat 45
Breestraat 456
Dubbele Straat 4a
Dubbele Straat 4-a
5 meistraat 1a
5meistraat 12
5meistraat 12a
Teststraat 22-III
Now the following regex works, except in the first case. This is because the single digit housenummber is missed because of the first \d in the regex (which prevents a starting digit to be captured).
\d?.(\d+.+)$
regex to extract housenumber addition
I'm scratching my head how to get the housenumer '4' for the first line. so basically how to change the "skip starting digit" to "skip starting digit but let it have to result on the capturing group".

You can use
\d+\D*$
\d+\S*$
See the regex demo #1 and regex demo #2.
The pattern matches
\d+ - one or more digits
\D* - zero or more non-digit chars
\S* - zero or more non-whitespace chars
$ - end of string.

It's not perfectly clear what you are requesting precisely..
Anyway this is the pattern matching the house number at the end of the string:
\d+[-\da-zI]*$
https://regexr.com/6l0g7
Anyway I'm aware this is not a valid answer

Related

Negate a character group to replace all other characters

I have the following string:
"Thu Dec 31 22:00:00 UYST 2009"
I want to replace everything except for the hours and minutes so I get the following result:
"22:00"
I am using this regex :
(^([0-9][0-9]:[0-9][0-9]))
But its not matching anything.
This would be my line of actual code :
println("Thu Dec 31 22:00:00 UYST 2009".replace("(^([0-9][0-9]:[0-9][0-9]))".toRegex(),""))
Can someone help me to correct the regex?
The reason the one you have isn't working is because you are asserting that the line starts right before the minutes and seconds, which isn't the case. This can be fixed by removing the assertion (^).
If you need the assertion to remain, there is another way. In most languages, you wouldn't be able to use a variable-length positive lookbehind here, but lucky for you, it looks like you can in Kotlin.
A positive lookbehind is basically just telling the pattern "this comes before what I'm looking for". It's denoted by a group beginning with ?<=. In this case, you can use something like (?<=^[\w ]+). This will match all word characters or spaces between the beginning of the line and where the pattern that comes after it is able to match. Appending it to your expression would look something like (?<=^[\w ]+)([0-9][0-9]:[0-9][0-9]) (note you will have to escape the \w in order for it to be in a string and not be angry about it).
Side note, Yogesh_D is correct in saying that \d\d:\d\d is the same as your [0-9][0-9]:[0-9][0-9]. Using this, it would look more like (?<=^[\w ]+)\d\d:\d\d.
You may use various solutions, here are two:
val text = """Thu Dec 31 22:00:00 UYST 2009"""
val match = """\b(?:0?[1-9]|1\d|2[0-3]):[0-5]\d\b""".toRegex().find(text)
println(match?.value)
val match2 = """\b(\d{1,2}:\d{2}):\d{2}\b""".toRegex().find(text)
println(match2?.groupValues?.getOrNull(1))
Both return 22:00. See regex #1 demo and regex #2 demo.
The regex complexity should be selected based on how messy the input string is.
Details
\b - a word boundary
(?:0?[1-9]|1\d|2[0-3]) - an optional zero and then a non-zero digit, or 1 and any digit, or 2 and a digit from 0 to 3
: - a : char
[0-5]\d - 0, 1, 2, 3, 4 or 5 and then any one digit
\b - a word boundary.
If there is a match with this regex, you get it as a whole match, so you can access it via match?.value.
If you do not have to worry about any pre-valiation when matching, you may simply match 3 colon-separated digit pairs and capture the first two, see the second regex:
\b - a word boundary
(\d{1,2}:\d{2}) - Group 1: one or two digits, : and two digits
:\d{2} - a : and two digits (not captured)
\b - a word boundary.
If there is a match, we need Group 1 value, hence match2?.groupValues?.getOrNull(1) is used.
I am not sure what language you are using but why use negation when you can directly match the first digits in the hh:mm format.
Assuming that the date string format always is in the format with a hh:mm in there.
This regex snippet should have the first group match the hh:mm.
https://regex101.com/r/aHdehZ/1
The regex to use is (\d\d:\d\d)

RegEx match anything except linebreaks up to positive lookahead

I'm trying to match certain text lines up to a specific string in RegEx (PCRE). Here's an example:
000000
999999900
20.10.19
Amoxicillin 1000 Heumann 20 Filmtbl. N2 - PZN: 04472730
-
Dr. Max Mustermann
In this text, I'd like to match exactly this part:
Amoxicillin 1000 Heumann 20 Filmtbl. N2
The similarity is always the part with the PZN and a 7-8 digit number behind that at the end of every line I'd like to match. However, the PZN part might sometimes be in the next line instead of directly behind it:
000000
999999900
20.10.19
Amoxicillin 1000 Heumann 20 Filmtbl. N2
- PZN: 04472730
-
Dr. Max Mustermann
So it's either directly behind it or in the next line. I've tried to do so using this RegEx:
.*(?=[ \-\r\n]+PZN)
This does work, however, in the first example above, it matches this:
Amoxicillin 1000 Heumann 20 Filmtbl. N2 -
Notice the " -" at the end. This should not be included in the match. I suppose RegEx prioritizes the .* part since it's working from left to right, and therefore only strips the very last character of the lookahead. I can't wrap my head around as to how to do it otherwise though.
Any ideas?
One option is to use a capturing group and match 0+ whitespace chars before the - PZN: part.
^(?![^\S\r\n]*$)(.+)\s* - PZN: \d{7,8}$
^ Start of line
(?![^\S\r\n]*$) Assert not an empty line
(.+)\s* Capture in group 1 matching any char 1+ times followed by 0+ times a whitespace char
- PZN: Match a space - and space followed by PZN: and space
\d{7,8} Match 7-8 digits
$ End of line
Regex demo
Another option is the same pattern in the form of using a lookahead
^(?![^\S\r\n]*$).+(?=\s* - PZN: \d{7,8}$)
Regex demo
This would work:
^(.+?)(?=\s?- PZN:)
^(.+?) - at the start of a line lazily match everything
(?=\s?- PZN:) - tell .+? to quit matching once we detect an upcoming PZN:
https://regex101.com/r/dhpth0/1/

Find the first set of 5 digits in a text

I need to find the first set of 5 numbers in a text like this :
;SUPER U CHARLY SUR MARNE;;;rte de Pavant CHARLY SUR MARNE Picardie 02310;Charly-sur-Marne;;;02310;;;;;;;;;;;;;;
I need to find the first 02310 only.
My regex but it found all set of 5 numbers :
([^\d]|^)\d{5}([^\d]|$)
To match the first 5-digit number you may use
^.*?\K(?<!\d)\d{5}(?!\d)
See the regex demo. As you want to remove the match, simply keep the Replace With field blank. The ^ matches the start of a line, .*? matches any 0+ chars other than line break chars, as few as possible, and \K operator drops the text matched so far. Then, (?<!\d)\d{5}(?!\d) matches 5 digits not enclosed with other digits.
Another variation includes a capturing group/backreference:
Find What: ^(.*?)(?<!\d)\d{5}(?!\d)
Replace With: $1
See this regex demo.
Here, instead of dropping the found text before the number, (.*?) is captured into Group 1 and $1 in the replacement pattern puts it back.
I would've use
(^(?:(?!\d{5}).)+)(\d{5})(?!\d)
It finds fragment from beginning of the string till end of first 5-digit number, but in case of replacement you can use $1 or $2 to substitute corresponding part. For example replacement $1<$2> will surround number by < and >.
To find the first 5 digits in the text, you could also match not a digit \D* or 1-4 digits followed by matching 5 digits:
^(?=.*\b\d{5}\b)(?:\D*|\d{1,4})*\K\d{5}(?!\d)
^ Start of string
(?=.*\b\d{5}\b) Assert that there are 5 consecutive digits between word boundaries
(?:\D*|\d{1,4})* Repeat matching 0+ times not a digit or 1-4 digits
\K\d{5} Forget what was matched, then match 5 digits
(?!\d) Assert what followed is not a digit
Regex demo

Regex lookahead part of group accepted

I'm using regex in powershell 5.1.
I need it to detect groups of numbers, but ignore groups followed or preceeded by /, so from this it should detect only 9876.
[regex]::matches('9876 1234/56',‘(?<!/)([0-9]{1,}(?!(\/[0-9])))’).value
As it is now, the result is:
9876
123
6
More examples: "13 17 10/20" should only match 13 and 17.
Tried using something like (?!(\/([0-9]{1,}))), but it did not help.
You may use
\b(?<!/)[0-9]+\b(?!/[0-9])
See the regex demo
Alternatively, if the numbers can be glued to text:
(?<![/0-9])[0-9]+(?!/?[0-9])
See this regex demo.
The first pattern is based on word boundaries \b that make sure there are no letters, digits and _ right before and after an expected match. The second one just makes sure there are no digits and / on both ends of the match.
Details
(?<![/0-9]) - a negative lookbehind making sure there is no digit or / immediately to the left of the current location
[0-9]+ - one or more digis
(?!/?[0-9]) - a negative lookahead making sure there is no optional / followed with a digit immediately to the right of the current location.

Phone regex validation for Argentina

I figured out a regular expresion for my country's phone but I've something missing.
The rule here is: (Area Code) Prefix - Sufix
Area Code could be 3 to 5 digits
Prefix could be 2 to 4 digits.
Area Code + Prefix is 7 digits long.
Sufix is always 4 digits long
Total digits are 11.
I figured I could have 3 simple regex chained with an OR "|" like this:
/(\(?\d{3}\)?[- .]?\d{4}[- .]?\d\d\d\d)|(\(?\d{4}\)?[- .]?\d{3}[- .]?\d\d\d\d)|(\(?\d{5}\)?[- .]?\d{2}[- .]?\d\d\d\d)/
The thing I'm doing wrong is that \d\d\d\d doesn't match only 4 digits for the sufix, for example: (011) 4740-5000 which is a valid phone number, works ok but if put extra digits it will also return as a valid phone number, ie: (011) 4740-5000000000
You should use ^ and $ to match whole string
For example ^\d{4}$ will match exactly 4 digits not more not less.
Here is the complete regex pattern
^((\(?\d{3}\)? \d{4})|(\(?\d{4}\)? \d{3})|(\(?\d{5}\)? \d{2}))-\d{4}$
Online demo
As per your regex pattern delimiter can be -,. or single space then try
^((\(?\d{3}\)?[-. ]?\d{4})|(\(?\d{4}\)?[-. ]?\d{3})|(\(?\d{5}\)?[-. ]?\d{2}))[-. ]?\d{4}$
This pattern works fine for me:
/^\\(?(\d{3,5})?\\)?\s?(15)?[\s|-]?(4)\d{2,3}[\s|-]?\d{4}$/
I've tested this in regex101:
/^((?:\(?\d{3}\)?[- .]?\d{4}|\(?\d{4}\)?[- .]?\d{3}|\(?\d{5}\)?[- .]?\d{2})[- .]?\d{4})$/
RegEx Demo
^ Matches the beginning of a string
( Beginning of capture group
(?: Beginning of non-capturing group
Your different options for area code & prefix
) End non-capturing group
[- .]?\d{4} The last four digits of the phone number
) End capture group
$ Matches the end of a string
If you're trying to validate such a phone number, then the following one should suit your needs:
^(?=.{15}$)[(]\d{3,5}[)] \d{2,4}-\d{4}$
Debuggex Demo
You need to match the complete expression by indicating the start and end with anchors. You also don't need alternation for the different lengths.
/^(?=(\D*\d){11}$)\(?\d{3,5}\)?[- .]?\d{2,4}[- .]?\d{4}$/
Here's the breakdown:
(?=(\D*\d){11}$) is a non-capturing group ensuring that there are 11 digits total,
with any number of non-digits amongst them
\(?\d{3,5}\)?[- .]? matches 3-5 digits in parens (area code), followed by a separator
\d{2,4}[- .]? matches 2-4 digits (prefix), followed by a separator
\d{4} matches the suffix