Word find/replace not being fully lazy - regex

I am using a wildcard find/replace involving the following find field:
([0-9]*)
(Please note that there should be a space at the end of the field even though I can't get it to stick on here on SO)
When I search on the text:
13 April Boon 87 155
(Just because it's not visually clear here, everything should be tab-separated except for the "87 155" and "April Boon", which have spaces.)
Since post-star is (nominally) a lazy evaluator, I would expect this to match only "87 ". This is the result that I want!
But it is making 4 matches:
"13 April "
"3 April "
"87 "
"7 "
This is all the more mysterious to me because it is NOT matching "13 April Boon 87 " or "3 April Boon 87 "
What's going on here? How can I get the match that I seek?
Thanks in advance!

Your wildcard pattern works as expected. Your pattern ([0-9]*) matches:
([0-9] - (Capture group 1, can be referenced with \1) a digit
*) - any characters but as few as possible up to the first...
- space.
Since matches are found from left to right, you have 4 matches. [0-9] matches a digit.
You can only capture 87 with a regex like (<[0-9]#>) <[0-9]#>^13.
(<[0-9]#>) - a whole "word" containing one or more digits
- a space
<[0-9]#> - a whole "word" containing one or more digits
^13 - carriage return

Related

regex to extract housenumber plus addition

I'm looking for a regex that matches housenumbers combined with additions for all addresses below:
Breestraat 4
Breestraat 45
Breestraat 456
Dubbele Straat 4a
Dubbele Straat 4-a
5 meistraat 1a
5meistraat 12
5meistraat 12a
Teststraat 22-III
Now the following regex works, except in the first case. This is because the single digit housenummber is missed because of the first \d in the regex (which prevents a starting digit to be captured).
\d?.(\d+.+)$
regex to extract housenumber addition
I'm scratching my head how to get the housenumer '4' for the first line. so basically how to change the "skip starting digit" to "skip starting digit but let it have to result on the capturing group".
You can use
\d+\D*$
\d+\S*$
See the regex demo #1 and regex demo #2.
The pattern matches
\d+ - one or more digits
\D* - zero or more non-digit chars
\S* - zero or more non-whitespace chars
$ - end of string.
It's not perfectly clear what you are requesting precisely..
Anyway this is the pattern matching the house number at the end of the string:
\d+[-\da-zI]*$
https://regexr.com/6l0g7
Anyway I'm aware this is not a valid answer

Can I use a regular expression to help format this data to separate name, age, and address?

I am working on an assignment for class, and we need to format this data. I was thinking that regular expressions would be a very elegant way of formatting the data. But, I ran into some trouble. This is my first time doing this before and I do not know how to properly split the data. I want the beginning to the first digit be the first section, the first digit until the next white space to be the second section, and there till the end of the line to be the third section. Here is my data:
Amber-Rose Bowen 53 123 Machinery Rd.
Joyce Kirkland 19 234 Cylinder Dr.
Seb Dotson 32 3456 Surgery Ln.
Dominique Hough 58 654 Election Rd.
Yasemin Mcleod 29 555 Cabinet Ave.
Nancy Lord 80 232 Highway Rd.
Tracy Mckenzie 72 101 Device Ave.
Alistair Salter 25 109 Guitar Ln.
Adeel Sears 42 222 Solitare Rd.
I have been using https://regex101.com/ to test my ideas. ([a-zA-Z]+)([0-9]+) this is my start, but I do not know how to go from the start to the first digit. (or any other part of this)
You can use
^(.*?)[^\S\r\n]+(\d+)[^\S\r\n]+(\S.*)
See the regex demo. This regex can also be used with a multiline flag to extract data from a multiline string.
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
[^\S\r\n]+ - zero or more horizontal whitespaces (in some regex flavors, you can use \h+ or [^\p{Zs}\t]+ instead)
(\d+) - Group 2: one or more digits
[^\S\r\n]+ - one or more horizontal whitespaces
(\S.*) - Group 3: a non-whitespace char and then the rest of the line.
If you merely wish to separate the string into full name, age and street address you may split the string on matches of the regular expression
(?i)(?<=[a-z]|\d) +(?=\d)
For example:
Amber-Rose Bowen 53 123 Machinery Rd.
^ ^^^^
Demo
The regular expression reads: "match one or more spaces preceded by a letter or digit and followed by a digit". (?i) causes the match of a letter to be case-indifferent. (?<=[a-z]|\d) is a positive lookbehind; (?=\d) is a positive lookahead.
You may use the following regular expression if you wish to to extract first name, last name, age, street number and street name.
^(?<first_name>\S+) +(?<last_name>\S+) +(?<age>\d+) +(?<street_nbr>\d+) +(?<stret_name>.*)
For example:
Amber-Rose Bowen 53 123 Machinery Rd.
^^^^^^^^^^ ^^^^^ ^^ ^^^ ^^^^^^^^^^^^^
1 2 3 4 5
1: first_name
2: last_name
3: age
4: street_nbr
5: street_name
Demo
I've used the PCRE regex engine with named capture groups. The expression would be similar for other regex engines, though some do not support named groups, in which case you would have to use numbered groups (group 1, group 2, and so forth.)
Note that this only works because of the consistent structure of your data. In real life some strings may contain such things as middle names or apartment numbers, which would complicate the parsing of the strings.

regex to match the end of string doesn't work

I want to recognize the house number in a given string. Here you can find some sample inputs:
"My house number is 23"
"23"
"23a"
"23 a"
"The house number is 23 a and the street ist XY"
"The house number is 23 a"
I have the following regex:
\d+(([\s]{0,1}[a-zA-Z]{0,1}[\s])*|[\s]{0,1}[a-zA-Z]{0,1}$)
But it is not able to capture the inputs which have the number followed by a letter at the end of the line (e.g. the house number is 23 a).
Any help would be appreciated.
PS: I finally need the regex in typescript.
If I got your problem correctly, this should work:
(\d+(\s?[a-zA-Z]?\s?|\s?[a-zA-Z]$))
Note: [\s]{0,1} is the same as \s?
https://regex101.com/r/r6WHFy/1
The issue in your regex was that The house number is 23 a matches ([\s]{0,1}[a-zA-Z]{0,1}[\s])* part, thus the parser "does not need" to look for the part with end of string symbol.
You could also write the pattern using word boundaries and without using an alternation |
\b\d+(?:\s*[a-zA-Z])?\b
\b A word boundary
\d+ Match 1+ digits
(?:\s*[a-zA-Z])? Optionally match optional whitespace chars and a-zA-Z
\b A word boundary
const regex = /\b\d+(?:\s*[a-zA-Z])?\b/;
[
"My house number is 23",
"23",
"23a",
"23 a",
"The house number is 23 a and the street ist XY",
"The house number is 23 a"
].forEach(s => console.log(s.match(regex)[0]));
Regex demo

Exclude word and quotes from regexp

I have the following phrases:
Mr "Smith"
MrS "Smith"
I need to retrieve only Smith from this phrases. I tried thousands of variants. I stoped on
(?!Mr|MrS)([^"]+).
Help, please.
The pattern (?!Mr|MrS)([^"]+) asserts from the current position that what is directly to the right is not Mr or MrS and then captures 1+ occurrences of any char except "
So it will not start the match at Mr but it will at r because at the position before the r the lookahead assertion it true.
Instead of using a lookaround, you could match either Mr or MrS and capture what is in between double quotes.
\mMrS? "([^"]+)"
\m A word boundary
MrS? Match Mr with an optional S
" Match a space and "
([^"]+) capture in group 1 what is between the "
" Match "
See a postgresql demo
For example
select REGEXP_MATCHES('Mr "Smith"', '\mMrS? "([^"]+)"');
select REGEXP_MATCHES('MrS "Smith"', '\mMrS? "([^"]+)"');
Output
regexp_matches
1 Smith
regexp_matches
1 Smith

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary