Regex capturing the first occurrence of every group in a recurring pattern - regex

Suppose I have the following text:
Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity
I have a regex (a bit more complex, but it boils down to this):
^(?:(?:(?:Name: (.+?))|(?:Address: (.+?))|(?:City: (.+?)))\t*)+$
which has three capturing groups, that can capture the values of Name, Address and City (if they occur in the text). A few more examples are here: https://regex101.com/r/37nemH/6. EDIT The ordering is not fixed beforehand, and it could also happen that the fields are not separated by \t characters.
Now this all works well, the only slight problem I have is when one field occurs twice in the same text, as can be seen in the last example I put on regex101:
Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity\tAddress: Other Address
What I would want is for the second capturing group to match the first address, i.e. Street 123 ABC, and preferably to let the second occurrence be matched within the "City" group, i.e.
1: John Doe
2: Street 123 ABC
3: MyCity\tAddress: Other Address
Conceptually, I tried doing this with a negative lookbehind, e.g. replacing (?:Address: (.+?)) with (?:(?<!.*Address: )Address: (.+?)), i.e. assuring that an Address: match was not proceded somewhere in the text by another Address: tag. But, negative lookbehind does not allow for arbitrary length, so this obviously would not work.
Can this be achieved using regex, and how?

For your stated problem, you may use this regex with a conditional construct:
^.*?(?:(?:Name: (.+?)|(Address: )(.+?)|City: ((?(2).*?Address: )*.+?))\t*)+$
RegEx Demo
Your values are available in captured groups 1, 3, 4.
Capture group 2 is for literal label "Address: ".
Here, (?(2).*?Address: )* is a conditional construct that means if captured group 2 is present then in group 4 match text till next Address: is found (0 or more matches of this).
For the text Name: John Doe Address: Street 123 ABC City: MyCity Address: Second address, it will have following matches:
Group 1. 169-177 `John Doe`
Group 2. 178-187 `Address: `
Group 3. 187-201 `Street 123 ABC`
Group 4. 210-240 `MyCity Address: Second address`

If the word order can be any and some or all the items can be missing, it is much easier to use 3 separate patterns to extract the bits you need.
Name (demo):
^.*?Name:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
City (demo):
^.*?City:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
Address (demo):
^.*?Address:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))
Details
^ - start of string
.*? - any 0+ chars other than line break chars, as few as possible
Address: - a keyword to stop at and look for the expected match
\s* - 0+ whitespaces
(.*?) - Group 1: any 0+ chars other than line break chars, as few as possible...
(?=\s*(?:Name:|Address:|City:|$)) - up to but excluding 0 or more whitespaces followed with Name:, Address:, City: or end of string.

Related

Can I use a regular expression to help format this data to separate name, age, and address?

I am working on an assignment for class, and we need to format this data. I was thinking that regular expressions would be a very elegant way of formatting the data. But, I ran into some trouble. This is my first time doing this before and I do not know how to properly split the data. I want the beginning to the first digit be the first section, the first digit until the next white space to be the second section, and there till the end of the line to be the third section. Here is my data:
Amber-Rose Bowen 53 123 Machinery Rd.
Joyce Kirkland 19 234 Cylinder Dr.
Seb Dotson 32 3456 Surgery Ln.
Dominique Hough 58 654 Election Rd.
Yasemin Mcleod 29 555 Cabinet Ave.
Nancy Lord 80 232 Highway Rd.
Tracy Mckenzie 72 101 Device Ave.
Alistair Salter 25 109 Guitar Ln.
Adeel Sears 42 222 Solitare Rd.
I have been using https://regex101.com/ to test my ideas. ([a-zA-Z]+)([0-9]+) this is my start, but I do not know how to go from the start to the first digit. (or any other part of this)
You can use
^(.*?)[^\S\r\n]+(\d+)[^\S\r\n]+(\S.*)
See the regex demo. This regex can also be used with a multiline flag to extract data from a multiline string.
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
[^\S\r\n]+ - zero or more horizontal whitespaces (in some regex flavors, you can use \h+ or [^\p{Zs}\t]+ instead)
(\d+) - Group 2: one or more digits
[^\S\r\n]+ - one or more horizontal whitespaces
(\S.*) - Group 3: a non-whitespace char and then the rest of the line.
If you merely wish to separate the string into full name, age and street address you may split the string on matches of the regular expression
(?i)(?<=[a-z]|\d) +(?=\d)
For example:
Amber-Rose Bowen 53 123 Machinery Rd.
^ ^^^^
Demo
The regular expression reads: "match one or more spaces preceded by a letter or digit and followed by a digit". (?i) causes the match of a letter to be case-indifferent. (?<=[a-z]|\d) is a positive lookbehind; (?=\d) is a positive lookahead.
You may use the following regular expression if you wish to to extract first name, last name, age, street number and street name.
^(?<first_name>\S+) +(?<last_name>\S+) +(?<age>\d+) +(?<street_nbr>\d+) +(?<stret_name>.*)
For example:
Amber-Rose Bowen 53 123 Machinery Rd.
^^^^^^^^^^ ^^^^^ ^^ ^^^ ^^^^^^^^^^^^^
1 2 3 4 5
1: first_name
2: last_name
3: age
4: street_nbr
5: street_name
Demo
I've used the PCRE regex engine with named capture groups. The expression would be similar for other regex engines, though some do not support named groups, in which case you would have to use numbered groups (group 1, group 2, and so forth.)
Note that this only works because of the consistent structure of your data. In real life some strings may contain such things as middle names or apartment numbers, which would complicate the parsing of the strings.

regex to split string into parts

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you
You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.
As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

Regex to extract city names (.NET)

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?
In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.
You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary

Regex - find string by excluding part of it

I have text: "Johnny Alan Walker Sint Jansstraat 7, 1012 HG Amsterdam +123456789012"
Is is possible to find Lastname and phone?
Exclude address?
Address regex is this: "([A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}" (two words from capital, housenumber, comma, postal code and city)
I want result string to be "Walker +123456789012"
This should do what you need, and also doesn't assume three names (works without a middle name present), so it's a little more flexible in case you run into entries for people who don't have a middle name:
.*?(\w+)\s*(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}\s*(\+\d+)
.*?(\w+)\s* - Capture the last word before the whitespace before the address. .*? will lazily match anything up to the word preceeding the address, but not capture. \s* will match the whitespace between the word and the address.
(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,} - your address regex but using a non-capturing group (?:)
\s*(\+\d+) - Captures the + and following numbers. \s* will match the whitespace between the address and the +.
I reused your address regex, but made the capture group non-capturing. Then we match the last word before the address (the last name) using (\w+), and the + and following numbers after the address using (\+\d+).
Here it is in action: https://regex101.com/r/YGiaJT/1
You could do....
\w+\s+\w+\s+(\w+).*(\+\d+)
And your capture groups should match up pretty well with what you're trying to match...
Essentially this will "disregard" your first and second "words" (first / middle name) and then disregard EVERYTHING from in between until it finds a + then captures the digits after it.
Live example: https://regex101.com/r/MjJCSv/1
In theory if your last name and your address will always be separated by more than 1 space you can shorten this a little bit and write it as
(\w+)\s{2,}.*(\+\d+)
Live example of this functionality: https://regex101.com/r/vGGB4z/1
Example implementation of the later in java: http://ideone.com/RExAEO
You can use the following to capture just the surname and the phone number.
The first part ((\w+\s){3}) will capture the 3rd occurrence of a word followed by a space.
The second part (.+?) will capture everything
The third part ((\+?\d+)$) will capture an optional + (phone number prefix) and the rest of the phone number, up to the end of the string.
(\w+\s){3}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/4
But, IF the surname and the address is separated with more than 1 space, then you can use
(\w+)\s{2,}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/5
I've tested these expressions on the Java engine, and they give back the correct match