Regex to extract city names (.NET) - regex

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?

In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

Related

Can I use a regular expression to help format this data to separate name, age, and address?

I am working on an assignment for class, and we need to format this data. I was thinking that regular expressions would be a very elegant way of formatting the data. But, I ran into some trouble. This is my first time doing this before and I do not know how to properly split the data. I want the beginning to the first digit be the first section, the first digit until the next white space to be the second section, and there till the end of the line to be the third section. Here is my data:
Amber-Rose Bowen 53 123 Machinery Rd.
Joyce Kirkland 19 234 Cylinder Dr.
Seb Dotson 32 3456 Surgery Ln.
Dominique Hough 58 654 Election Rd.
Yasemin Mcleod 29 555 Cabinet Ave.
Nancy Lord 80 232 Highway Rd.
Tracy Mckenzie 72 101 Device Ave.
Alistair Salter 25 109 Guitar Ln.
Adeel Sears 42 222 Solitare Rd.
I have been using https://regex101.com/ to test my ideas. ([a-zA-Z]+)([0-9]+) this is my start, but I do not know how to go from the start to the first digit. (or any other part of this)
You can use
^(.*?)[^\S\r\n]+(\d+)[^\S\r\n]+(\S.*)
See the regex demo. This regex can also be used with a multiline flag to extract data from a multiline string.
Details:
^ - start of string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
[^\S\r\n]+ - zero or more horizontal whitespaces (in some regex flavors, you can use \h+ or [^\p{Zs}\t]+ instead)
(\d+) - Group 2: one or more digits
[^\S\r\n]+ - one or more horizontal whitespaces
(\S.*) - Group 3: a non-whitespace char and then the rest of the line.
If you merely wish to separate the string into full name, age and street address you may split the string on matches of the regular expression
(?i)(?<=[a-z]|\d) +(?=\d)
For example:
Amber-Rose Bowen 53 123 Machinery Rd.
^ ^^^^
Demo
The regular expression reads: "match one or more spaces preceded by a letter or digit and followed by a digit". (?i) causes the match of a letter to be case-indifferent. (?<=[a-z]|\d) is a positive lookbehind; (?=\d) is a positive lookahead.
You may use the following regular expression if you wish to to extract first name, last name, age, street number and street name.
^(?<first_name>\S+) +(?<last_name>\S+) +(?<age>\d+) +(?<street_nbr>\d+) +(?<stret_name>.*)
For example:
Amber-Rose Bowen 53 123 Machinery Rd.
^^^^^^^^^^ ^^^^^ ^^ ^^^ ^^^^^^^^^^^^^
1 2 3 4 5
1: first_name
2: last_name
3: age
4: street_nbr
5: street_name
Demo
I've used the PCRE regex engine with named capture groups. The expression would be similar for other regex engines, though some do not support named groups, in which case you would have to use numbered groups (group 1, group 2, and so forth.)
Note that this only works because of the consistent structure of your data. In real life some strings may contain such things as middle names or apartment numbers, which would complicate the parsing of the strings.

Python Regex some name + US Address

I have these kind of strings:
WILLIAM SMITH 2345 GLENDALE DR RM 245 ATLANTA GA 30328-3474
LINDSAY SCARPITTA 655 W GRACE ST APT 418 CHICAGO IL 60613-4046
I want to make sure that strings I will get are like those strings like above.
Here's my regular expression:
[A-Z]+ [A-Z]+ [0-9]{3,4} [A-Z]+ [A-Z]{2,4} [A-Z]{2,4} [0-9]+ [A-Z]+ [A-Z]{2} [0-9]{5}-[0-9]{4}$
But my regular expression only matches the first example and does not match the second one.
Here's dawg's regex with capturing groups:
^([A-Z]+[ \t]+[A-Z]+)[ \t]+(\d+)[ \t](.*)[ \t]+([A-Z]{2})[ \t]+(\d{5}(?:-\d{4}))$
Here's the url.
UPDATE
sorry, I forgot to remove non-capturing group at the end of dawg's regex...
Here's new regex without non-capturing group: regex101
Try this:
^[A-Z]+[ \t]+[A-Z]+[ \t]+\d+.*[ \t]+[A-Z]{2}[ \t]+\d{5}(?:-\d{4})$
Demo
Explanation:
1. ^[A-Z]+[ \t]+[A-Z]+[ \t]+ Starting at the start of line,
two blocks of A-Z for the name
(however, names are often more complicated...)
2. \d+.*[ \t]+[A-Z]{2}[ \t]+ Using number start and
two letter state code at the end for the full address
Cities can have spaces such as 'Miami Beach'
3. \d{5}(?:-\d{4})$ Zip code with optional -NNNN with end anchor

Regex for all characters upto not including \n

Here I have a text string.
Serial#......... 12345678910123456\nCust#........... 654321\nCustomer Name... Some Customer\nBILL TO NO NAME. Bill To: 123456 - Some Company Pty Ltd\nDATE...... 01/01/00
I want to capture 2 parts of this string.
Cust#........... 654321 BILL TO NO NAME. Bill To: 123456 - Some Company Pty Ltd
using regex.
So far I have Cust#.*?\d+ which captures
Cust#........... 654321
However I dont think this is the best approach.
Note.. This is 1 string from thousands, so data within strings is dynamic, can I capture what is within end of line \n character to achieve my result??
Try this regex: ^.*?\n(.*?)\n.*?\n(.*?)\n.*$ at least it should give you a different way of looking at the problem.
It describes the entire string, using carriage returns as element delimiters. The parenthesis defines groups which you want to save, which are the 2nd and 4th groups.
Of course this depends on the elements you want always being the 2nd and 4th and being delimited by the newlines.
https://regex101.com/r/harmzn/1
You might use 2 capturing groups. In the first group, use your pattern without the lazy quantifier, as the digits are at the end of the line.
Then match (not capture) all the lines that do not start with BILL
After that, capture in group 2 the whole line that starts with BILL
^(Cust#.*\d+)(?:\r?\n(?!BILL ).*)*\r?\n(BILL .*)
Explanation
^ Start of string
( Capture group 1
Cust#.*\d+ The pattern to match Cust# with the digits at the end
) Close group
(?:\r?\n(?!BILL ).*)*\r?\n Match all lines that do not start with BILL
( Capture group 2
BILL .* Match the line that starts with BILL
) Close group
Regex demo

Regex to match phone and fax numbers for WebHarvy

Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks
Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.

Regex - find string by excluding part of it

I have text: "Johnny Alan Walker Sint Jansstraat 7, 1012 HG Amsterdam +123456789012"
Is is possible to find Lastname and phone?
Exclude address?
Address regex is this: "([A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}" (two words from capital, housenumber, comma, postal code and city)
I want result string to be "Walker +123456789012"
This should do what you need, and also doesn't assume three names (works without a middle name present), so it's a little more flexible in case you run into entries for people who don't have a middle name:
.*?(\w+)\s*(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,}\s*(\+\d+)
.*?(\w+)\s* - Capture the last word before the whitespace before the address. .*? will lazily match anything up to the word preceeding the address, but not capture. \s* will match the whitespace between the word and the address.
(?:[A-Z]{1,}[a-z]{1,}\s){2}[0-9]{0,4}\,\s{1,}[0-9]{4}\s[A-Z]{2}\s{1,}[a-zA-Z]{1,} - your address regex but using a non-capturing group (?:)
\s*(\+\d+) - Captures the + and following numbers. \s* will match the whitespace between the address and the +.
I reused your address regex, but made the capture group non-capturing. Then we match the last word before the address (the last name) using (\w+), and the + and following numbers after the address using (\+\d+).
Here it is in action: https://regex101.com/r/YGiaJT/1
You could do....
\w+\s+\w+\s+(\w+).*(\+\d+)
And your capture groups should match up pretty well with what you're trying to match...
Essentially this will "disregard" your first and second "words" (first / middle name) and then disregard EVERYTHING from in between until it finds a + then captures the digits after it.
Live example: https://regex101.com/r/MjJCSv/1
In theory if your last name and your address will always be separated by more than 1 space you can shorten this a little bit and write it as
(\w+)\s{2,}.*(\+\d+)
Live example of this functionality: https://regex101.com/r/vGGB4z/1
Example implementation of the later in java: http://ideone.com/RExAEO
You can use the following to capture just the surname and the phone number.
The first part ((\w+\s){3}) will capture the 3rd occurrence of a word followed by a space.
The second part (.+?) will capture everything
The third part ((\+?\d+)$) will capture an optional + (phone number prefix) and the rest of the phone number, up to the end of the string.
(\w+\s){3}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/4
But, IF the surname and the address is separated with more than 1 space, then you can use
(\w+)\s{2,}.+?(\+?\d+)$
\1 - The surname
\2 - The phone number
https://regex101.com/r/gqu0tt/5
I've tested these expressions on the Java engine, and they give back the correct match