Python Regex some name + US Address

Python Regex some name + US Address - regex

I have these kind of strings:
WILLIAM SMITH 2345 GLENDALE DR RM 245 ATLANTA GA 30328-3474
LINDSAY SCARPITTA 655 W GRACE ST APT 418 CHICAGO IL 60613-4046
I want to make sure that strings I will get are like those strings like above.
Here's my regular expression:
[A-Z]+ [A-Z]+ [0-9]{3,4} [A-Z]+ [A-Z]{2,4} [A-Z]{2,4} [0-9]+ [A-Z]+ [A-Z]{2} [0-9]{5}-[0-9]{4}$
But my regular expression only matches the first example and does not match the second one.

Here's dawg's regex with capturing groups:
^([A-Z]+[ \t]+[A-Z]+)[ \t]+(\d+)[ \t](.*)[ \t]+([A-Z]{2})[ \t]+(\d{5}(?:-\d{4}))$
Here's the url.
UPDATE
sorry, I forgot to remove non-capturing group at the end of dawg's regex...
Here's new regex without non-capturing group: regex101

Try this:
^[A-Z]+[ \t]+[A-Z]+[ \t]+\d+.*[ \t]+[A-Z]{2}[ \t]+\d{5}(?:-\d{4})$
Demo
Explanation:
1. ^[A-Z]+[ \t]+[A-Z]+[ \t]+ Starting at the start of line,
two blocks of A-Z for the name
(however, names are often more complicated...)
2. \d+.*[ \t]+[A-Z]{2}[ \t]+ Using number start and
two letter state code at the end for the full address
Cities can have spaces such as 'Miami Beach'
3. \d{5}(?:-\d{4})$ Zip code with optional -NNNN with end anchor

Related

regex to split string into parts

I have a string that has the following value,
ID Number / 1234
Name: John Doe Smith
Nationality: US
The string will always come with the Name: pre appended.
My regex expression to get the fullname is (?<=Name:\s)(.*) works fine to get the whole name. This (?<=Name:\s)([a-zA-Z]+) seems to get the first name.
So an expression each to get for first,middle & last name would be ideal. Could someone guide me in the right direction?
Thank you

You can capture those into 3 different groups:
(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)
>>> re.search('(?<=Name:\s)([a-zA-Z]+)\s+([a-zA-Z]+)\s+([a-zA-Z]+)', 'Name: John Doe Smith').groups()
('John', 'Doe', 'Smith')
Or, once you got the full name, you can apply split on the result, and get the names on a list:
>>> re.split(r'\s+', 'John Doe Smith')
['John', 'Doe', 'Smith']
For some reason I assumed Python, but the above can be applied to almost any programming language.

As you stated in the comments that you use .NET you can make use of a quantifier in the lookbehind to select which part of a "word" you want to select after Name:
For example, to get the 3rd part of the name, you can use {2} as the quantifier.
To match non whitespace chars instead of word characters only, you can use \S+ instead of \w+
(?<=\bName:(?:\s+\w+){2}\s+)\w+
(?<= Positive lookbehind, assert that from the current position directly to the left is:
\bName: A word boundary to prevent a partial match, match Name:
(?:\s+\w+){2} Repeat 2 times as a whole, matching 1+ whitespace chars and 1+ word chars. (To get the second name, use {1} or omit the quantifier, to get the first name use {0})
\s+ Match 1+ whitespace chars
) Close lookbehind
\w+ Match 1+ word characters
.NET regex demo

Regex to match phone and fax numbers for WebHarvy

Sample text
5950 S Willow Dr Ste 304
Greenwood Village, CO 80111
P (123) 456-7890
F (123) 456-7890
Get Directions
Tried the following but it grabbed the first line of the address as well
(.*)(?=(\n.*){2}$)
Also tried
P\s(\(\d{3})\)\s\d+-\d+
but it doesn't work in WebHarvy even though it works on RegexStorm
Looking for an expression to match the phone and fax numbers from it. I would be using the expression in WebHarvy
https://www.webharvy.com/articles/regex.html
Thanks

Your second pattern is almost what you need to do. With P\s(\(\d{3})\)\s\d+-\d+, you captured into Group 1 only (\(\d{3}) part, while you need to capture the whole number.
I also suggest to restrict the context: either match P as a whole word, or as the first word on a line:
\bP\s*(\(\d{3}\)\s*\d+-\d+)
or
(?m)^\s*P\s*(\(\d{3}\)\s*\d+-\d+)
See the regex demo, and here is what you need to pay attention to there:
The \b part matches a word boundary (\b) and (?m)^\s* matches the start of a line ((?m) makes ^ match the start of a line) and then \s* matches 0+ whitespaces. You may change it to only match horizontal whitespaces by replacing the pattern with [\p{Zs}\t]*.

Regex to extract city names (.NET)

Looking for an expression to extract City Names from addresses. Trying to use this expression in WebHarvy which uses the .NET flavor of regex
Example address
1234 Savoy Dr Ste 123
New Houston, TX 77036-3320
or
1234 Savoy Dr Ste 510
Texas, TX 77036-3320
So the city name could be single or two words.
The expression I am trying is
(\w|\w\s\w)+(?=,\s\w{2})
When I am trying this on RegexStorm it seems to be working fine, but when I am using this in WebHarvy, it only captures the 'n' from the city name New Houston and 'n' from Austin
Where am I going wrong?

In WebHarvey, if a regex contains a capturing group, its contents are returned. Thus, you do not need a lookahead.
Another point is that you need to match 1 or more word chars, optionally followed with a chunk of whitespaces followed with 1 or more word chars. Your regex contains a repeated capturing group whose contents are re-written upon each iteration and after it finds matching, Group 1 only contains n:
Use
(\w+(?:[^\S\r\n]+\w+)?),\s\w{2})
See the regex demo here
The [^\S\r\n]+ part matches any whitespace except CR and LF. You may use [\p{Zs}\t]+ to match any 1+ horizontal whitespaces.

Regex- Ignore a constant string that matches a pattern

I have this regular expression:
\b[A-Z]{1}[A-Z]{0,7}[0-9]?\b|\b[0-9]{2,3}\b
The desired output is as highlighted:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
Observed output:
JOHN went to LONDON one fine day.
JOHN had lunch in a PUB.
JOHN then moved to CHICAGO.
I don't want JOHN to be highlighted.
John does not want this to match the pattern.
Neither this.
But THIS1 should match the pattern.
Also the other 70 times that the pattern should match.
The regex works partly but I don't want two constant strings- JOHN and I to match as part of this regex. Please help.

You can use a negative lookahead to exclude those matches. Also, your pattern seems rather "redundant", you may shorten it considerably using grouping and removing unnecessary subpatterns:
\b(?!(?:JOHN|I)\b)(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3})\b
^^^^^^^^^^^^^^^^
See the regex demo
The (?!(?:JOHN|I)\b) is the negative lookahead that fails the match if the word matched is equal to I or JOHN.
Note that {1} can always be omitted as any unquantified pattern is matched once. [A-Z]{1}[A-Z]{0,7} is actually equal to [A-Z]{1,8}.
Pattern details:
\b - word boundary
(?!(?:JOHN|I)\b) - the word matched cannot be equal to JOHN or I
(?:[A-Z]{1,8}[0-9]?|[0-9]{2,3}) - one of the two alternatives:
[A-Z]{1,8}[0-9]? - 1 to 8 uppercase ASCII letters followed with an optional (1 or 0) digit
| - or
[0-9]{2,3} - 2 to 3 digits
\b - trailing word boundary

Regex for city with Apostrophe

Here's is my javascript regex for a city name and it's handling almost all cases except this.
^[a-zA-Z]+[\. - ']?(?:[\s-][a-zA-Z]+)*$
(Should pass)
Coeur d'Alene
San Tan Valley
St. Thomas
St. Thomas-Vincent
St. Thomas Vincent
St Thomas-Vincent
St-Thomas
anaconda-deer lodge county
(Should Fail)
San. Tan. Valley
St.. Thomas
St.. Thomas--Vincent
St.- Thomas -Vincent
St--Thomas

This matches all your names from the first list and not those from the second:
/^[a-zA-Z]+(?:\.(?!-))?(?:[\s-](?:[a-z]+')?[a-zA-Z]+)*$/
Multiline explanation:
^[a-zA-Z]+ # begins with a word
(?:\.(?!-))? # maybe a dot but not followed by a dash
(?:
[\s-] # whitespace or dash
(?:[a-z]+\')? # maybe a lowercase-word and an apostrophe
[a-zA-Z]+ # word
)*$ # repeated to the end
To allow the dots anywhere, but not two of them, use this:
/^(?!.*?\..*?\.)[a-zA-Z]+(?:(?:\.\s?|\s|-)(?:[a-z]+')?[a-zA-Z]+)*$/
^(?!.*?\..*?\.) # does not contain two dots
[a-zA-Z]+ # a word
(?:
(?:\.\s?|\s|-) # delimiter: dot with maybe whitespace, whitespace or dash
(?:[a-z]+\')? # maybe a lowercase-word and an apostrophe
[a-zA-Z]+ # word
)*$ # repeated to the end

Try this regex:
^(?:[a-zA-Z]+(?:[.'\-,])?\s?)+$
This does match:
Coeur d'Alene
San Tan Valley
St. Thomas
St. Thomas-Vincent
St. Thomas Vincent
St Thomas-Vincent
St-Thomas
anaconda-deer lodge county
Monte St.Thomas
San. Tan. Valley
Washington, D.C.
But doesn't match:
St.. Thomas
St.. Thomas--Vincent
St.- Thomas -Vincent
St--Thomas
(I allowed it to match San. Tan. Valley, since there's probably a city name out there with 2 periods.)
How the regex works:
# ^ - Match the line start.
# (?: - Start a non-catching group
# [a-zA-Z]+ - That starts with 1 or more letters.
# [.'\-,]? - Followed by one period, apostrophe dash, or comma. (optional)
# \s? - Followed by a space (optional)
# )+ - End of the group, match at least one or more of the previous group.
# $ - Match the end of the line

I think the following regexp fits your requirements :
^([Ss]t\. |[a-zA-Z ]|\['-](?:[^-']))+$
On the other hand, you may question the idea of using a regexp to do that... Whatever the complexity of your regexp, there will always be some fool finding a new unwanted pattern that matches...
Usually when you need to have valid city names, it's better to use some geocoding api, like google geocoding API

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Python Regex some name + US Address - regex

Here's dawg's regex with capturing groups: ^([A-Z]+[ \t]+[A-Z]+)[ \t]+(\d+)[ \t](.*)[ \t]+([A-Z]{2})[ \t]+(\d{5}(?:-\d{4}))$ Here's the url. UPDATE sorry, I forgot to remove non-capturing group at the end of dawg's regex... Here's new regex without non-capturing group: regex101

Related

regex to split string into parts

Regex to match phone and fax numbers for WebHarvy

Regex to extract city names (.NET)

Regex- Ignore a constant string that matches a pattern

Regex for city with Apostrophe

Categories

Resources