How to use regex to extract an address with optional street2 - regex

I need to extract name, street1, street2, city, state, zip
I have data in this form
JOHN m SMITH [1111 WEST OAK ROAD, SUITE 101, CITY, ST 55555]
GEORGE m JONES [222 MAIN STREET, CITY, ST 55555]
My results for JOHN should be
name="JOHN m SMITH"
street1="1111 WEST OAK ROAD"
street2="SUITE 101"
city = "CITY"
state = "ST"
zip = "55555"
This works with GEORGE's data
Regex r = new Regex(#"^(?<name>.*)\[(?<street>.*)[,]\s(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$");
var match = r.Match(fullNameAndAddress);
name = match.Groups["name"].Value;
street = match.Groups["street"].Value;
city = match.Groups["city"].Value;
state = match.Groups["state"].Value;
zip = match.Groups["zip"].Value;
How do I add the optional street2?
I want 1 and only 1 "street" group. I thought it should have this: (....){1}?
street2 is optional zero or 1 times. I thought it should have this (...)?
but it doesn't work with JOHN's data, both street1 & street2 are going into the street group:
^(?<name>.*)\[((?<street>.*)[,]\s){1}?((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$

Could you clarify what you want stored in street?
Do you want John's to look like '1111 WEST OAK ROAD, SUITE 101'?
Or do you want to stuff it into some variable you wont be using, so that street looks like '1111 WEST OAK ROAD'?
Edit: With clarification, check out this link
http://rubular.com/r/S4HaTMVFZl
What happens here I believe is that the * is greedy, grabbing as much as it can before finding the final occurence of [,]\s
Adding a ? after the .* makes it lazy, grabbing the least information possible.
The amended regex looks like this
^(?<name>.*)\[((?<street>.*?)[,]\s)((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.{2})\s(?<zip>\d{5})\]$
You'll notice I changed the Regex for state from .* to .{2}, forcing a 2-character state. Feel free to revert that if you don't want it :)

I made a couple of changes to your regex in rubular.com, and it seemed to be working on both the example strings:
^(?<name>.+)\s\[(?<street>[^,]+),\s((?<street2>[^,]+),\s+)?(?<city>[^,]+),\s(?<state>.+)\s(?<zip>\d{5})\]$
street2 = match.Groups["street2"].Value;
One trick I've learned with regex's is to use the negation of the divider (eg. [^,]* for anything but a comma) instead of .*, so it's impossible to capture multiple fields with one expression. Also, the + operator, which requires at least one match, is useful in most of the groups.
Also, the additional comma is only there if there's an street2 component of the address, which indicates that the comma should be in the same capture group as the street2 part. I added an extra capture group around the street2 capture group to account for this. You can make groups non-capturing in most languages, but it didn't seem necessary.

Related

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

How to match an optional group after a compulsory one that could include it

I have a string input I need to parse, that has 2 different possible formats. It may look like either of the following:
2900 Sétubal (Portugal)
2900 Sétubal
I need a regex that will adequately split the postal code, city, and country (if provided) of both solutions.
This is the regex I've come up with so far.
(?P<postal_code>\d*) (?P<city>.*)( \((?P<country>.*)\))?
The problem is that regexes being read from left to right, the city group matches the country part of the string if it is provided, and I end up with something like :
postal_code = 2900
city = Sétubal (Portugal)
The output is right when I make the country group compulsory:
(?P<postal_code>\d*) (?P<city>.*)( \((?P<country>.*)\))
postal_code = 2900
city = Sétubal
country = Portugal
However, this regex does NOT match the 2nd possible format:
2900 Sétubal
I have tried using lookarounds, but I haven't succeeded. Any piece of advice will most definitely be welcome.
The following regex extracts your data:
(\d+)\s+([^()]*)\s+(\(([^()]+)\))?
Test here.
Based on your regex:
(?P<postal_code>\d+) +(?P<city>[^()]+)(?> +|$)(\((?P<country>[^()]+)\))?
Test here.

Regex pattern in salesforce apex

I am new to regex.
I have a String formatted like below
Street Name
City, StateCode ZipNumber
for example, the string can be like
50 Connecticut Avenue
Norwalk, CT 06850
or
123 6th Avenue
New York, NY 10013
or
4TH Highway 6
Rule, TX 79547
I am trying to construct a regex here.
But cannot proceed as I have a little idea about regex.
Can you please help me?
The following might be enough :
^(?<Street>[^\n]+)\n(?<City>[^,]+), (?<StateCode>[A-Z]{2}) (?<Zip>\d+)$
It captures the following segments in different groups :
the first line in a group named Street
the part of the second line which precedes the comma in a group named City
the next two capital letters in a group named StateCode
the following digits in a group named Zip

Using regex and vba, extracting parts of data

I have an excel spreadsheet and within its contents it is formatted like -
Street Name, Street Number Street Direction(may not be present represented be an NSWE)
So it could look like John Doe Ave, 900 E or Jane Doe DR, 100
However, the people who used this spreadsheet put business names or other information that shouldn't be present
The part I'm stuck at is using regex patterns I'm not familiar with it and it confuses me
I have this variable
Dim strPattern As String: strPattern = "^(.+),\s(\d+)\s([NWSEnwse])"
So, I have this its working SLIGHTLY I wanted to know what changes I could make to this so it would include or exlude NWSEnwse, because right now it detects the address only when street direction is present
You may use this regex pattern to match it.
^(.+),\s+(\d+)(\s+[NWSEnwse])?
The ? at the end signifies that that part is optional.
I also replaced \s with \s+ to account for any extra spaces that might have crept in.

Regular expression for address field validation

I am trying to write a regular expression that facilitates an address, example 21-big walk way or 21 St.Elizabeth's drive I came up with the following regular expression but I am not too keen to how to incorporate all the characters (alphanumeric, space dash, full stop, apostrophe)
"regexp=^[A-Za-z-0-99999999'
See the answer to this question on address validating with regex:
regex street address match
The problem is, street addresses vary so much in formatting that it's hard to code against them. If you are trying to validate addresses, finding if one isn't valid based on its format is mighty hard to do.
This would return the following address (253 N. Cherry St. ), anything with its same format:
\d{1,5}\s\w.\s(\b\w*\b\s){1,2}\w*\.
This allows 1-5 digits for the house number, a space, a character followed by a period (for N. or S.), 1-2 words for the street name, finished with an abbreviation (like st. or rd.).
Because regex is used to see if things meet a standard or protocol (which you define), you probably wouldn't want to allow for the addresses provided above, especially the first one with the dash, since they aren't very standard. you can modify my above code to allow for them if you wish--you could add
(-?)
to allow for a dash but not require one.
In addition, http://rubular.com/ is a quick and interactive way to learn regex. Try it out with the addresses above.
In case if you don't have a fixed format for the address as mentioned above, I would use regex expression just to eliminate the symbols which are not used in the address (like specialized sybmols - &(%#$^). Result would be:
[A-Za-z0-9'\.\-\s\,]
Just to add to Serzas' answer(since don't have enough reps. to comment).
alphabets and numbers can effectively be replaced by \w for words.
Additionally apostrophe,comma,period and hyphen doesn't necessarily need a backslash.
My requirement also involved front and back slashes so \/ and finally whitespaces with \s. The working regex for me ,as such was :
pattern: "[\w',-\\/.\s]"
Regular expression for simple address validation
^[#.0-9a-zA-Z\s,-]+$
E.g. for Address match case
#1, North Street, Chennai - 11
E.g. for Address not match case
$1, North Street, Chennai # 11
I have succesfully used ;
Dim regexString = New stringbuilder
With regexString
.Append("(?<h>^[\d]+[ ])(?<s>.+$)|") 'find the 2013 1st ambonstreet
.Append("(?<s>^.*?)(?<h>[ ][\d]+[ ])(?<e>[\D]+$)|") 'find the 1-7-4 Dual Ampstreet 130 A
.Append("(?<s>^[\D]+[ ])(?<h>[\d]+)(?<e>.*?$)|") 'find the Terheydenlaan 320 B3
.Append("(?<s>^.*?)(?<h>\d*?$)") 'find the 245e oosterkade 9
End With
Dim Address As Match = Regex.Match(DataRow("customerAddressLine1"), regexString.ToString(), RegexOptions.Multiline)
If Not String.IsNullOrEmpty(Address.Groups("s").Value) Then StreetName = Address.Groups("s").Value
If Not String.IsNullOrEmpty(Address.Groups("h").Value) Then HouseNumber = Address.Groups("h").Value
If Not String.IsNullOrEmpty(Address.Groups("e").Value) Then Extension = Address.Groups("e").Value
The regex will attempt to find a result, if there is none, it move to the next alternative. If no result is found, none of the 4 formats where present.
This one worked for me:
\d+[ ](?:[A-Za-z0-9.-]+[ ]?)+(?:Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\.?
The source: https://www.codeproject.com/Tips/989012/Validate-and-Find-Addresses-with-RegEx
Regex is a very bad choice for this kind of task. Try to find a web service or an address database or a product which can clean address data instead.
Related:
Address validation using Google Maps API
As a simple one line expression recommend this,
^([a-zA-z0-9/\\''(),-\s]{2,255})$
I needed
STREET # | STREET | CITY | STATE | ZIP
So I wrote the following regex
[0-9]{1,5}( [a-zA-Z.]*){1,4},?( [a-zA-Z]*){1,3},? [a-zA-Z]{2},? [0-9]{5}
This allows
1-5 Street #s
1-4 Street description words
1-3 City words
2 Char State
5 Char Zip code
I also added option , for separating street, city, state, zip
Here is the approach I have taken to finding addresses using regular expressions:
A set of patterns is useful to find many forms that we might expect from an address starting with simply a number followed by set of strings (ex. 1 Basic Road) and then getting more specific such as looking for "P.O. Box", "c/o", "attn:", etc.
Below is a simple test in python. The test will find all the addresses but not the last 4 items which are company names. This example is not comprehensive, but can be altered to suit your needs and catch examples you find in your data.
import re
strings = [
'701 FIFTH AVE',
'2157 Henderson Highway',
'Attn: Patent Docketing',
'HOLLYWOOD, FL 33022-2480',
'1940 DUKE STREET',
'111 MONUMENT CIRCLE, SUITE 3700',
'c/o Armstrong Teasdale LLP',
'1 Almaden Boulevard',
'999 Peachtree Street NE',
'P.O. BOX 2903',
'2040 MAIN STREET',
'300 North Meridian Street',
'465 Columbus Avenue',
'1441 SEAMIST DR.',
'2000 PENNSYLVANIA AVENUE, N.W.',
'465 Columbus Avenue',
'28 STATE STREET',
'P.O, Drawer 800889.',
'2200 CLARENDON BLVD.',
'840 NORTH PLANKINTON AVENUE',
'1025 Connecticut Avenue, NW',
'340 Commercial Street',
'799 Ninth Street, NW',
'11318 Lazarro Ln',
'P.O, Box 65745',
'c/o Ballard Spahr LLP',
'8210 SOUTHPARK TERRACE',
'1130 Connecticut Ave., NW, Suite 420',
'465 Columbus Avenue',
"BANNER & WITCOFF , LTD",
"CHIP LAW GROUP",
"HAMMER & ASSOCIATES, P.C.",
"MH2 TECHNOLOGY LAW GROUP, LLP",
]
patterns = [
"c\/o [\w ]{2,}",
"C\/O [\w ]{2,}",
"P.O\. [\w ]{2,}",
"P.O\, [\w ]{2,}",
"[\w\.]{2,5} BOX [\d]{2,8}",
"^[#\d]{1,7} [\w ]{2,}",
"[A-Z]{2,2} [\d]{5,5}",
"Attn: [\w]{2,}",
"ATTN: [\w]{2,}",
"Attention: [\w]{2,}",
"ATTENTION: [\w]{2,}"
]
contact_list = []
total_count = len(strings)
found_count = 0
for string in strings:
pat_no = 1
for pattern in patterns:
match = re.search(pattern, string.strip())
if match:
print("Item found: " + match.group(0) + " | Pattern no: " + str(pat_no))
found_count += 1
pat_no += 1
print("-- Total: " + str(total_count) + " Found: " + str(found_count))
UiPath Academy training video lists this RegEx for US addresses (and it works fine for me):
\b\d{1,8}(-)?[a-z]?\W[a-z|\W|\.]{1,}\W(road|drive|avenue|boulevard|circle|street|lane|waylrd\.|st\.|dr\.|ave\.|blvd\.|cir\.|In\.|rd|dr|ave|blvd|cir|ln)
I had a different use case - find any addresses in logs and scold application developers (favourite part of a devops job). I had the advantage of having the word "address" in the pattern but should work without that if you have specific field to scan
\baddress.[0-9\\\/# ,a-zA-Z]+[ ,]+[0-9\\\/#, a-zA-Z]{1,}
Look for the word "address" - skip this if not applicable
Look for first part numbers, letters, #, space - Unit Number / street number/suite number/door number
Separated by a space or comma
Look for one or more of rest of address numbers, letters, #, space
Tested against :
1 Sleepy Boulevard PO, Box 65745
Suite #100 /98,North St,Snoozepura
Ave., New Jersey,
Suite 420 1130 Connect Ave., NW,
Suite 420 19 / 21 Old Avenue,
Suite 12, Springfield, VIC 3001
Suite#100/98 North St Snoozepura
This worked for me when there were street addresses with unit/suite numbers, zip codes, only street. It also didn't match IP addresses or mac addresses. Worked with extra spaces.
This assumes users are normal people separate elements of a street address with a comma, hash sign, or space and not psychopaths who use characters like "|" or ":"!
For French address and some international address too, I use it.
[\\D+ || \\d]+\\d+[ ||,||[A-Za-z0-9.-]]+(?:[Rue|Avenue|Lane|... etcd|Ln|St]+[ ]?)+(?:[A-Za-z0-9.-](.*)]?)
I was inspired from the responses given here and came with those 2 solutions
support optional uppercase
support french also
regex structure
numbers (required)
letters, chars and spaces
at least one common address keyword (required)
as many chars you want before the line break
definitions:
accuracy
capacity of detecting addresses and not something that looks like an address which is not.
range
capacity to detect uncommon addresses.
Regex 1:
high accuracy
low range
/[0-9]+[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.)|(cir\.)|(blvd\.)|(hway\.)|(st\.)|(aut\.)|(ave\.)|(ln\.)|(rd\.)|(hw\.)|(dr\.)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i
regex 2:
low accuracy
high range
/[0-9]*[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.?)|(cir\.?)|(blvd\.?)|(hway\.?)|(st\.?)|(aut\.?)|(ave\.?)|(ln\.?)|(rd\.?)|(hw\.?)|(dr\.?)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i
This one works well for me
^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$
Source : https://community.alteryx.com/t5/Alteryx-Designer-Discussions/RegEx-Addresses-different-formats-and-headaches/td-p/360147
Here is my RegEx for address, city & postal validation rules
validation rules:
address -
1 - 40 characters length.
Letters, numbers, space and . , : ' #
city -
1 - 19 characters length
Only Alpha characters are allowed
Spaces are allowed
postalCode -
The USA zip must meet the following criteria and is required:
Minimum of 5 digits (9 digits if zip + 4 is provided)
Numeric only
A Canadian postal code is a six-character string.
in the format A1A 1A1, where A is a letter and 1 is a digit.
a space separates the third and fourth characters.
do not include the letters D, F, I, O, Q or U.
the first position does not make use of the letters W or Z.
address: ^[a-zA-Z0-9 .,#;:'-]{1,40}$
city: ^[a-zA-Z ]{1,19}$
usaPostal: ^([0-9]{5})(?:[-]?([0-9]{4}))?$
canadaPostal : ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$
\b(\d{1,8}[a-z]?[0-9\/#- ,a-zA-Z]+[ ,]+[.0-9\/#, a-zA-Z]{1,})\n
A more dynamic approach to #micah would be the following:
(?'Address'(?'Street'[0-9][a-zA-Z\s]),?\s*(?'City'[A-Za-z\s]),?\s(?'Country'[A-Za-z])\s(?'Zipcode'[0-9]-?[0-9]))
It won't care about individual lengths of segments of code.
https://regex101.com/r/nuy7hB/1