Regex: Match only street name within address

Regex: Match only street name within address - regex

I have a list of addresses and I would like to have a regular expression that is able to capture just the name of the street without the street type, address number, or cardinal direction. There are some errors in formatting but all characters are in capital letters. So,
2038 W MAIN AVE
2038QWEW S JEFFERSON AVENUE
33 NORTH CALIFORNIA STREET
53371 SOUTH WASHINGTON
53371 S WASHINGTON AVENUE
1600 E PENNSYLVANIA AVE
WEST9 67ST ST
E171 N 23RD STREET
G171 N121ST STREET
ought to return
MAIN
JEFFERSON
CALIFORNIA
WASHINGTON
WASHINGTON
PENNSYLVANIA
67ST
23RD
121ST
So far I've got
([^ W ]|[^ E ]|[^ S ]|[^ N ])([0-9])*([A-Z]+)[^ ]
But I can't seem to only capture the first match that occurs after the street number. I feel like I need the standard greedy operators (i.e. ?, *, or +) but I can't figure out how to incorporate them.
These two links have taken me close:
Matching on every second occurence
Simple regex for street address

For the output what you want from the given (address) input, this regex will surely help: [\pL\pN]+(?=\h+[\pL\pN]+$)
This regex will match the second last word in your line where a word is "1 or more any letter or digit in any language".
For reference you could https://superuser.com/questions/1361759/matching-second-last-word-in-sentence-through-regular-expression

Logic: we are looking for the second last word (set of characters) + possible border with the symbol N
^.*?\s[N]{0,1}([-a-zA-Z0-9]+)\s*\w*$
Res:
Match 1
Full match 0-15 `2038 W MAIN AVE`
Group 1. 7-11 `MAIN`
Match 2
Full match 16-43 `2038QWEW S JEFFERSON AVENUE`
Group 1. 27-36 `JEFFERSON`
Match 3
Full match 44-70 `33 NORTH CALIFORNIA STREET`
Group 1. 53-63 `CALIFORNIA`
Match 4
Full match 71-93 `53371 SOUTH WASHINGTON`
Group 1. 83-93 `WASHINGTON`
Match 5
Full match 94-119 `53371 S WASHINGTON AVENUE`
Group 1. 102-112 `WASHINGTON`
Match 6
Full match 120-143 `1600 E PENNSYLVANIA AVE`
Group 1. 127-139 `PENNSYLVANIA`
Match 7
Full match 144-157 `WEST9 67ST ST`
Group 1. 150-154 `67ST`
Match 8
Full match 158-176 `E171 N 23RD STREET`
Group 1. 165-169 `23RD`
Match 9
Full match 177-195 `G171 N121ST STREET`
Group 1. 183-188 `121ST`
https://regex101.com/r/m2rmUQ/4

I was able to figure this out in a slightly different way
[0-9A-Z]* [0-9A-Z]*$
and then I simply split the string it created by the space. Maybe one or two steps too many but it's transparent

Related

Regex pattern in salesforce apex

I am new to regex.
I have a String formatted like below
Street Name
City, StateCode ZipNumber
for example, the string can be like
50 Connecticut Avenue
Norwalk, CT 06850
or
123 6th Avenue
New York, NY 10013
or
4TH Highway 6
Rule, TX 79547
I am trying to construct a regex here.
But cannot proceed as I have a little idea about regex.
Can you please help me?

The following might be enough :
^(?<Street>[^\n]+)\n(?<City>[^,]+), (?<StateCode>[A-Z]{2}) (?<Zip>\d+)$
It captures the following segments in different groups :
the first line in a group named Street
the part of the second line which precedes the comma in a group named City
the next two capital letters in a group named StateCode
the following digits in a group named Zip

Regex for date and 3-letter code

been trying to create a REGEX that will parse date and 3-letter code from a bit longer message. Here I will post examples of the messages and what I want to get:
AAA BBB 1A BY PEK14JUN18/1654 OR QQQ MF 812 XXXXX -> PEK, 14JUN18/1654
XXX/WWWW BY 05JUL 0900 BKK LT ELSE BKG WILL BE QQQQ -> BKK, 05JUL 0900
TO AZ BY 02AUG 1910 TYO OR AZ WWWW WILL BE XXX -> TYO, 02AUG 1910
BY TYO20JUL18/0355 OR CXL CA ALL QQQ -> TYO, 20JUL18/0355
BY AMS04JUL18/1954 OR CXL MF 812 L07JUL -> AMS, 04JUL18/1954
I want to be able to match 3-letter code and the date for every message. The code is always nearby the date but can be before or after the date part. Also the date part can be with or without a year.Is it possible to have one regex to use for all the above examples?
I started with this:
(\s[A-Z]{3}\d\d|\d\d[A-Z]{3}\s)
(https://regex101.com/r/LPLjgf/1) but it's not working as it should and I'm not very experienced with regex to be honest.
EDIT:
Actually I would need to use only the 3-letter codes but I need them to be connected with a date - for example in:
AAA BBB 1A BY PEK14JUN18/1654 OR QQQ MF 812 XXXXX
the AAA, BBB or QQQ shouldn't match because they arent right after / before the date as is PEK.
Same with BY TYO20JUL18/0355 OR CXL CA ALL QQQ -> only TYO should match because it's before a date and CXL shouldn't.

You may use the following pattern:
([A-Z]{3})(\d{2}[A-Z]{3}\d{2}\/\d{4})|(\d{2}[A-Z]{3} \d{4}) ([A-Z]{3})
([A-Z]{3}) Capturing group for three capital letters
(\d{2}[A-Z]{3}\d{2}\/\d{4}) Capturing group for two digits, three upper case letters, two digits, /, four digits.
| Logical OR, alternates pattern.
(\d{2}[A-Z]{3} \d{4}) Capturing group. Captures two digits, three upper case letters, whitespace and four digits.
([A-Z]{3}) Capturing group for three upper case letters.
You can try it live here.
Captured groups:
Group 1. 14-17 `PEK`
Group 2. 17-29 `14JUN18/1654`
Group 3. 83-93 `05JUL 0900`
Group 4. 94-97 `BKK`
Group 3. 151-161 `02AUG 1910`
Group 4. 162-165 `TYO`
Group 1. 211-214 `TYO`
Group 2. 214-226 `20JUL18/0355`
Group 1. 269-272 `AMS`
Group 2. 272-284 `04JUL18/1954`
Group 1. 342-345 `PEK`
Group 2. 345-357 `14JUN18/1654`
Group 1. 378-381 `TYO`
Group 2. 381-393 `20JUL18/0355`

Firstly
(\s[A-Z]{3}\d\d|\d\d[A-Z]{3}\s)
The alternation – | – means that will match \s[A-Z]{3}\d\d or \d\d[A-Z]{3}\s which is certainly not what you want. To narrow the scope of an alternation use grouping.
I would think you want to match this fairly directly:
([A-Z]{3})\d{2}[A-Z]{3}\d{2}
And that only captures the three letters in a group.

Try the following RegEx:
[A-Z]{3}(\d{2}[A-Z]{3}[\S]*)|(\d{2}[A-Z]{3}\s\d{4}\s[A-Z]{3})
It will matach 3 letters fowed by 2 numbers, followed by 3 letters OR 2 numbers followed by 3 letters, a Space, 4 numbers, a Space and 3 letters.
You can try it here

Regex for more than 1 First Name before the Middle Initial

I'm not that good with regular expression and here is my problem:
I want to create a regex that match with a name that has two or more first name (e.g. Francis Gabriel).
I came up with the regex ^[A-Z][a-z]{3,30}/s[A-Z][a-z]{3,30} but
it only matches with two first name and not all first names.
The regex should match with John John J. Johnny.

^[A-Z][a-z]{3,30}(\\s[A-Z](\\.|[a-z]{2,30})?)*$
\s must be used in java when using a Pattern Compiler.
If it is X., we have to validate it, or XYZ
John Johny J.hny -> is wrong
so either . or [a-z] and at least one first name should be there. So, put a * at last of second part to match 0 or more.
Since java is not supported in this snippet, a JavaScript implementation of same regex is done for you to understand.
Check it here
var reg=/^[A-Z][a-z]{3,30}(\s[A-Z](\.|[a-z]{2,30})?)*$/;
console.log(reg.test("John john")); // false because second part start with small case
console.log(reg.test("John John"));
console.log(reg.test("John John J."));
console.log(reg.test("John John J. Johny"));

Use the following regex:
^\w+\s(\w+\s)+\w\.\s\w+$
^\w+\s match a name a space
(\w+\s)+ followed by at least one more name and space
\w+\.\s followed by a single letter initial with dot then space
\w+$ followed by a last name
Regex101
Test code:
String testInput = "John John P. Johnny";
if (testInput.matches("^\\w+\\s(\\w+\\s)+\\w+\\.\\s\\w+$")) {
System.out.println("We have a match");
}

Try this:
^(\S*\s+)(\S*)?\s+\S*?
Francis Gabriel - matches:
0: [0,10] Francis
1: [0,9] Francis
2: [9,9]
John John2 J. Johnny - matches:
0: [0,11] John John2
1: [0,5] John
2: [5,10] John2

Regex with negative lookahead across multiple lines

For the past few hours I've been trying to match address(es) from the following sample data and I can't get it to work:
medicalHistory None
address 24 Lewin Street, KUBURA,
NSW, Australia
email MaryBeor#spambob.com
address 16 Yarra Street,
LAWRENCE, VIC, Australia
name Mary Beor
medicalHistory None
phone 00000000000000000000353336907
birthday 26-11-1972
My plan was to find anything that starts with "address", is followed by any space followed by characters, numbers commas and newlines and ends with newline followed by a character. I came up with the following (and many variations of it):
address\s+([0-9a-zA-Z, \n\t]+)(?!\n\w)
Unfortunately that matches the following:
address 24 Lewin Street, KUBURA,
NSW, Australia
email MaryBeor
and
address 16 Yarra Street,
LAWRENCE, VIC, Australia
name Mary Beor
medicalHistory None
phone 00000000000000000000353336907
birthday 26
instead of
address 24 Lewin Street, KUBURA,
NSW, Australia
and
address 16 Yarra Street,
LAWRENCE, VIC, Australia
Can you please tell me what I'm doing wrong?

I would do it this way:
address\s+((?![\r\n]+\w)[0-9a-zA-Z, \r\n\t])+
See it here on Regexr.
This ((?![\r\n]+\w)[0-9a-zA-Z, \r\n\t])+ is the important part, where I say, match the next character from [0-9a-zA-Z, \r\n\t], if (?![\r\n]+\w) is not following. This is matching what you expect.
In both your cases the regex stopped matching because of a character that is not included in your character class. If you want to go that way than you would need to combine a lazy quantifier and a positive lookahead:
address\s+([0-9a-zA-Z, \n\r\t]+?)(?=\r\w)
[0-9a-zA-Z, \n\r\t]+? is matching as less as possible till the condition (?=\r\w) is true.
See it here at Regexr

The problem with your regex is that + is greedy and goes until it finds a character out of that group, the # in the first case and - in the second.
Another approach is to use a non-greedy quantifier and a positive look-ahead for a newline followed by a word-character, like (python version):
re.findall(r'address\s+.*?(?=\n\w)', s, re.DOTALL)
It yields:
['address 24 Lewin Street, KUBURA, \n NSW, Australia',
'address 16 Yarra Street, \n LAWRENCE, VIC, Australia']

Regular expression for address field validation

I am trying to write a regular expression that facilitates an address, example 21-big walk way or 21 St.Elizabeth's drive I came up with the following regular expression but I am not too keen to how to incorporate all the characters (alphanumeric, space dash, full stop, apostrophe)
"regexp=^[A-Za-z-0-99999999'

See the answer to this question on address validating with regex:
regex street address match
The problem is, street addresses vary so much in formatting that it's hard to code against them. If you are trying to validate addresses, finding if one isn't valid based on its format is mighty hard to do.
This would return the following address (253 N. Cherry St. ), anything with its same format:
\d{1,5}\s\w.\s(\b\w*\b\s){1,2}\w*\.
This allows 1-5 digits for the house number, a space, a character followed by a period (for N. or S.), 1-2 words for the street name, finished with an abbreviation (like st. or rd.).
Because regex is used to see if things meet a standard or protocol (which you define), you probably wouldn't want to allow for the addresses provided above, especially the first one with the dash, since they aren't very standard. you can modify my above code to allow for them if you wish--you could add
(-?)
to allow for a dash but not require one.
In addition, http://rubular.com/ is a quick and interactive way to learn regex. Try it out with the addresses above.

In case if you don't have a fixed format for the address as mentioned above, I would use regex expression just to eliminate the symbols which are not used in the address (like specialized sybmols - &(%#$^). Result would be:
[A-Za-z0-9'\.\-\s\,]

Just to add to Serzas' answer(since don't have enough reps. to comment).
alphabets and numbers can effectively be replaced by \w for words.
Additionally apostrophe,comma,period and hyphen doesn't necessarily need a backslash.
My requirement also involved front and back slashes so \/ and finally whitespaces with \s. The working regex for me ,as such was :
pattern: "[\w',-\\/.\s]"

Regular expression for simple address validation
^[#.0-9a-zA-Z\s,-]+$
E.g. for Address match case
#1, North Street, Chennai - 11
E.g. for Address not match case
$1, North Street, Chennai # 11

I have succesfully used ;
Dim regexString = New stringbuilder
With regexString
.Append("(?<h>^[\d]+[ ])(?<s>.+$)|") 'find the 2013 1st ambonstreet
.Append("(?<s>^.*?)(?<h>[ ][\d]+[ ])(?<e>[\D]+$)|") 'find the 1-7-4 Dual Ampstreet 130 A
.Append("(?<s>^[\D]+[ ])(?<h>[\d]+)(?<e>.*?$)|") 'find the Terheydenlaan 320 B3
.Append("(?<s>^.*?)(?<h>\d*?$)") 'find the 245e oosterkade 9
End With
Dim Address As Match = Regex.Match(DataRow("customerAddressLine1"), regexString.ToString(), RegexOptions.Multiline)
If Not String.IsNullOrEmpty(Address.Groups("s").Value) Then StreetName = Address.Groups("s").Value
If Not String.IsNullOrEmpty(Address.Groups("h").Value) Then HouseNumber = Address.Groups("h").Value
If Not String.IsNullOrEmpty(Address.Groups("e").Value) Then Extension = Address.Groups("e").Value
The regex will attempt to find a result, if there is none, it move to the next alternative. If no result is found, none of the 4 formats where present.

This one worked for me:
\d+[ ](?:[A-Za-z0-9.-]+[ ]?)+(?:Avenue|Lane|Road|Boulevard|Drive|Street|Ave|Dr|Rd|Blvd|Ln|St)\.?
The source: https://www.codeproject.com/Tips/989012/Validate-and-Find-Addresses-with-RegEx

Regex is a very bad choice for this kind of task. Try to find a web service or an address database or a product which can clean address data instead.
Related:
Address validation using Google Maps API

As a simple one line expression recommend this,
^([a-zA-z0-9/\\''(),-\s]{2,255})$

I needed
STREET # | STREET | CITY | STATE | ZIP
So I wrote the following regex
[0-9]{1,5}( [a-zA-Z.]*){1,4},?( [a-zA-Z]*){1,3},? [a-zA-Z]{2},? [0-9]{5}
This allows
1-5 Street #s
1-4 Street description words
1-3 City words
2 Char State
5 Char Zip code
I also added option , for separating street, city, state, zip

Here is the approach I have taken to finding addresses using regular expressions:
A set of patterns is useful to find many forms that we might expect from an address starting with simply a number followed by set of strings (ex. 1 Basic Road) and then getting more specific such as looking for "P.O. Box", "c/o", "attn:", etc.
Below is a simple test in python. The test will find all the addresses but not the last 4 items which are company names. This example is not comprehensive, but can be altered to suit your needs and catch examples you find in your data.
import re
strings = [
'701 FIFTH AVE',
'2157 Henderson Highway',
'Attn: Patent Docketing',
'HOLLYWOOD, FL 33022-2480',
'1940 DUKE STREET',
'111 MONUMENT CIRCLE, SUITE 3700',
'c/o Armstrong Teasdale LLP',
'1 Almaden Boulevard',
'999 Peachtree Street NE',
'P.O. BOX 2903',
'2040 MAIN STREET',
'300 North Meridian Street',
'465 Columbus Avenue',
'1441 SEAMIST DR.',
'2000 PENNSYLVANIA AVENUE, N.W.',
'465 Columbus Avenue',
'28 STATE STREET',
'P.O, Drawer 800889.',
'2200 CLARENDON BLVD.',
'840 NORTH PLANKINTON AVENUE',
'1025 Connecticut Avenue, NW',
'340 Commercial Street',
'799 Ninth Street, NW',
'11318 Lazarro Ln',
'P.O, Box 65745',
'c/o Ballard Spahr LLP',
'8210 SOUTHPARK TERRACE',
'1130 Connecticut Ave., NW, Suite 420',
'465 Columbus Avenue',
"BANNER & WITCOFF , LTD",
"CHIP LAW GROUP",
"HAMMER & ASSOCIATES, P.C.",
"MH2 TECHNOLOGY LAW GROUP, LLP",
]
patterns = [
"c\/o [\w ]{2,}",
"C\/O [\w ]{2,}",
"P.O\. [\w ]{2,}",
"P.O\, [\w ]{2,}",
"[\w\.]{2,5} BOX [\d]{2,8}",
"^[#\d]{1,7} [\w ]{2,}",
"[A-Z]{2,2} [\d]{5,5}",
"Attn: [\w]{2,}",
"ATTN: [\w]{2,}",
"Attention: [\w]{2,}",
"ATTENTION: [\w]{2,}"
]
contact_list = []
total_count = len(strings)
found_count = 0
for string in strings:
pat_no = 1
for pattern in patterns:
match = re.search(pattern, string.strip())
if match:
print("Item found: " + match.group(0) + " | Pattern no: " + str(pat_no))
found_count += 1
pat_no += 1
print("-- Total: " + str(total_count) + " Found: " + str(found_count))

UiPath Academy training video lists this RegEx for US addresses (and it works fine for me):
\b\d{1,8}(-)?[a-z]?\W[a-z|\W|\.]{1,}\W(road|drive|avenue|boulevard|circle|street|lane|waylrd\.|st\.|dr\.|ave\.|blvd\.|cir\.|In\.|rd|dr|ave|blvd|cir|ln)

I had a different use case - find any addresses in logs and scold application developers (favourite part of a devops job). I had the advantage of having the word "address" in the pattern but should work without that if you have specific field to scan
\baddress.[0-9\\\/# ,a-zA-Z]+[ ,]+[0-9\\\/#, a-zA-Z]{1,}
Look for the word "address" - skip this if not applicable
Look for first part numbers, letters, #, space - Unit Number / street number/suite number/door number
Separated by a space or comma
Look for one or more of rest of address numbers, letters, #, space
Tested against :
1 Sleepy Boulevard PO, Box 65745
Suite #100 /98,North St,Snoozepura
Ave., New Jersey,
Suite 420 1130 Connect Ave., NW,
Suite 420 19 / 21 Old Avenue,
Suite 12, Springfield, VIC 3001
Suite#100/98 North St Snoozepura
This worked for me when there were street addresses with unit/suite numbers, zip codes, only street. It also didn't match IP addresses or mac addresses. Worked with extra spaces.
This assumes users are normal people separate elements of a street address with a comma, hash sign, or space and not psychopaths who use characters like "|" or ":"!

For French address and some international address too, I use it.
[\\D+ || \\d]+\\d+[ ||,||[A-Za-z0-9.-]]+(?:[Rue|Avenue|Lane|... etcd|Ln|St]+[ ]?)+(?:[A-Za-z0-9.-](.*)]?)

I was inspired from the responses given here and came with those 2 solutions
support optional uppercase
support french also
regex structure
numbers (required)
letters, chars and spaces
at least one common address keyword (required)
as many chars you want before the line break
definitions:
accuracy
capacity of detecting addresses and not something that looks like an address which is not.
range
capacity to detect uncommon addresses.
Regex 1:
high accuracy
low range
/[0-9]+[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.)|(cir\.)|(blvd\.)|(hway\.)|(st\.)|(aut\.)|(ave\.)|(ln\.)|(rd\.)|(hw\.)|(dr\.)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i
regex 2:
low accuracy
high range
/[0-9]*[ |[a-zà-ú.,-]* ((highway)|(autoroute)|(north)|(nord)|(south)|(sud)|(east)|(est)|(west)|(ouest)|(avenue)|(lane)|(voie)|(ruelle)|(road)|(rue)|(route)|(drive)|(boulevard)|(circle)|(cercle)|(street)|(cer\.?)|(cir\.?)|(blvd\.?)|(hway\.?)|(st\.?)|(aut\.?)|(ave\.?)|(ln\.?)|(rd\.?)|(hw\.?)|(dr\.?)|(a\.))([ .,-]*[a-zà-ú0-9]*)*/i

This one works well for me
^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$
Source : https://community.alteryx.com/t5/Alteryx-Designer-Discussions/RegEx-Addresses-different-formats-and-headaches/td-p/360147

Here is my RegEx for address, city & postal validation rules
validation rules:
address -
1 - 40 characters length.
Letters, numbers, space and . , : ' #
city -
1 - 19 characters length
Only Alpha characters are allowed
Spaces are allowed
postalCode -
The USA zip must meet the following criteria and is required:
Minimum of 5 digits (9 digits if zip + 4 is provided)
Numeric only
A Canadian postal code is a six-character string.
in the format A1A 1A1, where A is a letter and 1 is a digit.
a space separates the third and fourth characters.
do not include the letters D, F, I, O, Q or U.
the first position does not make use of the letters W or Z.
address: ^[a-zA-Z0-9 .,#;:'-]{1,40}$
city: ^[a-zA-Z ]{1,19}$
usaPostal: ^([0-9]{5})(?:[-]?([0-9]{4}))?$
canadaPostal : ^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$

\b(\d{1,8}[a-z]?[0-9\/#- ,a-zA-Z]+[ ,]+[.0-9\/#, a-zA-Z]{1,})\n

A more dynamic approach to #micah would be the following:
(?'Address'(?'Street'[0-9][a-zA-Z\s]),?\s*(?'City'[A-Za-z\s]),?\s(?'Country'[A-Za-z])\s(?'Zipcode'[0-9]-?[0-9]))
It won't care about individual lengths of segments of code.
https://regex101.com/r/nuy7hB/1

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex: Match only street name within address - regex

I was able to figure this out in a slightly different way [0-9A-Z]* [0-9A-Z]*$ and then I simply split the string it created by the space. Maybe one or two steps too many but it's transparent

Related

Regex pattern in salesforce apex

Regex for date and 3-letter code

Regex for more than 1 First Name before the Middle Initial

Regex with negative lookahead across multiple lines

Regular expression for address field validation

Categories

Resources