Regex for words in street addresses - regex

I know the dangers of relying on regex for street addresses. However, I must use regex and my addresses are all Australian and come well formatted from one regulated source.
I am successfully using groups to return the street number and street name from the following
1 Main Street, Sydney NSW 2000
1A Main Street, Sydney NSW 2000
1/20 Main Street, Sydney NSW 2000
1/20A Main Street, Sydney NSW 2000
U1/20A Main Street, Sydney NSW 2000
My (PHP) expression is ~([\w\d\-\/\.\&]*)\s*([\w\d '\-\.\ ()]+)~
But I am having trouble adapting that to work with:
Unit 1/20 Main Street, Sydney NSW 2000
My groups give me 'Unit' and '1'
The fiddle is here: https://regex101.com/r/aLRNgp/1

I believe the regex in question is only intended to match the house number and street name part of your addresses. Your regex looks complicated, however to fix the problem with a prefix Unit use:
^((?:Unit )?[\w\-\/\.\&]*)\s*([\w '\-\.\ ()]+)
Demo

Related

How can a regex catch all parts before a keyword from a finite set, but sometimes separated only by a single space

This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.

Regex for city and street name

Hi, I am looking for 2 regex which describe:
1) a valid name of a street
2) a valid name of a city
Valid street names are:
Mainstreet.
Mainstreet
Main Street
Big New mainstreet
Mainstreet-New
Mains Str.
St. Alexander Street
abcÜüßäÄöÖàâäèéêëîï ôœùûüÿçÀÂ-ÄÈÉÊËÎÏÔŒÙÛÜŸÇ.
John Kennedy Street
Not valid street names are:
Mainstreet #+;:_*´`?=)(/&%$§!
Mainstreet#+;:_*´`?=)(/&%$§!
Mainstreet 2
Mainstreet..
Mainstreet§
Valid cities are:
Edinôœùûüÿ
Berlin.
St. Petersburg
New-Berlin
Aue-Bad Schlema
Frankfurt am Main
Nürnberg
Ab
New York CityßäÄöÖàâäèéêëîïôœùûüÿçÀÂ-ÄÈÉÊËÎÏÔŒÙÛÜŸ
Not valid cities are:
Edingburgh 123
Edingburg123
St. Andrews 12
Berlin,#+;:_*´`?=)(/&%$§!
Berlin__
The solutions that I have at the moment matches very close but not perfectly:
For city and street name:
^[^\W\d_]+(?:[-\s][^\W\d_]+)*[.]?$
Unfortunately no match for these examples (the rest works fine):
St. Alexander Street
St. Petersburg
If you have more simple solutions, I am happy to learn sth. new! :-)
To make it match St. Alexander Street and St. Petersburg, you just need to add an optional dot after the letter matching patterns:
^[^\W\d_]+\.?(?:[-\s][^\W\d_]+\.?)*$
# ^^^ ^^^
See the regex demo.
Also, it might make sense to add a single apostrophe to the regex:
^[^\W\d_]+\.?(?:[-\s'’][^\W\d_]+\.?)*$
See the regex demo.

Regex (Posix) to get first word only, not including numbers

New to Regex (which was recently added to SQL in DB2 for i). I don't know anything about the different engines but research indicates that it is "based on POSIX extended regular expressions".
I would like to get the street name (first non-numeric word) from an address.
e.g.
101 Main Street = Main
2/b Pleasant Ave = Pleasant
5H Unpleasant Crescent = Unpleasant
I'm sorry I don't have a string that isn't working, as suggested by the forum software. I don't even know where to start. I tried a few things I found in search but they either yielded nothing or the first "word" - i.e. the number (101, 2/b, 5H).
Thanks
Edit: Although it's looking as if IBM's implementation of regex on the DB2 family of databases may be too alien for many of the resident experts, I'll press ahead with some more detail in case it helps.
A plain English statement of the requirement would be:
Basic/acceptable: Find the first word/unbroken string that contains no numbers or special characters
Advanced/ideal: Find the first word that contains three or more characters, being only letters and zero or one embedded dash/hyphen, but no numbers or other characters.
Additional examples (original ones at top are still valid)
190 - 192 Tweety-bird avenue = Tweety-bird
190-192 Tweety-bird avenue = Tweety-bird
Charles Bronson Place = Charles
190H Charles-Bronson Place = Charles-Bronson
190 to 192 Charles Bronson Place = Charles
Second Edit:
Mooching around on the internet and trying every vaguely connected expression that I could find, I stumbled on this one:
[a-zA-Z]+(?:[\s-][a-zA-Z]+)*
which actually works pretty well - it gives the street name and street type, which on reflection would actually suit my purpose as well as the street name alone (I can easily expand common abbreviations - e.g. RD to ROAD - on the fly).
Sample SQL:
select HAD1,
regexp_substr(HAD1, '[a-zA-Z]+(?:[\s-][a-zA-Z]+)*')
from ECH
where HEDTE > 20190601
Sample output
Ship To REGEXP_SUBSTR
Address
Line 1
32 CHRISTOPHER STREET CHRISTOPHER STREET
250 - 270 FEATHERSTON STREET FEATHERSTON STREET
118 MONTREAL STREET MONTREAL STREET
7 BIRMINGHAM STREET BIRMINGHAM STREET
59 MORRISON DRIVE MORRISON DRIVE
118 MONTREAL STREET MONTREAL STREET
MASON ROAD MASON ROAD
I know this wasn't exactly the question I asked, so apologies to anyone who could have done this but was following the original request faithfully.
Not sure if this is Posix compliant, but something like this could work: ^[\w\/]+?\s((\w+\s)+?)\s*\w+?$, example here.
The script assumes that the first chunk is the number of the building, the second chunk, is the name of the street, and the last chunk is Road/Ave/Blvd/etc.
This should also cater for street names which have white spaces in them.
Using the following regex matches your examples :
(?<=[^ ]+ )[^ ]*[ ]

Match 'State' text if written out or 2 digit?

New user, not much experience with regex.
I cannot match the state in the following text:
Lead status
53 W 70th St
New York, New York 10023
I have the following code
Lead status\s+(\w+.)\v+(\w+[^,]+).(\d{5})
I'd like to end up with
Street Address,
City,
State (whether spelled out or 2 digits),
Zip
I am only getting
Street Address,
City,
Zip
I appreciate any help.
Cheers

SAS Regex code to capture Business Address from 10-K company filings

Consider the following EDGAR 10-K SEC Company Filing
https://www.sec.gov/Archives/edgar/data/912382/000136231009004179/0001362310-09-004179.txt
BUSINESS ADDRESS:
STREET 1: 107 N PENNSYLVANIA ST
STREET 2: STE 600
CITY: INDIANAPOLIS
STATE: IN
ZIP: 46204
BUSINESS PHONE: 3172619000
MAIL ADDRESS:
STREET 1: 107 N PENNSYLVANIA ST
STREET 2: STE 600
CITY: INDIANAPOLIS
STATE: IN
ZIP: 46204
I need a regex in SAS to capture the fields STREET 1, STREET 2, CITY, STATE and ZIP under the Business Address, but not the Mailing Address. For example for STREET 1, I use STREET\s2\s*(.*) in SAS, but it ends up capturing the STREET 1 for Mailing address. Thanks!
This regex should work.
BUSINESS ADDRESS:\s*STREET\s1:\s*(.*)\s*STREET\s2:\s*(.*)
You can continue the pattern to capture each section you need in a new parenthesis. Basically you're just making sure that you get the first answer after business address. The problem with the pattern you were using is that it was able to match the pattern in two separate locations, and the regex engine will only return the last match it finds. Therefore you have to put something in that specifies which one you want.
In SAS you can use the prxposn function with the second argument indicating the capture buffer (parenthesis) to retrieve. For example.
address1=prxposn(regex_pattern, 1, edgar10);
Best.