I am new to regex.
I have a String formatted like below
Street Name
City, StateCode ZipNumber
for example, the string can be like
50 Connecticut Avenue
Norwalk, CT 06850
or
123 6th Avenue
New York, NY 10013
or
4TH Highway 6
Rule, TX 79547
I am trying to construct a regex here.
But cannot proceed as I have a little idea about regex.
Can you please help me?
The following might be enough :
^(?<Street>[^\n]+)\n(?<City>[^,]+), (?<StateCode>[A-Z]{2}) (?<Zip>\d+)$
It captures the following segments in different groups :
the first line in a group named Street
the part of the second line which precedes the comma in a group named City
the next two capital letters in a group named StateCode
the following digits in a group named Zip
Related
This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.
I have a list of addresses and I would like to have a regular expression that is able to capture just the name of the street without the street type, address number, or cardinal direction. There are some errors in formatting but all characters are in capital letters. So,
2038 W MAIN AVE
2038QWEW S JEFFERSON AVENUE
33 NORTH CALIFORNIA STREET
53371 SOUTH WASHINGTON
53371 S WASHINGTON AVENUE
1600 E PENNSYLVANIA AVE
WEST9 67ST ST
E171 N 23RD STREET
G171 N121ST STREET
ought to return
MAIN
JEFFERSON
CALIFORNIA
WASHINGTON
WASHINGTON
PENNSYLVANIA
67ST
23RD
121ST
So far I've got
([^ W ]|[^ E ]|[^ S ]|[^ N ])([0-9])*([A-Z]+)[^ ]
But I can't seem to only capture the first match that occurs after the street number. I feel like I need the standard greedy operators (i.e. ?, *, or +) but I can't figure out how to incorporate them.
These two links have taken me close:
Matching on every second occurence
Simple regex for street address
For the output what you want from the given (address) input, this regex will surely help: [\pL\pN]+(?=\h+[\pL\pN]+$)
This regex will match the second last word in your line where a word is "1 or more any letter or digit in any language".
For reference you could https://superuser.com/questions/1361759/matching-second-last-word-in-sentence-through-regular-expression
Logic: we are looking for the second last word (set of characters) + possible border with the symbol N
^.*?\s[N]{0,1}([-a-zA-Z0-9]+)\s*\w*$
Res:
Match 1
Full match 0-15 `2038 W MAIN AVE`
Group 1. 7-11 `MAIN`
Match 2
Full match 16-43 `2038QWEW S JEFFERSON AVENUE`
Group 1. 27-36 `JEFFERSON`
Match 3
Full match 44-70 `33 NORTH CALIFORNIA STREET`
Group 1. 53-63 `CALIFORNIA`
Match 4
Full match 71-93 `53371 SOUTH WASHINGTON`
Group 1. 83-93 `WASHINGTON`
Match 5
Full match 94-119 `53371 S WASHINGTON AVENUE`
Group 1. 102-112 `WASHINGTON`
Match 6
Full match 120-143 `1600 E PENNSYLVANIA AVE`
Group 1. 127-139 `PENNSYLVANIA`
Match 7
Full match 144-157 `WEST9 67ST ST`
Group 1. 150-154 `67ST`
Match 8
Full match 158-176 `E171 N 23RD STREET`
Group 1. 165-169 `23RD`
Match 9
Full match 177-195 `G171 N121ST STREET`
Group 1. 183-188 `121ST`
https://regex101.com/r/m2rmUQ/4
I was able to figure this out in a slightly different way
[0-9A-Z]* [0-9A-Z]*$
and then I simply split the string it created by the space. Maybe one or two steps too many but it's transparent
I have an excel spreadsheet and within its contents it is formatted like -
Street Name, Street Number Street Direction(may not be present represented be an NSWE)
So it could look like John Doe Ave, 900 E or Jane Doe DR, 100
However, the people who used this spreadsheet put business names or other information that shouldn't be present
The part I'm stuck at is using regex patterns I'm not familiar with it and it confuses me
I have this variable
Dim strPattern As String: strPattern = "^(.+),\s(\d+)\s([NWSEnwse])"
So, I have this its working SLIGHTLY I wanted to know what changes I could make to this so it would include or exlude NWSEnwse, because right now it detects the address only when street direction is present
You may use this regex pattern to match it.
^(.+),\s+(\d+)(\s+[NWSEnwse])?
The ? at the end signifies that that part is optional.
I also replaced \s with \s+ to account for any extra spaces that might have crept in.
One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA
This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).
Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.
I think this workflow will help you :
I need to extract name, street1, street2, city, state, zip
I have data in this form
JOHN m SMITH [1111 WEST OAK ROAD, SUITE 101, CITY, ST 55555]
GEORGE m JONES [222 MAIN STREET, CITY, ST 55555]
My results for JOHN should be
name="JOHN m SMITH"
street1="1111 WEST OAK ROAD"
street2="SUITE 101"
city = "CITY"
state = "ST"
zip = "55555"
This works with GEORGE's data
Regex r = new Regex(#"^(?<name>.*)\[(?<street>.*)[,]\s(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$");
var match = r.Match(fullNameAndAddress);
name = match.Groups["name"].Value;
street = match.Groups["street"].Value;
city = match.Groups["city"].Value;
state = match.Groups["state"].Value;
zip = match.Groups["zip"].Value;
How do I add the optional street2?
I want 1 and only 1 "street" group. I thought it should have this: (....){1}?
street2 is optional zero or 1 times. I thought it should have this (...)?
but it doesn't work with JOHN's data, both street1 & street2 are going into the street group:
^(?<name>.*)\[((?<street>.*)[,]\s){1}?((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.*)\s(?<zip>\d{5})\]$
Could you clarify what you want stored in street?
Do you want John's to look like '1111 WEST OAK ROAD, SUITE 101'?
Or do you want to stuff it into some variable you wont be using, so that street looks like '1111 WEST OAK ROAD'?
Edit: With clarification, check out this link
http://rubular.com/r/S4HaTMVFZl
What happens here I believe is that the * is greedy, grabbing as much as it can before finding the final occurence of [,]\s
Adding a ? after the .* makes it lazy, grabbing the least information possible.
The amended regex looks like this
^(?<name>.*)\[((?<street>.*?)[,]\s)((?<street2>.*)[,]\s)?(?<city>.*)[,]\s(?<state>.{2})\s(?<zip>\d{5})\]$
You'll notice I changed the Regex for state from .* to .{2}, forcing a 2-character state. Feel free to revert that if you don't want it :)
I made a couple of changes to your regex in rubular.com, and it seemed to be working on both the example strings:
^(?<name>.+)\s\[(?<street>[^,]+),\s((?<street2>[^,]+),\s+)?(?<city>[^,]+),\s(?<state>.+)\s(?<zip>\d{5})\]$
street2 = match.Groups["street2"].Value;
One trick I've learned with regex's is to use the negation of the divider (eg. [^,]* for anything but a comma) instead of .*, so it's impossible to capture multiple fields with one expression. Also, the + operator, which requires at least one match, is useful in most of the groups.
Also, the additional comma is only there if there's an street2 component of the address, which indicates that the comma should be in the same capture group as the street2 part. I added an extra capture group around the street2 capture group to account for this. You can make groups non-capturing in most languages, but it didn't seem necessary.