Regex optional everything separated by space or comma (city, state) - regex

I am trying to get the street, city, state and zip from a non-well-formed list of addresses, everything but the "street" is optional sequentially. (I can have street, street+city, street+city+state, street+city+state+zip). Separators are either a comma + space, or space only.
So far, I have
^(?<STREET>.*?)(?<SEPARATOR1>(?: *-{1,2} *)|(?:, ?))(?<CITY>[a-z-' ]*)?((?<SEPARATOR2>(?: )|(?:, ))(?<STATE>AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY))?((?<SEPARATOR3>(?: )|(?:, ))(?<ZIP>[0-9]{5}(-[0-9]{4})?))?
I am having trouble to get a capture after the CITY capture if it's only separated by a space.
Test data:
123 Ave Ave - Hoquiam WA 98103
123 Ave Ave - Hoquiam, WA 98103
123 Ave Ave - Hoquiam, WA 98103-1345
123 Ave Ave - Hoquiam
123 Ave Ave - Ocean Shores WA
123 Ave Ave - Ocean Shores, WA
123 Ave Ave - D'ile, WA
123 Ave Ave
What am I doing wrong?
https://regex101.com/r/v476Gx/1

With some tweaking, following updated regex should work for you:
^(?<STREET>.*?)(?:(?<SEPARATOR1>(?: *-{1,2} *)|(?:, ?))(?<CITY>[a-z-' ]*?)?((?<SEPARATOR2>(?: )|(?:, ))(?<STATE>AL|AK|AS|AZ|AR|CA|CO|CT|DE|DC|FM|FL|GA|GU|HI|ID|IL|IN|IA|KS|KY|LA|ME|MH|MD|MA|MI|MN|MS|MO|MT|NE|NV|NH|NJ|NM|NY|NC|ND|MP|OH|OK|OR|PW|PA|PR|RI|SC|SD|TN|TX|UT|VT|VI|VA|WA|WV|WI|WY))?((?<SEPARATOR3>(?: )|(?:, ))(?<ZIP>[0-9]{5}(?:-[0-9]{4})?))?)?$
Updated RegEx Demo

While you have your answer, this is probably more readable/maintainable:
^
(?P<street>[^-\n]+)
(?:-\h*)?
(?P<town>(?:(?!\b[A-Z]{2}\b).)*)
(?P<state>\b[A-Z]{2}\b)?\h*
(?P<zip>[-\d]*)
$
See a demo o nregex101.com. It just needs a bit cleaning on the town part.

Related

Using Regex in SOLR Query

I have a data set of street names and numbers which I need to search.
eg. 12 HILL STREET
12A HILL STREET
12B HILL STREET
123 HILL STREET
12 HILARY STREET
If I search as follows q=(street_name:12\ HILL*), I get
12 HILL STREET
I want to obtain the following results:
12 HILL STREET
12A HILL STREET
12B HILL STREET
Is there a way to query in SOLR to return the results as the above example shows?
I have tried querying as:
q=(street_name:/12[A-Z]\ HILL*/)
but don't get anything back.
You can use
q=(street_name:/12[A-Z]* HILL.*/)
Here, the pattern means
12 - string starts with 12
[A-Z]* - zero or more ASCII uppercase letters
- a space
HILL - HILL char sequence
.* - any zero or more chars other than line break chars as many as possible (so, the rest of the line).

Regular expression working in Pythex.com but not in pandas

I'm having trouble applying a regex function to a column in a python dataframe. It works fine in Pythex online editor.
Here is the head of my dataframe -
ID
Text
1
UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605
2
IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262
3
COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875
I've tried the following code to create another column which gives us just the address. but the new column is showing up as empty.
df['Address']=df['Text'].str.findall('[0-9]{2,6}(?:\s+\S+){3,8}\s{1,}\b(?:FLORIDA|NORTH CAROLINA|NEW YORK)\b')
The desired output should look like -
ID
Text
Address
1
UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605
25088 KITTAN DRIVE NORTH CAROLINA
2
IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262
26 W STREET 7TH HIGHWAY ORLANDO FLORIDA
3
COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875
757979 EAST TYRON BLVD NEW YORK NEW YORK
Thanks in advance.
If your text data are examples of this pattern, you can try the following code:
df['Address']=df['Text'].str.findall(r'[0-9]{2,6}(.*?)(?:\d+$)')
You could use a pattern to extract the values that you want from column Text:
\b([0-9]{2,6}\b.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK)) \d
The pattern matches:
\b A word boundary to prevent a partial word match
( Capture group 1
[0-9]{2,6}\b Match 2-6 digits followed by a word boundary
.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK) Match as least as possible chars until you can match one of the alternatives
) \d Close group 1, and match a space and a digit
See a regex demo.
For example
import pandas as pd
items = [
[1, "UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE NORTH CAROLINA 28605"],
[2, "IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA 28262"],
[3, "COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK 29875"]
]
df = pd.DataFrame(items, columns=["ID", "Text"])
df["Address"] = df["Text"].str.extract(
r'\b([0-9]{2,6}\b.*?(?:FLORIDA|NORTH CAROLINA|NEW YORK)) \d'
)
print(df)
Output
ID Text Address
0 1 UMM SURE THE ADDRESS IS IN 25088 KITTAN DRIVE ... 25088 KITTAN DRIVE NORTH CAROLINA
1 2 IT IS ON 26 W STREET 7TH HIGHWAY ORLANDO FLORI... 26 W STREET 7TH HIGHWAY ORLANDO FLORIDA
2 3 COOL 757979 EAST TYRON BLVD NEW YORK NEW YORK ... 757979 EAST TYRON BLVD NEW YORK NEW YORK

RegEx: Insert Double Quote to the left of third comma from the right [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I have this issue with address data in a terrible format. The last three comma separated data is ,city, country, postal code. But prior to the city, country, postal code parsed data the address has a couple of commas.
Example:
123 Oak St.,Appt 3,Suite 15,Paris,France,1234342
243 Oak St. Apt 4,New York,United States,12345
I would like to put a double quote before the third comma.
Line this:
123 Oak St.,Appt 3,Suite 15",Paris,France,1234342
243 Oak St. Apt 4",New York,United States,12345
Then I can insert a double quote at the begin as such.
Find:
\r\n
Replace:
\r\n"
Final output:
"123 Oak St.,Appt 3,Suite 15",Paris,France,1234342
"243 Oak St. Apt 4",New York,United States,12345
Any help with this problem is greatly appreciated.
You can do this with Regex in JavaScript with the following example.
The important part is the look-ahead (?=(,[^,]*){3}$) means that the line must end with three comma groups. The first capture group (.*) matches everything before that.
Once you have this capture group you can wrap it up in double quotes with input.replace(expression, '"$1"').
const input = '123 Oak St.,Appt 3,Suite 15,Paris,France,1234342\n243 Oak St. Apt 4,New York,United States,12345';
const expression = /^(.+)(?=(,[^,]*){3}$)/gm;
console.log(input.replace(expression, '"$1"'));

Regex for city and street name

Hi, I am looking for 2 regex which describe:
1) a valid name of a street
2) a valid name of a city
Valid street names are:
Mainstreet.
Mainstreet
Main Street
Big New mainstreet
Mainstreet-New
Mains Str.
St. Alexander Street
abcÜüßäÄöÖàâäèéêëîï ôœùûüÿçÀÂ-ÄÈÉÊËÎÏÔŒÙÛÜŸÇ.
John Kennedy Street
Not valid street names are:
Mainstreet #+;:_*´`?=)(/&%$§!
Mainstreet#+;:_*´`?=)(/&%$§!
Mainstreet 2
Mainstreet..
Mainstreet§
Valid cities are:
Edinôœùûüÿ
Berlin.
St. Petersburg
New-Berlin
Aue-Bad Schlema
Frankfurt am Main
Nürnberg
Ab
New York CityßäÄöÖàâäèéêëîïôœùûüÿçÀÂ-ÄÈÉÊËÎÏÔŒÙÛÜŸ
Not valid cities are:
Edingburgh 123
Edingburg123
St. Andrews 12
Berlin,#+;:_*´`?=)(/&%$§!
Berlin__
The solutions that I have at the moment matches very close but not perfectly:
For city and street name:
^[^\W\d_]+(?:[-\s][^\W\d_]+)*[.]?$
Unfortunately no match for these examples (the rest works fine):
St. Alexander Street
St. Petersburg
If you have more simple solutions, I am happy to learn sth. new! :-)
To make it match St. Alexander Street and St. Petersburg, you just need to add an optional dot after the letter matching patterns:
^[^\W\d_]+\.?(?:[-\s][^\W\d_]+\.?)*$
# ^^^ ^^^
See the regex demo.
Also, it might make sense to add a single apostrophe to the regex:
^[^\W\d_]+\.?(?:[-\s'’][^\W\d_]+\.?)*$
See the regex demo.

extract comma separated strings

I have data frame as below. This is a sample set data with uniform looking patterns but whole data is not very uniform:
locationid address
1073744023 525 East 68th Street, New York, NY 10065, USA
1073744022 270 Park Avenue, New York, NY 10017, USA
1073744025 Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA
1073744024 1251 Avenue of the Americas, New York, NY 10020, USA
1073744021 1301 Avenue of the Americas, New York, NY 10019, USA
1073744026 44 West 45th Street, New York, NY 10036, USA
I need to find the city and country name from this address. I tried the following:
1) strsplit
This gives me a list but I cannot access the last or third last element from this.
2) Regular expressions
finding country is easy
str_sub(str_extract(address, "\\d{5},\\s.*"),8,11)
but for city
str_sub(str_extract(address, ",\\s.+,\\s.+\\d{5}"),3,comma_pos)
I cannot find comma_pos as it leads me to the same problem again.
I believe there is a more efficient way to solve this using any of the above approached.
Try this code:
library(gsubfn)
cn <- c("Id", "Address", "City", "State", "Zip", "Country")
pat <- "(\\d+) (.+), (.+), (..) (\\d+), (.+)"
read.pattern(text = Lines, pattern = pat, col.names = cn, as.is = TRUE)
giving the following data.frame from which its easy to pick off components:
Id Address City State Zip Country
1 1073744023 525 East 68th Street New York NY 10065 USA
2 1073744022 270 Park Avenue New York NY 10017 USA
3 1073744025 Rockefeller Center, 50 Rockefeller Plaza New York NY 10020 USA
4 1073744024 1251 Avenue of the Americas New York NY 10020 USA
5 1073744021 1301 Avenue of the Americas New York NY 10019 USA
6 1073744026 44 West 45th Street New York NY 10036 USA
Explanation It uses this pattern (when within quotes the backslashes must be doubled):
(\d+) (.+), (.+), (..) (\d+), (.+)
visualized via the following debuggex railroad diagram -- for more see this Debuggex Demo :
and explained in words as follows:
"(\\d+)" - one or more digits (representing the Id) followed by
" " a space followed by
"(.+)" - any non-empty string (representing the Address) followed by
", " - a comma and a space followed by
"(.+)" - any non-empty string (representing the City) followed by
", " - a comma and a space followed by
"(..)" - two characters (representing the State) followed by
" " - a space followed by
"(\\d+)" - one or more digits (representing the Zip) followed by
", " - a comma and a space followed by
"(.+)" - any non-empty string (representing the Country)
It works since regular expressions are greedy always trying to find the longest string that can match backtracking each time subsequent portions of the regular expression fail to match.
The advantage of this appraoch is that the regular expression is quite simple and straight forward and the entire code is quite concise as one read.pattern statement does it all:
Note: We used this for Lines:
Lines <- "1073744023 525 East 68th Street, New York, NY 10065, USA
1073744022 270 Park Avenue, New York, NY 10017, USA
1073744025 Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA
1073744024 1251 Avenue of the Americas, New York, NY 10020, USA
1073744021 1301 Avenue of the Americas, New York, NY 10019, USA
1073744026 44 West 45th Street, New York, NY 10036, USA"
Split the data
ss <- strsplit(data,",")`
Then
n <- sapply(s,len)
will give the number of elements (so you can work backward). Then
mapply(ss,"[[",n)
gives you the last element. Or you could do
sapply(ss,tail,1)
to get the last element.
To get the second-to-last (or more generally) you need
sapply(ss,function(x) tail(x,2)[1])
Here's an approach using a the tidyr package. Personally, I'd just split the whole thing into all the various elements using just the tidyr package's extract. This uses regex but in a different way than you asked for.
library(tidyr)
extract(x, address, c("address", "city", "state", "zip", "state"),
"([^,]+),\\s([^,]+),\\s+([A-Z]+)\\s+(\\d+),\\s+([A-Z]+)")
## locationid address city state zip state
## 1 1073744023 525 East 68th Street New York NY 10065 USA
## 2 1073744022 270 Park Avenue New York NY 10017 USA
## 3 1073744025 50 Rockefeller Plaza New York NY 10020 USA
## 4 1073744024 1251 Avenue of the Americas New York NY 10020 USA
## 5 1073744021 1301 Avenue of the Americas New York NY 10019 USA
## 6 1073744026 44 West 45th Street New York NY 10036 USA
Her'es a visual explanation of the regular expression taken from http://www.regexper.com/:
I think you want something like this.
> x <- "1073744026 44 West 45th Street, New York, NY 10036, USA"
> regmatches(x, gregexpr('^[^,]+, *\\K[^,]+', x, perl=T))[[1]]
[1] "New York"
> regmatches(x, gregexpr('^[^,]+, *[^,]+, *[^,]+, *\\K[^\n,]+', x, perl=T))[[1]]
[1] "USA"
Regex explanation:
^ Asserts that we are at the start.
[^,]+ Matches any character but not of , one or more times. Change it to [^,]* if your dataframe contains empty fields.
, Matches a literal ,
<space>* Matches zero or more spaces.
\K discards previously matched characters from printing. The characters matched by the pattern following \K will be shown as output.
How about this pattern :
,\s(?<city>[^,]+?),\s(?<shortCity>[^,]+?)(?i:\d{5},)(?<country>\s.*)
This pattern will match this three groups:
"group": "city", "value": "New York"
"group": "shortCity", "value": "NY "
"group": "country", "value": " USA"
Using rex to construct the regular expression may make this type of task a little simpler.
x <- data.frame(
locationid = c(
1073744023,
1073744022,
1073744025,
1073744024,
1073744021,
1073744026
),
address = c(
'525 East 68th Street, New York, NY 10065, USA',
'270 Park Avenue, New York, NY 10017, USA',
'Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA',
'1251 Avenue of the Americas, New York, NY 10020, USA',
'1301 Avenue of the Americas, New York, NY 10019, USA',
'44 West 45th Street, New York, NY 10036, USA'
))
library(rex)
sep <- rex(",", spaces)
re <-
rex(
capture(name = "address",
except_some_of(",")
),
sep,
capture(name = "city",
except_some_of(",")
),
sep,
capture(name = "state",
uppers
),
spaces,
capture(name = "zip",
some_of(digit, "-")
),
sep,
capture(name = "country",
something
))
re_matches(x$address, re)
#> address city state zip country
#>1 525 East 68th Street New York NY 10065 USA
#>2 270 Park Avenue New York NY 10017 USA
#>3 50 Rockefeller Plaza New York NY 10020 USA
#>4 1251 Avenue of the Americas New York NY 10020 USA
#>5 1301 Avenue of the Americas New York NY 10019 USA
#>6 44 West 45th Street New York NY 10036 USA
This regular expression will also handle 9 digit zip codes (12345-1234) and countries other than USA.