I am new to using regular expressions so please pardon me. I need to match only the town, region name and country name using regex. Below is the sample from the dataset I have
1 Cliff Street ; Fremantle, Western Australia ; AUSTRALIA
10 Montpelier Square, London SW7 1JU ;,; UNITED KINGDOM
125 Hay Street ; East Perth, Western Australia ; AUSTRALIA
1395 Brickell Ave 3404, Miami, FL 33131 ;,; USA
14 Save Ljuboje ; Banja Luka,; BOSNIA AND HERZEGOVINA
15 Grosvenor Street ; Beaconsfield, Western Australia ; AUSTRALIA
151 Royal Street, 2nd Floor ; East Perth, Western Australia ; AUSTRALIA
168-170 St Georges Terrace ; Perth, Western Australia ; AUSTRALIA
184 Bennet Street ; East Perth, Western Australia ; AUSTRALIA
189 Royal Street ; East Perth, Western Australia ; AUSTRALIA
197 St Georges Terrace ; Perth, Western Australia ; AUSTRALIA
Example: 1 Cliff Street ; Fremantle, Western Australia ; AUSTRALIA I would want only Fremantle, Western Australia ; AUSTRALIA and not the address tags along. This is just a sample of my dataset and I would want only the last 3 strings in each row. It would be great if anyone could help me
You could use capturing groups for this...
(.*);(.*);(.*)
That regex splits the string into 3 groups. How you access the groups from the match object depends on your language's regex library.
As #sin suggested, a better approach would probably be just splitting the string on ; character. Just google for "String Splitting" to see how it is done in your language. Using regexes overcomplicates this problem.
If you want to match them use this regex:
[1-9a-zA-Z\s,]+;[1-9a-zA-Z\s]+$
Demo: https://regex101.com/r/cF1gW4/1
EDIT
If you want to leave them and remove first part of the address, using SublimeText replace this:
^[1-9a-zA-Z\s,]+;\s?
by nothing
Demo: https://regex101.com/r/cF1gW4/3
I have data frame as below. This is a sample set data with uniform looking patterns but whole data is not very uniform:
locationid address
1073744023 525 East 68th Street, New York, NY 10065, USA
1073744022 270 Park Avenue, New York, NY 10017, USA
1073744025 Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA
1073744024 1251 Avenue of the Americas, New York, NY 10020, USA
1073744021 1301 Avenue of the Americas, New York, NY 10019, USA
1073744026 44 West 45th Street, New York, NY 10036, USA
I need to find the city and country name from this address. I tried the following:
1) strsplit
This gives me a list but I cannot access the last or third last element from this.
2) Regular expressions
finding country is easy
str_sub(str_extract(address, "\\d{5},\\s.*"),8,11)
but for city
str_sub(str_extract(address, ",\\s.+,\\s.+\\d{5}"),3,comma_pos)
I cannot find comma_pos as it leads me to the same problem again.
I believe there is a more efficient way to solve this using any of the above approached.
Try this code:
library(gsubfn)
cn <- c("Id", "Address", "City", "State", "Zip", "Country")
pat <- "(\\d+) (.+), (.+), (..) (\\d+), (.+)"
read.pattern(text = Lines, pattern = pat, col.names = cn, as.is = TRUE)
giving the following data.frame from which its easy to pick off components:
Id Address City State Zip Country
1 1073744023 525 East 68th Street New York NY 10065 USA
2 1073744022 270 Park Avenue New York NY 10017 USA
3 1073744025 Rockefeller Center, 50 Rockefeller Plaza New York NY 10020 USA
4 1073744024 1251 Avenue of the Americas New York NY 10020 USA
5 1073744021 1301 Avenue of the Americas New York NY 10019 USA
6 1073744026 44 West 45th Street New York NY 10036 USA
Explanation It uses this pattern (when within quotes the backslashes must be doubled):
(\d+) (.+), (.+), (..) (\d+), (.+)
visualized via the following debuggex railroad diagram -- for more see this Debuggex Demo :
and explained in words as follows:
"(\\d+)" - one or more digits (representing the Id) followed by
" " a space followed by
"(.+)" - any non-empty string (representing the Address) followed by
", " - a comma and a space followed by
"(.+)" - any non-empty string (representing the City) followed by
", " - a comma and a space followed by
"(..)" - two characters (representing the State) followed by
" " - a space followed by
"(\\d+)" - one or more digits (representing the Zip) followed by
", " - a comma and a space followed by
"(.+)" - any non-empty string (representing the Country)
It works since regular expressions are greedy always trying to find the longest string that can match backtracking each time subsequent portions of the regular expression fail to match.
The advantage of this appraoch is that the regular expression is quite simple and straight forward and the entire code is quite concise as one read.pattern statement does it all:
Note: We used this for Lines:
Lines <- "1073744023 525 East 68th Street, New York, NY 10065, USA
1073744022 270 Park Avenue, New York, NY 10017, USA
1073744025 Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA
1073744024 1251 Avenue of the Americas, New York, NY 10020, USA
1073744021 1301 Avenue of the Americas, New York, NY 10019, USA
1073744026 44 West 45th Street, New York, NY 10036, USA"
Split the data
ss <- strsplit(data,",")`
Then
n <- sapply(s,len)
will give the number of elements (so you can work backward). Then
mapply(ss,"[[",n)
gives you the last element. Or you could do
sapply(ss,tail,1)
to get the last element.
To get the second-to-last (or more generally) you need
sapply(ss,function(x) tail(x,2)[1])
Here's an approach using a the tidyr package. Personally, I'd just split the whole thing into all the various elements using just the tidyr package's extract. This uses regex but in a different way than you asked for.
library(tidyr)
extract(x, address, c("address", "city", "state", "zip", "state"),
"([^,]+),\\s([^,]+),\\s+([A-Z]+)\\s+(\\d+),\\s+([A-Z]+)")
## locationid address city state zip state
## 1 1073744023 525 East 68th Street New York NY 10065 USA
## 2 1073744022 270 Park Avenue New York NY 10017 USA
## 3 1073744025 50 Rockefeller Plaza New York NY 10020 USA
## 4 1073744024 1251 Avenue of the Americas New York NY 10020 USA
## 5 1073744021 1301 Avenue of the Americas New York NY 10019 USA
## 6 1073744026 44 West 45th Street New York NY 10036 USA
Her'es a visual explanation of the regular expression taken from http://www.regexper.com/:
I think you want something like this.
> x <- "1073744026 44 West 45th Street, New York, NY 10036, USA"
> regmatches(x, gregexpr('^[^,]+, *\\K[^,]+', x, perl=T))[[1]]
[1] "New York"
> regmatches(x, gregexpr('^[^,]+, *[^,]+, *[^,]+, *\\K[^\n,]+', x, perl=T))[[1]]
[1] "USA"
Regex explanation:
^ Asserts that we are at the start.
[^,]+ Matches any character but not of , one or more times. Change it to [^,]* if your dataframe contains empty fields.
, Matches a literal ,
<space>* Matches zero or more spaces.
\K discards previously matched characters from printing. The characters matched by the pattern following \K will be shown as output.
How about this pattern :
,\s(?<city>[^,]+?),\s(?<shortCity>[^,]+?)(?i:\d{5},)(?<country>\s.*)
This pattern will match this three groups:
"group": "city", "value": "New York"
"group": "shortCity", "value": "NY "
"group": "country", "value": " USA"
Using rex to construct the regular expression may make this type of task a little simpler.
x <- data.frame(
locationid = c(
1073744023,
1073744022,
1073744025,
1073744024,
1073744021,
1073744026
),
address = c(
'525 East 68th Street, New York, NY 10065, USA',
'270 Park Avenue, New York, NY 10017, USA',
'Rockefeller Center, 50 Rockefeller Plaza, New York, NY 10020, USA',
'1251 Avenue of the Americas, New York, NY 10020, USA',
'1301 Avenue of the Americas, New York, NY 10019, USA',
'44 West 45th Street, New York, NY 10036, USA'
))
library(rex)
sep <- rex(",", spaces)
re <-
rex(
capture(name = "address",
except_some_of(",")
),
sep,
capture(name = "city",
except_some_of(",")
),
sep,
capture(name = "state",
uppers
),
spaces,
capture(name = "zip",
some_of(digit, "-")
),
sep,
capture(name = "country",
something
))
re_matches(x$address, re)
#> address city state zip country
#>1 525 East 68th Street New York NY 10065 USA
#>2 270 Park Avenue New York NY 10017 USA
#>3 50 Rockefeller Plaza New York NY 10020 USA
#>4 1251 Avenue of the Americas New York NY 10020 USA
#>5 1301 Avenue of the Americas New York NY 10019 USA
#>6 44 West 45th Street New York NY 10036 USA
This regular expression will also handle 9 digit zip codes (12345-1234) and countries other than USA.