Regular expression to split up city, state - regex

I have a list of city, state data in a data frame. I need to extract only the state abbreviation and store into a new variable column called state. From visual inspection it looks like the state is always the last 2 characters in the string and they are both capitalized. The city, state data looks like the following:
test <- c("Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA")
I tried the following
pattern <- "[, (A-Z){2}]"
strsplit(test, pattern)
The output was:
[[1]]
[1] "Anchorage, "
[[2]]
[1] "New York City, "
[[3]]
[1] "Some Place, Another Place, "
EDI:
I used another regular expresson:
pattern2 <- "([a-z, ])"
sp <- strsplit(test, pattern2)
I get these results:
[[1]]
[1] "A" "" "" "" "" "" "" "" "" "" "AK"
[[2]]
[1] "N" "" "" "Y" "" "" "" "C" "" "" "" "" "NY"
[[3]]
[1] "S" "" "" "" "P" "" "" "" "" "" "A" "" "" "" "" "" ""
[18] "P" "" "" "" "" "" "LA"
So, the abbreviation is there, but when I try to extract using sapply(), I am not sure how to get the last element of a list. I know how to get the first:
sapply(sp, "[[", 1)

I'm not sure you really need a regular expression here. If you always just want the last two characters of the string, just use
substring(test, nchar(test)-1, nchar(test))
[1] "AK" "NY" "LA"
If you really insist on a regular expression, at least consider using regexec rather than strsplit since you're not really interested in splitting, you only want to extract the state.
m <- regexec("[A-Z]+$", test)
unlist(regmatches(test,m))
# [1] "AK" "NY" "LA"

Try:
tt = strsplit(test, ', ')
tt
[[1]]
[1] "Anchorage" "AK"
[[2]]
[1] "New York City" "NY"
[[3]]
[1] "Some Place" "Another Place" "LA"
z = list()
for(i in tt) z[length(z)+1] = i[length(i)]
z
[[1]]
[1] "AK"
[[2]]
[1] "NY"
[[3]]
[1] "LA"

This can work:
regmatches(test, gregexpr("(?<=[,][\\s+])([A-Z]{2})", test, perl = TRUE))
## [[1]]
## [1] "AK"
##
## [[2]]
## [1] "NY"
##
## [[3]]
## [1] "LA"
Explanation compliments of: http://liveforfaith.com/re/explain.pl
(?<= look behind to see if there is:
[,] any character of: ','
[\\s+] any character of: whitespace (\n, \r,
\t, \f, and " "), '+'
) end of look-behind
( group and capture to \1:
[A-Z]{2} any character of: 'A' to 'Z' (2 times)
) end of \1

I think you understood reversely the meaning of '[]' and '()'. '()' means to match a group of characters; '[]' means to match any one character from a class. What you need is
"(, [A-Z]{2})".

library(stringr)
str_extract(test, perl('[A-Z]+(?=\\b$)'))
#[1] "AK" "NY" "LA"

here is a regex for the same
Regex
(?'state'\w{2})(?=")
Test String
"Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA"
Result
MATCH 1
state [12-14] AK
MATCH 2
state [33-35] NY
MATCH 3
state [66-68] LA
live demo here
you may remove the named capture to make it smaller if required
eg
(\w{2})(?=")

Related

Can I use an OR statement to indicate the pattern in stringr's str_extract_all function?

I'm looking at a number of cells in a data frame and am trying to extract any one of several sequences of characters; there's only one of these sequences per per cell.
Here's what I mean:
dF$newColumn = str_extract_all(string = "dF$column1", pattern ="sequence_1|sequence_2")
Am I screwing the syntax up here? Can I pull this sort of thing with stringr? Please rectify my ignorance!
Yes, you can use | since it denotes logical or in regex. Here's an example:
vec <- c("abc text", "text abc", "def text", "text def text")
library(stringr)
str_extract_all(string = vec, pattern = "abc|def")
The result:
[[1]]
[1] "abc"
[[2]]
[1] "abc"
[[3]]
[1] "def"
[[4]]
[1] "def"
However, in your command, you should replace "dF$column1" with dF$column1 (without quotes).

Split keep repeated delimiter

I'm trying to use the stringi package to split on a delimiter (potentially the delimiter is repeated) yet keep the delimiter. This is similar to this question I asked moons ago: R split on delimiter (split) keep the delimiter (split) but the delimiter can be repeated. I don't think base strsplit can handle this type of regex. The stringi package can but I can't figure out how to format the regex to it splits on the delimiter if there are repeats and also not to leave an empty string at the end of the string.
Base R solutions, stringr, stringi etc. solutions all welcomed.
The later problem occurs because I use greedy * on the \\s but the space isn't garunteed so I could only think to leave it in:
MWE
text.var <- c("I want to split here.But also||Why?",
"See! Split at end but no empty.",
"a third string. It has two sentences"
)
library(stringi)
stri_split_regex(text.var, "(?<=([?.!|]{1,10}))\\s*")
# Outcome
## [[1]]
## [1] "I want to split here." "But also|" "|" "Why?"
## [5] ""
##
## [[2]]
## [1] "See!" "Split at end but no empty." ""
##
## [[3]]
## [1] "a third string." "It has two sentences"
# Desired Outcome
## [[1]]
## [1] "I want to split here." "But also||" "Why?"
##
## [[2]]
## [1] "See!" "Split at end but no empty."
##
## [[3]]
## [1] "a third string." "It has two sentences"
Using strsplit
strsplit(text.var, "(?<=[.!|])( +|\\b)", perl=TRUE)
#[[1]]
#[1] "I want to split here." "But also||" "Why?"
#[[2]]
#[1] "See!" "Split at end but no empty."
#[[3]]
#[1] "a third string." "It has two sentences"
Or
library(stringi)
stri_split_regex(text.var, "(?<=[.!|])( +|\\b)")
#[[1]]
#[1] "I want to split here." "But also||" "Why?"
#[[2]]
#[1] "See!" "Split at end but no empty."
#[[3]]
#[1] "a third string." "It has two sentences"
Just use a pattern that finds inter-character locations that: (1) are preceded by one of ?.!|; and (2) are not followed by one of ?.!|. Tack on \\s* to match and eat up any number of consecutive space characters, and you're good to go.
## (look-behind)(look-ahead)(spaces)
strsplit(text.var, "(?<=([?.!|]))(?!([?.!|]))\\s*", perl=TRUE)
# [[1]]
# [1] "I want to split here." "But also||" "Why?"
#
# [[2]]
# [1] "See!" "Split at end but no empty."
#
# [[3]]
# [1] "a third string." "It has two sentences"

R stringr and str_extract_all: capturing contractions

I am doing a bit of NLP with R and am using the stringr package to tokenize some text.
I would like be able to capture contractions, for example, won't so that it is tokenized into "wo" and "n't".
Here is a sample of what I've got:
library(stringr)
s = "won't you buy my raspberries?"
foo = str_extract_all(s, "(n|t)|[[:punct:]]" ) # captures the contraction OK...
foo[[1]]
>[1] "n't" "?"
foo = str_extract_all(s, "(n|t)|\\w+|[[:punct:]]" ) # gets all words,
# but splits the contraction!
foo[[1]]
>[1] "won" "'" "t" "you" "buy" "my" "raspberries" "?"
I am trying to tokenize the above sentence into "wo", "n't", "you", "buy", "my", "raspberries", "?".
I am not too sure if I can do this with the default, extended regular expressions, or if I need to figure out some way to do this a Perl-like pattern.
Does anyone out there know of a way to do tokenization as described above with the stringr package?
EDIT
TO clarify, I am interested in Treebank tokenization
You could do this through lookaheads which was supported by PCRE library.
> s = "won't you buy my raspberries?"
> s
[1] "won't you buy my raspberries?"
> m <- gregexpr("\\w+(?=n[[:punct:]]t)|n?[[:punct:]]t?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
OR
> m <- gregexpr("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
OR
Through stringr library,
> s <- "won't you buy my raspberries?"
> str_extract_all(s, perl("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+") )[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
You could try the perl wrapper function when working with stringr package functions.
s <- "won't you buy my raspberries?"
pattern <- "(?=[a-z]'[a-z])|(\\s+)|(?=[!?.])"
library(stringr)
str_split(s, perl(pattern))[[1]]
# [1] "wo" "n't" "you" "buy" "my"
# [6] "raspberries" "?"
There are also other wrappers such as fixed and ignore.case

Using regexes in grep function in R

Could anyone maybe know how to extract x and y from this character: "x and y" using grep function (not using stringi package) if x and y are random characters?
I am so not skilled in regular expressions.
Thanks for any response.
The regex here matches any chars "and" chars and then extracts them with regmatches:
txt <- c("x and y", "a and b", " C and d", "qq and rr")
matches <- regexec("([[:alpha:]]+)[[:blank:]]+and[[:blank:]]+([[:alpha:]]+)", txt)
regmatches(txt, matches)[[1]][2:3]
## [1] "x" "y"
regmatches(txt, matches)[[2]][2:3]
## [1] "a" "b"
regmatches(txt, matches)[[3]][2:3]
## [1] "C" "d"
regmatches(txt, matches)[[4]][2:3]
## [1] "qq" "rr"
([[:alpha:]]+) matches one or more alpha characters and places it in a match group. [[:blank:]]+ matches one or more "whitespace" characters. There are less verbose ways to write these regexes but the expanded ones (to me) help make it easier to grok if there will be folks reading the code that aren't familiar with regexes.
I also didn't need to call regmatches 4x, but it was faster to cut/paste for a toy example.
As #MrFlick commented, grep is not the right function to extract these substrings.
You can use regmatches and do something like this:
> x <- c('x and y', 'abc and def', 'foo and bar')
> regmatches(x, gregexpr('and(*SKIP)(*F)|\\w+', x, perl=T))
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"
Or if " and " is always constant, then use strsplit as suggested in the comments.
> x <- c('x and y', 'abc and def', 'foo and bar')
> strsplit(x, ' and ', fixed=T)
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"

Split string recursively

Say I have text like this:
pattern = "This_is some word/expression I'd like to parse:intelligently(using special symbols-like '.')"
The challenge is how to split it into words, using word separators from the
c(" ","-","/","\\","_",":","(",")",".",",")
family.
Desired result:
"This" "is" "some" "word" "expression" "I'd" "like" "to" "parse" "intelligently" "using" "special" "symbols" "like"
Methods:
I could do sapply or for loop using:
keywords = unlist(strsplit(pattern," "))
keywords = unlist(strsplit(keywords,"-"))
# etc.
Question:
But what's the solution using Reduce(f, x, init, accummulate=TRUE)?
You shouldn't need Reduce here. You should be able to do something like the following:
splitters <- c(" ","/","\\","_",":","(",")",".",",","-") # dash should come last
pattern <- paste0("[", paste(splitters, collapse = ""), "]")
string <- "This_is some word/expression I'd like to parse:intelligently(using special symbols-like '.')"
strsplit(string, pattern)[[1]]
# [1] "This" "is" "some" "word"
# [5] "expression" "I'd" "like" "to"
# [9] "parse" "intelligently" "using" "special"
# [13] "symbols" "like" "'" "'"
Note that a - in a regex character class should come first or last, so I've edited your vector of "splitters" accordingly. Also, you may want to add a + at the end of your "pattern" in case you want to collapse, say, multiple spaces into one.
You can use option perl = TRUE and then split on punctuation or space
> strsplit(pattern, '[[:punct:]]|[[:space:]]', perl = TRUE)
[[1]]
[1] "This" "is" "some" "word" "expression"
[6] "I" "d" "like" "to" "parse"
[11] "intelligently" "using" "special" "symbols" "like"
[16] ""
I'd go with (It will keep "I'd" together)
strsplit(pattern, "[^[:alnum:][:digit:]']")
## [[1]]
## [1] "This" "is" "some" "word" "expression" "I'd" "like" "to" "parse"
## [10] "intelligently" "using" "special" "symbols" "like" "'" "'"