I am doing a bit of NLP with R and am using the stringr package to tokenize some text.
I would like be able to capture contractions, for example, won't so that it is tokenized into "wo" and "n't".
Here is a sample of what I've got:
library(stringr)
s = "won't you buy my raspberries?"
foo = str_extract_all(s, "(n|t)|[[:punct:]]" ) # captures the contraction OK...
foo[[1]]
>[1] "n't" "?"
foo = str_extract_all(s, "(n|t)|\\w+|[[:punct:]]" ) # gets all words,
# but splits the contraction!
foo[[1]]
>[1] "won" "'" "t" "you" "buy" "my" "raspberries" "?"
I am trying to tokenize the above sentence into "wo", "n't", "you", "buy", "my", "raspberries", "?".
I am not too sure if I can do this with the default, extended regular expressions, or if I need to figure out some way to do this a Perl-like pattern.
Does anyone out there know of a way to do tokenization as described above with the stringr package?
EDIT
TO clarify, I am interested in Treebank tokenization
You could do this through lookaheads which was supported by PCRE library.
> s = "won't you buy my raspberries?"
> s
[1] "won't you buy my raspberries?"
> m <- gregexpr("\\w+(?=n[[:punct:]]t)|n?[[:punct:]]t?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
OR
> m <- gregexpr("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+", s, perl=TRUE)
> regmatches(s, m)
[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
OR
Through stringr library,
> s <- "won't you buy my raspberries?"
> str_extract_all(s, perl("\\w+(?=\\w[[:punct:]]\\w)|\\w?[[:punct:]]\\w?|\\w+") )[[1]]
[1] "wo" "n't" "you" "buy" "my"
[6] "raspberries" "?"
You could try the perl wrapper function when working with stringr package functions.
s <- "won't you buy my raspberries?"
pattern <- "(?=[a-z]'[a-z])|(\\s+)|(?=[!?.])"
library(stringr)
str_split(s, perl(pattern))[[1]]
# [1] "wo" "n't" "you" "buy" "my"
# [6] "raspberries" "?"
There are also other wrappers such as fixed and ignore.case
Related
This question already has answers here:
Regular Expression to get a string between parentheses in Javascript
(10 answers)
Closed 7 years ago.
Suppose I have a string like "A B C (123-456-789)", I'm wondering what's the best way to retrieve "123-456-789" from it.
strsplit("A B C (123-456-789)", "\\(")
[[1]]
[1] "A B C" "123-456-789)"
If we want to extract the digits with - between the braces, one option is str_extract. If there are multiple patterns within a string, use str_extract_all
library(stringr)
str_extract(str1, '(?<=\\()[0-9-]+(?=\\))')
#[1] "123-456-789"
str_extract_all(str2, '(?<=\\()[0-9-]+(?=\\))')
In the above codes, we are using regex lookarounds to extract the numbers and the -. The positive lookbehind (?<=\\()[0-9-]+ matches numbers along with - ([0-9-]+) in (123-456-789 and not in 123-456-789. Similarly the lookahead ('[0-9-]+(?=\)') matches numbers along with - in 123-456-789) and not in 123-456-798. Taken together it matches all the cases that satisfy both the conditions (123-456-789) and extract those in between the lookarounds and not with cases like (123-456-789 or 123-456-789)
With strsplit you can specify the split as [()]. We keep the () inside the square brackets to [] to treat it as characters or else we have to escape the parentheses ('\\(|\\)').
strsplit(str1, '[()]')[[1]][2]
#[1] "123-456-789"
If there are multiple substrings to extract from a string, we could loop with lapply and extract the numeric split parts with grep
lapply(strsplit(str2, '[()]'), function(x) grep('\\d', x, value=TRUE))
Or we can use stri_split from stringi which has the option to remove the empty strings as well (omit_empty=TRUE).
library(stringi)
stri_split_regex(str1, '[()A-Z ]', omit_empty=TRUE)[[1]]
#[1] "123-456-789"
stri_split_regex(str2, '[()A-Z ]', omit_empty=TRUE)
Another option is rm_round from qdapRegex if we are interested in extracting the contents inside the brackets.
library(qdapRegex)
rm_round(str1, extract=TRUE)[[1]]
#[1] "123-456-789"
rm_round(str2, extract=TRUE)
data
str1 <- "A B C (123-456-789)"
str2 <- c("A B C (123-425-478) A", "ABC(123-423-428)",
"(123-423-498) ABCDD",
"(123-432-423)", "ABC (123-423-389) GR (124-233-848) AK")
or with sub from base R:
sub("[^(]+\\(([^)]+)\\).*", "\\1", "A B C (123-456-789)")
#[1] "123-456-789"
Explanation:
[^(]+ : matches anything except an opening bracket
\\( : matches an opening bracket, which is just before what you want
([^)]+) : matches the pattern you want to capture (which is then retrieved in replacement="\\1"), which is anything except a closing bracket
\\).* matches a closing bracket followed by anything, 0 or more times
Another option with look-ahead and look-behind
sub(".*(?<=\\()(.+)(?=\\)).*", "\\1", "A B C (123-456-789)", perl=TRUE)
#[1] "123-456-789"
The capture groups in sub will target your desired output:
sub('.*\\((.*)\\).*', '\\1', str1)
[1] "123-456-789"
Extra check to make sure I pass #akrun's extended example:
sub('.*\\((.*)\\).*', '\\1', str2)
[1] "123-425-478" "123-423-428" "123-423-498" "123-432-423" "124-233-848"
You may try these gsub functions.
> gsub("[^\\d-]", "", x, perl=T)
[1] "123-456-789"
> gsub(".*\\(|\\)", "", x)
[1] "123-456-789"
> gsub("[^0-9-]", "", x)
[1] "123-456-789"
Few more...
> gsub("[0-9-](*SKIP)(*F)|.", "", x, perl=T)
[1] "123-456-789"
> gsub("(?:(?![0-9-]).)*", "", x, perl=T)
[1] "123-456-789"
Try this also:
k<-"A B C (123-456-789)"
regmatches(k,gregexpr("*.(\\d+).*",k))[[1]]
[1] "(123-456-789)"
With suggestion from #Arun:
regmatches(k, gregexpr('(?<=\\()[^A-Z ]+(?=\\))', k, perl=TRUE))[[1]]
With suggestion from #akrun:
regmatches(k, gregexpr('[0-9-]+', k))[[1]]
I am trying something I thought would be easy. I'm looking for a single regex solution (though others are welcomed for completeness). I want to split on n occurrences of a delimiter.
Here is some data:
x <- "I like_to see_how_too"
pat <- "_"
Desired outcome
Say I want to split on first occurrence of _:
[1] "I like" "to see_how_too"
Say I want to split on second occurrence of _:
[1] "I like_to see" "how_too"
Ideally, if the solution is a regex one liner generalizable to nth occurrence; the solution will use strsplit with a single regex.
Here's a solution that doesn't fit my parameters of single regex that works with strsplit
x <- "I like_to see_how_too"
y <- "_"
n <- 1
loc <- gregexpr("_", x)[[1]][n]
c(substr(x, 1, loc-1), substr(x, loc + 1, nchar(x)))
Here is another solution using the gsubfn package and some regex-fu. To change the nth occurrence of the delimiter, you can simply swap the number that is placed inside of the range quantifier — {n}.
library(gsubfn)
x <- 'I like_to see_how_too'
strapply(x, '((?:[^_]*_){1})(.*)', c, simplify =~ sub('_$', '', x))
# [1] "I like" "to see_how_too"
If you would like the nth occurrence to be user defined, you could use the following:
n <- 2
re <- paste0('((?:[^_]*_){',n,'})(.*)')
strapply(x, re, c, simplify =~ sub('_$', '', x))
# [1] "I like_to see" "how_too"
Non-Solution
Since R is using PCRE, you can use \K to remove everything that matches the pattern before \K from the main match result.
Below is the regex to split the string at the 3rd _
^[^_]*(?:_[^_]*){2}\K_
If you want to split at the nth occurrence of _, just change 2 to (n - 1).
Demo on regex101
That was the plan. However, strsplit seems to think differently.
Actual execution
Demo on ideone.com
x <- "I like_to see_how_too but_it_seems to_be_impossible"
strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# strsplit(x, "^[^_]*(?:_[^_]*)\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){1}\\K_", perl=TRUE)
# [[1]]
# [1] "I like_to see" "how_too but" "it_seems to" "be_impossible"
# strsplit(x, "^[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
It still fails to work on a stronger assertion \A
strsplit(x, "\\A[^_]*(?:_[^_]*){0}\\K_", perl=TRUE)
# [[1]]
# [1] "I like" "to see" "how" "too but" "it"
# [6] "seems to" "be" "impossible"
Explanation?
This behavior hints at the fact that strsplit find the first match, do a substring to extract the first token and the remainder part, and find the next match in the remainder part.
This removes all the states from the previous matches, and leaves us with a clean state when it tries to match the regex on the remainder. This makes the task of stopping the strsplit function at first match and achieving the task at the same time impossible. There is not even a parameter in strsplit to limit the number of splits.
Rather than split you do match to get your split strings.
Try this regex:
^((?:[^_]*_){1}[^_]*)_(.*)$
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
RegEx Demo
Update: It seems R also supports PCRE and in that case you can do split as well using this PCRE regex:
^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_
Replace 1 by n-1 where you're trying to get split on nth occurrence of underscore.
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
RegEx Demo2
x <- "I like_to see_how_too"
strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## > strsplit(x, "^((?:[^_]*_){0}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like" "to see" "how" "too"
## > strsplit(x, "^((?:[^_]*_){1}[^_]*)(*SKIP)(*F)|_", perl=TRUE)
## [[1]]
## [1] "I like_to see" "how_too"
This uses gsubfn to to preprocess the input string so that strsplit can handle it. The main advantage is that one can specify a vector of numbers, k, indicating which underscores to split on.
It replaces the occurrences of underscore defined by k by a double underscore and then splits on double underscore. In this example we split at the 2nd and 4th underscore:
library(gsubfn)
k <- c(2, 4) # split at 2nd and 4th _
p <- proto(fun = function(., x) if (count %in% k) "__" else "_")
strsplit(gsubfn("_", p, "aa_bb_cc_dd_ee_ff"), "__")
giving:
[[1]]
[1] "aa_bb" "cc_dd" "ee_ff"
If empty fields are allowed then use any other character sequence not in the string, e.g. "\01" in place of the double underscore.
See section 4 of the gusbfn vignette for more info on using gusbfn with proto objects to retain state between matches.
I have searched and was able to find this forum discussion for achieving the effect of overlapping matches.
I also found the following SO question speaking of finding indexes to perform this task, but was not able to find anything concise about grabbing overlapping matches in the R language.
I can perform this task in most any language that supports (PCRE) by using a Positive Lookahead assertion while implementing a capturing group inside of the lookahead to capture the overlapped matches.
But, while actually performing this the same way I would in other languages, using perl=T in R, no results yield.
> x <- 'ACCACCACCAC'
> regmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
[1] "" "" "" "" "" "" ""
The same goes for using both the stringi and stringr package.
> library(stringi)
> library(stringr)
> stri_extract_all_regex(x, '(?=([AC]C))')[[1]]
[1] "" "" "" "" "" "" ""
> str_extract_all(x, perl('(?=([AC]C))'))[[1]]
[1] "" "" "" "" "" "" ""
The correct results that should be returned when executing this are:
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Edit
I am well aware that regmatches does not work well with captured matches, but what exactly causes this behavior in regmatches and why are no results returned? I am scavenging for a somewhat detailed answer.
Is the stringi and stringr package not capable of performing this over regmatches?
Please feel free to add to my answer or come up with a different workaround than I have found.
The standard regmatches does not work well with captured matches (specifically multiple captured matches in the same string). And in this case, since you're "matching" a look ahead (ignoring the capture), the match itself is zero-length. There is also a regmatches()<- function that may illustrate this. Obseerve
x <- 'ACCACCACCAC'
m <- gregexpr('(?=([AC]C))', x, perl=T)
regmatches(x, m) <- "~"
x
# [1] "~A~CC~A~CC~A~CC~AC"
Notice how all the letters are preserved, we've just replaced the locations of the zero-length matches with something we can observe.
I've created a regcapturedmatches() function that I often use for such tasks. For example
x <- 'ACCACCACCAC'
regcapturedmatches(x, gregexpr('(?=([AC]C))', x, perl=T))[[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
The gregexpr is grabbing all the data just fine so you can extract it from that object anyway you life if you prefer not to use this helper function.
As far as a workaround, this is what I have come up with to extract the overlapping matches.
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)
> mapply(function(X) substr(x, X, X+1), m[[1]])
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Please feel free to add or comment on a better way to perform this task.
A stringi solution using a capture group in the look-ahead part:
> stri_match_all_regex('ACCACCACCAC', '(?=([AC]C))')[[1]][,2]
## [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Another roundabout way of extracting the same information that I've done in the past is to replace the "match.length" with the "capture.length":
x <- c("ACCACCACCAC","ACCACCACCAC")
m <- gregexpr('(?=([AC]C))', x, perl=TRUE)
m <- lapply(m, function(i) {
attr(i,"match.length") <- attr(i,"capture.length")
i
})
regmatches(x,m)
#[[1]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
#
#[[2]]
#[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
It's not a regex solution, and doesn't really answer any of your more important questions, but you could also get your desired result by using the substrings of two characters at a time and then removing the unwanted CA elements.
x <- 'ACCACCACCAC'
y <- substring(x, 1:(nchar(x)-1), 2:nchar(x))
y[y != "CA"]
# [1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
An additional answer, based on #hwnd's own answer (the original didn't allow variable-length captured regions), using just built-in R functions:
> x <- 'ACCACCACCAC'
> m <- gregexpr('(?=([AC]C))', x, perl=T)[[1]]
> start <- attr(m,"capture.start")
> end <- attr(m,"capture.start") + attr(m,"capture.length") - 1
> sapply(seq_along(m), function(i) substr(x, start[i], end[i]))
[1] "AC" "CC" "AC" "CC" "AC" "CC" "AC"
Pretty ugly, which is why the stringr etc. packages exist.
Could anyone maybe know how to extract x and y from this character: "x and y" using grep function (not using stringi package) if x and y are random characters?
I am so not skilled in regular expressions.
Thanks for any response.
The regex here matches any chars "and" chars and then extracts them with regmatches:
txt <- c("x and y", "a and b", " C and d", "qq and rr")
matches <- regexec("([[:alpha:]]+)[[:blank:]]+and[[:blank:]]+([[:alpha:]]+)", txt)
regmatches(txt, matches)[[1]][2:3]
## [1] "x" "y"
regmatches(txt, matches)[[2]][2:3]
## [1] "a" "b"
regmatches(txt, matches)[[3]][2:3]
## [1] "C" "d"
regmatches(txt, matches)[[4]][2:3]
## [1] "qq" "rr"
([[:alpha:]]+) matches one or more alpha characters and places it in a match group. [[:blank:]]+ matches one or more "whitespace" characters. There are less verbose ways to write these regexes but the expanded ones (to me) help make it easier to grok if there will be folks reading the code that aren't familiar with regexes.
I also didn't need to call regmatches 4x, but it was faster to cut/paste for a toy example.
As #MrFlick commented, grep is not the right function to extract these substrings.
You can use regmatches and do something like this:
> x <- c('x and y', 'abc and def', 'foo and bar')
> regmatches(x, gregexpr('and(*SKIP)(*F)|\\w+', x, perl=T))
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"
Or if " and " is always constant, then use strsplit as suggested in the comments.
> x <- c('x and y', 'abc and def', 'foo and bar')
> strsplit(x, ' and ', fixed=T)
# [[1]]
# [1] "x" "y"
# [[2]]
# [1] "abc" "def"
# [[3]]
# [1] "foo" "bar"
I have a list of city, state data in a data frame. I need to extract only the state abbreviation and store into a new variable column called state. From visual inspection it looks like the state is always the last 2 characters in the string and they are both capitalized. The city, state data looks like the following:
test <- c("Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA")
I tried the following
pattern <- "[, (A-Z){2}]"
strsplit(test, pattern)
The output was:
[[1]]
[1] "Anchorage, "
[[2]]
[1] "New York City, "
[[3]]
[1] "Some Place, Another Place, "
EDI:
I used another regular expresson:
pattern2 <- "([a-z, ])"
sp <- strsplit(test, pattern2)
I get these results:
[[1]]
[1] "A" "" "" "" "" "" "" "" "" "" "AK"
[[2]]
[1] "N" "" "" "Y" "" "" "" "C" "" "" "" "" "NY"
[[3]]
[1] "S" "" "" "" "P" "" "" "" "" "" "A" "" "" "" "" "" ""
[18] "P" "" "" "" "" "" "LA"
So, the abbreviation is there, but when I try to extract using sapply(), I am not sure how to get the last element of a list. I know how to get the first:
sapply(sp, "[[", 1)
I'm not sure you really need a regular expression here. If you always just want the last two characters of the string, just use
substring(test, nchar(test)-1, nchar(test))
[1] "AK" "NY" "LA"
If you really insist on a regular expression, at least consider using regexec rather than strsplit since you're not really interested in splitting, you only want to extract the state.
m <- regexec("[A-Z]+$", test)
unlist(regmatches(test,m))
# [1] "AK" "NY" "LA"
Try:
tt = strsplit(test, ', ')
tt
[[1]]
[1] "Anchorage" "AK"
[[2]]
[1] "New York City" "NY"
[[3]]
[1] "Some Place" "Another Place" "LA"
z = list()
for(i in tt) z[length(z)+1] = i[length(i)]
z
[[1]]
[1] "AK"
[[2]]
[1] "NY"
[[3]]
[1] "LA"
This can work:
regmatches(test, gregexpr("(?<=[,][\\s+])([A-Z]{2})", test, perl = TRUE))
## [[1]]
## [1] "AK"
##
## [[2]]
## [1] "NY"
##
## [[3]]
## [1] "LA"
Explanation compliments of: http://liveforfaith.com/re/explain.pl
(?<= look behind to see if there is:
[,] any character of: ','
[\\s+] any character of: whitespace (\n, \r,
\t, \f, and " "), '+'
) end of look-behind
( group and capture to \1:
[A-Z]{2} any character of: 'A' to 'Z' (2 times)
) end of \1
I think you understood reversely the meaning of '[]' and '()'. '()' means to match a group of characters; '[]' means to match any one character from a class. What you need is
"(, [A-Z]{2})".
library(stringr)
str_extract(test, perl('[A-Z]+(?=\\b$)'))
#[1] "AK" "NY" "LA"
here is a regex for the same
Regex
(?'state'\w{2})(?=")
Test String
"Anchorage, AK", "New York City, NY", "Some Place, Another Place, LA"
Result
MATCH 1
state [12-14] AK
MATCH 2
state [33-35] NY
MATCH 3
state [66-68] LA
live demo here
you may remove the named capture to make it smaller if required
eg
(\w{2})(?=")