Extract substrings starting with specific character until next space - regex

I want to extract the tags (twitter handles) from tweets.
tweet <- "#me bla bla bla bla #2_him some text #me_"
The following only extracts part of some substrings due to the punctuation in some tags
regmatches(tweet, gregexpr("#[[:alnum:]]*", tweet))[[1]]
[1] "#me" "#2" "#me"
I don't know what regular expression would return the entire string (#tag).
Thanks!

If you want to match all non-spaces, just use the corresponding regular expression
regmatches(tweet, gregexpr("#[^ ]*", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"

You can use the following. \S will match any non-white space character. As well, you want to use the + quantifier instead of * otherwise you will end up matching the # character by itself if one did exist in the string.
> regmatches(tweet, gregexpr("#\\S+", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"

Instead of [[:alnum:]]* use \w* because _ does not comes under alphanumeric character list(ie, [[:alnum:]] matches alphanumeric[A-Za-z0-9] characters. ) but it comes under word character ([A-Za-z0-9_]) list.
> regmatches(tweet, gregexpr("#\\w*", tweet))[[1]]
[1] "#me" "#2_him" "#me_"

The qdapRegex package has a function specifically designed for this task rm_tag:
library(qdapRegex)
rm_tag(tweet, extract=TRUE)
## [[1]]
## [1] "#me" "#2_him" "#me_"

Related

Return only matching portion of regular expression

I have:
> pattern
[1] "(/[[:digit:]]{4}/)"
so I want to extract only the matching portions...the digits plus the /.../. Here's what I tried:
> gsub(pattern, '\\1', grep(pattern, c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs"), value=TRUE))
[1] "t3tg3wgw/5764/" "grsgs/gwgew/5656/bfsbs"
However this still returns letters attached to the actual match that do not themselves match the regex. How can I extract only /5764/ and /5656/?
We could extract the pattern / followed by one or more numbers ([0-9]+) followed by / using str_extract_all from library(stringr) to output a list, which can be unlisted to convert to vector
library(stringr)
unlist(str_extract_all(v1, '/[0-9]+/'))
#[1] "/5764/" "/5656/"
Or we use the same pattern and using regmatches/gregexpr from base R
unlist(regmatches(v1, gregexpr('/[0-9]+/',v1)))
#[1] "/5764/" "/5656/"
data
v1 <- c("t3tg3wgw/5764/", "ggg", "grsgs/gwgew/5656/bfsbs")
Try changing the pattern to .*(/[[:digit:]]{4}/).*

Regex for known start and end characters in Perl and R-lang

I'm looking to match mentions of foo in a username. I need to be able to match text strings that start with '#' and contain the word 'foo' at any location within that username, ending by either a space or grammar.
I neeed to be able to match:
example1: #anycharacterhere_foo, anything else here
example2: #foo_anymorecharacters here
I'm looking to use the stringr library like so:
str_extract_all(x, perl("?<=#"))
What I don't understand is the match all function
Assuming that your usernames won't have special characters:
x <- "#anycharacterhere_foo, anything else here"
username <- str_extract_all(x, "\\w*(foo)\\w*")
which yields a string with your username. This will pick up additional foos in the remaining string, but you could fix that with str_extract rather than all. I am not certain if you really need all foo from the string or simply the username which in your example data is at the beginning. You could also limit that with the all match by including the #, thus:
username <- str_extract_all(x, "\\#\\w*(foo)\\w*")
You need to look for "zero or more" word characters that precede or follow:
x <- '#anycharacterhere_foo #foo_anymorecharacters here anything else here'
str_extract_all(x, '#\\w*foo\\w*')[[1]]
# [1] "#anycharacterhere_foo" "#foo_anymorecharacters"
If you don't want to include the marker:
str_extract_all(x, '(?<=#)\\w*foo\\w*')[[1]]
# [1] "anycharacterhere_foo" "foo_anymorecharacters"
You could also use rm_tag from the qdapRegex package for this:
library(qdapRegex)
rm_tag(x, extract=TRUE)[[1]]
# [1] "#anycharacterhere_foo" "#foo_anymorecharacters"

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!
Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.
How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.
Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."