Extract from string using regular expression - regex

I have a string:
s <- "test.test AS field1, ablh.blah AS field2, faslk.lsdf AS field3"
I want to convert to:
"field1, field2, field3"
I know that the regular expression (\w+)(?:,|$) will extract the strings I want ('field1,' etc) but I can't figure out how to extract it with gsub.

## Preparation
s <- "test.test AS field1, ablh.blah AS field2, faslk.lsdf AS field3"
pat <- "(\\w+)(?:,|$)" ## Note the doubly-escaped \\w
## Use the powerful gregexpr/regmatches one-two punch
m <- gregexpr(pat, s)
paste(regmatches(s, m)[[1]], collapse=" ")
# [1] "field1, field2, field3"

With strapplyc in the gsubfn package one can do it with a particularly simple regular expression which extracts each string of word characters that follows " AS " (If the field can contain non-word characters then replace \\w with the appropriate expression, for example any char that is not a space or comma: [^ ,]):
> library(gsubfn)
> strapplyc(s, " AS (\\w+)", simplify = toString)[[1]]
[1] "field1, field2, field3"

Related

Trying to use a regular expression in R to capture some data

So I have a table in R, and an example of of the string I am trying to capture is this:
C.Hale (79-83)
I want to write a regular expression to extract the (79-83).
How do I go about doing this?
We can use sub. We match one or more characters that are not a space ([^ ]+) from the beginning of the string (^) , followed by a space (\\s) and replace it with a ''.
sub('^[^ ]+\\s', '', str1)
#[1] "(79-83)"
Or another option is stri_extract_all from stringi
library(stringi)
stri_extract_all_regex(str1, '\\([^)]+\\)')[[1]]
#[1] "(79-83)"
data
str1 <- 'C.Hale (79-83)'
One possibility using the qdapRegex package I maintain:
x <- "C.Hale (79-83)"
library(qdapRegex)
rm_round(x, extract = TRUE, include.markers = TRUE)
## [[1]]
## [1] "(79-83)"

Combining regex with a literal string

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.
I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt
There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)
Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO
Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"
Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!
Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.
How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.
Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

gsub returns \n (newline)

I have this behaviour of regex which I can't explain. My goal is to parse only the text after the # yet when my string contains \n preceded by some words, gsub parses also \n:
string <- ".#address something \n"
gsub("^\\.?#([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", string, perl=T);
# [1] "address\n"
string <- ".#address \n"
gsub("^\\.?#([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", string, perl=T);
# [1] "address"
In Perl-compatible regular expressions . does not match \n. This is in contrast to "normal" regular expressions. Have a look at this example:
grepl(".", "\n", perl = FALSE)
# [1] TRUE
grepl(".", "\n", perl = TRUE)
# [1] FALSE
Your code will work if you specify perl = FALSE:
gsub("^\\.?#([a-z0-9_]{1,15})[^a-z0-9_]+.*$", "\\1", string, perl = FALSE)
# [1] "address"
To extract address, you could also use:
library(stringr)
str_extract(string, perl('(?<=#)[a-z0-9_]+(?= )'))
#[1] "address"

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."