Combining regex with a literal string - regex

I have the following code:
input <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
This outputs:
"I2-I3" "I3-I1" "I3-I2" "I2-I1-I3" "I3" "I2-I3"
However, I only want to extract matches to the regex that are following immediately to a specific string, e.g.:
only match the regex when it's preceded by the literal string FA-I2-I2-I2-EX.
This, for example, would be the first match of the regex, while the second match is preceded by FA-I1-I2-TR-I1-I2-FA.
The expected output is roughly the same as in the regex above, but only selecting one of the 5 matches, because it needs to be preceded by a specific literal string.
How can I modify this regex to achieve this purpose? I assume it needs to use a positive lookbehind to first identify the literal string, then execute the regex.

I don't know if I'm fully understanding what you mean, but it seems you could use positive lookbehind.
For instance:
(?<=a)b (positive lookbehind) matches the b (and only the b) in cab, but does not match bed or debt

There should be something more intuitive but i think this will do the job
literal <- "FA-I2-I2-I2-EX"
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
a <- lapply(strsplit(innovation_patterns, literal )[[1]], str_extract_all, '(?:I\\d-?)*I3(?:-?I\\d)*')
b <- lapply(2:length(a), function(x){
a[[x]][[1]][1]
})
print(b)

Use (*SKIP)(*F)
innovation_patterns <- gsub(input, pattern = "-1-", replacement = "-")
innovation_patterns <- lapply(innovation_patterns, str_extract_all, perl('FA-I1-I2-TR-I1-I2-FA.*(*SKIP)(*F)|(?:I\\d-?)*I3(?:-?I\\d)*'))
Syntax would be like,
partIDontWant.*(*SKIP)(*F)|choose from the string which exists before partIDontWant
DEMO

Here's is another way you could go about this.
x <- "1-FA-1-I2-1-I2-1-I2-1-EX-1-I2-1-I3-1-FA-1-I1-1-I2-1-TR-1-I1-1-I2-1-FA-1-I3-1-I1-1-FA-1-FA-1-NR-1-I3-1-I2-1-TR-1-I1-1-I2-1-I1-1-I2-1-FA-1-I2-1-I1-1-I3-1-FA-1-QU-1-I1-1-I2-1-I2-1-I2-1-NR-1-I2-1-I2-1-NR-1-I1-1-I2-1-I1-1-NR-1-I3-1-QU-1-I2-1-I3-1-QU-1-NR-1-I2-1-I1-1-NR-1-QU-1-QU-1-I2-1-I1-1-EX"
CODE
substr <- 'FA-I2-I2-I2-EX'
regex <- paste0(substr, '-?((?:I\\d-?)*I3(?:-?I\\d)*)')
gsubfn::strapply(gsub('-1-', '-', x), regex, simplify = c)
## [1] "I2-I3"

Here's how to implement it:
lapply(innovation_patterns, str_extract_all, '(?<=FA-I2-I2-I2-EX-?)(?:I\\d-?)*I3(?:-?I\\d)*');
## [[1]]
## [[1]][[1]]
## [1] "I2-I3"

Related

R regular expression to obtain all text before the second underscore

s <- "1-343-43Hello_2_323.14_fdh-99H"
In R I want to use a regex to get the substring before the, say 2nd, underscore. How can this be done with one regex ? The alternative would be to split by '_' and then paste the first two - something along;
paste(sapply(strsplit(s, "_"),"[", 1:2), collapse = "_")
Gives:
[1] "1-343-43Hello_2"
But how can I make a regex expression to do the same ?
In general, for answering to the question in title, is
sub("^(([^_]*_){n}[^_]*).*", "\\1", s)
where n is the number of _ you are allowing.
You can use a sub:
sub("^([^_]*_[^_]*).*", "\\1", s)
See the regex demo
R code demo:
s <- "1-343-43Hello_2_323.14_fdh-99H"
sub("^([^_]*_[^_]*).*", "\\1", s)
## => [1] "1-343-43Hello_2"
Pattern details:
^ - start of string
([^_]*_[^_]*) - Group 1 capturing 0+ characters other than _, then a _ and again 0+ non-_s
.* - rest of the string (note that the TRE regex . matches newlines, too).
The \\1 replacement only returns the value inside Group 1.
echo preg_replace("/([^_])_([^_]).*/" , "$1_$2" , "1-343-43Hello_2_323.14_fdh-99H");
Or if you are looking for just matching int /^[^]*[^_]*/ would be the regex string to match it
<?php
echo preg_match("/^[^_]*_[^_]*/" , "1-343-43Hello_2_323.14_fdh-99H" , $test );
var_dump( $test );
?>
or in javascript
"1-343-43Hello_2_323.14_fdh-99H".match(/^[^_]*_[^_]*/);
sub('\\_\\d+\\..*$','',s)
#[1] "1-343-43Hello_2"
here it is with gsub (in a data.table), in case you need perl=TRUE, (fx look-ahead and look-behind), which does not work in str_match, unfortunately
dtx[, var_stringr := stringr::str_match(string, '([^_]+)(?:_[^_]+){5}$')[,2]][]
dtx[
# first select the ones with '_' so that the third element is NA
grepl('_', string),
var_gsub := sub('(.*_)([^_]+)(_[^_]+){5}$', '\\2', string)][]
The disadvantage of this method is that if you select a number higher than the nth occurence, instead of giving back NA like str_match, it gives back the whole string.

Return the first occurrence of a character in a string

I have been trying to extract a portion of string after the occurrence of a first ^ sign. For example, the string looks like abc^28092015^def^1234. I need to extract 28092015 sandwiched between the 1st two ^ signs.
So, I need to extract 8 characters from the occurrence of the 1st ^ sign. I have been trying to extract the position of the first ^ sign and then use it as an argument in the substr function.
I tried to use this:
x=abc^28092015^def^1234 `rev(gregexpr("\\^", x)[[1]])[1]`
Referring the answer discussed here.
But it continues to return the last position. Can anyone please help me out?
I would use sub.
x <- "^28092015^def^1234"
sub("^.*?\\^(.*?)\\^.*", "\\1", x)
# [1] "28092015"
Since ^ is a special char in regex, you need to escape that in-order to match literal ^ symbols.
or
Do splitting on ^ and get the value of second index.
strsplit(x,"^", fixed=-T)[[1]][2]
# [1] "28092015"
or
You may use gsub aslo.
gsub("^.*?\\^|\\^.*", "", x, perl=T)
# [1] "28092015"
Here's one option with base R:
x <- "abc^28092015^def^1234"
m <- regexpr("(?<=\\^)(.+?)(?=\\^)", x, perl = TRUE)
##
R> regmatches(x, m)
#[1] "28092015"
Another option is stri_extract_first from library(stringi)
library(stringi)
stri_extract_first_regex(str1, '(?<=\\^)\\d+(?=\\^)')
#[1] "28092015"
If it is any character between two ^
stri_extract(str1, regex='(?<=\\^)[^^]+')
#[1] "28092015"
data
str1 <- 'abc^28092015^def^1234'
x <- 'abc^28092015^def^1234'
library(qdapRegex)
unlist(rm_between(x, '^', '^', extract=TRUE))[1]
# [1] "28092015"
It would be better if you split it using ^. But if you still want the pattern, you can try this.
^\S+\^(\d+)(?=\^)
Then match group 1.
OUTPUT
28092015
See DEMO

Extract subset of a string following specific text in R

I am trying to extract all of the words in the string below contained within the brackets following the word 'tokens' only if the 'tokens' occurs after 'tag(noun)'.
For example, I have the string:
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),
inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),
inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),
head([lexmatch([department]),inputmatch(['Department']),tag(noun),
tokens([department])])],0/29,[])."
I want to get a list of all of the words that occur within the brackets after the word 'tokens' only when the word tokens occurs after 'tag(noun)'.
Therefore, I want my output to be a vector of the following:
[1] new, york, state, department
How do I do this? I'm assuming I have to use a regular expression, but I'm lost on how to write this in R.
Thanks!
Remove newlines and then extract the portion matched to the part between parentheses in pattern pat. Then split apart such strings by commas and simplify into a character vector:
library(gsubfn)
pat <- "tag.noun.,tokens..(.*?)\\]"
strapply(gsub("\\n", "", m), pat, ~ unlist(strsplit(x, ",")), simplify = c)
giving:
[1] "new" "york" "state" "department"
Visualization: Here is the debuggex representation of the regular expression in pat. (Note that we need to double the backslash when put within R's double quotes):
tag.noun.,tokens..(.*?)\]
Debuggex Demo
Note that .*? means match the shortetst string of any characters such that the entire pattern matches - without the ? it would try to match the longest string.
How about something like this. Here i'll use the regcatputedmatches helper function to make it easier to extract the captured matches.
m<- "phrase('The New York State Department',[det([lexmatch(['THE']),inputmatch(['The']),tag(det),tokens([the])]),mod([lexmatch(['New York State']),inputmatch(['New','York','State']),tag(noun),tokens([new,york,state])]),head([lexmatch([department]),inputmatch(['Department']),tag(noun),tokens([department])])],0/29,[])."
rx <- gregexpr("tag\\(noun\\),tokens\\(\\[([^]]+)\\]\\)", m, perl=T)
lapply(regcapturedmatches(m,rx), function(x) {
unlist(strsplit(c(x),","))
})
# [[1]]
# [1] "new" "york" "state" "department"
The regular expression is a bit messy because your desired match contains many special regular expression symbols so we need to properly escape them.
Here is a one liner if you like:
paste(unlist(regmatches(m, gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T))), collapse=",")
[1] "new,york,state,department"
Broken down:
# Get match indices
indices <- gregexpr("(?<=tag\\(noun\\),tokens\\(\\[)[^\\]]*", m, perl=T)
# Extract the matches
matches <- regmatches(m, indices)
# unlist and paste together
paste(unlist(matches), collapse=",")
[1] "new,york,state,department"

Get Twitter #Username with Regex in R

How can I use regex in R to extract Twitter usernames from a string of text?
I've tried
library(stringr)
theString <- '#foobar Foobar! and #foo (#bar) but not foo#bar.com'
str_extract_all(string=theString,pattern='(?:^|(?:[^-a-zA-Z0-9_]))#([A-Za-z]+[A-Za-z0-9_]+)')
But I end up with #foobar, #foo and (#bar which contains an unwanted parenthesis.
How can I get just #foobar, #foo and #bar as output?
Here's one method that works in R:
theString <- '#foobar Foobar! and #foo (#bar) but not foo#bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex <- "(^|[^#\\w])#(\\w{1,15})\\b"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "#foobar" "#foo" "(#bar)"
If you want to use #Jerry's answer in R:
regex <- "#([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)"
idx <- grep(regex, theString1, perl = T)
theString1[idx]
[1] "#foobar" "#foo" "(#bar)"
Both of these methods include the parenthesis that you don't want, however.
UPDATE This will get to you start-to-finish with no parentheses or any other kind of punctuation (except underscores, since they're allowed in usernames)
theString <- '#foobar Foobar! and #fo_o (#bar) but not foo#bar.com'
theString1 <- unlist(strsplit(theString, " "))
regex1 <- "(^|[^#\\w])#(\\w{1,15})\\b" # get strings with #
regex2 <- "[^[:alnum:]#_]" # remove all punctuation except _ and #
users <- gsub(regex2, "", theString1[grep(regex1, theString1, perl = T)])
users
[1] "#foobar" "#fo_o" "#bar"
#[a-zA-Z0-9_]{0,15}
Where:
# matches the character # literally (case sensitive).
[a-zA-Z0-15] match a single character present in the list
{0,15} Quantifier matches between 0 and 15 times, as many times as
possible, giving back as needed
It is working fine on selecting twitter usernames from a mixed dataset.
Try using a negative lookbehind so that characters are not consumed in your match:
(?:^|(?<![-a-zA-Z0-9_]))#([A-Za-z]+[A-Za-z0-9_]+)
^^^
EDIT: Since it seems lookbehinds don't work in R (I found somewhere here that lookbehinds worked on R, but apparently not...), try this one:
#([A-Za-z]+[A-Za-z0-9_]+)(?![A-Za-z0-9_]*\\.)
Edit: double escaped the dot
EDITv3... : Try turning on PCRE:
str_extract_all(string=theString,perl("(?:^|(?<![-a-zA-Z0-9_]))#([A-Za-z]+[A-Za-z0-9_]+)")

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."