R regex: grep excluding hyphen/dash as boundary - regex

I am trying to match an exact word in in a vector with variable strings. For this I am using boundaries. However, I would like for hyphen/dash not to be considered as word boundary. Here is an example:
vector<-c(
"ARNT",
"ACF, ASP, ACF64",
"BID",
"KTN1, KTN",
"NCRNA00181, A1BGAS, A1BG-AS",
"KTN1-AS1")
To match strings that contain "KTN1" I am using:
grep("(?i)(?=.*\\bKTN1\\b)", vector, perl=T)
But this matches both "KTN1" and "KTN1-AS1".
Is there a way I could treat the dash as a character so that "KTN1-AS1" is considered a whole word?

To match a particular word from an vector element, you need to use functions like regmatches , str_extract_all (from stringr package) not grep, since grep would return only the element index where the match is found.
> vector<-c(
+ "ARNT",
+ "ACF, ASP, ACF64",
+ "BID",
+ "KTN1, KTN",
+ "NCRNA00181, A1BGAS, A1BG-AS",
+ "KTN1-AS1")
> regmatches(vector, regexpr("(?i)\\bKTN1[-\\w]*\\b", vector, perl=T))
[1] "KTN1" "KTN1-AS1"
OR
> library(stringr)
> unlist(str_extract_all(vector[grep("(?i)\\bKTN1[-\\w]*\\b", vector)], perl("(?i).*\\bKTN1[-\\w]*\\b")))
[1] "KTN1" "KTN1-AS1"
Update:
> grep("\\bKTN1(?=$|,)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1 followed by a comma or end of the line.
OR
> grep("\\bKTN1\\b(?!-)", vector, perl=T, value=T)
[1] "KTN1, KTN"
Returns the element which contain the string KTN1 not followed by a hyphen.

I would keep this simple and create a DIY Boundary.
grep('(^|[^-\\w])KTN1([^-\\w]|$)', vector, ignore.case = TRUE)
We use a capture group to define the boundaries. We match a character that is not a hyphen or a word character — beginning or end of string, which is closer to the intent of the \b boundary .

Related

Split string on special character

I have a string (fasta format), something like this:
a = ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"
and would like to seperate at character >, filter out the newlines and put the thre substrings seperated by > into a vector or list with three elements:
>atttaggaccttaattgtcggta
>ccattnnnncccatt
>ttaggccta
I tried strsplit:
unlist(strsplit(a, "(?<=>)", perl=T))
but this puts the delimiter > at the end of the each string.
I found related questions are here or here but I can't really get it to work without making a complicated construct.
Is there a simple solution to do this in one go?
Your regex only contains a lookbehind that matches any empty location after a >, see your regex demo. The engine processes a string from left to right, checks if there is a > to the left of the current location, and then returns a valid empty string match if < is found.
You may use (?<=[^>])(?=>) regex:
> res <- unlist(strsplit(a, "(?<=[^>])(?=>)", perl=T))
> res
[1] ">atttaggacctta\nattgtcggta\n" ">ccattnnnn\ncccatt\n"
[3] ">ttaggccta"
> gsub("\n", "", res, fixed=TRUE)
[1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt"
[3] ">ttaggccta"
The pattern matches a location that is preceded with a non-> char and is followed with > char.
Note that using a lookbehind pattern only with strsplit often leads to unexpected behavior. See Why does strsplit use positive lookahead and lookbehind assertion matches differently?
library(stringi)
library(magrittr)
a <- ">atttaggacctta\nattgtcggta\n>ccattnnnn\ncccatt\n>ttaggccta"
stri_replace_all_regex(a, "\\n", "") %>%
stri_extract_all_regex("(>[[:alpha:]]+)") %>%
unlist()
## [1] ">atttaggaccttaattgtcggta" ">ccattnnnncccatt" ">ttaggccta"
If one must use base only:
a <- gsub("\\n", "", a)
unlist(regmatches(a, gregexpr("(>[[:alpha:]]+)", a)))

Regex matching all characters from the beginning of the string to the first underscore

I am trying to substring elements of a vector to only keep the part before the FIRST underscore. I am a bit of a newbie with taking substrings and don't fully understand all regex yet. I am close to the answer, I can get the part that I want to delete but still don't see how to get the opposite part. Any help and/or explanation of regex is appreciated!
my vector looks like the following, with multiple underscores in some elements
v = c("WL_Alk", "LQ_Frac_C_litter_origin", "MI_Nr_gat", "SED_C_N", "WL_CO2", "WL_S")
my desired output looks like
v_short = c("WL", "LQ", "MI", "SED", "WL", "WL")
The code that gets me the part I want to delete is sub("^[^_]*", "", v). I think I have to do something with $ in regex because sub("[_$]", "", v) deletes the first underscore, but I can't get it to delete the part behind it. Even with the regex helpfile I don't fully understand the meaning of ^, $ and * yet, so explanation on those is also appreciated!
You can use
> v = c("WL_Alk", "LQ_Frac_C_litter_origin", "MI_Nr_gat", "SED_C_N", "WL_CO2", "WL_S")
> sub("_.*", "", v)
[1] "WL" "LQ" "MI" "SED" "WL" "WL"
The "_.*" pattern matches the first _ and .* matches any 0+ characters up to the end of string greedily (that is, grabs them at one go).
With stringr str_extract, you can use your pattern:
> library(stringr)
> v_short = str_extract(v, "^[^_]*")
> v_short
[1] "WL" "LQ" "MI" "SED" "WL" "WL"
The ^[^_]* pattern matches the beginning of the string and 0 or more characters other than _.
If I understood correctly
gsub("(.*?)(_.*)","\\1",v, perl = TRUE)
Explanation:
(.*?) the first capturing group;
(_.*) the second capturing group;
\\1 return the first capturing group;
There are two ways to do it.
Either use ^[^_]+ and match string before first _. Regex101 Demo
OR
Select the part after first _ using \_.+$ and eliminate it. Regex101 Demo

Extract substrings starting with specific character until next space

I want to extract the tags (twitter handles) from tweets.
tweet <- "#me bla bla bla bla #2_him some text #me_"
The following only extracts part of some substrings due to the punctuation in some tags
regmatches(tweet, gregexpr("#[[:alnum:]]*", tweet))[[1]]
[1] "#me" "#2" "#me"
I don't know what regular expression would return the entire string (#tag).
Thanks!
If you want to match all non-spaces, just use the corresponding regular expression
regmatches(tweet, gregexpr("#[^ ]*", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"
You can use the following. \S will match any non-white space character. As well, you want to use the + quantifier instead of * otherwise you will end up matching the # character by itself if one did exist in the string.
> regmatches(tweet, gregexpr("#\\S+", tweet))[[1]]
# [1] "#me" "#2_him" "#me_"
Instead of [[:alnum:]]* use \w* because _ does not comes under alphanumeric character list(ie, [[:alnum:]] matches alphanumeric[A-Za-z0-9] characters. ) but it comes under word character ([A-Za-z0-9_]) list.
> regmatches(tweet, gregexpr("#\\w*", tweet))[[1]]
[1] "#me" "#2_him" "#me_"
The qdapRegex package has a function specifically designed for this task rm_tag:
library(qdapRegex)
rm_tag(tweet, extract=TRUE)
## [[1]]
## [1] "#me" "#2_him" "#me_"

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)
Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"
Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"
A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"
Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line
You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)
In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."