Regular Expression with pattern combination in R

Regular Expression with pattern combination in R - regex

Although I have consulted several threads, I cannot get my code to work, maybe someone can help me find a solution here.
I would like to look for files in a directory, which starts with
start.contain <- "VP01_SPSG2015_Experimental" ## beginning of the file name
and ends with
stop.contain <- ".vmrk" ## the file extension
What pattern do I have to feed to
findfile <- list.files(path, pattern = ???)
to find my file?

You can use
start.contain <- "VP01_SPSG2015_Experimental" ## beginning of the file name
stop.contain <- "[.]vmrk" ## the file extension
findfile <- list.files(path, pattern = paste0("^", start.contain, ".*", stop.contain, "$"))
The ^ means match at the beginning of the string, and $ means match at the end of the string. .* will match any zero or more characters.
Note that in a regex, . must be escaped or used in a character class ([.]) to be treated as a literal. Thus, you should use "[.]vmrk" or "\\.vmrk".

Related

R regex remove unicode apostrophe

Lets say I have the following string in R:
text <- "[Peanut M&M\u0092s]"
I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:
replaced <- gsub("\\\\u0092", "", text )
However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?
Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?

You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:
text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"
See IDEONE demo
If you only plan to remove the \0092 symbol, you do not need a Perl like regex:
replaced <- gsub("[][\u0092]", "", text)
See another demo
Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

Regex with stringr:: how to find first instance of pattern

Behind this question is an effort to extract all references created by knitr and latex. Not finding another way, my thought was to read into R the .Rnw script and use a regular expression to find references -- where the latex syntax is \ref{caption referenced to}. My script has 250+ references, and some are very close to each other.
The text.1 example below works, but not the text example. I think it has to do with R chugging along to the final closing brace. How do I stop at the first closing brace and extract what preceded it to the opening brace?
library(stringr)
text.1 <- c(" \\ref{test}", "abc", "\\ref{test2}", " \\section{test3}", "{test3")
# In the regular expression below, look back and if find "ref{", grab everything until look behind for } at end
# braces are special characters and require escaping with double backslacs for R to recognize them as braces
# unlist converts the list returned by str_extract to a vector
unlist(str_extract_all(string = text.1, pattern = "(?<=ref\\{).*(?=\\}$)"))
[1] "test" "test2"
# a more complicated string, with more than one set of braces in an element
text <- c("text \ref{?bar labels precision} and more text \ref{?table column alignment}", "text \ref{?table space} }")
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{).*(?=\\}$)"))
character(0)

The problem with text is the backslash in front of "ref" is being interpreted as a carriage return \r by the engine and R's parser; so you're trying to match "ref" but it's really (CR + "ef") ...
Also * is greedy by default, meaning it will match as much as it can and still allow the remainder of the regular expression to match. Use *? or a negated character class to prevent greediness.
unlist(str_extract_all(text, '(?<=\ref\\{)[^}]*'))
# [1] "?bar labels precision" "?table column alignment" "?table space"
As you can see, you can use a character class to match either (\r or r + "ef") ...
x <- c(' \\ref{test}', 'abc', '\\ref{test2}', ' \\section{test3}', '{test3',
'text \ref{?bar labels precision} and more text \ref{?table column alignment}',
'text \ref{?table space} }')
unlist(str_extract_all(x, '(?<=[\rr]ef\\{)[^}]*'))
# [1] "test" "test2" "?bar labels precision"
# [4] "?table column alignment" "?table space"

EDITED
The reason why it didn't capture what is before the closing brace } is because you added an end of line anchor $. Remove $ and it would work.
Therefore, you new code should be like this
unlist(str_extract_all(string = text, pattern = "(?<=ref\\{)[^}]*(?=\\})"))
See DEMO

regular expression -- greedy matching?

I am trying to extract a leading string by stripping off an optional trailing string, where the trailing strings are a subset of possible leading strings but not vice versa. Suppose the leading string is like [a-z]+ and the trailing string is like c. Thus from "abc" I want to extract "ab", and from "ab" I also want to get "ab". Something like this:
^([a-z]+)(?:c|)
The problem is that the [a-z]+ matches the entire string, using the empty option in the alternative, so the grabbed value is "abc" or "ab". (The (?: tells it not to grab the second part.) I want some way to make it take the longer option, or the first option, in the alternative, and use that to determine what matches the first part.
I have also tried putting the desired target inside both of the alternatives:
^([a-z]+)c|^([a-z]+)
I think that it should prefer to match the first one of the two possible alternatives, but I get the same results as above.
I am doing this in R, so I can use either the POSIX or the Perl regex library.
(The actual problem involves futures trading symbols. These have a root "instrument name" like [A-Z0-9]+, followed by an "expiration code" like [FGHJKMNQUVXZ][0-9]{1,2}. Given a symbol like "ZNH3", I want to strip the "H3" to get "ZN". But if I give it "ZN" I also want to get back "ZN".)

Try this:
> library(gsubfn)
> strapplyc(c("abc", "abd"), "^(\\w+?)c?$", simplify = TRUE)
[1] "ab" "abd"
and even easier:
> sub("c$", "", c("abc", "abd"))
[1] "ab" "abd"

Here's a working regular expression:
vec <- c("ZNH3", "ZN", "ZZZ33", "ABF")
sub("(\\w+)[FGHJKMNQUVXZ]\\d{1,2}", "\\1", vec)
# [1] "ZN" "ZN" "ZZ" "ABF"

A variation on the non-greedy answers using base code only.
codes <- c("ZNH3", "CLZ4")
matched <- regmatches(codes, regexec("^([A-Z0-9]+?)[FGHJKMNQUVXZ][0-9]{1,2}$", codes))
# [[1]]
# [1] "ZNH3" "ZN"
#
# [[2]]
# [1] "CLZ4" "CL"
sapply(matched, `[[`, 2) # extract just codes
# [1] "ZN" "CL"

Use a 'non-greedy' match for the first part of the regex, followed by the definitions of your 'optional allowed suffixes' anchored by the 'end-of-string'...
This regex (.+?)([FGHJKMNQUVXZ][0-9]{1,2})?$matches...
(.+?) as few characters as possible
([FGHJKMNQUVXZ][0-9]{1,2})? followed by an allowable (but optional) suffix
$ followed by the end of string
The required result is in the first captured element of the match (however that may be referenced in 'r') :-)

Using variable to create regular expression pattern in R

I have a function:
ncount <- function(num = NULL) {
toRead <- readLines("abc.txt")
n <- as.character(num)
x <- grep("{"n"} number",toRead,value=TRUE)
}
While grep-ing, I want the num passed in the function to dynamically create the pattern to be searched? How can this be done in R? The text file has number and text in every line

You could use paste to concatenate strings:
grep(paste("{", n, "} number", sep = ""),homicides,value=TRUE)

In order to build a regular expression from variables in R, in the current scenarion, you may simply concatenate string literals with your variable using paste0:
grep(paste0('\\{', n, '} number'), homicides, value=TRUE)
Note that { is a special character outside a [...] bracket expression (also called character class), and should be escaped if you need to find a literal { char.
In case you use a list of items as an alternative list, you may use a combination of paste/paste0:
words <- c('bananas', 'mangoes', 'plums')
regex <- paste0('Ben likes (', paste(words, collapse='|'), ')\\.')
The resulting Ben likes (bananas|mangoes|plums)\. regex will match Ben likes bananas., Ben likes mangoes. or Ben likes plums.. See the R demo and the regex demo.
NOTE: PCRE (when you pass perl=TRUE to base R regex functions) or ICU (stringr/stringi regex functions) have proved to better handle these scenarios, it is recommended to use those engines rather than the default TRE regex library used in base R regex functions.
Oftentimes, you will want to build a pattern with a list of words that should be matched exactly, as whole words. Here, a lot will depend on the type of boundaries and whether the words can contain special regex metacharacters or not, whether they can contain whitespace or not.
In the most general case, word boundaries (\b) work well.
regex <- paste0('\\b(', paste(words, collapse='|'), ')\\b')
unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE)))
## => [1] "bananas" "mangoes" "plums"
The \b(bananas|mangoes|plums)\b pattern will match bananas, but won't match banana (see an R demo).
If your list is like
words <- c('cm+km', 'uname\\vname')
you will have to escape the words first, i.e. append \ before each of the metacharacter:
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- c('Text: cm+km, and some uname\\vname?')
words <- c('cm+km', 'uname\\vname')
regex <- paste0('\\b(', paste(regex.escape(words), collapse='|'), ')\\b')
cat( unlist(regmatches(examples, gregexpr(regex, examples, perl=TRUE))) )
## => cm+km uname\vname
If your words can start or end with a special regex metacharacter, \b word boundaries won't work. Use
Unambiguous word boundaries, (?<!\w) / (?!\w), when the match is expected between non-word chars or start/end of string
Whitespace boundaries, (?<!\S) / (?!\S), when the match is expected to be enclosed with whitespace chars, or start/end of string
Build your own using the lookbehind/lookahead combination and your custom character class / bracket expression, or even more sophisticad patterns.
Example of the first two approaches in R (replacing with the match enclosed with << and >>):
regex.escape <- function(string) {
gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
examples <- 'Text: cm+km, +km and C++,Delphi,C++CLI and C++/CLI.'
words <- c('+km', 'C++')
# Unambiguous word boundaries
regex <- paste0('(?<!\\w)(', paste(regex.escape(words), collapse='|'), ')(?!\\w)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and <<C++>>,Delphi,C++CLI and <<C++>>/CLI."
# Whitespace boundaries
regex <- paste0('(?<!\\S)(', paste(regex.escape(words), collapse='|'), ')(?!\\S)')
gsub(regex, "<<\\1>>", examples, perl=TRUE)
# => [1] "Text: cm+km, <<+km>> and C++,Delphi,C++CLI and C++/CLI."

Matching non commented pattern in eclipse

I am having troubles with a regex syntax.
I want to match all occurrences of a certain word followed by a number, but exclude lines which are commented.
Comments are (multiple) # or ## or ### ...
Examples:
#This is a comment <- no match
#This is a comment myword 8 <- no match
my $var = 'myword 12'; <- match
my $var2 = 'myword'; <- no match
Until now I have
orignal pattern: ^[^(\#+)](.*?)(myword \d+)(.*?)$
new pattern: ^([^\#]*?)(myword\s+\d+)(.*?)$
Which should match lines which do no begin with one or more #, followed by something, then the word number combination I am searching for and finally something.
It would perhaps be good to match also parts of lines if the comment does not begin at the beginning of the line.
my $var3 = 'test';#myword 8 <- no match
What am I doing wrong?
I want to use it in Eclipse's file search (with Perl epic module).
Edit: The new pattern I got does no return false matches, but it return multiple the line which includes myword and several lines before that line. And I'm not sure it returns all matches.

Note that [] are character classes. You cannot use quantifiers in there. They are like the . – matches any character given in there. The dot itself, or a character class, can then be quantified.
In your example, [^(#+)] would match everything except (,), +, and depending on the flavour (I guess) # and \.
So what you want here is to match a line that starts with any character except for a #. (I think.)
A problem is that the # might occur in a string where it is not a comment. (Regarding comments not starting at the beginning of the line.)

Re: comments not at the beginning of the string.
To do this right (e.g. not to miss any valid matches) you pretty much have to parse a file's specific programming language's grammar properly, so you can't do this (easily, or even at all) with a RegEx.
If you don't, you risk missing valid search hits that follow a "#" used in a context other than comment start - as an example common to pretty much any language, after a string "this is my #hash".
It's even worse in Perl where "#" can also appear as a regex delimiter, as a $#myArr (index of the last element of an array), or - joy of joys - as a valid character in an identifyer name!

Of course, if You are aware of these problems and still want to use regexp to extract the content. Something like this may be useful:
^[^\#].[^\n\#]+myword\s\d+.[$;]+
This is a little bit complex but I hope it will works for You.
For me this matches as below:
my $var = 'myword 12'; <- match
my $var = 'myword 17'; <- match
my $var2 = 'myword'; <- no match
my $var = 'myword 9'; #'myword 17'; <- partly match
my $var = 'myword 8'; ##'myword 127'; <- partly match
my $var = ;#'myword 17'; <- no match
#my $var = 'myword 13'; <- no match
##my $var2 = 'myword 14'; <- no match

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression with pattern combination in R - regex

Related

R regex remove unicode apostrophe

Regex with stringr:: how to find first instance of pattern

regular expression -- greedy matching?

Using variable to create regular expression pattern in R

Matching non commented pattern in eclipse

Categories

Resources