RegEx escape function in R [duplicate] - regex

This question already has answers here:
Is there an R function to escape a string for regex characters
(5 answers)
Closed last year.
In a R script, I'd need to create a RegEx that contains strings that may have special characters. So, I should first escape those strings and then use them in the RegEx object.
pattern <- regex(paste('\\W', str, '\\W', sep = ''));
In this example, str should be fixed. So, I'd need a function that returns escaped form of its input. For example 'c++' -> 'c\\+\\+'

I think you have to escape only 12 character, so a conditional regular expression including those should do the trick -- for example:
> gsub('(\\\\^|\\$|\\.|\\||\\?|\\*|\\+|\\(|\\)|\\[|\\{)', '\\\\\\1', 'C++')
[1] "C\\+\\+"
Or you could build that regular expression from the list of special chars if you do not like the plethora of manual backslashes above -- such as:
> paste0('(', paste0('\\', strsplit('\\^$.|?*+()[{', '')[[1]], collapse = '|'), ')')
[1] "(\\\\|\\^|\\$|\\.|\\||\\?|\\*|\\+|\\(|\\)|\\[|\\{)"

Related

Remove parenthesis and Characters inside it [duplicate]

This question already has answers here:
Remove text between parentheses in dart/flutter
(2 answers)
Regular expresion for RegExp in Dart
(2 answers)
JavaScript/regex: Remove text between parentheses
(5 answers)
Closed 3 months ago.
I want to remove parenthesis along with all the characters inside it...
var str = B.Tech(CSE)2020;
print(str.replaceAll(new RegExp('/([()])/g'), '');
// output => B.Tech(CSE)2020
// output required => B.Tech 2020
I tried with bunch of Regex but nothing is working...
I am using Dart...
Using Dart, you don't have to use the forward slashes / to delimit the pattern. You can use a string and prepend it with r for a raw string and then you don't have to double escape the backslashes.
In your pattern you have to:
escape the parenthesis
negate the character class to match any character except the parenthesis
repeat the character class with a quantifier like * or else it will match a single character
The pattern will look like:
\([^()]*\)
Regex demo | Dart demo
Example
var str = "B.Tech(CSE)2020";
print(str.replaceAll(new RegExp(r'\([^()]*\)'), ' '));
Output
B.Tech 2020
Your Dart syntax is off, and seems to be confounded with JavaScript. Consider this version:
String str = "B.Tech(CSE)2020";
print(str.replaceAll(RegExp(r'\(.*?\)'), " ")); // B.Tech 2020

Regex string in Go [duplicate]

This question already has answers here:
"Unknown escape sequence" error in Go
(2 answers)
Closed 4 years ago.
I try to use string
"/{foo}/{bar:[a-zA-Z0-9=\-\/]+}.{vzz}"
in Go.
When I use ", I see error:
unknown escape sequence
When I use ', I get:
cannot use '\u0000' (type rune) as type string in array or slice literal
unknown escape sequence
How I can use this regular expression for MUX in my Go application?
When you mean \ character literally in string literals - it must be escaped additionally
"/{foo}/{bar:[a-zA-Z0-9=\\-\\/]+}.{vzz}"
otherwise you could use backticks instead of double quotes
`/{foo}/{bar:[a-zA-Z0-9=\-\/]+}.{vzz}`
According to Golang Language Specification.
string_lit = raw_string_lit | interpreted_string_lit .
raw_string_lit = "`" { unicode_char | newline } "`" .
interpreted_string_lit = `"` { unicode_value | byte_value } `"` .
So if you do not want to escape anything in your string literal, you need a raw one. and
The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes
Golang does not use single quote ' as a string literal indicator. And your error with the double quote " is due to the compiler trying to escape \- and \/ as a part of the string before the regex interpreter.

Regular expression in R: gsub pattern

I'm learning R's regular expression and I am having trouble understanding this
gsub example:
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", x)
So far I think I get:
if x is alphanumeric it doesn't match so all nothing modified
if x contains a . or | or ( or { or } or + or $ or ? it adds \\ in front of it
I can't explain:
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10\1')
[1] "10\001"
or
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10/1')
[1] "10/1"
I am also confused why the replacement "\\\\\\1" add only two brackets.
I'm suppose to figure out what this function does and I think it's suppose to escape certain special characters ?
The entire pattern is wrapped in parentheses which allows back-references. This part:
[.|()\\^{}+$*?]
... is a "character class" so it matches any one of the characters inside teh square-brackets, and as you say it is changing the way the pattern syntax will interpret what would otherwise be meta-characters within the pattern definition.
The next part is a "pipe" character which is the regex-OR followed by an escaped open-square-bracket, another "OR"-pipe, and then an escaped close-square-bracket. Since both R and regex use backslashes as escapes, you need to double them to get an R+regex-escape in patterns ... but not in replacement strings. The close-square-bracket can only be entered in a character class if it is placed first in the string, sot that entire pattern could have been more compactly formed with:
"[][.|()\\^{}+$*?]" # without the "|\\[|\\])"
In replacement strings the form "\\n" refers to whatever matched the n-th parenthetical portion of the 'pattern', in this case '\1' is the second portion of the replacement. The first position is "\" which forms an escape and the second "\" forms the backslash. Now get ready to the even weirder part ... how many characters are in that result?
> nchar( gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\1", '10\1') )
[1] 3
And then of course none of the items in the match is equal to '\1". Somebody writing whatever tutorial you have before you (which I do not think is the gsub help page) has a weird sense of humor. Here are a couple of functions that may be useful if you need to create characters that would otherwise be intercepted by the system readline function:
> intToUtf8(1)
[1] "\001"
> ?intToUtf8
> 0x0
[1] 0
> intToUtf8(0)
[1] ""
> utf8ToInt("")
integer(0)
And do look at ?Quotes where a lot of useful information can be found (under what I would consider a rather unlikely title) about how R handles octal, hexadecimal and other numbers and special characters.
The first regex broken down is this
( # (1 start)
[.|()\^{}+$*?]
| \[
| \]
) # (1 end)
It captures any what's in the 'class' or '[' or ']' then it looks like it replaces it with \\\1 which is an escape plus whatever was in capture 1.
So, basically it just escapes a single occurrence of one of those chars.
The regex could be better written as ([.|()^{}\[\]+$*?]) or within a
string as "([.|()^{}\\[\\]+$*?])"
Edit (promoting a comment) -
The regex won't match string 10\1 so there should be no replacement. There must be an interpolation (language) on the print out. Looks like its converting it to octal \001. - Since it cant show binary 1 it shows its octal equivalent.

gsub("BLAH", "", "BLAH\WHAT") won't let x have a backslash? [duplicate]

This question already has answers here:
How to escape a backslash in R? [duplicate]
(2 answers)
Closed 8 years ago.
I'm doing some batch string clean up and a lot of the entries look like this:
"ABC\Company Co."
Which causes weird errors, and I can't seem to remove the backslash.
For example, try entering this into your console:
gsub("BLAH", "", "BLAH\WHAT")
and you get:
Error: '\W' is an unrecognized escape in character string starting ""BLAH\W"
I know that it's thinking \W is a command.. I'm actually suprised that gsub's 'interpreting' x, since x is just the string I want to sub out. I don't get why gsub cares what's actually in x, just that it should replace "BLAH" with "" within "BLAH\WHAT"...
The obvious solution would be to remove the \ from the string ahead of time.
gsub("\\", "", "BLAH\WHAT")
But then you get the exact same error message!
Thoughts? Thanks!
Use
gsub("\\\\", "", "BLAH\\WHAT")
which gives
[1] "BLAHWHAT"
To produce one backslash, you need to escape it using a \. Thus, "\\\\" produces two backslashes, which matches the two inside "BLAH\\WHAT".
See these related questions:
How to escape a backslash in R?
How to escape backslashes in R string

Is there an R function to escape a string for regex characters

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".