How to replace a symbol by a backslash in R? - regex

Could you help me to replace a char by a backslash in R? My trial:
gsub("D","\\","1D2")
Thanks in advance

You need to re-escape the backslash because it needs to be escaped once as part of a normal R string (hence '\\' instead of '\'), and in addition it’s handled differently by gsub in a replacement pattern, so it needs to be escaped again. The following works:
gsub('D', '\\\\', '1D2')
# "1\\2"
The reason the result looks different from the desired output is that R doesn’t actually print the result, it prints an interpretable R string (note the surrounding quotation marks!). But if you use cat or message it’s printed correctly:
cat(gsub('D', '\\\\', '1D2'), '\n')
# 1\2

When inputting backslashes from the keyboard, always escape them:
gsub("D","\\\\","1D2")
#[1] "1\\2"
or,
gsub("D","\\","1D2", fixed=TRUE)
#[1] "1\\2"
or,
library(stringr)
str_replace("1D2","D","\\\\")
#[1] "1\\2"
Note: If you want something like "1\2" as output, I'm afraid you can't do that in R (at least in my knowledge). You can use forward slashes in path names to avoid this.
For more information, refer to this issue raised in R help: How to replace double backslash with single backslash in R.

gsub("\\p{S}", "\\\\", text, perl=TRUE);
\p{S} ... Match a character from the Unicode category symbol.

Related

Vim - removing leading and trailing spaces in a function

I am trying to remove leading and trailing spaces in a function but it does not work:
function Trim(s)
echo ">>>".a:s."<<<"
let ss = substitute(a:s, '\v^\s*([^\s]+)\s*$', '\1', '')
echo ">>>".ss."<<<"
endfunction
The regex \s*([^\s]+)\s* works ok on https://regex101.com/
Replacing * with + does not make any difference.
Testing:
: call Trim(" testing ")
Output:
>>> testing <<<
>>> testing <<<
Also it seems to matter if I use double quotes versus single quotes in substitute function.
Where are the problems and how can they be solved? Thanks.
Your issue is caused by your collection.
You should use [^ ] instead of [^\s]:
function! Trim(s)
echo ">>>".a:s."<<<"
let ss = substitute(a:s, '\v^\s*([^ ]+)\s*$', '\1', '')
echo ">>>".ss."<<<"
endfunction
This is because collections work on individual characters and \s is not an individual character; it's seen as \ followed by s, which doesn't resolve to anything because s is not a special character that needs escaping.
If you want your collection to include both spaces and tabs, use this:
[^ \t]
[ \t]
where \t represents a tab.
As romainl explained, [^\s] means neither \ nor s. The contrary of \s (i.e. anything but a space or a tab) would be \S.
Otherwise, here is another solution: in lh-vim-lib I've defined the following
function! lh#string#trim(string) abort
return matchstr(a:string, '^\v\_s*\zs.{-}\ze\_s*$')
endfunction
Regarding the difference(s) between the various kinds of quote characters, see this Q/A on vi.SE: https://vi.stackexchange.com/questions/9706/what-is-the-difference-between-single-and-double-quoted-strings
You are including what needs to be retained in your search/replace. Much easier is to just look for what needs te be removed and substitute that
:%s/\v(^\s+|\s+$)//g
Breakdown
%s Start a global search/replace
\v Use Very Magic
(^\s+ search for leading spaces
| or
\s+$) search for trailing spaces
//g remove all search results from entire line

R regex remove unicode apostrophe

Lets say I have the following string in R:
text <- "[Peanut M&M\u0092s]"
I've been trying to use regex to erase the apostrophe by searching for and deleting \u0092:
replaced <- gsub("\\\\u0092", "", text )
However, the above doesnt seem to work and results in the same line as the original. What is the correct way to do this removal?
Furthermore, if I wanted to remove the opening and closing [], is it more efficient to do it all in one go or on separate lines?
You can use a [^[:ascii:]] construct with a Perl-like regex to remove the non-ASCII codes from your input, and you can add an alternative [][] to also match square brackets:
text <- "[Peanut M&M\u0092s]"
replaced <- gsub("[][]|[^[:ascii:]]", "", text, perl=T)
replaced
## => [1] "Peanut M&Ms"
See IDEONE demo
If you only plan to remove the \0092 symbol, you do not need a Perl like regex:
replaced <- gsub("[][\u0092]", "", text)
See another demo
Note that [...] is a character class that matches 1 symbol, here, either a ] or [, or \u0092. If you place ] at the beginning of the character class, it does not need escaping. [ does not need escaping inside a character class (in R regex and in some other flavors, too).

Regular expression in R: gsub pattern

I'm learning R's regular expression and I am having trouble understanding this
gsub example:
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", x)
So far I think I get:
if x is alphanumeric it doesn't match so all nothing modified
if x contains a . or | or ( or { or } or + or $ or ? it adds \\ in front of it
I can't explain:
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10\1')
[1] "10\001"
or
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10/1')
[1] "10/1"
I am also confused why the replacement "\\\\\\1" add only two brackets.
I'm suppose to figure out what this function does and I think it's suppose to escape certain special characters ?
The entire pattern is wrapped in parentheses which allows back-references. This part:
[.|()\\^{}+$*?]
... is a "character class" so it matches any one of the characters inside teh square-brackets, and as you say it is changing the way the pattern syntax will interpret what would otherwise be meta-characters within the pattern definition.
The next part is a "pipe" character which is the regex-OR followed by an escaped open-square-bracket, another "OR"-pipe, and then an escaped close-square-bracket. Since both R and regex use backslashes as escapes, you need to double them to get an R+regex-escape in patterns ... but not in replacement strings. The close-square-bracket can only be entered in a character class if it is placed first in the string, sot that entire pattern could have been more compactly formed with:
"[][.|()\\^{}+$*?]" # without the "|\\[|\\])"
In replacement strings the form "\\n" refers to whatever matched the n-th parenthetical portion of the 'pattern', in this case '\1' is the second portion of the replacement. The first position is "\" which forms an escape and the second "\" forms the backslash. Now get ready to the even weirder part ... how many characters are in that result?
> nchar( gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\1", '10\1') )
[1] 3
And then of course none of the items in the match is equal to '\1". Somebody writing whatever tutorial you have before you (which I do not think is the gsub help page) has a weird sense of humor. Here are a couple of functions that may be useful if you need to create characters that would otherwise be intercepted by the system readline function:
> intToUtf8(1)
[1] "\001"
> ?intToUtf8
> 0x0
[1] 0
> intToUtf8(0)
[1] ""
> utf8ToInt("")
integer(0)
And do look at ?Quotes where a lot of useful information can be found (under what I would consider a rather unlikely title) about how R handles octal, hexadecimal and other numbers and special characters.
The first regex broken down is this
( # (1 start)
[.|()\^{}+$*?]
| \[
| \]
) # (1 end)
It captures any what's in the 'class' or '[' or ']' then it looks like it replaces it with \\\1 which is an escape plus whatever was in capture 1.
So, basically it just escapes a single occurrence of one of those chars.
The regex could be better written as ([.|()^{}\[\]+$*?]) or within a
string as "([.|()^{}\\[\\]+$*?])"
Edit (promoting a comment) -
The regex won't match string 10\1 so there should be no replacement. There must be an interpolation (language) on the print out. Looks like its converting it to octal \001. - Since it cant show binary 1 it shows its octal equivalent.

How to change "It's" to "It is" (without apostrophe) using str_replace?

I want to replace from Facebook's relationships string "It's complicated" to other text.
The line is like this:
$user->relationship = str_replace(array('single', 'It's complicated'), array('Soltero(a)', 'Es complicado'),$data['relationship_status']);
Using: 'It's complicated' , 'It&apos;s complicated' or 'It's complicated' ,
do not work.
Any suggestions?
Thanks a lot.
Regards.
If you want to use literal single quoted character ('), you have to escape them.
like:
$str = '\''; // single quote
You could try this.
$user->relationship = str_replace(array('single', 'It\'s complicated'), array('Soltero(a)', 'Es complicado'),$data['relationship_status']);
The PHP could not recognize the single literal quoted character (') without escape sequences character. Here is the explanation about it:
Strings literal
It's also happen for double literal quoted character (").

Is there an R function to escape a string for regex characters

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".