Replacing string with back reference preceeded with backslash - regex

I want to replace all _ and % in a string with \_ and \% respectively.
I tried
String.replace("_foo%_bar", ~r/_|%/, "\\\\0")
But this just produces "\\0foo\\0\\0bar".
How to properly escape the first backslash not to affect the back reference syntax?

You need to use
String.replace("_foo%_bar", ~r/_|%/, "\\\\\\0")
Here, "\\\\" defines 2 literal \ chars that are parsed as a single literal \ char in the replacement, and "\\0" is parsed as a \0, the backreference to the whole match value.
You may also use
String.replace("_foo%_bar", ~r/_|%/, ~S(\\\0))
to avoid overescaping, as ~S sigil does not allow escape sequences and backslashes have literal meanings inside them.

You need one more pair of backslashes:
iex(1)> IO.puts String.replace("_foo%_bar", ~r/_|%/, "\\\\\\0")
\_foo\%\_bar
But I'd suggest using Regex.replace/3 with a function as a callback here:
iex(2)> IO.puts Regex.replace(~r/_|%/, "_foo%_bar", &("\\" <> &1))
\_foo\%\_bar

Related

Matching a string containing special characters with regex in perl

I have a line in my file which contains the following string
$print = "SM_sdo_debugss_cxct6_CSCTM_4 \csctm_gen[4]_ctm_i_nctm_I_csctm (4+5)";
$my_meta = '\csctm_gen[4]_ctm_i_nctm_I_csctm';
print "I got this\n" if($print =~ /\Q$my_meta\E/);
But it's not able to find the $my_meta string in $print. Why?
Your first string is in double quotes, so backslash escape sequences are processed.
\cs stands for Ctrl-S, which can also be written chr(19) or "\x13".
Your second string is in single quotes, which ignores backslash escapes (apart from \\ and \').
So your regex ends up looking for a 3-character sequence \ c s, but your target string contains a single byte 0x13.
To fix this, either write "... \\cs ..." in your first string (the first backslash escapes the second one), or use single quotes for your first string ('... \cs ...').

Detect \ using regex in R [duplicate]

I'm writing strings which contain backslashes (\) to a file:
x1 = "\\str"
x2 = "\\\str"
# Error: '\s' is an unrecognized escape in character string starting "\\\s"
x2="\\\\str"
write(file = 'test', c(x1, x2))
When I open the file named test, I see this:
\str
\\str
If I want to get a string containing 5 backslashes, should I write 10 backslashes, like this?
x = "\\\\\\\\\\str"
[...] If I want to get a string containing 5 \ ,should i write 10 \ [...]
Yes, you should. To write a single \ in a string, you write it as "\\".
This is because the \ is a special character, reserved to escape the character that follows it. (Perhaps you recognize \n as newline.) It's also useful if you want to write a string containing a single ". You write it as "\"".
The reason why \\\str is invalid, is because it's interpreted as \\ (which corresponds to a single \) followed by \s, which is not valid, since "escaped s" has no meaning.
Have a read of this section about character vectors.
In essence, it says that when you enter character string literals you enclose them in a pair of quotes (" or '). Inside those quotes, you can create special characters using \ as an escape character.
For example, \n denotes new line or \" can be used to enter a " without R thinking it's the end of the string. Since \ is an escape character, you need a way to enter an actual . This is done by using \\. Escaping the escape!
Note that the doubling of backslashes is because you are entering the string at the command line and the string is first parsed by the R parser. You can enter strings in different ways, some of which don't need the doubling. For example:
> tmp <- scan(what='')
1: \\\\\str
2:
Read 1 item
> print(tmp)
[1] "\\\\\\\\\\str"
> cat(tmp, '\n')
\\\\\str
>

Regular expression in R: gsub pattern

I'm learning R's regular expression and I am having trouble understanding this
gsub example:
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", x)
So far I think I get:
if x is alphanumeric it doesn't match so all nothing modified
if x contains a . or | or ( or { or } or + or $ or ? it adds \\ in front of it
I can't explain:
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10\1')
[1] "10\001"
or
> gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", '10/1')
[1] "10/1"
I am also confused why the replacement "\\\\\\1" add only two brackets.
I'm suppose to figure out what this function does and I think it's suppose to escape certain special characters ?
The entire pattern is wrapped in parentheses which allows back-references. This part:
[.|()\\^{}+$*?]
... is a "character class" so it matches any one of the characters inside teh square-brackets, and as you say it is changing the way the pattern syntax will interpret what would otherwise be meta-characters within the pattern definition.
The next part is a "pipe" character which is the regex-OR followed by an escaped open-square-bracket, another "OR"-pipe, and then an escaped close-square-bracket. Since both R and regex use backslashes as escapes, you need to double them to get an R+regex-escape in patterns ... but not in replacement strings. The close-square-bracket can only be entered in a character class if it is placed first in the string, sot that entire pattern could have been more compactly formed with:
"[][.|()\\^{}+$*?]" # without the "|\\[|\\])"
In replacement strings the form "\\n" refers to whatever matched the n-th parenthetical portion of the 'pattern', in this case '\1' is the second portion of the replacement. The first position is "\" which forms an escape and the second "\" forms the backslash. Now get ready to the even weirder part ... how many characters are in that result?
> nchar( gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\1", '10\1') )
[1] 3
And then of course none of the items in the match is equal to '\1". Somebody writing whatever tutorial you have before you (which I do not think is the gsub help page) has a weird sense of humor. Here are a couple of functions that may be useful if you need to create characters that would otherwise be intercepted by the system readline function:
> intToUtf8(1)
[1] "\001"
> ?intToUtf8
> 0x0
[1] 0
> intToUtf8(0)
[1] ""
> utf8ToInt("")
integer(0)
And do look at ?Quotes where a lot of useful information can be found (under what I would consider a rather unlikely title) about how R handles octal, hexadecimal and other numbers and special characters.
The first regex broken down is this
( # (1 start)
[.|()\^{}+$*?]
| \[
| \]
) # (1 end)
It captures any what's in the 'class' or '[' or ']' then it looks like it replaces it with \\\1 which is an escape plus whatever was in capture 1.
So, basically it just escapes a single occurrence of one of those chars.
The regex could be better written as ([.|()^{}\[\]+$*?]) or within a
string as "([.|()^{}\\[\\]+$*?])"
Edit (promoting a comment) -
The regex won't match string 10\1 so there should be no replacement. There must be an interpolation (language) on the print out. Looks like its converting it to octal \001. - Since it cant show binary 1 it shows its octal equivalent.

How to replace a symbol by a backslash in R?

Could you help me to replace a char by a backslash in R? My trial:
gsub("D","\\","1D2")
Thanks in advance
You need to re-escape the backslash because it needs to be escaped once as part of a normal R string (hence '\\' instead of '\'), and in addition it’s handled differently by gsub in a replacement pattern, so it needs to be escaped again. The following works:
gsub('D', '\\\\', '1D2')
# "1\\2"
The reason the result looks different from the desired output is that R doesn’t actually print the result, it prints an interpretable R string (note the surrounding quotation marks!). But if you use cat or message it’s printed correctly:
cat(gsub('D', '\\\\', '1D2'), '\n')
# 1\2
When inputting backslashes from the keyboard, always escape them:
gsub("D","\\\\","1D2")
#[1] "1\\2"
or,
gsub("D","\\","1D2", fixed=TRUE)
#[1] "1\\2"
or,
library(stringr)
str_replace("1D2","D","\\\\")
#[1] "1\\2"
Note: If you want something like "1\2" as output, I'm afraid you can't do that in R (at least in my knowledge). You can use forward slashes in path names to avoid this.
For more information, refer to this issue raised in R help: How to replace double backslash with single backslash in R.
gsub("\\p{S}", "\\\\", text, perl=TRUE);
\p{S} ... Match a character from the Unicode category symbol.

C++ Unrecognized escape sequence

I want to create a string that contains all possible special chars.
However, the compiler gives me the warning "Unrecognized escape sequence" in this line:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Can anybody please tell me which one of the characters I may not add to this string the way I did?
You have to escape \ as \\, otherwise \¯ will be interpreted as an (invalid) escape sequence:
wstring s=L".,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`'" + wstring(1,34);
Escape sequence is a character string that has a different meaning than the literal characters themselves. In C and C++ the sequence begins with \ so if your string contains a double quote or backslash it must be escaped properly using \" and \\
In long copy-pasted strings it may be difficult to spot those characters and it's also less maintainable in the future so it's recommended to use raw string literals with the prefix R so you don't need any escapes at all
wstring s = LR"(.,;*:-_⁊#‘⁂‡…–«»¤¤¡=„+-¶~´:№\¯/?‽!¡-¢–”¥—†¿»¤{}«[-]()·^°$§%&«|⸗<´>²³£­€™℗#©®~µ´`')"
+ wstring(1,34);
A special delimiter string may be inserted outside the parentheses like this LR"delim(special string)delim" in case your raw string contains a )" sequence