Escape character for Location of char in a string: R lang - regex

I'm trying to get the location of \ or / in a string. Below is the code I'm attempting:
x <- "<span id=\"ref_12590587_l\">6,803.61</span>_l>"
gregexpr("\\\", x)
which(strsplit(x, "")[[1]]=="\")
My problem is when I attempt these codes in Rstudio, I get a continue prompt, the REPL prompt becomes +. These codes work for other characters.
Why I'm getting the continue prompt, even though the \ is quoted in the inverted quotes?
Edit: corrected the string after comment.

You have to add another slash (as stribizhev says in the comments). So you're looking for
gregexpr("\\\\", x)
The reason why is that the you need to escape the \, twice. So \\ gives you only 1 backslash. When you put 3 in, the 3rd backslash is actually escaping the quote!
See for an example:
gregexpr("\"", 'hello, "hello"')
This is searching for the quote in the string.

Just to formalize my comments:
Your x variable does not contain any backslashes, these are escaping characters that allow us putting literal quotation marks into a string.
gregexpr("\\\", x) contains a non-closed string literal because the quotation mark on the right is escaped, and thus is treated as a literal quotation mark, not the one that is used to "close" a string literal.
To search for a literal \ in gregexpr, we need 4 backslashes \\\\, as gregexpr expects a regular expression. In regular expressions, "\" is a special symbol, thus it must be escaped for the regex engine. But inside gregexpr, we pass a string that itself is using \ for escaping entities like \n. So, we need to escape the backslash for R first, and then for the regex engine.
That said, you can use
gregexpr("\\\\", x)
to get only literal backslashes, or
gregexpr("\\\\|/", x)
to also look for forward slashes.
See IDEONE demo

Related

How to prepend a backslash using clojure.string/replace

I am trying to use clojure.string/replace to escape certain characters like asterisks and backticks with backslashes (like ex*mple -> ex\*mple), but I cannot make sense of the function's own escaping rules:
If I try (cs/replace "ex*mple" #"[\*`]" "\\$0"), it treats the $0 literally and returns ex$0mple.
If I try (cs/replace "ex*mple" #"[\*`]" "\\\\$0") it adds two slashes: ex\\*mple.
What is the right way to do it?
Your second approach, (cs/replace "ex*mple" #"[\*`]" "\\\\$0"), is correct. The reason you see two backslashes in the result is because that's how Clojure shows single backslashes in strings. If you print "ex\\*mple", you'll see ex\*mple.
Clojure uses backslash as an escape character in strings, so backslashes themselves have to be escaped. ex\*mple is not a valid string in Clojure because \* is an unsupported escape character.

Got confused about Emacs Reg exp

I understand, in the configuration files, it's always \\( to escape special char like (, however, when we do re-search-replace, it should be \( or \\(?
For M-x re-search-* and M-x regexp-replace, use \(. Generally, do so whenever you're entering a regular expression at a prompt.1
The reason you have to use \\( in configuration files (or any elisp) is that there the regexes will be encoded as strings, and in string literals backslashes have to be escaped to be distinguishable from other escape sequences (i.e., "\\n" is a backslash followed by an n, whereas "\n" is a newline).
1 Thanks to #phils for pointing out that this should be mentioned explicitly.

Escaping filename suffix in a regex?

I'm trying to match a variable length string followed by the filetype suffix in an XML filename using a regex:
varrrrrriableLengthString.xml
Currently I'm using this regex with a greedy match, the second backslash is to escape the first, which is to escape the dot.
[A-Za-z0-9]+\\.[xX][mM][lL]
I've tested this on RegExr, and it matches with only one backslash. However my CPP parser requires the double backslash.
How can I properly escape the filename suffix?
You can also escape chars using the [] notation, in your case [.]. The main advantage is that there is no "one or two backslashes?" question anymore, and I find it more readable IMHO.
It just does not work with brackets, i.e. to escape a [ (or ]), you still have to use \[ (or \\[ for a string literal) and not [[].
Backslashes still have to be escaped using another backslash too.

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!
I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.
The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.
Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.

Regex match backslash star

Can't work this one out, this matches a single star:
// Escaped multiply
Text = Text.replace(new RegExp("\\*", "g"), '[MULTIPLY]');
But I need it to match \*, I've tried:
\\*
\\\\*
\\\\\*
Can't work it out, thanks for any help!
You were close, \\\\\\* would have done it.
Better use verbatim strings, that makes it easier:
RegExp(#"\\\*", "g")
\\ matches a literal backslash (\\\\ in a normal string), \* matches an asterisk (\\* in a normal string).
Remember that there are two 'levels' of escaping.
First, you are escaping your strings for the C# compiler, and you are also escaping your strings for the Regex engine.
If you want to match "\*" literally, then you need to escape both of these characters for the regex engine, since otherwise they mean something different. We escape these with backslashes, so you will have "\\\*".
Then, we have to escape the backslashes in order to write them as a literal string. This means replacing each backslash with two backslashes: "\\\\\\*".
Instead of this last part, we could use a "verbatim string", where no escapes are applied. In this case, you only need the result from the first escaping: #"\\\*".
Your syntax is completely wrong. It looks more like Javascript than C#.
This works fine:
string Text = "asdf*sadf";
Text = Regex.Replace(Text, "\\*", "[MULTIPLY]");
Console.WriteLine(Text);
Output:
asdf[MULTIPLY]sadf
To match \* you would use the pattern "\\\\\\*".