Got confused about Emacs Reg exp

Got confused about Emacs Reg exp - regex

I understand, in the configuration files, it's always \\( to escape special char like (, however, when we do re-search-replace, it should be \( or \\(?

For M-x re-search-* and M-x regexp-replace, use \(. Generally, do so whenever you're entering a regular expression at a prompt.1
The reason you have to use \\( in configuration files (or any elisp) is that there the regexes will be encoded as strings, and in string literals backslashes have to be escaped to be distinguishable from other escape sequences (i.e., "\\n" is a backslash followed by an n, whereas "\n" is a newline).
1 Thanks to #phils for pointing out that this should be mentioned explicitly.

Related

How to prepend a backslash using clojure.string/replace

I am trying to use clojure.string/replace to escape certain characters like asterisks and backticks with backslashes (like ex*mple -> ex\*mple), but I cannot make sense of the function's own escaping rules:
If I try (cs/replace "ex*mple" #"[\*`]" "\\$0"), it treats the $0 literally and returns ex$0mple.
If I try (cs/replace "ex*mple" #"[\*`]" "\\\\$0") it adds two slashes: ex\\*mple.
What is the right way to do it?

Your second approach, (cs/replace "ex*mple" #"[\*`]" "\\\\$0"), is correct. The reason you see two backslashes in the result is because that's how Clojure shows single backslashes in strings. If you print "ex\\*mple", you'll see ex\*mple.
Clojure uses backslash as an escape character in strings, so backslashes themselves have to be escaped. ex\*mple is not a valid string in Clojure because \* is an unsupported escape character.

Escape character for Location of char in a string: R lang

I'm trying to get the location of \ or / in a string. Below is the code I'm attempting:
x <- "<span id=\"ref_12590587_l\">6,803.61</span>_l>"
gregexpr("\\\", x)
which(strsplit(x, "")[[1]]=="\")
My problem is when I attempt these codes in Rstudio, I get a continue prompt, the REPL prompt becomes +. These codes work for other characters.
Why I'm getting the continue prompt, even though the \ is quoted in the inverted quotes?
Edit: corrected the string after comment.

You have to add another slash (as stribizhev says in the comments). So you're looking for
gregexpr("\\\\", x)
The reason why is that the you need to escape the \, twice. So \\ gives you only 1 backslash. When you put 3 in, the 3rd backslash is actually escaping the quote!
See for an example:
gregexpr("\"", 'hello, "hello"')
This is searching for the quote in the string.

Just to formalize my comments:
Your x variable does not contain any backslashes, these are escaping characters that allow us putting literal quotation marks into a string.
gregexpr("\\\", x) contains a non-closed string literal because the quotation mark on the right is escaped, and thus is treated as a literal quotation mark, not the one that is used to "close" a string literal.
To search for a literal \ in gregexpr, we need 4 backslashes \\\\, as gregexpr expects a regular expression. In regular expressions, "\" is a special symbol, thus it must be escaped for the regex engine. But inside gregexpr, we pass a string that itself is using \ for escaping entities like \n. So, we need to escape the backslash for R first, and then for the regex engine.
That said, you can use
gregexpr("\\\\", x)
to get only literal backslashes, or
gregexpr("\\\\|/", x)
to also look for forward slashes.
See IDEONE demo

Escaping filename suffix in a regex?

I'm trying to match a variable length string followed by the filetype suffix in an XML filename using a regex:
varrrrrriableLengthString.xml
Currently I'm using this regex with a greedy match, the second backslash is to escape the first, which is to escape the dot.
[A-Za-z0-9]+\\.[xX][mM][lL]
I've tested this on RegExr, and it matches with only one backslash. However my CPP parser requires the double backslash.
How can I properly escape the filename suffix?

You can also escape chars using the [] notation, in your case [.]. The main advantage is that there is no "one or two backslashes?" question anymore, and I find it more readable IMHO.
It just does not work with brackets, i.e. to escape a [ (or ]), you still have to use \[ (or \\[ for a string literal) and not [[].
Backslashes still have to be escaped using another backslash too.

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!

I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.

The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.

Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.

c++ regexp for not preceded by backslash and preceded by backslash

I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.

In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!

You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js