How is python regex '\\\\' evaluated? [duplicate]

How is python regex '\\\\' evaluated? [duplicate] - regex

This question already has answers here:
Confused about backslashes in regular expressions [duplicate]
(3 answers)
Closed 4 years ago.
I'm reading python doc of re library and quite confused by the following paragraph:
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
How is \\\\ evaluated?
\\\\ -> \\\ -> \\ cascadingly
or \\\\ -> \\ in pairs?
I know \ is a meta character just like |, I can do
>>> re.split('\|', 'a|b|c|d') # split by literal '|'
['a', 'b', 'c', 'd']
but
>>> re.split('\\', 'a\b\c\d') # split by literal '\'
Traceback (most recent call last):
gives me error, it seems that unlike \| the \\ evaluates more than once.
and I tried
>>> re.split('\\\\', 'a\b\c\d')
['a\x08', 'c', 'd']
which makes me even more confused...

There are two things going on here - how strings are evaluated, and how regexes are evaluated.
'a\b\c\d' in python <3.7 code represents the string a<backspace>\c\d
'\\\\' in python code represents the string \\.
the string \\ is a regex pattern that matches the character \
Your problem here is that the string you're searching is not what you expect.
\b is the backspace character, \x08. \c and \d are not real characters at all. In python 3.7, this will be an error.
I assume you meant to spell it r'a\b\c\d' or 'a\\b\\c\\d'

re.split('\\', 'a\b\c\d') # split by literal '\'
You forgot that '\' in the second one is escape character, it would work if the second one was changed:
re.split(r'\\', 'a\\b\\c\\d')
This r at the start means "raw" string - escape characters are not evaluated.

Think about the implications of evaluating backslashes cascadingly:
If you wanted the string \n (not the newline symbol, but literally \n), you couldn't find a sequence of characters to get said string.
\n would be the newline symbol, \\n would be evaluated to \n, which in turn would become the newline symbol again. This is why escape sequencens are evaluated in pairs.
So you need to write \\ within a string to get a single \, but you need to have to backslashes in your string so that the regex will match the literal \. Therefore you will need to write \\\\ to match a literal backslash.
You have a similar problem with your a\b\c\d string. The parser will try to evaluate the escape sequences, and \b is a valid sequence for 'backspace', represented as \x08. You will need to escape your backslashes here, too, like a\\b\\c\\d.

Related

What does the regex "/\\*{2,}/" mean? [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I'm kinda new to regex, and specifically, I don't understand there are 2 backslashes? I mean, I know the second one is to escape the character "*", but what does the first backslash do?
Well I'm passing this regex expression to the php function preg_match(), and I'm trying to find strings that include 2 or more consecutive "*".

That regex is invalid syntax.
You have this piece:
*{2,}
Which basically would read: match n-times, 2 or more times.
The following regex:
/\\*.{2,}/
Is the simplest and closest regex to the one you have, which would read as:
match 0 or more '\' and 2 or more characters that aren't newlines
If you are talking about the string itself, is may be interpreted as 2 things:
/\\*{2,}/
Read as: match a single \ and another \ n-times 2 times or more
This is invalid syntax
/\*{2,}\
Read as match 2 or more *
This is valid syntax
It all varies, depending on the escape character.
Edit:
Since the question was updated to show which language and engine it is being used, I've updated to add the following information:
You have to pass the regex as '/\*{2,}/' OR as "/\\*{2,}/" (watch the quotes).
Both are very similar, except that single quotes ('') only support the following escape sequences:
\' - Produces '
\\- Produces \
Double-quoted strings are treated differently in PHP. And they support almost any escape sequence, like:
\" - Produces "
\' - Produces '
\\ - Produces \
\x<2-digit hex number> - Same as chr(0x<2-digit hex number>)
\0 - Produces a null char
\1 - Produces a control char (same as chr(1))
\u<4-digit hex number> - Produces an UTF-8 character
\r - Produces a newline on old OSX
\n - Produces a newline on Linux/newer OSX/Windows (when writting a file without b)
\t - Produces a tab
\<number> or \0<number> - Same as \x, but the numbers are in octal (e.g.: "\75" and "\075" produce =)
... (some more that I probably forgot) ...
\<anything> - Produces <anything>
Read more about this on https://php.net/manual/en/language.types.string.php

Depending on the platrofm you're using, "/\\*{2,}/" may actually be a representation of a /\*{2,'}/ string - this is because languages like Java treat \ as an escape character, so to actually put that character within regex, you need to escape the character in regex string.
So, we have /\*{2'}/ regex. \*' matches the star character, and{2,}` means at least two times. Your regex will match any two or more consecutive star characters.

Is it a string literal written in a program and if so which one? The double backslash may be to escape the escape char so that this regex matches at least 2 * star characters.
In JavaScript for example you need to escape the \ so that your string literal can express it as data before you transform it into a regular expression when using the RegExp constructor. Why do regex constructors need to be double escaped?

For PHP what you have with that regex is to repeat literally a * 2 or more times. You can easily see with with below diagram:
But when you have to code it in PHP you have to escape the backslash (with a backslash) to use it in string. For instance:
$re = "/\\*{2,}/";
$str = "...";
preg_match($re, $str, $matches);

why does this python regex return no match?

>>> match = re.search(r'\d', 'ca\d')
>>> type(match)
<type 'NoneType'>
From my understanding 'r' means don't do any special processing with blackslashes and just return the raw string.
Also, why do i get the output below:
>>> match = re.search(r'\a', 'ca\a')
>>> match.group()
'\x07'

Because your input string has no digit. \d means capture a digit.
If you want to capture a literal \d, you should use \\d pattern.
See example here.
This program
import re
p = re.compile(ur'\\d')
test_str = u"ca\d"
print re.search(p, test_str).group(0)
Will output \d.
As for r'', please check this re documentation:
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two-character string
containing '\' and 'n', while "\n" is a one-character string
containing a newline. Usually patterns will be expressed in Python
code using this raw string notation.
It does not mean it does not process slashes anyhow, this just lets you use a single slash instead of a doubled one. The slash is meaningful before d in a regular expression.
And as for \a, there is no such a regex metacharacter, so \ is treated as a literal.

And in addition to stribizhev's comment, probably the 'r' (raw string indicator) is making you confused. That is used to avoid escaping. Escaping is a form of allowing writing in the code special (unprintable) characters like:
TAB - ASCII 9 - "\t"
CR - ASCII 13 - "\r" (Unix Enter)
But there's no special char that has the code "\d", so placing an r in front of it makes no difference, so the string will still be "\d" (2 chars) that in regex, matches over a digit.

In postgreSQL, why is \s treated differently from \w?

Here is the example that confuses me:
select ' w' ~ '^\s\w$';
This results in "false", but seems like it should be true.
select ' w' ~ '^\\s\w*$';
This results in "true", but:
Why does \s need the extra backslash?
If it truly does, why does \w not need the extra backslash?
Thanks for any help!

I think you have tested it the wrong way because I'm getting the opposite results that you got.
select ' w' ~ '^\s\w$';
Is returning 1 in my case. Which actually makes sense because it is matching the space at the beginning of the text, followed by the letter at the end.
select ' w' ~ '^\\s\w*$';
Is returning 0 and it makes sense too. Here you're trying to match a backslash at the beginning of the text followed by an s and then, by any number of letters, numbers or underscores.
A piece of text that would match your second regex would be: '\sw'
Check the fiddle here.

The string constants are first parsed and interpreted as strings, including escaped characters. Escaping of unrecognized sequences is handled differently by different parsers, but generally, besides errors, the most common behavior is to ignore the backslash.
In the first example, the right-hand string constant is first being interpreted as '^sw$', where both \s and \w are not recognized string escape sequences.
In the second example the right hand constant is interpreted as '^\sw*$' where \\s escapes the \
After the strings are interpreted they are then applied as a regular expression, '^\sw*$' matching ' w' where '^sw$' does not.

Some languages use backslash as an escape character. Regexes do that, C-like languages do that, and some rare and odd dialects of SQL do that. PostgresSQL does it. PostgresSQL is translating the backslash escaping to arrive at a string value, and then feeding that string value to the regex parser, which AGAIN translates whatever backslashes survived the first translation -- if any. In your first regex, none did.
For example, in a string literal or a regex, \n doesn't mean a backslash followed by a lowercase n. It means a newline. Depending on the language, a backslash followed by a lowercase s will mean either just a lowercase s, or nothing. In PostgresSQL, an invalid escape sequence in a string literal translates as the escaped character: '\w' translates to 'w'. All the regex parser sees there is the w. By chance, you used the letter w in the string you're matching against. It's not matching that w in the lvalue because it's a word character; it's matching it because it's a lowercase w. Change it to lowercase x and it'll stop matching.
If you want to put a backslash in a string literal, you need to escape it with another backslash: '\\'. This is why \\s in your second regex worked. Add a second backslash to \w if you want to match any word character with that one.
This is a horrible pain. It's why JavaScript, Perl, and other languages have special conventions for regex literals like /\s\w/, and why C# programmers use the #"string literal" feature to disable backslash escaping in strings they intend to use as regexes.

c++ regexp for not preceded by backslash and preceded by backslash

I can only find negative lookbehind for this , something like (?<!\\).
But this won't compile in c++ and flex. It seems like both regex.h nor flex support this?
I am trying to implement a shell which has to get treat special char like >, < of | as normal argument string if preceded by backslash. In other word, only treat special char as special if not preceded by 0 or even number of '\'
So echo \\>a or echo abc>a should direct output to a
but echo \>a should print >a
What regular expression should I use?
I'm using flex and yacc to parse the input.

In a Flex rule file, you'd use \\ to match a single backslash '\' character. This is because the \ is used as an escape character in Flex.
BACKSLASH \\
LITERAL_BACKSLASH \\\\
LITERAL_LESSTHAN \\\\<
LITERAL_GREATERTHAN \\\\>
LITERAL_VERTICALBAR \\\\|
If I follow you correctly, in your case you want "\>" to be treated as literal '>' but "\\>" to be treated as literal '\' followed by special redirect. You don't need negative look behind or anything particularly special to accomplish this as you can build one rule that would accept both your regular argument characters and also the literal versions of your special characters.
For purposes of discussion, let's assume that your argument/parameter can contain any character but ' ', '\t', and the special forms of '>', '<', '|'. The rule for the argument would then be something like:
ARGUMENT ([^ \t\\><|]|\\\\|\\>|\\<|\\\|)+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\\\ matches any instance of "\" (i.e. a literal backslash)
\\> matches any instance of ">" (i.e. a literal greater than)
\\< matches any instance of "\<" (i.e. a literal less than)
\\\| matches any instance of "\|" (i.e. a literal vertical bar/pipe)
Actually... You can probably just shorten that rule to:
ARGUMENT ([^ \t\\><|]|\\[^ \t\r\n])+
Where:
[^ \t\\><|] matches any single character but ' ', '\t', and your special characters
\\[^ \t\r\n] matches any character preceded by a '\' in your input except for whitespace (which will handle all of your special characters and allow for literal forms of all other characters)
If you want to allow for literal whitespace in your arguments/parameters then you could shorten the rule even further but be careful with using \\. for the second half of the rule alternation as it may or may not match " \n" (i.e. eat your trailing command terminator character!).
Hope that helps!

You cannot easily extract single escaped characters from a command-line, since you will not know the context of the character. In the simplest case, consider the following:
LessThan:\<
BackslashFrom:\\<
In the first one, < is an escaped character; in the second one, it is not. If your language includes quotes (as most shells do), things become even more complicated. It's a lot better to parse the string left to right, one entity at a time. (I'd use flex myself, because I've stopped wasting my time writing and testing lexers, but you might have some pedagogical reason to do so.)
If you really need to find a special character which shouldn't be special, just search for it (in C++98, where you don't have raw literals, you'll have to escape all of the backslashes):
regex: (\\\\)*\\[<>|]
(An even number -- possibly 0 -- of \, then a \ and a <, > or |)
as a C string => "(\\\\\\\\)*\\\\[<>|]"

How is \\n and \\\n interpreted by the expanded regular expression?

Within an ERE, a backslash character (\, \a, \b, \f, \n,
\r, \t, \v) is considered to begin an escape sequence.
Then I see \\n and [\\\n], I can guess though both \\n and [\\\n] here means \ followed by new line, but I'm confused by the exact process to interpret such sequence as how many \s are required at all?
UPDATE
I don't have problem understanding regex in programing languages so please make the context within the lexer.
[root# ]# echo "test\
> hi"

This is dependent on the programming language and on its string handling options.
For example, in Java strings, if you need a literal backslash in a string, you need to double it. So the regex \n must be written as "\\n". If you plan to match a backslash using a regex, then you need to escape it twice - once for Java's string handler, and once for the regex engine. So, to match \, the regex is \\, and the corresponding Java string is "\\\\".
Many programming languages have special "verbatim" or "raw" strings where you don't need to escape backslashes. So the regex \n can be written as a normal Python string as "\\n" or as a Python raw string as r"\n". The Python string "\n" is the actual newline character.
This can becoming confusing, because sometimes not escaping the backslash happens to work. For example the Python string "\d\n" happens to work as a regex that's intended to match a digit, followed by a newline. This is because \d isn't a recognized character escape sequence in Python strings, so it's kept as a literal \d and fed that way to the regex engine. The \n is translated to an actual newline, but that happens to match the newline in the string that the regex is tested against.
However, if you forget to escape a backslash where the resulting sequence is a valid character escape sequence, bad things happen. For example, the regex \bfoo\b matches an entire word foo (but it doesn't match the foo in foobar). If you write the regex string as "\bfoo\b", the \bs are translated into backspace characters by the string processor, so the regex engine is told to match <backspace>foo<backspace> which obviously will fail.
Solution: Always use verbatim strings where you have them (e. g. Python's r"...", .NET's #"...") or use regex literals where you have them (e. g. JavaScript's and Ruby's /.../). Or use RegexBuddy to automatically translate the regex for you into your language's special format.
To get back to your examples:
\\n as a regex means "Match a backslash, followed by n"
[\\\n] as a regex means "Match either a backslash or a newline character".

Actually regex string specified by string literal is processed by two compilers: programming language compiler and regexp compiler:
Original Compiled Regex compiled
"\n" NL NL
"\\n" '\'+'n' NL
"\\\n" '\'+NL NL
"\\\\n" '\'+'\'+'n' '\'+'n'
So you must use the shortest format "\n".
Code examples:
JavaScript:
'a\nb'.replace(RegExp("\n"),'<br>')
'a\nb'.replace(RegExp("\\n"),'<br>')
'a\nb'.replace(RegExp("\\\n"),'<br>')
but not:
'a\nb'.replace(/\\\n/,'<br>')
Java:
System.out.println("a\nb".replaceAll("\n","<br>"));
System.out.println("a\nb".replaceAll("\\n","<br>"));
System.out.println("a\nb".replaceAll("\\\n","<br>"));
Python:
str.join('<br>',regex.split('\n','a\nb'))
str.join('<br>',regex.split('\\n','a\nb'))
str.join('<br>',regex.split('\\\n','a\nb'))

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js