Why do regexes and string literals use different escape sequences? - regex

The handling of escape sequences varies across languages and between string literals and regular expressions. For example, in Python the \s escape sequence can be used in regular expressions but not in string literals, whereas in PHP the \f form feed escape sequence can be used in regular expressions but not in string literals.
In PHP, there is a dedicated page for PCRE escape sequences (http://php.net/manual/en/regexp.reference.escape.php) but it does not have an official list of escape sequences that are exclusive to string literals.
As a beginner in programming, I am concerned that I may not have a full understanding of the background and context of this topic. Are these concerns valid? Is this an issue that others are aware of?
Why do different programming languages handle escape sequences differently between regular expressions and string literals?

The escape sequences found in string literals are there to stop the programing language from getting confused. For example, in many languages a string literal is denoted as characters between quotes, like so
my_string = 'x string'
But if your string contains a quote character then you need a way to tell the programming language that this should be interpreted as a literal character
my_string = 'x's string' # this will cause bugs
my_string = 'x\'s string' # lets the programing language know that the internal quote is literal and not the end of the string
I think that most programing languages have the same set of escape sequences for string literals.
Regexes are a different story, you can think of them as their own separate language that is written as a string literal. In a regex some characters like the period (.) have a special meaning and must be escaped to match their literal counterpart. Whereas other characters, when preceded by a backslash allow those characters to have special meaning.
For example
regex_string = 'A.C' # match an A, followed by any character, followed by C
regex_string = 'A\.C' # match an A, followed by a period, followed by C
regex_string = 'AsC' # match an A, followed by s, followed by C
regex_string = 'A\sC' # match an A, followed by a space character, followed by C
Because regexes are their own mini-language it doesn't make sense that all of the escape sequences in regexes are available to normal string literals.

Regular expressions are best thought of as a language in themselves, which have their own syntax. Some programming languages offer a literal syntax specifically for describing a regex, but usually a regex will be compiled from an existing string. If you create that string from literal syntax, that uses a different set of escape sequences because it is a different kind of thing, created with a different syntax, for a different context, in a different language. That's the simple and direct answer to the question.
There are different needs and requirements. Regexes have to be able to describe things that aren't a single, specific sequence of text. String literals obviously don't have that problem, but they do need a way to, say, include quotation marks in the text. That usually isn't a problem for regex syntax, because the content of the string is already determined by that point. (Some languages have a "regex literal" syntax, typically enclosing the regex in forward slashes. In these languages, forward slashes that are supposed to be part of the regex need to be escaped.)
Although I understand the obvious (\s represents multiple characters and would introduce ambiguity)
Ambiguity isn't actually a concern for most languages that support regex. It often happens that the string literal syntax and the regex syntax use the same sequence to mean different things. For example: \b represents a word boundary in regex syntax, but many languages' string literal syntax also uses it to represent a backspace character, Unicode code point 8. (Unless you meant that \s to mean "any whitespace character" doesn't make sense in the string literal context but only in the regex context - then yes, of course.)
But keep in mind - if the regex is being compiled from a string literal, then first the string literal is interpreted to figure out what the string actually contains, and then that string is used to create the regex. These are separate steps that can and do apply separate rules, so there is no conflict.
This sometimes means that code has to use a double escaping mechanism: first for the string literal, and then for the regex syntax. If you want a regex that matches a literal backslash, you might end up typing four backslashes in a string literal - since that code will create a string that actually contains only two backslashes, which in turn is what the regex syntax requires. (Some languages offer some kind of "raw" string literal facility to work around this.)

Related

read regular expression from .ini file [duplicate]

When I create a string containing backslashes, they get duplicated:
>>> my_string = "why\does\it\happen?"
>>> my_string
'why\\does\\it\\happen?'
Why?
What you are seeing is the representation of my_string created by its __repr__() method. If you print it, you can see that you've actually got single backslashes, just as you intended:
>>> print(my_string)
why\does\it\happen?
The string below has three characters in it, not four:
>>> 'a\\b'
'a\\b'
>>> len('a\\b')
3
You can get the standard representation of a string (or any other object) with the repr() built-in function:
>>> print(repr(my_string))
'why\\does\\it\\happen?'
Python represents backslashes in strings as \\ because the backslash is an escape character - for instance, \n represents a newline, and \t represents a tab.
This can sometimes get you into trouble:
>>> print("this\text\is\not\what\it\seems")
this ext\is
ot\what\it\seems
Because of this, there needs to be a way to tell Python you really want the two characters \n rather than a newline, and you do that by escaping the backslash itself, with another one:
>>> print("this\\text\is\what\you\\need")
this\text\is\what\you\need
When Python returns the representation of a string, it plays safe, escaping all backslashes (even if they wouldn't otherwise be part of an escape sequence), and that's what you're seeing. However, the string itself contains only single backslashes.
More information about Python's string literals can be found at: String and Bytes literals in the Python documentation.
As Zero Piraeus's answer explains, using single backslashes like this (outside of raw string literals) is a bad idea.
But there's an additional problem: in the future, it will be an error to use an undefined escape sequence like \d, instead of meaning a literal backslash followed by a d. So, instead of just getting lucky that your string happened to use \d instead of \t so it did what you probably wanted, it will definitely not do what you want.
As of 3.6, it already raises a DeprecationWarning, although most people don't see those. It will become a SyntaxError in some future version.
In many other languages, including C, using a backslash that doesn't start an escape sequence means the backslash is ignored.
In a few languages, including Python, a backslash that doesn't start an escape sequence is a literal backslash.
In some languages, to avoid confusion about whether the language is C-like or Python-like, and to avoid the problem with \Foo working but \foo not working, a backslash that doesn't start an escape sequence is illegal.

C++ regex for properly matching strings that contain c-style escape characters (ECMAScript style, no look-behind)

I'm a regex noob attempting to match either the contents or the entirety of a quoted segment of text without breaking on escaped quotation marks.
Put another way, I need a regex that, between two question marks, will match all characters that are not quotation marks and also any quotation mark that has an odd number of consecutive backslashes preceding it. It has to be an odd number of backslashes as a pair of backslashes escapes to a single backslash.
I've successfully created a regex that does this but it relied on look-behind and because this project is in C++ and because the regex implementation of standard C++ does not have look-behind functionality, I could not use said regex.
Here is the regex with look-behind that I came up with: "(((?<!\\)(\\\\)*\\"|[^"])*)"
The following text should produce 8 matches:
"Woah. Look. A tab."
"This \\\\\\\\\\\\\" is all one string"
"This \"\"\"\" is\" also\"\\ \' one\"\\\" string."
"These \\""are separate strings"
"The cat said,\"Yo.\""
"
\"Shouldn't it work on multiple lines?\" he asked rhetorically.
\"Of course it should.\"
"
"If you don't have exactly 8 matches, then you've failed."
Here's a picture of my (probably naive) look-behind version for the visual people among you (You know who you are):
And here's a link to this example: https://regex101.com/r/uOxqWl/1
If this is impossible to do without look-behind, please let me know.
Also, if there is a well-regarded C++ regex library that allows regex look-behind, please let me know (It doesn't have to be ECMAScript, though I would slightly prefer that).
Let's derive a garden variety regular expression for C-style strings from an English description.
A string is a quotation mark, followed by a sequence of string-characters, followed by another quotation mark.
std::regex stringMatcher ( R"("<string-character>*")" );
Obviously this doesn't work as we didn't define the string-character yet. We can do so piece by piece.
Firstly, a string character could be any character except a quotation mark and a backslash.
R"([^\\"])"
Secondly, a string character could be an escape sequence consisting of a backslash and a single other character from a fixed set.
R"(\\[abfnrtv'"\\?])"
Thirdly, it can be an octal escape sequence that consists of a backslash and three octal digits
R"(\\[0-7][0-7][0-7])"
(We simplify here a bit because the real C standard allows 1, 2 or 3 octal digits. This is easy to add.)
Fourthly, it can be a hexadecimal escape sequence that consists of a backslash, a letter x, and a hexadecimal number. The range of the number is implementation defined, so we need to accept any one.
R"(\\x[0-9a-fA-F][0-9a-fA-F]*)"
We omit universal character names, they could be added in an exactly the same way. There are none in the given test example.
So, to bring this all together:
std::regex stringMatcher ( R"("([^\\"]|\\([abfnrtv'"\\?]|[0-7][0-7][0-7]|x[0-9a-fA-F][0-9a-fA-F]*))*")" );
// collapsed the leading backslashes of all the escape sequence types together
Live demo.

Regex For Strings in C

I'm looking to make a regular expression for some strings in C.
This is what i have so far:
Strings in C are delimited by double quotes (") so the regex has to be surrounded by \" \".
The string may not contain newline characters so I need to do [^\n] ( I think ).
The string may also contain double quotes or back slash characters if and only if they're escaped. Therefore [\\ \"] (again I think).
Other than that anything else goes.
Any help is much appreciated I'm kind of lost on how to start writing this regex.
A simple flex pattern to recognize string literals (including literals with embedded line continuations):
["]([^"\\\n]|\\.|\\\n)*["]
That will allow
"string with \
line continuation"
But not
"C doesn't support
multiline strings"
If you don't want to deal with line continuations, remove the \\\n alternative. If you need trigraph support, it gets more irritating.
Although that recognizes strings, it doesn't attempt to make sense of them. Normally, a C lexer will want to process strings with backslash sequences, so that "\"\n" is converted to the two characters "NL (0x22 0x0A). You might, at some point, want to take a look at, for example, Optimizing flex string literal parsing (although that will need to be adapted if you are programming in C).
Flex patterns are documented in the flex manual. It might also be worthwhile reading a good reference on regular expressions, such as John Levine's excellent book on Flex and Bison.

What does (^?)* mean in this regex?

I have this regex:
^(^?)*\?(.*)$
If I understand correctly, this is the breakdown of what it does:
^ - start matching from the beginning of the string
(^?)* - I don't know know, but it stores it in $1
\? - matches a question mark
(.*)$ - matches anything until the end of the string
So what does (^?)* mean?
The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string
Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does
Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.
Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.
Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.
The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.
To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:
^(^?)*\?(.*)$
Handed to string.match(), it would be interpreted as follows:
^ anchors the match to the beginning of the string.
( marks the beginning of the first capture.
^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.
? matches exactly zero or one of the previous character.
) marks the end of the first capture.
* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.
\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.
? matches exactly zero or one of the previous character.
( marks the beginning the second capture.
. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).
* matches zero or more of the previous character, greedily.
) marks the end of the second capture.
$ anchors the pattern to the end of the string.
So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.
Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.
Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.
In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.
For all your Regex needs: http://regexlib.com/CheatSheet.aspx
It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.

Regular expression opening and closing characters

When I learned regular expressions I learned they should start and end with a slash character (followed by modifiers).
For example /dog/i
However, in many examples I see them starting and ending with other characters, such as #, #, and |.
For example |dog|
What's the difference?
This varies enormously from one regex flavor to the next. For example, JavaScript only lets you use the forward-slash (or solidus) as a delimiter for regex literals, but in Perl you can use just about any punctuation character--including, in more recent versions, non-ASCII characters like « and ». When you use characters that come in balanced pairs like braces, parentheses, or the double-arrow quotes above, they have to be properly balanced:
m«\d+»
s{foo}{bar}
Ruby also lets you choose different delimiters if you use the %r prefix, but I don't know if that extends to the balanced delimiters or non-ASCII characters. Many languages don't support regex literals at all; you just write the regexes as string literals, for example:
r'\d+' // Python
#"\d+" // C#
"\\d+" // Java
Note the double backslash in the Java version. That's necessary because the string gets processed twice: once by the Java compiler and once by the compile() method of the Pattern class. Most other languages provide a "raw" or "verbatim" form of string literal that all but eliminates such backslash-itis.
And then there's PHP. Its preg regex functions are built on top of the PCRE library, which closely imitates Perl's regexes, including the wide variety of delimiters. However, PHP itself doesn't support regex literals, so you have to write them as if they were regex literals embedded in string literals, like so:
'/\d+/g' // match modifiers go after the slash but inside the quotes
"{\\d+}" // double-quotes may or may not require double backslashes
Finally, note that even those languages which do support regex literals don't usually offer anything like Perl's s/…/…/ construct. The closest equivalent is a function call that takes a regex literal as the first argument and a string literal as the second, like so:
s = s.replace(/foo/i, 'bar') // JavaScript
s.gsub!(/foo/i, "bar") // Ruby
Some RE engines will allow you to use a different character so as to avoid having to escape those characters when used in the RE.
For example, with sed, you can use either of:
sed 's/\/path\/to\/directory/xx/g'
sed 's?/path/to/directory?xx?g'
The latter is often more readable. The former is sometimes called "leaning toothpicks". With Perl, you can use either of:
$x =~ /#!\/usr\/bin\/perl/;
$x =~ m!#\!/usr/bin/perl!;
but I still contend the latter is easier on the eyes, especially as the REs get very complex. Well, as easy on the eyes as any Perl code could be :-)