Why is only ) a special character and not } or ]? - regex

I'm reading Jan Goyvaerts' "Regular Expressions: The Complete Tutorial and Reference" to touch up on my Regex.
In the second chapter, Jan has a section on "special characters:"
Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.
(emphasis mine)
I understand that only open square bracket and open curly brace are special since a close brace or bracket is clearly a literal if there's no preceding open. However, why does Jan specify that close parenthesis is a special character if the other two close's aren't?

Short answer
The regex flavors in my book do not require } and ] to be escaped (except for ] in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.
Full answer
First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.
What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, ) is treated as a token that closes a group. It is a syntax error if used without a corresponding (. So ) has a special meaning when used all on its own.
} does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7} or {7,42} literally, you only need to escape the opening {. If you want to argue that } is special because it sometimes has a special meaning, then you would have to say the same about , which becomes special in the same situation.
] does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\, ], ^, and -) discussed in a later chapter.
Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }. I escape ] in character classes when using JavaScript because that's the only way. But with other flavors I place ] at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.
I often see people new to regular expressions needlessly escape characters like ", ', or / because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.
I even see people escape characters like < or >. This is a bad habit because in some regex flavors \< and \> are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).
But, if you find it confusing to see unescaped } and ] used as literals, you are free to escape them in your regexes. Except for < and >, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.
So somebody saying that } and ] are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include , (quantifier), : (non-capturing group), - (mode modifier), ! (negative lookaround), < (lookbehind), and - (character class range).
But if "special characters" means "characters that have a special meaning on their own", then } and ] are not included in the list for the flavors my book covers.

The following paragraphs give an answer. I'm citing from Jan's website, not from the book, though:
If you forget to escape a special character where its use is not
allowed, such as in +1, then you will get an error message.
Most regular expression flavors treat the brace { as a literal
character, unless it is part of a repetition operator like a{1,3}.
So you generally do not need to escape it with a backslash, though you
can do so if you want. But there are a few exceptions.
Java requires
literal opening braces to be escaped.
Boost and
std::regex
require all literal braces to be escaped.
] is a literal outside character
classes.
Different rules apply inside character classes. Those are discussed in
the topic about character classes. Again, there are exceptions.
std::regex and
Ruby require closing
square brackets to be escaped even outside character classes.
It seems like he uses "needs to be escaped" as his definition for "special character", and unlike ), the ] and } characters need not be escaped in most flavours.
That said, you wouldn't be wrong calling them special characters as well. It's definitely a best practice to always escape them, and in no flavour \] and \} mean anything else than a literal ] or }.
On the other hand, they have their special meaning only inside a specific (parsing) context, namely when they follow [ and { respectively. There are similar cases: :=><!#'&, all have a non-literal meaning inside a specific context, and we wouldn't normally call these "special characters" either.
And while we could say the same about ), almost no flavour allows for it to occur on its own outside of groups, because pairs of parentheses always need to match. Its only usage is in the special context, and therefore ) is considered a special character.

Every where in a regular expression, regardless of engine and its standards, a parenthesis should be escaped to mean a literal character. Even the closing parenthesis. However, it doesn't apply to POSIX regular expressions:
) The <right-parenthesis> shall be special when matched with a preceding <left-parenthesis>, both outside a bracket expression.
But the interesting part is that POSIX has a separate definition for a right-parenthesis for times it should be treated as a special character. It doesn't have it for } or ].
Why other engines don't follow this rule?
Call it implementation peculiarities or historical reasons that have something to do with Perl as commented in PCRE source code:
/* It appears that Perl allows any characters whatsoever, other than
a closing parenthesis, to appear in arguments, so we no longer insist on
letters, digits, and underscores. */
It seems that with all that special clusters in more advanced engines treating a closing parenthesis as a special character will cost much less than implementing POSIX standard.

From experiments, it appears that unlike ), the characters ] and } are only interpreted as delimiters when the corresponding opening [ or { has been met.
Though IMO the same rule could apply to ), that's the way it is.
This might be due to the way the parser was written: parenthesis can be nested so that the balancing needs to be checked, whereas brackets/curly braces are just flagged. (For instance, [[] is a valid class definition. [[]] is also a valid pattern but understood as [\[]\].)

Related

Sed escaping special chars

To make the sed to work with an alternation construction we must espace special chars like ( or |:
sed -n "/\(abc\|def\)/p"
Simple
sed -n "/(abc|def)/p"
doesn't work.
My question is: why does sed behaves contrariwise to the "normal" regex where we escape special chars to give them literal meaning?
What you call "normal" is a feature invented by Perl.
All traditional regex engines (e.g. the ones used by grep, sed, emacs, awk) have some special characters that match literally when escaped and normal characters that get a special meaning when escaped. My best guess for why this happened is evolution: Maybe the first implementation of regexes only supported [, ], and *, and everything else was matched literally. To introduce new features while keeping compatibility, the escaped syntax (\(, \), etc.) was invented.
Later on, other tools just copied the existing syntax.
As far as I know, Perl was the first language to make regex syntax more, well, regular:
All alphanumeric characters match themselves.
Escaping an alphanumeric character may have a special meaning (e.g. \n, \1, \z).
Punctuation characters may have a special meaning (e.g. (, +, ?).
Escaping a non-alphanumeric character always makes it match literally, even if it wasn't special in the first place (e.g. \:, \").
All "modern" regex engines (e.g. the ones used in JavaScript or .NET) copied Perl's behavior.

Period in .Net 3.5 Regex.IsMatch

I came across this regular expression in vb.net 3.5 code:
Regex.IsMatch(strString, "^[\w\s.+'\-\(\)\/\,\&\#]+$")
What is really confusing me is the ".+" part. I was under the impression that the period means any character and the plus sign means one or more. Following this, I feel like this regular expression should allow anything! But it doesn't, so I must be misunderstanding something. In testing it, it seems like the period and the plus sign are being taken as literals.
Could somebody help explain this to me?
Thanks!
The issue is that all of those characters are enclosed in a [character-group]. The escaping rules are different in character-groups than they are elsewhere in a RegEx expression. For instance, according to the MSDN documentation, \b inside a character-group means a backspace character whereas, outside of a character-group, it is an anchor that matches a word boundary.
According to the Regular-Expressions.info documentation:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
Therefore, in your example RegEx expression, it looks for any one of the characters in that bracketed list, including either the literal . or + character. If you think about it, it wouldn't make any sense to use a . to mean "any character" inside of a character-group. Doing so would make the group, itself, moot. And certainly, using the + character to mean "one or more times" inside of a character-group really makes no sense.
.+ is mean any symbol in an amount of one or more. Maybe you need to escape dot like \.+?
Within the square parenthesis, dot and plus don't have their special meaning. The square brackets define a "character class". It does not contain a string but a set of characters allowed at this position.
So the expression [\w\s.+'-()/\,\&#] creates a character class of letters, digits, underscore, spaces, dots, pluses, single quotes, minuses, opening round brackets, closing round brackets, slashes, commas, ampersands and hashmarks.
The + behind the square parenthesis means you expect one or more characters of this character class.

Why is the closing bracket a special character that must be escaped to be taken as a literal?

It is clear that an opening bracket "(", among other characters, must be escaped (prefixed by a backslash) for the regex to contain a "literal opening bracket": Because there are regex options for which "(" is a lead-in.
But how comes the same holds true for the closing bracket ")"? There is no syntax construct that has ")" as a lead-in token, is there?
So why do I have to escape closing brackets for them to be taken literally?
Of course, the same question could be asked for the other closing brackets as well.
Sorry for this being a "why is this so?" question. It might possibly be un-answerable. But if there is a good reason, the only way to get to know it is by asking!
Addendum:
The rationale behind this question is:
For example, http://www.regexguru.com/2008/12/dont-escape-literal-characters-that-arent-metacharacters/ gives good reasons not to prefix characters that don´t need prefixing.
And imho, the closing bracket does not need prefixing in most cases:
Since a closing bracket without an opening one is not part of a regex group, I find it totally unlogical that it needs to be escaped in this case anyways.
Assume you want to match a group holding a closing bracket. Without escaping, this would look like this ()). Escaping the bracket like (\)) makes it much easier (if not even possible) for the regular expression to be parsed correctly and unambiguously.
In the (unescaped) regular expression (\w)), does the closing bracket belong to the group, or not, i.e., is the group closed by the first or the second )? E.g., for the string abc)d, does it match c or c)?
Of course one could omit some of the escape characters in case the meaning is not ambiguous (and the regex parser allows to do so) but what would it help? You save a character here and there, but each time you encounter a ) or another special character you have to think: "Is this a control character or a character to be matched? Is it ambiguous?" Better make it clear and consistent.
As a more specific example for tobias_k's answer:
Look at the following regex:
(a*))
looking at the string bbaaa)bb will it capture aaaor aaa)?
The result is clear with
(a*\))
versus
(a*)\)
Of course, the same question could be asked for the other closing brackets as well.
No that's not correct (or may vary with one regex engine to another).
In Javascript regex engine ] and } don't need to be escaped.
See this example:
var x = 'brackets)}]';
x.match(/]/); // works
x.match(/}/); // works
x.match(/)/); // failes
Only for the case 3, it fails with the error Unmatched ')'

What does (^?)* mean in this regex?

I have this regex:
^(^?)*\?(.*)$
If I understand correctly, this is the breakdown of what it does:
^ - start matching from the beginning of the string
(^?)* - I don't know know, but it stores it in $1
\? - matches a question mark
(.*)$ - matches anything until the end of the string
So what does (^?)* mean?
The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string
Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does
Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.
Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.
Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.
The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.
To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:
^(^?)*\?(.*)$
Handed to string.match(), it would be interpreted as follows:
^ anchors the match to the beginning of the string.
( marks the beginning of the first capture.
^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.
? matches exactly zero or one of the previous character.
) marks the end of the first capture.
* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.
\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.
? matches exactly zero or one of the previous character.
( marks the beginning the second capture.
. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).
* matches zero or more of the previous character, greedily.
) marks the end of the second capture.
$ anchors the pattern to the end of the string.
So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.
Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.
Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.
In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.
For all your Regex needs: http://regexlib.com/CheatSheet.aspx
It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.

Regular expression explanation for vim

If I want all the lines with the text 'ruby' but not 'myruby' then this is what I would do.
:g/\<ruby\>/
My question is what is the meaning of lesser than and greater than symbol here? The only regular expression I have used is while programming in ruby.
Similarly if I want to find three consecutive blank lines then this is what I would do
/^\n\{3}
My question is why I am escaping the first curly brace ( opening curly brace ) but not escaping the second curly brace ( closing curly brace )?
Vim's rules for backslash-escaping in regexes are not consistent. You have to escape the opening brace of\{...}, but [...] requires no escaping at all, and a capture group is \(...\) (escaping both open and close paren). There are other inconsistencies as well.
Thankfully Vim lets you change this behavior, even on a regex-by-regex basis, via the magic settings. If you put \v at the beginning of a regex, the escaping rules become more consistent; everything is "magic" except numbers, letters, and underscores, so you don't need backslashes unless you want to insert a literal character other than those.
Your first example then becomes :g/\v<ruby>/ and your second example becomes /\v^\n{3}. See :h /magic and :h /\v for more information.
the \< and \> mean word boundaries. In Perl, grep and less (to name 3 OTOH) you use \b for this, so I imagine it's the same in Ruby.
Regarding your 2nd question, the escape is needed for the whole expression {3}. You're not escaping each curly brace, but rather the whole thing together.
See this question for more.
For your first regular expression, you could also do:
:g/[^\ ]ruby\ /
This would ensure there was a space before and after your ruby keyword.