Sed escaping special chars - regex

To make the sed to work with an alternation construction we must espace special chars like ( or |:
sed -n "/\(abc\|def\)/p"
Simple
sed -n "/(abc|def)/p"
doesn't work.
My question is: why does sed behaves contrariwise to the "normal" regex where we escape special chars to give them literal meaning?

What you call "normal" is a feature invented by Perl.
All traditional regex engines (e.g. the ones used by grep, sed, emacs, awk) have some special characters that match literally when escaped and normal characters that get a special meaning when escaped. My best guess for why this happened is evolution: Maybe the first implementation of regexes only supported [, ], and *, and everything else was matched literally. To introduce new features while keeping compatibility, the escaped syntax (\(, \), etc.) was invented.
Later on, other tools just copied the existing syntax.
As far as I know, Perl was the first language to make regex syntax more, well, regular:
All alphanumeric characters match themselves.
Escaping an alphanumeric character may have a special meaning (e.g. \n, \1, \z).
Punctuation characters may have a special meaning (e.g. (, +, ?).
Escaping a non-alphanumeric character always makes it match literally, even if it wasn't special in the first place (e.g. \:, \").
All "modern" regex engines (e.g. the ones used in JavaScript or .NET) copied Perl's behavior.

Related

Transform Regexp to POSIX BRE

I'd like to put this expression into POSIX BRE.
HTTP\/[\d.]+.\s+(?:403)\s+(4[0-9])\s+
Here is what I've come up with so far.
HTTP\/[0-9.]{1,}.[[:blank:]]{1,}403[[:blank:]]{1,}(4[0-9])[[:blank:]]
Using a web based regex checker, both examples work quite well.
This regexp needs to be registered in SCOM however and it seems like it only supports POSIX BRE for monitoring Linux servers.
Here's the Posix documentation on Basic Regular Expressions. In particular, note:
When a BRE matching a single character, a subexpression, or a back-reference is followed by an interval expression of the format \{m\}, \{m,\}, or \{m,n\}, together with that interval expression it shall match what repeated consecutive occurrences of the BRE would match…
So [[:blank:]]{1,} isn't going to do what you think it will; the braces need to be preceded with backslashes.
On the other hand, most BRE implementations do allow you to use \+ to mean "one or more repetitions". At least, the BSD and Gnu varieties do. So you might well be able to write that as [[:blank:]]\+ instead of using the numeric repetition operator [[:blank:]]\{1,\}.
Finally, [[:blank:]] might not be what you want. At least, it doesn't match the same thing as \s does. [[:blank:]] matches only space and tab characters ([ \t]). But in most regex libraries, \s is the same as [ \t\r\n\f\v], which is what is matched by [[:space:]] in a C regex (or by the isspace() function in C code). The most visible difference between [[:blank:]] and \s (or [[:space:]]) is that [[:blank:]] does not match newlines. Perhaps that's fine in your application.
Pedantic note: Some regex libraries define \s as [ \t\r\n\f], but you're unlikely to notice the difference. And all of those lists of characters assume that the regex has been compiled in the "C" locale. If the regex library is locale-aware and some other locale has been enabled, additional characters might match.

Is plus (+) part of basic regular expressions?

Recently I was told, that + (one or more occurrence of the previous pattern/character) is not part of basic regex. Not even when written as \+.
It was on a question about maximum compatibility.
I was under the impression that ...
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
... always results in:
Hello.World.I.am.an.example.text
But then I was told that "it replaces every character not lowercase or a digit followed by + " and that it is the same as [^a-z0-9][+].
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.
POSIX "basic" regular expressions do not support + (nor ?!). Most implementations of sed add support for \+ but it's not a POSIX standard feature. If your goal is maximum portability you should avoid using it. Notice that you have to use \+ rather than the more common +.
echo "Hello World, I am an example-text" | sed 's#[^a-z0-9]\+#.#ig'
The -E flag enables "extended" regular expressions, which are a lot closer to the syntax used in Perl, JavaScript, and most other modern regex engines. With -E you don't need to have a backslash; it's simply +.
echo "Hello World, I am an example-text" | sed -E 's#[^a-z0-9]+#.#ig'
From https://www.regular-expressions.info/posix.html:
POSIX or "Portable Operating System Interface for uniX" is a collection of standards that define some of the functionality that a (UNIX) operating system should support. One of these standards defines two flavors of regular expressions. Commands involving regular expressions, such as grep and egrep, implement these flavors on POSIX-compliant UNIX systems. Several database systems also use POSIX regular expressions.
The Basic Regular Expressions or BRE flavor standardizes a flavor similar to the one used by the traditional UNIX grep command. This is pretty much the oldest regular expression flavor still in use today. One thing that sets this flavor apart is that most metacharacters require a backslash to give the metacharacter its flavor. Most other flavors, including POSIX ERE, use a backslash to suppress the meaning of metacharacters. Using a backslash to escape a character that is never a metacharacter is an error.
A BRE supports POSIX bracket expressions, which are similar to character classes in other regex flavors, with a few special features. Shorthands are not supported. Other features using the usual metacharacters are the dot to match any character except a line break, the caret and dollar to match the start and end of the string, and the star to repeat the token zero or more times. To match any of these characters literally, escape them with a backslash.
The other BRE metacharacters require a backslash to give them their special meaning. The reason is that the oldest versions of UNIX grep did not support these. The developers of grep wanted to keep it compatible with existing regular expressions, which may use these characters as literal characters. The BRE a{1,2} matches a{1,2} literally, while a\{1,2\} matches a or aa. Some implementations support \? and \+ as an alternative syntax to \{0,1\} and \{1,\}, but \? and \+ are not part of the POSIX standard. Tokens can be grouped with \( and \). Backreferences are the usual \1 through \9. Only up to 9 groups are permitted. E.g. \(ab\)\1 matches abab, while (ab)\1 is invalid since there's no capturing group corresponding to the backreference \1. Use \\1 to match \1 literally.
POSIX BRE does not support any other features. Even alternation is not supported.
(Emphasis mine.)
So my real question: is there any regex definition or implementation that does not treat either x+ or x\+ the same as xx*.
I can't think of any real world language or tool that supports neither + nor \+.
In the formal mathematical definition of regular expressions there are commonly only three operations defined:
Concatenation: AB matches A followed by B.
Alternation: A|B matches either A or B.
Kleene star: R* matches 0 or more repetitions of R.
These three operations are enough to give the full expressive power of regular expressions†. Operators like ? and + are convenient in programming but not necessary in a mathematical context. If needed, they are defined in terms of the others: R? is R|ε and R+ is RR*.
† Mathematically speaking, that is. Features like back references and lookahead/lookbehind don't exist in formal language theory. Those features add additional expressive power not available in mathematical definitions of regular expressions.
In some traditional sed implementations, you have to enable "extended" regular expressions to get support for + to mean "one or more."
For evidence of this, see: sed plus sign doesn't work

Why is only ) a special character and not } or ]?

I'm reading Jan Goyvaerts' "Regular Expressions: The Complete Tutorial and Reference" to touch up on my Regex.
In the second chapter, Jan has a section on "special characters:"
Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.
(emphasis mine)
I understand that only open square bracket and open curly brace are special since a close brace or bracket is clearly a literal if there's no preceding open. However, why does Jan specify that close parenthesis is a special character if the other two close's aren't?
Short answer
The regex flavors in my book do not require } and ] to be escaped (except for ] in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.
Full answer
First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.
What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, ) is treated as a token that closes a group. It is a syntax error if used without a corresponding (. So ) has a special meaning when used all on its own.
} does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7} or {7,42} literally, you only need to escape the opening {. If you want to argue that } is special because it sometimes has a special meaning, then you would have to say the same about , which becomes special in the same situation.
] does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\, ], ^, and -) discussed in a later chapter.
Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }. I escape ] in character classes when using JavaScript because that's the only way. But with other flavors I place ] at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.
I often see people new to regular expressions needlessly escape characters like ", ', or / because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.
I even see people escape characters like < or >. This is a bad habit because in some regex flavors \< and \> are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).
But, if you find it confusing to see unescaped } and ] used as literals, you are free to escape them in your regexes. Except for < and >, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.
So somebody saying that } and ] are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include , (quantifier), : (non-capturing group), - (mode modifier), ! (negative lookaround), < (lookbehind), and - (character class range).
But if "special characters" means "characters that have a special meaning on their own", then } and ] are not included in the list for the flavors my book covers.
The following paragraphs give an answer. I'm citing from Jan's website, not from the book, though:
If you forget to escape a special character where its use is not
allowed, such as in +1, then you will get an error message.
Most regular expression flavors treat the brace { as a literal
character, unless it is part of a repetition operator like a{1,3}.
So you generally do not need to escape it with a backslash, though you
can do so if you want. But there are a few exceptions.
Java requires
literal opening braces to be escaped.
Boost and
std::regex
require all literal braces to be escaped.
] is a literal outside character
classes.
Different rules apply inside character classes. Those are discussed in
the topic about character classes. Again, there are exceptions.
std::regex and
Ruby require closing
square brackets to be escaped even outside character classes.
It seems like he uses "needs to be escaped" as his definition for "special character", and unlike ), the ] and } characters need not be escaped in most flavours.
That said, you wouldn't be wrong calling them special characters as well. It's definitely a best practice to always escape them, and in no flavour \] and \} mean anything else than a literal ] or }.
On the other hand, they have their special meaning only inside a specific (parsing) context, namely when they follow [ and { respectively. There are similar cases: :=><!#'&, all have a non-literal meaning inside a specific context, and we wouldn't normally call these "special characters" either.
And while we could say the same about ), almost no flavour allows for it to occur on its own outside of groups, because pairs of parentheses always need to match. Its only usage is in the special context, and therefore ) is considered a special character.
Every where in a regular expression, regardless of engine and its standards, a parenthesis should be escaped to mean a literal character. Even the closing parenthesis. However, it doesn't apply to POSIX regular expressions:
) The <right-parenthesis> shall be special when matched with a preceding <left-parenthesis>, both outside a bracket expression.
But the interesting part is that POSIX has a separate definition for a right-parenthesis for times it should be treated as a special character. It doesn't have it for } or ].
Why other engines don't follow this rule?
Call it implementation peculiarities or historical reasons that have something to do with Perl as commented in PCRE source code:
/* It appears that Perl allows any characters whatsoever, other than
a closing parenthesis, to appear in arguments, so we no longer insist on
letters, digits, and underscores. */
It seems that with all that special clusters in more advanced engines treating a closing parenthesis as a special character will cost much less than implementing POSIX standard.
From experiments, it appears that unlike ), the characters ] and } are only interpreted as delimiters when the corresponding opening [ or { has been met.
Though IMO the same rule could apply to ), that's the way it is.
This might be due to the way the parser was written: parenthesis can be nested so that the balancing needs to be checked, whereas brackets/curly braces are just flagged. (For instance, [[] is a valid class definition. [[]] is also a valid pattern but understood as [\[]\].)

How can I convert a Perl regex to work with Boost::Regex?

What is the Boost::Regex equivalent of this Perl regex for words that end with ing or ed or en?
/ing$|ed$|en$/
...
The most important difference is that regexps in C++ are strings so all regexp specific backslash sequences (such as \w and \d should be double quoted ("\\w" and "\\d")
/^[\.:\,()\'\`-]/
should become
"^[.:,()'`-]"
The special Perl regex delimiter / doesn't exist in C++, so regexes are just a string. In those strings, you need to take care to escape backslashes correctly (\\ for every \ in your original regex). In your example, though, all those backslashes were unnecessary, so I dropped them completely.
There are other caveats; some Perl features (like variable-length lookbehind) don't exist in the Boost library, as far as I know. So it might not be possible to simply translate any regex. Your examples should be fine, though. Although some of them are weird. .*[0-9].* will match any string that contains a number somewhere, not all numbers.

Regular expression opening and closing characters

When I learned regular expressions I learned they should start and end with a slash character (followed by modifiers).
For example /dog/i
However, in many examples I see them starting and ending with other characters, such as #, #, and |.
For example |dog|
What's the difference?
This varies enormously from one regex flavor to the next. For example, JavaScript only lets you use the forward-slash (or solidus) as a delimiter for regex literals, but in Perl you can use just about any punctuation character--including, in more recent versions, non-ASCII characters like « and ». When you use characters that come in balanced pairs like braces, parentheses, or the double-arrow quotes above, they have to be properly balanced:
m«\d+»
s{foo}{bar}
Ruby also lets you choose different delimiters if you use the %r prefix, but I don't know if that extends to the balanced delimiters or non-ASCII characters. Many languages don't support regex literals at all; you just write the regexes as string literals, for example:
r'\d+' // Python
#"\d+" // C#
"\\d+" // Java
Note the double backslash in the Java version. That's necessary because the string gets processed twice: once by the Java compiler and once by the compile() method of the Pattern class. Most other languages provide a "raw" or "verbatim" form of string literal that all but eliminates such backslash-itis.
And then there's PHP. Its preg regex functions are built on top of the PCRE library, which closely imitates Perl's regexes, including the wide variety of delimiters. However, PHP itself doesn't support regex literals, so you have to write them as if they were regex literals embedded in string literals, like so:
'/\d+/g' // match modifiers go after the slash but inside the quotes
"{\\d+}" // double-quotes may or may not require double backslashes
Finally, note that even those languages which do support regex literals don't usually offer anything like Perl's s/…/…/ construct. The closest equivalent is a function call that takes a regex literal as the first argument and a string literal as the second, like so:
s = s.replace(/foo/i, 'bar') // JavaScript
s.gsub!(/foo/i, "bar") // Ruby
Some RE engines will allow you to use a different character so as to avoid having to escape those characters when used in the RE.
For example, with sed, you can use either of:
sed 's/\/path\/to\/directory/xx/g'
sed 's?/path/to/directory?xx?g'
The latter is often more readable. The former is sometimes called "leaning toothpicks". With Perl, you can use either of:
$x =~ /#!\/usr\/bin\/perl/;
$x =~ m!#\!/usr/bin/perl!;
but I still contend the latter is easier on the eyes, especially as the REs get very complex. Well, as easy on the eyes as any Perl code could be :-)