special characters in sed - regex

Does anybody know what the complete list of special characters in sed are ?
Please don't give an answer like, it is the same list of special characters for grep, because that just transforms my question to: Does anybody know what the complete list of special characters in grep are?

It depends. Strictly speaking, a standard compliant sed must only use Basic Regular Expressions for which the standard states:
The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\ The period, left-square-bracket, and backslash shall be special except when used in a bracket expression (see RE Bracket Expression ). An expression containing a '[' that is not preceded by a backslash and is not part of a bracket expression produces undefined results.
* The asterisk shall be special except when used in a bracket expression, as the first character of an entire BRE (after an initial '^' , if any), or as the first character of a subexpression (after an initial '^' , if any); see BREs Matching Multiple Characters
^ The circumflex shall be special when used as an anchor (see BRE Expression Anchoring )
or as the first character of a bracket expression (see RE Bracket Expression )
$ The dollar-sign shall be special when used as an anchor.
So the complete list is .[\*^$, but context matters. Also, many sed provide options to use extended regular expressions(EREs), which will expand the list and change the context in which characters are special. For example, without EREs groupings are formed using \( and \), but with EREs ( and ) by themselves are special and must be escaped to be matched literally.

I think this is the full list of characters [\^$.|?*+() on which sed will respond in a manner different than a normal character.

Related

sed matching "$" literally without considering it regex

I was trying to use $ in the sed -e command and it works , eg:
sed -e 's/world$/test/g' test.txt
the above command will replace "world" at the end of string.
what confused me the following worked literally :
sed -e 's/${projects.version}/20.0/g' test.txt
the above command replaced ${projects.version}, I don't have any explanation how did the sed match the $ and didn't expect it to be a special character?
As the POSIX spec says:
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a
subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the
<dollar-sign> can be said to match the end-of-string following the
last character.
so when it's not at the end of a BRE, it's just a literal $ character.
For EREs the 2nd paragraph is a little different:
A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and the ERE "e$f" is valid, but can never match because the
'f' prevents the expression "e$" from matching ending at the last
character.
Note that last sentence - that means the $ is NOT treated literally in an ERE when not at the end of a regexp, it just can't match anything.
This is something you should never have to worry about, though, because for clarity if nothing else, you should always make sure you write your regexps to escape any regexp metachar you want treated literally so you shouldn't write:
's/$foo/bar/'
but write either of these instead:
's/\$foo/bar/'
's/[$]foo/bar/'
and then none of the semantics mentioned above matter.
The rationale for the difference between the way $ is handled in BREs vs EREs in this context is explained at https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_08, but basically it's just that the standards were written this way to accommodate the different historical behavior of the way people used $ in BREs vs EREs.
Thanks to #M.NejatAydin here on SO and #oguzismail in comp.unix.shell on usenet for helping clarify the rationale.

Which characters combined with ^ don't need to be escaped in sed?

I have checked that ^* and ^& match lines beginning by * and &, which I didn't since they are special characters. But ^[ doesn't work. Is this "standard" behavior? Is there any rationale behind this?
sed version used was "GNU sed 4.4".
From POSIX.1-2017:
The sed utility shall support the BREs described in XBD Basic Regular Expressions, ... [sed]
Reading the POSIX section on BREs, we read:
A BRE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character is a BRE that matches the special character itself. The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\:
The <period>, <left-square-bracket>, and <backslash> shall be special except when used in a bracket expression (see RE Bracket Expression). An expression containing a '[' that is unescaped and is not part of a bracket expression produces undefined results.
*:
The <asterisk> shall be special except when used:
In a bracket expression
As the first character of an entire BRE (after an initial '^', if any)
As the first character of a subexpression (after an initial '^', if any); see BREs Matching Multiple Characters
^:
The <circumflex> shall be special when used as an anchor (see BRE Expression Anchoring). The <circumflex> shall signify a non-matching list expression when it occurs first in a list, immediately following a <left-square-bracket> (see RE Bracket Expression).
$:
The <dollar-sign> shall be special when used as an anchor.
source: Basic Regular Expressions, Special characters
So to answer the OPs question using the above:
& is not a special character, so ^& is expected to work
[ should always be escaped if it is not used as a bracket expression.
* is not special after an initial ^ when the latter is an anchor.
So all observed statements by the OP are therefore valid.
There is however still an interesting paragraph in RE Bracket Expression:
A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more expressions: ordinary characters, collating elements, collating symbols, equivalence classes, character classes, or range expressions. The <right-square-bracket> ( ] ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial <circumflex>( ^ ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as [.].] ) or is the ending <right-square-bracket> for a collating symbol, equivalence class, or character class. The special characters ., *, [, and \\ ( <period>, <asterisk>, <left-square-bracket>, and <backslash>, respectively) shall lose their special meaning within a bracket expression.
source: Basic Regular Expressions, RE Bracket Expression
This implies that ] cannot be escaped in a bracket expression. This means:
The following work:
$ echo '[]' | sed 's/[^]x]/a/'
a]
$ echo '[]' | sed 's/[^x[.].]]/a/'
a]
but this does not work as expected:
$ echo '[]' | sed 's/[^x\]]/a/'
[]
So in a Bracket Expression, dont escape it, but collate it!
See sed "3.3 Overview of Regular Expression Syntax" documentation.
The & char is not a special regex char, it does not need escaping in a regex pattern. Note that & can be parsed as a special construct in the replacement pattern where is refers to the whole match.
The * is not special when it is at the start in GNU sed (^* is a pattern that matches a * at the start of the string):
POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.
The [ starts a bracket expression and must have a paired ] to close the expression, hence it is an error.

Strange behavior when using regex to match parentheses in vim

I'm having some trouble understanding why a regular expression is not working. I'm searching for the phrase #Test(groups = {"broken"}), and I'm not able to find it with this expression:
#Test\(groups = {"broken"}\)
However, this expression yields results:
#Test\(.*groups = {"broken"}\)
Why is this happening? I can't see why the first expression would not work, but I understand why the second one does.
\( is used for capture in vim since it does not use extended/"magic" regexen by default. If you want to search for a literal paren, use (.
The second expression works because .* matches (.
If you want to search for literal text, just prepend \V to the search pattern; then, only the backslash has special meaning and must be escaped:
/\V#Test(groups = {"broken"})
In contrast to most other regular expression dialects, many Vim atoms need to be prefixed with \ to be non-literal. To make Vim's patterns look more like Perl's, you can prepend \v; then, (...) do capture grouping (as you've expected), and you need to escape \( to match literal parentheses.

Regular expressions documentation while using grep

I am trying to find some comprehensive documentation on character classes in regular expressions that could be used while using grep. I tried
info grep
man grep
man 7 regex
but could not find all the characters classes listed down in the documentation.
I am looking for some comprehensive documentation on regex that grep uses. Is there such a documentation available?
grep has three options for regex -E or --extended-regexp -G or --basic-regexp and -P or --perl-regexp.
Extended / Basic Regex Classes: Follow POSIX Classes
Perl Regex Classes: Follow Perl Classes
From the command line POSIX regex information can be accessed via man 7 regex where as Perl regex data can be accessed via perldoc perlre
http://linux.about.com/od/commands/l/blcmdl1_grep.htm
Finally, certain named classes of characters are predefined within
bracket expressions, as follows. Their names are self explanatory, and
they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:],
[:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].
For example, [[:alnum:]] means [0-9A-Za-z], except the latter form
depends upon the C locale and the ASCII character encoding, whereas
the former is independent of locale and character set.
When I do man grep, this is what I get:
REGULAR EXPRESSIONS
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed
analogously to arithmetic expressions, by using various operators to combine smaller expressions.
grep understands two different versions of regular expression syntax: "basic" and "extended." In GNU grep,
there is no difference in available functionality using either syntax. In other implementations, basic
regular expressions are less powerful. The following description applies to extended regular expressions;
differences for basic regular expressions are summarized afterwards.
The fundamental building blocks are the regular expressions that match a single character. Most characters,
including all letters and digits, are regular expressions that match themselves. Any meta-character with
special meaning may be quoted by preceding it with a backslash.
The period . matches any single character.
Character Classes and Bracket Expressions
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that
list; if the first character of the list is the caret ^ then it matches any character not in the list. For
example, the regular expression [0123456789] matches any single digit.
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It
matches any single character that sorts between the two characters, inclusive, using the locale's collating
sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many
locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to
[abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of
bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.
Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their
names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:],
[:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z],
except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is
independent of locale and character set. (Note that the brackets in these class names are part of the
symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most
meta-characters lose their special meaning inside bracket expressions. To include a literal ] place it
first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a
literal - place it last.
Anchoring
The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the
beginning and end of a line.
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b
matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the
edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].
Repetition
A regular expression may be followed by one of several repetition operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{,m} The preceding item is matched at most m times.
{n,m} The preceding item is matched at least n times, but not more than m times.
Concatenation
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by
concatenating two substrings that respectively match the concatenated expressions.
Alternation
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any
string matching either alternate expression.
Precedence
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole
expression may be enclosed in parentheses to override these precedence rules and form a subexpression.
Back References and Subexpressions
The back-reference \n, where n is a single digit, matches the substring previously matched by the nth
parenthesized subexpression of the regular expression.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead
use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead,
so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start
of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character
string {1 instead of reporting a syntax error in the regular expression. POSIX.2 allows this behavior as an
extension, but portable scripts should avoid it.
Regular expressions are awesome!

Using escape characters inside grep

I have the following regular expression for eliminating spaces, tabs, and new lines: [^ \n\t]
However, I want to expand this for certain additional characters, such as > and <.
I tried [^ \n\t<>], which works well for now, but I want the expression to not match if the < or > is preceded by a \.
I tried [^ \n\t[^\\]<[^\\]>], but this did not work.
Can any one of the sequences below occur in your input?
\\>
\\\>
\\\\>
\blank
\tab
\newline
...
If so, how do you propose to treat them?
If not, then zero-width look-behind assertions will do the trick, provided that your regular expression engine supports it. This will be the case in any engine that supports Perl-style regular expressions (including Perl's, PHP, etc.):
(?<!\\)[ \n\t<>]
The above will match any un-escaped space, newline, tab or angled braces. More generically (using \s to denote any space characters, including \r):
(?<!\\)\s
Alternatively, using complementary notation without the need for a zero-width look-behind assertion (but arguably less efficiently):
(?:[^ \n\t<>]|\\[<>])
You may also use a variation of the latter to handle the \\>, \\\>, \\\\> etc. cases as well up to some finite number of preceding backslashes, such as:
(?:[^ \n\t<>]|(?:^|[^<>])[\\]{1,3,5,7,9}[<>])
According to the grep man page:
A bracket expression is a list of
characters enclosed by [ and ]. It
matches any single character in that
list; if the first character of the
list is the caret ^ then it matches
any character not in the list.
This means that you can't match a sequence of characters such as \< or \> only single characters.
Unless you have a version of grep built with Perl regex support then you can use lookarounds like one of the other posters mentioned. Not all versions of grep have this support though.
Maybe you can use egrep and put your pattern string inside quotes. This should obliterate the need for escaping.