Which characters combined with ^ don't need to be escaped in sed?

Which characters combined with ^ don't need to be escaped in sed? - regex

I have checked that ^* and ^& match lines beginning by * and &, which I didn't since they are special characters. But ^[ doesn't work. Is this "standard" behavior? Is there any rationale behind this?
sed version used was "GNU sed 4.4".

From POSIX.1-2017:
The sed utility shall support the BREs described in XBD Basic Regular Expressions, ... [sed]
Reading the POSIX section on BREs, we read:
A BRE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character is a BRE that matches the special character itself. The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\:
The <period>, <left-square-bracket>, and <backslash> shall be special except when used in a bracket expression (see RE Bracket Expression). An expression containing a '[' that is unescaped and is not part of a bracket expression produces undefined results.
*:
The <asterisk> shall be special except when used:
In a bracket expression
As the first character of an entire BRE (after an initial '^', if any)
As the first character of a subexpression (after an initial '^', if any); see BREs Matching Multiple Characters
^:
The <circumflex> shall be special when used as an anchor (see BRE Expression Anchoring). The <circumflex> shall signify a non-matching list expression when it occurs first in a list, immediately following a <left-square-bracket> (see RE Bracket Expression).
$:
The <dollar-sign> shall be special when used as an anchor.
source: Basic Regular Expressions, Special characters
So to answer the OPs question using the above:
& is not a special character, so ^& is expected to work
[ should always be escaped if it is not used as a bracket expression.
* is not special after an initial ^ when the latter is an anchor.
So all observed statements by the OP are therefore valid.
There is however still an interesting paragraph in RE Bracket Expression:
A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more expressions: ordinary characters, collating elements, collating symbols, equivalence classes, character classes, or range expressions. The <right-square-bracket> ( ] ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial <circumflex>( ^ ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as [.].] ) or is the ending <right-square-bracket> for a collating symbol, equivalence class, or character class. The special characters ., *, [, and \\ ( <period>, <asterisk>, <left-square-bracket>, and <backslash>, respectively) shall lose their special meaning within a bracket expression.
source: Basic Regular Expressions, RE Bracket Expression
This implies that ] cannot be escaped in a bracket expression. This means:
The following work:
$ echo '[]' | sed 's/[^]x]/a/'
a]
$ echo '[]' | sed 's/[^x[.].]]/a/'
a]
but this does not work as expected:
$ echo '[]' | sed 's/[^x\]]/a/'
[]
So in a Bracket Expression, dont escape it, but collate it!

See sed "3.3 Overview of Regular Expression Syntax" documentation.
The & char is not a special regex char, it does not need escaping in a regex pattern. Note that & can be parsed as a special construct in the replacement pattern where is refers to the whole match.
The * is not special when it is at the start in GNU sed (^* is a pattern that matches a * at the start of the string):
POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.
The [ starts a bracket expression and must have a paired ] to close the expression, hence it is an error.

Related

sed matching "$" literally without considering it regex

I was trying to use $ in the sed -e command and it works , eg:
sed -e 's/world$/test/g' test.txt
the above command will replace "world" at the end of string.
what confused me the following worked literally :
sed -e 's/${projects.version}/20.0/g' test.txt
the above command replaced ${projects.version}, I don't have any explanation how did the sed match the $ and didn't expect it to be a special character?

As the POSIX spec says:
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a
subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the
<dollar-sign> can be said to match the end-of-string following the
last character.
so when it's not at the end of a BRE, it's just a literal $ character.
For EREs the 2nd paragraph is a little different:
A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and the ERE "e$f" is valid, but can never match because the
'f' prevents the expression "e$" from matching ending at the last
character.
Note that last sentence - that means the $ is NOT treated literally in an ERE when not at the end of a regexp, it just can't match anything.
This is something you should never have to worry about, though, because for clarity if nothing else, you should always make sure you write your regexps to escape any regexp metachar you want treated literally so you shouldn't write:
's/$foo/bar/'
but write either of these instead:
's/\$foo/bar/'
's/[$]foo/bar/'
and then none of the semantics mentioned above matter.
The rationale for the difference between the way $ is handled in BREs vs EREs in this context is explained at https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_08, but basically it's just that the standards were written this way to accommodate the different historical behavior of the way people used $ in BREs vs EREs.
Thanks to #M.NejatAydin here on SO and #oguzismail in comp.unix.shell on usenet for helping clarify the rationale.

'1A' is not accepted in [:alnum:] POSIX Character Class under Oracle regular expresssion

Why is '1A' not considered in alphanumeric class?
SELECT 'yes'
FROM dual
WHERE REGEXP_LIKE('1A', '[:alnum:]');

A pair of unescaped [...] creates a bracket expression. POSIX character classes, like [:alpha:], [:alnum:], can only be declared inside a bracket expression.
[:digit:] is a POSIX character class, used inside a bracket expression like [x-z[:digit:]].
Use the POSIX character class inside a bracket expression:
SELECT 'yes' FROM dual WHERE REGEXP_LIKE('1A', '[[:alnum:]]')
See an online demo.
('1a', '[:alnum:]') seems to acceptable.
Note that when you use '[:alnum:]', the regex engine parses it as a regular bracket expression matching any characters/character ranges defined inside the bracket expression. I.e. the [:alnum:] regex matches any single char that is either :, a, l, n, u or m. Since your input has a (1a) the regex engine returns a valid match once it encounters a in the input.

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;

When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.

The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).

The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

special characters in sed

Does anybody know what the complete list of special characters in sed are ?
Please don't give an answer like, it is the same list of special characters for grep, because that just transforms my question to: Does anybody know what the complete list of special characters in grep are?

It depends. Strictly speaking, a standard compliant sed must only use Basic Regular Expressions for which the standard states:
The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\ The period, left-square-bracket, and backslash shall be special except when used in a bracket expression (see RE Bracket Expression ). An expression containing a '[' that is not preceded by a backslash and is not part of a bracket expression produces undefined results.
* The asterisk shall be special except when used in a bracket expression, as the first character of an entire BRE (after an initial '^' , if any), or as the first character of a subexpression (after an initial '^' , if any); see BREs Matching Multiple Characters
^ The circumflex shall be special when used as an anchor (see BRE Expression Anchoring )
or as the first character of a bracket expression (see RE Bracket Expression )
$ The dollar-sign shall be special when used as an anchor.
So the complete list is .[\*^$, but context matters. Also, many sed provide options to use extended regular expressions(EREs), which will expand the list and change the context in which characters are special. For example, without EREs groupings are formed using \( and \), but with EREs ( and ) by themselves are special and must be escaped to be matched literally.

I think this is the full list of characters [\^$.|?*+() on which sed will respond in a manner different than a normal character.

Regular expressions documentation while using grep

I am trying to find some comprehensive documentation on character classes in regular expressions that could be used while using grep. I tried
info grep
man grep
man 7 regex
but could not find all the characters classes listed down in the documentation.
I am looking for some comprehensive documentation on regex that grep uses. Is there such a documentation available?

grep has three options for regex -E or --extended-regexp -G or --basic-regexp and -P or --perl-regexp.
Extended / Basic Regex Classes: Follow POSIX Classes
Perl Regex Classes: Follow Perl Classes
From the command line POSIX regex information can be accessed via man 7 regex where as Perl regex data can be accessed via perldoc perlre

http://linux.about.com/od/commands/l/blcmdl1_grep.htm
Finally, certain named classes of characters are predefined within
bracket expressions, as follows. Their names are self explanatory, and
they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:],
[:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:].
For example, [[:alnum:]] means [0-9A-Za-z], except the latter form
depends upon the C locale and the ASCII character encoding, whereas
the former is independent of locale and character set.

When I do man grep, this is what I get:
REGULAR EXPRESSIONS
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed
analogously to arithmetic expressions, by using various operators to combine smaller expressions.
grep understands two different versions of regular expression syntax: "basic" and "extended." In GNU grep,
there is no difference in available functionality using either syntax. In other implementations, basic
regular expressions are less powerful. The following description applies to extended regular expressions;
differences for basic regular expressions are summarized afterwards.
The fundamental building blocks are the regular expressions that match a single character. Most characters,
including all letters and digits, are regular expressions that match themselves. Any meta-character with
special meaning may be quoted by preceding it with a backslash.
The period . matches any single character.
Character Classes and Bracket Expressions
A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that
list; if the first character of the list is the caret ^ then it matches any character not in the list. For
example, the regular expression [0123456789] matches any single digit.
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It
matches any single character that sorts between the two characters, inclusive, using the locale's collating
sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many
locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to
[abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of
bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.
Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their
names are self explanatory, and they are [:alnum:], [:alpha:], [:cntrl:], [:digit:], [:graph:], [:lower:],
[:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means [0-9A-Za-z],
except the latter form depends upon the C locale and the ASCII character encoding, whereas the former is
independent of locale and character set. (Note that the brackets in these class names are part of the
symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most
meta-characters lose their special meaning inside bracket expressions. To include a literal ] place it
first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a
literal - place it last.
Anchoring
The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the
beginning and end of a line.
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b
matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the
edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].
Repetition
A regular expression may be followed by one of several repetition operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{,m} The preceding item is matched at most m times.
{n,m} The preceding item is matched at least n times, but not more than m times.
Concatenation
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by
concatenating two substrings that respectively match the concatenated expressions.
Alternation
Two regular expressions may be joined by the infix operator |; the resulting regular expression matches any
string matching either alternate expression.
Precedence
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole
expression may be enclosed in parentheses to override these precedence rules and form a subexpression.
Back References and Subexpressions
The back-reference \n, where n is a single digit, matches the substring previously matched by the nth
parenthesized subexpression of the regular expression.
Basic vs Extended Regular Expressions
In basic regular expressions the meta-characters ?, +, {, |, (, and ) lose their special meaning; instead
use the backslashed versions \?, \+, \{, \|, \(, and \).
Traditional egrep did not support the { meta-character, and some egrep implementations support \{ instead,
so portable scripts should avoid { in grep -E patterns and should use [{] to match a literal {.
GNU grep -E attempts to support traditional usage by assuming that { is not special if it would be the start
of an invalid interval specification. For example, the command grep -E '{1' searches for the two-character
string {1 instead of reporting a syntax error in the regular expression. POSIX.2 allows this behavior as an
extension, but portable scripts should avoid it.
Regular expressions are awesome!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js