sed matching "$" literally without considering it regex - regex

I was trying to use $ in the sed -e command and it works , eg:
sed -e 's/world$/test/g' test.txt
the above command will replace "world" at the end of string.
what confused me the following worked literally :
sed -e 's/${projects.version}/20.0/g' test.txt
the above command replaced ${projects.version}, I don't have any explanation how did the sed match the $ and didn't expect it to be a special character?

As the POSIX spec says:
$
The <dollar-sign> shall be special when used as an anchor.
A <dollar-sign> ( '$' ) shall be an anchor when used as the last
character of an entire BRE. The implementation may treat a
<dollar-sign> as an anchor when used as the last character of a
subexpression. The <dollar-sign> shall anchor the expression (or
optionally subexpression) to the end of the string being matched; the
<dollar-sign> can be said to match the end-of-string following the
last character.
so when it's not at the end of a BRE, it's just a literal $ character.
For EREs the 2nd paragraph is a little different:
A <dollar-sign> ( '$' ) outside a bracket expression shall anchor the
expression or subexpression it ends to the end of a string; such an
expression or subexpression can match only a sequence ending at the
last character of a string. For example, the EREs "ef$" and "(ef$)"
match "ef" in the string "abcdef", but fail to match in the string
"cdefab", and the ERE "e$f" is valid, but can never match because the
'f' prevents the expression "e$" from matching ending at the last
character.
Note that last sentence - that means the $ is NOT treated literally in an ERE when not at the end of a regexp, it just can't match anything.
This is something you should never have to worry about, though, because for clarity if nothing else, you should always make sure you write your regexps to escape any regexp metachar you want treated literally so you shouldn't write:
's/$foo/bar/'
but write either of these instead:
's/\$foo/bar/'
's/[$]foo/bar/'
and then none of the semantics mentioned above matter.
The rationale for the difference between the way $ is handled in BREs vs EREs in this context is explained at https://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_08, but basically it's just that the standards were written this way to accommodate the different historical behavior of the way people used $ in BREs vs EREs.
Thanks to #M.NejatAydin here on SO and #oguzismail in comp.unix.shell on usenet for helping clarify the rationale.

Related

Which characters combined with ^ don't need to be escaped in sed?

I have checked that ^* and ^& match lines beginning by * and &, which I didn't since they are special characters. But ^[ doesn't work. Is this "standard" behavior? Is there any rationale behind this?
sed version used was "GNU sed 4.4".
From POSIX.1-2017:
The sed utility shall support the BREs described in XBD Basic Regular Expressions, ... [sed]
Reading the POSIX section on BREs, we read:
A BRE special character has special properties in certain contexts. Outside those contexts, or when preceded by a <backslash>, such a character is a BRE that matches the special character itself. The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\:
The <period>, <left-square-bracket>, and <backslash> shall be special except when used in a bracket expression (see RE Bracket Expression). An expression containing a '[' that is unescaped and is not part of a bracket expression produces undefined results.
*:
The <asterisk> shall be special except when used:
In a bracket expression
As the first character of an entire BRE (after an initial '^', if any)
As the first character of a subexpression (after an initial '^', if any); see BREs Matching Multiple Characters
^:
The <circumflex> shall be special when used as an anchor (see BRE Expression Anchoring). The <circumflex> shall signify a non-matching list expression when it occurs first in a list, immediately following a <left-square-bracket> (see RE Bracket Expression).
$:
The <dollar-sign> shall be special when used as an anchor.
source: Basic Regular Expressions, Special characters
So to answer the OPs question using the above:
& is not a special character, so ^& is expected to work
[ should always be escaped if it is not used as a bracket expression.
* is not special after an initial ^ when the latter is an anchor.
So all observed statements by the OP are therefore valid.
There is however still an interesting paragraph in RE Bracket Expression:
A bracket expression is either a matching list expression or a non-matching list expression. It consists of one or more expressions: ordinary characters, collating elements, collating symbols, equivalence classes, character classes, or range expressions. The <right-square-bracket> ( ] ) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list (after an initial <circumflex>( ^ ), if any). Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as [.].] ) or is the ending <right-square-bracket> for a collating symbol, equivalence class, or character class. The special characters ., *, [, and \\ ( <period>, <asterisk>, <left-square-bracket>, and <backslash>, respectively) shall lose their special meaning within a bracket expression.
source: Basic Regular Expressions, RE Bracket Expression
This implies that ] cannot be escaped in a bracket expression. This means:
The following work:
$ echo '[]' | sed 's/[^]x]/a/'
a]
$ echo '[]' | sed 's/[^x[.].]]/a/'
a]
but this does not work as expected:
$ echo '[]' | sed 's/[^x\]]/a/'
[]
So in a Bracket Expression, dont escape it, but collate it!
See sed "3.3 Overview of Regular Expression Syntax" documentation.
The & char is not a special regex char, it does not need escaping in a regex pattern. Note that & can be parsed as a special construct in the replacement pattern where is refers to the whole match.
The * is not special when it is at the start in GNU sed (^* is a pattern that matches a * at the start of the string):
POSIX 1003.1-2001 says that * stands for itself when it appears at the start of a regular expression or subexpression, but many nonGNU implementations do not support this and portable scripts should instead use \* in these contexts.
The [ starts a bracket expression and must have a paired ] to close the expression, hence it is an error.

Regular Expression to follow a specific pattern

I'm trying to make sure the input to my shell script follows the format Name_Major_Minor.extension
where Name is any number of digits/characters/"-" followed by "_"
Major is any number of digits followed by "_"
Minor is any number of digits followed by "."
and Extension is any number of characters followed by the end of the file name.
I'm fairly certain my regular expression is just messed up slightly. any file I currently run through it evaluates to "yes" but if I add "[A-Z]$" instead of "*$" it always evaluates to "no". Regular expressions confuse the hell out of me as you can probably tell..
if echo $1 | egrep -q [A-Z0-9-]+_[0-9]+_[0-9]+\.*$
then
echo "yes"
else
echo "nope"
exit
fi
edit: realized I am missing the pattern for "minor". Still doesn't work after adding it though.
Use =~ operator
Bash supports regular expression matching through its =~ operator, and there is no need for egrep in this particular case:
if [[ "$1" =~ ^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$ ]]
Errors in your regular expression
The \.*$ sequence in your regular expression means "zero or more dots". You probably meant "a dot and some characters after it", i.e. \..*$.
Your regular expression matches only the end of the string ($). You likely want to match the whole string. To match the entire string, use the ^ anchor to match the beginning of the line.
Escape the command line arguments
If you still want to use egrep, you should escape its arguments as you should escape any command line arguments to avoid reinterpretation of special characters, or rather wrap the argument in single, or double quotes, e.g.:
if echo "$1" | egrep -q '^[A-Za-z0-9-]+_[0-9]+_[0-9]+\..*$'
Use printf instead of echo
Don't use echo, as its behavior is considered unreliable. Use printf instead:
printf '%s\n' "$1"
Try this regex instead: ^[A-Za-z0-9-]+(?:_[0-9]+){2}\..+$.
[A-Za-z0-9-]+ matches Name
_[0-9]+ matches _ followed by one or more digits
(?:...){2} matches the group two times: _Major_Minor
\..+ matches a period followed by one or more character
The problem in your regex seems to be at the end with \.*, which matches a period \. any number of times, see here. Also the [A-Z0-9-] will only match uppercase letters, might not be what you wanted.

End of line char ($) doesn't work inside square brackets

Putting $ inside square brackets doesn't work for grep.
~ $ echo -e "hello\nthere" > example.txt
~ $ grep "hello$" example.txt
hello
~ $ grep "hello[$]" example.txt
~ $
Is this a bug in grep or am I doing something wrong?
That's what it's supposed to do.
[$]
...defines a character class that matches one character, $.
Thus, this would match a line containing hello$.
See the POSIX RE Bracket Expression definition for the formal specification requiring that this be so. Quoting from that full definition:
A bracket expression (an expression enclosed in square brackets, "[]" ) is an RE that shall match a single collating element contained in the non-empty set of collating elements represented by the bracket expression.
Thus, any bracket expression matches a single element.
Moreover, in the BRE Anchoring Expression definition:
A dollar sign ( '$' ) shall be an anchor when used as the last character of an entire BRE. The implementation may treat a dollar sign as an anchor when used as the last character of a subexpression. The dollar sign shall anchor the expression (or optionally subexpression) to the end of the string being matched; the dollar sign can be said to match the end-of-string following the last character.
Thus -- as of BRE, the regexp format which grep recognizes by default with no arguments -- if $ is not at the end of the expression, it is not required to be recognized as an anchor.
If you're trying to match end of line characters or the end of the string, you can use (|) like so "ABC($|\n)".
You can, however, use $ in a parenthesis grouping, which facilitates the use of | (or), which can accomplish the same idea as a square bracket group.
Something like the following might be of interest to you:
~ $ cat example.txt
hello
there
helloa
hellob
helloc
~ $ grep "hello\($\|[ab]\)" example.txt
hello
helloa
hellob

What do these qr{} regular expressions mean?

What do these mean?
qr{^\Q$1\E[a-zA-Z0-9_\-]*\Q$2\E$}i
qr{^[a-zA-Z0-9_\-]*\Q$1\E$}i
If $pattern is a Perl regular expression, what is $identity in the code below?
$identity =~ $pattern;
When the RHS of =~ isn't m//, s/// or tr///, a match operator (m//) is implied.
$identity =~ $pattern;
is the same as
$identity =~ /$pattern/;
It matches the pattern or pre-compiled regex $pattern (qr//) against the value of $identity.
The binding operator =~ applies a regex to a string variable. This is documented in perldoc perlop
The \Q ... \E escape sequence is a way to quote meta characters (also documented in perlop). It allows for variable interpolation, though, which is why you can use it here with $1 and $2. However, using those variables inside a regex is somewhat iffy, because they themselves are defined during the use of a capture inside a regex.
The character class bracket [ ... ] defines a range of characters which it will match. The quantifier that follows it * means that particular bracket must match zero or more times. The dashes denote ranges, such as a-z meaning "from a through z". The escaped dash \- means a literal dash.
The ^ and $ (the dollar sign at the end) denotes anchors, beginning and end of string respectively. The modifier i at the end means the match is case insensitive.
In your example, $identity is a variable that presumably contains a string (or whatever it contains will be converted to a string).
The perlre documentation is your friend here. Search it for unfamiliar regex constructs.
A detailed explanation is below, but it is so hairy that I wonder whether using a module such as Text::Balanced would be a superior approach.
The first pattern matches possibly empty delimited strings, and the delimiters are in $1 and $2, which we do not know until runtime. Say $1 is ( and $2 is ), then the first pattern matches strings of the form
()
(a)
(9)
(abcABC_012-)
and so on …
The second pattern matches terminated strings, where the terminator is in $1—also not known until runtime. Assuming the terminator is ], then the second pattern matches strings of the form
]
a]
Aa9a_9]
Using \Q...\E around a pattern removes any special regex meaning from the characters inside, as documented in perlop:
For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and #). For example, the following matches:
'\s\t' =~ /\Q\s\t/
Because $ or # trigger interpolation, you'll need to use something like /\Quser\E\#\Qhost/ to match them literally.
The patterns in your question do want to trigger interpolation but do not want any regex metacharacters to have special meaning, as with parentheses and square brackets above that are meant to match literally.
Other parts:
Circumscribed brackets delimit a character class. For example, [a-zA-Z0-9_\-] matches any single character that is upper- or lowercase A through Z (but with no accents or other extras), zero through nine, underscore, or hyphen. Note that the hyphen is escaped at the end to emphasize that it matches a literal hyphen rather and does not specify part of a range.
The * quantifier means match zero or more of the preceding subpattern. In the examples from your question, the star repeats character classes.
The patterns are bracketed with ^ and $, which means an entire string must match rather than some substring to succeed.
The i at the end, after the closing curly brace, is a regex switch that makes the pattern case-insensitive. As TLP helpfully points out in the comment below, this makes the delimiters or terminators match without regard to case if they contain letters.
The expression $identity =~ $pattern tests whether the compiled regex stored in $pattern (created with $pattern = qr{...}) matches the text in $identity. As written above, it is likely being evaluated for its side effect of storing capture groups in $1, $2, etc. This is a red flag. Never use $1 and friends unconditionally but instead write
if ($identity =~ $pattern) {
print $1, "\n"; # for example
}

special characters in sed

Does anybody know what the complete list of special characters in sed are ?
Please don't give an answer like, it is the same list of special characters for grep, because that just transforms my question to: Does anybody know what the complete list of special characters in grep are?
It depends. Strictly speaking, a standard compliant sed must only use Basic Regular Expressions for which the standard states:
The BRE special characters and the contexts in which they have their special meaning are as follows:
.[\ The period, left-square-bracket, and backslash shall be special except when used in a bracket expression (see RE Bracket Expression ). An expression containing a '[' that is not preceded by a backslash and is not part of a bracket expression produces undefined results.
* The asterisk shall be special except when used in a bracket expression, as the first character of an entire BRE (after an initial '^' , if any), or as the first character of a subexpression (after an initial '^' , if any); see BREs Matching Multiple Characters
^ The circumflex shall be special when used as an anchor (see BRE Expression Anchoring )
or as the first character of a bracket expression (see RE Bracket Expression )
$ The dollar-sign shall be special when used as an anchor.
So the complete list is .[\*^$, but context matters. Also, many sed provide options to use extended regular expressions(EREs), which will expand the list and change the context in which characters are special. For example, without EREs groupings are formed using \( and \), but with EREs ( and ) by themselves are special and must be escaped to be matched literally.
I think this is the full list of characters [\^$.|?*+() on which sed will respond in a manner different than a normal character.