when to escape special character in shell - regex

guys:
it is hard for me to judge when to escape special characters in shell, and which character should be escaped. for example:
sed '/[0-9]\{3\}/d' filename.txt
like above, why we should escape { while leave [ unchanged, i think they are both special chars.
Can you help me with this?
/br
ruan

The general answer is that you need to escape characters that have special meaning when you want to treat them as literal characters, not for their special meaning. The rules for what characters have special meaning vary from program to program.
Your specific question involves characters that have special meaning to sed; single quotes prevent any enclosed characters from being interpreted by bash.
In this case, you are escaping the { and } to prevent sed from interpreting them. First, consider this command:
sed '/[0-9]{3}/d' filename.txt
If you are using a version of sed that treats both [ and { specially, this command says to delete any line which contains a sequence of exactly 3 digits. The [0-9] is not a literal 5-character string; it's a regular expression that matches any single numeral. The {3} isn't a literal 3-character string; it's a modifier that matches exactly 3 of the preceding regular expression. Lines like the following will be matched:
593
3296
but not
34a7
because there aren't 3 digits in a row.
Now, consider your command:
sed '/[0-9]\{3\}/d' filename.txt
The [0-9] is still a regular expression that matches a single numeral. But now, you have escaped the braces. Instead of being a modifier for the preceding regular expression, sed will treat it as the literal characters {, 3, and }. So it will match lines like the following:
0{3}
1{3}
5{3}
but not lines like
346
because there are no braces.

Difference in this behavior is related to sed only.
In regular mode sed supports very basic regex only and hence { is matched literally unless escaped as you noticed.
sed '/[0-9]\{3\}/d'
In extended regex mode both [ and { don't need escaping:
sed -r '/[0-9]{3}/d'
OR on OSX:
sed -E '/[0-9]{3}/d'
[ and ] is considered a character class in both regular and extended regex modes (even shell's glob pattern supports it)

I think your question pertains to special characters in regular expressions. Check this out:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03

It mainly depend on sed version (posix compliant or extended behavior) and then you need to adapt depending of the shell because, indeed, some modification occur before the sed action is received like you state. The best example is the use of simple of double quote at shell level and the \( or ( at sed level.
so:
define the pattern (reg ex) you want
adapt for the sed version/option you are using
adapt for shell interpretation
let's have fun to create the substitution sed order of \{ by &/$IFS (literal, not IFS value) using double quote surrounding sed script in BASH/KSH shell and posix or GNU sed.

Related

How to comment a include line using sed [duplicate]

I am using sed in a shell script to edit filesystem path names. Suppose I want to replace
/foo/bar
with
/baz/qux
However, sed's s/// command uses the forward slash / as the delimiter. If I do that, I see an error message emitted, like:
▶ sed 's//foo/bar//baz/qux//' FILE
sed: 1: "s//foo/bar//baz/qux//": bad flag in substitute command: 'b'
Similarly, sometimes I want to select line ranges, such as the lines between a pattern foo/bar and baz/qux. Again, I can't do this:
▶ sed '/foo/bar/,/baz/qux/d' FILE
sed: 1: "/foo/bar/,/baz/qux/d": undefined label 'ar/,/baz/qux/d'
What can I do?
You can use an alternative regex delimiter as a search pattern by backslashing it:
sed '\,some/path,d'
And just use it as is for the s command:
sed 's,some/path,other/path,'
You probably want to protect other metacharacters, though; this is a good place to use Perl and quotemeta, or equivalents in other scripting languages.
From man sed:
/regexp/
Match lines matching the regular expression regexp.
\cregexpc
Match lines matching the regular expression regexp. The c may be any character other than backslash or newline.
s/regular expression/replacement/flags
Substitute the replacement string for the first instance of the regular expression in the pattern space. Any character other than backslash or newline can be used instead of a slash to delimit the RE and the replacement. Within the RE and the replacement, the RE delimiter itself can be used as a literal character if it is preceded by a backslash.
Perhaps the closest to a standard, the POSIX/IEEE Open Group Base Specification says:
[2addr] s/BRE/replacement/flags
Substitute the replacement string for instances of the BRE in the
pattern space. Any character other than backslash or newline can
be used instead of a slash to delimit the BRE and the replacement.
Within the BRE and the replacement, the BRE delimiter itself can be
used as a literal character if it is preceded by a backslash."
When there is a slash / in theoriginal-string or the replacement-string, we need to escape it using \. The following command is work in ubuntu 16.04(sed 4.2.2).
sed 's/\/foo\/bar/\/baz\/qux/' file

Recursively wrapping a regular expression with given text

For a given path, I wish to wrap a given regular expression in all files in that path or that path's sub-directories with some given text using standard Linux shell commands.
More specifically, wrap all my syslog commands with an assert command such as syslog(LOG_INFO,json_encode($obj)); becomes assert(syslog(LOG_INFO,json_encode($obj)));.
I thought the following might work, but received sed: -e expression #1, char 47: Invalid preceding regular expression error.
sed -i -E "s/(?<=syslog\()(.*)(?=\);)/assert(syslog(\1));/" /path/to/somewhere
BACKUP INFO IN RESPONSE TO Wiktor Stribiżew's ANSWER
I've never used sed before. Please confirm my understanding of your answer:
sed -i "s/syslog(\(.*\));/assert(syslog(\1));/g" /path/to/somewhere
-i edit files in place. One could first leave out to see on the screen what will be changed.
s substitute text
The three /'s surrounding the pattern and replacement (i.e. /pattern/replacement/) are deliminator and can be any single character and not just /.
syslog(\(.*\)); The pattern with one placeholder. Uses escaped parentheses.
assert(syslog(\1)); The replacement using escaped 1 (or 2, 3, etc) for replacement sub-strings.
g Replace all and not just the first match.
Would sed -i "s/syslog(.*);/assert(&);/g" /path/to/somewhere work as well?
sed patterns do not support lookarounds like (?<=...) and (?=...).
You may use a capturing group/replacement backreference:
sed -i "s/syslog(\(.*\));/assert(syslog(\1));/g" /path/to/somewhere
The pattern is of BRE POSIX flavor (no -E option is passed), so to define a capturing group you need to use escaped parentheses, and unescaped ones will match literal parentheses.
Details
syslog( - syslog( substring
\(.*\) - Group 1: any 0+ chars as many as possible
); - a ); substring
The replacement is assert(syslog(\1));, that is, the match is replaced with assert(syslog(, the contents of Group 1, and then ));.
If you need Perl-compatible regex constructs, you can use Perl (sic).
perl -i -pe 's/(?<=syslog\()(.*)(?=\);)/assert(syslog($1));/' /path/to/somewhere
Regardless of this specific solution I switched to single quotes on the assumption that you are on a Unix-ish platform. Backslashes inside double quotes are pesky (sometimes you need to double them, sometimes not).
Perl prefers $1 over \1 in the replacement pattern, though the latter will also technically work.

Can anyone provide the regular expression for this: "datetime": "2014-11-28T00:00:00.000Z",

I need to search for and replace this complete line in a text file.
"datetime": "2014-11-28T00:00:00.000Z",
Where the date string can vary.
Trying different regex's but to no avail. I've tried:
"datetime": "[A-Z0-9:.]*",
Super simple solution:
"datetime": "[^"]+"
[^"] means "match any character that is not a quotation mark and the + means it must match multiple of them (at least one). Note that + is extended regex syntax (you must use grep -E or egrep, standard grep may not know it; same for sed, use sed -E on command line).
Of course there is no syntax check here. That regex will also match:
"datetime": "banana"
If you need syntax verification as well, the regex would be:
"datetime": "[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}[.][0-9]{3}Z"
{x} means "repeat x times" and [0-9] means "any digit". All other characters (-, T, Z) match themselves.
Some people may wonder why [.] near the end, well, . means actually any character and we don't want to match any character there but only a period. I could have written \. instead, but when used on the shell or within scripts, multiple backslashes may be required to get the correct escaping level (e.g. within quotes it is \\. and so on) and I don't like that, it's ugly and error prone. Instead I put it into a character class, because within a character class period is an ordinary character an needs no escaping.
A simple regex like
^"datetime":.*$
may help
$ echo '"datetime": "2014-11-28T00:00:00.000Z",' | sed 's/^"datetime":.*$/replaced/'
replaced

How do I write a SED regex to extract a string delimited by another string?

I am using GNU sed version 4.2.1 and I am trying to write a non-greedy SED regex to extract a string that delimited by two other strings. This is easy when the delimiting strings are single-character:
s:{\([^}]*\)}:\1:g
In that example the string is delimited by '{' on the left and '}' on the right.
If the delimiting strings are multiple characters, say '{{{' and '}}}' I can adjust the above expression like this:
s:{{{\([^}}}]*\)}}}:\1:g
so the centre expression matches anything not containing the '}}}' closing string. But this only works if the match string does not contain '}' at all. Something like:
{{{cannot match {this broken} example}}}
will not work but
{{{can match this example}}}
does work. Of course
s:{{{\(.*\)}}}:\1:g
always works but is greedy so isn't suitable where multiple patterns occur on the same line.
I understand [^a] to mean anything except a and [^ab] to mean anything except a or b so, despite it appearing to work, I don't think [^}}}] is the correct way to exclude that sequence of 3 consecutive characters.
So how to I write a regex for SED that matches a string that is delimited bt two other strings ?
You are correct that [^}}}] doesn't work. A negated character class matches anything that is not one of the characters inside it. Repeating characters doesn't change the logic. So what you wrote is the same as [^}]. (It is easy to see why this works when there are no braces inside the expression).
In Perl and compatible regular expressions, you can use ? to make a * or + non-greedy:
s:{{{(.*?)}}}:$1:g
This will always match the first }}} after the opening {{{.
However, this is not possible in Sed. In fact, I don't think there is any way in Sed of doing this match. The only other way to do this is use advanced features like look-ahead, which Sed also does not have.
You can easily use Perl in a sed-like fashion with the -pe options, which cause it to take a single line of code from the command line (-e) and automatically loop over each line and print the result (-p).
perl -pe 's:{{{(.*?)}}}:$1:g'
The -i option for in-place editing of files is also useful, but make sure your regex is correct first!
For more information see perlrun.
With sed you could do something like:
sed -e :a -e 's/\(.*\){{{\(.*\)}}}/\1\2/ ; ta'
With:
{{{can match this example}}} {{{can match this 2nd example}}}
This gives:
can match this example can match this 2nd example
It is not lazy matching, but by replacing from right to left we can make use of sed's greediness.

sed: Can my pattern contain an "is not" character? How do I say "is not X"?

How do I say "is not" a certain character in sed?
[^x]
This is a character class that accepts any character except x.
For those not satisfied with the selected answer as per johnny's comment.
'su[^x]' will match 'sum' and 'sun' but not 'su'.
You can tell sed to not match lines with x using the syntax below:
sed '/x/! s/su//' file
See kkeller's answer for another example.
There are two possible interpretations of your question. Like others have already pointed out, [^x] matches a single character which is not x. But an empty string also isn't x, so perhaps you are looking for [^x]\|^$.
Neither of these answers extend to multi-character sequences, which is usually what people are looking for. You could painstakingly build something like
[^s]\|s\($\|[^t]\|t\($\|[^r]\)\)\)
to compose a regular expression which doesn't match str, but a much more straightforward solution in sed is to delete any line which does match str, then keep the rest;
sed '/str/d' file
Perl 5 introduced a much richer regex engine, which is hence standard in Java, PHP, Python, etc. Because Perl helpfully supports a subset of sed syntax, you could probably convert a simple sed script to Perl to get to use a useful feature from this extended regex dialect, such as negative assertions:
perl -pe 's/(?:(?!str).)+/not/' file
will replace a string which is not str with not. The (?:...) is a non-capturing group (unlike in many sed dialects, an unescaped parenthesis is a metacharacter in Perl) and (?!str) is a negative assertion; the text immediately after this position in the string mustn't be str in order for the regex to match. The + repeats this pattern until it fails to match. Notice how the assertion needs to be true at every position in the match, so we match one character at a time with . (newbies often get this wrong, and erroneously only assert at e.g. the beginning of a longer pattern, which could however match str somewhere within, leading to a "leak").
From my own experience, and the below post supports this, sed doesn't support normal regex negation using "^". I don't think sed has a direct negation method...but if you check the below post, you'll see some workarounds.
Sed regex and substring negation
In addition to all the provided answers , you can negate a character class in sed , using the notation [^:[C_CLASS]:] , for example , [^[:blank:]] will match anything which is not considered a space character .