Why does this sed command output "[18" instead of "18"? - regex

echo [18%] | sed s:[\[%\]]::g
I'm really confused by this, because the same exact pattern successfully replaces [18%] in vim. I've also tested the expression in a few online regex tools and they all say that it will match on the [, %, and ] as intended. I have tried adding the -r option as well as surrounding the substitution command in quotes.
I know that there are other commands that I could use to accomplish this task, but I want to know why it is behaving this way so I can get a better understanding of sed.

$ echo [18%] | sed s:[][%]::g
18
sed supports POSIX.2 regular expression syntax: basic (BRE) syntax by default, extended syntax with the -r flag. In POSIX.2 syntax, basic or extended, you include a right square bracket by making it the first character in the character class. Backslashes do not help.
This is annoying because almost every other modern language and tool uses Perl or Perl-like regex syntax. POSIX syntax is an anachronism.
You can read about the POSIX.2 syntax in the regex(7) man page.
A bracket expression is a list of characters enclosed in "[]". It normally
matches any single character from the list (but see below). If the list begins
with '^', it matches any single character (but see below) not from the rest of
the list. If two characters in the list are separated by '-', this is shorthand
for the full range of characters between those two (inclusive) in the collating
sequence, for example, "[0-9]" in ASCII matches any decimal digit. It is ille‐
gal(!) for two ranges to share an endpoint, for example, "a-c-e". Ranges are
very collating-sequence-dependent, and portable programs should avoid relying on
them.
To include a literal ']' in the list, make it the first character (following a
possible '^'). To include a literal '-', make it the first or last character, or
the second endpoint of a range. To use a literal '-' as the first endpoint of a
range, enclose it in "[." and ".]" to make it a collating element (see below).
With the exception of these and some combinations using '[' (see next para‐
graphs), all other special characters, including '\', lose their special signifi‐
cance within a bracket expression.

Related

Can someone breakdown this regular expression?

While looking for a way to format 'ifconfig' output and display only the network interfaces names, I found a regular expression that worked like a charm for OS X.
ifconfig -a | sed -E 's/[[:space:]:].*//;/^$/d'
How can I breakdown this regular expression so I can understand it?
Here is the sed command
s/[[:space:]:].*//;/^$/d
There is a semicolon in the middle, so it's actually two commands:
s/[[:space:]:].*//
/^$/d
First command is a substitution. What to substitute? It's between the 1st 2 slashes.
[[:space:]:].*
Character class [] of any kind of whitespace or a colon, followed by zero or more * of any character .. This matches everything in a line after the first whitespace or colon.
Substitute with what? Between the 2nd two slashes: s/...//: Nothing. The matched strings are deleted from each line.
This leaves the interface names which start their lines, the other lines remain too, but they are empty, as they start with whitespace.
How to remove these empty lines? That's the second command:
/^$/d
Find empty lines that match regex with nothing between start of line ^ and end of line $. Then delete them with command d.
All that's left are the interface names.
This is more a sequence of commands than it is a regular expression, but I suppose breaking the sequence down may be instructive.
Read the manpage on ifconfig to find this
Optionally, the -a flag may be used instead of an interface name. This
flag instructs ifconfig to display information about all interfaces in
the system. The -d flag limits this to interfaces that are down, and
-u limits this to interfaces that are up. When no arguments are given,
-a is implied.
That's one part done. The pipe (|) sends what ifconfig would normally print to the standard output to the standard input of sed instead.
You're passing sed the option -E. Again, man sed is your friend and tells you that this option means
Interpret regular expressions as extended (modern) regular
expressions rather than basic regular expressions (BRE's). The
re_format(7) manual page fully describes both formats.
This isn't all you need though... The first string that you're giving sed lets it know which operation to perform.
Search the same manual for the word "substitute" to reach this
paragraph:
[2addr]s/regular expression/replacement/flags
Substitute the replacement string for the first instance of
the regular expression in the pattern space. Any character other than
backslash or newline can be used instead of a slash to delimit the RE
and the replacement. Within the RE and the replacement, the RE
delimiter itself can be used as a literal character if it is preceded
by a backslash.
Now we can run man 7 re_format to decode the first command s/[[:space:]:].*// which means "for each line passed to standard input, substitute the part matching the extended regular expression [[:space:]:].* with the empty string"
[[:space:]:] = match either a : or any character in the character class [:space:]
.* = match any character (.), zero or more times (*)
To understand the second command look for the [2addr]d part of the sed manual page.
[2addr]d
Delete the pattern space and start the next cycle.
Let's then look at the next command /^$/d which says "for each line passed to standard input, delete it if it corresponds to the extended regex ^$"
^$ = a line that contains no characters between its start (^) and its end ($)
We've discussed how to start with man pages and follow the clues to "decode" commands you see in everyday life.
Thanks Benjamin and Xufox for the resources. After taking a look, this is my conclusion:
s/[[:space:]:].*//;
[[:space:]:] this will search for spaces and/or : and begin the execution of the command, and this and anything that comes afterwards(hence the '.*') will be substituted by nothing (because the next thing is //, which in between should be what we would want to substitute for, which in this case is nothing.).
;
marks the end of the first command
and then we have
/^$/d
where ^$ means search for all empty spaces and d to delete them.
This is half wrong. Take a look at the other answer which gives you the complete and correct response! Thanks guys.

Awk indicating the first character not to be #

Is there a way to specify the first character not to be something?
There are many ways to limit what it can be but I don't recall a way to say what it can't be.
for example if ! meant not to be
root 4
awk {/[!#][Rr][Oo][Oo][Tt]/{ }}
The symbol for "not" in a bracket expression is the caret (or "circumflex") ^, but it must be the first character inside the brackets in order to have this meaning. The example given in the comments above is [^#], which means one character that is not #. So the regular expression /[^#]/ would match any string that does not have a # anywhere in it. This is not all of what you asked for:
Is there a way to specify the first character not to be something?
One thing that makes regular expressions hard for some people to read is that many symbols have different meanings based on context. The caret ^ is also used to indicate the beginning of a line. With a regex in awk, you can specify that the first character on the line (the first thing after the beginning of the line ^) is not a # with:
awk '/^[^#]/{ ... }'
This would execute the block of code { ... } for every line of input that does not start with # at the beginning of the line. Note that this would, however, match a line that starts with other characters, and then has a # somewhere in it. /^[^#]/ would also not match an empty line, since there is no character for [^#] to consume. As you can see, there are many nuances and subtleties to consider as you tailor your regex for your needs. For more, look up awk regex, POSIX regex, or just type man -s7 regex in your terminal.

grep: filtering list with multiple special characters

Using grep or another command line tool I need to filter a list so that every line containing one or more of the following characters are excluded:
.
/
-
'
[space]
I'm having a hard time escaping special characters while searching for multiple expesseions.
This isn't working:
grep -v '(.|/|-|'| )' input > output
By default, the grep command uses "Basic" regular expression format. The regex you've written is in "Extended" format. You can tell grep to use extended format with the -E option.
You've included a dot in your regex. Remember that a dot matches "any" character. To escape its normal behaviour you can either escape it with a backslash (\.) or by putting it in a range ([.]). I prefer the latter notation because I find that backslashes make things more difficult to read. The choice is yours.
You have a single quote in your expression. As you've written it, the command line won't work because the embedded single quote exits the string begun with the first single quote. You can get around this by wrapping your regex in double quotes.
You also don't need the outer brackets with this regex.
So... You could write the whole thing in Basic notation:
grep -v "[.]\|/\|-\|'\| " input > output
Or you could write it in Extended notation:
grep -Ev "[.]|/|-|'| " input > output
Or alternately, you could put ALL these characters into a range, which is written the same way in Basic and Extended:
grep -v "[./' -]" input > output
Note that the hyphen has moved to the END of the range so that it won't be interpreted as "the range of characters between a forward slash and a single quote". Note also that since this range is also compatible with Basic RE notation, I've removed the -E option.
See man re_format(7) for details.

Grep for multiple strings with escaped pipe in each

I'm using Gitbash within Windows. I want to grep for a set of strings, each of which ends with a |
I think I can do each one singly with a backslash to escape the pipe:
grep abcdef\| filename.tsv
But to do them all together I end up with:
grep 'abcdef\|\|uvwxyz\|' filename.tsv
which fails. Any ideas?
I could just do each string individually and then concatenate the resulting files, but it would take days.
In basic posix regexes - which are used by grep - you must not escape the literal |. However you need to escape the | if it is used as a regex syntax element to specify alternatives.
The following expression should work:
grep 'abcdef|\|uvwxyz|' filename.tsv
An ERE might be the way to go, for easier readability.
egrep '(abcdef|uvwxyz)[|]' filename.tsv
This lets you manage your string list a little more easily, and "escapes" the trailing vertical bar by putting it inside a range. (This works for dots, asterisks, etc, as well.)
If egrep isn't available on your system, you can check to see if your existing grep includes a -E option for extended regexes.
There are two competing effects here which you may be confusing. Firstly, the | must be escaped or quoted so that it is not interpreted by the shell. Secondly, depending on which regex mode you are using, escaping/unescaping the pipe changes whether it is a literal character or a metacharacter.
I would suggest that you change your pattern to this:
grep 'abcdef|\|uvwxyz|' file
In basic regex mode, an escaped pipe \| is a regex OR, so this matches either pattern followed by a literal pipe.
Alternatively, if all your patterns end in a pipe and you have more than just two, perhaps you could use this:
grep -E '(abc|def|ghi)\|' file
In extended mode, escaping the pipe has the opposite effect, so this pattern matches any of the sequences of letters followed by a literal pipe.

regex implementation to replace group with its lowercase version

Is there any implementation of regex that allow to replace group in regex with lowercase version of it?
If your regex version supports it, you can use \L, like so in a POSIX shell:
sed -r 's/(^.*)/\L\1/'
In Perl, you can do:
$string =~ s/(some_regex)/lc($1)/ge;
The /e option causes the replacement expression to be interpreted as Perl code to be evaluated, whose return value is used as the final replacement value. lc($x) returns the lowercased version of $x. (Not sure but I assume lc() will handle international characters correctly in recent Perl versions.)
/g means match globally. Omit the g if you only want a single replacement.
If you're using an editor like SublimeText or TextMate1, there's a good chance you may use
\L$1
as your replacement, where $1 refers to something from the regular expression that you put parentheses around. For example2, here's something I used to downcase field names in some SQL, getting everything to the right of the 'as' at the end of any given line. First the "find" regular expression:
(as|AS) ([A-Za-z_]+)\s*,$
and then the replacement expression:
$1 '\L$2',
If you use Vim (or presumably gvim), then you'll want to use \L\1 instead of \L$1, but there's another wrinkle that you'll need to be aware of: Vim reverses the syntax between literal parenthesis characters and escaped parenthesis characters. So to designate a part of the regular expression to be included in the replacement ("captured"), you'll use \( at the beginning and \) at the end. Think of \ as—instead of escaping a special character to make it a literal—marking the beginning of a special character (as with \s, \w, \b and so forth). So it may seem odd if you're not used to it, but it is actually perfectly logical if you think of it in the Vim way.
1 I've tested this in both TextMate and SublimeText and it works as-is, but some editors use \1 instead of $1. Try both and see which your editor uses.
2 I just pulled this regex out of my history. I always tweak regexen while using them, and I can't promise this the final version, so I'm not suggesting it's fit for the purpose described, and especially not with SQL formatted differently from the SQL I was working on, just that it's a specific example of downcasing in regular expressions. YMMV. UAYOR.
Several answers have noted the use of \L. However, \E is also worth knowing about if you use \L.
\L converts everything up to the next \U or \E to lowercase. ... \E turns off case conversion.
(Source: https://www.regular-expressions.info/replacecase.html )
So, suppose you wanted to use rename to lowercase part of some file names like this:
artist_-_album_-_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
artist_-_album_-_Another_Song_Title_to_be_Lowercased_-_MultiCaseHash.m4a
you could do something like:
rename -v 's/^(.*_-_)(.*)(_-_.*.m4a)/$1\L$2\E$3/g' *
In Perl, there's
$string =~ tr/[A-Z]/[a-z]/;
Most Regex implementations allow you to pass a callback function when doing a replace, hence you can simply return a lowercase version of the match from the callback.