Grep regex not working with square brackets - regex

So I was trying to write a regex in grep to match square brackets, i.e [ad] should match [ and ]. But I was getting different results on using capturing groups and character classes. Also the result is different on putting ' in the beginning and end of regex string.
So these are the different result that I am getting.
Using capturing groups works fine
echo "[ad]" | grep -E '(\[|\])'
[ad]
Using capturing groups without ' gives syntax error
echo "[ad]" | grep -E (\[|\])
bash: syntax error near unexpected token `('
using character class with [ followed by ] gives no output
echo "[ad]" | grep -E [\[\]]
Using character class with ] followed by [ works correctly
echo "[ad]" | grep -E [\]\[]
[ad]
Using character class with ] followed by [ and using ' does not work
echo "[ad]" | grep -E '[\]\[]'
It'd be great if someone could explain the difference between them.

You should know about:
BRE ( = Basic Regular Expression )
ERE ( = Extended Regular Expression )
BRE metacharacters require a backslash to give them their special meaning and grep is based on
The ERE flavor standardizes a flavor similar to the one used by the UNIX egrep command.
Pay attention to -E and -G
grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE)
-F, --fixed-strings PATTERN is a set of newline-separated strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
...
...
POSIX Basic Regular Expressions
POSIX Extended Regular Expressions
POSIX Bracket Expressions
And you should also know about bash, since some of your input is related to bash interpreter not grep or anything else
echo "[ad]" | grep -E (\[|\])
Here bash assumes you try to use () something like:
echo $(( 10 * 10 ))
and by using single quote ' you tell the bash that you do not want it treats as a special operator for it. So
echo "[ad]" | grep -E '(\[|\])'
is correct.

Firstly, always quote Regex pattern to prevent shell interpretation beforehand:
$ echo "[ad]" | grep -E '(\[|\])'
[ad]
Secondly, within [] surrounded by quotes, you don't need to escape the [] inside, just write them as is within the outer []:
$ echo "[ad]" | grep -E '[][]'
[ad]

Maybe you provided such a simple example on purpose (after all, it is minimal), but in case all you really want is to check for existence of square brackets (a fixed string, not regex pattern), you can use grep with -F/--fixed-strings and multiple -e options:
$ echo "[ad]" | grep -F -e '[' -e ']'
[ad]
Or, a little bit shorter with fgrep:
$ echo "[ad]" | fgrep -e '[' -e ']'
[ad]
Or, even:
$ echo "[ad]" | fgrep -e[ -e]
[ad]

Related

Correct (?) regex not understood by sed

According to https://regex101.com/r/NLSymf/3, the following regex:
\[\[(foo)([^\]]+)\]\]
(full) matches the string [[foo>test1|test2]], but this seems to not be understood by sed, since:
echo "[[foo>test1|test2]]" | sed -E -e '/\[\[(foo)([^\]]+)\]\]/d'
(which should return an empty string) returns:
[[foo>test1|test2]]
What is the regex that matches [[foo>test1|test2]] from sed's point of view?
The backslash character loses its escaping capability within a bracket expression. And stray closing brackets in a RE need not be escaped, that's why grep doesn't fail the first pipeline below. See RE Bracket Expression for reference.
$ echo 'a]' | grep -Eo '[^\]]'
a]
$ echo 'a]' | grep -Eo '[^]]'
a
The correct regex would be:
\[\[(foo)([^]]+)]]

I get a blank echo from regex

If I run this file, it works fine and outputs the lines I expect:
workspaceFile=`cat tensorflow/workspace.bzl`
echo $workspaceFile | grep -oP '\/[a-z0-9]{12}.tar.gz'
However, if I run this, all I get is blank output in the terminal:
workspaceFile=`cat tensorflow/workspace.bzl`
TAR_FILE_WITH_SLASH=$workspaceFile | grep -oP '\/[a-z0-9]{12}.tar.gz'
echo $TAR_FILE_WITH_SLASH
The file is quite long so I'll add a shortened version here for simplicity's sake:
tf_http_archive(
name = "eigen_archive",
urls = [
"https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/6913f0cf7d06.tar.gz",
"https://bitbucket.org/eigen/eigen/get/6913f0cf7d06.tar.gz",
],
You need to use $() syntax, echo the contents of workspaceFile and then pipe the grep command:
TAR_FILE_WITH_SLASH="$(echo $workspaceFile | grep -oE '/[a-z0-9]{12}\.tar\.gz')"
Also, note you need no PCRE regex here, you can use a POSIX ERE regex (that is, replace P with E). You may even use a POSIX BRE pattern here, like grep -o '/[a-z0-9]\{12\}\.tar\.gz'. The dot must be escaped to match a literal dot and the / is not special here and needs no escaping.
See the online demo.
What's about the path?
workspaceFile=`cat ~/tensorflow/workspace.bzl`

Simple replacement with sed inside bash not working

Why is this simple replacement with sed inside bash not working?
echo '[a](!)' | sed 's/[a](!)/[a]/'
It returns [a](!) instead of [a]. But why, given that only three characters need to be escaped in a sed replacement string?
If I account for the case that additional characters need to be replaced in the regex string and try
echo '[a](!)' | sed 's/\[a\]\(!\)/[a]/'
it is still not working.
The point is that [a] in the regex pattern does not match square brackets that form a bracket expression. Escape the first [ for it to be parsed as a literal [ symbol, and your replacement will work:
echo '[a](!)' | sed 's/\[a](!)/[a]/'
^^
See this demo
sed uses BREs by default and EREs can be enabled by escaping individual ERE metacharaters or by using the -E argument. [ and ] are BRE metacharacters, ( and ) are ERE metacharacters. When you wrote:
echo '[a](!)' | sed 's/\[a\]\(!\)/[a]/'
you were turning the [ and ] BRE metacharacters into literals, which is good, but you were turning the literal ( and ) into ERE metacharacters, which is bad. This is what you were trying to do:
echo '[a](!)' | sed 's/\[a\](!)/[a]/'
which you'd probably really want to write using a capture group:
echo '[a](!)' | sed 's/\(\[a\]\)(!)/\1/'
to avoid duplicating [a] on both sides of the substitution. With EREs enabled using the -E argument that last would be:
echo '[a](!)' | sed -E 's/(\[a\])\(!\)/\1/'
Read the sed man page and a regexp tutorial.
man echo tells that the command echo display a line of text. So [ and ( with their closing brackets are just text.
If you read man grep and type there /^\ *Character Classes and Bracket Expressions and /^\ *Basic vs Extended Regular Expressions you can read the difference. sed and other tools that use regex interprets this as Character Classes and Bracket Expressions.
You can try this
$ echo '[a](!)' | sed 's/(!)//'

Non greedy matching using ? with grep

I'm writing a bash script which analyses a html file and
I want to get the content of each single <tr>...</tr>. So my command looks like:
$ tr -d \\012 < price.html | grep -oE '<tr>.*?</tr>'
But it seems that grep gives me the result of:
$ tr -d \\012 < price.html | grep -oE '<tr>.*</tr>'
How can I make .* non-greedy?
If you have GNU Grep you can use -P to make the match non-greedy:
$ tr -d \\012 < price.html | grep -Po '<tr>.*?</tr>'
The -P option enables Perl Compliant Regular Expression (PCRE) which is needed for non-greedy matching with ? as Basic Regular Expression (BRE) and Extended Regular Expression (ERE) do not support it.
If you are using -P you could also use look arounds to avoid printing the tags in the match like so:
$ tr -d \\012 < price.html | grep -Po '(?<=<tr>).*?(?=</tr>)'
If you don't have GNU grep and the HTML is well formed you could just do:
$ tr -d \\012 < price.html | grep -o '<tr>[^<]*</tr>'
Note: The above example won't work with nested tags within <tr>.
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep -E. Use grep -P instead if you have that, or switch to Perl / Python / Ruby / what have you. (Oh, and pcregrep.)
Of course, if you really mean
<tr>[^<>]*</tr>
you should say that instead; then plain old grep will work fine.
You could (tediously) extend the regex to accept nested tags which are not <tr> but of course, it's better to use a proper HTML parser than spend a lot of time rediscovering why regular expressions are not the right tool for this.
.*? is a Perl regular expression. Change your grep to
grep -oP '<tr>.*?</tr>'
Try perl-style-regexp
$ grep -Po '<tr>.*?</tr>' input
<tr>stuff</tr>
<tr>more stuff</tr>

bash, regex, return the matched regular expression(s)

Please point out to me how to get bash to print for me a matched expr like (?<=id=)[0-9].
I'd also like the input to come from the pipe, and it will be a single line of text.
to print solely the matched expressions
(not the entire line, several expressions within the same line may be displayed)
yourcommand | grep -P -o '(?<=id=)[0-9]'
bash's regular expressions aren't Perl-compatible. You could use grep:
grep -P -o '(?<=id=)[0-9]'
And in a pipeline:
number=$(echo "foo id=3 bar" | grep -Po '(?<=id=)[0-9]')
echo $number # => 3