Does grep accept `+` as a regex - Better alternative to regex? - regex

I am trying to use the following syntax with grep
cat Notes.rtf | grep -i "\(D*\)ters"
The result is fine . However when I attempt to use
cat Notes.rtf | grep -i "\(D+\)ters"
There is no result.
I came across this page and it seems that regex does not support +
Is that correct is there an equivalent to + with grep. IS there a better alternative to grep for max OSX terminal ?

grep doesn't support extended regex properties like \D (matching a non-digit) unless your use -P flag (PCRE) or -E flag (extended regex) like this:
grep -Pi "\D+ters" Notes.rtf
OR
grep -Ei "\D+ters" Notes.rtf
Likewise + also need not be escaped while using P or E flags.

Related

Using grep with a negative lookahead assertion

I have exactly the same question as in this post, however the regex isn't working for me, in bash. RegExp exclusion, looking for a word not followed by another
I want to include all lines of a csv file that include the word "Tom", except when it's followed by "Thumb".
Include: Tom sat by the seashore.
Don't include: Tom Thumb sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
The regex Tom(?!\s+Thumb) works when I try it out on regex101.com.
But I've tried all these variations and none of them work. What am I missing and how can I work around this? I'm on a Mac.
cat inputfile.csv | grep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep “Tom(?!\s+Thumb)” > Tom.csv
cat inputfile.csv | grep -E Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | grep -E “Tom(?!\s+Thumb)” > Tom.csv
You can't do this with POSIX ERE.
There is no negative lookahead assertion in POSIX extended regular expressions, which is the syntax grep -E activates.
The closest you can get is to combine two separate regexes, one positive match and one negative:
grep -we 'Tom' inputfile.csv | grep -wvEe 'Tom[[:space:]]Thumb'
grep -v excludes any line that matches the given expression; so here, we're first searching for Tom, and then removing Tom Thumb.
However, the intent to match Tom and Tom Thumb sat by the seashore makes this unworkable. In short: You can't do what you're asking for with standard grep, unless it has grep -P to make your original syntax valid. In that case you could use:
grep -Pwe 'Tom(?!\s+Thumb)' <inputfile.csv >Tom.csv
One hack might be a temporary substitution
Assuming you have uuidgen available (it appears to be present in Big Sur) to generate a temporary, unpredictable sigil:
uuid=$(uuidgen)
sed -e "s/Tom Thumb/$uuid/g" <inputfile.csv \
| grep -we 'Tom' \
| sed -e "s/$uuid/Tom Thumb/g" >tom.csv
How about a Perl solution:
perl -ne 'print if /Tom(?!\s+Thumb)/' inputfile.csv > Tom.csv
Perl obviously supports PCRE and pre-installed on Mac.
The -n option is mostly equivalent to that of sed.
It suppresses the automatic printing.
The -e option enables a one-liner by putting the immediate code.
The code print if /pattern/ is an idiom to print the matched line, which
may substitute grep command.
Keep it simple and just use awk, e.g. using any awk in any shell on every Unix box:
$ awk '{orig=$0; gsub(/Tom Thumb/,"")} /Tom/{print orig}' file
Include: Tom sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
Grep can use Perl regular expressions (PCRE). From man grep:
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

I get a blank echo from regex

If I run this file, it works fine and outputs the lines I expect:
workspaceFile=`cat tensorflow/workspace.bzl`
echo $workspaceFile | grep -oP '\/[a-z0-9]{12}.tar.gz'
However, if I run this, all I get is blank output in the terminal:
workspaceFile=`cat tensorflow/workspace.bzl`
TAR_FILE_WITH_SLASH=$workspaceFile | grep -oP '\/[a-z0-9]{12}.tar.gz'
echo $TAR_FILE_WITH_SLASH
The file is quite long so I'll add a shortened version here for simplicity's sake:
tf_http_archive(
name = "eigen_archive",
urls = [
"https://mirror.bazel.build/bitbucket.org/eigen/eigen/get/6913f0cf7d06.tar.gz",
"https://bitbucket.org/eigen/eigen/get/6913f0cf7d06.tar.gz",
],
You need to use $() syntax, echo the contents of workspaceFile and then pipe the grep command:
TAR_FILE_WITH_SLASH="$(echo $workspaceFile | grep -oE '/[a-z0-9]{12}\.tar\.gz')"
Also, note you need no PCRE regex here, you can use a POSIX ERE regex (that is, replace P with E). You may even use a POSIX BRE pattern here, like grep -o '/[a-z0-9]\{12\}\.tar\.gz'. The dot must be escaped to match a literal dot and the / is not special here and needs no escaping.
See the online demo.
What's about the path?
workspaceFile=`cat ~/tensorflow/workspace.bzl`

Non greedy matching using ? with grep

I'm writing a bash script which analyses a html file and
I want to get the content of each single <tr>...</tr>. So my command looks like:
$ tr -d \\012 < price.html | grep -oE '<tr>.*?</tr>'
But it seems that grep gives me the result of:
$ tr -d \\012 < price.html | grep -oE '<tr>.*</tr>'
How can I make .* non-greedy?
If you have GNU Grep you can use -P to make the match non-greedy:
$ tr -d \\012 < price.html | grep -Po '<tr>.*?</tr>'
The -P option enables Perl Compliant Regular Expression (PCRE) which is needed for non-greedy matching with ? as Basic Regular Expression (BRE) and Extended Regular Expression (ERE) do not support it.
If you are using -P you could also use look arounds to avoid printing the tags in the match like so:
$ tr -d \\012 < price.html | grep -Po '(?<=<tr>).*?(?=</tr>)'
If you don't have GNU grep and the HTML is well formed you could just do:
$ tr -d \\012 < price.html | grep -o '<tr>[^<]*</tr>'
Note: The above example won't work with nested tags within <tr>.
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep -E. Use grep -P instead if you have that, or switch to Perl / Python / Ruby / what have you. (Oh, and pcregrep.)
Of course, if you really mean
<tr>[^<>]*</tr>
you should say that instead; then plain old grep will work fine.
You could (tediously) extend the regex to accept nested tags which are not <tr> but of course, it's better to use a proper HTML parser than spend a lot of time rediscovering why regular expressions are not the right tool for this.
.*? is a Perl regular expression. Change your grep to
grep -oP '<tr>.*?</tr>'
Try perl-style-regexp
$ grep -Po '<tr>.*?</tr>' input
<tr>stuff</tr>
<tr>more stuff</tr>

Bash (grep) regex performing unexpectedly

I have a text file, which contains a date in the form of dd/mm/yyyy (e.g 20/12/2012).
I am trying to use grep to parse the date and show it in the terminal, and it is successful,
until I meet a certain case:
These are my test cases:
grep -E "\d*" returns 20/12/2012
grep -E "\d*/" returns 20/12/2012
grep -E "\d*/\d*" returns 20/12/2012
grep -E "\d*/\d*/" returns nothing
grep -E "\d+" also returns nothing
Could someone explain to me why I get this unexpected behavior?
EDIT: I get the same behavior if I substitute the " (weak quotes) for ' (strong quotes).
The syntax you used (\d) is not recognised by Bash's Extended regex.
Use grep -P instead which uses Perl regex (PCRE). For example:
grep -P "\d+/\d+/\d+" input.txt
grep -P "\d{2}/\d{2}/\d{4}" input.txt # more restrictive
Or, to stick with extended regex, use [0-9] in place of \d:
grep -E "[0-9]+/[0-9]+/[0-9]" input.txt
grep -E "[0-9]{2}/[0-9]{2}/[0-9]{4}" input.txt # more restrictive
You could also use -P instead of -E which allows grep to use the PCRE syntax
grep -P "\d+/\d+" file
does work too.
grep and egrep/grep -E don't recognize \d. The reason your first three patterns work is because of the asterisk that makes \d optional. It is actually not found.
Use [0-9] or [[:digit:]].
To help troubleshoot cases like this, the -o flag can be helpful as it shows only the matched portion of the line. With your original expressions:
grep -Eo "\d*" returns nothing - a clue that \d isn't doing what you thought it was.
grep -Eo "\d*/" returns / (twice) - confirmation that \d isn't matching while the slashes are.
As noted by others, the -P flag solves the issue by recognizing "\d", but to clarify Explosion Pills' answer, you could also use -E as follows:
grep -Eo "[[:digit:]]*/[[:digit:]]*/" returns 20/12/
EDIT: Per a comment by #shawn-chin (thanks!), --color can be used similarly to highlight the portions of the line that are matched while still showing the entire line:
grep -E --color "[[:digit:]]*/[[:digit:]]*/" returns 20/12/2012 (can't do color here, but the bold "20/12/" portion would be in color)

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.
You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.
Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO