Using grep with a negative lookahead assertion

Using grep with a negative lookahead assertion - regex

I have exactly the same question as in this post, however the regex isn't working for me, in bash. RegExp exclusion, looking for a word not followed by another
I want to include all lines of a csv file that include the word "Tom", except when it's followed by "Thumb".
Include: Tom sat by the seashore.
Don't include: Tom Thumb sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
The regex Tom(?!\s+Thumb) works when I try it out on regex101.com.
But I've tried all these variations and none of them work. What am I missing and how can I work around this? I'm on a Mac.
cat inputfile.csv | grep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep “Tom(?!\s+Thumb)” > Tom.csv
cat inputfile.csv | grep -E Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | grep -E “Tom(?!\s+Thumb)” > Tom.csv

You can't do this with POSIX ERE.
There is no negative lookahead assertion in POSIX extended regular expressions, which is the syntax grep -E activates.
The closest you can get is to combine two separate regexes, one positive match and one negative:
grep -we 'Tom' inputfile.csv | grep -wvEe 'Tom[[:space:]]Thumb'
grep -v excludes any line that matches the given expression; so here, we're first searching for Tom, and then removing Tom Thumb.
However, the intent to match Tom and Tom Thumb sat by the seashore makes this unworkable. In short: You can't do what you're asking for with standard grep, unless it has grep -P to make your original syntax valid. In that case you could use:
grep -Pwe 'Tom(?!\s+Thumb)' <inputfile.csv >Tom.csv
One hack might be a temporary substitution
Assuming you have uuidgen available (it appears to be present in Big Sur) to generate a temporary, unpredictable sigil:
uuid=$(uuidgen)
sed -e "s/Tom Thumb/$uuid/g" <inputfile.csv \
| grep -we 'Tom' \
| sed -e "s/$uuid/Tom Thumb/g" >tom.csv

How about a Perl solution:
perl -ne 'print if /Tom(?!\s+Thumb)/' inputfile.csv > Tom.csv
Perl obviously supports PCRE and pre-installed on Mac.
The -n option is mostly equivalent to that of sed.
It suppresses the automatic printing.
The -e option enables a one-liner by putting the immediate code.
The code print if /pattern/ is an idiom to print the matched line, which
may substitute grep command.

Keep it simple and just use awk, e.g. using any awk in any shell on every Unix box:
$ awk '{orig=$0; gsub(/Tom Thumb/,"")} /Tom/{print orig}' file
Include: Tom sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.

Grep can use Perl regular expressions (PCRE). From man grep:
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

Related

Extract few matching strings from matching lines in file using sed

I have a file with strings similar to this:
abcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'
I have to find current_count and total_count for each line of file. I am trying below command but its not working. Please help.
grep current_count file | sed "s/.*\('current_count': u'\d+'\).*/\1/"
It is outputting the whole line but I want something like this:
'current_count': u'3', 'total_count': u'3'

It's printing the whole line because the pattern in the s command doesn't match, so no substitution happens.
sed regexes don't support \d for digits, or x+ for xx*. GNU sed has a -r option to enable extended-regex support so + will be a meta-character, but \d still doesn't work. GNU sed also allows \+ as a meta-character in basic regex mode, but that's not POSIX standard.
So anyway, this will work:
echo -e "foo\nabcd u'current_count': u'2', u'total_count': u'3', u'order_id': u'90'" |
sed -nr "s/.*('current_count': u'[0-9]+').*/\1/p"
# output: 'current_count': u'2'
Notice that I skip the grep by using sed -n s///p. I could also have used /current_count/ as an address:
sed -r -e '/current_count/!d' -e "s/.*('current_count': u'[0-9]+').*/\1/"
Or with just grep printing only the matching part of the pattern, instead of the whole line:
grep -E -o "'current_count': u'[[:digit:]]+'
(or egrep instead of grep -E). I forget if grep -o is POSIX-required behaviour.

For me this looks like some sort of serialized Python data. Basically I would try to find out the origin of that data and parse it properly.
However, while being hackish, sed can also being used here:
sed "s/.*current_count': [a-z]'\([0-9]\+\).*/\1/" input.txt
sed "s/.*total_count': [a-z]'\([0-9]\+\).*/\1/" input.txt

Match specific length words, anchored, without doing magic math

Let's say I wanted to find all 12-letter words in /usr/share/dict/words that started with c and ended with er. Off the top of my head, a workable pattern could look something like:
grep -E '^c.{9}er$' /usr/share/dict/words
It finds:
cabinetmaker
calcographer
calligrapher
campanologer
campylometer
...
But that .{9} bothers me. It feels too magical, subtracting the total length of all the anchor characters from the number defined in the original constraint.
Is there any way to rewrite this regex so it doesn't require doing this calculation up front, allowing a literal 12 to be used directly in the pattern?

You can use the -x option which selects only matches that exactly match the whole line.
grep -xE '.{12}' | grep 'c.*er'
Ideone Demo
Or use the -P option which clarifies the pattern as a Perl regular expression and use a lookahead assertion.
grep -P '^(?=.{12}$)c.*er$'
Ideone Demo

You can use awk as an alternative and avoid this calculation:
awk -v len=12 'length($1)==len && $1 ~ /^c.*?er$/' file

I don't know grep so well, but some more advanced NFA RegEx implementations provide you with lookaheads and lookbehinds. If you can figure out any means to make those available for you, you could write:
^(?=c).{12}(?<=er)$
Maybe as a perl one-liner like this?
cat /usr/share/dict/words | perl -ne "print if m/^(?=c).{12}(?<=er)$/"

One approach with GNU sed:
$ sed -nr '/^.{12}$/{/^c.*er$/p}' words
With BSD sed (Mac OS) it would be:
$ sed -nE '/^.{12}$/{/^c.*er$/p;}' words

Bash (grep) regex performing unexpectedly

I have a text file, which contains a date in the form of dd/mm/yyyy (e.g 20/12/2012).
I am trying to use grep to parse the date and show it in the terminal, and it is successful,
until I meet a certain case:
These are my test cases:
grep -E "\d*" returns 20/12/2012
grep -E "\d*/" returns 20/12/2012
grep -E "\d*/\d*" returns 20/12/2012
grep -E "\d*/\d*/" returns nothing
grep -E "\d+" also returns nothing
Could someone explain to me why I get this unexpected behavior?
EDIT: I get the same behavior if I substitute the " (weak quotes) for ' (strong quotes).

The syntax you used (\d) is not recognised by Bash's Extended regex.
Use grep -P instead which uses Perl regex (PCRE). For example:
grep -P "\d+/\d+/\d+" input.txt
grep -P "\d{2}/\d{2}/\d{4}" input.txt # more restrictive
Or, to stick with extended regex, use [0-9] in place of \d:
grep -E "[0-9]+/[0-9]+/[0-9]" input.txt
grep -E "[0-9]{2}/[0-9]{2}/[0-9]{4}" input.txt # more restrictive

You could also use -P instead of -E which allows grep to use the PCRE syntax
grep -P "\d+/\d+" file
does work too.

grep and egrep/grep -E don't recognize \d. The reason your first three patterns work is because of the asterisk that makes \d optional. It is actually not found.
Use [0-9] or [[:digit:]].

To help troubleshoot cases like this, the -o flag can be helpful as it shows only the matched portion of the line. With your original expressions:
grep -Eo "\d*" returns nothing - a clue that \d isn't doing what you thought it was.
grep -Eo "\d*/" returns / (twice) - confirmation that \d isn't matching while the slashes are.
As noted by others, the -P flag solves the issue by recognizing "\d", but to clarify Explosion Pills' answer, you could also use -E as follows:
grep -Eo "[[:digit:]]*/[[:digit:]]*/" returns 20/12/
EDIT: Per a comment by #shawn-chin (thanks!), --color can be used similarly to highlight the portions of the line that are matched while still showing the entire line:
grep -E --color "[[:digit:]]*/[[:digit:]]*/" returns 20/12/2012 (can't do color here, but the bold "20/12/" portion would be in color)

grep: group capturing

I have following string:
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
and I need to get value of "scheme version", which is 1234 in this example.
I have tried
grep -Eo "\"scheme_version\":(\w*)"
however it returns
"scheme_version":1234
How can I make it? I know I can add sed call, but I would prefer to do it with single grep.

You'll need to use a look behind assertion so that it isn't included in the match:
grep -Po '(?<=scheme_version":)[0-9]+'

This might work for you:
echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234
Sorry it's not grep, so disregard this solution if you like.
Or stick with grep and add:
grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2

I would recommend that you use jq for the job. jq is a command-line JSON processor.
$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
$ cat tmp | jq .scheme_version
1234

As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,
$ grep -Po 'scheme_version":\K[0-9]+'
This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.
You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.

To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.
$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json
1234
-r Capture group indices (e.g., $5) and names (e.g., $foo).
Another example with Python and json.tool module which can validate and pretty-print:
$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234
Related: Can grep output only specified groupings that match?

You can do this:
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'

Improving #potong's answer that works only to get "scheme_version", you can use this expression :
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.

You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING

While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'

The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.

Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using grep with a negative lookahead assertion - regex

Keep it simple and just use awk, e.g. using any awk in any shell on every Unix box: $ awk '{orig=$0; gsub(/Tom Thumb/,"")} /Tom/{print orig}' file Include: Tom sat by the seashore. Include: Tom and Tom Thumb sat by the seashore.

Grep can use Perl regular expressions (PCRE). From man grep: -P, --perl-regexp Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

Related

Extract few matching strings from matching lines in file using sed

Match specific length words, anchored, without doing magic math

Bash (grep) regex performing unexpectedly

grep: group capturing

matching a specific substring with regular expressions using awk

Categories

Resources