Is this a grep bug?

Is this a grep bug? - regex

I expect
egrep -i "((\w)\2){4,}" /usr/share/dict/words
to match the word 'subbookkeeper', but it does not.
Thoughts?

Apparently egrep doesn't support {m,n} repeat syntax:
$ egrep -i '((\w)\2)((\w)\4)((\w)\6)' words
bookkeeper
bookkeeping
subbookkeeper
$ egrep -i '((\w)\2)((\w)\4)((\w)\6)((\w)\8)' words
subbookkeeper
If you spell out the groups, it works.
This is on my Mac.

The problem seems to be that egrep is not resetting captured groups on repeats. Not sure if this is a bug or just ambiguity in what the notation implies. If you manually repeat then it should work:
egrep -i "(\w)\1(\w)\2(\w)\3(\w)\4" /usr/share/dict/words
However, it is strange that this does not work. This does work in perl:
perl -lne "print if /((\w)\2){3}/" /usr/share/dict/words
BTW, egrep does support {m,n} syntax. This proves that:
egrep -i "a{2}" /usr/share/dict/words

Your regex is correct and there is not a bug. /usr/share/dict/words does not contain the word "subbookkeeper".

On my freebsd system it did find match
[vaibhavc#freebsd-vai ~]$ cat acb
subbookkeeper
[vaibhavc#freebsd-vai ~]$ egrep "((\w)\2){4,}" -i acb
subbookkeeper

Related

Using grep with a negative lookahead assertion

I have exactly the same question as in this post, however the regex isn't working for me, in bash. RegExp exclusion, looking for a word not followed by another
I want to include all lines of a csv file that include the word "Tom", except when it's followed by "Thumb".
Include: Tom sat by the seashore.
Don't include: Tom Thumb sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
The regex Tom(?!\s+Thumb) works when I try it out on regex101.com.
But I've tried all these variations and none of them work. What am I missing and how can I work around this? I'm on a Mac.
cat inputfile.csv | grep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep “Tom(?!\s+Thumb)” > Tom.csv
cat inputfile.csv | grep -E Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | grep -E “Tom(?!\s+Thumb)” > Tom.csv

You can't do this with POSIX ERE.
There is no negative lookahead assertion in POSIX extended regular expressions, which is the syntax grep -E activates.
The closest you can get is to combine two separate regexes, one positive match and one negative:
grep -we 'Tom' inputfile.csv | grep -wvEe 'Tom[[:space:]]Thumb'
grep -v excludes any line that matches the given expression; so here, we're first searching for Tom, and then removing Tom Thumb.
However, the intent to match Tom and Tom Thumb sat by the seashore makes this unworkable. In short: You can't do what you're asking for with standard grep, unless it has grep -P to make your original syntax valid. In that case you could use:
grep -Pwe 'Tom(?!\s+Thumb)' <inputfile.csv >Tom.csv
One hack might be a temporary substitution
Assuming you have uuidgen available (it appears to be present in Big Sur) to generate a temporary, unpredictable sigil:
uuid=$(uuidgen)
sed -e "s/Tom Thumb/$uuid/g" <inputfile.csv \
| grep -we 'Tom' \
| sed -e "s/$uuid/Tom Thumb/g" >tom.csv

How about a Perl solution:
perl -ne 'print if /Tom(?!\s+Thumb)/' inputfile.csv > Tom.csv
Perl obviously supports PCRE and pre-installed on Mac.
The -n option is mostly equivalent to that of sed.
It suppresses the automatic printing.
The -e option enables a one-liner by putting the immediate code.
The code print if /pattern/ is an idiom to print the matched line, which
may substitute grep command.

Keep it simple and just use awk, e.g. using any awk in any shell on every Unix box:
$ awk '{orig=$0; gsub(/Tom Thumb/,"")} /Tom/{print orig}' file
Include: Tom sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.

Grep can use Perl regular expressions (PCRE). From man grep:
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

Regex doesn't work in grep

Try search for filenames in file.txt with this regexp: ([\w.-]+)[.]\w+/gm
On regexr.com it works good, but when I try to find them with grep with this command I get nothing:
grep -E "([\w.-]+)[.]\w+/gm" file.txt
What am I doing wrong?
Input:
hello.py fasdfasdf
fadsfsdf
f
file.docx fsdfasdf
fadsfsdf.fds
FILE.mp3
Output:
hello.py
file.docx
fadsfsdf.fds
FILE.mp3

\w is a Perl extension; either use the -P option with grep (if supported), or use a standard regular expression instead:
grep -E '([[:alpha:].-]+)[.][[:alpha:]]+/gm' file.text

regex match specific pattern

I have
[root#centos64 ~]# cat /tmp/out
[
"i-b7a82af5",
"i-9d78f4df",
"i-92ea58d0",
"i-fa4acab8"
]
I would like to pipe though sed or grep to match the format "x-xxxxxxxx" i.e. a mix of a-z 0-9 always in 1-[8 chars length], and omit everything else
[root#centos64 ~]# cat /tmp/out| sed s/x-xxxxxxxx/
i-b7a82af5
i-9d78f4df
i-92ea58d0
i-fa4acab8
I know this is basic, but I can only find examples of text substitution.

grep -Eo '[a-z0-9]-[a-z0-9]{8}' file
The -E option makes it recognize extended regular expressions, so it can use {8} to match 8 repetitions.
The -o option makes it only print the part of the line that matches the regexp.

Why not just print whatever's between the quotes:
$ sed -n 's/[^"]*"\([^"]*\).*/\1/p' file
i-b7a82af5
i-9d78f4df
i-92ea58d0
i-fa4acab8
$ awk -F\" 'NF>1{print $2}' file
i-b7a82af5
i-9d78f4df
i-92ea58d0
i-fa4acab8

Through GNU sed,
$ sed -nr 's/.*([a-z0-9]-[a-z0-9]{8}).*/\1/p' file
i-b7a82af5
i-9d78f4df
i-92ea58d0
i-fa4acab8

I think this is all you need: [0-9a-zA-Z]-[0-9a-zA-Z]{8}. Try it out here.

This should work ^[a-z0-9]-[a-zA-Z0-9]{8}$

Bash (grep) regex performing unexpectedly

I have a text file, which contains a date in the form of dd/mm/yyyy (e.g 20/12/2012).
I am trying to use grep to parse the date and show it in the terminal, and it is successful,
until I meet a certain case:
These are my test cases:
grep -E "\d*" returns 20/12/2012
grep -E "\d*/" returns 20/12/2012
grep -E "\d*/\d*" returns 20/12/2012
grep -E "\d*/\d*/" returns nothing
grep -E "\d+" also returns nothing
Could someone explain to me why I get this unexpected behavior?
EDIT: I get the same behavior if I substitute the " (weak quotes) for ' (strong quotes).

The syntax you used (\d) is not recognised by Bash's Extended regex.
Use grep -P instead which uses Perl regex (PCRE). For example:
grep -P "\d+/\d+/\d+" input.txt
grep -P "\d{2}/\d{2}/\d{4}" input.txt # more restrictive
Or, to stick with extended regex, use [0-9] in place of \d:
grep -E "[0-9]+/[0-9]+/[0-9]" input.txt
grep -E "[0-9]{2}/[0-9]{2}/[0-9]{4}" input.txt # more restrictive

You could also use -P instead of -E which allows grep to use the PCRE syntax
grep -P "\d+/\d+" file
does work too.

grep and egrep/grep -E don't recognize \d. The reason your first three patterns work is because of the asterisk that makes \d optional. It is actually not found.
Use [0-9] or [[:digit:]].

To help troubleshoot cases like this, the -o flag can be helpful as it shows only the matched portion of the line. With your original expressions:
grep -Eo "\d*" returns nothing - a clue that \d isn't doing what you thought it was.
grep -Eo "\d*/" returns / (twice) - confirmation that \d isn't matching while the slashes are.
As noted by others, the -P flag solves the issue by recognizing "\d", but to clarify Explosion Pills' answer, you could also use -E as follows:
grep -Eo "[[:digit:]]*/[[:digit:]]*/" returns 20/12/
EDIT: Per a comment by #shawn-chin (thanks!), --color can be used similarly to highlight the portions of the line that are matched while still showing the entire line:
grep -E --color "[[:digit:]]*/[[:digit:]]*/" returns 20/12/2012 (can't do color here, but the bold "20/12/" portion would be in color)

grep: group capturing

I have following string:
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
and I need to get value of "scheme version", which is 1234 in this example.
I have tried
grep -Eo "\"scheme_version\":(\w*)"
however it returns
"scheme_version":1234
How can I make it? I know I can add sed call, but I would prefer to do it with single grep.

You'll need to use a look behind assertion so that it isn't included in the match:
grep -Po '(?<=scheme_version":)[0-9]+'

This might work for you:
echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234
Sorry it's not grep, so disregard this solution if you like.
Or stick with grep and add:
grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2

I would recommend that you use jq for the job. jq is a command-line JSON processor.
$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
$ cat tmp | jq .scheme_version
1234

As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,
$ grep -Po 'scheme_version":\K[0-9]+'
This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.
You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.

To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.
$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json
1234
-r Capture group indices (e.g., $5) and names (e.g., $foo).
Another example with Python and json.tool module which can validate and pretty-print:
$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234
Related: Can grep output only specified groupings that match?

You can do this:
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'

Improving #potong's answer that works only to get "scheme_version", you can use this expression :
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is this a grep bug? - regex

I expect egrep -i "((\w)\2){4,}" /usr/share/dict/words to match the word 'subbookkeeper', but it does not. Thoughts?

Apparently egrep doesn't support {m,n} repeat syntax: $ egrep -i '((\w)\2)((\w)\4)((\w)\6)' words bookkeeper bookkeeping subbookkeeper $ egrep -i '((\w)\2)((\w)\4)((\w)\6)((\w)\8)' words subbookkeeper If you spell out the groups, it works. This is on my Mac.

Your regex is correct and there is not a bug. /usr/share/dict/words does not contain the word "subbookkeeper".

On my freebsd system it did find match [vaibhavc#freebsd-vai ~]$ cat acb subbookkeeper [vaibhavc#freebsd-vai ~]$ egrep "((\w)\2){4,}" -i acb subbookkeeper

Related

Using grep with a negative lookahead assertion

Regex doesn't work in grep

regex match specific pattern

Bash (grep) regex performing unexpectedly

grep: group capturing

Categories

Resources