Match specific length words, anchored, without doing magic math

Match specific length words, anchored, without doing magic math - regex

Let's say I wanted to find all 12-letter words in /usr/share/dict/words that started with c and ended with er. Off the top of my head, a workable pattern could look something like:
grep -E '^c.{9}er$' /usr/share/dict/words
It finds:
cabinetmaker
calcographer
calligrapher
campanologer
campylometer
...
But that .{9} bothers me. It feels too magical, subtracting the total length of all the anchor characters from the number defined in the original constraint.
Is there any way to rewrite this regex so it doesn't require doing this calculation up front, allowing a literal 12 to be used directly in the pattern?

You can use the -x option which selects only matches that exactly match the whole line.
grep -xE '.{12}' | grep 'c.*er'
Ideone Demo
Or use the -P option which clarifies the pattern as a Perl regular expression and use a lookahead assertion.
grep -P '^(?=.{12}$)c.*er$'
Ideone Demo

You can use awk as an alternative and avoid this calculation:
awk -v len=12 'length($1)==len && $1 ~ /^c.*?er$/' file

I don't know grep so well, but some more advanced NFA RegEx implementations provide you with lookaheads and lookbehinds. If you can figure out any means to make those available for you, you could write:
^(?=c).{12}(?<=er)$
Maybe as a perl one-liner like this?
cat /usr/share/dict/words | perl -ne "print if m/^(?=c).{12}(?<=er)$/"

One approach with GNU sed:
$ sed -nr '/^.{12}$/{/^c.*er$/p}' words
With BSD sed (Mac OS) it would be:
$ sed -nE '/^.{12}$/{/^c.*er$/p;}' words

Related

Using grep with a negative lookahead assertion

I have exactly the same question as in this post, however the regex isn't working for me, in bash. RegExp exclusion, looking for a word not followed by another
I want to include all lines of a csv file that include the word "Tom", except when it's followed by "Thumb".
Include: Tom sat by the seashore.
Don't include: Tom Thumb sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
The regex Tom(?!\s+Thumb) works when I try it out on regex101.com.
But I've tried all these variations and none of them work. What am I missing and how can I work around this? I'm on a Mac.
cat inputfile.csv | grep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep “Tom(?!\s+Thumb)” > Tom.csv
cat inputfile.csv | grep -E Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | grep -E “Tom(?!\s+Thumb)” > Tom.csv

You can't do this with POSIX ERE.
There is no negative lookahead assertion in POSIX extended regular expressions, which is the syntax grep -E activates.
The closest you can get is to combine two separate regexes, one positive match and one negative:
grep -we 'Tom' inputfile.csv | grep -wvEe 'Tom[[:space:]]Thumb'
grep -v excludes any line that matches the given expression; so here, we're first searching for Tom, and then removing Tom Thumb.
However, the intent to match Tom and Tom Thumb sat by the seashore makes this unworkable. In short: You can't do what you're asking for with standard grep, unless it has grep -P to make your original syntax valid. In that case you could use:
grep -Pwe 'Tom(?!\s+Thumb)' <inputfile.csv >Tom.csv
One hack might be a temporary substitution
Assuming you have uuidgen available (it appears to be present in Big Sur) to generate a temporary, unpredictable sigil:
uuid=$(uuidgen)
sed -e "s/Tom Thumb/$uuid/g" <inputfile.csv \
| grep -we 'Tom' \
| sed -e "s/$uuid/Tom Thumb/g" >tom.csv

How about a Perl solution:
perl -ne 'print if /Tom(?!\s+Thumb)/' inputfile.csv > Tom.csv
Perl obviously supports PCRE and pre-installed on Mac.
The -n option is mostly equivalent to that of sed.
It suppresses the automatic printing.
The -e option enables a one-liner by putting the immediate code.
The code print if /pattern/ is an idiom to print the matched line, which
may substitute grep command.

Keep it simple and just use awk, e.g. using any awk in any shell on every Unix box:
$ awk '{orig=$0; gsub(/Tom Thumb/,"")} /Tom/{print orig}' file
Include: Tom sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.

Grep can use Perl regular expressions (PCRE). From man grep:
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

sed regex with alternative on Solaris doesn't work

Currently I'm trying to use sed with regex on Solaris but it doesn't work.
I need to show only lines matching to my regex.
sed -n -E '/^[a-zA-Z0-9]*$|^a_[a-zA-Z0-9]*$/p'
input file:
grtad
a_pitr
_aupa
a__as
baman
12353
ai345
ki_ag
-MXx2
!!!23
+_)#*
I want to show only lines matching to above regex:
grtad
a_pitr
baman
12353
ai345
Is there another way to use alternative? Is it possible in perl?
Thanks for any solutions.

With Perl
perl -ne 'print if /^(a_)?[a-zA-Z0-9]*$/' input.txt
The (a_)? matches a_ one-or-zero times, so optionally. It may or may not be there.
The (a_) also captures the match, what is not needed. So you can use (?:a_)? instead. The ?: makes () only group what is inside (so ? applies to the whole thing), but not remember it.

with grep
$ grep -xiE '(a_)?[a-z0-9]*' ip.txt
grtad
a_pitr
baman
12353
ai345
-x match whole line
-i ignore case
-E extended regex, if not available, use grep -xi '\(a_\)\?[a-z0-9]*'
(a_)? zero or one time match a_
[a-z0-9]* zero or more alphabets or numbers
With sed
sed -nE '/^(a_)?[a-zA-Z0-9]*$/p' ip.txt
or, with GNU sed
sed -nE '/^(a_)?[a-z0-9]*$/Ip' ip.txt

Grep Regex Exclusion Special Character

I am having a difficult time trying to search for a phrase but exclude the phrase if it is directly followed by a colon-space.
I am looking for Delet! (i.e. "Delet.*" in regex syntax) but I do not want anything returned that is "Deleted: " (includes a space after the colon). However, I would like anything returned that is "Deleted" followed by anything other than a colon-space.
I have tried the following expressions
grep -ri 'delet.*[^:]'
grep -ri 'delet[a-zA-Z0-9\;\".....]{0,10}'
(including all special characters in the range preceded by escapes)

Using a lookahead expression:
grep -Pi 'Delet(?!ed: )'
Note the modification of the parameters of grep: -P enables the use of lookahead expressions.

Try this. The ? after the * instructs it to select as few non-space characters as possible, followed by any one character that is not a colon, followed by a space.
grep -ri 'delet[^ ]*?[^:] '

If I got you correctly you want anything starting with delet, and not starting with deleted::
grep -Ei '^delet((([^e]|e$)|e([^d]|d$)|ed([^:]|:$)|ed:[^ ]).*)?$'
This basically says:
Match [start]deletX[anything][end] or [start]delete[end] where X is not e
Match [start]deleteX[anything][end] or [start]deleted[end] where X is not d
Match [start]deletedX[anything][end] or [start]deleted:[end] where X is not :
Match [start]deleted:X[anything][end] where X is not space.
It would have been far easier to use pipe and second negative grep if that is applicable:
grep -i ^delet | grep -vi '^deleted: '

It sounds like all you need is:
awk -v IGNORECASE=1 '/delet/ && !/deleted: /' file
The above uses GNU awk for IGNORECASE, other awks would use tolower().
The benefit of awk over grep is that awk tests for conditions, not just regexps, so you can create compound conditions using && and || out of tests for regexps which makes it MUCH simpler and clearer to just code the condition you want to test - that the line contains delet and (&&) not (!) deleted:.

Bash (grep) regex performing unexpectedly

I have a text file, which contains a date in the form of dd/mm/yyyy (e.g 20/12/2012).
I am trying to use grep to parse the date and show it in the terminal, and it is successful,
until I meet a certain case:
These are my test cases:
grep -E "\d*" returns 20/12/2012
grep -E "\d*/" returns 20/12/2012
grep -E "\d*/\d*" returns 20/12/2012
grep -E "\d*/\d*/" returns nothing
grep -E "\d+" also returns nothing
Could someone explain to me why I get this unexpected behavior?
EDIT: I get the same behavior if I substitute the " (weak quotes) for ' (strong quotes).

The syntax you used (\d) is not recognised by Bash's Extended regex.
Use grep -P instead which uses Perl regex (PCRE). For example:
grep -P "\d+/\d+/\d+" input.txt
grep -P "\d{2}/\d{2}/\d{4}" input.txt # more restrictive
Or, to stick with extended regex, use [0-9] in place of \d:
grep -E "[0-9]+/[0-9]+/[0-9]" input.txt
grep -E "[0-9]{2}/[0-9]{2}/[0-9]{4}" input.txt # more restrictive

You could also use -P instead of -E which allows grep to use the PCRE syntax
grep -P "\d+/\d+" file
does work too.

grep and egrep/grep -E don't recognize \d. The reason your first three patterns work is because of the asterisk that makes \d optional. It is actually not found.
Use [0-9] or [[:digit:]].

To help troubleshoot cases like this, the -o flag can be helpful as it shows only the matched portion of the line. With your original expressions:
grep -Eo "\d*" returns nothing - a clue that \d isn't doing what you thought it was.
grep -Eo "\d*/" returns / (twice) - confirmation that \d isn't matching while the slashes are.
As noted by others, the -P flag solves the issue by recognizing "\d", but to clarify Explosion Pills' answer, you could also use -E as follows:
grep -Eo "[[:digit:]]*/[[:digit:]]*/" returns 20/12/
EDIT: Per a comment by #shawn-chin (thanks!), --color can be used similarly to highlight the portions of the line that are matched while still showing the entire line:
grep -E --color "[[:digit:]]*/[[:digit:]]*/" returns 20/12/2012 (can't do color here, but the bold "20/12/" portion would be in color)

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.

You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING

While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'

The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.

Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Match specific length words, anchored, without doing magic math - regex

You can use the -x option which selects only matches that exactly match the whole line. grep -xE '.{12}' | grep 'c.er' Ideone Demo Or use the -P option which clarifies the pattern as a Perl regular expression and use a lookahead assertion. grep -P '^(?=.{12}$)c.er$' Ideone Demo

You can use awk as an alternative and avoid this calculation: awk -v len=12 'length($1)==len && $1 ~ /^c.*?er$/' file

One approach with GNU sed: $ sed -nr '/^.{12}$/{/^c.er$/p}' words With BSD sed (Mac OS) it would be: $ sed -nE '/^.{12}$/{/^c.er$/p;}' words

Related

Using grep with a negative lookahead assertion

sed regex with alternative on Solaris doesn't work

Grep Regex Exclusion Special Character

Bash (grep) regex performing unexpectedly

matching a specific substring with regular expressions using awk

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Match specific length words, anchored, without doing magic math - regex

You can use the -x option which selects only matches that exactly match the whole line. grep -xE '.{12}' | grep 'c.*er' Ideone Demo Or use the -P option which clarifies the pattern as a Perl regular expression and use a lookahead assertion. grep -P '^(?=.{12}$)c.*er$' Ideone Demo

You can use awk as an alternative and avoid this calculation: awk -v len=12 'length($1)==len && $1 ~ /^c.*?er$/' file

One approach with GNU sed: $ sed -nr '/^.{12}$/{/^c.*er$/p}' words With BSD sed (Mac OS) it would be: $ sed -nE '/^.{12}$/{/^c.*er$/p;}' words

Related

Using grep with a negative lookahead assertion

sed regex with alternative on Solaris doesn't work

Grep Regex Exclusion Special Character

Bash (grep) regex performing unexpectedly

matching a specific substring with regular expressions using awk

Categories

Resources

You can use the -x option which selects only matches that exactly match the whole line. grep -xE '.{12}' | grep 'c.er' Ideone Demo Or use the -P option which clarifies the pattern as a Perl regular expression and use a lookahead assertion. grep -P '^(?=.{12}$)c.er$' Ideone Demo

One approach with GNU sed: $ sed -nr '/^.{12}$/{/^c.er$/p}' words With BSD sed (Mac OS) it would be: $ sed -nE '/^.{12}$/{/^c.er$/p;}' words