Bash (grep) regex performing unexpectedly - regex

I have a text file, which contains a date in the form of dd/mm/yyyy (e.g 20/12/2012).
I am trying to use grep to parse the date and show it in the terminal, and it is successful,
until I meet a certain case:
These are my test cases:
grep -E "\d*" returns 20/12/2012
grep -E "\d*/" returns 20/12/2012
grep -E "\d*/\d*" returns 20/12/2012
grep -E "\d*/\d*/" returns nothing
grep -E "\d+" also returns nothing
Could someone explain to me why I get this unexpected behavior?
EDIT: I get the same behavior if I substitute the " (weak quotes) for ' (strong quotes).

The syntax you used (\d) is not recognised by Bash's Extended regex.
Use grep -P instead which uses Perl regex (PCRE). For example:
grep -P "\d+/\d+/\d+" input.txt
grep -P "\d{2}/\d{2}/\d{4}" input.txt # more restrictive
Or, to stick with extended regex, use [0-9] in place of \d:
grep -E "[0-9]+/[0-9]+/[0-9]" input.txt
grep -E "[0-9]{2}/[0-9]{2}/[0-9]{4}" input.txt # more restrictive

You could also use -P instead of -E which allows grep to use the PCRE syntax
grep -P "\d+/\d+" file
does work too.

grep and egrep/grep -E don't recognize \d. The reason your first three patterns work is because of the asterisk that makes \d optional. It is actually not found.
Use [0-9] or [[:digit:]].

To help troubleshoot cases like this, the -o flag can be helpful as it shows only the matched portion of the line. With your original expressions:
grep -Eo "\d*" returns nothing - a clue that \d isn't doing what you thought it was.
grep -Eo "\d*/" returns / (twice) - confirmation that \d isn't matching while the slashes are.
As noted by others, the -P flag solves the issue by recognizing "\d", but to clarify Explosion Pills' answer, you could also use -E as follows:
grep -Eo "[[:digit:]]*/[[:digit:]]*/" returns 20/12/
EDIT: Per a comment by #shawn-chin (thanks!), --color can be used similarly to highlight the portions of the line that are matched while still showing the entire line:
grep -E --color "[[:digit:]]*/[[:digit:]]*/" returns 20/12/2012 (can't do color here, but the bold "20/12/" portion would be in color)

Related

Why does this regex work in grep but not sed?

I have two regular expressions:
$ grep -E '\-\- .*$' *.sql
$ sed -E '\-\- .*$' *.sql
(I am trying to grep lines in sql files that have comments and remove lines in sql files that have comments)
The grep command works using this regex; however, the sed returns the following error:
sed: -e expression #1, char 7: unterminated address regex
What am I doing incorrectly with sed?
(The space after the two hyphens is required for sql comments if you are unfamiliar with MySql comments of this type)
You're trying to use:
sed -E '\-\- .*$' *.sql
Here sed command is not correct because you're not really telling sed to do something.
It should be:
sed -n '/-- /p' *.sql
and equivalent grep would be:
grep -- '-- ' *.sql
or even better with a fixed string search:
grep -F -- '-- ' *.sql
Using -- to separate pattern and arguments in grep command.
There is no need to escape - in a regex if it is outside bracket expression (or character class) i.e. [...].
Based on comments below it seems OP's intent is to remove commented section in all *.sql files that start with 2 hyphens.
You may use this sed for that:
sed -i 's/-- .*//g' *.sql
The problem here is not the regex, the problem is that sed requires a command. The equivalent of your grep would be:
sed -n '/\-\- .*$/p'
You suppress output for non-matching lines -n ... you search (wrap your regex in slashes) and you print p (after the last slash).
P.S.: As Anub pointed out, escaping the hyphens - inside the regex is unnecessary.
You are trying to use sed's \cregexpc syntax where with \-<...> you are telling sed the delimiter character you want use is a dash -, but you didn't terminate it where it should be: \-<...>- also add d command to delete those lines.
sed '\-\-\-.*$-d' infile
see man sed about that:
\cregexpc
Match lines matching the regular expression regexp. The c may be any character.
if default / was used this was not required so:
sed '/--.*$/d' infile
or simply:
sed '/^--/d' infile
and more accurately:
sed '/^[[:blank:]]*--/d' infile

grep regex for return numerical values in between string

so i'm running the linux command
ls /etc/systemd/system | grep -o -E "[0-9]+"
which should return just numerical values, the only problem it returns some unwanted numerical values from parts of results i dont want, i want only the numerical values between - and .service so in like test-blah4-1321.service i just want it to return 1321. What am i missing here?
example
$ ls /etc/systemd/system
test.service test-blah4-1321.service test-blah2.service test-blah5-1387.service test-blah3-1521.service
GNU grep has the -P option for perl-style regexes, and the -o option to print only what matches the pattern. These can be combined using look-around assertions (described under Extended Patterns in the perlre manpage) to remove part of the grep pattern from what is determined to have matched for the purposes of -o.
Source
Applied to your example this would be:
echo test-blah4-1321.service | grep -oP '(?<=-)\d+(?=\.service)'
when I need look ahead or look behind tests I usually switch to a Perl one-liner.
this should do the trick.
echo test-blah4-1321.service | perl -ne 'm/(?<=-)(\d+)(?=\.service)/g; print "$1\n";'

grep -ioP '(?<=Arguments=\")(.*)(?=":Language=)' in Bash scripting

I was going through a bash script and read a line which says:
echo "Some line..." | grep -ioP '(?<=Arguments=\")(.*)(?=":Language=)'
I understood the grep part i.e grep -ioP but the input to grep i.e
'(?<=Arguments=\")(.*)(?=":Language=)' type of expression is encountered for the first time.
What does it mean? Does it mean something special to grep or it's just the grepping of similar string from the echo string part?
Thanks!
These are look-around assertions. (?<...) is a look-behind (what precedes) and (?=...) is a look-forward (what follows). The reason for them is they aren't part of the match, so -o won't output them.

regex grep/egrep problems

Based on searches on stackoverflow, I found out the difference between grep/egrep, but I still can't determine why this doesn't work. I've even check it at https://regex101.com/ and it shows it checked out right.
Regex:
.*ping[\/] or ping\D
searching against (text.txt):
path=/bin/ping6
path=/bin/ping
I'm trying to skip the first and only find the 2nd.
If I would do grep ping text.txt it finds both which isn't what I want.
grep -e ".*ping[\/]" text.txt [doesn't work]
egrep ".*ping[\/]" text.txt [doesn't work]
grep -P ".*ping[\/]" text.txt [doesn't work]
I did get this to work but not sure why:
grep -P "ping\D" text.txt [worked]
grep -e "ping\D" text.txt [doesn't work]
What am I failing to understand with grep -e/egrep/grep -p/regex?
When I man grep -P it said it's highly experimental and not to use it.
\D is the letter "D" for extended regular expressions. For perl regexes, \D is a non-digit and \d is a digit.
To search for "ping" followed optionally by a digit followed by a slash, you want:
grep 'ping[[:digit:]]\?/'
grep -E 'ping[[:digit:]]?/'
To search for "ping" not followed by a digit:
grep -E 'ping($|[^[:digit:]])' # ping then end-of-line or non-digit
grep -P 'ping(?!\d)'
GNU extended regular expressions documented here.

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.
You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.
Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO