ERE - adding quantifier to group with inner group and back-reference - regex

Was trying to get words with consecutive repeated letters occurring twice or thrice. Not able find a way to use quantifier and capture group using ERE
$ grep --version | head -n1
grep (GNU grep) 2.25
$ # consecutive repeated letters occurring twice
$ grep -m5 -xiE '[a-z]*([a-z])\1[a-z]*[a-z]*([a-z])\2[a-z]*' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Works with -P though
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){3}' /usr/share/dict/words
Chattahoochee
McConnell
Mississippi
Mississippian
Mississippians
Thanks Casimir et Hippolyte for coming up with simpler input and regex to test this behavior
$ echo 'aazbb' | grep -E '(([a-z])\2[a-z]*){2}' || echo 'No match'
aazbb
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*' || echo 'No match'
aazbbycc
$ echo 'aazbbycc' | grep -P '(([a-z])\2[a-z]*){3}' || echo 'No match'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){3}' || echo 'No match'
No match
Same behavior seen with sed as well
$ sed --version | head -n1
sed (GNU sed) 4.2.2
$ echo 'aazbb' | sed -E '/(([a-z])\2[a-z]*){2}/! s/.*/No match/'
aazbb
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){2}([a-z])\3[a-z]*/! s/.*/No match/'
aazbbycc
$ # failing case
$ echo 'aazbbycc' | sed -E '/(([a-z])\2[a-z]*){3}/! s/.*/No match/'
No match
Related search links, I checked some of them, but didn't get anything close to this question
https://savannah.gnu.org/bugs/?group=grep
http://lists.gnu.org/archive/html/bug-sed/
If this is solved in newer version of grep or sed, let me know. Also, if the issue is seen in non-GNU implementations

I suppose -E doesn't allow Quantifiers, that's why it works only with -P
to match 2 or more consecutive groups of repeated letters:
grep -P '(?:([a-z])\1*([a-z])\2){1}' /usr/share/dict/words
to match 3 or more consecutive groups of repeated letters:
grep -P '(?:([a-z])\1*([a-z])\2){2}' /usr/share/dict/words
Options:
-P, --perl-regexp PATTERN is a Perl regular expression

Update
After searching around, I installed gnugrep32 on my windows box, then ran
some tests:
I read this from an old SO post:
Non-greedy matching is not part of the Extended Regular Expression syntax supported by grep
So, we use [a-z]{0,20} as a test instead of [a-z]* or [a-z]*? where the ? is ignored (wtf?)
Below are incremental tests useing the overal (){n} to see how far it will go before it STOPS BACKTRACKING
into frames.
Min to work
(([a-z])\2[a-z]{0,20}){1} len = 2 rr
(([a-z])\2[a-z]{0,20}){2} len = 4 rrrr
(([a-z])\2[a-z]{0,20}){3} len = 25 rrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){4} len = 47 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){5} len = 69 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
(([a-z])\2[a-z]{0,20}){6} len = 91 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
From {3} to {6} the delta lengths are equal to 22.
This happens to be the full length of the capture frame expression ([a-z])\2[a-z]{0,20}
when it does not backtrack into previous frames.
Conclusion is that it automatically stops backtracking after 2 frames.
It makes sence given that for example, out of 20 frames, it gets to 16, and finds it cannot match.
Shoud it go back to frame 1 and adjust there and try it all over agaqin.
Why yes it should.
However, it has now consumed so much memory, the bloated pig has to unwind it all.
This could take forever with this old archaic utility.
Hey, better cap it to 2 frames.
Of course, there is no test case for (([a-z])\2[a-z]*){3} since the greedy quantifier *
will consume the entire line on the second frame if they are all [a-z] and never even
start a third frame.

$ # no output for this, why?
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Because you search for a double group (twice the same) that have a (at least) double letter inside. Something like abbcabbc [(...) = "abbc" 2 times] and not 2 (eventually similar) group that have each a double letter inside likeabbcdeef.
with 2 back ref:
$ grep -iE '[a-z]*([a-z])\1{1,}[a-z]*([a-z])\2{1,}[a-z]*`

I filed an issue https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864 and the manual is now updated to reflect such issues.
From https://www.gnu.org/software/grep/manual/grep.html#Known-Bugs:
Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2020 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.

Related

Bash script to extract 10 most common double-vowels word form a file

So I have tried to write a Bash script to extract the 10 most common double-vowels words from a file, like good, teeth, etc.
Here is what I have so far:
grep -E -o '[aeiou]{2}' $1|tr 'A-Z' 'a-z' |sort|uniq -c|sort -n | tail -10
I tried to use grep with flag E, then find the pattern match, such as 'aa', 'ee', 'ii' , etc, but it is not working at all,
enter image description here, what I got back, just 'ai', 'ea', something like this. Can anyone help me figure how to do pattern match in bash script?
You can simply match any amount of letters before or after a repeated vowel with this POSIX ERE regex with a GNU grep:
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' words.txt
FreeBSD (non-GNU) grep does not support a backreference in the pattern, so you will have to list all possible vowel sequences:
grep -oE '[[:alpha:]]*(aa|ee|ii|oo|uu)[[:alpha:]]*' words.txt
See the online demo:
#!/bin/bash
s='Some good feed
Soot and weed'
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' <<< "$s"
Details:
[[:alpha:]]* - zero or more letters
(aa|ee|ii|oo|uu) - one of the char sequences, aa, ee, ii, oo or uu (| is an alternation operator in a POSIX ERE regex)
([aeiou]) - Group 1: a vowel
\1 - the same vowel as in Group 1
[[:alpha:]]* - zero or more letters
See the diagram:
Simple way to change your regex: replace [aeiou]{2} with aa|ee|ii|oo|uu. (This does not fix the issue of only finding the match rather than the full word.)
Building on Andrew's answer (re: matching double vowels):
$ cat words.txt
good food;foul make chicken,eek too brave
eye you yuu something:three food too tu too
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt
good
food
eek
too
yuu
three
food
too
too
The grep finds only words (\< and \> represent word boundaries) with letters and/or digits and containing a dual vowel, printing each word on a separate line.
Applying the rest of OP's counting/sorting logic:
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt | sort | uniq -c | sort -n
1 eek
1 good
1 three
1 yuu
2 food
3 too

Extract version using grep/regex in bash

I have a file that has a line stating
version = "12.0.08-SNAPSHOT"
The word version and quoted strings can occur on multiple lines in that file.
I am looking for a single line bash statement that can output the following string:
12.0.08-SNAPSHOT
The version can have RELEASE tag too instead of SNAPSHOT.
So to summarize, given
version = "12.0.08-SNAPSHOT"
expected output: 12.0.08-SNAPSHOT
And given
version = "12.0.08-RELEASE"
expected output: 12.0.08-RELEASE
The following command prints strings enquoted in version = "...":
grep -Po '\bversion\s*=\s*"\K.*?(?=")' yourFile
-P enables perl regexes, which allow us to use features like \K and so on.
-o only prints matched parts instead of the whole lines.
\b ensures that version starts at a word boundary and we do not match things like abcversion.
\s stands for any kind of whitespace.
\K lets grep forget, that it matched the part before \K. The forgotten part will not be printed.
.*? matches as few chararacters as possible (the matching part will be printed) ...
(?=") ... until we see a ", which won't be included in the match either (this is called a lookahead).
Not all grep implementations support the -P option. Alternatively, you can use perl, as described in this answer:
perl -nle 'print $& if m{\bversion\s*=\s*"\K.*?(?=")}' yourFile
Seems like a job for cut:
$ echo 'version = "12.0.08-SNAPSHOT"' | cut -d'"' -f2
12.0.08-SNAPSHOT
$ echo 'version = "12.0.08-RELEASE"' | cut -d'"' -f2
12.0.08-RELEASE
Portable solution:
$ echo 'version = "12.0.08-RELEASE"' |sed -E 's/.*"(.*)"/\1/g'
12.0.08-RELEASE
or even:
$ perl -pe 's/.*"(.*)"/\1/g'.
$ awk -F"\"" '{print $2}'

non matching groups in grep regex not working

I would like to extract 1, 10, and 100 from:
1 one -args 123
10 ten -args 123
100 one hundred -args 123
However this regex returns 100:
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '^(?=[ ]*)\d+(?=.*)'
100
Not ignoring the preceding spaces returns the numbers (but of course with undesired spaces):
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '^[ ]*\d+(?=.*)'
1
10
100
Have I misunderstood non capturing regex groups in grep / Perl (grep version 2.2, Perl as the -P flag should use its regex) or is this a bug? I notice the release notes for 2.6 says "This release fixes an unexpectedly large number of flaws, from outright bugs (surprisingly many, considering this is "grep")".
If someone with 2.6 could try these examples that would be valuable to determine if this is a bug (in 2.2) or intended behaviour.
The issue is what is considered a 'match' by grep. In the absence of telling grep part of the total match is not what you want, it prints everything up to the end of the match regardless of matching groups.
Given:
$ echo "$txt"
1 one -args 123
10 ten -args 123
100 one hundred -args 123
You can get just the first column of digits without leading spaces several ways.
With GNU grep:
$ echo "$txt" | grep -Po '^[ ]*\K\d+'
1
10
100
Here \K is equivalent to a look behind assertion that resets the match text of the match to be what comes after. The left hand, before the \K, is required to match, but is not included in match text printed by grep.
Demo
awk:
$ echo "$txt" | awk '/^[ ]*[0-9]+/{print $1}'
sed:
$ echo "$txt" | sed 's/^[ ]*\([0-9]*\).*/\1/'
Perl:
$ echo "$txt" | perl -lne 'print $1 if /^[ ]*\K(\d+)/'
And then if you want the matches on a single line, run through xargs:
$ echo "$txt" | grep -Po '^[ ]*\K(\d+)' | xargs
1 10 100
Or, if you are using awk or Perl, just change the way it is printed to not include a carriage return.
You can delete the unwanted spaces this way :
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '^[ ]*(\d+)' | tr -d ' '
As for your question of why it is not working, it is not a bug, it is working as intended, you just misinterpreted how it should work.
If we focus on this ^(?=[ ]*)\d+:
The (?=[ ]*) part is a lookahead assertion. So it means that the regex engine tries to check if the ^ is followed by zero or more spaces. But the assertion itself is not part of the match, so in reality this code means :
- Match a ^ that is followed by 0 or more spaces
- After this ^, match one or more digits
So your code will only match when a digit is the first character of the line. The lookahead won't help you on your use case.
I think the anchor messes with the lookahead, which could be a lookbehind, but they can't be ambiguous (I always run into that one). So the following would work:
echo -e " 1 one\n 10 ten\n100 one hundred" | grep -Po '(?=[ ]*)\d+(?=.*)'
As for a better tool, I would use awk as it is suited to any column driven data. So if you were running it off of ps you could do something like:
ps | awk '/stuff you want to look for here/{print $1}'
awk will take care of all the white space by default

How to extract a complex version number using sed?

I use sed in CentOs to extract version number and it's work fine:
echo "var/opt/test/war/test-webapp-4.1.56.war" | sed -nre 's/^[^0-9]*(([0-9]+\.)*[0-9]+).*/\1/p'
But my problem is that i am not able to extract when the version is shown like this:
var/opt/test/war/test-webapp-4.1.56-RC1.war
I want to extract the 4.1.56-RC1 if it is present.
Any ideas ?
EDIT 2
Ok to be clear take this example, with a path:
Sometimes the path contains only a serial number like this var/opt/test/war/test-webapp-4.1.56.war and sometimes it contains a series of numbers and letters like this "var/opt/test/war/test-webapp-4.1.56-RC1.war
The need is to recover either 4.1.56 or 4.1.56-RC1 depending on the version present in the path. With sed or grep, no preference.
This seems to work but the .war is shown at the end:
echo "var/opt/test/war/test-webapp-4.1.56.war" | egrep -o '[0-9]\S*'
Little unclear what you are after, but this seems to be in the general direction.
Given:
$ echo "$e"
/var/opt/test/war/test-webapp-4.1.56-RC1.war
/var/opt/test/war/test-webapp-RC1.war
Version 4.2.4 (test version)
Try:
$ echo "$e" | egrep -o '(\d+\.\d+\.\d+-?\w*)'
4.1.56-RC1
4.2.4
The following will match the first digit up to 2 digits in length ({1,2}, second up to 2 digits and the last up to 4 digits followed by anything non-space up to a space.
grep -o '[0-9]\{1,2\}.[0-9]\{1,2\}.[0-9]\{1,4\}'
Just add (-[a-zA-Z]+[0-9]+) to your regex:
echo "Version 4.2.4 (test version)" | sed -nre 's/^[^0-9]*(([0-9]+\.)*[0-9]+(-[a-zA-Z]+[0-9]+)).*/\1/p'
What about just using whitespace as the delimiter like
echo "Version 4.2.4-RC1 (test version)" | grep -Po "Version\s+\K\S+"
for grep -P says to use Perl style regex, -o shows only the matching part and the \K in the string says not to show everything before it as part of the match
This passes both tests
egrep -o '[0-9]\S*'
Unfortunately, not all greps support -o, but grep in Linux does.
echo "Version 4.2.4 (test version)" | sed 's/Version[[:space:]]*\([^[:space:](]*\).*/\1/'
But like every extraction, you need to define what you want, not what could exist and extract it (or change your request).

Matching arbitrary number of digits using grep regex

I've got a file that has lines in it that look similar as follows
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
What I am looking to do is use regex to match any line that starts with data and ends with later AND has numbers in between. Here is what I've concocted so far:
^[D,d]ata[0-9]*later$
However the output includes all datalater lines. I suppose I could pipe the output and grep -v datalater, but I feel like a single expression should do the trick.
Use + instead of *.
+ matches at least one or more of the preceding.
* matches zero or more.
^[Dd]ata[0-9]+later$
In grep you need to escape the +, and we can use \d which is a character class and matches single digits.
^[Dd]ata\d\+later$
In you example file you also have a line:
datafhj893724897290384later
This currently will not be matched due to there being letters in-between data and the numbers. We can fix this by adding a [^0-9]* to match anything after data until the digits.
Our final command will be:
grep '^[Dd]ata[^0-9]*\d\+later$' filename
Using Cygwin, the above commands didn't work. I had to modify the commands given above to get the desired results.
$ cat > file.txt <<EOL
> data
> datalater
> 983290842
> Data387428later
> datafhj893724897290384later
> 4329804928later
> EOL
I always like to make sure my file has what I expect it to have:
$ cat file.txt
data
datalater
983290842
Data387428later
datafhj893724897290384later
4329804928later
$
I needed to run Perl-style expressions with the -P flag. This meant I couldn't use the [^0-9]+, whose necessity #Tom_Cammann aptly pointed out. Instead, I used .* which matches any sequence of characters not matching the next part of the pattern. Here are my command and output.
$ grep -P '^[Dd]ata.*\d+later$' file.txt
Data387428later
datafhj893724897290384later
$
I wish I could give a better explanation of WHY Perl expressions are needed, but I just know that Cygwin's grep works a bit differently.
System Info
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
My Results from the previous answers
$ grep '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep '^[Dd]ata\d+later$' file2.txt
$ grep -P '^[Dd]ata[^0-9]*\d\+later$' file2.txt
$ grep -P '^[Dd]ata\d+later$' file2.txt
Data387428later
$
You're matching zero or more digits with the * qualifier. Try
^[Dd]ata\d+later$
instead. You were also finding commas at the beginning of the string (e.g. ",ata1234later"). And \d is a shortcut to finding any digit character. So I changed those as well.
You should put a "+" (which means one or several) instead of "*" (which means zero, one or several
The "+" syntax only works for extended-regexp, not standard grep.
At least, that's my experience on RHEL.
To use extended-regexp, run egrep or pass "-E" / "--extended-regexp"
Examples...
Standard grep
echo abc123n1 | grep "abc[0-9]+n1"
<no output>
egrep
echo abc123n1 | egrep "abc[0-9]+n1"
abc123n1
grep with -E
echo abc123n1 | grep -E "abc[0-9]+n1"
abc123n1
HTH
grep -Eio "^(data)[0-9]+(later)$"
^[dD]ata=^d later$=r$
🎯 MOTIVATION
The rest of answers don't work on all systems.
🗒️ REQUISITES
grep
The option: --extended-regexp
Character groups, aka: [:group:]
Matching one or more of the preceding, aka: +
Optionally setting as starting or ending: ^whatever$
📟 COMMAND
grep --extended-regexp "[[:group:]]+"
🗂️ GROUPS
alnum
alpha
blank
cntrl
digit
graph
lower
print
punct
space
upper
xdigit