How to grep/perl/awk overlapping regex - regex

Trying to pipe a string into a grep/perl regex to pull out overlapping matches. Currently, the results only appear to pull out sequential matches without any "lookback":
Attempt using egrep (both on GNU and BSD):
$ echo "bob mary mike bill kim jim john" | egrep -io "[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using perl style grep (-P):
$ echo "bob mary mike bill kim jim john" | grep -oP "()[a-z]+ [a-z]+"
bob mary
mike bill
kim jim
Attempt using awk showing only the first match:
$ echo "bob mary mike bill kim jim john" | awk 'match($0, /[a-z]+ [a-z]+/) {print substr($0, RSTART, RLENGTH)}'
bob mary
The overlapping results I'd like to see from a simple working bash pipe command are:
bob mary
mary mike
mike bill
bill kim
kim jim
jim john
Any ideas?

Lookahead is your friend here
echo "bob mary mike bill kim jim john" |
perl -wnE'say "$1 $2" while /(\w+)\s+(?=(\w+))/g'
The point is that lookahead, as a "zero-width assertion," doesn't consume anything -- while it still allows us to capture a pattern in it.
So as the regex engine matches a word and spaces ((\w+)\s+), gobbling them up, it then stops there and "looks ahead," merely to "assert" that the sought pattern is there; it doesn't move from its spot between the last space and the next \w, doesn't "consume" that next word, as they say.
It is nice though that we can also capture that pattern that is "seen," even tough it's not consumed! So we get our $1 and $2, two words.
Then, because of /g modifier, the engine moves on, to find another word+spaces, with yet another word following. That next word is the one our lookahead spotted -- so now that one is consumed, and yet next one "looked" for (and captured). Etc.
See Lookahead and lookbehind assertions in perlretut

Use the Perl one-liners below, which avoid the lookahead (which can still be your friend):
For whitespace-delimited words:
echo "bob mary mike bill kim jim john" | perl -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'
For words defined as \w+ in Perl, delimited by the non-word characters \W+:
echo "bob.mary,mike'bill kim jim john" | perl -F'/\W+/' -lane 'print "$F[$_] $F[$_+1]" for 0..($#F-1);'
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\W+/' : Split into #F on \W+ (one or more non-word characters), rather than on whitespace.
$#F : the last index of the array #F, into which the input line is split.
0..($#F-1) : the range of indexes (numbers), from the first (0) to the penultimate ($#F-1) index of the array #F.
$F[$_] and $F[$_+1]: two consecutive elements of the array #F, with indexes $_ and $_+1, respectively.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

You can also use awk
awk '{for(i=1;i<NF;i++) print $i,$(i+1)}' <<< 'bob mary mike bill kim jim john'
See the online demo. This solution iterates over all whitespace-separated fields and prints current field ($i) + field separator (a space here) + the subsequent field value ($(i+1)).
Or, another perl solution that uses a very common technique to capture the overlapping pattern inside a positive lookahead:
perl -lane 'while (/(?=\b(\p{L}+\s+\p{L}+))/g) {print $1}' <<< 'bob mary mike bill kim jim john'
See the online demo. Details:
(?= - start of a positive lookahead
\b - a word boundary
(\p{L}+\s+\p{L}+) - capturing group 1: one or more letters, one or more whitespaces, one or more letters
) - end of the lookahead.
Here, only Group 1 values are printed ({print $1}).
Performance consideration
As for Perl solutions here, mine turns out the slowest, and Timur's the fastest, however, awk solution turns out to be faster than any Perl solutions. Results:
# ./wiktor_awk.sh
real 0m17.069s
user 0m12.264s
sys 0m5.314s
# ./timur_perl.sh
real 0m18.201s
user 0m15.612s
sys 0m6.139s
# ./zdim.sh
real 0m23.559s
user 0m19.883s
sys 0m7.359s
# ./wiktor_perl.sh
real 2m12.528s
user 1m52.857s
sys 0m20.201s
Note I created *.sh files for each solution like
#!/bin/bash
N=10000
time(
for i in $(seq 1 $N); do
<SOLUTION_HERE> &>/dev/null;
done)
and ran for f in *.sh; do chmod +x "$f"; done (borrowed from here).

Related

Bash script to extract 10 most common double-vowels word form a file

So I have tried to write a Bash script to extract the 10 most common double-vowels words from a file, like good, teeth, etc.
Here is what I have so far:
grep -E -o '[aeiou]{2}' $1|tr 'A-Z' 'a-z' |sort|uniq -c|sort -n | tail -10
I tried to use grep with flag E, then find the pattern match, such as 'aa', 'ee', 'ii' , etc, but it is not working at all,
enter image description here, what I got back, just 'ai', 'ea', something like this. Can anyone help me figure how to do pattern match in bash script?
You can simply match any amount of letters before or after a repeated vowel with this POSIX ERE regex with a GNU grep:
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' words.txt
FreeBSD (non-GNU) grep does not support a backreference in the pattern, so you will have to list all possible vowel sequences:
grep -oE '[[:alpha:]]*(aa|ee|ii|oo|uu)[[:alpha:]]*' words.txt
See the online demo:
#!/bin/bash
s='Some good feed
Soot and weed'
grep -oE '[[:alpha:]]*([aeiou])\1[[:alpha:]]*' <<< "$s"
Details:
[[:alpha:]]* - zero or more letters
(aa|ee|ii|oo|uu) - one of the char sequences, aa, ee, ii, oo or uu (| is an alternation operator in a POSIX ERE regex)
([aeiou]) - Group 1: a vowel
\1 - the same vowel as in Group 1
[[:alpha:]]* - zero or more letters
See the diagram:
Simple way to change your regex: replace [aeiou]{2} with aa|ee|ii|oo|uu. (This does not fix the issue of only finding the match rather than the full word.)
Building on Andrew's answer (re: matching double vowels):
$ cat words.txt
good food;foul make chicken,eek too brave
eye you yuu something:three food too tu too
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt
good
food
eek
too
yuu
three
food
too
too
The grep finds only words (\< and \> represent word boundaries) with letters and/or digits and containing a dual vowel, printing each word on a separate line.
Applying the rest of OP's counting/sorting logic:
$ grep -E -o '\<[[:alnum:]]*(aa|ee|ii|oo|uu)[[:alnum:]]*\>' words.txt | sort | uniq -c | sort -n
1 eek
1 good
1 three
1 yuu
2 food
3 too

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?
The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).
Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'
The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).
sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)
Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,

Using grep to remove text after the first, or second, occurrence of a four digit string. Issue with hyphenated text

I am trying to use grep and sed to format text and need help with my grep statement to include hyphens and preceding text in the output.
Example strings:
Merry.Ex-Mas.2014.1080p.Text.x265-JOHN
30.Rock.A.One-Time.Special.2020.1080p.Text.x265-JOHN
Creature.from.the.Black.Lagoon.REMASTERED.1954.1080p.BluRay.x265-JOHN
1984.1984.1080p.Text.x265-JOHN
The desired output would be:
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
Creature from the Black Lagoon 1954
1984 1984
Thanks to #grzegorz-pudłowski I have this line of code. (but for some reason hyphens and everything in front of the hyphen is being removed)
`grep -E -o '(\\w*[\\.]?)*(19|20)[0-9]{2}'`
(the extra escapes are needed in AppleScript)
Those grep commands result in:
Mas.2014
Time.Special.2020
Creature.from.the.Black.Lagoon.1954
1984.1984
I then pipe to sed to replace periods with spaces:
| sed 's/\\. */ /g'"
The original answer from #grzegorz-pudłowski that was removed from stackoverflow:
Better than sed should be grep in this situation. I gues you have bunch of files and you want to rename them or what not. So I would use something like this:
echo "Title.Text.2012.1080p.text.text" | grep -E -o "(\w*[\.]?)*(19|20)[0-9]{2}"
So... -E is "regex extended" flag. You can use egrep instead. Next flag is -o and it makes grep print only matched expression (as you want to throw away rest of this string).
Regexp is simple:
(\w*[\.]?)* match zero or more groups of zero or more alphanumeric
characters with zero or one dot at the end.
(19|20) match 19 or 20 as you want to match a year (assuming years
1900-2099 so change this part if you want wider range)
[0-9]{2} match two digits from 0 to 9
After that you can pipe result to mv or what not. If you grep file however then just use:
grep -E -o "(\w*[\.]?)*(19|20)[0-9]{2}" filename.txt
EDIT2: In case OP wants to stick with his original solution with additional steps then try following.
grep -E -o "(\w+\.){1,}.*(19|20)[0-9]{2}" Input_file | sed 's/\./ /g'
EDIT: As per OP's comment adding more generic solution.
awk '
match($0,/[0-9]{4}\.[0-9]+[a-zA-Z]+\..*/){
val=substr($0,1,RSTART+4)
gsub(/\./," ",val)
print val
val=""
}
' Input_file
Could you please try following, written and tested with shown samples in GNU sed.
sed -E 's/\.[0-9]+p\.Text\..*Text//;s/\./ /g' Input_file
2nd solution: Using awk.
awk '
BEGIN{
FS="."
}
match($0,/\.[0-9]+p\.Text\..*Text/){
$1=$1
print substr($0,1,RSTART-1)
}
' Input_file
A sed expression using BRE (Basic Regular Expressions) can be written as:
sed 's/[.]/ /g;s/\w\w*p\s.*$//' file
Where the first substitution globally replaces each '.' with a space and then the second deletes from the word ending in 'p' to the end of line. \w matches [A-Za-z0-9_], so you can tighten the matching criteria by adjusting the match of characters before 'p' if needed.
Example Use/Output
$ sed 's/[.]/ /g;s/\w\w*p\s.*$//' file
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984
Per-Edits To Include Additional Strings
Including additional strings such as:
"WALL-E.2008.1080p.BluRay.x265-JOHN", and
"WALL-E.2008.REMASTERED.1080p.BluRay.x265-RARBG"
To use BRE you would need:
sed 's/[.]/ /g;s/^[0-9][0-9]*[ ]\([0-9][0-9][0-9][0-9]\).*$/\1 \1/;s/[ ]\([0-9][0-9][0-9][0-9]\).*$/ \1/' file
Example Input File
$ cat file
Merry.Ex-Mas.2014.1080p.Text.x265.Text
30.Rock.A.One-Time.Special.2020.1080p.Text.x265.Text
1984.1984.1080p.Text.x265.Text
WALL-E.2008.1080p.BluRay.x265-JOHN
WALL-E.2008.REMASTERED.1080p.BluRay.x265-RARBG
Example Use/Output
$ sed 's/[.]/ /g;s/^[0-9][0-9]*[ ]\([0-9][0-9][0-9][0-9]\).*$/\1 \1/;s/[ ]\([0-9][0-9][0-9][0-9]\).*$/ \1/' file
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984
WALL-E 2008
WALL-E 2008
This can be solved using sed substitution:
sed -E 's/(.*(19|20)[0-9]{2}).*/\1/; s/\./ /g' file
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984
Details:
(.*(19|20)[0-9]{2}): Match longest string till we get a year string and capture in group #1
.*: Match remaining part till end
\1: Put 1st capture group back
s/\./ /g: replace each dot with spacec
You may use
sed -E 's/\.1080p\..*//g;s/\./ /g' file
See the online sed demo
Details
-E - enables POSIX ERE syntax
s/\.1080p\..*//g - removes the .1080. and all text to the end of string
s/\./ /g - replaces dots with spaces.
Test:
#!/bin/bash
s='Merry.Ex-Mas.2014.1080p.
30.Rock.A.One-Time.Special.2020.1080p.
1984.1984.1080p.'
sed -E 's/\.1080p\..*//g;s/\./ /g' <<< "$s"
Output:
Merry Ex-Mas 2014
30 Rock A One-Time Special 2020
1984 1984

Bash Regex Capture Groups

I have a single string that is this kind of format:
"Mike H<michael.haken#email1.com>" michael.haken#email2.com "Mike H<hakenmt#email1.com>"
If I was writing a normal regex in JS, C#, etc, I'd do this
(?:"(.+?)"|'(.+?)'|(\S+))
And iterate the match groups to grab each string, ideally without the quotes. I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows:
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
I can't figure out how to replicate this functionality with grep or sed or bash regex's. I've tried some things like
echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"
The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like
"Mike
H<michael.haken#email1.com>"
michael.haken#email2.com
If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. In that approach, I pipe the output to read so I can individually add each string to the array, but I'm open to other options.
EDIT:
I think my input example may have been confusing, it's just a possible input. The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity. The Javascript/C# regex I provided is the real behavior I'm trying to achieve.
You can use Perl:
$ email='"Mike H<michael.haken#email1.com>" michael.haken#email2.com "Mike H<hakenmt#email1.com>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print $1 ? $1 : $2}'
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Or in pure Bash, it gets kinda wordy:
re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
i=${#BASH_REMATCH}
email=${email:i}
done
# same output
You may use sed to achieve that,
$ sed -r 's/"(.*)" (.*)"(.*)"/\1\n\2\n\3/g' <<< "$EMAIL"
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
gawk + bash solution (adding each item to array):
email_str='"Mike H<michael.haken#email1.com>" michael.haken#email2.com "Mike H<hakenmt#email1.com>"'
readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
'{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)
Now, all items are in email_arr
Accessing the 2nd item:
echo "${email_arr[1]}"
michael.haken#email2.com
Accessing the 3rd item:
echo "${email_arr[3]}"
Mike H<hakenmt#email1.com>
Your first expression is fine; just be careful with the quotes (use single quotes when \ are present). In the end trim the " with sed.
$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Using gawk where you can set multi-line RS.
awk -v RS='"|" ' 'NF' inputfile
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Modify your regex like this :
grep -oP '("?\s*)\K.*?(?=")' file
Output:
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
Using GNU awk and FPAT to define fields by content:
$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" } # define a field to be space-separated or in quotes
{
for(i=1;i<=NF;i++) { # iterate every field
gsub(/^\"|\"$/,"",$i) # remove leading and trailing quotes
print $i # output
}
}' file
Mike H<michael.haken#email1.com>
michael.haken#email2.com
Mike H<hakenmt#email1.com>
What I was able to do that worked, but wasn't as concise as I wanted the code to be:
arr=()
while read line; do
line="${line//\"/}"
arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")
This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. It also provided the elements in the array without the wrapping quotes. Appreciate all of the suggestions.

Removing both duplicates (not just the repeated) from a text file?

By this I mean, erase all rows in a text file that are repeated, NOT just the duplicates. I mean both the row that is duplicated and the duplicated row. This would leave me only with the list of rows that weren't repeated. Perhaps a regular expression could do this in notepad++? But which one? Any other methods?
If you're on a unix-like system, you can use the uniq command.
ezra#ubuntu:~$ cat test.file
ezra
ezra
john
user
ezra#ubuntu:~$ uniq -u test.file
john
user
Note, that the similar rows be adjacent. You'll have to sort the file first if they're not.
ezra#ubuntu:~$ cat test.file
ezra
john
ezra
user
ezra#ubuntu:~$ uniq -u test.file
ezra
john
ezra
user
ezra#ubuntu:~$ sort test.file | uniq -u
john
user
If you have acess to a regex that supports PCRE style, this is straight forward:
s/(?:^|(?<=\n))(.*)\n(?:\1(?:\n|$))+//g
(?:^|(?<=\n)) # Behind us is beginning of string or newline
(.*)\n # Capture group 1: all characters up until next newline
(?: # Start non-capture group
\1 # backreference to what was captured in group 1
(?:\n|$) # a newline or end of string
)+ # End non-capture group, do this 1 or more times
Context is a single string
use strict; use warnings;
my $str =
'hello
this is
this is
this is
that is';
$str =~ s/
(?:^|(?<=\n))
(.*)\n
(?:
\1
(?:\n|$)
)+
//xg;
print "'$str'\n";
__END__
output:
'hello
that is'