Using awk to find a domain name containing the longest repeated word - regex

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.

A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.

A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

Related

BASH: Search a string and exactly display the exact number of times a substring happens inside it

I've searched all over and still cant find this simple answer. I'm sure its so easy. Please help if you know how to accomplish this.
sample.txt is:
AAAAA
I want to find the exact times the combination "AAA" happens. If you just use for example
grep -o 'AAA' sample.txt | wc -l
We receive a 1. This is the same as just searching the number of times AAA happens from with a standard text editor search box type search. However, I want the complete number of matches exactly, starting from each individual character which is exactly 3. We get this when we search from each character individually instead of treating each AAA hit like a box type block.
I am looking for the most squeezed in/most possibilities/literal exact number of occurences starting from every individual character of "AAA" in sample.txt, not just blocks of every time it finds it like it does in a normal text editor type search from the search box.
How do we accomplish this, preferrably in AWK? SED, GREP and anything else is fine as well as I can include in a Bash script.
This might work for you (GNU sed & wc):
sed -r 's/^[^A]*(AA?[^A]+)*AAA/AAA\nAA/;/^AAA/P;D' | wc -l
Lose any characters other than A's, and single or double A's.Then print a triple A and lose the first A and repeat. Finally count the number of lines printed.
This isn't a trivial problem in bash. As far as I know, standard utils don't support this kind of searching. You can however use standard bash features to implement this behavior in a function. Here's how I would attack the problem, but there are other ways:
#!/bin/bash
search_term="AAA"
text=$(cat sample.txt)
term_len=${#search_term}
occurences=0
# While the text is greater than or equal to the search term length
while [ "${#text}" -ge "$term_len" ]; do
# Look at just the length of the search term
text_substr=${text:0:${term_len}}
# If we see the search term, increment occurences
if [ "$text_substr" = "$search_term" ]; then
((occurences++))
fi
# Remove the first character from the main text
# (e.g. "AAAAA" becomes "AAAA")
text=${text:1}
done
printf "%d occurences of %s\n" "$occurences" "$search_term"
This is the awk version
echo "AAAAA AAA AAAABBAAA" \
| gawk -v pat="AAA" '{
for(i=1; i<=NF; i++){
# current field length
m=length($i)
#search pattern length
n=length(pat)
for(l=1 ; l<m; l++){
sstr=substr($i,l,n)
#print i " " $i " sub:" sstr
# substring matches pattern
if(sstr ~ pat){
count++
}else{
print "contiguous count on field " i " = " count
# uncomment next line if non-contiguous matches are not needed
#break
}
}
print "total count on field " i " = " count
count=0
}
}'
I posted this on OP's another post, but it was ignored maybe because I did not add notes and explanation. Just a different approach and any discussions are welcome.
$ awk -v sample="$(<sample.txt)" '{ x=sample; n=0 }$0 != ""{
while(t=index(x,$0)){ n++; x=substr(x,t+1) }
print $0,n
}' combinations
Explanation:
The variables:
sample: is the raw sample text slurp in from the file sample.txt with the -v argument
x: is the targeting string, before each test, the value is reset to sample
$0: is the testing string from the file combination, each line feeds a testing string
n: is the counter, number of occurences of the testing string($0)
t: is the position of the first character of the matched testing string($0) in the targeting string(x)
Update: Added $0 != "" before the main while loop to skip EMPTY strings which lead to unlimited loop.
The code:
awk -v sample="$(<sample.txt)" '
# reset the targeting string(with the sample text) and the counter "n"
{ x = sample; n = 0 }
# below the main block where $0 != "" to skip the EMPTY testing string
($0 != ""){
# the function index(x, $0) returns the position(assigned to "t") of the first character
# of the matched testing string($0) in the targeting string(x).
# when no match is found, it returns zero and thus step out of the while loop.
while(t=index(x,$0)) {
n++; # increment the number of matches
x = substr(x, t+1) # modify the targeting string to remove all characters before the position(t) inclusively
}
print $0, n # print the testing string and the counts
}
' combinations
awk index() is a function much faster than regex matches and it does not need the expensive string comparisons in a brute-force way. attached the tested sample.txt and combinations:
$ more sample.txt
AAAAAHHHAAHH
HAAAAHHHAAHH
AAHH
$ more combinations
AA
HH
AAA
HHH
AAH
HHA
ZK
Tested Environment: GNU Awk 4.0.2, Centos 7.3

Regex match as many of strings as possible

I don't know if this is possible or makes sense, but what I'm trying to do is grep or awk a file matching for multiple strings, but only showing the match that matches the most strings.
So I would have a file like:
cat,dog,apple,bark,chair
apple,chair,wall
cat,wall
phone,key,bark,nut
cat,dog,key
phone,dog,key
table,key,chair
I want to match a single line that includes the most of these strings: cat|dog|table|key|wall. Not necessarily having to include all of them, but whatever line matches the most, print it.
So for example, I would want it to display this output:
cat,dog,key
Since it is the line that includes most of the strings that are being searched for.
I've tried using:
cat filename \
|egrep -iE 'cat' \
|egrep -iE 'dog' \
|egrep -iE 'table' \
|egrep -iE 'key' \
|egrep -iE 'wall'
But it will only display lines that show ALL strings, I have also tried:
egrep -iE 'cat|dog|table|key|wall' filename
But that shows any line that matches any one of those strings.
Is regex possible of doing something like this?
Use awk, and increment a counter for each word that matches. If the counter is higher than the highest count, save this line.
awk 'BEGIN {max = 0}
{ count=0;
if (/\bcat\b/) count++;
if (/\bdog\b/) count++;
...
if (count > max) { saved = $0; max = count; }
}
END { print saved; }'
$ awk -F, -v r='^(cat|dog|table|key|wall)$' '{c=0;for (i=1;i<=NF;i++)if ($i~r)c++; if (c>max){max=c;most=$0}} END{print most}' file
cat,dog,key
How it works
-F,
This sets the field separator to a comma.
-v r='^(cat|dog|table|key|wall)$'
This sets the variable r to a regex matching your words of interest. The regex begins with ^ and ends with $. This assures that only whole words are matched.
c=0;for (i=1;i<=NF;i++)if ($i~r)c++
This sets the variable c to the number of matches on the current line.
if (c>max){max=c;most=$0}
If the number of matches on the current line, c, exceeds the previous maximum, max, then update max and set most to the current line.
END{print most}
When we are done reading the file, print the line with the most matches.
To make the problem more interesting I created two input files:
InFile1 ...
cat|dog|table|key|wall
InFile2 ...
cat,dog,apple,bark,chair
apple,chair,wall
cat,wall phone,key,bark,nut
cat,dog,key
phone,dog,key
table,key,chair
Note that InFile2 differs from the original post
in that it contains two lines each with three matches.
Hence, there is a "tie" for first place and both are
reported.
This code ...
awk -F, '{if (NR==FNR) r=$0; else {count=0
for (j=1;j<=NF;j++) if ($j ~ r) count++
a[FNR]=count" matching words in "$0
if (max<count) max=count}}
END{for (j=1;j<=FNR;j++) if (1==index(a[j],max)) print a[j]}' \
$InFile1 $InFile2 >$OutFile
... produced this OutFile ...
3 matching words in cat,dog,key
3 matching words in table,key,dog,banana
Daniel B. Martin

sed join lines together

what would be the sed (or other tool) command to join lines together in a file that do not end w/ the character '0'?
I'll have lines like this
412|n|Leader Building Material||||||||||d|d|20||0
which need to be left alone, and then I'll have lines like this for example (which is 3 lines, but only one ends w/ 0)
107|n|Knot Tying Tools|||||Knot Tying Tools
|||||d|d|0||0
which need to be joined/combined into one line
107|n|Knot Tying Tools|||||Knot Tying Tools|||||d|d|0||0
sed ':a;/0$/{N;s/\n//;ba}'
In a loop (branch ba to label :a), if the current line ends in 0 (/0$/) append next line (N) and remove inner newline (s/\n//).
awk:
awk '{while(/0$/) { getline a; $0=$0 a; sub(/\n/,_) }; print}'
Perl:
perl -pe '$_.=<>,s/\n// while /0$/'
bash:
while read line; do
if [ ${line: -1:1} != "0" ] ; then
echo $line
else echo -n $line
fi
done
awk could be short too:
awk '!/0$/{printf $0}/0$/'
test:
kent$ cat t
#aasdfasdf
#asbbb0
#asf
#asdf0
#xxxxxx
#bar
kent$ awk '!/0$/{printf $0}/0$/' t
#aasdfasdf#asbbb0
#asf#asdf0
#xxxxxx#bar
The rating of this answer is surprising ;s (this surprised wink emoticon pun on sed substitution is intentional) given the OP specifications: sed join lines together.
This submission's last comment
"if that's the case check what #ninjalj submitted"
also suggests checking the same answer.
ie. Check using sed ':a;/0$/{N;s/\n//;ba}' verbatim
sed ':a;/0$/{N;s/\n//;ba}'
does
no one
ie. 0
people,
try
nothing,
ie. 0
things,
any more,
ie. 0
tests?
(^D aka eot 004 ctrl-D ␄ ... bash generate via: echo ^V^D)
which will not give (do the test ;):
does no one ie. 0
people, try nothing, ie. 0
things, any more, ie. 0
tests? (^D aka eot 004 ctrl-D ␄ ... bash generate via: echo ^V^D)
To get this use:
sed 'H;${z;x;s/\n//g;p;};/0$/!d;z;x;s/\n//g;'
or:
sed ':a;/0$/!{N;s/\n//;ba}'
not:
sed ':a;/0$/{N;s/\n//;ba}'
Notes:
sed 'H;${x;s/\n//g;p;};/0$/!d;z;x;s/\n//g;'
does not use branching and
is identical to:
sed '${H;z;x;s/\n//g;p;};/0$/!{H;d;};/0$/{H;z;x;s/\n//g;}'
H commences all sequences
d short circuits further script command execution on the current line and starts the next cycle so address selectors following /0$/! can only be /0$/!! so the address selector of
/0$/{H;z;x;s/\n//g;} is redundant and not needed.
if a line does not end with 0 save it in hold space
/0$/!{H;d;}
if a line does end with 0 save it too and then print flush (double entendre ie. purged and lines aligned)
/0$/{H;z;x;s/\n//g;}
NB ${H;z;x;s/\n//g;p;} uses /0$/ ... commands with an extra p to coerce the final print and with a now unnecessary z (to empty and reset pattern space like s/.*//)
A typically cryptic Perl one-liner:
perl -pe 'BEGIN{$/="0\n"}s/\n//g;$_.=$/'
This uses the sequence "0\n" as the record separator (by your question, I'm assuming that every line should end with a zero). Any record then should not have internal newlines, so those are removed, then print the line, appending the 0 and newline that were removed.
Another take to your question would be to ensure each line has 17 pipe-separated fields. This does not assume that the 17th field value must be zero.
awk -F \| '
NF == 17 {print; next}
prev {print prev $0; prev = ""}
{prev = $0}
'
if ends with 0 store, remove newline..
sed '/0$/!N;s/\n//'

How do I remove duplicate characters and keep the unique one only in Perl?

How do I remove duplicate characters and keep the unique one only.
For example, my input is:
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
Expected output is:
EFUAH
UEH
UJHACDEF
I came across perl -pe's/$1//g while/(.).*\/' which is wonderful but it is removing even the single occurrence of the character in output.
This can be done using positive lookahead :
perl -pe 's/(.)(?=.*?\1)//g' FILE_NAME
The regex used is: (.)(?=.*?\1)
. : to match any char.
first () : remember the matched
single char.
(?=...) : +ve lookahead
.*? : to match anything in between
\1 : the remembered match.
(.)(?=.*?\1) : match and remember
any char only if it appears again
later in the string.
s/// : Perl way of doing the
substitution.
g: to do the substitution
globally...that is don't stop after
first substitution.
s/(.)(?=.*?\1)//g : this will
delete a char from the input string
only if that char appears again later
in the string.
This will not maintain the order of the char in the input because for every unique char in the input string, we retain its last occurrence and not the first.
To keep the relative order intact we can do what KennyTM tells in one of the comments:
reverse the input line
do the substitution as before
reverse the result before printing
The Perl one line for this is:
perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' FILE_NAME
Since we are doing print manually after reversal, we don't use the -p flag but use the -n flag.
I'm not sure if this is the best one-liner to do this. I welcome others to edit this answer if they have a better alternative.
if Perl is not a must, you can also use awk. here's a fun benchmark on the Perl one liners posted against awk. awk is 10+ seconds faster for a file with 3million++ lines
$ wc -l <file2
3210220
$ time awk 'BEGIN{FS=""}{delete _;for(i=1;i<=NF;i++){if(!_[$i]++) printf $i};print""}' file2 >/dev/null
real 1m1.761s
user 0m58.565s
sys 0m1.568s
$ time perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}' file2 > /dev/null
real 1m32.123s
user 1m23.623s
sys 0m3.450s
$ time perl -ne '$_=reverse;s/(.)(?=.*?\1)//g;print scalar reverse;' file2 >/dev/null
real 1m17.818s
user 1m10.611s
sys 0m2.557s
$ time perl -ne'my%s;print grep!$s{$_}++,split//' file2 >/dev/null
real 1m20.347s
user 1m13.069s
sys 0m2.896s
perl -ne'my%s;print grep!$s{$_}++,split//'
Here is a solution, that I think should work faster than the lookahead one, but is not regexp-based and uses hashtable.
perl -n -e '%seen=();' -e 'for (split //) {print unless $seen{$_}++;}'
It splits every line into characters and prints only the first appearance by counting appearances inside %seen hashtable
Tie::IxHash is a good module to store hash order (but may be slow, you will need to benchmark if speed is important). Example with tests:
use Test::More 0.88;
use Tie::IxHash;
sub dedupe {
my $str=shift;
my $hash=Tie::IxHash->new(map { $_ => 1} split //,$str);
return join('',$hash->Keys);
}
{
my $str='EFUAHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='EFUAHHUU';
is(dedupe($str),'EFUAH');
}
{
my $str='UJUJHHACDEFUCU';
is(dedupe($str),'UJHACDEF');
}
done_testing();
Use uniq from List::MoreUtils:
perl -MList::MoreUtils=uniq -ne 'print uniq split ""'
If the set of characters that can be encountered is restricted, e.g. only letters, then the easiest solution will be with tr
perl -p -e 'tr/a-zA-Z/a-zA-Z/s'
It will replace all the letters by themselves, leaving other characters unaffected and /s modifier will squeeze repeated occurrences of the same character (after replacement), thus removing duplicates
Me bad - it removes only adjoining appearances. Disregard
This looks like a classic application of positive lookbehind, but unfortunately perl doesn't support that. In fact, doing this (matching the preceding text of a character in a string with a full regex whose length is indeterminable) can only be done with .NET regex classes, I think.
However, positive lookahead supports full regexes, so all you need to do is reverse the string, apply positive lookahead (like unicornaddict said):
perl -pe 's/(.)(?=.*?\1)//g'
And reverse it back, because without the reverse that'll only keep the duplicate character at the last place in a line.
MASSIVE EDIT
I've been spending the last half an hour on this, and this looks like this works, without the reversing.
perl -pe 's/\G$1//g while (/(.).*(?=\1)/g)' FILE_NAME
I don't know whether to be proud or horrified. I'm basically doing the positive looakahead, then substituting on the string with \G specified - which makes the regex engine start its matching from the last place matched (internally represented by the pos() variable).
With test input like this:
aabbbcbbccbabb
EFAUUUUH
ABCBBBBD
DEEEFEGGH
AABBCC
The output is like this:
abc
EFAUH
ABCD
DEFGH
ABC
I think it's working...
Explanation - Okay, in case my explanation last time wasn't clear enough - the lookahead will go and stop at the last match of a duplicate variable [in the code you can do a print pos(); inside the loop to check] and the s/\G//g will remove it [you don't need the /g really]. So within the loop, the substitution will continue removing until all such duplicates are zapped. Of course, this might be a little too processor intensive for your tastes... but so are most of the regex-based solutions you'll see. The reversing/lookahead method will probably be more efficient than this, though.
From the shell, this works:
sed -e 's/$/<EOL>/ ; s/./&\n/g' test.txt | uniq | sed -e :a -e '$!N; s/\n//; ta ; s/<EOL>/\n/g'
In words: mark every linebreak with a <EOL> string, then put every character on a line of its own, then use uniq to remove duplicate lines, then strip out all the linebreaks, then put back linebreaks instead of the <EOL> markers.
I found the -e :a -e '$!N; s/\n//; ta part in a forum post and I don't understand the seperate -e :a part, or the $!N part, so if anyone can explain those, I'd be grateful.
Hmm, that one does only consecutive duplicates; to eliminate all duplicates you could do this:
cat test.txt | while read line ; do echo $line | sed -e 's/./&\n/g' | sort | uniq | sed -e :a -e '$!N; s/\n//; ta' ; done
That puts the characters in each line in alphabetical order though.
use strict;
use warnings;
my ($uniq, $seq, #result);
$uniq ='';
sub uniq {
$seq = shift;
for (split'',$seq) {
$uniq .=$_ unless $uniq =~ /$_/;
}
push #result,$uniq;
$uniq='';
}
while(<DATA>){
uniq($_);
}
print #result;
__DATA__
EFUAHUU
UUUEUUUUH
UJUJHHACDEFUCU
The output:
EFUAH
UEH
UJHACDEF
for a file containing the data you list named foo.txt
python -c "print set(open('foo.txt').read())"

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file