awk - How to get only the matching portion of a regex - regex

I have code like this
echo abc | awk '$0 ~ "a\(b\)c" {print $0}'
What if I only wanted what's in the parentheses instead of the whole line? This is obviously very simplified, and there is really a lot of awk code so I don't want to switch to sed or grep or something. Thanks

As far as I know you cannot do it in the pattern part, you must do it inside the action part with the match() function:
echo abc | awk '{ if ( match($0, /a(b)c/, a) > 0 ) { print a[1] } }'
It yields:
b

With GNU awk:
$ echo abc | awk '{print gensub(/a(b)c/,"\\1","")}'
b

Related

Awk regex expression to select for a certain number of delimiters in a field

I am trying to select for a field which has exactly a certain number of commas. For example, I can select for 1 comma in a field as follows:
$ echo jkl,abc | awk '$1 ~ /[a-z],[a-z]/{print $0}'
jkl,abc
The expected output, "jkl,abc", is seen.
However, when I try for 2 commas it doesn't work.
$ echo jkl,abc,xyz | awk '$1 ~ /[a-z],[a-z],[a-z]/{print $0}'
(no output)
Any thoughts?
Thanks!
/[a-z],[a-z],[a-z]/ doesn't match jkl,abc,xyz because you didn't use quantifiers. Right regex would have been: /^[a-z]+,[a-z]+,[a-z]+$/ e.g.
awk '/^[a-z]+,[a-z]+,[a-z]+$/' <<< 'jkl,abc,xyz'
However, to validate number of commas, it would be better to compare number of fields while using FS = "," like this:
awk -F, 'NF == 2' <<< 'jkl,abc'
awk -F, 'NF == 3' <<< 'jkl,abc,xyz'
jkl,abc
jkl,abc,xyz
It should be like:
echo jkl,abc,xyz | awk '/[a-z]+,[a-z]+,[a-z]+/{print $0}'
OR
echo jkl,abc,xyz | awk '/[a-z]+,[a-z]+,[a-z]+/'
OP's code why its not working:
Because OP is mentioning only 1 occurrence of [a-z] and , but that is not that case there are more than 1 characters present in line before comma hence its not matching it. With your given code $1 is not required since you are matching whole line so I have removed $1 part from solution.
In case you have multiple fields(separated by spaces) and you want to check condition on 1st part then you could go with:
echo "jkl,abc,xyz blabla" | awk '$1 ~ /[a-z]+,[a-z]+,[a-z]+/'
Your middle segment of the regexp wasn't accounting for more than one letter between the commas so you should have made just that one part of it [a-z]* or [a-z]+ depending on your requirements for handling the case of zero letters.
Some approaches to consider to find 2 or more commas in a field:
$ echo jkl,abc,xyz | awk '$1 ~ /[a-z],[a-z]*,[a-z]/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk '$1 ~ /([a-z]*,){2,}/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk '$1 ~ /[^,],[^,]*,[^,]/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk '$1 ~ /([^,]*,){2,}/'
jkl,abc,xyz
$ echo jkl,abc,xyz | awk 'gsub(/,/,"&",$1) > 1'
jkl,abc,xyz

what is regular expression to get the data after _

I am having a filename like:2015_q1_cricket_international.txt
How can I get the data after underscore(_).
my final output should be 2015internationalcricket
Using awk
Let's create a shell variable with your file name:
$ fname=2015_q1_cricket_international.txt
Now, let's extract the parts that you want:
$ echo "$fname" | awk -F'[_.]' '{print $1 $4 $3}'
2015internationalcricket
How it works:
-F'[_.]' tells awk to split the input anywhere it sees either a _ or a .
print $1 $4 $3 tells awk to print the parts that you asked for
Using shell
$ echo "$fname" | { IFS='_.' read a b c d e; echo "$a$d$c"; }
2015internationalcricket
Using sed
$ echo "$fname" | sed -E 's/^([^_.]*)_([^_.]*)_([^_.]*)_([^_.]*).*/\1\4\3/'
2015internationalcricket
Capturing to a shell variable
If we want put the new string in a shell variable, we use command subsitution:
var=$(echo "$fname" | awk -F'[_.]' '{print $1 $4 $3}')
var=$(echo "$fname" | { IFS='_.' read a b c d e; echo "$a$d$c"; })
var=$(echo "$fname" | sed -E 's/^([^_.]*)_([^_.]*)_([^_.]*)_([^_.]*).*/\1\4\3/')
If the shell is bash, we can do this more directly:
IFS='_.' read a b c d e <<<"$fname"
var="$a$d$c"
.*_([^_]*)_.* gets «cricket» as \1
You can use String.Split('_') and get array of results, or you can use regular expression _[A-Za-z0-9]* which returns all the chars after the underscore which matches three sets.
All the results are returned in an Array.

Search regex on a specific field using awk

In awk I can search a field for a value like:
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2=="eaae" {print $0};'
aa,bb,cc
dd,eaae,ff
And I can search by regular expressions like
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; /[a]{2}/ {print $0};'
aa,bb,cc
dd,eaae,ff
Can I force the awk to apply the regexp search to a specific field ? I'm looking for something like
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2==/[a]{2}/ {print $0};'
expecting result:
dd,eaae,ff
Anyone know how to do it using awk?
Accepted response - Operator "~" (thanks to hek2mgl):
$ echo -e "aa,bb,cc\ndd,eaae,ff" | awk 'BEGIN{FS=",";}; $2 ~ /[a]{2}/ {print $0};'
You can use :
$2 ~ /REGEX/ {ACTION}
If the regex should apply to the second field (for example) only.
In your case this would lead to:
awk -F, '$2 ~ /^[a]{2}$/' <<< "aa,bb,cc\ndd,eaae,ff"
You may wonder why I've just used the regex in the awk program and no print. This is because your action is print $0 - printing the current line - which is the default action in awk.

awk regex can't match ip addresses when trying to find repeating digits

I can't get the following to match any IP addresses
awk '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/{print $0}' maillog
or this one...
awk '/[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}/' maillog
but this works...
awk '/127.0.0.1/{print $0}' maillog
and so does this...
awk '/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]/{print $0}' maillog
What am I doing wrong in the first two?
To use interval {1,3} with gnu awk you my need to enable it with --re-interval like this:
awk --re-interval '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/{print $0}' maillog
They are just fine.
The following is working for me.
$ echo "2.168.1.1" | awk '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/{print $0}'
2.168.1.1
$ echo "2.1.1.1" | awk '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/{print $0}'
2.1.1.1
$ echo "22.1.1.1" | awk '/[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/{print $0}'
22.1.1.1
I would investigate your maillog and make sure that everything there is in plaintext.

AWK regex convert 3 letter word beginning with 'a' to uppercase

I have my regex expression to find 3 letter words beginning with "a"...
\b[aA][a-z]{2}\b
(seems to work, according to this! check it out: http://rubular.com/r/Jil0E4WZnW)
Now I need to know how to take that result and replace the lowercase word with the three letter word in uppercase.
Thanks!
call toupper function in awk:
echo "Abc" | awk '{print toupper($0)}'
gets you:
ABC
You can make use of the uc($string); command of PERL.
You can do it with Sed like this:
echo 'Ass ass ant Ant' | sed -re 's/\ba[a-z]{2}\b/\U&/gI'
(with your example string)
Another way is to use tr:
echo "Abc" | tr 'a-z' 'A-Z'
This solution "cheats" because it uses a loop and sub instead of gsub, but it is in awk and it works.
echo "abc Ape baaa ab abcd ant" | awk '{for (i=1;i<=NF;i++) if (length($i)==3){sub(/[aA][a-z]{2}/,toupper($i),$i)};print}'
perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;' your_file
tested:
> echo "Abc ab Ab" | perl -pe '$_=~s/\b([aA][a-z]{2})\b/\U$1/g;'
ABC ab Ab
>
Taken from here
Here is the awk version:
awk '{for(i=1;i<=NF;i++)
if((length($i)==3) && $i~/[aA][a-zA-Z][a-zA-Z]/)
$i=toupper($i)
}1' your_file