AWK: Access captured group from line pattern - regex

If I have an awk command
pattern { ... }
and pattern uses a capturing group, how can I access the string so captured in the block?

With gawk, you can use the match function to capture parenthesized groups.
gawk 'match($0, pattern, ary) {print ary[1]}'
example:
echo "abcdef" | gawk 'match($0, /b(.*)e/, a) {print a[1]}'
outputs cd.
Note the specific use of gawk which implements the feature in question.
For a portable alternative you can achieve similar results with match() and substr.
example:
echo "abcdef" | awk 'match($0, /b[^e]*/) {print substr($0, RSTART+1, RLENGTH-1)}'
outputs cd.

That was a stroll down memory lane...
I replaced awk by perl a long time ago.
Apparently the AWK regular expression engine does not capture its groups.
you might consider using something like :
perl -n -e'/test(\d+)/ && print $1'
the -n flag causes perl to loop over every line like awk does.

This is something I need all the time so I created a bash function for it. It's based on glenn jackman's answer.
Definition
Add this to your .bash_profile etc.
function regex { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'0'}']}'; }
Usage
Capture regex for each line in file
$ cat filename | regex '.*'
Capture 1st regex capture group for each line in file
$ cat filename | regex '(.*)' 1

You can use GNU awk:
$ cat hta
RewriteCond %{HTTP_HOST} !^www\.mysite\.net$
RewriteRule (.*) http://www.mysite.net/$1 [R=301,L]
$ gawk 'match($0, /.*(http.*?)\$/, m) { print m[1]; }' < hta
http://www.mysite.net/

NOTE: the use of gensub is not POSIX compliant
You can simulate capturing in vanilla awk too, without extensions. Its not intuitive though:
step 1. use gensub to surround matches with some character that doesnt appear in your string.
step 2. Use split against the character.
step 3. Every other element in the splitted array is your capture group.
$ echo 'ab cb ad' | awk '{ split(gensub(/a./,SUBSEP"&"SUBSEP,"g",$0),cap,SUBSEP); print cap[2]"|" cap[4] ; }'
ab|ad

I struggled a bit with coming up with a bash function that wraps Peter Tillemans' answer but here's what I came up with:
function regex
{
perl -n -e "/$1/ && printf \"%s\n\", "'$1'
}
I found this worked better than opsb's awk-based bash function for the following regular expression argument, because I do not want the "ms" to be printed.
'([0-9]*)ms$'

i think gawk match()-to-array is only for first instance of the capture group.
if there are multiple things you'd like to capture, and perform any complex operations upon them, perhaps
gawk 'BEGIN { S = SUBSEP
} {
nx=split(gensub(/(..(..)..(..))/,
"\\1"(S)"\\2"(S)"\\3", "g", str),
arr, S)
for(x in nx) { perform-ops-over arr[x] } }'
This way you aren't constrained by either gensub(), which limits the complexity if your modifications, or by match().
by pure trial-and-error, one caveat i've noted about gawk in unicode mode : for a valid unicode string 뀇꿬 with the 6 octal codes listed below :
Scenario 1 : matching individual bytes are fine, but will also report you the multi-byte RSTART of 1 instead of a byte-level answer of 2. It also won't provide info on whether \207 is the 1st continuation byte, or the second one, since RLENGTH will always be 1 here.
$ gawk 'BEGIN{ print match("\353\200\207\352\277\254", "\207") }'
$ 1
Scenario 2 : Match also works against unicode-invalid patterns like this
$ gawk 'BEGIN{ match("\353\200\207\352\277\254", "\207\352");
$ print RSTART, RLENGTH }'
$ 1 2
Scenario 3 : you can check for existence of a pattern against a unicode-illegal string (\300 \xC0 is UTF8-invalid for all possible byte pairings)
$ gawk 'BEGIN{ print ("\300\353\200\207\352\277\254" ~ /\200/) }'
$ 1
Scenarios 4/5/6 : the error message will show up for either (a) match() with unicode-invalid string, index() for either argument to be unicode-invalid/incomplete.
$ gawk 'BEGIN{ match("\300\353\200\207\352\277\254", "\207\352"); print RSTART, RLENGTH }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 2 2
$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\352") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0
$ gawk 'BEGIN{ print index("\353\200\207\352\277\254", "\200") }' gawk: cmd. line:1: warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. 0

Related

Catch specific string using regex

I have multiple boards. Inside my bash script, I want to catch my root filesystem name using regex. When I do a cat /proc/cmdline, I have this:
BOOT_IMAGE=/vmlinuz-5.15.0-57-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7
I just want to select /dev/mapper/vgubuntu-root
So far I have managed to catch root=/dev/mapper/vgubuntu-root using this command
\broot=[^ ]+
You can use your regex in sed with a capture group:
sed -E 's~.* root=([^ ]+).*~\1~' /proc/cmdline
/dev/mapper/vgubuntu-root
Another option is to use awk(should work in any awk):
awk 'match($0, /root=[^ ]+/) {
print substr($0, RSTART+5, RLENGTH-5)
}' /proc/cmdline
# if your string is always 2nd field then a simpler one
awk '{sub(/^[^=]+=/, "", $2); print $2}' /proc/cmdline
1st solution: With your shown samples in GNU awk please try following awk code.
awk -v RS='[[:space:]]+root=[^[:space:]]+' '
RT && split(RT,arr,"="){
print arr[2]
}
' Input_file
2nd solution: With GNU grep you could try following solution, using -oP options to enable PCRE regex in grep and in main section of grep using regex ^.*?[[:space:]]root=\K\S+ where \K is used for forgetting matched values till root= and get rest of the values as required.
grep -oP '^.*?[[:space:]]root=\K\S+' Input_file
3rd solution: In case your Input_file is always same as shown samples then try this Simple awk using field separator(s) concept.
awk -F' |root=' '{print $3}' Input_file
If the second field has the value, using awk you can split and check for root
awk '
{
n=split($2,a,"=")
if (n==2 && a[1]=="root"){
print a[2]
}
}
' file
Output
/dev/mapper/vgubuntu-root
Or using GNU-awk with a capture group
awk 'match($0, /(^|\s)root=(\S+)/, a) {print a[2]}' file
Since you are using Linux, you can use a GNU grep:
grep -oP '\broot=\K\S+'
where o allows match output, and P sets the regex engine to PCRE. See the online demo. Details:
\b - word boundary
root= - a fixed string
\K - match reset operator discarding the text matched so far
\S+ - one or more non-whitespace chars.
another awk solution, using good ole' FS / OFS :
-- no PCRE, capture groups, match(), g/sub(), or substr() needed
echo 'BOOT_IMAGE=/vmlinuz-5.15.0-57-generic root=/dev/mapper/vgubuntu-root ro quiet splash vt.handoff=7' |
mawk NF=NF FS='^[^=]+=[^=]+=| [^/]+$' OFS=
/dev/mapper/vgubuntu-root
if you're very very certain the structure has root=, then :
gawk NF=NF FS='^.+root=| .+$' OFS=
/dev/mapper/vgubuntu-root
if you like doing it the RS way instead :
nawk '$!NF = $NF' FS== RS=' [^/]+\n'
/dev/mapper/vgubuntu-root

Using protected wildcard character in awk field separator doesn't work

I have a file that contains paragraphs separated by lines of *(any amount). When I use egrep with the regex of '^\*+$' it works as intended, only displaying the lines that contain only stars.
However, when I use the same expression in awk -F or awk FS, it doesn't work and just prints out the whole document, excluding the lines of stars.
Commands that I tried so far:
awk -F'^\*+$' '{print $1, $2}' msgs
awk -F'/^\*+$/' '{print $1, $2}' msgs
awk 'BEGIN{ FS="/^\*+$/" } ; { print $1,$2 }' msgs
Printing the first field always prints out the whole document, using the first version it excludes the lines with the stars, other versions include everything from the file.
Example input:
Par1 test teststsdsfsfdsf
fdsfdsfdsftesyt
fdsfdsfdsf fddsteste345sdfs
***
Par2 dsadawe232343a5edsfe
43s4esfsd s45s45e4t rfgsd45
***
Par3 dsadasd
fasfasf53sdf sfdsf s45 sdfs
dfsf dsf
***
Par4 dasdasda r3ar d afa fs
ds fgdsfgsdfaser ar53d f
***
Par 5 dasdawr3r35a
fsada35awfds46 s46 sdfsds5 34sdf
***
Expected output for print $1:
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs
EDIT: Added example input and expected output
Strings used as regexps in awk are parsed twice:
to turn them into a regexp, and
to use them as a regexp.
So if you want to use a string as a regexp (including any time you assign a Field Separator or Record Separator as a regexp) then you need to double any escapes as each iteration of parsing will consume one of them. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for details.
Good (a literal/constant regexp):
$ echo 'a(b)c' | awk '$0 ~ /\(b)/'
a(b)c
Bad (a poorly-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\(b)"'
awk: cmd. line:1: warning: escape sequence `\(' treated as plain `('
a(b)c
Good (a well-written dynamic/computed regexp):
$ echo 'a(b)c' | awk '$0 ~ "\\(b)"'
a(b)c
but IMHO if you're having to double escapes to make a char literal then it's clearer to use a bracket expression instead:
$ echo 'a(b)c' | awk '$0 ~ "[(]b)"'
a(b)c
Also, ^ in a regexp means "start of string" which is only matched at the start of all the input, just like $ would only be matched at the end of all of the output. ^ does not mean "start of line" as some documents/scripts may lead you to believe. It only appears to mean that in grep and sed because they are line-oriented and so usually the script is being compared to 1 line at a time, but awk isnt line-oriented, it's record-oriented and so the input being compared to the regexp isn't necessarily just a line (the same is true in sed if you read multiple lines into its hold space).
So to match a line of *s as a Record Separator (RS) assuming you're using gawk or some other awk that can treat a multi-char RS as a regexp, you'd have to write this regexp:
(^|\n)[*]+(\n|$)
but be aware that also matches the newlines before the first and after the last *s on the target lines so you need to handle that appropriately in your code.
It seems like this is what you're really trying to do:
$ awk -v RS='(^|\n)[*]+(\n|$)' 'NR==1{$1=$1; print}' file
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs

How to create awk regex to match only one "space" between two words?

I have a sentence of form 2016-23-12 90-34-23 want to create an awk script to match it.
a.awk
$1 ~ /^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}/{
ts = $1 " " $2
print
}
Run using:
awk -f a.awk --posix
2016-23-12 90-34-23
Output:
Nothing
I assume your intention is match the whole string, in which $1 is incorrect, use it as $0
The problem you are seeing is Awk dynamic regular-expressions like the one you used don't need the $0 ~ /regex/ type match, the // is not needed here, just do as with your script being,
dynamicRegex = "[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}"
$0 ~ dynamicRegex {
print "match success"
}
and now running the script as
echo "2016-23-12 90-34-23"| awk -f a.awk --posix
2016-23-12 90-34-23
match success
Quoting from the page,
[..]The righthand side of a ~ or !~ operator need not be a regexp constant (i.e., a string of characters between slashes). It may be any expression. The expression is evaluated and converted to a string if necessary; the contents of the string are used as the regexp. A regexp that is computed in this way is called a dynamic regexp [..]
Another way would be to use the normal Regular Expression syntax over the POSIX character classes as a regexp constant as below,
$0 ~ /^[0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}-[0-9]{2}-[0-9]{2}$/ {
print "match success"
}
Remember with the above regex, your script is not longer POSIX compatible and running with --posix won't work here, also the \s here is a GNU Awk specific construct. Running it as
echo "2016-23-12 90-34-23"| awk -f a.awk
match success
Now to print the line upon the match, upon success just do,
print $1 FS $2
after the earlier print command.
Try this -
$echo "2016-23-12 90-34-23" | awk '{if($0 ~ /^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}$/) {print $0}}'
2016-23-12 90-34-23
$echo "2016-23-121 190-34-23" | awk '{if($0 ~ /^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:space:]][[:digit:]]{2}-[[:digit:]]{2}-[[:digit:]]{2}$/) {print $0}}'
##### No result

How can I use bash variable in awk with regexp?

I have a file like this (this is sample):
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
21.23.51.22|212.152.22.12|71.13.54.12|8.8.8.8
...
I have iplist.txt like this:
71.13.55.
12.33.23.
8.8.
4.2.
...
I need to grep if 3. column starts like in iplist.txt.
Like this:
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
I tried:
for ip in $(cat iplist.txt); do
awk -v var="$ip" -F '|' '{if ($3 ~ /^$var/) print $0;}' text.txt
done
But bash variable does not work in /^ / regex block. How can I do that?
First, you can use a concatenation of strings for the regular expression, it doesn't have to be a regex block. You can say:
'{if ($3 ~ "^" var) print $0;}'
Second, note above that you don't use a $ with variables inside awk. $ is only used to refer to fields by number (as in $3, or $somevar where somevar has a field number as its value).
Third, you can do everything in awk in which case you can avoid the shell loop and don't need the var:
awk -F'|' 'NR==FNR {a["^" $0]; next} { for (i in a) if ($3 ~ i) {print;next} }' iplist.txt r.txt
71.13.55.12|212.152.22.12|71.13.55.12|8.8.8.8
81.23.45.12|212.152.22.12|71.13.55.13|8.8.8.8
61.53.54.62|212.152.22.12|71.13.55.14|8.8.8.8
EDIT
As rightly pointed out in the comments, the .s in the patterns will match any character, not just a literal .. Thus we need to escape them before doing the match:
awk -F'|' 'NR==FNR {gsub(/\./,"\\."); a["^" $0]; next} { for (i in a) if ($3 ~ i) print }' iplist.txt r.txt
I'm assuming that you only want to output a given line once, even if it matches multiple patterns from iplist.txt. If you want to output a line multiple times for multiple matches (as your version would have done), remove the next from {print;next}.
Use var directly, instead of in /^$var/ ( adding ^ to the variable first):
awk -v var="^$ip" -F '|' '$3 ~ var' text.txt
By the way, the default action for a true condition is to print the current record, so, {if (test) {print $0}} can often be contracted to just test.
Here is a way with bash, sed and grep, it's straight forward and I think may be a bit cleaner than awk in this case:
IFS=$(echo -en "\n\b") && for ip in $(sed 's/\./\\&/g' iplist.txt); do
grep "^[^|]*|[^|]*|${ip}" r.txt
done

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'