Curly braces in awk reg exp - regex

I am trying to match a fixed number of digits using curly braces in awkbut I get no result.
# This outputs nothing
echo "123" | awk '/^[0-9]{3}$/ {print $1;}'
# This outputs 123
echo "123" | awk '/^[0-9]+$/ {print $1;}'
Do I need to do something specific to use curly braces?

Mac OS X awk (BSD awk) works with the first command shown:
$ echo "123" | /usr/bin/awk '/^[0-9]{3}$/ {print $1;}'
123
$
GNU awk does not. Adding backslashes doesn't help GNU awk. Using option --re-interval does, and so does using --posix.
$ echo "123" | /usr/gnu/bin/awk --re-interval '/^[0-9]{3}$/ {print $1;}'
123
$ echo "123" | /usr/gnu/bin/awk --posix '/^[0-9]{3}$/ {print $1;}'
123
$
(I'm not sure where mawk 1.3.3 dated 1996 comes from, but it is probably time to get an updated version of awk for your machine.)

AWK on Ubuntu 20.04.4 LTS is up-to-date, released in year 2020 of but its mawk.
As Ed Morton stated in a comment above, "mawk is a minimal functionality awk, optimized for speed of execution,...".
Seems those optimizations were at the expense of functionality.
SOLUTION
Install GNU awk (gawk):
$ sudo apt install gawk -y
$ awk -W version
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
Copyright (C) 1989, 1991-2019 Free Software Foundation.

Related

Why isn't Mac sed isn't matching what I expect?

echo 'iPhone 12 Pro Max (5EF5105C-7EED-4017-979C-A6185E927B84) (Booted)' | sed -En 's,(\w+-\w+-\w+-\w+-\w+),\1,p'
Because I'm using extended regex -E (-r in GNU sed) and -n for print only matched/replaced. Assuming my regex101 is correct,
expecting 5EF5105C-7EED-4017-979C-A6185E927B84 in the output, but getting empty.
If you're just trying to get the serial number out from inside the parens, and you're not actually modifying anything, then use grep
$ echo 'iPhone 12 Pro Max (5EF5105C-7EED-4017-979C-A6185E927B84) (Booted)' \
| grep -E '\w+-\w+-\w+-\w+-\w+' -o
5EF5105C-7EED-4017-979C-A6185E927B84
-o tells grep "Just output what matched, not the entire line".

Extract email addresses from log with grep or sed

Jan 23 00:46:24 portal postfix/smtp[31481]: 1B1653FEA1: to=<wanted1918_ke#yahoo.com>, relay=mta5.am0.yahoodns.net[98.138.112.35]:25, delay=5.4, delays=0.02/3.2/0.97/1.1, dsn=5.0.0, status=bounced (host mta5.am0.yahoodns.net[98.138.112.35] said: 554 delivery error: dd This user doesn't have a yahoo.com account (wanted1918_ke#yahoo.com) [0] - mta1321.mail.ne1.yahoo.com (in reply to end of DATA command))
Jan 23 00:46:24 portal postfix/smtp[31539]: AF40C3FE99: to=<devi_joshi#yahoo.com>, relay=mta7.am0.yahoodns.net[98.136.217.202]:25, delay=5.9, delays=0.01/3.1/0.99/1.8, dsn=5.0.0, status=bounced (host mta7.am0.yahoodns.net[98.136.217.202] said: 554 delivery error: dd This user doesn't have a yahoo.com account (devi_joshi#yahoo.com) [0] - mta1397.mail.gq1.yahoo.com (in reply to end of DATA command))
From above maillog I would like to extract the email addresses enclosed between the angular brackets < ... > eg. to=<wanted1918_ke#yahoo.com> to wanted1918_ke#yahoo.com
I am using cut -d' ' -f7 to extract emails but I am curious if there is a more flexible way.
With GNU grep, just use a regular expression containing a look behind and look ahead:
$ grep -Po '(?<=to=<).*(?=>)' file
wanted1918_ke#yahoo.com
devi_joshi#yahoo.com
This says: hey, extract all the strings preceded by to=< and followed by >.
You can use awk like this:
awk -F'to=<|>,' '{print $2}' the.log
I'm splitting the line by to=< or >, and print the second field.
Just to show a sed alternative (requires GNU or BSD/macOS sed due to -E):
sed -E 's/.* to=<(.*)>.*/\1/' file
Note how the regex must match the entire line so that the substitution of the capture-group match (the email address) yields only that match.
A slightly more efficient - but perhaps less readable - variation is
sed -E 's/.* to=<([^>]*).*/\1/' file
A POSIX-compliant formulation is a little more cumbersome due to the legacy syntax required by BREs (basic regular expressions):
sed 's/.* to=<\(.*\)>.*/\1/' file
A variation of fedorqui's helpful GNU grep answer:
grep -Po ' to=<\K[^>]*' file
\K, which drops everything matched up to that point, is not only syntactically simpler than a look-behind assertion ((?<=...), but also more flexible - it supports variable-length expressions - and is faster (though that may not matter in many real-world situations; if performance is paramount: see below).
Performance comparison
Here's how the various solutions on this page compare in terms of performance.
Note that this may not matter much in many use cases, but gives insight into:
the relative performance of the various standard utilities
for a given utility, how tweaking the regex can make a difference.
The absolute values are not important, but the relative performance hopefully provides some insight. See the bottom for the script that produced these numbers, which were obtained on a late-2012 27" iMac running macOS 10.12.3, using a 250,000-line input file created by replicating the sample input from the question, averaging the timings of 10 runs each.
Mawk 0.364s
GNU grep, \K, non-backtracking 0.392s
GNU awk 0.830s
GNU grep, \K 0.937s
GNU grep, (?>=...) 1.639s
BSD grep + cut 2.733s
GNU grep + cut 3.697s
BSD awk 3.785s
BSD sed, non-backtracking 7.825s
BSD sed 8.414s
GNU sed 16.738s
GNU sed, non-backtracking 17.387s
A few conclusions:
The specific implementation of a given utility matters.
grep is generally a good choice, even if it needs to be combined with cut
Tweaking the regex to avoid backtracking and look-behind assertions can make a difference.
GNU sed is surprisingly slow, whereas GNU awk is faster than BSD awk. Strangely, the (partially) non-backtracking solution is slower with GNU sed.
Here's the script that produced the timings above; note that the g-prefixed commands are GNU utilities that were installed on macOS via Homebrew; similarly, mawk was installed via Homebrew.
Note that "non-backtracking" only applies partially to some of the commands.
#!/usr/bin/env bash
# Define the test commands.
test01=( 'BSD sed' sed -E 's/.*to=<(.*)>.*/\1/' )
test02=( 'BSD sed, non-backtracking' sed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test03=( 'GNU sed' gsed -E 's/.*to=<(.*)>.*/\1/' )
test04=( 'GNU sed, non-backtracking' gsed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test05=( 'BSD awk' awk -F' to=<|>,' '{print $2}' )
test06=( 'GNU awk' gawk -F' to=<|>,' '{print $2}' )
test07=( 'Mawk' mawk -F' to=<|>,' '{print $2}' )
#--
test08=( 'GNU grep, (?>=...)' ggrep -Po '(?<= to=<).*(?=>)' )
test09=( 'GNU grep, \K' ggrep -Po ' to=<\K.*(?=>)' )
test10=( 'GNU grep, \K, non-backtracking' ggrep -Po ' to=<\K[^>]*' )
# --
test11=( 'BSD grep + cut' "{ grep -o ' to=<[^>]*' | cut -d'<' -f2; }" )
test12=( 'GNU grep + cut' "{ ggrep -o ' to=<[^>]*' | gcut -d'<' -f2; }" )
# Determine input and output files.
inFile='file'
# NOTE: Do NOT use /dev/null, because GNU grep apparently takes a shortcut
# when it detects stdout going nowhere, which distorts the timings.
# Use dev/tty if you want to see stdout in the terminal (will print
# as a single block across all tests before the results are reported).
outFile="/tmp/out.$$"
# outFile='/dev/tty'
# Make `time` only report the overall elapsed time.
TIMEFORMAT='%6R'
# How many runs per test whose timings to average.
runs=10
# Read the input file up to even the playing field, so that the first command
# doesn't take the hit of being the first to load the file from disk.
echo "Warming up the cache..."
cat "$inFile" >/dev/null
# Run the tests.
echo "Running $(awk '{print NF}' <<<"${!test*}") test(s), averaging the timings of $runs run(s) each; this may take a while..."
{
for n in ${!test*}; do
arrRef="$n[#]"
test=( "${!arrRef}" )
# Print test description.
printf '%s\t' "${test[0]}"
# Execute test command.
if (( ${#test[#]} == 2 )); then # single-token command? assume `eval` must be used.
time for (( n = 0; n < runs; n++ )); do eval "${test[#]: 1}" < "$inFile" >"$outFile"; done
else # multiple command tokens? assume that they form a simple command that can be invoked directly.
time for (( n = 0; n < runs; n++ )); do "${test[#]: 1}" "$inFile" >"$outFile"; done
fi
done
} 2>&1 |
sort -t$'\t' -k2,2n |
awk -v runs="$runs" '
BEGIN{FS=OFS="\t"} { avg = sprintf("%.3f", $2/runs); print $1, avg "s" }
' | column -s$'\t' -t
awk -F'[<>]' '{print $2}' file
wanted1918_ke#yahoo.com
devi_joshi#yahoo.com

Match only first occurrence of digit

After few hours of disappointed searching I can't figure this out.
I am piping to grep input, what I want to get is first occurrence of any digit.
Example:
nmcli --version
nmcli tool, version 1.1.93
Pipe to grep with regex
nmcli --version |grep -o '[[:digit:]]'
Output:
1
1
9
3
What I want:
1
Yeah there is a way to do that with another pipe, but is there "pure" single regex to do that?
With GNU grep:
nmcli --version | grep -Po ' \K[[:digit:]]'
Output:
1
See: Support of \K in regex
Although you want to avoid another process, it seems simplest just to add a head to your existing command...
grep -o [[:digit:]] | head -n1
echo "nmcli tool, version 1.1.93" |sed "s/[^0-9]//g" |cut -c1
1
echo "nmcli tool, version 1.1.93" |grep -o '[0-9]' |head -1
1
This can be seen as a stream editing task: reduce that one line to the first digit. Basic regex register-based referencing achieves the task:
$ echo "junk 1.2.3.4" | sed -e 's/.* \([0-9]\).*/\1/'
1
Traditionally, Grep is best for searching for files and lines which match a pattern. This is why the grep solution requires the use of Perl regex; Perl regex has features that, in combination with -o, allow grep to escape "out of the box" and be used in ways it wasn't really intended: match X, but then output a substring of X. The solution is terse, but not portable to grep implementations that don't have PCRE.
Use [0-9] to match ASCII digits, by the way. The purpose of [[:digit:]] is to bring in locale-specific behavior: to be able to match digits other than just the ASCII 0x30 through 0x39.
It's fairly safe to say that nmcli isn't going to put outs its --version using, say, Devangari numerals, like १.२.३.४.
You could use standard awk instead:
nmcli --version | awk 'match($0, /[[:digit:]]/) {print substr($0, RSTART, RLENGTH); exit}'
For example:
$ seq 11111 33333 | awk 'match($0, /[[:digit:]]/) {print substr($0, RSTART, RLENGTH); exit}'
1

awk regular expression doesn't work [duplicate]

echo xx y11y rrr | awk '{ if ($2 ~/y[1-5]{2}y/) print $3}'
Why I cannot get any output?
Thank you.
You need to enable "interval expressions" in regular expression matching by specifying either the --posix or --re-interval option.
e.g.
echo xx y11y rrr | awk --re-interval '{ if ($2 ~ /y[1-5]{2}y/) print $3}
From the man page:
--re-interval
Enable the use of interval expressions in regular expression matching (see Regular Expressions, below). Interval expressions were
not traditionally available in
the AWK language. The POSIX standard added them, to make awk and egrep consistent with each other. However, their use is
likely to break old AWK programs, so
gawk only provides them if they are requested with this option, or when --posix is specified.
You should force POSIX to use {} in awk
echo xx y11y rrr | awk -W posix '{ if ($2 ~/y[1-5]{2}y/) print $3}'
On my machine:
$ echo xx y11y rrr | awk '{ if ($2 ~/y[1-5]{2}y/) print $3}'
rrr
Was this what you wanted? I'm using GNU awk 4.0.0 in Cygwin on Windows XP.

Simple trouble with awk and regex

echo xx y11y rrr | awk '{ if ($2 ~/y[1-5]{2}y/) print $3}'
Why I cannot get any output?
Thank you.
You need to enable "interval expressions" in regular expression matching by specifying either the --posix or --re-interval option.
e.g.
echo xx y11y rrr | awk --re-interval '{ if ($2 ~ /y[1-5]{2}y/) print $3}
From the man page:
--re-interval
Enable the use of interval expressions in regular expression matching (see Regular Expressions, below). Interval expressions were
not traditionally available in
the AWK language. The POSIX standard added them, to make awk and egrep consistent with each other. However, their use is
likely to break old AWK programs, so
gawk only provides them if they are requested with this option, or when --posix is specified.
You should force POSIX to use {} in awk
echo xx y11y rrr | awk -W posix '{ if ($2 ~/y[1-5]{2}y/) print $3}'
On my machine:
$ echo xx y11y rrr | awk '{ if ($2 ~/y[1-5]{2}y/) print $3}'
rrr
Was this what you wanted? I'm using GNU awk 4.0.0 in Cygwin on Windows XP.