Extract email addresses from log with grep or sed - regex

Jan 23 00:46:24 portal postfix/smtp[31481]: 1B1653FEA1: to=<wanted1918_ke#yahoo.com>, relay=mta5.am0.yahoodns.net[98.138.112.35]:25, delay=5.4, delays=0.02/3.2/0.97/1.1, dsn=5.0.0, status=bounced (host mta5.am0.yahoodns.net[98.138.112.35] said: 554 delivery error: dd This user doesn't have a yahoo.com account (wanted1918_ke#yahoo.com) [0] - mta1321.mail.ne1.yahoo.com (in reply to end of DATA command))
Jan 23 00:46:24 portal postfix/smtp[31539]: AF40C3FE99: to=<devi_joshi#yahoo.com>, relay=mta7.am0.yahoodns.net[98.136.217.202]:25, delay=5.9, delays=0.01/3.1/0.99/1.8, dsn=5.0.0, status=bounced (host mta7.am0.yahoodns.net[98.136.217.202] said: 554 delivery error: dd This user doesn't have a yahoo.com account (devi_joshi#yahoo.com) [0] - mta1397.mail.gq1.yahoo.com (in reply to end of DATA command))
From above maillog I would like to extract the email addresses enclosed between the angular brackets < ... > eg. to=<wanted1918_ke#yahoo.com> to wanted1918_ke#yahoo.com
I am using cut -d' ' -f7 to extract emails but I am curious if there is a more flexible way.

With GNU grep, just use a regular expression containing a look behind and look ahead:
$ grep -Po '(?<=to=<).*(?=>)' file
wanted1918_ke#yahoo.com
devi_joshi#yahoo.com
This says: hey, extract all the strings preceded by to=< and followed by >.

You can use awk like this:
awk -F'to=<|>,' '{print $2}' the.log
I'm splitting the line by to=< or >, and print the second field.

Just to show a sed alternative (requires GNU or BSD/macOS sed due to -E):
sed -E 's/.* to=<(.*)>.*/\1/' file
Note how the regex must match the entire line so that the substitution of the capture-group match (the email address) yields only that match.
A slightly more efficient - but perhaps less readable - variation is
sed -E 's/.* to=<([^>]*).*/\1/' file
A POSIX-compliant formulation is a little more cumbersome due to the legacy syntax required by BREs (basic regular expressions):
sed 's/.* to=<\(.*\)>.*/\1/' file
A variation of fedorqui's helpful GNU grep answer:
grep -Po ' to=<\K[^>]*' file
\K, which drops everything matched up to that point, is not only syntactically simpler than a look-behind assertion ((?<=...), but also more flexible - it supports variable-length expressions - and is faster (though that may not matter in many real-world situations; if performance is paramount: see below).
Performance comparison
Here's how the various solutions on this page compare in terms of performance.
Note that this may not matter much in many use cases, but gives insight into:
the relative performance of the various standard utilities
for a given utility, how tweaking the regex can make a difference.
The absolute values are not important, but the relative performance hopefully provides some insight. See the bottom for the script that produced these numbers, which were obtained on a late-2012 27" iMac running macOS 10.12.3, using a 250,000-line input file created by replicating the sample input from the question, averaging the timings of 10 runs each.
Mawk 0.364s
GNU grep, \K, non-backtracking 0.392s
GNU awk 0.830s
GNU grep, \K 0.937s
GNU grep, (?>=...) 1.639s
BSD grep + cut 2.733s
GNU grep + cut 3.697s
BSD awk 3.785s
BSD sed, non-backtracking 7.825s
BSD sed 8.414s
GNU sed 16.738s
GNU sed, non-backtracking 17.387s
A few conclusions:
The specific implementation of a given utility matters.
grep is generally a good choice, even if it needs to be combined with cut
Tweaking the regex to avoid backtracking and look-behind assertions can make a difference.
GNU sed is surprisingly slow, whereas GNU awk is faster than BSD awk. Strangely, the (partially) non-backtracking solution is slower with GNU sed.
Here's the script that produced the timings above; note that the g-prefixed commands are GNU utilities that were installed on macOS via Homebrew; similarly, mawk was installed via Homebrew.
Note that "non-backtracking" only applies partially to some of the commands.
#!/usr/bin/env bash
# Define the test commands.
test01=( 'BSD sed' sed -E 's/.*to=<(.*)>.*/\1/' )
test02=( 'BSD sed, non-backtracking' sed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test03=( 'GNU sed' gsed -E 's/.*to=<(.*)>.*/\1/' )
test04=( 'GNU sed, non-backtracking' gsed -E 's/.*to=<([^>]*).*/\1/' )
# ---
test05=( 'BSD awk' awk -F' to=<|>,' '{print $2}' )
test06=( 'GNU awk' gawk -F' to=<|>,' '{print $2}' )
test07=( 'Mawk' mawk -F' to=<|>,' '{print $2}' )
#--
test08=( 'GNU grep, (?>=...)' ggrep -Po '(?<= to=<).*(?=>)' )
test09=( 'GNU grep, \K' ggrep -Po ' to=<\K.*(?=>)' )
test10=( 'GNU grep, \K, non-backtracking' ggrep -Po ' to=<\K[^>]*' )
# --
test11=( 'BSD grep + cut' "{ grep -o ' to=<[^>]*' | cut -d'<' -f2; }" )
test12=( 'GNU grep + cut' "{ ggrep -o ' to=<[^>]*' | gcut -d'<' -f2; }" )
# Determine input and output files.
inFile='file'
# NOTE: Do NOT use /dev/null, because GNU grep apparently takes a shortcut
# when it detects stdout going nowhere, which distorts the timings.
# Use dev/tty if you want to see stdout in the terminal (will print
# as a single block across all tests before the results are reported).
outFile="/tmp/out.$$"
# outFile='/dev/tty'
# Make `time` only report the overall elapsed time.
TIMEFORMAT='%6R'
# How many runs per test whose timings to average.
runs=10
# Read the input file up to even the playing field, so that the first command
# doesn't take the hit of being the first to load the file from disk.
echo "Warming up the cache..."
cat "$inFile" >/dev/null
# Run the tests.
echo "Running $(awk '{print NF}' <<<"${!test*}") test(s), averaging the timings of $runs run(s) each; this may take a while..."
{
for n in ${!test*}; do
arrRef="$n[#]"
test=( "${!arrRef}" )
# Print test description.
printf '%s\t' "${test[0]}"
# Execute test command.
if (( ${#test[#]} == 2 )); then # single-token command? assume `eval` must be used.
time for (( n = 0; n < runs; n++ )); do eval "${test[#]: 1}" < "$inFile" >"$outFile"; done
else # multiple command tokens? assume that they form a simple command that can be invoked directly.
time for (( n = 0; n < runs; n++ )); do "${test[#]: 1}" "$inFile" >"$outFile"; done
fi
done
} 2>&1 |
sort -t$'\t' -k2,2n |
awk -v runs="$runs" '
BEGIN{FS=OFS="\t"} { avg = sprintf("%.3f", $2/runs); print $1, avg "s" }
' | column -s$'\t' -t

awk -F'[<>]' '{print $2}' file
wanted1918_ke#yahoo.com
devi_joshi#yahoo.com

Related

Extract version using grep/regex in bash

I have a file that has a line stating
version = "12.0.08-SNAPSHOT"
The word version and quoted strings can occur on multiple lines in that file.
I am looking for a single line bash statement that can output the following string:
12.0.08-SNAPSHOT
The version can have RELEASE tag too instead of SNAPSHOT.
So to summarize, given
version = "12.0.08-SNAPSHOT"
expected output: 12.0.08-SNAPSHOT
And given
version = "12.0.08-RELEASE"
expected output: 12.0.08-RELEASE
The following command prints strings enquoted in version = "...":
grep -Po '\bversion\s*=\s*"\K.*?(?=")' yourFile
-P enables perl regexes, which allow us to use features like \K and so on.
-o only prints matched parts instead of the whole lines.
\b ensures that version starts at a word boundary and we do not match things like abcversion.
\s stands for any kind of whitespace.
\K lets grep forget, that it matched the part before \K. The forgotten part will not be printed.
.*? matches as few chararacters as possible (the matching part will be printed) ...
(?=") ... until we see a ", which won't be included in the match either (this is called a lookahead).
Not all grep implementations support the -P option. Alternatively, you can use perl, as described in this answer:
perl -nle 'print $& if m{\bversion\s*=\s*"\K.*?(?=")}' yourFile
Seems like a job for cut:
$ echo 'version = "12.0.08-SNAPSHOT"' | cut -d'"' -f2
12.0.08-SNAPSHOT
$ echo 'version = "12.0.08-RELEASE"' | cut -d'"' -f2
12.0.08-RELEASE
Portable solution:
$ echo 'version = "12.0.08-RELEASE"' |sed -E 's/.*"(.*)"/\1/g'
12.0.08-RELEASE
or even:
$ perl -pe 's/.*"(.*)"/\1/g'.
$ awk -F"\"" '{print $2}'

Match only first occurrence of digit

After few hours of disappointed searching I can't figure this out.
I am piping to grep input, what I want to get is first occurrence of any digit.
Example:
nmcli --version
nmcli tool, version 1.1.93
Pipe to grep with regex
nmcli --version |grep -o '[[:digit:]]'
Output:
1
1
9
3
What I want:
1
Yeah there is a way to do that with another pipe, but is there "pure" single regex to do that?
With GNU grep:
nmcli --version | grep -Po ' \K[[:digit:]]'
Output:
1
See: Support of \K in regex
Although you want to avoid another process, it seems simplest just to add a head to your existing command...
grep -o [[:digit:]] | head -n1
echo "nmcli tool, version 1.1.93" |sed "s/[^0-9]//g" |cut -c1
1
echo "nmcli tool, version 1.1.93" |grep -o '[0-9]' |head -1
1
This can be seen as a stream editing task: reduce that one line to the first digit. Basic regex register-based referencing achieves the task:
$ echo "junk 1.2.3.4" | sed -e 's/.* \([0-9]\).*/\1/'
1
Traditionally, Grep is best for searching for files and lines which match a pattern. This is why the grep solution requires the use of Perl regex; Perl regex has features that, in combination with -o, allow grep to escape "out of the box" and be used in ways it wasn't really intended: match X, but then output a substring of X. The solution is terse, but not portable to grep implementations that don't have PCRE.
Use [0-9] to match ASCII digits, by the way. The purpose of [[:digit:]] is to bring in locale-specific behavior: to be able to match digits other than just the ASCII 0x30 through 0x39.
It's fairly safe to say that nmcli isn't going to put outs its --version using, say, Devangari numerals, like १.२.३.४.
You could use standard awk instead:
nmcli --version | awk 'match($0, /[[:digit:]]/) {print substr($0, RSTART, RLENGTH); exit}'
For example:
$ seq 11111 33333 | awk 'match($0, /[[:digit:]]/) {print substr($0, RSTART, RLENGTH); exit}'
1

sed - exchange words with delimiter

I'm trying swap words around with sed, not replace because that's what I keep finding on Google search.
I don't know if it's the regex that I'm getting wrong. I did a search for everything before a char and everything after a char, so that's how I got the regex.
echo xxx,aaa | sed -r 's/[^,]*/[^,]*$/'
or
echo xxx/aaa | sed -r 's/[^\/]*/[^\/]*$/'
I am getting this output:
[^,]*$,aaa
or this:
[^,/]*$/aaa
What am I doing wrong?
For the first sample, you should use:
echo xxx,aaa | sed 's/\([^,]*\),\([^,]*\)/\2,\1/'
For the second sample, simply use a character other than slash as the delimiter:
echo xxx/aaa | sed 's%\([^/]*\)/\([^/]*\)%\2/\1%'
You can also use \{1,\} to formally require one or more:
echo xxx,aaa | sed 's/\([^,]\{1,\}\),\([^,]\{1,\}\)/\2,\1/'
echo xxx/aaa | sed 's%\([^/]\{1,\}\)/\([^/]\{1,\}\)%\2/\1%'
This uses the most portable sed notation; it should work anywhere. With modern versions that support extended regular expressions (-r with GNU sed, -E with Mac OS X or BSD sed), you can lose some of the backslashes and use + in place of * which is more precisely what you're after (and parallels \{1,\} much more succinctly):
echo xxx,aaa | sed -E 's/([^,]+),([^,]+)/\2,\1/'
echo xxx/aaa | sed -E 's%([^/]+)/([^/]+)%\2/\1%'
With sed it would be:
sed 's#\([[:alpha:]]\+\)/\([[:alpha:]]\+\)#\2,\1#' <<< 'xxx/aaa'
which is simpler to read if you use extended posix regexes with -r:
sed -r 's#([[:alpha:]]+)/([[:alpha:]]+)#\2/\1#' <<< 'xxx/aaa'
I'm using two sub patterns ([[:alpha:]]+) which can contain one or more letters and are separated by a /. In the replacement part I reassemble them in reverse order \2/\1. Please also note that I'm using # instead of / as the delimiter for the s command since / is already the field delimiter in the input data. This saves us to escape the / in the regex.
Btw, you can also use awk for that, which is pretty easy to read:
awk -F'/' '{print $2,$1}' OFS='/' <<< 'xxx/aaa'

matching a specific substring with regular expressions using awk

I'm dealing with a specific filenames, and need to extract information from them.
The structure of the filename is similar to: "20100613_M4_28007834.005_F_RANDOMSTR.raw.gz"
with RANDOMSTR a string of max 22 chars, and which may contain a substring (or not) with the format "-W[0-9].[0-9]{2}.[0-9]{3}". This substring also has the unique feature of starting with "-W".
The information I need to extract is the substring of RANDOMSTR without this optional substring.
I want to implement this in a bash script, and so far the best option I found is to use gawk with a regular expression. My best attempt so far fails:
gawk --re-interval '{match ($0,"([0-9]{8})_(M[0-9])_([0-9]{8}\\.[0-9]{3})_(.)_(.*)(-W.*)?.raw.gz",arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING-W0.40+045
The expected results are:
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_SOME-STRING.raw.gz"
SOME-STRING
gawk --re-interval '{match ($0,$regexp,arr); print arr[5]}' <<< "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
OTHER-STRING
How can I get the desired effect.
Thanks.
You need to be able to use look-arounds and I don't think awk/gawk supports that, but grep -P does.
$ pat='(?<=[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._)(.*?)(?=(-W.*)?\.raw\.gz)'
$ echo "20100613_M4_28007834.005_F_SOME-STRING.raw.gz" | grep -Po "$pat"
SOME-STRING
$ echo "20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz" | grep -Po "$pat"
OTHER-STRING
While the grep solution is very nice indeed, the OP didn't mention an operating system, and the -P option only seems to be available in Linux. It's also pretty simple to do this in awk.
$ awk -F_ '{sub(/(-W[0-9].[0-9]+.[0-9]+)?\.raw\.gz$/,"",$NF); print $NF}' <<EOT
> 20100613_M4_28007834.005_F_SOME-STRING.raw.gz
> 20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
> EOT
SOME-STRING
OTHER-STRING
$
Note that this breaks on "20100613_M4_28007834.005_F_OTHER-STRING-W0_40+045.raw.gz". If this is a risk, and -W only shows up in the place shown above, it might be better to use something like:
$ awk -F_ '{sub(/(-W[0-9.+]+)?\.raw\.gz$/,"",$NF); print $NF}'
The difficulty here seems to be the fact that the (.*) before the optional (-W.*)? gobbles up the latter text. Using a non-greedy match doesn't help either. My regex-fu is unfortunately too weak to combat this.
If you don't mind a multi-pass solution, then a simpler approach would be to first sanitise the input by removing the trailing .raw.gz and possible -W*.
str="20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz"
echo ${str%.raw.gz} | # remove trailing .raw.gz
sed 's/-W.*$//' | # remove trainling -W.*, if any
sed -nr 's/[0-9]{8}_M[0-9]_[0-9]{8}\.[0-9]{3}_._(.*)/\1/p'
I used sed, but you can just as well use gawk/awk.
Wasn't able to get reluctant quantifiers going, but running through two regexes in sequence does the job:
sed -E -e 's/^.{27}(.*).raw.gz$/\1/' << FOO | sed -E -e 's/-W[0-9.]+\+[0-9.]+$//'
20100613_M4_28007834.005_F_SOME-STRING.raw.gz
20100613_M4_28007834.005_F_OTHER-STRING-W0.40+045.raw.gz
FOO

Improving Shell Script Performance

This shell script is used to extract a line of data from $2 if it contains the pattern $line.
$line is constructed using the regular expression [A-Z0-9.-]+#[A-Z0-9.-]+ (a simple email match), form the lines in file $1.
#! /bin/sh
clear
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
do
echo `cat "$2" | grep -m 1 "\b$line\b"`
done
File $1 has short lines of data (< 100 chars) and contains approx. 50k lines (approx. 1-1.5 MB).
File $2 has slightly longer lines of text (> 80 to < 200) and has 2M+ lines (approx. 200MB).
The desktops this is running on has plenty of RAM (6 Gig) and Xenon processors with 2-4 cores.
Are there any quick fixes to increase performance as currently it takes 1-2 hours to completely run (and output to another file).
NB: I'm open to all suggestions but we're not in the position to complexity re-write the whole system etc. In addition the data come from a third party and is prone to random formatting.
Quick suggestions:
Avoid the useless use of cat and change cat X | grep Y to grep Y X.
You can process the grep output as it is produced by piping it rather than using backticks. Using backticks requires the first grep to complete before you can start the second grep.
Thus:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | while read line; do
grep -m 1 "\b$line\b" "$2"
done
Next step:
Don't process $2 repeatedly. It's huge. You can save up all your patterns and then execute a single grep over the file.
Replace loop with sed.
No more repeated grep:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
grep -f patterns "$2"
Finally, using some bash fanciness (see man bash → Process Substitution) we can ditch the temporary file and do this in one long line:
grep -f <(grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\b/g') "$2"
That's great unless you have so many patterns grep -f runs out of memory and barfs. If that happens you'll need to run it in batches. Annoying, but doable:
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" "$1" | sed -E 's/^|$/\\1/g' > patterns
while [ -s patterns ]; do
grep -f <(head -n 100 patterns) "$2"
sed -e '1,100d' -i patterns
done
That'll process 100 patterns at a time. The more it can do at once the fewer passes it'll have to make over your 2nd file.
the problem is you are piping too many shell commands, as well as unnecessary use of cat.
one possible solution using just awk
awk 'FNR==NR{
# get all email address from file1
for(i=1;i<=NF;i++){
if ( $i ~ /[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+/){
email[$i]
}
}
next
}
{
for(i in email) {
if ($0 ~ i) {
print
}
}
}' file1 file2
I would take the loop out, since greping a 2 million line file 50k times is probably pretty expensive ;)
To allow for you to take the loop out
First create a file of all your Email Addresses with your outer grep command.
Then use this as a pattern file to do your secondary grep by using grep -f
If $1 is a file, don't use "cat | grep". Instead, pass the file directly to grep. Should look like
grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+" $1
Besides, you may want to adjust your regex. You should at least expect the underscore ("_") in an email address, so
grep -i -o -E "[A-Z0-9._-]+#[A-Z0-9.-]+" $1
As John Kugelman has already answered, process the grep output by piping it rather than using backticks. If you are using backticks the whole expression within the backticks will be run first, and then the outer expression will be run with the output from the backticks as arguments.
First of all, this will be a lot slower than necessary as piping would allow the two programs to run simultaneously (which is really good if they are both CPU intensive and you have multiple CPUs). However there is another very important aspect to this, the line
for line in `cat "$1" | grep -i -o -E "[A-Z0-9.-]+#[A-Z0-9.-]+"`
may become to long for the shell to handle. Most shells (to my knowledge at least) limit the length of a command line, or at least the arguments to a command, and I think this could become a problem for the for loop too.