Stripping email addresses from arbitrary file

Stripping email addresses from arbitrary file - regex

What's the best way to get user#host.com combinations from a large fileset?
I assume that sed/awk can do this, but I'm not very familiar with regexp.
We have a file i.e., Staff_data.txt that houses more than just emails, and would like to strip the rest of the data, and gather only the email addresses (i.e., h#south.com)
I figured the easiest way would be via sed/awk in a terminal, but seeing as how complex regexp can be, I'd appreciate some guidance.
Thanks.

Here's a somewhat embarrassing but apparently working script I wrote a few years ago to do this job:
# Get rid of any Message-Id line like this:
# Message-ID: <AANLkTinSDG_dySv_oy_7jWBD=QWiHUMpSEFtE-cxP6Y1#mail.gmail.com>
#
# Change any character that can't be in an email address to a space.
#
# Print just the character strings that look like email addresses.
#
# Drop anything with multple "#"s and change any domain names (i.e.
# the part after the "#") to all lower case as those are not case-sensitive.
#
# If we have a local mail box part (i.e. the part before the "#") that's
# a mix of upper/lower and another that's all lower, keep them both. Ditto
# for multiple versions of mixed case since we don't know which is correct.
#
# Sort uniquely.
cat "$#" |
awk '!/^Message-ID:/' |
awk '{gsub(/[^-_.#[:alnum:]]+/," ")}1' |
awk '{for (i=1;i<=NF;i++) if ($i ~ /.+#.+[.][[:alpha:]]+$/) print $i}' |
awk '
BEGIN { FS=OFS="#" }
NF != 2 { printf "Badly formatted %s skipped.\n",$0 | "cat>&2"; next }
{ $2=tolower($2); print }
' |
tr '[A-Z]' '[a-z]' |
sort -u
It's not pretty, but it seems to be robust.

You want grep here not sed or awk. For example to display all emails from the domain south.com:
grep -o '[^ ]*#south\.com ' file

Related

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.

When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.

Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300

This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.

Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300

Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file

if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

I have a bunch of daily printer logs in CSV format and I'm writing a script to keep track of how much paper is being used and save the info to a database, but I've come across a small problem
Essentially, some of the document names in the logs include commas in them (which are all enclosed within double quotes), and since it's in comma separated format, my code is messing up and pushing everything one column to the right for certain records.
From what I've been reading, it seems like the best way to go about fixing this would be using awk or sed, but I'm unsure which is the best option for my situation, and how exactly I'm supposed to implement it.
Here's a sample of my input data:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"MicrosoftWordDocument1",COMSYRWS14,A4,PCL6,,,NOT DUPLEX,GRAYSCALE,35kb,
And here's what I have so far:
#!/bin/bash
#Get today's file name
yearprefix="20"
currentdate=$(date +"%m-%d-%y");
year=${currentdate:6};
year="$yearprefix$year"
month=${currentdate:0:2};
day=${currentdate:3:2};
filename="papercut-print-log-$year-$month-$day.csv"
echo "The filename is: $filename"
# Remove commas in between quotes.
#Loop through CSV file
OLDIFS=$IFS
IFS=,
[ ! -f $filename ] && { echo "$Input file not found"; exit 99; }
while read time user pages copies printer document client size pcl blank1 blank2 duplex greyscale filesize blank3
do
#Remove headers
if [ "$user" != "" ] && [ "$user" != "User" ]
then
#Remove any file name with an apostrophe
if [[ "$document" =~ "'" ]];
then
document="REDACTED"; # Lazy. Need to figure out a proper solution later.
fi
echo "$time"
#Save results to database
mysql -u username -p -h localhost -e "USE printerusage; INSERT INTO printerlogs (time, username, pages, copies, printer, document, client, size, pcl, duplex, greyscale, filesize) VALUES ('$time', '$user', '$pages', '$copies', '$printer', '$document', '$client', '$size', '$pcl', '$duplex', '$greyscale', '$filesize');"
fi
done < $filename
IFS=$OLDIFS
Which option is more suitable for this task? Will I have to create a second temporary file to get this done?
Thanks in advance!

As I wrote in another answer:
Rather than interfere with what is evidently source data, i.e. the stuff inside the quotes, you might consider replacing the field-separator commas (with say |) instead:
s/,([^,"]*|"[^"]*")(?=(,|$))/|$1/g
And then splitting on | (assuming none of your data has | in it).
Is it possible to write a regular expression that matches a particular pattern and then does a replace with a part of the pattern

There is probably an easier way using sed alone, but this should work. Loop on the file, for each line match the parentheses with grep -o then replace the commas in the line with spaces (or whatever it is you would like to use to get rid of the commas - if you want to preserve the data you can use a non printable and explode it back to commas afterward).
i=1 && IFS=$(echo -en "\n\b") && for a in $(< test.txt); do
var="${a}"
for b in $(sed -n ${i}p test.txt | grep -o '"[^"]*"'); do
repl="$(sed "s/,/ /g" <<< "${b}")"
var="$(sed "s#${b}#${repl}#" <<< "${var}")"
done
let i+=1
echo "${var}"
done

Extracting string from html file or curl output

I have a html file where some of them are "minified", this means that a whole website can be in just one line.
I want to filter the value of ?idsite= which contains numbers. So a html contains something like this: img src="//stats.domains.com/piwik.php?idsite=44.
So the plain output should be "44".
I tried grep but it echos the whole line and just highlights the value.

With perl it could be something like:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| perl -nE 'say /.*idsite=(..)\"/ '
(assumes that idsite is always two characters ! :-). Your regex will need to be more sophisticated than this most likely).
Putting the snippet from the page you reference above in an HTML file (non-minified) and subsituting 44 for the parameter variable, this bit of perl will extract the "44":
perl -nE 'say /.*idsite=(..)/ if /idsite/ ' idsite.html
Translating the one liner to a sed command line would be similar:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| sed -En "s/^.*idsite=(..)\"/\1/p"
This is POSIXsed from FreeBSD (should work on OSX) the -E switch is to add "modern" regexes.
Doing it in awk is left as an exercise for another community member :-)

Here is a perl way to extract only the trailing digits of strings like src="//stats.domains.com/piwik.php?idsite=44" and run on a bash command line:
echo $src|perl -ne '$_ =~m /(\d+$)/; print $1'
Here is a python way to do the same thing:
import re
print ', '.join( re.findall(r'\d+$', src))
If there will be a lot of src strings to process, it would be best to compile the regex when using Python as follows:
import re
p = re.compile('\d+$')
print ', '.join(p.findall(src))
The import and the compilation only have to be done once.
Here is a Ruby way to do it:
puts src.scan( /\d+$/ ).first
In all cases the regexes end with "$" which matches the end of the string. That is why they match and extract only digits (\d+) at the end of the string.

If you don't need to check whether the idsite is in the value of a src attribute, then all you need is
perl -nE'say $1 if /\bidsite=(\d+)' myfile.html

$ cat site.html
lorem ipsum idsite='4934' fasdf a
other line
$ sed -n '/idsite/ { s/.*idsite=\([0-9]\+\).*$/\1/; p }' < site.html
4934
Let me know in case you need an explanation of what is going on.

Add specific source file types to svn recursively

I know this is an ugly script but it does the job.
What I am facing now is adding a few more extensions what would clutter the scrip even more.
How can I make it more modular?
Specifically, how can I write this long regular expression (source file extensions) on multiple lines? say one extension on each line. I guess I am doing something wrong with string concatenation but not quite sure what exactly.
Here's the original file:
#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVN='svn add'
$COMMAND | grep -E '(\.m|\.mat|\.java|\.js|\.php|\.cpp|\.h|\.c|\.py|\.hs|\.pl|\.xml|\.html|\.sh|.\asm|\.s|\.tex|\.bib|.\Makefile|.\jpg|.\gif|.\png|.\css)'$ | awk ' { print$2 } ' | $XARGS $SVN
and here's roughly what I am aiming at
...code code
'(.\m|
\.mat|
\.js|
.
.
.
.\css)'
..more code here
Anybody?

I know this doesn't answer the question directly, but from a readability perspective, I think most developers would agree that a single-line regex is the most common way to do things and therefore the most maintainable approach. Further, I'm not sure why you're including a period in each of your extensions, this should only need to be used once.
I wrote this little script to automatically add all images to svn. You should be able to simply add extensions between the pipes in the regex to add or remove different file types. Note that it makes sure to only add files that are unrecognized by making sure each line starts with a "?" (^\?) and ends with a period (\.(extensions)$). Hope it's helpful!
#!/bin/bash
svn st | grep -E "^\?.*\.(png|jpg|jpeg|tiff|bmp|gif)$" > /tmp/svn-auto-add-img
while read output; do
FILE=$(echo $output | awk '{ print $2 }')
svn add $FILE
done < /tmp/svn-auto-add-img
exit 0

How about this:
PATTERNS="
\.foo
\.bar
\.baz"
# Put them into one list separated by or ("|").
PATTERNS=`echo $PATTERNS |sed 's/\s\+/|/g'`
$COMMAND | grep -E "($PATTERNS)"
(Note that this would not work if you put quotes around $PATTERNS in the call to echo -- echo is taking care of stripping whitespace and converting newlines to spaces for us.)

#!/bin/bash
COMMAND='svn status'
XARGS='xargs'
SVNADD='svn add'
pats=
pats+=' \.m'
pats+=' \.mat'
pats+=' \.java'
pats+=' \.js'
# add your 'or-able' sub patterns here
# build the full pattern
pattern='(';for pat in $pats;do pattern+="$pat|";done;pattern=${pattern%\|}')$'
# run grep with the generated pattern
files=$($COMMAND | grep -E ${pattern} | awk ' { print $NF } ')
if [ " $files" != " " ]
then
$COMMAND | grep -E ${pattern} | awk ' { print $NF } ' | $XARGS $SVNADD
fi

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing

My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'

I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).

You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.

perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.

If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp

why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution

there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.

you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"

For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}

gawk '/.*abc([0-9]+)xyz.*/' file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stripping email addresses from arbitrary file - regex

You want grep here not sed or awk. For example to display all emails from the domain south.com: grep -o '[^ ]*#south\.com ' file

Related

How to use 'sed' to add dynamic prefix to each number in integer list?

Should I use AWK or SED to remove commas between quotation marks from a CSV file? (BASH)

Extracting string from html file or curl output

Add specific source file types to svn recursively

how to use sed, awk, or gawk to print only what is matched?

Categories

Resources