Count occurrences of an email address in a text file - regex

I have a .txt file with many emails including headers. I'm just wondering how I would use perl to find out how many occurrences of the same email address are found in this text file?
Would it involve regular expressions?

You might find cpan: Email::Find useful. You could store the addresses you find in a hash table with email as the key and counter as value. You should be able to do that with the callback. Can you get started with this?

How about this script:
#!/usr/bin/perl
use strict;
use Data::Dumper;
my #email_list = ();
my %count;
while (my $line = <>) {
foreach my $email (split /\s+/, $line) {
if ( $email =~ /^[-\w.]+#([a-z0-9][a-z-0-9]+\.)+[a-z]{2,4}$/i ) {
push(#email_list,$email);
}
}
}
print "Total Email Count: ".scalar(#email_list)."\n\n";
$count{$_}++ for #email_list;
print Dumper(\%count);
Save it to a file such as email.pl and make sure it executable chmod +x email.pl.
./email.pl file.txt
It will print the total number of email addresses found and count per email address.

If you want to find all email addresses, I recommend trying a module rather than writing your own regex. Correctly matching all email addresses gets quite complicated.
However, if you simply want to search for a given email address, you can accomplish this with a fairly simple regex:
#!usr/bin/perl
use strict;
use warnings;
my $count = 0;
my $email = 'foo#bar.com';
while(<DATA>)
{
$count++ while (m/(^|\s)\K\Q$email\E(?=\s|$)/g);
}
print "Found $email $count times";
__DATA__
foo#bar.com foo#bar.com
mr-foo#bar.com #not a match
old.foo#bar.com #not a match
blah blah blah foo#bar.com blah blah
foo#bar.commmm #not a match
Note that this requires the email address to be separated from any other content by whitespace.
A couple of notes:
\Q...\E is the quote-literal escape. It ensures that nothing in the email address is treated as special regex characters (Without this, the . would match any character rather than a literal period).
(?=...) is a look-ahead insertion. This will match the contents without including it in the actual match. This is important, because a single space may be before one occurrence of the email and after another. In order to match both, you don't want the first match to "eat up" that space.

Related

Perl regex output hyphens multiple "words"

When I have a string with multiple hyphens in it, I seem to be able to find the (only) desired value, but why are there multiple outputs? I'd like to just report the matched string entirely, with hyphens. I've included what the output probably is, along with a way to rebuild the string, but this method seems like unnecessary work.
my $string = "phonenumber123-456-7890";
my #secondStrings = $string =~ m/(\d+)-(\d+)-(\d+)/g;
foreach (#secondStrings){
print $_, "\n";
}
if ($string =~ m/(\d+)-(\d+)-(\d+)/g){
print $1."-".$2."-".$3, "\n";
}
I believe you just want to put the entire phone number (123-456-7890) into 1 capture group, right now you are using 3.
my ($number) = $string =~ m/(\d+-\d+-\d+)/g;
Further reading can be found here: http://perldoc.perl.org/perlre.html#Capture-groups

In Perl, how can I extract email addresses from lines in log files?

I am looking to trim a string that will be created from reading in a file line by line. However i want to pull out only the email from the string, but it will change every time. The only contant is the domain, for example #domain.com.
So for the input string of
00:00:50,004 ERROR [SynchronousCallback] Cannot process resource: test.test#domain.com Channel: channel16
What regular expression will look for #domain.com and pull out all test.test#domain.com. Ive got a regex that will look for the string m/#domain.com/i but i dont know how to then manipulate the string once the #domain.com has been located in the whole string.
The output i would like would be just the email test.test#domain.com
#!/usr/bin/env perl
use strict; use warnings;
use Email::Address;
while (my $line = <DATA>) {
my ($addr) = Email::Address->parse($line);
print $addr->address, "\n";
}
__DATA__
00:00:50,004 ERROR [SynchronousCallback] Cannot process resource: test.test#domain.com Channel: channel16
Output:
C:\temp> tt
test.test#domain.com
Will there always be whitespace immediately preceding the e-mail address? If so, you can use something like:
m/\s([^\s\#]+\#domain.com)/i
Then you can retrieve the whole e-mail address by looking at $1.
If you need all result (more than one email per line) for a regexp you could do this:
while ($str =~ s# ([^ ]+\#domain.com)##i){
my $email = $1;
print $email."\n";
}
regards,
It looks like you simply need to capture all non-whitespace characters preceding the domain string using /\S+\#domain\.com/. This program shows the principle.
my $s = '00:00:50,004 ERROR [SynchronousCallback] Cannot process resource: test.test#domain.com Channel: channel16';
print "$_\n" for $s =~ /\S+\#domain\.com/gi;
output
test.test#domain.com

Perl regex: How to find in a file a word typed by a user

I am writing a script to read a LOG file. I want the user to type a word and then look it up and print the line (from a string) matching the word.
I'm just learning Perl so please be very specific and simple so that I can understand it.
print "Please Enter the word to find: ";
chomp ($userInput = <STDIN>);
while ($line = <INPUT>)
if ($line =~ /userInput/)
print $line;
I know that this is not perfect but I'm just learning.
You were close. You need to expand the variable in the pattern match.
print "Please Enter the word to find: ";
chomp ($userInput = <STDIN>);
while ($line = <INPUT>) {
if ($line =~ /$userInput/) { # note extra dollar sign
print $line;
}
}
Be aware that that is a pattern match, so you are searching with a string that potentially contains wildcards in it. If you want a literal string, put a \Q in front of the variable as you interpolate it: /\Q$userInput/.
Something like .\bWORD\b. might work (thou it is not tested)
print $line if ($line =~ /.*\bWORD\b/)
#NewLearner
\b is for word boundaries
http://www.regular-expressions.info/wordboundaries.html
If you're doing just one loopup, using a while loop is fine. Though of course you'll need to fix your syntax.
You could also use grep:
print grep /$userInput/, <INPUT>;
If you want to do multiple lookups, you can either reopen the file handle (if the file is large), or store it in an array:
print grep /$userInput/, #array;
You'll have meta characters in your input, of course. This can be a good thing, or bad, depending on your users. For example, an experienced user would recognize the option to refine his search by entering a search term such as ^foo(?=bar), whereas other people may get very confused when they can't find the string foo+bar.
A way to escape meta characters is by using quotemeta on your input. Another is to use \Q ... \E inside your regex.
$userInput = quotemeta($userInput);
# or
print grep /\Q$userInput\E/, <INPUT>;
I believe if I were you, I would use a subroutine for the lookup. That way you can perform as many lookups as you like rather handily.
use strict;
use warnings; # ALWAYS use these
print "Please Enter the word to find: ";
chomp (my $userInput = <>); # <> is a more flexible handle
print lookup($userInput);
sub lookup {
my $word = shift;
open my $fh, "<", $inputfile or die $!;
my #hits;
while (<$fh>) {
push #hits, $_ if /\Q$word\E/;
}
return #hits;
}

Perl Arrays and grep

I think its more a charachters, anyway, I have a text file, consisted of something like that:
COMPANY NAME
City
Addresss,
Address number
Email
phone number
and so on... (it repeats itself, but with different data...), lets assume thing text is now in $strting variable.
I want to have an array (#row), for example:
$row[0] = "COMPANY NAME";
$row[1] = "City";
$row[2] = "Addresss,
Address number";
$row[3] = "Email";
$row[4] = "phone number";
At first I though, well thats easily can be done with grep, something like that:
1) #rwo = grep (/^^$/, $string);
No GO!
2) #row = grep (/\n/, $string);
still no go, tried also with split and such, still no go.
any idea?
thanks,
FM has given an answer that works using split, but I wanted to point out that Perl makes this really easy if you're reading this data from a filehandle. All you need to do is to set the special variable $/ to an empty string. This puts Perl into "paragraph mode". In this mode each record returned by the file input operator will contain a paragraph of text rather than the usual line.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $/ = '';
my #row = <DATA>;
chomp #row;
print Dumper(\#row);
__DATA__
COMPANY NAME
City
Addresss,
Address number
Email
phone number
The output is:
$ ./addr
$VAR1 = [
'COMPANY NAME',
'City',
'Addresss,
Address number',
'Email ',
'phone number'
];
The way I understand your question, you want to grab the items separated by at least one blank line. Although /\n{2,}/ would be correct in a literal sense (split on one or more newlines), I would suggest the regex below, because it will handle nearly blank lines (those containing only whitespace characters).
use strict;
use warnings;
my $str = 'COMPANY NAME
City
Addresss,
Address number
Email
phone number';
my #items = split /\n\s*\n/, $str;
use strict;
use warnings;
my $string = "COMPANY NAME
City
Addresss,
Address number
Email
phone number";
my #string_parts = split /\n\n+/, $string;
foreach my $test (#string_parts){
print"$test\n";
}
OUTPUT:
COMPANY NAME
City
Addresss,
Address number
Email
phone number
grep cannot take a string as an argument.
This is why you need to split the string on the token that you're after (as FM shows).
While it isn't clear what you need this for, I would strongly recommend considering the Tie::File module:

Perl RegEx to find the portion of the email address before the #

I have this below issue in Perl.I have a file in which I get list of emails as input.
I would like to parse the string before '#' of all email addresses. (Later I will store all the string before # in an array)
For eg. in : abcdefgh#gmail.com, i would like to parse the email address and extract abcdefgh.
My intention is to get only the string before '#'. Now the question is how to check it using regular expression. Or is there any other method using substr?
while I use regular expression : $mail =~ "\#" in Perl, it's not giving me the result.
Also, how will I find that the character '#' is in which index of the string $mail?
I appreciate if anyone can help me out.
#!usr/bin/perl
$mail = "abcdefgh#gmail.com";
if ($mail =~ "\#" ) {
print("my name = You got it!");
}
else
{
print("my name = Try again!");
}
In the above code $mail =~ "\#" is not giving me desired output but ($mail =~ "abc" ) does.
$mail =~ "#" will work only if the given string $mail = "abcdefgh\#gmail.com";
But in my case, i will be getting the input with email address as its.
Not with an escape character.
Thanks,
Tom
Enabling warnings would have pointed out your problem:
#!/usr/bin/perl
use warnings;
$mail = "abcdefgh#gmail.com";
__END__
Possible unintended interpolation of #gmail in string at - line 3.
Name "main::gmail" used only once: possible typo at - line 3.
and enabling strict would have prevented it from even compiling:
#!/usr/bin/perl
use strict;
use warnings;
my $mail = "abcdefgh#gmail.com";
__END__
Possible unintended interpolation of #gmail in string at - line 4.
Global symbol "#gmail" requires explicit package name at - line 4.
Execution of - aborted due to compilation errors.
In other words, your problem wasn't the regex working or not working, it was that the string you were matching against contained "abcdefgh.com", not what you expected.
The # sign is a metacharacter in double-quoted strings. If you put your email address in single quotes, you won't get that problem.
Also, I should add the obligatory comment that this is fine if you're just experimenting, but in production code you should not parse email addresses using regular expressions, but instead use a module such as Mail::Address.
What if you tried this:
my $email = 'user#email.com';
$email =~ /^(.+?)#/;
print $1
$1 will be everything before the #.
If you want the index of a string, you can use the index() function. ie.
my $email = 'foo#bar';
my $index = index($email, '#');
If you want to return the former half of the email, I'd use split() over regular expressions.
my $email = 'foo#bar';
my #result = split '#', $email;
my $username = $result[0];
Or even better with substr
my $username = substr($email, 0, index($email, '#'))
$mail = 'abcdefgh#gmail.com';
$mail =~ /^([^#]*)#/;
print "$1\n"