Perl: Parse maillog to get date/recipient in a single regex statement - regex

I'm trying to parse my maillog, which contains a number of lines which look similar to the following line:
Jun 6 17:52:06 host sendmail[30794]: p569q3sX030792: to=<person#recipient.com>, ctladdr=<apache#host.com> (48/48), delay=00:00:03, xdelay=00:00:03, mailer=esmtp, pri=121354, relay=gmail-smtp-in.l.google.com. [1.2.3.4], dsn=2.0.0, stat=Sent (OK 1307354043 x8si28599066ict.63)
The rules I'm trying to apply are:
The date is always the first 2 words
The email address always occurs between " to=person#recipient.com, " however the email address might be surrounded by <>
There are some lines in the log which do not relate to a recipient, so I'd like to ignore those lines entirely.
The following code works for either rule individually, however I'm having trouble combining them:
if($_ =~ m/\ to=([<>a-zA-Z0-9\.\#]*),\ /g) {
print "$1\n";
}
if($_ =~ /^+(\S+\s+\S+\s)/g) {
print "$1\n";
}
As always, I'm not sure whether the regex I'm using above is "best practice" so feel free to point out anything I'm doing badly there too :)
Thanks!

print substr($_, 0, 7), "$1\n" if / to=(.+?), /;
Your date is in a fixed-length format, you don't need a regular expression to match it.
For the address, what you need is the part between to= and the next ,, so a non-greedy match is just what you need.

To match either with one regex, or them using syntax (regex1|regex2) together:
((?<\ to=)[<>a-zA-Z0-9\.\#]*(?=,\ )|^\S+\s+\S+\s)
The outer brackets preserve $1 being assigned the match.
The look behind (?<\ to=) and look ahead (?=,\ ) do not capture anything, so these regexes only capture your target string.

Related

Use Perl to check if a string has only English characters

I have a file with submissions like this
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
I am stripping everything but the song name by using this regex.
$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&\$#%#\\|]//g;
I want to make sure that the only strings printed are ones that contain only English characters, so in this case it would the first song title Ai Wo Quing shut up and not the next one because of the è.
I have tried this
if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
print $line;
}
else {
print "Non-english\n";
I thought this would match just the English characters, but it always prints Non-english. I feel this is me being rusty with regex, but I cannot find my answer.
Following from the comments, your problem would appear to be:
$line =~ m/[^a-zA-z0-9_]*$/
Specifically - the ^ is inside the brackets, which means that it's not acting as an 'anchor'. It's actually a negation operator
See: http://perldoc.perl.org/perlrecharclass.html#Negation
It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".
But the important part is - that without the 'start of line' anchor, your regular expression is zero-or-more instances (of whatever), so will match pretty much anything - because it can freely ignore the line content.
(Borodin's answer covers some of the other options for this sort of pattern match, so I shan't reproduce).
It's not clear exactly what you need, so here are a couple of observations that speak to what you have written.
It is probably best if you use split to divide each line of data on <SEP>, which I presume is a separator. Your question asks for the fourth such field, like this
use strict;
use warnings;
use 5.010;
while ( <DATA> ) {
chomp;
my #fields = split /<SEP>/;
say $fields[3];
}
__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi
output
Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi
Also, the word character class \w matches exactly [a-zA-z0-9_] (and \W matches the complement) so you can rewrite your if statement like this
if ( $line =~ /\W/ ) {
print "Non-English\n";
}
else {
print $line;
}

Getting multiple matches within a string using regex in Perl

After having read this similar question and having tried my code several times, I keep on getting the same undesired output.
Let's assume the string I'm searching is "I saw wilma yesterday".
The regex should capture each word followed by an 'a' and its optional 5 following characters or spaces.
The code I wrote is the following:
$_ = "I saw wilma yesterday";
if (#m = /(\w+)a(.{5,})?/g){
print "found " . #m . " matches\n";
foreach(#m){
print "\t\"$_\"\n";
}
}
However, I kept on getting the following output:
found 2 matches
"s"
"w wilma yesterday"
while I expected to get the following one:
found 3 matches:
"saw wil"
"wilma yest"
"yesterday"
until I found out that the return values inside #m were $1 and $2, as you can notice.
Now, since the /g flag is on, and I don't think the problem is about the regex, how could I get the desired output?
You can try this pattern that allows overlapped results:
(?=\b(\w+a.{1,5}))
or
(?=(?i)\b([a-z]+a.{0,5}))
example:
use strict;
my $str = "I saw wilma yesterday";
my #matches = ($str =~ /(?=\b([a-z]+a.{0,5}))/gi);
print join("\n", #matches),"\n";
more explanations:
You can't have overlapped results with a regex since when a character is "eaten" by the regex engine it can't be eaten a second time. The trick to avoid this constraint, is to use a lookahead (that is a tool that only checks, but not matches) which can run through the string several times, and put a capturing group inside.
For another example of this behaviour, you can try the example code without the word boundary (\b) to see the result.
Firstly you want to capture everything inside the expression, i.e.:
/(\w+a(?:.{5,})?)/
Next you want to start your search from one character past where the last expression's first character matched.
The pos() function allows you to specify where a /g regex starts its search from.
$s = "I saw wilma yesterday";
while ($s =~ /(\w+a(.{0,5}))/g){
print "\t\"$1\"\n";
pos($s) = pos($s) - length($2);
}
Gives you:
"saw wil"
"wilma yest"
"yesterday"
But I don't know why you should get day and not yesterday.

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.

Is it possible to make conditional regex of following?

Hello I am wondering whether it is possible to do this type of regex:
I have certain characters representing okjects i.e. #,#,$ and operations that may be used on them like +,-,%..... every object has a different set of operations and I want my regex to find valid pairs.
So for examle I want pairs #+, #-, $+ to be matched, but yair $- not to be matched as it is invalid.
So is there any way to do this with regexes only, without doing some gymnastics inside language using regex engine?
every okject with it's own rules in []
/(#[+-]|\$[+]|#[+-])/
you need to properly escape special characters
Gymnastics is hard. Try something like /#\+|#-|\$\+/ or something like that.
Just remember, +, $, and ^ are reserved, so they'll need to be escaped.
Another approach, mix not allowed with raw combinations, but this might be slower.
/(?!\$-|\$\%)([\#\$\#][+\-\%])/, though not if there are many alternations of the first character.
my $str = '
#+, #-, $+ to be matched,
but yair $- not to be matched asit is invalid.
$% $- #% $%
';
my $regex =
qr/
(?!\$-|\$\%) # Specific combinations not allowed
(
[\#\$\#][+\-\%] # Raw combinations allowed
)
/x;
while ( $str =~ /$regex/g ) {
print "found: '$1'\n";
}
__END__
Output:
found: '#+'
found: '#-'
found: '$+'
found: '#%'

perl regex grouping overload

I am using the following perl regex lines
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
$myalbum =~ m/([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+)/i;
to match the following strings
"30_Seconds_To_Mars_-_30_Seconds_To_Mars"
"30_Seconds_To_Mars_-_A_Beautiful_Lie"
"311_-_311"
"311_-_From_Chaos"
"311_-_Grassroots"
"311_-_Sound_System"
What I am experiencing is that for strings with less than 5 matching groups (ex. 311_-_311), attempting to print $1 $2 $3 prints nothing at all. Only strings with more than 5 matches will print.
How do I resolve this?
It looks like you just want the words in separate groups. To me, it seems like you're abusing regexes to do that when you could just run your substitutions and then split. Just do:
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
my #myalbum_list = split(/\s/, $myalbum);
#Print out whatever it is you want/ test length, etc...
print "$myalbum_list[0] $myalbum_list[1] $myalbum_list[2]";
the + character means at least one match. Which means your regex m/([A-Z0-9\$]+) +([A-Z0-9\$]+) + ... requires all those fields to be there for it to be considered a match. The reason you are not capturing anything is because it's not actually matching.
You are probably looking for the * character which means zero or more not one or more like +.
I suppose your capturing groups are empty for "311 - 311" because this string doesn't match your regex.
How to resolve? Use * instead of + to permit empty sequences.
Edit: From your post I guess you want to extract the album name, i.e. the part before the minus sign.
Why not match against '(.*) - (.*)', being the first group the album and the second the title. The problem is with strings like "Album with minus - sign - First track" or "My Album - Track is one - two - three". But also as a human you wouldn't know there where the album ends and the track starts.