help printing out hash keys to needed format - regex

I need help printing out data from a hash/hash ref to STDOUT or file with
data in a specific order if possible.
I have a perl routine that uses hash references like so:
#!/usr/local/bin/perl
use strict;
use warnings;
use File::Basename;
use Data::Dumper;
my %MyItems;
my $ARGV ="/var/logdir/server1.log";
my $mon = 'Aug';
my $day = '06';
my $year = '2010';
while (my $line = <>)
{
chomp $line;
if ($line =~ m/(.* $mon $day) \d{2}:\d{2}:\d{2} $year: ([^:]+):backup:/)
{
my $server = basename $ARGV, '.log';
my $BckupDate="$1 $year";
my $BckupSet =$2;
$MyItems{$server}{$BckupSet}->{'MyLogdate'} = $BckupDate;
$MyItems{$server}{$BckupSet}->{'MyDataset'} = $BckupSet;
$MyItems{$server}{$BckupSet}->{'MyHost'} = $server;
if ($line =~ m/(ERROR|backup-size|backup-time|backup-status)[:=](.+)/)
{
my $BckupKey=$1;
my $BckupVal=$2;
$MyItems{$server}{$BckupSet}->{$BckupKey} = $BckupVal;
}
}
}
foreach( values %MyItems ) {
print "MyHost=>$_->{MyHost};MyLogdate=>$_->{MyLogdate};MyDataset=>$_->{MyDataset};'backup-time'=>$_->{'backup-time'};'backup-status'=>$_->{'backup-status'}\n";
}
Output using dumper:
$VAR1 = 'server1';
$VAR2 = {
'abc1.mil.mad' => {
'ERROR' => ' If you are sure is not running, please remove the file and restart ',
'MyLogdate' => 'Fri Aug 06 2010',
'MyHost' => 'server1',
'MyDataset' => 'abc1.mil.mad'
},
'abc2.cfl.mil.mad' => {
'backup-size' => '187.24 GB',
'MyLogdate' => 'Fri Aug 06 2010',
'MyHost' => 'server1',
'backup-status' => 'Backup succeeded',
'backup-time' => '01:54:27',
'MyDataset' => 'abc2.cfl.mil.mad'
},
'abc4.mad_lvm' => {
'backup-size' => '422.99 GB',
'MyLogdate' => 'Fri Aug 06 2010',
'MyHost' => 'server1',
'backup-status' => 'Backup succeeded',
'backup-time' => '04:48:50',
'MyDataset' => 'abc4.mad_lvm'
}
};
Output formatted that I would like to see:
MyHost=>server1;MyLogdate=>Fri Aug 06 2010;MyDataset=>abc2.cfl.mil.mad;backup-time=>Fri Aug 06 2010;backup-status=>Backup succeeded
Just addded (8/7/2010):
Sample raw log file I am using: (recently added to provide better representation of the source log)
Fri Aug 06 00:00:05 2010: abc2.cfl.mil.mad:backup:INFO: backup-set=abc2.cfl.mil.mad
Fri Aug 06 00:00:05 2010: abc2.cfl.mil.mad:backup:INFO: backup-date=20100806000004
Fri Aug 06 00:48:54 2010: abc4.mad_lvm:backup:INFO: backup-size=422.99 GB
Fri Aug 06 00:48:54 2010: abc4.mad_lvm:backup:INFO: PHASE END: Calculating backup size & checksums
Fri Aug 06 00:48:54 2010: abc4.mad_lvm:backup:INFO: backup-time=04:48:50
Fri Aug 06 00:48:54 2010: abc4.mad_lvm:backup:INFO: backup-status=Backup succeeded
Fri Aug 06 00:48:54 2010: abc4.mad_lvm:backup:INFO: Backup succeeded

I've spent some time looking at your code and I think I have it figured out.
The reason this was hard to answer is that you've unintentionally planted a red herring--the data dumper output.
Notice how it shows $VAR1 = 'server1'; and then $VAR2 = { blah };.
You called Dumper like so: print Dumper %MyItems;
The problem is that Dumper wants a list of values to dump, since Perl flattens lists, complex structures must be passed by reference. So, you need to call Dumper like so:
print Dumper \%MyItems;
This shows the whole structure.
When you called dumper earlier, you inadvertently stripped off one layer of your data structure. The proposed solutions, and your own code are operating on this stripped structure.
Here I've bolted on some code to handle additional layer of nesting (and made it Perl 5.8 compatible):
for my $server_items ( values %MyItems ) {
for my $record ( values %$server_items ) {
print join ';', map {
# Replace non-existant values with 'undef'
my $val = exists $record->{$_} ? $record->{$_} : 'undef';
"'$_'=>$val" # <-- this is what we print for each field
} qw( MyHost MyLogdate MyDataset backup-time backup-status );
print "\n";
}
}
It looks like you have a lot of questions and need some help getting your head around a number of concepts. I suggest that you post a request on Perlmonks in Seekers of Perl Wisdom for help improving your code. SO is great for focussed question, but PM is more amenable to code rework.
** Original answer: **
To get around any parsing issues that I can't replicate, I just set %MyItems to the output of Dumper you provided.
Your warnings you mention above have to do with all the complicated quoting and repetitive coding you have in your print statement. I have replaced your print statement with a map to simplify the code.
Holy crap, a big join map blah is not simpler, you might be thinking. But really, it is simpler because each individual unit of expression is smaller. What is easier to understand and get right? What is easier to alter and maintain in a correct and consistent manor?
print "'foo'=>$_->{foo};'bar'=>$_->{bar};boo'=>$_->{boo};'far'=>$_->{far}\n";
or
say join ';', map {
"'$_'=>$item->{$_}"
} qw( foo bar boo far );
Here, you can add, remove or rearrange your output merely by changing the list of arguments passed to map. With the other style, you've got a bunch of copy/paste to do.
The map I use below is a bit more complex, in that it checks to see if a given key is defined before printing a value, and assign a default value if none is present.
#!perl
use strict;
use warnings;
use feature 'say';
my %MyItems = (
'abc1.mil.mad' => {
'ERROR' => ' If you are sure is not running, please remove the file and restart ',
'MyLogdate' => 'Fri Aug 06 2010',
'MyHost' => 'server1',
'MyDataset' => 'abc1.mil.mad'
},
'abc2.cfl.mil.mad' => {
'backup-size' => '187.24 GB',
'MyLogdate' => 'Fri Aug 06 2010',
'MyHost' => 'server1',
'backup-status' => 'Backup succeeded',
'backup-time' => '01:54:27',
'MyDataset' => 'abc2.cfl.mil.mad'
},
'abc3.mil.mad' => {
'backup-size' => '46.07 GB',
'MyLogdate' => 'Fri Aug 06 2010',
'MyHost' => 'server1',
'backup-status' => 'Backup succeeded',
'backup-time' => '00:41:06',
'MyDataset' => 'abc3.mil.mad'
},
'abc4.mad_lvm' => {
'backup-size' => '422.99 GB',
'MyLogdate' => 'Fri Aug 06 2010',
'MyHost' => 'server1',
'backup-status' => 'Backup succeeded',
'backup-time' => '04:48:50',
'MyDataset' => 'abc4.mad_lvm'
}
);
for my $record ( values %MyItems ) {
say join ';', map {
my $val = $record->{$_} // 'undef'; # defined-or requires perl 5.10 or newer.
"'$_'=>$val" # <-- this is what we print for each field
} qw( MyHost MyLogdate MyDataset backup-time backup-status );
}

Not tested but it should work in theory. This will print an output line for each of the keys for the main MyItems hash. If you want it all on one line you can just drop the \n or add some other separator.
foreach( values %MyItems ) {
print "MyServer=>$_->{MyServer};MyLogdate=>$_->{MyLogdate};MyDataset=>$_->{MyDataset};backup-time=>$_->{backup-time};backup-status=>$_->{backup-status}\n";
}

Not answering the question you asked, but this doesn't seem sensible to me.
You want an array of hashes not a hash of hashes.
Hashes are not ordered, if you want them ordered then use an array.

Thanks everyone for pitching in their help...
This works for me.
for my $Server(keys%MyItems){
for my $BckupSet(keys%{$MyItems{$Server}}){
for(sort keys%{$MyItems{$Server}{$BckupSet}}){
print$_,'=>',$MyItems{$Server}{$BckupSet}{$_},';';
}
print"\n";
}
}

Related

Perl: file search regex multiple infos in multiple lines

Hello I have this in a file, multiple lines and from them I want to be able to get the User name and the version he's using.
File
<W>2016-06-25 00:27:30.577 1 => <4:(-1)> Client version 1.2.10 (Win: 1.2.10)
<W>2016-06-25 00:27:30.635 1 => <4:[AAA] User1(1850)> Authenticated
<W>2016-06-25 00:27:30.635 1 => <2:(-1)> Client version 1.2.16 (Win: 1.2.16)
<W>2016-06-25 00:27:30.687 1 => <2:[AAA] User2(942)> Authenticated
Outpout wanted
4 : User1 : 1.2.10
2 : User2 : 1.2.16
So the datas for one client is on 2 lines.
The first line get the version number.
The second line the user name.
I noticed that both lines have a match ID, in my example the user1 line match ID is 4: and 2: for the second user.
So I started with something like this, but don't really work as intended and creating a second read to find the second line in the entire file is too much / not optimized.
Perl Script
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'mylogfile.log';
open (my $fl, '<:encoding(UTF-8)', $file)
or die 'File not found';
while (my $row = <$fl>) {
if ($row =~ m/\<(\d+).*\>\sclient\sversion\s(\d+.\d+.\d+)\s/i) {
my $id = $1;
my $vers = $2;
while (my $row1 = <$fl>) {
if ($row1 =~ m/\<$id\:(.+)\(\d+\)\>/i) {
my $name = $1;
print "$id : $name : $vers\n";
}
}
}
}
If any perl guru have an idea, thanks! :-)
I see in your log file that timestamps of corresponding rows are different.
So, I suppose, when two users log in at the same time, log records could get interspersed, for example:
<W>2016-06-25 00:27:30.577 1 => <4:(-1)> Client version 1.2.10 (Win: 1.2.10)
<W>2016-06-25 00:27:30.635 1 => <2:(-1)> Client version 1.2.16 (Win: 1.2.16)
<W>2016-06-25 00:27:30.635 1 => <4:[AAA] User1(1850)> Authenticated
<W>2016-06-25 00:27:30.687 1 => <2:[AAA] User2(942)> Authenticated
If this is the case, I would suggest using a hash to remember ids:
use strict;
use warnings;
my $file = 'mylogfile.log';
open (my $fl, '<:encoding(UTF-8)', $file)
or die 'File not found';
my %ids;
while (my $row = <$fl>) {
if ($row =~ m/\<(\d+).*\>\sclient\sversion\s(\d+.\d+.\d+)\s/i) {
my ($id,$vers)=($1,$2);
$ids{$id}=$vers;
}
elsif ($row =~ m/\<(\d+)\:(.+)\(\d+\)\>.*authenticated/i) {
if (defined $ids{$1}) {
print "$1 : $2 : $ids{$1}\n";
delete $ids{$1};
}
}
}
I don't know much about perl, but can provide some idea:
login= map();
while( row=readrow())
{
if(match(id version))
login[$1]=$2
else
if(match(id username userid ))
{
print "user: ", $2, "version:",login[$1], "userid: $3", "sessionid: ", $1
delete login[$1]
}
}
Running your code gave me the result
4 : [AAA] User1 : 1.2.10
Your second regular expression is capturing the bracketed letters and the user name. This isn't what your desired output looks like.
The second while loop exhausts the remainder of the file. And, this isn't what you want to do.
Here is a program that will produce the output you want. (I created a file at the top of the program. You would not use this but instead, open your file 'mylogfile.log' just as you did in your code).
#!/usr/bin/perl
use strict;
use warnings;
open my $fh, '<', \<<EOF;
<W>2016-06-25 00:27:30.577 1 => <4:(-1)> Client version 1.2.10 (Win: 1.2.10)
<W>2016-06-25 00:27:30.635 1 => <4:[AAA] User1(1850)> Authenticated
<W>2016-06-25 00:27:30.635 1 => <2:(-1)> Client version 1.2.16 (Win: 1.2.16)
<W>2016-06-25 00:27:30.687 1 => <2:[AAA] User2(942)> Authenticated
EOF
while (<$fh>) {
if (/<(\d+).+?Client version (\d+\.\d+\.\d+)/) {
my ($id, $vers) = ($1, $2);
# read next line and capture name
if (<$fh> =~ /<$id\S+ ([^(]+)/) {
my $name = $1;
print join(" : ", $id, $name, $vers), "\n";
}
}
}
In my second regular expression, the piece, [^(]+, is called a negated class. It matches non 'left parens' (1 or more times). This matches "User1' and 'User2' in the line of the file.
Update: You can find info about character classes here.
Update2: Looking at wolfrevokcats reply, I see he made a valid observation and his solution is the safer one.

How can I calculate time different from file in perl

this is my first post so please bear with me, I want to calculate the time different from ssh log file which have format like this
Jan 10 hr:min:sec Failed password for invalid user root from "ip" port xxx ssh2
Jan 10 hr:min:sec sshd[]: User root from "ip" not allowed because none of user's groups are listed in AllowGroups
The script will alert when user fail to login x times within 10 minutes, any can please teach me how to do this?
Thanks!!
Your specification is a little ambiguous - presumably there are going to be more than two lines in your log file - do you want the time difference between successive lines? Do you want a line parsed searching for a keyword (such as "failed login") and then the time difference to a different line similarly parsed?
Since I can't tell from what you've provided, I'm simply going to presume that there are two lines in a file which have a date at the start of each line and you want the time difference between those dates. You can then manipulate to do what you want. Alternatively, add to your question and define exactly what a "failed login" is.
There are many ways to skin this cat but I prefer the strptime function from DateTime::Format::Strptime which is described as;
This module implements most of strptime(3), the POSIX function that is the reverse of strftime(3), for DateTime. While strftime takes a DateTime and a pattern and returns a string, strptime takes a string and a pattern and returns the DateTime object associated.
This will do as I've described above;
use v5.12;
use DateTime::Format::Strptime;
my $logfile = "ssh.log";
open(my $fh, '<', $logfile) or die "$logfile: $!";
my $strp = DateTime::Format::Strptime->new(
pattern => '%h %d %T ',
time_zone => 'local',
on_error => 'croak',
);
my $dt1 = $strp->parse_datetime(scalar <$fh>);
my $dt2 = $strp->parse_datetime(scalar <$fh>);
my $duration = $dt2->subtract_datetime_absolute($dt1);
say "Number of seconds difference is: ", $duration->seconds ;
#
# Input
# Jan 10 14:03:18 Failed password for invalid user root from "ip" port xxx ssh2
# Jan 10 14:03:22 sshd[]: User root from "ip" not allowed because none of user's groups are listed in AllowGroups
#
# Outputs:
# Number of seconds difference is: 4
A more comprehensive answer (making even more assumptions) is below:
use v5.12;
use DateTime::Format::Strptime;
use DateTime::Duration;
my $logfile = "ssh.log";
my $maximum_failures_allowed = 3 ;
my $minimum_time_frame = 10 * 60; # in seconds
my $limit = $maximum_failures_allowed - 1 ;
# The following is a list of rules indicating a
# failed login for the username captured in $1
my #rules = (
qr/Failed password for user (\S+) / ,
qr/User (\S+) from [\d\.]+ not allowed/
);
my $strp = DateTime::Format::Strptime->new(
pattern => '%h %d %T ',
time_zone => 'local',
on_error => 'croak',
);
my %usernames ;
open(my $fh, '<', $logfile) or die "$logfile: $!";
while (<$fh>) {
for my $rx (#rules) {
if ( /$rx/ ) {
# rule matched -> login fail for $1. Save log line.
my $user = $1 ;
push #{ $usernames{$user} } , $_ ;
# No point checking other rules...
last ;
}
}
}
close $fh ;
for my $user (keys %usernames) {
my #failed_logins = #{ $usernames{$user} };
# prime the loop; we know there is at least one failed login
my $this_line = shift #failed_logins ;
while ( #failed_logins > $limit ) {
my $other_line = $failed_logins[ $limit ] ;
my $this_time = $strp->parse_datetime($this_line) ;
my $other_time = $strp->parse_datetime($other_line) ;
# this produces a DateTime::Duration object with the difference in seconds
my $time_frame = $other_time->subtract_datetime_absolute( $this_time );
if ($time_frame->seconds < $minimum_time_frame) {
say "User $user had login failures at the following times:" ;
print " $_" for $this_line, #failed_logins[ 0 .. $limit ] ;
# (s)he may have more failures but let's not labour the point
last ;
}
# Here if user had too many failures but within a reasonable time frame
# Continue to move through the array of failures checking time frames
$this_line = shift #failed_logins ;
}
}
exit 0;
Ran on this data;
Jan 10 14:03:18 sshd[15798]: Failed password for user root from "ip" port xxx ssh2
Jan 10 14:03:22 sshd[15798]: User root from 188.124.3.41 not allowed because none of user's groups are listed in AllowGroups
Jan 10 20:31:12 sshd[15798]: Connection from 188.124.3.41 port 32889
Jan 10 20:31:14 sshd[15798]: Failed password for user root from 188.124.3.41 port 32889 ssh2
Jan 10 20:31:14 sshd[29323]: Received disconnect from 188.124.3.41: 11: Bye Bye
Jan 10 22:04:56 sshd[25438]: Connection from 200.54.84.233 port 45196
Jan 10 22:04:58 sshd[25438]: Failed password for user root from 200.54.84.233 port 45196 ssh2
Jan 10 22:04:58 sshd[30487]: Received disconnect from 200.54.84.233: 11: Bye Bye
Jan 10 22:04:59 sshd[21358]: Connection from 200.54.84.233 port 45528
Jan 10 22:05:01 sshd[21358]: Failed password for user root from 200.54.84.233 port 45528 ssh2
Jan 10 22:05:02 sshd[2624]: Received disconnect from 200.54.84.233: 11: Bye Bye
Jan 10 22:05:29 sshd[21358]: Connection from 200.54.84.233 port 45528
Jan 10 22:05:30 sshd[21358]: Failed password for user root from 200.54.84.233 port 45528 ssh2
Jan 10 22:05:33 sshd[2624]: Received disconnect from 200.54.84.233: 11: Bye Bye
Jan 10 22:06:49 sshd[21358]: Connection from 200.54.84.233 port 45528
Jan 10 22:06:51 sshd[21358]: Failed password for user root from 200.54.84.233 port 45528 ssh2
Jan 10 22:06:51 sshd[2624]: Received disconnect from 200.54.84.233: 11: Bye Bye
... it produces this output;
User root had login failures at the following times:
Jan 10 22:04:58 sshd[25438]: Failed password for user root from 200.54.84.233 port 45196 ssh2
Jan 10 22:05:01 sshd[21358]: Failed password for user root from 200.54.84.233 port 45528 ssh2
Jan 10 22:05:30 sshd[21358]: Failed password for user root from 200.54.84.233 port 45528 ssh2
Jan 10 22:06:51 sshd[21358]: Failed password for user root from 200.54.84.233 port 45528 ssh2
Note that the timezone and/or offset is not present in the logfile data - so, there is no way for this script to work correctly on the day you enter or leave "Daylight Savings."

Action Mailer not executing Proc to generate TO field

UP 04/08/2015 : Is it actually possible to use both .deliver_later and a proc ?
My problem :
My email doesn't execute the procedure to generate the TO field, thus sending a bad email on postfix
Postfix log
sudo tail -n 150 /var/log/mail.log
Jul 30 17:39:44 je postfix/pickup[3974]: 0DA531FC2AFA: uid=1030 from=<candidature#myorg.com>
Jul 30 17:39:44 je postfix/cleanup[8430]: 0DA531FC2AFA: message-id=<55ba453fc4fd1_20e02375a9c04882a#myorg.mail>
Jul 30 17:39:44 je postfix/qmgr[3506]: 0DA531FC2AFA: from=<candidature#telecom-etude.com>, size=18915, nrcpt=2 (queue active)
Jul 30 17:39:44 je postfix/error[8522]: 0DA531FC2AFA: to=<Proc:0xbbd2989c#/var/www/intranet_rails_production/releases/20150729170507/app/mailers/mailing_lists_mailer.rb:42>, relay=none, delay=0.41, delays=0.22/0.02/0/0.17, dsn=5.1.3, status=bounced (bad address syntax)
Jul 30 17:39:44 je postfix/error[8522]: 0DA531FC2AFA: to=<#<Proc:0xbbd2989c#/var/www/intranet_rails_production/releases/20150729170507/app/mailers/mailing_lists_mailer.rb:42>>, relay=none, delay=0.52, delays=0.22/0.02/0/0.28, dsn=5.1.3, status=bounced (bad address syntax)
Controller
...
if Rails.env.production? # Using :sendmail as delivery method
MailingListsMailer.action(#mail).deliver_later
else # Typically using :letter_opener or :file as delivery method
MailingListsMailer.action(#mail).deliver_now
...
Mailer
class MailingListsMailer < ActionMailer::Base
def action(message)
format_mail_params(message)
mail(
to: Proc.new {read_emails_file},
...
)
end
private
def read_emails_file
File.read('emails.txt').split('\n')
end
end
Config
config.action_mailer.delivery_method = :sendmail
config.action_mailer.smtp_settings = {
:address => "localhost",
:domain => "myorg.fr"
}
config.action_mailer.default_url_options = { host: 'myorg.fr', protocol: 'https' }
EDIT : I was using a proc as suggested in the ActionMailer Basics #2.3.3
You probably don't need the Proc wrapper at all, try this:
def action(message)
format_mail_params(message)
mail(
to: read_emails_file,
...
)
end
To debug it, do something like this:
def action(message)
format_mail_params(message)
puts "*"*50
emails = read_emails_file
puts emails
puts "*"*50
mail(
to: emails,
...
)
end
Then check the server logs for the message that looks something like:
*******************************************************************
[...]
*******************************************************************
Iterate the Emails
Because you don't want to divulge everyone's email, and avoid spam detection, you may want to iterate the emails array:
def action(message)
format_mail_params(message)
read_emails_file.each do |email|
mail(
to: email,
...
)
end
end

Parsing and normalize the file containing 2-3 millions of line in Perl

I have the log file which contains millions (2-4) of line containing the some special information like IPs, Ports, Email Ids, domains, PIDs etc.
I need to parse and normalize the file in a such way that all of above special tokens will be replaced by some constant string like IP, PORT, EMAIL, DOMAIN etc. and need to provide the count of all duplicates lines.
i.e., for the file having content like below -
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.3.1 is not reachable
Aug 19 10:22:48 user 10.1.4.1 is not reachable
Aug 19 10:22:48 user 10.1.1.5 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
The normalize output will be -
MONTH DAY TIME user IP is not reachable =======> Count = 14
The log line can have multiple tokens to be search and replaced like domains, email ids.
The below code i have written is taking 16 minutes for 10MB of log file( used mail server logs )
Is it possible to minimize that time in Perl when you have to parse that many of line with some regex and substitution operation to perform.
The code snippet i have wrote is -
use strict;
use warnings;
use Tie::Hash::Sorted;
use Getopt::Long;
use Regexp::Common qw(net URI Email::Address );
use Email::Address;
my $ignore = 0;
my $threshold = 0;
my $normalize = 0;
GetOptions(
'ignore=s' => \$ignore,
'threshold=i' => \$threshold,
'normalize=i' => \$normalize,
);
my ( %initial_log, %Logs, %final_logs );
my ( $total_lines, $threshold_value );
my $file = shift or die "Usage: $0 FILE\n";
open my $fh, '<', $file or die "Could not open '$file' $!";
#Sort the results according to frequency
my $sort_by_numeric_value = sub {
my $hash = shift;
[ sort { $hash->{$b} <=> $hash->{$a} } keys %$hash ];
};
#Ignore "ignore" number fields from each line
while ( my $line = <$fh> ) {
my $skip_words = $ignore;
chomp $line;
$total_lines++;
if ($ignore) {
my #arr = split( /[\s\t]+/smx, $line );
while ( $skip_words-- != 0 ) { shift #arr; }
my $n_line = join( ' ', #arr );
$line = $n_line;
}
$initial_log{$line}++;
}
close $fh or die "unable to close: $!";
$threshold_value = int( ( $total_lines / 100 ) * $threshold );
tie my %sorted_init_logs, 'Tie::Hash::Sorted',
'Hash' => \%initial_log,
'Sort_Routine' => $sort_by_numeric_value;
%final_logs = %sorted_init_logs;
if ($normalize) {
# Normalize the logs
while ( my ( $line, $count ) = ( each %final_logs ) ) {
$line = normalize($line);
$Logs{$line} += $count;
}
%final_logs = %Logs;
}
tie my %sorted_logs, 'Tie::Hash::Sorted',
'Hash' => \%final_logs,
'Sort_Routine' => $sort_by_numeric_value;
my $reduced_lines = values(%final_logs);
my $reduction = int( 100 - ( ( values(%final_logs) / $total_lines ) * 100 ) );
print("Number of line in the original logs = $total_lines");
print("Number of line in the normalized logs = $reduced_lines");
print("Logs reduced after normalization = $reduction%\n");
# Show the logs below threshold value only
while ( my ( $log, $count ) = ( each %sorted_logs ) ) {
if ( $count >= $threshold_value ) {
printf "%-80s ===========> [%s]\n", $log, $sorted_logs{$log};
}
}
sub normalize {
my $input = shift;
# Remove unwanted charecters
$input =~ s/[()]//smxg;
# Normalize the URI
$input =~ s/$RE{URI}{HTTP}/URI/smxg;
# Normalize the IP Addresses
$input =~ s/$RE{net}{IPv4}/IP/smgx;
$input =~ s/IP(\W+)\d+/IP$1PORT/smxg;
$input =~ s/$RE{net}{IPv4}{hex}/HEX_IP/smxg;
$input =~ s/$RE{net}{IPv4}{bin}/BINARY_IP/smxg;
$input =~ s/\b$RE{net}{MAC}\b/MAC/smxg;
# Normalize the Email Addresses
$input =~ s/(\w+)=$RE{Email}{Address}/$1=EMAIL/smxg;
$input =~ s/$RE{Email}{Address}/EMAIL/smxg;
# Normalize the Domain name
$input =~ s/[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(?:\.[A-Za-z]{2,})/HOSTNAME/smxg;
return $input;
}
Especially if you do not know the exact types of queries you'll need to perform, you would be much better off putting parsed log data into an SQLite database. The following example illustrates this using a temporary database. If you want to run multiple different queries against the same data, parse once, load them up in the database, then query to your heart's content. This ought to be faster than what you are doing right now, but, obviously I haven't measured anything:
#!/usr/bin/env perl
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect('dbi:SQLite::memory:', undef, undef,
{
RaiseError => 1,
AutoCommit => 0,
}
);
$dbh->do(q{
CREATE TABLE 'status' (
id integer primary key,
month char(3),
day char(2),
time char(8),
agent varchar(100),
ip char(15),
status varchar(100)
)
});
$dbh->commit;
my #cols = qw(month day time agent ip status);
my $inserter = $dbh->prepare(sprintf
q{INSERT INTO 'status' (%s) VALUES (%s)},
join(',', #cols),
join(',', ('?') x #cols)
);
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
$inserter->execute(split ' ', $line, scalar #cols);
}
$dbh->commit;
my $summarizer = $dbh->prepare(q{
SELECT
month,
day,
time,
agent,
ip,
status,
count(*) as count
FROM status
GROUP BY month, day, time, agent, ip, status
}
);
$summarizer->execute;
my $result = $summarizer->fetchall_arrayref;
print "#$_\n" for #$result;
$dbh->disconnect;
__DATA__
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.3.1 is not reachable
Aug 19 10:22:48 user 10.1.4.1 is not reachable
Aug 19 10:22:48 user 10.1.1.5 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.4 is not reachable
Aug 19 10:22:48 user 10.1.1.1 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Aug 19 10:22:48 user 10.1.1.6 is not reachable
Output:
Aug 19 10:22:48 user 10.1.1.1 is not reachable 4
Aug 19 10:22:48 user 10.1.1.4 is not reachable 3
Aug 19 10:22:48 user 10.1.1.5 is not reachable 1
Aug 19 10:22:48 user 10.1.1.6 is not reachable 5
Aug 19 10:22:48 user 10.1.3.1 is not reachable 1
Aug 19 10:22:48 user 10.1.4.1 is not reachable 1

Need a Regex for parsing Apache files

I need a regex for parsing Apache files
For example:
Here is a portion of a /var/log/httpd/error_log
[Sun Sep 02 03:34:01 2012] [notice] Digest: done
[Sun Sep 02 03:34:01 2012] [notice] Apache/2.2.15 (Unix) DAV/2 mod_ssl/2.2.15 OpenSSL/1.0.0- fips SVN/1.6.11 configured -- resuming normal operations
[Sun Sep 02 03:34:01 2012] [error] avahi_entry_group_add_service_strlst("localhost") failed: Invalid host name
[Sun Sep 02 08:01:14 2012] [error] [client 216.244.73.194] File does not exist: /var/www/html/manager
[Sun Sep 02 11:04:35 2012] [error] [client 58.218.199.250] File does not exist: /var/www/html/proxy
I want a regex that includes space as delimiter and excludes embedded space. And the apache error log format alternates between
[DAY MMM DD HH:MM:SS YYYY] [MSG_TYPE] DESCRIPTOR: MESSAGE
[DAY MMM DD HH:MM:SS YYYY] [MSG_TYPE] [SOURCE IP] ERROR: DETAIL
I created 2 Regexes, 1st one is
^(\[[\w:\s]+\]) (\[[\w]+\]) (\[[\w\d.\s]+\])?([\w\s/.(")-]+[\-:]) ([\w/\s]+)$
This one is simple and just match the contents as it is
I want something like the following Regex which I created
(?<=|\s)([\w:\S]+)
This one doesn't give me the desired output, it doesn't include embedded space. So I need a regex which groups each field, includes embedded space and uses space as delimiter. Pls Help me out with the logic!!!!
my code
void regexparser( CharBuffer cb)
{ try{
Pattern linePattern = Pattern.compile(".*\r?\n");
Pattern csvpat = Pattern.compile( "^\\[([\\w:\\s]+)\\] \\[([\\w]+)\\] (\\[([\\w\\d.\\s]+)\\])?([\\w\\s/.(\")-]+[\\-:]) ([\\w/\\s].+)",Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.MULTILINE);
Matcher lm = linePattern.matcher(cb);
Matcher pm = null;
while(lm.find())
{ //System.out.print("1st loop");
CharSequence cs = lm.group();
if (pm==null)
pm = csvpat.matcher(cs);
else
pm.reset(cs);
while(pm.find())
{ // System.out.println("2nd loop");
//System.out.println(pm.groupCount());
//CharSequence ps = pm.group();
//System.out.print(ps);
if(pm.group(4)==null)
System.out.println(pm.group(1)+" "+pm.group(2)+" "+pm.group(5)+" "+pm.group(6));
else
System.out.println(pm.group(1)+" "+pm.group(2)+" "+pm.group(4)+" "+pm.group(5)+" "+pm.group(6));
I agree that this task should be done with an existing solution to parse Apache logs.
However, if you want to try something out for training purposes, maybe you want to start with this. Instead of parsing everything in one single huge regex, I do it in small steps that are much better readable:
Code
#!/usr/bin/env perl
use strict;
use warnings;
use DateTime::Format::Strptime;
use feature 'say';
# iterate log lines
while (defined(my $line = <DATA>)) {
chomp $line;
# prepare
my %data;
my $strp = DateTime::Format::Strptime->new(
pattern => '%a %b %d %H:%M:%S %Y',
);
# consume date/time
next unless $line =~ s/^\[(\w+ \w+ \d+ \d\d:\d\d:\d\d \d{4})\] //;
$data{date} = $strp->parse_datetime($1);
# consume message type
next unless $line =~ s/^\[(\w+)\] //;
$data{type} = $1;
# "[source ip]" alternative
if ($line =~ s/^\[(\w+) ([\d\.]+)\] //) {
#data{qw(source ip)} = ($1, $2);
# consume "error: detail"
next unless $line =~ s/([^:]+): (.*)//;
#data{qw(error detail)} = ($1, $2);
}
# "descriptor: message" alternative
elsif ($line =~ s/^([^:]+): (.*)//) {
#data{qw(descriptor message)} = ($1, $2);
}
# invalid
else {
next;
}
# something left: invalid
next if length $line;
# parsed ok: output
say "$_: $data{$_}" for keys %data;
say '-' x 40;
}
__DATA__
[Sun Sep 02 03:34:01 2012] [notice] Digest: done
[Sun Sep 02 03:34:01 2012] [notice] Apache/2.2.15 (Unix) DAV/2 mod_ssl/2.2.15 OpenSSL/1.0.0- fips SVN/1.6.11 configured -- resuming normal operations
[Sun Sep 02 03:34:01 2012] [error] avahi_entry_group_add_service_strlst("localhost") failed: Invalid host name
[Sun Sep 02 08:01:14 2012] [error] [client 216.244.73.194] File does not exist: /var/www/html/manager
[Sun Sep 02 11:04:35 2012] [error] [client 58.218.199.250] File does not exist: /var/www/html/proxy
Output
descriptor: Digest
date: 2012-09-02T03:34:01
type: notice
message: done
----------------------------------------
descriptor: avahi_entry_group_add_service_strlst("localhost") failed
date: 2012-09-02T03:34:01
type: error
message: Invalid host name
----------------------------------------
detail: /var/www/html/manager
source: client
ip: 216.244.73.194
date: 2012-09-02T08:01:14
error: File does not exist
type: error
----------------------------------------
detail: /var/www/html/proxy
source: client
ip: 58.218.199.250
date: 2012-09-02T11:04:35
error: File does not exist
type: error
----------------------------------------
Note that according to your format description, the second line is invalid and ignored by the program.