parse Log File, check for date, report results - regex

I need to take the time stamp printed in After FTP connection and check whether it happened today.
I have a log file which contains the following:
---------------------------------------------------------------------
Opening connection for file1.dat
---------------------------------------------------------------------
---------------------------------------------------------------------
Before ftp connection -- time is -- Mon Oct 21 04:01:52 CEST 2013
---------------------------------------------------------------------
---------------------------------------------------------------------
After ftp connection -- time is Mon Oct 21 04:02:03 CEST 2013 .
---------------------------------------------------------------------
---------------------------------------------------------------------
Opening connection for file2.dat
---------------------------------------------------------------------
---------------------------------------------------------------------
Before ftp connection -- time is -- Wed Oct 23 04:02:03 CEST 2013
---------------------------------------------------------------------
---------------------------------------------------------------------
After ftp connection -- time is Wed Oct 23 04:02:04 CEST 2013 .
---------------------------------------------------------------------
Desired Output:
INPUT:file1.dat --> FAIL # since it is Oct 21st considering today is Oct 23.
INPUT:file2.dat --> PASS # since it is Oct 23rd.
INPUT:file3.dat --> FAIL # File information does not exist
What I tried so far:
grep "file1.dat\\|Before ftp connection\\|After ftp connection" logfilename
But this returns all the info that matches either file1.dat OR Before ftp connection OR After ftp connection. Considering the above sample, I get 5 lines out of which last 2 lines are from file2.dat:
Opening connection for file1.dat
Before ftp connection -- time is -- Mon Oct 21 04:01:52 CEST 2013
After ftp connection -- time is Mon Oct 21 04:02:03 CEST 2013 .
Before ftp connection -- time is -- Wed Oct 23 04:02:03 CEST 2013
After ftp connection -- time is Wed Oct 23 01:02:04 CEST 2013 .
I am stuck here. So ideally I need to take Mon Oct 21 04:02:03 CEST 2013 and compare and print the a result FAIL.

Defining the records correctly makes things a lot easier:
$ awk '{print $5,($0~"After.*"d?"PASS":"FAIL")}' d="$(date +'%a %b %d')" RS= file
file1.dat FAIL
file2.dat PASS

Use awk:
# read dates in shell variables
read x m d x x y < <(date)
awk -v f='file2.dat' -v m=$m -v d=$d -v y=$y '$0 ~ f {s=1; next}
s && /After ftp connection/ {
res = ($8==m && $9==d && $12==y) ? "PASS" : "FAIL";
print f, res; exit
}' file.log
file2.dat PASS
FOLLOW UP by OP:
I achieved the intended results by this:
check_success ()
{
CHK_DIR=/Archive
if [[ ! -d ${CHK_DIR} ]]; then
exit 1
elif [[ ! -d ${LOG_FOLDER} ]]; then
exit 1
fi
count_of_files=$(ls -al --time-style=+%D $CHK_DIR/*.dat | grep $(date +%D) | cut -f1 | awk '{ print $7}' | wc -l)
if [[ $count_of_files -lt 1 ]]; then
exit 2
fi
list_of_files=$(basename $(ls -al --time-style=+%D $CHK_DIR/*.dat | grep $(date +%D) | cut -f1 | awk '{ print $7}'))
for filename in $list_of_files
do
filename=basename filename
lg_name=$(grep -El "Opening.*$filename" $LOG_FOLDER/* | head -1 )
m=$(date +%b)
d=$(date +%d)
y=$(date +%Y)
output=$(awk -v f=$filename -v m=$m -v d=$d -v y=$y '$0 ~ f {s=1; next} s && /After ftp connection/ { res = ($8==m && $9==d && $12==y) ? "0" : "1"; print res; exit }' $lg_name)
if [[ ${output} != 0 ]]; then
exit 2
fi
done
exit 0
}
I used Anubhava's snippet, nevertheless Thanks to all the three champs.

It was tricky!
$ awk -vtoday=$(date "+%Y%m%d")
'/^Opening/ {file=$4}
/^After ftp connection/
{$1=$2=$3=$4=$5=$6=$NF="";
r="date -d \"" $0 "\" \"+%Y%m%d\""; r | getline dat;
if (today==dat) {print file, "PASS"}
else {print file, "FAIL"}}
' file
For file1.dat FAIL
For file2.dat PASS
Explanation
-vtoday=$(date "+%Y%m%d") gives today's date with "20131023" format
/^Opening/ {file=$4} gets lines starting with Opening and store the filename, that happens to be in the 4th field.
/^After ftp connection/ on lines starting with "After ftp connection...", do:
{$1=$2=$3=$4=$5=$6=$NF=""; delete up to 6th field and last one so the rest is the date info.
r="date -d \"" $0 "\" \"+%Y%m%d\""; r | getline dat; calculate the date on YYYYMMDD format of that line.
if (today==dat) {print file, "PASS} make comparison of dates.
else {print file, "FAIL"} idem.

Related

Howto grep over months with defined start and end date

so here's my problem: I have big log files and want a script to grep certain periods of time and safe them to a file (sorted), basically
bash script.sh Jul 4 Sep 30
will return for example
Sep 30 user0 logged in
Sep 15 user1 logged in
Aug 6 user0 logged in
Aug 3 user1 logged in
Jul 28 user2 logged in
Jul 27 user2 logged in
Jul 4 user0 logged in
My first attempt was that every month and date gets his own variable like
bash script.sh Jul 4 Sep 3 0
so I can use $1 for start month (July), $2 for start date (4) and so on in grep like
for logs in logs*
do
grep -qEe "^\"$1\" [\"$2\"-9]\s" $messages >> result.txt
done
to get all logs from July 4 to 9 but I don't know how to get logs from the entire time period that aren't in the same month nor in a period like 1-9 or 10-19 and so on
Any help greatly appreciated!
EDIT:
As some people asked, here's how my log files look like (just much bigger and not sorted):
Sep 30 user0 logged in
Jul 27 user2 logged in
Aug 6 user0 logged in
Aug 31 user1 logged in
Jul 8 user2 logged in
Sep 5 user1 logged in
Jul 27 user2 logged in
Jul 14 user0 logged in
[...]
Here's my take:
#/bin/bash
year="$(date +"%Y")"
start="$(date -d"$1 $2, $year" +'%s')"
end="$(($(date -d"$3 $4, $year" +'%s')+86400))"
for log in logs*; do
while IFS= read -r line; do
d="$(date -d"$(cut -d' ' -f1,2 <<< "$line"), $year" +'%s')"
if (( $start <= $d && $d < $end )); then
echo "$s"
fi
done < "$log"
done
You run it like that: ./script.sh Jul 04 Sep 03. Since no year is included in the logs, it assumes that all dates (including the ones in the command line) are for the current year. It's probably not the most optimal solution but it works. It relies on date which it repeatedly calls to parse dates into a unix timestamp. unix timestamps are nice because they are just numbers and thus can be used in numeric comparisons.
$ range="Jul 4 Sep 30"
$ awk -v range="$range" '
BEGIN {
numMths = split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",m)
for (i in m) {
mths[m[i]] = i
}
split(range,r)
beg = sprintf("%02d%02d", mths[r[1]], r[2])
end = sprintf("%02d%02d", mths[r[3]], r[4])
}
{ cur = sprintf("%02d%02d", mths[$1], $2) }
(cur >= beg) && (cur <= end) { vals[$1,$2] = $0 }
END {
for (mthNr=numMths; mthNr>0; mthNr--) {
for (dayNr=31; dayNr>0; dayNr--) {
date = m[mthNr] SUBSEP dayNr
if (date in vals) {
print vals[date]
}
}
}
}
' file
Sep 30 user0 logged in
Sep 5 user1 logged in
Aug 31 user1 logged in
Aug 6 user0 logged in
Jul 27 user2 logged in
Jul 14 user0 logged in
Jul 8 user2 logged in

How to filter logs easily with awk?

Suppose I have a log file mylog like this:
[01/Oct/2015:16:12:56 +0200] error number 1
[01/Oct/2015:17:12:56 +0200] error number 2
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
[01/Nov/2015:01:02:00 +0200] error number 9
[01/Jan/2016:01:02:00 +0200] error number 10
And I want to find those lines that occur between 1 Oct at 18.00 and 1 Nov at 1.00. That is, the expected output would be:
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
I have managed to convert the times to timestamp by using match() and then mktime(). First one finds the specified pattern, that is stored in the array a[] so it can be accessed (interesting to see glenn jackman's answer to access captured group from line pattern for a good example). Since mktime requires a format YYYY MM DD HH MM SS[ DST], I also have to convert the month in the form Xxx into a digit, for which I use an answer by Ed Morton to "convert month from Aaa to xx": awk '{printf "%02d\n",(match("JanFebMarAprMayJunJulAugSepOctNovDec",$0)+2)/3}'.
All together, finally I have the timestamp in the variable mytimestamp:
awk 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
day=a[1]; month=a[2]; year=a[3];
hour=a[4]; min=a[5]; sec=a[6]; utc=a[7];
month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc);
mytimestamp=mktime(mydate)
print mytimestamp
}' mylog
Returns:
1443708776
1443712376
1443715676
etc.
So now I am ready to convert against the given dates. Since awk takes a lot to handle such format, I prefer to provide them through an external shell variable, using date -d"my date" +"%s" to print the timestamp:
start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")"
All together, this works:
awk start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")" end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" 'match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {day=a[1]; month=a[2]; year=a[3]; hour=a[4]; min=a[5]; sec=a[6]; utc=a[7]; month=sprintf("%02d",(match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3); mydate=sprintf("%s %s %s %s %s %s %s", year,month,day,hour,min,sec,utc); mytimestamp=mktime(mydate); if (start<=mytimestamp && mytimestamp<=end) print}' mylog
[01/Oct/2015:18:07:56 +0200] error number 3
[01/Oct/2015:18:12:56 +0200] error number 4
[02/Oct/2015:16:12:56 +0200] error number 5
[10/Oct/2015:16:12:58 +0200] error number 6
[10/Oct/2015:16:13:00 +0200] error number 7
[01/Nov/2015:00:10:00 +0200] error number 8
However, this seems to be quite a bit of work for something that should be more straight forward. Nonetheless, the introduction of the "Time functions" section in man gawk is
Since one of the primary uses of AWK programs is processing log files
that contain time stamp information, gawk provides the following
functions for obtaining time stamps and formatting them.
So I wonder: is there any better way to do this? For example, what if the format instead of dd/Mmm/YYYY:HH:MM:ss was something like dd Mmm YYYY HH:MM:ss? Couldn't it be possible to provide the match pattern externally instead of having to change it every time this would happen? Do I really have to use match() and then process that output to then feed mktime()? Doesn't gawk provide a more simple way to do this?
Use ISO 8601 time format!
However, this seems to be quite a bit of work for something that should be more straight forward.
Yes, this should be straightforward, and the reason why it is not, is because the logs do not use ISO 8601. Application logs should use ISO format and UTC to display times, other settings should be considered broken and fixed.
Your request should be split in two parts. The first part canonise the logs, converting dates to the ISO format, the second performs a research:
awk '
match($0, /([0-9]+)\/([A-Z][a-z]{2})\/([0-9]{4}):([0-9]{1,2}):([0-9]{1,2}):([0-9]{1,2}) ([+-][0-9]{4})/, a) {
day=a[1]
month=a[2];
year=a[3]
hour=a[4]
min=a[5]
sec=a[6]
utc=a[7];
month=sprintf("%02d", (match("JanFebMarAprMayJunJulAugSepOctNovDec",month)+2)/3);
myisodate=sprintf("%4d-%2d-%2dT%2d:%2d:%2d%6s", year,month,day,hour,min,sec,utc);
$1 = myisodate
print
}' mylog
The nice thing about ISO 8601 dates – besides them being a standard – is that the chronological order coincide with lexicographic order, therefore, you can use the /…/,/…/ operator to extract the dates you are interested in. For instance to find what happened between 1 Oct 2015 18:00 +0200 and 1 Nov 2015 01:00 +0200, append the following filter to the previous, standardising filter:
awk '/2015-10-01:18:00:00+0200/,/2015-11-01:01:00:00+0200/'
without getting into time format (assuming all records are formatted the same) you can use sort | awk combination to achieve the same with ease.
This assumes logs are not ordered, based on your format and special sort option to sort months (M) and awk to pick the interested range. The sorting is based on year, month, and day in that order.
$ sort -k1.9,1.12 -k1.5,1.7M -k1.2,1.3 log | awk '/01\/Oct\/2015/,/01\/Nov\/2015/'
You can easily extend to include time as well and drop the sort if the file is already sorted.
The following has the time constraint as well
awk -F: '/01\/Oct\/2015/ && $2>=18{p=1}
/01\/Nov\/2015/ && $2>=1 {p=0} p'
I would use date command inside awk to achieve this, though no idea how this would perform with large log files.
awk -F "[][]" -v start="$(date -d"1 Oct 2015 18:00 +0200" +"%s")"
-v end="$(date -d"1 Nov 2015 01:00 +0200" +"%s")" '{
gsub(/\//,"-",$2);sub(/:/," ",$2);
cmd="date -d\""$2"\" +%s" ;
cmd|getline mytimestamp;
close(cmd);
if (start<=mytimestamp && mytimestamp<=end) print
}' mylog

Use of uninitialized value $1 in addition - Perl

I am writing a program that somewhat mimics the last command in UNIX, and I am trying to use backreferencing in my solution. My program does exactly what it is supposed to do but I get a run time error/warning. My question is why is this error/warning coming up and how can I fix an issue like this?
If you need more information I can provide.
Program Execution
./last dodoherty
OUTPUT
Here is a listing of the logins for dodoherty:
1. dodohert pts/1 pc-618-012.omhq. Wed Feb 8 09:19 still logged in
2. dodohert pts/6 ip98-168-203-118 Tue Feb 7 19:19 - 20:50 (01:31)
3. dodohert pts/3 137.48.207.178 Tue Feb 7 14:00 - 15:06 (01:05)
4. dodohert pts/1 137.48.219.250 Tue Feb 7 12:32 - 12:36 (00:04)
5. dodohert pts/21 137.48.207.237 Tue Feb 7 12:07 - 12:23 (00:16)
6. dodohert pts/11 ip98-168-203-118 Mon Feb 6 20:50 - 23:29 (02:39)
7. dodohert pts/9 ip98-168-203-118 Mon Feb 6 20:31 - 22:57 (02:26)
8. dodohert pts/5 pc-618-012.omhq. Fri Feb 3 10:24 - 10:30 (00:05)
Use of uninitialized value $1 in addition (+) at ./odoherty_last.pl line 43.
Use of uninitialized value $2 in addition (+) at ./odoherty_last.pl line 44.
Here is a summary of the time spent on the system for dodoherty:
dodoherty
8
8:6
The Code (Snippet of where the error is coming from, Also this is the only time $1 and $2 are used.)
foreach my $line2 (#user)
{
$line2 =~ /\S*\((\d{2,2})\:(\d{2,2})\)\s*/;
$hours = $hours + $1;
$mins = $mins + $2;
if( $mins >= 60 )
{
$hours = $hours + 1;
$mins = $mins - 60;
}
}
I think the problem might be in the following line.
1. dodohert pts/1 pc-618-012.omhq. Wed Feb 8 09:19 still logged in
That is because nothing matches the pattern so $1 and $2 are undefined.
As has been noted in other answers, your regex does not match, and therefore $1 and $2 are undefined. It is necessary to always check to make sure the appropriate regex matches before using these variables.
Below I have upgraded your script with some proper perl code. += and %= are handy operator in this case. You can read about them in perlop
Your regex uses \S* and \s*, both of which are completely unnecessary here, since your regex is not anchored to anything else. In other words, \S*foo\s* will match any string that contains foo, since it can match the empty string around foo. Also, {2,2} means "match at least 2 times, max 2", which in effect is the same as {2} "match 2 times".
You will see that I changed your math around, and that is because it assumes that $mins will never be higher than 120. I suppose technically, that is a safe assumption, but doing it like below, it can handle all values of minutes and successfully turn them into hours.
The script below is for demonstration. If you remove DATA and leave <>, you can use this script as-is like so:
last user | perl script.pl
Code:
use strict;
use warnings;
use v5.10; # required for say()
my ($hours, $mins);
while (<DATA>) { # replace with while (<>) for live usage
if (/\((\d{2})\:(\d{2})\)/) {
$hours += $1;
$mins += $2;
if( $mins >= 60 ) {
$hours += int ($mins / 60); # take integer part of division
$mins %= 60; # remove excess minutes
}
}
}
say "Hours: $hours";
say "Mins : $mins";
__DATA__
1. dodohert pts/1 pc-618-012.omhq. Wed Feb 8 09:19 still logged in
2. dodohert pts/6 ip98-168-203-118 Tue Feb 7 19:19 - 20:50 (01:31)
3. dodohert pts/3 137.48.207.178 Tue Feb 7 14:00 - 15:06 (01:05)
4. dodohert pts/1 137.48.219.250 Tue Feb 7 12:32 - 12:36 (00:04)
5. dodohert pts/21 137.48.207.237 Tue Feb 7 12:07 - 12:23 (00:16)
6. dodohert pts/11 ip98-168-203-118 Mon Feb 6 20:50 - 23:29 (02:39)
7. dodohert pts/9 ip98-168-203-118 Mon Feb 6 20:31 - 22:57 (02:26)
8. dodohert pts/5 pc-618-012.omhq. Fri Feb 3 10:24 - 10:30 (00:05)
#!/usr/bin/perl
use strict;
my $hours = 0;
my $mins = 0;
my $loggedIn = 0;
while (<STDIN>)
{
chomp;
if (/\S*\((\d{2,2})\:(\d{2,2})\)\s*/)
{
$hours = $hours + $1;
$mins = $mins + $2;
if($mins >= 60 )
{
$hours = $hours + 1;
$mins = $mins - 60;
}
}
elsif (/still logged in$/)
{
$loggedIn = 1;
}
}
print "Summary: $hours:$mins ", ($loggedIn) ? " (Currently logged in)" : "", "\n";
When ever your RE fails to match, $1 and $2 have no value.
For this reason, it's considered best practice on ever to use $1, $2 etc. inside a conditional which tests the success of the RE.
So don't do:
$string =~ m/(somepattern)/sx;
my $var = $1;
But instead to do something like:
my $var = 'some_default_value';
if($string =~ m/(somepattern)/sx){
$var = $1;
}

parsing ns2 trace file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm using NS 2.35 and am trying to determine the end-to-end delay of my routing algorithm.
I think anyone with some good scripting experience should be able to answer this question, sadly that person is not me.
I have a trace file, that looks something like this:
- -t 0.548 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1052 -a 0 -x {2.0 17.0 6 ------- null}
h -t 0.548 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1052 -a 0 -x {2.0 17.0 -1 ------- null}
+ -t 0.55 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1056 -a 0 -x {2.0 17.0 10 ------- null}
+ -t 0.555 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1057 -a 0 -x {2.0 17.0 11 ------- null}
r -t 0.556 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1047 -a 0 -x {2.0 17.0 1 ------- null}
+ -t 0.556 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1047 -a 0 -x {2.0 17.0 1 ------- null}
- -t 0.556 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1047 -a 0 -x {2.0 17.0 1 ------- null}
But here is what I need to do.
A line that starts with + is when a new packet is added to the network.
A line starting with r is when a packet has been received by the destination. the double-typed number after the -t is the time at which that event happened. And finally, after the -i is the identity of the packet.
For me to calculate average end-to-end delay, I need to find every line that has a certain id after the -i. from there I need to calculate the timestamp of the r minus the timestamp of the +
So I figure there could be a regular expression separated by spaces. I could put each of the segements into their own variables. Then I would check the 15th (the packet ID).
But I'm not sure where to go from there, or how to put it all together.
I know there are some AWK scripts on the web for doing this, but they are all outdated and don't fit the current format (and I'm not sure how to change them).
Any help would be greatly appreciated.
EDIT:
Here is an example of a full packet route that I'm looking to find.
I've taken out a lot of lines in between these ones, so that you can see a single packets events.
# a packet is enqueued from node 2 going to node 7. It's ID is 1636. this was at roughly 1.75sec
+ -t 1.74499999999998 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# at 2.1s, it left node 2.
- -t 2.134 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# at 2.134 it hopped from 2 to 7 (not important)
h -t 2.134 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 -1 ------- null}
# at 2.182 it was received by node 7
r -t 2.182 -s 2 -d 7 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# it was the enqueued by node 7 to be sent to node 12
+ -t 2.182 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# slightly later it left node 7 on its was to node 12
- -t 2.1832 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# it hopped from 7 to 12 (not important)
h -t 2.1832 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 -1 ------- null}
# received by 12
r -t 2.2312 -s 7 -d 12 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# added to queue, heading to node 17
+ -t 2.2312 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# left for node 17
- -t 2.232 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
# hopped to 17 (not important)
h -t 2.232 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 -1 ------- null}
# received by 17 notice the time delay
r -t 2.28 -s 12 -d 17 -p cbr -e 500 -c 0 -i 1636 -a 0 -x {2.0 17.0 249 ------- null}
The ideal output of the script would recognize 2.134 as the start time, and 2.28 as the end, and then give me the delay of 0.146sec. It would do this for all packet IDs and only report the average.
It was requested that I expand a bit on how the file works, and what I am expecting.
The file is listing descriptions of about 10,000 packets. Each packet can be in a different state. The important states are + which means a packet has been enqueued at a router, and r which means the packet has been received by its destination.
It is possible that a packet that is enqueued (so a + entry) is not actually received and is instead dropped. This means we cannot assume that for every + entry there will be a r entry.
What I'm trying to measure is the average end to end delay. What this means, is that if you look at a single packet, it will have a time it was enqueued, and a time it was received. I need to make this calculation to find its end-to-end delay. But I also need to do it for 9,999 other packets to get an average.
I've thought about it more, and heres generally how I think the algorithm needs to work.
remove all lines that don't begin with a + or an r because they are unimportant.
go through all of the packet IDs (that is the numbers after -i, such as 1052 in the example), and put them into some sort of groups (multiple arrays perhaps).
each group should now contain all of the information about a particular packet.
inside the group, check if there is a +, ideally we want the very first +. Record its time.
look for any more + lines. Look at their time. It's possible the log is slightly jumbled. So its possible there is a + line later on that is actually earlier in the simulation.
If this new + line has an earlier time, then update the time variable with that.
assuming there are no more + lines, look for an r line.
if there is no r line, the packet was dropped so don't worry about it.
for every r line you find, all we need to do is find the one who has the lastest timestamp
The r line with the latest timestamp is where the packet was finally received.
subtract the + time from the r time, this gives us the time it took for the packet to travel.
Add this value to an array so that later it can be averaged.
repeat this process on every packet ID group, and then finally average the created array of delays.
Thats a lot of typing, but I think its as clear as I can be in what I want. I wish i was a regex master, but I just don't have time to learn it well enough to pull this off.
Thanks for all your help, and let me know if you have any questions.
There's not much to work with here, as Iain said in the comments to your question, but if I understand what you want to do correctly, something like this should work:
awk '/^[+r]/{$1~/r/?r[$15]=$2:r[$15]?d[$15]=r[$15]-$2:1} END {for(p in d){sum+=r[p];num++}print sum/num}' trace.file
It skips all lines not starting with '+' or 'r'. If the line starts with 'r' it adds time to the r array. Otherwise, it calculates the delay and adds it to the d array if the element is found in the r array. Finally it loops over the elements in the d array, adds up the total delay and number of elements and calculates the average from this. In your case the average is 0.
The :1 at the end of the main block is just in there so I can get away with a ternary expression instead of the significantly more verbose if statement.
EDIT: New expression to work with the added conditions:
awk '/^[+r]/{$1~/r/?$3>r[$15]?r[$15]=$3:1:!a[$15]||$3<a[$15]?a[$15]=$3:1} END {for(i in r){sum+=r[i]-a[i];num++}print "Average delay", sum/num}'
or as an awk-file
/^[+r]/ {
if ($1 ~ /r/) {
if ($3 > received[$15])
received[$15] = $3;
} else {
if (!added[$15] || $3 < added[$15])
added[$15] = $3;
}
} END {
for (packet in received) {
sum += received[packet] - added[packet];
num++
}
print "Average delay", sum/num
}
According to your algorithm it seems like 1.745 would be the start time, while you write that 2.134 is.

perl how to regex parts of data instead of entire string and then print out a csv file

I have a working perl script that grabs the data I need and displays them to STDOUT, but now I need to change it to generate a data file (csv, tab dellimited, any delimiter file).
The regular expression is filtering the data that I need, but I don't want the entire string, just snippets of the output. I'm assuming I would need to store this in another variable to create my output file.
I need a good example of this or suggestions to alter this code. Thank you in advance. :-)
Here's my code:
#!/usr/bin/perl -w
# Usage: ./bakstatinfo.pl Jul 28 2010 /var/log/mybackup.log <server1> <server2>
use strict;
use warnings;
#This piece added to view the arguments passed in
$" = "][";
print "===================================================================================\n";
print "[#ARGV]\n";
#Declare Variables
my($mon,$day,$year,$file,$server) = #ARGV;
my $regex_flag = 0;
splice(#ARGV, 0, 4, ());
foreach my $server ( #ARGV ) { #foreach will take Xn of server entries and add to the loop
print "===================================================================================\n";
print "REPORTING SUMMARY for SERVER : $server\n";
open(my $fh,"ssh $server cat $file |") or die "can't open log $server:$file: $!\n";
while (my $line = <$fh>) {
if ($line =~ m/.* $mon $day \d{2}:\d{2}:\d{2} $year:.*(ERROR:|backup-date=|backup-size=|backup-time=|backup-status)/) {
print $line;
$regex_flag=1; #Set to true
}
}
if ($regex_flag==0) {
print "NOTHING TO REPORT FOR $server: $mon $day $year \n";
}
$regex_flag=0;
close($fh);
}
Sample raw log file I am using: (recently added to provide better representation of log)
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:ERROR: mybak-abc appears to be already running for this backupset
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:ERROR: If you are sure mybak-abc is not running, please remove the file /etc/mybak-abc/test202.bak_lvm/.mybak-abc.pid and restart mybak-abc
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: PHASE START: Cleanup
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: PHASE END: Cleanup
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: END OF BACKUP
Wed Jul 28 00:00:04 2010: db9.abc.bak:backup:INFO: START OF BACKUP
Wed Jul 28 00:00:04 2010: db9.abc.bak:backup:INFO: PHASE START: Initialization
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:WARNING: Binary logging is off.
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: License check successful
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: License check successful for lvm-snapshot.pl
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-set=db9.abc.bak
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-date=20100728000004
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: SQL-server-os=Linux/Unix
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-type=regular
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: host=db9.abc.bak.test.com
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-date-epoch=1280300404
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: retention-policy=3D
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: mybak-abc-version=ABC for SQL Enterprise Edition - version 3.1
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: SQL-version=5.1.32-test-SMP-log
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-directory=/home/backups/db9.abc.bak/20100728000004
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-level=0
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-mode=raw
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE END: Initialization
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Running pre backup plugin
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Flushing logs
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE END: Flushing logs
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Creating snapshot based backup
Wed Jul 28 00:00:11 2010: db9.abc.bak:backup:INFO: Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: raw-databases-snapshot=test SQL sgl
Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: PHASE END: Creating snapshot based backup
Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: PHASE START: Calculating backup size & checksums
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: last-backup=/home/backups/test203.bak_lvm/20100726200004
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: PHASE END: Calculating backup size & checksums
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: read-locks-time=00:00:05
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: flush-logs-time=00:00:00
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
My working output now:
===================================================================================
[Jul][28][2010][/var/log/mybackup.log][server1]
===================================================================================
REPORTING SUMMARY for SERVER : server1
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
The output I need to see would be something like this:(data file with separated by ';' for example)
MyDate=Wed Jul 28;MyBackupSet= test203.bak_lvm;MyBackupSize=187.24 GB;MyBackupTime=04:49:51;MyBackupStat=Backup succeeded
Use 'capturing parentheses' to identify the bits you want to deal with.
if ($line =~ m/(.* $mon $day) \d{2}:\d{2}:\d{2} $year:.*
(ERROR:|backup-date=|backup-size=|
backup-time=|backup-status)/x) {
You will need to do some surgery on the second set of parentheses - those surrounding the start of the various keywords. You may have to chop those out in bits and pieces inside the condition.
When you have all the data extracted into variables, use Text::CSV to handle CSV output (and input).
There are a myriad modules to handle HTML or XML (over 2000, and I think over 3000, with HTML in their name - I happened to look yesterday). Many of those won't be applicable, but CPAN is your friend.
Answering questions posed by comments
Would I split them off into separate variables as well? The first part gives me the date/time that I need. The next filter then gives me 1) Error: 2)backup-date= 3)backup-size= ...etc.
More or less. Unfortunately, you don't show some representative input lines, which means it is hard to tell what might be best. However, it seems likely that a scheme such as:
while (my $line = <$fh>)
{
chomp $line;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/)
{
my $date = $1;
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
while ($line =~ m/(ERROR|backup-date|backup-size|
backup-time|backup-status)
[:=]([^:]+)/x)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
}
# The %items hash contains the split out information.
# Now write the data for this line of the log file.
}
}
There might well be better ways to handle the trimming (but it is Perl so TMTOWTDI), but the basic idea here is to catch the lines that are interesting, then progressively chop the bits of interest out of the line, so the line grows shorter on each iteration (therefore, eventually terminating the inner while loop).
Note the use of the /x modifier to allow for a more readable regex split over lines (I edited the original answer version to use that too). I've also allowed 'ERROR' to be followed by an '=' or the other keywords to be followed by ':'; it seems unlikely that you'd get false matches that way, and it simplifies the regex substitute operations. The initial pattern match no longer requires one of the subsections to be present, either. You must judge for yourself whether those small changes (which might pick up non-conforming information) matter or not. For most of my purposes, the chance of the mismatch is small enough not to be an issue - but for legal reasons, it might not be acceptable to you.
Answering questions posed by 'answer'
I manufactured some data:
Wed Jul 30 00:49:51 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 30 00:49:52 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 30 00:49:53 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
Wed Jul 30 00:49:51 2010: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
I took the script in the answer and hacked and instrumented it - making it standalone.
I also removed the dependency on specific files - it reads standard input and writes to standard output. It makes my testing easier - and the code more flexible.
use strict;
use warnings;
use constant debug => 0;
my $mon = 'Jul';
my $day = 30;
my $year = 2010;
while (my $line = <>)
{
chomp $line;
print "Line: $line\n" if debug;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/) #Mon Jul 26 22:00:02 2010:
{
print "### Scan\n";
my $date = $1;
print "$date\n";
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
print "Line: $line\n" if debug;
while ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=]([^:]+)/)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
print "$key=$val\n";
print "Line: $line\n" if debug;
}
print "### Verify\n";
for my $key (sort keys %items)
{
print "$key = $items{$key}\n";
}
}
}
The output I get is:
### Scan
Wed Jul 30
backup-size=417.32 GB
### Verify
backup-size = 417.32 GB
### Scan
Wed Jul 30
backup-time=04
### Verify
backup-time = 04
### Scan
Wed Jul 30
backup-status=Backup succeeded
### Verify
backup-status = Backup succeeded
### Scan
Wed Jul 30
backup-size=417.32 GB
backup-time=04
backup-status=Backup succeeded
### Verify
backup-size = 417.32 GB
backup-status = Backup succeeded
backup-time = 04
The verify loop prints out the data from the '%items' hash quite happily. With the debug value set to 1 instead of 0, the output I get is:
Line: Wed Jul 30 00:49:51 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
backup-size=417.32 GB
Line: test203.bak_lvm:backup:INFO:
### Verify
backup-size = 417.32 GB
Line: Wed Jul 30 00:49:52 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-time=04:49:51
backup-time=04
Line: test203.bak_lvm:backup:INFO: 49:51
### Verify
backup-time = 04
Line: Wed Jul 30 00:49:53 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
backup-status=Backup succeeded
Line: test203.bak_lvm:backup:INFO:
### Verify
backup-status = Backup succeeded
Line: Wed Jul 30 00:49:51 2010: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
### Scan
Wed Jul 30
Line: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
backup-size=417.32 GB
Line: backup-time=04:49:51:backup-status=Backup succeeded
backup-time=04
Line: 49:51:backup-status=Backup succeeded
backup-status=Backup succeeded
Line: 49:51:
### Verify
backup-size = 417.32 GB
backup-status = Backup succeeded
backup-time = 04
The substitute operations delete the previously matched part of the line. There are ways of continuing a match where you left off - see \G at the 'perlre' page.
Note that the regex is crafted to stop at the first colon after the 'colon or equals' after the keyword. That means it truncates the backup time. One moral is "do not use a separator that can appear in the data". Another is "provide sample data so people can help you more easily". Another is "provide complete but minimal working scripts where possible".
Processing the sample data
Now that we have the sample input data, we can see that you need slightly different processing. This script:
use strict;
use warnings;
use constant debug => 0;
my $mon = 'Jul';
my $day = 28;
my $year = 2010;
my %items = ();
while (my $line = <>)
{
chomp $line;
print "Line: $line\n" if debug;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year: ([^:]+):backup:/) #Mon Jul 26 22:00:02 2010:
{
print "### Scan\n" if debug;
my $date = $1;
my $set = $2;
print "$date ($set): " if debug;
$items{$set}->{'a-logdate'} = $date;
$items{$set}->{'a-dataset'} = $set;
if ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=](.+)/)
{
my $key = $1;
my $val = $2;
$items{$set}->{$key} = $val;
print "$key=$val\n" if debug;
}
}
}
print "### Verify\n";
for my $set (sort keys %items)
{
print "Set: $set\n";
my %info = %{$items{$set}};
for my $key (sort keys %info)
{
printf "%s=%s;", $key, $info{$key};
}
print "\n";
}
produces this result on the sample data file.
### Verify
Set: db9.abc.bak
a-dataset=db9.abc.bak;a-logdate=Wed Jul 28;backup-date=20100728000004;
Set: test203.bak_lvm
a-dataset=test203.bak_lvm;a-logdate=Wed Jul 28;backup-size=417.32 GB;backup-status=Backup succeeded;backup-time=04:49:51;
Note that now we have sample data, we can see that there is only one key/value pair per line, but there are multiple systems backed up per day. So, the inner while loop becomes a simple if. The printing out occurs at the end. And I'm using a 'two-tier' hash. The %items contains an entry for each data set; the entry, though, is a reference to a hash. Not necessarily something for novices to play with, but it fell into place very naturally with the previous code. Note, too, that this version doesn't hack the line - there's no need since there's only one lot of data per line.
Can it be improved - yes, undoubtedly. Does it work? Yes, more or less... Can it be hacked into shape? Yes, it can be hacked to work as you need.
#Jonathan- I wrote out the text file within the while loop. It seems to work. I tried doing it after the second while loop as you suggested in your comment. I'm not sure why it didn't work.
open (my $MYDATAFILE, ">/home/test/myout.txt") || die "cannot append $!";
open(my $fh,"ssh $server cat $file |") or die "can't open log $server:$file: $!\n";
while (my $line = <$fh>)
{
chomp $line;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/) #Mon Jul 26 22:00:02 2010:
{
my $date = $1;
#print $date;
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
while ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=]([^:]+)/)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
#print "[$key]";
#print "[$val]";
print $MYDATAFILE "$key=$val";
}
# The %items hash contains the split out information.
# Now write the data for this line of the log file.
}
}