split log of a multi-threaded application - regex

I have a multi-threaded application which generates logs as mentioned:
D Fri Feb 01 00:21:23 2013 <no machine> pin_deferred_act:10233 pin_mta_conf.c:636 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:1
pin_mta_convert_cmdline_options_to_flist parameters flist
D Fri Feb 01 00:21:23 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pcpst.c(78):406 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:2:0:1359658283:0
connect to host=172.16.87.14, port=11962 OK
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:2479 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:0
Config object search input flist
0 PIN_FLD_POID POID [0] 0.0.0.1 /search/pin -1 0
0 PIN_FLD_FLAGS INT [0] 0
0 PIN_FLD_TEMPLATE STR [0] "select X from /config/mta where F1 = V1 "
0 PIN_FLD_ARGS ARRAY [1] allocated 20, used 1
1 PIN_FLD_CONFIG_MTA ARRAY [0] allocated 20, used 1
2 PIN_FLD_NAME STR [0] "pin_deferred_act"
0 PIN_FLD_RESULTS ARRAY [0] allocated 20, used 1
1 PIN_FLD_POID POID [0] NULL poid pointer
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:2484 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:3:7:1359658284:2
Config object search output flist
0 PIN_FLD_POID POID [0] 0.0.0.1 /search/pin -1 0
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:3138 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:7:1359658284:2
So the threads update the logs, like pin_deferred_act:10233:1:7 --> where 1 specifies the log from the first thread, in the logfile.
I want to create log file for each thread, where the start point should be:
1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:
and end- point should be:
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in
(where date/timestamp will keep on modifying).
All the instances should go in one file.
For e.g.:
D Fri Feb 01 00:21:23 2013 <no machine> pin_deferred_act:10233 pin_mta_conf.c:636 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:1
pin_mta_convert_cmdline_options_to_flist parameters flist
D Fri Feb 01 00:21:23 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pcpst.c(78):406 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:2:0:1359658283:0
connect to host=172.16.87.14, port=11962 OK
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:2479 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:0
Config object search input flist
0 PIN_FLD_POID POID [0] 0.0.0.1 /search/pin -1 0
0 PIN_FLD_FLAGS INT [0] 0
0 PIN_FLD_TEMPLATE STR [0] "select X from /config/mta where F1 = V1 "
0 PIN_FLD_ARGS ARRAY [1] allocated 20, used 1
1 PIN_FLD_CONFIG_MTA ARRAY [0] allocated 20, used 1
2 PIN_FLD_NAME STR [0] "pin_deferred_act"
0 PIN_FLD_RESULTS ARRAY [0] allocated 20, used 1
1 PIN_FLD_POID POID [0] NULL poid pointer
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:3138 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:7:1359658284:2
should go to one file - Thread1.log, and similarly for other threads, the file should be created as Threadn.log with the respectively.

Files are a messy, non-scalable way of handling logs to begin with. A better approach is to handle logs as streams of log entry messages connected source(s) -> sink(s). Consider syslog, logplex or similar if Oracle provides alternative means of data collection. Custom re-implementation might be feasible depending on logging IOPS bottlenecks or other factors.
Use of high resolution monotonic clocks and/or globally-ordered GUID timestamps are highly recommended. With wall time, be sure to use non-backwards compensated UTC everywhere synced to low strata time sources.
Above recommendations may vary according to needs of the application of course, so experiment and implement wisely.

I think Barry's advice is useful, but in the event that you can't alter the application's log output, here is a quick Perl solution:
#!usr/bin/perl
use strict;
use warnings;
my %logs;
my $last_log;
while (<$main_log_file>) #open that application's log in this variable.
{
if (/pin_deferred_act:\d+:(\d+):\d/)
{
unless (defined $logs{$1})
{
open $fh,'>',"Thread$1.log") or die "Can't open Thread $1 log: $!";
$logs{$1} = $fh;
}
$last_log = $logs{$1};
}
if (defined $last_log)
{
print {$last_log} $_;
}
else
{
#Didn't find starting line. Error handling?
}
}
This solution maintains a hash of open file handles to the log files for all threads. I prefer this as it is more efficient if the input will have a lot of switching back and forth between the same threads. It would break, however, if the application has more threads than you are allowed to have files open on your system.

Related

Options to execute unit Test cases in minimum Time for Time Based Unit Test cases

I have a requirement to lockout a user if three failed attempts are made within 15 minutes. The account will be automatically unlocked after a period. Now I am passing the parameters - maximum attempt count, the lockout window duration and lockout period as parameters to the class that implements the functionality. Even with values like 2s or 3s for the parameters will result in the unit test suite execution to complete about 30 seconds.
Is there any specific method or strategies used in these scenarios to reduce the test execution time?
There are a couple of options:
Use a Test Double and inject an IClock that a test can control.
Use a smaller time resolution than seconds. Perhaps define the window and the quarantine period in milliseconds instead of seconds.
Write the logic as one or more pure functions.
Pure functions are intrinsically testable, so let me expand on that.
Pure functions
In order to ensure that we're working with pure functions, I'll write the tests and the System Under Test in Haskell.
I'm going to assume that some function exists that checks whether a single login attempt succeeds. How this works is a separate concern. I'm going to model the output of such a function like this:
data LoginResult = Success | Failure deriving (Eq, Show)
In other words, a login attempt either succeeds or fails.
You now need a function to determine the new state given a login attempt and a previous state.
Evaluate login
At first I thought this was going to be more complicated, but the nice thing about TDD is that once you write the first tests, you realise the potential for simplification. Fairly quickly, I realised that all that was required was a function like this:
evaluateLogin :: (UTCTime, LoginResult) -> [UTCTime] -> [UTCTime]
This function takes a current LoginResult and the time it was made (as a tuple: (UTCTime, LoginResult)), as well as a log of previous failures, and returns a new failure log.
After a few iterations, I'd written this inlined HUnit parametrised test:
"evaluate login" ~: do
(res, state, expected) <-
[
((at 2022 5 16 17 29, Success), [],
[])
,
((at 2022 5 16 17 29, Failure), [],
[at 2022 5 16 17 29])
,
((at 2022 5 16 18 6, Failure), [at 2022 5 16 17 29],
[at 2022 5 16 18 6, at 2022 5 16 17 29])
,
((at 2022 5 16 18 10, Success), [at 2022 5 16 17 29],
[])
]
let actual = evaluateLogin res state
return $ expected ~=? actual
The logic I found useful to tease out is that whenever there's a test failure, the evaluateLogin function adds the failure time to the failure log. If, on the other hand, there's a successful login, it clears the failure log:
evaluateLogin :: (UTCTime, LoginResult) -> [UTCTime] -> [UTCTime]
evaluateLogin ( _, Success) _ = []
evaluateLogin (when, Failure) failureLog = when : failureLog
This, however, tells you nothing about the quarantine status of the user. Another function can take care of that.
Quarantine status
The following parametrised tests is the result of a few more iterations:
"is locked out" ~: do
(wndw, p, whn, l, expected) <-
[
(ndt 0 15, ndt 1 0, at 2022 5 16 19 59, [], False)
,
(ndt 0 15, ndt 1 0, at 2022 5 16 19 59, [
at 2022 5 16 19 54,
at 2022 5 16 19 49,
at 2022 5 16 19 45
],
True)
,
(ndt 0 15, ndt 1 0, at 2022 5 16 19 59, [
at 2022 5 16 19 54,
at 2022 5 16 19 49,
at 2022 5 16 18 59
],
False)
,
(ndt 0 15, ndt 1 0, at 2022 5 16 19 59, [
at 2022 5 16 19 54,
at 2022 5 16 19 52,
at 2022 5 16 19 49,
at 2022 5 16 19 45
],
True)
,
(ndt 0 15, ndt 1 0, at 2022 5 16 20 58, [
at 2022 5 16 19 54,
at 2022 5 16 19 49,
at 2022 5 16 19 45
],
False)
]
let actual = isLockedOut wndw p whn l
return $ expected ~=? actual
These tests drive the following implementation:
isLockedOut :: NominalDiffTime -> NominalDiffTime -> UTCTime -> [UTCTime] -> Bool
isLockedOut window quarantine when failureLog =
case failureLog of
[] -> False
xs ->
let latestFailure = maximum xs
windowMinimum = addUTCTime (-window) latestFailure
lockOut = 3 <= length (filter (windowMinimum <=) xs)
quarantineEndsAt = addUTCTime quarantine latestFailure
isStillQuarantined = when < quarantineEndsAt
in
lockOut && isStillQuarantined
Since it's a pure function, it can calculate quarantine status deterministically based exclusively on input.
Determinism
None of the above functions depend on the system clock. Instead, you pass the current time (when) as an input value, and the functions calculate the result based on the input.
Not only is this easy to unit test, it also enables you to perform simulations (a test is essentially a simulation) and calculate past results.
Time is an input
If you don't consider time an input value, think about it until you do -- it is an important concept -- John Carmack
Arrange your design so that you can check the complicated logic of deciding what to do independently from actually doing it.
Design the actually doing it parts such that they are "so simple there are obviously no deficiencies". Check that code occasionally, or as part of your system testing - those tests are still potentially going to be "slow", but they aren't going to be in the way (because you've found a more effective way to mitigate "the gap between decision and feedback").

Always "I have no answer for that" response in program ab

I have been trying to create a chat bot using program ab. I have created a simple aiml file and tried. But It is not working. I am getting this,
Name = super Path = /aiml/bots/super
c:/ab
/aiml/bots
/aiml/bots/super
/aiml/bots/super/aiml
/aiml/bots/super/aimlif
/aiml/bots/super/config
/aiml/bots/super/logs
/aiml/bots/super/sets
/aiml/bots/super/maps
Preprocessor: 0 norms 0 persons 0 person2
Get Properties: /aiml/bots/super/config/properties.txt
addAIMLSets: /aiml/bots/super/sets does not exist.
addCategories: /aiml/bots/super/aiml does not exist.
AIML modified Thu Jan 01 05:30:00 IST 1970 AIMLIF modified Thu Jan 01 05:30:00 IST 1970
No deleted.aiml.csv file found
No deleted.aiml.csv file found
addCategories: /aiml/bots/super/aimlif does not exist.
Loaded 0 categories in 0.002 sec
No AIMLIF Files found. Looking for AIML
addCategories: /aiml/bots/super/aiml does not exist.
Loaded 0 categories in 0.001 sec
--> Bot super 0 completed 0 deleted 0 unfinished
Setting predicate topic to unknown
normalized = HELLO
No match.
writeCertainIFCaegories learnf.aiml size= 0
I have no answer for that.
Why the file is not loaded? I have included the simple aiml file also below. super folder have all the inbuilt aiml files I downloaded with program ab
Because the aiml files were not loaded properly and so there are no answers to reply to any of the questions.
Preprocessor: 0 norms 0 persons 0 person2
This means no files were processed and added.
The .aiml files were most likely not loaded or found. Perhaps a naming issue??
Name = super Path = /aiml/bots/super
c:/ab
/aiml/bots
/aiml/bots/super
/aiml/bots/super/aiml
/aiml/bots/super/aimlif
/aiml/bots/super/config
/aiml/bots/super/logs
/aiml/bots/super/sets
/aiml/bots/super/maps
Preprocessor: 0 norms 0 persons 0 person2
Get Properties: /aiml/bots/super/config/properties.txt
addAIMLSets: /aiml/bots/super/sets does not exist.
addCategories: /aiml/bots/super/aiml does not exist.
AIML modified Thu Jan 01 05:30:00 IST 1970 AIMLIF modified Thu Jan 01 05:30:00 IST 1970
No deleted.aiml.csv file found
No deleted.aiml.csv file found
addCategories: /aiml/bots/super/aimlif does not exist.
Loaded 0 categories in 0.002 sec
No AIMLIF Files found. Looking for AIML
addCategories: /aiml/bots/super/aiml does not exist.
Loaded 0 categories in 0.001 sec
--> Bot super 0 completed 0 deleted 0 unfinished
Setting predicate topic to unknown
normalized = HELLO
No match.
writeCertainIFCaegories learnf.aiml size= 0
I have no answer for that.
The Loaded 0 categories tells you that no categories were found from your .aiml files
Also, --> Bot super 0 completed 0 deleted 0 unfinished, again tells you that 0 categories were completed for your bot
It may be that you missed out on setting up your .aiml.csv files
Hope this helps

Multiple line regex perl [duplicate]

This question already has answers here:
Extracting specific lines with Perl
(4 answers)
Closed 8 years ago.
I'm trying to parse out data from a log file spanning over multiple lines (shown below).
Archiver Started: Fri May 16 00:35:00 2014
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:37:43 2014
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:39:54 2014
Archiver Completed: Fri May 16 00:42:37 2014
I want to split on Archiver Started: on the first line and split on Archiver Completed: on the last line for anything in between these lines. So I would be left with the following:
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:37:43 2014
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:39:54 2014
As sometimes the there can be a single or multiple entry for one day, week or month.
Is this possible with a Regex?
Use a Range Operator ...
The return value of a flipflop is a sequence number (starting with 1), so you simply need to filter out 1 and the ending number which has the the string "E0" appended to it.
use strict;
use warnings;
while (<DATA>) {
if (my $range = /Archiver Started/ .. /Archiver Completed/ ) {
print if $range != 1 && $range !~ /E/;
}
}
__DATA__
stuff
more stuff
Archiver Started: Fri May 16 00:35:00 2014
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:37:43 2014
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:39:54 2014
Archiver Completed: Fri May 16 00:42:37 2014
other stuff
ending stuff
Outputs:
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:37:43 2014
Daily Archive for (Thu) May. 15, 2014 STATUS: Successful Fri May 16 00:39:54 2014
you can use next trick:
my #result = ();
my $catch;
LINE:
for my $line ( #lines ) {
if ( $line =~ m/^Archiver Started/i ) {
$catch = 1;
next LINE;
} elsif ( $line =~ m/^Archiver Completed/i ) {
$catch = 0;
next LINE;
}
next LINE unless $catch;
push #result, $line;
}

perl how to regex parts of data instead of entire string and then print out a csv file

I have a working perl script that grabs the data I need and displays them to STDOUT, but now I need to change it to generate a data file (csv, tab dellimited, any delimiter file).
The regular expression is filtering the data that I need, but I don't want the entire string, just snippets of the output. I'm assuming I would need to store this in another variable to create my output file.
I need a good example of this or suggestions to alter this code. Thank you in advance. :-)
Here's my code:
#!/usr/bin/perl -w
# Usage: ./bakstatinfo.pl Jul 28 2010 /var/log/mybackup.log <server1> <server2>
use strict;
use warnings;
#This piece added to view the arguments passed in
$" = "][";
print "===================================================================================\n";
print "[#ARGV]\n";
#Declare Variables
my($mon,$day,$year,$file,$server) = #ARGV;
my $regex_flag = 0;
splice(#ARGV, 0, 4, ());
foreach my $server ( #ARGV ) { #foreach will take Xn of server entries and add to the loop
print "===================================================================================\n";
print "REPORTING SUMMARY for SERVER : $server\n";
open(my $fh,"ssh $server cat $file |") or die "can't open log $server:$file: $!\n";
while (my $line = <$fh>) {
if ($line =~ m/.* $mon $day \d{2}:\d{2}:\d{2} $year:.*(ERROR:|backup-date=|backup-size=|backup-time=|backup-status)/) {
print $line;
$regex_flag=1; #Set to true
}
}
if ($regex_flag==0) {
print "NOTHING TO REPORT FOR $server: $mon $day $year \n";
}
$regex_flag=0;
close($fh);
}
Sample raw log file I am using: (recently added to provide better representation of log)
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:ERROR: mybak-abc appears to be already running for this backupset
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:ERROR: If you are sure mybak-abc is not running, please remove the file /etc/mybak-abc/test202.bak_lvm/.mybak-abc.pid and restart mybak-abc
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: PHASE START: Cleanup
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: PHASE END: Cleanup
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: END OF BACKUP
Wed Jul 28 00:00:04 2010: db9.abc.bak:backup:INFO: START OF BACKUP
Wed Jul 28 00:00:04 2010: db9.abc.bak:backup:INFO: PHASE START: Initialization
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:WARNING: Binary logging is off.
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: License check successful
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: License check successful for lvm-snapshot.pl
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-set=db9.abc.bak
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-date=20100728000004
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: SQL-server-os=Linux/Unix
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-type=regular
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: host=db9.abc.bak.test.com
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-date-epoch=1280300404
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: retention-policy=3D
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: mybak-abc-version=ABC for SQL Enterprise Edition - version 3.1
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: SQL-version=5.1.32-test-SMP-log
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-directory=/home/backups/db9.abc.bak/20100728000004
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-level=0
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-mode=raw
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE END: Initialization
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Running pre backup plugin
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Flushing logs
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE END: Flushing logs
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Creating snapshot based backup
Wed Jul 28 00:00:11 2010: db9.abc.bak:backup:INFO: Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: raw-databases-snapshot=test SQL sgl
Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: PHASE END: Creating snapshot based backup
Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: PHASE START: Calculating backup size & checksums
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: last-backup=/home/backups/test203.bak_lvm/20100726200004
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: PHASE END: Calculating backup size & checksums
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: read-locks-time=00:00:05
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: flush-logs-time=00:00:00
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
My working output now:
===================================================================================
[Jul][28][2010][/var/log/mybackup.log][server1]
===================================================================================
REPORTING SUMMARY for SERVER : server1
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
The output I need to see would be something like this:(data file with separated by ';' for example)
MyDate=Wed Jul 28;MyBackupSet= test203.bak_lvm;MyBackupSize=187.24 GB;MyBackupTime=04:49:51;MyBackupStat=Backup succeeded
Use 'capturing parentheses' to identify the bits you want to deal with.
if ($line =~ m/(.* $mon $day) \d{2}:\d{2}:\d{2} $year:.*
(ERROR:|backup-date=|backup-size=|
backup-time=|backup-status)/x) {
You will need to do some surgery on the second set of parentheses - those surrounding the start of the various keywords. You may have to chop those out in bits and pieces inside the condition.
When you have all the data extracted into variables, use Text::CSV to handle CSV output (and input).
There are a myriad modules to handle HTML or XML (over 2000, and I think over 3000, with HTML in their name - I happened to look yesterday). Many of those won't be applicable, but CPAN is your friend.
Answering questions posed by comments
Would I split them off into separate variables as well? The first part gives me the date/time that I need. The next filter then gives me 1) Error: 2)backup-date= 3)backup-size= ...etc.
More or less. Unfortunately, you don't show some representative input lines, which means it is hard to tell what might be best. However, it seems likely that a scheme such as:
while (my $line = <$fh>)
{
chomp $line;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/)
{
my $date = $1;
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
while ($line =~ m/(ERROR|backup-date|backup-size|
backup-time|backup-status)
[:=]([^:]+)/x)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
}
# The %items hash contains the split out information.
# Now write the data for this line of the log file.
}
}
There might well be better ways to handle the trimming (but it is Perl so TMTOWTDI), but the basic idea here is to catch the lines that are interesting, then progressively chop the bits of interest out of the line, so the line grows shorter on each iteration (therefore, eventually terminating the inner while loop).
Note the use of the /x modifier to allow for a more readable regex split over lines (I edited the original answer version to use that too). I've also allowed 'ERROR' to be followed by an '=' or the other keywords to be followed by ':'; it seems unlikely that you'd get false matches that way, and it simplifies the regex substitute operations. The initial pattern match no longer requires one of the subsections to be present, either. You must judge for yourself whether those small changes (which might pick up non-conforming information) matter or not. For most of my purposes, the chance of the mismatch is small enough not to be an issue - but for legal reasons, it might not be acceptable to you.
Answering questions posed by 'answer'
I manufactured some data:
Wed Jul 30 00:49:51 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 30 00:49:52 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 30 00:49:53 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
Wed Jul 30 00:49:51 2010: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
I took the script in the answer and hacked and instrumented it - making it standalone.
I also removed the dependency on specific files - it reads standard input and writes to standard output. It makes my testing easier - and the code more flexible.
use strict;
use warnings;
use constant debug => 0;
my $mon = 'Jul';
my $day = 30;
my $year = 2010;
while (my $line = <>)
{
chomp $line;
print "Line: $line\n" if debug;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/) #Mon Jul 26 22:00:02 2010:
{
print "### Scan\n";
my $date = $1;
print "$date\n";
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
print "Line: $line\n" if debug;
while ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=]([^:]+)/)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
print "$key=$val\n";
print "Line: $line\n" if debug;
}
print "### Verify\n";
for my $key (sort keys %items)
{
print "$key = $items{$key}\n";
}
}
}
The output I get is:
### Scan
Wed Jul 30
backup-size=417.32 GB
### Verify
backup-size = 417.32 GB
### Scan
Wed Jul 30
backup-time=04
### Verify
backup-time = 04
### Scan
Wed Jul 30
backup-status=Backup succeeded
### Verify
backup-status = Backup succeeded
### Scan
Wed Jul 30
backup-size=417.32 GB
backup-time=04
backup-status=Backup succeeded
### Verify
backup-size = 417.32 GB
backup-status = Backup succeeded
backup-time = 04
The verify loop prints out the data from the '%items' hash quite happily. With the debug value set to 1 instead of 0, the output I get is:
Line: Wed Jul 30 00:49:51 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
backup-size=417.32 GB
Line: test203.bak_lvm:backup:INFO:
### Verify
backup-size = 417.32 GB
Line: Wed Jul 30 00:49:52 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-time=04:49:51
backup-time=04
Line: test203.bak_lvm:backup:INFO: 49:51
### Verify
backup-time = 04
Line: Wed Jul 30 00:49:53 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
backup-status=Backup succeeded
Line: test203.bak_lvm:backup:INFO:
### Verify
backup-status = Backup succeeded
Line: Wed Jul 30 00:49:51 2010: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
### Scan
Wed Jul 30
Line: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
backup-size=417.32 GB
Line: backup-time=04:49:51:backup-status=Backup succeeded
backup-time=04
Line: 49:51:backup-status=Backup succeeded
backup-status=Backup succeeded
Line: 49:51:
### Verify
backup-size = 417.32 GB
backup-status = Backup succeeded
backup-time = 04
The substitute operations delete the previously matched part of the line. There are ways of continuing a match where you left off - see \G at the 'perlre' page.
Note that the regex is crafted to stop at the first colon after the 'colon or equals' after the keyword. That means it truncates the backup time. One moral is "do not use a separator that can appear in the data". Another is "provide sample data so people can help you more easily". Another is "provide complete but minimal working scripts where possible".
Processing the sample data
Now that we have the sample input data, we can see that you need slightly different processing. This script:
use strict;
use warnings;
use constant debug => 0;
my $mon = 'Jul';
my $day = 28;
my $year = 2010;
my %items = ();
while (my $line = <>)
{
chomp $line;
print "Line: $line\n" if debug;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year: ([^:]+):backup:/) #Mon Jul 26 22:00:02 2010:
{
print "### Scan\n" if debug;
my $date = $1;
my $set = $2;
print "$date ($set): " if debug;
$items{$set}->{'a-logdate'} = $date;
$items{$set}->{'a-dataset'} = $set;
if ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=](.+)/)
{
my $key = $1;
my $val = $2;
$items{$set}->{$key} = $val;
print "$key=$val\n" if debug;
}
}
}
print "### Verify\n";
for my $set (sort keys %items)
{
print "Set: $set\n";
my %info = %{$items{$set}};
for my $key (sort keys %info)
{
printf "%s=%s;", $key, $info{$key};
}
print "\n";
}
produces this result on the sample data file.
### Verify
Set: db9.abc.bak
a-dataset=db9.abc.bak;a-logdate=Wed Jul 28;backup-date=20100728000004;
Set: test203.bak_lvm
a-dataset=test203.bak_lvm;a-logdate=Wed Jul 28;backup-size=417.32 GB;backup-status=Backup succeeded;backup-time=04:49:51;
Note that now we have sample data, we can see that there is only one key/value pair per line, but there are multiple systems backed up per day. So, the inner while loop becomes a simple if. The printing out occurs at the end. And I'm using a 'two-tier' hash. The %items contains an entry for each data set; the entry, though, is a reference to a hash. Not necessarily something for novices to play with, but it fell into place very naturally with the previous code. Note, too, that this version doesn't hack the line - there's no need since there's only one lot of data per line.
Can it be improved - yes, undoubtedly. Does it work? Yes, more or less... Can it be hacked into shape? Yes, it can be hacked to work as you need.
#Jonathan- I wrote out the text file within the while loop. It seems to work. I tried doing it after the second while loop as you suggested in your comment. I'm not sure why it didn't work.
open (my $MYDATAFILE, ">/home/test/myout.txt") || die "cannot append $!";
open(my $fh,"ssh $server cat $file |") or die "can't open log $server:$file: $!\n";
while (my $line = <$fh>)
{
chomp $line;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/) #Mon Jul 26 22:00:02 2010:
{
my $date = $1;
#print $date;
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
while ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=]([^:]+)/)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
#print "[$key]";
#print "[$val]";
print $MYDATAFILE "$key=$val";
}
# The %items hash contains the split out information.
# Now write the data for this line of the log file.
}
}

Parsing a multiline variable-length log file

I want to be able to utilize a 'grep' or 'pcregrep -M' like solution that parses a log file that fits the following parameters:
Each log entry can be multiple lines in length
First line of log entry has the key that I want to search for
Each key appears on more then one line
So in the example below I would want to return every line that has KEY1 on it and all the supporting lines below it until the next log message.
Log file:
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.758, DEBUG - KEY2:randomtest
this is a test
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.763, DEBUG - KEY2:testing
test test test
end of key2
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough
01 Feb 2010 - 10:39:01.762, DEBUG - KEY3:and so on
and on
Desired output of searching for KEY1:
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough
I was trying to do something like:
pcregrep -M 'KEY1(.*\n)+' logfile
but definitely doesn't work right.
if you are on *nix, you can use the shell
#!/bin/bash
read -p "Enter key: " key
awk -vkey="$key" '
$0~/DEBUG/ && $0 !~key{f=0}
$0~key{ f=1 }
f{print} ' file
output
$ cat file
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.758, DEBUG - KEY2:randomtest
this is a test
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.763, DEBUG - KEY2:testing
test test test
end of key2
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough
01 Feb 2010 - 10:39:01.762, DEBUG - KEY3:and so on
and on
$ ./shell.sh
Enter key: KEY1
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough
I had a similar requirement and decided to code a little tool (in .net) that parses log files for me and write the result to standard output.
Maybe you find it useful. Works on Windows and Linux (Mono)
See here: https://github.com/iohn2000/ParLog
A tool to filter log files for log entries that contain a specific (regex) pattern. Works also with multiline log entries.
e.g.: show only log entries from a certain workflow instance.
Writes the result to standard output. Use '>' to redirect into a file
default startPattern is :
^[0-9]{2} [\w]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}
this corresponds to date format: e.g.: 04 Feb 2017 15:02:50,778
Parameters are:
f:wildcard a file name or wildcard for multiple files
p:pattern the regex pattern to filter the file(s)
s:startPattern regex pattern to define when a new log entry starts
Example :
ParLog.exe -f=*.log -p=findMe
Adding on to ghostdog74's answer (thank you very much btw, it works great)
Now takes command line input in the form of "./parse file key" and handles loglevels of ERROR as well as DEBUG
#!/bin/bash
awk -vkey="$2" '
$0~/DEBUG|ERROR/ && $0 !~key{f=0}
$0~key{ f=1 }
f{print} ' $1