Parsing a multiline variable-length log file

Parsing a multiline variable-length log file - regex

I want to be able to utilize a 'grep' or 'pcregrep -M' like solution that parses a log file that fits the following parameters:
Each log entry can be multiple lines in length
First line of log entry has the key that I want to search for
Each key appears on more then one line
So in the example below I would want to return every line that has KEY1 on it and all the supporting lines below it until the next log message.
Log file:
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.758, DEBUG - KEY2:randomtest
this is a test
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.763, DEBUG - KEY2:testing
test test test
end of key2
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough
01 Feb 2010 - 10:39:01.762, DEBUG - KEY3:and so on
and on
Desired output of searching for KEY1:
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough
I was trying to do something like:
pcregrep -M 'KEY1(.*\n)+' logfile
but definitely doesn't work right.

if you are on *nix, you can use the shell
#!/bin/bash
read -p "Enter key: " key
awk -vkey="$key" '
$0~/DEBUG/ && $0 !~key{f=0}
$0~key{ f=1 }
f{print} ' file
output
$ cat file
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.758, DEBUG - KEY2:randomtest
this is a test
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.763, DEBUG - KEY2:testing
test test test
end of key2
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough
01 Feb 2010 - 10:39:01.762, DEBUG - KEY3:and so on
and on
$ ./shell.sh
Enter key: KEY1
01 Feb 2010 - 10:39:01.755, DEBUG - KEY1:randomtext
blah
blah2 T
blah3 T
blah4 F
blah5 F
blah6
blah7
01 Feb 2010 - 10:39:01.757, DEBUG - KEY1:somethngelse
01 Feb 2010 - 10:39:01.760, DEBUG - KEY1:more logs here
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:eve more here
this is another multiline log entry
keeps on going
but not as long as before
01 Feb 2010 - 10:39:01.762, DEBUG - KEY1:but key 1 is still going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
and going
okay enough

I had a similar requirement and decided to code a little tool (in .net) that parses log files for me and write the result to standard output.
Maybe you find it useful. Works on Windows and Linux (Mono)
See here: https://github.com/iohn2000/ParLog
A tool to filter log files for log entries that contain a specific (regex) pattern. Works also with multiline log entries.
e.g.: show only log entries from a certain workflow instance.
Writes the result to standard output. Use '>' to redirect into a file
default startPattern is :
^[0-9]{2} [\w]{3} [0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2},[0-9]{3}
this corresponds to date format: e.g.: 04 Feb 2017 15:02:50,778
Parameters are:
f:wildcard a file name or wildcard for multiple files
p:pattern the regex pattern to filter the file(s)
s:startPattern regex pattern to define when a new log entry starts
Example :
ParLog.exe -f=*.log -p=findMe

Adding on to ghostdog74's answer (thank you very much btw, it works great)
Now takes command line input in the form of "./parse file key" and handles loglevels of ERROR as well as DEBUG
#!/bin/bash
awk -vkey="$2" '
$0~/DEBUG|ERROR/ && $0 !~key{f=0}
$0~key{ f=1 }
f{print} ' $1

Related

Regex extracking information

Hope you are fine. I was trying to build a regex for extracting some strings in some documents. This is a sample of the strings I have to extract:
5003t00001fJK8EAAWFri Jan 27 2023 11:37:45 GMT-0600
0033t00004A596CAARFri Jan 27 2023 11:44:04 GMT-0600
The regex I wrote so far I have:
a) 500[\w._%+ :-]+(GMT)
b) (00).[0-9][\w._%+ :-]+(GMT)
However I am getting some observations with from 1) using the b) Regex that I build.
Any suggestion?

access read memo field

I have an access database that has a table with a memo field. Fields have been inserted in this format.
Apr 02 - some text
Feb 20 - some text
I would like to reverse the order of the inserts so the above would be:
Feb 20 - some text
Apr 02 - some text
I am thinking of reading line by line using regular expressions, anyone has a better path to achieve that

Your memo field contains 2 lines of text and you want to reverse their order. You can do that with a simple VBA procedure, which doesn't need a regular expression.
Here is a sample Immediate window session which demonstrates techniques you can use in a VBA procedure.
MyText = "Apr 02 - some text" & vbcrlf & "Feb 20 - some text"
? MyText
Apr 02 - some text
Feb 20 - some text
? Split(MyText, vbcrlf)(1)
Feb 20 - some text
? Split(MyText, vbcrlf)(0)
Apr 02 - some text
If the memo field can include more than two lines of text, you can load an array with the results from Split() and then loop through the array in reverse order.

split log of a multi-threaded application

I have a multi-threaded application which generates logs as mentioned:
D Fri Feb 01 00:21:23 2013 <no machine> pin_deferred_act:10233 pin_mta_conf.c:636 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:1
pin_mta_convert_cmdline_options_to_flist parameters flist
D Fri Feb 01 00:21:23 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pcpst.c(78):406 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:2:0:1359658283:0
connect to host=172.16.87.14, port=11962 OK
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:2479 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:0
Config object search input flist
0 PIN_FLD_POID POID [0] 0.0.0.1 /search/pin -1 0
0 PIN_FLD_FLAGS INT [0] 0
0 PIN_FLD_TEMPLATE STR [0] "select X from /config/mta where F1 = V1 "
0 PIN_FLD_ARGS ARRAY [1] allocated 20, used 1
1 PIN_FLD_CONFIG_MTA ARRAY [0] allocated 20, used 1
2 PIN_FLD_NAME STR [0] "pin_deferred_act"
0 PIN_FLD_RESULTS ARRAY [0] allocated 20, used 1
1 PIN_FLD_POID POID [0] NULL poid pointer
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:2484 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:3:7:1359658284:2
Config object search output flist
0 PIN_FLD_POID POID [0] 0.0.0.1 /search/pin -1 0
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:3138 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:7:1359658284:2
So the threads update the logs, like pin_deferred_act:10233:1:7 --> where 1 specifies the log from the first thread, in the logfile.
I want to create log file for each thread, where the start point should be:
1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:
and end- point should be:
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in
(where date/timestamp will keep on modifying).
All the instances should go in one file.
For e.g.:
D Fri Feb 01 00:21:23 2013 <no machine> pin_deferred_act:10233 pin_mta_conf.c:636 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:1
pin_mta_convert_cmdline_options_to_flist parameters flist
D Fri Feb 01 00:21:23 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pcpst.c(78):406 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:2:0:1359658283:0
connect to host=172.16.87.14, port=11962 OK
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:2479 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:0:1359658283:0
Config object search input flist
0 PIN_FLD_POID POID [0] 0.0.0.1 /search/pin -1 0
0 PIN_FLD_FLAGS INT [0] 0
0 PIN_FLD_TEMPLATE STR [0] "select X from /config/mta where F1 = V1 "
0 PIN_FLD_ARGS ARRAY [1] allocated 20, used 1
1 PIN_FLD_CONFIG_MTA ARRAY [0] allocated 20, used 1
2 PIN_FLD_NAME STR [0] "pin_deferred_act"
0 PIN_FLD_RESULTS ARRAY [0] allocated 20, used 1
1 PIN_FLD_POID POID [0] NULL poid pointer
D Fri Feb 01 00:21:24 2013 App-BRM-Prod-Pri.acttv.in pin_deferred_act:10233 pin_mta.c:3138 1:App-BRM-Prod-Pri.acttv.in:pin_deferred_act:10233:1:7:1359658284:2
should go to one file - Thread1.log, and similarly for other threads, the file should be created as Threadn.log with the respectively.

Files are a messy, non-scalable way of handling logs to begin with. A better approach is to handle logs as streams of log entry messages connected source(s) -> sink(s). Consider syslog, logplex or similar if Oracle provides alternative means of data collection. Custom re-implementation might be feasible depending on logging IOPS bottlenecks or other factors.
Use of high resolution monotonic clocks and/or globally-ordered GUID timestamps are highly recommended. With wall time, be sure to use non-backwards compensated UTC everywhere synced to low strata time sources.
Above recommendations may vary according to needs of the application of course, so experiment and implement wisely.

I think Barry's advice is useful, but in the event that you can't alter the application's log output, here is a quick Perl solution:
#!usr/bin/perl
use strict;
use warnings;
my %logs;
my $last_log;
while (<$main_log_file>) #open that application's log in this variable.
{
if (/pin_deferred_act:\d+:(\d+):\d/)
{
unless (defined $logs{$1})
{
open $fh,'>',"Thread$1.log") or die "Can't open Thread $1 log: $!";
$logs{$1} = $fh;
}
$last_log = $logs{$1};
}
if (defined $last_log)
{
print {$last_log} $_;
}
else
{
#Didn't find starting line. Error handling?
}
}
This solution maintains a hash of open file handles to the log files for all threads. I prefer this as it is more efficient if the input will have a lot of switching back and forth between the same threads. It would break, however, if the application has more threads than you are allowed to have files open on your system.

RegEx String Validator

In my MVC3 application, on one of the entities I am saving the Date of Birth as a string. Why ? because my application allows the storage of the date of birth of people long dead, eg. Socrates, Plato, Epicurus ... etc and as far as I know the DateTime class doesn't allow that.
Now obviously we don't know the exact date of birth of Epicurus for example, we just know the year of birth [ 341 BCE ], so what I am thinking of doing is building a custom validator, that will validate the input string for the Date of Birth and make sure that they all match the following format:
12 Feb 1809
Feb 1809
341
341 BCE
Oct 341 BCE
11 Mar 5 BCE
I need to write a regular expression that will match any of the above, and of course not match anything else.
Update
Thank you very much, I wish I was as good as you lot in building RegExes ! Since my application is with ASP.net MVC3, I would like to stick with the .NET RegEx class (for convenience's sake).
luastoned answer seems to work; I can't seem to break its logic with all the test data I've thrown at it.
One thing though, can I also allow BC? Because some people use BC and others use BCE < would that be possible? And, am I right that the regular expression can not replace BC with BCE? I have to do that manually through my C# code - the RegEx would just either match or not, is that correct?
Update 2
M42's Regular Expression seems to be working better. I've just copied it and used it in my Custom Validator (code in PasteBin link below).

How about :
/^(?:\d+\s)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?(?:\s?\d+)(?:\sBCE)?$/
Here is a perl script with test cases:
#!/usr/local/bin/perl
use strict;
use warnings;
use Test::More;
my $re1 = qr/^(?:\d+\s)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?(?:\s?\d+)(?:\sBCE)?$/;
while(<DATA>) {
chomp;
if (/$re1/) {
ok(1, "input = $_");
} else {
ok(0, "input = $_");
}
}
done_testing;
__DATA__
12 Feb 1809
Feb 1809
341
341 BCE
Oct 341 BCE
11 Mar 5 BCE
12D09
1s909
A3 43 4 BCE
a 1
3F9
abc
BCE
123b456
output:
# Looks like you failed 9 tests of 15.
ok 1 - input = 12 Feb 1809
ok 2 - input = Feb 1809
ok 3 - input = 341
ok 4 - input = 341 BCE
ok 5 - input = Oct 341 BCE
ok 6 - input = 11 Mar 5 BCE
not ok 7 - input = 12D09
not ok 8 - input = 1s909
not ok 9 - input = A3 43 4 BCE
not ok 10 - input = a 1
not ok 11 - input = 3F9
not ok 12 - input =
not ok 13 - input = abc
not ok 14 - input = BCE
not ok 15 - input = 123b456

this looks likes the weirdest Regex I've ever made:
(\d+\s?)?(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?\s(\d+\s)?(BCE)?
I have no idea how many false positives would go through though..
You can check the sample on Regexr

Not quite a friend of RegExr (and not knowing the limitations of regexes in MVC3), allow me to present a PHP version with named captures (demo):
(?:(?:(?P<date>\d{1,2})\s)?(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))?(?:(?:^|\s)(?P<year>\d+))?(?:\s(?P<bce>BCE))?
This is based on #luastoned's answer.

Need to filter log to search for the lines from the last 5 minutes

2011-04-13 00:09:07,731 INFO [STDOUT] 04/13 00:09:07 Information...
Hi everyone. I would post some of my code, but I don't even think it's worthy of posting. What I'm trying to do is that I've got a log file with lines like above. What I need to do is take the last lines timestamp, and keep all the lines from the last 5 minutes (rather than the last 200 lines or whatever....which would be easier). Could anyone help? I've searched the web, some decent tips, but still nothing going and frustrated as hell. Thanks!

Here's a simple Perl script that iterates over the file and prints every line whose timestamp is within 5 minutes of the time at the start of execution. For more efficiency, and assuming that the lines are in timestamp order, you could modify this to set a boolean flag when it encounters the first printable line and skip the testing from that point forwards.
#!/usr/bin/perl
use POSIX qw(mktime);
$now = time();
while(<>)
{
($yy,$mm,$dd,$h,$m,$s,$t) = /^(\d+)-(\d+)-(\d+)\s+(\d+):(\d+):(\d+),(\d+)/;
$t = mktime($s+$t/1000, $m, $h, $dd, $mm-1, $yy-1900);
print "$_" if ($t >= $now-300);
}

I take it by your latest comment that you are interested in finding out how to find the timestamp that is last in your log, and the entries that are 5 minutes before that.
I think Jim Garrison's solution could be patched to replace this:
$now = time();
with this:
open F, "<server.log" or die $!;
seek F,-1000,2; # set pos to last 1000 bytes
my #f = <F>;
$_ = $f[$#f];
($yy,$mm,$dd,$h,$m,$s,$t) = /^(\d+)-(\d+)-(\d+)\s+(\d+):(\d+):(\d+),(\d+)/;
$now = mktime($s+$t/1000, $m, $h, $dd, $mm-1, $yy-1900);
$now should now contain the last timestamp in the log.
I approximated "-1000" to be long enough to go past at least one line in the log. You could set it much higher if you expect to have long lines in the log, but from what I saw, the last log entry "should" be fairly short.
If you have a huge log file and want to increase performance in the following search, you can use an estimation and perform a seek to find the last, say, 1000000 bytes in the file with:
seek F, -1000000, 2;
Good luck!

Iterate over all the lines, using regexp grab: 00:09:07, and check against current time (localtime, etc...).
if the file contains entries from different dates, then also grab the dates using regexp, and again compare using the output of locatime

How to modify your script to make it work with the logs below
Dec 18 09:41:18 sd
Dec 18 09:46:29 sds
Dec 18 09:48:39 sds
Dec 18 09:48:54 sds
Dec 18 09:54:47 sds
Dec 18 09:55:33 sds
Dec 18 09:55:38 sds
Dec 18 09:57:58 sds
Dec 18 09:58:10 sds
Dec 18 10:00:50 sdsd
Dec 18 10:03:43 sds
Dec 18 10:03:50 sdsd
Dec 18 10:04:06 sdsd
Dec 18 10:04:15 sdsd
Dec 18 10:14:50 wdad
Dec 18 10:19:16 sdadsa
Dec 18 10:19:23 dsds
Dec 18 10:21:03 sadsd
Dec 18 10:22:54 adas
Dec 18 10:27:32 qadad

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing a multiline variable-length log file - regex

Adding on to ghostdog74's answer (thank you very much btw, it works great) Now takes command line input in the form of "./parse file key" and handles loglevels of ERROR as well as DEBUG #!/bin/bash awk -vkey="$2" ' $0~/DEBUG|ERROR/ && $0 !~key{f=0} $0~key{ f=1 } f{print} ' $1

Related

Regex extracking information

access read memo field

split log of a multi-threaded application

RegEx String Validator

Need to filter log to search for the lines from the last 5 minutes

Categories

Resources