so here's my problem: I have big log files and want a script to grep certain periods of time and safe them to a file (sorted), basically
bash script.sh Jul 4 Sep 30
will return for example
Sep 30 user0 logged in
Sep 15 user1 logged in
Aug 6 user0 logged in
Aug 3 user1 logged in
Jul 28 user2 logged in
Jul 27 user2 logged in
Jul 4 user0 logged in
My first attempt was that every month and date gets his own variable like
bash script.sh Jul 4 Sep 3 0
so I can use $1 for start month (July), $2 for start date (4) and so on in grep like
for logs in logs*
do
grep -qEe "^\"$1\" [\"$2\"-9]\s" $messages >> result.txt
done
to get all logs from July 4 to 9 but I don't know how to get logs from the entire time period that aren't in the same month nor in a period like 1-9 or 10-19 and so on
Any help greatly appreciated!
EDIT:
As some people asked, here's how my log files look like (just much bigger and not sorted):
Sep 30 user0 logged in
Jul 27 user2 logged in
Aug 6 user0 logged in
Aug 31 user1 logged in
Jul 8 user2 logged in
Sep 5 user1 logged in
Jul 27 user2 logged in
Jul 14 user0 logged in
[...]
Here's my take:
#/bin/bash
year="$(date +"%Y")"
start="$(date -d"$1 $2, $year" +'%s')"
end="$(($(date -d"$3 $4, $year" +'%s')+86400))"
for log in logs*; do
while IFS= read -r line; do
d="$(date -d"$(cut -d' ' -f1,2 <<< "$line"), $year" +'%s')"
if (( $start <= $d && $d < $end )); then
echo "$s"
fi
done < "$log"
done
You run it like that: ./script.sh Jul 04 Sep 03. Since no year is included in the logs, it assumes that all dates (including the ones in the command line) are for the current year. It's probably not the most optimal solution but it works. It relies on date which it repeatedly calls to parse dates into a unix timestamp. unix timestamps are nice because they are just numbers and thus can be used in numeric comparisons.
$ range="Jul 4 Sep 30"
$ awk -v range="$range" '
BEGIN {
numMths = split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",m)
for (i in m) {
mths[m[i]] = i
}
split(range,r)
beg = sprintf("%02d%02d", mths[r[1]], r[2])
end = sprintf("%02d%02d", mths[r[3]], r[4])
}
{ cur = sprintf("%02d%02d", mths[$1], $2) }
(cur >= beg) && (cur <= end) { vals[$1,$2] = $0 }
END {
for (mthNr=numMths; mthNr>0; mthNr--) {
for (dayNr=31; dayNr>0; dayNr--) {
date = m[mthNr] SUBSEP dayNr
if (date in vals) {
print vals[date]
}
}
}
}
' file
Sep 30 user0 logged in
Sep 5 user1 logged in
Aug 31 user1 logged in
Aug 6 user0 logged in
Jul 27 user2 logged in
Jul 14 user0 logged in
Jul 8 user2 logged in
Related
I have some log files that I am trying to analyze. Using a little regex I have gotten the following structure:
Month/Year, URL, Count
Sep 2016,/,100513
Sep 2016,/,68221
Oct 2016,/,536365
Oct 2016,/,362350
Oct 2016,/,89203
Nov 2016,/,526455
Nov 2016,/,351360
Nov 2016,/,88279
Dec 2016,/,538702
Dec 2016,/,156063
Dec 2016,/,89094
Jan 2017,/,535684
Jan 2017,/,105867
Jan 2017,/,87492
Feb 2017,/,483897
Feb 2017,/,80502
Feb 2017,/,47554
Mar 2017,/,434830
Mar 2017,/,72355
Mar 2017,/,43036
It's several 100k lines long so I can't use Excel or Google Sheets so I am trying to aggregate the Count by both Month and URL in python. What is a good method to do this?
You can do this using pandas. Your example is a csv file so the following would work.
import pandas as pd
df = pd.read_csv('x.csv', parse_dates=True)
print df.groupby(['Month/Year', 'URL']).sum()
If you need a solution without external dependencies (maybe a strict corporate environment):
months = {}
urls = {}
with open ('./parsed-data.txt', 'r') as f:
lines = f.readlines()
for line in lines:
# [Month, URL, Count]
data = line.split(',')
months[data[0]] = months.setdefault(data[0], 0) + int(data[2])
urls[data[1]] = urls.setdefault(data[1], 0) + int(data[2])
# Do whatever with months and urls here
I would like to filter the output of the utility last based on a variable set of usernames.
This is sample output from last unfiltered,
reboot system boot server Wed Apr 6 13:15 - 14:24 (01:09)
user1 pts/0 server Wed Apr 6 13:08 - 13:15 (00:06)
reboot system boot system Wed Apr 6 13:08 - 13:15 (00:06)
user1 pts/0 server Wed Apr 6 13:06 - down (00:01)
reboot system boot system Wed Apr 6 13:06 - 13:07 (00:01)
user1 pts/0 server Wed Apr 6 12:59 - down (00:06)
What I would like to do is pipe the output of last to sed. Then, using sed I would print the first occurrence of each specified user name i.e. their last log entry in wtmp. The output should appear as so,
reboot system boot server Wed Apr 6 13:15 - 14:24 (01:09)
user1 pts/0 server Wed Apr 6 13:08 - 13:15 (00:06)
The sed expression that I particularly like is,
last|sed '/user1/{p;q;}'
Unfortunately this only gives me the ability to match the first occurrence of one username. Using this syntax is there a way I could specify a multiple of usernames? Thanks in advance!
awk is better fit here than sed due to awk's ability to use associative arrays:
last | awk '!seen[$1]++'
reboot system boot server Wed Apr 6 13:15 - 14:24 (01:09)
user1 pts/0 server Wed Apr 6 13:08 - 13:15 (00:06)
I'm trying to write a program in c++ which produce a report, which provide a report on the usage by time. Break the time into blocks of quarter of an hour
00:00-00:14, 00:15-00:29, …, 23:45-23:59.
I should provide number of incidents in each time break. This is my code so far. I appreciate if anyone come up with a solution.
string time = word;
size_t found2 = word.find(":");
string tmpH,tmpM;
tmpH = word.substr(0,found2);
tmpM = word.substr((found2+1),word.length());
cout<<" word= "<<word<<" tmpH= "<<tmpH<<" tmpM= "<<tmpM<<endl;
int h = atoi(tmpH.c_str());
int m = atoi(tmpM.c_str());
////
Input:
aa784 pts/30 Fri Mar 28 03:25 still logged in 101.175.22.198
aa784 sshd Fri Mar 28 03:25 still logged in 101.175.22.198
aa784 pts/30 Fri Mar 28 03:25 - 03:25 (00:00) 101.175.22.198
aa784 sshd Fri Mar 28 03:25 - 03:25 (00:00) 101.175.22.198
hmb183 sshd Fri Mar 28 03:24 still logged in c110-20-244-248.mirnd4.nsw.optusnet.com.au
bkg988 sshd Fri Mar 28 03:24 - 03:24 (00:00) 139.218.157.100
hmb183 sshd Fri Mar 28 03:21 - 03:22 (00:01) c110-20-244-248.mirnd4.nsw.optusnet.com.au
fmm290 pts/43 Fri Mar 28 03:11 still logged in 1002-wan-001.rhw.com.au
fmm290 sshd Fri Mar 28 03:11 still logged in 1002-wan-001.rhw.com.au
bkg988 sshd Fri Mar 28 03:09 - 03:09 (00:00) 139.218.157.100
pm554 pts/14 Fri Mar 28 02:22 still logged in ppp239-204.static.internode.on.net
pm554 sshd Fri Mar 28 02:22 still logged in ppp239-204.static.internode.on.net
bkg988 sshd Fri Mar 28 02:17 - 02:17 (00:00) 139.218.157.100
bkg988 sshd Fri Mar 28 02:12 - 02:12 (00:00) 139.218.157.100
bkg988 sshd Fri Mar 28 02:10 - 02:10 (00:00) 139.218.157.100
bx972 pts/12 Fri Mar 28 02:09 still logged in cpe-121-218-195-236.lnse4.cht.bigpond.net.au
bkg988 sshd Fri Mar 28 02:07 - 02:07 (00:00) 139.218.157.100
hmb183 sshd Fri Mar 28 02:05 - 02:06 (00:01) c110-20-244-248.mirnd4.nsw.optusnet.com.au
bkg988 sshd Fri Mar 28 02:04 - 02:04 (00:00) 139.218.157.100
output:
00:00-00:14 10 users logged in
00:15-00:29 15 users logged in
....
23:45-23:59 3 users logged in
Therefore I have 4 conditions in an hour which comes to 96 conditions of time?
First, you can convert each block of hour and minute into minutes , for example 23:45 equals 1095 in minutes. Storing all of these blocks into a list and sort them by its starting time.
For each event, convert each event time into number of minute and use binary search(or linear search) to search for a block that has largest starting time less than or equals to the event time,and that block will be the block this event belong to.
Time complexity to sort is O(1) as there is only few block, and, for all query will be O(n), with n is the number of query.(Binary search in this case can be considered take constant time).
Edit: As you have added another constraint, so , you need to sort all the event by date and time, and for each date, you can use the described approach.
Given lines like:
bkg988 sshd Fri Mar 28 02:17 - 02:17 (00:00) 139.218.157.100
You can do this:
std::string to_month_number(const std::string& name)
{
return name == "Jan" ? "01/" :
name == "Feb" ? "02/" :
...;
}
typedef std::pair<std::string, int> When;
typedef std::map<When, int> Num_Logins;
Num_Logins num_logins;
std::string user, term, day, month, dom;
int hour, min;
char c;
while (std::cin >> user >> term >> dow >> month >> dom >> hour >> c >> min && c == ':')
{
if (dom.length() == 1) dom = ' ' + dom; // standardise with for sorting...
When when = std::make_pair(to_month_number(month) + ' ' + dom, (hour * 60 + min) / 15);
++num_logins[when];
}
I suspect the actual input will be a bit more complex, with the date being formatted differently when the process started last year or intraday, so you'll need to tune the fields parsed out. To recreate the time when iterating over num_logins to print out results, just:
int hour = key->second / 4;
int min = (key->second % 4) * 15; // 00, 15, 30 or 45
Consider this log file
SN PID Date Status
1 P01 Fri Feb 14 19:32:36 IST 2014 Alive
2 P02 Fri Feb 14 19:32:36 IST 2014 Alive
3 P03 Fri Feb 14 19:32:36 IST 2014 Alive
4 P04 Fri Feb 14 19:32:36 IST 2014 Alive
5 P05 Fri Feb 14 19:32:36 IST 2014 Alive
6 P06 Fri Feb 14 19:32:36 IST 2014 Alive
7 P07 Fri Feb 14 19:32:36 IST 2014 Alive
8 P08 Fri Feb 14 19:32:36 IST 2014 Alive
9 P09 Fri Feb 14 19:32:36 IST 2014 Alive
10 P010 Fri Feb 14 19:32:36 IST 2014 Alive
When i do => grep "P01" File
output is : (as expected)
1 P01 Fri Feb 14 19:32:36 IST 2014 Alive
10 P010 Fri Feb 14 19:32:36 IST 2014 Alive
But when i do => grep " P01 " File (notice the space before and after P01)
I do not get any output!
Question : grep matches pattern in a line, so " P01 " ( with space around ) should match the first PID of P01 as it has spaces around it....but seems that this logic is wrong....what obvious thing i am missing here!!!?
If the log uses tabs not spaces, your grep pattern won't match. I would add word boundaries to the word you want to find:
grep '\<P01\>' file
If you really want to use whitespace in your pattern, use one of:
grep '[[:blank:]]P01[[:blank:]]' file # horizontal whitespace, tabs and spaces
grep -P '\sP01\s' file # using Perl regex
I have a working perl script that grabs the data I need and displays them to STDOUT, but now I need to change it to generate a data file (csv, tab dellimited, any delimiter file).
The regular expression is filtering the data that I need, but I don't want the entire string, just snippets of the output. I'm assuming I would need to store this in another variable to create my output file.
I need a good example of this or suggestions to alter this code. Thank you in advance. :-)
Here's my code:
#!/usr/bin/perl -w
# Usage: ./bakstatinfo.pl Jul 28 2010 /var/log/mybackup.log <server1> <server2>
use strict;
use warnings;
#This piece added to view the arguments passed in
$" = "][";
print "===================================================================================\n";
print "[#ARGV]\n";
#Declare Variables
my($mon,$day,$year,$file,$server) = #ARGV;
my $regex_flag = 0;
splice(#ARGV, 0, 4, ());
foreach my $server ( #ARGV ) { #foreach will take Xn of server entries and add to the loop
print "===================================================================================\n";
print "REPORTING SUMMARY for SERVER : $server\n";
open(my $fh,"ssh $server cat $file |") or die "can't open log $server:$file: $!\n";
while (my $line = <$fh>) {
if ($line =~ m/.* $mon $day \d{2}:\d{2}:\d{2} $year:.*(ERROR:|backup-date=|backup-size=|backup-time=|backup-status)/) {
print $line;
$regex_flag=1; #Set to true
}
}
if ($regex_flag==0) {
print "NOTHING TO REPORT FOR $server: $mon $day $year \n";
}
$regex_flag=0;
close($fh);
}
Sample raw log file I am using: (recently added to provide better representation of log)
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:ERROR: mybak-abc appears to be already running for this backupset
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:ERROR: If you are sure mybak-abc is not running, please remove the file /etc/mybak-abc/test202.bak_lvm/.mybak-abc.pid and restart mybak-abc
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: PHASE START: Cleanup
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: PHASE END: Cleanup
Tue Jul 27 23:00:06 2010: test202.bak_lvm:backup:INFO: END OF BACKUP
Wed Jul 28 00:00:04 2010: db9.abc.bak:backup:INFO: START OF BACKUP
Wed Jul 28 00:00:04 2010: db9.abc.bak:backup:INFO: PHASE START: Initialization
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:WARNING: Binary logging is off.
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: License check successful
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: License check successful for lvm-snapshot.pl
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-set=db9.abc.bak
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-date=20100728000004
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: SQL-server-os=Linux/Unix
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-type=regular
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: host=db9.abc.bak.test.com
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-date-epoch=1280300404
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: retention-policy=3D
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: mybak-abc-version=ABC for SQL Enterprise Edition - version 3.1
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: SQL-version=5.1.32-test-SMP-log
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-directory=/home/backups/db9.abc.bak/20100728000004
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-level=0
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: backup-mode=raw
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE END: Initialization
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Running pre backup plugin
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Flushing logs
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE END: Flushing logs
Wed Jul 28 00:00:05 2010: db9.abc.bak:backup:INFO: PHASE START: Creating snapshot based backup
Wed Jul 28 00:00:11 2010: db9.abc.bak:backup:INFO: Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: raw-databases-snapshot=test SQL sgl
Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: PHASE END: Creating snapshot based backup
Wed Jul 28 00:49:53 2010: test203.bak_lvm:backup:INFO: PHASE START: Calculating backup size & checksums
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: last-backup=/home/backups/test203.bak_lvm/20100726200004
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: PHASE END: Calculating backup size & checksums
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: read-locks-time=00:00:05
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: flush-logs-time=00:00:00
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
My working output now:
===================================================================================
[Jul][28][2010][/var/log/mybackup.log][server1]
===================================================================================
REPORTING SUMMARY for SERVER : server1
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 28 00:49:54 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
The output I need to see would be something like this:(data file with separated by ';' for example)
MyDate=Wed Jul 28;MyBackupSet= test203.bak_lvm;MyBackupSize=187.24 GB;MyBackupTime=04:49:51;MyBackupStat=Backup succeeded
Use 'capturing parentheses' to identify the bits you want to deal with.
if ($line =~ m/(.* $mon $day) \d{2}:\d{2}:\d{2} $year:.*
(ERROR:|backup-date=|backup-size=|
backup-time=|backup-status)/x) {
You will need to do some surgery on the second set of parentheses - those surrounding the start of the various keywords. You may have to chop those out in bits and pieces inside the condition.
When you have all the data extracted into variables, use Text::CSV to handle CSV output (and input).
There are a myriad modules to handle HTML or XML (over 2000, and I think over 3000, with HTML in their name - I happened to look yesterday). Many of those won't be applicable, but CPAN is your friend.
Answering questions posed by comments
Would I split them off into separate variables as well? The first part gives me the date/time that I need. The next filter then gives me 1) Error: 2)backup-date= 3)backup-size= ...etc.
More or less. Unfortunately, you don't show some representative input lines, which means it is hard to tell what might be best. However, it seems likely that a scheme such as:
while (my $line = <$fh>)
{
chomp $line;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/)
{
my $date = $1;
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
while ($line =~ m/(ERROR|backup-date|backup-size|
backup-time|backup-status)
[:=]([^:]+)/x)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
}
# The %items hash contains the split out information.
# Now write the data for this line of the log file.
}
}
There might well be better ways to handle the trimming (but it is Perl so TMTOWTDI), but the basic idea here is to catch the lines that are interesting, then progressively chop the bits of interest out of the line, so the line grows shorter on each iteration (therefore, eventually terminating the inner while loop).
Note the use of the /x modifier to allow for a more readable regex split over lines (I edited the original answer version to use that too). I've also allowed 'ERROR' to be followed by an '=' or the other keywords to be followed by ':'; it seems unlikely that you'd get false matches that way, and it simplifies the regex substitute operations. The initial pattern match no longer requires one of the subsections to be present, either. You must judge for yourself whether those small changes (which might pick up non-conforming information) matter or not. For most of my purposes, the chance of the mismatch is small enough not to be an issue - but for legal reasons, it might not be acceptable to you.
Answering questions posed by 'answer'
I manufactured some data:
Wed Jul 30 00:49:51 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
Wed Jul 30 00:49:52 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
Wed Jul 30 00:49:53 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
Wed Jul 30 00:49:51 2010: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
I took the script in the answer and hacked and instrumented it - making it standalone.
I also removed the dependency on specific files - it reads standard input and writes to standard output. It makes my testing easier - and the code more flexible.
use strict;
use warnings;
use constant debug => 0;
my $mon = 'Jul';
my $day = 30;
my $year = 2010;
while (my $line = <>)
{
chomp $line;
print "Line: $line\n" if debug;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/) #Mon Jul 26 22:00:02 2010:
{
print "### Scan\n";
my $date = $1;
print "$date\n";
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
print "Line: $line\n" if debug;
while ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=]([^:]+)/)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
print "$key=$val\n";
print "Line: $line\n" if debug;
}
print "### Verify\n";
for my $key (sort keys %items)
{
print "$key = $items{$key}\n";
}
}
}
The output I get is:
### Scan
Wed Jul 30
backup-size=417.32 GB
### Verify
backup-size = 417.32 GB
### Scan
Wed Jul 30
backup-time=04
### Verify
backup-time = 04
### Scan
Wed Jul 30
backup-status=Backup succeeded
### Verify
backup-status = Backup succeeded
### Scan
Wed Jul 30
backup-size=417.32 GB
backup-time=04
backup-status=Backup succeeded
### Verify
backup-size = 417.32 GB
backup-status = Backup succeeded
backup-time = 04
The verify loop prints out the data from the '%items' hash quite happily. With the debug value set to 1 instead of 0, the output I get is:
Line: Wed Jul 30 00:49:51 2010: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-size=417.32 GB
backup-size=417.32 GB
Line: test203.bak_lvm:backup:INFO:
### Verify
backup-size = 417.32 GB
Line: Wed Jul 30 00:49:52 2010: test203.bak_lvm:backup:INFO: backup-time=04:49:51
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-time=04:49:51
backup-time=04
Line: test203.bak_lvm:backup:INFO: 49:51
### Verify
backup-time = 04
Line: Wed Jul 30 00:49:53 2010: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
### Scan
Wed Jul 30
Line: test203.bak_lvm:backup:INFO: backup-status=Backup succeeded
backup-status=Backup succeeded
Line: test203.bak_lvm:backup:INFO:
### Verify
backup-status = Backup succeeded
Line: Wed Jul 30 00:49:51 2010: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
### Scan
Wed Jul 30
Line: backup-size=417.32 GB:backup-time=04:49:51:backup-status=Backup succeeded
backup-size=417.32 GB
Line: backup-time=04:49:51:backup-status=Backup succeeded
backup-time=04
Line: 49:51:backup-status=Backup succeeded
backup-status=Backup succeeded
Line: 49:51:
### Verify
backup-size = 417.32 GB
backup-status = Backup succeeded
backup-time = 04
The substitute operations delete the previously matched part of the line. There are ways of continuing a match where you left off - see \G at the 'perlre' page.
Note that the regex is crafted to stop at the first colon after the 'colon or equals' after the keyword. That means it truncates the backup time. One moral is "do not use a separator that can appear in the data". Another is "provide sample data so people can help you more easily". Another is "provide complete but minimal working scripts where possible".
Processing the sample data
Now that we have the sample input data, we can see that you need slightly different processing. This script:
use strict;
use warnings;
use constant debug => 0;
my $mon = 'Jul';
my $day = 28;
my $year = 2010;
my %items = ();
while (my $line = <>)
{
chomp $line;
print "Line: $line\n" if debug;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year: ([^:]+):backup:/) #Mon Jul 26 22:00:02 2010:
{
print "### Scan\n" if debug;
my $date = $1;
my $set = $2;
print "$date ($set): " if debug;
$items{$set}->{'a-logdate'} = $date;
$items{$set}->{'a-dataset'} = $set;
if ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=](.+)/)
{
my $key = $1;
my $val = $2;
$items{$set}->{$key} = $val;
print "$key=$val\n" if debug;
}
}
}
print "### Verify\n";
for my $set (sort keys %items)
{
print "Set: $set\n";
my %info = %{$items{$set}};
for my $key (sort keys %info)
{
printf "%s=%s;", $key, $info{$key};
}
print "\n";
}
produces this result on the sample data file.
### Verify
Set: db9.abc.bak
a-dataset=db9.abc.bak;a-logdate=Wed Jul 28;backup-date=20100728000004;
Set: test203.bak_lvm
a-dataset=test203.bak_lvm;a-logdate=Wed Jul 28;backup-size=417.32 GB;backup-status=Backup succeeded;backup-time=04:49:51;
Note that now we have sample data, we can see that there is only one key/value pair per line, but there are multiple systems backed up per day. So, the inner while loop becomes a simple if. The printing out occurs at the end. And I'm using a 'two-tier' hash. The %items contains an entry for each data set; the entry, though, is a reference to a hash. Not necessarily something for novices to play with, but it fell into place very naturally with the previous code. Note, too, that this version doesn't hack the line - there's no need since there's only one lot of data per line.
Can it be improved - yes, undoubtedly. Does it work? Yes, more or less... Can it be hacked into shape? Yes, it can be hacked to work as you need.
#Jonathan- I wrote out the text file within the while loop. It seems to work. I tried doing it after the second while loop as you suggested in your comment. I'm not sure why it didn't work.
open (my $MYDATAFILE, ">/home/test/myout.txt") || die "cannot append $!";
open(my $fh,"ssh $server cat $file |") or die "can't open log $server:$file: $!\n";
while (my $line = <$fh>)
{
chomp $line;
if ($line =~ m/(.* $mon $day) \d\d:\d\d:\d\d $year:/) #Mon Jul 26 22:00:02 2010:
{
my $date = $1;
#print $date;
my %items = ();
$line =~ s/.* $mon $day \d\d:\d\d:\d\d $year://;
while ($line =~ m/(ERROR|backup-date|backup-size|backup-time|backup-status)[:=]([^:]+)/)
{
my $key = $1;
my $val = $2;
$items{$key} = $val;
$line =~ s/$key[:=]$val[:=]?//;
#print "[$key]";
#print "[$val]";
print $MYDATAFILE "$key=$val";
}
# The %items hash contains the split out information.
# Now write the data for this line of the log file.
}
}