Match Range of Lines in Log

Match Range of Lines in Log - regex

I am trying to figure out how to take a log that that has millions of lines in
a day and easily dump a range (based on begin and end timestamp) of lines to
another file. Here is an excerpt from the log to show how it is constructed:
00:04:59.703: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.703: 20121114070459 - XXX - 7028429950500220900257201211131000000003536
00:04:59.703: </abcxyz,v1>
00:04:59.711: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.711: 20121114070459 - XXX - 7028690080500220900257201211131000000003538
00:04:59.711: </abcxyz,v1>
00:04:59.723: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.723: 20121114070459 - XXX - 7028395150500220900257201211131000000003540
00:04:59.723: </abcxyz,v1>
00:04:59.744: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
As you can see there are multiple lines per millisecond. What I would like to
do is be able to give as an input a begin and end timestamp such as
begin=11:00: and end=11:45: and have it dump all the lines in that range.
I have been racking my brain trying to figure this one out, but so far haven't
come up with a satisfactory result.
UPDATE: Of course just the first thing I try after I post the question seems to
work. Here is what I have:
sed -n '/^06:25/,/^08:25:/p' logFile > newLogFile
More than happy to take suggestions if there is a better way.

I think your sed oneliner is ok for the task.
Besides, you can optimize that for speed (considering the file has millions of lines), exiting the sed script when the desired block was printed (assuming there are no repeated blocks of time in a file).
sed -n '/^06:25/,/^08:25/{p;/^08:25/q}' logFile > newLogFile
This tells sed to quit when the last line of the block was found.

You can use following oneliner:
awk -v start='00:04:59.000' -v end='00:04:59.900' \
'{if(start <= $1 && end >= $1) print $0}' < your.log > reduced.log
Notice the full format of start and end ranges - it's to keep it simple and doesn't make much problem IMO

Related

Multi-line regex should match multiple times in a file (one-line command if possible)

I'm trying to convert some (multi-line) git history info (extract file name changes) into a CSV file. Here's my regex and sample file. It's working perfectly on that site.
Regex:
commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n
Sample input:
commit 2701af4b3b66340644b01835a03bcc760e1606f8
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Sat Oct 16 20:44:32 2010 +0000
* Moved old sources to Maven src/main/java
diff --git a/alexo-chess/src/ao/chess/v2/move/Pawns.java b/alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
similarity index 100%
rename from alexo-chess/src/ao/chess/v2/move/Pawns.java
rename to alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
commit ea53898dcc969286078700f42ca5be36789e7ea7
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Sat Oct 17 03:30:43 2009 +0000
synch
diff --git a/src/chess/v2/move/Pawns.java b/alexo-chess/src/ao/chess/v2/move/Pawns.java
similarity index 100%
copy from src/chess/v2/move/Pawns.java
copy to alexo-chess/src/ao/chess/v2/move/Pawns.java
commit b869f395429a2c1345ce100953bfc6038d9835f5
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Wed Oct 7 22:43:06 2009 +0000
MctsPlayer works
diff --git a/ao/chess/v2/move/Pawns.java b/src/chess/v2/move/Pawns.java
similarity index 100%
copy from ao/chess/v2/move/Pawns.java
copy to src/chess/v2/move/Pawns.java
commit 4c697c510f5154d20be7500be1cbdecbaf99495c
Author: ostrovsky.alex <ostrovsky.alex#a51b5712-02d0-11de-9992-cbdf800730d7>
Date: Wed Sep 23 15:06:17 2009 +0000
* synch
diff --git a/v2/move/Pawns.java b/ao/chess/v2/move/Pawns.java
similarity index 95%
rename from v2/move/Pawns.java
rename to ao/chess/v2/move/Pawns.java
index e0172a3..e3659c5 100644
--- a/v2/move/Pawns.java
+++ b/ao/chess/v2/move/Pawns.java
However, when I try to run the following perl command (in git bash on Windows 10), I only get a single matching line (as opposed to the 4 lines in the sample you can see on the site I linked to above).
I know it's probably something stupid, like it needs to be in a loop. But I'm confused about slurping -0777 and applying a pattern multiple times. I tried the -p option but it prints out the entire input, and I only want to see output from the print (i.e., the CSV lines). I also thought /g would make the pattern be applied multiple times to the input file, but since -0777 makes it all one line, I'm not sure anymore.
<Pawns.java.history.txt perl -0777 -ne 'if (/commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n/g) { print $1.",".$2.",".$3.",".$4.",".$5."\n" }'
The output is only one line, whereas it should be 4 lines in total with the sample file:
2701af4b3b66340644b01835a03bcc760e1606f8,100,rename,alexo-chess/src/ao/chess/v2/move/Pawns.java,alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
Expected output:
2701af4b3b66340644b01835a03bcc760e1606f8,100,rename,alexo-chess/src/ao/chess/v2/move/Pawns.java,alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
ea53898dcc969286078700f42ca5be36789e7ea7,100,copy,src/chess/v2/move/Pawns.java,alexo-chess/src/ao/chess/v2/move/Pawns.java
b869f395429a2c1345ce100953bfc6038d9835f5,100,copy,ao/chess/v2/move/Pawns.java,src/chess/v2/move/Pawns.java
4c697c510f5154d20be7500be1cbdecbaf99495c,95,rename,v2/move/Pawns.java,ao/chess/v2/move/Pawns.java

You just need to convert your if with while:
perl -0777 -ne 'while (/commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n/g) { print $1.",".$2.",".$3.",".$4.",".$5."\n" }' file
2701af4b3b66340644b01835a03bcc760e1606f8,100,rename,alexo-chess/src/ao/chess/v2/move/Pawns.java,alexo-chess/src/main/java/ao/chess/v2/move/Pawns.java
ea53898dcc969286078700f42ca5be36789e7ea7,100,copy,src/chess/v2/move/Pawns.java,alexo-chess/src/ao/chess/v2/move/Pawns.java
b869f395429a2c1345ce100953bfc6038d9835f5,100,copy,ao/chess/v2/move/Pawns.java,src/chess/v2/move/Pawns.java
4c697c510f5154d20be7500be1cbdecbaf99495c,95,rename,v2/move/Pawns.java,ao/chess/v2/move/Pawns.java

The //g operator returns the captured results in list context. Since there are 5 sets of capturing parentheses and 4 matches, the returned list has 20 elements. You need to iterate over that list. Your code only looks at the first match. Here's one technique:
perl -0777 -nE '
#matches = /commit (.+)\n(?:.*\n)+?similarity index (\d+)+%\n(rename|copy) from (.+)\n\3 to (.+)\n/g;
$" = ",";
while (#matches) {
#thismatch = splice #matches, 0, 5;
say "#thismatch";
}
' Pawns.java.history.txt

Sed extract section of log file with timestamp boundaries

I have a selection of log files containing amongst other things time stamps.
Fwiw format is YYYY-MM-DD HH:MM:SS.sss (ie milliseconds granularity but no further)
Happily for me I can reasonably expect these timestamps to be both sorted chronologically AND unique.
However I am running into issues extracting the portion of the log file falling between two timestamps.
first timestamp in my file is 21:27:57.545
last timestamp in my file is 21:28:45.631
Syntax I am using is e.g.
sed -n '/21:28:10*/,/21:28:22*/p'
This is yielding some odd results (I am sure user error)
start time of 21:28:10* gives me timestamps starting at 21:28:10.043 (so far so good as prior was 21:28:09.484 so it is starting in the right place)
however start time of 21:28:09* gives me timestamps starting at 21:28:00.003
end time equally odd. end time of 21:28:22* yields timestamps up to and including 21:28:20.050, however I know for a fact that there timestamps after that as follows;
2017-05-10 21:28:21.278, 901
2017-05-10 21:28:21.303, 901
2017-05-10 21:28:21.304, 901
2017-05-10 21:28:21.483, 901
2017-05-10 21:28:22.448, 901
Therefore I am wondering if this is something to do with how sed interprets the strings - is it as text? Is there a one liner way to do what I am trying to do? Ideally I would be able to specify the start and end timestamps down to the same granularity as the actual data (ie in this case milliseconds)
TIA

You should use .* instead of *.
The RE 21:28:10* would match strings starting with 21:28:1 which can be followed by zero or more 0 chars.

If you want to get really crazy:
#!/bin/bash
T1="$(date -d '2017-05-10 21:28:21' +'%s').300" # your start time
T2="$(date -d '2017-05-10 21:28:21' +'%s').400" # your end time
while read L
do
D="$(echo $L | cut -c1-19)" # assuming line starts with timestamp
T=$(date -d "$D" +'%s')
T="${T}.$(echo $L | cut -c21-23)"
if [ $(echo $T'>'$T1 | bc -l) == 1 ] && [ $(echo $T'<'$T2 | bc -l) == 1 ]
then
echo "HIT: $L"
else
echo "NO!: $L"
fi
done < your_log_file

Matching bash variables as number literals with grep

I have a (GNU) bash script which establishes two variables to be matched in a file.
hour=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f1)
dom=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f4)
...and matches them to other occurrences in the file
grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log
Here is an example of the script calculating the mean for all values in field 2 of the input file for the given hour of day.
hMean=$(grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log | cut -f2 | awk ' {sum+=$
1}{count++}{mean=sum/count} END {printf("%.2f",mean) } ' );
Here is an example of the cleanup of the input file.
echo "removing: "$hour"th hour of the "$dom"th day of the "$month"th month"
sed -i -r '/'"$hour"'-[0-9]+-[0-9]+-'"$dom"'-'"$month"'-[0-9]{4}/d' sensorstest.log
And finally... Here is an example line in the file:
The format is:
status<tab>humidity<tab>temperature<tab>unix timestamp<tab>time/date
OK 94.4 16.9 1443058486 1-34-46-24-9-2015
I am attempting to match all instances of the hour on the day of the first entry in the file.
This works fine for numbers below 9, however;
Problem: Numbers over 9 are being matched as two single digit numbers, resulting in 12 matching 1, 2, 12, 21...etc.
Here is an example of where is trips up:
OK 100 17.2 1442570381 9-59-41-18-9-2015
OK 100 17.1 1442570397 9-59-57-18-9-2015
Moistening 100 17.6 1442574014 11-0-14-18-9-2015
Moistening 100 17.6 1442574030 11-0-30-18-9-2015
Here the output skips to 0-0-0-19-9-2015 (and yes I am missing an hour of entries from the log)
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 16.5,17.2,16.90,.7 1442566811 9-0-0-18-9-2015
removing: 9th hour of the 18th day of the 9th month
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 18.3,18.8,18.57,.5 1442620804 0-0-0-19-9-2015
removing: 0th hour of the 19th day of the 9th month
The problem is only happening with the hours. the day ($dom) is matching fine.
I have tried using the -w option with grep, but I think this only returns the exact match where I need the whole line.
There's not much online about matching numbers literally in grep. And I found nothing on using bash variables as a number literal.
Any help or relevant links would be greatly appreciated.
EDIT:
I have solved the problem after a night of dredging through the script.
The problem lay with my sed expression right at the end.
The problem being in single quoting parts of the sed expression and double quoting variables for expansion by the shell.
I took this from a suggestion on another thread.
Double quoting the whole expression solved the problem.
The awk suggestion has greatly increased the efficiency and accuracy of the script. Thanks again.

awk to the rescue!
I think you can combine everything to a simple awk script without needing any regex. For example,
awk 'NR==1{split($NF,h,"-")} {split($NF,t,"-")} t[1]==h[1] && t[4]==h[4]'
will parse the time stamp on the first row of the file and filters only the hour and day matching records.
This will take the average of field 2
awk 'NR==1
{
split($NF,h,"-")
}
{
split($NF,t,"-")
}
t[1]==h[1] && t[4]==h[4]
{
sum+=$2;
c++
}
END
{
print "Average: " sum/c
}'

Get time in HTML tags using curl and grep/sed/awk

I'm trying to extract just the arrival times from this web page. I'm running this in terminal on OSX 10.9.5
http://www.flyokc.com/Arrivals.aspx
I've come as far as isolating just the tags
curl 'www.flyokc.com/arrivals.aspx' | grep 'labelTime'
However, I'm terrible at RegEx so I haven't figured out just to grab the times from these tags. Thoughts on how I can do that?
Eventually, I'd like to group them by the hour of the day and display the number of arrivals by hour, in descending order

Parsing HTML/XML with regex is bad. That being sad, this seems to work at this moment for your use case:
gawk '
BEGIN{
PROCINFO["sorted_in"]="#ind_num_asc"
FS="[<>: ]+"
}
/labelTime/&&/ContentPlaceHolderMain/{
if($6="PM") a[$4+12]+=1
else a[$4]+=1
}
END{
for(h in a)
print h, a[h]
}' <(curl 'www.flyokc.com/arrivals.aspx' 2>/dev/null)
Edit: An account of what works why:
Set the field separator to the html delimiters, spacing, and HH:MM seperator.
Then grab the sixth field (Hours)
(this is only in a sense a regex what you asked for...)
If the sixth field is "PM", add 12 hours to it (you want to sort numerically in the end). +1 the count for that hour.
After processing of input, display the results. Because the array access order has been defined to sort numerically on the keys, no need to external sort commands are necessary.

If you're simply looking to grab the arrival times such as 12:00 PM, etc. awk with curl should work:
curl -s 'http://flyokc.com/arrivals.aspx' | awk '/labelTime/{print substr($2,68,5),substr($3,1,2)}'
Output:
12:47 PM
...
How it works:
CURL silently grabs the source of the webpage, then AWK takes the output and uses "labelTime" to pick out the line which contains the arrival times. Since awk grabs the entire <span> where the string resides, substring is used to start at position 68, then the result is printed.

awk regular expression to compare the file extensions

I'm trying to find the files with extensions sh, xls etc as shown in the FILTER variable below.
following is the output of ls -ltr, the output of of below script is coming as hourly_space_update.sh and kent.ksh, but I don't want .ksh file, could you please tell where I'm going wrong with my regex.
[root#SVRVSVN ~]# ls -ltr
total 20
-rw-r--r-- 1 root sqaadmin 44 Oct 9 18:24 hourly_space_update.sh
-rw-r--r-- 1 root sqaadmin 0 Oct 30 12:34 kent.ksh
-rw-r--r-- 1 root sqaadmin 0 Oct 30 12:34 a.abc
-rw-r--r-- 1 root sqaadmin 0 Oct 30 13:02 hh.h
#!/bin/sh
ls -ltr | awk '
BEGIN {
FILTER=".(sh|xls|xlsx|pdf)$"
}
{
for (i = 1; i < 9; i++) $i = ""; sub(/^ */, "");
if(match(tolower($1),FILTER))
{
print $1
}
}'

Try this regexp:
\.(sh|xls|xlsx|pdf)$

See the comments I made in the answers you got so far, but more importantly - your approach of testing one of the fields will fail for file names that contain spaces, and any piped solution will fail if one of those white spaces is a newline. You should just use shell as:
ls -tr *.sh *.xls *.xlsx *.pdf
and get rid of the need for a filter at all.
If you MUST keep an awk script, though, then the way to write it is this if you can guarantee your file names don't contain any spaces:
ls -ltr | awk 'BEGIN{FILTER="\\.(sh|xlsx?|pdf)$"} tolower($NF) ~ FILTER { print $NF }'
Note that I abbreviated your RE since "xslx?" will match "xls" or "xlsx".
Before I give you a solution for file names that contain spaces or newlines, though - why are you using "ls -ltr" instead of simply "ls -tr" if you only want to process the file name?

In bash/ksh/zsh, you can use brace expansion:
ls *.{sh,xls,xlsx,pdf}
Also don't parse ls.

Try with (\bsh\b|\bxls\b|\bxlsx\b|\bpdf\b) filter.
In you're filter you want .ksh file because it containts sh sequence.

Your code actually works in my gawk 4.0.1 running under cygwin.
But how come you don't want to do:
awk 'BEGIN {FILTER=".(sh|xls|xlsx|pdf)$"}{if(match(tolower($9),FILTER)){print $9}}'
This would make the for loop redundant, and clean up the code a bit. I guess output of ls -ltr use the same format each time you execute it. :)
Unfortunately I do not have access to a clean awk command for testing, but you could also try to double escape the \\. if that is the problem in you awk. A tips is to print $1 before the if statement to make sure it contain what you expect it to be.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Match Range of Lines in Log - regex

You can use following oneliner: awk -v start='00:04:59.000' -v end='00:04:59.900' \ '{if(start <= $1 && end >= $1) print $0}' < your.log > reduced.log Notice the full format of start and end ranges - it's to keep it simple and doesn't make much problem IMO

Related

Multi-line regex should match multiple times in a file (one-line command if possible)

Sed extract section of log file with timestamp boundaries

Matching bash variables as number literals with grep

Get time in HTML tags using curl and grep/sed/awk

awk regular expression to compare the file extensions

Categories

Resources