Sed extract section of log file with timestamp boundaries

Sed extract section of log file with timestamp boundaries - regex

I have a selection of log files containing amongst other things time stamps.
Fwiw format is YYYY-MM-DD HH:MM:SS.sss (ie milliseconds granularity but no further)
Happily for me I can reasonably expect these timestamps to be both sorted chronologically AND unique.
However I am running into issues extracting the portion of the log file falling between two timestamps.
first timestamp in my file is 21:27:57.545
last timestamp in my file is 21:28:45.631
Syntax I am using is e.g.
sed -n '/21:28:10*/,/21:28:22*/p'
This is yielding some odd results (I am sure user error)
start time of 21:28:10* gives me timestamps starting at 21:28:10.043 (so far so good as prior was 21:28:09.484 so it is starting in the right place)
however start time of 21:28:09* gives me timestamps starting at 21:28:00.003
end time equally odd. end time of 21:28:22* yields timestamps up to and including 21:28:20.050, however I know for a fact that there timestamps after that as follows;
2017-05-10 21:28:21.278, 901
2017-05-10 21:28:21.303, 901
2017-05-10 21:28:21.304, 901
2017-05-10 21:28:21.483, 901
2017-05-10 21:28:22.448, 901
Therefore I am wondering if this is something to do with how sed interprets the strings - is it as text? Is there a one liner way to do what I am trying to do? Ideally I would be able to specify the start and end timestamps down to the same granularity as the actual data (ie in this case milliseconds)
TIA

You should use .* instead of *.
The RE 21:28:10* would match strings starting with 21:28:1 which can be followed by zero or more 0 chars.

If you want to get really crazy:
#!/bin/bash
T1="$(date -d '2017-05-10 21:28:21' +'%s').300" # your start time
T2="$(date -d '2017-05-10 21:28:21' +'%s').400" # your end time
while read L
do
D="$(echo $L | cut -c1-19)" # assuming line starts with timestamp
T=$(date -d "$D" +'%s')
T="${T}.$(echo $L | cut -c21-23)"
if [ $(echo $T'>'$T1 | bc -l) == 1 ] && [ $(echo $T'<'$T2 | bc -l) == 1 ]
then
echo "HIT: $L"
else
echo "NO!: $L"
fi
done < your_log_file

Related

Matching bash variables as number literals with grep

I have a (GNU) bash script which establishes two variables to be matched in a file.
hour=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f1)
dom=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f4)
...and matches them to other occurrences in the file
grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log
Here is an example of the script calculating the mean for all values in field 2 of the input file for the given hour of day.
hMean=$(grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log | cut -f2 | awk ' {sum+=$
1}{count++}{mean=sum/count} END {printf("%.2f",mean) } ' );
Here is an example of the cleanup of the input file.
echo "removing: "$hour"th hour of the "$dom"th day of the "$month"th month"
sed -i -r '/'"$hour"'-[0-9]+-[0-9]+-'"$dom"'-'"$month"'-[0-9]{4}/d' sensorstest.log
And finally... Here is an example line in the file:
The format is:
status<tab>humidity<tab>temperature<tab>unix timestamp<tab>time/date
OK 94.4 16.9 1443058486 1-34-46-24-9-2015
I am attempting to match all instances of the hour on the day of the first entry in the file.
This works fine for numbers below 9, however;
Problem: Numbers over 9 are being matched as two single digit numbers, resulting in 12 matching 1, 2, 12, 21...etc.
Here is an example of where is trips up:
OK 100 17.2 1442570381 9-59-41-18-9-2015
OK 100 17.1 1442570397 9-59-57-18-9-2015
Moistening 100 17.6 1442574014 11-0-14-18-9-2015
Moistening 100 17.6 1442574030 11-0-30-18-9-2015
Here the output skips to 0-0-0-19-9-2015 (and yes I am missing an hour of entries from the log)
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 16.5,17.2,16.90,.7 1442566811 9-0-0-18-9-2015
removing: 9th hour of the 18th day of the 9th month
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 18.3,18.8,18.57,.5 1442620804 0-0-0-19-9-2015
removing: 0th hour of the 19th day of the 9th month
The problem is only happening with the hours. the day ($dom) is matching fine.
I have tried using the -w option with grep, but I think this only returns the exact match where I need the whole line.
There's not much online about matching numbers literally in grep. And I found nothing on using bash variables as a number literal.
Any help or relevant links would be greatly appreciated.
EDIT:
I have solved the problem after a night of dredging through the script.
The problem lay with my sed expression right at the end.
The problem being in single quoting parts of the sed expression and double quoting variables for expansion by the shell.
I took this from a suggestion on another thread.
Double quoting the whole expression solved the problem.
The awk suggestion has greatly increased the efficiency and accuracy of the script. Thanks again.

awk to the rescue!
I think you can combine everything to a simple awk script without needing any regex. For example,
awk 'NR==1{split($NF,h,"-")} {split($NF,t,"-")} t[1]==h[1] && t[4]==h[4]'
will parse the time stamp on the first row of the file and filters only the hour and day matching records.
This will take the average of field 2
awk 'NR==1
{
split($NF,h,"-")
}
{
split($NF,t,"-")
}
t[1]==h[1] && t[4]==h[4]
{
sum+=$2;
c++
}
END
{
print "Average: " sum/c
}'

Can adding a particular number to a bunch of "time" strings, be done in Regex

I have a "srt" file(like standard movie-subtitle format) like shown in below link:http://pastebin.com/3k8a53SC
Excerpt:
1
00:00:53,000 --> 00:00:57,000
<any text that may span multiple lines>
2
00:01:28,000 --> 00:01:35,000
<any text that may span multiple lines>
But right now the subtitles timing is all wrong, as it lags behind by 9 seconds.
Is it possible to add 9 seconds(+9) to every time entry with regex ?
Even if the milliseconds is set to 000 then it's fine, but the addition of 9 seconds should adhere to "60 seconds = 1 minute & 60 minutes = 1 hour" rules.
Also the subtitle text after timing entry must not get altered by regex.
By the way the time format for each time string is "Hours:Minutes:Seconds.Milliseconds".

Quick answer is "no", that's not an application for regex. A regular expression lets you MATCH text, but not change it. Changing things is outside the scope of the regex itself, and falls to the language you're using -- perl, awk, bash, etc.
For the task of adjusting the time within an SRT file, you could do this easily enough in bash, using the date command to adjust times.
#!/usr/bin/env bash
offset="${1:-0}"
datematch="^(([0-9]{2}:){2}[0-9]{2}),[0-9]{3} --> (([0-9]{2}:){2}[0-9]{2}),[0-9]{3}"
os=$(uname -s)
while read line; do
if [[ "$line" =~ $datematch ]]; then
# Gather the start and end times from the regex
start=${BASH_REMATCH[1]}
end=${BASH_REMATCH[3]}
# Replace the time in this line with a printf pattern
linefmt="${line//[0-2][0-9]:[0-5][0-9]:[0-5][0-9]/%s}\n"
# Calculate new times
case "$os" in
Darwin|*BSD)
newstart=$(date -v${offset}S -j -f "%H:%M:%S" "$start" '+%H:%M:%S')
newend=$(date -v${offset}S -j -f "%H:%M:%S" "$end" '+%H:%M:%S')
;;
Linux)
newstart=$(date -d "$start today ${offset} seconds" '+%H:%M:%S')
newend=$(date -d "$end today ${offset} seconds" '+%H:%M:%S')
;;
esac
# And print the result
printf "$linefmt" "$newstart" "$newend"
else
# No adjustments required, print the line verbatim.
echo "$line"
fi
done
Note the case statement. This script should auto-adjust for Linux, OSX, FreeBSD, etc.
You'd use this script like this:
$ ./srtadj -9 < input.srt > output.srt
Assuming you named it that, of course. Or more likely, you'd adapt its logic for use in your own script.

No, sorry, you can’t. Regex are a context free language (see Chomsky e.g. https://en.wikipedia.org/wiki/Chomsky_hierarchy) and you cannot calculate.
But with a context sensitive language like perl it will work.
It could be a one liner like this ;-)))
perl -n -e 'if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {print plus9($1).$2.plus9($3).$4."\n";}else{print $_} sub plus9{ ($h,$m,$s)=split(/:/,shift); $t=(($h*60+$m)*60+$s+9); $h=int($t/3600);$r=$t-($h*3600);$m=int($r/60);$s=$r-($m*60);return sprintf "%02d:%02d:%02d", $h, $m, $s;}‘ movie.srt
with move.srt like
1
00:00:53,000 --> 00:00:57,000
hello
2
00:01:28,000 --> 00:01:35,000
I like perl
3
00:02:09,000 --> 00:02:14,000
and regex
you will get
1
00:01:02,000 --> 00:01:06,000
hello
2
00:01:37,000 --> 00:01:44,000
I like perl
3
00:02:18,000 --> 00:02:23,000
and regex
You can change the +9 in the "sub plus9{...}", if you want another delta.
How does it work?
We are looking for lines that matches
dd:dd:dd something dd:dd:dd something
and then we call a sub, which add 9 seconds to the matched group one ($1) and group three ($3). All other lines are printed unchanged.
added
If you want to put the perl oneliner in a file, say plus9.pl, you can add newlines ;-)
if(/^(\d\d:\d\d:\d\d)([-,\d\s\>]*)(\d\d:\d\d:\d\d)(.*)/) {
print plus9($1).$2.plus9($3).$4."\n";
} else {
print $_
}
sub plus9{
($h,$m,$s)=split(/:/,shift);
$t=(($h*60+$m)*60+$s+9);
$h=int($t/3600);
$r=$t-($h*3600);
$m=int($r/60);
$s=$r-($m*60);
return sprintf "%02d:%02d:%02d", $h, $m, $s;
}

Regular expressions strictly do matching and cannot add/substract. You can match each datetime string using python, for example, add 9 seconds to that, and then rewrite the string in the appropriate spot. The regular expression I would use to match it would be the following:
(?<hour>\d+):(?<minute>\d+):(?<second>\d+),(?<msecond>\d+)
It has labeled capture groups so it's really easy to get each section (you won't need msecond but it's there for visualization, I guess)
Regex101

Get time in HTML tags using curl and grep/sed/awk

I'm trying to extract just the arrival times from this web page. I'm running this in terminal on OSX 10.9.5
http://www.flyokc.com/Arrivals.aspx
I've come as far as isolating just the tags
curl 'www.flyokc.com/arrivals.aspx' | grep 'labelTime'
However, I'm terrible at RegEx so I haven't figured out just to grab the times from these tags. Thoughts on how I can do that?
Eventually, I'd like to group them by the hour of the day and display the number of arrivals by hour, in descending order

Parsing HTML/XML with regex is bad. That being sad, this seems to work at this moment for your use case:
gawk '
BEGIN{
PROCINFO["sorted_in"]="#ind_num_asc"
FS="[<>: ]+"
}
/labelTime/&&/ContentPlaceHolderMain/{
if($6="PM") a[$4+12]+=1
else a[$4]+=1
}
END{
for(h in a)
print h, a[h]
}' <(curl 'www.flyokc.com/arrivals.aspx' 2>/dev/null)
Edit: An account of what works why:
Set the field separator to the html delimiters, spacing, and HH:MM seperator.
Then grab the sixth field (Hours)
(this is only in a sense a regex what you asked for...)
If the sixth field is "PM", add 12 hours to it (you want to sort numerically in the end). +1 the count for that hour.
After processing of input, display the results. Because the array access order has been defined to sort numerically on the keys, no need to external sort commands are necessary.

If you're simply looking to grab the arrival times such as 12:00 PM, etc. awk with curl should work:
curl -s 'http://flyokc.com/arrivals.aspx' | awk '/labelTime/{print substr($2,68,5),substr($3,1,2)}'
Output:
12:47 PM
...
How it works:
CURL silently grabs the source of the webpage, then AWK takes the output and uses "labelTime" to pick out the line which contains the arrival times. Since awk grabs the entire <span> where the string resides, substring is used to start at position 68, then the result is printed.

Match Range of Lines in Log

I am trying to figure out how to take a log that that has millions of lines in
a day and easily dump a range (based on begin and end timestamp) of lines to
another file. Here is an excerpt from the log to show how it is constructed:
00:04:59.703: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.703: 20121114070459 - XXX - 7028429950500220900257201211131000000003536
00:04:59.703: </abcxyz,v1>
00:04:59.711: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.711: 20121114070459 - XXX - 7028690080500220900257201211131000000003538
00:04:59.711: </abcxyz,v1>
00:04:59.723: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
00:04:59.723: 20121114070459 - XXX - 7028395150500220900257201211131000000003540
00:04:59.723: </abcxyz,v1>
00:04:59.744: <abcxyz,v1 from YYY::Process at 14 Nov 2012 07:04:59>
As you can see there are multiple lines per millisecond. What I would like to
do is be able to give as an input a begin and end timestamp such as
begin=11:00: and end=11:45: and have it dump all the lines in that range.
I have been racking my brain trying to figure this one out, but so far haven't
come up with a satisfactory result.
UPDATE: Of course just the first thing I try after I post the question seems to
work. Here is what I have:
sed -n '/^06:25/,/^08:25:/p' logFile > newLogFile
More than happy to take suggestions if there is a better way.

I think your sed oneliner is ok for the task.
Besides, you can optimize that for speed (considering the file has millions of lines), exiting the sed script when the desired block was printed (assuming there are no repeated blocks of time in a file).
sed -n '/^06:25/,/^08:25/{p;/^08:25/q}' logFile > newLogFile
This tells sed to quit when the last line of the block was found.

You can use following oneliner:
awk -v start='00:04:59.000' -v end='00:04:59.900' \
'{if(start <= $1 && end >= $1) print $0}' < your.log > reduced.log
Notice the full format of start and end ranges - it's to keep it simple and doesn't make much problem IMO

Ignoring base64 encoded attahments when grepping through .eml files

I've got a huge pile of exported emails in .eml format that I'm grepping through for keywords with something like this:
egrep -iR "keyword|list|foo|bar" *
This results in a number of false positives when using relatively short keywords due to base64 encoded email attachments like this:
Inbox/Email Subject.eml:rcX2aiCZBfoogjNUShcWC64U7buTJE3rC5CeShpo/Uhz0SeGz290rljsr6woPNt3DQ0iFGzixrdj
Inbox/Email Subject.eml:3qHXNEj5sKXUa3LxfkmEAEWOpW301Pbarq2Jr2IswluaeKqCgeHIEFmFQLeY4HIcTBe3wCf6HzPL
Is there a regex I can write that will identify and exclude these matches, or can I tell grep to stop reading a file once it gets to a line that says "Content-Transfer-Encoding: base64"?

If you exclude any matches consisting entirely of base64, you should be left with only the interesting matches. As an approximation, excluding any line consisting entirely of base64 with a length longer than, say, 60 characters is probably good enough for immediate human consumption.
egrep -iR "keyword|list|foo|bar" . |
egrep -v ':[0-9A-Za-z+/]{60,}$' |
less
If you need improved accuracy, maybe prefilter the messages to exclude any attachments. You might also want to check that the excluded lines are an even multiple of 4 characters long, although it's unlikely that you have a lot of false positives for that particular criterion.

You might find the -w grep option useful (match only complete words), although it will only reduce and not eliminate false positives since there is roughly a 1/1024 chance that a string in a base-64 encoded file will be surrounded by non-alphanumeric characters.
You can get grep to stop matching when it finds a given string, such as Content-Transfer-Encoding: base64 but only at the cost of always stopping at the first match, by also matching that string and setting the maximum match count to 1. However, you then have to filter the matches:
grep -EiR -e "Content-Transfer-Encoding: base64" -e "foo|bar" -x 1 * |
grep -v -i "Content-Transfer-Encoding: base64"
You could do this more easily and more precisely with gawk:
awk 'BEGIN {IGNORECASE=1}
/Content-Transfer-Encoding: base64/ {nextfile}
/foo|bar/ {print FILENAME":"$0}' *
(Note: nextfile is a gawk extension. There are other ways to do this, but not as convenient.)
That's a bit much to type every time you want to do this, so you'd be better-off making it a shell function (or script, but I personally prefer functions.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sed extract section of log file with timestamp boundaries - regex

You should use .* instead of . The RE 21:28:10 would match strings starting with 21:28:1 which can be followed by zero or more 0 chars.

Related

Matching bash variables as number literals with grep

Can adding a particular number to a bunch of "time" strings, be done in Regex

Get time in HTML tags using curl and grep/sed/awk

Match Range of Lines in Log

Ignoring base64 encoded attahments when grepping through .eml files

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Sed extract section of log file with timestamp boundaries - regex

You should use .* instead of *. The RE 21:28:10* would match strings starting with 21:28:1 which can be followed by zero or more 0 chars.

Related

Matching bash variables as number literals with grep

Can adding a particular number to a bunch of "time" strings, be done in Regex

Get time in HTML tags using curl and grep/sed/awk

Match Range of Lines in Log

Ignoring base64 encoded attahments when grepping through .eml files

Categories

Resources

You should use .* instead of . The RE 21:28:10 would match strings starting with 21:28:1 which can be followed by zero or more 0 chars.