regex | List of result from grep - regex

The following grep command gives me the number of requests from July 1st to July 31st between 8 a.m. and 4 p.m.
zgrep -E "[01\-31]/Jul/2021:[08\-16]" localhost_access.log* | wc -l
I don't want to get all requests in the month, but the requests per day. I could of course enter the command 31 times, but that's tedious. Is there a way to display the requests per day one below the other, so that I get the following as a result (ideally sorted by number), for example
543
432
321
etc.
How to do that?

You want to count lines based on a certain value in a line. That's a good job for awk. With grep-only, you would always have to process the input files once per day. In any way, we need to fix your regex first:
zgrep -E "[01\-31]/Jul/2021:[08\-16]" localhost_access.log* | wc -l
[08\-16] matches the characters 0, 8, -, 1 and 6. What you want to match is (0[89])|(1[0-6]); that is 0, followed by one of 8 or 9 - or - 1 followed by one of range 0-6. To make it easier, we assume normal days in the date and therefore match the day with [0-9]{2} (two digits).
Here's a complete awk for your task:
awk -F/ '/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/{a[$1]++}END{for (i in a) print "day " i ": " a[i]}' localhost_access.log*
Explanation:
/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/ matches date + time for every day (at 08-16) in july
{a[$1]++} builds an array with key=day and a counter of occurrences.
END{for (i in a) print "day " i ": " a[i]} prints the array when all input files were processed
Because we've set the field separator to /, you need to change a[$1] to address the correct position (for two more slashes before the actual date: a[$3]). (Of course this can be solved in a more dynamic way.)
Example:
$ cat localhost_access.log
01/Jul/2021:08 log message
01/Jul/2021:08 log message
02/Jul/2021:08 log message
02/Jul/2021:07 log message
$ awk -F/ '/[0-9]{2}\/Jul\/2021:(0[89])|(1[0-6])/{a[$1]++}END{for (i in a) print "day " i ": " a[i]}' localhost_access.log*
day 01: 2
day 02: 1
Run zcat | awk in case your log files are compressed, but remember the regex above searches for "Jul/2021" only.

Related

Sed extract section of log file with timestamp boundaries

I have a selection of log files containing amongst other things time stamps.
Fwiw format is YYYY-MM-DD HH:MM:SS.sss (ie milliseconds granularity but no further)
Happily for me I can reasonably expect these timestamps to be both sorted chronologically AND unique.
However I am running into issues extracting the portion of the log file falling between two timestamps.
first timestamp in my file is 21:27:57.545
last timestamp in my file is 21:28:45.631
Syntax I am using is e.g.
sed -n '/21:28:10*/,/21:28:22*/p'
This is yielding some odd results (I am sure user error)
start time of 21:28:10* gives me timestamps starting at 21:28:10.043 (so far so good as prior was 21:28:09.484 so it is starting in the right place)
however start time of 21:28:09* gives me timestamps starting at 21:28:00.003
end time equally odd. end time of 21:28:22* yields timestamps up to and including 21:28:20.050, however I know for a fact that there timestamps after that as follows;
2017-05-10 21:28:21.278, 901
2017-05-10 21:28:21.303, 901
2017-05-10 21:28:21.304, 901
2017-05-10 21:28:21.483, 901
2017-05-10 21:28:22.448, 901
Therefore I am wondering if this is something to do with how sed interprets the strings - is it as text? Is there a one liner way to do what I am trying to do? Ideally I would be able to specify the start and end timestamps down to the same granularity as the actual data (ie in this case milliseconds)
TIA
You should use .* instead of *.
The RE 21:28:10* would match strings starting with 21:28:1 which can be followed by zero or more 0 chars.
If you want to get really crazy:
#!/bin/bash
T1="$(date -d '2017-05-10 21:28:21' +'%s').300" # your start time
T2="$(date -d '2017-05-10 21:28:21' +'%s').400" # your end time
while read L
do
D="$(echo $L | cut -c1-19)" # assuming line starts with timestamp
T=$(date -d "$D" +'%s')
T="${T}.$(echo $L | cut -c21-23)"
if [ $(echo $T'>'$T1 | bc -l) == 1 ] && [ $(echo $T'<'$T2 | bc -l) == 1 ]
then
echo "HIT: $L"
else
echo "NO!: $L"
fi
done < your_log_file

Matching bash variables as number literals with grep

I have a (GNU) bash script which establishes two variables to be matched in a file.
hour=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f1)
dom=$(head -n 1 sensorstest.log | cut -f5 | cut -d"-" -f4)
...and matches them to other occurrences in the file
grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log
Here is an example of the script calculating the mean for all values in field 2 of the input file for the given hour of day.
hMean=$(grep -E [^0-9]+"$hour"-[0-9]+-[0-9]+-"$dom"-[0-9]+-[0-9]{4} sensorstest.log | cut -f2 | awk ' {sum+=$
1}{count++}{mean=sum/count} END {printf("%.2f",mean) } ' );
Here is an example of the cleanup of the input file.
echo "removing: "$hour"th hour of the "$dom"th day of the "$month"th month"
sed -i -r '/'"$hour"'-[0-9]+-[0-9]+-'"$dom"'-'"$month"'-[0-9]{4}/d' sensorstest.log
And finally... Here is an example line in the file:
The format is:
status<tab>humidity<tab>temperature<tab>unix timestamp<tab>time/date
OK 94.4 16.9 1443058486 1-34-46-24-9-2015
I am attempting to match all instances of the hour on the day of the first entry in the file.
This works fine for numbers below 9, however;
Problem: Numbers over 9 are being matched as two single digit numbers, resulting in 12 matching 1, 2, 12, 21...etc.
Here is an example of where is trips up:
OK 100 17.2 1442570381 9-59-41-18-9-2015
OK 100 17.1 1442570397 9-59-57-18-9-2015
Moistening 100 17.6 1442574014 11-0-14-18-9-2015
Moistening 100 17.6 1442574030 11-0-30-18-9-2015
Here the output skips to 0-0-0-19-9-2015 (and yes I am missing an hour of entries from the log)
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 16.5,17.2,16.90,.7 1442566811 9-0-0-18-9-2015
removing: 9th hour of the 18th day of the 9th month
$ sudo statanhourtest.sh
100,1.4,1.40,-98.6 18.3,18.8,18.57,.5 1442620804 0-0-0-19-9-2015
removing: 0th hour of the 19th day of the 9th month
The problem is only happening with the hours. the day ($dom) is matching fine.
I have tried using the -w option with grep, but I think this only returns the exact match where I need the whole line.
There's not much online about matching numbers literally in grep. And I found nothing on using bash variables as a number literal.
Any help or relevant links would be greatly appreciated.
EDIT:
I have solved the problem after a night of dredging through the script.
The problem lay with my sed expression right at the end.
The problem being in single quoting parts of the sed expression and double quoting variables for expansion by the shell.
I took this from a suggestion on another thread.
Double quoting the whole expression solved the problem.
The awk suggestion has greatly increased the efficiency and accuracy of the script. Thanks again.
awk to the rescue!
I think you can combine everything to a simple awk script without needing any regex. For example,
awk 'NR==1{split($NF,h,"-")} {split($NF,t,"-")} t[1]==h[1] && t[4]==h[4]'
will parse the time stamp on the first row of the file and filters only the hour and day matching records.
This will take the average of field 2
awk 'NR==1
{
split($NF,h,"-")
}
{
split($NF,t,"-")
}
t[1]==h[1] && t[4]==h[4]
{
sum+=$2;
c++
}
END
{
print "Average: " sum/c
}'

grep through a file conditionally in both directions

I have a log file written to by several instances of a cgi script. I need to extract certain information, with the following typical workflow:
search for the first occurrence of RequestString
extract PID from that log line
search backwards for the first occurrence of PID<separator>ConnectionString, to identify the client that initiated the request
do something with ConnectionString and repeat the search from after 'RequestString'
What is the best way to do this? I was thinking of writing a perl script to do this with caching the last N lines, and then match through those lines to perform 3.
Is there any better way to do this? Like extended regex that would do exactly this?
Sample with line numbers for reference -- not part of the file:
1 date pid1 ConnectionString1
2 date pid2 ConnectionString2
3 date pid3 ConnectionString3
4 date pid2 SomeOutput2
5 date pid2 SomeOutput2
6 date pid4 ConnectionString4
7 date pid3 SomeOutput3
8 date pid4 RequestString4
9 date pid1 SomeOutput1
10 date pid1 ConnectionString1
11 date pid1 RequestString1
12 date pid5 RequestString5
When I grep through this sample file, I wish for the following to match:
line 8, paired with line 6
line 11, paired with line 10 (and not with line 1)
Specifically, the following shouldn't be matched:
line 12, because no matching ConnectionString with that pid is found (pid5)
line 1, because there is a new ConnectionString for that pid before the next RequestString for that pid (line 10). Imagine that the first connection attempt failed before logging the RequestString)
any of the lines from pid2/pid3, because hey dont have a RequestString logged.
I could imagine writing a regex with the option for . to match \n:((pid\d)\s*(ConnectionString\d))(?!\1).*\2\s*RequestString\d and then use \3 to identify the client.
However, there are disproportionately more (perhaps between 1000 and 10000 times more) ConnectionStrings than RequestStrings, so my intuition was to first go for the RequestString and then backtrack.
I guess I could play with (?<) for lookbehind, but the lengths between ConnectionStrings and RequestStrings are essentially arbitrary -- will that work well?
Something along these lines:
#!/bin/bash
# Find and number all RequestStrings, then loop through them
grep -n RequestString file | while IFS=":" read n string; do
echo $n,$string # Debug
head -n $n file | tail -r | grep -m1 Connection
done
Output
4,RequestString 1
6189:Connection
7,RequestString 2
7230:Connection
9,RequestString 3
8280:Connection
with this input file
6189:Connection
RequestString 1
7229:Connection
7230:Connection
RequestString 2
8280:Connection
RequestString 3
Note: I used tail -r because OSX lacks tac which I would have preferred.

Get time in HTML tags using curl and grep/sed/awk

I'm trying to extract just the arrival times from this web page. I'm running this in terminal on OSX 10.9.5
http://www.flyokc.com/Arrivals.aspx
I've come as far as isolating just the tags
curl 'www.flyokc.com/arrivals.aspx' | grep 'labelTime'
However, I'm terrible at RegEx so I haven't figured out just to grab the times from these tags. Thoughts on how I can do that?
Eventually, I'd like to group them by the hour of the day and display the number of arrivals by hour, in descending order
Parsing HTML/XML with regex is bad. That being sad, this seems to work at this moment for your use case:
gawk '
BEGIN{
PROCINFO["sorted_in"]="#ind_num_asc"
FS="[<>: ]+"
}
/labelTime/&&/ContentPlaceHolderMain/{
if($6="PM") a[$4+12]+=1
else a[$4]+=1
}
END{
for(h in a)
print h, a[h]
}' <(curl 'www.flyokc.com/arrivals.aspx' 2>/dev/null)
Edit: An account of what works why:
Set the field separator to the html delimiters, spacing, and HH:MM seperator.
Then grab the sixth field (Hours)
(this is only in a sense a regex what you asked for...)
If the sixth field is "PM", add 12 hours to it (you want to sort numerically in the end). +1 the count for that hour.
After processing of input, display the results. Because the array access order has been defined to sort numerically on the keys, no need to external sort commands are necessary.
If you're simply looking to grab the arrival times such as 12:00 PM, etc. awk with curl should work:
curl -s 'http://flyokc.com/arrivals.aspx' | awk '/labelTime/{print substr($2,68,5),substr($3,1,2)}'
Output:
12:47 PM
...
How it works:
CURL silently grabs the source of the webpage, then AWK takes the output and uses "labelTime" to pick out the line which contains the arrival times. Since awk grabs the entire <span> where the string resides, substring is used to start at position 68, then the result is printed.

bash: Batch reformatting using sed + date?

I have a bunch of data that looks like this:
"2004-03-23 20:11:55" 3 3 1
"2004-03-23 20:12:20" 1 1 1
"2004-03-31 02:20:04" 15 15 1
"2004-04-07 14:33:48" 141 141 1
"2004-04-15 02:08:31" 2 2 1
"2004-04-15 07:56:01" 1 2 1
"2004-04-16 12:41:22" 4 4 1
and I need to feed this data to a program which only accepts time in UNIX (Epoch) format. Is there a way I can change all the dates in bash? My first instinct tells me to do something like this:
sed 's/"(.*)"/`date -jf "%Y-%m-%d %T" "\1" "+%s"`'
But I am not entirely sure that the \1 inside the date call will properly backreference the regex matched by sed. In fact, when I run this, I get the following response:
sed: 1: "s/(".*")/`date -jf "% ...": unterminated substitute in regular expression
Can anyone guide me in the right direction on this? Thank you.
Nothing is going to be expanded between single quotes. Also, no, the shell expansions are going to happen before the sed \1 expansion, so your code isn't going to work. How about something like this (untested):
while IFS= read -r date time a b c
do
date --date "${date:1} ${time::-1}" # Cut the variables to remove the literal quotes
printf " %s %s %s\n" "$a" "$b" "$c"
done < file