Retrieve log pattern via awk - regex

I would like to retrieve from the following logs the date, the 5 URI length, the ab and cde: - - [26/Oct/2020:19:50:13 +0000] "GET /five/six/seven/eight/nine/en?from=1603738800&to=1603785600ncludedInRange=false HTTP/1.1" 200 255441 "-" "Opera com.test.super/1.10.4;11072 (Linux;Neon KNWWWfj;0,02.2)" """f799b6b9-747f-4f22-a1bf-4b7de885fc60""-" "-" "-" "-"ab=0.110 cde=0.102 - - [26/Oct/2020:19:50:14 +0000] "GET /one/two/three/four/five/en HTTP/1.1" 200 2832 "-" "Opera com.test.super/1.10.4;11072 (Linux;Neon KNWWWfj;0,02.2)" """19a8ee3c-9cb3-4ba6-9732-eb4923601e92""-" "-" "-" "-"ab=0.111 cde=0.112
26/Oct/2020:19:50:13 /five/six/seven/eight/nine ab=0.110 cde=0.102
I have tried the following command, but I get only a part of it. Can you please help?
awk '{print $4 "\t" $7 "\t" $(NF-1),"\t",$NF}' |sed 's/"-"//g'

$ awk -F'[][[:space:]"]+' -v OFS='\t' '{match($7,"(/[^/]*){5}"); print $4, substr($7,1,RLENGTH), $(NF-1), $NF}' file
Based on #Ed Morton, but setting FS in 5 parts:
$ awk -v FS='[[]|\\+[[:digit:]]+[]]|GET |/en|"+-"' '{print $2,$4,$NF}' file
Thanks to #Ed Morton.


Grep logs between two timestamps in Shell

I am writing a script where I need to grep the logs exactly between two given timestamps . I don't want to use regex as it not full proof. Is there any other way through which I can achieve this ?
e.g: between time range 04:15:00 to 05:15:00
Log Format: - - [17/Dec/2015:04:00:00 -0500] "GET /abc/def/ghi/xyz.jsp HTTP/1.1" 200 337 3440 0000FqZTmTG2yuMTJeny7hPDOvG - - [17/Dec/2015:05:10:09 -0500] "POST /abc/def/ghi/xyz.jsp HTTP/1.1" 200 27 21124 0000FqZTmTG2yuMTJ
This might be what you want to do, using GNU awk for time functions:
$ cat tst.awk
BEGIN { FS="[][ ]+"; beg=t2s(beg); end=t2s(end) }
{ cur = t2s($4) }
(cur >= beg) && (cur <= end)
function t2s(time, t) {
return mktime(t[3]" "t[2]" "t[1]" "t[4]+0" "t[5]+0" "t[6]+0)
$ awk -v beg="17/Dec/2015:04:15" -v end="17/Dec/2015:05:15" -f tst.awk file
access_log.aging.20151217040207: - - [17/Dec/2015:05:10:09 -0500] "POST /abc/def/ghi/xyz.jsp HTTP/1.1" 200 27 21124 0000FqZTmTG2yuMTJ
but it's hard to guess without more sample input and expected output.
If you don't want to use regular expressions nor patterns for matching lines, then grep alone is not enough.
Here's a Bash+date solution:
# start and stop may be parameters of your script ("$1" and "$2"),
# here they are hardcoded for convenience.
start="17/Dec/2015 04:15:00 -0500"
stop="17/Dec/2015 05:15:00 -0500"
get_tstamp() {
# '17/Dec/2015:05:10:09 -0500' -> '17/Dec/2015 05:10:09 -0500'
datetime="${1/:/ }"
# '17/Dec/2015 05:10:09 -0500' -> '17 Dec 2015 05:10:09 -0500'
datetime="${datetime//// }"
# datetime to unix timestamp
date -d "$datetime" '+%s'
start=$(get_tstamp "$start")
stop=$(get_tstamp "$stop")
while read -r line
datetime="${line%%:*}" # remove ': ...'
tstamp="$(get_tstamp "$datetime")"
# $tstamp now contains a number like 1450347009;
# check if it is in range $start..$stop
[[ "$tstamp" -ge "$start" && "$tstamp" -le "$stop" ]] && echo "$line"

awk regex magic (match first occurrence of character in each line)

Have been scratching my head over this one, hoping there's a simple solution that I've missed.
Simplified the following code can't cope with IPv6 addresses in the (here abbreviated) apache log parsed to it. Do I SED the variable before parsing to AWK or can I change the AWK regex to match only the first ":" on each line in $clog?
$ clog=' - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262 - - [20/Nov/2015:01:06:29 +0100] "GET /another_url HTTP/1.1" 200 11142 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "GET /some_url HTTP/1.1" 200 273'
$ echo "$clog" | awk -F '[: -]+' '{ vHost[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f\n", var, vHost[var] }}'
> 37262
> 48394
> 0
As can be seen the last line of variable $clog, the vhost domain is caught but not the byte count which should come out at 273 instead of 0.
Original long question
The problem I have is with the ":" character. In addition to the other two characters (space and dash), I need AWK to match only the first occurrence of ":" in each line it's evaluating. the following splits each line by three characters which works fine, until the log entries contain IPv6 addresses.
matrix=$( echo "$clog" | awk -F '[: -]+' '{ vHost[$1]++; Bytes[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
The above code converts the following log entries (contained in variable $clog): - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)" - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)" - - [20/Nov/2015:01:06:29 +0100] "GET /wordpress/2014/ssl-intercept-headaches HTTP/1.1" 200 11142 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4" - - [20/Nov/2015:01:06:30 +0100] "GET /some_other_url HTTP/1.1" 404 37264 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
Into a table like so, containing vhost name (sans TCP port number), hits and cumulative byte count. One line per vhost: 3 85658 1 37262
But IPv6 addresses get unintentionally split due to their notation and this causes AWK to produce bogus output when evaluation these log entries. Sample IPv6 log entry: 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "POST /wordpress/wp-cron.php?doing_wp_cron=*** HTTP/1.0" 200 273 "-" "WordPress;"
I guess a work around would be to mangle variable $clog to replace the first occurrence of ":" and remove this character from the AWK regex. But I don't think native bash substitution is capable of negotiating variables with multiple lines.
clog=$(sed 's/:/ /' <<< "$clog")
matrix=$( echo "$clog" | awk -F '[ -]+' '{ vHost[$1]++; Bytes[$1]+=$10 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
This works because $clog is quoted which preserves the line feeds and runs sed on each line individually. As a result (and shown) the AWK line needs to be adjusted to ignore ":" and grab $10 instead of $13 for the byte count.
So as it turns out, in writing this, I've already given myself a solution. But I'm sure someone will know of a better more efficient way.
Just don't split the entire line on colons. Remove the port number from the field you extract instead.
split($1, v, /:/); vHost[v[1]]++; ...
I don't see why you would split on dashes, either; either way, the field numbers will be renumbered, so you would end up with something like
awk '{ split($1, v, /:/); vHost[v[1]]++; Bytes[v[1]]+=$11 }
END { for (var in vHost)
printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }'

Grep removing lines that are sem-similar?

I am reading a file like so:
cat access_logs | grep Ruby
To determine what IP's are accessing one of my files. It returns a huge list. I want to remove semi-duplicates, i.e. these two lines are technically the same- except have different time/date stamps. In a massive list with thousands of repeats- is there a way to only get unique ip addresses? - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby" - - [13/Apr/2014:14:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby" - - [13/Apr/2014:15:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby" - - [13/Apr/2014:15:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
So that for example those 4 lines would be trimmed into only one line?
You can do:
awk '/Ruby/{print $1}' file | sort -u
Or you can use grep + cut to get first column as suggested in the comment.
You can use awk:
awk '/Ruby/ && !seen[$1]++' access_logs
This will print only first line for each IP address even if timestamp is different for a given IP.
For your input it prints: - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"

access_log process in hive

i have access_logs around 500MB,i am giving sample as - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
how can i extract month from timestamp?
Expected output :
year month day event occurrence
2009 jul 15 GET /favicon.ico HTTP/1.1
2009 apr 29 GET / HTTP/1.1
i tried this
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u2.jar;
create table log(ip string, gt string, gt1 string, timestamp string, id1 string, s1 string, s2 string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties('input.regex'= '^(\\S+) (\\S+) (\\S+) \\[([[\\w/]+:(\\d{2}:\\d{2}):\\d{2}\\s[+\\-]\\d{4}:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+)')location '/path';
If i understand correctly string functions will not work in this situation.i am new to regex & hive.
help me..thanks in advance
I'm not familiar with hadoop/hive, but as far as regexes go, if I were using ruby:
log_file = %Q[ - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
converted_lines = log_file.split("\n").map do |line|
regex = /^.*? - - \[(\d+)\/(\w+)\/(\d{4}).*?\] (.*)/
matches = regex.match(line)
output = [
[:year, matches[3]],
[:month, matches[2]],
[:day, matches[1]],
[:event_occurrence, matches[4]],
Hope that helps.

regex to modificate domino http log to common format for Piwik import

I need to import old http log files from my Domino webserver into my piwik tracking.
the problem is the format of the log if an user is logged in.
Normal/good format example:
123.123.123 - [17/Mar/2013:00:00:39 +0100] "GET / HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +" 234 "" "example"
Bad format example - produced if user is logged in
123.123.123 "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET / HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +" 234 "" "example
i am looking for a one-liner bash to remove those CN information if it is included.
this is my solution to get a one liner to import an domino log file into piwik. maybe someday someone finds this thing and needn't flip his table
for i in `ls -v *.log`; do date && echo " bearbeite" $i && echo " " && awk '{sub(/ +"CN=[^"]+" +/," - ")}1' $i grep -v http.monitor | grep -v nagios > $i.cleanTmp && python /var/www/piwik/misc/log-analytics/ --url= --idsite=8 $i.cleanTmp --dry-run && rm $i.cleanTmp; done;
If You need a pure bash solution You can do something like this:
Example file
cat >infile <<XXX
123.123.123 "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET / HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +" 234 "" "example"
while read x; do
[[ $x =~ \ +\"CN=[^\"]+\"\ + ]] && x=${x/$BASH_REMATCH/ }
echo $x
done <infile
123.123.123 - [17/Mar/2013:00:00:39 +0100] "GET / HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +" 234 "" "example"
It parses for a string starting with spaces then "CN= and then any non " characters, then a " then some spaces. If this patten found, it replaces with a space.
If the log files are big ones (>1MB) and this should be done periodically, then use awk instead of the pure bash solution.
awk '{sub(/ +"CN=[^"]+" +/," ")}1' infile
So you just want to remove the "CN=SomeUser/OU=SomeOU/O=SomeO" part?
The regex to match that looks like this: