awk regex magic (match first occurrence of character in each line) - regex

Have been scratching my head over this one, hoping there's a simple solution that I've missed.
Summary
Simplified the following code can't cope with IPv6 addresses in the (here abbreviated) apache log parsed to it. Do I SED the variable before parsing to AWK or can I change the AWK regex to match only the first ":" on each line in $clog?
$ clog='djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /another_url HTTP/1.1" 200 11142
ipv6.com:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "GET /some_url HTTP/1.1" 200 273'
$ echo "$clog" | awk -F '[: -]+' '{ vHost[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f\n", var, vHost[var] }}'
> bogus.com 37262
> djerk.nl 48394
> ipv6.com 0
As can be seen the last line of variable $clog, the vhost domain is caught but not the byte count which should come out at 273 instead of 0.
Original long question
The problem I have is with the ":" character. In addition to the other two characters (space and dash), I need AWK to match only the first occurrence of ":" in each line it's evaluating. the following splits each line by three characters which works fine, until the log entries contain IPv6 addresses.
matrix=$( echo "$clog" | awk -F '[: -]+' '{ vHost[$1]++; Bytes[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
The above code converts the following log entries (contained in variable $clog):
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /wordpress/2014/ssl-intercept-headaches HTTP/1.1" 200 11142 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:30 +0100] "GET /some_other_url HTTP/1.1" 404 37264 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
Into a table like so, containing vhost name (sans TCP port number), hits and cumulative byte count. One line per vhost:
djerk.nl 3 85658
bogus.com 1 37262
But IPv6 addresses get unintentionally split due to their notation and this causes AWK to produce bogus output when evaluation these log entries. Sample IPv6 log entry:
djerk.nl:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "POST /wordpress/wp-cron.php?doing_wp_cron=*** HTTP/1.0" 200 273 "-" "WordPress; http://www.djerk.nl/wordpress"
I guess a work around would be to mangle variable $clog to replace the first occurrence of ":" and remove this character from the AWK regex. But I don't think native bash substitution is capable of negotiating variables with multiple lines.
clog=$(sed 's/:/ /' <<< "$clog")
matrix=$( echo "$clog" | awk -F '[ -]+' '{ vHost[$1]++; Bytes[$1]+=$10 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
This works because $clog is quoted which preserves the line feeds and runs sed on each line individually. As a result (and shown) the AWK line needs to be adjusted to ignore ":" and grab $10 instead of $13 for the byte count.
So as it turns out, in writing this, I've already given myself a solution. But I'm sure someone will know of a better more efficient way.

Just don't split the entire line on colons. Remove the port number from the field you extract instead.
split($1, v, /:/); vHost[v[1]]++; ...
I don't see why you would split on dashes, either; either way, the field numbers will be renumbered, so you would end up with something like
awk '{ split($1, v, /:/); vHost[v[1]]++; Bytes[v[1]]+=$11 }
END { for (var in vHost)
printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }'

Related

Retrieve log pattern via awk

I would like to retrieve from the following logs the date, the 5 URI length, the ab and cde:
10.10.10.10 - - [26/Oct/2020:19:50:13 +0000] "GET /five/six/seven/eight/nine/en?from=1603738800&to=1603785600ncludedInRange=false HTTP/1.1" 200 255441 "-" "Opera com.test.super/1.10.4;11072 (Linux;Neon KNWWWfj;0,02.2)" "10.10.10.10""f799b6b9-747f-4f22-a1bf-4b7de885fc60""-" "-" "-" "-"ab=0.110 cde=0.102
11.1.1.1 - - [26/Oct/2020:19:50:14 +0000] "GET /one/two/three/four/five/en HTTP/1.1" 200 2832 "-" "Opera com.test.super/1.10.4;11072 (Linux;Neon KNWWWfj;0,02.2)" "11.1.1.1""19a8ee3c-9cb3-4ba6-9732-eb4923601e92""-" "-" "-" "-"ab=0.111 cde=0.112
e.g.
26/Oct/2020:19:50:13 /five/six/seven/eight/nine ab=0.110 cde=0.102
I have tried the following command, but I get only a part of it. Can you please help?
awk '{print $4 "\t" $7 "\t" $(NF-1),"\t",$NF}' |sed 's/"-"//g'
$ awk -F'[][[:space:]"]+' -v OFS='\t' '{match($7,"(/[^/]*){5}"); print $4, substr($7,1,RLENGTH), $(NF-1), $NF}' file
26/Oct/2020:19:50:13 /five/six/seven/eight/nine ab=0.110 cde=0.102
26/Oct/2020:19:50:14 /one/two/three/four/five ab=0.111 cde=0.112
Based on #Ed Morton, but setting FS in 5 parts:
$ awk -v FS='[[]|\\+[[:digit:]]+[]]|GET |/en|"+-"' '{print $2,$4,$NF}' file
26/Oct/2020:19:50:13 /five/six/seven/eight/nine ab=0.110 cde=0.102
26/Oct/2020:19:50:14 /one/two/three/four/five ab=0.111 cde=0.112
Updated.
Thanks to #Ed Morton.

Parsing corrupt Apache logs using regex

I'm writing a Python 3.7.2 program to parse Apache logs looking for all successful response codes. I've got regex written right now that will parse all correct Apache log entries into individual tuples of [origin] [date/time] [HTML method/file/protocol] [response code] and [file size] and then I just check to see if the response code is 3xx. The problem is there are several entries that are corrupt, some corrupt enough to be unreadable so I've stripped them out in a different part of the program. Several are just missing the closing " (quotation mark) on the method/protocol item causing it to throw an error each time I parse that line. I'm thinking I need to use a RegEx Or expression for " OR whitespace but that seems to break the quote into a different tuple item instead of looking for say, "GET 613.html HTTP/1.0" OR "GET 613.html HTTP/1.0 I'm new to regex and thoroughly stumped, can anyone explain what I'm doing wrong?
I should note that the logs have been scrubbed of some info, instead of origin IP it only shows 'local' or 'remote' and the OS/browser info is removed entirely.
This is the regex for the relevant tuple item that works with valid entries: "(.*)?" I've also tried:
"(.*)?("|\s) - creates another tuple item and still throws error
Here's a snippet of the log entries including the last entry which is missing it's closing "
local - - [27/Oct/1994:18:47:03 -0600] "GET index.html HTTP/1.0" 200 3185
local - - [27/Oct/1994:18:48:53 -0600] "GET index.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:49:55 -0600] "GET index.html HTTP/1.0" 303 3185
local - - [27/Oct/1994:18:50:25 -0600] "GET 612.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:50:41 -0600] "GET index.html HTTP/1.0" 200 388
local - - [27/Oct/1994:18:50:52 -0600] "GET 613.html HTTP/1.0 303 728
regex = '([(\w+)]+) - - \[(.*?)\] "(.*)?" (\d+) (\S+)'
import re
with open("validlogs.txt") as validlogs:
i = 0
array = []
successcodes = 0
for line in validlogs:
array.append(line)
loglength = len(array)
while (i < loglength):
line = re.match(regex, array[i]).groups()
if(line[3].startswith("3")):
successcodes+=1
i+=1
print("Number of successcodes: ", successcodes)
Parsing the log responses above should give Number of success codes: 2
Instead I get: Traceback (most recent call last):
File "test.py", line 24, in
line = re.match(regex, array[i]).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
because (I believe) regex is looking explicitly for a " and can't handle the line entry that's missing it.
So I originally used re.match with ([(\w+)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) with a Try: / Except: continue code to parse all the logs that actually matched the pattern. Since ~100,000 of the ~750,000 lines didn't conform to the correct Apache logs pattern, I wound up changing my code to re.search with much smaller segments instead.
For instance:
with open("./http_access_log.txt") as logs:
for line in logs:
if re.search('\s*(30\d)\s\S+', line): #Checking for 30x redirect codes
redirectCounter += 1
I've read that re.match is faster than re.search but I felt that being able to accurately capture the most possible log entries (this handles all but about 2000 lines, most of which have no usable info) was more important.

python regex to extract string from log file line

I want to extract the data from log file
For opening the file:
a = open('access.log','rb')
lines = a.readlines()
So suppose line[0]:
123.456.678.89 - - [04/Aug/2014:12:01:41 +0530] "GET /123456789_10.10.20.111 HTTP/1.1" 404 537 "-" "Wget/1.14 (linux-gnu)"
I want to extract only 123456789 and 10.10.20.111 from "GET /123456789_10.10.20.111 HTTP/1.1"
The pattern will be like string starts with /, repetition of digit then underscore then ip.
I tried this, and it works. I think it takes overhead
node = re.search(r'\"(.*)\"', line).group(1)
node = node.split(" ")[1]
node,ip = node.split("_")
node = node[1:]
print node,ip
How to get this with pattern ?
Would you like to do this in one line?
nodeip = re.search(r'([\d]{9})_([\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3})', line)
Now your node and IP in groups 1 and 2:
print nodeip.group(1), nodeip.group(2)
Outputs:
123456789 10.10.20.111

Grep removing lines that are sem-similar?

I am reading a file like so:
cat access_logs | grep Ruby
To determine what IP's are accessing one of my files. It returns a huge list. I want to remove semi-duplicates, i.e. these two lines are technically the same- except have different time/date stamps. In a massive list with thousands of repeats- is there a way to only get unique ip addresses?
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:14:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
So that for example those 4 lines would be trimmed into only one line?
You can do:
awk '/Ruby/{print $1}' file | sort -u
Or you can use grep + cut to get first column as suggested in the comment.
You can use awk:
awk '/Ruby/ && !seen[$1]++' access_logs
This will print only first line for each IP address even if timestamp is different for a given IP.
For your input it prints:
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"

regex to modificate domino http log to common format for Piwik import

I need to import old http log files from my Domino webserver into my piwik tracking.
the problem is the format of the log if an user is logged in.
Normal/good format example:
123.123.123 www.example.com - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
Bad format example - produced if user is logged in
123.123.123 www.example.com "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example
i am looking for a one-liner bash to remove those CN information if it is included.
UPDATE:
this is my solution to get a one liner to import an domino log file into piwik. maybe someday someone finds this thing and needn't flip his table
for i in `ls -v *.log`; do date && echo " bearbeite" $i && echo " " && awk '{sub(/ +"CN=[^"]+" +/," - ")}1' $i grep -v http.monitor | grep -v nagios > $i.cleanTmp && python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://127.0.0.1/piwik --idsite=8 $i.cleanTmp --dry-run && rm $i.cleanTmp; done;
If You need a pure bash solution You can do something like this:
Example file
cat >infile <<XXX
123.123.123 www.example.com "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
XXX
while read x; do
[[ $x =~ \ +\"CN=[^\"]+\"\ + ]] && x=${x/$BASH_REMATCH/ }
echo $x
done <infile
Output:
123.123.123 www.example.com - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
It parses for a string starting with spaces then "CN= and then any non " characters, then a " then some spaces. If this patten found, it replaces with a space.
If the log files are big ones (>1MB) and this should be done periodically, then use awk instead of the pure bash solution.
awk '{sub(/ +"CN=[^"]+" +/," ")}1' infile
So you just want to remove the "CN=SomeUser/OU=SomeOU/O=SomeO" part?
The regex to match that looks like this:
"CN=\w+\/OU=\w+\/O=\w+"