python regex to extract string from log file line - python-2.7

I want to extract the data from log file
For opening the file:
a = open('access.log','rb')
lines = a.readlines()
So suppose line[0]:
123.456.678.89 - - [04/Aug/2014:12:01:41 +0530] "GET /123456789_10.10.20.111 HTTP/1.1" 404 537 "-" "Wget/1.14 (linux-gnu)"
I want to extract only 123456789 and 10.10.20.111 from "GET /123456789_10.10.20.111 HTTP/1.1"
The pattern will be like string starts with /, repetition of digit then underscore then ip.
I tried this, and it works. I think it takes overhead
node = re.search(r'\"(.*)\"', line).group(1)
node = node.split(" ")[1]
node,ip = node.split("_")
node = node[1:]
print node,ip
How to get this with pattern ?

Would you like to do this in one line?
nodeip = re.search(r'([\d]{9})_([\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3})', line)
Now your node and IP in groups 1 and 2:
print nodeip.group(1), nodeip.group(2)
Outputs:
123456789 10.10.20.111

Related

scala apache access log regex not working

I have defined regex for apache access log as below:
val apacheLogPattern = """
^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$
""".r
And a function to parse the log:
def parse_log(line: String) = {
line match {
case apacheLogPattern(ipAddress, clientIdentity, userId, dateTime, method, endPoint,
protocol, responseCode, contentSize, browser, somethingElse) => "match"
}
}
val p = """66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"""
parse_log(p)
Calling the parse function gives MatchError
scala.MatchError:
66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
(of class java.lang.String)
at .parse_log(:13)
... 28 elided
Can someone help me where the scala regex is going wrong?
From The fourth bird's comment, the regex is lacking .r at the end, and has one too many capturing groups. The correct pattern is shown below.
val apacheLogPattern = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$""".r

Parsing corrupt Apache logs using regex

I'm writing a Python 3.7.2 program to parse Apache logs looking for all successful response codes. I've got regex written right now that will parse all correct Apache log entries into individual tuples of [origin] [date/time] [HTML method/file/protocol] [response code] and [file size] and then I just check to see if the response code is 3xx. The problem is there are several entries that are corrupt, some corrupt enough to be unreadable so I've stripped them out in a different part of the program. Several are just missing the closing " (quotation mark) on the method/protocol item causing it to throw an error each time I parse that line. I'm thinking I need to use a RegEx Or expression for " OR whitespace but that seems to break the quote into a different tuple item instead of looking for say, "GET 613.html HTTP/1.0" OR "GET 613.html HTTP/1.0 I'm new to regex and thoroughly stumped, can anyone explain what I'm doing wrong?
I should note that the logs have been scrubbed of some info, instead of origin IP it only shows 'local' or 'remote' and the OS/browser info is removed entirely.
This is the regex for the relevant tuple item that works with valid entries: "(.*)?" I've also tried:
"(.*)?("|\s) - creates another tuple item and still throws error
Here's a snippet of the log entries including the last entry which is missing it's closing "
local - - [27/Oct/1994:18:47:03 -0600] "GET index.html HTTP/1.0" 200 3185
local - - [27/Oct/1994:18:48:53 -0600] "GET index.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:49:55 -0600] "GET index.html HTTP/1.0" 303 3185
local - - [27/Oct/1994:18:50:25 -0600] "GET 612.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:50:41 -0600] "GET index.html HTTP/1.0" 200 388
local - - [27/Oct/1994:18:50:52 -0600] "GET 613.html HTTP/1.0 303 728
regex = '([(\w+)]+) - - \[(.*?)\] "(.*)?" (\d+) (\S+)'
import re
with open("validlogs.txt") as validlogs:
i = 0
array = []
successcodes = 0
for line in validlogs:
array.append(line)
loglength = len(array)
while (i < loglength):
line = re.match(regex, array[i]).groups()
if(line[3].startswith("3")):
successcodes+=1
i+=1
print("Number of successcodes: ", successcodes)
Parsing the log responses above should give Number of success codes: 2
Instead I get: Traceback (most recent call last):
File "test.py", line 24, in
line = re.match(regex, array[i]).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
because (I believe) regex is looking explicitly for a " and can't handle the line entry that's missing it.
So I originally used re.match with ([(\w+)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) with a Try: / Except: continue code to parse all the logs that actually matched the pattern. Since ~100,000 of the ~750,000 lines didn't conform to the correct Apache logs pattern, I wound up changing my code to re.search with much smaller segments instead.
For instance:
with open("./http_access_log.txt") as logs:
for line in logs:
if re.search('\s*(30\d)\s\S+', line): #Checking for 30x redirect codes
redirectCounter += 1
I've read that re.match is faster than re.search but I felt that being able to accurately capture the most possible log entries (this handles all but about 2000 lines, most of which have no usable info) was more important.

fail2ban scan for 403 in nginx access logs

I have setup some specific rules on nginx, blocking some urls and some extensions (aspx, sh, jsp, etc..).
I have also enable a custom access log file only for 403|429|410 errors, so that in only 1 place i can have all my access denied log.
My goal is to have fail2ban read this log and for every GET/POST that ends in a 403 error, IP should be banned.
1) nginx.conf will be logging the custom error log file like this:
log_format limit '$time_local - $remote_addr "$request" $status';
and this is a log entry:
03/Jan/2017:15:53:01 +0100 - 1.2.3.4 "GET /aaa.jsp HTTP/1.1" 403
2) i have a fail2ban filter like this (taken from here)
^<HOST> .* "(GET|POST) [^"]+" 403
3) i have tried with fail2ban-regex
fail2ban-regex /var/log/nginx/access-live-limitbot-website.log /etc/fail2ban/filter.d/nginx-403.conf
and this is the output
Results
=======
Failregex: 0 total
Ignoreregex: 0 total
Date template hits:
|- [# of hits] date format
| [1] Day/MONTH/Year:Hour:Minute:Second
`-
Lines: 2 lines, 0 ignored, 0 matched, 2 missed
|- Missed line(s):
| 217.19.158.242 "POST /wp-login.php HTTP/1.1" 403
| 03/Jan/2017:15:53:01 +0100 - 217.19.158.242 "GET /aaa.jsp HTTP/1.1" 403
`-
and i will never get the entry matching the error code.
Will someone please help me with the regex based on my custom log?
thank you
Fail2ban is picky about the date format. Also, for ease of matching, I suggest reordering the items in the log.
For date format, see documentation here:
https://www.fail2ban.org/wiki/index.php/MANUAL_0_8
In order for a log line to match your failregex, it actually has to match in two parts: the beginning of the line has to match a timestamp pattern or regex, and the remainder of the line has to match your failregex. If the failregex is anchored with a leading ^, then the anchor refers to the start of the remainder of the line, after the timestamp and intervening whitespace.
The pattern or regex to match the time stamp is currently not documented, and not available for users to read or set. See Debian bug #491253. This is a problem if your log has a timestamp format that fail2ban doesn't expect, since it will then fail to match any lines. Because of this, you should test any new failregex against a sample log line, as in the examples below, to be sure that it will match. If fail2ban doesn't recognize your log timestamp, then you have two options: either reconfigure your daemon to log with a timestamp in a more common format, such as in the example log line above; or file a bug report asking to have your timestamp format included.
For the reorder, something like datetime - status - host (- other stuff), would help create a simple pattern such as 403.
Therefore your log should look like:
03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP/1.1"
and your pattern can be
403 <HOST>
You can run this from the command line to validate as:
fail2ban-regex '03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP/1.1"' '403 <HOST>'
Which produces the output:
Running tests
=============
Use regex line : 403 <HOST>
Use single line: 03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP...
Matched time template Day-Month-Year Hour:Minute:Second
Got time using template Day-Month-Year Hour:Minute:Second
Results
=======
Failregex: 1 total
|- #) [# of hits] regular expression
| 1) [1] 403 <HOST>
`-
Ignoreregex: 0 total
Summary
=======
Addresses found:
[1]
1.2.3.4 (Tue Jan 03 15:53:01 2017)
Date template hits:
2 hit(s): Day-Month-Year Hour:Minute:Second
Success, the total number of match is 1

awk regex magic (match first occurrence of character in each line)

Have been scratching my head over this one, hoping there's a simple solution that I've missed.
Summary
Simplified the following code can't cope with IPv6 addresses in the (here abbreviated) apache log parsed to it. Do I SED the variable before parsing to AWK or can I change the AWK regex to match only the first ":" on each line in $clog?
$ clog='djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /another_url HTTP/1.1" 200 11142
ipv6.com:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "GET /some_url HTTP/1.1" 200 273'
$ echo "$clog" | awk -F '[: -]+' '{ vHost[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f\n", var, vHost[var] }}'
> bogus.com 37262
> djerk.nl 48394
> ipv6.com 0
As can be seen the last line of variable $clog, the vhost domain is caught but not the byte count which should come out at 273 instead of 0.
Original long question
The problem I have is with the ":" character. In addition to the other two characters (space and dash), I need AWK to match only the first occurrence of ":" in each line it's evaluating. the following splits each line by three characters which works fine, until the log entries contain IPv6 addresses.
matrix=$( echo "$clog" | awk -F '[: -]+' '{ vHost[$1]++; Bytes[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
The above code converts the following log entries (contained in variable $clog):
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /wordpress/2014/ssl-intercept-headaches HTTP/1.1" 200 11142 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:30 +0100] "GET /some_other_url HTTP/1.1" 404 37264 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
Into a table like so, containing vhost name (sans TCP port number), hits and cumulative byte count. One line per vhost:
djerk.nl 3 85658
bogus.com 1 37262
But IPv6 addresses get unintentionally split due to their notation and this causes AWK to produce bogus output when evaluation these log entries. Sample IPv6 log entry:
djerk.nl:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "POST /wordpress/wp-cron.php?doing_wp_cron=*** HTTP/1.0" 200 273 "-" "WordPress; http://www.djerk.nl/wordpress"
I guess a work around would be to mangle variable $clog to replace the first occurrence of ":" and remove this character from the AWK regex. But I don't think native bash substitution is capable of negotiating variables with multiple lines.
clog=$(sed 's/:/ /' <<< "$clog")
matrix=$( echo "$clog" | awk -F '[ -]+' '{ vHost[$1]++; Bytes[$1]+=$10 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
This works because $clog is quoted which preserves the line feeds and runs sed on each line individually. As a result (and shown) the AWK line needs to be adjusted to ignore ":" and grab $10 instead of $13 for the byte count.
So as it turns out, in writing this, I've already given myself a solution. But I'm sure someone will know of a better more efficient way.
Just don't split the entire line on colons. Remove the port number from the field you extract instead.
split($1, v, /:/); vHost[v[1]]++; ...
I don't see why you would split on dashes, either; either way, the field numbers will be renumbered, so you would end up with something like
awk '{ split($1, v, /:/); vHost[v[1]]++; Bytes[v[1]]+=$11 }
END { for (var in vHost)
printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }'

Having Trouble with string::find

I'm writing a C++ program to parse pieces out of web logs, and one of the pieces I want is the requested page. I'm using string::find to define the beginning and end of the page, then using string::substr to extract it. Here is an example line:
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
The requested page is the part right after the GET, and the end is right before the HTTP is, So I do something like :
int beginning = log_entry.find("\"GET") + 5;
int end = log_entry.find("HTTP) - 5;
std::string requested_page = log_entry.substr(beginning, end);
This is then what would be contained within requested_page:
/~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/
Instead of
/~csc226
As you can see, the beginning is correct, but the end is not. I have a log of 3000 lines with the same syntax as the example entry above, and the beginnings of the requested pages in all of them are correct and the ends are not.
Any ideas as to what is going wrong?
Thanks!
Don't store the result of find in an int. use std::string::size_type aka std::size_t.
To test if it failed, then compare against std::string::npos.
Second, never ever manipulate the result of std::string::find until you both confirm it is not npos and know that the manipulation moves it within the valid range. +5 and -5 blindly is a no-go. I don't care if you "know" what your data is. Don't write buffer overflow culpable code.
Finally, substr( start, LENGTH ) not substr( start, end ).
std::string was imported from a different source library than the standard containers. So its conventions are very different (and often worse).
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
So:
log_entry.find("\"GET") + 5; will match: "GET and then move the iterator 5 places forward to the location:
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
^
Next `log_entry.find("HTTP"); will match HTTP:
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
^
You want to use (size_t length = log_entry.find("\"HTTP") - log_entry.find("\"GET") - 5;). Finally you need to use std::string::substr correctly here.