grok - how do you find a quoted string - regex

I am trying to grab the output from an nginx log file and send it to logstash.
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1” 200 3653189 “-” “git/1.8.3.4 (Apple Git–47)”
Grock is able to find the first 3 words fine
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000]
%{IPV4:user_ip} - %{USERNAME:user_name} \[%{HTTPDATE:time_local}\]
Grok is able to find the 3rd and 4th words fine
[14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1”
\[%{HTTPDATE:time_local}\] %{QUOTEDSTRING:request}
However when I combine them, and try to find all 4, grok says there are no results (using http://grokdebug.herokuapp.com/ for testing)
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1”
%{IPV4:user_ip} - %{USERNAME:user_name} \[%{HTTPDATE:time_local}\] %{QUOTEDSTRING:request}
#not found
Anyone know how to get the quoted string in the above example?
I'm brand new to grok, so perhaps I'm not approaching this correctly.
Update
Interestingly if I use the following log line and then manually type in the url it does work
bob 14/Feb/2014:18:57:05 +0000 "herp"
#Once herp works, replace herp, with POST
bob 14/Feb/2014:18:57:05 +0000 "POST"
#Once POST works, keep expounding until the whole thing is in place
autobuild 14/Feb/2014:18:57:05 +0000 "POST /main/builder.git/git-upload-pack HTTP/1.1"

"POST /main/builder.git/git-upload-pack HTTP/1.1" in pattern
"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}"

The process of posting to stack overflow identified the problem.
If you look carefully, the double quotes are parsed differently
"POST
vs
“POST
Manually typing in the double quote fixes the problem

Also you can use this expression for the cases where the log changes:
"%{WORD:verb}(?:| %{URIPATHPARAM:request})(?:| HTTP/%{NUMBER:httpversion})"
it matches with:
"POST /main/builder.git/git-upload-pack HTTP/1.1"
or
"POST /main/builder.git/git-upload-pack"
or
"POST"
try it.. ;)

Related

regex content Monit

Sorry all,
i plan to fetch alert for monit content based on apache log files but i think this regex is wrong as i thought this will be work like based in grep i use from shell
check file vhost with path /var/log/apache2/other_vhosts_access.log
if content = '.*"POST /urlpath/ HTTP/1.1" 200.*' then alert
does above regex is wrong ? as i plan to alert only the line contains that word
is it possible in regex monit content to have multiple keyword regex like below for example i often use in grep
grep -P '(?=.*urlpath.*)(?=.*200.*)(?=.*302.*)' logfile
Hope you can advice me regarding above regex
Many thanks
have you read/see this https://www.mmonit.com/monit/documentation/monit.html#FILE-CONTENT-TEST
Your grep sample will not work, but you can use simple POSIX regex (see man 7 regex in a Linux system).
check file vhost with path /var/log/apache2/other_vhosts_access.log
if content = '.*"POST /urlpath/ HTTP/1.1" (200|302).*' then alert
Based on you sample message line, this will send alerts for status code 200 and 302.
If message lines with a status code 200 and 302 occur in a check time interval, you will get one alert only. The log file used by Monit should contain the matching messages.
[2022-07-13T16:32:11MESZ] error : 'vhost' content match:
"POST /urlpath/ HTTP/1.1" 200
[2022-07-13T16:33:11MESZ] info : 'vhost' content doesn't match
[2022-07-13T16:33:11MESZ] info : 'vhost' content doesn't match
[2022-07-13T16:45:11MESZ] error : 'vhost' content match:
"POST /urlpath/ HTTP/1.1" 200
[2022-07-13T16:45:11MESZ] error : 'vhost' content match:
"POST /urlpath/ HTTP/1.1" 200
"POST /urlpath/ HTTP/1.1" 302
[2022-07-13T16:46:11MESZ] error : 'vhost' content doesn't match
But the alert sent will contain both/all message lines.

Parsing corrupt Apache logs using regex

I'm writing a Python 3.7.2 program to parse Apache logs looking for all successful response codes. I've got regex written right now that will parse all correct Apache log entries into individual tuples of [origin] [date/time] [HTML method/file/protocol] [response code] and [file size] and then I just check to see if the response code is 3xx. The problem is there are several entries that are corrupt, some corrupt enough to be unreadable so I've stripped them out in a different part of the program. Several are just missing the closing " (quotation mark) on the method/protocol item causing it to throw an error each time I parse that line. I'm thinking I need to use a RegEx Or expression for " OR whitespace but that seems to break the quote into a different tuple item instead of looking for say, "GET 613.html HTTP/1.0" OR "GET 613.html HTTP/1.0 I'm new to regex and thoroughly stumped, can anyone explain what I'm doing wrong?
I should note that the logs have been scrubbed of some info, instead of origin IP it only shows 'local' or 'remote' and the OS/browser info is removed entirely.
This is the regex for the relevant tuple item that works with valid entries: "(.*)?" I've also tried:
"(.*)?("|\s) - creates another tuple item and still throws error
Here's a snippet of the log entries including the last entry which is missing it's closing "
local - - [27/Oct/1994:18:47:03 -0600] "GET index.html HTTP/1.0" 200 3185
local - - [27/Oct/1994:18:48:53 -0600] "GET index.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:49:55 -0600] "GET index.html HTTP/1.0" 303 3185
local - - [27/Oct/1994:18:50:25 -0600] "GET 612.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:50:41 -0600] "GET index.html HTTP/1.0" 200 388
local - - [27/Oct/1994:18:50:52 -0600] "GET 613.html HTTP/1.0 303 728
regex = '([(\w+)]+) - - \[(.*?)\] "(.*)?" (\d+) (\S+)'
import re
with open("validlogs.txt") as validlogs:
i = 0
array = []
successcodes = 0
for line in validlogs:
array.append(line)
loglength = len(array)
while (i < loglength):
line = re.match(regex, array[i]).groups()
if(line[3].startswith("3")):
successcodes+=1
i+=1
print("Number of successcodes: ", successcodes)
Parsing the log responses above should give Number of success codes: 2
Instead I get: Traceback (most recent call last):
File "test.py", line 24, in
line = re.match(regex, array[i]).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
because (I believe) regex is looking explicitly for a " and can't handle the line entry that's missing it.
So I originally used re.match with ([(\w+)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) with a Try: / Except: continue code to parse all the logs that actually matched the pattern. Since ~100,000 of the ~750,000 lines didn't conform to the correct Apache logs pattern, I wound up changing my code to re.search with much smaller segments instead.
For instance:
with open("./http_access_log.txt") as logs:
for line in logs:
if re.search('\s*(30\d)\s\S+', line): #Checking for 30x redirect codes
redirectCounter += 1
I've read that re.match is faster than re.search but I felt that being able to accurately capture the most possible log entries (this handles all but about 2000 lines, most of which have no usable info) was more important.

Grok pattern assistance

Hi i'm in need of some serious help,
I have logs that i wish to Parse using GROK but the problem i'm having is that they are not always consistent in content or spacing here are some obfuscated examples.
title_access_log:ipaddress1, ipaddress2, ipaddress3 - - [14/Nov/2017:08:30:00 +0000] "GET /url HTTP/1.1" 200 198454 - 153261 - 0000fD5b5OSuS2C7ZdhgwqYufJk:GH809 url
title_access_log:ipaddress1, ipaddress2 - - [14/Nov/2017:08:30:00 +0000] "GET /url HTTP/1.1" 200 2326 - 20482 V22843489635e0e42e864037eccb8ad4857500ea 0000BDzHfUFhjJmcs9R4-CyglGS:GH806 url
title_access_log:ipaddress1, ipaddress2 - - [14/Nov/2017:08:30:00 +0000] "POST /url HTTP/1.1" 200 30031 - 17942 - 0000PjpQluI9BZ0w4EDB9o2fow-:GH809 url
I have managed to make a GROK patterns that pull out up to time and date for logs that contain 2 IPs but i get suck going further or when trying to do logs with 3 ips.
Has anyone got any advice on how to tackle this.
i'm using Graylog is what i'm using to extract data to so i do have the option of using other formats than GROK.

Filtering with Filebeat and regex

Saw a similar post regarding filtering Filebeat output, but my case is complicated by the existence of a double quote and slashes and backslashes within the message string.
Here is the message field in Filebeat:
"message":"162.246.216.28 - msca.operations [03/May/2017:17:21:21 +0000] \"GET /api/console/proxy?uri=aliases\u0026=1493758138223 HTTP/1.1\" 200 60 \"http
As seen above, to remove all successful HTTP requests, I need to capture both the "HTTP" and the returned status "200" and dump them.
I tried this line, and a few variations without success:
- drop_event:
when:
regexp:
message: "*HTTP?1.1?? 200*"

fail2ban scan for 403 in nginx access logs

I have setup some specific rules on nginx, blocking some urls and some extensions (aspx, sh, jsp, etc..).
I have also enable a custom access log file only for 403|429|410 errors, so that in only 1 place i can have all my access denied log.
My goal is to have fail2ban read this log and for every GET/POST that ends in a 403 error, IP should be banned.
1) nginx.conf will be logging the custom error log file like this:
log_format limit '$time_local - $remote_addr "$request" $status';
and this is a log entry:
03/Jan/2017:15:53:01 +0100 - 1.2.3.4 "GET /aaa.jsp HTTP/1.1" 403
2) i have a fail2ban filter like this (taken from here)
^<HOST> .* "(GET|POST) [^"]+" 403
3) i have tried with fail2ban-regex
fail2ban-regex /var/log/nginx/access-live-limitbot-website.log /etc/fail2ban/filter.d/nginx-403.conf
and this is the output
Results
=======
Failregex: 0 total
Ignoreregex: 0 total
Date template hits:
|- [# of hits] date format
| [1] Day/MONTH/Year:Hour:Minute:Second
`-
Lines: 2 lines, 0 ignored, 0 matched, 2 missed
|- Missed line(s):
| 217.19.158.242 "POST /wp-login.php HTTP/1.1" 403
| 03/Jan/2017:15:53:01 +0100 - 217.19.158.242 "GET /aaa.jsp HTTP/1.1" 403
`-
and i will never get the entry matching the error code.
Will someone please help me with the regex based on my custom log?
thank you
Fail2ban is picky about the date format. Also, for ease of matching, I suggest reordering the items in the log.
For date format, see documentation here:
https://www.fail2ban.org/wiki/index.php/MANUAL_0_8
In order for a log line to match your failregex, it actually has to match in two parts: the beginning of the line has to match a timestamp pattern or regex, and the remainder of the line has to match your failregex. If the failregex is anchored with a leading ^, then the anchor refers to the start of the remainder of the line, after the timestamp and intervening whitespace.
The pattern or regex to match the time stamp is currently not documented, and not available for users to read or set. See Debian bug #491253. This is a problem if your log has a timestamp format that fail2ban doesn't expect, since it will then fail to match any lines. Because of this, you should test any new failregex against a sample log line, as in the examples below, to be sure that it will match. If fail2ban doesn't recognize your log timestamp, then you have two options: either reconfigure your daemon to log with a timestamp in a more common format, such as in the example log line above; or file a bug report asking to have your timestamp format included.
For the reorder, something like datetime - status - host (- other stuff), would help create a simple pattern such as 403.
Therefore your log should look like:
03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP/1.1"
and your pattern can be
403 <HOST>
You can run this from the command line to validate as:
fail2ban-regex '03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP/1.1"' '403 <HOST>'
Which produces the output:
Running tests
=============
Use regex line : 403 <HOST>
Use single line: 03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP...
Matched time template Day-Month-Year Hour:Minute:Second
Got time using template Day-Month-Year Hour:Minute:Second
Results
=======
Failregex: 1 total
|- #) [# of hits] regular expression
| 1) [1] 403 <HOST>
`-
Ignoreregex: 0 total
Summary
=======
Addresses found:
[1]
1.2.3.4 (Tue Jan 03 15:53:01 2017)
Date template hits:
2 hit(s): Day-Month-Year Hour:Minute:Second
Success, the total number of match is 1