scala apache access log regex not working - regex

I have defined regex for apache access log as below:
val apacheLogPattern = """
^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$
""".r
And a function to parse the log:
def parse_log(line: String) = {
line match {
case apacheLogPattern(ipAddress, clientIdentity, userId, dateTime, method, endPoint,
protocol, responseCode, contentSize, browser, somethingElse) => "match"
}
}
val p = """66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"""
parse_log(p)
Calling the parse function gives MatchError
scala.MatchError:
66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
(of class java.lang.String)
at .parse_log(:13)
... 28 elided
Can someone help me where the scala regex is going wrong?

From The fourth bird's comment, the regex is lacking .r at the end, and has one too many capturing groups. The correct pattern is shown below.
val apacheLogPattern = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$""".r

Related

Is it possible to write multiple regex for the same input in Fluent Bit?

My logs look like this:
200 59903 0.056 - [24/Jun/2020:00:06:56 +0530] "GET /xxxxx/xxxxx/xxxxx HTTP/1.1" xxxxx.com [xxxx:4900:xxxx:b798:xxxx:c8ba:xxxx:6a23] - - xxx.xxx.xxx.xxx - - - "http://xxxxx/xxxxx/xxxxx" 164551836 1 HIT "-" "-" "Mozilla/5.0 (Linux; Android 9; Mi A1 Build/PKQ1.180917.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/77.0.3865.92 Mobile Safari/537.36" "-" "-" "dhDebug=-" "-" - -
200 11485 0.000 - [24/Jun/2020:00:06:56 +0530] "GET /xxxxx/xxxxx/xxxxx/xxxxx HTTP/1.1" xxxxx.com xxx.xxx.xxx.xxx - - xxx.xxx.xxx.xxx - - - "-" 164551710 7 HIT "-" "-" "Dalvik/2.1.0 (Linux; U; Android 9; vivo 1915 Build/PPR1.180610.011)" "-" "-" "dhDebug=appVersion=13.0.8&osVersion=9&clientId=1271210612&conn_type=4G&conn_quality=NO_CONNECTION&sessionSource=organic&featureMask=1879044085&featureMaskV1=635" "-" 40 -
The two logs are almost same except the fact that the last one contains a detailed output of dhDebug.
This is how my parsers.conf looks like:
[PARSER]
Name head
Format regex
Regex (?<responseCode>\d{3})\s(?<responseSize>\d+)\s(?<responseTime>\d+.\d+)\s.*?\s\[(?<time>.*?)\]\s"(?<method>.*?)\s(?<url1>.*?)\s(?<protocol>.*?)"\s(?<servedBy>.*?)\s(?<Akamai_ip1>.*?)\s(?<ClientId_ip2>.*?)\s(?<ip3>.*?)\s(?<lb_ip4>.*?)\s(?<ip5>.*?)\s(?<ip6>.*?)\s(?<ip7>.*?)\s+"(?<url2>.*?)".*?".*?"\s".*?"\s"(?<agentInfo>.*?)"
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
Time_Keep On
Types responseTime:float
Please suggest any idea on how to implement the information of dhDebug in a separate key-value pair in the same regex that works on both the types of logs.
EDITED!!
You can use (?:case1|case2) for case1: is null and case2: is not null
So Regex will be:
(?<responseCode>\d{3})\s(?<responseSize>\d+)\s(?<responseTime>\d+.\d+)\s.*?\s\[(?<time>.*?)\]\s"(?<method>.*?)\s(?<url1>.*?)\s(?<protocol>.*?)"\s(?<servedBy>.*?)\s(?<Akamai_ip1>.*?)\s(?<ClientId_ip2>.*?)\s(?<ip3>.*?)\s(?<lb_ip4>.*?)\s(?<ip5>.*?)\s(?<ip6>.*?)\s(?<ip7>.*?)\s+"(?<url2>.*?)".*?".*?"\s".*?"\s"(?<agentInfo>.*?)"\s"-"\s"-"\s"dhDebug=(?:-|appVersion=(?<appVersion>.*?)&osVersion=(?<osVersion>.*?)&clientId=(?<clientId>.*?)&conn_type=(?<conn_type>.*?)&conn_quality=(?<conn_quality>.*?)&sessionSource=(?<sessionSource>.*?)&featureMask=(?<featureMask>.*?)&featureMaskV1=(?<featureMaskV1>.*?))"
With this you get null for each field name of dhDebug for the first log line and field names with values for the second one.
You can test it at http://grokdebug.herokuapp.com/

awk regex magic (match first occurrence of character in each line)

Have been scratching my head over this one, hoping there's a simple solution that I've missed.
Summary
Simplified the following code can't cope with IPv6 addresses in the (here abbreviated) apache log parsed to it. Do I SED the variable before parsing to AWK or can I change the AWK regex to match only the first ":" on each line in $clog?
$ clog='djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /another_url HTTP/1.1" 200 11142
ipv6.com:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "GET /some_url HTTP/1.1" 200 273'
$ echo "$clog" | awk -F '[: -]+' '{ vHost[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f\n", var, vHost[var] }}'
> bogus.com 37262
> djerk.nl 48394
> ipv6.com 0
As can be seen the last line of variable $clog, the vhost domain is caught but not the byte count which should come out at 273 instead of 0.
Original long question
The problem I have is with the ":" character. In addition to the other two characters (space and dash), I need AWK to match only the first occurrence of ":" in each line it's evaluating. the following splits each line by three characters which works fine, until the log entries contain IPv6 addresses.
matrix=$( echo "$clog" | awk -F '[: -]+' '{ vHost[$1]++; Bytes[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
The above code converts the following log entries (contained in variable $clog):
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /wordpress/2014/ssl-intercept-headaches HTTP/1.1" 200 11142 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:30 +0100] "GET /some_other_url HTTP/1.1" 404 37264 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
Into a table like so, containing vhost name (sans TCP port number), hits and cumulative byte count. One line per vhost:
djerk.nl 3 85658
bogus.com 1 37262
But IPv6 addresses get unintentionally split due to their notation and this causes AWK to produce bogus output when evaluation these log entries. Sample IPv6 log entry:
djerk.nl:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "POST /wordpress/wp-cron.php?doing_wp_cron=*** HTTP/1.0" 200 273 "-" "WordPress; http://www.djerk.nl/wordpress"
I guess a work around would be to mangle variable $clog to replace the first occurrence of ":" and remove this character from the AWK regex. But I don't think native bash substitution is capable of negotiating variables with multiple lines.
clog=$(sed 's/:/ /' <<< "$clog")
matrix=$( echo "$clog" | awk -F '[ -]+' '{ vHost[$1]++; Bytes[$1]+=$10 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
This works because $clog is quoted which preserves the line feeds and runs sed on each line individually. As a result (and shown) the AWK line needs to be adjusted to ignore ":" and grab $10 instead of $13 for the byte count.
So as it turns out, in writing this, I've already given myself a solution. But I'm sure someone will know of a better more efficient way.
Just don't split the entire line on colons. Remove the port number from the field you extract instead.
split($1, v, /:/); vHost[v[1]]++; ...
I don't see why you would split on dashes, either; either way, the field numbers will be renumbered, so you would end up with something like
awk '{ split($1, v, /:/); vHost[v[1]]++; Bytes[v[1]]+=$11 }
END { for (var in vHost)
printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }'

fail2ban need help to create regex rules

I try to protect my serveur from xmlrpc.php ddos.
I use fail2ban, but the regex I found dont seems to be ok. Can you have a look:
This is the log:
Aug 2 17:33:11 myserver pound: my.web.site 188.209.49.38 - -
[02/Aug/2015:17:33:11 +0200] "POST /xmlrpc.php HTTP/1.0" 404 410 ""
"Mozilla/5.0 (compatible; Googlebot/2.1;
http://www.google.com/bot.html)"
Aug 2 16:27:49 myserver pound:
(7fec610c5700) e503 no back-end "POST /xmlrpc.php HTTP/1.0" from
185.62.188.25
filter.d/xmlrpc.conf
[Definition]
failregex = ^<HOST> .*POST .*xmlrpc\.php.*
ignoreregex =
jail.local
[xmlrpc]
enabled = true
filter = xmlrpc
action = iptables[name=xmlrpc, port=http, protocol=tcp]
logpath = /var/log/pound.log
bantime = 43600
maxretry = 2
And the test
fail2ban-regex /var/log/pound.log /etc/fail2ban/filter.d/xmlrpc.conf
/usr/share/fail2ban/server/filter.py:442: DeprecationWarning: the md5 module is deprecated; use hashlib instead
import md5
Running tests
=============
Use regex file : /etc/fail2ban/filter.d/xmlrpc.conf
Use log file : /var/log/pound.log
Results
=======
Failregex
|- Regular expressions:
| [1] ^<HOST> .*POST .*xmlrpc\.php.*
|
`- Number of matches:
[1] 0 match(es)
Ignoreregex
|- Regular expressions:
|
`- Number of matches:
Summary
=======
Sorry, no match
Look at the above section 'Running tests' which could contain important
information.
root#myserver:/etc/fail2ban#
Any idea?
Thks
I edited the type format, so I have now this kind of log
Aug 3 06:25:51 ns111111 pound: 141.101.96.94 POST /xmlrpc.php HTTP/1.1 - HTTP/1.1 200 OK
So I tried this, and it's ok :
fail2ban-regex 'Aug 3 06:25:51 ns111111 pound: 141.101.96.94 POST /xmlrpc.php HTTP/1.1 - HTTP/1.1 200 OK' 'ns111111 pound: <HOST> .*POST .*xmlrpc\.php.*'

access_log process in hive

i have access_logs around 500MB,i am giving sample as
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
how can i extract month from timestamp?
Expected output :
year month day event occurrence
2009 jul 15 GET /favicon.ico HTTP/1.1
2009 apr 29 GET / HTTP/1.1
i tried this
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u2.jar;
create table log(ip string, gt string, gt1 string, timestamp string, id1 string, s1 string, s2 string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties('input.regex'= '^(\\S+) (\\S+) (\\S+) \\[([[\\w/]+:(\\d{2}:\\d{2}):\\d{2}\\s[+\\-]\\d{4}:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+)')location '/path';
If i understand correctly string functions will not work in this situation.i am new to regex & hive.
help me..thanks in advance
I'm not familiar with hadoop/hive, but as far as regexes go, if I were using ruby:
log_file = %Q[
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
]
converted_lines = log_file.split("\n").map do |line|
regex = /^.*? - - \[(\d+)\/(\w+)\/(\d{4}).*?\] (.*)/
matches = regex.match(line)
output = [
[:year, matches[3]],
[:month, matches[2]],
[:day, matches[1]],
[:event_occurrence, matches[4]],
]
end
Hope that helps.

Assistance needed with regular expressions [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
The HTTP messages are listed below right after the questions.
I need a regular expression that finds the HTTP status codes within both messages.
Another one that finds the name of the requesting user in both messages.
A last one that finds the time stamp within both messages.
127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
127.0.0.1 - debbie7 [19/Dec/2010:11:11:02 -0700] "GET /apache_pbs.gif
HTTP/1.0" 404 2336
Thanks!
Description
You can pull the values {username, date, and http code} in one pass using this regex:
^.*?-\s(\S*)\s+\[([^\]]*)\]\s"[^"]*"\s(\d+)\s\d+
Groups
Group 0 gets the entire line, while the other groups will individually get the respective matches.
gets the username
gets the date stamp
gets the http status code
PHP Code Example:
You didn't select a language so I present a php example to show how the regex works
Given input string, complete with link break in the middle of the message area
127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
127.0.0.1 - debbie7 [19/Dec/2010:11:11:02 -0700] "GET /apache_pbs.gif
HTTP/1.0" 404 2336
Code Example
<?php
$sourcestring="your source string";
preg_match_all('/^.*?-\s(\S*)\s+\[([^\]]*)\]\s"[^"]*"\s(\d+)\s\d+/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => 127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
[1] => 127.0.0.1 - debbie7 [19/Dec/2010:11:11:02 -0700] "GET /apache_pbs.gif
HTTP/1.0" 404 2336
)
[1] => Array
(
[0] => Johny
[1] => debbie7
)
[2] => Array
(
[0] => 17/Dec/2010:17:15:16 -0700
[1] => 19/Dec/2010:11:11:02 -0700
)
[3] => Array
(
[0] => 200
[1] => 404
)
)
HTTP status:
(?<=HTTP/1.0" )\d+
Requesting user (works for any ip address):
(?<=(\d\d?\d?\.){3}\d\d?\d? - )\w+(?= \[)
Timestamp:
(?<=\[).*(?=\])
You can try with this Regex to achieve this:
^.* (\w*) \[([^\]]*)] \"[\w.\/ ]*\" ([\d]+)
Input:
127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
Output:
Group 1: Johny
Group 2: 17/Dec/2010:17:15:16 -0700
Group 3: 200
You can test the Regex here.
In Perl:
!([a-zA-Z]+) \W+
(.* -) [\w\W]+
HTTP/1.0" \ ([\d]+)
!x
$1 -> username
$2 -> timestamp
$3 -> status