access_log process in hive - regex

i have access_logs around 500MB,i am giving sample as
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
how can i extract month from timestamp?
Expected output :
year month day event occurrence
2009 jul 15 GET /favicon.ico HTTP/1.1
2009 apr 29 GET / HTTP/1.1
i tried this
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u2.jar;
create table log(ip string, gt string, gt1 string, timestamp string, id1 string, s1 string, s2 string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties('input.regex'= '^(\\S+) (\\S+) (\\S+) \\[([[\\w/]+:(\\d{2}:\\d{2}):\\d{2}\\s[+\\-]\\d{4}:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+)')location '/path';
If i understand correctly string functions will not work in this situation.i am new to regex & hive.
help me..thanks in advance

I'm not familiar with hadoop/hive, but as far as regexes go, if I were using ruby:
log_file = %Q[
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
]
converted_lines = log_file.split("\n").map do |line|
regex = /^.*? - - \[(\d+)\/(\w+)\/(\d{4}).*?\] (.*)/
matches = regex.match(line)
output = [
[:year, matches[3]],
[:month, matches[2]],
[:day, matches[1]],
[:event_occurrence, matches[4]],
]
end
Hope that helps.

Related

Is it possible to write multiple regex for the same input in Fluent Bit?

My logs look like this:
200 59903 0.056 - [24/Jun/2020:00:06:56 +0530] "GET /xxxxx/xxxxx/xxxxx HTTP/1.1" xxxxx.com [xxxx:4900:xxxx:b798:xxxx:c8ba:xxxx:6a23] - - xxx.xxx.xxx.xxx - - - "http://xxxxx/xxxxx/xxxxx" 164551836 1 HIT "-" "-" "Mozilla/5.0 (Linux; Android 9; Mi A1 Build/PKQ1.180917.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/77.0.3865.92 Mobile Safari/537.36" "-" "-" "dhDebug=-" "-" - -
200 11485 0.000 - [24/Jun/2020:00:06:56 +0530] "GET /xxxxx/xxxxx/xxxxx/xxxxx HTTP/1.1" xxxxx.com xxx.xxx.xxx.xxx - - xxx.xxx.xxx.xxx - - - "-" 164551710 7 HIT "-" "-" "Dalvik/2.1.0 (Linux; U; Android 9; vivo 1915 Build/PPR1.180610.011)" "-" "-" "dhDebug=appVersion=13.0.8&osVersion=9&clientId=1271210612&conn_type=4G&conn_quality=NO_CONNECTION&sessionSource=organic&featureMask=1879044085&featureMaskV1=635" "-" 40 -
The two logs are almost same except the fact that the last one contains a detailed output of dhDebug.
This is how my parsers.conf looks like:
[PARSER]
Name head
Format regex
Regex (?<responseCode>\d{3})\s(?<responseSize>\d+)\s(?<responseTime>\d+.\d+)\s.*?\s\[(?<time>.*?)\]\s"(?<method>.*?)\s(?<url1>.*?)\s(?<protocol>.*?)"\s(?<servedBy>.*?)\s(?<Akamai_ip1>.*?)\s(?<ClientId_ip2>.*?)\s(?<ip3>.*?)\s(?<lb_ip4>.*?)\s(?<ip5>.*?)\s(?<ip6>.*?)\s(?<ip7>.*?)\s+"(?<url2>.*?)".*?".*?"\s".*?"\s"(?<agentInfo>.*?)"
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
Time_Keep On
Types responseTime:float
Please suggest any idea on how to implement the information of dhDebug in a separate key-value pair in the same regex that works on both the types of logs.
EDITED!!
You can use (?:case1|case2) for case1: is null and case2: is not null
So Regex will be:
(?<responseCode>\d{3})\s(?<responseSize>\d+)\s(?<responseTime>\d+.\d+)\s.*?\s\[(?<time>.*?)\]\s"(?<method>.*?)\s(?<url1>.*?)\s(?<protocol>.*?)"\s(?<servedBy>.*?)\s(?<Akamai_ip1>.*?)\s(?<ClientId_ip2>.*?)\s(?<ip3>.*?)\s(?<lb_ip4>.*?)\s(?<ip5>.*?)\s(?<ip6>.*?)\s(?<ip7>.*?)\s+"(?<url2>.*?)".*?".*?"\s".*?"\s"(?<agentInfo>.*?)"\s"-"\s"-"\s"dhDebug=(?:-|appVersion=(?<appVersion>.*?)&osVersion=(?<osVersion>.*?)&clientId=(?<clientId>.*?)&conn_type=(?<conn_type>.*?)&conn_quality=(?<conn_quality>.*?)&sessionSource=(?<sessionSource>.*?)&featureMask=(?<featureMask>.*?)&featureMaskV1=(?<featureMaskV1>.*?))"
With this you get null for each field name of dhDebug for the first log line and field names with values for the second one.
You can test it at http://grokdebug.herokuapp.com/

regex to find only date from a string

I have a string with below pattern. I want to only extract date from the string.
199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245
199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085
Expected output
01/Jul/1995
01/Jul/1995
01/Jul/1995
Currently I am extracting with two steps.
extract everything between square bracket. \[(.*?)\]
extract the first 11 letters from the first step output string. ^.{1,11}
Wondering if it can be done in one step.
In Scala 2.13 consider pattern matching with interpolated string patterns, for example
List(
"""199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245""",
"""199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985""",
"""199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085"""
) collect { case s"${head}[${day}/${month}/${year}:${tail}" => s"$day/$month/$year" }
outputs
res1: List[String] = List(01/Jul/1995, 01/Jul/1995, 01/Jul/1995)
If you aren't on Scala 2.13 yet, standard regex patterns still work.
val dateRE = "\\[([^:]+):".r.unanchored
List(
"""199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245""",
"""199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985""",
"""199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085"""
) collect { case dateRE(date) => date }
//res0: List[String] = List(01/Jul/1995, 01/Jul/1995, 01/Jul/1995)

scala apache access log regex not working

I have defined regex for apache access log as below:
val apacheLogPattern = """
^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$
""".r
And a function to parse the log:
def parse_log(line: String) = {
line match {
case apacheLogPattern(ipAddress, clientIdentity, userId, dateTime, method, endPoint,
protocol, responseCode, contentSize, browser, somethingElse) => "match"
}
}
val p = """66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"""
parse_log(p)
Calling the parse function gives MatchError
scala.MatchError:
66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
(of class java.lang.String)
at .parse_log(:13)
... 28 elided
Can someone help me where the scala regex is going wrong?
From The fourth bird's comment, the regex is lacking .r at the end, and has one too many capturing groups. The correct pattern is shown below.
val apacheLogPattern = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$""".r

Pig: issue with REPLACE

Below is how my data looks like:
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
Below is the Pig code:
loadFulldata = LOAD '/root/Kennadi-Project/Kennadi-data.txt' USING PigStorage(',') AS (fullline:chararray);
extractData = FOREACH loadFulldata GENERATE FLATTEN (REGEX_EXTRACT_ALL(fullline,'(.*) - - (.*) -(.*)] "(.*)" (.*) (.*)'));
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
My extractData looks like:
(199.72.81.55,[01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,[01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,[01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,[01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)
(199.120.110.21,[01/Jul/1995:00:00:11,0400,GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0,200,4179)
(burger.letters.com,[01/Jul/1995:00:00:12,0400,GET /images/NASA-logosmall.gif HTTP/1.0,304,0)
When I use REGEX_EXTRACT_ALL I cannot remove '[' from the data, how can I achieve that?
In addition, I tried to remove '[' using REPLACE function like so:
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datadatetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
newdata = FOREACH rowdata GENERATE REPLACE(datadatetime,'[','');
But I am getting below warning:
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
I think it is because I haven't defined any datatype for datadatetime, how do I define datatype in foreach?
You have a problem. You try solving it using a regular expression. You now have two problems.
Seriously though, after trying it this seems to just be a problem with the regex. Using
REGEX_EXTRACT_ALL(fullline,'(.*) - - \\[(.*) -(.*)\\] "(.*)" (.*) (.*)')
did the trick for me.
Result:
(199.72.81.55,01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)

Grep removing lines that are sem-similar?

I am reading a file like so:
cat access_logs | grep Ruby
To determine what IP's are accessing one of my files. It returns a huge list. I want to remove semi-duplicates, i.e. these two lines are technically the same- except have different time/date stamps. In a massive list with thousands of repeats- is there a way to only get unique ip addresses?
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:14:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
So that for example those 4 lines would be trimmed into only one line?
You can do:
awk '/Ruby/{print $1}' file | sort -u
Or you can use grep + cut to get first column as suggested in the comment.
You can use awk:
awk '/Ruby/ && !seen[$1]++' access_logs
This will print only first line for each IP address even if timestamp is different for a given IP.
For your input it prints:
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"