Pig: issue with REPLACE - replace

Below is how my data looks like:
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
Below is the Pig code:
loadFulldata = LOAD '/root/Kennadi-Project/Kennadi-data.txt' USING PigStorage(',') AS (fullline:chararray);
extractData = FOREACH loadFulldata GENERATE FLATTEN (REGEX_EXTRACT_ALL(fullline,'(.*) - - (.*) -(.*)] "(.*)" (.*) (.*)'));
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
My extractData looks like:
(199.72.81.55,[01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,[01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,[01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,[01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)
(199.120.110.21,[01/Jul/1995:00:00:11,0400,GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0,200,4179)
(burger.letters.com,[01/Jul/1995:00:00:12,0400,GET /images/NASA-logosmall.gif HTTP/1.0,304,0)
When I use REGEX_EXTRACT_ALL I cannot remove '[' from the data, how can I achieve that?
In addition, I tried to remove '[' using REPLACE function like so:
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datadatetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
newdata = FOREACH rowdata GENERATE REPLACE(datadatetime,'[','');
But I am getting below warning:
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
I think it is because I haven't defined any datatype for datadatetime, how do I define datatype in foreach?

You have a problem. You try solving it using a regular expression. You now have two problems.
Seriously though, after trying it this seems to just be a problem with the regex. Using
REGEX_EXTRACT_ALL(fullline,'(.*) - - \\[(.*) -(.*)\\] "(.*)" (.*) (.*)')
did the trick for me.
Result:
(199.72.81.55,01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)

Related

regex to find only date from a string

I have a string with below pattern. I want to only extract date from the string.
199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245
199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085
Expected output
01/Jul/1995
01/Jul/1995
01/Jul/1995
Currently I am extracting with two steps.
extract everything between square bracket. \[(.*?)\]
extract the first 11 letters from the first step output string. ^.{1,11}
Wondering if it can be done in one step.
In Scala 2.13 consider pattern matching with interpolated string patterns, for example
List(
"""199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245""",
"""199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985""",
"""199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085"""
) collect { case s"${head}[${day}/${month}/${year}:${tail}" => s"$day/$month/$year" }
outputs
res1: List[String] = List(01/Jul/1995, 01/Jul/1995, 01/Jul/1995)
If you aren't on Scala 2.13 yet, standard regex patterns still work.
val dateRE = "\\[([^:]+):".r.unanchored
List(
"""199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245""",
"""199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985""",
"""199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085"""
) collect { case dateRE(date) => date }
//res0: List[String] = List(01/Jul/1995, 01/Jul/1995, 01/Jul/1995)

Parsing corrupt Apache logs using regex

I'm writing a Python 3.7.2 program to parse Apache logs looking for all successful response codes. I've got regex written right now that will parse all correct Apache log entries into individual tuples of [origin] [date/time] [HTML method/file/protocol] [response code] and [file size] and then I just check to see if the response code is 3xx. The problem is there are several entries that are corrupt, some corrupt enough to be unreadable so I've stripped them out in a different part of the program. Several are just missing the closing " (quotation mark) on the method/protocol item causing it to throw an error each time I parse that line. I'm thinking I need to use a RegEx Or expression for " OR whitespace but that seems to break the quote into a different tuple item instead of looking for say, "GET 613.html HTTP/1.0" OR "GET 613.html HTTP/1.0 I'm new to regex and thoroughly stumped, can anyone explain what I'm doing wrong?
I should note that the logs have been scrubbed of some info, instead of origin IP it only shows 'local' or 'remote' and the OS/browser info is removed entirely.
This is the regex for the relevant tuple item that works with valid entries: "(.*)?" I've also tried:
"(.*)?("|\s) - creates another tuple item and still throws error
Here's a snippet of the log entries including the last entry which is missing it's closing "
local - - [27/Oct/1994:18:47:03 -0600] "GET index.html HTTP/1.0" 200 3185
local - - [27/Oct/1994:18:48:53 -0600] "GET index.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:49:55 -0600] "GET index.html HTTP/1.0" 303 3185
local - - [27/Oct/1994:18:50:25 -0600] "GET 612.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:50:41 -0600] "GET index.html HTTP/1.0" 200 388
local - - [27/Oct/1994:18:50:52 -0600] "GET 613.html HTTP/1.0 303 728
regex = '([(\w+)]+) - - \[(.*?)\] "(.*)?" (\d+) (\S+)'
import re
with open("validlogs.txt") as validlogs:
i = 0
array = []
successcodes = 0
for line in validlogs:
array.append(line)
loglength = len(array)
while (i < loglength):
line = re.match(regex, array[i]).groups()
if(line[3].startswith("3")):
successcodes+=1
i+=1
print("Number of successcodes: ", successcodes)
Parsing the log responses above should give Number of success codes: 2
Instead I get: Traceback (most recent call last):
File "test.py", line 24, in
line = re.match(regex, array[i]).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
because (I believe) regex is looking explicitly for a " and can't handle the line entry that's missing it.
So I originally used re.match with ([(\w+)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) with a Try: / Except: continue code to parse all the logs that actually matched the pattern. Since ~100,000 of the ~750,000 lines didn't conform to the correct Apache logs pattern, I wound up changing my code to re.search with much smaller segments instead.
For instance:
with open("./http_access_log.txt") as logs:
for line in logs:
if re.search('\s*(30\d)\s\S+', line): #Checking for 30x redirect codes
redirectCounter += 1
I've read that re.match is faster than re.search but I felt that being able to accurately capture the most possible log entries (this handles all but about 2000 lines, most of which have no usable info) was more important.

Grep removing lines that are sem-similar?

I am reading a file like so:
cat access_logs | grep Ruby
To determine what IP's are accessing one of my files. It returns a huge list. I want to remove semi-duplicates, i.e. these two lines are technically the same- except have different time/date stamps. In a massive list with thousands of repeats- is there a way to only get unique ip addresses?
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:14:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
So that for example those 4 lines would be trimmed into only one line?
You can do:
awk '/Ruby/{print $1}' file | sort -u
Or you can use grep + cut to get first column as suggested in the comment.
You can use awk:
awk '/Ruby/ && !seen[$1]++' access_logs
This will print only first line for each IP address even if timestamp is different for a given IP.
For your input it prints:
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"

access_log process in hive

i have access_logs around 500MB,i am giving sample as
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
how can i extract month from timestamp?
Expected output :
year month day event occurrence
2009 jul 15 GET /favicon.ico HTTP/1.1
2009 apr 29 GET / HTTP/1.1
i tried this
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u2.jar;
create table log(ip string, gt string, gt1 string, timestamp string, id1 string, s1 string, s2 string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties('input.regex'= '^(\\S+) (\\S+) (\\S+) \\[([[\\w/]+:(\\d{2}:\\d{2}):\\d{2}\\s[+\\-]\\d{4}:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+)')location '/path';
If i understand correctly string functions will not work in this situation.i am new to regex & hive.
help me..thanks in advance
I'm not familiar with hadoop/hive, but as far as regexes go, if I were using ruby:
log_file = %Q[
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
]
converted_lines = log_file.split("\n").map do |line|
regex = /^.*? - - \[(\d+)\/(\w+)\/(\d{4}).*?\] (.*)/
matches = regex.match(line)
output = [
[:year, matches[3]],
[:month, matches[2]],
[:day, matches[1]],
[:event_occurrence, matches[4]],
]
end
Hope that helps.

Assistance needed with regular expressions [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
The HTTP messages are listed below right after the questions.
I need a regular expression that finds the HTTP status codes within both messages.
Another one that finds the name of the requesting user in both messages.
A last one that finds the time stamp within both messages.
127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
127.0.0.1 - debbie7 [19/Dec/2010:11:11:02 -0700] "GET /apache_pbs.gif
HTTP/1.0" 404 2336
Thanks!
Description
You can pull the values {username, date, and http code} in one pass using this regex:
^.*?-\s(\S*)\s+\[([^\]]*)\]\s"[^"]*"\s(\d+)\s\d+
Groups
Group 0 gets the entire line, while the other groups will individually get the respective matches.
gets the username
gets the date stamp
gets the http status code
PHP Code Example:
You didn't select a language so I present a php example to show how the regex works
Given input string, complete with link break in the middle of the message area
127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
127.0.0.1 - debbie7 [19/Dec/2010:11:11:02 -0700] "GET /apache_pbs.gif
HTTP/1.0" 404 2336
Code Example
<?php
$sourcestring="your source string";
preg_match_all('/^.*?-\s(\S*)\s+\[([^\]]*)\]\s"[^"]*"\s(\d+)\s\d+/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => 127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
[1] => 127.0.0.1 - debbie7 [19/Dec/2010:11:11:02 -0700] "GET /apache_pbs.gif
HTTP/1.0" 404 2336
)
[1] => Array
(
[0] => Johny
[1] => debbie7
)
[2] => Array
(
[0] => 17/Dec/2010:17:15:16 -0700
[1] => 19/Dec/2010:11:11:02 -0700
)
[3] => Array
(
[0] => 200
[1] => 404
)
)
HTTP status:
(?<=HTTP/1.0" )\d+
Requesting user (works for any ip address):
(?<=(\d\d?\d?\.){3}\d\d?\d? - )\w+(?= \[)
Timestamp:
(?<=\[).*(?=\])
You can try with this Regex to achieve this:
^.* (\w*) \[([^\]]*)] \"[\w.\/ ]*\" ([\d]+)
Input:
127.0.0.1 - Johny [17/Dec/2010:17:15:16 -0700] "GET /apache_pb.gif
HTTP/1.0" 200 2326
Output:
Group 1: Johny
Group 2: 17/Dec/2010:17:15:16 -0700
Group 3: 200
You can test the Regex here.
In Perl:
!([a-zA-Z]+) \W+
(.* -) [\w\W]+
HTTP/1.0" \ ([\d]+)
!x
$1 -> username
$2 -> timestamp
$3 -> status