Grep removing lines that are sem-similar? - regex

I am reading a file like so:
cat access_logs | grep Ruby
To determine what IP's are accessing one of my files. It returns a huge list. I want to remove semi-duplicates, i.e. these two lines are technically the same- except have different time/date stamps. In a massive list with thousands of repeats- is there a way to only get unique ip addresses?
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:14:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
1.2.3.4 - - [13/Apr/2014:15:20:38 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"
So that for example those 4 lines would be trimmed into only one line?

You can do:
awk '/Ruby/{print $1}' file | sort -u
Or you can use grep + cut to get first column as suggested in the comment.

You can use awk:
awk '/Ruby/ && !seen[$1]++' access_logs
This will print only first line for each IP address even if timestamp is different for a given IP.
For your input it prints:
1.2.3.4 - - [13/Apr/2014:14:20:17 -0400] "GET /color.txt HTTP/1.1" 404 207 "-" "Ruby"

Related

regex to find only date from a string

I have a string with below pattern. I want to only extract date from the string.
199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245
199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085
Expected output
01/Jul/1995
01/Jul/1995
01/Jul/1995
Currently I am extracting with two steps.
extract everything between square bracket. \[(.*?)\]
extract the first 11 letters from the first step output string. ^.{1,11}
Wondering if it can be done in one step.
In Scala 2.13 consider pattern matching with interpolated string patterns, for example
List(
"""199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245""",
"""199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985""",
"""199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085"""
) collect { case s"${head}[${day}/${month}/${year}:${tail}" => s"$day/$month/$year" }
outputs
res1: List[String] = List(01/Jul/1995, 01/Jul/1995, 01/Jul/1995)
If you aren't on Scala 2.13 yet, standard regex patterns still work.
val dateRE = "\\[([^:]+):".r.unanchored
List(
"""199.120.110.23 - - [01/Jul/1995:00:00:01 -0400] "GET /medium/1/ HTTP/1.0" 200 6245""",
"""199.120.110.22 - - [01/Jul/1995:00:00:06 -0400] "GET /medium/2/ HTTP/1.0" 200 3985""",
"""199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /medium/3/stats/stats.html HTTP/1.0" 200 4085"""
) collect { case dateRE(date) => date }
//res0: List[String] = List(01/Jul/1995, 01/Jul/1995, 01/Jul/1995)

Pig: issue with REPLACE

Below is how my data looks like:
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
Below is the Pig code:
loadFulldata = LOAD '/root/Kennadi-Project/Kennadi-data.txt' USING PigStorage(',') AS (fullline:chararray);
extractData = FOREACH loadFulldata GENERATE FLATTEN (REGEX_EXTRACT_ALL(fullline,'(.*) - - (.*) -(.*)] "(.*)" (.*) (.*)'));
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
My extractData looks like:
(199.72.81.55,[01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,[01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,[01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,[01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)
(199.120.110.21,[01/Jul/1995:00:00:11,0400,GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0,200,4179)
(burger.letters.com,[01/Jul/1995:00:00:12,0400,GET /images/NASA-logosmall.gif HTTP/1.0,304,0)
When I use REGEX_EXTRACT_ALL I cannot remove '[' from the data, how can I achieve that?
In addition, I tried to remove '[' using REPLACE function like so:
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datadatetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
newdata = FOREACH rowdata GENERATE REPLACE(datadatetime,'[','');
But I am getting below warning:
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
I think it is because I haven't defined any datatype for datadatetime, how do I define datatype in foreach?
You have a problem. You try solving it using a regular expression. You now have two problems.
Seriously though, after trying it this seems to just be a problem with the regex. Using
REGEX_EXTRACT_ALL(fullline,'(.*) - - \\[(.*) -(.*)\\] "(.*)" (.*) (.*)')
did the trick for me.
Result:
(199.72.81.55,01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)

awk regex magic (match first occurrence of character in each line)

Have been scratching my head over this one, hoping there's a simple solution that I've missed.
Summary
Simplified the following code can't cope with IPv6 addresses in the (here abbreviated) apache log parsed to it. Do I SED the variable before parsing to AWK or can I change the AWK regex to match only the first ":" on each line in $clog?
$ clog='djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /another_url HTTP/1.1" 200 11142
ipv6.com:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "GET /some_url HTTP/1.1" 200 273'
$ echo "$clog" | awk -F '[: -]+' '{ vHost[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f\n", var, vHost[var] }}'
> bogus.com 37262
> djerk.nl 48394
> ipv6.com 0
As can be seen the last line of variable $clog, the vhost domain is caught but not the byte count which should come out at 273 instead of 0.
Original long question
The problem I have is with the ":" character. In addition to the other two characters (space and dash), I need AWK to match only the first occurrence of ":" in each line it's evaluating. the following splits each line by three characters which works fine, until the log entries contain IPv6 addresses.
matrix=$( echo "$clog" | awk -F '[: -]+' '{ vHost[$1]++; Bytes[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
The above code converts the following log entries (contained in variable $clog):
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /wordpress/2014/ssl-intercept-headaches HTTP/1.1" 200 11142 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:30 +0100] "GET /some_other_url HTTP/1.1" 404 37264 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
Into a table like so, containing vhost name (sans TCP port number), hits and cumulative byte count. One line per vhost:
djerk.nl 3 85658
bogus.com 1 37262
But IPv6 addresses get unintentionally split due to their notation and this causes AWK to produce bogus output when evaluation these log entries. Sample IPv6 log entry:
djerk.nl:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "POST /wordpress/wp-cron.php?doing_wp_cron=*** HTTP/1.0" 200 273 "-" "WordPress; http://www.djerk.nl/wordpress"
I guess a work around would be to mangle variable $clog to replace the first occurrence of ":" and remove this character from the AWK regex. But I don't think native bash substitution is capable of negotiating variables with multiple lines.
clog=$(sed 's/:/ /' <<< "$clog")
matrix=$( echo "$clog" | awk -F '[ -]+' '{ vHost[$1]++; Bytes[$1]+=$10 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
This works because $clog is quoted which preserves the line feeds and runs sed on each line individually. As a result (and shown) the AWK line needs to be adjusted to ignore ":" and grab $10 instead of $13 for the byte count.
So as it turns out, in writing this, I've already given myself a solution. But I'm sure someone will know of a better more efficient way.
Just don't split the entire line on colons. Remove the port number from the field you extract instead.
split($1, v, /:/); vHost[v[1]]++; ...
I don't see why you would split on dashes, either; either way, the field numbers will be renumbered, so you would end up with something like
awk '{ split($1, v, /:/); vHost[v[1]]++; Bytes[v[1]]+=$11 }
END { for (var in vHost)
printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }'

access_log process in hive

i have access_logs around 500MB,i am giving sample as
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
how can i extract month from timestamp?
Expected output :
year month day event occurrence
2009 jul 15 GET /favicon.ico HTTP/1.1
2009 apr 29 GET / HTTP/1.1
i tried this
add jar /usr/lib/hive/lib/hive-contrib-0.7.1-cdh3u2.jar;
create table log(ip string, gt string, gt1 string, timestamp string, id1 string, s1 string, s2 string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties('input.regex'= '^(\\S+) (\\S+) (\\S+) \\[([[\\w/]+:(\\d{2}:\\d{2}):\\d{2}\\s[+\\-]\\d{4}:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+)')location '/path';
If i understand correctly string functions will not work in this situation.i am new to regex & hive.
help me..thanks in advance
I'm not familiar with hadoop/hive, but as far as regexes go, if I were using ruby:
log_file = %Q[
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET / HTTP/1.1" 403 15779
10.223.157.186 - - [15/Jul/2009:14:58:59 -0700] "GET /favicon.ico HTTP/1.1" 404 5397
10.216.113.172 - - [29/Apr/2010:07:19:48 -0700] "GET / HTTP/1.1" 200 68831
]
converted_lines = log_file.split("\n").map do |line|
regex = /^.*? - - \[(\d+)\/(\w+)\/(\d{4}).*?\] (.*)/
matches = regex.match(line)
output = [
[:year, matches[3]],
[:month, matches[2]],
[:day, matches[1]],
[:event_occurrence, matches[4]],
]
end
Hope that helps.

regex to modificate domino http log to common format for Piwik import

I need to import old http log files from my Domino webserver into my piwik tracking.
the problem is the format of the log if an user is logged in.
Normal/good format example:
123.123.123 www.example.com - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
Bad format example - produced if user is logged in
123.123.123 www.example.com "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example
i am looking for a one-liner bash to remove those CN information if it is included.
UPDATE:
this is my solution to get a one liner to import an domino log file into piwik. maybe someday someone finds this thing and needn't flip his table
for i in `ls -v *.log`; do date && echo " bearbeite" $i && echo " " && awk '{sub(/ +"CN=[^"]+" +/," - ")}1' $i grep -v http.monitor | grep -v nagios > $i.cleanTmp && python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://127.0.0.1/piwik --idsite=8 $i.cleanTmp --dry-run && rm $i.cleanTmp; done;
If You need a pure bash solution You can do something like this:
Example file
cat >infile <<XXX
123.123.123 www.example.com "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
XXX
while read x; do
[[ $x =~ \ +\"CN=[^\"]+\"\ + ]] && x=${x/$BASH_REMATCH/ }
echo $x
done <infile
Output:
123.123.123 www.example.com - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
It parses for a string starting with spaces then "CN= and then any non " characters, then a " then some spaces. If this patten found, it replaces with a space.
If the log files are big ones (>1MB) and this should be done periodically, then use awk instead of the pure bash solution.
awk '{sub(/ +"CN=[^"]+" +/," ")}1' infile
So you just want to remove the "CN=SomeUser/OU=SomeOU/O=SomeO" part?
The regex to match that looks like this:
"CN=\w+\/OU=\w+\/O=\w+"