I'm writing a C++ program to parse pieces out of web logs, and one of the pieces I want is the requested page. I'm using string::find to define the beginning and end of the page, then using string::substr to extract it. Here is an example line:
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
The requested page is the part right after the GET, and the end is right before the HTTP is, So I do something like :
int beginning = log_entry.find("\"GET") + 5;
int end = log_entry.find("HTTP) - 5;
std::string requested_page = log_entry.substr(beginning, end);
This is then what would be contained within requested_page:
/~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/
Instead of
/~csc226
As you can see, the beginning is correct, but the end is not. I have a log of 3000 lines with the same syntax as the example entry above, and the beginnings of the requested pages in all of them are correct and the ends are not.
Any ideas as to what is going wrong?
Thanks!
Don't store the result of find in an int. use std::string::size_type aka std::size_t.
To test if it failed, then compare against std::string::npos.
Second, never ever manipulate the result of std::string::find until you both confirm it is not npos and know that the manipulation moves it within the valid range. +5 and -5 blindly is a no-go. I don't care if you "know" what your data is. Don't write buffer overflow culpable code.
Finally, substr( start, LENGTH ) not substr( start, end ).
std::string was imported from a different source library than the standard containers. So its conventions are very different (and often worse).
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
So:
log_entry.find("\"GET") + 5; will match: "GET and then move the iterator 5 places forward to the location:
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
^
Next `log_entry.find("HTTP"); will match HTTP:
172.138.80.174 - - [05/Aug/2001:21:06:27 -0300] "GET /~csc226 HTTP/1.0" 301 303 "http://www.goto.com/d/search/?Keywords=stringVar+%2B+savitch&view=2+80+0&did=" "Mozilla/4.61 [en] (Win98; I)"
^
You want to use (size_t length = log_entry.find("\"HTTP") - log_entry.find("\"GET") - 5;). Finally you need to use std::string::substr correctly here.
Related
I'm writing a Python 3.7.2 program to parse Apache logs looking for all successful response codes. I've got regex written right now that will parse all correct Apache log entries into individual tuples of [origin] [date/time] [HTML method/file/protocol] [response code] and [file size] and then I just check to see if the response code is 3xx. The problem is there are several entries that are corrupt, some corrupt enough to be unreadable so I've stripped them out in a different part of the program. Several are just missing the closing " (quotation mark) on the method/protocol item causing it to throw an error each time I parse that line. I'm thinking I need to use a RegEx Or expression for " OR whitespace but that seems to break the quote into a different tuple item instead of looking for say, "GET 613.html HTTP/1.0" OR "GET 613.html HTTP/1.0 I'm new to regex and thoroughly stumped, can anyone explain what I'm doing wrong?
I should note that the logs have been scrubbed of some info, instead of origin IP it only shows 'local' or 'remote' and the OS/browser info is removed entirely.
This is the regex for the relevant tuple item that works with valid entries: "(.*)?" I've also tried:
"(.*)?("|\s) - creates another tuple item and still throws error
Here's a snippet of the log entries including the last entry which is missing it's closing "
local - - [27/Oct/1994:18:47:03 -0600] "GET index.html HTTP/1.0" 200 3185
local - - [27/Oct/1994:18:48:53 -0600] "GET index.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:49:55 -0600] "GET index.html HTTP/1.0" 303 3185
local - - [27/Oct/1994:18:50:25 -0600] "GET 612.html HTTP/1.0" 404 -
local - - [27/Oct/1994:18:50:41 -0600] "GET index.html HTTP/1.0" 200 388
local - - [27/Oct/1994:18:50:52 -0600] "GET 613.html HTTP/1.0 303 728
regex = '([(\w+)]+) - - \[(.*?)\] "(.*)?" (\d+) (\S+)'
import re
with open("validlogs.txt") as validlogs:
i = 0
array = []
successcodes = 0
for line in validlogs:
array.append(line)
loglength = len(array)
while (i < loglength):
line = re.match(regex, array[i]).groups()
if(line[3].startswith("3")):
successcodes+=1
i+=1
print("Number of successcodes: ", successcodes)
Parsing the log responses above should give Number of success codes: 2
Instead I get: Traceback (most recent call last):
File "test.py", line 24, in
line = re.match(regex, array[i]).groups()
AttributeError: 'NoneType' object has no attribute 'groups'
because (I believe) regex is looking explicitly for a " and can't handle the line entry that's missing it.
So I originally used re.match with ([(\w+)]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+) with a Try: / Except: continue code to parse all the logs that actually matched the pattern. Since ~100,000 of the ~750,000 lines didn't conform to the correct Apache logs pattern, I wound up changing my code to re.search with much smaller segments instead.
For instance:
with open("./http_access_log.txt") as logs:
for line in logs:
if re.search('\s*(30\d)\s\S+', line): #Checking for 30x redirect codes
redirectCounter += 1
I've read that re.match is faster than re.search but I felt that being able to accurately capture the most possible log entries (this handles all but about 2000 lines, most of which have no usable info) was more important.
Consider an access log of a REST API, you will see lines (simplified) that looks like this:
2017-01-01T12:12:41Z "GET /api/posts" HTTP/1.1 200 "-"
2017-01-01T12:12:42Z "GET /api/posts/56/comments" HTTP/1.1 200 "-"
2017-01-01T12:12:42Z "GET /api/posts" HTTP/1.1 200 "-"
2017-01-01T12:12:56Z "POST /api/posts" HTTP/1.1 202 "Safari"
2017-01-01T12:12:58Z "GET /api/posts/134/comments" HTTP/1.1 200 "-"
To parse that you could write something like :
_collector=access.log | regex parse "(?<method>[A-Z]+) /api/(?<path>[\w\d\/]+) HTTP"
This would extract METHOD and PATH form the log lines, BUT you would see these unique values:
GET posts
POST posts
GET posts/56/comments
GET posts/134/comments
I wish to throw away all the dynamic parts of the url, so I could find the following instead:
GET posts
POST posts
GET posts/{id}/comments
I could figure out this in a search and replace regex easily enough, but is it even possible in Sumologic?
Below is how my data looks like:
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
Below is the Pig code:
loadFulldata = LOAD '/root/Kennadi-Project/Kennadi-data.txt' USING PigStorage(',') AS (fullline:chararray);
extractData = FOREACH loadFulldata GENERATE FLATTEN (REGEX_EXTRACT_ALL(fullline,'(.*) - - (.*) -(.*)] "(.*)" (.*) (.*)'));
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
My extractData looks like:
(199.72.81.55,[01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,[01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,[01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,[01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)
(199.120.110.21,[01/Jul/1995:00:00:11,0400,GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0,200,4179)
(burger.letters.com,[01/Jul/1995:00:00:12,0400,GET /images/NASA-logosmall.gif HTTP/1.0,304,0)
When I use REGEX_EXTRACT_ALL I cannot remove '[' from the data, how can I achieve that?
In addition, I tried to remove '[' using REPLACE function like so:
rowdata = FOREACH extractData GENERATE $0 as host,$1 as datadatetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;
newdata = FOREACH rowdata GENERATE REPLACE(datadatetime,'[','');
But I am getting below warning:
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
2016-01-05 05:10:13,758 [main] WARN org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
I think it is because I haven't defined any datatype for datadatetime, how do I define datatype in foreach?
You have a problem. You try solving it using a regular expression. You now have two problems.
Seriously though, after trying it this seems to just be a problem with the regex. Using
REGEX_EXTRACT_ALL(fullline,'(.*) - - \\[(.*) -(.*)\\] "(.*)" (.*) (.*)')
did the trick for me.
Result:
(199.72.81.55,01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)
Have been scratching my head over this one, hoping there's a simple solution that I've missed.
Summary
Simplified the following code can't cope with IPv6 addresses in the (here abbreviated) apache log parsed to it. Do I SED the variable before parsing to AWK or can I change the AWK regex to match only the first ":" on each line in $clog?
$ clog='djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /another_url HTTP/1.1" 200 11142
ipv6.com:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "GET /some_url HTTP/1.1" 200 273'
$ echo "$clog" | awk -F '[: -]+' '{ vHost[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f\n", var, vHost[var] }}'
> bogus.com 37262
> djerk.nl 48394
> ipv6.com 0
As can be seen the last line of variable $clog, the vhost domain is caught but not the byte count which should come out at 273 instead of 0.
Original long question
The problem I have is with the ":" character. In addition to the other two characters (space and dash), I need AWK to match only the first occurrence of ":" in each line it's evaluating. the following splits each line by three characters which works fine, until the log entries contain IPv6 addresses.
matrix=$( echo "$clog" | awk -F '[: -]+' '{ vHost[$1]++; Bytes[$1]+=$13 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
The above code converts the following log entries (contained in variable $clog):
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:25 +0100] "GET /some_url HTTP/1.1" 404 37252 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
bogus.com:80 200.87.62.227 - - [20/Nov/2015:01:06:27 +0100] "GET /some_url HTTP/1.1" 404 37262 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:29 +0100] "GET /wordpress/2014/ssl-intercept-headaches HTTP/1.1" 200 11142 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4"
djerk.nl:80 200.87.62.227 - - [20/Nov/2015:01:06:30 +0100] "GET /some_other_url HTTP/1.1" 404 37264 "-" "Safari/11601.1.56 CFNetwork/760.0.5 Darwin/15.0.0 (x86_64)"
Into a table like so, containing vhost name (sans TCP port number), hits and cumulative byte count. One line per vhost:
djerk.nl 3 85658
bogus.com 1 37262
But IPv6 addresses get unintentionally split due to their notation and this causes AWK to produce bogus output when evaluation these log entries. Sample IPv6 log entry:
djerk.nl:80 2a01:3e8:abcd:320::1 - - [20/Nov/2015:01:35:24 +0100] "POST /wordpress/wp-cron.php?doing_wp_cron=*** HTTP/1.0" 200 273 "-" "WordPress; http://www.djerk.nl/wordpress"
I guess a work around would be to mangle variable $clog to replace the first occurrence of ":" and remove this character from the AWK regex. But I don't think native bash substitution is capable of negotiating variables with multiple lines.
clog=$(sed 's/:/ /' <<< "$clog")
matrix=$( echo "$clog" | awk -F '[ -]+' '{ vHost[$1]++; Bytes[$1]+=$10 } END { for (var in vHost) { printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }}' )
This works because $clog is quoted which preserves the line feeds and runs sed on each line individually. As a result (and shown) the AWK line needs to be adjusted to ignore ":" and grab $10 instead of $13 for the byte count.
So as it turns out, in writing this, I've already given myself a solution. But I'm sure someone will know of a better more efficient way.
Just don't split the entire line on colons. Remove the port number from the field you extract instead.
split($1, v, /:/); vHost[v[1]]++; ...
I don't see why you would split on dashes, either; either way, the field numbers will be renumbered, so you would end up with something like
awk '{ split($1, v, /:/); vHost[v[1]]++; Bytes[v[1]]+=$11 }
END { for (var in vHost)
printf "%s %.0f %.0f\n", var, vHost[var], Bytes[var] }'
I am trying to grab the output from an nginx log file and send it to logstash.
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1” 200 3653189 “-” “git/1.8.3.4 (Apple Git–47)”
Grock is able to find the first 3 words fine
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000]
%{IPV4:user_ip} - %{USERNAME:user_name} \[%{HTTPDATE:time_local}\]
Grok is able to find the 3rd and 4th words fine
[14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1”
\[%{HTTPDATE:time_local}\] %{QUOTEDSTRING:request}
However when I combine them, and try to find all 4, grok says there are no results (using http://grokdebug.herokuapp.com/ for testing)
10.1.10.20 - bob [14/Feb/2014:18:57:05 +0000] “POST /main/foo.git/git-upload-pack HTTP/1.1”
%{IPV4:user_ip} - %{USERNAME:user_name} \[%{HTTPDATE:time_local}\] %{QUOTEDSTRING:request}
#not found
Anyone know how to get the quoted string in the above example?
I'm brand new to grok, so perhaps I'm not approaching this correctly.
Update
Interestingly if I use the following log line and then manually type in the url it does work
bob 14/Feb/2014:18:57:05 +0000 "herp"
#Once herp works, replace herp, with POST
bob 14/Feb/2014:18:57:05 +0000 "POST"
#Once POST works, keep expounding until the whole thing is in place
autobuild 14/Feb/2014:18:57:05 +0000 "POST /main/builder.git/git-upload-pack HTTP/1.1"
"POST /main/builder.git/git-upload-pack HTTP/1.1" in pattern
"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}"
The process of posting to stack overflow identified the problem.
If you look carefully, the double quotes are parsed differently
"POST
vs
“POST
Manually typing in the double quote fixes the problem
Also you can use this expression for the cases where the log changes:
"%{WORD:verb}(?:| %{URIPATHPARAM:request})(?:| HTTP/%{NUMBER:httpversion})"
it matches with:
"POST /main/builder.git/git-upload-pack HTTP/1.1"
or
"POST /main/builder.git/git-upload-pack"
or
"POST"
try it.. ;)