I need to exclude some sensitive details in my apache log, but I want to keep the log and the uri's in it. Is it possible to achieve following in my access log:
127.0.0.1 - - [27/Feb/2012:13:18:12 +0100] "GET /api.php?param=secret HTTP/1.1" 200 7600 "http://localhost/api.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
I want to replace "secret" with "[FILTERED]" like this:
127.0.0.1 - - [27/Feb/2012:13:18:12 +0100] "GET /api.php?param=[FILTERED] HTTP/1.1" 200 7600 "http://localhost/api.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
I know I probably should have used POST to send this variable, but the damage is already done. I've looked at http://httpd.apache.org/docs/2.4/logs.html and LogFormat, but could not find any possibilities to use regular expression or similar. Any suggestions?
[edit]
Do NOT send sensitive variables as GET parameters if you have the possibility to choose.
I've found one way to solve the problem. If I pipe the log output to sed, I can perform a regex replace on the output before I append it to the log file.
Example 1
CustomLog "|/bin/sed -E s/'param=[^& \t\n]*'/'param=\[FILTERED\]'/g >> /your/path/access.log" combined
Example 2
It's also possible to exclude several parameters:
exclude.sh
#!/bin/bash
while read x ; do
result=$x
for ARG in "$#"
do
cleanArg=`echo $ARG | sed -E 's|([^0-9a-zA-Z_])|\\\\\1|g'`
result=`echo $result | sed -E s/$cleanArg'=[^& \t\n]*'/$cleanArg'=\[FILTERED\]'/g`
done
echo $result
done
Move the script above to the folder /opt/scripts/ or somewhere else, give the script execute rights (chmod +x exclude.sh) and modify your apache config like this:
CustomLog "|/opt/scripts/exclude.sh param param1 param2 >> /your/path/access.log" combined
Documentation
http://httpd.apache.org/docs/2.4/logs.html#piped
http://www.gnu.org/software/sed/manual/sed.html
If you want to exclude several parameters, but don't want to use a script, you can use groups like that :
CustomLog "|$/bin/sed -E s/'(email|password)=[^& \t\n]*'/'\\\\\1=\[FILTERED\]'/g >> /var/log/apache2/${APACHE_LOG_FILENAME}.access.log" combined
Related
I have below simple regex expressions that works pretty well to split the given sample log. This would provides separate groups of object where I could access with $1 $2 $3 ... etc. I'm using this in Splunk.
Eg.
$1 = https
$2 = 2020-08-20T12:40:00.274478Z
$3 = app/my-aws-alb/e7538073dd1a6fd8
(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+?)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)(.*?\s+)
https 2020-08-20T12:40:00.274478Z app/my-aws-alb/e7538073dd1a6fd8 162.158.26.188:21098 172.0.51.37:80 0.000 0.004 0.000 405 405 974 424 "POST https://my-aws-alb-domain:443/api/ps/fpx/callback HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.2840.91 Safari/537.36" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:ap-southeast-1:111111111111:targetgroup/my-aws-target-group/41dbd234b301e3d84 "Root=1-5f3e6f20-3fdasdsfffdsf" "api.mydomain.com" "arn:aws:acm:ap-southeast-1:11111111111:certificate/be4344424-a40f-416e-8434c-88a8a3b072f5" 0 2020-08-20T12:40:00.270000Z "forward" "-" "-" "172.0.51.37:80" "405" "-" "-"
The problem here is, I want to separate IP:Port into separate group. There are multiple places which have the IP:Port. Those I need as a separate group like other object.
Eg.
$4 = 162.158.26.188
$5 = 21098
$6 = 172.0.51.37
$7 = 80
Can anyone help on this? Thank you!
Here's a regex that will pull all of the ip:port values from a field:
| rex field=_raw max_match=0 "(?<ip_port>\d+\.\d+\.\d+\.\d+\:\d+)"
Now expand the ip_port field:
| mvexpand ip_port
And then extract from ip_port into ip & port:
| rex field=ip_port "(?<ip>\d+\.\d+\.\d+\.\d+\)\:(?<port>\d+)"
I am using telegraf plugin[[inputs.logparser]] to grab the access_log data from Apache based on a local web page I have got running.
Using ["%{COMBINED_LOG_FORMAT}"] patterns, I am able to retrieve the default measurements provided by the access_logs, including http_version, request, resp_bytes etc.
I have appended the "Log Format" within httpd.conf file to include the additional "Response time" to each request access_log records with %D at the end, this has been successful when i look at the access_log after implementing.
However I am so far unable to successfully tell Telegraf to acknowledge this new measurement with the inputs.logparser - I am using a grafana dashboard with InfluxDB to monitor this data and it has not yet appeared as an additional measurement.
So far I have attempted the following:
First [[inputs.logparser]] section remains the same throughout my attempts and is always present/active, this seems right in order to be able to obtain the default measurements?
######## default logparser using COMBINED to obtain default access_log measurements ######
# Stream and parse log file(s).
[[inputs.logparser]]
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT}"
measurement = "apache_access_log"
custom_patterns = '''
'''
Attempt 1 at matching the response time appended to access_log:
############# Grok/RegEx for matching response time ######################
# Stream and parse log file(s).
[[inputs.logparser]]
## Log files to parse.
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{METRICS_INCLUDE_RESPONSE}"]
measurement = "apache_access_log"
custom_patterns = '''
METRICS_INCLUDE_RESPONSE [%{NUMBER:resp}]
'''
And my 2nd attempt I thought to try normal regular expressions
############# Grok/RegEx for matching response time ######################
# Stream and parse log file(s).
[[inputs.logparser]]
## Log files to parse.
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{METRICS_INCLUDE_RESPONSE}"]
measurement = "apache_access_log"
custom_patterns = '''
METRICS_INCLUDE_RESPONSE [%([0-9]{1,3})]
'''
After both of these attempts, the default measurements are still recorded and grabbed fine by Telegraf, but the response time does not appear as an additional measurement.
I believe the issue to be syntax within my custom grok pattern, and that it is not matching as I have intended it to because I am not telling it to pull the correct information? But I am unsure.
I have provided an example of the access_log output below, ALL details are pulled from Telegraf without issue under COMBINED_LOG_FORMAT, except for the number at the end, which is representative of the response time.
10.30.20.32 - - [09/Jan/2020:11:08:14 +0000] "POST /404.php HTTP/1.1" 200 252 "http://10.30.10.77/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 600
10.30.20.32 - - [09/Jan/2020:11:08:15 +0000] "POST /boop.html HTTP/1.1" 200 76 "http://10.30.10.77/404.php" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 472
You are essentially extending a pre-defined pattern. So, the pattern should be written like so (assuming your response time value is within square brackets in the log) :
######## default logparser using COMBINED to obtain default access_log measurements ######
# Stream and parse log file(s).
[[inputs.logparser]]
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT} \\[%{NUMBER:responseTime:float}\\]"]
measurement = "apache_access_log"
custom_patterns = '''
'''
You will get the response time value in a metric named 'responseTime' in float data type.
Working on an input extractor issue with IIS logs using an "advanced" IIS login tool to collect more than the basic logs provide. It's adding double quotes and spaces to many of the fields and we are trying to us the extractor to correct this. This is the beginning of an example message:
2016-02-08 16:46:35.957 "SITE" "SOURCE" XX.XX.XX.XX GET /blah/etc/etc/file.ext - 80 - "XX.XX.XX.XX" "HTTP/1.1" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; yie11; rv:11.0) like Gecko"
We've already written an extractor to remove all the added quotes before running it through all the other extractors to populate the fields, etc., but we want to replace all spaces between the quotes with + before we do that to match the old logging style.
Can anyone point us in the right direction for this? The closest I've come so far is catching " " between SITE and SOURCE and replacing that using something like "([\s]*)". Result:
2016-02-08 16:46:35.957 "SITE+SOURCE" XX.XX.XX.XX GET /blah/etc/etc/file.ext - 80 - "XX.XX.XX.XX+HTTP/1.1+Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; yie11; rv:11.0) like Gecko"
I can't seem to only look for spaces between the quotes.
Any help would be greatly appreciated. Thanks.
Further Clarification. This portion of the string:
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; yie11; rv:11.0) like Gecko"
Should be:
"Mozilla/5.0+(Windows+NT+6.1;+WOW64;+Trident/7.0;+yie11;+rv:11.0)+like+Gecko"
Everything else should remain the same as those are the only spaces inside of a quoted section of the string.
Is this even possible with regex?
I'm afraid that regular expressions are not the best tool for this. You basically have to "count" quotes to determine whether a space is within quotes or not.
You can try something like this (Python):
text = '2016-02-08 16:46:35.957 "SITE" "SOURCE" XX.XX.XX.XX GET /blah/etc/etc/file.ext - 80 - "XX.XX.XX.XX" "HTTP/1.1" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; yie11; rv:11.0) like Gecko"'
escaped = ""
count = 0
for c in text:
if c == '"':
count += 1
if c == " " and count % 2 == 1:
escaped += "+"
else:
escaped += c
Afterwards, escaped is this:
2016-02-08 16:46:35.957 "SITE" "SOURCE" XX.XX.XX.XX GET /blah/etc/etc/file.ext - 80 - "XX.XX.XX.XX" "HTTP/1.1" "Mozilla/5.0+(Windows+NT+6.1;+WOW64;+Trident/7.0;+yie11;+rv:11.0)+like+Gecko"
Why does
AliasMatch .*\.(png|ico|gif|jpg|jpeg|js|css|woff|ttf|svg)$ /my-location/
+
GET /pages/index/index.js HTTP/1.1
=
[30/Jul/2014:12:55:28 -0700] "GET /pages/index/index.js HTTP/1.1" 404 433 "http://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
?
The solution was
AliasMatch (.*\.(png|ico|gif|jpg|jpeg|js|css|woff|ttf|svg))$ /my-location/$1
The reason for that is:"[...] Alias will automatically copy any additional part of the URI, past the part that matched, onto the end of the file path on the right side, while AliasMatch will not. This means that in almost all cases, you will want the regular expression to match the entire request URI from beginning to end, and to use substitution on the right side." (http://httpd.apache.org/docs/2.2/mod/mod_alias.html#aliasmatch)
I need to import old http log files from my Domino webserver into my piwik tracking.
the problem is the format of the log if an user is logged in.
Normal/good format example:
123.123.123 www.example.com - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
Bad format example - produced if user is logged in
123.123.123 www.example.com "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example
i am looking for a one-liner bash to remove those CN information if it is included.
UPDATE:
this is my solution to get a one liner to import an domino log file into piwik. maybe someday someone finds this thing and needn't flip his table
for i in `ls -v *.log`; do date && echo " bearbeite" $i && echo " " && awk '{sub(/ +"CN=[^"]+" +/," - ")}1' $i grep -v http.monitor | grep -v nagios > $i.cleanTmp && python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://127.0.0.1/piwik --idsite=8 $i.cleanTmp --dry-run && rm $i.cleanTmp; done;
If You need a pure bash solution You can do something like this:
Example file
cat >infile <<XXX
123.123.123 www.example.com "CN=SomeUser/OU=SomeOU/O=SomeO" - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
XXX
while read x; do
[[ $x =~ \ +\"CN=[^\"]+\"\ + ]] && x=${x/$BASH_REMATCH/ }
echo $x
done <infile
Output:
123.123.123 www.example.com - [17/Mar/2013:00:00:39 +0100] "GET /example.org HTTP/1.1" 200 3810 "" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 234 "" "example"
It parses for a string starting with spaces then "CN= and then any non " characters, then a " then some spaces. If this patten found, it replaces with a space.
If the log files are big ones (>1MB) and this should be done periodically, then use awk instead of the pure bash solution.
awk '{sub(/ +"CN=[^"]+" +/," ")}1' infile
So you just want to remove the "CN=SomeUser/OU=SomeOU/O=SomeO" part?
The regex to match that looks like this:
"CN=\w+\/OU=\w+\/O=\w+"