RegexFilter with RollingFileAppender not working properly - regex

I am trying to use Regexfilter in RollingFileAppender. For 1st matching instance it retreived the logger, but after that I different patttern but nothing is logged in the file. Here is what I am using:
Main Class:
public class MainApp {
public static void main(String[] args) {
final Logger logger = LogManager.getLogger(MainApp.class.getName());
ApplicationContext context = new ClassPathXmlApplicationContext("Beans.xml");
HelloWorld obj = (HelloWorld) context.getBean("helloWorld");
logger.trace("NPF:Trace:Entering Log4j2 Example.");
logger.debug("NTL:debug Entering Log4j2 Example.");
obj.getMessage();
Company comp = new Company();
comp.setCompName("ANC");
comp.setEstablish(1889);
CompanyBusiness compBus = (CompanyBusiness)context.getBean("compBus");
compBus.finaceBusiness(comp.getCompName(), comp.getEstablish());
logger.trace("NTL: Trace: Exiting Log4j2 Example.");
}
}
log4j2.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<Configuration>
<Appenders>
<Console name="STDOUT" target="SYSTEM_OUT">
<PatternLayout pattern="%d{yyyy-MM-dd [%t] HH:mm:ss} %-5p %c{1}:%L - %m%X%n" />
</Console>
<RollingFile name="RollingFile" fileName="C:\logTest\runtime\tla\els3.log" append="true" filePattern="C:\logTest\runtime\tla\els3-%d{yyyy-MM-dd}-%i.log" >
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %m%X%n" />
<RegexFilter regex=".*business*." onMatch="ACCEPT" onMismatch="DENY"/>
<Policies>
<SizeBasedTriggeringPolicy size="20 MB" />
</Policies>
</RollingFile>
</Appenders>
<Loggers>
<Logger name="com.anc" level="trace"/>
<Root level="trace">
<AppenderRef ref="STDOUT" />
<AppenderRef ref="RollingFile"/>
</Root>
</Loggers>
</Configuration>
When I ran for the first time, in my logfile I got logs having only "business" related line. Latter I changed the patter from .business (pattern has astreik before and after business word). to "business", logging did not happen in file nor on the console. Also my application terminated without any kind of logging.
Then I tried to revert back the pattern to '.business.' (pattern has astreik before and after business word), thereafter no logging happened on the log file, but on the console all the log trace is printed. When I comment out the Regexfilter after trying for long time, my logs was printed in the log file.
I am not sure if this is a bug of Regexfilter works only for one time. Also if we do not pass any patter matching characters, the application stops without any log printing either on console or file.

If you want to log all events containing the word "business", then you shall use the regex .*business.* instead of .*business*.. Here is an example:
<RegexFilter regex=".*business.*" onMatch="ACCEPT" onMismatch="DENY"/>
For information, .*business*. means: anything, followed by business, followed by s character 0 or more time, followed by any single character.
More explaining:
. means any single character
* means 0 or more times
so .* means any character, 0 or more times.

Related

Request param is logged in access log with embedded jetty server of spring boot application

I have got an issue with my application, it logs request along with its query param which may contain sensitive data in access log. application is configured with logback.xml & embedded jetty.
jetty server is customized with below accessLogCustomer
public JettyServerCustomizer accessLogCustomizer() {
return server -> {
Slf4jRequestLog requestLog = new Slf4jRequestLog();
requestLog.setExtended(true);
requestLog.setLogLatency(true);
requestLog.setPreferProxiedForAddress(true);
requestLog.setLogTimeZone(userTimezone == null ? ZoneId.systemDefault().getId() : userTimezone);
requestLog.setLogDateFormat("Y-MM-dd HH:mm:ss, SSS Z");
RequestLogHandler requestLogHandler = new RequestLogHandler();
requestLogHandler.setRequestLog(requestLog);
requestLogHandler.setHandler(server.getHandler());
server.setHandler(requestLogHandler);
};
}
logback.xml
<appender name="access" class="ch.qos.logback.core.rolling.RollingFileAppender">
<File>${logs.dir}/abc-access.log</File>
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="ch.qos.logback.classic.PatternLayout">
<Pattern>%m %n</Pattern>
</layout>
<charset>UTF-8</charset>
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<FileNamePattern>${logs.dir}/abc-access.%d.log.gz</FileNamePattern>
</rollingPolicy>
</appender>
<logger name="org.eclipse.jetty.server.RequestLog" additivity="false">
<appender-ref ref="access"/>
</logger>
request logged in access log
192.168.0.100 - - [2021-05-20 15:48:15,093 +0530] "POST /myAPI/v2/customer/message?myID=123&messageText=hello HTTP/1.0" 200 0 "-" "PostmanRuntime/7.26.8" 475
I am trying to avoid messageText from access log, but not getting any solution.
Use the CustomRequestLog and Slf4jRequestLogWriter instead.
You'll want the special format option %U which emits the URL path, without the query string (which is available as %q btw)
Your resulting configuration would look like this ...
Slf4jRequestLogWriter slfjRequestLogWriter = new Slf4jRequestLogWriter();
String format = "%{client}a - %u %t %m \"%U\" %s %O \"%{Referer}i\" \"%{User-Agent}i\"";
CustomRequestLog customRequestLog = new CustomRequestLog(slfjRequestLogWriter, format);
server.setRequestLog(customRequestLog);
Play with the format line, read the Javadoc on CustomRequestLog to know what you can do.
Some notes:
The example format is not strictly following the Extended NCSA format (as it's missing the HTTP version portion, and the HTTP method is outside of the quoted section, but that is usually not a problem for many users)
Slf4jRequestLogWriter is only concerned with taking the formatted log line and sending it to the slf4j-api, it does nothing else.
RequestLogHandler is deprecated and not a recommended usage anymore (as it does not log bad requests and context-less requests), use the Server.setRequestLog(RequestLog) instead.
Jetty will use the CustomRequestLog's Pattern to produce a String, this String is forwarded to the Slf4jRequestLogWriter as a slf4j logging event message, which is then logged per your existing slf4j + logback configuration.

nutch 1.16 parsechecker issue with file:/directory/ inputs

Building up from nutch 1.16 skips file:/directory styled links in file system crawl , I have been trying (and failing) to get nutch to crawl through different directories and subdirectories on a Windows 10 installation, calling commands with Cygwin.
The file dirs/seed.txt, used to initiate the crawl, contains the following:
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
Running cat ./dirs/seed.txt | ./bin/nutch normalizerchecker -stdin to check on how Nutch is normalizing (default regex-normalize.xml) yields
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:/localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
While running cat ./dirs/seed.txt | ./bin/nutch filterchecker -stdin returns:
+file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
Meaning all directories are seen as valid. So far, so good, but then, running the following:
cat ./dirs/seed.txt | ./bin/nutch parsechecker -stdin
yields the same error for all three directories, namely:
Fetch failed with protocol status: notfound(14), lastModified=0
The files in logs also do not really tell me anything of what went wrong, just that it won't read the input no matter what, as the logs only contain a "fetching directory X" message per entry...
So what exactly is going on here? I'll also leave the nutch-site.xml , regex-urlfilter.txt and regex-normalize.xml files, for completeness' sake.
nutch-site.xml :
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>NutchSpiderTest</value>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchSpiderTest,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.agent.description</name>
<value>I am just testing nutch, please tell me if it's bothering your website</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>
<property>
<name>file.crawl.parent</name>
<value>false</value>
<description>The crawler is not restricted to the directories that you specified in the
Urls file but it is jumping into the parent directories as well. For your own crawlings you can
change this behavior (set to false) the way that only directories beneath the directories that you specify get
crawled.</description>
</property>
<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated documents. By default this
property is activated due to extremely high levels of CPU which parsing can sometimes take.
</description>
</property>
<!-- the following is just an attempt at using a solution I found elsewhere, didn't work -->
<property>
<name>http.robot.rules.whitelist</name>
<value>file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
</configuration>
regex-urlfilter.txt:
# The default url filter.
# Better for whole-internet crawling.
# Please comment/uncomment rules to your needs.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip http: ftp: mailto: and https: urls
-^(http|ftp|mailto|https):
# This change is not necessary but may make your life easier.
# Any file types you do not want to index need to be added to the list otherwise
# Nutch will often try to parse them and fail in doing so as it doesnt know
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$
# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
+.*(/[^/]+)/[^/]+\1/[^/]+\1/
# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
# can be leaked by placing links pointing to web interfaces of services
# running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
# or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
# http://localhost:8080
# http://127.0.0.1/ .. http://127.255.255.255/
# http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
# 10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
# 192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# 172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# accept anything else
+.
regex-normalize.txt:
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
This is intended so that users can specify substitutions to be
done on URLs using the Java regex syntax, see
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The rules are applied to URLs in the order they occur in this file. -->
<!-- WATCH OUT: an xml parser reads this file an ampersands must be
expanded to & -->
<!-- The following rules show how to strip out session IDs, default pages,
interpage anchors, etc. Order does matter! -->
<regex-normalize>
<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
<pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>
<substitution>$4</substitution>
</regex>
<!-- changes default pages into standard for /index.html, etc. into /
<regex>
<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&|#|$)</pattern>
<substitution>/$3</substitution>
</regex> -->
<!-- removes interpage href anchors such as site.com#location -->
<regex>
<pattern>#.*?(\?|&|$)</pattern>
<substitution>$1</substitution>
</regex>
<!-- cleans ?&var=value into ?var=value -->
<regex>
<pattern>\?&</pattern>
<substitution>\?</substitution>
</regex>
<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
<pattern>&{2,}</pattern>
<substitution>&</substitution>
</regex>
<!-- removes trailing ? -->
<regex>
<pattern>[\?&\.]$</pattern>
<substitution></substitution>
</regex>
<!-- normalize file:/// protocol prefix: -->
<!-- keep one single slash (NUTCH-1483) -->
<regex>
<pattern>^file://+</pattern>
<substitution>file:/</substitution>
</regex>
<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
</regex-normalize>
Any idea what I'm doing wrong here?
Nutch's file: protocol implementation "fetches" local files by creating a File object using the path component of the URL: /cygdrive/c/Users/abc/Desktop/anotherdirectory/. As stated in the discussion "Is there a java sdk for cygwin?", Java does not translate the path, but replacing cygdrive/c/ by c:/ should work.

Unable to match XML element using Python regular expression

I have an XML document with the following structure-
> <?xml version="1.0" encoding="UTF-8"?> <!-- generated by CLiX/Wiki2XML
> [MPI-Inf, MMCI#UdS] $LastChangedRevision: 93 $ on 17.04.2009
> 12:50:48[mciao0826] --> <!DOCTYPE article SYSTEM "../article.dtd">
> <article xmlns:xlink="http://www.w3.org/1999/xlink"> <header>
> <title>Postmodern art</title> <id>192127</id> <revision>
> <id>244517133</id> <timestamp>2008-10-11T05:26:50Z</timestamp>
> <contributor> <username>FairuseBot</username> <id>1022055</id>
> </contributor> </revision> <categories> <category>Contemporary
> art</category> <category>Modernism</category> <category>Art
> movements</category> <category>Postmodern art</category> </categories>
> </header> <bdy> Postmodernism preceded by Modernism '' Postmodernity
> Postchristianity Postmodern philosophy Postmodern architecture
> Postmodern art Postmodernist film Postmodern literature Postmodern
> music Postmodern theater Critical theory Globalization Consumerism
> </bdy>
I am interested in capturing the text contained within ... and for that I wrote the following Python 3 regex code-
file = open("sample_xml.xml", "r")
xml_doc = file.read()
file.close()
body_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc)
But 'body_text' is always returning an empty list. However, when I try to capture the text for the tags ... using code-
category_text = re.findall(r'(.+)', xml_doc)
This does the job.
Any idea(s) as to why the ... XML element code is not working?
Thanks!
The special character . will not match a newline, so that regex will not match a multiline string.
You can change this behavior by specifying the DOTALL flag. To specify that flag you can include this at the start of your regular expression: (?s)
More information on Python's regular expression syntax can be found here: https://docs.python.org/3/library/re.html#regular-expression-syntax
You can use re.DOTALL
category_text = re.findall(r'<bdy>(.+)</bdy>', xml_doc, re.DOTALL)
Output:
[" Postmodernism preceded by Modernism '' Postmodernity\n> Postchristianity Postmodern philosophy Postmodern architecture\n> Postmodern art Postmodernist film Postmodern literature Postmodern\n> music Postmodern theater Critical theory Globalization Consumerism\n> "]

xbuild with [System.Text.RegularExpressions.Regex]::Match(string,string) parameters doesn't work properly (MSBuild is fine)

I have a target that reads a .proj file with ReadLinesFromFile and then try to match a version number (e.g. 1.0.23) from the contained lines like:
<Target Name="GetRevision">
<ReadLinesFromFile File="$(MyDir)GetStuff.Data.proj">
<Output TaskParameter="Lines" ItemName="GetStuffLines" />
</ReadLinesFromFile>
<PropertyGroup>
<In>#(GetStuffLines)</In>
<Out>$([System.Text.RegularExpressions.Regex]::Match($(In), "(\d+)\.(\d+)\.(\d+)"))</Out>
</PropertyGroup>
<Message Text="Revision number [$(Out)]" />
<CreateProperty Value="$(Out)">
<Output TaskParameter="Value" PropertyName="RevisionNumber" />
</CreateProperty>
</Target>
The result is always empty.. Even if I try to do a simple Match($(In), "somestring") its not working correctly in linux/xbuild. This does work on windows/msbuild
Any tricks/ideas? An alternative would be to get the property version out of the first .proj file, instead of reading all lines and matching the number with a regex, but I don't even know if that is possible.
I am running versions:
XBuild Engine Version 12.0
Mono, Version 4.2.1.0
EDIT:
I've been able to trace it further down into the parameters that go into Match(), there is something wrong with the variables evaluation. The function actually works with for example Match("foobar","bar") I will get bar
But weird things happen with other inputs, e.g. Match($(In), "Get") will match Get because it is actually matching against the string "#(GetStuffLines)"
When I do Match($(In), "#..") I will get a match of #(G
But then, when I do Match($(In), "#.*") I actually get the entire content of the input file GetStuff.Data.proj which indicates that the variable was correctly expanded somewhere and the matching matched the entire input string.
I needed to circumvent Match() because it seems to be bugged at this point.
The ugly solution I came up with was to use Exec and grep the pattern like:
<Exec Command="grep -o -P '[0-9]+[.][0-9]+[.][0-9]+' $(MyDir)GetStuff.Data.proj > extractedRevisionNumber.tmp" Condition="$(OSTYPE.Contains('linux'))"/>
<ReadLinesFromFile File="$(ComponentRootDir)extractedRevisionNumber.tmp" Condition="$(OSTYPE.Contains('linux'))">
<Output TaskParameter="Lines" ItemName="GetExtractedRevisionNumber" />
</ReadLinesFromFile>
I couldn't even use the properties ConsoleToMSBuild and ConsoleOutput (https://msdn.microsoft.com/en-us/library/ms124731%28v=VS.110%29.aspx) because xbuild didn't recognize those.. That's why I grep the pattern and save it into a temp file which can be read with ReadLinesFromFile into the ItemName="GetExtractedRevisionNumber" that I use later.

Parse Data output from a remote STAF command

What would the easiest way to parse the Data section from this STAF command be?
Cannot find a STAF parameter which I can pass to the command to automatically do this,
so looks like parsing/regular expression might be best option?
Note: I do not want to use any external libraries.
[root#source ~]# STAF target PROCESS START SHELL COMMAND "ls" WAIT RETURNSTDOUT
Response
--------
{
Return Code: 0
Key : <None>
Files : [
{
Return Code: 0
Data : myFile.txt
myFile2.txt
myFile3.txt
}
]
}
Instead I would like the output/result to be formated like ..
[root#source ~]# STAF target PROCESS START SHELL COMMAND "ls" WAIT RETURNSTDOUT
myFile.txt
myFile2.txt
myFile3.txt
Best way to this is Create a XML file and use python script to access the data part of STAFResult since STAF Return data in Marshalled form as "CONTENT" and python can be use to grab that.
I will try to explain it with simple example, Its an HTTP request to server.
<stafcmd>
<location>'%s'%machineName</location>
<service>'http'</service>
<request>'DOGET URL %s?phno=%s&shortCode=%s&query=%s' % (url, phno, shortCode, escapeQuery)</request>
</stafcmd>
<if expr="RC == 0">
<sequence>
<call function="'func_Script'"></call>
<if expr="rc == 0"> <!-- Pass At First Query -->
<sequence>
<message>'PASS#Fisrt HTTPRequest: Keyword = %s,\nRequired Response = %s,\ncontent=%s' %(query, response, content)</message>
<tcstatus result="'pass'">'Pass:' </tcstatus>
</sequence>
<else> <!-- Check For MORE -->
<call function="'Validate_QueryMore'"> </call>
</else>
</if>
</sequence>
<else>
<message>'ERROR: HTTPRequest QUERY : RC = %s Result= %s' %(rc,STAFResult)</message>
</else>
</if>
<function name="func_Script">
<script>
import re
content = STAFResult['content'].lower()
response = response.lower()
test = content.find(response)
if test != -1:
rc = 0
else:
rc = 1
</script>
</function>
Hope It will give you some Help.
You can pipe the output of your command through an sed script that will filter out only the filenames for you. Here's a first cut:
sed -ne '/^[a-z]/p;/Data/s/[^:]*: \(.*\)/\1/p'
The idea is: If a line starts with a lower-case letter, that's a file name (expression up to the first semicolon). If the string "Data" is on the line, take everything that comes after the first colon in that line (expression after the semicolon). Everything else is ignored.
You might want to be more specific than just expecting a lower-case letter at the beginning (this would filter out the "Response" line at the beginning, but if your filename might start with an upper-case letter, that won't work). Also, just looking for the string "Data" might be a bit too general -- that string might occur in the filename as well. But hopefully you get the idea. To use this, run your command like this:
STAF ... | sed -ne ...