Robot Framework parsing for Notepad++ Function List - regex

I'm trying to create a Notepad++ function list for Robot Framework scripts using the class structure to encapsulate the 4 different sections in a robot file:
Settings
Variables
Test Cases
Keywords
Using the documentation and some experimentation I created a simple filter that will return the keywords and test cases based on the fact that they start at the beginning of the line. But the more complex class grouping I need some regex help with. It seems that the *** should help with clear marking.
This is what I have thusfar:
I have installed the User Defined Robot Syntax Highlighting and have added the following section to the %app%\notepad++\functionList.xml
<association userDefinedLangName="Robotframework" id="robot_function"/>
And then in the parser section:
<parser
id="robot_function"
displayName="Robot Section"
commentExpr="((#.*?$)|(^Documentation*\w.*?$)|(^Meta*\w.*?$))|(^Library*\w.*?$)">
<function
mainExpr="^(\w.*?$)"
displayMode="$functionName">
<functionName>
<nameExpr expr="^(\w.*?$)"/>
</functionName>
</function>
</parser>
So, the part I'm having trouble with and I'd appreciate some help is:
<classRange mainExpr="^(\*).*(?=\n\S|\Z)">
<className>
<nameExpr expr="^(\w.*?$)"/>
</className>
<function mainExpr="^(\w.*?$)">
<functionName>
<nameExpr expr="^(\w.*?$)"/>
</functionName>
</function>
</classRange>
Below is an example robot file
*** Variables ***
${variable} variable value
*** Settings ***
Documentation multi
... line
... documentation.
Metadata Version 0.1
Library LibraryName some variable
Library String
*** Test Cases ***
Test Case RF 01
Run Keyword ${TEST_NAME}
Test Case RF 02
Run Keyword ${TEST_NAME}
*** Keywords ***
Test Case RF ${tc}
Sleep 30ms
Test Keyword
Sleep 300ms
I'm sure that if I can make it work for one of the sections, for example test cases, then that will allow me to also apply it to the other sections. Predominantly I'm interested in the test cases and keywords.

With the settings below I am able to use Notepad++ Function List for Robot Framework keywords and test cases:
<association id="robot_syntax" userDefinedLangName="Robotframework" />
<association id="robot_syntax" ext=".robot" />
<parser
displayName="Robot Framework"
id ="robot_syntax"
commentExpr="(^(\h*)|(#.*?)|(\[\w.*?)|(Documentation*\w.*?)|(Library*\w.*?)|(Metadata*\w.*?)|(Resource*\w.*?)|(Test (Setup|Teardown|Template|Timeout)*\w.*?)||(Suite (Setup|Teardown)*\w.*?)|((Force|Default) Tags*\w.*?))$"
>
<function mainExpr="(?m-s:(?:^)[A-Za-z0-9].*$)"/>
</parser>

Related

nutch 1.16 parsechecker issue with file:/directory/ inputs

Building up from nutch 1.16 skips file:/directory styled links in file system crawl , I have been trying (and failing) to get nutch to crawl through different directories and subdirectories on a Windows 10 installation, calling commands with Cygwin.
The file dirs/seed.txt, used to initiate the crawl, contains the following:
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
Running cat ./dirs/seed.txt | ./bin/nutch normalizerchecker -stdin to check on how Nutch is normalizing (default regex-normalize.xml) yields
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
file:/localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
While running cat ./dirs/seed.txt | ./bin/nutch filterchecker -stdin returns:
+file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file:///cygdrive/c/Users/abc/Desktop/anotherdirectory/
+file://localhost/cygdrive/c/Users/abc/Desktop/anotherdirectory/
Meaning all directories are seen as valid. So far, so good, but then, running the following:
cat ./dirs/seed.txt | ./bin/nutch parsechecker -stdin
yields the same error for all three directories, namely:
Fetch failed with protocol status: notfound(14), lastModified=0
The files in logs also do not really tell me anything of what went wrong, just that it won't read the input no matter what, as the logs only contain a "fetching directory X" message per entry...
So what exactly is going on here? I'll also leave the nutch-site.xml , regex-urlfilter.txt and regex-normalize.xml files, for completeness' sake.
nutch-site.xml :
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>NutchSpiderTest</value>
</property>
<property>
<name>http.robots.agents</name>
<value>NutchSpiderTest,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.agent.description</name>
<value>I am just testing nutch, please tell me if it's bothering your website</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>file.content.limit</name>
<value>-1</value>
<description> Needed to stop buffer overflow errors - Unable to read.....</description>
</property>
<property>
<name>file.crawl.parent</name>
<value>false</value>
<description>The crawler is not restricted to the directories that you specified in the
Urls file but it is jumping into the parent directories as well. For your own crawlings you can
change this behavior (set to false) the way that only directories beneath the directories that you specify get
crawled.</description>
</property>
<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated documents. By default this
property is activated due to extremely high levels of CPU which parsing can sometimes take.
</description>
</property>
<!-- the following is just an attempt at using a solution I found elsewhere, didn't work -->
<property>
<name>http.robot.rules.whitelist</name>
<value>file:/cygdrive/c/Users/abc/Desktop/anotherdirectory/</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for. Use with care and only if you are explicitly
allowed by the site owner to ignore the site's robots.txt!
</description>
</property>
</configuration>
regex-urlfilter.txt:
# The default url filter.
# Better for whole-internet crawling.
# Please comment/uncomment rules to your needs.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip http: ftp: mailto: and https: urls
-^(http|ftp|mailto|https):
# This change is not necessary but may make your life easier.
# Any file types you do not want to index need to be added to the list otherwise
# Nutch will often try to parse them and fail in doing so as it doesnt know
# how to deal with a lot of binary file types.:
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS
#|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|gz|GZ|rpm|RPM|tgz|TGZ|mov
#|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|asp|ASP|xxx|XXX|yyy|YYY
#|cs|CS|dll|DLL|refresh|REFRESH)$
# skip URLs longer than 2048 characters, see also db.max.outlink.length
#-^.{2049,}
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
#-(?i)\.(?:gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!#=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
+.*(/[^/]+)/[^/]+\1/[^/]+\1/
# For safe web crawling if crawled content is exposed in a public search interface:
# - exclude private network addresses to avoid that information
# can be leaked by placing links pointing to web interfaces of services
# running on the crawling machines (e.g., HDFS, Hadoop YARN)
# - in addition, file:// URLs should be either excluded by a URL filter rule
# or ignored by not enabling protocol-file
#
# - exclude localhost and loop-back addresses
# http://localhost:8080
# http://127.0.0.1/ .. http://127.255.255.255/
# http://[::1]/
#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
#
# - exclude private IP address spaces
# 10.0.0.0/8
#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
# 192.168.0.0/16
#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# 172.16.0.0/12
#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
# accept anything else
+.
regex-normalize.txt:
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
This is intended so that users can specify substitutions to be
done on URLs using the Java regex syntax, see
https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The rules are applied to URLs in the order they occur in this file. -->
<!-- WATCH OUT: an xml parser reads this file an ampersands must be
expanded to & -->
<!-- The following rules show how to strip out session IDs, default pages,
interpage anchors, etc. Order does matter! -->
<regex-normalize>
<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
<pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>
<substitution>$4</substitution>
</regex>
<!-- changes default pages into standard for /index.html, etc. into /
<regex>
<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&|#|$)</pattern>
<substitution>/$3</substitution>
</regex> -->
<!-- removes interpage href anchors such as site.com#location -->
<regex>
<pattern>#.*?(\?|&|$)</pattern>
<substitution>$1</substitution>
</regex>
<!-- cleans ?&var=value into ?var=value -->
<regex>
<pattern>\?&</pattern>
<substitution>\?</substitution>
</regex>
<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
<pattern>&{2,}</pattern>
<substitution>&</substitution>
</regex>
<!-- removes trailing ? -->
<regex>
<pattern>[\?&\.]$</pattern>
<substitution></substitution>
</regex>
<!-- normalize file:/// protocol prefix: -->
<!-- keep one single slash (NUTCH-1483) -->
<regex>
<pattern>^file://+</pattern>
<substitution>file:/</substitution>
</regex>
<!-- removes duplicate slashes but -->
<!-- * allow 2 slashes after colon ':' (indicating protocol) -->
<regex>
<pattern>(?<!:)/{2,}</pattern>
<substitution>/</substitution>
</regex>
</regex-normalize>
Any idea what I'm doing wrong here?
Nutch's file: protocol implementation "fetches" local files by creating a File object using the path component of the URL: /cygdrive/c/Users/abc/Desktop/anotherdirectory/. As stated in the discussion "Is there a java sdk for cygwin?", Java does not translate the path, but replacing cygdrive/c/ by c:/ should work.

JAVAMETHOD grok pattern with optional thread number at the end

I'm trying to parse log4j messages:
2019-12-02 20:48:20.198utc DEBUG UnknownElementContentHandler,streamLock-9-th-11:32 - blabla
2019-11-19 23:40:04.014utc WARN AnnotationBinder,localhost-startStop-1:611 - blabla
2019-11-19 23:40:04.014utc INFO CovImCtl,main:109 - blabla
with grok pattern
%{TIMESTAMP_ISO8601:timestamp}utc%{SPACE}%{LOGLEVEL:level}%{SPACE}%{JAVACLASS:class},%{JAVAMETHOD1:method}:%{POSINT:lineno}%{SPACE}-%{SPACE}%{GREEDYDATA:message}
with using a variation on the standard:
JAVAMETHOD (?:(<(?:cl)?init>)|[a-zA-Z$_][a-zA-Z$_0-9]*)
JAVAMETHOD1 (?:(<(?:cl)?init>)|[a-zA-Z$_][a-zA-Z$_\-0-9]*)
The JAVAMETHOD worked for "main" but not for the others, (the pattern was missing -).
JAVAMETHOD1 works, but I need to get the optional trailing integer retrieved as a "thread_no" field (11 from streamLock-9-th-11, 1 from localhost-startStop-1)
I'm wrecking my brain, the methods like streamLock-9-th-11 has the internal "-\d+" "-9" which belongs to "streamLock-9-th"
Any ideas?

Need to remove white spaces and PS dir when running from ant using filterchain and replaceregex pattern

Here is a sample log file that I am trying to parse through
Added Change Sets
Component PS
9476: Build changes to make for Ant task [Nov 12, 2015 12:02 PM]
Work Item 9476: Build changes to make for Ant task
/PS/build/AntTaskHelper.xml
9582: Testing for EBF and migration script changes [Nov 12, 2015 12:02 PM]
Work Item 9582: Testing for EBF and migration script changes
/PS/database/ebf-migration/EBF-RTC-9582.sql
/PS/database/sif-internal-migration-scripts/RTC-9582.sql
9583: PKB PKG and Image File testing [Nov 12, 2015 12:02 PM]
Work Item 9583: PKB PKG and Image File testing
/PS/database/src/program-units/RTC-9583-PKG_CDT.pkb
/templates/Images/RTC-9583-ABAKER.TIF
/templates/Templates/RTC-9583-A100_1_20090101.xdp
Ultimately I need the results to show the following:
/database/ebf-migration/EBF-RTC-9582.sql
/database/sif-internal-migration-scripts/RTC-9582.sql
/database/src/program-units/RTC-9583-PKG_CDT.pkb
/templates/Images/RTC-9583-ABAKER.TIF
/templates/Templates/RTC-9583-A100_1_20090101.xdp
My regular expression works perfectly well when testing with a sample reg exp tester but not quite what I need when running in the build.
Here's my target
<target name="Parse">
<loadfile property="textFile" srcfile="${deployDir}\buildChanges1.txt">
<filterchain>
<linecontainsregexp>
<regexp pattern="((/database/(ebf-migration|sif-internal-migration-scripts/|src/program-units/))|(/templates/)).*" />
</linecontainsregexp>
<replaceregex pattern="((/database/(ebf-migration|sif-internal-migration-scripts/|src/program-units/))|(/templates/)).*" replace="\0"/>
</filterchain>
</loadfile>
<echo message= "value based on regex =${textFile}"/>
</target>
Here's the output from the build.
Parse:
[echo] value based on regex = /PS/database/ebf-migration/EBF-RTC-9582.sql
[echo] /PS/database/sif-internal-migration-scripts/RTC-9582.sql
[echo] /PS/database/src/program-units/RTC-9583-PKG_CDT.pkb
[echo] /templates/Images/RTC-9583-ABAKER.TIF
[echo] /templates/Templates/RTC-9583-A100_1_20090101.xdp
Any help on getting this to run would be greatly appreciated.

Configuring Notepad++ "Function List" for Perl

I'm trying to get the "Fucntion List" feature on notepad++ (v 6.7.5) working for Perl with classes (or packages, in perl parlance). Only regular subroutines outside of packages are supported by default.
Below is the XML snippet in question from the Function List config file (located on my windows machine at C:\Users\user\AppData\Roaming\Notepad++\functionList.xml ). I added the "classRange" node myself on top of the default "function" node.
EDIT: below is the corrected XML, thanks to user stribizhev
UPDATE: I've commented out the "normal" function section, because it was causing all my methods to appear twice in the function list.
<parser id="perl_function" displayName="Perl">
<classRange mainExpr="^package.*?(?=\npackage|\Z)">
<className>
<nameExpr expr="\s\K[^;]+"/>
</className>
<function mainExpr="^[\s]*(?<!#)[\s]*sub[\s]+[\w]+[\s]*\(?[^\)\(]*?\)?[\n\s]*\{" displayMode="$className->$functionName">
<functionName>
<funcNameExpr expr="(sub[\s]+)?\K[\w]+"/>
</functionName>
</function>
</classRange>
<!--
<function mainExpr="^[\s]*(?<!#)[\s]*sub[\s]+[\w]+[\s]*\(?[^\)\(]*?\)?[\n\s]*\{" displayMode="$className->$functionName">
<functionName>
<nameExpr expr="(sub[\s]+)?\K[\w]+"/>
</functionName>
</function>
-->
</parser>
The documentation for this is here.
I tried your XML in Notepad++ 6.8.1 and while it does work for Perl with 'packages', my plain scripts without packages fail to produce subs now. I uncommented the lines you commented out and it fixes that problem, but does exhibit the behavior you mentioned - doubled up subs within the 'packages'.
I found the following works nicely and even ignores subs in POD (which may be there as example usage) so they aren't added to the list:
<parser id="perl_function" displayName="Perl" commentExpr="(#.*?$|(__END__.*\Z))">
<classRange mainExpr="(?<=^package).*?(?=\npackage|\Z)">
<className>
<nameExpr expr="\s\K[^;]+"/>
</className>
<function mainExpr="^[\s]*(?<!#)[\s]*sub[\s]+[\w]+[\s]*\(?[^\)\(]*?\)?[\n\s]*\{">
<functionName>
<funcNameExpr expr="(sub[\s]+)?\K[\w]+"/>
</functionName>
</function>
</classRange>
<function mainExpr="^[\s]*(?<!#)[\s]*sub[\s]+[\w]+[\s]*\(?[^\)\(]*?\)?[\n\s]*\{">
<functionName>
<nameExpr expr="(?:sub[\s]+)?\K[\w]+"/>
</functionName>
</function>
</parser>
Most probably, you should use funcNameExpr instead of nameExpr:
Example:
<functionName>
<funcNameExpr expr="(sub[\s]+)?\K[\w]+"/>
</functionName>
funcNameExpr or uncomment did not work for me. What worked - commenting out the three regex lines with parentheses:
#\(
#[^()]*
#\)
As Perl does not consider parentheses after function name correct, I do not have such. Parentheses can be at each call of the function, but not where the function is.

Suppressing stack trace when Rails tests error

I'm a Ruby on Rails newbie and writing tests. Some of these generate exceptions; I would like the "rake test" output to give me the exception error message but not the whole backtrace. (I'd like to write tests which exercise unimplemented functionality, which I'll then fill in.)
For example, actual output:
Started
E
Finished in 0.081054 seconds.
1) Error:
test_should_fail(VersioningTest):
ActiveRecord::StatementInvalid: PGError: ERROR: null value in column "client_ip" violates not-null constraint
: INSERT INTO "revisions" ("created_at", "id") VALUES ('2011-02-03 20:14:17', 980190962)
/Users/rpriedhorsky/.rvm/gems/ruby-1.9.2-p136/gems/activerecord-3.0.3/lib/active_record/connection_adapters/abstract_adapter.rb:202:in `rescue in log'
/Users/rpriedhorsky/.rvm/gems/ruby-1.9.2-p136/gems/activerecord-3.0.3/lib/active_record/connection_adapters/abstract_adapter.rb:194:in `log'
/Users/rpriedhorsky/.rvm/gems/ruby-1.9.2-p136/gems/activerecord-3.0.3/lib/active_record/connection_adapters/postgresql_adapter.rb:496:in `execute'
[... etc. etc. etc. ...]
1 tests, 0 assertions, 0 failures, 1 errors, 0 skips
Desired output:
Started
E
Finished in 0.081054 seconds.
1) Error:
test_should_fail(VersioningTest):
ActiveRecord::StatementInvalid: PGError: ERROR: null value in column "client_ip" violates not-null constraint
1 tests, 0 assertions, 0 failures, 1 errors, 0 skips
I found info (e.g.) on the opposite direction, but not on suppressing stack traces.
Edit:
It would be nice to turn them on and off easily; as pointed out below, sometimes they are useful for tracking down bugs.
You could take a look at "backtrace silencers" - for me (Rails 2.3.8), this is the file config/initializers/backtrace_silencers.rb:
# Be sure to restart your server when you modify this file.
# You can add backtrace silencers for libraries that you're using but
# don't wish to see in your backtraces.
# Rails.backtrace_cleaner.add_silencer { |line| line =~ /my_noisy_library/ }
# You can also remove all the silencers if you're trying do debug a
# problem that might steem from framework code.
# Rails.backtrace_cleaner.remove_silencers!
Rails.backtrace_cleaner.add_silencer {|line| line =~ /gems/}
Rails.backtrace_cleaner.add_silencer {|line| line =~ /passenger/}
It looks like you should be able to put a line like
Rails.backtrace_cleaner.add_silencer {|line| true}
In your config/environments/test.rb file, and that would wipe your backtraces clean away (though it might just apply to the logger - I'm not very familiar with the method).
But ask yourself - do you really want to do away with backtraces entirely? They can be pretty useful for tracking down bugs...