Difficulty using JAPE Grammar - gate

I have a document which contains sections such as Assessments, HPI, ROS, Vitals etc.
I want to extract notes in each section. I am using GATE for this purpose. I have made a JAPE file which will extract notes in the Assessment section. Following is the grammar,
Input: Token
Options: control=appelt debug=true
Rule: Assess
({Token.string =~"(?i)diagnose[d]?"}{Token.string=="with"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)from"} | {Token.string=~"(?i)suffering"}{Token.string=~"(?i)with"})
(
({Token})*
):assessments
({Token.string =~"(?i)HPI"} | {Token.string =~"(?i)ROS"} | {Token.string =~"(?i)EXAM"} | {Token.string =~"(?i)VITAL[S]"} | {Token.string =~"(?i)TREATMENT[s]"} |{Token.string=~"(?i)use[d]?"}{Token.string=~"(?i)orderset[s]?"} | {Token.string=~"$"})
-->
:assessments.Assessments = {}
Now, when the assessment section is in the end of the document I can retrieve the notes properly. But if it is somewhere between two sections then this will return entire document from assessment section till the end of file.
I have tried using {Token.string=~"$"} in different ways but could not extract ONLY THE ASSESSMENT SECTION IRRESPECTIVE OF ITS PLACE IN THE DOC.
Please explain how can I achieve this using JAPE grammar.

That is correct since Appelt mode always prefers the longest possible overall match. Since any Token can match string =~ "$" the assessments label will grab all but the final token in the document.
I would adopt a two pass approach, using an initial gazetteer or JAPE phase to annotate the "section headings" and then another phase with only these heading annotations in its input line
Imports: { import static gate.Utils.*; }
Phase: AnnotateBetweenHeadings
Input: Heading
Options: control = appelt
Rule: TwoHeadings
({Heading.type ="assessments"}):h1
(({Heading})?):h2
-->
{
Long endOffset = end(doc);
AnnotationSet h2Annots = bindings.get("h2");
if(h2Annots != null && !h2Annots.isEmpty()) {
endOffset = start(h2Annots);
}
outputAS.add(end(bindings.get("h1")), endOffset, "Assessments", featureMap());
}
This will annotate everything between the end of the assessments heading and the start of the following heading, or the end of the document if there is no following heading.

Tyson Hamilton provides this alternative to annotating EOD since $ doesn't work in JAPE:
Rule: DOCMARKERS
// we need to match something even though we don't use it directly
(({Token})):doc
-->
:doc{
FeatureMap features = Factory.newFeatureMap();
features.put("rule", ruleName());
try {
outputAS.add(0L, 0L, "SOD", features);
outputAS.add(docAnnots.getDocument().getContent().size(), docAnnots.getDocument().getContent().size(), "EOD", features);
} catch (InvalidOffsetException ioe) {
throw new GateRuntimeException(ioe);
}
}
I found that EOD was only recognized in later rules by giving it some length. So I have this:
Rule: DOCMARKERS
Priority: 2
(
({Sentence}) // we need to matching something even though we don't use it directly
):doc
-->
:doc{
FeatureMap features = Factory.newFeatureMap();
features.put("rule", "DOCMARKERS");
try {
outputAS.add(0L, 0L, "SOD", features);
long docsize = docAnnots.getDocument().getContent().size();
// The only way I could get EOD to be recognized in later rules was to
// give it some length, hence the -2 and -1
outputAS.add(docsize-2, docsize-1, "EOD", features);
System.err.println("Debug: added EOD");
} catch (InvalidOffsetException ioe) {
throw new GateRuntimeException(ioe);
}
}
And then you should be able to change the end of your rule to
...| {Token.string=~"$"})

Related

Regex for finding the name of a method containing a string

I've got a Node module file containing about 100 exported methods, which looks something like this:
exports.methodOne = async user_id => {
// other method contents
};
exports.methodTwo = async user_id => {
// other method contents
fooMethod();
};
exports.methodThree = async user_id => {
// other method contents
fooMethod();
};
Goal: What I'd like to do is figure out how to grab the name of any method which contains a call to fooMethod, and return the correct method names: methodTwo and methodThree. I wrote a regex which gets kinda close:
exports\.(\w+).*(\n.*?){1,}fooMethod
Problem: using my example code from above, though, it would effectively match methodOne and methodThree because it finds the first instance of export and then the first instance of fooMethod and goes on from there. Here's a regex101 example.
I suspect I could make use of lookaheads or lookbehinds, but I have little experience with those parts of regex, so any guidance would be much appreciated!
Edit: Turns out regex is poorly-suited for this type of task. #ctcherry advised using a parser, and using that as a springboard, I was able to learn about Abstract Syntax Trees (ASTs) and the recast tool which lets you traverse the tree after using various tools (acorn and others) to parse your code into tree form.
With these tools in hand, I successfully built a script to parse and traverse my node app's files, and was able to find all methods containing fooMethod as intended.
Regex isn't the best tool to tackle all the parts of this problem, ideally we could rely on something higher level, a parser.
One way to do this is to let the javascript parse itself during load and execution. If your node module doesn't include anything that would execute on its own (or at least anything that would conflict with the below), you can put this at the bottom of your module, and then run the module with node mod.js.
console.log(Object.keys(exports).filter(fn => exports[fn].toString().includes("fooMethod(")));
(In the comments below it is revealed that the above isn't possible.)
Another option would be to use a library like https://github.com/acornjs/acorn (there are other options) to write some other javascript that parses your original target javascript, then you would have a tree structure you could use to perform your matching and eventually return the function names you are after. I'm not an expert in that library so unfortunately I don't have sample code for you.
This regex matches (only) the method names that contain a call to fooMethod();
(?<=exports\.)\w+(?=[^{]+\{[^}]+fooMethod\(\)[^}]+};)
See live demo.
Assuming that all methods have their body enclosed within { and }, I would make an approach to get to the final regex like this:
First, find a regex to get the individual methods. This can be done using this regex:
exports\.(\w+)(\s|.)*?\{(\s|.)*?\}
Next, we are interested in those methods that have fooMethod in them before they close. So, look for } or fooMethod.*}, in that order. So, let us name the group searching for fooMethod as FOO and the name of the method calling it as METH. When we iterate the matches, if group FOO is present in a match, we will use the corresponding METH group, else we will reject it.
exports\.(?<METH>\w+)(\s|.)*?\{(\s|.)*?(\}|(?<FOO>fooMethod)(\s|.)*?\})
Explanation:
exports\.(?<METH>\w+): Till the method name (you have already covered this)
(\s|.)*?\{(\s|.)*?: Some code before { and after, non-greedy so that the subsequent group is given preference
(\}|(?<FOO>fooMethod)(\s|.)*?\}): This has 2 parts:
\}: Match the method close delimiter, OR
(?<FOO>fooMethod)(\s|.)*?\}): The call to fooMethod followed by optional code and method close delimiter.
Here's a JavaScript code that demostrates this:
let p = /exports\.(?<METH>\w+)(\s|.)*?\{(\s|.)*?(\}|(?<FOO>fooMethod)(\s|.)*?\})/g
let input = `exports.methodOne = async user_id => {
// other method contents
};
exports.methodTwo = async user_id => {
// other method contents
fooMethod();
};
exports.methodThree = async user_id => {
// other method contents
fooMethod();
};';`
let match = p.exec( input );
while( match !== null) {
if( match.groups.FOO !== undefined ) console.log( match.groups.METH );
match = p.exec( input )
}

Custom vallidator to ban a specific wordlist

I need a custom validator to ban a specific list of banned words from a textarea field.
I need exactly this type of implementation, I know that it's not logically correct to let the user type part of a query but it's exactly what I need.
I tried with a regExp but it has a strange behaviour.
My RegExp
/(drop|update|truncate|delete|;|alter|insert)+./gi
my Validator
export function forbiddenWordsValidator(sqlRe: RegExp): ValidatorFn {
return (control: AbstractControl): { [key: string]: any } | null => {
const forbidden = sqlRe.test(control.value);
return forbidden ? { forbiddenSql: { value: control.value } } : null;
};
}
my formControl:
whereCondition: new FormControl("", [
Validators.required,
forbiddenWordsValidator(this.BAN_SQL_KEYWORDS)...
It works only in certain cases and I don't understand why does the same string works one time and doesn't work if i delete a char and rewrite it or sometimes if i type a whitespace the validator returns ok.
There are several issues here:
The global g modifier leads to unexpected alternated results when used in RegExp#test and similar methods that move the regex index after a valid match, it must be removed
. at the end requires any 1 char other than line break char, hence it must be removed.
Use
/drop|update|truncate|delete|;|alter|insert/i
Or, to match the words as whole words use
/\b(?:drop|update|truncate|delete|alter|insert)\b|;/i
This way, insert in insertion and drop in dropout won't get "caught" (=matched).
See the regex demo.
it's not a great idea to give such power to the user

FW/1 pattern matching N digits

I am trying to match routes where IDs have exactly 6 numbers
This does not work:
variables.framework.routes = [
{ "main/{id:[0-9]{6}}" = "main/home/eid/:id"},
{ "main/home" = "main/home"},
{ "*" = "main/404"}
];
This does:
variables.framework.routes = [
{ "main/{id:[0-9]+}" = "main/home/eid/:id"},
{ "main/home" = "main/home"},
{ "*" = "main/404"}
];
The second one of course matches on any number of digits. I wonder if I have to escape the {
It looks like FW/1 only allows a limited regular expression syntax for the routes declaration. So I don't think your first example will work. From what I could find the limited regular expression syntax in routes was added to FW/1 version 3.5. I found some discussion on the topic and this specific comment describing the requested behavior - https://github.com/framework-one/fw1/issues/325#issuecomment-118572702
{placeholder:regex}, so we could have product/{id:[0-9]+}-:name.html that targets product.detail?id={id:[0-9]+}&name=:name.
You need to repeat the placeholder with the regex in the target route too (could be changed).
You can't put } in your placeholder specific regex.
Let me know if a PR is welcome for this add-on.
Notice that second bullet point which mentions that the } (bracket) is not allowed in the placeholder regex.
Here is a link to the code referenced by that pull-request which was included in 3.5 - https://github.com/framework-one/fw1/commit/9543b78552dbd27a526083ac72a3846bd86eeb90
And here is a link to the updated documentation for version 3.5 where some information was added about this feature - http://framework-one.github.io/documentation/developing-applications.html#url-routes
Snippet of that doc here:
Placeholder variables in the route are identified either by a leading colon or by braces (specifying a variable name and a regex to restrict matches) and can appear in the URL as well, for example { "/product/:id" = "/product/view/id/:id" } specifies a match for /product/something which will be treated as if the URL was /product/view/id/something - section: product, item: view, query string id=something. Similarly, { "/product/{id:[0-9]+}" = "/product/view/id/:id" } specifies a match for /product/42 which will be treated as if the URL was /product/view/id/42, and only numeric values will match the placeholder.

Pulling multiple values from JSON response using RegEx Extractor

I'm testing a web service that returns JSON responses and I'd like to pull multiple values from the response. A typical response would contain multiple values in a list. For example:
{
"name":"#favorites",
"description":"Collection of my favorite places",
"list_id":4894636,
}
A response would contain many sections like the above example.
What I'd like to do in Jmeter is go through the JSON response and pull each section outlined above in a manner that I can tie the returned name and description as one entry to iterate over.
What I've been able to do thus far is return the name value with regular expression extractor ("name":"(.+?)") using the template $1$. I'd like to pull both name and description but can't seem to get it to work. I've tried using a regex "name":"(.+?)","description":"(.+?)" with a template of $1$$2$ without any success.
Does anyone know how I might pull multiple values using regex in this example?
You can just add (?s) to the regex to avoid line breaks.
E.g: (?s)"name":"(.+?)","description":"(.+?)"
It works for me on assertions.
It may be worth to use BeanShell scripting to process JSON response.
So if you need to get ALL the "name/description" pairs from response (for each section) you can do the following:
1. extract all the "name/description" pairs from response in loop;
2. save extracted pairs in csv-file in handy format;
3. read saved pairs from csv-file later in code - using CSV Data Set Config in loop, e.g.
JSON response processing can be implemented using BeanShell scripting (~ java) + any json-processing library (e.g. json-rpc-1.0):
- either in BeanShell Sampler or in BeanShell PostProcessor;
- all the required beanshell libs are currently provided in default
jmeter delivery;
- to use json-processing library place jar into JMETER_HOME/lib folder.
Schematically it will look like:
in case of BeanShell PostProcessor:
Thread Group
. . .
YOUR HTTP Request
BeanShell PostProcessor // added as child
. . .
in case of BeanShell Sampler:
Thread Group
. . .
YOUR HTTP Request
BeanShell Sampler // added separate sampler - after your
. . .
In this case there is no difference which one use.
You can either put the code itself into the sampler body ("Script" field) or store in external file, as shown below.
Sampler code:
import java.io.*;
import java.util.*;
import org.json.*;
import org.apache.jmeter.samplers.SampleResult;
ArrayList nodeRefs = new ArrayList();
ArrayList fileNames = new ArrayList();
String extractedList = "extracted.csv";
StringBuilder contents = new StringBuilder();
try
{
if (ctx.getPreviousResult().getResponseDataAsString().equals("")) {
Failure = true;
FailureMessage = "ERROR: Response is EMPTY.";
throw new Exception("ERROR: Response is EMPTY.");
} else {
if ((ResponseCode != null) && (ResponseCode.equals("200") == true)) {
SampleResult result = ctx.getPreviousResult();
JSONObject response = new JSONObject(result.getResponseDataAsString());
FileOutputStream fos = new FileOutputStream(System.getProperty("user.dir") + File.separator + extractedList);
if (response.has("items")) {
JSONArray items = response.getJSONArray("items");
if (items.length() != 0) {
for (int i = 0; i < items.length(); i++) {
String name = items.getJSONObject(i).getString("name");
String description = items.getJSONObject(i).getString("description");
int list_id = items.getJSONObject(i).getInt("list_id");
if (i != 0) {
contents.append("\n");
}
contents.append(name).append(",").append(description).append(",").append(list_id);
System.out.println("\t " + name + "\t\t" + description + "\t\t" + list_id);
}
}
}
byte [] buffer = contents.toString().getBytes();
fos.write(buffer);
fos.close();
} else {
Failure = true;
FailureMessage = "Failed to extract from JSON response.";
}
}
}
catch (Exception ex) {
IsSuccess = false;
log.error(ex.getMessage());
System.err.println(ex.getMessage());
}
catch (Throwable thex) {
System.err.println(thex.getMessage());
}
As well a set of links on this:
JSON in JMeter
Processing JSON Responses with JMeter and the BSF Post Processor
Upd. on 08.2017:
At the moment JMeter has set of built-in components (merged from 3rd party projects) to handle JSON without scripting:
JSON Path Extractor (contributed from ATLANTBH jmeter-components project);
JSON Extractor (contributed from UBIK Load Pack since JMeter 3.0) - see answer below.
I am assuming that JMeter uses Java-based regular expressions... This could mean no named capturing groups. Apparently, Java7 now supports them, but that doesn't necessarily mean JMeter would. For JSON that looks like this:
{
"name":"#favorites",
"description":"Collection of my favorite places",
"list_id":4894636,
}
{
"name":"#AnotherThing",
"description":"Something to fill space",
"list_id":0048265,
}
{
"name":"#SomethingElse",
"description":"Something else as an example",
"list_id":9283641,
}
...this expression:
\{\s*"name":"((?:\\"|[^"])*)",\s*"description":"((?:\\"|[^"])*)",(?:\\}|[^}])*}
...should match 3 times, capturing the "name" value into the first capturing group, and the "description" into the second capturing group, similar to the following:
1 2
--------------- ---------------------------------------
#favorites Collection of my favorite places
#AnotherThing Something to fill space
#SomethingElse Something else as an example
Importantly, this expression supports quote escaping in the value portion (and really even in the identifier name portion as well, so that the Javascript string I said, "What is your name?"! will be stored in JSON as AND parsed correctly as I said, \"What is your name?\"!
Using Ubik Load Pack plugin for JMeter which has been donated to JMeter core and is since version 3.0 available as JSON Extractor you can do it this way with following Test Plan:
namesExtractor_ULP_JSON_PostProcessor config:
descriptionExtractor_ULP_JSON_PostProcessor config:
Loop Controller to loop over results:
Counter config:
Debug Sampler showing how to use name and description in one iteration:
And here is what you get for the following JSON:
[{ "name":"#favorites", "description":"Collection of my favorite places", "list_id": 4894636 }, { "name":"#AnotherThing", "description":"Something to fill space", "list_id": 48265 }, { "name":"#SomethingElse", "description":"Something else as an example", "list_id":9283641 }]
Compared to Beanshell solution:
It is more "standard approach"
It performs much better than Beanshell code
It is more readable

Regex to parse querystring values to named groups

I have a HTML with the following content:
... some text ...
link ... some text ...
... some text ...
link ... some text ...
... some text ...
I would like to parse that and get a match with named groups:
match 1
group["user"]=123
group["section"]=2
match 2
group["user"]=678
group["section"]=5
I can do it if parameters always go in order, first User and then Section, but I don't know how to do it if the order is different.
Thank you!
In my case I had to parse an Url because the utility HttpUtility.ParseQueryString is not available in WP7. So, I created a extension method like this:
public static class UriExtensions
{
private static readonly Regex queryStringRegex;
static UriExtensions()
{
queryStringRegex = new Regex(#"[\?&](?<name>[^&=]+)=(?<value>[^&=]+)");
}
public static IEnumerable<KeyValuePair<string, string>> ParseQueryString(this Uri uri)
{
if (uri == null)
throw new ArgumentException("uri");
var matches = queryStringRegex.Matches(uri.OriginalString);
for (int i = 0; i < matches.Count; i++)
{
var match = matches[i];
yield return new KeyValuePair<string, string>(match.Groups["name"].Value, match.Groups["value"].Value);
}
}
}
Then It's matter of using it, for example
var uri = new Uri(HttpUtility.UrlDecode(#"file.aspx?userId=123&section=2"),UriKind.RelativeOrAbsolute);
var parameters = uri.ParseQueryString().ToDictionary( kvp => kvp.Key, kvp => kvp.Value);
var userId = parameters["userId"];
var section = parameters["section"];
NOTE: I'm returning the IEnumerable instead of the dictionary directly just because I'm assuming that there might be duplicated parameter's name. If there are duplicated names, then the dictionary will throw an exception.
Why use regex to split it out?
You could first extrct the query string. Split the result on & and then create a map by splitting the result from that on =
You didn't specify what language you are working in, but this should do the trick in C#:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace RegexTest
{
class Program
{
static void Main(string[] args)
{
string subjectString = #"... some text ...
link ... some text ...
... some text ...
link ... some text ...
... some text ...";
Regex regexObj =
new Regex(#"<a href=""file.aspx\?(?:(?:userId=(?<user>.+?)&section=(?<section>.+?)"")|(?:section=(?<section>.+?)&user=(?<user>.+?)""))");
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
string user = matchResults.Groups["user"].Value;
string section = matchResults.Groups["section"].Value;
Console.WriteLine(string.Format("User = {0}, Section = {1}", user, section));
matchResults = matchResults.NextMatch();
}
Console.ReadKey();
}
}
}
Using regex to first find the key value pairs and then doing splits... doesn't seem right.
I'm interested in a complete regex solution.
Anyone?
Check this out
\<a\s+href\s*=\s*["'](?<baseUri>.+?)\?(?:(?<key>.+?)=(?<value>.+?)[&"'])*\s*\>
You can get pairs with something like Groups["key"].Captures[i] & Groups["value"].Captures[i]
Perhaps something like this (I am rusty on regex, and wasn't good at them in the first place anyway. Untested):
/href="[^?]*([?&](userId=(?<user>\d+))|section=(?<section>\d+))*"/
(By the way, the XHTML is malformed; & should be & in the attributes.)
Another approach is to put the capturing groups inside lookaheads:
Regex r = new Regex(#"<a href=""file\.aspx\?" +
#"(?=[^""<>]*?user=(?<user>\w+))" +
#"(?=[^""<>]*?section=(?<section>\w+))";
If there are only two parameters, there's no reason to prefer this way over the alternation-based approaches suggested by Mike and strager. But if you needed to match three parameters, the other regexes would grow to several times their current length, while this one would only need another lookahead like just like the two existing ones.
By the way, contrary to your response to Claus, it matters quite a bit which language you're working in. There's a huge variation in capabilities, syntax, and API from one language to the next.
You did not say which regex flavor you are using. Since your sample URL links to an .aspx file, I'll assume .NET. In .NET, a single regex can have multiple named capturing groups with the same name, and .NET will treat them as if they were one group. Thus you can use the regex
userID=(?<user>\d+)&section=(?<section>\d+)|section=(?<section>\d+)&userID=(?<user>\d+)
This simple regex with alternation will be far more efficient than any tricks with lookaround. You can easily expand it if your requirements include matching the parameters only if they're in a link.
a simple python implementation overcoming the ordering problem
In [2]: x = re.compile('(?:(userId|section)=(\d+))+')
In [3]: t = 'href="file.aspx?section=2&userId=123"'
In [4]: x.findall(t)
Out[4]: [('section', '2'), ('userId', '123')]
In [5]: t = 'href="file.aspx?userId=123&section=2"'
In [6]: x.findall(t)
Out[6]: [('userId', '123'), ('section', '2')]