How to cache and use the cached regexes in perl6 grammar? - regex

My code spends a lot of time on regex interpolation. As the patterns rarely change, I guess caching these generated regexes should speed up the code. But I cannot figure out a right way to cache and use the cached regexes.
The code is used to parse some arithmetric expressions. As the users are allowed to define new operators, the parser must be ready to add new operators to the grammar. So the parser use a table to record these new operators and generate regexes from the table on the fly.
#! /usr/bin/env perl6
use v6.c;
# the parser may add new operators to this table on the fly.
my %operator-table = %(
1 => $['"+"', '"-"'],
2 => $['"*"', '"/"'],
# ...
);
# original code, runnable but slow.
grammar Operator {
token operator(Int $level) {
<{%operator-table{$level}.join('|')}>
}
# ...
}
# usage:
say Operator.parse(
'+',
rule => 'operator',
args => \(1)
);
# output:
# 「+」
Here are some experiments:
# try to cache the generated regexes but not work.
grammar CachedOperator {
my %cache-table = %();
method operator(Int $level) {
if (! %cache-table{$level}) {
%cache-table.append(
$level => rx { <{%operator-table{$level}.join('|')}> }
)
}
%cache-table{$level}
}
}
# test:
say CachedOperator.parse(
'+',
rule => 'operator',
args => \(1)
);
# output:
# Nil
# one more try
grammar CachedOperator_ {
my %cache-table = %();
token operator(Int $level) {
<create-operator($level)>
}
method create-operator(Int $level) {
if (! %cache-table{$level}) {
%cache-table.append(
$level => rx { <{%operator-table{$level}.join('|')}> }
)
}
%cache-table{$level}
}
}
# test:
say CachedOperator_.parse(
'+',
rule => 'operator',
args => \(1)
);
# compile error:
# P6opaque: no such attribute '$!pos' on type Match in a Regex when trying to get a value

The following doesn't directly answer your question but may be of interest.
User defined operators
The following code declares an operator in P6:
sub prefix:<op> ($operand) { " $operand prefixed by op" }
Now one can use the new operator:
say op 42; # 42 prefixed by op
A wide range of operator positions and arities are covered, including choice of associativity and precedence, parentheses for grouping, etc. So maybe this is an appropriate way to implement what you're implementing.
Although it's slow, it might be fast enough. Additionally, as Larry said in 2017 ...
we know some some places in the parser that are slower than they should be, for instance ... various lexers relook at various characters in your Perl 6 program, it averages 5 or 6 times on every character, which is obviously deeply sub-optimal, and we know how to fix it
... and with luck Jonathan will work on the P6 grammar parser this year.
DSLs and Slangs
Even if you aren't interested in using the main language's ability to declare user defined operators, or can't for some reason, the underlying mechanisms that make it work might be of interest/use. Here are some references:
Brian Duggan's Informal DSLs presentation (video, slides).
Mouq's 2014 gist Slangs.
Larry Wall's speculation from way back when in Switching parsers and Slangs.

Related

Regex for finding the name of a method containing a string

I've got a Node module file containing about 100 exported methods, which looks something like this:
exports.methodOne = async user_id => {
// other method contents
};
exports.methodTwo = async user_id => {
// other method contents
fooMethod();
};
exports.methodThree = async user_id => {
// other method contents
fooMethod();
};
Goal: What I'd like to do is figure out how to grab the name of any method which contains a call to fooMethod, and return the correct method names: methodTwo and methodThree. I wrote a regex which gets kinda close:
exports\.(\w+).*(\n.*?){1,}fooMethod
Problem: using my example code from above, though, it would effectively match methodOne and methodThree because it finds the first instance of export and then the first instance of fooMethod and goes on from there. Here's a regex101 example.
I suspect I could make use of lookaheads or lookbehinds, but I have little experience with those parts of regex, so any guidance would be much appreciated!
Edit: Turns out regex is poorly-suited for this type of task. #ctcherry advised using a parser, and using that as a springboard, I was able to learn about Abstract Syntax Trees (ASTs) and the recast tool which lets you traverse the tree after using various tools (acorn and others) to parse your code into tree form.
With these tools in hand, I successfully built a script to parse and traverse my node app's files, and was able to find all methods containing fooMethod as intended.
Regex isn't the best tool to tackle all the parts of this problem, ideally we could rely on something higher level, a parser.
One way to do this is to let the javascript parse itself during load and execution. If your node module doesn't include anything that would execute on its own (or at least anything that would conflict with the below), you can put this at the bottom of your module, and then run the module with node mod.js.
console.log(Object.keys(exports).filter(fn => exports[fn].toString().includes("fooMethod(")));
(In the comments below it is revealed that the above isn't possible.)
Another option would be to use a library like https://github.com/acornjs/acorn (there are other options) to write some other javascript that parses your original target javascript, then you would have a tree structure you could use to perform your matching and eventually return the function names you are after. I'm not an expert in that library so unfortunately I don't have sample code for you.
This regex matches (only) the method names that contain a call to fooMethod();
(?<=exports\.)\w+(?=[^{]+\{[^}]+fooMethod\(\)[^}]+};)
See live demo.
Assuming that all methods have their body enclosed within { and }, I would make an approach to get to the final regex like this:
First, find a regex to get the individual methods. This can be done using this regex:
exports\.(\w+)(\s|.)*?\{(\s|.)*?\}
Next, we are interested in those methods that have fooMethod in them before they close. So, look for } or fooMethod.*}, in that order. So, let us name the group searching for fooMethod as FOO and the name of the method calling it as METH. When we iterate the matches, if group FOO is present in a match, we will use the corresponding METH group, else we will reject it.
exports\.(?<METH>\w+)(\s|.)*?\{(\s|.)*?(\}|(?<FOO>fooMethod)(\s|.)*?\})
Explanation:
exports\.(?<METH>\w+): Till the method name (you have already covered this)
(\s|.)*?\{(\s|.)*?: Some code before { and after, non-greedy so that the subsequent group is given preference
(\}|(?<FOO>fooMethod)(\s|.)*?\}): This has 2 parts:
\}: Match the method close delimiter, OR
(?<FOO>fooMethod)(\s|.)*?\}): The call to fooMethod followed by optional code and method close delimiter.
Here's a JavaScript code that demostrates this:
let p = /exports\.(?<METH>\w+)(\s|.)*?\{(\s|.)*?(\}|(?<FOO>fooMethod)(\s|.)*?\})/g
let input = `exports.methodOne = async user_id => {
// other method contents
};
exports.methodTwo = async user_id => {
// other method contents
fooMethod();
};
exports.methodThree = async user_id => {
// other method contents
fooMethod();
};';`
let match = p.exec( input );
while( match !== null) {
if( match.groups.FOO !== undefined ) console.log( match.groups.METH );
match = p.exec( input )
}

Custom vallidator to ban a specific wordlist

I need a custom validator to ban a specific list of banned words from a textarea field.
I need exactly this type of implementation, I know that it's not logically correct to let the user type part of a query but it's exactly what I need.
I tried with a regExp but it has a strange behaviour.
My RegExp
/(drop|update|truncate|delete|;|alter|insert)+./gi
my Validator
export function forbiddenWordsValidator(sqlRe: RegExp): ValidatorFn {
return (control: AbstractControl): { [key: string]: any } | null => {
const forbidden = sqlRe.test(control.value);
return forbidden ? { forbiddenSql: { value: control.value } } : null;
};
}
my formControl:
whereCondition: new FormControl("", [
Validators.required,
forbiddenWordsValidator(this.BAN_SQL_KEYWORDS)...
It works only in certain cases and I don't understand why does the same string works one time and doesn't work if i delete a char and rewrite it or sometimes if i type a whitespace the validator returns ok.
There are several issues here:
The global g modifier leads to unexpected alternated results when used in RegExp#test and similar methods that move the regex index after a valid match, it must be removed
. at the end requires any 1 char other than line break char, hence it must be removed.
Use
/drop|update|truncate|delete|;|alter|insert/i
Or, to match the words as whole words use
/\b(?:drop|update|truncate|delete|alter|insert)\b|;/i
This way, insert in insertion and drop in dropout won't get "caught" (=matched).
See the regex demo.
it's not a great idea to give such power to the user

Does the Perl compiler need to be told not to optimize away function calls with ignored return values?

I am writing new Perl 5 module Class::Tiny::ConstrainedAccessor to check type constraints when you touch object attributes, either by setting or by getting a default value. I am writing the unit tests and want to run the accessors for the latter case. However, I am concerned that Perl may optimize away my accessor-function call since the return value is discarded. Will it? If so, can I tell it not to? Is the corresponding behaviour documented? If the answer is as simple as "don't worry about it," that's good enough, but a reference to the docs would be appreciated :) .
The following MCVE succeeds when I run it on my Perl 5.26.2 x64 Cygwin. However, I don't know if that is guaranteed, or if it just happens to work now and may change someday.
use 5.006; use strict; use warnings; use Test::More; use Test::Exception;
dies_ok { # One I know works
my $obj = Klass->new; # Default value of "attribute" is invalid
diag $obj->accessor; # Dies, because the default is invalid
} 'Bad default dies';
dies_ok {
my $obj = Klass->new;
$obj->accessor; # <<< THE QUESTION --- Will this always run?
} 'Dies even without diag';
done_testing();
{ package Klass;
sub new { my $class = shift; bless {#_}, $class }
sub check { shift; die 'oops' if #_ and $_[0] eq 'bad' }
sub default { 'bad' }
sub accessor {
my $self = shift;
if(#_) { $self->check($_[0]); return $self->{attribute} = $_[0] } # W
elsif(exists $self->{attribute}) { return $self->{attribute} } # R
else {
# Request to read the attribute, but no value is assigned yet.
# Use the default.
$self->check($self->default); # <<<---- What I want to exercise
return $self->{attribute} = $self->default;
}
} #accessor()
} #Klass
This question deals with variables, but not functions. perlperf says that Perl will optimize away various things, but other than ()-prototyped functions, it's not clear to me what.
In JavaScript, I would say void obj.accessor();, and then I would know for sure it would run but the result would be discarded. However, I can't use undef $obj->accessor; for a similar effect; compilation legitimately fails with Can't modify non-lvalue subroutine call of &Klass::accessor.
Perl doesn't ever optimize away sub calls, and sub calls with side effects shouldn't be optimised away in any language.
undef $obj->accessor means something similar to $obj->accessor = undef

perl6 Need help to understand more about proto regex/token/rule

The following code is taken from the Perl 6 documentation, and I am trying to learn more about it before more experimentation:
proto token command {*}
token command:sym<create> { <sym> }
token command:sym<retrieve> { <sym> }
token command:sym<update> { <sym> }
token command:sym<delete> { <sym> }
Is the * in the first line a whatever-star? Can it be something else, such as
proto token command { /give me an apple/ }
Can "sym" be something else, such as
command:eat<apple> { <eat> } ?
{*} tells the runtime to call the correct candidate.
Rather than force you to write {{*}} for the common case of just call the correct one, the compiler allows you to shorten it to just {*}
That is the case for all proto routines like sub, method, regex, token, and rule.
In the case of the regex proto routines, only a bare {*} is allowed.
The main reason is probably because no-one has really come up with a good way to make it work sensibly in the regex sub-language.
So here is an example of a proto sub that does some things that are common to all of the candidates.
#! /usr/bin/env perl6
use v6.c;
for #*ARGS { $_ = '--stdin' when '-' }
# find out the number of bytes
proto sub MAIN (|) {
try {
# {*} calls the correct multi
# then we get the number of elems from its result
# and try to say it
say {*}.elems # <-------------
}
# if {*} returns a Failure note the error message to $*ERR
or note $!.message;
}
#| the number of bytes on the clipboard
multi sub MAIN () {
xclip
}
#| the number of bytes in a file
multi sub MAIN ( Str $filename ){
$filename.IO.slurp(:!chomp,:bin)
}
#| the number of bytes from stdin
multi sub MAIN ( Bool :stdin($)! ){
$*IN.slurp-rest(:bin)
}
sub xclip () {
run( «xclip -o», :out )
.out.slurp-rest( :bin, :close );
}
This answers your second question. Yes, it's late.
You have to distinguish two different syms (or eats). The one that's on the definition of the token as an "adverb" (or extended syntax identifier, whatever you want to call it), and the one that's on the token itself.
If you use <eat> in the token body, Perl 6 will simply not find it. You will get an error like
No such method 'eat' for invocant of type 'Foo'
Where Foo would be the name of the grammar. <sym> is a predefined token, which matches the value of the adverb (or pair value) in the token.
You could, in principle, use the extended syntax to define a multi token (or rule, or regex). However, if you try to define it in this way, you will get a different error:
Can only use <sym> token in a proto regex
So, the answer to your second question is no, and no.

Using regular expressions in python to determine C++ functions and their parameters

So I'm doing something wrong in this python script, but it's becoming convoluted and I'm losing sight of what I'm doing wrong.
I want a script to go through a file, find all the function definitions, and then pull out the name, return type, and parameters of the function, and output a "doxygen" style comment like this:
/******************************************************************************/
/*!
\brief
Main function for the file
\return
The exit code for the program
*/
/******************************************************************************/
But I'm doing something wrong with the regular expression in trying to parse the parameters... Here is the script so far:
import re
import sys
f = open(sys.argv[1])
functions = []
for line in f:
match = re.search(r'([\w]+)\s+([\S]+)\(([\w+\s+\w+])+\)',line)
if line.find("\\fn") < 0:
if match:
returntype = match.group(1)
funcname = match.group(2)
print '/********************************************************************'
print " \\fn " + match.group()
print ''
print ' \\brief'
print ' Function description for ' + funcname
print ''
if len(match.groups()) > 2:
params = []
count = len(match.groups()) - 2
while count > 0:
matchingstring = match.group(count + 2)
if matchingstring.find("void") < 0:
params.append(matchingstring)
count -= 1
for parameter in params:
print " \\param " + parameter
print ' Description of ' + parameter
print ''
print ' \\return'
print ' ' + returntype
print '********************************************************************/'
print ''
Any help would be appreciated. Thanks
The grammar of C++ is far too complex to be handled by simple
regular expressions. You'll need at least a minimal parser.
I've found that for restricted cases, where I'm not concerned
with C++ in general, but only my own style, I can often get away
with a flex based tokenizer and a simple state machine. This
will fail in many cases of legal C++—for starters, of
course, if someone uses the pre-processor to modify the syntax;
but also because < can have different meanings, depending on
what precedes it names a template or not. But it's often
adequate for a specific job.
I've used a PEG parser with great success when trying to do simple format parsing. pyPeg is a very simple implementation of such a parser written in Python.
Example Python code for C++ function parser:
EDIT: Address template parameters. Tested with input from SK-logic and output is correct.
import pyPEG
from pyPEG import parseLine
import re
def symbol(): return re.compile(r"[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&*][\w:]+")
def type(): return symbol
def functionName(): return symbol
def templatedType(): return symbol, "<", -1, [templatedType, symbol, ","], ">"
def parameter(): return [templatedType, type], symbol
def template(): return "<", -1, [symbol, template], ">"
def function(): return [type, templatedType], functionName, -1, template, "(", -1, [",", parameter], ")" # -1 -> zero or more repetitions.
sourceCode = "std::string foobar(std::vector<int> &A, std::map<std::string, std::vector<std::string> > &B)"
results = parseLine(sourceCode, function(), [], packrat=True)
When this is executed results is:
([(u'type', [(u'symbol', 'std::string')]), (u'functionName', [(u'symbol', 'foobar')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'int')]), (u'symbol', '&A')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::map'), (u'symbol', 'std::string'), (u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'std::string')])]), (u'symbol', '&B')])], '')
C++ cannot really be parsed by a (sane) regular expression: they are a nightmare as soon as nesting is concerned.
There is another concern too, determining when to parse and when not to. A function may be declared:
at file scope
in a namespace
in a class
And the two last can be nested at arbitrary depths.
I would propose to use CLang here. It's a real C++ front-end with a full-featured parser and there are:
a C API, with (notably) an API to the Indexing Library
Python bindings on top of the C API
The C API and Python bindings are far from fully exposing the underlying C++ model, but for a task as simple as listing functions it should be enough.
That said, I would question the usefulness of the project: if the documentation can be generated by a simple parser, then it is redundant with the code. And redundancy is at best, useless, and worst dangerous: it introduces the potential threat of desynchronization...
If the function is tricky enough that its use requires documentation, then a developer, who knows the limitations and al, has to write this documentation.