Using regular expressions in python to determine C++ functions and their parameters - c++

So I'm doing something wrong in this python script, but it's becoming convoluted and I'm losing sight of what I'm doing wrong.
I want a script to go through a file, find all the function definitions, and then pull out the name, return type, and parameters of the function, and output a "doxygen" style comment like this:
/******************************************************************************/
/*!
\brief
Main function for the file
\return
The exit code for the program
*/
/******************************************************************************/
But I'm doing something wrong with the regular expression in trying to parse the parameters... Here is the script so far:
import re
import sys
f = open(sys.argv[1])
functions = []
for line in f:
match = re.search(r'([\w]+)\s+([\S]+)\(([\w+\s+\w+])+\)',line)
if line.find("\\fn") < 0:
if match:
returntype = match.group(1)
funcname = match.group(2)
print '/********************************************************************'
print " \\fn " + match.group()
print ''
print ' \\brief'
print ' Function description for ' + funcname
print ''
if len(match.groups()) > 2:
params = []
count = len(match.groups()) - 2
while count > 0:
matchingstring = match.group(count + 2)
if matchingstring.find("void") < 0:
params.append(matchingstring)
count -= 1
for parameter in params:
print " \\param " + parameter
print ' Description of ' + parameter
print ''
print ' \\return'
print ' ' + returntype
print '********************************************************************/'
print ''
Any help would be appreciated. Thanks

The grammar of C++ is far too complex to be handled by simple
regular expressions. You'll need at least a minimal parser.
I've found that for restricted cases, where I'm not concerned
with C++ in general, but only my own style, I can often get away
with a flex based tokenizer and a simple state machine. This
will fail in many cases of legal C++—for starters, of
course, if someone uses the pre-processor to modify the syntax;
but also because < can have different meanings, depending on
what precedes it names a template or not. But it's often
adequate for a specific job.

I've used a PEG parser with great success when trying to do simple format parsing. pyPeg is a very simple implementation of such a parser written in Python.
Example Python code for C++ function parser:
EDIT: Address template parameters. Tested with input from SK-logic and output is correct.
import pyPEG
from pyPEG import parseLine
import re
def symbol(): return re.compile(r"[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&*][\w:]+")
def type(): return symbol
def functionName(): return symbol
def templatedType(): return symbol, "<", -1, [templatedType, symbol, ","], ">"
def parameter(): return [templatedType, type], symbol
def template(): return "<", -1, [symbol, template], ">"
def function(): return [type, templatedType], functionName, -1, template, "(", -1, [",", parameter], ")" # -1 -> zero or more repetitions.
sourceCode = "std::string foobar(std::vector<int> &A, std::map<std::string, std::vector<std::string> > &B)"
results = parseLine(sourceCode, function(), [], packrat=True)
When this is executed results is:
([(u'type', [(u'symbol', 'std::string')]), (u'functionName', [(u'symbol', 'foobar')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'int')]), (u'symbol', '&A')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::map'), (u'symbol', 'std::string'), (u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'std::string')])]), (u'symbol', '&B')])], '')

C++ cannot really be parsed by a (sane) regular expression: they are a nightmare as soon as nesting is concerned.
There is another concern too, determining when to parse and when not to. A function may be declared:
at file scope
in a namespace
in a class
And the two last can be nested at arbitrary depths.
I would propose to use CLang here. It's a real C++ front-end with a full-featured parser and there are:
a C API, with (notably) an API to the Indexing Library
Python bindings on top of the C API
The C API and Python bindings are far from fully exposing the underlying C++ model, but for a task as simple as listing functions it should be enough.
That said, I would question the usefulness of the project: if the documentation can be generated by a simple parser, then it is redundant with the code. And redundancy is at best, useless, and worst dangerous: it introduces the potential threat of desynchronization...
If the function is tricky enough that its use requires documentation, then a developer, who knows the limitations and al, has to write this documentation.

Related

Regex for finding the name of a method containing a string

I've got a Node module file containing about 100 exported methods, which looks something like this:
exports.methodOne = async user_id => {
// other method contents
};
exports.methodTwo = async user_id => {
// other method contents
fooMethod();
};
exports.methodThree = async user_id => {
// other method contents
fooMethod();
};
Goal: What I'd like to do is figure out how to grab the name of any method which contains a call to fooMethod, and return the correct method names: methodTwo and methodThree. I wrote a regex which gets kinda close:
exports\.(\w+).*(\n.*?){1,}fooMethod
Problem: using my example code from above, though, it would effectively match methodOne and methodThree because it finds the first instance of export and then the first instance of fooMethod and goes on from there. Here's a regex101 example.
I suspect I could make use of lookaheads or lookbehinds, but I have little experience with those parts of regex, so any guidance would be much appreciated!
Edit: Turns out regex is poorly-suited for this type of task. #ctcherry advised using a parser, and using that as a springboard, I was able to learn about Abstract Syntax Trees (ASTs) and the recast tool which lets you traverse the tree after using various tools (acorn and others) to parse your code into tree form.
With these tools in hand, I successfully built a script to parse and traverse my node app's files, and was able to find all methods containing fooMethod as intended.
Regex isn't the best tool to tackle all the parts of this problem, ideally we could rely on something higher level, a parser.
One way to do this is to let the javascript parse itself during load and execution. If your node module doesn't include anything that would execute on its own (or at least anything that would conflict with the below), you can put this at the bottom of your module, and then run the module with node mod.js.
console.log(Object.keys(exports).filter(fn => exports[fn].toString().includes("fooMethod(")));
(In the comments below it is revealed that the above isn't possible.)
Another option would be to use a library like https://github.com/acornjs/acorn (there are other options) to write some other javascript that parses your original target javascript, then you would have a tree structure you could use to perform your matching and eventually return the function names you are after. I'm not an expert in that library so unfortunately I don't have sample code for you.
This regex matches (only) the method names that contain a call to fooMethod();
(?<=exports\.)\w+(?=[^{]+\{[^}]+fooMethod\(\)[^}]+};)
See live demo.
Assuming that all methods have their body enclosed within { and }, I would make an approach to get to the final regex like this:
First, find a regex to get the individual methods. This can be done using this regex:
exports\.(\w+)(\s|.)*?\{(\s|.)*?\}
Next, we are interested in those methods that have fooMethod in them before they close. So, look for } or fooMethod.*}, in that order. So, let us name the group searching for fooMethod as FOO and the name of the method calling it as METH. When we iterate the matches, if group FOO is present in a match, we will use the corresponding METH group, else we will reject it.
exports\.(?<METH>\w+)(\s|.)*?\{(\s|.)*?(\}|(?<FOO>fooMethod)(\s|.)*?\})
Explanation:
exports\.(?<METH>\w+): Till the method name (you have already covered this)
(\s|.)*?\{(\s|.)*?: Some code before { and after, non-greedy so that the subsequent group is given preference
(\}|(?<FOO>fooMethod)(\s|.)*?\}): This has 2 parts:
\}: Match the method close delimiter, OR
(?<FOO>fooMethod)(\s|.)*?\}): The call to fooMethod followed by optional code and method close delimiter.
Here's a JavaScript code that demostrates this:
let p = /exports\.(?<METH>\w+)(\s|.)*?\{(\s|.)*?(\}|(?<FOO>fooMethod)(\s|.)*?\})/g
let input = `exports.methodOne = async user_id => {
// other method contents
};
exports.methodTwo = async user_id => {
// other method contents
fooMethod();
};
exports.methodThree = async user_id => {
// other method contents
fooMethod();
};';`
let match = p.exec( input );
while( match !== null) {
if( match.groups.FOO !== undefined ) console.log( match.groups.METH );
match = p.exec( input )
}

How to cache and use the cached regexes in perl6 grammar?

My code spends a lot of time on regex interpolation. As the patterns rarely change, I guess caching these generated regexes should speed up the code. But I cannot figure out a right way to cache and use the cached regexes.
The code is used to parse some arithmetric expressions. As the users are allowed to define new operators, the parser must be ready to add new operators to the grammar. So the parser use a table to record these new operators and generate regexes from the table on the fly.
#! /usr/bin/env perl6
use v6.c;
# the parser may add new operators to this table on the fly.
my %operator-table = %(
1 => $['"+"', '"-"'],
2 => $['"*"', '"/"'],
# ...
);
# original code, runnable but slow.
grammar Operator {
token operator(Int $level) {
<{%operator-table{$level}.join('|')}>
}
# ...
}
# usage:
say Operator.parse(
'+',
rule => 'operator',
args => \(1)
);
# output:
# 「+」
Here are some experiments:
# try to cache the generated regexes but not work.
grammar CachedOperator {
my %cache-table = %();
method operator(Int $level) {
if (! %cache-table{$level}) {
%cache-table.append(
$level => rx { <{%operator-table{$level}.join('|')}> }
)
}
%cache-table{$level}
}
}
# test:
say CachedOperator.parse(
'+',
rule => 'operator',
args => \(1)
);
# output:
# Nil
# one more try
grammar CachedOperator_ {
my %cache-table = %();
token operator(Int $level) {
<create-operator($level)>
}
method create-operator(Int $level) {
if (! %cache-table{$level}) {
%cache-table.append(
$level => rx { <{%operator-table{$level}.join('|')}> }
)
}
%cache-table{$level}
}
}
# test:
say CachedOperator_.parse(
'+',
rule => 'operator',
args => \(1)
);
# compile error:
# P6opaque: no such attribute '$!pos' on type Match in a Regex when trying to get a value
The following doesn't directly answer your question but may be of interest.
User defined operators
The following code declares an operator in P6:
sub prefix:<op> ($operand) { " $operand prefixed by op" }
Now one can use the new operator:
say op 42; # 42 prefixed by op
A wide range of operator positions and arities are covered, including choice of associativity and precedence, parentheses for grouping, etc. So maybe this is an appropriate way to implement what you're implementing.
Although it's slow, it might be fast enough. Additionally, as Larry said in 2017 ...
we know some some places in the parser that are slower than they should be, for instance ... various lexers relook at various characters in your Perl 6 program, it averages 5 or 6 times on every character, which is obviously deeply sub-optimal, and we know how to fix it
... and with luck Jonathan will work on the P6 grammar parser this year.
DSLs and Slangs
Even if you aren't interested in using the main language's ability to declare user defined operators, or can't for some reason, the underlying mechanisms that make it work might be of interest/use. Here are some references:
Brian Duggan's Informal DSLs presentation (video, slides).
Mouq's 2014 gist Slangs.
Larry Wall's speculation from way back when in Switching parsers and Slangs.

How to pass multiple arguments to custom written Robot Framework Keyword?

Custom Keyword written in python 2.7:
#keyword("Update ${filename} with ${properties}")
def set_multiple_test_properties(self, filename, properties):
for each in values.split(","):
each = each.replace(" ", "")
key, value = each.split("=")
self.set_test_properties(filename, key, value)
When we send paremeters in a single line as shown below, its working as expected:
"Update sample.txt with "test.update=11,timeout=20,delay.seconds=10,maxUntouchedTime=10"
But when we modify the above line with a new lines (for better readability) it's not working.
Update sample.txt with "test.update = 11,
timeout=20,
delay.seconds=10,
maxUntouchedTime=10"
Any clue on this please?
I am not very sure whether it will work or not, but please try like this
Update sample.txt with "test.update = 11,
... timeout=20,
... delay.seconds=10,
... maxUntouchedTime=10"
Your approach is not working, cause the 2nd line is considered a call to a keyword (called "timeout=20,"), the 3rd another one, and so on. The 3 dots don't work cause they are "cell separators" - delimiter b/n arguments.
If you are going for readability, you can use the Catenate kw (it's in the Strings library):
${props}= Catenate SEPARATOR=${SPACE}
... test.update = 11,
... timeout=20,
... delay.seconds=10,
... maxUntouchedTime=10
, and then call your keyword with that variable:
Update sample.txt with "${props}"
btw, a) I think your keyword declaration in the decorator is without the double quotes - i.e. called like that ^ they'll be treated as part of the argument's value, b) there seems to be an error in the py method - the argument's name is "properties" while the itterator uses "values", and c) you might want to consider using named varargs (**kwargs in python, ${kwargs} in RF syntax) for this purpose (sorry, offtopic, but couldn't resist :)

Parsing non-mutually-exclusive groups of command-line arguments

I am trying to find a way of parsing sequences of related arguments, preferably using argparse.
For example:
command --global-arg --subgroup1 --arg1 --arg2 --subgroup2 --arg1 --arg3 --subgroup3 --arg4 --subcommand1 --arg1 --arg3
where --global-arg applies to the whole command, but each --subgroupN argument has sub-arguments that apply only to it (and may have the same name, such as --arg1 and --arg3 above), and where some sub-arguments are optional, so the number of sub-arguments is not constant. However, I know that each --subgroupN sub-argument set is complete either by the presence of another --subgroupN or the end of the argument list (I am not fussed if global arguments cannot appear at the end, although I imagine that is possible as long as they don't clash with sub-argument names).
The --subgroupN elements are essentially sub-commands, but I do not appear to be able to use the sub-parser ability of argparse as it slurps any following --subgroupN entries as well (and therefore barfs with unexpected arguments).
(An example of this style of argument list is used by xmlstarlet)
Are there any suggestions beyond writing my own parser? I assume I can at least leverage something out of argparse if that is the only option...
Examples
The examples below were an attempt to find a way to parse an argument structure along the following lines:
(a --name <name>|b --name <name>)+
in the first example I hoped to have --a and --b introduce a set of arguments that were processed by a subparser.
I was hoping to get something out perhaps along the lines of
Namespace(a=Namespace(name="dummya"), b=Namespace(name="dummyb"))
subparser example fails
parser = argparse.ArgumentParser()
subparsers = parser.add_subparsers()
parser_a = subparsers.add_parser("a")
parser_b = subparsers.add_parser("b")
parser_a.add_argument("--name")
parser_b.add_argument("--name")
parser.parse_args(["a", "--name", "dummy"])
> Namespace(name='dummy') (Good)
parser.parse_args(["b", "--name", "dummyb", "a", "--name", "dummya"])
> error: unrecognized arguments: a (BAD)
mutually exclusive group fails
parser = argparse.ArgumentParser()
g = parser.add_mutually_exclusive_group()
g1 = g.add_mutually_exclusive_group()
g1.add_argument("--name")
g2 = g.add_mutually_exclusive_group()
g2.add_argument("--name")
> ArgumentError: argument --name: conflicting option string(s): --name (BAD)
(I wasn't really expecting this to work, it was an attempt to see if I could have repetition of grouped arguments.)
Other than the subparser mechanism, argparse is not designed to handle groups of arguments. Other than the nargs grouping, it handles the arguments in the order that they appear in the argv list.
As I mentioned in the comments there have been earlier questions, which can probably be found by search with words like multiple. But one way or other they seek to work about the basic order-independent design of argparse.
https://stackoverflow.com/search?q=user%3A901925+[argparse]+multiple
I think the most straight forward solution is to process the sys.argv list before hand, breaking it into groups, and then passing those sublists to one or more parsers.
parse [command --global-arg],
parse [--subgroup1 --arg1 --arg2],
parse [--subgroup2 --arg1 --arg3],
parse [--subgroup3 --arg4],
parse [--subcommand1 --arg1 --arg3]
In fact the only alternative is to use that subparser 'slurp everything else' behavior to get a remainder of arguments that can be parsed again. Use parse_known_args to return a list of unknown arguments (parse_args raises an error if that list is not empty).
Using hpaulj's reply above, I came up with the following:
args = [
"--a", "--name", "dummya",
"--b", "--name", "dummyb",
"--a", "--name", "another_a", "--opt"
]
parser_globals = argparse.ArgumentParser()
parser_globals.add_argument("--test")
parser_a = argparse.ArgumentParser()
parser_a.add_argument("--name")
parser_a.add_argument("--opt", action="store_true")
parser_b = argparse.ArgumentParser()
parser_b.add_argument("--name")
command_parsers = {
"--a": parser_a,
"--b": parser_b
}
the_namespace = argparse.Namespace()
if globals is not None:
(the_namespace, rest) = parser_globals.parse_known_args(args)
subcommand_dict = vars(the_namespace)
subcommand = []
val = rest.pop()
while val:
if val in command_parsers:
the_args = command_parsers[val].parse_args(subcommand)
if val in subcommand_dict:
if "list" is not type(subcommand_dict[val]):
subcommand_dict[val] = [subcommand_dict[val]]
subcommand_dict[val].append(the_args)
else:
subcommand_dict[val] = the_args
subcommand = []
else:
subcommand.insert(0, val)
val = None if not rest else rest.pop()
I end up with:
Namespace(
--a=[
Namespace(
name='another_a',
opt=True
),
Namespace(
name='dummya',
opt=False
)
],
--b=Namespace(
name='dummyb'
),
test=None
)
which seems to serve my purposes.

I am new to RUBY and i need to understand 3 functions

I have been given the 3 functions below. Can anybody please help me to understand these? I am trying to port an application to C++ using Qt, but I don't understand these functions. So please help me!
Thanks in advance.
function 1:
def read_key
puts "read pemkey: \"#{#pkey}\"" if #verbose
File.open(#pkey, 'rb') do |io|
#key = OpenSSL::PKey::RSA.new(io)
end
end
function 2:
def generate_key
puts "generate pemkey to \"#{#pkey_o}\"" if #verbose
#key = OpenSSL::PKey::RSA.generate(KEY_SIZE)
# save key
File.open(#pkey_o, 'wb') do |file|
file << #key.export()
end
end
function 3:
def sign_zip
puts "sign zip" if #verbose
plain = nil
File.open(#zip, 'rb') do |file|
plain = file.read
end
#sig = #key.sign(OpenSSL::Digest::SHA1.new, plain)
end
There are probably two things about the above code that are confusing you, which if clarified, will help understand it.
First, #verbose and #key are instance variables, what a C++ programmer might call "member variables." The "if #verbose" following the puts statement literally means only do the puts if #verbose is true. #verbose never needs to be declared a bool--you just start using it. If it's never initialized, it's "nil" which evaluates to false.
Second, the do/end parts are code blocks. Many Ruby methods take a code block and execute it with a variable declared in those pipe characters. An example would be "array.each do |s| puts s; end" which might look like "for(int i = 0; i < array.size(); ++i) { s = array[i]; puts(s); }" in C++. For File.open, |io| is the file instance opened, and "read" is one of its methods.
These are all methods. #{#pkey_o} is string interpolation, substituting in the contents of an instance variable (called pkey_o; Ruby instance variables begin with # and class variables – unused here – begin with ##).
File.open(#pkey, 'rb') do |io|
#key = OpenSSL::PKey::RSA.new(io)
end
That opens the file whose name is stored in #pkey, stores the file handle in io (a block-local variable) and uses that with OpenSSL::PKey::RSA.new, whose result is stored in #key. Finally, it closes the file handle when the block is finished (at the end) whether or not it is a successful exit or an error case (in which case an exception would be thrown, but it would still be thrown). When translating this to C++, use of the RAII pattern is entirely reasonable (if you were going to Java, I'd say to use try/finally).