Regex for extracting functions from C++ code - c++

I have sample C++ code (http://pastebin.com/6q7zs7tc) from which I have to extract functions names as well as the number of parameters that a function requires. So far I have written this regex, but it's not working perfectly for me.
(?![a-z])[^\:,>,\.]([a-z,A-Z]+[_]*[a-z,A-Z]*)+[(]

You can't parse C++ reliably with regex.
In fact, you can't parse it with weak parsing technology (See Why can't C++ be parsed with a LR(1) parser?). If you expect to get extract this information reliably from source files, you will need a time-tested C++ parser; see https://stackoverflow.com/a/28825789/120163
If you don't care that your extraction process is flaky, then you can use a regex and maybe some additional hackery. Your key problem for heuristic extraction is matching various kinds of brackets, e.g., [...], < ... > (which won't quite work for shift operators) and { ... }. Bracket matching requires you to keep a stack of seen brackets. And bracket matching may fail in the presence of macros and preprocessor conditionals.

Related

Regex ignore redundant braces

I am building a lex program that will analyze something like the following...
function myFunc {
if a = b {
print "Cool"
}
}
Is it possible, specifically using flex, to create a regex that will single out everything in the first { }
so i will get
{ if a = b { print "Cool" } }
instead of
{ if a = b { print "Cool" }
Currently in my flex file i have this regex
{[^\0]*}
One problem with what you are trying to do is that RegEx is greedy by default (could do some tricks to change that, but you'll still have problems), and you will match more than intended if you run this on a file with multiple functions in it. The reason is that most programming languages are Type 1 grammars in the Chomsky hierarchy, or context-sensitive grammars, and RegEx is a Type 2 (context-free) grammar. It is fundamentally impossible to directly parse the former using the later without a LOT of work. The full explanation for that is ... long. But it boils down to in context sensitive grammars the meaning of a given element can change depending on where you are in the input, while in a context-free grammar every element has exactly one meaning. In your case, you don't want to match any ole' }, you want to match the corresponding } to an open {, which involves counting the number of { and } you have seen so far.
If you really want to do code parsing without having to re-invent the wheel, the plow, fire, steel, and all the way up to electricity, I would recommend that you go check out AnTLR over on GitHub. AnTLR will allow you to create a grammar (if one does not already exist) for the language you are trying to parse and provide the parsed source code to you in the form of a Parse Tree. Parse trees are very, very easy to use and AnTLR has grammars already for almost every language imaginable, and plugins for several languages.
Other than that, both the online regex tester I used and Notepad++ with your sample code matched everything. You could try the RegEx {.*} which also matches everything.

How to test if a string is a valid C++(ish) expression?

I am writing a program in C++ that needs to be able to test if a string (probably std::string) is a valid C++ expression. Variables can be checked if they have been declared (bool variableDeclared(std::string identifier)) and their type can also be checked (std::string variableType(std::string identifier)). The variableType function returns a string based on how it would be declared in C++ ("bool", "double", "char", etc).
The expression doesn't need to be evaluated but only tested to see if it is valid. The function only needs to support character literals, string literals, number literals, brackets, simple operators (+, -, *, /, ! (logic not), &&, ||, >, <, ==), and variables of type double, std::string (no function calls needed), bool and char. It is also not required to support string concatenation.
The desired result would be a function that is something like bool validExpression(std::string expression). It is also preferable that it allows me to modify the operations (for example I could change "==" to "equal-to").
How would I implement this? Is there a library that could do something like this, a regex statement or is it simply a matter of a long function with lots of if statements?
Formally, your situation is: you have a grammar which describes the language of expressions which you want to validate, and a word for which you want to determine whether it belongs in that language. This is a job for a parser of that language.
You could hand-cook something like a recursive-descent LL(1) parser, or use a tool to generate a parser. A well-known example of such a tool is Bison for generating LALR(1) parsers. Wikipedia has a long parser generator list.
Technical terms are used above mainly to provide entry points for googling.
You would start from defining your language more or less formally. (A language is a set of strings). A good way to define a language is to specify its context-free grammar. Describe additional conditions (like the requirement that variables must be declared, and of the right type) informally in prose.
The next step would be building a parser for your grammar specified at the previous step. There are several tools for building parsers from grammars automatically, from yacc/bison to boost::spirit.
After building and checking the parser, implement the informally-specified rules and plug them into your parser code/data.
Normally the next step, building an evaluator, would probably the easiest part of writing a simple interpreter, but you say you don't need one.
Describing your language as "just like C++ only with certain bits taken out" could be a preliminary step to the sequence outlined above. It is however not recommended to start out from C++ if you can help it. C++ is an extremely hard language to specify formally, and its parsers tend to be rather hairy, due to its convoluted declaration syntax.
you can run compiler as sub-process of your application. All you have to do is to pass arguments and parse response properly

How to parse DSL input to high performance expression template

(EDITED both title and main text and created a spin-off question that arose)
For our application it would be ideal to parse a simple DSL of logical expressions. However the way I'd like to do this is to parse (at runtime) the input text which gives the expressions into some lazily evaluated structure (an expression template) which can then be later used within more performance sensitive code.
Ideally the evaluation is as fast as possible using this technique as it will be used a large number of times with different values substituting into the placeholders each time. I'm not expecting the expression template to be equivalent in performance to say a hardcoded function that models the same function as the given input text string i.e. there is no need to go down a route of actually compiling say, c++, in situ of a running program (I believe other questions cover dynamic library compiling/loading).
My own thoughts reading examples from boost is that I can use boost::spirit to do the parsing of the input text and I'm confident I can develop the grammar I need. However, I'm not sure how I can combine the parser with boost::proto to build an executable expression template. Most examples of spirit that I've seen are merely interpreters or end up building some kind of syntax tree but go no further. Most examples of proto that I've seen assume the DSL is embedded in the host source code and does not need to be initially interpreted from a string. I'm aware that boost::spirit is actually implemented with boost::proto but not sure if this is relevant to the problem or whether that fact will suggest a convenient solution.
To re-iterate, I need to be able to make real the something like following:
const std::string input_text("a && b || c");
// const std::string input_text(get_dsl_string_from_file("expression1.dsl"));
Expression expr(input_text);
while(keep_intensively_processing) {
...
Context context(…);
// e.g. context.a = false; context.b=false; context.c=true;
bool result(evaluate(expr, context));
...
}
I would really appreciate a minimal example or even just a small kernel that I can build upon that creates an expression from input text which is evaluated later in context.
I don't think this is exactly the same question as posted here: parsing boolean expressions with boost spirit
as I'm not convinced this is necessarily the quickest executing way of doing this, even though it looks very clever. In time I'll try to do a benchmark of all answers posted.

Regular Expression Vs. String Parsing

At the risk of open a can of worms and getting negative votes I find myself needing to ask,
When should I use Regular Expressions and when is it more appropriate to use String Parsing?
And I'm going to need examples and reasoning as to your stance. I'd like you to address things like readability, maintainability, scaling, and probably most of all performance in your answer.
I found another question Here that only had 1 answer that even bothered giving an example. I need more to understand this.
I'm currently playing around in C++ but Regular Expressions are in almost every Higher Level language and I'd like to know how different languages use/ handle regular expressions also but that's more an after thought.
Thanks for the help in understanding it!
Edit: I'm still looking for more examples and talk on this but the response so far has been great. :)
It depends on how complex the language you're dealing with is.
Splitting
This is great when it works, but only works when there are no escaping conventions.
It does not work for CSV for example because commas inside quoted strings are not proper split points.
foo,bar,baz
can be split, but
foo,"bar,baz"
cannot.
Regular
Regular expressions are great for simple languages that have a "regular grammar". Perl 5 regular expressions are a little more powerful due to back-references but the general rule of thumb is this:
If you need to match brackets ((...), [...]) or other nesting like HTML tags, then regular expressions by themselves are not sufficient.
You can use regular expressions to break a string into a known number of chunks -- for example, pulling out the month/day/year from a date. They are the wrong job for parsing complicated arithmetic expressions though.
Obviously, if you write a regular expression, walk away for a cup of coffee, come back, and can't easily understand what you just wrote, then you should look for a clearer way to express what you're doing. Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions.
Context free
Parser generators and hand-coded pushdown/PEG parsers are great for dealing with more complicated input where you need to handle nesting so you can build a tree or deal with operator precedence or associativity.
Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.
The rule of thumb for CF grammars is
If regular expressions are insufficient but all words in the language have the same meaning regardless of prior declarations then CF works.
Non context free
If words in your language change meaning depending on context, then you need a more complicated solution. These are almost always hand-coded solutions.
For example, in C,
#ifdef X
typedef int foo
#endif
foo * bar
If foo is a type, then foo * bar is the declaration of a foo pointer named bar. Otherwise it is a multiplication of a variable named foo by a variable named bar.
It should be Regular Expression AND String Parsing..
You can use both of them to your advantage!Many a times programmers try to make a SINGLE regular expression for parsing a text and then find it very difficult to maintain..You should use both as and when required.
The REGEX engine is FAST.A simple match takes less than a microsecond.But its not recommended for parsing HTML.

Generalized stream parsing?

Are there any libraries or technologies(in any language) that provide a regular-expression-like tool for any sort of stream-like or list-like data(as opposed to only character strings)?
For example, suppose you were writing a parser for your pet programming language. You've already got it lexed into a list of Common Lisp objects representing the tokens.
You might use a pattern like this to parse function calls(using C-style syntax):
(pattern (:var (:class ident)) (:class left-paren)
(:optional (:var object)) (:star (:class comma) (:var :object)) (:class right-paren))
Which would bind variables for the function name and each of the function arguments(actually, it would probably be implemented so that this pattern would probably bind a variable for the function name, one for the first argument, and a list of the rest, but that's not really an important detail).
Would something like this be useful at all?
I don't know how many replies you'll receive on a subject like this, as most languages lack the sort of robust stream APIs you seem to have in mind; thus, most of the people reading this probably don't know what you're talking about.
Smalltalk is a notable exception, shipping with a rich hierarchy of Stream classes that--coupled with its Collection classes--allow you to do some pretty impressive stuff. While most Smalltalks also ship with regex support (the pure ST implementation by Vassili Bykov is a popular choice), the regex classes unfortunately are not integrated with the Stream classes in the same way the Collection classes are. This means that using streams and regexes in Smalltalk usually involves reading character strings from a stream and then testing those strings separately with regex patterns--not the sort "read next n characters up until a pattern matches," or "read next n characters matching this pattern" type of functionally you likely have in mind.
I think a powerful stream API coupled with powerful regex support would be great. However, I think you'd have trouble generalizing about different stream types. A read stream on a character string would pose few difficulties, but file and TCP streams would have their own exceptions and latencies that you would have to handle gracefully.
Try looking at scala.util.regexp, both the API documentation, and the code example at http://scala.sygneca.com/code/automata. I think would allow a computational linguist to match strings of words by looking for part of speech patterns, for example.
This is the principle behind most syntactic parsers, which operate in two phases. The first phase is the lexer, where identifiers, language keywords, and other special characters (arithmetic operators, braces, etc) are identified and split into Token objects that typically have a numeric field indicating the type of the lexeme, and optionally another field indicating the text of the lexeme.
In the second phase, a syntactic parser operates on the Token objects, matching them by magic number alone, to parse phrases. (Software for doing this includes Antlr, yacc/bison, Scala's cala.util.parsing.combinator.syntactical library, and plenty of others). The two phases don't entirely have to depend on each other -- you can get your Token objects from anywhere else that you like. The magic number aspect seems to be important, though, because the magic numbers are assigned to constants, and they're what make it easy to express your grammar in a readable language.
And remember, that anything you can accomplish with a regular expression can also be accomplished with a context-free grammar (usually just as easily).