I need to create a rule via boost spirit that should match situations like
return foo;
and
return (foo);
I tried smth like this:
start %= "return" >> -boost::spirit::qi::char_('(') >> identifier >> -boost::spirit::qi::char_(')') >> ';';
but this will succeeded even in cases like
return (foo;
and
return foo);
How can I solve it?
Your example only looks pathological, because you are using an overly specific example.
In practice, you don't "return" >> identifier;. Usually, the thing that's returned is just an expression. So, you'd say
expr = literal | variable | function_call;
Now the general way to cater for parenthesized expressions in on fell swoop is simply:
expr = literal | variable | function_call
| ('(' >> expr >> ')')
;
Bam. Done. It handles the balancing. It handles nested parentheses. It handles (((foo))) even. Not a whistle was given that day.
I don't think there is /anything/ wrong at all. I've posted probably over 20 recursive different expression grammars in answers on this site. They should provide motivating examples (showing operator precedence and overruling them with these parentheses).
Related
I am trying to write Grammar for java specification
for example:-
COMPILATION_UNIT: PACKAGE_DEC? IMPORT_DECS? TYPE_DECS?
but it doesn't work
I have the following error:
invalid character: `?'
for each question mark I use in my file.y
I know that Bison has special characters and it should handle it
Please help
Bison does not allow a ? meaning that the prior token is optional, you have to write out the grammar with the optional elements:
package_decl_opt: %empty
| SOME_TOKEN
;
package: package)_dec_opt TOKEN_PACKAGE TOKEN_IDENTIFIER
;
would allow both of the following:
SOME_TOKEN TOKEN_PACKAGE TOKEN_IDENTIFIER
TOKEN_PACKAGE TOKEN_IDENTIFIER
As you have seen, bison does not implement the ? regular expression optionality operator. Nor does it implement + or * repetition operators. That's because the right-hand sides of productions in contex-free grammars are not regular expressions.
Yacc/bison context-free grammars do allow the | alternation operator, but as an abbreviation:
a : b | c
Is exactly the same as writing
a : b
a : c
and semantic actions only apply to the alternative in which they are specified, so that
a : b | c { /* C action; */ }
Is equivalent to:
a : b { /* Implicit default action*/ }
a : c { /* C action; */ }
It is tempting to create X_opt non-terminals to capture the semantics of X?:
X_opt: X | %empty { $$ = default_value; }
In many simple cases that will work fine, but there are also many grammars in which that introduces an unnecessary shift-reduce conflict. Consider, for example:
label: IDENT ':'
label_opt: label | %empty
statement: label_opt expr
Since expr can start with an identifier, there is no way to know if an IDENT token starts a label or if it starts an expr following an empty label_opt. But LR(1) requires that the empty label_opt be reduced before the IDENT is consumed. So the above grammar is LR(2) and cannot be correctly parsed by an LR(1) parser.
That problem does not occur without the use of the label_opt shortcut:
label: IDENT ':'
statement: label expr
| expr
Since the parser now does not have decide between label and expr before the ':' is encountered.
I have been tasked with a project that involves me taking a Grammar (in BNF form) and creating a lexical scanner (using lex) and a parser (using bison). I've never worked with any of these programs and I think a good reference would be to see how these items are created from a grammar. I am looking for a grammar and it's associated .l and .ypp files, preferably in C++. I've been able to find sample files or sample grammars, but not both of them. I've spent some time searching and I could not find anything. I figure I'd post here in hopes that someone has something for me, but I will continue searching in the meantime.
I am currently reading Tom Niemann's
http://epaperpress.com/lexandyacc/download/LexAndYaccTutorial.pdf which seems to be pretty well written and understandable.
Thanks
Edit: I am still searching, I am starting to think that what I am looking for does not exist. Google usually never fails me!
Edit 2: Maybe if I provide some of the grammar, you folks could show me what the appropriate .l and .ypp files would look like. This is just a snippet of the grammar, I just need a little 'taste' of how this works and I think I can take it from there.
Grammar:
Program ::= Compound
Statements ::= Compound | Assignment | ...
Assignment ::= Var ASSIGN Expression
Expression ::= Var | Operator Expression Expression | Number
Compound := START Statements END
Number ::= NUMBER
Descriptions:
Assignment is the equal sign ":="
Var is an identifier that begins with a lower case letter and is followed by lower case letters or digits
START is the "start" keyword
END is the "end keyword
Operator is "+", "-", "*", "/"
Number is decimal digits which could potentially be negative (minus sign in front)
Most of this is fairly straightforward. One part, however, is decidedly problematic. You've defined a number to (potentially) include a leading -, and that's a problem.
The problem is pretty simple. Given an input like 321-123, it's essentially impossible for the lexer (which won't normally keep track of current state) to guess at whether that's supposed to be two tokens (321 and -123 or three 321, -, 123). In this case, the - is almost certainly intended to be separate from the 123, but if the input were 321 + -123 you'd apparently want -123 as a single token instead.
To deal with that, you probably want to change your grammar so the leading - isn't part of the number. Instead, you always want to treat the - as an operator, and the number itself is composed solely of the digits. Then it's up to the parser to sort out expressions where the - is unary vs. binary.
Taking that into account, the lexer file would look something like this:
%{
#include "y.tab.h"
%}
%option noyywrap case-insensitive
%%
:= { return ASSIGN; }
start { return START; }
end { return END; }
[+/*] { return OPERATOR; }
- { return MINUS; }
[0-9]+ { return NUMBER; }
[a-z][a-z0-9]* { return VAR; }
[ \r\n] { ; }
%%
void yyerror(char const *s) { fputs(s, stderr); }
The matching yacc file would look something like this:
%token ASSIGN START END OPERATOR MINUS NUMBER VAR
%left '-' '+' '*' '/'
%%
program : compound
statement : compound
| assignment
;
assignment : VAR ASSIGN expression
;
statements :
| statements statement
;
expression : VAR
| expression OPERATOR expression
| expression MINUS expression
| value
;
value: NUMBER
| MINUS NUMBER
;
compound : START statements END
%%
int main() {
yyparse();
return 0;
}
Note: I've tested these only extremely minimally--enough to verify input I believe is grammatical, such as: start a:=1 b:=2 end and start a:=1+3*3 b:=a+4 c:=b*3 end is accepted (no error message printed out) and input I believe is un-grammatical, such as: 9:=13 and a=13 do both print out syntax error messages. Since this doesn't attempt to do any more with the expressions than recognize those which are or are not grammatical, that's about the best we can do though.
I'm trying to write a basic OBJ file loader using the Boost Spirit library. Although I got it working using the standard std::ifstreams, I'm wondering if it's possible to do a phrase_parse on the entire file using a memory mapped file, since it seems to provide the best performance as posted here.
I have the following code, which seems to work well, but it breaks when there is a comment in the file. So, my question is how do you ignore a comment that starts with a '#' in the OBJ file using Spririt?
struct vertex {
double x, y, z;
};
BOOST_FUSION_ADAPT_STRUCT(
vertex,
(double, x)
(double, y)
(double, z)
)
std::vector<vertex> b_vertices
boost::iostreams::mapped_file mmap(
path,
boost::iostreams::mapped_file::readonly);
const char* f = mmap.const_data();
const char* l = f + mmap.size();
using namespace boost::spirit::qi;
bool ok = phrase_parse(f,l,(("v" >> double_ >> double_ >> double_) |
("vn" >> double_ >> double_>> double_)) % eol ,
blank, b_vertices);
The above code works well when there are no comments or any other data except vertices/normals. But when there is a different type of data the parser fails (as it should) and I'm wondering if there is a way to make it work without going back to parsing every line as it is slower (almost 2.5x in my tests). Thank you!
The simplest way that comes to mind is to simply make comments skippable:
bool ok = qi::phrase_parse(
f,l,
(
("v" >> qi::double_ >> qi::double_ >> qi::double_) |
("vn" >> qi::double_ >> qi::double_ >> qi::double_)
)
% qi::eol,
('#' >> *(qi::char_ - qi::eol) >> qi::eol | qi::blank), b_vertices);
Note that this also 'recognizes' comments if # appears somewhere inside the line. This is probably just fine (as it would make the parsing fail, unless it was a comment trailing on an otherwise valid input line).
See it Live on Coliru
Alternatively, use some phoenix magic to handle "comment lines" just as you handle a "vn" or "v" line.
I realize that my comment/post is not directly related code but I'm for not reinventing the wheel if possible and I would have wanted to know about this library. I was working with a handwritten OBJ/Wavefront loader but in my research I found this library Tiny Obj Loader. This library is written C++ with no dependencies excetp C++ STL. It handles the edge cases for the Wavefront spec fairly well and it is very fast. The thing that the user has to do is convert the Tiny OBJ objects into their code. TinyObjLoader has been adopted by quite a number of projects as well. I do apologize for not directly answering the question and my desire is to get knowledge about this great library out.
I just started using Boost::xpressive and find it an excellent library... I went through the documentation and tried to use the ! operator (zero or one) but it doesn't compile (VS2008).
I want to match a sip address which may or may not start with "sip:"
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
using namespace boost::xpressive;
using namespace std;
int main()
{
sregex re = !"sip:" >> *(_w | '.') >> '#' >> *(_w | '.');
smatch what;
for(;;)
{
string input;
cin >> input;
if(regex_match(input, what, re))
{
cout << "match!\n";
}
}
return 0;
}`
You just encountered a bug that plagues most of the DSEL.
The issue is that you want a specific operator to be called, the one actually defined in your specific languages. However this operator already exist in C++, and therefore the normal rules of Lookup and Overload resolution apply.
The selection of the right operator is done with ADL (Argument Dependent Lookup), which means that at least one of the objects on which the operator apply should be part of the DSEL itself.
For example, consider this simple code snippet:
namespace dsel
{
class MyObject;
class MyStream;
MyStream operator<<(std::ostream&, MyObject);
}
int main(int, char*[])
{
std::cout << MyObject() << "other things here";
}
Because the expression is evaluated from left to right, the presence of dsel::MyObject is viral, ie the dsel will here be propagated.
Regarding Xpressive, most of the times it works because you use special "markers" that are Xpressive type instances like (_w) or because of the viral effect (for example "#" works because the expression on the left of >> is Xpressive-related).
Were you to use:
sregex re = "sip:" >> *(_w | '.') >> '#' >> *(_w | '.');
^^^^^^ ~~ ^^^^^^^^^^^
Regular Xpressive
It would work, because the right hand-side argument is "contaminated" by Xpressive thanks to the precedence rules of the operators.
However here operator! has one of the highest precedence. At such, its scope is restricted to:
`!"sip:"`
And since "sip:" is of type char const[5], it just invokes the regular operator! which will rightly conclude that the expression to which it applies is true and thus evaluate to the bool value false.
By using as_xpr, you convert the C-string into an Xpressive object, and thus bring in the right operator! from the Xpressive namespace into consideration, and overload resolution kicks in appropriately.
as_xpr helper must be used...
!as_xpr("sip:")
I'd like to parse simple C++ typedef instructions such as
typedef Class NewNameForClass;
typedef Class::InsideTypedef NewNameForTypedef;
typedef TemplateClass<Arg1,Arg2> AliasForObject;
I have written the corresponding grammar that i'd like to see used in parsing.
Name <- ('_'|letter)('_'|letter|digit)*
Type <- Name
Type <- Type::Name
Type <- Name Templates
Templates <- '<' Type (',' Type)* '>'
Instruction <- "typedef" Type Name ';'
Once this is parsed, all i'll want to do is to generate xml with the same information (but layed out differently)
What is the most effective language for writing such a program ?
How can you achieve this ?
EDIT : What i have come up with using Boost Spirit (it's not perfect, but it's good enough for me, at least for now)
rule<> sep_p = space_p;
rule<> name_p = (ch_p('_')|alpha_p) >> *(ch_p('_')|alpha_p|digit_p);
rule<> type_p = name_p
>> !(*sep_p >>str_p("::") >> *sep_p>> name_p)
>> *(*sep_p >> ch_p('*') )
>> !(*sep_p >> str_p("const"))
>> !(*sep_p >> ch_p('&'));
rule<> templated_type_p = name_p >> *sep_p
>> ch_p('<') >> *sep_p
>> (*sep_p>>type_p>>*sep_p)%ch_p(',')
>> ch_p('>') >> *sep_p;
rule<> typedef_p = *sep_p
>> str_p ("typedef")
>> +sep_p >> (type_p|templated_type_p)
>> +sep_p >> name_p
>> *sep_p >> ch_p(';') >> *sep_p;
rule<> typedef_list_p = *typedef_p;
I would alter the grammar slightly
ShortName <- ('_'|letter)('_'|letter|digit)*
Name <- ShortName
Name <- Name::ShortName
Type <- Name
Type <- Name Templates
Templates <- '<' Type (',' Type)* '>'
Instruction <- "typedef" Type Name ';'
Also your grammar leaves out the following cases
Multiple typedef targets.
Pointer targets
Function pointers (this is by far the most difficult)
Parsing a grammar (i love the irony) is a fairly straight forward operation. If you wanted to actually use the grammar in a functional way, I would say the best bet is a lex/yacc combination.
But from your question it appears that you want to spit it out to another format. There really isn't a language designed for this so I would say use whatever language you're most comfortable with.
Edit
The OP asked about multiple typedef targets. It's perfectly legally for a typedef declaration to have more than 1 target. For Example:
typedef _SomeStruct SomeStruct, *PSomeStruct
This creates 2 typedef names.
SomeStruct which is equivalent to "struct _SomeStruct"
PSomeStruct which is equivalent to "struct _SomeStruct*"
Well, since you're apparently already working with/on C++, have you considered using Boost.Spirit? This allows you to hard-code the grammar inline in C++ as a domain-specific language and program against it in normal C++ code.