I am using a C lexer that is Flex-generated, and a C++ parser that is Bison-generated. I have modified the parser to acccept only string input.
I am calling the parser function yyparse() in a loop, and reading line by line of user input. I stop the loop if the input is "quit".
The problem I am facing is that when input doesn't match any rule, then the parser stops abruptly, and at next iteration starts off at same state, expecting the rule which was stopped (due to syntax error) to complete.
It works fine if the input is valid and matches a parser rule.
On syntax error I have redefined the yyerror() function, that displays a simple error message.
How do I clear the state of the parser when the input doesn't match any parser rule, so that at next iteration the parser starts afresh?
According to my Lex & Yacc book there is a function yyrestart(file) .
Else (and I quote a paragraph of the book:
This means that you cannot restart a lexer just by calling yylex(). You have to reset it into the default state using BEGIN INITIAL, discard any input text buffered up by unput(), and otherwise arrange so that the next call to input() will start reading the new input.
Interesting question - I have a parser that can be compiled with Bison, Byacc, MKS Yacc or Unix Yacc, and I don't do anything special to deal with resetting the grammar whether it fails or succeeds. I don't use a Flex or Lex tokenizer; that is hand-coded, but it works strictly off strings. So, I have to agree with Gamecat; the most likely cause of the trouble is the lexical analyzer, rather than the parser proper.
(If you want to obtain my code, you can download SQLCMD from the IIUG (International Informix User Group) web site. Although the full product requires Informix ESQL/C, the grammar can, in principle, be converted into a standalone test program. Sadly, however, it appears I've not run that test for a while - there are some issues with the test compilation. Some structure element names changed in April 2006, plus there are linkage issues. I will need to re-reorganize the code so that the grammar can be tested standalone again.)
Related
In my parser generated by flex, I would like to be able to store each line in the file, so that when reporting errors, I can show the user the line that the error occurred on.
I could of course do this using a vector and read in all lines from the file before/after lexing, but this would just add to the time needed to parse a file.
What I thought I could instead do, is to store the line whenever a new-line character is matched, and insert the current line into a vector. So my questions is, is there a variable/macro that flex that stores the current line inside? (Something like yyline perhaps)
Note: I am also using bison
By itself, lex/flex does not do what you ask. As noted, you want this for reporting error messages. (I do something like this in vi like emacs).
With lex/flex, the only way to store the entire line is to record each token from the current line into your own line-buffer. That can be complicated, especially if your lexer has to handle multi-line content (such as comments or strings).
The yytext variable only shows you the most recently parsed token (and yylength, the corresponding length). If your lexer does a simple ECHO, that is a token just like the ones you pay attention to.
Reading the file in advance as noted is one way to simplify the problem. In vi like emacs, the lexers read via a function from the in-memory buffer rather than from an input stream. It bypasses the normal stream-handling logic by redefining the YY_INPUT macro, e.g.,
#define YY_INPUT(buf,result,max_size) result = flt_input(buf,max_size)
Likewise, ECHO is redefined (since the editor reads the results back rather than letting them go to the standard output):
#define ECHO flt_echo(yytext, yyleng)
and it traps errors detected by the lexer with another redefinition:
#define YY_FATAL_ERROR(msg) flt_failed(msg);
However you do this, the yylineno value reported for a given token will be at the end of parsing a given token.
While it is nice to report the entire line in context in an error message, it is also useful to track the line and column number of each token -- various editors can deal with lines like this
filename:line:col:message
If you build up your line-buffer by tracking tokens, it might be relatively simple to track the column on which each token begins as well.
I have a file which contains a ABNF Grammar with tags like in this simplified example:
$name = Bertha {userID=013} | Bob {userID=429} | ( Ben | Benjamin ) {userID=265};
$greet = Hi | Hello | Greetings;
$S = $greet $name;
Now the task is to obtain the userID by parsing a given sentence for this grammar. For example, parsing the sentence
Greetings Bob
should give us the userID 429. The grammars have to be read in at runtime because they can change between runs.
My approach for now is the following:
parse the grammar into one or multiple trees, putting the tags at the leaves or nodes they belong to
parse the sentence with this/those tree(s) to construct a tree which creates the given sentence(I'm thinking about using Earley for this)
use this tree to obtain the tags (unlike in the example, there will be multiple different tags in such a tree)
My question is, are there any software components that I can use or at least modify to solve this task? Especially steps 1 and 2 seem to be quite generic (1. reading a ABNF grammar into a C++ internal representation (e.g. trees); 2. Early-algorithm (or something like that) working with the internal representation from 1.) and writing a complete, fault-proof ABNF parser for step 1 will be a really time consuming task for me.
I know that VoiceXML grammars work like this, but i was unable to find a parser for them. Basically all I could find were parser generators which will generate C++ code for a single grammar, which is not practical for me because the grammars are not known at compile time.
Any ideas?
Back in 2001 I wrote a C++ library that will generate a parser from rules specified at run-time. It is available on SourceForge as project BuildParse with a LGPL license. I've used it in a couple of other projects, and I updated it to work with C++ as of 2009. If it doesn't matter if the parser is fast, it might work for you or save you some work rolling your own.
Basically, you'd need a parser to parse your grammar into the data structures that buildparse uses (you can use buildparse for that as well) and then run the buildparse parser generator to generate a something that can recognize tokens.
For an school project, I need to parse a text/source file containing a simplified "fake" programming language to build an AST. I've looked at boost::spirit, however since this is a group project and most seems reluctant to learn extra libraries, plus the lecturer/TA recommended leaning to create a simple one on C++. I thought of going that route. Is there some examples out there or ideas on how to start? I have a few attempts but not really successful yet ...
parsing line by line
Test each line with a bunch of regex (1 for procedure/function declaration), one for assignment, one for while etc...
But I will need to assume there are no multiple statements in one line: eg. a=b;x=1;
When I reach a container statement, procedures, whiles etc, I will increase the indent. So all nested statements will go under this
When I reach a } I will decrement indent
Any better ideas or suggestions? Example code I need to parse (very simplified here ...)
procedure Hello {
a = 1;
while a {
b = a + 1 + z;
}
}
Another idea was to read whole file into a string, and go top down. Match all procedures, then capture everything in { ... } then start matching statements (end with ;) or containers while { ... }. This is similar to how PEG does things? But I will need to read entire file
Multipass makes things easier. On a first pass, split things into tokens, like "=", or "abababa", or a quote-delimited string, or a block of whitespace. Don't be destructive (keep the original data), but break things down to simple chunks, and maybe have a little struct or enum that describes what the token is (ie, whitespace, a string literal, an identifier type thing, etc).
So your sample code gets turned into:
identifier(procedure) whitespace( ) identifier(Hello) whitespace( ) operation({) whitespace(\n\t) identifier(a) whitespace( ) operation(=) whitespace( ) number(1) operation(;) whitespace(\n\t) etc.
In those tokens, you might also want to store line number and offset on the line (this will help with error message generation later).
A quick test would be to turn this back into the original text. Another quick test might be to dump out pretty-printed version in html or something (where you color whitespace to have a pink background, identifiers as light blue, operations as light green, numbers as light orange), and see if your tokenizer is making sense.
Now, your language may be whitespace insensitive. So discard the whitespace if that is the case! (C++ isn't, because you need newlines to learn when // comments end)
(Note: a professional language parser will be as close to one-pass as possible, because it is faster. But you are a student, and your goal should be to get it to work.)
So now you have a stream of such tokens. There are a bunch of approaches at this point. You could pull out some serious parsing chops and build a CFG to parse them. (Do you know what a CFG is? LR(1)? LL(1)?)
An easier method might be to do it a bit more ad-hoc. Look for operator({) and find the matching operator(}) by counting up and down. Look for language keywords (like procedure), which then expects a name (the next token), then a block (a {). An ad-hoc parser for a really simple language may work fine.
I've done exactly this for a ridiculously simple language, where the parser consisted of a really simple PDA. It might work for you guys. Or it might not.
Since you mentioned PEG i'll like to throw in my open source project : https://github.com/leblancmeneses/NPEG/tree/master/Languages/npeg_c++
Here is a visual tool that can export C++ version: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-language-workbench
Documentation for rule grammar: http://www.robusthaven.com/blog/parsing-expression-grammar/npeg-dsl-documentation
If i was writing my own language I would probably look at the terminals/non-terminals found in System.Linq.Expressions as these would be a great start for your grammar rules.
http://msdn.microsoft.com/en-us/library/system.linq.expressions.aspx
System.Linq.Expressions.Expression
System.Linq.Expressions.BinaryExpression
System.Linq.Expressions.BlockExpression
System.Linq.Expressions.ConditionalExpression
System.Linq.Expressions.ConstantExpression
System.Linq.Expressions.DebugInfoExpression
System.Linq.Expressions.DefaultExpression
System.Linq.Expressions.DynamicExpression
System.Linq.Expressions.GotoExpression
System.Linq.Expressions.IndexExpression
System.Linq.Expressions.InvocationExpression
System.Linq.Expressions.LabelExpression
System.Linq.Expressions.LambdaExpression
System.Linq.Expressions.ListInitExpression
System.Linq.Expressions.LoopExpression
System.Linq.Expressions.MemberExpression
System.Linq.Expressions.MemberInitExpression
System.Linq.Expressions.MethodCallExpression
System.Linq.Expressions.NewArrayExpression
System.Linq.Expressions.NewExpression
System.Linq.Expressions.ParameterExpression
System.Linq.Expressions.RuntimeVariablesExpression
System.Linq.Expressions.SwitchExpression
System.Linq.Expressions.TryExpression
System.Linq.Expressions.TypeBinaryExpression
System.Linq.Expressions.UnaryExpression
I am trying to design a server app which will read a command line over a socket stream (one character at a time). Obviously the simple way is to read characters up to the EOL and execute the command contained in the receive buffer.
Instead, I want to have it so that when a user starts entering a command line and then enters "?", the app will generate a list of all the parameters which are syntactically correct from that point in the parsing of the command line (this is similar to the way it is in some embedded devices that I have seen, like Cisco and Netscreen routers).
For example,
$ set interface ?
would display
> set interface [option] -- displays information about the specified interface.
>
> [option] must be one of the following:
> address [ip-addr]
> port [port-no]
> protocol [tcp|udp]
So basically, I would need to know where we were in the grammar, and what symbols are expected from that point forward.
It would also be nice if it could support simple line editing commands (BS, DEL, insert, left-arrow, right-arrow), and maybe even up-arrow/down-arrow for command history.
Can this be done using the boost spirit parser?
EDIT:
Simply put: Is there a simple way to create a boost spirit parser which (in addition to having a set of rules), immediately executes an action anytime '?' is encountered on the input stream (without having to explicitly encode the token '?' into the rules)?
I'm writing a fairly simple program with LEX, that after parsing a few files, parses input from a user.
Now, with the files, everything works like a charm. However, when it comes to user input from stdin, LEX rules won't run until an EOF (via ctrl+D) character is sent. When I do that, LEX parses all I wrote and then waits for more input. A second consecutive EOF terminates the scanner.
Thing is, I want the program to react on \n, outputting some data. Is there a way to force a scan from inside a rule, or to configure LEX buffering somehow to match this behaviour?
Solved! This did the trick:
%option always-interactive
I'm leaving this here for future reference, in case... well, who knows.
Here is a snippet from a unix shell I did with lex and yacc. I think it'll do the trick.
"\n" |
";" {
//yylval.sb = getsb(yytext); for yacc stuff
fprintf(stderr,"EOL\n");
return(EOL);
}