Parsing a tokenized free form grammar with Boost.Spirit - c++

I've got stuck trying to create a Boost.Spirit parser for the callgrind tool's output which is part of valgrind. Callgrind outputs a domain specific embedded programming language (DSEL) which lets you do all sorts of cool stuff like custom expressions for synthetic counters, but it's not easy to parse.
I've placed some sample callgrind output at https://gist.github.com/ned14/5452719#file-sample-callgrind-output. I've placed my current best attempt at a Boost.Spirit lexer and parser at https://gist.github.com/ned14/5452719#file-callgrindparser-hpp and https://gist.github.com/ned14/5452719#file-callgrindparser-cxx. The Lexer part is straightforward: it tokenises tag-values, non-whitespace text, comments, end of lines, integers, hexadecimals, floats and operators (ignore the punctuators in the sample code, they're unused). White space is skipped.
So far so good. The problem is parsing the tokenised input stream. I haven't even attempted the main stanzas yet, I'm still trying to parse the tag-values which can occur at any point in the file. Tag values look like this:
tagtext: unknown series of tokens<eol>
It could be freeform text e.g.
desc: I1 cache: 32768 B, 64 B, 8-way associative, 157 picosec hit latency
In this situation you'd want to convert the set of tokens to a string i.e. to an iterator_range and extract.
It could however be an expression e.g.
event: EPpsec = 316 Ir + 1120 I1mr + 1120 D1mr + 1120 D1mw + 1362 ILmr + 1362 DLmr + 1362 DLmw
This says that from now on, event EPpsec is to be synthesised as Ir multiplied by 316 added to I1mr multiplied by 1120 added to ... etc.
The point I want to make here is that tag-value pairs need to be accumulated as arbitrary sets of tokens, and post-processed into whatever we turn them into later.
To that end, Boost.Spirit's utree() class looked exactly what I wanted, and that's what the sample code uses. But on VS2012 using the November CTP compiler with variadic templates I'm currently seeing this compile error:
1>C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(56): error C2440: 'static_cast' : cannot convert from 'boost::spirit::detail::list::node_iterator<const boost::spirit::utree>' to 'base_iterator_type'
1> No constructor could take the source type, or constructor overload resolution was ambiguous
1> C:\Users\ndouglas.RIMNET\documents\visual studio 2012\Projects\CallgrindParser\boost\boost/range/iterator_range_core.hpp(186) : see reference to function template instantiation 'IteratorT boost::iterator_range_detail::iterator_range_impl<IteratorT>::adl_begin<const Range>(ForwardRange &)' being compiled
1> with
1> [
1> IteratorT=base_iterator_type
1> , Range=boost::spirit::utree
1> , ForwardRange=boost::spirit::utree
1> ]
... which suggests that my base_iterator_type, which is a Boost.Spirit multi_pass<> wrap of an istreambuf_iterator for forward iterator nature, is somehow not understood by Boost.Spirit's utree() implementation. Thing is, I'm not sure if this is my bad code or bad Boost.Spirit code seeing as line_pos_iterator<> was failing to correctly specify its forward_iterator concept tag.
Thanks to past Stackoverflow help I could write a pure non-tokenised grammar, but it would be brittle. The right solution is to tokenise and use a freeform grammar capable of fairly arbitrary input. The number of examples of getting Boost.Spirit's Lex and Grammar working together in real world examples to achieve this rather than toy examples is sadly very few. Therefore any help would be greatly appreciated.
Niall

The token attribute exposes a variant, which in addition to the base-iterator range, can _assume the types declared in the token_type typedef:
typedef lex::lexertl::token<base_iterator_type, mpl::vector<std::string, int, double>> token_type;
So: string, int and double. Note however that coercion into one of the possible types will only occur lazily, when the parser actually uses the value.
utrees are a very versatile container [1]. Hence, when you expose a spirit::utree attribute on a rule, and the token value variant contains an iterator_range, then it attempts to assign that into the utree object (this fails, because the iterators are ... 'funky').
The easiest way to get your desired behaviour is to force Qi to interpret the attribute of the tag token as a string, and have that assigned to the utree. Therefore the following line constitutes a fix that will make compilation succeed:
unknowntagvalue = qi::as_string[tok.tag] >> restofline;
Notes
Having said all this, I would indeed suggest the following
Consider using the Nabialek Trick to dispatch different lazy rules depending on the tag matched - this makes it unnecessary to deal with raw utrees later on
You might have had success specializing boost::spirit::traits::assign_to_XXXXXX traits (see documentation)
consider using a pure Qi parser. While I can "feel" your sentiment that "it is going to brittle" [2] it seems you have already demonstrated that it raises the complexity to such a degree that it might not have net merit:
the unexpected ways in which attributes materialize (this question)
the problem with line-pos iterators (this is frequently asked question, and AFAIR it has mostly hard or inelegant solutions)
the inflexibility regarding e.g. ad-hoc debugging (access to source data in SA), switching/disabling skippers etc.
my personal experience was that looking at lexer states to drive these isn't helpful, because switching lexer state can only work reliably from lexer token semantic actions, whereas often, the disambiguation would happen in the Qi phase
but I'm diverging :)
[1] e.g. they have facilities for very lightweight 'referencing' of iterator ranges (e.g. for symbols, or to avoid copying characters from a source buffer into the attribute unless wanted)
[2] In effect, only because using a sequential lexer (scanner) vastly reduces the number of backtrack opportunities, so it simplifies the mental model of the parser. However, you can use expectation points to much the same effect.

Related

Rules & Actions for Parser Generator, and

I am trying to wrap my head around an assignment question, therefore I would very highly appreciate any help in the right direction (and not necessarily a complete answer). I am being asked to write the grammar specification for this parser. The specification for the grammar that I must implement can be found here:
http://anoopsarkar.github.io/compilers-class/decafspec.html
Although the documentation is there, I do not understand a few things, such as how to write (in my .y file) things such as
{ identifier },+
I understand that this would mean a comma-separated list of 1 (or more) occurrences of an identifier, however when I write it as such, the compiler displays an error of unrecognized symbols '+' and ',', being mistaken as whitespace. I tried '{' identifier "},+", but I haven't the slightest clue whether that is correct or not.
I have written the lexical analyzer portion (as it was from the previous segment of the assignment) which returns tokens (T_ID, T_PLUS, etc.) accordingly, however there is this new notion that I must assign 'yylval' to be the value of the token itself. To my understanding, this is only necessary if I am in need of the actual value of the token, therefore I would need the value of an identifier token T_ID, but not necessarily the value of T_PLUS, being '+'. This is done by creating a %union in the parser generator file, which I have done, and have provided the tokens that I currently believe would require the literal token value with the proper yylval assignment.
Here is my lexical analysis code (I could not get it to format properly, I apologize): https://pastebin.com/XMZwvWCK
Here is my parser file decafast.y: https://pastebin.com/2jvaBFQh
And here is the final piece of code supplied to me, the C++ code to build an abstract syntax tree at the end:
https://pastebin.com/ELy53VrW?fbclid=IwAR2cFT_-pGKlVZ2liC-zAe3Fw0BWDlGjrrayqEGV4JuJq1_7nKoe9-TLTlA
To finalize my question, I do not know if I am creating my grammar rules correctly. I have tried my best to follow the specification in the above website, but I can't help but feel that what I am writing is completely wrong. My compiler is spitting out nothing but "warning: rule useless in grammar" for almost every (if not every) rule.
If anyone could help me out and point me in the right direction on how to make any progress, I would highly, highly appreciate it.
The decaf specification is written in (an) Extended Backus Naur Form (EBNF), which includes a number of convenience operators for repetition, optionality and grouping. These are not part of the bison/yacc syntax, which is pretty well limited to BNF. (Bison/yacc do allow the alternation operator |, but since there is no way to group subpatterns, alteration can only be used at the top-level, to combine two productions for the same non-terminal.)
The short section at the beginning of the specification which describes EBNF includes a grammar for the particular variety of EBNF that is being used. (Since this grammar is itself recursively written in the same EBNF, there is a need to apply a bit of inductive reasoning.) When it says, for example,
CommaList = "{" Expression "}+," .
it is not saying that "}+," is the peculiar spelling of a comma-repetition operator. What it is saying is that when you see something in the Decaf grammar surrounded by { and }+,, that should be interpreted as describing a comma-separated list.
For example, the Decaf grammar includes:
FieldDecl = var { identifier }+, Type ";" .
That means that a FieldDecl can be (amongst other possibilities) the token var followed by a comma-separated list of identifier tokens and then followed by a Type and finally a semicolon.
As I said, bison/yacc don't implement the EBNF operators, so you have to find an equivalent yourself. Since BNF doesn't allow any form of grouping -- and a list is a grouped subexpression -- we need to rewrite the subexpression of a production as a new non-terminal. Also, I suppose we need to use the tokens defined in spec (although bison allows a more readable syntax).
So to yacc-ify this EBNF production, we first introducing the new non-terminal and replace the token names:
FieldDecl: T_VAR IdentifierList Type T_SEMICOLON
Which leaves the definition of IdentifierList. Repetition in BNF is always produced with recursion, following a very simple model which uses two productions:
the base, which is the simplest possible repetition (usually either nothing or a single list item), and
the recursion, which describes a longer possibility by extending a shorter one.
In this case, the list must have at least one item, and we extend by adding a comma and another item:
IdentifierList
: T_ID /* base case */
| IdentifierList T_COMMA T_ID /* Recursive extension */
The point of this exercise is to develop your skills in thinking grammatically: that is, factoring out the syntax and semantics of the language. So you should try to understand the grammars presented, both for Decaf and for the author's version of EBNF, and avoid blindly copying code (including grammars). Good luck!

How do c/c++ compilers know which line an error is on

There is probably a very obvious answer to this, but I was wondering how the compiler knows which line of code my error is on. In some cases it even knows the column.
The only way I can think to do this is to tokenize the input string into a 2D array. This would store [lines][tokens].
C/C++ could be tokenized into 1 long 1D array which would probably be more efficient. I am wondering what the usual parsing method would be that would keep line information.
actually most of it is covered in the dragon book.
Compilers do Lexing/Parsing i.e.: transforming the source code into a tree representation.
When doing so each keyword variable etc. is associated with a line and column number.
However during parsing the exact origin of the failure might get lost and the information might be off.
This is the first step in the long, complicated path towards "Engineering a Compiler" or Compilers Theory
The short answer to that is: there's a module called "front-end" that usually takes care of many phases:
Scanning
Parsing
IR generator
IR optimizer ...
The structure isn't fixed so each compiler will have its own set of modules but more or less the steps involved in the front-end processing are
Scanning - maps character streams into words (also ignores whitespaces/comments) or tokens
Parsing - this is where syntax and (some) semantic analysis take place and where syntax errors are reported
To make this up to you: the compiler knows the location of your error because when something doesn't fit into a structure called "abstract syntax tree" (i.e. it cannot be constructed) or doesn't follow any of the syntax-directed translation rules, well.. there's something wrong and the compiler indicates the location where this didn't happen. If there's a grammar error on just one word/token then even a precise column location can be returned since nothing matched a terminal keyword: a basic token like the if keyword in the C/C++ language.
If you want to know more about this topic my suggestion is to start with the classic academic approach of the "Compiler Book" or "Dragon Book" and then, later on, possibly study an open-source front-end like Clang

the generated file by the ocamllex

the theory says about lex tool (I read ocamllex) it will convert a collection of regular expressions into C (OCaml) code for a DFA (actually in a NFA and also NFA2DFA). The formal definition of a DFA M is a 5 tuple M = { Q, Sigma, transition_function, q0, F}. What I found in the generated file is the following:
a record called __ocaml_lex_tables with fields from Lexing module
a recursive function
There is a mapping between the objects/structures of a DFA and the structures generated by ocamllex? I cannot 'see' it.... also I was googling for some help and I did not find any useful example.
The answer from ocamllex tool is meaningful in a DFA context e.g. 7 states, 279 transitions, table size 1158 bytes.
Is it a state transition table ? How to 'read' it ?
Thank you for any link/hint !
ocamllex is focused on speed, so it will not have explicit states visible in generated code. The theoretical representation is not always the fastest one, in practice it is usually transformed to account for constant factor speed improvements. The states are most probably represented with indexes in the generated arrays. You can think of it as mapping back assembly code to the real source code - in the general case it is not possible to do immediately because the compiler performs some optimizations and strives for the most compact and effective code, same goes for ocamllex. And the interesting question is why do you want to do that??

How do I associate changed lines with functions in a git repository of C code?

I'm attempting to construct a “heatmap” from a multi-year history stored in a git repository where the unit of granularity is individual functions. Functions should grow hotter as they change more times, more frequently, and with more non-blank lines changed.
As a start, I examined the output of
git log --patch -M --find-renames --find-copies-harder --function-context -- *.c
I looked at using Language.C from Hackage, but it seems to want a complete translation unit—expanded headers and all—rather being able to cope with a source fragment.
The --function-context option is new since version 1.7.8. The foundation of the implementation in v1.7.9.4 is a regex:
PATTERNS("cpp",
/* Jump targets or access declarations */
"!^[ \t]*[A-Za-z_][A-Za-z_0-9]*:.*$\n"
/* C/++ functions/methods at top level */
"^([A-Za-z_][A-Za-z_0-9]*([ \t*]+[A-Za-z_][A-Za-z_0-9]*([ \t]*::[ \t]*[^[:space:]]+)?){1,}[ \t]*\\([^;]*)$\n"
/* compound type at top level */
"^((struct|class|enum)[^;]*)$",
/* -- */
"[a-zA-Z_][a-zA-Z0-9_]*"
"|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lL]?"
"|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->"),
This seems to recognize boundaries reasonably well but doesn’t always leave the function as the first line of the diff hunk, e.g., with #include directives at the top or with a hunk that contains multiple function definitions. An option to tell diff to emit separate hunks for each function changed would be really useful.
This isn’t safety-critical, so I can tolerate some misses. Does that mean I likely have Zawinski’s “two problems”?
I realise this suggestion is a bit tangential, but it may help in order to clarify and rank requirements. This would work for C or C++ ...
Instead of trying to find text blocks which are functions and comparing them, use the compiler to make binary blocks. Specifically, for every C/C++ source file in a change set, compile it to an object. Then use the object code as a basis for comparisons.
This might not be feasible for you, but IIRC there is an option on gcc to compile so that each function is compiled to an 'independent chunk' within the generated object code file. The linker can pull each 'chunk' into a program. (It is getting pretty late here, so I will look this up in the morning, if you are interested in the idea. )
So, assuming we can do this, you'll have lots of functions defined by chunks of binary code, so a simple 'heat' comparison is 'how much longer or shorter is the code between versions for any function?'
I am also thinking it might be practical to use objdump to reconstitute the assembler for the functions. I might use some regular expressions at this stage to trim off the register names, so that changes to register allocation don't cause too many false positive (changes).
I might even try to sort the assembler instructions in the function bodies, and diff them to get a pattern of "removed" vs "added" between two function implementations. This would give a measure of change which is pretty much independent of layout, and even somewhat independent of the order of some of the source.
So it might be interesting to see if two alternative implementations of the same function (i.e. from different a change set) are the same instructions :-)
This approach should also work for C++ because all names have been appropriately mangled, which should guarantee the same functions are being compared.
So, the regular expressions might be kept very simple :-)
Assuming all of this is straightforward, what might this approach fail to give you?
Side Note: This basic strategy could work for any language which targets machine code, as well as VM instruction sets like the Java VM Bytecode, .NET CLR code, etc too.
It might be worth considering building a simple parser, using one of the common tools, rather than just using regular expressions. Clearly it is better to choose something you are familiar with, or which your organisation already uses.
For this problem, a parser doesn't actually need to validate the code (I assume it is valid when it is checked in), and it doesn't need to understand the code, so it might be quite dumb.
It might throw away comments (retaining new lines), ignore the contents of text strings, and treat program text in a very simple way. It mainly needs to keep track of balanced '{' '}', balanced '(' ')' and all the other valid program text is just individual tokens which can be passed 'straight through'.
It's output might be a separate file/function to make tracking easier.
If the language is C or C++, and the developers are reasonably disciplined, they might never use 'non-syntactic macros'. If that is the case, then the files don't need to be preprocessed.
Then a parser is mostly just looking for a the function name (an identifier) at file scope followed by ( parameter-list ) { ... code ... }
I'd SWAG it would be a few days work using yacc & lex / flex & bison, and it might be so simple that their is no need for the parser generator.
If the code is Java, then ANTLR is a possible, and I think there was a simple Java parser example.
If Haskell is your focus, their may be student projects published which have made a reasonable stab at a parser.

Finite State Machine parser

I would like to parse a self-designed file format with a FSM-like parser in C++ (this is a teach-myself-c++-the-hard-way-by-doing-something-big-and-difficult kind of project :)). I have a tokenized string with newlines signifying the end of a euh... line. See here for an input example. All the comments will and junk is filtered out, so I have a std::string like this:
global \n { \n SOURCE_DIRS src \n HEADER_DIRS include \n SOURCES bitwise.c framing.c \n HEADERS ogg/os_types.h ogg/ogg.h \n } \n ...
Syntax explanation:
{ } are scopes, and capitalized words signify that a list of options/files is to follow.
\n are only important in a list of options/files, signifying the end of the list.
So I thought that a FSM would be simple/extensible enough for my needs/knowledge. As far as I can tell (and want my file design to be), I don't need concurrent states or anything fancy like that. Some design/implementation questions:
Should I use an enum or an abstract class + derivatives for my states? The first is probably better for small syntax, but could get ugly later, and the second is the exact opposite. I'm leaning to the first, for its simplicity. enum example and class example. EDIT: what about this suggestion for goto, I thought they were evil in C++?
When reading a list, I need to NOT ignore \n. My preferred way of using the string via stringstream, will ignore \n by default. So I need simple way of telling (the same!) stringstream to not ignore newlines when a certain state is enabled.
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
Here's the draft states I have in mind:
upper: reads global, exe, lib+ target names...
normal: inside a scope, can read SOURCES..., create user variables...
list: adds items to a list until a newline is encountered.
Each scope will have a kind of conditional (e.g. win32:global { gcc:CFLAGS = ... }) and will need to be handled in the exact same fashion eveywhere (even in the list state, per item).
Thanks for any input.
If you have nesting scopes, then a Finite State Machine is not the right way to go, and you should look at a Context Free Grammar parser. An LL(1) parser can be written as a set of recursive funcitons, or an LALR(1) parser can be written using a parser generator such as Bison.
If you add a stack to an FSM, then you're getting into pushdown automaton territory. A nondeterministic pushdown automaton is equivalent to a context free grammar (though a deterministic pushdown automaton is strictly less powerful.) LALR(1) parser generators actually generate a deterministic pushdown automaton internally. A good compiler design textbook will cover the exact algorithm by which the pushdown automaton is constructed from the grammar. (In this way, adding a stack isn't "hacky".) This Wikipedia article also describes how to construct the LR(1) pushdown automaton from your grammar, but IMO, the article is not as clear as it could be.
If your scopes nest only finitely deep (i.e. you have the upper, normal and list levels but you don't have nested lists or nested normals), then you can use a FSM without a stack.
There are two stages to analyzing a text input stream for parsing:
Lexical Analysis: This is where your input stream is broken into lexical units. It looks at a sequence of characters and generates tokens (analagous to word in spoken or written languages). Finite state machines are very good at lexical analysis provided you've made good design decision about the lexical structure. From your data above, individal lexemes would be things like your keywords (e.g. "global"), identifiers (e.g. "bitwise", "SOURCES"), symbolic tokesn (e.g. "{" "}", ".", "/"), numeric values, escape values (e.g. "\n"), etc.
Syntactic / Grammatic Analysis: Upon generating a sequence of tokens (or perhaps while you're doing so) you need to be able to analyze the structure to determine if the sequence of tokens is consistent with your language design. You generally need some sort of parser for this, though if the language structure is not very complicated, you may be able to do it with a finite state machine instead. In general (and since you want nesting structures in your case in particular) you will need to use one of the techniques Ken Bloom describes.
So in response to your questions:
Should I use an enum or an abstract class + derivatives for my states?
I found that for small tokenizers, a matrix of state / transition values is suitable, something like next_state = state_transitions[current_state][current_input_char]. In this case, the next_state and current_state are some integer types (including possibly an enumerated type). Input errors are detected when you transition to an invalid state. The end of an token is identified based on the state identification of valid endstates with no valid transition available to another state given the next input character. If you're concerned about space, you could use a vector of maps instead. Making the states classes is possible, but I think that's probably making thing more difficult than you need.
When reading a list, I need to NOT ignore \n.
You can either create a token called "\n", or a more generalize escape token (an identifier preceded by a backslash. If you're talking about identifying line breaks in the source, then those are simply characters you need to create transitions for in your state transition matrix (be aware of the differnce between Unix and Windows line breaks, however; you could create a FSM that operates on either).
Will the simple enum states suffice for multi-level parsing (scopes within scopes {...{...}...}) or would that need hacky implementations?
This is where you will need a grammar or pushdown automaton unless you can guarantee that the nesting will not exceed a certain level. Even then, it will likely make your FSM very complex.
Here's the draft states I have in mind: ...
See my commments on lexical and grammatical analysis above.
For parsing I always try to use something already proven to work: ANTLR with ANTLRWorks which is of great help for designing and testing a grammar. You can generate code for C/C++ (and other languages) but you need to build the ANTLR runtime for those languages.
Of course if you find flex or bison easier to use you can use them too (I know that they generate only C and C++ but I may be wrong since I didn't use them for some time).