Given an input like this (USER DEFINED FORMAT):
type dog<
int years
char[] name
>
How can I generate 2 or more differents files like these:
file1.c
------------
struct dog{
int years
char name
}
file2.cpp
-------------
class dog{
int years
string name
%get and set methods
}
Is a parser generator like flex and bison the best way? Or are there better way?
It all depends on the complextity of your language and the transformations you intend.
Option 1: hand-crafted ad-hoc processing
If your transformations are simple and mostly convert some tokens from your source language into the target language (e.g "type" into "struct" and "<>" into "{}"), you could think of a hand-crafted parsing using string operations or regexes using std::regex.
But this is not sufficient here: you must identify statements in your source langage to generate semi-column separated c and c++ statements. Furhtermore, you must related the [] to the right element. Furthermore, to generate correct getter/setter your code must get understanding of the statement content.
So yes, here you'll need a lexer and a parser.
Option 2: hand-crafted parser
For "simple" languages, you could craft easily a recursive descent parser. What does "simple" mean ? It means that the language which can be described with a LL(1) grammar. Without entering now too much in compiler theory, it's more or less languages where one token ahead uniquely determine which grammar construct you have.
Note that you have a very simple calculator example designed this way in the "C++ Programming language" from Bjarne Stroustrup.
Option 3: use a compiler generator
While lexical scanners and recursive descent parsers are relatively easy to code, it's still somehow reinventing the wheel. And for languages which are not LL(1) compliant, you have no other choice than using a compiler generator.
Flex/bison is then a very reasonable and widely supported choice. But you have to become familiar with its logic, grammar files, and code templates. It's not something that you could master in a couple of hours.
If you have to support a custom language professionally, it's really worth the investment. If it's a small project, the learning curve could play against you.
Option 4: lightweight boost alternative
Another, lighter, approach could be to use boost spirit, which bases itself on c++ and is easier to learn: you express the grammar rule directly in C++. There are too main components: Lex for lexical scanner and Qi for parsers.
It's ok for small to medium parsers (like a data definition language), but it has some serious limitations if you have a full language. And by the way, as far as I know, it's limited to LL(k) grammars.
Related
I need to create a parser to C++ 14. This parser must be created using C++ in order to enable a legacy code generation reuse. I am thinking to implement this using ANTLR3 (because ANTLR4 doesn't target C++ code yet).
My doubt is if ANTLR3 can parse C++ since it doesn't use the Adaptive LL(*) algorithm like ANTLR4.
Most classic parser generators cannot generate a parser that will parse a grammar for an arbitrary context free language. The restrictions of the grammars they can parse often gives rise to the name of the class of parser generators: LL(k), LALR, ... ANTLR3 is essentially LL; ANTLR4 is better but still not context free.
Earley, GLR, and GLL parser generators can parse context free languages, sometimes with high costs. In practice, Earley tends to be pretty slow (but see the MARPA parser generator used with Perl6, which I understand to be an Earley variant that is claimed to be reasonably fast). GLR and GLL seem to produce working parsers with reasonable performance.
My company has built about 40 parsers for real languages using GLR, including all of C++14, so I have a lot of confidence in the utility of GLR.
When it comes to parsing C++, you're in a whole other world, mostly because C++ parsing seems to depend on collecting symbol table information at the same time. (It isn't really necessary to do that if you can parse context-free).
You can probably make ANTLR4 (and even ANTLR3) parse C++ if you are willing to fight it hard enough. Essentially what you do is build a parser which accepts too much [often due to limitations of the parser generator class], and then uses ad hoc methods to strip away the extra. This is essentially what the hand-written GCC and Clang parsers do; the symbol table information is used to force the parser down the right path.
If you choose to go down this path of building your own parser, no matter which parser generator you choose, you will invest huge amounts of energy to get a working parser. [Been here; done this]. This isn't a good way to get on with whatever your intended task motivates this parser.
I suggest you get one that already works. (I've already listed two; you can find out about our parser through my bio if you want).
That will presumably leave you with a working parser. Then you want to do something with the parse tree, and you'll discover that Life After Parsing requires a lot of machinery that the parsers don't provide. Google the phrase to find my essay on the topic or check my bio.
I noticed something when looking the kind of language of a software (openFOAM) which is written in C++. It was something like,
field value;
For example,
temperature 25;
I wonder how this works. I mean how temperature is set to 25 without using equality sign. Any idea?
Because openFOAM has a parser that understands that format.
Such a parser is not hard to make, and C++ (with the help of a parser-generator tool like yacc, bison) is a popular choice for parsers because it can be very fast.
C++ is a Turing-complete language, therefore it can do anything that any other language can. Specifically, it can process data that doesn't look like C++ code.
Programming languages are typically context free. If the language is context free it can be parsed (I won't say it's trivial but it's a problem that has been solved so many times before and if you take a compilers class you'll have to do it for simple langauges).
The parser is simply looking for a declarations of the format field value;. It appears there is no type check happening so it is simply splitting on semi colon then on the space. Read about parsing context-free langauges, push down automata, and context free grammars if you're interested in learning about parsing source code.
I need to parse a language which is similar to a minimalized version of Java. Since effiency is the most important factor I choose for a hand written parser instead of LRAR parser generators like GOLD, bison and yacc.
However I can't find the theory behind good hand written parsers. It seems like there are only tutorials on those generators and the mechanism behind it.
Do I have to drop using regular expressions? Because I can imaging they are slow compared to hand written tokiners.
Does anybody know a good class or tutorial for hand written parsing?
In case it helps, here is (not a class or a tutorial but) an example of a hand-written parser: https://github.com/tabatkins/css-parser (however it's explicitly coded for correct/simple correspondence to the specification, and not for optimized for high performance).
The bigger problem is, I expect, to develop the specification for the parsing. Examples of parser specifications include http://dev.w3.org/csswg/css3-syntax/ and a similar one for parsing HTML5.
The prerequisite to using a parser generator is that the language syntax has been defined by a grammar (where the grammar format is supported by the parser generator), instead of by a parsing algorithm.
I'm using boost::spirit to write a parser, lexer
here is what i want to do. i want to put functions and classes into data structures with the variables they use. so i want to know what would be the best way to do this.
and what parts of boost::spirit would be best to use for this?
the languages i want to use this on are C/C++/C#/Objective C/Objective C++.
also the language I'm writing it in is C++ only i'm not to good with the other languages i know
Spirit is a fine tool, but it is not the best tool for all parsing tasks. And for the task of parsing actual C++, it is pretty terrible.
Spirit excels at small-to-mid-level parsing tasks. Language that are fairly regular, with standardized grammars and so forth. C and C-derived languages are generally too complicated for Spirit to handle. It's not that you can't write Spirit parsing code for them. It's just that doing so would be too difficult to build and maintain, given Spirit's general design.
I would suggest downloading Clang if you want a good C or C++ (or Objective variants thereof) parser. It is a compiler as well, but it's designed to be modular, so you can just link to the parsing part of it.
Is it possible to create an AST for any arbitrary programming language or IR using C or C++ alone (without the help of tools like YACC and LEX )?
If so, how to implement the lexical and syntactic analysis ?
If not, what are the tools that have to augmented to C or C++ to successfully create an AST ?
Hope I made my doubt clear. If My question looks vague or out of context, please indicate the required.
P.S : I Am actually trying to create the AST for the LLVM's .ll format of IR representation. I do know that .ll is derived from AST. But I am trying out static analysis practices. So I am looking at creating the AST.
The most straight-forward methodology for creating the parser without a parser-generator is recursive descent. It is very well documented - the standard book in the field is The Dragon Book.
A scanner, that takes text as input and produces a string of tokens as output, can be written using standard string manipulation techniques.
I doubt there's a one-to-one mapping between your arbitrary langauge and LLVM's ASTs.
That means it is likely that you really want to do this in two stages:
Parse your 'arbitrary language' using the best parsing tools you can get to simplify the problem of parsing your language. Use that to build an AST for your language, following standard methods for parser generators producing ASTs. LEX/YACC are OK but there are plenty of good alternatives out there. Its pretty likely you'll need to build a symbol table.
Walk the AST of the your parsed langauge to build your LLVM AST. This won't be one-to-one, but the ability to look around the tree near a tree node in your AST to collect information need to generate the LLVM code will likely be extremely helpful.
This is a classic style for a simple compiler.
I suggest you read the Aho/Ullman Dragon book on syntax directed translation. A day's worth of education will save you months of wasted engineering time.