C++ BNF grammar with parsing/matching examples - c++

I'm developing a C++ parser (for an IDE), so now trying to understand C++ grammar in details.
While I've found an excellent grammar source at http://www.nongnu.org/hcb/, I'm having trouble understanding some parts of it - and, especially, which "real" language constructs correspond to various productions.
So I'm looking for a C/C++ BNF grammar guide with examples of code that match various productions/rules. Are there any?

A hyperlinked (purported) grammar is not necessarily one on which you can build a parser easily. That is determined by the nature of your parsing engine, and which real dialect of C and C++ you care about (ANSI? GNU? C99? C++11? MS?).
Building a working C++ parser is really hard. See my answer to Why C++ cannot be parsed with a LR(1) parser? for some of the reasons. If you want a "good" parser, I suggest you use one of the existing ones. One worth looking at might be Elsa, since it is open source.

Related

Flex++ Bisonc++ parser

I'm trying to use flex and bison in my project to generate a parser code for a file structure. Main programming language is C++ and project is on an OO design mainly running in parallel.
I heard that flex and bison generated parsers are C codes and they're not reenterant. Googling, I found flex++ and bisonc++. Unfortunately there is no simple tutorial to get started. Most examples are based on bison/flex. Some people somehow integrated bison/flex parsers in their C++ code. They supposed to be "tricky"...
Documentation of flex++ and bisonc++ doesn't help me and. Tutorials and examples, they all get input from stdin and print some messages on stdout.
I need these features in my parser:
Parser should be a C++ class, defined in normal manner (a header and a cpp file)
Parser receives data from either an std::string or std::stringstream or a null-terminated char*.
I feel so confused. Should I use flex++/bisonc++ or flex/bison? And how to do that, satisfying above conditions?
There are flex/bison, flex++/bison++ and flexc++/bisonc++. I think it's best to pick one of these three pairs, instead of mixing/matching flex++ and bisonc++.
Here are the user guides for Flexc++ and Bisonc++.
From the Flexc++ website:
Flexc++, contrary to flex and flex++, generates code that is
explicitly intended for use by C++ programs. The well-known flex(1)
program generates C source-code and flex++(1) merely offers a C++-like
shell around the yylex function generated by flex(1) and hardly
supports present-day ideas about C++ software development.
Contrary to this, flexc++ creates a C++ class offering a predefined
member function lex matching input against regular expressions and
possibly executing C++ code once regular expressions were matched. The
code generated by flexc++ is pure C++, allowing its users to apply all
of the features offered by that language.
From the Bisonc++ website:
Bisonc++ is a general-purpose parser generator that converts a grammar
description for an LALR(1) context-free grammar into a C++ class to
parse that grammar. Once you are proficient with bisonc++, you may use
it to develop a wide range of language parsers, from those used in
simple desk calculators to complex programming languages. Bisonc++ is
highly comparable to the program bison++, written by Alain Coetmeur:
all properly-written bison++ grammars ought to be convertible to
bisonc++ grammars after very little or no change. Anyone familiar with
bison++ or its precursor, bison, should be able to use bisonc++ with
little trouble. You need to be fluent in using the C++ programming in
order to use bisonc++ or to understand this manual.
So flexc++/bisonc++ are more than just wrappers around the old flex/bison utilities. They generate complete C++ classes to be used for re-entrant scanning / parsing.
Flex can generate a reentrant C scanner. See Section 19 Reentrant C scanners in the Flex manual.
Similarly, Bison can generate a reentrant C parser. See Section 3.8.11 A Pure (Reentrant) Parser in the Bison manual for details.
Do you absolutely need to have a C++ parser and std::string/stringstream based parser data?
Have you looked at Boost.Spirit as an alternative?
The LRSTAR product (LR(k) parser and DFA lexer generator) is C++ based. Runs on Widowns and has six Visual Studio projects. The code also compiles with "gcc" and other compilers. There are classes for lexer and parser, symbol-table, AST. Complete source code is available. It gets good reviews. I should know. I am the author.

searching for a BNF (for yacc) grammar of C++

I found something similar here: Where can I find standard BNF or YACC grammar for C++ language?
But the download links don't work anymore, and I want to ask if somebody know where I can download it now?
C++ is not a context-free language and therefore cannot be accurately parsed using a parser like BNF or yacc. However, it is possible to parse a superset of the language with those tools, and then apply additional contextual processing to the parsed structure.
Looking here: http://www.parashift.com/c++-faq-lite/compiler-dependencies.html#faq-38.11, I found this: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
Depending on your task, you might want to use an existing C++ frontend instead.
The EDG Compiler Frontend and the CLang Frontend have both been designed so as to be used independently from "pure compilation".
CLang notably features accurate location of tokens and for example includes "rewrite" tools that can be used to modify existing code.

Is Vala a sane language to parse, compared to C++?

The problems parsing C++ are well known. It can't be parsed purely based on syntax, it can't be done as LALR (whatever the term is, i'm not a language theorist), the language spec is a zillion pages, etc. For that and other reasons I'm deciding on an alternative language for my personal projects.
Vala looks like a good language. Although providing many improvements over C++, is just as troublesome to parse? Or does it have a neat, reasonable length formal grammar, or some logical description, suitable for building parsers for compilers, source analyzers and other tools?
Whatever the answer, does that go for the Genie alternative syntax?
(I also wonder albeit less intensely about D and other post-C++ non-VM languages.)
C++ is one of the most complex (if not the most complex) programming language to parse in common use. Of particular difficulty is it's name lookup rules and template instantiation rules. C++ is not parsable using a LALR(1) parser (such as the parsers generated by Bison and Yacc), but it is by all means parsable (after all, people use parsers which have no problem parsing C++ every day). (In fact, early versions of G++ were built on top of Bison's Generalized LR parser framework Actually not, see comments) before it was more recently replaced with a hand written recursive descent parser)
On the other hand, I'm not sure I see what "improvements" Vala offers over C++. The languages look to attempt to accomplish the same goals. On the other hand, you're probably not going to find much outside of GTK+ written with Vala interfaces. You're going to be using C interfaces to everything else, which really defeats the point of using such a language.
If you don't like C++ due to it's complexity, it might be a good idea to consider Objective-C instead, because it is a simple extension of C, (like Vala), but has a much larger community of programmers for you to draw upon given it's foundation for everything in Mac land.
Finally, I don't see why the difficulty of parsing the language itself has to do with what a programmer should be caring about in order to use the language. Just my 2 cents.
It's pretty simple. You can use libvala to do both parsing, semantic analyzing and code generation instead of writing your own.

How to turn type-labeled tokens into a parse-tree?

So I'm writing a programming language in C++. I've written pretty much all of it except for one little bit where I need to turn my tokens into a parse tree.
The tokens are already type labeled and ready to go, but I don't want to go through the effort of making my own parse tree generator. I've been looking around for apps to do this but always run into very complicating or overzealous apps, and all I want to turn a list of token types into a parse tree, nothing more, nothing less. Thanks in advance!
The simplest parser generator is yacc (or bison).
Bison is just a harry yacc (ie it has more options).
One of these is too generate a C++ parser object (rather than a C function).
Just add the following to the yacc file:
%skeleton "lalr1.cc"
The canonical parser generator is called yacc. There's a gnu version of it called bison. These are both C based tools, so they should integrate nicely with your C++ code. There is a tool for java called ANTLR which I've heard very good things about (i.e. it's easy to use and powerful). Keep in mind that with yacc or bison you will have to write a grammar in their language. This is most certainly doable, but not always easy. It's important to have a theoretical background in LR(k) parsing so you can understand that it means when it tells you to fix your ambiguous grammar.
Depending on what exactly your requirements are, Boost.Spirit might be an alternative. Its modular, so you should be able to use only components of it as well.

Good tools for creating a C/C++ parser/analyzer [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
What are some good tools for getting a quick start for parsing and analyzing C/C++ code?
In particular, I'm looking for open source tools that handle the C/C++ preprocessor and language. Preferably, these tools would use lex/yacc (or flex/bison) for the grammar, and not be too complicated. They should handle the latest ANSI C/C++ definitions.
Here's what I've found so far, but haven't looked at them in detail (thoughts?):
CScope - Old-school C analyzer. Doesn't seem to do a full parse, though. Described as a glorified 'grep' for finding C functions.
GCC - Everybody's favorite open source compiler. Very complicated, but seems to do it all. There's a related project for creating GCC extensions called GEM, but hasn't been updated since GCC 4.1 (2006).
PUMA - The PUre MAnipulator. (from the page: "The intention of this project is to
provide a library of classes for the analysis and manipulation of C/C++ sources. For this
purpose PUMA provides classes for scanning, parsing and of course manipulating C/C++
sources."). This looks promising, but hasn't been updated since 2001. Apparently PUMA has been incorporated into AspectC++, but even this project hasn't been updated since 2006.
Various C/C++ raw grammars. You can get c-c++-grammars-1.2.tar.gz, but this has been unmaintained since 1997. A little Google searching pulls up other basic lex/yacc grammars that could serve as a starting place.
Any others?
I'm hoping to use this as a starting point for translating C/C++ source into a new toy language.
Thanks!
-Matt
(Added 2/9): Just a clarification: I want to extract semantic information from the preprocessor in addition to the C/C++ code itself. I don't want "#define foo 42" to disappear into the integer "42", but remain attached to the name "foo". This, unfortunately, excludes several solutions that run the preprocessor first and only deliver the C/C++ parse tree)
Parsing C++ is extremely hard because the grammar is undecidable. To quote Yossi Kreinin:
Outstandingly complicated grammar
"Outstandingly" should be interpreted literally, because all popular languages have context-free (or "nearly" context-free) grammars, while C++ has undecidable grammar. If you like compilers and parsers, you probably know what this means. If you're not into this kind of thing, there's a simple example showing the problem with parsing C++: is AA BB(CC); an object definition or a function declaration? It turns out that the answer depends heavily on the code before the statement - the "context". This shows (on an intuitive level) that the C++ grammar is quite context-sensitive.
You can look at clang that uses llvm for parsing.
Support C++ fully now link
The ANTLR parser generator has a grammar for C/C++ as well as the preprocessor. I've never used it so I can't say how complete its parsing of C++ is going to be. ANTLR itself has been a useful tool for me on a couple of occasions for parsing much simpler languages.
Depending on your problem GCCXML might be your answer.
Basically it parses the source using GCC and then gives you easily digestible XML of parse tree.
With GCCXML you are done once and for all.
pycparser is a complete parser for C (C99) written in Python. It has a fully configurable AST backend, so it's being used as a basis for any kind of language processing you might need.
Doesn't support C++, though. Granted, it's much harder than C.
Update (2012): at this time the answer, without any doubt, would be Clang - it's modular, supports the full C++ (with many C++-11 features) and has a relatively friendly code base. It also has a C API for bindings to high-level languages (i.e. for Python).
Have a look at how doxygen works, full source code is available and it's flex-based.
A misleading candidate is GOLD which is a free Windows-based parser toolkit explicitly for creating translators. Their list of supported languages refers to the languages in which one can implement parsers, not the list of supported parse grammars.
They only have grammars for C and C#, no C++.
Parsing C++ is a very complex challenge.
There's the Boost/Spirit framework, and a couple of years ago they did play with the idea of implementing a C++ parser, but it's far from complete.
Fully and properly parsing ISO C++ is far from trivial, and there were in fact many related efforts. But it is an inherently complex job that isn't easily accomplished, without rewriting a full compiler frontend understanding all of C++ and the preprocessor. A pre-processor implementation called "wave" is available from the Spirit folks.
That said, you might want to have a look at pork/oink (elsa-based), which is a C++ parser toolkit specifically meant to be used for source code transformation purposes, it is being used by the Mozilla project to do large-scale static source code analysis and automated code rewriting, the most interesting part is that it not only supports most of C++, but also the preprocessor itself!
On the other hand there's indeed one single proprietary solution available: the EDG frontend, which can be used for pretty much all C++ related efforts.
Personally, I would check out the elsa-based pork/oink suite which is used at Mozilla, apart from that, the FSF has now approved work on gcc plugins using the runtime library license, thus I'd assume that things are going to change rapidly, once people can easily leverage the gcc-based C++ parser for such purposes using binary plugins.
So, in a nutshell: if you the bucks: EDG, if you need something free/open source now: else/oink are fairly promising, if you have some time, you might want to use gcc for your project.
Another option just for C code is cscout.
The grammar for C++ is sort of notoriously hairy. There's a good thread at Lambda about it, but the gist is that C++ grammar can require arbitrarily much lookahead.
For the kind of thing I imagine you might be doing, I'd think about hacking either Gnu CC, or Splint. Gnu CC in particular does separate out the language generation part pretty thoroughly, so you might be best off building a new g++ backend.
Actually, PUMA and AspectC++ are still both actively maintained and updated. I was looking into using AspectC++ and was wondering about the lack of updates myself. I e-mailed the author who said that both AspectC++ and PUMA are still being developed. You can get to source code through SVN https://svn.aspectc.org/repos/ or you can get regular binary builds at http://akut.aspectc.org. As with a lot of excellent c++ projects these days, the author doesn't have time to keep up with web page maintenance. Makes sense if you've got a full time job and a life.
how about something easier to comprehend like tiny-C or Small C
Elsa beats everything else I know hands down for C++ parsing, even though it is not 100% compliant. I'm a fan. There's a module that prints out C++, so that may be a good starting point for your toy project.
See our C++ Front End
for a full-featured C++ parser: builds ASTs, symbol tables, does name
and type resolution. You can even parse and retain the preprocessor
directives. The C++ front end is built on top of our DMS Software Reengineering
Toolkit, which allows you to use that information to carry out arbitrary
source code changes using source-to-source transformations.
DMS is the ideal engine for implementing such a translator.
Having said that, I don't see much point in your imagined task; I don't
see much value in trying to replace C++, and you'll find building
a complete translator an enormous amount of work, especially if your
target is a "toy" language. And there is likely little point in
parsing C++ using a robust parser, if its only purpose is to produce
an isomorphic version of C++ that is easier to parse (wait, we postulated
a robust C++ already!).
EDIT May 2012: DMS's C++ front end now handles GCC3/GCC4/C++11,Microsoft VisualC 2005/2010. Robustly.
EDIT Feb 2015: Now handles C++14 in GCC and MS dialects.
EDIT August 2015: Now parses and captures both the code and the preprocessor directives in a unified tree.
EDIT May 2020: Has been doing C++17 for the past few years. C++20 in process.
A while back I attempted to write a tool that will automatically generate unit tests for c files.
For preprosessing I put the files thru GCC. The output is ugly but you can easily trace where in the original code from the preprocessed file. But for your needs you might need somthing else.
I used Metre as the base for a C parser. It is open source and uses lex and yacc. This made it easy to get up and running in a short time without fully understanding lex & yacc.
I also wrote a C app since the lex & yacc solution could not help me trace functionality across functions and parse the structure of the entire function in one pass. It became unmaintainable in a short time and was abandoned.
What about using a tool like GNU's CFlow, that can analyse the code and produce charts of call-graphs, here's what the opengroup(man page) has to say about cflow. The GNU version of cflow comes with source, and open source also ...
Hope this helps,
Best regards,
Tom.