Partially parse C++ for a domain-specific language - c++

I would like to create a domain specific language as an augmented-C++ language. I will need mostly two types of contructs:
Top-level constructs for specialized types or declarations
In-code constructs, i.e. to add primitives to make functions calls or idiom easier
The language will be used for scientific computing purposes, and will ultimately be translated into plain C++. C++ has been chosen as it seems to offer a good compromise between: ease of use, efficiency and availability of a wide range of libraries.
A previous attempt using flex and bison failed due to the complexity of the C++ syntax. The existing parser can still fail on some constructs. So we want to start over, but on better bases.
Do you know about similar projects? And if you attempted to do so, what tools would you use? What would be the main pitfalls? Would you have recommendations in term of syntax?

There are many (clever) attempts to have domain specific languages within the C++ language.
It's usually called DSEL for Domain Specific Embedded Language. For example, you could look up the Boost.Spirit syntax, or Boost.rdb (in the boost vault).
Those are fully compliant C++ libraries which make use of C++ syntax.
If you want to hide some complexity, you might add in a few macros.
I would be happy to provide some examples if you gave us something to work with :)

You can try extending an open source Elsa C++ parser (it is now a part of a Mozilla's Pork project):
https://wiki.mozilla.org/Pork

The way to extend C++ is not to try to extend the language, which will be extremely difficult and probably break as new base compiler releases implement new features, but to write class libraries to support your problem domain. This has been what C++ programming has been all about since the language's inception.

If you really want to extend C++, you'll need a full C++ parser plus name and type resolution. As you've found out, this is pretty hard. Your best solution is to get an existing one and modify it.
Our DMS Software Reengineering Toolkit is an infrastructure for implementing langauge processors. It is
designed to support the construction of tools that parse languages, carry out transformations, and spit out the same language (with enhanced code) or a different language/dialect.
DMS has a full C++ Front End, that parses C++, builds abstract syntax trees and symbol tables (e.g., all that name and type resolution stuff).
The DMS/C++ front end is provided with DMS in source form, so that it can be customized to achieve the kind of effect you want. You'd define your DSL as an extension of the C++ front end, and then write transformations that convert your special constructs into "vanilla" C++ constructs, and then spit out compilable result.
DMS/C++ have been used for a wide variety of transformation tasks, including ones that involved extending C++ as you've described, and including tasks that carry out massive reorganizations of large C++ applications. (See the Publications at that website).

To solve you first bullet, maybe you can use C++0x new features "initializer lists", and "user defined litterals" avoiding the need for a new parser. They may help for the second bullet, too.

Related

Lisp/Scheme DSEL in C++

I came across the following post on the boost mailing lists (emphasis mine):
hello all,
does anybody know of an existing spirit/lisp implimentation, and is there
any interest in developing such a project in open source?
None yet, AFAIK.
I'll be writing an example for Spirit2
to complement the tiny-C virtual
machine in there. What's equally
interesting though is that scheme (or
at least a subset of it) can be
implemented in pure c++. No parsing,
just pure DSEL in C++. Now, imagine a
parser that targets this DSEL (through
C++) -- a source to source translator.
Essentially, your scheme code will be
compiled into highly efficient C++.
Has anyone actually done this? I would be very interested in such a DSEL.
I wrote a Lisp-like language called Funky using Spirit in C++. An Open Source version is available at http://funky.vlinder.ca. It wouldn't take too much to turn that into a Lisp-like to C++ translator.
Actually, what it would take is a run-time support library to provide generic closure times and somesuch: if you want to turn the Lisp code into efficient C++, you will basically need C++ classes (functors, etc.) to do the heavy lifting once you get to run-time, so your Lisp-to-C++ translator would need to:
parse the Lisp
create an AST from the Lisp
transform the AST to optimize it, if possible (optimizations in Lisp are different from optimizations in C++, so if you want rally fast C++, you have to optimize the Lisp and let your C++ compiler optimize the generated C++)
generate the C++, for which you'd rely on your run-time support library for things like built-in functions, functor types, etc.
If you were to start from Funky, you'd already have the parse and the AST (though Funky doesn't optimize the AST), so you could go from there an create the run-time and generate the C++...
It wouldn't be overly complicated to write one from scratch either: Lisp grammar isn't that difficult, so most of the work would go into the AST and the run-time support.
If I weren't writing an object-oriented DSL right now, I might try my hand at this.
scheme to (readable) c++
http://www.suri.cs.okayama-u.ac.jp/servlets/APPLICATION.rkt
How about this
Not sure if this is what you want, but:
http://howtowriteaprogram.blogspot.com/2010/11/lisp-interpreter-in-90-lines-of-c.html
It looks like a start, at least.

Need C++ parser

I need a good, stable and, maybe, easy to use C++ parser library with C/C++ interface (C is preferred).
I hear that cint is good c++ interpreter. Can I use it (or some part of it) for this purpose?
Any suggestions?
See: http://clang.llvm.org/
It has both a C++ and a C interface (libclang).
C++ parsing is famously hard. AFAIK there are only three parsers that are acceptable by todays standards: EDG (widely used as a frontend in popular C++ compilers), GCC's and Microsoft's. And apparently, Microsoft has started using EDG's parser in VS2010, for Intellisense.
When you're looking at the free options, you're pretty much stuck at GCC. It can produce XML, though, so the easy part is there. (Easy by C++ parsing standards, that is)
Clang is the most up-to-date and mature option, with a decent C++ API (but no plain C). Elsa is a bit out of date and unmaintained, but still a usable choice. Both could be used as libraries as well as standalone XML frontends.
If you want to parse C or C++ code, there are some options:
http://bellard.org/tcc/
http://students.ceid.upatras.gr/~sxanth/ncc/
If you want to create a parser using C/C++, you can try:
http://boost-spirit.com/home/
http://dinosaur.compilertools.net/ Lex and Yacc
http://www.codeguru.com/csharp/.net/net_general/patterns/article.php/c12805 Flex and Bison
Our C++ Front End is able to parse a variety of C++ dialects (ANSI, GCC, MSVS), automatically builds ASTs whose nodes are marked with precise source positions and are decorated with any nearby comment text, and builds a full symbol table. (EDIT Jan 2013: the C++ front end has been able to handle C++11 for quite awhile now).
The C++ front end is built on top of our DMS Software Reengineering Toolkit, generalized compiler technology for program analysis and transformation, designed to support custom tool building. The C++ front end includes a preprocessor, in which the preprocessor directives can be expanded or not collectively or individually as appropriate for the task. It also includes full symbol construction with all the nasty Koenig lookup stuff.
DMS accepts explicit language definitions (that's how it understands C++; there are also fron ends for C, C#, Java, COBOL, and variety of other languages). DMS provides general parsing, symbol table building, flow analysis machinery, procedural APIs for tree navigation/inspection/modification, source-to-source transformation, and AST-to-source text regeneration including the original comments, number radices, etc. All of these capabilities are available for use by the C++ Front End.
DMS is also designed to handle the scale required for serious tasks. Often you need not just one compilation unit (which is what GCC will give you at best) but access to an entire set. DMS has been used to analyze/transform thousands of C++ compilation units, and literally tens of thousands of C compilation units (on a 25 million line application).
"Easy to use library" is an oxymoron when it comes to program manipulation tools. The langauges themselves are complex (C++ being one of the most difficult and getting worse with C++0X) and that induces complexity in the nature of the questions you can ask and what the answers look like (e.g. "are there any template instantions that can modify local variable X in method Y in class C in any namespace N?"). The questions themselves are hard.
What you want is a library with the necessary complexity to let you carry off your task. DMS has been under continuous development for the last 15 years, to provide that necessary complexity. If you want to do serious program processing, I claim you will need that information.
As proof, DMS has been used to carry out massive automated reengineering of C++-based mission avionics software for Boeing. I don't believe there are any other tools that can do this. (Clang looks to be trying, but only for C++. YMMV).
I don't know for cint, but I heard people use gcc-xml for this.
I have been looking for a good stand-alone library too, but haven't found any.
If you're feeling brave the links in the answer to "is there a yacc-able C++ grammar?" might be helpful. Gcc-xml and clang have already been suggested and Swig also has an XML output which depending on what you're trying to achieve might be relevant.
I did not try it, but I think that best choice will be getting modules for parsing from some popular open source compiler like gcc for C++;
Maybe you'll find something interesting here http://www.nobugs.org/developer/parsingcpp/

Is Vala a sane language to parse, compared to C++?

The problems parsing C++ are well known. It can't be parsed purely based on syntax, it can't be done as LALR (whatever the term is, i'm not a language theorist), the language spec is a zillion pages, etc. For that and other reasons I'm deciding on an alternative language for my personal projects.
Vala looks like a good language. Although providing many improvements over C++, is just as troublesome to parse? Or does it have a neat, reasonable length formal grammar, or some logical description, suitable for building parsers for compilers, source analyzers and other tools?
Whatever the answer, does that go for the Genie alternative syntax?
(I also wonder albeit less intensely about D and other post-C++ non-VM languages.)
C++ is one of the most complex (if not the most complex) programming language to parse in common use. Of particular difficulty is it's name lookup rules and template instantiation rules. C++ is not parsable using a LALR(1) parser (such as the parsers generated by Bison and Yacc), but it is by all means parsable (after all, people use parsers which have no problem parsing C++ every day). (In fact, early versions of G++ were built on top of Bison's Generalized LR parser framework Actually not, see comments) before it was more recently replaced with a hand written recursive descent parser)
On the other hand, I'm not sure I see what "improvements" Vala offers over C++. The languages look to attempt to accomplish the same goals. On the other hand, you're probably not going to find much outside of GTK+ written with Vala interfaces. You're going to be using C interfaces to everything else, which really defeats the point of using such a language.
If you don't like C++ due to it's complexity, it might be a good idea to consider Objective-C instead, because it is a simple extension of C, (like Vala), but has a much larger community of programmers for you to draw upon given it's foundation for everything in Mac land.
Finally, I don't see why the difficulty of parsing the language itself has to do with what a programmer should be caring about in order to use the language. Just my 2 cents.
It's pretty simple. You can use libvala to do both parsing, semantic analyzing and code generation instead of writing your own.

Official C++ language subsets

I mainly use C++ to do scientific computing, and lately I've been restricting myself to a very C-like subset of C++ features; namely, no classes/inheritance except complex and STL, templates only used for find/replace kinds of substitutions, and a few other things I can't put in words off the top of my head. I am wondering if there are any official or well-documented subsets of the C++ language that I could look at for reference (as well as rationale) when I go about picking and choosing which features to use.
There is Embedded C++. It sounds mostly similar to what you're looking for.
Google publishes its internal C++ style guide, which is often referred to as such a subset: https://google.github.io/styleguide/cppguide.html . Ben Maurer, whose company reCAPTCHA was acquired by Google, describes it as follows in this post on Quora:
You can basically think of Google's
C++ subset as C plus a bit of sugar:
The ability to add methods to structs
Basic single inheritance.
Collection and string classes
Scope based resource management.
They also publish a lint tool, cpplint.py.
Not long ago I listened to this SE-Radio podcast - Episode 152: MISRA with Johan Bezem, which introduces MISRA, standard guidelines for C and C++ to ensure better quality, try looking at it.
The GCC developers are about to allow some C++ features. I'm not aware of any official guidelines, yet, but I am pretty sure that they will define some. Have a look at initial report on the mailing list.
Well, latest developments (TR1, C++0x) in C++ made it very much generic, allowing you to do imperative, OOP or even (limited) functional programming in C++. Libraries like Boost also enable you to do very power declarative template-based meta-programming.
I think Boost is the first thing to try out in C++. It's a comprehensive library, which also includes several modules that enable you to program in functional style (Boost.Functional) or making compile-time declarative meta-programming (Boost MPL).
OpenCL has been using C for writing kernels, but they have recently added (or will soon add) C++ bindings and perhaps Java. OpenCL leaves out a number of performance robbing features of C. Excluded are things like function pointers and recursion. Smart pointers and polymorphism also create overhead.
Restrictions on C:
SIMD programming languages
Slightly off topic: Here is a good discussion comparing OpenCL with CUDA using C.
OpenCL or CUDA Which way to go?
The SEI CERT C++ Coding Standard gives a list of rules for writing safe, reliable, and secure systems in C++14. This is not a subset of C++ per se, but as a coding standard like the other answers is a subset in effect by avoiding unsafe, undefined, or easily-misused features (including some common to C).

What is the preferred naming conventions du jour for c++?

I am quite confused by looking at the boost library, and stl, and then looking at people's examples. It seems that capitalized type names are interspersed with all lowercase, separated by underscores.
What exactly is the way things should be done these days? I know the .NET world has their own set of conventions, but it appears to be completely different than the C++ sphere.
What a can of worms you've opened.
The C++ standard library uses underscore_notation for everything, because that's what the C standard library uses.
So if you want your code to look consistent across the board (and actually aren't using external libraries), that is the only way to go.
You'll see boost use the same notation because often their libraries get considered for future standards.
Beyond that, there are many conventions, usually using different notations to designate different types of symbols. It is common to use CamelCase for custom types, such as classes and typedefs and mixedCase for variables, specifically to differentiate those two, but that is certainly not a universal standard.
There's also Hungarian Notation, which further differentiates specific variable types, although just mentioning that phrase can incite hostility from some coders.
The best answer, as a good C++ programmer, is to adopt whatever convention is being used in the code you're immersed in.
There is no good answer. If you're integrating with an existing codebase, it makes sense to match their style. If you're creating a new codebase, you might want to establish simple guidelines.
Google has some.
It's going to be different depending on the library and organization.
For the developer utility library I'm building, for example, I'm including friendly wrapper modules for various conventions in the style of the convention. So, for example, the MFC wrapper module uses the 'm_typeMemberVariable' notation for members, whereas the STL wrapper module uses 'member_variable'. I'm trying to build it so that whatever front-end is used will have the style typical for that type of front-end.
The problem with having a universal style is that everyone would have to agree, and (for example) for every person who detests Hungarian notation, there's someone else who thinks not using Hungarian notation detracts from the basic value of comprehensibility of the code. So it's unlikely there will be a universal standard for C++ any time soon.
Find something you feel comfortable with and stick with it. Some form of style is better than no style at all and don't get too hung up on how other libraries do it.
FWIW I use the Google C++ style guide (with some tweaks).
http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml