I want to read, and learn from, the source code of a scripting language's interpreter/compiler. What scripting language interpreter/compiler has the simplest, cleanest, and easiest to read source code? I would prefer it to be written in C/C++ (what else are compilers written in anyway?) because I'm planning on writing a compiler in C.
Take a look at lua, you can go through the firsts versions of the programming language and see how it has evolved. It's written in C and has a clean and nice code. You can write a compiler in almost every programming language, but C has been the one that most programmers chose.
The CPython interrupter has been around for quite some time and I would imagine that it would be very useful to you.
AngelScript is a very good option for learning about compilers. This is a language with C/C++ familiar syntax, garbage collection, it is object-oriented with inheritance and polymorphism, cross-platform and compiles to byte-code.
My second choice would be Lua.
I would recommend, as a gentle introduction, having a look at the LLVM Tutorial.
Chris Lattner creates a simple toy language Kaleidoscope to show the various phases of compilation:
Lexing
Parsing
Code Generation
He then demonstrates how to add JIT capabilities (essential for an interpreter).
The toy language is extremely simple, and thus the resulting code is simple as well, and demonstrates nicely the architecture without drowning you in implementation details.
I am not sure that the tutorial is fully up-to-date and can be used as is against a recent LLVM version, but I do advise at least reading it.
(And of course, reading the Dragon Book).
Take a look on V8 for JavaScript. Every interpeter has a component called tokenizer. GNU has one whose name is bison. Take on look on it too. It can be helpful. Chromium uses some tokenizer for interpreting html on the Webkit too, but V8 is the javascript interpreter.
Claudio M. Souza Junior
a famous language, but not simple (PHP Source Code).
You can take advantage of the source code .
PHP Source Code
Related
I inherited a C++ project which uses Ragel for string parsing.
This is the first time I have seen this being done and I would like to understand why someone would use Ragel instead of C++ to parse a string?
parser generators (improperly called "compiler-compilers") are very handsome to use and produce reliable and efficient C++ or C code (notably because parsing theory is well understood).
In general, using source code generators could be a wise thing to do. Sometimes, notably in large projects, it is sensible to write your own one (read about metaprogramming, notably SICP and even J.Pitrat's blog). Good build automation tools like GNU make or ninja can easily be configured to run C or C++ code generators and use them at build time.
Read Ragel intro. Look also into flex, bison, ANTLR, rpcgen, Qt moc, swig, gperf as common examples of C or C++ generators.
In some programs, you could even use some JIT compilation library (such as libgccjit or LLVM) to dynamically generate code at run time and use it. On POSIX systems you could also generate at runtime a temporary C or C++ file, compile it as a plugin, and load that temporary plugin using dlopen & dlsym. Having a good culture about compilers and interpreters (e.g. thru the Dragon Book) is then worthwhile.
Embedding some interpreter (like lua or guile) in your application is also an interesting approach. But it is a strong architectural decision.
In many cases, generating source code is easier than hand writing it. Of course that is not always possible.
PS. I never heard of Ragel before reading your question!
Recently I picked up a copy of The Definitive ANTLR 4 Reference and since I am sophisticated when it comes to working with grammars and languages I wanted to work on my DSL I once have written using yacc and bison. The general idea is to write a translator (with included validation for type safety(1)) which translates the DSL to JavaScript during runtime which is then executed by v8.
Although ANTLR was designed for inclusion in Java applications I would like to stay with native C++. Can ANTLR 4 produce such a C parser/lexer(2) which I can include using a C++-style wrapper? And how to do so?
(1) The book has some good examples which I will use as a template.
(2) I am not sure but I think that I read somewhere that ANTLR doesn't support output in C++, am I right?
I found the ANTLR 3 C/C++ target almost unusable. It contains so many hacks to circumvent the lack of exceptions in C that it was recommended for experts only. Though it's Terr's call, I hope ANTLR 4 doesn't support target languages without native exceptions unless it can isolate whatever hackery is required to do so from end users. The ANTLR 2 C++ target is cleaner than ANTLR 3's, but ANTLR 2 itself has limitations, including extremely messy licensing (making it difficult to use in commercial products).
ANTLR v3 has various different targets, most notably Java (of course), C, C#, JavaScript and Python. For a full list, see: http://www.antlr.org/wiki/display/ANTLR3/Code+Generation+Targets
ANTLR v4, however, only has a Java target at this moment.
In case you are still interested, version 4.7 of antlr does have a c++ target.
To John G. answer
I agree that ANTLR3 C target is very hacking. Me, C/C++ expert with 20 years, was not able even guess how to use it without answers from author. Yes, ideas was quite good, but without docs near to impossible to understand.
I not agree that main trouble with exceptions. In times of ANTLR2 and C++ implementation fir v2, exceptions did exists ... And there was opinion that if to remove exceptions it will be faster. In v3 they have try do that, but ...
But speed have not became better. We was in hope switch from ANTRL2 to ANTLR3 in our Valentina Database engine, we have spend months re-writing to v3 grammar, and ... zero speed up. Just zero. So we use up to now v2 of ANTLR.
I think main issue if speed in ANTLR is the fact that for each rule it produce separate function. Yes this is its strong side, and this is its weak side.
In the v4 Terrence have invent how to use state machines in Lexer. If we could get that for parsers also. I think in ideal, ANTLR could produce functions as now, while we develop grammar, and state machines for release. But this is a dream so far.
Can a scripting language be translated into C, C++, or Java so it can be run on an IDE without rewriting the code?
In theory, yes, it is possible to translate any scripting language into C, C++, or Java code. A theoretically valid way of doing this would be to take the source code for the interpreter and then to hardcode in the script that it's going to be executing. The resulting code would then be "run the interpreter written in C/C++/Java on the specified source code."
In practice, there usually isn't a good way of translating from a scripting language to some other target language in a way that preserves the original coding style. Each language has its own constructs, idioms, and idiosyncrasies and in translating from the source scripting language to a target language much of the original structure is lost. That said, there are many projects that do this sort of conversion for performance reasons. For example, Facebook's HipHop compiler translates PHP into C++ for efficiency reasons. The resulting code is not intended to be read by humans, though.
So in short, yes, it can be done, but not in a way that's going to result in pretty code.
Take a look at shedskin for an example of a Python to C++ translator. It isn't perfect. It has some limitations on what code can be translated. But in general it works.
The main reason to do so, in this case, is speed and ease of integration with other existing C++ software.
In theory, yes it's possible. Depending on the scripting language and it supporting "virtual machine", there are tools to do this (semi-)automatically. The more heavily interpreted the language is the less likely you will be able to translate the code (for example, translating an HTML webpage into native C is kinda ridiculous, as opposed to translating matlab code into C or C++). In general, generic tools for translating code are rarely good enough that you can compile and run the code that is produced, very often they will do most of the syntax translation (basically find & replace operations) and maybe some more advanced stuff. But, most of the time, you will still have significant work to do (like using google translator to translate a webpage from one language to another, it is never perfect and it depends on how close the two languages are).
In my opinion, however, I would say that code translation is a very dangerous business. It is a lot easier to make typos or other mistakes when you are manually rewriting code that you know very little about. An automatic translation tool won't perform much better on that front either. And then, once you have translated the code, what if there are bugs? How are you supposed to find them? and fix them? when you know very little about the actual code. This can very rapidly become a nightmarish experience. I have done it in the past, and don't do it anymore!
BTW: if you are looking to use code that is written in a script language inside a project written in another language, then you might consider interfacing the languages instead of translating one code to the other language. Most programming languages and scripting languages have facilities to interface with other languages (e.g. DLLs or COM/ActiveX components). It is always better to preserve the code in the language it was originally written if at all possible.
There is a programming language called Haxe that can be translated into C++, Java, Javascript, C# and several other languages. This language appeared relatively recently, and is designed to be translated into as many target languages as possible.
For C, there are some scripting languages which look a little like C. Maybe that's a starting point. Lua comes to my mind.
For java, there is a scripting language, beanshell/bsh which runs a simplified java code as script. But you would have to rewrite it to make Javacode out of it, and I don't know how easy the process would be, to make this automatically happen.
A different approach would be: Write an C interpreter, so you can use C-Code for scriping, and just compile it when you need to.
After over a decade of C/C++ coding, I've noticed the following pattern - very good programmers tend to have detailed knowledge of the innards of the compiler.
I'm a reasonably good programmer, and I have an ad-hoc collection of compiler "superstitions", so I'd like to reboot my knowledge and start from the basics.
Can anyone recommend links to online resources or favorite books? I'm particularly interested in C/C++ compiling, optimization, GCC and LLVM.
Start with the dragon book....(stress more on code optimization and code generation)
Go onto write a toy compiler for an educational programming language like Decaf or Cool.., you may use parser generators (lex and yacc) for your front end(to make life easier and focus on more imp stuff)....
Then read gcc internals book along with browsing gcc source code.
GCC Internals Manual.
CPP Internals Manual
LLVM Documentation
Compiler Text are good, but they are a bit heavy for teaching yourself. Jack Crenshaw has a "Book" that was a series of articles you can download and read call "Lets Build a Compiler." It follows a "Learn By Doing" methodology that is great if you didn't get anything out of taking formal classes on the subject, or it's been WAY too many years since took it (that's my case). It holds your hand and leads you through writting a compiler instead of smacking you around with Lambda Calculus and deep theoretical issues that only academia cares about. It was a good way to stir up those brain cells that only had a fuzzy memory of writting something on the Vax (YEAH, that right a VAX!) many many moons ago at school. It's written very conversationally and easy to just sit down and read, unlike most text books which require several pots of coffee just to get past the first chapter. Once you have a basis for understanding then more traditional text such as the Dragon book are great references to expand on your understanding. (And personal I like the Dead Tree versions, I printed out Jack's, it's much easier to read in a comfortable position than on a laptop. And the Ebook readers are too expensive for something that doesn't actually feel like you're reading a real book yet.)
What some might call a "downside" is that it's written in Pascal, but I thought that just made me think about it more than if someone had given me a working C program to start with. Appart from that it was written with the 68000 in mind, which is only being used in embedded systems at this point time. Again for me this wasn't a problem, I knew 68000 asm and 68000 asm is easier to read than some other asm.
If you want dead-tree edition, try The Art of Compiler Design: Theory and Practice.
As noted by Pete Eddy, Jack Crenshaw's tutorial is excellent for newbies. But if you want to see how to a real, production C compiler works—one which was designed by brilliant engineers instead of created by throwing code at the wall until something stuck—get yourself a copy of Fraser and Hanson's A Retargetable C Compiler: Design and Implementation, which contains the source code to the very clean lcc compiler. Explanations of the design and implementation are mixed in with the code. It is not a first book for a beginner, but it will repay careful study, and you can get a used copy for $35.
For a longer blurb about lcc, see Compile C Faster on Linux.
The lcc web page also has links to a number of good textbooks. I don't know of an intro text that I really like, however.
P.S. Sorry you got ripped off at Uni.
http://se-radio.net/podcast/2007-07/episode-61-internals-gcc
see Fabrice Bellard's otcc source code
http://bellard.org/otcc/
Depending on what you exactly want to know, you should have a look at pipes&filter pattern, because as far as I know this (or something similar) is used in a lot of compilers in the last years.
When my compiler knowledge is not too outdated it works like this:
Parse sourcecode into symbolic representation
Clean up symbolic representation, do some normalization
Optimization of the symbolic tree based on certain rules
write out executable code based on symbolic tree
Of course dependencies etc. have to be resolved too.
And of course having a look at gcc or javac sourcecode may help in getting more detailed understanding.
It may also be valuable to pick up and read the source code to a compiler. I doubt that GCC is the best first choice, since it is burdened with full compatibility to more than 20 years of evolution of the language. But I'm also sure that a reading of its source, guided by one of the internal reference manuals, would be educational.
I'd seriously consider looking at the source to a scripting language that is internally compiled to a bytecode for a virtual machine. Several languages fit that description, but I would start with Lua. The language is small, and the VM is novel. The source code is also small and the bits I've looked at have been very clear although lightly commented.
have a look on Kaleidoscope.
You can write your own compiler in just a few days with LLVM.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
What are some good tools for getting a quick start for parsing and analyzing C/C++ code?
In particular, I'm looking for open source tools that handle the C/C++ preprocessor and language. Preferably, these tools would use lex/yacc (or flex/bison) for the grammar, and not be too complicated. They should handle the latest ANSI C/C++ definitions.
Here's what I've found so far, but haven't looked at them in detail (thoughts?):
CScope - Old-school C analyzer. Doesn't seem to do a full parse, though. Described as a glorified 'grep' for finding C functions.
GCC - Everybody's favorite open source compiler. Very complicated, but seems to do it all. There's a related project for creating GCC extensions called GEM, but hasn't been updated since GCC 4.1 (2006).
PUMA - The PUre MAnipulator. (from the page: "The intention of this project is to
provide a library of classes for the analysis and manipulation of C/C++ sources. For this
purpose PUMA provides classes for scanning, parsing and of course manipulating C/C++
sources."). This looks promising, but hasn't been updated since 2001. Apparently PUMA has been incorporated into AspectC++, but even this project hasn't been updated since 2006.
Various C/C++ raw grammars. You can get c-c++-grammars-1.2.tar.gz, but this has been unmaintained since 1997. A little Google searching pulls up other basic lex/yacc grammars that could serve as a starting place.
Any others?
I'm hoping to use this as a starting point for translating C/C++ source into a new toy language.
Thanks!
-Matt
(Added 2/9): Just a clarification: I want to extract semantic information from the preprocessor in addition to the C/C++ code itself. I don't want "#define foo 42" to disappear into the integer "42", but remain attached to the name "foo". This, unfortunately, excludes several solutions that run the preprocessor first and only deliver the C/C++ parse tree)
Parsing C++ is extremely hard because the grammar is undecidable. To quote Yossi Kreinin:
Outstandingly complicated grammar
"Outstandingly" should be interpreted literally, because all popular languages have context-free (or "nearly" context-free) grammars, while C++ has undecidable grammar. If you like compilers and parsers, you probably know what this means. If you're not into this kind of thing, there's a simple example showing the problem with parsing C++: is AA BB(CC); an object definition or a function declaration? It turns out that the answer depends heavily on the code before the statement - the "context". This shows (on an intuitive level) that the C++ grammar is quite context-sensitive.
You can look at clang that uses llvm for parsing.
Support C++ fully now link
The ANTLR parser generator has a grammar for C/C++ as well as the preprocessor. I've never used it so I can't say how complete its parsing of C++ is going to be. ANTLR itself has been a useful tool for me on a couple of occasions for parsing much simpler languages.
Depending on your problem GCCXML might be your answer.
Basically it parses the source using GCC and then gives you easily digestible XML of parse tree.
With GCCXML you are done once and for all.
pycparser is a complete parser for C (C99) written in Python. It has a fully configurable AST backend, so it's being used as a basis for any kind of language processing you might need.
Doesn't support C++, though. Granted, it's much harder than C.
Update (2012): at this time the answer, without any doubt, would be Clang - it's modular, supports the full C++ (with many C++-11 features) and has a relatively friendly code base. It also has a C API for bindings to high-level languages (i.e. for Python).
Have a look at how doxygen works, full source code is available and it's flex-based.
A misleading candidate is GOLD which is a free Windows-based parser toolkit explicitly for creating translators. Their list of supported languages refers to the languages in which one can implement parsers, not the list of supported parse grammars.
They only have grammars for C and C#, no C++.
Parsing C++ is a very complex challenge.
There's the Boost/Spirit framework, and a couple of years ago they did play with the idea of implementing a C++ parser, but it's far from complete.
Fully and properly parsing ISO C++ is far from trivial, and there were in fact many related efforts. But it is an inherently complex job that isn't easily accomplished, without rewriting a full compiler frontend understanding all of C++ and the preprocessor. A pre-processor implementation called "wave" is available from the Spirit folks.
That said, you might want to have a look at pork/oink (elsa-based), which is a C++ parser toolkit specifically meant to be used for source code transformation purposes, it is being used by the Mozilla project to do large-scale static source code analysis and automated code rewriting, the most interesting part is that it not only supports most of C++, but also the preprocessor itself!
On the other hand there's indeed one single proprietary solution available: the EDG frontend, which can be used for pretty much all C++ related efforts.
Personally, I would check out the elsa-based pork/oink suite which is used at Mozilla, apart from that, the FSF has now approved work on gcc plugins using the runtime library license, thus I'd assume that things are going to change rapidly, once people can easily leverage the gcc-based C++ parser for such purposes using binary plugins.
So, in a nutshell: if you the bucks: EDG, if you need something free/open source now: else/oink are fairly promising, if you have some time, you might want to use gcc for your project.
Another option just for C code is cscout.
The grammar for C++ is sort of notoriously hairy. There's a good thread at Lambda about it, but the gist is that C++ grammar can require arbitrarily much lookahead.
For the kind of thing I imagine you might be doing, I'd think about hacking either Gnu CC, or Splint. Gnu CC in particular does separate out the language generation part pretty thoroughly, so you might be best off building a new g++ backend.
Actually, PUMA and AspectC++ are still both actively maintained and updated. I was looking into using AspectC++ and was wondering about the lack of updates myself. I e-mailed the author who said that both AspectC++ and PUMA are still being developed. You can get to source code through SVN https://svn.aspectc.org/repos/ or you can get regular binary builds at http://akut.aspectc.org. As with a lot of excellent c++ projects these days, the author doesn't have time to keep up with web page maintenance. Makes sense if you've got a full time job and a life.
how about something easier to comprehend like tiny-C or Small C
Elsa beats everything else I know hands down for C++ parsing, even though it is not 100% compliant. I'm a fan. There's a module that prints out C++, so that may be a good starting point for your toy project.
See our C++ Front End
for a full-featured C++ parser: builds ASTs, symbol tables, does name
and type resolution. You can even parse and retain the preprocessor
directives. The C++ front end is built on top of our DMS Software Reengineering
Toolkit, which allows you to use that information to carry out arbitrary
source code changes using source-to-source transformations.
DMS is the ideal engine for implementing such a translator.
Having said that, I don't see much point in your imagined task; I don't
see much value in trying to replace C++, and you'll find building
a complete translator an enormous amount of work, especially if your
target is a "toy" language. And there is likely little point in
parsing C++ using a robust parser, if its only purpose is to produce
an isomorphic version of C++ that is easier to parse (wait, we postulated
a robust C++ already!).
EDIT May 2012: DMS's C++ front end now handles GCC3/GCC4/C++11,Microsoft VisualC 2005/2010. Robustly.
EDIT Feb 2015: Now handles C++14 in GCC and MS dialects.
EDIT August 2015: Now parses and captures both the code and the preprocessor directives in a unified tree.
EDIT May 2020: Has been doing C++17 for the past few years. C++20 in process.
A while back I attempted to write a tool that will automatically generate unit tests for c files.
For preprosessing I put the files thru GCC. The output is ugly but you can easily trace where in the original code from the preprocessed file. But for your needs you might need somthing else.
I used Metre as the base for a C parser. It is open source and uses lex and yacc. This made it easy to get up and running in a short time without fully understanding lex & yacc.
I also wrote a C app since the lex & yacc solution could not help me trace functionality across functions and parse the structure of the entire function in one pass. It became unmaintainable in a short time and was abandoned.
What about using a tool like GNU's CFlow, that can analyse the code and produce charts of call-graphs, here's what the opengroup(man page) has to say about cflow. The GNU version of cflow comes with source, and open source also ...
Hope this helps,
Best regards,
Tom.