How do C/C++ compilers work? - c++

After over a decade of C/C++ coding, I've noticed the following pattern - very good programmers tend to have detailed knowledge of the innards of the compiler.
I'm a reasonably good programmer, and I have an ad-hoc collection of compiler "superstitions", so I'd like to reboot my knowledge and start from the basics.
Can anyone recommend links to online resources or favorite books? I'm particularly interested in C/C++ compiling, optimization, GCC and LLVM.

Start with the dragon book....(stress more on code optimization and code generation)
Go onto write a toy compiler for an educational programming language like Decaf or Cool.., you may use parser generators (lex and yacc) for your front end(to make life easier and focus on more imp stuff)....
Then read gcc internals book along with browsing gcc source code.

GCC Internals Manual.
CPP Internals Manual
LLVM Documentation

Compiler Text are good, but they are a bit heavy for teaching yourself. Jack Crenshaw has a "Book" that was a series of articles you can download and read call "Lets Build a Compiler." It follows a "Learn By Doing" methodology that is great if you didn't get anything out of taking formal classes on the subject, or it's been WAY too many years since took it (that's my case). It holds your hand and leads you through writting a compiler instead of smacking you around with Lambda Calculus and deep theoretical issues that only academia cares about. It was a good way to stir up those brain cells that only had a fuzzy memory of writting something on the Vax (YEAH, that right a VAX!) many many moons ago at school. It's written very conversationally and easy to just sit down and read, unlike most text books which require several pots of coffee just to get past the first chapter. Once you have a basis for understanding then more traditional text such as the Dragon book are great references to expand on your understanding. (And personal I like the Dead Tree versions, I printed out Jack's, it's much easier to read in a comfortable position than on a laptop. And the Ebook readers are too expensive for something that doesn't actually feel like you're reading a real book yet.)
What some might call a "downside" is that it's written in Pascal, but I thought that just made me think about it more than if someone had given me a working C program to start with. Appart from that it was written with the 68000 in mind, which is only being used in embedded systems at this point time. Again for me this wasn't a problem, I knew 68000 asm and 68000 asm is easier to read than some other asm.

If you want dead-tree edition, try The Art of Compiler Design: Theory and Practice.

As noted by Pete Eddy, Jack Crenshaw's tutorial is excellent for newbies. But if you want to see how to a real, production C compiler works—one which was designed by brilliant engineers instead of created by throwing code at the wall until something stuck—get yourself a copy of Fraser and Hanson's A Retargetable C Compiler: Design and Implementation, which contains the source code to the very clean lcc compiler. Explanations of the design and implementation are mixed in with the code. It is not a first book for a beginner, but it will repay careful study, and you can get a used copy for $35.
For a longer blurb about lcc, see Compile C Faster on Linux.
The lcc web page also has links to a number of good textbooks. I don't know of an intro text that I really like, however.
P.S. Sorry you got ripped off at Uni.

http://se-radio.net/podcast/2007-07/episode-61-internals-gcc

see Fabrice Bellard's otcc source code
http://bellard.org/otcc/

Depending on what you exactly want to know, you should have a look at pipes&filter pattern, because as far as I know this (or something similar) is used in a lot of compilers in the last years.
When my compiler knowledge is not too outdated it works like this:
Parse sourcecode into symbolic representation
Clean up symbolic representation, do some normalization
Optimization of the symbolic tree based on certain rules
write out executable code based on symbolic tree
Of course dependencies etc. have to be resolved too.
And of course having a look at gcc or javac sourcecode may help in getting more detailed understanding.

It may also be valuable to pick up and read the source code to a compiler. I doubt that GCC is the best first choice, since it is burdened with full compatibility to more than 20 years of evolution of the language. But I'm also sure that a reading of its source, guided by one of the internal reference manuals, would be educational.
I'd seriously consider looking at the source to a scripting language that is internally compiled to a bytecode for a virtual machine. Several languages fit that description, but I would start with Lua. The language is small, and the VM is novel. The source code is also small and the bits I've looked at have been very clear although lightly commented.

have a look on Kaleidoscope.
You can write your own compiler in just a few days with LLVM.

Related

Making a GUI with turbo c++?

Note: I am new to programming and extremely new to c++.I have seen scouring google for a long time and only things i can come up with are external headers and very complicated code..
I want to do this at school, and there we are provided with Turbo C++. We can't bring any external headers in there, gotta work with whatever i've got.
I want to create a GUI. I want to create something really good for our annual project and i want to create a GUI.
I would like to make as detailed a GUI as possible, but would be satisfied if i can create as much as several text options and clicking on em triggers respective functions (I am sorry if thats not how GUIs work, i never worked with one).
Again help is highly appreciated, i understand most discussions on Stack Overflow are much more complicated than this, i appreciate you taking the time to read and (hopefully) answer a layman question.
I'll suggest a non-technical solution because this is actually largely not a technical problem. Much of your problem is that you have to use Turbo C++. Unfortunately, a number of poorer countries have, for some reason, stuck with extremely outdates software in education. I know because I'm originally from one of the "Turbo countries", and I know that the main technical university there still uses Turbo for undergrad courses.
This is bad. Large projects in school are there to teach you to work on software. A regular programming course should teach you to think like a programmer, and it doesn't matter what language you use. But term projects are supposed to be more practical. The problem is, with Turbo C++, not only will you fail to learn enough, you will learn things that are bad. You'd be writing a 16-bit program that takes effort to even run on modern hardware, while not being able to use the C++ language properly. The compiler is older than the first ISO C++ standard!
If you want to make an impressive project that will stand out in your studies, especially if you later want to continue at a foreign university, I urge you to talk to your professor, explain the situation, and ask if you can use something else for the project. A modern compiler with some framework like Qt. If you can reach an agreement to use something else, it will benefit you.
Otherwise, if you have no choice, get Turbo Vision. There are versions of BC++ packaged with it, or you can find it elsewhere, Turbo Vision is a fairly comprehensive interface framework for the dinosaur era.
First off, the toolchain and operating system you are using is outdated and incapable. And the language support Turbo C++ offers can hardly be called C++ at all; the code you will write will not be C++ code. At best, it will be C with classes code.
All that aside, there was a fairly-capable Text-based User Interface (TUI) library available with Turbo C++ (as well as Borland's Pascal-based toolchains,) called Turbo Vision. You might be able to use that. It generates UIs quite similar to the Turbo C++ IDE itself.
But IIRC, it was not trivial to use, so I advise you to find a book or reference or comprehensive tutorial of some kind. However, since your environment precludes anything that is not available already with TC, I see no option for you other than using Turbo Vision or writing your own, which doesn't sound like something you can do or want to do.

Tokenization and AST

Have a rather abstract question for you all. I'm looking at getting involved in a static code analysis project. It uses C and C++ as the language to develop in so if any code could be in either of those languages in your response that would be great.
My question:
I need to understand some of the basic concepts/constructs used to process code for static analysis. I have heard people use things like AST and tokenization etc. I was just wondering if anything can clarify how these things are applied in creating a static analysis tool? Id like more of an explanation of tokenization as I dont really understand that so well. I understand it is a sort of way to process strings but I'm not confident in that answer. Furthermore, I know that the project I'm looking at passes the code through the preprocessor before it is analyzed. Can anyone explain this? Surely if it is static code analysis it needn't be preprocessed?
Hope someone can clear this up for me.
Cheers.
Tokenization is the act of breaking source text into language elements such as operators, variable names, numbers, etc. Parsing reads sequences of tokens and build Abstract Syntax Trees, which is a particular program representation. Tokenization and parsing are necessary for static analysis, but hardly interesting, in the same way that ante-to-the-pot is necessary to playing poker but not the interesting part of the game in any way.
If you are building a static analyzer (you imply you expect to work on one implemented in C or C++), you will need fundamental knowledge of compiling (parsing not so much unless you are building a parser for the language to be statically analyzed), but certainly about program representations (ASTs, triples, control and data flow graphs, ...), type and property inference, and limits on analysis accuracy (the cause of conservative analyses.
The program representations are fundamental because these are the data structures that most static analysers really process; its simply too hard to wring useful facts directly from program text. These concepts can be used to implement static analysis capabilities in any programming language to implement analysis type tools; there's nothing special in implementing them in C or C++.
Run, don't walk, to your nearest compiler class for the first part of this. If you don't have it, you won't be able to do anything effective in tool building. The second part you will more likely find in a graduate computer science class.
If you get past that basic knowledge issue, you will either decide to implement an analysis tool from scratch, or build on existing analysis tool infrastructure. Few people decide to build one from scratch; it takes a huge amount of work (years or decades) to build robust parsers, flow analyzers, etc. needed as foundations for any particular static analysis. Mostly people try to use some existing infrastructure.
There's a huge list of candidates at: http://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
If you insist on processing C or C++ and building your own custom sophisticated analysis, you really, really need a tool that can handle real C and C++ code. There are IMHO a limited number of good candidates:
GCC (and various graft ons such as Starynkevitch's MELT, about which I know little)
Clang (pretty spectacular toolset)
DMS (and its C and C++ front ends) [my company's tool]
Open64 compiler infrastructure
Rose compiler infrastructure (based on EDG's industry-wide front end)
Each one of these are big systems and require a big investment to understand and begin to use. Don't underrate the learning curve.
There are lots of other tools out there that sort of process C and C++, but "sort of" is pretty useless for static analysis purposes.
If you intend to simply use a static analysis tool, you can avoid learning most of the parsing and program representation questions; instead you'll need to learn as much as you can about the specific analysis tool you intend to use. You'll still be a lot better off with the compiler background above because you will understand what the tool does, why it does it, and why it produces the kinds of answers that it does (usually it produces answers that are unsatisfying in a lot of ways due to the conservative limits on analysis accuracy).
Lastly, you should be clear that you understand the difference between static analysis and dynamic analysis [using data collected at runtime to determine program properties]. Most end users couldn't care less how you get information about their code, and each analysis approach has its strength and weaknesses.
If you are interested in static analysis, you should not bother about syntax analysis (i.e. lexing, parsing, building the abstract syntax tree), because that won't learn you much.
So I suggest you to use some existing parsing infrastructure. This is particularly true if you want to analyze C++ source code, because just coding a C++ parser is a difficult thing to do.
For example, you could leverage on the GCC compiler, or on the LLVM/Clang compiler. Be aware of their open source license: GCC is under GPLv3, and Clang/LLVM has a BSD-like license. GCC is able to handle not only C & C++, but also Fortran, Go, Ada, Objective-C.
Because of the GPLv3 license of GCC, your development above it should be also free software under GPLv3. However, that also means that the GCC community is quite big. If you know Ocaml and are interested of static analysis of C only, you could consider Frama-C (LGPL licensed)
Actually, I am working on GCC, by providing the MELT extension framework (MELT is also GPLv3 licensed). MELT is a high-level domain specific language to extend GCC and should be the ideal choice to work on static analysis using the powerful framework and internal representations (Gimple, Tree, ...) of GCC.
(Of course, LLVM folks would be delighted to have their work used too; and nobody knows well both LLVM and GCC, because learning such big software is already a challenge, so people have to choose on which they invest their time).
So my suggestion is to join an existing static analysis or compiler effort, don't reinvent the wheel by starting from scratch (because that means years of effort, just to have a good parser).
By so doing, you will take advantage of the existing infrastructure (so you won't bother about AST-s and parsing) and you will improve some existing project.
I would be very delighted if you are interested in MELT. FWIW, there is a MELT tutorial at Paris, on january 24th 2012, HiPEAC conference.
If you are looking into static analysis for C & C++, your first stop should be libclang, along with its associated papers.
In terms of why the preprocessor is run, static analysis is meant to catch errors and possible unintended code functionality (very basically), this it must analyse what the compiler will be compiling, not what the user sees.
I'm pretty sure #Ira Baxter will popup with a far more complete answer :)

Scripting language interpreter source code to learn from

I want to read, and learn from, the source code of a scripting language's interpreter/compiler. What scripting language interpreter/compiler has the simplest, cleanest, and easiest to read source code? I would prefer it to be written in C/C++ (what else are compilers written in anyway?) because I'm planning on writing a compiler in C.
Take a look at lua, you can go through the firsts versions of the programming language and see how it has evolved. It's written in C and has a clean and nice code. You can write a compiler in almost every programming language, but C has been the one that most programmers chose.
The CPython interrupter has been around for quite some time and I would imagine that it would be very useful to you.
AngelScript is a very good option for learning about compilers. This is a language with C/C++ familiar syntax, garbage collection, it is object-oriented with inheritance and polymorphism, cross-platform and compiles to byte-code.
My second choice would be Lua.
I would recommend, as a gentle introduction, having a look at the LLVM Tutorial.
Chris Lattner creates a simple toy language Kaleidoscope to show the various phases of compilation:
Lexing
Parsing
Code Generation
He then demonstrates how to add JIT capabilities (essential for an interpreter).
The toy language is extremely simple, and thus the resulting code is simple as well, and demonstrates nicely the architecture without drowning you in implementation details.
I am not sure that the tutorial is fully up-to-date and can be used as is against a recent LLVM version, but I do advise at least reading it.
(And of course, reading the Dragon Book).
Take a look on V8 for JavaScript. Every interpeter has a component called tokenizer. GNU has one whose name is bison. Take on look on it too. It can be helpful. Chromium uses some tokenizer for interpreting html on the Webkit too, but V8 is the javascript interpreter.
Claudio M. Souza Junior
a famous language, but not simple (PHP Source Code).
You can take advantage of the source code .
PHP Source Code

Gang of Four: lexi editor c++ source

I'm reading "Design Patterns: Elements of Reusable OOSW". In chapter two, the authors provide a case study of an editor they refer to as Lexi, which seems to be written in C++. I've looked around everywhere, but the only useful link I could find said this:
The Gof tell us in a note that Lexi is
based on "Doc, a text editing
application developed by Calder". But
this paper only outlines an editor,
without any source. And I even believe
today that Lexi never truly existed as
a program.
The link provides Delphi source. I'm after C++, cause that's what I'm comfortable with, and that's what's used in the book.
Does anybody know where I can find C++ source for Lexi? If the original never existed, it would be good to find something that I can use as a base. I really don't feel like writing my own text editor from scratch just so I can work through the case study in this book.
Doc was developed using the InterViews UI toolkit. I believe that doc source is part of the InterViews distribution. Doc was used to typeset Paul's thesis. (Paul Calder was my lecturer at Flinders University)
If you look at the InterViews code you might be surprised. It was developed before modern C++ existed. For example, there are no templates. And there are no comments in the code.
To my understanding, Lexi never existed. It was created as an example for the book by GoF.
Maybe a Java implementation can help, being it more similar to c++. Here it is:
jexieditor - A WYSIWYG editor based on JavaSE. I have not had a look at the code yet, anyway
I may be showing my age here but are you sure about C++? I have a funny feeling that when that book came out originally it may have been oriented toward Smalltalk. Its just something nagging at the back of my mind, I can't substantiate it I'm afraid
I'm currently implementing Lexi analog, pls take a look https://github.com/romaonishuk/LexI. Implementation is still in progress, but most of the described in GoF patterns and concepts are implemented using C++.
This is the code source of LEXI, written in Delphi unfortunately for you: LEXI sources.
It appears that the source code might be on the CD-ROM version of Design Patterns that came out in 1998. According to the Amazon listing, the CD contains (among other things):
Sample code demonstrating pattern implementation
Furthermore,
All patterns are compiled from real-world examples and include code that demonstrates how they may be implemented in object-oriented programming languages such as C++ and Smalltalk. Readers who already own the book will want the CD to take advantage of its dynamic search mechanism and ready-to-install patterns.
Whether these code samples include the full Lexi source is impossible to tell from the listing, and the current price of the CD (£86.87) is rather high. But it might be worth checking if any local libraries have the CD in stock.
I was just trying to find out if a real working Lexi version exists, to have a concrete reference, but I didn't find it.
I found this Java version on GitHub: https://github.com/AmitDutta/lexi
I don't know, maybe it could be useful for someone's purpose here.

Dynamic slicing in C/C++

After reading the book of debugging from Andreas Zeller, I became interested in Dynamic Slicing.
At the moment I only found relevant tools for Java analysis. Do you know such tools for C/C++?
A little information in addition to Rob's
the Wisconsin Program-Slicing Tool has evolved in a tool called CodeSurfer. Good news: it's commercially available and supported, and it works great for what it does. Bad news (perhaps): it does not actually produce a reduced program that computes the same value that you selected, but it's very convenient for navigating source code that you have not written.
Frama-C handles only C (no C++ for the foreseeable future). It is nice, not great, for navigating source code, but it can produce an equivalent smaller program for the criterion that you specify, if the original program is of the kind that it can analyze automatically (no recursion, no dynamic allocation). Frama-C is Open Source and has a mailing list in which your questions will be welcome if you are interested in the techniques it uses.
The reason CodeSurfer does not risk itself to produce an equivalent program and Frama-C can only do it for code with embedded-like restrictions is, in short, that doing so requires knowing the values of pointers, which can be arbitrarily difficult to compute with precision.
There's a tool listed on the Wikipedia page you cite. It's for C, so I guess it could work for whatever "C/C++" is.
Wisconsin Program-Slicing Tool
Also for C, and also mentioned on the Wikipedia page:
Frama-C
Giri implements dynamic backwards slicing in LLVM compiler, which as far as I know, is the latest effort to build a usable, effective and thread-aware dynamic slicer in modern compilers.