How can I make my own C++ compiler understand templates, nested classes, etc. strong features of C++? - c++

It is a university task in my group to write a compiler of C-like language. Of course I am going to implement a small part of our beloved C++.
The exact task is absolutely stupid, and the lecturer told us it need to be self-compilable (should be able to compile itself) - so, he meant not to use libraries such as Boost and STL. He also does not want us to use templates because it is hard to implement.
The question is - is it real for me, as I`m going to write this project on my own, with the deadline at the end of May - the middle of June (this year), to implement not only templates, but also nested classes, namespaces, virtual functions tables at the level of syntax analysis?
PS I am not noobie in C++

Stick to doing a C compiler.
Believe me, it's hard enough work building a decent C compiler, especially if its expected to compile itself. Trying to support all the C++ features like nested classes and templates will drive you insane. Perhaps a group could do it, but on your own, I think a C compiler is more than enough to do.
If you are dead set on this, at least implement a C-like language first (so you have something to hand in). Then focus on showing off.

"The exact task is absolutely stupid" - I don't think you're in a position to make that judgment fairly. Better to drop that view.
"I`m going to write this project on my own" - you said it's a group project. Are you saying that your group doesn't want to go along with your view that it should morph into C++, so you're taking off and working on your own? There's another bit I'd recommend changing.
It doesn't matter how knowledgable you are about C++. Your ability with grammars, parsers, lexers, ASTs, and code generation seems far more germane.
Without knowing more about you or the assignment, I'd say that you'd be doing well to have the original assignment done by the end of May. That's three months away. Stick to the assignment. It might surprise you with its difficulty.
If you finish early, and fulfill your obligation to your team, I'd say you should feel free to modify what's produced to add C++ features.
I'll bet it took Bjarne Stroustrup more than three months to add objects to C. Don't overestimate yourself or underestimate the original assignment.

No problem. And while you're at it, why not implement an operating system for it to run on too.

Follow the assignment. Write a compiler for a C-like language!
What I'd do is select a subset of C. Remove floating-point datatypes and every other feature that isn't necessary in building your compiler.
Writing a C compiler is a lot of work. You won't be able to do that in a couple of months.
Writing a C++ compiler is downright insane. You wouldn't be able to do that in 5 years.

I will like to stress a few points already mentioned and give a few references.
1) STICK TO THE 1989 ANSI C STANDARD WITH NO OPTIMIZATION.
2) Don't worry, with proper guidance, good organization and a fair amount of hard work this is doable.
3) Read the The C Programming Language cover to cover.
4) Understand important concepts of compiler development from the Dragon Book.
5) Take a look at lcc both the code as well as the book.
6) Take a look at Lex and Yacc (or Flex and Bison)
7) Writing a C compiler (up to the point it can self compile) is a rite of passage ritual among programmers. Enjoy it.

For a class project, I think that requiring the compiler to be able to compile itself is a bit much to ask. I assume that this is what was meant by stupid in the question. It means that you need to figure out in advance exactly how much of C you are going to implement, and stick to that in building the compiler. So, building a symbol table using primitives rather than just using an STL map. This might be useful for a data structure course, but misses the point for a compiler course. It should be about understanding the issues involved with the compiler, and chosing which data structures to use, not coding the data structures.
Building a compiler is a wonderful way to really understand what happens to your code once the compiler get a hold of it. What is the target language? When I took compilers, it took 3 of us all semester to build a compiler to go from sorta-pascal to assembly. Its not a trivial task. Its one of those things that seems simple at first, but the more you get into it, the more complicated things get.

You should be able to complete c-like language within the time frame. Assuming you are taking more than 1 course, that is exactly what you might be able to do in time. C++ is also doable but with a lot more extra hours to put it. Expecing to do c++ templates/virtual functions is overexpecting yourself and you might fail in the assignment all together. So it's better stick with a c subset compiler and finish it in time. You should also consider the time it takes for QA. If you want to be thorough QA itself will also take good time.

Namespaces or nested clases, either virtual functions are at syntax level quite simple, its just one or two more rules to parser. It is much more complicated at higher levels, at deciding, which function / class choose (name shadowing, ambiguous names between namespaces, etc.), or when compiling to bytecode/running AST. So - you may be able to write these, but if isn't necessary, skip it, and write just bare functional model.

If you are talking about a complete compiler, with code generation, then forget it. If you just intend to do the lexical & syntactic analysis side of things, then some form of templating may just about be doable in the time frame, depending on what compiler building tools you use.

Related

How much time would it take to write a C++ compiler using flex/yacc?

How much time would it take to write a C++ compiler using lex/yacc?
Where can I get started with it?
There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >> for example is crucially dependent on parsing context.
Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.
yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.
C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.
It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.
Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.
It will probably take you years, and you'll probably switch to some other parser generator in the process.
Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).
Existing C++ Front Ends:
The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.
GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.
Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.
Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.
A long time, and lex and yacc won't help
If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.
I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.
For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.
As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).
Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.
But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.
But then you have the rest of the compiler to consider:
Preprocessor
AST construction
Semantic analysis and type checking
Control, Data flow, and pointer analysis
Basic code generation
Optimizations
Register allocation
Final Code Generation
Debugging support
I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.
Expect building a C++ compiler to take you years.
(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).
EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.
Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.
This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.
You're not the first person to think that this is fun. Here's a nice blog-style article on the topic:
Parsing C++
Here's an important quote from the article:
"After lots of investigation, I
decided that writing a
parser/analysis-tool for C++ is
sufficiently difficult that it's
beyond what I want to do as a hobby."
The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:
ANTLR Grammars (contain several grammars for C++)
A YACC-able C++ 2.1 Grammar and the resulting ambiguities
Parsing and Processing C++ Code (Wikipedia)
Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor.
It depends on how you do it.
How much pre-made components do you plan to use?
You need to get the description of the syntax and its token from somewhere.
For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser....
You can get a c preprocessor from boost project..
You need to create a test suite to test your compiler automatically.
It can take a year if you work on it each day or much less you have more talent and motivation.
Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.
Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).
If you do a google for "C++ grammars" there are a couple around to get you started.
C++ LEX Tokens: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
A few years - if you can get research grant to re-write new lex/yacc :-)
People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).
The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.
What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).
Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))
A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.
Recursive decent is a good choice to parse C++. GCC and clang use it.
The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.
In either case, writing a C++ compiler is a BIG job.
Well, what do you mean by write a compiler?
I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.
Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.
The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.
With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.

Learning C++ and overcautiousness

How should I learn C++? I hear that the language gives enough rope to shoot myself in the head, so should I treat every C++ line I write as a potential minefiled?
How should I learn C++?
Refer to:
Books to refer for learning OOP through C++
https://stackoverflow.com/questions/631793/good-book-to-learn-c-from
The Definitive C++ Book Guide and List
https://stackoverflow.com/questions/1122921/suggested-c-books
https://stackoverflow.com/questions/1686906/what-is-a-very-practical-c-book
https://stackoverflow.com/questions/681551/a-c-book-that-covers-non-syntax-related-problems
I hear that the language gives enough
rope to shoout myself in the head, so
should I treat every C++ line I write
as a potential minefiled?
A C++ statement can do either what you want it to do, or something else. This depends on your understanding of what that C++ statement means. But this is not specific to this language.
By learning the language, and using techniques for building correct software (like Object Oriented design, especially Design by Contract, and testing techniques), you will be able to guarantee that your program behaves as you intended it to.
I love your metaphor! What Stroustrup actually said was:
http://en.wikiquote.org/wiki/Bjarne_Stroustrup
C makes it easy to shoot yourself in
the foot; C++ makes it harder, but
when you do it blows your whole leg
off.
This was many years ago. I started learning C++ in ca. 1991 and it really was a minefield. There were no common libraries, no debuggers and the AT&T approach used a C code generator. There are now many good IDEs which support C++.
Personally I moved to Java because I find it a cleaner language but C++ is fine as long as you don't try to be tricky. Avoid native C constructs where there are existing class libraries (Stroustrup initially did not provide a String class as he though it a useful "rite of passage" to have to write one!) Now you can use a proven one.
I'm assuming you have no choice in the language. How you go about it depends on where you are coming from. C++ is not the easiest of the object-oriented languages to start on and Stroustrup's book is not necessarily the best intro.
UPDATE the OP is worried about blowing themselves up when learning the language. Generally it's a good idea to start with a subset of what one will do later. I assumed the OP is worried about:
Things which you have to know and use whatever level you program at (such as destructors)
Things which add additional complexity to the learning process and can be shelved until later (such as multiple inheritance)
what follows are some places where I blew myself up... They are not subjective, they happened!
There are some up-front gotchas that don't exist in Java or C#.
destructors. You have to manage your own memory. Failing to write destructors will blow your fingers and toes off.
equality. You will have to write an equals method (in simple Java you may get away without it)
copy constructor. Ditto. a = b will invoke this. Bites you in the bottom.
And I'd suggest avoiding multiple inheritance unless you really need it. Then avoid it anyway.
And avoid operator overloading. It looks cute to write:
vector1 = vector2 + vector3;
but
vector1 = vector2.plus(vector3);
is just as clear, only a few more characters, and you can search for it.
Well, it's not a minefield.
Really, the most problems are related to anything related to pointers, so you'll have to understand them (which it's not easy at first) and be careful when using them.
I think it's more a question of experience, having all the basics clear and trying to get a clear design since the beggining.
More than a minefield, I think it's like going to the most dangeours neighbourhood in your town. Yes, it's dangerous, but only for the ones without the attitude. :-D
I would say that that C++ is a challenging environment, if not a minefield. The fundamental issue is that problem symptoms and problem causes are not always easy to tie up. As Khelben has said one major reason for that is that we have pointers to deal with and hence we can do quite a lot of damage when pointers are not pointing where we think they are.
So you need to pay special attention when dealing with arrays and pointers, out-by-one errors can result in memory corruption and these then result in interesting problem manifestations.
Every formal language is a minefield. There're less mines in managed environments. For instance, in C# if you overblow an array you won't cause someone else's remote function to do strange things. You won't have code run differently in tests and prod because someone forgot to initialize a variable in constructor.
However, these are the easy ones. You learn to avoid them, and then you stay with the real mines, which are there in every language.
More specifically, these are some of the most important points when moving to C++:
always initialize variables. even theoretical possibility of having your program logic depend on what was in the memory beforehand is a nightmare.
dependencies: avoid data members of other compound types (classes) without pimpl idiom. This will make your users exposed to the inner workings of the types you use, and increase compilation time dramatically. Dependencies are your enemy.
in C++, you can optimize for performance in ridiculously huge number of ways. don't. Unless you are in the innermost loop of a heavy math software, and even then don't.
avoid DLLs on windows. They don't work with singletons, causing problems to popular libraries.
use boost, shared pointers whenever you can. avoid reinventing the wheel and regular pointers.
use std::string, smart containers instead of arrays. These are dangerous. It will be faster than managed containers anyway.
use RAII. This one is priceless.
prefer data members to inheritance, or you will expose the base type definition to your type's users.
learn to avoid nested includes with forward declarations.
How should I learn C++?
depends. where are you coming from? anyway, I'd suggest:
use an up-to-date compiler such as gcc-4.4 or 4.5
C++0x is worth it for the type inference alone (local variables don't need explicit type designations)
write small, standalone, short-lived utilities (try porting such tools written in other languages)
STL has complex parts, but the basic things are easy, don't shy away from it. FMPOV it embodies the spirit of C++
use state-of-the-art C++ libraries: stuff like Boost.Foreach, Boost.Tuple, Boost.Regex or Boost.Optional turn C++ into serious competition in the scripting department
when you're comfortable:
learn to generalize your code with templates
learn to use RAII
then:
add C libraries to the mix. this might be the first time you'll need to tinker with pointers and casts!
add OOP if you feel like it
should I treat every C++ line I write as a potential minefiled?
be cautious, but don't worry too much. it's true that you can't know what a + b means without knowing the whole program containing such an expression because of operator overloads and argument-dependent lookup, and I've seen many people whine about it. a killing counter-argument is that you cannot really know what a->plus(b) does in Java or a scripting language in the face of inheritance: all methods are virtual, yoyo effect in extremis! (this does kill me in large codebases with rampant inheritance written in languages w/o ADL or operator overloading!)
anecdotes from my experience learning basics of C and C++:
C: unless you do something really, really, really stupid, the program will compile just fine, and SIGSEGV or SIGBUS as soon as you run it
C++: unless you do something really, really, really "clever", the program will either fail to compile, or compile and do what you mean (a mantra Perl "inherited" from Interlisp as I've been told).
a ranty post scriptum:
C++ can be used as a much higher-level language than C: whereas you can't do almost anything beyond simple arithmetic without pointers in C, it's possible to write complete programs in C++ without a pointer in sight, save for char **argv.
there's a whole class of programs that can be implemented in C++ using it as a "scripting" language with unparalleled runtime speed and simple runtime environment (the "dll hell" is nothing compared to the volatility of real scripting languages).
however, the "scripting language" cloak is a leaky abstraction: it's built from native C++ mechanisms such as ADL, operator overloading and templates, and that has its price. get ready for abysmal compile times and unintelligible error messages. OTOH, at least the error messages can be greatly improved with tools like STLfilt, and I think it's well worth it overall.
one thing where C++ really shines in contrast to environments such as Java (perhaps C# too? don't know that one that well) is destructors (vs finalizers and GC). it's one of the pillars of the "scriptiness" of the language. whereas GC adds a whole level of semantic complexity (things don't cease to exist as soon as they're inaccessible from the program) and syntactic verbiage and duplication (finally), destructors are the workhorse of natural semantics and obviate code duplication that's unavoidable with finally.
BTW, "enough rope to shoot myself in the head" almost killed me. I think I'll borrow it. ;)
C++ has some pitfalls certainly, but writting safe code is certainly possible.
Some things to think about. There are far from the only things for writting safe C++ code but they seem like a good start.
Use std::string and std::vector to store strings and collections rather than C style strings and native arrays. They are much easier to get right.
When you allocate an object using new always think about who owns the pointer to this memory and is responsible for deleting it. If you can't think of a single owner for the data that manages it's lifetime, then either rethink your design or think about using a "smart pointer" to manage the lifetime.
Prefer indexing into arrays rather than using pointer arithmetic where possible. Whenever you index into an array every time ask your self "How do I know that this index can only index a valid index in the array".
If a class has a pointer to some data then write methods to act on that data. Don't write methods that return that pointer or at some point you'll end up using the pointer after the data has been deleted. (Not always possible but something to aim for)
If you write simple code that uses strings and vectors and as much as possible encapsule pointers as members of classes that both manage the lifetime of the data and provide the methods that act on that data then that's a good strarting point.
As others have said, read effective c++ and other books.
In C++, foot shoots YOU.
The question is do you need it for anything? If you want to make game code, 3d tools, or something similar you pretty much have to have it. If not, you don't. The errors people are afraid of are seldom big killers but there are plenty of other things that will come up if you make a large enough project.
You may find this spoof interview with Bjarne Stroustrup to be enlightening:
http://www-users.cs.york.ac.uk/susan/joke/cpp.htm
The syntax of C++ is easy, just like Java or C# with pointers. So learning C++ is fast.
The hard thing is that when it comes to a project, C++ is harder to use and more error prone compared to Java or C#. It is just too flexible and the programmer is responsible for too many things.
In a 100 lines of code, you don't need to worry about memory and null pointers at all as you can find them quickly. But when it comes to 10000 lines of code, memory management could be hard. The exception mechanism in C++ is also weak. Thirdly, you need to worry the null pointer problem in C++ in a big project.
I look at the dilemma from a different perspective. The more discipline you have in development the faster you can develop quality robust code. Assembly requires more discipline than C. C requires more discipline than C++.
Don't worry about hanging yourself, blowing your foot or leg off. Just work on improving your quality develop process. For example, a code review will help regardless of the language. Unit testing and test frameworks will also save some bloodshed. Everything boils down to project deadlines and money.

Write C++ in a graphical scratch-like way?

I am considering the possibility of designing an application that would allow people to develop C++ code graphically. I was amazed when I discovered Scratch (see site and tutorial videos).
I believe most of C++ can be represented graphically, with the exceptions of preprocessor instructions and possibly function pointers.
What C++ features do you think could be (or not be) represented by graphical items?
What would be the pros and cons of such an application ? How much simpler would it be than "plain" C++?
RECAP and MORE:
Pros:
intuitive
simple for small applications
helps avoid typos
Cons:
may become unreadable for large (medium?) - sized applications
manual coding is faster for experienced programmers
C++ is too complicated a language for such an approach
Considering that we -at my work- already have quite a bit of existing C++ code, I am not looking for a completely new way of programming. I am considering an alternate way of programming that is fully compatible with legacy code. Some kind of "viral language" that people would use for new code and, hopefully, would eventually use to replace existing code as well (where it could be useful).
How do you feel towards this viral approach?
When it comes to manual vs graphical programming, I tend to agree with your answers. This is why, ideally, I'll find a way to let the user always choose between typing and graphical programming. A line-by-line parser (+partial interpreter) might be able to convert typed code into graphical design. It is possible. Let's all cross our fingers.
Are there caveats to providing both typing and graphical programming capabilities that I should think about and analyze carefully?
I have already worked on template classes (and more generally type-level C++) and their graphical representation.
See there for an example of graphical representation of template classes. Boxes represent classes or class templates. First top node is the class itself, the next ones (if any) are typedef instructions inside the class. Bottom nodes are template arguments. Edges, of course, connect classes to template arguments for instantiations.
I already have a prototype for working on such type-level diagrams.
If you feel this way of representing template classes is plain wrong, don't hesitate to say so and why!
Much as I like Scratch, it is still much quicker for an experienced programmer to write code using a text editor than it is to drag blocks around, This has been proved time and again with any number of graphical programming environments.
Writing code is the easiest part of a developers day. I don't think we need more help with that. Reading, understanding, maintaining, comparing, annotating, documenting, and validating is where - despite a gargantuan amount of tools and frameworks - we still are lacking.
To dissect your pros:
Intuitive and simple for small applications - replace that with "misleading". It makes it look simple, but it isn't: As long as it is simple, VB.NET is simpler. When it gets complicated, visual design would get in the way.
Help avoid typos - that's what a good style, consistency and last not least intellisense are for. The things you need anyway when things aren't simple anymore.
Wrong level
You are thinking on the wrong level: C++ statements are not reusable, robust components, they are more like a big bag of gears that need to be put together correctly. C++ with it's complexity and exceptions (to rules) isn't even particulary suited.
If you want to make things easy, you need reusable components at a much higher level. Even if you have these, plugging them together is not simple. Despite years of struggle, and many attempts in many environments, this sometimes works and often fails.
Viral - You are correct IMO about that requriement: allow incremental adoption. This is closely related to switching smoothly between source code and visual representation, which in turn probably means you must be able to generate the visual representation from modified source code.
IDE Support - here's where most language-centered approaches go astray. A modern IDE is more than just a text editor and a compiler. What about debugging your graph - with breakpoints, data inspection etc? Will profilers, leak detectors etc. highlight nodes in your graph? Will source control give me a Visual Diff of yesterday's graph vs. today's?
Maybe you are on to something, despite all my "no"s: a better way to visualize code, a way to put different filters on it so that I see just what I need to see.
The early versions of C++ were originally written so that they compiled to C, then the C was compiled as normal.
What it sounds like you are describing is a graphical language that is compiled to C++, which will then be compiled as normal.
So really you are not creating a graphical C++, you are creating a new language that happens to be graphical. Nothing wrong with that, but don't let C++ restrict what you do, because eventually you may want to compile the graphical language straight to machine code, or even to something like CIL, Java ByteCode, or whatever else tickles your fancy.
Other graphical languages you may want to check out are LabVIEW, and more generally the category of visual programming languages.
Good luck in your efforts.
The complexity of a nontrivial program is usually too high to be represented with graphical symbols, which are low in their information content. Unless your approach is markedly different in some way, I am skeptical that this would be of value based on past efforts.
So, practically speaking, his will be useful only for instructional purposes and very simple programs. But that would still be a great target market for a product like this. sometimes people have trouble grasping the fundamentals, and a visual model might be just the thing to help things click.
Interesting idea. I doubt I'd use it though. I tend to prefer coding in a flat text editor, not even an IDE, and for tough problems I prefer a pad of paper. Most of the really experienced programmers I know work this way, Maybe it's because we grew up in a different environment, but I think it's also because of the way we think about programming. As you get more experience, you start seeing the code in your head more clearly than any GUI tool will show it to you.
As for your question, I'd nominate templates as one of the harder / more interesting sort of thing to try to represent well. They are ubiquitous and carry information that you won't have access to as you are designing your tool. Getting that to the user in a useful way should pose an interesting challenge.
What C++ features do you think could be [...] represented by graphical items?
Object Oriented Design. Hence classes, inheritance, polymorphism, mutability, const-ness etc. And, templates.
What would be the pros and cons of such an application?
It may be easier for beginners to start writing programs. For the experienced, it may be get rid of the boring parts of programming.
Think of any other code generator. They create a framework for you to write the more involved portion(s). They also lead to bloated-code (think of any WYSIWYG HTML editor).
The biggest challenge, as I see it, is that any such UI necessarily hinders the user's imagination.
How much simpler would it be than "plain" c++ ?
It can be a real pain, when you wade through truckloads of errors which is typical of code generators.
Further, since a lot of code is generated, you have no idea of what is going on -- debugging becomes difficult.
Also, for the experienced there may be some irritation to find that the generated code is not per their preferred coding style.
I prefer hot-keys instead graphical menus and buttons.
And I think same thing will happen with graphical development tool. Many peoples will prefer manual codding.
But, source code visualizer - should be nice thing.
I like the idea, but I suspect there comes a point where things get far too complicated to be represented graphically.
However, given recent experience at work; it would be useful to give such a graphical interface to a non-techie person to use to create basic drag-and-drop programs, leaving myself free to get on with some "proper" programming ;-) If it can do the job of allowing somebody non-skilled to build something functional it can be a very good thing (even if programming logic escapes them)
There comes a point in such a system where it becomes easier to define what you want to do using literal C++ code, rather than have a user interface getting in the way; it can get frustrating to the sessioned programmer knowing the precise code that needs to be written but then only being limited to the design GUI. I'm specifically thinking about a more common application, such as html editors/designers in which they allow newbies to build their websites without knowing any html at all.
It would be interesting to see how such a system would handle the dynamic allocation of memory, and the different states of a program as time progressed; I suspect that there are some very basic programming concepts that may be difficult to represent graphically.. Polymorphism..? Virtual Classes, LinkList, Stacks/Circular Queues. I wonder for a moment how you would explain an image compression algorithm (such as jpg) successfully too without the help of a gigantic display screen.
I also wonder if such a system would even go to such a low level, and whether you would be dealing with abstracted concepts and the compiler would be working out the best way to do something.
I've been working on a new model-driven software development paradigm named ABSE (http://www.abse.info) that supports end-user programming: It's a template-based system that can be complemented with transformation code. I also have an IDE (named AtomWeaver) implementing ABSE that is in pre-alpha stage right now.
With AtomWeaver, as an expert/architect, you build your knowledge Templates, and then the developers (or end-users if you make your meta-models simpler) can just "assemble" systems by building blocks, and then filling template parameters in form-style editors.
At the end, pressing the "Generate" button will create the final system as specified by the architect/expert.
I'm surprised you think function pointers would be a particular problem. How about anything at all to do with pointers?
A programming language can be represented by a hierarchy of nodes - that's exactly what the compiler turns it into. It is very strange that the UI for editing programs is still a sequence of characters that get parsed, because the degrees of freedom in the editor is way larger than the available set of allowed choices. But intellisense helps to reduce this problem a lot.
C++ would be a strange choice to base such a system on.
I think the major problem of this kind of IDEs are that the code generated becomes unmantainable easily.
This happened to Delphi. It's a really nice tool to develop some kind of applications, however, when we start adding complex relationships between the components, start adding Design Patterns, etc. the code grows to an unmantainable size.
I believe it's also because graphical tools don't apply the concept of MVC (or if they do, it's only in the way that the IDE understands).
It can be really helpful for prototypes and very small applications that don't tend to grow, otherwise it can become a mess for the developer(s)

Converting C source to C++

How would you go about converting a reasonably large (>300K), fairly mature C codebase to C++?
The kind of C I have in mind is split into files roughly corresponding to modules (i.e. less granular than a typical OO class-based decomposition), using internal linkage in lieu private functions and data, and external linkage for public functions and data. Global variables are used extensively for communication between the modules. There is a very extensive integration test suite available, but no unit (i.e. module) level tests.
I have in mind a general strategy:
Compile everything in C++'s C subset and get that working.
Convert modules into huge classes, so that all the cross-references are scoped by a class name, but leaving all functions and data as static members, and get that working.
Convert huge classes into instances with appropriate constructors and initialized cross-references; replace static member accesses with indirect accesses as appropriate; and get that working.
Now, approach the project as an ill-factored OO application, and write unit tests where dependencies are tractable, and decompose into separate classes where they are not; the goal here would be to move from one working program to another at each transformation.
Obviously, this would be quite a bit of work. Are there any case studies / war stories out there on this kind of translation? Alternative strategies? Other useful advice?
Note 1: the program is a compiler, and probably millions of other programs rely on its behaviour not changing, so wholesale rewriting is pretty much not an option.
Note 2: the source is nearly 20 years old, and has perhaps 30% code churn (lines modified + added / previous total lines) per year. It is heavily maintained and extended, in other words. Thus, one of the goals would be to increase mantainability.
[For the sake of the question, assume that translation into C++ is mandatory, and that leaving it in C is not an option. The point of adding this condition is to weed out the "leave it in C" answers.]
Having just started on pretty much the same thing a few months ago (on a ten-year-old commercial project, originally written with the "C++ is nothing but C with smart structs" philosophy), I would suggest using the same strategy you'd use to eat an elephant: take it one bite at a time. :-)
As much as possible, split it up into stages that can be done with minimal effects on other parts. Building a facade system, as Federico Ramponi suggested, is a good start -- once everything has a C++ facade and is communicating through it, you can change the internals of the modules with fair certainty that they can't affect anything outside them.
We already had a partial C++ interface system in place (due to previous smaller refactoring efforts), so this approach wasn't difficult in our case. Once we had everything communicating as C++ objects (which took a few weeks, working on a completely separate source-code branch and integrating all changes to the main branch as they were approved), it was very seldom that we couldn't compile a totally working version before we left for the day.
The change-over isn't complete yet -- we've paused twice for interim releases (we aim for a point-release every few weeks), but it's well on the way, and no customer has complained about any problems. Our QA people have only found one problem that I recall, too. :-)
What about:
Compiling everything in C++'s C subset and get that working, and
Implementing a set of facades leaving the C code unaltered?
Why is "translation into C++ mandatory"? You can wrap the C code without the pain of converting it into huge classes and so on.
Your application has lots of folks working on it, and a need to not-be-broken.
If you are serious about large scale conversion to an OO style, what
you need is massive transformation tools to automate the work.
The basic idea is to designate groups of data as classes, and then
get the tool to refactor the code to move that data into classes,
move functions on just that data into those classes,
and revise all accesses to that data to calls on the classes.
You can do an automated preanalysis to form statistic clusters to get some ideas,
but you'll still need an applicaiton aware engineer to decide what
data elements should be grouped.
A tool that is capable of doing this task is our DMS Software Reengineering
Toolkit.
DMS has strong C parsers for reading your code, captures the C code
as compiler abstract syntax trees, (and unlike a conventional compiler)
can compute flow analyses across your entire 300K SLOC.
DMS has a C++ front end that can be used as the "back" end;
one writes transformations that map C syntax to C++ syntax.
A major C++ reengineering task on a large avionics system gives
some idea of what using DMS for this kind of activity is like.
See technical papers at
www.semdesigns.com/Products/DMS/DMSToolkit.html,
specifically
Re-engineering C++ Component Models Via Automatic Program Transformation
This process is not for the faint of heart. But than anybody
that would consider manual refactoring of a large application
is already not afraid of hard work.
Yes, I'm associated with the company, being its chief architect.
I would write C++ classes over the C interface. Not touching the C code will decrease the chance of messing up and quicken the process significantly.
Once you have your C++ interface up; then it is a trivial task of copy+pasting the code into your classes. As you mentioned - during this step it is vital to do unit testing.
GCC is currently in midtransition to C++ from C. They started by moving everything into the common subset of C and C++, obviously. As they did so, they added warnings to GCC for everything they found, found under -Wc++-compat. That should get you on the first part of your journey.
For the latter parts, once you actually have everything compiling with a C++ compiler, I would focus on replacing things that have idiomatic C++ counterparts. For example, if you're using lists, maps, sets, bitvectors, hashtables, etc, which are defined using C macros, you will likely gain a lot by moving these to C++. Likewise with OO, you'll likely find benefits where you are already using a C OO idiom (like struct inheritence), and where C++ will afford greater clarity and better type checking on your code.
Your list looks okay except I would suggest reviewing the test suite first and trying to get that as tight as possible before doing any coding.
Let's throw another stupid idea:
Compile everything in C++'s C subset and get that working.
Start with a module, convert it in a huge class, then in an instance, and build a C interface (identical to the one you started from) out of that instance. Let the remaining C code work with that C interface.
Refactor as needed, growing the OO subsystem out of C code one module at a time, and drop parts of the C interface when they become useless.
Probably two things to consider besides how you want to start are on what you want to focus, and where you want to stop.
You state that there is a large code churn, this may be a key to focus your efforts. I suggest you pick the parts of your code where a lot of maintenance is needed, the mature/stable parts are apparently working well enough, so it is better to leave them as they are, except probably for some window dressing with facades etc.
Where you want to stop depends on what the reason is for wanting to convert to C++. This can hardly be a goal in itself. If it is due to some 3rd party dependency, focus your efforts on the interface to that component.
The software I work on is a huge, old code base which has been 'converted' from C to C++ years ago now. I think it was because the GUI was converted to Qt. Even now it still mostly looks like a C program with classes. Breaking the dependencies caused by public data members, and refactoring the huge classes with procedural monster methods into smaller methods and classes never has really taken off, I think for the following reasons:
There is no need to change code that is working and that does not need to be enhanced. Doing so introduces new bugs without adding functionality, and end users don't appreciate that;
It is very, very hard to do refactor reliably. Many pieces of code are so large and also so vital that people hardly dare touching it. We have a fairly extensive suite of functional tests, but sufficient code coverage information is hard to get. As a result, it is difficult to establish whether there are already sufficient tests in place to detect problems during refactoring;
The ROI is difficult to establish. The end user will not benefit from refactoring, so it must be in reduced maintenance cost, which will increase initially because by refactoring you introduce new bugs in mature, i.e. fairly bug-free code. And the refactoring itself will be costly as well ...
NB. I suppose you know the "Working effectively with Legacy code" book?
You mention that your tool is a compiler, and that: "Actually, pattern matching, not just type matching, in the multiple dispatch would be even better".
You might want to take a look at maketea. It provides pattern matching for ASTs, as well as the AST definition from an abstract grammar, and visitors, tranformers, etc.
If you have a small or academic project (say, less than 10,000 lines), a rewrite is probably your best option. You can factor it however you want, and it won't take too much time.
If you have a real-world application, I'd suggest getting it to compile as C++ (which usually means primarily fixing up function prototypes and the like), then work on refactoring and OO wrapping. Of course, I don't subscribe to the philosophy that code needs to be OO structured in order to be acceptable C++ code. I'd do a piece-by-piece conversion, rewriting and refactoring as you need to (for functionality or for incorporating unit testing).
Here's what I would do:
Since the code is 20 years old, scrap down the parser/syntax analyzer and replace it with one of the newer lex/yacc/bison(or anything similar) etc based C++ code, much more maintainable and easier to understand. Faster to develop too if you have a BNF handy.
Once this is retrofitted to the old code, start wrapping modules into classes. Replace global/shared variables with interfaces.
Now what you have will be a compiler in C++ (not quite though).
Draw a class diagram of all the classes in your system, and see how they are communicating.
Draw another one using the same classes and see how they ought to communicate.
Refactor the code to transform the first diagram to the second. (this might be messy and tricky)
Remember to use C++ code for all new code added.
If you have some time left, try replacing data structures one by one to use the more standardized STL or Boost.

Are C++ meta-templates required knowledge for programmers?

In my experience Meta-templates are really fun (when your compilers are compliant), and can give good performance boosts, and luckily I'm surrounded by seasoned C++ programmers that also grok meta-templates, however occasionally a new developer arrives and can't make heads or tails of some of the meta-template tricks we use (mostly Andrei Alenxandrescu stuff), for a few weeks until he gets initiated appropriately.
So I was wondering what's the situation for other C++ programmers out there?
Should meta-template programming be something C++ programmers should be "required" to know (excluding entry level students of course), or not?
Edit: Note my question is related to production code and not little samples or prototypes
If you can you find enough candidates who really know template meta-programing then by all means, require it. You will be showing a lot of qualified and potentially productive people the door (there are plenty of legitimate reasons not to know how to do this, namely that if you do it on a lot of platforms, you will create code that can't compile, or that average developers will have trouble understanding). Template meta-programming is great, but let's face it, it's pushing C++ to the limit.
Now, a candidate should probably understand basics (compute n! at compile time, or at least explain how it works if they are shown the code). If your new developers are reliably becoming productive within a few weeks, then your current recruiting is probably pretty good.
Yes, but I would not personally place a high priority on it. It's a nifty feature, but it's a bit situational, and good C++ code can be developed without it. I've personally used it once or twice, but haven't really found it to be valuable enough in my work to regularly use it. (Maybe that's a function of my lack of C++ production experience, though)
The only use I've ever made of template metaprogramming in production code is to unroll a critical loop which read a hardware register N times, followed by another M times, N, M different for different hardware and known at compile time. In general, the techniques don't seem a natural fit for our codebase, and I'd never get them through a code review.
Required? As always, it depends. For those of us in embedded land who are just now getting semi-decent C++ compilers for our tiny little DSPs and what not, we're just happy to be able to use classes.
However, if you've got a halfway decent C++ compiler, say gcc 3.3ish+, then yes, you should look at template metaprogramming. A good start is the boost library, of course, because it covers most of the templates you seem to look around for when the STL runs out of gas. It also serves as a great jumping off point.
However, sometimes I find that the advantages of template metaprogramming (lots of nice type safe code with a few lines of < and >) aren't worth the cost that it's going to take. Sometimes, a good old for( container::const_iterator iter = ... ) does what you need just fine.
18 months later and this topic is still very relevent. I would still say that template meta programming is not required knowledge, but you need to be able to at least read and explain the basics such as conditionals and the curiously repeating template pattern (looping). Beyond that, as long as you have a few people who can write a good interface to it, then just basic to intermediate template knowledge is all that is really required, though YMMV.
As someone who makes reasonable (although not extensive) use of templates and meta-programming, I go out of my way to try to make the interfaces (and by that I mean the internal usage interfaces) as normal as possible. Not everyone can understand templates, and even those that can sometimes cannot understand complex or obtuse meta programming paradigms.
With that said, if you want to dig in a modify my low-level libraries, you're going to have to know quite a bit. However, you should not even have to know templates (aside from baseline knowledge) to use them. That's how I draw the line at least, and the knowledge level I would expect in other developers (depending on how they are using the code).
I wouldn't consider template programming required, but it's definitely good to know. You should know enough about the subject to be able to effectively use template libraries such as the STL or Boost.
When I interview someone, I will always ask some questions about template meta-programming. If the candidate doesn't know about the subject, I would never hold that against them. But if they do, then it's a big plus in their favor.
It's not absolutely necessary to know how to use C++ templates. You can do most things without them. They are however a fantastic feature.
Since you roll your own templates, anyone new is going to have to come up to speed with them just like the rest of your code which is going to be the bigger chunk of the learning.
I encourage people to learn to use some of the features of the STL. I have used this library in production code and it does save time and simplify things quite a bit. I also roll my own when the need arises.
I've also heard good things about the boost library.
If I need to write portable code then I'll generally stick away from templates because many compilers still don't support them properly. If you need a portable STL then STLPort is the most portable.