Tool for model checking large, distributed C++ projects such as KDE? - c++

Is there a tool which can handle model checking large, real-world, mostly-C++, distributed systems, such as KDE?
(KDE is a distributed system in the sense that it uses IPC, although typically all of the processes are on the same machine. Yes, by the way, this is a valid usage of "distributed system" - check Wikipedia.)
The tool would need to be able to deal with intraprocess events and inter-process messages.
(Let's assume that if the tool supports C++, but doesn't support other stuff that KDE uses such as moc, we can hack something together to workaround that.)
I will happily accept less general (e.g. static analysers specialised for finding specific classes of bugs) or more general static analysis alternatives, in lieu of actual model checkers. But I am only interested in tools that can actually handle projects of the size and complexity of KDE.

You're obviously looking for a static analysis tool that can
parse C++ on scale
locate code fragments of interest
extract a model
pass that model to a model checker
report that result to you
A significant problem is that everybody has a different idea about what model they'd like to check.
That alone likely kills your chance of finding exactly what you want, because each model extraction tool
has generally made a choice as to what it wants to capture as a model, and the chances that it matches
what you want precisely are IMHO close to zero.
You aren't clear on what specifically you want to model, but I presume you want to find the communication
primitives and model the process interactions to check for something like deadlock?
The commercial static analysis tool vendors seem like a logical place to look, but I don't think they are there, yet. Coverity would seem like the best bet, but it appears they only have some kind of dynamic analysis for Java threading issues.
This paper claims to do this, but I have not looked at in any detail: Compositional analysis of C/C++ programs
with VeriSoft. Related is [PDF] Computer-Assisted Assume/Guarantee Reasoning with VeriSoft. It appears you have to hand-annotate
the source code to indicate the modelling elements of interest. The Verifysoft tool itself appears to be proprietary to Bell Labs and is likely hard to obtain.
Similarly this one: Distributed Verification of Multi-threaded C++ Programs .
This paper also makes interesting claims, but doesn't process C++ in spite of the title:
Runtime Model Checking of Multithreaded C/C++ Programs.
While all the parts of this are difficult, an issue they all share is parsing C++ (as exemplified by
the previously quoted paper) and finding the code patterns that provide the raw information for the model.
You also need to parse the specific dialect of C++ you are using; its not nice that the C++ compilers all accept different languages. And, as you have observed, processing large C++ codes is necessary. Model checkers (SPIN and friends) are relatively easy to find.
Our DMS Software Reengineering Toolkit provides for general purpose parsing, with customizable pattern matching and fact extraction, and has a robust C++ Front End that handles many dialects of C++ (EDIT Feb 2019: including C++17 in Ansi, GCC and MS flavors). It could likely be configured to find and extract the facts that correspond to the model you care about. But it doesn't do this this off the shelf.
DMS with its C front end have been used to process extremely large C applications (19,000 compilation units!). The C++ front end has been used in anger on a variety of large-scale C++ projects (EDIT Feb 2019: including large scale refactoring of APIs across 3000+ compilation units). Given DMS's general capability, I think it likely capable of handling fairly big chunks of code. YMMV.

Static code analyzers when used against large code base first time usually produce so many warnings and alerts that you won't be able to analyze all of them in reasonable amount of time. It is hard to single out real problems from code that just look suspicious to a tool.
You can try to use automatic invariant discovery tools like "Daikon" that capture perceived invariants at run time. You can validate later if discovered invariants (equivalence of variables "a == b+1" for example) make sense and then insert permanent asserts into your code. This way when invariant is violated as result of your change you will get a warning that perhaps you broke something by your change. This method helps to avoid restructuring or changing your code to add tests and mocks.

The usual way of applying formal techniques to large systems is to modularise them and write specifications for the interfaces of each module. Then you can verify each module independently (while verifying a module, you import the specifications - but not the code - of the other modules it calls). This approach makes verification scalable.

Related

Lua vs Embedded Lisp and potential other candidates. for set based data processing

Current Choice: lua-jit. Impressive benchmarks, I am getting used to the syntax. Writing a high performance ABI will require careful consideration on how I will structure my C++.
Other Questions of interest
Gambit-C and Guile as embeddable languages
Lua Performance Tips (have the option of running with a disabled collector, and calling the collector at the end of a processing run(s) is always an option).
Background
I'm working on a realtime high volume (complex) event processing system. I have a DSL that represents the schema of the event structure at source, the storage format, certain domain specific constructs, firing internal events (to structure and drive general purpose processing), and encoding certain processing steps that always happen.
The DSL looks pretty similar to SQL, infact I am using berkeley db (via sqlite3 interface) for long-term storage of events. The important part here is that the processing of events is done set-based, like SQL. I have come to conclusion however that I should not add general-purpose processing logic to the DSL, and rather embed lua or lisp to take care of this.
The processing core is built arround boost::asio, it is multithreaded, rpc is done via protocol buffers, events are encoded using the protocol buffer IO library --i.e., the events are not structured using protocol buffer object they just use the same encoding/decoding library. I will create a dataset object that contains rows, pretty similar to how a database engine stores in memory sets. processing steps in the DSL will be taken care of first and then presented to the general purpose processing logic.
Regardless of what embeddable scripting environment I use, each thread in my processing core will probably needs it's own embedded-language-environment (that is how lua requires it to be at least if you are doing multi-threaded work).
The Question(s)
The choice at the moment is between lisp ECL and lua. Keeping in mind that performance and throughput is a strong requirement, this means minimising memory allocations is highly desired:
If you were in my position which language would you chose ?
are there any alternatives I should consider (don't suggest languages that don't have an embeddable implementation). Javascript v8 perhaps ?
Does lisp fit the domain better ? I don't think lua and lisp are that different in terms of what they provide. Call me out :D
Are there any other properties (like the ones below) I should be thinking about ?
I assert that any form of embedded database IO (see the example DSL below for context) dwarfs the scripting language call on orders of magnitude, and that picking either will not add much overhead to the overall throughput. Am I on the right track ? :D
Desired Properties
I would like to map my dataset onto a lisp list or lua table and I would like to minimise redundant data copies. For example adding a row from one dataset to another should try to use reference semantics if both tables have the same shape.
I can guarantee that the dataset that is passed as input will not change whilst I have made the lua/lisp call. I want lua and lisp to enforce not altering the dataset as well if possible.
After the embedded call end's the datasets should be destroyed, any references created would need to be replaced with copies (I guess).
DSL Example
I attach a DSL for your viewing pleasure so you can get an idea of what I am trying to achieve. Note: The DSL does not show general purpose processing.
// Derived Events : NewSession EndSession
NAMESPACE WebEvents
{
SYMBOLTABLE DomainName(TEXT) AS INT4;
SYMBOLTABLE STPageHitId(GUID) AS INT8;
SYMBOLTABLE UrlPair(TEXT hostname ,TEXT scriptname) AS INT4;
SYMBOLTABLE UserAgent(TEXT UserAgent) AS INT4;
EVENT 3:PageInput
{
//------------------------------------------------------------//
REQUIRED 1:PagehitId GUID
REQUIRED 2:Attribute TEXT;
REQUIRED 3:Value TEXT;
FABRRICATED 4:PagehitIdSymbol INT8;
//------------------------------------------------------------//
PagehitIdSymbol AS PROVIDED(INT8 ph_symbol)
OR Symbolise(PagehitId) USING STPagehitId;
}
// Derived Event : Pagehit
EVENT 2:PageHit
{
//------------------------------------------------------------//
REQUIRED 1:PageHitId GUID;
REQUIRED 2:SessionId GUID;
REQUIRED 3:DateHit DATETIME;
REQUIRED 4:Hostname TEXT;
REQUIRED 5:ScriptName TEXT;
REQUIRED 6:HttpRefererDomain TEXT;
REQUIRED 7:HttpRefererPath TEXT;
REQUIRED 8:HttpRefererQuery TEXT;
REQUIRED 9:RequestMethod TEXT; // or int4
REQUIRED 10:Https BOOL;
REQUIRED 11:Ipv4Client IPV4;
OPTIONAL 12:PageInput EVENT(PageInput)[];
FABRRICATED 13:PagehitIdSymbol INT8;
//------------------------------------------------------------//
PagehitIdSymbol AS PROVIDED(INT8 ph_symbol)
OR Symbolise(PagehitId) USING STPagehitId;
FIRE INTERNAL EVENT PageInput PROVIDE(PageHitIdSymbol);
}
EVENT 1:SessionGeneration
{
//------------------------------------------------------------//
REQUIRED 1:BinarySessionId GUID;
REQUIRED 2:Domain STRING;
REQUIRED 3:MachineId GUID;
REQUIRED 4:DateCreated DATETIME;
REQUIRED 5:Ipv4Client IPV4;
REQUIRED 6:UserAgent STRING;
REQUIRED 7:Pagehit EVENT(pagehit);
FABRICATED 8:DomainId INT4;
FABRICATED 9:PagehitId INT8;
//-------------------------------------------------------------//
DomainId AS SYMBOLISE(domain) USING DomainName;
PagehitId AS SYMBOLISE(pagehit:PagehitId) USING STPagehitId;
FIRE INTERNAL EVENT pagehit PROVIDE (PagehitId);
}
}
This project is a component of a Ph.D research project and is/will be free software. If your interested in working with me (or contributing) on this project, please leave a comment :D
I strongly agree with #jpjacobs's points. Lua is an excellent choice for embedding, unless there's something very specific about lisp that you need (for instance, if your data maps particularly well to cons-cells).
I've used lisp for many many years, BTW, and I quite like lisp syntax, but these days I'd generally pick Lua. While I like the lisp language, I've yet to find a lisp implementation that captures the wonderful balance of features/smallness/usability for embedded use the way Lua does.
Lua:
Is very small, both source and binary, an order of magnitude or more smaller than many more popular languages (Python etc). Because the Lua source code is so small and simple, it's perfectly reasonable to just include the entire Lua implementation in your source tree, if you want to avoid adding an external dependency.
Is very fast. The Lua interpreter is much faster than most scripting languages (again, an order of magnitude is not uncommon), and LuaJIT2 is a very good JIT compiler for some popular CPU architectures (x86, arm, mips, ppc). Using LuaJIT can often speed things up by another order of magnitude, and in many cases, the result approaches the speed of C. LuaJIT is also a "drop-in" replacement for standard Lua 5.1: no application or user code changes are required to use it.
Has LPEG. LPEG is a "Parsing Expression Grammar" library for Lua, which allows very easy, powerful, and fast parsing, suitable for both large and small tasks; it's a great replacement for yacc/lex/hairy-regexps. [I wrote a parser using LPEG and LuaJIT, which is much faster than the yacc/lex parser I was trying emulate, and was very easy and straight-forward to create.] LPEG is an add-on package for Lua, but is well-worth getting (it's one source file).
Has a great C-interface, which makes it a pleasure to call Lua from C, or call C from Lua. For interfacing large/complex C++ libraries, one can use SWIG, or any one of a number of interface generators (one can also just use Lua's simple C interface with C++ of course).
Has liberal licensing ("BSD-like"), which means Lua can be embedded in proprietary projects if you wish, and is GPL-compatible for FOSS projects.
Is very, very elegant. It's not lisp, in that it's not based around cons-cells, but it shows clear influences from languages like scheme, with a straight-forward and attractive syntax. Like scheme (at least in it's earlier incarnations), it tends towards "minimal" but does a good job of balancing that with usability. For somebody with a lisp background (like me!), a lot about Lua will seem familiar, and "make sense", despite the differences.
Is very flexible, and such features as metatables allow easily integrating domain-specific types and operations.
Has a simple, attractive, and approachable syntax. This might not be such an advantage over lisp for existing lisp users, but might be relevant if you intend to have end-users write scripts.
Is designed for embedding, and besides its small size and fast speed, has various features such as an incremental GC that make using a scripting language more viable in such contexts.
Has a long history, and responsible and professional developers, who have shown good judgment in how they've evolved the language over the last 2 decades.
Has a vibrant and friendly user-community.
You don't state what platform you are using, but if it would be capable of using LuaJIT 2 I'd certainly go for that, since execution speeds approach that of compiled code, and interfacing with C code just got a whole lot easier with the FFI library.
I don't quit know other embeddable scripting languages so I can't really compare what they can do, and how they work with tables.
Lua mostly works with references: all functions, userdata, tables are used by reference, and are collected on the next gc run when no references to the data are left.
Strings are internalised, so a certain string is in the memory only once.
The thing to take into account is that you should avoid creating and subsequently discarding loads of tables, since this can slow down the GC cycle (as explained in the Lua gem you cited)
For parsing your code sample, I'd take a look at the LPEG library
There is a number of options for implementing high performance embedded compilers. One is Mono VM, it naturally comes with dozens of already made high quality languages implemented on top of it, and it is quite embeddable (see how Second Life is using it). It is also possible to use LLVM - looks like your DSL is not complicated, so implementing an ad hoc compiler would not be a big deal.
I happened to work on a project which have some parts that is similar to your project, It's a cross-platform system running on Win-CE,Android,iOS, I need maximize cross-platform-able code, C/C++ combine with a embeddable language is a good choice. here is my solution related to your questions.
If you were in my position which language would you chose ?
The DSL in my project is similar to yours. for performance, I wrote a compiler with Yacc/Lex to compile the DSL to binary for runtime and a bunch of API to get information from binary, but it's annoying when there is something modified in DSL syntax, I need to modify both compiler and APIs, so I abondoned the DSL, turned into XML(don't write XML directly, a well defined schema is worthy), I wrote a general compiler converting XML to lua table, reimplement APIs with lua. by doing this I got two benefits: Readability and flexibility, without perceivable performance degradation.
Are there any alternatives I should consider (don't suggest languages that don't have an embeddable implementation). Javascript v8 perhaps ?
Before choosing lua, I considerd Embedded Ch(mostly used in industrial control system) , embedded lisp and lua, at last lua stand out, because lua is well integrated with C, lua have a prosperous community, and lua is easy to learn for another team member. regarding Javascript v8, it's like using a steam-hammer to crack nuts, if used in a embedded realtime system.
Does lisp fit the domain better ? I don't think lua and lisp are that different in terms of what they provide. Call me out :D
For my domain, lisp and lua have the same ability in semantic, they both can handle XML-based DSL easily, or you might even wrote a simple compiler converting XML to lisp list or lua table. they both can handle domain logic easily. but lua is better integrated with C/C++, this is what lua aim for.
Are there any other properties (like the ones below) I should be thinking about ?
Working alone or with team members is also a weighting factor of solution selection. nowadays not so many programmers are familiar with lisp-like language.
I assert that any form of embedded database IO (see the example DSL below for context) dwarfs the scripting language call on orders of magnitude, and that picking either will not add much overhead to the overall throughput. Am I on the right track ? :D
here is a list of programming languages performance, here is a list of access time of computer components. if your system is IO-bound, the overhead of script is not key point. my system is a O&M(Operation & Maintenance) system, script performance is insignificant.

Is Communicating Sequential Processes ever used in large multi threaded C++ programs?

I'm currently writing a large multi threaded C++ program (> 50K LOC).
As such I've been motivated to read up alot on various techniques for handling multi-threaded code. One theory I've found to be quite cool is:
http://en.wikipedia.org/wiki/Communicating_sequential_processes
And it's invented by a slightly famous guy, who's made other non-trivial contributions to concurrent programming.
However, is CSP used in practice? Can anyone point to any large application written in a CSP style?
Thanks!
CSP, as a process calculus, is fundamentally a theoretical thing that enables us to formalize and study some aspects of a parallel program.
If you instead want a theory that enables you to build distributed programs, then you should take a look to parallel structured programming.
Parallel structural programming is the base of the current HPC (high-performance computing) research and provides to you a methodology about how to approach and design parallel programs (essentially, flowcharts of communicating computing nodes) and runtime systems to implements them.
A central idea in parallel structured programming is that of algorithmic skeleton, developed initially by Murray Cole. A skeleton is a thing like a parallel design pattern with a cost model associated and (usually) a run-time system that supports it. A skeleton models, study and supports a class of parallel algorithms that have a certain "shape".
As a notable example, mapreduce (made popular by Google) is just a kind of skeleton that address data parallelism, where a computation can be described by a map phase (apply a function f to all elements that compose the input data), and a reduce phase (take all the transformed items and "combine" them using an associative operator +).
I found the idea of parallel structured programming both theoretical sound and practical useful, so I'll suggest to give a look to it.
A word about multi-threading: since skeletons addresses massive parallelism, usually they are implemented in distributed memory instead of shared. Intel has developed a tool, TBB, which address multi-threading and (partially) follows the parallel structured programming framework. It is a C++ library, so probably you can just start using it in your projects.
Yes and no. The basic idea of CSP is used quite a bit. For example, thread-safe queues in one form or another are frequently used as the primary (often only) communication mechanism to build a pipeline out of individual processes (threads).
Hoare being Hoare, however, there's quite a bit more to his original theory than that. He invented a notation for talking about the processes, defined a specific set of signals that can be sent between the processes, and so on. The notation has since been refined in various ways, quite a bit of work put into proving various aspects, and so on.
Application of that relatively formal model of CSP (as opposed to just the general idea) is much less common. It's been used in a few systems where high reliability was considered extremely important, but few programmers appear interested in learning (yet another) formal design notation.
When I've designed systems like this, I've generally used an approach that's less rigorous, but (at least to me) rather easier to understand: a fairly simple diagram, with boxes representing the processes, and arrows representing the lines of communication. I doubt I could really offer much in the way of a proof about most of the designs (and I'll admit I haven't designed a really huge system this way), but it's worked reasonably well nonetheless.
Take a look at the website for a company called Verum. Their ASD technology is based on CSP and is used by companies like Philips Healthcare, Ericsson and NXP semiconductors to build software for all kinds of high-tech equipment and applications.
So to answer your question: Yes, CSP is used on large software projects in real-life.
Full disclosure: I do freelance work for Verum
Answering a very old question, yet it seems important that one
There is Go where CSPs are a fundamental part of the language. In the FAQ to Go, the authors write:
Concurrency and multi-threaded programming have a reputation for difficulty. We believe this is due partly to complex designs such as pthreads and partly to overemphasis on low-level details such as mutexes, condition variables, and memory barriers. Higher-level interfaces enable much simpler code, even if there are still mutexes and such under the covers.
One of the most successful models for providing high-level linguistic support for concurrency comes from Hoare's Communicating Sequential Processes, or CSP. Occam and Erlang are two well known languages that stem from CSP. Go's concurrency primitives derive from a different part of the family tree whose main contribution is the powerful notion of channels as first class objects. Experience with several earlier languages has shown that the CSP model fits well into a procedural language framework.
Projects implemented in Go are:
Docker
Google's download server
Many more
This style is ubiquitous on Unix where many tools are designed to process from standard in to standard out. I don't have any first hand knowledge of large systems that are build that way, but I've seen many small once-off systems that are
for instance this simple command line uses (at least) 3 processes.
cat list-1 list-2 list-3 | sort | uniq > final.list
This system is only moderately sized, but I wrote a protocol processor that strips away and interprets successive layers of protocol in a message that used a style very similar to this. It was an event driven system using something akin to cooperative threading, but I could've used multithreading fairly easily with a couple of added tweaks.
The program is proprietary (unfortunately) so I can't show off the source code.
In my opinion, this style is useful for some things, but usually best mixed with some other techniques. Often there is a core part of your program that represents a processing bottleneck, and applying various concurrency increasing techniques there is likely to yield the biggest gains.
Microsoft had a technology called ActiveMovie (if I remember correctly) that did sequential processing on audio and video streams. Data got passed from one filter to another to go from input to output format (and source/sink). Maybe that's a practical example??
The Wikipedia article looks to me like a lot of funny symbols used to represent somewhat pedestrian concepts. For very large or extensible programs, the formalism can be very important to check how the (sub)processes are allowed to interact.
For a 50,000 line class program, you're probably better off architecting it as you see fit.
In general, following ideas such as these is a good idea in terms of performance. Persistent threads that process data in stages will tend not to contend, and exploit data locality well. Also, it is easy to throttle the threads to avoid data piling up as a fast stage feeds a slow stage: just block the fast one if its output buffer grows too big.
A little bit off-topic but for my thesis I used a tool framework called TERRA/LUNA which aims for software development for Embedded Control Systems but is used heavily for all sorts of software development at my institute (so only academical use here).
TERRA is a graphical CSP and software architecture editor and LUNA is both the name for a C++ library for CSP based constructs and the plugin you'll find in TERRA to generate C++ code from your CSP models.
It becomes very handy in combination with FDR3 (a CSP refinement checker) to detect any sort of (dead/life/etc) lock or even profiling.

Write C++ in a graphical scratch-like way?

I am considering the possibility of designing an application that would allow people to develop C++ code graphically. I was amazed when I discovered Scratch (see site and tutorial videos).
I believe most of C++ can be represented graphically, with the exceptions of preprocessor instructions and possibly function pointers.
What C++ features do you think could be (or not be) represented by graphical items?
What would be the pros and cons of such an application ? How much simpler would it be than "plain" C++?
RECAP and MORE:
Pros:
intuitive
simple for small applications
helps avoid typos
Cons:
may become unreadable for large (medium?) - sized applications
manual coding is faster for experienced programmers
C++ is too complicated a language for such an approach
Considering that we -at my work- already have quite a bit of existing C++ code, I am not looking for a completely new way of programming. I am considering an alternate way of programming that is fully compatible with legacy code. Some kind of "viral language" that people would use for new code and, hopefully, would eventually use to replace existing code as well (where it could be useful).
How do you feel towards this viral approach?
When it comes to manual vs graphical programming, I tend to agree with your answers. This is why, ideally, I'll find a way to let the user always choose between typing and graphical programming. A line-by-line parser (+partial interpreter) might be able to convert typed code into graphical design. It is possible. Let's all cross our fingers.
Are there caveats to providing both typing and graphical programming capabilities that I should think about and analyze carefully?
I have already worked on template classes (and more generally type-level C++) and their graphical representation.
See there for an example of graphical representation of template classes. Boxes represent classes or class templates. First top node is the class itself, the next ones (if any) are typedef instructions inside the class. Bottom nodes are template arguments. Edges, of course, connect classes to template arguments for instantiations.
I already have a prototype for working on such type-level diagrams.
If you feel this way of representing template classes is plain wrong, don't hesitate to say so and why!
Much as I like Scratch, it is still much quicker for an experienced programmer to write code using a text editor than it is to drag blocks around, This has been proved time and again with any number of graphical programming environments.
Writing code is the easiest part of a developers day. I don't think we need more help with that. Reading, understanding, maintaining, comparing, annotating, documenting, and validating is where - despite a gargantuan amount of tools and frameworks - we still are lacking.
To dissect your pros:
Intuitive and simple for small applications - replace that with "misleading". It makes it look simple, but it isn't: As long as it is simple, VB.NET is simpler. When it gets complicated, visual design would get in the way.
Help avoid typos - that's what a good style, consistency and last not least intellisense are for. The things you need anyway when things aren't simple anymore.
Wrong level
You are thinking on the wrong level: C++ statements are not reusable, robust components, they are more like a big bag of gears that need to be put together correctly. C++ with it's complexity and exceptions (to rules) isn't even particulary suited.
If you want to make things easy, you need reusable components at a much higher level. Even if you have these, plugging them together is not simple. Despite years of struggle, and many attempts in many environments, this sometimes works and often fails.
Viral - You are correct IMO about that requriement: allow incremental adoption. This is closely related to switching smoothly between source code and visual representation, which in turn probably means you must be able to generate the visual representation from modified source code.
IDE Support - here's where most language-centered approaches go astray. A modern IDE is more than just a text editor and a compiler. What about debugging your graph - with breakpoints, data inspection etc? Will profilers, leak detectors etc. highlight nodes in your graph? Will source control give me a Visual Diff of yesterday's graph vs. today's?
Maybe you are on to something, despite all my "no"s: a better way to visualize code, a way to put different filters on it so that I see just what I need to see.
The early versions of C++ were originally written so that they compiled to C, then the C was compiled as normal.
What it sounds like you are describing is a graphical language that is compiled to C++, which will then be compiled as normal.
So really you are not creating a graphical C++, you are creating a new language that happens to be graphical. Nothing wrong with that, but don't let C++ restrict what you do, because eventually you may want to compile the graphical language straight to machine code, or even to something like CIL, Java ByteCode, or whatever else tickles your fancy.
Other graphical languages you may want to check out are LabVIEW, and more generally the category of visual programming languages.
Good luck in your efforts.
The complexity of a nontrivial program is usually too high to be represented with graphical symbols, which are low in their information content. Unless your approach is markedly different in some way, I am skeptical that this would be of value based on past efforts.
So, practically speaking, his will be useful only for instructional purposes and very simple programs. But that would still be a great target market for a product like this. sometimes people have trouble grasping the fundamentals, and a visual model might be just the thing to help things click.
Interesting idea. I doubt I'd use it though. I tend to prefer coding in a flat text editor, not even an IDE, and for tough problems I prefer a pad of paper. Most of the really experienced programmers I know work this way, Maybe it's because we grew up in a different environment, but I think it's also because of the way we think about programming. As you get more experience, you start seeing the code in your head more clearly than any GUI tool will show it to you.
As for your question, I'd nominate templates as one of the harder / more interesting sort of thing to try to represent well. They are ubiquitous and carry information that you won't have access to as you are designing your tool. Getting that to the user in a useful way should pose an interesting challenge.
What C++ features do you think could be [...] represented by graphical items?
Object Oriented Design. Hence classes, inheritance, polymorphism, mutability, const-ness etc. And, templates.
What would be the pros and cons of such an application?
It may be easier for beginners to start writing programs. For the experienced, it may be get rid of the boring parts of programming.
Think of any other code generator. They create a framework for you to write the more involved portion(s). They also lead to bloated-code (think of any WYSIWYG HTML editor).
The biggest challenge, as I see it, is that any such UI necessarily hinders the user's imagination.
How much simpler would it be than "plain" c++ ?
It can be a real pain, when you wade through truckloads of errors which is typical of code generators.
Further, since a lot of code is generated, you have no idea of what is going on -- debugging becomes difficult.
Also, for the experienced there may be some irritation to find that the generated code is not per their preferred coding style.
I prefer hot-keys instead graphical menus and buttons.
And I think same thing will happen with graphical development tool. Many peoples will prefer manual codding.
But, source code visualizer - should be nice thing.
I like the idea, but I suspect there comes a point where things get far too complicated to be represented graphically.
However, given recent experience at work; it would be useful to give such a graphical interface to a non-techie person to use to create basic drag-and-drop programs, leaving myself free to get on with some "proper" programming ;-) If it can do the job of allowing somebody non-skilled to build something functional it can be a very good thing (even if programming logic escapes them)
There comes a point in such a system where it becomes easier to define what you want to do using literal C++ code, rather than have a user interface getting in the way; it can get frustrating to the sessioned programmer knowing the precise code that needs to be written but then only being limited to the design GUI. I'm specifically thinking about a more common application, such as html editors/designers in which they allow newbies to build their websites without knowing any html at all.
It would be interesting to see how such a system would handle the dynamic allocation of memory, and the different states of a program as time progressed; I suspect that there are some very basic programming concepts that may be difficult to represent graphically.. Polymorphism..? Virtual Classes, LinkList, Stacks/Circular Queues. I wonder for a moment how you would explain an image compression algorithm (such as jpg) successfully too without the help of a gigantic display screen.
I also wonder if such a system would even go to such a low level, and whether you would be dealing with abstracted concepts and the compiler would be working out the best way to do something.
I've been working on a new model-driven software development paradigm named ABSE (http://www.abse.info) that supports end-user programming: It's a template-based system that can be complemented with transformation code. I also have an IDE (named AtomWeaver) implementing ABSE that is in pre-alpha stage right now.
With AtomWeaver, as an expert/architect, you build your knowledge Templates, and then the developers (or end-users if you make your meta-models simpler) can just "assemble" systems by building blocks, and then filling template parameters in form-style editors.
At the end, pressing the "Generate" button will create the final system as specified by the architect/expert.
I'm surprised you think function pointers would be a particular problem. How about anything at all to do with pointers?
A programming language can be represented by a hierarchy of nodes - that's exactly what the compiler turns it into. It is very strange that the UI for editing programs is still a sequence of characters that get parsed, because the degrees of freedom in the editor is way larger than the available set of allowed choices. But intellisense helps to reduce this problem a lot.
C++ would be a strange choice to base such a system on.
I think the major problem of this kind of IDEs are that the code generated becomes unmantainable easily.
This happened to Delphi. It's a really nice tool to develop some kind of applications, however, when we start adding complex relationships between the components, start adding Design Patterns, etc. the code grows to an unmantainable size.
I believe it's also because graphical tools don't apply the concept of MVC (or if they do, it's only in the way that the IDE understands).
It can be really helpful for prototypes and very small applications that don't tend to grow, otherwise it can become a mess for the developer(s)

Converting C source to C++

How would you go about converting a reasonably large (>300K), fairly mature C codebase to C++?
The kind of C I have in mind is split into files roughly corresponding to modules (i.e. less granular than a typical OO class-based decomposition), using internal linkage in lieu private functions and data, and external linkage for public functions and data. Global variables are used extensively for communication between the modules. There is a very extensive integration test suite available, but no unit (i.e. module) level tests.
I have in mind a general strategy:
Compile everything in C++'s C subset and get that working.
Convert modules into huge classes, so that all the cross-references are scoped by a class name, but leaving all functions and data as static members, and get that working.
Convert huge classes into instances with appropriate constructors and initialized cross-references; replace static member accesses with indirect accesses as appropriate; and get that working.
Now, approach the project as an ill-factored OO application, and write unit tests where dependencies are tractable, and decompose into separate classes where they are not; the goal here would be to move from one working program to another at each transformation.
Obviously, this would be quite a bit of work. Are there any case studies / war stories out there on this kind of translation? Alternative strategies? Other useful advice?
Note 1: the program is a compiler, and probably millions of other programs rely on its behaviour not changing, so wholesale rewriting is pretty much not an option.
Note 2: the source is nearly 20 years old, and has perhaps 30% code churn (lines modified + added / previous total lines) per year. It is heavily maintained and extended, in other words. Thus, one of the goals would be to increase mantainability.
[For the sake of the question, assume that translation into C++ is mandatory, and that leaving it in C is not an option. The point of adding this condition is to weed out the "leave it in C" answers.]
Having just started on pretty much the same thing a few months ago (on a ten-year-old commercial project, originally written with the "C++ is nothing but C with smart structs" philosophy), I would suggest using the same strategy you'd use to eat an elephant: take it one bite at a time. :-)
As much as possible, split it up into stages that can be done with minimal effects on other parts. Building a facade system, as Federico Ramponi suggested, is a good start -- once everything has a C++ facade and is communicating through it, you can change the internals of the modules with fair certainty that they can't affect anything outside them.
We already had a partial C++ interface system in place (due to previous smaller refactoring efforts), so this approach wasn't difficult in our case. Once we had everything communicating as C++ objects (which took a few weeks, working on a completely separate source-code branch and integrating all changes to the main branch as they were approved), it was very seldom that we couldn't compile a totally working version before we left for the day.
The change-over isn't complete yet -- we've paused twice for interim releases (we aim for a point-release every few weeks), but it's well on the way, and no customer has complained about any problems. Our QA people have only found one problem that I recall, too. :-)
What about:
Compiling everything in C++'s C subset and get that working, and
Implementing a set of facades leaving the C code unaltered?
Why is "translation into C++ mandatory"? You can wrap the C code without the pain of converting it into huge classes and so on.
Your application has lots of folks working on it, and a need to not-be-broken.
If you are serious about large scale conversion to an OO style, what
you need is massive transformation tools to automate the work.
The basic idea is to designate groups of data as classes, and then
get the tool to refactor the code to move that data into classes,
move functions on just that data into those classes,
and revise all accesses to that data to calls on the classes.
You can do an automated preanalysis to form statistic clusters to get some ideas,
but you'll still need an applicaiton aware engineer to decide what
data elements should be grouped.
A tool that is capable of doing this task is our DMS Software Reengineering
Toolkit.
DMS has strong C parsers for reading your code, captures the C code
as compiler abstract syntax trees, (and unlike a conventional compiler)
can compute flow analyses across your entire 300K SLOC.
DMS has a C++ front end that can be used as the "back" end;
one writes transformations that map C syntax to C++ syntax.
A major C++ reengineering task on a large avionics system gives
some idea of what using DMS for this kind of activity is like.
See technical papers at
www.semdesigns.com/Products/DMS/DMSToolkit.html,
specifically
Re-engineering C++ Component Models Via Automatic Program Transformation
This process is not for the faint of heart. But than anybody
that would consider manual refactoring of a large application
is already not afraid of hard work.
Yes, I'm associated with the company, being its chief architect.
I would write C++ classes over the C interface. Not touching the C code will decrease the chance of messing up and quicken the process significantly.
Once you have your C++ interface up; then it is a trivial task of copy+pasting the code into your classes. As you mentioned - during this step it is vital to do unit testing.
GCC is currently in midtransition to C++ from C. They started by moving everything into the common subset of C and C++, obviously. As they did so, they added warnings to GCC for everything they found, found under -Wc++-compat. That should get you on the first part of your journey.
For the latter parts, once you actually have everything compiling with a C++ compiler, I would focus on replacing things that have idiomatic C++ counterparts. For example, if you're using lists, maps, sets, bitvectors, hashtables, etc, which are defined using C macros, you will likely gain a lot by moving these to C++. Likewise with OO, you'll likely find benefits where you are already using a C OO idiom (like struct inheritence), and where C++ will afford greater clarity and better type checking on your code.
Your list looks okay except I would suggest reviewing the test suite first and trying to get that as tight as possible before doing any coding.
Let's throw another stupid idea:
Compile everything in C++'s C subset and get that working.
Start with a module, convert it in a huge class, then in an instance, and build a C interface (identical to the one you started from) out of that instance. Let the remaining C code work with that C interface.
Refactor as needed, growing the OO subsystem out of C code one module at a time, and drop parts of the C interface when they become useless.
Probably two things to consider besides how you want to start are on what you want to focus, and where you want to stop.
You state that there is a large code churn, this may be a key to focus your efforts. I suggest you pick the parts of your code where a lot of maintenance is needed, the mature/stable parts are apparently working well enough, so it is better to leave them as they are, except probably for some window dressing with facades etc.
Where you want to stop depends on what the reason is for wanting to convert to C++. This can hardly be a goal in itself. If it is due to some 3rd party dependency, focus your efforts on the interface to that component.
The software I work on is a huge, old code base which has been 'converted' from C to C++ years ago now. I think it was because the GUI was converted to Qt. Even now it still mostly looks like a C program with classes. Breaking the dependencies caused by public data members, and refactoring the huge classes with procedural monster methods into smaller methods and classes never has really taken off, I think for the following reasons:
There is no need to change code that is working and that does not need to be enhanced. Doing so introduces new bugs without adding functionality, and end users don't appreciate that;
It is very, very hard to do refactor reliably. Many pieces of code are so large and also so vital that people hardly dare touching it. We have a fairly extensive suite of functional tests, but sufficient code coverage information is hard to get. As a result, it is difficult to establish whether there are already sufficient tests in place to detect problems during refactoring;
The ROI is difficult to establish. The end user will not benefit from refactoring, so it must be in reduced maintenance cost, which will increase initially because by refactoring you introduce new bugs in mature, i.e. fairly bug-free code. And the refactoring itself will be costly as well ...
NB. I suppose you know the "Working effectively with Legacy code" book?
You mention that your tool is a compiler, and that: "Actually, pattern matching, not just type matching, in the multiple dispatch would be even better".
You might want to take a look at maketea. It provides pattern matching for ASTs, as well as the AST definition from an abstract grammar, and visitors, tranformers, etc.
If you have a small or academic project (say, less than 10,000 lines), a rewrite is probably your best option. You can factor it however you want, and it won't take too much time.
If you have a real-world application, I'd suggest getting it to compile as C++ (which usually means primarily fixing up function prototypes and the like), then work on refactoring and OO wrapping. Of course, I don't subscribe to the philosophy that code needs to be OO structured in order to be acceptable C++ code. I'd do a piece-by-piece conversion, rewriting and refactoring as you need to (for functionality or for incorporating unit testing).
Here's what I would do:
Since the code is 20 years old, scrap down the parser/syntax analyzer and replace it with one of the newer lex/yacc/bison(or anything similar) etc based C++ code, much more maintainable and easier to understand. Faster to develop too if you have a BNF handy.
Once this is retrofitted to the old code, start wrapping modules into classes. Replace global/shared variables with interfaces.
Now what you have will be a compiler in C++ (not quite though).
Draw a class diagram of all the classes in your system, and see how they are communicating.
Draw another one using the same classes and see how they ought to communicate.
Refactor the code to transform the first diagram to the second. (this might be messy and tricky)
Remember to use C++ code for all new code added.
If you have some time left, try replacing data structures one by one to use the more standardized STL or Boost.

Self Testing Systems

I had an idea I was mulling over with some colleagues. None of us knew whether or not it exists currently.
The Basic Premise is to have a system that has 100% uptime but can become more efficient dynamically.
Here is the scenario: * So we hash out a system quickly to a
specified set of interfaces, it has
zero optimizations, yet we are
confident that it is 100% stable
though (dubious, but for the sake of
this scenario please play
along) * We then profile
the original classes, and start to
program replacements for the
bottlenecks.
* The original and the replacement are initiated simultaneously and
synchronized.
* An original is allowed to run to completion: if a replacement hasnĀ“t
completed it is vetoed by the system
as a replacement for the
original.
* A replacement must always return the same value as the original, for a
specified number of times, and for a
specific range of values, before it is
adopted as a replacement for the
original.
* If exception occurs after a replacement is adopted, the system
automatically tries the same operation
with a class which was superseded by
it.
Have you seen a similar concept in practise? Critique Please ...
Below are comments written after the initial question in regards to
posts:
* The system demonstrates a Darwinian approach to system evolution.
* The original and replacement would run in parallel not in series.
* Race-conditions are an inherent issue to multi-threaded apps and I
acknowledge them.
I believe this idea to be an interesting theoretical debate, but not very practical for the following reasons:
To make sure the new version of the code works well, you need to have superb automatic tests, which is a goal that is very hard to achieve and one that many companies fail to develop. You can only go on with implementing the system after such automatic tests are in place.
The whole point of this system is performance tuning, that is - a specific version of the code is replaced by a version that supersedes it in performance. For most applications today, performance is of minor importance. Meaning, the overall performance of most applications is adequate - just think about it, you probably rarely find yourself complaining that "this application is excruciatingly slow", instead you usually find yourself complaining on the lack of specific feature, stability issues, UI issues etc. Even when you do complain about slowness, it's usually an overall slowness of your system and not just a specific applications (there are exceptions, of course).
For applications or modules where performance is a big issue, the way to improve them is usually to identify the bottlenecks, write a new version and test is independently of the system first, using some kind of benchmarking. Benchmarking the new version of the entire application might also be necessary of course, but in general I think this process would only take place a very small number of times (following the 20%-80% rule). Doing this process "manually" in these cases is probably easier and more cost-effective than the described system.
What happens when you add features, fix non-performance related bugs etc.? You don't get any benefit from the system.
Running the two versions in conjunction to compare their performance has far more problems than you might think - not only you might have race conditions, but if the input is not an appropriate benchmark, you might get the wrong result (e.g. if you get loads of small data packets and that is in 90% of the time the input is large data packets). Furthermore, it might just be impossible (for example, if the actual code changes the data, you can't run them in conjunction).
The only "environment" where this sounds useful and actually "a must" is a "genetic" system that generates new versions of the code by itself, but that's a whole different story and not really widely applicable...
A system that runs performance benchmarks while operating is going to be slower than one that doesn't. If the goal is to optimise speed, why wouldn't you benchmark independently and import the fastest routines once they are proven to be faster?
And your idea of starting routines simultaneously could introduce race conditions.
Also, if a goal is to ensure 100% uptime you would not want to introduce untested routines since they might generate uncatchable exceptions.
Perhaps your ideas have merit as a harness for benchmarking rather than an operational system?
Have I seen a similar concept in practice? No. But I'll propose an approach anyway.
It seems like most of your objectives would be meet by some sort of super source control system, which could be implemented with CruiseControl.
CruiseControl can run unit tests to ensure correctness of the new version.
You'd have to write a CruiseControl builder pluggin that would execute the new version of your system against a series of existing benchmarks to ensure that the new version is an improvement.
If the CruiseControl build loop passes, then the new version would be accepted. Such a process would take considerable effort to implement, but I think it feasible. The unit tests and benchmark builder would have to be pretty slick.
I think an Inversion of Control Container like OSGi or Spring could do most of what you are talking about. (dynamic loading by name)
You could build on top of their stuff. Then implement your code to
divide work units into discrete modules / classes (strategy pattern)
identify each module by unique name and associate a capability with it
when a module is requested it is requested by capability and at random one of the modules with that capability is used.
keep performance stats (get system tick before and after execution and store the result)
if an exception occurs mark that module as do not use and log the exception.
If the modules do their work by message passing you can store the message until the operation completes successfully and redo with another module if an exception occurs.
For design ideas for high availability systems, check out Erlang.
I don't think code will learn to be better, by itself. However, some runtime parameters can easily adjust onto optimal values, but that would be just regular programming, right?
About the on-the-fly change, I've shared the wondering and would be building it on top of Lua, or similar dynamic language. One could have parts that are loaded, and if they are replaced, reloaded into use. No rocket science in that, either. If the "old code" is still running, it's perfectly all right, since unlike with DLL's, the file is needed only when reading it in, not while executing code that came from there.
Usefulness? Naa...