I'm looking for some code analyzing tool (either static or dynamic analysis) that can detect concurrency issues like race conditions or deadlock in F# programs.
I know F# is based on the 'Actor model' and that it has some inherent concurrency support, so I'm assuming that there is a possibility of issues like race conditions and deadlock arising. If so, I'm wandering if there is any tool that can take in the F# source code and find program points where such issues can occur.
Does anyone have any ideas about this? Thanks in advance!
I don't think there are any tools for this. It's still a very hard problem (given the program state explosion.) Microsoft Research developed the CHESS tool for concurrency testing Win32 and .NET code (oriented towards locks and mutable data) The tool is still experimental and I don't think further versions have been developed or at least published.
IMHO The best approach for this problem is still model checking. That is develop an abstract model from your problem and use a specific tool for analyzing this model. For example CSP is a language for modeling interactions (oriented towards the actor model) and FDR is a tool for finding concurrency issues in those models.
On a sidenote F# is not based on the actor model. You can implement the actor model using mailboxes but you can also use locks and mutable data (relevant since a checker tool would need to check both approaches).
Related
Is there a tool which can handle model checking large, real-world, mostly-C++, distributed systems, such as KDE?
(KDE is a distributed system in the sense that it uses IPC, although typically all of the processes are on the same machine. Yes, by the way, this is a valid usage of "distributed system" - check Wikipedia.)
The tool would need to be able to deal with intraprocess events and inter-process messages.
(Let's assume that if the tool supports C++, but doesn't support other stuff that KDE uses such as moc, we can hack something together to workaround that.)
I will happily accept less general (e.g. static analysers specialised for finding specific classes of bugs) or more general static analysis alternatives, in lieu of actual model checkers. But I am only interested in tools that can actually handle projects of the size and complexity of KDE.
You're obviously looking for a static analysis tool that can
parse C++ on scale
locate code fragments of interest
extract a model
pass that model to a model checker
report that result to you
A significant problem is that everybody has a different idea about what model they'd like to check.
That alone likely kills your chance of finding exactly what you want, because each model extraction tool
has generally made a choice as to what it wants to capture as a model, and the chances that it matches
what you want precisely are IMHO close to zero.
You aren't clear on what specifically you want to model, but I presume you want to find the communication
primitives and model the process interactions to check for something like deadlock?
The commercial static analysis tool vendors seem like a logical place to look, but I don't think they are there, yet. Coverity would seem like the best bet, but it appears they only have some kind of dynamic analysis for Java threading issues.
This paper claims to do this, but I have not looked at in any detail: Compositional analysis of C/C++ programs
with VeriSoft. Related is [PDF] Computer-Assisted Assume/Guarantee Reasoning with VeriSoft. It appears you have to hand-annotate
the source code to indicate the modelling elements of interest. The Verifysoft tool itself appears to be proprietary to Bell Labs and is likely hard to obtain.
Similarly this one: Distributed Verification of Multi-threaded C++ Programs .
This paper also makes interesting claims, but doesn't process C++ in spite of the title:
Runtime Model Checking of Multithreaded C/C++ Programs.
While all the parts of this are difficult, an issue they all share is parsing C++ (as exemplified by
the previously quoted paper) and finding the code patterns that provide the raw information for the model.
You also need to parse the specific dialect of C++ you are using; its not nice that the C++ compilers all accept different languages. And, as you have observed, processing large C++ codes is necessary. Model checkers (SPIN and friends) are relatively easy to find.
Our DMS Software Reengineering Toolkit provides for general purpose parsing, with customizable pattern matching and fact extraction, and has a robust C++ Front End that handles many dialects of C++ (EDIT Feb 2019: including C++17 in Ansi, GCC and MS flavors). It could likely be configured to find and extract the facts that correspond to the model you care about. But it doesn't do this this off the shelf.
DMS with its C front end have been used to process extremely large C applications (19,000 compilation units!). The C++ front end has been used in anger on a variety of large-scale C++ projects (EDIT Feb 2019: including large scale refactoring of APIs across 3000+ compilation units). Given DMS's general capability, I think it likely capable of handling fairly big chunks of code. YMMV.
Static code analyzers when used against large code base first time usually produce so many warnings and alerts that you won't be able to analyze all of them in reasonable amount of time. It is hard to single out real problems from code that just look suspicious to a tool.
You can try to use automatic invariant discovery tools like "Daikon" that capture perceived invariants at run time. You can validate later if discovered invariants (equivalence of variables "a == b+1" for example) make sense and then insert permanent asserts into your code. This way when invariant is violated as result of your change you will get a warning that perhaps you broke something by your change. This method helps to avoid restructuring or changing your code to add tests and mocks.
The usual way of applying formal techniques to large systems is to modularise them and write specifications for the interfaces of each module. Then you can verify each module independently (while verifying a module, you import the specifications - but not the code - of the other modules it calls). This approach makes verification scalable.
I am currently looking for a discrete event simulator written for C++. I did not find much on the web written specifically in OO-style; there are some, but outdated. Some others, such as Opnet, Omnet and ns3 are way too complicated for what I need to do. And besides, I need to simulate agent-based algorithms capable of simulating systems of thousands of nodes.
Does anybody know anything suitable for my needs?
Others have good direct answers, but I'm going to suggest an alternative. If I understand you right, you want a system in C++ or such where you can post events that fire in the future, and code is run when those events fire.
I had a project to do like this, and I started out trying to write such an event system in C++ and then quickly realized I had a better solution.
Have you considered writing your program in behavioral Verilog? That may seem strange to write software in a hardware description language, but a Verilog simulator is an event-based system underneath, and behavioral Verilog is a very convenient way to express events, timing, triggers, etc. There is a free Verilog simulator (which is what I used) called Icarus Verilog. If you're not using Ubuntu or some Linux distro with Icarus already in a package, building from source is straightforward.
I would recommend having a second look to OmNet++. At first sight it may look quite complex, but if you look it into more detail you will find that most of the complexity is in the network add-on (the INET Framework). Unless you are going to do a detailed network simulation you do not need the INET.
Using OmNet++ core is not specially difficult and it may be simpler than other similar tools.
You may want to have a look to an intro.
One of the things that makes OmNet++ attractive to me is its scalability. Is possible to run large simulations in a desktop. Besides, it is possible to scale the same simulation to a cluster without rewriting the code.
You should consider SystemC, although I'd also recommend taking a second look at OmNet++.
We use SIMLIB at my school. It is very fast, easy to understand, object oriented, discrete and continuous simulator. It might look outdated but it is still maintained.
There is CSIM from Mesquite Software which supports developing models in C, C++ and Java. However, it is paid-commercial, AFAIK.
Take a look at GBL library. It's written in modern C++ and even supports C++0x features like move semantics and lambda functions. It offers several modeling mechanisms: synchronous and asynchronous event handlers, preemptive threads, and fibers. You can create purely behavioral, cycle accurate, and real-time models, or any mixture of those.
I've had heavy exposure to Miro Samek's "Quantum Hierarchical State Machine," but I'd like to know how it compares to Boost StateCharts - as told by someone who has worked with both. Any takers?
I know them both, although at different detail levels. But we can start with the differences I've came across, maybe there are more :-) .
Scope
First, the Quantum Platform provides a complete execution framework for UML state machines, whereas boost::statechart only helps the state machine implementations. As such, boost::statechart provides the same mechanism as the event processor of the Quantum Platform (QEP).
UML Conformance
Both approaches are designed to be UML conform. However, the Quantum Platform executes transition actions before exit actions of the respective state. This contradicts the UML, but in practice this is seldom a problem (if the developer is aware of it).
Boost::statechart is designed according to UML 1.4, but as far as I know the execution semantics did not change in UML 2.0 in an incompatible way (please correct me if I'm wrong), so this shouldn't be a problem either.
Supported UML features
Both implementations do not support the full UML state machine feature set. For example, parallel states (aka AND states) are not supported directly in QP. They have to be implemented manually by the user. Boost::statechart does not support internal transitions, because they were introduced in UML 2.0.
I think the exact features that each technique supports are easy to figure out in the documentation, so I don't list them here.
As a fact, both support the most important statechart features.
Other differences
Another difference is that QP is suitable for embedded applications, whereas boost::statechart maybe is, maybe not. The FAQ says "it depends" (see http://www.boost.org/doc/libs/1_44_0/libs/statechart/doc/faq.html#EmbeddedApplications), but to me this is already a big warning sign.
Also, you have to take special measurements to get boost::statechart real-time capable (see FAQ).
So much to the differences that I know, tell me if you find more, I'd be interested!
I have also worked with both, let me elaborate on theDmi's great answer:
Trace capability:
QP also implements a powerful trace capability called QSpy which enables very fine granularity traces with filter capabilities. With boost you have to roll your own and you never get beyond a certain granularity.
Modern C++ Style and Compile time error checking:
While Boost MSM and Statecharts give horrid and extremely long error messages if you mess up (as does all code written by template geniuses I envy), this is far better than runtime error detection. QP uses Q_ASSERT() and similar macros to do some error checking but in general you have to watch yourself more with QP and code is less durable.
I also find the extensive use of the preprocessor in QP takes a bit of getting used to. It may be warranted to use the preprocessor rather than templates, virtual functions etc. because of QPs use in embedded systems where the C++ compilers are often worse and the hardware is less virtual function friendly but sometimes I wish Mr. Samek would make a C, a C++ and a Modern C++ version ;) Rumor has it I'm not the only one who hates the preprocessor.
Scalability:
Boost MSM is not good for anything above 20 states, Statecharts has pretty much no limit on states but the amount of transitions a state can have is limited by the mpl::vector/list constraints. QP is scalable to an insane degree, virtually unlimited states and transitions are possible. It should also be noted that QP state machines can be spread over many many files with few dependencies.
Model driven development:
because of its extreme scalability and flexibility QP is much better suited for Model Driven Development, see this article for a lengthy comparison: http://security.hsr.ch/mse/projects/2011_Code_Generator_for_UML_State_Machines.pdf
Embedded designs:
QP is the only solution for any kind of embedded design in my mind. Its documented down to the bare bones so its easily portable, ports exist for many many common processors and it brings a lot of stuff with it beyond the state machine functionality. I particularly like the raw thread safe queues and memory management. I have never seen an embedded kernel I liked until I tried the RTC Kernel in QP (although it should be noted I have not used it in production code yet).
I am unfamiliar with Boost StateCharts, but something I feel Samek gets wrong is that he associates transition actions with state context. Transition actions should occur between states.
To understand why I don't like this style requires an example:
What if a state has two different transitions out? Then the events are different but the source state would be the same.
In Samek's formalism, transition actions are associated with a state context, so you have to have the same action on both transitions. In this way, Samek does not allow you to express a pure Mealy model.
While I have not provided a comparison to Boost StateCharts, I have provided you with some details on how to critique StateCharts implementations: by analyzing coupling among the various components that make up the implementation.
I'm currently writing a large multi threaded C++ program (> 50K LOC).
As such I've been motivated to read up alot on various techniques for handling multi-threaded code. One theory I've found to be quite cool is:
http://en.wikipedia.org/wiki/Communicating_sequential_processes
And it's invented by a slightly famous guy, who's made other non-trivial contributions to concurrent programming.
However, is CSP used in practice? Can anyone point to any large application written in a CSP style?
Thanks!
CSP, as a process calculus, is fundamentally a theoretical thing that enables us to formalize and study some aspects of a parallel program.
If you instead want a theory that enables you to build distributed programs, then you should take a look to parallel structured programming.
Parallel structural programming is the base of the current HPC (high-performance computing) research and provides to you a methodology about how to approach and design parallel programs (essentially, flowcharts of communicating computing nodes) and runtime systems to implements them.
A central idea in parallel structured programming is that of algorithmic skeleton, developed initially by Murray Cole. A skeleton is a thing like a parallel design pattern with a cost model associated and (usually) a run-time system that supports it. A skeleton models, study and supports a class of parallel algorithms that have a certain "shape".
As a notable example, mapreduce (made popular by Google) is just a kind of skeleton that address data parallelism, where a computation can be described by a map phase (apply a function f to all elements that compose the input data), and a reduce phase (take all the transformed items and "combine" them using an associative operator +).
I found the idea of parallel structured programming both theoretical sound and practical useful, so I'll suggest to give a look to it.
A word about multi-threading: since skeletons addresses massive parallelism, usually they are implemented in distributed memory instead of shared. Intel has developed a tool, TBB, which address multi-threading and (partially) follows the parallel structured programming framework. It is a C++ library, so probably you can just start using it in your projects.
Yes and no. The basic idea of CSP is used quite a bit. For example, thread-safe queues in one form or another are frequently used as the primary (often only) communication mechanism to build a pipeline out of individual processes (threads).
Hoare being Hoare, however, there's quite a bit more to his original theory than that. He invented a notation for talking about the processes, defined a specific set of signals that can be sent between the processes, and so on. The notation has since been refined in various ways, quite a bit of work put into proving various aspects, and so on.
Application of that relatively formal model of CSP (as opposed to just the general idea) is much less common. It's been used in a few systems where high reliability was considered extremely important, but few programmers appear interested in learning (yet another) formal design notation.
When I've designed systems like this, I've generally used an approach that's less rigorous, but (at least to me) rather easier to understand: a fairly simple diagram, with boxes representing the processes, and arrows representing the lines of communication. I doubt I could really offer much in the way of a proof about most of the designs (and I'll admit I haven't designed a really huge system this way), but it's worked reasonably well nonetheless.
Take a look at the website for a company called Verum. Their ASD technology is based on CSP and is used by companies like Philips Healthcare, Ericsson and NXP semiconductors to build software for all kinds of high-tech equipment and applications.
So to answer your question: Yes, CSP is used on large software projects in real-life.
Full disclosure: I do freelance work for Verum
Answering a very old question, yet it seems important that one
There is Go where CSPs are a fundamental part of the language. In the FAQ to Go, the authors write:
Concurrency and multi-threaded programming have a reputation for difficulty. We believe this is due partly to complex designs such as pthreads and partly to overemphasis on low-level details such as mutexes, condition variables, and memory barriers. Higher-level interfaces enable much simpler code, even if there are still mutexes and such under the covers.
One of the most successful models for providing high-level linguistic support for concurrency comes from Hoare's Communicating Sequential Processes, or CSP. Occam and Erlang are two well known languages that stem from CSP. Go's concurrency primitives derive from a different part of the family tree whose main contribution is the powerful notion of channels as first class objects. Experience with several earlier languages has shown that the CSP model fits well into a procedural language framework.
Projects implemented in Go are:
Docker
Google's download server
Many more
This style is ubiquitous on Unix where many tools are designed to process from standard in to standard out. I don't have any first hand knowledge of large systems that are build that way, but I've seen many small once-off systems that are
for instance this simple command line uses (at least) 3 processes.
cat list-1 list-2 list-3 | sort | uniq > final.list
This system is only moderately sized, but I wrote a protocol processor that strips away and interprets successive layers of protocol in a message that used a style very similar to this. It was an event driven system using something akin to cooperative threading, but I could've used multithreading fairly easily with a couple of added tweaks.
The program is proprietary (unfortunately) so I can't show off the source code.
In my opinion, this style is useful for some things, but usually best mixed with some other techniques. Often there is a core part of your program that represents a processing bottleneck, and applying various concurrency increasing techniques there is likely to yield the biggest gains.
Microsoft had a technology called ActiveMovie (if I remember correctly) that did sequential processing on audio and video streams. Data got passed from one filter to another to go from input to output format (and source/sink). Maybe that's a practical example??
The Wikipedia article looks to me like a lot of funny symbols used to represent somewhat pedestrian concepts. For very large or extensible programs, the formalism can be very important to check how the (sub)processes are allowed to interact.
For a 50,000 line class program, you're probably better off architecting it as you see fit.
In general, following ideas such as these is a good idea in terms of performance. Persistent threads that process data in stages will tend not to contend, and exploit data locality well. Also, it is easy to throttle the threads to avoid data piling up as a fast stage feeds a slow stage: just block the fast one if its output buffer grows too big.
A little bit off-topic but for my thesis I used a tool framework called TERRA/LUNA which aims for software development for Embedded Control Systems but is used heavily for all sorts of software development at my institute (so only academical use here).
TERRA is a graphical CSP and software architecture editor and LUNA is both the name for a C++ library for CSP based constructs and the plugin you'll find in TERRA to generate C++ code from your CSP models.
It becomes very handy in combination with FDR3 (a CSP refinement checker) to detect any sort of (dead/life/etc) lock or even profiling.
I saw a tool which can tell you if you have design problem in your project and I'm wondering if there is a tool which can tell you dynamically if there are some concurrency problems in your project.
Chess from MS Research is great (http://research.microsoft.com/en-us/projects/chess/)
It detects Concurrency Bugs with the help of Unittests and important: they are reproducable with chess.
I think you will find that detecting this type of thing by static code analysis is basically the Halting Problem in disguise and is thus undecidable in the general case. Such a tool almost certainly doesn't exist.
The closest thing to a proving tool that does exist is to model the computation as 'Communicating Sequential Processes', which can be subjected to formal mathematical reasoning. However, this doesn't allow you to produce a tool that can take an arbitrary program in an arbitrary language and compute a proof for it.