What is the state of OCaml's parallelization abilities? - ocaml

I'm interested in using OCaml for a project, however I'm not sure about where its parallelization capabilities are anymore. Is there a message passing ability in OCaml? Is OCaml able to efficiently use more than 1 CPU?
Most of what I have read on the subject was written in 2002-2006, and I haven't seen anything more recent.
Thanks!

This 2009 issue of the Caml weekly news ("CWN", a digest of interesting messages from the caml list) shows that:
the official party line on threads and Ocaml hasn't changed. A notable quote:
(...) in general, the whole standard library is not thread-safe. Probably that should be stated in the
documentation for the threads library, but there isn't much point in documenting it per standard library module. -- X. Leroy
(for how Ocaml threads can still be useful, see a remark by the culprit himself in another question on SO)
the most frequently adopted paradigm for parallelism is message-passing, and of note is X. Leroy's OcamlMPI, providing bindings for programming in SPMD style against the MPI standard. The same CWN issue I pointed to above provides references to examples, and numerous other related projects.
another message-passing solution is JoCaml, pioneering new style of concurrent communications known as join calculus. Note that it is binary-compatible with OCaml compilers.
that did not prevent the confection of a runtime whose GC is ok with parallelism, though: see a discussion of OCAML4MC in this other issue of the CWN.
There is also:
Netmulticore - multi-processing sharing ocaml values via mapped shared memory.
CamlP3l - compiler for Caml parallel programs.
OCaml-Java - an OCaml compiler that emits Java bytecode
I haven't followed more recent discussions about Ocaml & parallel programming, though. I'm leaving this CW so that others can update what I mention. It would be great if this question could reach the same level of completeness as the analogous one for Haskell.

At present, the OCaml runtime does not support running across multiple cores in parallel, so a single OCaml process cannot take advantage of multiple cores. This is unlikely to change directly; the direction the OCaml developers are most interested in taking for increased parallelism seems to be allowing multiple OCaml runtimes to run in parallel in a single process; this will allow for very fast message passing, but will not allow multiple threads to run in parallel in a shared-memory configuration. The major hangup is the garbage collector; some years ago, the team experimented with a concurrent GC, but it introduced unacceptable slowdowns in the single-threaded case.
There are a couple of projects, namely Functory and OCamlnet, which provide multicore-happy parallelism by using multiple processes.
In general, the OCaml community tends to favor message passing approaches, which can be done across process boundaries (like OCamlnet does), over single-process shared-memory multithreading. If your program can be split into multiple processes (many can!), then yes, you can efficiently use multiple CPUs.

BSMLlib provides a simplified programming interface for data-parallel programming in OCaml.
Its execution amounts to BSP-style message passing but it is deterministic and even declarative for a subset of OCaml.
The key concept is the 'a par type which corresponds to a vector of values, one per process.
http://traclifo.univ-orleans.fr/BSML/
http://fr.wikipedia.org/wiki/Bulk_Synchronous_Parallel_ML
GaƩtan Hains
University Paris-Est

Related

what is the relationship among different approaches of F# concurrency

I'm recently learn F# asynchronous workflows, which is an important feature of F# concurrency. What confused me is that how many approaches to write concurrent code in F#? I read Except F#, and some blog about F# concurrency, I know things like background workers; IAsyncResult; If programming on local machine, there is shard-memory concurrency in F#; If programming on distributed system, there is message-passing concurrency. But I really not sure what is the relationship between these techniques, and how to classify them. I understand it is quite a "big" question cannot be answer with one or two sentences, so I would definitely appreciate if anyone can give me specific answer or recommend me some useful references.
I'm also rather new to F#, so I hope more answers come to complement this one :)
The first thing is you need to distinguish between .NET classes (which can be used from any .NET language) and F# unique ways to deal with asynchronous operations. In the first case as you mention and among others, you have:
System.ComponentModel.BackgroundWorker: This was used mainly in the first .NET versions with Windows Forms and it's not recommended anymore.
System.IAsyncResult: This is also an old .NET interface implemented by several classes (also Task) but I don't usually use it directly.
Windows.Foundation.IAsyncOperation: Another interface but used only in Windows Store apps. Most of the times you translate it directly to Task, so you don't have to worry too much about it.
System.Threading.Tasks.Task: This is the recommended way now to handle .NET asynchronous and parallel (with the Parallel Task Library) operations. It's the hidden force behind C# async/await keywords, which are just syntactic sugar to pass continuations to Tasks.
So now with F# unique ways: Asynchronous Workflows and MailboxProcessor. It can roughly be said the former corresponds to parallelism while the latter deals with concurrency.
Asynchronous Workflows: This is just a computation expression (aka monad) which happens to deal with asynchrony: operations that run in the background to prevent blocking the UI thread or parallel operations to get the maximum performance in multi-core systems.
It's more or less the equivalent to C# async/await but we F# fans like to think it's a more elegant solution because it uses a more generic and flexible mechanism (computation expressions) which can be adapted for example to asynchronous sequences, events or even Javascript callbacks. It has also other advantages as Thomas Petricek explains here.
Within an asynchronous workflow most of the time you'll be using the methods in Control.Async or the extensions to .NET classes (like WebRequest.AsyncGetResponse) from the F# Core Library. If necessary, you can also interact directly with .NET Tasks (Async.AwaitTask and Async.StartAsTask) or even easily create your own async operations with Async.StartWithContinuations.
To learn more about asynchronous workflows you can consult the MSDN documentation, the magnificent Scott Wlaschin's site, Tomas Petricek's blog or the F# Wikibook.
Control.MailboxProcessor: Designed to deal with concurrency, that is, several processes running at the same time which usually need to share some information. The traditional .NET way to prevent memory corruption when several threads try to write a variable at the same time was the lock statement. Besides the fact that functional style prefers to use immutable values, memory locks are complicated to use properly and can also have a high performance penalty. So instead of this, MailboxProcessor uses an Erlang-like message-based (or actor-based) approach to concurrency.
I have not used MailboxProcessor myself that much, but for more info you can check Scott Wlaschin's site or the F# Wikibook.
I hope this helps! If someone sees something not completely correct in this answer, please feel free to edit it.
Cheers!

Common Lisp Parallel Programming

I want to implement my particle filtering algorithm in parallel in Common Lisp. Particle Filtering and sampling can be parallelized and I want to do this for my 4-core machine. My question is whether programming in parallel is feasible in CL or not and if it is feasible are there any good readings, tutorials about getting started to parallel computing in CL.
Definitely feasible!
The Bordeaux Threads project provides thread primitives for a number of implementations; I would suggest using it instead of SBCL's implementation-specific primitives (especially if you aren't on SBCL!).
The thread primitives are provided by bt are, however, quite primitive. I've used and enjoyed Eager Future2 which builds on bt to provide concurrency features using futures. You can create futures that are computed lazily, eagerly (immediately), or speculatively. The speculative futures are computed by a thread pool whose size can be customized.
I started a little project to provide parallel versions of CL functions using EF2, but it's only about three functions so far, so it won't be of much use to anyone. I do of course welcome other coders to hack on it and submit pull requests, and I hope to do more work on it in the future.
There are many other libraries listed on Cliki that I haven't tried myself.
As far as tutorials, I don't know of any, but the concurrency features provided are found in other languages as well and good algorithms and practices are not generally language-specific.
If you're interested in reading a book, I recommend The Concurrent C Programming Language. The authors describe a new programming language, based on C, with concurrency as a language feature. Of course, due to the nature of CL, it would likely be possible to implement these features without resorting to creating a new compiler. In my opinion the book presents excellent concurrency concepts, and addresses many of the problems you may encounter or fail to consider in writing concurrent programs.
SBCL has some multithreading support. It is too low level and, to my knowledge, does not include any parallel algorithms. It has just the posibility of creating threads that execute some lambda function and test afterwards if the thread has finished (joining it). I used that support to generate my blog pages with great speedup (each page or set of pages in a different thread). You can see the code here:
https://github.com/dsevilla/functional-mind-blog/blob/master/blog/process.lisp
For eample, generating a thread for each page was something like:
#+sbcl
(defun generate-post-pages ()
(map nil
#'(lambda (post)
(make-thread (lambda () (page-generation-function post))))
*posts*))
You can also join-thread, and have mutexes, etc. You can read the documentation here: SBCL Threading. It is too low-level, though. You'll end missing the fantastic features of Clojure for concurrency...
Check out bordeaux threads if you're looking for a single POSIX-threads-style interface to multi-threading primitives for different Lisps.
If I were looking for a reliable free Lisp implementation, I'd start with CCL and then try SBCL. I use CCL for almost all of my testing and SBCL and LispWorks for the remainder.
Sedach's futures library should provide a higher level interface. There are also some other contributions from various users in SBCL's contrib directory.
This coming from someone who has used neither bordeaux-threads nor Sedach's futures library and has written his own version of both of those. I could send you my implementation, but these two packages are also supposed to be good, and they're probably a better starting point.
LispWorks 6 comes with a nice set of primitives for concurrent programming.
Note though that to my knowledge none of the usual Common Lisp implementations has a concurrent Garbage Collector.
Documentation for LispWorks 6 and Multiprocessing
The MP Package
Multiprocessing

Erlang - C and Erlang

There are certain common library functions in erlang that are much slower than their c equivalent.
Is it possible to have c code do the binary parsing and number crunching, and have erlang spawn processes to run the c code?
Of course C would be faster, in the extreme case, after optimizations. If by faster you mean faster to run.
Erlang would be by far, faster to write. Depending on the speed requirements you have Erlang is probably "fast enough", and it will save you days of searching for bugs in C.
C code will only be faster after optimizations. If you spend the same amount of time on C and Erlang you will come out with about the same speed (note that I count time spent debugging and error fixing in this time estimation. Which will be a lot less in Erlang).
So:
faster writing = Erlang
faster running (after optimisations) = C
faster running without optimisations = any of the two
Take your pick.
There are two rough rules of thumb based on Erlang FAQ:
Code which involves mainly number crunching and data processing will run about 10 times slower than an equivalent C program. This includes almost all "micro benchmarks".
Large systems which spent most of their time communicating with other systems, recovering from faults and making complex decisions run at least as fast as equivalent C programs.
However there are some official solutions to the lack of number crunching performance of Erlang:
Native Implemented Function (NIF):
Implementing a function in C and loading its object code into Erlang virtual machine to be like a standard Erlang function but with native performance.
Examples: Evedis, Bitcask, ElevelDB
Port:
A byte-oriented interface from Erlang virtual machine to external OS processes through standard input and output file descriptors. The communication with this port is through message passing from Erlang's point of view.
Port Driver:
A dynamically linked C object file which is loaded into Erlang virtual machine and acts like a port. The communication with this port driver is through message passing from Erlang's point of view.
Examples: OTP_Inet, ENanomsg, P1_TLS
C Node:
You can simply promote your Erlang runtime to a distributed node. This way there is a specification to implement an Erlang runtime in C and communicate with Erlang nodes with a single interface.
All of aforementioned solutions have its own pros and cons and need to be used with extreme care.
First of all write whole logic of the system in Erlang, then implement handling binaries in C. Using NIFs (it is kind of interface to C) is pretty straight forward and transparent to the rest of the system. Here is another thread about talking to C Run C Code Block in Erlang.
Before hacking C, make sure you benchmarked current implementation. It is possible that it will satisfy your needs especially with the latest Erlang/OTP release (R14) which introduces great enhancements to binary handling.
easy threading is not so interesting to erlang. Easy threading + Message passing and the OTP framework is what's awesome about erlang. If you need number crunching use something like ocaml, python, haskell. Erlang is all that good at number crunching.
Parsing binaries is one of the things erlang is best at though, probably the best for it. Joe's book programming erlang covers everything really well, and is not so expensive used. It also talks about integrating C code and gives an example. the source is available from pragmatic programming without needing to buy the book, you can grep #include or something.
If you really look for speed you should try OpenMP or MPI parallel programming frameworks for C and C++. I recommend you to take a look at Patterns for Parallel Programming (link to amazon.com) for the details of OpenMP and MPI programming patterns.
The section of erl_nif in Erlang ERTS reference manual will be helpful.
If you like Erlang, but want C speed, why not go for JOCAML. It is an extension for OCAML (which is similar to Erlang but is near C in terms of speed) designed for the multicore revolution going on at the moment. I love it (and I know more than 10 programming languages...)
I used C over 20 years.
I am using Erlang almost exclusively the recently years.
C is faster to run for obvious reason.
Hower, Erlang is fast enough for most things when you do it right.
Also, writing Erlang is much faster and more of fun.
For the piece of algorithms for which the run-time speed is critical, it surely can be written in C, which is the way of Erlang BIFs.
Yes,
But there's more than one way to this, loosely speaking, some or all of which are already listed.
We should ask:
Are those procedures really equivalent (how do the Erlang and C differ)?
Is there a better way to write Erlang for this task (other procedures/libraries or data-types)?
It may be helpful to consider this post: Scaling & Speed with Erlang.
To address the question, yes it is possible to have erlang call some c function to handle a specific task. The most common way is to use a NIF - http://erlang.org/doc/tutorial/nif.html. NIFs were recommended only for short running functions before Erlang version 20 or so, few ms, because they were blocking, which couldn't work with Erlangs preemptive scheduler. Now with dirty threads it is more flexible, you can read up on that.
Just to note, C may be faster at parsing binary, though you should run tests, Erlang is by far faster to write the code. Erlang does a great job parsing binaries by pattern matching.

Comparison of Boost StateCharts vs. Samek's "Quantum Statecharts"

I've had heavy exposure to Miro Samek's "Quantum Hierarchical State Machine," but I'd like to know how it compares to Boost StateCharts - as told by someone who has worked with both. Any takers?
I know them both, although at different detail levels. But we can start with the differences I've came across, maybe there are more :-) .
Scope
First, the Quantum Platform provides a complete execution framework for UML state machines, whereas boost::statechart only helps the state machine implementations. As such, boost::statechart provides the same mechanism as the event processor of the Quantum Platform (QEP).
UML Conformance
Both approaches are designed to be UML conform. However, the Quantum Platform executes transition actions before exit actions of the respective state. This contradicts the UML, but in practice this is seldom a problem (if the developer is aware of it).
Boost::statechart is designed according to UML 1.4, but as far as I know the execution semantics did not change in UML 2.0 in an incompatible way (please correct me if I'm wrong), so this shouldn't be a problem either.
Supported UML features
Both implementations do not support the full UML state machine feature set. For example, parallel states (aka AND states) are not supported directly in QP. They have to be implemented manually by the user. Boost::statechart does not support internal transitions, because they were introduced in UML 2.0.
I think the exact features that each technique supports are easy to figure out in the documentation, so I don't list them here.
As a fact, both support the most important statechart features.
Other differences
Another difference is that QP is suitable for embedded applications, whereas boost::statechart maybe is, maybe not. The FAQ says "it depends" (see http://www.boost.org/doc/libs/1_44_0/libs/statechart/doc/faq.html#EmbeddedApplications), but to me this is already a big warning sign.
Also, you have to take special measurements to get boost::statechart real-time capable (see FAQ).
So much to the differences that I know, tell me if you find more, I'd be interested!
I have also worked with both, let me elaborate on theDmi's great answer:
Trace capability:
QP also implements a powerful trace capability called QSpy which enables very fine granularity traces with filter capabilities. With boost you have to roll your own and you never get beyond a certain granularity.
Modern C++ Style and Compile time error checking:
While Boost MSM and Statecharts give horrid and extremely long error messages if you mess up (as does all code written by template geniuses I envy), this is far better than runtime error detection. QP uses Q_ASSERT() and similar macros to do some error checking but in general you have to watch yourself more with QP and code is less durable.
I also find the extensive use of the preprocessor in QP takes a bit of getting used to. It may be warranted to use the preprocessor rather than templates, virtual functions etc. because of QPs use in embedded systems where the C++ compilers are often worse and the hardware is less virtual function friendly but sometimes I wish Mr. Samek would make a C, a C++ and a Modern C++ version ;) Rumor has it I'm not the only one who hates the preprocessor.
Scalability:
Boost MSM is not good for anything above 20 states, Statecharts has pretty much no limit on states but the amount of transitions a state can have is limited by the mpl::vector/list constraints. QP is scalable to an insane degree, virtually unlimited states and transitions are possible. It should also be noted that QP state machines can be spread over many many files with few dependencies.
Model driven development:
because of its extreme scalability and flexibility QP is much better suited for Model Driven Development, see this article for a lengthy comparison: http://security.hsr.ch/mse/projects/2011_Code_Generator_for_UML_State_Machines.pdf
Embedded designs:
QP is the only solution for any kind of embedded design in my mind. Its documented down to the bare bones so its easily portable, ports exist for many many common processors and it brings a lot of stuff with it beyond the state machine functionality. I particularly like the raw thread safe queues and memory management. I have never seen an embedded kernel I liked until I tried the RTC Kernel in QP (although it should be noted I have not used it in production code yet).
I am unfamiliar with Boost StateCharts, but something I feel Samek gets wrong is that he associates transition actions with state context. Transition actions should occur between states.
To understand why I don't like this style requires an example:
What if a state has two different transitions out? Then the events are different but the source state would be the same.
In Samek's formalism, transition actions are associated with a state context, so you have to have the same action on both transitions. In this way, Samek does not allow you to express a pure Mealy model.
While I have not provided a comparison to Boost StateCharts, I have provided you with some details on how to critique StateCharts implementations: by analyzing coupling among the various components that make up the implementation.

Is Communicating Sequential Processes ever used in large multi threaded C++ programs?

I'm currently writing a large multi threaded C++ program (> 50K LOC).
As such I've been motivated to read up alot on various techniques for handling multi-threaded code. One theory I've found to be quite cool is:
http://en.wikipedia.org/wiki/Communicating_sequential_processes
And it's invented by a slightly famous guy, who's made other non-trivial contributions to concurrent programming.
However, is CSP used in practice? Can anyone point to any large application written in a CSP style?
Thanks!
CSP, as a process calculus, is fundamentally a theoretical thing that enables us to formalize and study some aspects of a parallel program.
If you instead want a theory that enables you to build distributed programs, then you should take a look to parallel structured programming.
Parallel structural programming is the base of the current HPC (high-performance computing) research and provides to you a methodology about how to approach and design parallel programs (essentially, flowcharts of communicating computing nodes) and runtime systems to implements them.
A central idea in parallel structured programming is that of algorithmic skeleton, developed initially by Murray Cole. A skeleton is a thing like a parallel design pattern with a cost model associated and (usually) a run-time system that supports it. A skeleton models, study and supports a class of parallel algorithms that have a certain "shape".
As a notable example, mapreduce (made popular by Google) is just a kind of skeleton that address data parallelism, where a computation can be described by a map phase (apply a function f to all elements that compose the input data), and a reduce phase (take all the transformed items and "combine" them using an associative operator +).
I found the idea of parallel structured programming both theoretical sound and practical useful, so I'll suggest to give a look to it.
A word about multi-threading: since skeletons addresses massive parallelism, usually they are implemented in distributed memory instead of shared. Intel has developed a tool, TBB, which address multi-threading and (partially) follows the parallel structured programming framework. It is a C++ library, so probably you can just start using it in your projects.
Yes and no. The basic idea of CSP is used quite a bit. For example, thread-safe queues in one form or another are frequently used as the primary (often only) communication mechanism to build a pipeline out of individual processes (threads).
Hoare being Hoare, however, there's quite a bit more to his original theory than that. He invented a notation for talking about the processes, defined a specific set of signals that can be sent between the processes, and so on. The notation has since been refined in various ways, quite a bit of work put into proving various aspects, and so on.
Application of that relatively formal model of CSP (as opposed to just the general idea) is much less common. It's been used in a few systems where high reliability was considered extremely important, but few programmers appear interested in learning (yet another) formal design notation.
When I've designed systems like this, I've generally used an approach that's less rigorous, but (at least to me) rather easier to understand: a fairly simple diagram, with boxes representing the processes, and arrows representing the lines of communication. I doubt I could really offer much in the way of a proof about most of the designs (and I'll admit I haven't designed a really huge system this way), but it's worked reasonably well nonetheless.
Take a look at the website for a company called Verum. Their ASD technology is based on CSP and is used by companies like Philips Healthcare, Ericsson and NXP semiconductors to build software for all kinds of high-tech equipment and applications.
So to answer your question: Yes, CSP is used on large software projects in real-life.
Full disclosure: I do freelance work for Verum
Answering a very old question, yet it seems important that one
There is Go where CSPs are a fundamental part of the language. In the FAQ to Go, the authors write:
Concurrency and multi-threaded programming have a reputation for difficulty. We believe this is due partly to complex designs such as pthreads and partly to overemphasis on low-level details such as mutexes, condition variables, and memory barriers. Higher-level interfaces enable much simpler code, even if there are still mutexes and such under the covers.
One of the most successful models for providing high-level linguistic support for concurrency comes from Hoare's Communicating Sequential Processes, or CSP. Occam and Erlang are two well known languages that stem from CSP. Go's concurrency primitives derive from a different part of the family tree whose main contribution is the powerful notion of channels as first class objects. Experience with several earlier languages has shown that the CSP model fits well into a procedural language framework.
Projects implemented in Go are:
Docker
Google's download server
Many more
This style is ubiquitous on Unix where many tools are designed to process from standard in to standard out. I don't have any first hand knowledge of large systems that are build that way, but I've seen many small once-off systems that are
for instance this simple command line uses (at least) 3 processes.
cat list-1 list-2 list-3 | sort | uniq > final.list
This system is only moderately sized, but I wrote a protocol processor that strips away and interprets successive layers of protocol in a message that used a style very similar to this. It was an event driven system using something akin to cooperative threading, but I could've used multithreading fairly easily with a couple of added tweaks.
The program is proprietary (unfortunately) so I can't show off the source code.
In my opinion, this style is useful for some things, but usually best mixed with some other techniques. Often there is a core part of your program that represents a processing bottleneck, and applying various concurrency increasing techniques there is likely to yield the biggest gains.
Microsoft had a technology called ActiveMovie (if I remember correctly) that did sequential processing on audio and video streams. Data got passed from one filter to another to go from input to output format (and source/sink). Maybe that's a practical example??
The Wikipedia article looks to me like a lot of funny symbols used to represent somewhat pedestrian concepts. For very large or extensible programs, the formalism can be very important to check how the (sub)processes are allowed to interact.
For a 50,000 line class program, you're probably better off architecting it as you see fit.
In general, following ideas such as these is a good idea in terms of performance. Persistent threads that process data in stages will tend not to contend, and exploit data locality well. Also, it is easy to throttle the threads to avoid data piling up as a fast stage feeds a slow stage: just block the fast one if its output buffer grows too big.
A little bit off-topic but for my thesis I used a tool framework called TERRA/LUNA which aims for software development for Embedded Control Systems but is used heavily for all sorts of software development at my institute (so only academical use here).
TERRA is a graphical CSP and software architecture editor and LUNA is both the name for a C++ library for CSP based constructs and the plugin you'll find in TERRA to generate C++ code from your CSP models.
It becomes very handy in combination with FDR3 (a CSP refinement checker) to detect any sort of (dead/life/etc) lock or even profiling.