How do I optimize the parallelization of Monte Carlo data generation with MPI?

How do I optimize the parallelization of Monte Carlo data generation with MPI? - c++

I am currently building a Monte Carlo application in C++ and I have a question regarding parallelization with MPI.
The process I want to parallelize is the MC generation of data. To have good precision in my final results, I specify the goal number of data points. Each data point is generated independently, but might require vastly differing amounts of time.
How do I organize the parallelization and workload distribution of the data generation most efficiently?
What I have done so far
So far I have come up with three possible ways of organizing the MPI part of the code:
The simplest way, but most likely inefficient way: I divide the desired sample size by the number of workers and let every worker generate that amount of data in isolation. However, when the slowest worker finishes, all other workers have been idling for a potentially long time. They could have been "supporting" the slowest worker by sharing its workload.
Use a master: A master communicates with the workers who work continuously until the master process registers that we have enough data and tells everybody to stop what they are doing. The disadvantage I see is that the master process might not be necessary and could be generating data instead (especially when I don't have a lot of workers).
A "ring communication" algorithm I came up with myself: A message is continuously sent and updated in a circle (1->2, 2->3, ... , N ->1). This message contains the global number of generated data point. Once the desired goal is met, the message is tagged, circles one more time and thereby tells everybody to stop working. Important here is I use non-blocking communication (with MPI_Iprobe before receiving via MPI_Recv, and sending via MPI_Isend). This way, everybody works, and no one ever idles.
No matter, which solution is chosen, in the end I reduce all data sets to one big set and continue to process the data.
The concrete questions:
Is there an "optimal" way of parallelizing such a fairly simple process? Would you prefer any of the proposed solutions for some reason?
What do you think of this "ring communication" solution?
I'm sure I'm not the first one to come up with e.g. the ring communication algorithm. I have tried to google this problem, but it seems to me that I do not know the right terminology in this context. I'm sure there must be a lot of material and literature on such simple algorithms, but I never had a formal course on MPI/parallelization. What are the "keywords" to look for?
Any advice and tips are much appreciated.

Related

Profiling a multiprocess system

I have a system that i need to profile.
It is comprised of tens of processes, mostly c++, some comprised of several threads, that communicate to the network and to one another though various system calls.
I know there are performance bottlenecks sometimes, but no one has put in the time/effort to check where they are: they may be in userspace code, inefficient use of syscalls, or something else.
What would be the best way to approach profiling a system like this?
I have thought of the following strategy:
Manually logging the roundtrip times of various code sequences (for example processing an incoming packet or a cli command) and seeing which process takes the largest time. After that, profiling that process, fixing the problem and repeating.
This method seems sorta hacky and guess-worky. I dont like it.
How would you suggest to approach this problem?
Are there tools that would help me out (multi-process profiler?)?
What im looking for is more of a strategy than just specific tools.
Should i profile every process separately and look for problems? if so how do i approach this?
Do i try and isolate the problematic processes and go from there? if so, how do i isolate them?
Are there other options?

I don't think there is a single answer to this sort of question. And every type of issue has it's own problems and solutions.
Generally, the first step is to figure out WHERE in the big system is the time spent. Is it CPU-bound or I/O-bound?
If the problem is CPU-bound, a system-wide profiling tool can be useful to determine where in the system the time is spent - the next question is of course whether that time is actually necessary or not, and no automated tool can tell the difference between a badly written piece of code that does a million completely useless processing steps, and one that does a matrix multiplication with a million elements very efficiently - it takes the same amount of CPU-time to do both, but one isn't actually achieving anything. However, knowing which program takes most of the time in a multiprogram system can be a good starting point for figuring out IF that code is well written, or can be improved.
If the system is I/O bound, such as network or disk I/O, then there are tools for analysing disk and network traffic that can help. But again, expecting the tool to point out what packet response or disk access time you should expect is a different matter - if you contact google to search for "kerflerp", or if you contact your local webserver that is a meter away, will have a dramatic impact on the time for a reasonable response.
There are lots of other issues - running two pieces of code in parallel that uses LOTS of memory can cause both to run slower than if they are run in sequence - because the high memory usage causes swapping, or because the OS isn't able to use spare memory for caching file-I/O, for example.
On the other hand, two or more simple processes that use very little memory will benefit quite a lot from running in parallel on a multiprocessor system.
Adding logging to your applications such that you can see WHERE it is spending time is another method that works reasonably well. Particularly if you KNOW what the use-case is where it takes time.
If you have a use-case where you know "this should take no more than X seconds", running regular pre- or post-commit test to check that the code is behaving as expected, and no-one added a lot of code to slow it down would also be a useful thing.

On a single node, is there any reason I should choose MPI over other inter-process mechanisms?

Currently I am implementing a multi-processing application. Different processes seldom share data, instead each process will work on its own pre-divided chunk of data for most of the time. With that said, these processes are pretty independent, except occasionally they communicate for control or feedback purposes, and collecting results in the end.
So my question is, is there any reason I should choose MPI over other inter-process mechanisms? Or is there any reason preventing me from choosing MPI?
I am not sure if on a single node, MPI is efficient enough.

MPI is an interface that let you to communicate between multiple process. When your processes are not communicating together (with the MPI code) and not doing parallelized code, they are not more or less eficient than a standalone process.
MPI is made for high performance, scalability, and portability. Most of the time MPI is used for scientific computing because of it's caracteristics that I enumeratate. It helps scientifics to save time.
The communication interface include function like broadcast, scatter, gather, reduction that help to reduce time of processing. In other communication mechanisme (most of them) you need to think about the efficient part of your communication by yourself.
If you only need a simple communication with two process, and the processing time is not a concern, you should use the inter-process message passing mechanismes of your OS. Sometimes MPI become tricky and difficult if you don't follow the idea for what MPI is made. And if you are juste begining to that, I think it's better for your to begin with OS mechanisms.
Hope that helps

Captain Wise spoke to the efficiency aspect. Consider the maintainability aspect as well.
If you use MPI and your code grows more complex and larger and some day needs to run on more than one node, then your porting and tuning effort might be as simple as "run on more than one node" -- presuming, of course, that your MPI implementation is clever enough to skip the networking stack when a message is passed on the same node.
If you want to show your code to a future collaborator, its likely they will know MPI if you are at all affiliated with HPC environments, and your possible future collaborator can just dive in and help you. The pool of Chapel or UPC users is a bit smaller (though if you use Chapel, Brad Chamberlain will work insanely hard to ensure you have a good experience)

mapreduce vs other parallel processing solutions

So, the questions are:
1. Is mapreduce overhead too high for the following problem? Does anyone have an idea of how long each map/reduce cycle (in Disco for example) takes for a very light job?
2. Is there a better alternative to mapreduce for this problem?
In map reduce terms my program consists of 60 map phases and 60 reduce phases all of which together need to be completed in 1 second. One of the problems I need to solve this way is a minimum search with about 64000 variables. The hessian matrix for the search is a block matrix, 1000 blocks of size 64x64 along a diagonal, and one row of blocks on the extreme right and bottom. The last section of : block matrix inversion algorithm shows how this is done. Each of the Schur complements S_A and S_D can be computed in one mapreduce step. The computation of the inverse takes one more step.
From my research so far, mpi4py seems like a good bet. Each process can do a compute step and report back to the client after each step, and the client can report back with new state variables for the cycle to continue. This way the process state is not lost computation can be continued with any updates.
http://mpi4py.scipy.org/docs/usrman/index.html
This wiki holds some suggestions, but does anyone have a direction on the most developed solution:
http://wiki.python.org/moin/ParallelProcessing
Thanks !

MPI is a communication protocol that allows for the implementation of parallel processing by passing messages between cluster nodes. The parallel processing model that is implemented with MPI depends upon the programmer.
I haven't had any experience with MapReduce but it seems to me that it is a specific parallel processing model and is designed to be simple to implement. This kind of abstraction should save you programming time and may or may not provide a suitable solution to your problem. It all depends on the nature of what you are trying to do.
The trick with parallel processing is that the most suitable solution is often problem specific and without knowing more specifics about your problem it is hard to make recommendations.
If you can tell us more about the environment that you are running your job on and where your program fits into Flynn's taxonomy, I might be able to provide some more helpful suggestions.

Organizing a task-based scientific computation

I have a computational algebra task I need to code up. The problem is broken into well-defined individuals tasks that naturally form a tree - the task is combinatorial in nature, so there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large.
I had previously coded this up in a functional fashion, calling the functions as needed and storing everything in RAM. This was a terrible approach, but I was more concerned about the theory then.
I'm planning to rewrite the code in C++ for a variety of reasons. I have a few requirements:
Checkpointing: The calculation takes a long time, so I need to be able to stop at any point and resume later.
Separate individual tasks as objects: This helps me keep a good handle of where I am in the computations, and offers a clean way to do checkpointing via serialization.
Multi-threading: The task is clearly embarrassingly parallel, so it'd be neat to exploit that. I'd probably want to use Boost threads for this.
I would like suggestions on how to actually implement such a system. Ways I've thought of doing it:
Implement tasks as a simple stack. When you hit a task that needs subcalculations done, it checks if it has all the subcalculations it requires. If not, it creates the subtasks and throws them onto the stack. If it does, then it calculates its result and pops itself from the stack.
Store the tasks as a tree and do something like a depth-first visitor pattern. This would create all the tasks at the start and then computation would just traverse the tree.
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
I feel like I'm over-thinking it and there's already a simple, well-established way to do something like this. Is there one?
Technical details in case they matter:
The task tree has 5 levels.
Branching factor of the tree is really small (say, between 2 and 5) for all levels except the lowest which is on the order of a few million.
Each individual task would only need to store a result tens of bytes large. I don't mind using the disk as much as possible, so long as it doesn't kill performance.
For debugging, I'd have to be able to recall/recalculate any individual task.
All the calculations are discrete mathematics: calculations with integers, polynomials, and groups. No floating point at all.

there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large... blah blah resuming, multi-threading, etc.
Correct me if I'm wrong, but it seems to me that you are exactly describing a map-reduce algorithm.
Just read what wikipedia says about map-reduce :
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
Using an existing mapreduce framework could save you a huge amount of time.
I just google "map reduce C++" and I start to get results, notably one in boost http://www.craighenderson.co.uk/mapreduce/

These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
You definitely do not want millions of CPU-bound threads. You want at most N CPU-bound threads, where N is the product of the number of CPUs and the number of cores per CPU on your machine. Exceed N by a little bit and you are slowing things down a bit. Exceed N by a lot and you are slowing things down a whole lot. The machine will spend almost all its time swapping threads in and out of context, spending very little time executing the threads themselves. Exceed N by a whole lot and you will most likely crash your machine (or hit some limit on threads). If you want to farm lots and lots (and lots and lots) of parallel tasks out at once, you either need to use multiple machines or use your graphics card.

Advice for converting a large monolithic singlethreaded application to a multithreaded architecture?

My company's main product is a large monolithic C++ application, used for scientific data processing and visualisation. Its codebase goes back maybe 12 or 13 years, and while we have put work into upgrading and maintaining it (use of STL and Boost - when I joined most containers were custom, for example - fully upgraded to Unicode and the 2010 VCL, etc) there's one remaining, very significant problem: it's fully singlethreaded. Given it's a data processing and visualisation program, this is becoming more and more of a handicap.
I'm both a developer and the project manager for the next release where we want to tackle this, and this is going to be a difficult job in both areas. I'm seeking concrete, practical, and architectural advice on how to tackle the problem.
The program's data flow might go something like this:
a window needs to draw data
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This will go and calculate or read from file or whatever else is required (often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations)
Ie, the paint message handler will block while processing is done, and if the data hasn't already been calculated and cached, this can be a long time. Sometimes this is minutes. Similar paths occur for other parts of the program that perform lengthy processing operations - the program is unresponsive for the entire time, sometimes hours.
I'm seeking advice on how to approach changing this. Practical ideas. Perhaps things like:
design patterns for asynchronously requesting data?
storing large collections of objects such that threads can read and write safely?
handling invalidation of data sets while something is trying to read it?
are there patterns and techniques for this sort of problem?
what should I be asking that I haven't thought of?
I haven't done any multithreaded programming since my Uni days a few years ago, and I think the rest of my team is in a similar position. What I knew was academic, not practical, and is nowhere near enough to have confidence approaching this.
The ultimate objective is to have a fully responsive program, where all calculations and data generation is done in other threads and the UI is always responsive. We might not get there in a single development cycle :)
Edit: I thought I should add a couple more details about the app:
It's a 32-bit desktop application for Windows. Each copy is licensed. We plan to keep it a desktop, locally-running app
We use Embarcadero (formerly Borland) C++ Builder 2010 for development. This affects the parallel libraries we can use, since most seem (?) to be written for GCC or MSVC only. Luckily they're actively developing it and its C++ standards support is much better than it used to be. The compiler supports these Boost components.
Its architecture is not as clean as it should be and components are often too tightly coupled. This is another problem :)
Edit #2: Thanks for the replies so far!
I'm surprised so many people have recommended a multi-process architecture (it's the top-voted answer at the moment), not multithreading. My impression is that's a very Unix-ish program structure, and I don't know anything about how it's designed or works. Are there good resources available about it, on Windows? Is it really that common on Windows?
In terms of concrete approaches to some of the multithreading suggestions, are there design patterns for asynchronous request and consuming of data, or threadaware or asynchronous MVP systems, or how to design a task-oriented system, or articles and books and post-release deconstructions illustrating things that work and things that don't work? We can develop all this architecture ourselves, of course, but it's good to work from what others have done before and know what mistakes and pitfalls to avoid.
One aspect that isn't touched on in any answers is project managing this. My impression is estimating how long this will take and keeping good control of the project when doing something as uncertain as this may be hard. That's one reason I'm after recipes or practical coding advice, I guess, to guide and restrict coding direction as much as possible.
I haven't yet marked an answer for this question - this is not because of the quality of the answers, which is great (and thankyou) but simply that because of the scope of this I'm hoping for more answers or discussion. Thankyou to those who have already replied!

You have a big challenge ahead of you. I had a similar challenge ahead of me -- 15 year old monolithic single threaded code base, not taking advantage of multicore, etc. We expended a great deal of effort in trying to find a design and solution that was workable and would work.
Bad news first. It will be somewhere between impractical and impossible to make your single-threaded app multithreaded. A single threaded app relies on it's singlethreaded-ness is ways both subtle and gross. One example is if the computation portion requires input from the GUI portion. The GUI must run in the main thread. If you try to get this data directly from the computation engine, you will likely run in to deadlock and race conditions that will require major redesigns to fix. Many of these reliances will not crop up during the design phase, or even during the development phase, but only after a release build is put in a harsh environment.
More bad news. Programming multithreaded applications is exceptionally hard. It might seem fairly straightforward to just lock stuff and do what you have to do, but it is not. First of all if you lock everything in sight you end up serializing your application, negating every benefit of mutithreading in the first place while still adding in all the complexity. Even if you get beyond this, writing a defect-free MP application is hard enough, but writing a highly-performant MP application is that much more difficult. You could learn on the job in a kind of baptismal by fire. But if you are doing this with production code, especially legacy production code, you put your buisness at risk.
Now the good news. You do have options that don't involve refactoring your whole app and will give you most of what you seek. One option in particular is easy to implement (in relative terms), and much less prone to defects than making your app fully MP.
You could instantiate multiple copies of your application. Make one of them visible, and all the others invisible. Use the visible application as the presentation layer, but don't do the computational work there. Instead, send messages (perhaps via sockets) to the invisible copies of your application which do the work and send the results back to the presentation layer.
This might seem like a hack. And maybe it is. But it will get you what you need without putting the stability and performance of your system at such great risk. Plus there are hidden benefits. One is that the invisible engine copies of your app will have access to their own virtual memory space, making it easier to leverage all the resources of the system. It also scales nicely. If you are running on a 2-core box, you could spin off 2 copies of your engine. 32 cores? 32 copies. You get the idea.

So, there's a hint in your description of the algorithm as to how to proceed:
often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations
I'd look into making that data-flow graph be literally the structure that does the work. The links in the graph can be thread-safe queues, the algorithms at each node can stay pretty much unchanged, except wrapped in a thread that picks up work items from a queue and deposits results on one. You could go a step further and use sockets and processes rather than queues and threads; this will let you spread across multiple machines if there is a performance benefit in doing this.
Then your paint and other GUI methods need split in two: one half to queue the work, and the other half to draw or use the results as they come out of the pipeline.
This may not be practical if the app presumes that data is global. But if it is well contained in classes, as your description suggests it may be, then this could be the simplest way to get it parallelised.

Don't attempt to multithread everything in the old app. Multithreading for the sake of saying it's multithreaded is a waste of time and money. You're building an app that does something, not a monument to yourself.
Profile and study your execution flows to figure out where the app spends most of its time. A profiler is a great tool for this, but so is just stepping through the code in the debugger. You find the most interesting things in random walks.
Decouple the UI from long-running computations. Use cross-thread communications techniques to send updates to the UI from the computation thread.
As a side-effect of #3: think carefully about reentrancy: now that the compute is running in the background and the user can smurf around in the UI, what things in the UI should be disabled to prevent conflicts with the background operation? Allowing the user to delete a dataset while a computation is running on that data is probably a bad idea. (Mitigation: computation makes a local snapshot of the data) Does it make sense for the user to spool up multiple compute operations concurrently? If handled well, this could be a new feature and help rationalize the app rework effort. If ignored, it will be a disaster.
Identify specific operations that are candidates to be shoved into a background thread. The ideal candidate is usually a single function or class that does a lot of work (requires a "lot of time" to complete - more than a few seconds) with well defined inputs and outputs, that makes use of no global resources, and does not touch the UI directly. Evaluate and prioritize candidates based on how much work would be required to retrofit to this ideal.
In terms of project management, take things one step at a time. If you have multiple operations that are strong candidates to be moved to a background thread, and they have no interaction with each other, these might be implemented in parallel by multiple developers. However, it would be a good exercise to have everybody participate in one conversion first so that everyone understands what to look for and to establish your patterns for UI interaction, etc. Hold an extended whiteboard meeting to discuss the design and process of extracting the one function into a background thread. Go implement that (together or dole out pieces to individuals), then reconvene to put it all together and discuss discoveries and pain points.
Multithreading is a headache and requires more careful thought than straight up coding, but splitting the app into multiple processes creates far more headaches, IMO. Threading support and available primitives are good in Windows, perhaps better than some other platforms. Use them.
In general, don't do any more than what is needed. It's easy to severely over implement and over complicate an issue by throwing more patterns and standard libraries at it.
If nobody on your team has done multithreading work before, budget time to make an expert or funds to hire one as a consultant.

The main thing you have to do is to disconnect your UI from your data set. I'd suggest that the way to do that is to put a layer in between.
You will need to design a data structure of data cooked-for-display. This will most likely contain copies of some of your back-end data, but "cooked" to be easy to draw from. The key idea here is that this is quick and easy to paint from. You may even have this data structure contain calculated screen positions of bits of data so that it's quick to draw from.
Whenever you get a WM_PAINT message you should get the most recent complete version of this structure and draw from it. If you do this properly, you should be able to handle multiple WM_PAINT messages per second because the paint code never refers to your back end data at all. It's just spinning through the cooked structure. The idea here is that its better to paint stale data quickly than to hang your UI.
Meanwhile...
You should have 2 complete copies of this cooked-for-display structure. One is what the WM_PAINT message looks at. (call it cfd_A) The other is what you hand to your CookDataForDisplay() function. (call it cfd_B). Your CookDataForDisplay() function runs in a separate thread, and works on building/updating cfd_B in the background. This function can take as long as it wants because it isn't interacting with the display in any way. Once the call returns cfd_B will be the most up-to-date version of the structure.
Now swap cfd_A and cfd_B and InvalidateRect on your application window.
A simplistic way to do this is to have your cooked-for-display structure be a bitmap, and that might be a good way to go to get the ball rolling, but I'm sure with a bit of thought you can do a much better job with a more sophisticated structure.
So, referring back to your example.
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This is now 2 threads, the paint method refers to cfd_A and runs on the UI thread. Meanwhile cfd_B is being built by a background thread using GetData calls.
The quick-and-dirty way to do this is
Take your current WM_PAINT code, stick it into a function called PaintIntoBitmap().
Create a bitmap and a Memory DC, this is cfd_B.
Create a thread and pass it cfd_B and have it call PaintIntoBitmap()
When this thread completes, swap cfd_B and cfd_A
Now your new WM_PAINT method just takes the pre-rendered bitmap in cfd_A and draws it to the screen. Your UI is now disconnnected from your backend GetData() function.
Now the real work begins, because the quick-and-dirty way doesn't handle window resizing very well. You can go from there to refine what your cfd_A and cfd_B structures are a little at a time until you reach a point where you are satisfied with the result.

You might just start out breaking the the UI and the work task into separate threads.
In your paint method instead of calling getData() directly, it puts the request in a thread-safe queue. getData() is run in another thread that reads its data from the queue. When the getData thread is done, it signals the main thread to redraw the visualisation area with its result data using thread syncronization to pass the data.
While all this is going on you of course have a progress bar saying reticulating splines so the user knows something is going on.
This would keep your UI snappy without the significant pain of multithreading your work routines (which can be akin to a total rewrite)

It sounds like you have several different issues that parallelism can address, but in different ways.
Performance increases through utilizing multicore CPU Architecutres
You're not taking advantage of the multi-core CPU architetures that are becoming so common. Parallelization allow you to divide work amongst multiple cores. You can write that code through standard C++ divide and conquer techniques using a "functional" style of programming where you pass work to separate threads at the divide stage. Google's MapReduce pattern is an example of that technique. Intel has the new CILK library to give you C++ compiler support for such techniques.
Greater GUI responsiveness through asynchronous document-view
By separating the GUI operations from the document operations and placing them on different threads, you can increase the apparent responsiveness of your application. The standard Model-View-Controller or Model-View-Presenter design patterns are a good place to start. You need to parallelize them by having the model inform the view of updates rather than have the view provide the thread on which the document computes itself. The View would call a method on the model asking it to compute a particular view of the data, and the model would inform the presenter/controller as information is changed or new data becomes available, which would get passed to the view to update itself.
Opportunistic caching and pre-calculation
It sounds like your application has a fixed base of data, but many possible compute-intensive views on the data. If you did a statistical analysis on which views were most commonly requested in what situations, you could create background worker threads to pre-calculate the likely-requested values. It may be useful to put these operations on low-priority threads so that they don't interfere with the main application processing.
Obviously, you'll need to use mutexes (or critical sections), events, and probably semaphores to implement this. You may find some of the new synchronization objects in Vista useful, like the slim reader-writer lock, condition variables, or the new thread pool API. See Joe Duffy's book on concurrency for how to use these basic techniques.

There is something that no-one has talked about yet, but which is quite interesting.
It's called futures. A future is the promise of a result... let's see with an example.
future<int> leftVal = computeLeftValue(treeNode); // [1]
int rightVal = computeRightValue(treeNode); // [2]
result = leftVal + rightVal; // [3]
It's pretty simple:
You spin off a thread that starts computing leftVal, taking it from a pool for example to avoid the initialization problem.
While leftVal is being computed, you compute rightVal.
You add the two, this may block if leftVal is not computed yet and wait for the computation to end.
The great benefit here is that it's straightforward: each time you have one computation followed by another that is independent and you then join the result, you can use this pattern.
See Herb Sutter's article on futures, they will be available in the upcoming C++0x but there are already libraries available today even if the syntax is perhaps not as pretty as I would make you believe ;)

If it was my development dollars I was spending, I would start with the big picture:
What do I hope to accomplish, and how much will I spend to accomplish this, and how will I be further ahead? (If the answer to this is, my app will run 10% better on quadcore PCs, and I could have achieved the same result by spending $1000 more per customer PC , and spending $100,000 less this year on R&D, then, I would skip the whole effort).
Why am I doing multi-threaded instead of massively parallel distributed? Do I really think threads are better than processes? Multi-core systems also run distributed apps pretty well. And there are some advantages to message-passing process based systems that go beyond the benefits (and the costs!) of threading. Should I consider a process-based approach? SHould I consider a background running entirely as a service, and a foreground GUI? Since my product is node-locked and licensed, I think services would suit me (vendor) quite well. Also, separating stuff into two processes (background service and foreground) just might force the kind of rewrite and rearchitecting to occur that I might not be forced to do, if I was to just add threading into my mix.
This is just to get you thinking: What if you were to rewrite it as a service (background app) and a GUI, because that would actually be easier than adding threading, without also adding crashes, deadlocks, and race conditions?
Consider the idea that for your needs, perhaps threading is evil. Develop your religion, and stick with that. Unless you have a real good reason to go the other way. For many years, I religiously avoided threading. Because one thread per process is good enough for me.
I don't see any really solid reasons in your list why you need threading, except ones that could be more inexpensively solved by more expensive target computer hardware. If your app is "too slow" adding in threads might not even speed it up.
I use threads for background serial communications, but I would not consider threading merely for computationally heavy applications, unless my algorithms were so inherently parallel as to make the benefits clear, and the drawbacks minimal.
I wonder if the "design" problems that this C++Builder app has are like my Delphi "RAD Spaghetti" application disease. I have found that a wholesale refactor/rewrite (over a year per major app that I have done this to), was a minimum amount of time for me to get a handle on application "accidental complexity". And that was without throwing a "threads where possible" idea. I tend to write my apps with threads for serial communication and network socket handling, only. And maybe the odd "worker-thread-queue".
If there is a place in your app you can add ONE thread, to test the waters, I would look for the main "work queue" and I would create an experimental version control branch, and I would learn about how my code works by breaking it in the experimental branch. Add that thread. And see where you spend your first day of debugging. Then I might just abandon that branch and go back to my trunk until the pain in my temporal lobe subsides.
Warren

Here's what I would do...
I would start by profiling your and seeing:
1) what is slow and what the hot paths are
2) which calls are reentrant or deeply nested
you can use 1) to determine where the opportunity is for speedups and where to start looking for parallelization.
you can use 2) to find out where the shared state is likely to be and get a deeper sense of how much things are tangled up.
I would use a good system profiler and a good sampling profiler (like the windows perforamnce toolkit or the concurrency views of the profiler in Visual Studio 2010 Beta2 - these are both 'free' right now).
Then I would figure out what the goal is and how to separate things gradually to a cleaner design that is more responsive (moving work off the UI thread) and more performant (parallelizing computationally intensive portions). I would focus on the highest priority and most noticable items first.
If you don't have a good refactoring tool like VisualAssist, invest in one - it's worth it. If you're not familiar with Michael Feathers or Kent Beck's refactoring books, consider borrowing them. I would ensure my refactorings are well covered by unit tests.
You can't move to VS (I would recommend the products I work on the Asynchronous Agents Library & Parallel Pattern Library, you can also use TBB or OpenMP).
In boost, I would look carefully at boost::thread, the asio library and the signals library.
I would ask for help / guidance / a listening ear when I got stuck.
-Rick

You can also look at this article from Herb Sutter You have a mass of existing code and want to add concurrency. Where do you start?

Well, I think you're expecting a lot based on your comments here. You're not going to go from minutes to milliseconds by multithreading. The most you can hope for is the current amount of time divided by the number of cores. That being said, you're in a bit of luck with C++. I've written high performance multiprocessor scientific apps, and what you want to look for is the most embarrassingly parallel loop you can find. In my scientific code, the heaviest piece is calculating somewhere between 100 and 1000 data points. However, all of the data points can be calculated independently of the others. You can then split the loop using openmp. This is the easiest and most efficient way to go. If you're compiler doesn't support openmp, then you will have a very hard time porting existing code. With openmp (if you're lucky), you may only have to add a couple of #pragmas to get 4-8x the performance. Here's an example StochFit

I hope this will help you in understanding and converting your monolithic single threaded app to multi thread easily. Sorry it is for another programming language but never the less the principles explained are the same all over.
http://www.freevbcode.com/ShowCode.Asp?ID=1287
Hope this helps.

The first thing you must do is to separate your GUI from your data, the second is to create a multithreaded class.
STEP 1 - Responsive GUI
We can assume that the image you are producing is contained in the canvas of a TImage. You can put a simple TTimer in you form and you can write code like this:
if (CurrenData.LastUpdate>CurrentUpdate)
{
Image1->Canvas->Draw(0,0,CurrenData.Bitmap);
CurrentUpdate=Now();
}
OK! I know! Is a little bit dirty, but it's fast and is simple.The point is that:
You need an Object that is created in the main thread
The object is copied in the Form you need, only when is needed and in a safe way (ok, a better protection for the Bitmap may be is needed, but for semplicity...)
The object CurrentData is your actual project, single threaded, that produces an image
Now you have a fast and responsive GUI. If your algorithm as slow, the refresh is slow, but your user will never think that your program is freezed.
STEP 2 - Multithread
I suggest you to implement a class like the following:
SimpleThread.h
typedef void (__closure *TThreadFunction)(void* Data);
class TSimpleThread : public TThread
{
public:
TSimpleThread( TThreadFunction _Action,void* _Data = NULL, bool RunNow = true );
void AbortThread();
__property Terminated;
protected:
TThreadFunction ThreadFunction;
void* Data;
private:
virtual void __fastcall Execute() { ThreadFunction(Data); };
};
SimpleThread.c
TSimpleThread::TSimpleThread( TThreadFunction _Action,void* _Data, bool RunNow)
: TThread(true), // initialize suspended
ThreadFunction(_Action), Data(_Data)
{
FreeOnTerminate = false;
if (RunNow) Resume();
}
void TSimpleThread::AbortThread()
{
Suspend(); // Can't kill a running thread
Free(); // Kills thread
}
Let's explain. Now, in your simple threaded class you can create an object like this:
TSimpleThread *ST;
ST=new TSimpleThread( RefreshFunction,NULL,true);
ST->Resume();
Let's explain better: now, in your own monolithic class, you have created a thread. More: you bring a function (ie: RefreshFunction) in a separate thread. The scope of your funcion is the same, the class is the same, the execution is separate.

My number one suggestion, although it's very late (sorry for reviving old thread, it's interesting!) is seek out homogeneous transform loops where each iteration of the loop is mutating a completely independent piece of data from the other iterations.
Instead of thinking about how to turn this old codebase into an asynchronous one running all kinds of operations in parallel (which could be asking for all kinds of trouble from worse than single-threaded performance from poor locking patterns or exponentially worse, race conditions/deadlocks by trying to do this in hindsight to code you can't fully comprehend), stick to the sequential mindset for the overall application design for now but identify or extract simple, homogeneous transform loops. Don't go from intrusive broad design-level multithreading and then try to drill into details. Work from non-intrusive multithreading of fine implementation details and specific hotspots first.
What I mean by homogeneous loops is basically one that transforms data in a very straightforward way, like:
for each pixel in image:
make it brighter
That is very simple to reason about and you can safely parallelize this loop without any problems whatsoever using OMP or TBB or whatever and without getting tangled up in thread synchronization. It only takes one glance at this code to fully comprehend its side effects.
Try to find as many hotspots as you can which fit this type of simple homogeneous transform loop and if you have complex loops which update many different types of data with complex control flows that trigger complex side effects, then seek to refactor towards these homogeneous loops. Often a complex loop which causes 3 disparate side effects to 3 different types of data can be turned into 3 simple homogeneous loops which each trigger just one kind of side effect to one type of data with a simpler control flow. Doing multiple loops instead of one might seem a tad wasteful, but the loops become simpler, the homogeneity will often lead to more cache-friendly sequential memory access patterns vs. sporadic random-access patterns, and you then tend to find much more opportunities to safely parallelize (as well as vectorize) the code in a straightforward way.
First you have to thoroughly understand the side effects of any code you attempt to parallelize (and I mean thoroughly!!!), so seeking out these homogeneous loops gives you isolated areas of the codebase you can easily reason about in terms of the side effects to the point where you can confidently and safely parallelize those hotspots. It'll also improve the maintainability of the code by making it very easy to reason about the state changes going on in that particular piece of code. Save the dream of the uber multithreaded application running everything in parallel for later. For now, focus on identifying/extracting performance-critical, homogeneous loops with simple control flows and simple side effects. Those are your priority targets for parallelization with simple parallelized loops.
Now admittedly I somewhat dodged your questions, but most of them don't need apply if you do what I suggest, at least until you've kind of worked your way out to the point where you're thinking more about multithreading designs as opposed to simply parallelizing implementation details. And you might not even need to go that far to have a very competitive product in terms of performance. If you have beefy work to do in a single loop, you can devote the hardware resources to making that loop go faster instead of making many operations run simultaneously. If you have to resort to more async methods like if your hotspots are more I/O bound, seek an async/wait approach where you fire off an async task but do some things in the meantime and then wait on the async task(s) to complete. Even if that's not absolutely necessary, the idea is to section off isolated areas of your codebase where you can, with 100% confidence (or at least 99.9999999%) say that the multithreaded code is correct.
You don't ever want to gamble with race conditions. There's nothing more demoralizing than finding some obscure race condition that only occurs once in a full moon on some random user's machine while your entire QA team is unable to reproduce it, only to, 3 months later, run into it yourself except during that one time you ran a release build without debugging info available while you then toss and turn in your sleep knowing your codebase can flake out at any given moment but in ways that no one will ever be able to consistently reproduce. So take it easy with multithreading legacy codebases, at least for now, and stick to multithreading isolated but critical sections of the codebase where the side effects are dead simple to reason about. And test the crap out of it -- ideally apply a TDD approach where you write a test for the code you're going to multithread to ensure it gives the correct output after you finish... though race conditions are the types of things that easily fly under the radar of unit and integration testing, so again you absolutely need to be able to comprehend the entirety of the side effects that go on in a given piece of code before you attempt to multithread it. The best way to do that is to make the side effects as easy to comprehend as possible with the simplest control flows causing just one type of side effect for an entire loop.

It is hard to give you proper guidelines. But...
The easiest way out according to me is to convert your application to ActiveX EXE as COM has support for Threading, etc. built right into it your program will automatically become Multi Threading application. Of course you will have to make quite a few changes to your code. But this is the shortest and safest way to go.
I am not sure but probably RichClient Toolset lib may do the trick for you. On the site the author has written:
It also offers registration free Loading/Instancing-capabilities
for ActiveX-Dlls and new, easy to use Threading-approach,
which works with Named-Pipes under the
hood and works therefore also
cross-process.
Please check it out. Who knows it may be the right solution for your requirements.
As for Project management I think you can continue using what is provided in your choice IDE by integrating it with SVN through plugins.
I forgot to mention that we have completed an application for Share market that automatically trades (buys and sells based on lows and highs) into those scripts that are in user portfolio based on an algorithm that we have developed.
While developing this software we were facing the same kind of problem as you have illustrated here. To solve it we converted out application in ActiveX EXE and we converted all those parts that need to execute parallely into ActiveX DLLs. We have not used any third party libs for this!
HTH

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js