(preferably boost) lock-free array/vector/map/etc?

(preferably boost) lock-free array/vector/map/etc? - c++

Considering my lack of c++ knowledge, please try to read my intent and not my poor technical question.
This is the backbone of my program https://github.com/zaphoyd/websocketpp/blob/experimental/examples/broadcast_server/broadcast_server.cpp
I'm building a websocket server with websocket++ (and oh is websocket++ sweet. I highly recommend), and I can easily manipulate per user data thread-safely because it really doesn't need to be manipulated by different threads; however, I do want to be able to write to an array (I'm going to use the catch-all term "array" from weaker languages like vb, php, js) in one function thread (with multiple iterations that could be running simultanously) and also read in 1 or more threads.
Take stack as an example: if I wanted to have all of the ids (PRIMARY column of all articles) sorted in a particular way, in this case by net votes, and held in memory, I'm thinking I would have a function that's called in its' own boost::thread, fired whenever a vote on the site comes in to reorder the array.
How can I do this without locking & blocking? I'm 100% fine with users reading from an old array while another is being built, but I absolutely do not want their reads or the thread writes to ever fail/be blocked.
Does a lock-free array exist? If not, is there some way to build the new array in a temporary array and then write it to the actual array when the building is finished without locking & blocking?

Have you looked at Boost.Lockfree?

Uh, uh, uh. Complicated.
Look here (for an example): RCU -- and this is only about multiple reads along with ONE write.
My guess is that multiple writers at once are not going to work. You should rather look for a more efficient representation than an array, one that allows for faster updates. How about a balanced tree? log(n) should never block anything in a noticeable fashion.
Regarding boost -- I'm happy that it finally has proper support for thread synchronization.
Of course, you could also keep a copy and batch the updates. Then a background process merges the updates and copies the result for the readers.

Related

Thread-safe read-only alternative to vtkUnstructuredGrid->GetPoint()

I have started working on multithreading and point cloud processing. Problem is i have to implement multithreading onto an existing implementation and there are so many read and write operation so using mutex does not give me enough speed up in terms of performance due to too many read operations from the grid.
At the end i modified the code in a way that i can have one vtkSmartPointer<vtkUnstructuredGrid>which holds my point cloud. The only operation the threads have to do is accessing points using GetPoint method. However, it is not thread safe even when you have read-only operation due to smart pointers.
Because of that i had to copy my main Point cloud for each thread which at the end causes memory issues if i have too many threads and big clouds.
I tried to cut point clouds into chunks but then it gets too complicated again when i have too many threads. I can not guarantee optimized amount of points to process for each thread. Also i do neighbour search for each point so cutting point cloud into chunks gets even more complicated because i need to have overlaps for each chunk in order to get proper neighbourhood search.
Since vtkUnstructuredGridis memory optimized i could not replace it with some STL containers. I would be happy if you can recommend me data structures i can use for point cloud processing that are thread-safe to read. Or if there is any other solution i could use.
Thanks in advance

I am not familiar with VTK or how it works.
In general, there are various techniques and methods to improve performance in multi-threading environment. The question is vague, so I can only provide a general vague answer.
Easy: In case there are many reads and few writes, use std::shared_mutex as it allows multiple reads simultaneously.
Moderate: If the threads work with distinct data most of the time: they access the same data array but at distinct locations - then you can implement a handler that ensures that the threads concurrently work over distinct pieces of data without intersections and if a thread ask to work over a piece of data that is currently being processed, then tell it to work over something else or wait.
Hard: There are methods that allow efficient concurrency via std::atomic by utilizing various memory instructions. I am not too familiar with it and it is definitely not simple but you can seek tutorials on it in the internet. As far as I know, certain parts of such methods are still in research-and-development and best practices aren't yet developed.
P.S. If there are many reads/writes over the same data... is the implementation even aware of the fact that the data is shared over several threads? Does it even perform correctly? You might end up needing to rewrite the whole implementation.

I just thought i post the solution because it was actually my stupitidy. I realized that at one part of my code i was using double* vtkDataSet::GetPoint(vtkIdType ptId) version of GetPoint() which is not thread safe.
For multithreaded code void vtkDataSet::GetPoint(vtkIdType id,double x[3]) should be used.

Implementing a lock free data structure on disk

I have an interesting challenge, for those with strong background in lock-free data structures, and disk based data structures.
I'm looking for a way to build in C++ a data structure to hold a varying amount of objects.
The limitation are such:
The data structure must reside on disk.
There is one thread writing to the data structure and many others reading from it.
Every read is atomic. (lets assume I can atomically read a block of size 32/64KB for this and that all objects are small than that in size.
A write should not block a read, for that it is possible to assume that I can write in an atomic way a block of 32/64KB as well.
Locks cannot be used at all.
Any suggestions?
I was thinking of using something like a B-Tree and when needed to split nodes and write new data than move them to new nodes at the end of the file and then just update the pointers to the nodes which will reside for example in some other file (the original blocks will be marked as free and added to a freestore)
However, I run into a problem if my mapping file is greater than 32/64Kb.. Let say I want it to hold even just 1 million object pointers than at 4 bytes/pointer I get to 4 million bytes which is roughly 4 Megs... (and at 1 billion objects even more than that..) Which means the mappings file cannot be written in an atomic manner.
So if someone has a better suggestion as to maybe how to implement the above - or even some direction it would be greatly appreciated.
As far as I know all opensource/commercial implementations of B-Tree use locks of some sort, which I cannot use.
Thanks,
Max.

You won't get very far by just assuming reads/writes are atomic -- mainly because they're not, and you'll end up emulating it in a way that'll kill performance.
It sounds like you want to research MVCC, which is the pretty standard mechanism to use when designing a lock-free database. The basic concept is that every read gets a "snapshot" of the database -- usually implemented in a lock-free way by leaving old pages alone and performing any modifications to new pages only. Once the old pages are finished being used by readers, they're finally marked for re-use.
While MVCC is significantly more involved than a CPU/RAM lock-free structure, once you have it many of the same optimistic lock-free patterns apply towards using it.

LMDB will do all of this with no problem. It is an MVCC B+tree and readers are completely lockless.

How to LRU-cache numerous objects made of C++ STL heavy structures?

I have big C++/STL data structures (myStructType) with imbricated lists and maps. I have many objects of this type I want to LRU-cache with a key. I can reload objects from disk when needed. Moreover, it has to be shared in a multiprocessing high performance application running on a BSD plateform.
I can see several solutions:
I can consider a life-time sorted list of pair<size_t lifeTime, myStructType v> plus a map to o(1) access the index of the desired object in the list from its key, I can use shm and mmap to store everything, and a lock to manage access (cf here).
I can use a redis server configured for LRU, and redesign my data structures to redis key/value and key/lists pairs.
I can use a redis server configured for LRU, and serialise my data structures (myStructType) to have a simple key/value to manage with redis.
There may be other solutions of course. How would you do that, or better, how have you successfully done that, keeping in mind high performance ?
In addition, I would like to avoid heavy dependencies like Boost.

I actually built caches (not only LRU) recently.
Options 2 and 3 are quite likely not faster than re-reading from disk. That's effectively no cache at all. Also, this would be a far heavier dependency than Boost.
Option 1 can be challenging. For instance, you suggest "a lock". That would be quite a contended lock, as it must protect each and every lifetime update, plus all LRU operations. Since your objects are already heavy, it may be worthwhile to have a unique lock per object. There are intermediate variants of this solution, where there is more than one lock, but also more than one object per lock. (You still need a key to protect the whole map, but that's for replacement only)
You can also consider if you really need strict LRU. That strategy assumes that the chances of an object being reused decreases over time. If that's not actually true, random replacement is just as good. You can also consider evicting more than one element at a time. One of the challenges is that when an element needs removing, it would be so from all threads, but it's sufficient if one thread removes it. That's why a batch removal helps: if a thread tries to take a lock for batch removal and it fails, it can continue under the assumption that the cache will have free space soon.
One quick win is to not update the LRU time of the last used element. It was already the newest, making it any newer won't help. This of course only has an effect if you often use that element quickly again, but (as noted above) otherwise you'd just use random eviction.

What is the defacto standard for sharing variables between programs in different languages?

I've never had formal training in this area so I'm wondering what do they teach in school (if they do).
Say you have two programs in written in two different languages: C++ and Python or some other combination and you want to share a constantly updated variable on the same machine, what would you use and why? The information need not be secured but must be isochronous should be reliable.
Eg. Program A will get a value from a hardware device and update variable X every 0.1ms, I'd like to be able to access this X from Program B as often as possible and obtain the latest values. Program A and B are written and compiled in two different (robust) languages. How do I access X from program B? Assume I have the source code from A and B and I do not want to completely rewrite or port either of them.
The method's I've seen used thus far include:
File Buffer - Read and write to a
single file (eg C:\temp.txt).
Create a wrapper - From A to B or B
to A.
Memory Buffer - Designate a specific
memory address (mutex?).
UDP packets via sockets - Haven't
tried it yet but looks good.
Firewall?
Sorry for just throwing this out there, I don't know what the name of this technique is so I have trouble searching.

Well you can write XML and use some basic message queuing (like rabbitMQ) to pass messages around

Don't know if this will be helpful, but I'm also a student, and this is what I think you mean.
I've used marshalling to get a java class and import it into a C# program.
With marshalling you use xml to transfer code in a way so that it can be read by other coding environments.

When asking particular questions, you should aim at providing as much information as possible. You have added a use case, but the use case is incomplete.
Your particular use case seems like a very small amount of data that has to be available at a high frequency 10kHz. I would first try to determine whether I can actually make both pieces of code part of a single process, rather than two different processes. Depending on the languages (missing from the question) it might even be simple, or turn the impossible into possible --depending on the OS (missing from the question), the scheduler might not be fast enough switching from one process to another, and it might impact the availability of the latest read. Switching between threads is usually much faster.
If you cannot turn them into a single process, then you will have to use some short of IPC (Inter Process Communication). Due to the frequency I would rule out most heavy weight protocols (avoid XML, CORBA) as the overhead will probably be too high. If the receiving end needs only access to the latest value, and that access may be less frequent than 0.1 ms, then you don't want to use any protocol that includes queueing as you do not want to read the next element in the queue, you only care about the last, if you did not read the element when it was good, avoid the cost of processing it when it is already stale --i.e. it does not make sense to loop extracting from the queue and discarding.
I would be inclined to use shared memory, or a memory mapped shared file (they are probably quite similar, depends on the platform missing from the question). Depending on the size of the element and the exact hardware architecture (missing from the question) you may be able to avoid locking with a mutex. As an example in current intel processors, read/write access to 32 bit integers from memory is guaranteed to be atomic if the variable is correctly aligned, so in that case you would not be locking.

At my school they teach CORBA. They shouldn't, it's an ancient hideous language from the eon of mainframes, it's a classic case of design-by-committee, every feature possible that you don't want is included, and some that you probably do (asynchronous calls?) aren't. If you think the c++ specification is big, think again.
Don't use it.
That said though, it does have a nice, easy-to-use interface for doing simple things.
But don't use it.

It almost always pass through C binding.

Advice for converting a large monolithic singlethreaded application to a multithreaded architecture?

My company's main product is a large monolithic C++ application, used for scientific data processing and visualisation. Its codebase goes back maybe 12 or 13 years, and while we have put work into upgrading and maintaining it (use of STL and Boost - when I joined most containers were custom, for example - fully upgraded to Unicode and the 2010 VCL, etc) there's one remaining, very significant problem: it's fully singlethreaded. Given it's a data processing and visualisation program, this is becoming more and more of a handicap.
I'm both a developer and the project manager for the next release where we want to tackle this, and this is going to be a difficult job in both areas. I'm seeking concrete, practical, and architectural advice on how to tackle the problem.
The program's data flow might go something like this:
a window needs to draw data
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This will go and calculate or read from file or whatever else is required (often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations)
Ie, the paint message handler will block while processing is done, and if the data hasn't already been calculated and cached, this can be a long time. Sometimes this is minutes. Similar paths occur for other parts of the program that perform lengthy processing operations - the program is unresponsive for the entire time, sometimes hours.
I'm seeking advice on how to approach changing this. Practical ideas. Perhaps things like:
design patterns for asynchronously requesting data?
storing large collections of objects such that threads can read and write safely?
handling invalidation of data sets while something is trying to read it?
are there patterns and techniques for this sort of problem?
what should I be asking that I haven't thought of?
I haven't done any multithreaded programming since my Uni days a few years ago, and I think the rest of my team is in a similar position. What I knew was academic, not practical, and is nowhere near enough to have confidence approaching this.
The ultimate objective is to have a fully responsive program, where all calculations and data generation is done in other threads and the UI is always responsive. We might not get there in a single development cycle :)
Edit: I thought I should add a couple more details about the app:
It's a 32-bit desktop application for Windows. Each copy is licensed. We plan to keep it a desktop, locally-running app
We use Embarcadero (formerly Borland) C++ Builder 2010 for development. This affects the parallel libraries we can use, since most seem (?) to be written for GCC or MSVC only. Luckily they're actively developing it and its C++ standards support is much better than it used to be. The compiler supports these Boost components.
Its architecture is not as clean as it should be and components are often too tightly coupled. This is another problem :)
Edit #2: Thanks for the replies so far!
I'm surprised so many people have recommended a multi-process architecture (it's the top-voted answer at the moment), not multithreading. My impression is that's a very Unix-ish program structure, and I don't know anything about how it's designed or works. Are there good resources available about it, on Windows? Is it really that common on Windows?
In terms of concrete approaches to some of the multithreading suggestions, are there design patterns for asynchronous request and consuming of data, or threadaware or asynchronous MVP systems, or how to design a task-oriented system, or articles and books and post-release deconstructions illustrating things that work and things that don't work? We can develop all this architecture ourselves, of course, but it's good to work from what others have done before and know what mistakes and pitfalls to avoid.
One aspect that isn't touched on in any answers is project managing this. My impression is estimating how long this will take and keeping good control of the project when doing something as uncertain as this may be hard. That's one reason I'm after recipes or practical coding advice, I guess, to guide and restrict coding direction as much as possible.
I haven't yet marked an answer for this question - this is not because of the quality of the answers, which is great (and thankyou) but simply that because of the scope of this I'm hoping for more answers or discussion. Thankyou to those who have already replied!

You have a big challenge ahead of you. I had a similar challenge ahead of me -- 15 year old monolithic single threaded code base, not taking advantage of multicore, etc. We expended a great deal of effort in trying to find a design and solution that was workable and would work.
Bad news first. It will be somewhere between impractical and impossible to make your single-threaded app multithreaded. A single threaded app relies on it's singlethreaded-ness is ways both subtle and gross. One example is if the computation portion requires input from the GUI portion. The GUI must run in the main thread. If you try to get this data directly from the computation engine, you will likely run in to deadlock and race conditions that will require major redesigns to fix. Many of these reliances will not crop up during the design phase, or even during the development phase, but only after a release build is put in a harsh environment.
More bad news. Programming multithreaded applications is exceptionally hard. It might seem fairly straightforward to just lock stuff and do what you have to do, but it is not. First of all if you lock everything in sight you end up serializing your application, negating every benefit of mutithreading in the first place while still adding in all the complexity. Even if you get beyond this, writing a defect-free MP application is hard enough, but writing a highly-performant MP application is that much more difficult. You could learn on the job in a kind of baptismal by fire. But if you are doing this with production code, especially legacy production code, you put your buisness at risk.
Now the good news. You do have options that don't involve refactoring your whole app and will give you most of what you seek. One option in particular is easy to implement (in relative terms), and much less prone to defects than making your app fully MP.
You could instantiate multiple copies of your application. Make one of them visible, and all the others invisible. Use the visible application as the presentation layer, but don't do the computational work there. Instead, send messages (perhaps via sockets) to the invisible copies of your application which do the work and send the results back to the presentation layer.
This might seem like a hack. And maybe it is. But it will get you what you need without putting the stability and performance of your system at such great risk. Plus there are hidden benefits. One is that the invisible engine copies of your app will have access to their own virtual memory space, making it easier to leverage all the resources of the system. It also scales nicely. If you are running on a 2-core box, you could spin off 2 copies of your engine. 32 cores? 32 copies. You get the idea.

So, there's a hint in your description of the algorithm as to how to proceed:
often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations
I'd look into making that data-flow graph be literally the structure that does the work. The links in the graph can be thread-safe queues, the algorithms at each node can stay pretty much unchanged, except wrapped in a thread that picks up work items from a queue and deposits results on one. You could go a step further and use sockets and processes rather than queues and threads; this will let you spread across multiple machines if there is a performance benefit in doing this.
Then your paint and other GUI methods need split in two: one half to queue the work, and the other half to draw or use the results as they come out of the pipeline.
This may not be practical if the app presumes that data is global. But if it is well contained in classes, as your description suggests it may be, then this could be the simplest way to get it parallelised.

Don't attempt to multithread everything in the old app. Multithreading for the sake of saying it's multithreaded is a waste of time and money. You're building an app that does something, not a monument to yourself.
Profile and study your execution flows to figure out where the app spends most of its time. A profiler is a great tool for this, but so is just stepping through the code in the debugger. You find the most interesting things in random walks.
Decouple the UI from long-running computations. Use cross-thread communications techniques to send updates to the UI from the computation thread.
As a side-effect of #3: think carefully about reentrancy: now that the compute is running in the background and the user can smurf around in the UI, what things in the UI should be disabled to prevent conflicts with the background operation? Allowing the user to delete a dataset while a computation is running on that data is probably a bad idea. (Mitigation: computation makes a local snapshot of the data) Does it make sense for the user to spool up multiple compute operations concurrently? If handled well, this could be a new feature and help rationalize the app rework effort. If ignored, it will be a disaster.
Identify specific operations that are candidates to be shoved into a background thread. The ideal candidate is usually a single function or class that does a lot of work (requires a "lot of time" to complete - more than a few seconds) with well defined inputs and outputs, that makes use of no global resources, and does not touch the UI directly. Evaluate and prioritize candidates based on how much work would be required to retrofit to this ideal.
In terms of project management, take things one step at a time. If you have multiple operations that are strong candidates to be moved to a background thread, and they have no interaction with each other, these might be implemented in parallel by multiple developers. However, it would be a good exercise to have everybody participate in one conversion first so that everyone understands what to look for and to establish your patterns for UI interaction, etc. Hold an extended whiteboard meeting to discuss the design and process of extracting the one function into a background thread. Go implement that (together or dole out pieces to individuals), then reconvene to put it all together and discuss discoveries and pain points.
Multithreading is a headache and requires more careful thought than straight up coding, but splitting the app into multiple processes creates far more headaches, IMO. Threading support and available primitives are good in Windows, perhaps better than some other platforms. Use them.
In general, don't do any more than what is needed. It's easy to severely over implement and over complicate an issue by throwing more patterns and standard libraries at it.
If nobody on your team has done multithreading work before, budget time to make an expert or funds to hire one as a consultant.

The main thing you have to do is to disconnect your UI from your data set. I'd suggest that the way to do that is to put a layer in between.
You will need to design a data structure of data cooked-for-display. This will most likely contain copies of some of your back-end data, but "cooked" to be easy to draw from. The key idea here is that this is quick and easy to paint from. You may even have this data structure contain calculated screen positions of bits of data so that it's quick to draw from.
Whenever you get a WM_PAINT message you should get the most recent complete version of this structure and draw from it. If you do this properly, you should be able to handle multiple WM_PAINT messages per second because the paint code never refers to your back end data at all. It's just spinning through the cooked structure. The idea here is that its better to paint stale data quickly than to hang your UI.
Meanwhile...
You should have 2 complete copies of this cooked-for-display structure. One is what the WM_PAINT message looks at. (call it cfd_A) The other is what you hand to your CookDataForDisplay() function. (call it cfd_B). Your CookDataForDisplay() function runs in a separate thread, and works on building/updating cfd_B in the background. This function can take as long as it wants because it isn't interacting with the display in any way. Once the call returns cfd_B will be the most up-to-date version of the structure.
Now swap cfd_A and cfd_B and InvalidateRect on your application window.
A simplistic way to do this is to have your cooked-for-display structure be a bitmap, and that might be a good way to go to get the ball rolling, but I'm sure with a bit of thought you can do a much better job with a more sophisticated structure.
So, referring back to your example.
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This is now 2 threads, the paint method refers to cfd_A and runs on the UI thread. Meanwhile cfd_B is being built by a background thread using GetData calls.
The quick-and-dirty way to do this is
Take your current WM_PAINT code, stick it into a function called PaintIntoBitmap().
Create a bitmap and a Memory DC, this is cfd_B.
Create a thread and pass it cfd_B and have it call PaintIntoBitmap()
When this thread completes, swap cfd_B and cfd_A
Now your new WM_PAINT method just takes the pre-rendered bitmap in cfd_A and draws it to the screen. Your UI is now disconnnected from your backend GetData() function.
Now the real work begins, because the quick-and-dirty way doesn't handle window resizing very well. You can go from there to refine what your cfd_A and cfd_B structures are a little at a time until you reach a point where you are satisfied with the result.

You might just start out breaking the the UI and the work task into separate threads.
In your paint method instead of calling getData() directly, it puts the request in a thread-safe queue. getData() is run in another thread that reads its data from the queue. When the getData thread is done, it signals the main thread to redraw the visualisation area with its result data using thread syncronization to pass the data.
While all this is going on you of course have a progress bar saying reticulating splines so the user knows something is going on.
This would keep your UI snappy without the significant pain of multithreading your work routines (which can be akin to a total rewrite)

It sounds like you have several different issues that parallelism can address, but in different ways.
Performance increases through utilizing multicore CPU Architecutres
You're not taking advantage of the multi-core CPU architetures that are becoming so common. Parallelization allow you to divide work amongst multiple cores. You can write that code through standard C++ divide and conquer techniques using a "functional" style of programming where you pass work to separate threads at the divide stage. Google's MapReduce pattern is an example of that technique. Intel has the new CILK library to give you C++ compiler support for such techniques.
Greater GUI responsiveness through asynchronous document-view
By separating the GUI operations from the document operations and placing them on different threads, you can increase the apparent responsiveness of your application. The standard Model-View-Controller or Model-View-Presenter design patterns are a good place to start. You need to parallelize them by having the model inform the view of updates rather than have the view provide the thread on which the document computes itself. The View would call a method on the model asking it to compute a particular view of the data, and the model would inform the presenter/controller as information is changed or new data becomes available, which would get passed to the view to update itself.
Opportunistic caching and pre-calculation
It sounds like your application has a fixed base of data, but many possible compute-intensive views on the data. If you did a statistical analysis on which views were most commonly requested in what situations, you could create background worker threads to pre-calculate the likely-requested values. It may be useful to put these operations on low-priority threads so that they don't interfere with the main application processing.
Obviously, you'll need to use mutexes (or critical sections), events, and probably semaphores to implement this. You may find some of the new synchronization objects in Vista useful, like the slim reader-writer lock, condition variables, or the new thread pool API. See Joe Duffy's book on concurrency for how to use these basic techniques.

There is something that no-one has talked about yet, but which is quite interesting.
It's called futures. A future is the promise of a result... let's see with an example.
future<int> leftVal = computeLeftValue(treeNode); // [1]
int rightVal = computeRightValue(treeNode); // [2]
result = leftVal + rightVal; // [3]
It's pretty simple:
You spin off a thread that starts computing leftVal, taking it from a pool for example to avoid the initialization problem.
While leftVal is being computed, you compute rightVal.
You add the two, this may block if leftVal is not computed yet and wait for the computation to end.
The great benefit here is that it's straightforward: each time you have one computation followed by another that is independent and you then join the result, you can use this pattern.
See Herb Sutter's article on futures, they will be available in the upcoming C++0x but there are already libraries available today even if the syntax is perhaps not as pretty as I would make you believe ;)

If it was my development dollars I was spending, I would start with the big picture:
What do I hope to accomplish, and how much will I spend to accomplish this, and how will I be further ahead? (If the answer to this is, my app will run 10% better on quadcore PCs, and I could have achieved the same result by spending $1000 more per customer PC , and spending $100,000 less this year on R&D, then, I would skip the whole effort).
Why am I doing multi-threaded instead of massively parallel distributed? Do I really think threads are better than processes? Multi-core systems also run distributed apps pretty well. And there are some advantages to message-passing process based systems that go beyond the benefits (and the costs!) of threading. Should I consider a process-based approach? SHould I consider a background running entirely as a service, and a foreground GUI? Since my product is node-locked and licensed, I think services would suit me (vendor) quite well. Also, separating stuff into two processes (background service and foreground) just might force the kind of rewrite and rearchitecting to occur that I might not be forced to do, if I was to just add threading into my mix.
This is just to get you thinking: What if you were to rewrite it as a service (background app) and a GUI, because that would actually be easier than adding threading, without also adding crashes, deadlocks, and race conditions?
Consider the idea that for your needs, perhaps threading is evil. Develop your religion, and stick with that. Unless you have a real good reason to go the other way. For many years, I religiously avoided threading. Because one thread per process is good enough for me.
I don't see any really solid reasons in your list why you need threading, except ones that could be more inexpensively solved by more expensive target computer hardware. If your app is "too slow" adding in threads might not even speed it up.
I use threads for background serial communications, but I would not consider threading merely for computationally heavy applications, unless my algorithms were so inherently parallel as to make the benefits clear, and the drawbacks minimal.
I wonder if the "design" problems that this C++Builder app has are like my Delphi "RAD Spaghetti" application disease. I have found that a wholesale refactor/rewrite (over a year per major app that I have done this to), was a minimum amount of time for me to get a handle on application "accidental complexity". And that was without throwing a "threads where possible" idea. I tend to write my apps with threads for serial communication and network socket handling, only. And maybe the odd "worker-thread-queue".
If there is a place in your app you can add ONE thread, to test the waters, I would look for the main "work queue" and I would create an experimental version control branch, and I would learn about how my code works by breaking it in the experimental branch. Add that thread. And see where you spend your first day of debugging. Then I might just abandon that branch and go back to my trunk until the pain in my temporal lobe subsides.
Warren

Here's what I would do...
I would start by profiling your and seeing:
1) what is slow and what the hot paths are
2) which calls are reentrant or deeply nested
you can use 1) to determine where the opportunity is for speedups and where to start looking for parallelization.
you can use 2) to find out where the shared state is likely to be and get a deeper sense of how much things are tangled up.
I would use a good system profiler and a good sampling profiler (like the windows perforamnce toolkit or the concurrency views of the profiler in Visual Studio 2010 Beta2 - these are both 'free' right now).
Then I would figure out what the goal is and how to separate things gradually to a cleaner design that is more responsive (moving work off the UI thread) and more performant (parallelizing computationally intensive portions). I would focus on the highest priority and most noticable items first.
If you don't have a good refactoring tool like VisualAssist, invest in one - it's worth it. If you're not familiar with Michael Feathers or Kent Beck's refactoring books, consider borrowing them. I would ensure my refactorings are well covered by unit tests.
You can't move to VS (I would recommend the products I work on the Asynchronous Agents Library & Parallel Pattern Library, you can also use TBB or OpenMP).
In boost, I would look carefully at boost::thread, the asio library and the signals library.
I would ask for help / guidance / a listening ear when I got stuck.
-Rick

You can also look at this article from Herb Sutter You have a mass of existing code and want to add concurrency. Where do you start?

Well, I think you're expecting a lot based on your comments here. You're not going to go from minutes to milliseconds by multithreading. The most you can hope for is the current amount of time divided by the number of cores. That being said, you're in a bit of luck with C++. I've written high performance multiprocessor scientific apps, and what you want to look for is the most embarrassingly parallel loop you can find. In my scientific code, the heaviest piece is calculating somewhere between 100 and 1000 data points. However, all of the data points can be calculated independently of the others. You can then split the loop using openmp. This is the easiest and most efficient way to go. If you're compiler doesn't support openmp, then you will have a very hard time porting existing code. With openmp (if you're lucky), you may only have to add a couple of #pragmas to get 4-8x the performance. Here's an example StochFit

I hope this will help you in understanding and converting your monolithic single threaded app to multi thread easily. Sorry it is for another programming language but never the less the principles explained are the same all over.
http://www.freevbcode.com/ShowCode.Asp?ID=1287
Hope this helps.

The first thing you must do is to separate your GUI from your data, the second is to create a multithreaded class.
STEP 1 - Responsive GUI
We can assume that the image you are producing is contained in the canvas of a TImage. You can put a simple TTimer in you form and you can write code like this:
if (CurrenData.LastUpdate>CurrentUpdate)
{
Image1->Canvas->Draw(0,0,CurrenData.Bitmap);
CurrentUpdate=Now();
}
OK! I know! Is a little bit dirty, but it's fast and is simple.The point is that:
You need an Object that is created in the main thread
The object is copied in the Form you need, only when is needed and in a safe way (ok, a better protection for the Bitmap may be is needed, but for semplicity...)
The object CurrentData is your actual project, single threaded, that produces an image
Now you have a fast and responsive GUI. If your algorithm as slow, the refresh is slow, but your user will never think that your program is freezed.
STEP 2 - Multithread
I suggest you to implement a class like the following:
SimpleThread.h
typedef void (__closure *TThreadFunction)(void* Data);
class TSimpleThread : public TThread
{
public:
TSimpleThread( TThreadFunction _Action,void* _Data = NULL, bool RunNow = true );
void AbortThread();
__property Terminated;
protected:
TThreadFunction ThreadFunction;
void* Data;
private:
virtual void __fastcall Execute() { ThreadFunction(Data); };
};
SimpleThread.c
TSimpleThread::TSimpleThread( TThreadFunction _Action,void* _Data, bool RunNow)
: TThread(true), // initialize suspended
ThreadFunction(_Action), Data(_Data)
{
FreeOnTerminate = false;
if (RunNow) Resume();
}
void TSimpleThread::AbortThread()
{
Suspend(); // Can't kill a running thread
Free(); // Kills thread
}
Let's explain. Now, in your simple threaded class you can create an object like this:
TSimpleThread *ST;
ST=new TSimpleThread( RefreshFunction,NULL,true);
ST->Resume();
Let's explain better: now, in your own monolithic class, you have created a thread. More: you bring a function (ie: RefreshFunction) in a separate thread. The scope of your funcion is the same, the class is the same, the execution is separate.

My number one suggestion, although it's very late (sorry for reviving old thread, it's interesting!) is seek out homogeneous transform loops where each iteration of the loop is mutating a completely independent piece of data from the other iterations.
Instead of thinking about how to turn this old codebase into an asynchronous one running all kinds of operations in parallel (which could be asking for all kinds of trouble from worse than single-threaded performance from poor locking patterns or exponentially worse, race conditions/deadlocks by trying to do this in hindsight to code you can't fully comprehend), stick to the sequential mindset for the overall application design for now but identify or extract simple, homogeneous transform loops. Don't go from intrusive broad design-level multithreading and then try to drill into details. Work from non-intrusive multithreading of fine implementation details and specific hotspots first.
What I mean by homogeneous loops is basically one that transforms data in a very straightforward way, like:
for each pixel in image:
make it brighter
That is very simple to reason about and you can safely parallelize this loop without any problems whatsoever using OMP or TBB or whatever and without getting tangled up in thread synchronization. It only takes one glance at this code to fully comprehend its side effects.
Try to find as many hotspots as you can which fit this type of simple homogeneous transform loop and if you have complex loops which update many different types of data with complex control flows that trigger complex side effects, then seek to refactor towards these homogeneous loops. Often a complex loop which causes 3 disparate side effects to 3 different types of data can be turned into 3 simple homogeneous loops which each trigger just one kind of side effect to one type of data with a simpler control flow. Doing multiple loops instead of one might seem a tad wasteful, but the loops become simpler, the homogeneity will often lead to more cache-friendly sequential memory access patterns vs. sporadic random-access patterns, and you then tend to find much more opportunities to safely parallelize (as well as vectorize) the code in a straightforward way.
First you have to thoroughly understand the side effects of any code you attempt to parallelize (and I mean thoroughly!!!), so seeking out these homogeneous loops gives you isolated areas of the codebase you can easily reason about in terms of the side effects to the point where you can confidently and safely parallelize those hotspots. It'll also improve the maintainability of the code by making it very easy to reason about the state changes going on in that particular piece of code. Save the dream of the uber multithreaded application running everything in parallel for later. For now, focus on identifying/extracting performance-critical, homogeneous loops with simple control flows and simple side effects. Those are your priority targets for parallelization with simple parallelized loops.
Now admittedly I somewhat dodged your questions, but most of them don't need apply if you do what I suggest, at least until you've kind of worked your way out to the point where you're thinking more about multithreading designs as opposed to simply parallelizing implementation details. And you might not even need to go that far to have a very competitive product in terms of performance. If you have beefy work to do in a single loop, you can devote the hardware resources to making that loop go faster instead of making many operations run simultaneously. If you have to resort to more async methods like if your hotspots are more I/O bound, seek an async/wait approach where you fire off an async task but do some things in the meantime and then wait on the async task(s) to complete. Even if that's not absolutely necessary, the idea is to section off isolated areas of your codebase where you can, with 100% confidence (or at least 99.9999999%) say that the multithreaded code is correct.
You don't ever want to gamble with race conditions. There's nothing more demoralizing than finding some obscure race condition that only occurs once in a full moon on some random user's machine while your entire QA team is unable to reproduce it, only to, 3 months later, run into it yourself except during that one time you ran a release build without debugging info available while you then toss and turn in your sleep knowing your codebase can flake out at any given moment but in ways that no one will ever be able to consistently reproduce. So take it easy with multithreading legacy codebases, at least for now, and stick to multithreading isolated but critical sections of the codebase where the side effects are dead simple to reason about. And test the crap out of it -- ideally apply a TDD approach where you write a test for the code you're going to multithread to ensure it gives the correct output after you finish... though race conditions are the types of things that easily fly under the radar of unit and integration testing, so again you absolutely need to be able to comprehend the entirety of the side effects that go on in a given piece of code before you attempt to multithread it. The best way to do that is to make the side effects as easy to comprehend as possible with the simplest control flows causing just one type of side effect for an entire loop.

It is hard to give you proper guidelines. But...
The easiest way out according to me is to convert your application to ActiveX EXE as COM has support for Threading, etc. built right into it your program will automatically become Multi Threading application. Of course you will have to make quite a few changes to your code. But this is the shortest and safest way to go.
I am not sure but probably RichClient Toolset lib may do the trick for you. On the site the author has written:
It also offers registration free Loading/Instancing-capabilities
for ActiveX-Dlls and new, easy to use Threading-approach,
which works with Named-Pipes under the
hood and works therefore also
cross-process.
Please check it out. Who knows it may be the right solution for your requirements.
As for Project management I think you can continue using what is provided in your choice IDE by integrating it with SVN through plugins.
I forgot to mention that we have completed an application for Share market that automatically trades (buys and sells based on lows and highs) into those scripts that are in user portfolio based on an algorithm that we have developed.
While developing this software we were facing the same kind of problem as you have illustrated here. To solve it we converted out application in ActiveX EXE and we converted all those parts that need to execute parallely into ActiveX DLLs. We have not used any third party libs for this!
HTH

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js