As far as I can tell it is possible for a queue family to support presenting to the screen but not support graphics. Say I have a queue family that supports both graphics and presenting, and another queue family that only supports presenting. Should I use the first queue family for both processes or should I delegate the first to graphics and the latter to presenting? Or would there be no noticeable difference between these two approaches?
No such HW exists, so best approach is no approach. If you want to be really nice, you can handle the separate present queue family case with expending minimal brain-power on it. Though you have no way to test it on real HW that needs it. So I would say abort with a nice error message would be as adequate, until you can get your hands on actual HW that does it.
I think there is bit of a design error here on Khronoses part. Separate present queue does look like a more explicit way. But then, present op itself is not a queue operation, so the driver can use whatever it wants anyway. Also separate present requires extra semaphore, and Queue Family Ownership Transfer (or VK_SHARING_MODE_CONCURRENT resource). The history went the way that no driver is so extremist to report a separate present queue. So I made KhronosGroup/Vulkan-Docs#1234.
For rough notion of what happens at vkQueuePresentKHR, you can inspect Mesa code: https://github.com/mesa3d/mesa/blob/bf3c9d27706dc2362b81aad12eec1f7e48e53ddd/src/vulkan/wsi/wsi_common.c#L1120-L1232. There's probably no monkey business there using the queue you provided except waiting on your semaphore, or at most making a blit of the image. If you (voluntarily) want to use separate present queue, you need to measure and whitelist it only for drivers (and probably other influences) it actually helps (if any such exist, and if it is even worth your time).
First off, I assume you mean "beneficial" in terms of performance, and whenever it comes to questions like that you can never have a definite answer except by profiling the different strategies. If your application needs to run on a variety of hardware, you can have it profile the different strategies the first time it's run and save the results locally for repeated use, provide the user with a benchmarking utility they can run if they see poor performance, etc. etc. Trying to reason about it in the abstract can only get you so far.
That aside, I think the easiest way to think about questions like this is to remember that when it comes to graphics programming, you want to both maximize the amount of work that can be done in parallel and minimize the amount of work overall. If you want to present an image from a non-graphics queue and you need to perform graphics operations on it, you'll need to transfer ownership of it to the non-graphics queue when graphics operations on it have finished. Presumably, that will take a bit of time in the driver if nothing else, so it's only worth doing if it will save you time elsewhere somehow.
A common situation where this would probably save you time is if the device supports async compute and also lets you present from the compute queue. For example, a 3D game might use the compute queue for things like lighting, blur, UI, etc. that make the most sense to do after geometry processing is finished. In this case, the game engine would transfer ownership of the image to be presented to the compute queue first anyway, or even have the compute queue own the swapchain image from beginning to end, so presenting from the compute queue once its work for the frame is done would allow the graphics queue to stay busy with the next frame. AMD and NVIDIA recommend this sort of approach where it's possible.
If your application wouldn't otherwise use the compute queue, though, I'm not sure how much sense it makes or not to present on it when you have the option. The advantage of that approach is that once graphics operations for a given frame are over, you can have the graphics queue immediately release ownership of the image for it and acquire the next one without having to pause to present it, which would allow presentation to be done in parallel with rendering the next frame. On the other hand, you'll have to transfer ownership of it to the compute queue first and set up presentation there, which would add some complexity and overhead. I'm not sure which approach would be faster and I wouldn't be surprised if it varies with the application and environment. Of course, I'm not sure how many realtime Vulkan applications of any significant complexity fit this scenario today, and I'd guess it's not very many as "per-pixel" things tend to be easier and faster to do with a compute shader.
I develop a Hand-tracking application, with C++ and OpenGL. (and QT, Eigen, OpenCV)
OpenGL is used in order to render a 3D model (for every iteration of the tracking loop).
The application runs in just 1 thread.
I'm interested in doing some very time-consuming experiments, so I was wondering if it is possible to parallelize things, in the sense of starting many instances of the same executable, and running them with different parameters.
Just by trying to do this, it seems that it works, but I'm not sure if different instances interfere with each other on the GPU. To be more descriptive, I wonder about the following:
if I do some experiments by running only one instance at each time, and then repeat the same experiments by running many instances concurrently, are the results going to be the same numerically?
Of course I'll try to verify it through experiments, I was wondering though if anybody can pinpoint me to a suitable read (I didn't find something truly relevant).
Any ideas on this matter?
Answering the first comment (#KillianDS)
The details of the experiment are really mathematical and would cause 'noise' in the topic.
The idea is that you have a tracking-algorithm that tries to find correspondences between the previous and the current frame. By using these correspondences, the algorithm takes the 3D model from the pose of the previous frame (already known), and it transforms it in such a way, so that it fits the current frame. There are some (mathematical) parameters affecting this, and the experiments are about having a lot of testing frames, and running the algorithm on these with many different parameters, so that you can find the optimal parameters (value or range of values)
During the experiments, you use OpenGL to project the 3D model, so that it fits the current frame-image. What you see is rendered as usual, the actual job though is done with the use of an offscreen buffer in GPU.
Up to now I run the experiments using multiprocessing (many instances at the same time), but when I run just one instance at the time, I couldn't reproduce the exact same number, because of a bug that I just found. (same test is now ongoing - but very time-consuming)
However I was wondering if you can really trust GPU when you run many instances at the same time, or things in GPU-memory can be messed up
Answering the second comment (#Lajos Arpad)
To redefine in short the problem, I don't want to share things, but be sure that different instances (in case of multiprocessing, that you mention) don't affect each other (that is no sharing at all)
There are two possible parallelizations:
Multithreading
Multiprocessing
You can use one of them or both.
I don't really understand your task, as you were not too specific, but in general you can share values in the following way:
Use a file which will be used by your threads/processes.
Use a service, like a database where your threads/processes will manipulate data.
Use variables which are declaratively or by definition shared (like static members for C++ classes).
Using communication between your processes/threads by a listening port and sockets.
In this moment nothing else comes to my mind. Everything else is not shared.
For more information, read this and this.
My company's main product is a large monolithic C++ application, used for scientific data processing and visualisation. Its codebase goes back maybe 12 or 13 years, and while we have put work into upgrading and maintaining it (use of STL and Boost - when I joined most containers were custom, for example - fully upgraded to Unicode and the 2010 VCL, etc) there's one remaining, very significant problem: it's fully singlethreaded. Given it's a data processing and visualisation program, this is becoming more and more of a handicap.
I'm both a developer and the project manager for the next release where we want to tackle this, and this is going to be a difficult job in both areas. I'm seeking concrete, practical, and architectural advice on how to tackle the problem.
The program's data flow might go something like this:
a window needs to draw data
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This will go and calculate or read from file or whatever else is required (often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations)
Ie, the paint message handler will block while processing is done, and if the data hasn't already been calculated and cached, this can be a long time. Sometimes this is minutes. Similar paths occur for other parts of the program that perform lengthy processing operations - the program is unresponsive for the entire time, sometimes hours.
I'm seeking advice on how to approach changing this. Practical ideas. Perhaps things like:
design patterns for asynchronously requesting data?
storing large collections of objects such that threads can read and write safely?
handling invalidation of data sets while something is trying to read it?
are there patterns and techniques for this sort of problem?
what should I be asking that I haven't thought of?
I haven't done any multithreaded programming since my Uni days a few years ago, and I think the rest of my team is in a similar position. What I knew was academic, not practical, and is nowhere near enough to have confidence approaching this.
The ultimate objective is to have a fully responsive program, where all calculations and data generation is done in other threads and the UI is always responsive. We might not get there in a single development cycle :)
Edit: I thought I should add a couple more details about the app:
It's a 32-bit desktop application for Windows. Each copy is licensed. We plan to keep it a desktop, locally-running app
We use Embarcadero (formerly Borland) C++ Builder 2010 for development. This affects the parallel libraries we can use, since most seem (?) to be written for GCC or MSVC only. Luckily they're actively developing it and its C++ standards support is much better than it used to be. The compiler supports these Boost components.
Its architecture is not as clean as it should be and components are often too tightly coupled. This is another problem :)
Edit #2: Thanks for the replies so far!
I'm surprised so many people have recommended a multi-process architecture (it's the top-voted answer at the moment), not multithreading. My impression is that's a very Unix-ish program structure, and I don't know anything about how it's designed or works. Are there good resources available about it, on Windows? Is it really that common on Windows?
In terms of concrete approaches to some of the multithreading suggestions, are there design patterns for asynchronous request and consuming of data, or threadaware or asynchronous MVP systems, or how to design a task-oriented system, or articles and books and post-release deconstructions illustrating things that work and things that don't work? We can develop all this architecture ourselves, of course, but it's good to work from what others have done before and know what mistakes and pitfalls to avoid.
One aspect that isn't touched on in any answers is project managing this. My impression is estimating how long this will take and keeping good control of the project when doing something as uncertain as this may be hard. That's one reason I'm after recipes or practical coding advice, I guess, to guide and restrict coding direction as much as possible.
I haven't yet marked an answer for this question - this is not because of the quality of the answers, which is great (and thankyou) but simply that because of the scope of this I'm hoping for more answers or discussion. Thankyou to those who have already replied!
You have a big challenge ahead of you. I had a similar challenge ahead of me -- 15 year old monolithic single threaded code base, not taking advantage of multicore, etc. We expended a great deal of effort in trying to find a design and solution that was workable and would work.
Bad news first. It will be somewhere between impractical and impossible to make your single-threaded app multithreaded. A single threaded app relies on it's singlethreaded-ness is ways both subtle and gross. One example is if the computation portion requires input from the GUI portion. The GUI must run in the main thread. If you try to get this data directly from the computation engine, you will likely run in to deadlock and race conditions that will require major redesigns to fix. Many of these reliances will not crop up during the design phase, or even during the development phase, but only after a release build is put in a harsh environment.
More bad news. Programming multithreaded applications is exceptionally hard. It might seem fairly straightforward to just lock stuff and do what you have to do, but it is not. First of all if you lock everything in sight you end up serializing your application, negating every benefit of mutithreading in the first place while still adding in all the complexity. Even if you get beyond this, writing a defect-free MP application is hard enough, but writing a highly-performant MP application is that much more difficult. You could learn on the job in a kind of baptismal by fire. But if you are doing this with production code, especially legacy production code, you put your buisness at risk.
Now the good news. You do have options that don't involve refactoring your whole app and will give you most of what you seek. One option in particular is easy to implement (in relative terms), and much less prone to defects than making your app fully MP.
You could instantiate multiple copies of your application. Make one of them visible, and all the others invisible. Use the visible application as the presentation layer, but don't do the computational work there. Instead, send messages (perhaps via sockets) to the invisible copies of your application which do the work and send the results back to the presentation layer.
This might seem like a hack. And maybe it is. But it will get you what you need without putting the stability and performance of your system at such great risk. Plus there are hidden benefits. One is that the invisible engine copies of your app will have access to their own virtual memory space, making it easier to leverage all the resources of the system. It also scales nicely. If you are running on a 2-core box, you could spin off 2 copies of your engine. 32 cores? 32 copies. You get the idea.
So, there's a hint in your description of the algorithm as to how to proceed:
often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations
I'd look into making that data-flow graph be literally the structure that does the work. The links in the graph can be thread-safe queues, the algorithms at each node can stay pretty much unchanged, except wrapped in a thread that picks up work items from a queue and deposits results on one. You could go a step further and use sockets and processes rather than queues and threads; this will let you spread across multiple machines if there is a performance benefit in doing this.
Then your paint and other GUI methods need split in two: one half to queue the work, and the other half to draw or use the results as they come out of the pipeline.
This may not be practical if the app presumes that data is global. But if it is well contained in classes, as your description suggests it may be, then this could be the simplest way to get it parallelised.
Don't attempt to multithread everything in the old app. Multithreading for the sake of saying it's multithreaded is a waste of time and money. You're building an app that does something, not a monument to yourself.
Profile and study your execution flows to figure out where the app spends most of its time. A profiler is a great tool for this, but so is just stepping through the code in the debugger. You find the most interesting things in random walks.
Decouple the UI from long-running computations. Use cross-thread communications techniques to send updates to the UI from the computation thread.
As a side-effect of #3: think carefully about reentrancy: now that the compute is running in the background and the user can smurf around in the UI, what things in the UI should be disabled to prevent conflicts with the background operation? Allowing the user to delete a dataset while a computation is running on that data is probably a bad idea. (Mitigation: computation makes a local snapshot of the data) Does it make sense for the user to spool up multiple compute operations concurrently? If handled well, this could be a new feature and help rationalize the app rework effort. If ignored, it will be a disaster.
Identify specific operations that are candidates to be shoved into a background thread. The ideal candidate is usually a single function or class that does a lot of work (requires a "lot of time" to complete - more than a few seconds) with well defined inputs and outputs, that makes use of no global resources, and does not touch the UI directly. Evaluate and prioritize candidates based on how much work would be required to retrofit to this ideal.
In terms of project management, take things one step at a time. If you have multiple operations that are strong candidates to be moved to a background thread, and they have no interaction with each other, these might be implemented in parallel by multiple developers. However, it would be a good exercise to have everybody participate in one conversion first so that everyone understands what to look for and to establish your patterns for UI interaction, etc. Hold an extended whiteboard meeting to discuss the design and process of extracting the one function into a background thread. Go implement that (together or dole out pieces to individuals), then reconvene to put it all together and discuss discoveries and pain points.
Multithreading is a headache and requires more careful thought than straight up coding, but splitting the app into multiple processes creates far more headaches, IMO. Threading support and available primitives are good in Windows, perhaps better than some other platforms. Use them.
In general, don't do any more than what is needed. It's easy to severely over implement and over complicate an issue by throwing more patterns and standard libraries at it.
If nobody on your team has done multithreading work before, budget time to make an expert or funds to hire one as a consultant.
The main thing you have to do is to disconnect your UI from your data set. I'd suggest that the way to do that is to put a layer in between.
You will need to design a data structure of data cooked-for-display. This will most likely contain copies of some of your back-end data, but "cooked" to be easy to draw from. The key idea here is that this is quick and easy to paint from. You may even have this data structure contain calculated screen positions of bits of data so that it's quick to draw from.
Whenever you get a WM_PAINT message you should get the most recent complete version of this structure and draw from it. If you do this properly, you should be able to handle multiple WM_PAINT messages per second because the paint code never refers to your back end data at all. It's just spinning through the cooked structure. The idea here is that its better to paint stale data quickly than to hang your UI.
Meanwhile...
You should have 2 complete copies of this cooked-for-display structure. One is what the WM_PAINT message looks at. (call it cfd_A) The other is what you hand to your CookDataForDisplay() function. (call it cfd_B). Your CookDataForDisplay() function runs in a separate thread, and works on building/updating cfd_B in the background. This function can take as long as it wants because it isn't interacting with the display in any way. Once the call returns cfd_B will be the most up-to-date version of the structure.
Now swap cfd_A and cfd_B and InvalidateRect on your application window.
A simplistic way to do this is to have your cooked-for-display structure be a bitmap, and that might be a good way to go to get the ball rolling, but I'm sure with a bit of thought you can do a much better job with a more sophisticated structure.
So, referring back to your example.
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This is now 2 threads, the paint method refers to cfd_A and runs on the UI thread. Meanwhile cfd_B is being built by a background thread using GetData calls.
The quick-and-dirty way to do this is
Take your current WM_PAINT code, stick it into a function called PaintIntoBitmap().
Create a bitmap and a Memory DC, this is cfd_B.
Create a thread and pass it cfd_B and have it call PaintIntoBitmap()
When this thread completes, swap cfd_B and cfd_A
Now your new WM_PAINT method just takes the pre-rendered bitmap in cfd_A and draws it to the screen. Your UI is now disconnnected from your backend GetData() function.
Now the real work begins, because the quick-and-dirty way doesn't handle window resizing very well. You can go from there to refine what your cfd_A and cfd_B structures are a little at a time until you reach a point where you are satisfied with the result.
You might just start out breaking the the UI and the work task into separate threads.
In your paint method instead of calling getData() directly, it puts the request in a thread-safe queue. getData() is run in another thread that reads its data from the queue. When the getData thread is done, it signals the main thread to redraw the visualisation area with its result data using thread syncronization to pass the data.
While all this is going on you of course have a progress bar saying reticulating splines so the user knows something is going on.
This would keep your UI snappy without the significant pain of multithreading your work routines (which can be akin to a total rewrite)
It sounds like you have several different issues that parallelism can address, but in different ways.
Performance increases through utilizing multicore CPU Architecutres
You're not taking advantage of the multi-core CPU architetures that are becoming so common. Parallelization allow you to divide work amongst multiple cores. You can write that code through standard C++ divide and conquer techniques using a "functional" style of programming where you pass work to separate threads at the divide stage. Google's MapReduce pattern is an example of that technique. Intel has the new CILK library to give you C++ compiler support for such techniques.
Greater GUI responsiveness through asynchronous document-view
By separating the GUI operations from the document operations and placing them on different threads, you can increase the apparent responsiveness of your application. The standard Model-View-Controller or Model-View-Presenter design patterns are a good place to start. You need to parallelize them by having the model inform the view of updates rather than have the view provide the thread on which the document computes itself. The View would call a method on the model asking it to compute a particular view of the data, and the model would inform the presenter/controller as information is changed or new data becomes available, which would get passed to the view to update itself.
Opportunistic caching and pre-calculation
It sounds like your application has a fixed base of data, but many possible compute-intensive views on the data. If you did a statistical analysis on which views were most commonly requested in what situations, you could create background worker threads to pre-calculate the likely-requested values. It may be useful to put these operations on low-priority threads so that they don't interfere with the main application processing.
Obviously, you'll need to use mutexes (or critical sections), events, and probably semaphores to implement this. You may find some of the new synchronization objects in Vista useful, like the slim reader-writer lock, condition variables, or the new thread pool API. See Joe Duffy's book on concurrency for how to use these basic techniques.
There is something that no-one has talked about yet, but which is quite interesting.
It's called futures. A future is the promise of a result... let's see with an example.
future<int> leftVal = computeLeftValue(treeNode); // [1]
int rightVal = computeRightValue(treeNode); // [2]
result = leftVal + rightVal; // [3]
It's pretty simple:
You spin off a thread that starts computing leftVal, taking it from a pool for example to avoid the initialization problem.
While leftVal is being computed, you compute rightVal.
You add the two, this may block if leftVal is not computed yet and wait for the computation to end.
The great benefit here is that it's straightforward: each time you have one computation followed by another that is independent and you then join the result, you can use this pattern.
See Herb Sutter's article on futures, they will be available in the upcoming C++0x but there are already libraries available today even if the syntax is perhaps not as pretty as I would make you believe ;)
If it was my development dollars I was spending, I would start with the big picture:
What do I hope to accomplish, and how much will I spend to accomplish this, and how will I be further ahead? (If the answer to this is, my app will run 10% better on quadcore PCs, and I could have achieved the same result by spending $1000 more per customer PC , and spending $100,000 less this year on R&D, then, I would skip the whole effort).
Why am I doing multi-threaded instead of massively parallel distributed? Do I really think threads are better than processes? Multi-core systems also run distributed apps pretty well. And there are some advantages to message-passing process based systems that go beyond the benefits (and the costs!) of threading. Should I consider a process-based approach? SHould I consider a background running entirely as a service, and a foreground GUI? Since my product is node-locked and licensed, I think services would suit me (vendor) quite well. Also, separating stuff into two processes (background service and foreground) just might force the kind of rewrite and rearchitecting to occur that I might not be forced to do, if I was to just add threading into my mix.
This is just to get you thinking: What if you were to rewrite it as a service (background app) and a GUI, because that would actually be easier than adding threading, without also adding crashes, deadlocks, and race conditions?
Consider the idea that for your needs, perhaps threading is evil. Develop your religion, and stick with that. Unless you have a real good reason to go the other way. For many years, I religiously avoided threading. Because one thread per process is good enough for me.
I don't see any really solid reasons in your list why you need threading, except ones that could be more inexpensively solved by more expensive target computer hardware. If your app is "too slow" adding in threads might not even speed it up.
I use threads for background serial communications, but I would not consider threading merely for computationally heavy applications, unless my algorithms were so inherently parallel as to make the benefits clear, and the drawbacks minimal.
I wonder if the "design" problems that this C++Builder app has are like my Delphi "RAD Spaghetti" application disease. I have found that a wholesale refactor/rewrite (over a year per major app that I have done this to), was a minimum amount of time for me to get a handle on application "accidental complexity". And that was without throwing a "threads where possible" idea. I tend to write my apps with threads for serial communication and network socket handling, only. And maybe the odd "worker-thread-queue".
If there is a place in your app you can add ONE thread, to test the waters, I would look for the main "work queue" and I would create an experimental version control branch, and I would learn about how my code works by breaking it in the experimental branch. Add that thread. And see where you spend your first day of debugging. Then I might just abandon that branch and go back to my trunk until the pain in my temporal lobe subsides.
Warren
Here's what I would do...
I would start by profiling your and seeing:
1) what is slow and what the hot paths are
2) which calls are reentrant or deeply nested
you can use 1) to determine where the opportunity is for speedups and where to start looking for parallelization.
you can use 2) to find out where the shared state is likely to be and get a deeper sense of how much things are tangled up.
I would use a good system profiler and a good sampling profiler (like the windows perforamnce toolkit or the concurrency views of the profiler in Visual Studio 2010 Beta2 - these are both 'free' right now).
Then I would figure out what the goal is and how to separate things gradually to a cleaner design that is more responsive (moving work off the UI thread) and more performant (parallelizing computationally intensive portions). I would focus on the highest priority and most noticable items first.
If you don't have a good refactoring tool like VisualAssist, invest in one - it's worth it. If you're not familiar with Michael Feathers or Kent Beck's refactoring books, consider borrowing them. I would ensure my refactorings are well covered by unit tests.
You can't move to VS (I would recommend the products I work on the Asynchronous Agents Library & Parallel Pattern Library, you can also use TBB or OpenMP).
In boost, I would look carefully at boost::thread, the asio library and the signals library.
I would ask for help / guidance / a listening ear when I got stuck.
-Rick
You can also look at this article from Herb Sutter You have a mass of existing code and want to add concurrency. Where do you start?
Well, I think you're expecting a lot based on your comments here. You're not going to go from minutes to milliseconds by multithreading. The most you can hope for is the current amount of time divided by the number of cores. That being said, you're in a bit of luck with C++. I've written high performance multiprocessor scientific apps, and what you want to look for is the most embarrassingly parallel loop you can find. In my scientific code, the heaviest piece is calculating somewhere between 100 and 1000 data points. However, all of the data points can be calculated independently of the others. You can then split the loop using openmp. This is the easiest and most efficient way to go. If you're compiler doesn't support openmp, then you will have a very hard time porting existing code. With openmp (if you're lucky), you may only have to add a couple of #pragmas to get 4-8x the performance. Here's an example StochFit
I hope this will help you in understanding and converting your monolithic single threaded app to multi thread easily. Sorry it is for another programming language but never the less the principles explained are the same all over.
http://www.freevbcode.com/ShowCode.Asp?ID=1287
Hope this helps.
The first thing you must do is to separate your GUI from your data, the second is to create a multithreaded class.
STEP 1 - Responsive GUI
We can assume that the image you are producing is contained in the canvas of a TImage. You can put a simple TTimer in you form and you can write code like this:
if (CurrenData.LastUpdate>CurrentUpdate)
{
Image1->Canvas->Draw(0,0,CurrenData.Bitmap);
CurrentUpdate=Now();
}
OK! I know! Is a little bit dirty, but it's fast and is simple.The point is that:
You need an Object that is created in the main thread
The object is copied in the Form you need, only when is needed and in a safe way (ok, a better protection for the Bitmap may be is needed, but for semplicity...)
The object CurrentData is your actual project, single threaded, that produces an image
Now you have a fast and responsive GUI. If your algorithm as slow, the refresh is slow, but your user will never think that your program is freezed.
STEP 2 - Multithread
I suggest you to implement a class like the following:
SimpleThread.h
typedef void (__closure *TThreadFunction)(void* Data);
class TSimpleThread : public TThread
{
public:
TSimpleThread( TThreadFunction _Action,void* _Data = NULL, bool RunNow = true );
void AbortThread();
__property Terminated;
protected:
TThreadFunction ThreadFunction;
void* Data;
private:
virtual void __fastcall Execute() { ThreadFunction(Data); };
};
SimpleThread.c
TSimpleThread::TSimpleThread( TThreadFunction _Action,void* _Data, bool RunNow)
: TThread(true), // initialize suspended
ThreadFunction(_Action), Data(_Data)
{
FreeOnTerminate = false;
if (RunNow) Resume();
}
void TSimpleThread::AbortThread()
{
Suspend(); // Can't kill a running thread
Free(); // Kills thread
}
Let's explain. Now, in your simple threaded class you can create an object like this:
TSimpleThread *ST;
ST=new TSimpleThread( RefreshFunction,NULL,true);
ST->Resume();
Let's explain better: now, in your own monolithic class, you have created a thread. More: you bring a function (ie: RefreshFunction) in a separate thread. The scope of your funcion is the same, the class is the same, the execution is separate.
My number one suggestion, although it's very late (sorry for reviving old thread, it's interesting!) is seek out homogeneous transform loops where each iteration of the loop is mutating a completely independent piece of data from the other iterations.
Instead of thinking about how to turn this old codebase into an asynchronous one running all kinds of operations in parallel (which could be asking for all kinds of trouble from worse than single-threaded performance from poor locking patterns or exponentially worse, race conditions/deadlocks by trying to do this in hindsight to code you can't fully comprehend), stick to the sequential mindset for the overall application design for now but identify or extract simple, homogeneous transform loops. Don't go from intrusive broad design-level multithreading and then try to drill into details. Work from non-intrusive multithreading of fine implementation details and specific hotspots first.
What I mean by homogeneous loops is basically one that transforms data in a very straightforward way, like:
for each pixel in image:
make it brighter
That is very simple to reason about and you can safely parallelize this loop without any problems whatsoever using OMP or TBB or whatever and without getting tangled up in thread synchronization. It only takes one glance at this code to fully comprehend its side effects.
Try to find as many hotspots as you can which fit this type of simple homogeneous transform loop and if you have complex loops which update many different types of data with complex control flows that trigger complex side effects, then seek to refactor towards these homogeneous loops. Often a complex loop which causes 3 disparate side effects to 3 different types of data can be turned into 3 simple homogeneous loops which each trigger just one kind of side effect to one type of data with a simpler control flow. Doing multiple loops instead of one might seem a tad wasteful, but the loops become simpler, the homogeneity will often lead to more cache-friendly sequential memory access patterns vs. sporadic random-access patterns, and you then tend to find much more opportunities to safely parallelize (as well as vectorize) the code in a straightforward way.
First you have to thoroughly understand the side effects of any code you attempt to parallelize (and I mean thoroughly!!!), so seeking out these homogeneous loops gives you isolated areas of the codebase you can easily reason about in terms of the side effects to the point where you can confidently and safely parallelize those hotspots. It'll also improve the maintainability of the code by making it very easy to reason about the state changes going on in that particular piece of code. Save the dream of the uber multithreaded application running everything in parallel for later. For now, focus on identifying/extracting performance-critical, homogeneous loops with simple control flows and simple side effects. Those are your priority targets for parallelization with simple parallelized loops.
Now admittedly I somewhat dodged your questions, but most of them don't need apply if you do what I suggest, at least until you've kind of worked your way out to the point where you're thinking more about multithreading designs as opposed to simply parallelizing implementation details. And you might not even need to go that far to have a very competitive product in terms of performance. If you have beefy work to do in a single loop, you can devote the hardware resources to making that loop go faster instead of making many operations run simultaneously. If you have to resort to more async methods like if your hotspots are more I/O bound, seek an async/wait approach where you fire off an async task but do some things in the meantime and then wait on the async task(s) to complete. Even if that's not absolutely necessary, the idea is to section off isolated areas of your codebase where you can, with 100% confidence (or at least 99.9999999%) say that the multithreaded code is correct.
You don't ever want to gamble with race conditions. There's nothing more demoralizing than finding some obscure race condition that only occurs once in a full moon on some random user's machine while your entire QA team is unable to reproduce it, only to, 3 months later, run into it yourself except during that one time you ran a release build without debugging info available while you then toss and turn in your sleep knowing your codebase can flake out at any given moment but in ways that no one will ever be able to consistently reproduce. So take it easy with multithreading legacy codebases, at least for now, and stick to multithreading isolated but critical sections of the codebase where the side effects are dead simple to reason about. And test the crap out of it -- ideally apply a TDD approach where you write a test for the code you're going to multithread to ensure it gives the correct output after you finish... though race conditions are the types of things that easily fly under the radar of unit and integration testing, so again you absolutely need to be able to comprehend the entirety of the side effects that go on in a given piece of code before you attempt to multithread it. The best way to do that is to make the side effects as easy to comprehend as possible with the simplest control flows causing just one type of side effect for an entire loop.
It is hard to give you proper guidelines. But...
The easiest way out according to me is to convert your application to ActiveX EXE as COM has support for Threading, etc. built right into it your program will automatically become Multi Threading application. Of course you will have to make quite a few changes to your code. But this is the shortest and safest way to go.
I am not sure but probably RichClient Toolset lib may do the trick for you. On the site the author has written:
It also offers registration free Loading/Instancing-capabilities
for ActiveX-Dlls and new, easy to use Threading-approach,
which works with Named-Pipes under the
hood and works therefore also
cross-process.
Please check it out. Who knows it may be the right solution for your requirements.
As for Project management I think you can continue using what is provided in your choice IDE by integrating it with SVN through plugins.
I forgot to mention that we have completed an application for Share market that automatically trades (buys and sells based on lows and highs) into those scripts that are in user portfolio based on an algorithm that we have developed.
While developing this software we were facing the same kind of problem as you have illustrated here. To solve it we converted out application in ActiveX EXE and we converted all those parts that need to execute parallely into ActiveX DLLs. We have not used any third party libs for this!
HTH
It was hard for me to come up with a real-world example for a concurrency:
Imagine the above situation, where
there are many lanes, many junctions
and a great amount of cars. Besides,
there is a human factor.
The problem is a hard research area for traffic engineers. When I investigated it a some time ago, I noticed that many models failed on it. When people are talking about functional programming, the above problem tends to pop up to my mind.
Can you simulate it in Haskell? Is Haskell really so concurrent? What are the limits to parallelise such concurrent events in Haskell?
I'm not sure what the question is exactly. Haskell 98 doesn't specify anything for concurrency. Specific implementations, like GHC, provide extensions that implement parallelism and concurrency.
To simulate traffic, it would depend on what you needed out of the simulation, e.g. if you wanted to track individual cars or do it in a general statistical way, whether you wanted to use ticks or a continuous model for time, etc. From there, you could come up with a representation of your data that lent itself to parallel or concurrent evaluation.
GHC provides several methods to leverage multiple hardware execution units, ranging from traditional semaphores and mutexes, to channels with lightweight threads (which could be used to implement an actor model like Erlang), to software transactional memory, to pure functional parallel expression evaluation, with strategies, and experimental nested data parallelism.
So yes, Haskell has many approaches to parallel execution that could certainly be used in traffic simulations, but you need to have a clear idea of what you're trying to do before you can choose the best digital representation for your concurrent simulation. Each approach has its own advantages and limits, including learning curve. You may even learn that concurrency is overkill for the scale of your simulations.
It sounds to me like you are trying to do a simulation, rather than real-world concurrency. This kind of thing is usually tackled using discrete event simulation. I did something similar in Haskell a few years ago, and rolled my own discrete event simulation library based on the continuation monad transformer. I'm afraid its owned by my employer, so I can't post it, but it wasn't too difficult. A continuation is effectively a suspended thread, so define something like this (from memory):
type Sim r a = ContT r (StateT ThreadQueue IO a)
newtype ThreadQueue = TQ [() -> Sim r ()]
The ThreadQueue inside the state holds the queue of currently scheduled threads. You can also have other types of thread queue to hold threads that are not scheduled, for instance in a semaphore (based on "IORef (Int, ThreadQueue)"). Once you have semaphores you can build the equivalent of MVars and MQueues.
To schedule a thread use "callCC". The argument to "callCC" is a function "f1" that itself takes a function "c" as an argument. This inner argument "c" is the continuation: calling it resumes the thread. When you do this, from that thread's point of view "callCC" just returned the value you gave as an argument to "c". In practice you don't need to pass values back to the suspended threads, so the parameter type is null.
So your argument to "callCC" is a lambda function that takes "c" and puts it on the end of whatever queue is appropriate for the action you are doing. Then it takes the head of the ThreadQueue from inside the state and calls that. You don't need to worry about this function returning: it never does.
If you need a concurrent programming language with a functional sequential subset, consider Erlang.
More about Erlang
I imagine you're asking if you could have one thread for each object in the system?
The GHC runtime scales nicely to millions of threads, and multiplexes those threads onto the available hardware, via the abstractions Chris Smith mentioned. So it certainly is possible to have thousands of threads in your system, if you're using Haskell/GHC.
Performance-wise, it tends to be a good deal faster than Erlang, but places less emphasis on distribution of processes across multiple nodes. GHC in particular, is more targetted towards fast concurrency on shared memory multicore systems.
Erlang, Scala, Clojure are languages that might suit you.
But I think what you need more is to find a suitable Multi-Agents simulation library or toolkit, with bindings to your favourite language.
I can tell you about MASON, Swarm and Repast. But these are Java and C libaries...
I've done one answer on this, but now I'd like to add another from a broader perspective.
It sounds like the thing that make this a hard problem is that each driver is basing their actions on mental predictions of what other drivers are going to do. For instance when I am driving I can tell when a car is likely to pull in front of me, even before he indicates, based on the way he is lining himself up with the gap between me and the car in front. He in turn can tell that I have seen him from the fact that I'm backing off to make room for him, so its OK to pull in. A good driver picks up lots of these subtle clues, and its very hard to model.
So the first step is to find out what aspects of real driving are not included in the failed models, and work out how to put them in.
(Clue: all models are wrong, but some models are useful).
I suspect that the answer is going to involve giving each simulated driver one or more mental models of what each other driver is going to do. This involves running the planning algorithm for Driver 2 using several different assumptions that Driver 1 might make about the intentions of Driver 2. Meanwhile Driver 2 is doing the same about Driver 1.
This is the kind of thing that can be very difficult to add to an existing simulator, especially if it was written in a conventional language, because the planning algorithm may well have side effects, even if its only in the way it traverses a data structure. But a functional language may well be able to do better.
Also, the interdependence between drivers probably means there is a fixpoint somewhere in there, which lazy languages tend to do better with.