Setting a memory limit on my AI algorithm?

Setting a memory limit on my AI algorithm? - c++

I am working on a Tetris AI implementation. It is a GUI application that plays the game by itself. The user can manipulate a few parameters that influence the decisions made by the AI. The basic algorithm goes as follows:
Start a new thread and clone the current game state to avoid excessive locking.
Generate a list of all possible future game states. These become child nodes of the current game state.
For each child node generate it's future game states.
Keep doing this recursively until a predefined depth has been reached.
Once the requested depth has been reached, find the best end node and recursively find it's parent node until you have the first level child.
Delete all child nodes that are not on the path between the child node and the end node.
This path is now the list of precalculated moves.
The main game executes the list of precalculated moves (with some fancy animation).
This is working pretty well up until search depth 4. After that I start to get memory problems. The number of possible game states can go from 9 to 34. So the worst case scenario for a level 4 search would be 34^4 game states. Windows XP seems unable to deal with level 5 searches (it hangs at 2+ GB).
So if I want to use deeper searches I'll need to use a strategy where I delete the non-promising branches and continue with the ones that will lead to a good score. But this makes it harder to estimate a maximum acceptable search depth. Therefore I think that I would be better to specify a memory limit instead of a depth limit.
I considering to use a memory pool and use "placement new" to create my objects on the pool's memory segments. However the game grid is implemented as a STL vector. So in order to allocate it on the pool I need to implement a custom allocator.
This seems quite a challenge and perhaps I'm overlooking a simpler solution. So I'd like your insights on how to best deal with this.
Can boost, or another library, provide me some of these facilities? (I already found Poco's MemoryPool.) Are there any good online resources to help me get going?
FYI: here's the source code and a sample binary for Windows.

You can create a memory pool, etc, but that won't really make it any easier or harder to count game state instances. You do need to make sure you don't go over a certain number of active states in your decision tree, with or without a pool. And Boost does have one: http://www.boost.org/doc/libs/1_44_0/libs/pool/doc/index.html
It sounds like you're not really doing any pruning of the tree, which would allow you to get much deeper. Evaluate each future game state and drop ones unlikely to develop into anything useful, and don't waste your time going down that branch.

Despite the lack of context [what kind of search problem are you doing? Depth first, breadth first, A*?....] My suggestion is:
Use semaphores to limit the amount of processing that is done at one time, and then release it once the processing has been evaluated. I can't really recommend a specific library that includes Semaphores, as that threading is not built in to C++, however check with your framework's documentation.

Related

In Vulkan, is it beneficial for the graphics queue family to be separate from the present queue family?

As far as I can tell it is possible for a queue family to support presenting to the screen but not support graphics. Say I have a queue family that supports both graphics and presenting, and another queue family that only supports presenting. Should I use the first queue family for both processes or should I delegate the first to graphics and the latter to presenting? Or would there be no noticeable difference between these two approaches?

No such HW exists, so best approach is no approach. If you want to be really nice, you can handle the separate present queue family case with expending minimal brain-power on it. Though you have no way to test it on real HW that needs it. So I would say abort with a nice error message would be as adequate, until you can get your hands on actual HW that does it.
I think there is bit of a design error here on Khronoses part. Separate present queue does look like a more explicit way. But then, present op itself is not a queue operation, so the driver can use whatever it wants anyway. Also separate present requires extra semaphore, and Queue Family Ownership Transfer (or VK_SHARING_MODE_CONCURRENT resource). The history went the way that no driver is so extremist to report a separate present queue. So I made KhronosGroup/Vulkan-Docs#1234.
For rough notion of what happens at vkQueuePresentKHR, you can inspect Mesa code: https://github.com/mesa3d/mesa/blob/bf3c9d27706dc2362b81aad12eec1f7e48e53ddd/src/vulkan/wsi/wsi_common.c#L1120-L1232. There's probably no monkey business there using the queue you provided except waiting on your semaphore, or at most making a blit of the image. If you (voluntarily) want to use separate present queue, you need to measure and whitelist it only for drivers (and probably other influences) it actually helps (if any such exist, and if it is even worth your time).

First off, I assume you mean "beneficial" in terms of performance, and whenever it comes to questions like that you can never have a definite answer except by profiling the different strategies. If your application needs to run on a variety of hardware, you can have it profile the different strategies the first time it's run and save the results locally for repeated use, provide the user with a benchmarking utility they can run if they see poor performance, etc. etc. Trying to reason about it in the abstract can only get you so far.
That aside, I think the easiest way to think about questions like this is to remember that when it comes to graphics programming, you want to both maximize the amount of work that can be done in parallel and minimize the amount of work overall. If you want to present an image from a non-graphics queue and you need to perform graphics operations on it, you'll need to transfer ownership of it to the non-graphics queue when graphics operations on it have finished. Presumably, that will take a bit of time in the driver if nothing else, so it's only worth doing if it will save you time elsewhere somehow.
A common situation where this would probably save you time is if the device supports async compute and also lets you present from the compute queue. For example, a 3D game might use the compute queue for things like lighting, blur, UI, etc. that make the most sense to do after geometry processing is finished. In this case, the game engine would transfer ownership of the image to be presented to the compute queue first anyway, or even have the compute queue own the swapchain image from beginning to end, so presenting from the compute queue once its work for the frame is done would allow the graphics queue to stay busy with the next frame. AMD and NVIDIA recommend this sort of approach where it's possible.
If your application wouldn't otherwise use the compute queue, though, I'm not sure how much sense it makes or not to present on it when you have the option. The advantage of that approach is that once graphics operations for a given frame are over, you can have the graphics queue immediately release ownership of the image for it and acquire the next one without having to pause to present it, which would allow presentation to be done in parallel with rendering the next frame. On the other hand, you'll have to transfer ownership of it to the compute queue first and set up presentation there, which would add some complexity and overhead. I'm not sure which approach would be faster and I wouldn't be surprised if it varies with the application and environment. Of course, I'm not sure how many realtime Vulkan applications of any significant complexity fit this scenario today, and I'd guess it's not very many as "per-pixel" things tend to be easier and faster to do with a compute shader.

Approach for finding file in huge tree structure using multithreading

I have a tree which has all the directories and files as its nodes. I want to search a particular file. Say the tree is spread widely and I want to do a breadth first search to find some particular file and that too using multithreading. How should I do that using multithreading ? What is a good approach ?

There are some case where multiThreading the search will provide a useful speedup - if the tree spans more than one disk, for example, or if some of the disks/nodes are indirected over some network.
I certainly would not want to try creating threads for every folder. That's thousands of create/run/terminate, thousands of stack allocation/free etc. Gross, avoidable overhead.
A multiThreaded search can be done, but as other posters have said, look at available alternatives first. Then read the rest of this post. Then look again
I have done something like this once using a queue approach similar to that suggested by Matt.
I don't want to ever do it again:
I used a producer-consumer work queue on which 6 threads waited for work, (6 because testing showed this to be optimum with my problem). The threads were all created once at startup and never terminated. None of this continual create/load/run/waitFor/getResult/terminateIfYoureLucky stuff that unaccountably seems to be popular with developers despite poor performance, shutdown AVs, 216/217 messageBoxes etc etc.
The work came in the form of a 'folderSearch' class that contained the path to be searched, a file match function event to call and the FindFirst/FindNext loop method to do the searching. I created a couple hundred of these at startup in a pool, (ie pushed onto another P-C pool queue:). When the FF/FN iterated the files in the folder to look for matching files, encountering a sub-folder resulted in extracting another folderSearch instance from the pool, loading it up with the new path & pushing it onto the work queue - some thread would then pick it up and iterate the sub-folder. The class had a list for the paths to matching files and a 'results' event to call, (with 'this' as parameter, of course), if it found something of interest. If a folderSearch got to the end of a twig, having found nothing and with nothing left to search, it would release itself back to the pool, (well, OK, the thread would do it, but you know what I mean:).
There was no need for any explicit 'load balancing'. If one node was exceptionally deep, it would naturally end up with all six threads working on its subtrees because the other paths are exhausted.
Searching 3 disks in their entirety meant popping 3 folderSearch from the pool, loading them up with 'C:\', 'E:\', 'F:\' and the file match method and then pushing them onto the work queue. The disks then made rattling noises and the event would eventually fire with results. In my case, (Windows), the event PostMessaged the folderSearch objects to a UI thread where the results were displayed in a treeView before repooling the folderSearch's for re-use.
This system was ~ 2.5 times as fast as a simple sequential search across 3 disks, even on my old development box that only had one core, simply because all 3 disks were searched in parallel. I suspect it would show the same sort of advantage on a modern box because the limiting factor is probably dominated by IO waiting on the disks.
Surprisingly, there was also a speedup with only one disk, but not that much. Don't know why - should be slower, by rights, due to all the extra complication.
Naturally, there were issues. One was that, with a search that fired lots of results, the pool would empty because the UI could not keep up with the threads and so all the folderSearch objects got stuck in the PostMessages queued to the UI, so slowing down the search threads as they had to wait on the pool queue until the PostMessages got handled and so returned folderSearch's to the pool. This also meant that the UI was effectively blocked until the search was over and it could catch up, negating one of the advantages of threading off the search in the first place :( With small result sets, it worked fine.
Another possible issue is that the results come back in an 'unnatural' manner, interleaved in such an aparrently confusing manner that things like assembling a tree view are much more complex than with a single-threaded recursive search - you have to flit about all over the place to stuff the results into the treeView in the right place. This loads up the GUI with extra work and can negate the searching speed advantage with large numbers of results, as I found out
This design could run multiple searches concurrently. As a test, I would load on several 3-disk searches at once, (no not while loading up the treeView - I just dumped the number of files found onto a memo line in the GUI message-handler). This made a huge amount of rattling and slowed everything to a crawl, but it did eventually complete all the searches without crashing. I didn't do this often as I was afraid for my poor disks. Don't try this at home
I was never sure how many threads to hang off the queue. Six was about the optimum on my old box with local disks. If there are networked disks in the mix, then more is probably better since a network call will tend to block one thread for much longer periods than a local disk read. Never tried that, but loading on more threads did not affect the performance any with local disks, just used more memory for no extra advantage.
Another problem is finding out if the search is actually over - are all the results in .. or is some thread still waiting on a network drive that's slow or actually unreachable? With only one search, I could tell because the pool became full again when the search was over, (I dumped the pool level to a stausBar on a 1s GUI timer). It didn't matter in my app, but in others, it might...
Cancelling a search is a similar issue. These sorts of things would need another 'searchClass' to control each search. Each folderSearch allocated to a search would have to keep a reference to the searchClass so that any thread handling the folderSearch could find out if an abort had been set and if so stop doing stuff with that folderSearch. I didn't need this, so did not implement it.
The there's error reporting. If a network drive connection fails, for example, several, (most likely all!), threads can block up for a long time before an exception is raised. Then they all except at once. The catch messages get loaded into an 'errorMess' field in the folderSearch and the results event fired. Human-detectable evidence - the rattling stops. Nothing happens for a minute, then [no of threads] errors appear all at once.
Note well the caveats from the other posters and my experiences. Only attempt something like this if you really, really need it for some special search purpose and you are 100% happy with multiThreaded apps. If you can get away with a single-threaded search, or a shell call to a File Explorer, or almost anything else, do it that way!
I've used the same approach since with an FTP server to generate trees. That was much faster as well, though the server admins were probably not happy about the multiple connections
Rgds,
Martin

Multithreading a tree-search task with unknown work distribution in each branch is non-trivial (this comes up a lot in say, constraint satisfaction problems.)
The easiest way is to create a task queue (protected by a mutex.) Fill this queue with all the children of the root node. Spawn N threads (one for each available CPU core) and have them search through each node. There are various tricks you can do to avoid some bad scenarios (if any thread finds that its node is "unexpectedly deep" you can have them add new tasks to the queue corresponding to subdirectories it wants other threads to explore.) If your node depths are well distributed and the root node has lots of children you can avoid the queue entirely -- just assign each thread with index i the task of exploring X % N + i (where X is the number of children of the root node.)

My first response is to say "just use nftw and forget about doing it multi-threaded". If you happen to have an implementation of nftw that does the tree walk in a multi-threaded fashion, then you get multi-threading for free (I'm not aware of any such implementation). If you really want to do multiple threads, I would suggest using nftw and spawning a new thread for each directory within the call back, but it's not immediately clear that that would be any easier (or any different) than following Kanopus' suggestion. And after thinking about it for a few moments, I fall back to my first suggestion and wonder why you want to do this with multiple threads. Having more threads is unlikely to speed up the search. Use nftw. Don't worry about threading.

Assuming each node in the tree represents a directory (and the files within it), and also assuming there is no limit in the number of threads you can open:
Enter the root node, if it has n subdirectories, create n - 1 threads to search in the first n - 1 and continue the search through the last subdirectory. Repeat as needed.

Tree-structures don't typically lend themselves to parallelization. Assuming you have all the nodes loaded into memory, try and organize them such that they occupy an array - after all, they need to live in RAM which is serial - and ignore their tree structure for the purpose of your search. Then iterate over the elements of the array using some sort of parallel for loop. A popular choice for this is OpenMP or you might try parallel_for_each in Visual Studio.

Multi-agent system in C++ code design

I have a simulation written in C++ in which I need to maintain a variable number of agents, and I am having trouble deciding how to implement it well. Every agent looks something similar to:
class Agent{
public:
Vector2f pos;
float health;
float data[DATASIZE];
vector<Rule> rules;
}
I need to maintain a variable number of agents in my simulation such that:
Preferably, there is no upper bound on the number of agents
I can easily add an Agent
I can easily remove any agent under some condition (say health<0)
I can easily iterate all agents and do something (say health--)
Preferably, I can parallelize the work using openMP, because many updates are somewhat costly, but completely independent of other agents.
(edit) the order of the agents doesn't matter at all
What kind of container or design principles should I use for the agents? Until now I was using a vector, but I think it pretty hard to erase from this structure: something I need to do quite often, as things die all the time. Are there any alternatives I should look at? I thought of something like List, but I don't think they can be parallelized because they are implemented as linked lists with iterator objects?
Thank you

You could leave the agent in the list when dead, ready for re-use. No worries about shrinking your container, and you retain the benefits of a vector. You could keep a separate stack of pointers to dead/reusable agents, just push onto it when an agent dies, pop one off to reclaim for a new agent.
foreach Agent {
if (agent.health > 0) // skip dead agents
process rules

Until now I was using a vector, but I think it pretty hard to erase from this structure: something I need to do quite often, as things die all the time.
How many do you actually expect to die per each step of your simulation? What seems like "all the time" to a human could still be considered very infrequent to a computer. For instance, if each step of your simulation processes thousands of agents but on average only 1 agent dies every few steps, then agent death is a minor incident. With those kind of numbers, your program spends far more time processing live agents than it does dealing with dead agents and so worrying about the performance of removing a dead agent may not be worth while at all. If making agent removal more efficient would end up making normal agent iteration and processing less efficient (yet agent removal is relatively rare), then that would probably be a poor trade-off.
On the other hand, if large numbers of agents are born and die every simulation step, then you might want to make sure those events can be handled efficiently. So it really depends on the kind of numbers you expect to be dealing with.
My general advice would be to proceed with using std::vector (as long as it fits the rest of your design) unless you really expect a significant number of agent deaths per step compared to the number of agents in total.

List should work pretty well. It can be parallelized, because inserting or removing an element does not invalidate other iterators (except of course iterators pointing to an element being removed).
If you don't need backward traversal, slist is as good as list, and a little faster.
If you don't care about the order of elements, use set.

Use a quadtree like in video games. Then searching on pos is fast and removal is fast too. (Plus you can parallelize across child nodes).

Advice for converting a large monolithic singlethreaded application to a multithreaded architecture?

My company's main product is a large monolithic C++ application, used for scientific data processing and visualisation. Its codebase goes back maybe 12 or 13 years, and while we have put work into upgrading and maintaining it (use of STL and Boost - when I joined most containers were custom, for example - fully upgraded to Unicode and the 2010 VCL, etc) there's one remaining, very significant problem: it's fully singlethreaded. Given it's a data processing and visualisation program, this is becoming more and more of a handicap.
I'm both a developer and the project manager for the next release where we want to tackle this, and this is going to be a difficult job in both areas. I'm seeking concrete, practical, and architectural advice on how to tackle the problem.
The program's data flow might go something like this:
a window needs to draw data
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This will go and calculate or read from file or whatever else is required (often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations)
Ie, the paint message handler will block while processing is done, and if the data hasn't already been calculated and cached, this can be a long time. Sometimes this is minutes. Similar paths occur for other parts of the program that perform lengthy processing operations - the program is unresponsive for the entire time, sometimes hours.
I'm seeking advice on how to approach changing this. Practical ideas. Perhaps things like:
design patterns for asynchronously requesting data?
storing large collections of objects such that threads can read and write safely?
handling invalidation of data sets while something is trying to read it?
are there patterns and techniques for this sort of problem?
what should I be asking that I haven't thought of?
I haven't done any multithreaded programming since my Uni days a few years ago, and I think the rest of my team is in a similar position. What I knew was academic, not practical, and is nowhere near enough to have confidence approaching this.
The ultimate objective is to have a fully responsive program, where all calculations and data generation is done in other threads and the UI is always responsive. We might not get there in a single development cycle :)
Edit: I thought I should add a couple more details about the app:
It's a 32-bit desktop application for Windows. Each copy is licensed. We plan to keep it a desktop, locally-running app
We use Embarcadero (formerly Borland) C++ Builder 2010 for development. This affects the parallel libraries we can use, since most seem (?) to be written for GCC or MSVC only. Luckily they're actively developing it and its C++ standards support is much better than it used to be. The compiler supports these Boost components.
Its architecture is not as clean as it should be and components are often too tightly coupled. This is another problem :)
Edit #2: Thanks for the replies so far!
I'm surprised so many people have recommended a multi-process architecture (it's the top-voted answer at the moment), not multithreading. My impression is that's a very Unix-ish program structure, and I don't know anything about how it's designed or works. Are there good resources available about it, on Windows? Is it really that common on Windows?
In terms of concrete approaches to some of the multithreading suggestions, are there design patterns for asynchronous request and consuming of data, or threadaware or asynchronous MVP systems, or how to design a task-oriented system, or articles and books and post-release deconstructions illustrating things that work and things that don't work? We can develop all this architecture ourselves, of course, but it's good to work from what others have done before and know what mistakes and pitfalls to avoid.
One aspect that isn't touched on in any answers is project managing this. My impression is estimating how long this will take and keeping good control of the project when doing something as uncertain as this may be hard. That's one reason I'm after recipes or practical coding advice, I guess, to guide and restrict coding direction as much as possible.
I haven't yet marked an answer for this question - this is not because of the quality of the answers, which is great (and thankyou) but simply that because of the scope of this I'm hoping for more answers or discussion. Thankyou to those who have already replied!

You have a big challenge ahead of you. I had a similar challenge ahead of me -- 15 year old monolithic single threaded code base, not taking advantage of multicore, etc. We expended a great deal of effort in trying to find a design and solution that was workable and would work.
Bad news first. It will be somewhere between impractical and impossible to make your single-threaded app multithreaded. A single threaded app relies on it's singlethreaded-ness is ways both subtle and gross. One example is if the computation portion requires input from the GUI portion. The GUI must run in the main thread. If you try to get this data directly from the computation engine, you will likely run in to deadlock and race conditions that will require major redesigns to fix. Many of these reliances will not crop up during the design phase, or even during the development phase, but only after a release build is put in a harsh environment.
More bad news. Programming multithreaded applications is exceptionally hard. It might seem fairly straightforward to just lock stuff and do what you have to do, but it is not. First of all if you lock everything in sight you end up serializing your application, negating every benefit of mutithreading in the first place while still adding in all the complexity. Even if you get beyond this, writing a defect-free MP application is hard enough, but writing a highly-performant MP application is that much more difficult. You could learn on the job in a kind of baptismal by fire. But if you are doing this with production code, especially legacy production code, you put your buisness at risk.
Now the good news. You do have options that don't involve refactoring your whole app and will give you most of what you seek. One option in particular is easy to implement (in relative terms), and much less prone to defects than making your app fully MP.
You could instantiate multiple copies of your application. Make one of them visible, and all the others invisible. Use the visible application as the presentation layer, but don't do the computational work there. Instead, send messages (perhaps via sockets) to the invisible copies of your application which do the work and send the results back to the presentation layer.
This might seem like a hack. And maybe it is. But it will get you what you need without putting the stability and performance of your system at such great risk. Plus there are hidden benefits. One is that the invisible engine copies of your app will have access to their own virtual memory space, making it easier to leverage all the resources of the system. It also scales nicely. If you are running on a 2-core box, you could spin off 2 copies of your engine. 32 cores? 32 copies. You get the idea.

So, there's a hint in your description of the algorithm as to how to proceed:
often quite a complex data flow - think of this as data flowing through a complex graph, each node of which performs operations
I'd look into making that data-flow graph be literally the structure that does the work. The links in the graph can be thread-safe queues, the algorithms at each node can stay pretty much unchanged, except wrapped in a thread that picks up work items from a queue and deposits results on one. You could go a step further and use sockets and processes rather than queues and threads; this will let you spread across multiple machines if there is a performance benefit in doing this.
Then your paint and other GUI methods need split in two: one half to queue the work, and the other half to draw or use the results as they come out of the pipeline.
This may not be practical if the app presumes that data is global. But if it is well contained in classes, as your description suggests it may be, then this could be the simplest way to get it parallelised.

Don't attempt to multithread everything in the old app. Multithreading for the sake of saying it's multithreaded is a waste of time and money. You're building an app that does something, not a monument to yourself.
Profile and study your execution flows to figure out where the app spends most of its time. A profiler is a great tool for this, but so is just stepping through the code in the debugger. You find the most interesting things in random walks.
Decouple the UI from long-running computations. Use cross-thread communications techniques to send updates to the UI from the computation thread.
As a side-effect of #3: think carefully about reentrancy: now that the compute is running in the background and the user can smurf around in the UI, what things in the UI should be disabled to prevent conflicts with the background operation? Allowing the user to delete a dataset while a computation is running on that data is probably a bad idea. (Mitigation: computation makes a local snapshot of the data) Does it make sense for the user to spool up multiple compute operations concurrently? If handled well, this could be a new feature and help rationalize the app rework effort. If ignored, it will be a disaster.
Identify specific operations that are candidates to be shoved into a background thread. The ideal candidate is usually a single function or class that does a lot of work (requires a "lot of time" to complete - more than a few seconds) with well defined inputs and outputs, that makes use of no global resources, and does not touch the UI directly. Evaluate and prioritize candidates based on how much work would be required to retrofit to this ideal.
In terms of project management, take things one step at a time. If you have multiple operations that are strong candidates to be moved to a background thread, and they have no interaction with each other, these might be implemented in parallel by multiple developers. However, it would be a good exercise to have everybody participate in one conversion first so that everyone understands what to look for and to establish your patterns for UI interaction, etc. Hold an extended whiteboard meeting to discuss the design and process of extracting the one function into a background thread. Go implement that (together or dole out pieces to individuals), then reconvene to put it all together and discuss discoveries and pain points.
Multithreading is a headache and requires more careful thought than straight up coding, but splitting the app into multiple processes creates far more headaches, IMO. Threading support and available primitives are good in Windows, perhaps better than some other platforms. Use them.
In general, don't do any more than what is needed. It's easy to severely over implement and over complicate an issue by throwing more patterns and standard libraries at it.
If nobody on your team has done multithreading work before, budget time to make an expert or funds to hire one as a consultant.

The main thing you have to do is to disconnect your UI from your data set. I'd suggest that the way to do that is to put a layer in between.
You will need to design a data structure of data cooked-for-display. This will most likely contain copies of some of your back-end data, but "cooked" to be easy to draw from. The key idea here is that this is quick and easy to paint from. You may even have this data structure contain calculated screen positions of bits of data so that it's quick to draw from.
Whenever you get a WM_PAINT message you should get the most recent complete version of this structure and draw from it. If you do this properly, you should be able to handle multiple WM_PAINT messages per second because the paint code never refers to your back end data at all. It's just spinning through the cooked structure. The idea here is that its better to paint stale data quickly than to hang your UI.
Meanwhile...
You should have 2 complete copies of this cooked-for-display structure. One is what the WM_PAINT message looks at. (call it cfd_A) The other is what you hand to your CookDataForDisplay() function. (call it cfd_B). Your CookDataForDisplay() function runs in a separate thread, and works on building/updating cfd_B in the background. This function can take as long as it wants because it isn't interacting with the display in any way. Once the call returns cfd_B will be the most up-to-date version of the structure.
Now swap cfd_A and cfd_B and InvalidateRect on your application window.
A simplistic way to do this is to have your cooked-for-display structure be a bitmap, and that might be a good way to go to get the ball rolling, but I'm sure with a bit of thought you can do a much better job with a more sophisticated structure.
So, referring back to your example.
In the paint method, it will call a GetData method, often hundreds of times for hundreds of bits of data in one paint operation
This is now 2 threads, the paint method refers to cfd_A and runs on the UI thread. Meanwhile cfd_B is being built by a background thread using GetData calls.
The quick-and-dirty way to do this is
Take your current WM_PAINT code, stick it into a function called PaintIntoBitmap().
Create a bitmap and a Memory DC, this is cfd_B.
Create a thread and pass it cfd_B and have it call PaintIntoBitmap()
When this thread completes, swap cfd_B and cfd_A
Now your new WM_PAINT method just takes the pre-rendered bitmap in cfd_A and draws it to the screen. Your UI is now disconnnected from your backend GetData() function.
Now the real work begins, because the quick-and-dirty way doesn't handle window resizing very well. You can go from there to refine what your cfd_A and cfd_B structures are a little at a time until you reach a point where you are satisfied with the result.

You might just start out breaking the the UI and the work task into separate threads.
In your paint method instead of calling getData() directly, it puts the request in a thread-safe queue. getData() is run in another thread that reads its data from the queue. When the getData thread is done, it signals the main thread to redraw the visualisation area with its result data using thread syncronization to pass the data.
While all this is going on you of course have a progress bar saying reticulating splines so the user knows something is going on.
This would keep your UI snappy without the significant pain of multithreading your work routines (which can be akin to a total rewrite)

It sounds like you have several different issues that parallelism can address, but in different ways.
Performance increases through utilizing multicore CPU Architecutres
You're not taking advantage of the multi-core CPU architetures that are becoming so common. Parallelization allow you to divide work amongst multiple cores. You can write that code through standard C++ divide and conquer techniques using a "functional" style of programming where you pass work to separate threads at the divide stage. Google's MapReduce pattern is an example of that technique. Intel has the new CILK library to give you C++ compiler support for such techniques.
Greater GUI responsiveness through asynchronous document-view
By separating the GUI operations from the document operations and placing them on different threads, you can increase the apparent responsiveness of your application. The standard Model-View-Controller or Model-View-Presenter design patterns are a good place to start. You need to parallelize them by having the model inform the view of updates rather than have the view provide the thread on which the document computes itself. The View would call a method on the model asking it to compute a particular view of the data, and the model would inform the presenter/controller as information is changed or new data becomes available, which would get passed to the view to update itself.
Opportunistic caching and pre-calculation
It sounds like your application has a fixed base of data, but many possible compute-intensive views on the data. If you did a statistical analysis on which views were most commonly requested in what situations, you could create background worker threads to pre-calculate the likely-requested values. It may be useful to put these operations on low-priority threads so that they don't interfere with the main application processing.
Obviously, you'll need to use mutexes (or critical sections), events, and probably semaphores to implement this. You may find some of the new synchronization objects in Vista useful, like the slim reader-writer lock, condition variables, or the new thread pool API. See Joe Duffy's book on concurrency for how to use these basic techniques.

There is something that no-one has talked about yet, but which is quite interesting.
It's called futures. A future is the promise of a result... let's see with an example.
future<int> leftVal = computeLeftValue(treeNode); // [1]
int rightVal = computeRightValue(treeNode); // [2]
result = leftVal + rightVal; // [3]
It's pretty simple:
You spin off a thread that starts computing leftVal, taking it from a pool for example to avoid the initialization problem.
While leftVal is being computed, you compute rightVal.
You add the two, this may block if leftVal is not computed yet and wait for the computation to end.
The great benefit here is that it's straightforward: each time you have one computation followed by another that is independent and you then join the result, you can use this pattern.
See Herb Sutter's article on futures, they will be available in the upcoming C++0x but there are already libraries available today even if the syntax is perhaps not as pretty as I would make you believe ;)

If it was my development dollars I was spending, I would start with the big picture:
What do I hope to accomplish, and how much will I spend to accomplish this, and how will I be further ahead? (If the answer to this is, my app will run 10% better on quadcore PCs, and I could have achieved the same result by spending $1000 more per customer PC , and spending $100,000 less this year on R&D, then, I would skip the whole effort).
Why am I doing multi-threaded instead of massively parallel distributed? Do I really think threads are better than processes? Multi-core systems also run distributed apps pretty well. And there are some advantages to message-passing process based systems that go beyond the benefits (and the costs!) of threading. Should I consider a process-based approach? SHould I consider a background running entirely as a service, and a foreground GUI? Since my product is node-locked and licensed, I think services would suit me (vendor) quite well. Also, separating stuff into two processes (background service and foreground) just might force the kind of rewrite and rearchitecting to occur that I might not be forced to do, if I was to just add threading into my mix.
This is just to get you thinking: What if you were to rewrite it as a service (background app) and a GUI, because that would actually be easier than adding threading, without also adding crashes, deadlocks, and race conditions?
Consider the idea that for your needs, perhaps threading is evil. Develop your religion, and stick with that. Unless you have a real good reason to go the other way. For many years, I religiously avoided threading. Because one thread per process is good enough for me.
I don't see any really solid reasons in your list why you need threading, except ones that could be more inexpensively solved by more expensive target computer hardware. If your app is "too slow" adding in threads might not even speed it up.
I use threads for background serial communications, but I would not consider threading merely for computationally heavy applications, unless my algorithms were so inherently parallel as to make the benefits clear, and the drawbacks minimal.
I wonder if the "design" problems that this C++Builder app has are like my Delphi "RAD Spaghetti" application disease. I have found that a wholesale refactor/rewrite (over a year per major app that I have done this to), was a minimum amount of time for me to get a handle on application "accidental complexity". And that was without throwing a "threads where possible" idea. I tend to write my apps with threads for serial communication and network socket handling, only. And maybe the odd "worker-thread-queue".
If there is a place in your app you can add ONE thread, to test the waters, I would look for the main "work queue" and I would create an experimental version control branch, and I would learn about how my code works by breaking it in the experimental branch. Add that thread. And see where you spend your first day of debugging. Then I might just abandon that branch and go back to my trunk until the pain in my temporal lobe subsides.
Warren

Here's what I would do...
I would start by profiling your and seeing:
1) what is slow and what the hot paths are
2) which calls are reentrant or deeply nested
you can use 1) to determine where the opportunity is for speedups and where to start looking for parallelization.
you can use 2) to find out where the shared state is likely to be and get a deeper sense of how much things are tangled up.
I would use a good system profiler and a good sampling profiler (like the windows perforamnce toolkit or the concurrency views of the profiler in Visual Studio 2010 Beta2 - these are both 'free' right now).
Then I would figure out what the goal is and how to separate things gradually to a cleaner design that is more responsive (moving work off the UI thread) and more performant (parallelizing computationally intensive portions). I would focus on the highest priority and most noticable items first.
If you don't have a good refactoring tool like VisualAssist, invest in one - it's worth it. If you're not familiar with Michael Feathers or Kent Beck's refactoring books, consider borrowing them. I would ensure my refactorings are well covered by unit tests.
You can't move to VS (I would recommend the products I work on the Asynchronous Agents Library & Parallel Pattern Library, you can also use TBB or OpenMP).
In boost, I would look carefully at boost::thread, the asio library and the signals library.
I would ask for help / guidance / a listening ear when I got stuck.
-Rick

You can also look at this article from Herb Sutter You have a mass of existing code and want to add concurrency. Where do you start?

Well, I think you're expecting a lot based on your comments here. You're not going to go from minutes to milliseconds by multithreading. The most you can hope for is the current amount of time divided by the number of cores. That being said, you're in a bit of luck with C++. I've written high performance multiprocessor scientific apps, and what you want to look for is the most embarrassingly parallel loop you can find. In my scientific code, the heaviest piece is calculating somewhere between 100 and 1000 data points. However, all of the data points can be calculated independently of the others. You can then split the loop using openmp. This is the easiest and most efficient way to go. If you're compiler doesn't support openmp, then you will have a very hard time porting existing code. With openmp (if you're lucky), you may only have to add a couple of #pragmas to get 4-8x the performance. Here's an example StochFit

I hope this will help you in understanding and converting your monolithic single threaded app to multi thread easily. Sorry it is for another programming language but never the less the principles explained are the same all over.
http://www.freevbcode.com/ShowCode.Asp?ID=1287
Hope this helps.

The first thing you must do is to separate your GUI from your data, the second is to create a multithreaded class.
STEP 1 - Responsive GUI
We can assume that the image you are producing is contained in the canvas of a TImage. You can put a simple TTimer in you form and you can write code like this:
if (CurrenData.LastUpdate>CurrentUpdate)
{
Image1->Canvas->Draw(0,0,CurrenData.Bitmap);
CurrentUpdate=Now();
}
OK! I know! Is a little bit dirty, but it's fast and is simple.The point is that:
You need an Object that is created in the main thread
The object is copied in the Form you need, only when is needed and in a safe way (ok, a better protection for the Bitmap may be is needed, but for semplicity...)
The object CurrentData is your actual project, single threaded, that produces an image
Now you have a fast and responsive GUI. If your algorithm as slow, the refresh is slow, but your user will never think that your program is freezed.
STEP 2 - Multithread
I suggest you to implement a class like the following:
SimpleThread.h
typedef void (__closure *TThreadFunction)(void* Data);
class TSimpleThread : public TThread
{
public:
TSimpleThread( TThreadFunction _Action,void* _Data = NULL, bool RunNow = true );
void AbortThread();
__property Terminated;
protected:
TThreadFunction ThreadFunction;
void* Data;
private:
virtual void __fastcall Execute() { ThreadFunction(Data); };
};
SimpleThread.c
TSimpleThread::TSimpleThread( TThreadFunction _Action,void* _Data, bool RunNow)
: TThread(true), // initialize suspended
ThreadFunction(_Action), Data(_Data)
{
FreeOnTerminate = false;
if (RunNow) Resume();
}
void TSimpleThread::AbortThread()
{
Suspend(); // Can't kill a running thread
Free(); // Kills thread
}
Let's explain. Now, in your simple threaded class you can create an object like this:
TSimpleThread *ST;
ST=new TSimpleThread( RefreshFunction,NULL,true);
ST->Resume();
Let's explain better: now, in your own monolithic class, you have created a thread. More: you bring a function (ie: RefreshFunction) in a separate thread. The scope of your funcion is the same, the class is the same, the execution is separate.

My number one suggestion, although it's very late (sorry for reviving old thread, it's interesting!) is seek out homogeneous transform loops where each iteration of the loop is mutating a completely independent piece of data from the other iterations.
Instead of thinking about how to turn this old codebase into an asynchronous one running all kinds of operations in parallel (which could be asking for all kinds of trouble from worse than single-threaded performance from poor locking patterns or exponentially worse, race conditions/deadlocks by trying to do this in hindsight to code you can't fully comprehend), stick to the sequential mindset for the overall application design for now but identify or extract simple, homogeneous transform loops. Don't go from intrusive broad design-level multithreading and then try to drill into details. Work from non-intrusive multithreading of fine implementation details and specific hotspots first.
What I mean by homogeneous loops is basically one that transforms data in a very straightforward way, like:
for each pixel in image:
make it brighter
That is very simple to reason about and you can safely parallelize this loop without any problems whatsoever using OMP or TBB or whatever and without getting tangled up in thread synchronization. It only takes one glance at this code to fully comprehend its side effects.
Try to find as many hotspots as you can which fit this type of simple homogeneous transform loop and if you have complex loops which update many different types of data with complex control flows that trigger complex side effects, then seek to refactor towards these homogeneous loops. Often a complex loop which causes 3 disparate side effects to 3 different types of data can be turned into 3 simple homogeneous loops which each trigger just one kind of side effect to one type of data with a simpler control flow. Doing multiple loops instead of one might seem a tad wasteful, but the loops become simpler, the homogeneity will often lead to more cache-friendly sequential memory access patterns vs. sporadic random-access patterns, and you then tend to find much more opportunities to safely parallelize (as well as vectorize) the code in a straightforward way.
First you have to thoroughly understand the side effects of any code you attempt to parallelize (and I mean thoroughly!!!), so seeking out these homogeneous loops gives you isolated areas of the codebase you can easily reason about in terms of the side effects to the point where you can confidently and safely parallelize those hotspots. It'll also improve the maintainability of the code by making it very easy to reason about the state changes going on in that particular piece of code. Save the dream of the uber multithreaded application running everything in parallel for later. For now, focus on identifying/extracting performance-critical, homogeneous loops with simple control flows and simple side effects. Those are your priority targets for parallelization with simple parallelized loops.
Now admittedly I somewhat dodged your questions, but most of them don't need apply if you do what I suggest, at least until you've kind of worked your way out to the point where you're thinking more about multithreading designs as opposed to simply parallelizing implementation details. And you might not even need to go that far to have a very competitive product in terms of performance. If you have beefy work to do in a single loop, you can devote the hardware resources to making that loop go faster instead of making many operations run simultaneously. If you have to resort to more async methods like if your hotspots are more I/O bound, seek an async/wait approach where you fire off an async task but do some things in the meantime and then wait on the async task(s) to complete. Even if that's not absolutely necessary, the idea is to section off isolated areas of your codebase where you can, with 100% confidence (or at least 99.9999999%) say that the multithreaded code is correct.
You don't ever want to gamble with race conditions. There's nothing more demoralizing than finding some obscure race condition that only occurs once in a full moon on some random user's machine while your entire QA team is unable to reproduce it, only to, 3 months later, run into it yourself except during that one time you ran a release build without debugging info available while you then toss and turn in your sleep knowing your codebase can flake out at any given moment but in ways that no one will ever be able to consistently reproduce. So take it easy with multithreading legacy codebases, at least for now, and stick to multithreading isolated but critical sections of the codebase where the side effects are dead simple to reason about. And test the crap out of it -- ideally apply a TDD approach where you write a test for the code you're going to multithread to ensure it gives the correct output after you finish... though race conditions are the types of things that easily fly under the radar of unit and integration testing, so again you absolutely need to be able to comprehend the entirety of the side effects that go on in a given piece of code before you attempt to multithread it. The best way to do that is to make the side effects as easy to comprehend as possible with the simplest control flows causing just one type of side effect for an entire loop.

It is hard to give you proper guidelines. But...
The easiest way out according to me is to convert your application to ActiveX EXE as COM has support for Threading, etc. built right into it your program will automatically become Multi Threading application. Of course you will have to make quite a few changes to your code. But this is the shortest and safest way to go.
I am not sure but probably RichClient Toolset lib may do the trick for you. On the site the author has written:
It also offers registration free Loading/Instancing-capabilities
for ActiveX-Dlls and new, easy to use Threading-approach,
which works with Named-Pipes under the
hood and works therefore also
cross-process.
Please check it out. Who knows it may be the right solution for your requirements.
As for Project management I think you can continue using what is provided in your choice IDE by integrating it with SVN through plugins.
I forgot to mention that we have completed an application for Share market that automatically trades (buys and sells based on lows and highs) into those scripts that are in user portfolio based on an algorithm that we have developed.
While developing this software we were facing the same kind of problem as you have illustrated here. To solve it we converted out application in ActiveX EXE and we converted all those parts that need to execute parallely into ActiveX DLLs. We have not used any third party libs for this!
HTH

Making a game in C++ using parallel processing

I wanted to "emulate" a popular flash game, Chrontron, in C++ and needed some help getting started. (NOTE: Not for release, just practicing for myself)
Basics:
Player has a time machine. On each iteration of using the time machine, a parallel state
is created, co-existing with a previous state. One of the states must complete all the
objectives of the level before ending the stage. In addition, all the stages must be able
to end the stage normally, without causing a state paradox (wherein they should have
been able to finish the stage normally but, due to the interactions of another state,
were not).
So, that sort of explains how the game works. You should play it a bit to really
understand what my problem is.
I'm thinking a good way to solve this would be to use linked lists to store each state,
which will probably either be a hash map, based on time, or a linked list that iterates
based on time. I'm still unsure.
ACTUAL QUESTION:
Now that I have some rough specs, I need some help deciding on which data structures to use for this, and why. Also, I want to know what Graphics API/Layer I should use to do this: SDL, OpenGL, or DirectX (my current choice is SDL). And how would I go about implementing parallel states? With parallel threads?
EDIT (To clarify more):
OS -- Windows (since this is a hobby project, may do this in Linux later)
Graphics -- 2D
Language -- C++ (must be C++ -- this is practice for a course next semester)
Q-Unanswered: SDL : OpenGL : Direct X
Q-Answered: Avoid Parallel Processing
Q-Answered: Use STL to implement time-step actions.
So far from what people have said, I should:
1. Use STL to store actions.
2. Iterate through actions based on time-step.
3. Forget parallel processing -- period. (But I'd still like some pointers as to how it
could be used and in what cases it should be used, since this is for practice).
Appending to the question, I've mostly used C#, PHP, and Java before so I wouldn't describe myself as a hotshot programmer. What C++ specific knowledge would help make this project easier for me? (ie. Vectors?)

What you should do is first to read and understand the "fixed time-step" game loop (Here's a good explanation: http://www.gaffer.org/game-physics/fix-your-timestep).
Then what you do is to keep a list of list of pairs of frame counter and action. STL example:
std::list<std::list<std::pair<unsigned long, Action> > > state;
Or maybe a vector of lists of pairs. To create the state, for every action (player interaction) you store the frame number and what action is performed, most likely you'd get the best results if action simply was "key <X> pressed" or "key <X> released":
state.back().push_back(std::make_pair(currentFrame, VK_LEFT | KEY_PRESSED));
To play back the previous states, you'd have to reset the frame counter every time the player activates the time machine and then iterate through the state list for each previous state and see if any matches the current frame. If there is, perform the action for that state.
To optimize you could keep a list of iterators to where you are in each previous state-list. Here's some pseudo-code for that:
typedef std::list<std::pair<unsigned long, Action> > StateList;
std::list<StateList::iterator> stateIteratorList;
//
foreach(it in stateIteratorList)
{
if(it->first == currentFrame)
{
performAction(it->second);
++it;
}
}
I hope you get the idea...
Separate threads would simply complicate the matter greatly, this way you get the same result every time, which you cannot guarantee by using separate threads (can't really see how that would be implemented) or a non-fixed time-step game loop.
When it comes to graphics API, I'd go with SDL as it's probably the easiest thing to get you started. You can always use OpenGL from SDL later on if you want to go 3D.

This sounds very similar to Braid. You really don't want parallel processing for this - parallel programming is hard, and for something like this, performance should not be an issue.
Since the game state vector will grow very quickly (probably on the order of several kilobytes per second, depending on the frame rate and how much data you store), you don't want a linked list, which has a lot of overhead in terms of space (and can introduce big performance penalties due to cache misses if it is laid out poorly). For each parallel timeline, you want a vector data structure. You can store each parallel timeline in a linked list. Each timeline knows at what time it began.
To run the game, you iterate through all active timelines and perform one frame's worth of actions from each of them in lockstep. No need for parallel processing.

I have played this game before. I don't necessarily think parallel processing is the way to go. You have shared objects in the game (levers, boxes, elevators, etc) that will need to be shared between processes, possibly with every delta, thereby reducing the effectiveness of the parallelism.
I would personally just keep a list of actions, then for each subsequent iteration start interleaving them together. For example, if the list is in the format of <[iteration.action]> then the 3rd time thru would execute actions 1.1, 2.1, 3.1, 1.2, 2.2, 3.3, etc.

After briefly glossing over the description, I think you have the right idea, I would have a state object that holds the state data, and place this into a linked list...I don't think you need parallel threads...
as far as the graphics API, I have only used opengl, and can say that it is pretty powerful and has a good C / C++ API, opengl would also be more cross platform as you can use the messa library on *Nix computers.

A very interesting game idea. I think you are right that parrellel computing would be benefical to this design, but no more then any other high resource program.
The question is a bit ambigous. I see that you are going to write this in C++ but what OS are you coding it for? Do you intend on it being cross platform and what kind of graphics would you like, ie 3D, 2D, high end, web based.
So basically we need a lot more information.

Parallel processing isn't the answer. You should simply "record" the players actions then play them back for the "previous actions"
So you create a vector (singly linked list) of vectors that holds the actions. Simply store the frame number that the action was taken (or the delta) and complete that action on the "dummy bot" that represents the player during that particular instance. You simply loop through the states and trigger them one after another.
You get a side effect of easily "breaking" the game when a state paradox happens simply because the next action fails.

Unless you're desperate to use C++ for your own education, you should definitely look at XNA for your game & graphics framework (it uses C#). It's completely free, it does a lot of things for you, and soon you'll be able to sell your game on Xbox Live.
To answer your main question, nothing that you can already do in Flash would ever need to use more than one thread. Just store a list of positions in an array and loop through with a different offset for each robot.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js