Related
I have started working on multithreading and point cloud processing. Problem is i have to implement multithreading onto an existing implementation and there are so many read and write operation so using mutex does not give me enough speed up in terms of performance due to too many read operations from the grid.
At the end i modified the code in a way that i can have one vtkSmartPointer<vtkUnstructuredGrid>which holds my point cloud. The only operation the threads have to do is accessing points using GetPoint method. However, it is not thread safe even when you have read-only operation due to smart pointers.
Because of that i had to copy my main Point cloud for each thread which at the end causes memory issues if i have too many threads and big clouds.
I tried to cut point clouds into chunks but then it gets too complicated again when i have too many threads. I can not guarantee optimized amount of points to process for each thread. Also i do neighbour search for each point so cutting point cloud into chunks gets even more complicated because i need to have overlaps for each chunk in order to get proper neighbourhood search.
Since vtkUnstructuredGridis memory optimized i could not replace it with some STL containers. I would be happy if you can recommend me data structures i can use for point cloud processing that are thread-safe to read. Or if there is any other solution i could use.
Thanks in advance
I am not familiar with VTK or how it works.
In general, there are various techniques and methods to improve performance in multi-threading environment. The question is vague, so I can only provide a general vague answer.
Easy: In case there are many reads and few writes, use std::shared_mutex as it allows multiple reads simultaneously.
Moderate: If the threads work with distinct data most of the time: they access the same data array but at distinct locations - then you can implement a handler that ensures that the threads concurrently work over distinct pieces of data without intersections and if a thread ask to work over a piece of data that is currently being processed, then tell it to work over something else or wait.
Hard: There are methods that allow efficient concurrency via std::atomic by utilizing various memory instructions. I am not too familiar with it and it is definitely not simple but you can seek tutorials on it in the internet. As far as I know, certain parts of such methods are still in research-and-development and best practices aren't yet developed.
P.S. If there are many reads/writes over the same data... is the implementation even aware of the fact that the data is shared over several threads? Does it even perform correctly? You might end up needing to rewrite the whole implementation.
I just thought i post the solution because it was actually my stupitidy. I realized that at one part of my code i was using double* vtkDataSet::GetPoint(vtkIdType ptId) version of GetPoint() which is not thread safe.
For multithreaded code void vtkDataSet::GetPoint(vtkIdType id,double x[3]) should be used.
I have a computational algebra task I need to code up. The problem is broken into well-defined individuals tasks that naturally form a tree - the task is combinatorial in nature, so there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large.
I had previously coded this up in a functional fashion, calling the functions as needed and storing everything in RAM. This was a terrible approach, but I was more concerned about the theory then.
I'm planning to rewrite the code in C++ for a variety of reasons. I have a few requirements:
Checkpointing: The calculation takes a long time, so I need to be able to stop at any point and resume later.
Separate individual tasks as objects: This helps me keep a good handle of where I am in the computations, and offers a clean way to do checkpointing via serialization.
Multi-threading: The task is clearly embarrassingly parallel, so it'd be neat to exploit that. I'd probably want to use Boost threads for this.
I would like suggestions on how to actually implement such a system. Ways I've thought of doing it:
Implement tasks as a simple stack. When you hit a task that needs subcalculations done, it checks if it has all the subcalculations it requires. If not, it creates the subtasks and throws them onto the stack. If it does, then it calculates its result and pops itself from the stack.
Store the tasks as a tree and do something like a depth-first visitor pattern. This would create all the tasks at the start and then computation would just traverse the tree.
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
I feel like I'm over-thinking it and there's already a simple, well-established way to do something like this. Is there one?
Technical details in case they matter:
The task tree has 5 levels.
Branching factor of the tree is really small (say, between 2 and 5) for all levels except the lowest which is on the order of a few million.
Each individual task would only need to store a result tens of bytes large. I don't mind using the disk as much as possible, so long as it doesn't kill performance.
For debugging, I'd have to be able to recall/recalculate any individual task.
All the calculations are discrete mathematics: calculations with integers, polynomials, and groups. No floating point at all.
there's a main task which requires a small number of sub-calculations to get its results. Those sub-calculations have sub-sub-calculations and so on. Each calculation only depends on the calculations below it in the tree (assuming the root node is the top). No data sharing needs to happen between branches. At lower levels the number of subtasks may be extremely large... blah blah resuming, multi-threading, etc.
Correct me if I'm wrong, but it seems to me that you are exactly describing a map-reduce algorithm.
Just read what wikipedia says about map-reduce :
"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in some way to get the output – the answer to the problem it was originally trying to solve.
Using an existing mapreduce framework could save you a huge amount of time.
I just google "map reduce C++" and I start to get results, notably one in boost http://www.craighenderson.co.uk/mapreduce/
These don't seem quite right because of the problems of the lower levels requiring a vast number of subtasks. I could approach it in a iterator fashion at this level, I guess.
You definitely do not want millions of CPU-bound threads. You want at most N CPU-bound threads, where N is the product of the number of CPUs and the number of cores per CPU on your machine. Exceed N by a little bit and you are slowing things down a bit. Exceed N by a lot and you are slowing things down a whole lot. The machine will spend almost all its time swapping threads in and out of context, spending very little time executing the threads themselves. Exceed N by a whole lot and you will most likely crash your machine (or hit some limit on threads). If you want to farm lots and lots (and lots and lots) of parallel tasks out at once, you either need to use multiple machines or use your graphics card.
I have a tree which has all the directories and files as its nodes. I want to search a particular file. Say the tree is spread widely and I want to do a breadth first search to find some particular file and that too using multithreading. How should I do that using multithreading ? What is a good approach ?
There are some case where multiThreading the search will provide a useful speedup - if the tree spans more than one disk, for example, or if some of the disks/nodes are indirected over some network.
I certainly would not want to try creating threads for every folder. That's thousands of create/run/terminate, thousands of stack allocation/free etc. Gross, avoidable overhead.
A multiThreaded search can be done, but as other posters have said, look at available alternatives first. Then read the rest of this post. Then look again
I have done something like this once using a queue approach similar to that suggested by Matt.
I don't want to ever do it again:
I used a producer-consumer work queue on which 6 threads waited for work, (6 because testing showed this to be optimum with my problem). The threads were all created once at startup and never terminated. None of this continual create/load/run/waitFor/getResult/terminateIfYoureLucky stuff that unaccountably seems to be popular with developers despite poor performance, shutdown AVs, 216/217 messageBoxes etc etc.
The work came in the form of a 'folderSearch' class that contained the path to be searched, a file match function event to call and the FindFirst/FindNext loop method to do the searching. I created a couple hundred of these at startup in a pool, (ie pushed onto another P-C pool queue:). When the FF/FN iterated the files in the folder to look for matching files, encountering a sub-folder resulted in extracting another folderSearch instance from the pool, loading it up with the new path & pushing it onto the work queue - some thread would then pick it up and iterate the sub-folder. The class had a list for the paths to matching files and a 'results' event to call, (with 'this' as parameter, of course), if it found something of interest. If a folderSearch got to the end of a twig, having found nothing and with nothing left to search, it would release itself back to the pool, (well, OK, the thread would do it, but you know what I mean:).
There was no need for any explicit 'load balancing'. If one node was exceptionally deep, it would naturally end up with all six threads working on its subtrees because the other paths are exhausted.
Searching 3 disks in their entirety meant popping 3 folderSearch from the pool, loading them up with 'C:\', 'E:\', 'F:\' and the file match method and then pushing them onto the work queue. The disks then made rattling noises and the event would eventually fire with results. In my case, (Windows), the event PostMessaged the folderSearch objects to a UI thread where the results were displayed in a treeView before repooling the folderSearch's for re-use.
This system was ~ 2.5 times as fast as a simple sequential search across 3 disks, even on my old development box that only had one core, simply because all 3 disks were searched in parallel. I suspect it would show the same sort of advantage on a modern box because the limiting factor is probably dominated by IO waiting on the disks.
Surprisingly, there was also a speedup with only one disk, but not that much. Don't know why - should be slower, by rights, due to all the extra complication.
Naturally, there were issues. One was that, with a search that fired lots of results, the pool would empty because the UI could not keep up with the threads and so all the folderSearch objects got stuck in the PostMessages queued to the UI, so slowing down the search threads as they had to wait on the pool queue until the PostMessages got handled and so returned folderSearch's to the pool. This also meant that the UI was effectively blocked until the search was over and it could catch up, negating one of the advantages of threading off the search in the first place :( With small result sets, it worked fine.
Another possible issue is that the results come back in an 'unnatural' manner, interleaved in such an aparrently confusing manner that things like assembling a tree view are much more complex than with a single-threaded recursive search - you have to flit about all over the place to stuff the results into the treeView in the right place. This loads up the GUI with extra work and can negate the searching speed advantage with large numbers of results, as I found out
This design could run multiple searches concurrently. As a test, I would load on several 3-disk searches at once, (no not while loading up the treeView - I just dumped the number of files found onto a memo line in the GUI message-handler). This made a huge amount of rattling and slowed everything to a crawl, but it did eventually complete all the searches without crashing. I didn't do this often as I was afraid for my poor disks. Don't try this at home
I was never sure how many threads to hang off the queue. Six was about the optimum on my old box with local disks. If there are networked disks in the mix, then more is probably better since a network call will tend to block one thread for much longer periods than a local disk read. Never tried that, but loading on more threads did not affect the performance any with local disks, just used more memory for no extra advantage.
Another problem is finding out if the search is actually over - are all the results in .. or is some thread still waiting on a network drive that's slow or actually unreachable? With only one search, I could tell because the pool became full again when the search was over, (I dumped the pool level to a stausBar on a 1s GUI timer). It didn't matter in my app, but in others, it might...
Cancelling a search is a similar issue. These sorts of things would need another 'searchClass' to control each search. Each folderSearch allocated to a search would have to keep a reference to the searchClass so that any thread handling the folderSearch could find out if an abort had been set and if so stop doing stuff with that folderSearch. I didn't need this, so did not implement it.
The there's error reporting. If a network drive connection fails, for example, several, (most likely all!), threads can block up for a long time before an exception is raised. Then they all except at once. The catch messages get loaded into an 'errorMess' field in the folderSearch and the results event fired. Human-detectable evidence - the rattling stops. Nothing happens for a minute, then [no of threads] errors appear all at once.
Note well the caveats from the other posters and my experiences. Only attempt something like this if you really, really need it for some special search purpose and you are 100% happy with multiThreaded apps. If you can get away with a single-threaded search, or a shell call to a File Explorer, or almost anything else, do it that way!
I've used the same approach since with an FTP server to generate trees. That was much faster as well, though the server admins were probably not happy about the multiple connections
Rgds,
Martin
Multithreading a tree-search task with unknown work distribution in each branch is non-trivial (this comes up a lot in say, constraint satisfaction problems.)
The easiest way is to create a task queue (protected by a mutex.) Fill this queue with all the children of the root node. Spawn N threads (one for each available CPU core) and have them search through each node. There are various tricks you can do to avoid some bad scenarios (if any thread finds that its node is "unexpectedly deep" you can have them add new tasks to the queue corresponding to subdirectories it wants other threads to explore.) If your node depths are well distributed and the root node has lots of children you can avoid the queue entirely -- just assign each thread with index i the task of exploring X % N + i (where X is the number of children of the root node.)
My first response is to say "just use nftw and forget about doing it multi-threaded". If you happen to have an implementation of nftw that does the tree walk in a multi-threaded fashion, then you get multi-threading for free (I'm not aware of any such implementation). If you really want to do multiple threads, I would suggest using nftw and spawning a new thread for each directory within the call back, but it's not immediately clear that that would be any easier (or any different) than following Kanopus' suggestion. And after thinking about it for a few moments, I fall back to my first suggestion and wonder why you want to do this with multiple threads. Having more threads is unlikely to speed up the search. Use nftw. Don't worry about threading.
Assuming each node in the tree represents a directory (and the files within it), and also assuming there is no limit in the number of threads you can open:
Enter the root node, if it has n subdirectories, create n - 1 threads to search in the first n - 1 and continue the search through the last subdirectory. Repeat as needed.
Tree-structures don't typically lend themselves to parallelization. Assuming you have all the nodes loaded into memory, try and organize them such that they occupy an array - after all, they need to live in RAM which is serial - and ignore their tree structure for the purpose of your search. Then iterate over the elements of the array using some sort of parallel for loop. A popular choice for this is OpenMP or you might try parallel_for_each in Visual Studio.
I am working on a Tetris AI implementation. It is a GUI application that plays the game by itself. The user can manipulate a few parameters that influence the decisions made by the AI. The basic algorithm goes as follows:
Start a new thread and clone the current game state to avoid excessive locking.
Generate a list of all possible future game states. These become child nodes of the current game state.
For each child node generate it's future game states.
Keep doing this recursively until a predefined depth has been reached.
Once the requested depth has been reached, find the best end node and recursively find it's parent node until you have the first level child.
Delete all child nodes that are not on the path between the child node and the end node.
This path is now the list of precalculated moves.
The main game executes the list of precalculated moves (with some fancy animation).
This is working pretty well up until search depth 4. After that I start to get memory problems. The number of possible game states can go from 9 to 34. So the worst case scenario for a level 4 search would be 34^4 game states. Windows XP seems unable to deal with level 5 searches (it hangs at 2+ GB).
So if I want to use deeper searches I'll need to use a strategy where I delete the non-promising branches and continue with the ones that will lead to a good score. But this makes it harder to estimate a maximum acceptable search depth. Therefore I think that I would be better to specify a memory limit instead of a depth limit.
I considering to use a memory pool and use "placement new" to create my objects on the pool's memory segments. However the game grid is implemented as a STL vector. So in order to allocate it on the pool I need to implement a custom allocator.
This seems quite a challenge and perhaps I'm overlooking a simpler solution. So I'd like your insights on how to best deal with this.
Can boost, or another library, provide me some of these facilities? (I already found Poco's MemoryPool.) Are there any good online resources to help me get going?
FYI: here's the source code and a sample binary for Windows.
You can create a memory pool, etc, but that won't really make it any easier or harder to count game state instances. You do need to make sure you don't go over a certain number of active states in your decision tree, with or without a pool. And Boost does have one: http://www.boost.org/doc/libs/1_44_0/libs/pool/doc/index.html
It sounds like you're not really doing any pruning of the tree, which would allow you to get much deeper. Evaluate each future game state and drop ones unlikely to develop into anything useful, and don't waste your time going down that branch.
Despite the lack of context [what kind of search problem are you doing? Depth first, breadth first, A*?....] My suggestion is:
Use semaphores to limit the amount of processing that is done at one time, and then release it once the processing has been evaluated. I can't really recommend a specific library that includes Semaphores, as that threading is not built in to C++, however check with your framework's documentation.
I wanted to "emulate" a popular flash game, Chrontron, in C++ and needed some help getting started. (NOTE: Not for release, just practicing for myself)
Basics:
Player has a time machine. On each iteration of using the time machine, a parallel state
is created, co-existing with a previous state. One of the states must complete all the
objectives of the level before ending the stage. In addition, all the stages must be able
to end the stage normally, without causing a state paradox (wherein they should have
been able to finish the stage normally but, due to the interactions of another state,
were not).
So, that sort of explains how the game works. You should play it a bit to really
understand what my problem is.
I'm thinking a good way to solve this would be to use linked lists to store each state,
which will probably either be a hash map, based on time, or a linked list that iterates
based on time. I'm still unsure.
ACTUAL QUESTION:
Now that I have some rough specs, I need some help deciding on which data structures to use for this, and why. Also, I want to know what Graphics API/Layer I should use to do this: SDL, OpenGL, or DirectX (my current choice is SDL). And how would I go about implementing parallel states? With parallel threads?
EDIT (To clarify more):
OS -- Windows (since this is a hobby project, may do this in Linux later)
Graphics -- 2D
Language -- C++ (must be C++ -- this is practice for a course next semester)
Q-Unanswered: SDL : OpenGL : Direct X
Q-Answered: Avoid Parallel Processing
Q-Answered: Use STL to implement time-step actions.
So far from what people have said, I should:
1. Use STL to store actions.
2. Iterate through actions based on time-step.
3. Forget parallel processing -- period. (But I'd still like some pointers as to how it
could be used and in what cases it should be used, since this is for practice).
Appending to the question, I've mostly used C#, PHP, and Java before so I wouldn't describe myself as a hotshot programmer. What C++ specific knowledge would help make this project easier for me? (ie. Vectors?)
What you should do is first to read and understand the "fixed time-step" game loop (Here's a good explanation: http://www.gaffer.org/game-physics/fix-your-timestep).
Then what you do is to keep a list of list of pairs of frame counter and action. STL example:
std::list<std::list<std::pair<unsigned long, Action> > > state;
Or maybe a vector of lists of pairs. To create the state, for every action (player interaction) you store the frame number and what action is performed, most likely you'd get the best results if action simply was "key <X> pressed" or "key <X> released":
state.back().push_back(std::make_pair(currentFrame, VK_LEFT | KEY_PRESSED));
To play back the previous states, you'd have to reset the frame counter every time the player activates the time machine and then iterate through the state list for each previous state and see if any matches the current frame. If there is, perform the action for that state.
To optimize you could keep a list of iterators to where you are in each previous state-list. Here's some pseudo-code for that:
typedef std::list<std::pair<unsigned long, Action> > StateList;
std::list<StateList::iterator> stateIteratorList;
//
foreach(it in stateIteratorList)
{
if(it->first == currentFrame)
{
performAction(it->second);
++it;
}
}
I hope you get the idea...
Separate threads would simply complicate the matter greatly, this way you get the same result every time, which you cannot guarantee by using separate threads (can't really see how that would be implemented) or a non-fixed time-step game loop.
When it comes to graphics API, I'd go with SDL as it's probably the easiest thing to get you started. You can always use OpenGL from SDL later on if you want to go 3D.
This sounds very similar to Braid. You really don't want parallel processing for this - parallel programming is hard, and for something like this, performance should not be an issue.
Since the game state vector will grow very quickly (probably on the order of several kilobytes per second, depending on the frame rate and how much data you store), you don't want a linked list, which has a lot of overhead in terms of space (and can introduce big performance penalties due to cache misses if it is laid out poorly). For each parallel timeline, you want a vector data structure. You can store each parallel timeline in a linked list. Each timeline knows at what time it began.
To run the game, you iterate through all active timelines and perform one frame's worth of actions from each of them in lockstep. No need for parallel processing.
I have played this game before. I don't necessarily think parallel processing is the way to go. You have shared objects in the game (levers, boxes, elevators, etc) that will need to be shared between processes, possibly with every delta, thereby reducing the effectiveness of the parallelism.
I would personally just keep a list of actions, then for each subsequent iteration start interleaving them together. For example, if the list is in the format of <[iteration.action]> then the 3rd time thru would execute actions 1.1, 2.1, 3.1, 1.2, 2.2, 3.3, etc.
After briefly glossing over the description, I think you have the right idea, I would have a state object that holds the state data, and place this into a linked list...I don't think you need parallel threads...
as far as the graphics API, I have only used opengl, and can say that it is pretty powerful and has a good C / C++ API, opengl would also be more cross platform as you can use the messa library on *Nix computers.
A very interesting game idea. I think you are right that parrellel computing would be benefical to this design, but no more then any other high resource program.
The question is a bit ambigous. I see that you are going to write this in C++ but what OS are you coding it for? Do you intend on it being cross platform and what kind of graphics would you like, ie 3D, 2D, high end, web based.
So basically we need a lot more information.
Parallel processing isn't the answer. You should simply "record" the players actions then play them back for the "previous actions"
So you create a vector (singly linked list) of vectors that holds the actions. Simply store the frame number that the action was taken (or the delta) and complete that action on the "dummy bot" that represents the player during that particular instance. You simply loop through the states and trigger them one after another.
You get a side effect of easily "breaking" the game when a state paradox happens simply because the next action fails.
Unless you're desperate to use C++ for your own education, you should definitely look at XNA for your game & graphics framework (it uses C#). It's completely free, it does a lot of things for you, and soon you'll be able to sell your game on Xbox Live.
To answer your main question, nothing that you can already do in Flash would ever need to use more than one thread. Just store a list of positions in an array and loop through with a different offset for each robot.