Tracking global index in MPI

Tracking global index in MPI - c++

Suppose I am running an MPI program on 5 nodes, which will run a simulation 50 times. However, simulations may take notably different amounts of time. If I have a set of initial conditions, say ic1,ic2,...ic50, when one node/process completes a simulation, I would like it to run a new simulation using the next (not yet used) set of initial conditions. My initial thought was to use MPI_Bcast or MPI_Gather s.t. all nodes have an int holding the next index to run, and update this before starting a new simulation. Is this a reasonable idea/are there other and/or better solutions?

you need a shared work queue. There are lots of ways to do this. The simplest way is to designate one mpi rank the "master" and the others the "workers". Workers ask the master for a work unit (one of your initial conditions), go to work, then when done ask master for new one. Some additional detail here (it says OpenMPI, but there's nothing about it that's specific to OpenMPI): Non-blocking data sharing through OpenMPI
There is a project called ADLB (http://www.mcs.anl.gov/project/adlb-asynchronous-dynamic-load-balancer) that is a work queue on steroids. Possibly (definitely) overkill for a 5 node job, but might make sense for a 5,000 node job.
you could use RMA shared memory and use that to keep track of an index. If the process owning the RMA window is itself busy computing on a work unit, it may be a bit slow to respond to requests, but the benefit you would not need to burn one MPI process as a master process.
You could use the MPI shared file pointer routines to keep a work queue on disk. I'm generally quite down on the MPI shared file pointer routines, but in this case where the I/O is small (read an initial condition) relative to the amount of CPU work, it will be ok.

Related

How do I optimize the parallelization of Monte Carlo data generation with MPI?

I am currently building a Monte Carlo application in C++ and I have a question regarding parallelization with MPI.
The process I want to parallelize is the MC generation of data. To have good precision in my final results, I specify the goal number of data points. Each data point is generated independently, but might require vastly differing amounts of time.
How do I organize the parallelization and workload distribution of the data generation most efficiently?
What I have done so far
So far I have come up with three possible ways of organizing the MPI part of the code:
The simplest way, but most likely inefficient way: I divide the desired sample size by the number of workers and let every worker generate that amount of data in isolation. However, when the slowest worker finishes, all other workers have been idling for a potentially long time. They could have been "supporting" the slowest worker by sharing its workload.
Use a master: A master communicates with the workers who work continuously until the master process registers that we have enough data and tells everybody to stop what they are doing. The disadvantage I see is that the master process might not be necessary and could be generating data instead (especially when I don't have a lot of workers).
A "ring communication" algorithm I came up with myself: A message is continuously sent and updated in a circle (1->2, 2->3, ... , N ->1). This message contains the global number of generated data point. Once the desired goal is met, the message is tagged, circles one more time and thereby tells everybody to stop working. Important here is I use non-blocking communication (with MPI_Iprobe before receiving via MPI_Recv, and sending via MPI_Isend). This way, everybody works, and no one ever idles.
No matter, which solution is chosen, in the end I reduce all data sets to one big set and continue to process the data.
The concrete questions:
Is there an "optimal" way of parallelizing such a fairly simple process? Would you prefer any of the proposed solutions for some reason?
What do you think of this "ring communication" solution?
I'm sure I'm not the first one to come up with e.g. the ring communication algorithm. I have tried to google this problem, but it seems to me that I do not know the right terminology in this context. I'm sure there must be a lot of material and literature on such simple algorithms, but I never had a formal course on MPI/parallelization. What are the "keywords" to look for?
Any advice and tips are much appreciated.

erlang processes and message passing architecture

The task I have in hand is to read the lines of large file, process them, and return ordered results.
My algorithm is:
start with master process that will evaluate the workload (written in the first line of the file)
spawn worker processes: each worker will read part of the file using pread/3, process this part, and send results to master
master receives all sub-results, sort, and return
so basically no communication needed between workers.
My questions:
How to find the optimal balance between the number of erlang processes and the number of cores? so if I spawn one process for each processor core I have would that be under utilizing of my cpu?
How does pread/3 reach the specified line; does it iterate over all lines in file ? and is pread/3 a good plan to parallel file reading?
Is it better to send one big message from process A to B or send N small messages? I have found part of the answer in the below link, but I would appreciate further elaboration
erlang message passing architecture

Erlang processes are cheap. You're free (and encouraged) to use more than however many cores you have. There might be an upper limit to what is practical for your problem (loading 1TB of data in one process per line is asking a bit for much, depending on line size).
The easiest way to do it when you don't know is to let the user decide. This means you could decide to spawn N workers, and distribute work between them, waiting to hear back. Re-run the program while changing N if you don't like how it runs.
Trickier ways to do it is to benchmark a bunch of time, pick what you think makes sense as a maximal value, stick it in a pool library (if you want to; some pool go for preallocated resources, some for a resizable amount), and settle for what would be a one-size-fits-all solution.
But really, there is no easy 'optimal number of cores'. You can run it on 50 processes as well as on 65,000 of them if you want; if the task is embarrassingly parallel, the VM should be able to make usage of most of them and saturate the cores anyway.
-
Parallel file reads is an interesting question. It may or may not be faster (as direct comments have mentioned) and it may only represent a speed up if the work on each line is minimal enough that reading the file has the biggest cost.
The tricky bit is really that functions like pread/2-3 takes a byte offset. Your question is worded such that you are worried about lines of the file. The byte offsets you hand off to workers may therefore end up straddling a line. If your block ends up at the word my in this is my line\nhere it goes\n, one worker will see itself have an incomplete line, while the other will report only on my line\n, missing the prior this is.
Generally, this kind of annoying stuff is what will lead you to have the first process own the file and sift through it, only to hand off bits of text to process to workers; that process will then act as some sort of coordinator.
The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses have been received, making it easy to know when to return the results. If everything is disjoint, you have to trust both the starter and the workers to tell you "we're all out of work" as a distinct set of independent messages to know.
In practice, you'll probably find that what helps the most will be to know do operations that help the life of your hardware regarding file operations, more than "how many people can read the file at once". There's only one hard disk (or SSD), all data has to go through it anyway; parallelism may be limited in the end for the access there.
-
Use messages that make sense for your program. The most performant program would have a lot of processes able to do work without ever needing to pass messages, communicate, or acquire locks.
A more realistic very performant program would use very few messages of a very small size.
The fun thing here is that your problem is inherently data-based. So there's a few things you can do:
make sure you read text in a binary format; large binaries (> 64b) get allocated on a global binary heap, are shared around and GC'd with reference counting
Hand in information on what needs to be done rather than the data for doing it; this one would need measuring, but the lead process could go over the file, note where lines end, and just hand byte offsets to the workers so they can go and read the file themselves; do note that you'll end up reading the file twice, so if memory allocation is not your main overhead, this will likely be slower
Make sure the file is read in raw or ram mode; other modes use a middle-man process to read and forward data (this is useful if you read files over a network in clustered Erlang nodes); raw and ram modes gives the file descriptor directly to the calling process and is a lot faster.
First worry about writing a clear, readable and correct program. Only if it is too slow should you attempt to refactor and optimize it; you may very well find it good enough on the first try.
I hope this helps.
P.S. You can try the really simple stuff at first:
either:
read the whole file at once with {ok, Bin} = file:read_file(Path) and split lines (with binary:split(Bin, <<"\n">>, [global])),
use {ok, Io} = file:open(File, [read,ram]) and then use file:read_line(Io) on the file descriptor repeatedly
use {ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}]) and then use file:read_line(Io) on the file descriptor repeatedly
call rpc:pmap({?MODULE, Function}, ExtraArgs, Lines) to run everything in parallel automatically (it will spawn one process per line)
call lists:sort/1 on the result.
Then from there you can refine each step if you identify them as problematic.

Concurrent Programming act on each element in array

I have a question whch related to parallel programming. If I have a program that acts on each and every elemnt of an array why might it not be advantagous to use all the available processors?
I was thinking maybe because of the significant overhead of setting up and managing multiple threads or if the array size didnt warrant a concurrent solution. Can anyone think of anything else?

Some processors may already be busy doing important things, or you may want to leave spare capacity just in case they need to respond quickly to new workloads. For example, in a desktop system with 8 processors, you may want to leave 1 free to keep the UI responsive, while you fork out 7 "batch-processing" threads on the others. In a non-UI system, you may still want to keep one or more cores listening to OS interrupts or doing network IO.
A particularly frustrating example would be starting a parallel computation on all your cores, finding that you should have tweaked a parameter before launching it, and not being able to interrupt the computation because there is no spare computing power left to allow the UI to respond to your 'cancel' button.

I would have made that array a static variable and according to its size, I would have divided the task and assigned multiple treads to carry the work for each set of elements in the array.
For example, If I have 100 elements in the array. I would have divided it and made sets of 10.
and with 10 different threads I would have carried out my work.
Correct me if I am not getting you.
EDITED:-
The OS already does precisely that for you. It doesn't guarantee that each thread will stay on the same core forever (and in nearly all cases, there's no need for that either), but it does try to keep as many cores busy as possible. Which means giving all available threads their own core as much as possible.
Note:- Direct correlation between program threads and OS threads is not guaranteed, at least according to this for .net : http://msdn.microsoft.com/en-us/library/74169f59.aspx
Hope this make some sense.

Poor performance in multi-threaded C++ program

I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}

If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.

Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.

Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.

The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.

I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.

It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.

If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.

I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.

If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.

Approach for finding file in huge tree structure using multithreading

I have a tree which has all the directories and files as its nodes. I want to search a particular file. Say the tree is spread widely and I want to do a breadth first search to find some particular file and that too using multithreading. How should I do that using multithreading ? What is a good approach ?

There are some case where multiThreading the search will provide a useful speedup - if the tree spans more than one disk, for example, or if some of the disks/nodes are indirected over some network.
I certainly would not want to try creating threads for every folder. That's thousands of create/run/terminate, thousands of stack allocation/free etc. Gross, avoidable overhead.
A multiThreaded search can be done, but as other posters have said, look at available alternatives first. Then read the rest of this post. Then look again
I have done something like this once using a queue approach similar to that suggested by Matt.
I don't want to ever do it again:
I used a producer-consumer work queue on which 6 threads waited for work, (6 because testing showed this to be optimum with my problem). The threads were all created once at startup and never terminated. None of this continual create/load/run/waitFor/getResult/terminateIfYoureLucky stuff that unaccountably seems to be popular with developers despite poor performance, shutdown AVs, 216/217 messageBoxes etc etc.
The work came in the form of a 'folderSearch' class that contained the path to be searched, a file match function event to call and the FindFirst/FindNext loop method to do the searching. I created a couple hundred of these at startup in a pool, (ie pushed onto another P-C pool queue:). When the FF/FN iterated the files in the folder to look for matching files, encountering a sub-folder resulted in extracting another folderSearch instance from the pool, loading it up with the new path & pushing it onto the work queue - some thread would then pick it up and iterate the sub-folder. The class had a list for the paths to matching files and a 'results' event to call, (with 'this' as parameter, of course), if it found something of interest. If a folderSearch got to the end of a twig, having found nothing and with nothing left to search, it would release itself back to the pool, (well, OK, the thread would do it, but you know what I mean:).
There was no need for any explicit 'load balancing'. If one node was exceptionally deep, it would naturally end up with all six threads working on its subtrees because the other paths are exhausted.
Searching 3 disks in their entirety meant popping 3 folderSearch from the pool, loading them up with 'C:\', 'E:\', 'F:\' and the file match method and then pushing them onto the work queue. The disks then made rattling noises and the event would eventually fire with results. In my case, (Windows), the event PostMessaged the folderSearch objects to a UI thread where the results were displayed in a treeView before repooling the folderSearch's for re-use.
This system was ~ 2.5 times as fast as a simple sequential search across 3 disks, even on my old development box that only had one core, simply because all 3 disks were searched in parallel. I suspect it would show the same sort of advantage on a modern box because the limiting factor is probably dominated by IO waiting on the disks.
Surprisingly, there was also a speedup with only one disk, but not that much. Don't know why - should be slower, by rights, due to all the extra complication.
Naturally, there were issues. One was that, with a search that fired lots of results, the pool would empty because the UI could not keep up with the threads and so all the folderSearch objects got stuck in the PostMessages queued to the UI, so slowing down the search threads as they had to wait on the pool queue until the PostMessages got handled and so returned folderSearch's to the pool. This also meant that the UI was effectively blocked until the search was over and it could catch up, negating one of the advantages of threading off the search in the first place :( With small result sets, it worked fine.
Another possible issue is that the results come back in an 'unnatural' manner, interleaved in such an aparrently confusing manner that things like assembling a tree view are much more complex than with a single-threaded recursive search - you have to flit about all over the place to stuff the results into the treeView in the right place. This loads up the GUI with extra work and can negate the searching speed advantage with large numbers of results, as I found out
This design could run multiple searches concurrently. As a test, I would load on several 3-disk searches at once, (no not while loading up the treeView - I just dumped the number of files found onto a memo line in the GUI message-handler). This made a huge amount of rattling and slowed everything to a crawl, but it did eventually complete all the searches without crashing. I didn't do this often as I was afraid for my poor disks. Don't try this at home
I was never sure how many threads to hang off the queue. Six was about the optimum on my old box with local disks. If there are networked disks in the mix, then more is probably better since a network call will tend to block one thread for much longer periods than a local disk read. Never tried that, but loading on more threads did not affect the performance any with local disks, just used more memory for no extra advantage.
Another problem is finding out if the search is actually over - are all the results in .. or is some thread still waiting on a network drive that's slow or actually unreachable? With only one search, I could tell because the pool became full again when the search was over, (I dumped the pool level to a stausBar on a 1s GUI timer). It didn't matter in my app, but in others, it might...
Cancelling a search is a similar issue. These sorts of things would need another 'searchClass' to control each search. Each folderSearch allocated to a search would have to keep a reference to the searchClass so that any thread handling the folderSearch could find out if an abort had been set and if so stop doing stuff with that folderSearch. I didn't need this, so did not implement it.
The there's error reporting. If a network drive connection fails, for example, several, (most likely all!), threads can block up for a long time before an exception is raised. Then they all except at once. The catch messages get loaded into an 'errorMess' field in the folderSearch and the results event fired. Human-detectable evidence - the rattling stops. Nothing happens for a minute, then [no of threads] errors appear all at once.
Note well the caveats from the other posters and my experiences. Only attempt something like this if you really, really need it for some special search purpose and you are 100% happy with multiThreaded apps. If you can get away with a single-threaded search, or a shell call to a File Explorer, or almost anything else, do it that way!
I've used the same approach since with an FTP server to generate trees. That was much faster as well, though the server admins were probably not happy about the multiple connections
Rgds,
Martin

Multithreading a tree-search task with unknown work distribution in each branch is non-trivial (this comes up a lot in say, constraint satisfaction problems.)
The easiest way is to create a task queue (protected by a mutex.) Fill this queue with all the children of the root node. Spawn N threads (one for each available CPU core) and have them search through each node. There are various tricks you can do to avoid some bad scenarios (if any thread finds that its node is "unexpectedly deep" you can have them add new tasks to the queue corresponding to subdirectories it wants other threads to explore.) If your node depths are well distributed and the root node has lots of children you can avoid the queue entirely -- just assign each thread with index i the task of exploring X % N + i (where X is the number of children of the root node.)

My first response is to say "just use nftw and forget about doing it multi-threaded". If you happen to have an implementation of nftw that does the tree walk in a multi-threaded fashion, then you get multi-threading for free (I'm not aware of any such implementation). If you really want to do multiple threads, I would suggest using nftw and spawning a new thread for each directory within the call back, but it's not immediately clear that that would be any easier (or any different) than following Kanopus' suggestion. And after thinking about it for a few moments, I fall back to my first suggestion and wonder why you want to do this with multiple threads. Having more threads is unlikely to speed up the search. Use nftw. Don't worry about threading.

Assuming each node in the tree represents a directory (and the files within it), and also assuming there is no limit in the number of threads you can open:
Enter the root node, if it has n subdirectories, create n - 1 threads to search in the first n - 1 and continue the search through the last subdirectory. Repeat as needed.

Tree-structures don't typically lend themselves to parallelization. Assuming you have all the nodes loaded into memory, try and organize them such that they occupy an array - after all, they need to live in RAM which is serial - and ignore their tree structure for the purpose of your search. Then iterate over the elements of the array using some sort of parallel for loop. A popular choice for this is OpenMP or you might try parallel_for_each in Visual Studio.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js