I have a tree which has all the directories and files as its nodes. I want to search a particular file. Say the tree is spread widely and I want to do a breadth first search to find some particular file and that too using multithreading. How should I do that using multithreading ? What is a good approach ?
There are some case where multiThreading the search will provide a useful speedup - if the tree spans more than one disk, for example, or if some of the disks/nodes are indirected over some network.
I certainly would not want to try creating threads for every folder. That's thousands of create/run/terminate, thousands of stack allocation/free etc. Gross, avoidable overhead.
A multiThreaded search can be done, but as other posters have said, look at available alternatives first. Then read the rest of this post. Then look again
I have done something like this once using a queue approach similar to that suggested by Matt.
I don't want to ever do it again:
I used a producer-consumer work queue on which 6 threads waited for work, (6 because testing showed this to be optimum with my problem). The threads were all created once at startup and never terminated. None of this continual create/load/run/waitFor/getResult/terminateIfYoureLucky stuff that unaccountably seems to be popular with developers despite poor performance, shutdown AVs, 216/217 messageBoxes etc etc.
The work came in the form of a 'folderSearch' class that contained the path to be searched, a file match function event to call and the FindFirst/FindNext loop method to do the searching. I created a couple hundred of these at startup in a pool, (ie pushed onto another P-C pool queue:). When the FF/FN iterated the files in the folder to look for matching files, encountering a sub-folder resulted in extracting another folderSearch instance from the pool, loading it up with the new path & pushing it onto the work queue - some thread would then pick it up and iterate the sub-folder. The class had a list for the paths to matching files and a 'results' event to call, (with 'this' as parameter, of course), if it found something of interest. If a folderSearch got to the end of a twig, having found nothing and with nothing left to search, it would release itself back to the pool, (well, OK, the thread would do it, but you know what I mean:).
There was no need for any explicit 'load balancing'. If one node was exceptionally deep, it would naturally end up with all six threads working on its subtrees because the other paths are exhausted.
Searching 3 disks in their entirety meant popping 3 folderSearch from the pool, loading them up with 'C:\', 'E:\', 'F:\' and the file match method and then pushing them onto the work queue. The disks then made rattling noises and the event would eventually fire with results. In my case, (Windows), the event PostMessaged the folderSearch objects to a UI thread where the results were displayed in a treeView before repooling the folderSearch's for re-use.
This system was ~ 2.5 times as fast as a simple sequential search across 3 disks, even on my old development box that only had one core, simply because all 3 disks were searched in parallel. I suspect it would show the same sort of advantage on a modern box because the limiting factor is probably dominated by IO waiting on the disks.
Surprisingly, there was also a speedup with only one disk, but not that much. Don't know why - should be slower, by rights, due to all the extra complication.
Naturally, there were issues. One was that, with a search that fired lots of results, the pool would empty because the UI could not keep up with the threads and so all the folderSearch objects got stuck in the PostMessages queued to the UI, so slowing down the search threads as they had to wait on the pool queue until the PostMessages got handled and so returned folderSearch's to the pool. This also meant that the UI was effectively blocked until the search was over and it could catch up, negating one of the advantages of threading off the search in the first place :( With small result sets, it worked fine.
Another possible issue is that the results come back in an 'unnatural' manner, interleaved in such an aparrently confusing manner that things like assembling a tree view are much more complex than with a single-threaded recursive search - you have to flit about all over the place to stuff the results into the treeView in the right place. This loads up the GUI with extra work and can negate the searching speed advantage with large numbers of results, as I found out
This design could run multiple searches concurrently. As a test, I would load on several 3-disk searches at once, (no not while loading up the treeView - I just dumped the number of files found onto a memo line in the GUI message-handler). This made a huge amount of rattling and slowed everything to a crawl, but it did eventually complete all the searches without crashing. I didn't do this often as I was afraid for my poor disks. Don't try this at home
I was never sure how many threads to hang off the queue. Six was about the optimum on my old box with local disks. If there are networked disks in the mix, then more is probably better since a network call will tend to block one thread for much longer periods than a local disk read. Never tried that, but loading on more threads did not affect the performance any with local disks, just used more memory for no extra advantage.
Another problem is finding out if the search is actually over - are all the results in .. or is some thread still waiting on a network drive that's slow or actually unreachable? With only one search, I could tell because the pool became full again when the search was over, (I dumped the pool level to a stausBar on a 1s GUI timer). It didn't matter in my app, but in others, it might...
Cancelling a search is a similar issue. These sorts of things would need another 'searchClass' to control each search. Each folderSearch allocated to a search would have to keep a reference to the searchClass so that any thread handling the folderSearch could find out if an abort had been set and if so stop doing stuff with that folderSearch. I didn't need this, so did not implement it.
The there's error reporting. If a network drive connection fails, for example, several, (most likely all!), threads can block up for a long time before an exception is raised. Then they all except at once. The catch messages get loaded into an 'errorMess' field in the folderSearch and the results event fired. Human-detectable evidence - the rattling stops. Nothing happens for a minute, then [no of threads] errors appear all at once.
Note well the caveats from the other posters and my experiences. Only attempt something like this if you really, really need it for some special search purpose and you are 100% happy with multiThreaded apps. If you can get away with a single-threaded search, or a shell call to a File Explorer, or almost anything else, do it that way!
I've used the same approach since with an FTP server to generate trees. That was much faster as well, though the server admins were probably not happy about the multiple connections
Rgds,
Martin
Multithreading a tree-search task with unknown work distribution in each branch is non-trivial (this comes up a lot in say, constraint satisfaction problems.)
The easiest way is to create a task queue (protected by a mutex.) Fill this queue with all the children of the root node. Spawn N threads (one for each available CPU core) and have them search through each node. There are various tricks you can do to avoid some bad scenarios (if any thread finds that its node is "unexpectedly deep" you can have them add new tasks to the queue corresponding to subdirectories it wants other threads to explore.) If your node depths are well distributed and the root node has lots of children you can avoid the queue entirely -- just assign each thread with index i the task of exploring X % N + i (where X is the number of children of the root node.)
My first response is to say "just use nftw and forget about doing it multi-threaded". If you happen to have an implementation of nftw that does the tree walk in a multi-threaded fashion, then you get multi-threading for free (I'm not aware of any such implementation). If you really want to do multiple threads, I would suggest using nftw and spawning a new thread for each directory within the call back, but it's not immediately clear that that would be any easier (or any different) than following Kanopus' suggestion. And after thinking about it for a few moments, I fall back to my first suggestion and wonder why you want to do this with multiple threads. Having more threads is unlikely to speed up the search. Use nftw. Don't worry about threading.
Assuming each node in the tree represents a directory (and the files within it), and also assuming there is no limit in the number of threads you can open:
Enter the root node, if it has n subdirectories, create n - 1 threads to search in the first n - 1 and continue the search through the last subdirectory. Repeat as needed.
Tree-structures don't typically lend themselves to parallelization. Assuming you have all the nodes loaded into memory, try and organize them such that they occupy an array - after all, they need to live in RAM which is serial - and ignore their tree structure for the purpose of your search. Then iterate over the elements of the array using some sort of parallel for loop. A popular choice for this is OpenMP or you might try parallel_for_each in Visual Studio.
Related
The task I have in hand is to read the lines of large file, process them, and return ordered results.
My algorithm is:
start with master process that will evaluate the workload (written in the first line of the file)
spawn worker processes: each worker will read part of the file using pread/3, process this part, and send results to master
master receives all sub-results, sort, and return
so basically no communication needed between workers.
My questions:
How to find the optimal balance between the number of erlang processes and the number of cores? so if I spawn one process for each processor core I have would that be under utilizing of my cpu?
How does pread/3 reach the specified line; does it iterate over all lines in file ? and is pread/3 a good plan to parallel file reading?
Is it better to send one big message from process A to B or send N small messages? I have found part of the answer in the below link, but I would appreciate further elaboration
erlang message passing architecture
Erlang processes are cheap. You're free (and encouraged) to use more than however many cores you have. There might be an upper limit to what is practical for your problem (loading 1TB of data in one process per line is asking a bit for much, depending on line size).
The easiest way to do it when you don't know is to let the user decide. This means you could decide to spawn N workers, and distribute work between them, waiting to hear back. Re-run the program while changing N if you don't like how it runs.
Trickier ways to do it is to benchmark a bunch of time, pick what you think makes sense as a maximal value, stick it in a pool library (if you want to; some pool go for preallocated resources, some for a resizable amount), and settle for what would be a one-size-fits-all solution.
But really, there is no easy 'optimal number of cores'. You can run it on 50 processes as well as on 65,000 of them if you want; if the task is embarrassingly parallel, the VM should be able to make usage of most of them and saturate the cores anyway.
-
Parallel file reads is an interesting question. It may or may not be faster (as direct comments have mentioned) and it may only represent a speed up if the work on each line is minimal enough that reading the file has the biggest cost.
The tricky bit is really that functions like pread/2-3 takes a byte offset. Your question is worded such that you are worried about lines of the file. The byte offsets you hand off to workers may therefore end up straddling a line. If your block ends up at the word my in this is my line\nhere it goes\n, one worker will see itself have an incomplete line, while the other will report only on my line\n, missing the prior this is.
Generally, this kind of annoying stuff is what will lead you to have the first process own the file and sift through it, only to hand off bits of text to process to workers; that process will then act as some sort of coordinator.
The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses have been received, making it easy to know when to return the results. If everything is disjoint, you have to trust both the starter and the workers to tell you "we're all out of work" as a distinct set of independent messages to know.
In practice, you'll probably find that what helps the most will be to know do operations that help the life of your hardware regarding file operations, more than "how many people can read the file at once". There's only one hard disk (or SSD), all data has to go through it anyway; parallelism may be limited in the end for the access there.
-
Use messages that make sense for your program. The most performant program would have a lot of processes able to do work without ever needing to pass messages, communicate, or acquire locks.
A more realistic very performant program would use very few messages of a very small size.
The fun thing here is that your problem is inherently data-based. So there's a few things you can do:
make sure you read text in a binary format; large binaries (> 64b) get allocated on a global binary heap, are shared around and GC'd with reference counting
Hand in information on what needs to be done rather than the data for doing it; this one would need measuring, but the lead process could go over the file, note where lines end, and just hand byte offsets to the workers so they can go and read the file themselves; do note that you'll end up reading the file twice, so if memory allocation is not your main overhead, this will likely be slower
Make sure the file is read in raw or ram mode; other modes use a middle-man process to read and forward data (this is useful if you read files over a network in clustered Erlang nodes); raw and ram modes gives the file descriptor directly to the calling process and is a lot faster.
First worry about writing a clear, readable and correct program. Only if it is too slow should you attempt to refactor and optimize it; you may very well find it good enough on the first try.
I hope this helps.
P.S. You can try the really simple stuff at first:
either:
read the whole file at once with {ok, Bin} = file:read_file(Path) and split lines (with binary:split(Bin, <<"\n">>, [global])),
use {ok, Io} = file:open(File, [read,ram]) and then use file:read_line(Io) on the file descriptor repeatedly
use {ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}]) and then use file:read_line(Io) on the file descriptor repeatedly
call rpc:pmap({?MODULE, Function}, ExtraArgs, Lines) to run everything in parallel automatically (it will spawn one process per line)
call lists:sort/1 on the result.
Then from there you can refine each step if you identify them as problematic.
Suppose I am running an MPI program on 5 nodes, which will run a simulation 50 times. However, simulations may take notably different amounts of time. If I have a set of initial conditions, say ic1,ic2,...ic50, when one node/process completes a simulation, I would like it to run a new simulation using the next (not yet used) set of initial conditions. My initial thought was to use MPI_Bcast or MPI_Gather s.t. all nodes have an int holding the next index to run, and update this before starting a new simulation. Is this a reasonable idea/are there other and/or better solutions?
you need a shared work queue. There are lots of ways to do this. The simplest way is to designate one mpi rank the "master" and the others the "workers". Workers ask the master for a work unit (one of your initial conditions), go to work, then when done ask master for new one. Some additional detail here (it says OpenMPI, but there's nothing about it that's specific to OpenMPI): Non-blocking data sharing through OpenMPI
There is a project called ADLB (http://www.mcs.anl.gov/project/adlb-asynchronous-dynamic-load-balancer) that is a work queue on steroids. Possibly (definitely) overkill for a 5 node job, but might make sense for a 5,000 node job.
you could use RMA shared memory and use that to keep track of an index. If the process owning the RMA window is itself busy computing on a work unit, it may be a bit slow to respond to requests, but the benefit you would not need to burn one MPI process as a master process.
You could use the MPI shared file pointer routines to keep a work queue on disk. I'm generally quite down on the MPI shared file pointer routines, but in this case where the I/O is small (read an initial condition) relative to the amount of CPU work, it will be ok.
I am trying to use google perf tools CPU profiler for debugging performance issues on a multi-threaded program. With single thread it take 250 ms while 4 threads take around 900ms.
My program has a mmap'ed file which is shared across threads and all operations are read only. Also my program creates large number of objects which are not shared across threads. (Specifically my program uses CRF++ library to do some querying). I am trying to figure out how to make my program perform better with multi threads. Call graph produced by CPU profiler of gperf tools shows that my program spends a lot of time (around 50%) of time in _L_unlock_16.
Searching web for _L_unlock_16 pointed to some bug reports with canonical suggesting that its associated with libpthread. But other than that I was not able to find any useful information for debugging.
A brief description of what my program does. I have few words in a file (4). In my program I have a processWord() which processes a single word using CRF++. This processWord() is what each thread executes. My main() reads words from the file and each threads runs processWord() in parallel. If I process a single word(hence only 1 thread) it takes 250ms and so if I process all 4 words(and hence 4 threads) I expected it to finish by same time 250 ms, however as I mentioned above it's taking around 900ms.
This is the callgraph of the execution - https://www.dropbox.com/s/o1mkh477i7e9s4m/cgout_n2.png
I want to understand why my program is spending lot of time at _L_unlock_16 and what I can do to mitigate it.
Yet again the _L_unlock_16 is not a function of your code. Have you looked at the stracktraces above that function? What are the callers of it when the program waits? You've said that the program wastes 50% waiting inside. But, which part of the program ordered that operation? Is it again from memory alloc/dealloc ops?
The function seems to come from libpthread. Does CRF+ handle threads/libpthread in any way? If yes, then maybe the library is illconfigured? Or maybe it implements some 'basic threadsafety' by adding locks everywhere and simply is not built well for multithreading? What does the docs say about that?
Personally, I'd guess that it ignores threads and that you have added all the threading. I may be wrong, but if that's true, then the CRF++ probably will not call that 'unlock' function at all, and the 'unlock' is somwhow called from your code that orchestrates the threads/locks/queues/messages etc? Halt the program a few times and look at who called the unlock. If it really spends 50% sitting in the unlock, you will very quickly know who causes the lock to be used and you will be able to either eliminate it or at least perform a more refined research..
EDIT #1:
Eh.. when I said "stacktrace" I meant stacktrace, not callgraph. Callgraph may look nice in trivial cases, but in more complex ones, it will be mangled and unreadable and will hide the precious details into "compacted" form.. But, fortunatelly, here the case looks simple enough.
Please observe the beginning: "Process word, 99x". I assume that the "99x" is the call count. Then, look at "tagger-parse": 97x. From that:
61x into rebuildFeatures from which 41x goes right into unlock and 20(13) indirectly into it
23x goes to buildLattice fro which 21x goes into unlock
I'd guess that it was the CRF++ uses locking quite heavily. For me, it seems that you simply observe the effects of CRF's internal locking. It certainly is not lockless internally.
It seems to lock at least once per "processWord". It's hard to say without looking at code (is it opensource? I've not checked..), from stacktraces it would be more obvious, but IF it really locks once per "processWord" that it could even be a sort of a "global lock" that protects "everything" from "all threads" and causes all jobs to serialize. Whatever. Anyways, clearly, it's the CRF++'s internals that lock and wait.
If your CRF objects are really (really) not shared across threads, then remove threading configuration flags from CRF, pray that they were sane enough to not use any static variables nor global objects, add some own locking (if needed) at the topmost job/result level and retry. You should have it now much faster.
If the CRF objects are shared, unshare them and see above.
But, if they are shared behind the scenes, then there's little doable. Change your library to a one that has a better threading support, OR fix the library, OR ignore and use it with current performance.
The last advice may sound strange (it works slowly, right? so why to ignore it?), but in fact is the most important one, and you should try it first. If the parallel tasks have similar "data profile", then there is very probable that they will try to hit the same locks in the same approximate moment of time. Imagine a medium-sized cache that holds words sorted by its first letter. At the toplevel there's array of, say, 26 entries. Each entry has a lock and a list of words inside. If you run 100 threads that will each first check "mom" then "dad" then "son" - then all of that 100 threads will first hit and wait for each other at "M", then at "D" then at "S". Well, approximately/probably of course. But you get the idea. If the data profile were more random then they'd block each other far less. Mind that processing ONE word is a .. small task and that you try to process the same word(s). Even if the internal CRF's locking is smart, it is just bound to hit the same areas. Try again with a more dispersed data.
Add to that the fact that threading costs. If something was guarded against races with use of locks, then every lock/unlock costs because at least they have to "halt and check if the lock is open" (sorry very imprecise wording). If the data-to-process is small in relation to the-amount-of-lockchecks, then adding more threads will not help and will just waste the time. For checking one word, it may even happen that the sole handling of a single lock takes longer than processing the word! But, if the amount of data to be processed were larger, then the cost of flipping a lock compared to processing the data might start being neglible.
Prepare a set of 100 or more words. Run and measure it on one thread. Then partition the words at random and run it on 2 and 4 threads. And measure. If not better, try at 1000 and 10000 words. The more the better, of course, keeping in mind that the test should not last till your next birthday ;)
If you notice that 10k words split over 4 threads (2500w per th) works about 40%-30%-or even 25% faster than on one thread - here you go! You simply gave it a too small job. It was tailored and optimized for bigger ones!
But, on the other hand, it may happen that 10k words split over 4 threads does not work faster, or worse, works slower - then it might indicate that the library handles multithreading very wrong. Now try the other things like stripping threading from it or repairing it.
I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}
If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.
Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.
Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.
The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.
I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.
It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.
If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.
I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.
If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.
I am working on a Tetris AI implementation. It is a GUI application that plays the game by itself. The user can manipulate a few parameters that influence the decisions made by the AI. The basic algorithm goes as follows:
Start a new thread and clone the current game state to avoid excessive locking.
Generate a list of all possible future game states. These become child nodes of the current game state.
For each child node generate it's future game states.
Keep doing this recursively until a predefined depth has been reached.
Once the requested depth has been reached, find the best end node and recursively find it's parent node until you have the first level child.
Delete all child nodes that are not on the path between the child node and the end node.
This path is now the list of precalculated moves.
The main game executes the list of precalculated moves (with some fancy animation).
This is working pretty well up until search depth 4. After that I start to get memory problems. The number of possible game states can go from 9 to 34. So the worst case scenario for a level 4 search would be 34^4 game states. Windows XP seems unable to deal with level 5 searches (it hangs at 2+ GB).
So if I want to use deeper searches I'll need to use a strategy where I delete the non-promising branches and continue with the ones that will lead to a good score. But this makes it harder to estimate a maximum acceptable search depth. Therefore I think that I would be better to specify a memory limit instead of a depth limit.
I considering to use a memory pool and use "placement new" to create my objects on the pool's memory segments. However the game grid is implemented as a STL vector. So in order to allocate it on the pool I need to implement a custom allocator.
This seems quite a challenge and perhaps I'm overlooking a simpler solution. So I'd like your insights on how to best deal with this.
Can boost, or another library, provide me some of these facilities? (I already found Poco's MemoryPool.) Are there any good online resources to help me get going?
FYI: here's the source code and a sample binary for Windows.
You can create a memory pool, etc, but that won't really make it any easier or harder to count game state instances. You do need to make sure you don't go over a certain number of active states in your decision tree, with or without a pool. And Boost does have one: http://www.boost.org/doc/libs/1_44_0/libs/pool/doc/index.html
It sounds like you're not really doing any pruning of the tree, which would allow you to get much deeper. Evaluate each future game state and drop ones unlikely to develop into anything useful, and don't waste your time going down that branch.
Despite the lack of context [what kind of search problem are you doing? Depth first, breadth first, A*?....] My suggestion is:
Use semaphores to limit the amount of processing that is done at one time, and then release it once the processing has been evaluated. I can't really recommend a specific library that includes Semaphores, as that threading is not built in to C++, however check with your framework's documentation.