VS 7.1 Release compile and multiple threads

VS 7.1 Release compile and multiple threads - c++

VS 7.1 release mode does not seem to be properly parallelizing threads while debug mode does. Here is a summary of what is happening.
First, for what it's worth, here is the main piece of code that parallelizes, but I don't think it's an issue:
// parallelize the search
CWinThread* thread[THREADS];
for ( i = 0; i < THREADS; i++ ) {
thread[i] = AfxBeginThread( game_search, &parallel_params[i],
THREAD_PRIORITY_NORMAL, 0, CREATE_SUSPENDED );
thread[i]->m_bAutoDelete = FALSE;
thread[i]->ResumeThread();
}
for ( i = 0; i < THREADS; i++ ) {
WaitForSingleObject(thread[i]->m_hThread, INFINITE);
delete(thread[i]);
}
THREADS is a global variable that I set and I recompile if I want to change the number of threads. To give a bit of context this is a game playing program that searches game positions.
Here is what happens that doesn't make sense to me.
First, compiling in debug mode. If I set THREADS to 1 the one thread manages to search about 13,000 positions. If I set THREADS to 2, each thread searches about 13,000 positions. Great!
If I compile in release mode and set THREADS to 1 the thread manages to search about 30,000 positions, a typical speedup I'm used to seeing when moving from debug to release. But here is the kicker. When I compile with THREADS = 2 each thread only searches about 15,000 positions. Obviously half of what THREADS = 1 does, so effectively a release compile gives me no effective speedup whatsoever. :(
Watching task manager when these things run, with THREADS = 1 I see 50% CPU usage on my dual core machine and when THREADS = 2 I see 100% CPU usage. But the release compile seems to be giving me an effective CPU usage of 50%. Or something?!
Any thoughts? Is there something I should be setting in the Property Pages?
Update: The following is also posted below but it was suggested I update this post. It was also suggested I post code, but it is a quite large project. I'm hoping others have run into this kind of behavior themselves in the past and can shed some light on what going on.
I ran the program on a quad core system and got consistent but still confusing results. I know I am verging on getting away from a specific programming question and becoming a bit abstract, but I'd really like to hear any comments you might have to help explain the numbers I am seeing. For all of these tests I run for 30 seconds and according to task manager all threads are running full power for the entire 30 seconds.
When running in Debug mode, if I run with 1 thread it gets X amount of work done. If I run 2 threads each thread gets X amount of work done. Similarly with 3 and 4 threads. Scaling is perfect.
When running in Release mode, this is what happens:
With 1 thread: it gets Y amount of work done, where Y is nearly double X.
With 2 threads: Each thread gets Y amount of work done. Again, perfect scaling.
With 3 threads: 1 thread gets Y amount of work done, the other 2 threads get 2/3 Y amount of work done. I've lost about 2/3 of a CPU even though one is presumable completely idle. Task Manager shows 75% CPU usage.
With 4 threads: 1 thread gets Y amount of work done. The other 3 threads get 1/2 Y amount of work done. Now I've lost about 1.5 CPU's worth of computing. The Task Manager shows 100% CPU usage.
The obvious questions are:
(1) Repeating the earlier question, was does Debug mode scale so well, but not Release?
(2) Why is one core always able to get full usage but the others seem to fall off? This lack of symmetry is disturbing.
(3) Why are the others falling off? Memory bandwidth was suggested earlier but that seem like an awfully steep price.
Any comments or insights are most welcome. And, as always, thanks!

I think you should be using WaitForMultipleObjects().

The problem with multi-threading is that it is non-deterministic.
First of all, the DEBUG target doesn't optimize the code. It also adds additional code for runtime checks (e.g. asserts, traces in MFC, etc.).
The RELEASE target is optimized. So in release mode, the binary can be slightly different than in case of DEBUG mode.
What is the job executed by the thread is also important. For example, if your threads are using some IO operations, they will have some idle times, waiting for those IO operations to complete. Since in RELEASE mode the code to be executed is expected to be more efficient, the ratio between idle time and execution time might be different than in DEBUG mode.
I am only guessing possible explanations, given the provided information.
Later update:
You can use WaitForMultipleObjects to wait for all the threads to finish:
DWORD result = WaitForMultipleObjects(
numberOfThreads, // Number of thread handles in the array
threadHandleArray, // the array of thread handles
true, // true means wait for all the threads to finish
INFINITE); // wait indefinetly
if( result == WAIT_FAILED)
// Some error handling here

I'm not sure I understand why there are a different number of positions searched in Debug vs. Release. You are waiting for the threads to complete, so I would just expect the Release version to finish faster but for both versions to generate the same results.
Are you imposing a per-thread time limit? If so what is the mechanism for this?
In the absence of logic bugs, it would appear that your processing is CPU limited for the Debug case in both single and double threaded versions. In the release case, you are not getting any effective speedup which means that either the processing is more efficient and the processing is now limitied by something else (e.g. IO or memory bandwidth) or that any gains that you are making are offset by frequent context switching between the threads which might happen if you have a poor synchronization strategy between the threads.
It would be helpful to know exact what processing each thread does, what shared data they have and how often they need to synchronize with each other.

As Charles Bailey said, from you description it seems like you are imposing a per-thread time limit.
It could be the case that the timing mechanism you use references wall clock time in debug mode and CPU time (which sums across all processors/cores in use) in release mode. Thus, when THREADS = 2 in release mode, you use the total allotment of CPU time twice as fast, doing half as much work on each core.
Just an idea. Can you give more detail on your timing mechanism?

The fact that you get 30k positions from both 1 and 2 threads looks suspicious to me. Could that limit come from another component in your system? You mention each thread is totaly independent, but are you by any chance using any of the Interlocked* functions? They look innocent, but they actually force a synchronization of all CPU caches, which can be painful when trying to squeeze the most out of the CPU.
What I would do is have each thread do some dummy action such (string manipulation or so), just to waste some time. If that scales well, add a portion of the thread's real code to the dummy action, and test again. Repeat until the performance stops scaling, which means the latest code addition is the bottleneck.
Another direction I'd look into is making sure both threads are actually running concurrently, on different CPUs. Try bounding each thread to a single CPU. This is not something I'd leave in production, but if your system is loaded by other processes, you might not get the gain you expect from dual CPUs. After all, on a single CPU machine you'll probably get a lower throughput using two thread than what you'd get using one.

I ran the program on a quad core system and got consistent but still confusing results. I know I am verging on getting away from a specific programming question and becoming a bit abstract, but I'd really like to hear any comments you might have to help explain the numbers I am seeing. For all of these tests I run for 30 seconds and according to task manager all threads are running full power for the entire 30 seconds.
When running in Debug mode, if I run with 1 thread it gets X amount of work done. If I run 2 threads each thread gets X amount of work done. Similarly with 3 and 4 threads. Scaling is perfect.
When running in Release mode, this is what happens:
With 1 thread: it gets Y amount of work done, where Y is nearly double X.
With 2 threads: Each thread gets Y amount of work done. Again, perfect scaling.
With 3 threads: 1 thread gets Y amount of work done, the other 2 threads get 2/3 Y amount of work done. I've lost about 2/3 of a CPU even though one is presumable completely idle. Task Manager shows 75% CPU usage.
With 4 threads: 1 thread gets Y amount of work done. The other 3 threads get 1/2 Y amount of work done. Now I've lost about 1.5 CPU's worth of computing. The Task Manager shows 100% CPU usage.
The obvious questions are:
(1) Repeating the earlier question, was does Debug mode scale so well, but not Release?
(2) Why is one core always able to get full usage but the others seem to fall off? This lack of symmetry is disturbing.
(3) Why are the others falling off? Memory bandwidth was suggested earlier but that seem like an awfully steep price.
Any comments or insights are most welcome. And, as always, thanks!

There are many things that may hamper your performance.
One problem might be false sharing of cache lines.
When you have something like :
struct data
{
int cnt_parsed_thread[THREADS];
// ...
};
static data;
and in the threads itself :
threadFunc( int threadNum )
{
while( !end )
{
// ...
// do something
++data.cnt_parsed_thread[num];
}
}
You force both processors to send the cache line after each increment to the other processor, stalling computation enormously.
This problem can be worked around by spreading the falsely shared data into separate cachelines.
e.g. :
struct data
{
int cnt_parsed_thread[THREADS*CACHELINESIZE];
// ...
int& at( int k ) { return cnt_parsed_thread[k*CACHELINESIZE}; }
};
(CACHELINE size should be 64 bytes (I think), maybe play around with that.)

Related

Multitasking and measuring time difference

I understand that a preemptive multitasking OS can interrupt a process at any "code position".
Given the following code:
int main() {
while( true ) {
doSthImportant(); // needs to be executed at least each 20 msec
// start of critical section
int start_usec = getTime_usec();
doSthElse();
int timeDiff_usec = getTime_usec() - start_usec;
// end of critical section
evalUsedTime( timeDiff_usec );
sleep_msec( 10 );
}
}
I would expect this code to usually produce proper results for timeDiff_usec, especially in case that doSthElse() and getTime_usec() don't take much time so they get interrupted rarely by the OS scheduler.
But the program would get interrupted from time to time somewhere in the "critical section". The context switch will do what it is supposed to do, and still in such a case the program would produce wrong results for the timeDiff_usec.
This is the only example I have in mind right now but I'm sure there would be other scenarios where multitasking might get a program(mer) into trouble (as time is not the only state that might be changed at re-entry).
Is there a way to ensure that measuring the time for a certain action works fine?
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Edit:
I changed the sample code to make it more precise.
I want to check the time being spent to make sure that doSthElse() doesn't take like 50 msec or so, and if it does I would look for a better solution.

Is there a way to ensure that measuring the time for a certain action works fine?
That depends on your operating system and your privilege level. On some systems, for some privilege levels, you can set a process or thread to have a priority that prevents it from being preempted by anything at lower priority. For example, on Linux, you might use sched_setscheduler to give a thread real-time priority. (If you're really serious, you can also set the thread affinity and SMP affinities to prevent any interrupts from being handled on the CPU that's running your thread.)
Your system may also provide time tracking that accounts for time spent preempted. For example, POSIX defines the getrusage function, which returns a struct containing ru_utime (the amount of time spent in “user mode” by the process) and ru_stime (the amount of time spent in “kernel mode” by the process). These should sum to the total time the CPU spent on the process, excluding intervals during which the process was suspended. Note that if the kernel needs to, for example, spend time paging on behalf of your process, it's not defined how much (if any) of that time is charged to your process.
Anyway, the common way to measure time spent on some critical action is to time it (essentially the way your question presents) repeatedly, on an otherwise idle system, throw out outlier measurements, and take the mean (after eliminating outliers), or take the median or 95th percentile of the measurements, depending on why you need the measurement.
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Too broad. There are whole books written about this subject.

Why is 6-7 threads faster than 20?

In school we were introduced to C++11 threads. The teacher gave us a simple assessment to complete which was to make a basic web crawler using 20 threads. To me threading is pretty new, although I do understand the basics.
I would like to mention that I am not looking for someone to complete my assessment as it is already done. I only want to understand the reason why using 6 threads is always faster than using 20.
Please see code sample below.
main.cpp:
do
{
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i] = std::thread(SweepUrlList);
}
for (size_t i = 0; i < THREAD_COUNT; i++)
{
threads[i].join();
}
std::cout << std::endl;
WriteToConsole();
listUrl = listNewUrl;
listNewUrl.clear();
} while (listUrl.size() != 0);
Basically this assigns to each worker thread the job to complete which is the method SweepUrlList that can be found below and then join all thread.
while (1)
{
mutextGetNextUrl.lock();
std::set<std::string>::iterator it = listUrl.begin();
if (it == listUrl.end())
{
mutextGetNextUrl.unlock();
break;
}
std::string url(*it);
listUrl.erase(*it);
mutextGetNextUrl.unlock();
ExtractEmail(url, listEmail);
std::cout << ".";
}
So each worker thread loop until ListUrl is empty. ExtractEmail is a method that downloads the webpage (using curl) and parse it to extract emails from mailto links.
The only blocking call in ExtractEmail can be found below:
if(email.length() != 0)
{
mutextInsertNewEmail.lock();
ListEmail.insert(email);
mutextInsertNewEmail.unlock();
}
All answers are welcome and if possible links to any documentation you found to answer this question.

This is a fairly universal problem with threading, and at its core:
What you are demonstrating is thread Scheduling. The operating system is going to work with the various threads, and schedule work where there is currently not work.
Assuming you have 4 cores and hyper threading you have 8 processors that can carry the load, but also that of other applications (Operating System, C++ debugger, and your application to start).
In theory, you would probably be OK on performance up until about 8 intensive threads. After you reach the most threads your processor can effectively use, then threads begin to compete against each other for resources. This can be seen (especially with intensive applications and tight loops) by poor performance.
Finally, this is a simplified answer but I suspect what you are seeing.

The simple answer is choke points. Something that you are doing is causing a choke point. When this occurs there is a slow down. It could be in the number of active connections you are making to something, or merely the extra overhead of the number and memory size of the threads (see the below answer about cores being one of these chokes).
You will need to set up a series of monitors to investigate where your choke point is, and what needs to change in order to achieve scale. Many systems across every industry face this problem every day. Opening up the throttle at one end does not equal the same increase in the output at the other end. In cases it can decrease the output at the other end.
Take for example individuals leaving a hall. The goal is to get 100 people out of the building as quickly as possible. If single file produces a rate of 1 person every 1 second therefore 100 seconds to clear the building. We many be able to half that time by sending them out 2 abreast, so 50 seconds to clear the building. What if we then sent them out as 8 abreast. The door is only 2m wide, so with 8 abreast being equivalent to 4m, only 50% of the first row would make it through. The other 4 would then cause a blockage for the next row and so on. Depending on the rate, this could cause temporary blockages and increase the time 10 fold.

Threads are an operating system construct. Basically, each thread's state (which is basically all the CPU's registers and virtual memory mapping [which is a part of the process construct]) is saved by the operating system. Once the OS gives that specific thread "execution time" it restores this state and let it run. Once this time is finished, it has to save this state. The process of saving a specific thread's state and restoring another is called Context Switching, and it takes a significant amount of time (usually between a couple of hundreds to thousand of CPU cycles).
There are also additional penalties to context switching. Some of the processor's cache (like the virtual memory translation cache, called the TLB) has to be flushed, pipelining instruction to be discarded and more. Generally, you want to minimize context switching as much as possible.
If your CPU has 4 cores, than 4 threads can run simultaneously. If you try to run 20 threads on a 4 core system, then the OS has to manage time between those threads so it will seem like they run in parallel. E.g, threads 1-4 will run for 50 milliseconds, then 5-9 will run for 50 milliseconds, etc.
Therefore, if all of your threads are running CPU intensive operations, it is generally most efficient to make your program use the same amount of threads as cores (sometimes called 'processors' in windows). If you have more threads than cores, than context switching must happen, and it is overhead that can be minimized.

In general, more threads is not better. More threading provides value in two ways higher parallelism and less blocking. More threading hurts by higher memory, higher context switching and higher resource contention.
The value of more threads for higher parallelism is generally maximized between 1-2x the number of actual cores that you have available. If your threads are already CPU bound the maximum value is generally 1x number of cores.
The value of less blocking is much harder to quantify and depends on the type of work you are performing. If you are IO bound and your threads are primarily waiting for IO to be ready then a larger number of threads could be beneficial.
However if you have shared state between threads, or you are doing some form of message passing between threads then you will run into synchronization and contention issues. As the number of threads increases, the more these types of overhead as well as context switches dominates the time spent doing your task.
Amdahl's law is a useful measure to determine if higher parallelism will actually improve the total runtime of your job.
You also must be careful that your increased parallelism doesn't exceed some other resource like total memory or disk or network throughput. Once you have saturated the current bottleneck, you will not see improved performance by increasing the number of threads.
Before doing any performance tuning, it is important to understand what the dominant resource bottleneck is. There are lots of tools for doing system-wide resource monitoring. On Linux, one very useful tool is dstat. On Windows, you can use the Task Manager to monitor many of these resources.

Multithreading crowds out other processes

I have added multithreading to a raytracer I am writing, and while it does run much faster now, when it's running, my computer is almost unusably slow. Obviously I want to use all my PC's compute power, but I don't want it to prevent any other application from getting access to the CPUs.
I thought about having the threads sleep, but unless they all sleep at the same time, then the other threads would just eat up the extra time. Also, I don't necessarily want to give up a certain percentage of available compute power if I'm not going to use it.
Also, (This is not my official question) I've noticed that for some reason the first thread launched does more work than the second, and the second more than the third, and so on until like the last 5 threads (out of 32) won't actually get a crack at any work, despite the fact that there's plenty to go a around (there's at least 0.5M work items for them to chew through). If someone would like to venture a guess in the comments, it would be appreciated.

If you use the standard threads, you could try to use thread::hardware_concurrency to find out an estimate of the maximul number of threads that are really supported by hardware, in order not to overload your cpu.
If it returns 0 the information is not available. In other cases you could limit yourself to this number or a little bit below (thinking that other processes might use these as well).
If limiting the number of threads does not improve responsiveness, you can also consider calling from time to time this_thread::yield() to give opportunity to reschedule threads. But depending on the kind of job and synchronisation you use, this second alternative might decrease performance.

As requested, my comment as an answer:
It sounds like you've oversubscribed your poor CPU. Try reducing the number of threads?
If there's significantly more threads than hardware cores, a lot of time is going to be wasted switching between threads, scheduling them in the OS, and in contention over shared variables. It would also cause the general slowdown of the other running programs, because they have to contend with the high number of threads from your program (which by default all have the same priority as the other programs' threads in the eyes of the OS scheduler).

Poor performance in multi-threaded C++ program

I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}

If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.

Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.

Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.

The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.

I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.

It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.

If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.

I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.

If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.

Simple multi-threading confusion for C++

I am developing a C++ application in Qt.
I have a very basic doubt, please forgive me if this is too stupid...
How many threads should I create to divide a task amongst them for minimum time?
I am asking this because my laptop is 3rd gen i5 processor (3210m). So since it is dual core & NO_OF_PROCESSORS environment variable is showing me 4. I had read in an article that dynamic memory for an application is only available for that processor which launched that application. So should I create only 1 thread (since env variable says 4 processors) or 2 threads (since my processor is dual core & env variable might be suggesting the no of cores) or 4 threads (if that article was wrong)?
Please forgive me since I am a beginner level programmer trying to learn Qt.
Thank You :)

Although hyperthreading is somewhat of a lie (you're told that you have 4 cores, but you really only have 2 cores, and another two that only run on what resources the former two don't use, if there's such a thing), the correct thing to do is still to use as many threads as NO_OF_PROCESSORS tells you.
Note that Intel isn't the only one lying to you, it's even worse on recent AMD processors where you have 6 alleged "real" cores, but in reality only 4 of them, with resources shared among them.
However, most of the time, it just more or less works out. Even in absence of explicitly blocking a thread (on a wait function or a blocking read), there's always a point where a core is stalled, for example in accessing memory due to a cache miss, which gives away resources that can be used by the hyperthreaded core.
Therefore, if you have a lot of work to do, and you can parallelize it nicely, you should really have as many workers as there are advertized cores (whether they're "real" or "hyper"). This way, you make maximum use of the available processor resources.
Ideally, one would create worker threads early at application startup, and have a task queue to hand tasks to workers. Since synchronization is often non-neglegible, the task queue should be rather "coarse". There is a tradeoff in maximum core usage and synchronization overhead.
For example, if you have 10 million elements in an array to process, you might push tasks that refer to 100,000 or 200,000 consecutive elements (you will not want to push 10 million tasks!). That way, you make sure that no cores stay idle on the average (if one finishes earlier, it pulls another task instead of doing nothing) and you only have a hundred or so synchronizations, the overhead of which is more or less neglegible.
If tasks involve file/socket reads or other things that can block for indefinite time, spawning another 1-2 threads is often no mistake (takes a bit of experimentation).

This totally depends on your workload, if you have a workload which is very cpu intensive you should stay closer to the number of threads your cpu has(4 in your case - 2 core * 2 for hyperthreading). A small oversubscription might be also be ok, as that can compensate for times where one of your threads waits for a lock or something else.
On the other side, if your application is not cpu dependent and is mostly waiting, you can even create more threads than your cpu count. You should however notice that thread creation can be quite an overhead. The only solution is to measure were your bottleneck is and optimize in that direction.
Also note that if you are using c++11 you can use std::thread::hardware_concurrency to get a portable way to determine the number of cpu cores you have.
Concerning your question about dynamic memory, you must have misunderstood something there.Generally all threads you create can access the memory you created in your application. In addition, this has nothing to do with C++ and is out of the scope of the C++ standard.

NO_OF_PROCESSORS shows 4 because your CPU has Hyper-threading. Hyper-threading is the Intel trademark for tech that enables a single core to execute 2 threads of the same application more or less at the same time. It work as long as e.g. one thread is fetching data and the other one accessing the ALU. If both need the same resource and instructions can't be reordered, one thread will stall. This is the reason you see 4 cores, even though you have 2.
That dynamic memory is only available to one of the Cores is IMO not quite right, but register contents and sometimes cache content is. Everything that resides in the RAM should be available to all CPUs.
More threads than CPUs can help, depending on how you operating systems scheduler works / how you access data etc. To find that you'll have to benchmark your code. Everything else will just be guesswork.
Apart from that, if you're trying to learn Qt, this is maybe not the right thing to worry about...
Edit:
Answering your question: We can't really tell you how much slower/faster your program will run if you increase the number of threads. Depending on what you are doing this will change. If you are e.g. waiting for responses from the network you could increase the number of threads much more. If your threads are all using the same hardware 4 threads might not perform better than 1. The best way is to simply benchmark your code.
In an ideal world, if you are 'just' crunching numbers should not make a difference if you have 4 or 8 threads running, the net time should be the same (neglecting time for context switches etc.) just the response time will differ. The thing is that nothing is ideal, we have caches, your CPUs all access the same memory over the same bus, so in the end they compete for access to resources. Then you also have an operating system that might or might not schedule a thread/process at a given time.
You also asked for an Explanation of synchronization overhead: If all your threads access the same data structures, you will have to do some locking etc. so that no thread accesses the data in an invalid state while it is being updated.
Assume you have two threads, both doing the same thing:
int sum = 0; // global variable
thread() {
int i = sum;
i += 1;
sum = i;
}
If you start two threads doing this at the same time, you can not reliably predict the output: It might happen like this:
THREAD A : i = sum; // i = 0
i += 1; // i = 1
**context switch**
THREAD B : i = sum; // i = 0
i += 1; // i = 1
sum = i; // sum = 1
**context switch**
THREAD A : sum = i; // sum = 1
In the end sum is 1, not 2 even though you started the thread twice.
To avoid this you have to synchronize access to sum, the shared data. Normally you would do this by blocking access to sum as long as needed. Synchronization overhead is the time that threads would be waiting until the resource is unlocked again, doing nothing.
If you have discrete work packages for each thread and no shared resources you should have no synchronization overhead.

The easiest way to get started with dividing work among threads in Qt is to use the Qt Concurrent framework. Example: You have some operation that you want to perform on every item in a QList (pretty common).
void operation( ItemType & item )
{
// do work on item, changing it in place
}
QList<ItemType> seq; // populate your list
// apply operation to every member of seq
QFuture<void> future = QtConcurrent::map( seq, operation );
// if you want to wait until all operations are complete before you move on...
future.waitForFinished();
Qt handles the threading automatically...no need to worry about it. The QFuture documenation describes how you can handle the map completion asymmetrically with signals and slots if you need to do that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js