Parallelism on a single processor system

Parallelism on a single processor system - concurrency

Can we apply the concept of parallelism on a single processor system. Let's say for example if we have two processes A & B and they are independent of each other, can they be simultaneously taken care (executed) of, if so how? Can you please explain in terms of the execution cycle that would follow.

They cannot run simultaneously if there is only a single processor. If you have a multi-thread or multi-process environment, it will time slice each process and/or thread. Only one will run at any given time, and there is overhead at each context switch.
The precise meaning of "context switch" varies significantly in usage, most often to mean "thread switch or process switch" or "process switch only", either of which may be referred to as a "task switch". More finely, one can distinguish thread switch (switching between two threads within a given process), process switch (switching between two processes), mode switch (domain crossing: switching between user mode and kernel mode within a given thread), register switch, a stack frame switch, and address space switch (memory map switch: changing virtual memory to physical memory map). The computational cost of context switches varies significantly depending on what precisely it entails, from little more than a subroutine call for light-weight user processes, to very expensive, though typically much less than that of saving or restoring a process image.
On an interesting historical note, there were even multi-threading libraries available for MS-DOS before Windows became popular. Many mainframe and mini computers from the same era employed the technique as well.

The concept of having something like parallelism is called multitasking for single processor. We have to understand that when we have one core, that means no matter how many process(task) are there in system, only one can be executed at a time.
But if a process have threads, all the threads will be assigned to cpu one by one and user will have impression that all the threads of process are running.

The cpu is to be switched between processes.It is called context switching in OS.
There are different methods for context switching like:
round robin,priority queue
Above method will decide that which process will use cpu.
But cpu can't be used by two processes at a time.
In advance OS task scheduler is responsible to assign process to cpu.

Related

How to multithread core-schedule onto different cores (ideally in C++)

I have a large C++11 multithreaded application where the threads are always active, communicating to each other constantly, and should be scheduled on different physical CPUs for reasonable performance.
The default Linux behavior AFAIK is that threads will typically/often get scheduled onto the same CPU, causing horrible performance.
To solve this, I understand how to attach threads to specific physical CPUs in C++, e.g.:
std::cout << "Assign to thread cpu " << cpu << "\n";
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
int rc = pthread_setaffinity_np(thread.native_handle(), sizeof(cpu_set_t), &cpuset);
and can use this to pin to specific CPUs, e.g. attach 4 threads to CPUs 0,2,4,6.
However this approach requires a specific CPU number which is a problem in that there may be many programs running on a host using other CPUs. These might be my program or other programs. As just one example an 8 core machine might have two copies of my 4-threaded application so obviously having both of those two programs pick the same 4 CPUs is a problem.
I'd thus like a way to say "schedule the threads in this set all on different CPUs without caring of the CPU number". Is this possible in C++(11)?
If not, is this possible with numactl or another utility? E.g. I don't want "numactl -C 0,2,4,6" but rather "numactl -C W,X,Y,Z" where the scheduler can pick arbitrary W,X,Y,Z subject to W!=X!=Y!=Z.
I'm most interested in Linux behavior. I cannot change the OS configuration. I don't want the separate applications to cross communicate (nor can they as they might be other applications I do not control.)
Once I have the answer to this, the follow up is how do I modify this to add a e.g. fifth thread I do want to schedule on the same CPU as the first thread?

My problem in a specific Boost ASIO multithreaded application is, that even with a limited number of threads (like ten) on a system with much more cores, the threads get pushed around onto different cores all the time, which seriously reduces performances due to a high number of L1/L2 cache misses.
I have not searched much, yet, but there is a getcpu() system call on Linux, that returns the CPU-ID and NUMA Node-ID of the active thread, that is calling getcpu(). To get a set of unique CPU-IDs, one could try to create all threads, first, then let them all wait for a barrier via pthread_barrier_wait() and after that call getcpu() repeatedly in each thread until the returned values have stabilized. Stability has been reached, when each thread has gotten the same CPU-ID as answer for at least the last 1000 calls to getcpu() AND all the answers to all the different threads are different. It is of extreme importance to use non-blocking techniques like std::atomic values to synchronize during this testing phase. Because, if you wait for some Mutexes instead, the likelyhood is high, that your threads get re-mixed again by the scheduler.
After stability has been reached, each thread just sets its CPU affinity to its current CPU-ID and you are done.
In many cases, where you do not dynamically start and stop a lot of applications, hand-binding the threads to certain Cores might be the easiest solution, though. And if you do start and stop a lot of apps dynamically, the "pick N free cores" algo described above will fail miserably, if there aren't enough free cores left, anyways.

should I "bind" "spinning" thread to the certain core?

My application contains several latency-critical threads that "spin", i.e. never blocks.
Such thread expected to take 100% of one CPU core. However it seems modern operation systems often transfer threads from one core to another. So, for example, with this Windows code:
void Processor::ConnectionThread()
{
while (work)
{
Iterate();
}
}
I do not see "100% occupied" core in Task manager, overall system load is 36-40%.
But if I change it to this:
void Processor::ConnectionThread()
{
SetThreadAffinityMask(GetCurrentThread(), 2);
while (work)
{
Iterate();
}
}
Then I do see that one of the CPU cores is 100% occupied, also overall system load is reduced to 34-36%.
Does it mean that I should tend to SetThreadAffinityMask for "spin" threads? If I improved latency adding SetThreadAffinityMask in this case? What else should I do for "spin" threads to improve latency?
I'm in the middle of porting my application to Linux, so this question is more about Linux if this matters.
upd found this slide which shows that binding busy-waiting thread to CPU may help:

Running a thread locked to a single core gives the best latency for that thread in most circumstances if this is the most important thing in your code.
The reasons(R) are
your code is likely to be in your iCache
the branch predictors are tuned to your code
your data is likely to be ready in your dCache
the TLB points to your code and data.
Unless
Your running a SMT sytem (ex. hyperthreaded) in which case the evil twin will "help" you with by causing your code to be washed out, your branch predictors to be tuned to its code and its data will push your out of the dCache, your TLB is impacted by its use.
Cost unknown, each cache misses cost ~4ns, ~15ns and ~75ns for data, this quickly runs up to several 1000ns.
It saves for each reason R mentioned above, that is still there.
If the evil twin also just spins the costs should be much lower.
Or your allowing interrupts on your core, in which case you get the same problems and
your TLB is flushed
you take a 1000ns-20000ns hit on the context switch, most should be in the low end if the drivers are well programmed.
Or you allow the OS to switch your process out, in which case you have the same problems as the interrupt, just in the hight end of the range.
switching out could also cause the thread to pause for the entire slice as it can only be run on one (or two) hardware threads.
Or you use any system calls that cause context switches.
No disk IO at all.
only async IO else.
having more active (none-paused) threads than cores increases the likelihood of problems.
So if you need less than 100ns latency to keep your application from exploding you need to prevent or lessen the impact of SMT, interrupts and task switching on your core.
The perfect solution would be an Real time operating system with static scheduling. This is a nearly perfect match for your target, but its a new world if your have mostly done server and desktop programming.
The disadvantages of locking a thread to a single core are:
It will cost some total throughput.
as some threads that might have run if the context could have been switched.
but the latency is more important in this case.
If the thread gets context switched out it will take some time before it can be scheduled potentially one or more time slices, typically 10-16ms, which is unacceptable in this application.
Locking it to a core and its SMT will lessen this problem, but not eliminate it. Each added core will lessen the problem.
setting its priority higher will lessen the problem, but not eliminate it.
schedule with SCHED_FIFO and highest priority will prevent most context switches, interrupts can still cause temporary switches as does some system calls.
If you got a multi cpu setup you might be able to take exclusive ownership of one of the CPU's through cpuset. This prevents other applications from using it.
Using pthread_setschedparam with SCHED_FIFO and highest priority running in SU and locking it to the core and its evil twin should secure the best latency of all of these, only a real time operating system can eliminate all context switches.
Other links:
Discussion on interrupts.
Your Linux might accept that you call sched_setscheduler, using SCHED_FIFO, but this demands you got your own PID not just a TID or that your threads are cooperative multitasking.
This might not ideal as all your threads would only be switches "voluntarily" and thereby removing flexibility for the kernel to schedule it.
Interprocess communication in 100ns

Pinning a task to specific processor will generally give better performance for the task. But, there are a lot of nuances and costs to consider when doing so.
When you force affinity, you restrict the operating system's scheduling choices. You increase cpu contention for the remaining tasks. So EVERYTHING else on the system is impacted including the operating system itself. You also need to consider that if tasks need to communicate across memory, and affinities are set to cpus that don't share cache, you can drastically increase latency for communication across tasks.
One of the biggest reasons setting task cpu affinity is beneficial though, is that it gives more predictable cache and tlb (translation lookaside buffer) behavior. When a task switches cpus, the operating system can switch it to a cpu that doesn't have access to the last cpu's cache or tlb. This can increase cache misses for the task. It's particularly an issue communicating across tasks, as it takes more time to communicate across higher level caches and worst finally memory. To measure cache statistics on linux (performance in general) I recommend using perf.
The best suggestion is really to measure before you try to fix affinities. A good way to quantify latency would be by using the rdtsc instruction (at least on x86). This reads the cpu's time source, which will generally give the highest precision. Measuring across events will give roughly nanosecond accuracy.
volatile uint64_t rdtsc() {
register uint32_t eax, edx;
asm volatile (".byte 0x0f, 0x31" : "=d"(edx), "=a"(eax) : : );
return ((uint64_t) edx << 32) | (uint64_t) eax;
}
note - the rdtsc instruction needs to be combined with a load fence to ensure all previous instructions have completed (or use rdtscp)
also note - if rdtsc is used without an invariant time source (on linux grep constant_tsc /proc/cpuinfo, you may get unreliable values across frequency changes and if the task switches cpu (time source)
So, in general, yes, setting the affinity does gives lower latency, but this is not always true, and there are very serious costs when you do it.
Some additional reading...
Intel 64 Architecture Processor Topology Enumeration
What Every Programmer Needs to Know About Memory (Parts 2, 3, 4, 6, and 7)
Intel Software Developer Reference (Vol. 2A/2B)
Aquire and Release Fences
TCMalloc

I came across this question because I'm dealing with the exactly same design problem. I'm building HFT systems where each nanosecond count.
After reading all the answers, I decided to implement and benchmark 4 different approaches
busy wait with no affinity set
busy wait with affinity set
observer pattern
signals
The imbatible winner was "busy wait with affinity set". No doubt about it.
Now, as many have pointed out, make sure to leave a couple of cores free in order to allow OS run freely.
My only concern at this point is if there is some physical harm to those cores that are running at 100% for hours.

Binding a thread to a specific core is probably not the best way to get the job done. You can do that, it will not harm a multi core CPU.
The really best way to reduce latency is to raise the priority of the process and the polling thread(s). Normally the OS will interrupt your threads hundreds of times a second and let other threads run for a while. Your thread may not run for several milliseconds.
Raising the priority will reduce the effect (but not eliminate it).
Read more about SetThreadPriority and SetProcessPriorityBoost.
There some details in the docs you need to understand.

This is simply foolish. All it does is reduce the scheduler's flexibility. Whereas before it could run it on whatever core it thought was best, now it can't. Unless the scheduler was written by idiots, it would only move the thread to a different core if it had a good reason to do that.
So you're just saying to the scheduler, "even if you have a really good reason to do this, don't do it anyway". Why would you say that?

Am I disturbing other programs with OpenMP?

I'm using OpenMP for a loop like this:
#pragma omp parallel for
for (int out = 1; out <= matrix.rows; out++)
{
...
}
I'm doing a lot of computations on a machine with 64 CPUs. This works quite qell but my question is:
Am I disturbing other users on this machine? Usually they only run single thread programms. Will they still run on 100%? Obviously I will disturb other multithreads programms, but will I disturb single thread programs?
If yes, can I prevend this? I think a can set the maximum number of CPUs with omp_set_num_threads. I can set this to 60, but I don't think this is the best solution.
The ideal solution would disturb no other single thread programs but take as much CPUs as possible.

Every multitasking OS has something called a process scheduler. This is an OS component that decides where and when to run each process. Schedulers are usually quite stubborn in the decisions they make but those could often be influenced by various user-supplied policies and hints. The default configuration for almost any scheduler is to try and spread the load over all available CPUs, which often results in processes migrating from one CPU to another. Fortunately, any modern OS except "the most advanced desktop OS" (a.k.a. OS X) supports something called processor affinity. Every process has a set of processors on which it is allowed to execute - the so-called CPU affinity set of that process. By configuring disjoint affinity sets to various processes, those could be made to execute concurrently without stealing CPU time from each other. Explicit CPU affinity is supported on Linux, FreeBSD (with the ULE scheduler), Windows NT (this also includes all desktop versions since Windows XP), and possibly other OSes (but not OS X). Every OS then provides a set of kernel calls to manipulate the affinity and also an instrument to do that without writing a special program. On Linux this is done using the sched_setaffinity(2) system call and the taskset command line instrument. Affinity could also be controlled by creating a cpuset instance. On Windows one uses the SetProcessAffinityMask() and/or SetThreadAffinityMask() and affinities can be set in Task Manager from the context menu for a given process. Also one could specify the desired affinity mask as a parameter to the START shell command when starting new processes.
What this all has to do with OpenMP is that most OpenMP runtimes for the listed OSes support under one form or another ways to specify the desired CPU affinity for each OpenMP thread. The simplest control is the OMP_PROC_BIND environment variable. This is a simple switch - when set to TRUE, it instructs the OpenMP runtime to "bind" each thread, i.e. to give it an affinity set that includes a single CPU only. The actual placement of threads to CPUs is implementation dependent and each implementation provides its own way to control it. For example, the GNU OpenMP runtime (libgomp) reads the GOMP_CPU_AFFINITY environment variable, while the Intel OpenMP runtime (open-source since not long ago) reads the KMP_AFFINITY environment variable.
The rationale here is that you could limit your program's affinity in such a way as to only use a subset of all the available CPUs. The remaining processes would then get predominantly get scheduled to the rest of the CPUs, though this is only guaranteed if you manually set their affinity (which is only doable if you have root/Administrator access since otherwise you can modify the affinity only of processes that you own).
It is worth mentioning that it often (but not always) makes no sense to run with more threads than the number of CPUs in the affinity set. For example, if you limit your program to run on 60 CPUs, then using 64 threads would result in some CPUs being oversubscribed and in timesharing between the threads. This will make some threads run slower than the others. The default scheduling for most OpenMP runtimes is schedule(static) and therefore the total execution time of the parallel region is determined by the execution time of the slowest thread. If one thread timeshares with another one, then both threads will execute slower than those threads that do not timeshare and the whole parallel region would get delayed. Not only this reduces the parallel performance but it also results in wasted cycles since the faster threads would simply wait doing nothing (possibly busy looping at the implicit barrier at the end of the parallel region). The solution is to use dynamic scheduling, i.e.:
#pragma omp parallel for schedule(dynamic,chunk_size)
for (int out = 1; out <= matrix.rows; out++)
{
...
}
where chunk_size is the size of the iteration chunk that each thread gets. The whole iteration space is divided in chunks of chunk_size iterations and are given to the worker threads on a first-come-first-served basis. The chunk size is an important parameter. If it is too low (the default is 1), then there could be a huge overhead from the OpenMP runtime managing the dynamic scheduling. If it is too high, then there might not be enough work available for each thread. It makes no sense to have chunk size bigger than maxtrix.rows / #threads.
Dynamic scheduling allows your program to adapt to the available CPU resources, even if they are not uniform, e.g. if there are other processes running and timesharing with the current one. But it comes with a catch: big system like your 64-core one usually are ccNUMA (cache-coherent non-uniform memory access) systems, which means that each CPU has its own memory block and access to the memory block(s) of other CPU(s) is costly (e.g. takes more time and/or provides less bandwidth). Dynamic scheduling tends to destroy data locality since one could not be sure that a block of memory, which resides on one NUMA, won't get utilised by a thread running on another NUMA node. This is especially important when data sets are large and do not fit in the CPU caches. Therefore YMMV.

Put your process on low priority within the operating system. Use a many resources as you like. If someone else needs those resources the OS will make sure to provide them, because they are on a higher (i.e. normal) priority. If there are no other users you will get all resources.

Poor performance in multi-threaded C++ program

I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}

If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.

Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.

Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.

The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.

I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.

It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.

If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.

I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.

If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.

Is concurrent programming the same as parallel programming?

Are they both the same thing? Looking just at what concurrent or parallel means in geometry, I'd definetely say no:
In geometry, two or more lines are said to be concurrent if they intersect at a single point.
and
Two lines in a plane that do not
intersect or meet are called parallel
lines.
Again, in programming, do they have the same meaning? If yes...Why?
Thanks

I agree that the geometry vocabulary is in conflict. Think of train tracks instead: Two trains which are on parallel tracks can run independently and simultaneously with little or no interaction. These trains run concurrently, in parallel.
The basic usage difficulty is that "concurrent" can mean "at the same time" (with the trains or code) or "at the same place" (with the geometric lines). For many practical purposes (trains, thread resources) these two notions are directly in conflict.
Natural language is supposed to be silly, ambiguous, and confusing. But we're programmers. We can take refuge in the clarity, simplicity, and elegance of our formal programming languages. Like perl.

From Wikipedia:
Concurrent computing is a form of
computing in which programs are
designed as collections of interacting
computational processes that may be
executed in parallel.
Basically, programs can be written as concurrent programs if they are made up of smaller interacting processes. Parallel programming is actually doing these processes at the same time.
So I suppose that concurrent programming is really a style that lends itself to processes being executed in parallel to improve performance.

No, definitely concurrent is different from parallel. here is exactly how.
Concurrency refers to the sharing of resources in the same time frame. As an example, several processes may share the same CPU or share memory or an I/O device.
Now, by definition two processes are concurrent if an only if the second starts execution before the first has terminated (on the same CPU). If the two processes both run on the same - say for now - single-core CPU the processes are concurrent but not parallel: in this case, parallelism is only virtual and refers to the OS doing timesharing. The OS seems to be executing several processes simultaneously. If there is only one single-core CPU, only one instruction from only one process can be executing at any particular time. Since the human time scale is billions of times slower than that of modern computers, the OS can rapidly switch between processes to give the appearance of several processes executing at the same time.
If you instead run the two processes on two different CPUs, the processes are parallel: there is no sharing in the same time frame, because each process runs on its own CPU. The parallelism in this case is not virtual but physical. It is worth noting here that running on different cores of the same multi-core CPU still can not be classified as fully parallel, because the processes will share the same CPU caches and will even contend for them.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js