Signal handling and check pointing for mpif90

Signal handling and check pointing for mpif90 - fortran

I have written a code for trapping the signal for CTRL+C for gfortran and it works.
program trap
external trap_term
call signal(2, trap_term)
call sleep(60)
end program trap
function trap_term()
integer::trap_term
print*,'done'
call exit(trap_term)
end function trap_term
How would one write exactly same thing for mpif90 ? Also, what is the best way to include checkpoints and restart (probably automatic) the code (from where left before) in parallel processors.
This is required because I have allocated time on clusters. Jobs are kicked out after fixed number of hours and a new resubmission is required.

Writing your software to checkpoint on receipt of a kill signal from the operating system is likely to be far less useful than you probably hope it will be. Suppose that you can code your program to write a full checkpoint in the time available to it when it is told to stop. You are then left with restarting your program from the arbitrary point at which it was previously stopped. That's a far from trivial problem.
Why not do what many of us used to do, and many of us still do, in this domain ? Write your code to checkpoint every X iterations or at intervals of approximately Y minutes (you choose X and Y) ? And write routines to restart from one of those checkpoints in the event that a previous execution has been prematurely halted. This way you only have to restart from a single defined state of execution.
You should probably be writing these checkpoint and restart routines anyway to guard against hardware problems, which only become worse as the CPU count rises and the number of network connections multiplies.
I suppose you could write your code to keep an eye on the wall-clock, as it were, and tell it, on start-up, that it had an allowance of N hours so to checkpoint at N-n hours, where n is long enough to do the checkpointing with a small margin of error. But this approach won't help if a CPU fails mid-computation.

tl;dr; Do as High Performance Mark and francescalus suggest.
In addition to what HPM says in his answer, keep in mind that what you're allowed to do in a signal handler is extremely limited. For instance, allocating memory is not allowed, which in turn rules out a lot of other things such as Fortran (or C stdio) I/O because the Fortran I/O routines may allocate memory for their own use. You can see a list of so-called 'async-signal-safe' POSIX functions e.g. at http://man7.org/linux/man-pages/man7/signal.7.html .
Among the few things which you can reliably do in a signal handler is to set some flag variable, which you then check later on in your main program. E.g. after an iteration is finished, you check the flag whether to checkpoint and exit, and then do all the I/O and whatever in the "normal" context, not in the signal handler context. This is essentially what francescalus explained in his comment to HPM's answer.

Related

How to Read the program counter / Instruction pointer of a specific core in Kernel Mode?

Windows 10, x64 , x86
My current knowledge
Lets say it is quad core, there will be 4 individual program counters which will point to 4 different locations of code for parallel execution.
Each of this program counters indicates where a computer is in its program sequence.
The address it points to changes after a context switch where another threads program counter gets placed onto the program counter to execute.
What I want to do:
Im in Kernel Mode my thread is running on core 1 and I want to read the current instruction pointer of core 2.
Expected Results:
0x203123 is the address of the instruction pointer and this address belongs to this thread and this thread belongs to this process... etc.
Anyone knows how to do it or can give me good book references, links etc...

Although I don't believe it's officially documented, there is a ZwGetContextThread exported from ntdll.dll. Being undocumented, things can change (and I haven't tried it in quite a while) but at least when I last tried it, you called it with a thread handle and a pointer to a CONTEXT structure, and it would return that thread's context.
I'm not certain exactly how up-to-date that is though. It's never mattered to me, so I haven't checked, but my guess would be that the IP in the CONTEXT you get is whatever was saved the last time the thread was suspended. So, if you want something (reasonably) current, you'd use ZwSuspendThread, get the context, then ZwResumeThread to start it running again.
Here I suppose I'm probably supposed to give the standard lines about undocumented function being subject to change, using them being a bad idea, and that you should generally leave all of this alone. Ah well, I been disappointing teachers and other authority figures for years, and I guess I'm not changing right now.
On the other hand, there may be a practical problem here. If you really need data that's really current, this probably isn't going to work very well for you. What it gives you will be kind of current at best. On the other hand, really current is almost a meaningless concept with information that goes out of date every clock cycle.

Anyone knows how to do it or can give me good book references, links etc...
For 80x86 hardware (regardless of operating system); there are only 3 ways to do this (that I know of):
a) send an inter-processor interrupt to the other CPU, and have an interrupt handler that stores the "return EIP" (from its stack) at a known address in memory so that your CPU can read "value of EIP immediately before interrupt" (with synchronization so that your CPU doesn't read before the value is written, etc).
b) put the other CPU into some kind of "debug mode" (single-stepping, last branch recording, ...) so that (either code in a debug exception handler or the CPU's hardware itself) is constantly writing EIP values to memory that you can read.
Of course both of these options will ruin performance, and the value you get will probably be useless (because EIP would've changed after you obtain it but before you can use the obtained value). To ensure the value is still useful; you'd need the other CPU to wait until after you've consumed the obtained value (and are ready for the next value); and to do that you'd have to resort to single-step debugging facilities (with the waiting in the debug exception handler), where you'll be lucky if you can get performance better than a thousand times slower (and can probably improve performance by simply disabling other CPUs completely).
Also note that they still won't accurately tell you EIP in all cases (e.g. if the CPU is in SMM/System Management Mode and is beyond the control of the OS); and I doubt Windows kernel supports any of it (e.g. kernel should support single-stepping of user-space processes/threads to allow debuggers to work, but won't support single-stepping of kernel and will probably lock up the computer due to various "waiting for lock to be released for 6 days" problems).
The last of the 3 options is:
c) Run the OS inside an emulator/simulator instead of running it on real hardware. In that case you can probably modify the emulator/simulator's code to inject EIP values somewhere (maybe some kind of virtual "EIP reporting device"?). This will ruin performance of the emulator/simulator, but you may be able to hide that (e.g. "virtual time inside the emulator passes at a rate of one second per 1000 seconds of real time outside the emulator").

Are there some common techniques to profile coroutine based code?

It's pretty obvious how to visualize a regular call stack and count internal and external execution times. However, if one have dealt with coroutines, the call stack can look pretty messy. I mean, a coroutine may yield execution not to its parent but to another coroutine (eg. greenlet). Are there some common ways to make consistent profiling output for such scenarios?

Think about a single sample, of the stack for all threads at the same time.
What you need to know is - who's waiting for whom, and why.
Normally if function A is above B on a stack, it means A is waiting for B to return, and the reason is that A wanted B to do something.
If you look at a whole stack, for one thread, you get a chain of reasons why that particular nanosecond is being spent, by that thread.
If you're looking for speed, you're looking for chains of reasons that, altogether, you don't really need (because there is a weak link).
This works even if the chain ends in I/O.
If it is user input it's simply waiting for the user.
But if it's output, or disk I/O, or plain old CPU cranking, you might be able to do something to reduce it, and get a performance gain (if you see the same problem on 2 or more samples).
What if thread A is waiting for thread B?
Then what you see at the bottom of A's stack is a function that waits for the other thread.
You need to figure out which is thread B, and look at its stack, because the longer it takes, the longer A takes.
So this is more difficult, but surely you're not afraid of that.
I'm talking about manual profiling here, where you take samples yourself, in a debugger, and apply your full attention to each sample.
Profiling tools tend to assume you're lazy and only want numbers, and if nothing jumps out of those numbers you will be happy because you found nothing.
In fact, if some silly needless activity is taking 30% of time, then on average the number of samples you require to see it twice is 2/0.3 = 6.67 samples (not a big number), and it is quite likely that you will see it and the profiler will not.
That's random pausing.

c++ time() function performance in solaris

We have a multi-threaded C++ application running on Solaris (5.10, sparc platform). As per "pstack" most of the threads seem to be waiting on the below call often for little too long. This corresponds to "time_t currentTime = time(NULL) ;" function in the application code to get the current time in seconds.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff76c63508, ffffffff76e3e000, 2000) + 8
The timezone is "Asia/Riyadh". I tried setting the TZ variable to both "Asia/Riyadh" as well as '<GMT+3>-3'. But there is no obvious improvement with either option. Changing the server code (even if there is an alternative) is rather difficult at this point. A test program (single thread, compiled without -O2) having 1 million "time(NULL)" invocations came out rather quickly. The application & test program are compiled using gcc 4.5.1.
Is there anything else that I can try out?
I agree that it is a rather broad question. I will try out the valid suggestions and close this as soon as there is adequate improvement to handle current load.
Edit 1 :
Please ignore the reference to time(NULL) above, as a possible cause for __time stack. I made the inference based on the signature, and finding the same invocation in the source method.
Following is another stack leading to __time.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff773e5cc0, ffffffff76e3e000, 2000) + 8
ffffffff76c9c7f0 getnow (ffffffff704fb180, ffffffff773c6384, 1a311c, 2, ffffffff76e4eae8, fffc00) + 4
ffffffff76c9af0c strptime_recurse (ffffffff76e4cea0, 1, 104980178, ffffffff704fb938, ffffffff704fb180, ffffffff704fb1a4) + 34
ffffffff76c9dce8 __strptime_std (ffffffff76e4cea0, 10458b2d8, 104980178, ffffffff704fb938, 2400, 1a38ec) + 2c

You (and we) are not going to be able to make time faster.
From your message, I gather that you are calling it from many
different threads at once. This may be a problem; it's quite
possible that Solaris serializes these calls, so you end up with
a lot of threads waiting for the others to complete.
How much accuracy do you need? A possible solution might be to
have one thread loop on reading the time, sleeping maybe 10 ms
between each read, and putting the results in a global variable,
which the other threads read. (Don't forget that you'll need to
synchronize all accesses to the variable, unless you have some
sort of atomic variables, like std::atomic<time_t> in C++11.)

Keep in mind that pstack doesn't just immediately interrupt your program and generate a stack. It has to grab debug-level control and if time calls are sufficiently frequent it may drastically over-indicate calls to time as it utilizes those syscalls to take control of your application to print the stack.
Most likely the time calls are not the source of your real performance problem. I suspect you'll want to utilize a profiler such as gprof (with g++ -p). Alternately you could utilize some of the dtrace kits and use the hotuser dtrace script which will do basic statistical profiling on your running application's user code.
time returns UTC time so any changes to TZ should have no effect on its call time whatsoever.
If, after profiling, it turns out that time really is the culprit you may be able to cache the value from the time call since it won't change more than once a second.

Performance bottleneck at _L_unlock_16

I am trying to use google perf tools CPU profiler for debugging performance issues on a multi-threaded program. With single thread it take 250 ms while 4 threads take around 900ms.
My program has a mmap'ed file which is shared across threads and all operations are read only. Also my program creates large number of objects which are not shared across threads. (Specifically my program uses CRF++ library to do some querying). I am trying to figure out how to make my program perform better with multi threads. Call graph produced by CPU profiler of gperf tools shows that my program spends a lot of time (around 50%) of time in _L_unlock_16.
Searching web for _L_unlock_16 pointed to some bug reports with canonical suggesting that its associated with libpthread. But other than that I was not able to find any useful information for debugging.
A brief description of what my program does. I have few words in a file (4). In my program I have a processWord() which processes a single word using CRF++. This processWord() is what each thread executes. My main() reads words from the file and each threads runs processWord() in parallel. If I process a single word(hence only 1 thread) it takes 250ms and so if I process all 4 words(and hence 4 threads) I expected it to finish by same time 250 ms, however as I mentioned above it's taking around 900ms.
This is the callgraph of the execution - https://www.dropbox.com/s/o1mkh477i7e9s4m/cgout_n2.png
I want to understand why my program is spending lot of time at _L_unlock_16 and what I can do to mitigate it.

Yet again the _L_unlock_16 is not a function of your code. Have you looked at the stracktraces above that function? What are the callers of it when the program waits? You've said that the program wastes 50% waiting inside. But, which part of the program ordered that operation? Is it again from memory alloc/dealloc ops?
The function seems to come from libpthread. Does CRF+ handle threads/libpthread in any way? If yes, then maybe the library is illconfigured? Or maybe it implements some 'basic threadsafety' by adding locks everywhere and simply is not built well for multithreading? What does the docs say about that?
Personally, I'd guess that it ignores threads and that you have added all the threading. I may be wrong, but if that's true, then the CRF++ probably will not call that 'unlock' function at all, and the 'unlock' is somwhow called from your code that orchestrates the threads/locks/queues/messages etc? Halt the program a few times and look at who called the unlock. If it really spends 50% sitting in the unlock, you will very quickly know who causes the lock to be used and you will be able to either eliminate it or at least perform a more refined research..
EDIT #1:
Eh.. when I said "stacktrace" I meant stacktrace, not callgraph. Callgraph may look nice in trivial cases, but in more complex ones, it will be mangled and unreadable and will hide the precious details into "compacted" form.. But, fortunatelly, here the case looks simple enough.
Please observe the beginning: "Process word, 99x". I assume that the "99x" is the call count. Then, look at "tagger-parse": 97x. From that:
61x into rebuildFeatures from which 41x goes right into unlock and 20(13) indirectly into it
23x goes to buildLattice fro which 21x goes into unlock
I'd guess that it was the CRF++ uses locking quite heavily. For me, it seems that you simply observe the effects of CRF's internal locking. It certainly is not lockless internally.
It seems to lock at least once per "processWord". It's hard to say without looking at code (is it opensource? I've not checked..), from stacktraces it would be more obvious, but IF it really locks once per "processWord" that it could even be a sort of a "global lock" that protects "everything" from "all threads" and causes all jobs to serialize. Whatever. Anyways, clearly, it's the CRF++'s internals that lock and wait.
If your CRF objects are really (really) not shared across threads, then remove threading configuration flags from CRF, pray that they were sane enough to not use any static variables nor global objects, add some own locking (if needed) at the topmost job/result level and retry. You should have it now much faster.
If the CRF objects are shared, unshare them and see above.
But, if they are shared behind the scenes, then there's little doable. Change your library to a one that has a better threading support, OR fix the library, OR ignore and use it with current performance.
The last advice may sound strange (it works slowly, right? so why to ignore it?), but in fact is the most important one, and you should try it first. If the parallel tasks have similar "data profile", then there is very probable that they will try to hit the same locks in the same approximate moment of time. Imagine a medium-sized cache that holds words sorted by its first letter. At the toplevel there's array of, say, 26 entries. Each entry has a lock and a list of words inside. If you run 100 threads that will each first check "mom" then "dad" then "son" - then all of that 100 threads will first hit and wait for each other at "M", then at "D" then at "S". Well, approximately/probably of course. But you get the idea. If the data profile were more random then they'd block each other far less. Mind that processing ONE word is a .. small task and that you try to process the same word(s). Even if the internal CRF's locking is smart, it is just bound to hit the same areas. Try again with a more dispersed data.
Add to that the fact that threading costs. If something was guarded against races with use of locks, then every lock/unlock costs because at least they have to "halt and check if the lock is open" (sorry very imprecise wording). If the data-to-process is small in relation to the-amount-of-lockchecks, then adding more threads will not help and will just waste the time. For checking one word, it may even happen that the sole handling of a single lock takes longer than processing the word! But, if the amount of data to be processed were larger, then the cost of flipping a lock compared to processing the data might start being neglible.
Prepare a set of 100 or more words. Run and measure it on one thread. Then partition the words at random and run it on 2 and 4 threads. And measure. If not better, try at 1000 and 10000 words. The more the better, of course, keeping in mind that the test should not last till your next birthday ;)
If you notice that 10k words split over 4 threads (2500w per th) works about 40%-30%-or even 25% faster than on one thread - here you go! You simply gave it a too small job. It was tailored and optimized for bigger ones!
But, on the other hand, it may happen that 10k words split over 4 threads does not work faster, or worse, works slower - then it might indicate that the library handles multithreading very wrong. Now try the other things like stripping threading from it or repairing it.

Poor performance in multi-threaded C++ program

I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}

If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.

Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.

Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.

The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.

I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.

It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.

If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.

I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.

If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js