Locking a process to Cuda core

Locking a process to Cuda core - c++

I'm just getting into GPU processing.
I was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
For example you may have a small C program that performs an image filter on an index of images. Can you have that program running on each CUDA core that essentially runs forever - reading/writing from it's own memory to system memory and disk?
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
My semantics here are probably way off. I apologize if what i've said requries some interpretation. I'm not that used to GPU stuff yet.
Thanks.

All of my comments here should be prefaced with "at the moment". Technology is constantly evolving.
was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
process is mostly a (host) operating system term. CUDA doesn't define a process separately from the host operating system definition of it, AFAIK. CUDA threadblocks, once launched on a Streaming Multiprocessor (or SM, a hardware execution resource component inside a GPU), in many cases will stay on that SM for their "lifetime", and the SM includes an array of "CUDA cores" (a bit of a loose or conceptual term). However, there is at least one documented exception today to this in the case of CUDA Dynamic Parallelism, so in the most general sense, it is not possible to "lock" a CUDA thread of execution to a CUDA core (using core here to refer to that thread of execution forever remaining on a given warp lane within a SM).
Can you have that program running on each CUDA core that essentially runs forever
You can have a CUDA program that runs essentially forever. It is a recognized programming technique sometimes referred to as persistent threads. Such a program will naturally occupy/require one or more CUDA cores (again, using the term loosely). As already stated, that may or may not imply that the program permanently occupies a particular set of physical execution resources.
reading/writing from it's own memory to system memory
Yes, that's possible, extending the train of thought. Writing to it's own memory is obviously possible, by definition, and writing to system memory is possible via the zero-copy mechanism (slides 21/22), given a reasonable assumption of appropriate setup activity for this mechanism.
and disk?
No, that's not directly possible today, without host system interaction, and/or without a significant assumption of atypical external resources such as a disk controller of some sort connected via a GPUDirect interface (with a lot of additional assumptions and unspecified framework). The GPUDirect exception requires so much additional framework, that I would say, for typical usage, the answer is "no", not without host system activity/intervention. The host system (normally) owns the disk drive, not the GPU.
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
In my opinion, the CPU must still be considered. One consideration is if you need to write to disk. Even if you don't, most programs derive I/O from somewhere (e.g. MPI) and so the implication of a larger framework of some sort is there. Secondly, and relatedly, the persistent threads programming model usually implies a producer/consumer relationship, and a work queue. The GPU is on the processing side (consumer side) of the work queue, but something else (usually) is on the producer side, typically the host system CPU. Again, it could be another GPU, either locally or via MPI, that is on the producer side of the work queue, but that still usually implies an ultimate producer somewhere else (i.e. the need for system I/O).
Additionally:
Can CUDA threads send packets over a network?
This is like the disk question. These questions could be viewed in a general way, in which case the answer might be "yes". But restricting ourselves to formal definitions of what a CUDA thread can do, I believe the answer is more reasonably "no". CUDA provides no direct definitions for I/O interfaces to disk or network (or many other things, such as a display!). It's reasonable to conjecture or presume the existence of a lightweight host process that simply copies packets between a CUDA GPU and a network interface. With this presumption, the answer might be "yes" (and similarly for disk I/O). But without this presumption (and/or a related, perhaps more involved presumption of a GPUDirect framework), I think the most reasonable answer is "no". According to the CUDA programming model, there is no definition of how to access a disk or network resource directly.

Related

How can I distinguish between high- and low-performance cores/threads in C++?

When talking about multi-threading, it often seems like threads are treated as equal - just the same as the main thread, but running next to it.
On some new processors, however, such as the Apple "M" series and the upcoming Intel Alder Lake series not all threads are equally as performant as these chips feature separate high-performance cores and high-efficiency, slower cores.
It’s not to say that there weren’t already things such as hyper-threading, but this seems to have a much larger performance implication.
Is there a way to query std::thread‘s properties and enforce on which cores they’ll run in C++?

How to distinguish between high- and low-performance cores/threads in C++?
Please understand that "thread" is an abstraction of the hardware's capabilities and that something beyond your control (the OS, the kernel's scheduler) is responsible for creating and managing this abstraction. "Importance" and performance hints are part of that abstraction (typically presented in the form of a thread priority).
Any attempt to break the "thread" abstraction (e.g. determine if the core is a low-performance or high-performance core) is misguided. E.g. OS could change your thread to a low performance core immediately after you find out that you were running on a high performance core, leading you to assume that you're on a high performance core when you are not.
Even pinning your thread to a specific core (in the hope that it'll always be using a high-performance core) can/will backfire (cause you to get less work done because you've prevented yourself from using a "faster than nothing" low-performance core when high-performance core/s are busy doing other work).
The biggest problem is that C++ creates a worse abstraction (std::thread) on top of the "likely better" abstraction provided by the OS. Specifically, there's no way to set, modify or obtain the thread priority using std::thread; so you're left without any control over the "performance hints" that are necessary (for the OS, scheduler) to make good "load vs. performance vs. power management" decisions.
When talking about multi-threading, it often seems like threads are treated as equal
Often people think we're still using time-sharing systems from the 1960s. Stop listening to these fools. Modern systems do not allow CPU time to be wasted on unimportant work while more important work waits. Effective use of thread priorities is a fundamental performance requirement. Everything else ("load vs. performance vs. power management" decisions) is, by necessity, beyond your control (on the other side of the "thread" abstraction you're using).

Is there any way to query std::thread‘s properties and enforce on which cores they’ll run in C++?
No. There is no standard API for this in C++.
Platform-specific APIs do have the ability to specify a specific logical core (or a set of such cores) for a software thread. For example, GNU has pthread_setaffinity_np.
Note that this allows you to specify "core 1" for your thread, but that doesn't necessarily help with getting the "performance" core unless you know which core that is. To figure that out, you may need to go below OS level and into CPU-specific assembly programming. In the case of Intel to my understanding, you would use the Enhanced Hardware Feedback Interface.

No, the C++ standard library has no direct way to query the sub-type of CPU, or state you want a thread to run on a specific CPU.
But std::thread (and jthread) does have .native_handle(), which on most platforms will let you do this.
If you know the threading library implementation of your std::thread, you can use native_handle() to get at the underlying primitives, then use the underlying threading library to do this kind of low-level work.
This will be completely non-portable, of course.

iPhones, iPads, and newer Macs have high- and low-performance cores for a reason. The low-performance cores allow some reasonable amount of work to be done while using the smallest possible amount of energy, making the battery of the device last longer. These additional cores are not there just for fun; if you try to get around them, you can end up with a much worse experience for the user.
If you use the C++ standard library for running multiple threads, the operating system will detect what you are doing, and act accordingly. If your task only takes 10ms on a high-performance core, it will be moved to a low-performance core; it's fast enough and saves battery life. If you have multiple threads using 100% of the CPU time, the high-performance cores will be used automatically (plus the low-performance cores as well). If your battery runs low, the device can switch to all low-performance cores which will get more work done with the battery charge you have.
You should really think about what you want to do. You should put the needs of the user ahead of your perceived needs. Apart from that, Apple recommends assigning OS-specific priorities to your threads, which improves behaviour if you do it right. Giving a thread the highest priority so you can get better benchmark results is usually not "doing it right".

You can't select the core that a thread will be physically scheduled to run on using std::thread. See here for more. I'd suggest using a framework like OpenMP, MPI, or you will have dig into the native Mac OS APIs to select the core for your thread to execute on.

macOS provides a notion of "Quality of Service" for tasks, task queues and run loops, and threads. If you use libdispatch/GCD then the queue priorities map to the QoS as well. This article describes the QoS system in detail.
Using the macOS pthreads interface you can set a thread QoS before creating a thread, query a thread's QoS, or temporarily override a thread's QoS level (not visible in the query function though) using the non-portable functions in pthread/qos.h
This system by no means offers guarantees about how your threads will be scheduled, but can be used to make a hint to the scheduler.
I'm not aware of any way to get a similar interface on other systems, but that doesn't mean they don't exist. I imagine they'll become more widely discussed as these hybrid CPUs befome more common.
EDIT: Intel provides information here about how to query this information for their hybrid processors on Windows and for the current CPU using cpuid, haven't had a chance to play with this though.

Can more than one Load/Store instructions can be executed at the same instance of time in Multiprocessor Environment

I believe in single processor systems, more than one Store will happen one after the other,
but what is the case for multi processsor systems?
Adding to the question, also if the machine is 32bit and when we try to write
a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
The reason for the above two questions is, if someone tries to read the same memory (a memory
of size 32bit/64 bit, in 32 bit systems), in another
thread will this be safe, or do i need to consider using locks.?
Added:
I wanted to do with minimum locks possible since ours is time critical execution.
Hence I wanted to understand is there ever a possibility of executing two Store/Load instructions
at the same instant of time to the same memory location, if things gets executed in multi processor
environment.

You are wrong if you only look on load/store cpu instructions.
The compiler and your os and cpu can:
change execution order to optimize the code
can hold values in separate caches
can store data in cpu registers without accessing cache or other memory
can optimize access complete away
... a lot more I believe!
If you want to access the same variable from different threads you must use a synchronization mechanism which is provided from your language or a library which fits to your os. Nothing else will give you a guarantee to work.
The problem is not the real access to any kind of memory. You definitely must ensure that your code contains memory barriers as needed for the underlying libraries and OS support. If there are no barriers between multi thread access you will maybe not see any change from a write in one thread while read it from a second one.
This will also be a problem on a single core cpu because the compiler have no idea that you modify a variable from two threads if you don't use any kind of synchronization.
To your add on:
You simply have no control over any kind of memory access without writing your code in assembler. And if you write it in assembler, you! have to deal with registers L1/L2/Lx Caching, Memory Mapping, Inter-CPU-Communication and so on. Forget all about load/store instruction. This is only 1% of the job!
If you have time critical jobs:
try to fix the core were the thread runs on ( see detailed description for threading libraries like posix pthreads or whatever lib you are running on )
it can be much faster to run a single process with a single thread and program it in a cooperative fashion. No locks, no memory barriers, no ipc. But you have to deal with all the thread a like problems. But it is fast!
Often it is much faster to split you problem in some processes each only with one thread and make the ipc minimal. This needs a deep understanding how you can scale your algorithms.
Often a very! simple 8/16 bit cpu runs much faster in special environments in comparison to a fat 8 core cpu with fat os on it.
But you don't tell us the rest of your environment and requirements so the answer never can give a full answer to your real problem. But keep in mind: load/store was yesterday.

This can not be answered generically. You have to know which model of what design of processor it is. An AMD Opteron will be different from an Intel Pentium which is different from a Intel Core2, and all of those are different from an ARMv7 design. [They are probably fairly similar, but there's details that you may care about if you REALLY want to rely on these operations to be performed in a specific way]. And of course, if you share memory between, say, a GPU (graphics processing unit) and a CPU, you have even more possible scenarios of "different design".
There are single core "superscalar" (more than one execution unit) and "out of order execution" (processors that reorder instructions), so more than one execution unit (including more than one load/store unit), and thus more than one instruction (including load or store) can be performed at the same time.
Obviously, once the processor determines that the memory operation needs to go "outside" (that is, the value is not available in the cache), it has to be serialized, but there is no guarantee that a load or store as sequenced by you or the compiler won't be re-ordered between loads and stores. If the processor has instructions to support "data wider than the bus" (e.g. 32-bit processor loading 64-bit word), these are typically atomic to that processor. If the processor does not in itself support 64-bit words, then the load of a 64-bit value would encompass two 32-bit loads.
[When I write "load", the same applies for "store"]
In case of multiprocessor or multicore architectures, it becomes a system architecture question, which makes it even more complicated than "we can't answer this without understanding the processor design", since there are now more components involved: memory design (one lump of memory shared between processors, several lumps of memory that are not directly shared, etc).
In general, if you have multiple threads, you will need to use atomic operations - most processors have a way to say "I want this to happen without someone else interfering" - in the old days, it would be a "lock" pin on the processor(s) that would be wired to anything else that could access the memory bus, and when that pin was active, all other devices has to wait for it to become inactive before accessing the memory bus. These days, it's a fair bit more sophisticated, since there are caches involved. Most systems use an "exclusive cache content" method: The processor signals all it's peers that "I want this address to be exclusive in my cache", at which point all other processors will "flush and invalidate" that particular address in their caches. Then the atomic operation is performed in the cache, and the results available to be read by other processors only when the atomic operation is completed. This is a pretty simplified view of how it works - modern processors are very complex, and there is a lot of work involved with such seemingly simple things as "make sure this value gets updated in a way that doesn't get interrupted by some other processor writing to the same thing".
If there isn't support in the processor for "atomic" operations, then there has to be proper locks (and any processor designed for use in a multicore/multicpu environment will have operations to support locks in some way), where the lock is taken before updating something, and then released after the update. This is clearly more complex than having builtin atomic operations, but it makes the design of the processor simpler. Also, for more complex updates (where more than one 32- or 64-bit value needs updating) this sort of locking is still required - for example, if we have a "queue" where you have a "where we're writing" and "elements in queue" that both need to be updated on write, you can't do that in a single operation [without being VERY clever about it, at least].
In heterogeneous systems, such as GPU + CPU combinations, you can't do atomics between different devices, because the cache of one device doesn't "understand the language" of the other device. So when the CPU says "I want this as exclusive", the GPU sees "Hurdi gurdi meatballs" and thinks "I have no idea what that is about, I'll just ignore it" [or something like that]. In this case, there has to be some other way to access shared data, but it's typically not atomic ever, you have to send commands (via other means than the interprocessor signalling system) to the GPU to say "flush your cache, and tell me when you're done with that", and when the CPU has written something the GPU needs, the CPU will flush it's cache before telling the GPU that it can use the data. This can get pretty messy, and takes a fair amount of time.

I believe in single processor systems, more than one Store will happen one after the other,
False. Most machines are set up like that, but for performance reasons many CPUs can be configured to have a much more relaxed store ordering. This is almost never a problem for an application on a single CPU (because the CPU will make it look like you expect) but it's really critical to understand when talking to hardware.
Here's a wikipedia article: http://en.wikipedia.org/wiki/Memory_ordering
This gets doubly complicated on CPUs with non-coherent local caches. Because then you can have strong ordering as seen from this CPU while other CPUs will see totally different results depending on the cache flush order.
Adding to the question, also if the machine is 32bit and when we try to write a long int(64 bit) value to the memory, how will the Load/Store instructions behave?
Some 32 bit CPUs have instructions to do atomic 64 bit writes, others don't. Those that don't will do two separate writes where a partial result can be seen by other CPUs or threads (if you get unlucky with context switching) or signal handlers or interrupt handlers.
The reason for the above two questions is, if someone tries to read the same memory (a memory of size 32bit/64 bit, in 32 bit systems), in another thread will this be safe, or do i need to consider using locks.?
Yes, no, maybe. If it's just one value and it doesn't tell the other thread that some other memory might be in a certain state, then yes, it can be safe in certain circumstances. You're not guaranteed that the other thread will see the changed value in memory for a long time, but eventually it should see it.
Generally, you can't reason about the behavior of access to shared memory in a threaded environment without strictly following the documentation of the thread model you're using. And most of those say something like without locks the behavior is undefined, with locks everything that happened before the lock is guaranteed to happen before the lock and everything that happens after the lock is guaranteed to happen after the lock. This is not only because of differences between CPUs, but also because the operating system can do something funny and the locking code needs to be designed to convince the compiler to not do something funny either (which is surprisingly hard with modern compilers).

Poor performance in multi-threaded C++ program

I have a C++ program running on Linux in which a new thread is created to do some computationally expensive work independent of the main thread (The computational work completes by writing the results to files, which end up being very large). However, I'm getting relatively poor performance.
If I implement the program straightforward (without introducing other threads), it completes the task in roughly 2 hours. With the multi-threaded program it takes around 12 hours to do the same task (this was tested with only one thread spawned).
I've tried a couple of things, including pthread_setaffinity_np to set the thread to a single CPU (out of the 24 available on the server I'm using), as well as pthread_setschedparam to set the scheduling policy (I've only tried SCHED_BATCH). But the effects of these have so far been negligible.
Are there any general causes for this kind of problem?
EDIT: I've added some example code that I'm using, which is hopefully the most relevant parts. The function process_job() is what actually does the computational work, but it would be too much to include here. Basically, it reads in two files of data, and uses these to perform queries on an in-memory graph database, in which the results are written to two large files over a period of hours.
EDIT part 2: Just to clarify, the problem is not that I want to use threads to increase the performance of an algorithm I have. But rather, I want to run many instances of my algorithm simultaneously. Therefore, I expect the algorithm would run at a similar speed when put in a thread as it would if I didn't use multi-threads at all.
EDIT part 3: Thanks for the suggestions all. I'm currently doing some unit tests (seeing which parts are slowing down) as some have suggested. As the program takes a while to load and execute, it is taking time to see any results from the tests and therefore I apologize for late responses. I think the main point I wanted to clarify is possible reasons why threading could cause a program to run slowly. From what I gather from the comments, it simply shouldn't be. I'll post when I can find a reasonable resolution, thanks again.
(FINAL) EDIT part 4: It turns out that the problem was not related to threading after all. Describing it would be too cumbersome at this point (including the use of compiler optimization levels), but the ideas posted here were very useful and appreciated.
struct sched_param sched_param = {
sched_get_priority_min(SCHED_BATCH)
};
int set_thread_to_core(const long tid, const int &core_id) {
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(core_id, &mask);
return pthread_setaffinity_np(tid, sizeof(mask), &mask);
}
void *worker_thread(void *arg) {
job_data *temp = (job_data *)arg; // get the information for the task passed in
...
long tid = pthread_self();
int set_thread = set_thread_to_core(tid, slot_id); // assume slot_id is 1 (it is in the test case I run)
sched_get_priority_min(SCHED_BATCH);
pthread_setschedparam(tid, SCHED_BATCH, &sched_param);
int success = process_job(...); // this is where all the work actually happens
pthread_exit(NULL);
}
int main(int argc, char* argv[]) {
...
pthread_t temp;
pthread_create(&temp, NULL, worker_thread, (void *) &jobs[i]); // jobs is a vector of a class type containing information for the task
...
return 0;
}

If you have plenty of CPU cores, and have plenty of work to do, it should not take longer to run in multithreaded than single threaded mode - the actual CPU time may be a fraction longer, but the "wall-clock time" should be shorter. I'm pretty sure that your code has some sort of bottleneck where one thread is blocking the other.
This is because of one or more of these things - I'll list them first, then go into detail below:
Some lock in a thread is blocking the second thread from running.
Sharing of data between threads (either true or "false" sharing)
Cache thrashing.
Competition for some external resource causing thrashing and/or blocking.
Badly designed code in general...
Some lock in a thread is blocking the second thread from running.
If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn't doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:
while (!tryLock(some_some_lock))
{
tried_locking_failed[lock_id][thread_id]++;
}
total_locks[some_lock]++;
Printing some stats of the locks would help to identify where the locking is contentious - or you can try the old trick of "Press break in the debugger and see where you are" - if a thread is constantly waiting for some lock, then that's what's preventing progress...
Sharing of data between threads (either true or "false" sharing)
If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap "I've updated this" messages, and the CPU's have to fetch the data from the other CPU before it can continue with it's use of the variable. Since "data" is shared on a "per cache-line" level, and a cache-line is typically 32-bytes, something like:
int var[NUM_THREADS];
...
var[thread_id]++;
would classify as something called "false sharing" - the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.
Cache thrashing.
If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don't run in "lockstep" on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it's a good idea to "add an offset" for each thread - for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.
Competition for some external resource causing thrashing and/or blocking.
If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.
It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.
Generally bad design
This is a "catchall" for "lots of other things that can be wrong". If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.
Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be "calculate if this is a prime" to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of "is this actually a prime-number" - the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.
There are plenty of other "bad design". Without understanding your code it's quite hard to say for sure.
It is entirely possible that your problem is none of the ones I've mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.

Read CPU Caches and Why You Care to understand why a naive port of an algorithm from one thread to multiple threads will more often than not result in greatly reduced performance and negative scalability. Algorithms that are specififcally designed for parallelism take care of overactive interlocked operations, false sharing and other causes of cache pollution.

Here are a few things you might wanna look into.
1°) Do you enter any critical section (locks, semaphores, etc.) between your worker thread and your main thread? (this should be the case if your queries modify the graph). If so, that could be one of the sources of the multithreading overhead : threads competing for a lock usually degrades performances.
2°) You're using a 24 cores machines, which I assume would be NUMA (Non-Uniform Memory Access). Since you set the threads affinities during your tests, you should pay close attention to the memory topology of your hardware. Looking at the files in /sys/devices/system/cpu/cpuX/ can help you with that (beware that cpu0 and cpu1 aren't necessarily close together, and thus does not necessarily share memory). Threads heavily using memory should use local memory (allocated in the same NUMA node as the core they're executing on).
3°) You are heavily using disk I/O. Which kind of I/O is that? if every thread perform every time some synchronous I/O, you might wanna consider asynchronous system calls, so that the OS stays in charge of scheduling those requests to the disk.
4°) Some caches issues have already been mentionned in other answers. From experience, false sharing can hurt performances as much as you're observing. My last recommendation (which should have been my first) is to use a profiler tool, such as Linux Perf, or OProfile. With such performance degradation you're experiencing, the cause will certainly appear quite clearly.

The other answers have all addressed the general guidelines that can cause your symptoms. I will give my own, hopefully not excessively redundant version. Then I will talk a bit about how you can get to the bottom of the problem with everything discussed in mind.
In general, there's a few reasons you'd expect multiple threads to perform better:
A piece of work is dependent on some resources (disk, memory, cache, etc.) while other pieces can proceed independently of these resources or said workload.
You have multiple CPU cores that can process your workload in parallel.
The main reasons, enumerated above, you'd expect multiple threads to perform less well are all based on resource contention:
Disk contention: already explained in detail and can be a possible issue, especially if you are writing small buffers at a time instead of batching
CPU time contention if the threads are scheduled onto the same core: probably not your issue if you're setting affinity. However, you should still double check
Cache thrashing: similarly probably not your problem if you have affinity, though this can be very expensive if it is your problem.
Shared memory: again talked about in detail and doesn't seem to be your issue, but it wouldn't hurt to audit the code to check it out.
NUMA: again talked about. If your worker thread is pinned to a different core, you will want to check whether the work it needs to access is local to the main core.
Ok so far not much new. It can be any or none of the above. The question is, for your case, how can you detect where the extra time is coming from. There's a few strategies:
Audit the code and look for obvious areas. Don't spend too much time doing this as it's generally unfruitful if you wrote the program to begin with.
Refactor the single threaded code and the multi-threaded code to isolate one process() function, then profile at key checkpoints to try to account for the difference. Then narrow it down.
Refactor the resource access into batches, then profile each batch on both the control and the experiment to account for the difference. Not only will this tell you which areas (disk access vs memory access vs spending time in some tight loop) you need to focus your efforts on, doing this refactor might even improve your running time overall. Example:
First copy the graph structure to thread-local memory (perform a straight-up copy in the single-threaded case)
Then perform the query
Then setup an asynchronous write to disk
Try to find a minimally reproducible workload with the same symptoms. This means changing your algorithm to do a subset of what it already does.
Make sure there's no other noise in the system that could've caused the difference (if some other user is running a similar system on the work core).
My own intuition for your case:
Your graph structure is not NUMA friendly for your worker core.
The kernel can actually scheduled your worker thread off the affinity core. This can happen if you don't have isolcpu on for the core you're pinning to.

I can't tell you what's wrong with your program because you haven't shared enough of it to do a detailed analysis.
What I can tell you is if this was my problem the first thing I would try is to run two profiler sessions on my application, one on the single threaded version and another on the dual thread configuration. The profiler report should give you a pretty good idea of where the extra time is going. Note that you may not need to profile the entire application run, depending on the problem the time difference may become obvious after you profile for a few seconds or minutes.
As far as profiler choices for Linux you may want to consider oprofile or as a second choice gprof.
If you find you need help interpreting the profiler output feel free to add that to your question.

It can be a right pain in the rear to track down why threads aren't working as planned. One can do so analytically, or one can use tool to show what's going on. I've had very good mileage out of ftrace, Linux's clone of Solaris's dtrace (which in turn is based on what VxWorks, Greenhill's Integrity OS and Mercury Computer Systems Inc have been doing for a looong time.)
In particular I found this page very useful: http://www.omappedia.com/wiki/Installing_and_Using_Ftrace, particularly this and this section. Don't worry about it being an OMAP orientated website; I've used it on X86 Linuxes just fine (though you may have to build a kernel to include it). Also remember that the GTKWave viewer is primarily intended for looking at log traces from VHDL developments, which is why it looks 'odd'. It's just that someone realised that it would be a usable viewer for sched_switch data too, and that saved them writing one.
Using the sched_switch tracer you can see when (but not necessarily why) your threads are running, and that might be enough to give you a clue. The 'why' can be revealed by careful examination of some of the other tracers.

If you are getting slowdown from using 1 thread, it is likely due to overhead from using thread safe library functions, or from thread setup. Creating a thread for each job will cause significant overhead, but probably not as much as you refer to.
In other words, it is probably some overhead from some thread safe library function.
The best thing to do, is to profile your code to find out where time is spent. If it is in a library call, try to find a replacement library or implement it yourself. If the bottleneck is thread creation/destruction try reusing threads, for instance using OpenMP tasks or std::async in C++11.
Some libraries are really nasty wrt thread safe overhead. For instance, many rand() implementations use a global lock, rather than using thread local prgn's. Such locking overhead is much larger than generating a number, and is hard to track without a profiler.
The slowdown could also stem from small changes you have made, for instance declaring variables volatile, which generally should not be necessary.

I suspect you're running on a machine with one single-core processor. This problem is not parallelizable on that kind of system. Your code is constantly using the processor, which has a fixed number of cycles to offer to it. It actually runs more slowly because the additional thread adds expensive context switching to the problem.
The only kinds of problems that parallelize well on a single-processor machine are those that allow one path of execution to run while another is blocked waiting for I/O, and situations (such as keeping a responsive GUI) where allowing one thread to get some processor time is more important than executing your code as quickly as possible.

If you only want to run many independent instances of your algorithm can you just submit multiple jobs (with different parameters, can be handled by a single script) to your cluster? That would eliminate the need to profile and debug your multithreaded program. I don't have much experience with multithreaded programming but if you use MPI or OpenMP then you'd have to write less code for the book keeping too. For example, if some common initialization routine is needed and the processes can run independently thereafter you can just do that by initializing in one thread and doing a broadcast. No need for maintaining locks and such.

c++ Distributed computing of an executable program

I was wondering if it is possible to run an executable program without adding to its source code, like running any game across several computers. When i was programming in c# i noticed a process method, which lets you summon or close any application or process, i was wondering if there was something similar with c++ which would let me transfer the processes of any executable file or game to other computers or servers minimizing my computer's processor consumption.
thanks.

Everything is possible, but this would require a huge amount of work and would almost for sure make your program painfully slower (I'm talking about a factor of millions or billions here). Essentially you would need to make sure every layer that is used in the program allows this. So you'd have to rewrite the OS to be able to do this, but also quite a few of the libraries it uses.
Why? Let's assume you want to distribute actual threads over different machines. It would be slightly more easy if it were actual processes, but I'd be surprised many applications work like this.
To begin with, you need to synchronize the memory, more specifically all non-thread-local storage, which often means 'all memory' because not all language have a thread-aware memory model. Of course, this can be optimized, for example buffer everything until you encounter an 'atomic' read or write, if of course your system has such a concept. Now can you imagine every thread blocking for synchronization a few seconds whenever a thread has to be locked/unlocked or an atomic variable has to be read/written?
Next to that there are the issues related to managing devices. Assume you need a network connection: which device will start this, how will the ip be chosen, ...? To seamlessly solve this you probably need a virtual device shared amongst all platforms. This has to happen for network devices, filesystems, printers, monitors, ... . And as you kindly mention games: this should happen for a GPU as well, just imagine how this would impact performance in only sending data from/to the GPU (hint: even 16xpci-e is often already a bottleneck).
In conclusion: this is not feasible, if you want a clustered application, you have to build it into the application from scratch.

I believe the closest thing you can do is MapReduce: it's a paradigm which hopefully will be a part of the official boost library soon. However, I don't think that you would want to apply it to a real-time application like a game.
A related question may provide more answers: https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
But as KillianDS pointed out, there is no automagical way to do this, nor does it seem like is there a feasible way to do it. So what is the exact problem that you're trying to solve?

The current state of research is into practical means to distribute the work of a process across multiple CPU cores on a single computer. In that case, these processors still share RAM. This is essential: RAM latencies are measured in nanoseconds.
In distributed computing, remote memory access can take tens if not hundreds of microseconds. Distributed algorithms explicitly take this into account. No amount of magic can make this disappear: light itself is slow.

The Plan 9 OS from AT&T Bell Labs supports distributed computing in the most seamless and transparent manner. Plan 9 was designed to take the Unix ideas of breaking jobs into interoperating small tasks, performed by highly specialised utilities, and "everything is a file", as well as the client/server model, to a whole new level. It has the idea of a CPU server which performs computations for less powerful networked clients. Unfortunately the idea was too ambitious and way beyond its time and Plan 9 remained largerly a research project. It is still being developed as open source software though.
MOSIX is another distributed OS project that provides a single process space over multiple machines and supports transparent process migration. It allows processes to become migratable without any changes to their source code as all context saving and restoration are done by the OS kernel. There are several implementations of the MOSIX model - MOSIX2, openMosix (discontinued since 2008) and LinuxPMI (continuation of the openMosix project).
ScaleMP is yet another commercial Single System Image (SSI) implementation, mainly targeted towards data processing and Hight Performance Computing. It not only provides transparent migration between the nodes of a cluster but also provides emulated shared memory (known as Distributed Shared Memory). Basically it transforms a bunch of computers, connected via very fast network, into a single big NUMA machine with many CPUs and huge amount of memory.
None of these would allow you to launch a game on your PC and have it transparently migrated and executed somewhere on the network. Besides most games are GPU intensive and not so much CPU intensive - most games are still not even utilising the full computing power of multicore CPUs. We have a ScaleMP cluster here and it doesn't run Quake very well...

shared memory, MPI and queuing systems

My unix/windows C++ app is already parallelized using MPI: the job is splitted in N cpus and each chunk is executed in parallel, quite efficient, very good speed scaling, the job is done right.
But some of the data is repeated in each process, and for technical reasons this data cannot be easily splitted over MPI (...).
For example:
5 Gb of static data, exact same thing loaded for each process
4 Gb of data that can be distributed in MPI, the more CPUs are used, smaller this per-CPU RAM is.
On a 4 CPU job, this would mean at least a 20Gb RAM load, most of memory 'wasted', this is awful.
I'm thinking using shared memory to reduce the overall load, the "static" chunk would be loaded only once per computer.
So, main question is:
Is there any standard MPI way to share memory on a node? Some kind of readily available + free library ?
If not, I would use boost.interprocess and use MPI calls to distribute local shared memory identifiers.
The shared-memory would be read by a "local master" on each node, and shared read-only. No need for any kind of semaphore/synchronization, because it wont change.
Any performance hit or particular issues to be wary of?
(There wont be any "strings" or overly weird data structures, everything can be brought down to arrays and structure pointers)
The job will be executed in a PBS (or SGE) queuing system, in the case of a process unclean exit, I wonder if those will cleanup the node-specific shared memory.

One increasingly common approach in High Performance Computing (HPC) is hybrid MPI/OpenMP programs. I.e. you have N MPI processes, and each MPI process has M threads. This approach maps well to clusters consisting of shared memory multiprocessor nodes.
Changing to such a hierarchical parallelization scheme obviously requires some more or less invasive changes, OTOH if done properly it can increase the performance and scalability of the code in addition to reducing memory consumption for replicated data.
Depending on the MPI implementation, you may or may not be able to make MPI calls from all threads. This is specified by the required and provided arguments to the MPI_Init_Thread() function that you must call instead of MPI_Init(). Possible values are
{ MPI_THREAD_SINGLE}
Only one thread will execute.
{ MPI_THREAD_FUNNELED}
The process may be multi-threaded, but only the main thread will make MPI calls (all MPI calls are ``funneled'' to the main thread).
{ MPI_THREAD_SERIALIZED}
The process may be multi-threaded, and multiple threads may make MPI calls, but only one at a time: MPI calls are not made concurrently from two distinct threads (all MPI calls are ``serialized'').
{ MPI_THREAD_MULTIPLE}
Multiple threads may call MPI, with no restrictions.
In my experience, modern MPI implementations like Open MPI support the most flexible MPI_THREAD_MULTIPLE. If you use older MPI libraries, or some specialized architecture, you might be worse off.
Of course, you don't need to do your threading with OpenMP, that's just the most popular option in HPC. You could use e.g. the Boost threads library, the Intel TBB library, or straight pthreads or windows threads for that matter.

I haven't worked with MPI, but if it's like other IPC libraries I've seen that hide whether other threads/processes/whatever are on the same or different machines, then it won't be able to guarantee shared memory. Yes, it could handle shared memory between two nodes on the same machine, if that machine provided shared memory itself. But trying to share memory between nodes on different machines would be very difficult at best, due to the complex coherency issues raised. I'd expect it to simply be unimplemented.
In all practicality, if you need to share memory between nodes, your best bet is to do that outside MPI. i don't think you need to use boost.interprocess-style shared memory, since you aren't describing a situation where the different nodes are making fine-grained changes to the shared memory; it's either read-only or partitioned.
John's and deus's answers cover how to map in a file, which is definitely what you want to do for the 5 Gb (gigabit?) static data. The per-CPU data sounds like the same thing, and you just need to send a message to each node telling it what part of the file it should grab. The OS should take care of mapping virtual memory to physical memory to the files.
As for cleanup... I would assume it doesn't do any cleanup of shared memory, but mmaped files should be cleaned up since files are closed (which should release their memory mappings) when a process is cleaned up. I have no idea what caveats CreateFileMapping etc. have.
Actual "shared memory" (i.e. boost.interprocess) is not cleaned up when a process dies. If possible, I'd recommend trying killing a process and seeing what is left behind.

With MPI-2 you have RMA (remote memory access) via functions such as MPI_Put and MPI_Get. Using these features, if your MPI installation supports them, would certainly help you reduce the total memory consumption of your program. The cost is added complexity in coding but that's part of the fun of parallel programming. Then again, it does keep you in the domain of MPI.

MPI-3 offers shared memory windows (see e.g. MPI_Win_allocate_shared()), which allows usage of on-node shared memory without any additional dependencies.

I don't know much about unix, and I don't know what MPI is. But in Windows, what you are describing is an exact match for a file mapping object.
If this data is imbedded in your .EXE or a .DLL that it loads, then it will automatically be shared between all processes. Teardown of your process, even as a result of a crash will not cause any leaks or unreleased locks of your data. however a 9Gb .dll sounds a bit iffy. So this probably doesn't work for you.
However, you could put your data into a file, then CreateFileMapping and MapViewOfFile on it. The mapping can be readonly, and you can map all or part of the file into memory. All processes will share pages that are mapped the same underlying CreateFileMapping object. it's good practice to close unmap views and close handles, but if you don't the OS will do it for you on teardown.
Note that unless you are running x64, you won't be able to map a 5Gb file into a single view (or even a 2Gb file, 1Gb might work). But given that you are talking about having this already working, I'm guessing that you are already x64 only.

If you store your static data in a file, you can use mmap on unix to get random access to the data. Data will be paged in as and when you need access to a particular bit of the data. All that you will need to do is overlay any binary structures over the file data. This is the unix equivalent of CreateFileMapping and MapViewOfFile mentioned above.
Incidentally glibc uses mmap when one calls malloc to request more than a page of data.

I had some projects with MPI in SHUT.
As i know , there are many ways to distribute a problem using MPI, maybe you can find another solution that does not required share memory,
my project was solving an 7,000,000 equation and 7,000,000 variable
if you can explain your problem,i would try to help you

I ran into this problem in the small when I used MPI a few years ago.
I am not certain that the SGE understands memory mapped files. If you are distributing against a beowulf cluster, I suspect you're going to have coherency issues. Could you discuss a little about your multiprocessor architecture?
My draft approach would be to set up an architecture where each part of the data is owned by a defined CPU. There would be two threads: one thread being an MPI two-way talker and one thread for computing the result. Note that MPI and threads don't always play well together.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js