How to set number of threads in C++

How to set number of threads in C++ - c++

I have written the following multi-threaded program for multi-threaded sorting using std::sort. In my program grainSize is a parameter. Since grainSize or the number of threads which can spawn is a system dependent feature. Therefore, I am not getting what should be the optimal value to which I should set the grainSize to? I work on Linux?
int compare(const char*,const char*)
{
//some complex user defined logic
}
void multThreadedSort(vector<unsigned>::iterator data, int len, int grainsize)
{
if(len < grainsize)
{
std::sort(data, data + len, compare);
}
else
{
auto future = std::async(multThreadedSort, data, len/2, grainsize);
multThreadedSort(data + len/2, len/2, grainsize); // No need to spawn another thread just to block the calling thread which would do nothing.
future.wait();
std::inplace_merge(data, data + len/2, data + len, compare);
}
}
int main(int argc, char** argv) {
vector<unsigned> items;
int grainSize=10;
multThreadedSort(items.begin(),items.size(),grainSize);
std::sort(items.begin(),items.end(),CompareSorter(compare));
return 0;
}
I need to perform multi-threaded sorting. So, that for sorting large vectors I can take advantage of multiple cores present in today's processor. If anyone is aware of an efficient algorithm then please do share.
I dont know why the value returned by multiThreadedSort() is not sorted, do you see some logical error in it, then please let me know about the same

This gives you the optimal number of threads (such as the number of cores):
unsigned int nThreads = std::thread::hardware_concurrency();
As you wrote it, your effective thread number is not equal to grainSize : it will depend on list size, and will potentially be much more than grainSize.
Just replace grainSize by :
unsigned int grainSize= std::max(items.size()/nThreads, 40);
The 40 is arbitrary but is there to avoid starting threads for sorting to few items which will be suboptimal (the time starting the thread will be larger than sorting the few items). It may be optimized by trial-and-error, and is potentially larger than 40.
You have at least a bug there:
multThreadedSort(data + len/2, len/2, grainsize);
If len is odd (for instance 9), you do not include the last item in the sort. Replace by:
multThreadedSort(data + len/2, len-(len/2), grainsize);

Unless you use a compiler with a totally broken implementation (broken is the wrong word, a better match would be... shitty), several invocations of std::futureshould already do the job for you, without having to worry.
Note that std::future is something that conceptually runs asynchronously, i.e. it may spawn another thread to execute concurrently. May, not must, mind you.
This means that it is perfectly "legitimate" for an implementation to simply spawn one thread per future, and it is also legitimate to never spawn any threads at all and simply execute the task inside wait().
In practice, sane implementations avoid spawning threads on demand and instead use a threadpool where the number of workers is set to something reasonable according to the system the code runs on.
Note that trying to optimize threading with std::thread::hardware_concurrency() does not really help you because the wording of that function is too loose to be useful. It is perfectly allowable for an implementation to return zero, or a more or less arbitrary "best guess", and there is no mechanism for you to detect whether the returned value is a genuine one or a bullshit value.
There also is no way of discriminating hyperthreaded cores, or any such thing as NUMA awareness, or anything the like. Thus, even if you assume that the number is correct, it is still not very meaningful at all.
On a more general note
The problem "What is the correct number of threads" is hard to solve, if there is a good universal answer at all (I believe there is not). A couple of things to consider:
Work groups of 10 are certainly way, way too small. Spawning a thread is an immensely expensive thing (yes, contrary to popular belief that's true for Linux, too) and switching or synchronizing threads is expensive as well. Try "ten thousands", not "tens".
Hyperthreaded cores only execute while the other core in the same group is stalled, most commonly on memory I/O (or, when spinning, by the explicit execution of an instruction such as e.g. REP-NOP on Intel). If you do not have a significant number of memory stalls, extra threads running on hyperthreaded cores will only add context switches, but will not run any faster. For something like sorting (which is all about accessing memory!), you're probably good to go as far as that one goes.
Memory bandwidth is usually saturated by one, sometimes 2 cores, rarely more (depends on the actual hardware). Throwing 8 or 12 threads at the problem will usually not increase memory bandwidth but will heighten pressure on shared cache levels (such as L3 if present, and often L2 as well) and the system page manager. For the particular case of sorting (very incoherent access, lots of stalls), the opposite may be the case. May, but needs not be.
Due to the above, for the general case "number of real cores" or "number of real cores + 1" is often a much better recommendation.
Accessing huge amounts of data with poor locality like with your approach will (single-threaded or multi-threaded) result in a lot of cache/TLB misses and possibly even page faults. That may not only undo any gains from thread parallelism, but it may indeed execute 4-5 orders of magnitude slower. Just think about what a page fault costs you. During a single page fault, you could have sorted a million elements.
Contrary to the above "real cores plus 1" general rule, for tasks that involve network or disk I/O which may block for a long time, even "twice the number of cores" may as well be the best match. So... there is really no single "correct" rule.
What's the conclusion of the somewhat self-contradicting points above? After you've implemented it, be sure to benchmark whether it really runs faster, because this is by no means guaranteed to be the case. And unluckily, there's no way of knowing with certitude what's best without having measured.
As another thing, consider that sorting is by no means trivial to parallelize. You are already using std::inplace_merge so you seem to be aware that it's not just "split subranges and sort them".
But think about it, what exactly does your approach really do? You are subdividing (recursively descending) up to some depth, then sorting the subranges concurrently, and merging -- which means overwriting. Then you are sorting (recursively ascending) larger ranges and merging them, until the whole range is sorted. Classic fork-join.
That means you touch some part of memory to sort it (in a pattern which is not cache-friendly), then touch it again to merge it. Then you touch it yet again to sort the larger range, and you touch it yet another time to merge that larger range. With any "luck", different threads will be accessing the memory locations at different times, so you'll have false sharing.
Also, if your understanding of "large data" is the same as mine, this means you are overwriting every memory location beween 20 and 30 times, possibly more often. That's a lot of traffic.
So much memory being read and written to repeatedly, over and over again, and the main bottleneck is memory bandwidth. See where I'm going? Fork-join looks like an ingenious thing, and in academics it probably is... but it isn't certain at all that this runs any faster on a real machine (it might quite possibly be many times slower).

Ideally, you cannot assume more than n*2 thread running in your system. n is number of CPU cores.
Modern OS uses concept of Hyperthreading. So, now on one CPU at a time can run 2 threads.
As mentioned in another answer, in C++11 you can get optimal number of threads using std::thread::hardware_concurrency();

Related

Missing small primes in C++ atomic prime sieve

I try to develop a concurrent prime sieve implementation using C++ atomics. However, when core_count is increased, more and more small primes are missing from the output.
My guess is that the producer threads overwrite each others' results, before being read by the consumer. Even though the construction should protect against it by using the magic number 0 to indicate it's ready to accept the next prime. It seems the compare_exchange_weak is not really atomic in this case.
Things I've tried:
Replacing compare_exchange_weak with compare_exchange_strong
Changing the memory_order to anything else.
Swapping around the 'crossing-out' and the write.
I have tested it with Microsoft Visual Studio 2019, Clang 12.0.1 and GCC 11.1.0, but to no avail.
Any ideas on this are welcome, including some best practices I might have missed.
#include <algorithm>
#include <atomic>
#include <future>
#include <iostream>
#include <iterator>
#include <thread>
#include <vector>
int main() {
using namespace std;
constexpr memory_order order = memory_order_relaxed;
atomic<int> output{0};
vector<atomic_bool> sieve(10000);
for (auto& each : sieve) atomic_init(&each, false);
atomic<unsigned> finished_worker_count{0};
auto const worker = [&output, &sieve, &finished_worker_count]() {
for (auto current = next(sieve.begin(), 2); current != sieve.end();) {
current = find_if(current, sieve.end(), [](atomic_bool& value) {
bool untrue = false;
return value.compare_exchange_strong(untrue, true, order);
});
if (current == sieve.end()) break;
int const claimed = static_cast<int>(distance(sieve.begin(), current));
int zero = 0;
while (!output.compare_exchange_weak(zero, claimed, order))
;
for (auto product = 2 * claimed; product < static_cast<int>(sieve.size());
product += claimed)
sieve[product].store(true, order);
}
finished_worker_count.fetch_add(1, order);
};
const auto core_count = thread::hardware_concurrency();
vector<future<void>> futures;
futures.reserve(core_count);
generate_n(back_inserter(futures), core_count,
[&worker]() { return async(worker); });
vector<int> result;
while (finished_worker_count < core_count) {
auto current = output.exchange(0, order);
if (current > 0) result.push_back(current);
}
sort(result.begin(), result.end());
for (auto each : result) cout << each << " ";
cout << '\n';
return 0;
}

compare_exchange_weak will update (change) the "expected" value (the local variable zero) if the update cannot be made. This will allow overwriting one prime number with another if the main thread doesn't quickly handle the first prime.
You'll want to reset zero back to zero before rechecking:
while (!output.compare_exchange_weak(zero, claimed, order))
zero = 0;

Even with correctness bugs fixed, I think this approach is going to be low performance with multiple threads writing to the same cache lines.
As 1201ProgramAlarm's points out in their answer, CAS but I wouldn't expect good performance! Having multiple threads storing to the same cache lines will create big stalls. I'd normally write that as follows so you only need to write the zero = 0 once, but it still happens before every CAS.
do{
zero = 0;
}while(!output.compare_exchange_weak(zero, claimed, order));
Caleth pointed out in comments that it's also Undefined Behaviour for a predictate to modify the element (like in your find_if). That's almost certainly not a problem in practice in this case; find_if is just written in C++ in a header (in mainstream implementations) and likely in a way that there isn't actually any UB in the resulting program.
And it would be straightforward to replace the find_if with a loop. In fact probably making the code more readable, since you can just use array indexing the whole time instead of iterators; let the compiler optimize that to a pointer and then pointer-subtraction if it wants.
Scan read-only until you find a candidate to try to claim, don't try to atomic-RMW every true element until you get to a false one. Especially on x86-64, lock cmpxchg is way slower than read-only access to a few contiguous bytes. It's a full memory barrier; there's no way to do an atomic RMW on x86 that isn't seq_cst.
You might still lose the race, so you do still need to try to claim it with an RMW and keep looping on failure. And CAS is a good choice for that.
Correctness seems plausible with this strategy, but I'd avoid it for performance reasons.
Multiple threads storing to the array will cause contention
Expect cache lines to be bouncing around between cores, with most RFOs (MESI Read For Ownership) having to wait to get the data for a cache line from another core that had it in Modified state. A core can't modify a cache line until after it gets exclusive ownership of that cache line. (Usually 64 bytes on modern systems.)
Your sieve size is only 10k bools, so 10 kB, comfortably fitting into L1d cache on modern CPUs. So a single-threaded implementation would get all L1d cache hits when looping over it (in the same thread that just initialized it all to zero).
But with other threads writing the array, at best you'll get hits in L3 cache. But since the sieve size is small, other threads won't be evicting their copies from their own L1d caches, so the RFO (read for ownership) from a core that wants to write will typically find that some other core has it Modified, so the L3 cache (or other tag directory) will have to send on a request to that core to write back or transfer directly. (Intel CPUs from Nehalem onwards use Inclusive L3 cache where the tags also keep track of which cores have the data. They changed that for server chips with Skylake and later, but client CPUs still I think have inclusive L3 cache where the tags also work as a snoop filter / coherence directory.)
With 1 whole byte per bool, and not even factoring out multiples of 2 from your sieve, crossing off multiples of a prime is very high bandwidth. For primes between 32 and 64, you touch every cache line 1 to 2 times. (And you only start at prime*2, not prime*prime, so even for large strides, you still start very near the bottom of the array and touch most of it.)
A single-threaded sieve can use most of L3 cache bandwidth, or saturate DRAM, on a large sieve, even using a bitmap instead of 1 bool per byte. (I made some benchmarks of a hand-written x86 asm version that used a bitmap version in comments on a Codereview Q&A; https://godbolt.org/z/nh39TWxxb has perf stat results in comments on a Skylake i7-6700k with DDR4-2666. My implementation also has some algorithmic improvements, like not storing the even numbers, and starting the crossing off at i*i).
Although to be fair, L3 bandwidth scales with number of cores, especially if different pairs are bouncing data between each other, like A reading lines recently written by B, and B reading lines recently written by C. Unlike with DRAM where the shared bus is the bottleneck, unless per-core bandwidth limits are lower. (Modern server chips need multiple cores to saturate their DRAM controllers, but modern client chips can nearly max out DRAM with one thread active).
You'd have to benchmark to see whether all thread piled up in a bad way or not, like if they tend to end up close together, or if one with a larger stride can pull ahead and get some distance for write-prefetches not to create extra contention.
The cache-miss delays in committing the store to cache can be hidden some by the store buffer and out-of-order exec (especially since it's relaxed, not seq_cst), but it's still not good.
(Using a bitmap with 8 bools per byte would require atomic RMWs for this threading strategy, which would be a performance disaster. If you're going to thread this way, 1 bool per byte is by far the least bad.)
At least if you aren't reading part of the array that's still being written, you might not be getting memory-order mis-speculation on x86. (x86's memory model disallows LoadStore and LoadLoad reordering, but real implementations speculatively load early, and have to roll back if the value they loaded has been invalidated by the time the load is architecturally allowed to happen.)
Better strategy: each thread owns a chunk of the sieve
Probably much better would be segmenting regions and handing out lists of primes to cross off, with each thread marking off multiples of primes in its own region of the output array. So each cache line of the sieve is only touched by one thread, and each thread only touches a subset of the sieve array. (A good chunk size would be half to 3/4 of the L1d or L2 cache size of a core.)
You might start with a small single-threaded sieve, or a fixed list of the first 10 or 20 primes to get the threads started, and have the thread that owns the starting chunk generate more primes. Perhaps appending them to an array of primes, and updating a valid-index (with a release store so readers can keep reading in that array up to that point, then spin-wait or use C++20 .wait() for a value other than what they last saw. But .wait would need a .notify in the writer, like maybe every 10 primes?)
If you want to move on in a larger sieve, divide up the next chunk of the full array between threads and have them each cross off the known primes from the first part of the array. No thread has to wait for any other, the first set of work already contains all the primes that need to be crossed off from an equal-sized chunk of the sieve.
Probably you don't want an actually array of atomic_int; probably all threads should be scanning the sieve to find not-crossed-off positions. Especially if you can do that efficiently with SIMD, or with bit-scan like tzcnt if you use packed bitmaps for this.
(I assume there are some clever algorithmic ideas for segmented sieves; this is just what I came up with off the top of my head.)

OpenMP first kernel much slower than the second kernel

I have a huge 98306 by 98306 2D array initialized. I created a kernel function that counts the total number of elements below a certain threshold.
#pragma omp parallel for reduction(+:num_below_threshold)
for(row)
for(col)
index = get_corresponding_index(row, col);
if (array[index] < threshold)
num_below_threshold++;
For benchmark purpose I measured the execution time of the kernel executing when the number of thread is set to 1. I noticed that the first time the kernel executes it took around 11 seconds. The next call to the kernel executing on the same array with one thread only took around 3 seconds. I thought it might be a problem related to cache but it doesn't seem to be related. What is the possible reasons that caused this?
This array is initialized as:
float *array = malloc(sizeof(float) * 98306 * 98306);
for (int i = 0; i < 98306 * 98306; i++) {
array[i] = rand() % 10;
}
This same kernel is applied to this array twice and the second execution time is much faster than the first kernel. I though of lazy allocation on Linux but that shouldn't be a problem because of the initialization function. Any explanations will be helpful. Thanks!

Since you don't provide any Minimal, Complete and Verifiable Example, I'll have to make some wild guesses here, but I'm pretty confident I have the gist of the issue.
First, you have to notice that 98,306 x 98,306 is 9,664,069,636 which is way larger than the maximum value a signed 32 bit integer can store (which is 2,147,483,647). Therefore, the upper limit of your for initialization loop, after overflowing, could become 1,074,135,044 (as on my machines, although it is undefined behavior so strictly speaking, anything could happen), which is roughly 9 times smaller than what you expected.
So now, after the initialization loop, only 11% of the memory you thought you allocated has actually been allocated and touched by the operating system. However, your first reduction loop does a good job in going over the various elements of the array, and since for about 89% of it, it's for the fist time, the OS does the actual memory allocation there and then, which takes some significant amount of time.
And now, for your second reduction loop, all memory has been properly allocated and touched, which makes it much faster.
So that's what I believe happened. That said, many other parameters can enter into play here, such as:
Swapping: the array you try to allocate represents about 36GB of memory. If your machine doesn't have that much memory available, then your code might swap, which will potentially make a big mess of whatever performance measurement you can come up with
NUMA effect: if your machine has multiple NUMA nodes, then thread pinning and memory affinity, when not managed properly, can have a large impact on performance between loop occurrences
Compiler optimization: you didn't mention which compiler you used and which level of optimization you requested. Depending on that, you'd be amazed on how shortened your code could become. For example, the compiler could totally remove the second loop as it does the same thing as the first and becomes useless as the result will be the same... And many other interesting and unexpected things which render your benchmark meaningless

Will a modern processor (like the i7) follow pointers and prefetch their data while iterating over a list of them?

I want to learn how to write better code that takes advantage of the CPU's cache. Working with contiguous memory seems to be the ideal situation. That being said, I'm curious if there are similar improvements that can be made with non-contiguous memory, but with an array of pointers to follow, like:
struct Position {
int32_t x,y,z;
}
...
std::vector<Position*> posPointers;
...
updatePosition () {
for (uint32_t i = 0; i < posPointers.size(); i++) {
Position& nextPos = *posPointers[i];
nextPos.x++;
nextPos.y++;
nextPos.z++;
}
}
This is just some rough mock-up code, and for the sake of learning this properly let's just say that all Position structs were created randomly all over the heap.
Can modern, smart, processors such as Intel's i7 look ahead and see that it's going to need X_ptr's data very shortly? Would the following line of code help?
... // for loop
Position& nextPos1 = *posPointers[i];
Position& nextPos2 = *posPointers[i+1];
Position& nextPos3 = *posPointers[i+2];
Position& nextPos4 = *posPointers[i+3];
... // Work on data here
I had read some presentation slides that seemed to indicate code like this would cause the processor to prefetch some data. Is that true? I am aware there are non-standard, platform specific, ways to call prefetching like __builtin_prefetch, but throwing that all over the place just seems like an ugly premature optimization. I am looking for a way I can subconsciously write cache-efficient code.

I know you didn't ask (and probably don't need a sermon on proper treatment of caches, but I thought I'd contribute my two cents anyways. Note that all this only applies in hot code. Remember that premature optimization is the root of all evil.
As has been pointed out in the comments, the best way is to have containers of actual data. Generally speaking, flat data structures are much preferable to "pointer spaghetti", even if you have to duplicate some data and/or pay a price for resizing/moving/defragmenting your data structures.
And as you know, flat data structures (e.g. an array of data) only pay off if you access them linearly and sequentially most of the time.
But this strategy may not always be usable. In lieu of actual linear data, you can use other strategies like employing pool allocators, and iterating over the pools themselves, instead of over the vector holding the pointers. This of course has its own disadvantages and can be a bit more complicated.
I'm sure you know this already, but it bears mentioning again that one of the most effective techniques for getting most out of your cache is having smaller data! In the above code, if you can get away with int16_t instead of int32_t, you should definitely do so. You should pack your many bools and flags and enums into bit-fields, use indexes instead of pointers (specially on 64-bit systems,) use fixed-size hash values in your data structures instead of strings, etc.
Now, about your main question that whether the processor can follow random pointers around and bring the data into cache before they are needed. To a very limited extent, this does happen. As probably you know, modern CPUs employ a lot of tricks to increase their speed (i.e. increase their instruction retire rate.) Tricks like having a store buffer, out-of-order execution, superscalar pipelines, multiple functional units of every kind, branch prediction, etc. Most of the time, these tricks all just help the CPU to keep executing instructions, even if the current instructions have stalled or take too long to finish. For memory loads (which is the slowest thing to do, iff the data is not in the cache,) this means that the CPU should get to the instruction as soon as possible, calculate the address, and request the data from the memory controller. However, the memory controller can have only a very limited number of outstanding requests (usually two these days, but I'm not sure.) This means that even if the CPU did very sophisticated stuff to look ahead into other memory locations (e.g. the elements of your posPointers vector) and deduce that these are the addresses of new data that your code is going to need, it couldn't get very far ahead because the memory controller can have only so many requests pending.
In any case, AFAIK, I don't think that CPUs actually do that yet. Note that this is a hard case, because the addresses of your randomly distributed memory locations are themselves in memory (as opposed to being in a register or calculable from the contents of a register.) And if the CPUs did it, it wouldn't have that much of an effect anyways because of memory interface limitations.
The prefetching technique you mentioned seems valid to me and I've seen it used, but it only yields noticeable effect if your CPU has something to do while waiting for the future data to arrive. Incrementing three integers takes a lot less time than loading 12 bytes from memory (loading one cache line, actually) and therefor it won't mean much for the execution time. But if you had something worthwhile and more heavyweight to overlay on top of the memory prefetches (e.g. calculating a complex function that doesn't require data from memory!) then you could get very nice speedups. You see, the time to go through the above loop is essentially the sum of the time of all the cache misses; and you are getting the coordinate increments and the loop bookkeeping for free. So, you'd have won more if the free stuff were more valuable!

Modern processors have hardware prefetching mechanisms: Intel Hardware prefetcher. They infer stride access patterns to memory and prefetch memory locations that are likely to be accessed in the near future.
However in the case of totally random pointer chasing such techniques can not help. The processor does not know that the program in execution is performing pointer chasing, therefore it can not prefetch accordingly. In such cases hardware mechanisms are detrimental for performance as they would prefetch values that are not likely to be used.
The best that you can do is try to organize you data structures in memory in such a way that accesses to contiguous portions of memory are more likely.

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?

Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.

The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?

One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.

I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.

Simple operation to waste time?

I'm looking for a simple operation / routine which can "waste" time if repeated continuously.
I'm researching how gprof profiles applications, so this "time waster" needs to waste time in the user space and should not require external libraries. IE, calling sleep(20) will "waste" 20 seconds of time, but gprof will not record this time because it occurred within another library.
Any recommendations for simple tasks which can be repeated to waste time?

Another variant on Tomalak's solution is to set up an alarm, and so in your busy-wait loop, you don't need to keep issuing a system call, but instead just check if the signal has been sent.

The simplest way to "waste" time without yielding CPU is a tight loop.
If you don't need to restrict the duration of your waste (say, you control it by simply terminating the process when done), then go C style*:
for (;;) {}
(Be aware, though, that the standard allows the implementation to assume that programs will eventually terminate, so technically speaking this loop — at least in C++0x — has Undefined Behaviour and could be optimised out!**
Otherwise, you could time it manually:
time_t s = time(0);
while (time(0) - s < 20) {}
Or, instead of repeatedly issuing the time syscall (which will lead to some time spent in the kernel), if on a GNU-compatible system you could make use of signal.h "alarms" to end the loop:
alarm(20);
while (true) {}
There's even a very similar example on the documentation page for "Handler Returns".
(Of course, these approaches will all send you to 100% CPU for the intervening time and make fluffy unicorns fall out of your ears.)
* {} rather than trailing ; used deliberately, for clarity. Ultimately, there's no excuse for writing a semicolon in a context like this; it's a terrible habit to get into, and becomes a maintenance pitfall when you use it in "real" code.
** See [n3290: 1.10/2] and [n3290: 1.10/24].

A simple loop would do.
If you're researching how gprof works, I assume you've read the paper, slowly and carefully.
I also assume you're familiar with these issues.

Here's a busy loop which runs at one cycle per iteration on modern hardware, at least as compiled by clang or gcc or probably any reasonable compiler with at least some optimization flag:
void busy_loop(uint64_t iters) {
volatile int sink;
do {
sink = 0;
} while (--iters > 0);
(void)sink;
}
The idea is just to store to the volatile sink every iteration. This prevents the loop from being optimized away, and makes each iteration have a predictable amount of work (at least one store). Modern hardware can do one store per cycle, and the loop overhead generally can complete in parallel in that same cycle, so usually achieves one cycle per iteration. So you can ballpark the wall-clock time in nanoseconds a given number of iters will take by dividing by your CPU speed in GHz. For example, a 3 GHz CPU will take about 2 seconds (2 billion nanos) to busy_loop when iters == 6,000,000,000.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js