Here is what I'm doing:
I'm taking in bezier points and running bezier interpolation then storing the result in a std::vector<std::vector<POINT>.
The bezier calculation was slowing me down so this is what I did.
I start with a std::vector<USERPOINT> which is a struct with a point and 2 other points for bezier handles.
I divide these up into ~4 groups and assign each thread to do 1/4 of the work. To do this I created 4 std::vector<std::vector<POINT> > to store the results from each thread.In the end all the points have to be in 1 continuous vector, before I used multithreading I accessed this directly but now I reserve the size of the 4 vectors produced by the threads and insert them into the original vector, in the correct order. This works, but unfortunatly the copy part is very slow and makes it slower than without multithreading. So now my new bottleneck is copying the results to the vector. How could I do this way more efficiently?
Thanks
Have all the threads put their results into a single contiguous vector just like before. You have to ensure each thread only accesses parts of the vector that are separate from the others. As long as that's the case (which it should be regardless -- you don't want to generate the same output twice) each is still working with memory that's separate from the others, and you don't need any locking (etc.) for things to work. You do, however, need/want to ensure that the vector for the result has the correct size for all the results first -- multiple threads trying (for example) to call resize() or push_back() on the vector will wreak havoc in a hurry (not to mention causing copying, which you clearly want to avoid here).
Edit: As Billy O'Neal pointed out, the usual way to do this would be to pass a pointer to each part of the vector where each thread will deposit its output. For the sake of argument, let's assume we're using the std::vector<std::vector<POINT> > mentioned as the original version of things. For the moment, I'm going to skip over the details of creating the threads (especially since it varies across systems). For simplicity, I'm also assuming that the number of curves to be generated is an exact multiple of the number of threads -- in reality, the curves won't divide up exactly evenly, so you'll have to "fudge" the count for one thread, but that's really unrelated to the question at hand.
std::vector<USERPOINT> inputs; // input data
std::vector<std::vector<POINT> > outputs; // space for output data
const int thread_count = 4;
struct work_packet { // describe the work for one thread
USERPOINT *inputs; // where to get its input
std::vector<POINT> *outputs; // where to put its output
int num_points; // how many points to process
HANDLE finished; // signal when it's done.
};
std::vector<work_packet> packets(thread_count); // storage for the packets.
std::vector<HANDLE> events(thread_count); // storage for parent's handle to events
outputs.resize(inputs.size); // can't resize output after processing starts.
for (int i=0; i<thread_count; i++) {
int offset = i * inputs.size() / thread_count;
packets[i].inputs = &inputs[0]+offset;
packets[i].outputs = &outputs[0]+offset;
packets[i].count = inputs.size()/thread_count;
events[i] = packets[i].done = CreateEvent();
threads[i].process(&packets[i]);
}
// wait for curves to be generated (Win32 style, for the moment).
WaitForMultipleObjects(&events[0], thread_count, WAIT_ALL, INFINITE);
Note that although we have to be sure that the outputs vector doesn't get resized while be operated on by multiple threads, the individual vectors of points in outputs can be, because each will only ever be touched by one thread at a time.
If the simple copy in between things is slower than before you started using Mulitthreading, it's entirely likely that what you're doing simple isn't going to scale to multiple cores. If it's something simple like bezier stuff I suspect that's going to be the case.
Remember that the overhead of making the threads and such has an impact on total run time.
Finally.. for the copy, what are you using? Is it std::copy?
Multithreading is not going to speed up your process. Processing the data in different cores, could.
Related
I have a piece of code in my full code:
const unsigned int GL=8000000;
const int cuba=8;
const int cubn=cuba+cuba;
const int cub3=cubn*cubn*cubn;
int Length[cub3];
int Begin[cub3];
int Counter[cub3];
int MIndex[GL];
struct Particle{
int ix,jy,kz;
int ip;
};
Particle particles[GL];
int GetIndex(const Particle & p){return (p.ix+cuba+cubn*(p.jy+cuba+cubn*(p.kz+cuba)));}
...
#pragma omp parallel for
for(int i=0; i<cub3; ++i) Length[i]=Counter[i]=0;
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
int ic=GetIndex(particles[i]);
#pragma omp atomic update
Length[ic]++;
}
Begin[0]=0;
#pragma omp single
for(int i=1; i<cub3; ++i) Begin[i]=Begin[i-1]+Length[i-1];
#pragma omp parallel for
for(int i=0; i<N; ++i)
{
if(particles[i].ip==3)
{
int ic=GetIndex(particles[i]);
if(ic>cub3 || ic<0) printf("ic=%d out of range!\n",ic);
int cnt=0;
#pragma omp atomic capture
cnt=Counter[ic]++;
MIndex[Begin[ic]+cnt]=i;
}
}
If to remove
#pragma omp parallel for
the code works properly and the output results are always the same.
But with this pragma there is some undefined behaviour/race condition in the code, because each time it gives different output results.
How to fix this issue?
Update: The task is the following. Have lots of particles with some random coordinates. Need to output to the array MIndex the indices in the array particles of the particles, which are in each cell (cartesian cube, for example, 1×1×1 cm) of the coordinate system. So, in the beginning of MIndex there should be the indices in the array particles of the particles in the 1st cell of the coordinate system, then - in the 2nd, then - in the 3rd and so on. The order of indices within given cell in the area MIndex is not important, may be arbitrary. If it is possible, need to make this in parallel, may be using atomic operations.
There is a straight way: to traverse across all the coordinate cells in parallel and in each cell check the coordinates of all the particles. But for large number of cells and particles this seems to be slow. Is there a faster approach? Is it possible to travel across the particles array only once in parallel and fill MIndex array using atomic operations, something like written in the code piece above?
You probably can't get a compiler to auto-parallelize scalar code for you if you want an algorithm that can work efficiently (without needing atomic RMWs on shared counters which would be a disaster, see below). But you might be able to use OpenMP as a way to start threads and get thread IDs.
Keep per-thread count arrays from the initial histogram, use in 2nd pass
(Update: this might not work: I didn't notice the if(particles[i].ip==3) in the source before. I was assuming that Count[ic] will go as high as Length[ic] in the serial version. If that's not the case, this strategy might leave gaps or something.
But as Laci points out, perhaps you want that check when calculating Length in the first place, then it would be fine.)
Manually multi-thread the first histogram (into Length[]), with each thread working on a known range of i values. Keep those per-thread lengths around, even as you sum across them and prefix-sum to build Begin[].
So Length[thread][ic] is the number of particles in that cube, out of the range of i values that this thread worked on. (And will loop over again in the 2nd loop: the key is that we divide the particles between threads the same way twice. Ideally with the same thread working on the same range, so things may still be hot in L1d cache.)
Pre-process that into a per-thread Begin[][] array, so each thread knows where in MIndex to put data from each bucket.
// pseudo-code, fairly close to actual C
for(ic < cub3) {
// perhaps do this "vertical" sum into a temporary array
// or prefix-sum within Length before combining across threads?
int pos = sum(Length[0..nthreads-1][ic-1]) + Begin[0][ic-1];
Begin[0][ic] = pos;
for (int t = 1 ; t<nthreads ; t++) {
pos += Length[t][ic]; // prefix-sum across threads for this cube bucket
Begin[t][ic] = pos;
}
}
This has a pretty terrible cache access pattern, especially with cuba=8 making Length[t][0] and Length[t+1][0] 4096 bytes apart from each other. (So 4k aliasing is a possible problem, as are cache conflict misses).
Perhaps each thread can prefix-sum its own slice of Length into that slice of Begin, 1. for cache access pattern (and locality since it just wrote those Lengths), and 2. to get some parallelism for that work.
Then in the final loop with MIndex, each thread can do int pos = --Length[t][ic] to derive a unique ID from the Length. (Like you were doing with Count[], but without introducing another per-thread array to zero.)
Each element of Length will return to zero, because the same thread is looking at the same points it just counted. With correctly-calculated Begin[t][ic] positions, MIndex[...] = i stores won't conflict. False sharing is still possible, but it's a large enough array that points will tend to be scattered around.
Don't overdo it with number of threads, especially if cuba is greater than 8. The amount of Length / Begin pre-processing work scales with number of threads, so it may be better to just leave some CPUs free for unrelated threads or tasks to get some throughput done. OTOH, with cuba=8 meaning each per-thread array is only 4096 bytes (too small to parallelize the zeroing of, BTW), it's really not that much.
(Previous answer before your edit made it clearer what was going on.)
Is this basically a histogram? If each thread has its own array of counts, you can sum them together at the end (you might need to do that manually, not have OpenMP do it for you). But it seems you also need this count to be unique within each voxel, to have MIndex updated properly? That might be a showstopper, like requiring adjusting every MIndex entry, if it's even possible.
After your update, you are doing a histogram into Length[], so that part can be sped up.
Atomic RMWs would be necessary for your code as-is, performance disaster
Atomic increments of shared counters would be slower, and on x86 might destroy the memory-level parallelism too badly. On x86, every atomic RMW is also a full memory barrier, draining the store buffer before it happens, and blocking later loads from starting until after it happens.
As opposed to a single thread which can have cache misses to multiple Counter, Begin and MIndex elements outstanding, using non-atomic accesses. (Thanks to out-of-order exec, the next iteration's load / inc / store for Counter[ic]++ can be doing the load while there are cache misses outstanding for Begin[ic] and/or for Mindex[] stores.)
ISAs that allow relaxed-atomic increment might be able to do this efficiently, like AArch64. (Again, OpenMP might not be able to do that for you.)
Even on x86, with enough (logical) cores, you might still get some speedup, especially if the Counter accesses are scattered enough they cores aren't constantly fighting over the same cache lines. You'd still get a lot of cache lines bouncing between cores, though, instead of staying hot in L1d or L2. (False sharing is a problem,
Perhaps software prefetch can help, like prefetchw (write-prefetching) the counter for 5 or 10 i iterations later.
It wouldn't be deterministic which point went in which order, even with memory_order_seq_cst increments, though. Whichever thread increments Counter[ic] first is the one that associates that cnt with that i.
Alternative access patterns
Perhaps have each thread scan all points, but only process a subset of them, with disjoint subsets. So the set of Counter[] elements that any given thread touches is only touched by that thread, so the increments can be non-atomic.
Filtering by p.kz ranges maybe makes the most sense since that's the largest multiplier in the indexing, so each thread "owns" a contiguous range of Counter[].
But if your points aren't uniformly distributed, you'd need to know how to break things up to approximately equally divide the work. And you can't just divide it more finely (like OMP schedule dynamic), since each thread is going to scan through all the points: that would multiply the amount of filtering work.
Maybe a couple fixed partitions would be a good tradeoff to gain some parallelism without introducing a lot of extra work.
Re: your edit
You already loop over the whole array of points doing Length[ic]++;? Seems redundant to do the same histogramming work again with Counter[ic]++;, but not obvious how to avoid it.
The count arrays are small, but if you don't need both when you're done, you could maybe just decrement Length to assign unique indices to each point in a voxel. At least the first histogram could benefit from parallelizing with different count arrays for each thread, and just vertically adding at the end. Should scale perfectly with threads since the count array is small enough for L1d cache.
BTW, for() Length[i]=Counter[i]=0; is too small to be worth parallelizing. For cuba=8, it's 8*8*16 * sizeof(int) = 4096 bytes, just one page, so it's just two small memsets.
(Of course if each thread has their own separate Length array, they each need to zero it). That's small enough to even consider unrolling with maybe 2 count arrays per thread to hide store/reload serial dependencies if a long sequence of points are all in the same bucket. Combining count arrays at the end is a job for #pragma omp simd or just normal auto-vectorization with gcc -O3 -march=native since it's integer work.
For the final loop, you could split your points array in half (assign half to each thread), and have one thread get unique IDs by counting down from --Length[i], and another counting up from 0 in Counter[i]++. With different threads looking at different points, this could give you a factor of 2 speedup. (Modulo contention for MIndex stores.)
To do more than just count up and down, you'd need info you don't have from just the overall Length array... but which you did have temporarily. See the section at the top
You are right to make the update Counter[ic]++ atomic, but there is an additional problem on the next line: MIndex[Begin[ic]+cnt]=i; Different iterations can write into the same location here, unless you have mathematical proof that this is never the case from the structure of MIndex. So you have to make that line atomic too. And then there is almost no parallel work left in your loop, so your speed up if probably going to be abysmal.
EDIT the second line however is not of the right form for an atomic operation, so you have to make it critical. Which is going to make performance even worse.
Also, #Laci is correct that since this is an overwrite statement, the order of parallel scheduling is going to influence the outcome. So either live with that fact, or accept that this can not be parallelized.
general question: the number of threads must be equal to the size of the elements i want to deal with? exmaple: if i have matrix M[a][b]. i must allocate (aXb) threads or i can allocate more threads than i need(more than ab)? because the thread that will focus on element aXb+1 will throw us out, doesnt he? or the solution is to put a condition(only if in range(ab))?
specific question: let be M[x][y] matrix with x rows and y columns. consider that 1000 <= x <= 300000 and y <= 100. how can i organize the threads in that way that it will be general for each input for x and y. i want that each thread will focus on one element in the matrix. CC = 2.1 thanks!
General answer: It depends on a problem.
In most cases natural one-to-one mapping of the problem to the grid of threads is fine to start with, but what you want to keep in mind is:
Achieving high occupancy.
Maximizing GPU resources usage and memory throughput.
Working with valid data.
Sometimes it may require using single thread to process many elements or many threads to process single element.
For instance, you can imagine an series of independent operations A,B and C that need to be applied on array of elements. You could run three different kernels, but it might be better choice to allocate the grid to contain three times more threads than there is elements and distinguish operations by one of the dimensions of the grid (or anything else). On the other side - you might have a problem that could use maximizing the usage of shared memory (e.g transforming the image) - you could use block of 16 threads to process 5x5 image window where each thread would calculate some statistics of each 2x2 slice.
The choice is yours - the best advice is not always go with the obvious. Try different approaches and choose what works best.
I have a data structure which consists of 1,000 array elements, each array element is a smaller array of 8 ints:
std::array<std::array<int, 8>, 1000>
The data structure contains two "pointers", which track the largest and smallest populated array elements (within the "outer", 1000-element array). So for example they might be:
min = 247
max = 842
How can I read and write to this data structure from multiple threads? I am worried about race conditions between pushing/popping elements and maintaining the two "pointers". My basic mode of operation is:
// Pop element from current index
// Calculate new index
// Write element to new index
// Update min and max "pointers"
You are correct that your current algorithm is not thread safe, there are a number of places where contention could occur.
This is impossible to optimize without more information though. You need to know where the slow-down is happening before you can improve it - and for that you need metrics. Profile your code and find out what bits are actually taking the time, because you can only gain by parallelizing those bits and even then you may find that it's actually memory or something else that is the limiting factor, not CPU.
The simplest approach will then be to just lock the entire structure for the full process. This will only work if the threads are doing a lot of other processing as well, if not you will actually lose performance compared to single threading.
After that you can consider having a separate lock for different sections of the data structure. You will need to properly analyse what you are using when and where and work out what would be useful to split. For example you might have chunks of the sub arrays with each chunk having its own lock.
Be careful of deadlocks in this situation though, you might have a thread claim 32 then want 79 while another thread already has 79 and then wants 32. Make sure you always claim locks in the same order.
The fastest option (if possible) may even be to give each thread it's own copy of the data structure, each processes 1/N of the work and then merge the results at the end. This way no synchronization is needed at all during processing.
But again it all comes back to the metrics and profiling. This is not a simple problem.
I'm trying to figure out the best way to do this, but I'm getting a bit stuck in figuring out exactly what it is that I'm trying to do, so I'm going to explain what it is, what I'm thinking I want to do, and where I'm getting stuck.
I am working on a program that has a single array (Image really), which per frame can have a large number of objects placed on an image array. Each object is completely independent of all other objects. The only dependency is the output, in theory possible to have 2 of these objects placed on the same location on the array. I'm trying to increase the efficiency of placing the objects on the image, so that I can place more objects. In order to do that, I'm wanting to thread the problem.
The first step that I have taken towards threading it involves simply mutex protecting the array. All operations which place an object on the array will call the same function, so I only have to put the mutex lock in one place. So far, it is working, but it is not seeing the improvements that I would hope to have. I am hypothesizing that this is because most of the time, the limiting factor is the image write statement.
What I'm thinking I need to do next is to have multiple image buffers that I'm writing to, and to combine them when all of the operations are done. I should say that obscuration is not a problem, all that needs to be done is to simply add the pixel counts together. However, I'm struggling to figure out what mechanism I need to use in order to do this. I have looked at semaphores, but while I can see that they would limit a number of buffers, I can envision a situation in which two or more programs would be trying to write to the same buffer at the same time, potentially leading to inaccuracies.
I need a solution that does not involve any new non-standard libraries. I am more than willing to build the solution, but I would very much appreciate a few pointers in the right direction, as I'm currently just wandering around in the dark...
To help visualize this, imagine that I am told to place, say, balls at various locations on the image array. I am told to place the balls each frame, with a given brightness, location, and size. The exact location of the balls is dependent on the physics from the previous frame. All of the balls must be placed on a final image array, as quickly as they possibly can be. For the purpose of this example, if two balls are on top of each other, the brightness can simply be added together, thus there is no need to figure out if one is blocking the other. Also, no using GPU cards;-)
Psuedo-code would look like this: (Assuming that some logical object is given for location, brightness, and size). Also, assume, that isValidPoint simply finds if the point should be on the circle, given the location and radius of said circle.
global output_array[x_arrLimit*y_arrLimit)
void update_ball(int ball_num)
{
calc_ball_location(ball_num, *location, *brightness, *size); // location, brightness, size all set inside function
place_ball(location,brightness,size)
}
void place_ball(location,brighness,size)
{
get_bounds(location,size,*xlims,*ylims)
for (int x=xlims.min;x<xlims.max;y++)
{
for (int y=ylims.min;y<ylims.max;y++)
{
if (isValidPoint(location,size,x,y))
{
output_array(x,y)+=brightness;
}
}
}
}
The reason you're not seeing any speed up with the current design is that, with a single mutex for the entire buffer, you might as well not bother with threading, as all the objects have to be added serially anyway (unless there's significant processing being done to determine what to add, but it doesn't sound like that's the case). Depending on what it takes to "add an object to the buffer" (do you use scan-line algorithms, flood fill, or something else), you might consider having one mutex per row or a range of rows, or divide the image into rectangular tiles with one mutex per region or something. That would allow multiple threads to add to the image at the same time as long as they're not trying to update the same regions.
OK, you have an image member in some object. Add the, no doubt complex, code to add other image/objects to it. maipulate it, whatever. Aggregate in all the other objects that may be involved, add some command enun to tell threads what op to do and an 'OnCompletion' event to call when done.
Queue it to a pool of threads hanging on the end of a producer-consumer queue. Some thread will get the *object, perform the operation on the image/set and then call the event, (pass the completed *object as a parameter). In the event, you can do what you like, according to the needs of your app. Maybe you will add the processed images into a (thread-safe!!), vector or other container or queue them off to some other thread - whatever.
If the order of processing the images must be preserved, (eg. video stream), you could add an incrementing sequence-number to each object that is submitted to the pool, so enabling your 'OnComplete' handler to queue up 'later' images until all earlier ones have come in.
Since no two threads ever work on the same image, you need no locking while processing. The only locks you should, (may), need are those internal the queues, and they only lock for the time taken to push/pop object pointers to/from the queue - contention will be very rare.
I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.