Here is a little program that runs two for loops on separate threads:
#include <iostream>
#include <thread>
void print_one() {
for (int i = 0; i < 1000; i++) {
std::cout << i << "\n";
}
}
void print_two() {
for (int i = 0; i > -1000; i--) {
std::cout << i << "\n";
}
}
int main(){
std::thread t1(print_one);
std::thread t2(print_two);
t1.join();
t2.join();
}
The program follows a pattern in which one for loop runs about 30-400 times (usually about 30-200), after which the other loop gets its "turn" and does the same. The behavior continues until the program exits.
I left mutex out on purpose to show that it's not about locks. In all examples that I've seen on Youtube, the loops usually alternate after every 1-3 iterations. On my PC it's as if two threads simply can't run simultaneously, and one thread has to take time off while the other one is working. I'm not experiencing any performance issues in general, so it doesn't seem like hardware failure to me.
A thread not doing anything while another one has time to execute std::cout and add a new line hundreds of times in a row just doesn't sound right to me. Is this normal behavior?
I have a Ryzen 5 1600 processor, if that makes any difference.
A thread not doing anything while another one has time to execute std::cout and add a new line hundreds of times in a row just doesn't sound right to me. Is this normal behavior?
There is no other sensible possibility here. Each thread only needs to do a single integer increment before it needs access to the terminal, something that only one thread can access at a time.
There are only two other possibilities, and they're both obviously terrible:
A single core runs the two threads, switching threads after every single output. This provides atrocious performance as 90+% of the time, the core is switching from one thread to the other.
The threads run on two cores. Each core does a single increment, then waits for the other core to finish writing to the terminal, and then writes to the terminal. Each core spends at least half its time waiting for the other core to release the terminal. This takes up two cores and provides fairly poor performance because the threads are spending a lot of time stopping and starting each other.
What you are seeing is the best possible behavior. The only sensible thing to do is to allow each thread to run long enough that the cost of switching is drowned out.
Answer to the first part:
The behaviour you observed is a normal behaviour.
Each thread gets a specific amount of time (to be executed on the core) that is dynamically decided by the OS, based on current situations.
Answer to the second part:
The cout operation is buffered.
Even if the calls to write (or whatever it is that accomplishes that effect in that particular implementation) are guaranteed to be mutually exclusive, the buffer might be shared by the different threads. This will quickly lead to corruption of the internal state of the stream. [1]
That is why you see only newline for some time.
The solution to this issue is using printf(). It's behaviour is atomic.
EDITS:
The point made about the printf() function that its behaviour is atomic is based on my experiments. The code at [2] demonstrates that the printf() function is atomic in nature.
You can actually test it.
References:
https://stackoverflow.com/a/6374525/7422352
https://www.researchgate.net/publication/351979208_Code_to_demonstrate_sequential_execution_of_2_parallel_regions_created_using_OpenMP_API
Related
I writing a program that does heavy calculations for large arrays in real-time. The task can be split into several sub-arrays for multithreading. However, I cannot run this any faster using threads.
Here is a sample dummy code which was created for demonstration (same problem).
Two threads-version ends up lasting 39 seconds, which is couple of seconds longer if they were computed one after another(!). It doesn't matter if the arrays are global etc. I also tested using "thread constructors" only once, but with the same result.
I'm using XCode (5.1.1) and Macbook Air (2013 model, Core i5, Os X 10.8.5). Yes, this is old computer, I'm rarely programming...
So, could you find any mistake in the logic I have in the code or could it be somewhere in the settings of Xcode etc?
#include <ctime>
#include <iostream>
#include <thread>
class Value
{
public:
float a[3000000];
};
void cycle(Value *val)
{
int i;
for (i=0; i<3000000; i++)
{
val->a[i]=n;
n+=0.0001;
}
}
int main()
{
Value *val1=new Value, *val2=new Value;
clock_t start,stop;
start=clock();
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
thread second (cycle,val2);
first.join();
second.join();
}
stop=clock();
float tdiff=(((float)stop - (float)start) / 1000000.0F);
std::cout<<endl<<"This took "<<tdiff<<" seconds...";
return 0;
}
'''
There is a joke that goes something like that:
One programmer needs 1 day to finsish a program, how many days do 10 programmers need for the same? - 10 days.
The work in your code is done in this loop:
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
thread second (cycle,val2);
first.join();
second.join();
}
Now consider that spawning and joining threads is overhead. In total your parallel code does more than a sequential would have to do, in gerenal there is no way around that. And you are not creating and joining threads once, but 1000-times, ie you add 1000-times overhead.
Don't expect code to run faster by simply adding more threads to it. I refer you to Amdahl's law or Gustavson's Law (which basically states the same just a bit more positive).
I suggest you to experiment with sequential vs threaded but only one thread to get a feeling for the overhead. You can compare this:
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
first.join();
}
With a sequential version that does not use any threads. You will be surprised by the difference.
You get most out of multithreading when the threads do lots of work (cf Amdahl/Gustavson) and when there is no synchronisation between different threads. Your 1000 times joining the threads is basically a barrier, where second has to wait doing nothing until first is finished. Such barriers are best avoided.
Last but not least, as mentioned in a comment, your benchmark is rather questionable, because you are not using the result of the computations. That is, either you didn't turn on optimizations which makes the results rather meaningless, or you did turn on optimizations and the compiler might optimize things away without you noticing it. And actually I am not sure whether you are comparing two versions that do the same work, or if perhaps your parallel version is doing twice the work. Moreover, when measuring time you need to take care to measure wall clock time not cpu time, because cpu time adds times spend on multiple cores, while you want to compare wall clock time.
TL;DR: More threads != automatically less runtime.
If you read the documentation for clock, you might notice that it says that time can appear to go faster if the process executes on multiple cores; clock is a total CPU-use approximation, not "wall clock time", and one "CPU tick" on two cores in parallel is the same amount of "time" as two sequential "ticks" on one core.
(By the way: in order to get the time in seconds, you should be dividing by CLOCKS_PER_SEC.)
Using a more appropriate timer, like std::chrono::steady_clock, will show that the sequential variant takes almost twice as long as the multithreaded version.
The difference can be explained completely by the overhead of creating and destroying threads.
Unlike the other comments suggest, n is not the culprit, neither is the sequential joins.
Each thread performs the same operation cycle, so how could there be any improvement?
You have to manually split the workload between the two threads, so that each one works on half the data.
So I have a Kinect program that has three main functions that collect data and saves it. I want one of these functions to execute as much as possible, while the other two run maybe 10 times every second.
while(1)
{
...
//multi-threading to make sure color and depth events are aligned -> get skeletal data
if (WaitForSingleObject(colorEvent, 0) == 0 && WaitForSingleObject(depthEvent, 0) == 0)
{
std::thread first(getColorImage, std::ref(colorEvent), std::ref(colorStreamHandle), std::ref(colorImage));
std::thread second(getDepthImage, std::ref(depthEvent), std::ref(depthStreamHandle), std::ref(depthImage));
if (WaitForSingleObject(skeletonEvent, INFINITE) == 0)
{
first.join();
second.join();
std::thread third(getSkeletonImage, std::ref(skeletonEvent), std::ref(skeletonImage), std::ref(colorImage), std::ref(depthImage), std::ref(myfile));
third.join();
}
//if (check == 1)
//check = 2;
}
}
Currently my threads are making them all run at the same exact time, but this slows down my computer a lot and I only need to run 'getColorImage' and 'getDepthImage' maybe 5-10 times/second, whereas 'getSkeletonImage' I would want to run as much as possible.
I want 'getSkeletonImage' to run at max frequency (~30 times/second through the while loop) and then the 'getColorImage' and 'getDepthImage' to time synchronize (~5-10 times/second through the while loop)
What is a way I can do this? I am already using threads, but I need one to run consistently, and then the other two to join in intermittently essentially. Thank you for your help.
Currently, your main loop is creating the threads every iteration, which suggests each thread function runs once to completion. That introduces the overhead of creating and destroying threads every time.
Personally, I wouldn't bother with threads at all. Instead, in the main thread I'd do
void RunSkeletonEvent(int n)
{
for (i = 0; i < n; ++i)
{
// wait required time (i.e. to next multiple of 1/30 second)
skeletonEvent();
}
}
// and, in your main function ....
while (termination_condition_not_met)
{
runSkeletonEvent(3);
colorEvent();
runSkeletonEvent(3);
depthEvent();
}
This interleaves the events, so skeletonEvent() runs six times for every time depthEvent() and colorEvent() are run. Just adjust the numbers as needed to get required behaviour.
You'll need to design the code for all the events so they don't run over time (if they do, all subsequent events will be delayed - there is no means to stop that).
The problem you'll then need to resolve is how to wait for the time to fire the skeleton event. A process of retrieving clock time, calculating how long to wait, and sleeping for that interval will do it. By sleeping (the thread yielding its time slice) your program will also be a bit better mannered (e.g. it won't be starving other processes of processor time).
One advantage is that, if data is to be shared between the "events" (e.g. all of the events modify some global data) there is no need for synchronisation, because the looping above guarantees that only one "event" accesses shared data at one time.
Note: your usage of WaitForSingleObject() indicates you are using windows. Windows (except, arguably CE in a weak sense) is not really a realtime system, so does not guarantee precise timing. In other words, the actual intervals you achieve will vary.
It is still possible to restructure to use threads. From your description, there is no evidence you really need anything like that, so I'll leave this reply at that.
This question already has answers here:
C++: Simple return value from std::thread?
(8 answers)
Closed 9 years ago.
What's the most efficient way to return a value from a thread in C++11?
vector<thread> t(6);
for(int i = 0; i < 6; i++)
t[i] = thread(do_c);
for(thread& t_now : t)
t_now.join();
for(int i = 0; i < 6; i++)
cout << /*the return of function do_c*/
Also, if the change will benefit performance, feel free to recommend another thread than std::thread.
First of all std::thread doesn't return a value, but the function that is passed to it on construction may very well do it.
There's no way to access the function's return value from the std::thread object unless you save it somehow after calling the function on the thread.
A simple solution would e.g. be to pass a reference to the thread and store the result in the memory pointed to by the reference. With threads though one must be careful not to introduce a data race.
Consider a simple function:
int func() {
return 1;
}
And this example:
std::atomic<int> x{0}; // Use std::atomic to prevent data race.
std::thread t{[&x] { // Simple lambda that captures a reference of x.
x = func(); // Call function and assign return value.
}};
/* Do something while thread is running... */
t.join();
std::cout << "Value: " << x << std::endl;
Now, instead of dealing with this low level concurrency stuff yourself you can use the Standard Library as someone (as always) has already solved it for you. There's std::packaged_task and std::future which is designed to work with std::thread for this particular type of issue. They should also be just as efficient as the custom solution in most cases.
Here's an equivalent example using std::packaged_task and std::future:
std::packaged_task<int()> task{func}; // Create task using func.
auto future = task.get_future(); // Get the future object.
std::thread t{std::move(task)}; // std::packaged_task is move-only.
/* Do something while thread is running... */
t.join();
std::cout << "Value: " << future.get() << std::endl; // Get result atomically.
Don't always assume something is less efficient just because it is considered as "high level".
Lauching a thread and terminating it require many hundreds of machine cycles. But that's only a beginning. Context switches between threads, that are bound to happen if the threads are doing anything useful, will repeatedly consume even more many hundreds of machine cycles. The execution context of all these threads will consume many a byte of memory, which in turn will mess up many a line of cache, thus hindering the CPU efforts for yet another great deal of many hundreds of machine cycles.
As a matter of fact, doing anything with multitasking is a great consumer of many hundreds of machine cycles. Multitasking only becomes profitable in terms of CPU power usage when you manage to get enough processors working on lumps of data that are conceptually independent (so parallel processing won't threaten their integrity) and big enough to show a net gain compared with a monoprocessor version.
In all other cases, multitasking is inherently inefficient in all domains but one: reactivity. A task can react very quickly and precisely to an external event, that ultimately comes from some external H/W component (be it the internal clock for timers or your WiFi/Ethernet controller for network traffic).
This ability to wait for external events without wasting CPU is what increases the overall CPU efficiency. And that's it.
In terms of other performance parameters (memory consumption, time wasted inside kernel calls, etc), launching a new thread is always a net loss.
In a nutshell, the art of multitasking programming boils down to:
identifying the external I/O flows you will have to handle
taking reactivity requirements into account (remembering that more reactive = less CPU/memory efficient 99% of the time)
setting up handlers for the required events with a reasonable efficiency/ease of maintenance compromise.
Multiprocessor architectures are adding a new level of complexity, since any program can now be seen as a process having a number of external CPUs at hand, that could be used as additional power sources. But your problem does not seem to have anything to do with that.
A measure of multitasking efficiency will ultimately depend on the number of external events a given program is expected to cope with simultaneously and within a given set of reactivity limits.
At last I come to your particular question.
To react to external events, launching a task each time a new twig or bit of dead insect has to be moved around the anthill is a very coarse and inefficient approach.
You have many powerful synchronization tools at your disposal, that will allow you to react to a bunch of asynchronous events from within a single task context with (near) optimal efficiency at (virtually) no cost.
Typically, blocking waits on multiple inputs, like for instance the unix-flavoured select() or Microsoft's WaitForMultipleEvents() counterpart.
Using these will give you a performance boost incomparably greater than the few dozen CPU cycles you could squeeze out of this task-result-gathering-optimization project of yours.
So my answer is: don't bother with optimizing thread setup at all. It's a non-issue.
Your time would be better spent rethinking your architecture so that a handful of well thought out threads could replace the hordes of useless CPU and memory hogs your current design would spawn.
I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.
I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.