OpenMP and optimising vector operations - c++

I'm running an algorithm at the moment that is very heavy but extremely parallel.
I've been looking at ways to speed it up and I've noticed that the slowest operation I have is my VecAdd function (It gets called thousands of times on a 6000 or so wide vector).
It is implemented as follows:
bool VecAdd( float* pOut, const float* pIn1, const float* pIn2, unsigned int num )
{
for( int idx = 0; idx < num; idx++ )
{
pOut[idx] = pIn1[idx] + pIn2[idx];
}
return true;
}
Its a very simple loop but all the additions can be performed in parallel. My first optimisation option is to move over to using SIMD as I can easily get a near 4 times speed up doing this.
However I'm also interested in the possibility of using OpenMP and having it automatically thread the for loop (potentially giving me a further 4x speedup for a total of 16x with SIMD).
However it really runs slowly. With the loop straight it takes around 3.2 seconds to process my example data. If I insert
#pragma omp parallel for
before the for loop I was assuming it would farm out several blocks of additions to other threads.
Unfortunately the result is that it takes ~7 seconds to process my example data.
Now I understand that a lot of my problem here will be caused by overheads with setting up threads and so forth but I'm still surprised just how much slower it makes things run.
Is it possible to speed this up by somehow setting up the thread pool in advance or will I never be able to combat these overheads?
Any thoughts on advice as to whether I can thread this nicely with OpenMP will be much appreciated!

Your loop should parallelize fine with the #pragma omp parallel for.
However, I think the problem is that you shouldn't parallelize at that level. You said that the function gets called thousands of times, but only operates on 6000 floats. Parallelize at the higher level, so that each thread is responsible for thounsands/4 calls to VecAdd. Right now you have this algorithm:
List item
serial execution
(re) start threads
do short computation
synchronize threads (at the end of the for loop)
back to serial code
Change it so that it's parallel at the highest possible level.
Memory bandwidth of course matters, but there is no way it would result in slower than serial execution.

Related

Multithreading slower than single threading

I writing a program that does heavy calculations for large arrays in real-time. The task can be split into several sub-arrays for multithreading. However, I cannot run this any faster using threads.
Here is a sample dummy code which was created for demonstration (same problem).
Two threads-version ends up lasting 39 seconds, which is couple of seconds longer if they were computed one after another(!). It doesn't matter if the arrays are global etc. I also tested using "thread constructors" only once, but with the same result.
I'm using XCode (5.1.1) and Macbook Air (2013 model, Core i5, Os X 10.8.5). Yes, this is old computer, I'm rarely programming...
So, could you find any mistake in the logic I have in the code or could it be somewhere in the settings of Xcode etc?
#include <ctime>
#include <iostream>
#include <thread>
class Value
{
public:
float a[3000000];
};
void cycle(Value *val)
{
int i;
for (i=0; i<3000000; i++)
{
val->a[i]=n;
n+=0.0001;
}
}
int main()
{
Value *val1=new Value, *val2=new Value;
clock_t start,stop;
start=clock();
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
thread second (cycle,val2);
first.join();
second.join();
}
stop=clock();
float tdiff=(((float)stop - (float)start) / 1000000.0F);
std::cout<<endl<<"This took "<<tdiff<<" seconds...";
return 0;
}
'''
There is a joke that goes something like that:
One programmer needs 1 day to finsish a program, how many days do 10 programmers need for the same? - 10 days.
The work in your code is done in this loop:
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
thread second (cycle,val2);
first.join();
second.join();
}
Now consider that spawning and joining threads is overhead. In total your parallel code does more than a sequential would have to do, in gerenal there is no way around that. And you are not creating and joining threads once, but 1000-times, ie you add 1000-times overhead.
Don't expect code to run faster by simply adding more threads to it. I refer you to Amdahl's law or Gustavson's Law (which basically states the same just a bit more positive).
I suggest you to experiment with sequential vs threaded but only one thread to get a feeling for the overhead. You can compare this:
for (int i=0; i<1000; i++)
{
thread first (cycle,val1);
first.join();
}
With a sequential version that does not use any threads. You will be surprised by the difference.
You get most out of multithreading when the threads do lots of work (cf Amdahl/Gustavson) and when there is no synchronisation between different threads. Your 1000 times joining the threads is basically a barrier, where second has to wait doing nothing until first is finished. Such barriers are best avoided.
Last but not least, as mentioned in a comment, your benchmark is rather questionable, because you are not using the result of the computations. That is, either you didn't turn on optimizations which makes the results rather meaningless, or you did turn on optimizations and the compiler might optimize things away without you noticing it. And actually I am not sure whether you are comparing two versions that do the same work, or if perhaps your parallel version is doing twice the work. Moreover, when measuring time you need to take care to measure wall clock time not cpu time, because cpu time adds times spend on multiple cores, while you want to compare wall clock time.
TL;DR: More threads != automatically less runtime.
If you read the documentation for clock, you might notice that it says that time can appear to go faster if the process executes on multiple cores; clock is a total CPU-use approximation, not "wall clock time", and one "CPU tick" on two cores in parallel is the same amount of "time" as two sequential "ticks" on one core.
(By the way: in order to get the time in seconds, you should be dividing by CLOCKS_PER_SEC.)
Using a more appropriate timer, like std::chrono::steady_clock, will show that the sequential variant takes almost twice as long as the multithreaded version.
The difference can be explained completely by the overhead of creating and destroying threads.
Unlike the other comments suggest, n is not the culprit, neither is the sequential joins.
Each thread performs the same operation cycle, so how could there be any improvement?
You have to manually split the workload between the two threads, so that each one works on half the data.

Performance issues of multiple independent for loop with openMp

I am planning to use OpenMP threads for an intense computation. However, I couldn't acquire my expected performance in first trial. I thought I have several issues on it, but I have not assured yet. Generally, I am thinking the performance bottleneck is caused from fork and join model. Can you help me in some ways.
First, in a route cycle, running on a consumer thread, there is 2 independent for loops and some additional functions. The functions are located at end of the routine cycle and between the for loops, which is already seen below:
void routineFunction(short* xs, float* xf, float* yf, float* h)
{
// Casting
#pragma omp parallel for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
memset(yf,0,1024*1024*sizeof( float ));
// Filtering
#pragma omp parallel for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
status = DftiComputeBackward(hand, yf, yf); // Compute backward transform
}
Note: This code cannot be compilied, because I did it more readible as clearing details.
OpenMP thread number is set 8 dynamically. I observed the used threads in Windows taskbar. While thread number is increased by significantly, I didn't observe any performance improvement. I have some guesses, but I want to still discuss with you for further implementations.
My questions are these.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads? In case, the for loops cause fork and join at 2 times, how to align the code again. Is combining the two loops in a single loop sensible for saving performance, or using parallel region (#pragma omp parallel) and #pragma omp for (not #pragma omp parallel for) better choice for sharing works. I care about it forces me static scheduling by using thread id and thread numbers. According the document at page 34, static scheduling can cause load imbalance. Actually, I am familiar static scheduling because of CUDA programming, but I want to still avoid it, if there is any performance issue. I also read an answer in stackoverflow which points smart OpenMP algorithms do not join master thread after a parallel region is completed writed by Alexey Kukanov in last paragraph. How to utilize busy wait and sleep attributes of OpenMP for avoiding joining the master thread after first loop is completed.
Is there another reason for performance issue in the code?
This is mostly memory-bound code. Its performance and scalability are limited by the amount of data the memory channel can transfer per unit time. xf and yf take 8 MiB in total, which fits in the L3 cache of most server-grade CPUs but not of most desktop or laptop CPUs. If two or three threads are already able to saturate the memory bandwidth, adding more threads is not going to bring additional performance. Also, casting short to float is a relatively expensive operation - 4 to 5 cycles on modern CPUs.
Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
No, basically all OpenMP runtimes, including that of MSVC++, implement parallel regions using thread pools as this is the easiest way to satisfy the requirement of the OpenMP specification that thread-private variables retain their value between the different parallel regions. Only the very first parallel region suffers the full cost of starting new threads. Consequent regions reuse those threads and an additional price is paid only if more threads are needed that in any of the previously executed parallel regions. There is still some overhead, but it is far lower than that of starting new threads each time.
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads?
Yes, in your case two separate parallel regions are created. You can manually merge them into one:
#pragma omp parallel
{
#pragma omp for
for (int n = 0; n<1024*1024; n++)
{
xf[n] = (float)xs[n];
}
#pragma omp single
{
memset(yf,0,1024*1024*sizeof( float ));
//
// Other code that was between the two parallel regions
//
}
// Filtering
#pragma omp for
for (int n = 0; n<1024*1024-1024; n++)
{
for(int nn = 0; nn<1024; nn++)
{
yf[n]+=xf[n+nn]*h[nn];
}
}
}
Is there another reason for performance issue in the code?
It is memory-bound, or at least the two loops shown here are.
Alright, it's been a while since I did OpenMP stuff so hopefully I didn't mess any of this up... but here goes.
Forking and joining is the same thing as creating and destroying threads. How the cost compares to other threads (such as a C++11 thread) will be implementation dependent. I believe in general OpenMP threads might be slightly lighter-weight than C++11 threads, but I'm not 100% sure about that. You'd have to do some testing.
Currently each time routineFunction is called you will fork for the first for loop, join, do a memset, fork for the second loop, join, and then call DftiComputeBackward
You would be better off creating a parallel region as you stated. Not sure why the scheduling is an extra concern. It should be as easy as moving your memset to the top of the function, starting a parallel region using your noted command, and making sure each for loop is marked with #pragma omp for as you mentioned. You may need to put an explicit #pragma omp barrier in between the two for loops to make sure all threads finish the first for loop before starting the second... OpenMP has some implicit barriers but I forgot if #pragma omp for has one or not.
Make sure that the OpenMP compile flag is turned on for your compiler. If it isn't, the pragmas will be ignored, it will compile, and nothing will be different.
Your operations are prime for SIMD acceleration. You might want to see if your compiler supports auto-vectorization and if it is doing it. If not, I'd look into SIMD a bit, perhaps using intrinsics.
How much time does DftiComputeBackwards take relative to this code?

Performance problems using OpenMP in nested loops

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).
Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.
I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE. When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.
for (round k = 1; k < processor.max; ++k) {
initialise_round(k);
for (std::vector<int> bucket : color_buckets) {
#pragma omp parallel for schedule (dynamic)
for (int i = 0; i < bucket.size(); ++i) {
if (processor.mark.is_marked_item(bucket[i])) {
processor.process(k, bucket[i]);
}
}
processor.finish_round(k);
}
}
You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).
Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/
I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.
The actual problem was not related to OpenMP directly.
As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.
The solution was to use thread pinning for example via the linux tool: taskset

OpenMP and C++ parallel for loop: why does my code slow down when using OpenMP?

I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.
#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>
using namespace std;
int main(){
srand(time(NULL));//Seed random number generator
vector<int>v;//Create vector to hold random numbers in interval [0,9]
vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0
for(int i=0;i<1e9;++i)
v.push_back(rand()%10);//Push back random numbers [0,9]
clock_t c=clock();
#pragma omp parallel for
for(int i=0;i<v.size();++i)
d[v[i]]+=1;//Count number stored at v[i]
cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;
for(vector<int>::iterator i=d.begin();i!=d.end();++i)
cout<<*i<<endl;
return 0;
}
The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)
Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?
My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.
Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?
Your code exibits:
race conditions due to unsyncronised access to a shared variable
false and true sharing cache problems
wrong measurement of run time
Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.
False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:
#pragma omp parallel for
for(int i=0;i<v.size();++i)
#pragma omp atomic
d[v[i]]+=1;//Count number stored at v[i]
The second is implemented like this:
omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
omp_init_lock(&locks[i]);
#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
int vv = v[i];
omp_set_lock(&locks[vv]);
d[vv]+=1;//Count number stored at v[i]
omp_unset_lock(&locks[vv]);
}
for (int i = 0; i < 10; i++)
omp_destroy_lock(&locks[i]);
(include omp.h to get access to the omp_* functions)
I leave it up to you to come up with an implementation of the third option.
You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.
EDIT
Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:
Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.
A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.
See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP

Splitting up a program into 4 threads is slower than a single thread

I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.
Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.
Running the program with time:
> export OMP_NUM_THREADS=1; time ./raytracer
real 0m34.344s
user 0m34.310s
sys 0m0.008s
> export OMP_NUM_THREADS=4; time ./raytracer
real 0m53.189s
user 0m20.677s
sys 0m0.096s
User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.
Code that I have parallelized using OpenMP
void Raytracer::render( Camera& cam ) {
// let the camera know to use this raytracer for probing the scene
cam.setSamplingFunc(getSamplingFunction());
int i, j;
#pragma omp parallel private(i, j)
{
// Construct a ray for each pixel.
#pragma omp for schedule(dynamic, 4)
for (i = 0; i < cam.height(); ++i) {
for (j = 0; j < cam.width(); ++j) {
cam.computePixel(i, j);
}
}
}
}
When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!
Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.
In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.
Can anyone suggest places where there could be some kind of contention? Things to try?
Thank you in advance.
EDIT:
Sorry, was stupid not to provide more info on my system.
Compiler gcc 4.6 (with -O2 optimization)
Ubuntu Linux 11.10
OpenMP 3
Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)
Code for compute pixel:
class Camera {
// constructors destructors
private:
// this is the array that is being written to, but not read from.
Colour* _sensor; // allocated using new at construction.
}
void Camera::computePixel(int i, int j) const {
Colour col;
// simple code to construct appropriate ray for the pixel
Ray3D ray(/* params */);
col += _sceneSamplingFunc(ray); // calls a const method that traverses scene.
_sensor[i*_scrWidth+j] += col;
}
From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?
Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.
As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.
Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.
I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.
threadrand.h
#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_
// max number of thread states to store
const int maxThreadNum = 100;
void init_threadrand();
// requires openmp, for thread number
int threadrand();
#endif // _THREAD_RAND_H_
threadrand.cpp
#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>
// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
for (int i = 0; i < maxThreadNum; ++i) {
randThreadStates[i].reset(new unsigned int(std::rand()));
}
}
// requires openmp, for thread number, to index into array of states.
int threadrand() {
int i = omp_get_thread_num();
return rand_r(randThreadStates[i].get());
}
Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.
The answer is, without knowing what machine you're running this on, and without really seeing the code of your computePixel function, that it depends.
There is quite a few factors that could affect the performance of your code, one thing that comes to mind is the cache alignment. Perhaps your data structures, and you did mention a tree, are not really ideal for caching, and the CPU ends up waiting for the data come from the RAM, since it cannot fit things into the cache. Wrong cache-line alignments could cause something like that. If the CPU has to wait for things to come from RAM, it is likely, that the thread will be context-switched out, and another will be run.
Your OS thread scheduler is non-deterministic, therefore, when a thread will run is not a predictable thing, so if it so happens that your threads are not running a lot, or are contending for CPU cores, this could also slow things down.
Thread affinity, also plays a role. A thread will be scheduled on a particular core, and normally it will be attempted to keep this thread on the same core. If more then one of your threads are running on a single core, they will have to share the same core. Another reason things could slow down. For performance reasons, once a particular thread has run on a core, it is normally kept there, unless there's a good reason to swap it to another core.
There's some other factors, which I don't remember off the top of my head, however, I suggest doing some reading on threading. It's a complicated and extensive subject. There's lots of material out there.
Is the data being written at the end, data that other threads need to be able to do computePixel ?
One strong possibility is false sharing. It looks like you are computing the pixels in sequence, thus each thread may be working on interleaved pixels. This is usually a very bad thing to do.
What could be happening is that each thread is trying to write the value of a pixel beside one written in another thread (they all write to the sensor array). If these two output values share the same CPU cache-line this forces the CPU to flush the cache between the processors. This results in an excessive amount of flushing between CPUs, which is a relatively slow operation.
To fix this you need to ensure that each thread truly works on an independent region. Right now it appears you divide on rows (I'm not positive since I don't know OMP). Whether this works depends on how big your rows are -- but still the end of each row will overlap with the beginning of the next (in terms of cache lines). You might want to try breaking the image into four blocks and have each thread work on a series of sequential rows (for like 1..10 11..20 21..30 31..40). This would greatly reduce the sharing.
Don't worry about reading constant data. So long as the data block is not being modified each thread can read this information efficiently. However, be leery of any mutable data you have in your constant data.
I just looked and the Intel i3-2310M doesn't actually have 4 cores, it has 2 cores and hyper-threading. Try running your code with just 2 threads and see it that helps. I find in general hyper-threading is totally useless when you have a lot of calculations, and on my laptop I turned it off and got much better compilation times of my projects.
In fact, just go into your BIOS and turn off HT -- it's not useful for development/computation machines.