Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.
I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.
Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:
void doYourThing(int size,int threadNumber,int numOfThreads) {
int start = (threadNumber - 1) * size / numOfThreads;
int end = threadNumber * size / numOfThreads;
for (int i = start; i < end; i++) {
//Calculations...
}
}
int main(void) {
int size = 100000;
int numOfThreads = 4;
int start = 0;
int end = size / numOfThreads;
std::thread coreB(doYourThing, size, 2, numOfThreads);
std::thread coreC(doYourThing, size, 3, numOfThreads);
std::thread coreD(doYourThing, size, 4, numOfThreads);
for (int i = start; i < end; i++) {
//Calculations...
}
coreB.join();
coreC.join();
coreD.join();
}
With this, computation time changed from 60ms to 40ms.
Questions:
1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.
2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?
(1).
The question #François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.
Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.
(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server
(3). It is not necessarily true that more threads will have a better performance
There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is
auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);
The library will decide how many threads to use based on how many cores you have.
On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:
#include "Lazy.h"
void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
for (int i = from; i < to; ++i) {
// Calculate vecOut[i]
}
}
int main(void) {
int size = 100000;
MyVector vecIn(size), vecOut(size)
// Load vecIn vector with input data...
Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
[&](auto indexPair) {
doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
});
// Now the results are in vecOut
}
README.md gives further examples on parallel execution which you might find useful.
Related
#include <math.h>
#include <sstream>
#include <iostream>
#include <mutex>
#include <stdlib.h>
#include <chrono>
#include <thread>
bool isPrime(int number) {
int i;
for (i = 2; i < number; i++) {
if (number % i == 0) {
return false;
}
}
return true;
}
std::mutex myMutex;
int pCnt = 0;
int icounter = 0;
int limit = 0;
int getNext() {
std::lock_guard<std::mutex> guard(myMutex);
icounter++;
return icounter;
}
void primeCnt() {
std::lock_guard<std::mutex> guard(myMutex);
pCnt++;
}
void primes() {
while (getNext() <= limit)
if (isPrime(icounter))
primeCnt();
}
int main(int argc, char *argv[]) {
std::stringstream ss(argv[2]);
int tCount;
ss >> tCount;
std::stringstream ss1(argv[4]);
int lim;
ss1 >> lim;
limit = lim;
auto t1 = std::chrono::high_resolution_clock::now();
std::thread *arr;
arr = new std::thread[tCount];
for (int i = 0; i < tCount; i++)
arr[i] = std::thread(primes);
for (int i = 0; i < tCount; i++)
arr[i].join();
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "Primes: " << pCnt << std::endl;
std::cout << "Program took: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() <<
" milliseconds" << std::endl;
return 0;
}
Hello , im trying to find the amount of prime numbers between the user specified range, i.e., 1-1000000 with a user specified amount of threads to speed up the process, however, it seems to take the same amount of time for any amount of threads compared to one thread. Im not sure if its supposed to be that way or if theres a mistake in my code. thank you in advance!
You don't see performance gain because time spent in isPrime() is much smaller than time which threads take when fighting on mutex.
One possible solution is to use atomic operations, as #The Badger suggested. The other way is to partition your task into smaller ones and distribute them over your thread pool.
For example, if you have n threads, then each thread should test numbers from i*(limit/n) to (i+1)*(limit/n), where i is thread number. This way you wouldn't need to do any synchronization at all and your program would (theoretically) scale linearly.
Multithreaded algorithms work best when threads can do a lot of work on their own.
Imagine doing this in real life: you have a group of 20 humans that will do work for you, and you want them to test whether each number up to 1000 is prime. How will you do this?
Would you hand each person a single number at a time, and ask them to come back to you to tell you if its prime and to receive another number?
Surely not; you would give each person a bunch of numbers to work on at once, and have them come back and tell you how many were prime and to receive another bunch of numbers.
Maybe even you'd divide up the entire set of numbers into 20 groups and tell each person to work on a group. (but then you run the risk of one person being slow and having everyone else sitting idle while you wait for that one person to finish... although there are so-called "work stealing" algorithms, but that's complicated)
The same thing applies here; you want each thread to do a lot of work on its own and keep its own tally, and only have to check back with the centralized information once in a while.
A better solution would be to use the Sieve of Atkin to find the primes (even the Sieve of Eratosthenes which is easier to understand is better), your basic algorithm is very poor to start with. It will for every number n in your interval do n checks in order to determine if it's prime and do this limit times. This means that you're doing about limit*limit/2 checks - that's what we call O(n^2) complexity. The Sieve of Atkins OTOH only have to do O(n) operations to find all primes. If n is large it is hard to beat the algorithm that has fewer steps by performing the steps faster. Trying to fix a poor algorithm by throwing more resources on it is a bad strategy.
Another problem with your implementation is that it has race conditions and therefore is broken to start with. It's often little use in optimizing something unless you first make sure it's working correctly. The problem is in the primes function:
void primes() {
while (getNext() <= limit)
if( isPrime(icounter) )
primeCnt();
}
Between the getNext() and isPrime another thread may have increased the icounter and cause the program to skip candidates. This results in the program giving different result each time. In addition neither icounter nor pCnt is declared volatile so there's actually no guarantee that the value gets to the global storage location as part of the mutex lock.
Since the problem is CPU intensive, that is almost all of the time is spent executing CPU instructions multi threading won't help unless you have multiple CPU's (or cores) which the OS are scheduling threads of the same process on. This means that there is a limit of number of threads (that can be as low as 1 - I fx see only a improvement for two threads, beyond that theres none) where you can expect an improved performance. What happens if you have more threads than cores is that the OS will just let one thread run for a while on a core and then switch the thread an let the next thread execute for a while.
The problem that may arise when scheduling threads on different cores is in addition that each core may have separate cache (which is faster than the shared cache). In effect if two threads are going to access the same memory the separated cache has to be flushed as part of the synchronization of the data involved - this may be time consuming.
That is you have to strive to keep the data that the different threads are working on separate and minimize the frequent use of common variable data. In your example it would mean that you should avoid the global data as much as possible. The counter for example need only be accessed when the counting has finished (to add the threads contribution to the count). Also you could minimize the use of icounter by not reading it for each candidate, but get a bunch of candidates in one go. Something like:
void primes() {
int next;
int count=0;
while( (next = getNext(1000)) <= limit ) {
for( int j = next; j < next+1000 && j <= limit ; j++ ) {
if( isPrime(j) )
count++;
}
}
primeCnt(count);
}
where getNext is the same, but it reserves a number of candidates (by increasing icounter by the supplied count) and primeCnt adds count to pCnt.
Consequently you may end up in a situation where the core runs one thread, then after a while switch to another thread and so on. The result of this is that you will have to run all the code for your problem plus code for switching between the thread. Add that you will probably have more cache hits, then this will probably even be slower.
Perhaps instead of a mutex try to use an atomic integer for the counter. It might speed it up a bit, not sure by how much.
#include <atomic>
std::atomic<uint64_t> pCnt; // Made uint64 for bigger range as #IgnisErus mentioned
std::atomic<uint64_t> icounter;
int getNext() {
return ++icounter; // Pre increment is faster
}
void primeCnt() {
++pCnt;
}
On benchmarking, most of the time the processor need to warm up to get the best performance, so to take the time once is not always a good representation of the actual performance. Try to run the code many times and get an average. You can also try to do some heavy work before you do the calculation (A long for-loop calculating the power of some counter?)
Getting accurate benchmark results is also a topic of interest for me since I do not yet know how to do it.
I encountered this weird bug in a c++ multithread program on linux. The multithreaded part basically executes a loop. One single iteration first loads a sift file containing some features. And then it queries these features against a tree. Since I have a lot of images, I used multiple threads to do this querying. Here is the code snippets.
struct MultiMatchParam
{
int thread_id;
float *scores;
double *scores_d;
int *perm;
size_t db_image_num;
std::vector<std::string> *query_filenames;
int start_id;
int num_query;
int dim;
VocabTree *tree;
FILE *file;
};
// multi-thread will do normalization anyway
void MultiMatch(MultiMatchParam ¶m)
{
// Clear scores
for(size_t t = param.start_id; t < param.start_id + param.num_query; t++)
{
for (size_t i = 0; i < param.db_image_num; i++)
param.scores[i] = 0.0;
DTYPE *keys;
int num_keys;
keys = ReadKeys_sfm((*param.query_filenames)[t].c_str(), param.dim, num_keys);
int normalize = true;
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
delete [] keys;
}
}
I run this on a 8-core cpu. At first it runs perfectly and the cpu usage is nearly 100% on all 8 cores. After each thread has queried several images (about 20 images), all of a sudden the performance (cpu usage) drops drastically, down to about 30% across all eight cores.
I doubt the key to this bug is concerned with this line of code.
double mag = param.tree->MultiScoreQueryKeys(num_keys, normalize, keys, param.scores);
Since if I replace it with another costly operations (e.g., a large for-loop containing sqrt). The cpu usage is always nearly 100%. This MultiScoreQueryKeys function does a complex operation on a tree. Since all eight cores may read the same tree (no write operation to this tree), I wonder whether the read operation has some kind of blocking effect. But it shouldn't have this effect because I don't have write operations in this function. Also the operations in the loop are basically the same. If it were to block the cpu usage, it would happen in the first few iterations. If you need to see the details of this function or other part of this project, please let me know.
Use std::async() instead of zeta::SimpleLock lock
I've been coding in C++ for years, and I've used threads in the past, but I'm just now starting to learn about multithreaded programming and how it actually works.
So far I'm doing okay with understanding the concepts, but one thing has me stumped.
What are parallel for loops, and how do they work?
Can any for loop be made parallel?
What are the use for them? Performance?
Other functionality?
I can't find anything online that explains it well enough for me to understand.
I code in C++, but I'm sure this question can apply to many different programming languages.
What are parallel for loops, and how do they work?
A parallel for loop is a for loop in which the statements in the loop can be run in parallel: on separate cores, processors or threads.
Let us take a summing code:
unsigned int numbers[] = { 1, 2, 3, 4, 5, 6};
unsigned int sum = 0;
const unsigned int quantity = sizeof(numbers) / sizeof (numbers[0]);
for (unsigned int i = 0; i < quantity; ++i)
{
sum = sum + numbers[i];
};
Calculating a sum does not depend on the order. The sum only cares that all numbers have been added.
The loop could be split into two loops that are executed by separate threads or processors:
// Even loop:
unsigned int even_sum = 0;
for (unsigned int e = 0; e < quantity; e += 2)
{
even_sum += numbers[e];
}
// Odd summation loop:
unsigned int odd_sum = 0;
for (unsigned int odd = 1; odd < quantity; odd += 2)
{
odd_sum += numbers[odd];
}
// Create the sum
sum = even_sum + odd_sum;
The even and odd summing loops are independent of each other. They do not access any of the same memory locations.
The summing for loop can be considered as a parallel for loop because its statements can be run by separate processes in parallel, such as separate CPU cores.
Somebody else can supply a more detailed definition, but this is the general example.
Edit 1:
Can any for loop be made parallel?
No, not any loop can be made parallel. Iterations of the loop must be independent from each other. That is, one cpu core should be able to run one iteration without any side effects to another cpu core running a different iteration.
What are the use for them?
Performance?
In general, the reason is for performance. However, the overhead of setting up the loop must be less than the execution time of the iteration. Also, there is overhead of waiting for the parallel execution to finish and join the results together.
Usually data moving and matrix operations are good candidates for parallelism. For example, moving a bitmap or applying a transformation to the bitmap. Huge quantities of data need all the help they can get.
Other functionality?
Yes, there are other possible uses of parallel for loops, such as updating more than one hardware device at the same time. However, the general case is for improving data processing performance.
(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.
On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.
What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.
NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
See also: What can make a program run slower when using more threads?
The times you are reporting are measured using the clock function:
The clock() function returns an approximation of processor time used by the program.
$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds
real 0m5.247s
user 0m9.025s
sys 0m0.304s
The real time will be less for multiprocessor tasks, but the processor time will typically be greater.
When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.
It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.
Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:
$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds
real 0m5.326s
user 0m9.057s
sys 0m0.344s
which still is less of a speed-up than one might hope for, but at least the numbers make some sense.
100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)
The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.
I believe that your algorithm essentially makes your cache memory useless.
Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.
Not related to your threading issues, but there is a bounds error in your code.
You have:
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
When i is zero you will be doing
arr[0] += arr[-1];
Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'
As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.
tstenner has got it mostly right.
This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.
Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.
You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.
Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.
Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:
a proper parallelization of your algorithm
a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
hardware support for parallelization
It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...
But this is then only a theoretical standpoint.
Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.