OpenMP has low CPU usage

OpenMP has low CPU usage - c++

My OpenMP Implementation shows a really bad performance. When I profile it with vtune, I have a super low CPU usage and I don't know why. Does anyone have an idea?
Hardware:
NUMA architecture with 28 cores (56 Threads)
Implementation:
struct Lineitem {
int64_t l_quantity;
int64_t l_extendedprice;
float l_discount;
unsigned int l_shipdate;
};
Lineitem* array = (Lineitem*)malloc(sizeof(Lineitem) * array_length);
// array will be filled
#pragma omp parallel for num_threads(48) shared(array, array_length, date1, date2) reduction(+: sum)
for (unsigned long i = 0; i < array_length; i++)
{
if (array[i].l_shipdate >= date1 && array[i].l_shipdate < date2 &&
array[i].l_discount >= 0.08f && array[i].l_discount <= 0.1f &&
array[i].l_quantity < 24)
{
sum += (array[i].l_extendedprice * array[i].l_discount);
}
}
Additionally as information, I am using cmake and clang.

I was able to find the cause of my poor OpenMP performance. I am running my OpenMP code inside a thread pinned to a core. If I don't pin the thread to a core, then the OpenMP code is fast.
Probably the threads created by OpenMP in the pinned thread are also executed on the core where the pinned thread is pinned. Consequently, the whole OpenMP code runs on only one core with many threads.

Modern CPUs will only show high performance if there is lots of cache data to be reused. Since you are only operating linearly on an array, there is no such thing and you are limited by memory bandwdith. Your cores will indeed be operating at a small fraction of their full utilization.
Things may be even worse: you have an array of structures from which you use certain fields. If there are other fields that you don't use, you get the phenomenon that you don't fully use the cachelines that you load from memory, dividing the performance yet again by a factor. Please amend your question by including the data layout of your structure/class.

Related

Problem of sorting OpenMP threads into NUMA nodes by experiment

I'm attempting to create a std::vector<std::set<int>> with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num().
Topo:
Idea:
Create data which is larger than L3 cache,
set first touch using thread 0,
perform multiple experiments to determine the minimum access time of each thread,
extract the threads into nodes based on sorted access times and information about the topology.
Code: (Intel compiler, OpenMP)
// create data which will be shared by multiple threads
const auto part_size = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto size = 2 * part_size;
auto container = std::unique_ptr<double>(new double[size]);
// open a parallel section
auto thread_count = 0;
auto thread_id_min_duration = std::multimap<double, int>{};
#ifdef DECIDE_THREAD_COUNT
#pragma omp parallel num_threads(std::thread::hardware_concurrency())
#else
#pragma omp parallel
#endif
{
// perform first touch using thread 0
const auto thread_id = omp_get_thread_num();
if (thread_id == 0)
{
thread_count = omp_get_num_threads();
for (auto index = std::size_t{}; index < size; ++index)
{
container.get()[index] = static_cast<double>(std::rand() % 10 + 1);
}
}
#pragma omp barrier
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// calculate the minimum access time of this thread
auto this_thread_min_duration = std::numeric_limits<double>::max();
for (auto experiment_counter = std::size_t{}; experiment_counter < 250; ++experiment_counter)
{
const auto* data = experiment_counter % 2 == 0 ? container.get() : container.get() + part_size;
const auto start_timestamp = omp_get_wtime();
for (auto index = std::size_t{}; index < part_size; ++index)
{
static volatile auto exceedingly_interesting_value_wink_wink = data[index];
}
const auto end_timestamp = omp_get_wtime();
const auto duration = end_timestamp - start_timestamp;
if (duration < this_thread_min_duration)
{
this_thread_min_duration = duration;
}
}
#pragma omp critical
{
thread_id_min_duration.insert(std::make_pair(this_thread_min_duration, thread_id));
}
}
} // #pragma omp parallel
Not shown here is code which outputs the minimum access times sorted into the multimap.
Env. and Output
How do OMP_PLACES and OMP_PROC_BIND work?
I am attempting to not use SMT by using export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=24. However, I'm getting this output:
What's puzzling me is that I'm having the same access times on all threads. Since I'm trying to spread them across the 2 NUMA nodes, I expect to neatly see 12 threads with access time, say, x and another 12 with access time ~2x.
Why is the above happening?
Additional Information
Even more puzzling are the following environments and their outputs:
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=26
export OMP_PLACES=cores OMP_PROC_BIND=spread OMP_NUM_THREADS=48
Any help in understanding this phenomenon would be much appreciated.

Put it shortly, the benchmark is flawed.
perform multiple experiments to determine the minimum access time of each thread
The term "minimum access time" is unclear here. I assume you mean "latency". The thing is your benchmark does not measure the latency. volatile tell to the compiler to read store data from the memory hierarchy. The processor is free to store the value in its cache and x86-64 processors actually do that (like almost all modern processors).
How do OMP_PLACES and OMP_PROC_BIND work?
You can find the documentation of both here and there. Put it shortly, I strongly advise you to set OMP_PROC_BIND=TRUE and OMP_PLACES="{0},{1},{2},..." based on the values retrieved from hw-loc. More specifically, you can get this from hwloc-calc which is a really great tool (consider using --li --po, and PU, not CORE because this is what OpenMP runtimes expect). For example you can query the PU identifiers of a given NUMA node. Note that some machines have very weird non-linear OS PU numbering and OpenMP runtimes sometimes fail to map the threads correctly. IOMP (OpenMP runtime of ICC) should use hw-loc internally but I found some bugs in the past related to that. To check the mapping is correct, I advise you to use hwloc-ps. Note that OMP_PLACES=cores does not guarantee that threads are not migrating from one core to another (even one on a different NUMA node) except if OMP_PROC_BIND=TRUE is set (or a similar setting). Note that you can also use numactl so to control the NUMA policies of your process. For example, you can tell to the OS not to use a given NUMA node or to interleave the allocations. The first touch policy is not the only one and may not be the default one on all platforms (on some Linux platforms, the OS can move the pages between the NUMA nodes so to improve locality).
Why is the above happening?
The code takes 4.38 ms to read 50 MiB in memory in each threads. This means 1200 MiB read from the node 0 assuming the first touch policy is applied. Thus the throughout should be about 267 GiB/s. While this seems fine at first glance, this is a pretty big throughput for such a processor especially assuming only 1 NUMA node is used. This is certainly because part of the fetches are done from the L3 cache and not the RAM. Indeed, the cache can partially hold a part of the array and certainly does resulting in faster fetches thanks to the cache associativity and good cache policy. This is especially true as the cache lines are not invalidated since the array is only read. I advise you to use a significantly bigger array to prevent this complex effect happening.
You certainly expect one NUMA node to have a smaller throughput due to remote NUMA memory access. This is not always true in practice. In fact, this is often wrong on modern 2-socket systems since the socket interconnect is often not a limiting factor (this is the main source of throughput slowdown on NUMA systems).
NUMA effect arise on modern platform because of unbalanced NUMA memory node saturation and non-uniform latency. The former is not a problem in your application since all the PUs use the same NUMA memory node. The later is not a problem either because of the linear memory access pattern, CPU caches and hardware prefetchers : the latency should be completely hidden.
Even more puzzling are the following environments and their outputs
Using 26 threads on a 24 core machine means that 4 threads have to be executed on two cores. The thing is hyper-threading should not help much in such a case. As a result, multiple threads sharing the same core will be slowed down. Because IOMP certainly pin thread to cores and the unbalanced workload, 4 threads will be about twice slower.
Having 48 threads cause all the threads to be slower because of a twice bigger workload.

Let me address your first sentence. A C++ std::vector is different from a C malloc. Malloc'ed space is not "instantiated": only when you touch the memory does the physical-to-logical address mapping get established. This is known as "first touch". And that is why in C-OpenMP you initialize an array in parallel, so that the socket touching the part of the array gets the pages of that part. In C++, the "array" in a vector is created by a single thread, so the pages wind up on the socket of that thread.
Here's a solution:
template<typename T>
struct uninitialized {
uninitialized() {};
T val;
constexpr operator T() const {return val;};
double operator=( const T&& v ) { val = v; return val; };
};
Now you can create a vector<uninitialized<double>> and the array memory is not touched until you explicitly initialize it:
vector<uninitialized<double>> x(N),y(N);
#pragma omp parallel for
for (int i=0; i<N; i++)
y[i] = x[i] = 0.;
x[0] = 0; x[N-1] = 1.;
Now, I'm not sure how this goes if you have a vector of sets. Just thought I'd point out the issue.

After more investigation, I note the following:
work-load managers on clusters can and will disregard/reset OMP_PLACES/OMP_PROC_BIND,
memory page migration is a thing on modern NUMA systems.
Following this, I started using the work-load manager's own thread binding/pinning system, and adapted my benchmark to lock the memory page(s) on which my data lay. Furthermore, giving in to my programmer's paranoia, I ditched the std::unique_ptr for fear that it may lay its own first touch after allocating the memory.
// create data which will be shared by multiple threads
const auto size_per_thread = std::size_t{50 * 1024 * 1024 / sizeof(double)}; // 50 MB
const auto total_size = thread_count * size_per_thread;
double* data = nullptr;
posix_memalign(reinterpret_cast<void**>(&data), sysconf(_SC_PAGESIZE), total_size * sizeof(double));
if (data == nullptr)
{
throw std::runtime_error("could_not_allocate_memory_error");
}
// perform first touch using thread 0
#pragma omp parallel num_threads(thread_count)
{
if (omp_get_thread_num() == 0)
{
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < total_size; ++d_index)
{
data[d_index] = -1.0;
}
}
} // #pragma omp parallel
mlock(data, total_size); // page migration is a real thing...
// open a parallel section
auto thread_id_avg_latency = std::multimap<double, int>{};
auto generator = std::mt19937(); // heavy object can be created outside parallel
#pragma omp parallel num_threads(thread_count) private(generator)
{
// access the data using all threads individually
#pragma omp for schedule(static, 1)
for (auto thread_counter = std::size_t{}; thread_counter < thread_count; ++thread_counter)
{
// seed each thread's generator
generator.seed(thread_counter + 1);
// calculate the minimum access latency of this thread
auto this_thread_avg_latency = 0.0;
const auto experiment_count = 250;
for (auto experiment_counter = std::size_t{}; experiment_counter < experiment_count; ++experiment_counter)
{
const auto start_timestamp = omp_get_wtime() * 1E+6;
for (auto counter = std::size_t{}; counter < size_per_thread / 100; ++counter)
{
const auto index = std::uniform_int_distribution<std::size_t>(0, size_per_thread-1)(generator);
auto& datapoint = data[thread_counter * size_per_thread + index];
datapoint += index;
}
const auto end_timestamp = omp_get_wtime() * 1E+6;
this_thread_avg_latency += end_timestamp - start_timestamp;
}
this_thread_avg_latency /= experiment_count;
#pragma omp critical
{
thread_id_avg_latency.insert(std::make_pair(this_thread_avg_latency, omp_get_thread_num()));
}
}
} // #pragma omp parallel
std::free(data);
With these changes, I am noticing the difference I expected.
Further notes:
this experiment shows that the latency of non-local access is 1.09 - 1.15 times that of local access on the cluster that I'm using,
there is no reliable cross-platform way of doing this (requires kernel-APIs),
OpenMP seems to number the threads exactly as hwloc/lstopo, numactl and lscpu seems to number them (logical ID?)
The most astonishing things are that the difference in latencies is very low, and that memory page migration may happen, which begs the question, why should we care about first-touch and all the rest of the NUMA concerns at all?

Large and math-heavy C++ program is ~10% faster or slower randomly

The program
I have a C++ program that looks something like the following:
<load data from disk, etc.>
// Get some buffers aligned to 4 KiB
double* const x_a = static_cast<double*>(std::aligned_alloc(......));
double* const p = static_cast<double*>(std::aligned_alloc(......));
double* const m = static_cast<double*>(std::aligned_alloc(......));
double sum = 0.0;
const auto timerstart = std::chrono::steady_clock::now();
for(uint32_t i = 0; i<reps; i++){
uint32_t pos = 0;
double factor;
if((i%2) == 0) factor = 1.0; else factor = -1.0;
for(uint32_t j = 0; j<xyzvec.size(); j++){
pos = j*basis::ndist; //ndist is a compile-time constant == 36
for(uint32_t k =0; k<basis::ndist; k++) x_a[k] = distvec[k+pos];
sum += factor*basis::energy(x_a, &coeff[0], p, m);
}
}
const auto timerstop = std::chrono::steady_clock::now();
<free memory, print stats, etc.>
reger
where reps is a single digit number, xyzvec has ~15k elements, and a single call to basis::energy(...) takes about 100 µs to return. The energy function is huge in terms of code size (~5 MiB of source code that looks something like this, it's from a code generator).
Edit: The m array is somewhat large, ~270 KiB for this test case.
Edit 2: Source code of the two functions responsible for ~90% of execution time
All of the pointers entering energy are __restrict__-qualified and declared to be aligned via __assume_aligned(...), the object files are generated with -Ofast -march=haswell to allow the compiler to optimize and vectorize at will. Profiling suggests the function is currently frontend-bound (L1i cache miss, and fetch/decode).
energy does no dynamic memory allocation or IO, and mostly reads/writes x_a, m and p, x_a is const, which are all aligned to 4k page boundaries. Its execution time ought to be pretty consistent.
The strange timing behaviour
Running the program many times, and looking at the time elapsed between the timer start/stop calls above, I have found it to have a strange bimodal distribution.
Calls to energy are either "fast" or "slow", fast ones take ~91 µs, slow ones take ~106 µs on an Intel Skylake-X 7820X.
All calls to energy in a given process are either fast or slow, the metaphorical coin is flipped once, when the process starts.
The process is not quite random, and can be heavily biased towards the "fast" cases, by purging all kernel caches via echo 3 | sudo tee /proc/sys/vm/drop_caches immediately before execution.
The random effect may be CPU dependent. Running the same executable on a Ryzen 1700X yields both faster and much more consistent execution. The "slow" runs either don't happen or their prominence is much reduced. Both machines are running the same OS. (Ubuntu 20.04 LTS, 5.11.0-41-generic kernel, mitigations=off)
What could be the cause?
Data alignment (dubious, the arrays intensively used are aligned)
Code alignment (maybe, but I have tried printing the function pointer of energy, no correlation with speed)
Cache aliasing?
JCC erratum?
Interrupts, scheduler activity?
Some cores turbo boosting higher? (probably not, tried launching it bound to a core with taskset and tried all cores one by one, could not find one that was always "fast")
???
Edit
Zero-filling x_a, p and m before first use appears to make no difference to the timing pattern.
Replacing (i % 2) with factor *= -1.0 appears to make no difference to the timing pattern.

Why is OpenMP slower than sequential execution?

I have the following code:
omp_set_num_threads(10);
int N = 10000;
int chunk = 1000;
int i;
float a[N];
for (i=0; i < N; i++)
{
a[i] = i;
}
#pragma omp parallel private(i)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i=0; i < N; i++) a[i] = a[i] + a[i];
}
The sequential code is the same, without the two pragma directives.
sequential ~150us
parallel ~1100us
That's a huge gap, and I expected it to be the other way around.
Does someone have a clue what's wrong, or does OpenMP have so much to do in the background?
Does someone have an example, where I can see that the parallelized for loop is faster?
Thanks

Launching ten threads and managing them is more expensive than adding 10,000 elements. If you're on a system with less than 10 true cores, they're going to compete for time slices, and incur context switches and (potentially) more cache misses. And you chose dynamic scheduling, which means there needs to be some synchronization to help the threads figure out which chunks they're going to do (not a lot, but enough to slow things down when the work being distributed is pretty trivial by contrast).
In one anecdote launching a no-op thread and immediately detaching it cost about 10 μs, and those threads didn't need to do anything. In your case, that's 100 μs just for launching the threads, ignoring all the other inefficiencies threading potentially introduces.
Parallelizing helps for big workloads, or when the workers are used many times for many moderate sized tasks (so the cost of launching the threads is a fraction of the work they do). But performing 10,000 additions is chump change to a CPU; you're just not doing enough to benefit from parallelizing it.

Bad performace of parallelized OpenMP code for particle simulation

I am trying to parallelize a code for particle-based simulations and experiencing poor performance of an OpenMP based approach. By that I mean:
Displaying CPU usage using the Linux tool top, OpenMP-threads running CPUs have an average usage of 50 %.
With increasing number of threads, speed up converges to a factor of about 1.6. Convergence is quite fast, i.e. I reach a speed up of 1.5 using 2 threads.
The following pseudo code illustrates the basic template for all parallel regions implemented.
Note that during a single time step, 5 parallel regions of the below shown fashion are being executed. Basically, the force acting on a particle i < N is a function of several field properties of neighboring particles j < NN(i).
omp_set_num_threads(ncpu);
#pragma omp parallel shared( quite_a_large_amount_of_readonly_data, force )
{
int i,j,N,NN;
#pragma omp for
for( i=0; i<N; i++ ){ // Looping over all particles
for ( j=0; j<NN(i); j++ ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
force[i] += function(j);
}
}
}
I am trying to sort out the cause for the observed bottleneck. My naive initial guess for an explanation:
As stated, there is large amount of memory being shared between threads for read-only access. It is quite possible that different threads try to read the same memory location at the same time. Is this causing a bottleneck ? Should I rather let OpenMP allocate private copies ?

How large is N, and how intensive is NN(i)?
You say nothing shared, but force[i] is probably within the same cache line of force[i+1]. This is what's known as false sharing and can be pretty detrimental. OpenMP should batch things together to compensate for this, so with a large enough N I don't think this would be your problem.
If NN(i) isn't very CPU intensive, you might have a simple memory bottleneck -- in which case throwing more cores at it won't solve anything.

Assuming that force[i] is plain array of 4 or 8 byte data, you definitely have false sharing, no doubt about it.
Assuming that function(j) is independently calculated, you may want to do something like this:
for( i=0; i<N; i+=STEP ){ // Looping over all particles
for ( j=0; j<NN(i); j+=STEP ){ // Nested loop over all neighbors of i
// No communtions between threads, atomic regions,
// barriers whatsoever.
calc_next(i, j);
}
}
void calc_next(int i, int j)
{
int ii, jj;
for(ii = 0; ii < STEP; ii++)
{
for(jj = 0; jj < STEP; jj++)
{
force[i+ii] = function(j+jj);
}
}
}
That way, you calculate a bunch of things on one thread, and a bunch of things on the next thread, and each bunch is far enough apart that you don't get false sharing.
If you can't do it this way, try to split it up in some other way that leads to larger sections being calculated each time.

As the others stated that, false sharing on force could be a reason. Try in this simple way,
#pragma omp for
for( i=0; i<N; i++ ){
int sum = force[i];
for ( j=0; j<NN(i); j++ ){
sum += function(j);
}
force[i] = sum;
}
Technically, it's possible that force[i] = sum still makes a false sharing. But, it's highly unlikely to happen because the other thread would access force[i + N/omp_num_threads()*omp_thread_num()], which is pretty far from force[i].
If still scalability is poor, try to use a profiler such as Intel Parallel Amplifier (or VTune) to see how much memory bandwidth is needed per thread. If so, put some more DRAMs in your computer :) That will really boost memory bandwidth.

Negative Speedup on Multithreading my Program

On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.
What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.
NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
See also: What can make a program run slower when using more threads?

The times you are reporting are measured using the clock function:
The clock() function returns an approximation of processor time used by the program.
$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds
real 0m5.247s
user 0m9.025s
sys 0m0.304s
The real time will be less for multiprocessor tasks, but the processor time will typically be greater.
When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.
It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.
Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:
$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds
real 0m5.326s
user 0m9.057s
sys 0m0.344s
which still is less of a speed-up than one might hope for, but at least the numbers make some sense.
100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)

The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.

I believe that your algorithm essentially makes your cache memory useless.
Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.

Not related to your threading issues, but there is a bounds error in your code.
You have:
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
When i is zero you will be doing
arr[0] += arr[-1];

Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'

As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.

tstenner has got it mostly right.
This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.
Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.
You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.
Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.

Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:
a proper parallelization of your algorithm
a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
hardware support for parallelization
It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...
But this is then only a theoretical standpoint.
Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js