Mergesort pThread implementation taking same time as single-threaded

Mergesort pThread implementation taking same time as single-threaded - c++

(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s

There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.

Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.

Related

C++ Wait in main thread for future without while(true)

Question
I want to know if it is possible to wait in the main-Thread without any while(1)-loop.
I launch a few threads via std::async() and do calculation of numbers on each thread. After i start the threads i want to receive the results back. I do that with a std::future<>.get().
My problem
When i receive the result i call std::future.get(), which blocks the main thread until the calculation on the thread is done. This leads to some slower execution time, if one thread needs considerably more time then the following, where i could do some calculation with the finished results instead and then when the slowest thread is done i maybe have some some further calculation.
Is there a way to idle the main thread until ANY of the threads has finished running? I have thought of a callback function which wakes the main thread up, but i still don't know how to idle the main function without making it unresponsive for i.e. a second and not running a while(true) loop instead.
Current code
#include <iostream>
#include <future>
uint64_t calc_factorial(int start, int number);
int main()
{
uint64_t n = 1;
//The user entered number
uint64_t number = 0;
// get the user input
printf("Enter number (uint64_t): ");
scanf("%lu", &number);
std::future<uint64_t> results[4];
for (int i = 0; i < 4; i++)
{
// push to different cores
results[i] = std::async(std::launch::async, calc_factorial, i + 2, number);
}
for (int i = 0; i < 4; i++)
{
//retrieve result...I don't want to wait here if one threads needs more time than usual
n *= results[i].get();
}
// print n or the time needed
return 0;
}
uint64_t calc_factorial(int start, int number)
{
uint64_t n = 1;
for (int i = start; i <= number; i+=4) n *= i;
return n;
}
I prepared a code snippet which runs fine, I am using the GMP Lib for the big results, but the code runs with uint64_t instead if you enter small numbers.
Note
If you have compiled the GMP library for whatever reason on your PC already you could replace every uint64_t with mpz_class

I'd approach this somewhat differently.
Unless I have a fairly specific reason to do otherwise, I tend to approach most multithreaded code the same general way: use a (thread-safe) queue to transmit results. So create an instance of a thread-safe queue, and pass a reference to it to each of the threads that's doing to generate the data. The have whatever thread is going to collect the results grab them from the queue.
This makes it automatic (and trivial) that you create each result as it's produced, rather than getting stuck waiting for one after another has produced results.

Making a for loop faster by splitting it in threads

Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.
I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.
Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:
void doYourThing(int size,int threadNumber,int numOfThreads) {
int start = (threadNumber - 1) * size / numOfThreads;
int end = threadNumber * size / numOfThreads;
for (int i = start; i < end; i++) {
//Calculations...
}
}
int main(void) {
int size = 100000;
int numOfThreads = 4;
int start = 0;
int end = size / numOfThreads;
std::thread coreB(doYourThing, size, 2, numOfThreads);
std::thread coreC(doYourThing, size, 3, numOfThreads);
std::thread coreD(doYourThing, size, 4, numOfThreads);
for (int i = start; i < end; i++) {
//Calculations...
}
coreB.join();
coreC.join();
coreD.join();
}
With this, computation time changed from 60ms to 40ms.
Questions:
1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.
2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?

(1).
The question #François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.
Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.
(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server
(3). It is not necessarily true that more threads will have a better performance

There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is
auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);
The library will decide how many threads to use based on how many cores you have.
On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:
#include "Lazy.h"
void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
for (int i = from; i < to; ++i) {
// Calculate vecOut[i]
}
}
int main(void) {
int size = 100000;
MyVector vecIn(size), vecOut(size)
// Load vecIn vector with input data...
Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
[&](auto indexPair) {
doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
});
// Now the results are in vecOut
}
README.md gives further examples on parallel execution which you might find useful.

C++ Multithreaded prime counter between specified range

#include <math.h>
#include <sstream>
#include <iostream>
#include <mutex>
#include <stdlib.h>
#include <chrono>
#include <thread>
bool isPrime(int number) {
int i;
for (i = 2; i < number; i++) {
if (number % i == 0) {
return false;
}
}
return true;
}
std::mutex myMutex;
int pCnt = 0;
int icounter = 0;
int limit = 0;
int getNext() {
std::lock_guard<std::mutex> guard(myMutex);
icounter++;
return icounter;
}
void primeCnt() {
std::lock_guard<std::mutex> guard(myMutex);
pCnt++;
}
void primes() {
while (getNext() <= limit)
if (isPrime(icounter))
primeCnt();
}
int main(int argc, char *argv[]) {
std::stringstream ss(argv[2]);
int tCount;
ss >> tCount;
std::stringstream ss1(argv[4]);
int lim;
ss1 >> lim;
limit = lim;
auto t1 = std::chrono::high_resolution_clock::now();
std::thread *arr;
arr = new std::thread[tCount];
for (int i = 0; i < tCount; i++)
arr[i] = std::thread(primes);
for (int i = 0; i < tCount; i++)
arr[i].join();
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "Primes: " << pCnt << std::endl;
std::cout << "Program took: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() <<
" milliseconds" << std::endl;
return 0;
}
Hello , im trying to find the amount of prime numbers between the user specified range, i.e., 1-1000000 with a user specified amount of threads to speed up the process, however, it seems to take the same amount of time for any amount of threads compared to one thread. Im not sure if its supposed to be that way or if theres a mistake in my code. thank you in advance!

You don't see performance gain because time spent in isPrime() is much smaller than time which threads take when fighting on mutex.
One possible solution is to use atomic operations, as #The Badger suggested. The other way is to partition your task into smaller ones and distribute them over your thread pool.
For example, if you have n threads, then each thread should test numbers from i*(limit/n) to (i+1)*(limit/n), where i is thread number. This way you wouldn't need to do any synchronization at all and your program would (theoretically) scale linearly.

Multithreaded algorithms work best when threads can do a lot of work on their own.
Imagine doing this in real life: you have a group of 20 humans that will do work for you, and you want them to test whether each number up to 1000 is prime. How will you do this?
Would you hand each person a single number at a time, and ask them to come back to you to tell you if its prime and to receive another number?
Surely not; you would give each person a bunch of numbers to work on at once, and have them come back and tell you how many were prime and to receive another bunch of numbers.
Maybe even you'd divide up the entire set of numbers into 20 groups and tell each person to work on a group. (but then you run the risk of one person being slow and having everyone else sitting idle while you wait for that one person to finish... although there are so-called "work stealing" algorithms, but that's complicated)
The same thing applies here; you want each thread to do a lot of work on its own and keep its own tally, and only have to check back with the centralized information once in a while.

A better solution would be to use the Sieve of Atkin to find the primes (even the Sieve of Eratosthenes which is easier to understand is better), your basic algorithm is very poor to start with. It will for every number n in your interval do n checks in order to determine if it's prime and do this limit times. This means that you're doing about limit*limit/2 checks - that's what we call O(n^2) complexity. The Sieve of Atkins OTOH only have to do O(n) operations to find all primes. If n is large it is hard to beat the algorithm that has fewer steps by performing the steps faster. Trying to fix a poor algorithm by throwing more resources on it is a bad strategy.
Another problem with your implementation is that it has race conditions and therefore is broken to start with. It's often little use in optimizing something unless you first make sure it's working correctly. The problem is in the primes function:
void primes() {
while (getNext() <= limit)
if( isPrime(icounter) )
primeCnt();
}
Between the getNext() and isPrime another thread may have increased the icounter and cause the program to skip candidates. This results in the program giving different result each time. In addition neither icounter nor pCnt is declared volatile so there's actually no guarantee that the value gets to the global storage location as part of the mutex lock.
Since the problem is CPU intensive, that is almost all of the time is spent executing CPU instructions multi threading won't help unless you have multiple CPU's (or cores) which the OS are scheduling threads of the same process on. This means that there is a limit of number of threads (that can be as low as 1 - I fx see only a improvement for two threads, beyond that theres none) where you can expect an improved performance. What happens if you have more threads than cores is that the OS will just let one thread run for a while on a core and then switch the thread an let the next thread execute for a while.
The problem that may arise when scheduling threads on different cores is in addition that each core may have separate cache (which is faster than the shared cache). In effect if two threads are going to access the same memory the separated cache has to be flushed as part of the synchronization of the data involved - this may be time consuming.
That is you have to strive to keep the data that the different threads are working on separate and minimize the frequent use of common variable data. In your example it would mean that you should avoid the global data as much as possible. The counter for example need only be accessed when the counting has finished (to add the threads contribution to the count). Also you could minimize the use of icounter by not reading it for each candidate, but get a bunch of candidates in one go. Something like:
void primes() {
int next;
int count=0;
while( (next = getNext(1000)) <= limit ) {
for( int j = next; j < next+1000 && j <= limit ; j++ ) {
if( isPrime(j) )
count++;
}
}
primeCnt(count);
}
where getNext is the same, but it reserves a number of candidates (by increasing icounter by the supplied count) and primeCnt adds count to pCnt.
Consequently you may end up in a situation where the core runs one thread, then after a while switch to another thread and so on. The result of this is that you will have to run all the code for your problem plus code for switching between the thread. Add that you will probably have more cache hits, then this will probably even be slower.

Perhaps instead of a mutex try to use an atomic integer for the counter. It might speed it up a bit, not sure by how much.
#include <atomic>
std::atomic<uint64_t> pCnt; // Made uint64 for bigger range as #IgnisErus mentioned
std::atomic<uint64_t> icounter;
int getNext() {
return ++icounter; // Pre increment is faster
}
void primeCnt() {
++pCnt;
}
On benchmarking, most of the time the processor need to warm up to get the best performance, so to take the time once is not always a good representation of the actual performance. Try to run the code many times and get an average. You can also try to do some heavy work before you do the calculation (A long for-loop calculating the power of some counter?)
Getting accurate benchmark results is also a topic of interest for me since I do not yet know how to do it.

Interesting processing time results

I've made a small application that averages the numbers between 1 and 1000000. It's not hard to see (using a very basic algebraic formula) that the average is 500000.5 but this was more of a project in learning C++ than anything else.
Anyway, I made clock variables that were designed to find the amount of clock steps required for the application to run. When I first ran the script, it said that it took 3770000 clock steps, but every time that I've run it since then, it's taken "0.0" seconds...
I've attached my code at the bottom.
Either a.) It's saved the variables from the first time I ran it, and it's just running quickly to the answer...
or b.) something is wrong with how I'm declaring the time variables.
Regardless... it doesn't make sense.
Any help would be appreciated.
FYI (I'm running this through a Linux computer, not sure if that matters)
double avg (int arr[], int beg, int end)
{
int nums = end - beg + 1;
double sum = 0.0;
for(int i = beg; i <= end; i++)
{
sum += arr[i];
}
//for(int p = 0; p < nums*10000; p ++){}
return sum/nums;
}
int main (int argc, char *argv[])
{
int nums = 1000000;//atoi(argv[0]);
int myarray[nums];
double timediff;
//printf("Arg is: %d\n",argv[0]);
printf("Nums is: %d\n",nums);
clock_t begin_time = clock();
for(int i = 0; i < nums; i++)
{
myarray[i] = i+1;
}
double average = avg(myarray, 0, nums - 1);
printf("%f\n",average);
clock_t end_time = clock();
timediff = (double) difftime(end_time, begin_time);
printf("Time to Average: %f\n", timediff);
return 0;
}

You are measuring the I/O operation too (printf), that depends on external factors and might be affecting the run time. Also, clock() might not be as precise as needed to measure such a small task - look into higher resolution functions such as clock_get_time(). Even then, other processes might affect the run time by generating page fault interrupts and occupying the memory BUS, etc. So this kind of fluctuation is not abnormal at all.

On the machine I tested, Linux's clock call was only accurate to 1/100th of a second. If your code runs in less than 0.01 seconds, it will usually say zero seconds have passed. Also, I ran your program a total of 50 times in .13 seconds, so I find it suspicous that you claim it takes 2 seconds to run it once on your computer.
Your code incorrectly uses the difftime, which may display incorrect output as well if clock says time did pass.
I'd guess that the first timing you got was with different code than that posted in this question, becase I can't think of any way the code in this question could produce a time of 3770000.
Finally, benchmarking is hard, and your code has several benchmarking mistakes:
You're timing how long it takes to (1) fill an array, (2) calculate an average, (3) format the result string (4) make an OS call (slow) that prints said string in the right language/font/colo/etc, which is especially slow.
You're attempting to time a task which takes less than a hundredth of a second, which is WAY too small for any accurate measurement.
Here is my take on your code, measuring that the average takes ~0.001968 seconds on this machine.

Code runs 6 times slower with 2 threads than with 1

Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds

Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.

More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.

This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js