C++: stress windows system - c++

I want to write a program in c++ that are able to stress a windows 7 system. In my intention I want that this program brings the cpu usage to 100%, using all ram installed.
I have tried with a big FOR cycle that run a simple multiplication at every step: the cpu usage increase but the ram used remain low.
What is the best approach to reach my target?!

In an OS agnostic way, you could allocate and fill heap memory obtained with malloc(3) (so do some computation with that zone), or in C++ with operator new. Be sure to test against failure of malloc. And increase the size of your zone progressively.

If your goal is ability to stress CPU and RAM utilization (and not necessarily writing a program), try HeavyLoad which is free and does just that http://www.jam-software.com/heavyload/

I wrote a test program for this using multithreading and Mersenne Primes as the algorithm for stress testing
The code first determines the number of cores on the machine (WINDOWS only guys, sorry!!)
NtQuerySystemInformation(SystemBasicInformation, &BasicInformation, sizeof(SYSTEM_BASIC_INFORMATION), NULL);
It then runs a test method to spawn a thread for each core
// make sure we've got at least as many threads as cores
for (int i = 0; i < this->BasicInformation.NumberOfProcessors; i++)
{
threads.push_back(thread(&CpuBenchmark::MersennePrimes, CpuBenchmark()));
}
running our load
BOOL CpuBenchmark::IsPrime(ULONG64 n) // function determines if the number n is prime.
{
for (ULONG64 i = 2; i * i < n; i++)
if (n % i == 0)
return false;
return true;
}
ULONG64 CpuBenchmark::Power2(ULONG64 n) //function returns 2 raised to power of n
{
ULONG64 square = 1;
for (ULONG64 i = 0; i < n; i++)
{
square *= 2;
}
return square;
}
VOID CpuBenchmark::MersennePrimes()
{
ULONG64 i;
ULONG64 n;
for (n = 2; n <= 61; n++)
{
if (IsPrime(n) == 1)
{
i = Power2(n) - 1;
if (IsPrime(i) == 1) {
// dont care just want to stress the CPU
}
}
}
}
and waits for all threads to complete
// wait for them all to finish
for (auto& th : threads)
th.join();
Full article and source code on my blog site
http://malcolmswaine.com/c-cpu-stress-test-algorithm-mersenne-primes/

Related

Memory occupation increase

I am trapped in a wired situation; my c++ code keeps consuming more memory (reaching around 70G), until the whole process got killed.
I am invoking a C++ code from Python, which implements the Longest common subsequence length algorithm.
The C++ code is shown below:
#define MAX(a,b) (((a)>(b))?(a):(b))
#include <stdio.h>
int LCSLength(long unsigned X[], long unsigned Y[], int m, int n)
{
int** L = new int*[m+1];
for(int i = 0; i < m+1; ++i)
L[i] = new int[n+1];
printf("i am hre\n");
int i, j;
for(i=0; i<=m; i++)
{
printf("i am hre1\n");
for(j=0; j<=n; j++)
{
if(i==0 || j==0)
L[i][j] = 0;
else if(X[i-1]==Y[j-1])
L[i][j] = L[i-1][j-1]+1;
else
L[i][j] = MAX(L[i-1][j],L[i][j-1]);
}
}
int tt = L[m][n];
printf("i am hre2\n");
for (i = 0; i < m+1; i++)
delete [] L[i];
delete [] L;
return tt;
}
And my Python code is like this:
from ctypes import cdll
import ctypes
lib = cdll.LoadLibrary('./liblcs.so')
la = 36840
lb = 833841
a = (ctypes.c_ulong * la)()
b = (ctypes.c_ulong * lb)()
for i in range(la):
a[i] = 1
for i in range(lb):
b[i] = 1
print "test"
lib._Z9LCSLengthPmS_ii(a, b, la, lb)
IMHO, in the C++ code, after the new operation which could allocate a large amount of memory on the heap, there would be not more additional memory consumption inside the loop.
However, to my surprise, I observed that the used memory keeps increasing during the loop. (I am using top on Linux, and it keeps print i am her1 before the process got killed)
It is really confused me at this point, as I guess after the memory allocation, there are only some arithmetic operations inside the loop, why does the code take more memory?
Am I clear enough? Could anyone give me some help on this issue? Thank you!
Your consuming too much memory. The reason why the system does not die on allocation is because Linux allows you to allocate more memory than you can use
http://serverfault.com/questions/141988/avoid-linux-out-of-memory-application-teardown
I just did the same thing on a test machine. I was able to get past the uses of new and start the loop, only when the system decided that I was eating too much of the available RAM did it kill me.
This is what I got. A lovely OOM message in dmesg.
[287602.898843] Out of memory: Kill process 7476 (a.out) score 792 or sacrifice child
[287602.899900] Killed process 7476 (a.out) total-vm:2885212kB, anon-rss:907032kB, file-rss:0kB, shmem-rss:0kB
On Linux you would see something like this in your kernel logs or as the output from dmesg...
[287585.306678] Out of memory: Kill process 7469 (a.out) score 787 or sacrifice child
[287585.307759] Killed process 7469 (a.out) total-vm:2885208kB, anon-rss:906912kB, file-rss:4kB, shmem-rss:0kB
[287602.754624] a.out invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
[287602.755843] a.out cpuset=/ mems_allowed=0
[287602.756482] CPU: 0 PID: 7476 Comm: a.out Not tainted 4.5.0-x86_64-linode65 #2
[287602.757592] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[287602.759461] 0000000000000000 ffff88003d845780 ffffffff815abd27 0000000000000000
[287602.760689] 0000000000000282 ffff88003a377c58 ffffffff811d0e82 ffff8800397f8270
[287602.761915] 0000000000f7d192 000105902804d798 ffffffff81046a71 ffff88003d845780
[287602.763192] Call Trace:
[287602.763532] [<ffffffff815abd27>] ? dump_stack+0x63/0x84
[287602.774614] [<ffffffff811d0e82>] ? dump_header+0x59/0x1ed
[287602.775454] [<ffffffff81046a71>] ? kvm_clock_read+0x1b/0x1d
[287602.776322] [<ffffffff8112b046>] ? ktime_get+0x49/0x91
[287602.777127] [<ffffffff81156c83>] ? delayacct_end+0x3b/0x60
[287602.777970] [<ffffffff81187c11>] ? oom_kill_process+0xc0/0x367
[287602.778866] [<ffffffff811882c5>] ? out_of_memory+0x3bf/0x406
[287602.779755] [<ffffffff8118c646>] ? __alloc_pages_nodemask+0x8fc/0xa6b
[287602.780756] [<ffffffff811c095d>] ? alloc_pages_current+0xbc/0xe0
[287602.781686] [<ffffffff81186c1d>] ? filemap_fault+0x2d3/0x48b
[287602.782561] [<ffffffff8128adea>] ? ext4_filemap_fault+0x37/0x51
[287602.783511] [<ffffffff811a9d56>] ? __do_fault+0x68/0xb1
[287602.784310] [<ffffffff811adcaa>] ? handle_mm_fault+0x6a4/0xd1b
[287602.785216] [<ffffffff810496cd>] ? __do_page_fault+0x33d/0x398
[287602.786124] [<ffffffff819c6ab8>] ? async_page_fault+0x28/0x30
Take a look at what you are doing:
#include <iostream>
int main(){
int m = 36840;
int n = 833841;
unsigned long total = 0;
total += (sizeof(int) * (m+1));
for(int i = 0; i < m+1; ++i){
total += (sizeof(int) * (n+1));
}
std::cout << total << '\n';
}
You're simply consuming too much memory.
If the size of your int is 4 bytes, you are allocating 122 GB.

Why is filtering by primality in an inifinite stream of numbers taking forever if processed in parallel?

I'm creating an infinite stream of Integers starting at 200 Million, filter this stream using a naive primality test implementation to generate load and limit the result to 10.
Predicate<Integer> isPrime = new Predicate<Integer>() {
#Override
public boolean test(Integer n) {
for (int i = 2; i < n; i++) {
if (n % i == 0) return false;
}
return true;
}
};
Stream.iterate(200_000_000, n -> ++n)
.filter(isPrime)
.limit(10)
.forEach(i -> System.out.print(i + " "));
This works as expected.
Now, if I add a call to parallel() before filtering, nothing is produced and the processing does not complete.
Stream.iterate(200_000_000, n -> ++n)
.parallel()
.filter(isPrime)
.limit(10)
.forEach(i -> System.out.print(i + " "));
Can someone point me in the right direction of what's happening here?
EDIT: I am not looking for better primality test implementations (it is intended to be a long running implementation) but for an explanation of the negative impact of using a parallel stream.
Processing actually completes, though may take quite a long time depending on number of hardware threads on your machine. API documentation about limit warns that it might be slow for parallel streams.
Actually the parallel stream first splits the computation to the several parts according to the available parallelism level, performs a computation for every part, then join the results together. How many parts do you have in your task? One per common FJP thread (=Runtime.getRuntime().availableProcessors()) plus (sometimes?) one for current thread if it's not in FJP. You can control it adding
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4");
Practically for your task the lower number you set, the faster it will compute.
How to split the unlimited task? You particular task is handled by IteratorSpliterator which trySplit method creates chunks of ever-increasing size starting from 1024. You may try by yourself:
Spliterator<Integer> spliterator = Stream.iterate(200_000_000, n -> ++n).spliterator();
Spliterator[] spliterators = new Spliterator[10];
for(int i=0; i<spliterators.length; i++) {
spliterators[i] = spliterator.trySplit();
}
for(int i=0; i<spliterators.length; i++) {
System.out.print((i+1)+": ");
spliterators[i].tryAdvance(System.out::println);
}
So the first chunk handles numbers of range 200000000-200001023, the second handles numbers of range 200001024-200003071 and so on. If you have only 1 hardware thread, your task will be split to two chunks, so 3072 will be checked. If you have 8 hardware threads, your task will be split to 9 chunks and 46080 numbers will be checked. Only after all the chunks are processed the parallel computation will stop. The heuristic of splitting the task to such a big chunks doesn't work good in your case, but you would see the performance boost had the prime numbers around that region appear once in several thousand numbers.
Probably your particular scenario could be optimized internally (i.e. stop the computation if the first thread found that limit condition is already achieved). Feel free to report a bug to Java bug tracker.
Update after digging more inside the Stream API I concluded that current behavior is a bug, raised an issue and posted a patch. It's likely that the patch will be accepted for JDK9 and probably even backported to JDK 8u branch. With my patch the parallel version still does not improve the performance, but at least its working time is comparable to sequential stream working time.
The reason why parallel stream taking so long is due to the fact that all parallel streams uses common fork-join thread pool and since you are submitting a long running task(because your implementation of isPrime method is not efficient), you are blocking all threads in the pool and as a result of which all other tasks using parallel stream are blocked.
In order to make the parallel version faster you can implement isPrime more efficiently. For e.g.
Predicate<Integer> isPrime = new Predicate<Integer>() {
#Override
public boolean test(Integer n) {
if(n < 2) return false;
if(n == 2 || n == 3) return true;
if(n%2 == 0 || n%3 == 0) return false;
long sqrtN = (long)Math.sqrt(n)+1;
for(long i = 6L; i <= sqrtN; i += 6) {
if(n%(i-1) == 0 || n%(i+1) == 0) return false;
}
return true;
}
};
And immediately you will notice the improvement in the performance. In general avoid using parallel stream when there exists the possibility of blocking threads in the pool

Is super long redundant code efficient?

I'm trying to find some primes with the Sieve of the greek guy algorithm. I have some efficiency concerns. Here's the code:
void check_if_prime(unsigned number)
{
unsigned index = 0;
while (primes[index] <= std::sqrt(number))
{
if (number % primes[index] == 0) return;
++index;
}
primes.push_back(number);
}
And, because I coded huge 2/3/5/7/11/13 prime wheel, the code is 5795 lines longs.
for (unsigned i = 0; i < selection; ++i)
{
unsigned multiple = i * 30030;
if (i!=0) check_if_prime( multiple+1 );
check_if_prime ( multiple+17 );
check_if_prime ( multiple+19 );
check_if_prime ( multiple+23 );
// ...so on until 30029
}
Optimization flags: -O3, -fexpensive-optimizations, -march=pentium2
25 million primes in 20 minutes with CPU stuck at 50% (no idea why, tried real time priority but it didn't change much). Size of output text file is 256MB (going to change to binary later on).
Compilation takes ages! Is it okay? How can I make it faster without compromising efficiency?
Is that if statement at the start of for loop OK? I've read if statements take the longest.
Anything else concerning the code, not the algorithm? Anything to make it faster? What statements are faster than others?
Would even a bigger wheel (up to 510510, not just 30030, hell a lot of lines) compile within a day?
I want to find all primes up to 2^32 and little optimizations would save some hours and electricity. Thank you in advance!
EDIT: not seeking for an algorithm, seeking for code improvement if there can be done any!
Here is what I can say about the performance of your program:
Likely your main problem is the call to std::sqrt(). This is a floating point function that's designed for full precision of the result, and it definitely take quite a few cycles. I bet you'll be much faster if you use this check instead:
while (primes[index]*primes[index] < number)
That way you are using an integer multiplication which is trivial for modern CPUs.
The if statement at the start of your for() loop is irrelevant to performance. It's not executed nearly enough times. Your inner loop is the while loop within check_if_prime(). That's the one you need to optimize.
I can't see how you are doing output. There are ways to do output that can severely slow you down, but I don't think that's the main issue (if it is an issue at all).
Code size can be an issue: your CPU has an instruction cache with limited capacity. If your 6k lines don't fit into the first level instruction cache, the penalty can be severe. If I were you, I'd reimplement the wheel using data instead of code, i. e.:
unsigned const wheel[] = {1, 17, 19, 23, ...}; //add all your 6k primes here
for (unsigned i = 0; i < selection; ++i)
{
unsigned multiple = i * 30030;
for(unsigned j = 0; j < sizeof(wheel)/sizeof(*wheel); j++) {
check_if_prime(multiple + wheel[j]);
}
}
Get it running under a debugger, and single-step it, instruction by instruction, and at each point understand what it is doing, and why.
This makes you walk in the shoes of the CPU, and you will see all the silliness that nutty programmer is making you do,
and you will see what you could do better.
That's one way to make your code go as fast as possible.
Program size, by itself, only affects speed if you've got it so fast that caching becomes an issue.
Here's a stab at some techniques for checking if a number is a prime:
bool is_prime(unsigned int number) // negative numbers are not prime.
{
// A data store for primes already calculated.
static std::set<unsigned int> calculated_primes;
// Simple checks first:
// Primes must be >= 2.
// Primes greater than 2 are odd.
if ( (number < 2)
|| ((number > 2) && ((number & 1) == 0) )
{
return false;
}
// Initialize the set with a few prime numbers, if necessary.
if (calculated_primes.empty())
{
static const unsigned int primes[] =
{ 2, 3, 5, 7, 13, 17, 19, 23, 29};
static const unsigned int known_primes_quantity =
sizeof(primes) / sizeof(primes[0]);
calculated_primes.insert(&primes[0], &primes[known_primes_quantity]);
}
// Check if the number is a prime that is already calculated:
if (calculated_primes.find(number) != calculated_primes.end())
{
return true;
}
// Find the smallest prime to the number:
std::set<unsigned int>::iterator prime_iter =
calculated_primes.lower_bound(number);
// Use this value as the start for the sieve.
unsigned int prime_candidate = *prime_iter;
const unsigned int iteration_limit = number * number;
while (prime_candidate < iteration_limit)
{
prime_candidate += 2;
bool is_prime = true;
for (prime_iter = calculated_primes.begin();
prime_iter != calculated_primes.end();
++prime_iter)
{
if ((prime_candidate % (*prime_iter)) == 0)
{
is_prime = false;
break;
}
}
if (is_prime)
{
calculated_primes.insert(prime_candidate);
if (prime_candidate == number)
{
return true;
}
}
}
return false;
}
Note: This is untested code but demonstrates some techniques for checking if a number is prime.

Mergesort pThread implementation taking same time as single-threaded

(I have tried to simplify this as much as i could to find out where I'm doing something wrong.)
The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters.
On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close.
This is the (relevant) code:
//These are the 2 sorting functions, each thread will call merge_sort(..). Is this a problem? both threads calling same (normal) function?
void merge (int *v, int start, int middle, int end) {
//dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
//copies the original values into the 2 halves
//then sorts them back into the v array
}
void merge_sort (int *v, int start, int end) {
//recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
//calls merge(start, middle, end)
}
//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier)
void* mergesort_t2(void * arg) {
t_data* th_info = (t_data*)arg;
merge_sort(v, th_info->a, th_info->b);
return (void*)0;
}
//in main I simply create 2 threads calling the above function
int main (int argc, char* argv[])
{
//some stuff
//getting the clock to calculate run time
clock_t t_inceput, t_sfarsit;
t_inceput = clock();
//ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
//the a and b are the range of values the created thread will have to sort
pthread_t thread[2];
t_data next_info[2];
next_info[0].crt_depth = 1;
next_info[0].a = 0;
next_info[0].b = n/2;
next_info[1].crt_depth = 1;
next_info[1].a = n/2+1;
next_info[1].b = n-1;
for (int i=0; i<2; i++) {
if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
cerr<<"error\n;";
return err;
}
}
for (int i=0; i<2; i++) {
if (pthread_join(thread[i], &status) != 0) {
cerr<<"error\n;";
return err;
}
}
//now i merge the 2 sorted halves
merge(v, 0, n/2, n-1);
//calculate end time
t_sfarsit = clock();
cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
delete [] v;
}
Output (on 1 million values):
Sort time (s): 1.294
Output with direct calling of merge_sort, no threads:
Sort time (s): 1.388
Output (on 10 million values):
Sort time (s): 12.75
Output with direct calling of merge_sort, no threads:
Sort time (s): 13.838
Solution:
I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning.
I've used the inplace_merge(..) function instead of my own and the program run times are as they should now.
Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version):
void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]
//create the 2 new arrays
int *st = new int[m-a+1];
int *dr = new int[c-m+1];
//copy the values
for (int i1 = 0; i1 <= m-a; i1++)
st[i1] = v[a+i1];
for (int i2 = 0; i2 <= c-(m+1); i2++)
dr[i2] = v[m+1+i2];
//merge them back together in sorted order
int is=0, id=0;
for (int i=0; i<=c-a; i++) {
if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
v[a+i] = st[is];
is++;
}
else {
v[a+i] = dr[id];
id++;
}
}
delete st, dr;
}
all this was replaced with:
inplace_merge(v+a, v+m, v+c);
Edit, some times on my 3ghz dual core cpu:
1 million values:
1 thread : 7.236 s
2 threads: 4.622 s
4 threads: 4.692 s
10 million values:
1 thread : 82.034 s
2 threads: 46.189 s
4 threads: 47.36 s
There's one thing that struck me: "dynamically creates 2 new arrays[...]". Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. In particular the idea of doing microscopic array allocations sounds horribly inefficient. Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance.
Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". In other words, maybe you haven't reached n0 yet? I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high.
Note: since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. I left it for sake of those who might find the information useful.
clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock), which in case of multiple threads is the sum of CPU time for all threads. You need to measure elapsed, or wallclock, time. See more details in this SO question: C: using clock() to measure time in multi-threaded programs, which also tells what API can be used instead of clock().
In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. Nevertheless, it's still wrong to use CPU time (and so clock()) for performance measurement, even in serial programs; for one reason, if a program waits for e.g. a network event or a message from another MPI process, it still spends time - but not CPU time.
Update: In Microsoft's implementation of C run-time library, clock() returns wall-clock time, so is OK to use for your purpose. It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW.

C++0x threads give no speed up

I've written a program for search of the maximum in arrays using c++0x threads (for learning purposes). For implementation I used standard thread and future classes. However, parallelized function constantly showes same or worse run time than non-parallelized.
Code is below. I tried to store data in one-dimensional array, multi-dimensional array and ended up with several arrays. However, no option has given good results. I tried to compile and run my code from Eclipse and command line, still with no success. I also tried similar test without array usage. Parallelization gave only 20% speed up there. From my point of view, I run very simple parallel program, without locks and almost no resource sharing (each thread operates on his own array). What is bottleneck?
My machine has Intel Core i7 processor 2.2 GHz with 8 GB of RAM, running Ubuntu 12.04.
const int n = 100000000;
int a[n], b[n], c[n], d[n];
int find_max_usual() {
int res = 0;
for (int i = 0; i < n; ++i) {
res = max(res, a[i]);
res = max(res, b[i]);
res = max(res, c[i]);
res = max(res, d[i]);
}
return res;
}
int find_max(int *a) {
int res = 0;
for (int i = 0; i < n; ++i)
res = max(res, a[i]);
return res;
}
int find_max_parallel() {
future<int> res_a = async(launch::async, find_max, a);
future<int> res_b = async(launch::async, find_max, b);
future<int> res_c = async(launch::async, find_max, c);
future<int> res_d = async(launch::async, find_max, d);
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
return res;
}
double get_time() {
timeval tim;
gettimeofday(&tim, NULL);
double t = tim.tv_sec + (tim.tv_usec / 1000000.0);
return t;
}
int main() {
for (int i = 0; i < n; ++i) {
a[i] = rand();
b[i] = rand();
c[i] = rand();
d[i] = rand();
}
double start = get_time();
int x = find_max_usual();
cerr << x << " " << get_time() - start << endl;
start = get_time();
x = find_max_parallel();
cerr << x << " " << get_time() - start << endl;
return 0;
}
Timing showed that almost all the time in find_max_parralel is consumed by
int res = max(max(res_a.get(), res_b.get()), max(res_c.get(), res_d.get()));
Compilation command line
g++ -O3 -std=c++0x -pthread x.cpp
Update. Problem is solved. I got desired results with same test. 4 threads give about 3.3 speed up, 3 threads give about 2.5 speed up, 2 threads behave almost ideally with 1.9 speed up. I've just rebooted system with some new updates. I haven't seen any significant difference in cpu load and running porgrams.
Thanks to all for help.
You have to explicitly set std::launch::async.
future<int> res_c = async(std::launch::async, find_max, c);
If you omit the flag std::launch::async | std::launch::deferred is assumend which leaves it up to implementation to choose whether to start the task asynchronously or deferred.
Current versions of gcc use std::launch::deferred, MSVC has an runtime scheduler which decides on runtime how the task should be run.
Also note that if you want to try:
std::async(find_max, c);
this will also block because the destructor of std::future waits for the task to finish.
I just ran the same test with gcc-4.7.1 and threaded version is roughly 4 times faster (on 4-core server).
So the problem is obviously not in std::future implementation, but in choosing threading settings not optimal for your environment. As it was noted above, you test is not CPU, but memory intensive, so the bottleneck is definitely memory access.
You'd probably want to run some cpu-intensive test (like computing PI number with high precision) to benchmark threading properly.
Without experimenting with different number of threads and different array sizes, it's hard to say, where exactly the bottleneck is, but there are probably few things in play:
- You probably have 2-channel memory controller (it's either 2, or 3), so going above 2 threads will just introduce additional contention around memory access. Thus your thesis about having no locking and no resource sharing is not correct: on hardware level there's contention around concurrent memory access.
- Non-parallel version will be efficiently optimized by pre-fetching data into cache. On other hand, there's chance, that in parallel version you end up with intensive context switching, and as result thrashing CPU cache.
For both factors you are likely to see a speedup, if you tune down number of threads to 2.