Why does similar code running in multiple threads have different running times? - c++

I met a very strange problem about a C++ multi-thread program which as belows.
#include<iostream>
#include<thread>
using namespace std;
int* counter = new int[1024];
void updateCounter(int position)
{
for (int j = 0; j < 100000000; j++)
{
counter[position] = counter[position] + 8;
}
}
int main() {
time_t begin, end;
begin = clock();
thread t1(updateCounter, 1);
thread t2(updateCounter, 2);
thread t3(updateCounter, 3);
thread t4(updateCounter, 4);
t1.join();
t2.join();
t3.join();
t4.join();
end = clock();
cout<<end-begin<<endl; //1833
begin = clock();
thread t5(updateCounter, 16);
thread t6(updateCounter, 32);
thread t7(updateCounter, 48);
thread t8(updateCounter, 64);
t5.join();
t6.join();
t7.join();
t8.join();
end = clock();
cout<<end-begin<<endl; //358
}
the first code block run about 1833 seconds,but the second which is almost same with the first one
run about 358 seconds.Beg for an answer!Thank you!

Writing to nearby variables from multiple threads is slow due to "false sharing" which is described here: What is "false sharing"? How to reproduce / avoid it?
Your offsets of 16/32/48/64 are 64 bytes apart because the int values are (on most common platforms) 4 bytes each. And 64 bytes is a common cache line size, so this puts each target value on its own cache line.
The performance difference is not nearly as large if you compile with optimization. Which of course you should always do when measuring performance. But there's still a difference, and it may get worse the more threads you have.
Finally, your benchmark is unfair because you always run the "slow" code first. That means the code and data are "cold" for the first experiment and "hot" for the second experiment. This is a common mistake in benchmarking, and may even be the dominant factor in the performance difference you're seeing, depending on your system.

Related

Number of Threads higher than the Number of Cores

#include <iostream>
#include <thread>
#include <unistd.h>
using namespace std;
void taskA() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskA: %d\n", i*i);
fflush(stdout);
}
}
void taskB() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskB: %d\n", i*i);
fflush(stdout);
}
}
void taskC() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskC: %d\n", i*i);
fflush(stdout);
}
}
void taskD() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskD: %d\n", i*i);
fflush(stdout);
}
}
void taskE() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskE: %d\n", i*i);
fflush(stdout);
}
}
void taskF() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskF: %d\n", i*i);
fflush(stdout);
}
}
void taskG() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskG: %d\n", i*i);
fflush(stdout);
}
}
void taskH() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskH: %d\n", i*i);
fflush(stdout);
}
}
int main(void) {
thread t1(taskA);
thread t2(taskB);
thread t3(taskC);
thread t4(taskD);
thread t5(taskE);
thread t6(taskF);
thread t7(taskG);
thread t8(taskH);
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
t7.join();
t8.join();
return 0;
}
In this C++ program, I created 8 similar functions (taskA to taskH) and 8 threads, one for each. When I executed, I got outputs of all the 8 functions parallely. But my Laptop has only 4 cores.
So the problem is how is it happening? 4 cores running 8 threads parallely, I didn't understand it! Please explain what's happening inside?
Thanks for your explanation!
Each core can run 1 thread at a time, or 2 threads in parallel if hyperthreading is enabled. So, on a system with 4 cores installed, there can be 4 or 8 threads running in parallel, max.
However, even so, your app is not the only one running threads. Every running process has at least 1 thread, maybe more. And the OS itself has dozens, maybe hundreds, of threads running. So clearly way way more threads total than the number of cores that are installed.
So, the OS has a built-in scheduler that is actively scheduling all of these running threads in such a way that cores will switch between threads at regular intervals, known as "time slices". This scheduling process is commonly known as "task switching".
This means that when a time slice on a core elapses, the core will temporarily pause the thread currently running on it, save that thread's state, and then resume an earlier paused thread for the next time slice, then pause and save that thread, switch to another thread, and so on, dozens/hundreds of times a second. Spread out over however many cores are installed.
Most systems are not real-time, so true parallel processing is just an illusion. Just a lot of switching between threads as core time slices become available.
That is it, in a nutshell. Obviously, things are more complex in practical use. There are lengthy articles, research papers, even books, on this topic, if you really want to know the gritty implementation details.
A bit about human physiology. Every person in the world has this thing called "reaction time". It is time between an event actually occuring and your brain realizing it and reacting to it. It varies between people, but it is extremely rare for a human to have a reaction time below 100ms, and below 10ms is unheard of. This means that if two events happens one after another, but the time between them is less than 10ms then a human will think that these events happened simultaneously.
What does it have to do with threading and cores? A CPU can only run N threads in parallel, where N is the number of cores (well, technically it is more complicated due to features like hyperthreading, but lets forget about it for now). So when you fire, say 10*N threads, then these cannot run in parallel. It is technically impossible. What actually happens is that the operating system has this internal piece of code called the scheduler, which controls which threads runs on which core at a given moment. And it jumps from one thread to another, so that every thread has some CPU time and can actually progress.
But you say "outputs of all the functions are coming at a same time". No they don't. The CPU processes billions of instructions per second. Or equivalently one or more instructions every 1/1bln second. The exact number depends on what exactly the CPU does, for example printing stuff to a monitor requires much more time, but still it can print probably thousands or tens of thousands characters into monitor below 10ms. And since this is below your reaction time, you only think that it happened in parallel, while in reality it did not.
and I am also using a sleep of 1sec
Sleep is not an action. It is a lack of action. And as such does not require CPU. The operations of "falling asleep" and "waking up" require some CPU (even though very little), but not waiting itself. And indeed, waiting does happen truely parallely, regardless of how many threads you have.

using multiple threads simultaneously c++

I'm having issues with using multiple threads for my madelbrot program.
One of the ways I tired following a tutorial
int sliceSize = 800 / threads;
double start = 0, end = 0;
for (int i = 0; i < threads; i++)
{
start = i * sliceSize;
end = ((1 + i) * sliceSize);
thrd.push_back(thread(compute_mandelbrot, left, right, top, bottom, start, end));
}
for (int i = 0; i < threads; i++)
{
thrd[i].join();
}
thrd.clear();
but the code takes only half the time to compute, while using 8 threads.
I also tried something more complicated but it doesn't work at all
void slicer(double left, double right, double top, double bottom)
{
/*promise<int> prom;
future<int> fut = prom.get_future();*/
int test = -1;
double start = 0, end = 0;
const size_t nthreads = std::thread::hardware_concurrency(); //detect how many threads cpu has
{
int sliceSize = 800 / nthreads;
std::cout << "CPU has " << nthreads << " threads" << std::endl;
std::vector<std::thread> threads(nthreads);
for (int t = 0; t < nthreads; t++)
{
threads[t] = std::thread(std::bind(
[&]()
{
mutex2.lock();
test++;
start = (test) * sliceSize;
end = ((test + 1) * sliceSize);
mutex2.unlock();
compute_mandelbrot(left, right, top, bottom, start, end);
}));
}
std::for_each(threads.begin(), threads.end(), [](std::thread& x) {x.join(); }); //join threads
}
}
but it seems while it is computing 8 things at once they tend to over lap even after using a mutex, and it's not any faster.
This has given me a headache for the last 7h and I want to kill myself. Help.
There's a lot at play when you're trying to speed up a workload by multi-threading, and in the perfect world it's pretty much impossible to get an Nx speed-up when multiplying by N threads. Some things to bear in mind:
If you're making use of hyperthreading (so using 1 thread per virtual core on the system, not just per physical core), then you don't get the equivalent performance of 2 real cores - you'll get some percentage (probably around 1.2x or so).
The operating system (Windows) is going to be doing stuff while your workloads are executing. It's fairly random what and when these OS tasks cut into your app time, but it's going to make a difference. Always expect some percentage of your CPU time is going to be stolen by windows.
Any kind of synchronization is going to heavily impact performance. In your second example, mutexes are pretty hefty and are likely going to impact performance.
Memory accesses, cache access, etc, are going to come in to play. Multiple threads accessing memory all over the place is going to result in pressure on the cache, which is going to have a (potential) impact.
I'm curious - what sort of times are you looking at here? And how many iterations are you passing on each thread? To dig in and see what's happening timing-wise, you could try something like recording the start/end time of each thread using queryPerformanceCounter to see how long each is running, when they start, etc. Posting the times here for 1, 2, 4 and 8 threads would maybe shed a little light.
Hopefully this at least helps a little...

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.

Code runs 6 times slower with 2 threads than with 1

Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.

Negative Speedup on Multithreading my Program

On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.
What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.
NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
See also: What can make a program run slower when using more threads?
The times you are reporting are measured using the clock function:
The clock() function returns an approximation of processor time used by the program.
$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds
real 0m5.247s
user 0m9.025s
sys 0m0.304s
The real time will be less for multiprocessor tasks, but the processor time will typically be greater.
When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.
It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.
Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:
$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds
real 0m5.326s
user 0m9.057s
sys 0m0.344s
which still is less of a speed-up than one might hope for, but at least the numbers make some sense.
100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)
The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.
I believe that your algorithm essentially makes your cache memory useless.
Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.
Not related to your threading issues, but there is a bounds error in your code.
You have:
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
When i is zero you will be doing
arr[0] += arr[-1];
Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'
As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.
tstenner has got it mostly right.
This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.
Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.
You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.
Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.
Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:
a proper parallelization of your algorithm
a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
hardware support for parallelization
It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...
But this is then only a theoretical standpoint.
Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.