performance difference of pthreads - c++

I am programming performance-sensitive code. I implement a simple scheduler to distribute workloads and master thread takes charge of the scheduler.
cpu_set_t cpus;
pthread_attr_t attr;
pthread_attr_init(&attr);
for(int i_group =0; i_group<n_groups; i_group++){
std::cout << i_t<< "\t"<<i_group << "th group of cpu" <<std::endl;
for(int i =index ; i < index+group_size[i_group]; i++){
struct timeval start, end;
double spent_time;
gettimeofday(&start, NULL);
arguments[i].i_t=i_t;
arguments[i].F_x=F_xs[i_t];
arguments[i].F_y=F_ys[i_t];
arguments[i].F_z=F_zs[i_t];
CPU_ZERO(&cpus);
CPU_SET(arguments[i].thread_id, &cpus);
int err= pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
if(err!=0){
std::cout << err <<std::endl;
exit(-1);
}
arguments[i].i_t=i_t;
pthread_create( &threads[i], &attr, &cpu_work, &arguments[i]);
gettimeofday(&end, NULL);
spent_time = ((end.tv_sec - start.tv_sec) * 1000000u + end.tv_usec - start.tv_usec) / 1.e6;
std::cout <<"create: " << spent_time << "s " << std::endl;
}
i_t++;
cpu_count++;
arr_finish[i_group]=false;
}
}
like above the master thread create. For the simple explanation, i will assume i_group=1. The child threads divide and conquer a bunch of matrix-matrix multiplications. Here rank means thread_id.
int local_first = size[2]*( rank -1 )/n_compute_thread ;
int local_end = size[2] * rank/n_compute_thread-1;
//mkl_set_num_threads_local(10);
gettimeofday(&start, NULL);
for(int i_z=local_first; i_z<=local_end; i_z++ ){
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
size[0], size[1], size[0], 1.0, F_x, size[0],
rho[i_z], size[1], 0.0, T_gamma[i_z], size[1] );
}
for(int i_z=local_first; i_z<=local_end; i_z++ ){
cblas_dgemm( CblasColMajor, CblasNoTrans, CblasNoTrans,
size[0], size[1], size[1], 1.0, T_gamma[i_z], size[0],
F_y, size[1], 0.0, T_gamma2[i_z], size[0] );
}
gettimeofday(&end, NULL);
std::cout <<i_t <<"\t"<< arg->thread_id <<"\t"<< sched_getcpu()<< "\t" << "compute: " <<spent_time << "s" <<std::endl;
Even though workload fairly distributed, the performance of each thread vary too much. see the result below
5 65 4 4 compute: 0.270229s
5 64 1 1 compute: 0.284958s
5 65 2 2 compute: 0.741197s
5 65 3 3 compute: 0.76302s
second column shows how many matrix-matrix multiplications are done in a particular thread. last column shows consumed time.
When I saw this result firstly, I thought that it related to the affinity of threads. Thus, I added several lines to control the binding of threads. However, it did not change the trends of last column.
My computer has 20 physical cores and 20 virtual core. I made only 4 child threads to test. Of course, it was tested in a Linux machine.
Why does the performance of the threads vary so much? and How to fix it?

First, are you actually creating a scheduler? Your code sample suggests that you are using Linux scheduler and setting thread attributes object, and thread affinity parameters, etc. This difference is relevant in choosing how to approach the problem.
In any case, the question is big, and there are several additional questions/topics that could be raised to help clarify the conditions, and move closer to a real answer. To start, here are some things to consider:
1 - Length of benchmark test. Sub-second evaluation of thread performance within a pool of threads seems inadequate. Stretch the evaluation time to allow the scheduler time to settle. Perhaps several minutes.
( For an example of typical duration times used in an existing benchmarking utility, read this )
2 - Thread priorities. Your threads are not the only ones. Is it possible the kernel scheduler may be periodically moving the benchmark around as threads belonging to other processes (other than the ones you have created) that have a higher priority? (and therefore are displacing yours, resulting in skewing the task completion time)
3 - Size of task. Is the number of operations required to complete each task small enough to fit within the time-slice allocated by the scheduler? This could possibly contribute to the perception of thread to thread performance issues, especially if there are any differences in the number of operations between each task. ( Processes that exceed the allotted
CPU time slice are automatically moved down to a lower “tier,” while processes that make I/O requests or block will be moved to higher “tiers.” )
4 - Equality of tasks - You mention divide and conquer a bunch of matrix-matrix multiplications. But are the matrices identical in size and similar in content. i.e., are you certain the count of operations in each task is equivalent the the count of operations in all the other tasks? The time slice allocated by the scheduler to each equally prioritized thread will ensure that over time a task having an operation count larger than that can be completed in a single time-slice will be more susceptible to longer completion times ( context switching due to higher priority of other OS processes) than those having few enough operations to fit comfortable within one time slice.
5 - Other processes. I have mentioned this in other items above, but it deserves its own number. In order to use multiple cores, multiple threads are required at the same time. But the inverse is not true. A single core is not limited to a single thread. The OS can at any time preemptively interrupt one of your processes (threads) on a particular core with a higher priority process, (while not interrupting any other core) possibly skewing your time measurement. Again, a longer benchmarking time will serve to reduce the impact of thread to thread differences caused by this particular phenomena.

Related

Using threads to split up task is slowing down my work

So I had this question: simple-division-of-labour-over-threads-is-not-reducing-the-time-taken. I thought I had it sorted, but going back to re-visit this work, I am not getting crazy slow down like I was before (due to mutex within rand()) but nor am I getting any improvement in total time taken.
In the code I am splitting up a task of x iteration of work over y threads.
So if I want to do 100'000'000 calculations, in one thread that might take ~350ms, then my hope its that in 2 threads (doing 50'000'000 calcs each) that would take ~175ms and in the three threads ~115ms and so on...
I know using threads won't perfectly split the work due to thread overheads and such. But i want hoping for some performance gain at least.
My slightly updated code is here:
Reults
1thread:
starting thread: 1 workload: 100000000
thread: 1 finished after: 303ms val: 7.02066
==========================
thread overall_total_time time: 304ms
3 threads
starting thread: 1 workload: 33333333
starting thread: 3 workload: 33333333
starting thread: 2 workload: 33333333
thread: 3 finished after: 363ms val: 6.61467
thread: 1 finished after: 368ms val: 6.61467
thread: 2 finished after: 365ms val: 6.61467
==========================
thread overall_total_time time: 368ms
You can see the 3 threads actually takes slightly longer then 1 thread, but each thread is only doing 1/3 of the work iterations. I see similar lack of performance gain on my PC at home which has 8 CPU cores.
Its not like threading overhead should take more then a few milliseconds (IMO) so I can't see what is going on here. I don't believe there is any resource sharing conflicts because this code is quite simple and uses no external outputs/inputs (other then RAM).
Code For Reference
in godbolt: https://godbolt.org/z/bGWdxE
In main() you can tweak the number of threads and amount of work (loop iterations).
#include <iostream>
#include <vector>
#include <thread>
#include <math.h>
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
double val{0};
for (auto i = 1u; i <= interations; i++)
{
val += i / (2.2 * i) / (1.23 * i); // some work
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << " val: " << val << std::endl;
}
int main()
{
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Update
I think I have narrowed down my issue:
On my 64Bit VM I see:
Compiling for 32-bit no optimisation: more threads = runs slower!
Compiling for 32-bit with optimisation: more threads = runs a bit faster
Compiling for 64-bit no optimisation: more threads = runs faster (as expected)
Compiling for 64-bit with optimisation: more threads = same as with out opt, except everything takes less time in general.
So my issue might just be from running 32-bit code on a 64-bit VM. But I don't really understand why adding threads does not work very well if my executable is 32-bit running on a 64-bit architecture...
There are many possible reasons that could explain the observed results, so I do not think that anyone could give you a definitive answer. Also, the majority of reasons have to do with peculiarities of the hardware architecture, so different answers might be right or wrong on different machines.
As already mentioned in the comments, it could very well be that there is something wrong with thread allocation, so you are not really enjoying any benefit from using multiple threads. Godbolt.org is a cloud service, so it is most probably very heavily virtualized, meaning that your threads are competing against who knows how many hundreds of other threads, so I would assign a zero amount of trust to any results from running on godbolt.
A possible reason for the bad performance of unoptimized 32-bit code on a 64-bit VM could be that the unoptimized 32-bit code is not making efficient use of registers, so it becomes memory-bound. The code looks like it would all nicely fit within the CPU cache, but even the cache is considerably slower than direct register access, and the difference is more pronounced in a multi-threaded scenario where multiple threads are competing for access to the cache.
A possible reason for the still not stellar performance of optimized 32-bit code on a 64-bit VM could be that the CPU is optimized for 64-bit use, so instructions are not efficiently pipelined when running in 32-bit mode, or that the arithmetic unit of the CPU is not being used efficiently. It could be that these divisions in your code make all threads contend for the divider circuitry, of which the CPU may have only one, or of which the CPU may have only one when running in 32-bit mode. That would mean that most threads do nothing but wait for the divider to become available.
Note that these situations where a thread is being slowed down due to contention for CPU circuitry are very different from situations where a thread is being slowed down due to waiting for some device to respond. When a device is busy, the thread that waits for it is placed by the scheduler in I/O wait mode, so it is not consuming any CPU. When you have contention for CPU circuitry, the stalling happens inside the CPU; there is no thread context switch, so the thread slows down while it appears as if it is running full speed (consuming a full CPU core.) This may be an explanation for your CPU utilization observations.

C++ multithreads run time issue

I have been studying C++ multithreads and get a question about it.
Here is what I am understanding about multithreads.
One of the reasons we use multithreads is to reduce the run time, right?
For example, I think if we use two threads we can expect half of the execution time.
So, I tried to code to prove it.
Here is the code.
#include <vector>
#include <iostream>
#include <thread>
#include <future>
using namespace std;
#define iterationNumber 1000000
void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) {
clock_t begin,end;
int firstIndex = index * numberInThread;
int lastIndex = firstIndex + numberInThread;
vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
vector<int>::const_iterator last = numberList.cbegin() + lastIndex;
vector<int> numbers(first,last);
unsigned long result = 0;
begin = clock();
for(int i = 0 ; i < numbers.size(); i++) {
result += numbers.at(i);
}
end = clock();
cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
p.set_value(result);
}
int main(void)
{
vector<int> numberList;
vector<thread> t;
vector<future<unsigned long>> futures;
vector<unsigned long> result;
const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
int numberInThread = iterationNumber / NumberOfThreads;
clock_t begin,end;
for(int i = 0 ; i < iterationNumber ; i++) {
int randomN = rand() % 10000 + 1;
numberList.push_back(randomN);
}
for(int j = 0 ; j < NumberOfThreads; j++){
promise<unsigned long> promises;
futures.push_back(promises.get_future());
t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
}
for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));
for (int i = 0; i < futures.size(); i++) {
result.push_back(futures.at(i).get());
}
unsigned long RRR = 0;
begin = clock();
for(int i = 0 ; i < numberList.size(); i++) {
RRR += numberList.at(i);
}
end = clock();
cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
}
Because the hardware concurrency of my laptop is 4, it will create 4 threads and each takes a quarter of numberList and sum up the numbers.
However, the result was different than I expected.
thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654
Why? Why it took more time than serial version(not by thread).
For example, I think if we use two threads we can expect half of the
execution time.
You'd think so, but sadly, that is often not the case in practice. The ideal "N cores means 1/Nth the execution time" scenario occurs only when the N cores can execute completely in parallel, without any core's actions interfering with the performance of the other cores.
But what your threads are doing is just summing up different sub-sections of an array... surely that can benefit from being executed in parallel? The answer is that in principle it can, but on a modern CPU, simple addition is so blindingly fast that it isn't really a factor in how long it takes a loop to complete. What really does limit the execute speed of a loop is access to RAM. Compared to the speed of the CPU, RAM access is very slow -- and on most desktop computers, each CPU has only one connection to RAM, regardless of how many cores it has. That means that what you are really measuring in your program is the speed at which a big array of integers can be read in from RAM to the CPU, and that speed is roughly the same -- equal to the CPU's memory-bus bandwidth -- regardless of whether it's one core doing the reading-in of the memory, or four.
To demonstrate how much RAM access is a factor, below is a modified/simplified version of your test program. In this version of the program, I've removed the big vectors, and instead the computation is just a series of calls to the (relatively expensive) sin() function. Note that in this version, the loop is only accessing a few memory locations, rather than thousands, and thus a core that is running the computation loop will not have to periodically wait for more data to be copied in from RAM to its local cache:
#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>
using namespace std;
static int iterationNumber = 1000000;
unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];
void myFunction(const int index, const int numberInThread)
{
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i=0; i<numberInThread; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
threadResults[index] = result;
threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();
// We'll print out the value of threadElapsedTimeMicros[index] later on,
// after all the threads have been join()'d.
// If we printed it out now it might affect the timing of the other threads
// that may still be executing
}
int main(void)
{
vector<thread> t;
const int NumberOfThreads = thread::hardware_concurrency();
const int numberInThread = iterationNumber / NumberOfThreads;
// Multithreaded approach
std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;
// And now, the single-threaded approach, for comparison
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
return 0;
}
When I run the above program on my dual-core Mac mini (i7 with hyperthreading), here are the results I get:
Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
The computations in thread #0: result=1062, took 11718 microseconds
The computations in thread #1: result=1062, took 11481 microseconds
The computations in thread #2: result=1062, took 11525 microseconds
The computations in thread #3: result=1062, took 11230 microseconds
Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds
So in this case the results are more like what you'd expect -- because memory access was not a bottleneck, each core was able to run at full speed, and complete its 25% portion of the total calculations in about 25% of the time that it took a single thread to complete 100% of the calculations... and since the four cores were running truly in parallel, the total time spent doing the calculations was about 33% of the time it took for the single-threaded routine to complete (ideally it would be 25% but there's some overhead involved in starting up and shutting down the threads, etc).
This is an explanation, for the beginner.
It's not technically accurate, but IMHO not that far from it that anyone takes damage from reading it.
It provides an entry into understanding the parallel processing terms.
Threads, Tasks, and Processes
It is important to know the difference between threads, and processes.
By default starting a new process, allocates a dedicated memory for that process. So they share memory with no other processes, and could (in theory) be run on separate computers.
(You can share memory with other processes, via operating system, or "shared memory", but you have to add these features, they are not by default available for your process)
Having multiple cores means that the each running process can be executed on any idle core.
So basically one program runs on one core, another program runs on a second core, and the background service doing something for you, runs on a third, (and so on and so forth)
Threads is something different.
For instance all processes will run in a main thread.
The operating system implements a scheduler, that is supposed to allocate cpu time for programs. In principle it will say:
Program A, get 0.01 seconds, than pause!
Program B, get 0.01 seconds, then pause!
Program A, get 0.01 seconds, then pause!
Program B, get 0.01 seconds, then pause!
you get the idea..
The scheduler typically can prioritize between threads, so some programs get more CPU time than others.
The scheduler can of course schedule threads on all cores, but if it does this within a process, (splits a process's threads over multiple cores) there can be a performance penalty as each core holds it's own very fast memory cache.
Since threads from the same process can access the same cache, sharing memory between threads is quite fast.
Accessing another cores cache is not as fast, (if even possible without going via RAM), so in general schedulers will not split a process over multiple cores.
The result is that all the threads belonging to a process runs on the same core.
| Core 1 | Core 2 | Core 3 |
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
A process can spawn multiple threads, they all share the parent threads memory area, and will normally all run on the core that the parent was running on.
It makes sense to spawn threads within a process, if you have an application that needs to respond to something that it cannot control the timing of.
I.E. the users presses on a cancel button, or attempts to move a window, while the application is running calculations that takes a long time to complete.
Responsiveness of the UI, requires the application to spend time reading, and handling what the user is attempting to do. This could be achieved in a main loop, if the program does parts of the calculation in each iteration.
However that get's complicated real fast, so instead of having the calculation code, exit in the middle of a calculation to check the UI, and update the UI, and then continue. You run the calculation code in another thread.
The scheduler then makes sure that the UI thread, and the calculation thread gets CPU time, so the UI responds to user input, while the calculation continues..
And your code stays fairly simple.
But I want to run my calculations another core to gain speed
To distribute calculations on multiple cores, you could spawn a new process for each calculation job. In this way the scheduler will know that each process get's it's own memory, and it can easily be launched on an idle core.
However you have a problem, you need to share memory with the other process, so it knows what to do.
A simple way of doing this, is sharing memory via the filesystem.
You could create a file with the data for the calculation, and then spawn a thread governing the execution (and communication) with another program, (so your UI is responsive, while we wait for the results).
The governing thread runs the other program via system commands, which starts it as another process.
The other program will be written such that it runs with the input file as input argument, so we can run it in multiple instances, on different files.
If the program self terminates when it's done, and creates an output file, it can run on any core, (or multiple) and your process can read the output file.
This actually works, and should the calculation take a long time (like many minutes) this is perhaps ok, even though we use files to communicate between our processes.
For calculations that only takes seconds, however, the file system is slow, and waiting for it will almost remove the gained performance of using processes instead of just using threads. So other more efficient memory sharing is used in real life. For instance creating a shared memory area in RAM.
The "create governing thread, and spawn subprocess, allow communication with process via governing thread, collect data when process is complete, and expose via governing thread" can be implemented in multiple ways.
Tasks
Well "tasks" is ambiguous.
In general it means "Process or thread that solves a task".
However, in certain languages like C#, it is something that implements a thread like thing, that the scheduler can treat as a process. Other languages that provide a similar feature typically dubs this either tasks or workers.
So with workers/tasks it appears to the programmer as if it was merely a thread, that you can share memory with easily, via references, and control like any other thread, by invoking methods on the thread.
But it appears to the scheduler as if it's a process that can be run on any core.
It implements the shared memory problem in a fairly efficient way, as part of the language, so the programmer won't have to re-invent this wheel for all tasks.
This is often referred to as "Hybrid threading" or simply "parallel threads"
Seems that you have some misconception about multi-threading. Simply using two threads cannot halve the processing time.
Multi-threading is a kind of complicated concept but you can easily find related materials on the web. You should read one of them first. But I will try to give a simple explanation with an example.
No matter how many CPUs(or cores) you have, the total handling capacity of the CPU will be always the same whether you use multi-thread or not, right? Then, where does the performance difference come from?
When a program runs on a device(computer) it uses not only CPU but also other system resources such as Networks, RAM, Hard drives, etc. If the flow of the program is serialized there will be a certain point of time when the CPU is idle waiting for other system resources to get done. But, in the case that the program runs with multiple threads(multiple flow), if a thread turns to idle(waiting some tasks done by other system resources) the other threads can use the CPU. Therefore, you can minimize the idle time of the CPU and improve the time performance. This is one of the most simple example about multi-threading.
Since your sample code is almost 'only CPU-consuming', using multi-thread could bring little improvement of performance. Sometimes it can be worse because multi-threading also comes with time cost of context-switching.
FYI, parallel processing is not the same as multi-threading.
This is very good to point out the problems with macs.
Provided you use a o.s. that can schedule threads in a useful manner, you have to consider if a problem is basically the product of 1 problem many times. An example is matrix multiplication. When you multiply 2 matrices there is a certain parts of it which are independent of the others. A 3x3 matrix times another 3x3 requires 9 dot products which can be computed independently of the others, which themselves require 3 multiplications and 2 additions but here the multiplications must be done first. So we see if we wanted to utilize multithreaded processor for this task we could use 9 cores or threads and given they get equal compute time or have same priority level (which is adjustable on windows) you would reduce the time to multiply a 3x3 matrices by 9. This is because we are essentially doing something 9 times which can be done at the same time by 9 people.
now for each of 9 threads we could have 3 cores perform multiplications totaling 3x9=24 cores all together now. Reducing time by t/24. But we have 18 additions and here we can get no gain from more cores. One addition must be piped into another. And the problem takes time t with one core or time t/24 ideally with 24 cores working together. Now you can see why problems are often seeked out if they are 'linear' because they can be done in parallel pretty good like graphics for example (some things like backside culling are sorting problems and inherently not linear so parallel processing has diminished performance boosts).
Then there is added overhead of starting threads and how they are scheduled by the o.s. and processor. Hope this helps.

Code runs 6 times slower with 2 threads than with 1

Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds
Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.
As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.
More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.
This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.

pthread_join is being a bottleneck

I have an application where pthread_join is being the bottleneck. I need help to resolve this problem.
void *calc_corr(void *t) {
begin = clock();
// do work
end = clock();
duration = (double) (1000*((double)end - (double)begin)/CLOCKS_PER_SEC);
cout << "Time is "<<duration<<"\t"<<h<<endl;
pthread_exit(NULL);
}
int main() {
start_t = clock();
for (ii=0; ii<16; ii++)
pthread_create(&threads.p[ii], NULL, &calc_corr, (void *)ii);
for (i=0; i<16; i++)
pthread_join(threads.p[15-i], NULL);
stop_t = clock();
duration2 = (double) (1000*((double)stop_t - (double)start_t)/CLOCKS_PER_SEC);
cout << "\n Time is "<<duration2<<"\t"<<endl;
return 0;
}
The time printed in the thread function is in the range of 40ms - 60ms where as the time printed in the main function is in the 650ms - 670ms. The irony is, my serial code runs in 650ms - 670ms time. what can I do to reduce the time taken by pthread_join?
Thanks in advance!
On Linux, clock() measures the combined CPU time. It does not measure the wall time.
This is explains why you get ~640 ms = 16 * 40ms. (as pointed out in the comments)
To measure wall time, you should be using something like:
gettimeofday()
clock_gettime()
By creating some threads you are adding an overhead to your system: Creation time, scheduling time. Creating a thread require allocating the stack, etc; scheduling means more context switching. Also, pthread_join suspends execution of the calling thread until the target thread terminates. Which means you want for thread 1 to finish, when he does you are rescheduled as quick as possible but not instantly, then you wait for thread 2, etc...
Now your computer has few cores, like one or 2, and you are creating 16 threads. At best 2 threads of your program will run at the same time and just by adding their clock measurements you have something around 400 ms.
Again It depends on lot of things, so I quickly flown over what is happening.

C , C++ unsynchronized threads returning a strange result

Okay, i have this question in one regarding threads.
there are two unsynchronized threads running simultaneously and using a global resource "int num"
1st:
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
2nd:
void Thread2()
{
int j;
for ( j=0 ; j < 100000000; j++ )
{
num++;
num--;
}
}
The question states: what are the possible values of the variable "num" at the end of the program.
now i would say 0 will be the value of num at the end of the program but, try and run this code and you will find out that the result is quite random,
and i can't understand why?
The full code:
#include <windows.h>
#include <process.h>
#include <stdio.h>
int static num=0;
void Thread()
{
int i;
for ( i=0 ; i < 100000000; i++ )
{
num++;
num--;
}
}
void Thread2()
{
int j;
for ( j=0 ; j < 100000000; j++ )
{
num++;
num--;
}
}
int main()
{
long handle,handle2,code,code2;
handle=_beginthread( Thread, 0, NULL );
handle2=_beginthread( Thread2, 0, NULL );
while( (GetExitCodeThread(handle,&code)||GetExitCodeThread(handle2,&code2))!=0 );
TerminateThread(handle, code );
TerminateThread(handle2, code2 );
printf("%d ",num);
system("pause");
}
num++ and num-- don't have to be atomic operations. To take num++ as an example, this is probably implemented like:
int tmp = num;
tmp = tmp + 1;
num = tmp;
where tmp is held in a CPU register.
Now let's say that num == 0, both threads try to execute num++, and the operations are interleaved as follows:
Thread A Thread B
int tmp = num;
tmp = tmp + 1;
int tmp = num;
tmp = tmp + 1;
num = tmp;
num = tmp;
The result at the end will be num == 1 even though it should have been incremented twice. Here, one increment is lost; in the same way, a decrement could be lost as well.
In pathological cases, all increments of one thread could be lost, resulting in num == -100000000, or all decrements of one thread could be lost, resulting in num == +100000000. There may even be more extreme scenarios lurking out there.
Then there's also other business going on, because num isn't declared as volatile. Both threads will therefore assume that the value of num doesn't change, unless they are the one changing it. This allows the compiler to optimize away the entire for loop, if it feels so inclined!
The possible values for num include all possible int values, plus floating point values, strings, and jpegs of nasal demons. Once you invoke undefined behavior, all bets are off.
More specifically, modifying the same object from multiple threads without synchronization results in undefined behavior. On most real-world systems, the worst effects you see will probably be missing or double increments or decrements, but it could be much worse (memory corruption, crashing, file corruption, etc.). So just don't do it.
The next upcoming C and C++ standards will include atomic types which can be safely accessed from multiple threads without any synchronization API.
You speak of threads running simultaneously which actually might not be the case if you only have one core in your system. Let's assume that you have more than one.
In the case of multiple devices having access to main memory either in the form of CPUs or bus-mastering or DMA they must be synchronized. This is handled by the lock prefix (implicit for the instruction xchg). It accesses a physical wire on the system bus which essentially signals all devices present to stay away. It is, for example, part of the Win32 function EnterCriticalSection.
So in the case of two cores on the same chip accessing the same position the result would be undefined which may seem strange considering some synchronization should occur since they share the same L3 cache (if there is one). Seems logical, but it doesn't work that way. Why? Because a similar case occurs when you have the two cores on different chips (i e don't have a shared L3 cache). You can't expect them to be synchronized. Well you can but consider all the other devices having access to main memory. If you plan to synchronize between two CPU chips you can't stop there - you have to perform a full-blown synchronization that blocks out all devices with access and to ensure a successful synchronization all the other devices need time to recognize that a synchronization has been requested and that takes a long time, especially if a device has been granted access and is performing a bus-mastering operation which must be allowed to complete. The PCI bus will perform an operation every 0.125 us (8 MHz) and considering that your CPUs run at 400 times you're looking at A LOT of wait states. Then consider that several PCI clock cycles might be required.
You could argue that a medium type (memory bus only) lock should exist but this means an additional pin on every processor and additional logic in every chipset just to handle a case which is really a misunderstanding on the programmer's part. So it's not implemented.
To sum it up: a generic synchronization that would handle your situation would render your PC useless due to it always having to wait for the last device to check in and ok the synchronization. It is a better solution to let it be optional and only insert wait states when the developer has determined that it is absolutely necessary.
This was so much fun that I played a little with the example code and added spinlocks to see what would happen. The spinlock components were
// prototypes
char spinlock_failed (spinlock *);
void spinlock_leave (spinlock *);
// application code
while (spinlock_failed (&sl)) ++n;
++num;
spinlock_leave (&sl);
while (spinlock_failed (&sl)) ++n;
--num;
spinlock_leave (&sl);
spinlock_failed was constructed around the "xchg mem,eax" instruction. Once it failed (at not setting the spinlock <=> succeeded at setting it) spinlock_leave would just assign to it with "mov mem,0". The "++n" counts the total number of retries.
I changed the loop to 2.5 million (because with two threads and two spinlocks per loop I get 10 million spinlocks, nice and easy to round with) and timed the sequences with the "rdtsc" count on a dual-core Athlon II M300 # 2GHz and this is what I found
Running one thread without timing
(except for the main loop) and locks
(as in the original example) 33748884
<=> 16.9 ms => 13.5 cycles/loop.
Running one thread i e no other core
trying took 210917969 cycles <=>
105.5 ms => 84,4 cycles/loop <=> 0.042 us/loop. The spinlocks required 112581340 cycles <=> 22.5 cycles per
spinlocked sequence. Still, the
slowest spinlock required 1334208
cycles: that's 667 us or only 1500
every second.
So, the additon of spinlocks unaffected by another CPU added several hundred percent to the total execution time. The final value in num was 0.
Running two threads without spinlocks
took 171157957 cycles <=> 85.6 ms =>
68.5 cycles/loop. Num contained 10176.
Two threads with spinlocks took
4099370103 <=> 2049 ms => 1640
cycles/loop <=> 0.82 us/loop. The
spinlocks required 3930091465 cycles
=> 786 cycles per spinlocked sequence. The slowest spinlock
required 27038623 cycles: thats 13.52
ms or only 74 every second. Num
contained 0.
Incidentally the 171157957 cycles for two threads without spinlocks compares very favorably to two threads with spinlocks where the spinlock time has been removed: 4099370103-3930091465 = 169278638 cycles.
For my sequence the spinlock competition caused 21-29 million retries per thread which comes out to 4.2-5.8 retries per spinlock or 5.2-6.8 tries per spinlock. Addition of spinlocks caused an execution time penalty of 1927% (1500/74-1). The slowest spinlock required 5-8% of all tries.
As Thomas said, the results are unpredictable because your increment and decrement are non-atomic. You can use InterlockedIncrement and InterlockedDecrement -- which are atomic -- to see a predictable result.
Interlocked Variable Access (MSDN)