I am scanning through every permutation of vectors and I would like to multithread this process (each thread would scan all the permutation of some vectors).
I manage to extract the code that would not speed up (I know it does not do anything useful but it reproduces my problem).
int main(int argc, char *argv[]){
std::vector<std::string *> myVector;
for(int i = 0 ; i < 8 ; ++i){
myVector.push_back(new std::string("myString" + std::to_string(i)));
}
std::sort(myVector.begin(), myVector.end());
omp_set_dynamic(0);
omp_set_num_threads(8);
#pragma omp parallel for shared(myVector)
for(int i = 0 ; i < 100 ; ++i){
std::vector<std::string*> test(myVector);
do{ //here is a permutation
} while(std::next_permutation(test.begin(), test.end())); // tests all the permutations of this combination
}
return 0;
}
The result is :
1 thread : 15 seconds
2 threads : 8 seconds
4 threads : 15 seconds
8 threads : 18 seconds
16 threads : 20 seconds
I am working with an i7 processor with 8 cores. I can't understand how it could be slower with 8 threads than with 1... I don't think the cost of creating new threads is higher than the one to go through 40320 permutations.. so what is happening?
Thanks to the help of everyone, I finally manage to find the answer :
There were two problems :
A quick performance profiling showed that most of the time was spent in std::lockit which is something used for debug on visual studio.. to prevent that just add this command line /D "_HAS_ITERATOR_DEBUGGING=0" /D "_SECURE_SCL=0". That was why adding more threads resulted in loss of time
Switching optimization on helped improve the performance
Related
I am trying to make code using OpenMP, some class and Eigen.
The overall code is at the bottom.
The code is a simplified version of what I want to do in my project.
The simplified code is just a matrix inverse and multiplication using class and OpenMP.
If I don't use the OpenMP, I get the results of
solution in for loop at 0th iteration is... 1 1 1
solution in class at 0th iteration is... 1 1 1
solution in for loop at 1th iteration is... 4 4 4
solution in class at 1th iteration is... 4 4 4
solution in for loop at 2th iteration is... 9 9 9
solution in class at 2th iteration is... 9 9 9
solution in for loop at 3th iteration is... 16 16 16
solution in class at 3th iteration is... 16 16 16
However, If I use the OpenMP, I get the results of
9 11 91
1
16 16
solution in class at 3th iteration is... 116 16 161
1 1
solution in class at 2th iteration is... 1
9 9 9
solution in for loop at 1th iteration is... solution in for loop at 9
13th iteration is... 1
solution in class at 0th iteration is... 4 4 4 16 solution in for loop at 1
solution in for loop at 01 11
1 solution in for loop at 1 solution in for loop at 41 3th iteration is... th iteration is...
16 9 solution in for loop at th iteration is... 1 solution in class at 33th iteration is... th iteration is... 161
16 solution in for loop at 016 16 2th iteration is... 1616 1616 16 161616th iteration is...
solution in for loop at 19 solution in class at 3th iteration is... 9 9
16
The results are corrupted.
Why does this happen?
Also, at first, I found that only using "#pragma omp parallel for" results in the error that occurs when LDLT is not initialized.
So, I added the following instead of "#pragma omp parallel for".
#pragma omp parallel for private(j,k) firstprivate(foo)
class Foo {
private:
std::vector<Eigen::Matrix<double, 3, 3>> a;
Eigen::LDLT<Eigen::MatrixXd> Minv; // Minv
std::vector<Eigen::Matrix<double, 3, 1>> b;
std::vector<Eigen::Matrix<double, 3, 1>> solution;
public:
Foo() {};
~Foo() {};
void Initialization() {
a.resize(4);
b.resize(4);
solution.resize(4);
for (int i = 0; i < 4; i++) {
a.at(i).setZero();
b.at(i).setZero();
solution.at(i).setZero();
}
}
void SetInv(int idx) {
a.at(idx) = (1.0/(idx+1.0))*Eigen::Matrix3d::Identity();
b.at(idx) = (idx+1)*Eigen::Vector3d::Ones();
Minv.compute(a.at(idx));
}
void Calculation(int idx) {
solution.at(idx) = Minv.solve(b.at(idx));
}
Eigen::Matrix<double, 3, 1> &ReturnSolution(int idx) {
return solution.at(idx);
}
void ShowSolution(int idx) {
std::cout << " solution in class at " << idx << "th iteration is... "<< solution.at(idx).transpose() << std::endl;
}
};
I hope to know what the corrupted result is made.
First of all, the results are corrupted, because using OpenMP you made a multithreaded code. There are several threads doing cout at the same time - therefore you can see the mess.
if you want to have clear output, you need implement a mutex to make your code thread safe. Simplified example:
#include <mutex>
std::mutex mtx;
#pragma omp parallel for private(j,k) firstprivate(foo)
{
/* Your work happens here */
{ // use brackets to show scope where mutex is locked
std::lock_guard<std::mutex> lock(mtx);
std::cout << "Some result you want to print";
}
}
And link to doc: https://en.cppreference.com/w/cpp/thread/lock_guard
Also, at first, I found that only using "#pragma omp parallel for"
results in the error that occurs when LDLT is not initialized.
Please provide sample code with your implementation of #pramgma omp... so it will be easier to answer your question.
i am trying to measure the execution time.
i'm on windows 10 and use gcc compiler.
start_t = chrono::system_clock::now();
tree->insert();
end_t = chrono::system_clock::now();
rslt_period = chrono::duration_cast<chrono::nanoseconds>(end_t - start_t);
this is my code to measure time about bp_w->insert()
the function insert work internally like follow (just pseudo code)
insert(){
_load_node(node);
// do something //
_save_node(node, addr);
}
_save_node(n){
ofstream file(name);
file.write(n);
file.close();
}
_load_node(n, addr){
ifstream file(name);
file.read_from(n, addr);
file.close();
}
the actual results is,
read is number of _load_node executions.
write is number of _save_node executions.
time is nano secs.
read write time
1 1 1000000
1 1 0
2 1 0
1 1 0
1 1 0
1 1 0
2 1 0
1 1 1004000
1 1 1005000
1 1 0
1 1 0
1 1 15621000
i don't have any idea why this result come and want to know.
What you are trying to measure is ill-defined.
"How long did this code take to run" can seem simple. In practice, though, do you mean "how many CPU cycles my code took" ? Or how many cycles between my program and the other running programs ? Do you account for the time to load/unload it on the CPU ? Do you account for the CPU being throttled down when on battery ? Do you want to account for the time to access the main clock located on the motherboard (in terms of computation that is extremely far).
So, in practice timing will be affected by a lot of factors and the simple fact of measuring it will slow everything down. Don't expect nanosecond accuracy. Micros, maybe. Millis, certainly.
So, that leaves you in a position where any measurement will fluctuate a lot. The sane way is to average it out over multiple measurement. Or, even better, do the same operation (on different data) a thousand (million?) times and divide the results by a thousand.
Then, you'll get significant improvement on accuracy.
In code:
start_t = chrono::system_clock::now();
for(int i = 0; i < 1000000; i++)
tree->insert();
end_t = chrono::system_clock::now();
You are using the wrong clock. system_clock is not useful for timing intervals due to low resolution and its non-monotonic nature.
Use steady_clock instead. it is guaranteed to be monotonic and have a low enough resolution to be useful.
The input consists of a set of tasks given in increasing order of start time, and each task has a certain duration associated.
The first line is number of tasks, for example
3
2 5
4 23
7 4
This means that there are 3 tasks. The first one starts at time 2, and ends at 7 (2+5). Second starts at 4, ends at 27. Third starts at 7, ends at 11.
We assume each task starts as soon as it is ready, and does not need to wait for a processor or anything else to free up.
This means we can keep track of number of active tasks:
Time #tasks
0 - 2 0
2 - 4 1
4 - 11 2
11 - 27 1
I need to find 2 numbers:
Max number of active tasks at any given time (2 in this case) and
Average number of active tasks over the entire duration computed here as :
[ 0*(2-0) + 1*(4-2) + 2*(11-4) + 1*(27-11) ] / 27
For this,
I have first found the time when all tasks have come to an end using the below code:
#include "stdio.h"
#include "stdlib.h"
typedef struct
{
long int start;
int dur;
} task;
int main()
{
long int num_tasks, endtime;
long int maxtime = 0;
scanf("%ld",&num_tasks);
task *t = new task[num_tasks];
for (int i=0;i<num_tasks;i++)
{
scanf("%ld %d",&t[i].start,&t[i].dur);
endtime = t[i].start + t[i].dur;
if (endtime > maxtime)
maxtime = endtime;
}
printf("%ld\n",maxtime);
}
Can this be done using Priority Queues implemented as heaps ?
Your question is rather broad, so I am going to just give you a teaser answer that will, hopefully, get you started, attempting to answer your first part of the question, with a not necessarily optimized solution.
In your toy input, you have:
2 5
4 23
7 4
thus you can compute and store in the array of structs that you have, the end time of a task, rather than its duration, for later usage. That gives as an array like this:
2 7
4 27
7 11
Your array is already sorted (because the input is given in that order) by start time, and that's useful. Use std::sort to sort the array, if needed.
Observe how you could check for the end time of the first task versus the start time of the other tasks. With the right comparison, you can determine the number of active tasks along with the first task. Checking whether the end time of the first task is greater than the start time of the second task, if true, denotes that these two tasks are active together at some point.
Then you would do the same for the comparison of the first with the third task. After that you would know how many tasks were active in relation with the first task.
Afterwards, you are going to follow the same procedure for the second task, and so on.
Putting all that together in code, we get:
#include "stdio.h"
#include "stdlib.h"
#include <algorithm>
typedef struct {
int start;
int dur;
int end;
} task;
int compare (const task& a, const task& b) {
return ( a.start < b.start );
}
int main() {
int num_tasks;
scanf("%d",&num_tasks);
task *t = new task[num_tasks];
for (int i=0;i<num_tasks;i++) {
scanf("%d %d",&t[i].start,&t[i].dur);
t[i].end = t[i].start + t[i].dur;
}
std::sort(t, t + num_tasks, compare);
for (int i=0;i<num_tasks;i++) {
printf("%d %d\n", t[i].start, t[i].end);
}
int max_noOf_tasks = 0;
for(int i = 0; i < num_tasks - 1; i++) {
int noOf_tasks = 1;
for(int j = i + 1; j < num_tasks; j++) {
if(t[i].end > t[j].start)
noOf_tasks++;
}
if(max_noOf_tasks < noOf_tasks)
max_noOf_tasks = noOf_tasks;
}
printf("Max. number of active tasks: %d\n", max_noOf_tasks);
delete [] t;
}
Output:
2 7
4 27
7 11
Max. number of active tasks: 2
Now, good luck with the second part of your question.
PS: Since this is C++, you could have used an std::vector to store your structs, rather than a plain array. That way you would avoid dynamic memory allocation too, since the vector takes care that for you automatically.
I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.
The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.
However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?
I wrote a small program to demonstrate the effect:
// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread
#include <iostream>
#include <thread>
#include <vector>
#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"
void simplesum(double* a, double* b, std::size_t dim);
int main() {
for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
const std::size_t N { 936 };
std::vector <std::size_t> times(num_threads, 0);
auto threadfunction = [&](std::size_t tid)
{
const std::size_t dim { N * N };
double* pA = new double[dim];
double* pB = new double[dim];
for (std::size_t i {0}; i < N; ++i){
pA[i] = i;
pB[i] = 2*i;
}
boost::posix_time::ptime now1 =
boost::posix_time::microsec_clock::universal_time();
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
boost::posix_time::ptime now2 =
boost::posix_time::microsec_clock::universal_time();
boost::posix_time::time_duration dur = now2 - now1;
times[tid] += dur.total_milliseconds();
delete[] pA;
delete[] pB;
};
std::vector <std::thread> mythreads;
// start threads
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads.emplace_back(threadfunction, n);
}
// wait for threads to finish
for (std::size_t n {0} ; n < num_threads; ++n)
{
mythreads[n].join();
std::cout << " Thread " << n+1 << " of " << num_threads
<< " took " << times[n]<< "msec" << std::endl;
}
}
}
void simplesum(double* a, double* b, std::size_t dim){
for(std::size_t i{0}; i < dim; ++i)
{*(++a) += *(++b);}
}
The outout with gcc:
Thread 1 of 1 took 532msec
Thread 1 of 2 took 1104msec
Thread 2 of 2 took 1103msec
Thread 1 of 3 took 1680msec
Thread 2 of 3 took 1821msec
Thread 3 of 3 took 1808msec
Thread 1 of 4 took 2542msec
Thread 2 of 4 took 2536msec
Thread 3 of 4 took 2509msec
Thread 4 of 4 took 2515msec
The outout with icc:
Thread 1 of 1 took 663msec
Thread 1 of 2 took 674msec
Thread 2 of 2 took 674msec
Thread 1 of 3 took 681msec
Thread 2 of 3 took 681msec
Thread 3 of 3 took 681msec
Thread 1 of 4 took 688msec
Thread 2 of 4 took 689msec
Thread 3 of 4 took 687msec
Thread 4 of 4 took 688msec
So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.
The answer is fairly simple: Your threads are fighting for memory bandwidth!
Consider that you perform one floating point addition per 2 stores (one initialization, one after the addition) and 2 reads (in the addition). Most modern systems providing multiple cpus actually have to share the memory controller among several cores.
The following was run on a system with 2 physical CPU sockets and 12 cores (24 with HT). Your original code exhibits exactly your problem:
Thread 1 of 1 took 657msec
Thread 1 of 2 took 1447msec
Thread 2 of 2 took 1463msec
[...]
Thread 1 of 8 took 5516msec
Thread 2 of 8 took 5587msec
Thread 3 of 8 took 5205msec
Thread 4 of 8 took 5311msec
Thread 5 of 8 took 2731msec
Thread 6 of 8 took 5545msec
Thread 7 of 8 took 5551msec
Thread 8 of 8 took 4903msec
However, by simply increasing the arithmetic density, we can see a significant increase in scalability. To demonstrate, I changed your addition routine to also perform an exponentiation: *(++a) += std::exp(*(++b));. The result shows almost perfect scaling:
Thread 1 of 1 took 7671msec
Thread 1 of 2 took 7759msec
Thread 2 of 2 took 7759msec
[...]
Thread 1 of 8 took 9997msec
Thread 2 of 8 took 8135msec
Thread 3 of 8 took 10625msec
Thread 4 of 8 took 8169msec
Thread 5 of 8 took 10054msec
Thread 6 of 8 took 8242msec
Thread 7 of 8 took 9876msec
Thread 8 of 8 took 8819msec
But what about ICC?
First, ICC inlines simplesum. Proving that inlining happens is simple: Using icc, I have disable multi-file interprocedural optimization and moved simplesum into its own translation unit. The difference is astonishing. The performance went from
Thread 1 of 1 took 687msec
Thread 1 of 2 took 688msec
Thread 2 of 2 took 689msec
[...]
Thread 1 of 8 took 690msec
Thread 2 of 8 took 697msec
Thread 3 of 8 took 700msec
Thread 4 of 8 took 874msec
Thread 5 of 8 took 878msec
Thread 6 of 8 took 874msec
Thread 7 of 8 took 742msec
Thread 8 of 8 took 868msec
To
Thread 1 of 1 took 1278msec
Thread 1 of 2 took 2457msec
Thread 2 of 2 took 2445msec
[...]
Thread 1 of 8 took 8868msec
Thread 2 of 8 took 8434msec
Thread 3 of 8 took 7964msec
Thread 4 of 8 took 7951msec
Thread 5 of 8 took 8872msec
Thread 6 of 8 took 8286msec
Thread 7 of 8 took 5714msec
Thread 8 of 8 took 8241msec
This already explains why the library performs badly: ICC cannot inline it and therefore no matter what else causes ICC to perform better than g++, it will not happen.
It also gives a hint as to what ICC might be doing right here... What if instead of executing simplesum 1000 times, it interchanges the loops so that it
Loads two doubles
Adds them 1000 times (or even performs a = 1000 * b)
Stores two doubles
This would increase arithmetic density without adding any exponentials to the function... How to prove this? Well, to begin let us simply implement this optimization and see what happens! To analyse, we will look at the g++ performance. Recall our benchmark results:
Thread 1 of 1 took 640msec
Thread 1 of 2 took 1308msec
Thread 2 of 2 took 1304msec
[...]
Thread 1 of 8 took 5294msec
Thread 2 of 8 took 5370msec
Thread 3 of 8 took 5451msec
Thread 4 of 8 took 5527msec
Thread 5 of 8 took 5174msec
Thread 6 of 8 took 5464msec
Thread 7 of 8 took 4640msec
Thread 8 of 8 took 4055msec
And now, let us exchange
for (std::size_t n{0}; n < 1000; ++n){
simplesum(pA, pB, dim);
}
with the version in which the inner loop was made the outer loop:
double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
double x = *a, y = *b;
for (std::size_t n{0}; n < 1000; ++n)
{
x += y;
}
*a = x;
}
The results show that we are on the right track:
Thread 1 of 1 took 693msec
Thread 1 of 2 took 703msec
Thread 2 of 2 took 700msec
[...]
Thread 1 of 8 took 920msec
Thread 2 of 8 took 804msec
Thread 3 of 8 took 750msec
Thread 4 of 8 took 943msec
Thread 5 of 8 took 909msec
Thread 6 of 8 took 744msec
Thread 7 of 8 took 759msec
Thread 8 of 8 took 904msec
This proves that the loop interchange optimization is indeed the main source of the excellent performance ICC exhibits here.
Note that none of the tested compilers (MSVC, ICC, g++ and clang) will replace the loop with a multiplication, which improves performance by 200x in the single threaded and 15x in the 8-threaded cases. This is due to the fact that the numerical instability of the repeated additions may cause wildly differing results when replaced with a single multiplication. When testing with integer data types instead of floating point data types, this optimization happens.
How can we force g++ to perform this optimization?
Interestingly enough, the true killer for g++ is not an inability to perform loop interchange. When called with -floop-interchange, g++ can perform this optimization as well. But only when the odds are significantly stacked into its favor.
Instead of std::size_t all bounds were expressed as ints. Not long, not unsigned int, but int. I still find it hard to believe, but it seems this is a hard requirement.
Instead of incrementing pointers, index them: a[i] += b[i];
G++ needs to be told -floop-interchange. A simple -O3 is not enough.
When all three criteria are met, the g++ performance is similar to what ICC delivers:
Thread 1 of 1 took 714msec
Thread 1 of 2 took 724msec
Thread 2 of 2 took 721msec
[...]
Thread 1 of 8 took 782msec
Thread 2 of 8 took 1221msec
Thread 3 of 8 took 1225msec
Thread 4 of 8 took 781msec
Thread 5 of 8 took 788msec
Thread 6 of 8 took 1262msec
Thread 7 of 8 took 1226msec
Thread 8 of 8 took 820msec
Note: The version of g++ used in this experiment is 4.9.0 on a x64 Arch linux.
Ok, I came to the conclusion that the main problem is that the processor acts on different parts of the memory in parallel and hence I assume that one has to deal with lots of cache misses which slows the process further down. Putting the actual sum function in a critical section
summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();
solves the problem of the cache missses, but of course does not yield optimal speed-up. Anyway, because now the other threads are blocked the simplesum method might as well use all available threads for the sum
void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){
omp_set_num_threads(numberofthreads);
#pragma omp parallel
{
#pragma omp for
for(std::size_t i = 0; i < dim; ++i)
{
a[i]+=b[i];
}
}
}
In this case all the threads work on the same chunk on memory: it should be in the processor cache and if the processor needs to load some other parts of the memory into its cache the other threads benefit from this all well (depending whether this is L1 or L2 cache, but I reckon the details do not really matter for the sake of this discussion).
I don't claim that this solution is perfect or anywhere near optimal, but it seems to work much better than the original code. And it does not rely on some loop switching tricks which I cannot do in my actual code.
I have two shared global variables
int a = 0;
int b = 0;
and two threads
// thread 1
for (int i = 0; i < 10; ++i) {
EnterCriticalSection(§);
a++;
b++;
std::cout << a " " << b << std::endl;
LeaveCriticalSection(§);
}
// thread2
for (int i = 0; i < 10; ++i) {
EnterCriticalSection(§);
a--;
b--;
std::cout << a " " << b << std::endl;
LeaveCriticalSection(§);
}
The code always prints the following output
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
That is quite strange, looks like threads are working sequentally.. What's the problem with that?
Thanks.
Each thread has a specific time slice during which it executes before being preempted. In your example, the time slice seems to be longer than the time required to complete the loop.
However, you can actively yield control by calling Sleep(0) after leaving the critical section inside the loop.
IMO critical section leave/enter in your example is so fast that another thread is not fast enough to execute enter section during this moment.
Try to put some (maybe random) sleeps to slow down code to see desired effects.
Note:
Default timeout for EnterCriticalSection is like 30 days or so (means infinty) so you cannot expect that function will time out. And documentation says:
There is no guarantee about the order in which threads will obtain ownership of the critical section, however, the system will be fair to all threads.
For me it looks like the topic discussed in http://social.msdn.microsoft.com/forums/en-US/windowssdk/thread/980e5018-3ade-4823-a6dc-5ddbcc3091d5/
Please look the example from June 28, 2006
(unfortunately I cannot find the original article by Microsoft telling about the change of CriticalSection)
Could you try your code on Windows XP? What does it show?
I guess that I/O operations (cout) affect the scheduling similarly to a Sleep() call, so starting with Windows Vista a thread could cause starvation of other threads when doing I/O inside a CS.