I am new to Eigen and is writing some simple code to test its performance. I am using a MacBook Pro with M1 Pro chip (I do not know whether the ARM architecture causes the problem). The code is a simple Laplace equation solver
#include <iostream>
#include "mpi.h"
#include "Eigen/Dense"
#include <chrono>
using namespace Eigen;
using namespace std;
const size_t num = 1000UL;
MatrixXd initilize(){
MatrixXd u = MatrixXd::Zero(num, num);
u(seq(1, fix<num-2>), seq(1, fix<num-2>)).setConstant(10);
return u;
}
void laplace(MatrixXd &u){
setNbThreads(8);
MatrixXd u_old = u;
u(seq(1,last-1),seq(1,last-1)) =
(( u_old(seq(0,last-2,fix<1>),seq(1,last-1,fix<1>)) + u_old(seq(2,last,fix<1>),seq(1,last-1,fix<1>)) +
u_old(seq(1,last-1,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(1,last-1,fix<1>),seq(2,last,fix<1>)) )*4.0 +
u_old(seq(0,last-2,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(0,last-2,fix<1>),seq(2,last,fix<1>)) +
u_old(seq(2,last,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(2,last,fix<1>),seq(2,last,fix<1>)) ) /20.0;
}
int main(int argc, const char * argv[]) {
initParallel();
setNbThreads(0);
cout << nbThreads() << endl;
MatrixXd u = initilize();
auto start = std::chrono::high_resolution_clock::now();
for (auto i=0UL; i<100; i++) {
laplace(u);
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
// cout << u(seq(0, fix<10>), seq(0, fix<10>)) << endl;
cout << "Execution time (ms): " << duration.count() << endl;
return 0;
}
Compile with gcc and enable OpenMPI
james#MBP14 tests % g++-11 -fopenmp -O3 -I/usr/local/include -I/opt/homebrew/Cellar/open-mpi/4.1.3/include -o test4 test.cpp
Direct run the binary file
james#MBP14 tests % ./test4
8
Execution time (ms): 273
Run with mpirun and specify 8 threads
james#MBP14 tests % mpirun -np 8 test4
8
8
8
8
8
8
8
8
Execution time (ms): 348
Execution time (ms): 347
Execution time (ms): 353
Execution time (ms): 356
Execution time (ms): 350
Execution time (ms): 353
Execution time (ms): 357
Execution time (ms): 355
So obviously the matrix operation is not running in parallel, instead, every thread is running the same copy of the code.
What should be done to solve this problem? Do I have some misunderstanding about using OpenMPI?
You are confusing OpenMPI with OpenMP.
The gcc flag -fopenmp enables OpenMP. It is one way to parallelize an application by using special #pragma omp statements in the code. The parallelization happens on a single CPU (or, to be precise, compute node, in case the compute node has multiple CPUs). This allows to employ all cores of that CPU. OpenMP cannot be used to parallelize an application over multiple compute nodes.
On the other hand, MPI (where OpenMPI is one particular implementation) can be used to parallelize a code over multiple compute nodes (i.e., roughly speaking, over multiple computers that are connected). It can also be used to parallelize some code over multiple cores on a single computer. So MPI is more general, but also much more difficult to use.
To use MPI, you need to call "special" functions and do the hard work of distributing data yourself. If you do not do this, calling an application with mpirun simply creates several identical processes (not threads!) that perform exactly the same computation. You have not parallelized your application, you just executed it 8 times.
There are no compiler flags that enable MPI. MPI is not built into any compiler. Rather, MPI is a standard and OpenMPI is one specific library that implements that standard. You should read a tutorial or book about MPI and OpenMPI (google turned up this one, for example).
Note: Usually, MPI libraries such as OpenMPI ship with executables/scripts (e.g. mpicc) that behave like compilers. But they are just thin wrappers around compilers such as gcc. These wrappers are used to automatically tell the actual compiler the include directories and libraries to link with. But again, the compilers themselves to not know anything about MPI.
Related
I develop a C++ application that needs to process different images at the same time. The processing algorithm is built on top of OpenCV and uses parallelism functionalities.
The application works in the following way: for each image it has, it spawns a thread to execute the processing algorithm. Unfortunately it seems that this scheme does not work well with OpenCV internal multithreading.
Minimal example:
#include <iostream>
#include <thread>
#include <chrono>
#include <opencv2/core.hpp>
void run(int thread_id, cv::Mat& mat)
{
auto start = std::chrono::steady_clock::now();
// multithreaded operation on mat
mat.forEach<float>([](float& pixel, int const* position) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
});
auto end = std::chrono::steady_clock::now();
std::cout << "thread " << thread_id << " took "
<< (end - start).count() * 1e-9 << " sec"
<< std::endl;
}
int main()
{
cv::Mat mat1(100, 100, CV_32F), mat2(100, 100, CV_32F);
std::thread t1(run, 1, std::ref(mat1));
std::thread t2(run, 2, std::ref(mat2));
t1.join();
t2.join();
return 0;
}
Output on my machine:
thread 1 took 1.42477 sec
thread 2 took 12.1963 sec
It seems that the second operation is not taking advantage of multithreading. Looking at my CPU usage, I have the feeling that OpenCV assigns all its internal threads to the first operation and, when the second one arrives, there is no internal thread left. Thus, the second operation is executed sequentially in the application thread body.
Firstly, I would appreciate if someone that already faced similar issues with OpenCV can confirm that my hypothesis is correct.
Secondly, is there a way to dispatch internal OpenCV resources more intelligently ? For example, by assigning half of the threads to the first operation and half to the second one ?
Multithreading objective
After writing my question, I realize that the purpose of doing multithreading at the application level might be unclear. Some people may argue that it suffices to run the two operations sequentially at the application level to take full advantage of internal OpenCV multithreading. This is true for the minimal example I posted here, but typically not all parts of processing algorithms can be run in parallel.
The idea behind multithreading at application level is to try to run a maximum of 'unparallelisable' operations at the same time:
Operations 1 and 2 sequentially:
[-----seq 1----][-par 2 (full power)-][-----seq 2----][-par 2 (full power)-]
Operations 1 and 2 in parallel:
[-----seq 1----][------------par 2 (half power)------------]
[-----seq 2----][------------par 2 (half power)------------]
seq X = sequential task of operation X
par X = parallelisable task of operation X
We can see that application level multithreading reduce the total computation time, because sequential parts of different operations are run concurrently.
I think your approach to multi threading is correct. I ran the code you provided and here's my output:
thread 1 took 2.30654 sec
thread 2 took 2.63872 sec
Maybe you should check the number of available threads for your program?
std::cout << std::thread::hardware_concurrency() << std::endl;
So I had this question: simple-division-of-labour-over-threads-is-not-reducing-the-time-taken. I thought I had it sorted, but going back to re-visit this work, I am not getting crazy slow down like I was before (due to mutex within rand()) but nor am I getting any improvement in total time taken.
In the code I am splitting up a task of x iteration of work over y threads.
So if I want to do 100'000'000 calculations, in one thread that might take ~350ms, then my hope its that in 2 threads (doing 50'000'000 calcs each) that would take ~175ms and in the three threads ~115ms and so on...
I know using threads won't perfectly split the work due to thread overheads and such. But i want hoping for some performance gain at least.
My slightly updated code is here:
Reults
1thread:
starting thread: 1 workload: 100000000
thread: 1 finished after: 303ms val: 7.02066
==========================
thread overall_total_time time: 304ms
3 threads
starting thread: 1 workload: 33333333
starting thread: 3 workload: 33333333
starting thread: 2 workload: 33333333
thread: 3 finished after: 363ms val: 6.61467
thread: 1 finished after: 368ms val: 6.61467
thread: 2 finished after: 365ms val: 6.61467
==========================
thread overall_total_time time: 368ms
You can see the 3 threads actually takes slightly longer then 1 thread, but each thread is only doing 1/3 of the work iterations. I see similar lack of performance gain on my PC at home which has 8 CPU cores.
Its not like threading overhead should take more then a few milliseconds (IMO) so I can't see what is going on here. I don't believe there is any resource sharing conflicts because this code is quite simple and uses no external outputs/inputs (other then RAM).
Code For Reference
in godbolt: https://godbolt.org/z/bGWdxE
In main() you can tweak the number of threads and amount of work (loop iterations).
#include <iostream>
#include <vector>
#include <thread>
#include <math.h>
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
double val{0};
for (auto i = 1u; i <= interations; i++)
{
val += i / (2.2 * i) / (1.23 * i); // some work
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << " val: " << val << std::endl;
}
int main()
{
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Update
I think I have narrowed down my issue:
On my 64Bit VM I see:
Compiling for 32-bit no optimisation: more threads = runs slower!
Compiling for 32-bit with optimisation: more threads = runs a bit faster
Compiling for 64-bit no optimisation: more threads = runs faster (as expected)
Compiling for 64-bit with optimisation: more threads = same as with out opt, except everything takes less time in general.
So my issue might just be from running 32-bit code on a 64-bit VM. But I don't really understand why adding threads does not work very well if my executable is 32-bit running on a 64-bit architecture...
There are many possible reasons that could explain the observed results, so I do not think that anyone could give you a definitive answer. Also, the majority of reasons have to do with peculiarities of the hardware architecture, so different answers might be right or wrong on different machines.
As already mentioned in the comments, it could very well be that there is something wrong with thread allocation, so you are not really enjoying any benefit from using multiple threads. Godbolt.org is a cloud service, so it is most probably very heavily virtualized, meaning that your threads are competing against who knows how many hundreds of other threads, so I would assign a zero amount of trust to any results from running on godbolt.
A possible reason for the bad performance of unoptimized 32-bit code on a 64-bit VM could be that the unoptimized 32-bit code is not making efficient use of registers, so it becomes memory-bound. The code looks like it would all nicely fit within the CPU cache, but even the cache is considerably slower than direct register access, and the difference is more pronounced in a multi-threaded scenario where multiple threads are competing for access to the cache.
A possible reason for the still not stellar performance of optimized 32-bit code on a 64-bit VM could be that the CPU is optimized for 64-bit use, so instructions are not efficiently pipelined when running in 32-bit mode, or that the arithmetic unit of the CPU is not being used efficiently. It could be that these divisions in your code make all threads contend for the divider circuitry, of which the CPU may have only one, or of which the CPU may have only one when running in 32-bit mode. That would mean that most threads do nothing but wait for the divider to become available.
Note that these situations where a thread is being slowed down due to contention for CPU circuitry are very different from situations where a thread is being slowed down due to waiting for some device to respond. When a device is busy, the thread that waits for it is placed by the scheduler in I/O wait mode, so it is not consuming any CPU. When you have contention for CPU circuitry, the stalling happens inside the CPU; there is no thread context switch, so the thread slows down while it appears as if it is running full speed (consuming a full CPU core.) This may be an explanation for your CPU utilization observations.
Working (in terms of the run time) with a small program, for example, a c++ program can be easy even I have a few cores on my computer.
#include <bits/stdc++.h>
using namespace std;
int main()
{
vector<int> g1;
for (int i = 1; i <= 10; i++)
g1.push_back(i * 10);
for (int i = 1; i <= 10; i++){
std::cout << g1[i] <<endl;
}
return 0;
}
But, I'm going to work with a program that has a very big vector size [more than a milions]. There are a lot of other processes as well, which makes it harder to finish the program on my computer(MacBook) with small run time. Is there any way that I can do it parallelly (i mean with multiple threads)? This means I run the same program, but the time gets reduced because of the processing in multiple threads. I'm very new to parallel computing, so let me know if the question is not clear enough.
The memory of my computer(macbook): [8GB 1600 MHz DDR3]
Processor: 1.6 GHz Dual-Core Intel Core i5
If you are using the same resources (the vector g1) then unfortunately there will not be significant time saving.
Threads are good to run separately with separated resources asynchronously.
Here is another question that goes more into depth of accessing the STL vector with threads: C++ Access to vector from multiple threads
I developed a cross-platform c++ library which spawn threads at runtime.
I used a concurrency queue to dispatch computing tasks, thus every thread will be busy at most of the time.
Now the question is how to get a proper number of threads at runtime. As my task has no I/O or networking operation but calculations and heap-memory allocations, the best strategy would be spawn thread per CPU core:
My code looks like below:
#include "concurrentqueue.h"
#include <algorithm>
#include <thread>
#include <vector>
#include <iostream>
#include <mutex>
std::mutex io_m;
struct Task {
int n;
};
void some_time_consuming_operations(Task &t) {
std::vector<int> vec;
for (int i = 0; i < t.n; ++i)
vec.push_back(1);
{
std::lock_guard<std::mutex> g(io_m);
std::cout << "thread " << std::this_thread::get_id() << " done, vec size:" << vec.size() << std::endl;
}
}
int main() {
// moodycamel's lockfree queue: https://github.com/cameron314/concurrentqueue
moodycamel::ConcurrentQueue<Task> tasks;
for (int i = 0; i < 100; ++i)
tasks.enqueue(Task{(i % 5) * 1000000 + 1000000});
// I left 2 threads for ui and other usages
std::vector<std::thread> jobs(std::max((size_t)2, (size_t)std::thread::hardware_concurrency() - 2));
std::cout << "thread num:" << jobs.size() << std::endl;
for (auto &job : jobs) {
job = std::thread([&tasks]() {
Task task;
while (tasks.try_dequeue(task))
some_time_consuming_operations(task);
});
}
for (auto &job : jobs)
job.join();
return 0;
}
However, when enabling multi-threading on my iOS device(iPhone XR, A12), the test program is 2-times slower than the single thread mode. I 've test it on My windows machine with a 4-core 8-thread intel CPU, and it is 6-times faster than the single thread mode.
On my iPhone, the hardware_concurrency function returns 6, which is the exactly core number of Apple A12. On my windows machine, the number is 8.
I understand there are 4 energy-efficient cores called Tempest lies i Apple's A12, but since they claimed that A11/A12 will use all six cores simultaneously (I kept the charge on during test). I have no idea why its slower than the single thread mode.
The test program is a game app build by UE4.
The four slower cores are a lot slower than the fast cores. So if you took a task that takes 6 seconds on a fast core, and ran one second worth of work on each core, then the two fast cores would finish after a second, while the four slow cores would take maybe ten seconds.
If you use GCD, iOS will shuffle these six threads between the cores, so you can gain up to a factor 2.4 in speed. If your thread implementation doesn't do this, then you are slowing down things.
The solutions: Either use GCD (and get a speedup of 2.4) or use only two threads (and get a speedup of 2.0). That's on an iPhone XR; you'd need to find out the number of fast cores somehow.
Consider the following code. It runs nThreads threads to copy floats from data1 to data2. It appears to have no speedup as nThreads increases, even works slower. I thought that it might be related to thread creation overhead, so increased sizes of the arrays to insane values, but it still doesn't speedup. Then I read about false sharing, but it appeared to only matter when false shared data are close enough to each other to fit in a cache line, definitely not hundreds of megabytes away.
#include <iostream>
#include <thread>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
void mythread(float* timePrev, float* timeNext, int kMin, int kMax)
{
for(int q=0;q<16;++q) // take more time
for(int k=kMin;k<kMax;++k)
timeNext[k]=timePrev[k];
}
static inline void runParallelJob(float* timePrev, float* timeNext, int W, int H, int nThreads)
{
std::thread* threads[nThreads];
int total=W*H;
int perThread=total/nThreads;
for(int t=0;t<nThreads;++t)
{
int k0=t*perThread;
int k1=(t+1)*perThread;
threads[t]=new std::thread(mythread,timePrev,timeNext,k0,k1);
}
for(int t=0;t<nThreads;++t)
{
threads[t]->join();
delete threads[t];
}
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
for(float nThreads=1;nThreads<=8;++nThreads)
{
long double time1=currentTime();
runParallelJob(data1, data2, W, H, nThreads);
long double time2=currentTime();
std::cerr << nThreads << " threads: " << (time2-time1)*1e+3 << " ms\n";
}
}
I compile this program with g++ 4.5.1, with command g++ main.cpp -o threads -std=c++0x -O3 -lrt && ./threads. The output of this program on my Core i7 930 (quad core with HyperThreading) reads:
1 threads: 5426.82 ms
2 threads: 5298.8 ms
3 threads: 5865.99 ms
4 threads: 5845.62 ms
5 threads: 5827.3 ms
6 threads: 5843.36 ms
7 threads: 5919.97 ms
8 threads: 5862.17 ms
Originally the program, which was reduced to this test case did a bit of multiplications, divisions and additions in the thread loop instead of plain copying, with the same performances.
Interestingly, if I omit -O3 from the compiler command line, 1 thread appears to execute for 11303 ms while 2 threads for 6398 ms (~2x speedup), but more threads still execute for about 5700 ms (no more speedup).
So, my question is: what am I missing? Why doesn't the performance scale with number of threads in my case?
I think that the factor that limits copy speed here is memory bandwidth. Therefore, having multiple cores copy the data makes no difference, since all the threads have to share the same memory bandwidth.
In general, throwing more threads at a given task won't necessarily make it faster. The overhead of managing threads and context switches is expensive, so there needs to be a specific reason to go this route. Waiting for some I/O (database calls, service calls, disk access) would be one common reason to use additional threads, high-CPU tasks could benefit from threading on a multicore machine, and ensuring the user maintains control of an app with a dedicated UI thread is another case.