I am trying to compute Eigen vectors for a matrix in C++ in a very efficient way. The problem is that the most representative C++ libraries OpenCV, Eigen and Armadillo are considerably slower than MATLAB equivalent eigen function. This is devastating and very concerning for me, I would never expect that MATLAB could beat C++ code performance, specially if it is a very well known/used library like the aforementioned ones. To give an idea of the performance difference, below find all the computation times for eigen vector computation in different libraries (units in seconds, some of them are in hours and days! yes days!) (all of them give me the expect results including MATLAB, it just the execution time is vastly difference)
The unfortunate part is that for my purposes I need to compute the Eigen values for a matrix whose dimensions are (8192, 8192). From the table you can see that MATLAB only takes 17 secs, and the second best (armadillo) takes 47 secs, might not sound like a big deal, but I need to repeat this operation thousands of times, so you can imagine how this will add up time quickly. I would tremendously appreciate if someone can tell me what I am doing wrong or this is just the sad reality I have to face and just have slow C++ code (at least slower than MATLAB) (by the way, props to the MATLAB coders that manage to have the eig function considerably faster than any other C++ library). For those who are interested in looking at the code and way how I compute the eigen vectors, I will leave it down here.
PD: All these methods were tested using release builds for the libraries (including libopenblas.dll for armadillo, excepting eigen that is a header library)
PD: All these computation times were obtained using same computer
// COV is a covariance matrix, meaning is a symmetric matrix
// has just a bunch of double numbers (CV_64FC1).
// For those interesting this covariance matrix COV is just
// the result of this operation COV = X' * X, where M is just literally
// any double (CV_64FC1) matrix dimensions (m, 8192), m can be any positive
// value to be honest.
cv::Mat X = read_opencv_matrix() // dim(X) = (8192, m)
cv::Mat COV, MEAN;
// dim(COV) = (8192, 8192)
cv::calcCovarMatrix(X.t(), COV, MEAN, cv::COVAR_NORMAL | cv::COVAR_ROWS);
int numRows = X.rows; // should be 8192
int numCols = X.cols; // can be anything to be honest
// computing eigen values using different libraries (opencv, armadillo, eigen)
// all of them give me the same results (including MATLAB function eig())
// the problem is that all of them are considerably slower than MATLAB
///////////////////////////////// OPENCV //////////////////////////
// opencv + eigen
if (do_opencv_eigen)
{
cv::Mat D, V;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
cv::eigen(cov, D, V);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] OpenCV + cv::eigen = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
// opencv + SVD
if (do_opencv_eigen)
{
cv::Mat u, w, vt;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
cv::SVD::compute(cov, w, u, vt);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] OpenCV + cv::SVD = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
// opencv + SVD + MODIFY_A flag
if (do_opencv_svd_mod_a)
{
cv::Mat u2, w2, vt2;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
cv::SVD::compute(cov, w2, u2, vt2, cv::SVD::MODIFY_A);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] OpenCV + cv::SVD::MODIFY_A = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
///////////////////////// ARMADILLO /////////////////////////////
arma::mat arma_cov = Utils::opencv2arma(cov); // helper function to convert cv::mat to arma::mat
arma::mat arma_X = Utils::opencv2arma(X);
arma::mat arma_col_mean_rep = Utils::opencv2arma(col_mean_rep);
// compute arma eigen gen vectors
if (do_arma_eigen_gen)
{
arma::cx_vec arma_Dc;
arma::cx_mat arma_Vc;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
arma::eig_gen(arma_Dc, arma_Vc, arma_cov);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] Arma + arma::eig_gen = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
// compute arma eigen gen vectors
if (do_arma_eigen_sym)
{
arma::vec arma_D;
arma::mat arma_V;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
arma::eig_sym(arma_D, arma_V, arma_cov);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] Arma + arma::eig_sym = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
// armadillo + svd
if (do_arma_svd)
{
arma::mat arma_U2;
arma::vec arma_s2;
arma::mat arma_V2;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
arma::svd(arma_U2, arma_s2, arma_V2, arma_cov);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] Arma + arma::svd = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
// armadillo + svd + econ
if (do_arma_svd_econ)
{
arma::mat arma_U2_econ;
arma::vec arma_s2_econ;
arma::mat arma_V2_econ;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
arma::svd_econ(arma_U2_econ, arma_s2_econ, arma_V2_econ, arma_cov);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] Arma + arma::svd_econ = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
/////////////////// EIGEN /////////////////////////////
Eigen::Matrix eig_cov = Utils::opencv2eigen(cov); // helper function to convert cv::mat to Eigen::Matrix
Eigen::Matrix eig_X = Utils::opencv2eigen(X);
Eigen::Matrix eige_col_mean_rep = Utils::opencv2eigen(col_mean_rep);
//Eigen general eigen function
if (do_eigen_eigen)
{
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
Eigen::EigenSolver<Eigen::MatrixXd> es(eig_cov);
Eigen::MatrixXcd eig_VC = es.eigenvectors();
Eigen::MatrixXd eig_V = eig_VC.real();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] Eigen + Eigen::EigenSolver = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
// eigen library + SVD
if (do_eigen_SVD)
{
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
Eigen::BDCSVD<Eigen::MatrixXd> SVD(eig_cov, Eigen::ComputeThinU | Eigen::ComputeThinV);
Eigen::MatrixXd eig_V2 = SVD.matrixV();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] Eigen + Eigen::BDCSVD = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
//Eigen library + SelfAdjointEigenSolver
if (do_eigen_SelfAdjoin)
{
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
Eigen::SelfAdjointEigenSolver<Eigen::MatrixXd> esa;
esa.compute(eig_cov);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
auto count = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
std::cout << "[TIME] Eigen + Eigen::SelfAdjointEigenSolver = " << static_cast<double>(count) / 1000.0 << "[sec]" << std::endl;
}
UPDATE 1
OS : Windows 10
IDE : Visual Studio 2022
CPU : Intel Xeon CPU E5 2687W v4
OpenCV Version: 4.5.3 (compiled with the official intrusction from opencv https://docs.opencv.org/4.x/d3/d52/tutorial_windows_install.html)
Armadillo: uses the default libopenblas.dll that comes with it
Eigen: I downloaded the lastest and greatest version up today
Turning my comments into an answer now that a more complete understanding of the platform is available:
Your praises shouldn't go to Matlab but to Intel MKL which is used by Matlab. When you use Eigen with the MKL backend you get the same performance, at least for straightforward LAPACK calls. Make sure to compile with one of the parallelized implementations. If you see only one CPU core used for the majority, fix your linker and compile flags.
In particular for Visual Studio 2022, and, I assume OneMKL 2023, the compile flags taken from MKL's Link Line Advisor and Eigen's documentation look like this.
For compilation: -I"%MKLROOT%\include" -DEIGEN_USE_MKL_ALL
Link line: mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib tbb12.lib
Alternatively you may use this line for OpenMP parallelization instead of TBB:
mkl_intel_lp64_dll.lib mkl_intel_thread_dll.lib mkl_core_dll.lib libiomp5md.lib
No further changes such as including <Eigen/src/Core/util/MKL_support.h> should be necessary.
Compile errors
I get the following error Intel MKL ERROR: Parameter 6 was incorrect on entry to SGEMV (I used the link advisor to link the corresponding libraries)
I believe that error (or similarly worded ones with other function names) occur when you link with the ILP64 interface of MKL instead of the LP64 interface. The wording in the Eigen documentation could be clearer in this regard. On the link advisor, use "Select interface layer: C API with 32 bit integer"
Performance comparisons
Whether performance is exactly the same depends on whatever else Matlab decides to do. To the best of my knowledge its exact choice of decomposition is not documented. Eigen simply calls the syev family of functions, as far as I can tell. On my system (Linux with Intel i7-11800H), this computes the selfadjoint 2048x2048 decomposition in 0.6 seconds. So it is at least in the same ballpark and likely the exact same thing.
Alternatives
MKL is known to optimize primarily for Intel and may artificially cripple AMD. Something that has affected Matlab as well. On AMD CPUs you may use the plain old LAPACKE and BLAS interface provided by AMD's AOCL library. In particular AOCL-BLIS as a BLAS backend and AOCL-libFLAME as LAPACK.
Do you think it is possible to use Armadillo with Intel MKL?
According to its FAQ, yes. Link against MKL as its LAPACK library.
Related
I am using cv::matchTemplate to track a moving object in a video.
However, running the template matching of open cv with a small picture can be slower on a better/newer intel's CPU. The code snippet below run typically 2 times slower on a i9-7920x (0.28ms/match) than a i7-9700k (0.14ms/match).
#include <chrono>
#include <fstream>
#include <opencv2/opencv.hpp>
#pragma optimize("", off)
int main()
{
cv::Mat haystack;
cv::Mat needle;
cv::Mat result;
cv::Rect rect;
//https://en.wikipedia.org/wiki/Barack_Obama#/media/File:President_Barack_Obama.jpg
haystack = cv::imread("C:/President_Barack_Obama.jpg");
rect.width = 64;
rect.height = 64;
haystack = haystack(rect);
rect.width = 12;
rect.height = 12;
rect.x = 50;
rect.y = 50;
needle = haystack(rect);
auto start = std::chrono::high_resolution_clock::now();
int nbmatch = 10000;
for (int i = 0; i < nbmatch; i++) {
cv::matchTemplate(haystack, needle, result, cv::TemplateMatchModes::TM_CCOEFF_NORMED);
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff = end - start;
std::cout << "time per match: " << (diff.count() / nbmatch) * 1000 << " ms\n";
std::this_thread::sleep_for(std::chrono::seconds(500));
}
In my real application, I noticed this:
i7-9700k: 1ms;
i7-6800k: 1.3ms;
i9-7920x: 2.8ms;
i9-9820x: 2.8ms.
Both the i9 are slower by a fair amount that could not be explained by the slight difference in clock speed.
Win 7 or 10 does not make a difference. It is compiled with Visual Studio 2019 (v142). Open CV is compiled from the source with the pre-built libraries (building it myself did not help).
Edit:
The capacity to scale the frequency seems to have an important impact. If runned single threaded the i9-7920x still run in 2.8ms if I sleep regularily but if I yield instead (cpu load of 100%) it lower to 1.9ms.
Question:
What could explain this?
Do you think it is possible to bring all processor to compute in the same range of time using cv::matchTemplate?
What could I do else to reduce my computation time?
In the past few weeks I was wondering what is the point that people trying to re-invent the wheel and spend hours to write their own sqrt function for example. The built-in version is optimized well, precise and stable enough.
I am speaking about the Carmack-style Square Root for example. What is the point? It will lose precision during the approximation and it uses casting.
Intel style SSE Square Root was giving precise results, but was slower in my calculations than the standard SQRT.
By average, all the above tricks were beaten by far the standard SQRT. So my question is, what is the point?
My PC has the below CPU:
Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz.
I've got the following results for each method (I've fixed the performance test according to the below suggestion by the helpful comment, thanks for that n.m.):
(Please keep in mind that if you're using approximation like Newton method, you'll lose precision so you must align your calculation accordingly.)
You can find the source code below for reference.
#include <chrono>
#include <cmath>
#include <deque>
#include <iomanip>
#include <iostream>
#include <immintrin.h>
#include <random>
using f64 = double;
using s64 = int64_t;
using u64 = uint64_t;
static constexpr u64 cycles = 24;
static constexpr u64 sample_max = 1000000;
f64 sse_sqrt(const f64 x) {
__m128d root = _mm_sqrt_pd(_mm_load_pd(&x));
return *(reinterpret_cast<f64*>(&root));
}
constexpr f64 carmack_sqrt(const f64 x) {
union {
f64 x;
s64 i;
} u = {};
u.x = x;
u.i = 0x5fe6eb50c7b537a9 - (u.i >> 1);
f64 xhalf = 0.5 * x;
u.x = u.x * (1.5 - xhalf * u.x * u.x);
# u.x = u.x * (1.5 - xhalf * u.x * u.x);
# u.x = u.x * (1.5 - xhalf * u.x * u.x);
# ... so on, if you want more precise result ...
return u.x * x;
}
int main(int /* argc */, char ** /*argv*/) {
std::random_device r;
std::default_random_engine e(r());
std::uniform_real_distribution<f64> dist(1, sample_max);
std::deque<f64> samples(sample_max);
for (auto& sample : samples) {
sample = dist(e);
}
// std sqrt
{
std::cout << "> Measuring std sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += std::sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
// sse sqrt
{
std::cout << "> Measuring sse sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += sse_sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
// carmack sqrt
{
std::cout << "> Measuring carmack sqrt.\r\n> Please wait . . .\r\n";
f64 result = 0;
auto t1 = std::chrono::high_resolution_clock::now();
for (auto cycle = 0; cycle < cycles; ++cycle) {
for (auto& sample : samples) {
result += carmack_sqrt(static_cast<f64>(sample));
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto dt = t2 - t1;
std::cout << "> Accumulated result: " << std::setprecision(19) << result << "\n";
std::cout << "> Total execution time: " <<
std::chrono::duration_cast<std::chrono::milliseconds>(dt).count() << " ms.\r\n\r\n";
}
std::cout << "> Press any key to exit . . .\r\n";
std::getchar();
return 0;
}
Please note that I am not here for criticizing anybody, I am just here for learning, experimenting and trying to figure out my own method and the best toolset to choose from.
I am writing my own game engine to one of my portfolio. I am appreciating your kind answers and I am open for any suggestions.
Have a nice day.
That fast reciprocal square root trick is mostly obsolete. SSE's built in approximate reciprocal square root which exists since the Pentium 3 has completely replaced it on the PC platform. Other platforms usually have their own reciprocal square root, for example ARM has VRSQRTE and a handy instruction that does the Newton step too.
By the way, turning the result into a non-reciprocal square root usually makes it less useful: the primary use case is normalizing a vector, where a "straight" square root is annoying (it would have to be divided by) while a reciprocal square root fits exactly (then it's a multiply).
As often, your benchmark isn't quite accurate. I happen to have done some relevant tests a while ago, where the relevant parts look like this:
std::sqrt based:
HMM_INLINE float HMM_LengthVec4(hmm_vec4 A)
{
float Result = std::sqrt(HMM_LengthSquaredVec4(A));
return(Result);
}
HMM_INLINE hmm_vec4 HMM_NormalizeVec4(hmm_vec4 A)
{
hmm_vec4 Result = {0};
float VectorLength = HMM_LengthVec4(A);
/* NOTE(kiljacken): We need a zero check to not divide-by-zero */
if (VectorLength != 0.0f)
{
float Multiplier = 1.0f / VectorLength;
#ifdef HANDMADE_MATH__USE_SSE
__m128 SSEMultiplier = _mm_set1_ps(Multiplier);
Result.InternalElementsSSE = _mm_mul_ps(A.InternalElementsSSE, SSEMultiplier);
#else
Result.X = A.X * Multiplier;
Result.Y = A.Y * Multiplier;
Result.Z = A.Z * Multiplier;
Result.W = A.W * Multiplier;
#endif
}
return (Result);
}
SSE reciprocal square root plus Newton step:
HMM_INLINE hmm_vec4 HMM_NormalizeVec4_new(hmm_vec4 A)
{
hmm_vec4 Result;
// square elements and add them together, result is in every lane
__m128 t0 = _mm_mul_ps(A.InternalElementsSSE, A.InternalElementsSSE);
__m128 t1 = _mm_add_ps(t0, _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(2, 3, 0, 1)));
__m128 sq = _mm_add_ps(t1, _mm_shuffle_ps(t1, t1, _MM_SHUFFLE(0, 1, 2, 3)));
// compute reciprocal square root with Newton step for ~22bit accuracy
__m128 rLen = _mm_rsqrt_ps(sq);
__m128 half = _mm_set1_ps(0.5);
__m128 threehalf = _mm_set1_ps(1.5);
__m128 t = _mm_mul_ps(_mm_mul_ps(sq, half), _mm_mul_ps(rLen, rLen));
rLen = _mm_mul_ps(rLen, _mm_sub_ps(threehalf, t));
// multiply elements by the reciprocal of the vector length
__m128 normed = _mm_mul_ps(A.InternalElementsSSE, rLen);
// normalize zero-vector to zero, not to NaN
__m128 zero = _mm_setzero_ps();
Result.InternalElementsSSE = _mm_andnot_ps(_mm_cmpeq_ps(A.InternalElementsSSE, zero), normed);
return (Result);
}
SSE reciprocal square root without Newton step:
HMM_INLINE hmm_vec4 HMM_NormalizeVec4_lowacc(hmm_vec4 A)
{
hmm_vec4 Result;
// square elements and add them together, result is in every lane
__m128 t0 = _mm_mul_ps(A.InternalElementsSSE, A.InternalElementsSSE);
__m128 t1 = _mm_add_ps(t0, _mm_shuffle_ps(t0, t0, _MM_SHUFFLE(2, 3, 0, 1)));
__m128 sq = _mm_add_ps(t1, _mm_shuffle_ps(t1, t1, _MM_SHUFFLE(0, 1, 2, 3)));
// compute reciprocal square root without Newton step for ~12bit accuracy
__m128 rLen = _mm_rsqrt_ps(sq);
// multiply elements by the reciprocal of the vector length
__m128 normed = _mm_mul_ps(A.InternalElementsSSE, rLen);
// normalize zero-vector to zero, not to NaN
__m128 zero = _mm_setzero_ps();
Result.InternalElementsSSE = _mm_andnot_ps(_mm_cmpeq_ps(A.InternalElementsSSE, zero), normed);
return (Result);
}
(quick-bench)
As you can see I measured throughput and latency separately, and the distinction mattered a lot. Reciprocal square root with a Newton step takes a long time, about as long as using a normal square root, but can be processed at a higher throughput. Without Newton step, a single vector-normalize operation takes less time from start to finish too, and the throughput becomes even better than before. Anyway this should demonstrate that there is some point to doing something about your square roots.
By the way the above code is not meant to be Good Practice, that would be normalizing 4 vectors simultaneously, so as to not waste a 4-wide SIMD operation on calculating a single (reciprocal) square root. That's not really the issue here though.
For fun and profit?
Based on your question, there is no reason to do so, but if you want to learn a language, it is recommended to solve mathematical problems, because they rely on integers/floats (which are primitives in (mostly) any language) and the algorithms are well documented.
In "real" code, one should use the provided methods by libc as long as you have one. Embedded platforms usually lack of a libc or roll their own, so you have to implement your own.
The point of Carmack’s technique was to extract better performance from integer operations than could be got from floating-point operations in the 1990’s. Since that time, floating point performance has massively improved! As you can see in your own benchmarks. There would be no practical reason to use that technique in new code unless you faced a similar constraint in your hardware, which the i7 doesn’t.
Limitation of standard library usage due to scope of project
What is the point implementing custom math functions in C++ (like SQRT)?
In addition to what has already been mentioned in the other answers, the choice of a project implementing their own (custom) math functions could be due to:
Limitations placed on the project, e.g. due to some standard, to not make use of the math library shipped with the compiler.
One example could be an ASIL classified project adhering to the ISO 26262 Standard, say, making use of a compiler that provides adequate qualification w.r.t. correctly compiling the project's source code, but that does not provide adequate qualification for the shipped standard library, where e.g. the math library could be linked in only by object and not source code (for the latter, appropriate tests and source code qualification could be written by the project themselves).
I might discovered a huge performance issue with OpenCV's own implementation of matrix multiplication / summation, and wanted to check with you guys if I maybe missing something:
In advance: All runs were done in (OpenCV's) Release Mode.
Setup:
(a) I'll do 10 million times a matrix-vector multiplication with a 3-by-3 matrix and a 3-by-1 vector. The implementation follows the code: res = mat * vec;
(b) I'll do the same with my own implementation of accessing the elements individually and then doing the multiplication process using pointer-arithmetic. [basically just multiplying out the process and writing down the equations for each row for the result vector]
I tested these variants with the compiler flags -O0, -O1, -O2, -O3, -Ofast and for OpenCV 3.1 & 3.2.
The timings are done using chrono (high_resolution_clock) on Ubuntu 16.04.
Findings:
In all cases the non-optimized method (b) outperforms the OpenCV method (a) by a factor of ~100 to ~1000.
Question:
How can that be the case? Shouldn't OpenCV be optimized for these kinds of procedures? Should I raise an issue on Github, or is there something I'm totally missing?
Code: [Ready to copy and test on your machine]
#include <chrono>
#include <iostream>
#include "opencv2/core/cvstd.hpp"
#include "opencv2/core.hpp"
#include "opencv2/imgproc.hpp"
#include "opencv2/highgui.hpp"
int main()
{
// 1. Setup:
std::vector<std::chrono::high_resolution_clock::time_point> timestamp_vec_start(2);
std::vector<std::chrono::high_resolution_clock::time_point> timestamp_vec_end(2);
std::vector<double> timestamp_vec_total(2);
cv::Mat test_mat = (cv::Mat_<float>(3,3) << 0.023, 232.33, 0.545,
22.22, 0.1123, 4.444,
0.012, 3.4521, 0.202);
cv::Mat test_vec = (cv::Mat_<float>(3,1) << 5.77,
1.20,
0.03);
cv::Mat result_1 = cv::Mat(3, 1, CV_32FC1);
cv::Mat result_2 = cv::Mat(3, 1, CV_32FC1);
cv::Mat temp_test_mat_results = cv::Mat(3, 3, CV_32FC1);
cv::Mat temp_test_vec_results = cv::Mat(3, 1, CV_32FC1);
auto ptr_test_mat_res_0 = temp_test_mat_results.ptr<float>(0);
auto ptr_test_mat_res_1 = temp_test_mat_results.ptr<float>(1);
auto ptr_test_mat_res_2 = temp_test_mat_results.ptr<float>(2);
auto ptr_test_vec_res_0 = temp_test_vec_results.ptr<float>(0);
auto ptr_test_vec_res_1 = temp_test_vec_results.ptr<float>(1);
auto ptr_test_vec_res_2 = temp_test_vec_results.ptr<float>(2);
auto ptr_res_0 = result_2.ptr<float>(0);
auto ptr_res_1 = result_2.ptr<float>(1);
auto ptr_res_2 = result_2.ptr<float>(2);
// 2. OpenCV Basic Matrix Operations:
timestamp_vec_start[0] = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10000000; ++i)
{
// factor of up to 5000 here:
// result_1 = (test_mat + test_mat + test_mat) * (test_vec + test_vec);
// factor of 30~100 here:
result_1 = test_mat * test_vec;
}
timestamp_vec_end[0] = std::chrono::high_resolution_clock::now();
timestamp_vec_total[0] = static_cast<double>(std::chrono::duration_cast<std::chrono::microseconds>(timestamp_vec_end[0] - timestamp_vec_start[0]).count());
// 3. Pixel-Wise Operations:
timestamp_vec_start[1] = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 10000000; ++i)
{
auto ptr_test_mat_0 = test_mat.ptr<float>(0);
auto ptr_test_mat_1 = test_mat.ptr<float>(1);
auto ptr_test_mat_2 = test_mat.ptr<float>(2);
auto ptr_test_vec_0 = test_vec.ptr<float>(0);
auto ptr_test_vec_1 = test_vec.ptr<float>(1);
auto ptr_test_vec_2 = test_vec.ptr<float>(2);
ptr_test_mat_res_0[0] = ptr_test_mat_0[0] + ptr_test_mat_0[0] + ptr_test_mat_0[0];
ptr_test_mat_res_0[1] = ptr_test_mat_0[1] + ptr_test_mat_0[1] + ptr_test_mat_0[1];
ptr_test_mat_res_0[2] = ptr_test_mat_0[2] + ptr_test_mat_0[2] + ptr_test_mat_0[2];
ptr_test_mat_res_1[0] = ptr_test_mat_1[0] + ptr_test_mat_1[0] + ptr_test_mat_1[0];
ptr_test_mat_res_1[1] = ptr_test_mat_1[1] + ptr_test_mat_1[1] + ptr_test_mat_1[1];
ptr_test_mat_res_1[2] = ptr_test_mat_1[2] + ptr_test_mat_1[2] + ptr_test_mat_1[2];
ptr_test_mat_res_2[0] = ptr_test_mat_2[0] + ptr_test_mat_2[0] + ptr_test_mat_2[0];
ptr_test_mat_res_2[1] = ptr_test_mat_2[1] + ptr_test_mat_2[1] + ptr_test_mat_2[1];
ptr_test_mat_res_2[2] = ptr_test_mat_2[2] + ptr_test_mat_2[2] + ptr_test_mat_2[2];
ptr_test_vec_res_0[0] = ptr_test_vec_0[0] + ptr_test_vec_0[0];
ptr_test_vec_res_1[0] = ptr_test_vec_1[0] + ptr_test_vec_1[0];
ptr_test_vec_res_2[0] = ptr_test_vec_2[0] + ptr_test_vec_2[0];
ptr_res_0[0] = ptr_test_mat_res_0[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_0[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_0[2]*ptr_test_vec_res_2[0];
ptr_res_1[0] = ptr_test_mat_res_1[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_1[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_1[2]*ptr_test_vec_res_2[0];
ptr_res_2[0] = ptr_test_mat_res_2[0]*ptr_test_vec_res_0[0] + ptr_test_mat_res_2[1]*ptr_test_vec_res_1[0] + ptr_test_mat_res_2[2]*ptr_test_vec_res_2[0];
}
timestamp_vec_end[1] = std::chrono::high_resolution_clock::now();
timestamp_vec_total[1] = static_cast<double>(std::chrono::duration_cast<std::chrono::microseconds>(timestamp_vec_end[1] - timestamp_vec_start[1]).count());
// 4. Printout Timing Results:
std::cout << "\n\nTimings:\n\n";
std::cout << "Time spent in OpenCV's implementation: " << timestamp_vec_total[0]/1000.0 << " ms.\n";
std::cout << "Time spent in element-wise implementation: " << timestamp_vec_total[1]/1000.0 << " ms.\n\n";
std::cin.get();
return 0;
}
OpenCV is not optimized for small matrix operations.
You can reduce your overhead a little by not allocating a new Matrix for the result inside the loop by using cv::gemm
But if small matrix operations are a bottleneck for you I recommend using Eigen.
Using a quick Eigen implementation like:
Eigen::Matrix3d mat;
mat << 0.023, 232.33, 0.545,
22.22, 0.1123, 4.444,
0.012, 3.4521, 0.202;
Eigen::Vector3d vec3;
vec3 << 5.77,
1.20,
0.03;
Eigen::Vector3d result_e;
for (int i = 0; i < 10000000; ++i)
{
result_e = (mat *3 ) * (vec3 *2);
}
gives me the following numbers with VS2015 (obviously the difference might be less dramatic in GCC or Clang):
Timings:
Time spent in OpenCV's implementation: 2384.45 ms.
Time spent in element-wise implementation: 78.653 ms.
Time spent in Eigen implementation: 36.088 ms.
I am totally surprised by all of your answers. Thank you very much!
The bug code is showed as following:
percentage = (double)kk * 100.0 / (double)totalnum;
After I modified it to:
percentage = (double)kk * 100.0 / totalnum;
The problem is SOLVED. And this simple division consumed about 90s out of 150s. Maybe division between double and int is faster than it between doubles.
Again, thanks for all of your answers!
I'm trying to getting the average image from a set of pictures which come from a video. There are only 2 steps for this job:
Sum up all the images into a matrix.
Divide the matrix by the number of images.
I used following code in OpenCV: (C++)
Mat avIM = Mat::zeros(IMG_HEIGHT, IMG_WIDTH, CV_32FC3);
for (ii = startnum; ii <= endnum; ii += interval) {
string fullname = argv[1];
sprintf(filename, "\\%d.png", ii);
fullname.append(filename);
Mat tempIM = imread(fullname.c_str());
if (tempIM.empty()) { cout << "Can't open image!\n"; return -1; }
tempIM.convertTo(tempIM, CV_32FC3);
avIM += tempIM; //Sum up every image
++kk;
}
avIM = avIM * (double)(1.0 / kk); //get average'
And following code in MatLab: (2015a)
avIM = zeros(size(imread([im.dir,'\',num2str(startnum),'.png'])));
pointIdx = startnum:interval:endnum;
for j=pointIdx,
IM = imread([im.dir,'\',num2str(j),'.png']);
avIM = avIM + double(IM); %Sum up every image
end
avIM = uint8(round(avIM./size(pointIdx,2))); %get average
But when I run those two program on 2,100 images, OpenCV took 150.3s(Release) and MatLab took 103.1s. It really confused me that a C++ program runs slower than a MatLab script.
So what's slowing down my OpenCV program? If it's caused by my method of matrix accessing, what should I do to improve the efficiency?
Your code seems good enough, and in my tests I found it's running 10 times faster than Matlab code.
However, I show a slightly optimized code, that performs a little faster than yours.
Notes
Please note that I don't have a folder with images named as you, so I used cv::glob in C++ version, and dir in Matlab version to get the names of the images in the folder.
In my folder I have 82 small images, so the running time is obviously smaller than yours, but the relative performance should be reliable.
Execution time
Sum only Get filenames + Sum
Matlab: 0.173543 s (0.185308 s)
OpenCV #Seven Wang: 0.0145206 s (0.0155748 s)
OpenCV #Miki: 0.0128943 s (0.013333 s)
Considerations
Be sure that you're computing the running time consistently in OpenCV and Matlab.
Code
Matlab code:
tic
folder = 'D:\\SO\\temp\\old_075_6\\';
filenames = dir([folder '*.bmp']);
% Get rows and cols from 1st image
img = imread([folder name]);
S = zeros(size(img));
for ii = 1 : length(filenames)
name = filenames(ii).name;
currentImage = imread([folder name]);
S = S + double(currentImage);
end
S = uint8(round(S / length(filenames)));
toc
C++ code:
#include <opencv2\opencv.hpp>
#include <vector>
#include <iostream>
int main()
{
double ticLoad = double(cv::getTickCount());
std::string folder = "D:\\SO\\temp\\old_075_6\\*.bmp";
std::vector<cv::String> filenames;
cv::glob(folder, filenames);
int rows, cols;
{
// Just load the first image to get rows and cols
cv::Mat3b img = cv::imread(filenames[0]);
rows = img.rows;
cols = img.cols;
}
/*{
double tic = double(cv::getTickCount());
cv::Mat3d S(rows, cols, 0.0);
for (const auto& name : filenames)
{
cv::Mat currentImage = cv::imread(name);
currentImage.convertTo(currentImage, CV_64F);
S += currentImage;
}
S = S * double(1.0 / filenames.size());
cv::Mat3b avg;
S.convertTo(avg, CV_8U);
double toc = double(cv::getTickCount());
double timeLoad = (toc - ticLoad) / cv::getTickFrequency();
double time = (toc - tic) / cv::getTickFrequency();
std::cout << "#Seven Wang: " << time << " s (" << timeLoad << " s)" << std::endl;
}*/
{
double tic = double(cv::getTickCount());
cv::Mat3d S(rows, cols, 0.0);
cv::Mat3b currentImage;
for (const auto& name : filenames)
{
currentImage = cv::imread(name);
cv::add(S, currentImage, S, cv::noArray(), CV_64F);
}
S /= filenames.size();
cv::Mat3b avg;
S.convertTo(avg, CV_8U);
double toc = double(cv::getTickCount());
double timeLoad = (toc - ticLoad) / cv::getTickFrequency();
double time = (toc - tic) / cv::getTickFrequency();
std::cout << "#Miki: " << time << " s (" << timeLoad << " s)" << std::endl;
}
getchar();
return 0;
}
One point that drew my attention is the type "CV_32FC3". Are you specifically preferring that 32 bit float matrix and are you sure Matlab as well gets the pixel values the same way?
Because you have that extra step
tempIM.convertTo(tempIM, CV_32FC3);
in your Cpp code, where Matlab directly operates as soon as it retrieves the image without any conversion, which might be slowing down your cpp code. Furthermore, if Matlab is not getting the image in float values, that might be contributing the speed difference as float point arithmetics is a harder task for CPU to handle compared to integers.
Let say, A and B are matrices of the same size.
In Matlab, I could use simple indexing as below.
idx = A>0;
B(idx) = 0
How can I do this in OpenCV? Should I just use
for (i=0; ... rows)
for(j=0; ... cols)
if (A.at<double>(i,j)>0) B.at<double>(i,j) = 0;
something like this? Is there a better (faster and more efficient) way?
Moreover, in OpenCV, when I try
Mat idx = A>0;
the variable idx seems to be a CV_8U matrix (not boolean but integer).
You can easily convert this MATLAB code:
idx = A > 0;
B(idx) = 0;
// same as
B(A>0) = 0;
to OpenCV as:
Mat1d A(...)
Mat1d B(...)
Mat1b idx = A > 0;
B.setTo(0, idx) = 0;
// or
B.setTo(0, A > 0);
Regarding performance, in C++ it's usually faster (it depends on the enabled optimizations) to work on raw pointers (but is less readable):
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
Also note that in OpenCV there isn't any boolean matrix, but it's a CV_8UC1 matrix (aka a single channel matrix of unsigned char), where 0 means false, and any value >0 is true (typically 255).
Evaluation
Note that this may vary according to optimization enabled with OpenCV. You can test the code below on your PC to get accurate results.
Time in ms:
my results my results #AdrienDescamps
(OpenCV 3.0 No IPP) (OpenCV 2.4.9)
Matlab : 13.473
C++ Mask: 640.824 5.81815 ~5
C++ Loop: 5.24414 4.95127 ~4
Note: I'm not entirely sure about the performance drop with OpenCV 3.0, so I just remark: test the code below on your PC to get accurate results.
As #AdrienDescamps stated in comments:
It seems that the performance drop with OpenCV 3.0 is related to the OpenCL option, that is now enabled in the comparison operator.
C++ Code
#include <opencv2/opencv.hpp>
#include <iostream>
using namespace std;
using namespace cv;
int main()
{
// Random initialize A with values in [-100, 100]
Mat1d A(1000, 1000);
randu(A, Scalar(-100), Scalar(100));
// B initialized with some constant (5) value
Mat1d B(A.rows, A.cols, 5.0);
// Operation: B(A>0) = 0;
{
// Using mask
double tic = double(getTickCount());
B.setTo(0, A > 0);
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Mask: " << toc << endl;
}
{
// Using for loop
double tic = double(getTickCount());
for (int r = 0; r < B.rows; ++r)
{
double* pA = A.ptr<double>(r);
double* pB = B.ptr<double>(r);
for (int c = 0; c < B.cols; ++c)
{
if (pA[c] > 0.0) pB[c] = 0.0;
}
}
double toc = (double(getTickCount()) - tic) * 1000 / getTickFrequency();
cout << "Loop: " << toc << endl;
}
getchar();
return 0;
}
Matlab Code
% Random initialize A with values in [-100, 100]
A = (rand(1000) * 200) - 100;
% B initialized with some constant (5) value
B = ones(1000) * 5;
tic
B(A>0) = 0;
toc
UPDATE
OpenCV 3.0 uses IPP optimization in the function setTo. If you have that enabled (you can check with cv::getBuildInformation()), you'll have a faster computation.
The answer of Miki is very good, but i just want to add some clarification about the performance problem to avoid any confusion.
It is true that the best way to implement an image filter (or any algorithm) with OpenCV is to use the raw pointers, as shown in the second C++ example of Miki (C++ Loop).
Using the at function is also correct, but significantly slower.
However, most of the time, you don't need to worry about that, and you can simply use the high level functions of OpenCV (first example of Miki , C++ Mask). They are well optimized, and will usually be almost as fast as a low level loop on pointers, or even faster.
Of course, there are exceptions (we just found one), and you should always test for your specific problem.
Now, regarding this specific problem :
The example here where the high level function was much slower (100x slower) than the low level loop is NOT a normal case, as it is demonstrated by the timings with other version/configuration of OpenCV, that are much lower.
The problem seems to be that when OpenCV3.0 is compiled with OpenCL, there is a huge overhead the first time a function that uses OpenCL is called. The simplest solution is to disable OpenCL at compile time, if you use OpenCV3.0 (see also here for other possible solutions if you are interested).